October 07, 2008

David Cournapeau

The links of the week


A few articles which I have recently read:

  • The worst academic job . An article summarizing what’s wrong with academic career paths today in the US and in Europe.
  • Is Google making us stupid ?: . A bit late, but it is an interesting article on how Google, and more generally easy access to a vast knowledge base may influence how we think. I can’t help linking this to a recent work from James Evans on the effects of open access to science (discussed in The Economist here).
      

by cournape at October 07, 2008 06:31 AM

October 05, 2008

Gaël Varoquaux

My travels this summer

This summer has been hectic (life is hectic anyhow!). As I was switching fields from physics to neuro-imaging, I took the chance to travel to the US and to spend the summer doing Python-related stuff.

Austin - Enthought

I spent most of my time this summer at Enthought, in Austin, Texas. Actually it was the Enthought guys who really made my fantastic summer possible by paying me a good salary and thus indirectly funding my travels. Thanks Enthought, you guys rock.

Austin is a nice city. It’s got a very nice night life (thought the tequilla doesn’t match Jarrod’s expectations, and they don’t always accept South Africans in bars because they look like kids and “any college kid could fake this ID”).

I don’t know if it is particular to the Enthought guys, or it is just Texans in general, but the hospitality was simply incredible. I spent the summer at Eric Jones’ place (I had a separate little cabin for me, fantastic). I enjoyed a lot interacting with Eric’s family: Courtney and the kids, Zach and Liz. It was fun to be at a place with young children. By the way, they go around in scipy T-shirts and love when daddy’s friends (aka geeks) come around at the house. I bet they know the scipy community better than I do.

Of course, there is a catch: Eric lives out in the boonies (just in case you don’t speak Texan, this means “out in the bush”, ie in a remote location). It might even have been a trap: he lives in a dry county. Moreover, I woke up one morning to find this in my living room:

It’s not deadly, I have been told, it just hurts a lot. On top of that, the kids brought back a dead black widow one day. Now it may seem to some that I am making a stupid fuss about nothing, but the scariest animals we have where I live are probably cats.

The Enthought office is a very pleasant place to work. It large with a lot of space for everybody. It is full of very nice people (I knew that before coming, but it was nice to have a confirmation). I had the nicest office I have ever had so far (they seem to be improving each time I change jobs, that a good sign). I’ll talk about what I did there in another post (I actually started a screen cast about this, but the state of video-editing software under Linux is abymissimal, and I could never edit the sound track).

Dave Peterson: ETS operator

Visits to California

I made two trips to California. First I went to UCLA for a summer school on mathematical method in neuro-imaging. It was the occasion for me to discover the field. One thing that stroke me was the importance of software in the field, and how little people are organized to limit duplication of efforts. On the other hand, we had a very nice presentation about a beautiful software engineering effort (slicer) trying to build a platform to unite tools by Steve Pieper. Fernando and I where gritting our teethes during the talk, wondering what the license of the tools would be, and it turned out that Steve ended his talk with a discussion about why it seemed to him that the BSD was most suited. We were delighted.

My second visit was for the SciPy conference which was great fun. Fun to meet new people in real life, fun to meet old friends. I had the feeling the talks where excellent and I learned a lot of things. After the conference I went to Berkeley with the nipy team. Nipy stands for NeuroImaging in Python. It is a project led by a crack team at Berkeley with Chris Burns, Jarrod Millman, Fernando Perez, Tom Waite and now Matthew Brett. The team I am starting my new job with, in France, hopes to be able to integrate their software with nipy.

Berkeley is a nice place. I stayed at Fernando’s and Jarrod’s and it is always a pleasure to hang out with these guys. As far as work goes, we tried to do some 3D visualization of neuro-imaging data with Mayavi. We got some things done and the week ended in a party at Chris’s place where we greated Jarrod with a mac-book air displaying a really cool view of a brain. However I have the feeling I stayed just long-enough to understand the problems, and not to solve them. Damn, software is hard. Hopefuly my work will allow me to move further in this direction.

Back in France, and off to Prague

Well, after all these travels, I got back to France. Off course the first thing I did with my girlfriend, Emmanuelle, was to shoot out of the country. I need to get away of a computer, sometimes. We went to Prague. It is a very beautyful city, it has good beers, and it is the home town of Ondrej Certik, so that made three good reasons to go there. We had a great week end strolling through the city, in the old streets or in the castles (photos here). Most people there speak English, but I was lucky-enough to order fried cheese (doesn’t that sound nice?) to a lady not speaking English, so we fell back on Russian as a common language, and it is always fun to me to speak Russian (now that’s a long sentence).

We had a bunch of beers with Ondrej. We spent some times talking the world into a better place. We discussed licensing, and agreed that Cython was sooooo cool, and talked about all things under the (scientific Python) sun.

In the past year I have the feeling I have been all over the place. I changed jobs three times, moved houses once in France and twice abroad, and visited a lot of countries for the fun of it. In the near future I am planning to settle down in France to get some work done.

We need help

Speaking of work, I am starting a new career in a new academic field. This is going to require a lot of focus from me, and it will not leave too much time for open source work in the near future. We need help! There are many ways to help and not all of them involve coding. I think I spend not more than 50% of my open-source-devoted time on coding. We need better docs. We need more marketing (we are really bad at this: we have fantastic tools, but it is hard to see them). We need people to help each-others on the mailing lists. We need packaging. All this is paramount and takes a lot of time… And we need coding.

Jarrod, cherry picking ... patches?Indian cuisineSugar in your coffe?

by gael at October 05, 2008 10:24 PM

October 03, 2008

OpenOpt (Dmitrey Kroshko)

Very important bugfix for nonlinear problems

I have committed very important bugfix for nonlinear problems.
I don't know how many days the nasty bug was hiding and lurkering (I guess several months), did the one affects some solvers work or no, because it has been somehow (but maybe not everytime) overwritten by other code, and only my latest changes (committed to subversion repository some days ago) have revealed the one.

by Dmitrey (noreply@blogger.com) at October 03, 2008 02:18 AM

new Python optimization soft: Pyomo

I've got to know about Pyomo (thanks to Nils Wagner) that is made by Sandia Labs (Trilinos/PyTrilinos developers). In the article they say "However, Coopr Opt is not as mature as the OpenOpt package", I guess that's the reason why my site daily visitors number has jumped from ~ 50-60 to ~ 80-90 this week.

Also, in the paper they inform of willing to collaborate with OO. I had contacted them long time ago but the provided examples of their soft usage were too difficult for me to understand and connect to OO during an adequate time. I guess it will be much easier for them to understand my code instead.

by Dmitrey (noreply@blogger.com) at October 03, 2008 02:14 AM

October 02, 2008

Enthought

Plot highlighting using alpha values

I recently added support for changing the alpha of a plot after it has been constructed. Combined with a selection tool you get a really nice highlight feature. In this example, when a plot is selected I double its width and set the alpha of the other plots to 1/3 of their normal value.

plot with alpha

    def _select(self, selected):
        """ Decorates a plot to indicate it is selected """
        for plot in reduce(operator.add, selected.container.plots.values()):
            if plot != selected:
                plot.alpha /= 3
            else:
                plot.line_width *= 2

        plot.request_redraw()

    def _deselect(self, selected):
        for plot in reduce(operator.add, selected.container.plots.values()):
            if plot != selected:
                plot.alpha *= 3
            else:
                plot.line_width /= 2

        plot.request_redraw()

by bhendrix at October 02, 2008 09:28 PM

October 01, 2008

Ondřej Čertík

Gael and Emmanuelle in Prague

Last weekend Gaël Varoquaux together with Emmanuelle came to Prague, so it was really awesome to meet again (we first met with Gaël at the SciPy 2007 conference at Caltech).

Because I recently finished my master, I was basically visiting pubs each evening, so we first met on Friday at 10pm and just had couple Pilsens (I already had some beers that evening with other friends), then I had to go, as I was doing TOEFL on Saturday. At the exam I met a very beautiful girl and other interesting people, so we went to lunch together and as a result I arrived half an our later than we agreed with Gaël and Emmanuelle, but I think they understood my situation. :) Then we went around Vyšehrad, came to I.P.Pavlova, had some Czech meal and beers, then went on foot pass the Church of Saints Cyril and Methodius, where the Heydrich attackers were cornered (there is still the bullet-scarred window visible from the street), then continued over the bridge to the hotel, Emmanuelle went in, I then came with Gaël to another pub for couple of more beers.

The Python scientific community is very cool and I always enjoy meeting people from it and discussing things like cython, scipy, ipython, mayavi, sympy, matplotlib, sage, what license is best for each project, etc. in Prague pubs. Python has a lot of high quality libraries and tools for scientific computing, so things look very promising.

It was fun I really enjoyed that.

by Ondřej Čertík (noreply@blogger.com) at October 01, 2008 02:51 PM

September 30, 2008

Enthought

Texas Python Regional Unconference - Austin, TX

We’re excited about meeting over at the University of Texas (Enthought’s backyard) for the Texas Python Regional Unconference this weekend (October 4-5).

2007 Unconference Attendees

Program

It’s absolutely free to attend the Unconference.  It’s not too late to register, so add yourself to the list of attendees if you can make it.  There are open slots in the self-organizing program as well, so feel free to add yourself to the schedule on the wiki.

Wireless

It’s also important to note you’ll have to email me (travis@enthought.com) with the following information if you’d like to have wireless access at the meeting venue.

  • Full Name
  • Phone or Email
  • Address
  • Affiliation

Friday Dinner

Anyone who happens to be in town on Friday evening is welcome to come over to our offices and walk with us to grab dinner in downtown Austin. Feel free to come over anytime after 5:30pm to hang out and get the nickel tour — we’ll leave at 7:00pm to eat.

by travis at September 30, 2008 04:48 PM

Matthieu Brucher

Book review: An Introduction to Design Patterns in C++ with Qt4

Contrary to what the title may hint to, this book is an introduction to C++ and the Qt library. And in the process, the authors tried to teach some good practices through design patterns. So if you’re a good C++ or Qt programer, this book is not for you. If you’re a beginner, the answer is in my review.

Content and opinions

The first part is an introduction to C++ and Qt4. How to install and compile Qt, Basic types and operators are explained, as well as small other things. First red alert, using namespace std; is written in some headers, one bad point for good practices… Then objects are presented with classes (a yellow warning, some things are not correct, struct can be used to declare a class, and there is no need for a member access specifier) before the first short Qt chapter. It was very short, but one cannot explain Qt without some C++ lugagge, so this was to be expected. Strangely, lists were dealt with next and then functions. Well, why not. Finally, the part ends with polymorphism and inheritance. If the chapter is globally OK, there are some big red alerts when I read it. For instance, in their example, Square inherits from Rectangle, Ellipsis from Circle… This is not good practice at all. Square IS NOT A Rectangle. This is well explained in a lot of tutorials and books, and it should never be written as an example in a book about good C++ and Qt practice.
On a side note, some terms are used before their definitions… in the last part. So the organization of that part was not the best one.

The second part deals with Qt itself (there is still no complete C++ presentation…). After some reuse speach, design patterns are introduced. Well, there is only one in the chapter, so this chapter has only one section (surprising when you read the content of that chapter). The Qt core class QObject has a whole chapter, but there are some mistakes and some missing explanations. First with signal and slots, slots can be called directly, as well as signals (for the moment). And with the normal way of signaling, there is no thread waiting for another, it’s just a “stupid” function call (the authors claim that the signal blocks the thread until the slot is executed, but the slot runs in the same thread, so this is not possible). And finally, there are no correct explanation of the Q_OBJECT macro. The remaining of this part is OK (compared to what I’ve said before, it only means that I didn’t find something as big), going from containers to widgets, XML, regular expressions, but also concurrency. Other design patterns are presented in several chapters.

The third chapter should have been the second one, it’s a C++ reference beginning with C++ type mecanisms (types themselves, casts, …), scope and namespaces (at last), what is a statement and control structures (how is it possible to make an introduction to a language and talk about those important pieces of the language in the last chapter ??), exceptions (not that well explained, like for exception specification, the authors should read some of Herb Sutter’s books), memory handling (memory leaks, segmentation faults, …) and finally how does inheritance work.

Oh, I forgot, a macro does not have a namespace, macros are managed by the preprocessor. I think the authors meant “functions” instead.

Conclusion

Sadly, the book missed its main goal, introducing novices to C++ and Qt and teach them good coding practices. They would be better with a good C++ book and the Qt4 book.

An Introduction to Design Patterns in C++ with Qt 4 (Bruce Perens’ Open Source Series) (Paperback)
by Alan Ezust, Paul Ezust
ISBN: 0131879057

Price: USD 45.51
39 used & new available from USD 37.00

| 4 | 8

by Matt at September 30, 2008 08:25 AM

September 29, 2008

OpenOpt (Dmitrey Kroshko)

Changes for converters

I have committed lots of changes for (all) converters; also, the issue with splitted funcs (for nlps2nlp and lsp2nlp) has been fixed.

by Dmitrey (noreply@blogger.com) at September 29, 2008 01:47 PM

PSwarm v 1.2: bugfixes for Linux

After another one OO user (Nils Wagner) has informed PSwarm developer Ismael Vaz about same bugs ("use -fPIC" and "undefined symbol 'opt'") the bugs have been fixed.

PSwarm v. 1.2 is already available with the bugfixes.

I have removed mention of the bug from GLP docpage entry.

by Dmitrey (noreply@blogger.com) at September 29, 2008 01:31 PM

September 27, 2008

OpenOpt (Dmitrey Kroshko)

new converter: nlsp2nlp

I have committed nlsp2nlp converter, see NLSP page for details. It tries to minimize
   ||F(x)||2
and, of course, it is capable of handling constrained problems.

You can try updated nlsp_1.py or nlsp_constrained.py from /examples.

However, a requirement should be satisfied: non-linear functions shouldn't be splitted.

This minor issue is intended to be fixed in future (BTW, splitting can't yield benefits for nlsp2nlp converter with any NLP solver).

by Dmitrey (noreply@blogger.com) at September 27, 2008 02:24 PM

Enthought

EPD with Py2.5 v4.0.300 Beta 3 released

We’ve recently posted the third beta release of EPD (the Enthought
Python Distribution) with Python 2.5 version 4.0.300.  You may download
the beta from here:

http://www.enthought.com/products/epdbeta.php

Please help us test it out and provide feedback on the EPD Trac
instance: https://svn.enthought.com/epd  You can check out the release
notes here: http://www.enthought.com/products/epdbetareleasenotes.php

About EPD
The Enthought Python Distribution (EPD) is a “kitchen-sink-included”
distribution of the Python™ Programming Language, including over 60
additional tools and libraries. The EPD bundle includes NumPy, SciPy,
IPython, 2D and 3D visualization, database adapters, and a lot of
other tools right out of the box.

http://www.enthought.com/products/epd.php

It is currently available as a single-click installer for Windows XP
(x86), Mac OS X (a universal binary for OS X 10.4 and above), and
RedHat 3 and 4 (x86 and amd64).

EPD is free for academic use.  An annual subscription and installation
support are available for individual commercial use.  An enterprise
subscription with support for particular deployment environments is also
available for commercial purchase.  The beta versions of EPD are
available for indefinite free trial.

by dpeterson at September 27, 2008 06:14 AM

September 23, 2008

Barry Wark

SciPy 2008 Proceedings online

The proceedings of the 2008 SciPy conference are now online. The articles provide a snapshot of some of the great work being done using SciPy as well as valuable references for SciPy users. Gaël Varoquaux has done a great job getting the whole collection in shape. Thanks Gaël!

by Barry (noreply@blogger.com) at September 23, 2008 12:40 PM

OpenOpt (Dmitrey Kroshko)

Some more openopt install issues

Latest NumPy (taken from subversion repository) forced some OO users to inform me about some more openopt's install issues.

First of all, sometimes openopt's "python setup.py install" starts to search and download numpy 1.1.x from internet, while recent numpy from svn is already installed.

At second (it's especially inconvenient to me), modifying any single line in openopt's sources and then running "python setup.py install" now recompiles all openopt's files. Even if there were no changes at all, anyway all OO files are recompiled. Taking into account that I run this dozens times per day it's very annoying, and my increases danger for my HDD. Of course I could use OO without "python setup.py install" but this is inconvenient because of some reasons (I had already tried).

So, please take into account: these issues are not up to me, they are up to numpy.distutils developers. I hope they will fixed it ASAP.

by Dmitrey (noreply@blogger.com) at September 23, 2008 07:12 AM

September 22, 2008

Gaël Varoquaux

SciPy Conference proceedings

The SciPy conference proceedings are finally available online: http://conference.scipy.org/proceedings/SciPy2008 .

I hope you enjoy them. I find it great to have this set of excellent articles talking about works done with, or for, Python in science. For me, it is a reference to remember what was said at the conference. I hope it can also be interesting for people who were not present at the conference.

I apologize for being so slow at publishing them. In addition to the round trip between authors and editors taking a while, I have been travelling back home and spent way too much time last week finishing off administrative duties in the US.

by gael at September 22, 2008 01:54 PM

Ondřej Čertík

master studies

I did it! :) I defended my thesis and passed master finals from theoretical physics at Charles University in Prague couple hours ago.

After almost dropping out of my school exactly a year ago for not having enough credits to go to the next year, I gave myself an obligation to finish my school on time. I worked very hard the last year, I had to do 8 exams, some of them very hard, requiring more than a week of thorough learning and 9 seminars, requiring a lot of work too and also a master thesis, for which I had to had working finite element solvers and together with SymPy it took all my time and energy. I even had to cancel my trip to Austin and Caltech for the SciPy conference. But I finished my school after all, I am very happy about it.

Last year, two of my friends bet $100 between themselves that I will not finish on time. Jarda, who believed in me is now at Princeton doing his Ph.D. Matouš, who didn't believe in me, will now pay $100 to Jarda. I think that life is fair.:)

Now I'll be visiting pubs quite often and then I'll fix some long standing issues in SymPy, hopefully finish my Debian task & skills (to finally become a Developer couple months after that) and finally do useful stuff for my research with a fresh head now.

by Ondřej Čertík (noreply@blogger.com) at September 22, 2008 10:34 AM

OpenOpt (Dmitrey Kroshko)

Install issues with latest numpy

Recently some changes have been committed to numpy, and previous openopt versions can't work with those ones.

So I have committed some changes to openopt, and AFAIK now openopt runs with all known to me numpy versions >= 1.1.0 correctly.

Hence, install/update OpenOpt from latest tarball or subversion repository is recommended.

by Dmitrey (noreply@blogger.com) at September 22, 2008 08:42 AM

September 21, 2008

Prabhu Ramachandran

Python list (in Cython) vs. NumPy

Taking my previous benchmark a little further I decided to see how well iterating over a Python list of doubles compares with using NumPy arrays. Here is an extremely simple example that implements the sum function in Cython and compares the result with NumPy's sum method.

Here is the Cython code:

# --- csum.pyx ---
def csum(list array):
cdef int i, N=len(array)
cdef double x, s=0.0
for i in range(N):
x = array[i]
s += x
return s


Here is a setup.py to build it:

# --- setup.py ---
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(cmdclass={'build_ext': build_ext},
ext_modules = [Extension("csum", ["csum.pyx"])])


To time it in IPython I created a simple file called test.ipy like so:

# --- test.ipy ---
import csum
import numpy
for i in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 250000000]:
print '-'*80
print 'N =', i
a = numpy.linspace(0, 1, i)
b = a.tolist()
print "Cython:",
%timeit csum.csum(b)
print "NumPy:",
%timeit a.sum()


I run it using IPython and here are the results (formatted a little):

N Cython NumPy
10 534 ns 10.1 micros
100 1.76 micros 10.8 micros
1000 15.3 micros 19.3
10000 150 micros 101 micros
100000 1.75 ms 933 micros
1000000 19.7 ms 9.24 ms
10000000 198 ms 92.1 ms
25000000 499 ms 231 ms


This was done on a P4 3 Ghz machine and clearly lists do quite well. At small sizes they outperform NumPy and at really large sizes they are about twice as slow. This is very good considering how general lists are.

by Prabhu Ramachandran (noreply@blogger.com) at September 21, 2008 03:57 AM

September 20, 2008

Titus Brown

The Future of Bioinformatics (in Python), part 1 (b)

My last post initiated a discussion on the biology-in-python mailing list about BioPython, among other things. (Here is a link to the discussion, which is kind of long and unfocused.)

I'm happy that the bip list is serving as a place for people to interact with the BioPython maintainers to discuss the future of BioPython. Hopefully it will lead to more involvement with BioPython, which would be a good thing.

However, I would like to take the time to question the longer-term utility of the BioPython/Perl approach.

Bioinformatics -- by which I mostly mean sequence analysis -- has predominantly followed the UNIX scripting/pipeline model, in which data is kept in simple, easily-manipulated formats (comma- or tab-separated values, or CSV) and then processed incrementally. This approach has a number of advantages:

  • Each step is isolated and so easier to understand.
  • Each step produces a simple, easy-to-parse kind of data.
  • Each step is language neutral (anything can read CSV).
  • New programmers can learn to use each step in isolation.
  • The components are re-usable.

I've used this exact approach for well over a decade: first for analyzing Avida data, then Earthshine, and most recently scads of genome data.

This scripting & pipeline approach is what BioPython and BioPerl facilitate. They have a lot of tools for running programs to produce data and loading in different formats, and they serve as a good library for this purpose.

The scripting/pipeline approach does have some deficiencies as a general data-analysis approach:

  • Poor (O(n)) scalability: processing CSV files is hard to do supra-linearly, and often the easiest analysis approach is actually O(n**2)
  • Hard to test: generally people do not test their scripts. Even now that I've become test infected, I find scripts to be more difficult to test than modules and libraries. I can do it, but it's not natural for me, and empirical evidence suggests that it's not natural for most people.
  • Hard to re-use: scripts are often quite fragile with respect to assumptions about input data, and these assumptions are rarely spelled out or asserted within the code. This leads to hard-to-diagnose errors that often occur deep within the tool chain (if they ever show up explicitly).
  • Poor metadata support: try attaching metadata to a CSV file. You'll end up with something like GFF3, which overloads the metadata field to mean something slightly different with each database. Awesome.
  • Too easy to map into SQL databases: yes, you can load CSV files into SQL databases, but JOINs are a relatively rare form of actual data analysis -- and that's what SQL databases are best at. SQL databases do a particular poor job of interval analysis (overlap/nearest neighbor extraction/etc.)
  • Poor abstraction: when you load something into memory from a CSV file, it's easy to treat it as a list. Lists are, generally, a poor way to interact with sequence annotations. (This is really the same problem as the SQL database problem.)
  • Poor user interface: it's hard to put lipstick on a script! People who aren't comfortable with UNIX and file munging (i.e. most biologists) have a hard time using scripts, and it's rather difficult to wrap a script in a GUI or Web site.
  • Poor reproducibility: every scientist I know has trouble keeping track of what parameters they used last time they ran a script. Even if they keep track of things in a lab notebook, that's a poor medium for reference; logging and notebook software don't seem to work very well for this, either.

These deficiencies didn't bother me too much when I was first interacting with genomic data, but they've become glaringly apparent in the face of massively parallel sequencing data. The advent of 454 and (particularly) Solexa sequencing data, where you can get tens of millions of short reads from a DNA sample, means that scalability concerns dominate; the ready availability of such data means that everyone has some and needs to analyze it, and they want good, fast, correct tools to do so. In the struggle to cope with this data, things like maq emerge, which uses a largely opaque intermediate data format to make Solexa data analysis scalable; this ends up being a bit of an intermediate model, where you query and manipulate maq databases from the command line. It can be scripted, but it doesn't have the advantages of language neutrality or easy parse-ability, and so you lose some of the advantages of scripting. Since maq doesn't really work as a programming library, either, you don't gain the benefits of abstraction (it's designed on extract-transform-load model where you run each command as an isolated operation). There are lots of pieces of bioinformatics software like this: they solve one problem well, but they're not built to output data that can be easily combined with data from another program -- at which point you run into format and scripting issues.

For me, the deficiencies of the scripting model largely come down to the lack of an abstraction layer that separates how the data is stored from how I want to query and manipulate the data. The introduction of a good abstraction layer immediately potentiates re-usability, because now I can separate data loading from data query and start using objects to build queries. It also makes scalability a matter of building a good, general solution once, or perhaps building specific solutions that all look the same at the API level. Once the API is firm, it's relatively straightforward to test; once I can separate the API from implementation I can implement different backend storage and retrieval mechanisms as I like (pickle, SQL, whatever); and I can build a GUI interface without having to change the internals every time I change data storage types or analysis algorithms.

On the flip side, once move into a framework, you now have the problem that you're coding at a level well above most newbie programmers and biologists, so ease-of-use becomes a real issue. This means that people need good documentation and good tutorials, in particular -- the Achilles Heel of open source & academic software. And, of course, the framework has to actually work well and solve problems well enough to reward the casual scientist who needs a tool.

So, with respect to BioPython, I appreciate the functionality it has, but I think the model is wrong for my work (and for work in a world full of genomes and sequence). What I really want is a complete solution stack for sequence analysis and annotation:

   data storage
        --
   object layer
        --
  scripting layer
        --
user interface tools

I'm out of time now, but next installment, I'll talk about how pygr provides much of this "solution stack" for me.

If you're interested in a longer, more detailed version of much the same argument, see Chris Lee's paper with Stott Parker and Michael Gorlick, Evolving from Bioinformatics-in-the-Small to Bioinformatics-in-the-Large.

For a recent overview of pygr's functionality, see the draft paper, Pygr: A Python graph framework for highly scalable comparative genomics and annotation database analysis.

--titus

September 20, 2008 07:58 PM

Prabhu Ramachandran

Python vs. Cython vs. D (PyD) vs. C++ (SWIG)

In April 2008 there was a thread on the scipy-dev list regarding the inclusion of Cython code in SciPy. In that thread, I mentioned a particular use case of interest to me -- creating and manipulating an array of objects (rather than arrays of elementary data types) and being able to do that with Cython easily and efficiently.

The problem I was considering is a simple one. I create a list of "Vortex" objects and compute (naively) the velocity of a collection of these particles on one another. This is an O(N^2) computation since every particle influences every other. The idea was to create simple OO code to be able to perform these computations. Here is an outline of the Python code for doing this:

class Vortex(object):
def __init__(self, pos=0.0, strength=1.0):
# ...
def eval_velocity(self, pos):
return -1j*self.strength/(2*pi*(pos - self.position))

class VortexManager(object):
def __init__(self, vortices=None):
# vortices is a list of vortex objects.
self.vortices = vortices

def set(self, pos, str):
# ...

def velocity(self):
for va in self.vortices:
vel = complex(0, 0)
for vb in self.vortices:
if va is vb:
continue
else:
vel += vb.eval_velocity(va.position)
va.velocity = vel


Very straightforward code. Now, back in April I implemented this in pure Python, C++ and wrapped the C++ code to Python using SWIG. I also implemented it in D and wrapped that using PyD. I found that D was about 1.7 times slower than C++. C++ was about 300-400 times faster than the pure Python version.

I attended Robert Bradshaw's Cython tutorial at SciPy08 and really liked it. About 10 days ago I finally found the time to create a Cython version and the winner is ...

Cython!

I've put up all of the code here. To use the code, untar the tarball and do the following:

$ cd cpython_d_cpp
$ python setup.py build_ext --inplace

This requires SWIG, numpy and Cython to build. If you have PyD installed it will build the PyD extension also. To test the performance do this:

$ python test_perf.py

This produces the following output for me on a P4 (3 Ghz):

dee(4000): 1.87730193138
(1411.53285812+1411.53285812j) (945.091286479+945.091286479j)
swig(4000): 1.10782289505
(1411.53285812+1411.53285812j) (945.091286479+945.091286479j)
cython(4000): 1.15034103394
(1411.53285812+1411.53285812j) (945.091286479+945.091286479j)
Pure Python(200): 1.14771318436
# N SWIG Cython Ratio
1000 0.071 0.069 0.967
2000 0.283 0.274 0.968
3000 0.638 0.619 0.970
4000 1.135 1.100 0.970
5000 1.767 1.720 0.973
6000 2.517 2.473 0.983
7000 3.474 3.370 0.970
8000 4.541 4.403 0.970
9000 5.698 5.575 0.978
10000 7.000 6.879 0.983


The first few numbers just test one single case of 4000 particles. D is slower than both C++ and Cython. Python is dog slow (or donkey slow as I like to say it)! For some reason I was getting segfaults when I tried to test the D wrapper for more than 3000 particles. On my Macbook the Cython version is actually 30% faster than the C++ version and on a Sempron 2800 Cython is about 25% slower. So different machines produce different numbers. However, C++ and Cython are both in the same ballpark.

What I loved about the Cython code is that I use a Python list to manage the Vortex objects. This shows that we can use the normal Python containers to manage objects. This is extremely convenient. This isn't very surprising either since Python containers are also heavily optimized. "cython -a" was a huge help when getting things done. For more details on the Cython code look at the tarball.

Clearly, if you are building code from scratch and need speed, Cython is an excellent option. For this I really must congratulate the Cython and Pyrex developers.

by Prabhu Ramachandran (noreply@blogger.com) at September 20, 2008 12:29 PM

September 19, 2008

Prabhu Ramachandran

Mayavi Screencast now on blogger

Finally, I've managed to upload the screencast I posted last about on blogger.
If you are on a Mac or on Windows and downloaded the file I put up earlier, you might need to install the theora component from here.

I have no idea how it will eventually show up with the processing that goes into it. Depending on how this goes I'll put up new screencasts either here or perhaps at Showmedo. Anyways, here it is:


by Prabhu Ramachandran (noreply@blogger.com) at September 19, 2008 03:25 PM

Mayavi2-3.x screencast

Here is a screencast (16 Mb Ogg Theora) of some of the new Mayavi2 features in the 3.x series.

http://www.aero.iitb.ac.in/~prabhu/tmp/videos/mayavi2-3-screencast.avi


Its an ogg theora video. This is my first screencast so its going to be a little rough. There is too much noise in the sound recording and I'll try and fix that next time. BTW, I used the excellent istanbul for the video recording. I recorded the sound track separately and mixed the two with a gst script I got from here.

I was unable to upload this video on this blog because the IITB network at this time is unbelievably slow. If you know of another place I should consider uploading the video or would like to host the video elsewhere please let me know. Feel free to download it and host it elsewhere.

by Prabhu Ramachandran (noreply@blogger.com) at September 19, 2008 12:25 PM

OpenOpt (Dmitrey Kroshko)

new converter: lp2nlp

I have committed lp2nlp converter, see LP page for details.
See also the remark about ipopt.opt file here

by Dmitrey (noreply@blogger.com) at September 19, 2008 06:00 AM

September 18, 2008

Prabhu Ramachandran

Announcing the Mayavi2-3.x series

This is a long overdue announcement. ETS-3.0.0 was released just before the SciPy conference in August 2008. Mayavi-3.0.0 was released as part of this. There are a huge number of significant changes to Mayavi as compared to 2.x. Note that we are still calling this Mayavi2-3.x.y since the Mayavi2 represents a departure from the older Mayavi-1.x series and Mayavi2-3.x is simply the next version of Mayavi2. The full details of the changes from the 2.x series are documented in the CHANGES.txt file in the mayavi documentation directory. Here is a summary of the major changes to this series thus far. The current release is 3.0.3.

Core Mayavi:
  1. I've added all the modules and filters that were available in Mayavi1 into Mayavi2. The only module I didn't port is the Locator module which didn't seem very useful. Mayavi2 now has more modules and filters than Mayavi1 had. Now there isn't an excuse to continue using Mayavi1.5.
  2. Users can now right-click on the nodes on the tree view to create new sources, filters and modules.
  3. The menu entries for the modules and filters (on the app and on right-click) are all context sensitive. So if your data doesn't support a particular module you shouldn't be able to add it from the UI.
  4. The file->open menu is far cleaner and exposes just one "Open" item that automatically lets you open any supported data.
  5. Added a toolbar to the engine view that offer icons to make it easy to add new sources/filters and modules. Special "Adder nodes" are added to the tree view when a scene/source is empty that makes it easy for new users to use the mayavi pipeline.
  6. Added a -o/--offscreen option to the mayavi2 application so you can run mayavi offscreen if your VTK version supports it. This in combination with the -x command line option makes for a powerful combination.
  7. New and much easier extension mechanism for the mayavi library and app via a user_mayavi.py and site_mayavi.py.
  8. Added a tvtk_doc module/script that lets you search through the TVTK classes (with and/or keyword support), this is similar to Mayavi1's vtk_doc script.
  9. Added a GenericModule that makes it very easy to put together a bunch of components/filters to create a new module.
  10. Added Optional, Collection filters that let you easily build filters out of combinations of existing components or filters.
  11. Added a new SetActiveAttribute filter that lets you choose the active scalar/vector/tensor attribute, this lets you do neat things like plot iso-contours of one scalar on top of the iso-contour of another, see examples/mayavi/contour_contour.py for an example.
  12. Gaël sphinxified the documentation to make it look much nicer and fully searchable.
  13. Better and more complete testing, these are unfortunately integration tests currently and will slowly be made into proper unit tests.
  14. The mayavi2 application and plugins are now ported to use Envisage3 which is much cleaner and nicer to work with than Envisage2.
  15. There is now a full-fledged preferences framework for Mayavi (to access the preferences use, from enthought.mayavi.preferences.api import preference_manager).
  16. Some parts of the API and file organization has been cleaned up. This is mostly related to the location of some modules, the core scripting API hasn't really changed.
  17. The project is now called Mayavi and not MayaVi as before. This avoids unnecessary confusion on how to pronounce the name and avoids any comparison with either Maya or Vi.
  18. ETS itself is reorganized into a much smaller set of packages unlike the 40 odd packages in the ETS-2.x series. This makes dependency handling, packaging and installing much easier.

Mlab:

  1. The enthought.mayavi.mlab.pipeline is complete and can be used to fully script mayavi.
  2. The mlab API has changed to be more consistent with the naming style used in ETS, for example isosurface has become iso_surface, extractedges becomes extract_edges etc.
  3. Added a show() function and decorator to allow users to easily create standalone scripts.
  4. mlab.pipeline.open lets you open any supported data.
  5. The mlab API can now take either engine or figure keyword arguments. This allows to avoid the use of the global sate set in the mlab engine. Mlab also now exposes a set_engine function.
  6. It is easy to change visualized data using the .mlab_source attribute on objects created from mlab. This makes it very efficient and easy to create animations from mlab. See here for more details.
  7. Mlab by default uses a MayaviScene that features a convenient Mayavi icon which brings up the engine view using which you can edit the pipeline from the UI (using the toolbar or right clicks). This gives mlab the full power of mayavi.
Apart from these significant feature additions there have been the usual round of bug fixes and new bugs introduced.

As you can see this is a very significant release that marks a very important phase for mayavi2. All the additions made at the sprint went into this release.

Currently it is probably easiest to install mayavi via either enthought's EPD or Python(x,y). Gaël has made available Ubuntu packages that are available at a link he mentions here. Dave Peterson has been making all of ETS available from PyPI, however you'll have to get all the dependencies installed (numpy, VTK, wxPython or Qt4). The best place to look for installation instructions is here, https://svn.enthought.com/enthought/wiki/Install

Enjoy.

by Prabhu Ramachandran (noreply@blogger.com) at September 18, 2008 12:06 PM

OpenOpt (Dmitrey Kroshko)

new converter: qp2nlp

I have committed qp2nlp converter, see QP page for details.
Let me note that if a problem have been converted to NLP from QP, LLSP, LP (latter will be added soon), then the following lines are automatically appended to ipopt.opt file:

jac_c_constant yes
jac_d_constant yes
hessian_constant yes

Also, bugfix for example qp_1.py has been committed:

Instead of
x1^2 + 2x2^2 + 3x3^2 + 15x1 + 8x2 + 80x3 -> min

we have
0.5 * (x1^2 + 2x2^2 + 3x3^2) + 15x1 + 8x2 + 80x3 -> min

I.e. if H = diag((1,2,3)), instead of
xHx + fx -> min

we search for
0.5 * xHx + fx -> min

by Dmitrey (noreply@blogger.com) at September 18, 2008 08:20 AM

September 17, 2008

Gaël Varoquaux

My computer wants a break

Hell, I had a crazy summer of hacking (it not quite over yet, and I really need to blog about it). I am starting too feel a bit tired, but apparently my laptop feels even more so.

My DELL laptop has decided it would die on me: it no longer charges. The good news is that I still have the dekstop at work, and I am going back to France at the end of the week, but in the mean time, my efficiency on hacking on the various exciting projects I am involved with is going to take a hit.

by gael at September 17, 2008 04:41 AM

September 16, 2008

Enthought

I Hate Web Browsers

I just wrote a long and brilliant post into a text box in a web browser.  I hit Command-Left Arrow to go to the beginning of the line.  (For those without MacBooks, Fn+Left Arrow is Home, which *should* take you to the beginning of the line, but for some reason text boxes in Firefox on OS X don’t actually respond to Home, so you have to use Command-Left Arrow.)

The problem is that not all javascript-y AJAX-y sexy WYSIWIG Web 2.0 text areas actually *implement* Command-Left Arrow.  A basic Firefox text area *will* move to the start and end of line with Command-Left and Command-Right arrows.  But for some reason, when a javascript-y AJAX-y WYSIWIG Web 2.0 editor wraps a text area, sometimes they don’t handle Command-Left, and instead pass that straight through to Firefox.  What does Firefox do with Command-Left?  Why, return you to the previous page, of course!

This is about the 3rd or 4th time this has happened to me.  In fact, I have a habit now of doing “select all; copy” so I at least have the text stored in the system clipboard.  I know I am not the only one.  Why doesn’t Firefox just have an internal flag on text areas that triggers whenever the user actually enters more than, say, 10 words, and automatically prompt the user if they try to close or navigate away from a page with unsaved text in those text areas?  Can someone more knowledgeable about web browsers and plugins and DOMs tell me if this is a difficult thing to implement as an add-on?  I know there are some plugins that allow you to manually save the text in an edit window to disk, but that’s way too heavyweight and manual of a process.  Does anyone have a suggestion for a plugin or add-on that actually solves this problem?

by pwang at September 16, 2008 02:54 PM

Matthieu Brucher

Book review: Beautiful Code: Leading Programmers Explain How They Think

I got this book from a partnership between http://www.developpez.com/ and O’Reilly. Thanks to both of them.

What defines “beautiful code”? How do people think a beautiful code should look like? This isn’t a simple question to answer, so this book asked several lead programmers (Ruby, Python, C, C++, Java, Perl, …) some beautiful code they wrote or they encountered. And if some want to answer “think about a robust, simple to extend code and that will be it” (and I would be one of them before I read the book), there are some code that would not fit this profile.

Content and opinions

I won’t speak about each chapter because there are numerous and my goal is not to analyze what programmer wrote what. The book has no parts per se, but some chapters can be grouped together.

There is a progression through the chapters, from pure code to architecture and design. The study domains are very wide, regular expression, subversion core, the quick sort algorithm, the evolution of the fast mathematical routines, but also how to test, some pitfalls of concurrency programming, how to design an efficient GUI for bioinformatics or simply an UI for disabled people, … In every case, there is some thing close to what one is doing.

Some may think that there are a lot of books on code writing or code architecture, code design, … but there is not another “Beautiful code”, at least at this time of blogging. It melds these aspects of programming together, and this is what is interesting in this book. It’s not meant to be a reference like the Cormen for “beautiful” algorithms, or Martin Fowler’s book on design, Kent Beck’s on developement, so don’t wait for the same things.

Conclusion

If you are looking for theory about good code practices, this is not a book for you. If you are looking for culture, for some applied code pratices (how they interact with the code, how code can evolve, …), go and read this book.

Beautiful Code: Leading Programmers Explain How They Think (Theory in Practice (O’Reilly)) (Paperback)
by
ISBN: 0596510047

Price: USD 38.31
43 used & new available from USD 28.75

| 3.5 | 31

by Matt at September 16, 2008 08:24 AM

September 15, 2008

Enthought

ETS 3.0.2 Released!

I’m pleased to announce that the Enthought Tool Suite (ETS) 3.0.2 has
just been tagged and released!

Source distributions (.tar.gz) have been pushed to PyPi.   Window’s
binaries will be built and uploaded to PyPi over the next 24 hours or so.

You can update to ETS 3.0.2 like so:
easy_install -U ets[nonets]>=3.0.2

Changes
ETS 3.0.2 is an update to ETS 3.0.1 that includes the following changes:
* Update of Enable to fix problems doing ’setup.py install’.
* Update of ETSProjectTools to fix bugs and improve the help messages.
* Update of Mayavi to fix bugs found during the SciPy conference.
* Update of Traits, TraitsGUI, and TraitsBackend* to fix a number of
issues (see https://svn.enthought.com/enthought/query?milestone=Traits+3.0.2)

What is ETS?
The Enthought Tool Suite (ETS) is a collection of components developed
by Enthought and our partners, which we use every day to construct
custom scientific applications. It includes a wide variety of
components, including:* an extensible application framework
* application building blocks
* 2-D and 3-D graphics libraries
* scientific and math libraries
* developer toolsThe cornerstone on which these tools rest is the Traits package, which
provides explicit type declarations in Python; its features include
initialization, validation, delegation, notification, and visualization
of typed attributes.

More information is available for all these packages from the Enthought
Tool Suite development home page: http://code.enthought.com/projects/tool-suite.php

by dpeterson at September 15, 2008 09:11 PM

OpenOpt (Dmitrey Kroshko)

OpenOpt release 0.19

Hi all,
I'm glad to inform you about new OpenOpt release: v 0.19.

Changes since previous release 0.18 (June 15, 2008):
  • Some changes for NLP/NSP solver ralg (especially related to handling linear constraints Ax = b, Aeq x = beq, lb = x = ub)
  • Bugfix for ralg and IPOPT linear constraints handling
  • ALGENCAN v 2.0.x has been connected (v 1.0 is no longer supported, v 2.0.3 or later is required)
  • Bugfix for constrained NLSP graphic output (constrained nssolve isn't turned to latest ralg version yet)
  • Scale parameter for lpSolve (p.scale = {False} | True | 0 | 1)
  • New OO class LLAVP (linear least absolute values aka linear least deviations)
  • Improved handling of non-linear functions with restricted dom
  • GLP (global) solver galileo now can handle integer problems (via p.useInteger = 1 or True)
  • Another one GLP solver connected: pswarm
  • Lots of work related to oofun concept (see OO Doc page for details)
  • Add converters llsp2nlp, llavp2nsp
  • Convenient handling of maximization problems (via p.goal = 'max' or 'maximum')
  • Some code clean up and bugfixes

Backward incompatibility:
  • Changed objective function in LLSP
  • MATLAB-style gradtol renamed to gtol (for to provide same syntax to scipy.optimize fmin_bfgs, fmin_cg and less-to-type)
Regards, Dmitrey.

by Dmitrey (noreply@blogger.com) at September 15, 2008 12:28 PM

changes for ralg and oofun derivatives example

  • some ralg changes, mainly for linear constraints (A, Aeq, lb, ub)
  • example of oofun derivatives has been attached to OO Doc page
  • some code cleanup and minor bugfixes

by Dmitrey (noreply@blogger.com) at September 15, 2008 06:15 AM

September 14, 2008

Titus Brown

The Future of Bioinformatics (in Python), part 1 (a)

Chris Lasher wrote a nice blog post naming me as a rabble rouser in the area of "Python in bioinformatics". His post raised a number of interesting points, some of which I'd like to discuss here on my blog.

First, why is Python not more dominant in bioinformatics? I really lay this at the feet of Lincoln Stein, who (from what I can tell) was the dominant force behind BioPerl in the early days. So it worked really well and attracted all sorts of attention and users and actual use. However, I think the tide is shifting away from Perl: from the not-so-imminent release of a complex, backwardsly-incompatible Perl 6, to the massive quantities of completely non-reusable Perl code that have been flung in every direction, people are starting to get sick of Perl. also, a lot of people in academia are moving towards Python for bioinformatics, if not in a very coordinated way: when I left Caltech, two of the three heavy bioinformatics groups were using Python, and when I arrived at MSU I found several groups doing bioinformatics in Python and only one using Perl (and, at that, mainly because they rely on GMOD).

Heck, there are a lot of Python-in-bio sightings these days. I just went to a talk by Rob Knight, who works on the human microbiome project, and he mentioned developing PyCogent with some collaborators. A lab on campus uses TAMO for motif searching. Cistematic and a variety of tools from the Wold Lab use Python. James Taylor is working hard on developing Galaxy into a general purpose tool. So I don't despair for Python's presence in biology.

I think the world is moving, medium-to-long-term, towards the use of Perl for scripting-level work, Python for frameworks and re-usable software, and R for statistical analysis of data sets (BioConductor is also popping up a lot these days). Personally I think this is the right approach and bodes well in the long term.

--

Second, Chris says,

I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.

I, too, am struggling with the problem that research scientists, somewhat shockingly, are more interested in doing (and funding) novel research than in building re-usable software. OK, I'm being a bit sarcastic, but that's only a mildly sarcastic statement, really; while it's understandable that researchers want to do research, the rise of large-scale data and computational methods in biology unambiguously argues for computational competence in the next generation of researchers. Part of computational competence is knowing how to get stuff done effectively and correctly, not to mention with reusable software when possible. I am actually shocked that there's so little focus on Software Carpentry-like skills in science and education, and I'm doing my best to push on that front here at MSU (see my very first course here, which is introducing Python, Subversion and automated testing to CSE undergrads).

That computing in biology sucks is not by any means a novel observation; see this nice article, Computational Biology Resources Lack Persistence and Usability, for example. My take on things is that the funding bodies simply need to recognize the utility of software maintenance, which is slowly happening, and that the undergrad and graduate departments need to adapt to the future by teaching this stuff. But there's no question it's going to be Darwinian out there -- as Stewart Brand says, "Once a new technology starts rolling, if you're not part of the steamroller, you're part of the road." Hopefully some of us can be the steamroller and not the road, yeh?

So what's my solution, you might ask? Well, now that I'm a bigshot professor, I'm going to be encouraging (well, demanding where possible) that my students and collaborators use good software development techniques and release their source code and data. But my real "secret" -- and please steal it if you can :) -- is that I hope to continue building a real infrastructure that can underlie solutions to my various research problems. If I can build a re-usable core of well-tested tools on top of a solid framework, I should be able to do research faster, better, smarter, and more reliably than my colleagues and competitors. That should translate into more publications, more grants, and more problems actually solved. (I'll let you know how that goes; it's early days still.)

That, incidentally, is why you should ignore people who tell you not work on your coding or on general-purpose libraries: because if it's useful to you, it's worth doing right and making it useful to yourself in the future.

This is also one of the reasons why I'm investing a substantial amount of my scarcest resource (time) in pygr. pygr is a solution for scalable storage, retrieval, and named persistence of sequence-associated data, and it works fantastically well. The real problem with pygr is the high barrier to entry, and that's what we're working on lowering, if only so that my own students will have less trouble learning it.

Some other time I'll talk about why pygr and pygr-like solutions are the right solution to reusability in bioinformatics.

So, in summary: don't worry, be happy, Python is coming to bioinformatics one way or another. And don't worry, just work hard at becoming the steamroller (and not the road) by improving your coding skills and becoming a general-purpose computational scientist, or at least general-purpose bioinformatician. You won't regret it.

Heck, you can always come work for me, right? ;)

--titus

September 14, 2008 09:24 PM

September 10, 2008

Enthought

NumPy arrays with pre-allocated memory

A common need whenever NumPy is used to mediate the Python level access to another library
is to wrap the memory that the library creates using its own allocator into a NumPy array. This allows easy Python-side manipulation of the data already available without requiring an un-necessary copy. Fundamentally this is easy to do using PyArray_SimpleNewFromData. The C-level calling syntax is

int nd;
npy_intp *dims
void *data;

arr = PyArray_SimpleNewFromData(nd, dims, typenum, data);

In this code block, nd is the number of dimensions, dims is a C-array of integers describing the number of elements in each dimension of the array, typenum is the simple data-type of the NumPy array (e.g. NPY_DOUBLE), and data is the pointer to the memory that has been previously allocated.

By default, the memory for the NumPy array will be interpreted as a C-ordered contiguous array. If you need more control over the data-type or the striding of the array, then you can also use PyArray_NewFromDescr.

This is the simple part and code like this has been possible in Numeric for more than a decade. The tricky part, however, is memory management. How does the memory get deallocated? The suggestions have always been something similar to “make sure the memory doesn’t get deallocated before the NumPy
array disappears.” This is nice advice, but not generally helpful as it basically just tells you to create a memory leak.

All that NumPy does internally is to un-set a flag on the array object indicating that the array doesn’t own its memory pointer and so NumPy won’t free the memory when the last reference to the array disappears.

The key to managing memory correctly is to recognize that every NumPy array that doesn’t own its own memory can also point to a “base” object from which it obtained the memory. This base object is usually another NumPy array or an object exposing the buffer protocol — but it can be any object (even one we create on the fly). This object is DECREF’d when the NumPy array is deallocated and if the NumPy array contains the only reference to the object, then it will also be deallocated when the NumPy array is deallocated.

Thus, a good way to manage memory from another allocator is to create an instance of a new Python type. You then store the pointer to the memory (and anything else you may need to call the deallocator correctly) in the instance. Finally, you call the deallocator in the tp_dealloc function of the new Python type you’ve created. Then, you point the base member of your new NumPy array to the new object you’ve created.

The concept is relatively simple, but there are enough moving parts that an example is probably useful. Let’s say I want to create an extension module that only uses NumPy arrays allocated on 16-byte boundaries (maybe I’m experimenting with some SIMD instructions). I want to use arrays whose data is allocated using the aligned allocator defined below (borrowed from a patch to NumPy by David Cournapeau):

#include <errno.h>
#define uintptr_t size_t

#define _NOT_POWER_OF_TWO(n) (((n) & ((n) - 1)))
#define _UI(p) ((uintptr_t) (p))
#define _CP(p) ((char *) p)

#define _PTR_ALIGN(p0, alignment)			
((void *) (((_UI(p0) + (alignment + sizeof(void*)))	
& (~_UI(alignment - 1)))))

/* pointer must sometimes be aligned; assume sizeof(void*) is a power of two */
#define _ORIG_PTR(p) (*(((void **) (_UI(p) & (~_UI(sizeof(void*) - 1)))) - 1))

static void *_aligned_malloc(size_t size, size_t alignment)
{
void *p0, *p;

if (_NOT_POWER_OF_TWO(alignment)) {
errno = EINVAL;
return ((void *) 0);
}
if (size == 0) {
return ((void *) 0);
}
if (alignment < sizeof(void *)) {
alignment = sizeof(void *);
}

/* including the extra sizeof(void*) is overkill on a 32-bit
machine, since malloc is already 8-byte aligned, as long
as we enforce alignment >= 8 ...but oh well */

p0 = malloc(size + (alignment + sizeof(void *)));
if (!p0) {
return ((void *) 0);
}
p = _PTR_ALIGN(p0, alignment);
_ORIG_PTR(p) = p0;
return p;
}

static void _aligned_free(void *memblock)
{
if (memblock) {
free(NPY_ALIGNED_ORIG_PTR(memblock));
}
}

Now, to create arrays using this allocator we just need to allocate the needed memory and use SimpleNewFromData. Then we create a new object encapsulating the deallocation and set this as the base object of the ndarray.

int nd=2
npy_intp dims[2]={10,20};
size_t size;
PyObject newobj, arr=NULL;
void *mymem;

size = PyArray_MultiplyList(dims, nd);
mymem = _aligned_malloc(size, 16);
arr = PyArray_SimpleNewFromData(nd, dims, NPY_DOUBLE, mymem);
if (arr == NULL) goto fail;
newobj = PyObject_New(_MyDeallocObject, &_MyDealloc_Type);
if (newobj == NULL) goto fail;
((_MyDeallocObject *)newobj)->memory = mymem;
PyArray_BASE(arr) = newobj;

fail:
_aligned_free(size);
Py_XDECREF(arr);

Now, all that is missing is the code to create the new Python Type. That code is

typedef struct {
PyObject_HEAD
void *memory;
} _MyDeallocObject;

static void
_mydealloc_dealloc(_MyDeallocObject *self)
{
_aligned_free(self->memory);
self->ob_type->tp_free((PyObject *)self);
}

static PyTypeObject _myDeallocType = {
PyObject_HEAD_INIT(NULL)
0,                                          /*ob_size*/
"mydeallocator",                   /*tp_name*/
sizeof(_MyDeallocObject),    /*tp_basicsize*/
0,                                          /*tp_itemsize*/
_mydealloc_dealloc,             /*tp_dealloc*/
0,                         /*tp_print*/
0,                         /*tp_getattr*/
0,                         /*tp_setattr*/
0,                         /*tp_compare*/
0,                         /*tp_repr*/
0,                         /*tp_as_number*/
0,                         /*tp_as_sequence*/
0,                         /*tp_as_mapping*/
0,                         /*tp_hash */
0,                         /*tp_call*/
0,                         /*tp_str*/
0,                         /*tp_getattro*/
0,                         /*tp_setattro*/
0,                         /*tp_as_buffer*/
Py_TPFLAGS_DEFAULT,        /*tp_flags*/
"Internal deallocator object",           /* tp_doc */
};

Don’t forget to add the following to the extension type initialization module in order to initialize the new Python type that has been created.

_MyDeallocType.tp_new = PyType_GenericNew;
if (PyType_Ready(&_MyDeallocType) < 0)
return;

This simple pattern should allow you to seamlessly integrate NumPy arrays with all kinds of memory allocation strategies. I think this pattern is common enough that we should probably add something to NumPy itself to make it easier to do this sort of thing in a few lines of code. Perhaps a new C-API call is justified with a new internal Python type that allows different allocators and deallocators to be used. Subscribe and post to the numpy-discussion@scipy.org list if you are interested in staying tuned.

by teoliphant at September 10, 2008 04:26 AM

September 09, 2008

David Cournapeau

Is science obsolete ?


That’s basically what is argued in the Wired article“the end of theory” (found through the article “La fin de la théorie?”, on the excellent Econ/French blog econoclastes). The article itself is not that interesting: it tries to be provocative, but fails at giving good arguments for his case. The main argument is that thanks to enormous amount of data, number crunching, and computer farms as available for example at Google, it will be more effective to find patterns with just data, and without models. But the analysis is fundamentally flawed at several levels, both scientific and philosophical. It is true that some sciences, or more exactly some activities traditionally labeled as science are endangered by number crunching; but computers already made some repetitive activities obsolete, and it is hardly big news, except for the people concerned by the changes.

First, for the epistemological arguments against this thesis: the reasons why science is first about making theories, and then making experiments to confront the theory to reality is not just about practicality. There are some fundamental reasons why this is the case, as mentioned in the article by econoclaste (warning, in French): data gathering itself is subject to various biases, and theory can somewhat alleviate this bias. There is also an ambiguity on what Chris Anderson means by scientific models: he argues that google succeeded in getting reliable search results by avoiding using a model. But you could argue on the contrary that Google did better than everyone else because it had a better model for getting interesting pages related to keywords; indeed, the PageRank algorithm, which is the foundation of Google search engine, is a better algorithm than what other search engines used to do. And the PageRank algorithm is based upon some other works and theories, in particular citation analysis, which trace back to at least 1950. Another example given by Anderson is translation: he argues that translation with any language model can work better than with any linguistic knowledge. But arguing that translation can be done better without knowing the language is different than arguing it can be done without any model at all. For people who work on machine translation, it is actually quite well known that you don’t need to know a language to be successful in translating to it.

Reductionism and the curse of dimensionality

But more significantly in my opinion, the number crunching approach is fundamentally reductionist: it assumes you can explain the whole phenomenon from its smallest parts. A typical example of the failure of reductionism in science is fluid mechanics: you could explain the behavior of a fluid from the behavior from each particle in your fluid, but actually, you can’t, because once you have a reasonable number of particles, it becomes intractable to do it at the particle level. Tom Roud made a similar argument (in French) for the flaws in reductionist approach in biology.

There are some theoretical reasons why you will never manage to make a model of everything just from data. Most number crunching data methods are statistical in nature, and rely on estimating some probability distribution. From this point of view, Anderson’s argument can be understood as “with enough data, you can estimate any probability distribution”. But this is not true for several reasons; one is that complex problems often require computation in high-dimension spaces, and high-dimension spaces have some funky properties which are not intuitive and do not map well to our fundamentally three dimension world. One particularly significant property is the localization of volume in smooth solids. For example, in high dimension, most of the volume of a sphere is on a very thin shell, e.g. really near the surface. In dimension 1000, for a sphere of radius one, the volume contained in the sub-sphere of radius 0.5 is only 1/10^300 of the total volume. This means that if you could put uniformly all the atoms of the universe in the sphere, you would not even get one in the subsphere. In statistics, this phenomenon is known as the curse of dimensionality: the number of data necessary for estimation in high dimensions grows exponentially with the number of dimension.

Also, more data does not always mean you will get better information: a common quote in the data-mining community is “there is no better data than more data”, but this is a fallacy. You want data which brings more information, and in some cases, you can only easily get data which are not very informative. For example, when transcribing speech with computers machine translation (Automatic Speech Recognition, e.g. “speaking to your computer instead of typing”), you need to estimate the probability distribution of the words, you are interested in the probability of the words which do not appear often. After analyzing a few thousand examples, you will get a pretty good estimation of the “behavior” of common words like “the” and the likes, but maybe not for words like “hermeneutic”. And for practical applications, those rare words are the one which matter: if you miss “the” in a sentence, you can still understand it, but if you miss “hermeneutic”, this is much less likely.

Is number crunching new ?

So is this number crunching really the beginning of something new ? Actually, similar thesis have been argued before Anderson; the fields of data-mining and artificial intelligence (AI) have since their inception an history of making claims which never really materialize (AI, for example, has known several “AI winter”, for periods of low-funding, generally after periods of high-funding and high claims about what AI could do). Anyone familiar with the data-mining and artificial intelligence communities should be skeptical about big announcements like paradigm shift, or like here claiming to make science obsolete. I would not be surprised if AI/data-mining/associated fields are the ones which use the expression  “paradigm shift” the most often.

It baffles me that people still argue similar points with similar claims as 50 years ago.

Linked articles

For more on this, you can also see on Cosma’s blog. Also, in French:

by cournape at September 09, 2008 09:42 AM

Matthieu Brucher

Dimensionality reduction: videos in regression algorithms

Two months ago, my last post was on regression. I’d like to start this new year with some videos on how my algorithms behave.

The first video shows the manifold being regressed with a color mapping. Each color symbolizes a model in the piecewise linear function.

Here is the evolution of the mapping with the first algorithm. Only a part of the manifold is regressed each time, and there is no way of tuning the result.

The following video was made with the second algorithm. Each time, the whole manifold is regressed and the model enhanced. I didn’t put the whole optimization, at the end nothing could be seen on such a video.
After the first iteration, the original model is split in two other models, and this happens again after the second iteration.

I’ll make other videos on the other parts of my work.

by Matt at September 09, 2008 08:04 AM

September 07, 2008

Gaël Varoquaux

Rendering static pages with Turbogears

Yes, the topic of this post is a bit unlike what I usually talk about. But, hey, a man’s got to do what a man’s got to do.

Turbogears hack

So, suppose you have a dynamic website using turbogears, and you want to publish part of the content of this dynamic site to a static website, for instance to garanty its perenity. Well turbogears makes it really hard for you to do this. On the mailing lists they pretty much advise you to create a webserver and crawl it. Ugly. So here is the code required to render the kid templates that you have been using with turbogears to an html string (consider this as a brain dump, so that the almighty Google picks this up, hopefuly to help somebody not to loose a day like I did):

# First set up the environment you need for your webapp:
import turbogears
turbogears.update_config(configfile="dev.cfg",
                         modulename="sanum.config")

from itertools import izip
import turbogears.view
turbogears.view.load_engines()

import turbogears.util as tg_util
from turbogears.widgets import js_location

engine = turbogears.view.engines.get('kid')

def render_static(data_dict, template):
    """ Render a given template + its data dictionnary to a static html.
    """
    data_dict['tg_css'] = tg_util.setlike()
    data_dict['tg_flash'] = False
    js = dict(izip(js_location, iter(tg_util.setlike, None)))

    for l in iter(js_location):
        data_dict["tg_js_%s" % str(l)] = js[l]

    return engine.render(data_dict, template=template)

You can call this function with data_dict being a dictionary as returned by your controller methods, and template the path to your template, as in the expose decorator.

No update on my life

Yes, I have been really bad on the blogging side lately. A lot  of things have happened, both personal as I discovered wild Texas and traveled and geeked in the States, and on the scientific computing side. I promise I’ll blog a bit more about all this soon. Right now I am struggling to finish things before I leave the states.

My fantastic girlfriend, Emmanuelle, has published a detailed account of our trip in central Asia in June on the static part of this site. I find it sooooo nice to browse through it, as it reminds me of these great times.

by gael at September 07, 2008 06:12 AM

September 04, 2008

David Cournapeau

A python 2.5.2 binary for Mac OS X with dtrace enabled


As promised a few days ago, I took the time to build a .dmg of python from the official sources + my patch for dtrace. The binary is built with the build-script.py script in the Mac/ directory of python, and except the dtrace patch, no other modification has been done, so it should be usable as a drop-in replacement for the official binary on python.org. You can find the binary here

Again, use it at your own risk. If you prefer building it yourself, or with different options, the patch can be found here

by cournape at September 04, 2008 01:14 PM

September 02, 2008

David Cournapeau

How to embed a manifest into a dll with mingw tools only


(DISCLAIMER: I am not a windows guy; all the discussion here is how I understand things from various sources).

With Visual Studio 2005, MS introduced a mechanism called side by side assemblies and C/C++ isolated applications. Assembly is the MS term which encompasses usual dll, as well as .Net modules implemented in CLR, the .Net bytecode (e.g. anything programmed in C#). The idea is to provide a mechanism to deal with the well known dll hell, because there was no proper versioning scheme with dll in windows. You can read more here:

Why should you care as a python developer ? Concretely, starting from VS 2005, if you build a python extension with the mingw compiler, it will link against a runtime which is not available system-wise (such as in C:\Windows\system32 by default), causing a runtime error when loading the extension (msvcr80.dll not found). A simple way to reproduce the result is to have a small dll, and try to link it to a simple executable with the ms runtime:

# This works:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe
# This does not:
gcc -shared hello.c -o hello.dll
gcc main.c hello.dll -o main.exe -lmsvcr90

If you build the 2nd way, explicitely linking the msvcr90, you will get a dll not found error when running the executable, because the dll is not in the system paths (and should not be; the dll is not redistributable). Starting from VS 2005, the only way to refer to VS libraries is to use manifest, which are xml files embedded in the binary. Those manifest are automatically generated by the MS compiler. Assuming you already have the manifest, how can you generate a binary using it without using MS compilers ?

Build the object of the dll:

    gcc -c hello.c

    Have a hello.rc file which refers to the manifest file  (2 seems to refer to dll, vs 1 for exe, but I am not sure):

      #include "winuser.h"
      2 RT_MANIFEST hello.dll.manifest

      Build the .res file, which will embed the xml file into the resource file (.res)

        windres --input hello.rc --output hello.res --output-format=coff

        Link the whole thing together:

          gcc -shared -o hello.dll hello.o hello.res -lmsvcr90

          Now, executing main.exe should be possible. There is still the problem of generating the manifest file. Since in our case, the problem is mainly with the MSVC runtime, to stay compatible with the python.org binary, we may just reuse the same manifest all the time ?

          A few more links on the topic:

          http://www.ddj.com/windows/184406482

          http://msdn.microsoft.com/en-us/library/ms235591(VS.80).aspx

          http://www.codeproject.com/KB/COM/regsvr42.aspx

          by cournape at September 02, 2008 11:10 AM

          Building dtrace-enabled python from sources on Mac OS X


          One highlight of Mac OS X Tiger is dtrace. Providers for ruby and python are also available, but only with the “system” interpreters (the one included out of the box). If you install python from http://www.python.org, you can’t use dtrace anymore. Since the code to make dtrace enable python is available on the open source corner of Apple, I thought it would be easy to apply it to pristine sources available on python.org.

          Unfortunately, for some strange reasons, Apple only provides changes in the form of ed scripts applied through a Makefile; the changes are not feature specific, you just have a bunch of files for each modified file: dtrace, Apple-specific changes are all put together. I managed to extract the dtrace part of this mess, so that I can apply only dtrace related changes. The goal is to have a python as close as possible to the official binary available on python.org. The patch can be found there.

          How to use ?

          • Untar the python 2.5.2 tarball
          • Apply the patch
          • Regenerate the configure scripts by running autoconf
          • configure as usual, with the additional option –enable-dtrace  (the configuration is buggy, and will fail if you don’t enable dtrace, unfortunately)
          • build python (make, make install).
          It time permits, I will post a .dmg. Needless to say, you run this at your own risk.

          by cournape at September 02, 2008 07:14 AM

          September 01, 2008

          OpenOpt (Dmitrey Kroshko)

          changes for ralg, oofun, handling NaNs

          I have committed some changes related to
          • ralg: change for matrix B rejuvenation criterium, now by default it performs check "cond(b) less than 1e5" each 250th iter or provided a stop criterion has triggered on. However, for large nVars evaluation of numpy.linalg.cond(b) takes a time (for nVars = 1000 I got ~5 seconds at my AMD 3800+ X2). So it would be good to get estimation of cond(b) instead of exact value (like MATLAB's condest for sparse matrices; however, b is dense).
          • oofun: recursive 1st derivatives, appropriate example will be committed soon
          • handling NaNs: if right-derivative (f(x+dx)-f(x))/dx is NaN then left-derivative (f(x)-f(x-dx))/dx will be used; it's valid for both oofun and classic style, however, it doesn't resolve all possible issues. Also, I intend to add double-side derivative possibility in future: (f(x+dx)-f(x-dx))/(2*dx) via an oofun field, for example c2 = oofun(..., useDoubleSideDerivative = True,...)
          • some other (minor) ralg changes

          by Dmitrey (noreply@blogger.com) at September 01, 2008 03:22 AM

          August 30, 2008

          Enthought

          SciPy 2008 Swarm

          During the “State of SciPy” presentation that Jarrod Millman and I gave this year at the SciPy 2008 Conference, we showed a “code swarm” video that was put together by David Martin and Chris Galvan.  It’s a first-of-it’s-kind, because it combines repositories from several projects; namely NumPy, SciPy, Mayavi, Chaco, Traits, and Matplotlib.

          Among other things, you can see how the number of people involved and the level of code