August 31, 2010

Matthieu Brucher

Book review: Beautiful Testing: Leading Professionals Reveal How They Improve Software

Testing is one of the basis to create robust and correct code. O’Reilly has published in its “Beautiful” series a lot of books on different parts of the development process. This is the testing part.

Content and opinions

According to this book, testing has three aspects: testers, process and tools.

Perhaps the biggest issue in testing is getting motivated testers. The first part of the book has only three chapters. I guess there is no silver bullet for getting a beautiful team of testers, it’s mainly about people getting along.

The second part is the one that got the most attention in the book. In retrospective, thanks to this book, I’ve discovered many aspects of testing are never addressed , and I thought that some code couldn’t even be tested. I was obviously very wrong. For instance, testing mathematical code can be done, a chapter gives a pretty good direction to follow. Several chapters address fuzzing and thus an aspect of testing that is mainly done after discovering a flaw (whereas fuzzing may give a shot at fixing the bug before it is discovered), or also software testing over a network. In fact the broad scope of projects that participated in this part gives me hope and shame at the same time (because there are many things I didn’t test for, didn’t anticipate, …).

The last part handles tools that are used during testing. It’s not about the usual xUnit, but more about some of the once-upon-a-time small tools that morphed into one of the pillars of testing in its project, or Open Source tools (valgrind, …) or even others.

Conclusion

A very broad scope of projects, of people and of workflow are depicted inside this book. If there are many beautiful processes out there, there are not so many ways of getting beautiful people inside test teams. Also, it seems that many beautiful tools were mainly small project tools that were written for a simple crude purpose and that were progressively refactored into something more powerful.

I especially liked some chapters in the process part, for instance the statistical one. It’s something very difficult to test, and although the chapter doesn’t solve the ultimate issue of statistics, it helps creating something that looks like real random tests.

Also, I won’t see Q&A the same way now.

Beautiful Testing: Leading Professionals Reveal How They Improve Software
Beautiful Testing: Leading Professionals Reveal How They Improve Software
Price: $44.99
This unique book offers essays from 25 leading software testers that illustrate the qualities and techniques necessary to make software testing an art in itself. The latest entry in O’Reilly’s successful series that includes “Beautiful Code” (9780596510046) and “Beautiful Teams” (9780596518028), this book demonstrates through personal stories and many examples the simplicity, maintainability, flexibility, and efficiency required to test every aspect of a software project.

Beautiful Testing: Leading Professionals Reveal How They Improve Software (Theory in Practice) (Paperback)
by Tim Riley, Adam Goucher
ISBN: 0596159811

Price: USD 43.76
39 used & new available from USD 33.34

| 5 | 5


by Matt at August 31, 2010 07:33 AM

August 30, 2010

David Cournapeau

cournape

I got bitten by this several times already, here is what usually works if this error happens for user foo

  • delete from mysql.user where user=foo;
  • delete from mysql.db where user=foo

by cournape at August 30, 2010 08:54 AM

August 27, 2010

Enthought

A Renewed ReStructuredText Editor!

Remember the ReStructuredText editor? Well I just spent my Google Summer of Code adding a bunch of new features to it: A new toolbar with nice icons from the Tango icon library to give quick access to old and new features alike. Buttons to apply inline markup for *emphasis*, **strong emphasis** and “inline literal“ to [...]

by Kevin Salvesen at August 27, 2010 07:25 PM

August 25, 2010

Fwrap blog

freemalloc

From mercurial to git

After speaking with some folks at SciPy-Austin this year (in particular, Fernando Perez), I wanted to try out git & github as an alternative to mercurial & bitbucket.  As with anything with so much overlap, there are pros and there are cons, and it comes down to small differences.  The deciding factor, ultimately, was the hg-git mercurial plugin, which allows someone to use a mercurial client with a git repository, with improved branch functionality.  So all of the nice, smooth, user-friendly mercurial commands you’re used to are available, and you can push/pull bookmarks a-la git branches.  (I’ve not used this extensively, so caveat emptor.)

Some of my reasons for the move:

  • As mentioned above, the hg-git plugin allows both mercurial and git users to collaborate using the same repository, so there is no need for current mercurial users to have to change to git.
  • The numpy/scipy ecosystem is converging on git & github for SCM.  Arguably Cython (which uses mercurial) would be more influential here, but pretty much every fwrap user would be coming from a numerical & Fortran background, which would be more numpy/scipy-centric, hence git & github.
  • The way Git handles branches fits my style much better.  And I really like the index.  Mercurial seems to encourage enabling the ‘queue’ extension for advanced stuff, but it just plain doesn’t fit my brain, and I always screw it up.  Granted, it’s plenty easy to blow off your hand using git, although I’ve become quite proficient at messing up whatever SCM I’m using at the time.
  • It seems that mercurial goes to great lengths to keep their design pure, which I can respect.  Unfortunately by not allowing bookmarks to be pushed/pulled as part of the wire protocol, they force you to use mercurial branches, which are much heavier than I need them to be.  If mercurial would just allow bookmarks (which are the closest analogue to git branches) to be pushed-pulled, I’d have a much harder time justifying the switch to git.

Github repository

Fwrap’s new home:

http://github.com/kwmsmith/fwrap


by Kurt Smith at August 25, 2010 03:17 PM

August 24, 2010

Gaël Varoquaux

SVG Word map of countries

To be able to visualize some quantities attached to countries all over the world, I needed a image with various countries color-coded. The fantastic matplotlib basemap package was not an option as I really needed a static image.

So I generated an SVG image with all the countries. It was generating by tracing a bitmap, so it has a lot of imperfections, but being an SVG with each (major) country as a different object, it can be used to create a colored-code world map. I am uploading it here under a public-domain license. Enjoy!



by gael at August 24, 2010 09:55 AM

August 23, 2010

Fabian Pedregosa

Support for sparse matrices in scikits.learn

I recently added support for sparse matrices (as defined in
scipy.sparse) in some classifiers of scikits.learn.

In those classes, the fit method will perform the algorithm without
converting to a dense representation and will also store parameters in
an efficient format.

Right now, the only classese that implements this is SVC and LinearSVC
in scikits.learn.sparse, although the plan is to add more classes in
the future. These are capable of taking sparse matrices in the fit()
method and will also store support vectors as sparse matrices.

Here is an example. We first create a toy dataset and import relevant
modules:

In [1]: import scipy.sparse

In [2]: from scikits.learn.sparse import svm

In [3]: X, Y = scipy.sparse.csr_matrix([[0, 0], [0, 1]]), [0, 1]

In [4]: clf = svm.SVC(kernel=‘linear’)

now we will fit the model and query some of its parameters:

In [5]: clf.fit(X, Y)
Out[5]:
SVC(kernel=‘linear’, C=1.0, probability=0, shrinking=1, eps=0.001,
  cache_size=100.0,
  coef0=0.0,
  gamma=0.0)

In [6]: clf.support_
Out[6]:
<2×2 sparse matrix of type ‘<type ‘numpy.float64‘>’
        with 1 stored elements in Compressed Sparse Row format>

In [7]: clf.coef_
Out[7]:
<1×2 sparse matrix of type ‘<type ‘numpy.float64‘>’
        with 1 stored elements in Compressed Sparse Row format>

For a more complete example, you can look at Classification
of text documents using sparse features, contributed by Olivier Grisel.

by fabian at August 23, 2010 03:47 PM

August 18, 2010

William Stein

*-Overflow

I spent some time obsessively browsing http://mathoverflow.net during the last few days (and posting too), and it is a really awesome site.  I found out about it last December, when I visited Berkeley to give a talk, and had lunch with one of the guys that runs the site, but didn't pay much attention to it until recently.  It's much more suitable for discussion of research-level mathematics than general use and development of Sage.

Mathoverflow is interesting in that their comment system supports jsmath with realtime preview, which is something the TinyMCE editor in the Sage notebook doesn't do. It's also something that http://stackoverflow.com doesn't do, as far as I can tell. The  grad students at Berkeley that organize mathoverflow probably somehow hacked jsmath into the comment system.

I was very surprised to learn that Mathoverflow and Stackoverflow (and all the dozens and dozens of stackoverflow-based sites) are closed source. There is a company that runs stackoverflow and hosts related sites, and keeping their code closed is part of their business model.

Some people sort of forked something related to stackexchange at some point, and created this: CNPROG, which is fortunately based on Python.  There are about 20 sites using CNPROG now, including one for all questions about the use of Python in science, mathematics, and engineering.
I tried this site, answering a question about Cython, and it is snappy and clean.

Perhaps I will add jsmath or mathjax support to cnprog and start another site (http://q.sagemath.org ?), say on using math software as an aid to mathematical research. The site would not be specific to Sage or restricted to open source. The FAQ would say that all questions must be of the form: "(How) can I use a computer to compute XYZ, where XYZ should be some sort of advanced mathematical question." The emphasis on "advanced" and "research" is that there are already other sites about how to do elementary stuff (=people's homework), and it will provide an excuse to keep that out, which will greatly raise the quality.   It is, after all, exactly this uncompromising focus on research that makes http://mathoverflow.net so interesting.

EDIT: I've now created http://ask.sagemath.org which is based on http://askbot.org, which is in turn based on CNPROG. 

by William Stein (wstein@gmail.com) at August 18, 2010 04:29 PM

Fabian Pedregosa

Flags to debug python C extensions.

I often find myself debugging python C extensions from gdb, but usually some variables are hidden because aggressive optimizations that distutils sets by default. What I did not know, is that you can prevent those optimizations by passing flags -O0 -fno-inline to gcc in keyword extra_compile_args (note: this will only work in GCC). A complete example would look like:

config.add_extension(‘foo’,
                         sources=[‘a.c’],
                         # add this for gdb debug
                         extra_compile_args=[‘-O0 -fno-inline’])

and your extension becomes much easier to debug from gdb.

by fabian at August 18, 2010 11:40 AM

August 17, 2010

Gaël Varoquaux

Git…

No point me launching in a rant against Git, it’s the future anyhow… but it’s the only DVCS that I know with which I routinely loose work, and it is called a feature (I lost work a few times with bzr, but these were bugs)!

by gael at August 17, 2010 10:08 PM

Fwrap blog

freemalloc

FYI, fwrap has a wiki that is slowly filling-in.  There are (or will be) tutorials that guide you through getting everything set-up correctly, FAQs, etc.

https://sourceforge.net/apps/trac/fwrap/wiki/WikiStart

Specifically, I just posted a rough roadmap on the fwrap wiki, outlining what fortran features are currently supported and what will be supported in upcoming versions:

https://sourceforge.net/apps/trac/fwrap/wiki/FortranSupport

I commend it to your attention.


by Kurt Smith at August 17, 2010 06:10 PM

August 10, 2010

Matthieu Brucher

Optimization scikit: separation of orthogonally convoluted signals

My last blog post on optimization helped me generate orthogonal sequences. Now, I will use those sequences to separate two signals. The basic use case is a linear system with two inputs, one output, and instead of recording the response of one input at a time, one plays both inputs simultaneously with specific sequences so that they can be separated in another process.

A separation cost function

In fact the process is really easy. Each input will be convoluted with one sequence generated by last time’s genetic algorithm. Both sequences are not orthogonal, so the resulting separated signal will have a signal to noise ratio (SNR) probably around the orthogonality amount of the sequences.

The cost function will be the squared error between the recorded signal and the sum of the convolutions of the estimated signals with their associated combs plus a fraction of the sum of the absolute values of the signals. This additional terms can be seen as regularisation, but also the whole function can be interpreted as the likelihood of an error following a Gaussian law and the input signals following Laplacian laws. The function is thus written this way:

class Function(object):
  def __init__(self, signal, combs):
    self.signal = signal
    self.combs = combs
    self.mu = 20000
 
  def create_estimation(self, x):
    length = len(x) / len(self.combs)
    return numpy.convolve(x[:length], self.combs[0])[:length] + numpy.convolve(x[length:], self.combs[1])[:length]
 
  def __call__(self, x):
    return numpy.sum((self.signal - self.create_estimation(x))**2) + numpy.sum(numpy.abs(x)) / self.mu
 
  def gradient(self, x):
    error = self.signal - self.create_estimation(x)
    grad = numpy.zeros(len(x))
 
    length = len(x) / len(self.combs)
    grad[:length] = - 2 * numpy.convolve(self.combs[0], error[::-1])[:length][::-1]
    grad[length:] = - 2 * numpy.convolve(self.combs[1], error[::-1])[:length][::-1]
 
    return grad + numpy.sign(x) / self.mu

Besides the cost, this class also returns the correct analytical gradient (it is not easy to derive, but it is in fact a simple cross correlation between the error each estimated input signal).

Application

Now that this is in place, an optimizer can be designed:

  from scikits.optimization import *
 
  fun = Function(signal, combs)
 
  mystep = step.FRPRPConjugateGradientStep()
  mylinesearch = line_search.WolfePowellRule()
  mycriterion = criterion.criterion(ftol = 0.0001, iterations_max = 500)
  myoptimizer = optimizer.StandardOptimizer(function = fun,
                                            step = mystep,
                                            line_search = mylinesearch,
                                            criterion = mycriterion,
                                            x0 = numpy.zeros(2*len(signal)))
 
  xf = myoptimizer.optimize()

To separate both signals correctly, the optimizer will consists of a Polak-Ribière-Polyak conjugate gradient with a Fletcher-Reeves variant (it always finds the best conjugate factor and has an auto-restart behavior), a Wolfe-Powell line search and the stop criterion will stop the optimization after 500 iterations or if the relative cost doesn’t vary more than 0.0001.

Now, I’ve convoluted two signals (drawn for a Laplacian distribution) with two combs (10 impulses each). I’ve added a small amount of white noise. In the end, with combs not entirely orthogonal and white noise, the SNR may not be lower than 10dB, but with the additional hypothesis on the distribution of the input signals, it might just be enough. The optimization looks like this:

As you can see, the crude strating estimate is efficiently corrected but the estimation degrades a the end of the signals. In the end, we have a little more than 10dB, but with other signals, the SNR is lower than 10 dB.

Conclusion

With a more complex comb, better SNR would be achieved, but at the cost of longer output signals. It means that sometimes, it’s better to record each input separately. Of course, if you need very long input sequences, you can use longer orthogonal sequences. You can also combine more than two signals!

As usual, the code may be found on Launchpad.

Buy Me a Coffee!



Other Amount:



Your Email Address :




by Matt at August 10, 2010 07:18 AM

August 08, 2010

Fwrap blog

freemalloc

Some have asked how fwrap relates to f2py.

Broadly, fwrap can be seen as an update of f2py for all of Fortran 90/95, addressing the features that f2py does not support at the present time.  My understanding is that the 3rd version of f2py intends to address many of these features, but that version of f2py is far from complete.  This is not intedended as a critique of f2py, merely a brief comparison of the two projects.

Fwrap focuses on generating Cython wrappers for Fortran code, and the Python-level wrappers come for free, in a sense.  Fwrap uses the Fortran parser–named ‘fparser’–that Pearu Peterson (the author of f2py) created for the latest incarnation of f2py.  In contrast to Fwrap using Cython, f2py generates the C extension module directly–quite an undertaking.  Fwrap focuses on all 3 levels of wrapping: C, Cython and Python; and the C wrappers can be used without any Cython or Python dependency.

Fwrap generates one more wrapping layer than f2py; this allows fwrap to eventually support every desirable feature in Fortran 90/95 (and, in principle, 2003).

F2PY has a very nice interface file functionality, and supports comment directives in the fortran source to specify how it is to handle different arguments.  These features are planned for fwrap, but currently it is easier to make ‘pythonic’ wrappers with f2py, while fwrap just wraps every argument directly, taking into account the argument’s intent.

Fwrap has planned some features that are not in f2py: for instance, a fortan inline module that is similar to scipy’s weave, user derived types, automatic type discovery, and (way in the future) wrapping a Fortran module in a Cython class.

Pearu and I are on good terms and we both collaborate to make fparser better along with the other fparser & f2py developers.  F2PY is an impressive project and provides the skeleton for many Python & Fortran projects, most prominently among them being SciPy.

Fwrap would not exist if f2py had not blazed the trail, and paved over half of it :-)


by Kurt Smith at August 08, 2010 07:42 PM

freemalloc

Fwrap v0.1.0

I am pleased to announce the first release of Fwrap v0.1.0, a utility for
wrapping Fortran code in C, Cython and Python. Fwrap focuses on supporting the
features of Fortran 90/95, but will work with most well-behaved Fortran 77
code, too.

Fwrap is BSD licensed.

Fwrap is beta-level software; all public APIs and commandline options are
subject to change.

Features in the v0.1.0 release:

  • Fortran source parsing and automatic wrapper generation;
  • Top-level (non-module) Fortran subroutines and functions;
  • Supports all intrinsic types (integer, real, complex, logical &
    character);
  • Default and non-default intrinsic types properly wrapped;
  • Scalar and array arguments supported;
  • Supports assumed-shape, assumed-size, and explicit-shape array arguments;
  • Intent ‘in’, ‘inout’, ‘out’ and no intent arguments supported;
  • Automatic docstring generation for extension module and functions;
  • Wrappers are portable w.r.t. both Fortran and C compilers.

Upcoming features:

  • “Use” statement support;
  • Python & Cython callback support;
  • Kind-parameters in type specification;
  • Module-level data and parameters;
  • User-derived types.

See the README and documentation for requirements, compiler support, etc.

Download

You can get fwrap from pypi or from its sourceforge download page:

https://sourceforge.net/projects/fwrap/files/

More Info

Project homepage, including links to wiki & bug tracker:

http://fwrap.sourceforge.net/

Mailing list:

http://groups.google.com/group/fwrap-users

Development blog:

http://fortrancython.wordpress.com/


by Kurt Smith at August 08, 2010 12:59 AM

August 06, 2010

Ondřej Čertík

Week Aug 2 - 6

This week I essentially only worked on my LLNL poster, which I finally finished about two hours ago. I have created a web page for it:

http://certik.github.com/ccms-10-poster/

where you can download pdf, sources, I also put there some relevant info and links.


It turned out to be a lot more work than I expected (well, as usual), but we were very thorough with John and in the process I discovered several bugs in my program, so I am glad we did it. I used to generate all the plots by hand, by manually adjusting all the parameters in the Python script (like atomic number, mesh parameters, element orders, adaptivity parameters, error tolerance and so on). Essentially I had to remember all these parameters for each of the plots (about 10 of them). Then I settled to have a Python dictionary, that holds all the parameters, and then I just pass them to a radial_schroedinger_equation_adapt(params, error_tol=1e-8) function.

Here are example of the parameters:

params_hydrogen_p_L = dict(l=0, Z=1, a=0, b=100, el_num=4, el_order=1,
eig_num=3, mesh_uniform=False, mesh_par1=20, adapt_type="p",
eqn_type="R")
params_hydrogen_p_U = dict(l=0, Z=1, a=0, b=100, el_num=4, el_order=2,
eig_num=3, mesh_uniform=True, adapt_type="p", eqn_type="R")
params_hydrogen_hp_U = dict(l=0, Z=1, a=0, b=100, el_num=4, el_order=2,
eig_num=3, mesh_uniform=True, adapt_type="hp", eqn_type="R")
params_hydrogen_h_U = dict(l=0, Z=1, a=0, b=100, el_num=4, el_order=6,
eig_num=3, mesh_uniform=True, adapt_type="romanowski",
eqn_type="rR")

params_silver_p_L = dict(l=0, Z=47, a=0, b=150, el_num=4, el_order=13,
eig_num=50, mesh_uniform=False, mesh_par1=35, adapt_type="p",
eqn_type="R")
params_silver_hp_L = dict(l=0, Z=47, a=0, b=150, el_num=4, el_order=13,
eig_num=50, mesh_uniform=False, mesh_par1=35, adapt_type="hp",
eqn_type="R")

I mean, this is kind of obvious if you think about it, but for some reason I didn't do that at all at the beginning, because I thought --- I'll just run this once and I am done with it. But I had to run it like 20x, e.g. regenerating he plots, then creating a table about meshes, then redoing the table after changing the error tolerance, and so on.

Besides that I also got permission to release my code, so I'll go over it in the coming days and generate nice patches against Hermes1D.

Also in the process of creating the poster, I played a lot with p-FEM, uniform-p-FEM, hp-FEM and h-FEM and I will keep playing with that. It's clear to me now, that our current Hermes1D is not optimal. Especially the convergence of hp-FEM and p-FEM (as it is implemented right now) greatly depends on the initial mesh.

Nevertheless, even with the above limitations, hp-FEM seems to be really good if you don't a-priori know anything about the problem/mesh. One should not make any deep conclusions in 1D (it might be a bit different in 2D and 3D, and also I only did couple test problems), but from my experience so far, hp-FEM is a really good choice, if you just want to solve the problem and get a decent convergence (way better than h-FEM, and in general about the same as uniform-p-FEM with optimized mesh).

Another conclusion is that uniform-p-FEM (also called spectral element method), if you optimize the mesh for the problem, is very fast. All you have to do is increase the polynomial order and it goes very straight on the convergence graphs, it's very hard to beat. Also, and that I would like to write in the coming days, the algorithm for optimizing the mesh is really simple: just solve it with high "p", then play with the mesh parameters (for logarithmic mesh, there are only 2 parameters --- number of elements, and a ratio of first vs. last element), so that the eigenvalues (that one is interested in) are converged (with given accuracy) and optimize it wrt DOFs. The algorithm can also "look" at the convergence graphs and make sure it's steep enough. For atomic problems, my experience shows that the logarithmic mesh is good enough (as long as you optimize it). The advantage is that you do this once, and then (for close enough potentials in the Schroedinger equation), you just increase "p", and it's very robust and fast (no need for reference mesh, or trial refinements and so on).

When I get back to Reno, we'll do more research on hp-FEM with Pavel and I think this is not the last word to say. We need to review how we choose the candidates for eigenvectors, especially "p" vs "hp" and make it more robust. We'll see.

by Ondřej Čertík (noreply@blogger.com) at August 06, 2010 07:27 PM

August 05, 2010

Fwrap blog

freemalloc

PGI has some very nice CUDA support in the latest version of their fortran compiler.  Here’s the article. [www.pgroup.com]  This may be old news to some of you, but I’m impressed.  Hopefully CUDA support is in the pipeline for the other fortran vendors as well, and something of a standard can emerge from this.

My first impression is that this is an ideal sweet spot, allowing the usage of nice high-level fortran array expressions on the host when convenient, and dropping into CUDA code where speed is a must.  Granted the host-device bandwith is painfully slow, so there is a significant overhead, but ideally it would allow incremental speedups, all within fortran.

I’m glad to see Fortran adapting to the latest technology, as it has done every generation or so; brings to mind the quip:

“What language will they be using 30 years from now?”

“I don’t know, but they’ll call it ‘Fortran’.”

And yet the standard is still backwards compatible with code from almost 35 years ago.  AFAICT, it’s the combination of adaptibility and backwards compatibility that keeps Fortran around.

Now if they would only get better I/O syntax, named goto labels (not ‘assign 10 to n’, more like C’s goto), etc…


by Kurt Smith at August 05, 2010 10:24 PM

August 03, 2010

Matthieu Brucher

Book review: Masterminds of Programming

When twenty or so langage creators are put together to make a book, it can only be interesting. It’s a good revealer of character, as they tend to open their heart. In fact I think that’s exactly what happened in this book.

Content and opinions

I won’t make the list of the chosen langages. They are taken from the successful ones, in terms of quality as well as fame, which means that some are less known than others. Each chapter is very different, and, as I’ve said, very revealing.

Depending on the interviewed, the addressed topics are different. For some authors, you get questions and answers very close to their langages. For others, you have some feedback on how they see the programming field and its future evolution. Some will tell you how much their langage is great and that is a shame it is not more used or that it could replace almost everything. Other will tell you that they fill their langage slipped through their hands.

There is no single line of code, it’s really a thoughts book, and this is the point of the book: you don’t get advice on how you will write code or design your application. You get inner reasons for some langages and their success. You get leads for the future of your program, how langages may evolve and on what you may focus your attention. Unfortunately, the chapter I almost liked the most is the last one (on Eiffel). It’s funny that in this precise chapter, Bertrand Meyer talks about the treasures he found in the last chapter of Structured Programming (Dijkstra et al).

Conclusion

From a cultural point of view, this book is a gold mine. For an ego point of view, this book is a revealer. All things considered, a lot of things can be learnt by reading this book on the philosophical approach of some langages.

Masterminds of Programming: Conversations with the Creators of Major Programming Languages (Theory in Practice (O’Reilly)) (Paperback)
by Federico Biancuzzi, Chromatic
ISBN: 0596515170

Price: USD 26.39
53 used & new available from USD 4.96

| 4 | 11


by Matt at August 03, 2010 07:13 AM

August 01, 2010

Gaël Varoquaux

Software design for maintainability

I have just spent the best part of my Sunday fixing a bug that turned out being a seemingly-trivial two-liner. Such unpleasant experiences are all too frequent, and weight a lot on my view of code design.

My stance on code design

I call code design the process of designing the architecture of a piece of software: what are the objects it uses? how do they interact? how is the information passed around?…

My view of code design and software engineering has progressively evolved to favor extreme simplicity over sophistication. I believe that a good programmer should know design patterns, powerful language features, libraries dark corners, and not use them unless absolutely necessary.

Some rules of thumb

Here are some rules that I apply nowadays when writing code that I would like to last (I am aware that some of them go against well-advertised best practices).

  • Keep it as simple a possible, really! Experimental results have shown that the tractability of a code base goes down as the square of the number of interactions, and thus much quicker than the number of lines in a project. Each time you add a line, think about it: can you make simpler? If not you’ll have to find resources to maintain your project as fixing bugs or adding features will grow harder.
  • Design for the 80% usecases. In the same vein, a small decrease in the requirements can make your project much simpler [Woodfield1979]. Corner cases and minor usecases should not make the whole project complex and hard to maintain. If you can, give up on what is bringing in complexity. If you cannot, isolate it, and don’t let it sit at the core of your design.
  • Don’t design for the future. Again the same core idea: don’t start planing for all the usecases, and all the difficulties that you haven’t encountered, you will most certainly design wrong, and chances are that you’ll add complexity that you do not use. Design simple, design cleanly and refactor as you go, based on concrete problems.
  • Don’t be clever. Each time you do a clever trick, whoever has to read and maintain this code will have to understand it (that person may be you, in a few years). Chances are that they’ll get it wrong and start by loosing a lot of time.
  • Repeating yourself may actually be OK. This is a case of practicality beats purity. Repeating code is really a bad thing in software design, because it leads to an increased number of lines to debug, and tends to hinder reusability. However, adding complexity in order to save a few lines of duplicated code will cost you more in the long run.
  • Use objects sparingly. Object are great, but are they always need? An object with a single method eval can probably simply be implemented by a function. The limitation of objects is that they all have a different behavior. As a result, the users and maintainers of your codebase will first have to understand how all your classes interact before understanding your code. This also means that there is a lot of benefit in making many different classes that have the same interface.
  • Avoid abstractions and levels of indirection. The more levels of code piled on top one of the other, the more layers your maintainer is going to have to inspect to find were the bug might be. An abstraction hides another object or algorithm. To debug code, chances are that all the black boxes will first have to be opened.

Coding for others to debug

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” - Brian W. Kernighan

You may think that I am overemphasizing simplicity at the cost of functionality. Well, think about the future of your code. The net is full of unmaintained and abandoned code. If you want your project to grow and have a future, you will probably need people to help you. For a given purpose, the easiest the code is to read and debug, the more chances you will have to pick momentum.


Some external references I like (about software engineering, rather than debugging):

by gael at August 01, 2010 10:47 PM

July 31, 2010

Fwrap blog

freemalloc

In the past 2 weeks, we’ve been able to put in some major front-facing features — features that make fwrapped python modules friendly (or, at least, friendlier) and discoverable.

The most significant features for users are module- and function-level docstrings, but there are many more features you’ll like.

For this trivial fortran function:

function one_arg(a)
    implicit none
    ! Automatic type discovery: the C type correspondances
    ! will be automatically determined at compile-time.
    integer :: one_arg
    ! Assumed-shape arrays handled seamlessly
    real(kind=8), dimension(:, : ), intent(inout) :: a

    a = 1.61803399_8
    ! bonus if you know 1.618... without googling
    one_arg = 42

end function

Fwrapping it:

$ fwrapc source.f90 --build \
--name=one_arg --fcompiler=gnu95 \
--f90exec=/usr/bin/gfortran-4.3 \
-L/usr/lib/gcc/x86_64-linux-gnu/4.3.2 -lgfortran

(Admittedly the commandline can get kinda long, but the -L, -l and --f90exec options can be set to environment options $LDFLAGS and $F90, resp. The fwrapc command needs to know where to find the fortran compiler and how to link in the fortran runtime libraries.)

This will generate a million files in ./one_arg, including an extension module, ./one_arg/one_arg.so. All those other files are useful if and when you want to use the wrappers from C or Cython. For now we’ll look at using the code from Python. Importing that extension module, we see some interesting stuff:

$ cd one_arg
$ ipython
Python 2.5.2 (r252:60911, Jan 24 2010, 17:44:40)
Type "copyright", "credits" or "license" for more information.

IPython 0.10 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: import one_arg

In [2]: one_arg?
Type:		module
Base Class:	<type 'module'>
String Form:	<module 'one_arg' from 'one_arg.so'>
Namespace:	Interactive
File:		/home/ksmith/Devel/fwrap/fwrap-dev/testdir/one_arg/one_arg.so
Docstring:
    The one_arg module was generated with Fwrap v0.1.0dev_a0bf087f24a0+.

    Below is a listing of functions and data types.
    For usage information see the function docstrings.

    Functions
    ---------
    one_arg(...)

    Data Types
    ----------
    fw_character
    fwi_integer
    fwi_npy_intp
    fwr_real_8

In [3]:

You’ll notice a list of functions and data types in the one_arg module. The datatypes fwr_real_8 and fwi_integer correspond to the function argument and return value types, resp. (automatic type-discovery, like I mentioned).

We can take a look at the one_arg.one_arg() docstring:

In [3]: one_arg.one_arg?
Type:		builtin_function_or_method
Base Class:	<type 'builtin_function_or_method'>
String Form:	<built-in function one_arg>
Namespace:	Interactive
Docstring:
    one_arg(a) -> (fw_ret_arg, a,)

    Parameters
    ----------
    a : fwr_real_8, 2D array, dimension(:, : ), intent inout

    Returns
    -------
    fw_ret_arg : fwi_integer, intent out
    a : fwr_real_8, 2D array, dimension(:, : ), intent inout

In [4]:

It tells us the function signature, the argument types and the return tuples.

Let’s kick the tires:

In [4]: one_arg.one_arg([[1,2,3], [4,5,6]])
Out[4]:
(42,
 array([[ 1.61803399,  1.61803399,  1.61803399],
       [ 1.61803399,  1.61803399,  1.61803399]]))

And we find that the fortran function return value is the first element of the return tuple, and the array is the second. In this case, we passed in a python list, and fwrap made an internal copy seeing that it wasn’t compatible with the type and ordering of the fortran dummy argument. If we pass in an array of the proper ordering and type, no copy will be made, and it will be modified in-place (there will be options to have finer control over this behavior, including warnings or exceptions raised if you don’t want any copies made, ever):

In [5]: import numpy as np

In [6]: a = np.empty((5,5), dtype=one_arg.fwr_real_8, order='F')

In [7]: one_arg.one_arg(a)
Out[7]:
(42,
 array([[ 1.61803399,  1.61803399,  1.61803399,  1.61803399,  1.61803399],
       [ 1.61803399,  1.61803399,  1.61803399,  1.61803399,  1.61803399],
       [ 1.61803399,  1.61803399,  1.61803399,  1.61803399,  1.61803399],
       [ 1.61803399,  1.61803399,  1.61803399,  1.61803399,  1.61803399],
       [ 1.61803399,  1.61803399,  1.61803399,  1.61803399,  1.61803399]]))

In [8]: (ret, a_ret) = one_arg.one_arg(a)

In [9]: a_ret is a
Out[9]: True

So a_ret and a are the same python object, no copying took place!

I hope this was instructive. A release is coming soon!


by Kurt Smith at July 31, 2010 07:23 PM

David Cournapeau

cournape

I have spent the last few days on a relatively big feature for bento: recursive package description. The idea is to be able to simply describe packages in a deeply nested hierarchy without having to write long paths, and to split complicated packages descriptions into several files.

At the bento.info level, the addition is easy:

Subento: numpy/core, numpy/lib ...

It took me more time to figure out a way to do it in the hook file. I ended up with a recurse decorator:

@recurse(["numpy/core/bscript", "numpy/lib/bscript"])
@post_configure
def some_func(ctx):
    ....

I am not sure it is the right solution yet, but it works for now. My first idea was to simply use a recurse function attached to hook contexts (the ctx argument), but I did not find a good way to guarantee an execution order (declaration order == execution order), and it was a bit unintuitive to integrate both hook decorator and the recurse together.

The reason why I tackle this now is that bento is at a stage where it need to be used on “real” builds to get a feeling of what works and what does not. The target is numpy and hopefully later scipy. Although I still hope to integrate waf or scons in bento as the canonical way of building numpy/scipy with bento, this also gives a good test for yaku (my simple build system).

It took me less than half a day to port the scons scripts to bento/yaku. A full build, unnoptimized build of numpy with clang is less than 10 seconds. A no-op build is ~ 150 ms, but as yaku does not have all the infrastructure for header dependency tracking yet, the number for no-op build is rather meaningless.


by cournape at July 31, 2010 06:33 PM

July 29, 2010

Fabian Pedregosa

July in Paris

One of the best things of spending summer in Paris: its parcs (here, with friends @ Parc Montsouris).

Con Julia en Montsouris

by fabian at July 29, 2010 10:11 PM

July 27, 2010

Matthieu Brucher

Genetic algorithms in Python

Although I’m fond of numerical optimization through gradients, … there are some times where a global optimization is much more powerfull. For instance, I have to generate two sequences/combs that are orthogonal and for which their autocorrelation is almost an impulse. The two combs have a fixed number of impulse, so it’s a perfect job for genetic algorithms.

Genetic algorithms are a global optimization technique. The parameters are encoded in a genome, and then different populations are grown. The fittest individuals survive and give new individuals. The usual implementation in Python is PyEvolve, a pure Python package that isn’t depended on anything except if you want to save and plot populations.

Creating a genome and a cost function

So my goal is to minimize crosscorrelation and autocorrelation, safe for the middle value of autocorrelation. I won’t use a fixed-size sequence, I will encode in the genome the distance between two impulses. This will allow for a really small genome with a lot of possibilities.

For this purpose, I create a class named OrthogonalSequences:

?Download GA.py
class OrthogonalSequences(object):
  def __init__(self, size, number):
    self.size = size
    self.number = number
    self.genome_length = size * number
 
  @staticmethod
  def convert_slice_to_time(chromosome):
    """"
    Converts a chromosome in a time sequence
    """"
    total = numpy.cumsum(chromosome)
    array = numpy.zeros(total[-1]+1)
    array[total] = 1
    array[0] = 1
    return array
 
  def evaluate(self, chromosomes):
    """"
    Evaluate the cost (square of error) of the genome by evaluating cross- and autocorrelation.
    """"
    value = 0
    for i in range(self.number):
      slice_i = self.convert_slice_to_time(chromosomes[i * self.size:(i+1) * self.size])
      value += .1 * len(slice_i)
      for j in range(i, self.number):
        slice_j = self.convert_slice_to_time(chromosomes[j * self.size:(j+1) * self.size])
        correlation = numpy.correlate(slice_i, slice_j, mode = "full")
        if (i == j):
          correlation[len(slice_i)-1] -= self.size
 
        value += numpy.sum(correlation ** 2)
 
    return value

In fact this class will only be used to evaluate a genome. As I have two combs (and more if you’d like), I need to pass this information to the evaluation function, and thus nothing is more easy than Python for this. I’ve also added a small factor that adds a cost for lengthy combs.

Setting up the optimizer

Now, the only thing remaining is to set all objects and launch the optimization:

?Download GA.py
if __name__ == "__main__":
  from pyevolve import G1DList, GSimpleGA, Consts
  import sys
 
  size = int(sys.argv[1]) if len(sys.argv) > 1 else 10
  size -= 1
  number = int(sys.argv[2]) if len(sys.argv) > 2 else 2
  rangemax = int(sys.argv[3]) if len(sys.argv) > 3 else 100
  generations = int(sys.argv[4]) if len(sys.argv) > 4 else 50
  population = int(sys.argv[5]) if len(sys.argv) > 5 else 100
 
  genome = G1DList.G1DList(number*size)
  sequences = OrthogonalSequences(size, number)
  ga = GSimpleGA.GSimpleGA(genome)
 
  genome.setParams(rangemin=1, rangemax=rangemax)
  genome.evaluator.set(sequences.evaluate)
 
  ga.setMinimax(Consts.minimaxType["minimize"])
 
  ga.setPopulationSize(population)
  ga.setElitism(True)
  ga.setGenerations(generations)
  ga.evolve(freq_stats = generations/10)
  print ga.bestIndividual()

Results

Now the result is the following for 2 sequences of 10 impulses:

Gen. 1 (2.00%): Max/Min/Avg Fitness(Raw) [436.73(387.20)/317.81(349.20)/363.94(363.94)]
Gen. 5 (10.00%): Max/Min/Avg Fitness(Raw) [417.89(375.20)/340.39(345.20)/348.24(348.24)]
Gen. 10 (20.00%): Max/Min/Avg Fitness(Raw) [419.21(379.20)/339.65(345.20)/349.34(349.34)]
Gen. 15 (30.00%): Max/Min/Avg Fitness(Raw) [417.72(377.20)/341.16(345.20)/348.10(348.10)]
Gen. 20 (40.00%): Max/Min/Avg Fitness(Raw) [419.40(375.20)/337.80(345.20)/349.50(349.50)]
Gen. 25 (50.00%): Max/Min/Avg Fitness(Raw) [419.86(389.20)/341.55(345.20)/349.88(349.88)]
Gen. 30 (60.00%): Max/Min/Avg Fitness(Raw) [418.25(381.20)/341.41(345.20)/348.54(348.54)]
Gen. 35 (70.00%): Max/Min/Avg Fitness(Raw) [418.08(375.20)/340.08(345.20)/348.40(348.40)]
Gen. 40 (80.00%): Max/Min/Avg Fitness(Raw) [418.42(375.20)/339.53(345.20)/348.68(348.68)]
Gen. 45 (90.00%): Max/Min/Avg Fitness(Raw) [418.80(381.20)/340.76(345.20)/349.00(349.00)]
Gen. 50 (100.00%): Max/Min/Avg Fitness(Raw) [417.94(371.20)/338.92(345.20)/348.28(348.28)]
Total time elapsed: 2.478 seconds.
- GenomeBase
        Score:                   345.200000
        Fitness:                 338.919595

        Slot [Evaluator] (Count: 1)
                Name: evaluate
        Slot [Initializator] (Count: 1)
                Name: G1DListInitializatorInteger
                Doc:  Integer initialization function of G1DList

   This initializator accepts the *rangemin* and *rangemax* genome parameters.

        Slot [Mutator] (Count: 1)
                Name: G1DListMutatorSwap
                Doc:  The mutator of G1DList, Swap Mutator
        Slot [Crossover] (Count: 1)
                Name: G1DListCrossoverSinglePoint
                Doc:  The crossover of G1DList, Single Point

   .. warning:: You can't use this crossover method for lists with just one element.

- G1DList
        List size:       18
        List:            [36, 48, 1, 11, 95, 37, 34, 38, 25, 10, 29, 64, 19, 11, 62, 68, 7, 15]

Here are the visual results for this optimization:

The first comb

The second comb

Correlations between the combs

They are not perfect, far from it, but remember that they are only combs with positive impulses. This means that you can’t get a perfect result. An almost-perfect result would be that the crosscorrelation is below 1, but with only one value of 2, it’s still pretty good.

Before the end

The process is very fast for 2 sequences of 10 impulses. If you add more sequences and more impulses, the computation time will rise fastly, but PyEvolve has now the ability to use the multiprocessing module, which means it can do parallel optimizations now.

PyEvolve can do much more, I only barely scratch the surface of its capablities, but I hope this example will make you try this package.

Buy Me a Coffee!



Other Amount:



Your Email Address :




by Matt at July 27, 2010 07:04 AM

July 24, 2010

EuroSciPy 2010

Euroscipy 2010 is over: please fill in the feedback form!

If you attended Euroscipy 2010 (tutorials and/or conference), we're very interested in getting your feedback on the event. We have set up an anonymous feedback survey on http://spreadsheets.google.com/viewform?formkey=dGpqVGRBN1VHdEMxZVJmTW9paG5ycWc6MQ. It will take you only a couple of minutes to answer a dozen of questions on the tutorials, the talks, the practical organization, your wishlist for Euroscipy 2011, etc.

Up to now, 35 people have filled in the form (many thanks to them!). If you have not yet filled in the feedback survey (there were 170+ participants to Euroscipy 2010!), make your voice heard now!

by Emmanuelle Gouillart at July 24, 2010 01:28 PM

July 23, 2010

Gaël Varoquaux

Sprint Scikit learn in Paris

We are organizing a coding sprint in Paris on the scikit learn, machine learning in Python. The goal of this sprint is to set the API and the general coding guidelines of the scikit to be able to tackle many different statistical learning problems in a consistent framework.

This is why we would like to have people with different problems, applications, and backgrounds to pitch in.

It will be a two-days sprint. Everyone is welcome, so just fill in the doodle, so that we can choose the date?

And do not hesitate to suggest some topics that you would like to be addressed during the sprint, and to discuss them on the mailing-list!

Vincent Michel is organizing the sprint. If you have questions about the sprint, you are welcomed to contact me, but please do put him in Cc to.

by gael at July 23, 2010 01:31 PM

July 20, 2010

Matthieu Brucher

Book review: Hacking Roomba: ExtremeTech

I’ve recently bought a fourth generation Roomba, which is a vacuum cleaning robot. I bought this brand because it is well-known and has a good history of hackable robots. So the next step was to figure out how to hack it, and hence this book.

Content and opinions

The book is split in three parts with progress hacking solutions. It is to be mentioned that it is dedicated to hacking the third generation, but the interface should be the same, although some instructions may have changed. Perhaps a new edition of the book may be needed.

iRobot had the good idea of opening their pets to hackers. It is done through a serial interface called ROI. The first encompasses the creation (or the buying) of a connector between a PC and the Roomba. There are several solutions, from a simple serial cable to a Bluetooth interface. Once you have a connector, you need to learn and use the interface to make the robot move and feel. Each captor and motor is explained as well as how it can be interacted with. The main examples are given through a Java library (I would have prefered a Python library, of course).

The second part is perhaps the least interesting. It starts with making an application to steer the robot. It is done with Processing, an obscur language that runs on the JVM. Why not one of the top five script language on it? You may then use the inner beeper to make some music. Indeed, it can be fun, but it’s also not that easy, even with the help of the book. More fun is making art drawings with addition of pens sticked on the Roomba. I was a big fan of these kind of drawngs when I was young. The last chapter exposes the Roomba as a big input device such as a mouse. Of course, it can help doing some sports…

The last part is about using additional platforms to make a more complicated robot. First, you may connect it to the Internet to steer it from the other end of the world, the next step being using WiFi. Then, you may upgrade its brain with a classic Linux board, adding a camera, … OK, this is fun, but it’s not a cleaner anymore. It begins to be even very ugly and not that autonomous unless you stick a big battery pack. But I have to be agree on some aspects of the mod: it’s a good basis for building one’s custom robot. But you may find a skeletton for less (or not).

Conclusion

I intend to keep my Roomba as a vacuum cleaner, not a complete robot with a wifi router on top of it. Still, it is interesting to know that you can do a lot with it, and first of all with a simple serial cable or a Bluetooth connection.

Hacking Roomba
Hacking Roomba
Price: $20.22
By Kurt – John Wiley & Sons, Inc. (00) – Paperback – ISBN 10 0470072717 – ISBN 13 9780470072714

Hacking Roomba: ExtremeTech (Paperback)
by Tod E. Kurt
ISBN:

Price:
22 used & new available from USD 30.00

| 4.5 | 7


by Matt at July 20, 2010 06:44 AM

July 17, 2010

Fwrap blog

freemalloc

Fwrap sucessfully wraps upwards of 97% of the more than 1500 routines in netlib’s LAPACK.  Oh yeah.  Granted this is pretty much all F77 code, but it works.

The remaining 3% are related to a trivial feature, namely dimension(0:N) style array declarations that will be supported very soon.

Much work has gone into the front end recently.  The commandline options are taking their final shape and I’ll be getting some improved documentation in the README tomorrow.  Feel free to take it out for a spin.

To try out fwrap from the mercurial repo, see http://fwrap.sourceforge.net/ for instructions and dependencies (numpy, cython & a “modern-enough” Fortran 90 compiler).  Let me know any problems you may have.

Enjoy!


by Kurt Smith at July 17, 2010 05:28 AM

July 16, 2010

Gaël Varoquaux

Simple object signatures

A signature pattern

There are many libraries around to specify what I call a ’signature’ for an object, in other words a list of attributes that define its parameter set. I have heavily used Enthought’s Traits library for this purpose, but the concept is fairly general and can be found eg in ORMs (Object Relational Mappers) or web frameworks.

Specification of this interface of parameters may be used to answer a variety of needs:

  • Typing: in the case of an ORM, to generate UIs, or for better error management, it may be desirable to have some control on the types of certain attributes of an object. In this case, specifying the signature corresponds to laying out a data model for the object.
  • Reactive programming: using properties to react to changes to attributes, one can fully specify the API of an object in terms of these attributes. This gives a message-passing like programming style that can be very well suited to parallel-computing in particular because it can easily be made thread-safe.

Signatures for statistical learning objects

Recently, I considered the signature pattern in a new context. In the scikit-learn, we are interested in statistical learning. This entails fitting models to data and often tuning parameters to select a model that fits best (a problem called model selection). Each of our models is an object that implements a couple of key methods to fit to the data and to apply to new data (fit and predict).

The approach that we are currently taking for model selection is (more or less) to generate a list of models with different parameters and fit and test them on the data.

A very nice feature would be to find out the parameters to vary simply by inspecting the objects, and such a desire recently got us discussing of defining signatures for our objects. I must confess that I am a bit weary as this means either depending on a signature library, or building one. We don’t want to grow our dependencies, and most signature-definition code that I know involve meta-programming tricks to avoid code duplication.

Solving the simple problem: avoiding type checking

Today, I had to bite the bullet, because we were in a situation in which we had to instantiate new models from the existing one during model selection. For technical reasons, using a copy.copy to create these new models was not a great idea, and it was better to have the minimal list of parameters required. Here come signatures again.

After a bit of messing around with the code, I realized that typing information was useless, and most probably harmful, to our immediate goals and that I just needed the names of the relevant attributes. I finally settled down to the following solution (which might still change):

  • All parameters need to be specified as keyword arguments of the __init__. The __init__ may not have positional arguments or ‘*’ arguments. Attributes on the objects have the same names as the __init__ parameters.
  • A simple base class, with couple of methods relying on a simple use of the inspect module to find the signature of the __init__.

class BaseEstimator(object):
    @classmethod
    def _get_param_names(cls):
        args, varargs, kw, default = inspect.getargspec(cls.__init__)
        assert varargs is None, (
            'scikit learn estimators should always specify their '
            'parameters in the signature of their init (no varargs).'
            )
        # Remove 'self'
        args.pop(0)
        return args

    def _get_params(self):
        out = dict()
        for key in self._get_param_names():
            out[key] = getattr(self, key)
        return out

    def _set_params(self, **params):
        valid_params = self._get_param_names()
        for key, value in params.iteritems():
            assert key in valid_params, (’Invalid parameter %s ‘
                ‘for estimator %s’ %
                (key, self.__class__.__name__))
            setattr(self, key, value)

The full code can be seen here and adds a bit more features, such as a clever __repr__.

What I like about this solution is that it (almost) does not use metaprograming, and avoids code duplication without forcing any specific pattern on the developer subclassing BaseEstimator.

The next step

This approach solves my immediate problem, but not the bigger one of finding what values can the different parameters take when varied for model selection. Of course this second problem is much more complicated, and maybe it is not worth solving it: the framework could very easily be bringing in more problems than it solves.

However, it seems that a fairly easy way of specifying possible values for parameters would be to decorate the __init__, giving the possible parameters to be tested during the model selection:

    @cv_params(l1=np.logspace(1e-4, 1, 10))
    def __init__(self, l1=.5, fit_intercept=True)
	# ...

All the decorator has to do is to store the information in an attribute attached to the __init__ (and probably to check that the parameters it was given are valid arguments, in order to raise errors early). Methods on the class can later inspect this information for model selection, or GUI building (data-model specification will probably require some typing language, rather than a simple list of possible parameters).

Once again, here we would be avoiding the difficulty of specifying type information in a non restrictive way, but avoiding a problem that we don’t have to solve is probably a good idea.

by gael at July 16, 2010 10:31 PM

July 13, 2010

EuroSciPy 2010

EuroSciPy 2010 is over

http://www.euroscipy.org/image/3737?vid=download

EuroSciPy 2010 was a fun conference to organize and we think we can call over 150 participants a success.

Thank you to every delegate and speaker that joined us to face the heat.

Thank you to Ecole Normale Supérieure for hosting us.

Read about the reports and send us yours if you want it added to the list.

EuroSciPy 2011 will be in Paris again. We hope to see you there!

by Admin Istrator at July 13, 2010 07:56 PM

Gaël Varoquaux

Euroscipy 2010: code, science, and a lot of fun

Euroscipy 2010, the third European conference for the use of Python in science, is just over, and I think it was a great success.

Euroscipy in numbers

The attendance this year was huge: there was a grand total of 160 who came to EuroScipy, with 140 that came only to the tutorials, and 130 only the conference. This up by almost a factor of 3 compared to last year’s
EuroScipy, more than last year’s SciPy conference in Passadena, and almost as much as this year’s SciPy conference in Austin that hosted 180 person. We had people coming from 16 country, and as far as New Zealand, the US, or Turkey. Research lab, education, and industry (small to large companies) were all well represented, with approximately a third of the delegates coming from the industry. Similarly, many different scientific field were discussed, ranging from landscape ecology to pure math.

There were 2 tutorial tracks with 10 tutorial slots in each track. We had 2 keynotes from Hans Petter Langtangen and Konrad Hinsen. With regards to the contributed talks, the conference this year was highly selective. We received 52 propositions. We unfortunately could accept only 30 of them, which corresponds to an acceptance rate of 58%. Finally, we had 18 lightning talks.

A warm and friendly atmosphere

As an organizer, I was really pleased to find out how much people were relaxed and friendly. This certainly facilitates discussions during the breaks. And the ambiance was undoubtedly warm: 140 people with laptops in a room without air conditioning in the Paris summer :).

Of course during the evenings, many people met to continue the passionate discussions in restaurants and bars.

Trends I noticed

What one remembers from a conference is obviously biased by personal interests. With that disclaimer, here are the recurrent and important topics that I noticed, both in the talks, but also in the coffee break discussions:

Finally installation problems of scientific tools were the subject of many discussions, as each year. One thing that I did notice, is that people stopped simply blaming each others and acknowledged that nobody knew how to fix the problem. Somebody even pointed out that installing any major scientific code was not a piece of cake. Hans Petter and others said that they had solved the problem by relying on a virtual machine and Ubuntu.

Konrad has also blogged, giving his own view of the conference.

Thanks

The conference could happen only because of the help of many people. First we need to thank our sponsors: Enthought, Python Academy, Pytables, and especially our host Ecole Normale Supérieure, which not only provided us with the rooms, but also made sure that everything was going well with the sound system, the projection, or the access to the building. With regards to organization and planing, Nicolas and I received a lot of help from Emmanuelle Gouillart.

by gael at July 13, 2010 04:31 PM

Matthieu Brucher

Dimensionality reduction: Refactoring the manifold module

It’s been a while since I last blogged about manifold learning. I don’t think I’ll add much in terms of algorithms to the scikit, but now that a clear API is being defined (http://sourceforge.net/apps/trac/scikit-learn/wiki/ApiDiscussion), it’s time for the manifold module to comply to it. Also, documentation will be enhanced and some dependencies will be removed.

I’ve started a branch available on github.com, and I will some examples in the scikit as well. I may explain them here, but I won’t rewrite what is already published. A future post will explain the changes, and I hope that interested people will understand the modifications and apply them to my former posts. It’s just that I don’t have much time to change everything…


by Matt at July 13, 2010 07:05 AM

July 12, 2010

Fwrap blog

freemalloc

It’s been a while.  You’re wondering what’s been happening with fwrap.

Thanks to a sprint at SciPy 2010, fwrap is quickly gaining parity with f2py.  Fwrap doesn’t yet have interface files working (and it won’t for the first release); neither does it support ‘common’ blocks.  I’ve been focusing on F77-style function & subroutine functionality.

The main Fortran features currently supported:

  • All intrinsic scalar datatypes, including with literal integer kind parameters
  • Arrays of intrinsic types & literal kind params.
  • All dimension declarations (assumed shape, assumed size, explicit shape)
  • Runtime array dimension checking
  • Automatic type discovery

Fwrap has a pretty good testsuite in place, both unittests & integration tests.  They’re run frequently to ensure no regressions are introduced.

Kyle Mandli helped improve fwrap’s commandline interface during the fwrap sprint at SciPy 2010.  Those changesets have yet to be merged with the main repo, but when they are it will be a *big* improvement, and move fwrap much closer to the 0.1 release.  He also added logging functionality to fwrap and fparser — another big improvement.

I’ve been throwing some fairly big F77 (clawpack, lapack) and F90 (a UW-Madison inertial confinement simulation; thanks, Matt!) codebases at fparser to thresh out glaring bugs, and thus far I’ve been pleased.  No major problems, although I’d very much like to figure out how to test the AST generated by fparser for correctness.  Not sure how to do that yet.

Anyway, just a quick update.  Stay tuned!


by Kurt Smith at July 12, 2010 05:22 PM

July 10, 2010

NeuralEnsemble

Sumatra 0.2 released

We would like to announce the release of version 0.2 of Sumatra, a tool for automated tracking of simulations and computational analyses so as to be able to easily replicate them at a later date.


The main changes are:
  • expanded the focus from just simulations to any command-line driven computation, e.g. analyses, graphing. This simply involved changes to the documentation and some renaming, e.g. SimProject is now just Project.
  • RecordStores can now contain records from multiple projects and multiple users. This makes it possible to keep all your records in a single database, and for different people to collaborate on the same project.
  • added support for the Git version control system. Sumatra requires your code to be stored in a version control system to ensure reproducibility, and now supports Git, Subversion and Mercurial.
  • removed the concept of record groups, since grouping can easily be achieved using tags.
  • Sumatra can now pass the record label to your main script, by appending it either to the command line or to the parameter file. This is very useful for separating the output files of different experiments into their own directories such that Sumatra can correctly link to them. 
  • you can now tag a simulation/analysis at the same time you run it, using smt run, rather than having to remember to do this afterwards with smt tag.
  • added a @capture decorator, to make it easier to use Sumatra in your own Python scripts.
  • the web interface will now display the contents of any CSV files generated during your experiment as an HTML table.
  • added ConfigParserParameterSet. If you pass parameters to your simulation/analysis in a separate file, then Sumatra can store these parameters for future searching, provided it understands  the parameter file format. The new class adds support for parameters stored in ConfigParser-style files (the existing supported formats are simple one-per-line key=value files and hierarchical, JSON-like NeuroTools parameter sets).
Sumatra 0.2 may be downloaded from the INCF Software Center or from PyPI.

by Andrew Davison (noreply@blogger.com) at July 10, 2010 09:52 AM

July 07, 2010

Enthought

Fast Numpy/Protobuf Deserialization Example

In my last post I wrote about how I was able to improve protobuf deserialization using a C++ extension. I’ve boiled down the code to its essence to show how I did it. Rather than zip everything up in a file, the code is short enough to show in its entirety.

Here’s the simplified protobuf message which is used to represent a time series as 2 arrays:

package timeseries;

message TimeSeries {
    repeated double times = 2;
    repeated double values = 3;
}

I then wrote a test app in Python and C++ to provide a benchmark. Here is the Python version:

import numpy
import time_series_pb2

def write_test():
    ts = time_series_pb2.TimeSeries()
    for i in range(10000000):
        ts.times.append(i)
        ts.values.append(i*10.0)

    import time
    start = time.time();

    f = open("ts.bin", "wb")
    f.write(ts.SerializeToString())
    f.close()

    print time.time() - start

def read_test():
    ts = time_series_pb2.TimeSeries()

    import time
    start = time.time();

    f = open("ts.bin", "rb")
    ts.ParseFromString(f.read())
    f.close()

    t = numpy.array(ts.times._values)
    v = numpy.array(ts.values._values)

    print 'Read time:', time.time() - start
    print "Read time series of length %d" % len(ts.times)

if __name__ == "__main__":
    import sys
    if len(sys.argv) < 2:
        print "usage:   %s <--read> | <--write>" % sys.argv[0]

    if sys.argv[1] == "--read":
        read_test()
    else:
        write_test()

I will spare you the C++ standalone code, since it was only a stepping stone. Instead here is the C++ extension, with 2 exposed methods, one which deserializes a string and the other which operates on a file.

#include <fcntl.h>

#include <Python.h>
#include <numpy/arrayobject.h>
#include <google/protobuf/io/coded_stream.h>
#include <google/protobuf/io/zero_copy_stream_impl_lite.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>

#include "time_series.pb.h"

static PyObject* construct_numpy_arrays(timeseries::TimeSeries* ts)
{
    // returns a tuple (t,v) where t and v are double arrays of the same length

    PyObject* data_tuple = PyTuple_New(2);

    long array_size = ts->times_size();
    double* times = new double[array_size];
    double* values = new double[array_size];

    // the data must be copied because the tsid will go away and its mutable data
    // will too
    memcpy(times, ts->times().data(), ts->times_size()*sizeof(double));
    memcpy(values, ts->values().data(), ts->values_size()*sizeof(double)); 

    // put the arrays into numpy array objects
    npy_intp dims[1] = {array_size};
    PyObject* time_array = PyArray_SimpleNewFromData(1, dims, PyArray_DOUBLE, times);
    PyObject* value_array = PyArray_SimpleNewFromData(1, dims, PyArray_DOUBLE, values);

    PyTuple_SetItem(data_tuple, 0, time_array);
    PyTuple_SetItem(data_tuple, 1, value_array);

    return data_tuple;
}

static PyObject* TimeSeries_load(PyObject* self, PyObject* args)
{
    char* filename = NULL;

    if (! PyArg_ParseTuple(args, "s", &filename))
    {
        return NULL;
    }

    timeseries::TimeSeries ts;

    int fd = open(filename, O_RDONLY);
    google::protobuf::io::FileInputStream fs(fd);
    google::protobuf::io::CodedInputStream coded_fs(&fs);
    coded_fs.SetTotalBytesLimit(500*1024*1024, -1);
    ts.ParseFromCodedStream(&coded_fs);
    fs.Close();
    close(fd);

    return construct_numpy_arrays(&ts);
}

static PyObject* TimeSeries_deserialize(PyObject* self, PyObject* args)
{
    int buffer_length;
    char* serialization = NULL;

    if (! PyArg_ParseTuple(args, "t#", &serialization, &buffer_length))
    {
        return NULL;
    }
    google::protobuf::io::ArrayInputStream input(serialization, buffer_length);

    google::protobuf::io::CodedInputStream coded_fs(&input);
    coded_fs.SetTotalBytesLimit(500*1024*1024, -1);

    timeseries::TimeSeries ts;
    ts.ParseFromCodedStream(&coded_fs);
    return construct_numpy_arrays(&ts);
}

static PyMethodDef TSMethods[] = {
    {"load", TimeSeries_load, METH_VARARGS, "loads a TimeSeries from a file"},
    {"deserialize", TimeSeries_deserialize, METH_VARARGS, "loads a TimeSeries from a string"}
};

#ifndef PyMODINIT_FUNC  /* declarations for DLL import/export */
#define PyMODINIT_FUNC void
#endif
PyMODINIT_FUNC inittimeseries(void)
{
    import_array();
    (void) Py_InitModule("timeseries", TSMethods);
}

Calling the exension from python is trivial:

import time
import timeseries
start = time.time()
t, v = timeseries.load('ts.bin')
print "read and converted to numpy array in %f" % (time.time()-start)
print "timeseries contained %d values" % len(v)

by Bryce Hendrix at July 07, 2010 07:17 PM

Titus Brown

A memory efficient way to remove low-abundance k-mers from large (metagenomic?) DNA data sets

I've spent the last few weeks working on a simple solution to a challenging problem in DNA sequence assembly, and I think we've got a nice simple theoretical solution with an actual implementation. I'd be interested in comments!

Introduction

Briefly, the algorithmic challenge is this:

We have a bunch of DNA sequences in (let's say) FASTA format,

>850:2:1:1145:4509/1
CCGAGTCGTTTCGGAGGGACCCCGCCATCATACTCGGGGAATTCATCTGAAAGCATGATCATAGAGTCACCGAGCA
>850:2:1:1145:4509/2
AGCCAAGAGCACCCCAGCTATTCCTCCCGGACTTCATAACGTAACGGCCTACCTCGCCATTAAGACTGCGGCGGAG
>850:2:1:1145:14575/1
GACGCAAAAGTAATCGTTTTTTGGGAACATGTTTTCATCCTGATCATGTTCCTGCCGATTCTGATCTCGCGACTGG
>850:2:1:1145:14575/2
TAACGGGCTGAATGTTCAGGACCACGTTTACTACCGTCGGGTTGCCATACTTCAACGCCAGCGTGAAAAAGAACGT
>850:2:1:1145:2219/1
GAAGACAGAGTGGTCGAAAGCTATCAGACGCGATGCCTAACGGCATTTTGTAGGGCGTCTGCGTCAGACGCCAACC
>850:2:1:1145:2219/2
GAAGGCGGGTAATGTCCGACAAACATTTGACGTCAAAGCCGGCTTTTTTGTAGTGGGTTTGACTCTTTCGCTTCCG
>850:2:1:1145:5660/1
GATGGCGTCGTCCGGGTGCCCTGGTGGAAGTTGCGGGGATGCGGATTCATCCGGGACGCGCAGACGCAGGCGGTGG

and we want to pre-filter these sequences so that only sequences containing high-abundance DNA words of length k ("k-mers"), remain. For example, given a set of DNA sequences, we might want to remove any sequence that does not contain a k-mer present at least 5 times in the entire data set.

This is very straightforward to do for small numbers of sequences, or for small k. Unfortunately, we are increasingly confronted by data sets containing 250 million sequences (or more), and we would like to be able to do this for large k (k > 20, at least).

You can break the problem down into two basic steps: first, counting k-mers; and second, filtering sequences based on those k-mer counts. It's not immediately obvious how you would parallelize this task: the counting should be very quick (basically, it's I/O bound) while the filtering is going to rely on wide-reaching lookups that aren't localized to any subset of k-mer space.

tl; dr? we've developed a way to do this for arbitrary k, in linear time and constant memory, efficiently utilizing as many computers as you have available. It's open source and works today, but, uhh, could use some documentation...

Digression: the bioinformatics motivation

(You can skip this if you're not interested in the biological motivation ;)

What we really want to do with this kind of data is assemble it, using a De Bruijn graph approach a la Velvet, ABySS, or SOAPdenovo. However, De Bruijn graphs all rely on building a graph of overlapping k-mers in memory, which means that their memory usage scales as a function of the number of unique k-mers. This is good in general, but Bad in certain circumstances -- in particular, whenever the data set you're trying to assemble has a lot of genomic novelty. (See this fantastic review and my assembly lecture for some background here.)

One particular circumstance in which De Bruijn graph-based assemblers flail is in metagenomics, the isolation and sequencing of genetic material from "the wild", e.g. soil or the human gut. This is largely because the diversity of bacteria present in soil is so huge (although the relatively high error rate of next-gen platforms doesn't help).

Prefiltering can help reduce this memory usage by removing erroneous sequences as well as not-so-useful sequences. First, any sequence arising as the result of a sequencing error is going to be extremely rare, and abundance filtering will remove those. Second, genuinely "rare" (shallowly-sequenced) sequences will generally not contribute much to the assembly, and so removing them is a good heuristic for reducing memory usage.

We are currently playing with dozens (probably hundreds, soon) of gigabytes of metagenomic data, and we really need to do this prefiltering in order to have a chance at getting a useful assembly out of it.

It's worth noting that this is in no way an original thought: in particular, the Beijing Genome Institute (BGI) did this kind of prefiltering in their landmark Human Microbiome paper (ref). We want to do it, too, and the BGI wasn't obliging enough to give us source code (booooooo hisssssssssssssss).

Existing approaches

Existing approaches are inadequate to our needs, as far as we can tell from a literature survey and private conversations. Everyone seems to rely on big-RAM machines, which are nice if you have them, but shouldn't be necessary.

There are two basic approaches.

First, you can build a large table in memory, and then map k-mers into it. This involves writing a simple hash function that converts DNA words into numbers. However, this approach scales poorly with k: for example, there are 4**k unique DNA sequences of length k (or roughly (4**k) / 2 + (4**(k/2))/2, considering reverse complements). So the table for k=17 needs 4**17 entries -- that's 17 gb at 1 byte per counter, which stretches the memory of cheaply available commodity hardware. And we want to use bigger k than 17 -- most assemblers will be more effective for longer k, because it's more specific. (We've been using k values between 30 and 70 for our assemblies, and we'd like to filter on the same k.)

Second, you can observe that k-mer space (for sufficiently large k) is likely to be quite sparsely occupied -- after all, there's only so many actual 30-mers present in a 100gb data set! So, you can do something clever like use a tree representation of k-mers and then attach counters to the end nodes of the tree (see, for example, tallymer. The problem here that you need to use pointers to connect nodes in the tree, which means that while the tree size is going to scale with the amount of novel k-mers -- ok! -- it's going to have a big constant in front of it -- bad!. In our experience this has been prohibitive for real metagenomic data filtering.

These seem to be the two dominant approaches, although if you don't need to actually count the k-mers but only assess presence or absence, you can use something like a Bloom filter -- for example, see Classification of DNA sequences using a Bloom filter, which uses Bloom filters to look for novel sequences (the exact opposite of what we want to do here!). References to other approaches welcome...

Note that you really, really, really want to avoid disk access by keeping everything in memory. These are ginormous data sets and we would like to be able to quickly explore different parameters and methods of sequence selection. So we want to come up with a good counting scheme for k-mers that involves relatively small amounts of memory and as little disk access as possible.

I think this is a really fun kind of problem, actually. The more clever you are, the more likely you are to come up with a completely inappropriate data structure, given the amount of data and the basic algorithmic requirements. It demands KISS! Read on...

An approximate approach to counting

So, we've come up with a solution that scales with the amount of genomic novelty, and efficiently uses your available memory. It can also make effective use of multiple computers (although not multiple processors). What is this magic approach?

Hash tables. Yep. Map k-mers into a fixed-size table (presumably one about as big as your available memory), and have the table entries be counters for k-mer abundance.

This is an obvious solution, except for one problem: collisions. The big problem with hash tables is that you're going to have collisions, wherein multiple k-mers are mapped into a single counting bin. Oh noes! The traditional way to deal with this is to keep track of each k-mer individually -- e.g. when there's a collision, use some sort of data structure (like a linked list) to track the actual k-mers that mapped to a particular bin. But now you're stuck with using gobs of memory to keep these structures around, because each collision requires a new pointer of some sort. It may be possible to get around this efficiently, but I'm not smart enough to figure out how.

Instead of becoming smarter, I reconfigured my brain to look at the problem differently. (Think Different?)

The big realization here is that collisions may not matter all that much. Consider the situation where you're filtering on a maximum abundance of 5 -- that is, you want at least one k-mer in each sequence to be present five or more times across the data set. Well, if you look at the hash bin for a specific k-mer and see the number 4, you immediately know that whether or not there are any collisions, that particular k-mer isn't present five or more times, and can be discarded! That is, the count for a particular hash bin is the sum of the (non-negative) individual counts for k-mers mapping to that bin, and hence that sum is an upper bound on any individual k-mer's abundance.

http://ivory.idyll.org/permanent/kmer-hashtable.png

The tradeoff is false positives: as k-mer space fills up, the hash table is going to be hit by more and more collisions. In turn, more of the k-mers are going to be falsely reported as being over the minimum abundance, and more of the sequences will be kept. You can deal with this partly by doing iterative filtering with different prime hash table sizes, but that will be of limited utility if you have a really saturated hash table.

Note that the counting and filtering is still O(N), while the false positives grow as a function of k-mer space occupancy -- which is to say that the more diversity you have in your data, the more trouble you're in. That's going to be a problem no matter the approach, however.

You can see a simple example of approximate and iterative filtering here, for k=15 (a k-mer space of approximately 1 billion) and hash table sizes ranging from 50m to 100m. The curves all approach the correct final number of reads (which we can calculate exactly, for this data set) but some take longer than others. (The code for this is here.)

http://ivory.idyll.org/permanent/kmer-filtering-iterative.png

Making use of multiple computers

While I was working out the details of the (really simple) approximate counting approach, I was bugged by my inability to make effective use of all the juicy computers to which I have access. I don't have many big computers, but I do have lots of medium sized computers with memory in the ~10-20 gb range. How can I use them?

This is not a pleasantly parallel problem, for at least two reasons. First, it's I/O bound -- reading the DNA sequences in takes more time than breaking them down into k-mers and counting them. And since it's really memory bound -- you want to use the biggest hash table possible to minimize collisions -- it doesn't seem like using multiple processors on a single machine will help. Second, the hash table is going to be too big to effectively share between computers: 10-20 gb of data per computer is too much to send over the network. So what do we do?

I was busy explaining to a colleague that this was impossible -- always a useful way for me to solve problems ;) -- when it hit me that you could use multiple chassis (RAM + CPU + disk) to decrease the false positive rate with only a small amount of communication overhead. Basically, my approach is to partition k-mer space into Z subsets (one for each chassis) and have each computer count only the k-mers that fall into their partition. Then, after the counting stage, each chassis records a max k-mer abundance per partition per sequence, and communicates that to a central node. This central node in turn calculates the max k-mer abundance across all partitions.

The partitioning trick is a more general form of the specific 'prefix' approach -- that is, separately count and get max abundances/sequence for all k-mers starting with 'A', then 'C', then 'G', and then 'T'. For each sequence you will then have four values (the max abundance/sequence for k-mers start with 'A', 'C', 'G', and 'T'), which can be cheaply stored on disk or in memory. Now you can do a single-pass integration and figure out what sequences to keep.

This approach effectively multiplies your working memory by a factor of Z, decreasing your false positive rate equivalently - no mean feat!

http://ivory.idyll.org/permanent/kmer-hashtable-par.png http://ivory.idyll.org/permanent/kmer-filter-process-2.png

The communication load is significant but not prohibitive: replicate a read-only sequence data set across Z computers, and then communicate max values (1 byte for each sequence) back -- 50-500 mb of data per filtering round. The result of each filtering round can be returned to each node as a readmask against the already-replicated initial sequence set, with one bit per sequence for ignore or process. You can even do it on a single computer, with a local hard drive, in multiple iterations.

You can see a simple in-memory implementation of this approach here, and some tests here. I've implemented this using readmask tables and min/max tables (serializable data structures) more generally, too; see "the actual code", below.

Similar approaches

By allowing for false positives, I've effectively turned the hash table into a probabilistic data structure. The closest analog I've seen is the Bloom filter which records presence/absence information using multiple hash functions for arbitrary k. The hash approach outlined above devolves into a maximally efficient Bloom filter for fixed k when only presence/absence information is recorded.

The actual code

Theory and practice are the same, in theory. In practice, of course, they differ. A whole host of minor interface and implementation design decisions needed to be made. The result can be seen at the 'khmer' project, here: http://github.com/ctb/khmer. It's slim on documentation, but there are some example scripts and a reasonable amount of tests. It requires nothing but Python 2.6 and a compiler; nose if you want to run the tests.

The implementation is in C++ with a Python wrapper, much like the paircomp and motility software packages.

It will filter 1m 70-mer sequences in about 45 seconds, and 80 million sequences in less than an hour, on a 3 GHz Xeon with 16 gbs of RAM.

Right now it's limited to k <= 32, because we encode each DNA k-mer in a 64-bit 'long long'.

You can see an exact filtering script here: http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py . By varying the hash table size (second arg to new_hashtable) you can turn it into an inexact filtering script quite easily.

Drop me a note if you want help using the code, or a specific example. We're planning to write documentation, doctests, examples, robust command line scripts, etc. prior to publication, but for now we're primarily trying to use it.

Other uses

It has not escaped our notice that you can use this approach for a bunch of other k-mer based problems, such as repeat discovery and calculating abundance distributions... but right now we're focused on actually using it for filtering metagenomic data sets prior to assembly.

Acks

I talked a fair bit with Prof. Rich Enbody about this approach, and he did a wonderful job of double-checking my intuition. Jason Pell and Qingpeng Zhang are co-authors on this project; in particular, Jason helped write the C++ code, and Qingpeng has been working with k-mers in general and tallymer in specific on some other projects, and provided a lot of background help.

Conclusions

We've taken a previously hard problem and made it practically solvable: we can filter ~88m sequences in a few hours on a single-processor computer with only 16gb of RAM. This seems useful.

Our main challenge now is to come up with a better hashing function. Our current hash function is not uniform, even for a uniform distribution of k-mers, because of the way it handles reverse complements.

The approach scales reasonably nicely. Doubling the amount of data doubles the compute time. However, if you have double the novelty, you'll need to do double the partitions to keep the same false positive rate, in which case you quadruple the compute time. So it's O(N^2) for the worst case (unending novelty) but should be something better for real-world cases. That's what we'll be looking at over the next few months.

I haven't done enough background reading to figure out if our approach is particularly novel, although in the space of bioinformatics it seems to be reasonably so. That's less important than actually solving our problem, but it would be nice to punch the "publication" ticket if possible. We're thinking of writing it up and sending it to BMC Bioinformatics, although suggestions are welcome.

It would be particularly ironic if the first publication from my lab was this computer science-y, given that I have no degrees in CS and am in the CS department by kind of a fluke of the hiring process ;).

July 07, 2010 03:09 PM

July 06, 2010

Matthieu Brucher

QtVST: a Chamberlin Variable Filter

After my last post on QtAgain, I’ve decided to test a few simple digital filters. I’ve tried to make them as generic as possible, and with a VST interface.

A templated Chamberlin filter

A Chamberlin filter is a IIR (Infinite Impulse Response) filter, and it is possible to make it output high-, band- or low-pass signals. It has only two parameters that are independent.

The equations are very simple:

Each of the series is either the low-pass, the band-pass or the high-pass output.

There are some factors that are pre-computed:

The first is a function of the Cut frequency and the Sampling frequency, a “numerical frequency”. The second one is the numerical attenuation, between 0 and 2.

This is an excerpt of the actual code, I’ve only displayed the relevant code.

template<class Data_Type>
class VariableFilter
{
public:
  typedef Data_Type DataType;
 
  void process(const DataType* in, DataType* out, unsigned long nb_samples)
  {
    for(unsigned long i = 0; i < nb_samples; ++i)
    {
      yh = in[i] - yl - numerical_attenuation * yb;
      yb = numerical_frequency * yh + yb;
      yl = numerical_frequency * yb + yl;
      if(selected == 0)
      {
        out[i] = yl;
      }
      else if(selected == 1)
      {
        out[i] = yb;
      }
      else
      {
        out[i] = yh;
      }
    }
  } 
 
protected:
  void compute_factors()
  {
    numerical_frequency = 2 * std::sin(M_PI * cutoff_frequency / sampling_frequency);
    numerical_attenuation = 2 * attenuation;
  }
};

Now, I’ve written a simple GUI for this plugin.

GUI with Qt

I’ve used the QtAgain skeletton to wrap the filter and to make it available to PyVST. It’s really easy to add other parameters to the skeletton, so I will not write everything here again. There are three parameters, 2 numerical ones, and one three-state variable.

Results

I’ve recorded the filter’s outputs for 6kHz as central frequency, and an attenuation factor of .3. The output signal was generated from a random signal, and then an FFT was performed on the inputs and the outputs and displayed.

Note: I didn’t convert the scale to dB.



As you may have noted, the filter amplifies the signal near the central frequency for each output for the attenuation I’ve selected. The issue is that selecting a higher one has also an impact on the stability of the filter at higher frequencies. The sum of the numerical attenuation and the numerical frequency should always be inferior to 2.

Conclusion

This numerical is really simple to implement, and it has few parameters. The dark side of this filter is that is not always stable and the central frequency seems to be always amplified. Other filters can be better balanced.

The code is available on Launchpad.


by Matt at July 06, 2010 07:17 AM

Titus Brown

Teaching scientists how to use computers - hub & spokes

After my recent next-gen sequencing course, which was supposed to tie into the whole software carpentry (SWC) effort but didn't really succeed in doing so the first time through, I started thinking about the Right Way to tie in the SWC material. In particular, how do you both motivate scientists to look at the SWC material, and (re)direct people to the appropriate places?

It's not clear that a Plan is in place. Greg Wilson seems to assume that scientists will find at least some of the material immediately obviously usable, but I think he's targetted at a more sophisticated population of users -- physicists and the like. My experience with bioinformaticians, however, is that they either come from straight biology backgrounds (with little or no computational background and rather limited on-the-job training), straight computation backgrounds (with very little biology), or physics (gonzo programming skills, but no biology). The latter fit neatly into the SWC fold, but they (we ;) are rare in biology. I think computer scientists and biologists are going to need guidance to dive into SWC at an early enough time for it to be the most rewarding.

So, what's a good model for SWC to guide scientists from multiple disciplines into the appropriate material? It's obviously not going to be possible to have Greg et al. tailor the SWC material to individual subgroups -- he doesn't know much (any ;) biology, for example. I don't have the time, patience, or skillset to integrate my next-gen notes into his SWC material, either. So, instead, I propose the hub & spokes model!

http://ivory.idyll.org/permanent/hub-spokes.png

Here, the "hub" is the SWC material, and the spokes are all of the individual disciplines.

Basically, the idea is that individual sites (like my own ANGUS site on next-gen sequencing, http://ged.msu.edu/angus/) will develop their own field-specific content, and then link from that content into the SWC notes. This way the experts with feet in both fields can link appropriately, and Greg only has to worry about making the central content general -- which he's already doing quite well, I think. Yes, It's more work than asking Greg to do it, but frankly I'm going to be happy with a kick-ass central SWC site to which I can link -- right now it's dismayingly challenging to teach students why this stuff matters and how to learn it.

From the psychosocial perspective, it's a great fit. Students can get hands on tutorials on how to do X, Y, and Z in their own field -- and then connect into the SWC material to learn the background, or additional computational techniques in support of it. Motivation first!

What do we need SWC to do to support this? Not much -- basically, the central SWC notes need to be stable enough (with permalinks) that I can link into them from my own site(s) and not have to worry about the links becoming broken or (worse) silently migrating in topic. There are other solutions (wholesale incorporation of SWC into my own notes, for example) but I think the permalink idea is the most straightforward. Oh, and we should have a Greg-gets-hit-by-a-bus plan, too; at some point he's going to move on from SWC (perhaps when his lovely wife decide she's had enough and he needs to stop obsessing over it, or perhaps under more dire circumstances ;( and it would be good to know who holds the domain and site keys.

Thoughts? Comments?

--titus

July 06, 2010 03:12 AM

July 04, 2010

EuroSciPy 2010

Euroscipy tutorials: practical information

http://farm3.static.flickr.com/2726/4351345406_aa4e0c0f9a_m_d.jpg

The tutorials are closing in, here are a few practical details!

Venue and schedule

The conference will take place at the Ecole Normale Superieure (ENS), 45 rue d'Ulm, 75005 Paris. The registration desk will be located in front of the Dussane room where the Advanced tutorials will also be given. All participants should first check in at the registration desk, which will be opened from 8.15 on Thursday 8th. We expect 140 participants to show up, so please arrive well before the the first tutorial at 9.00! The introduction tutorials will be given in a different building (Jules Ferry room, 29 rue d'Ulm, 75005 Paris) but all participants should go first to the registration desk in front of the Dussane room.

To reach the Dussane room, enter the main building of the Ecole Normale Supérieure, and take the corridor on your left.

There will be some wifi access in the Dussane room (although we don't know how it will stand up to dozens of computers connecting at the same time), but not in the Jules Ferry room.

All coffee breaks will take place in front of the Dussane room.

Pre-requisites

As most tutorials will be hands-on, you need to install all the required software on your computer before next Thursday. For the introduction tutorials, a list of required softwares and Python modules (Python, Ipython, NumPy, SciPy, Matplotlib) is given on http://www.euroscipy.org/conference/euroscipy2010, together with links to distributions (Python(x,y) and EPD) that install all these packages at once (and many others). For Windows users, we recommend installing one of these distributions. Would you install Python(x,y), make sure to install the full version. If you are using Linux, all these modules are packaged for most Linux distributions and can be installed in a few clicks with the package manager.

If you have problems installing the software, you may contact us for help, but please try first to get help around you or browsing the Internet, as we expect to be quite overwhelmed as the conference is closing in. Nevertheless, it will a save a lot of time if everybody arrives with a laptop well setup, so don't hesitate to ask questions.

For the advanced track, please check on the tutorials description if other packages or softwares are mentioned (Mayavi, SymPy, PyTables, Cython, etc.), and if so, install them in addition to the base packages listed above (or check if they come with your scientific Python distribution).

Also try to take a computer with a long battery lifetime, as we won't be able to plug more than a fraction of the computers at the same time. (Yes, we do have some extension cables, but the electrical circuits of the rooms are not sized to supply power to 80 laptops at the same time).

For any questions, please write to info@euroscipy.org.

We are looking forward to seeing you next Thursday!

photography under CC by Valentin Ottone

by Emmanuelle Gouillart at July 04, 2010 11:17 PM

July 02, 2010

David Cournapeau

cournape

I have just released bento 0.0.3 (see here for details).

The release is available on github


by cournape at July 02, 2010 02:56 AM

SciPy 2010

Travis Oliphant announces...

(reposted from the Enthought blog)
Travis announces project to extend NumPy/SciPy to .Net(photo: Paul Ivanov)
Travis Oliphant, President of Enthought, Inc., kicked off today's SciPy 2010 Day 2 with a great keynote talk. He told the story of his own path to Python, filling his slides with the faces and work of other developers, scientists, and mathematicians — inspiration, teachers, and collaborators. He explained how his academic trajectory, from electrical engineering, through a brief affair with neuroscience, to a biomedical engineering PhD, both drove and competed with his work creating NumPy.
Last, but not least, Travis closed his talk with rather large announcement: Enthought has undertaken the extension of NumPy and SciPy to the .NET framework. For details on the project refer to the official release.

by Amenity Applewhite (noreply@blogger.com) at July 02, 2010 02:27 AM

July 01, 2010

Enthought

Fast Protocol Buffers in Python

A couple of years ago I worked on a project which needed to transport a large dataset over the wire. I looked at a number of technologies, and Google Protocol Buffers looked very interesting. Over the past week, I’ve been asked about my experience a couple of times, so I hope this provides a little bit of insight into how to use Protocol Buffers in Python when performance matters.

I wrote a little test case to model the serialization of the data I wanted to send, a list of 100 pairs of arrays, where each array contained 250,000 elements. The raw data size was 381 MB.

First, I ran the pure python test: the write took 83 seconds, the read took 202 seconds. Not good.

Next I tested the same data in C++: the write took 4.4 seconds and the read took 2.8 seconds. Impressive.

The obvious path then was to write the serialization code in C++ and expose it through an extension point. The read function, including putting all of the data into numpy arrays now takes 7.5 seconds. I only needed the read function from Python, but the write function should take about the same time.

by Bryce Hendrix at July 01, 2010 10:02 PM

Travis Oliphant announces…

Travis announces project to extend NumPy/SciPy to .Net

Travis announces project to extend NumPy/SciPy to .Net

Travis Oliphant kicked off today’s SciPy 2010 Day 2 with a great keynote talk. He told the story of his own path to Python, filling his slides with the faces and work of other developers, scientists, and mathematicians — inspiration, teachers, and collaborators. He explained how his academic trajectory, from electrical engineering, through a brief affair with neuroscience, to a biomedical engineering PhD, both drove and competed with his work creating NumPy.
Last, but not least, Travis closed his talk with rather large announcement: Enthought has undertaken the extension of NumPy and SciPy to the .NET framework. For details on the project refer to the official release.

by amenity at July 01, 2010 08:58 PM

SciPy 2010 underway!

Everyone minus Ian, the most valiant photographer!

Everyone minus Ian, the most valiant photographer!

We were thrilled to host SciPy 2010 in Austin this year. Everyone seems to be enjoying the cool weather (so what if it’s borne of thunderstorms?) and the plush conference center/hotel (even if we had to retrain their A/V team).
After two days of immensely informative Tutorials, the General Session began yesterday with speaker Dave Beazley’s awesome keynote on Python concurrency. In addition to the solid line-up of talks at the main conference, we had two very well-attended specialized tracks: Glen Otero, chaired the Bioinformatics track, while Brian Granger and Ken Elkabany coordinated the Parallel Processing & Cloud Computing talks. The day then closed with a conference reception and guacamole-fueled Birds of a Feather sessions.
SciPy 2010

by amenity at July 01, 2010 08:28 PM

SciPy 2010

SciPy 2010 underway!

(reposted from the Enthought blog)
We were thrilled to host SciPy 2010 in Austin this year. Everyone seems to be enjoying the cool weather (so what if it’s borne of thunderstorms?) and the plush conference center/hotel (even if we had to retrain their A/V team).
After two days of immensely informative Tutorials, the General Session began yesterday with speaker Dave Beazley's awesome keynote on Python concurrency. In addition to the solid line-up of talks at the main conference, we had two very well-attended specialized tracks: Glen Otero, chaired the Bioinformatics track, while Brian Granger and Ken Elkabany coordinated the Parallel Processing & Cloud Computing talks. The day then closed with a conference reception and guacamole-fueled Birds of a Feather sessions.
SciPy 2010

by Amenity Applewhite (noreply@blogger.com) at July 01, 2010 03:23 PM

SciPy 2010 Group Photo

DSC_9860

It took a couple attempts to find a good location for such a large group photo -- but fortunately the conference is only a block or two from the entrance to the beautiful University of Texas Austin campus! Thankyou to everyone for being patient while we moved from spot to spot. You can download large (1024 x 680) and original (4256 x 2828) files on the photo page.

by Ian Rees (noreply@blogger.com) at July 01, 2010 02:46 PM

June 29, 2010

SciPy 2010

SciPy 2010: Basic Tutorials -- IPython and virtualenv

The basic tutorials at SciPy were well-taught and allowed someone unfamiliar with numpy, scipy, and ipython (like me!) to get up to speed and productive very quickly. I now feel confident with the new tools in my python toolbox. Thanks!

I use virtualenv extensively to control my python site-packages environment. This allows me to experiment with different versions of the python packages I use, including repository sources, without polluting or corrupting my system's overall python environment. It's a great tool and easy to use, at least in the Red Hat and Debian-based Linux systems I use daily.

However, one thing I discovered was that ipython unfortunately did not work properly in my virtualenv'ed python environments. Regardless if I had activated a virtualenv environment, ipython always used the system site packages only.

In order for virtualenv to do it's magic in creating a virtual python environment, it relocates python (so that the python path, etc. are set correctly). For example, if there is a virtualenv in the directory "~/dev/pyenv", "which python" will return "~/dev/pyenv/bin/python" if that virtualenv is active.

The problem is that the shebang in the ipython script (/usr/bin/ipython on my system) points to the system python, not the virtualenv one. So whenever ipython is executed, it will use the system python, not the virtualenv one.

There are a number of ways to address this. One could "easy_install ipython" in every virtualenv where ipython is desired. This seems excessive, and redundant in virtualenv's where the system site packages are included. Note, you'll still need to "pip install ipython" in virtualenv's without the system site packages included (--no-site-packages option) as that creates a clean and empty python environment with nothing previously installed.

One could instead edit the ipython script to shebang "python", removing the absolute path. Then, when ipython is executed, it will use whatever python command is in the path. But if ipython is upgraded, that edit could be lost. It also directly changes a system package, which is generally not preferred.

Therefore, the best solution is to is just bypass the shebang by running "python /usr/bin/ipython" rather than just "ipython". This way, the shebang will be ignored, and the path's python will get run. This will be the system python if not in a virtualenv environment, or the virtualenv's python if active. However, remembering to type "python /usr/bin/ipython" instead of "ipython" is a pain. An alias in .bash_aliases like "alias ipython='python /usr/bin/ipython'" is the answer here.

Hopefully this helps others who want to use ipython in their virtualenv environments. I am looking forward to using these new tools and hear all the cool things people are using python for at the conference proper tomorrow and Thursday.

With this, ipython will work perfectly in both system and virtualenv environments.

by Eric Floehr (noreply@blogger.com) at June 29, 2010 09:06 PM

SciPy 2010: Tutorials Day 2 - Mayavi

Prabhu Ramachandran is delivering an overview of Mayavi 2 and some of the finer points of mlab in particular. The first exercise was a quick Lorenz attractor visualization (which I got mostly right on the first try). mlab.quiver3d provides a nice one-liner to pop up a view of the data. Here's a video of the result (with a little tweaking of parameters through the nice UI that comes up).

by Travis (noreply@blogger.com) at June 29, 2010 04:06 PM

SciPy 2010: Tutorials Day 2 - Traits/TraitsUI

Corran Webster of Enthought presented a session on Traits in the afternoon. In particular, he covered TraitsUI with some detail. Good hands-on exercises resulted in a nice GUI for manipulating data in just a few minutes. It's good to see how this library is incrementally improving.

by Travis (noreply@blogger.com) at June 29, 2010 02:20 PM

Tutorials Day 1 Cont'd #2: HPC with Python


Brian Granger answers questions at the end of an excellent 4 hour tutorial yesterday on doing parallel computing in Python. Brian started by covering the painful but often overlooked realities of Amdahl’s Law - namely the fact that the maximum speedup is bounded by the fraction of parallel code (see the figure, taken from the tutorial with permission).

After that, attendees were taken on a whirlwind tour through the lands of Cython, Threading and Multiprocessing modules of the python standard library (with the requisite discussion of the GIL), interactive parallel usage of IPython using ipcluster, PiCloud's cloud package, with brief overviews of Mpi4Py, and PyZMQ, the python wrappers for ØMQ (Brian pointed to this blog post by Nicholas Peil for more on ØMQ)

by Paul Ivanov (noreply@blogger.com) at June 29, 2010 01:50 PM

Matthieu Brucher

Optimization scikit: a conjugate-gradient optimization

In my last post about optimization, I’ve derived my function analytically. Sometimes, it’s not as easy. Sometimes also, a simple gradient optimization is not enough.

scikits.optimization has a special class for handling numerical differentiation, and several tools for conjugate gradients.

Numerical Differentiation

Differentiation is accomplished through helper class. The new function has to be derived from of two classes,
ForwardFiniteDifferences or CenteredFiniteDifferences. In these classes, gradient and Hessian are computed based on the actual cost function. You may also override the gradient so that only the Hessian is numerically computed.

A single parameter, difference, is the delta that will be applied on each parameter to compute the derivatives. This number is arbitrally fixed, and may be changed in numerical differentiation fails.

Here is what a weighted sinc function looks like (the weights are given by self.a):

?View Code PYTHON
from scikits.optimization.helpers import ForwardFiniteDifferences
 
class Function(ForwardFiniteDifferences):
  def __init__(self):
    ForwardFiniteDifferences.__init__(self)
    self.a = numpy.array((1, 2))
  def __call__(self, x):
    t = numpy.sqrt(numpy.sum((self.a * x)**2))
    return -numpy.sinc(numpy.pi * t)

Conjugate Gradient

Conjugate gradient is an optimization technique that uses the new gradient as well as the last descent direction or gradient to create a new descent direction. Besides this, the line search must have some properties. You don’t need an exact one, but it must obey some rules, known as the Wolfe-Powell rules. There are two formulations, both available in the scikit. I suggest you read some of the reference book I give at the end of the topic to choose the right line search, as well as the appropriate conjugate gradient.

Besides the well-known Fletcher Reeves conjugate gradient, there are several other factors that can be used, as the Polak-Ribière-Polyak that can also be combined with Fletcher-Reeves one to form what is perhaps the best formulation. The different available conjugate gradients are available in the documentation.

Application

OK, let’s see what it can do. I’ve assembled a Fletcher-Reeves conjugate gradient with Wolfe-Powell rules line search.

?View Code PYTHON
mystep = step.FRConjugateGradientStep()
mylinesearch = line_search.WolfePowellRule()
mycriterion = criterion.criterion(ftol = 0.0001, iterations_max = 100)
myoptimizer = optimizer.StandardOptimizer(function = fun,
                                          step = mystep,
                                          line_search = mylinesearch,
                                          criterion = mycriterion,
                                          x0 = numpy.array((.35, .45)))

I didn’t display the result, because in fact it didn’t finished at 100 iterations.

Here are animations on the evolution of the value of the cost function and the position of the parameters.

Knowing that Fletcher-Reeves adds to the new gradient a fraction of the last direction, this circular evolution can be understood. In fact, with a simple gradient, it works better, as well as if a Polak-Ribière-Polyak is used. It all depends on how the different ingredients are mixed, as well as if the Wolfe-Powell rules are close to an exact line search.

Conclusion

I think that trying different solutions is what is fun about optimizations. With conjugate gradient, a lot of different optimizers can be built. With numerical differentitation, you don’t have to rely on analytical gradient. Know though that the more the parameters, the longest it takes to compute the gradient (O(n)) and I don’t even speak about the Hessian (O(n²)).

Here are some useful books that explain the theory and the ideas behind these topics:

Optimization Theory and Methods: Nonlinear Programming

Numerical Optimization

Engineering Optimization: Theory and Practice

Link to the tutorial code: Launchpad.

Buy Me a Coffee!



Other Amount:



Your Email Address :




by Matt at June 29, 2010 07:36 AM

June 28, 2010

SciPy 2010

Tutorials Day 1 Cont'd: Signals and Systems

Gunnar Ristroph gave a tutorial in the advanced track on the basics of doing signal processing in Python using scipy. Here's a link to the materials for his talk: slides and code with sample data. He also plugged the work-in-progress python-control module which goes beyond the functionality provided by the scipy.signal module.

by Paul Ivanov (noreply@blogger.com) at June 28, 2010 02:08 PM

Tutorials Day 1: Advanced NumPy

The advanced track tutorials kicked off today with Stéfan van der Walt's advanced NumPy tutorial, "Kittens and Dragons".

Among other things, he gave an "under the hood" introduction to the numpy.ndarray object, broadcasting and indexing tricks, an introduction to profiling Python code and visualizing your profiling results, a crash course in speeding up your code with Cython, and an example of using the array interface to expose foreign memory to NumPy.

His slides and code examples are now online.


by David Warde-Farley (noreply@blogger.com) at June 28, 2010 10:00 AM

June 27, 2010

SciPy 2010

SciPy Salsa!

A truly Texan treat awaits this year's participants at the registration desk:


by Paul Ivanov (noreply@blogger.com) at June 27, 2010 05:06 PM

June 25, 2010

EuroSciPy 2010

EuroSciPy 2010 Tutorials in the works

http://www.europython.eu/images/euro_compo.jpg

The tutorials are being worked on at github and discussed on a mailing list. If you want to join us and contribute, subscribe to euroscipy-tutorials and fetch the source from the repositories: first and second.

photo borrowed from the EuroPython website

by Nicolas Chauvat at June 25, 2010 06:20 PM

June 24, 2010

Titus Brown

Which functional programming language(s) should we teach?

Laurie Dillon just posted the SIGPLAN eduction board article on Why Undergraduates Should Learn the Principles of Programming Languages to our faculty mailing list at the MSU Computer Science department. One question that came up in the ensuing conversation was: what functional programming language(s) would/should we teach?

I mentioned OCaml, Haskell, and Erlang as reasonably pure but still pragmatic FP languages. Anything else that's both "truly" functional and used somewhat broadly in the real world?

thanks!

--titus

June 24, 2010 06:31 PM

June 22, 2010

Skipper Seabold

Statsmodels: GSoC Week 4 Update

I spent the last week finishing up the paper that I submitted to accompany my talk at the SciPy conference. I am really looking forward to going to Austin and hearing all the great talks (plus I hear the beer is cheap and the food and music are good, which doesn't hurt). In addition to finishing up the paper, I have started to clean up our time series code.

So far this has included finishing the augmented Dickey-Fuller (ADF) test for unit roots. The big time sink here is that the ADF test-statistic has a non-standard distribution in most cases.  The ADF test statistic is obtained by running the following regression



One approach to testing for a unit root means testing the t-stat on the coefficient on the lagged level of y.  The actual distribution for this statistic, however, is not Student's t.  Many software packages use the tables in Fuller (1976, updated to 1996 version or not) in order to get the critical values for the test statistic depending on the sample size.  They use linear interpolation for sample sizes not included in the table.  The p-values for the obtained test statistic are usually obtained using MacKinnon's (1994) study that estimated regression surfaces of these distributions via Monte Carlo simulation.

While we do use MacKinnon's approximate p-values from the 1994 paper, MacKinnon wrote a note updating this paper in early 2010, which gives new regression surface results for obtaining the critical values.  We use these new results for the critical values.  Therefore, when using our ADF test, it is advised that if the p-value is close to the reject/accept region then the critical values should be used in place of the p-value to make the ultimate decision.

We can illlustrate the use of ADF.  Note that this version is only in my branch and that it is still in the sandbox, even though it has now been tested, because the API and returned results may change.  We will demonstrate on a series that we can easily guess is non-stationary, real GDP.


In [1]: import scikits.statsmodels as sm

In [2]: from scikits.statsmodels.sandbox.tsa.stattools import adfuller

In [3]: data = sm.datasets.macrodata.load()

In [4]: realgdp = data.data['realgdp']

In [5]: adf = adfuller(realgdp, maxlag=4, autolag=None, regression="ct")

In [6]: adf
Out[6]:
(-1.8566384063254346,
 0.67682917510440099,
 4,
 198,
 {'1%': -4.0052351400496136,
  '10%': -3.1402115863254525,
  '5%': -3.4329000694218998})


The return values are the test statistic, its p-value (the null-hypothesis here is that the series does contain a unit root), the number of lags of the differences used, the number of observations for the regression, and a dictionary containing the critical values at the respective confidence levels.  The regression option controls the type of regression (ie., whether to include a constant or a linear or quadratic time trend), and the autolag option has three options for choosing the lag length to help correct for serial correlation in the regression.  There are 'AIC', 'BIC', and 't-stat'.  The former two choose the lag length that maximizes the infofrmation criterion, the latter chooses the lag length based on the significance of the lag.  This starts with maxlag and works its way down.  The docstring has more detailed information.




Beyond this, I have been working on an autocorrelation function (acf), a  partial autocorrelation function (pacf), and Q-Statistics (Box-Ljung test). Next up for this week is finishing my VAR class with identification schemes.  After this, I will work to integrate post-estimation tests into our results classes, most likely using some sort of mix-in classes and attach test containers to the results objects for test results.  Then it's off to the SciPy conference. There I will hopefully be participating in the stats sprint, helping out with the docs marathon and discussing what we need for the future of statistics and Python.


Fuller, W.A.  1996.  Introduction to Statistical Time Series. 2nd ed.  Wiley.


MacKinnon, J.G. 1994.  "Approximate asymptotic distribution functions for
    unit-root and cointegration tests.  Journal of Business and Economic
    Statistics
12, 167-76.

MacKinnon, J.G. 2010. "Critical Values for Cointegration Tests." 
    Queen's University, Dept of Economics, Working Papers.  Available at
    http://ideas.repec.org/p/qed/wpaper/1227.html

by jseabold (noreply@blogger.com) at June 22, 2010 11:42 AM

June 15, 2010

Matthieu Brucher

Optimally use massively parallel clusters resources

We have now several petaflopic clusters available in the Top500. Of course, we are trying to get the most of their peak computational power, but I think we should sometimes also look at optimal resource allocation.

I’ve been thinking about this for several months now, for work that has thousands of tasks, each task being massively data parallel. Traditionnally, one launches a job through one’s favorite batch scheduler (favorite or mandatory…) with fixed resources and during an estimated amount of time. This may work well in research, but in the industrial world, there often a new job that arises and that needs part of your scarce resources. You may have to stop your work, loose your current advances and/or restart the job with less resources. And then the cycle goes on.

Static resource allocation

How can resource allocation work? Let’s start with a simple case where you have 2 applications with different priorities. One of them has a priority of 70 (it’s supposed to finish in three days) whereas the other one has a priority of 50 (four days left). They share the cluster so that 66% is allocated to the first application and 33% to the second one.

What happens if a third application must be launched with a higher priority, because it has to ne finished by tomorrow? You may stop the other two programs, you may loose a lot of work if you didn’t implement checkpoints (besides, one of them may be an of-the-shelf program you bought yesterday) or suspend it. Either way, this is what you will get:

In fact, even if you use dynamic resource allocation, this is what you must get to have your results by the time you need them, but obviously, you have lost your two other applications. Some batch schedulers allow applications to be suspended, but this is a double-edge sword:

  • your cluster must support job suspension, and thus have access to drives to save the job state (which is not possible for medium to large-scaled clusters)
  • if your application does not scale to your entire cluster (it happens), although one of the other two applications could go on, it is not possible, all processes are put to sleep

So all things considered, you have to implement dynamic resource allocation.

Dynamic resource allocation

How does this work? Each application must be aware that it can be allocated more resources or deallocated some at all time. To be portable on all clusters, you cannot suspend part of your program, it must really go away. The batch scheduler must also notice that your application has freed some of its resources. You thus have to allocate small jobs that will communicate together (this can be done with MPI-2).

This means that you will have hundreds or thousands of small works. All of them will not have to be connected to the scheduler, only one master must be. Of course, this can easilly be done by using a specific queue. Each application on this queue will thus receive orders from the batch scheduler and act upon it. Another advantage is that also the application gets no resource at one point, it still has a saved state that enable the continuation of a run.

Of course, this is not easy to do. How can this be applied to an of-the-shelf application? Well, in this case, you may create a bogus application on the master queue that will at least allow other applications to be allocated resources beside it.

You do not have to implement this on top of MPI. It can be really hard to do (handling data moves between processors, change the decomposition, …), and you may implement another solution. In my case, I have thousands different tasks that can be run on very few cores, so this is my elementary unit. I don’t need all tasks to communicate between them, so I create each time brand new independent jobs and I also can tell the scheduler it can kill jobs that are not responding before the next allocation phase.

Conclusion

To finish, I’ll say that I know that LSF allows plugins that help dispatch jobs on specific hosts of your cluster (to have the best communication location). There seems to be a way of implementing the needs gathering and the resource assignment, but the documentation is not clear (at all). A specific daemon may be needed. I don’t know if other batch scheduler allow plugins to modify their behavior, if you know of them and their API, please do tell ;)


by Matt at June 15, 2010 07:53 AM

June 14, 2010

Titus Brown

Teaching next-gen sequencing data analysis to biologists

Our sequencing analysis course ended last Friday, with an overwhelmingly positive response from the students. The few negative comments that I got were largely about organizational issues, and could be reshaped as suggestions for next time rather than as condemnations of this year's course.

http://ivory.idyll.org/permanent/ngs-2010-group.png

The 23 students -- most with no prior command-line experience -- spent two weeks experiencing at first hand the challenges of dealing with dozens of gigabytes of sequencing data. Each of the students went through genome-scale mapping, genome assembly, mRNAseq analysis on an "emerging model organism" (a.k.a "one with a crappy genome", lamprey), resequencing analysis on E. coli, and ChIP-seq analysis on Myxococcus xanthus. By the beginning of the second week, many students were working with their own data -- a real victory. Python programming competency may take a bit longer, but many of them seem motivated.

If you had told me three weeks ago that we could pull this off, I would have told you that you were crazy. This does beg the question of what I was thinking when I proposed the course -- but don't dwell on that, please...

The locale was great, as you can see:

http://ivory.idyll.org/permanent/ngs-2010-beach.png

One of the most important lessons of the course for me is that cloud computing works well to backstop this kind of course. I was very worried about the suitabiliy and reliability and ease of use, but AWS did a great job, providing an easy-to-use Web interface and a good range of machine images. I have little doubt that this course would have been nearly impossible (and either completely ineffective or much more expensive) without it.

In the end, we spent more on beer than on computational power. That says something important to me :)

The course notes are available under a CC license although they need to be reworked to use publicly available data sets before they become truly useful. At that point I expect them to become awesomely useful, though.

From the scientific perspective, the students derived a number of significant benefits from the course. One that I had not really expected was that some students had no idea what went in to computational "sausage", and were kind of shocked to see what kinds of assumptions us comp bio people made on their behalf. This was especially true in the case of students from companies, who have pipelines that are run on their data. One student lamented that "we used to look at the raw traces... now all we get are spreadsheet summaries!" Another student came to me in a panic because they didn't realize that there was no one true answer -- that that was in fact part of the "fun" of all biology, not just experimental biology. These reactions alone made teaching the course worthwhile.

Of course, the main point is that many of the students seem to be capable of at least starting their own analyses now. I was surprised at the practical power of our cut-and-paste approach -- for example, if you look at the Short-read assembly with ABySS tutorial, it turns out to be relatively straightforward to adapt this to doing assemblies of your own genomic or transcriptomic data. I based our approach on Greg Wilson's post on the failure of inquiry-based teaching and so far I like it.

I am particularly amused that we have now documented, in replicable detail, the Kroos Lab MrpC ChIP analysis. We also have the best documentation for Jeff Barrick's breseq software, I think; this is what is used to analyze the Long Term Evolution Experiment lines -- and I can't wait for the anti-evolutionists to pounce on that... "Titus Brown -- making evolution experiments accessible to creationists." Yay?

There were a number of problems and mistakes that we had to steamroller through. In particular, more background and more advanced tutorials would have be great, but we just didn't have time to write them. Some 454, Helicos, and SOLiD data sets (and next year, PacBio?) would be a good addition. We had a general lack of multiplexing data, which is becoming a Big Thing now that sequencing is so ridiculously deep. I would also like to introduce additional real data analyses next year, reprising things like the Cufflinks analysis and whole-vertebrate-genome ChIP-seq/mRNAseq a la the Wold Lab. I'm weighing adding metagenomics data analysis in for a day, although it's a pretty separate field of inquiry (and frankly much harder in terms of "unknown unknowns"). We also desperately need some plant genomics expertise, because frankly I know nothing about plant genomes; my last-minute plant genomics TA fell through due to lack of planning on my part. (Conveniently, plant genomics is something MSU is particularly good at, so I'm sure I can find someone next year.)

Oops, did I say next year? Well, yes. If I can find funding for my princely salary, then I will almost certainly run the course again next year. I can cover TAs and my own room/board and speakers with workshop fees, but if I'm going to keep room+board+fees under $1000/student -- a practical necessity for most -- there's no way I can pay myself, too. And while this year I relied on my lovely, patient, and frankly long-suffering wife to hold down the home fort while I was away for two weeks, I simply can't put her through that again, so I will need to pay for a nanny next year. So doing it for free is not an option.

In other words, if you are a sequencing company, or an NIH/NSF/USDA program director, interested in keeping this going, please get in touch. I plan to apply for this Initiative to Maximize Research Education in Genomics in September, but I am not confident of getting that on the first try, and in any case I will need letters of support from interested folks. So drop me a note at ctb@msu.edu.

Course development this year was sponsored by the MSU Gene Expression in Disease and Development, to whom I am truly grateful. The course would simply not have been possible without their support.

My overall conclusion is that it is possible to teach bench biologists with no prior computational experience to achieve at least minimal competency in real-world data analysis of next-generation sequencing data. I can't conclusively demonstrate this without doing a better job of course evaluation, and of course only time will tell if it sticks for any of the students, but right now I'm feeling pretty good about the course overall. Not to mention massively relieved.

--titus

p.s. Update from one student -- "It's not even 12 o'clock Monday morning and I'm already getting people asking me how to run assemblies and analyze data." Heh.

June 14, 2010 03:38 PM

Skipper Seabold

Statsmodels: GSoC Week 3 Update

[Edit: Formatting should be fixed now. I will not be reformatting old posts though, so that they don't get reposted at Planet SciPy]



Last week was spent mainly ensuring that I pass my comps and remain a PhD student. This week was much more productive for coding. For now, all changes are in my branch and have not been merged to trunk, but I will describe the two big changes.



The first concerns the datasets package. This one is not all that exciting, but suffice it to say that the datasets are now streamlined and use the Bunch pattern to load the data. Thanks, Gaël, for pointing this out. I also rewrote a bit of David's datasets proposal from scikits-learn to reflect the current design of our datasets and thoughts. You can see it here (soon to be on the docs page). We are making an effort to ensure that our datasets are going to be similar to those of scikits-learn.



The second change was an improvement of the fitting of maximum likelihood models and the start of a GenericLikelihoodModel class. Maximum likelihood based models (mainly discrete choice models in the main code base right now) can now be fit using any of the unconstrained solvers from scipy.optimize (Nelder-Mead, BFGS, CG, Newton-CG, Powell) plus Newton-Raphson. To take a simple example to see how it works, we can fit a Probit model.



In [1]: import scikits.statsmodels as sm

In [2]: data = sm.datasets.spector.load()

In [3]: data.exog = sm.add_constant(data.exog)

In [4]: res_newton = sm.Probit(data.endog, data.exog).fit(method="newton")
Optimization terminated successfully.
Current function value: 12.818804
Iterations 6

In [5]: res_nm = sm.Probit(data.endog, data.exog).fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [6]: res_bfgs = sm.Probit(data.endog, data.exog).fit(method="bfgs")
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 15
Function evaluations: 21
Gradient evaluations: 21

In [7]: res_cg = sm.Probit(data.endog, data.exog).fit(method="cg", maxiter=250)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 188
Function evaluations: 428
Gradient evaluations: 428

In [8]: res_ncg = sm.Probit(data.endog, data.exog).fit(method="ncg", avextol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 14
Gradient evaluations: 12
Hessian evaluations: 12

In [9]: res_powell = sm.Probit(data.endog, data.exog).fit(method="powell", ftol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 568




All of the options for the solvers are available and are documented in the fit method. As you can see, some of the default values need to be changed to ensure (accurate) convergence. The Results objects that are returned have two new attributes.




In [10]: res_powell.mle_retvals
Out[10]:
{'converged': True,
'direc': array([[ 7.06629660e-02, -3.07499922e-03, 5.38418734e-01,
-4.19910465e-01],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[ 1.49194876e+00, -6.64992809e-02, -6.96792443e-03,
-3.22306873e+00],
[ -5.36227277e-02, 1.18544093e-01, -8.75205765e-02,
-2.42149981e+00]]),
'fcalls': 568,
'fopt': 12.818804069990534,
'iterations': 12,
'warnflag': 0}

In [11]: res_powell.mle_settings
Out[11]:
{'callback': None,
'disp': 1,
'fargs': (),
'ftol': 1e-08,
'full_output': 1,
'maxfun': None,
'maxiter': 35,
'optimizer': 'powell',
'retall': 0,
'start_direc': None,
'start_params': [0, 0, 0, 0],
'xtol': 0.0001}

The dict mle_retvals contains all of the values that are returned from the solver if the full_output keyword is True. The dict mle_settings contains all of the arguments passed to the solver, including the defaults so that these can be checked after the fit. Again, all settings and returned values are documented in the fit method and in the results class, respectively.



Lastly, I started a GenericLikelihoodModel class. This is currently unfinished, though the basic idea is laid out. Take again the Probit example above using Lee Spector's educational program data. And assume we didn't have the Probit model from statsmodels. We could use the new GenericLikelihoodModel class. There are two ways (probably more) to proceed. For those comfortable with object oriented programming and inheritance in Python, we could subclass the GenericLikelihoodModel, defining our log-likelihood method.



from scikits.statsmodels import GenericLikelihoodModel as LLM
from scipy import stats
import numpy as np

class MyModel(LLM):
def loglike(self, params):
"""
Probit log-likelihood
"""
q = 2*self.endog - 1
X = self.exog
return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))




Now this model could be fit, using any of the methods that only require an objective function, i.e., Nelder-Mead or Powell.



In [43]: mod = MyModel(data.endog, data.exog)

In [44]: res = mod.fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [45]: res_nm.params
Out[45]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

In [46]: res.params
Out[46]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])




The main drawback right now is that all statistics that rely on the covariance of the parameters, etc. will use numeric gradients and Hessians, which can lessen that accuracy of those statistics. This can be overcome by providing score and hessian methods as loglike was provided above. Of course, for more complicated likelihood functions this can soon become cumbersome. We are working towards more accurate numerical differentiation and discussing options for automatic or symbolic differentiation.



The main advantage as opposed to just writing your likelihood and passing it to a solver is that you have all of the (growing number of) statistics and tests available to statsmodels right in the generic model.



I would also like to accommodate those who are less familiar with OOP and inheritance in Python. I haven't quite worked out the final design for how this would go yet. Right now, you could do the following, though I don't think it quite meets the less complicated goal.


In [4]: from scikits.statsmodels.model import GenericLikelihoodModel as LLM

In [5]: import scikits.statsmodels as sm

In [6]: from scipy import stats

In [7]: import numpy as np

In [8]:

In [9]: data = sm.datasets.spector.load()

In [10]: data.exog = sm.add_constant(data.exog)

In [11]:

In [12]: def probitloglike(params, endog, exog):
....: """
....: Log likelihood for the probit
....: """
....: q = 2*endog - 1
....: X = exog
....: return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))
....:

In [13]: mod = LLM(data.endog, data.exog, loglike=probitloglike)

In [14]: res = mod.fit(method="nm", fargs=(data.endog,data.exog), maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [15]: res.params
Out[15]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])




There are still a few design issues and bugs that need to be worked out with the last example, but the basic idea is there. That's all for now.

by jseabold (noreply@blogger.com) at June 14, 2010 12:54 PM

June 10, 2010

NeuralEnsemble

Program & reg. deadline extension - FACETS CodeJam #4

A preliminary program for the 4th annual FACETS CodeJam meeting (http://neuralensemble.org/codejam4) which will take place June 22nd-24th, 2010 in Marseille, France is now available here:

http://neuralensemble.org/meetings/CJ4_Preliminary_Program_v2.pdf

In addition, the registration deadline has been extended to June 13th, 2010. What are you waiting for? Register now!

The goal of the FACETS CodeJam workshops is to catalyze open-source, collaborative software development in computational and systems neuroscience and neuroinformatics, by bringing together researchers, students and engineers to share ideas, present their work, and write code together. The general format of the workshops is to dedicate the mornings to invited and contributed talks, leaving the afternoons free for discussions and code sprints.

For the 4th FACETS CodeJam, the main theme of the meeting will be workflows: what are the best practices for combining different tools (simulators, analysis tools, visualization tools, databases etc.) to ensure the efficient and reproducible flow of data and information from experiment conception to publication and archiving?

The meeting is being organised by:
Andrew Davison (UNIC-CNRS, Gif-sur-Yvette, France)
Abigail Morrison (BCCN-Freiburg, Germany)
Eilif Muller (BBP-EPFL, Lausanne, Switzerland)
Laurent Perrinet (INCM-CNRS, Marseille, France)

Please consult the meeting website at

http://neuralensemble.org/codejam4

for registration and further information.

by eilif (noreply@blogger.com) at June 10, 2010 12:26 AM