January 27, 2012

NeuralEnsemble

BrainScaleS/FACETS CodeJam #5 registration now open

Registration is now open for the 5th BrainScaleS/FACETS CodeJam workshop, which will take place March 14th-16th, 2011 in Edinburgh.

The goal of the CodeJam workshops is to catalyze open-source, collaborative software development in computational and systems neuroscience and neuroinformatics (especially, but not exclusively, using Python), by bringing together researchers, students and engineers to share ideas, present their work, and write code together. The general format of the workshops is to dedicate the mornings to invited and contributed talks, leaving the afternoons free for discussions and code sprints.

For the 5th BrainScaleS/FACETS CodeJam, the main theme of the meeting will be convergence in computational neuroscience software: recent developments to promote interoperability of modelling, simulation and data analysis tools and future efforts to develop common tools and libraries. We are planning sessions on:

  • Neuronal network modelling
  • Code generation for neuron and synapse models
  • Multicompartmental neuron modelling in Python
  • Data analysis tools for computational and systems neuroscience
  • Best practices for running an open-source software project

More details on the program and invited speakers will follow soon.

We invite contributions on any topic related to software in neuroscience, but especially on topics related to the main theme and planned sessions. If you have ideas for organising code sprints, whether a feature that you would like to see added to an existing tool or an idea for new software, please also let us know.

The meeting is being organised by Andrew Davison, Mike Hull, Abigail Morrison, Eilif Muller, Miha Pelko and Laurent Perrinet.

Registration & Further Information

The registration deadline is 19th February 2012, and is limited to 50 participants.

Please consult the meeting website at http://neuralensemble.org/meetings/CodeJam5 for registration and further information.

by Andrew Davison (noreply@blogger.com) at January 27, 2012 02:36 PM

January 26, 2012

Ondřej Čertík

When double precision is not enough

I was doing some finite element (FE) calculation and I needed the sum of the lowest 7 eigenvalues of a symmetric matrix (that comes from the FE assembly) to converge to at least 1e-8 accuracy (so that I can check calculation done by some other solver of mine, that calculates the same but doesn't use FE). In reality I wanted the rounded value to 8 decimal digits to be correct, so I really needed 1e-9 accuracy (but it's ok if it is let's say 2e-9, but not ok if it is 9e-9). With my FE solver, I couldn't get it to converge more than to roughly 5e-7 no matter how hard I tried. Now what?

When doing the convergence, I take a good mesh and keep increasing "p" (the polynomial order) until it converges. For my particular problem, it is fully converged for about p=25 (the solver supports the order up to 64). Increasing "p" further will not increase the accuracy anymore, and the accuracy stays at the level 5e-7 for the sum of the lowest 7 eigenvalues. For optimal meshes, it converges at p=25, for not optimal meshes, it converges for higher "p", but in all cases, it doesn't get below 5e-7.

I know from experience, that for simpler problems, the FE solver can easily converge to 1e-10 or more using double precision. So I know it is doable, now the question is what the problem is: there
are a few possible reasons:

  • The FE quadrature is not accurate enough
  • The condition number of the matrix is high, thus LAPACK doesn't return very accurate eigenvalues
  • Bug in the assembly/solver (like single/double corruption in Fortran, or some other subtle bug)
When using the same solver for simpler potential, it converged nicely to 1e-10. So this suggests there is no bug in the assembly or solver itself. It is possible that the quadrature is not accurate enough, but again, if it converges for simple problem, it's probably not it. So it seems it is the ill conditioned matrix, that causes this. So I printed the residuals (that I simply calculated in Fortran using the matrix and the eigenvectors returned by LAPACK), and it only showed 1e-9. For simpler problems, it can go to 1e-14 easily. So that must be it. How do we fix it?

Obviously by making the matrix less ill conditioned, which is caused by the mesh for the problem (the ratio of the longest/shortest elements is 1e9) but for my problem I really needed such a mesh. So the other option is to increase the real number accuracy.

In Fortran all real variables are defined as real(dp), where dp is an integer defined at a single place in the project. There are several ways to define it, but it's value is 8 for gfortran and it means double precision.So I increased it to 16 (quadruple precision), recompiled. Now the whole program calculates in quadruple precision (more than 30 significant digits). I had to recompile LAPACK using the "-fdefault-real-8" gfortran option, that promotes all double precision numbers to quadruple precision, and I used the "d" versions (double precision, now promoted to quadruple) of LAPACK routines.

I rerun the calculation ---- and suddenly LAPACK residuals are around 1e-13, and the solver converges to 1e-10 easily (for the sum of the lowest 7 eigenvalues). Problem solved.

Turning my Fortran program to quadruple precision is as easy as changing one variable and recompiling. Turning LAPACK to quadruple precision is easy with a single gfortran flag (LAPACK uses the old f77 syntax for double precision, if it used real(dp), then I would simply change it as for my program). The whole calculation got at least 10x slower with quadruple. The reason is that gfortran runtime uses the libquadmath library, that simulates quadruple precision (as current CPUs only support double precision natively).

I actually discovered a few bugs in my program (typically some constants in older code didn't use the "dp" syntax, but had the double precision hardwired). Fortran warns about all such cases, when the real variables have incompatible precision.

It is amazing how easy it is to work with different precision in Fortran (literally just one change and recompile). How could this be done with C++? This wikipedia page suggests, that "long double" is only 80bit in most cases (quadruple is 128bit), but gcc offers __float128, so it seems I would have to manually change all "double" to "__float128" in the whole C++ program (this could be done with a single "sed" command).

by Ondřej Čertík (noreply@blogger.com) at January 26, 2012 10:51 AM

January 24, 2012

Matthieu Brucher

Book review: Canon 7D: From Snapshots to Great Shots

It’s been a while since I’ve started considering buying a real photo camera. And I’ve decided on a Canon 7D. As usual, the user guide covers the camera, but I thought it would be better to have a more complete guide to the 7D.

Content and opinions

You don’t have to read the book to know how to take a picture, but each chapter adds a new angle o the way one can use the 7D. The first chapter is a ten configuration tips list, and I think each one of them is relevant. They have already helped me enhance my pictures. The second chapter is a list of things to check and understand before shooting. I have to say I didn’t apply them all (updating the firmware for instance), and there is also an overview of the basic technical elements in a camera (shutter, iso…).

Then the serious work begins. The third chapter browses through the different capture modes. Each mode is explained with the associated available changes (exposure, ISO…). Then two chapters tackle what I would say are the most common styles: portrait and landscape. I learnt about a lot of things offered by the 7D in those chapters, like exposure metering or the electronic level. There are a lot of tips that are available in those two chapters. The sixth chapter is about moving targets, a subject that is hard, but it seems that the camera has some useful functions for that.

Then, two chapters tackle subjects that are common to all modes: light and composition. There are more advanced books on these topics, but the author manages to cover the most useful ground, from hints to warnings. The last chapter but one is on video, it is interesting, but I didn’t get my 7D for video. Still, there is a warning that I did find very interesting on this subject. I won’t tell it here, go and grab the book instead.

The ending chapter is about some “advanced” techniques, but it’s more about explaining some things you can do with the 7D, but there is 90% chance that you won’t ever use any of them.

Finally, it is worth mentioning that each chapter starts with one or two pictures and some associated comments. This is a great help when you don’t have much experience. After each chapter, a small list sums up all new information.

Conclusion

This book is a perfect complement to the user guide. The latter explains one button or setting after the other, whereas the former links everything together. When you don’t know about the different modes of the camera (aperture or shutter for instance), it is a must have to take great pictures.


by Matt at January 24, 2012 08:08 AM

January 17, 2012

Matthieu Brucher

QtMosaic 0.2: faster mosaics

Just after the 0.1 release, I’ve worked to add some few tricks and fix a few bugs (see QtMosaic on Github). The most important change is a better image search.

The Antipole Tree

An antipole tree is a binary search tree that is constructed on antipole pairs. An antipole pair is composed of the furthest points in a set. When constructing the tree, one starts with the full set of images/thumbnails, then recursively, an antipole pair is chosen from the (sub)set (figure 1) and two new clusters are created, images are aggregated to the nearest image of the pair (figure 2). When the subset is small enough, the recursion stops.

Figure 1: Original set with the first antipole pair


Figure 2: Two subsets based on the antipole pair in the original set

The search uses a distance-sorted list. When a leaf node is visited, one looks for the nearest image and stores the result. When an internal node is visited, its children will be added to the sorted list, and the distance will be the distance to the center of the cluster minus the radius (figure 3). This way, if the current best is nearer to the reference than the nearest cluster, no nearest image will be found. It’s a usual distance search in a binary tree.


Figure 3: distance from a new image to the different subsets

Result

Of course, building the tree is costly. Although it can be parallelized, it is still slow. The upper side is that using a tree for the search is O(log(n)) instead of the usual O(n). It is actually more efficient for mosaic databases with several thousands of images or more. The result is of course exactly the same as with the linear search.

There are some differences between the original publication (see the reference) and the current code. In the reference, the antipole pair has a special status, as the center of the clusters and they are not tested with the rest of the subset in a leaf node. So in the original code, each tested cluster tests at least one image with the reference, whereas only leaf nodes give the closest images search in my case. The difference is that the center of each cluster will be the real center of the cluster, meaning that the minimum distance will be more close to the reality, and I hope that this will translate into speed-up. The other addition is that the visiting code is also simpler, which is better for maintenance.

Conclusion

This tool has now a logarithmic image search. It should be faster than linear search for huge mosaic databases. Also, the final photomosaic was enhanced by changing the result mosaic mean to the original image mosaic.

Reference : Antipole Tree Indexing to Support Range Search and K-Nearest Neighbor Search in Metric Spaces

Buy Me a Coffee!



Other Amount:



Your Email Address :




by Matt at January 17, 2012 08:09 AM

January 14, 2012

Wes McKinney

Contingency tables and cross-tabulations in pandas

Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

> head(esoph)
  agegp     alcgp    tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day      0        40
2 25-34 0-39g/day    10-19      0        10
3 25-34 0-39g/day    20-29      0         6
4 25-34 0-39g/day      30+      0         5
5 25-34     40-79 0-9g/day      0        27
6 25-34     40-79    10-19      0         7

> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph))
                ncases ncontrols
agegp tobgp                     
25-34 0-9g/day       0        70
      10-19          1        19
      20-29          0        11
      30+            0        16
35-44 0-9g/day       2       109
      10-19          4        46
      20-29          3        27
      30+            0        17
45-54 0-9g/day      14       104
      10-19         13        57
      20-29          8        33
      30+           11        19
55-64 0-9g/day      25       117
      10-19         23        65
      20-29         12        38
      30+           16        22
65-74 0-9g/day      31        99
      10-19         12        38
      20-29         10        20
      30+            2         4
75+   0-9g/day       6        26
      10-19          5        11
      20-29          0         3
      30+            2         4

> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph))
       ncases ncontrols
agegp                  
25-34       1       116
35-44       9       199
45-54      46       213
55-64      76       242
65-74      55       161
75+        13        44

Here are the same examples using pandas:

In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'], 
                                    rows=['agegp', 'tobgp'], 
                                   aggfunc=np.sum)

In [20]: table
Out[20]: 
                ncases  ncontrols
agegp tobgp                      
25-34 0-9g/day  0.      70.      
      10-19     1.      19.      
      20-29     0.      11.      
      30+       0.      16.      
35-44 0-9g/day  2.      109      
      10-19     4.      46.      
      20-29     3.      27.      
      30+       0.      17.      
45-54 0-9g/day  14      104      
      10-19     13      57.      
      20-29     8.      33.      
      30+       11      19.      
55-64 0-9g/day  25      117      
      10-19     23      65.      
      20-29     12      38.      
      30+       16      22.      
65-74 0-9g/day  31      99.      
      10-19     12      38.      
      20-29     10      20.      
      30+       2.      4.0      
75+   0-9g/day  6.      26.      
      10-19     5.      11.      
      20-29     0.      3.0      
      30+       2.      4.0     

In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'], 
                                    rows='agegp', aggfunc=np.sum)
In [23]: table2
Out[23]: 
       ncases  ncontrols
agegp                   
25-34  1.      116      
35-44  9.      199      
45-54  46      213      
55-64  76      242      
65-74  55      161      
75+    13      44.

One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

In [26]: table['ncases'].unstack('agegp')
Out[26]: 
agegp     25-34  35-44  45-54  55-64  65-74  75+
tobgp                                           
0-9g/day  0      2      14     25     31     6  
10-19     1      4      13     23     12     5  
20-29     0      3      8.     12     10     0  
30+       0      0      11     16     2.     2

Here is another fun example:

In [31]: import pandas.rpy.common as com
In [32]: wp = com.load_data('warpbreaks')
In [33]: wp['replicate'] = np.tile(range(1, 10), 6)

In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'], 
                        cols='replicate', aggfunc=np.sum)
Out[34]: 
replicate     1   2   3   4   5   6   7   8   9 
wool tension                                    
A    H        36  21  24  18  10  43  28  15  26
     L        26  30  54  25  70  52  51  26  67
     M        18  21  29  17  12  18  35  30  36
B    H        20  21  24  17  13  15  15  16  28
     L        27  14  29  19  29  31  41  20  44
     M        42  26  19  16  39  28  21  39  29

Here is the equivalent R code:

> warpbreaks$replicate <- rep(1:9, len = 54)
> ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks))
             replicate  1  2  3  4  5  6  7  8  9
wool tension                                     
A    L                 26 30 54 25 70 52 51 26 67
     M                 18 21 29 17 12 18 35 30 36
     H                 36 21 24 18 10 43 28 15 26
B    L                 27 14 29 19 29 31 41 20 44
     M                 42 26 19 16 39 28 21 39 29
     H                 20 21 24 17 13 15 15 16 28

As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

by Wes McKinney at January 14, 2012 09:18 PM

January 11, 2012

Wes McKinney

NYCPython 1/10/2012: A look inside pandas design and development

I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

Wes McKinney: pandas design and development from Adam Klein on Vimeo.

Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

Link to PDF output of Demo

Link to IPython Notebook file

If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

by Wes McKinney at January 11, 2012 04:46 PM

January 10, 2012

Gaël Varoquaux

Book review: NumPy 1.5 Beginner’s guide

Packt publishing sent me a copy of NumPy 1.5 Beginner’s guide by Ivan Idris.

The book actually covers more than only numpy: it is a full introduction to numerical computing with Python. The table of contents is the following:

  • NumPy Quick Start
  • Beginning with NumPy Fundamentals
  • Get into Terms with Commonly Used Functions
  • Convenience Functions for Your Convenience
  • Working with Matrices and ufuncs
  • Move Further with NumPy Modules
  • Peeking Into Special Routines
  • Assure Quality with Testing
  • Plotting with Matplotlib
  • When NumPy is Not Enough: SciPy and Beyond

The book is easy to read, as it requires no specific expertise other than knowing basic Python programming. It is full of examples and exercises, which is really great for learning. I find the style of the author, Ivan Idris, particularly amusing and relaxing, engaging the reader with questions, challenges, or even jokes (“Have a go hero”).

With regards to the formatting and the print, the book is written in large fonts, with sectioning information, tips and exercises clearly standing out.

It is full of practical information, such as how to install the software, or where to get help. Finally, One thing that I appreciated, is that the examples are typed in IPython. Each time I teach, I like to use IPython, because it is full of features to help plotting, debugging and profiling numerical code. The book even has a little introduction to some useful IPython features.

After an introduction to the work flow, the book explores array manipulation such as creation or reshaping, followed by some simple numerics and the battery of array-based operations on functions and polynomials. Then it presents linear algebra and signal processing basics (FFT). It also covers the financial functions that are present in numpy and mentions testing, which is very important to achieve quality code. The book finishes with matplotlib and scipy, two modules that are important to know to go further.

The examples are mostly drawn from statistics or financial applications, such as computing running averages on stock quotes. Basic math explanations, such as the definition of the Moore-Penrose pseudo-inverse, are given when needed.

To conclude, I enjoyed this book and I think that it is a nice addition to my library. It answers exactly it’s title: it is well-suited for beginners wanting to learn numpy. On the other hand, I would not recommend it as a reference material, or as a book to learn more general scientific or numerical computing with Python.

by gael at January 10, 2012 07:57 AM

January 09, 2012

Fernando Perez

The IPython notebook: a historical retrospective

On December 21 2011, we released IPython 0.12 after an intense 4 1/2 months of development.  Along with a number of new features and bug fixes, the main highlight of this release is our new browser-based interactive notebook: an environment that retains all the features of the familiar console-based IPython but provides a cell-based execution workflow and can contain not only code but any element a modern browser can display.  This means you can create interactive computational documents that contain explanatory text (including LaTeX equations rendered in-browser via MathJax), results of computations, figures, video and more.  These documents are stored in a version-control-friendly JSON format that is easy to export as a pure Python script, reStructuredText, LaTeX or HTML.

For the IPython project this was a major milestone, as we had wanted for years to have such a system, and it has generated a fair amount of interest online. In particular, on our mailing list a user asked us about the relationship between this effort and the well-known and highly capable Sage Notebook.  In responding to the question, I ended up writing up a fairly detailed retrospective of our path to get to the IPython notebook, and it seemed like a good idea to put this up as a blog post to encourage discussion beyond the space of a mailing list, so here it goes (the original email that formed the base of this post, in case anyone is curious about the context).

The question that was originally posed by Oleg Mikulchenklo was: What is the relation and comparison between the IPython notebook and the Sage notebook? Can someone provide motivation and roadmap for the IPython notebook as an alternative to the Sage notebook?  I'll try to answer that now...

Early efforts: 2001-2005

Let me provide some perspective on this, since it's a valid question that is probably in the minds of others as well.  This is a long post, but I'm trying to do justice to over 10 years of development, multiple interactions between the two projects and the contributions of many people.  I apologize in advance to anyone I've forgotten, and please do correct me in the comments, as I want to have a full record that's reasonably trustworthy.

Let's go back to the beginning: when I started IPython in late 2001, I was a graduate student in physics at CU Boulder, and had used extensively first Maple, then Mathematica, both of which have notebook environments.  I also used Pascal (earlier) then C/C++, but those two (plus IDL for numerics) were the interactive environments that I knew well, and my experience with them shaped my views on what a good system for everyday scientific computing should look like.  In particular, I was a heavy user of the Mathematica notebooks and liked them a lot.

I started using Python in 2001 and liked the language, but its interactive prompt felt like a crippled toy compared to the systems mentioned above or to a Unix shell.  When I found out about sys.displayhook, I realized that by putting in a callable object, I would be able to hold state and capture previous results for reuse.  I then wrote a python startup file to provide these features and some other niceties such as loading Numeric and Gnuplot, giving me a 'mini-mathematica' in Python (femto- might be a better description, in fairness).  Thus was my 'ipython-0.0.1' born, a mere 259 lines to be loaded as $PYTYHONSTARTUP.

I also read an article that mentioned two good interactive systems for Python, LazyPython and IPP, not surprisingly also created by scientists.  I say this because the natural flow of scientific computing pretty much mandates a solid interactive environment, so while other Python users and developers may like having occasional access to interactive facilities, scientists more or less demand them.  I contacted their authors,  Nathan Gray and Janko Hauser, seeking to join forces to create IPython;  they were both very gracious and let me use their code, but didn't have the time to participate in the effort.  As any self-respecting graduate student with a dissertation deadline looming would do, I threw myself full-time into building the first 'real' IPython by merging my code with both of theirs (eventually I did graduate, by the way).

The point of this little trip down memory lane is to show how from the very beginning, Mathematica and its notebooks (and the Maple worksheets before) were in my mind as the ideal environment for daily scientific work. In 2005 we had two Google SoC students and we took a stab at building, using Wx, a notebook system.  Robert Kern then put some more work into the problem, but unfortunately that prototype never really became fully usable.

Sage bursts into the scene

In early 2006, William Stein organized the first Sage Days at UCSD and invited me; William and I had been in touch since 2005 as he was using IPython for the Sage terminal interface.  I  suggested Robert Kern come as well, and he demoed the notebook prototype he had at that point. It was very clear that the system wasn't production ready, and William was already starting to think about a notebook-like system for Sage as well. Eventually he started working on a browser-based system, and by Sage Days 2 in October 2006, as shown by the coding sprint topics, the Sage notebook was already usable.

For Sage, going at it separately was completely reasonable and justified: we were moving slowly and by that point we weren't even convinced the Wx approach would go anywhere. William is a force of nature and was trying to get Sage to be very usable very fast, so building something integrated for his needs was certainly the right choice.

We continued slowly working on IPython, and actually had another attempt at a notebook-type system in 2006-2007. By that point Brian Granger and Min Ragan-Kelley had come on board and we had built the Twisted-based parallel tools. Using this, Min got a notebook prototype working using an SQL/SQLAlchemy backend.  We had the opportunity to work on many of these ideas during a workshop on Interactive Parallel Computation that William and I co-organized (along with others).  Like Sage, this prototype used a browser for the client but it tried to retain the 'IPython experience', something the Sage notebook didn't provide.

Keeping the IPython experience in the notebook

This is a key difference of our approach and the Sage notebook, so it' worth clarifying what I mean, the key point being the execution model and its relation to the filesystem.  The Sage notebook took the route of using the filesystem for notebook operations, so you can't meaningfully use 'ls' in it or move around the filesystem yourself with 'cd', because Sage will always execute your code in hidden directories with each cell actually being a separate subdirectory.  This is a perfectly valid approach and has a number of very good consequences for the Sage notebook, but it is also very different from the IPython model where we always keep the user very close to the filesystem and OS.  For us, it's really important that you can access local scripts, use %run, see arbitrary files conveniently, etc., as these are routine needs in data analysis and numerical simulation.

Furthermore, we wanted a notebook that would provide the entire IPython experience, meaning that magics, aliases, syntax extensions and all other special IPython features worked the same in the notebook and terminal.  The Sage notebook reimplemented some of these things in its own way: they reused the % syntax but it has a different meaning, they took some of the IPython introspection code and built their own x?/?? object introspection system, etc. In some cases it's almost like IPython but in others the behavior is fairly different; this is fine for Sage but doesn't work for us.

So we continued with our own efforts, even though by then the Sage notebook was fairly mature.  For a number of reasons (I honestly don't recall all the details), Min's browser-based notebook prototype also never reached production quality.

Breaking through our bottleneck and ZeroMQ

Eventually, in the summer of 2009 we were able to fund Brian to work full-time on IPython, thanks to Matthew Brett and Jarrod Millman, with resources from the NiPy project.  Brian could then dig into the heart of the beast, and attack the fundamental problem that made IPython development so slow and hard: the fact that the main codebase was an outgrowth of that original merge from 2001 of my hack, IPP and LazyPython, by now having become an incomprehensible and terribly interconnected mess with barely any test suite.  Brian was able to devote a summer full-time to dismantling these pieces and reassembling them so that they would continue to work as before (with only minimal regressions), but now in a vastly more approachable and cleanly modularized codebase.

This is where early 2010 found us, and then zerendipity struck: while on a month-long teaching trip to Colombia I read an article about ZeroMQ and talked to Brian about it, as it seemed to provide the right abstractions for us with a simpler model than Twisted.  Brian then blew me away, coming back in two days with a new set of clean Cython-based bindings: we now had pyzmq! It became clear that we had the right tools to build a two-process implementation of IPython that could give us the 'real IPython' but communicating with a different frontend, and this is precisely what we wanted for cleaner parallel computing, multiprocess clients and a notebook.

When I returned from Colombia I had a free weekend and drove down from Berkeley to San Luis Obispo.  Upon arriving at Brian's place I didn't even have zeromq installed nor had I read any docs about it.  I installed it, and Brian simply told me what to type in IPython to import the library and open a socket, while he had another one open on his laptop.  We then started exchanging messages from our IPython sessions.  The fact that we could be up and running this fast was a good sign that the library was exactly what we wanted.  We coded frantically in parallel: one of us wrote the kernel and the other the client, and we'd debug one of them while leaving the other running in the meantime.  It was the perfect blend of pair programming and simultaneous development, and in just two days we had a prototype of a python shell over zmq working, proving that we could indeed build everything we needed.  Incidentally, that code may still be useful to someone wanting to understand our basic ideas or how to build an interactive client over ZeroMQ, so I've posted it for reference as a standalone github repository.

Shortly thereafter, we had discussions with Eric Jones and Travis Oliphant at Enthought, who offered to support Brian and I to work in collaboration with Evan Patterson, and build a Qt console for IPython using this new design. Our little weekend prototype had been just a proof of concept, but their support allowed us to spend the time necessary to apply the same ideas to the real IPython. Brian and I would build a zeromq kernel with all the IPython functionality, while Evan built a Qt console that would drive it using our communications protocol.  This worked extremely well, and by late 2010 we had a more or less complete Qt console working:



Over the summer of 2010, Omar Zapata and Gerardo Gutierrez worked as part of the Google Summer of Code project and started building both terminal- and Qt-based clients for IPython on top of ZeroMQ.  Their task was made much harder because we hadn't yet refactored all of IPython to use zmq, but the work they did provided critical understanding of the problem at this point, and eventually by 0.12 much of it has been finally merged.

The value and correctness of this architecture became clear when Brian, Min and I met with the Enthought folks and Shahrokh Mortazavi and Dino Viehland from Microsoft.  After a single session explaining to Dino and Shahrokh our design and pointing them to our github repository, they were able to build support for IPython into the new Python Tools for Visual Studio, without ever asking us a single question:


In October 2010 James Gao (a Berkeley neuroscience graduate student) wrote up a quick prototype of a web notebook, demonstrating again that this design really worked well and could be easily used by a completely different client:


And finally, in the summer of 2011 Brian took James' prototype and built up a fully working system, this time using websockets, the Tornado web server, JQuery for Javascript, CodeMirror for code editing, and MathJax for LaTeX rendering.  Ironically, we had looked at Tornado in early 2010 along with ZeroMQ as a candidate for our communications, but dismissed it as it wasn't really the tool for that job; it now turned out to be the perfect fit for an asynchronous http server with Websockets support.

We merged Brian's work in late August while working on IRC from a boarding room at the San Francisco airport, just in time for me to present it at the EuroSciPy 2011 conference.  We  then polished it over the next few months to finally release it as part of IPython 0.12:



Other differences with the Sage notebook

We deliberately wrote the IPython notebook to be a lightweight, single-user program that feels like any other local application.  The Sage notebook draws many parallels with the google docs model, by default requiring a login and showing all of your notebooks together, kept in a location separate from the rest of your files.  In contrast, we want the notebook to just start like any other program and for the ipynb files to be part of your normal workflow, ready to be version-controlled just like any other, stored in your normal folders and easy to manage on their own. Update: as noted by Jason Grout, the Sage notebook was designed from the start to scale to big centralized multi-user servers (sagenb.org, with about 76,000 accounts, is a good example).  The notebook that runs in the local user's computer is the same as the one in these large public servers.

There are other deliberate differences of interface and workflow:

  • We keep our In/Out prompts explicit because we have an entire system of caching variables that uses those numbers, and because those numbers give the user a visual clue of the execution order of cells, which may differ from the document's order.
  • We deliberately chose a structured JSON format for our documents. It's clear enough for human reading while allowing easy and powerful machine manipulation without having to write our own parsing.  So writing utilities like a reStructuredText or LaTeX converter is very easy, as we recently showed.
  • Our move to zmq allowed us (thanks to Thomas Kluyver's tireless work) to ship the notebook working both on Python2 and Python3 out of the box.  The current version of the  Sage notebook only works on Python2, in part due to its use of Twisted.  Update: William pointed out to me that the upcoming 5.0 version of the notebook will have a vastly reduced dependency on Twisted, so this will soon be less of an issue for Sage.
  • Because our notebook works in the normal filesystem, and lets you create .py files right next to the .ipynb just by passing --script at startup, you can reuse your notebooks like normal scripts, import one notebook from another or a normal python script, etc.  I'm not sure how to import a Sage notebook from a normal python file, or if it's even possible.
  • We have a long list of plans for the document format: multi-sheet capabilities, LaTeX-style preamble, per-cell metadata, structural cells to allow outline-level navigation and manipulation such as in LyX, improved literate programming and validation/reproducibility support, ... For that, we need to control the document format ourselves so we can evolve it according to our needs and ideas.

As you see, there are indeed a number of key differences between our notebook and the sage one, but there are very good technical reasons for this.  The notebook integrates with our architecture and leverages it; you can for example use the interactive debugger via a console or qtconsole against a notebook kernel, something not possible with the sage notebook.

In addition, Sage is GPL licensed while IPython is BSD licensed.  This means we can not directly reuse their code, though when we have asked them to relicense specific pieces of code to us, they have always agreed to do so. But large-scale reuse of Sage code in IPython is not really viable.

The value of being the slowest in the race

As this long story shows, it has taken us a very long time to get here. But what we have now makes a lot of sense for us, even considering the existence of the Sage notebook and how good it is for many use cases. Our notebook is just one particular aspect of a large and rich architecture built around the concept of a Python interpreter abstracted over a JSON-based, explicitly defined communications protocol.  Even considering purely http clients, the notebook is still just one of many possible: you can easily build an interface that only evaluates a single cell with a tiny bit of javascript like the Sage single cell server, for example.

Furthermore, since Min also reimplemented our parallel machinery completely with pyzmq, now we have one truly common codebase for all of IPython. We still need to finish up a bit of integration between the interactive kernels and the parallel ones, but we plan to finish that soon.

In many ways, our slow pace of development paid off:
  • We had multiple false starts that helped us much to better understand the hard parts of the problem and where the dead ends would lie.
  • We were still thinking about this all the time: even when we couldn't spare the time to actively work on it, we had no end of discussions on these things over the years (esp. Brian, Min and I, but also with others at meetings and conferences).
  • The Sage notebook was a great trailblazer showing both what could be done, and also how there were certain decisions that we wanted to make differently.
  • The technology of some critical third-party tools caught up in an amazing way: ZeroMQ, Tornado, WebSockets, MathJax, and the fast and capable Javascript engines in modern browsers along with good JS libraries. Without these tools we couldn't possibly have implemented what we have now.
As much as we would have loved to have a solid notebook years ago in IPython, I'm actually happy at how things turned out.  We have now a very nice mix of our own implementation for the things that are really within our scope, and leveraging third party tools for critical parts that we wouldn't want to implement ourselves.

What next?

We have a lot of ideas for the notebook, as we want it to be the best possible environment for modern computational work (scientific work is our focus, but not its only use), including research, education and publication, with consistent support for clean and reproducible practices throughout.  We are fairly confident that the core design and architecture are extremely solid, and we already have a long list of ideas and improvements we want to make.  We are limited only by manpower and time, so please join us on github and pitch in!

Since this post was motivated by questions about Sage, I'd like to emphasize that we have had multiple, productive collaborations with William and other Sage developers in the past, and I expect that to continue to be the case.  On certain points that collaboration has already led to convergence; e.g. the new Sage single cell server uses the IPython messaging protocol, after we worked closely with Jason Grout during Sage Days 29 in March 2011 thanks to William's invitation.  Furthermore, William's invitations to several Sage Days events, as well as the workshops we have organized together over the years, offered multiple opportunities for collaboration and discussion that proved critical on the way to today's results.

In the future we may find other areas where we can reuse tools or approaches common to Sage and IPython.  It is clear to us that the Sage notebook is a fantastic system, it just wasn't the right fit for IPython. I hope this very long post illustrates why, as well as providing some insights into our vision for scientific computing.

Last, but not least

From this post it should be obvious that what today's IPython is the result of the work of many talented people over the years, and I would like to thank all the developers and users who contribute to the project.  But it's especially important to recognize the stunning quality and quantity of work that Brian Granger and Min Ragan-Kelley have done for this to be possible.  Brian and I did our PhDs together at CU and we have been close friends since then. Min was an undergraduate student of Brian's while he was a professor at U. Santa Clara and the first IPython parallel implementation using Twisted was his senior thesis project; he is now a PhD student at Berkeley (where I work) so we continue to be able to easily collaborate.  Building a project like IPython with partners of such talent, dedication, tenacity and generous spirit is a wonderful experience. Thanks, guys!

Please notify me in the comments of any inaccuracies in the above, especially if I failed to credit someone.

by Fernando (noreply@blogger.com) at January 09, 2012 06:46 PM

January 08, 2012

Travis Oliphant

Transition to Continuum


Our lives are punctuated by transformational events:  the birth of a child, finishing school, the passing of a loved one, meeting someone special.   Even without the regular beating of celestial rhythms to provide opportunities for renewal we would have these moments to measure our lives by.   Once in a while, the rhythmic and asynchronous coincide providing a particularly poignant opportunity for change.   Jan 1, 2012 was just such a time for me as I left my position as President of Enthought to start a new venture with Peter Wang (author of Chaco) and others.    Our new company is Continuum Analytics, Inc. (or just Continuum).  Our nascent website initially targeted only to the Python initiate is http://www.continuum.io.  

While I am ecstatic about the new venture, I will definitely miss the team that we've built around the world that has delivered Enthought's second consecutive record year.   This team of exceptional individuals has been very successful at improving and expanding the Python story in a few targeted companies inside of the Fortune 50 as well as making it easy to install Python for the masses.    Those who have taken the time to first install and then learn Traits, TraitsUI, Chaco, MayaVi, and the rest of the Enthought Tool Suite have had their efforts rewarded with increased productivity in the creation of rich client UIs and improved pluggable, scriptable, and component-based architecture.     It has been a highly educational experience to participate with Enthought.  There is much you learn about business, people, and the world when a software consulting company grows from 1 office with fewer than 17 people to 4 offices around the world and nearly 50 people.   I will always be grateful to the Enthought founders, employees (past and present), and customers (past and present) for the relationships, the trust, and the thoughtful times we shared in learning, growing, and serving each other.

My heart, however, has always been and continues to be with NumPy and SciPy which need more support than Enthought can currently provide --- so I must move on.   It took a lot of trust from my wife when (with 3 small children at home) she patiently waited for me while I spent all of 1999 writing Multipack (which in 2001 formed the bulk of SciPy).  It also took trust when in 2005 (with now 5 children at home) she watched me sacrifice my tenure-track position by writing NumPy instead of more papers.   In 2012 (with now 6 children at home), I'm asking her to trust me one more time while I leave a comfortable salary with a good company to put more effort full time in helping take NumPy and SciPy to the next level.

Over the past 4 1/2 years consulting with large companies I have learned a great deal about what NumPy (and SciPy) can and should be.  These and related tools in the Python ecosystem need to become significant pieces to real solutions to the data analytics challenges that face us.   R, Hadoop, and other (proprietary) solutions are already staking their claim on the space that Python should be dominating.   Python has significant traction in science and analysis but too little publicity in the nascent nomenclature of data analytics.    In order to accelerate the processing capabilities of Python and related tools, much progress needs to be made.   My New Year's resolution this year is to begin to contribute more substantially to that progress by organizing a new company that will hopefully allow many people to spend significant time directly on NumPy and SciPy during working hours.  I also hope to assist any public, non-profit efforts towards that mutual goal as well.   I also hope to be able to spend more time myself on NumPy and SciPy.

To realize my hopes long term, the company must succeed.   For the company to succeed it must find customers --- people willing to buy something that it sells.   People are appropriately particular about what they buy.  Making products that delight will require a lot of work from Continuum, but I am excited to help organize and work alongside the best team we can put together to do it.   This may also mean different business models and licensing around some of the NumPy-related code that the company writes.   I recognize this may cause some raised eye-brows.   I deeply value making code freely available.    I'm a Jeffersonian at heart and believe that ideas (including code) should be shared freely.   Six years ago I experimented by selling my "Guide to NumPy" long enough to make sufficient money to justify the effort.   The book ended up in the public domain and contributed substantially to the current NumPy documentation.   This is an illustration of how resources can be allocated to full-time attention and then later made available for all to enjoy.   Of course there are other models that also work to accomplish similar ends and we will be actively exploring a few of them.

Despite my ideals, my wife thanks me that I'm a pragmatist with children to provide for.   In addition, I have watched wearily as it's been difficult to find volunteer labor (including my own) to turn NumPy into the data-management and data-analytics substrate that it should already be.   All of this happens while huge sums of money are wasted at companies large and small inefficiently transforming raw, but inaccessible data into something closer to information that can be used for decisions by the domain experts.    The information available is not what it can be.  The amount of effort it takes to transform the data to actionable information is not where it can be.  The wide-spread understanding about how to program parallel and distributed machines is not where it can be.  We can and must do better in figuring out how to get full-time attention on NumPy and related tools while still making them widely available.

At Continuum, we have a vision for significantly changing how people manipulate, transform, and uncover their data.   We also have customer-driven plans to achieve it, and we are going to put our full energy into it.   So far, the development team consists of Peter Wang, me, Mark Wiebe, Francesc Alted (PyTables), and Bryan Van de Ven.   We will also be getting part-time but important development help from Hugo Shi and Andy Terrel.  In addition, we are building an initial support/business staff to help us build and grow the business.     We plan to continue to collaborate with others in the community both commercial (e.g. Wes McKinney in his new startup: Lambda Foundry) and open (e.g. Fernando Perez, Brian Granger, Min Ragan-Kelley of IPython fame). If you are interested in either joining us or collaborating with us, please send us an email at info@continuum.io.  Also, please follow us on Twitter @ContinuumIO or Like us on Facebook.  

We are actively looking for customer partners, as well.  If you are interested in learning more about where we are heading and how that might help you, please drop us a line, or come see us at PyCon this year.   We will also be at Strata, and afterwords we will be hosting a Python Data Workshop ("PyData") at the Googleplex.   Please sign up for the PyData workshop wait-list at http://pydataworkshop.eventbrite.com/ (we could only find room for 50 people at the Googleplex).  However, given that the event is free of charge, I'm expecting some people who have reserved their spot may not actually be able to attend.  So, signing up on the wait-list is still worthwhile.

This year will be an exciting one for us.  When I get a spare moment, I still hope to finish a few of the blogs that I've started and possibly include some more that describe more of what I've learned over the past several years as a scientist/engineer-turned-software developer, lessons about running a software company, more of where we are headed at Continuum, reflections on open source, and other more technical ramblings.

by Travis Oliphant (noreply@blogger.com) at January 08, 2012 12:32 PM

January 07, 2012

Gaël Varoquaux

Joblib beta release: fast compressed persistence + Python 3

Joblib 0.6: better I/O and Python 3 support

Happy new year, every one. I have just released Joblib 0.6.0 beta. The highlights of the 0.6 release are a reworked enhanced pickler, and Python 3 support.

Many thanks go to the contributors to the 0.5.X series (Fabian Pedregosa, Yaroslav Halchenko, Kenneth C. Arnold, Alexandre Gramfort, Lars Buitinck, Bala Subrahmanyam Varanasi, Olivier Grisel, Ralf Gommers, Juan Manuel Caicedo Carvajal, and myself). In particular Fabian made sure that Joblib worked under Python 3.

In this blog post, I’d like to discuss a bit more the compressed persistence engine, as it illustrates well key factors in implementing and using compressed serialization.

Fast compressed persistence

One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.

However, one drawback of joblib, is that the caching mechanism may end up using a lot of disk space. As a result, there is strong interest in having compressed storage, provided it doesn’t slow down the library too much. Another use case that I have in mind for fast compressed persistence, is implementing out of core computation.

There are some great compressed I/O libraries for Python, for instance Pytables. You may wonder why the need to code yet another one. The answer is that joblib is pure Python, depending only on the standard library (numpy is optional), but also that the goal here is black-box persistence of arbitrary objects.

Comparing I/O speed and compression to other libraries

Implementing efficient compressed storage was a bit of a struggle and I learned a lot. Rather than going into the details straight away, let me first discuss a few benchmarks of the resulting code. Benching such feature is very hard, first because you are fighting with the disk cache, second because they performances depends very much on the data at hand (some data compress better than others), last because they are three interesting metrics: disk space used, write speed, and read speed.

Dataset used - I chose to compare the different strategies on some datasets that I work with, namely the probabilistic brain atlases MNI 1mm (62Mb uncompressed) and Juelich 2mm (105Mb uncompressed). Whether the data is represented as a Fortran-ordered array, or a C-ordered array is important for the I/O performance. This data is normally stored to disk compressed using the domain-specific Nifti format (.nii files), accessed in Python with the Nibabel library.

Libraries used - I benched different compression strategies in joblib against Nibabel’s Nifti I/O, compressed or not, and against using Pytables to store the data buffer (without the meta-informations). Pytables exposed a variety of compression strategies, with different speed compromises. In addition, I benched numpy’s builtin save_compressed.

I would like to stress that I am comparing a general purpose persistence engine (joblib) to specific I/O libraries either optimized for the data (Nifti), or requiring some massaging to enable persistence (pytables).






Comparing to other libraries

Actual numbers can be found here.

Take home messages - The graphs are not crystal-clear, but a few tendencies appear:

  • Pytables with LZO or blosc compression is the king of the hill for read and write speed.
  • I/O of compressed data is often faster than with uncompressed data for a good compression algorithm.
  • Joblib with Zlib compression level 1 performs honorably in terms of speed with only the Python standard library and no compiled code.
  • Read time of memmapping (with nibabel or joblib) is negligeable (it is tiny on the graphs), however the loading time appears when you start accessing the data.
  • Passing in arrays with a memory layout (Fortran versus C order) that the I/O library doesn’t expect can really slow down writing.
  • Compressing with Zlib compression-level 1 gets you most of the disk space gains for a reasonable cost in write/read speed.
  • Compressing with Zlib compression-level 9 (not shown on the figures) doesn’t buy you much in disk space, but costs a lot in writing time.

Benching datasets richer than pure arrays

The datasets used so far are pretty much composed of one big array, a 4D smooth spatial map. I wanted to test on more datasets, to see how the performances varied with data type and richness. For this, I used the datasets of the scikit-learn, real life data of various nature, described here:

  • 20 news - 20 usenet news group: this data mainly consists of text, and not numpy arrays.
  • LFW people - Labeled faces in the wild, many pictures of different people’s face.
  • LFW pairs - Labeled faces in the wild, pairs of pictures for each individual. This is a high entropy dataset, it does not have much redundant information.
  • Olivetti - Olivetti dataset: centered pictures of faces.
  • Juelich(F) - Our previous Juelich atlas
  • Big people - The LFW people dataset, but repeated 4 times, to put a strain on memory resources.
  • MNI(F) - Our previous MNI atlas
  • Species - Occurence of species measured in latin America, with a lot of missing data.

Testing compression strategies on various datasets

Actual numbers can be found here.

What this tells us - The main message from these benchmarks is that datasets with redundant information, i.e. that compress well, give fast I/O. This is not surprising. In particular, good compression can give good I/O on text (20 news). Another result, more of a sanity check, is that compressed I/O on big data (Big people, ) works as well as on smaller data. Earlier code would start to swap. Finally, I conclude from these graphs, that compression levels from 1 to 3 buy you most of the gains for reasonable costs, and that going up to 9 is not recommended, unless you know that your data can be compressed a lot (species).

Lessons learned

I’ll keep this paragraph short, because the information is really in joblib’s code and comments. Don’t hesitate to have a look, it’s BSD-licenced, so you are free to borrow what you please.

  1. Memory copies, of arrays, but also of strings and byte streams can really slow you down with big data.
  2. To avoid copies with numpy arrays, fully embrace numpy’s strided memory model. For instance, you do not need to save arrays in C order, if they are given to you in a different order. Accessing the memory in the wrong striding direction explains the poor write performance of pytables on Fortran-ordered Juelich.
  3. When dealing with the file system, the OS makes so much magic (e.g. prefetching) that clever hacks tend not to work: always benchmark.
  4. Depending on the size of the data, it may be more efficient to store subsets in different files: it introduces ‘chunk’ that avoid filling in the memory too much (parameter cache_size in joblib’s code). In addition, data of a same nature tends to compress better.
  5. The I/O stream or file object interfaces are abstractions that can hide the data movement and the creation of large temporaries. After experiments with GZipFile and StringIO/BytesIO I found it more efficient to fall back to passing around big buffer object, numpy arrays, or strings.
  6. For reasons 4 and 5, I ended up avoiding the gzip module: raw access to the zlib with buffers gives more control. This explains a good part of the differences in read speed for pure arrays with numpy’s save_compressed.

One of my conclusions for joblib, is that I’ll probably use Pytables as an optional backend for persistence in a future release.

Details on the benchmarks

These benchmarks where run on a Dell Lattitude D630 laptop. That’s a dual-core Intel Core2 Duo box, with 2M of CPU cache.

The code for the benchmarks below can be found on a gist.

Thanks

I’d like to that Francesc Alted for very useful feedback he gave on this topics. In particular, the following thread on the pytables mailing-list may be of interest to the reader.

by gael at January 07, 2012 06:27 PM

January 05, 2012

Vlad Niculae

Moving out

Happy new year, friends!

I’ve made a New Year’s resolution to build a better web presence and make better use of the domain that I previously only used for mail.

This has prompted me to move my blog over to http://blog.vene.ro which hopefully is shorter, better, faster and stronger!

by vene at January 05, 2012 10:16 PM

January 03, 2012

Wes McKinney

Some pandas Database Join (merge) Benchmarks vs. R base::merge

Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:


Sort by key columns
pandas R Ratio
inner 0.07819 0.2286 2.924
outer 0.09013 1.284 14.25
left 0.0853 0.7766 9.104
right 0.08112 0.3371 4.156


Don’t sort by key columns
pandas R Ratio
inner 0.02808 0.2297 8.18
outer 0.03879 1.181 30.45
left 0.0365 0.6706 18.37
right 0.03021 0.2995 9.912

As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

Here is the R code.

N <- 10000
indices = rep(NA, N)
for (i in 1:N)
  indices[i] <- paste(sample(letters, 10), collapse="")

left <- data.frame(key=rep(indices, 10),
                   key2=sample(rep(indices, 10)),
                   value=rnorm(100000))
right <- data.frame(key=indices,
                    key2=sample(indices),
                    value2=rnorm(10000))
timeit <- function(func, niter=10) {
  timing = rep(NA, niter)
  for (i in 1:niter) {
    gc()
    timing[i] <- system.time(func())[3]
  }
  mean(timing)
}

left.join <- function(sort=TRUE) {
  result <- merge(left, right, all.x=TRUE, sort=sort)
}
right.join <- function(sort=TRUE) {
  result <- merge(left, right, all.y=TRUE, sort=sort)
}
outer.join <- function(sort=TRUE) {
  result <- merge(left, right, all=TRUE, sort=sort)
}
inner.join <- function(sort=TRUE) {
  reuslt <- merge(left, right, sort=sort)
}

sort.options <- c(FALSE, TRUE)

results <- matrix(nrow=4, ncol=2)
colnames(results) <- c("dont_sort", "sort")
rownames(results) <- c("inner", "outer", "left", "right")

join.functions <- c(inner.join, outer.join, left.join, right.join)
for (i in 1:4) {
  results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
  results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
  results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
  results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
  results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
  results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
  results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
  results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
}

And the Python code

from collections import defaultdict
import gc
import time
from pandas.util.testing import rands
N = 10000
indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

key = np.tile(indices, 10)
key2 = key.copy()
random.shuffle(key2)
indices2 = indices.copy()
random.shuffle(indices2)
left = DataFrame({'key' : key, 'key2':key2,
                  'value' : np.random.randn(100000)})
right = DataFrame({'key': indices, 'key2':indices2,
                   'value2' : np.random.randn(10000)})
join_methods = ['inner', 'outer', 'left', 'right']
results = DataFrame(index=join_methods, columns=[False, True])
niter = 10
for sort in [False, True]:
    for join_method in join_methods:
        f = lambda: merge(left, right, how=join_method, sort=sort)
        gc.disable()
        start = time.time()
        for _ in xrange(niter):
            f()
        elapsed = (time.time() - start) / niter
        gc.enable()
        results[sort][join_method] = elapsed
results.columns = ['dont_sort', 'sort']

# R results
from StringIO import StringIO
r_results = read_table(StringIO("""dont_sort   sort
inner    0.2297 0.2286
outer    1.1811 1.2843
left     0.6706 0.7766
right    0.2995 0.3371
"""
), sep='\s+')

sort_results = DataFrame.from_items([('pandas', results['sort']),
                                     ('R', r_results['sort'])])
sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                       ('R', r_results['dont_sort'])])
nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

by Wes McKinney at January 03, 2012 05:25 AM

December 28, 2011

Enthought

EPD 7.2 released

I am pleased to announce the release of Enthought Python Distribution, EPD version 7.2, along with its “EPD Free” counterpart. The highlights of this release are: the addition of GDAL and updates to over 30 packages, including SciPy, matplotlib and IPython. The new IPython 0.12 includes the HTML notebook, which is why the Tornado web-server [...]

by Ilan Schnell at December 28, 2011 05:51 PM

December 26, 2011

Gökhan Sever

NumPy 1.5 Beginner's Guide review

This new NumPy book provides an easy start to NumPy for those who want to churn numbers in Python programming language, whether you don't have much programming background or you are switching from your favorite language. The author of the book uses quite a friendly tone throughout the book. Most of the important aspects of NumPy is well covered with well explained examples. The examples provided are step by step explained, starting from the basic array/matrix creation to more complex tasks like signal analysis and linear algebra related calculations.

To get best out of this book, it is recommended that you would try out the examples and challenge yourself with the exercises given in have a go sections of the book. If you are feeling lazy for some reason, you can get the source code from the book's page. I finished perusing the book in about a week (having a few years of NumPy experience has definitely role in this). For me reading and understanding through stock market related examples were a bit boring, but if you are in to financial business you might enjoy putting NumPy in your toolstack with the help of this book. Notably, besides the basic NumPy beginners chapters, the book extends into basic Matplotlib and SciPy lands. A lot of examples use Matplotlib to create plots and better illustrate the operations at hand (e.g. curve fitting, statistical distributions) It is a bit surprising to see MaskedArray module didn't get any mention in the book. However, with the basics you gained, it should be fairly easy to start experimenting with masked array functions of NumPy.

Overall, if you are looking for a book to get started in NumPy in about 200 pages, you might give this one a chance. It is available in both print and electronic formats. Further on, you can try their advanced matplotlib and and Sage books, if you are willing to enhance your scientific Python skills.

Final Note: I received a review copy from the publisher. Thanks to Packt publishing for their contributions to open-source literature and providing me a free copy of the book.

by Gökhan Sever (noreply@blogger.com) at December 26, 2011 09:48 AM

December 18, 2011

Wes McKinney

Introducing vbench, new code performance analysis and monitoring tool

Do you know how fast your code is? Is it faster than it was last week? Or a month ago? How do you know if you accidentally made a function slower by changes elsewhere? Unintentional performance regressions are extremely common in my experience: it’s hard to unit test the performance of your code. Over time I have gotten tired of playing the game of “performance whack-a-mole”. Thus, I started hacking together a little weekend project that I’m calling vbench. If someone thinks up a cleverer name, I’m all ears.

Link to pandas benchmarks page produced using vbench

What is vbench?

vbench is a super-lightweight Python library for running a collection of performance benchmarks over the course of your source repository’s history. Since I’m a GitHub user, it only does git for now, but it could be generalized to support other VCSs. Basically, you define a benchmark:

common_setup = """
from pandas import *
import pandas.util.testing as tm
import random
import numpy as np
"""


setup = common_setup + """

N = 100000
ngroups = 100

def get_test_data(ngroups=100, n=N):
    unique_groups = range(ngroups)
    arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)

    if len(arr) < n:
        arr = np.asarray(list(arr) + unique_groups[:n - len(arr)],
                         dtype=object)

    random.shuffle(arr)
    return arr

df = DataFrame({'key1' : get_test_data(ngroups=ngroups),
                'key2' : get_test_data(ngroups=ngroups),
                'data' : np.random.randn(N)})
def f():
    df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum())
"""

stmt2 = "df.groupby(['key1', 'key2']).sum()"
bm_groupby2 = Benchmark(stmt2, setup, name="GroupBy test 2",
                        start_date=datetime(2011, 7, 1))

Then you write down the information about your repository and how to build any relevant DLLs, etc., that vary from revision to revision:

REPO_PATH = '/home/wesm/code/pandas'
REPO_URL = 'git@github.com:wesm/pandas.git'
DB_PATH = '/home/wesm/code/pandas/gb_suite/benchmarks.db'
TMP_DIR = '/home/wesm/tmp/gb_pandas'
PREPARE = """
python setup.py clean
"""
BUILD = """
python setup.py build_ext --inplace
"""
START_DATE = datetime(2011, 3, 1)

Then you pass this info, plus a list of your benchmark objects, to the BenchmarkRunner class:

runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
                         BUILD, DB_PATH, TMP_DIR, PREPARE,
                         run_option='eod', start_date=START_DATE)
runner.run()

Now, the BenchmarkRunner makes a clone of your repo, then runs all of the benchmarks once for each revision in the repository (or some other rule, e.g. I’ve set run_option='eod' to only take the last snapshot on each day). It persists the results in a SQLite database so that you can rerun the process and it will skip benchmarks it’s already run (this is key when you add new benchmarks, only the new ones will be updated). Benchmarks are uniquely identified by the MD5 hash of their source code.

This is the resulting plot over time for the above GroupBy benchmark related to some Cython code that I worked on late last week (where I made a major performance improvement in this case):

Here is a fully-formed vbench suite in the pandas git repository.

Kind of like codespeed and speed.pypy.org?

Before starting to write a new project I looked briefly at codespeed and the excellent work that the PyPy guys have done with speed.pypy.org. But then I started thinking, you know, Django web application, JSON requests to upload benchmark results? Seemed like far too much work to do something relatively simple. The dealbreaker is that codespeed is just a web application. It doesn’t actually (to my knowledge, someone correct me if I’m wrong?) have any kind of a framework for orchestrating the running of benchmarks throughout your code history. That is what this new project is for. I actually see a natural connection between vbench and codespeed, all you need to do is write a script to upload your vbench results to a codespeed web app!

At some point I’d like to build a simple web front end or wx/Qt viewer for the generated vbench database. I’ve never done any JavaScript, but it would be a good opportunity to learn. Knowing me, I might break down and hack out a stupid little wxPython app with an embedded matplotlib widget anyway.

Anyway, I’m really excited about this project. It’s very prototype-y at the moment but I tried to design it in a nice and extensible way. I also plan to put all my git repo analysis tools in there (like code churn graphs etc.) so it should become a nice little collection of tools.

Other ideas for extending

  • Dealing with API changes
  • Multiple library versions (e.g. NumPy) or Python versions
  • by Wes McKinney at December 18, 2011 11:24 PM

    December 15, 2011

    William Stein

    How big is the core Sage library?

    I just did the following with Sage-4.8.alpha5:
    1. "sudo apt-get install sloccount".
    2. "cp -rv SAGE_ROOT/devel/sage-main /tmp/x"
    3. Use a script [1] to rename all .pyx and .pxi files to .py.
    4. Ran "sloccount *" in the /tmp/x directory, which ignores autogenerated .c/.cpp files coming from Cython.

    Here's the result for the full Sage library, which does not distinguish between Python and Cython. Note that sloccount really only counts lines of code -- comments are blank lines are ignored.

    Totals grouped by language (dominant language first):
    python: 530370 (96.41%)
    ansic: 14538 (2.64%)
    cpp: 5188 (0.94%)

    This suggests that the core Sage library is just over a "half million lines of Python and Cython source code, not counting comments and whitespace".

    Here's the breakdown by module:
    SLOC    Directory       SLOC-by-Language (Sorted)
    88903 rings python=87720,cpp=1183
    72913 combinat python=71629,cpp=1284
    47747 schemes python=46255,cpp=1492
    39815 graphs python=28377,ansic=11438
    31540 matrix python=31540
    31019 modular python=31012,ansic=7
    24475 libs python=21171,ansic=2845,cpp=459
    20517 misc python=20383,ansic=134
    18006 interfaces python=18006
    17577 geometry python=16936,cpp=641
    12775 categories python=12775
    12093 server python=12093
    11971 groups python=11971
    11961 plot python=11961
    10686 crypto python=10686
    9920 modules python=9920
    8389 symbolic python=8260,cpp=129
    8150 algebras python=8150
    7260 ext python=7198,ansic=62
    7093 structure python=7093
    6364 coding python=6364
    5670 functions python=5670
    5249 homology python=5249
    4798 numerical python=4798
    4323 quadratic_forms python=4323
    3919 gsl python=3919
    3911 calculus python=3911
    3879 sandpiles python=3879
    3003 sets python=3003
    2647 databases python=2647
    2074 logic python=2074
    1736 finance python=1736
    1608 games python=1608
    1465 monoids python=1465
    1435 tests python=1383,ansic=52
    1370 stats python=1370
    971 interacts python=971
    959 tensor python=959
    906 lfunctions python=906
    308 parallel python=308
    275 probability python=275
    219 media python=219
    197 top_dir python=197

    Here is the script [1]:
    #!/usr/bin/env python

    import os, shutil

    for dirpath, dirnames, filenames in os.walk('.'):
    for f in filenames:
    if f.endswith('.pyx') or f.endswith('.pxi'):
    print f
    shutil.move(os.path.join(dirpath, f),
    os.path.join(dirpath, os.path.splitext(f)[0] + '.py'))

    by William Stein (noreply@blogger.com) at December 15, 2011 02:42 PM

    December 14, 2011

    Matthieu Brucher

    Interactive Raytracer: Reboot?

    During SuperComputing 2011, I stumbled across a paper on parallel random number generation. And it brought back some memories of a blog post I wanted to do here on oversampling in my Interactive Raytracer.

    Random numbers are important in a raytracer because one needs oversampling to have anti-aliasing, and to have a more realistic rendering, we need to add some jitter inside the oversampling. I won’t talk about this here, it is not my point (yet). So I’ve decided to resurrect IRT from the dead, at least long enough to test parallel random number generators. It is now on Github, and I’ve also changed the matrix library to Eigen. Some optimizations later, it is slightly faster than before, although it is still very far from Bikker’s work.

    The Python wrappers are still working BTW.

    Let’s hope I find some time to finish this before getting back at VSTs :)

    Buy Me a Coffee!



    Other Amount:



    Your Email Address :




    by Matt at December 14, 2011 08:53 AM

    December 13, 2011

    William Stein

    Using Sage to Support Research Mathematics

    When using Sage to support research mathematics, the most important point to make is to strongly encourage people to do the extra work to turn their "scruffy research code" into a patch that can be peer reviewed and included in Sage. They will have to 100% doctest it, and the quality of their code may improve dramatically as a result. Including code in Sage means that the code will continue to work as Sage is updated. Also, the code is peer reviewed and has to have examples and documentation for every function. That's a much higher bar than just "reproducible research".

    Moreover, getting code up to snuff to include in Sage will often also reveal mistakes that will avoid embarrassment later. I'm fixing some issues related to a soon-to-be-done paper right now that I found when doing just this for trac 11975.

    This final step of turning snippets of research code into a peer-reviewed contribution to Sage is: (1) a surprisingly huge amount of very important useful work, (2) something that is emphasized as an option for Sage more than with Magma or Mathematica or Pari (say), (3) something whose value people have to be sold on, since they get no real extra academic credit for it, at present, usually, and journal referees often don't care either way (I do, but I'm probably in the minority there), and (4) something that a *lot* of research mathematicians do not do. As an example of (4), in the last two months I've seen a ton of (separate!) bodies of code which is all sort of secret research code in various Dropbox repos, and which isn't currently squarely aimed at going into Sage. I've also seen a bunch of code related to Edixhoven et al.'s algorithm for computing Galois representation with a similar property (there is now trac 12132, due to my
    urging).

    I did *not* do this step yet with this recently accepted paper. Instead I used "scrappy research code" in psage to do the fast L-series computations. The referee for Math Comp didn't care either way, actually... I hope this doesn't come back to haunt me, though there are many double checks here (e.g., BSD) so I'm not too worried. I will do this get-it-in-Sage step at some point though.

    This will be better for the community in the long run, and better for individual researcher's credibility too. And there is a lot of value in having a stable refereed snapshot of code on which a published (=very stable) paper is based.

    by William Stein (noreply@blogger.com) at December 13, 2011 02:07 PM

    December 12, 2011

    Official SymPy blog

    Google Summer of Code 2011 Wrap-Up

    (Cross-posted on the Google Open Source Blog)

    SymPy is a computer algebra system (CAS) written in pure Python. The core allows basic manipulation of expressions (like differentiation or expansion) and it contains many modules for common tasks (limits, integrals, differential equations, series, matrices, quantum physics, geometry, plotting, and code generation).

    SymPy has participated in the Google Summer of Code program in previous years under the umbrellas of Python Software Foundation, Portland State University, and the Space Telescope Science Institute, where we were very successful. In fact, several of our core developers, including four of the mentors from this year, started working with SymPy as Google Summer of Code students. This was our first year participating as a standalone organization, and we would like to share our experience.

    As part of the application process we required each student to submit a patch (as a GitHub pull request) that had to be reviewed and accepted. This allowed us to see that each applicant knew how to use git as well as communicate effectively during the review process.This also encouraged only serious applicants to apply. We had over 10 mentors available and we ended up with 9 students, all of whom were successful at final evaluations.

    Tom Bachmann - Definite Integration using Meijer G-functions, mentored by Aaron Meurer
    Tom implemented an algorithm for computing symbolic definite integrals that uses so-called Meijer G-functions. This is the state-of-the-art algorithm for computing definite integrals, and indeed the results of his project are very impressive. This project has pushed SymPy forward a long way to becoming the strongest open source computer algebra system with respect to symbolic definite integration.

    Vladimir Peric - Porting to Python 3, mentored by Ronan Lamy
    Vladimir ported SymPy to work on Python 3 and ported all testing infrastructure so that SymPy gets regularly tested in Python 2.x, 3.2 and PyPy. Thanks to Vladimir’s work, the next version of SymPy, 0.7.2, which will hopefully be released later this year, will work in both Python 2 and Python 3, and it may support PyPy as well.

    Gilbert Gede - PyDy, mentored by Luke Peterson
    Gilbert implemented a physics module to assist in generating symbolic equations of motion for complex multibody systems using Kane's Method. He expanded on the code written by his mentor, Luke, in 2009, and the module can now generate equations of motion for a bicycle. Gilbert also wrote very thorough documentation both for the Kane’s Method and the module in SymPy.

    Tomo Lazovich - Position and Momentum Bases for Quantum Mechanics, mentored by Brian Granger
    Tomo has greatly improved the quantum mechanics module by implementing position/momentum representations for operators and eigenstates in various coordinate systems (including cartesian, cylindrical, and spherical) that allows you to easily represent many of the "textbook" quantum mechanics systems, including particle in a box, simple harmonic oscillator, hydrogen atom, etc.

    Saptarshi Mandal - Combinatorics package for Sympy, mentored by Christian Muise
    Saptarshi’s project was to mimic the various capabilities of Combinatorica, a Mathematica package for combinatorics. Most of the functionality involving elementary combinatorial objects such as Permutations, Partitions, Subsets, Gray codes and Prufer codes are complete.

    Sherjil Ozair - Symbolic Linear Algebra, mentored by Vinzent Steinberg
    Sherjil improved the speed of the linear algebra module by using efficient coefficient types for values of entries of matrices. Previously, SymPy used generic expressions in this place, which slowed down computations considerably and caused trouble with solving of the zero equivalence problem. He also implemented sparse matrix representation and unified the API with dense matrices. In addition, Sherjil also added a few linear algebra related algorithms (e.g. Cholesky decomposition).

    Matthew Rocklin - SymPy Stats: Random Variables, mentored by Andy Terrel
    Matthew improved the statistics module to use symbolics and introduced a Random Variable type, with support for finite, continuous, and multivariable normal random variables. With these you can symbolically compute things like probabilities of a given condition, conditional spaces, and expectation values. As a side consequence of this project, he also improved some of our Sets classes and implemented a MatrixExpr class, which allows you to compute with matrices symbolically, including computing with block matrices.

    Sean Vig - Symbolic Clebsch-Gordon coefficients/Wigner symbols and Implementing Addition of Spin Angular Momenta, mentored by Ondřej Čertík
    Sean was working on the quantum mechanics module and has implemented symbolic Clebsch-Gordan coefficients, Wigner D function, and related mathematical concepts that are used very often in quantum physics when dealing with angular momentum and then the necessary classes to support coupled spin algebra.

    Jeremias Yehdegho - Implementing F5, mentored by Mateusz Paprocki
    Jeremias worked on implementing algorithms related to Groebner bases. Groebner bases are a useful tool in many areas of computer algebra. He implemented the F5B algorithm, which is an improved version of the classical Buchberger’s algorithm that was previously implemented in SymPy, and an algorithm for converting Groebner bases between different orders of monomials and worked on applications of Groebner bases. This allowed for handling problems of much larger size in SymPy.

    The full report can be found here, where each student wrote a wiki page about their experience during the summer and you can also find their blogs and links to applications. Each student was required to blog about their progress each week and all blogs were synchronized at planet.sympy.org.

    In previous years, there was usually one student from each summer who became a regular contributor and also a mentor for the next year. It has been a rewarding experience for the whole SymPy community.

    by Aaron Meurer (noreply@blogger.com) at December 12, 2011 04:10 PM

    December 11, 2011

    Titus Brown

    Data Intensive Science, and Workflows

    I'm writing this on my way back from Stockholm, where I attended a workshop on the 4th Paradigm. This is the idea (so named by Jim Gray, I gather?) that data-intensive science is a distinct paradigm from the first three paradigms of scientific investigation -- theory, experiment, and simulation. I was invited to attend as a token biologist -- someone in biology who works with large scale data, and thinks about large scale data, but isn't necessarily only devoted to dealing with large scale data.

    The workshop was pretty interesting. It was run by Microsoft, who invited me & paid my way out there. The idea was to identify areas of opportunity in which Microsoft could make investments to help out scientists and develop new perspectives on the future of eScience. To do that, we played a game that I'll call the "anchor game", where we divvied up into groups to discuss the blocks to our work that stemmed from algorithms and tools, data, process and workflows, social/organizational aspects. In each group we put together sticky notes with our "complaints" and then ranked them by how big of an anchor they were on us -- "deep" sticky notes held us back more than shallow sticky notes. We then reorganized by discipline, and put together an end-to-end workflow that we hoped in 5 years would be possible, and then finally we looked for short- and medium-term projects that would help get us there.

    The big surprise for me in all of this was that it turns out I'm most interested in workflows and process! All of my sticky notes had the same theme: it wasn't tools, or data management, or social aspects that were causing me problems, but rather the development of appropriate workflows for scientific investigation. Very weird, and not what I would have predicted from the outset.

    Two questions came up for me during the workshop:

    Why don't people use workflows in bioinformatics?

    The first question that comes to mind is, why doesn't anyone I know use a formal workflow engine in bioinformatics? I know they exist (Taverna, for one, has a bunch of bioinformatics workflows); I'm reasonably sure they would be useful; but there seems to be some block against using them!

    What would the ideal workflow situation be?

    During the anchor game, our group (which consisted of two biologists, myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few computer scientists, including Alex Wade) came up with an idea for a specific tool. The tool would be a bridge between Datascope for biologists and a workflow engine. The essential idea is to combine data exploration with audit trail recording, which could then be hardened into a workflow template and re-used.

    ---

    Thinking about the process I usually use when working on a new problem, it tends to consist of all these activites mixed together:

    1. evaluation of data quality, and application of "computational" controls
    2. exploration of various data manipulation steps, looking for statistical signal
    3. solidifying of various subcomponents of the data manipulation steps into scripted actions
    4. deployment of the entire thing on multiple data sets

    Now, I'm never quite done -- new data types and questions always come up, and there are always components to tweak. But portions of the workflow become pretty solid by the time I'm halfway done with the initial project, and the evaluation of data quality accretes more steps but rarely loses one. So it could be quite useful to be able to take a step back and say, "yeah, these steps? wrap 'em up and put 'em into a workflow, I'm done." Except that I also want to be able to edit and change them in the future. And I'd like to be able to post the results along with the precise steps taken to generate them. And (as long as I'm greedy) I would like to work at the command line, while I know that others would like to be able to work in a notebook or graphical format. And I'd like to be able to "scale out" the computation by bringing the compute to the data.

    For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

    I'm guessing no tool out there does this, although I know other people are thinking this way, so maybe I'm wrong. The Microsoft folk didn't know of any, though, and they seemed pretty well informed in this area :).

    The devil's choice I personally make in all of this is to go for workflow agility, and ignore versioning and tracking and the component model, by scripting the hell out of things. But this is getting old, and as I get older and have to teach my wicked ways to grad students and postdocs, the lack of versioning and tracking and easy scaling out gets more and more obvious. And now that I'm trying to actually teach computational science to biologists, it's simply appallingly difficult to convey this stuff in a sensible way. So I'm looking for something better.

    One particularly intriguing type of software I've been looking at recently is the "interactive Web notebook" -- ipython notebook and sagenb. These are essentially Mathematica or matlab-style notebook packages that work with IPython or Sage, scientific computing and mathematical computing systems in Python. They let you run Python code interactively, and colocate it with its output (including graphics); the notebooks can then be saved and loaded and re-run. I'm thinking about working one or the other into my class, since it would let me move away from command-line dependence a bit. (Command lines are great, but throwing them at biologists, along with Python programming, data analysis, and program execution all together, seems a bit cruel. And not that productive.)

    It would also be great to have cloud-enabled workflow components. As I embark on more and more sequence analysis, there are only about a dozen "things" we actually do, but mixed and matched. These things could be hardened, parameterized into components, and placed behind an RPC interface that would give us a standard way to execute them. Combined with a suitable data abstraction layer, I could run the components from location A on data in location B in a semi-transparent way, and also record and track their use in a variety of ways. Given a suitably rich set of components and use cases, I could imagine that these components and their interactions could be executed from multiple workflow engines, and with usefully interesting GUIs. I know Domain Specific Languages are already passe, but a DSL might be a good way to find the right division between subcomponents.

    I'd be interested in hearing about such things that may already exist. I'm aware of Galaxy, but I think the components in Galaxy are not quite written at the right level of abstraction for me; Galaxy is also more focused on the GUI than I want. I don't know anything about Taverna, so I'm going to look into that a bit more. And, inevitably, we'll be writing some of our own software in this area, too.

    Overall, I'm really interested in workflow approaches that let me transition seemlessly between discovery science and "firing for effect" for hypothesis-driven science.

    A few more specific thoughts:

    In the area of metagenomics (one of my research focuses at the moment), it would be great to see img/m, camera, and MG-RAST move towards a "broken out" workflow that lets semi-sophisticated computational users (hi mom!) run their stuff on the Amazon Cloud and on private HPCs or clouds. While I appreciate hosted services, there are many drawbacks to them, and I'd love to get my hands on the guts of those services. (I'm sure the MG-RAST folk would like me to note that they are moving towards making their pipeline more usable outside of Argonne: so noted.)

    In the comments to my post on Four reasons I won't use your data analysis pipeline, Andrew Davison reminds me of VisTrails, which some people at the MS workshop recommended.

    I met David De Roure from myExperiment at the MS workshop. To quote, "myExperiment makes it easy to find, use, and share scientific workflows and other Research Objects, and to build communities."

    David put me in touch with Carole Goble who is involved with Taverna. Something to look at.

    In the cloud computing workshop I organized at the Planet and Animal Genome conference this January, I will get a chance to buttonhole one of the Galaxy Cloud developers. I hope to make the most of this opportunity ;).

    It'd be interesting to do some social science research on what difficulties users encounter when they attempt to use workflow engines. A lot of money goes into developing them, apparently, but at least in bioinformatics they are not widely used. Why? This is sort of in line with Greg Wilson's Software Carpentry and the wonderfully named blog It will never work in theory: rather than guessing randomly at what technical directions need to be pursued, why not study it empirically? It is increasingly obvious to me that improving computational science productivity has more to do with lowering learning barriers and changing other societal or cultural issues than with a simple lack of technology, and figuring out how (and if) appropriate technology could be integrated with the right incentives and teaching strategy is pretty important.

    --titus

    p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the workshop, and especially for paying my way. 'twas really interesting!

    December 11, 2011 03:00 PM

    December 07, 2011

    NeuralEnsemble

    5th INCF Congress of Neuroinformatics: Call for Workshop Proposals


    5th INCF Congress of Neuroinformatics: Munich, Sept 10 - 12, 2012

    Call for Workshop Proposals!



    Neuroinformatics 2012 will feature three workshops organized by the program committee and an additional three parallel workshops which will be selected from submitted proposals. This is a tremendous opportunity to host your own workshop with the topic and speakers of your choice!

    MORE INFORMATION

    WORKSHOP PROPOSAL SUBMISSION

    Neuroinformatics 2012 regular workshops:
    1. Function-structure relationship in microcircuitry (Chair: Keiji Tanaka)
    2. Systems Biology of the Neuron (Chair: Mary Kennedy)
    3. If there is a data deluge, where are the data? (Chair: Tim Clark)

    by Andrew Davison (noreply@blogger.com) at December 07, 2011 01:50 PM

    December 01, 2011

    Prabhu Ramachandran

    SciPy.in 2011 coming soon to IIT Bombay


    If you aren't aware of it yet, SciPy.in 2011 is just round the corner.
    This year it is going to be held at IIT Bombay's new Victor Menezes
    Convention Center (VMCC).  We have conference talks on 4th and 5th and
    tutorials on 6th and 7th.  We have invited several key people from
    the "scientific computing with Python" community.  We have an awesome set
    of talks and tutorials.  Eric Jones from Enthought is our keynote
    speaker.

    In 2009 and 2010 we did workshops on basic scientific computing with
    Python.  This year our tutorial track features several advanced topics
    including, image processing, machine learning, symbolic computing,
    mapping and geoprocessing, Traits/TraitsUI, Mayavi, Sage and Cython.

    If you wish to attend the conference,  feel free to drop in and attend,
    free of cost -- but you'll miss out on the awesome food, t-shirt and
    other goodies if you do not register.  It would be best if you did
    register though.

    The main organizers of SciPy.in 2011 are FOSSEE.  Enthought (India) is
    also helping organize the conference.  MHRD and Enthought are our
    financial sponsors and we are grateful to them for their continued
    support.

    Given the nature of the talks, the tutorials, the location, you are
    going to miss out if you choose to not attend.  Register before 2nd to
    avail of reduced rates.  Spot registration is certainly available but is more
    expensive.

    by Prabhu Ramachandran (noreply@blogger.com) at December 01, 2011 03:28 PM

    November 27, 2011

    Prabhu Ramachandran

    Enthought ... now in India

    Enthought, as most of you (following my blog anyway) know,
    is a leading scientific computing solutions provider and an extra-ordinary
    champion of scientific computing with Python. Enthought is headquartered
    at Austin, Texas and has offices in New York, Cambridge (UK) and Brussels.
    Enthought has started a new development office in Mumbai, India. Starting July 2010,
    I am taking two years of long leave (lien) from IITB to serve as the
    Managing Director of Enthought India. We have a nice office located at the Supreme
    business park in Hiranandani, Powai.



    We right now have three people (including yours truly) and are looking
    for more people. We aim to build a strong development team. We are
    currently working on an exciting new product that we will release over
    the course of next year. Stay tuned!

    If you are a strong Python developer with a penchant for challenging
    problems and building scientific computing applications, do send us your
    resume.

    I can be reached on my usual email addresses and my Enthought email is
    prabhu at enthought dot com. Possibly the best place to send in resumes is:
    india-jobs@enthought.com

    by Prabhu Ramachandran (noreply@blogger.com) at November 27, 2011 08:08 PM

    ... and they lived happily ever after


    It has been two years since I blogged about anything significant. A lot
    has happened and I thought I might as well give everyone a news update.

    In November 2009, Kadambari Devarajan (KD) and I were married. I'm not
    getting into the details here but in short, she stole my heart, I
    proposed, we were married in November and we lived happily ever after.
    :)

    The marriage ceremony was smooth and memorable thanks in very little
    part to the two of us. My parents organized a wonderful, short and
    thoughtful marriage ceremony and KD's parents a wonderful reception. We
    cannot thank our parents enough! My cousins thoughtfully organized and
    hosted a mehendi party which was very memorable. The support of our
    siblings at this time was touching. Close friends and relatives
    travelled from far and wide to attend the marriage and made it very
    special for us. Many friends and relatives offered us their advice,
    support and help. Many of you attended the wedding at short notice and
    showered us with gifts and good wishes. Our heartfelt thanks to one and
    all!


    Ever since, KD and I have been extremely happy together. Life has
    changed in many ways. We've travelled the world (the picture above was
    at Yosemite, courtesy Jarrod Millman), done many things together and
    are having a wonderful time. Thanks again to all of you who got
    us together. :)


    by Prabhu Ramachandran (noreply@blogger.com) at November 27, 2011 07:46 PM

    FOSSEE updates

    The FOSSEE project started in mid 2009 and has been doing fairly well.
    We've conducted several Python and Scilab workshops all over India. We
    organized two SciPy.in conferences (2009 and 2010) and hosted Sage days
    25. This year we are organizing SciPy.in 2011 at IIT Bombay.

    We've generated plenty of introductory material on elementary Python for
    scientific computing, version control and using GNU/Linux tools. Using
    this material we are conducting a "Software development techniques for
    engineers and scientists" course at IITB. There are numerous other
    initiatives including a textbook conversion project that we are working
    on.

    Recently, and rather significantly, we've have embarked on an ambitious
    effort to train about 650 teachers all over India over five weekends
    this month on "Software development techniques for engineers and
    scientists". This is part of Prof. Phatak's IIT e-Outreach program
    funded generously by the MHRD, India. More details are available here.

    The basic idea is that we conduct workshops at IIT Bombay and the
    lectures are streamed online. Students are able to interact with the
    faculty via the internet. We are very excited about this since this
    allows us to reach out to a large number of people in one shot. It is
    also a tad scary since the numbers are so large. So far it has worked fairly well.

    I did write a simple Django based application to take online exams and automatically
    grade the submitted solutions and will probably talk about that in more detail
    at a future point.

    by Prabhu Ramachandran (noreply@blogger.com) at November 27, 2011 07:46 PM

    November 18, 2011

    Vlad Niculae

    The nasty bug crawling in my Orthogonal Matching Pursuit code

    A while back, Bob L. Sturm blogged about a similar implementation of OMP to the one in scikit-learn. Instead of using the Cholesky decomposition like we did, his Matlab code uses the QR decomposition, to a similar (or maybe even identical) outcome, in theory. So lucky that Alejandro pointed out to him the existence of the scikit-learn implementation, and that Bob’s code exposed a bug that all the test coverage didn’t catch! This plot should increase, certainly not decrease! Something is clearly wrong here.
    OMP buggy phase transition, decreasing instead of increasing
    Luckily we were able to find it and fix it very quickly. I have updated the old entries I wrote on the OMP optimizations, so they no longer include the bug. But I take this opportunity to explain what exactly went wrong.

    A key part of the optimization was that slicing out arbitrary columns out of an array is slow when they are passed to BLAS functions like matrix multiplication. In order to make the most out of your code, the data should have a contiguous layout. We achieved this by swapping active dictionary atoms (columns) to the beginning of the array.

    Something that can happen, but won’t happen very often, is that after an atom is selected as active, the atom that takes its place after swapping needs to be selected. This is rare because dictionaries have many columns, out of which only very very few will be active. But when it happens, because the code didn’t keep track of swapped indices, the corresponding coefficient of the solution would get updated twice, leading to more zero entries than we should have. A keen eye could have noticed that the first `n_nonzero_coefs` entries in OMP solution vectors were never non-zero. But alas, my eye was not a keen one at all.

    In other words, the following test (that was written after the bug was found, unfortunately) was failing:

    def test_swapped_regressors():
        gamma = np.zeros(n_features)
        # X[:, 21] should be selected first, then X[:, 0] selected second,
        # which will take X[:, 21]'s place in case the algorithm does
        # column swapping for optimization (which is the case at the moment)
        gamma[21] = 1.0
        gamma[0] = 0.5
        new_y = np.dot(X, gamma)
        new_Xy = np.dot(X.T, new_y)
        gamma_hat = orthogonal_mp(X, new_y, 2)
        gamma_hat_gram = orthogonal_mp_gram(G, new_Xy, 2)
        # active indices should be [0, 21], but prior to the bugfix
        # the algorithm would update only [21] but twice
        assert_equal(np.flatnonzero(gamma_hat), [0, 21])
        assert_equal(np.flatnonzero(gamma_hat_gram), [0, 21])
    

    Note that this bug has been fixed for a while, but I didn’t get the free time to write this post until now. Good news is: we fixed it, and did so very quickly after the report. So you can still trust me, I guess!

    by vene at November 18, 2011 06:51 PM

    Gaël Varoquaux

    Scikit-learn NIPS 2011 sprint: international thanks to our sponsors

    The NIPS conference: time for a sprint. The NIPS conference, one of the major conferences in machine learning, is hosted in Granada this year. I believe that it is the first time that it is hosted in Europe. As many of the scikit-learn developers are part of the wider NIPS community, but also many live in Europe, we jumped on the occasion to organize a truly international sprint: the NIPS 2011 scikit-learn sprint.

    Finding money. As often with open source development, a lot of our contributors are young people, investing their free time outside of any request from their hierarchy. In such a situation, it can be hard to find travel money. So we started looking for sponsors. We needed to find a decent sum of money, as we were flying people in from places such as the West coast of the US, or even Japan. The good news is that we found money, and between supervisors pitching in, universities giving travel grants, and our generous sponsors, there will be an impressive list of contributors from all over the world at the sprint.

    Thanks to our sponsors. The first people that we need to thank are Google, who gave us a sizable sponsorship, and the PSF, who made Google’s sponsorship possible through their accounting and sprints programs. We also need to thanks our other sponsors, namely Tinyclues. Thanks to these sponsors, and additional investment from many universities and research group, we have been able to gather a total of 12 contributors in Granada, a handful coming from overseas. Also, we are indebted to the University of Granada, and the Gnu/Linux Granada Group (GGG), who are providing hosting for the sprint, as well as Régine Bricquet, from INRIA, who did a lot of the trip planing for the sponsored people.

    I am very much looking forward to the sprint. It will be the first time that meet in real life many of the contributors, and judging by the warmness of the on-line exchanges, it will be a great moment. Besides, Granada is known to be a lively and historical city.

    If you are around and want to join us, to work on Python in machine learning, send us a mail on the mailing list.

    by gael at November 18, 2011 01:47 PM

    November 15, 2011

    William Stein

    Is the time ripe for http://sagenb.com?

    On a Sage mailing list, Karl Crisman wrote: "Regarding the downtime issue [of http://sagenb.org], there have occasionally been rumors of someone or some organization starting a service which would guarantee uptime and provide support."

    I might do this. It might be at http://sagenb.com (right now sagenb.com points to sagenb.org). Here's the "business plan". Probably, sagenb.com would appear almost the same as sagenb.org, except there would be Google (?) ads on the side of your worksheets, and the revenue would go toward paying for:
    1. Commercial server hosting:  Some Amazon EC2 instances
    2. An employee (or later, employees) to maintain the servers: at the beginning, this would be me in my free time, since I have a lot of experience with this.
    3. Advertising that sagenb.com exists and we want users!:  Unlike sagenb.org, we would very strongly encourage as many people as possible to use sagenb.com.   The advertising and landing page would explain that though sagenb.com generates money, all that money is all given back to the Sage development community (see below).
    4. Hire employees to improve the notebook: Fix bugs, implement features, etc. There would be a public commitment that all such work would be open sourced and get included with Sage. This would include adding support for a for-pay "pro" subscription version, adding nicer "offline support" (via a Sage install on the user's computer), integrated spreadsheets and better data visualization and manipulation tools for Science and business applications, and enabling development of Sage's core library and patches submission entirely through the notebook, etc.


    At some point, there could be a $X/year subscription version that would remove all ads, and increase disk space available to that user. There would also be a $Y/year university site license version with customized authentication (e.g., LDAP?) for each university.   The university site license version might also include Maple/Mathematica/Matlab/Magma pre-installed in their notebook server, assuming appropriate site licenses are honored, so sagenb.com could be something that goes beyond just a platform for "sage as math software".   We can of course also tap into the R market, given that Sage includes R.

    I imagine the above being done as a not-for-profit effort, so if it brought in a lot of revenue (e.g., more than needed for hosting and employees), excess money would go to the Sage Foundation to support other Sage development activities. Regarding numbers, according to Google Analytics, right now on average well over 1000 people use sagenb.org every day.    Standard commercial hosting costs for EC2 to support this load would be roughly $100/month. If each visitor generates on average of 1 penny of ad revenue per day (is this a reasonable estimate -- I have no clue?), then we would expect to make $3,650 in one year, which would be enough to fund the EC2 service (at about $2000/year), with a profit of $1,650.  

    Now let's dream big!  There might be 1,000,000 potential daily users out there for a service like this, if it worked sufficiently well, since there are many college students (and people that use math and stats programs like R in the sciences, and R is part of Sage) in the world.   Scaling up by a factor of 1,000 would yield over $1 million/year, after paying for hosting fees.   This would be enough to fund substantial work to improve Sage, the notebook, and have a paid Patch Editor position (imagine buying out a top Sage developer professor, e.g., John Palmierri, from 50% of his teaching obligations in exchange for him spending the time he would spend teaching instead organizing that the patches to Sage get properly refereed).  Maybe we could even hire back some of the (many!) people who were Sage developers when they were mathematics grad student or postdocs, but who then went to work at a top web technology company and acquired useful experience there (and are now way too busy to work on Sage).

    This has the potential to make Sage a more longterm self-sustaining project. It's probably not possible to get traditional venture capital for a not-for-profit plan like the one above, but fortunately that is not needed due to (1) the generous support the National Science Foundation is currently providing toward development on the Sage notebook, and (2) private donations to the Sage Foundation.  In particular, (2) provides enough money to bootstrap a sagenb.com for a while.

    I think the main potential downside is competition.  If somebody else does the same thing right now for profit without giving back their changes, and captures the market it's hard to imagine how the above would work.  Since we don't use the Affero GPL for the Sage notebook, it is legal for somebody to do a lot of customization work to Sage and notebook, create a web-app using this customized version, and give back nothing to the community, so long as they don't redistribute their modified versions publicly. This isn't crazy -- not so long ago, I had a major company (I won't say who out of respect) tell me they planed to do something like that.    And "random people" suggest it somewhat regularly when I give talks about Sage.   I'm surprised it hasn't happened yet, but it definitely hasn't.

    So the time to do this is today.  The notebook software we have is now finally reasonably scalable, primarily due to work of Mike Hansen and Rado Kirov.  Our funding situation is good this year.  We have strong good will and interest right now from both the Mathematical Association of America and WebWork developers.   If we wait longer, the one chance to truly make Sage financially self-sustaining in a way that fits with my dreams and values will pass.  

    I hope that by making all plans open, and by having a commitment to put the profits back into Sage development, an enterprise like I describe here will be in a better position to attract users than a purely for profit venture by somebody else.

    by William Stein (noreply@blogger.com) at November 15, 2011 08:58 AM

    November 06, 2011

    Fabian Pedregosa

    Low rank approximation

    A little experiment to see what low rank approximation looks like. These are the best rank-k approximations (in the Frobenius norm) to the a natural image for increasing values of k and an original image of rank 512.

    Python code can be found here. GIF animation made using ImageMagic’s convert script.

    by fabian at November 06, 2011 10:05 AM

    November 04, 2011

    Spyder

    Spyder v2.1 is available

    After a year of hard work on stability, performance and new features, Spyder v2.1 is now available for Windows XP/Vista/7, GNU/Linux and MacOS X.

    This release introduces major enhancements and new features:
    • Large performance and stability improvements
    • PySide support (PyQt is no longer exclusively required)
    • PyQt API v1 and v2 support
    • New profiler plugin (thanks to Santiago Jaramillo, a new contributor)
    • Experimental support for IPython v0.11+
    • And many other changes

    We [the Spyder development team] are all very excited by this new release: Spyder is now becoming the [GUI-based] scientific development environment. Its future seems very promising. First there is the PySide support and the PyQt API v1/v2 support which were strategically essential (this will also help us supporting Python 3 in time). Second there is a cleaner and optimized code: this was critical for Spyder's behavior to be the closest possible to a standard Python interpreter. Third there is a very stable application: a lot of crashes and segmentation faults were fixed during the last year. Last, but not least, the future looks bright thanks to the new project members: we are now 5 committers and 6 contributors.

    Stand-alone executable on Windows platforms
    Note that, on Windows platforms, Spyder is also available as a stand-alone executable (don't forget to disable UAC on Vista/7). This all-in-one portable version is still experimental (for example, it does not embed sphinx -- meaning no rich text mode for the object inspector) but it should provide a working version of Spyder for Windows platforms without having to install anything else (except Python 2.x itself, of course).

    Help needed
    More than ever, we welcome any contribution that helps making Spyder an efficient scientific development/computing environment. Join us to help creating your favourite environment!

    by Pierre (noreply@blogger.com) at November 04, 2011 05:10 AM

    October 31, 2011

    October 30, 2011

    Jarrod Millman

    SciPy India 2011 abstracts due November 2nd

    The third SciPy India Conference will be held from December 4th through the 7th at the Indian Institute of Technology, Bombay (IITB) in Mumbai, Maharashtra India.

    At this conference, novel applications and breakthroughs made in the pursuit of science using Python are presented. Attended by leading figures from both academia and industry, it is an excellent opportunity to experience the cutting edge of scientific software development.

    The conference is followed by two days of tutorials and a code sprint, during which community experts provide training on several scientific Python packages.

    We invite you to take part by submitting a talk abstract on the conference website.

    Talk/Paper Submission

    We solicit talks and accompanying papers (either formal academic or magazine-style articles) that discuss topics regarding scientific computing using Python, including applications, teaching, development and research. We welcome contributions from academia as well as industry.

    Important Dates

    • November 2, 2011, Wednesday: Abstracts Due
    • November 7, 2011, Monday: Schedule announced
    • November 28, 2011, Monday: Proceedings paper submission due
    • December 4-5, 2011, Sunday-Monday: Conference
    • December 6-7 2011, Tuesday-Wednesday: Tutorials/Sprints

    Organizers

    by Jarrod Millman (noreply@blogger.com) at October 30, 2011 09:34 PM

    October 25, 2011

    Matthieu Brucher

    Book review: Version Control by Example

    A colleague of mine give me this book, as I use “third generation” VCSs. I decided to check on the author approach on Version Control and his opinion on the matter. The book explores the different approaches of the latest VCS tools, with their advantages and drawbacks. Also, it delves into some algorithmic designs of Distributed VCSs. I’ve already discussed some of these tools, but the book is not a flame war against one DVCS, but more an explanation of all of them.

    Content and opinions

    The focus on the book is how to use a Version Control system, so several chapters are dedicated to how to use the most well-known one – and the one the author develops.

    The book is split in three parts: “old” VCSs, new VCSs and finally how they work. The first chapter is a glossary of the special VCS words and their meaning. Once this is done, the first example chapter starts with Subversion. A small situation is depicted with two coders that are locally closed.

    The third chapter starts the second part and it introduces new concepts for distributed VCSs. A complete chapter is dedicated to the advantages of DVCSs, and it doesn’t fall in the common pitfalls (those are in fact a reason to some humor from the author). The next chapter does exactly the opposite, i.e. explains the drawbacks of a DVCS. Although I disagree on GUI part (there are wonderful GUIs for both Mercurial and GIT on Windows, and on Linux they have a great integration in Eclipse for instance), I do agree with the author on the main topics here. The next four chapters are the practical application seen in the second chapter applied to Mercurial, Git and Veracity (two chapter for this one, one to explain the differences with Mercurial and Git, and one for the scenario), the tool the author develops.

    The last part can be called “how to use them”. Even for the second generation tools like SVN, I was flabbergasted to see that a lot of coders didn’t even know how to use it, and even worse, that experienced managers didn’t know how to implement an appropriate workflow. So I think that the first chapter in this part should be something that everyone HAS to read if one wants to be called a coder. Seriously. Then, as I’ve said, the author explains how the distributed tools work and some of their internals. It’s not for the average coder, but it’s really interesting – for the culture. The last chapter is a set of golden rules, and like the first chapter, it’s something mandatory to read and use.

    Conclusion

    One of the most easy-to-read introductory books and a book on a sensible subject, this book should be read by everyone searching for a robust tool to support one’s development. It is also a very good argumentation on why you should use one.


    by Matt at October 25, 2011 07:23 AM

    October 23, 2011

    Wes McKinney

    Fast and easy pivot tables in pandas 0.5.0

    Over the last several months, I’ve invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I’ll talk a bit about in this article.

    An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function “pivot_table“ which provides a high level interface for creating pivot tables. Let’s consider the tips data set used in some examples of the Hadley Wickham’s excellent R reshape2 package:

    In [1]: import pandas.rpy.common as com

    In [2]: tips = com.load_data('tips', package='reshape2')

    In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

    In [4]: tips.head()
    Out[4]:
       total_bill  tip   sex     smoker  day  time    size  tip_pct
    1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
    2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
    3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
    4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
    5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

    Since most readers won’t have rpy2, you can get the data file here and read it with:

    tips = read_csv('tips.csv')

    All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson– all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

    Pivot table basics

    To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

    In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
                                cols='smoker')

    In [10]: table
    Out[10]:
      smoker       No      Yes  
    time   sex                  
    Dinner Female  0.1568  0.1851
           Male    0.1594  0.1489
    Lunch  Female  0.1571  0.1753
           Male    0.1657  0.1667

    In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
                         cols='sex')
    Out[17]:
      sex        Female  Male  
    day  time                  
    Fri  Dinner  0.1991  0.1302
         Lunch   0.1997  0.1741
    Sat  Dinner  0.1565  0.1516
    Sun  Dinner  0.1816  0.1623
    Thur Dinner  0.1597  NaN  
         Lunch   0.1575  0.1653

    Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

    In [22]: table.ix['Dinner']
    Out[22]:
      smoker  No      Yes  
    sex                    
    Female    0.1568  0.1851
    Male      0.1594  0.1489

    If you don’t specify a particular values column, it will try to aggregate all the other columns:

    In [24]: pivot_table(tips, rows=['sex', 'smoker'])
    Out[24]:
                   size   tip    tip_pct  total_bill
    sex    smoker                                  
    Female No      2.593  2.774  0.1569   18.11    
           Yes     2.242  2.932  0.1822   17.98    
    Male   No      2.711  3.113  0.1607   19.79    
           Yes     2.5    3.051  0.1528   22.28

    Other aggregations can be performed, like using len to get group counts:

    In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
    Out[26]:
      smoker  No  Yes
    sex              
    Female    54  33
    Male      97  60

    Once you’ve done the aggregation, you’re free to use the built-in reshaping capability to reshape the data in the table:

    In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
                                 cols='smoker', aggfunc=len)

    In [30]: table
    Out[30]:
      smoker     No  Yes
    sex    day          
    Female Fri   2   7  
           Sat   13  15
           Sun   14  4  
           Thur  25  7  
    Male   Fri   2   8  
           Sat   32  27
           Sun   43  15
           Thur  20  10

    In [31]: table.unstack('sex')
    Out[31]:
      smoker  No            Yes        
      sex     Female  Male  Female  Male
    day                                
    Fri       2       2     7       8  
    Sat       13      32    15      27  
    Sun       14      43    4       15  
    Thur      25      20    7       10

    If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

    In [59]: pivot_table(tips, rows=['sex', 'smoker'],
                         aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
    Out[59]:
                   size  tip_pct
    sex    smoker              
    Female No      140   0.1569
           Yes     74    0.1822
    Male   No      263   0.1607
           Yes     150   0.1528

    I thought that was pretty cool.

    Comparison with reshape2

    R’s reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R’s built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

    > tips$tip.pct <- tips$tip / tips$total_bill
    > mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
    > acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
         Female_No Female_Yes Male_No Male_Yes
    Fri          2          7       2        8
    Sat         13         15      32       27
    Sun         14          4      43       15
    Thur        25          7      20       10

    > acast(mtips, sex + smoker ~ variable, mean)
               total_bill      tip     size   tip.pct
    Female_No    18.10519 2.773519 2.592593 0.1569210
    Female_Yes   17.97788 2.931515 2.242424 0.1821504
    Male_No      19.79124 3.113402 2.711340 0.1606687
    Male_Yes     22.28450 3.051167 2.500000 0.1527712

    R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

    In R there is also the xtabs function for doing cross-tabulation:

    > ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
                         day Fri Sat Sun Thur
    time   sex    smoker                    
    Dinner Female No           2  30  43    2
                  Yes          8  33  10    0
           Male   No           4  85 124    0
                  Yes         12  71  39    0
    Lunch  Female No           3   0   0   60
                  Yes          6   0   0   17
           Male   No           0   0   0   50
                  Yes          5   0   0   23

    ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

    In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
       ....:             cols='day', aggfunc=np.sum, fill_value=0)
    Out[54]:
      day                 Fri  Sat  Sun  Thur
    time   sex    smoker                    
    Dinner Female No      2    30   43   2  
                  Yes     8    33   10   0  
    Dinner Male   No      4    85   124  0  
                  Yes     12   71   39   0  
    Lunch  Female No      3    0    0    60  
                  Yes     6    0    0    17  
    Lunch  Male   No      0    0    0    50  
                  Yes     5    0    0    23

    Performance

    pandas is very fast as I’ve invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don’t have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

    library(reshape2)
    n <- 100000
    a.size <- 5
    b.size <- 5
    data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
                       b=sample(letters[1:b.size], n, replace=T),
                       c=rnorm(n),
                       d=rnorm(n))
    timings <- numeric()

    for (i in 1:10) {
      gc()
      tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
      timings[i] = tim[3]
    }
    mean(timings)

    Here is what the actual table looks like:

    > table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
    > head(table)
                   c            d
    a_a  0.002761963 -0.008793516
    a_b -0.004618980 -0.026978744
    a_c  0.022491767 -0.002163986
    a_d  0.005027362 -0.002532782
    a_e  0.006146438  0.031699238
    b_a -0.025094899  0.003334301

    Here’s the equivalent Python code (sans timings, which I’ll do in IPython):

    from pandas import *
    import string
    n = 100000
    asize = 5
    bsize = 5

    letters = np.asarray(list(string.letters), dtype=object)

    data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
                          bar=letters[:bsize][np.random.randint(0, bsize, n)],
                          baz=np.random.randn(n),
                          qux=np.random.randn(n)))

    table = pivot_table(data, rows=['foo', 'bar'])

    Similarly:

    In [36]: table = pivot_table(data, rows=['foo', 'bar'])

    In [37]: table[:10]
    Out[37]:
             baz       qux      
    foo bar                    
    a   a   -0.02162   0.02186  
        b   -0.002155  0.01173  
        c   -0.002331  0.006608
        d   -0.01734   0.01109  
        e    0.02837   0.01093  
    b   a    0.01703  -0.005845
        b   -0.001043  0.01283  
        c   -0.006015 -0.0005866
        d    0.009683  0.02211  
        e    0.02347  -8.346e-05

    And the R timing

    > mean(timings)
    [1] 0.42

    And pandas:

    In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
    10 loops, best of 3: 117 ms per loop

    So it’s about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here’s a couple slightly different benchmarks on the same data:

    > acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
                 a            b            c           d            e
    a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
    b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
    c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
    d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
    e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

    Timing
    > mean(timings)
    [1] 0.3036

    and pandas:

    In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
    Out[41]:
      bar  a         b          c         d         e      
    foo                                                    
    a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
    b      0.01703  -0.001043  -0.006015  0.009683  0.02347
    c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
    d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
    e     -0.01656  -0.01276   -0.02481   0.008629  0.011  

    In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
    10 loops, best of 3: 55 ms per loop

    I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.

    by Wes McKinney at October 23, 2011 12:11 AM

    October 19, 2011

    Matthieu Brucher

    Announcement: QtSimpleEQ 1.0 (QtVST)

    I’m pleased to announce the 1.0 version of QtSimpleEQ, a plugin with one low-pass, two peak and one high pass second-order filters. Nothing fancy in the algorithms, it’s mainly another show case for Qt VST plugins.

    The code is available under the GPL2 on github and on Sourceforge.

    The plugin can be download on the Sourceforge project page.

    The plugin was tested with Tracktion 3 (Windows XP).

    Snapshot:

    QtSimpleEQ UI

    Buy Me a Coffee!



    Other Amount:



    Your Email Address :




    by Matt at October 19, 2011 01:00 PM

    October 16, 2011

    Travis Oliphant

    Thoughts on porting NumPy to PyPy

    Last weekend, I attended GitHub's PyCodeConf in Miami Florida and had the opportunity to give a talk on array-oriented computing and Python.  I would like to thank Tom and Chris (GitHub founders) for allowing me to come speak.  I enjoyed my time there, but I have to admit I felt old and a bit out of place.   There were a lot of young people there who understood a lot more about making web-pages and working at cool web start-ups than solving partial differential equations with arrays.  Fortunately, Dave Beazley and Raymond Hettinger were there so I didn't feel completely ancient.   In addition, Wes McKinney and Peter Wang helped me represent the NumPy community.

    At the conference I was reminded of PyPy's recent success at showing the speedups some said weren't possible from a dynamic language like Python --- speed-ups which make it possible to achieve C-like speeds using Python constructs.   This is very nice as it illustrates again that high-level languages can be compiled to low-level speeds.    I also became keenly aware of the enthusiasm that has cropped up in porting NumPy to PyPy.   I am happy for this enthusiasm as it illustrates the popularity of NumPy which pleases me.   On the other hand, in every discussion I have heard or read about this effort, I'm not convinced that anyone excited about this porting effort actually understands the complexity of what they are trying to do nor the dangers that it could create in splitting the small community of developers who are regularly contributing to NumPy and SciPy and causing confusion for the user base of Python in Science.

    I'm hopeful that I can provide some perspective.   Before I do this, however, I want to congratulate the PyPy team and emphasize that I have the utmost respect for the PyPy developers and what they have achieved.   I am also a true-believer in the ability for high-level languages to achieve faster-than-C speeds.   In fact, I'm not satisfied with a Python JIT.  I want the NumPy constructs such as vectorization, fancy indexing, and reduction to be JIT compiled.    I also think that there are use-cases of NumPy all by itself that makes it somewhat interesting to do a NumPy-in-PyPy port.    I would also welcome the potential things that can be learned about how to improve NumPy that would come out of trying to write a version of NumPy in RPython.

    However, to avoid detracting from the overall success of Python in Science, Statistics, and Data Analysis,  I think it is important that 3 things are completely clear to people interested in the NumPy on PyPy idea. 
    1. NumPy is just the beginning (SciPy, matplotlib, scikits, and 100s of other packages and legacy C/C++ and Fortran code are all very important)
    2. NumPy should be a lot faster than it is currently.
    3. NumPy has an ambitious roadmap and will be moving forward rather quickly over the coming years.  

      NumPy is just the beginning 

      Most of the people who use NumPy use it as an entry-point to the entire ecosystem of Scientific Packages available for Python.    This ecosystem is huge.   There are at least 1 million unique visitors to the http://www.scipy.org site every year and that is just an entry point to the very large and diverse community of technical computing users who rely on Python. 

      Most of the scientists and engineers who have come to Python over the past years have done so because it is so easy to integrate their legacy C/C++ and Fortran code into Python.   National laboratories, large oil companies, large banks and many other Fortune 50 companies all must integrate their code into Python in order for Python to be part of their story.   NumPy is part of the answer  that helps them seamlessly view large amounts of data as arrays in Python or as arrays in another compiled language without the non-starter of copying the memory back and forth. 

      Once the port of NumPy to PyPy has finished, are you going to port SciPy?  Are you going to port matplotlib?  Are you going to port scikits.learn, or scikits.statsmodels?   What about Sage?   Most of these rely on not just the Python C-API but also the NumPy C-API which you would have to have a story for to make a serious technical user of Python get excited about a NumPy port to PyPy.  

      To me it is much easier to think about taking the ideas of PyPy and pulling them into the Scientific Python ecosystem then going the other way around.    It's not to say that there isn't some value in re-writing NumPy in PyPy, it just shouldn't be over-sold and those who fund it should understand what they aren't getting in the transaction.  

      C-speed is the wrong target

      Several examples including my own previous blog post has shown that vectorized Fortran 90 can be 4-10 times faster than NumPy.   Thus, we know there is room for improvement even on current single-core machines.   This doesn't even take into account the optimizations that should be possible for multiple-cores, GPUs and even FPGAs all of which are in use today but are not being utilized to the degree they should be.   NumPy needs to adapt to make use of this kind of hardware and will adapt in time.

      NumPy will be evolving rapidly over the coming years

      The pace of NumPy development has leveled off in recent years, but this year has created a firestorm of new ideas that will be coming to fruition over the next 1-2 years and NumPy will be evolving fairly rapidly during that time.    I am committed to making this happen and will be working very hard in 2012 on the code-base itself to realize some of the ideas that have emerged.   Some of this work will require some re-factoring and re-writing as well.   I would honestly rather collaborate with PyPy than compete, but my constraints are that I care very much about backward compatibility and very much about the entire SciPy ecosystem.    I sacrificed a year of my life in 1999 (delaying my PhD graduation by at least 6-12 months) bringing SciPy to life.    I sacrificed my tenure-track position in academia bringing NumPy to life in 2005.   Constraints of keeping my family fed, clothed, and housed seem to keep me on this 6-7 year sabbatical-like cycle for SciPy/NumPy but it looks like next year I will finally be in a position to spend substantial time and take the next steps with NumPy to help it progress to the next stage.

      Some of the ideas that will be implemented include:
      • integration of non-contiguous memory chunks into the NumPy array structure (generalization of strides)
      • addition of labels to axes and dimensions (generalization of shape)
      • derived fields, enumerated data-types, reference data-types, and indices for structured arrays
      • improvements to the data-type infrastructure to make it easier to add new data-types
      • improvements to the calculation infrastructure (iterators and fast general looping constructs)
      • fancy-indexing as views
      • integration of Pandas group-by features
      • missing data bit-patterns
      • distributed arrays 
      Over conversations with many people this year, more ideas than room to talk about them have emerged and I am excited to start seeing these ideas come to fruition to make NumPy and Python the best solution for data-analysis.    Beginning next year, I will be pushing hard for their introduction into the NumPy/SciPy ecosystem --- with a careful eye on backward compatibility which has long been one of NumPy's strengths.

      A way forward

      I would love to see more scientific code written at a high-level without sacrificing run-time performance.   The high-level intent allows for the creation of faster machine code than lower-level translations of the intent often does. I know this is possible and I intend to do everything I can professionally to see that happen (but from within the context of the entire SciPy ecosystem). As this work emerges, I will encourage PyPy developers to join us using the hard-won knowledge and tools they have created.

      Even if PyPy continues as a separate eco-system, then there are points of collaboration that will benefit both groups. One of these is to continue the effort Microsoft initially funded to separate the C-parts of NumPy away from the CPython interface to NumPy. This work is now in a separate branch that has diverged from the main NumPy branch and needs to be re-visited. If people interested in NumPy on PyPy spent time on improving this refactoring into basically a NumPy C-library, then PyPy can call this independent library using its methods for making native calls just as CPython can call it using its extension approach. Then IronPython, Jython (and for that matter Ruby, Javascript, etc.) can all call the C-library and leverage the code. There is some effort to do this and it's not trivial. Perhaps, there is even a way for PyPy to generate C-libraries from Python source code --- now that would be an interesting way to collaborate.

      The second way forward is for PyPy to interact better with the Cython community. Support in PyPy for Cython extension modules would be a first step. There is wide agreement among NumPy developers that more of NumPy should be written at a high-level (probably using Cython). Cython already is used to implement many, many extension modules for the Sage project. William Stein's valiant efforts in that community have made Cython the de-facto standard for how most scientists and engineers are writing extensions modules for Python these days. This is a good thing for efforts like PyPy because it adds a layer of indirection that allows PyPy to make a Cython backend and avoid the Python C-API.  

      I was quite astonished that Cython never came up in the panel discussion at the last PyCon when representatives from CPython, PyPy, IronPython, and Jython all talked about the Python VMs.   To me that oversite was very troublesome.   I was left doubting the PyPy community after Cython was not mentioned at all --- even when the discussion about how to manage extensions to the language came up during the panel discussion.    It shows that pure Python developers on all fronts have lost sight of what the scientific Python community is doing.   This is dangerous. I encourage Python developers to come to a SciPy conference and take a peek at what is going on.   I hope to be able to contribute more to the discussion as well. 

      If you are a Python developer and want to extend an olive leaf, then put a matrix infix operator into the language.  It's way past time :-)

      by Travis Oliphant (noreply@blogger.com) at October 16, 2011 05:15 PM

      October 14, 2011

      Fabian Pedregosa

      qr_multiply function in scipy.linalg

      In scipy’s development version there’s a new function closely related to the QR-decomposition of a matrix and to the least-squares solution of a linear system.

      What this function does is to compute the QR-decomposition of a matrix and then multiply the resulting orthogonal factor by another arbitrary matrix. In pseudocode:

      def qr_multiply(X, Y):
          Q, R = qr(X)
          return dot(Q.T, Y)

      but unlike this naive implementation, qr_multiply is able to do all this without explicitly computing the orthogonal Q matrix, resulting both in memory and time saving. In the following picture I measured the memory consumption as a function of time of running this computation on a 1.000 x 1.000 matrix X and a vector Y (full code can be found here):

      It can be seen that not only qr_multiply is almost twice as fast as the naive approach, but also that the memory consumption is significantly reduced, since the orthogonal factor is never explicitly computed.

      Credit for implementing the qr_multiply function goes to Martin Teichmann.

      by fabian at October 14, 2011 02:44 PM

      October 11, 2011

      Matthieu Brucher

      QtMosaic 0.1: easy photomosaics

      Just for fun, I’m pleased to announce a working version of QtMosaic.
      It allows to create a mosaics database and then generate a photomosaic from this database.

      For instance, here is an original image:

      Original photo


      The used database is freely available on the Internet, and the photomosaic is generated with 18×24 thumbnails:

      Rendered photomosaic

      The code is available on GitHub, as usual, and it has been tested on Windows and Linux.

      Buy Me a Coffee!



      Other Amount:



      Your Email Address :




      by Matt at October 11, 2011 07:25 AM

      October 10, 2011

      Stefan van der Walt

      scikits.image 0.3 release

      After a brief (!) absence, we’re back with a new and shiny version of scikits.image, the image processing toolbox for SciPy.

      This release runs under all major operating systems where Python (>=2.6 or 3.x), NumPy and SciPy can be installed.

      For more information, visit our website

      http://scikits-image.org

      or the examples gallery at

      http://scikits-image.org/docs/0.3/auto_examples/

      New Features

      • Shortest paths
      • Total variation denoising
      • Hough and probabilistic Hough transforms
      • Radon transform with reconstruction
      • Histogram of gradients
      • Morphology, including watershed, connected components
      • Faster homography transformations (rotations, zoom, etc.)
      • Image dtype conversion routines
      • Line drawing
      • Better image collection handling
      • Constant time median filter
      • Edge detection (canny, sobel, etc.)
      • IO: Freeimage, FITS, Qt and other image loaders; video support.
      • SIFT feature loader
      • Example data-sets

      … as well as many bug fixes and minor updates.

      Contributors for this release

      • Martin Bergtholdt
      • Luis Pedro Coelho
      • Chris Colbert
      • Damian Eads
      • Dan Farmer
      • Emmanuelle Gouillart
      • Brian Holt
      • Pieter Holtzhausen
      • Thouis (Ray) Jones
      • Lee Kamentsky
      • Almar Klein
      • Kyle Mandli
      • Andreas Mueller
      • Neil Muller
      • Zachary Pincus
      • James Turner
      • Stefan van der Walt
      • Gael Varoquaux
      • Tony Yu

      Permalink | Leave a comment  »

      October 10, 2011 06:45 PM

      October 09, 2011

      Vlad Niculae

      Sampling Gamma random variates through the ratio-of-uniforms method

      One year ago I had the chance to take a class on Monte Carlo simulation with prof. Ion Văduva, and my assignment for the class was to implement exactly what it says in the title of the blog post. I am going to walk you through the idea behind this.

      General formulation

      The ratio-of-uniforms is a method that can be applied to many density functions. Essentially, given a density function over \( \mathbb{R}^m\), \( f(x) = \frac{h(x)}{H}\) where \(H\) is a normalization constant (ie. \( h(x) \geq 0\), \( H = \int h(x)dX\)). Given a parameter \( c > 0 \) and a parametrization \(\phi\) from \( \mathbb{R}^{m+1}\) to \( \mathbb{R}^{m}\) expressed as: $$ \phi(v_0, v_1, …, v_m) = \left ( \frac{v_1}{v_0^c}, \frac{v_2}{v_0^c}, …, \frac{v_m}{v_0^c} \right )$$
      Define the set \( \mathcal{C} = \{\mathbf{v} \big | \gamma(\mathbf{v}) \leq 0, v_0 > 0\} \in \mathbb{R}^{m+1}\) where
      $$\gamma(\mathbf{v}) = \log v_0 – \frac{\log h(\phi(v_0, v_1, …, v_m))}{mc + 1}$$ If \( \mathcal{C}\) is bounded and we sample a uniform vector \( \mathbf{V} \sim \text{Uniform}(\mathcal{C})\) then \( \phi(\mathbf{V}) \sim f(x)\). Also note that the measure (volume) of the set \( \mathcal{C}\) is \( \frac{H}{mc + 1}\). I do not have any references for the proof, except for a book in Romanian, but if you are interested, just leave me a comment and I’ll do a follow-up post with the proofs.

      Univariate scenario

      For the univariate case, all the above simplifies to $$ \mathcal{C} = \left \{(u, v) \Big | 0 u \sqrt
      {h\left (\frac{v}{u^c}\right )} \right \} $$. We generate \( (U, V) \sim \text{Uniform}(\mathcal{C})\) and take \( \frac{V}{U^c} \sim f(x)\).
      Since we are looking at the (univariate) Gamma distribution, described by: $$ f(x; \nu, \theta) = \frac{x^{\mu - 1} \exp(\frac{-x}{\theta})}{\theta^k\Gamma(k)}$$ \( \nu\) is the shape parameter and \( \theta\) is the scale parameter.
      But because of the property that if \( X \sim \text{Gamma}(\nu, \theta)\), then for any \( k > 0\), \( kX \sim \text{Gamma}(\nu, k\theta)\), we conclude that we can fix \( \theta\) to 1 without loss of generality. Replacing in the style of the definition in the previous section, we have \( h(x; \nu) = x^{\nu-1}e^{-x}\) and \( H_\nu = \Gamma(\nu)\).
      This allows us to compute the equation of the boundary of the set \( \mathcal{C}\) which ends up being described by \(\gamma(u, v) = \log{u} – \frac{\nu – 1}{c + 1} \log{\left(\frac{v}{u^c}\right)} + \frac{1}{c+1} \frac{v}{u^c}\). For visualisation purposes, here is how it would look like for \( \nu=6, c=1\) (plotted using Wolfram Alpha):

      Sampling algorithm

      In order to uniformly sample from this set, we can apply basic rejection sampling: just uniformly sample from a rectangular region surrounding the set, and reject the points that do not satisfy the condition. In order to do this as efficiently as possible, we need to compute the minimal bounding box, which can be done by solving a couple of optimization problems using Lagrange multipliers and the KKT conditions. Also by looking closely at the image, you can see that the lower left corner is exactly the origin: this turns out not to be a coincidence. I won’t go into detail here, but here are the bounds I derived:
      $$ 0 u (\nu - 1)^\frac{\nu - 1}{c + 1} e ^ {-\frac{\nu - 1}{c + 1}} \text{ and } 0 v \left(\frac{c\nu + 1}{c}\right)^{\frac{c\nu + 1}{c + 1}} e ^ {- \frac {c\nu + 1}{c+1}}$$

      The probability of acceptance (which can be seen as the efficiency) of the rejection sampling method is given by the ratio of the areas of the set \( \mathcal{C}\) and the bounding box. The larger this probability, the less points we throw away and the more efficient the algorithm is. Using the values derived above, this probability is: $$ p(\nu, c) = \frac{\Gamma(\nu)e^{\nu}}{(c+1) (\nu - 1)^{\frac{\nu - 1}{c + 1}} \left(\frac{c\nu + 1}{c}\right)^{\frac{c\nu + 1}{c + 1}}}$$

      Personally I got stumped here. The idea would be to determine the ideal \( c\) for a given \( \nu\) in order to maximize the probability, but I didn't manage to do it (I leave it as an exercise for the reader ;) ). Anyway, this is enough to proceed with an implementation, so I'm gonna give the Python code for it. Note that I used the name k for the shape parameter instead of \( \nu\). Also note that the case when \( 0 \nu 1\) needed to be treated separately, which I did using the following property: Let \( \nu \in (0, 1)\). If \( X' \sim \text{Gamma}(1+\nu, 1), U \sim \text{Uniform}(0, 1)\) then $$ X = X' \cdot \sqrt[\nu]{U} \sim \text{Gamma}(\nu, 1)$$ For a proof of this fact, see [1], which is a great article on generating Gamma variates.

      Implementation

      from import numpy as np
      
      def _cond(u, v, k, c):
          """Identity function describing the acceptance region"""
          x = v / u ** c
          return (c + 1) * np.log(u) <= (k - 1) * np.log(x) - x
      
      def vn_standard_gamma(k, c=1.0, rng=np.random):
          """Generates a single standard gamma random variate"""
          if k <= 0:
              raise ValueError("Gamma shape should be positive")
          elif k < 1:
              return vn_standard_gamma(1 + k, c, rng) * rng.uniform() ** (1 / k)
          elif k == 1:
              return rng.standard_exponential()
          else:
              a, b = get_bounds(k, c)
              while True:
                  u, v = rng.uniform(0, a), rng.uniform(0, b)
                  if _cond(u, v, k, c):
                      break;
              return v / u ** c
      
      def vn_gamma(k, t, shape=1, c=1.0, rng=np.random):
          """Vectorized function to generate multiple gamma variates"""
          generator = lambda x: t * vn_standard_gamma(k, c, rng)
          generator = np.vectorize(generator)
          return generator(np.empty(shape))
      
      def get_bounds(k, c=1.0):
          """Computes the minimal upper bounds surrounding the acceptance region"""
          a = ((k - 1) / np.e) ** ((k - 1) / (c + 1))
          b = ((c * k + 1) / (c * np.e)) ** ((c * k + 1) / (c + 1))
          return a, b
      
      def prob_acc(k, c=1.0):
          """Calculates the probability of acceptance for the given parameters"""
          from scipy.special import gamma
          a, b = get_bounds(k, c)
          return gamma(k) / ((c + 1) * a * b)
      

      Results

      And of course I should show you that it works. Here are some histograms for various values of \( \nu\), with the theoretical density plotted in dotted red, after sampling \( 10^5\) values. The y-axis is the frequency (sorry for labeling in Romanian), and for the red dotted line it can be interpreted as the theoretical probability. You can clearly see the goodness of fit is excellent.

      [1]: George Marsaglia and Wai Wan Tsang. 1998. The Monty Python method for generating random variables. ACM Trans. Math. Softw. 24, 3 (September 1998), 341-350. DOI=10.1145/292395.292453

      by vene at October 09, 2011 01:40 PM

      October 08, 2011

      Paul Ivanov

      Ada Lovelace Day: remembering Shirley Theis and Evelyn Silvia

      In case you didn’t know it – today is Ada Lovelace Day!

      Now, as any self-respecting Computer Science degree-wielding person should, I, too, think it’s important to celebrate the day named after the world’s very first programmer.

      For me, the first math teacher I remember making a big difference was Shirley Theis – who taught me Algebra in 8th grade at McKinley Middle School in Redwood City, CA. Mrs Theis, an energetic dynamo in her mid fifties, was a deeply motivated and caring teacher, who expected a lot out of her students, but never in a disciplinary manner. She was full of enthusiasm, which projected out and infected even the most timid or disaffected student: in her class, you couldn’t be just a sack of potatoes planted in your seat.

      She often lead class in a nearly theatrical manner – pacing back and forth, egging students on by eagerly repeating their partial responses, getting exponentially more excited if the student was on the right track, barely containing herself from jumping up and down in anticipation of that lightbulb going off — and yet just as quickly waning in her enthusiasm,becoming a personified caricature of hopelessness and despair to let you know the instant a response was starting to go astray.

      It may have been the only math class I’ve ever taken where there were group assignments – we would work with a partner or a few classmates in trying to figure out an assignment, first trying it solo, and then putting our heads together to figure out why our answers disagree and which is the right one. I believe it was Mrs. Theis who succinctly captured a value I hold in high regard: “it’s not about how far you go – it’s about how many people you bring with you.”

      There was one other mathematics teacher I had in my life who clearly stands out: it was Professor Evelyn Silvia who had a comparable level of enthusiasm and energy, and from whom I had the pleasure of taking the first upper-division math course (Math 108 – Intro to Abstract Math) during my second quarter at UC Davis. Dr. Silvia was the real deal – she cared, gesticulated, encouraged us to question why something was true, and had an approach to the demanded we each take ownership of our education. The book for the course, Introduction to Abstract Mathematics: A Working Excursion by D.O. Cutler and E.M. Silvia was a blue workbook – each of us had our own copy, and there were blanks left out for us to write our own answers to the exercises. The fact that the book had blanks for me to fill in was so inviting, there was a kind of “working mathematician” approach that came with it with that it made me really enjoy and look forward to working through the material. I still have mine.

      Dr. Silvia was incredibly sharp, not just intellectually but also interpersonally. Not only could she could gauge when the class was lost, but she also had a knack for spotting if something was affecting you outside of class. She was really committed to helping you not just as a student, but as a person. I remember spending hours at Mishka’s, or Cafe Roma, or the CoHo, reading and writing, wanting to do well and not let Silvia down, because she invested so much energy in placed a great deal of trust in us.

      So thank you both, Shirley Theis and Evelyn Silvia – you both encouraged me to grow a lot as a person, challenged my concept of what it means to be a student, and by your example provided a template of what it means to be an effective teacher, which I’ve imitated and embraced with pleasure in my own teaching.

      (tagged scipy to spread word of Ada Lovelace day to Planet SciPy)

      by Paul Ivanov at October 08, 2011 12:01 AM

      October 04, 2011

      Matthieu Brucher

      QtVST, Qt and QtSimpleEQ

      Today, I wanted to announce my new plugin, a 4-bands EQ, but when I started a test with pyVST, I encountered strange things:

      • The first is my fault, as the code of the EQ disappeared from my Git repository, so I have to code it again. Mainly it is just plugin the correct actions between the filters and the GUI.
      • The second is an error at the end of the test. I’ve updated my Qt version from 4.7.1 to 4.7.4, and since this update (or perhaps since I updated to Python 2.7 for pyVST), I found that even a recompiled QtSimpleOverdrive has the same behavior. It did not when I released it. It seems that Qt is complaining about events that are bounced between different threads, but the actual error message is more cryptic, and impossible to debug the application at this point.

      I hope to fix these mistakes this month, I really hope I can get QtVST to work again.

      Buy Me a Coffee!



      Other Amount:



      Your Email Address :




      by Matt at October 04, 2011 10:24 AM

      October 02, 2011

      Fabian Pedregosa

      scikit-learn 0.9

      Last week we released a new version of scikit-learn. The Changelog is particularly impressive, yet personally this release is important for other reasons.

      This will probably be my last release as a paid engineer. I’m starting a PhD next month, and although I plan to continue contributing to the project and make a few more releases, I will certainly have less time to devote to it. Luckily, I received a lot of help from the community while preparing the release, from Changelog writing to build of Windows binaries, thus I expect the transition to go smoothly.

      Almost two years have elapsed since the first 0.1 release. During this time, we did a lot of refactoring and broke the API several times. However, I’ve seen some concerns about API stability both at the EuroScipy conference and in the mailing list where I’ve realized we need to provide an API that does not break in every release, and do this in a way that the project remains fun for developers.

      That’s why I’m extremely glad to see that although this release is big in changes, these have been made in a more organized manner. Yes, we’ve broken the API once again, but now there’s a compatibility layer that ensures that code written for 0.8 will continue working with the new release.

      by fabian at October 02, 2011 09:19 AM

      September 30, 2011

      Wes McKinney

      Python, R, and the allure of magic

      R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

      > a <- rnorm(10)
      > b <- rnorm(10)
      > cbind(a, b)
                     a          b
       [1,]  0.8729978  0.5170078
       [2,] -0.6885048 -0.4430447
       [3,]  0.4017740  1.8985843
       [4,]  2.1088905 -1.4121763
       [5,]  0.9375273  0.4703302
       [6,]  0.5558276 -0.5825152
       [7,] -2.1606252  0.7379874
       [8,] -0.7651046 -0.4534345
       [9,] -4.2604901  0.9561077
      [10,]  0.3940632 -0.8331285

      If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about abusing the language’s parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

      While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

      While this is woefully unpythonic, it’s also kind of cool:

      In [27]: merge(a, b)
      Out[27]:
                  a         b      
      2000-01-03 -1.35      0.8398
      2000-01-04  0.999    -1.617  
      2000-01-05  0.2537    1.433  
      2000-01-06  0.6273   -0.3959
      2000-01-07  0.7963   -0.789  
      2000-01-10  0.004295 -1.446

      This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

      In [30]: merge(a, np.log(b))
      Out[30]:
                  a        np.log(b)
      2000-01-03  0.6243   0.7953  
      2000-01-04  0.3593  -1.199    
      2000-01-05  2.805   -1.059    
      2000-01-06  0.6369  -0.9067  
      2000-01-07 -0.2734   NaN      
      2000-01-10 -1.023    0.3326

      Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

      In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
      Out[32]:
                  a        log(b)
      2000-01-03  0.6243   0.7953
      2000-01-04  0.3593  -1.199  
      2000-01-05  2.805   -1.059  
      2000-01-06  0.6369  -0.9067
      2000-01-07 -0.2734   NaN    
      2000-01-10 -1.023    0.3326

      by Wes McKinney at September 30, 2011 01:52 AM

      September 28, 2011

      Gaël Varoquaux

      Cython example of exposing C-computed arrays in Python without data copies

      Colleagues who are exposing a numerical C code in Python asked me for some advice on the best way to pass arrays from C to Python avoiding copies. They had Cython in mind, and I must agree with them that I have found the Cython code to be more maintainable than hand-written Python C-API code.

      When writing my answer, I found out that there was no self-contained example of creating numpy arrays from existing data in Cython. Thus I created my own. The full code with readme build and demo scripts is available on a gist. Here I only give an executive summary.

      The core functionality is implemented by the PyArray_SimpleNewFromData function of the C API of numpy that can create an ndarray from a pointer to the data, a simple data type, and the shape of the data. The Cython file just builds around that function:

      by gael at September 28, 2011 10:42 PM

      September 23, 2011

      Wes McKinney

      NumPy indexing peculiarities

      Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

      import numpy as np
      import random

      arr = np.random.randn(10000, 5)
      indexer = np.arange(10000)
      random.shuffle(indexer)

      In [26]: timeit arr[indexer]
      1000 loops, best of 3: 1.25 ms per loop

      In [27]: timeit arr.take(indexer, axis=0)
      10000 loops, best of 3: 127 us per loop

      It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13′s indexing falls somewhere in the middle:

      mat <- matrix(rnorm(50000), nrow=10000, ncol=5)
      set.seed(12345)
      indexer <- sample(1:10000)
      > system.time(for (i in 1:1000) mat[indexer,])
         user  system elapsed
        0.460   0.197   0.656

      So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

      Another peculiarity that I noticed with take is that performance gets worse when you use the out argument, which tells the function to use an array you pass in to write out the result:

      out = np.empty_like(arr)

      In [50]: timeit np.take(arr, indexer, axis=0, out=out)
      10000 loops, best of 3: 200 us per loop

      EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.

      Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

      @cython.wraparound(False)
      @cython.boundscheck(False)
      def take_axis0(ndarray[float64_t, ndim=2] values,
                     ndarray[int32_t] indexer,
                     out=None):
          cdef:
              Py_ssize_t i, j, k, n, idx
              ndarray[float64_t, ndim=2] outbuf

          if out is None:
              outbuf = np.empty_like(values)
          else:
              outbuf = out

          n = len(indexer)
          k = values.shape[1]
          for i from 0 <= i < n:
              idx = indexer[i]

              if idx == -1:
                  for j from 0 <= j < k:
                      outbuf[i, j] = NaN
              else:
                  for j from 0 <= j < k:
                      outbuf[i, j] = values[idx, j]

      Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.

      In [53]: timeit lib.take_axis0(arr, indexer)
      10000 loops, best of 3: 115 us per loop

      In [54]: timeit lib.take_axis0(arr, indexer, out)
      10000 loops, best of 3: 109 us per loop

      Very interesting.

      TL;DR

    1. Use take not []-based indexing to get best performance
    2. Cython is just as fast for my specific application and a lot faster if you’re passing an out array (which I will be for the application that I needed this for)
    3. R’s matrix indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than ndarray.take. This can probably be improved.
    4. by Wes McKinney at September 23, 2011 08:14 PM

      September 20, 2011

      Vlad Niculae

      RANLP 2011 in Hissar, BG

      Last week was marked by the international RANLP (Recent Advances in Natural Language Processing) conference, taking place in a nice spa in Hissar, Bulgaria. The excellent folks from the computational linguistics group at the University of Wolverhampton were behind it, together with the Institute of Information and Communication Technologies from the Bulgarian Academy of Sciences.

      A fountain with warm mineral spring water in Hissarya.
      Picture by Miranda


      I must begin by thanking them: the organization was impeccable! I’m not sure, but I think that at one point Ivelina was even running around buying routers to improve wifi coverage (which is already spectacular in Bulgaria — I’ve received reports from Miranda that you can get wifi in the mountains!)

      The schedule was busy, with three tracks going in parallel, in order to cover a wide range of topics in computational linguistics. The student workshop should also be noted for the excellent quality of the works there.

      Of course it would be infeasible to write about all the great people I met and their high quality work. And if I were to write about all the fun we had, it would probably make this post look unprofessional :) . This doesn’t mean I forgot about any of you, and as soon as I get the chance to work on something related, I will most certainly write about it, and you.

      So, if I would have to summarize the trends and the ideas stated during the conference and especially during the keynotes, I would say:

      • When talking about word sense disambiguation, it’s wrong to speak about the different meanings of a word, but rather about the potential a word has for bringing a certain meaning in a certain context. See Patrick HanksCorpus Pattern Analysis. Without something like this, to have good WSD you need to heavily adjust the overlapping meanings from a Wordnet-style ontology.
      • Certain relations, such as temporal and spacial ones, can naturally be modeled by complex domain-specific logics (see Inderjeet Mani‘s new book, Interpreting Motion: Grounded Representations for Spatial Language, that is due for publishing). But these only appear in a small subset of human communication. The attempt to map human language to a complete logic in which to do general-purpose inference seems futile: Ido Dagan suggests textual entailment: reasoning directly in natural language, and only abstracting away to a formal logic system when need arises.
      • If you have a large enough sample of n-gram frequency data, you can eventually beat the performance you can get with a limited amount of labeled data, and most importantly: it generalizes much better when going out of the domain you trained on. Apparently the best tool for this at the moment is the Google n-gram data, which has some limitations. In time, we can easily extend this data by huge amounts by mining n-grams from Wikipedia (which allegedly has a higher count of distinct n-grams than the Google dataset), and more importantly, by aligning multi-language data, making use of transliterations and cognate identification.

      Please note that I may be (or most probably, am) ignorant of older instances of similar ideas, and I may have misunderstood certain claims. Please feel free to discuss in the comments whether you think I forgot about something important, or whether I am plain wrong about something. In particular, I seem to have been completely ignorant of the existence of the Google n-gram data, which has been around for quite a while, so I must have missed other important things as well.

      Take care, kind readers, and express your opinion!
      V

      by vene at September 20, 2011 12:17 PM

      Matthieu Brucher

      Book review: CUDA By Example

      It has been a while since my last post here, but I’m back! I had access to the French version of this book, thanks to the publisher.

      CUDA is now in the trend, and there are several books, one of them I’ve also reviewed.

      Content and opinions

      The first two chapters are dedicated to introducing CUDA in computing science: why CUDA, why it can help someone and what nVidia offers. Of course, now nVidia offers more than what the book tells us (even nVidia offered more when the book was published if my memory serves me right).

      Actual CUDA code is introduced next. The author adds small pieces in each chapter: first the specific syntax, then parallelism and threads, blocks and their cooperation with shared memory, constant memory and streams, texture memory.

      The last step is the optimization of the code. The author starts with the interaction with OpenGL (and DirectX is left as an exercise), then atomicity, streams and finally multi GPUs. Some of these are only available on new GPUs, but it is not mentioned explicitly (you have to test your card with the provided code).

      The last chapter browses through the different supports that are available: CUDA tools, documentation…

      There are a lot of stuff that are “missing”. The most important thing is an explanation of how memory should be accessed. Coalescing memory is not even mentioned, and although nVidia made huge improvements since the first CUDA GPU, this is still one of the biggest ways to optimize your code. Also, the debuggers are mentioned, but there are sadly no real tutorial on this. You will have to fill some voids in your new knowledge (for instance the texture memory has more functionalities than what is described).

      Conclusion

      So CUDA By Example is a good introductory book with flaws. Contrary to the other book I read, this one (unfortunately?) does not dive into architectural details. You will still have to learn from the nVidia’s tutorials, but you will have an almost complete overview of CUDA.


      by Matt at September 20, 2011 07:55 AM

      September 19, 2011

      Vlad Niculae

      Dictionary learning in scikit-learn 0.9

      Thanks to Olivier, Gaël and Alex, who reviewed the code heavily the last couple of days, and with apologies for my lack of activity during a sequence of conferences, Dictionary learning has officially been merged into scikit-learn master, and just in time for the new scikit-learn 0.9 release. Here are some glimpses of the examples you can run for yourself:

      Dictionary learned from Lena patches

      Noisy image for denoising

      Image denoising with Dictionary learning and Orthogonal matching pursuit

      The stars of this new release are: the manifold learning module by Jake Vanderplas and Fabian Pedregosa, the Dirichlet process gaussian mixture model by Alexandre Passos, and many others, as you can see from the development changelog (as soon as the release is made, I will update this post with permanent links).

      The release is due tomorrow. I will also be in charge with building the Windows installers for this release, let’s hope I do a good job and you can think of me and smile when installing!

      by vene at September 19, 2011 05:15 PM

      September 11, 2011

      Gaël Varoquaux

      Python at scientific conferences

      Top notch scientific conferences are starting to add Python tracks to their program. This is good news. Indeed, it scientific Python conferences (namely Scipy, EuroSciPy and Scipy India) are doing great to get together people who have already heard about Python for science, but we need to reach out to specific Python communities to maximize impact.

      ESCO 2012 - European Seminar on Coupled Problems

      ESCO 2012 is the 3rd event in a series of interdisciplineary meetings dedicated to computational science challenges in multi-physics and PDEs.

      I was invited as ESCO last year. It was an aboslute pleasure, because it is a small conference that is very focused on discussions. I learned a lot and could sit down with people who code top notch PDE libraries such as FEniCS and have technical discussions. Besides, it is hosted in the historical brewery where the Pilsner was invented. Plenty of great beer.

      Application areas Theoretical results as well as applications are welcome. Application areas include, but are not limited to: Computational electromagnetics, Civil engineering, Nuclear engineering, Mechanical engineering, Computational fluid dynamics, Computational geophysics, Geomechanics and rock mechanics, Computational hydrology, Subsurface modeling, Biomechanics, Computational chemistry, Climate and weather modeling, Wave propagation, Acoustics, Stochastic differential equations, and Uncertainty quantification.

      Minisymposia

      • Multiphysics and Multiscale Problems in Civil Engineering
      • Modern Numerical Methods for ODE
      • Porous Media Hydrodynamics
      • Nuclear Fuel Recycling Simulations
      • Adaptive Methods for Eigenproblems
      • Discontinuous Galerkin Methods for Electromagnetics
      • Undergraduate Projects in Technical Computing

      Software afternoon Important part of each ESCO conference is a software afternoon featuring software projects by participants. Presented can be any computational software that has reached certain level of maturity, i.e., it is used outside of the author’s institution, and it has a web page and a user documentation. If you would like to present your software project, let us know soon.

      Proceedings For each ESCO we strive to reserve a special issue of an international journal with impact factor. Proceedings of ESCO 2008 appeared in Math. Comput. Simul., proceedings of ESCO 2010 in CiCP and Appl. Math. Comput. Proceedings of ESCO 2012 will appear in Computing.

      Important Dates

      • December 15, 2011: Abstract submission deadline.
      • December 15, 2011: Minisymposia proposals.
      • January 15, 2012: Notification of acceptance.

      PyHPC: Python for High performance computing

      If you are doing super computing, SC11, the Super Computing conference is the reference conference. This year there will a workshop on high performance computing with Python: PyHPC.

      At the scipy conference, I was having a discussion with some of the attendees on how people often still do process management and I/O with Fortran in the big computing environment. This is counter productive. However, has success stories of supercomputing folks using high-level languages are not advertized, this is bound to stay. Come and tell us how you use Python for high performance computing!

      Topics

      • Python-based scientific applications and libraries
      • High performance computing
      • Parallel Python-based programming languages
      • Scientific visualization
      • Scientific computing education
      • Python performance and language issues
      • Problem solving environments with Python
      • Performance analysis tools for Python application

      Papers We invite you to submit a paper of up to 10 pages via the submission site. Authors are encouraged to use IEEE two column format.

      Important Dates

      • Full paper submission: September 19, 2011
      • Notification of acceptance: October 7, 2011
      • Camera-ready papers: October 31, 2011

      by gael at September 11, 2011 02:52 PM

      Spyder

      New enhanced scientific Python interpreter

      As I mentionned in my previous post, the standard Python interpreter provided by Spyder's console has been considerably enhanced. Moreover, despite these interesting enhancements, this is based on a real Python interpreter, not an emulation.

      The scientific startup script mentionned in my previous post was integrated in Spyder's development version (under the name scientific_startup.py).

      The goal of all these recent changes was to provide a MATLAB-like experience with a minimum of requirements: NumPy, SciPy, Matplotlib and Spyder, that's all.

      And here are two screenshots of the result (Spyder was executed in `light` mode: `spyder --light`):


      by Pierre (noreply@blogger.com) at September 11, 2011 03:16 AM

      No IPython v0.11 support in Spyder's console, but standard Python interpreter enhancements

      It's been a month since IPython v0.11 public release and I've been trying to solve issues related to IPython support in Spyder's console, all in vain. At first, I thought that it was only a problem on Windows platforms, but apparently it is not. The more I think about it (and the more I spend time on it), the more I doubt that there will ever be a decent support of IPython >=v0.11 in Spyder's console. In the meantime, I have fixed a couple of bugs related to the new IPython plugin: this new plugin will probably be the only way to run IPython >=v0.11 within Spyder.

      So, to compensate this lack of IPython support in Spyder's console, I have implemented the following enhancements for the standard Python interpreter (this is not much, but it provides all the IPython features I am using personnally):

      1. I have fixed the PyQt input hook issue on Windows platforms. Before this revision, the console was crashing (non-responsive actually) on Windows platforms after trying to manipulate Qt objects interactively, like interactive plotting with Matplotlib for example. Now, when enabling the "Replace PyQt input hook by Spyder's" option (which is enabled by default on Windows platforms), the standard Python interpreter may be used to do interactive plotting, even on Windows.
      2. Plus, I have implemented some basic special commands in the standard Python interpreter (see here more details), adding support for %pwd, %ls, %clear or !dir (or !ls on UNIX platforms), etc. (I haven't implemented the %edit, %run and other similar commands which have really no use within Spyder).
      3. And I have fixed a bug with PYTHONSTARTUP substitution option (the packages imported in the startup script were not available in the Python interpreter).

      These three changesets were intended to facilitate the use of standard Python interpreter as an interactive computing shell. I've also created two PYTHONSTARTUP scripts (on my machine), one to support Matplotlib's pylab interface and the other to support guiqwt's interactive plotting mode:

      • Matplotlib's startup script:
      print "Running pylab startup script...",
      # Import modules following official guidelines:
      import numpy as np
      import scipy as sp
      import matplotlib as mpl
      import matplotlib.pyplot as plt
      # Pollute the namespace but also provide MATLAB-like experience:
      from pylab import *
      # Enable Matplotlib's pylab mode:
      ion()
      print "done."
      • guiqwt startup script:
      print "Running guiqwt startup script...",
      # Import modules following official guidelines:
      import numpy as np
      import scipy as sp
      import guiqwt.pyplot as plt
      import guiqwt.io as io
      # Pollute the namespace but also provide MATLAB-like experience:
      from numpy import *
      from guiqwt.pyplot import *
      # Enable guiqwt's interactive mode:
      ion()
      print "done."

      With these startup scripts, I have my own IPython shell (not as powerful as IPython, of course) but it suits my needs... why not yours?

      To select you own startup script, go to Tools > Preferences > Console > Advanced settings:
      I think I'll add these startup scripts to Spyder v2.1 but I don't know how exactly. Maybe the simplest way would be to change the settings GUI above and add a third choice: "Build-in startup script: " with a combo box allowing to choose a script between the two scripts above (and others).

      by Pierre (noreply@blogger.com) at September 11, 2011 02:29 AM

      September 07, 2011

      Wes McKinney

      Leaking memory like a pro in Python

      For all my Python zealotry, I never claimed it was perfect. In fact, there are a number of places you can shoot yourself in the foot if you aren’t careful. But I was kind of bummed about this one:

      import numpy as np

      class MainScreen(object):

          def __init__(self, cats):
              self.cats = cats
              self._signal = None

          def turn_on(self):
              if self._signal is None:
                  signal = Signal(self)
                  self._signal = signal
              return self._signal

      class Signal(object):

          def __init__(self, main_screen):
              # well we want to know which screen we came from!
              self.main_screen = main_screen

          def we_get(self):
              return True

      for i in xrange(50):
          cats = np.empty((10000, 500), dtype=float)
          cats.fill(0)
          main_screen = MainScreen(cats)
          signal = main_screen.turn_on()
          # all your base...

      Try it.

      EDIT: I’ve been informed by an astute reader that the GC is actually capable of dealing with the above case. Yang Zhang says “The collector actually works with cycles, including this case. (Actually, most objects are managed via reference-counting; the gc is specifically for cyclic references.) The problem is that it’s triggered by # allocations, not # bytes. In this case, each cats array is large enough that you need several GB of allocations before the threshold is triggered. Try adding gc.collect() or gc.get_count() in the loop (and inspect gc.get_threshold()).”

      All is not lost however. This pattern can be implemented using weak references:

      class MainScreen(object):

          def __init__(self, cats):
              self.cats = cats
              self._signal = None

          def turn_on(self):
              import weakref
              if self._signal is None:
                  signal = Signal(self)
                  self._signal = weakref.ref(signal)
              return self._signal()

      by Wes McKinney at September 07, 2011 04:17 AM

      September 05, 2011

      Gaël Varoquaux

      Conference posters

      At the request of a friend, I am putting up some of the posters that I recently presented at conferences.


      Large-scale functional-connectivity graphical models for individual subjects using population prior.
      This is a poster for our NIPS work

      Multi-subject dictionary learning to segment an atlas of brain spontaneous activity.
      This is a poster for our IPMI work

      Mayavi for 3D visualization of neuroimaging data: powerful scripting and reusable components in Python.

      Machine learning for fMRI in Python: inverse inference with scikit-learn.

      by gael at September 05, 2011 03:15 AM

      September 04, 2011

      Vlad Niculae

      Long overdue update. EuroScipy and SSLST 2011

      Anybody reading my blog should have expected me to blog about the end of my GSoC. Sorry to disappoint, but I simply did not experience anything similar to an ending. On the contrary, I feel like things have barely started. Also, I apologize for one of the few posts here without pretty pictures! :)

      For the last two weeks, I’ve been traveling. I attended the EuroScipy conference thanks to Fabian, who offered me a place to sleep during the week. We sprinted hard, we discussed tricky APIs, we drank a lot of coffee, beer, and ate well in lovely Paris. It was great to meet all of the celebrities, the people who keep the scientific Python globe turning.

      Many thanks to Gael and Emmanuelle, who worked very, very hard on organizing everything, so they weren’t around and I didn’t get to say goodbye when I ran to catch my plane last Sunday.

      I was in a hurry, heading to Tarragona, a beautiful city on the Catalan coast, where the public university organized the 2011 summer school in linguistics and speech technologies (SSLST). This was a great opportunity to meet many fellow young researchers working in computational linguistics. I will not go into details now, because I plan expand on this, but I would like to state a couple of things. Firsty, even though NLP seems to be mostly a Java-dominated affair (note for example Stanford’s NLP toolkit and Sheffield’s GATE), the computational linguistics and psycholinguistics research center (CLiPS) at the University of Antwerp actually briefly manifested its devotion to Python and NLTK via its research director, Walter Daelemans.

      It was good to see a little love for Python in this field. NLTK is very underrepresented in the SciPy community, I couldn’t find anybody at the EuroScipy conference knowing too much about it or about the people behind it.

      Another lab that has done a lot of cool work is Cental at the Catholic University of Louvain, and they also use Python for natural language processing. Maybe in the coming years, we will see a Python for Computational Linguistics sattelite, along with Physics and Neuroscience. Doesn’t it sound more fun? :P

      Secondly, I wish SSLST were organized by someone like Gael! As the discussion at dinner regarding who will organize next year’s EuroScipy went, it is imperative that the organizers be actively involved in the community, and generally passionate about it. Even though I’m comparing apples and oranges, Carlos Martin-Vide behaved in this context like a old, tired, emotionless academic, not taking into account even lunch breaks for the whole group, not to mention any sort of getting together or even a group photo (which, alas, we were not able to take, apart from small groups.) They said it couldn’t be done. Of course it could, they just didn’t want it hard enough.

      Finally, before signing off, I would like to announce that because the Romanian Ministry of Education failed to specify the allocated number of public positions for masters’ programmes, the admission exam at the University of Bucharest will be delayed by a couple of weeks. Luckily, this will allow me to attend RANLP 2011 in Hissar, Bulgaria a week from now, where I will present my poster entitled:
      “Can alternations be learned? A machine learning approach to Romanian verb conjugation” by Liviu P. Dinu, Emil Ionescu, Vlad Niculae and Octavia-Maria Sulea. See you in Hissar!

      by vene at September 04, 2011 10:08 PM

      Fabian Pedregosa

      Reworked example gallery for scikit-learn

      I’ve been working lately in improving the scikit-learn example gallery to show also a small thumbnail of the plotted result. Here is what the gallery looks like now


      And the real thing should be already displayed in the development documentation. The next thing is to add a static image to those that don’t generate any result, examples such as the SVM GUI should have an image to display.

      by fabian at September 04, 2011 06:09 PM

      September 03, 2011

      Gaël Varoquaux

      Hiring a junior developer on the scikit-learn

      Once again, we are looking for a junior developer to work on the scikit-learn. Below is the official job posting. As a personal remark, I would like to stress that this is a unique opportunity to be payed for two years to work on learning and improving the scientific Python toolstack.


      Job Description

      INRIA is looking to hire a young graduate on a 2-year position to help with the community-driven development of the open source machine learning in Python library, scikit-learn. The scikit-learn is one of the majormajor machine-learning libraries in Python. It aims to be state-of-the-art on mid-size to large datasets by harnessing the power of the scientific Python toolstack.

      Speaking French is not a requirement, as it is an international team.

      Requirements

      • Programming skills in Python and C/C++
      • Understanding of quality assurance in software development: test-driven programming, version control, technical documentation.
      • Some knowledge of Linux/Unix
      • Software design skills
      • Knowledge of open-source development and community-driven environments
      • Good technical English level
      • An experience in statistical learning or a mathematical-oriented mindset is a plus
      • We can only hire a young-graduate that has received a masters or equivalent degree at most a year ago.

      About INRIA

      INRIA is the French computer science research institute. It recognized word-wide as one of the leading research institutions and has a strong expertise in machine learning. You will be working in the Parietal team that makes a heavy use of Python for brain imaging analysis.

      Parietal is a small research team (around 10 people) with an excellent
      technical knowledge of scientific and numerical computing in Python asas
      well as a fine understanding of algorithmic issues in machine learning and statistics. Parietal is committed to investing in scikit-learn.

      Working at Parietal is a unique opportunity to improve your skills in machine learning and numerical computing in Python. In addition, working full time on the scikit-learn, a very active open-source project, will give you premium experience of open source community management and collaborative project development.

      Contact Info:

      • Technical Contact: Bertand Thirion
      • E-mail contact: bertrand dotnospam thirion atnospam inria dotnospam fr
      • HR Contact: Marie Domingues
      • E-mail Contact: marie dotnospam domingues atnospam inria dotnospam fr
      • No telecommuting

      by gael at September 03, 2011 06:26 AM

      September 01, 2011

      Travis Vaught

      Modern Portfolio Theory - A Python Implementation


      I was surprised last week to find there was no accessible Python implementation of the calculation of the Efficient Frontier (as defined by Markowitz in his presentation of Modern Portfolio Theory ~1957).* Since the problem seemed simple to solve with the tools at hand, I set out to "right the wrong" and develop an open implementation (available under an open, BSD license via github) that meets my needs.

      In the process of building the basic metrics, I was able to experiment with NumPy's datetime support -- as mentioned in my earlier post. The API's are still a bit in flux, but I thought I would write up a full workflow using the tools as they are. I would greatly appreciate feedback/contributions/suggestions.  So let's dive in ...



      Modern Portfolio Theory
      While there are much better sources for an explanation of MPT, I'll take a stab at a high level introduction to the basic concepts. MPT is based on the idea that a diversified portfolio--a portfolio that holds several assets, or asset classes, that have some inverse correlation--may be constructed which provides less risk and the same return as any of the single assets.  Or similarly, an asset mix may be chosen that has the same risk as any single asset, but a higher return. If one plots the returns of an asset along the value (y) axis versus the risk of an asset along the index (x) axis, it's easy to spot desirable assets based on their location on the chart.  While an efficient market may imply that there is a close correlation between risk and return, some assets may lie closer to the upper left portion of the chart.  These would provide less risk and higher returns than assets falling in the lower right portion of the chart.

      Figure: 1 Plot of Risk (Annual Volatility: x axis) vs. Return (Annualized Returns: y axis)

      For example, it's easy to see in Figure 1 that BA might be a less preferable asset than COP, since COP has a higher return for less risk (historically, at least).

      While there are acknowledged problems with using historical standard deviation as a proxy for risk, we'll continue to implement the standard model for now.  In this standard model for MPT, one may construct a portfolio, by holding assets in optimal quantities--or relative weights--that lies along a line of all optimal portfolios, or along the Efficient Frontier.


      Points along the Efficient Frontier may be found by solving an optimization problem which finds, for example, the best weighting of assets to maximize return for a given risk tolerance. This is the approach I take. The mpt.py module closely follows the gradient method described by William Sharpe to accomplish this.

      The Software Model
      While I've tried to keep things fairly functionally oriented, I have created two objects that accomplish the tasks of managing the data structures and providing methods. The first object is defined by the Stock class. It has methods to load data and to store metrics about an asset.  The metrics are populated when the class calls metrics functions from the metrics module on its data.

      Before I use my Stock object, I'll start from scratch by pre-populating a sqlite database file with some useful price data. You only have to do this once, or as many times as you want to keep your data fresh:



      Notice that I've loaded symbols from a csv file (available in the github repo) that has a short sampling of assets from the S&P 500. (There is also a complete list of S&P symbols in the repo). The populate_db method iterates over these symbols and pulls the price data from Yahoo Finance. Not all the assets loaded from Yahoo (I may have some symbol issues), but I got what I could. An additional asset's price data is also pulled down to be used as a benchmark. "^GSPC" is the symbol for the S&P 500 index itself. Note that passing a list or a filename as the symbols argument both work.

      Now, on to an example session using a Stock object:


      This is pretty straightforward. A Stock object and its metrics are generated, as expected. Things begin to get interesting when the other object in the MPT workflow is introduced: the Portfolio object. As one might guess, it contains a collection of Stock objects. Each Stock object has had its ratearray appropriately truncated and imputed in order for every asset's data length to match.

      Here's an example of its usage:



      Notice that the recommended allocation for a risk tolerance of 2.0 has AAPL weighted to 100% of the portfolio. This indicates that AAPL is actually on the Efficient Frontier for the given asset options--i.e. there is no combination of assets that gives the same return for less risk. When the risk tolerance is reduced to 0.5, the allocations include KO as well -- reducing the risk and return, and revealing another point on the Frontier. So, let's take a look at how the calculations were implemented in Python.

      At a high level, the approach for the optimization is to begin with equal weights of all assets. The marginal utility of all the assets is then calculated and used to determine an optimal two-asset swap. The process is repeated until two-asset swaps result in minimal (below a threshold) improvement. This is done within the confines of the upper and lower bounds specifically imposed on our weightings.

      Marginal Utility
      The marginal utility calculation is given by:



      Where:

      : Vector of expected return of assets. Uses annualized_adjusted_return for each asset.
      : Scalar value of risk tolerance in daily volatility.
      : Covariance matrix of assets in portfolio.
      : Current portfolio weighting.

      For example, the marginal utility in the portfolio we created earlier can be gotten by:



      This is interpreted as: the utility for AAPL increases at a rate of 33.19% for every unit change in the amount invested in AAPL.

      Two Asset Swap
      Taking the min and max of the matrix elements reveals which two assets to swap, and in which direction. What remains is to calculate the optimal amount to swap. This is given by:



      Where:

      : The optimal amount to swap.

      :

      :

      : A vector representing the stocks to be swapped. For our example, this looks like:

      I'll leave the derivation to William Sharpe. The implementation, given SciPy's and NumPy's tool set and expressiveness, was quite simple. Though the bounds checking is a bit of sausage making, the actual methods on the portfolio almost write themselves.

      All this is nice algorithmically, but sometimes it helps to have some graphical interaction, and a visualization of what's happening. This is where the Enthought Tool Suite (ETS) came in handy.

      A GUI App
      Up to this point the only dependencies for the MPT workflow are SQLite, NumPy, and SciPy. To get a visualization, we use a separate module (chaco_mpt_display.py) to achieve coolness. We do, however, add the full stack of Enthought tools as a dependency for this module. The good news is that all dependencies can be easily gotten with the Enthought Python Distribution. Below is a quick video of this afternoon's version of the GUI--it's changing fairly rapidly as I pile in new features.



      Any feedback or corrections are welcome -- feel free to get the code from github. Pull requests are welcome!

      * I recently found a nascent implementation that is fairly tied to a visualization -- I like my approach much better.

      by Travis (noreply@blogger.com) at September 01, 2011 07:52 PM

      August 31, 2011

      William Stein

      Sage: Creating a Viable Free Open Source Alternative to Magma, Maple, Mathematica, and MATLAB


      The goal of the Sage project is to create a viable fre open source alternative to Magma, Maple(TM), Mathematica(R), and MATLAB(R), which are the most popular non-free closed source mathematical software systems. (Maple is a trademark of Waterloo Maple Inc. Mathematica is a registered trademark of Wolfram Research Incorporated. MATLAB is a registered trademark of MathWorks. I will refer to the four systems together as ``Ma'' in the rest of this article.) Magma is (by far) the most advanced non-free system for structured abstract algebraic computation, Mathematica and Maple are popular and highly developed systems that shine at symbolic manipulation, and MATLAB is the most popular system for applied numerical mathematics. Together there are over 3,000 employees working at the companies that produce the four Ma's listed above, which take in over a hundred million dollars of revenue annually.

      By a viable free alternative to the Ma's, we mean a system that will have the important mathematical features of each Ma, with comparable speed. It will have 2d and 3d graphics, an interactive notebook-based graphical user interface, and documentation, including books, papers, school and college curriculum materials, etc. A single alternative to all of the Ma's is not necessarily a drop-in replacement for any of the Ma's; in particular, it need not run programs written in the custom languages of those systems. Thus it need not be like Octave or R, which (nearly) clone the languages of MATLAB and S, respectively. Development would instead focus on implementing functions that users demand, rather than systematically trying to implement every single function of the Ma's. The culture, architecture, and general look and feel of such a system would be very different than that of the Ma's.



      Motivation for Starting Sage


      Each of the Ma's cost substantial money, and is hence expensive for me, my collaborators, and students. The Ma's are not owned by the community like Sage is, or Wikipedia is, for that matter.


      The Ma's are closed, which means that the implementation of some algorithms are secret, in which case you are not allowed to modify or extend them.

      The Mathematica Documentation: "You should realize at the outset that while knowing about the internals of Mathematica may be of intellectual interest, it is usually much less important in practice than you might at first suppose. Indeed, in almost all practical uses of Mathematica, issues about how Mathematica works inside turn out to be largely irrelevant. Particularly in more advanced applications of Mathematica, it may sometimes seem worthwhile to try to analyze internal algorithms in order to predict which way of doing a given computation will be the most efficient. [...] But most often the analyses will not be worthwhile. For the internals of Mathematica are quite complicated.."

      The philosophy espoused in Sage, and indeed by the vast open source software community, is exactly the opposite. We want you to know about the internals, and when they are quite complicated, we want you to help make them more understandable. Indeed, Sage's growth depends on you analyzing how Sage works, improving it, and contributing your improvements back.
      sage: crt(2, 1, 3, 5)  # Chinese Remainder Theorem
      
      11
      sage: crt? # ? = documentation and examples
      Returns a solution to a Chinese Remainder Theorem...
      ...
      sage: crt?? # ?? = source code
      def crt(...):
      ...
      g, alpha, beta = XGCD(m, n)
      q, r = (b - a).quo_rem(g)
      if r != 0:
      raise ValueError("No solution ...")
      return (a + q*alpha*m) % lcm(m, n)
      Moreover, by browsing the Mercurial repository, you can see exactly who wrote or modified any particular line of code in the Sage library, when they did it, and why. Everything included in Sage is free and open source, and it will foreover remain that way.

      Linus Torvalds: "I see open source as Science. If you don't spread your ideas in the open, if you don't allow other people to look at how your ideas work and verify that they work, you are not doing Science, you are doing Witchcraft. Traditional software development models, where you keep things inside a company and hide what you are doing, are basically Witchcraft. Open source is all about the fact that it is open; people can actually look at what you are doing, and they can improve it, and they can build on top of it. [...] One of my favorite quotes from history is Newton: `If I had seen further, it has been by standing on the shoulders of giants.'"

      The design decisions of the Ma's are not made openly by the community. In contrast, important decisions about Sage development are made via open public discussions and voting that is archived on public mailing lists with thousands of subscribers.

      Every one of the Ma's uses a special mathematics-oriented interpreted programming language, which locks you into their product, makes writing some code outside mathematics unnecessarily difficult, and impacts the number of software engineers that are experts at programming in that language. In contrast, the user language of Sage is primarily the mainstream free open source language Python, which is one of the world's most popular interpreted programming languages. The Sage project neither invented nor maintains the underlying Python language, but gains immediate access to the IPython shell, Python scientific libraries (such as NumPy, SciPy, CVXopt and MatPlotLib), and a large Python community with major support from big companies such as Google. In comparison to Python, the Ma's are small players in terms of language development. Thus for Sage most of the problems of language development are handled by someone else.

      The bug tracking done for three of four of the Ma's is currently secret (MATLAB has an open bug tracker, though it requires free registration to view.), which means that there is no published accounting of all known bugs, the status of work on them, and how bugs are resolved. But the Ma's do have many bugs; see the release notes of each new version, which lists bugs that were fixed. Sage also has bugs, which are all publicly tracked at the trac server, and there are numerous ``Bug Days'' workshops devoted entirely to fixing bugs in Sage. Moreover, all discussion about resolving a given bug, including peer review of solutions, is publicly archived. We note that sadly even some prize winning free open source systems, such as GAP, do not have an open bug tracking system, resulting in people reporting the same bugs over and over again.

      Each of the Ma's is a combination of secret unchangeable compiled code and less secret interpreted code. Users with experience programming in compiled languages such as Fortran or C++ may find the loss of a compiler to be frustrating. None of the Ma's has an optimizing compiler that converts programs written in their custom interpreted language to a fast executable binary format that is not interpreted at runtime. (MATLAB has a compiler, but ``the source code is still interpreted at run-time, and performance of code should be the same whether run in standalone mode or in MATLAB.'' Mathematica also has a Compile function, but simply compiles expressions to a different internal format that is interpreted, much like Sage's fast_callable function.) In contrast, Sage is tightly integrated with Cython. The Cython project has received extensive contributions from Sage developers, and is very popular in the world of Python-based scientific computing. Cython is a Python-to-C/C++ compiler that speeds up code execution and has support for statically declaring data types (for potentially enormous speedups) and natively calling existing C/C++/Fortran code. For example, enter the following in a cell of the Sage notebook:
      def python_sum2(n):
      
      s = int(0)
      for i in xrange(1, n+1):
      s += i*i
      return s
      Then enter the following in another cell:
      %cython
      
      def cython_sum2(long n):
      cdef long i, s = 0
      for i in range(1, n+1):
      s += i*i
      return s
      The second implementation, despite looking nearly identical, is nearly a hundred times faster than the first one (your timings may vary).
      sage: timeit('python_sum2(2*10^6)')
      
      5 loops, best of 3: 154 ms per loop
      sage: timeit('cython_sum2(2*10^6)')
      125 loops, best of 3: 1.76 ms per loop
      sage: 154/1.76
      87.5

      Of course, it is better to choose a different algorithm. In case you don't remember a closed form expression for the sum of the first $n$ squares, Sage can deduce it:
      sage: var('k, n')
      
      sage: factor(sum(k^2, k, 1, n))
      1/6*(n + 1)*(2*n + 1)*n
      And now our simpler fast implementation is:
      def sum2(n):
      
      return n*(2*n+1)*(n+1)/6
      Just as above, we can also use the Cython compiler:
      %cython
      
      def c_sum2(long n):
      return n*(2*n+1)*(n+1)/6
      Comparing times, we see that Cython is 10 times faster:
      sage: n = 2*10^6
      
      sage: timeit('sum2(n)')
      625 loops, best of 3: 1.41 microseconds per loop
      sage: timeit('c_sum2(n)')
      625 loops, best of 3: 0.145 microseconds per loop
      sage: 1.41/.145
      9.72413793103448
      In this case, the enhanced speed comes at a cost, in that the answer is wrong when the input is large enough to cause an overflow:
      sage: c_sum2(2*10^6)   # WARNING: overflow
      
      -407788678951258603
      Cython is very powerful, but to fully benefit from it, one must understand machine level arithmetic data types, such as long, int, float, etc. With Sage you have that option.

      What is Sage?


      The goal of Sage is to compete with the Ma's, and the intellectual property at our disposal is the complete range of GPL-compatibly licensed open source software.

      Sage is a self-contained free open source distribution of about 100 open source software packages and libraries that aims to address all computational areas of pure and applied mathematics. See the list of packages in Sage, which includes R, Pari, Singular, GAP, Maxima, GSL, Numpy, Scipy, ATLAS, Matplotlib, and many other popular programs. The download of Sage contains all dependencies required for the normal functioning of Sage, including Python itself. Sage includes a substantial amount of code that provides a unified Python-based interface to these other packages. Sage also includes a library of new code written in Python, Cython and C/C++, which implements a huge range of algorithms.



      History


      I made the first release of Sage in February 2005, and at the time called it "Software for Arithmetic Geometry Experimentation." I was a serious user of, and contributor to, Magma at the time, and was motivated to start Sage for many of the reasons discussed above. In particular, I was personally frustrated with the top-down closed development model of Magma, the fact that several million lines of the source code of Magma are closed source, and the fees that my colleagues had to pay in order to use the substantial amount of code that I contributed to Magma. Despite my early naive hope that Magma would be open sourced, it never was. So I started Sage motivated by the dream that someday the single most important item of software I use on a daily basis would be free and open. David Joyner, David Kohel, Joe Wetherell, and Martin Albrecht were also involved in the development of Sage during the first year.

      In February 2006, the National Science Foundation funded a 2-day workshop called ``Sage Days 2006'' at UC San Diego, which had about 40 participants and speakers from several open and closed source mathematical software projects. After doing a year of fulltime mostly solitary work on Sage, I was surprised by the positive reception of Sage by members of the mathematical research community. What Sage promised was something many mathematicians wanted. Whether or not Sage would someday deliver on that promise was (and for many still is) an open question.

      I had decided when I started Sage that I would make it powerful enough for my research, with or without the help of anybody else, and was pleasantly surprised at this workshop to find that many other people were interested in helping, and understood the shortcomings of existing open source software, such as GAP and PARI, and the longterm need to move beyond Magma. Six months later, I ran another Sage Days workshop, which resulted in numerous talented young graduate students, including David Harvey, David Roe, Robert Bradshaw, and Robert Miller, getting involved in Sage development. I used startup money from University of Washington to hire Alex Clemesha as a fulltime employee to implement 2d graphics and help create a notebook interface to Sage. I also learned that there was much broader interest in such a system, and stopped referring to Sage as being exclusively for ``arithmetic geometry''; instead, Sage became ``Software for Algebra and Geometry Experimentation.'' Today the acronym is deprecated.

      The year 2007 was a major turning point for Sage. Far more people got involved with development, we had four Sage Days workshops, and prompted by Craig Citro, we instituted a requirement that all new code must have tests for 100% of the functions touched by that code, and every modification to Sage must be peer reviewed. Our peer review process is much more open than in mathematical research journals; everything that happens is publicly archived on trac. During 2007, I also secured some funding for Sage development from Microsoft Research, Google, and NSF. Also, a German graduate student studying cryptography, Martin Albrecht presented Sage at the Troph'ees du Libre competition in France, and Sage won first place in ``Scientific Software'', which led to a huge amount of good publicity, including articles in many languages around the world and appearances, for example, this Slashdot article.

      In 2008, I organized 7 Sage Days workshops at places such as IPAM (at UCLA) and the Clay Mathematics Institute, and for the first time, several people besides me made releases of Sage. In 2009, we had 8 more Sage Days workshops, and the underlying foundations of Sage improved, including development of a powerful coercion architecture. This coercion model systematically determines what happens when performing operations such as a + b, when a and b are elements of potentially different rings (or groups, or modules, etc.).
      sage: R. = PolynomialRing(ZZ)
      
      sage: f = x + 1/2; f
      x + 1/2
      sage: parent(f)
      Univariate Polynomial Ring in x over Rational Field
      We compare this with Magma (V2.17-4), which has a more ad hoc coercion system:
      > R := PolynomialRing(IntegerRing());
      
      > x + 1/2
      ^
      Runtime error in '+': Bad argument types
      Argument types given: RngUPolElt[RngInt], FldRatElt

      Robert Bradshaw and I also added support for beautiful browser-based 3D graphics to Sage, which involved writing a 3D graphics library, and adapting the free open source JMOL Java library for rendering molecules to instead plot mathematical objects.

      sage: f(x,y) = sin(x - y) * y * cos(x)
      
      sage: plot3d(f, (x,-3,3), (y,-3,3), color='red')

      In 2009, following a huge amount of porting work by Mike Hansen, development of algebraic combinatorics in Sage picked up substantial momentum, with the switch of the entire MuPAD-combinat group to Sage (forming sage-combinat), only months before the formerly free system MuPAD{\textregistered}\footnote{MuPAD is a registered trademark of SciFace Software GmbH \& Co.} was bought out by Mathworks (makers of MATLAB). In addition to work on Lie theory by Dan Bump, this also led to a massive amount of work on a category theoretic framework for Sage by Nicolas Thiery.

      In 2010, there were 13 Sage Days workshops in many parts of the world, and grant funding for Sage significantly improved, including new NSF funding for undergraduate curriculum development. I also spent much of my programming time during 2010--2011 developing a number theory library called psage, which is currently not included in Sage, but can be easily installed.

      Many aspects of Sage make it an ideal tool for teaching mathematics, so there's a steadily growing group of teachers using it: for example, there have been MAA PREP workshops on Sage for the last two years, and a third is likely to run next summer, there are regular posts on the Sage lists about setting up classroom servers, and there is an NSF-funded project called UTMOST devoted to creating undergraduate curriculum materials for Sage.

      The publications page lists 101 accepted publications that use Sage, 47 preprints, 22 theses, and 16 books, and there are surely many more ``in the wild'' that we are not aware of. According to Google Analytics, the main Sage website gets about 2,500 absolute unique visitors per day, and the website http://sagenb.org, which allows anybody to easily use Sage through their web browser, has around 700 absolute unique visitors per day.

      For many mathematicians and students, Sage is today the mature, open source, and free foundation on which they can build their research program.

      by William Stein (noreply@blogger.com) at August 31, 2011 11:06 AM