Monthly Archives: July 2013

Data analysis with pandas: enjoy the awesome

Within the past months I frequently got my hands on pandas. Pandas is a data analysis framework for Python initiated by Wes McKinney. Unfortunately, I wasn’t aware of this powerful package earlier, that would have saved a lot of time. Pandas is tightly integrated with numpy and matplotlib, so if you are familiar with those, you can smoothly jump into the pandas world. In the short time of working with pandas, I have frequently been amazed by its power and simplicity. In fact, pandas has been named to be a “Python killer app” during the PyCon 2013 keynote, being at eye level with Zope and Django. In this post, I will present a short example giving a feeling for why pandas is so neat.

Consider N independent repetitions of the same experiment. Let each experiment yield various metrics for a set of system components. The goal usually is to statistically evaluate the data obtained by repeating the experiment. A simple example: Each experiment consists of measuring velocity and weight for three different cars: opel, benz, audi. This experiment is repeated three times. From each repetition, we got one data set:

data1.txt:

car,weight,velocity
audi,2.1,100.3
opel,1.5,55.2
benz,3.2,80.9

data2.txt:

car,weight,velocity
audi,2.3,107.1
opel,1.2,50.1
benz,3.3,82.4

data3.txt:

car,weight,velocity
audi,2.6,110.7
opel,1.4,51.2
benz,3.2,89.1

For each system component (car), there are N (three) values for each metric (velocity, weight). Let us build the mean value as well as the standard deviation for each metric and car, sort the data set by the mean velocity, and print the result.

Short version first

import pandas as pd
import numpy as np
 
print pd.concat([pd.read_csv("data%s.txt" % i) for i in xrange(1,4)]).groupby(
    "car").agg([np.mean, np.std]).sort([('velocity', 'mean')])

Output:

        weight              velocity          
          mean       std        mean       std
car                                           
opel  1.366667  0.152753   52.166667  2.683903
benz  3.233333  0.057735   84.133333  4.366158
audi  2.333333  0.251661  106.033333  5.281414

It’s as simple as this. Looks like pure magic, but actually is not. The one-liner above is quite straight-forward to develop and simple to understand. What follows now is a slow step-by-step explanation.

Long version

First of all, a Python list comprehension is used together with panda’s read_csv method for reading the three data files. Each data file ends up in one of the most important of pandas data types: a DataFrame which is actually based on a numpy array.

>>> dataframes = [pd.read_csv("data%s.txt" % i) for i in xrange(1,4)]
>>> type(dataframes[0])
<class 'pandas.core.frame.DataFrame'>

One of the nice things about pandas is that each of the payload-containing data types has a convenient text representation for the console (but there also are HTML representations for usage of pandas in the context of an IPython notebook!):

>>> dataframes[0]
    car  weight  velocity
0  audi     2.1     100.3
1  opel     1.5      55.2
2  benz     3.2      80.9

As you can see above in the first column, pandas has assigned a numeric index to each of the rows. The indexing concept is extremely important in general, but not of interest in this article.

Now, let’s just vertically concatenate the three data sets:

>>> concatenated = pd.concat(dataframes)
>>> concatenated
    car  weight  velocity
0  audi     2.1     100.3
1  opel     1.5      55.2
2  benz     3.2      80.9
0  audi     2.3     107.1
1  opel     1.2      50.1
2  benz     3.3      82.4
0  audi     2.6     110.7
1  opel     1.4      51.2
2  benz     3.2      89.1

I think the concept is clear. We have just merged three DataFrames into one. Next, we make use of pandas’ groupby method and group the data by car type:

>>> grouping = concatenated.groupby("car")

The resulting grouping is an abstract representation of the groups in the data set. Let’s have a look at their text representation:

>>> grouping.groups
{'benz': [2, 2, 2], 'opel': [1, 1, 1], 'audi': [0, 0, 0]}

The concatenated data set was divided into three groups, one for each car type. The numbers above are the ‘indices’ from the concatenated data set, we do not use them here (otherwise we should have made sure that there are no duplicates). grouping still contains the raw data, it’s just not contained within the default text representation.

The goal now is to merge the data within each of the groups in a controlled fashion. Remember, for each car we want to calculate mean and standard deviation for each metric. This is where the awesome aggregate — or short agg — method of a grouping comes into play. Via this method, the user can define any function to be used for aggregating group-internal data. Even better, one can define a list of functions. The aggregation then is performed with each of the functions and the resulting data set has one column for each combination of metric and aggregation function. So this is what we get when we apply numpy’s mean as well as std functions:

>>> merged = grouping.agg([np.mean, np.std])
>>> merged
        weight              velocity          
          mean       std        mean       std
car                                           
audi  2.333333  0.251661  106.033333  5.281414
benz  3.233333  0.057735   84.133333  4.366158
opel  1.366667  0.152753   52.166667  2.683903

The result is a DataFrame again, which can be sorted by any column. The column indices now are assembled in a two-level hierarchy (Each original column now has a sub-level containing the mean and std columns). Let’s sort by the mean value of the velocity:

>>> merged_sorted = merged.sort([('velocity', 'mean')])
>>> merged_sorted
        weight              velocity          
          mean       std        mean       std
car                                           
opel  1.366667  0.152753   52.166667  2.683903
benz  3.233333  0.057735   84.133333  4.366158
audi  2.333333  0.251661  106.033333  5.281414

So, what is the slowest car?

>>> merged_sorted.iloc[0]
weight    mean     1.366667
          std      0.152753
velocity  mean    52.166667
          std      2.683903
Name: opel, dtype: float64

An Opel, and we love pandas for telling us.

This simple example could have been solved with the standard library. However, before you end up writing a tabular data type (like I once did), you seriously should consider using pandas instead. pandas’ DataFrame supports plenty of convenient selection/indexing schemes as well as data conversion methods that you don’t want to re-implement. The one-liner above does not care how many rows and columns your data has. All this can be done with extreme amounts of data. pandas is using efficient numpy operations where it can, makes heavy use of Cython in other places and is generally designed for efficient large-scale data analysis. Have a look at the documentation to see how powerful pandas is.

Die grenzenlose Sammlung von (deinen) Internet-Verbindungsdaten…

…einfach so hinnehmen? Eine Petition zu unterzeichnen ist das Mindeste, was man tun kann. Zum Beispiel diese hier: http://www.change.org/prism

Ansonsten:

  • Informieren: über was genau reden wir eigentlich? Einen guten technischen Übersichtsartikel hat Heise veröffentlicht. Insider-Informationen zu “PRISM” gibt es z. B. hier und hier, auch wenn wir natürlich zumeist zu einer anderen Wertung kommen. Auch Wikipedia versucht eine Übersicht zu verschaffen, wobei deren Vollständigkeit natürlich anzuzweifeln ist. Und bitte nicht vergessen, was die Briten machen. Zu empfehlen ist noch dieser zusammenfassende Artikel von Heise.
  • Und vielleicht auch mal demonstrieren, so wie hier (ab Zeitpunkt 4:30 mal reinschauen):

    Leider waren dort nur ein paar Hundert Leute.

Empfehlen kann ich diesen wunderbaren Artikel von Mathias Priebe. Er geht das Thema sehr persönlich und dennoch abstrakt an.

Ein interessanter Auszug aus einem Spiegel-Artikel:

“Wenn die Bundesregierung ein Interesse an weiterer Aufklärung hat, sollte sie ihm (Snowden, d. Red.) aus politischen Gründen eine Aufenthaltserlaubnis gewähren”, sagt ProAsyl-Geschäftsführer Burghardt. Für den Grünen-Innenexperten Hans-Christian Ströbele ist das sogar zwingend. “Da mittlerweile selbst die Bundesanwaltschaft wegen möglicher Spionage gegen Deutschland ermittelt, muss die Bundesregierung Snowden nicht nur Asyl, sondern wie bei den Steuer-Informanten aus der Schweiz möglicherweise sogar Zeugenschutz anbieten”, sagt er. “Wenn der BND wegen Steuerhinterziehung Millionen vorstreckt und Garantien abgibt, dies aber im Fall der Datensicherheit aller Deutschen nicht tut, wäre das ein Skandal”, sagt der Bundestagsabgeordnete.

Bitwise identity of NetCDF data — hashing numpy arrays

In a molecular dynamics (MD) simulation, a molecular system is evolved over time, i.e. the movement of single atoms at a given temperature is simulated over time. The output of such a simulation is a so-called MD trajectory. It becomes written to disk while the simulation is running and contains snapshots of the molecular system for certain points in time. These snapshots are called frames. Each frame contains the coordinates of all atoms in the molecular system for one single point in time. Considering N atoms, three spatial dimensions and F frames, 3 x N x F floating point numbers are stored within a trajectory file. Amber, a well-established molecular dynamics software, can store this data in binary form in a NetCDF file.

Given two trajectory files in NetCDF format, I wanted to find out if the coordinates of both trajectories are equal up to a certain frame number. The issue here was that these two trajectory files were created by independent simulations starting from the same input data on the same hardware (an Nvidia GTX 580), while one of the simulations unexpectedly crashed in the middle of the run. In case both simulations produced bitwise identical output up to the point where the one simulation crashed, most likely external factors such as temperature or voltage were the reason for the crash. In case the simulations diverged before the crash appeared, an error in the Amber code or a general hardware problem with the GTX 580 is more likely.

So, how can we find out if both simulations produced bitwise identical output up to the point where one crashed? We have two NetCDF files and want to see if one contains an exact subset of numerical data of the other. If we expected the entire files to be identical then we could just apply a hash function to the entire files. We already know that the files are not identical, just a certain part of the data contained in both files should be checked for equality. So, a good idea might be to apply a hash function to the raw data representing these parts.

Luckily, NetCDF is a well-structured file format with strong language support. I love Python and used its netCDF4 module. Its usage is pretty straight-forward:

In [1]: from netCDF4 import Dataset

Load the trajectory file that is incomplete and have a look into the file structure, especially at the shape of the coordinates data:

In [2]: old = Dataset('production_NVT.mdcrd.OLD')
In [3]: print old.variables['coordinates']
<type 'netCDF4.Variable'>
float32 coordinates(frame, atom, spatial)
    units: angstrom
unlimited dimensions: frame
current shape = (1768, 46010, 3)

From the shape we can infer that this trajectory file contains 1768 frames made of 46010 atoms each and that we have — suprise, surprise — three spatial dimensions.

Load the trajectory file that is complete:

In [4]: new = Dataset('production_NVT.mdcrd')
In [5]: print new.variables['coordinates']
<type 'netCDF4.Variable'>
float32 coordinates(frame, atom, spatial)
    units: angstrom
unlimited dimensions: frame
current shape = (2500, 46010, 3)

It contains 2500 frames, which was the final goal of the simulation.

Upon access, the coordinates are returned within numpy arrays containing numpy data types. In this case, coordinate values are stored in single precision (32 bit) floating point numbers, as you can see for the set of coordinates belonging to the first frame of the damaged trajectory:

In [6]: old.variables['coordinates'][1,:,:].dtype
Out[6]: dtype('float32')

Now, we can just go ahead and build a check sum of the coordinates belonging to the first 1768 frames of both trajectories:

In [7]: import hashlib
In [8]: hashlib.sha1(new.variables['coordinates'][:1768,:,:]).hexdigest()
Out[8]: 'b8bd51001aedbdc52de0d9e39ceb5284e9641a8a'
In [9]: hashlib.sha1(old.variables['coordinates'][:1768,:,:]).hexdigest()
Out[9]: 'b8bd51001aedbdc52de0d9e39ceb5284e9641a8a'

They are identical. But which data is actually used for building the hash? The goal is to make sure that the compared data is bitwise identical, because this is what we expect from working Amber code and from a valid CUDA device. Of course, we can make a step back and just compare the numerical values first:

In [10]: (old.variables['coordinates'][:1768,:,:] == new.variables['coordinates'][:1768,:,:]).all()
Out[10]: True

Okay, numpy considers all coordinate values in the first 1768 frames to be equal. What does this mean? All numerics frameworks use some kind of epsilon value for comparison of floating point numbers. In this case this is:

In [11]: np.finfo(np.single).eps
Out[11]: 1.1920929e-07

Quite a large value. The comparison above proves strong similarity, but not bitwise identity of data. We can go ahead and decrease the epsilon value for the numeric comparison:

In [12]: np.allclose(old.variables['coordinates'][:1768,:,:], new.variables['coordinates'][:1768,:,:], rtol=1e-10, atol=1e-12)
Out[12]: True

Now we have proven that the two data sets are extremely similar. But how can we actually prove bitwise identity? The easiest is hashing, but we have to kind of validate that the hash function actually considers all the bits of the data. Let’s make up a small example for this:

>>> import numpy as np
>>> from hashlib import sha1
>>> a = np.random.rand(5)
>>> a
array([ 0.1841292 ,  0.25379848,  0.27524327,  0.15762775,  0.88225297])
>>> a.dtype
dtype('float64')
>>> sha1(a).hexdigest()
'59db294ed35dc9137394198413ca7fca592a7ac2'

So far, we have created an array a containing 5 randomly created double precision (64 bit) floating point numbers. We have built a corresponding hash value from this array, as done above in case of the trajectory data. Now, numpy provides a convenient function to re-interpret the raw array data. In this process, the array data is considered as a large set of single bits which becomes masked by a new data type. Above, a contains 5 * 64 bit = 320 bit of raw data. Interpreted with an 8 bit integer data type, a looks like this:

>>> a.view(dtype=np.uint8)
array([ 24,  30,  26, 171, 139, 145, 199,  63, 188,  18,  20, 248,  59,
        62, 208,  63,   0, 236,  40, 243, 149, 157, 209,  63, 104,  99,
       220,  99,  37,  45, 196,  63, 204,  14, 233, 147, 106,  59, 236,  63], dtype=uint8)
>>> len(a.view(dtype=np.uint8))
40

Obviously, 320 bit of raw data result in exactly 40 of 8 bit integer values. This view creates a different representation of the same raw data. What happens when we hash this representation?

>>> sha1(a.view(dtype=np.uint8)).hexdigest()
'59db294ed35dc9137394198413ca7fca592a7ac2'

The same as above. This proves that changing the representation does not change the check sum. But does it prove that the check sum is built from the raw data? Simple check:

>>> a[1]
0.2537984774246913
>>> a[1] = 0.2537984774246912
>>> sha1(a.view(dtype=np.uint8)).hexdigest()
'7ffdda4de9654c11f0c8387b82d83b577cdc64cc'
>>> sha1(a).hexdigest()
'7ffdda4de9654c11f0c8387b82d83b577cdc64cc'

Above, one numerical value in a was changed. As a consequence, the direct check sum of a as well as the check sum of a in uint8 representation also changed. So, yes, the raw numerical data is used for building the check sum.

Hence,

sha1(new.variables['coordinates'][:1768,:,:]).hexdigest()

builds a check sum from all bits of the numerical raw data of the first 1768 frames in the given trajectory. The equality of both check sums for the first 1768 frames in both trajectories means that in both trajectories the coordinates in the the first 1768 frames are bitwise identical.

This probably means that the hardware as well as the software was working just fine until at some point between writing frame 1768 and 1769 something happened that made the simulation crash.

By the way, this was done with Python 2.7.3 and numpy 1.6.2.