Within the past months I frequently got my hands on pandas. Pandas is a data analysis framework for Python initiated by Wes McKinney. Unfortunately, I wasn’t aware of this powerful package earlier, that would have saved a lot of time. Pandas is tightly integrated with numpy and matplotlib, so if you are familiar with those, you can smoothly jump into the pandas world. In the short time of working with pandas, I have frequently been amazed by its power and simplicity. In fact, pandas has been named to be a “Python killer app” during the PyCon 2013 keynote, being at eye level with Zope and Django. In this post, I will present a short example giving a feeling for why pandas is so neat.
Consider N
independent repetitions of the same experiment. Let each experiment yield various metrics for a set of system components. The goal usually is to statistically evaluate the data obtained by repeating the experiment. A simple example: Each experiment consists of measuring velocity
and weight
for three different cars: opel
, benz
, audi
. This experiment is repeated three times. From each repetition, we got one data set:
data1.txt:
car,weight,velocity audi,2.1,100.3 opel,1.5,55.2 benz,3.2,80.9
data2.txt:
car,weight,velocity audi,2.3,107.1 opel,1.2,50.1 benz,3.3,82.4
data3.txt:
car,weight,velocity audi,2.6,110.7 opel,1.4,51.2 benz,3.2,89.1
For each system component (car
), there are N
(three) values for each metric (velocity
, weight
). Let us build the mean value as well as the standard deviation for each metric and car, sort the data set by the mean velocity, and print the result.
Short version first
import pandas as pd import numpy as np print pd.concat([pd.read_csv("data%s.txt" % i) for i in xrange(1,4)]).groupby( "car").agg([np.mean, np.std]).sort([('velocity', 'mean')])
Output:
weight velocity mean std mean std car opel 1.366667 0.152753 52.166667 2.683903 benz 3.233333 0.057735 84.133333 4.366158 audi 2.333333 0.251661 106.033333 5.281414
It’s as simple as this. Looks like pure magic, but actually is not. The one-liner above is quite straight-forward to develop and simple to understand. What follows now is a slow step-by-step explanation.
Long version
First of all, a Python list comprehension is used together with panda’s read_csv
method for reading the three data files. Each data file ends up in one of the most important of pandas data types: a DataFrame
which is actually based on a numpy array.
>>> dataframes = [pd.read_csv("data%s.txt" % i) for i in xrange(1,4)] >>> type(dataframes[0]) <class 'pandas.core.frame.DataFrame'>
One of the nice things about pandas is that each of the payload-containing data types has a convenient text representation for the console (but there also are HTML representations for usage of pandas in the context of an IPython notebook!):
>>> dataframes[0] car weight velocity 0 audi 2.1 100.3 1 opel 1.5 55.2 2 benz 3.2 80.9
As you can see above in the first column, pandas has assigned a numeric index to each of the rows. The indexing concept is extremely important in general, but not of interest in this article.
Now, let’s just vertically concatenate the three data sets:
>>> concatenated = pd.concat(dataframes) >>> concatenated car weight velocity 0 audi 2.1 100.3 1 opel 1.5 55.2 2 benz 3.2 80.9 0 audi 2.3 107.1 1 opel 1.2 50.1 2 benz 3.3 82.4 0 audi 2.6 110.7 1 opel 1.4 51.2 2 benz 3.2 89.1
I think the concept is clear. We have just merged three DataFrame
s into one. Next, we make use of pandas’ groupby
method and group the data by car type:
>>> grouping = concatenated.groupby("car")
The resulting grouping is an abstract representation of the groups in the data set. Let’s have a look at their text representation:
>>> grouping.groups {'benz': [2, 2, 2], 'opel': [1, 1, 1], 'audi': [0, 0, 0]}
The concatenated data set was divided into three groups, one for each car type. The numbers above are the ‘indices’ from the concatenated data set, we do not use them here (otherwise we should have made sure that there are no duplicates). grouping
still contains the raw data, it’s just not contained within the default text representation.
The goal now is to merge the data within each of the groups in a controlled fashion. Remember, for each car we want to calculate mean and standard deviation for each metric. This is where the awesome aggregate
— or short agg
— method of a grouping comes into play. Via this method, the user can define any function to be used for aggregating group-internal data. Even better, one can define a list of functions. The aggregation then is performed with each of the functions and the resulting data set has one column for each combination of metric and aggregation function. So this is what we get when we apply numpy’s mean
as well as std
functions:
>>> merged = grouping.agg([np.mean, np.std]) >>> merged weight velocity mean std mean std car audi 2.333333 0.251661 106.033333 5.281414 benz 3.233333 0.057735 84.133333 4.366158 opel 1.366667 0.152753 52.166667 2.683903
The result is a DataFrame
again, which can be sorted by any column. The column indices now are assembled in a two-level hierarchy (Each original column now has a sub-level containing the mean
and std
columns). Let’s sort by the mean value of the velocity:
>>> merged_sorted = merged.sort([('velocity', 'mean')]) >>> merged_sorted weight velocity mean std mean std car opel 1.366667 0.152753 52.166667 2.683903 benz 3.233333 0.057735 84.133333 4.366158 audi 2.333333 0.251661 106.033333 5.281414
So, what is the slowest car?
>>> merged_sorted.iloc[0] weight mean 1.366667 std 0.152753 velocity mean 52.166667 std 2.683903 Name: opel, dtype: float64
An Opel, and we love pandas for telling us.
This simple example could have been solved with the standard library. However, before you end up writing a tabular data type (like I once did), you seriously should consider using pandas instead. pandas’ DataFrame
supports plenty of convenient selection/indexing schemes as well as data conversion methods that you don’t want to re-implement. The one-liner above does not care how many rows and columns your data has. All this can be done with extreme amounts of data. pandas is using efficient numpy operations where it can, makes heavy use of Cython in other places and is generally designed for efficient large-scale data analysis. Have a look at the documentation to see how powerful pandas is.
Leave a Reply