Data analysis with pandas: enjoy the awesome

Within the past months I frequently got my hands on pandas. Pandas is a data analysis framework for Python initiated by Wes McKinney. Unfortunately, I wasn’t aware of this powerful package earlier, that would have saved a lot of time. Pandas is tightly integrated with numpy and matplotlib, so if you are familiar with those, you can smoothly jump into the pandas world. In the short time of working with pandas, I have frequently been amazed by its power and simplicity. In fact, pandas has been named to be a “Python killer app” during the PyCon 2013 keynote, being at eye level with Zope and Django. In this post, I will present a short example giving a feeling for why pandas is so neat.

Consider N independent repetitions of the same experiment. Let each experiment yield various metrics for a set of system components. The goal usually is to statistically evaluate the data obtained by repeating the experiment. A simple example: Each experiment consists of measuring velocity and weight for three different cars: opel, benz, audi. This experiment is repeated three times. From each repetition, we got one data set:

data1.txt:

car,weight,velocity
audi,2.1,100.3
opel,1.5,55.2
benz,3.2,80.9

data2.txt:

car,weight,velocity
audi,2.3,107.1
opel,1.2,50.1
benz,3.3,82.4

data3.txt:

car,weight,velocity
audi,2.6,110.7
opel,1.4,51.2
benz,3.2,89.1

For each system component (car), there are N (three) values for each metric (velocity, weight). Let us build the mean value as well as the standard deviation for each metric and car, sort the data set by the mean velocity, and print the result.

Short version first

import pandas as pd
import numpy as np
 
print pd.concat([pd.read_csv("data%s.txt" % i) for i in xrange(1,4)]).groupby(
    "car").agg([np.mean, np.std]).sort([('velocity', 'mean')])

Output:

        weight              velocity          
          mean       std        mean       std
car                                           
opel  1.366667  0.152753   52.166667  2.683903
benz  3.233333  0.057735   84.133333  4.366158
audi  2.333333  0.251661  106.033333  5.281414

It’s as simple as this. Looks like pure magic, but actually is not. The one-liner above is quite straight-forward to develop and simple to understand. What follows now is a slow step-by-step explanation.

Long version

First of all, a Python list comprehension is used together with panda’s read_csv method for reading the three data files. Each data file ends up in one of the most important of pandas data types: a DataFrame which is actually based on a numpy array.

>>> dataframes = [pd.read_csv("data%s.txt" % i) for i in xrange(1,4)]
>>> type(dataframes[0])
<class 'pandas.core.frame.DataFrame'>

One of the nice things about pandas is that each of the payload-containing data types has a convenient text representation for the console (but there also are HTML representations for usage of pandas in the context of an IPython notebook!):

>>> dataframes[0]
    car  weight  velocity
0  audi     2.1     100.3
1  opel     1.5      55.2
2  benz     3.2      80.9

As you can see above in the first column, pandas has assigned a numeric index to each of the rows. The indexing concept is extremely important in general, but not of interest in this article.

Now, let’s just vertically concatenate the three data sets:

>>> concatenated = pd.concat(dataframes)
>>> concatenated
    car  weight  velocity
0  audi     2.1     100.3
1  opel     1.5      55.2
2  benz     3.2      80.9
0  audi     2.3     107.1
1  opel     1.2      50.1
2  benz     3.3      82.4
0  audi     2.6     110.7
1  opel     1.4      51.2
2  benz     3.2      89.1

I think the concept is clear. We have just merged three DataFrames into one. Next, we make use of pandas’ groupby method and group the data by car type:

>>> grouping = concatenated.groupby("car")

The resulting grouping is an abstract representation of the groups in the data set. Let’s have a look at their text representation:

>>> grouping.groups
{'benz': [2, 2, 2], 'opel': [1, 1, 1], 'audi': [0, 0, 0]}

The concatenated data set was divided into three groups, one for each car type. The numbers above are the ‘indices’ from the concatenated data set, we do not use them here (otherwise we should have made sure that there are no duplicates). grouping still contains the raw data, it’s just not contained within the default text representation.

The goal now is to merge the data within each of the groups in a controlled fashion. Remember, for each car we want to calculate mean and standard deviation for each metric. This is where the awesome aggregate — or short agg — method of a grouping comes into play. Via this method, the user can define any function to be used for aggregating group-internal data. Even better, one can define a list of functions. The aggregation then is performed with each of the functions and the resulting data set has one column for each combination of metric and aggregation function. So this is what we get when we apply numpy’s mean as well as std functions:

>>> merged = grouping.agg([np.mean, np.std])
>>> merged
        weight              velocity          
          mean       std        mean       std
car                                           
audi  2.333333  0.251661  106.033333  5.281414
benz  3.233333  0.057735   84.133333  4.366158
opel  1.366667  0.152753   52.166667  2.683903

The result is a DataFrame again, which can be sorted by any column. The column indices now are assembled in a two-level hierarchy (Each original column now has a sub-level containing the mean and std columns). Let’s sort by the mean value of the velocity:

>>> merged_sorted = merged.sort([('velocity', 'mean')])
>>> merged_sorted
        weight              velocity          
          mean       std        mean       std
car                                           
opel  1.366667  0.152753   52.166667  2.683903
benz  3.233333  0.057735   84.133333  4.366158
audi  2.333333  0.251661  106.033333  5.281414

So, what is the slowest car?

>>> merged_sorted.iloc[0]
weight    mean     1.366667
          std      0.152753
velocity  mean    52.166667
          std      2.683903
Name: opel, dtype: float64

An Opel, and we love pandas for telling us.

This simple example could have been solved with the standard library. However, before you end up writing a tabular data type (like I once did), you seriously should consider using pandas instead. pandas’ DataFrame supports plenty of convenient selection/indexing schemes as well as data conversion methods that you don’t want to re-implement. The one-liner above does not care how many rows and columns your data has. All this can be done with extreme amounts of data. pandas is using efficient numpy operations where it can, makes heavy use of Cython in other places and is generally designed for efficient large-scale data analysis. Have a look at the documentation to see how powerful pandas is.

Leave a Reply to Process Google Sheets data in Python with Pandas (example: running distance over time) | Jan-Philip Gehrcke Cancel reply

Your email address will not be published. Required fields are marked *

Human? Please fill this out: * Time limit is exhausted. Please reload CAPTCHA.

  1. […] powerful end-to-end data processing using Google Sheets, Python, and Pandas (and I have to notice: my first blog post about data analysis with Pandas is already more than six years old, time flies […]