Bitwise identity of NetCDF data — hashing numpy arrays

In a molecular dynamics (MD) simulation, a molecular system is evolved over time, i.e. the movement of single atoms at a given temperature is simulated over time. The output of such a simulation is a so-called MD trajectory. It becomes written to disk while the simulation is running and contains snapshots of the molecular system for certain points in time. These snapshots are called frames. Each frame contains the coordinates of all atoms in the molecular system for one single point in time. Considering N atoms, three spatial dimensions and F frames, 3 x N x F floating point numbers are stored within a trajectory file. Amber, a well-established molecular dynamics software, can store this data in binary form in a NetCDF file.

Given two trajectory files in NetCDF format, I wanted to find out if the coordinates of both trajectories are equal up to a certain frame number. The issue here was that these two trajectory files were created by independent simulations starting from the same input data on the same hardware (an Nvidia GTX 580), while one of the simulations unexpectedly crashed in the middle of the run. In case both simulations produced bitwise identical output up to the point where the one simulation crashed, most likely external factors such as temperature or voltage were the reason for the crash. In case the simulations diverged before the crash appeared, an error in the Amber code or a general hardware problem with the GTX 580 is more likely.

So, how can we find out if both simulations produced bitwise identical output up to the point where one crashed? We have two NetCDF files and want to see if one contains an exact subset of numerical data of the other. If we expected the entire files to be identical then we could just apply a hash function to the entire files. We already know that the files are not identical, just a certain part of the data contained in both files should be checked for equality. So, a good idea might be to apply a hash function to the raw data representing these parts.

Luckily, NetCDF is a well-structured file format with strong language support. I love Python and used its netCDF4 module. Its usage is pretty straight-forward:

In [1]: from netCDF4 import Dataset

Load the trajectory file that is incomplete and have a look into the file structure, especially at the shape of the coordinates data:

In [2]: old = Dataset('production_NVT.mdcrd.OLD')
In [3]: print old.variables['coordinates']
<type 'netCDF4.Variable'>
float32 coordinates(frame, atom, spatial)
    units: angstrom
unlimited dimensions: frame
current shape = (1768, 46010, 3)

From the shape we can infer that this trajectory file contains 1768 frames made of 46010 atoms each and that we have — suprise, surprise — three spatial dimensions.

Load the trajectory file that is complete:

In [4]: new = Dataset('production_NVT.mdcrd')
In [5]: print new.variables['coordinates']
<type 'netCDF4.Variable'>
float32 coordinates(frame, atom, spatial)
    units: angstrom
unlimited dimensions: frame
current shape = (2500, 46010, 3)

It contains 2500 frames, which was the final goal of the simulation.

Upon access, the coordinates are returned within numpy arrays containing numpy data types. In this case, coordinate values are stored in single precision (32 bit) floating point numbers, as you can see for the set of coordinates belonging to the first frame of the damaged trajectory:

In [6]: old.variables['coordinates'][1,:,:].dtype
Out[6]: dtype('float32')

Now, we can just go ahead and build a check sum of the coordinates belonging to the first 1768 frames of both trajectories:

In [7]: import hashlib
In [8]: hashlib.sha1(new.variables['coordinates'][:1768,:,:]).hexdigest()
Out[8]: 'b8bd51001aedbdc52de0d9e39ceb5284e9641a8a'
In [9]: hashlib.sha1(old.variables['coordinates'][:1768,:,:]).hexdigest()
Out[9]: 'b8bd51001aedbdc52de0d9e39ceb5284e9641a8a'

They are identical. But which data is actually used for building the hash? The goal is to make sure that the compared data is bitwise identical, because this is what we expect from working Amber code and from a valid CUDA device. Of course, we can make a step back and just compare the numerical values first:

In [10]: (old.variables['coordinates'][:1768,:,:] == new.variables['coordinates'][:1768,:,:]).all()
Out[10]: True

Okay, numpy considers all coordinate values in the first 1768 frames to be equal. What does this mean? All numerics frameworks use some kind of epsilon value for comparison of floating point numbers. In this case this is:

In [11]: np.finfo(np.single).eps
Out[11]: 1.1920929e-07

Quite a large value. The comparison above proves strong similarity, but not bitwise identity of data. We can go ahead and decrease the epsilon value for the numeric comparison:

In [12]: np.allclose(old.variables['coordinates'][:1768,:,:], new.variables['coordinates'][:1768,:,:], rtol=1e-10, atol=1e-12)
Out[12]: True

Now we have proven that the two data sets are extremely similar. But how can we actually prove bitwise identity? The easiest is hashing, but we have to kind of validate that the hash function actually considers all the bits of the data. Let’s make up a small example for this:

>>> import numpy as np
>>> from hashlib import sha1
>>> a = np.random.rand(5)
>>> a
array([ 0.1841292 ,  0.25379848,  0.27524327,  0.15762775,  0.88225297])
>>> a.dtype
dtype('float64')
>>> sha1(a).hexdigest()
'59db294ed35dc9137394198413ca7fca592a7ac2'

So far, we have created an array a containing 5 randomly created double precision (64 bit) floating point numbers. We have built a corresponding hash value from this array, as done above in case of the trajectory data. Now, numpy provides a convenient function to re-interpret the raw array data. In this process, the array data is considered as a large set of single bits which becomes masked by a new data type. Above, a contains 5 * 64 bit = 320 bit of raw data. Interpreted with an 8 bit integer data type, a looks like this:

>>> a.view(dtype=np.uint8)
array([ 24,  30,  26, 171, 139, 145, 199,  63, 188,  18,  20, 248,  59,
        62, 208,  63,   0, 236,  40, 243, 149, 157, 209,  63, 104,  99,
       220,  99,  37,  45, 196,  63, 204,  14, 233, 147, 106,  59, 236,  63], dtype=uint8)
>>> len(a.view(dtype=np.uint8))
40

Obviously, 320 bit of raw data result in exactly 40 of 8 bit integer values. This view creates a different representation of the same raw data. What happens when we hash this representation?

>>> sha1(a.view(dtype=np.uint8)).hexdigest()
'59db294ed35dc9137394198413ca7fca592a7ac2'

The same as above. This proves that changing the representation does not change the check sum. But does it prove that the check sum is built from the raw data? Simple check:

>>> a[1]
0.2537984774246913
>>> a[1] = 0.2537984774246912
>>> sha1(a.view(dtype=np.uint8)).hexdigest()
'7ffdda4de9654c11f0c8387b82d83b577cdc64cc'
>>> sha1(a).hexdigest()
'7ffdda4de9654c11f0c8387b82d83b577cdc64cc'

Above, one numerical value in a was changed. As a consequence, the direct check sum of a as well as the check sum of a in uint8 representation also changed. So, yes, the raw numerical data is used for building the check sum.

Hence,

sha1(new.variables['coordinates'][:1768,:,:]).hexdigest()

builds a check sum from all bits of the numerical raw data of the first 1768 frames in the given trajectory. The equality of both check sums for the first 1768 frames in both trajectories means that in both trajectories the coordinates in the the first 1768 frames are bitwise identical.

This probably means that the hardware as well as the software was working just fine until at some point between writing frame 1768 and 1769 something happened that made the simulation crash.

By the way, this was done with Python 2.7.3 and numpy 1.6.2.