Pages

Wednesday 15 June 2016

Check for different rows in two dataframes

With time series data, I am sometimes interested in what has changed between one time period and the next. Typically I store the data for a particular time period in its own DataFrame (and I keep these DataFrames in a pandas Panel. With DataFrames where columns represent variables and rows are cases, I am interested in the rows that have changed between two DataFrames. The easy way to check for changed rows is ...
different = (df1 != df2).any(axis=1)

Let's have a look in iPython. We start by creating two play DataFrames, each with 10 rows and 5 columns (lines 1-3). We change the first two rows on the second DataFrame (line 4). Then we do the above Boolean test for difference (at line 5). This yields a Boolean series (line 6), which we can use to look at the changed rows (line 7).
In [1]: df1 = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # make play data

In [2]: df1 # let's look at the play data
Out[2]: 
    0   1   2   3   4
0  80  54  36  41  43
1  41  55  66  52  59
2  49  56  61  35  42
3  72  67  53  58  43
4  51  45  76  64  43
5  61  22  33  44  59
6  49  51  52  47  50
7  33  56  56  39  39
8  52  27  57  19  64
9  45  64  53  43  59

In [3]: df2 = df1.copy() # copy the play data

In [4]: df2[0:2] = df2[0:2] + 1 # add 1 to each element in the first two rows

In [5]: different = (df1 != df2).any(axis=1) # test for differences in the rows

In [6]: different
Out[6]: 
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [7]: df2[different] # these are the rows that were changed
Out[7]: 
    0   1   2   3   4
0  81  55  37  42  44
1  42  56  67  53  60

I should note, this test assumes that the two DataFrames being compared have identically labelled row index and columns. This will be the case if the DataFrames are being sliced from a pandas Panel.

No comments:

Post a Comment