Pages

Saturday, 11 June 2016

Comparing the maximum and next highest values in each row

This morning I wanted to sort a DataFrame by the difference between the maximum value in a row and the next (or equal) highest value in that row. For each row, there was no way of predicting which column would have the highest value and which would have the next highest value.

This is how I did it.

The first line in the iPython code snippet creates some play data. It is a DataFrame with 5 columns and 10 rows. We have a quick look at our play data on line 2.

I created a function to return the difference between highest value in a Pandas Series and the next or equal highest value. This function is at 3.

Next, I applied that function to each row in the DataFrame, ranked the result, and returned the rank as an integer. Then, I added that Series of ranks as a new column in the DataFrame, This is on 4.

Almost done, at 5 we sort the DataFrame by its rank.

Finally at 6 we have a quick look at the results.You can see the original line position from the index column at the left. You can see that two rows have the rank of 7th place.
In [1]: df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data

In [2]: df
Out[2]: 
    0   1   2   3   4
0  48  52  56  38  66
1  60  65  52  38  48
2  50  61  50  38  53
3  67  57  43  50  50
4  54  56  54  60  51
5  55  44  49  62  53
6  41  48  53  51  44
7  48  43  60  42  48
8  38  58  75  62  58
9  30  58  44  57  53

In [3]: def differand(x):
   ...:         y = x.sort_values(ascending=False)
   ...:         return y.iat[0] - y.iat[1] # first position minus second ...
   ...: 

In [4]: df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int)

In [5]: df.sort_values('rank', inplace=True)

In [6]: df
Out[6]: 
    0   1   2   3   4  rank
9  30  58  44  57  53     1
6  41  48  53  51  44     2
4  54  56  54  60  51     3
1  60  65  52  38  48     4
5  55  44  49  62  53     5
2  50  61  50  38  53     6
0  48  52  56  38  66     7
3  67  57  43  50  50     7
7  48  43  60  42  48     9
8  38  58  75  62  58    10

More concisely, without the iPython, we have:
df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data

def differand(x): 
    """Receive a Pandas Series and return the difference between
       the highest value in the Series and the next (or equal)
       highest value in that Series."""
    y = x.sort_values(ascending=False)
    return y.iat[0] - y.iat[1]

df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int)

df.sort_values('rank', inplace=True)

No comments:

Post a Comment