Pandas Code Snippets

Monte Carlo simulation

2016-06-26T10:03:00.000+10:00

Yesterday I coded a quick and dirty Monte-Carlo simulation function. Here it is.

def monte_carlo(df, n=1000, verbose=False):
    """Take a DataFrame of probabilities (in range 0..1) for outcomes (in columns)
       for each case (in rows), and conduct a Monte Carlo simulation.
       Return a tuple of three DataFrames. The first is the raw simulation results.
       The second is the outcome tally of those results for each simulation. 
       The third DataFrame returned is a binned summary of the tally for each outcome.
    """

    # contract
    assert((df.sum(axis=1) == 1.0).all)
    
    # monte carlo - we do this by rows to use the pandas.cut() function
    print('Doing the MC simulation ...')
    simulation = {}
    for (name, series) in df.iterrows():
        # for each case MC similate n outcomes ...
        if verbose:
            print(name)
        # set up this simulation
        votes = np.random.rand(n)
        s = series[series > 0.0]
        labels = s.index.tolist()
        cuts = [0.0] + s.cumsum().tolist()
        # and simulate
        simulation[name] = pd.cut(votes, bins=cuts, labels=labels, precision=7)

    # Take the dictionary of case outcomes above, 
    # Put them into a DataFrame as columns.
    # So that we can subsequently tally rows of simulation outcomes below.
    simulation = pd.DataFrame(simulation)

    # tally the results - for each simulation
    print('Tallying the results of the MC simulation ...')
    tally = {}
    for (name, series) in simulation.iterrows():  
        tally[name] = series.value_counts()
    tally = pd.DataFrame.from_dict(tally, orient='index')
    tally.fillna(0, inplace=True)
    tally = tally.astype(int)
    
    # - summarise into value_counts
    print('Summarising the tally of the MC simulation ...')
    summary = {}
    for (name, series) in tally.iteritems():
        summary[name] = series.value_counts()
    summary = pd.DataFrame(summary)
    summary.fillna(0, inplace=True)
    summary = summary.astype(int)

    # - return simulation summary
    return (simulation, tally, summary)

Combining two lists into a human readable string

2016-06-19T20:37:00.000+10:00

I wanted to list the odds I had scraped from a website as a comma delimited string list.

The first task was to join the party names with their odds.

The second challenge was to bring that list into a delimited string.

It took me a few goes to land the ideal approach

Let's start with some play data

names = ['Alpha', 'Beta', 'Gamma', 'Delta']
amounts = [1.20, 4.50, 151, 19.50]

My first go at combining the party names with their odds.

In [19]: [' '.join([a, str(b)]) for a, b in zip(names, amounts)]
Out[19]: ['Alpha 1.2', 'Beta 4.5', 'Gamma 151', 'Delta 19.5']

My second attempt

In [20]: [' '.join([a, '$'+ str(b)]) for a, b in zip(names, amounts)]
Out[20]: ['Alpha $1.2', 'Beta $4.5', 'Gamma $151', 'Delta $19.5']

My third attempt - this looks about right

In [21]: ['{} ${:.2f}'.format(a, b) for a, b in zip(names, amounts)]
Out[21]: ['Alpha $1.20', 'Beta $4.50', 'Gamma $151.00', 'Delta $19.50']

Now, to get that into a delimited string. Voila!

In [22]: ', '.join(['{} ${:.2f}'.format(a, b) for a, b in zip(names, amounts)])
Out[22]: 'Alpha $1.20, Beta $4.50, Gamma $151.00, Delta $19.50'

Check for different rows in two dataframes

2016-06-15T10:39:00.000+10:00

With time series data, I am sometimes interested in what has changed between one time period and the next. Typically I store the data for a particular time period in its own DataFrame (and I keep these DataFrames in a pandas Panel. With DataFrames where columns represent variables and rows are cases, I am interested in the rows that have changed between two DataFrames. The easy way to check for changed rows is ...

different = (df1 != df2).any(axis=1)

Let's have a look in iPython. We start by creating two play DataFrames, each with 10 rows and 5 columns (lines 1-3). We change the first two rows on the second DataFrame (line 4). Then we do the above Boolean test for difference (at line 5). This yields a Boolean series (line 6), which we can use to look at the changed rows (line 7).

In [1]: df1 = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # make play data

In [2]: df1 # let's look at the play data
Out[2]: 
    0   1   2   3   4
0  80  54  36  41  43
1  41  55  66  52  59
2  49  56  61  35  42
3  72  67  53  58  43
4  51  45  76  64  43
5  61  22  33  44  59
6  49  51  52  47  50
7  33  56  56  39  39
8  52  27  57  19  64
9  45  64  53  43  59

In [3]: df2 = df1.copy() # copy the play data

In [4]: df2[0:2] = df2[0:2] + 1 # add 1 to each element in the first two rows

In [5]: different = (df1 != df2).any(axis=1) # test for differences in the rows

In [6]: different
Out[6]: 
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [7]: df2[different] # these are the rows that were changed
Out[7]: 
    0   1   2   3   4
0  81  55  37  42  44
1  42  56  67  53  60

I should note, this test assumes that the two DataFrames being compared have identically labelled row index and columns. This will be the case if the DataFrames are being sliced from a pandas Panel.

Plotting from pandas

2016-06-12T14:17:00.001+10:00

Today I want to talk a little about plotting from pandas.

I will begin with my markgraph.mpstyle style-sheet. Using a style sheet allows me to maintain a consistent look and feel. In this stylesheet I have opted for a plain white background, minimal axes and very feint grid lines. I buy the Edward Tufte philosophy of reducing the non-data ink as much as possible.

# Mark Graph Style Sheet

font.family: sans-serif
font.style: normal
font.variant: normal
font.weight: medium
font.stretch: normal
font.sans-serif: Helvetica, Arial, Avant Garde, sans-serif
font.size: 14.0

lines.linewidth: 2.0
lines.solid_capstyle: butt

legend.fancybox: true
legend.fontsize: x-small
legend.framealpha: 0.5

axes.prop_cycle: cycler('color', ['ef5962', '5a9dd2', '7dc16d', 'f9a560', '9d69aa', 'cd6f5a', 'd680b3', '737373'])
axes.facecolor: white
axes.titlesize: large  # fontsize of the axes title
axes.labelsize: medium # x and y label size
axes.labelcolor: 111111
axes.axisbelow: true
axes.grid: true
axes.edgecolor: white
axes.linewidth: 0.01

patch.edgecolor: white
patch.linewidth: 0.01

svg.fonttype: path

grid.linestyle: -
grid.linewidth: 0.5
grid.color: e7e7e7

xtick.major.size: 0
xtick.minor.size: 0
xtick.labelsize: small
xtick.color: 333333
ytick.major.size: 0
ytick.minor.size: 0
ytick.labelsize: small
ytick.color: 333333

figure.figsize: 8, 4 # inches
figure.facecolor: white

text.color: black

savefig.edgecolor: white
savefig.facecolor: white

Once a style-sheet is i place (typically saved in the directory I do my work), creating a new plot is easy. I use a recipe like the following to create my plots.

# --- pandas  and numpy initialisation
import pandas as pd
import numpy as np

# --- matplotlib initialisation
import matplotlib.pyplot as plt
plt.style.use('markgraph.mplstyle')
 
# --- create some fake data
df = pd.DataFrame(np.random.randn(100, 6)).cumsum()
df.columns = ['alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta']

# --- plot the data
ax = df.plot(kind='line') # required
ax.set_ylabel('Index') # optional
ax.set_xlabel('Days since inception') # optional
ax.xaxis.set_ticks([0, 20, 40, 60, 80, 100]) # optional

ax.axhline(0, color='#AAAAAA', linestyle='-', linewidth=0.5) # optional

fig = ax.get_figure() # required

fig.suptitle('Fake Data', linespacing=1.2) # optional
fig.tight_layout() # advisable

# - put the legend above - optional - can be fiddly to get right
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width, box.height * 0.88])
ax.legend(bbox_to_anchor=(0.5, 1.13), loc='upper center', ncol=6, numpoints=1)

# - add a footnote - optional
fig.text(0.99, 0.01, 'pandascodesnippets.blogspot.com.au', ha='right', va='bottom',
        fontsize='x-small', fontstyle='italic', color='#999999')

# - and save
fig.savefig('example-chart.png', dpi=125) # required

Which yields the following chart (or something like it).

If you don't have the time to play with your own style-sheet, you can use one of the many built-in style sheets. For example, if I change the matplotlib initialisation part of my code, I can use the built-in style plt.style.use('ggplot') to get ...

There are a host of other built-in styles, including: 'bmh', 'classic', 'dark_background', ' fivethirtyeight', and 'grayscale'. For a complete list, check out this site. My experience has been patchy when it comes to getting the built-in style-sheets to deliver everything I like.

Comparing the maximum and next highest values in each row

2016-06-11T18:22:00.001+10:00

This morning I wanted to sort a DataFrame by the difference between the maximum value in a row and the next (or equal) highest value in that row. For each row, there was no way of predicting which column would have the highest value and which would have the next highest value.

This is how I did it.

The first line in the iPython code snippet creates some play data. It is a DataFrame with 5 columns and 10 rows. We have a quick look at our play data on line 2.

I created a function to return the difference between highest value in a Pandas Series and the next or equal highest value. This function is at 3.

Next, I applied that function to each row in the DataFrame, ranked the result, and returned the rank as an integer. Then, I added that Series of ranks as a new column in the DataFrame, This is on 4.

Almost done, at 5 we sort the DataFrame by its rank.

Finally at 6 we have a quick look at the results.You can see the original line position from the index column at the left. You can see that two rows have the rank of 7th place.

In [1]: df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data

In [2]: df
Out[2]: 
    0   1   2   3   4
0  48  52  56  38  66
1  60  65  52  38  48
2  50  61  50  38  53
3  67  57  43  50  50
4  54  56  54  60  51
5  55  44  49  62  53
6  41  48  53  51  44
7  48  43  60  42  48
8  38  58  75  62  58
9  30  58  44  57  53

In [3]: def differand(x):
   ...:         y = x.sort_values(ascending=False)
   ...:         return y.iat[0] - y.iat[1] # first position minus second ...
   ...: 

In [4]: df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int)

In [5]: df.sort_values('rank', inplace=True)

In [6]: df
Out[6]: 
    0   1   2   3   4  rank
9  30  58  44  57  53     1
6  41  48  53  51  44     2
4  54  56  54  60  51     3
1  60  65  52  38  48     4
5  55  44  49  62  53     5
2  50  61  50  38  53     6
0  48  52  56  38  66     7
3  67  57  43  50  50     7
7  48  43  60  42  48     9
8  38  58  75  62  58    10

More concisely, without the iPython, we have:

df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data

def differand(x): 
    """Receive a Pandas Series and return the difference between
       the highest value in the Series and the next (or equal)
       highest value in that Series."""
    y = x.sort_values(ascending=False)
    return y.iat[0] - y.iat[1]

df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int)

df.sort_values('rank', inplace=True)