tag:blogger.com,1999:blog-68524407044140861062024-02-20T01:50:28.541+11:00Pandas Code SnippetsMark Graphhttp://www.blogger.com/profile/10462713733051104779noreply@blogger.comBlogger5125tag:blogger.com,1999:blog-6852440704414086106.post-58451529662446560222016-06-26T10:03:00.000+10:002016-06-28T14:52:48.639+10:00Monte Carlo simulationYesterday I coded a quick and dirty Monte-Carlo simulation function. Here it is.<br />
<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">def monte_carlo(df, n=1000, verbose=False):
"""Take a DataFrame of probabilities (in range 0..1) for outcomes (in columns)
for each case (in rows), and conduct a Monte Carlo simulation.
Return a tuple of three DataFrames. The first is the raw simulation results.
The second is the outcome tally of those results for each simulation.
The third DataFrame returned is a binned summary of the tally for each outcome.
"""
# contract
assert((df.sum(axis=1) == 1.0).all)
# monte carlo - we do this by rows to use the pandas.cut() function
print('Doing the MC simulation ...')
simulation = {}
for (name, series) in df.iterrows():
# for each case MC similate n outcomes ...
if verbose:
print(name)
# set up this simulation
votes = np.random.rand(n)
s = series[series > 0.0]
labels = s.index.tolist()
cuts = [0.0] + s.cumsum().tolist()
# and simulate
simulation[name] = pd.cut(votes, bins=cuts, labels=labels, precision=7)
# Take the dictionary of case outcomes above,
# Put them into a DataFrame as columns.
# So that we can subsequently tally rows of simulation outcomes below.
simulation = pd.DataFrame(simulation)
# tally the results - for each simulation
print('Tallying the results of the MC simulation ...')
tally = {}
for (name, series) in simulation.iterrows():
tally[name] = series.value_counts()
tally = pd.DataFrame.from_dict(tally, orient='index')
tally.fillna(0, inplace=True)
tally = tally.astype(int)
# - summarise into value_counts
print('Summarising the tally of the MC simulation ...')
summary = {}
for (name, series) in tally.iteritems():
summary[name] = series.value_counts()
summary = pd.DataFrame(summary)
summary.fillna(0, inplace=True)
summary = summary.astype(int)
# - return simulation summary
return (simulation, tally, summary)</pre>
</div>
<br />
<br />Mark Graphhttp://www.blogger.com/profile/10462713733051104779noreply@blogger.com0tag:blogger.com,1999:blog-6852440704414086106.post-24362417639033507172016-06-19T20:37:00.000+10:002016-06-19T20:37:14.892+10:00Combining two lists into a human readable stringI wanted to list the odds I had scraped from a website as a comma delimited string list. <br />
<br />
The first task was to join the party names with their odds. <br />
<br />
The second challenge was to bring that list into a delimited string.<br />
<br />
It took me a few goes to land the ideal approach<br />
<br />
Let's start with some play data<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">names = ['Alpha', 'Beta', 'Gamma', 'Delta']
amounts = [1.20, 4.50, 151, 19.50]</pre>
</div>
<br />
My first go at combining the party names with their odds.<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">In [19]: [' '.join([a, str(b)]) for a, b in zip(names, amounts)]
Out[19]: ['Alpha 1.2', 'Beta 4.5', 'Gamma 151', 'Delta 19.5']</pre>
</div>
<br />
My second attempt<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">In [20]: [' '.join([a, '$'+ str(b)]) for a, b in zip(names, amounts)]
Out[20]: ['Alpha $1.2', 'Beta $4.5', 'Gamma $151', 'Delta $19.5']</pre>
</div>
<br />
My third attempt - this looks about right<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">In [21]: ['{} ${:.2f}'.format(a, b) for a, b in zip(names, amounts)]
Out[21]: ['Alpha $1.20', 'Beta $4.50', 'Gamma $151.00', 'Delta $19.50']</pre>
</div>
<br />
Now, to get that into a delimited string. Voila!<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">In [22]: ', '.join(['{} ${:.2f}'.format(a, b) for a, b in zip(names, amounts)])
Out[22]: 'Alpha $1.20, Beta $4.50, Gamma $151.00, Delta $19.50'</pre>
</div>
<br />Mark Graphhttp://www.blogger.com/profile/10462713733051104779noreply@blogger.com0tag:blogger.com,1999:blog-6852440704414086106.post-35595316263282917682016-06-15T10:39:00.000+10:002016-06-17T21:54:09.823+10:00Check for different rows in two dataframesWith time series data, I am sometimes interested in what has changed between one time period and the next. Typically I store the data for a particular time period in its own DataFrame (and I keep these DataFrames in a <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Panel.html">pandas Panel</a>. With DataFrames where columns represent variables and rows are cases, I am interested in the rows that have changed between two DataFrames. The easy way to check for changed rows is ...<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">different = (df1 != df2).any(axis=1)</pre>
</div>
<br />
Let's have a look in iPython. We start by creating two play DataFrames, each with 10 rows and 5 columns (lines 1-3). We change the first two rows on the second DataFrame (line 4). Then we do the above Boolean test for difference (at line 5). This yields a Boolean series (line 6), which we can use to look at the changed rows (line 7).<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">In [1]: df1 = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # make play data
In [2]: df1 # let's look at the play data
Out[2]:
0 1 2 3 4
0 80 54 36 41 43
1 41 55 66 52 59
2 49 56 61 35 42
3 72 67 53 58 43
4 51 45 76 64 43
5 61 22 33 44 59
6 49 51 52 47 50
7 33 56 56 39 39
8 52 27 57 19 64
9 45 64 53 43 59
In [3]: df2 = df1.copy() # copy the play data
In [4]: df2[0:2] = df2[0:2] + 1 # add 1 to each element in the first two rows
In [5]: different = (df1 != df2).any(axis=1) # test for differences in the rows
In [6]: different
Out[6]:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
In [7]: df2[different] # these are the rows that were changed
Out[7]:
0 1 2 3 4
0 81 55 37 42 44
1 42 56 67 53 60</pre>
</div>
<br />
I should note, this test assumes that the two DataFrames being compared have identically labelled row index and columns. This will be the case if the DataFrames are being sliced from a pandas Panel. Mark Graphhttp://www.blogger.com/profile/10462713733051104779noreply@blogger.com0tag:blogger.com,1999:blog-6852440704414086106.post-27215013710464054282016-06-12T14:17:00.001+10:002016-06-13T07:27:27.359+10:00Plotting from pandasToday I want to talk a little about plotting from pandas.<br />
<br />
I will begin with my <i>markgraph.mpstyle</i> style-sheet. Using a style sheet allows me to maintain a consistent look and feel. In this stylesheet I have opted for a plain white background, minimal axes and very feint grid lines. I buy the Edward Tufte philosophy of reducing the non-data ink as much as possible. <br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;"># Mark Graph Style Sheet
font.family: sans-serif
font.style: normal
font.variant: normal
font.weight: medium
font.stretch: normal
font.sans-serif: Helvetica, Arial, Avant Garde, sans-serif
font.size: 14.0
lines.linewidth: 2.0
lines.solid_capstyle: butt
legend.fancybox: true
legend.fontsize: x-small
legend.framealpha: 0.5
axes.prop_cycle: cycler('color', ['ef5962', '5a9dd2', '7dc16d', 'f9a560', '9d69aa', 'cd6f5a', 'd680b3', '737373'])
axes.facecolor: white
axes.titlesize: large # fontsize of the axes title
axes.labelsize: medium # x and y label size
axes.labelcolor: 111111
axes.axisbelow: true
axes.grid: true
axes.edgecolor: white
axes.linewidth: 0.01
patch.edgecolor: white
patch.linewidth: 0.01
svg.fonttype: path
grid.linestyle: -
grid.linewidth: 0.5
grid.color: e7e7e7
xtick.major.size: 0
xtick.minor.size: 0
xtick.labelsize: small
xtick.color: 333333
ytick.major.size: 0
ytick.minor.size: 0
ytick.labelsize: small
ytick.color: 333333
figure.figsize: 8, 4 # inches
figure.facecolor: white
text.color: black
savefig.edgecolor: white
savefig.facecolor: white</pre>
</div>
<br />
Once a style-sheet is i place (typically saved in the directory I do my work), creating a new plot is easy. I use a recipe like the following to create my plots.<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;"># --- pandas and numpy initialisation
import pandas as pd
import numpy as np
# --- matplotlib initialisation
import matplotlib.pyplot as plt
plt.style.use('markgraph.mplstyle')
# --- create some fake data
df = pd.DataFrame(np.random.randn(100, 6)).cumsum()
df.columns = ['alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta']
# --- plot the data
ax = df.plot(kind='line') # required
ax.set_ylabel('Index') # optional
ax.set_xlabel('Days since inception') # optional
ax.xaxis.set_ticks([0, 20, 40, 60, 80, 100]) # optional
ax.axhline(0, color='#AAAAAA', linestyle='-', linewidth=0.5) # optional
fig = ax.get_figure() # required
fig.suptitle('Fake Data', linespacing=1.2) # optional
fig.tight_layout() # advisable
# - put the legend above - optional - can be fiddly to get right
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width, box.height * 0.88])
ax.legend(bbox_to_anchor=(0.5, 1.13), loc='upper center', ncol=6, numpoints=1)
# - add a footnote - optional
fig.text(0.99, 0.01, 'pandascodesnippets.blogspot.com.au', ha='right', va='bottom',
fontsize='x-small', fontstyle='italic', color='#999999')
# - and save
fig.savefig('example-chart.png', dpi=125) # required</pre>
</div>
<br />
Which yields the following chart (or something like it).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwj5lwmwM647skkQinsNku72vzJG9BUcV3jmfhC8nDZaoiwhQyIaly77f4WzpPGrfx8QeVURRaAHlCG7Zh7yQpNLkt-9E5O4iMfXw8QWeVu3qb6-t0rqKj5b4ic4JLwUBLIYaCpwPrHgKp/s1600/example-chart.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwj5lwmwM647skkQinsNku72vzJG9BUcV3jmfhC8nDZaoiwhQyIaly77f4WzpPGrfx8QeVURRaAHlCG7Zh7yQpNLkt-9E5O4iMfXw8QWeVu3qb6-t0rqKj5b4ic4JLwUBLIYaCpwPrHgKp/s640/example-chart.png" width="640" /></a></div>
<br />
If you don't have the time to play with your own style-sheet, you can use one of the many built-in style sheets. For example, if I change the matplotlib initialisation part of my code, I can use the built-in style <span style="font-family: "courier new" , "courier" , monospace;">plt.style.use('ggplot')</span> to get ...<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjf9fuAwLK6YWQGDGd3m6-_kYSJUzTPL_QalY897gf6WHJuoMv4-BGntTaVWIbaxqaGRxb6ruLIXauiTE9PLwyDZ4kypvwkyPND0HUBl7Ge3WKu8dEvi7CmKcnPxoreDPmRO6UC7UTh2xtp/s1600/example-chart.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjf9fuAwLK6YWQGDGd3m6-_kYSJUzTPL_QalY897gf6WHJuoMv4-BGntTaVWIbaxqaGRxb6ruLIXauiTE9PLwyDZ4kypvwkyPND0HUBl7Ge3WKu8dEvi7CmKcnPxoreDPmRO6UC7UTh2xtp/s640/example-chart.png" width="640" /></a></div>
<br />
There are a host of other built-in styles, including: 'bmh', 'classic', 'dark_background', ' fivethirtyeight', and 'grayscale'. <a href="https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html">For a complete list, check out this site</a>. My experience has been patchy when it comes to getting the built-in style-sheets to deliver everything I like. Mark Graphhttp://www.blogger.com/profile/10462713733051104779noreply@blogger.com0tag:blogger.com,1999:blog-6852440704414086106.post-88546599155910163482016-06-11T18:22:00.001+10:002016-06-12T12:49:23.658+10:00Comparing the maximum and next highest values in each rowThis morning I wanted to sort a DataFrame by the difference between the maximum value in a row and the next (or equal) highest value in that row. For each row, there was no way of predicting which column would have the highest value and which would have the next highest value.<br />
<br />
This is how I did it.<br />
<br />
The first line in the iPython code snippet creates some play data. It is a DataFrame with 5 columns and 10 rows. We have a quick look at our play data on line 2.<br />
<br />
I created a function to return the difference between highest value in a Pandas Series and the next or equal highest value. This function is at 3.<br />
<br />
Next, I applied that function to each row in the DataFrame, ranked the result, and returned the rank as an integer. Then, I added that Series of ranks as a new column in the DataFrame, This is on 4.<br />
<br />
Almost done, at 5 we sort the DataFrame by its rank.<br />
<br />
Finally at 6 we have a quick look at the results.You can see the original line position from the index column at the left. You can see that two rows have the rank of 7th place. <br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">In [1]: df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data
In [2]: df
Out[2]:
0 1 2 3 4
0 48 52 56 38 66
1 60 65 52 38 48
2 50 61 50 38 53
3 67 57 43 50 50
4 54 56 54 60 51
5 55 44 49 62 53
6 41 48 53 51 44
7 48 43 60 42 48
8 38 58 75 62 58
9 30 58 44 57 53
In [3]: def differand(x):
...: y = x.sort_values(ascending=False)
...: return y.iat[0] - y.iat[1] # first position minus second ...
...:
In [4]: df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int)
In [5]: df.sort_values('rank', inplace=True)
In [6]: df
Out[6]:
0 1 2 3 4 rank
9 30 58 44 57 53 1
6 41 48 53 51 44 2
4 54 56 54 60 51 3
1 60 65 52 38 48 4
5 55 44 49 62 53 5
2 50 61 50 38 53 6
0 48 52 56 38 66 7
3 67 57 43 50 50 7
7 48 43 60 42 48 9
8 38 58 75 62 58 10</pre>
</div>
<br />
More concisely, without the iPython, we have:<br />
<div class="post-body entry-content" id="post-body-3160232610746660605" itemprop="description articleBody" style="background-color: white; color: #333333; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13px; line-height: 1.4; position: relative; width: 656px;">
<pre style="background-color: lightyellow; border: 1px dashed rgb(153 , 153 , 153); font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 653px;">df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data
def differand(x):
"""Receive a Pandas Series and return the difference between
the highest value in the Series and the next (or equal)
highest value in that Series."""
y = x.sort_values(ascending=False)
return y.iat[0] - y.iat[1]
df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int)
df.sort_values('rank', inplace=True)</pre>
</div>
Mark Graphhttp://www.blogger.com/profile/10462713733051104779noreply@blogger.com0