def monte_carlo(df, n=1000, verbose=False): """Take a DataFrame of probabilities (in range 0..1) for outcomes (in columns) for each case (in rows), and conduct a Monte Carlo simulation. Return a tuple of three DataFrames. The first is the raw simulation results. The second is the outcome tally of those results for each simulation. The third DataFrame returned is a binned summary of the tally for each outcome. """ # contract assert((df.sum(axis=1) == 1.0).all) # monte carlo - we do this by rows to use the pandas.cut() function print('Doing the MC simulation ...') simulation = {} for (name, series) in df.iterrows(): # for each case MC similate n outcomes ... if verbose: print(name) # set up this simulation votes = np.random.rand(n) s = series[series > 0.0] labels = s.index.tolist() cuts = [0.0] + s.cumsum().tolist() # and simulate simulation[name] = pd.cut(votes, bins=cuts, labels=labels, precision=7) # Take the dictionary of case outcomes above, # Put them into a DataFrame as columns. # So that we can subsequently tally rows of simulation outcomes below. simulation = pd.DataFrame(simulation) # tally the results - for each simulation print('Tallying the results of the MC simulation ...') tally = {} for (name, series) in simulation.iterrows(): tally[name] = series.value_counts() tally = pd.DataFrame.from_dict(tally, orient='index') tally.fillna(0, inplace=True) tally = tally.astype(int) # - summarise into value_counts print('Summarising the tally of the MC simulation ...') summary = {} for (name, series) in tally.iteritems(): summary[name] = series.value_counts() summary = pd.DataFrame(summary) summary.fillna(0, inplace=True) summary = summary.astype(int) # - return simulation summary return (simulation, tally, summary)
Pandas Code Snippets
Sunday 26 June 2016
Monte Carlo simulation
Yesterday I coded a quick and dirty Monte-Carlo simulation function. Here it is.
Sunday 19 June 2016
Combining two lists into a human readable string
I wanted to list the odds I had scraped from a website as a comma delimited string list.
The first task was to join the party names with their odds.
The second challenge was to bring that list into a delimited string.
It took me a few goes to land the ideal approach
Let's start with some play data
My first go at combining the party names with their odds.
My second attempt
My third attempt - this looks about right
Now, to get that into a delimited string. Voila!
The first task was to join the party names with their odds.
The second challenge was to bring that list into a delimited string.
It took me a few goes to land the ideal approach
Let's start with some play data
names = ['Alpha', 'Beta', 'Gamma', 'Delta'] amounts = [1.20, 4.50, 151, 19.50]
My first go at combining the party names with their odds.
In [19]: [' '.join([a, str(b)]) for a, b in zip(names, amounts)] Out[19]: ['Alpha 1.2', 'Beta 4.5', 'Gamma 151', 'Delta 19.5']
My second attempt
In [20]: [' '.join([a, '$'+ str(b)]) for a, b in zip(names, amounts)] Out[20]: ['Alpha $1.2', 'Beta $4.5', 'Gamma $151', 'Delta $19.5']
My third attempt - this looks about right
In [21]: ['{} ${:.2f}'.format(a, b) for a, b in zip(names, amounts)] Out[21]: ['Alpha $1.20', 'Beta $4.50', 'Gamma $151.00', 'Delta $19.50']
Now, to get that into a delimited string. Voila!
In [22]: ', '.join(['{} ${:.2f}'.format(a, b) for a, b in zip(names, amounts)]) Out[22]: 'Alpha $1.20, Beta $4.50, Gamma $151.00, Delta $19.50'
Wednesday 15 June 2016
Check for different rows in two dataframes
With time series data, I am sometimes interested in what has changed between one time period and the next. Typically I store the data for a particular time period in its own DataFrame (and I keep these DataFrames in a pandas Panel. With DataFrames where columns represent variables and rows are cases, I am interested in the rows that have changed between two DataFrames. The easy way to check for changed rows is ...
Let's have a look in iPython. We start by creating two play DataFrames, each with 10 rows and 5 columns (lines 1-3). We change the first two rows on the second DataFrame (line 4). Then we do the above Boolean test for difference (at line 5). This yields a Boolean series (line 6), which we can use to look at the changed rows (line 7).
I should note, this test assumes that the two DataFrames being compared have identically labelled row index and columns. This will be the case if the DataFrames are being sliced from a pandas Panel.
different = (df1 != df2).any(axis=1)
Let's have a look in iPython. We start by creating two play DataFrames, each with 10 rows and 5 columns (lines 1-3). We change the first two rows on the second DataFrame (line 4). Then we do the above Boolean test for difference (at line 5). This yields a Boolean series (line 6), which we can use to look at the changed rows (line 7).
In [1]: df1 = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # make play data In [2]: df1 # let's look at the play data Out[2]: 0 1 2 3 4 0 80 54 36 41 43 1 41 55 66 52 59 2 49 56 61 35 42 3 72 67 53 58 43 4 51 45 76 64 43 5 61 22 33 44 59 6 49 51 52 47 50 7 33 56 56 39 39 8 52 27 57 19 64 9 45 64 53 43 59 In [3]: df2 = df1.copy() # copy the play data In [4]: df2[0:2] = df2[0:2] + 1 # add 1 to each element in the first two rows In [5]: different = (df1 != df2).any(axis=1) # test for differences in the rows In [6]: different Out[6]: 0 True 1 True 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False dtype: bool In [7]: df2[different] # these are the rows that were changed Out[7]: 0 1 2 3 4 0 81 55 37 42 44 1 42 56 67 53 60
I should note, this test assumes that the two DataFrames being compared have identically labelled row index and columns. This will be the case if the DataFrames are being sliced from a pandas Panel.
Sunday 12 June 2016
Plotting from pandas
Today I want to talk a little about plotting from pandas.
I will begin with my markgraph.mpstyle style-sheet. Using a style sheet allows me to maintain a consistent look and feel. In this stylesheet I have opted for a plain white background, minimal axes and very feint grid lines. I buy the Edward Tufte philosophy of reducing the non-data ink as much as possible.
Once a style-sheet is i place (typically saved in the directory I do my work), creating a new plot is easy. I use a recipe like the following to create my plots.
Which yields the following chart (or something like it).
If you don't have the time to play with your own style-sheet, you can use one of the many built-in style sheets. For example, if I change the matplotlib initialisation part of my code, I can use the built-in style plt.style.use('ggplot') to get ...
There are a host of other built-in styles, including: 'bmh', 'classic', 'dark_background', ' fivethirtyeight', and 'grayscale'. For a complete list, check out this site. My experience has been patchy when it comes to getting the built-in style-sheets to deliver everything I like.
I will begin with my markgraph.mpstyle style-sheet. Using a style sheet allows me to maintain a consistent look and feel. In this stylesheet I have opted for a plain white background, minimal axes and very feint grid lines. I buy the Edward Tufte philosophy of reducing the non-data ink as much as possible.
# Mark Graph Style Sheet font.family: sans-serif font.style: normal font.variant: normal font.weight: medium font.stretch: normal font.sans-serif: Helvetica, Arial, Avant Garde, sans-serif font.size: 14.0 lines.linewidth: 2.0 lines.solid_capstyle: butt legend.fancybox: true legend.fontsize: x-small legend.framealpha: 0.5 axes.prop_cycle: cycler('color', ['ef5962', '5a9dd2', '7dc16d', 'f9a560', '9d69aa', 'cd6f5a', 'd680b3', '737373']) axes.facecolor: white axes.titlesize: large # fontsize of the axes title axes.labelsize: medium # x and y label size axes.labelcolor: 111111 axes.axisbelow: true axes.grid: true axes.edgecolor: white axes.linewidth: 0.01 patch.edgecolor: white patch.linewidth: 0.01 svg.fonttype: path grid.linestyle: - grid.linewidth: 0.5 grid.color: e7e7e7 xtick.major.size: 0 xtick.minor.size: 0 xtick.labelsize: small xtick.color: 333333 ytick.major.size: 0 ytick.minor.size: 0 ytick.labelsize: small ytick.color: 333333 figure.figsize: 8, 4 # inches figure.facecolor: white text.color: black savefig.edgecolor: white savefig.facecolor: white
Once a style-sheet is i place (typically saved in the directory I do my work), creating a new plot is easy. I use a recipe like the following to create my plots.
# --- pandas and numpy initialisation import pandas as pd import numpy as np # --- matplotlib initialisation import matplotlib.pyplot as plt plt.style.use('markgraph.mplstyle') # --- create some fake data df = pd.DataFrame(np.random.randn(100, 6)).cumsum() df.columns = ['alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta'] # --- plot the data ax = df.plot(kind='line') # required ax.set_ylabel('Index') # optional ax.set_xlabel('Days since inception') # optional ax.xaxis.set_ticks([0, 20, 40, 60, 80, 100]) # optional ax.axhline(0, color='#AAAAAA', linestyle='-', linewidth=0.5) # optional fig = ax.get_figure() # required fig.suptitle('Fake Data', linespacing=1.2) # optional fig.tight_layout() # advisable # - put the legend above - optional - can be fiddly to get right box = ax.get_position() ax.set_position([box.x0, box.y0, box.width, box.height * 0.88]) ax.legend(bbox_to_anchor=(0.5, 1.13), loc='upper center', ncol=6, numpoints=1) # - add a footnote - optional fig.text(0.99, 0.01, 'pandascodesnippets.blogspot.com.au', ha='right', va='bottom', fontsize='x-small', fontstyle='italic', color='#999999') # - and save fig.savefig('example-chart.png', dpi=125) # required
Which yields the following chart (or something like it).
If you don't have the time to play with your own style-sheet, you can use one of the many built-in style sheets. For example, if I change the matplotlib initialisation part of my code, I can use the built-in style plt.style.use('ggplot') to get ...
There are a host of other built-in styles, including: 'bmh', 'classic', 'dark_background', ' fivethirtyeight', and 'grayscale'. For a complete list, check out this site. My experience has been patchy when it comes to getting the built-in style-sheets to deliver everything I like.
Saturday 11 June 2016
Comparing the maximum and next highest values in each row
This morning I wanted to sort a DataFrame by the difference between the maximum value in a row and the next (or equal) highest value in that row. For each row, there was no way of predicting which column would have the highest value and which would have the next highest value.
This is how I did it.
The first line in the iPython code snippet creates some play data. It is a DataFrame with 5 columns and 10 rows. We have a quick look at our play data on line 2.
I created a function to return the difference between highest value in a Pandas Series and the next or equal highest value. This function is at 3.
Next, I applied that function to each row in the DataFrame, ranked the result, and returned the rank as an integer. Then, I added that Series of ranks as a new column in the DataFrame, This is on 4.
Almost done, at 5 we sort the DataFrame by its rank.
Finally at 6 we have a quick look at the results.You can see the original line position from the index column at the left. You can see that two rows have the rank of 7th place.
More concisely, without the iPython, we have:
This is how I did it.
The first line in the iPython code snippet creates some play data. It is a DataFrame with 5 columns and 10 rows. We have a quick look at our play data on line 2.
I created a function to return the difference between highest value in a Pandas Series and the next or equal highest value. This function is at 3.
Next, I applied that function to each row in the DataFrame, ranked the result, and returned the rank as an integer. Then, I added that Series of ranks as a new column in the DataFrame, This is on 4.
Almost done, at 5 we sort the DataFrame by its rank.
Finally at 6 we have a quick look at the results.You can see the original line position from the index column at the left. You can see that two rows have the rank of 7th place.
In [1]: df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data In [2]: df Out[2]: 0 1 2 3 4 0 48 52 56 38 66 1 60 65 52 38 48 2 50 61 50 38 53 3 67 57 43 50 50 4 54 56 54 60 51 5 55 44 49 62 53 6 41 48 53 51 44 7 48 43 60 42 48 8 38 58 75 62 58 9 30 58 44 57 53 In [3]: def differand(x): ...: y = x.sort_values(ascending=False) ...: return y.iat[0] - y.iat[1] # first position minus second ... ...: In [4]: df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int) In [5]: df.sort_values('rank', inplace=True) In [6]: df Out[6]: 0 1 2 3 4 rank 9 30 58 44 57 53 1 6 41 48 53 51 44 2 4 54 56 54 60 51 3 1 60 65 52 38 48 4 5 55 44 49 62 53 5 2 50 61 50 38 53 6 0 48 52 56 38 66 7 3 67 57 43 50 50 7 7 48 43 60 42 48 9 8 38 58 75 62 58 10
More concisely, without the iPython, we have:
df = pd.DataFrame(np.random.randn(10, 5) * 10 + 50).astype(int) # play data def differand(x): """Receive a Pandas Series and return the difference between the highest value in the Series and the next (or equal) highest value in that Series.""" y = x.sort_values(ascending=False) return y.iat[0] - y.iat[1] df['rank'] = df.apply(differand, axis=1).rank(method='min').astype(int) df.sort_values('rank', inplace=True)
Subscribe to:
Posts (Atom)