Analysis of Becoming a Data Scientist Podcast Listens

Becoming a Data Scientist Podcast

Renee Teate 10/1/2017 In celebration of reaching 20,000 Twitter followers, woo!!

NOTE: This isn't meant to be a tutorial, it's just me demonstrating a thought process while doing analysis to ask questions about my podcast listens and use the listen log dataset to get answers

In [1]:
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import datetime as pydt

Connect directly to the database so don't have to keep downloading csv files. Live data.

In [2]:
#set up the database connection
engine = create_engine('mysql+pymysql://XXXXXXMYCONNECTIONSTRINGXXXXXX', echo=False)
conn = engine.connect().connection

#query the database to get row representing the 1st time each IP address accessed each episode
#note that a random number is being substituted in for each IP
dfListens = pd.read_sql("SELECT distinct s1.id, s1.post_id, r.random_ip_sub, s1.referrer, s1.date " \
    "FROM clz_ssp_stats s1 " \
       "left join (select ip_address, post_id, min(id) first_id from clz_ssp_stats group by ip_address, post_id) s2 " \
         "ON s1.ip_address = s2.ip_address and s1.id = s2.first_id " \
       "left join (select ip_Address, max(FLOOR(100000000 + (RAND() * 199999999))) random_ip_sub from clz_ssp_stats " \
                        "group by ip_Address) r ON s1.ip_address = r.ip_address " \
    "WHERE s2.post_id is not null", conn, index_col=['id'])

#close the database connection
conn.close()

#rename the columns
dfListens.columns = ['postnum','IP','method','timestamp']

#display new randomized IP as string without scientific notation
dfListens['IP'] = dfListens['IP'].astype(int).astype(str)

#display the row count and first 5 rows
print("Count of rows in dataframe:",len(dfListens),"\n")
print("Example data rows:",dfListens.head())
Count of rows in dataframe: 117349 

Example data rows:         postnum         IP    method   timestamp
id                                              
131839     1316  266900982  overcast  1485171545
38992      1056  120660726            1462242769
69897      1118  110863746            1467846404
69100      1118  249507165            1467809673
184606     1249  261508401    itunes  1500148909

Assign episode name by episode ID and convert timestamp to date

In [3]:
#add a column with the date from timestamp
dfListens['day'] = pd.to_datetime(dfListens['timestamp'], unit='s').dt.date

#create a column of episode names based on dictionary of post numbers
eps = {776:'Renee',807:'Will',841:'Safia',871:'Shlomo',896:'Sherman',921:'Clare',964:'Erin',
      980:'Enda',1014:'Sebastian',1043:'Justin',1056:'Trey',1081:'Stephanie',1105:'DSLC',1118:'Debbie',
      1159:'Anncmt',1204:'Election',1249:'Ed',1316:'Jasmine',1346:'David',1421:'Randy'}
dfListens['episode'] = dfListens['postnum'].replace(eps)

#add a column with the date from timestamp
dfListens['day'] = pd.to_datetime(dfListens['timestamp'], unit='s').dt.date

#set option to print all dataframs columns without wrapping
pd.set_option('expand_frame_repr', False)

#show last 5 rows after new columns added
print("Example transformed data rows:",dfListens.tail(n=5))
print("\nMax date in dataframe: ",dfListens['day'].max())
Example transformed data rows:         postnum         IP    method   timestamp         day    episode
id                                                                     
180937      776  160781950            1498696927  2017-06-29      Renee
53375      1081  265238924            1465046139  2016-06-04  Stephanie
85467      1159  265238924            1471147077  2016-08-14     Anncmt
122747     1249  265238924  overcast  1483033438  2016-12-29         Ed
188821      841  181465186    itunes  1502362658  2017-08-10      Safia

Max date in dataframe:  2017-10-01

What methods do people use to listen to my podcast (referrers)?

In [4]:
#truncate anything that starts with "download" to be just "download"
dfListens.loc[dfListens['method'].str.startswith('download'),'method'] = "download"
#and replace the blank methods with "unknown"
dfListens.loc[dfListens['method'] == '','method'] = "unknown"

#group by methods again and show counts, sorted
grpMethod = dfListens.groupby('method').count()
print(grpMethod['postnum'].sort_values())
method
new_window       380
player          1206
download        1627
stitcher        2799
overcast        4842
pocketcasts     9062
itunes         31900
unknown        65533
Name: postnum, dtype: int64
In [5]:
#set plots to display inline in jupyter notebook
%matplotlib inline

#bar chart of listen methods
grpMethod['postnum'].sort_values().plot(kind='bar')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x171a99c9438>

Which episodes are most popular?

In [6]:
#group by methods again and show counts, sorted
grpEpisode = dfListens.groupby('episode').count()
print(grpEpisode['postnum'].sort_values())
episode
Shlomo       4561
Sherman      4756
Safia        4864
Ed           4927
Erin         5277
Enda         5282
Election     5344
Jasmine      5517
Clare        5562
Justin       5576
Sebastian    5640
Stephanie    5815
Anncmt       6018
DSLC         6113
David        6213
Trey         6299
Debbie       6636
Will         6939
Renee        6972
Randy        9038
Name: postnum, dtype: int64
In [7]:
#bar chart of episode listen counts
grpEpisode['postnum'].sort_values().plot(kind='bar', color='r')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x171a99e5240>

What did the first day each episode was released look like?

Note: "first day" is literally the first date there was a listen/download recorded. This isn't a "fair" comparison, because it's not the first full 24 hours after an episode was posted.

In [13]:
#get count of IPs accessing episode per day
dfEpisodeDays = dfListens[['postnum','episode','day']].groupby(['episode','day']).count()
dfEpisodeDays.columns=['listencount']
dfEpisodeDays = dfEpisodeDays.reset_index().sort_values(by=['listencount'], ascending=False)

#get the listen count on the first day each episode was released
#note: would be "fairer" to get 1st 24 hrs per episode since i don't post at same time of day each time
dfEpisodeDays = dfEpisodeDays.reset_index().sort_values(by=['day'])
dfEpDayMax = pd.DataFrame(dfEpisodeDays.groupby('episode').nth(0)).reset_index()
#add weekday to see if most are Monday/Tuesday when episode published
dfEpDayMax['DayOfWeek'] = pd.to_datetime(dfEpDayMax['day']).apply(lambda x: pydt.datetime.strftime(x, '%A'))
print('Listens on Release Day by Episode (ordered by date):')
print(dfEpDayMax.sort_values(by=['day']))
Listens on Release Day by Episode (ordered by date):
      episode         day  index  listencount  DayOfWeek
12      Renee  2015-12-14   4856           11     Monday
19       Will  2015-12-20   8637            1     Sunday
13      Safia  2016-01-04   5452           54     Monday
16     Shlomo  2016-01-18   7098          108     Monday
15    Sherman  2016-02-02   6556          194    Tuesday
1       Clare  2016-02-15    404          230     Monday
8        Erin  2016-02-29   3290          205     Monday
7        Enda  2016-03-15   2734          498    Tuesday
14  Sebastian  2016-04-01   6010          715     Friday
10     Justin  2016-04-12   4128          852    Tuesday
18       Trey  2016-05-01   8122          833     Sunday
17  Stephanie  2016-06-02   7640          420   Thursday
2        DSLC  2016-06-15    978         1066  Wednesday
4      Debbie  2016-07-06   1693         1098  Wednesday
0      Anncmt  2016-08-13      0          536   Saturday
6    Election  2016-11-13   2414          398     Sunday
5          Ed  2016-12-29   2141         1428   Thursday
9     Jasmine  2017-01-10   3865           51    Tuesday
3       David  2017-01-29   1447          257     Sunday
11      Randy  2017-03-21   4661            2    Tuesday
In [15]:
#get count of IPs accessing episode per day
dfEpisodeDays = dfListens[['postnum','episode','day']].groupby(['episode','day']).count()
dfEpisodeDays.columns=['listencount']
dfEpisodeDays = dfEpisodeDays.reset_index().sort_values(by=['listencount'], ascending=False)

#get the max listen count day per episode
dfEpDayMax = pd.DataFrame(dfEpisodeDays.groupby('episode').nth(0)).reset_index()
dfEpDayMax['DayOfWeek'] = pd.to_datetime(dfEpDayMax['day']).apply(lambda x: pydt.datetime.strftime(x, '%A'))
print('Max Listens In a Day by Episode (ordered by listen count):')
print(dfEpDayMax.sort_values(by=['listencount']))
Max Listens In a Day by Episode (ordered by listen count):
      episode         day  listencount  DayOfWeek
13      Safia  2016-12-29           76   Thursday
19       Will  2016-12-29           85   Thursday
12      Renee  2016-12-29           91   Thursday
16     Shlomo  2016-01-18          108     Monday
15    Sherman  2016-02-02          194    Tuesday
8        Erin  2016-02-29          205     Monday
1       Clare  2016-02-15          230     Monday
7        Enda  2016-03-15          498    Tuesday
0      Anncmt  2016-08-14          674     Sunday
14  Sebastian  2016-04-01          715     Friday
17  Stephanie  2016-06-03          741     Friday
18       Trey  2016-05-01          833     Sunday
10     Justin  2016-04-12          852    Tuesday
6    Election  2016-11-14         1040     Monday
2        DSLC  2016-06-15         1066  Wednesday
4      Debbie  2016-07-06         1098  Wednesday
5          Ed  2016-12-29         1428   Thursday
3       David  2017-01-30         1485     Monday
9     Jasmine  2017-01-11         1618  Wednesday
11      Randy  2017-03-22         2105  Wednesday

It makes sense that the earliest episodes have the lowest "biggest day" counts, because they were released when there were few regular subscribers, so have built up listeners over time and didn't get as big of a "pop" on release day as later episodes did

Tableau dashboards

Podcast listen monitoring | Interactive listens by episode by week

Hide code button source - Chris Said

In [ ]: