Analysis of Becoming a Data Scientist Podcast Listens

Becoming a Data Scientist Podcast

Renee Teate 10/1/2017 In celebration of reaching 20,000 Twitter followers, woo!!

NOTE: This isn't meant to be a tutorial, it's just me demonstrating a thought process while doing analysis to ask questions about my podcast listens and use the listen log dataset to get answers

In [20]:
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import datetime as pydt

Connecting directly to the database so I don't have to keep downloading files to csv and can get "live" updates

In [21]:
#set up the database connection
#this is a stats table for a wordpress plugin I use to post my podcast episodes
#This account only has SELECT privileges.
#note that a random number is being substituted in for each IP
engine = create_engine('mysql+pymysql://XXMYCONNECTIONSTRINGXX', echo=False)
conn = engine.connect().connection

#create the pandas dataframe and populate with data from ssp stats table
dfListens = pd.read_sql("SELECT s1.id, s1.post_id, r.random_ip_sub, s1.referrer, s1.date FROM clz_ssp_stats s1 " \
                "left join (select ip_Address, max(FLOOR(100000000 + (RAND() * 199999999))) " \
                "random_ip_sub from clz_ssp_stats group by ip_Address) r ON s1.ip_address = r.ip_address "
                        , conn, index_col=['id'])

#rename the columns
dfListens.columns = ['postnum','IP','method','timestamp']

#display new randomized IP as string without scientific notation
dfListens['IP'] = dfListens['IP'].astype(int).astype(str)

#display the row count and first 5 rows
print("rows in dataframe:",len(dfListens),"\n")
print(dfListens.head())
rows in dataframe: 198687 

    postnum         IP    method   timestamp
id                                          
1       776  297908903    player  1450081082
2       776  297908903    player  1450081089
3       776  297908903    player  1450101683
4       776  297908903  download  1450101715
5       776  297908903  download  1450101774

Assigning episode names to episode IDs and converting timestamp to date and checking the output

In [22]:
#add a column with the date from timestamp
dfListens['day'] = pd.to_datetime(dfListens['timestamp'], unit='s').dt.date

#create a column of episode names based on dictionary of post numbers
eps = {776:'Renee',807:'Will',841:'Safia',871:'Shlomo',896:'Sherman',921:'Clare',964:'Erin',
      980:'Enda',1014:'Sebastian',1043:'Justin',1056:'Trey',1081:'Stephanie',1105:'DSLC',1118:'Debbie',
      1159:'Anncmt',1204:'Election',1249:'Ed',1316:'Jasmine',1346:'David',1421:'Randy'}
dfListens['episode'] = dfListens['postnum'].replace(eps)

#add a column with the date from timestamp
dfListens['day'] = pd.to_datetime(dfListens['timestamp'], unit='s').dt.date

#set option to print all dataframs columns without wrapping
pd.set_option('expand_frame_repr', False)

#check that it's connecting to live data
print("Max date in dataframe: ",dfListens['day'].max(),"\n")

#show first 5 and last 20 rows after new columns added
print(dfListens.head(n=5),"\n")
print(dfListens.tail(n=20))
Max date in dataframe:  2017-10-01 

    postnum         IP    method   timestamp         day episode
id                                                              
1       776  297908903    player  1450081082  2015-12-14   Renee
2       776  297908903    player  1450081089  2015-12-14   Renee
3       776  297908903    player  1450101683  2015-12-14   Renee
4       776  297908903  download  1450101715  2015-12-14   Renee
5       776  297908903  download  1450101774  2015-12-14   Renee 

        postnum         IP       method   timestamp         day    episode
id                                                                        
198668      841  286445852  pocketcasts  1506880425  2017-10-01      Safia
198669      871  286445852  pocketcasts  1506880463  2017-10-01     Shlomo
198670     1421  267842264       itunes  1506880993  2017-10-01      Randy
198671     1043  164688143       itunes  1506881102  2017-10-01     Justin
198672     1014  164688143       itunes  1506881121  2017-10-01  Sebastian
198673     1316  260181611       itunes  1506882058  2017-10-01    Jasmine
198674     1014  269063398       itunes  1506882235  2017-10-01  Sebastian
198675     1421  208677117       itunes  1506883479  2017-10-01      Randy
198676      807  208677117       itunes  1506883530  2017-10-01       Will
198677      871  208677117       itunes  1506883561  2017-10-01     Shlomo
198678      896  208677117       itunes  1506883578  2017-10-01    Sherman
198679     1056  208677117       itunes  1506883807  2017-10-01       Trey
198680      871  279537409       itunes  1506884966  2017-10-01     Shlomo
198681      921  279537409       itunes  1506885007  2017-10-01      Clare
198682      841  292933898     overcast  1506886071  2017-10-01      Safia
198683      776  183224465       itunes  1506886339  2017-10-01      Renee
198684      807  183224465       itunes  1506886340  2017-10-01       Will
198685      841  183224465       itunes  1506886344  2017-10-01      Safia
198686      776  109148086               1506887777  2017-10-01      Renee
198687      921  285460060       itunes  1506889355  2017-10-01      Clare

What methods do people use to listen to my podcast?

In [23]:
#group listens by method and display count
grpMethod = dfListens.groupby('method')
print(grpMethod['postnum'].count())
method
                                     127749
download                               4040
download\' and 3>\'1         6
download\' and 3>\'4         6
download\' order by 1 -- ;           6
download\' order by 999 -- ;         6
download\" and 3>\"1         6
download\" and 3>\"4         6
download\" order by 1 -- ;           6
download\" order by 999 -- ;         6
itunes                                41464
new_window                              823
overcast                               7020
player                                 2053
pocketcasts                           11458
stitcher                               4032
Name: postnum, dtype: int64
In [24]:
#some weird methods in there!
#truncate anything that starts with "download" to be just "download"
dfListens.loc[dfListens['method'].str.startswith('download'),'method'] = "download"
#and replace the blank methods with "unknown"
dfListens.loc[dfListens['method'] == '','method'] = "unknown"

#group by methods again and show counts, sorted
grpMethod = dfListens.groupby('method').count()
print(grpMethod['postnum'].sort_values())
method
new_window        823
player           2053
stitcher         4032
download         4088
overcast         7020
pocketcasts     11458
itunes          41464
unknown        127749
Name: postnum, dtype: int64
In [25]:
#set plots to display inline in jupyter notebook
%matplotlib inline

#bar chart of listen methods
grpMethod['postnum'].sort_values().plot(kind='bar')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x20e2ae49470>

Which episodes are most popular?

In [26]:
#Which episodes are most popular?

#group by methods again and show counts, sorted
grpEpisode = dfListens.groupby('episode').count()
print(grpEpisode['postnum'].sort_values())
episode
Ed            6243
Election      6857
Jasmine       7191
Sherman       7778
Shlomo        7870
Erin          8661
Safia         8673
David         9000
Enda          9366
Anncmt        9393
Clare        10151
Renee        10401
Stephanie    10499
Justin       11361
DSLC         11777
Will         11811
Sebastian    12329
Randy        12622
Trey         12797
Debbie       13907
Name: postnum, dtype: int64
In [27]:
#bar chart of episode listen counts
grpEpisode['postnum'].sort_values().plot(kind='bar', color='r')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x20e215dfc50>

How many times does each listener access the podcast?

In [28]:
#now let's group by episode and display counts
grpIP = dfListens.groupby('IP').count()
print(grpIP['postnum'].sort_values(ascending=False).head())
IP
299900616    1924
299971845    1807
299505258    1238
299915564     902
299771955     860
Name: postnum, dtype: int64

Uh oh... problem. These aren't regular listeners.

In [29]:
#let's take a look at the data for that top IP address. how many accesses per day?
print(dfListens.loc[dfListens['IP'] == '299900616',:].groupby(['day']).count().tail(n=10))
            postnum  IP  method  timestamp  episode
day                                                
2017-09-18       15  15      15         15       15
2017-09-21       30  30      30         30       30
2017-09-23       15  15      15         15       15
2017-09-24       30  30      30         30       30
2017-09-25       15  15      15         15       15
2017-09-26       30  30      30         30       30
2017-09-27       30  30      30         30       30
2017-09-28       15  15      15         15       15
2017-09-29       15  15      15         15       15
2017-10-01       30  30      30         30       30

that IP is accessing every episode once or twice every day - probably a search engine or bot

We need to rethink our counts overall.

Let's only look at the first time each IP accessed each episode

In [30]:
#I don't know how to do "get first row from each grouping"quickly in pandas yet, so for now I'll adjust the SQL query
#...ugh apparently it's an old version of mySQL without window functions like ROW_NUMBER(), so we'll self-join
dfListens2 = pd.read_sql("SELECT distinct s1.id, s1.post_id, r.random_ip_sub, s1.referrer, s1.date " \
    "FROM clz_ssp_stats s1 " \
       "left join (select ip_address, post_id, min(id) first_id from clz_ssp_stats group by ip_address, post_id) s2 " \
         "ON s1.ip_address = s2.ip_address and s1.id = s2.first_id " \
       "left join (select ip_Address, max(FLOOR(100000000 + (RAND() * 199999999))) random_ip_sub from clz_ssp_stats " \
                        "group by ip_Address) r ON s1.ip_address = r.ip_address " \
    "WHERE s2.post_id is not null", conn, index_col=['id'])

#close the database connection
conn.close()

#apply all of the same data transformations I made to df1 to this dataframe
dfListens2.columns = ['postnum','IP','method','timestamp']
dfListens2['day'] = pd.to_datetime(dfListens2['timestamp'], unit='s').dt.date
eps = {776:'Renee',807:'Will',841:'Safia',871:'Shlomo',896:'Sherman',921:'Clare',964:'Erin',
      980:'Enda',1014:'Sebastian',1043:'Justin',1056:'Trey',1081:'Stephanie',1105:'DSLC',1118:'Debbie',
      1159:'Anncmt',1204:'Election',1249:'Ed',1316:'Jasmine',1346:'David',1421:'Randy'}
dfListens2['episode'] = dfListens2['postnum'].replace(eps)
dfListens2.loc[dfListens['method'].str.startswith('download'),'method'] = "download"
dfListens2.loc[dfListens['method'] == '','method'] = "unknown"
dfListens2['IP'] = dfListens['IP'].astype(int).astype(str)

#check count of rows in new dataframe
print("rows in dataframe:",len(dfListens2),"\n")

#print header of new dataframe
print(dfListens2.head())

#group by episode and display counts in new dataframe
grpEpisode = dfListens2.groupby('IP').count()
print(grpEpisode['postnum'].sort_values(ascending=False).head())
rows in dataframe: 117380 

        postnum         IP    method   timestamp         day  episode
id                                                                   
131839     1316  292496886  overcast  1485171545  2017-01-23  Jasmine
38992      1056  259246919            1462242769  2016-05-03     Trey
69897      1118  253581423            1467846404  2016-07-06   Debbie
69100      1118  257650483            1467809673  2016-07-06   Debbie
184606     1249  239635240    itunes  1500148909  2017-07-15       Ed
IP
294785199    27
297199589    23
299575710    21
288046633    20
297308236    20
Name: postnum, dtype: int64

Redo: What methods do people use to listen to my podcast?

In [31]:
#OK now each IP is only counted once per episode, let's redo the method count
#group by methods again and show counts, sorted
grpMethod = dfListens2.groupby('method').count()
print(grpMethod['postnum'].sort_values())
method
new_window       380
player          1206
download        1627
stitcher        2799
overcast        4842
pocketcasts     9067
itunes         31922
               65537
Name: postnum, dtype: int64
In [32]:
#bar chart of listen methods
grpMethod['postnum'].sort_values().plot(kind='bar')
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x20e22374940>

Redo: Which episodes are most popular?

In [33]:
#Which episodes are most popular?

#group by episodes again and show counts, sorted
grpEpisode = dfListens2.groupby('episode').count()
print(grpEpisode['postnum'].sort_values())
episode
Shlomo       4564
Sherman      4758
Safia        4866
Ed           4927
Erin         5277
Enda         5282
Election     5344
Jasmine      5517
Clare        5564
Justin       5577
Sebastian    5643
Stephanie    5815
Anncmt       6018
DSLC         6114
David        6213
Trey         6301
Debbie       6638
Will         6943
Renee        6976
Randy        9043
Name: postnum, dtype: int64
In [34]:
#bar chart of episodes
grpEpisode['postnum'].sort_values().plot(kind='bar', color='r')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x20e29231630>
In [35]:
print(dfListens2.head())
#get count of IPs accessing episode per day
dfEpisodeDays = dfListens2[['postnum','episode','day']].groupby(['episode','day']).count()
dfEpisodeDays.columns=['listencount']
dfEpisodeDays = dfEpisodeDays.reset_index().sort_values(by=['listencount'], ascending=False)
print('\nListens Grouped by Episode by Day')
print(dfEpisodeDays.head())
        postnum         IP    method   timestamp         day  episode
id                                                                   
131839     1316  292496886  overcast  1485171545  2017-01-23  Jasmine
38992      1056  259246919            1462242769  2016-05-03     Trey
69897      1118  253581423            1467846404  2016-07-06   Debbie
69100      1118  257650483            1467809673  2016-07-06   Debbie
184606     1249  239635240    itunes  1500148909  2017-07-15       Ed

Listens Grouped by Episode by Day
      episode         day  listencount
4662    Randy  2017-03-22         2105
3866  Jasmine  2017-01-11         1618
1448    David  2017-01-30         1485
2141       Ed  2016-12-29         1428
1693   Debbie  2016-07-06         1098
In [36]:
#get the max listen count day per episode
#note- if published late at night, or published but not made public immediately, 1st day will be low
dfEpDayMax = pd.DataFrame(dfEpisodeDays.groupby('episode').nth(0)).reset_index()
#add weekday to see if most are Monday/Tuesday when episode published
dfEpDayMax['DayOfWeek'] = pd.to_datetime(dfEpDayMax['day']).apply(lambda x: pydt.datetime.strftime(x, '%A'))
print('Max Listens In a Day by Episode (ordered by date):')
print(dfEpDayMax.sort_values(by=['listencount']))
Max Listens In a Day by Episode (ordered by date):
      episode         day  listencount  DayOfWeek
13      Safia  2016-12-29           76   Thursday
19       Will  2016-12-29           85   Thursday
12      Renee  2016-12-29           91   Thursday
16     Shlomo  2016-01-18          108     Monday
15    Sherman  2016-02-02          194    Tuesday
8        Erin  2016-02-29          205     Monday
1       Clare  2016-02-15          230     Monday
7        Enda  2016-03-15          498    Tuesday
0      Anncmt  2016-08-14          674     Sunday
14  Sebastian  2016-04-01          715     Friday
17  Stephanie  2016-06-03          741     Friday
18       Trey  2016-05-01          833     Sunday
10     Justin  2016-04-12          852    Tuesday
6    Election  2016-11-14         1040     Monday
2        DSLC  2016-06-15         1066  Wednesday
4      Debbie  2016-07-06         1098  Wednesday
5          Ed  2016-12-29         1428   Thursday
3       David  2017-01-30         1485     Monday
9     Jasmine  2017-01-11         1618  Wednesday
11      Randy  2017-03-22         2105  Wednesday

It makes sense that the first episodes have the lowest "biggest day" counts, because they were released when there were few regular subscribers, so have built up listeners over time and didn't get as big of a "pop" on release day as later episodes did

What did the first day each episode was released look like?

Note: "first day" is literally the first date there was a listen/download recorded. This isn't a "fair" comparison, because it's not the first 24 hours after an episode was posted

In [37]:
#get the listen count on the first day each episode was released
#note: would be "fairer" to get 1st 24 hrs per episode since i don't post at same time of day each time
dfEpisodeDays = dfEpisodeDays.reset_index().sort_values(by=['day'])
print(dfEpisodeDays.head())
dfEpDayMax = pd.DataFrame(dfEpisodeDays.groupby('episode').nth(0)).reset_index()
#add weekday to see if most are Monday/Tuesday when episode published
dfEpDayMax['DayOfWeek'] = pd.to_datetime(dfEpDayMax['day']).apply(lambda x: pydt.datetime.strftime(x, '%A'))
print('Listens on Release Day by Episode (ordered by date):')
print(dfEpDayMax.sort_values(by=['day']))
      index episode         day  listencount
2791   4856   Renee  2015-12-14           11
2229   4857   Renee  2015-12-15           13
1072   4858   Renee  2015-12-16           19
1252   4859   Renee  2015-12-17           17
6861   4860   Renee  2015-12-18            4
Listens on Release Day by Episode (ordered by date):
      episode         day  index  listencount  DayOfWeek
12      Renee  2015-12-14   4856           11     Monday
19       Will  2015-12-20   8637            1     Sunday
13      Safia  2016-01-04   5452           54     Monday
16     Shlomo  2016-01-18   7098          108     Monday
15    Sherman  2016-02-02   6556          194    Tuesday
1       Clare  2016-02-15    404          230     Monday
8        Erin  2016-02-29   3290          205     Monday
7        Enda  2016-03-15   2734          498    Tuesday
14  Sebastian  2016-04-01   6010          715     Friday
10     Justin  2016-04-12   4128          852    Tuesday
18       Trey  2016-05-01   8122          833     Sunday
17  Stephanie  2016-06-02   7640          420   Thursday
2        DSLC  2016-06-15    978         1066  Wednesday
4      Debbie  2016-07-06   1693         1098  Wednesday
0      Anncmt  2016-08-13      0          536   Saturday
6    Election  2016-11-13   2414          398     Sunday
5          Ed  2016-12-29   2141         1428   Thursday
9     Jasmine  2017-01-10   3865           51    Tuesday
3       David  2017-01-29   1447          257     Sunday
11      Randy  2017-03-21   4661            2    Tuesday

I was going to make pretty visuals using Bokeh or Seaborn here, but decided to switch over to Tableau to build a reusable visual dashboard to monitor listens over time

Podcast listen monitoring | Interactive listens by episode by week

In [38]:
#Need to build a Web Data Connector to database for Tableau Public (since mySQL connection not available in free version),
#so in the meantime, download CSV of podcast data to use

dfListens2.to_csv('BADSPodcast1stListenPerEpPerIP.csv', sep=',')
In [ ]: