Motivation
The season 8 of the Game Of Thrones is the topic of discussion these days. While discussing on the lunch table a friend gave me a challenge to list all the episodes of GOT for $1000. I am a GOT fan but failed to recite all the episode names. When I came back to my room I decided the check all the episode names. Again with Data Science tool box in hand I thought of doing something fancy. She had asked me the episode names, I scraped:
1. Episode names for all 8 seasons.
2. Number of viewers for each episode.
3. Duration of each episode
Extracting season, episode name, Number of US viewers and the URL of wikipedia page for each episode.
got_request=requests.get('https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes')
got=BeautifulSoup(got_request.text)
got_tables = got.find_all("table")
seasons=[]
titles=[]
viewers=[]
url=[]
url_start="https://en.wikipedia.org"
for i in range(8):
table_num=i+1
table=got_tables[table_num].find_all('tr',{'class':'vevent'})
for m in range(len(table)):
if table[m].find_all('td')[1].text !='TBA':
seasons.append(i+1)
titles.append(table[m].find_all('td')[1].text)
viewers.append(table[m].find_all('td')[5].text.split('[')[0])
url.append(url_start+table[m].find_all('a', {'href':True})[0]['href'])
else:
pass
GameOfThrones=pd.DataFrame({'Seasons':seasons,'Title':titles,'Viewers (Millions)':viewers, 'url':url})
Follow the URL of each episode wikipedia page and extracting the running time for each episode.
duration_list=[]
for url_link in GameOfThrones.url:
duration=requests.get(url_link)
duration_got=BeautifulSoup(duration.text)
duration_list.append(duration_got.find_all('table',{'class':'infobox vevent'})[0].find('th', text='Running time').next_sibling.text.split('[')[0])
GameOfThrones['Duration']=duration_list
The DataFrame looks like this :
This is just sample, actual DataFrame contains the details for all the episodes till date.
The code is available on my GitHub repository: GitHub Repository
Reference