Web scraping on Game Of Thrones

20 Apr 2019

Reading time ~1 minute

Motivation

The season 8 of the Game Of Thrones is the topic of discussion these days. While discussing on the lunch table a friend gave me a challenge to list all the episodes of GOT for $1000. I am a GOT fan but failed to recite all the episode names. When I came back to my room I decided the check all the episode names. Again with Data Science tool box in hand I thought of doing something fancy. She had asked me the episode names, I scraped:
1. Episode names for all 8 seasons.
2. Number of viewers for each episode.
3. Duration of each episode

Extracting season, episode name, Number of US viewers and the URL of wikipedia page for each episode.

got_request=requests.get('https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes')
got=BeautifulSoup(got_request.text)

got_tables = got.find_all("table")

seasons=[]
titles=[]
viewers=[]
url=[]
url_start="https://en.wikipedia.org"

for i in range(8):
    table_num=i+1

    table=got_tables[table_num].find_all('tr',{'class':'vevent'})

    for m in range(len(table)):
        if table[m].find_all('td')[1].text !='TBA':
            seasons.append(i+1)
            titles.append(table[m].find_all('td')[1].text)
            viewers.append(table[m].find_all('td')[5].text.split('[')[0])
            url.append(url_start+table[m].find_all('a', {'href':True})[0]['href'])
        else:
            pass

GameOfThrones=pd.DataFrame({'Seasons':seasons,'Title':titles,'Viewers (Millions)':viewers, 'url':url})

Follow the URL of each episode wikipedia page and extracting the running time for each episode.

duration_list=[]

for url_link in GameOfThrones.url:
    duration=requests.get(url_link)
    duration_got=BeautifulSoup(duration.text)
    duration_list.append(duration_got.find_all('table',{'class':'infobox vevent'})[0].find('th', text='Running time').next_sibling.text.split('[')[0])

GameOfThrones['Duration']=duration_list

The DataFrame looks like this :

This is just sample, actual DataFrame contains the details for all the episodes till date.

The code is available on my GitHub repository: GitHub Repository
Reference