Scrape quotes using BeautifulSoup and then analyse the words

Posted by on

In this article, we will scrape quotes using BeautifulSoup on http://quotes.toscrape.com/. This website is used in the tutorial for Scrapy. Scrapy is a web scraping framework, whereas BeautifulSoup is a library for web scraping. Scrapy is comprehensive than BeautifulSoup.

In this code, we will first of all scrape all quotes on all pages of given website, then we will study those quotes using NLTK (Natural Language Toolkit) and matplotlib. We will plot frequency plots for words in the quotes and tags and cumulative frequency plot for words as well as tags, in order to understand what are the popular words or topics amongst great men and women.

Experiment Setup:

  1. Language: Python3.
  2. IDE: Spyder
  3. Libraries Used: BeautifulSoup, pandasnltk, matplotlib,

If above libraries are not installed then you need to install them.

>>>pip install BeautifulSoup
>>>pip install pandas
>>>pip install matplotlib
>>>pip install nltk

In our analysis, we will be removing stopwords in the sentences. For which we will be using stop words in the nltk corpus. You need to install nltk corpus.

Execute following code in order to install stopwords in the nltk corpus. I executed this in IPython.

nltk.download()

Once you will execute the above line, you will see the following window

Click on the stopwords in Corpora tab, then select stopwords and click on download.

Important: If you do not want to use stop words in the nltk corpus, you can manually create a list of words that you want to exclude which you can use to filter unwanted words.

HTML code for a quote:

<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">
    “The world as we have created it is a process of our thinking. 
    It cannot be changed without changing our thinking.”
 </span> 
 <span>
    by <small class="author" itemprop="author">Albert Einstein</small> 
    <a href="/author/Albert-Einstein">(about)</a> 
 </span> 
 <div class="tags"> 
   Tags: <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / > 
   <a class="tag" href="/tag/change/page/1/">change</a> 
   <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> 
   <a class="tag" href="/tag/thinking/page/1/">thinking</a> 
   <a class="tag" href="/tag/world/page/1/">world</a> 
 </div> 
</div>

Algorithm:

  1. Set your base URL as http://quotes.toscrape.com/page/<page_number>
  2. Set page number equal to 1
  3. Send a get request to URL http://quotes.toscrape.com/page/1.
  4. Parse the response using BeautifulSoup’s html parser.
  5. Get the quote from parsed html using BeautifulSoup and append them to a list
  6. Increase page number by 1
  7. Repeat steps from 3 to 6, till you find “No quotes found!” in parsed html
  8. Convert a list to pandas dataframe
  9. Get quotes column from dataframe and store it a variable called quotes.
  10. Convert quotes to list.
  11. Remove any numbers, punctuations but allows spaces. convert all words to lower case.
  12. Tokenize using the nltk library
  13. Create a list of words from the tokenized list
  14. Remove stopwords
  15. Calculate frequency distribution of words using the nltk library.
  16.  From the above frequency distribution, create a dataframe with columns as words and count.
  17. Sort this list by descending word count.
  18. Set words column as an index but do not drop the column.
  19. Follow steps from 9 to 18 for tags column of dataframe.
  20. Plot frequency distribution and cumulative frequency distribution using matplotlib.

Instead of explaining code further here, you can go through the code itself which is well commented.

Code

# import BeautifulSoup
from bs4 import BeautifulSoup
# import requests to make a get request
import requests
# import regular expressions
import re
# import pandas so that we can use the dataframe for our analysis
import pandas as pd
# import nltk to process text
import nltk
from nltk.corpus import stopwords
# import matplotlib to plot word frequency distribution
import matplotlib.pyplot as plt

# format of url for a certain page number is http://quotes.toscrape.com/page/
url = "http://quotes.toscrape.com/page/"

# first of all, we have to find out how many pages are there. if a page does not have any
# quote it returns No quotes found as the message on the webpage. So we will find page number
# where we will get first such message.

# create an empty list of quotes
quotes_list = []

page_count = 1
while page_count:
    # send get request
    request = requests.get(url + str(page_count))
    # parse html using html parser of BeautifulSoup
    parsed_html = BeautifulSoup(request.text, "html.parser")
    # search for No quotes found in prased html. We are using regular expressions to search
    # as there may or may not be leading or trailing white spaces around message
    # so if we search without regex it will search for exact term.
    
    # if we find 'no quotes found', we will break the loop
    if len(parsed_html.body.find_all(text=re.compile("No quotes found!"))) > 0:
        break
    else:
        # get the quotes from page enclosed in div with class quote 
        # and increment page number
        quotes = parsed_html.find_all("div", {"class": "quote"})
        # loop through each quote
        for quote in quotes:
            # find childred of quotes and get their text. Store the result in a
            # dict and append it to our quotes list
            # since there are multiple tags, we are looping through tags as well
            quotes_list.append({
                    'quote': quote.find("span", {"class": "text"}).text,
                    'author': quote.find("small", {"class": "author"}).text,
                    'tags': [tag.text for tag in quote.find_all("a", {"class": "tag"})]
                })
        page_count += 1
        
# convert the list to dataframe

df = pd.DataFrame(quotes_list)

# tokenize quotes
# first convert the quotes column in dataframe to a list
quotes = df['quote'].tolist()
# remove any numbers, punctuations but allows spaces. convert all words to lower case
quotes = [re.sub(r'[^a-zA-Z\s]', '', quote).strip().lower() for quote in quotes]
# tokenize using nltk library
quote_words = [nltk.regexp_tokenize(quote, r'\S+') for quote in quotes]
quote_words = [w for l in quote_words for w in l]

# remove stopwords
# you may need to download stopwords using nltk.download()
quote_words = [word for word in quote_words if word not in stopwords.words('english')]

# generate a frequency distribution of words
freq_dist = nltk.FreqDist(quote_words)
# create a dataframe with columns word and its count
words_count_df = pd.DataFrame(list(freq_dist.items()), columns=['word', 'count'])
# sort the dataframe
words_count_df = words_count_df.sort_values('count', ascending = False)
# set word as index
words_count_df.set_index('word', drop=False, inplace=True)

# create a list of tags
tags_list = df['tags'].tolist()
# create a temporary variable to store tags
temp = []
# tokenize tags
for tags in tags_list:
    for tag in tags:
        temp.append(tag)
        
# store temp in tags_list
tags_list = temp

# generate a frequency distribution of words
freq_dist = nltk.FreqDist(tags_list)
# create a dataframe with columns word and its count
tags_count_df = pd.DataFrame(list(freq_dist.items()), columns=['tag', 'count'])
# sort the dataframe
tags_count_df = tags_count_df.sort_values('count', ascending = False)
# set word as index
tags_count_df.set_index('tag', drop=False, inplace=True)

# plot frequency using matplotlib
def plot_freq(df, metric):
    fig = plt.figure(figsize=(12, 9))
    ax = fig.gca()
    df['count'][:60].plot(kind='bar', ax=ax)
    ax.set_title('Frequency of the most common ' + metric + 's')
    ax.set_ylabel('Frequency of ' + metric)
    ax.set_xlabel(metric)
    plt.show()
    
# plot cumulative frequency distribution using matplotlib
def plot_cfd(df, metric):
    word_count = float(df['count'].sum(axis = 0))   
    df['cumulative'] = df['count'].cumsum(axis = 0)
    df['cumulative'] = df['cumulative'].divide(word_count)
     
    fig = plt.figure(figsize=(12, 9))
    ax = fig.gca()    
    df['cumulative'][:60].plot(kind = 'bar', ax = ax)
    ax.set_title('Cumulative fraction of total ' + metric + ' vs. ' + metric + 's')
    ax.set_ylabel('Cumulative fraction')
    ax.set_xlabel(metric)
    plt.show()
    
if __name__ == "__main__":
    plot_freq(words_count_df, 'Word')
    plot_cfd(words_count_df, 'Word')
    plot_freq(tags_count_df, 'Tag')
    plot_cfd(tags_count_df, 'Tag')

Output

frequency destribution

cumulative fractions

freq distribution

cf

What we found using our little experiment is that ‘love’ is the most used word in the quotes by the great.

Note:

I am no more writing regarding Python or programming on this blog, as I have shifted my focus from programming to WordPress and web development. If you are interested in WordPress, you can continue reading other articles on this blog.
Thanks and Cheers ????

>