An introduction to webscraping, part IV¶

A tutorial for those with little or no programming background
By Maurits van der Veen
Last modified October 2015

Recursive scraping¶

As a final exercise, we will construct a scraper that follows links on its own. This is generally called recursive scraping: we pass the spider a starting webpage, and it follows links from that starting page to another page, and then perhaps another still, and so on.

For this exercise, we'll build a scraper for the transcripts of CBS' Face the Nation broadcasts. First, let's go to the website and use Firebug and Firepath to see what it looks like, and which elements we would like to scrape. The transcripts page is at:
http://www.cbsnews.com/news/face-the-nation-transcripts-2015/

On this page, we see a list of links that go to transcripts. This list is followed, near the bottom of the page, by a list of links to the transcript index pages for previous years.

We will follow both of these types of links, but do different things for each link.

Let's begin our new project. From the command line, inside the folder where you want your scraper to be located, type:

scrapy startproject facethenation

First, we'll edit our items.py file. For each transcript, we would like to keep track of the exact date (day, month, year), the title, the guests, and the full text. In addition, add fields for the part number (in case a transcript is so long it gets split across two items), and for the date and time the transcript was generated.

In [ ]:
import scrapy

class FTNItem(scrapy.Item):
year = scrapy.Field()
month = scrapy.Field()
day = scrapy.Field()

title = scrapy.Field()
guests = scrapy.Field()
text = scrapy.Field()

part = scrapy.Field()

Tyear = scrapy.Field()
Tmonth = scrapy.Field()
Tday = scrapy.Field()
Ttime = scrapy.Field()


Next, we can begin editing our spider. Inside the spiders folder of the project, create a new file, FTN_spider.py. We begin by specifying the basics:

In [ ]:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Selector

from facethenation.items import FTNItem

class FTNSpider(CrawlSpider):
"""Collect Face the Nation transcripts from the earliest html version
available (July 24, 2011) until the present."""
name = "FTNcrawl"
allowed_domains = ["www.cbsnews.com"]
start_urls = ["http://www.cbsnews.com/news/face-the-nation-transcripts-2015/",]


We import two new things: Rule, and LinkExtractor. The latter tells scrapy which links on a page we would like to do something with (follow, scrape, or both); the former specifies what to do.

Also, I added a comment string (three double quotes on either side) right after the title of the class. It is common python form (as well as good practice) to have such a comment string for every class and function definition.

Notice that I say in the comment string that the earliest html transcript available is 24 July 2011. What about earlier transcripts? Those are all in pdf format.

Go to the page for the 2011 transcripts to verify that this is still the case. (Just by hovering over a link, we can see what the URL is that it links to. Indeed, on the 2011 page, July 24th is the oldest one that links to a page whose address ends in .html)

Regular expressions¶

Now let's start by writing the specifications that scrapy will use to extract links, and the rules it will use to follow them.

The first rule will follow links to the transcript pages for previous years. Hover over the links at the bottom of the transcript page you are on at the moment (2011 or 2015), to see how those links are structured:

http://www.cbsnews.com/news/face-the-nation-transcripts-2011/



So we are looking for links that end in 'transcripts-20xx' where xx completes the year.

We will use regular expressions to specify this format. Regular expressions offer powerful ways to describe a particular string character. We want to specify a couple of things:

• we don't really care how the URL string begins; only how it ends
• we want it to end in the specific text 'transcripts-20', followed by any 2 digits
• and we need a closing forward slash

To match any character, we use a period ('.') To match one or more characters, follow the period by a '+'. To match any digit, we use [0-9], while we can specify how many digits we want by putting the required number in curly brackets. The resulting specification is (with the letter r indicating that this is a regular expression string):

r'.+transcripts-20[0-9]{2}/'



To tell scrapy that we want this rule to apply to links of this format only, we pass this regular expression to the LinkExtractor. The link extractor also needs to know in what section of the page to look for these links. This is specified by an xpath.

Using Firebug and Firepath, find the xpath that selects just the section of the page containing the links (it will contain the links to show transcripts as well as to the previous years' archive pages).

We pass the LinkExtractor the regular expression defining the links it should follow, along with the xpath indicating where it should look for those links:

In [ ]:
LinkExtractor(allow=(r'.+transcripts-20[0-9]{2}/'),
restrict_xpaths=('//div[@id="article-entry"]', ))


Next, we need to define a rule for what to do with these links. Since each of these annual archive pages look the same as the 2015 page we start with, we want to treat them the same as our start_url page. Hence we tell the parser to use the parser for the start page also for these subsequent pages:

callback="parse_start_url"



The rule also needs to know whether to follow these links recursively. We set this to 'True'. This will mean that scrapy will find a lot of the same links to follow: each page has links to every annual archive page. Fortunately, scrapy internally handles duplicate links, making sure to follow any link only once.

follow=True



Together, the LinkExtractor, callback, and follow specifications compose the rule:

In [ ]:
# Rule 1: follow links to previous years' transcript lists
Rule(LinkExtractor(allow=(r'.+transcripts-20[0-9]{2}/'),
restrict_xpaths=('//div[@id="article-entry"]', )),
callback="parse_start_url",
follow=True)


Next, we specify rules for following transcript links. These come in two formats. The older ones (on the 2011 page) have the format

http://www.cbsnews.com/stories/2011/08/07/ftn/main20089222.shtml



The newer ones (starting later in 2011) have the format

http://www.cbsnews.com/news/face-the-nation-transcript-october-23-2011



These two link formats are sufficiently different that we will write two different rules for them.

For the older ones, we are looking for a string that includes a piece of the form '/ftn/main20', with a variable number of characters on either side (remember, to match any one or more characters, just put '.+'):

r'.+/ftn/main20.+'



The section of the page where we are looking for these links is the same as the one we identified for the annual archive pages. The only other difference from the first rule is the function we want to call to parse the page where we land when we follow the link:

In [ ]:
# Rule 2: URLs in 2011 that are filed under 'ftn/main20...'
Rule(LinkExtractor(allow=(r'.+/ftn/main20.+'),
restrict_xpaths=('//div[@id="article-entry"]', )),
callback="parse_episode",
follow=True)


For the newer episodes, the only thing that changes is the link format we're looking for. This requires one new regular expression character: a '?' indicates that the preceding thing is optional (more specifically, it means 0 or 1 repetitions of the preceding item):

In [ ]:
# Rule 3: URLs of the form '-x-' or '-xx-' where x is a digit from 0 to 9.
Rule(LinkExtractor(allow=(r'.+-[0-9]?[0-9]-.+'),
restrict_xpaths=('//div[@id="article-entry"]', )),
callback="parse_episode",
follow=True)


We put these three rules together into a tuple (remember: a fixed list in parentheses), and it goes right after the start_urls

In [ ]:
rules = (# Rule 1: follow links to previous years' transcript lists
Rule(LinkExtractor(
allow=(r'.+transcripts-20[0-9]{2}/'),
restrict_xpaths=('//div[@id="article-entry"]', )),
callback="parse_start_url",
follow=True),

# Rules 2-3: follow links to html episode transcripts
# - URLs that mention day-of-month: hyphen, 1 or 2 digits, hyphen
Rule(LinkExtractor(
allow=(r'.+-[0-9]?[0-9]-.+'),
restrict_xpaths=('//div[@id="article-entry"]', )),
callback="parse_episode",
follow=True),
# - URLs in 2011 containing the substring 'ftn/main20...'
Rule(LinkExtractor(
allow=(r'.+/ftn/main20.+'),
restrict_xpaths=('//div[@id="article-entry"]', )),
callback="parse_episode",
follow=True))


Now we turn to defining our two parsing functions.

The first, for the annual archive pages, doesn't really need to do anything. We do not need to separately look for links, because the rules are already taking care of that. We will just use the function to print an update to the Terminal; we can insert additional debugging statements if we get stuck at some point.

In [ ]:
def parse_start_url(self, response):
print "**** Visited year page", response.url
# Nothing to return


Now it is time to work on the core of our scraper: the episode transcript scraper. Let's go to a transcript page and take a closer look at its structure using Firebug and Firepath. Pick the transcript for July 31, 2011.

Two things worth noticing right away:

• The URL that shows up when you over over this date on the archive list page (which is of the 'ftn/main20' format) is not the same as the URL that shows in the title bar (which is more like the newer URLs). This is not an issue for us, since we correctly handle the older URL format anyway.

• The date listed for the production of the transcript is different from the date of the show. Let's scrape both dates just to be sure.

First, let's extract the date and time the transcript was produced.
Find the xpath to get to this date & time information:

September 2, 2011, 1:19 PM



We want to extract the text of the date & time element, and then split it up into its component parts. By now we know how to do this, using the split() function.

However, when we split the date-time information by spaces, the time gets split apart from the 'AM' or 'PM' designation. We can put them together again using the opposite of the split function, called 'join'.

The join function has an interesting structure, in that you specify the joining character first. For example, here is how to unsplit a list

In [2]:
mylist = ['This', 'is', 'a', 'list']
print ' '.join(mylist)
print '-'.join(mylist)

This is a list
This-is-a-list


Now you should be ready to write the code extracting and parsing the production date and time of the transcript.

Try it yourself in the empty code slot below, before looking at the code in the following code slot:

In [ ]:
In [ ]:
# Extract & parse the time the transcript was produced
datetime = hxs.xpath('//article/header//span[@class="time"]/text()').extract()[0]
datetime_parts = datetime.split()
Tmonth = datetime_parts[0]
Tday = datetime_parts[1][:-1]
Tyear = datetime_parts[2][:-1]
Ttime = ' '.join(datetime_parts[3:])


Next, we want to extract title, date, and guests of the episode, not all of which are always available. This information is in the same header section as the date & time of production.

title = hxs.xpath('//article/header/h1[@class="title"]/text()').extract()[0]



The title always begins with "Face the Nation", sometimes without quotes, followed by the word 'transcript'. Sometimes a colon follows, sometimes the word for, and then the show's date. After the date the guests follow, if they are specified in the title.

So we'll split the text by spaces and skip over the first 4 or 5 words (depending on whether the 5th word is 'for').

title_parts = title.split()
title_parts = title_parts[5:] if title_parts[4] == 'for' else title_parts[4:]

Sometimes the date does not include the year. We can test for this by seeing whether the third word of the title begins '20'. If no year is specified, just assume that the transcript was generated in the same year as the show took place. Any remaining parts of the title specify the guests.

Since scrapy will access these transcripts in parallel, we cannot know in what order they will appear in our output file. We might like to sort by date, so let's write a function that converts a month's name to its corresponding number.

Just in case the month's name is abbreviated, let's identify the month by just the first 3 characters of its name. Also, in case the month's name is not capitalized, convert it to lower-case, using the string function lower().

To define this function, we'll take advantage of the function 'index', which returns the index of an item in a list.

In [3]:
mystring = "This string is in Python ABC"
print mystring.lower()

mylist = ['jan', 'feb', 'mar', 'apr', 'may', 'jun']
print mylist.index('apr')

this string is in python abc
3

In [ ]:
def monthnr(month):
return ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug',
'sep', 'oct', 'nov', 'dec'].index(month.lower()[:3]) + 1


Now we're ready to write this section of our parser:

In [ ]:
# Extract title, and date and guests (if available)
title = hxs.xpath('//article/header/h1[@class="title"]/text()').extract()[0]
# Title always begins with "Face the Nation" (sometimes without quotes),
# followed by 'transcript'. sometimes a colon follows, sometimes the word 'for',
# and then the date. Finally, when available, the guests are listed
title_parts = title.split()
title_parts = title_parts[5:] if title_parts[4] == 'for' else title_parts[4:]
month = title_parts[0]
mo = monthnr(month)
day = title_parts[1][:-1]
if title_parts[2][:2] != '20':
year = Tyear  # Just assume transcript generated in same year as show
guests = '' if len(title_parts) < 3 else ' '.join(title_parts[2:])
else:
year = title_parts[2][:4]
guests = '' if len(title_parts) < 4 else ' '.join(title_parts[3:])


This brings us to the final part of our parser: the actual transcript text. Navigate your way down the html parse tree in Firebug to the transcript, and you'll see that the text is full of <p> tags (not surprisingly). We would like to get rid of these tags somehow, but we do not want to access the <p> elements one by one.

The module BeautifulSoup (which comes pre-installed with Anaconda) is specifically designed to do just that. So let's write a function to take text and strip out all html tags. We might as well get rid of any special characters at the same time, using unidecode.

In [ ]:
def get_cleantext(xp):
"""Get clean text out of an xpath."""
import re
from unidecode import unidecode
from bs4 import BeautifulSoup

if len(xp.extract()) > 0:
text = BeautifulSoup(xp.extract()[0]).get_text()
return re.sub(r'\s+', ' ', unidecode(text))
else:
return ''


There is another possible wrinkle to take into account: some of these shows might be so long that the transcript overflows the maximum string length for some programs (usually the maximum is 32767, or 2^15-1).

To handle this possibility, we will cut the transcript into consecutive chunks and save them as separate items, while adding a 'part' number indicating which part of the transcript for a given show it is.

In [ ]:
# Extract text, and divide into parts that don't overflow the handler
text = get_cleantext(hxs.xpath('//div[@id="article-entry"]'))
nrparts = len(text)/maxsize
if nrparts != len(text)/float(maxsize):
nrparts += 1
for part in xrange(nrparts):


Now we should be ready to put everything together. Your parser should something like this:

In [ ]:
def parse_episode(self, response):
"""Function to parse Face the Nation transcript page."""
transcripts = []
hxs = Selector(response)
maxsize = 32700

# Extract & parse the time the transcript was produced
datetime = hxs.xpath('//article/header//span[@class="time"]/text()').extract()[0]
datetime_parts = datetime.split()
Tmonth = datetime_parts[0]
Tday = datetime_parts[1][:-1]
Tyear = datetime_parts[2][:-1]
Ttime = ' '.join(datetime_parts[3:])

# Extract title, and date and guests (if available)
title = hxs.xpath('//article/header/h1[@class="title"]/text()').extract()[0]
# Title always begins with "Face the Nation" (sometimes without quotes)
# sometimes a colon follows, sometimes the word 'for', and then the date
title_parts = title.split()
title_parts = title_parts[5:] if title_parts[4] == 'for' else title_parts[4:]
month = title_parts[0]
mo = monthnr(month)
day = title_parts[1][:-1]
if title_parts[2][:2] != '20':
year = Tyear  # Just assume transcript generate in same year as show
guests = '' if len(title_parts) < 3 else ' '.join(title_parts[2:])
else:
year = title_parts[2][:4]
guests = '' if len(title_parts) < 4 else ' '.join(title_parts[3:])

# Extract text, and divide into parts that don't overflow the handler
text = get_cleantext(hxs.xpath('//div[@id="article-entry"]'))
nrparts = len(text)/maxsize
if nrparts != len(text)/float(maxsize):
nrparts += 1
for part in xrange(nrparts):
partial_transcript = FTNItem()
partial_transcript['month'] = month
partial_transcript['monthnr'] = mo
partial_transcript['day'] = day
partial_transcript['year'] = year
partial_transcript['Tmonth'] = Tmonth
partial_transcript['Tday'] = Tday
partial_transcript['Tyear'] = Tyear
partial_transcript['Ttime'] = Ttime
partial_transcript['part'] = part + 1
partial_transcript['title'] = title
partial_transcript['guests'] = guests
partial_transcript['text'] = text[part * maxsize:(part+1) * maxsize]
transcripts.append(partial_transcript)
return transcripts


Test it to see whether it works.

That's it for the tutorial. At this point you should have a good sense of how to tackle your own webscraping project. Good luck!

In [ ]: