An introduction to webscraping, part II

A tutorial for those with little or no programming background
By Maurits van der Veen
Last modified October 2015

1. Starting a Scrapy project

Scrapy is a web scraping package that does most of the actual scraping work for you. You need to supply it information about the web URLs you wish to scrape, and the data from those URLs you'd like to collect; it basically does the rest.

For our first scraping exercise, we will collect names and emails from the departmental faculty directories at William & Mary. We'll call our project faculty_emails

To start a new scraper, navigate to a folder where you'd like your web scrapers to reside, and type, on the command line (not inside ipython notebook):
scrapy startproject faculty_emails

This will create a new directory named faculty_emails and fill it with some templates for the necessary scraping files.

Note: Depending on how your system is set up, you may need to be within the folder named anaconda when you type the scrapy startproject command.

If you did it right, you will see something very much like the following:

2015-10-19 13:34:30 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-19 13:34:30 [scrapy] INFO: Optional features available: ssl, http11
2015-10-19 13:34:30 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'faculty_emails' created in:
    /Users/maurits/python/scraping/faculty_emails

You can start your first spider with:
    cd faculty_emails
    scrapy genspider example example.com

2. Specifying items to scrape

The startproject command created a folder called faculty_emails. Inside that folder we find another folder with the same name. Inside that one are several files, including one named items.py.

We edit this file to specify the items we wish to scrape. Editing files is best done in a standard text editor such as TextEdit on the Mac. Make sure to work in 'plain text' mode. (Other good text editors for the Mac are Brackets and Apple's Xcode).

This file defines a 'class' to hold the different things you want to extract for each scraped item. Scrapy tries to give the class a helpful name; you may change it as you wish.

In python, indentation matters. Inside (i.e. below) the class definition, every line should have the same indentation of 4 spaces or 1 tab.

Comments in python are signaled by a # as the first non-space character on the line. You'll see that there are several comment lines in items.py; you can just leave these be.

Change the file to look as follows (after the first few lines of comments):

import scrapy

class FacultyEmailsItem(scrapy.Item):
    department = scrapy.Field()
    last = scrapy.Field()
    first = scrapy.Field()
    email = scrapy.Field()

With the import command, we are telling scrapy that we want to use all the functions, classes, etc. it has pre-defined for us. To refer to any of these, we prefix them with the name scrapy. Hence scrapy.Item in the line that names our class.

If you prefer, you can import only a specific function or two:
from scrapy import Field, Item

If you do this, you can just refer to Field and Item directly, without the scrapy prefix.

Once your file looks right, save the changes and close it.

3. Defining the crawler

Next, we need to define a spider (to crawl the web :-).

The spider needs to know 3 key things:

  1. Which initial set of URLs to visit
  2. How to follow links from those initial URLs (if applicable)
  3. How to extract and parse the fields we are interested in.

In addition, the spider must have a name you can call it by. Scrapy will take care of the rest.

The file defining the spider should reside inside the folder called spiders (which is at the same level as the items.py file we just edited). Using your text editor of choice, create a new, empty file called fac_emails_spider.py and save it in the spiders folder.

The first thing we need to do in our spider file is import some functions from scrapy:

from scrapy.spiders import CrawlSpider
from scrapy import Selector

Next, we want to import our own item definition we just created. Make sure the name you specify to import is the exact same name you gave to your class in items.py.

from faculty_emails.items import FacultyEmailsItem

With the imports out of the way, we can start defining our spider.

The first two lines tell python the name you want to give to your spider's class, as well as the name you want scrapy to use for your spider:

class FacultyEmailsSpider(CrawlSpider):
    name = "facemail"

Next, we tell scrapy where it is allowed to go, and which URLs it should start with. The allowed_domains tell scrapy never to follow a link to a website outside www.wm.edu.

The start_urls tell scrapy which websites we would like to scrape. For more complicated spiders, we could allow scrapy to follow links from those websites to other websites, and so on. (This is why having a constraint on allowed_domains is valuable.)

allowed_domains = ["www.wm.edu"]
start_urls = ["http://www.wm.edu/as/classicalstudies/faculty/index.php",
              "http://www.wm.edu/as/biology/people/faculty/index.php",
              "http://www.wm.edu/as/government/faculty/directory/index.php",
              "https://www.wm.edu/as/linguistics/faculty/index.php"]

Finally, we tell scrapy how it ought to parse each webpage it visits, with the function parse:

def parse(self, response):

Inside this function, we give scrapy the xpath we worked out for the faculty names in the directory. To do so, we first need to get scrapy to load the page's html source in a format we can apply xpath to: this is where the Selector we imported earlier comes in:

    pagesource = Selector(response)

The xpath specification will get us our desired results, but in a format that can be used for further xpath selecting. If we just want to use what we get back directly, we need to extract it.

    names = pagesource.xpath('//article/p/a[@class="person_name"]/text()').extract()

This should give us a list of the names in the directory, as we saw when we entered this xpath specification in Firepath. We want to store each of the names in its own item. For now, we just store it in the slot for last name and leave all the other slots empty. (For a starting point, we just want to see whether the scraping works as expected.)

The following code accomplishes just that. It creates an empty list (lists in python are written with square brackets), then loops over each name in the list names. For each name, it creates a new "item", and fills the last name slot with the name. Finally, it appends this item to the end of the list of entries. When it's all done, it returns the list of entries to scrapy.

    entries = []
    for name in names:
        entry = FacultyEmailsItem()
        entry['last'] = name
        entries.append(entry)
    return entries

That's it! If you've followed along correctly, your file fac_emails_spider.py should look as follows:

from scrapy.spiders import CrawlSpider
from scrapy import Selector
from faculty_emails.items import FacultyEmailsItem

class FacultyEmailsSpider(CrawlSpider):
    name = "facemail"
    allowed_domains = ["www.wm.edu"]
    start_urls = ["http://www.wm.edu/as/classicalstudies/faculty/index.php",
                  "http://www.wm.edu/as/biology/people/faculty/index.php",
                  "http://www.wm.edu/as/government/faculty/directory/index.php",
                  "https://www.wm.edu/as/linguistics/faculty/index.php"]

    def parse(self, response):
        pagesource = Selector(response)
        names = pagesource.xpath('//article/p/a[@class="person_name"]/text()').extract()
        entries = []

        for name in names:
            entry = FacultyEmailsItem()
            entry['last'] = name
            entries.append(entry)
        return entries

Remember that indentation matters in python.
The from ... import ... lines and the class line should not be indented at all; all others should be indented as shown here.

Once you have completed the definition of your spider, save the file and close it.

4. Crawling the web

You are now ready to tell scrapy to start scraping for you. Scrapy can collect output in several different formats. We are going to ask it to save in comma-separated-value format (csv), which is a standard spreadsheet format readable by programs such as Excel.

On your command line, inside the top-most faculty_emails folder, type:

scrapy crawl facemail -t csv -o scraped_emails.csv

The -t tells scrapy that the next item specifies the output format (csv); the -o tells scrapy that the next item specifies the name of the output file.

Once scrapy starts running, it will spit out a lot of progress information. This is very helpful when you are trying to fix problems, so for now, just let this be. Next time we will see how to suppress this information once we are confident the spider works.

When scrapy stops running, look in the same top-most faculty_emails folder for a file with the name scraped_emails.csv. Inspect it in Excel or some other spreadsheet program; it should have several empty columns and one column (labeled last) with the faculty names in it.

The faculty names have some funny characters in them; we will deal with that next time.

If you want to play around with this a bit more before the next session, see if you can scrape some other pieces of information for each faculty member. (Warning: this is harder than it looks, so don't get discouraged if something doesn't work as you would expect.)

In [ ]: