An introduction to webscraping, part III

A tutorial for those with little or no programming background
By Maurits van der Veen
Last modified October 2015

Now that we have a basic web scraper, we can work on completing it. This requires 3 additional steps: splitting name information we have already scraped into first and last names; extracting the department name from the URL, and scraping the email addresses for each faculty member.

We'll begin with the department name.

(Reminder: press shift-enter with your cursor inside a code snippet to run that code snippet)

1. Basic text manipulation

The department name is included in the URL of the webpage we send our scraper to, so we easily extract it from there. In the parser, we can get to the URL as part of the parameter named response: response.url gets us the URL as a string variable (a string is any sequence of characters stored together).

Python has a great function that splits a string into pieces based on a separator you specify. The pieces are put together into a list. If you don't specify a separator, the string is split by spaces. This is a great way to separate a sentence into words:

In [2]:
mystring = 'This is a sample string'
mystring.split()
Out[2]:
['This', 'is', 'a', 'sample', 'string']

For a URL, the logical separator is the forward slash:

In [4]:
myurl = "http://www.wm.edu/as/government/faculty"
myurl.split('/')
Out[4]:
['http:', '', 'www.wm.edu', 'as', 'government', 'faculty']

Note that the second item in the list is an empty string, which results from the '//' after http: in the URL. The department name is the 5th item in the list.

To retrieve an individual item from a list in python, we just specify its position in the list, but we start counting at 0, not 1. Thus the department name is myurl.split('/')[4]:

In [5]:
myurl.split('/')[4]
Out[5]:
'government'

This gives us the department name as it appears in web addresses etc., which is fine for our purposes. So we can add these lines to the beginning of our parser:

urlparts = (response.url).split('/')
department = urlparts[4]

Then, within the loop over the names, simply add the following to each item:

entry['department'] = department

Some might prefer to have the department name separated into words and capitalized, as it appears on the page (for example: 'Classical Studies' rather than 'classicalstudies').

To do so, they would need to look at the "breadcrumbs" that appear at the top of the page, just below the banner picture. The breadcrumbs can be reached with the xpath //ul[@id="breadcrumbs". I leave it as an exercise to figure out how to extract the department name from within that list.

1a. Substrings & sublists

In Python, it is easy to get substrings out of strings. Strings are indexed the same way lists are, by position in the string, and starting at 0.

You specify a range of characters by specifying the initial one to include, and the initial one to exclude (not the last one to include!), separated by a colon.

In [2]:
mystring = 'This is a sample string'
print mystring[5:7]
is

If you want to start at the beginning of the string you need not specify the initial index; if you want to go all the way to the end, you need not specify the final index.

In [3]:
print mystring[:4]
print mystring[4:]
This
 is a sample string

Finally, you can index a string by counting backwards from the end, using negative numbers.

In [4]:
print mystring[-1]
print mystring[-6:]
g
string

Added bonus: strings are essentially lists of characters. This means that you can index lists the same way you do strings.

In [6]:
mylist = ['This', 'is', 'a', 'list']
print mylist[-1]
list

1b. Special characters

Next, we turn to the first and last names of individual faculty members. As we saw when we scraped their names, the first and last names are separated not just by a comma, but also by a special character that Excel, at least, will not display correctly (on my laptop, it inserts 'Â' between the comma and the space).

This is a fairly common problem with online texts: there are often special characters with a particular typographical meaning that cannot be represented in standard character sets. Among the more common are accents: ç, é, å, etc.

What we will generally want to do is to simplify these as much as possible (so the 3 letters just listed become c, e, and a), and then simply remove any remaining characters. We can do this using the unidecode module in python I asked you to install on your system.

Strings are most commonly read in unicode format, marked by the letter 'u' preceding the string. Unidecode will convert these to non-unicode strings, trying to keep the string looking as close to the original as possible.

In [14]:
mystring = u'This is a test for tough characters: å ç é'
unidecode(mystring)
Out[14]:
'This is a test for tough characters: a c e'

In our scraper, we can apply unidecode to the name we extract, and then split the name into first and last name by separating on ', '. To do so, insert the line

from unidecode import unidecode

at the top of your spider file, immediately below the other import statements.
Next, add the following lines

fullname = unidecode(name)
nameparts = fullname.split(', ')
entry['last'] = nameparts[0]
entry['first'] = nameparts[1]

inside the for loop in your parse function.

Try it, to make sure it works. The output file should now have 3 of the columns filled in: department, last name, and first name.

Note that scrapy tries to process multiple websites in parallel. This means that we cannot be assured that all the faculty names from one department are listed in a block before all the names from another department. That is why it is convenient to associate the department information with each individual name.

2. Dynamic page sources

Finally, we turn to the emails. Remember the xpath we developed in part I of this tutorial to extract the emails:
//article/p/a[contains(@href,"mailto")].

Let's add that to our spider's parsing function:

emails = pagesource.xpath('//article/p/a[contains(@href,"mailto")]/text()').extract()

This should give us a list of emails that parallels our list of names. We would like to loop down the two lists together. Fortunately, python has a function that allows us to do just that: zip.

Zip creates 'tuples' (a particular type of list that does not change once you create it) that combine the first item of list 1 with the first item of list 2, the second item of list 1 with the second item of list 2, and so on:

In [15]:
list1 = ['This', 'is', 'a', 'list']
list2 = ['Questo', 'è', 'un', 'elenco']
zip(list1, list2)
Out[15]:
[('This', 'Questo'), ('is', '\xc3\xa8'), ('a', 'un'), ('list', 'elenco')]

(Note how the 'è' got messed up in the process, because it contains a special character)

We can zip as many lists together as we wish, as long as they have the same length.

(Warning: if they are not the same length, the resulting list will be equal in length to the shortest of the lists included; among others, this means that zipping something together with an empty list will simply produce an empty list)

We can extract paired values together from a zipped list in a for loop:

for name, email in zip(names, emails):

Putting it all together, your parse function should now look as below. Try running it.

In [ ]:
def parse(self, response):
    pagesource = Selector(response)
    urlparts = (response.url).split('/')
    department = urlparts[4]

    names = pagesource.xpath('//article/p/a[@class="person_name"]/text()').extract()
    emails = pagesource.xpath('//article/p/a[contains(@href,"mailto")]/text()').extract()
            
    entries = []
    for name, email in zip(names, emails):
        entry = FacultyEmailsItem()
        fullname = unidecode(name)
        nameparts = fullname.split(', ')
        entry['last'] = nameparts[0]
        entry['first'] = nameparts[1]
        entry['email'] = email
        entry['department'] = department
        entries.append(entry)
    return entries

Our output file is empty. What happened?

We know it worked until we added the xpath for emails. And we know the xpath specification for the emails works correctly in Firepath. Yet the emails are not getting captured. The debugging information scrapy spits out is no help either; everything seems to work OK on that front.

We do know that the empty output file must mean that the emails list is empty. This makes the zipped-together names & emails empty, which in turn makes our output empty. But why is the emails list empty?

2a. Reading text from a file

The best way to figure out what is going wrong is to see what happens if we try our scraping code on a single webpage, interactively. To do so, first download the source code for one of the department webpages to your computer.

Go to the biology directory (https://www.wm.edu/as/biology/people/faculty/index.php).
Choose "Save As...", and as the format, choose "Webpage, HTML only".

Now how do we read this into an ipython notebook?

Reading data from a file is pretty easy. We open the file by specifying its name and the type of operation we wish to do (rt for read text). Assign a name to the open file, and then just read everything from the file into a variable. This reads the data in as one long string.

Note that we must specify the filename in such a way that python can find it. Assuming you are running this in an ipython notebook, you can look at the URL of this page to see where python will try to look. Everything that comes after localhost:8888/notebooks/ indicates the path to our notebook file.

If the only thing listed is the name of the notebook itself, then your working directory is your main user directory. If you put the downloaded html file there, you can just refer directly to it by name. If you put it in another folder inside the main user directory, you need to specify the folder name. If you put it in another folder somewhere else, you need to tell python how to find it

For example, if your notebook is in Downloads, but your html file is in Desktop, you can refer to it as '../Desktop/BioDep.html' (assuming you named the html file 'BioDep.html'). The '..' in the filename indicates that the program should look one level up from where we are, and then down into the Desktop folder.

Let's see how long the html source code is, and then display a piece of it that allows us to see if it looks as expected. (Note that here we are asking for a lengthy substring inside the even longer string that is the html source.)

In [8]:
with open('BioDep.html', 'rt') as inf:
    origsource = inf.read()
print "File length", len(origsource)
print "\n", origsource[19000:21000]
File length 37568

u"><img alt="Mathhias Leu" src="../../images/thumbnails/leu_m100pix.jpg"/></a><p><a class="person_name" href="leu_m.php">Leu,&#160;Matthias</a><br/><span class="person_position">Assistant Professor</span><br/><span class="person_field_title">Office</span>: Integrated Science Center 2129<br/><span class="person_field_title">Phone</span>: 757 221 7497<br/><span class="person_field_title">Email</span>: [[mleu]]<br/><span class="person_field_title">Web site</span>: {{http://wmpeople.wm.edu/mleu}}<br/></p></article><article class="item_listing directory_listing"><a href="miller_k.php" title="Katherine Miller"><img alt="Katherine Miller" src="../../images/directoryphotos/miller-katherine100pix.jpg"/></a><p><a class="person_name" href="miller_k.php">Miller,&#160;Katherine</a><br/><span class="person_position">Visiting Assistant Professor</span><br/><span class="person_field_title">Office</span>: Integrated Science Center 3051<br/><span class="person_field_title">Phone</span>: 757-221-2491<br/><span class="person_field_title">Email</span>: [[kjmiller]]<br/></p></article><article class="item_listing directory_listing"><a href="murphy_h.php" title="Helen Murphy"><img alt="Helen Murphy" src="../../images/thumbnails/HelenMicPicThumb.jpg"/></a><p><a class="person_name" href="murphy_h.php">Murphy,&#160;Helen</a><br/><span class="person_position">Assistant Professor</span><br/><span class="person_field_title">Office</span>: Integrated Science Center 2133<br/><span class="person_field_title">Phone</span>: 757-221-2216<br/><span class="person_field_title">E-mail</span>: hamurphy@wm.edu<br/><span class="person_field_title">Website</span>: {{http://www.helenmurphy.net}}<br/></p></article><article class="item_listing directory_listing"><a href="puzey_j.php" title="Joshua Puzey"><img alt="Joshua Puzey" src="../../images/thumbnails/puzey-joshua100px.jpg"/></a><p><a class="person_name" href="puzey_j.php">Puzey,&#160;Joshua</a><br/><span class="person_position">Assistant Professor</span><b

Try to isolate a single person's information from this text, by highlighting it. Remember, each person is an article on the webpage, so find an opening <article> tag and highlight all the way to the closing tag. Look at the component pieces:

In [ ]:
<article class="item_listing directory_listing">
 <a href="miller_k.php" title="Katherine Miller">
  <img alt="Katherine Miller" src="../../images/directoryphotos/miller-katherine100pix.jpg"/></a>
 <p>
   <a class="person_name" href="miller_k.php">Miller,&#160;Katherine</a><br/>
   <span class="person_position">Visiting Assistant Professor</span><br/>
   <span class="person_field_title">Office</span>
   : Integrated Science Center 3051<br/>
   <span class="person_field_title">Phone</span>
   : 757-221-2491<br/>
   <span class="person_field_title">Email</span>
   : [[kjmiller]]<br/>
 </p>
</article>

This is not the same source that we saw displayed on the webpage in Firebug! How did that happen?

It turns out that many webpages are filled in dynamically using small programs or scripts (often in Javascript) that run when you load the page. In our case, such a script takes an email reference and converts it to a hyperlinked @wm.edu reference.

(Note: Sometimes this kind of dynamic loading means that it is impossible to scrape the page unless you load it completely and send a normal browser to open it. The program to do that with is Selenium, but it considerably complicates the scaping process.)

Fortunately, in our case we can identify the email address in each 'article' even without the href inside the anchor: it is whatever is enclosed in double square brackets.

Notice that the email address is not directly enclosed in any <span> or <a> tags; instead, it is just text within the broader <p> tag. So what we need to do is get all of the text out of that <p> tag and then isolate the part in the double square brackets.

We can try it first here, interactively in python, before adding the code to our spider's parser. To do so, we need to pass the text we read in from the file to the xpath Selector. Fortunately, there is a straightforward way to do so: just tell the Selector we are passing it plain text:

pagesource = Selector(text=origsource)

Next, we need to get the personal data for all the people in the directory. Since we are interested in the text inbetween the <p> and </p> markers, we isolate those <p> elements:

persondata = pagesource.xpath('//article/p')

Now let's see what that looks like:

In [9]:
from scrapy import Selector

pagesource = Selector(text=origsource)
persondata = pagesource.xpath('//article/p')
persondata
Out[9]:
[<Selector xpath='//article/p' data=u'<p><a class="person_name" href="allen_j.'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="allison_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="bradley_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="buchser_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="case_m.p'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="chambers'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="cristol_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="dalgleis'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="deberry_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="fashing_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="forsyth_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="griffin_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="heideman'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="hinton_s'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="huber_s.'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="kerscher'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="lamar_d.'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="leu_m.ph'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="miller_k'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="murphy_h'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="puzey_j.'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="ryan_s.p'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="saha_m.p'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="sanderso'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="saunders'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="shakes_d'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="sher_b.p'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="swaddle_'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="van_mete'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="watts_b.'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="wawersik'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="williams'>,
 <Selector xpath='//article/p' data=u'<p><a class="person_name" href="zwollo_p'>]

So far, so good: we have a list of Selectors, each of which contains the data for one person in the department. We would like to turn this into a list of only their email addresses.

Python provides a very convenient way to generate one list from another, called a list comprehension. It is effectively a for loop inside list brackets. Here is an example:

In [24]:
mylist1 = ['this', 'is', 'a', 'list']
mylist2 = [x + '!' for x in mylist1]
mylist2
Out[24]:
['this!', 'is!', 'a!', 'list!']

What we did there was add an exclamation point to every item in the first list, by adding (concatenating) each item together with the exclamation point string.

Now let's do a list comprehension to extract just the text from the persondata, and see what that looks like (we'll display just the first two items of the resulting list):

In [10]:
personinfo = [p.xpath('text()').extract() for p in persondata]
personinfo[:2]
Out[10]:
[[u': Integrated Science Center 2123',
  u': 757 221 7498',
  u': [[jdallen]]',
  u': {{http://wmpeople.wm.edu/jdallen}}'],
 [u': Integrated Science Center 2117',
  u': 757 221 2232',
  u': [[laalli]]',
  u': {{http://www.lizabethallison.com}}']]

Each <p> element appears to have 4 separate pieces of text (which are bundled together in a list): the office address, the phone number, the email, and the website. Since not every person has a website, some may have just 3 pieces, but that is of not concern to us at the moment.

Rather than assuming that the email is always the third item in this list, let's just look for the double square brackets. We can do this in another list comprehension.

We want to do something for every person's list in personinfo, so we know the list comprehension will end in:

for p in personinfo]

What we want to do for each of those lists is keep only the item that contains double square brackets. This means we want to do another list comprehension for each of those sublists, but this one with a test for inclusion:

if '[[' in x

If we decide to include the item, we'll just keep it as is, so the list comprehension becomes:

[x for x in p if '[[' in x]

Now let's put this list comprehension inside the other one:

emailinfo = [[x for x in p if '[[' in x] for p in personinfo]
In [11]:
emailinfo = [[x for x in p if '[[' in x] for p in personinfo]
emailinfo
Out[11]:
[[u': [[jdallen]]'],
 [u': [[laalli]]'],
 [u': [[elbrad]]'],
 [u': [[wjbuchser]]'],
 [u': [[macase]]'],
 [u': [[rmcham]]'],
 [u': [[dacris]]'],
 [u': [[hjdalgleish]]'],
 [u': [[dadeberry]]'],
 [u': [[njfash]]'],
 [u': [[mhfors]]'],
 [u': [[jdgri2]]'],
 [u': [[pdheid]]'],
 [u': [[sdhinton]]'],
 [u': [[v|skhuber]]'],
 [u': [[opkers]]'],
 [u': [[mdlama]]'],
 [u': [[mleu]]'],
 [u': [[kjmiller]]'],
 [],
 [u': [[jrpuzey]]'],
 [u': [[spryan01]]'],
 [u': [[mssaha]]'],
 [u': [[slsand]]'],
 [u': [[bdsaun]]'],
 [u': [[dcshak]]'],
 [u': [[btsher]]'],
 [u': [[jpswad]]'],
 [u': [[tevanmeter]]'],
 [u': [[bdwatt]]'],
 [u': [[mjwawe]]'],
 [u': [[kewilliamson]]'],
 [u': [[pxzwol]]']]

This looks pretty good. There are two strange results: one entry has a vertical bar in it: v|skhuber, and one is empty. Looking back at the Biology department webpage, we can see that the v| tells the program that fills in emails to make the email @vims.edu rather than @wm.edu. The empty one belongs to a faculty member whose email address appears to have been entered by hand in a non-standard way.

We'll just leave the v| for now, and we'll solve the other problem by also looking for @ as an alternative to double square brackets:

In [12]:
emailinfo = [[x for x in p if '[[' in x or '@' in x] for p in personinfo]
emailinfo
Out[12]:
[[u': [[jdallen]]'],
 [u': [[laalli]]'],
 [u': [[elbrad]]'],
 [u': [[wjbuchser]]'],
 [u': [[macase]]'],
 [u': [[rmcham]]'],
 [u': [[dacris]]'],
 [u': [[hjdalgleish]]'],
 [u': [[dadeberry]]'],
 [u': [[njfash]]'],
 [u': [[mhfors]]'],
 [u': [[jdgri2]]'],
 [u': [[pdheid]]'],
 [u': [[sdhinton]]'],
 [u': [[v|skhuber]]'],
 [u': [[opkers]]'],
 [u': [[mdlama]]'],
 [u': [[mleu]]'],
 [u': [[kjmiller]]'],
 [u': hamurphy@wm.edu'],
 [u': [[jrpuzey]]'],
 [u': [[spryan01]]'],
 [u': [[mssaha]]'],
 [u': [[slsand]]'],
 [u': [[bdsaun]]'],
 [u': [[dcshak]]'],
 [u': [[btsher]]'],
 [u': [[jpswad]]'],
 [u': [[tevanmeter]]'],
 [u': [[bdwatt]]'],
 [u': [[mjwawe]]'],
 [u': [[kewilliamson]]'],
 [u': [[pxzwol]]']]

Now we just need to extract the actual email address, getting rid of the double brackets, as well as the leading colon. This will require another list comprehension.

Each email address is a single-item list, so the first thing the list comprehension has to do is extract that single item: x[0].

However, note that if we still have an empty list somewhere (we fixed that problem for Biology, but what if there is a faculty member for whom there is no email address listed at all?), referring to x[0] will cause a problem. So let's prevent this potential problem first, by replacing any empty list (identified by a length — len(x) — of 0) by a single-item list of our own making:

emailinfo = [x if len(x) > 0 else '  no email' for x in emailinfo]

Next, we want to extract the text starting at the fifth character (after ': [[') and stopping two characters before the end. To this email, we can attach the extension @wm.edu:

`x[0][4:-2] + '@wm.edu'`

This will not work well for the one email address in our list that does not have the square brackets, so we will add a special case for that one (as well as for any empty items fixed in the previous statement):

In [13]:
emailinfo = [x if len(x) > 0 else '  no email' for x in emailinfo]
emails = [x[0][4:-2] + '@wm.edu' if '[[' in x[0] else x[0][2:] for x in emailinfo]
emails
Out[13]:
[u'jdallen@wm.edu',
 u'laalli@wm.edu',
 u'elbrad@wm.edu',
 u'wjbuchser@wm.edu',
 u'macase@wm.edu',
 u'rmcham@wm.edu',
 u'dacris@wm.edu',
 u'hjdalgleish@wm.edu',
 u'dadeberry@wm.edu',
 u'njfash@wm.edu',
 u'mhfors@wm.edu',
 u'jdgri2@wm.edu',
 u'pdheid@wm.edu',
 u'sdhinton@wm.edu',
 u'v|skhuber@wm.edu',
 u'opkers@wm.edu',
 u'mdlama@wm.edu',
 u'mleu@wm.edu',
 u'kjmiller@wm.edu',
 u'hamurphy@wm.edu',
 u'jrpuzey@wm.edu',
 u'spryan01@wm.edu',
 u'mssaha@wm.edu',
 u'slsand@wm.edu',
 u'bdsaun@wm.edu',
 u'dcshak@wm.edu',
 u'btsher@wm.edu',
 u'jpswad@wm.edu',
 u'tevanmeter@wm.edu',
 u'bdwatt@wm.edu',
 u'mjwawe@wm.edu',
 u'kewilliamson@wm.edu',
 u'pxzwol@wm.edu']

As a final step, let's fix the VIMS email. By now you should be able to decipher what is going on in the following statement:

In [14]:
emails = [x if 'v|' not in x else x.split('@')[0][2:] + '@vims.edu' for x in emails]
emails
Out[14]:
[u'jdallen@wm.edu',
 u'laalli@wm.edu',
 u'elbrad@wm.edu',
 u'wjbuchser@wm.edu',
 u'macase@wm.edu',
 u'rmcham@wm.edu',
 u'dacris@wm.edu',
 u'hjdalgleish@wm.edu',
 u'dadeberry@wm.edu',
 u'njfash@wm.edu',
 u'mhfors@wm.edu',
 u'jdgri2@wm.edu',
 u'pdheid@wm.edu',
 u'sdhinton@wm.edu',
 u'skhuber@vims.edu',
 u'opkers@wm.edu',
 u'mdlama@wm.edu',
 u'mleu@wm.edu',
 u'kjmiller@wm.edu',
 u'hamurphy@wm.edu',
 u'jrpuzey@wm.edu',
 u'spryan01@wm.edu',
 u'mssaha@wm.edu',
 u'slsand@wm.edu',
 u'bdsaun@wm.edu',
 u'dcshak@wm.edu',
 u'btsher@wm.edu',
 u'jpswad@wm.edu',
 u'tevanmeter@wm.edu',
 u'bdwatt@wm.edu',
 u'mjwawe@wm.edu',
 u'kewilliamson@wm.edu',
 u'pxzwol@wm.edu']

Now that we know it all works, step by step, we can add all the previous statements together to our parser, which now looks as follows:

In [ ]:
def parse(self, response):
    pagesource = Selector(response)
    urlparts = (response.url).split('/')
    department = urlparts[4]

    names = pagesource.xpath('//article/p/a[@class="person_name"]/text()').extract()
    persondata = pagesource.xpath('//article/p')
    personinfo = [p.xpath('text()').extract() for p in persondata]
    emailinfo = [[x for x in p if '[[' in x or '@' in x] for p in personinfo]
    emailinfo = [x if len(x) > 0 else '  no email' for x in emailinfo]
    emails = [x[0][4:-2] + '@wm.edu' if '[[' in x[0] else x[0][2:] for x in emailinfo]            
    emails = [x if 'v|' not in x else x.split('@')[0][2:] + '@vims.edu' for x in emails]

    entries = []
    for name, email in zip(names, emails):
        entry = FacultyEmailsItem()
        fullname = unidecode(name)
        nameparts = fullname.split(', ')
        entry['last'] = nameparts[0]
        entry['first'] = nameparts[1]
        entry['email'] = email
        entry['department'] = department
        entries.append(entry)
    return entries

Save your spider file, and run scrapy from the command line. Looks good!

Finally, if you're confident that everything will work as expected, you can turn to debugging information off by specifying the option --nolog on the command line:

scrapy crawl facemail -t csv -o fac_emails.csv --nolog

That completes our spider!

As an exercise, try adding the directories for the Mathematics and Physics departments. You will see that Mathematics has a bunch of email addresses similar to the one in Biology that was entered without square brackets; our spider seamlessly handles those too. Physics has a bunch of addresses similar in style to the one in Biology that begins with v|, only they begin with physics|. Add a line to the parser that handles this correctly (just copy from the VIMS model).

In [ ]: