An introduction to webscraping, part I

A tutorial for those with little or no programming background
By Maurits van der Veen
Last modified October 2015

1: Very basic HTML

In order to successfully scrape the web, you need to understand at least a little bit of hypertext markup language (html). In this first part of the tutorial we cover the basics.

Tags

Html is a document format that makes it possible to specify, in a clean, text-based way (no special characters, etc.), how a document (usually: a webpage) ought to look. This is done with tags.

Html tags are distinguished from the text to be displayed on a page by < and >. For example, every html document begins with the tag <document>.

Most tags come in pairs, to indicate the scope of the formatting information provided in the tag. Thus, to make text bold, enclose it in the tags <b> and </b>

Go to http://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic
to try this out.

Two other very common tags are:
<br> which inserts a line break, and
<p> which inserts a paragraph break.

These two tags are often not followed by a closing tag, though purists sometimes insist on putting them in.

Some other tags to play around with (figure out what each does):
<h2>Some text</h2>
<hr> (no closing tag)
<ul><li>Some text 1</li><li>Some text 2</li></ul>

What happens when you replace ul by ol?

Html attributes

Whatever is enclosed by a pair of matching tags is generally called an element. The simplest element is just plain text, but elements can also be pictures, further (nested) html specifications, etc.

Elements can be assigned attributes. These are always specified in the starting tag, not the closing tag. They are always of the format
attribute_name="value"

Sometimes these attribute add specific parameters to a tag; sometims they identify the tag as belonging to a particular class of similar tags.

The most familiar attribute is a link, which has the attribute name href, and is generally associated with the anchor tag <a>.

In the try-it editor, enter the following:
<a href="http://www.wm.edu">Click for W&M</a>

Now let's try displaying a picture. Images are displayed with the <img> (no closing tag needed). The source of the image is specified in the src attribute; its dimensions, optionally, in the width and height attributes:

<img src="https://www.wm.edu/research/ideation/_images/2010/kelso-incubator/sealthumb.jpg">

<img src="https://www.wm.edu/research/ideation/_images/2010/kelso-incubator/sealthumb.jpg" width="100" height="100">

<img src="https://www.wm.edu/research/ideation/_images/2010/kelso-incubator/sealthumb.jpg" width="400" height="400">

Note: if you specify just one dimension, the other will adjust to keep the width-height ratio constant. You can also speciy both dimensions and squash or stretch the image.

<img src="https://www.wm.edu/research/ideation/_images/2010/kelso-incubator/sealthumb.jpg" width="100">

<img src="https://www.wm.edu/research/ideation/_images/2010/kelso-incubator/sealthumb.jpg" width="100" height="300">

2. Inspecting html source

If you wish to learn more about html, there are many good tutorials on the web. For our purposes, we now know enough about the basics to move on to a crucial step in web scraping: inspecting a web page's html source.

Viewing tags

Normally, html tags are invisible to the user. You can see them by opening or saving a page's source (an option in the File menu of most web browsers).

However, sometimes it is difficult to match a page's source to the displayed text of the page you have in front of you. This is where Firebug comes in. Firebug allows you to inspect a page's source as you're looking at it, highlighting the section of the page corresponding to each tag.

Let's try this by looking at a fairly straightforward page:

W&M Government department faculty directory

In the Firefox browser, go to: http://www.wm.edu/as/government/faculty/directory/index.php

Turn on Firebug, and click all the way down to the entry for the first faculty member listed, Brian Blouet. Notice how many nested levels we need to go through! This is very common, especially on large institutional websites.

In [ ]:
<document>
 <html class=" js flexbox canvas ...
  <body>
   <div id="page-wrapper-outer">
    <div id="page-wrapper-inner">
     <div id="main_bkg_container">
      <div id="main_content" class="clearfix"> 
       <section id="main" style="height: auto;">
        <article class="item_listing directory_listing">
         <a title="Brian Blouet" href="blouet_b.php">
         <p>
          <a class="person_name" href="blouet_b.php">Blouet, Brian</a>
 

One thing you will notice when inspecting webpages is a lot of tags and attributes that do not appear to do anything in and of themselves. Most common are <div>, id, and class, all of which we encounter here.

The <div> tag defines a division or a section in an HTML document. The <div> tag is used to group block-elements to format them with CSS (cascading style sheets, which is how websites maintain a consistent 'look-and-feel' across the entire site).

id and class are used as hooks by CSS files, so that the same type of information can look the same throughout a page and on every other page. They are not quite the same (although for our purposes it makes little difference):

  • Each element can have only one ID
  • Each page can have only one element with that ID
  • You can use the same class on multiple elements
  • You can use multiple classes on the same element

Comparing different department pages

Now let's go to another William & Mary department and look at its directory. Let's pick Classics:
http://www.wm.edu/as/classicalstudies/index.php

Before going to the directory, notice how Molly Swetnam-Burland's picture looks kind of squashed. Looking at the source, click down to see if this is because the html source specifies both height and width of the image.

Now go to the faculty directory:
http://www.wm.edu/as/classicalstudies/faculty/index.php

Click down to the first entry, and compare the path of nested levels to the one we found for Government. Are they the same?

Now look at how the URL is constructed. Is that the same?

Pick one more department to visit (see the list at http://www.wm.edu/as/program-list/index.php), and compare the URL as well as the nested html structure to get to the first faculty member listed.

I tried Biology; the directory is at:
http://www.wm.edu/as/biology/people/faculty/index.php
which is, yet again, a different URL structure.

Try a different one.

OK, so it appears that the URL for the faculty directory is not consistent across departments, but the actual html layout of the directory is.

3. Selecting html elements: xpath

The list of nested levels we need to traverse to get where we want to go in an html source is called a path. When web scraping, we need to specify the path to the thing we wish to scrape.

If we had to describe the full path each time, that would be incredibly cumbersome. Fortunately, there is a 'language' to specify paths more directly, called xpath. The Firepath extension you installed within Firebug makes it possible to try out path specifications on a web site to see how they work.

[Strictly speaking, xpath is a specification for paths in xml documents, which is a broader and more general category than html. But it works just fine on html]

Basic xpath

Paths are specified using a forward slash (/) to indicate nested levels, and stripping the < and > from a tag. The <document> tag is the top level, so does not need to be specified.

Construct the path to Brian Blouet's name from the detailed listing above, and enter it in the Firepath text window:

/html/body/div/div/div/div/section/article/p/a

If you did it right, you will see the line with his name highlighted in the source, and a dotted outline around his name in the directory.

But wait, there's more!

It turns out the specification we provided is not unique. Not only does that path lead us to Brian's name; it also leads us to his email address. Moreover, it does the same for every other person in the directory. For some, it even selects their personal website as well.

Scroll through the source as well as the webpage to verify that this is true. You can easily see how this might be useful for scraping multiple similar pieces of information from a webpage.

Now the full xpath specification is still pretty cumbersome, especially with all those /div specifications right after one another.

In xpath, you specify that you don't care about the specifics of any intervening levels with a double slash: //

Try entering the following in the Firepath window:
//article/p/a
That's certainly a lot simpler!!

Finally, if we want just each faculty member's name, we can add the number 1 in square brackets:
//article/p/a[1]

This works because we know that the name is the first thing our specification catches for each person. Similarly, [2] gets the email address. [last()] gets the last of the items we match for each person, which for some is the email address and for others the website.

Notice that so far we've been selecting the entire html element specified by the xpath. If we want to select just the actual name (not the element containing the name), we add /text() to the xpath specification:
//article/p/a[1]/text()

Xpath with attributes

Sometimes we do not know whether the item we are interested in is the first, second, etc. of the items our xpath specification might match. Here looking at attribute names and values helps out.

Attributes are also specified inside square brackets, preceded by an ampersand (@). If we look at all the matches we get with //article/p/a
we see that the names all appear inside an <a> tag whose class is "person_name". So try adding [@class="person_name"] to that specification:
//article/p/a[@class="person_name"]

Now imagine we would like to specify, instead, that we're interested in the email address. How might that work? There is no class inside that anchor tag, so that doesn't help us. There is, however, an href. Let's see if that works:
//article/p/a[@href]

This casts the net too widely; websites and even the person's name have an href attribute. What we can do, instead, is to specify that the href value must contain the string "mailto":
//article/p/a[contains(@href,"mailto")]

There is a lot more to xpath, but if you've grasped all this so far, you are well ahead of most people who try web scraping.

In [ ]: