An introduction to webscraping - Part 0

A tutorial for those with little or no programming background
By Maurits van der Veen
Last modified October 2015

Preparation

Before you can try your hand at scraping the web, you will need to install a number of programs, packages, and extensions on your computer. When we begin the actual tutorial, I am going to assume that you have all of these successfully installed already.

Text editor

We will be writing several program files (in the programming language Python). Writing computer code is easiest in an editor specifically design for that purpose, but a general-purpose text editor will do fine for our purposes.

A very good text editor that works across multiple computing platforms, and is specifically designed for writing computer code, is Brackets, a side project of Adobe. Get it at http://brackets.io. It comes in versions for Mac OSX, Windows, Ubuntu, and Debian Linux.

For the Mac, another option is TextEdit is a program that comes standard on your machine, and that handles plain text well. However, it tends to assume that you want to write your text in rich text format (rtf). You do not! When you are editing an existing file in rtf, press command-shift-T to convert it to plain text.

You may also want to change TextEdit's defaults in the Preferences window (from the TextEdit menu in the menu bar). Here, make sure the 'Plain text' option is selected (rather than 'Rich text) for Format (At the top of the window.

Try to familiarize yourself with the process of opening and closing files, indenting lines of codes, and other basic text editing tasks in the text editor of your choice.

Command line interface

Another thing you will need (and need to be familiar with) is a command line interface, which will allow you to type specific commands directly to the computer.

The most common application for this purpose on the Mac is Terminal, which, again, comes standard with every Mac. A similar program exists for windows, but I am not sure of its precise name.

We will use the command line primarily to run scrapy, which is written on Python but is an executable program that is run from outside Python. We may also wish to move around your laptop's directory structure. The most useful command here is cd (for change directory).

Use cd by itself to return to the top-level directory associated with your user account.

To go one level up (to the parent folder), use cd ..

To go to a specific folder, specify its name. Finally, we can combine these basic commands to go up or down several levels in the hierarchy; for example:

cd ../../Desktop/Scraping/

Note: When you run ipython notebook (the program used to write this page), it will open a Terminal window for itself. This window will be occupied with running the notebook. If you need to enter some commands on the command line, open a different window to do so.

Firefox, firebug, firepath

The web browser I will use in the tutorial is Firefox (Mozilla). If you do not already have this browser, download the appropriate version for your computer from https://www.mozilla.org/en-US/firefox/new/ Once you have installed it, make sure it works correctly (just try visiting a couple of websites).

While Firefox is a nice browser, the real reason for using it here are 2 plug-ins that have been developed for it: Firebug and Firepath. Firebug can be downloaded from http://getfirebug.com It allows you to inspect the html code of web pages, which is essential for any web scraping endeavour. Once you have Firebug installed, the next time you start Firefox, you will see a little firebug icon near the top right of your browser window. Click on it, and your window will be split, with the bottom half showing the source code (html) of the page you are looking at (for most pages, you will just see a little triangle, followed by ).

Firepath, finally, is a plug-in for Firebug, which allows you to specify an xpath to select just a piece of the page. You can find it at https://addons.mozilla.org/en-US/firefox/addon/firepath/ Once you have it installed, the Firebug window (bottom part of your browser window) will have a text entry field immediately below the menubar labeled XPath:

Python

The web scraping tools we will use are written in Python, one of the more popular programming languages. This means you will need to have Python installed to use the tools. The most straightforward method of installing Python is to use the Anaconda distribution (which bundles Python together with many of the most popular tools written in Python), available at https://www.continuum.io/downloads Get the version for Python 2.7 (not Python 3.x).

Important: Anaconda needs to be installed in the directory where you want to use it. In other words: do not move the Anaconda folder around once you've installed it.

Scrapy, Unidecode

Anaconda comes with many packages pre-installed, but we need to install two additional ones: unidecode, which converts letters with accents to the corresponding un-accented letter (as well has gracefully handling strange characters in encounters), and scrapy, which is the web scraping package we will use.

To do so, open a program that will give you access to a command-line interface (on the Mac, use Terminal), and navigate to the folder containing Anaconda. (In general, Anaconda will install at the top level, so you should just be able to type cd anaconda)

Once there, simply enter the following 2 commands (press return after each):

```conda install unidecode conda install scrapy ```

You will likely see a lot of text scroll by informing you of the various installation actions taken. Some may (appear to) be error messages, but as long as somewhere near the end there is a line indicating successful installation, you will be fine.

IPython

Python can be run from the command line, but it is often preferable to run it interactively while taking notes about what you are doing. IPython, now wrapped into Jupyter, allows us to do just that. You can read a bit about IPython at http://ipython.org. It comes pre-installed by Anaconda, so you need not separately install it.

To run IPython, locate the Launcher in the Anaconda folder (Mac) or under Anaconda in the Start menu (Windows). This may directly open a browser window (with Jupyter or IP[y] along the top), or you may need to click on the Launch button next to ipython-notebook first.

You will see an empty text entry box just below the window's menu bar. This is a 'cell' where you can enter Python code. Test it by typing print "hello" and pressing the Run (Play) button on the menu bar or typing Shift-Enter

In [6]:
print "hello"
hello

Finally, test whether you have successfully installed unidecode and scrapy by trying to load them into memory, as follows:

In [7]:
import scrapy
import unidecode
print "success!"
success!

If running these 3 lines produces only the word "success", then the two packages are successfully installed. If you get an error message about either package, you will want to try installing again.

Once you have made it this far, you are ready for the tutorial.