Anda di halaman 1dari 4

Drafts for the Open Data Cook Book - see http://www.opendatacookbook.

net

Recipes: Scraper Wiki


ScraperWiki is a service that helps you to gather data from websites that do not provide it
as raw data. ScraperWiki provides a programming environment where you can write and
share a scraper from your browser. ScraperWiki will run your scraper for you once a day,
and will make the results available to download and through Application Programming
Interfaces (API) for other web programs to use as well.

You will need:


• An account at www.scraperwiki.com (free)
• Some programming experience
• A website with structured information on that you
want to scrape

1) Explore the structure of the website you are


planning to scrape
In this example Iʼm looking at the location of
Garages to Rent in Oxford City. First I check
when viewing the page that the elements I want
to scrape are presented fairly uniformly (e.g.
there is always the same title for the same
thing) as lots of variation in the way similar
things are presented makes for difficult
scraping.

Secondly, I take a look at the source code of the


web page to explore whether each ʻfieldʼ I want
to scrape (e.g. Postcode; Picture etc.) is
contained neatly in itʼs own HTML element. In
this case, whilst each listing is in a <div> html element, a lot of the rest of the text is only
separated by line-breaks.

Iʼve used the FireBug plugin for Firefox web browser to look at the structure of the page,
as it allows me to explore in more detail than the standard ʻView Sourceʼ feature on
most browsers.

2) Create a new Scraper on Scraper Wiki


Iʼm going to be creating a PHP scraper as
this is the programming language Iʼm most
comfortable with, but you can also create
scrapers using Python and Ruby languages.

The PHP Startup Scraper will load with


some basic code for fetching a web page
and starting to parse it already set. It makes
use of the simple_html_dom library which
allows you to access elements of web pages
using simple selectors.

Change the default URL so scraperwiki is


finding the page you are interested in. Then also change the line ʻforeach($dom-
>find('td') as $data)ʼ using a selector identified in your earlier exploration to
Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net

see if you can pick out the elements you want to scrape.

For example, each of the listings of Garages for Rent in Oxford are contained within a
div with the class ʻpagewidgetʼ, so I can use the selector $dom->find
('div.pagewidget') to locate them. (This sort of selector will be familiar to
anyone used to working with CSS - Cascading Style Sheets).

3) Check what Scraper Wiki returns and start refining your scraper
If you click ʻRunʼ below your scraper you should now see a range of elements returned
in the console. The default PHP template loops through all the elements that match the
selector we just set and prints them out to the console.

My scraper returns quite a few elements I donʼt want (there must be more than just the
Garage listings picked out by the div.pagewidget selector), so I look for something
uniform about the elements I do want. In this case they all start with ʻSite Locationʼ (or at
least the plaintext versions of them, as returned by $data->plaintext do.

I can now add some conditional code to my scraper to only carry on processing those
elements that contain ʻSite Locationʼ. Iʼve chosen to use the ʻstristrʼ function on PHP that
just checks if one string is contained in another and is case insensitive, rather than
checking the exact position of the phrase, to be tolerant in case there is variation in the
way the data is presented that Iʼve not spotted.

4) Loop, slice and dice


The next steps will depend on how your data is formatted. You may have lots more
nested selectors to work through to pick out the elements you want. You can use
$data just like the $dom object earlier. So, for example, we can use $data->find
("img",0)->src; to return the ʻsrcʼ attribute of the first (0) image element (img) we
find in each garage listing.

Sometimes, you get down to text which isnʼt nicely formatted in HTML, and then you will
need to use different string processing to pull apart the bits you want. For example, in
the Garage listings we can separate each line of plain text by splitting the text by <br>
Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net

elements, and then splitting each line at the colon ʻ:ʼ used to separate titles and values.

A check of the raw source shows the Oxford Garages page uses both <BR> and <br />
as elements so we can use a replace function to standardise these (or we could use
regular expressions for splitting).

In the Oxford Garages case as well, our data is split across multiple pages, so once we
have the scraper for a single page working right, we can nest it inside a scraper that
grabs the list of pages and loops through those too. Scraper Wiki also includes useful
helper code for working with forms, for sites where you have to submit searches or
make selections in forms to view any data.

5) Save each section of scraped data for use later


Towards the end of each loop through the elements you are scraping (each row in your
final dataset) you will need to call the scraperwiki::save() function. This takes
four paramaters:

Firstly, an array indicating the name of the unique key in your data that should be used
to work out whether a record is new, or an update to an existing record.

Second, an array of data values to save.

Third, the date of the record (for indexing). Leave as null to just use the date the scraper
was run.

Fourth, an array of latitude and longitude if you have geocoded your data.

Run you scraper and check the ʻdataʼ tab to see what is being saved.

6) (Optional) Sprinkle in some geocoding as required


If you have a UK postcode in your data then you can use the
scraperwiki::gb_postcode_to_latlng(); function to turn it into a latitude
and longitude, and then save then into your generated dataset.

For example, we can use $lat_lng =


scraperwiki::gb_postcode_to_latlng($values['Postcode']); and
then when we save our data we add the $lat_lng values to the end of the save function.
Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net

scraperwiki::save(array('Site location'), $values,null,


$lat_lng);

7) Run your scraper and explore the results


You can now run your scraper. You will be able to access the results as a CSV file,
through the scrape wiki API, and to load them into a Google Spreadsheet.

You can also create ʻViewsʼ onto your data, using pre-prepared templates to create
maps and other useful visualisations of your data, direct from within Scraperwiki.

Scraperwiki will run your scraper every 24 hours, meaning that as long as it keeps
working, you can rely on it as an up-to-date data source.

Below is the map I produced, showing Garages to Rent around Oxford, with the number of
garages, photos, and links off to the pages with details about them.

One of the best things about Scraper Wiki overall though, is that it is wiki-like. You can take
a look at my Oxford Garages code at http://scraperwiki.com/scrapers/oxford-garages-to-
rent/ and you can edit and improve it (and there are lots of potential improvements to be
made).

You can also suggest scrapers you would like other people to create, or respond to
requests for scrapers from others.

Anda mungkin juga menyukai