Extra Lab Exercise

Directed Graphs and Web Crawlers

Objectives

  • To implement a web crawler
  • To see how directed graphs might be used with real-world data

Admin

This lab is not marked and there is no submission for it.

Background

We can view the World Wide Web as a massive directed graph, where pages (identified by URLs) are the vertices and hyperlinks are the directed edges. Unlike the graphs we have studied in lectures, there is not a single central representation (e.g. adjacency matrix) for all the edges in the graph of the web; such a data structure would clearly be way too large to store. Instead, the "edges" are embedded in the "vertices". Despite the unusual representation, the Web is clearly a graph, so the aim of this lab exercise is to build an in-memory graph structure for a very, very small subset of the Web.

Web crawlers are programs that navigate the Web automatically, moving from page to page, processing each page they visit. Crawlers typically use a standard graph traversal algorithm to:

  • maintain a list of pages to visit (a ToDo list)
  • "visit" the next page by grabbing its HTML content
  • scan the HTML to extract whatever features they are interested in
  • collect hyperlinks from the visited page, and add these to the ToDo list
  • repeat the above steps (until there are no more pages to visit :-)

You need to implement such a crawler, using a collection of supplied ADTs and a partially complete main program. Your crawler processes each page by finding any hyperlinks and inserting the implied edges into a directed Graph ADT based on an adjacency matrix. One difference between this Graph ADT and the ones we have looked at in lectures is that you dynamically add information about vertices, as well as edges. The following diagram shows what an instance of the Graph ADT might look like:

The Graph data structure allows for maxV vertices (URLs), where maxV is supplied when graph is created. Initially, there are no vertices or edges, but as the crawler examines the web, it adds the URLs of any pages that it visits and records the hyperlinks between them. This diagram shows what a web crawler might have discovered had it started crawling from the URL http://x.com/index.html, and so far examined four web pages.

If we number the four pages from 0..3, with

  • page (vertex) 0 being http://x.com/index.html
  • page (vertex) 1 being http://x.com/category/index.html
  • page (vertex) 2 being http://x.com/products.html
  • page (vertex) 3 being http://x.com/products/abc.html

The vertices array holds the actual URL strings and also, effectively, provides the mapping between URLs and vertex numbers. The edges array is a standard adjacency matrix. The top row tells us that page 0 has hyperlinks to pages 1 and 2. The second row tells us that page 1 has a hyperlink back to page 0. The third row similarly shows a hyperlink from page 2 back to page 0, but also a hyperlink to page 3.

Setting Up

Set up a directory for this lab, change into that directory, and run the following command:

unzip /web/cs2521/24T1/labs/week15/downloads/files.zip

If you're working at home, download files.zip by clicking on the above link and then run the unzip command on the downloaded file.

If you've done the above correctly, you should now have the following files:

Makefile
a set of dependencies used to control compilation
crawl.c
a main program that implements a web crawler (incomplete)
url_file.h, url_file.c
provides a file-like interface to webpages
html.h, html.c
provides functions for extracting URLs from HTML
Graph.h, Graph.c
a Graph ADT interface and implementation
Queue.h, Queue.c
a Queue ADT interface and implementation
Stack.h, Stack.c
a Stack ADT interface and implementation
Set.h, Set.c
a Set ADT interface and implementation

The only file you need to modify is crawl.c, but you need to understand at least the interfaces to the functions in the various ADTs. This is described in comments in the .h files. You can also see examples of using the ADT functions in the t?.c files. Note that there's no test file for the HTML parsing and URL-extracting code, because the supplied version of crawl.c effectively provides this.

Note that HTML parsing code was borrowed from Dartmouth College. If you looks at the code, it has quite a different style to the rest of the code. This provides an interesting comparison with our code.

The crawl program is used as follows:

./crawl StartingURL MaxURLsInGraph

The StartingURL tells you which URL to start the crawl from, and should be of the form http://x.y.z/. The crawler uses this URL as both the starting point and uses a normalised version as the base against which to interpret other URLs.

The MaxURLsInGraph specifies the maximum number of URLs that can be stored in the Graph. Once this many URLs have been scanned, the crawling will stop, or will stop earlier if there are no more URLs left in the ToDo list.

If you compile then run the supplied crawler on the UNSW handbook, you would see something like:

./crawl http://www.handbook.unsw.edu.au/2017/ 100
Found: 'http://www.unsw.edu.au'
Found: 'https://my.unsw.edu.au/'
Found: 'https://student.unsw.edu.au/'
Found: 'http://www.futurestudents.unsw.edu.au/'
Found: 'http://timetable.unsw.edu.au/'
Found: 'https://student.unsw.edu.au/node/62'
Found: 'http://www.library.unsw.edu.au/'
Found: 'http://www.handbook.unsw.edu.au/2017/'
Found: 'http://www.unsw.edu.au/faculties'
Found: 'https://student.unsw.edu.au/node/4431'
Found: 'https://student.unsw.edu.au/node/128'
Found: 'http://www.unsw.edu.au/contacts'
Found: 'http://www.handbook.unsw.edu.au/general/current/SSAPO/previousEditions.html'
Found: 'http://www.handbook.unsw.edu.au/undergraduate/2017/'
Found: 'http://www.handbook.unsw.edu.au/postgraduate/2017/'
Found: 'http://www.handbook.unsw.edu.au/research/2017/'
Found: 'http://www.handbook.unsw.edu.au/nonaward/2017/'
Found: 'http://www.unsw.edu.au/future-students/domestic-undergraduate'
Found: 'http://www.unsw.edu.au/future-students/postgraduate-coursework'
Found: 'http://research.unsw.edu.au/future-students'
Found: 'http://www.international.unsw.edu.au/#1'
Found: 'https://student.unsw.edu.au/node/1334'
Found: 'https://moodle.telt.unsw.edu.au/login/index.php'
Found: 'https://student.unsw.edu.au/node/943'
Found: 'https://apply.unsw.edu.au/'
Found: 'https://student.unsw.edu.au/node/5450'
Found: 'http://cgi.cse.unsw.edu.au/~nss/feest/'
Found: 'http://www.unsw.edu.au/privacy'
Found: 'http://www.unsw.edu.au/copyright-disclaimer'
Found: 'http://www.unsw.edu.au/accessibility'

The supplied crawler simply scans the URL given on the command line, prints out any URLs that it finds, and then stops. It does not attempt to traverse any further than the supplied page. The second command-line argument, which limits the size of the Graph, is effectively ignored here, since the supplied code does not build a Graph; you need to add the code to do this.

If you run the supplied "crawler" on http://www.cse.unsw.edu.au, you don't get very much, because the CSE website recently moved under the Engineering Faculty system and the above URL is just a redirection page to the new site. Copying/pasting the redirection URL gives you more interesting results. Before you go running the "crawler" on other websites ... DON'T! See the comments below.

HTML is a language which is difficult to parse given the way it is frequently used, and the GetNextURL() make some approximations which, while they wouldn't be acceptable in a real Web crawler, are OK for this lab.

Exercise

Your task is to modify crawl.c so that it follows any URLs it finds and builds a Graph of the small region of the Web that it traverses before bumping in to the MaxURLsInGraph limit.

Important: running crawlers outside the UNSW domain is problematic. Running crawlers that make very frequent URL requests is problematic. So...

  • DO NOT run your crawler on any website outside UNSW
  • YOU MUST include a delay (sleep(1)) between each URL access

If you don't do the above, there's a chance that angry sites that are being hammered by your crawler will block UNSW from future access to those sites. Breaches of the above will result in disciplinary action.

Your crawler can do either a breadth-first or depth-first search, and should follow roughly the graph traversal strategy described in lectures and tutes:

add firstURL to the ToDo list
initialise Graph to hold up to maxURLs
initialise set of Seen URLs
while (ToDo list not empty and Graph not filled) {
	grab Next URL from ToDo list
	if (not allowed) continue
	foreach line in the opened URL {
		foreach URL on that line {
			if (Graph not filled  or both URLs in Graph)
				add an edge from Next to this URL
			if (this URL not Seen already) {
				add it to the Seen set
				add it to the ToDo list
			}
		}
	}
	close the opened URL
	sleep(1)
}

This does not give all the details. You need to work them out yourself, based on the supplied ADT functions and your understanding of generic graph traversal. If you use a Stack for the ToDo list, your crawler will end up doing a depth-first search. If you use a Queue for the ToDo list, your crawler will end up doing a breadth-first search.

A couple more things to note:

  • (not allowed) refers to not using URLs outside UNSW
  • the ToDo list is a Stack or Queue rather than a List
  • if you don't include the sleep(1) you will smash whatever web server you access

If you test the crawler out on www.cse.unsw.edu.au, you don't get particularly interesting results, because you'll build a large adjacency matrix, most of which is empty, before you bump into MaxURLsInGraph. To assist in doing some feasible crawling and getting some more interesting output, I have set up a tiny set of self-contained web pages that you can crawl, starting from:

./crawl https://cgi.cse.unsw.edu.au/~cs2521/mini-web/ 30

You should use GraphShow() to check whether you are building a plausible Graph. Note that GraphShow() has two modes:

  • GraphShow(g, 0) shows the URL for each vertex, followed by an indented list of connected vertices
  • GraphShow(g, 1) shows just the adjacency matrix in a very compact form; it does not show the stored URLs

Challenges

There are several aspects of the crawler that you could look to improve:

  • The existing crawler grabs all sorts of URLs that do not represent Web pages. Modify the code so that it filters out non-HTML output.
  • The supplied GetNextURL() function does a reasonable job on finding URLs, but doesn't handle relative URLs. Find online or write your own, or modify the existing code, to make a new GetNextURL() that handles a wider range of URLs.
  • Modify GraphShow() so that it can (also) produce output (JSON) that could be fed into a graph drawing tool such as sigmajs, to produce beautiful graph diagrams.