Felix Colibri- Web Spider

Home

web

FelSpider: Web Spider - Felix John COLIBRI.

abstract : Download a complete web site, with custom or gui filtering and selection of the downloaded files.

key words : web spider, downloader, off line reader, HTTP, HTML, TCP IP, virtual path

software used : Windows XP, Delphi 6

hardware used : Pentium 1.400Mhz, 256 M memory, 140 G hard disc

scope : Delphi 1 to 8 for Windows, Kylix

level : Delphi developer

plan :

Introduction

Web Spidering

The Delphi Sources

Improvements

Download the Sources

1 - Introduction

Whenever I am hitting an interesting Web site, I am always afraid to miss some important information which might be relocated elsewhere, or altogether vanish. With todays disk capacity, it is easy to save those pages on a local drive, possibly adding some indexing scheme or local search engine.

To automate the download of the pages from a site, we use a Web Spider. The basic idea is very old: start with some .HTML page, extract all its links, and recurse by downloading those links.

We already presented a single page downloader, which downloads a page, analyzes the anchor tags, and downloads the corresponding image tags. So all the features are already present. The spider then only adds the page selection.

2 - Web Spidering

2.1 - The objective

Knowing how to download a page and collect all the anchor tags from an .HTML page and recursively perform the same task, why can't we simply download all the files, and read the whole thing off-line ?

The answer is simple: file swamping. If you donwload, say, the page from a learned teacher at MIT or Waterloo about "UML and Design Patterns", chances are that this page will also reference:

other courses taught by the same professor which might not interest you (java programming, computer 101 etc)
coleague's and budies pages, which might work in other areas (chemistry or psychiatry)
personal pages (resume, photo galleries of last conventions, favorite songs)
links to other universities with similar courses. And from there, the flood starts anew.

So some rules must be applied to constrain this flood of pages.

2.2 - Spidering Strategies

Irrespective of the technique used to filter out downloaded pages, here are a couple of target pages:

we can forbid some pages that we know will seldom interest us. In our case, all multi-media stuff is banned right from the start: no .AVI, .MOV, .WAV and the like. Note that since the advent of BDN (Borland community movies) this might change.

the second most effective technique was to limit the pages to the request's local path. If we request Bertrand MEYER's page:

http://www.inf.ethz.ch/personal/meyer/

we are usually also interested to load the sub directories referenced in this page:

http://www.inf.ethz.ch/personal/meyer/events/2001/ubs.html
http://www.inf.ethz.ch/personal/meyer/events/talks.html
http://www.inf.ethz.ch/personal/meyer/images/sounds/pronunciation.html
http://www.inf.ethz.ch/personal/meyer/publications/
http://www.inf.ethz.ch/personal/meyer/publications/computer/dotnet.pdf
http://www.inf.ethz.ch/personal/meyer/publications/inaugural/inaugural.pdf
http://www.inf.ethz.ch/personal/meyer/publications/joop/overloading.pdf
http://www.inf.ethz.ch/personal/meyer/publications/online/bits/index.html

but would be reluctant to download pages with a complete different domain:

http://www.csse.monash.edu.au
http://se.inf.ethz.ch/jobs/

Some references should interest us (page aliases or pages with quite similar content):

http://www.inf.ethz.ch/~meyer/
http://www.inf.ethz.ch/~meyer/events/
http://www.inf.ethz.ch/~meyer/ongoing/
http://www.inf.ethz.ch/~meyer/ongoing/covariance/recast.pdf
http://www.inf.ethz.ch/~meyer/ongoing/etl/
http://www.inf.ethz.ch/~meyer/ongoing/etl/agent.pdf
http://www.inf.ethz.ch/~meyer/ongoing/references/
http://www.inf.ethz.ch/~meyer/publications/
http://www.inf.ethz.ch/~meyer/publications/events/events.pdf
http://www.inf.ethz.ch/~meyer/publications/extraction/extraction.pdf
http://www.inf.ethz.ch/~meyer/publications/ieee/trusted-icse.pdf
http://www.inf.ethz.ch/~meyer/publications/proofs/references-proofs.pdf
http://www.inf.ethz.ch/~meyer/publications/reviews/csharp-review.html

and some could interest us (his books, I guess):

http://www.inf.ethz.ch/
http://vig.pearsoned.com/store/product/1,3498,store-562_isbn-0130622966,00.html
http://www.amazon.com/exec/obidos/ASIN/0130331155/qid=1003330798/sr=8-1/ref=sr_8_7_1/104-4

Based on this example, we see that:

pages in sub directories are very often of interest
pages in parent directories are of increasing generality, and the risk of overflooding increases
pages in other domain cannot easily be automatically selected, even if they have a close relation to the requested domain

some site offer links to files available for download. We frequently encountered this for "reading lists" of some university courses. Some teachers place those files in a local (sometimes private) directory, but some other simply list the URL where a copy of the file is available. So, in this case, the strategy should be "download all the non-html files listed in this page". Access to those domains should be allowed, but parsing of the corresponding parent page and pathes refused.

Here is an example with three different sites:

the three sites
we forbid some pages (.AVI, .MOV, .DOC etc)
here is an example of page and sub pathes:
here is an example of page+ any pages of some kind (reading list .PDF):

2.3 - File organization

When saving the downloaded pages on disc, we can adopt an organization

by topic (database, web programming, accounting ...)
by request domain and path name (c:\www.cs.berkeley.edu\..., c:\www.ece.uwaterloo.ca\... etc)

When the dowloaded volume increases, the second option seems the best, since it avoids many of the aliasing problems.

We cannot save the files under the requested page name only, since a page often contains links to some parent page (for an organization logo, for instance). And even if we adopt a strict "only sub directory" structure, we might want to later download sibbling directories, or upper directories.

Here is an example:

we want to download the class diagram page of ModelMaker:

http://www.modelmakertools.com/modelmaker/screenshots/page2.html
we could save this page as:

c:\www.modelmakertools.com\page2.html
however, we could later want to also look at any page1.html, page3.html, or the parent http://www.modelmakertools.com/modelmaker/ page, even the other tools, like http://www.modelmakertools.com/code-explorer/
This is why we save the page2.html as:

c:\www.modelmakertools.com\modelmaker\screenshots\page2.html

Of course, finding the page with Windows File Explorer will be more difficult, since some content will be nested at deeper levels. But this is why we did build the a file search engine ("find in file") with some expression power. We could then ask all files satisfying:

ModelMaker OR (Delphi AND UML) OR ("class diagrams" AND NOT java)

We also include an "__doc.txt" file at the root of all our downloade sites (in our case, this would be c:\www.modelmakertools.com\__doc.txt), and we usually place some key words in this text. All the __doc.txt can then be automatically parsed for some expressions to tell which sites contain some topic.

Note that our naming scheme is not totally alias free. Some sites can be renamed, split or otherwise reorganized under different names. This is even the case for our own site, which can be reached with www.felix-colibri.com OR www.colibrix.com. Removing such alias would need the computation of unique signature for each page and an indexing scheme. We considered that it would require too much work to simply remove some redundancy.

2.4 - The Spidering algorithm

Here is the pseudo code of our spider:

fill the requested URL list
selectect the first URL
REPEAT
    donwload the selected URL
    save the page
    IF this is an HTML containing page
      THEN
          extract the anchor tags
          filter the tags
    select the next URL
UNTIL no more URLs

and:

"fill the requested URL list": we usually start with a seed page. However we could type a list of URLs in a tMemo, load the list from a "todo" file, or paste some URLs from previously rejected pages

"select the first URL" : this starting page is either an URL ending with .HTML, .PHP, .ASP etc, or the domain name, or some path in the target domain.

http://download.oracle.com/otn/nt/oracle10g/10g_win32_db.zip

/downloads/index.html
/ww/
aboutMe.html#contactInfo
bliki
../../super_search.htm
./files/tables.zip

http://cc.borland.com/beta/Item.aspx?id=23191

"donwload the selected URL" : the socket sends a GET, and the Server answers either with the requested explicit page, or with a default page (index.html), or the page dynamically build from a CGI request. The socket the receives the HTTP answer
"save the page" : in the usual "domain download" case, we save the page in the path derived from the URL. In the case of a "reading list" download, we save the files in the requested page folder. And if the name of a file is a relative name, we must add the absolute path.
"IF this is an HTML containing page" : we cannot use the request extension as a reliable way to trigger the .HTML parsing.
So, when the extension is not an .HTML like exention or an explicit non-HTML exention (.ZIP, .PDF, .PNG ...), we look whether the file contain some .HTML tag. We chose <HTML> and </HTML> as the test criterion, although this is not bullet proof:
- some binary file might contain those tags, and still conaint only binary data
- some web pages do not contain <HTML> and </HTML> (only <BODY>, or even only the text with some text tags and some anchors
"extract the anchor tags" if the page is a declared or detected .HTML page, we parse the content and build the anchor tag list
"filter the tags" : this is the filtering stage. Here we can
- eliminate unwanted extensions (.WAV ...)
- eliminate some links based on textual patterns (no URL containing "resume", or "photo gallery" etc
- eliminate links which do not contain the same domain as the initially requested page
"select the next URL" : this step will determine the dowload order. If we download all files up to the end, the download order has no importance whatsoever. But if we might abort the process after some time (bug, overflooding, timeout ...), then it is obviously better to first download the "best" pages. The ranking function obviously depends on the user needs. In our case
- we usually first donwload the non-HTML files (the .PDD, .ZIP etc)
- then we donwload the .HTML pages
  - first the sub pages
  - then the parent pages, gradually moving from the current path to the site root
This can be presented in the following figure.
- In Red the unwanted pages (.MOV) and in blue the non-HTML pages (.PDF):
- we download the target page:
- then the non HTML pages:
- and the sub pages:
- and we gradually fetch the parent pages:
  
  and

So we fixed the broad lines (first the non-HTML, then the sub pages, then the parent pages), and you still have full source code to change this. To add some more flexibility, we added two hooks:

a customizable filtering callback where one can add some condition to remove some pages
a customizable next page selection procedure

We also added very limited GUI selection with checkboxes. The user can select:

the page level (parent, the page itself, child pages)
the type of pages (none, all, only .HTML-like, image, zip and .PDF like)

3 - The Delphi full Source code

3.1 - The Overall organization

The spider is directly derived from the downloader. The structure and operation of the downloader has been presented in the Web Downloader paper at great length.

The spider only introduces filtering and page ordering functionalities.

So we simply derived the c_site_spider from the c_spider, adding the needed attributes and methods. The main form was also slightly changed.

The filtering and selection additions are:

a default management
a custom callback possibility (with some requested coding on the main Form)
a gui handling

3.2 - The Guid management

The main Form presents 3 levels of buttons allowing the user to decide what to download:

gui strategy settings

The snapshot shows the selection of

all pages at the requested URL level
the .HTML, .ZIP and .PDF from the children
only the images from the parent pages

Here are the definitions for this user level management:

t_page_level_type= (e_unknown_level, e_outside_level, e_parent_level,
e_page_level, e_child_level);
t_page_type= (e_unknown_type, e_no_type, e_html_type, e_image_type,
e_zip_type, e_pdf_type, e_all_type);

t_page_selection= array[e_unknown_level..e_child_level, e_unknown_type..e_all_type] of Boolean;

t_fo_check_url= function(p_c_url: c_url): Boolean of Object;
t_fn_check_url= function(p_c_url: c_url): Boolean;

c_site_spider=
    class(c_spider)
      m_selection_array: t_page_selection;
      m_page_level_fo_array: array[e_parent_level..e_child_level] of t_fo_check_url;
      m_page_type_fn_array: array[e_no_type..e_all_type] of t_fn_check_url;

Constructor create_site_spider(p_name, p_root_save_path: String; p_c_url: c_url);

      procedure append_selection_procedure(p_po_on_select_url: t_po_accept_url);
      procedure set_selection(p_page_level_type: t_page_level_type;
          p_page_type: t_page_type; p_set: Boolean);
      procedure display_selection;
      function f_has_gui_selection: Boolean;

      procedure filter_other_pages; Override;
      function f_c_select_next_page: c_remote_file; Override;
    end; // c_site_spider

and:

we use two enumerated types for our levels and page types
the c_site_spider contains the m_selection_array where the boolean are stored by the user
the setting of the booleans is performed using the set_selection method

The user then checks some options (which levels, which pages), and when we creat the c_site_spider, we check those buttons:

procedure TForm1.goClick(Sender: TObject);

  procedure add_gui_selection;
    begin
      with g_c_spider do
      begin
        set_selection(e_parent_level, e_no_type, no_parent_.Down);
        set_selection(e_parent_level, e_html_type, html_parent_.Down);
        set_selection(e_parent_level, e_image_type, image_parent_.Down);

// ...

To select which pages must be kept for download, we test each anchor tag:

procedure c_site_spider.filter_other_pages;
// -- remove unwanted pages

procedure filter_gui_choices;

    function f_accept(p_c_url: c_url): Boolean;
      var l_page_level_type: t_page_level_type;
          l_page_type: t_page_type;
      begin
        Result:= True;

        for l_page_level_type:= e_parent_level to e_child_level do
          for l_page_type:= e_no_type to e_all_type do
            if m_selection_array[l_page_level_type, l_page_type]
                and m_page_level_fo_array[l_page_level_type](p_c_url)
              then
                case l_page_type of
                  e_no_type : Result:= False;
                  e_all_type : Result:= True;
                  else Result:= m_page_type_fn_array[l_page_type](p_c_url);
                end; // case
      end; // f_accept

var l_remote_file_index: Integer;

    begin // filter_gui_choices
      with m_c_remote_file_list do
        for l_remote_file_index:= 0 to f_remote_file_count- 1 do
          with f_c_remote_file(l_remote_file_index) do
            if not m_downloaded and m_do_download
              then m_do_download:= f_accept(m_c_normalized_url);
    end; // filter_gui_choices

  begin // filter_other_pages
    // -- first remove GUI choices
    if f_has_gui_selection
      then filter_gui_choices
      else
        // ...
  end; // filter_other_pages

where

m_selection_array[p_page_level_type, l_page_type] tests what the user selected
m_page_level_fo_array[l_page_level_type](p_c_url) test the actual page level. The m_page_level_fo_array was initialized in the CONSTRUCTOR to point to page level test functions. So testing:

if m_page_level_fo_array[e_child_level_page](p_c_url)
then

is the same as

if f_is_child_page(p_c_url)
then

And the same is done with the m_page_type_fn_array test

The selection of the next page to download is performed in the f_c_select_next_page function.

3.3 - The Custom Filtering and Selection

We also added the possibility to let the Delphi developper write his own custom filtering and selection methods.

For the filtering, we simply created a callback in the c_site_spider CLASS, and if this method is initialized by the developer, the filtering procedure uses this callback to decide if a pages should be kept or not:

here is the filtering hook:

t_po_accept_url= procedure(p_c_url: c_url; var pv_accept: Boolean) of Object;

c_site_spider= class(c_spider)
                 m_on_filter_url: t_po_accept_url;
                 // ...
               end; // c_site_spider

the developper adds the method in tForm1 CLASS:

TForm1= class(TForm)
          // ...
          private
            procedure filter_page(p_c_url: c_url; var pv_accept: Boolean);
        end; // tForm1

with, for instance, this check:

procedure TForm1.filter_page(p_c_url: c_url; var pv_accept: Boolean);
  begin
    with g_c_spider do
      pv_accept:=
          (
             f_is_child_page(p_c_url)
          and
             f_is_html_extension(p_c_url)
          )
        or
          (
             f_is_domain_page(p_c_url)
          and
             not f_is_child_page(p_c_url)
          and
            (p_c_url.m_extension= '.png')
          );
  end; // filter_page

and the callback pointer is initialized when the c_site_spider object is created:

c_site_spider= class(c_spider)
                 m_oa_on_select_url: array of t_po_accept_url;
                 // -- ...
               end; // c_site_spider

For the next page selection, we have to use several loops in order to select a page. For instance:

first find a page on the current level
then look at the child page level
then search the parent level

The ordering of the level and the test conditions can change, but the developer needs to go at least thru one complete loop before checking the page list with another condition. So we provide an array of callbacks, and the developer might add as many of those, each which its own conditions. So

in the c_site_spider CLASS we have the callback array:

c_site_spider= class(c_spider)
                 m_oa_on_select_url: array of t_po_accept_url;
                 // -- ...
               end; // c_site_spider

the user adds the methods definitions in the tForm CLASS:

TForm1= class(TForm)
          // ...
          private
            procedure select_next_page_child(p_c_url: c_url; var pv_accept: Boolean);
            procedure select_next_page_parent_images(p_c_url: c_url; var pv_accept: Boolean);
        end; // tForm1

with, for instance, the following implementation:

procedure TForm1.select_next_page_child(p_c_url: c_url; var pv_accept: Boolean);
  begin
    with g_c_spider do
      pv_accept:= f_is_child_page(p_c_url) and f_is_html_extension(p_c_url);
  end; // select_next_page_child

procedure TForm1.select_next_page_parent_images(p_c_url: c_url;
    var pv_accept: Boolean);
  begin
    with g_c_spider do
      pv_accept:= f_is_domain_page(p_c_url) and not f_is_child_page(p_c_url)
          and (p_c_url.m_extension= '.png');
  end; // select_next_page_parent_images

and the callbacks are initialzes with:

procedure TForm1.goClick(Sender: TObject);
begin
// ...

    with g_c_spider do
    begin
      append_selection_procedure(select_next_page_child);
      append_selection_procedure(select_next_page_parent_images);

// -- ...

Care must be taken to correctly synchronize the filtering and the selection. If the selection is more stringent than the filtering, the download panel and the "todo pages" list will display pages which will never be downloaded.

3.4 - The delfault strategy

The default strategy then implements a "usual" reasonable choice. We hard coded the following choices:

everything from the current page
the .HTML from the children
the images from the parent

This could have been hard coded into the c_site_spider, but since we have a general call-back system, we simply initialized the call back to this strategy when none was set by the user interface or by the custom call-backs.

3.5 - The main form

The main form is nearly the same as the one used for the downloader. We simply added:

a c_find_memo, which can sort the pages which will not be downloaded. This can be used to perform a first analysis:
- we fist donwload the the main target URL
- the "other pages" displays all anchor tags in this page
- when clicking "sort", we easily see what is in the page itself, the child pages, the other domains etc
the speed button filter selection, which have been displayed above.

Here is the snapshot of the project (a page about "OO and Architecture-based Testing"):

spider snapshot

4 - Improvements

Among the possible improvements:

start the filtering at the tag-analysis level (not after adding the tags to the file list)
improve the GUI selection functionality: buiding the usual expression building technique (the kind found in Quick Report, for instance)
start the download in parallel instead of using a page after page limit
build a more abstract and generic "spidering API"

5 - Download the Sources

Here are the source code files:

web_spider.zip: creating (83 K)

Those .ZIP files contain:

the main program (.DPR, .DOF, .RES), the main form (.PAS, .DFM), and any other auxiliary form
any .TXT for parameters
all units (.PAS) for units

Those .ZIP

are self-contained: you will not need any other product (unless expressly mentioned).
can be used from any folder (the pathes are RELATIVE)
will not modify your PC in any way beyond the path where you placed the .ZIP (no registry changes, no path creation etc).

To use the .ZIP:

create or select any folder of your choice
unzip the downloaded file
using Delphi, compile and execute

To remove the .ZIP simply delete the folder.

As usual:

please tell us at fcolibri@felix-colibri.com if you found some errors, mistakes, bugs, broken links or had some problem downloading the file. Resulting corrections will be helpful for other readers
we welcome any comment, criticism, enhancement, other sources or reference suggestion. Just send an e-mail to fcolibri@felix-colibri.com.
or more simply, enter your (anonymous or with your e-mail if you want an answer) comments below and clic the "send" button

Name :

E-mail :

Comments * :
and if you liked this article, talk about this site to your fellow developpers, add a link to your links page ou mention our articles in your blog or newsgroup posts when relevant. That's the way we operate: the more traffic and Google references we get, the more articles we will write.

6 - Conclusion

We presented in this paper a Web Spider, allowing the disc storage of all the pages of a web site. Filtering and Selection allow to guide the process and avoid overflooding.

7 - The author

Felix John COLIBRI works at the Pascal Institute. Starting with Pascal in 1979, he then became involved with Object Oriented Programming, Delphi, Sql, Tcp/Ip, Html, UML. Currently, he is mainly active in the area of custom software development (new projects, maintenance, audits, BDE migration, Delphi Xe_n migrations, refactoring), Delphi Consulting and Delph training. His web site features tutorials, technical papers about programming with full downloadable source code, and the description and calendar of forthcoming Delphi, FireBird, Tcp/IP, Web Services, OOP / UML, Design Patterns, Unit Testing training sessions.

Created: nov-04. Last updated: jul-15 - 98 articles, 131 .ZIP sources, 1012 figures
Copyright © Felix J. Colibri http://www.felix-colibri.com 2004 - 2015. All rigths reserved
Back: Home Papers Training Delphi developments Links Download

Felix J COLIBRI

+ Home

+ articles_with_sources
+ database

    + web_internet_sockets
      – tcp_ip_sniffer
      – socket_programming
      – socket_architecture
      – simple_web_server
      – simple_cgi_web_server
      – cgi_database_browser
      – whois
      – web_downloader

– web_spider

      – rss_reader
      – news_message_tree
      – indy_news_reader
      – delphi_web_designer
      – intraweb_architecture
      – ajax_tutorial
      – bayesian_spam_filter
      + asp_net

    + oop_components
    + uml_design_patterns
    + debug_and_test

    + graphic
    + controls
    + colibri_utilities
    + colibri_helpers
    + delphi
    + firemonkey
    + compilers

+ delphi_training

+ delphi_developments

+ sweet_home

– download_zip_sources

+ links

RSS feed

Blog