menu
  Home  ==>  papers  ==>  web  ==>  web_spider   

FelSpider: Web Spider - Felix John COLIBRI.

  • abstract : Download a complete web site, with custom or gui filtering and selection of the downloaded files.
  • key words : web spider, downloader, off line reader, HTTP, HTML, TCP IP, virtual path
  • software used : Windows XP, Delphi 6
  • hardware used : Pentium 1.400Mhz, 256 M memory, 140 G hard disc
  • scope : Delphi 1 to 8 for Windows, Kylix
  • level : Delphi developer
  • plan :


1 - Introduction

Whenever I am hitting an interesting Web site, I am always afraid to miss some important information which might be relocated elsewhere, or altogether vanish. With todays disk capacity, it is easy to save those pages on a local drive, possibly adding some indexing scheme or local search engine.

To automate the download of the pages from a site, we use a Web Spider. The basic idea is very old: start with some .HTML page, extract all its links, and recurse by downloading those links.

We already presented a single page downloader, which downloads a page, analyzes the anchor tags, and downloads the corresponding image tags. So all the features are already present. The spider then only adds the page selection.




2 - Web Spidering

2.1 - The objective

Knowing how to download a page and collect all the anchor tags from an .HTML page and recursively perform the same task, why can't we simply download all the files, and read the whole thing off-line ?

The answer is simple: file swamping. If you donwload, say, the page from a learned teacher at MIT or Waterloo about "UML and Design Patterns", chances are that this page will also reference:

  • other courses taught by the same professor which might not interest you (java programming, computer 101 etc)
  • coleague's and budies pages, which might work in other areas (chemistry or psychiatry)
  • personal pages (resume, photo galleries of last conventions, favorite songs)
  • links to other universities with similar courses. And from there, the flood starts anew.
So some rules must be applied to constrain this flood of pages.



2.2 - Spidering Strategies

Irrespective of the technique used to filter out downloaded pages, here are a couple of target pages:
  • we can forbid some pages that we know will seldom interest us. In our case, all multi-media stuff is banned right from the start: no .AVI, .MOV, .WAV and the like. Note that since the advent of BDN (Borland community movies) this might change.
  • the second most effective technique was to limit the pages to the request's local path. If we request Bertrand MEYER's page:

     
    http://www.inf.ethz.ch/personal/meyer/

    we are usually also interested to load the sub directories referenced in this page:

     
    http://www.inf.ethz.ch/personal/meyer/events/2001/ubs.html
    http://www.inf.ethz.ch/personal/meyer/events/talks.html
    http://www.inf.ethz.ch/personal/meyer/images/sounds/pronunciation.html
    http://www.inf.ethz.ch/personal/meyer/publications/
    http://www.inf.ethz.ch/personal/meyer/publications/computer/dotnet.pdf
    http://www.inf.ethz.ch/personal/meyer/publications/inaugural/inaugural.pdf
    http://www.inf.ethz.ch/personal/meyer/publications/joop/overloading.pdf
    http://www.inf.ethz.ch/personal/meyer/publications/online/bits/index.html

    but would be reluctant to download pages with a complete different domain:

     
    http://www.csse.monash.edu.au
    http://se.inf.ethz.ch/jobs/

    Some references should interest us (page aliases or pages with quite similar content):

     
    http://www.inf.ethz.ch/~meyer/
    http://www.inf.ethz.ch/~meyer/events/
    http://www.inf.ethz.ch/~meyer/ongoing/
    http://www.inf.ethz.ch/~meyer/ongoing/covariance/recast.pdf
    http://www.inf.ethz.ch/~meyer/ongoing/etl/
    http://www.inf.ethz.ch/~meyer/ongoing/etl/agent.pdf
    http://www.inf.ethz.ch/~meyer/ongoing/references/
    http://www.inf.ethz.ch/~meyer/publications/
    http://www.inf.ethz.ch/~meyer/publications/events/events.pdf
    http://www.inf.ethz.ch/~meyer/publications/extraction/extraction.pdf
    http://www.inf.ethz.ch/~meyer/publications/ieee/trusted-icse.pdf
    http://www.inf.ethz.ch/~meyer/publications/proofs/references-proofs.pdf
    http://www.inf.ethz.ch/~meyer/publications/reviews/csharp-review.html

    and some could interest us (his books, I guess):

     
    http://www.inf.ethz.ch/
    http://vig.pearsoned.com/store/product/1,3498,store-562_isbn-0130622966,00.html
    http://www.amazon.com/exec/obidos/ASIN/0130331155/qid=1003330798/sr=8-1/ref=sr_8_7_1/104-4

    Based on this example, we see that:

    • pages in sub directories are very often of interest
    • pages in parent directories are of increasing generality, and the risk of overflooding increases
    • pages in other domain cannot easily be automatically selected, even if they have a close relation to the requested domain
  • some site offer links to files available for download. We frequently encountered this for "reading lists" of some university courses. Some teachers place those files in a local (sometimes private) directory, but some other simply list the URL where a copy of the file is available. So, in this case, the strategy should be "download all the non-html files listed in this page". Access to those domains should be allowed, but parsing of the corresponding parent page and pathes refused.


Here is an example with three different sites:
  • the three sites

    spidering strategy

  • we forbid some pages (.AVI, .MOV, .DOC etc)

    spidering strategy

  • here is an example of page and sub pathes:

    spidering strategy

  • here is an example of page+ any pages of some kind (reading list .PDF):

    spidering strategy



2.3 - File organization

When saving the downloaded pages on disc, we can adopt an organization
  • by topic (database, web programming, accounting ...)
  • by request domain and path name (c:\www.cs.berkeley.edu\..., c:\www.ece.uwaterloo.ca\... etc)
When the dowloaded volume increases, the second option seems the best, since it avoids many of the aliasing problems.

We cannot save the files under the requested page name only, since a page often contains links to some parent page (for an organization logo, for instance). And even if we adopt a strict "only sub directory" structure, we might want to later download sibbling directories, or upper directories.

Here is an example:

  • we want to download the class diagram page of ModelMaker:

     
    http://www.modelmakertools.com/modelmaker/screenshots/page2.html

  • we could save this page as:

     
    c:\www.modelmakertools.com\page2.html

  • however, we could later want to also look at any page1.html, page3.html, or the parent http://www.modelmakertools.com/modelmaker/ page, even the other tools, like http://www.modelmakertools.com/code-explorer/

    This is why we save the page2.html as:

     
    c:\www.modelmakertools.com\modelmaker\screenshots\page2.html

Of course, finding the page with Windows File Explorer will be more difficult, since some content will be nested at deeper levels. But this is why we did build the a file search engine ("find in file") with some expression power. We could then ask all files satisfying:

 
ModelMaker OR (Delphi AND UML) OR ("class diagrams" AND NOT java)

We also include an "__doc.txt" file at the root of all our downloade sites (in our case, this would be c:\www.modelmakertools.com\__doc.txt), and we usually place some key words in this text. All the __doc.txt can then be automatically parsed for some expressions to tell which sites contain some topic.



Note that our naming scheme is not totally alias free. Some sites can be renamed, split or otherwise reorganized under different names. This is even the case for our own site, which can be reached with www.felix-colibri.com OR www.colibrix.com. Removing such alias would need the computation of unique signature for each page and an indexing scheme. We considered that it would require too much work to simply remove some redundancy.

2.4 - The Spidering algorithm

Here is the pseudo code of our spider:

 
fill the requested URL list
selectect the first URL
REPEAT
    donwload the selected URL
    save the page
    IF this is an HTML containing page
      THEN
          extract the anchor tags
          filter the tags
    select the next URL
UNTIL no more URLs

and:

  • "fill the requested URL list": we usually start with a seed page. However we could type a list of URLs in a tMemo, load the list from a "todo" file, or paste some URLs from previously rejected pages
  • "select the first URL" : this starting page is either an URL ending with .HTML, .PHP, .ASP etc, or the domain name, or some path in the target domain.

    http://download.oracle.com/otn/nt/oracle10g/10g_win32_db.zip
     
    /downloads/index.html
    /ww/
    aboutMe.html#contactInfo
    bliki
    ../../super_search.htm
    ./files/tables.zip
     
    http://cc.borland.com/beta/Item.aspx?id=23191

  • "donwload the selected URL" : the socket sends a GET, and the Server answers either with the requested explicit page, or with a default page (index.html), or the page dynamically build from a CGI request. The socket the receives the HTTP answer
  • "save the page" : in the usual "domain download" case, we save the page in the path derived from the URL. In the case of a "reading list" download, we save the files in the requested page folder. And if the name of a file is a relative name, we must add the absolute path.
  • "IF this is an HTML containing page" : we cannot use the request extension as a reliable way to trigger the .HTML parsing.
    So, when the extension is not an .HTML like exention or an explicit non-HTML exention (.ZIP, .PDF, .PNG ...), we look whether the file contain some .HTML tag. We chose <HTML> and </HTML> as the test criterion, although this is not bullet proof:
    • some binary file might contain those tags, and still conaint only binary data
    • some web pages do not contain <HTML> and </HTML> (only <BODY>, or even only the text with some text tags and some anchors
  • "extract the anchor tags" if the page is a declared or detected .HTML page, we parse the content and build the anchor tag list
  • "filter the tags" : this is the filtering stage. Here we can
    • eliminate unwanted extensions (.WAV ...)
    • eliminate some links based on textual patterns (no URL containing "resume", or "photo gallery" etc
    • eliminate links which do not contain the same domain as the initially requested page
  • "select the next URL" : this step will determine the dowload order. If we download all files up to the end, the download order has no importance whatsoever. But if we might abort the process after some time (bug, overflooding, timeout ...), then it is obviously better to first download the "best" pages. The ranking function obviously depends on the user needs. In our case
    • we usually first donwload the non-HTML files (the .PDD, .ZIP etc)
    • then we donwload the .HTML pages
      • first the sub pages
      • then the parent pages, gradually moving from the current path to the site root

    This can be presented in the following figure.
    • In Red the unwanted pages (.MOV) and in blue the non-HTML pages (.PDF):

      download order

    • we download the target page:

      download order target page

    • then the non HTML pages:

      download order non html

    • and the sub pages:

      download order non html

    • and we gradually fetch the parent pages:

      download order non html

      and

      download order non html



So we fixed the broad lines (first the non-HTML, then the sub pages, then the parent pages), and you still have full source code to change this. To add some more flexibility, we added two hooks:
  • a customizable filtering callback where one can add some condition to remove some pages
  • a customizable next page selection procedure


We also added very limited GUI selection with checkboxes. The user can select:
  • the page level (parent, the page itself, child pages)
  • the type of pages (none, all, only .HTML-like, image, zip and .PDF like)



3 - The Delphi full Source code



3.1 - The Overall organization

The spider is directly derived from the downloader. The structure and operation of the downloader has been presented in the Web Downloader paper at great length.

The spider only introduces filtering and page ordering functionalities.

So we simply derived the c_site_spider from the c_spider, adding the needed attributes and methods. The main form was also slightly changed.

The filtering and selection additions are:

  • a default management
  • a custom callback possibility (with some requested coding on the main Form)
  • a gui handling


3.2 - The Guid management

The main Form presents 3 levels of buttons allowing the user to decide what to download:

gui strategy settings

The snapshot shows the selection of

  • all pages at the requested URL level
  • the .HTML, .ZIP and .PDF from the children
  • only the images from the parent pages


Here are the definitions for this user level management:

t_page_level_type= (e_unknown_levele_outside_levele_parent_level
    e_page_levele_child_level);
t_page_type= (e_unknown_typee_no_typee_html_typee_image_type
    e_zip_typee_pdf_typee_all_type);

t_page_selectionarray[e_unknown_level..e_child_levele_unknown_type..e_all_typeof Boolean;

t_fo_check_urlfunction(p_c_urlc_url): Boolean of Object;
t_fn_check_urlfunction(p_c_urlc_url): Boolean;

c_site_spider=
    class(c_spider)
      m_selection_arrayt_page_selection;
      m_page_level_fo_arrayarray[e_parent_level..e_child_levelof t_fo_check_url;
      m_page_type_fn_arrayarray[e_no_type..e_all_typeof t_fn_check_url;

      Constructor create_site_spider(p_namep_root_save_pathStringp_c_urlc_url);

      procedure append_selection_procedure(p_po_on_select_urlt_po_accept_url);
      procedure set_selection(p_page_level_typet_page_level_type;
          p_page_typet_page_typep_setBoolean);
      procedure display_selection;
      function f_has_gui_selectionBoolean;

      procedure filter_other_pagesOverride;
      function f_c_select_next_pagec_remote_fileOverride;
    end// c_site_spider

and:

  • we use two enumerated types for our levels and page types
  • the c_site_spider contains the m_selection_array where the boolean are stored by the user
  • the setting of the booleans is performed using the set_selection method
The user then checks some options (which levels, which pages), and when we creat the c_site_spider, we check those buttons:

procedure TForm1.goClick(SenderTObject);

  procedure add_gui_selection;
    begin
      with g_c_spider do
      begin
        set_selection(e_parent_levele_no_typeno_parent_.Down);
        set_selection(e_parent_levele_html_typehtml_parent_.Down);
        set_selection(e_parent_levele_image_typeimage_parent_.Down);

        // ...



To select which pages must be kept for download, we test each anchor tag:

procedure c_site_spider.filter_other_pages;
    // -- remove unwanted pages

  procedure filter_gui_choices;

    function f_accept(p_c_urlc_url): Boolean;
      var l_page_level_typet_page_level_type;
          l_page_typet_page_type;
      begin
        Result:= True;

        for l_page_level_type:= e_parent_level to e_child_level do
          for l_page_type:= e_no_type to e_all_type do
            if m_selection_array[l_page_level_typel_page_type]
                and m_page_level_fo_array[l_page_level_type](p_c_url)
              then
                case l_page_type of
                  e_no_type : Result:= False;
                  e_all_type : Result:= True;
                  else Result:= m_page_type_fn_array[l_page_type](p_c_url);
                end// case
      end// f_accept

    var l_remote_file_indexInteger;

    begin // filter_gui_choices
      with m_c_remote_file_list do
        for l_remote_file_index:= 0 to f_remote_file_count- 1 do
          with f_c_remote_file(l_remote_file_indexdo
            if not m_downloaded and m_do_download
              then m_do_download:= f_accept(m_c_normalized_url);
    end// filter_gui_choices

  begin // filter_other_pages
    // -- first remove GUI choices
    if f_has_gui_selection
      then filter_gui_choices
      else
        // ...
  end// filter_other_pages

where

  • m_selection_array[p_page_level_type, l_page_type] tests what the user selected
  • m_page_level_fo_array[l_page_level_type](p_c_url) test the actual page level. The m_page_level_fo_array was initialized in the CONSTRUCTOR to point to page level test functions. So testing:

    if m_page_level_fo_array[e_child_level_page](p_c_url)
      then 

    is the same as

    if f_is_child_page(p_c_url)
      then 

    And the same is done with the m_page_type_fn_array test



The selection of the next page to download is performed in the f_c_select_next_page function.



3.3 - The Custom Filtering and Selection

We also added the possibility to let the Delphi developper write his own custom filtering and selection methods.



For the filtering, we simply created a callback in the c_site_spider CLASS, and if this method is initialized by the developer, the filtering procedure uses this callback to decide if a pages should be kept or not:

  • here is the filtering hook:

    t_po_accept_urlprocedure(p_c_urlc_urlvar pv_acceptBooleanof Object;

    c_site_spiderclass(c_spider)
                     m_on_filter_urlt_po_accept_url;
                     // ...
                   end// c_site_spider

  • the developper adds the method in tForm1 CLASS:

    TForm1class(TForm)
              // ...
              private
                procedure filter_page(p_c_urlc_urlvar pv_acceptBoolean);
            end// tForm1

  • with, for instance, this check:

    procedure TForm1.filter_page(p_c_urlc_urlvar pv_acceptBoolean);
      begin
        with g_c_spider do
          pv_accept:=
              (
                 f_is_child_page(p_c_url)
              and
                 f_is_html_extension(p_c_url)
              )
            or
              (
                 f_is_domain_page(p_c_url)
              and
                 not f_is_child_page(p_c_url)
              and
                (p_c_url.m_extension'.png')
              );
      end// filter_page

  • and the callback pointer is initialized when the c_site_spider object is created:

    c_site_spiderclass(c_spider)
                     m_oa_on_select_urlarray of t_po_accept_url;
                     // -- ...
                   end// c_site_spider



For the next page selection, we have to use several loops in order to select a page. For instance:
  • first find a page on the current level
  • then look at the child page level
  • then search the parent level
The ordering of the level and the test conditions can change, but the developer needs to go at least thru one complete loop before checking the page list with another condition. So we provide an array of callbacks, and the developer might add as many of those, each which its own conditions. So
  • in the c_site_spider CLASS we have the callback array:

    c_site_spiderclass(c_spider)
                     m_oa_on_select_urlarray of t_po_accept_url;
                     // -- ...
                   end// c_site_spider

  • the user adds the methods definitions in the tForm CLASS:

    TForm1class(TForm)
              // ...
              private
                procedure select_next_page_child(p_c_urlc_urlvar pv_acceptBoolean);
                procedure select_next_page_parent_images(p_c_urlc_urlvar pv_acceptBoolean);
            end// tForm1

  • with, for instance, the following implementation:

    procedure TForm1.select_next_page_child(p_c_urlc_urlvar pv_acceptBoolean);
      begin
        with g_c_spider do
          pv_accept:= f_is_child_page(p_c_urland f_is_html_extension(p_c_url);
      end// select_next_page_child

    procedure TForm1.select_next_page_parent_images(p_c_urlc_url;
        var pv_acceptBoolean);
      begin
        with g_c_spider do
          pv_accept:= f_is_domain_page(p_c_urland not f_is_child_page(p_c_url)
              and (p_c_url.m_extension'.png');
      end// select_next_page_parent_images

  • and the callbacks are initialzes with:

    procedure TForm1.goClick(SenderTObject);
      begin
        // ...

        with g_c_spider do
        begin
          append_selection_procedure(select_next_page_child);
          append_selection_procedure(select_next_page_parent_images);

          // -- ...




Care must be taken to correctly synchronize the filtering and the selection. If the selection is more stringent than the filtering, the download panel and the "todo pages" list will display pages which will never be downloaded.



3.4 - The delfault strategy

The default strategy then implements a "usual" reasonable choice. We hard coded the following choices:
  • everything from the current page
  • the .HTML from the children
  • the images from the parent
This could have been hard coded into the c_site_spider, but since we have a general call-back system, we simply initialized the call back to this strategy when none was set by the user interface or by the custom call-backs.



3.5 - The main form

The main form is nearly the same as the one used for the downloader. We simply added:
  • a c_find_memo, which can sort the pages which will not be downloaded. This can be used to perform a first analysis:
    • we fist donwload the the main target URL
    • the "other pages" displays all anchor tags in this page
    • when clicking "sort", we easily see what is in the page itself, the child pages, the other domains etc
  • the speed button filter selection, which have been displayed above.
Here is the snapshot of the project (a page about "OO and Architecture-based Testing"):

spider snapshot




4 - Improvements

Among the possible improvements:
  • start the filtering at the tag-analysis level (not after adding the tags to the file list)
  • improve the GUI selection functionality: buiding the usual expression building technique (the kind found in Quick Report, for instance)
  • start the download in parallel instead of using a page after page limit
  • build a more abstract and generic "spidering API"



5 - Download the Sources

Here are the source code files:

Those .ZIP files contain:
  • the main program (.DPR, .DOF, .RES), the main form (.PAS, .DFM), and any other auxiliary form
  • any .TXT for parameters
  • all units (.PAS) for units
Those .ZIP
  • are self-contained: you will not need any other product (unless expressly mentioned).
  • can be used from any folder (the pathes are RELATIVE)
  • will not modify your PC in any way beyond the path where you placed the .ZIP (no registry changes, no path creation etc).
To use the .ZIP:
  • create or select any folder of your choice
  • unzip the downloaded file
  • using Delphi, compile and execute
To remove the .ZIP simply delete the folder.



As usual:

  • please tell us at fcolibri@felix-colibri.com if you found some errors, mistakes, bugs, broken links or had some problem downloading the file. Resulting corrections will be helpful for other readers
  • we welcome any comment, criticism, enhancement, other sources or reference suggestion. Just send an e-mail to fcolibri@felix-colibri.com.
  • or more simply, enter your (anonymous or with your e-mail if you want an answer) comments below and clic the "send" button
    Name :
    E-mail :
    Comments * :
     

  • and if you liked this article, talk about this site to your fellow developpers, add a link to your links page ou mention our articles in your blog or newsgroup posts when relevant. That's the way we operate: the more traffic and Google references we get, the more articles we will write.



6 - Conclusion

We presented in this paper a Web Spider, allowing the disc storage of all the pages of a web site. Filtering and Selection allow to guide the process and avoid overflooding.




7 - The author

Felix John COLIBRI works at the Pascal Institute. Starting with Pascal in 1979, he then became involved with Object Oriented Programming, Delphi, Sql, Tcp/Ip, Html, UML. Currently, he is mainly active in the area of custom software development (new projects, maintenance, audits, BDE migration, Delphi Xe_n migrations, refactoring), Delphi Consulting and Delph training. His web site features tutorials, technical papers about programming with full downloadable source code, and the description and calendar of forthcoming Delphi, FireBird, Tcp/IP, Web Services, OOP  /  UML, Design Patterns, Unit Testing training sessions.
Created: nov-04. Last updated: jul-15 - 98 articles, 131 .ZIP sources, 1012 figures
Copyright © Felix J. Colibri   http://www.felix-colibri.com 2004 - 2015. All rigths reserved
Back:    Home  Papers  Training  Delphi developments  Links  Download
the Pascal Institute

Felix J COLIBRI

+ Home
  + articles_with_sources
    + database
    + web_internet_sockets
      – tcp_ip_sniffer
      – socket_programming
      – socket_architecture
      – simple_web_server
      – simple_cgi_web_server
      – cgi_database_browser
      – whois
      – web_downloader
      – web_spider
      – rss_reader
      – news_message_tree
      – indy_news_reader
      – delphi_web_designer
      – intraweb_architecture
      – ajax_tutorial
      – bayesian_spam_filter
      + asp_net
    + oop_components
    + uml_design_patterns
    + debug_and_test
    + graphic
    + controls
    + colibri_utilities
    + colibri_helpers
    + delphi
    + firemonkey
    + compilers
  + delphi_training
  + delphi_developments
  + sweet_home
  – download_zip_sources
  + links
Contacts
Site Map
– search :

RSS feed  
Blog