CRIS

Issue Crawler - Istruzioni per l'uso

1 giugno 2006

ISSUE CRAWLER

ISTRUZIONIPER L'USO

(testo originale in inglese dal sito http://www.govcom.org/Issuecrawler_instructions.htm - 26 maggio 2006)

1. Introduction

Welcome to the Issue Crawler, the network mapping software by the Govcom.org Foundation, Amsterdam. This is the online documentation. (Auto-request an account at issuecrawler.net .) Issuecrawler.net also has a FAQ , and a list of features currently not working.


1.1 Before you begin

Download the svg viewer plug-in at http://www.adobe.com/svg . For SVG info, see: http://www.w3.org/Graphics/SVG/SVG-Implementations . (Windows users are advised to use Internet Explorer. Windows users who use Firefox exclusively should see this svg plug-in information: http://plugindoc.mozdev.org/windows-all.html#AdobeSVG .)


1.2 Quick start

Enter at least two related URLs in the Issue Crawler, harvest, name your crawl and launch your crawl. Crawls complete in 10 minutes to 8 hours, depending upon quantity of starting points. View map in Network Manager. Clicking node names opens URLs. Save from map options. Print map from saved file, such as pdf. (For printing from pdf, page set up should be landscape, and use 'actual size,' not fit to page.)


1.3 Description of the Issue Crawler

The IssueCrawler is web network location software . It consists of a crawler, a co-link analysis engine and two visualisation modules. It is server-side software that crawls specified sites, captures the outlinks from the specified sites, performs co-link analysis on the outlinks, returns densely interlinked networks, and visualises them in circle and cluster maps . For user tips, see also scenarios of use, available at http://www.govcom.org/scenarios_use.htm . For a list of articles resulting from the use of the Issue Crawler, see http://www.govcom.org/publications.html .

The following is a step by step guide to software use.


2. Log in

Enter Username and Password

Remember me? Checking the box has the software remember your username and password for future use. (A cookie is used.) Your browser also is able to remember your log-in's.

Forgot password? Type username or email address into username field, press login. A new password is sent to your email address, if you are a valid user.

Request account? Fill in as many fields as you feel comfortable with. Note how a user's privacy concerns have been built into the archive search, whilst still enabling an open archive.

3. The Lobby

The Lobby is so named for the area where one waits for crawls to complete. Crawl completion time varies between 10 minutes and 8 hours, depending on the number of servers from which the crawler requests pages.

Whilst waiting users may read news about the software and the results people have generated. (News is posted by the administrators of the software.) Users also may view maps in the archive as well as launch additional crawls.

To the right is the listing of current crawls . Crawls are either crawling or queued (i.e., ‘waiting to be launched'). Crawls run sequentially on parallel crawlers. Under details you may view the author, email address, settings and progress of the current crawls, as well as a live views of the crawls. Estimated completion time may change significantly should net congestion increase or decrease.

The User Manager is below the listing of current crawls. Users may change their username, password and email address.


4. Issue Crawler

The Issue Crawler is the crawler itself. There are two steps before launching a crawl.

4.1 The Harvester. (Step one)

The Harvester is so named for it strips URLs from text dumped into the space. For example, one may copy and paste a page of search engine returns into the Harvester. The Harvester strips away the text, leaving only URLs. It is a generally useful tool in itself. (See also FAQ .)

Type or paste at least two different URLs into the harvester, and press harvest. These harvested URLs will be crawled upon launching crawl.

Tip:
If you find a list of URLs on the Web with only pointer text and without URLs, view page source, copy the code containing the URLs, paste into the Harvester and press Harvest. The Harvester will strip out the code leaving only URLs.


4.2 The Crawler Settings. (Step two)

Your harvested URLs appear in the box. You may edit and remove URLs. You may save your harvested results. This is also the stage where you provide the Crawler with instructions (the crawler settings), and where you name and launch your crawl.

Tips:

Once you have harvested:

Remove double entries
by clicking on a URL, and pressing remove .

View starting points
to ensure they are correct by clicking on a URL, and pressing view .

Should the URL be incorrect, edit the starting point by clicking the URL and pressing edit. Once edited, press update .

You may save your harvested results by pressing save results . A text file is created.

Should you wish to add URLs , save your results, return to the Harvester, and paste your saved results into the Harvester. Add URLs. Press Harvest.


4.3 Explanation of General Crawler Operation.

The Issue Crawler crawls the specified starting points, captures the starting points' outlinks, and performs co-link analysis to determine which outlinks at least two starting points have in common. The Issue Crawler performs these two steps (crawling and co-link analysis) once, twice or three times. Each performance of these two steps is called an iteration. Each iteration has the same crawl depth. The crawler respects robot exclusion files. Note: if you desire to see a site's robots exclusion policy, you may wish to consult http://tools.issuecrawler.net/.

Tip:

1. Avoid crawling big media sites, blogs, search engines, pdf files, image files and pages, more generally, without specific outgoing links.

More specific crawler operation information is available in the FAQ .


4.4 Crawler Settings in Detail

There are 4 settings. The default settings suffice to ensure a crawl. You must name your crawl before launching the crawler.


Privilege Starting Points : This setting keeps your starting points in the results after the first iteration. Privileging starting points (and using one iteration of method) are suggested for social network mapping. The software understands a social network as the starting points plus those organizations receiving at least two links from the starting points.

Perform co-link analysis by page or by site. Performing co-link analysis by page analyses deep pages, and returns networks consisting of pages. Performing co-link analysis by site returns networks consisting of sites or homepages only. Analysis by page is suggested, for the results are more specific, and the clickable nodes on the map are often 'deep pages' as opposed to homepages.

Set iterations. One may set the number of iterations of method (crawling and co-link analysis) to one, two or three iterations. One iteration is suggested for social network mapping, two for issue network mapping and three for establishment network mapping. For a longer description of the distinction between networks, see also scenarios of use, http://www.govcom.org/scenarios_use.htm .

Crawl depth. One may crawl sites one, two or three layers deep.

Here is a strict definition of how depth is calculated .

The pages fetched from the starting point URLs are considered to be
depth 0. The pages fetched from URL links from those pages are considered to be depth 1. In general, the pages found from URL links on a page of depth N are considered to be depth N+1. If you set a depth of 2, then no pages of depth 2 will be fetched. Only pages of depth 0 and 1 will be fetched (ie. two levels of depth). {Text by David Heath at Oneworld.}

Tips:
1. Use links pages as starting points. Links pages are the URLs where hyperlinks are listed, e.g., http://www.freetibet.org/info/links.html . Occasionally sites, using frames or other structures, are so designed that visitors may have the impression that they are always on the homepage. If, on the homepage, you notice a hyperlink to ‘links' or ‘resources', right-mouse click the ‘links', copy location to clipboard, and paste into the harvester. Use as many links pages as possible for your starting points.
2. Give the crawler the least amount of work to do. Using a few links pages as starting points, with one iteration of method and one layer deep will provide the quickest crawl completion.
3. Before launching a crawl, name the crawl clearly . Name the crawl so that others viewing the archive will understand what it is. Viewing the archive will provide you with an understanding of crawls that have been named well or less so.

Ceilings (advanced). The crawled URL ceiling (per host) is the maximum quantity of URLs crawled on each host. The crawled URL ceiling (overall) is the total quantity of URLs crawled (max 60000). The co-link ceiling by page (pages per host per iteration) is the maximum quantity of co-linked pages returned per iteration (max 1000). The co-link ceiling by site (hosts per iteration) is the maximum quantity of co-linked sites returned per iteration (max 1000).

Exclusion list. There is a list of URLs to be excluded from crawling and thereby excluded from the results, e.g., software download pages, site stats counters, search engines and others. It is suggested that you keep your own list. You may edit the existing list. Please note the list format, and edit the list using the same format, i.e., www.google.com ; news.google.com.

Name and Launch crawl.
Name crawl before launch. Use a name that clearly identifies the network you seek. Once you have launched a crawl, your crawl details will appear. These include the name of your crawl, and the time and date launched.


5. Network Manager and Archive

5.1 Purpose of the Network Manager and Archive

The principle purpose of the Network Manager as well as the Archive is to allow you to generate, view, edit, save and print maps .

The Network Manager provides a list of your completed crawls. The Archive provides a list of all users' completed crawls . The archive may be searched.


5.2 Features of the Network Manager and Archive

The Network Manager and the Archive have a number of features .

List of completed crawls. Listed are the network names and top five organizations in each network. Each network lists the top 5 URLs beneath the title of the network, with an inlink count in parentheses. The inlink count is the total number of links the organization or site has received from the network. It is a page count. Clicking on an organization (in the form of a shortened URL) places it in the archive search, and allows you to find all maps in the archive containing that organization (according to the homepage URL, without the www, such as greenpeace.org). It seems that worldbank.org currently appears in the most networks in the archive.

Network Selection - The Scheduler. You may schedule the network to repeat the crawl at specified intervals using either your original starting points or the network results. This allows you to watch the evolution of the network over time , either on your terms (scheduling a crawl using your starting points) or on the network's terms (scheduling a crawl using last available network results).

Network Selection – View Map. You may view a depiction of your network as a circle or cluster map.

Network Selection – Edit Map Name and Add Legend Text. You may change the name of the map and add a legend text by pressing the + sign below, editing and pressing save changes. The legend text will appear on the map.

Network Selection – Other Data Views. Available are: the xml source file;
the raw data (comma separated); an actor list with interlinkings (core network) and its equivalent non-matrix version; actor list with interlinkings (core network and periphery) and its equivalent non-matrix version; and the
page list with their interlinkings (core and periphery).


5.3 Map Viewing and Interactivity

Map Viewing

Pressing View Depiction for a cluster map or a circle map generates a map. The map is generated as a scalable vector graphic (svg). The browser may require a plug-in to view an svg file. An svg viewer plug-in is available at http://www.adobe.com/svg .

The map shows its name, author, crawl start and completion dates , as well as the crawler settings. It also loads statistics of the largest node on the map, by default. The largest node is the node that has received the most inlinks from the network actors.

Legend text may be added on the network details page.

The legend shows the top- and second-level domains ("node types") represented on the map.

For the cluster map, the placement of the nodes on the map is significant . Placement is relative to significance of the node to other nodes, according to the ReseauLu approach .


Map Interactivity
Clickable Node Names . Each node name on the map is clickable. Clicking a node name will open a pop-up window and retrieve the URL associated with the node name. Should you have run your crawl with the co-link analysis mode set to ‘by page', often the nodes are ‘deep pages'.

Clickable Nodes
Selecting a node shows the destination URL, the node's crawl inlink count, as well as its links to and from other network actors, in the statistics.

Clickable Node Types (domains and sub-domains)
You may turn on and off links to and from domains and sub-domains listed in the legend. You also may turn on and off links, using the drop-down menu.

Zooming and Panning. To zoom in, out and return to original view, ctl-mouse. To pan, press alt and drag.


5.4 Saving and Printing Maps

Saving Map.
Use the save and export option on the map.

Save the interactive .svg file for uploading to a site or for file transfer.
In order for the .svg file to load on your site, put a line in the mime-types configuration for your webserver that recognizes svg and outputs the correct content type to the web browser. It is standard with Apache.

Save the .jpg or .png file as flat image for pasting into a document or into html. Save the .tiff flat image for higher print quality. Save the .pdf file as document .

Printing Map.

Print from imported or saved file. Landscape orientation is advised. Printing from the browser also works but is not optimal.


5.5 Advanced Options - Map Generation and Editing

Circle Map - Advanced Options

Map Generation
Retaining the default setting will generate a map with a node count of approximately 25 or fewer nodes. You may raise or lower the node count. A node count reduction is equivalent to an authority threshold. You show nodes with increasingly higher or lower inlink counts.

Map Editing

You may edit the nodes on your map. You may edit the names of the nodes as well as the colors of the nodes, either by typing in the hex numbers for the colors or by using the color picker. The table allows you to sort the nodes on your map by name, domain and page datestamp.

Cluster Map - Advanced Options

Map Generation

The cluster map advanced options provides data about your network.

Choose nodes to be mapped allows you to choose the number of nodes to be mapped according to a significance measure, that is, the ‘top' nodes according to inlink count per node.

Selection of ties by specificity is the qualitative strength of ties. The network clusters actors with strongest ties to one another.

Selection of ties by frequency is the quantitative force of ties. The network clusters actors with the greatest quantity of ties between them.

Color scheme by type indicates domain type, e.g., .gov, .co.uk, .gv.at. Color scheme by structural position indicates type of linking behavior , e.g., only gives links, only receives links, give and receives links.

Size of nodes by inlinks indicates that the size of the node is relative to the number of links received by the site or organization during the crawl.

Size of nodes by centrality indicates the size of the node is relative to number of of links given and received per cluster.


Map Editing
The advanced options for the cluster map allow you to change the colors as well as the names of the nodes.