FCS
  • Quickstart
  • FCS basics
    • Registration
    • Main page
    • List of tasks
    • Create new task
    • Edit existing task
    • Send feedback
    • Download crawling results
  • Management module (fcs.manager)
  • Crawling Unit module (fcs.crawler)
  • Task Server module (fcs.server)
  • Crawling results decoder (fcs.content_file_decoder)
 
FCS
  • Docs »
  • fcs.server.link_db
  • Edit on GitHub

fcs.server.link_db¶

This module contains implementations of API for link database. Link database stores information about links that are to visit or have been already visited by Crawling Units.

class BaseLinkDB¶

This is a base class for concrete database API implementations.

is_in_base(link)¶

Checks if the given link is already in database.

Parameters:link (string) – Searched link.
add_link(link, priority, depth)¶

Adds a link to database.

Parameters:
  • link (string) – Link to add.
  • priority (int) – Link’s priority.
  • depth (int) – Depth of link in a crawling tree (method of crawling tree depth calculating depends on the policy - for details see fcs.server.crawling_depth_policy).
  • fetch_time (string) – Time of last link’s processing.
set_as_fetched(link)¶

Sets time of page processing ending.

Parameters:link (string) – URL.
get_link()¶

Obtains one link with highest priority.

change_link_priority(link, priority)¶

Changes link priority.

Parameters:
  • link (string) – Page address.
  • priority (int) – New priority.
get_details(link)¶

Returns details about the given link.

Parameters:link (string) – Link of which details have to be given.
close()¶

Closes database.

clear()¶

Clears database content.

class GraphAndBTreeDB(base_name, policy_module)¶

Implementation of link database API. It is based on the Berkeley DB Btree (bsddb3 module is used) and on Neo4j.

Parameters:
  • base_name (string) – Name of the database.
  • policy_module (AbstractPolicyModule) – Describes established policy (how links should be created, how and when priorities should be modified, etc.).
FOUND_LINKS_DB¶

Name of database storing the found_links structure.

PRIORITY_QUEUE_DB¶

Suffix of name of database storing the priority_queue structure.

found_links¶

Structure with links and crawled content of web sites pointed by these links. This structure is based on the Neo4j graph database.

priority_queue_db_name¶

Name of database storing the priority_queue structure.

priority_queue¶

Priority queue storing links to be crawled with their priorities. This structure is based on the Berkeley DB Btree.

is_in_base(link)¶

Checks if the given link is already in database.

Parameters:link (string) – Searched link.
Returns:Information if the link is in database.
Return type:bool
add_link(link, priority, depth, fetch_time="")¶

Adds given link to database.

Parameters:
  • link (string) – Link to add.
  • priority (int) – Link’s priority.
  • depth (int) – Depth of crawling tree (method of crawling tree depth calculating depends on the policy - for details see fcs.server.crawling_depth_policy).
  • fetch_time (string) – Time of last link’s processing.
set_as_fetched(link)¶

Sets time of page processing ending.

Parameters:link (string) – URL.
get_link()¶

Obtains one link with highest priority.

Returns:URL with highest priority.
Return type:string
change_link_priority(link, priority)¶

Changes link priority.

Parameters:
  • link (string) – URL.
  • priority (int) – Link’s new priority.
get_details()¶

Returns additional information about the given link.

Returns:List with 3 strings - priority, fetch date (could be an empty string) and depth of crawling tree (method of crawling tree depth calculating depends on the policy - for details see fcs.server.crawling_depth_policy).
Return type:list
points(url_a, url_b)¶

Connects two URLs-representing nodes in graph with relationship: “url_b obtained from page identified with url_a”.

Parameters:
  • url_a (string) – Parent URL
  • url_b (string) – Child URL
feedback(link, feedback_rating)¶

Processes rating sent by user in feedback and updates priorities of the given link and its children.

Parameters:
  • link (string) – URL of which rating was sent in feedback.
  • feedback_rating (int) – URL rating sent in feedback.
size()¶

Returns actual size of priority_queue structure.

Returns:Number of elements in queue with links to be crawled and their priorities.
Return type:int
close()¶

Closes database.

clear()¶

Closes and removes database.


© Copyright 2014, AGH-GLK.

Built with Sphinx using a theme provided by Read the Docs.