fcs.server.link_db¶
This module contains implementations of API for link database. Link database stores information about links that are to visit or have been already visited by Crawling Units.
- class BaseLinkDB¶
This is a base class for concrete database API implementations.
- is_in_base(link)¶
Checks if the given link is already in database.
Parameters: link (string) – Searched link.
- add_link(link, priority, depth)¶
Adds a link to database.
Parameters: - link (string) – Link to add.
- priority (int) – Link’s priority.
- depth (int) – Depth of link in a crawling tree (method of crawling tree depth calculating depends on the policy - for details see fcs.server.crawling_depth_policy).
- fetch_time (string) – Time of last link’s processing.
- get_link()¶
Obtains one link with highest priority.
- change_link_priority(link, priority)¶
Changes link priority.
Parameters:
- get_details(link)¶
Returns details about the given link.
Parameters: link (string) – Link of which details have to be given.
- close()¶
Closes database.
- clear()¶
Clears database content.
- class GraphAndBTreeDB(base_name, policy_module)¶
Implementation of link database API. It is based on the Berkeley DB Btree (bsddb3 module is used) and on Neo4j.
Parameters: - base_name (string) – Name of the database.
- policy_module (AbstractPolicyModule) – Describes established policy (how links should be created, how and when priorities should be modified, etc.).
- FOUND_LINKS_DB¶
Name of database storing the found_links structure.
- PRIORITY_QUEUE_DB¶
Suffix of name of database storing the priority_queue structure.
- found_links¶
Structure with links and crawled content of web sites pointed by these links. This structure is based on the Neo4j graph database.
- priority_queue_db_name¶
Name of database storing the priority_queue structure.
- priority_queue¶
Priority queue storing links to be crawled with their priorities. This structure is based on the Berkeley DB Btree.
- is_in_base(link)¶
Checks if the given link is already in database.
Parameters: link (string) – Searched link. Returns: Information if the link is in database. Return type: bool
- add_link(link, priority, depth, fetch_time="")¶
Adds given link to database.
Parameters: - link (string) – Link to add.
- priority (int) – Link’s priority.
- depth (int) – Depth of crawling tree (method of crawling tree depth calculating depends on the policy - for details see fcs.server.crawling_depth_policy).
- fetch_time (string) – Time of last link’s processing.
- get_link()¶
Obtains one link with highest priority.
Returns: URL with highest priority. Return type: string
- change_link_priority(link, priority)¶
Changes link priority.
Parameters:
- get_details()¶
Returns additional information about the given link.
Returns: List with 3 strings - priority, fetch date (could be an empty string) and depth of crawling tree (method of crawling tree depth calculating depends on the policy - for details see fcs.server.crawling_depth_policy). Return type: list
- points(url_a, url_b)¶
Connects two URLs-representing nodes in graph with relationship: “url_b obtained from page identified with url_a”.
Parameters:
- feedback(link, feedback_rating)¶
Processes rating sent by user in feedback and updates priorities of the given link and its children.
Parameters:
- size()¶
Returns actual size of priority_queue structure.
Returns: Number of elements in queue with links to be crawled and their priorities. Return type: int
- close()¶
Closes database.
- clear()¶
Closes and removes database.