fcs.server.task_server

This module contains implementation of Task Server.

URL_PACKAGE_TIMEOUT

After specified time in seconds package is considered lost.

DATE_FORMAT

Datetime format used in system.

WAIT_FOR_DOWNLOAD_TIME

Time in seconds tells how long task server waits for downloading crawling results before quit.

KEEP_STATS_SECONDS

After this time in seconds old statistics are removed.

CHECK_EFFICIENCY_PERIOD

Time in second for which efficiency statistics are gathered.

class Status

Stores attributes defining Crawler Unit state.

INIT

Task Server is being initialized.

STARTING

Task Server is starting.

RUNNING

Task Server is running.

PAUSED

Task Server is paused.

STOPPING

Task Server is being stopped.

KILLED

Task Server is killed.

class TaskServer(web_server, task_id, manager_address, max_url_depth=1)

Main class of Task Server, containing its logic.

Parameters:

Database for extracted links. Current implementation: GraphAndBTreeDB.

content_db

Database for content extracted from processed pages. Current implementation: BerkeleyContentDB.

crawlers

Dict of the following format: key - Crawling Unit’s address, value - links to be processed by this Crawling Unit.

Maximal amount of unique links that may be crawled during the current task.

expire_date

Expiration date of the given task.

mime_type

List of MIME types of data to be crawled.

uuid

Task Server’s UUID.

whitelist

Regexp with allowed URL form.

blacklist

Regexp with forbidden URL form.

urls_per_min

Expected efficiency in URLs per minute. For more details about this speed, see assign_crawlers().

package_cache

Dict of the following format: key - package_id, value - information about packages with links that have been sent to Crawling Unit (time of sending, list of links, Crawling Unit’s address, timeout flag).

package_id

ID of package with links.

processing_crawlers

List of working Crawling Units assigned to this Task Server.

status

Crawler state, described by Status.

List for statistics - processed links, crawling beginning and end times.

stats_reset_time

Object used for computing time period from which the server efficiency statistics are collected.

assign_crawlers(assignment)

Sets actual crawler assignment. Task Server can send crawling requests only to these crawlers and size of packages must be specified in assignment dict for each crawler. It allows to control crawling efficiency of all Task Servers.

Parameters:assignment (dict) – Dict of the following format: key - Crawling Unit’s address, value - links to be processed by the given Crawling Unit.
assign_speed(speed)

Sets Task Server’s crawling speed. After each speed change statistics are reset.

Parameters:speed (int) – Crawling speed computed as follows: speed = urls_per_min * task.priority / priority_sum, where urls_per_min is defined on the basis of user’s quota, task.priority is a value of priority of the given task and priority_sum is a sum of all of the user’s tasks priorities.
update(data)

Updates crawling parameters and status. It is usually called when some changes in task data are made using GUI or API.

Parameters:data (dict) – Task description (parameters of the task).
pause()

Pauses the Task Server if it was running.

resume()

Resumes the Task Server if it was paused.

stop()

Stops the Task Server. Stopped Task Server won’t send crawling requests anymore. It will wait WAIT_FOR_DOWNLOAD_TIME seconds for user to download gathered data.

kill()

Kills the Task Server. Task Server that is to be killed, will be stopped as soon as possible.

run()

Main Task Server loop.

get_idle_crawlers()

Returns list of crawlers which are not processing any requests.

Returns:List of idle Crawler Units.
Return type:list
feedback(link, rating)

Increases priority of specified link and its children.

Parameters:
  • link (string) – Link.
  • rating (string) – Link’s new rating, can be a number 1-5 casted to string.

Adds links to process.

Parameters:
  • links (list) – List of links (links are of string type).
  • priority (int) – Links’ priority, can be a number 0-999 (0 is the lowest priority).
  • depth (int) – Depth of crawling for a page from which links have been retrieved.
  • source_url (string) – URL of page from which links have been retrieved.
Raises Exception:
 

in case of an error in database.

put_data(package_id, data)

Handles crawled data package received from crawler and puts it into a content database. If received package is not in a package cache or crawling request has timed out, no data will be stored in database. It also marks crawler which was assigned to this crawling request as ‘idle’, so next request can be sent to this crawler.

Parameters:
  • package_id (int) – ID of crawled data package (identical to the package ID from crawling request).
  • data (string) – Crawled data package.
get_data(size)

Returns path to file with crawling results.

Parameters:size (int) – Size of package with demanded crawling results.
Returns:Path to file with crawling results.
Return type:string