fcs.manager.models

This module contains model layer - implementation of system units and consists of object-relational mapping classes:

class UserManager

Provides methods for creation of user and his Quota or superuser.

create_user(username, email, password)

Creates common FCS user with a default Quota.

Parameters:
  • username (string) – New user’s name.
  • email (string) – New user’s email address.
  • password (string) – New user’s password.
Returns:

New user

Return type:

User

create_superuser(username, email, password)

Creates FCS superuser that can use admin panel.

Parameters:
  • username (string) – New user’s name.
  • email (string) – New user’s email address.
  • password (string) – New user’s password.
Returns:

New superuser

Return type:

User

class User

FCS user class. Extends django.contrib.auth.models.AbstractUser.

Note

Username, password and email are required. Other fields are optional.

class Quota

Represents limitations in creating tasks. Each User object is connected with his personal quota.

max_priority

Maximal allowed sum of tasks’ priorities.

max_tasks

Maximal allowed number of running tasks.

Maximal allowed sum of tasks’ processed links number.

Maximal allowed number of processed links per task.

urls_per_min

Expected crawling speed sum. Used by efficiency estimation module and autoscaling.

user

Quota’s owner.

class QuotaException

Raised when user exceeds limitations defined by Quota object.

class TaskManager

Manages creation of Task.

static create_task(user, name, priority, expire, start_links, whitelist='*', blacklist='', max_links=1000, mime_type='text/html')

Returns new task.

Parameters:
  • user (string) – User’s name.
  • name (string) – New task’s name.
  • priority (int) – Task priority.
  • expire (datetime) – Task expiration date.
  • start_links (string) – List of pages where crawler starts his work.
  • whitelist (string) – Allowed URLs as regexp list.
  • blacklist (string) – Disallowed URLs as regexp list.
  • max_links (string) – Maximal allowed number of processed pages.
  • mime_type (string) – List of allowed MIME types.
Returns:

New task

Return type:

Task

Raises QuotaException:
 

if user quota is exceeded.

class Crawler

Represents Crawling Unit.

address

Crawling unit’s address.

uuid

Crawling unit’s UUID.

is_alive()

Checks if crawler responds for requests.

Returns:Information if crawler is alive
Return type:bool
stop()

Sends stop request to crawler.

Note

If crawler doesn’t respond this object will be deleted.

kill()

Sends kill request to crawler.

Note

If crawler doesn’t respond this object will be deleted.

send(path, method='get', data=None)

Sends request to crawler.

Parameters:
  • path (string) – Request name, may be one of the following: ‘/put_links’, ‘/kill’, ‘/stop’, ‘/alive’, ‘/stats’.
  • method (string) – Method of request, acceptable values are ‘get’ or ‘post’.
  • data (dict) – Dict with parameters (in JSON). Details of particular request’s parameters are described in fcs.crawler.web_interface documentation.
Returns:

Response or None if connection cannot be established

Return type:

requests.Response or None

class TaskServer

Represents server which executes crawling tasks.

address

Task Server’s address.

urls_per_min

Tasks server’s speed.

uuid

Task Server’s UUID.

is_alive()

Checks if Task Server responds for requests.

Returns:Information if Task Server is alive
Return type:bool
kill()

Sends kill request to Task Server.

Note

If server doesn’t respond this object will be deleted.

send(self, path, method='get', data=None)

Sends request to Task Server.

Parameters:
  • path (string) – Request name, may be one of the following: ‘/put_links’, ‘/kill’, ‘/stop’, ‘/alive’, ‘/stats’.
  • method (string) – Method of request, acceptable values are ‘get’ or ‘post’.
  • data (dict) – Dict with parameters (in JSON). Details of particular request’s parameters are described in fcs.server.web_interface documentation.
Returns:

Response or None if connection cannot be established

Return type:

requests.Response or None

delete()

Deletes this Task Server.

class Task

Represents crawling task defined by user.

user

User that owns this task.

name

Task’s name.

priority

Task’s priority.

Starting point of crawling.

whitelist

URLs which should be crawled (in regex format).

blacklist

URLs which should not be crawled (in regex format).

Maximal amount of links that may be visited while crawling.

expire_date

Datetime of task expiration.

mime_type

MIME types which are to be crawled.

active

Boolean value. If true task is running, else task is paused.

finished

Boolean value. If true task is finished, else running or paused.

created

Datetime of task creation.

last_data_download

Time of last crawled data collection.

server

Task Server that handles this task.

last_server_spawn

Time of last spawn of server which was run for handling this task.

autoscale_change

Boolean value, informs if some task’s parameter has been modified. It value is true, Task Server has to be informed of this change.

clean()

Cleans task’s data. Validates new task’s fields before save operation.

Raises:
  • ValidationError – If task’s parameters cannot be validated
  • QuotaException – If user’s quota has been exceeded
save(*args, **kwargs)

Saves task in data base and sends information about modifications to its Task Server.

get_parsed_whitelist()

Returns whitelist converted from user-friendly regex to python regex.

Returns:Whitelist in python regex format
Return type:list
get_parsed_blacklist()

Returns blacklist converted from user-friendly regex to python regex.

Returns:Blacklist in python regex format
Return type:list
change_priority(priority)

Sets task priority.

Note

Task with higher priority crawls more links at the same time than those with lower priority.

Parameters:priority (int) – Task’s new priority.
Raises QuotaException:
 if task priority exceeds quota of user which owns this task
pause()

Pauses task.

Note

Paused task does not crawl any links until it is resumed. It temporarily releases resources used by this task (such as priority).

resume()

Resumes task - task becomes active so it can crawl links.

Raises QuotaException:
 if user has not enough free priority resources to run this task. Then, user should decrease priority of this or other active task.
stop()

Marks task as finished.

Note

Finished tasks cannot be resumed and they do not count to user max_tasks quota. After some time its Task Server will be closed and crawling results will be lost.

is_waiting_for_server()

Checks if running task has no Task Server assigned. This case includes waiting until new Task Server starts.

Returns:Information if this task has no Task Server assigned
Return type:bool
feedback(link, rating)

Processes feedback from client in order to update crawling process to satisfy client expectations.

Parameters:
  • link (string) – Rated link
  • rating (string) – Rating as number in range 1 - 5
send_update_to_task_server()

Sends information about task update to its Task Server.

create_api_keys(sender, **kwargs)

Creates Application object, required for working with REST API.

Parameters:sender (string) – signal sender. In our case this parameter is irrelevant, however more details about this mechanism can be found in Django documentation.
class MailSent

Representation of mail sent to user, reminding him to collect crawling data waiting for him.

tasks

List of tasks related to uncollected data.

date

Date of mail sending.