fcs.crawler.content_parser

This module contains classes responsible for parsing content acquired by Crawling Units.

class ParserProvider

Provides concrete parser instance.

parsers

Dict in the following format: {content_type, parser_instance}. Stores concrete parser instances for the given content type.

static get_parser(content_type)

Returns parser instance depending on passed content type.

Parameters:content_type (string) – Type of content to parse (MIME type).
Returns:Instance of parser that is able to parse a content of given type.
Return type:Parser
Raises Exception:
 if unknown parser type has been requested.
class Parser

Superclass for concrete parser implementations.

parse(content, url="")

This method should contain parsing logic.

Parameters:
  • content (string) – Content to parse.
  • url (string) – URL of base site of which content is parsed.

Note

Parser class’s parse method is not implemented and should be overwritten.

class TextHtmlParser

Parses HTML type content and retrieves links (recognized by <a href> and <link href> tags).

parse(content, url="")

Parses HTML page content.

Parameters:
  • content (string) – Content to parse.
  • url (string) – URL of base site of which content is parsed.
Returns:

List with 2 elements: the first one is the site’s HTML encoded in Base64 format, the second one contains links retrieved from that site.

Return type:

list