fcs.crawler.content_parser¶

This module contains classes responsible for parsing content acquired by Crawling Units.

class ParserProvider¶

Provides concrete parser instance.

parsers¶: Dict in the following format: {content_type, parser_instance}. Stores concrete parser instances for the given content type.

static get_parser(content_type)¶

Returns parser instance depending on passed content type.

Raises Exception:
Parameters:	content_type (string) – Type of content to parse (MIME type).
Returns:	Instance of parser that is able to parse a content of given type.
Return type:	`Parser`
	if unknown parser type has been requested.

class Parser¶

Superclass for concrete parser implementations.

parse(content, url="")¶

This method should contain parsing logic.

Parameters:	content (string) – Content to parse. url (string) – URL of base site of which content is parsed.

Note

Parser class’s parse method is not implemented and should be overwritten.

class TextHtmlParser¶

Parses HTML type content and retrieves links (recognized by <a href> and <link href> tags).

parse(content, url="")¶

Parses HTML page content.

Parameters:	content (string) – Content to parse. url (string) – URL of base site of which content is parsed.
Returns:	List with 2 elements: the first one is the site’s HTML encoded in Base64 format, the second one contains links retrieved from that site.
Return type:	list