The basic API included the application framework and routing system (provided by werkzeug.routing) of BrownAnt.
The app which could manage whole crawler system.
Add a url rule to the app instance.
The url rule is the same with Flask apps and other Werkzeug apps.
Parameters: |
|
---|
Dispatch the URL string to the target endpoint function.
Parameters: | url_string – the origin URL string. |
---|---|
Returns: | the return value of calling dispatched function. |
Mount a supported site to this app instance.
Parameters: | site – the site instance be mounted. |
---|
Parse the URL string with the url map of this app instance.
Parameters: | url_string – the origin URL string. |
---|---|
Returns: | the tuple as (url, url_adapter, query_args), the url is parsed by the standard library urlparse, the url_adapter is from the werkzeug bound URL map, the query_args is a multidict from the werkzeug. |
The crawling request object.
Parameters: |
|
---|
The site supported object which could be mounted to app instance.
Parameters: | name – the name of the supported site. |
---|
Play record actions on the target object.
Parameters: | target (brownant.site.Site) – the target which recive all record actions, is a brown ant app instance normally. |
---|
Record the method-calling action.
The actions expect to be played on an target object.
Parameters: |
|
---|
The decorator to register wrapped function to the brown ant app.
The parameters of this method is compatible with the BrownAnt.add_url_rule() method.
Parameters: |
|
---|
The declarative API is around the “dinergate” and “pipeline property”.
The simple classify crawler.
In order to work with unnamed properties such as the instances of brownant.pipeline.base.PipelineProperty, the meta class brownant.dinergate.DinergateType will scan subclasses of this class and name all unnamed members which are instances of werkzeug.utils.cached_property.
Parameters: |
|
---|
the URL template string for generating crawled target. the self could be referenced in the template. .e.g. “http://www.example.com/items/{self.item_id}?page={self.page}”
The fetching target URL.
The default behavior of this property is build URL string with the URL_TEMPLATE.
The subclasses could override URL_TEMPLATE or give a different implementation of this property.
Bases: type
The metaclass of Dinergate and its subclasses.
This metaclass will give all members are instance of cached_property default names. It is because many pipeline properties are subclasses of cached_property, but them would not be created by decorating functions. They will has not built-in __name__, which may cause them could not cache values as expected.
Bases: werkzeug.utils.cached_property
The base class of pipeline properties.
There are three kinds of initial parameters.
A workable subclass of PipelineProperty should implement provide_value(self, obj)(), which accept an argument, the instance of Dinergate.
The implementation of prepare(self)() is optional in subclasses.
Parameters: | kwargs – the parameters with the three kinds. |
---|
the definition of attr_names
Get attribute of the target object with the configured attribute name in the attr_names of this instance.
Parameters: |
|
---|
the definition of options
This method will be called after instance ininialized. The subclasses could override the implementation.
In general purpose, the implementation of this method should give default value to options and the members of attr_names.
Example:
def prepare(self):
self.attr_names.setdefault("text_attr", "text")
self.options.setdefault("use_proxy", False)
the names of required attributes.
The query argument property.
Parameters: |
|
---|
The text response which returned by fetching network resource.
Parameters: |
|
---|
The element tree built from a raw html property.
Parameters: | text_response_attr – optional. default: “text_response”. |
---|
The text extracted from a element tree property by XPath.
Parameters: |
|
---|