Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

aext is a tool used to extract full XML documents out of CLAIMS Direct. It is installed as part of the CLAIMS Direct repository. Please see the Client Tools Installation Instructions for more information about how to install this tool.

...

Detailed Description of the Parameters

Connectivity

Parameter

Description
pgdbnameAs configured in /etc/alexandria.xml, the database entry pointing to the on-site CLAIMS Direct PostgreSQL instance. The default value is alexandria as this value is pre-configured in /etc/alexandria.xml.
solrurlAvailable with optional SOLR on-site installation only, this is the URL of the standalone CLAIMS Direct SOLR instance or, if used, the URL of the load balancer. Although there is a default value, if you specify --solrq, this parameter is mandatory.

Source

The following parameters determine the source criteria for extracting CLAIMS Direct XML. Only one may be specified.

ParameterDescription
loadidThe modified_load_id of the table xml.t_patent_document_values. Please see the documentation on content updates describing the various load-ids.
tableThe name of a user-created table with a minimum required column publication_id.
sqlqAny raw SQL that returns one or more publication_id values.
solrqAny raw SOLR query.

Extract Naming and Destination

ParameterDescription
rootThe output location of either the batches or, if --archive is specified, the root directory for files in the predictable path structure. The default is the current working directory.
prefixThe standard extract is run in batches. This parameter specifies the prefix for each output file. The default is batch.
archive

Archive the XML into a predictable path structure. The structure is as follows:

<root>/<country>/kind/nnnnnn/nn/nn/nn/ucid.xml

Where:
root: the destination specified with the --root parameter
country: the country of publication
kind: the kind code of the publication
nnnnnnnnnnnn: the 12-digit, zero-padded publication number
ucid.xml: the full XML file of the publication

For example:
./DE/A1/102008/03/79/61/DE-102008037961-A1.xml

Process Options

ParameterDescription
nthreadsFor increased speed, the extraction of data by default is done using parallel processes. This parameter specifies exactly how many parallel processes will be used. A general rule of thumb is to set this parameter to the number of CPU cores the machine has.
batchsizeThis parameter specifies the number of documents to extract per thread. If you know the content you are extracting, this parameter can be used to increase speed, e.g., bibliographic content only would benefit from a larger value while full-text content would benefit from a lower value.

Output XML Filtering

ParameterDescription
dbfuncBy default, aext uses the internal PostgreSQL function xml.f_patent_document_s to extract full XML documents. This parameter allows you to specify a custom extract function.

 


Examples

Extracting Using a Specific load-id

...

Warning

By manipulating the content of the XML, there is a risk that invalid XML can be produced. If you are validating the XML using the CLAIMS Direct DTD, beware of required elements.

...


First, we create the function that extracts only publication information and classification information.

...

Code Block
languagetext
aext --loadid=261358 --dbfunc=mySchema.f_cpc_only

 

...