Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

aext is a tool used to extract full XML documents out of CLAIMS Direct. It is installed as part of the package Alexandria::Client::Tools.

Code Block
aext [ Options ... ]
  --pgdbname=s      Database configuration name (default=alexandria)
  --solrurl=s       SOLR index url (default=http://solr.alexandria.com:8080/alexandria-index/alexandria)
  --loadid=i        modified-load-id to extract
  --table=s         extract from table
  --sqlq=s          extract from SQL statement
  --solrq=s         extract from SOLR query
  --root=s          directory to deposit output file(s) or into which files will be archived
  --prefix=s        prefix for output files (default=batch)
  --archive         archive data into predictable path structure
  --nthreads=i      number of parallel processes (default=4)
  --batchsize=i     number of documents per process (default=500)
  --dbfunc=s        specific user-defined database function

...

Parameter

Description
pgdbnameAs configured in /etc/alexandria.xml, the database entry pointing to the on-site CLAIMS Direct postgresql PostgreSQL instance. The default value is alexandria as this value is pre-configured in /etc/alexandria.xml.
solrurlAvailable with optional SOLR on-site installation only, this is the URL of the standalone CLAIMS Direct SOLR instance or, if used, the URL of the load balancer. Although there is a default value, if you specify --solrq, this parameter is mandatory.

...

ParameterDescription
loadidThe modified_load_id of the table xml.t_patent_document_values. Please see the documentation on content updates describing the various load-ids.
tableThe name of a user-created table with a minimum required column publication_id.
sqlqAny raw SQL that returns one or more publication_id values.
solrqAny raw SOLR query.

Extract Naming and Destination

...

ParameterDescription
nthreadsFor increased speed, the extraction of data by default is done using parallel processes. This parameter specifies exactly how many parallel process processes will be used. A general rule of thumb is to set this parameter to the number of CPU cores the machine has.
batchsizeThis parameter specifies the number of documents to extract per thread. If you know the content you are extracting, this parameter can be used to increase speed, .e.g., bibliographic content only would benefit from a larger value , while full-text content would benefit from a lower value.

Output XML Filtering

ParameterDescription
dbfuncBy default, aext uses the internal postgresql PostgreSQL function xml.f_patent_document_s to extract full XML documents. This parameter allows you to specify a custom extract function.

...

The following example uses the table parameter. A user-defined table is created with a subset of documents which are then extracted using aext.

First we create the table in a private schema:.

Code Block
languagesql
CREATE TABLE mySchema.t_load_261358 ( publication_id integer );

...

Finally, extract the documents into a predicable path structure in the current directory.

Code Block
aext --table=mySchema.t_load_261358 --archive
 
##
## abbreviated listing
##
./JP
./JP/B2
./JP/B2/000H07
./JP/B2/000H07/11
./JP/B2/000H07/11/02
./JP/B2/000H07/11/02/83
./JP/B2/000H07/11/02/83/JP-H07110283-B2.xml
./JP/B2/000H07/11/56
./JP/B2/000H07/11/56/83
./JP/B2/000H07/11/56/83/JP-H07115683-B2.xml
etc ...

...

This example will take the raw SQL used to populate the private table in the example above, and use it directly as a parameter to aext.

Code Block
aext --sqlq="SELECT t.publication_id from xml.t_patent_document_values as t where t.modified_load_id=261358" \
     --archive \
     --root=/tmp

Extracting using SOLR

If the optional CLAIMS DIrect Direct SOLR instance is installed, the power of SOLR can be used to search, filter, and extract documents. This example will simply pull the same set of documents as above using SOLR query syntax.

...

Together with the --loadid parameter, we can now extract XML that only includes publication and CPC information.

Code Block
aext --loadid=261358 --dbfunc=mySchema.f_cpc_only

...