aext is a tool used to extract full XML documents out of CLAIMS Direct. It It is installed as part of the package Alexandria::Client::ToolsCLAIMS Direct repository. Please see the Client Tools Installation Instructions for more information about how to install this tool.

Code Block

language	text

aext [ Options ... ]
  --pgdbname=s      Database configuration name (default=alexandria)
  --solrurl=s       SOLRSolr index url (default=http://solr.alexandria.com:8080/alexandria-index/alexandria)
  --loadid=i        modified-load-id to extract
  --table=s         extract from table
  --sqlq=s          extract from SQL statement
  --solrq=s         extract from SOLRSolr query
  --root=s          directory to deposit output file(s) or into which files will be archived
  --prefix=s        prefix for output files (default=batch)
  --archive         archive data into predictable path structure
  --nthreads=i      number of parallel processes (default=4)
  --batchsize=i     number of documents per process (default=500)
  --dbfunc=s        specific user-defined database function

Detailed Description of the Parameters

Connectivity

Parameter	Description
`pgdbname`	As configured in `/etc/alexandria.xml`, the database entry pointing to the on-site CLAIMS Direct PostgreSQL instance. The default value is `alexandria` as this value is pre-configured in `/etc/alexandria.xml`.
`solrurl`	Available with optional

SOLR

Solr on-site installation only, this is the URL of the standalone CLAIMS Direct

SOLR

Solr instance or, if used, the URL of the load balancer. Although there is a default value, if you specify --solrq, this parameter is mandatory.

Source

The following parameters determine the source criteria for extracting CLAIMS Direct XML. Only one may be specified.

Parameter	Description
`loadid`	The `modified_load_id` of the table `xml.t_patent_document_values`. Please see the documentation on content updates describing the various load-ids.
`table`	The name of a user-created table with a minimum required column `publication_id`.
`sqlq`	Any raw SQL that returns one or more `publication_id` values.
`solrq`	Any raw

SOLR

Solr query.

Extract Naming and Destination

Parameter Description

root The output location of either the batches or, if --archive is specified, the root directory for files in the predictable path structure. The default is the current working directory.

prefix The standard extract is run in batches. This parameter specifies the prefix for each output file. The default is batch.

archive

Archive the XML into a predictable path structure. The structure is as follows:

<root>/<country>/kind/nnnnnn/nn/nn/nn/ucid.xml

Where:
root: the destination specified with the --root parameter
country: the country of publication
kind: the kind code of the publication
nnnnnnnnnnnn: the 12-digit, zero-padded publication number
ucid.xml: the full XML file of the publication

For example:
./DE/A1/102008/03/79/61/DE-102008037961-A1.xml

Process Options

Parameter	Description
`nthreads`	For increased speed, the extraction of data by default is done using parallel processes. This parameter specifies exactly how many parallel processes will be used. A general rule of thumb is to set this parameter to the number of CPU cores the machine has.
`batchsize`	This parameter specifies the number of documents to extract per thread. If you know the content you are extracting, this parameter can be used to increase speed,

.

e.g., bibliographic content only would benefit from a larger value while full-text content would benefit from a lower value.

Output XML Filtering

Parameter	Description
`dbfunc`	By default, `aext` uses the internal PostgreSQL function `xml.f_patent_document_s` to extract full XML documents. This parameter allows you to specify a custom extract function.

...

Examples

Extracting Using a Specific load-id

The following example uses modified_load_id 261358. The resulting XML batches will be in /tmp and will be prefixed with TEST. The logging output may be different depending on your logging configuration.

Code Block

language	textbash

aext --loadid=261358 --root=/tmp --prefix=TEST
 
##
## the results in /tmp
##
ls -l /tmp/TEST*.xml
-rw-r--r-- 1 root root 56626271 Apr  6 03:52 /tmp/TEST.00000001-00000001.00000500.001491465129.xml
-rw-r--r-- 1 root root 68733642 Apr  6 03:52 /tmp/TEST.00000002-00000501.00001000.001491465129.xml
-rw-r--r-- 1 root root 91214345 Apr  6 03:52 /tmp/TEST.00000003-00001001.00001500.001491465129.xml
-rw-r--r-- 1 root root 91201427 Apr  6 03:52 /tmp/TEST.00000004-00001501.00002000.001491465129.xml
-rw-r--r-- 1 root root 79966094 Apr  6 03:52 /tmp/TEST.00000005-00002001.00002500.001491465129.xml
-rw-r--r-- 1 root root 86552704 Apr  6 03:52 /tmp/TEST.00000006-00002501.00003000.001491465129.xml
-rw-r--r-- 1 root root 35221625 Apr  6 03:52 /tmp/TEST.00000007-00003001.00003500.001491465129.xml
-rw-r--r-- 1 root root 68582397 Apr  6 03:52 /tmp/TEST.00000008-00003501.00004000.001491465129.xml
-rw-r--r-- 1 root root 80311992 Apr  6 03:52 /tmp/TEST.00000009-00004001.00004500.001491465129.xml
-rw-r--r-- 1 root root 17395649 Apr  6 03:52 /tmp/TEST.00000010-00004501.00004613.001491465129.xml

...

Finally, extract the documents into a predicable path structure in the current directory.

Code Block

language	textbash

aext --table=mySchema.t_load_261358 --archive
 
##
## abbreviated listing
##
./JP
./JP/B2
./JP/B2/000H07
./JP/B2/000H07/11
./JP/B2/000H07/11/02
./JP/B2/000H07/11/02/83
./JP/B2/000H07/11/02/83/JP-H07110283-B2.xml
./JP/B2/000H07/11/56
./JP/B2/000H07/11/56/83
./JP/B2/000H07/11/56/83/JP-H07115683-B2.xml
etc ...

Extracting Using SQL

This example will take takes the raw SQL used to populate the private table in the example above, and use uses it directly as a parameter to to aext.

Code Block

language	text

aext --sqlq="SELECT t.publication_id from xml.t_patent_document_values as t where t.modified_load_id=261358" \
     --archive \
     --root=/tmp

Extracting Using

...

Solr

If the optional CLAIMS Direct SOLR Solr instance is installed, the power of SOLR Solr can be used to search, filter, and extract documents. This example will simply pull pulls the same set of documents as above using SOLR Solr query syntax.

Code Block

language	text

aext --solrurl=http://SOLR-INSTANCE-URL/alexandria-v2.1/alexandria --archive --solrq='loadid:261358'

[aindex01] [2017/04/06 04:17:11] [DEBUG     ] [preparing extract ...]
[aindex01] [2017/04/06 04:17:11] [DEBUG     ] [creating t_tmp_000000000000_001491466631 ... ]
[aindex01] [2017/04/06 04:17:11] [DEBUG     ] [querying SOLR (http://SOLR-INSTANCE-URL/alexandria-v2.1/alexandria { loadid:261358 })]
[aindex01] [2017/04/06 04:17:12] [DEBUG     ] [running extract ...]
[aindex01] [2017/04/06 04:17:27] [DEBUG     ] [finalizing extract ...]
[aindex01] [2017/04/06 04:17:27] [INFO      ] [extract complete: { 4613 documents across 10 batches in 15.643s (294.894/s) }]

Extracting Using a Custom Database Function

The following example will describe describes a use-case in which only CPC classifications are of interest. It will make makes use of a custom extract function created in a private schema.

Warning
By manipulating the content of the XML, there is a risk that invalid XML can be produced. If you are validating the XML using the CLAIMS Direct DTD, beware of required elements.

...

First, we create the function that extracts only publication information and classification information.

...

Code Block

language	text

aext --loadid=261358 --dbfunc=mySchema.f_cpc_only

...

Checking Status

To determine the current status of the data extraction, check the log output for the batch number currently being extracted, then insert it into the following formula:

Code Block

language	text

( ( total-documents / batch-size ) - current-batch-number ) * batch-size = number of documents left to extract

For example, given 17000000 total documents, a batch size of 500, and a current batch number of 31000, the formula would determine that there are 1500000 documents left to extract:

Code Block

language	text

( ( 17000000 / 500 ) - 31000 ) * 500 = 1500000

Page tree

Versions Compared

Old Version 7

New Version Current

Key

Detailed Description of the Parameters

Connectivity

Source

Extract Naming and Destination

Process Options

Output XML Filtering

Examples

Extracting Using a Specific load-id

Extracting Using SQL

Extracting Using

Solr

Extracting Using a Custom Database Function

Checking Status

Page tree

Page History

Versions Compared

Old Version 7

New Version Current

Key

Detailed Description of the Parameters

Connectivity

Source

Extract Naming and Destination

Process Options

Output XML Filtering

Examples

Extracting Using a Specific load-id

Extracting Using SQL

Extracting Using

Solr

Extracting Using a Custom Database Function

Checking Status