Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

aext is a tool used to extract full XML documents out of CLAIMS Direct. It is installed as part of the package Alexandria::Client::Tools.

Code Block
languagetext
aext [ Options ... ]
  --pgdbname=s      Database configuration name (default=alexandria)
  --solrurl=s       SOLR index url (default=http://solr.alexandria.com:8080/alexandria-index/alexandria)
  --loadid=i        modified-load-id to extract
  --table=s         extract from table
  --sqlq=s          extract from SQL statement
  --solrq=s         extract from SOLR query
  --root=s          directory to deposit output file(s) or into which files will be archived
  --prefix=s        prefix for output files (default=batch)
  --archive         archive data into predictable path structure
  --nthreads=i      number of parallel processes (default=4)
  --batchsize=i     number of documents per process (default=500)
  --dbfunc=s        specific user-defined database function

...

The following example uses modified_load_id 261358. The resulting XML batches will be in /tmp and will be prefixed with TEST. The logging output may be different depending on your logging configuration.

Code Block
languagetext
aext --loadid=261358 --root=/tmp --prefix=TEST
 
##
## the results in /tmp
##
ls -l /tmp/TEST*.xml
-rw-r--r-- 1 root root 56626271 Apr  6 03:52 /tmp/TEST.00000001-00000001.00000500.001491465129.xml
-rw-r--r-- 1 root root 68733642 Apr  6 03:52 /tmp/TEST.00000002-00000501.00001000.001491465129.xml
-rw-r--r-- 1 root root 91214345 Apr  6 03:52 /tmp/TEST.00000003-00001001.00001500.001491465129.xml
-rw-r--r-- 1 root root 91201427 Apr  6 03:52 /tmp/TEST.00000004-00001501.00002000.001491465129.xml
-rw-r--r-- 1 root root 79966094 Apr  6 03:52 /tmp/TEST.00000005-00002001.00002500.001491465129.xml
-rw-r--r-- 1 root root 86552704 Apr  6 03:52 /tmp/TEST.00000006-00002501.00003000.001491465129.xml
-rw-r--r-- 1 root root 35221625 Apr  6 03:52 /tmp/TEST.00000007-00003001.00003500.001491465129.xml
-rw-r--r-- 1 root root 68582397 Apr  6 03:52 /tmp/TEST.00000008-00003501.00004000.001491465129.xml
-rw-r--r-- 1 root root 80311992 Apr  6 03:52 /tmp/TEST.00000009-00004001.00004500.001491465129.xml
-rw-r--r-- 1 root root 17395649 Apr  6 03:52 /tmp/TEST.00000010-00004501.00004613.001491465129.xml

...

Finally, extract the documents into a predicable path structure in the current directory.

Code Block
languagetext
aext --table=mySchema.t_load_261358 --archive
 
##
## abbreviated listing
##
./JP
./JP/B2
./JP/B2/000H07
./JP/B2/000H07/11
./JP/B2/000H07/11/02
./JP/B2/000H07/11/02/83
./JP/B2/000H07/11/02/83/JP-H07110283-B2.xml
./JP/B2/000H07/11/56
./JP/B2/000H07/11/56/83
./JP/B2/000H07/11/56/83/JP-H07115683-B2.xml
etc ...

...

This example will take the raw SQL used to populate the private table in the example above, and use it directly as a parameter to aext.

Code Block
languagetext
aext --sqlq="SELECT t.publication_id from xml.t_patent_document_values as t where t.modified_load_id=261358" \
     --archive \
     --root=/tmp

...

If the optional CLAIMS Direct SOLR instance is installed, the power of SOLR can be used to search, filter, and extract documents. This example will simply pull the same set of documents as above using SOLR query syntax.

Code Block
languagetext
aext --solrurl=http://SOLR-INSTANCE-URL/alexandria-v2.1/alexandria --archive --solrq='loadid:261358'

[aindex01] [2017/04/06 04:17:11] [DEBUG     ] [preparing extract ...]
[aindex01] [2017/04/06 04:17:11] [DEBUG     ] [creating t_tmp_000000000000_001491466631 ... ]
[aindex01] [2017/04/06 04:17:11] [DEBUG     ] [querying SOLR (http://SOLR-INSTANCE-URL/alexandria-v2.1/alexandria { loadid:261358 })]
[aindex01] [2017/04/06 04:17:12] [DEBUG     ] [running extract ...]
[aindex01] [2017/04/06 04:17:27] [DEBUG     ] [finalizing extract ...]
[aindex01] [2017/04/06 04:17:27] [INFO      ] [extract complete: { 4613 documents across 10 batches in 15.643s (294.894/s) }]

...

Together with the --loadid parameter, we can now extract XML that only includes publication and CPC information.

Code Block
languagetext
aext --loadid=261358 --dbfunc=mySchema.f_cpc_only

...