...
Code Block | ||
---|---|---|
| ||
aext [ Options ... ] --pgdbname=s Database configuration name (default=alexandria) --solrurl=s SOLRSolr index url (default=http://solr.alexandria.com:8080/alexandria-index/alexandria) --loadid=i modified-load-id to extract --table=s extract from table --sqlq=s extract from SQL statement --solrq=s extract from SOLRSolr query --root=s directory to deposit output file(s) or into which files will be archived --prefix=s prefix for output files (default=batch) --archive archive data into predictable path structure --nthreads=i number of parallel processes (default=4) --batchsize=i number of documents per process (default=500) --dbfunc=s specific user-defined database function |
...
Parameter | Description |
---|---|
pgdbname | As configured in /etc/alexandria.xml , the database entry pointing to the on-site CLAIMS Direct PostgreSQL instance. The default value is alexandria as this value is pre-configured in /etc/alexandria.xml . |
solrurl | Available with optional SOLR Solr on-site installation only, this is the URL of the standalone CLAIMS Direct SOLR Solr instance or, if used, the URL of the load balancer. Although there is a default value, if you specify --solrq , this parameter is mandatory. |
...
Parameter | Description |
---|---|
loadid | The modified_load_id of the table xml.t_patent_document_values . Please see the documentation on content updates describing the various load-ids. |
table | The name of a user-created table with a minimum required column publication_id . |
sqlq | Any raw SQL that returns one or more publication_id values. |
solrq | Any raw SOLR Solr query. |
Extract Naming and Destination
...
The following example uses modified_load_id
261358. The resulting XML batches will be in /tmp
and will be prefixed with TEST
. The logging output may be different depending on your logging configuration.
Code Block | ||
---|---|---|
| ||
aext --loadid=261358 --root=/tmp --prefix=TEST ## ## the results in /tmp ## ls -l /tmp/TEST*.xml -rw-r--r-- 1 root root 56626271 Apr 6 03:52 /tmp/TEST.00000001-00000001.00000500.001491465129.xml -rw-r--r-- 1 root root 68733642 Apr 6 03:52 /tmp/TEST.00000002-00000501.00001000.001491465129.xml -rw-r--r-- 1 root root 91214345 Apr 6 03:52 /tmp/TEST.00000003-00001001.00001500.001491465129.xml -rw-r--r-- 1 root root 91201427 Apr 6 03:52 /tmp/TEST.00000004-00001501.00002000.001491465129.xml -rw-r--r-- 1 root root 79966094 Apr 6 03:52 /tmp/TEST.00000005-00002001.00002500.001491465129.xml -rw-r--r-- 1 root root 86552704 Apr 6 03:52 /tmp/TEST.00000006-00002501.00003000.001491465129.xml -rw-r--r-- 1 root root 35221625 Apr 6 03:52 /tmp/TEST.00000007-00003001.00003500.001491465129.xml -rw-r--r-- 1 root root 68582397 Apr 6 03:52 /tmp/TEST.00000008-00003501.00004000.001491465129.xml -rw-r--r-- 1 root root 80311992 Apr 6 03:52 /tmp/TEST.00000009-00004001.00004500.001491465129.xml -rw-r--r-- 1 root root 17395649 Apr 6 03:52 /tmp/TEST.00000010-00004501.00004613.001491465129.xml |
...
Finally, extract the documents into a predicable path structure in the current directory.
Code Block | ||
---|---|---|
| ||
aext --table=mySchema.t_load_261358 --archive ## ## abbreviated listing ## ./JP ./JP/B2 ./JP/B2/000H07 ./JP/B2/000H07/11 ./JP/B2/000H07/11/02 ./JP/B2/000H07/11/02/83 ./JP/B2/000H07/11/02/83/JP-H07110283-B2.xml ./JP/B2/000H07/11/56 ./JP/B2/000H07/11/56/83 ./JP/B2/000H07/11/56/83/JP-H07115683-B2.xml etc ... |
Extracting Using SQL
This example will take takes the raw SQL used to populate the private table in the example above, and use uses it directly as a parameter to to aext.
Code Block | ||
---|---|---|
| ||
aext --sqlq="SELECT t.publication_id from xml.t_patent_document_values as t where t.modified_load_id=261358" \ --archive \ --root=/tmp |
Extracting Using
...
Solr
If the optional CLAIMS Direct SOLR Solr instance is installed, the power of SOLR Solr can be used to search, filter, and extract documents. This example will simply pull pulls the same set of documents as above using SOLR Solr query syntax.
Code Block | ||
---|---|---|
| ||
aext --solrurl=http://SOLR-INSTANCE-URL/alexandria-v2.1/alexandria --archive --solrq='loadid:261358' [aindex01] [2017/04/06 04:17:11] [DEBUG ] [preparing extract ...] [aindex01] [2017/04/06 04:17:11] [DEBUG ] [creating t_tmp_000000000000_001491466631 ... ] [aindex01] [2017/04/06 04:17:11] [DEBUG ] [querying SOLR (http://SOLR-INSTANCE-URL/alexandria-v2.1/alexandria { loadid:261358 })] [aindex01] [2017/04/06 04:17:12] [DEBUG ] [running extract ...] [aindex01] [2017/04/06 04:17:27] [DEBUG ] [finalizing extract ...] [aindex01] [2017/04/06 04:17:27] [INFO ] [extract complete: { 4613 documents across 10 batches in 15.643s (294.894/s) }] |
Extracting Using a Custom Database Function
The following example will describe describes a use-case in which only CPC classifications are of interest. It will make makes use of a custom extract function created in a private schema.
...