...
Source | Frequency | Average Volume | Element(s) Affected |
---|---|---|---|
Patent Abstracts of Japan (PAJ) | monthly | 25,000 |
|
USPTO Master Classification File | every 2 months | 150,000 |
|
USPTO Reassignments | daily | 60,000 |
|
EPO IPCR Incremental Update | quarterly | 1,200,000 |
|
Architecture
Two important scripts are used to manage the ongoing data update process.
- apgupd: The main update daemon along with the sub-script apgup is used to check for and download new or updated data from the CLAIMS Direct primary data warehouse.
- aidxd: This, along with the sub-script aidx (alexandria_index), is responsible for keeping the optional SOLR index updated with data synced from the individual client PostgreSQL database.
Update Process In Detail
The main components involved in content updates for Claims Direct client instances consist of the client PostgreSQL database proper, the remote web client, and the server-side web service end points (CDWS), as shown in the diagram below:
...
Content is processed on the IFI CLAIMS primary instance based on the concept of load source. These load sources are particular data feeds from issuing authorities. Load sources can include new, complete documents, updated complete documents or partial document updates. As these load sources are processed into the primary data warehouse, they are stamped with a load-id (an identifier used to group sets of documents together) and are immediately made available for client download.
...
Info | ||
---|---|---|
| ||
Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: (1) created-load-id, (2) deleted-load-id, (3) and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document, and the deleted-load-id represents the load-id in which the document marked as deleted. |
All meta data pertaining to updates to both PostgreSQL and SOLR are contained in the PostgreSQL schema reporting
. For further details, see design documentation.
PostgreSQL Updates
apgup (alexandria_update) is the mechanism through which client instances communicate with the IFI primary instance and is the only method of obtaining new or updated content for the client PostgreSQL database. You will need authentication credentials from IFI CLAIMS in order to communicate with the IFI primary server.
Action | Example Command |
---|---|
Check for available content updates: | apgup --check --user=xxx --password=yyy |
Automatically download and process the next available content update: | apgup --update --user=xxx --password=yyy |
To continually process available content updates: | nohup apgup --update --continue --user=xxx --password=yyy & |
To stop apgup when in --continue mode (in separate terminal window): | apgup --stop |
Detailed Usage (apgup)
Code Block |
---|
apgup [ Options ... ]
--help print this usage and exit
--update update local database
--continue continue processing available load-ids
--interval=i number of seconds to sleep before continuing (default=10)
--die_on_error Break out of continue loop if any error conditions arise
(default=false, i.e., continue trying to update)
--check check for available updates but don't proceed to actually update
--limit=n limit display of available updates to n (default=10)
--stop trigger a gracefull exit while in continuous mode
--maxloadid=i force a specific max-client-load-id value
--noindex don't trigger index of new load-id(s)
Required Authorization Arguments
--------
--user=s basic authorization name
--password=s basic authorization password
Optional Update Arguments
--------
--url Alexandria services url
--update_method method name for requesting server updates
--status_method method name for getting/setting session status
--check_method method name for checking available updates
--schema schema name for temporary work tables
--force force the update even if it is redundant
--tmp temporary directory for update processing
--facility=s logging facility (default=apgup)
Optional Database Argument
--------
--pgdbname=conf default database connection name (alexandria) |
SOLR Indexing
Indexing into SOLR is controlled by an indexing daemon: aidxd. This daemon probes PostgreSQL for available load-id(s) to index. This "queue" is represented by the table reporting.t_client_index_process. When processing is successfully completed into PostgreSQL, apgup registers a new, index-ready load-id. The indexing daemon recognizes this as an available load-id and begins the indexing process for that particular load-id.
Action | Example Command |
---|---|
Starting the indexing daemon: | aidxd --tmp=/scratch/solr-load-tmp |
Pausing the indexing daemon: | kill -s USR1 <pid> |
Resuming a paused daemon: | kill -s USR2 <pid> |
Stopping the indexing daemon: | kill -s USR1 <pid> && kill -s INT <pid> |
Detailed Usage (aidxd)
Code Block |
---|
aidxd [ Options ... ]
--nodaemon don't put process into background
--once only process one load-id
--pidfile=s specify location of PIDFILE
(default=/var/log/alexandria/aidxd.pid)
--interval=i n-seconds between probing for new loads
--tmp=dir specify temporary file storage (default=/tmp)
--clean remove temporary processing directory
--batchsize=i maximum number of documents to parallelize
--nthreads=i maximum number of processes to parallelize
--facility=s logging facility (default=aidxd)
--help print this usage and exit
--------
--idxcls=s Specify indexing class
--idxexe=s specify indexing script (default aidx)
--quiet supress output from sub-process
NOTE: supressing this output will make it difficult
to track down errors originating in --idxexe
--pgdbname=s config entry or source database (default=alexandria)
--solrdbname=s base url for index (default=alexandria)
--core=s index core (default=alexandria)
--autooptimize issue an 'optimize' call to SOLR after optinterval
continuous load-id(s)
--optinterval # of load-id(s) after which an optimize is issued
(default=100)
--optsegments=n optimize down to n-segments (default=16)
|