You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Update Schedule

Overview

The CLAIMS Global Patent Database data is delivered as a separate record for each patent publication, with data merged from multiple sources. The sources include the following:

  • DocDB and legal status from the EPO
  • Bibliographic and full text files from national patent offices
  • Assignment information for US 
  • Translated bibliographic data for Japan from Patent Abstracts of Japan

New raw data is loaded as soon as published and includes both new records as well as changes.

Because of the unpredictable, large amount of updated records delivered by DocDB (which can cause delays in processing national office data), we are more selective in processing DocDB data by effectively spreading the weekly loads over a full week instead of by the data when we received the data. Doing so allows us to make sure that the US, EP, and PCT files (in particular) are processed in a timely manner.

In the following sections, we include information about data volumes and frequency of updates for the main raw data sources.

Weekly updates

Source

Day of source data availability

Delay from patent publication date

Availability in Claims Global DB

Translation Availability

CA

Wednesday

same day

day of publication

 

CN

Wednesday

2-3 weeks

2-3 weeks after publication

2-3 weeks after publication
DEThursdaysame dayday of publication2 days after publication

EP

Wednesday

same day

day of publication

 

EPO DOCDB Create/Delete

Thursday

depending on the country

1 day after source
data availability

 
ESDaily2-4 weeks2-4  weeks after publication 

JP Grants

Wednesday

1-2 weeks

1-2 weeks after publication

2-3 weeks after publication

JP Applications

Thursday

1-2 weeks

1-2 weeks after publication

2-3 weeks after publication
KRDailysame dayday of publication2-3 weeks after publication

US Grants

Tuesday

same day

day of publication

 

US Grants - Attachments

Tuesday

same day

day of publication

 

US Applications

Thursday

same day

day of publication

 

US Apps - Attachments

Thursday

same day

day of publication

 

WO

Thursday

same day

day of publication

 

 

 

Source

Average volume

classes

priorities

titles

abst.

desc.

claims

CA

3,200

A

S

A

S

 

A

CN

24,000

A

S

A

A

S

S

DE1,500ASASAA

EP

4,400

A

 

A

S

A

A

EPO DOCDB Create/Delete

125,000

S

A

S

S

 

 

ES120ASASAA

JP Grants

4,800

A

S

A

S

A

A

JP Applications

6,600

A

S

S

S

S

S

KR1,200ASAAAA

US Grants

4,800

A

S

A

S

A

A

US Grants - Attachments

120,000

 

 

 

 

 

 

US Applications

6,300

A

S

A

A

A

A

US Apps - Attachments

150,000

 

 

 

 

 

 

WO

4,000

 

 

A

A

S

S


A - all first publications contain this element (excludes corrections, search reports, WO equivalents/republications, etc.)

S - some publications contain this element, data is loaded when available

DocDB Amended Records

DocDB Amended Records (percentage of documents having updated elements for the given fields)

Source

Day of Publication

Average Volume

EPO DocDB Amend

Tuesday

365,000

Source

classes

citations

titles

applicants

inventors

abstracts

EPO DocDB Amend

48%

26%

8%

6%

5%

 

Non-Weekly Updates

Source

Frequency

Average Volume

Element(s) Affected

Patent Abstracts of Japan (PAJ)

monthly

25,000

classifications-ipcr, titles, parties, abstracts

USPTO Master Classification File

every 2 months

150,000

classification-national

USPTO Reassignments

daily

60,000

assignee-history

EPO IPCR Incremental Update

quarterly

1,200,000

classifications-ipcr


Architecture

 

Two important scripts are used to manage the ongoing data update process.

  • apgupd: The main update daemon along with the sub-script apgup is used to check for and download new or updated data from the CLAIMS Direct primary data warehouse.
  • aidxd: This, along with the sub-script aidx (alexandria_index), is responsible for keeping the SOLR index updated with data synced from the individual client PostgreSQL database.

Update Process In Detail

The main components involved in content updates for Claims Direct client instances consist of the client PostgreSQL database proper, the remote web client, and the server-side web service end points (CDWS), as shown in the diagram below:

 

alexandria-update-outline

 

Content is processed on the IFI CLAIMS primary instance based on the concept of load source. These load sources are particular data feeds from issuing authorities. Load sources can include new, complete documents, updated complete documents or partial document updates. As these load sources are processed into the primary data warehouse, they are stamped with a load-id (an identifier used to group sets of documents together) and are immediately made available for client download.


Client instances download and process these load-ids into the PostgreSQL database (see PostgreSQL Updates below). Then, these load-id(s) are queued up to be indexed by SOLR (see the SOLR Indexing below). 

What is a load-id

Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: (1) created-load-id, (2) deleted-load-id, (3) and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document, and the deleted-load-id represents the load-id in which the document marked as deleted.

All meta data pertaining to updates to both PostgreSQL and SOLR are contained in the PostgreSQL schema reporting.

PostgreSQL Updates

apgup (alexandria_update) is the mechanism through which client instances communicate with the IFI primary instance and is the only method of obtaining new or updated content for the client PostgreSQL database. You will need authentication credentials from IFI CLAIMS in order to communicate with the IFI primary server.

ActionExample Command
Check for available content updates:apgup --check --user=xxx --password=yyy
Automatically download and process the next available content update:apgup --update --user=xxx --password=yyy
To continually process available content updates:nohup apgup --update --continue --user=xxx --password=yyy &
To stop apgup when in --continue mode (in separate terminal window):apgup --stop


Detailed Usage (apgup)

apgup [ Options ... ]

  --help           print this usage and exit
  --update         update local database
    --continue     continue processing available load-ids
    --interval=i   number of seconds to sleep before continuing (default=10)
    --die_on_error Break out of continue loop if any error conditions arise
                     (default=false, i.e., continue trying to update)
  --check          check for available updates but don't proceed to actually update
    --limit=n      limit display of available updates to n (default=10)
  --stop           trigger a gracefull exit while in continuous mode
  --maxloadid=i    force a specific max-client-load-id value
  --noindex        don't trigger index of new load-id(s)

  Required Authorization Arguments
  --------
  --user=s         basic authorization name
  --password=s     basic authorization password

  Optional Update Arguments
  --------
  --url            Alexandria services url
  --update_method  method name for requesting server updates
  --status_method  method name for getting/setting session status
  --check_method   method name for checking available updates
  --schema         schema name for temporary work tables
  --force          force the update even if it is redundant
  --tmp            temporary directory for update processing
  --facility=s     logging facility (default=apgup)

  Optional Database Argument
  --------
  --pgdbname=conf  default database connection name (alexandria)


SOLR Indexing

Indexing into SOLR is controlled by an indexing daemon: aidxd. This daemon probes PostgreSQL for available load-id(s) to index. This "queue" is represented by the table reporting.t_client_index_process. When processing is successfully completed into PostgreSQL, apgup registers a new, index-ready load-id. The indexing daemon recognizes this as an available load-id and begins the indexing process for that particular load-id.

ActionExample Command
Starting the indexing daemon:aidxd --tmp=/scratch/solr-load-tmp
Pausing the indexing daemon:kill -s USR1 <pid>
Resuming a paused daemon:kill -s USR2 <pid>
Stopping the indexing daemon:kill -s USR1 <pid> && kill -s INT <pid>

 

Detailed Usage (aidxd)

aidxd [ Options ... ]

  --nodaemon    don't put process into background
    --once      only process one load-id
  --pidfile=s   specify location of PIDFILE
                  (default=/var/log/alexandria/aidxd.pid)
  --interval=i  n-seconds between probing for new loads
  --tmp=dir     specify temporary file storage (default=/tmp)
  --clean       remove temporary processing directory
  --batchsize=i maximum number of documents to parallelize
  --nthreads=i  maximum number of processes to parallelize
  --facility=s  logging facility (default=aidxd)
  --help        print this usage and exit
  --------
  --idxcls=s    Specify indexing class
  --idxexe=s    specify indexing script (default aidx)
    --quiet     supress output from sub-process
                NOTE: supressing this output will make it difficult
                      to track down errors originating in --idxexe
  --pgdbname=s   config entry or source database (default=alexandria)
  --solrdbname=s base url for index (default=alexandria)
    --core=s     index core (default=alexandria)
  --autooptimize issue an 'optimize' call to SOLR after optinterval
                 continuous load-id(s)
    --optinterval  # of load-id(s) after which an optimize is issued
                   (default=100)
    --optsegments=n optimize down to n-segments (default=16)

  • No labels