Blog
Skip to end of metadata
Go to start of metadata

The CLAIMS Direct data warehouse is updated continuously. On average, 1-3% of the entire database is touched weekly. These 1-3 million updates fall into two categories: new documents (roughly 20%) and updated documents (roughly 80%).  New documents are, of course, new publications from issuing authorities, US, EP etc. Updates are generally, but not limited to, CPC changes, family changes (adding family ID), IFI integrated content (patent status, standard names and claims summaries), reassignments, legal status, et al. I suggest reading about content updates for a better understanding of load sources and issuing authorities. This blog will describe how to determine what has been updated and further, differentiating new content from updated content at the document level as well as container level.

Load Identifiers

As described in the content update documentation,

Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: created-load-id, deleted-load-id and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document and the deleted-load-id represents the load-id in which the document marked as deleted.

The table xml.t_patent_document_values houses the load-ids that affect the overall status of the entire document.

  • modified_load_id – the load-id that last modified the document, whether it be addition, correction or deletion of content
  • deleted_load_id – the load-id that the entire document was marked deleted
  • created_load_id – the load-id when the document was added to the data warehouse

 

Each satellite table also contains a modified_load_id column that is specific to the content in that table and will always be <= xml.t_patent_document_values(modified_load_id). The following table should clarify the sequence of update events and the effect on load-id values for three database tables.

dateload-sourceLoad-Idtablecreated_load_idmodified_load_iddeleted_load_id
20161004US10000xml.t_patent_document_values1000010000NULL
   xml.t_ifi_integrated_content
assuming ifi integrated content is not delivered on date of publication 
n/an/an/a
   xml.t_partiesn/a10000n/a
20161006IFI-CONT10001xml.t_patent_document_values1000010001NULL
   xml.t_ifi_integrated_contentn/a10001n/a
   xml.t_parties
parties are not updated with IFI content loads 
n/a10000n/a
20161011DOCDB10002xml.t_patent_document_values1000010002NULL
   xml.t_ifi_integrated_contentn/a10001n/a
   xml.t_partiesn/a10002n/a

This granularity offers the ability to trace modifications to individual containers (tables) as well as the overall modification status of the document as a whole.

Quick Test

How many US documents published on 20161004 have updated CPC classifications?

select count(*) as docs_with_modified_cpc
from xml.t_patent_document_values as a
  inner join xml.t_classifications_cpc as b
    on (a.publication_id=b.publication_id)
where a.country='US'
  and a.published='2016-10-04'
  and a.deleted_load_id is null -- disregard deleted documents
  and a.withdraw != true        -- disregard withdrawn documents
  /*
    If b.modified_load_id > a.created_load_id,
    the table has been touched and we can therefore
    assume CPC values have been modified
  */
  and b.modified_load_id > a.created_load_id;
 
 docs_with_modified_cpc 
------------------------
                   6145

Out of 6680 documents published on 20161004, 6145 had CPC modifications after publication.

 

Reporting

Knowing which tables and containers have changed is important, but the when and why are also critically important. The information gathered during data warehouse updates is stored in reporting.t_client_load_process. For a detailed description of the contents of this table, see CLAIMS Direct Design Documentation. Of the many questions that can be posed, for our purposes in this blog, we are only concerned with answering the following questions:

  • On any given date, how many documents were added to the data warehouse?
  • On any given date, how many documents were modified in the data warehouse?

The answers to these questions are important for a variety of reasons, most importantly for down-stream processing. In the case of CLAIMS Direct primary data warehouses,  we trigger indexing based on the answer to these questions. The first thing to look at is what happened on a certain date, for the sake of an example, we will use 20161004.

select load_source, count(*)
from reporting.t_client_load_process
where date_trunc('day', completed_stamp) = '2016-10-04'::date
group by load_source
order by load_source;


 load_source  | count 
--------------+-------
 DOCDB        |     7
 IFI-CLAIMINF |     1
 IFI-Cont     |    15
 IPCR         |    47
 US           |     1
 USPTO-OG     |     1
 USREAS       |     1

How many documents were added to the data warehouse?

select count(*) as new_docs
from xml.t_patent_document_values
where created_load_id in (
  select load_id
  from reporting.t_client_load_process
  where date_trunc('day', completed_stamp) = '2016-10-04'::date
);

 new_docs 
----------
     6680

... and which documents were updated?

select count(*) as updated_docs
from xml.t_patent_document_values
where created_load_id < modified_load_id
    and modified_load_id in (
      select load_id
      from reporting.t_client_load_process
      where date_trunc('day', completed_stamp) = '2016-10-04'::date
    );

 updated_docs 
----------
   362388

In part II of this series, I will detail methods which can be used to monitor this table and react to changes either at the load-id-level or over the course of an interval of time.

  • No labels