The main executable script used for indexing is
aidx delivered as part of
Alexandria::Library. This script is responsible for pulling source data, converting it into SOLR XML and submitting via HTTP POST to SOLR for indexing. The conversion process from CLAIMS Direct XML to SOLR XML is handled by the indexer class (default is
Alexandria::Client::Tools also provides an indexing daemon,
aidxd which monitors an index process queue. Insertion into this queue, the table
reporting.t_client_index_process, is handled by
Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the
Alexandria::DWH::Extract. The Extract module can pull data based on a number of criteria, the most common of which are:
- table: any table name that has
- SQL: raw SQL selecting desired documents by
Regardless of extraction criteria,
Alexandria::DWH::Extract utilizes an
UNLOGGED temporary table to accumulate desired
publication_id(s). Extraction proper is done from this accumulation table in parallel
select batches. The amount of parallelization as well as the amount of documents per
select are controlled by the parameters
aidx also accepts a
dbfunc parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is
xml.f_patent_document_s which pulls an entire XML document. One could, for example, create a custom function, e.g.,
myschema.f_barebones modeled on
xml.f_patent_document_s (i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).
|Command||Accumulation SQL||Extract SQL|
from xml.t_patent_document_values t1
inner join x as t1 on ( t1.publication_id=t2.publication_id)
|aidx --sqlq=USER_SQL||execute SQL into t1|
|Command||Accumulation SQL||Extract SQL|
|aidx --table=x --dbfunc=f_my_function|
from xml.t_patent_document_values t1
inner join x as t1 on ( t1.publication_id=t2.publication_id)
callback mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into SOLR. The following abbreviated example from
aidx serves to illustrate the process.
Creating a Custom Indexing Class
Creating a custom indexing class is simply a matter of sub-classing the
Alexandria::DWH::Index::Document and manipulating the SOLR document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely,
_process_source. The following shell-module will serve as a basis for the use cases detailed below.
You can now specify
MyCustomIndexingClass as the command line argument
--idxcls to the indexing utility
The following use cases assume:
- a valid index entry in
/etc/alexandria.xml– this will be different than the default if you have a custom SOLR installation
- custom indexing class modules are either in the directory you run
aidxor in your
(1) Adding (Injecting), Modifying, and Deleting Fields
For this use case, you will need to modify your SOLR schema for the installation associated with the appropriate configuration index entry. Add the following field definition:
Below is example code to inject
customInteger into the SOLR document. Additionally, it will show how to modify the contents of
anseries and delete
anseries if the publication country is US and publication date is later than 2015.
(2) Accessing the CLAIMS Direct Source XML Document
This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the SOLR schema:
We first need to define related ucid
rel_ucid. For this example, it will be defined as:
- any related documents which have a
- any pct-or-regional-publishing-data
The parts of the XML document that are of interest:
As this example is more involved, the following code is broken down by function. A complete listing of code will be provided below.
Points to consider with
- The source document (XML) representation is an
XML::LibXML::Node, named above as
- Utilizing available methods, it is relatively simple to access particular parts of the XML tree.
findvaluemethod is lacking error checking, i.e., we assume every value is present, combined in
sprintfwill return a correctly formatted
Points to consider:
- according to the DTD, there is only ever one related pct document, hence single-value return
ucidattribute is available, differing from the above
The complete listing:
There are a number of reasons one would need to re-index data from the data warehouse. These range from simply repeating a
load-id to a complete re-index of the entire contents of the data warehouse. In this blog, I'm going to go over the mechanisms that move data from the data warehouse to the index and ways in which these mechanisms can be used to trigger partial or full re-indexing of data.
In all installations of CLAIMS Direct, data synchronization between the CLAIMS Direct primary data warehouse and on-site installations is driven by the tool
apgup. In version 2.1, a new update daemon (
apgupd) was introduced to facilitate background processing of new and updated data. The main roles this daemon plays include:
- Reconciling differentials between on-site data warehouses and CLAIMS Direct primary data warehouses
- Downloading the next available update package for processing within the on-site installation
- Processing the update package
- Loading data to the data warehouse
- Optionally en-queuing that load for indexing
For our topic, we will concentrate on the indexing queue as this drives the indexing process.
The indexing queue is a table inside the
apgupd completes the load into the PostgreSQL, a row is inserted into
pending. The indexing daemon,
aidxd, probes the table looking for the highest priority
load_id to index. Upon finding an applicable
aidxd proceeds to extract all publications associated with that
load_id and index the documents into SOLR.
priority influences which
load_id is to be processed next. Higher priority
load-ids are processed before lower priority ones. This can cause out-of-order processing as we will discuss below.
At its simplest, re-indexing is just a matter of resetting a
The above SQL resets
load_id 123456 so that the indexing daemon processes that
load_id next. You can re-index any number of
load_ids by resetting each of the rows.
Entire Data Warehouse
Unfortunately, resetting all the rows in
reporting.t_client_index_process is not sufficient to re-index the entire contents of the data warehouse. This has to do with the fact that the table is empty upon initial load to the on-site instance as both SOLR and PostgreSQL back file deliveries are in sync. Each new update that is processed is queued. To that end, the most efficient way to queue the entire data warehouse for re-indexing is to select the entire
xml.t_patent_document_values table and group by
Before proceeding with the next set of SQL, please be sure both
aidxd are paused ( kill -s USR1 <pid> )
A few comments about the above statements:
- Truncating the table is required as the
load_idcolumn is unique. If you wish to save the contents of that table, simply use the pg_dump utility.
load_sourcecolumn is chosen at random. You can use any identifier you want as long as it doesn't exceed 64 characters.
- The priority is set to -1 as in this example, re-index is less important than indexing new data.
Changing Field Definitions
Modifying the CLAIMS Direct SOLR Schema
The CLAIMS Direct SOLR schema is a compromise between having every piece of the patent document searchable, having enough metadata stored to be retrievable in the search result set as well as indexing certain fields to be efficiently faceted. This compromise keeps the size of the index manageable while allowing efficient return of data from a search. Of course, the concept of manageable is subjective. There could very well be the need to have more fields returned during the search (
stored=true) and other fields removed from storage but left searchable. In this use case, we will enable patent citations to be returned during the search. Regardless of CLAIMS Direct SOLR package, we'll start editing the schema under
<package>/conf-2.1.2. We are making a very simple change but one that requires a complete re-index of all documents having citations.
Be sure the indexing daemon (
aidxd ) is paused.
The current field definition
To make this change effective, SOLR needs to be restarted or the collection reloaded. For the standalone package, you would just need to restart SOLR:
For the distributed package, first deploy updated configuration to the Zookeeper nodes.
Then reload all nodes:
Setting up the Index Queue
Although we can go about re-indexing the entire data warehouse as outlined above, we want to be a bit more efficient and choose only the applicable documents, i.e., those that actually have citation data. To this end, we will find only
load_ids that contain documents with citations and modify existing
reporting.t_client_index_process rows as needed.
Again, setting priority to -1 allows new documents to be indexed as usual, and documents needing to be updated will be indexed with a lower priority.
Checking Re-Index Status
You can monitor the progress of the re-index by querying the
reporting.t_client_index_process table as follows:
The CLAIMS Direct data warehouse is updated continuously. On average, 1-3% of the entire database is touched weekly. These 1-3 million updates fall into two categories: new documents (roughly 20%) and updated documents (roughly 80%). New documents are, of course, new publications from issuing authorities, US, EP etc. Updates are generally, but not limited to, CPC changes, family changes (adding family ID), IFI integrated content (patent status, standard names and claims summaries), reassignments, legal status, et al. I suggest reading about content updates for a better understanding of load sources and issuing authorities. This blog will describe how to determine what has been updated and further, differentiating new content from updated content at the document level as well as container level.
As described in the content update documentation,
Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: created-load-id, deleted-load-id and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document and the deleted-load-id represents the load-id in which the document marked as deleted.
xml.t_patent_document_values houses the
load-ids that affect the overall status of the entire document.
load-idthat last modified the document, whether it be addition, correction or deletion of content
load-idthat the entire document was marked deleted
load-idwhen the document was added to the data warehouse
Each satellite table also contains a
modified_load_id column that is specific to the content in that table and will always be <=
xml.t_patent_document_values(modified_load_id). The following table should clarify the sequence of update events and the effect on
load-id values for three database tables.
assuming ifi integrated content is not delivered on date of publication
parties are not updated with IFI content loads
This granularity offers the ability to trace modifications to individual containers (tables) as well as the overall modification status of the document as a whole.
How many US documents published on 20161004 have updated CPC classifications?
Out of 6680 documents published on 20161004, 6145 had CPC modifications after publication.
Knowing which tables and containers have changed is important, but the when and why are also critically important. The information gathered during data warehouse updates is stored in
reporting.t_client_load_process. For a detailed description of the contents of this table, see CLAIMS Direct Design Documentation. Of the many questions that can be posed, for our purposes in this blog, we are only concerned with answering the following questions:
- On any given date, how many documents were added to the data warehouse?
- On any given date, how many documents were modified in the data warehouse?
The answers to these questions are important for a variety of reasons, most importantly for down-stream processing. In the case of CLAIMS Direct primary data warehouses, we trigger indexing based on the answer to these questions. The first thing to look at is what happened on a certain date, for the sake of an example, we will use 20161004.
How many documents were added to the data warehouse?
... and which documents were updated?
In part II of this series, I will detail methods which can be used to monitor this table and react to changes either at the load-id-level or over the course of an interval of time.
The CLAIMS Direct Web Services (CDWS) offer a variety of entry points into both the data warehouse and SOLR index. These are mid-to-high-level entry points and can satisfy most requirements pertaining to searching and extracting data for a typical search/view application. There are, however, corner cases which may require more intricate extraction of particular information. On the other hand, there may also be situations where massive amounts of data need to be extracted for further, down-stream processing. The following discussion will touch on solutions for both of these cases.
Basic Data Warehouse Table Structure
The basic structure of the PostgreSQL database housing the XML content is best described as a collection of tables containing the raw XML of every patent document available in CLAIMS Direct. Each table is a section of the XML document, referred to as a container, and each table is named according to the XML container data it contains, e.g.,
|xml.t_publication_reference||Document publication information|
The structure of each table is identical with the exception of
xml.t_patent_document_values which functions as a meta table.
|serial primary key|
|integer not null|
|integer not null|
|character(1) (v=valid XML, i=invalid XML)|
The XML is always in the
content (PostgreSQL type: XML) column.
The invaluable utility function
xml.f_ucid2id takes a
ucid and returns a
Extracting Pieces of the XML Content
As mentioned above, all
content columns are of type XML and therefore the internal PostgreSQL xpath functionality can be used. There are a variety of utility functions provided by CLAIMS Direct that mimic DOM functionality, e.g.
Extract text of node:
Extract single attribute value:
Test presence of a particular node:
Select a node (type XML) satisfying the given xpath expression:
Select an array of nodes satisfying the given xpath expression:
Select an array of nodes based on element name:
Select the name of the content node:
Select all child nodes:
Concatenate all child text nodes:
Use Case: Creating relational
view of XML content
Although the underlying architecture of the CLAIMS Direct PostgreSQL database is a data warehouse, creating normalized relational views into the XML content is easily achievable using the functions described above. Our example will be to create a relational view of all patent citations. This view will let us
SELECT based on criteria as well as being able to
GROUP BY and
ORDER BY. The first step is to define all the properties of a citation:
source(examiner or applicant)
Now we can create a function that returns a
TABLE of these properties. We'll use
publication_id as input to the function and create it in a private
Now, we test the results:
GROUP BY and
ORDER BY are available on the returned columns. In addition to SELECTing citations for a single document, we can also
JOIN to table expressions, e.g., assume we are interested in the top 5 cited countries from US grant publication date 2016-05-03:
Paging results is cumbersome and inefficient. In this next segment I'd like to talk about simple and complex sorting. Sorting, used effectively with the
rows parameter can push relevant documents into the first page of results. Generally, you can sort on any indexed field but you can also utilize query boosting and functions to influence sort order.
Missing Value Behavior
CLAIMS Direct is configured to return empty fields at the top when
asc is the direction and the bottom when
desc is the direction.
Static Field Sorts
The most common field to sort on will undoubtedly be the date of publication,
pd desc. This pushes the newly published documents to the top of the result set. CLAIMS Direct uses
pd desc as default sort order instead of
score. If two documents share the same publication date (or any sort field value), the tie is broken by the internal Lucene document IDs ascending.
Multiple sort criteria are also possible and used mostly to stabilize the initial sort results. The syntax for the
sort parameter is a comma-separated list of field/direction pairs, e.g.,
pd desc,ad asc. Field and direction are separated by a space.
Dynamic Field Sort
As mentioned, CLAIMS Direct uses a default sort of
pd desc. You are, of course, free to use the dynamic
score field to mimic the default SOLR distribution behavior.
Random Sort (for Random Results)
CLAIMS Direct uses a utility field to allow random sorting. The field is input as
n is any random integer, e.g.,
rnd_1234. Specifying the same value for
n will yield the same sort results so the randomness is only as random as the input integer. Returning random results (sorting on
rnd_n) is an efficient way to sample a wide variety of data.
Sorting on Functions
A previous segment discussed using functions as pseudo fields to return in the result set. You can also use the results of a function as sort criteria. An interesting use-case for sorting on a calculated value is measuring the latency between application filing date and grant date. You can calculate this value by subtracting
tdate field types of filing and publication dates). The following examples use field aliasing with the
fl parameter as well as sorting by function to return documents with low latency between filing and grant (
asc) and high latency (
desc) displayed in days (86400000 milliseconds in a day).
In this first of a series of blogs about SOLR result sets I'd like to talk about returned fields, both static and dynamic.
Any field that is stored in SOLR can be returned in the result set. The following table shows all available stored fields:
|timestamp||last modification stamp||tdate||true|
|ucid||unique character identifier||string||true|
|loadid||meta load identifier||tint||true|
|nclms||number of claims||tint||true|
|nindepclms||number of independent claims||tint||true|
|pn||patent number (tokenized)||alexandria_token||true|
|an||application number (tokenized)||alexandria_token||true|
|anseries||US application series code||alexandria_string||true|
|anorig||original, patent office format application number||alexandria_token||true|
|pri||priority filing number (tokenized)||alexandria_string||true|
|priorig||original, patent office format priority number||alexandria_string||true|
|prid||earliest priority filing date||tint||true|
|icl,cpcl,uscl,ficl||classifications suitable for display||string||false|
|pa||applicants/assignees (all formats)||alexandria_token||true|
|ttl_*||all titles||mixed: alexandria_text, alexandria_asian_text||true|
|ab_*||all abstracts||mixed: alexandria_text, alexandria_asian_text||true|
The shorthand to return all static fields is the asterisk
As with all examples, we discuss results in JSON format.
Some things to notice: Multi-valued fields are returned as JSON arrays. Also, as
copyFields containing multiple values, duplicate entries may appear.
Dynamic fields are fields that are generated either internally, like the relevance
score or fields that can be generated by computing values using function queries. Dynamic fields must be listed explicitly in the
fl parameter, e.g.,
fl=score. Below, we request the static
ucid field together with the relevance
score for 5 results.
Another type of dynamic field makes use of function queries. A simple, yet contrived, example would be to return the number of dependent claims. CLAIMS Direct stores the total number of claims in
nclms as well as the number of independent claims in
nindepclms, both as
tint. Simple subtraction yields the desired result.
Field aliasing allows individual fields in the result set to be renamed. This functionality is particularly useful to give meaningful names to pseudo fields created by function queries as in the example above. Any field can be renamed using aliases. The syntax is: alias-field-name:original-field-name
ucid is the only exception. When trying to alias the
ucid, an HTTP 500 will be returned.