Blog


Processes

The main executable script used for indexing is aidx delivered as part of Alexandria::Library. This script is responsible for pulling source data, converting it into SOLR XML and submitting via HTTP POST to SOLR for indexing. The conversion process from CLAIMS Direct XML to SOLR XML is handled by the indexer class (default is Alexandria::DWH::Index::Document). Alexandria::Client::Tools also provides an indexing daemon, aidxd which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process, is handled by apgup.

ProcessPackage
aidxAlexandria::Library
aidxdAlexandria::Client::Tools
apgupAlexandria::Client::Tools

Source Data

Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library module Alexandria::DWH::Extract. The Extract module can pull data based on a number of criteria, the most common of which are:

  • load-id: modified-load-id of xml.t_patent_document_values
  • table: any table name that has publication_id(int) column
  • SQL: raw SQL selecting desired documents by publication_id

Regardless of extraction criteria, Alexandria::DWH::Extract utilizes an UNLOGGED temporary table to accumulate desired publication_id(s). Extraction proper is done from this accumulation table in parallel select batches. The amount of parallelization as well as the amount of documents per select are controlled by the parameters batchsize and nthreadsaidx  also accepts a dbfunc parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones modeled on xml.f_patent_document_s (i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).

 

CommandAccumulation SQLExtract SQL
aidx --table=x

select publication_id
from x into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)
aidx --loadid=y

select publication_id
from xml.t_patent_document_values
where modified_load_id=y into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

aidx --sqlq=USER_SQLexecute SQL into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

CommandAccumulation SQLExtract SQL
aidx --table=x --dbfunc=f_my_function

select publication_id
from x into t1

select f_my_function(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

Indexer Class

Using the callback mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into SOLR. The following abbreviated example from aidx serves to illustrate the process.

#! /usr/bin/perl
use Alexandria::DWH::Extract;
use Alexandria::DWH::Index;
use Alexandria::DWH::Index::Document;

my $idxcls = shift(@ARGV); # from command line

sub _create_solr_document {
  my ( $batch, $xml ) = @_;
  eval 'require $idxcls';
  return $idxcls->new( document => $xml )->toNode()->toString(1);
}

my $ex = Alexandria::DWH::Extract->new(
  ...
  callbacks => { on_document_processed => \&_create_solr_document }
);
$ex->prepare();
$ex->run();  # every document extracted is sent through _create_solr_document()
$ex->finalize();

Creating a Custom Indexing Class

Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document and manipulating the SOLR document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source. The following shell-module will serve as a basis for the use cases detailed below.

package MyCustomIndexingClass;

use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';
# override _process_source
sub _process_source {
  my $self = shift;

  # we want to process the standard way ...
  $self->SUPER::_process_source();

  # do nothing else
}
1;

You can now specify MyCustomIndexingClass as the command line argument --idxcls to the indexing utility aidx.

aidx --idxcls=MyCustomIndexingClass [ many other arguments ]

 Use Cases

Assumptions

The following use cases assume:

  • a valid index entry in /etc/alexandria.xml – this will be different than the default if you have a custom SOLR installation
  • custom indexing class modules are either in the directory you run aidx or in your PERL5LIB path

 

(1) Adding (Injecting), Modifying, and Deleting Fields

For this use case, you will need to modify your SOLR schema for the installation associated with the appropriate configuration index entry. Add the following field definition:

<field name="customInteger" type="tint" indexed="true" stored="true" />

Below is example code to inject customInteger into the SOLR document. Additionally, it will show how to modify the contents of anseries and delete anseries if the publication country is US and publication date is later than 2015.

package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

 
# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  # the _fields member of $self contains all the
  # SOLR content as a hash reference of array references
  # e.g.
  # _fields =>
  #      NOTE: multiValued=false fields are still represented as an array
  #            but only have one member
  #    pn => [ 'US-5551212-A' ],
  #    anseries => [ '07' ],
  #    icl1 => [ 'A', 'F', 'H' ]

  my $flds = $self->{_fields} || return; # nothing to do

  # inject a new field
  push( @{ $flds->{customInteger} }, 1 ) ;

  # we want to make certain that anseries is not padded, i.e.,
  # we need to be sure it is an integer
  if( scalar( $flds->{anseries} ) ) {
    $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] );

    # lastly, we don't want to index anseries for US documents published
    # after 20150101
    my $ctry = $flds->{pnctry}->[0];
    my $date = $flds->{pd}->[0];
    if( $ctry eq 'US' && $date > 20141231 ) {
      delete $flds->{anseries};
    }
  }
}
1;

(2) Accessing the CLAIMS Direct Source XML Document

This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the SOLR schema:

<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" />

We first need to define related ucid rel_ucid. For this example, it will be defined as:

  • any related documents which have a @relation=related-publication
  • any pct-or-regional-publishing-data

 

The parts of the XML document that are of interest:

<related-documents>
  <relation type="related-publication">
    <document-id>
      <country>US</country>
      <doc-number>20150126456</doc-number>
      <kind>A1</kind>
      <date>20150507</date>
    </document-id>
  </relation>
</related-documents>
<!-- ... -->
<pct-or-regional-publishing-data ucid="WO-2013182650-A1">
  <document-id>
    <country>WO</country>
    <doc-number>2013182650</doc-number>
    <kind>A1</kind>
    <date>20131212</date>
  </document-id>
</pct-or-regional-publishing-data>

As this example is more involved, the following code is broken down by function. A complete listing of code will be provided below.

### routine to parse related documents
sub _parse_related_documents {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my @a = (); # stores any related-publications

  # if there are no related documents, return empty array
  my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0];
  return \@a if !$reldoc_node;

  foreach my $relation ( $reldoc_node->getElementsByTagName('relation') ) {
    if( $relation->getAttribute('type') eq 'related-publication' ) {
      push( @a,
        sprintf("%s-%s-%s",
                 $relation->findvalue('./document-id/country'),
                 $relation->findvalue('./document-id/doc-number'),
                 $relation->findvalue('./document-id/kind')
        )
      );
    }
  }
  return \@a;
}

Points to consider with _parse_related_documents:

  • The source document (XML) representation is an XML::LibXML::Node, named above as patdoc.
  • Utilizing available methods, it is relatively simple to access particular parts of the XML tree.
  • The findvalue method is lacking error checking, i.e., we assume every value is present, combined in sprintf will return a correctly formatted ucid


### routine to parse pct publication information
sub _parse_pct_publishing_data {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  # if there is no pct publishing node, return undef
  my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0];
  return undef if !$pct_node;

  # return ucid
  return $pct_node->getAttribute('ucid');
}

Points to consider:

  • according to the DTD, there is only ever one related pct document, hence single-value return
  • the ucid attribute is available, differing from the above related-documents

The complete listing:

package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  my $flds = $self->{_fields} || return; # nothing to do

  my $reldocs = $self->_parse_related_documents( $self->{_source_root} );
  my $pctdoc  = $self->_parse_pct_publishing_data( $self->{_source_root} );
  if( scalar( @{$reldocs} ) ) {
    foreach my $r ( @{$reldocs} ) {
      push( @{ $flds->{rel_ucids} }, $r );
    }
  }
  if( $pctdoc ) { push( @{ $flds->{rel_ucids} }, $pctdoc );
}

### routine to parse related documents
sub _parse_related_documents {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my @a = (); # stores any related-publications

  # if there are no related documents, return empty array
  my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0];
  return \@a if !$reldoc_node;

  for each my $relation ( $reldoc_node->getElementsByTagName('relation') ) {
    if( $relation->getAttribute('type') eq 'related-publication' ) {
      push( @a,
        sprintf("%s-%s-%s",
                 $relation->findvalue('./document-id/country'),
                 $relation->findvalue('./document-id/doc-number'),
                 $relation->findvalue('./document-id/kind')
        )
      );
    }
  }
  return \@a;
}

### routine to parse pct publication information
sub _parse_pct_publishing_data {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my $ret; # only one value available (or none)

  # if there is no pct publishing node, return undef
  my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0];
  return undef if !$pct_node;

  # return ucid
  return $pct_node->getAttribute('ucid');
}

1;

Re-indexing Data from CLAIMS Direct Data Warehouse


Introduction

There are a number of reasons one would need to re-index data from the data warehouse. These range from simply repeating a load-id to a complete re-index of the entire contents of the data warehouse. In this blog, I'm going to go over the mechanisms that move data from the data warehouse to the index and ways in which these mechanisms can be used to trigger partial or full re-indexing of data.

Background

In all installations of CLAIMS Direct, data synchronization between the CLAIMS Direct primary data warehouse and on-site installations is driven by the tool apgup. In version 2.1, a new update daemon (apgupd) was introduced to facilitate background processing of new and updated data. The main roles this daemon plays include:

  1. Reconciling differentials between on-site data warehouses and CLAIMS Direct primary data warehouses
  2. Downloading the next available update package for processing within the on-site installation
  3. Processing the update package
    1. Loading data to the data warehouse
    2. Optionally en-queuing that load for indexing

For our topic, we will concentrate on the indexing queue as this drives the indexing process.

Index Queue

The indexing queue is a table inside the reporting schema: t_client_index_process. When apgupd completes the load into the PostgreSQL, a row is inserted into reporting.t_client_index_process with running_statuspending. The indexing daemon, aidxd, probes the table looking for the highest priority load_id to index. Upon finding an applicable load-idaidxd proceeds to extract all publications associated with that load_id and index the documents into SOLR.

The column priority influences which load_id is to be processed next. Higher priority load-ids are processed before lower priority ones. This can cause out-of-order processing as we will discuss below.

Re-indexing

Individual load_ids

At its simplest, re-indexing is just a matter of resetting a load_id in reporting.t_client_index_process

update reporting.t_client_index_process
  set running_status='pending',
      completed_stamp=null,
      completed_status=null,
      completed_comment=null
where load_id=123456;

The above SQL resets load_id 123456 so that the indexing daemon processes that load_id next. You can re-index any number of load_ids by resetting each of the rows.

Entire Data Warehouse

Unfortunately, resetting all the rows in reporting.t_client_index_process is not sufficient to re-index the entire contents of the data warehouse. This has to do with the fact that the table is empty upon initial load to the on-site instance as both SOLR and PostgreSQL back file deliveries are in sync. Each new update that is processed is queued. To that end, the most efficient way to queue the entire data warehouse for re-indexing is to select the entire xml.t_patent_document_values table and group by modified_load_id.

Before proceeding with the next set of SQL, please be sure both apgupd and aidxd are paused ( kill -s USR1 <pid> )

truncate table reporting.t_client_index_process;
 
insert into reporting.t_client_index_process (load_id, load_source, ndocs, priority)
  select modified_load_id, 'REINDEX', count(*), -1
  from xml.t_patent_document_values
  group by modified_load_id
  order by modified_load_id desc;

A few comments about the above statements:

  1. Truncating the table is required as the load_id column is unique. If you wish to save the contents of that table, simply use the pg_dump utility.
  2. The load_source column is chosen at random. You can use any identifier you want as long as it doesn't exceed 64 characters.
  3. The priority is set to -1 as in this example, re-index is less important than indexing new data.

Use Case

Changing Field Definitions

Modifying the CLAIMS Direct SOLR Schema

The CLAIMS Direct SOLR schema is a compromise between having every piece of the patent document searchable, having enough metadata stored to be retrievable in the search result set as well as indexing certain fields to be efficiently faceted. This compromise keeps the size of the index manageable while allowing efficient return of data from a search. Of course, the concept of manageable is subjective. There could very well be the need to have more fields returned during the search (stored=true) and other fields removed from storage but left searchable. In this use case, we will enable patent citations to be returned during the search. Regardless of CLAIMS Direct SOLR package, we'll start editing the schema under <package>/conf-2.1.2. We are making a very simple change but one that requires a complete re-index of all documents having citations.

Be sure the indexing daemon (aidxd ) is paused.

The current field definition

    <field name="pcit" type="alexandria_string" indexed="true" stored="false" multiValued="true" />

becomes

    <field name="pcit" type="alexandria_string" indexed="true" stored="true" multiValued="true" />

To make this change effective, SOLR needs to be restarted or the collection reloaded. For the standalone package, you would just need to restart SOLR:

<package>/solrctl stop ; <package>/solrctl start

For the distributed package, first deploy updated configuration to the Zookeeper nodes.

cd <package>
./bootstrap-zookeeper.sh

Then reload all nodes:

cd <package>
./reload-collections

Setting up the Index Queue

Although we can go about re-indexing the entire data warehouse as outlined above, we want to be a bit more efficient and choose only the applicable documents, i.e., those that actually have citation data. To this end, we will find only load_ids that contain documents with citations and modify existing reporting.t_client_index_process rows as needed.

-- DELETE load-ids already processed
delete from reporting.t_client_index_process
where load_id in ( select distinct(modified_load_id) from xml.t_citations );
 
-- INSERT load-ids
insert into reporting.t_client_index_process (load_id, load_source, ndocs, priority)
  select modified_load_id, 'CITATION-FIX', count(*), -1
  from xml.t_citations
  group by modified_load_id

Again, setting priority to -1 allows new documents to be indexed as usual, and documents needing to be updated will be indexed with a lower priority.

Checking Re-Index Status

You can monitor the progress of the re-index by querying the reporting.t_client_index_process table as follows:

select priority, load_source, count(*)
from reporting.t_client_index_process
where running_status = 'pending'
group by priority, load_source
order by priority, load_source;



Sorting Through Data Warehouse Updates

The CLAIMS Direct data warehouse is updated continuously. On average, 1-3% of the entire database is touched weekly. These 1-3 million updates fall into two categories: new documents (roughly 20%) and updated documents (roughly 80%).  New documents are, of course, new publications from issuing authorities, US, EP etc. Updates are generally, but not limited to, CPC changes, family changes (adding family ID), IFI integrated content (patent status, standard names and claims summaries), reassignments, legal status, et al. I suggest reading about content updates for a better understanding of load sources and issuing authorities. This blog will describe how to determine what has been updated and further, differentiating new content from updated content at the document level as well as container level.

Load Identifiers

As described in the content update documentation,

Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: created-load-id, deleted-load-id and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document and the deleted-load-id represents the load-id in which the document marked as deleted.

The table xml.t_patent_document_values houses the load-ids that affect the overall status of the entire document.

  • modified_load_id – the load-id that last modified the document, whether it be addition, correction or deletion of content
  • deleted_load_id – the load-id that the entire document was marked deleted
  • created_load_id – the load-id when the document was added to the data warehouse

 

Each satellite table also contains a modified_load_id column that is specific to the content in that table and will always be <= xml.t_patent_document_values(modified_load_id). The following table should clarify the sequence of update events and the effect on load-id values for three database tables.

dateload-sourceLoad-Idtablecreated_load_idmodified_load_iddeleted_load_id
20161004US10000xml.t_patent_document_values1000010000NULL
   xml.t_ifi_integrated_content
assuming ifi integrated content is not delivered on date of publication 
n/an/an/a
   xml.t_partiesn/a10000n/a
20161006IFI-CONT10001xml.t_patent_document_values1000010001NULL
   xml.t_ifi_integrated_contentn/a10001n/a
   xml.t_parties
parties are not updated with IFI content loads 
n/a10000n/a
20161011DOCDB10002xml.t_patent_document_values1000010002NULL
   xml.t_ifi_integrated_contentn/a10001n/a
   xml.t_partiesn/a10002n/a

This granularity offers the ability to trace modifications to individual containers (tables) as well as the overall modification status of the document as a whole.

Quick Test

How many US documents published on 20161004 have updated CPC classifications?

select count(*) as docs_with_modified_cpc
from xml.t_patent_document_values as a
  inner join xml.t_classifications_cpc as b
    on (a.publication_id=b.publication_id)
where a.country='US'
  and a.published='2016-10-04'
  and a.deleted_load_id is null -- disregard deleted documents
  and a.withdraw != true        -- disregard withdrawn documents
  /*
    If b.modified_load_id > a.created_load_id,
    the table has been touched and we can therefore
    assume CPC values have been modified
  */
  and b.modified_load_id > a.created_load_id;
 
 docs_with_modified_cpc 
------------------------
                   6145

Out of 6680 documents published on 20161004, 6145 had CPC modifications after publication.

 

Reporting

Knowing which tables and containers have changed is important, but the when and why are also critically important. The information gathered during data warehouse updates is stored in reporting.t_client_load_process. For a detailed description of the contents of this table, see CLAIMS Direct Design Documentation. Of the many questions that can be posed, for our purposes in this blog, we are only concerned with answering the following questions:

  • On any given date, how many documents were added to the data warehouse?
  • On any given date, how many documents were modified in the data warehouse?

The answers to these questions are important for a variety of reasons, most importantly for down-stream processing. In the case of CLAIMS Direct primary data warehouses,  we trigger indexing based on the answer to these questions. The first thing to look at is what happened on a certain date, for the sake of an example, we will use 20161004.

select load_source, count(*)
from reporting.t_client_load_process
where date_trunc('day', completed_stamp) = '2016-10-04'::date
group by load_source
order by load_source;


 load_source  | count 
--------------+-------
 DOCDB        |     7
 IFI-CLAIMINF |     1
 IFI-Cont     |    15
 IPCR         |    47
 US           |     1
 USPTO-OG     |     1
 USREAS       |     1

How many documents were added to the data warehouse?

select count(*) as new_docs
from xml.t_patent_document_values
where created_load_id in (
  select load_id
  from reporting.t_client_load_process
  where date_trunc('day', completed_stamp) = '2016-10-04'::date
);

 new_docs 
----------
     6680

... and which documents were updated?

select count(*) as updated_docs
from xml.t_patent_document_values
where created_load_id < modified_load_id
    and modified_load_id in (
      select load_id
      from reporting.t_client_load_process
      where date_trunc('day', completed_stamp) = '2016-10-04'::date
    );

 updated_docs 
----------
   362388

In part II of this series, I will detail methods which can be used to monitor this table and react to changes either at the load-id-level or over the course of an interval of time.

XML Functionality Inside CLAIMS Direct Data Warehouse

Overview

The CLAIMS Direct Web Services (CDWS) offer a variety of entry points into both the data warehouse and SOLR index. These are mid-to-high-level entry points and can satisfy most requirements pertaining to searching and extracting data for a typical search/view application. There are, however, corner cases which may require more intricate extraction of particular information. On the other hand, there may also be situations where massive amounts of data need to be extracted for further, down-stream processing. The following discussion will touch on solutions for both of these cases.

Basic Data Warehouse Table Structure

The basic structure of the PostgreSQL database housing the XML content is best described as a collection of tables containing the raw XML of every patent document available in CLAIMS Direct. Each table is a section of the XML document, referred to as a container, and each table is named according to the XML container data it contains, e.g.,

Table
Container
Description
xml.t_publication_referencepublication-referenceDocument publication information
xml.t_invention_titleinvention-titleDocument title(s)
xml.t_claimsclaimsDocument claims

The structure of each table is identical with the exception of xml.t_patent_document_values which functions as a meta table.

Column
Type
<container>_idserial primary key
publication_idinteger not null
modified_load_idinteger not null
statuscharacter(1) (v=valid XML, i=invalid XML)
contentXML


The XML is always in the content (PostgreSQL type: XML) column.

select content
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
--------
<publication-reference fvid="71329085" ucid="US-5551212-A">
   <document-id>
    <country>US</country>
    <doc-number>5551212</doc-number>
    <kind>A</kind>
    <date>19960903</date>
    <lang>EN</lang>
  </document-id>
</publication-reference>

The invaluable utility function xml.f_ucid2id takes a ucid and returns a publication_id.

Extracting Pieces of the XML Content

As mentioned above, all content columns are of type XML and therefore the internal PostgreSQL xpath functionality can be used. There are a variety of utility functions provided by CLAIMS Direct that mimic DOM functionality, e.g.

Function
Example

string

xml.f_textContent(xpath::text,xml)

Extract text of node:

select xml.f_textContent('./document-id/lang', content)
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
----
EN

string

xml_f_getAttribute(att::text,xml)

Extract single attribute value:

select xml.f_getAttribute('ucid', content)
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
----
US-5551212-A

boolean

xml.f_nodeExists(xpath::text,xml)

Test presence of a particular node:

select xml.f_nodeExists('//date', content)
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
----
t

XML

xml.f_findNode(xpath::text,xml)

Select a node (type XML) satisfying the given xpath expression:

select xml.f_findNode('./main-classification', content)
from xml.t_classification_national
where publication_id=xml.f_ucid2id('US-5551212-A');
----
 <main-classification mxw-id="PCL893513162" load-source="mcf">053397</main-classification>

XML[]

xml.f_findNodes(xpath::text,xml)

Select an array of nodes satisfying the given xpath expression:

select xml.f_findNodes('./classification-ipcr', content)
from xml.t_classifications_ipcr
where publication_id=xml.f_ucid2id('US-5551212-A');
----
{
  "<classification-ipcr ...",
  "<classification-ipcr ...",
  "<classification-ipcr ..."
}

XML[]

xml.f_getElementsByTagName(name::text,xml))

Select an array of nodes based on element name:

select xml.f_getElementsByTagName('classification-ipcr', content)
from xml.t_classifications_ipcr
where publication_id=xml.f_ucid2id('US-5551212-A');
----
{
  "<classification-ipcr ...",
  "<classification-ipcr ...",
  "<classification-ipcr ..."
}

string

xml.f_nodeName(xml)

Select the name of the content node:

select xml.f_nodeName( content )
from xml.t_invention_title
where publication_id=xml.f_ucid2id('US-5551212-A');
-----------------
 invention-title

XML[]

xml.f_childNodes(xml)

Select all child nodes:

select xml.f_childNodes( content )
from xml.t_application_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
-----------------------------------------------------------------------------------
 <document-id mxw-id="PAPP60364944" load-source="docdb" format="epo">
   <country>US</country>
   <doc-number>2582893</doc-number>
   <kind>A</kind>
   <date>19930303</date>
   <lang>EN</lang>
 </document-id>
 <document-id mxw-id="PAPP84571955" load-source="patent-office" format="original">
   <country>US</country>
   <doc-number>08025828</doc-number>
   <date>19930303</date>
   <lang>EN</lang>
 </document-id>

string

xml.f_node2string(xml,indent)

Concatenate all child text nodes:

select xml.f_node2string( content::text, 1 )
from xml.t_abstract
where publication_id=xml.f_ucid2id('US-5551212-A');
-----------------------------------------------------------------------------------
 A package is formed by holding a nonextensible strip taut and wrapping it around the upper and/or lower periphery of the bundle and joining the ends of the strip segments e.g. by adhesive bonding. A projecting margin of the strip is folded to overlie the bundle or underlie the bundle so that an L-configuration is imparted to the frame formed around the bundle after the strip is applied thereto.

 

Use Case: Creating relational view of XML content

Although the underlying architecture of the CLAIMS Direct PostgreSQL database is a data warehouse, creating normalized relational views into the XML content is easily achievable using the functions described above. Our example will be to create a relational view of all patent citations. This view will let us SELECT based on criteria as well as being able to GROUP BY and ORDER BY. The first step is to define all the properties of a citation:

  • ref-ucid (cited document)
  • ref-publication-id
  • publication-date
  • country
  • kind
  • load-source
  • format
  • source (examiner or applicant)

Now we can create a function that returns a TABLE of these properties. We'll use publication_id as input to the function and create it in a private schema (mySchema).

create or replace function mySchema.f_rdb_citations( integer )
  returns table (
    publication_id integer,
    ref_publication_id integer,
    ref_ucid varchar(64),
    published date,
    country varchar(2),
    kind varchar(2),
    load_source varchar(32),
    format varchar(32),
    source varchar(32)
  ) as
$BODY$
declare v_content xml;
        v_node xml;
        v_doc_node xml;
begin
 
-- example usage:
-- select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') );
  select x.publication_id, x.content into publication_id, v_content
    from xml.t_citations as x where x.publication_id=$1;
  if not found then
    return;
  end if;
  
  -- loop through all patent citations
  for each v_node in array
    ( select xml.f_findNodes('//patcit', v_content) ) loop
    select xml.f_getAttribute( 'ucid', v_node ) into ref_ucid;
 
    select x.publication_id into ref_publication_id
      from xml.t_patent_document_values as x
      where ucid=xml.f_getAttribute( 'ucid', v_node );
 
 
    select xml.f_getAttribute( 'load-source', v_node ) into load_source;
    select xml.f_getAttribute('name', xml.f_findNode('./sources/source', v_node ) ) into source;
 
    -- loop through all forms of the document-id(s)
    for each v_doc_node in array
      ( select xml.f_findNodes('./document-id', v_node) ) loop
      select xml.f_getAttribute( 'format', v_doc_node ) into format;
      select xml.f_textContent( './country', v_doc_node ) into country;
      select xml.f_textContent( './kind', v_doc_node ) into kind;
      select xml.f_textContent( './date', v_doc_node ) into published;
      return next;
    end loop;
  end loop;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100
  ROWS 1000;

Now, we test the results:

select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') ) limit 5;
 publication_id | ref_publication_id |     ref_ucid      | published  | country | kind | load_source | format | source
----------------+--------------------+-------------------+------------+---------+------+-------------+--------+--------
      160817649 |            5778519 | CA-2580978-A1     | 2006-03-30 | CA      | A1   | docdb       | epo    | APP
      160817649 |          127402642 | CN-201918032-U    | 2011-08-03 | CN      | U    | docdb       | epo    | APP
      160817649 |           75847915 | US-20020150866-A1 | 2002-10-17 | US      | A1   | docdb       | epo    | APP
      160817649 |           75947470 | US-20030060942-A1 | 2003-03-27 | US      | A1   | docdb       | epo    | APP
      160817649 |           76046244 | US-20030154010-A1 | 2003-08-14 | US      | A1   | docdb       | epo    | APP

Of course, GROUP BY and ORDER BY are available on the returned columns. In addition to SELECTing citations for a single document, we can also JOIN to table expressions, e.g., assume we are interested in the top 5 cited countries from US grant publication date 2016-05-03:

select count(*) as n, t2.country as ref_country
from xml.t_patent_document_values t1
  inner join mySchema.f_rdb_citations (t1.publication_id) as t2
    on (t1.publication_id=t2.publication_id)
where t1.country='US'
  and t1.published='2016-05-03'
group by t2.country
order by n desc
limit 5;
   n    | ref_country
--------+---------
 221800 | US
  14285 | WO
  13313 | JP
   7115 | EP
   2554 | CN
Understanding the SOLR Result Set - Sort Parameter

Paging results is cumbersome and inefficient. In this next segment I'd like to talk about simple and complex sorting. Sorting, used effectively with the rows parameter can push relevant documents into the first page of results. Generally, you can sort on any indexed field but you can also utilize query boosting and functions to influence sort order.

Missing Value Behavior

 CLAIMS Direct is configured to return empty fields at the top when asc is the direction and the bottom when desc is the direction.

Static Field Sorts

The most common field to sort on will undoubtedly be the date of publication, pd desc. This pushes the newly published documents to the top of the result set. CLAIMS Direct uses pd desc as default sort order instead of score. If two documents share the same publication date (or any sort field value), the tie is broken by the internal Lucene document IDs ascending.

/search/query?q=ab_en:sonar&rows=5&fl=ucid,[docid],pd
         "docs" : [
            {
               "[docid]" : 5536680,
               "pd" : 20160225,
               "ucid" : "US-20160054444-A1"
            },
            {
               "[docid]" : 1274637,
               "pd" : 20160224,
               "ucid" : "EP-2986998-A1"
            },
            {
               "[docid]" : 4335577,
               "pd" : 20160223,
               "ucid" : "US-9268020-B2"
            },
            {
               "[docid]" : 4986794,
               "pd" : 20160218,
               "ucid" : "US-20160049143-A1"
            },
            {
               "[docid]" : 5088703,
               "pd" : 20160218,
               "ucid" : "US-20160047906-A1"
            }
         ]

Multiple sort criteria are also possible and used mostly to stabilize the initial sort results. The syntax for the sort parameter is a comma-separated list of field/direction pairs, e.g., pd desc,ad asc. Field and direction are separated by a space.

/search/query?q=ab_en:sonar&rows=5&fl=ucid,pd&sort=pd desc,ucid desc

Dynamic Field Sort

As mentioned, CLAIMS Direct uses a default sort of pd desc. You are, of course, free to use the dynamic score field to mimic the default SOLR distribution behavior.

/search/query?q=ab_en:sonar&rows=5&fl=ucid,score,pd&sort=score desc
         "docs" : [
            {
               "pd" : 19960322,
               "ucid" : "FR-2724734-A1",
               "score" : 3.781385
            },
            {
               "pd" : 20120523,
               "ucid" : "EP-2454606-A1",
               "score" : 3.3196893
            },
            {
               "pd" : 20151229,
               "ucid" : "US-9223022-B2",
               "score" : 3.3196893
            },
            {
               "pd" : 20150910,
               "ucid" : "US-20150253425-A1",
               "score" : 3.3196893
            },
            {
               "pd" : 20150521,
               "ucid" : "AU-2010273841-B2",
               "score" : 3.2852356
            }
         ]

Random Sort (for Random Results)

CLAIMS Direct uses a utility field to allow random sorting. The field is input as rnd_n where n is any random integer, e.g., rnd_1234. Specifying the same value for n will yield the same sort results so the randomness is only as random as the input integer. Returning random results (sorting on rnd_n) is an efficient way to sample a wide variety of data.

/search/query?q=ab_en:sonar&rows=3&fl=ucid,score,pd&sort=rnd_1234 desc
/search/query?q=ab_en:sonar&rows=3&fl=ucid,score,pd&sort=rnd_12345 desc
         "docs" : [
            {
               "pd" : 20150902,
               "ucid" : "CN-204605667-U",
               "score" : 1.6596402
            },
            {
               "pd" : 20110224,
               "ucid" : "WO-2010125029-A4",
               "score" : 1.1842703
            },
            {
               "pd" : 19910719,
               "ucid" : "FR-2657063-A1",
               "score" : 1.760531
            }
         ]
//
         "docs" : [
            {
               "pd" : 19720119,
               "ucid" : "GB-1260387-A",
               "score" : 0.4986458
            },
            {
               "pd" : 20160114,
               "ucid" : "US-20160011310-A1",
               "score" : 2.3957791
            },
            {
               "pd" : 20130425,
               "ucid" : "WO-2013056893-A1",
               "score" : 0.9474162
            }
         ],

Sorting on Functions

A previous segment discussed using functions as pseudo fields to return in the result set. You can also use the results of a function as sort criteria. An interesting use-case for sorting on a calculated value is measuring the latency between application filing date and grant date. You can calculate this value by subtracting ad_d from pd_d (tdate field types of filing and publication dates). The following examples use field aliasing with the fl parameter as well as sorting by function to return documents with low latency between filing and grant (asc) and high latency (desc) displayed in days (86400000 milliseconds in a day).

/search/query?rows=5&q=ab_en:sonar +ifi_publication_type:G&fl=ucid,ad,pd,days_to_grant:div(sub(pd_d,ad_d),86400000)&sort=sub(pd_d,ad_d) asc
/search/query?rows=5&q=ab_en:sonar +ifi_publication_type:G&fl=ucid,ad,pd,days_to_grant:div(sub(pd_d,ad_d),86400000)&sort=sub(pd_d,ad_d) desc
         "docs" : [
            {
               "pd" : 20131205,
               "ad" : 20130118,
               "ucid" : "DE-102013000802-B3",
               "days_to_grant" : 321.0005
            },
            {
               "pd" : 20160126,
               "ad" : 20150304,
               "ucid" : "US-9244168-B2",
               "days_to_grant" : 328.0001
            },
            {
               "pd" : 20140904,
               "ad" : 20130904,
               "ucid" : "DE-102013014640-B3",
               "days_to_grant" : 365.00064
            },
            {
               "pd" : 20160216,
               "ad" : 20141208,
               "ucid" : "US-9261391-B2",
               "days_to_grant" : 434.99976
            },
            {
               "pd" : 19931007,
               "ad" : 19920421,
               "ucid" : "DE-4213121-C1",
               "days_to_grant" : 534.00006
            }
         ]
//
         "docs" : [
            {
               "pd" : 19890719,
               "ad" : 19730214,
               "ucid" : "GB-1605319-A",
               "days_to_grant" : 5999
            },
            {
               "pd" : 19781220,
               "ad" : 19640511,
               "ucid" : "GB-1536653-A",
               "days_to_grant" : 5336.0005
            },
            {
               "pd" : 20160209,
               "ad" : 20100426,
               "ucid" : "US-9255982-B2",
               "days_to_grant" : 2115
            },
            {
               "pd" : 20160223,
               "ad" : 20120210,
               "ucid" : "US-9268020-B2",
               "days_to_grant" : 1473.9987
            },
            {
               "pd" : 20150217,
               "ad" : 20110322,
               "ucid" : "US-RE45379-E1",
               "days_to_grant" : 1428.0006
            }
         ]
Understanding the SOLR Result Set - fl parameter

In this first of a series of blogs about SOLR result sets I'd like to talk about returned fields, both static and dynamic.

Stored Fields

Any field that is stored in SOLR can be returned in the result set. The following table shows all available stored fields:

namedesctypeindexed
timestamplast modification stamptdatetrue
ucidunique character identifierstringtrue
loadidmeta load identifiertinttrue
nclmsnumber of claimstinttrue
nindepclmsnumber of independent claimstinttrue
famfamily identifierstringtrue
pnpatent number (tokenized)alexandria_tokentrue
pdpublication datetinttrue
anapplication number (tokenized)alexandria_tokentrue
anseriesUS application series codealexandria_stringtrue
anorigoriginal, patent office format application numberalexandria_tokentrue
adfiling datetinttrue
pripriority filing number (tokenized)alexandria_stringtrue
priorigoriginal, patent office format priority numberalexandria_stringtrue
pridearliest priority filing datetinttrue
icl,cpcl,uscl,ficlclassifications suitable for displaystringfalse
invinventorsalexandria_tokentrue
paapplicants/assignees (all formats)alexandria_tokentrue
ifi_paIFI assigneesalexandria_tokentrue
ttl_*all titlesmixed: alexandria_text, alexandria_asian_texttrue
ab_*all abstractsmixed: alexandria_text, alexandria_asian_texttrue

The shorthand to return all static fields is the asterisk fl=*

/search/query?q=ucid:US-20160054444-A1&fl=*&rows=1

As with all examples, we discuss results in JSON format.

         "docs" : [
            {
               "timestamp" : "2016-02-27T08:29:31.936Z",
               "ucid" : "US-20160054444-A1",
               "loadid" : 229946,
               "nclms" : 19,
               "nindepclms" : 2,
               "fam" : "-1",
               "pn" : "US-20160054444-A1 US20160054444A1",
               "pd" : 20160225,
               "an" : "US-201514833357-A"
               "anseries" : "14",
               "anorig" : "US-14833357",
               "ad" : 20150824,
               "pri" : [
                  "FR-1457951"
               ],
               "prid" : 20140825,
               "priorig" : [
                  "FR-1457951"
               ],
               "cpcl" : [
                  "G01S  15/588       20130101 LI20160225BHUS        ",
                  "G01S  15/60        20130101 FI20160225BHUS        "
               ],
               "icl" : [
                  "G01S  15/58        20060101ALI20160225BHUS        ",
                  "G01S  15/60        20060101AFI20160225BHUS        "
               ],
               "ifi_pa" : [
                  "ECA ROBOTICS",
                  "ECA ROBOTICS"
               ],
               "inv" : [
                  "Pinto, Marc"
               ],
               "pa" : [
                  "ECA ROBOTICS",
                  "ECA ROBOTICS"
               ],
               "ttl_en" : [
                  "METHOD AND SONAR DEVICE FOR DETERMINING THE SPEED OF MOVEMENT OF A NAVAL VEHICLE IN RELATION TO THE SEA BED"
               ],
               "ab_en" : [
                  "<abstract mxw-id=\"PA168904151\" lang=\"EN\" load-source=\"patent-office\">\n    <p id=\"p-0001\" num=\"0000\">Sonar intended to be carried by a naval vehicle including at least one element for transmitting an acoustic signal, at least one element for receiving the acoustic signal transmitted and reflected on the sea bed and at least two phase centres (PC<sub>1</sub>, PC<sub>2</sub>) that are disposed along a first and a second axis (v<sub>1</sub>, v<sub>2</sub>), respectively, forming an interferometric antenna. The sonar includes elements for determining the speed of movement of the vehicle as a function of the computed value of the relative trim angle (β) formed between a straight line (d<sub>1</sub>) that is perpendicular to the axes (v<sub>1</sub>, v<sub>2</sub>) of the phase centres and a straight line (d<sub>2</sub>) that is perpendicular to the sea bed (F) and of the value determined for the angle of sight.</p>\n  </abstract>"
               ],
            }
         ],

Some things to notice: Multi-valued fields are returned as JSON arrays. Also, as invpa and ifi_pa are copyFields containing multiple values, duplicate entries may appear.

Dynamic Fields

Dynamic fields are fields that are generated either internally, like the relevance score or fields that can be generated by computing values using function queries. Dynamic fields must be listed explicitly in the fl parameter, e.g., fl=score. Below, we request the static ucid field together with the relevance score for 5 results.

/search/query?q=ab_en:sonar&fl=ucid,score&rows=5
         "docs" : [
            {
               "ucid" : "US-20160054444-A1",
               "score" : 1.1746656
            },
            {
               "ucid" : "US-9268020-B2",
               "score" : 2.8773313
            },
            {
               "ucid" : "US-20160047906-A1",
               "score" : 1.1735018
            },
            {
               "ucid" : "US-20160049143-A1",
               "score" : 0.99574935
            },
            {
               "ucid" : "US-20160047891-A1",
               "score" : 1.1675186
            }
         ],

Another type of dynamic field makes use of function queries. A simple, yet contrived, example would be to return the number of dependent claims. CLAIMS Direct stores the total number of claims in nclms as well as the number of independent claims in nindepclms, both as tint. Simple subtraction yields the desired result.

/search/query?q=ab_en:sonar&fl=ucid,nclms,nindepclms,sub(nclms,nindepclms)&rows=2
         "docs" : [
            {
               "ucid" : "US-20160054444-A1",
               "nclms" : 19,
               "nindepclms" : 2,
               "sub(nclms,nindepclms)" : 17
            },
            {
               "ucid" : "US-9268020-B2",
               "nclms" : 74,
               "nindepclms" : 3,
               "sub(nclms,nindepclms)" : 71
            }
         ],

Field Aliases

Field aliasing allows individual fields in the result set to be renamed. This functionality is particularly useful to give meaningful names to pseudo fields created by function queries as in the example above. Any field can be renamed using aliases. The syntax is: alias-field-name:original-field-name

Note

ucid is the only exception. When trying to alias the ucid, an HTTP 500 will be returned.

/search/query?q=ab_en:sonar&fl=ucid,total:nclms,independent:nindepclms,dependent:sub(nclms,nindepclms)&rows=1
         "docs" : [
            {
               "ucid" : "US-20160054444-A1",
               "total" : 19,
               "independent" : 2,
               "dependent" : 17
            }
         ],