Blog from October, 2018

Solr Indexing Process Explained


Processes

The main executable script used for indexing is aidx delivered as part of Alexandria::Library. This script is responsible for pulling source data, converting it into Solr XML and submitting via HTTP POST to Solr for indexing. The conversion process from CLAIMS Direct XML to Solr XML is handled by the indexer class (default is Alexandria::DWH::Index::Document). Alexandria::Client::Tools also provides an indexing daemon, aidxd which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process, is handled by apgup.

ProcessPackage
aidxAlexandria::Library
aidxdAlexandria::Client::Tools
apgupAlexandria::Client::Tools

Source Data

Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library module Alexandria::DWH::Extract. The Extract module can pull data based on a number of criteria, the most common of which are:

  • load-id: modified-load-id of xml.t_patent_document_values
  • table: any table name that has publication_id(int) column
  • SQL: raw SQL selecting desired documents by publication_id

Regardless of extraction criteria, Alexandria::DWH::Extract utilizes an UNLOGGED temporary table to accumulate desired publication_id(s). Extraction proper is done from this accumulation table in parallel select batches. The amount of parallelization as well as the amount of documents per select are controlled by the parameters batchsize and nthreadsaidx  also accepts a dbfunc parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones modeled on xml.f_patent_document_s (i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).


CommandAccumulation SQLExtract SQL
aidx --table=x

select publication_id
from x into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)
aidx --loadid=y

select publication_id
from xml.t_patent_document_values
where modified_load_id=y into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

aidx --sqlq=USER_SQLexecute SQL into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

CommandAccumulation SQLExtract SQL
aidx --table=x --dbfunc=f_my_function

select publication_id
from x into t1

select f_my_function(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

Indexer Class

Using the callback mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into Solr. The following abbreviated example from aidx serves to illustrate the process.

#! /usr/bin/perl
use Alexandria::DWH::Extract;
use Alexandria::DWH::Index;
use Alexandria::DWH::Index::Document;

my $idxcls = shift(@ARGV); # from command line

sub _create_solr_document {
  my ( $batch, $xml ) = @_;
  eval 'require $idxcls';
  return $idxcls->new( document => $xml )->toNode()->toString(1);
}

my $ex = Alexandria::DWH::Extract->new(
  ...
  callbacks => { on_document_processed => \&_create_solr_document }
);
$ex->prepare();
$ex->run();  # every document extracted is sent through _create_solr_document()
$ex->finalize();

Creating a Custom Indexing Class

Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document and manipulating the Solr document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source. The following shell-module will serve as a basis for the use cases detailed below.

package MyCustomIndexingClass;

use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';
# override _process_source
sub _process_source {
  my $self = shift;

  # we want to process the standard way ...
  $self->SUPER::_process_source();

  # do nothing else
}
1;

You can now specify MyCustomIndexingClass as the command line argument --idxcls to the indexing utility aidx.

aidx --idxcls=MyCustomIndexingClass [ many other arguments ]

 Use Cases

Assumptions

The following use cases assume:

  • a valid index entry in /etc/alexandria.xml – this will be different than the default if you have a custom Solr installation
  • custom indexing class modules are either in the directory you run aidx or in your PERL5LIB path


(1) Adding (Injecting), Modifying, and Deleting Fields

For this use case, you will need to modify your Solr schema for the installation associated with the appropriate configuration index entry. Add the following field definition:

<field name="customInteger" type="tint" indexed="true" stored="true" />

Below is example code to inject customInteger into the Solr document. Additionally, it will show how to modify the contents of anseries and delete anseries if the publication country is US and publication date is later than 2015.

package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

 
# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  # the _fields member of $self contains all the
  # Solr content as a hash reference of array references
  # e.g.
  # _fields =>
  #      NOTE: multiValued=false fields are still represented as an array
  #            but only have one member
  #    pn => [ 'US-5551212-A' ],
  #    anseries => [ '07' ],
  #    icl1 => [ 'A', 'F', 'H' ]

  my $flds = $self->{_fields} || return; # nothing to do

  # inject a new field
  push( @{ $flds->{customInteger} }, 1 ) ;

  # we want to make certain that anseries is not padded, i.e.,
  # we need to be sure it is an integer
  if( scalar( $flds->{anseries} ) ) {
    $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] );

    # lastly, we don't want to index anseries for US documents published
    # after 20150101
    my $ctry = $flds->{pnctry}->[0];
    my $date = $flds->{pd}->[0];
    if( $ctry eq 'US' && $date > 20141231 ) {
      delete $flds->{anseries};
    }
  }
}
1;

(2) Accessing the CLAIMS Direct Source XML Document

This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the Solr schema:

<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" />

We first need to define related ucid rel_ucid. For this example, it will be defined as:

  • any related documents which have a @relation=related-publication
  • any pct-or-regional-publishing-data


The parts of the XML document that are of interest:

<related-documents>
  <relation type="related-publication">
    <document-id>
      <country>US</country>
      <doc-number>20150126456</doc-number>
      <kind>A1</kind>
      <date>20150507</date>
    </document-id>
  </relation>
</related-documents>
<!-- ... -->
<pct-or-regional-publishing-data ucid="WO-2013182650-A1">
  <document-id>
    <country>WO</country>
    <doc-number>2013182650</doc-number>
    <kind>A1</kind>
    <date>20131212</date>
  </document-id>
</pct-or-regional-publishing-data>

As this example is more involved, the following code is broken down by function. A complete listing of code will be provided below.

### routine to parse related documents
sub _parse_related_documents {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my @a = (); # stores any related-publications

  # if there are no related documents, return empty array
  my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0];
  return \@a if !$reldoc_node;

  foreach my $relation ( $reldoc_node->getElementsByTagName('relation') ) {
    if( $relation->getAttribute('type') eq 'related-publication' ) {
      push( @a,
        sprintf("%s-%s-%s",
                 $relation->findvalue('./document-id/country'),
                 $relation->findvalue('./document-id/doc-number'),
                 $relation->findvalue('./document-id/kind')
        )
      );
    }
  }
  return \@a;
}

Points to consider with _parse_related_documents:

  • The source document (XML) representation is an XML::LibXML::Node, named above as patdoc.
  • Utilizing available methods, it is relatively simple to access particular parts of the XML tree.
  • The findvalue method is lacking error checking, i.e., we assume every value is present, combined in sprintf will return a correctly formatted ucid


### routine to parse pct publication information
sub _parse_pct_publishing_data {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  # if there is no pct publishing node, return undef
  my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0];
  return undef if !$pct_node;

  # return ucid
  return $pct_node->getAttribute('ucid');
}

Points to consider:

  • according to the DTD, there is only ever one related pct document, hence single-value return
  • the ucid attribute is available, differing from the above related-documents

The complete listing:

package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  my $flds = $self->{_fields} || return; # nothing to do

  my $reldocs = $self->_parse_related_documents( $self->{_source_root} );
  my $pctdoc  = $self->_parse_pct_publishing_data( $self->{_source_root} );
  if( scalar( @{$reldocs} ) ) {
    foreach my $r ( @{$reldocs} ) {
      push( @{ $flds->{rel_ucids} }, $r );
    }
  }
  if( $pctdoc ) { push( @{ $flds->{rel_ucids} }, $pctdoc );
}

### routine to parse related documents
sub _parse_related_documents {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my @a = (); # stores any related-publications

  # if there are no related documents, return empty array
  my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0];
  return \@a if !$reldoc_node;

  for each my $relation ( $reldoc_node->getElementsByTagName('relation') ) {
    if( $relation->getAttribute('type') eq 'related-publication' ) {
      push( @a,
        sprintf("%s-%s-%s",
                 $relation->findvalue('./document-id/country'),
                 $relation->findvalue('./document-id/doc-number'),
                 $relation->findvalue('./document-id/kind')
        )
      );
    }
  }
  return \@a;
}

### routine to parse pct publication information
sub _parse_pct_publishing_data {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my $ret; # only one value available (or none)

  # if there is no pct publishing node, return undef
  my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0];
  return undef if !$pct_node;

  # return ucid
  return $pct_node->getAttribute('ucid');
}

1;