Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Processes

The main executable script used for indexing is aidx delivered as part of Alexandria::Library. This script is responsible for pulling source data, converting it into SOLR Solr XML and submitting via HTTP POST to SOLR Solr for indexing. The conversion process from CLAIMS Direct XML to SOLR Solr XML is handled by the indexer class (default is Alexandria::DWH::Index::Document). Alexandria::Client::Tools also provides an indexing daemon, aidxd which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process, is handled by apgup.

ProcessPackage
aidxAlexandria::Library
aidxdAlexandria::Client::Tools
apgupAlexandria::Client::Tools

Source Data

Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library module Alexandria::DWH::Extract. The Extract module can pull data based on a number of criteria, the most common of which are:

...

Regardless of extraction criteria, Alexandria::DWH::Extract utilizes an UNLOGGED temporary table to accumulate desired publication_id(s). Extraction proper is done from this accumulation table in parallel select batches. The amount of parallelization as well as the amount of documents per select are controlled by the parameters batchsize and nthreadsaidx  also accepts a dbfunc parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones modeled on xml.f_patent_document_s (i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).

 


CommandAccumulation SQLExtract SQL
aidx --table=x

select publication_id
from x into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)
aidx --loadid=y

select publication_id
from xml.t_patent_document_values
where modified_load_id=y into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

aidx --sqlq=USER_SQLexecute SQL into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

CommandAccumulation SQLExtract SQL
aidx --table=x --dbfunc=f_my_function

select publication_id
from x into t1

select f_my_function(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

Indexer Class

Using the callback mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into SOLRSolr. The following abbreviated example from aidx serves to illustrate the process.

Code Block
languagetext
#! /usr/bin/perl
use Alexandria::DWH::Extract;
use Alexandria::DWH::Index;
use Alexandria::DWH::Index::Document;

my $idxcls = shift(@ARGV); # from command line

sub _create_solr_document {
  my ( $batch, $xml ) = @_;
  eval 'require $idxcls';
  return $idxcls->new( document => $xml )->toNode()->toString(1);
}

my $ex = Alexandria::DWH::Extract->new(
  ...
  callbacks => { on_document_processed => \&_create_solr_document }
);
$ex->prepare();
$ex->run();  # every document extracted is sent through _create_solr_document()
$ex->finalize();

Creating a Custom Indexing Class

Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document and manipulating the SOLR Solr document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source. The following shell-module will serve as a basis for the use cases detailed below.

...

Code Block
languagetext
aidx --idxcls=MyCustomIndexingClass [ many other arguments ]

 Use Cases

Info
titleAssumptions

The following use cases assume:

  • a valid index entry in /etc/alexandria.xml – this will be different than the default if you have a custom SOLR Solr installation
  • custom indexing class modules are either in the directory you run aidx or in your PERL5LIB path

...


(1) Adding (Injecting), Modifying, and Deleting Fields

For this use case, you will need to modify your SOLR Solr schema for the installation associated with the appropriate configuration index entry. Add the following field definition:

...

Below is example code to inject customInteger into the SOLR Solr document. Additionally, it will show how to modify the contents of anseries and delete anseries if the publication country is US and publication date is later than 2015.

Code Block
languagetext
package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

 
# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  # the _fields member of $self contains all the
  # SOLRSolr content as a hash reference of array references
  # e.g.
  # _fields =>
  #      NOTE: multiValued=false fields are still represented as an array
  #            but only have one member
  #    pn => [ 'US-5551212-A' ],
  #    anseries => [ '07' ],
  #    icl1 => [ 'A', 'F', 'H' ]

  my $flds = $self->{_fields} || return; # nothing to do

  # inject a new field
  push( @{ $flds->{customInteger} }, 1 ) ;

  # we want to make certain that anseries is not padded, i.e.,
  # we need to be sure it is an integer
  if( scalar( $flds->{anseries} ) ) {
    $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] );

    # lastly, we don't want to index anseries for US documents published
    # after 20150101
    my $ctry = $flds->{pnctry}->[0];
    my $date = $flds->{pd}->[0];
    if( $ctry eq 'US' && $date > 20141231 ) {
      delete $flds->{anseries};
    }
  }
}
1;

(2) Accessing the CLAIMS Direct Source XML Document

This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the SOLR Solr schema:

Code Block
xml
xml
<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" />

...

  • any related documents which have a @relation=related-publication
  • any pct-or-regional-publishing-data

...


The parts of the XML document that are of interest:

...