Page History

Table of Contents

Processes

The main executable script used for indexing is aidx delivered as part of Alexandria::Library. This script is responsible for pulling source data, converting it into SOLR Solr XML and submitting via HTTP POST to SOLR Solr for indexing. The conversion process from CLAIMS Direct XML to SOLR Solr XML is handled by the indexer class (default is Alexandria::DWH::Index::Document). Alexandria::Client::Tools also provides an indexing daemon, aidxd which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process, is handled by apgup.

Process	Package
aidx	Alexandria::Library
aidxd	Alexandria::Client::Tools
apgup	Alexandria::Client::Tools

Source Data

Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library module Alexandria::DWH::Extract. The Extract module can pull data based on a number of criteria, the most common of which are:

...

Regardless of extraction criteria, Alexandria::DWH::Extract utilizes an UNLOGGED temporary table to accumulate desired publication_id(s). Extraction proper is done from this accumulation table in parallel select batches. The amount of parallelization as well as the amount of documents per select are controlled by the parameters batchsize and nthreads. aidx also accepts a dbfunc parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones modeled on xml.f_patent_document_s (i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).

Command	Accumulation SQL	Extract SQL
aidx --table=x	select publication_id from x into t1	select xml.f_patent_document_values_s(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id)
aidx --loadid=y	select publication_id from xml.t_patent_document_values where modified_load_id=y into t1	select xml.f_patent_document_values_s(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id)
aidx --sqlq=USER_SQL	execute SQL into t1	select xml.f_patent_document_values_s(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id)

Command	Accumulation SQL	Extract SQL
aidx --table=x --dbfunc=f_my_function	select publication_id from x into t1	select f_my_function(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id)

Indexer Class

Using the callback mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into SOLRSolr. The following abbreviated example from aidx serves to illustrate the process.

Code Block

language	text

#! /usr/bin/perl
use Alexandria::DWH::Extract;
use Alexandria::DWH::Index;
use Alexandria::DWH::Index::Document;

my $idxcls = shift(@ARGV); # from command line

sub _create_solr_document {
  my ( $batch, $xml ) = @_;
  eval 'require $idxcls';
  return $idxcls->new( document => $xml )->toNode()->toString(1);
}

my $ex = Alexandria::DWH::Extract->new(
  ...
  callbacks => { on_document_processed => \&_create_solr_document }
);
$ex->prepare();
$ex->run();  # every document extracted is sent through _create_solr_document()
$ex->finalize();

Creating a Custom Indexing Class

Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document and manipulating the SOLR Solr document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source. The following shell-module will serve as a basis for the use cases detailed below.

...

Code Block

language	text

aidx --idxcls=MyCustomIndexingClass [ many other arguments ]

Use Cases

Info

title	Assumptions

The following use cases assume:

a valid index entry in /etc/alexandria.xml – this will be different than the default if you have a custom SOLR Solr installation
custom indexing class modules are either in the directory you run aidx or in your PERL5LIB path

...

(1) Adding (Injecting), Modifying, and Deleting Fields

For this use case, you will need to modify your SOLR Solr schema for the installation associated with the appropriate configuration index entry. Add the following field definition:

...

Below is example code to inject customInteger into the SOLR Solr document. Additionally, it will show how to modify the contents of anseries and delete anseries if the publication country is US and publication date is later than 2015.

Code Block

language	text

package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

 
# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  # the _fields member of $self contains all the
  # SOLRSolr content as a hash reference of array references
  # e.g.
  # _fields =>
  #      NOTE: multiValued=false fields are still represented as an array
  #            but only have one member
  #    pn => [ 'US-5551212-A' ],
  #    anseries => [ '07' ],
  #    icl1 => [ 'A', 'F', 'H' ]

  my $flds = $self->{_fields} || return; # nothing to do

  # inject a new field
  push( @{ $flds->{customInteger} }, 1 ) ;

  # we want to make certain that anseries is not padded, i.e.,
  # we need to be sure it is an integer
  if( scalar( $flds->{anseries} ) ) {
    $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] );

    # lastly, we don't want to index anseries for US documents published
    # after 20150101
    my $ctry = $flds->{pnctry}->[0];
    my $date = $flds->{pd}->[0];
    if( $ctry eq 'US' && $date > 20141231 ) {
      delete $flds->{anseries};
    }
  }
}
1;

(2) Accessing the CLAIMS Direct Source XML Document

This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the SOLR Solr schema:

Code Block

	xml
	xml

<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" />

...

any related documents which have a @relation=related-publication
any pct-or-regional-publishing-data

...

The parts of the XML document that are of interest:

...

Blog

Versions Compared

Old Version 1

New Version Current

Key

Table of Contents

Processes

Source Data

Indexer Class

Creating a Custom Indexing Class

Use Cases

(1) Adding (Injecting), Modifying, and Deleting Fields

(2) Accessing the CLAIMS Direct Source XML Document

Blog

Page History

Versions Compared

Old Version 1

New Version Current

Key

Table of ContentsProcesses

Source Data

Indexer Class

Creating a Custom Indexing Class

Use Cases

(1) Adding (Injecting), Modifying, and Deleting Fields

(2) Accessing the CLAIMS Direct Source XML Document

Table of Contents

Processes