Table of Contents |
---|
Processes
The main executable script used for indexing is aidx
delivered as part of Alexandria::Library
. This script is responsible for pulling source data, converting it into SOLR Solr XML and submitting via HTTP POST to SOLR Solr for indexing. The conversion process from CLAIMS Direct XML to SOLR Solr XML is handled by the indexer class (default is Alexandria::DWH::Index::Document
). Alexandria::Client::Tools
also provides an indexing daemon, aidxd
which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process
, is handled by apgup
.
Process | Package |
---|---|
aidx | Alexandria::Library |
aidxd | Alexandria::Client::Tools |
apgup | Alexandria::Client::Tools |
Source Data
Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library
module Alexandria::DWH::Extract
. The Extract module can pull data based on a number of criteria, the most common of which are:
...
Regardless of extraction criteria, Alexandria::DWH::Extract
utilizes an UNLOGGED
temporary table to accumulate desired publication_id(s)
. Extraction proper is done from this accumulation table in parallel select
batches. The amount of parallelization as well as the amount of documents per select
are controlled by the parameters batchsize
and nthreads
. aidx
also accepts a dbfunc
parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s
which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones
modeled on xml.f_patent_document_s
(i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).
Command | Accumulation SQL | Extract SQL |
---|---|---|
aidx --table=x | select publication_id | select xml.f_patent_document_values_s(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id) |
aidx --loadid=y | select publication_id | select xml.f_patent_document_values_s(t2.publication_id) |
aidx --sqlq=USER_SQL | execute SQL into t1 | select xml.f_patent_document_values_s(t2.publication_id) |
Command | Accumulation SQL | Extract SQL |
---|---|---|
aidx --table=x --dbfunc=f_my_function | select publication_id | select f_my_function(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id) |
Indexer Class
Using the callback
mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into SOLRSolr. The following abbreviated example from aidx
serves to illustrate the process.
Code Block | ||
---|---|---|
| ||
#! /usr/bin/perl use Alexandria::DWH::Extract; use Alexandria::DWH::Index; use Alexandria::DWH::Index::Document; my $idxcls = shift(@ARGV); # from command line sub _create_solr_document { my ( $batch, $xml ) = @_; eval 'require $idxcls'; return $idxcls->new( document => $xml )->toNode()->toString(1); } my $ex = Alexandria::DWH::Extract->new( ... callbacks => { on_document_processed => \&_create_solr_document } ); $ex->prepare(); $ex->run(); # every document extracted is sent through _create_solr_document() $ex->finalize(); |
Creating a Custom Indexing Class
Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document
and manipulating the SOLR Solr document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source
. The following shell-module will serve as a basis for the use cases detailed below.
...
Code Block | ||
---|---|---|
| ||
aidx --idxcls=MyCustomIndexingClass [ many other arguments ] |
Use Cases
Info | ||
---|---|---|
| ||
The following use cases assume:
|
...
(1) Adding (Injecting), Modifying, and Deleting Fields
For this use case, you will need to modify your SOLR Solr schema for the installation associated with the appropriate configuration index entry. Add the following field definition:
...
Below is example code to inject customInteger
into the SOLR Solr document. Additionally, it will show how to modify the contents of anseries
and delete anseries
if the publication country is US and publication date is later than 2015.
Code Block | ||
---|---|---|
| ||
package MyCustomIndexingClass; # subclass of Alexandria::DWH::Index::Document use Moose; ### note: if using v2.0, you would extend Alexandria::DWH::Index::Document extends 'Alexandria::DWH::Index::DocumentEx'; # override _process source sub _process_source { my $self = shift; # even though we are overriding _process_source(), we still # want the parent class to do all the work for us # by calling the parent method (SUPER) ... $self->SUPER::_process_source(); # the _fields member of $self contains all the # SOLRSolr content as a hash reference of array references # e.g. # _fields => # NOTE: multiValued=false fields are still represented as an array # but only have one member # pn => [ 'US-5551212-A' ], # anseries => [ '07' ], # icl1 => [ 'A', 'F', 'H' ] my $flds = $self->{_fields} || return; # nothing to do # inject a new field push( @{ $flds->{customInteger} }, 1 ) ; # we want to make certain that anseries is not padded, i.e., # we need to be sure it is an integer if( scalar( $flds->{anseries} ) ) { $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] ); # lastly, we don't want to index anseries for US documents published # after 20150101 my $ctry = $flds->{pnctry}->[0]; my $date = $flds->{pd}->[0]; if( $ctry eq 'US' && $date > 20141231 ) { delete $flds->{anseries}; } } } 1; |
(2) Accessing the CLAIMS Direct Source XML Document
This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the SOLR Solr schema:
Code Block | ||||
---|---|---|---|---|
| ||||
<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" /> |
...
- any related documents which have a
@relation=related-publication
- any pct-or-regional-publishing-data
...
The parts of the XML document that are of interest:
...