IFI CLAIMS has been managing standardized company name variations for decades, allowing you to map patent publications to the correct original and current assignees. Our proprietary name management process cleans various types of entity names to address issues of misspelling, name changes, and other merger and acquisition activity. The result is the most reliable company name mapping available for IP analytics. Our Names Service takes this process even further, providing the Ultimate Owner for any IFI Name and allowing you to easily capture the full portfolio of patent assets owned by a company and its subsidiaries.
To make the best use of our IFI Names and Ultimate Owners, it is important to understand how we manage the data. This blog explains some of the rules we use to standardize names and assign Ultimate Owners.
What We Standardize
- applicant names
- assignee names in the parties container and assignment history
- names provided in legal events that are related to changes in ownership or name changes
We do not standardize:
We follow three main steps in our cleaning process: normalization, standardization, and harmonization.
The first step is to normalize the data or put it into a homogenized format. This includes:
- removing punctuation, parentheses, hyphens, and similar characters
- dealing with spelling variations (and/&, ä/ae) and misspellings
- putting legal forms into a consistent format such as Corp, SL, Ltd, PLC, AG etc.
Next, different name variations are grouped under one official IFI Name. This is an extremely important step in the process as some of the variations we receive would make the record very difficult to find otherwise. For example, as of January 2022, we have 2,260 name variations for BASF SE, including:
- GASF AKTIENGESELLSCHAFT
- BEE AA ESU EFU AG
- BADIS CHE ANVIN & SODA FABRIK AKTIENGESELLSCHAFT
- ?SF A
Finally, harmonization involves matching the standardized IFI Name to the most current entity name, accounting for name changes. For example, Orig3n Inc changed its name to Seaport Diagnostics Inc in April 2021. As of December 2021, there were no published reassignments of any of their patents. However, the new name is available as the
ifi-standardized-name-current in CLAIMS Direct:
To deal with names in different languages, we follow a common practice used by many patent offices and information providers.
For names in languages using Latin characters, we do not translate the names to English. Rather, we keep the original company names in their original language. For example:
- Bayerische Motoren Werke AG (German)
- Telmat Industrie SA (French)
- Universidad Politecnica de Madrid (Spanish)
For names in languages using non-Latin characters such as Russian and Chinese, we transliterate the names to Latin characters and use the name provided by the applicant if the company has a known English name. For example:
Name in Chinese: 清锋（北京）科技有限公司
Transliteration: Qingfeng Beijing Technology Co Ltd
IFI Name: Luxcreo Beijing Inc
For Japanese names, we try to find the official English name, which is not always a translation of the Japanese name. For example, according to the company website of Daido Tokushuko Kabushiki Kaisha, the company "name shall be expressed in English as Daido Steel Co., Ltd."
In a transliterated Japanese name, Kabushiki Kaisha is abbreviated as KK. If the name is translated, then the legal form is represented as “Co Ltd.”
Romanized names are added as variations of the official English name.
State and Country Disambiguation
For purposes of disambiguation, we add the country to the end of the name if the same entity name exists in different countries:
- Siemens Industry Software GmbH Austria
- Siemens Industry Software GmbH Germany
If multiple companies with the same entity name exist in different states of the same country and the entities are owned by different parent companies, we add the state to the end of the name:
- Blackbird Technologies Inc California
- Blackbird Technologies Inc Virginia
Rules for Different Entity Types
IFI's Ultimate Owners provide comprehensive coverage of the patent holdings of major public companies such as those listed on the following:
- FTSE 100
- S&P 500
- Russell 1000
- Forbes Global 2000
- Fortune 500
In general, our rules for determining Ultimate Owners are as follows:
- The IFI Ultimate Owner is the highest-level legal entity with an ownership interest of greater than 50% in the subsidiary. We define the Ultimate parent as the highest-level entity that prepares consolidated financial statements, understanding that “financial statements of a group in which the assets, liabilities, equity, income, expenses, and cash flows of the parent and its subsidiaries are presented as those of a single economic entity” as defined by the International Financial Reporting Standard organization (IFRS).
- We do not consider groups or conglomerates that do not exist as legal entities.
- IFI Ultimate Owners are operational entities, not investment companies. This means that we do not consider an investment company to be the ultimate parent of a private entity even if it owns more than 50% of the company shares.
As an exception, we do create Ultimate Owners that are investment companies if they own patents or if they are in the S&P lists. For example, Huntington Bancshares Inc. is considered an Ultimate Owner because many of its child companies own patents, including:
- Grayling State Bank
- Midwest Bank and Trust Co
- Security First Corp
- LeaseNet Group Inc
- Huntington National Bank
- NBC Corp
- Royal American Bank
- Sky Bank
- Citizens State Bank
- Empire National Bank of Traverse City
- Commerce Exchange Bank
Joint ventures join two or more different entities into a new entity for a specific objective in a specific country. Joint ventures are considered to be Ultimate Owners because none of the participants own more than 50% of the venture.
A holding company is an umbrella company that owns one or several other companies. They are often set up to distribute liabilities amongst different companies. A holding company may be listed as an Ultimate Owner if it owns enough stock to have a controlling interest and if it produces a consolidated financial statement. For example, Berkshire Hathaway is considered an Ultimate Owner because it produces a consolidated financial statement.
Dual-Listed Companies (DLC) form a corporate structure in which two legal entities function as a single operating business but retain separate stock exchange listings, usually in different countries. In CLAIMS Direct, the DLC is the Ultimate Owner.
For example, Rio Tinto is a leading international mining group headquartered in the UK, combining Rio Tinto PLC, a London and New York Stock Exchange listed company, and Rio Tinto Limited, which is listed on the Australian Securities Exchange. The two companies are joined in a DLC structure as a single economic entity called the Rio Tinto Group.
Rio Tinto PLC, GB - LSE
Rio Tinto Ltd, AU - ASX
Ultimate Owner: Rio Tinto Group, GB
Academic Institutions and University Systems
IFI uses this category for educational institutions, no matter the funding type. This includes schools, colleges, and universities, all of which may be Ultimate Owners.
The following types of academic institutions are added as child names if they are controlled directly by the parent institution:
- colleges of liberal arts
- professional schools
- technology transfer entities
- research centers (located on or off the main campus)
- branch campuses
For example, for the Ultimate Owner, Technion Israel Institute of Technology, child names include:
- Dimotech Ltd
- Technion Research and Development Foundation Ltd
A university system is a group of institutions having a collective legal status and generally recognized by a state government, a board of education, or other relevant organization as a collective organization. IFI creates an Ultimate Owner for university systems.
For example, for the Ultimate Owner, State University System of Florida, child names include, among others:
- Florida Agricultural and Mechanical University FAMU
- Florida Atlantic University
- Florida Atlantic University Research Corp
- University of Florida Research Foundation Inc
This category includes state-owned companies as well as government departments and agencies. Some examples include:
- Agence Spatiale Europeenne (France)
- Beijing Automotive Group Co Ltd (China)
- Ministry of Science and Technology India (India)
- National Institute of Advanced Industrial Science and Technology AIST (Japan)
Backward citations are a primary component of proving inventiveness in new patent applications. These citations reference previous work or prior art that is considered relevant to a current patent application. However, when looking for these citations in patent data, it is important to know that they are not always published in the applications. Here, we will look at how backward citations are handled during the application process and how the process can affect where those citations appear in the data.
As an example, at the time of filing a US patent application, an Information Disclosure Statement (IDS) form is submitted with the application. This includes any citations the inventors/applicants feel are relevant to demonstrate how their invention differs from the prior art. Since not everyone does a thorough search of the literature before filing their application, in the US (as well as some other countries), the inventors/applicants must also submit any relevant prior art cited by other offices which examine applications of the patent family members within three months of publication in the respective office.
Examiners generally do not rely solely on the prior art supplied by inventors/applicants, and citations may be appended by the examiner as well as by the inventor/applicant. If the applicant did not provide any citations in the application, it is the job of the examiner, who is an expert in the field, to locate documents that are most closely related to the new invention. Even if an applicant does provide citations, examiners conduct their own searches based on how they interpret the invention and cite the art they believe is relevant to novelty and inventiveness.
Examiners also consider the prior art provided in the related PCT International Search Report (ISR). In the ISR, the WIPO examiner designates each citation as an X, Y, or A document. “X” indicates that a citation is novelty-killing. “Y” indicates that it is relevant to inventiveness, and “A” indicates that it is general state of the art. And, of course, they do review the art cited by the applicant. There is also an option for third parties to submit prior art within a specific time frame (often between publication and the first office action). All of this information becomes part of the file wrapper/record.
In the US, the publication of applications occurs 18 months after the priority filing. Therefore, it is often the case that publication occurs before the examination begins and thus before the examiner has had the opportunity to add relevant citations. In these cases, there may be no citations (other than ISR references) at the time the application is published.
While examiners try to find all relevant art in their initial search, they will also re-do their search prior to allowance to make sure they found everything. In the US, Form 892 is the “List of references cited by examiner”. The citations on that list are the ones which are included in the grant publication and subsequently provided in CLAIMS Direct’s XML in the technical-data/citations container. Therefore, the best way to find all the relevant backward citations for a patent is to look at the grant publication rather than the application.
As of June 2020, CLAIMS Direct added support for PostgreSQL 10.x. The changes required to support 10.x unfortunately broke backwards compatibility to 9.x. Therefore, to migrate from 9.x to 10.x, one needs to use the new 10.x schema delivered in the package
alexandria-schema-tools. This new package, available through the IFI CLAIMS Direct yum repository, also provides tools for quality control (
cd-count.sh ) as well as tools for bulk extraction and loading (
cd-load.sh ). What follows will be a guide to migrate CLAIMS Direct from 9.x to 10.x.
Please note, updates should be stopped (or paused) before proceeding. See apgupd for instructions.
In order to migrate, the entire contents of the
alexandria database must be extracted to disk.
- Minimum of 1.8TB free space on the extract destination drive. You don't need a full database footprint as the extract files are compressed.
- Network connectivity between PostgreSQL 9 and the new PostgreSQL 10. It is also possible to use a separate server either as an intermediary or to share the extract destination drive, e.g., via NFS between the current and new instance.
The first step is to extract the data from the
The following script must be run on a drive with at least 1.8TB free.
First, the script:
By default, this script will use the alexandria database located on localhost and output the compressed table data to
./data. If storage is tight on the CLAIMS Direct instance, you can also use an intermediary server to extract and load, or use a storage server to share disk to both the old and new CLAIMS Direct instances via NFS, e.g.,
For the sake of simplicity, we'll use the 9.x CLAIMS Direct instance for extraction and loading. If using a separate server, you will need to install PostgreSQL (the default version is sufficient:
yum -y install postgresql ), then replace
localhost with the IP if you are using the 9.x instance.
- A new CLAIMS Direct instance. Please see https://docs.ificlaims.com/display/CDVDP/PostgreSQL+Installation+Instructions for step-by-step instructions.
- Access to the extract directory which contains the
datadirectory of the files created above. This could be the local CLAIMS Direct 9.x instance, an intermediate server, or via a shared drive.
Loading the data is accomplished in much the same way you loaded the original CLAIMS Direct instance:
load.sh for earlier installs.
IFI recommends 6TB of disk space for an on-site PostgreSQL instance. Normally, this will accommodate approximately 3 years of database growth (depending on your subscription level). If you are running out of disk space more quickly than anticipated, the reason may be an increased number of deletes occurring during the update process. This is especially likely to occur when IFI replaces content at a more aggressive rate than usual, as we have been doing since October 2019 in order to replace our old translations with higher-quality translations from Google. The blog Understanding of Bloat and VACUUM in PostgreSQL does a good job of explaining why your database may be growing disproportionately to actual pipeline data flow. This blog will help diagnose whether bloat is the cause of your problem and provide some potential solutions.
Checking for Bloat
The following SQL query will examine each table in the XML schema and identify dead rows (tuples) that are wasting disk space.
If this query returns a high percentage (
pfrag ) of dead tuples, the
VACUUM command may be used to reclaim space.
Using VACUUM to Reclaim Storage
1) VACUUM ANALYZE
apgupd before running this command.
VACUUM ANALYZE command marks the dead tuples in all tables of the database as available for reuse, which should slow database growth considerably.
After you run this, you may want to adjust the aggressiveness of your autovacuum settings to avoid accumulating more bloat in the future. The following settings should be considered:
autovacuum_max_workers= 5 - This is the recommended setting for a 16-core machine. If you have, for example, a 4-core, a setting of 2 would be the best available option.
autovacuum_vacuum_scale_factor= 0.02 - This setting indicates the threshold which determines when autovacuum needs to run per table. It refers to the fraction of the table size. So if the percentage of dead tuples (the
pfragcolumn in the SQL query above) is greater than 2%, this spawns an autovacuum process. The default value for this setting is 20%, which will create massive bloat in your database before the autovacuum process even starts.
autovacuum_analyze_scale_factor= 0.01 - Although not specific to table size and vacuum cleanup, this setting, which is also a percentage of fragmentation of the table, tells the autovacuum process to analyze the table (i.e., update query planner statistics) when the fragmentation percentage reaches 1% (the default is 10%).
2) VACUUM FULL <table>;
apgupd before running this command.
FULL vacuum command physically re-writes the table, removing the dead tuples and reducing the size of the table, whereas without the FULL modifier, the dead tuples are only made available for reuse. This is a processor- and disk-intensive operation but given appropriate planning, can reduce the size of the table by upwards of 25%. If you want to pursue this avenue, pick a highly fragmented table and test the amount of disk space recovered in order to assess whether it is worth the effort. For example, to test the process on
1. Determine the current size of the table.
2. Defragment and reclaim all disk space.
3. Determine the new size of the table after vacuum.
You will need at least twice the current table size available as free disk space in order to run the
VACUUM FULL command because it re-writes the entire table. You can use both strategies above -
VACUUM (to slow growth) and
VACUUM FULL (to reduce data footprint). However, it is recommended to complete
VACUUM first if you want to pursue the more aggressive
If you are really out of disk space, you may want to consider the following options:
1. Create a hot standby on a separate machine with +2TB disk, i.e., 8TB instead of 6TB. See Hot Standby for more information.
This is probably the easiest method and is least intrusive to normal workflow as nothing needs to be stopped or paused. Once the primary and standby instances are in sync, simply make the standby the primary.
2. If you can tolerate downtime, create a parallel machine as in #1 but use
pg_dump to recreate the database. You will need to pause
apgupd for the duration of the transfer:
With the deployment of the cumulative patch
alexandria-sql-patch-alpa-3636-20191101, on-site CLAIMS Direct installations now have the ability to utilize family and citation functionality in-house. What was previously only available using the remote CLAIMS Direct shared API is now possible internally with simple SQL functions. The following post will outline the steps required to prepare the data tables as well as presenting a brief walk-through of the functionality.
In order to leverage the new functionality, the necessary lookup tables need to be populated on-site. The functions available to achieve this are:
cdws.f_update_cited_documents– serving forward and backward citations
cdws.f_update_priority_documents– serving simple and extended family
Each function takes as input parameter a modified-load-id. Therefore, to populate each table, each modified-load-id from the
xml.t_patent_document_values table needs to be processed. The following pseudo-SQL will serve as an example.
Although the above SQL exemplifies the fundamental logic involved, it isn't the most efficient method. To that end, we are offering a supplemental patch that installs bulk loading functions to populate these tables:
alexandria-sql-patch-alpa-3636-x-20191215. In addition to adding bulk loading functions, this patch expands some of the citation and family functionality not included in the original
alexandria-sql-patch-alpa-3636-20191101 patch, namely:
Returns backward citations for an entire family
Returns forward citations for an entire family
Returns bibliographic data for an entire family
For a full enhancement listing, please see the README included in the package.
After the supplemental patch is installed, 2 new functions for bulk loading are exposed.
These new functions bulk-load the auxiliary tables and are called simply as:
Each function requires at least 24 hours to complete and may take up to 7 days, depending on the environment. They can be run in parallel if desired. Because of the duration required to complete the functions, it's important that the SQL runs uninterrupted by server restarts or terminal disconnects.
In order to expedite the completion of these functions, it is required that
apgupd be paused. Optionally upon completion, executing an
ANALYZE on the auxiliary tables is recommended:
In order to automatically populate these auxiliary tables every new load-id, you must use the
--api flag to
apgupd. If this parameter isn't available with your version of
Alexandria-Client-Tools, you will need to update to the latest version. In versions 2.5.3 and above, this parameter is set in
/etc/alexandria/apgupd.conf, as a
CD_OTHER_OPTIONS. Please see Client Tools documentation for configuration details.
To confirm that the citation and family functionality have completed loading, locate the log files (
family.log). The last line of each log should show an entry in the following format:
To check whether the calculations are still running, use one of the two SQL queries below while in the database or using pgAdmin.
Once the auxiliary table loading is complete, a wide variety of citation and family functionality is available directly from the on-site CLAIMS Direct instance.
Bringing the CLAIMS Direct API methods
citations/backward on-site is achieved using the new functions:
Each function takes as input a
publication_id and returns 2 columns:
GROUP BY and
ORDER BY are available, e.g., using
substring, group forward citations by country:
Bringing CLAIMS Direct API methods
family/extended functionality to the on-site install is accomplished with the functions:
Each of these functions, as with the citation functions above, take a
publication_id as input. The return columns are
family-id. The extended function returns additional
A second optional argument in the
v_include_apps Boolean, can be used to include application filing data from the patent office. The default setting of this argument is
false in order to maintain consistency with the EPO. In some cases, however, application filing data is not included in the priority claims data. Setting this argument to
true will include the application filing data, as in the following example.
In addition to the above, there are other functions in the supplemental patch worth exploring, namely:
cdws.f_family_meta( v_family_id integer )
Returns a variety of bibliographic and legal status information
Returns backward citations for an entire family
Returns forward citations for an entire family
cdws.f_family_meta function brings a variety of information together for a simple family. The following columns are returned:
- family-id - DocDB-assigned simple family integer identifier
- ucid – unique character identifier (publication)
- published – publication date
- anucid – unique character identifier (application)
- filed – application filing date
- title – title of document
- ifi_status – IFI Snapshot document status
- ifi_anticipated_expiration - IFI anticipated expiration date
- ifi_adjusted_expiration - IFI adjusted expiration date
- epo_legal_status_xml – XML representation of EPO legal status
An abbreviated example:
Forward and backward citations for an entire family are also available directly inside CLAIMS Direct.
GROUP BY and
ORDER BY are available.
The main executable script used for indexing is
aidx delivered as part of
Alexandria::Library. This script is responsible for pulling source data, converting it into SOLR XML and submitting via HTTP POST to SOLR for indexing. The conversion process from CLAIMS Direct XML to SOLR XML is handled by the indexer class (default is
Alexandria::Client::Tools also provides an indexing daemon,
aidxd which monitors an index process queue. Insertion into this queue, the table
reporting.t_client_index_process, is handled by
Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the
Alexandria::DWH::Extract. The Extract module can pull data based on a number of criteria, the most common of which are:
- table: any table name that has
- SQL: raw SQL selecting desired documents by
Regardless of extraction criteria,
Alexandria::DWH::Extract utilizes an
UNLOGGED temporary table to accumulate desired
publication_id(s). Extraction proper is done from this accumulation table in parallel
select batches. The amount of parallelization as well as the amount of documents per
select are controlled by the parameters
aidx also accepts a
dbfunc parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is
xml.f_patent_document_s which pulls an entire XML document. One could, for example, create a custom function, e.g.,
myschema.f_barebones modeled on
xml.f_patent_document_s (i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).
|Command||Accumulation SQL||Extract SQL|
from xml.t_patent_document_values t1
inner join x as t1 on ( t1.publication_id=t2.publication_id)
|aidx --sqlq=USER_SQL||execute SQL into t1|
|Command||Accumulation SQL||Extract SQL|
|aidx --table=x --dbfunc=f_my_function|
from xml.t_patent_document_values t1
inner join x as t1 on ( t1.publication_id=t2.publication_id)
callback mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into SOLR. The following abbreviated example from
aidx serves to illustrate the process.
Creating a Custom Indexing Class
Creating a custom indexing class is simply a matter of sub-classing the
Alexandria::DWH::Index::Document and manipulating the SOLR document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely,
_process_source. The following shell-module will serve as a basis for the use cases detailed below.
You can now specify
MyCustomIndexingClass as the command line argument
--idxcls to the indexing utility
The following use cases assume:
- a valid index entry in
/etc/alexandria.xml– this will be different than the default if you have a custom SOLR installation
- custom indexing class modules are either in the directory you run
aidxor in your
(1) Adding (Injecting), Modifying, and Deleting Fields
For this use case, you will need to modify your SOLR schema for the installation associated with the appropriate configuration index entry. Add the following field definition:
Below is example code to inject
customInteger into the SOLR document. Additionally, it will show how to modify the contents of
anseries and delete
anseries if the publication country is US and publication date is later than 2015.
(2) Accessing the CLAIMS Direct Source XML Document
This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the SOLR schema:
We first need to define related ucid
rel_ucid. For this example, it will be defined as:
- any related documents which have a
- any pct-or-regional-publishing-data
The parts of the XML document that are of interest:
As this example is more involved, the following code is broken down by function. A complete listing of code will be provided below.
Points to consider with
- The source document (XML) representation is an
XML::LibXML::Node, named above as
- Utilizing available methods, it is relatively simple to access particular parts of the XML tree.
findvaluemethod is lacking error checking, i.e., we assume every value is present, combined in
sprintfwill return a correctly formatted
Points to consider:
- according to the DTD, there is only ever one related pct document, hence single-value return
ucidattribute is available, differing from the above
The complete listing:
There are a number of reasons one would need to re-index data from the data warehouse. These range from simply repeating a
load-id to a complete re-index of the entire contents of the data warehouse. In this blog, I'm going to go over the mechanisms that move data from the data warehouse to the index and ways in which these mechanisms can be used to trigger partial or full re-indexing of data.
In all installations of CLAIMS Direct, data synchronization between the CLAIMS Direct primary data warehouse and on-site installations is driven by the tool
apgup. In version 2.1, a new update daemon (
apgupd) was introduced to facilitate background processing of new and updated data. The main roles this daemon plays include:
- Reconciling differentials between on-site data warehouses and CLAIMS Direct primary data warehouses
- Downloading the next available update package for processing within the on-site installation
- Processing the update package
- Loading data to the data warehouse
- Optionally en-queuing that load for indexing
For our topic, we will concentrate on the indexing queue as this drives the indexing process.
The indexing queue is a table inside the
apgupd completes the load into the PostgreSQL, a row is inserted into
pending. The indexing daemon,
aidxd, probes the table looking for the highest priority
load_id to index. Upon finding an applicable
aidxd proceeds to extract all publications associated with that
load_id and index the documents into SOLR.
priority influences which
load_id is to be processed next. Higher priority
load-ids are processed before lower priority ones. This can cause out-of-order processing as we will discuss below.
At its simplest, re-indexing is just a matter of resetting a
The above SQL resets
load_id 123456 so that the indexing daemon processes that
load_id next. You can re-index any number of
load_ids by resetting each of the rows.
Entire Data Warehouse
Unfortunately, resetting all the rows in
reporting.t_client_index_process is not sufficient to re-index the entire contents of the data warehouse. This has to do with the fact that the table is empty upon initial load to the on-site instance as the population of
reporting.t_client_index_process begins with the first update to the data warehouse. Each new update that is processed is queued. To that end, the most efficient way to queue the entire data warehouse for re-indexing is to select the entire
xml.t_patent_document_values table and group by
Before proceeding with the next set of SQL, please be sure both
aidxd are paused ( kill -s USR1 <pid> )
A few comments about the above statements:
- Truncating the table is required as the
load_idcolumn is unique. If you wish to save the contents of that table, simply use the pg_dump utility.
load_sourcecolumn is chosen at random. You can use any identifier you want as long as it doesn't exceed 64 characters.
- The priority is set to -1 as in this example, re-index is less important than indexing new data.
Changing Field Definitions
Modifying the CLAIMS Direct SOLR Schema
The CLAIMS Direct SOLR schema is a compromise between having every piece of the patent document searchable, having enough metadata stored to be retrievable in the search result set as well as indexing certain fields to be efficiently faceted. This compromise keeps the size of the index manageable while allowing efficient return of data from a search. Of course, the concept of manageable is subjective. There could very well be the need to have more fields returned during the search (
stored=true) and other fields removed from storage but left searchable. In this use case, we will enable patent citations to be returned during the search. Regardless of CLAIMS Direct SOLR package, we'll start editing the schema under
<package>/conf-2.1.2. We are making a very simple change but one that requires a complete re-index of all documents having citations.
Be sure the indexing daemon (
aidxd ) is paused.
The current field definition
To make this change effective, SOLR needs to be restarted or the collection reloaded. For the standalone package, you would just need to restart SOLR:
For the distributed package, first deploy updated configuration to the Zookeeper nodes.
Then reload all nodes:
Setting up the Index Queue
Although we can go about re-indexing the entire data warehouse as outlined above, we want to be a bit more efficient and choose only the applicable documents, i.e., those that actually have citation data. To this end, we will find only
load_ids that contain documents with citations and modify existing
reporting.t_client_index_process rows as needed.
Again, setting priority to -1 allows new documents to be indexed as usual, and documents needing to be updated will be indexed with a lower priority.
Checking Re-Index Status
You can monitor the progress of the re-index by querying the
reporting.t_client_index_process table as follows:
The CLAIMS Direct data warehouse is updated continuously. On average, 1-3% of the entire database is touched weekly. These 1-3 million updates fall into two categories: new documents (roughly 20%) and updated documents (roughly 80%). New documents are, of course, new publications from issuing authorities, US, EP etc. Updates are generally, but not limited to, CPC changes, family changes (adding family ID), IFI Snapshots (patent status, standard names and claims summaries), reassignments, legal status, et al. I suggest reading about content updates for a better understanding of load sources and issuing authorities. This blog will describe how to determine what has been updated and further, differentiating new content from updated content at the document level as well as container level.
As described in the content update documentation,
Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: created-load-id, deleted-load-id and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document and the deleted-load-id represents the load-id in which the document marked as deleted.
xml.t_patent_document_values houses the
load-ids that affect the overall status of the entire document.
load-idthat last modified the document, whether it be addition, correction or deletion of content
load-idthat the entire document was marked deleted
load-idwhen the document was added to the data warehouse
Each satellite table also contains a
modified_load_id column that is specific to the content in that table and will always be <=
xml.t_patent_document_values(modified_load_id). The following table should clarify the sequence of update events and the effect on
load-id values for three database tables.
assuming ifi integrated content is not delivered on date of publication
parties are not updated with IFI content loads
This granularity offers the ability to trace modifications to individual containers (tables) as well as the overall modification status of the document as a whole.
How many US documents published on 20161004 have updated CPC classifications?
Out of 6680 documents published on 20161004, 6145 had CPC modifications after publication.
Knowing which tables and containers have changed is important, but the when and why are also critically important. The information gathered during data warehouse updates is stored in
reporting.t_client_load_process. For a detailed description of the contents of this table, see CLAIMS Direct Design Documentation. Of the many questions that can be posed, for our purposes in this blog, we are only concerned with answering the following questions:
- On any given date, how many documents were added to the data warehouse?
- On any given date, how many documents were modified in the data warehouse?
The answers to these questions are important for a variety of reasons, most importantly for down-stream processing. In the case of CLAIMS Direct primary data warehouses, we trigger indexing based on the answer to these questions. The first thing to look at is what happened on a certain date, for the sake of an example, we will use 20161004.
How many documents were added to the data warehouse?
... and which documents were updated?
The CLAIMS Direct Web Services (CDWS) offer a variety of entry points into both the data warehouse and SOLR index. These are mid-to-high-level entry points and can satisfy most requirements pertaining to searching and extracting data for a typical search/view application. There are, however, corner cases which may require more intricate extraction of particular information. On the other hand, there may also be situations where massive amounts of data need to be extracted for further, down-stream processing. The following discussion will touch on solutions for both of these cases.
Basic Data Warehouse Table Structure
The basic structure of the PostgreSQL database housing the XML content is best described as a collection of tables containing the raw XML of every patent document available in CLAIMS Direct. Each table is a section of the XML document, referred to as a container, and each table is named according to the XML container data it contains, e.g.,
|xml.t_publication_reference||Document publication information|
The structure of each table is identical with the exception of
xml.t_patent_document_values which functions as a meta table.
|serial primary key|
|integer not null|
|integer not null|
|character(1) (v=valid XML, i=invalid XML)|
The XML is always in the
content (PostgreSQL type: XML) column.
The invaluable utility function
xml.f_ucid2id takes a
ucid and returns a
Extracting Pieces of the XML Content
As mentioned above, all
content columns are of type XML and therefore the internal PostgreSQL xpath functionality can be used. There are a variety of utility functions provided by CLAIMS Direct that mimic DOM functionality, e.g.
Extract text of node:
Extract single attribute value:
Test presence of a particular node:
Select a node (type XML) satisfying the given xpath expression:
Select an array of nodes satisfying the given xpath expression:
Select an array of nodes based on element name:
Select the name of the content node:
Select all child nodes:
Concatenate all child text nodes:
Use Case: Creating relational
view of XML content
Although the underlying architecture of the CLAIMS Direct PostgreSQL database is a data warehouse, creating normalized relational views into the XML content is easily achievable using the functions described above. Our example will be to create a relational view of all patent citations. This view will let us
SELECT based on criteria as well as being able to
GROUP BY and
ORDER BY. The first step is to define all the properties of a citation:
source(examiner or applicant)
Now we can create a function that returns a
TABLE of these properties. We'll use
publication_id as input to the function and create it in a private
Now, we test the results:
GROUP BY and
ORDER BY are available on the returned columns. In addition to SELECTing citations for a single document, we can also
JOIN to table expressions, e.g., assume we are interested in the top 5 cited countries from US grant publication date 2016-05-03:
Paging results is cumbersome and inefficient. In this next segment I'd like to talk about simple and complex sorting. Sorting, used effectively with the
rows parameter can push relevant documents into the first page of results. Generally, you can sort on any indexed field but you can also utilize query boosting and functions to influence sort order.
Missing Value Behavior
CLAIMS Direct is configured to return empty fields at the top when
asc is the direction and the bottom when
desc is the direction.
Static Field Sorts
The most common field to sort on will undoubtedly be the date of publication,
pd desc. This pushes the newly published documents to the top of the result set. CLAIMS Direct uses
pd desc as default sort order instead of
score. If two documents share the same publication date (or any sort field value), the tie is broken by the internal Lucene document IDs ascending.
Multiple sort criteria are also possible and used mostly to stabilize the initial sort results. The syntax for the
sort parameter is a comma-separated list of field/direction pairs, e.g.,
pd desc,ad asc. Field and direction are separated by a space.
Dynamic Field Sort
As mentioned, CLAIMS Direct uses a default sort of
pd desc. You are, of course, free to use the dynamic
score field to mimic the default SOLR distribution behavior.
Random Sort (for Random Results)
CLAIMS Direct uses a utility field to allow random sorting. The field is input as
n is any random integer, e.g.,
rnd_1234. Specifying the same value for
n will yield the same sort results so the randomness is only as random as the input integer. Returning random results (sorting on
rnd_n) is an efficient way to sample a wide variety of data.
Sorting on Functions
A previous segment discussed using functions as pseudo fields to return in the result set. You can also use the results of a function as sort criteria. An interesting use-case for sorting on a calculated value is measuring the latency between application filing date and grant date. You can calculate this value by subtracting
tdate field types of filing and publication dates). The following examples use field aliasing with the
fl parameter as well as sorting by function to return documents with low latency between filing and grant (
asc) and high latency (
desc) displayed in days (86400000 milliseconds in a day).
In this first of a series of blogs about SOLR result sets I'd like to talk about returned fields, both static and dynamic.
Any field that is stored in SOLR can be returned in the result set. The following table shows all available stored fields:
|timestamp||last modification stamp||tdate||true|
|ucid||unique character identifier||string||true|
|loadid||meta load identifier||tint||true|
|nclms||number of claims||tint||true|
|nindepclms||number of independent claims||tint||true|
|pn||patent number (tokenized)||alexandria_token||true|
|an||application number (tokenized)||alexandria_token||true|
|anseries||US application series code||alexandria_string||true|
|anorig||original, patent office format application number||alexandria_token||true|
|pri||priority filing number (tokenized)||alexandria_string||true|
|priorig||original, patent office format priority number||alexandria_string||true|
|prid||earliest priority filing date||tint||true|
|icl,cpcl,uscl,ficl||classifications suitable for display||string||false|
|pa||applicants/assignees (all formats)||alexandria_token||true|
|ttl_*||all titles||mixed: alexandria_text, alexandria_asian_text||true|
|ab_*||all abstracts||mixed: alexandria_text, alexandria_asian_text||true|
The shorthand to return all static fields is the asterisk
As with all examples, we discuss results in JSON format.
Some things to notice: Multi-valued fields are returned as JSON arrays. Also, as
copyFields containing multiple values, duplicate entries may appear.
Dynamic fields are fields that are generated either internally, like the relevance
score or fields that can be generated by computing values using function queries. Dynamic fields must be listed explicitly in the
fl parameter, e.g.,
fl=score. Below, we request the static
ucid field together with the relevance
score for 5 results.
Another type of dynamic field makes use of function queries. A simple, yet contrived, example would be to return the number of dependent claims. CLAIMS Direct stores the total number of claims in
nclms as well as the number of independent claims in
nindepclms, both as
tint. Simple subtraction yields the desired result.
Field aliasing allows individual fields in the result set to be renamed. This functionality is particularly useful to give meaningful names to pseudo fields created by function queries as in the example above. Any field can be renamed using aliases. The syntax is: alias-field-name:original-field-name
ucid is the only exception. When trying to alias the
ucid, an HTTP 500 will be returned.