Blog


What is the Unitary Patent (UP) and Unified Patent Court (UPC)? 

A European patent with unitary effect or Unitary Patent (UP) is based on a European patent granted by the EPO under the rules and procedures of the European Patent Convention (EPC). The Unitary Patent is a single right that covers all participating UP Member States, which are a subset of the member states of the EPO. Only a single annual renewal fee is required to cover the included states. 

The goal of the UP is to decrease the costs and complexities related to validating a European Patent in the member countries. The UP achieves this through a single renewal fee payment, reducing translation costs, and removing the need for representation at the national level. While the single annual renewal is a benefit for filers, there is some risk as a single court decision, e.g., on infringement and/or validity, applies across all participating UP Member States, rather than just in one country.

The Unified Patent Court (UPC) was created by international treaty to handle invalidity and infringement actions related to UP and EP patents. Patentees are therefore able to bring a single infringement action to the UPC for pan-European infringements, while third parties are able to seek centralized revocation of the UP. The existing EP opposition procedures will also affect the validity of Unitary Patents. 

You can find more information about Unitary Patents and the Unified Patent court on the EPO website: https://www.epo.org/applying/european/unitary.html.

Identifying Patents Which Have Requested Unitary Protection

For EP grants where the UP is registered, a C0 marker record is created. The C0 record is not a published document, but rather a file container for UP data which contains information such as the list of countries covered, the date of registration, and legal status events. We typically receive C0 markers from DOCDB on Wednesdays. These can be identified by searching for documents where the country code is EP and the kind code is C0.  They can also be found using the application number for the document of interest. 

Here is an example of the first lines of XML in a C0 marker record: 

<patent-document file-reference-id="752534" 
                 mxw-id="221216033" 
                 ucid="EP-3135684-C0" 
                 lang="EN" 
                 country="EP" 
                 doc-number="3135684" 
                 kind="C0" 
                 date="20230628" 
                 date-produced="20230812" 
                 family-id="45004197" 
                 status="corrected">

Identifying Where the UP is in Effect

To identify where the Unitary Patent is in effect, look within the international-convention-data/designated-states section of the C0 marker XML. The ep-up-participating-states are the countries that are part of the UP at the time of registration. The UP annual fee will therefore apply to these countries. 

<ep-up-participating-states>
   <country mxw-id="DS1187928874" load-source="docdb">AT</country>
   <country mxw-id="DS1187928876" load-source="docdb">BE</country>
   <country mxw-id="DS1187928878" load-source="docdb">BG</country>
   <country mxw-id="DS1187928883" load-source="docdb">DE</country>
   <country mxw-id="DS1187928885" load-source="docdb">DK</country>
   <country mxw-id="DS1187928887" load-source="docdb">EE</country>
   <country mxw-id="DS1187928890" load-source="docdb">FI</country>
   <country mxw-id="DS1187928892" load-source="docdb">FR</country>
   <country mxw-id="DS1187928900" load-source="docdb">IT</country>
   <country mxw-id="DS1187928903" load-source="docdb">LT</country>
   <country mxw-id="DS1187928905" load-source="docdb">LU</country>
   <country mxw-id="DS1187928907" load-source="docdb">LV</country>
   <country mxw-id="DS1187928911" load-source="docdb">MT</country>
   <country mxw-id="DS1187928913" load-source="docdb">NL</country>
   <country mxw-id="DS1187928917" load-source="docdb">PT</country>
   <country mxw-id="DS1187928921" load-source="docdb">SE</country>
   <country mxw-id="DS1187928923" load-source="docdb">SI</country>
</ep-up-participating-states>

Legal Events Related to Unitary Patents

Legal event codes related to the UP are in the legal status section of the C0 marker record. They can also be found in the legal status section of any related application and/or granted record having the same application number. At this time, the expected time frame for delivery of legal event codes is not consistent. The two event codes currently appearing most frequently related to the UP in the database are U01 and U07. As other new UP-related legal event codes are delivered by the EPO, they will be added to the legal status section.


U01: The event indicating that a request for Unitary Effect has been filed.

<legal-status load-source="inpadoc">
   <legal-event country="EP" impact="+" code="U01" date="20230802" mxw-id="LS507355453" class="W" created-load-id="752825" modified-load-id="752825">
      <legal-event-body>
         <event-title>REQUEST FOR UNITARY EFFECT FILED</event-title>
         <event-attributes>
            <event-attribute>
               <event-attribute-label>Effective Date </event-attribute-label>
               <event-attribute-value>20230628</event-attribute-value>
            </event-attribute>
         </event-attributes>
      </legal-event-body>
   </legal-event>


U07: The event indicating that a Unitary Patent has been registered. This includes a list of the countries which the patent covers and the effective date of the grant. 

<legal-event country="EP" code="U07" date="20230802" mxw-id="LS507355454" class="W" created-load-id="752825" modified-load-id="752825">
   <legal-event-body>
      <event-title>POSITIVE DECISION AFTER REQUEST FOR UNITARY EFFECT</event-title>
         <event-attributes>
            <event-attribute>
               <event-attribute-label>Designated State(s) </event-attribute-label>
               <event-attribute-value>AT BE BG DE DK EE FI FR IT LT LU LV MT NL PT SE SI</event-attribute-value>
               <event-attribute-label>Effective Date </event-attribute-label>
               <event-attribute-value>20230703</event-attribute-value>
            </event-attribute>
         </event-attributes>
      </legal-event-body>
</legal-event>

The Unified Patent Court and Opting Out of the Unified Patent Court

The Unified Patent Court (UPC) has jurisdiction over the settlement of disputes for both European patents and Unitary Patents. This jurisdiction is however subject to exceptions: cases may still be brought before national courts, and patentees of European patents may opt out from the UPC's jurisdiction. If a patentee does not submit an opt-out request, they may be subject to invalidity and infringement proceedings under the jurisdiction of the UPC.

P01 is one of the event codes relating to opting out, but there are additional codes associated with opting out of the UPC. These codes appear on the B document and other documents with the same application number. 

<legal-event country="EP" code="P01" date="20230614" mxw-id="LS500898319" class="W" created-load-id="739212" modified-load-id="739212">
   <legal-event-body>
      <event-title>OPT-OUT OF THE COMPETENCE OF THE UNIFIED PATENT COURT (UPC) REGISTERED</event-title>
         <event-attributes>
            <event-attribute>
               <event-attribute-label>Effective Date </event-attribute-label>
               <event-attribute-value>20230418</event-attribute-value>
            </event-attribute>
         </event-attributes>
      </legal-event-body>
</legal-event>

Additional Legal Status Codes Related to the UP and UPC 

U01, U07, and P01 are only a few of the potential codes related to the UP and UPC. Many more codes have been reserved by the EPO for use related to the Unitary Patent and Unified Patent Court. The additional codes typically have either "UPC" or  "Unitary Effect" in the description text. The most up-to-date list of legal event codes produced by the EPO can be found here: https://www.epo.org/searching-for-patents/data/coverage/weekly.html.

What's Next for Unitary Patents in CLAIMS Direct? 

We have started to review the effect Unitary Patents will have on IFI Snapshots, including the IFI Patent Status and other fields derived by IFI. Snapshot calculations for both the B and C0 records are in the process of being reviewed and will be updated as soon as possible.

Introduction

The entities stream offers detailed content relating to IFI standardized names and corporate hierarchies (ultimate owners and subsidiaries). This stream allows users to quickly build full corporate portfolios, retrieve all the subsidiaries for a specific company, and gain quick statistical insights into a portfolio more easily. The optional statistical analysis tables give insights for key metrics such as number of active patent families, number of granted patents, and more. Before beginning, it is important to review the design of the schema, tables and columns as well as the built-in functions.

Terminology

Note on terminology: In the following blog, the terms IFI standardized name and entity name are synonymous. Generally, an IFI standardized name is an entity if it belongs to an entity hierarchy.

Prerequisites

Before executing any of the functions described below, it's important to stop all stream updates:

systemctl stop apgupd.service

Linking Entities to Publications

Although the entities stream is self-contained (has no direct dependencies on other stream content), the optional table linking to the xml schema via entities.t_publication_ifi_numbers can only be exposed by building these using the function entities.f_update_xml_info_bf().

To populate the optional content:

select entities.f_update_xml_info_bf();

This function builds the links between the xml.t_ifi_integrated_content data and the entities stream which includes publications, publication status, family and family status information. These links will be maintained by apgupd every entities stream update.

Generating and Maintaining Statistics

Linking entities to publications is an in-line process and is triggered by loads into the XML schema. The optional statistics in entities.t_entity_hierarchy_counts, on the other hand, are generated by an offline process that can be scheduled. As with populating entities.t_publication_ifi_numbers, an initial population of the table entities.t_entity_hierarchy_counts is done by executing a function:

select entities.f_recalculate_all_entity_hierarchy_counts(true)

With the true parameter, the table entities.t_entity_hierarchy_counts will be completely rebuilt, false and only load-ids that haven't been processed will be refreshed. IFI CLAIMS Primary scheduling uses cron as shown below: 

# Incremental updates: 3:30 Mon-Sat
30    3   * *  1-6  psql -U alexandria -h localhost -p 5432 -d alexandria -qt -c "select entities.f_recalculate_all_entity_hierarchy_counts()" -o /dev/null
# Full refresh: Sat 8:30
30    8   * *   6   psql -U alexandria -h localhost -p 5432 -d alexandria -qt -c 'select entities.f_recalculate_all_entity_hierarchy_counts(true)' -o /dev/null

Getting Started

To ease into the entities content, we'll go over some fundamental functionality:

  • Retrieving an ultimate owner from an IFI standardized name
  • Retrieving a list of subsidiaries from an IFI standardized name
  • Retrieving ticker and exchange information for an IFI standardized name
  • Expanding functionality

Hierarchy Levels

Hierarchies can have multiple levels so it is possible that a subsidiary itself is an owner of other subsidiaries. To this end, if you are querying for an entire hierarchy, the input parameter should be an ultimate owner.

Extended vs Compact

The following SQL output from psql is displayed in both compact and extended format. Extended format is used when the number of columns would make page display unwieldy.

Retrieving an Ultimate Owner

In its simplest form, given any IFI standardized name that is a member of an entity hierarchy, its ultimate owner can be found by simply executing the following SQL query:

select * from entities.f_cdws_ultimate( 'Google LLC' );

-[ RECORD 1 ]-------+-----------------------------------------------------------
entity_id           | 3898
country             | US
name                | Alphabet Inc
name_type           | N
extract_status      | V
updated             | 2020-11-16
parent_relationship | O
address             | 1600 Amphitheatre Parkway, Mountain View, CA, 94043, US
attributes          | <attributes />

Retrieving a List of Subsidiaries

select name, country from entities.f_cdws_subsidiaries( 'Alphabet Inc' );
                    name                    | country
--------------------------------------------+---------
 Admob Inc                                  | US
 Adometry Inc                               | US
 Adscape Media Canada Inc                   | CA
 Adscape Media Inc                          | US
 Agnilux Inc                                | US
 Anvato Inc                                 | US
 Apigee Corp                                | US
 Apigee Technologies India Pvt Ltd          | IN
-- List abbreviated

Returned Columns

To see the full list of columns returned, issue the \df psql command.

\df entities.f_cdws_subsidiaries

Retrieving Ticker and Exchange

select * from entities.f_attributes_ticker( 4366 );
 exchange | symbol 
----------+--------
 XNYS     | IBM

Expanding Functionality

Of course, the above functions are rudimentary. One can easily build out the functionality for application-specific use cases. As a simple example, the following query retrieves an ultimate owner from a publication ucid (or publication id).

Entity status

When executing statements against the table entities.t_entity_hierarchy_info it is important to filter out all but validated (V) and reviewed (R) entities using the status column.

with t as (
     select a.entity_name
     from entities.t_entity_hierarchy_info as a
       inner join entities.t_publication_ifi_numbers as b
         on ( a.ifi_number=b.ifi_number)
     where a.status in ('V', 'R') and b.publication_id = xml.f_ucid2id( 'EP-0700000-B1' )
  )
  select * from entities.f_cdws_ultimate( ( select x.entity_name from t as x ) );

-[ RECORD 1 ]-------+-----------------------------------------------------------
entity_id           | 4366
country             | US
name                | International Business Machines Corp
name_type           | N
extract_status      | V
updated             | 2021-01-04
parent_relationship | O
address             | Old Orchard Road, Armonk, NY, 10504, US
attributes          | <attributes />

Understanding Entity Hierarchy Statistical Information

Along with entity information housed in entities.t_entity_hierarchy_info, is the companion hierarchy statistics table entities.t_entity_hierarchy_counts. This table contains summary statistics for entire hierarchies (ultimate owner and all subsidiaries) including hierarchy family statistics. The following table lists the available statistical categories:

Category (col: count_category)Description
CBTotal number of backward citations
CFTotal number of forward citations
FSATotal families with Active status
FSLTotal families with Lapsed status
FSPTotal families with Pending status
FTTotal number of families
PLATotal published applications
PLGTotal published grants
PLUTotal publications with unknown publication level
PSATotal publications with Active status
PSLTotal publications with Lapsed status
PSPTotal publications with Pending status
PSUTotal publications with unknown status
PTTotal publications
STTotal number of subsidiaries


This information is wrapped into a view that expands the category with a category description:

select * from entities.v_hierarchy_family_statistics where entity_id = entities.f_name2id( 'Alphabet Inc' ) ;
 entity_id | count_category |        count_category_desc        | count_value | modified_load_id 
-----------+----------------+-----------------------------------+-------------+------------------
      3898 | CB             | total_backward_citations          |      420962 |           679395
      3898 | CF             | total_forward_citations           |      897547 |           679395
      3898 | FSA            | total_families_status_active      |       22935 |           680151
      3898 | FSL            | total_families_status_lapsed      |       11877 |           680151
      3898 | FSP            | total_families_status_pending     |        2673 |           680151
      3898 | FT             | total_families                    |       37485 |           680151
      3898 | PLA            | total_publevel_application        |       79625 |           680151
      3898 | PLG            | total_publevel_grant              |       66325 |           680151
      3898 | PLU            | total_publevel_unknown            |       38821 |           680151
      3898 | PSA            | total_publications_status_active  |       94507 |           680151
      3898 | PSL            | total_publications_status_lapsed  |       34353 |           680151
      3898 | PSP            | total_publications_status_pending |       17092 |           680151
      3898 | PSU            | total_publications_status_unknown |       38819 |           680151
      3898 | PT             | total_publications                |      184771 |           680151
      3898 | ST             | total_subsidiaries                |         186 |           680151

Upgrading PostgreSQL

CLAIMS Direct now supports recent PostgreSQL versions (i.e. 12, 13, 14). The Alexandria schema (alexandria-schema-tools) and IFI CLAIMS Direct yum repository have been updated. They provide tools for quality control (cd-count.sh), bulk extraction (cd-extract.sh) and bulk loading (cd-load.sh). 

Please note:

  • Updates should be stopped (or paused) before proceeding. See apgupd for instructions.
  • PostgreSQL must be configured to allow inbound and outbound communication via the pg_hba.conf, security group, and/or firewall.

Extraction

In order to migrate, the entire contents of the alexandria database must be extracted to disk.

Prerequisites

  1. A minimum of 2.0TB free space on the extract destination drive. You don't need a full database footprint as the extract files are compressed.
  2. Solid State Drives are required for better performance. 
  3. Network connectivity between the two PostgreSQL databases is required. It is also possible to use a separate server either as an intermediary or to share the extract destination drive, e.g., via NFS between the current and new instance.

Process

Stop apgupd before proceeding (See apgupd for instructions.)

Extract the data from all applicable schema to the storage location.

The following script must be run on a drive with at least 2.0TB free.

First, the script:

cd-extract -h

Usage: cd-extract.sh OPTIONS

  Option        Description                     Default
  --------------------------------------------------------
  -H|--host     specify host                    localhost
  -P|--port     specify port                    5432
  -d|--database specify database                alexandria
  -u|--user     specify database username       alexandria
  -h|--help     print this usage and exit

Examples:

cd-extract -h 10.10.10.2 -p 5432 -d alexandria -u alexandria
cd-extract -h localhost -d alexandria -u alexandria


By default, this script uses the alexandria database located on localhost and outputs the compressed table data to ./data. If storage is tight on the CLAIMS Direct instance, you can also use an intermediary server to extract and load, or use a storage server to share disk to both the old and new CLAIMS Direct instances via NFS, e.g.,

extract-load-with-storage-server

For the sake of simplicity, we'll use the existing CLAIMS Direct instance (with the older version of PostgreSQL) for extraction and loading. If using a separate server, you will need to install PostgreSQL (the default version is sufficient: yum -y install postgresql), then replace localhost with the IP-of-CD-PostgreSQL-instance if you are using the older instance.

# Create a location for the extract

mkdir EXTRACT && cd EXTRACT

cd-extract.sh -H localhost -d alexandria -u alexandria

# Please note, this process can take up to 3 days to complete,
# Therefore, it is recommended to detach process using nohup

nohup cd-extract.sh -H localhost -d alexandria -u alexandria &

# Progress will be output to nohup.out (standard out for terminal invocation), e.g.,

Checking ./data ... not present, creating ... ok
Extracting data ........ done

Loading

Prerequisites

  1. A new CLAIMS Direct instance. Please see the PostgreSQL Installation Instructions for step-by-step instructions.
  2. Access to the extract directory which contains the data directory of the files created above. This could be via the older local CLAIMS Direct instance, an intermediate server, or a shared drive.
  3. Installation of supplementary streams, e.g., litigation, entities etc., see installation instructions.

Process

Loading the data is accomplished in much the same way you loaded the original CLAIMS Direct instance: cd-load.sh or load.sh for earlier installs.

# Move into the directory created above

cd EXTRACT

# Load the data

cd-load.sh -H IP -d alexandria -u alexandria

# As above, this is a multi-day process and
# It is therefore recommended to detach the process

nohup cd-load.sh -H [IP-of-CD-PostgreSQL-instance] -d alexandria -u alexandria &

IFI CLAIMS has been managing standardized company name variations for decades, allowing you to map patent publications to the correct original and current assignees. Our proprietary name management process cleans various types of entity names to address issues of misspelling, name changes, and other merger and acquisition activity. The result is the most reliable company name mapping available for IP analytics. Our Names Service takes this process even further, providing the Ultimate Owner for any IFI Name and allowing you to easily capture the full portfolio of patent assets owned by a company and its subsidiaries.

To make the best use of our IFI Names and Ultimate Owners, it is important to understand how we manage the data. This blog explains some of the rules we use to standardize names and assign Ultimate Owners.

What We Standardize

We standardize:

  • applicant names
  • assignee names in the parties container and assignment history
  • names provided in legal events that are related to changes in ownership or name changes

We do not standardize:

  • individuals
  • agents
  • correspondents

Our Process

We follow three main steps in our cleaning process: normalization, standardization, and harmonization.

The first step is to normalize the data or put it into a homogenized format. This includes:

  • removing punctuation, parentheses, hyphens, and similar characters
  • dealing with spelling variations (and/&, ä/ae) and misspellings
  • putting legal forms into a consistent format such as Corp, SL, Ltd, PLC, AG etc.

Next, different name variations are grouped under one official IFI Name. This is an extremely important step in the process as some of the variations we receive would make the record very difficult to find otherwise. For example, as of January 2022, we have 2,260 name variations for BASF SE, including:

  • GASF AKTIENGESELLSCHAFT
  • BEE AA ESU EFU AG
  • BADIS CHE ANVIN & SODA FABRIK AKTIENGESELLSCHAFT
  • ?SF A

Finally, harmonization involves matching the standardized IFI Name to the most current entity name, accounting for name changes. In the Names API, when an entity changes its name, the old name is added as a child with the parent_relationship flagged as "F" (former name). For example, in May 2020 United Technologies Corp changed its name to Raytheon Technologies. So for the Ultimate Owner (UO) Raytheon Technologies Corp, the Names API includes the child name, United Technologies Corp (F), among others.

The system then recalculates the ifi_name_current for all associated records. For example, Orig3n Inc changed its name to Seaport Diagnostics Inc in April 2021. As of December 2021, there were no published reassignments of any of their patents. However, the new name was available as the ifi-standardized-name-current in CLAIMS Direct:

<ifi-standardized-name country="US" number="866109">
  <name>Orig3n Inc</name>
</ifi-standardized-name>
<ifi-standardized-name-current country="US" number="866108">
  <name>Seaport Diagnostics Inc</name>
</ifi-standardized-name-current>

Language

To deal with names in different languages, we follow a common practice used by many patent offices and information providers.

For names in languages using Latin characters, we do not translate the names to English. Rather, we keep the original company names in their original language. For example:

  • Bayerische Motoren Werke AG (German)
  • Telmat Industrie SA (French) 
  • Universidad Politecnica de Madrid (Spanish)

For names in languages using non-Latin characters such as Russian and Chinese, we transliterate the names to Latin characters and use the name provided by the applicant if the company has a known English name. For example:

Name in Chinese: 清锋(北京)科技有限公司
Transliteration: Qingfeng Beijing Technology Co Ltd
IFI Name: Luxcreo Beijing Inc

For Japanese names, we try to find the official English name, which is not always a translation of the Japanese name. For example, according to the company website of Daido Tokushuko Kabushiki Kaisha, the company "name shall be expressed in English as Daido Steel Co., Ltd."

In a transliterated Japanese name, Kabushiki Kaisha is abbreviated as KK. If the name is translated, then the legal form is represented as “Co Ltd.”

Romanized names are added as variations of the official English name.

State and Country Disambiguation

For purposes of disambiguation, we add the country to the end of the name if the same entity name exists in different countries:

  • Siemens Industry Software GmbH Austria
  • Siemens Industry Software GmbH Germany

If multiple companies with the same entity name exist in different states of the same country and the entities are owned by different parent companies, we add the state to the end of the name:

  • Blackbird Technologies Inc California
  • Blackbird Technologies Inc Virginia

Rules for Different Entity Types

Ultimate Owners

IFI's Ultimate Owners provide comprehensive coverage of the patent holdings of major public companies such as those listed on the following:

  • FTSE 100
  • S&P 500
  • Russell 1000
  • Forbes Global 2000
  • Fortune 500

In general, our rules for determining Ultimate Owners are as follows:

  • The IFI Ultimate Owner is the highest-level legal entity with an ownership interest of greater than 50% in the subsidiary. We define the Ultimate parent as the highest-level entity that prepares consolidated financial statements, understanding that “financial statements of a group in which the assets, liabilities, equity, income, expenses, and cash flows of the parent and its subsidiaries are presented as those of a single economic entity” as defined by the International Financial Reporting Standard organization (IFRS).
  • We do not consider groups or conglomerates that do not exist as legal entities.
  • IFI Ultimate Owners are operational entities, not investment companies. This means that we do not consider an investment company to be the ultimate parent of a private entity even if it owns more than 50% of the company shares.

As an exception, we do create Ultimate Owners that are investment companies if they own patents or if they are in the S&P lists. For example, Huntington Bancshares Inc. is considered an Ultimate Owner because many of its child companies own patents, including:

  • Grayling State Bank
  • Midwest Bank and Trust Co
  • Security First Corp
  • LeaseNet Group Inc
  • Huntington National Bank
  • NBC Corp
  • Royal American Bank
  • Sky Bank
  • Citizens State Bank
  • Empire National Bank of Traverse City
  • Commerce Exchange Bank

Joint Ventures

Joint ventures join two or more different entities into a new entity for a specific objective in a specific country. Joint ventures are considered to be Ultimate Owners because none of the participants own more than 50% of the venture.

Holding Companies

A holding company is an umbrella company that owns one or several other companies. They are often set up to distribute liabilities amongst different companies. A holding company may be listed as an Ultimate Owner if it owns enough stock to have a controlling interest and if it produces a consolidated financial statement. For example, Berkshire Hathaway is considered an Ultimate Owner because it produces a consolidated financial statement.

Dual-Listed Companies

Dual-Listed Companies (DLC) form a corporate structure in which two legal entities function as a single operating business but retain separate stock exchange listings, usually in different countries. In CLAIMS Direct, the DLC is the Ultimate Owner.

For example, Rio Tinto is a leading international mining group headquartered in the UK, combining Rio Tinto PLC, a London and New York Stock Exchange listed company, and Rio Tinto Limited, which is listed on the Australian Securities Exchange. The two companies are joined in a DLC structure as a single economic entity called the Rio Tinto Group.

Rio Tinto PLC, GB - LSE
Rio Tinto Ltd, AU - ASX

Ultimate Owner: Rio Tinto Group, GB

Educational Institutions and University Systems

IFI uses this category for educational institutions, no matter the funding type. This includes schools, colleges, and universities, all of which may be Ultimate Owners.

As it is now common for educational applicants to file for patents under the name of their parent institution, we consolidate the various departments, campuses, and other divisions under the name of the parent institution, with the exception of institutions that are financially independent entities. 

This may include:

  • colleges of liberal arts
  • professional schools
  • bureaus
  • offices
  • foundations
  • technology transfer entities
  • research centers (located on or off the main campus)
  • branch campuses

A university system is a group of institutions having a collective legal status and generally recognized by a state government, a board of education, or other relevant organization as a collective organization. IFI creates an Ultimate Owner for university systems.

For example, for the Ultimate Owner, State University System of Florida, child names include, among others: 

  • Florida Agricultural and Mechanical University FAMU
  • Florida Atlantic University
  • Florida Atlantic University Research Corp
  • University of Florida Research Foundation Inc

Government-Owned Entities

This category includes state-owned companies as well as government departments and agencies. Some examples include:

  • Agence Spatiale Europeenne (France)
  • Beijing Automotive Group Co Ltd (China)
  • Ministry of Science and Technology India (India)
  • National Institute of Advanced Industrial Science and Technology AIST (Japan)

Ticker Symbols

In CLAIMS Direct, ticker symbols are available for all listed entities. You can use the Names service to retrieve the IFI name associated with a ticker symbol, then use that name to find all patent records associated with that entity in the IFI Snapshots.

We cover the most popular lists such as the S&P 100, S&P 500, Russell 1000, and more, and we add new Ultimate Owners from the Nasdaq, NYSE, and other sources daily. We have also begun to cover markets outside of the US such as the Australian Securities Exchange and the Japan Exchange Group. New sources are added on a continuous basis.

Rules for Ticker Symbols

Tickers are associated with the standard Market Identifier Code (MIC), specifically the operating MIC, which identifies an entity that operates an exchange, trading platform, etc. Some examples include:

Market Name

MIC

Operating MIC

CBOE BZX U.S. EQUITIES EXCHANGEBATSXCBO
SHANGHAI STOCK EXCHANGEXSSCXSHG
SHENZHEN STOCK EXCHANGEXSECXSHE

Tickers that contain numbers are not defined as numeric. Leading zeros are not removed.

Entity NameMICTicker
Petrochina Co LtdXSHG601857
XHKG00857

If a ticker ends in a letter indicating the equities class, the letter is removed.

Entity Name

Ticker in External Sources

Ticker in IFI Products

Corus Entertainment Inc

CJR.B

CJR/B

CJR B

CJRB

CJRB

Tickers that contain other symbols are stored as they are.

Entity Name

Ticker in External Sources

Ticker in IFI Products

Mahindra and Mahindra LtdM&MM&M

Tickers for stocks listed as American Depositary Receipts (ADRs) are provided if available.

Entity NameMICTicker
Novo Nordisk ABXCSENOVOB
XNYSNVO

Backward citations are a primary component of proving inventiveness in new patent applications. These citations reference previous work or prior art that is considered relevant to a current patent application. However, when looking for these citations in patent data, it is important to know that they are not always published in the applications. Here, we will look at how backward citations are handled during the application process and how the process can affect where those citations appear in the data.

As an example, at the time of filing a US patent application, an Information Disclosure Statement (IDS) form is submitted with the application. This includes any citations the inventors/applicants feel are relevant to demonstrate how their invention differs from the prior art. Since not everyone does a thorough search of the literature before filing their application, in the US (as well as some other countries), the inventors/applicants must also submit any relevant prior art cited by other offices which examine applications of the patent family members within three months of publication in the respective office.

Examiners generally do not rely solely on the prior art supplied by inventors/applicants, and citations may be appended by the examiner as well as by the inventor/applicant. If the applicant did not provide any citations in the application, it is the job of the examiner, who is an expert in the field, to locate documents that are most closely related to the new invention. Even if an applicant does provide citations, examiners conduct their own searches based on how they interpret the invention and cite the art they believe is relevant to novelty and inventiveness.

Examiners also consider the prior art provided in the related PCT International Search Report (ISR). In the ISR, the WIPO examiner designates each citation as an X, Y, or A document. “X” indicates that a citation is novelty-killing. “Y” indicates that it is relevant to inventiveness, and “A” indicates that it is general state of the art. And, of course, they do review the art cited by the applicant. There is also an option for third parties to submit prior art within a specific time frame (often between publication and the first office action). All of this information becomes part of the file wrapper/record.

In the US, the publication of applications occurs 18 months after the priority filing. Therefore, it is often the case that publication occurs before the examination begins and thus before the examiner has had the opportunity to add relevant citations. In these cases, there may be no citations (other than ISR references) at the time the application is published.

While examiners try to find all relevant art in their initial search, they will also re-do their search prior to allowance to make sure they found everything. In the US, Form 892 is the “List of references cited by examiner”. The citations on that list are the ones which are included in the grant publication and subsequently provided in CLAIMS Direct’s XML in the technical-data/citations container. Therefore, the best way to find all the relevant backward citations for a patent is to look at the grant publication rather than the application.

To learn more about the citation data included in CLAIMS Direct's XML and how to search this data, see technical-data/citations and Citations.


Introduction

IFI recommends 6TB of disk space for an on-site PostgreSQL instance. Normally, this will accommodate approximately 3 years of database growth (depending on your subscription level). If you are running out of disk space more quickly than anticipated, the reason may be an increased number of deletes occurring during the update process. This is especially likely to occur when IFI replaces content at a more aggressive rate than usual, as we have been doing since October 2019 in order to replace our old translations with higher-quality translations from Google. The blog Understanding of Bloat and VACUUM in PostgreSQL does a good job of explaining why your database may be growing disproportionately to actual pipeline data flow. This blog will help diagnose whether bloat is the cause of your problem and provide some potential solutions.

Checking for Bloat

The following SQL query will examine each table in the XML schema and identify dead rows (tuples) that are wasting disk space.

SELECT schemaname || '.' || relname as tblnam,
    n_dead_tup,
    (n_dead_tup::float / n_live_tup::float) * 100 as pfrag
FROM pg_stat_user_tables
WHERE schemaname = 'xml' and n_dead_tup > 0 and n_live_tup > 0 order by pfrag desc;

If this query returns a high percentage ( pfrag ) of dead tuples, the VACUUM command may be used to reclaim space.

Using VACUUM to Reclaim Storage

1) VACUUM ANALYZE

Important

Pause apgupd before running this command.

The VACUUM ANALYZE command marks the dead tuples in all tables of the database as available for reuse, which should slow database growth considerably.

After you run this, you may want to adjust the aggressiveness of your autovacuum settings to avoid accumulating more bloat in the future. The following settings should be considered:

  • autovacuum_max_workers = 5 - This is the recommended setting for a 16-core machine. If you have, for example, a 4-core, a setting of 2 would be the best available option.
  • autovacuum_vacuum_scale_factor = 0.02 - This setting indicates the threshold which determines when autovacuum needs to run per table. It refers to the fraction of the table size. So if the percentage of dead tuples (the pfrag column in the SQL query above) is greater than 2%, this spawns an autovacuum process. The default value for this setting is 20%, which will create massive bloat in your database before the autovacuum process even starts.
  • autovacuum_analyze_scale_factor = 0.01 - Although not specific to table size and vacuum cleanup, this setting, which is also a percentage of fragmentation of the table, tells the autovacuum process to analyze the table (i.e., update query planner statistics) when the fragmentation percentage reaches 1% (the default is 10%).

2) VACUUM FULL <table>;

Important

Pause apgupd before running this command.

The FULL vacuum command physically re-writes the table, removing the dead tuples and reducing the size of the table, whereas without the FULL modifier, the dead tuples are only made available for reuse. This is a processor- and disk-intensive operation but given appropriate planning, can reduce the size of the table by upwards of 25%. If you want to pursue this avenue, pick a highly fragmented table and test the amount of disk space recovered in order to assess whether it is worth the effort. For example, to test the process on xml.t_keywords:

1. Determine the current size of the table.

SELECT pg_size_pretty( pg_total_relation_size('xml.t_keywords') );

2. Defragment and reclaim all disk space.

VACUUM FULL xml.t_keywords;

3. Determine the new size of the table after vacuum.

SELECT pg_size_pretty( pg_total_relation_size('xml.t_keywords') );

Important

You will need at least twice the current table size available as free disk space in order to run the VACUUM FULL command because it re-writes the entire table. You can use both strategies above - VACUUM (to slow growth) and VACUUM FULL (to reduce data footprint). However, it is recommended to complete VACUUM first if you want to pursue the more aggressive VACUUM FULL.

Alternative Options

If you are really out of disk space, you may want to consider the following options:

1. Create a hot standby on a separate machine with +2TB disk, i.e., 8TB instead of 6TB. See Hot Standby for more information.

This is probably the easiest method and is least intrusive to normal workflow as nothing needs to be stopped or paused. Once the primary and standby instances are in sync, simply make the standby the primary.

2. If you can tolerate downtime, create a parallel machine as in #1 but use pg_dump to recreate the database. You will need to pause apgupd for the duration of the transfer:

pg_dump -Ualexandria -hOLD-IP [ other options ] | psql -Ualexandria -hNEW-IP postgres

Introduction

With the deployment of the cumulative patch alexandria-sql-patch-alpa-3636-20191101, on-site CLAIMS Direct installations now have the ability to utilize family and citation functionality in-house. What was previously only available using the remote CLAIMS Direct shared API is now possible internally with simple SQL functions. The following post will outline the steps required to prepare the data tables as well as presenting a brief walk-through of the functionality.

Preparation

Note: If your initial load occurred after May 13, 2020, these tables were included in the initial load and you do not need to follow the preparation instructions here. However, to maintain the supplemental tables required for on-site family and citation functionality, you must add --api, e.g., CD_OTHER_OPTIONS='--api' to the apgupd configuration file.

Sequential Loading

In order to leverage the new functionality, the necessary lookup tables need to be populated on-site. The functions available to achieve this are:

  • cdws.f_update_cited_documents – serving forward and backward citations
  • cdws.f_update_priority_documents  – serving simple and extended family

Each function takes as input parameter a modified-load-id. Therefore, to populate each table, each modified-load-id from the xml.t_patent_document_values table needs to be processed. The following pseudo-SQL will serve as an example.

DECLARE v_load_id integer;
BEGIN
  FOR v_load_id IN
      SELECT modified_load_id
      FROM xml.t_patent_document_values
      GROUP BY modified_load_id
      ORDER by modified_load_id LOOP

    PERFORM cdws.f_update_cited_documents( v_load_id );
    PERFORM cdws.f_update_priority_documents( v_load_id );

  END LOOP;
END;

Although the above SQL exemplifies the fundamental logic involved, it isn't the most efficient method. To that end, we are offering a supplemental patch that installs bulk loading functions to populate these tables: alexandria-sql-patch-alpa-3636-x-20191215. In addition to adding bulk loading functions, this patch expands some of the citation and family functionality not included in the original alexandria-sql-patch-alpa-3636-20191101 patch, namely:

  • FUNCTION cdws.f_family_citations_backward(v_family_id integer) 
    Returns backward citations for an entire family
  • FUNCTION cdws.f_family_citations_forward(v_family_id integer) 
    Returns forward citations for an entire family
  • FUNCTION cdws.f_family_meta(integer) 
    Returns bibliographic data for an entire family

For a full enhancement listing, please see the README included in the package.

Bulk Loading

After the supplemental patch is installed, 2 new functions for bulk loading are exposed.

  • FUNCTION cdws.f_update_cited_documents_bf() 
  • FUNCTION cdws.f_update_priority_documents_bf() 

These new functions bulk-load the auxiliary tables and are called simply as:

SELECT cdws.f_update_cited_documents_bf();
SELECT cdws.f_update_priority_documents_bf();

Each function requires at least 24 hours to complete and may take up to 7 days, depending on the environment. They can be run in parallel if desired. Because of the duration required to complete the functions, it's important that the SQL runs uninterrupted by server restarts or terminal disconnects.

Important

In order to expedite the completion of these functions, it is required that apgupd be paused. Optionally upon completion, executing an ANALYZE on the auxiliary tables is recommended:

ANALYZE VERBOSE cdws.t_cited_documents;
ANALYZE VERBOSE cdws.t_applications;
ANALYZE VERBOSE cdws.t_priority_documents;

Critical

In order to automatically populate these auxiliary tables every new load-id, you must use the --api flag to apgupd. If this parameter isn't available with your version of Alexandria-Client-Tools, you will need to update to the latest version. In versions 2.5.3 and above, this parameter is set in /etc/alexandria/apgupd.conf, as a CD_OTHER_OPTIONS. Please see Client Tools documentation for configuration details.

To confirm that the citation and family functionality have completed loading, locate the log files (citation.log  and family.log). The last line of each log should show an entry in the following format:

INFO:  added N new citations records from the work table (time)
INFO:  added N new priority records from the work table (time)

To check whether the calculations are still running, use one of the two SQL queries below while in the database or using pgAdmin.

SELECT datname, query FROM pg_stat_activity;
SELECT * FROM pg_stat_activity;

Usage

Once the auxiliary table loading is complete, a wide variety of citation and family functionality is available directly from the on-site CLAIMS Direct instance.

Citations

Bringing the CLAIMS Direct API methods citations/forward and citations/backward on-site is achieved using the new functions:

  • FUNCTION cdws.f_citations_backward
  • FUNCTION cdws.f_citations_forward 

Each function takes as input a publication_id and returns 2 columns: ucid and source. 

SELECT * FROM cdws.f_citations_backward( xml.f_ucid2id('US-5551212-A') );
     ucid      | source 
---------------+--------
 US-4905451-A  | SEA
 US-2043419-A  | SEA
 US-4631900-A  | SEA
 DE-8002686-U1 | SEA
 EP-0225665-A1 | SEA
 US-3187480-A  | SEA
 US-4828110-A  | SEA
 DE-3138439-A1 | SEA
 EP-0061805-A1 | SEA
 DE-3606826-A1 | SEA
 US-4546875-A  | SEA
 US-4534151-A  | SEA
 EP-0313721-A2 | SEA

Of course, GROUP BY and ORDER BY are available, e.g., using substring, group forward citations by country:

SELECT substring(ucid, 1, 2) AS country, count(*) AS ncites
FROM cdws.f_citations_forward( xml.f_ucid2id('US-5000000-A') )
GROUP BY country
ORDER BY ncites DESC;
 country | ncites 
---------+--------
 US      |    264
 WO      |     31
 EP      |     27
 AU      |      1
 KR      |      1

Family

Bringing CLAIMS Direct API methods family/simple and family/extended functionality to the on-site install is accomplished with the functions:

  • FUNCTION cdws.f_family_simple 
  • FUNCTION cdws.f_family_extended 

Each of these functions, as with the citation functions above, take a publication_id as input. The return columns are ucid and family-id. The extended function returns additional publication_id and level columns.

SELECT * FROM cdws.f_family_simple( xml.f_ucid2id('US-5000000-A') );
       ucid       | family_id 
------------------+-----------
 AR-242634-A1     |  26932266
 CA-1335430-C     |  26932266
 EP-0431047-A1    |  26932266
 HU-T60328-A      |  26932266
 JP-H05502366-A   |  26932266
 KR-900702042-A   |  26932266
 WO-1990002193-A1 |  26932266

SELECT * FROM cdws.f_family_extended( xml.f_ucid2id('US-5551212-A') );
 publication_id |      ucid      | level 
----------------+----------------+-------
       71329085 | US-5551212-A   |     0
         302369 | AT-96388-T     |     1
        1220699 | AU-636238-B2   |     1
        2293384 | AU-8174591-A   |     1
        5137694 | CA-2050271-A1  |     1
        5137693 | CA-2050271-C   |     1
       14883151 | DE-4027762-C1  |     1
       17156127 | DE-59100535-D1 |     1
       93003830 | DK-0477487-T3  |     1
       21052692 | EP-0477487-A1  |     1
       21052691 | EP-0477487-B1  |     1
       25818932 | ES-2046827-T3  |     1
      128970015 | JP-H054613-A   |     1
      131336800 | JP-H0825524-B2 |     1
       59117993 | NO-913030-D0   |     1
      177975213 | NO-913030-L    |     1
       62462393 | TR-25401-A     |     1
      159577685 | ZA-9106183-B   |     1

A second optional argument in the cdws.f_family_extended function, v_include_apps Boolean, can be used to include application filing data from the patent office. The default setting of this argument is false in order to maintain consistency with the EPO. In some cases, however, application filing data is not included in the priority claims data. Setting this argument to true will include the application filing data, as in the following example.

SELECT ucid
FROM cdws.f_family_extended(xml.f_ucid2id('AU-2012201497-B2'), true ) as f
WHERE f.ucid = 'WO-2008136115-A1';
       ucid
------------------
WO-2008136115-A1
(1 row)

Additional Functions

In addition to the above, there are other functions in the supplemental patch worth exploring, namely:

  • FUNCTION cdws.f_family_meta( v_family_id integer ) 
    Returns a variety of bibliographic and legal status information
  • FUNCTION cdws.f_family_citations_backward(v_family_id integer) 
    Returns backward citations for an entire family
  • FUNCTION cdws.f_family_citations_forward(v_family_id integer) 
    Returns forward citations for an entire family

Family Meta

The cdws.f_family_meta function brings a variety of information together for a simple family. The following columns are returned:

  • family-id - DOCDB-assigned simple family integer identifier
  • ucid – unique character identifier (publication)
  • published – publication date
  • anucid – unique character identifier (application)
  • filed – application filing date
  • title – title of document
  • ifi_status – IFI Snapshot document status
  • ifi_anticipated_expiration - IFI anticipated expiration date
  • ifi_adjusted_expiration - IFI adjusted expiration date
  • epo_legal_status_xml – XML representation of EPO legal status

An abbreviated example:

SELECT ucid, published, anucid, filed, title FROM cdws.f_family_meta( 26932266 );
       ucid       | published |     anucid     |  filed   |                                          title                                           
------------------+-----------+----------------+----------+------------------------------------------------------------------------------------------
 AR-242634-A1     |  19930430 | AR-31479889-A  | 19890830 | ETHANOL PRODUCTION BY GENETICALLY ENGINEERED ESCHERICHIA COLI STRAINS
 CA-1335430-C     |  19950502 | CA-609829-A    | 19890830 | ETHANOL PRODUCTION BY ENGINEERED MICROBES
 EP-0431047-A1    |  19910612 | EP-89909966-A  | 19890830 | ETHANOL PRODUCTION BY GENETICALLY ENGINEERED ESCHERICHIA COLI STRAINS
 HU-T60328-A      |  19920828 | HU-577189-A    | 19890830 | PROCESS FOR PRODUCING ETHANOL WITH TRANSFORMED MICROORGANISMS
 KR-900702042-A   |  19901205 | KR-900700901-A | 19890830 | 유전공학처리 미생물에 의한 에탄올 생성
 US-5000000-A     |  19910319 | US-35206289-A  | 19890830 | Ethanol production by Escherichia coli strains co-expressing Zymomonas PDC and ADH genes
 WO-1990002193-A1 |  19900308 | US-8903753-W   | 19890830 | ETHANOL PRODUCTION BY GENETICALLY ENGINEERED ESCHERICHIA COLI STRAINS

Family Citations

Forward and backward citations for an entire family are also available directly inside CLAIMS Direct.

SELECT * FROM cdws.f_family_citations_backward( 26932266 );
       ucid       
------------------
 EP-0047641-A2
 WO-1986001833-A1
 WO-1986004357-A1

Again, GROUP BY and ORDER BY are available.

SELECT substring(ucid, 1, 2) AS country, count(*) AS ncites
FROM cdws.f_family_citations_forward( 26932266 )
GROUP BY country
ORDER BY ncites DESC;
 country | ncites 
---------+--------
 US      |    290
 WO      |     36
 EP      |     35
 JP      |      2
 FR      |      1
 KR      |      1
 AU      |      1



Solr Indexing Process Explained


Processes

The main executable script used for indexing is aidx delivered as part of Alexandria::Library. This script is responsible for pulling source data, converting it into Solr XML and submitting via HTTP POST to Solr for indexing. The conversion process from CLAIMS Direct XML to Solr XML is handled by the indexer class (default is Alexandria::DWH::Index::Document). Alexandria::Client::Tools also provides an indexing daemon, aidxd which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process, is handled by apgup.

ProcessPackage
aidxAlexandria::Library
aidxdAlexandria::Client::Tools
apgupAlexandria::Client::Tools

Source Data

Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library module Alexandria::DWH::Extract. The Extract module can pull data based on a number of criteria, the most common of which are:

  • load-id: modified-load-id of xml.t_patent_document_values
  • table: any table name that has publication_id(int) column
  • SQL: raw SQL selecting desired documents by publication_id

Regardless of extraction criteria, Alexandria::DWH::Extract utilizes an UNLOGGED temporary table to accumulate desired publication_id(s). Extraction proper is done from this accumulation table in parallel select batches. The amount of parallelization as well as the amount of documents per select are controlled by the parameters batchsize and nthreadsaidx  also accepts a dbfunc parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones modeled on xml.f_patent_document_s (i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).


CommandAccumulation SQLExtract SQL
aidx --table=x

select publication_id
from x into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)
aidx --loadid=y

select publication_id
from xml.t_patent_document_values
where modified_load_id=y into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

aidx --sqlq=USER_SQLexecute SQL into t1

select xml.f_patent_document_values_s(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

CommandAccumulation SQLExtract SQL
aidx --table=x --dbfunc=f_my_function

select publication_id
from x into t1

select f_my_function(t2.publication_id)
from xml.t_patent_document_values t1
  inner join x as t1 on ( t1.publication_id=t2.publication_id)

Indexer Class

Using the callback mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into Solr. The following abbreviated example from aidx serves to illustrate the process.

#! /usr/bin/perl
use Alexandria::DWH::Extract;
use Alexandria::DWH::Index;
use Alexandria::DWH::Index::Document;

my $idxcls = shift(@ARGV); # from command line

sub _create_solr_document {
  my ( $batch, $xml ) = @_;
  eval 'require $idxcls';
  return $idxcls->new( document => $xml )->toNode()->toString(1);
}

my $ex = Alexandria::DWH::Extract->new(
  ...
  callbacks => { on_document_processed => \&_create_solr_document }
);
$ex->prepare();
$ex->run();  # every document extracted is sent through _create_solr_document()
$ex->finalize();

Creating a Custom Indexing Class

Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document and manipulating the Solr document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source. The following shell-module will serve as a basis for the use cases detailed below.

package MyCustomIndexingClass;

use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';
# override _process_source
sub _process_source {
  my $self = shift;

  # we want to process the standard way ...
  $self->SUPER::_process_source();

  # do nothing else
}
1;

You can now specify MyCustomIndexingClass as the command line argument --idxcls to the indexing utility aidx.

aidx --idxcls=MyCustomIndexingClass [ many other arguments ]

 Use Cases

Assumptions

The following use cases assume:

  • a valid index entry in /etc/alexandria.xml – this will be different than the default if you have a custom Solr installation
  • custom indexing class modules are either in the directory you run aidx or in your PERL5LIB path


(1) Adding (Injecting), Modifying, and Deleting Fields

For this use case, you will need to modify your Solr schema for the installation associated with the appropriate configuration index entry. Add the following field definition:

<field name="customInteger" type="tint" indexed="true" stored="true" />

Below is example code to inject customInteger into the Solr document. Additionally, it will show how to modify the contents of anseries and delete anseries if the publication country is US and publication date is later than 2015.

package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

 
# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  # the _fields member of $self contains all the
  # Solr content as a hash reference of array references
  # e.g.
  # _fields =>
  #      NOTE: multiValued=false fields are still represented as an array
  #            but only have one member
  #    pn => [ 'US-5551212-A' ],
  #    anseries => [ '07' ],
  #    icl1 => [ 'A', 'F', 'H' ]

  my $flds = $self->{_fields} || return; # nothing to do

  # inject a new field
  push( @{ $flds->{customInteger} }, 1 ) ;

  # we want to make certain that anseries is not padded, i.e.,
  # we need to be sure it is an integer
  if( scalar( $flds->{anseries} ) ) {
    $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] );

    # lastly, we don't want to index anseries for US documents published
    # after 20150101
    my $ctry = $flds->{pnctry}->[0];
    my $date = $flds->{pd}->[0];
    if( $ctry eq 'US' && $date > 20141231 ) {
      delete $flds->{anseries};
    }
  }
}
1;

(2) Accessing the CLAIMS Direct Source XML Document

This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the Solr schema:

<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" />

We first need to define related ucid rel_ucid. For this example, it will be defined as:

  • any related documents which have a @relation=related-publication
  • any pct-or-regional-publishing-data


The parts of the XML document that are of interest:

<related-documents>
  <relation type="related-publication">
    <document-id>
      <country>US</country>
      <doc-number>20150126456</doc-number>
      <kind>A1</kind>
      <date>20150507</date>
    </document-id>
  </relation>
</related-documents>
<!-- ... -->
<pct-or-regional-publishing-data ucid="WO-2013182650-A1">
  <document-id>
    <country>WO</country>
    <doc-number>2013182650</doc-number>
    <kind>A1</kind>
    <date>20131212</date>
  </document-id>
</pct-or-regional-publishing-data>

As this example is more involved, the following code is broken down by function. A complete listing of code will be provided below.

### routine to parse related documents
sub _parse_related_documents {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my @a = (); # stores any related-publications

  # if there are no related documents, return empty array
  my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0];
  return \@a if !$reldoc_node;

  foreach my $relation ( $reldoc_node->getElementsByTagName('relation') ) {
    if( $relation->getAttribute('type') eq 'related-publication' ) {
      push( @a,
        sprintf("%s-%s-%s",
                 $relation->findvalue('./document-id/country'),
                 $relation->findvalue('./document-id/doc-number'),
                 $relation->findvalue('./document-id/kind')
        )
      );
    }
  }
  return \@a;
}

Points to consider with _parse_related_documents:

  • The source document (XML) representation is an XML::LibXML::Node, named above as patdoc.
  • Utilizing available methods, it is relatively simple to access particular parts of the XML tree.
  • The findvalue method is lacking error checking, i.e., we assume every value is present, combined in sprintf will return a correctly formatted ucid


### routine to parse pct publication information
sub _parse_pct_publishing_data {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  # if there is no pct publishing node, return undef
  my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0];
  return undef if !$pct_node;

  # return ucid
  return $pct_node->getAttribute('ucid');
}

Points to consider:

  • according to the DTD, there is only ever one related pct document, hence single-value return
  • the ucid attribute is available, differing from the above related-documents

The complete listing:

package MyCustomIndexingClass;

# subclass of Alexandria::DWH::Index::Document
use Moose;
### note: if using v2.0, you would extend Alexandria::DWH::Index::Document
  extends 'Alexandria::DWH::Index::DocumentEx';

# override _process source
sub _process_source {
  my $self = shift;

  # even though we are overriding _process_source(), we still
  # want the parent class to do all the work for us
  # by calling the parent method (SUPER) ...
  $self->SUPER::_process_source();

  my $flds = $self->{_fields} || return; # nothing to do

  my $reldocs = $self->_parse_related_documents( $self->{_source_root} );
  my $pctdoc  = $self->_parse_pct_publishing_data( $self->{_source_root} );
  if( scalar( @{$reldocs} ) ) {
    foreach my $r ( @{$reldocs} ) {
      push( @{ $flds->{rel_ucids} }, $r );
    }
  }
  if( $pctdoc ) { push( @{ $flds->{rel_ucids} }, $pctdoc );
}

### routine to parse related documents
sub _parse_related_documents {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my @a = (); # stores any related-publications

  # if there are no related documents, return empty array
  my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0];
  return \@a if !$reldoc_node;

  for each my $relation ( $reldoc_node->getElementsByTagName('relation') ) {
    if( $relation->getAttribute('type') eq 'related-publication' ) {
      push( @a,
        sprintf("%s-%s-%s",
                 $relation->findvalue('./document-id/country'),
                 $relation->findvalue('./document-id/doc-number'),
                 $relation->findvalue('./document-id/kind')
        )
      );
    }
  }
  return \@a;
}

### routine to parse pct publication information
sub _parse_pct_publishing_data {
  my $self = shift;
  # the root of the source XML
  # as an XML::LibXML::Node
  my $patdoc = shift;

  my $ret; # only one value available (or none)

  # if there is no pct publishing node, return undef
  my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0];
  return undef if !$pct_node;

  # return ucid
  return $pct_node->getAttribute('ucid');
}

1;


Introduction

There are a number of reasons one would need to re-index data from the data warehouse. These range from simply repeating a load-id to a complete re-index of the entire contents of the data warehouse. In this blog, I'm going to go over the mechanisms that move data from the data warehouse to the index and ways in which these mechanisms can be used to trigger partial or full re-indexing of data.

Background

In all installations of CLAIMS Direct, data synchronization between the CLAIMS Direct primary data warehouse and on-site installations is driven by the tool apgup. In version 2.1, a new update daemon (apgupd) was introduced to facilitate background processing of new and updated data. The main roles this daemon plays include:

  1. Reconciling differentials between on-site data warehouses and CLAIMS Direct primary data warehouses
  2. Downloading the next available update package for processing within the on-site installation
  3. Processing the update package
    1. Loading data to the data warehouse
    2. Optionally en-queuing that load for indexing

For our topic, we will concentrate on the indexing queue as this drives the indexing process.

Index Queue

The indexing queue is a table inside the reporting schema: t_client_index_process. When apgupd completes the load into the PostgreSQL, a row is inserted into reporting.t_client_index_process with running_statuspending. The indexing daemon, aidxd, probes the table looking for the highest priority load_id to index. Upon finding an applicable load-idaidxd proceeds to extract all publications associated with that load_id and index the documents into Solr.

The column priority influences which load_id is to be processed next. Higher priority load-ids are processed before lower priority ones. This can cause out-of-order processing as we will discuss below.

Re-indexing

Individual load_ids

At its simplest, re-indexing is just a matter of resetting a load_id in reporting.t_client_index_process

update reporting.t_client_index_process
  set running_status='pending',
      completed_stamp=null,
      completed_status=null,
      completed_comment=null
where load_id=123456;

The above SQL resets load_id 123456 so that the indexing daemon processes that load_id next. You can re-index any number of load_ids by resetting each of the rows.

Entire Data Warehouse

Unfortunately, resetting all the rows in reporting.t_client_index_process is not sufficient to re-index the entire contents of the data warehouse. This has to do with the fact that the table is empty upon initial load to the on-site instance as the population of reporting.t_client_index_process begins with the first update to the data warehouse. Each new update that is processed is queued. To that end, the most efficient way to queue the entire data warehouse for re-indexing is to select the entire xml.t_patent_document_values table and group by modified_load_id.

Before proceeding with the next set of SQL, please be sure both apgupd and aidxd are paused ( kill -s USR1 <pid> )

truncate table reporting.t_client_index_process;
 
insert into reporting.t_client_index_process (load_id, load_source, ndocs, priority)
  select modified_load_id, 'REINDEX', count(*), -1
  from xml.t_patent_document_values
  group by modified_load_id
  order by modified_load_id desc;

A few comments about the above statements:

  1. Truncating the table is required as the load_id column is unique. If you wish to save the contents of that table, simply use the pg_dump utility.
  2. The load_source column is chosen at random. You can use any identifier you want as long as it doesn't exceed 64 characters.
  3. The priority is set to -1 as in this example, re-index is less important than indexing new data.

Use Case

Changing Field Definitions

Modifying the CLAIMS Direct Solr Schema

The CLAIMS Direct Solr schema is a compromise between having every piece of the patent document searchable, having enough metadata stored to be retrievable in the search result set as well as indexing certain fields to be efficiently faceted. This compromise keeps the size of the index manageable while allowing efficient return of data from a search. Of course, the concept of manageable is subjective. There could very well be the need to have more fields returned during the search (stored=true) and other fields removed from storage but left searchable. In this use case, we will enable patent citations to be returned during the search. Regardless of CLAIMS Direct Solr package, we'll start editing the schema under <package>/conf-2.1.2. We are making a very simple change but one that requires a complete re-index of all documents having citations.

Be sure the indexing daemon (aidxd ) is paused.


The current field definition

    <field name="pcit" type="alexandria_string" indexed="true" stored="false" multiValued="true" />

becomes

    <field name="pcit" type="alexandria_string" indexed="true" stored="true" multiValued="true" />

To make this change effective, Solr needs to be restarted or the collection reloaded. For the standalone package, you would just need to restart Solr:

<package>/solrctl stop ; <package>/solrctl start

For the distributed package, first deploy updated configuration to the Zookeeper nodes.

cd <package>
./bootstrap-zookeeper.sh

Then reload all nodes:

cd <package>
./reload-collections

Setting up the Index Queue

Although we can go about re-indexing the entire data warehouse as outlined above, we want to be a bit more efficient and choose only the applicable documents, i.e., those that actually have citation data. To this end, we will find only load_ids that contain documents with citations and modify existing reporting.t_client_index_process rows as needed.

-- DELETE load-ids already processed
delete from reporting.t_client_index_process
where load_id in ( select distinct(modified_load_id) from xml.t_citations );
 
-- INSERT load-ids
insert into reporting.t_client_index_process (load_id, load_source, ndocs, priority)
  select modified_load_id, 'CITATION-FIX', count(*), -1
  from xml.t_citations
  group by modified_load_id

Again, setting priority to -1 allows new documents to be indexed as usual, and documents needing to be updated will be indexed with a lower priority.

Checking Re-Index Status

You can monitor the progress of the re-index by querying the reporting.t_client_index_process table as follows:

select priority, load_source, count(*)
from reporting.t_client_index_process
where running_status = 'pending'
group by priority, load_source
order by priority, load_source;



The CLAIMS Direct data warehouse is updated continuously. On average, 1-3% of the entire database is touched weekly. These 1-3 million updates fall into two categories: new documents (roughly 20%) and updated documents (roughly 80%).  New documents are, of course, new publications from issuing authorities, US, EP etc. Updates are generally, but not limited to, CPC changes, family changes (adding family ID), IFI Snapshots (patent status, standard names and claims summaries), reassignments, legal status, et al. I suggest reading about content updates for a better understanding of load sources and issuing authorities. This blog will describe how to determine what has been updated and further, differentiating new content from updated content at the document level as well as container level.

Load Identifiers

As described in the content update documentation,

Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: created-load-id, deleted-load-id and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document and the deleted-load-id represents the load-id in which the document marked as deleted.

The table xml.t_patent_document_values houses the load-ids that affect the overall status of the entire document.

  • modified_load_id – the load-id that last modified the document, whether it be addition, correction or deletion of content
  • deleted_load_id – the load-id that the entire document was marked deleted
  • created_load_id – the load-id when the document was added to the data warehouse


Each satellite table also contains a modified_load_id column that is specific to the content in that table and will always be <= xml.t_patent_document_values(modified_load_id). The following table should clarify the sequence of update events and the effect on load-id values for three database tables.

dateload-sourceLoad-Idtablecreated_load_idmodified_load_iddeleted_load_id
20161004US10000xml.t_patent_document_values1000010000NULL



xml.t_ifi_integrated_content
assuming ifi integrated content is not delivered on date of publication 
n/an/an/a



xml.t_partiesn/a10000n/a
20161006IFI-CONT10001xml.t_patent_document_values1000010001NULL



xml.t_ifi_integrated_contentn/a10001n/a



xml.t_parties
parties are not updated with IFI content loads 
n/a10000n/a
20161011DOCDB10002xml.t_patent_document_values1000010002NULL



xml.t_ifi_integrated_contentn/a10001n/a



xml.t_partiesn/a10002n/a

This granularity offers the ability to trace modifications to individual containers (tables) as well as the overall modification status of the document as a whole.

Quick Test

How many US documents published on 20161004 have updated CPC classifications?

select count(*) as docs_with_modified_cpc
from xml.t_patent_document_values as a
  inner join xml.t_classifications_cpc as b
    on (a.publication_id=b.publication_id)
where a.country='US'
  and a.published='2016-10-04'
  and a.deleted_load_id is null -- disregard deleted documents
  and a.withdraw != true        -- disregard withdrawn documents
  /*
    If b.modified_load_id > a.created_load_id,
    the table has been touched and we can therefore
    assume CPC values have been modified
  */
  and b.modified_load_id > a.created_load_id;
 
 docs_with_modified_cpc 
------------------------
                   6145

Out of 6680 documents published on 20161004, 6145 had CPC modifications after publication.


Reporting

Knowing which tables and containers have changed is important, but the when and why are also critically important. The information gathered during data warehouse updates is stored in reporting.t_client_load_process. For a detailed description of the contents of this table, see CLAIMS Direct Design Documentation. Of the many questions that can be posed, for our purposes in this blog, we are only concerned with answering the following questions:

  • On any given date, how many documents were added to the data warehouse?
  • On any given date, how many documents were modified in the data warehouse?

The answers to these questions are important for a variety of reasons, most importantly for down-stream processing. In the case of CLAIMS Direct primary data warehouses,  we trigger indexing based on the answer to these questions. The first thing to look at is what happened on a certain date, for the sake of an example, we will use 20161004.

select load_source, count(*)
from reporting.t_client_load_process
where date_trunc('day', completed_stamp) = '2016-10-04'::date
group by load_source
order by load_source;


 load_source  | count 
--------------+-------
 DOCDB        |     7
 IFI-CLAIMINF |     1
 IFI-Cont     |    15
 IPCR         |    47
 US           |     1
 USPTO-OG     |     1
 USREAS       |     1

How many documents were added to the data warehouse?

select count(*) as new_docs
from xml.t_patent_document_values
where created_load_id in (
  select load_id
  from reporting.t_client_load_process
  where date_trunc('day', completed_stamp) = '2016-10-04'::date
);

 new_docs 
----------
     6680

... and which documents were updated?

select count(*) as updated_docs
from xml.t_patent_document_values
where created_load_id < modified_load_id
    and modified_load_id in (
      select load_id
      from reporting.t_client_load_process
      where date_trunc('day', completed_stamp) = '2016-10-04'::date
    );

 updated_docs 
----------
   362388

Overview

The CLAIMS Direct Web Services (CDWS) offer a variety of entry points into both the data warehouse and Solr index. These are mid-to-high-level entry points and can satisfy most requirements pertaining to searching and extracting data for a typical search/view application. There are, however, corner cases which may require more intricate extraction of particular information. On the other hand, there may also be situations where massive amounts of data need to be extracted for further, down-stream processing. The following discussion will touch on solutions for both of these cases.

Basic Data Warehouse Table Structure

The basic structure of the PostgreSQL database housing the XML content is best described as a collection of tables containing the raw XML of every patent document available in CLAIMS Direct. Each table is a section of the XML document, referred to as a container, and each table is named according to the XML container data it contains, e.g.,

Table
Container
Description
xml.t_publication_referencepublication-referenceDocument publication information
xml.t_invention_titleinvention-titleDocument title(s)
xml.t_claimsclaimsDocument claims

The structure of each table is identical with the exception of xml.t_patent_document_values which functions as a meta table.

Column
Type
<container>_idserial primary key
publication_idinteger not null
modified_load_idinteger not null
statuscharacter(1) (v=valid XML, i=invalid XML)
contentXML


The XML is always in the content (PostgreSQL type: XML) column.

select content
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
--------
<publication-reference fvid="71329085" ucid="US-5551212-A">
   <document-id>
    <country>US</country>
    <doc-number>5551212</doc-number>
    <kind>A</kind>
    <date>19960903</date>
    <lang>EN</lang>
  </document-id>
</publication-reference>

The invaluable utility function xml.f_ucid2id takes a ucid and returns a publication_id.

Extracting Pieces of the XML Content

As mentioned above, all content columns are of type XML and therefore the internal PostgreSQL xpath functionality can be used. There are a variety of utility functions provided by CLAIMS Direct that mimic DOM functionality, e.g.

Function
Example

string

xml.f_textContent(xpath::text,xml)

Extract text of node:

select xml.f_textContent('./document-id/lang', content)
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
----
EN

string

xml_f_getAttribute(att::text,xml)

Extract single attribute value:

select xml.f_getAttribute('ucid', content)
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
----
US-5551212-A

boolean

xml.f_nodeExists(xpath::text,xml)

Test presence of a particular node:

select xml.f_nodeExists('//date', content)
from xml.t_publication_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
----
t

XML

xml.f_findNode(xpath::text,xml)

Select a node (type XML) satisfying the given xpath expression:

select xml.f_findNode('./main-classification', content)
from xml.t_classification_national
where publication_id=xml.f_ucid2id('US-5551212-A');
----
 <main-classification mxw-id="PCL893513162" load-source="mcf">053397</main-classification>

XML[]

xml.f_findNodes(xpath::text,xml)

Select an array of nodes satisfying the given xpath expression:

select xml.f_findNodes('./classification-ipcr', content)
from xml.t_classifications_ipcr
where publication_id=xml.f_ucid2id('US-5551212-A');
----
{
  "<classification-ipcr ...",
  "<classification-ipcr ...",
  "<classification-ipcr ..."
}

XML[]

xml.f_getElementsByTagName(name::text,xml))

Select an array of nodes based on element name:

select xml.f_getElementsByTagName('classification-ipcr', content)
from xml.t_classifications_ipcr
where publication_id=xml.f_ucid2id('US-5551212-A');
----
{
  "<classification-ipcr ...",
  "<classification-ipcr ...",
  "<classification-ipcr ..."
}

string

xml.f_nodeName(xml)

Select the name of the content node:

select xml.f_nodeName( content )
from xml.t_invention_title
where publication_id=xml.f_ucid2id('US-5551212-A');
-----------------
 invention-title

XML[]

xml.f_childNodes(xml)

Select all child nodes:

select xml.f_childNodes( content )
from xml.t_application_reference
where publication_id=xml.f_ucid2id('US-5551212-A');
-----------------------------------------------------------------------------------
 <document-id mxw-id="PAPP60364944" load-source="docdb" format="epo">
   <country>US</country>
   <doc-number>2582893</doc-number>
   <kind>A</kind>
   <date>19930303</date>
   <lang>EN</lang>
 </document-id>
 <document-id mxw-id="PAPP84571955" load-source="patent-office" format="original">
   <country>US</country>
   <doc-number>08025828</doc-number>
   <date>19930303</date>
   <lang>EN</lang>
 </document-id>

string

xml.f_node2string(xml,indent)

Concatenate all child text nodes:

select xml.f_node2string( content::text, 1 )
from xml.t_abstract
where publication_id=xml.f_ucid2id('US-5551212-A');
-----------------------------------------------------------------------------------
 A package is formed by holding a nonextensible strip taut and wrapping it around the upper and/or lower periphery of the bundle and joining the ends of the strip segments e.g. by adhesive bonding. A projecting margin of the strip is folded to overlie the bundle or underlie the bundle so that an L-configuration is imparted to the frame formed around the bundle after the strip is applied thereto.


Use Case: Creating relational view of XML content

Although the underlying architecture of the CLAIMS Direct PostgreSQL database is a data warehouse, creating normalized relational views into the XML content is easily achievable using the functions described above. Our example will be to create a relational view of all patent citations. This view will let us SELECT based on criteria as well as being able to GROUP BY and ORDER BY. The first step is to define all the properties of a citation:

  • ref-ucid (cited document)
  • ref-publication-id
  • publication-date
  • country
  • kind
  • load-source
  • format
  • source (examiner or applicant)

Now we can create a function that returns a TABLE of these properties. We'll use publication_id as input to the function and create it in a private schema (mySchema).

create or replace function mySchema.f_rdb_citations( integer )
  returns table (
    publication_id integer,
    ref_publication_id integer,
    ref_ucid varchar(64),
    published date,
    country varchar(2),
    kind varchar(2),
    load_source varchar(32),
    format varchar(32),
    source varchar(32)
  ) as
$BODY$
declare v_content xml;
        v_node xml;
        v_doc_node xml;
begin
 
-- example usage:
-- select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') );
  select x.publication_id, x.content into publication_id, v_content
    from xml.t_citations as x where x.publication_id=$1;
  if not found then
    return;
  end if;
  
  -- loop through all patent citations
  for each v_node in array
    ( select xml.f_findNodes('//patcit', v_content) ) loop
    select xml.f_getAttribute( 'ucid', v_node ) into ref_ucid;
 
    select x.publication_id into ref_publication_id
      from xml.t_patent_document_values as x
      where ucid=xml.f_getAttribute( 'ucid', v_node );
 
 
    select xml.f_getAttribute( 'load-source', v_node ) into load_source;
    select xml.f_getAttribute('name', xml.f_findNode('./sources/source', v_node ) ) into source;
 
    -- loop through all forms of the document-id(s)
    for each v_doc_node in array
      ( select xml.f_findNodes('./document-id', v_node) ) loop
      select xml.f_getAttribute( 'format', v_doc_node ) into format;
      select xml.f_textContent( './country', v_doc_node ) into country;
      select xml.f_textContent( './kind', v_doc_node ) into kind;
      select xml.f_textContent( './date', v_doc_node ) into published;
      return next;
    end loop;
  end loop;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100
  ROWS 1000;

Now, we test the results:

select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') ) limit 5;
 publication_id | ref_publication_id |     ref_ucid      | published  | country | kind | load_source | format | source
----------------+--------------------+-------------------+------------+---------+------+-------------+--------+--------
      160817649 |            5778519 | CA-2580978-A1     | 2006-03-30 | CA      | A1   | docdb       | epo    | APP
      160817649 |          127402642 | CN-201918032-U    | 2011-08-03 | CN      | U    | docdb       | epo    | APP
      160817649 |           75847915 | US-20020150866-A1 | 2002-10-17 | US      | A1   | docdb       | epo    | APP
      160817649 |           75947470 | US-20030060942-A1 | 2003-03-27 | US      | A1   | docdb       | epo    | APP
      160817649 |           76046244 | US-20030154010-A1 | 2003-08-14 | US      | A1   | docdb       | epo    | APP

Of course, GROUP BY and ORDER BY are available on the returned columns. In addition to SELECTing citations for a single document, we can also JOIN to table expressions, e.g., assume we are interested in the top 5 cited countries from US grant publication date 2016-05-03:

select count(*) as n, t2.country as ref_country
from xml.t_patent_document_values t1
  inner join mySchema.f_rdb_citations (t1.publication_id) as t2
    on (t1.publication_id=t2.publication_id)
where t1.country='US'
  and t1.published='2016-05-03'
group by t2.country
order by n desc
limit 5;
   n    | ref_country
--------+---------
 221800 | US
  14285 | WO
  13313 | JP
   7115 | EP
   2554 | CN

Paging results is cumbersome and inefficient. In this next segment I'd like to talk about simple and complex sorting. Sorting, used effectively with the rows parameter can push relevant documents into the first page of results. Generally, you can sort on any indexed field but you can also utilize query boosting and functions to influence sort order.

Missing Value Behavior

 CLAIMS Direct is configured to return empty fields at the top when asc is the direction and the bottom when desc is the direction.

Static Field Sorts

The most common field to sort on will undoubtedly be the date of publication, pd desc. This pushes the newly published documents to the top of the result set. CLAIMS Direct uses pd desc as default sort order instead of score. If two documents share the same publication date (or any sort field value), the tie is broken by the internal Lucene document IDs ascending.

/search/query?q=ab_en:sonar&rows=5&fl=ucid,[docid],pd
         "docs" : [
            {
               "[docid]" : 5536680,
               "pd" : 20160225,
               "ucid" : "US-20160054444-A1"
            },
            {
               "[docid]" : 1274637,
               "pd" : 20160224,
               "ucid" : "EP-2986998-A1"
            },
            {
               "[docid]" : 4335577,
               "pd" : 20160223,
               "ucid" : "US-9268020-B2"
            },
            {
               "[docid]" : 4986794,
               "pd" : 20160218,
               "ucid" : "US-20160049143-A1"
            },
            {
               "[docid]" : 5088703,
               "pd" : 20160218,
               "ucid" : "US-20160047906-A1"
            }
         ]

Multiple sort criteria are also possible and used mostly to stabilize the initial sort results. The syntax for the sort parameter is a comma-separated list of field/direction pairs, e.g., pd desc,ad asc. Field and direction are separated by a space.

/search/query?q=ab_en:sonar&rows=5&fl=ucid,pd&sort=pd desc,ucid desc

Dynamic Field Sort

As mentioned, CLAIMS Direct uses a default sort of pd desc. You are, of course, free to use the dynamic score field to mimic the default Solr distribution behavior.

/search/query?q=ab_en:sonar&rows=5&fl=ucid,score,pd&sort=score desc
         "docs" : [
            {
               "pd" : 19960322,
               "ucid" : "FR-2724734-A1",
               "score" : 3.781385
            },
            {
               "pd" : 20120523,
               "ucid" : "EP-2454606-A1",
               "score" : 3.3196893
            },
            {
               "pd" : 20151229,
               "ucid" : "US-9223022-B2",
               "score" : 3.3196893
            },
            {
               "pd" : 20150910,
               "ucid" : "US-20150253425-A1",
               "score" : 3.3196893
            },
            {
               "pd" : 20150521,
               "ucid" : "AU-2010273841-B2",
               "score" : 3.2852356
            }
         ]

Random Sort (for Random Results)

CLAIMS Direct uses a utility field to allow random sorting. The field is input as rnd_n where n is any random integer, e.g., rnd_1234. Specifying the same value for n will yield the same sort results so the randomness is only as random as the input integer. Returning random results (sorting on rnd_n) is an efficient way to sample a wide variety of data.

/search/query?q=ab_en:sonar&rows=3&fl=ucid,score,pd&sort=rnd_1234 desc
/search/query?q=ab_en:sonar&rows=3&fl=ucid,score,pd&sort=rnd_12345 desc
         "docs" : [
            {
               "pd" : 20150902,
               "ucid" : "CN-204605667-U",
               "score" : 1.6596402
            },
            {
               "pd" : 20110224,
               "ucid" : "WO-2010125029-A4",
               "score" : 1.1842703
            },
            {
               "pd" : 19910719,
               "ucid" : "FR-2657063-A1",
               "score" : 1.760531
            }
         ]
//
         "docs" : [
            {
               "pd" : 19720119,
               "ucid" : "GB-1260387-A",
               "score" : 0.4986458
            },
            {
               "pd" : 20160114,
               "ucid" : "US-20160011310-A1",
               "score" : 2.3957791
            },
            {
               "pd" : 20130425,
               "ucid" : "WO-2013056893-A1",
               "score" : 0.9474162
            }
         ],

Sorting on Functions

A previous segment discussed using functions as pseudo fields to return in the result set. You can also use the results of a function as sort criteria. An interesting use-case for sorting on a calculated value is measuring the latency between application filing date and grant date. You can calculate this value by subtracting ad_d from pd_d (tdate field types of filing and publication dates). The following examples use field aliasing with the fl parameter as well as sorting by function to return documents with low latency between filing and grant (asc) and high latency (desc) displayed in days (86400000 milliseconds in a day).

/search/query?rows=5&q=ab_en:sonar +ifi_publication_type:G&fl=ucid,ad,pd,days_to_grant:div(sub(pd_d,ad_d),86400000)&sort=sub(pd_d,ad_d) asc
/search/query?rows=5&q=ab_en:sonar +ifi_publication_type:G&fl=ucid,ad,pd,days_to_grant:div(sub(pd_d,ad_d),86400000)&sort=sub(pd_d,ad_d) desc
         "docs" : [
            {
               "pd" : 20131205,
               "ad" : 20130118,
               "ucid" : "DE-102013000802-B3",
               "days_to_grant" : 321.0005
            },
            {
               "pd" : 20160126,
               "ad" : 20150304,
               "ucid" : "US-9244168-B2",
               "days_to_grant" : 328.0001
            },
            {
               "pd" : 20140904,
               "ad" : 20130904,
               "ucid" : "DE-102013014640-B3",
               "days_to_grant" : 365.00064
            },
            {
               "pd" : 20160216,
               "ad" : 20141208,
               "ucid" : "US-9261391-B2",
               "days_to_grant" : 434.99976
            },
            {
               "pd" : 19931007,
               "ad" : 19920421,
               "ucid" : "DE-4213121-C1",
               "days_to_grant" : 534.00006
            }
         ]
//
         "docs" : [
            {
               "pd" : 19890719,
               "ad" : 19730214,
               "ucid" : "GB-1605319-A",
               "days_to_grant" : 5999
            },
            {
               "pd" : 19781220,
               "ad" : 19640511,
               "ucid" : "GB-1536653-A",
               "days_to_grant" : 5336.0005
            },
            {
               "pd" : 20160209,
               "ad" : 20100426,
               "ucid" : "US-9255982-B2",
               "days_to_grant" : 2115
            },
            {
               "pd" : 20160223,
               "ad" : 20120210,
               "ucid" : "US-9268020-B2",
               "days_to_grant" : 1473.9987
            },
            {
               "pd" : 20150217,
               "ad" : 20110322,
               "ucid" : "US-RE45379-E1",
               "days_to_grant" : 1428.0006
            }
         ]

In this first of a series of blogs about Solr result sets I'd like to talk about returned fields, both static and dynamic.

Stored Fields

Any field that is stored in Solr can be returned in the result set. The following table shows all available stored fields:

NameDescriptionTypeIndexed
ab_*all abstractsmixed: alexandria_text, alexandria_asian_texttrue
adfiling datetinttrue
anapplication number (tokenized)alexandria_tokentrue
anorigoriginal, patent office format application numberalexandria_tokentrue
anseriesUS application series codealexandria_stringtrue
anucidstandardized filing identifierstringtrue
eclaECLA classificationalexandria_stringtrue
famfamily identifierstringtrue
ftermF-Termsalexandria_stringtrue
icl,cpcl,uscl,ficlclassifications suitable for displaystringfalse
ifi_name_current_idcurrent assignee identifierstringtrue
ifi_name_original_idoriginal assignee identifierstringtrue
ifi_paIFI assigneesalexandria_tokentrue
invinventorsalexandria_tokentrue
loadidmeta load identifiertinttrue
nclmsnumber of claimstinttrue
nindepclmsnumber of independent claimstinttrue
paapplicants/assignees (all formats)alexandria_tokentrue
pdpublication datetinttrue
pnpatent number (tokenized)alexandria_tokentrue
pripriority filing number (tokenized)alexandria_stringtrue
pridearliest priority filing datetinttrue
priorigoriginal, patent office format priority numberalexandria_stringtrue
timestamplast modification stamptdatetrue
ttl_*all titlesmixed: alexandria_text, alexandria_asian_texttrue
ucidunique character identifierstringtrue


The shorthand to return all static fields is the asterisk fl=*

/search/query?q=ucid:US-20160054444-A1&fl=*&rows=1

As with all examples, we discuss results in JSON format.

         "docs" : [
            {
               "timestamp" : "2016-02-27T08:29:31.936Z",
               "ucid" : "US-20160054444-A1",
               "loadid" : 229946,
               "nclms" : 19,
               "nindepclms" : 2,
               "fam" : "-1",
               "pn" : "US-20160054444-A1 US20160054444A1",
               "pd" : 20160225,
               "an" : "US-201514833357-A"
               "anseries" : "14",
               "anorig" : "US-14833357",
               "ad" : 20150824,
               "pri" : [
                  "FR-1457951"
               ],
               "prid" : 20140825,
               "priorig" : [
                  "FR-1457951"
               ],
               "cpcl" : [
                  "G01S  15/588       20130101 LI20160225BHUS        ",
                  "G01S  15/60        20130101 FI20160225BHUS        "
               ],
               "icl" : [
                  "G01S  15/58        20060101ALI20160225BHUS        ",
                  "G01S  15/60        20060101AFI20160225BHUS        "
               ],
               "ifi_pa" : [
                  "ECA ROBOTICS",
                  "ECA ROBOTICS"
               ],
               "inv" : [
                  "Pinto, Marc"
               ],
               "pa" : [
                  "ECA ROBOTICS",
                  "ECA ROBOTICS"
               ],
               "ttl_en" : [
                  "METHOD AND SONAR DEVICE FOR DETERMINING THE SPEED OF MOVEMENT OF A NAVAL VEHICLE IN RELATION TO THE SEA BED"
               ],
               "ab_en" : [
                  "<abstract mxw-id=\"PA168904151\" lang=\"EN\" load-source=\"patent-office\">\n    <p id=\"p-0001\" num=\"0000\">Sonar intended to be carried by a naval vehicle including at least one element for transmitting an acoustic signal, at least one element for receiving the acoustic signal transmitted and reflected on the sea bed and at least two phase centres (PC<sub>1</sub>, PC<sub>2</sub>) that are disposed along a first and a second axis (v<sub>1</sub>, v<sub>2</sub>), respectively, forming an interferometric antenna. The sonar includes elements for determining the speed of movement of the vehicle as a function of the computed value of the relative trim angle (β) formed between a straight line (d<sub>1</sub>) that is perpendicular to the axes (v<sub>1</sub>, v<sub>2</sub>) of the phase centres and a straight line (d<sub>2</sub>) that is perpendicular to the sea bed (F) and of the value determined for the angle of sight.</p>\n  </abstract>"
               ],
            }
         ],

Some things to notice: Multi-valued fields are returned as JSON arrays. Also, as invpa and ifi_pa are copyFields containing multiple values, duplicate entries may appear.

Dynamic Fields

Dynamic fields are fields that are generated either internally, like the relevance score or fields that can be generated by computing values using function queries. Dynamic fields must be listed explicitly in the fl parameter, e.g., fl=score. Below, we request the static ucid field together with the relevance score for 5 results.

/search/query?q=ab_en:sonar&fl=ucid,score&rows=5
         "docs" : [
            {
               "ucid" : "US-20160054444-A1",
               "score" : 1.1746656
            },
            {
               "ucid" : "US-9268020-B2",
               "score" : 2.8773313
            },
            {
               "ucid" : "US-20160047906-A1",
               "score" : 1.1735018
            },
            {
               "ucid" : "US-20160049143-A1",
               "score" : 0.99574935
            },
            {
               "ucid" : "US-20160047891-A1",
               "score" : 1.1675186
            }
         ],

Another type of dynamic field makes use of function queries. A simple, yet contrived, example would be to return the number of dependent claims. CLAIMS Direct stores the total number of claims in nclms as well as the number of independent claims in nindepclms, both as tint. Simple subtraction yields the desired result.

/search/query?q=ab_en:sonar&fl=ucid,nclms,nindepclms,sub(nclms,nindepclms)&rows=2
         "docs" : [
            {
               "ucid" : "US-20160054444-A1",
               "nclms" : 19,
               "nindepclms" : 2,
               "sub(nclms,nindepclms)" : 17
            },
            {
               "ucid" : "US-9268020-B2",
               "nclms" : 74,
               "nindepclms" : 3,
               "sub(nclms,nindepclms)" : 71
            }
         ],

Field Aliases

Field aliasing allows individual fields in the result set to be renamed. This functionality is particularly useful to give meaningful names to pseudo fields created by function queries as in the example above. Any field can be renamed using aliases. The syntax is: alias-field-name:original-field-name

Note

ucid is the only exception. When trying to alias the ucid, an HTTP 500 will be returned.

/search/query?q=ab_en:sonar&fl=ucid,total:nclms,independent:nindepclms,dependent:sub(nclms,nindepclms)&rows=1
         "docs" : [
            {
               "ucid" : "US-20160054444-A1",
               "total" : 19,
               "independent" : 2,
               "dependent" : 17
            }
         ],