What is the Unitary Patent (UP) and Unified Patent Court (UPC)?
A European patent with unitary effect or Unitary Patent (UP) is based on a European patent granted by the EPO under the rules and procedures of the European Patent Convention (EPC). The Unitary Patent is a single right that covers all participating UP Member States, which are a subset of the member states of the EPO. Only a single annual renewal fee is required to cover the included states.
The goal of the UP is to decrease the costs and complexities related to validating a European Patent in the member countries. The UP achieves this through a single renewal fee payment, reducing translation costs, and removing the need for representation at the national level. While the single annual renewal is a benefit for filers, there is some risk as a single court decision, e.g., on infringement and/or validity, applies across all participating UP Member States, rather than just in one country.
The Unified Patent Court (UPC) was created by international treaty to handle invalidity and infringement actions related to UP and EP patents. Patentees are therefore able to bring a single infringement action to the UPC for pan-European infringements, while third parties are able to seek centralized revocation of the UP. The existing EP opposition procedures will also affect the validity of Unitary Patents.
You can find more information about Unitary Patents and the Unified Patent court on the EPO website: https://www.epo.org/applying/european/unitary.html.
Identifying Patents Which Have Requested Unitary Protection
For EP grants where the UP is registered, a C0 marker record is created. The C0 record is not a published document, but rather a file container for UP data which contains information such as the list of countries covered, the date of registration, and legal status events. We typically receive C0 markers from DOCDB on Wednesdays. These can be identified by searching for documents where the country code is EP and the kind code is C0. They can also be found using the application number for the document of interest.
Here is an example of the first lines of XML in a C0 marker record:
<patent-document file-reference-id="752534" mxw-id="221216033" ucid="EP-3135684-C0" lang="EN" country="EP" doc-number="3135684" kind="C0" date="20230628" date-produced="20230812" family-id="45004197" status="corrected">
Identifying Where the UP is in Effect
To identify where the Unitary Patent is in effect, look within the international-convention-data/designated-states
section of the C0 marker XML. The ep-up-participating-states
are the countries that are part of the UP at the time of registration. The UP annual fee will therefore apply to these countries.
<ep-up-participating-states> <country mxw-id="DS1187928874" load-source="docdb">AT</country> <country mxw-id="DS1187928876" load-source="docdb">BE</country> <country mxw-id="DS1187928878" load-source="docdb">BG</country> <country mxw-id="DS1187928883" load-source="docdb">DE</country> <country mxw-id="DS1187928885" load-source="docdb">DK</country> <country mxw-id="DS1187928887" load-source="docdb">EE</country> <country mxw-id="DS1187928890" load-source="docdb">FI</country> <country mxw-id="DS1187928892" load-source="docdb">FR</country> <country mxw-id="DS1187928900" load-source="docdb">IT</country> <country mxw-id="DS1187928903" load-source="docdb">LT</country> <country mxw-id="DS1187928905" load-source="docdb">LU</country> <country mxw-id="DS1187928907" load-source="docdb">LV</country> <country mxw-id="DS1187928911" load-source="docdb">MT</country> <country mxw-id="DS1187928913" load-source="docdb">NL</country> <country mxw-id="DS1187928917" load-source="docdb">PT</country> <country mxw-id="DS1187928921" load-source="docdb">SE</country> <country mxw-id="DS1187928923" load-source="docdb">SI</country> </ep-up-participating-states>
Legal Events Related to Unitary Patents
Legal event codes related to the UP are in the legal status section of the C0 marker record. They can also be found in the legal status section of any related application and/or granted record having the same application number. At this time, the expected time frame for delivery of legal event codes is not consistent. The two event codes currently appearing most frequently related to the UP in the database are U01 and U07. As other new UP-related legal event codes are delivered by the EPO, they will be added to the legal status section.
U01: The event indicating that a request for Unitary Effect has been filed.
<legal-status load-source="inpadoc"> <legal-event country="EP" impact="+" code="U01" date="20230802" mxw-id="LS507355453" class="W" created-load-id="752825" modified-load-id="752825"> <legal-event-body> <event-title>REQUEST FOR UNITARY EFFECT FILED</event-title> <event-attributes> <event-attribute> <event-attribute-label>Effective Date </event-attribute-label> <event-attribute-value>20230628</event-attribute-value> </event-attribute> </event-attributes> </legal-event-body> </legal-event>
U07: The event indicating that a Unitary Patent has been registered. This includes a list of the countries which the patent covers and the effective date of the grant.
<legal-event country="EP" code="U07" date="20230802" mxw-id="LS507355454" class="W" created-load-id="752825" modified-load-id="752825"> <legal-event-body> <event-title>POSITIVE DECISION AFTER REQUEST FOR UNITARY EFFECT</event-title> <event-attributes> <event-attribute> <event-attribute-label>Designated State(s) </event-attribute-label> <event-attribute-value>AT BE BG DE DK EE FI FR IT LT LU LV MT NL PT SE SI</event-attribute-value> <event-attribute-label>Effective Date </event-attribute-label> <event-attribute-value>20230703</event-attribute-value> </event-attribute> </event-attributes> </legal-event-body> </legal-event>
The Unified Patent Court and Opting Out of the Unified Patent Court
The Unified Patent Court (UPC) has jurisdiction over the settlement of disputes for both European patents and Unitary Patents. This jurisdiction is however subject to exceptions: cases may still be brought before national courts, and patentees of European patents may opt out from the UPC's jurisdiction. If a patentee does not submit an opt-out request, they may be subject to invalidity and infringement proceedings under the jurisdiction of the UPC.
P01 is one of the event codes relating to opting out, but there are additional codes associated with opting out of the UPC. These codes appear on the B document and other documents with the same application number.
<legal-event country="EP" code="P01" date="20230614" mxw-id="LS500898319" class="W" created-load-id="739212" modified-load-id="739212"> <legal-event-body> <event-title>OPT-OUT OF THE COMPETENCE OF THE UNIFIED PATENT COURT (UPC) REGISTERED</event-title> <event-attributes> <event-attribute> <event-attribute-label>Effective Date </event-attribute-label> <event-attribute-value>20230418</event-attribute-value> </event-attribute> </event-attributes> </legal-event-body> </legal-event>
Additional Legal Status Codes Related to the UP and UPC
U01, U07, and P01 are only a few of the potential codes related to the UP and UPC. Many more codes have been reserved by the EPO for use related to the Unitary Patent and Unified Patent Court. The additional codes typically have either "UPC" or "Unitary Effect" in the description text. The most up-to-date list of legal event codes produced by the EPO can be found here: https://www.epo.org/searching-for-patents/data/coverage/weekly.html.
What's Next for Unitary Patents in CLAIMS Direct?
We have started to review the effect Unitary Patents will have on IFI Snapshots, including the IFI Patent Status and other fields derived by IFI. Snapshot calculations for both the B and C0 records are in the process of being reviewed and will be updated as soon as possible.
Introduction
The entities stream offers detailed content relating to IFI standardized names and corporate hierarchies (ultimate owners and subsidiaries). This stream allows users to quickly build full corporate portfolios, retrieve all the subsidiaries for a specific company, and gain quick statistical insights into a portfolio more easily. The optional statistical analysis tables give insights for key metrics such as number of active patent families, number of granted patents, and more. Before beginning, it is important to review the design of the schema, tables and columns as well as the built-in functions.
Prerequisites
Before executing any of the functions described below, it's important to stop all stream updates:
systemctl stop apgupd.service
Linking Entities to Publications
Although the entities stream is self-contained (has no direct dependencies on other stream content), the optional table linking to the xml
schema via entities.t_publication_ifi_numbers
can only be exposed by building these using the function entities.f_update_xml_info_bf()
.
To populate the optional content:
select entities.f_update_xml_info_bf();
This function builds the links between the xml.t_ifi_integrated_content
data and the entities stream which includes publications, publication status, family and family status information. These links will be maintained by apgupd
every entities stream update.
Generating and Maintaining Statistics
Linking entities to publications is an in-line process and is triggered by loads into the XML schema. The optional statistics in entities.t_entity_hierarchy_counts
, on the other hand, are generated by an offline process that can be scheduled. As with populating entities.t_publication_ifi_numbers
, an initial population of the table entities.t_entity_hierarchy_counts
is done by executing a function:
select entities.f_recalculate_all_entity_hierarchy_counts(true)
With the true
parameter, the table entities.t_entity_hierarchy_counts
will be completely rebuilt, false
and only load-ids
that haven't been processed will be refreshed. IFI CLAIMS Primary scheduling uses cron
as shown below:
# Incremental updates: 3:30 Mon-Sat 30 3 * * 1-6 psql -U alexandria -h localhost -p 5432 -d alexandria -qt -c "select entities.f_recalculate_all_entity_hierarchy_counts()" -o /dev/null # Full refresh: Sat 8:30 30 8 * * 6 psql -U alexandria -h localhost -p 5432 -d alexandria -qt -c 'select entities.f_recalculate_all_entity_hierarchy_counts(true)' -o /dev/null
Getting Started
To ease into the entities content, we'll go over some fundamental functionality:
- Retrieving an ultimate owner from an IFI standardized name
- Retrieving a list of subsidiaries from an IFI standardized name
- Retrieving ticker and exchange information for an IFI standardized name
- Expanding functionality
Retrieving an Ultimate Owner
In its simplest form, given any IFI standardized name that is a member of an entity hierarchy, its ultimate owner can be found by simply executing the following SQL query:
select * from entities.f_cdws_ultimate( 'Google LLC' ); -[ RECORD 1 ]-------+----------------------------------------------------------- entity_id | 3898 country | US name | Alphabet Inc name_type | N extract_status | V updated | 2020-11-16 parent_relationship | O address | 1600 Amphitheatre Parkway, Mountain View, CA, 94043, US attributes | <attributes />
Retrieving a List of Subsidiaries
select name, country from entities.f_cdws_subsidiaries( 'Alphabet Inc' ); name | country --------------------------------------------+--------- Admob Inc | US Adometry Inc | US Adscape Media Canada Inc | CA Adscape Media Inc | US Agnilux Inc | US Anvato Inc | US Apigee Corp | US Apigee Technologies India Pvt Ltd | IN -- List abbreviated
Retrieving Ticker and Exchange
select * from entities.f_attributes_ticker( 4366 ); exchange | symbol ----------+-------- XNYS | IBM
Expanding Functionality
Of course, the above functions are rudimentary. One can easily build out the functionality for application-specific use cases. As a simple example, the following query retrieves an ultimate owner from a publication ucid (or publication id).
Entity status
When executing statements against the table entities.t_entity_hierarchy_info
it is important to filter out all but validated (V) and reviewed (R) entities using the status column.
with t as ( select a.entity_name from entities.t_entity_hierarchy_info as a inner join entities.t_publication_ifi_numbers as b on ( a.ifi_number=b.ifi_number) where a.status in ('V', 'R') and b.publication_id = xml.f_ucid2id( 'EP-0700000-B1' ) ) select * from entities.f_cdws_ultimate( ( select x.entity_name from t as x ) ); -[ RECORD 1 ]-------+----------------------------------------------------------- entity_id | 4366 country | US name | International Business Machines Corp name_type | N extract_status | V updated | 2021-01-04 parent_relationship | O address | Old Orchard Road, Armonk, NY, 10504, US attributes | <attributes />
Understanding Entity Hierarchy Statistical Information
Along with entity information housed in entities.t_entity_hierarchy_info
, is the companion hierarchy statistics table entities.t_entity_hierarchy_counts
. This table contains summary statistics for entire hierarchies (ultimate owner and all subsidiaries) including hierarchy family statistics. The following table lists the available statistical categories:
Category (col: count_category) | Description |
---|---|
CB | Total number of backward citations |
CF | Total number of forward citations |
FSA | Total families with Active status |
FSL | Total families with Lapsed status |
FSP | Total families with Pending status |
FT | Total number of families |
PLA | Total published applications |
PLG | Total published grants |
PLU | Total publications with unknown publication level |
PSA | Total publications with Active status |
PSL | Total publications with Lapsed status |
PSP | Total publications with Pending status |
PSU | Total publications with unknown status |
PT | Total publications |
ST | Total number of subsidiaries |
This information is wrapped into a view that expands the category with a category description:
select * from entities.v_hierarchy_family_statistics where entity_id = entities.f_name2id( 'Alphabet Inc' ) ; entity_id | count_category | count_category_desc | count_value | modified_load_id -----------+----------------+-----------------------------------+-------------+------------------ 3898 | CB | total_backward_citations | 420962 | 679395 3898 | CF | total_forward_citations | 897547 | 679395 3898 | FSA | total_families_status_active | 22935 | 680151 3898 | FSL | total_families_status_lapsed | 11877 | 680151 3898 | FSP | total_families_status_pending | 2673 | 680151 3898 | FT | total_families | 37485 | 680151 3898 | PLA | total_publevel_application | 79625 | 680151 3898 | PLG | total_publevel_grant | 66325 | 680151 3898 | PLU | total_publevel_unknown | 38821 | 680151 3898 | PSA | total_publications_status_active | 94507 | 680151 3898 | PSL | total_publications_status_lapsed | 34353 | 680151 3898 | PSP | total_publications_status_pending | 17092 | 680151 3898 | PSU | total_publications_status_unknown | 38819 | 680151 3898 | PT | total_publications | 184771 | 680151 3898 | ST | total_subsidiaries | 186 | 680151
CLAIMS Direct now supports recent PostgreSQL versions (i.e. 12, 13, 14). The Alexandria schema (alexandria-schema-tools
) and IFI CLAIMS Direct yum repository have been updated. They provide tools for quality control (cd-count.sh
), bulk extraction (cd-extract.sh)
and bulk loading (cd-load.sh
).
Please note:
- Updates should be stopped (or paused) before proceeding. See apgupd for instructions.
- PostgreSQL must be configured to allow inbound and outbound communication via the pg_hba.conf, security group, and/or firewall.
Extraction
In order to migrate, the entire contents of the alexandria
database must be extracted to disk.
Prerequisites
- A minimum of 2.0TB free space on the extract destination drive. You don't need a full database footprint as the extract files are compressed.
- Solid State Drives are required for better performance.
- Network connectivity between the two PostgreSQL databases is required. It is also possible to use a separate server either as an intermediary or to share the extract destination drive, e.g., via NFS between the current and new instance.
Process
Stop apgupd
before proceeding (See apgupd for instructions.)
Extract the data from all applicable schema to the storage location.
The following script must be run on a drive with at least 2.0TB free.
First, the script:
cd-extract -h Usage: cd-extract.sh OPTIONS Option Description Default -------------------------------------------------------- -H|--host specify host localhost -P|--port specify port 5432 -d|--database specify database alexandria -u|--user specify database username alexandria -h|--help print this usage and exit Examples: cd-extract -h 10.10.10.2 -p 5432 -d alexandria -u alexandria cd-extract -h localhost -d alexandria -u alexandria
By default, this script uses the alexandria database located on localhost and outputs the compressed table data to ./data
. If storage is tight on the CLAIMS Direct instance, you can also use an intermediary server to extract and load, or use a storage server to share disk to both the old and new CLAIMS Direct instances via NFS, e.g.,
For the sake of simplicity, we'll use the existing CLAIMS Direct instance (with the older version of PostgreSQL) for extraction and loading. If using a separate server, you will need to install PostgreSQL (the default version is sufficient: yum -y install postgresql
), then replace localhost
with the IP-of-CD-PostgreSQL-instance if you are using the older instance.
# Create a location for the extract mkdir EXTRACT && cd EXTRACT cd-extract.sh -H localhost -d alexandria -u alexandria # Please note, this process can take up to 3 days to complete, # Therefore, it is recommended to detach process using nohup nohup cd-extract.sh -H localhost -d alexandria -u alexandria & # Progress will be output to nohup.out (standard out for terminal invocation), e.g., Checking ./data ... not present, creating ... ok Extracting data ........ done
Loading
Prerequisites
- A new CLAIMS Direct instance. Please see the PostgreSQL Installation Instructions for step-by-step instructions.
- Access to the extract directory which contains the
data
directory of the files created above. This could be via the older local CLAIMS Direct instance, an intermediate server, or a shared drive. - Installation of supplementary streams, e.g., litigation, entities etc., see installation instructions.
Process
Loading the data is accomplished in much the same way you loaded the original CLAIMS Direct instance: cd-load.sh
or load.sh
for earlier installs.
# Move into the directory created above cd EXTRACT # Load the data cd-load.sh -H IP -d alexandria -u alexandria # As above, this is a multi-day process and # It is therefore recommended to detach the process nohup cd-load.sh -H [IP-of-CD-PostgreSQL-instance] -d alexandria -u alexandria &
IFI CLAIMS has been managing standardized company name variations for decades, allowing you to map patent publications to the correct original and current assignees. Our proprietary name management process cleans various types of entity names to address issues of misspelling, name changes, and other merger and acquisition activity. The result is the most reliable company name mapping available for IP analytics. Our Names Service takes this process even further, providing the Ultimate Owner for any IFI Name and allowing you to easily capture the full portfolio of patent assets owned by a company and its subsidiaries.
To make the best use of our IFI Names and Ultimate Owners, it is important to understand how we manage the data. This blog explains some of the rules we use to standardize names and assign Ultimate Owners.
What We Standardize
We standardize:
- applicant names
- assignee names in the parties container and assignment history
- names provided in legal events that are related to changes in ownership or name changes
We do not standardize:
- individuals
- agents
- correspondents
Our Process
We follow three main steps in our cleaning process: normalization, standardization, and harmonization.
The first step is to normalize the data or put it into a homogenized format. This includes:
- removing punctuation, parentheses, hyphens, and similar characters
- dealing with spelling variations (and/&, ä/ae) and misspellings
- putting legal forms into a consistent format such as Corp, SL, Ltd, PLC, AG etc.
Next, different name variations are grouped under one official IFI Name. This is an extremely important step in the process as some of the variations we receive would make the record very difficult to find otherwise. For example, as of January 2022, we have 2,260 name variations for BASF SE, including:
- GASF AKTIENGESELLSCHAFT
- BEE AA ESU EFU AG
- BADIS CHE ANVIN & SODA FABRIK AKTIENGESELLSCHAFT
- ?SF A
Finally, harmonization involves matching the standardized IFI Name to the most current entity name, accounting for name changes. In the Names API, when an entity changes its name, the old name is added as a child with the parent_relationship
flagged as "F" (former name). For example, in May 2020 United Technologies Corp changed its name to Raytheon Technologies. So for the Ultimate Owner (UO) Raytheon Technologies Corp, the Names API includes the child name, United Technologies Corp (F), among others.
The system then recalculates the ifi_name_current
for all associated records. For example, Orig3n Inc changed its name to Seaport Diagnostics Inc in April 2021. As of December 2021, there were no published reassignments of any of their patents. However, the new name was available as the ifi-standardized-name-current
in CLAIMS Direct:
<ifi-standardized-name country="US" number="866109"> <name>Orig3n Inc</name> </ifi-standardized-name> <ifi-standardized-name-current country="US" number="866108"> <name>Seaport Diagnostics Inc</name> </ifi-standardized-name-current>
Language
To deal with names in different languages, we follow a common practice used by many patent offices and information providers.
For names in languages using Latin characters, we do not translate the names to English. Rather, we keep the original company names in their original language. For example:
- Bayerische Motoren Werke AG (German)
- Telmat Industrie SA (French)
- Universidad Politecnica de Madrid (Spanish)
For names in languages using non-Latin characters such as Russian and Chinese, we transliterate the names to Latin characters and use the name provided by the applicant if the company has a known English name. For example:
Name in Chinese: 清锋(北京)科技有限公司
Transliteration: Qingfeng Beijing Technology Co Ltd
IFI Name: Luxcreo Beijing Inc
For Japanese names, we try to find the official English name, which is not always a translation of the Japanese name. For example, according to the company website of Daido Tokushuko Kabushiki Kaisha, the company "name shall be expressed in English as Daido Steel Co., Ltd."
In a transliterated Japanese name, Kabushiki Kaisha is abbreviated as KK. If the name is translated, then the legal form is represented as “Co Ltd.”
Romanized names are added as variations of the official English name.
State and Country Disambiguation
For purposes of disambiguation, we add the country to the end of the name if the same entity name exists in different countries:
- Siemens Industry Software GmbH Austria
- Siemens Industry Software GmbH Germany
If multiple companies with the same entity name exist in different states of the same country and the entities are owned by different parent companies, we add the state to the end of the name:
- Blackbird Technologies Inc California
- Blackbird Technologies Inc Virginia
Rules for Different Entity Types
Ultimate Owners
IFI's Ultimate Owners provide comprehensive coverage of the patent holdings of major public companies such as those listed on the following:
- FTSE 100
- S&P 500
- Russell 1000
- Forbes Global 2000
- Fortune 500
In general, our rules for determining Ultimate Owners are as follows:
- The IFI Ultimate Owner is the highest-level legal entity with an ownership interest of greater than 50% in the subsidiary. We define the Ultimate parent as the highest-level entity that prepares consolidated financial statements, understanding that “financial statements of a group in which the assets, liabilities, equity, income, expenses, and cash flows of the parent and its subsidiaries are presented as those of a single economic entity” as defined by the International Financial Reporting Standard organization (IFRS).
- We do not consider groups or conglomerates that do not exist as legal entities.
- IFI Ultimate Owners are operational entities, not investment companies. This means that we do not consider an investment company to be the ultimate parent of a private entity even if it owns more than 50% of the company shares.
As an exception, we do create Ultimate Owners that are investment companies if they own patents or if they are in the S&P lists. For example, Huntington Bancshares Inc. is considered an Ultimate Owner because many of its child companies own patents, including:
- Grayling State Bank
- Midwest Bank and Trust Co
- Security First Corp
- LeaseNet Group Inc
- Huntington National Bank
- NBC Corp
- Royal American Bank
- Sky Bank
- Citizens State Bank
- Empire National Bank of Traverse City
- Commerce Exchange Bank
Joint Ventures
Joint ventures join two or more different entities into a new entity for a specific objective in a specific country. Joint ventures are considered to be Ultimate Owners because none of the participants own more than 50% of the venture.
Holding Companies
A holding company is an umbrella company that owns one or several other companies. They are often set up to distribute liabilities amongst different companies. A holding company may be listed as an Ultimate Owner if it owns enough stock to have a controlling interest and if it produces a consolidated financial statement. For example, Berkshire Hathaway is considered an Ultimate Owner because it produces a consolidated financial statement.
Dual-Listed Companies
Dual-Listed Companies (DLC) form a corporate structure in which two legal entities function as a single operating business but retain separate stock exchange listings, usually in different countries. In CLAIMS Direct, the DLC is the Ultimate Owner.
For example, Rio Tinto is a leading international mining group headquartered in the UK, combining Rio Tinto PLC, a London and New York Stock Exchange listed company, and Rio Tinto Limited, which is listed on the Australian Securities Exchange. The two companies are joined in a DLC structure as a single economic entity called the Rio Tinto Group.
Rio Tinto PLC, GB - LSE
Rio Tinto Ltd, AU - ASX
Ultimate Owner: Rio Tinto Group, GB
Educational Institutions and University Systems
IFI uses this category for educational institutions, no matter the funding type. This includes schools, colleges, and universities, all of which may be Ultimate Owners.
As it is now common for educational applicants to file for patents under the name of their parent institution, we consolidate the various departments, campuses, and other divisions under the name of the parent institution, with the exception of institutions that are financially independent entities.
This may include:
- colleges of liberal arts
- professional schools
- bureaus
- offices
- foundations
- technology transfer entities
- research centers (located on or off the main campus)
- branch campuses
A university system is a group of institutions having a collective legal status and generally recognized by a state government, a board of education, or other relevant organization as a collective organization. IFI creates an Ultimate Owner for university systems.
For example, for the Ultimate Owner, State University System of Florida, child names include, among others:
- Florida Agricultural and Mechanical University FAMU
- Florida Atlantic University
- Florida Atlantic University Research Corp
- University of Florida Research Foundation Inc
Government-Owned Entities
This category includes state-owned companies as well as government departments and agencies. Some examples include:
- Agence Spatiale Europeenne (France)
- Beijing Automotive Group Co Ltd (China)
- Ministry of Science and Technology India (India)
- National Institute of Advanced Industrial Science and Technology AIST (Japan)
Ticker Symbols
In CLAIMS Direct, ticker symbols are available for all listed entities. You can use the Names service to retrieve the IFI name associated with a ticker symbol, then use that name to find all patent records associated with that entity in the IFI Snapshots.
We cover the most popular lists such as the S&P 100, S&P 500, Russell 1000, and more, and we add new Ultimate Owners from the Nasdaq, NYSE, and other sources daily. We have also begun to cover markets outside of the US such as the Australian Securities Exchange and the Japan Exchange Group. New sources are added on a continuous basis.
Rules for Ticker Symbols
Tickers are associated with the standard Market Identifier Code (MIC), specifically the operating MIC, which identifies an entity that operates an exchange, trading platform, etc. Some examples include:
Market Name | MIC | Operating MIC |
---|---|---|
CBOE BZX U.S. EQUITIES EXCHANGE | BATS | XCBO |
SHANGHAI STOCK EXCHANGE | XSSC | XSHG |
SHENZHEN STOCK EXCHANGE | XSEC | XSHE |
Tickers that contain numbers are not defined as numeric. Leading zeros are not removed.
Entity Name | MIC | Ticker |
---|---|---|
Petrochina Co Ltd | XSHG | 601857 |
XHKG | 00857 |
If a ticker ends in a letter indicating the equities class, the letter is removed.
Entity Name | Ticker in External Sources | Ticker in IFI Products |
---|---|---|
Corus Entertainment Inc | CJR.B CJR/B CJR B CJRB | CJRB |
Tickers that contain other symbols are stored as they are.
Entity Name | Ticker in External Sources | Ticker in IFI Products |
---|---|---|
Mahindra and Mahindra Ltd | M&M | M&M |
Tickers for stocks listed as American Depositary Receipts (ADRs) are provided if available.
Entity Name | MIC | Ticker |
---|---|---|
Novo Nordisk AB | XCSE | NOVOB |
XNYS | NVO |
Backward citations are a primary component of proving inventiveness in new patent applications. These citations reference previous work or prior art that is considered relevant to a current patent application. However, when looking for these citations in patent data, it is important to know that they are not always published in the applications. Here, we will look at how backward citations are handled during the application process and how the process can affect where those citations appear in the data.
As an example, at the time of filing a US patent application, an Information Disclosure Statement (IDS) form is submitted with the application. This includes any citations the inventors/applicants feel are relevant to demonstrate how their invention differs from the prior art. Since not everyone does a thorough search of the literature before filing their application, in the US (as well as some other countries), the inventors/applicants must also submit any relevant prior art cited by other offices which examine applications of the patent family members within three months of publication in the respective office.
Examiners generally do not rely solely on the prior art supplied by inventors/applicants, and citations may be appended by the examiner as well as by the inventor/applicant. If the applicant did not provide any citations in the application, it is the job of the examiner, who is an expert in the field, to locate documents that are most closely related to the new invention. Even if an applicant does provide citations, examiners conduct their own searches based on how they interpret the invention and cite the art they believe is relevant to novelty and inventiveness.
Examiners also consider the prior art provided in the related PCT International Search Report (ISR). In the ISR, the WIPO examiner designates each citation as an X, Y, or A document. “X” indicates that a citation is novelty-killing. “Y” indicates that it is relevant to inventiveness, and “A” indicates that it is general state of the art. And, of course, they do review the art cited by the applicant. There is also an option for third parties to submit prior art within a specific time frame (often between publication and the first office action). All of this information becomes part of the file wrapper/record.
In the US, the publication of applications occurs 18 months after the priority filing. Therefore, it is often the case that publication occurs before the examination begins and thus before the examiner has had the opportunity to add relevant citations. In these cases, there may be no citations (other than ISR references) at the time the application is published.
While examiners try to find all relevant art in their initial search, they will also re-do their search prior to allowance to make sure they found everything. In the US, Form 892 is the “List of references cited by examiner”. The citations on that list are the ones which are included in the grant publication and subsequently provided in CLAIMS Direct’s XML in the technical-data/citations container. Therefore, the best way to find all the relevant backward citations for a patent is to look at the grant publication rather than the application.
To learn more about the citation data included in CLAIMS Direct's XML and how to search this data, see technical-data/citations and Citations.
Introduction
IFI recommends 6TB of disk space for an on-site PostgreSQL instance. Normally, this will accommodate approximately 3 years of database growth (depending on your subscription level). If you are running out of disk space more quickly than anticipated, the reason may be an increased number of deletes occurring during the update process. This is especially likely to occur when IFI replaces content at a more aggressive rate than usual, as we have been doing since October 2019 in order to replace our old translations with higher-quality translations from Google. The blog Understanding of Bloat and VACUUM in PostgreSQL does a good job of explaining why your database may be growing disproportionately to actual pipeline data flow. This blog will help diagnose whether bloat is the cause of your problem and provide some potential solutions.
Checking for Bloat
The following SQL query will examine each table in the XML schema and identify dead rows (tuples) that are wasting disk space.
SELECT schemaname || '.' || relname as tblnam, n_dead_tup, (n_dead_tup::float / n_live_tup::float) * 100 as pfrag FROM pg_stat_user_tables WHERE schemaname = 'xml' and n_dead_tup > 0 and n_live_tup > 0 order by pfrag desc;
If this query returns a high percentage ( pfrag
) of dead tuples, the VACUUM
command may be used to reclaim space.
Using VACUUM to Reclaim Storage
1) VACUUM ANALYZE
Important
Pause apgupd
before running this command.
The VACUUM ANALYZE
command marks the dead tuples in all tables of the database as available for reuse, which should slow database growth considerably.
After you run this, you may want to adjust the aggressiveness of your autovacuum settings to avoid accumulating more bloat in the future. The following settings should be considered:
autovacuum_max_workers
= 5 - This is the recommended setting for a 16-core machine. If you have, for example, a 4-core, a setting of 2 would be the best available option.autovacuum_vacuum_scale_factor
= 0.02 - This setting indicates the threshold which determines when autovacuum needs to run per table. It refers to the fraction of the table size. So if the percentage of dead tuples (thepfrag
column in the SQL query above) is greater than 2%, this spawns an autovacuum process. The default value for this setting is 20%, which will create massive bloat in your database before the autovacuum process even starts.autovacuum_analyze_scale_factor
= 0.01 - Although not specific to table size and vacuum cleanup, this setting, which is also a percentage of fragmentation of the table, tells the autovacuum process to analyze the table (i.e., update query planner statistics) when the fragmentation percentage reaches 1% (the default is 10%).
2) VACUUM FULL <table>;
Important
Pause apgupd
before running this command.
The FULL
vacuum command physically re-writes the table, removing the dead tuples and reducing the size of the table, whereas without the FULL modifier, the dead tuples are only made available for reuse. This is a processor- and disk-intensive operation but given appropriate planning, can reduce the size of the table by upwards of 25%. If you want to pursue this avenue, pick a highly fragmented table and test the amount of disk space recovered in order to assess whether it is worth the effort. For example, to test the process on xml.t_keywords
:
1. Determine the current size of the table.
SELECT pg_size_pretty( pg_total_relation_size('xml.t_keywords') );
2. Defragment and reclaim all disk space.
VACUUM FULL xml.t_keywords;
3. Determine the new size of the table after vacuum.
SELECT pg_size_pretty( pg_total_relation_size('xml.t_keywords') );
Important
You will need at least twice the current table size available as free disk space in order to run the VACUUM FULL
command because it re-writes the entire table. You can use both strategies above - VACUUM
(to slow growth) and VACUUM FULL
(to reduce data footprint). However, it is recommended to complete VACUUM
first if you want to pursue the more aggressive VACUUM FULL
.
Alternative Options
If you are really out of disk space, you may want to consider the following options:
1. Create a hot standby on a separate machine with +2TB disk, i.e., 8TB instead of 6TB. See Hot Standby for more information.
This is probably the easiest method and is least intrusive to normal workflow as nothing needs to be stopped or paused. Once the primary and standby instances are in sync, simply make the standby the primary.
2. If you can tolerate downtime, create a parallel machine as in #1 but use pg_dump
to recreate the database. You will need to pause apgupd
for the duration of the transfer:
pg_dump -Ualexandria -hOLD-IP [ other options ] | psql -Ualexandria -hNEW-IP postgres
Introduction
With the deployment of the cumulative patch alexandria-sql-patch-alpa-3636-20191101
, on-site CLAIMS Direct installations now have the ability to utilize family and citation functionality in-house. What was previously only available using the remote CLAIMS Direct shared API is now possible internally with simple SQL functions. The following post will outline the steps required to prepare the data tables as well as presenting a brief walk-through of the functionality.
Preparation
Note: If your initial load occurred after May 13, 2020, these tables were included in the initial load and you do not need to follow the preparation instructions here. However, to maintain the supplemental tables required for on-site family and citation functionality, you must add --api,
e.g., CD_OTHER_OPTIONS='--api'
to the apgupd
configuration file.
Sequential Loading
In order to leverage the new functionality, the necessary lookup tables need to be populated on-site. The functions available to achieve this are:
cdws.f_update_cited_documents
– serving forward and backward citationscdws.f_update_priority_documents
– serving simple and extended family
Each function takes as input parameter a modified-load-id. Therefore, to populate each table, each modified-load-id from the xml.t_patent_document_values
table needs to be processed. The following pseudo-SQL will serve as an example.
DECLARE v_load_id integer; BEGIN FOR v_load_id IN SELECT modified_load_id FROM xml.t_patent_document_values GROUP BY modified_load_id ORDER by modified_load_id LOOP PERFORM cdws.f_update_cited_documents( v_load_id ); PERFORM cdws.f_update_priority_documents( v_load_id ); END LOOP; END;
Although the above SQL exemplifies the fundamental logic involved, it isn't the most efficient method. To that end, we are offering a supplemental patch that installs bulk loading functions to populate these tables: alexandria-sql-patch-alpa-3636-x-20191215
. In addition to adding bulk loading functions, this patch expands some of the citation and family functionality not included in the original alexandria-sql-patch-alpa-3636-20191101
patch, namely:
- FUNCTION
cdws.f_family_citations_backward(v_family_id integer)
Returns backward citations for an entire family - FUNCTION
cdws.f_family_citations_forward(v_family_id integer)
Returns forward citations for an entire family - FUNCTION
cdws.f_family_meta(integer)
Returns bibliographic data for an entire family
For a full enhancement listing, please see the README included in the package.
Bulk Loading
After the supplemental patch is installed, 2 new functions for bulk loading are exposed.
- FUNCTION
cdws.f_update_cited_documents_bf()
- FUNCTION
cdws.f_update_priority_documents_bf()
These new functions bulk-load the auxiliary tables and are called simply as:
SELECT cdws.f_update_cited_documents_bf(); SELECT cdws.f_update_priority_documents_bf();
Each function requires at least 24 hours to complete and may take up to 7 days, depending on the environment. They can be run in parallel if desired. Because of the duration required to complete the functions, it's important that the SQL runs uninterrupted by server restarts or terminal disconnects.
Important
In order to expedite the completion of these functions, it is required that apgupd
be paused. Optionally upon completion, executing an ANALYZE
on the auxiliary tables is recommended:
ANALYZE VERBOSE cdws.t_cited_documents; ANALYZE VERBOSE cdws.t_applications; ANALYZE VERBOSE cdws.t_priority_documents;
Critical
In order to automatically populate these auxiliary tables every new load-id, you must use the --api
flag to apgupd
. If this parameter isn't available with your version of Alexandria-Client-Tools
, you will need to update to the latest version. In versions 2.5.3 and above, this parameter is set in /etc/alexandria/apgupd.conf
, as a CD_OTHER_OPTIONS
. Please see Client Tools documentation for configuration details.
To confirm that the citation and family functionality have completed loading, locate the log files (citation.log
and family.log
). The last line of each log should show an entry in the following format:
INFO: added N new citations records from the work table (time) INFO: added N new priority records from the work table (time)
To check whether the calculations are still running, use one of the two SQL queries below while in the database or using pgAdmin.
SELECT datname, query FROM pg_stat_activity; SELECT * FROM pg_stat_activity;
Usage
Once the auxiliary table loading is complete, a wide variety of citation and family functionality is available directly from the on-site CLAIMS Direct instance.
Citations
Bringing the CLAIMS Direct API methods citations/forward
and citations/backward
on-site is achieved using the new functions:
- FUNCTION
cdws.f_citations_backward
- FUNCTION
cdws.f_citations_forward
Each function takes as input a publication_id
and returns 2 columns: ucid
and source.
SELECT * FROM cdws.f_citations_backward( xml.f_ucid2id('US-5551212-A') ); ucid | source ---------------+-------- US-4905451-A | SEA US-2043419-A | SEA US-4631900-A | SEA DE-8002686-U1 | SEA EP-0225665-A1 | SEA US-3187480-A | SEA US-4828110-A | SEA DE-3138439-A1 | SEA EP-0061805-A1 | SEA DE-3606826-A1 | SEA US-4546875-A | SEA US-4534151-A | SEA EP-0313721-A2 | SEA
Of course, GROUP BY
and ORDER BY
are available, e.g., using substring
, group forward citations by country:
SELECT substring(ucid, 1, 2) AS country, count(*) AS ncites FROM cdws.f_citations_forward( xml.f_ucid2id('US-5000000-A') ) GROUP BY country ORDER BY ncites DESC; country | ncites ---------+-------- US | 264 WO | 31 EP | 27 AU | 1 KR | 1
Family
Bringing CLAIMS Direct API methods family/simple
and family/extended
functionality to the on-site install is accomplished with the functions:
- FUNCTION
cdws.f_family_simple
- FUNCTION
cdws.f_family_extended
Each of these functions, as with the citation functions above, take a publication_id
as input. The return columns are ucid
and family-id
. The extended function returns additional publication_id
and level
columns.
SELECT * FROM cdws.f_family_simple( xml.f_ucid2id('US-5000000-A') ); ucid | family_id ------------------+----------- AR-242634-A1 | 26932266 CA-1335430-C | 26932266 EP-0431047-A1 | 26932266 HU-T60328-A | 26932266 JP-H05502366-A | 26932266 KR-900702042-A | 26932266 WO-1990002193-A1 | 26932266 SELECT * FROM cdws.f_family_extended( xml.f_ucid2id('US-5551212-A') ); publication_id | ucid | level ----------------+----------------+------- 71329085 | US-5551212-A | 0 302369 | AT-96388-T | 1 1220699 | AU-636238-B2 | 1 2293384 | AU-8174591-A | 1 5137694 | CA-2050271-A1 | 1 5137693 | CA-2050271-C | 1 14883151 | DE-4027762-C1 | 1 17156127 | DE-59100535-D1 | 1 93003830 | DK-0477487-T3 | 1 21052692 | EP-0477487-A1 | 1 21052691 | EP-0477487-B1 | 1 25818932 | ES-2046827-T3 | 1 128970015 | JP-H054613-A | 1 131336800 | JP-H0825524-B2 | 1 59117993 | NO-913030-D0 | 1 177975213 | NO-913030-L | 1 62462393 | TR-25401-A | 1 159577685 | ZA-9106183-B | 1
A second optional argument in the cdws.f_family_extended
function, v_include_apps Boolean
, can be used to include application filing data from the patent office. The default setting of this argument is false
in order to maintain consistency with the EPO. In some cases, however, application filing data is not included in the priority claims data. Setting this argument to true
will include the application filing data, as in the following example.
SELECT ucid FROM cdws.f_family_extended(xml.f_ucid2id('AU-2012201497-B2'), true ) as f WHERE f.ucid = 'WO-2008136115-A1'; ucid ------------------ WO-2008136115-A1 (1 row)
Additional Functions
In addition to the above, there are other functions in the supplemental patch worth exploring, namely:
- FUNCTION
cdws.f_family_meta( v_family_id integer )
Returns a variety of bibliographic and legal status information - FUNCTION
cdws.f_family_citations_backward(v_family_id integer)
Returns backward citations for an entire family - FUNCTION
cdws.f_family_citations_forward(v_family_id integer)
Returns forward citations for an entire family
Family Meta
The cdws.f_family_meta
function brings a variety of information together for a simple family. The following columns are returned:
- family-id - DOCDB-assigned simple family integer identifier
- ucid – unique character identifier (publication)
- published – publication date
- anucid – unique character identifier (application)
- filed – application filing date
- title – title of document
- ifi_status – IFI Snapshot document status
- ifi_anticipated_expiration - IFI anticipated expiration date
- ifi_adjusted_expiration - IFI adjusted expiration date
- epo_legal_status_xml – XML representation of EPO legal status
An abbreviated example:
SELECT ucid, published, anucid, filed, title FROM cdws.f_family_meta( 26932266 ); ucid | published | anucid | filed | title ------------------+-----------+----------------+----------+------------------------------------------------------------------------------------------ AR-242634-A1 | 19930430 | AR-31479889-A | 19890830 | ETHANOL PRODUCTION BY GENETICALLY ENGINEERED ESCHERICHIA COLI STRAINS CA-1335430-C | 19950502 | CA-609829-A | 19890830 | ETHANOL PRODUCTION BY ENGINEERED MICROBES EP-0431047-A1 | 19910612 | EP-89909966-A | 19890830 | ETHANOL PRODUCTION BY GENETICALLY ENGINEERED ESCHERICHIA COLI STRAINS HU-T60328-A | 19920828 | HU-577189-A | 19890830 | PROCESS FOR PRODUCING ETHANOL WITH TRANSFORMED MICROORGANISMS KR-900702042-A | 19901205 | KR-900700901-A | 19890830 | 유전공학처리 미생물에 의한 에탄올 생성 US-5000000-A | 19910319 | US-35206289-A | 19890830 | Ethanol production by Escherichia coli strains co-expressing Zymomonas PDC and ADH genes WO-1990002193-A1 | 19900308 | US-8903753-W | 19890830 | ETHANOL PRODUCTION BY GENETICALLY ENGINEERED ESCHERICHIA COLI STRAINS
Family Citations
Forward and backward citations for an entire family are also available directly inside CLAIMS Direct.
SELECT * FROM cdws.f_family_citations_backward( 26932266 ); ucid ------------------ EP-0047641-A2 WO-1986001833-A1 WO-1986004357-A1
Again, GROUP BY
and ORDER BY
are available.
SELECT substring(ucid, 1, 2) AS country, count(*) AS ncites FROM cdws.f_family_citations_forward( 26932266 ) GROUP BY country ORDER BY ncites DESC; country | ncites ---------+-------- US | 290 WO | 36 EP | 35 JP | 2 FR | 1 KR | 1 AU | 1
Processes
The main executable script used for indexing is aidx
delivered as part of Alexandria::Library
. This script is responsible for pulling source data, converting it into Solr XML and submitting via HTTP POST to Solr for indexing. The conversion process from CLAIMS Direct XML to Solr XML is handled by the indexer class (default is Alexandria::DWH::Index::Document
). Alexandria::Client::Tools
also provides an indexing daemon, aidxd
which monitors an index process queue. Insertion into this queue, the table reporting.t_client_index_process
, is handled by apgup
.
Process | Package |
---|---|
aidx | Alexandria::Library |
aidxd | Alexandria::Client::Tools |
apgup | Alexandria::Client::Tools |
Source Data
Source XML is extracted out of the PostgreSQL data warehouse using the core library functionality exposed by the Alexandria::Library
module Alexandria::DWH::Extract
. The Extract module can pull data based on a number of criteria, the most common of which are:
- load-id:
modified-load-id
ofxml.t_patent_document_values
- table: any table name that has
publication_id(int)
column - SQL: raw SQL selecting desired documents by
publication_id
Regardless of extraction criteria, Alexandria::DWH::Extract
utilizes an UNLOGGED
temporary table to accumulate desired publication_id(s)
. Extraction proper is done from this accumulation table in parallel select
batches. The amount of parallelization as well as the amount of documents per select
are controlled by the parameters batchsize
and nthreads
. aidx
also accepts a dbfunc
parameter which designates the stored function within the PostgreSQL database to use to extract the XML data needed for indexing. The current default function is xml.f_patent_document_s
which pulls an entire XML document. One could, for example, create a custom function, e.g., myschema.f_barebones
modeled on xml.f_patent_document_s
(i.e., accepting the same parameters and returning CLAIMS Direct XML with only application-specific XML content).
Command | Accumulation SQL | Extract SQL |
---|---|---|
aidx --table=x | select publication_id | select xml.f_patent_document_values_s(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id) |
aidx --loadid=y | select publication_id | select xml.f_patent_document_values_s(t2.publication_id) |
aidx --sqlq=USER_SQL | execute SQL into t1 | select xml.f_patent_document_values_s(t2.publication_id) |
Command | Accumulation SQL | Extract SQL |
---|---|---|
aidx --table=x --dbfunc=f_my_function | select publication_id | select f_my_function(t2.publication_id) from xml.t_patent_document_values t1 inner join x as t1 on ( t1.publication_id=t2.publication_id) |
Indexer Class
Using the callback
mechanism exposed by the extract module, the indexer class takes an XML document and creates a transformed XML document suitable for loading into Solr. The following abbreviated example from aidx
serves to illustrate the process.
#! /usr/bin/perl use Alexandria::DWH::Extract; use Alexandria::DWH::Index; use Alexandria::DWH::Index::Document; my $idxcls = shift(@ARGV); # from command line sub _create_solr_document { my ( $batch, $xml ) = @_; eval 'require $idxcls'; return $idxcls->new( document => $xml )->toNode()->toString(1); } my $ex = Alexandria::DWH::Extract->new( ... callbacks => { on_document_processed => \&_create_solr_document } ); $ex->prepare(); $ex->run(); # every document extracted is sent through _create_solr_document() $ex->finalize();
Creating a Custom Indexing Class
Creating a custom indexing class is simply a matter of sub-classing the Alexandria::DWH::Index::Document
and manipulating the Solr document representation by either adding, deleting, or modifying certain fields. There is currently only one method that can be overridden in the sub-class, namely, _process_source
. The following shell-module will serve as a basis for the use cases detailed below.
package MyCustomIndexingClass; use Moose; ### note: if using v2.0, you would extend Alexandria::DWH::Index::Document extends 'Alexandria::DWH::Index::DocumentEx'; # override _process_source sub _process_source { my $self = shift; # we want to process the standard way ... $self->SUPER::_process_source(); # do nothing else } 1;
You can now specify MyCustomIndexingClass
as the command line argument --idxcls
to the indexing utility aidx
.
aidx --idxcls=MyCustomIndexingClass [ many other arguments ]
Use Cases
Assumptions
The following use cases assume:
- a valid index entry in
/etc/alexandria.xml
– this will be different than the default if you have a custom Solr installation - custom indexing class modules are either in the directory you run
aidx
or in yourPERL5LIB
path
(1) Adding (Injecting), Modifying, and Deleting Fields
For this use case, you will need to modify your Solr schema for the installation associated with the appropriate configuration index entry. Add the following field definition:
<field name="customInteger" type="tint" indexed="true" stored="true" />
Below is example code to inject customInteger
into the Solr document. Additionally, it will show how to modify the contents of anseries
and delete anseries
if the publication country is US and publication date is later than 2015.
package MyCustomIndexingClass; # subclass of Alexandria::DWH::Index::Document use Moose; ### note: if using v2.0, you would extend Alexandria::DWH::Index::Document extends 'Alexandria::DWH::Index::DocumentEx'; # override _process source sub _process_source { my $self = shift; # even though we are overriding _process_source(), we still # want the parent class to do all the work for us # by calling the parent method (SUPER) ... $self->SUPER::_process_source(); # the _fields member of $self contains all the # Solr content as a hash reference of array references # e.g. # _fields => # NOTE: multiValued=false fields are still represented as an array # but only have one member # pn => [ 'US-5551212-A' ], # anseries => [ '07' ], # icl1 => [ 'A', 'F', 'H' ] my $flds = $self->{_fields} || return; # nothing to do # inject a new field push( @{ $flds->{customInteger} }, 1 ) ; # we want to make certain that anseries is not padded, i.e., # we need to be sure it is an integer if( scalar( $flds->{anseries} ) ) { $flds->{anseries}->[0] = sprintf( "%d", $flds->{anseries}->[0] ); # lastly, we don't want to index anseries for US documents published # after 20150101 my $ctry = $flds->{pnctry}->[0]; my $date = $flds->{pd}->[0]; if( $ctry eq 'US' && $date > 20141231 ) { delete $flds->{anseries}; } } } 1;
(2) Accessing the CLAIMS Direct Source XML Document
This next use case will examine methods of (re)processing data from the source XML document. The goal will be to create a new multi-valued field to store related documents. The following changes need to be made to the Solr schema:
<field name="rel_ucids" type="string" indexed="true" stored="true" required="false" />
We first need to define related ucid rel_ucid
. For this example, it will be defined as:
- any related documents which have a
@relation=related-publication
- any pct-or-regional-publishing-data
The parts of the XML document that are of interest:
<related-documents> <relation type="related-publication"> <document-id> <country>US</country> <doc-number>20150126456</doc-number> <kind>A1</kind> <date>20150507</date> </document-id> </relation> </related-documents> <!-- ... --> <pct-or-regional-publishing-data ucid="WO-2013182650-A1"> <document-id> <country>WO</country> <doc-number>2013182650</doc-number> <kind>A1</kind> <date>20131212</date> </document-id> </pct-or-regional-publishing-data>
As this example is more involved, the following code is broken down by function. A complete listing of code will be provided below.
### routine to parse related documents sub _parse_related_documents { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; my @a = (); # stores any related-publications # if there are no related documents, return empty array my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0]; return \@a if !$reldoc_node; foreach my $relation ( $reldoc_node->getElementsByTagName('relation') ) { if( $relation->getAttribute('type') eq 'related-publication' ) { push( @a, sprintf("%s-%s-%s", $relation->findvalue('./document-id/country'), $relation->findvalue('./document-id/doc-number'), $relation->findvalue('./document-id/kind') ) ); } } return \@a; }
Points to consider with _parse_related_documents:
- The source document (XML) representation is an
XML::LibXML::Node
, named above aspatdoc.
- Utilizing available methods, it is relatively simple to access particular parts of the XML tree.
The
findvalue
method is lacking error checking, i.e., we assume every value is present, combined insprintf
will return a correctly formatteducid
### routine to parse pct publication information sub _parse_pct_publishing_data { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; # if there is no pct publishing node, return undef my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0]; return undef if !$pct_node; # return ucid return $pct_node->getAttribute('ucid'); }
Points to consider:
- according to the DTD, there is only ever one related pct document, hence single-value return
- the
ucid
attribute is available, differing from the aboverelated-documents
The complete listing:
package MyCustomIndexingClass; # subclass of Alexandria::DWH::Index::Document use Moose; ### note: if using v2.0, you would extend Alexandria::DWH::Index::Document extends 'Alexandria::DWH::Index::DocumentEx'; # override _process source sub _process_source { my $self = shift; # even though we are overriding _process_source(), we still # want the parent class to do all the work for us # by calling the parent method (SUPER) ... $self->SUPER::_process_source(); my $flds = $self->{_fields} || return; # nothing to do my $reldocs = $self->_parse_related_documents( $self->{_source_root} ); my $pctdoc = $self->_parse_pct_publishing_data( $self->{_source_root} ); if( scalar( @{$reldocs} ) ) { foreach my $r ( @{$reldocs} ) { push( @{ $flds->{rel_ucids} }, $r ); } } if( $pctdoc ) { push( @{ $flds->{rel_ucids} }, $pctdoc ); } ### routine to parse related documents sub _parse_related_documents { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; my @a = (); # stores any related-publications # if there are no related documents, return empty array my $reldoc_node = $patdoc->getElementsByTagName('related-documents')->[0]; return \@a if !$reldoc_node; for each my $relation ( $reldoc_node->getElementsByTagName('relation') ) { if( $relation->getAttribute('type') eq 'related-publication' ) { push( @a, sprintf("%s-%s-%s", $relation->findvalue('./document-id/country'), $relation->findvalue('./document-id/doc-number'), $relation->findvalue('./document-id/kind') ) ); } } return \@a; } ### routine to parse pct publication information sub _parse_pct_publishing_data { my $self = shift; # the root of the source XML # as an XML::LibXML::Node my $patdoc = shift; my $ret; # only one value available (or none) # if there is no pct publishing node, return undef my $pct_node = $patdoc->getElementsByTagName('pct-or-regional-publishing-data')->[0]; return undef if !$pct_node; # return ucid return $pct_node->getAttribute('ucid'); } 1;
Introduction
There are a number of reasons one would need to re-index data from the data warehouse. These range from simply repeating a load-id
to a complete re-index of the entire contents of the data warehouse. In this blog, I'm going to go over the mechanisms that move data from the data warehouse to the index and ways in which these mechanisms can be used to trigger partial or full re-indexing of data.
Background
In all installations of CLAIMS Direct, data synchronization between the CLAIMS Direct primary data warehouse and on-site installations is driven by the tool apgup
. In version 2.1, a new update daemon (apgupd
) was introduced to facilitate background processing of new and updated data. The main roles this daemon plays include:
- Reconciling differentials between on-site data warehouses and CLAIMS Direct primary data warehouses
- Downloading the next available update package for processing within the on-site installation
- Processing the update package
- Loading data to the data warehouse
- Optionally en-queuing that load for indexing
For our topic, we will concentrate on the indexing queue as this drives the indexing process.
Index Queue
The indexing queue is a table inside the reporting
schema: t_client_index_process
. When apgupd
completes the load into the PostgreSQL, a row is inserted into reporting.t_client_index_process
with running_status
= pending
. The indexing daemon, aidxd
, probes the table looking for the highest priority load_id
to index. Upon finding an applicable load-id
, aidxd
proceeds to extract all publications associated with that load_id
and index the documents into Solr.
The column priority
influences which load_id
is to be processed next. Higher priority load-ids
are processed before lower priority ones. This can cause out-of-order processing as we will discuss below.
Re-indexing
Individual load_ids
At its simplest, re-indexing is just a matter of resetting a load_id
in reporting.t_client_index_process
update reporting.t_client_index_process set running_status='pending', completed_stamp=null, completed_status=null, completed_comment=null where load_id=123456;
The above SQL resets load_id
123456 so that the indexing daemon processes that load_id
next. You can re-index any number of load_ids
by resetting each of the rows.
Entire Data Warehouse
Unfortunately, resetting all the rows in reporting.t_client_index_process
is not sufficient to re-index the entire contents of the data warehouse. This has to do with the fact that the table is empty upon initial load to the on-site instance as the population of reporting.t_client_index_process
begins with the first update to the data warehouse. Each new update that is processed is queued. To that end, the most efficient way to queue the entire data warehouse for re-indexing is to select the entire xml.t_patent_document_values
table and group by modified_load_id
.
Before proceeding with the next set of SQL, please be sure both apgupd
and aidxd
are paused ( kill -s USR1 <pid> )
truncate table reporting.t_client_index_process; insert into reporting.t_client_index_process (load_id, load_source, ndocs, priority) select modified_load_id, 'REINDEX', count(*), -1 from xml.t_patent_document_values group by modified_load_id order by modified_load_id desc;
A few comments about the above statements:
- Truncating the table is required as the
load_id
column is unique. If you wish to save the contents of that table, simply use the pg_dump utility. - The
load_source
column is chosen at random. You can use any identifier you want as long as it doesn't exceed 64 characters. - The priority is set to -1 as in this example, re-index is less important than indexing new data.
Use Case
Changing Field Definitions
Modifying the CLAIMS Direct Solr Schema
The CLAIMS Direct Solr schema is a compromise between having every piece of the patent document searchable, having enough metadata stored to be retrievable in the search result set as well as indexing certain fields to be efficiently faceted. This compromise keeps the size of the index manageable while allowing efficient return of data from a search. Of course, the concept of manageable is subjective. There could very well be the need to have more fields returned during the search (stored=true
) and other fields removed from storage but left searchable. In this use case, we will enable patent citations to be returned during the search. Regardless of CLAIMS Direct Solr package, we'll start editing the schema under <package>/conf-2.1.2
. We are making a very simple change but one that requires a complete re-index of all documents having citations.
Be sure the indexing daemon (aidxd
) is paused.
The current field definition
<field name="pcit" type="alexandria_string" indexed="true" stored="false" multiValued="true" />
becomes
<field name="pcit" type="alexandria_string" indexed="true" stored="true" multiValued="true" />
To make this change effective, Solr needs to be restarted or the collection reloaded. For the standalone package, you would just need to restart Solr:
<package>/solrctl stop ; <package>/solrctl start
For the distributed package, first deploy updated configuration to the Zookeeper nodes.
cd <package> ./bootstrap-zookeeper.sh
Then reload all nodes:
cd <package> ./reload-collections
Setting up the Index Queue
Although we can go about re-indexing the entire data warehouse as outlined above, we want to be a bit more efficient and choose only the applicable documents, i.e., those that actually have citation data. To this end, we will find only load_ids
that contain documents with citations and modify existing reporting.t_client_index_process
rows as needed.
-- DELETE load-ids already processed delete from reporting.t_client_index_process where load_id in ( select distinct(modified_load_id) from xml.t_citations ); -- INSERT load-ids insert into reporting.t_client_index_process (load_id, load_source, ndocs, priority) select modified_load_id, 'CITATION-FIX', count(*), -1 from xml.t_citations group by modified_load_id
Again, setting priority to -1 allows new documents to be indexed as usual, and documents needing to be updated will be indexed with a lower priority.
Checking Re-Index Status
You can monitor the progress of the re-index by querying the reporting.t_client_index_process
table as follows:
select priority, load_source, count(*) from reporting.t_client_index_process where running_status = 'pending' group by priority, load_source order by priority, load_source;
The CLAIMS Direct data warehouse is updated continuously. On average, 1-3% of the entire database is touched weekly. These 1-3 million updates fall into two categories: new documents (roughly 20%) and updated documents (roughly 80%). New documents are, of course, new publications from issuing authorities, US, EP etc. Updates are generally, but not limited to, CPC changes, family changes (adding family ID), IFI Snapshots (patent status, standard names and claims summaries), reassignments, legal status, et al. I suggest reading about content updates for a better understanding of load sources and issuing authorities. This blog will describe how to determine what has been updated and further, differentiating new content from updated content at the document level as well as container level.
Load Identifiers
As described in the content update documentation,
Every document within the data warehouse was loaded as part of a group of documents. This set of documents is identified by a load-id (integer value). There are 3 types of load-ids in the data warehouse: created-load-id, deleted-load-id and modified-load-id. The created-load-id represents the load-id in which a document was added to the database, the modified-load-id represents the load-id that last modified the document and the deleted-load-id represents the load-id in which the document marked as deleted.
The table xml.t_patent_document_values
houses the load-ids
that affect the overall status of the entire document.
modified_load_id
– theload-id
that last modified the document, whether it be addition, correction or deletion of contentdeleted_load_id
– theload-id
that the entire document was marked deletedcreated_load_id
– theload-id
when the document was added to the data warehouse
Each satellite table also contains a modified_load_id
column that is specific to the content in that table and will always be <= xml.t_patent_document_values(modified_load_id)
. The following table should clarify the sequence of update events and the effect on load-id
values for three database tables.
date | load-source | Load-Id | table | created_load_id | modified_load_id | deleted_load_id |
---|---|---|---|---|---|---|
20161004 | US | 10000 | xml.t_patent_document_values | 10000 | 10000 | NULL |
xml.t_ifi_integrated_content assuming ifi integrated content is not delivered on date of publication | n/a | n/a | n/a | |||
xml.t_parties | n/a | 10000 | n/a | |||
20161006 | IFI-CONT | 10001 | xml.t_patent_document_values | 10000 | 10001 | NULL |
xml.t_ifi_integrated_content | n/a | 10001 | n/a | |||
xml.t_parties parties are not updated with IFI content loads | n/a | 10000 | n/a | |||
20161011 | DOCDB | 10002 | xml.t_patent_document_values | 10000 | 10002 | NULL |
xml.t_ifi_integrated_content | n/a | 10001 | n/a | |||
xml.t_parties | n/a | 10002 | n/a |
This granularity offers the ability to trace modifications to individual containers (tables) as well as the overall modification status of the document as a whole.
Quick Test
How many US documents published on 20161004 have updated CPC classifications?
select count(*) as docs_with_modified_cpc from xml.t_patent_document_values as a inner join xml.t_classifications_cpc as b on (a.publication_id=b.publication_id) where a.country='US' and a.published='2016-10-04' and a.deleted_load_id is null -- disregard deleted documents and a.withdraw != true -- disregard withdrawn documents /* If b.modified_load_id > a.created_load_id, the table has been touched and we can therefore assume CPC values have been modified */ and b.modified_load_id > a.created_load_id; docs_with_modified_cpc ------------------------ 6145
Out of 6680 documents published on 20161004, 6145 had CPC modifications after publication.
Reporting
Knowing which tables and containers have changed is important, but the when and why are also critically important. The information gathered during data warehouse updates is stored in reporting.t_client_load_process
. For a detailed description of the contents of this table, see CLAIMS Direct Design Documentation. Of the many questions that can be posed, for our purposes in this blog, we are only concerned with answering the following questions:
- On any given date, how many documents were added to the data warehouse?
- On any given date, how many documents were modified in the data warehouse?
The answers to these questions are important for a variety of reasons, most importantly for down-stream processing. In the case of CLAIMS Direct primary data warehouses, we trigger indexing based on the answer to these questions. The first thing to look at is what happened on a certain date, for the sake of an example, we will use 20161004.
select load_source, count(*) from reporting.t_client_load_process where date_trunc('day', completed_stamp) = '2016-10-04'::date group by load_source order by load_source; load_source | count --------------+------- DOCDB | 7 IFI-CLAIMINF | 1 IFI-Cont | 15 IPCR | 47 US | 1 USPTO-OG | 1 USREAS | 1
How many documents were added to the data warehouse?
select count(*) as new_docs from xml.t_patent_document_values where created_load_id in ( select load_id from reporting.t_client_load_process where date_trunc('day', completed_stamp) = '2016-10-04'::date ); new_docs ---------- 6680
... and which documents were updated?
select count(*) as updated_docs from xml.t_patent_document_values where created_load_id < modified_load_id and modified_load_id in ( select load_id from reporting.t_client_load_process where date_trunc('day', completed_stamp) = '2016-10-04'::date ); updated_docs ---------- 362388
Overview
The CLAIMS Direct Web Services (CDWS) offer a variety of entry points into both the data warehouse and Solr index. These are mid-to-high-level entry points and can satisfy most requirements pertaining to searching and extracting data for a typical search/view application. There are, however, corner cases which may require more intricate extraction of particular information. On the other hand, there may also be situations where massive amounts of data need to be extracted for further, down-stream processing. The following discussion will touch on solutions for both of these cases.
Basic Data Warehouse Table Structure
The basic structure of the PostgreSQL database housing the XML content is best described as a collection of tables containing the raw XML of every patent document available in CLAIMS Direct. Each table is a section of the XML document, referred to as a container, and each table is named according to the XML container data it contains, e.g.,
Table | Container | Description |
---|---|---|
xml.t_publication_reference | publication-reference | Document publication information |
xml.t_invention_title | invention-title | Document title(s) |
xml.t_claims | claims | Document claims |
The structure of each table is identical with the exception of xml.t_patent_document_values
which functions as a meta table.
Column | Type |
---|---|
<container>_id | serial primary key |
publication_id | integer not null |
modified_load_id | integer not null |
status | character(1) (v=valid XML, i=invalid XML) |
content | XML |
The XML is always in the content
(PostgreSQL type: XML) column.
select content from xml.t_publication_reference where publication_id=xml.f_ucid2id('US-5551212-A'); -------- <publication-reference fvid="71329085" ucid="US-5551212-A"> <document-id> <country>US</country> <doc-number>5551212</doc-number> <kind>A</kind> <date>19960903</date> <lang>EN</lang> </document-id> </publication-reference>
The invaluable utility function xml.f_ucid2id
takes a ucid
and returns a publication_id
.
Extracting Pieces of the XML Content
As mentioned above, all content
columns are of type XML and therefore the internal PostgreSQL xpath functionality can be used. There are a variety of utility functions provided by CLAIMS Direct that mimic DOM functionality, e.g.
Function | Example |
---|---|
| Extract text of node: select xml.f_textContent('./document-id/lang', content) from xml.t_publication_reference where publication_id=xml.f_ucid2id('US-5551212-A'); ---- EN |
| Extract single attribute value: select xml.f_getAttribute('ucid', content) from xml.t_publication_reference where publication_id=xml.f_ucid2id('US-5551212-A'); ---- US-5551212-A |
| Test presence of a particular node: select xml.f_nodeExists('//date', content) from xml.t_publication_reference where publication_id=xml.f_ucid2id('US-5551212-A'); ---- t |
| Select a node (type XML) satisfying the given xpath expression: select xml.f_findNode('./main-classification', content) from xml.t_classification_national where publication_id=xml.f_ucid2id('US-5551212-A'); ---- <main-classification mxw-id="PCL893513162" load-source="mcf">053397</main-classification> |
| Select an array of nodes satisfying the given xpath expression: select xml.f_findNodes('./classification-ipcr', content) from xml.t_classifications_ipcr where publication_id=xml.f_ucid2id('US-5551212-A'); ---- { "<classification-ipcr ...", "<classification-ipcr ...", "<classification-ipcr ..." } |
| Select an array of nodes based on element name: select xml.f_getElementsByTagName('classification-ipcr', content) from xml.t_classifications_ipcr where publication_id=xml.f_ucid2id('US-5551212-A'); ---- { "<classification-ipcr ...", "<classification-ipcr ...", "<classification-ipcr ..." } |
| Select the name of the content node: select xml.f_nodeName( content ) from xml.t_invention_title where publication_id=xml.f_ucid2id('US-5551212-A'); ----------------- invention-title |
| Select all child nodes: select xml.f_childNodes( content ) from xml.t_application_reference where publication_id=xml.f_ucid2id('US-5551212-A'); ----------------------------------------------------------------------------------- <document-id mxw-id="PAPP60364944" load-source="docdb" format="epo"> <country>US</country> <doc-number>2582893</doc-number> <kind>A</kind> <date>19930303</date> <lang>EN</lang> </document-id> <document-id mxw-id="PAPP84571955" load-source="patent-office" format="original"> <country>US</country> <doc-number>08025828</doc-number> <date>19930303</date> <lang>EN</lang> </document-id> |
| Concatenate all child text nodes: select xml.f_node2string( content::text, 1 ) from xml.t_abstract where publication_id=xml.f_ucid2id('US-5551212-A'); ----------------------------------------------------------------------------------- A package is formed by holding a nonextensible strip taut and wrapping it around the upper and/or lower periphery of the bundle and joining the ends of the strip segments e.g. by adhesive bonding. A projecting margin of the strip is folded to overlie the bundle or underlie the bundle so that an L-configuration is imparted to the frame formed around the bundle after the strip is applied thereto. |
Use Case: Creating relational view
of XML content
Although the underlying architecture of the CLAIMS Direct PostgreSQL database is a data warehouse, creating normalized relational views into the XML content is easily achievable using the functions described above. Our example will be to create a relational view of all patent citations. This view will let us SELECT
based on criteria as well as being able to GROUP BY
and ORDER BY
. The first step is to define all the properties of a citation:
ref-ucid
(cited document)ref-publication-id
publication-date
country
kind
load-source
format
source
(examiner or applicant)
Now we can create a function that returns a TABLE
of these properties. We'll use publication_id
as input to the function and create it in a private schema
(mySchema).
create or replace function mySchema.f_rdb_citations( integer ) returns table ( publication_id integer, ref_publication_id integer, ref_ucid varchar(64), published date, country varchar(2), kind varchar(2), load_source varchar(32), format varchar(32), source varchar(32) ) as $BODY$ declare v_content xml; v_node xml; v_doc_node xml; begin -- example usage: -- select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') ); select x.publication_id, x.content into publication_id, v_content from xml.t_citations as x where x.publication_id=$1; if not found then return; end if; -- loop through all patent citations for each v_node in array ( select xml.f_findNodes('//patcit', v_content) ) loop select xml.f_getAttribute( 'ucid', v_node ) into ref_ucid; select x.publication_id into ref_publication_id from xml.t_patent_document_values as x where ucid=xml.f_getAttribute( 'ucid', v_node ); select xml.f_getAttribute( 'load-source', v_node ) into load_source; select xml.f_getAttribute('name', xml.f_findNode('./sources/source', v_node ) ) into source; -- loop through all forms of the document-id(s) for each v_doc_node in array ( select xml.f_findNodes('./document-id', v_node) ) loop select xml.f_getAttribute( 'format', v_doc_node ) into format; select xml.f_textContent( './country', v_doc_node ) into country; select xml.f_textContent( './kind', v_doc_node ) into kind; select xml.f_textContent( './date', v_doc_node ) into published; return next; end loop; end loop; end; $BODY$ LANGUAGE plpgsql VOLATILE COST 100 ROWS 1000;
Now, we test the results:
select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') ) limit 5; publication_id | ref_publication_id | ref_ucid | published | country | kind | load_source | format | source ----------------+--------------------+-------------------+------------+---------+------+-------------+--------+-------- 160817649 | 5778519 | CA-2580978-A1 | 2006-03-30 | CA | A1 | docdb | epo | APP 160817649 | 127402642 | CN-201918032-U | 2011-08-03 | CN | U | docdb | epo | APP 160817649 | 75847915 | US-20020150866-A1 | 2002-10-17 | US | A1 | docdb | epo | APP 160817649 | 75947470 | US-20030060942-A1 | 2003-03-27 | US | A1 | docdb | epo | APP 160817649 | 76046244 | US-20030154010-A1 | 2003-08-14 | US | A1 | docdb | epo | APP
Of course, GROUP BY
and ORDER BY
are available on the returned columns. In addition to SELECTing citations for a single document, we can also JOIN
to table expressions, e.g., assume we are interested in the top 5 cited countries from US grant publication date 2016-05-03:
select count(*) as n, t2.country as ref_country from xml.t_patent_document_values t1 inner join mySchema.f_rdb_citations (t1.publication_id) as t2 on (t1.publication_id=t2.publication_id) where t1.country='US' and t1.published='2016-05-03' group by t2.country order by n desc limit 5; n | ref_country --------+--------- 221800 | US 14285 | WO 13313 | JP 7115 | EP 2554 | CN
Paging results is cumbersome and inefficient. In this next segment I'd like to talk about simple and complex sorting. Sorting, used effectively with the rows
parameter can push relevant documents into the first page of results. Generally, you can sort on any indexed field but you can also utilize query boosting and functions to influence sort order.
Missing Value Behavior
CLAIMS Direct is configured to return empty fields at the top when asc
is the direction and the bottom when desc
is the direction.
Static Field Sorts
The most common field to sort on will undoubtedly be the date of publication, pd desc
. This pushes the newly published documents to the top of the result set. CLAIMS Direct uses pd desc
as default sort order instead of score
. If two documents share the same publication date (or any sort field value), the tie is broken by the internal Lucene document IDs ascending.
/search/query?q=ab_en:sonar&rows=5&fl=ucid,[docid],pd
"docs" : [ { "[docid]" : 5536680, "pd" : 20160225, "ucid" : "US-20160054444-A1" }, { "[docid]" : 1274637, "pd" : 20160224, "ucid" : "EP-2986998-A1" }, { "[docid]" : 4335577, "pd" : 20160223, "ucid" : "US-9268020-B2" }, { "[docid]" : 4986794, "pd" : 20160218, "ucid" : "US-20160049143-A1" }, { "[docid]" : 5088703, "pd" : 20160218, "ucid" : "US-20160047906-A1" } ]
Multiple sort criteria are also possible and used mostly to stabilize the initial sort results. The syntax for the sort
parameter is a comma-separated list of field/direction pairs, e.g., pd desc,ad asc
. Field and direction are separated by a space.
/search/query?q=ab_en:sonar&rows=5&fl=ucid,pd&sort=pd desc,ucid desc
Dynamic Field Sort
As mentioned, CLAIMS Direct uses a default sort of pd desc
. You are, of course, free to use the dynamic score
field to mimic the default Solr distribution behavior.
/search/query?q=ab_en:sonar&rows=5&fl=ucid,score,pd&sort=score desc
"docs" : [ { "pd" : 19960322, "ucid" : "FR-2724734-A1", "score" : 3.781385 }, { "pd" : 20120523, "ucid" : "EP-2454606-A1", "score" : 3.3196893 }, { "pd" : 20151229, "ucid" : "US-9223022-B2", "score" : 3.3196893 }, { "pd" : 20150910, "ucid" : "US-20150253425-A1", "score" : 3.3196893 }, { "pd" : 20150521, "ucid" : "AU-2010273841-B2", "score" : 3.2852356 } ]
Random Sort (for Random Results)
CLAIMS Direct uses a utility field to allow random sorting. The field is input as rnd_n
where n
is any random integer, e.g., rnd_1234
. Specifying the same value for n
will yield the same sort results so the randomness is only as random as the input integer. Returning random results (sorting on rnd_n
) is an efficient way to sample a wide variety of data.
/search/query?q=ab_en:sonar&rows=3&fl=ucid,score,pd&sort=rnd_1234 desc /search/query?q=ab_en:sonar&rows=3&fl=ucid,score,pd&sort=rnd_12345 desc
"docs" : [ { "pd" : 20150902, "ucid" : "CN-204605667-U", "score" : 1.6596402 }, { "pd" : 20110224, "ucid" : "WO-2010125029-A4", "score" : 1.1842703 }, { "pd" : 19910719, "ucid" : "FR-2657063-A1", "score" : 1.760531 } ] // "docs" : [ { "pd" : 19720119, "ucid" : "GB-1260387-A", "score" : 0.4986458 }, { "pd" : 20160114, "ucid" : "US-20160011310-A1", "score" : 2.3957791 }, { "pd" : 20130425, "ucid" : "WO-2013056893-A1", "score" : 0.9474162 } ],
Sorting on Functions
A previous segment discussed using functions as pseudo fields to return in the result set. You can also use the results of a function as sort criteria. An interesting use-case for sorting on a calculated value is measuring the latency between application filing date and grant date. You can calculate this value by subtracting ad_d
from pd_d
(tdate
field types of filing and publication dates). The following examples use field aliasing with the fl
parameter as well as sorting by function to return documents with low latency between filing and grant (asc
) and high latency (desc
) displayed in days (86400000 milliseconds in a day).
/search/query?rows=5&q=ab_en:sonar +ifi_publication_type:G&fl=ucid,ad,pd,days_to_grant:div(sub(pd_d,ad_d),86400000)&sort=sub(pd_d,ad_d) asc /search/query?rows=5&q=ab_en:sonar +ifi_publication_type:G&fl=ucid,ad,pd,days_to_grant:div(sub(pd_d,ad_d),86400000)&sort=sub(pd_d,ad_d) desc
"docs" : [ { "pd" : 20131205, "ad" : 20130118, "ucid" : "DE-102013000802-B3", "days_to_grant" : 321.0005 }, { "pd" : 20160126, "ad" : 20150304, "ucid" : "US-9244168-B2", "days_to_grant" : 328.0001 }, { "pd" : 20140904, "ad" : 20130904, "ucid" : "DE-102013014640-B3", "days_to_grant" : 365.00064 }, { "pd" : 20160216, "ad" : 20141208, "ucid" : "US-9261391-B2", "days_to_grant" : 434.99976 }, { "pd" : 19931007, "ad" : 19920421, "ucid" : "DE-4213121-C1", "days_to_grant" : 534.00006 } ] // "docs" : [ { "pd" : 19890719, "ad" : 19730214, "ucid" : "GB-1605319-A", "days_to_grant" : 5999 }, { "pd" : 19781220, "ad" : 19640511, "ucid" : "GB-1536653-A", "days_to_grant" : 5336.0005 }, { "pd" : 20160209, "ad" : 20100426, "ucid" : "US-9255982-B2", "days_to_grant" : 2115 }, { "pd" : 20160223, "ad" : 20120210, "ucid" : "US-9268020-B2", "days_to_grant" : 1473.9987 }, { "pd" : 20150217, "ad" : 20110322, "ucid" : "US-RE45379-E1", "days_to_grant" : 1428.0006 } ]
In this first of a series of blogs about Solr result sets I'd like to talk about returned fields, both static and dynamic.
Stored Fields
Any field that is stored in Solr can be returned in the result set. The following table shows all available stored fields:
Name | Description | Type | Indexed |
---|---|---|---|
ab_* | all abstracts | mixed: alexandria_text, alexandria_asian_text | true |
ad | filing date | tint | true |
an | application number (tokenized) | alexandria_token | true |
anorig | original, patent office format application number | alexandria_token | true |
anseries | US application series code | alexandria_string | true |
anucid | standardized filing identifier | string | true |
ecla | ECLA classification | alexandria_string | true |
fam | family identifier | string | true |
fterm | F-Terms | alexandria_string | true |
icl,cpcl,uscl,ficl | classifications suitable for display | string | false |
ifi_name_current_id | current assignee identifier | string | true |
ifi_name_original_id | original assignee identifier | string | true |
ifi_pa | IFI assignees | alexandria_token | true |
inv | inventors | alexandria_token | true |
loadid | meta load identifier | tint | true |
nclms | number of claims | tint | true |
nindepclms | number of independent claims | tint | true |
pa | applicants/assignees (all formats) | alexandria_token | true |
pd | publication date | tint | true |
pn | patent number (tokenized) | alexandria_token | true |
pri | priority filing number (tokenized) | alexandria_string | true |
prid | earliest priority filing date | tint | true |
priorig | original, patent office format priority number | alexandria_string | true |
timestamp | last modification stamp | tdate | true |
ttl_* | all titles | mixed: alexandria_text, alexandria_asian_text | true |
ucid | unique character identifier | string | true |
The shorthand to return all static fields is the asterisk fl=*
/search/query?q=ucid:US-20160054444-A1&fl=*&rows=1
As with all examples, we discuss results in JSON format.
"docs" : [ { "timestamp" : "2016-02-27T08:29:31.936Z", "ucid" : "US-20160054444-A1", "loadid" : 229946, "nclms" : 19, "nindepclms" : 2, "fam" : "-1", "pn" : "US-20160054444-A1 US20160054444A1", "pd" : 20160225, "an" : "US-201514833357-A" "anseries" : "14", "anorig" : "US-14833357", "ad" : 20150824, "pri" : [ "FR-1457951" ], "prid" : 20140825, "priorig" : [ "FR-1457951" ], "cpcl" : [ "G01S 15/588 20130101 LI20160225BHUS ", "G01S 15/60 20130101 FI20160225BHUS " ], "icl" : [ "G01S 15/58 20060101ALI20160225BHUS ", "G01S 15/60 20060101AFI20160225BHUS " ], "ifi_pa" : [ "ECA ROBOTICS", "ECA ROBOTICS" ], "inv" : [ "Pinto, Marc" ], "pa" : [ "ECA ROBOTICS", "ECA ROBOTICS" ], "ttl_en" : [ "METHOD AND SONAR DEVICE FOR DETERMINING THE SPEED OF MOVEMENT OF A NAVAL VEHICLE IN RELATION TO THE SEA BED" ], "ab_en" : [ "<abstract mxw-id=\"PA168904151\" lang=\"EN\" load-source=\"patent-office\">\n <p id=\"p-0001\" num=\"0000\">Sonar intended to be carried by a naval vehicle including at least one element for transmitting an acoustic signal, at least one element for receiving the acoustic signal transmitted and reflected on the sea bed and at least two phase centres (PC<sub>1</sub>, PC<sub>2</sub>) that are disposed along a first and a second axis (v<sub>1</sub>, v<sub>2</sub>), respectively, forming an interferometric antenna. The sonar includes elements for determining the speed of movement of the vehicle as a function of the computed value of the relative trim angle (β) formed between a straight line (d<sub>1</sub>) that is perpendicular to the axes (v<sub>1</sub>, v<sub>2</sub>) of the phase centres and a straight line (d<sub>2</sub>) that is perpendicular to the sea bed (F) and of the value determined for the angle of sight.</p>\n </abstract>" ], } ],
Some things to notice: Multi-valued fields are returned as JSON arrays. Also, as inv
, pa
and ifi_pa
are copyFields
containing multiple values, duplicate entries may appear.
Dynamic Fields
Dynamic fields are fields that are generated either internally, like the relevance score
or fields that can be generated by computing values using function queries. Dynamic fields must be listed explicitly in the fl
parameter, e.g., fl=score
. Below, we request the static ucid
field together with the relevance score
for 5 results.
/search/query?q=ab_en:sonar&fl=ucid,score&rows=5
"docs" : [ { "ucid" : "US-20160054444-A1", "score" : 1.1746656 }, { "ucid" : "US-9268020-B2", "score" : 2.8773313 }, { "ucid" : "US-20160047906-A1", "score" : 1.1735018 }, { "ucid" : "US-20160049143-A1", "score" : 0.99574935 }, { "ucid" : "US-20160047891-A1", "score" : 1.1675186 } ],
Another type of dynamic field makes use of function queries. A simple, yet contrived, example would be to return the number of dependent claims. CLAIMS Direct stores the total number of claims in nclms
as well as the number of independent claims in nindepclms
, both as tint
. Simple subtraction yields the desired result.
/search/query?q=ab_en:sonar&fl=ucid,nclms,nindepclms,sub(nclms,nindepclms)&rows=2
"docs" : [ { "ucid" : "US-20160054444-A1", "nclms" : 19, "nindepclms" : 2, "sub(nclms,nindepclms)" : 17 }, { "ucid" : "US-9268020-B2", "nclms" : 74, "nindepclms" : 3, "sub(nclms,nindepclms)" : 71 } ],
Field Aliases
Field aliasing allows individual fields in the result set to be renamed. This functionality is particularly useful to give meaningful names to pseudo fields created by function queries as in the example above. Any field can be renamed using aliases. The syntax is: alias-field-name:original-field-name
Note
ucid
is the only exception. When trying to alias the ucid
, an HTTP 500 will be returned.
/search/query?q=ab_en:sonar&fl=ucid,total:nclms,independent:nindepclms,dependent:sub(nclms,nindepclms)&rows=1
"docs" : [ { "ucid" : "US-20160054444-A1", "total" : 19, "independent" : 2, "dependent" : 17 } ],