Overview
The CLAMS CLAIMS Direct Web Services (CDWS) offer a variety of entry points into both the data warehouse and SOLR Solr index. These are mid-to-high-level entry points and can satisfy most requirements pertaining to searching and extracting data for a typical search/view application. There are, however, corner cases which may require more intricate extraction of particular information. On the other hand, there may also be situations where massive amounts of data need to be extracted for further, down-stream processing. The following discussion will touch on solutions for both of these cases.
...
The basic structure of the PostgreSQL database housing the XML content is best described as a collection of tables containing the raw XML of every patent document available in CLAIMS Direct. Each table is a section of the XML document, referred to as a container, and each table is named according to the XML container data it it contains, e.g.,
| | |
---|
xml.t_publication_reference | publication-reference | Document publication information |
xml.t_invention_title | invention-title | Document title(s) |
xml.t_claims | claims | Document claims |
The structure of each table is identical with the exception of xml.t_patent_document_values
which functions as a meta table.
| |
---|
<container>_id | serial primary key |
publication_id | integer not null |
modified_load_id | integer not null |
status | character(1) (v=valid XML, i=invalid XML) |
content | XML |
The XML is always in the content
(Postgresql PostgreSQL type: XML) column.
...
...
from xml.t_publication_reference |
...
where publication_id=xml.f_ucid2id('US-5551212-A'); |
...
...
<publication-reference fvid="71329085" |
...
...
...
...
...
number>5551212</doc-number> |
...
...
<date>
19960903
</date>
...
<date>19960903</date>
<lang>EN</lang> |
...
...
...
Info |
---|
The invaluable utility |
...
...
...
...
...
As mentioned above, all content
columns are of type XML and therefore the internal PostgreSQL xpath functionality can be used. There are a variety of utility functions provided by CLAIMS Direct that mimic DOM functionality, e.g.
| |
---|
string
xml.f_textContent(xpath::text,xml)
| |
Code Block |
---|
select xml.f_textContent('./document-id/lang', content) |
|
from xml.t_publication_reference |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
|
string
xml_f_getAttribute(att::text,xml)
| Extract single attribute value: |
| select xml.f_getAttribute('ucid', content) |
|
from xml.t_publication_reference |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
|
boolean
xml.f_nodeExists(xpath::text,xml)
| Test presence of a particular node: |
| select xml.f_nodeExists('//date', content) |
|
from xml.t_publication_reference |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
|
XML
xml.f_findNode(xpath::text,xml)
| Select a node (type XML) satisfying the given xpath expression: |
Code Block |
---|
| select xml.f_findNode('./main-classification', content) |
|
from xml.t_classification_national |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
<main-classification mxw-id="PCL893513162" |
|
>
053397
<>053397</main-classification> |
|
XML[]
xml.f_findNodes(xpath::text,xml)
| |
and an array of nodes satisfying the given xpath expression: |
Code Block |
---|
| select xml.f_findNodes('./classification-ipcr', content) |
|
from xml.t_classifications_ipcr |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
{
{
"<classification-ipcr ...", |
|
"<classification-ipcr ...", |
|
"<classification-ipcr ..." |
|
|
XML[]
xml.f_getElementsByTagName(name::text,xml))
| Select an array of nodes based on element name: |
| select xml.f_getElementsByTagName('classification-ipcr', content) |
|
from xml.t_classifications_ipcr |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
{
{
"<classification-ipcr ...", |
|
"<classification-ipcr ...", |
|
"<classification-ipcr ..." |
|
|
string
xml.f_nodeName(xml)
| |
context | select xml.f_nodeName( content ) |
|
from xml.t_invention_title |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
XML
XML[]
xml.f_childNodes(xml)
| |
| select xml.f_childNodes( content ) |
|
from xml.t_application_reference |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
----------------------------------------------------------------------------------- |
|
<document-id mxw-id="PAPP60364944" |
|
number>2582893
<number>2582893</doc-number> |
|
<date>
19930303
</date>
<date>19930303</date>
<lang>EN</lang> |
|
<document-id mxw-id="PAPP84571955" |
|
load-source="patent-office" |
|
number>08025828
<number>08025828</doc-number> |
|
<date>
19930303
</date>
<date>19930303</date>
<lang>EN</lang> |
|
|
string
xml.f_node2string(xml,indent)
| Concatenate all child text nodes: |
| select xml.f_node2string( content::text, |
|
where publication_id=xml.f_ucid2id('US-5551212-A'); |
|
----------------------------------------------------------------------------------- |
|
A
package
is formed by holding a nonextensible strip taut and wrapping it around the upper and/or lower periphery of the bundle and joining the ends of the strip segments e.g. by adhesive bonding. A projecting margin of the strip is folded to overlie the bundle or underlie the bundle so that an L-configuration is imparted to the frame formed around the bundle after the strip is applied thereto.
A package is formed by holding a nonextensible strip taut and wrapping it around the upper and/or lower periphery of the bundle and joining the ends of the strip segments e.g. by adhesive bonding. A projecting margin of the strip is folded to overlie the bundle or underlie the bundle so that an L-configuration is imparted to the frame formed around the bundle after the strip is applied thereto. |
|
Use Case: Creating relational view
of XML content
Although the underlying architecture of the Claims CLAIMS Direct postgreSQL PostgreSQL database is a data warehouse, creating normalized relational views into the XML content is easily achievable using the functions described above. Our example will be to create a relational view of all patent citations. This view will let us SELECT
based on criteria as well as being able to GROUP BY
and ORDER BY
. The first step is to define all the properties of a citation:
...
Now we can create a function that returns a TABLE
of these properties. We'll use publication_id
as input to the function and create it in a private schema
(mySchema).
...
|
create or replace function mySchema.f_rdb_citations( integer ) |
...
...
(
publication_id integer, |
...
ref_publication_id integer, |
...
...
published date,
...
published date,
country varchar(2), |
...
...
...
...
...
...
...
...
...
...
begin
...
-- select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') ); |
...
select x.publication_id, x.content into publication_id, v_content |
...
from xml.t_citations as x where x.publication_id= |
...
if
not found then
return
;
end
if
;
-- loop through all patent citations
foreach v_node in array
...
$1;
if not found then
return;
end if;
-- loop through all patent citations
for each v_node in array
( select xml.f_findNodes('//patcit', v_content) ) |
...
loop
select xml.f_getAttribute( |
...
'ucid', v_node ) into ref_ucid; |
...
...
select x.publication_id into ref_publication_id |
...
from xml.t_patent_document_values as |
...
x
where ucid=xml.f_getAttribute( |
...
...
...
);
select xml.f_getAttribute( |
...
'load-source', v_node ) into load_source; |
...
select xml.f_getAttribute('name', xml.f_findNode('./sources/source', v_node ) ) into source; |
...
...
-- loop through all forms of the document-id(s) |
...
...
array
( select xml.f_findNodes('./document-id', v_node) ) |
...
loop
select xml.f_getAttribute( |
...
'format', v_doc_node ) into format; |
...
select xml.f_textContent( |
...
'./country', v_doc_node ) into country; |
...
select xml.f_textContent( |
...
'./kind', v_doc_node ) into kind; |
...
select xml.f_textContent( |
...
'./date', v_doc_node ) into published; |
...
return
next;
end loop;
end loop;
end;
$BODY$
LANGUAGE plpgsql VOLATILE
COST
100
ROWS
1000
;
return next;
end loop;
end loop;
end;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100
ROWS 1000; |
Now, we test the results:
...
Code Block |
---|
|
select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') ) |
...
...
publication_id | ref_publication_id | |
...
...
ucid | published | country | kind | load_source | format | source |
...
----------------+--------------------+-------------------+------------+---------+------+-------------+--------+-------- |
...
160817649 | 5778519 | CA-2580978- |
...
...
| CA | A1 | docdb | epo | APP
160817649 | 127402642 | CN-201918032-U | 2011-08-03 |
...
| CN | U | docdb | epo | APP
160817649 | 75847915 | US-20020150866-A1 | |
...
...
| US | A1 | docdb | epo | APP
160817649 | 75947470 | US-20030060942-A1 | |
...
...
| US | A1 | docdb | epo | APP
160817649 | 76046244 | US-20030154010-A1 | |
...
...
| US | A1 | docdb | epo | APP |
Of course, GROUP BY
and ORDER BY
are available on the returned columns. In addition
...
to SELECTing citations for a single document, we can
...
also JOIN
to table expressions, e.g., assume we are interested in the top 5 cited countries from US grant publication date 2016-05-03:
|
select count(*) as n, t2.country as ref_country |
from xml.t_patent_document_values t1 |
inner join mySchema.f_rdb_citations (t1.publication_id) as |
t2
on t2
on (t1.publication_id=t2.publication_id) |
and t1.published='2016-05-03' |
limit
5
;
n |
14285
| WO
13313
| JP
7115
| EP
2554
|
14285 | WO
13313 | JP
7115 | EP
2554 | CN |