Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The CLAIMS Direct Web Services (CDWS) offer a variety of entry points into both the data warehouse and SOLR Solr index. These are mid-to-high-level entry points and can satisfy most requirements pertaining to searching and extracting data for a typical search/view application. There are, however, corner cases which may require more intricate extraction of particular information. On the other hand, there may also be situations where massive amounts of data need to be extracted for further, down-stream processing. The following discussion will touch on solutions for both of these cases.

...

The basic structure of the PostgreSQL database housing the XML content is best described as a collection of tables containing the raw XML of every patent document available in CLAIMS Direct. Each table is a section of the XML document, referred to as a container, and each table is named according to the XML container data it contains, e.g.,

Table
Container
Description
xml.t_publication_referencepublication-referenceDocument publication information
xml.t_invention_titleinvention-titleDocument title(s)
xml.t_claimsclaimsDocument claims

The structure of each table is identical with the exception of xml.t_patent_document_values which functions as a meta table.

Column
Type
<container>_idserial primary key
publication_idinteger not null
modified_load_idinteger not null
statuscharacter(1) (v=valid XML, i=invalid XML)
contentXML

...


The XML is always in the content (PostgreSQL type: XML) column.

Code Block
language

...

sql
select content

...


from xml.t_publication_reference

...


where publication_id=xml.f_ucid2id('US-5551212-A');

...


--------

...


<publication-reference fvid="71329085"

...

 ucid="US-5551212-A">

...


   <document-id>

...


    <country>US</country>

...


    <doc-

...

number>5551212</doc-number>

...


    <kind>A</kind>

...

    <date>19960903</date>

...


    <date>19960903</date>
    <lang>EN</lang>

...


  </document-id>

...


</publication-reference>

...

 

...

 

Info

The invaluable utility

...

function xml.f_ucid2id

...

takes a ucid

...

and returns

...

a publication_id.

...

Extracting Pieces of the XML Content

As mentioned above, all content columns are of type XML and therefore the internal PostgreSQL xpath functionality can be used. There are a variety of utility functions provided by CLAIMS Direct that mimic DOM functionality, e.g.

Function
Example

string

xml.f_textContent(xpath::text,xml)

Extract text of node:

 
Code Block
select xml.f_textContent('./document-id/lang', content)

from xml.t_publication_reference

where publication_id=xml.f_ucid2id('US-5551212-A');

----
select

EN

string

xml_f_getAttribute(att::text,xml)

Extract single attribute value:

 

Code Block
languagesql
select xml.f_getAttribute('ucid', content)

from xml.t_publication_reference

where publication_id=xml.f_ucid2id('US-5551212-A');

----

US-5551212-A

boolean

xml.f_nodeExists(xpath::text,xml)

Test presence of a particular node:

Code Block
 
languagesql
select xml.f_nodeExists('//date', content)

from xml.t_publication_reference

where publication_id=xml.f_ucid2id('US-5551212-A');

----

t

XML

xml.f_findNode(xpath::text,xml)

Select a node (type XML) satisfying the given xpath expression:

Code Block
 
languagesql
select xml.f_findNode('./main-classification', content)

from xml.t_classification_national

where publication_id=xml.f_ucid2id('US-5551212-A');

----
 

 <main-classification mxw-id="PCL893513162"
 
 load-source="mcf"
>053397<
>053397</main-classification>

XML[]

xml.f_findNodes(xpath::text,xml)

Select an array of nodes satisfying the given xpath expression:

Code Block
language
 
sql
select xml.f_findNodes('./classification-ipcr', content)

from xml.t_classifications_ipcr

where publication_id=xml.f_ucid2id('US-5551212-A');

----
{
  

{
  "<classification-ipcr ...",
  

  "<classification-ipcr ...",
  

  "<classification-ipcr ..."

}

XML[]

xml.f_getElementsByTagName(name::text,xml))

Select an array of nodes based on element name:

 
Code Block
languagesql
select xml.f_getElementsByTagName('classification-ipcr', content)

from xml.t_classifications_ipcr

where publication_id=xml.f_ucid2id('US-5551212-A');

----
{
  

{
  "<classification-ipcr ...",
  

  "<classification-ipcr ...",
  

  "<classification-ipcr ..."

}

 

 

string

xml.f_nodeName(xml)

Select the name of the content node:

Code Block
language
 
sql
select xml.f_nodeName( content )

from xml.t_invention_title

where publication_id=xml.f_ucid2id('US-5551212-A');

-----------------
 

 invention-title

 

 

 

XML[]

xml.f_childNodes(xml)

Select all child nodes:

Code Block
 
languagesql
select xml.f_childNodes( content )

from xml.t_application_reference

where publication_id=xml.f_ucid2id('US-5551212-A');

-----------------------------------------------------------------------------------
 

 <document-id mxw-id="PAPP60364944"
 
 load-source="docdb"
 
 format="epo">
   

   <country>US</country>
   

   <doc-
number>2582893<
number>2582893</doc-number>
   

   <kind>A</kind>
   <date>19930303</date>
   

   <date>19930303</date>
   <lang>EN</lang>
 

 </document-id>
 

 <document-id mxw-id="PAPP84571955"
 
 load-source="patent-office"
 
 format="original">
   

   <country>US</country>
   

   <doc-
number>08025828<
number>08025828</doc-number>
   <date>19930303</date>
   

   <date>19930303</date>
   <lang>EN</lang>
 

 </document-id>

string

xml.f_node2string(xml,indent)

Concatenate all child text nodes:

Code Block
language
 
sql
select xml.f_node2string( content::text,
 
 1
 
 )

from xml.t_abstract

where publication_id=xml.f_ucid2id('US-5551212-A');

-----------------------------------------------------------------------------------
 package is formed by holding a nonextensible strip taut and wrapping it around the upper and/or lower periphery of the bundle and joining the ends of the strip segments e.g. by adhesive bonding. A projecting margin of the strip is folded to overlie the bundle or underlie the bundle so that an L-configuration is imparted to the frame formed around the bundle after the strip is applied thereto.

...


 A package is formed by holding a nonextensible strip taut and wrapping it around the upper and/or lower periphery of the bundle and joining the ends of the strip segments e.g. by adhesive bonding. A projecting margin of the strip is folded to overlie the bundle or underlie the bundle so that an L-configuration is imparted to the frame formed around the bundle after the strip is applied thereto.


Use Case: Creating relational view of XML content

...

Now we can create a function that returns a TABLE of these properties. We'll use publication_id as input to the function and create it in a private schema (mySchema).

Code Block
language

...

sql
create or replace function mySchema.f_rdb_citations( integer )

...


  returns table

...

 (
    publication_id integer,

...


    ref_publication_id integer,

...


    ref_ucid varchar(64),

...

    published date,

...


    published date,
    country varchar(2),

...


    kind varchar(2),

...


    load_source varchar(32),

...


    format varchar(32),

...


    source varchar(32)

...


  ) as

...


$BODY$

...


declare v_content xml;

...


        v_node xml;

...


        v_doc_node xml;

...

begin
 

begin
 
-- example usage:

...


-- select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') );

...


  select x.publication_id, x.content into publication_id, v_content

...


    from xml.t_citations as x where x.publication_id=

...

  if not found then
    return;
  end if;
  
  -- loop through all patent citations
  for each v_node in array

...

$1;
  if not found then
    return;
  end if;
  
  -- loop through all patent citations
  for each v_node in array
    ( select xml.f_findNodes('//patcit', v_content) )

...

 loop
    select xml.f_getAttribute(

...

 'ucid', v_node ) into ref_ucid;

...

 

...


 
    select x.publication_id into ref_publication_id

...


      from xml.t_patent_document_values as

...

 x
      where ucid=xml.f_getAttribute(

...

 'ucid', v_node );

...

 
 

...


 
 
    select xml.f_getAttribute(

...

 'load-source', v_node ) into load_source;

...


    select xml.f_getAttribute('name', xml.f_findNode('./sources/source', v_node ) ) into source;

...

 

...


 
    -- loop through all forms of the document-id(s)

...


    for each v_doc_node in

...

 array
      ( select xml.f_findNodes('./document-id', v_node) )

...

 loop
      select xml.f_getAttribute(

...

 'format', v_doc_node ) into format;

...


      select xml.f_textContent(

...

 './country', v_doc_node ) into country;

...


      select xml.f_textContent(

...

 './kind', v_doc_node ) into kind;

...


      select xml.f_textContent(

...

 './date', v_doc_node ) into published;

...

      return next;
    end loop;
  end loop;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100
  ROWS 1000;

 


      return next;
    end loop;
  end loop;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100
  ROWS 1000;


Now, we test the results:

Code Block
language

...

sql
select * from mySchema.f_rdb_citations( xml.f_ucid2id('US-9240001-B2') )

...

 limit 5;

...


 publication_id | ref_publication_id |

...

     ref_

...

ucid      | published  | country | kind | load_source | format | source

...


----------------+--------------------+-------------------+------------+---------+------+-------------+--------+--------

...


      160817649 |            5778519 | CA-2580978-

...

A1     | 2006-03-30

...

 | CA      | A1   | docdb       | epo    | APP
      160817649 |          127402642 | CN-201918032-U    | 2011-08-03

...

 | CN      | U    | docdb       | epo    | APP
      160817649 |           75847915 | US-20020150866-A1 |

...

 2002-10-17

...

 | US      | A1   | docdb       | epo    | APP
      160817649 |           75947470 | US-20030060942-A1 |

...

 2003-03-27

...

 | US      | A1   | docdb       | epo    | APP
      160817649 |           76046244 | US-20030154010-A1 |

...

 2003-08-14

...

 

 | US      | A1   | docdb       | epo    | APP


Of course, GROUP BY and ORDER BY are available on the returned columns. In addition

...

to SELECTing citations for a single document, we can also JOIN to table expressions, e.g., assume we are interested in the top 5 cited countries from US grant publication date 2016-05-03:


Code Block
language
 
sql
select count(*) as n, t2.country as ref_country

from xml.t_patent_document_values t1
  

  inner join mySchema.f_rdb_citations (t1.publication_id) as
t2    on
 t2
    on (t1.publication_id=t2.publication_id)

where t1.country='US'
  

  and t1.published='2016-05-03'

group by t2.country

order by n desc
limit 5;
   n    |

limit 5;
   n    | ref_country

--------+---------
 

 221800
 
 | US
  14285 | WO
  13313 | JP
   7115 | EP
   2554 |

  14285 | WO
  13313 | JP
   7115 | EP
   2554 | CN