Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The data warehouse design is coupled tightly with the structure of the ST36-based XML document. The relational database is the open source PostgreSQL system and makes use of schemas to organize content and organizational structure inside the database. See http://www.postgresql.org for further information on PostgreSQL-specific functionality. The following table lists the active schemas and their roles in the data warehouse.

Schema

Description

work

Tables used for loading and merging raw data – these will not be described below

reporting

Statistics pertaining to loading, updating, and indexing

xml

The complete XML broken into tables based on the ST36 XML structure


Note

Current Demos may not contain all of the schemas listed below. In addition, tables and schemas present in the current demo with the exception of reporting, work, and xml (described below), may be removed from the database without notice.

 


The following graphic is a tree view of the structure of a complete XML document. Each node represents a container whereby each leaf of the tree represents a table in the data warehouse (xml and work schemas). The naming pattern of tables follows the following rules:

...

The two exceptions to this are the non-XML table t_patent_document_values and the copyright node. The copyright container is stored separately in t_sys_data.

Image RemovedImage Added

Schemas

Reporting

The reporting schema houses tables critical to data loading, update processes, and queues for optional indexing. Although most tables are for internal use, the table reporting.t_client_load_process contains interesting information for clients around which web status interfaces or triggers can easily be built.

Column

Type

Modifiers

Comment

client_load_process_id

serial

primary key

Table primary key

load_id

integer

 


The load identifer

load_source

text

 


Source data for load

url

text 


URL of package to load

ndocs

integer 


Number of documents contained in the load

entered_stamp

timestamp

 


Time entered for processing

completed_stamp

timestamp 


Time loading completed

running_status

text 


One of: downloading, unpacking, loading, merging, complete

completed_statustext 
One of: success, failure

If you have installed the optional SOLR index, there is a companion table reporting.t_client_index_process that manages the indexing queue and has the exact structure of the reporting.t_client_load_process.

...

Column

Type

Modifiers

Comment

<tbl>_id

serial

primary key

Table primary key

publication_id

integer

not null

Integer representation of the publication

modified_load_id

integer

not null

Internal load id used to manage data updates

status

char(1) 


Validity of the XML fragment (v=valid, i=invalid)*

content

XML 


The XML fragment


Info

All XML stored in content columns is without exception well-formed but occasionally during processing, invalid XML is created. This is due to a variety of reasons, most predominantly, corrupt or invalid patent office data. In an effort to maintain the completeness of the publication, even invalid fragments are stored. If you wish to validate a complete document, simply filter out invalid fragments. 

...

Column

Type

Modifiers

Comment

patent_document_value_id

serial

primary key

Table primary key

publication_id

integer

not null

Integer representation of the publication

created_load_idinteger 
Load id of document creation

modified_load_id

integer 


Load id of last modification

deleted_load_idinteger 
Load id when document was deleted

status

varchar(16)

 


Status of complete document (new|update)

country

char(2)

not null

Country of publication

doc_number

varchar(32)

not null

Publication document number

kind

varchar(4)

 


Document kind code

lang

char(2) 


Document publication language

ucid

varchar(32)

 


Concatenation of country-doc_number-kind

published

date

 


Date of publication

produced

date 


Date of Alexandria XML production

withdraw

boolean 


Withdrawn status

family_id

integer 


DocDB-provided simple family identifier

...

Code Block
languagesql
SELECT xml.f_patent_document(1, null);

...


xml.f_patent_document_s(integer, text)

...

Code Block
languagesql
SELECT xml.f_patent_document_s(t.publication_id, null)
FROM xml.t_patent_document_values AS t
WHERE t.published='20100407';

...


xml.f_ucid2id(text)

This function takes a ucid (unique character ID) and converts it to an integer publication_id.

...