The data warehouse design is coupled tightly with the structure of the ST36-based XML document. The relational database is the open source PostgreSQL system and makes use of schemas to organize content and organizational structure inside the database. See http://www.postgresql.org for further information on PostgreSQL-specific functionality. The following table lists the active schemas and their roles in the data warehouse.

Schema	Description
work	Tables used for loading and merging raw data – these will not be described below
reporting	Statistics pertaining to loading, updating, and indexing
xml	The complete XML broken into tables based on the ST36 XML structure

Current Demos may not contain all of the schemas listed below. In addition, tables and schemas present in the current demo with the exception of reporting, work, and xml (described below), may be removed from the database without notice.

The following graphic is a tree view of the structure of a complete XML document. Each node represents a container whereby each leaf of the tree represents a table in the data warehouse (xml and work schemas). The naming pattern of tables follows the following rules:

dashes are replaced with underlines and
each leaf-node name is prefixed with t_, i.e., publication-reference =t_publication_reference.

The two exceptions to this are the non-XML table t_patent_document_values and the copyright node. The copyright container is stored separately in t_sys_data.

Schemas

Reporting

The reporting schema houses tables critical to data loading, update processes, and queues for optional indexing. Although most tables are for internal use, the table reporting.t_client_load_process contains interesting information for clients around which web status interfaces or triggers can easily be built.

Column	Type	Modifiers	Comment
client_load_process_id	serial	primary key	Table primary key
load_id	integer		The load identifer
load_source	text		Source data for load
url	text		URL of package to load
ndocs	integer		Number of documents contained in the load
entered_stamp	timestamp		Time entered for processing
completed_stamp	timestamp		Time loading completed
running_status	text		One of: downloading, unpacking, loading, merging, complete
completed_status	text		One of: success, failure

If you have installed the optional SOLR index, there is a companion table reporting.t_client_index_process that manages the indexing queue and has the exact structure of the reporting.t_client_load_process.

XML

The XML schema is the heart of the data warehouse and contains tables representing individual containers (sections) of the complete document stored in pure XML.

Table Structure–General

All tables, with the exception of xml.t_patent_document_values, follow the same structure.

Column	Type	Modifiers	Comment
<tbl>_id	serial	primary key	Table primary key
publication_id	integer	not null	Integer representation of the publication
modified_load_id	integer	not null	Internal load id used to manage data updates
status	char(1)		Validity of the XML fragment (v=valid, i=invalid)*
content	XML		The XML fragment

All XML stored in content columns is without exception well-formed but occasionally during processing, invalid XML is created. This is due to a variety of reasons, most predominantly, corrupt or invalid patent office data. In an effort to maintain the completeness of the publication, even invalid fragments are stored. If you wish to validate a complete document, simply filter out invalid fragments.

The table xml.t_patent_document_values is the master content table containing meta information of the document in standard relational database columns. This table is the basis for the overall container element patent-document.

Column	Type	Modifiers	Comment
patent_document_value_id	serial	primary key	Table primary key
publication_id	integer	not null	Integer representation of the publication
created_load_id	integer		Load id of document creation
modified_load_id	integer		Load id of last modification
deleted_load_id	integer		Load id when document was deleted
status	varchar(16)		Status of complete document (new\|update)
country	char(2)	not null	Country of publication
doc_number	varchar(32)	not null	Publication document number
kind	varchar(4)		Document kind code
lang	char(2)		Document publication language
ucid	varchar(32)		Concatenation of country-doc_number-kind
published	date		Date of publication
produced	date		Date of Alexandria XML production
withdraw	boolean		Withdrawn status
family_id	integer		DocDB-provided simple family identifier

User-Defined Functions

In addition to content columns from which XML content can easily be selected, there are also the following supplemental functions available for common data extraction tasks:

xml.f_patent_document(integer, text)

This function returns an entire XML document given a publication_id as first argument. The second argument is currently unused.

SELECT xml.f_patent_document(1, null);

xml.f_patent_document_s(integer, text)

Similar to the function above, this function returns a SET OF XML documents, one document per row. Arguments are identical to function above.

SELECT xml.f_patent_document_s(t.publication_id, null)
FROM xml.t_patent_document_values AS t
WHERE t.published='20100407';

xml.f_ucid2id(text)

This function takes a ucid (unique character ID) and converts it to an integer publication_id.

SELECT xml.f_patent_document(xml.f_ucid2id('US-5551212-A'), null);

Page tree

Data Warehouse Design

Reporting

XML

Table Structure–General

User-Defined Functions