Golem API documentation

golem

The Golem ontology parsing library.

This module contains the main class which parses Golem/CML dictionaries, as defined by the CML and Golem schemata, and allows you to use them to extract and convert information found in CML datafiles.

class golem.Dictionary(filename=None, asModel=False)

Main class for representing CML/Golem dictionaries.

Example of usage:

>>> from StringIO import StringIO
>>> dictionarystring = """<?xml version="1.0"?>
... <dictionary 
...   namespace="http://www.materialsgrid.org/castep/dictionary"
...   dictionaryPrefix="castep" 
...   title="CASTEP Dictionary"
...   xmlns="http://www.xml-cml.org/schema"
...   xmlns:h="http://www.w3.org/1999/xhtml/"
...   xmlns:cml="http://www.xml-cml.org/schema"
...   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
...   xmlns:golem="http://www.lexical.org.uk/golem"
...   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...   <entry id="xcFunctional" term="Exchange-Correlation Functional">
...     <annotation />
...     <definition>
...       The exchange-correlation functional used.
...     </definition>
...     <description>
...      <h:div class="dictDescription">
...         Available values for this are:
...         <h:ul>
...           <h:li>
...             <h:strong>LDA</h:strong>
...             , the Local Density Approximation
...           </h:li>
...           <h:li>
...             <h:strong>PW91</h:strong>
...             , Perdew and Wang's 1991 formulation
...           </h:li>
...           <h:li>
...             <h:strong>PBE</h:strong>
...             Perdew, Burke and Enzerhof's original GGA
...             functional
...           </h:li>
...           <h:li>
...             <h:strong>RPBE</h:strong>
...             , Hammer et al's revised PBE functional
...           </h:li>
...         </h:ul>
...       </h:div>
...     </description>
...     
...     <metadataList>
...       <metadata name="dc:author" content="golem-kiln" />
...     </metadataList>
...     <golem:xpath>/cml:cml/cml:parameterList[@dictRef="input"]/cml:parameter[@dictRef="castep:xcFunctional"]</golem:xpath>
...     <golem:template call="scalar" role="getvalue" binding="pygolem_serialization" />
...     <golem:template role="arb_to_input" binding="input" input="external">
...       <xsl:stylesheet version='1.0' 
...                       xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
...                       xmlns:cml='http://www.xml-cml.org/schema'>
...         <xsl:strip-space elements="*" />
...         <xsl:output method="text" />
...         <xsl:param name="p1" />
...         <xsl:template match="/">
...           <xsl:text>XC_FUNCTIONAL </xsl:text><xsl:value-of select="$p1" />      
...   </xsl:template>
...       </xsl:stylesheet>
...     </golem:template>
...     <golem:implements>convertibleToInput</golem:implements>
...     <golem:implements>value</golem:implements>
...     <golem:implements>absolute</golem:implements>
...     <golem:childOf>input</golem:childOf>
... 
...     <golem:possibleValues type="string">
...       <golem:enumeration>
...         <golem:value>LDA</golem:value>
...         <golem:value>PW91</golem:value>
...         <golem:value>PBE</golem:value>
...         <golem:value>RPBE</golem:value>
...         <golem:value>HF</golem:value>
...         <golem:value>SHF</golem:value>
...         <golem:value>EXX</golem:value>
...         <golem:value>SX</golem:value>
...         <golem:value>ZERO</golem:value>
...         <golem:value>HF-LDA</golem:value>
...         <golem:value>SHF-LDA</golem:value>
...         <golem:value>EXX-LDA</golem:value>
...         <golem:value>SX-LDA</golem:value>
...       </golem:enumeration>
...     </golem:possibleValues>
...   </entry>
... 
... <entry id="scalar" term="Scalar default call">
...     <annotation />
...     <definition />
...     <description />
...     <metadataList />
...     <golem:template role="getvalue" binding="pygolem_serialization">
...         <xsl:stylesheet version='1.0' 
...                 xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
...                 xmlns:cml='http://www.xml-cml.org/schema'
...                 xmlns:str="http://exslt.org/strings"
...                 xmlns:func="http://exslt.org/functions"
...                 xmlns:exsl="http://exslt.org/common"
...                 xmlns:tohw="http://www.uszla.me.uk/xsl/1.0/functions"
...                 extension-element-prefixes="func exsl tohw str"
...                 exclude-result-prefixes="exsl func tohw xsl str">
...         <xsl:output method="text" />
...   
...   
...   <func:function name="tohw:isAListOfDigits">
...     <!-- look only for [0-9]+ -->
...     <xsl:param name="x_"/>
...     <xsl:variable name="x" select="normalize-space($x_)"/>
...     <xsl:choose>
...       <xsl:when test="string-length($x)=0">
...         <func:result select="false()"/>
...       </xsl:when>
...       <xsl:when test="substring($x, 1, 1)='0' or
...                       substring($x, 1, 1)='1' or
...                       substring($x, 1, 1)='2' or
...                       substring($x, 1, 1)='3' or
...                       substring($x, 1, 1)='4' or
...                       substring($x, 1, 1)='5' or
...                       substring($x, 1, 1)='6' or
...                       substring($x, 1, 1)='7' or
...                       substring($x, 1, 1)='8' or
...                       substring($x, 1, 1)='9'">
...         <xsl:choose>
...           <xsl:when test="string-length($x)=1">
...             <func:result select="true()"/>
...           </xsl:when>
...           <xsl:otherwise>
...             <func:result select="tohw:isAListOfDigits(substring($x, 2))"/>
...           </xsl:otherwise>
...         </xsl:choose>
...       </xsl:when>
...       <xsl:otherwise>
...         <func:result select="false()"/>
...       </xsl:otherwise>
...     </xsl:choose>
...   </func:function>
... 
...   <func:function name="tohw:isAnInteger">
...     <!-- numbers fitting [\+-][0-9]+ -->
...     <xsl:param name="x_"/>
...     <xsl:variable name="x" select="normalize-space($x_)"/>
...     <xsl:variable name="try">
...       <xsl:choose>
...         <xsl:when test="starts-with($x, '+')">
...           <xsl:value-of select="substring($x,2)"/>
...         </xsl:when>
...         <xsl:when test="starts-with($x, '-')">
...           <xsl:value-of select="substring($x,2)"/>
...         </xsl:when>
...         <xsl:otherwise>
...           <xsl:value-of select="$x"/>
...         </xsl:otherwise>
...       </xsl:choose>
...     </xsl:variable>
...     <func:result select="tohw:isAListOfDigits($try)"/>
...   </func:function>
... 
...   <func:function name="tohw:isANumberWithoutExponent">
...     <!-- numbers fitting [\+-][0-9]+(\.[0-9]*) -->
...     <xsl:param name="x"/>
...     <xsl:choose>
...       <xsl:when test="contains($x, '.')">
...         <func:result select="tohw:isAnInteger(substring-before($x, '.')) and
...                              tohw:isAListOfDigits(substring-after($x, '.'))"/>
...       </xsl:when>
...       <xsl:otherwise>
...         <func:result select="tohw:isAnInteger($x)"/>
...       </xsl:otherwise>
...     </xsl:choose>
...   </func:function>
... 
...   <func:function name="tohw:isAnFPNumber">
...     <!-- Try and interpret a string as an exponential number -->
...     <!-- should only recognise strings of the form: [\+-][0-9]*\.[0-9]*([DdEe][+-][0-9]+)? -->
...     <xsl:param name="x"/>
...     <xsl:choose>
...       <xsl:when test="contains($x, 'd')">
...         <func:result select="tohw:isANumberWithoutExponent(substring-before($x, 'd')) and
...                              tohw:isAnInteger(substring-after($x, 'd'))"/>
...       </xsl:when>
...       <xsl:when test="contains($x, 'D')">
...         <func:result select="tohw:isANumberWithoutExponent(substring-before($x, 'D')) and
...                              tohw:isAnInteger(substring-after($x, 'D'))"/>
...       </xsl:when>
...       <xsl:when test="contains($x, 'e')">
...         <func:result select="tohw:isANumberWithoutExponent(substring-before($x, 'e')) and
...                              tohw:isAnInteger(substring-after($x, 'e'))"/>
...       </xsl:when>
...       <xsl:when test="contains($x, 'E')">
...         <func:result select="tohw:isANumberWithoutExponent(substring-before($x, 'E')) and
...                              tohw:isAnInteger(substring-after($x, 'E'))"/>
...       </xsl:when>
...       <xsl:otherwise>
...          <func:result select="tohw:isANumberWithoutExponent($x)"/>
...       </xsl:otherwise>
...     </xsl:choose>
...   </func:function>
...         
...   <xsl:template match="/">
...     <xsl:apply-templates />
...   </xsl:template>
...     
...   <xsl:template match="cml:scalar">
...     <xsl:variable name="value">
...       <xsl:choose>
...         <xsl:when test="tohw:isAnFPNumber(.)">
...           <xsl:value-of select="." />
...         </xsl:when>
...         <xsl:otherwise>
...           <xsl:text>"</xsl:text><xsl:value-of select="." /><xsl:text>"</xsl:text>
...         </xsl:otherwise>
...       </xsl:choose>
...     </xsl:variable>
...     <xsl:variable name="units">
...       <xsl:choose>
...         <xsl:when test="@units">
...           <xsl:text>"</xsl:text><xsl:value-of select="@units" /><xsl:text>"</xsl:text>
...         </xsl:when>
...         <xsl:otherwise>
...           <xsl:text>""</xsl:text>
...         </xsl:otherwise>
...       </xsl:choose>
...     </xsl:variable>
...     <xsl:text>[</xsl:text><xsl:value-of select="$value"/><xsl:text>,</xsl:text><xsl:value-of select="$units" /><xsl:text>]</xsl:text>
...   </xsl:template>
... </xsl:stylesheet>
...     </golem:template>
... 
...     <golem:template role="defaultoutput">
...       <xsl:stylesheet version='1.0' 
...                       xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
...                       xmlns:cml='http://www.xml-cml.org/schema'
...                       xmlns:str="http://exslt.org/strings"
...                       extension-element-prefixes="str"
...                       >
...         <xsl:output method="text" />
...         <xsl:param name="name" />
...         <xsl:param name="value" />
...         <xsl:template match="/">
...           <xsl:value-of select='$name' /><xsl:value-of select='$value' />
...         </xsl:template>
...       </xsl:stylesheet>
...     </golem:template>
...     <golem:seealso>gwtsystem</golem:seealso>
...   </entry>
... </dictionary>
... """
>>> d = Dictionary(StringIO(dictionarystring))
>>> xcf = d["{http://www.materialsgrid.org/castep/dictionary}xcFunctional"]
>>> cmlstr = """<?xml version="1.0" encoding="UTF-8"?>
... <?xml-stylesheet href="display.xsl" type="text/xsl"?>
... <cml convention="FoX_wcml-2.0" fileId="NaCl_00GPa.xml" version="2.4"
...   xmlns="http://www.xml-cml.org/schema"
...   xmlns:castep="http://www.materialsgrid.org/castep/dictionary"
...   xmlns:castepunits="http://www.materialsgrid.org/castep/units"
...   xmlns:cml="http://www.xml-cml.org/dict/cmlDict"
...   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
...   xmlns:dc="http://purl.org/dc/elements/1.1/title"
...   xmlns:units="http://www.uszla.me.uk/FoX/units"
...   xmlns:atomicUnits="http://www.xml-cml.org/units/atomic">
...   <metadataList title="Autocaptured metadata">
...     <metadata name="dc:date" content="2007-02-09"/>
...   </metadataList>
...   <parameterList dictRef="input" convention="Input Parameters">
...     <parameter dictRef="castep:xcFunctional"
...       name="Exchange-Correlation Functional">
...       <scalar dataType="xsd:string">PBE</scalar>
...     </parameter>
...   </parameterList>
... </cml>
... """
>>> tree = etree.parse(StringIO(cmlstr))
>>> xcfd = xcf.findin(tree)
>>> print len(xcfd)
1
>>> xcval = xcf.getvalue(xcfd[0])
>>> print xcf.getvalue(xcfd[0])
PBE
>>> # units are not defined on XCFunctional, so:
>>> print xcval.unit
golem:undefined
>>> # by convention
>>> print xcval.entry.definition
<BLANKLINE>
      The exchange-correlation functional used.
<BLANKLINE>
parsexml(filename, asModel=False)
Load and parse a CML dictionary.
serialize(ordering=None)
Serialize a dictionary back to XML.
class golem.Entry(d, ns, xml=None, asModel=False)

The Entry class represents an entry in a Golem/CML dictionary.

Entries have the following structure:

<entry id="template" term="Template entry">
  <annotation>
    <appinfo><!-- CML-specific machine-processable information --></appinfo>
  </annotation>
  <definition>Human-readable one-liner definition</definition>
  <description>Substantial human-readable documentation</description>
  <metadataList><!-- Dublin Core semantics -->
    <metadata name="dc:creator" content="Test Author" />
  </metadataList>
  <golem:xpath></golem:xpath>
  <golem:template role="role" binding="binding"> <!-- and optionally "@input" -->
  </golem:template>
  <golem:possibleValues type="DATATYPE">
    <golem:range>
      <golem:minimum>1</golem:minimum>
      <golem:maximum>100</golem:maximum>
    </golem:range>
    <!-- or -->
    <golem:enumeration>
      <golem:value>1</golem:value>
      <golem:value>2</golem:value>
      <golem:value>3</golem:value>
    </golem:enumeration>
  </golem:possibleValues> <!-- or matrix ... -->
  <golem:implements>otherEntry</golem:implements> <!-- times n -->
  <golem:synonym>synonymousEntry</golem:synonym> <!-- times n -->
  <golem:seealso>similarEntry</golem:seealso> <!-- times n -->
  <golem:childOf>parentEntry</golem:childOf> <!-- times n -->
</entry>
boundscheck(arb, ctype='')
Check that a given piece of data is of the type, and lies within the bounds, defined in this dictionary entry.
dcall(template, arb)

Internal method (you’ll never call this directly); bounds-check a piece of data and template it into an associated <golem:template> defined in the dictionary. These are mapped onto Python methods named after the name of the <golem:template>.

In other words, this is where entry.to_value calls come from.

findin(*trees)
Find instances of this dictionary entry in a given ElementTree or set of ElementTrees. This version supplies new, rerooted ElementTrees, not just the old ElementTrees with a pointer to the right context - use findin_context for that.
findin_context(*trees)
Find instances of this dictionary entry in a (set of) ElementTrees or filenames. Returns a set of nodes pointing into the searched ElementTrees.
getAllImplementations()
Recursively identify and return all entries which <golem:implement> the current class (and which are in currently-loaded dictionaries).
getChildren()
Recursively identify and return all entries which are <golem:children> of the current concept - i.e. only ever appear as childNodes of the (XML) node, or nodes, with which this dictionary entry is associated.
list_to_arbdict(l)

Map a matrix onto a dictionary for subsequent output using XSLT.

The algorithm used is:

  1. Check that the matrix is of the correct shape and within bounds.
  2. From left to right row-wise from the upper left, number off the matrix elements p1, p2, p3... pn (for an n-element matrix), and store these in a dictionary {“p1”: p1, “p2”: p2 ...}.
  3. Return the resulting dictionary.
matrix_boundscheck(l)
Check that the elements of a given matrix have the type, and lie within the bounds, defined in the current dictionary entry.
matrix_coercelist(l)
Coerce a matrix into a list, left-to-right, top-to-bottom.
matrix_shapecheck(l)
Check that a given matrix is, or can be coerced, into the shape defined in this dictionary entry.
parsexml(x, asModel=False)

Load a dictionary entry from its XML representation.

arguments: (etree for the entry, parent dictionary object).

Set asModel to true if you’re using this dictionary as a model for building a new one: it stashes way more of the native XML in that case, allowing you to serialize it out directly into your new dictionary. At present, this is only used by the dictionary generator (bin/make_dictionary.py in your Golem distribution.)

serialize()
Write out this dictionary entry as XML.
with_predicate(predicate)

Set a predicate (condition) on a particular Entry instance.

This predicate will be honoured on subsequent calls to x.findin for entry x; it takes the form of an XPath function.

class golem.ImpOnlyEntry
Dictionary helper class: this is used to store information on entries which have been pointed to (by, say, <golem:implements>), but which haven’t themselves been parsed yet.
golem.loadDictionary(filename)

Load a dictionary from a default location on the filesystem.

On Windows, this is C:cmldictionariesand must be changed by editing golem.py by hand: on Unix, it defaults to ~/.cmldictionaries/ but can be overridden by setting the environment variable CMLDICTIONARIES.

golem.setDataWarning(val)

Set whether warnings will be emitted when unit/type-bearing data is modified.

Default is True.

golem.setTypeWarning(val)

Set whether warnings will be emitted when a dictionary Entry without a defined type is used.

Default is True.

golem.helpers

golem.helpers.generics

golem.helpers.generics.hexstring()
Generate a random hex string based on nothing in particular.
golem.helpers.generics.print_rdf_rich(x, resource, property=None)

For a given Golem value x, with units x.unit and concept x.uri, from URI ‘resource’, produce an RDF/XML fragment of the form:

<rdf:Description rdf:about="resource">
  <dictionary:uri rdf:about="resource#fragment">
    <golem:value datatype="http://example.org/json/">JSON literal
    </golem:value>
    <golem:units>unit</golem:units>
  </dictionary:uri>
</rdf:Description>

golem.helpers.dataset

class golem.helpers.dataset.caching_dataset(cachefile, dbfile, dbformat, dictionaryfile, dictionarynamespace)

If you’re trying to build a program to interact with a large corpus of CML, this is a good place to start.

Arguments:

  • cachefile: file in which cached query results against this dataset are saved
  • dbfile: path to where the DB we’re querying resides
  • dbformat: which DB driver you’re using. You can plug in new databases here, but most of the time you’ll want golem.db.fs, the directory-full-of-files driver.
  • dictionaryfile: which CML/Golem dictionary to use for queries.
  • dictionarynamespace: Namespace of the above dictionary.
go(graphfile, output=<open file '<stdout>', mode 'w' at 0x13068>, format='verbose', cache=None)

Run any attached queries or fits against this dataset and return the results.

graphfile, here, is the path for a Pelote plot, if you wish to make one; output is a file-like object to write output to, and format is the format in which to return results.

loadcache()
Load query cache as a pickle.
savecache()
Save query cache.
class golem.helpers.dataset.dataset(dbfile, dbformat, dictionaryfile, dictionarynamespace)

Generic representation of a collection of CML data with an attached Golem query (see register) and, optionally, one or more functions to be fitted to the result of that query (see addfunction).

For most practical purposes, you’re going to want to use caching_dataset, which inherits from this, instead, with the golem.db.fs driver (dbformat in the class signature.)

addfunction(name, function, params)
Add a function you wish to fit to the data resulting from a query attached to this dataset.
dispatch(format, cache=None)
Run any fits or queries attached to this dataset, but do not return the results.
go(graphfile, output=<open file '<stdout>', mode 'w' at 0x13068>, format='verbose', cache=None)
Run all queries and function fitting processes attached to this dataset, returning the results.
makeplot(format, graphfile)
Plot any queries and fitted functions attached to this dataset.
register(xconcepts, yconcepts, xpredicate=None, ypredicate=None, xreducer=None, yreducer=None)

Attach a query (consisting of a pair of lists of Golem concepts, each producing one datapoint per file) to this dataset.

A dataset may only have one query (and therefore set of fits) attached at once; attaching a new query through register loses any previously attached fits or retrieved queries.

golem.helpers.dict

class golem.helpers.dict.cmlconcept(keys, parentConcept)

Concept extracted from a corpus of CML data.

When analysing a corpus of CML data in order to produce a dictionary, every dictRef encountered is mapped to an instance of this class; therefore, this class contains helper methods to output

hasPayload()
Does this concept have a payload (is self.payload not None)?
isRelative()
Is this concept relatively-positioned?
setPayload(payloadtype)
Set the payload of this concept - the type of data, if any, it contains.
setRelative()

State that this concept is relatively, not absolutely, positioned.

A relatively-positioned concept occurs in more than one location within the documents in this corpus (as distinguished by XPath expressions); an absolutely-positioned concept always occurs in the same place.

xpathfragment(id=True, title=True)

Calculate an XPath expression which identifies an XML node corresponding to the current concept.

This XPath expression ignores the document context in which the concept was found - ie, “//%%s” %% (xpath) would find all instances of this concept within a given XML document.

class golem.helpers.dict.concept(keys, parentConcept)
Abstract representation of a ‘concept’ within a corpus of XML data; ie, a common structure used to denote a semantically-meaningful unit within a collection of semi-structured documents.
class golem.helpers.dict.conceptset(keys)

A collection of the concepts found in a corpus of documents we’re analysing.

addconcept(concept)
Add a concept to this conceptset.
golem.helpers.dict.make(filenames, namespace, prefix, title, groupings, model=None, inputfn=None, use_id=True, use_title=True)

Build a dictionary.

Arguments:

  • filenames - list of CML files comprising the corpus to be analysed.
  • namespace - namespace of the dictionary
  • prefix - short prefix to be used within the dictionary and namespace declaration
  • title - Title of the dictionary (eg ‘CASTEP dictionary’)
  • groupings - Extra groupings to be determined for entries. If the XPath for this entry contains the key, then this entry is a <golem:instanceOf> the grouping.
  • model - Optionally, a dictionary to copy definitions/descriptions/terms from.
  • inputfn - Optionally, an input file containing extra dictionary entries, not found in the corpus supplied, which should be added to the dictionary.
  • use_id/use_title - distinguish concepts by unique ID/title as well as dictRef.
golem.helpers.dict.parsecmlelement(element, parent)
Parse a chunk of CML representing a concept to an instance of cmlconcept.
golem.helpers.dict.parsecmltree(element, conceptset, parent=None)
Parse an entire CML tree, recursively, parsing any concepts found and adding them to the supplied conceptset.
golem.helpers.dict.print_dictfooter()
Return the common footer - including data-parsing XSLT stylesheets - for a CML/Golem dictionary.
golem.helpers.dict.print_dictheader(ns, prefix, title)
Return the common header for a CML/Golem dictionary with given namespace, short name, and title.
golem.helpers.dict.print_dictionary(ns, prefix, title, concepts, groupings, model=None, inputdict=None, use_id=True, use_title=True)

Write out a CML/Golem dictionary, using model as a source of definitions, descriptions and terms, containing definitions for the entries in concepts and the relationships between concepts defined in groupings.

prefix is the short name to be used in the namespace declaration; the namespace is ns, and the dictionary will be entitled title.

golem.helpers.dict.print_entry(ns, prefix, c, concepts, groupings, model=None, inputdict=None, use_id=True, use_title=True)
Dump a given concept in CML/Golem dictionary format.
golem.helpers.dict.print_tree(concepttree)
Dump a tree of concepts found in a corpus of documents as a GraphViz .dot file.
golem.helpers.dict.xpath_concept(c, concepts, id=True, title=True)
Calculate the XPath for a given concept in a given corpus of documents, by backtracking over its parents to determine the longest common - ie, most specific - path which captures all instances of it within the corpus.

golem.helpers.function

golem.helpers.output

golem.helpers.output.csv(data)
Outputs x,y data in .csv format.
class golem.helpers.output.pelote(title=None)

Represents a Pelote graph.

addaxis(orientation, position=None, title=None, numticks=None, ticks=None)
Attach an axis to a Pelote plot.
addpointlist(data=None)
Attach a pointlist to a Pelote plot.
addrange(floorX=None, floorY=None, ceilingX=None, ceilingY=None)
Attach a range to a Pelote plot.
serialize()
Serialize a Pelote plot file as an ElementTree.
write(f)

Serialize and write out a Pelote plot to a file or file-like object.

f can be either a string or a file-like object. If f is a string, it is taken to be the filename to use.

golem.db

golem.db.resultlist

Golem resultlist class.

Ties a series of results to the file they came from.

>>> x = resultlist(range(10), filename="test.xml")
>>> print x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> print x.filename
test.xml