Developing CML/Golem dictionaries

Adding CML markup to your code

The best way to add CML output to your Fortran program is to use the FoX library. Documentation on the is beyond the scope of this document, but can be found on the CMLComp.org wiki and FoX website - here’s a tutorial on how to mark up a code using FoX and WCML.

We will assume your code was marked up using this, or (in general) that the format you have used for the following tags is, broadly speaking, similar to that used by FoX’s WCML library.

  • <parameter> and <parameterList>
  • <property> and <propertyList>
  • <atomArray>
  • <scalar>
  • <metadata>
  • <matrix>
  • <lattice>
  • <cellParameter>
  • <array>

If you’re using FoX, this is already true; if you’re not, for more details of how we recommend you use these CML elements, see the CMLComp website at http://cmlcomp.org/.

WCML does contain some output routines - those relating to <kpoint>, <kpointList> and <band> in particular - which Golem does not have the means, at present, to automatically understand. There are efforts to standardise the usage of these tags, but at present there is still ongoing discussion about how best to represent these concepts. It will be possible to add support for these tags later; in the meantime, later in this manual you will find documentation on how to extract data from these concepts using XML/XPath methods once you have located them using Golem.

Producing a CML/Golem dictionary

Your Golem distribution ships with a program called make_dictionary. This scans output files from your code for the above tags, works out where they are in the file in terms of XPath expressions, and puts in a mechanism for reading the data found in the file. (In brief: each of the above tags is associated with an XSLT stylesheet which converts the CML to a JSON object, which is then converted into a Python object by the Golem library.)

This will associate each CML dictionary reference with where it is in the file, and if it’s on a value-bearing element - the elements listed above, with the exception of parameterList and propertyList - will associate the right stylesheet with it so it can be read.

So, as far as reading the output of your code goes, you can generate your dictionary automatically, as long as you can generate CML which completely covers the potential range of output of your code. If you have an automated test suite, run it and collate the CML output - that makes a very, very good start. If not, rerunning a large number of simulations which exercise every part of the codebase is a good idea: this is the approach taken with the CASTEP dictionary.

Using make_dictionary

Like the other command-line tools which ship with Golem, make_dictionary comes with a help message:

$ make_dictionary --help
usage: make_dictionary options file1.xml [file2.xml ...]

options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -m FILE, --model-dictionary=FILE
                        Model dictionary to incorporate
  -i FILE, --input-config=FILE
                        Input configuration file
  -o FILE, --output=FILE
                        Output filename (defaults to stdout)
  -p PREFIX, --prefix=PREFIX
                        Dictionary prefix
  -n NAMESPACE, --namespace=NAMESPACE
                        Dictionary namespace
  -t TITLE, --title=TITLE
                        Dictionary title
  -l, --use-title       Use titles to distinguish between potential concepts?
  -d, --use-id          Use IDs to distinguish between potential concepts?

This, admittedly, does look quite intimidating, but let’s take a concrete example.

Imagine a new simulation code, CMLized using the instructions above.

Firstly, one needs to enter the short name - something easy to remember, say megasim - and the namespace for the dictionary, which was picked when CMLizing it. Let’s say that it has been decided that its dictionary namespace should be http://www.example.com/megasim/dictionary/.

A quick aside on choosing namespaces: this namespace should really be unique among all CML codes, so make sure to pick a URL in a bit of webspace you control; there doesn’t actually have to be anything there if you go there in a Web browser, though. The actual rules of XML namespaces are very, very complex, but if you think of it in this context as a unique name identifying your dictionary, you’ll be OK.

Let us write the dictionary into a file called megasimDict.xml, with the title MegaSim Dictionary, and dictRefs have been added to every concept you want to read - so we don’t have to resort to using title or id to distinguish between the pieces of CML representing those concepts. The title attribute is designed for people rather than machines to read: as such, using the value of title to distinguish between general concepts is something of an abuse of its intended usage. Only one element with a given id may occur within a document, and as such id should not be used to identify any concept which could occur more than once in a document. Thus, for consistency’s sake, if nothing else, it is better to use dictRef everywhere. The WCML library enforces this restriction.

Now we collect a set of the CML output from MegaSim - if you have a test suite, we run the CMLized code over it and collect all the outputs, putting them in the current working directory. We will assume their filenames all end in .cml. Once that has been done, the following commandline:

$ make_dictionary -o megasimDict.xml -p megasim
   -n "http://www.example.com/megasim/dictionary"
   -t "MegaSim Dictionary" *.cml

will build you a CML/Golem dictionary. It’ll walk over the supplied CML files, identify what, where and how each dictRef is used, and write the resulting dictionary to megasimDict.xml.

Trying it for yourself

If you’ve downloaded the Golem source distribution, change directory into the directory where you’ve unzipped it; it has a subdirectory docs/examples/rmcprofile. Change into that directory, and run the following:

$ make_dictionary -o rmcprofileDict.xml -p rmcprofile
    -n "http://www.esc.cam.ac.uk/rmcprofile"
    -t "rmcprofile dictionary" ag3cocn6_300k.xml

This will write a dictionary, rmcprofileDict.xml, in that directory: take a look at it (and ag3cocn6_300k.xml) to see how dictRef``s in the CML get turned into ``<entry> elements in the dictionary.

Improving the dictionary

From here, there are four major things you can do to improve the dictionary;

  • add definitions and descriptions to document the terms in the dictionary
  • add further information to annotate the types of data a term can return
  • add stylesheets to the dictionary to (for example) produce code input
  • add information on the relationships between concepts

Each of these is out of bounds here (and is covered in the next section), but if you have a dictionary to which these have been added, and you wish to recompile the dictionary you can copy this information across into your new dictionary by using your old dictionary as a model. Pass it to make_dictionary with the -m flag:

$ make_dictionary -m old_dictionary.xml -o new_dictionary.xml [...] *.cml

If a term occurs in both dictionaries, any definitions, descriptions, or <golem:possibleValues> defined in the model dictionary are merged into the new one. This is particularly useful if you change the document structure of the CML you output, but you don’t want to lose the extra information you’ve added to your old dictionary.