An overview of Golem

What is Golem?

Golem is a markup language and library for extracting information from many kinds of CML (Chemical Markup Language) files, particularly aimed towards processing the output of atomistic simulation programs.

The Golem markup language annotates CML dictionaries with information about the types of data associated with dictionary entries, the location of that data in CML documents, and information on how to parse that data once it has been found; the Golem library, written in Python, then uses these annotations to give you a simple, powerful API for extracting and manipulating that data from collections of CML documents.

In other words, Golem makes it easier to write programs and services that consume the CML outputted by a range of popular simulation programs - and, also, the CML within the CrystalEye database.

Who (and what) uses Golem?

A range of tools and systems use Golem to help in the generation of code input and the processing of code output. For example, the MaterialsGrid system uses Golem to help with generating both the input files for systems people want to simulate and also in the generation of the final reports presented to users.

Golem also comes with a range of programs to help with simple data extraction from CML files, primarily summon, and with the development of CML dictionaries (make_dictionary).

What problem does Golem solve?

Simply that what two codes mean by a given word is usually not identical. Every code, or resource, which uses CML has a subtly different set of concepts it is trying to represent. For example, CASTEP and SIESTA have very different conceptions of their basis sets. As a result, every program uses CML syntax slightly differently. CML allows for this using its dictRef mechanism: each concept in a CML document is given a unique name. Golem leverages and extends this mechanism to allow these differences in usage and document structure to be encapsulated in Golem/CML Dictionaries, which specify the concepts and syntax particular to a given domain of CML usage.

A good example is temperature. Consider Monte Carlo and molecular dynamics simulations; in Monte Carlo, the temperater defines the probability of a change of configuration, whereas in molecular dynamics it controls the kinetic energy of the atoms. These are, on one level, the same concept - in the higher language of thermodynamics, they relate to quantities like the free energy in the same way - but their usage in the context of the simulations is very different, even though they legitimately have the same name. Therefore, some of the time the two conceptualizations of temperatuere will be comparable, and some of the time they won’t; any system for processing CML has to be able to deal with this ambiguity.

Codes and resources with Golem dictionaries

Golem comes with dictionaries for:

If the code you use isn’t listed here, Golem also includes tools to make it straightforward to develop new dictionaries for new CML dialects; we have developed dictionaries for

using the make_dictionary tool described later in this documentation. If your code doesn’t output CML at present, the process of adding markup is covered later in “Adding CML markup to your code”.

How can I get started?

The best way to get a feel for Golem is to install it and try it out on some of your data using summon. In the next chapter, we discuss installing Golem; the following chapter is an introduction to the summon program, which enables you to quickly extract, as CSV, the data identified by specific dictionary entries from one or many CML files.