An overview of Golem
====================
What is Golem?
--------------
Golem is a markup language and library for extracting information from many
kinds of CML (`Chemical Markup Language `_) files,
particularly aimed towards processing the output of atomistic simulation
programs.
The Golem markup language annotates CML dictionaries with information about
the types of data associated with dictionary entries, the location of that
data in CML documents, and information on how to parse that data once it has
been found; the Golem library, written in Python, then uses these annotations
to give you a simple, powerful API for extracting and manipulating that data
from collections of CML documents.
In other words, Golem makes it easier to write programs and services that
consume the CML outputted by a range of popular simulation programs - and,
also, the CML within the `CrystalEye `_
database.
Who (and what) uses Golem?
--------------------------
A range of tools and systems use Golem to help in the generation of code input
and the processing of code output. For example, the `MaterialsGrid
`_ system uses Golem to help with generating
both the input files for systems people want to simulate and also in the
generation of the final reports presented to users.
Golem also comes with a range of programs to help with simple data extraction
from CML files, primarily ``summon``, and with the development of
CML dictionaries (``make_dictionary``).
What problem does Golem solve?
------------------------------
Simply that what two codes mean by a given word is usually not identical.
Every code, or resource, which uses CML has a subtly different set of concepts
it is trying to represent. For example, CASTEP and SIESTA have very different
conceptions of their basis sets. As a result, every program uses CML syntax
slightly differently. CML allows for this using its ``dictRef`` mechanism:
each concept in a CML document is given a unique name. Golem leverages and
extends this mechanism to allow these differences in usage and document
structure to be encapsulated in Golem/CML Dictionaries, which specify the
concepts and syntax particular to a given domain of CML usage.
A good example is temperature. Consider Monte Carlo and molecular dynamics
simulations; in Monte Carlo, the temperater defines the probability of a
change of configuration, whereas in molecular dynamics it controls the kinetic
energy of the atoms. These are, on one level, the same concept - in the higher
language of thermodynamics, they relate to quantities like the free energy in
the same way - but their usage in the context of the simulations is very
different, even though they legitimately have the same name. Therefore, some
of the time the two conceptualizations of temperatuere will be comparable, and
some of the time they won't; any system for processing CML has to be able to
deal with this ambiguity.
Codes and resources with Golem dictionaries
--------------------------------------------
Golem comes with dictionaries for:
* `CASTEP `_
* `OSSIA `_
* the `CrystalEye`_ crystallographic
structure database.
If the code you use isn't listed here, Golem also includes tools to make it
straightforward to develop new dictionaries for new CML dialects; we have
developed dictionaries for
* `DL_POLY `_
* `SIESTA `_
* `GULP `_
* `MOPAC `_
* `DALTON `_
* `rmcprofile `_
using the ``make_dictionary`` tool described later in this documentation.
If your code doesn't output CML at present, the process of adding markup is
covered later in "Adding CML markup to your code".
How can I get started?
----------------------
The best way to get a feel for Golem is to install it and try it out on some
of your data using ``summon``. In the next chapter, we discuss installing
Golem; the following chapter is an introduction to the ``summon`` program,
which enables you to quickly extract, as CSV, the data identified by specific
dictionary entries from one or many CML files.