General presentation of the Virtual Atomic and Molecular Data Centre
VAMDC is a worldwide e-infrastructure that federates 41 heterogeneous and interoperable Atomic and Molecular databases. In the VAMDC jargon, every federated database is a Data-Node. Every partner has in charge the curation of its node and decides independently about the growing rate, the ingestion system, and the corrections to apply to the already stored data. Indeed the VAMDC infrastructure can grow in two ways: each node can grow (independently) and new nodes can join the federated infrastructure.
Each data-node, regardless of the technology used for storing data (SQL, No-sql, ascii files), implements the VAMDC access/query protocols and returns results formatted into a standardized XML format, called XSAMS.
The user can access the data directly node-by-node or can use the VAMDC portal, which relays the user request to each node.
Collaboration with RDA – feedbacks on the adopted strategy for data citation.
When we started the collaboration with RDA, each data node owner could modify/delete/add new data without tracing the modifications, thus loosing the reproducibility of the past data-extractions and obstructing the citation mechanism.
The interaction with the Data-citation working group helped us in finding a solution for these issues, as explained in the following paragraphs.
Difficulty linked with the Query storage
Considering the distributed architecture of the federated VAMDC infrastructure, it seemed very complex to apply the “Query Store” (QS hereafter) strategy described into the document https://rd-alliance.org/group/data-citation-wg/wiki/scalable-dynamic-data-citation-rda-wg-dc-position-paper.html.
Many questions arose when we considered how to implement the QS in VAMDC: should we need a QS on each node? Should we need an additional QS on the central portal? Since the portal acts as a relay between the user and the existing nodes, how can we coordinate the generation of PID for queries in this distributed context?
We decided at first to omit the QS strategy, but we proposed internally a slightly similar approach inspired by the collaboration with the RDA. This is actually discussed for validation inside the VAMDC consortium.
How the RDA recommendations have been adapted to our case:
The Atomic and Molecular data have no intrinsic meaning outside a given context. This context defines for instance the zero energy state, the molecular symmetries. Usually the same context is used for data coming from the same experimental measuring campaign, generated by the same simulation code or published in the same paper.
We decided to use this context as the base unit for tagging dynamic data in our databases. We will call this context “dataset”. In each data-node we will have a collection of different datasets, each dataset containing the atomic and molecular data.
- Each dataset will be tagged with a unique Digital ID, which permits to identify the provenance data-node.
- An existing dataset will never be deleted from a data-node, nor modified.
o If a correction and/or addition to an existing data node are/is needed, this will be associated with the creation of a new dataset.
o We will automatically maintain the genealogy for the families of datasets. Users and data-providers will be able to know the creation date of a dataset, its ancestor and its descendants.
- If new datasets enrich the VAMDC infrastructure, a user will always be able to obtain exactly the same results he/she obtained at a given date, by restricting his/her query on the datasets already existing at that date.
- The data-set ID will be returned in each result file coming from the VAMDC infrastructure. A user will always know what are the datasets used for satisfying his/her query and may easily cite it. The XSAMS format (the VAMDC standard for formatting the results) will be modified to natively include references to the datasets used for its composition.
This approach guarantees the reproducibility of all the queries, the perpetuation of all the data-versions and sup