What are you looking for?

Big data, big responsibility: recording a project’s data lineage for publishing reproducible results

What was the challenge that you addressed?

Modern data analysis involves a very large and complex series of steps: including the various sources of input data (from different databases), to the software used (and their precise versions, installation configurations and host operating system), and finally the recipes of how the software are run on the data (and their ordering) to produce the final results (added-value datasets, or the figures, plots and tables in a published scientific paper). A project’s data lineage contains all this information, as well as their links (for example, which analysis step should be done after which). Without the lineage, the results can’t be precisely reproduced/validated by others (or even by the same team!).

The adoption process

During the RDA Global Adoption week Mohammad Akhlaghi presented the Workflows for Research Data Publishing Models and Key Components RecommendationSlides are available here, watch the recordings:


Benefits and impact of adoption

Prior to the publication of a project as a scientific paper, adopting Maneage has the following advantages:

1. The full history of the various analysis steps and software versions is recorded.
2. Changing any step (to see its effect on the final result) is trivial because all the numbers in the text, as well as plots and figures, are automatically generated.
3. Geographically separated co-authors can exactly reproduce an on-going project on their independent computers, contribute their analysis steps to a single project in a single work-flow, and later merge it with the work of other team members.
4. It is more straightforward to revert back to a previous state of the project and test an alternative analysis method (to “merge” if it is good).


What lessons did you learn?

The Adoption grant initiative in the RDA Europe 4.0 was instrumental for Maneage, due to the fact that we are astronomers and this meta-project is not officially considered to be “astronomy” research. With the grant, we were able to convince our institute that investing more time and energy in it is academically productive. Currently most 2019 adoption grant recipients were data center curators as following best practices in data management is unfortunately considered an unnecessary distraction by many scientists, who simply judge a team’s outputs by number of published papers. Academically recognized grants can help scientists in any field focus on improving the methods of their research and thus gradually improve the culture of data management in the larger scientific community.

Download and read the full Adoption Story.

Instituto de Astrofísica de Canarias

The Instituto de Astrofísica de Canarias (IAC) is a public research consortium that is a center of reference within the Spanish astrophysics community, but also in European and worldwide contexts. The IAC maintains the Teide Astronomical Observatory in Tenerife and the Roque de los Muchachos Observatory in La Palma, which is one of the few places on earth with such a high ratio of telescopes to data creators.
IAC is also a partner in many international facilities, including the large astronomical projects Euclid and LSST (which will create peta-byte scale datasets of the sky during this decade). The ‘Maneage’ framework which was the focus of the RDA Europe 4.0 grant was developed to deal with the complex problem of reproducibility in the results of projects that use large and complex sets of datasets. In addition, it has also been designed in a
modular and generic framework that is applicable to any data-intensive research project.