Scalable Dynamic Data Citation
Approaches, Reference Architectures and Applications
RDA WG Data Citation Position Paper
Draft Version - 2015-03-23
Contributors
- Andreas Rauber, TU Wien, rauber@ifs.tuwien.ac.at
- Stephan Pröll, SBA, sproell@sba-research.org
- ...
Abstract
Uniquely and precisely identifying and citing arbitrary subsets of data is essential in many settings, e.g. to facilitate experiment validation and data re-use in meta-studies. Current approaches relying on pointers to entire data collections or on explicit copies of data do not scale.
Within this WG, we propose novel approaches that allow scientists to create, reference, cite and reuse subsets of any size and complexity, by applying requirements for dynamic data citation to different types of data storages. We use timestamped, versioned data that can be assembled to specific subsets by using queries. We then attach persistent identifiers to the queries in a way that theycan be reissued against the data set in the future. By citing only the query instead of the whole data set we enable scalable data citation. This position paper describes the basic model and its application to different types of data.
Table of Contents
- Introduction and Motivation
- The Query Store
- Citing Data in Relational Database Management Systems
- Citing Data in CSV Files
- Citing Linked Data
- Citing Data in NoSQL Stores
- Citing Data Stored in Specialized Data formats
- Use Cases / User Stories
- Conclusions
- References
feel free to comment / contribute / edit!!
1 Introduction and Motivation
In many settings it is essential to be able to precisely and preferably automatically identify specific subsets of data that, e.g. was used in a given study, served as the basis for training a specific model, or served as input to subsequent processing. When any of these computations should be repeated for verification purposes in the future, providing access to the exact subset of data is a core requirement. Data citation is essential in various settings in academia, research and the business realm. Not only is giving credit an established standard, it is also economically reasonable to reuse and share data [1], specifically in a fully automated and machine-processable way. Therefore data citation also provides additional value by providing mechanisms for verifying and validating experiments. In many areas understanding how a result was achieved is required, especially when it comes to critical IT infrastructure. Not only do publications and supplementary data need to be citable, but also the attribution of the creation of specific subsets is a demand.
Many data citation approaches exist that allow to reference data sets[2]. So far data sets are usually referenced as one unit having a textual description attached that serves as metadata source. These descriptions of such subsets can often not be generated automatically and they are often not machine interpretable. As a result these descriptions are often not precise enough to be useful. As the amounts of data are growing increasingly in size and complexity, therefore referencing whole data sets as one unit becomes impractical. On the other extreme, citations could be enabled for very fine-granular subdivisions of data,leading to citations that bundle an enormous amount of individual data items being cited. Hence this approach does not scale with increasing data set sizes. No matter at which level of granularity static PIDs are assigned, the resulting solution will not be flexible enough to accomodate all required levels of subset citation in an efficient and flexible manner.
Also, records evolve over time. They can be updated or deleted, which is hardly reflected in subsets and their citations. This is often not suitable for many applications, because live data that is still evolving while previous versions are still available, needs to be identifiable as well. We need data citation solutions that can be acted upon automatically, i.e. machine readable and actionable data citation tools that scale with growing amounts of data.
This position paper proposes mechanisms for making evolving data citable. These mechanisms utilize various query languages for constructing precisely defined subsets. The data sets including potential updates of the data are maintained and versioned. Thereby we can use the query as the source for the construction of subsets and it is sufficient to identify the queries only instead of storage consuming copies of data sets. We attach persistent identifiers (PIDs) to the query and use timestamps in order to retrieve the very same data set at a later point in time.
This aspect of data citation obviously constitutes only a very specific issue within the broader context of data citation, i.e. the actual identification of a subset of data to be identified. This has to be placed in context of other aspects such as concrete expressions of data citations, aspects of human readability, metadata to be assigned to a citation, access and rights issues, and many more. In order to remain focused on specific, solvable tasks, the Working Group decided to primarily address this very focused aspect, putting it in context at a higher level, collaborating with and obtaining input from the plethora of groups and institutions working on different aspects of data citation. Specifically, many of these initiatives have been brought together under the coordination of FORCE11, establishing a synthesis group to harmonize across different views and approaches of data citation[1].
The remainder of this position paper is structured as follows: Section II describes the query store, which is a central module in our query centric model. Section III describes how a dynamic relational database can be rendered into dynamic citable, mostly adoptedfrom [3]. Section IV, V, Section VI and Section VII describe how the model can be adapted to enable dynamic data citation for CSV files, linked data, NoSQL systems and specialized scientific data formats.
2 The Query Store
Data driven science uses a diverse set of data formats, data storages and databases. Our model relies on a query centric approach, i.e. we utilize query languages to construct arbitrary subsets. As the data is maintained with all its version during time, the queries can be use to access the same data set again. Therefore such queries are key for scalable data citation, as they can be used as a reference for the data set. Therefore the queries that have already been used to construct a specific subset need to be stored for later reuse. To identify and differentiate such queries, a persistent identification mechanism is needed. There exists a variety of different persistent identifier mechanisms and approaches[4]. Our model is PID agnostic and can be used with any mechanism, that allows unique and unambiguous identification of digital objects.
2.1 Setting up the Query Store
The query store is used to preserve all queries that have beeing issued against a data set for the long term. Several metadata is needed on order to guarantee long term proof, secure and reliable storage of the queries:
• A PID
• A timestamp
• The original query statement q as entered by the user
• The modified original query for the maintenance of significant properties of the data set (e.g. sorting).
• A hash key on the result set (to facilitatve verification of correctness)
• Additional metadata following recommendations, e.g. from DataCite[2]
These issues are expected to be addessed in more detail in the PID Type Working Group[3].
2.2 Assigning PIDs to Queries
It is essential to detect queries that have already been captured by the system in order to avoid the multiple assignment of new PIDs to identical queries. Furthermore the result sets of queries have to be analyzed to verify the correctness. We rely on hashes to determine identity and correctness of result sets. There are two options for computing the hash key for a query result. The simplest solution requires computing the hash key over the entire result set. This allows straightforward comparison to see whether two result sets are identical. Yet as result sets are potentially huge, hash computation can be quite expensive. A more efficient alternative requires the existence of a unique identifier for element in the result set. In this case we can compute the hash key over the unique element to ensure that two result sets are identical.
If a new query or an altered result set was detected, then a PID can be assigned. Any of the persistent identifier mechanisms is suitable for assigning PIDs for the later identification of queries. A further possibility is to generate a PID by creating a hash from the query string and an appended timestamp. In our scenario, only queries that are either new or return an altered result set will be assigned a new PID.
3 Relational Database Management Systems
Relational database management systems (RDBMS) are implemented in many scenarios and drive data storage, access and analysis. In [5] an initial draft of a generic model for citing data sets is provided. The model is based on timestamped SELECT-queries which are used in order to retrieve the relevant data. These SELECT-statements can be used for data citation as long as versioned data is available and all data manipulation language (DML) statements are provided with the timestamp of their execution. Data is never deleted (except in the case of e.g. legal requirements for deletion, which need to be documented) and the previous state of a row within a data set is maintained during the whole life cycle. Hence the state of the data set that was valid at the execution time can be reconstructed. The generation of citable result sets is query centric, hence not the data itself needs to be explicitly stored, but the query statements are preserved. Our model supports the citation of both static and dynamic data that gets updated and changed. This provides a very flexible citation solution that enables the provision of references in a timely manner. A similar solution has been adopted by Puneet Kishor [6] for the EarthBase system[4]. Basic requirements for a model of dynamic data citation from a database perspective are:
1. Unique subsets can be expressed (i.e. a primary key is available)
2. Dynamic data can be handled (inserts, updates and deletes are recorded)
3. Nothing is actually deleted from the system (except legal requirements enforce actual deletion. This would require to mark affected queries as non-reproducible.)
As a result, the evolution of data is traceable, citations can be managed in a scalable and automated way and the model does not require vendor specific features. A further detail that is crucial for the acceptance of dynamic data citation is transparency and the effort required for implementing the model for a given data source, both from the point of view of the operator as well as from a user’s perspective. In [3] we proposed three different approaches for storing the required metadata. The implementation of this model should be as non-invasive with regard to the existing applications interacting with the DB at the cost of slightly more than doubling storage requirements.
The following steps describe the actions necessary for creating and retrieving a citable data set:
1. Record timestamps for all INSERT, UPDATE and DELETE statements
2. Maintain the history of all inserts, updates and deletes with the original values
3. Store all (adapted) SELECT-queries identifying subsets of data that need to be persistent and citable
4. Create a hash of the result set
5. Assign a PID to this query which serves as identification for the citation record
We thus need to first ensure that the data is stored in a timestamped and versioned manner. Then, each query that needs to be persistently stored is expanded with its execution timestamp and modified to ensure unique sorting and assigned a PID. Hashes over the result set are computed to allow verification of correctness of the returned data at a later point in time.
The last two steps are the actual retrieval of the historical data set and its verification:
1. Re-execute the (adapted) stored query and retrieve the original data
2. Verify the correctness of the data retrieved using the hash key computed on the original data set
After the database schema has been enhanced to be compliant with the model, inserts, updates and deletes within databases can be executed fully transparent. Queries that lead to result sets that need to be persistently stored for later re-use can be stored and are assigned a PID for future identification. Whenever a data set needs to be re-constructed in the future, the query with the timestamps and the versioned data is sufficient to retrieve the exact same data set.
3.1 Adapting the Database Tables
In some settings, the database may already utilize timestamps to record when new data was being added. Also, versioning of data, recording updates and deletes, is already quite common in many big data applications, especially as in analytical settings these are frequently dominated by inserts of new data rather than corrections or deletions to existing data. In such cases, usually no specific steps are necessary on the database layer. Otherwise, we need to turn the database into a timestamped and versioned database.
There exist many approaches for handling temporal data [7]. The evolution of data needs to be traced by capturing DML statements. The implementation scenarios rely on additional columns for the current version, user information and the event type as well as a timestamp of the last event. Storing such additional metadata for each table introduces significant overhead that needs to be handled besides the original records. We introduced three approaches for handling the metadata required for data citation: (1) integrated, (2) hybrid and (3) separated.
- The first approach requires to extend all original tables by the temporal metadata and expand the primary key by the versions column. All operations that alter data need to be reflected in the original table. This method is intrusive, because other tables and applications need to be adapted to this alteration.
- A hybrid approach moves records into a history table whenever a record gets updated or deleted, but inserts are only recorded with a timestamp in the history table. This approach facilitates a history table having the same structure as the original table, but includes columns for data citation metadata. The original table always reflects the latest versions as if no versioning and timestamping was enabled, whereas the history table records the evolution of the data. If a record has been inserted but not updated or deleted, no additional metadata or versioning is required. The advantage of this approach is again a minimal demand for storage, especially when data is hardly updated but lots of inserts occur. But more importantly, the approach is non-intrusive, as the original table remains unchanged. A disadvantage is a more complex query structure for retrieving historical data.
- The full history approach also uses a history table as described before, but this approach also copies records that have been inserted into the original table immediately over into the history table and marks them as inserted. Deleted records are again only marked as deleted in the history table and removed from the original table. An advantage of this approach is that the original table remains unchanged again and that it facilitates a simple query structure as no joins are required. All requests for data citation are handled by the history table whereas the original tables is not involved in the data citation process. The drawback of this approach is its huge increase in storage size by keeping virtually a historicizing copy of the entire data.
Many RDBMS such as PostgreSQL or Oracle support timestamped and versioned data off the shelf, hence such an approach might be implemented with standard tools. Table I shows a comparison of the three table designs.
Table 1: Comparison of Table Designs
|
Intrusiveness |
Storage demand |
Query complexity |
Integrated |
high |
low |
low |
Hybrid |
low |
medium |
medium |
Separated |
low |
high |
low |
The granularity of this approach is on a record level, hence every change on the data is reflected. In our model, the timestamp contains the explicit date at which the data has been changed with the respective updates being executed as atomic transactions.
3.2 Ensuring Unique Sorts
A crucial aspect of result sets is their sorting. The results returned by a database query are commonly fed into subsequent processes in their order of appearance in the result set. In situations where the order of processing has some effect on the outcome (e.g. in machine learning processes), we need to ensure consistent sorting of the result sets is provided. Hence, the order in which records are returned from the database needs to be maintained upon a query’s re-execution. This is challenging as relational databases are inherently set based. According to the SQL99 standard[5], if no ORDER BY clause is specified, the sorting is implementation specific. Even if an ORDER BY clause is provided, the exact ordering may not be defined if there are several rows having the same values for the specified sorting criteria.
For this reason queries need to have a sorting order specified in order to be compliant with our model. As a basic mechanism, we only consider standard sorting mechanisms, i.e. using ORDER BY clauses. If no such sorting behavior is determined, a standard sorting based on the primary key columns of the involved tables is appended to the query. This covers only basic SQL SELECT statements, but is sufficient for the most use cases requiring data citation. Additionally means to mitigate non-determinism and randomness are required when preparing and re-executing processes at a later point in time for validation purposes [8]. Randomized functions or relative time specifications are still a problem when they are used in stored procedures or SELECT-statements. So far, there exists no standard solution that can mitigate the effects of randomness and non-determinism.
3.3 Retrieving a Citable Subset
The query store contains the modified query q, a timestamp and the hash value HR(q) of the result set that q returned at that point of time. Whenever the subset that was delivered by q is referenced at a later point in time, the SQL query statement obtaining the previously stored data set denoted q’ is executed. At the retrieval of the subset, query q’ needs to point to the subset of the history table presenting only those values that have been valid during the time of the original query execution. By using the version information and the temporal metadata for every record, the appropriate subset delivering the appropriate data can easily be constructed. The exact query structure again depends on the implementation design of the temporal data in the tables.
3.4 Query Re-Writing: Replanting Branches
For retrieving the results of the original query q it is sufficient to replace references to the original table with a generic query pattern addressing earlier versions of the data. As the data in the database is still evolving, updates and deletes occur to the original data. For obtaining the original result set at a later point in time, the query q’ has to be rewritten in order to include only data as it was valid during the time of the execution of q.
The sorting order is crucial for maintaining the same sequence of records in the result set and thereby producing identical result set hashes (RH(q)) for queries. If query q includes an ORDER BY-clause it needs to be asserted that the primary keys of the involved tables are used as sorting criteria. If no ORDER BY-clause is present, the system needs to append this as part of the statement in order to maintain the sorting behavior. Several queries used for the report generation use a SELECT * - statement pattern and perform a selection on the result set. Hence the columns describing temporal metadata need to be excluded by the query.
There are numerous optimizations possible. for example, the query is not encapsulated inside a view, as not all database systems allow to use other fundamental database system concepts such as indexes on views, rendering the process described here inefficient. The same is true for temporary tables as there are too many differences between the available RDBMS vendors.
4 CSV Files
[Freelinking: unknown plugin indicator "path"]
While they are usually not heavily utilized in data settings of dramatic size or high dynamics of data being added or corrected, comma separated value (CSV) files are a prominent representation of scientific data. Thus, the approach proposed withing this WG should be applicable to these settings as well. Again, we want to base the approach on making the data timestamped and versioned, while having some sort of access-based citation option for citing either an entire CSV file or arbitrary subsets of it. To this end, the current prioposal includes
- Timestamping and versioning: migrating the CSV files automatically into relational database tables allows us to re-utilize the timestamping and versioning approaches which we described above. As CSV files are in fact tables, the process is very straight forward. Users may upload their CSV files via a Web interface. After the upload is completed, the system iterates over the file, creates a table structure based on the metadata of the CSV file (headers and field lengths) and migrates the data into a table. The required columns for storing the timestamps and event information are automatically added. Records can be identified unambiguously either via a primary key (a columns containing only unique values, such as ids) or via an automatically created sequence number. This allows to select the proper version with regaard to a specific timestamp for selection and also for updating. Note that an update or deletion in this scenario is again only a marking of the record to be replaced or deleted.
-
Query based access: Retrieving the current version of a whole CSV file (i.e. a table in the database) is a trivial task. A SELECT statement filtering the most recent version which was valid before the execution time is sufficient. But users need a mechanism which allows them to create their own subsets from a potentially dynamic CSV file and cite a specific version of said subset. Therefore we developed a Web based query interface, allowing uisers to seelct columns, filter records and sort the resultset based on their preferences. The system automatically captures the subset creation process (filtering, sorting and selecting) and store the metadata in the query store. Upon re-execution, the query is re-written based on the metadata from the query store and augmented with the corresponding timestamp. Thus each subset can be re-created on demand, solely based on the PID of the query.
A protptype has been implemented and is currently undergoing intensive testing. Screenshots are available below:
5 Linked Data
[Freelinking: unknown plugin indicator "path"]
Linked data represents an ever larger part of scientific data, and is specifically characterized by huge degrees of dynamics, specifically in open research environments. To enable data citation for this type of data, several approaches are currently being proposed:
• file-based: Relying on serialization of the triples, a versioning system similar to the approach proposed for CSV files could be used to create time-stamped and versioned representations of the data. queries could then, similarly, be time-stamped and executed against the correct version of the serialization in the versioning system.
• modeled versioning: The versioning could explicitly be modeled as triples as well, adding for each triple to be stored another triple providing the time-stamped operation performed, i.e. a triple being added or deleted from the store (e.g. <id, [add, delete], timestamp>), with appropriate adjustments to the queries being posted to the triple-store.
• RDBM-based storage: the triples could be transformed into a RDBMS, applying the solutions proposed for relational data.
These approaches are subject to further discussion and verification in pilot settings.
6 NoSQL Data Stores
[Freelinking: unknown plugin indicator "path"]
Similar to the above, approaches relying on versioning systems as well as approaches modeling the time-stamping and versioning in the respective storage model should be applicable for XML-style databases and other NoSQL data stores.
7 Citing Data Stored in Specialized Data Formats
[Freelinking: unknown plugin indicator "path"]
- HDF5, NetCDF, ...
8 Use cases / User Stories
[Freelinking: unknown plugin indicator "path"]
9 Conclusions and Outlook
The model for data citation proposed in [3] and [5], similarly to the approach proposed in [6], can be used to persistently cite arbitrary subsets of scientific data. It is based upon the principles of versioned and time-stamped databases, with PIDs being assigned to time-stamped and adapted queries. A mechanism that checks whether a query was already issued and that can detect if two subsequently executed queries return different result sets, ensures that only new queries will be assigned a new PID to avoid ambiguity.
In principle, the proposed model should be applicable in all settings where data can be timestamped, and where some form of access and query interface is being used to identify the subsets to be made persistently available as individual units.
10 References
[1] R. Darby, S. Lambert, B. Matthews, M. Wilson, K. Gitmans, S. Dallmeier-Tiessen, S. Mele, and J. Suhonen, “Enabling scientific data sharing and re-use,” in E-Science (e-Science), 2012 IEEE 8th International Conference on, 2012, pp. 1–8.
[2] CODATA-ICSTI Task Group on Data Citation Standards and Practices, “Out of cite, out of mind: The current state of practice, policy, and technology for the citation of data,” Data Science Journal, vol. 12, pp. 1–75, 2013.
[3] S. Pröll and A. Rauber, “Scalable Data Citation in Dynamic, Large Databases: Model and Reference Implementation,” in IEEE International Conference on Big Data 2013 (IEEE BigData 2013), 10 2013. [Online]. Available: http://dx.doi.org/10.1109/BigData.2013.6691588
[4] J. K. Hans-Werner Hilse, Implementing Persistent Identifiers: Overview of concepts, guidelines and recommendations.1em plus 0.5em minus 0.4emConsortium of European Research Libraries, London, 2006. [Online]. Available: http://www.cerl.org/publications/report\s\do5(o)n\s\do5(p)ersistent\s\do5(i)dentifiers =0pt
[5] S. Pröll and A. Rauber, “Citable by Design - A Model for Making Data in Dynamic Environments Citable,” in 2nd International Conference on Data Management Technologies and Applications (DATA2013), Reykjavik, Iceland, July 29-31 2013.
[6] P. Kishor, “Serving data, licenses, citations, and tracking use,” Webpage, March 29 2012, http://punkish.org/Serving-Data,-Licenses,-Citations,-and-Tracking-Use.
[7] C. S. Jensen and R. Snodgrass, “Temporal data management,” Knowledge and Data Engineering, IEEE Transactions on, vol. 11, no. 1, pp. 36–44, 1999.
[8] M. Guttenbrunner and A. Rauber, “A measurement framework for evaluating emulators for digital preservation,” ACM Transactions on Information Systems (TOIS), vol. 30, no. 2, 3 2012. [Online]. Available: http://dl.acm.org/citation.cfm? id=2180876 =0pt
- 12219 reads