Dynamic Data Citation: which time-stamp to assign?

08 Aug 2014
Groups audience: 

It's probably somewhat sub-ideal to start a discussion just before heading off to vacations, but here it is nevertheless:

While going through some pilot scenarios we came across an interesting discussion concerning the precise time-stamp to assign to a query, and a solution that woul dbe paticulalry appealingbecause it would be generic, and could simlplify detection of identical queries/subsets.

Re-iterating the basics, we assume we have

  • a time-stamped and versioned database of some sort holding the data we want to cite, with updates at arbitrary intervals neatly logged.
  • a researcher who issues a specific query to select an arbitrary subset of the data at a given point in time which should be citeable

 

  • then, a PID will be issued to the time-stamped query and stored in a query store
  • the PID would resolve to a landing page allowing , apart from providing all kind of additional metadata, to re-issue the query against the database with either
  1. the original query timestamp, resulting in an identical dataset as originally used by the researcher;
  2. or using the current timestamp, thus producing the current view of the data including all corrections/changes/additionas made since the orginal query was posed, and
  3. inherently, also a diff becomes possible

The question was: Do we need to assign the timestamp of when the researcher executed the query? Or, more specifically, it arose from the discussion: if two researchers issue the same query, resulting in the same set of data (in case there were no changes in-between) why would these reult in two different PIDs?

This obviously doe snot make sense and is captured in the current approach by computing a hash-key over the result set, allowing us to determine if two result sets are identical. If so, and if also the query parse tree is identical (i.e. also the semantics of the query is identical, not just the result set, which could be by coincidence) then a new PID would be issued, otherwise the previous one would be returned.

A cleaner solution that emerged from discussion was the following proposal:
The timestamp to be asigned to the query should not be the timestamp of the query execution, but either

  1. the timestamp of the last update to the database, or
  2. the timestamp of the "newest" change timestamp in the result set
  3. for mattters of comleteness: this would be the current option of using the query execution timestamp

This would mean that different timestamps, and thus new PIDs, are only assigned to otherwise identical queries if the data has actually changed in-between. technically, the difference is small, as both can be computed the same way, and hash keys are needed anyway for verification, but relying on the approach above would make the model generic and semantically identical to the concept of "version" as it is currently used in many more static settings, and the semantics of the timestamp would be more closely tied to the data, rather than the query (execution), i.e. more akin to "last-changed", rather than "downloaded on"

Does anybody have particular views on this? Would it harm any of the concepts discussed so far? Are there any specific reasons for chosing one of the options above (1-3)? do you see any specific  advantages/disadvantages?

issues identified so far:

- 3 is only client-specific, requires no change to the system
- 1 would require a global variable to be set on each update
- 2 would require the query to always include the timestamp with every select, sort by this timestamp and return the newest one for determining whetther it is newer than the one stored in the query store in case an otherwise identical query has been found.
- 3 requires a hash key to b computed and compared over the result set (but this we likely want to keep anyway for verification purposes, requiring unique-sorting to be applied prior to user-defined sorts upon query execution)

I'm looking forward to hearing any additional insights!
sorry for the long email and thanks for reading it all the way throuugh or at leats jumping to its end :)

best regards,
Andreas

  • Stefan Proell's picture

    Author: Stefan Proell

    Date: 08 Aug, 2014

    Servus Andi,
    ich schau mir das mit den Updates und der Query Execution time daweil an.
    Wie lange bist du auf Urlaub?
    Ich fand jetzt dann mit dem IEEE TETC journal paper an und schick dir
    sobald ich was sinnvolles hab die aktualisierungen. diese überlegungen
    sollten wir auch gleich einbauen, oder?
    LG Stefan

  • Rob Hooft's picture

    Author: Rob Hooft

    Date: 08 Aug, 2014

    Andreas,
    A reply from my holiday address: for your option number 2 there is no
    guarantee that the set has not changed since that time. A correction could
    have taken an entry out of the result set recently.
    Rob Hooft
    On Friday, August 8, 2014, rauber <***@***.***> wrote:

  • Andreas Rauber's picture

    Author: Andreas Rauber

    Date: 08 Aug, 2014

    Rob,
    Thanks for pointing this out: I forgot this in my description: the "last changed" date would refer to the result set prior to the consideration of the history table or item removal. This might lead to significantly more complex query re-writing, and extracting the last-changed date within the query, rather than as part of the original select, but it seems do-able, not requiring any global variable or changes to the system.
    But it is true that Option 1 would be simpler to realize, and would still work perfectly fine, matching the conventional version-thinking.
    Andreas

submit a comment