It's probably somewhat sub-ideal to start a discussion just before heading off to vacations, but here it is nevertheless:
While going through some pilot scenarios we came across an interesting discussion concerning the precise time-stamp to assign to a query, and a solution that woul dbe paticulalry appealingbecause it would be generic, and could simlplify detection of identical queries/subsets.
Re-iterating the basics, we assume we have
- a time-stamped and versioned database of some sort holding the data we want to cite, with updates at arbitrary intervals neatly logged.
- a researcher who issues a specific query to select an arbitrary subset of the data at a given point in time which should be citeable
- then, a PID will be issued to the time-stamped query and stored in a query store
- the PID would resolve to a landing page allowing , apart from providing all kind of additional metadata, to re-issue the query against the database with either
- the original query timestamp, resulting in an identical dataset as originally used by the researcher;
- or using the current timestamp, thus producing the current view of the data including all corrections/changes/additionas made since the orginal query was posed, and
- inherently, also a diff becomes possible
The question was: Do we need to assign the timestamp of when the researcher executed the query? Or, more specifically, it arose from the discussion: if two researchers issue the same query, resulting in the same set of data (in case there were no changes in-between) why would these reult in two different PIDs?
This obviously doe snot make sense and is captured in the current approach by computing a hash-key over the result set, allowing us to determine if two result sets are identical. If so, and if also the query parse tree is identical (i.e. also the semantics of the query is identical, not just the result set, which could be by coincidence) then a new PID would be issued, otherwise the previous one would be returned.
A cleaner solution that emerged from discussion was the following proposal:
The timestamp to be asigned to the query should not be the timestamp of the query execution, but either
- the timestamp of the last update to the database, or
- the timestamp of the "newest" change timestamp in the result set
- for mattters of comleteness: this would be the current option of using the query execution timestamp
This would mean that different timestamps, and thus new PIDs, are only assigned to otherwise identical queries if the data has actually changed in-between. technically, the difference is small, as both can be computed the same way, and hash keys are needed anyway for verification, but relying on the approach above would make the model generic and semantically identical to the concept of "version" as it is currently used in many more static settings, and the semantics of the timestamp would be more closely tied to the data, rather than the query (execution), i.e. more akin to "last-changed", rather than "downloaded on"
Does anybody have particular views on this? Would it harm any of the concepts discussed so far? Are there any specific reasons for chosing one of the options above (1-3)? do you see any specific advantages/disadvantages?
issues identified so far:
- 3 is only client-specific, requires no change to the system
- 1 would require a global variable to be set on each update
- 2 would require the query to always include the timestamp with every select, sort by this timestamp and return the newest one for determining whetther it is newer than the one stored in the query store in case an otherwise identical query has been found.
- 3 requires a hash key to b computed and compared over the result set (but this we likely want to keep anyway for verification purposes, requiring unique-sorting to be applied prior to user-defined sorts upon query execution)
I'm looking forward to hearing any additional insights!
sorry for the long email and thanks for reading it all the way throuugh or at leats jumping to its end :)