Notes from today's PID KernelInfo call

03 Aug 2017

Dear all,
here are my notes from today's call. Please feel free to extend.
Our next call is scheduled for Aug 17, 13:00 UTC.
Notes:
* We discussed several use cases, particularly from RPID, in terms of
how the strawman fits and what fields might be missing
o Typical properties were related to time, geolocation; there are
also more 'domain properties' like temperature, wind speed - but
are these actually necessary for fundamental decision making?
o Can we form 'packages' out of this? like: KI for trust, KI for
geo/time, KI for environment?
* RPID: 4 exemplary scenarios; often need more information than in the
strawman; can this be broken down into parts, where parts of the
scenarios are enabled purely by the trust KI?
o Example 1: weather scenario: sensor network data, group by
dates, publish as 'daily research objects'; using all the
mandatory fields in the strawman; second part is then analysis
part, but RPID did not proceed yet to the filtering case. but
from what is there, it looks that creation date (already in
strawman) and device ID (can't be included) are important.
o Example 2: rice genomics: phenotypes & genotypes data;
copyright/licensing is a big issue - who created data, who
published data; also uses derivedFrom; future properties may be:
publication date, also geoinformation
* Discussion at IU: pulling info from domains will just make the
profile bigger; this is not what we want, but no clarity on what
else to do, so the discussion stopped. But the problem remains
unsolved.
o This is familiar also from the previous PIT group work. We also
got to that point and did not have any answers.
* One way to approach this: What is the value of the limited profile?
If this stands on its own, what does it enable? Does it enable
enough (cost/benefit ratio)?
o Dublin Core or DataCite must have been there. Can we learn from
them? But: This is not about metadata fields, but at the
conceptual process that leads to including or not including them
(or in what form). We want to clarify that process for our KI
decisions.
* Currently, we can't see a clear limit to the profile. So we want to
structure the decisions on what field to include or not include.
o Ulrich got to something: took down some first ideas for
structuring: graphs; some sort of ordering (the typical 'date'
use case, but also geolocation; string ordering); patterns in
strings; (there were more - I did not get all of them..)
o it's all geared to give easy 'yes'/'no' answers
* Ulrich got there by thinking about RDA discussions and what currents
run in them. Example: versioning discussions in RDA have always been
based on different understandings what versioning is, e.g. version
numbers of objects vs. graph lineage (git etc.); they are
orthogonal, but both provide some form of ordering - and therefore,
ordering in principle seems to be an interesting/relevant part;
comparability of versions seems to be important;
o Can we find more such examples?
o Guideline is always: What information is at a generic level
required to crawl through DOs? and then it all ends up at these
yes/no decisions
o TW: another example for a recurring RDA discussion could be
granularity/collection/subsetting - but what is the
generalization of this that leads to yes/no decisions?
o Another example: pattern in strings: is about searching - search
questions are always about string pattern matching; this again
leads to yes/no decisions
+ Can we also look the other way, i.e.: which processes are
ultimately boiled down to a pattern matching question?
--
Dr. Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784

  • Ulrich Schwardmann's picture

    Author: Ulrich Schwardmann

    Date: 04 Aug, 2017

    Dear Tobias, all
    thanks Tobias for streamlining my straying thoughts.
    As the mayor problem in the discussion yesterday and also in discussions
    before I see: we define a straw man for a set of types as kind of
    elementary for each decision process in crawling through data, and on
    the other hand we see, that the use cases we have in general need some
    additional specific types not covered by the straw man.
    This has the consequence, that with the straw man itself we cannot crawl
    through the data once to get the answers needed by the use cases, but we
    need a second run for the queries particularly necessary for the
    individual use cases. Such a double run approach in most cases will be
    highly inefficient.
    We tried several times to extend the straw man in different directions,
    but this always lead to doubts about a specific decision and questions
    about extending a bit further. In general the discussion ended with the
    statement, that we can't decide, therefore we stop at this point and
    look at that what we have. Which means, at the end we propose
    inefficient processes in most of our known use cases and probably for
    all of those we don't know yet.
    Another approach was to define the straw man and a couple of more
    community specific sets of types. This might help to extend the number
    of use cases covered. But at the end this again will create a couple of
    exceptions which will need a second run. For me the inefficiency caused
    by the necessity of a complete second run is a big obstacle for general
    acceptance of such a solution.
    And this leads back to the abstract starting point of this group: the
    problem to give guidelines of how we can get easy Yes/No answers when
    crawling through masses of DOs. The emphasis on easy cannot be
    underestimated here: often the most ressource consuming simulations on
    our HPC system boiles down to a Yes or No to a certain question.
    But my suggestion is not to select specific types as guideline, but to
    define, how easy Yes/No answers can be provided; in other words to have
    a closer look at the decision processes itself, that take place during
    the single crawling steps and to try to define criteria to select info
    types this way.
    This would not name directly the types allowed as kernel infos, but
    allows a much broader flexibility. This is comparable to not name the
    elements in a set but define the set by a function.
    Naming types always restricts the semantic and therefore immediately and
    dramatically restricts the use cases, which are always driven by a
    specific semantic behind it. Saying that a type has to have a specific
    criteria, like being a number, means that all kind of semantic can be
    used that is directly related to numbers, which is a huge field. And why
    for instance numbers: because there are lots of very easy decision
    processes possible on numbers. But again not every thing can be covered
    by numbers.
    To go deeper, this problem has two aspects:
    * the landscape of the crawling process istself
    * making the selections of DOs during the crawling process for the
    overall output of the process
    The first relays on some graph structure of identifiers, the second
    needs easy Yes/No answers about decisions to follow a particular path in
    that graph. These decisions can be made by information about the graph
    structure at the current node itself and/or by additional information
    available at the current node like availability of certain types and
    more often of the content of data in particular types.
    This gives three categories for decision in that crawling process:
    - local graph structure
    - local type structure
    - particular type content
    The first two lead in general to easy decisions by construction. The
    third is the one, which makes more trouble. For this one we need to
    closer define what an the easy decision process itself could be.
    In principle this means that a certain service/function for an info type
    together with a given condition exists, that reliably answers Yes/No or
    True/False in a certain short timeframe.
    One could say, that for kernel infos one only allows such types, where
    decision processes exist (and are used) that fulfil these criteria,
    simple as that. But it certainly makes sense to discuss in this group
    deeper the suggested structure and give more advice to the data managers.
    As a rough guideline one could say, that the condition as well as the
    info type data both needs to be simple, and they need to be compatible
    (for instance one cannot ask for values > 1 (condition) and provide a
    string as info type data, unless one has some conversion/mapping in place).
    Examples for such info type condition combinations would be:
    * the boolean True/False itself is the trivial kind of type for a
    decision process, but these are certainly valuable candidates, because
    that is the fastest possible decision process.
    * types that have some sort of well defined order, such that it is easy
    to ask for greater, less or equal. One could also demand the much
    greater class of a semi ordering, because one can determine, that
    incomparability would give the answer No. Examples for ordering would be
    numbers, strings by lexicographical order and structured strings like
    dates, geolocations, etc.. With semi ordering one can compare also nodes
    in a graph, lists or arrays, including strings again, viewed now as
    arrays, also sets and possibly dictionaries viewed as sets etc.. This
    list comprises a range from versioning in its both othogonal aspects to
    update dates or geolocation.
    * types that have some structure, that can be explored by pattern
    matching, because in general pattern matching is a fast decision
    process. Examples would be strings, but also numbers and probably all
    the examples of the semi ordering above. BTW: this makes semi orderering
    even more interesting in this context, because there would be at least
    two approaches for easy decisions. This list comprises a range from
    typical publication queries to (simple) semantic web queries.
    * Others have to be explored... Lets have a look at the uses cases,
    whether we missed something. What's about this devices example we had in
    the discussion yesterday?
    Finally one could end up with a three level structure:
    * one that defines the straw man as a most generic part of kernel infos,
    but one clearly says, that this will be not enough in most of the cases
    * one that defines a couple of well known examples of kinds (like
    integer, time, etc., or like ordered or matchable types) of info types
    and related kinds of decisions (like >, belongs to or matches) on these
    types,
    * and one provided by this more generic definition, that a certain
    service/function for an info type together with a given condition needs
    to exist, that reliably answers Yes/No or True/False in a certain short
    timeframe.
    There is certainly a lot to do on such an approach, but I think it could
    be a good starting point to come over these boundary problems we have.

  • Tobias Weigel's picture

    Author: Tobias Weigel

    Date: 16 Aug, 2017

    Hello Ulrich,
    in view of tomorrow's call, I've gone through your notes again and came
    up with a couple of detail questions (can ask tomorrow), but more
    importantly, observations regarding future directions of the group:
    a) We need a better and somewhat formal description of the crawling &
    selection process, including the decision-making part
    b) We need a framework for condition specifications. You've already put
    some cornerstones in place (categories) and some essential requirements
    (simple machine-readability, compatibility).
    Also, I'm musing about how complete your various points are - e.g. the 3
    decision crawling categories - I did not find gaps so far, which is
    good, but it might still worth thinking about extensions. But the two
    items above may be more direct regarding next steps.
    Best, Tobias

  • Mark Parsons's picture

    Author: Mark Parsons

    Date: 16 Aug, 2017

    Hi all,
    I just joined this group. Where is the call in info? I’d like to join, if I may.
    cheers,
    -m.
    Mark A. Parsons
    0000-0002-7723-0950
    Senior Research Scientist
    Tetherless World Constellation
    Rensselaer Polytechnic Institute
    http://tw.rpi.edu
    +1 303 941 9986
    Skype: mark.a.parsons
    mail: 1550 Linden Ave., Boulder CO 80304, USA
    On 16 Aug 2017, at 10:00, TobiasWeigel <***@***.***> wrote:
    Hello Ulrich,
    in view of tomorrow's call, I've gone through your notes again and came
    up with a couple of detail questions (can ask tomorrow), but more
    importantly, observations regarding future directions of the group:
    a) We need a better and somewhat formal description of the crawling &
    selection process, including the decision-making part
    b) We need a framework for condition specifications. You've already put
    some cornerstones in place (categories) and some essential requirements
    (simple machine-readability, compatibility).
    Also, I'm musing about how complete your various points are - e.g. the 3
    decision crawling categories - I did not find gaps so far, which is
    good, but it might still worth thinking about extensions. But the two
    items above may be more direct regarding next steps.
    Best, Tobias
    On 04.08.2017 16:05, Ulrich Schwardmann wrote:
    Dear Tobias, all
    thanks Tobias for streamlining my straying thoughts.
    As the mayor problem in the discussion yesterday and also in discussions
    before I see: we define a straw man for a set of types as kind of
    elementary for each decision process in crawling through data, and on
    the other hand we see, that the use cases we have in general need some
    additional specific types not covered by the straw man.
    This has the consequence, that with the straw man itself we cannot crawl
    through the data once to get the answers needed by the use cases, but we
    need a second run for the queries particularly necessary for the
    individual use cases. Such a double run approach in most cases will be
    highly inefficient.
    We tried several times to extend the straw man in different directions,
    but this always lead to doubts about a specific decision and questions
    about extending a bit further. In general the discussion ended with the
    statement, that we can't decide, therefore we stop at this point and
    look at that what we have. Which means, at the end we propose
    inefficient processes in most of our known use cases and probably for
    all of those we don't know yet.
    Another approach was to define the straw man and a couple of more
    community specific sets of types. This might help to extend the number
    of use cases covered. But at the end this again will create a couple of
    exceptions which will need a second run. For me the inefficiency caused
    by the necessity of a complete second run is a big obstacle for general
    acceptance of such a solution.
    And this leads back to the abstract starting point of this group: the
    problem to give guidelines of how we can get easy Yes/No answers when
    crawling through masses of DOs. The emphasis on easy cannot be
    underestimated here: often the most ressource consuming simulations on
    our HPC system boiles down to a Yes or No to a certain question.
    But my suggestion is not to select specific types as guideline, but to
    define, how easy Yes/No answers can be provided; in other words to have
    a closer look at the decision processes itself, that take place during
    the single crawling steps and to try to define criteria to select info
    types this way.
    This would not name directly the types allowed as kernel infos, but
    allows a much broader flexibility. This is comparable to not name the
    elements in a set but define the set by a function.
    Naming types always restricts the semantic and therefore immediately and
    dramatically restricts the use cases, which are always driven by a
    specific semantic behind it. Saying that a type has to have a specific
    criteria, like being a number, means that all kind of semantic can be
    used that is directly related to numbers, which is a huge field. And why
    for instance numbers: because there are lots of very easy decision
    processes possible on numbers. But again not every thing can be covered
    by numbers.
    To go deeper, this problem has two aspects:
    * the landscape of the crawling process istself
    * making the selections of DOs during the crawling process for the
    overall output of the process
    The first relays on some graph structure of identifiers, the second
    needs easy Yes/No answers about decisions to follow a particular path in
    that graph. These decisions can be made by information about the graph
    structure at the current node itself and/or by additional information
    available at the current node like availability of certain types and
    more often of the content of data in particular types.
    This gives three categories for decision in that crawling process:
    - local graph structure
    - local type structure
    - particular type content
    The first two lead in general to easy decisions by construction. The
    third is the one, which makes more trouble. For this one we need to
    closer define what an the easy decision process itself could be.
    In principle this means that a certain service/function for an info type
    together with a given condition exists, that reliably answers Yes/No or
    True/False in a certain short timeframe.
    One could say, that for kernel infos one only allows such types, where
    decision processes exist (and are used) that fulfil these criteria,
    simple as that. But it certainly makes sense to discuss in this group
    deeper the suggested structure and give more advice to the data managers.
    As a rough guideline one could say, that the condition as well as the
    info type data both needs to be simple, and they need to be compatible
    (for instance one cannot ask for values > 1 (condition) and provide a
    string as info type data, unless one has some conversion/mapping in place).
    Examples for such info type condition combinations would be:
    * the boolean True/False itself is the trivial kind of type for a
    decision process, but these are certainly valuable candidates, because
    that is the fastest possible decision process.
    * types that have some sort of well defined order, such that it is easy
    to ask for greater, less or equal. One could also demand the much
    greater class of a semi ordering, because one can determine, that
    incomparability would give the answer No. Examples for ordering would be
    numbers, strings by lexicographical order and structured strings like
    dates, geolocations, etc.. With semi ordering one can compare also nodes
    in a graph, lists or arrays, including strings again, viewed now as
    arrays, also sets and possibly dictionaries viewed as sets etc.. This
    list comprises a range from versioning in its both othogonal aspects to
    update dates or geolocation.
    * types that have some structure, that can be explored by pattern
    matching, because in general pattern matching is a fast decision
    process. Examples would be strings, but also numbers and probably all
    the examples of the semi ordering above. BTW: this makes semi orderering
    even more interesting in this context, because there would be at least
    two approaches for easy decisions. This list comprises a range from
    typical publication queries to (simple) semantic web queries.
    * Others have to be explored... Lets have a look at the uses cases,
    whether we missed something. What's about this devices example we had in
    the discussion yesterday?
    Finally one could end up with a three level structure:
    * one that defines the straw man as a most generic part of kernel infos,
    but one clearly says, that this will be not enough in most of the cases
    * one that defines a couple of well known examples of kinds (like
    integer, time, etc., or like ordered or matchable types) of info types
    and related kinds of decisions (like >, belongs to or matches) on these
    types,
    * and one provided by this more generic definition, that a certain
    service/function for an info type together with a given condition needs
    to exist, that reliably answers Yes/No or True/False in a certain short
    timeframe.
    There is certainly a lot to do on such an approach, but I think it could
    be a good starting point to come over these boundary problems we have.
    Am 03.08.2017 um 16:35 schrieb TobiasWeigel:
    Dear all,
    here are my notes from today's call. Please feel free to extend.
    Our next call is scheduled for Aug 17, 13:00 UTC.
    Notes:
    * We discussed several use cases, particularly from RPID, in terms
    of how the strawman fits and what fields might be missing
    o Typical properties were related to time, geolocation; there
    are also more 'domain properties' like temperature, wind speed
    - but are these actually necessary for fundamental decision
    making?
    o Can we form 'packages' out of this? like: KI for trust, KI for
    geo/time, KI for environment?
    * RPID: 4 exemplary scenarios; often need more information than in
    the strawman; can this be broken down into parts, where parts of
    the scenarios are enabled purely by the trust KI?
    o Example 1: weather scenario: sensor network data, group by
    dates, publish as 'daily research objects'; using all the
    mandatory fields in the strawman; second part is then analysis
    part, but RPID did not proceed yet to the filtering case. but
    from what is there, it looks that creation date (already in
    strawman) and device ID (can't be included) are important.
    o Example 2: rice genomics: phenotypes & genotypes data;
    copyright/licensing is a big issue - who created data, who
    published data; also uses derivedFrom; future properties may
    be: publication date, also geoinformation
    * Discussion at IU: pulling info from domains will just make the
    profile bigger; this is not what we want, but no clarity on what
    else to do, so the discussion stopped. But the problem remains
    unsolved.
    o This is familiar also from the previous PIT group work. We
    also got to that point and did not have any answers.
    * One way to approach this: What is the value of the limited
    profile? If this stands on its own, what does it enable? Does it
    enable enough (cost/benefit ratio)?
    o Dublin Core or DataCite must have been there. Can we learn
    from them? But: This is not about metadata fields, but at the
    conceptual process that leads to including or not including
    them (or in what form). We want to clarify that process for
    our KI decisions.
    * Currently, we can't see a clear limit to the profile. So we want
    to structure the decisions on what field to include or not include.
    o Ulrich got to something: took down some first ideas for
    structuring: graphs; some sort of ordering (the typical 'date'
    use case, but also geolocation; string ordering); patterns in
    strings; (there were more - I did not get all of them..)
    o it's all geared to give easy 'yes'/'no' answers
    * Ulrich got there by thinking about RDA discussions and what
    currents run in them. Example: versioning discussions in RDA have
    always been based on different understandings what versioning is,
    e.g. version numbers of objects vs. graph lineage (git etc.); they
    are orthogonal, but both provide some form of ordering - and
    therefore, ordering in principle seems to be an
    interesting/relevant part; comparability of versions seems to be
    important;
    o Can we find more such examples?
    o Guideline is always: What information is at a generic level
    required to crawl through DOs? and then it all ends up at
    these yes/no decisions
    o TW: another example for a recurring RDA discussion could be
    granularity/collection/subsetting - but what is the
    generalization of this that leads to yes/no decisions?
    o Another example: pattern in strings: is about searching -
    search questions are always about string pattern matching;
    this again leads to yes/no decisions
    + Can we also look the other way, i.e.: which processes are
    ultimately boiled down to a pattern matching question?
    --
    Dr. Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    ORCID: orcid.org/0000-0002-4040-0215
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784
    --
    Full post:
    https://www.rd-alliance.org/group/pid-kernel-information-wg/post/notes-t...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post:
    https://www.rd-alliance.org/mailinglist/unsubscribe/57236
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Dr. Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    ORCID: orcid.org/0000-0002-4040-0215
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784
    --
    Full post:
    https://www.rd-alliance.org/group/pid-kernel-information-wg/post/notes-t...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post:
    https://www.rd-alliance.org/mailinglist/unsubscribe/57236

submit a comment