"Data Movement: What infrastructure fabrics are required?"

You are here

26 Nov 2019

"Data Movement: What infrastructure fabrics are required?"

Submitted by Guido Aben


Meeting objectives: 

Research data movement is an important component in research data management discussions. In the 2019 position paper (https://www.earthcube.org/FAIR) ‘EarthCube adoption/promotion of principles  embodied in the FAIR acronym for current and future activities’ the authors describe in their rationale for adoption of FAIR principles that “Data and knowledge integration through community reuse of data can lead to broad scientific advances on complex, multidisciplinary research topics. Scientific advances in many disciplines are greatly facilitated by open access to readily available and well-documented data, including observations, interpretations, metadata, and data resources”.

 

Data movement is a core component in making data readily available and hence FAIR. Effective data movement is not trivial but should not be a barrier to data sharing.

Effortless data sharing with no specific technical knowledge is the aim of effective research data movement. Beyond the technical there are also social bridges that assist in increasing knowledge around research data movement, such as peer learning and interdisciplinary collaboration. Encouraging researchers, data providers and innovators to openly share knowledge across technologies, disciplines, and countries can address data movement issues as volume continues to grow in a broadening group of domains. 

 

The organisers of this BoF include eInfrastructure operators as well as science data providers. The list of invited participants at the time of writing is:

- DKRZ (climate data provider, .de)

- ICRAR (Astronomy data provider, .au)

- Australian Biocommons (life science data provider, .au)

- SANren (einfra operator, .za)

- NESI (eInfra operator, .nz)

- AARNet (eInfra operator, .au)

 

The organisers of this BoF aim to gain wider traction for concepts of data mobility that move away from one-off file transfers initiated by end users on an ad-hoc basis. Wider traction would mean greater economies of scale, lower barrier of entry towards new science disciplines, re-use of existing tooling, practices and infrastructure, and more stable career paths for eInfra support staff who specialise in data delivery engineering.

 

Experience shows that users that rely on ad-hoc scenarios tend not to have the tools required for the large volumes of data they are now working with (”You mean I can’t drag-and-drop a Terabyte directory structure?”), are insufficiently trained (What do I do when the transfer stalls, two days in?), and are under time pressure ("I need that Terabyte today! Nothing in the system says I should have started planning a week ago!).

 

Taking lessons from cases where big data research works well, the main concepts to consider are:

 

  • Data placement. This is the act of making sure that the block of data to be operated on is made available at the correct location; typically “near” a compute facility. This also implies that a “phone book” is available showing all locations that a user can move their data to, and a rights/quota management system telling a user whether they are allowed to move their data to that specific location.

 

  • Data scheduling. This builds on data placement, but adds workflow automation. Instead of engaging in data placement in a user operated, ad-hoc fashion, we now rely on research workflow tools. These tools know what data blocks will be needed at some point in data processing, where to find them, and can then request that they be placed in a location of choice. Significantly, what this step requires on top of the previous step is agreement between all users (“global agreement”) of the entire data system (”namespace”). This is how data blocks are uniquely identified, located and tracked during movement.

 

  • Data integrity. This adds checks and versioning to data scheduling, so that all users of the global system can be assured that various copies of the same data in different locations are the same, can see who worked on them, and what changes were made. Proper versioning allows one to reverse a change if accidents occur. 

 

The purpose of this BoF is to determine the levels of support and tooling required in various disciplines to move parts of their datasets to the paradigm above. It will investigate the transferability of tooling and practices existing in developed disciplines to new entrants. It will also investigate the merits of building partnerships between research disciplines with eInfrastructure providers to encourage these practices across domains.

 

For context, a few global scale science efforts already treat their data this way. As a result, some infrastructures to deal with placement, scheduling and integrity are also already in place. These could be investigated for transferability and best practice. In this BoF we will also discuss the potential for new communities wishing to adopt data scheduling paradigms, and how to add them to existing infrastructures. This may have benefits both for new entrants as well as for incumbents: at the Exabyte scale, operators tend to be receptive to efficiencies such as those gained by reused and pooled resources / funding.

 

If it is found that sufficient traction for the ideas described above exists, this and/or subsequent BoFs can focus on an action statement, linking together interested research disciplines with interested eInfrastructure operators. We suggest as a minimum to seek liaison with relevant SKA and CERN contacts; preferably, we would convince them to join a resultant WG or IG.

 

 

Meeting agenda: 
  1. Presentation on current state of the art in scientific data movement and data placement (CERN, SKA, ELIXIR, climate science). Socialise these concepts and include a wider group of research disciplines in these modes of large scale data organisation and data movement. (15 minutes) 
  2. Overview of the opportunities to harmonise concepts and terminology; software, tools and recipes; and shared infrastructures. Identify pathways for more efficient service delivery between potential service users (eg, research platforms), and service providers (eg, eInfrastructure providers, NRENs). (30 mins)
  3. Discussion of how to engage with current interest groups (discussions, all participants). (30 mins)
  4. Identification of other potential group members (all participants). (5 minutes) 
  5. Summary of the results, actions, and identification of contributions of the group members. (10 mins)
Type of Meeting: 
Working meeting
Short introduction describing any previous activities: 

State of the art data management in advanced research domains is moving away from treating data collections at every participant site as separate collections, and into a paradigm where the entire global set of data is treated as one big namespace, with movement, alteration and versioning of all subsets of data tracked uniformly across the resultant system-of-systems.

 

In such a paradigm, data movement is not a one-off action of taking a locally annotated dataset and sending it to another location, but taking a globally identified, named block of data and replicating it  between mutually coordinated sites.

 

Current examples of software that manages many petabytes worldwide come from the grid computing world (Globus) and the HPC world (Rucio). Climate science in the IPCC consortium, and the ELIXIR bioinformatics consortium are also known to have done work on bespoke solutions. As it is complex build such systems, there is a recent trend for emerging big research domains to align with existing data scheduling systems as opposed to developing their own; such as the joint SKA-CERN “ESCAPE” project.

Prior work on the concepts of establishment of inter-node namespaces, their metadata harmonisation and propagation and related concepts have been the topic of the Data Fabric IG. Plans to coordinate policy and execution on actual international implementations of data sharing architectures are the topic of the Global Open Research Commons IG. The organisers of this proposed BoF maintain a liaison with, and are informed by both groups.

 

The organisers of this proposed BoF hope to make inter-node data management and mobility better known among other RDA groups. If traction is established, we can begin the scoping process among disciplines ready to take this next step. Infrastructure operators can then take the resultant requirements, user stories and problem statements to, and investigate either reuse of existing infrastructure or building of bespoke solutions.

BoF chair serving as contact person: 
Remote participation availability (only for physical Plenaries): 
Yes
Avoid conflict with the following group (1): 
Avoid conflict with the following group (2):