Research data movement is an important component in research data management discussions. In the 2019 position paper (https://www.earthcube.org/FAIR) ‘EarthCube adoption/promotion of principles embodied in the FAIR acronym for current and future activities’ the authors describe in their rationale for adoption of FAIR principles that “Data and knowledge integration through community reuse of data can lead to broad scientific advances on complex, multidisciplinary research topics. Scientific advances in many disciplines are greatly facilitated by open access to readily available and well-documented data, including observations, interpretations, metadata, and data resources”.
Data movement is a core component in making data readily available and hence FAIR. Effective data movement is not trivial but should not be a barrier to data sharing.
Effortless data sharing with no specific technical knowledge is the aim of effective research data movement. Beyond the technical there are also social bridges that assist in increasing knowledge around research data movement, such as peer learning and interdisciplinary collaboration. Encouraging researchers, data providers and innovators to openly share knowledge across technologies, disciplines, and countries can address data movement issues as volume continues to grow in a broadening group of domains.
The organisers of this BoF include eInfrastructure operators as well as science data providers. The list of invited participants at the time of writing is:
- DKRZ (climate data provider, .de)
- EGI (EOSC's default grid / compute eInfrastructure provider, .eu)
- Australian Biocommons (life science data provider, .au)
- NESI (eInfra operator, .nz)
- AARNet (eInfra operator, .au)
The organisers of this BoF aim to gain wider traction for concepts of data mobility that move away from one-off file transfers initiated by end users on an ad-hoc basis. Wider traction would mean greater economies of scale, lower barrier of entry towards new science disciplines, re-use of existing tooling, practices and infrastructure, and more stable career paths for eInfra support staff who specialise in data delivery engineering.
Experience shows that users that rely on ad-hoc scenarios tend not to have the tools required for the large volumes of data they are now working with (”You mean I can’t drag-and-drop a Terabyte directory structure?”), are insufficiently trained (What do I do when the transfer stalls, two days in?), and are under time pressure ("I need that Terabyte today! Nothing in the system says I should have started planning a week ago!).
Taking lessons from cases where big data research works well, the main concepts to consider are:
- Data placement. This is the act of making sure that the block of data to be operated on is made available at the correct location; typically “near” a compute facility. This also implies that a “phone book” is available showing all locations that a user can move their data to, and a rights/quota management system telling a user whether they are allowed to move their data to that specific location.
- Data scheduling. This builds on data placement, but adds workflow automation. Instead of engaging in data placement in a user operated, ad-hoc fashion, we now rely on research workflow tools. These tools know what data blocks will be needed at some point in data processing, where to find them, and can then request that they be placed in a location of choice. Significantly, what this step requires on top of the previous step is agreement between all users (“global agreement”) of the entire data system (”namespace”). This is how data blocks are uniquely identified, located and tracked during movement.
- Data integrity. This adds checks and versioning to data scheduling, so that all users of the global system can be assured that various copies of the same data in different locations are the same, can see who worked on them, and what changes were made. Proper versioning allows one to reverse a change if accidents occur.
The purpose of this BoF is to determine the levels of support and tooling required in various disciplines to move parts of their datasets to the paradigm above. It will investigate the transferability of tooling and practices existing in developed disciplines to new entrants. It will also investigate the merits of building partnerships between research disciplines with eInfrastructure providers to encourage these practices across domains.
For context, a few global scale science efforts already treat their data this way. As a result, some infrastructures to deal with placement, scheduling and integrity are also already in place. These could be investigated for transferability and best practice. In this BoF we will also discuss the potential for new communities wishing to adopt data scheduling paradigms, and how to add them to existing infrastructures. This may have benefits both for new entrants as well as for incumbents: at the Exabyte scale, operators tend to be receptive to efficiencies such as those gained by reused and pooled resources / funding.
If it is found that sufficient traction for the ideas described above exists, this and/or subsequent BoFs can focus on an action statement, linking together interested research disciplines with interested eInfrastructure operators. We suggest as a minimum to seek liaison with relevant SKA and CERN contacts; preferably, we would convince them to join a resultant WG or IG.