Addressing the Challenges of Physical Data Access to Accelerate Science: Repositories, Networks and Petabytes, and How to Make them Play Nicely at Speed

You are here

20 Jan 2021

Addressing the Challenges of Physical Data Access to Accelerate Science: Repositories, Networks and Petabytes, and How to Make them Play Nicely at Speed

Submitted by Guido Aben


Meeting objectives: 

To explore the challenges that arise when handling big data in terms of facilitating data movement, bringing data to the compute. This session will explore solutions in production and others currently under development that enable repositories to deal with big data and truly make them FAIR.

 

 

Meeting agenda: 

Collaborative Notes:

https://docs.google.com/document/d/182yq9oobZPZY7g1aQtdVyjrNeemk_CBarxo3O6qhtxE/edit?usp=sharing

 

At the initial BoF held at P16, eInfrastructure operators and research repositories discussed points of overlap and potential collaboration in handling big data. A targeted project was proposed to investigate the integration of fast data movement tools and expertise present at NRENs with data ingest mechanisms in state of the art repository platforms. Interest was shown both from Zenodo and Fedora. This BoF will be used to explore the scope of the proposed project, and identify further participants who wish to collaborate on this activity. Interest in the establishment of a dedicated WG for the project will be determined at the BoF.

 

·        30 mins update group on GEANT/EGI/ESCAPE etc. progress, with focus on science workflow tools

·        30 mins update group on repository ingest offloading PoC (Zenodo/Fedora)

·        30 mins discussion and future requirements gathering

 

Type of Meeting: 
Working meeting
Short introduction describing any previous activities: 

 

At the 16th plenary, this group held its 1st BoF, (https://www.rd-alliance.org/data-movement-what-infrastructure-fabrics-are-required-0) to investigate whether there was sufficient overlap between structured big-data movement activities at eInfrastructure operators (such as EGI, GEANT etc) and the RDA-centred communities that concern themselves with scalable data description, big data handling in the repository context, etc.

 

The agenda there was kept broad, intending to cover the spectrum of issues and potential collaboration opportunities. Turnout was encouraging (30 virtual participants registered on the attendee log), and of the various agenda items, two were identified as warranting further study. These were:
 

1)     Informative -- Continuing a liaison between data movement / data orchestration as-a-service initiatives currently being set up at EGI, and per the ESCAPE project, and initiatives among GEANT constituent NRENs on the one hand, and science disciplines represented at RDA who were about to embark on the structured, distributed big data part of their digital journey on the other hand.
 

2)     Targeted project -- investigate the integration of fast data movement tools and expertise present at NRENs with data ingest mechanisms in state of the art repository platforms. The intent being to embark on a PoC between repository builders and operators and NRENs whereby the latter offer scalable, sped-up repository ingest and inter-repository asset replication as a service, obviating the need for repository builders to invest time in the maintenance of bespoke ingest/distribution tools for the petabyte era. Interest was shown both from Zenodo and Fedora.


BoF chair serving as contact person: 
Meeting presenters: 
Sarah Jones (GEANT), Andrea Manzi (EGI), Guido Aben (AARNet)
Avoid conflict with the following group (1): 
Avoid conflict with the following group (2): 
Contact for group (email):