The RDA/EOSC Future team is launching a series of spotlights to showcase the grantees, their work and experience.
This week we spoke to Gavin Chait from the project 'Implementation of a "no code" method for schema-to-schema data transformations'
Can you tell us about the origins of the idea behind the project? (who/when/how did the group of people behind the project come together)
We live in a transformational era where governments and research groups are releasing more data than can conceivably be processed. This accessibility leads to exciting integrations of diverse sources, but it also raises extreme challenges for audit, review and replication.
In 2016, I started a longitudinal study which required importing and transforming data from 330 municipalities, and updating every 3 months. There is no standard format or schema for the source data, which arrive in various types of tabular spreadsheet. Some publishers provide a single data update in as many as 7 spreadsheets, and some provide all history each time, not just the latest data. Managing this was a nightmare, and ensuring confidence in our data probity necessitated building a complex support system.
In 2020, with COVID, the study formed a small, but crucial, part of both response and economic recovery as our data have been used by national and local government across the UK to conduct lockdown impact analysis, and provide recovery support. That also meant that the underlying data was subjected to intense political scrutiny.
Our software support infrastructure started becoming as large as the research project itself, and I began looking for ways to extract it as a generic schema-to-schema data management and crosswalks system for other researchers to use.
Whyqd is premised on recognising the challenges inherent in replicable research - that tabular data are poorly-documented, and that transformation of that data into forms useful for research can be destructive and introduce difficult-to-audit transcription and transformation errors. The benefit of an open-source toolkit not only supports this, but also - by introducing "opinionated" workflows - change research process from exploring data to defining and transforming schemas using crosswalks.
Whyqd also compiles these resources - source data, schemas, crosswalks - into a searchable database, automatically raising matches for reuse.
This process matches accessible source data with accessible transformations and robust and auditable transparency.
As Whyqd goes live at https://whyqd.com, I will integrate resources developed through the Education and Training on Handling of Research Data IG into the portal to promote its use as a teaching resource. Changing work practice and improving data accessibility is a process, and Whyqd can provide one entry point to that process.
How has the project plan (and perhaps the idea) evolved during execution?
Mostly, it’s remained the same. If anything, the main challenge is keeping focused on the core deliverable and not expanding to try be everything. Data crosswalks are an immense and complex topic, with a wide range of stakeholders and interests, and it can be very compelling to want to solve “everything”. That simply eats up development time, and leads to overwhelming complexity. So … as a single-developer project, that’s been the main task. Focus.
Have there been any surprises (new opportunities or challenges)?
I’ve begun a collaboration with Sveinung Gundersen for his Omnipy data crosswalks project. He’s with ELIXIR Norway and FAIRtracks. We'll be collaborating with Centre for Bioinformatics, University of Oslo and Computational Systems Biology at University of Kaiserslautern-Landau at Biohackathon 2023 on enabling continuous RDM using Annotated Research Contexts with RO-Crates profiles for ISA. And I’ve met with numerous stakeholders in FAIRCORE4EOSC on their mapping project.
Presenting to the other EOSC fundees to show them what I’m doing has also been useful, and I’m hopeful that between these various initiatives I’ll find partners to work with beyond the EOSC Future end.
The main challenge is complexity. There is no “easy” aspect to the project, just different mountains. Whyqd will not aim to solve all crosswalk problems but it will be an easy-to-use addition to a research data scientist's toolkit. One area I would definitely like to explore is deploying Whyqd on university and research institution infrastructure where it can be a local repository for all things data, supporting interoperability and discovery within the institute. Eventually, maybe we could federate all these different instances.
Can you tell us more about the team and how it has evolved? New skills/learning acquired, working methods shaped by the RDA/EOSC environment?
At the moment, it’s still just me, and mostly I’ve focused on expanding my skills in coding. A lot of the initial work was around memory management for working with >1million row datasets, which imposes new challenges on data transformations. I’ve ensured modularisation, so that each component can be released as I go. That’s resulted in two standalone software components:
- https://github.com/whythawk/full-stack-fastapi-postgresql which is the project generator for the web app, and is generic for any data-driven Python-based progressive web application,
- https://github.com/whythawk/whyqd which is the underlying crosswalks and transformation engine released as a Python library.
These form the stack for the final web application, and - by the time you read this - it will be live at https://whyqd.com.
Are there any lessons learned that you could share with the community?
Designing software workflows is difficult, and the approach I developed I’m calling “tutorial driven development”, where I start by writing out a step-by-step guide as if I were teaching a group of students how to use the application. From there, I can work backwards into the code to see how to implement. If find it helps break some logjams.
As example: https://whyqd.readthedocs.io/en/latest/tutorials/tutorial1/ and, as a bonus, it actually is a useable teaching syllabus. Whyqd's core library is also listed on the EOSC Portal.
There is a space between writing crosswalks (coding) and understanding how data curation works. Writing these curation strategies out helps to frame the way the software must support a process – a job within a research team – rather than just being another software library that has no implications beyond the needs of the coder. How you design a crosswalk is a function of the data research process and may involve trade-offs and discussion across a research team.
I have also been invited to convene a focus group for the Education and Training on Handling of Research Data IG. The topic is: Learning Outcomes - Further material documentation/metadata for data curation in research should include learning outcomes which demonstrate data probity, attribution, and process, including differentiation based on discipline, data type, methodology, etc (e.g., quantitative vs qualitative; archaeology vs zoology; required restrictions).
How was your experience at the RDA 20th Plenary Meeting?
I went in with only limited knowledge of RDA and I came away with a great network of new potential collaborators and a deeper understanding of the incredible scale of the work RDA supports. I enjoyed myself a great deal.
You can contact Gavin by completing this form.