STELLA: A Framework for Reproducibility and Evaluation of Online Search Experiments
STELLA next event: CLEF 2020 – LILAS workshop https://clef-lilas.github.io/
Reproducibility is a central aspect to algorithm evaluation. It makes it easier to validate and compare the results produced by different methods, following different experimental setups. We use a reproducibility approach in order to evaluate experimental systems regarding a production system. Reproducing results coming from such experimental systems is often difficult as you need them all to use the same input data. Such data should be provided by production systems, which sometimes prefer to keep their data hidden. In order to facilitate evaluation and reproducibility of experimental search and recommendation systems, we propose an evaluation infrastructure named STELLA. STELLA is a German Research Foundation funded project providing a docker-based environment. STELLA follows the principle of a Living Lab: experimental ranking and recommender algorithms are evaluated regarding a production system with real users. This setup differs considerably from classical Text Retrieval Conference approaches, which can only be carried on offline. It also differs from user studies involving a predefined environment where interaction are limited to some user testing cases and the platform used is also limited to this setup. STELLA provides researchers with an evaluation method that was previously reserved for industrial research only or the operators of large online platforms, i.e., an environment enabling experimental setups running along a production system and interacting with real regular users.
Click on the poster image to enlarge
Nowadays, in-silico research data is common across all sciences. As this data is produced and consumed by research software, software reproducibility has become core to science reproducibility. In a similar vein, open research software is also fundamental for Open Science. Both of these subjects, reproducibility and open science are relevant to RDA. Although the RDA does not endorse any particular software or technology, A/B testing systems, such as STELLA, are generic enough to serve multiple domains. To us, presenting our poster at the RDA 15th Plenary, will be an opportunity to gather feedback and improve our research.