Content

This tutorial introduces provenance-aware data curation techniques for hybrid (classic and data-driven) methods used in scientific datasets. We will illustrate our proposal by curating earth and biodiversity data collected using different strategies. We will show how provenance gives insight into the degree of trustability of content produced throughout the tasks to address earth and biodiversity problems. Considering data provenance to provide insight into the conditions in which earth and biodiversity events and phenomena are identified.

1. Introduction: from scientific workflows to data science pipelines for addressing
experimental sciences challenges

  • Data curation as an enabling action for maintaining content produced in
  • scientific practice [PDF]
    • – Data wrangling techniques.
    • – Scientific lakehouse.
  • Scientific Process Curation [PDF]
    • Data collection strategies and protocols.
    • Quantitative and qualitative methodologies for dealing with curated content.
  • Data provenance
    • Provenance in databases: lineage, why-provenance, how-provenance, and where-provenance [CCT09]
    • Capturing provenance in data pipelines, from file-based provenance capture to more fine-grained methods [ZAI19]
    • Provenance within data science pipelines: debugging pipelines [LFS+23] and ensuring fairness in developing machine learning models [GGSS21]
  • Use case: curating earth and biodiversity phenomena detection applications
  • Conclusions and assessment