Data, data everywhere, and not a set to use... The DS Teacher's Lament
The demand for high-quality, engaging, and relatable datasets is soaring as the demand for K-12 Data Science (and close cousins ML and AI) grows. Indeed, early research shows that a student's choice of dataset has a substantial impact on their engagement! There are already many datasets freely available online, or ready to be provided by industry partners. Unfortunately, using these datasets in classrooms or curricula often requires additional work to take a dataset designed for a specific purpose and audience and adapt it to be accessible and relevant to a K-12 student. This work can be a barrier for educators.
Knowing where these datasets are sourced matters, too. We live in a world where algorithms trained on data determine who gets a loan and who is granted parole. Professional data scientists therefore have to be keenly aware of where a training dataset comes from, how it has been transformed, and how it will be used. This mindset is valuable for student data scientists as well and can lead to valuable insights beyond the raw data itself: an Earth Science class analyzing a climate dataset should be able to find out whether that data came from Greenpeace or Chevron and discuss if this impacts their analysis or recommendations!
For most of the past year, members of Bootstrap, Brown University, and Code.org have worked together to create a new specification that offers a pathway for individuals to find, clean, document, and upload datasets that can be used in K-12 data science tools. With additional input from AI4ALL, SAS, and Data Science For Everyone, we are proud to announce version 1.0 of "Datasets for K-12 Data Science" - the specification can be found here. Our goal with this spec is to create a common framework that individuals can use to take meaningful real-world data and make it accessible to K-12 students, teachers, and curricula.
In the spec, we lean heavily on the groundbreaking Datasheets for Datasets or "D4D" paper (Gebru, Morgenstern, et al., 2021), which proposed a series of questions whose answers must accompany all datasets used in professional contexts. However, most educational datasets are derived from an original dataset - usually cleansed or preprocessed for things like consistency, reduced size, or complexity. The original D4D questions were targeted at the creators of the original dataset whereas most educational datasets are instead derived from those datasets by a third party outside. In our experience the original D4D questions fit awkwardly for this use-case. Our spec proposes an Educator-Facing Datasheet that adapts the structure in the D4D paper for these derived datasets, and adds new questions that provide additional context for educational uses.
We imagine a collection of datasets - open to all teachers and curriculum providers - where each dataset is accompanied by a "nutrition label" which includes:
As authors of the specification, we want to be direct about the values we hold:
By winter 2022, Data Science For Everyone commits to creating an open-source platform that will centrally list datasets - accompanied by datasheets that adhere to principles outlined in the spec - for any teacher, curriculum provider, or other data science education program to draw upon and easily incorporate.
By summer 2023, Bootstrap and Code.org's curriculum team commit to prioritizing datasets that adhere to the Datasets for K-12 Data Science specification when designing new lessons and activities so that teachers know exactly what their students are using and where it came from. We hope that other curriculum developers will follow suit, and that industry partners who want to donate datasets to the community will include Datasheets as part of their contributions.