The Bootstrap Blog

Meeting the demand for engaging, quality datasets for K-12 data science

Data, data everywhere, and not a set to use... The DS Teacher's Lament

The demand for high-quality, engaging, and relatable datasets is soaring as the demand for K-12 Data Science (and close cousins ML and AI) grows. Indeed, early research shows that a student's choice of dataset has a substantial impact on their engagement! There are already many datasets freely available online, or ready to be provided by industry partners. Unfortunately, using these datasets in classrooms or curricula often requires additional work to take a dataset designed for a specific purpose and audience and adapt it to be accessible and relevant to a K-12 student. This work can be a barrier for educators.

Knowing where these datasets are sourced matters, too. We live in a world where algorithms trained on data determine who gets a loan and who is granted parole. Professional data scientists therefore have to be keenly aware of where a training dataset comes from, how it has been transformed, and how it will be used. This mindset is valuable for student data scientists as well and can lead to valuable insights beyond the raw data itself: an Earth Science class analyzing a climate dataset should be able to find out whether that data came from Greenpeace or Chevron and discuss if this impacts their analysis or recommendations!

For most of the past year, members of Bootstrap, Brown University, and Code.org have worked together to create a new specification that offers a pathway for individuals to find, clean, document, and upload datasets that can be used in K-12 data science tools. With additional input from AI4ALL, SAS, and Data Science For Everyone, we are proud to announce version 1.0 of "Datasets for K-12 Data Science" - the specification can be found here. Our goal with this spec is to create a common framework that individuals can use to take meaningful real-world data and make it accessible to K-12 students, teachers, and curricula.

In the spec, we lean heavily on the groundbreaking Datasheets for Datasets or "D4D" paper (Gebru, Morgenstern, et al., 2021), which proposed a series of questions whose answers must accompany all datasets used in professional contexts. However, most educational datasets are derived from an original dataset - usually cleansed or preprocessed for things like consistency, reduced size, or complexity. The original D4D questions were targeted at the creators of the original dataset whereas most educational datasets are instead derived from those datasets by a third party outside. In our experience the original D4D questions fit awkwardly for this use-case. Our spec proposes an Educator-Facing Datasheet that adapts the structure in the D4D paper for these derived datasets, and adds new questions that provide additional context for educational uses.

Our Goals

We imagine a collection of datasets - open to all teachers and curriculum providers - where each dataset is accompanied by a "nutrition label" which includes:

what the data is about, where it came from, how and why it was created what - if any - transformations, modifications, or cleaning have been done to prepare it for classroom use
pedagogical information, such as "unusual outliers for students to explore", "subject-specific concepts that the dataset is recommended for", "suggested subsets or correlations", etc.
a guarantee that the data meets certain quality controls, including student-relevance, approved for free and open use, etc.
no blank cells, and reasonable formats for data, so that the teacher gets to make the choice about what kind of "messy data" their students confront

As authors of the specification, we want to be direct about the values we hold:

We believe that all students and teachers deserve access to a diverse range of high quality datasets.
We believe that these datasets must be free and open, and not tied to any one curriculum or community.
We believe that Data Science itself must be inclusive of more than just "math and code", and that social and ethical discussions must also be rooted in the data itself.
We recognize that this specification must evolve, and invite contributions from educators, researchers, and industry partners towards future iterations of the spec.

By winter 2022, Data Science For Everyone commits to creating an open-source platform that will centrally list datasets - accompanied by datasheets that adhere to principles outlined in the spec - for any teacher, curriculum provider, or other data science education program to draw upon and easily incorporate.

By summer 2023, Bootstrap and Code.org's curriculum team commit to prioritizing datasets that adhere to the Datasets for K-12 Data Science specification when designing new lessons and activities so that teachers know exactly what their students are using and where it came from. We hope that other curriculum developers will follow suit, and that industry partners who want to donate datasets to the community will include Datasheets as part of their contributions.

Posted September 27th, 2022