Unit 3: Exploring Datasets

Unit 3Exploring Datasets

Unit Overview

Students learn to prepare for analyzing a new dataset by considering logical subsets of that data. They begin with the Animals Dataset, and then apply what they’ve learned to a dataset of their own choosing. In the process, they practice using the Design Recipe to create filter functions, and come up with questions they wish to explore. The focus of this unit is categorical variables, and by the end students will know how to display categorical variables.

Agenda
10 minReview
20 minMaking Subsets
20 minChoose Your Dataset
40 minExploring Your Dataset
5 minClosing
English
add translation
Product Outcomes:
Students choose a dataset they are interested in
Standards and Evidence Statements:
Standards with prefix BS are specific to Bootstrap; others are from the Common Core. Mouse over each standard to see its corresponding evidence statements. Our Standards Document shows which units cover each standard.
Data 3.1.3: Explain the insight and knowledge gained from digitally processed data by using appropriate visualizations, notations, and precise language.
Length: 95 Minutes
Materials:
Preparation:
Computer for each student (or pair), with access to the internet
Student workbooks, and something to write with
Types
Functions
Values
Number
num-sqrt, num-sqr
4, -1.2. 2/3
String
string-repeat, string-contains
"hello" "91"
Boolean
==, <, >, <=, >=, string-equal
true false
Image
triangle, circle, star, rectangle, ellipse, square, text, overlay, bar-chart, pie-chart, bar-chart-raw, pie-chart-raw
Table
count, .row-n, .order-by, .filter

Review
Overview
Learning Objectives
Evidence Statementes
Product Outcomes
Materials
Preparation
Computer for each student (or pair), with access to the internet
Student workbooks, and something to write with
Review (Time 10 minutes)
ReviewOpen your saved animals-dataset file. You should have several functions defined:
is-fixed
gender
is-cat
is-young
If you didn’t have a chance to type them in from your workbook, make sure you do!
Take 10m and write a function is-dog, then type it into the Definitions Area.

Making Subsets
Overview
Learning Objectives
Evidence Statementes
Product Outcomes
Materials
Preparation
Making Subsets (Time 20 minutes)
Making SubsetsA lot of Data Science involves making predictions based on data. Suppose we want to survey Americans and try to predict who our next president will be. Obviously, it would take too long to ask everyone who they’re voting for! Instead, pollsters try to take a sample of Americans, and generalize the opinion of the sample to estimate how Americans as a whole feel.
Would it be problematic to only call voters who are registered Democrats? To only call voters under 25? To only call regular churchgoers? Why or why not?
Suppose we are interested how in women feel about a particular issue. Should we still make sure we’re surveying men, too? Why or why not?
As you can see, sampling is a complicated issue! Depending on the question we want to answer, sometimes it makes sense to work with an entire dataset, and sometimes it makes sense to carve out a subset of the data (e.g. - calling only women). In this Unit, we’ll be practicing what you learned about writing functions, and then using the .filter method to create subsets.
Make subsets first!
Data Scientists don’t always know what the interesting questions are right away. So whenever they explore a dataset, one of the first things do is define some logical subsets, just to have them handy later. Someone looking at our animals dataset might want to consider "just the lizards" or "just males". This also helps them reason about the data, without being biased by a particular question.
A "kitten" is an animal whose species == "cat" and whose age < 2. How would you make a subset of just kittens? Turn to Page 14, and see what code will compute whether or not an animal is a kitten. Can you fill in the code for the other subsets?
Sometimes we want to create a table that’s just a random sample of an existing table. Type the following code into the Definitions Area (left-hand side of your screen), and click "Run". tiny-sample = random-rows(animals-table, 3) small-sample = random-rows(animals-table, 8)
What do you get when you evaluate tiny-sample in the Interactions Area? small-sample?
What is the contract for random-rows? What does the function do?
We already know how to define values, and how to filter a dataset. So let’s define some subsets, in addition to the random samples we just made: dogs = animals-table.filter(is-dog) cats = animals-table.filter(is-cat) fixed = animals-table.filter(is-fixed) young = animals-table.filter(is-young)
We can make a pie-chart showing how many of each species is in the shelter, by writing pie-chart(animals-table, "species")
Which of our subsets do you think will give us the most accurate approximation of the original chart? pie-chart(dogs, "species") pie-chart(cats, "species") pie-chart(fixed, "species") pie-chart(young, "species") pie-chart(tiny-sample, "species") pie-chart(small-sample, "species") Compare the charts you get from each of these. Which one is the most representative of the whole population? Why?

Choose Your Dataset
Overview
Learning Objectives
Evidence Statementes
Product Outcomes
Students choose a dataset they are interested in
Materials
Preparation
Choose Your Dataset (Time 20 minutes)
Choose Your DatasetNow it’s time to choose a dataset of your own! Throughout this course, you’ll be analyzing this dataset and writing up your findings. As you learn new tools for data science, you’ll continue to refine this analysis, answering questions and raising new ones of your own! Take 10 minutes to look through the following datasets, and choose one that interests you:
Movies (Dataset | Starter file)
Schools (Dataset | Starter file)
US Income (Dataset | Starter file)
US Presidents (US Presidents Dataset | Starter file)
Countries of the World (Dataset | Starter file)
Music (Dataset | Starter file)
New York City Restaurant Health Inspections (Dataset | Starter file)
Pokemon Characters (Dataset | Starter file)
IGN Video Game Reviews (Dataset | Starter file)
2016 Presidential Primary Election (Dataset | Starter file)
US State Demographics (Dataset | Starter File)
Sodas (Dataset | Starter file)
Cereals (Dataset | Starter file)
Summer Olympic Medals (Dataset | Starter file)
Winter Olympic Medals (Dataset | Starter file)
MLB Hitting Stats (Dataset | Starter file)
Spotify Top Songs (Dataset | Starter file)
Or find your own dataset, and use this (Blank Starter file) for your project. See this tutorial video for help importing your own data into Pyret.
Make sure students realize this is a firm commitment! The farther they go in the course, the harder it will be to change datasets.

Exploring Your Dataset
Overview
Learning Objectives
Evidence Statementes
Product Outcomes
Students choose a dataset they are interested in
Materials
Preparation
Exploring Your Dataset (Time 40 minutes)
Exploring Your Dataset
Look at the spreadsheet for your data. What do you notice? What do you wonder? Complete Page 15, making sure to have at least two Lookup Questions, two Compute Questions, and two Relate Questions.
In the Definitions Area, use random-rows to define at least three tables of different sizes: tiny-sample, small-sample, and medium-sample.
In the Definitions Area, use .row-n to define at least three values, representing different rows in your table.
Take a minute to think about subsets that might be useful for your dataset. Name these subsets and write the Pyret code to test an individual row from your dataset on Page 16.
Have students share back.
Turn to Page 17, and use the Design Recipe to write the filter functions that you planned out on Page 16. When the teacher has checked your work, type them into the Definitions Area and use the .filter method to define your new subset tables.
Choose one categorical column from your dataset, and try making a bar or pie-chart for the whole table. Now try making the same display for each of your subsets. Which is most representative of the entire column in the table?
Have students share back. Encourage students to read their observations aloud, to make sure they get practice saying and hearing these observations.

Closing
Overview
Learning Objectives
Evidence Statementes
Product Outcomes
Materials
Preparation
Closing (Time 5 minutes)
ClosingCongratulations! You’ve explored the Animals dataset, formulated your own and begun to think critically about how questions and data shape one another. For the rest of this course, you’ll be learning new programming and Data Science skills, practicing them with the Animals dataset and then applying them to your own data.
Have students share which dataset they chose, and pick one question they’re looking at.

Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz and Ben Lerner was developed partly through support of the National Science Foundation, (awards 1535276, 1647486, and 1738598), and is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.

Types	Functions	Values
Number	`num-sqrt, num-sqr`	`4, -1.2. 2/3`
String	`string-repeat, string-contains`	`"hello" "91"`
Boolean	`==, <, >, <=, >=, string-equal`	`true false`
Image	`triangle, circle, star, rectangle, ellipse, square, text, overlay, bar-chart, pie-chart, bar-chart-raw, pie-chart-raw`
Table	`count, .row-n, .order-by, .filter`