Introduction to Computational Data Science Starting to Program Applying Functions [ds-plotting] [ds-displays-and-lookups] [ds-defining-functions] [ds-table-methods] [ds-defining-table-functions] [ds-method-chaining] [ds-if-expressions] [ds-random-samples] [ds-grouped-samples] [ds-choosing-your-dataset] [ds-histograms] [ds-histograms2] [ds-measures-of-center] [ds-measures-of-spread] [ds-checking-your-work] [ds-scatter-plots] [ds-correlations] [ds-linear-regression] [ds-ethics-and-privacy] [ds-threats-to-validity]
Introduction to Computational Data Science
Introduction to Computational Data Science
Students are introduced to the Animals Dataset, learn about Tables, Categorical and Quantitative data, and consider the kinds of questions that can be asked about a dataset.
Prerequisites |
None |
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core ELA Standards
CSTA Standards
K-12CS Standards
|
Lesson Goals |
Students will be able to…
|
Student-facing Lesson Goals |
|
Materials |
|
Preparation |
|
Supplemental Resources |
|
Language Table |
No language features in this lesson |
- categorical data
-
data whose values are qualities that are not subject to the laws of arithmetic.
- data science
-
the science of collecting, organizing, and drawing general conclusions from data, with the help of computers
- programming language
-
a set of rules for writing code that a computer can evaluate
- quantitative data
-
number values for which arithmetic makes sense
Introduction 20 minutes
Overview
Students look at opening questions, either at their desks or in a walk around the room. They select a question they are personally interested in, and think about the data required to answer that question. This process draws a direct line between answering questions they care about and the basics of data science.
Launch
-
Give students 2 minutes to choose a question that grabs their attention, and group themselves by question. Ideally, no student will be the only one interested in that question.
-
Have students spend 2 minutes coming up with a hypothesis about what the answer is, and explaining why. Does every student in a single question-grouping have the same answer?
Investigate
-
What information would you collect to answer this question? Give students 5 minutes to think about what information they would need to collect, to find the answer.
Possible Misconceptions
Students may lean towards questions about individuals, instead of questions about what’s true for a group of individuals who vary from one to another. For example, instead of wondering what movie gets the highest rating, they should ask what’s the typical rating for movies in a list, or how much those ratings tend to vary.
Synthesize
Have students share back the different data they would gather to answer their questions. For each question, students would likely have to gather many different kinds of data. If we wanted to find out if small schools are better than big schools, for example, we might want to gather data on SAT scores, college acceptance, etc. Each of these is a variable in our dataset: any two schools we look at could vary by each of them.
What’s the greatest movie of all time? Is Climate Change real? Who is the best quarterback? Is Stop-and-Frisk racially biased? We can’t survey every school in the world, get data on every movie ever made, or every police action - but we can do an analysis for a sample of them, and try to infer something about all of them as a whole. These questions quickly turn into a discussion about data — how you assess it, how you interpret the results, and what you can infer from those results. The process of learning from data is called Data Science. Data science techniques are used by scientists, business people, politicians, sports analysts, and hundreds of other different fields to ask and answer questions about data.
We’ll use a programming language to investigate these questions. Just like any human language, programming languages have their own vocabulary and grammar that you will need to learn. The language you’ll be learning for data science is called Pyret.
The Animals Dataset 25 minutes
Overview
Students explore the Animals Dataset, sharing observations and familiarizing themselves with the idiosyncrasies and patterns in the data. In the process, they learn about Categorical and Quantitative data.
Notice and Wonder Pedagogy This pedagogy has a rich grounding in literature, and is used throughout this course. In the "Notice" phase, students are asked to crowd-source their observations. No observation is too small or too silly! Students may notice that the animals table has corners, or that it’s printed in black ink. But by listening to other students' observations, students may find themselves taking a closer look at the dataset to begin with. The "Wonder" phase involves students raising questions, but they must also explain the context for those questions. Sharon Hessney (moderator for the NYTimes excellent What’s going on in this Graph? activity) sometimes calls this "what do you wonder…and why?". Both of these phases should be done in groups or as a whole class, with time given to each. |
Launch
Have students open the Animals Spreadsheet in a browser tab, or turn to The Animals Dataset (Page 2) in their Student Workbooks.
Investigate
This table contains data from an animal shelter, listing animals that have been adopted. We’ll be analyzing this table as an example throughout the course, but you’ll be applying what you learn to a dataset you choose as well.
-
Turn to Questions and Column Descriptions (Page 4) in your Student Workbook. What do you Notice about this dataset? Write down your observations in the first column.
-
Sometimes, looking at data sparks questions. What do you Wonder about this dataset, and why? Write down your questions in the second column.
-
There’s a third column, called “Question Type” — we’re going to return to that later, so you can ignore it for now.
-
If you look at the bottom of the spreadsheet file, you’ll see that this document contains multiple sheets. One is called
"pets"
and the other is called"README"
. Which sheet are we looking at? -
Each sheet contains a table. For our purposes, we only care about the animals table on the
"pets"
sheet.
Any two animals in our dataset may have different ages, weights, etc. Each of these is called a variable in the dataset.
Data Scientists work with two broad kinds of data: Categorical Data and Quantitative Data. Categorical Data is used to classify, not measure. Categories aren’t subject to the laws of arithmetic. For example, we couldn’t ask if “cat is more than lizard”, and it doesn’t make sense to "find the average ZIP code” in a list of addresses. “Species” is a categorical variable, because we can ask questions like “which species does Mittens belong to?"
What are some other categorical variables you see in this table?
Quantitative Data is used to measure an amount of something, or to compare two pieces of data to see which is less or more. If we want to ask “how much” or “which is most”, we’re talking about Quantitative Data. "Pounds" is a quantitative variable, because we can talk about whether one animal weighs more than another or ask what the average weight of animals in the shelter is.
We use Categorical Data to answer “what kind?”, and Quantitative Data to answer "how much?".
-
Turn to page Categorical or Quantitative? (Page 3), and answer questions 1-7.
-
Sometimes it can be tricky to figure out if data is categorical or quantitative, because it depends on how that data is being used!
-
On Categorical or Quantitative? (Page 3) in your Student Workbook, fill in the blanks for questions 8-13.
Synthesize
Have students share back their noticings (statements) and wonderings (questions), and write them on the board.
Data Science is all about using a smaller sample of data to make predictions about a larger population. It’s important to remember that tables are only a sample of a larger population: this table describes some animals, but obviously it isn’t every animal in the world! Still, if we took the average age of the animals from this particular shelter, it might tell us something about the average age of animals from other shelters.
Question Types 10 minutes
Overview
Students begin to categorize questions, sorting them into "lookup", "compute", and "relate" questions - as well as questions that simply can’t be answered based on the data.
Launch
Once we have a dataset, we can start asking questions! But how do we know what questions to ask? There’s an art to asking the right questions, and good Data Scientists think hard about what kind of questions can and can’t be answered.
Most questions can be broken down into one of four categories:
-
Lookup questions — These can be answered simply by looking up a single value in the table and reading it out. Once you find the value, you’re done! Examples of lookup questions might be “is Sunflower fixed?” or “How many legs does Felix have?”
-
Compute questions — These can be answered by computing an answer across a single column. Examples of computing questions might be “how much does the heaviest animal weigh?” or “What is the average age of animals from the shelter?”
-
Relate questions — These ones take the most work, because they require looking for relationships between multiple columns. Examples of analysis questions might be “Do cats tend to be adopted faster than dogs?” or “Are older animals heavier than young ones?”
-
Can’t answer — These are questions that just can’t be answered based on the available data. We might ask "are cats or dogs better for elderly owners?", but the Animals Dataset doesn’t have information that we can use to answer it.
Investigate
-
Come up with examples for each type of question.
-
Look back at the Wonders you wrote on Questions and Column Descriptions (Page 4). Are any of these Lookup, Compute, or Relate questions? Circle the question type that’s appropriate. Can you come up with additional examples for each type of question?
Synthesize
Have students share their questions with the class. Allow time for discussion!
Have students reflect on what they learned by writing on What’s on your mind? (Page 5). Some prompts that may be helpful:
-
What new vocabulary did you learn?
-
What question was exciting to you, and what data would you need to answer it? Is that data Qualitative or Quantitative?
-
What do you hope to learn in the next lesson?
Additional Exercises:
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
Starting to Program
Starting to Program
Students begin to program in Pyret, learning about basic datatypes, operations, and value definitions.
Prerequisites |
None |
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
Oklahoma Standards
|
Lesson Goals |
Students will be able to…
|
Student-facing Lesson Goals |
|
Materials |
|
Preparation |
|
Supplemental Resources |
|
Language Table |
Students are not expected to have any familiarity with the Pyret programming for this lesson. |
- data row
-
a structured piece of data in a dataset that typically reports all the information gathered about a given individual
- definitions area
-
the left-most text box in the Editor where definitions for values and functions are written
- editor
-
software in which you can write and evaluate code
- header
-
the titles of each column of a table, usually shown at the top
- identifier column
-
a column of unique values which identify all the individual rows (e.g. - student IDs, SSNs, etc)
- interactions area
-
the right-most text box in the Editor, where expressions are entered to evaluate
Introducing Pyret 10 minutes
Overview
Students open up the Pyret environment (code.pyret.org, or "CPO") and see how tables look in Pyret.
Launch
Open up the Animals Starter File in a new tab. Click “Connect to Google Drive” to sign into your Google account. This will allow you to save Pyret files into your Google Drive. Next, click the "File" menu and select "Save a Copy". This will save a copy of the file into your own account, so that you can make changes and retrieve them later.
This screen is called the Editor, and it looks something like the diagram you see here. There are a few buttons at the top, but most of the screen is taken up by two large boxes: the Definitions Area on the left and the Interactions Area on the right.
The Definitions Area is where programmers define values and functions that they want to keep, while the Interactions Area allows them to experiment with those values and functions. This is like writing function definitions on a blackboard, and having students use those functions to compute answers on scrap paper.
For now, we will only be writing programs in the Interactions Area.
The first few lines in the Definitions Area tell Pyret to import
files from elsewhere, which contain tools we’ll want to use for this course. We’re importing a file called Bootstrap:Data Science, as well as files for working with Google Sheets, tables, and images:
include shared-gdrive("Bootstrap-DataScience-...") include gdrive-sheets include tables include image
After that, we see a line of code that defines shelter-sheet
to be a spreadsheet. This table is loaded from Google Drive, so now Pyret can see the same spreadsheet you do. (Notice the funny scramble of letters and numbers in that line of code? If you open up the Google Sheet, you’ll find that same scramble in the address bar! That scramble is how the Pyret editor knows which spreadsheet to load.) After that, we see the following code:
# load the 'pets' sheet as a table called animals-table animals-table = load-table: name, species, age, fixed, legs source: pets-sheet.sheet-by-name("pets", true) end
The first line (starting with #
) is called a Comment. Comments are notes for humans, which the computer ignores. The next line defines a new table called animals-table
, which is loaded from the shelter-sheet
defined above. We also create names for the columns: name
, species
, sex
, age
, fixed
, legs
, pounds
and weeks
. We could use any names we want for these columns, but it’s always a good idea to pick names that make sense!
Even if your spreadsheet already has column headers, Pyret requires that you name them in the program itself.
Click “Run”, and type animals-table
into the Interactions Area to see what the table looks like in Pyret. Is it the same table you saw in Google Sheets? What is the same? What is different?
In Data Science, every table is composed of cells, which are arranged in a grid of rows and columns. Most of the cells contain data, but the first row and first column are special. The first row is called the header row, which gives a unique name to each variable (or “column”) in the table. The first column in the table is the identifier column, which contains a unique ID for each row. Often, this will be the name of each individual in the table, or sometimes just an ID number.
Below is an example of a table with one header row and two data rows:
name | species | sex | age | fixed | legs | pounds | weeks |
---|---|---|---|---|---|---|---|
"Sasha" |
"cat" |
"female" |
1 |
false |
4 |
6.5 |
3 |
"Mittens" |
"cat" |
"female" |
2 |
true |
4 |
7.4 |
1 |
Investigate
-
How many variables are listed in the header row for the Animals Dataset? What are they called? What is being used for the identifier column in this dataset?
-
Try changing the name of one of the columns, and click "Run". What happens when you print out the table back in the Interactions Area?
-
What happens if you remove a column from the list? Or add an extra one?
After the header, Pyret tables can have any number of data rows. Each data row has values for every column variable (nothing can be left empty!). A table can have any number of data rows, including zero, as in the table below:
name | species | sex | age | fixed | legs | pounds | weeks |
---|
Numbers, Strings and Booleans 25 minutes
Overview
This lesson starts them programming, showing students how to make Pyret do simple math, work with text, and create simple computer graphics. It also draws attention to error messages, which are helpful when diagnosing mistakes.
Launch
Pyret lets us use many different kinds of data. In the animals table, for example, there are Numbers (the number of legs each animal has), Strings (the species of the animal), and Booleans (whether it is true or false that an animal is fixed). Pyret has the usual arithmetic operators: addition (+
), subtraction (-
), multiplication (*
), and division (/
).
To identify if an animal is male, we need to know if the value in the sex
column is equal to the string "male"
. To sort the table by age, we need to know if one animal’s age is less than another’s and should come before it. To filter the table to show only young animals, we might want to know if an animal’s age is less than 2. Pyret has Boolean operators, too: equals (==
), less-than (<
), greater-than (>
), as well as greater-than-or-equal (>=
) and less-than-or-equal (<=
).
Investigate
In pairs, students complete Numbers and Strings (Page 7).
Discuss what students have learned about Pyret:
-
Numbers and Strings evaluate to themselves.
-
Anything in quotes is a String, even something like
"42"
. -
Strings must have quotation marks on both sides.
-
Operators like
+
,-
,*
, and/
need spaces around them. -
Any time there is more than one operator being used, Pyret requires that you use parentheses.
-
Types matter! We can add two Numbers or two Strings to one another, but we can’t add the Number
4
to the String"hello"
.
Error messages are a way for Pyret to explain what went wrong, and are a really helpful way of finding mistakes. Emphasize how useful they can be, and why students should read those messages out loud before asking for help. Have students see the following errors:
-
6 / 0
. In this case, Pyret obeys the same rules as humans, and gives an error. -
A`(2 + 2`. An unclosed quotation mark is a problem, and so is an unmatched parentheses.
In pairs, students complete Booleans (Page 8).
Synthesize
Debrief student answers as a class.
Going Deeper By using the |
Defining Values 20 minutes
Overview
Students learn how to define values in Pyret (note that these definitions work the way variable substitution does in math, as opposed to variable assignment you may have seen in other programming languages).
Launch
Pyret allows us to define names for values using the =
sign. In math, you’re probably used to seeing definitions like x = 4, which defines the name x to be the value 4. Pyret works the same way, and you’ve already seen two names defined in this file: shelter-sheet
and animals-table
. We generally write definitions on the left, in the Definitions Area. You can add your own definitions, for example:
my-name = "Maya" sum = 2 + 2 kittens-are-cute = true
With your partner, take turns adding definitions to this file:
-
Define a value with name
food
, whose value is a String representing your favorite food -
Define a value with name
year
, whose value is a Number representing the current year -
Define a value with name
likes-cats
, whose value is a Boolean that istrue
if you like cats andfalse
if you don’t
Synthesize
Why is it useful to be able to define values, and refer to them by name?
Additional Exercises:
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
Applying Functions
Applying Functions
Students learn how to apply Functions, and how to interpret the information contained in a Contract: Name, Domain and Range. They then use this knowledge to explore more of the Pyret language.
Prerequisites |
|
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
Oklahoma Standards
|
Lesson Goals |
Students will be able to…
|
Student-facing Lesson Goals |
|
Materials |
|
Preparation |
|
Supplemental Resources |
|
Language Table |
No language features in this lesson |
- arguments
-
the inputs to a function; expressions for arguments follow the name of a function
- contract
-
a statement of the name, domain, and range of a function
- domain
-
the type or set of inputs that a function expects
- function
-
a mathematical object that consumes inputs and produces an output
- range
-
the type or set of outputs that a function produces
Applying Functions 15 minutes
Overview
Students learn how to apply functions in Pyret, reinforcing concepts from standard Algebra.
Launch
Students know about Numbers, Strings, Booleans and Operators — all of which behave just like they do in math. But what about functions? They may remember functions from algebra: fx = x².
-
What is the name of this function?
-
The expression f2 applies the function f to the number 2. What will it evaluate to?
-
What will the expression f3 evaluate to?
-
The values to which we apply a function are called its arguments. How many arguments does f expect?
Arguments (or "inputs") are the values passed into a function. This is different from variables, which are the placeholders that get replaced with input values! Pyret has lots of built-in functions, which we can use to write more interesting programs.
Have students log into CPO and open the "Animals Starter File". If they don’t have the file, they can open a new one. Have students type this line of code into the interactions area and hit Enter: num-sqrt(16)
.
-
What is the name of this function?
-
What do we think the expression
num-sqrt(16)
will evaluate to? -
What did the expression
num-sqrt(16)
evaluate to? -
Does the
num-sqrt
function produce Numbers? Strings? Booleans? -
How many arguments does
num-sqrt
expect?
Have students type this line of code into the interactions area and hit Enter: num-min(140, 84)
.
-
What is the name of this function?
-
What does the expression
num-min(140, 84)
evaluate to? -
Does the
num-min
function produce Numbers? Strings? Booleans? -
How many arguments does
num-min
expect? -
What happens if we forget to include a comma between our numbers?
Just like in math, functions can also be composed with one another. For example:
# take the minimum of 84 and 99, then take the square root of the result
num-sqrt(num-min(84, 99))
Investigation
Have students complete Applying Functions (Page 9).
Synthesize
Debrief the activity with the class. What kind of value was produced by that expression? (An Image! New datatype!) Which error messages were helpful? Which ones weren’t?
Contracts 35 minutes
Overview
Students learn about Contracts, and how they can be used to figure out new functions or diagnose errors in their code. Then they use this knowledge to explore the contracts pages in their workbooks.
Launch
When students typed triangle(50, "solid", "red")
, they created an example of a new Datatype, called an Image.
-
What are the types of the arguments
triangle
was expecting? -
How does this output relate to the inputs?
-
Try making different triangles. Change the size and color! Try using
"outline"
for the second argument.
The triangle
function consumes a Number and two Strings as input, and produces an Image. As you can imagine, there are many other functions for making images, each with a different set of arguments. For each of these functions, we need to keep track of
three things:
-
Name — the name of the function, which we type in whenever we want to use it
-
Domain — the type of data we give to the function (names and Types!), written between parentheses and separated by commas
-
Range — the type of data the function produces
Domain and Range are Types, not specific values. As a convention, we capitalize Types and keep names in lowercase. triangle
works on many different Numbers, not just the 20
we used in the example above!
These three parts make up a contract for each function. Let’s take a look at the Name, Domain, and Range of the functions we’ve seen before:
# num-sqrt :: (n :: Number) -> Number # num-min :: (a :: Number, b :: Number) -> Boolean # triangle :: (side :: Number, mode :: String, color :: String) -> Image
The first part of a contract is the function’s name. In this example, our functions are named num-sqrt
, and triangle
.
The second part is the Domain, or the names and types of arguments the function expects. triangle
has a Number and two Strings as variables, representing the length of each side, the mode, and the color. We write name-type pairs with double-colons, with commas between each one. Finally, after the arrow goes the type of the Range, or the function’s output, which in this case is Image.
Contracts tell us a lot about how to use a function. In fact, we can figure out how to use functions we’ve never seen before, just by looking at the contract! Most of the time, error messages occur when we’ve accidentally broken a contract.
Investigate
Complete pages Contracts (Page 10) and Matching Expressions and Contracts (Page 11), to get some practice working with Contracts.
Once you feel confident, it’s time to play with some new functions! Turn to the back of your workbook, and get some practice reading and using Contracts! Make sure you try out the following functions:
-
text
-
circle
-
ellipse
-
star
-
string-repeat
When you’ve figured out the code for each of these, write it down in the empty line beneath each contract. These pages will become your reference for the remainder of the class!
Here’s an example of another function. Type it into the Interactions Area to see what it does. Can you figure out the contract, based on the example?
string-contains("apples, pears, milk", "pears")
=== Possible Misconceptions Students are very likely to randomly experiment, rather than actually using the Contracts page. You should plan to ask lots of direct questions to make sure students are making this connection, such as:
-
How many items are in this function’s Domain?
-
What is the name of the 1st item in this function’s Domain?
-
What is the type of the 1st item in this function’s Domain?
-
What is the type of the Range?
=== Synthesize You’ve learned about Numbers, Strings, Booleans, and Images. You’ve learned about operators and functions, and how they can be used to make shapes, strings, and more!
One of the other skills you’ll learn in this class is how to diagnose and fix errors. Some of these errors will be syntax errors: a missing comma, an unclosed string, etc. All the other errors are contract errors. If you see an error and you know the syntax is right, ask yourself these two questions:
-
What is the function that is generating that error?
-
What is the contract for that function?
-
Is the function getting what it needs, according to its Domain?
By learning to use values, operations and functions, you are now familiar with the fundamental concepts needed to write simple programs. You will have many opportunities to use these concepts in this course, by writing programs to answer data science questions.
Make sure to save your work, so you can go back to it later!
== Additional Exercises:
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Displaying Categorical Data :leveloffset: +1
= Displaying Categorical Data
Students learn to apply functions to entire Tables, generating pie charts and bar charts. They then explore other plotting and display functions that are part of the Data Science library.
Prerequisites |
||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Oklahoma Standards
|
|||||||||||||||
Lesson Goals |
Students will be able to:
|
|||||||||||||||
Student-facing Lesson Goals |
|
|||||||||||||||
Materials |
||||||||||||||||
Preparation |
|
|||||||||||||||
Supplemental Resources |
||||||||||||||||
Language Table |
|
- bar chart
-
a display of categorical data that uses bars positioned over category values; each bar’s height reflects the count or percentage of data values in that category
- contract
-
a statement of the name, domain, and range of a function
- domain
-
the type or set of inputs that a function expects
- pie chart
-
a display that uses areas of a circular pie’s slices to show percentages in each category
== Displaying Categorical Variables 10 minutes
=== Overview Students extend their understanding of Contracts and function application, learning new functions that consume Tables and produce displays and plots.
=== Launch Have students ever seen any pictures created from tables of data? Can they think of a situation when they’d want to consume a Table, and use that to produce an image? The library included at the top of the file includes some helper functions that are useful for Data Science, which we will use throughout this course. Here is the Contract for a function that makes pie charts, and an example of using it:
# pie-chart :: (t :: Table, col :: String) -> Image pie-chart(animals-table, "legs")
-
What is the Name of this function?
-
How many inputs are in its Domain?
-
In the Interactions Area, type
pie-chart(animals-table, "legs")
and hit Enter. What happens?
Hovering over a pie slice reveals the label, as well as the count and the percentage of the whole. In this example we see that there is one three-legged animal, representing 3.2% of the population.
We can also resize the window by dragging its borders. This allows us to experiment with the data before closing the window and generating the final, non-interactive image.
The function pie-chart
consumes a Table of data, along with the name of a categorical column you want to display. The computer goes through the column, counting the number of times that each value appears. Then it draws a pie slice for each value, with the size of the slice being the percentage of times it appears. In this example, we used our animals-table
table as our dataset, and made a pie chart showing the distribution of legs
across the shelter.
=== Investigate Here is the Contract for another function, which makes bar charts:
# bar-chart :: (t :: Table, col :: String) -> Image
-
Which column of the animals table tells us how many legs an animal has?
-
Use
bar-chart
to make a display showing how many animals have each number of legs. -
Experiment with pie and bar charts, passing in different column names. If you get an error message, read it carefully!
-
What do you think are the rules for what kinds of columns can be used by bar-chart and pie-chart?
-
When would you want to use one chart instead of another?
=== Possible Misconceptions Pie charts and bar charts may show counts or percentages (in Pyret, pie charts show percentages and bar charts show counts). Bar charts look a lot like histograms, which are actually quite different because they display quantitative data, not categorical. Also, a pie chart can only display one categorical variable but a bar chart might be used to display two or more categorical variables.
==== Synthesize Pie and Bar Charts display what portion of a sample that belongs to each category. If they are based on sample data from a larger population, we use them to infer the proportion of a whole population that might belong to each category.
Pie charts and bar charts are mostly used to display categorical columns.
While bars in some bar charts should follow some logical order (alphabetical, small-medium-large, etc), the pie slices and bars can technically be placed in any order, without changing the meaning of the chart.
== Exploring other Displays 30 minutes
=== Overview Students freely explore the Data Science display library. In doing so, they experiment with new charts, practice reading Contracts and error messages, and develop better intuition for the programming constructs they’ve seen before.
=== Launch There are lots of other functions, for all different kinds of charts and plots. Even if you don’t know what these plots are for yet, see if you can use your knowledge of Contracts to figure out how to use them.
=== Investigate
=== Possible Misconceptions There are many possible misconceptions about displays that students may encounter here. But that’s ok! Understanding all those other plots is not a learning goal for this lesson. Rather, the goal is to have them develop some loose familiarity, and to get more practice reading Contracts.
=== Synthesize
Today you’ve added more functions to your toolbox. Functions like pie-chart
and bar-chart
can be used to visually display data, and even transform entire tables!
You will have many opportunities to use these concepts in this course, by writing programs to answer data science questions.
Extension Activity Sometimes we want to summarize a categorical column in a Table, rather than a pie chart. For example, it might be handy to have a table that has a row for dogs, cats, lizards, and rabbits, and then the count of how many of each type there are. Pyret has a function that does exactly this! Try typing this code into the Interactions Area: What did we get back? - Use the - Use the Sometimes the dataset we have is already summarized in a table like this, and we want to make a chart from that. In this situation, we want to base our display on the summary table: the size of the pie slice or bar is taken directly from the count column, and the label is taken directly from the value column. When we want to use summarized data to produce a pie chart, we have another function:
|
== Additional Exercises: Practice Plotting
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Data Displays and Lookups :leveloffset: +1
= Data Displays and Lookups
Students continue to practice making different kinds of data displays, this time focusing less on programming and more on using displays to answer questions. They also learn how to extract individual rows from a table, and columns from a row.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
- categorical data
-
data whose values are qualities that are not subject to the laws of arithmetic.
- contract
-
a statement of the name, domain, and range of a function
- method
-
a function that is only associated with an instance of a datatype, which consumes inputs and produces an output based on that instance
- quantitative data
-
number values for which arithmetic makes sense
== Displaying Data 20 minutes
=== Overview Students get some more practice applying the plotting functions and working with Contracts, and begin to shift the focus from programming to data visualization. This activity stresses a hard programming skill (reading Contracts) with formal reading comprehension (identifying key portions of the sentence).
=== Launch The Contracts page in the back of students' workbooks contains contracts for many plotting functions.
Suppose we wanted to generate a display showing the ratio of fixed to un-fixed animals from the shelter? How do we go from a simple sentence to working code that makes a data display?
To make a data display, we ask "Which Rows?", "Which Column(s)?", and "What Display?"
-
We start by asking which rows we’re talking about. In this case, it’s all the animals from the shelter.
-
We also need to know which column(s) - or "which variable(s)" - we are displaying. In this case, it’s the
fixed
column. -
Finally, we need to know which display we are using. Is it a histogram? Bar chart? Scatter plots are essential for displaying relationships between columns, but the other displays only deal with one column. Some displays work for categorical data, and others are for quantitative data.
Once we can answer these questions, all we need to do is find the Contract for that display and fill in the Domain!
To display the categorical data, we can choose between pie and bar charts. Which one of these two is best, and why?
=== Investigate Do you know what kind of data is used for each display?
Turn to What Display Goes with Which Data? (Page 18), and see if you identify what kind of data each display needs!
Let’s get some practice going from questions to code, making visualizations.
Turn to Data Displays (Page 19), and see if you can fill in these three parts for a number of data display requests. When you’re finished, try to make the display in Pyret using the appropriate function.
=== Synthesize Debrief the activity with students.
Optional: As an extension, have students break into teams and come up with additional Data Display challenges, then race to see which team can complete the other team’s challenges first!
== Row and Column Lookups 30 minutes
=== Overview Students learn how to access individual rows from a table in Pyret, and how to access a particular column from those rows.
=== Launch Have students open their saved Animals Starter File (or make a new copy), and click “Run”.
Tables have special functions associated with them, called Methods, which allow us to do all sorts of things with those tables. For example, we can get the first data row in a table by using the .row-n
method: animals-table.row-n(0)
Don’t forget: data rows start at index zero!
For practice, in the Interactions Area, use the row-n
method to get the second and third data rows.
What is the Domain of .row-n? What is the Range? Find the contract for this method in your contracts table. A table method is a special kind of function which always operates on a specific table. In our example, we always use .row-n with the animals table, so the number we pass in is always used to grab a particular row from animals-table.
Pyret also has a way for us to get at individual columns of a Row, by using a Row Accessor. Row accessors start with a Row value, followed by square brackets and the name of the column where the value can be found. Here are three examples that use row accessors to get at different columns from the first row in the animals-table:
animals-table.row-n(0)["name"] animals-table.row-n(0)["age"] animals-table.row-n(0)["fixed"]
=== Investigate
-
How would you get the
weeks
column out of the second row? The third? -
Complete the exercises on Lookup Questions (Page 20).
We can use the row-n
method to define entire animal rows as values. Type the following lines of code into the Definitions Area and click “Run”:
animalA = animals-table.row-n(4) animalB = animals-table.row-n(13)
Flip back to page 2 of your workbook and look at The Animals Dataset. Which row is animalA? Label it in the margin next to the dataset. Which row is animalB? Label it in the margin next to the dataset.
Now turn back to your screen.
What happens when you evaluate animalA
in the Interactions Area?
-
Define at least two additional values to be animals from the
animals-table
, calledanimalC
andanimalD
.
=== Synthesize Have students share their answers, and see if there are any common questions that arise.
== Additional Exercises: - More Practice with Lookups
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Defining Functions :leveloffset: +1
= Defining Functions
Students learn a structured approach to problem solving called the “Design Recipe”. They then use these functions to create images, and learn how to apply them to enhance their scatterplots.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
In a more programming-focused course, or if appropriate for your learning goals, students learn to write more sophisticated functions by learning about conditionals in the If-Expressions lesson. |
||||||||||||||||||
Language Table |
|
- design recipe
-
a sequence of steps that helps people document, test, and write functions
== Defining Functions over Numbers 20 minutes
=== Overview
Students have learned to define values (e.g. - name = "Maya"
, x = 5
, etc). Students should have defined animalA
and animalB
to be the following two rows in the animals table.
animalA = animals-table.row-n(4) animalB = animals-table.row-n(13)
If they haven’t, make sure they do this now.
=== Launch Suppose we want to make a solid, green triangle of size 10. What would we type? What if we wanted to make one of size 20? 25? 1000?
triangle(10, "solid", "green") triangle(20, "solid", "green") triangle(25, "solid", "green") triangle(1000, "solid", "green")
This is a lot of redundant typing, when the only thing changing is the size of the triangle! It would be convenient to define a shortcut, which only needs the size. Suppose we call it gt
for short:
gt(10) gt(20) gt(25) gt(1000)
We don’t need to tell gt
whether the shape is "solid"
or "outline"
, and we don’t need to tell it what color to use. We will define our shortcut so it already knows these things, and all it needs is the size. This is a lot like defining values, which we already know how to do. But values don’t change, so our triangles would always be the same size. Instead of defining values, we need to define functions.
To build our own functions, we’ll use a series of steps called the Design Recipe. The Design Recipe is a way to think through the behavior of a function, to make sure we don’t make any mistakes with the animals that depend on us! The Design Recipe has three steps, and we’ll go through them together for our first function.
Turn to The Design Recipe (Page 23) in your Student Workbook, and read the word problem at the top of the page.
Step 1: Contract and Purpose
The first thing we do is write a Contract for this function. You already know a lot about contracts: they tell us the Name, Domain and Range of the function. Our function is named gt
, and it consumes a Number. It makes triangles, so the output will be an Image. A Purpose Statement is just a description of what the function does:
# gt :: (size :: Number) -> Image # Consumes a size, and produces a solid green triangle of that size.
Since the contract and purpose statement are notes for humans, we add the # symbol at the front of the line to turn them into comments.
Be sure to check students’ contracts and purpose statements before having them move on!
Step 2: Write Examples
Examples are a way for us to tell the computer how our function should behave for a specific input. We can write as many examples as we want, but they must all be wrapped in an examples: block and an end statement. Examples start with the name of the function we’re writing, followed by an example input. Suppose we write gt(10)
. What work do we have to do, in order to produce the right shape as a result? What if we write gt(20)
?
# gt :: (size :: Number) -> Image # Consumes a size, and produces a solid green triangle of that size. examples: gt(100) is triangle(100, "solid", "green") gt(30) is triangle(30, "solid", "green") end
Step 3: Define the Function
We start with the fun
keyword (short for “function”), followed by the name of our function and a set of parentheses. This is exactly how all of our examples started, too. But instead of writing 10
or 20
, we’ll use the label from our Domain. Then we add a colon (:
) in place of is
, and write out the work we did to get the answers for our examples. Finally, we finish with the end
keyword.
# gt :: (size :: Number) -> Image # Consumes a size, and produces a solid green triangle of that size. examples: gt(100) is triangle(100, "solid", "green") gt(30) is triangle(30, "solid", "green") end fun gt(size): triangle(size, "solid", "green") end
=== Investigate
Type your function definition into the Definitions Area. Be sure to include the Contract, Purpose Statement, Examples and your Definition! Once you have typed everything in, click "Run" and evaluate gt(10)
in the Interactions Area. What did you get back?
Once we have defined a function, we can use it as our shortcut! This makes it easy to write simpler code, by moving the complexity into a function that can be tested and re-used whenever we like.
-
Use the Design Recipe to solve the word problem at the bottom of The Design Recipe (Page 23).
-
Type in the Contract, Purpose Statement, Examples and Definition into the Definitions Area.
-
Click “Run”, and make sure all your examples pass!
-
Type bc(20) into the Interactions Area. What happens?
=== Synthesize Ask students what happens if they change one of the examples to be incorrect: gt(10) is triangle(99, "solid", "green")
== Defining Functions over Other Datatypes 20 minutes
=== Overview Students deepen their understanding of function definition and the Design Recipe, by solving different kinds of problems.
=== Launch
Functions can consume values besides Numbers. For example, we might want to define a function called sticker
that consumes a Color, and draws a star of that color:
fun sticker(color): star(50, "solid", color) end
Or a function called nametag
that consumes a Row from the animals table, and draws that animal’s name in purple letters.
fun nametag(r): text(r["name"], 10, "purple") end
NOTE: for now, students will follow the pattern for row-consuming functions, so that both examples include a lookup operation. Eventually, however, students will write examples that do not contain lookups.
=== Investigate
Turn to The Design Recipe (Page 24), and use the Design Recipe to write both of these functions.
== Custom Scatter Plot Images 15 minutes
=== Overview Students discover functions that consume other functions, and compose a scatter plot function with one of the functions they’ve already defined.
=== Launch
Students have used Pyret functions that use Numbers, Strings, Images, and even Tables and Rows. Now they’ve written functions of their own that work with these datatypes. However, Pyret functions can even use other functions! Have students look at the Contract for image-scatter-plot
:
image-scatter-plot :: (t :: Table, xs :: String, ys :: String, f :: (Row -> Image)) -> Image
This function looks a lot like the regular scatter-plot
function. It takes in a table, and the names of columns to use for x- and y-values. Take a closer look at the third input…
...f :: (Row -> Image)...
That looks like the contract for a function! Indeed, the third input to image-scatter-plot
is named f
, which itself is a function that consumes Rows and produces Images. In fact, students have just defined a function that does exactly that!
=== Investigate
-
Type
image-scatter-plot(animals-table, "pounds", "weeks", nametag)
into the Interactions Area. -
What did you get?
-
What other scatter plots could we create?
Note: the optional lesson If Expressions goes deeper into basic programming constructs, using image-scatter-plot
to motivate more complex (and exciting!) plots.
=== Synthesize
Functions are powerful tools, for both mathematics and programming. They allow us to create reusable chunks of logic that can be tested to ensure correctness, and can be used over and over to solve different kinds of problems. A little later on, you’ll learn how to combine, or compose functions together, in order to handle more complex problems.
== Additional Exercises:
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Table Methods :leveloffset: +1
= Table Methods
Students learn about table methods, which allow them to order, filter, and build columns to extend the animals table.
Prerequisites |
||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
Next-Gen Science Standards
Oklahoma Standards
|
|||||||||||||||
Lesson Goals |
Students will be able to…
|
|||||||||||||||
Student-facing Lesson Goals |
|
|||||||||||||||
Materials |
||||||||||||||||
Preparation |
|
|||||||||||||||
Supplemental Resources |
||||||||||||||||
Language Table |
|
== Review Function Definitions 15 minutes
=== Overview Students get some practice reading function definitions, and in the process they build knowledge that’s needed later on in the lesson.
=== Launch Let’s see how much you remember about function definitions! Load the Table Methods Starter File, go to the File menu, and click "Save a Copy".
=== Investigate
Students complete Reading Function Definitions (Page 27) in their student workbooks.
=== Synthesize Can students explain what each function does?
== Ordering Tables 10 minutes
=== Overview Students learn a second table method, which allows them to sort rows in ascending or descending order, according to one column.
=== Launch
Have students find the contract for .order-by
in their contracts pages. The .order-by
method consumes a String (the name of the column by which we want to order) and a Boolean (true for ascending, false for descending). But what does it produce?
=== Investigate
-
Type
animals-table.order-by("name", true)
into the Interactions Area. What do you get? -
Type
animals-table.order-by("age", false)
into the Interactions Area. What do you get? -
Sort the animals table from heaviest-to-lightest.
-
Sort the animals table alphabetically by species.
-
Sort the animals table by how long it took for each animal to be adopted, in ascending order.
=== Synthesize
Answer any questions students may have. Class discussion: what do .order-by
and .row-n
have in common? How are they different?
== Filtering Tables 20 minutes
=== Overview Students learn how to filter tables, by removing rows.
=== Launch Explain to students that you have "Function Cards", which describe the purpose statement of a function that consumes a Row from a table of students, and produces a Boolean (e.g. - "this student is wearing glasses"). Select a volunteer to be the "filter method", and have them randomly choose a Function Card, and make sure they read it without showing it to anyone else.
Have ~10 students line up in front of the classroom, and have the filter method go to each student and say "stay" or "sit" depending on whether their function would return true or false for that student. If they say "sit", the student sits down. If they say true, the student stays standing.
Ask the class: based on who sat and who stayed, what function was on the card?
The .filter
method takes a function, and produces a new table containing only rows for which the function returns true
.
Suppose we want to get a table of only animals that have been fixed? Have students find the contract for .filter
in their contracts pages. The .filter
method is taking in a function. What is the contract for that function? Where have we seen functions-taking-functions before?
=== Investigate
-
In the Interactions Area, type
animals-table.filter(is-fixed)
. What did you get? -
What do you expect
animals-table
to produce, and why? Try it out. What happened? -
In the Interactions Area, type
animals-table.filter(is-old)
. What did you get? -
In the Interactions Area, type
animals-table.filter(is-dog)
. What did you get? -
In the Interactions Area, type
animals-table.filter(lookup-name)
. What did you get?
The .filter
method walks through the table, applying whatever function it was given to each row, and producing a new table containing all the rows for which the function returned true
. Notice that the Domain for .filter
says that test must be a function (that’s the arrow), which consumes a Row
and produces a Boolean
. If it consumes anything besides a single Row
, or if it produces anything else besides a Boolean
, we’ll get an error.
=== Possible Misconceptions
Students often think that filtering a table changes the table. In Pyret, all table methods produce a brand new table. If we want to save that table, we need to define it. For example: cats = animals-table.filter(is-cat)
.
=== Synthesize Debrief with students. Some guiding questions on filtering:
-
Suppose we wanted to determine whether cats or dogs get adopted faster. How might using the
.filter
method help? -
If the shelter is purchasing food for older cats, what filter would we write to determine how many cats to buy for?
-
Can you think of a situation where filtering fixed animals would be helpful?
== Building Columns 10 minutes
=== Overview
Students learn how to build columns, using the .build-column
table method.
=== Launch
Suppose we want to transform our table, converting pounds
to kilograms
or weeks
to days
. Or perhaps we want to add a "cute" column that just identifies the puppies and kittens? Have students find the contract for .build-column
in their contracts pages. The .build-column
method is taking in a function and a string. What is the contract for that function?
=== Investigate
-
Try typing
animals-table.build-column("old", is-old)
into the Interactions Area. -
Try typing
animals-table.build-column("sticker", label)
into the Interactions Area. -
What do you get? What do you think is going on?
The .build-column
method walks through the table, applying whatever function it was given to each row. Whatever the function produces for that row becomes the value of our new column, which is named based on the string it was given. In the first example, we gave it the is-old
function, so the new table had an extra Boolean column for every animal, indicating whether or not it was young. Notice that the Domain for .build-column
says that the builder must be a function which consumes a Row
and produces some other value. If it consumes anything besides a single Row
, we’ll get an error.
=== Synthesize Debrief with students. Ask them if they think of a situation where they would want to use this. Some ideas:
-
A dataset about school might include columns for how many students are in the school and how many pass the state exam. But when comparing schools of different sizes, what we really want is a column showing what percentage passed the exam. We could use
.build-column
to compute that for every row in the table. -
The animals shelter might want to print nametags for every animal. They could build a column using the
text
function to have every animal’s name in big, purple letters. -
A dataset from Europe might list everything in metric (centimeters, kilograms, etc), so we could build a column to convert that to imperial units (inches, pounds, etc).
== Additional Exercises:
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Defining Table Functions :leveloffset: +1
= Defining Table Functions
Students continue practicing the Design Recipe, writing helper functions to filter rows and build columns in the Animals Dataset, using Methods.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Language Table |
|
== Defining Lookup Functions 25 minutes
=== Overview Students continue practicing the Design Recipe, by writing functions to answer Lookup Questions.
=== Launch
Take two minutes to find all the fixed animals by hand. Turn to The Animals Dataset, and walk down the table one row at a time, putting a check next to each animal that is fixed.
To do this activity, what kind of question were you asking of each animal? Was it a Lookup, Compute, or Relate question?
You went through the table one row at a time, and for each row you did a lookup on the fixed
column.
Have students type the code that will look up if animalA
is fixed or not, then do the same with animalB
. Suppose we wanted to do this for every animal in the table? This seems really repetitive, doesn’t it? We would keep typing the same thing over and over, but all that’s really changing is the animal. Wouldn’t it be great if Pyret had a function called lookup-fixed
, that would do this for us?
Fortunately, we already know how to define functions using the Design Recipe!
Turn to The Design Recipe (Page 28) in your Student Workbook.
Step 1: Contract and Purpose
The first thing we do is write a Contract for this function. You already know a lot about contracts: they tell us the Name, Domain and Range of the function. Our function is named lookup-fixed
, and it consumes a row from the animals table. It looks up the value in the fixed
column, which will always be a Boolean. A Purpose Statement is a description of what the function does:
# lookup-fixed :: (r :: Row) -> Boolean # Consumes an animal, and lookup the value in the fixed column
Since the contract and purpose statement are notes for humans, we add the # symbol at the front of the line to turn it into a comment. Note that we used "lookup" in the purpose statement and the function name! This is a useful way of reminding ourselves what the function is for.
Be sure to check students’ contracts and purpose statements before having them move on.
Step 2: Write Examples
Writing examples for Lookup questions is really simple: all we have to do is look up the correct value in the Row, and then write the answer!
# lookup-fixed :: (r :: Row) -> Boolean # Consumes an animal, and looks up the value in the fixed column examples: lookup-fixed(animalA) is true lookup-fixed(animalB) is false end
Step 3: Define the Function
When defining the function, we replace the answer with the lookup code.
# lookup-fixed :: (animal :: Row) -> Boolean # Consumes an animal, and looks up the value in the fixed column examples: lookup-fixed(animalA) is true lookup-fixed(animalB) is false end fun lookup-fixed(r): r["fixed"] end
No lookups in examples! In all previous functions, the examples matched the definitions almost perfectly. The only difference was the definition’s use of variables instead of actual values. So if our definition uses a lookup operation ( Data Scientists never want to stray far from the "truth", meaning they want to use real data whenever possible in order to minimize errors. So when writing examples for a specific animal, we use the actual value from that animal’s row instead of a lookup operation. |
=== Investigate For practice, try using the Design Recipe to define another lookup function.
-
Use the Design Recipe to solve the word problem at the bottom of The Design Recipe (Page 28).
-
Type in the Contract, Purpose Statement, Examples and Definition into the Definitions Area.
-
Click “Run”, and make sure all your examples pass!
-
Type
lookup-sex(animalA)
into the Interactions Area.
== Defining Compute Functions 25 minutes
=== Overview Students define functions that answer Compute Questions, again practicing the Design Recipe.
=== Launch
We’ve only been writing Lookup Functions: they consume a Row, look up one column from that row, and produce the result as-is. And as long as that row contains Boolean values, we can use that function with the .filter
method.
But what if we want to filter by a Boolean expression? For example, what if we want to find out specifically whether or not an animal is a cat, or whether it’s young? Let’s walk through an example of a Compute Function using the Design Recipe, by turning to The Design Recipe (Page 29).
Suppose we want to define a function called is-cat
, which consumes a row from the animals-table
and returns true if the animal is a cat.
-
Is this a Lookup, Compute or Relate question?
-
What is the name of this function? What are its Domain and Range?
-
Is Sasha a cat? What did you do to get that answer?
To find out if an animal is a cat, we look-up the species column and check to see if that value is equal to "cat"
. Suppose animalA
is a cat and animalB
is a dog. What should our examples look like? Remember: we replace any lookup with the actual value, and check to see if it is equal to "cat"
.
# is-cat :: (r :: Row) -> Boolean # Consumes an animal, and compute whether the species is "cat" examples: is-cat(animalA) is "cat" == "cat" is-cat(animalB) is "dog" == "cat" end
Write two examples for your defined animals. Make sure one is a cat and one isn’t!
As before, we’ll use the pattern from our examples to come up with our definition.
# is-cat :: (r :: Row) -> Boolean # Consumes an animal, and compute whether the species is "cat" examples: is-cat(animalA) is "cat" == "cat" is-cat(animalB) is "dog" == "cat" end fun is-cat(r): r["species"] == "cat" end
Don’t forget to include the lookup code in the function definition! We only write the actual value for our examples!
=== Investigate
-
Type this definition — and its examples! — into the Definitions Area, then click “Run” and try using it to filter the
animals-table
. -
For practice, try solving the word problem for
is-young
at the bottom of The Design Recipe (Page 29).
=== Synthesize Debrief as a class. Ask students to brainstorm some other functions they could write?
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Method Chaining :leveloffset: +1
= Method Chaining
Students continue practicing their Design Recipe skills, making lots of simple functions dealing with the Animals Dataset. Then they learn how to chain Methods together, and define more sophisticated subsets.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
== Design Recipe Practice 25 minutes
=== Overview Students practice more of what they learned in the previous lesson, applying the Design Recipe to simple table functions that operate on rows of the Animals Dataset. The functions they create - in addition to the ones they’ve already made - set up the method-chaining activity.
=== Launch The Design Recipe is a powerful tool for solving problems by writing functions. It’s important for this to be like second nature, so let’s get some more practice using it!
=== Investigate
Define the Compute functions on The Design Recipe (Page 32) and The Design Recipe (Page 33).
=== Synthesize Did students find themselves getting faster at using the Design Recipe? Can students share any patterns they noticed, or shortcuts they used?
== Chaining Methods 25 minutes
=== Overview Students learn how to perform multiple table operations (sorting, filtering, building) in the same line of code.
=== Launch Now that we are doing more sophisticated analyses, we might find ourselves writing the following code:
# get a table with the nametags of all the fixed animals, ordered by species with-labels = animals-table.build-column("labels", nametag) fixed-with-labels = with-nametags.filter(is-fixed) result = fixed-with-labels.order-by("species", true)
That’s a lot of code, and it also requires us to come up with names for each intermediate step! Pyret allows table methods to be chained together, so that we can build, filter and order a Table in one shot. For example:
# get a table with the nametags of all the fixed animals, ordered by species result = animals-table.build-column("labels", nametag).filter(is-fixed).order-by("species", true)
This code takes the animals-table
, and builds a new column. According to our Contracts Page, .build-column
produces a new Table, and that’s the Table whose .filter
method we use. That method produces yet another Table, and we call that Table’s order-by
method. The Table that comes back from that is our final result.
Teaching Tip Use different color markers to draw nested boxes around each part of the expression, showing where each Table came from. |
It can be difficult to read code that has lots of method calls chained together, so we can add a line-break before each “.
” to make it more readable. Here’s the exact same code, written with each method on its own line:
# get a table with the nametags of all the fixed animals, order by species animals-table .build-column("label", nametag) .filter(is-fixed) .order-by("species", true)
Order matters: Build, Filter, Order.
Suppose we want to build a column and then use it to filter our table. If we use the methods in the wrong order (trying to filter by a column that doesn’t exist yet), we might wind up crashing the program. Even worse, the program might work, but produce results that are incorrect!
=== Investigate
When chaining methods, it’s important to build first, then filter, and then order.
How well do you know your table methods? Complete Chaining Methods (Page 34) and Chaining Methods 2: Order Matters! (Page 35) in your Student Workbook to find out.
=== Synthesize As our analysis gets more complex, method chaining is a great way to keep the code simple. But complex analysis also has more room for mistakes, so it’s critical to think carefully when we use it!
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== If-Expressions :leveloffset: +1
= If-Expressions
Students build on their knowledge of the image-scatter-plot function, motivating the need for if-expressions in their programming toolkit. This drives deeper insight into subgroups within a population, and motivates the need for more advanced analysis.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Next-Gen Science Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
== Warmup
Age v. Weeks Scatterplot 🖼Show image
-
Show students this code, which uses
image-url
andscale
to generate icons of animals. -
What do they Notice? What do they Wonder? How might this scatterplot change our analysis?
-
Have students make a scatter plot of animals, using
age
as the x-axis values andweeks
as the y-axis.
(For now, the scatter plot is purely to give students practice with contracts and displays. They are not expected to know much about scatter plots at this point.)
== If-Expressions 20 minutes
=== Overview Students explore a program that makes use of an if-expression, develop their own understanding, and modify it.
=== Launch
So far, all of the functions we know how to write have had a single rule. The rule for gt
was to take a number and make a solid, green triangle of that size. The rule for bc
was to take a number and make a solid, blue circle of that size. The rule for nametag
was to take a row and make an image of the animal’s name in purple letters.
What if we want to write functions that apply different rules, depending on the input? For example, what if we want to change the color of the nametag depending on the species of the animal?
=== Investigate
-
Open the Mood Generator starter file.
-
Complete Mood Generator (Page 37) in your student workbooks.
=== Synthesize Have the class share their own explanations for how if-expressions work.
Pyret allows us to write if-expressions, which contain:
-
the keyword
if
, followed by a condition. -
a colon (
:
), followed by a rule for what the function should do if the condition istrue
-
an
else:
, followed by a rule for what to do if the condition isfalse
We can chain them together to create multiple rules, with the last else:
being our fallback in case every other condition is false
.
== Better Image Scatter Plots 20 minutes
=== Overview Suppose we want to make a scatter plot for the Animals Dataset, but with each dot being a different color depending on the species. This would make it possible to see if different animals are "clustered" in different parts of the plot.
=== Investigate Have students open Word Problem: species-color (Page 38). Make sure they all write the Contract and Purpose Statement first , and check in with their partner and the teacher before proceeding.
Once they’ve got the Contract and Purpose Statement, have them come up with examples:
for each species. Once again, have them check with a partner and the teacher before finishing the page.
Once another student and the teacher has checked their work, have them type this function into their animals starter files, and use it to make an image-scatter-plot
using age
as the x-axis and weeks
as the y-axis.
=== Synthesize Age v. Weeks Scatterplot 🖼Show image
-
What do you Notice about this scatter plot?
-
What do you Wonder?
What does this new visualization tell us about the relationship between age and weeks? What other analysis would be helpful here?
== Closing Make sure to direct the conversation back to Data Science! Does this scatter plot make us think we should be analyzing animals separately? What other scatter plots might this be useful for?
This scatterplot makes it clear that we may want to analyze each species separately, rather than grouping them all together! In the next lesson, students will learn how to do just that.
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Randomness and Sample Size :leveloffset: +1
= Randomness and Sample Size
Students learn about random samples and statistical inference, as applied to the Animals Dataset. In the process, students get a light introduction to the role of sample size and the importance of statistical inference.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Optional Projects |
|||||||||||||||||||
Language Table |
|
- statistical inference
-
using information from a sample to draw conclusions about the larger population from which the sample was taken
== Do Now
Students should log into CPO open the Random Samples Starter File, and save a copy.
== Flip the Script: Inference v. Probability 30 minutes
=== Overview Statistical inference involves looking at a sample and trying to infer something you don’t know about a larger population. This requires a sort of backwards reasoning, kind of like making a guess about a cause, based on the effect that we see. To better understand the process of going from the sample back to the population, it helps to understand the more straightforward process of going from the population to a sample. If the sample is random, we call this process Probability!
In real life we typically don’t know what’s true for an entire population. But this probability thought-experiment will start with a larger population with known properties (such as the fact that nearly half of the entire population are males). Then we’ll see what kind of behavior we tend to see in random samples taken from that population.
=== Launch
Inference Reasons Backwards; Probability Reasons Forwards
One of the most useful tasks in Data Science is using sample data to infer (guess) what’s true about the larger population from which the sample was taken. This process, called statistical inference, is used to gain information in practically every field of study you can imagine: medicine, business, politics, history; even art! Early on, statisticians discovered that random samples almost always work best.
Suppose we want to make an educated guess about who the next US president will be. We can’t ask everyone who they’re voting for, so pollsters instead take a sample of Americans, and generalize the opinion of the sample to estimate how Americans as a whole feel. But choosing a sample can be tricky…
-
Would it be problematic to only call voters who are registered Democrats? To only call voters under 25? To only call regular churchgoers? Why or why not?
-
How could we choose a representative subset, or sample of American voters?
-
Would it be problematic to only sample a handful of voters? What do we gain by taking a larger sample?
Before we infer something unknown about a population from a sample, we need to know what makes a "good" sample!
Sampling is a complicated issue. The main reason for doing inference is to guess about something that’s unknown for the whole population. But a useful step along the way is to practice with situations where we happen to know what’s true for the whole population. As an exercise, we can keep taking random samples from that population and see how close they tend to get us to the truth. Another discovery (besides the value of randomness) that statisticians made early on was something that’s perfectly consistent with common sense: Larger samples are better than smaller ones, because they tend to get us closer to the truth about the whole population.
Let’s see what happens if we switch from smaller to larger sample sizes, if we’re taking a random sample of shelter animals to infer what’s true about the larger population…
=== Investigate The Animals Dataset we’ve been using is just one sample taken from a very large animal shelter. How much can we infer about the whole population of hundreds of animals, by looking at just this one sample?
-
Divide the class into groups of 3-5 students.
-
Have students open the Random Samples Starter File, and click "Run".
-
Have students complete Sampling and Inference (Page 40), sharing their results and discussing with the group.
=== Common Misconceptions Larger populations need to be represented by larger sample sizes. In fact, the formulas that Data Scientists use to assess how good a job the sample does is only based on the sample size, not the population size.
Extension In a statistics-focused class, or if appropriate for your learning goals, this is a great place to include more rigorous statistics content on sample size, sampling bias, etc. |
=== Synthesize Have students share how much better their larger samples are at guessing the truth about the whole population.
Project Options: Food Habits / Time Use In both of these projects, students can gather data about their own lives, and use what they’ve learned in the class so far to analyze it. This project can be used as a mid-term or formative assessment, or as a capstone for a limited implementation of Bootstrap:Data Science. See the project descriptions for pages/food-habits-project.html and pages/time-use-project.html. (Based on the projects of the same name from IDS at UCLA) |
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Grouped Samples :leveloffset: +1
= Grouped Samples
Students learn about grouped samples, and practice creating them from the Animals Dataset. In the process, they practice using the Design Recipe to create filter functions, and come up with questions they wish to explore.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Language Table |
|
- grouped sample
-
a non-random subset of individuals chosen from a larger set, where the individuals belong to a specific group
== Problems with a Single Population 10 minutes
=== Overview This activity is all about grouped samples: Students make a bunch of subsets from the Animals Dataset, and see how each subset might answer the same question differently.
=== Launch 🖼Show image When looking at a scatter plot of our animals, it looks like the amount an animal weighs may have something to do with how long it takes to be adopted.
But if we label the dots by animal (see the image on the right), we notice every data point after 25 pounds belongs to a dog from the shelter!
=== Investigate Divide the class into groups of 3-4, with one student identified as the "reporter".
-
Looking at this scatterplot, does it make sense to analyze all the animals together? Why or why not?
-
Are there some questions where it would be important to break up the population into species-specific populations? What are they?
-
Are there some questions where it would be important to keep the whole population together? What are they?
=== Synthesize Have the reporters share their findings with the class.
Imagine that you’ve been handed a dataset from a country where half the people are wealthy and have access to amazing medical care, and the other half are poor and have no healthcare. If we took a random sample of the population as a whole, we might think that they are generally middle-income and have average health. But if we ask the same question about the two groups separately, we would discover inequality hiding in plain sight!
== Grouped Samples 20 minutes
=== Launch Ultimately, it might make more sense to ask certain questions about "just the cats" or "just the dogs". Averaging every animal together will give us an answer, but it may not be a useful answer.
Sometimes important facts about samples get lost if we mix them with the rest of the population!
Data Scientists make grouped samples of datasets, breaking them up into sub-groups that may be helpful in their analysis.
=== Investigate
A “kitten” is an animal who is a cat and who is young. How would you make a subset of just kittens?
-
Turn to Grouped Samples from the Animals Dataset (Page 41), and see what code will compute whether or not an animal is a kitten.
-
Can you fill in the code for the other subsets?
-
When you’re done, type these definitions into the Definitions Area.
We already know how to define values, and how to filter a dataset. So let’s put those skills together to define one of our subsets:
dogs = animals-table.filter(is-dog)
-
Define the other subsets, and click "Run".
-
Make a pie chart showing the species in the
young
subset, by typingpie-chart(young, "species")
. -
Make pie charts for every grouped sample. Which one is the most representative of the whole population? Why?
=== Synthesize Debrief with students. Thoughtful question: how could we filter and sort a table? How can we combine methods?
== Displaying Samples 20 minutes
=== Overview Students revisit the data display activity, now using the samples they created.
=== Launch Making grouped and random samples is a powerful skill to have, which allows us to dig deeper than just making charts or asking questions about a whole dataset. Now that we know how to make subsets, we can make much more sophisticated displays!
=== Investigate
Complete Displaying Data (Page 42), using what you’ve learned about samples to make more sophisticated data displays.
=== Synthesize Were any of the students' displays interesting or surprising? Given a novel question, can students identify what helper functions they would need to write?
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Choosing Your Dataset :leveloffset: +1
= Choosing Your Dataset
Students summarize their dataset by exploring the data and identifying categorical and quantitative columns, datatypes, and more. They also define a few sample rows, random subsets, and logical subsets.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
== The Data Cycle 20 minutes
=== Overview Students learn about the Data Cycle, which helps them get situated in the process of analyzing the datasets they will select in this lesson. They browse through the library of provided datasets, and choose one they want to work with. NOTE: the selection process can also be done as a homework assignment, if all students have internet access at home.
=== Launch Zoom out a little and help students reflect on what they’ve done so far. Students began by exploring the Animals Dataset, formulating questions and exploring them with data displays. This led to further questions, making subsets, and asking more questions.
🖼Show image The Data Cycle[*] is a roadmap, which helps guide us in the process of data analysis.
(Step 1) We start by Asking Questions - statistical questions that can be answered with data.
(Step 2) Then we Consider Data. This could be done by conducting a survey, observing and recording data, or finding a dataset that meets our needs.
(Step 3) Then it’s on to Analyzing the Data, in which we produce data displays and new tables of filtered or transformed data in order to identify patterns and relationships.
(Step 4) Finally, we Interpret the Data, in which we answer our questions and summarize the results. As we’ve already seen from the Animals Dataset, these interpretations often lead to new questions….and the cycle begins again.
Explain to students that they will now select a dataset for them to work with for the remainder of the course. Make sure they understand that it genuinely has to be something they are interested in - their engagement with the data is critical to engaging with the class.
Students can also find their own dataset, and use this Blank Starter file. See this tutorial video for help importing your own data into Pyret.
Students must have at least 2 questions that are both interesting and answerable using their dataset.
=== Investigate Have students choose a dataset that is interesting to them! They should have at least two questions that the dataset can help them answer, and write them on What’s on your mind? (Page 49).
- U.S. Voter Turnout Rates 1986-2018
- 2016 U.S. Presidential Elections
- R.I. Schools
- Police Traffic Stops, Durham, NC, 2002-2013
- MLB Hitting Stats
- State Demographics
- Movies
- Countries of the World
- U.S. Income
- U.S. Presidents
- Music
- Summer Olympic Medals
- Winter Olympic Medals
- Pokemon Characters
- IGN Video Game Reviews
- U.S. Cancer Rates
- Sodas
- Cereals
Open the Research Paper template, and save a copy.
-
Students fill in their first and last name(s), the teacher name on the first page of the Research Paper.
-
Students should also copy the link to the dataset (spreadsheet), and paste it into the first page of the Research Paper.
-
Students should click "Publish" in their Pyret Starter File, then copy/paste the resulting link into the first page of the Research Paper.
We have also compiled some notes on these datasets, which we recommend for all teachers before having their students choose a dataset.
=== Synthesize Have students share their datasets and their questions.
For the rest of this course, students will be learning new programming and Data Science skills, practicing them with the Animals Dataset and then applying them to their own data.
== Exploring Your Dataset flexible
=== Overview Students apply what they’ve learned about describing and making subsets from the Animals Dataset to their own dataset. Note: this activity can be done briefly as a homework assignment, but we recommend giving students an additional class period to work on this.
=== Launch By now you’ve already learned what to do when you approach a new dataset. With the Animals Dataset, you first read the data itself, and wrote down your Notice and Wonders. You described the columns in the Animals Dataset, identifying which were categorical and which were quantitative, and whether they were Numbers, Strings, Booleans, etc. Finally, you used the Design Recipe and table methods to make random and logical subsets.
Now, you’re doing to do the same thing with your own dataset.
=== Investigate
-
Have students look at the spreadsheet for their dataset. What do they Notice? What do they Wonder? Have them complete My Dataset (Page 45), making sure to have at least two Lookup Questions, two Compute Questions, and two Relate Questions.
-
In the Definitions Area, students use
random-rows
to define at least three tables of different sizes:tiny-sample
,small-sample
, andmedium-sample
. -
In the Definitions Area, students use
.row-n
to define at least three values, representing different rows in your table. -
Have students think about subsets that might be useful for their dataset. Name these subsets and write the Pyret code to test an individual row from your dataset on Samples from My Dataset (Page 46).
-
Students should fill in My Dataset portion of their Research Paper.
-
Students should fill in Categorical Visualizations portion of their Research Paper, by generating pie and bar charts for their dataset and explaining what they show.
Turn to The Design Recipe (Page 47), and use the Design Recipe to write the filter functions that you planned out on Samples from My Dataset (Page 46). When the teacher has checked your work, type them into the Definitions Area and use the .filter
method to define your new sample tables.
Choose one categorical column from your dataset, and try making a bar or pie-chart for the whole table. Now try making the same display for each of your subsets. Which is most representative of the entire column in the table?
=== Synthesize
Have students share which subsets they created for their datasets.
[*] From the Mobilizing IDS project and GAISE
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Histograms :leveloffset: +1
= Histograms
Students explore new visualizations in Pyret, this time focusing on the distribution in a quantitative dataset. Students are introduced to Histograms by comparing them to bar charts, and learn to construct them by hand and in Pyret.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
- bar chart
-
a display of categorical data that uses bars positioned over category values; each bar’s height reflects the count or percentage of data values in that category
- frequency
-
how often a particular value appears in a data set
- histogram
-
a display of quantitative data that uses vertical bars positioned over bins (sub-intervals); each bar’s height reflects the count or percentage of data values in that bin.
- shape
-
The aspect of a dataset that tells which values are more or less common
== Review 20 minutes
Have students open their Animals Starter File, and click “Run”. (If they do not have this file, or if something has happened to it, they can always make a new copy.)
-
Turn to The Design Recipe (Page 51), and write the functions you see there. When you’re ready, type the contracts, purpose statements, examples and definitions into the Definitions Area.
-
Use the
.build-column
method to add a new column to the animals table, showing the weight of every animal in kilograms. -
Use the
image-scatter-plot
function to plot all of the animals, puttingage
on the x-axis, number ofweeks
in the shelter on the y-axis, andsmart-dot
as our function.
== Introducing Histograms 20 minutes
=== Overview Students look at a bar chart and a histogram, compare/contrast them, and make observations about what they have in common and how they are different. Then they learn a more formal explanation of histograms.
=== Launch
Have students complete Summarizing Columns (Page 52).
The display on the left side of that page is a Bar chart.
-
The x-axis lists the values of a categorical variable (
species
). -
The y-axis shows the frequency of categorical values in the dataset.
-
This chart happens to show the categorical values in alphabetical order from left to right, but it would be fine to re-order them any way we wish. The bar for “dogs” could have been drawn before the one for “cats”, without changing the meaning of the display. It never makes sense to talk about the “shape” of a categorical data set, since that shape holds no meaning.
The display on the right side is called a histogram.
-
Histograms show the distribution of quantitative data.
-
Since quantitative data must follow a natural order, these bars cannot be re-ordered.
-
Histograms allow us to see the shape of a data set.
=== Investigate To build a histogram, we start by sorting all of the numbers in our column from smallest to largest, marking our x-axis from the smallest value (or a bit below) to the largest value (or a bit above) and dividing into equally-sized intervals, or “bins”. For example, if our values ranged from 3 to 53 we might mark our x-axis from 0 to 60 and divide it into bins of width 10. If they range from 22 to 41 we might mark our x-axis from 20 to 45 and divide it into bins of width 5. Once we have our bins, we put each value in our dataset into the bin where it belongs, and then count how many values fall in each bin. This count determines the height of the bars on our y-axis.
Turn to Making Histograms (Page 53), and try drawing a histogram from a dataset.
==== Possible Misconceptions Note that intervals on this display include the left endpoint but not the right. If we included the right endpoint and someone had 0 teeth, we’d have to add on a bar from -5 to 0, which would be awfully strange!
=== Synthesize Review: How are histograms and bar charts different?
== Choosing the Right Bin Size 15 minutes
=== Overview Students make histograms from the animals-dataset, and explore different bin sizes.
=== Launch The size of the bins matters a lot! Bins that are too small will hide the shape of the data by breaking it into too many short bars. Bins that are too large will hide the shape by squeezing the data into just a few tall bars. In this workbook exercise, the bins were provided for you. But how do you choose a good bin-size?
=== Investigate
A display of how long it takes animals to get adopted can make it easier to get an idea of what adoption times were most common, and if there were any unusually long or short times that it took for an animal to be adopted.
Suppose we want to know how long it takes for animals from the shelter to be adopted.
-
Find the contract for the
histogram
function. -
Make a histogram for the
"weeks"
column in theanimals-table
, using a bin size of 10. -
How many took between 0 and 10 weeks? Between 10 and 20?
-
Try some other bin sizes (be sure to experiment with bigger and smaller bins!) - what shapes emerge? What bin size gives you the best picture of the distribution?
Look at the histogram and count how many animals took between 0 and 5 weeks to be adopted. How many took between 5 and 10 weeks? What else do you Notice? What do you Wonder?
Some observations you can share with the class, to get them started:
-
We see most of the histogram’s area under the two bars between 0 and 10 weeks, so we can say it was most common for an animal to be adopted in 10 weeks or less.
-
We see a small amount of the histogram’s area trailing out to unusually high values, so we can say that a couple of animals took an unusually long time to be adopted: one took even more than 30 weeks.
-
More than half of the animals (17 out of 31) took just 5 weeks or less to be adopted. But those few unusually long adoption times pulled the average up to 5.8 weeks. We’ll talk more about Shape of a histogram in the next lesson, and about its effect on average (the mean) in the lesson after that.
If someone asked what was a typical adoption time, we could say: “Almost all of the animals were adopted in 10 weeks or less, but a couple of animals took an unusually long time to be adopted — even more than 20 or 30 weeks!” Without looking at the histogram’s shape, we could not have drawn this conclusion.
What would the histogram look like if most of the animals took more than 20 weeks to be adopted, but a couple of them were adopted in fewer than 5 weeks?
=== Synthesize Have students talk about the bin sizes they tried. Encourage open discussion as much as possible here, so that students can make their own meaning about bin sizes before moving on to the next point.
Rule of thumb: a histogram should have between 5–10 bins.
Histograms are a powerful way to display a data set and assess its shape. Choosing the right bin size for a column has a lot to do with how data is distributed between the smallest and largest values in that column! With the right bin size, we can see the shape of a quantitative column. But how do we talk about or describe that shape, and what does the shape actually tell us? The next lesson addresses all of these.
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Visualizing the “Shape” of Data :leveloffset: +1
= Visualizing the “Shape” of Data
Students explore the concept of "shape", using histograms to determine whether a dataset has skewness, and what the direction of the skewness means. They apply this knowledge to the Animals Dataset, and then to their own.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
- shape
-
The aspect of a dataset that tells which values are more or less common
- skewed left
-
A distribution is skewed left if there are a few values that are fairly low compared to the bulk of data values. A display of the data will show a longer tail to the left.
- skewed right
-
A distribution is skewed right if there are a few values that are fairly high compared to the bulk of data values. A display of the data will show a longer tail to the right.
- symmetric
-
A symmetric distribution has a balanced shape, showing that it’s just as likely for the variable to take lower values as higher values.
== Review 15 minutes
Have students turn to Reading Histograms (Page 54), and complete the matching activity there.
== Describing Shape 20 minutes
=== Overview This activity focuses on describing shape based on a histogram. Students learn about "left skewed", "right skewed", and "symmetric" data, and what those descriptions tell us about a dataset.
=== Launch 🖼Show image
Shape is one way to summarize information in a dataset, to quickly describe what values are more or less common.
Consider the image on the right: most of the data points are clustered on the left side, and it contains a few unusually high values way off to the right. We might describe this histogram by saying that it is “skewed right, or has high outliers.”
Here are the most common shapes that we see for real-world data sets:
Symmetric: values are balanced on either side of the middle.
🖼Show image In a symmetric distribution, it’s just as likely for the variable to take a value a certain distance below the middle as it is to take a value that same distance above the middle. Examples:
-
Heights of 12-year-olds would have a symmetric shape. It’s just as likely for a 12-year-old to be a certain number of inches below average height as it is to be that number of inches above average height.
-
In a standardized test, most students score fairly close to what’s average. Also, we see just as many students scoring a certain number of points above average as we see scoring that same number of points below average. The shape is symmetric (and bulges in the middle because most students score fairly close to what’s average).
Skewed left, or low outliers.
In a distribution that is skewed left, values are clumped around what’s typical, but they trail off to the left with a few unusually low values. Examples:
-
Number of teeth that adults have in their mouths would be skewed left or have low outliers. Most adults will have close to a full set of 32 teeth, but a few of them with serious dental problems would have a very small number of teeth. We won’t get anyone in our data set who has 10 or 20 extra teeth in their mouths!
-
If most students did pretty well on an exam, but a few students performed very badly, then we’d see a shape that has left skewness and/or low outliers.
Skewed right, or high outliers.
In a distribution that is skewed right, values are clumped around what’s typical, but they trail off to the right with a few unusually high values. We see this shape often in the real world, because there are many variables — like “income” or “time spent on the phone” — for which a few individuals have unusually high values, which aren’t balanced out by unusually low values (things like “income” and “phone time” can’t be less than zero). Examples:
-
Age when a woman in the U.S. gives birth would be skewed right or have high outliers. A few women would be unusually old (40+ years), above the average age of 26 (check the tabloids!), but none of them could be even close to 40 years below average to balance things out!
-
A data set of earnings almost always shows right skewness or high outliers, because there are usually a few values that are so far above average, they can’t be balanced out by any values that are so far below average. (Earnings can’t be negative.)
=== Investigate
-
Make a histogram for the pounds column in the animals table, sorting the animals into 20-pound bins:
-
Would you describe the shape of your histogram as being skewed left, skewed right, or symmetric?
-
Which one of these statements is justified by the histogram’s shape?
-
A few of the animals were unusually light.
-
A few of the animals were unusually heavy.
-
It was just as likely for an animal to be a certain amount below or above average weight.
-
-
Try bins of 1-pound intervals, then 100-pound intervals. Which of these three histograms best satisfies our rule of thumb?
-
On Identifying Shape (Page 55), describe the shape of the histograms you see there.
-
On The Shape of the Animals Dataset (Page 56), describe the pounds histogram and another one you make yourself. When writing down what you notice, try to use the language Data Scientists use, discussing both skew and outliers.
Challenge Questions:
- Compare histograms for the pounds
column of both cats and dogs in the dataset. Are their shapes different? How much overlap is there?
- Compare histograms for the age
column of both cats and dogs in the dataset. Are their shapes different? How much overlap is there?
- Can you explain why the amount of overlap between these two distributions is different?
=== Synthesize Discuss as a class, making sure students agree on the description of the shape.
== Your Analysis flexible
=== Overview Students repeat the previous activity, this time applying it to their own dataset and interpreting their own results. Note: this activity can be done briefly as a homework assignment, but we recommend giving students an additional class period to work on this.
=== Launch Now it’s time to try looking at the shape of your own dataset! Pick one quantitative column in your dataset, and hypothesize whether you think it will be skewed right, skewed left, or symmetric. What do you think?
=== Investigate
-
How is your dataset distributed? Choose two quantitative variables and display them with histograms. Explain what you learn by looking at these displays. If you’re looking at a particular subset of the data, make sure you write that up in your findings on The Shape of My Dataset (Page 57).
-
Students should fill in the Quantitative Visualizations portion of their Research Paper, using histograms they’ve constructed for their dataset and explaining what they show.
=== Synthesize Have students share their findings.
Histograms are a powerful way to display a data set and see its shape. But shape is just one of three key aspects that tell us what’s going on with a quantitative data set. In the next unit, we’ll explore the other two: center and spread.
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Measures of Center :leveloffset: +1
= Measures of Center
Students learn different ways to report the center of a quantitative data set: mean, median and mode(s). After applying these concepts to a contrived dataset, they apply them to their own datasets and interpret the results.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). K-12CS Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
- mean
-
average, calculated as the sum of values divided by the number of values
- median
-
the middle element of a quantitative data set
- mode
-
the most commonly appearing categorical or quantitative value or values in a data set
- outlier
-
a data point that is unusually far above or below most of the others
- skew
-
lack of balance in a dataset’s shape, arising from more values that are unusually low or high. Such values tend to trail off, rather than be separated by a gap (as with outliers).
== Mean 15 minutes
=== Overview Students learn about mean (or "average"), and how it is one way (among others!) to summarize a quantitative column.
=== Launch
According to the Animal Shelter Bureau, the average pet weighs almost 41 pounds.
Some medicines are dosed by weight: heavier animals need a larger dose. If someone from the shelter needs to give a dose of medicine to the animals, is the “average” the best estimate we can use?
“The average pet weighs 41 pounds” is a statement about the entire dataset, which summarizes a whole column of values with a single number. Summarizing a big dataset means that some information gets lost, so it’s important to pick an appropriate summary. Picking the wrong summary can have serious implications! Here are just a few examples of summary data being used for important things. Do you think these summaries are appropriate or not?
-
Students are sometimes summarized by two numbers — their GPA and SAT scores — which can impact where they go to college or how much financial aid they get.
-
Schools are sometimes summarized by a few numbers — student pass rates and attendance, for example — which can determine whether or not a school gets shut down.
-
Adults are often summarized by a single number — like their credit score — which determines their ability to get a job or a home loan.
-
When buying uniforms for a sports team, a coach might look for the most common size that the players wear.
Can you think of other examples where someone uses a number or two to summarize something complex?
Every kind of summary has situations in which it does a good job of reporting what’s typical, and others where it doesn’t really do justice to the data. In fact, the shape of the data can play a huge role in whether or not one kind of summary is appropriate!
One of the ways that Data Scientists summarize quantitative data is by talking about its center - literally asking "what is a typical value in this sample?", in the hopes of inferring something about a larger population. But there are many different ways to define "center", and each method has strengths and weaknesses. Let’s check the “41 pounds” claim and see if it’s an appropriate measure of center. Later on, you’ll have a chance to apply what you’ve learned to your own dataset, to find the best way to provide an overall summary of the data.
=== Investigate
Open your “Animals Starter File”. (If you do not have this file, or if something has happened to it, you can always make a new copy.)
If we plotted all the pounds values as points on a number line, what could we say about the average of those values? Is there a midpoint? Is there a point that shows up most often? Each of these are different ways of “measuring center”.
The Animal Shelter Bureau used one method of summary, called the mean, or "average". In general, the mean of a data set is the sum of values divided by the number of values. To take the average of a column, we add all the numbers in that column and divide by the number of rows.
Pyret has a way for us to compute the mean of any quantitative column in a Table. It consumes a Table and the name of the column you want to measure, and produces the mean — or average — of the numbers in that column.
# mean :: (t :: Table, col :: String) -> Number
What is its name? Domain? Range?
Notice that calculating the mean requires being able to add and divide, so the mean only makes sense for quantitative data. For example, the mean of a list of Presidents doesn’t make sense. Same thing for a list of zip codes: even though we can divide a sum of zip codes, the output doesn’t correspond to some “center” zip code.
Type mean(animals-table, "pounds")
. What does this give us?
Does this support the Bureau’s claims?
Open your workbooks to Summarizing Columns in the Animals Dataset (Page 61). Under the “measures of center” section, fill in the computed mean.
== Median 15 minutes
=== Overview Students learn a second measure of center: the median. They learn the algorithm and the code to find the median, as well as situations where taking the median is more appropriate than the mean.
=== Launch You computed the mean of that column to be almost exactly 41 pounds. That IS the average, but if we scan the dataset we’ll quickly see that most of the animals weigh less than 41 pounds! In fact, more than half of the animals weigh less than just 15 pounds. What is throwing off the average so much?
Kujo and Mr. Peanutbutter!
In this case, the mean is being thrown off by a few extreme data points. These extreme points are called outliers, because they fall far outside of the rest of the dataset. Calculating the mean is great when all the points are fairly balanced on either side of the middle, but it distorts things for datasets with extreme outliers. The mean may also be thrown off by the presence of skewness: a lopsided shape due to values trailing off left or right of center.
Make a histogram
of the pounds
column, and try different bin sizes. Can you see the skew towards the right, with a huge number of animals clumped to the left?
A different way to measure center is to line up all of the data points — in order — and find a point in the center where half of the values are smaller and the other half are larger. This is the median, or “middle” value of a list.
As an example, consider this list of ACT scores:
25, 26, 28, 28, 28, 29, 29, 30, 30, 31, 32
Here 29 is the median, because it separates the "bottom half” (5 values below it) from the top half” (5 values above it).
The algorithm for finding the median of a quantitative column is:
-
Sort the numbers (we did this for you in the above example).
-
Cross out the highest number.
-
Cross out the lowest number.
-
Repeat until there is only one number left. If there are two numbers left at the end, take the mean of those numbers.
=== Investigate
-
Pyret has a function to compute the median of a list as well. Find the contract in your contracts page.
-
Compute the median for the
pounds
column in the Animals Dataset, and add this to Summarizing Columns in the Animals Dataset (Page 61). -
Is it different than the mean?
-
What can we conclude when the mean is so much greater than the median?
-
For practice, compute the mean and median for the weeks and age columns.
=== Synthesize By looking at the histogram, we can develop an intuition for whether it’s probably better to use the mean or median. Pronounced left skewness and/or low outliers can pull the mean down below the median, while right skewness and/or high outliers can pull it up. Either way, such shapes distort the mean as a measure of what’s typical for the data set. Data scientists generally prefer to use the mean as their measure of center, because it contains information from every single data value. However, if a data set has substantial skewness or outliers, they use median to report the center .
== Modes 25 minutes
=== Overview Students learn about the mode(s) of a dataset, how to compute the mode, and when it is appropriate to use this as a measure of center.
=== Launch The third measure of center is called the mode of a dataset. The mode of a data set is the value that appears most often. Median and Mean always produce one number, but if two or more values are equally common, there can be more than one mode. If all values are equally common, then there is no mode at all! Often there will be just one mode in the list of most common values: many data sets are what we call “unimodal”. But sometimes there are exceptions! Consider the following three datasets:
1, 2, 3, 4 1, 2, 2, 3, 4 1, 1, 2, 3, 4, 4
-
The first dataset has no mode at all!
-
The mode of the second data set is 2, since 2 appears more than any other number.
-
The modes (plural!) of the last data set are 1 and 4, because 1 and 4 both appear more often than any other element, and because they appear equally often.
Mode is rarely used to summarize quantitative data. It is very common as a summary of categorical data, telling us which category occurs most often.
In Pyret, the mode(s) are calculated by the modes function, which consumes a Table and the name of the column you want to measure, and produces a List of Numbers.
# modes :: (t :: Table, col :: String) -> List<Number>
=== Investigate
Compute the modes
of the pounds
column, and add it to Summarizing Columns in the Animals Dataset (Page 61). What did you get?
=== Synthesize The most common number of pounds an animal weighs is 6.5! That’s well below our mean and even our median, which is further evidence of outliers or skewness.
At this point, we have a lot of evidence that suggests the Bureau’s use of “mean” to summarize animal weights isn’t ideal. Our mean weight agrees with their findings, but we have three reasons to suspect that mean isn’t the best value to use:
-
The median is only 13.4 pounds.
-
The mode of our dataset is only 6.5 pounds, which suggests a cluster of animals that weigh less than one-sixth the mean.
-
When viewed as a histogram, we can see the right skewness and high outliers in the dataset. Mean is sensitive to datasets with skewness and/or outliers.
“In 2003, the average American family earned $43,000 a year — well above the poverty line! Therefore very few Americans were living in poverty."
Do you trust this statement? Why or why not? Consider how many policies or laws are informed by statistics like this! Knowing about measures of center helps us see through misleading statements.
You now have three different ways to measure center in a dataset. But how do you know which one to use? Depending on the shape of the dataset, a measure could be really useful or totally misleading! Here are some guidelines for when to use one measurement over the other:
-
If the data is doesn’t show much skewness or have outliers, mean is the best summary because it incorporates information from every value.
-
If the data has noticeable outliers or skewness, median gives a better summary of center than the mean.
-
If there are very few possible values, such as AP Scores (1–5), the mode could be a useful way to summarize the data set.
== Additional Exercises Critiquing Written Findings
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Spread of a Data Set :leveloffset: +1
= Spread of a Data Set
Students learn how to evaluate the spread of a quantitative column using box plots, and explore how this offers a different perspective on shape from what can be achieved with a histogram. After applying these concepts to a contrived dataset, they apply them to their own datasets and interpret the results.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
Summative Assessments / Capstone: Stress Project (You will also need the Personality True Colors assessment) |
||||||||||||||||||
Language Table |
|
- box plot
-
the box plot (a.k.a. box-and whisker-plot) is a way of displaying a distribution of data based on the five-number summary: minimum, first quartile, median, third quartile, and maximum
- interquartile range
-
(IQR) is one possible measure of spread, based on dividing a data set into four parts. The values that divide each part are called the first quartile (Q1), the median, and third quartile (Q3). IQR is calculated as Q3 minus Q1.
- median
-
the middle element of a quantitative data set
- quartiles
-
three values that divide a data set into four equal-sized groups
- range of a data set
-
the distance between minimum and maximum values
- shape
-
The aspect of a dataset that tells which values are more or less common
- spread
-
the extent to which values in a data set vary, either from one another or from the center
== Measures of Spread 30 minutes
=== Overview Students are introduced to the notion of spread in a dataset. They learn about quartiles, box plots, and how to use them to talk about spread.
=== Launch A teacher may report that her students averaged a 75 on a test, but it’s important to know how those scores were spread out: did all of them get exactly 75, or did half score 100 and the other half 50? When Data Scientists use the mean of a sample to estimate the mean of a whole population, it’s important to know the spread in order to report how good or bad a job that estimate does.
Suppose we lined up all of the values in the pounds column of the animals data set from smallest to largest, and then split the line up into two equal groups by taking the median. We can learn something about the spread of the data set by taking things further: The middle of the lighter half of animals is called the first quartile - or "Q1" - and the middle of the heavier half of animals is the third quartile (also called "Q3"). Once we find these numbers, we can say that the middle half of the animals’ weights are spread between Q1 and Q3.
The first quartile (Q1) is the value for which 25% of the animals weighed that amount or less. What does the third quartile represent?
Besides looking at the median as center, and the spread between Q1 and Q3, we also gain valuable information from the spread of the entire data set—that is, the distance between minimum and maximum. This is called the range of a data set. (Note: the term “Range” means something different in statistics than it does in algebra and programming!)
We can use box plots to visualize all of this information. These plots are constructed using just five numbers, which makes them convenient ways to display both center and spread of a data set in a clear and simple way. Below is the contract for box-plot
, along with an example that will make a box plot for the pounds
column in the animals-table
.
# box-plot :: (t :: Table, column :: String) -> Image box-plot(animals-table, "pounds")
Box plots divide our sample into equally-sized groups, and show where those groups are spread thin or clumped together.
Type in this expression in the Interactions Area, and see the resulting plot.
This plot shows us the center and spread in our dataset according to those five numbers.
-
The minimum value in the dataset (at the left of “whisker”). In our dataset, that’s just 0.1 pounds.
-
The First Quartile (Q1) (the left edge of the box), is computed by taking the median of the lower half of the values. In the pounds column, that’s 3.9 pounds.
-
The Median value (the line in the middle), which is the middle Quartile of the whole dataset. We already computed this to be 11.3 pounds.
-
The Third Quartile (Q3) (the right edge of the box), which is computed by taking the median of the upper half of the values. That’s 60.4 pounds in our dataset.
-
The maximum value in the dataset (at the right of the “whisker”). In our dataset, that’s 172 pounds.
Extension Activity In statistics, it is not uncommon to use modified box plots, which remove extreme datapoints from the box-and-whisker and draw them as dots outside of the blot. The box plot then represents only the "non-extreme" points. Modified box plots are also available in Bootstrap:Data Science, using the following contract:
|
=== Investigate
-
Turn to Summarizing Columns in the Animals Dataset (Page 61)
-
Fill in the five-number summary for the
pounds
column, and sketch the box plot. -
What conclusions can you draw about the distribution of values in this column?
Data Scientists subtract the 1st quartile from the 3rd quartile to compute the range of the “middle half” of the dataset, also called the interquartile range.
-
Find the interquartile range of this dataset.
-
What percentage of animals fall within the interquartile range?
-
What percentage of animals fall below the First Quartile? Above the Third Quartile? What percentage fall anywhere between the minimum and the maximum?
Now that you’re comfortable creating box plots and looking at measures of spread on the computer, it’s time to put your skills to the test!
Turn to Interpreting Spread (Page 62) and complete the questions you see there.
Just as pie and bar charts are ways of visualizing categorical data, box plots and histograms are both ways of visualizing the shape of quantitative data. Box plots make it easy to see the 5-number summary, and compare the Range and Interquartile Range. Histograms make it easier to see skewness and more details of the shape, and offer more granularity when using smaller bins.
Left-skewness is seen as a long tail in a histogram. In a box plot, it’s seen as a longer left "whisker" or more spread in the left part of the box. Likewise, right skewness is shown as a longer right "whisker" or more spread in the right part of the box.
Box plots and Histograms can both tell us a lot about the shape of a dataset, but they do so by grouping data quite differently. A box plot is always divided into four parts, which may fall on differently-sized intervals but all contain the same number of points. A histogram, on the other hand, has identically-sized intervals which can contain very different numbers of points.
Turn to Identifying Shape (Page 63) and see if you can describe box plots using what you know about skewness.
Challenge Questions:
- Compare the for the pounds
column of both cats and dogs in the dataset. Are their shapes different? How much overlap is there?
- Compare histograms for the age
column of both cats and dogs in the dataset. Are their shapes different? How much overlap is there?
- Can you explain why the amount of overlap between these two distributions is different?
=== Possible Misconceptions It is extremely common for students to forget that every quartile always includes 25% of the dataset. This will need to be heavily reinforced.
=== Synthesize Histograms, box plots, and measures of center and spread are all different ways to get at the shape of our data. It’s important to get comfortable using every tool in the toolbox when discussing shape!
Modified Box Plots More Statistics- or Math-oriented classes will also be familiar with modified box plots. These are similar to traditional box plots, but the box-and-whisker just extends to minimum and maximum non-outliers. To call our attention to outliers, they are drawn as small dots or asterisks at the extreme ends of the graph (watch a video on modified box plots). Pyret also has a |
== Comparing Box Plots 15 minutes
=== Overview Students assess the degree of visual overlap of two numerical distributions.
=== Launch "Do dogs take longer to get adopted than cats?"
This is asking us about the interaction between a categorical variable (species
) and a quantitative one (weeks
). Instead of creating a whole new display, all we have to do is make separate box plots for the distribution of weeks
for both cats and dogs. Note: this works fine as long as we’re sure to use a common scale! Both box plots (see below) share the same axis for adoption times, which ranges from about 1 to 10 weeks.
Box plots make it easy to decide if values of a quantitative variable seem to be fairly similar or quite different, depending on which group an individual is in. The trick is to train your eyes to look for whether there’s a lot of overlap in the two box plots, or if one is noticeably higher than the other.
=== Investigate Have students break into groups of 3-4, and compare the box plot of weeks-to-adoption for cats with the one for dogs. Note: they can generate the pair of box plots themselves, but we recommend simply giving them this image: cats v. dogs 🖼Show image
-
Do the two box plots mostly overlap, or does one have a noticeably different range than the other?
-
How do the medians compare?
Next, each group examines the pair of box plots that compare weeks to adoption for fixed versus unfixed animals: fixed v. unfixed 🖼Show image. Once again, consider how similar or different the two plots seem.
-
Do the two box plots mostly overlap, or does one have a noticeably different range than the other?
-
How do the medians compare?
Students should confirm that the box plots for adoption times of unfixed versus fixed animals have more overlap than the box plots for adoption times of cats versus dogs.
Box plots and histograms give us two different views on the concept of shape.
Histograms: fixed intervals (“bins”) with variable numbers of data points in each one. Points “pile up in bins”, so we can see how many are in each. Larger bars show where the clusters are.
Box plots: variable intervals (“quartiles”) with a fixed number of data points in each one. Treats data more like “pizza dough”, dividing it into four equal quarters showing where the data is tightly clumped or spread thin. Smaller intervals show where the clusters are.
To make connections between histograms and box plots, complete Matching Box-Plots to Histograms (Page 65).
=== Synthesize Referring to our Dogs v. Cats box plots, the dogs’ adoption times were much higher than the cats’; the top half of the dogs’ box plot doesn’t overlap at all with the cats’ box plot. Does this suggest that species does or does not play a role in how long it takes for an animal to be adopted?
Referring to our Fixed v. Unfixed box plots, we saw that adoption times for unfixed and fixed animals overlapped a lot, and the medians were pretty close. Does this suggest that being fixed does or does not play a role in how long it takes for an animal to be adopted?
Which variable seems to have more of an effect on adoption time: species (cat or dog) or whether an animal is fixed or not? Have students share back their findings.
Project Option: Stress or Chill? Students can gather data about their own lives, and use what they’ve learned in the class so far to analyze it. This project can be used as a mid-term or formative assessment, or as a capstone for a limited implementation of Bootstrap:Data Science. The project description is available here (You will also need the Personality True Colors assessment) (Based on the What Stresses Us? project from IDS at UCLA) |
== Your Analysis flexible
=== Overview Students repeat the previous activity, this time applying it to their own dataset and interpreting their own results. Note: this activity can be done briefly as a homework assignment, but we recommend giving students an additional class period to work on this.
=== Investigate
-
Take 15 minutes to fill out Shape of My Dataset (Page 64) in your Student Workbook. Choose a column to investigate, and write up your findings.
-
Students should fill in Measures of Center and Spread portion of their Research Paper, using the means, medians, modes, box plots and five-number summaries they’ve constructed for their dataset and explaining what they show.
=== Synthesize Have students share their findings with one another.
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Checking Your Work :leveloffset: +1
= Checking Your Work
Students consider the concept of trust and testing — how do we know if a particular analysis is trustworthy?
Prerequisites |
None |
||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to… - Create a subset of data to verify that a given transformation works as-advertised, using attributes of the transformation and the dataset. |
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
== Confirming Analysis 30 minutes
=== Overview Students learn how to create a Testing Table, which is small enough to reason about and can be used to test whether code does the right thing.
=== Launch Samples are taken in Data Science and Computer Programming for two different reasons. One of the main purposes of Data Science is to take a representative sample from a larger population, and use information from the sample to infer what’s true about the whole population. In programming, we often extract a smaller Table from a larger one, for the purpose of testing that our code seems to do what it’s supposed to. In this lesson, we focus on the tasks of programmers, and consider best practices for setting up a Testing Table that helps us check our code.
-
Uber and Google are making self-driving cars, which use artificial intelligence to interpret sensor data and make predictions about whether a car should speed up, slow down, or slam on the brakes. This AI is trained on a lot of sample data, which it learns from. What might be the problem if the sample data only included roads in California?
-
Law enforcement in many towns has started using facial-recognition software to automatically detect whether someone has a warrant out for their arrest. A lot of facial-recognition software, however, has been trained on sample data containing mostly white faces. As a result, it has gotten really good at telling white people apart, but often can’t tell the difference between people who aren’t white. Why might this be a problem?
-
Why might it be a bad thing to only test medicines only on men (or only on women), before prescribing them to the general public?
Testing Matters!
A good Testing Table should be representative of the population, and relevant to what’s being analyzed. A good Testing Table should have…
-
At least the columns that matter — whether we’ll be ordering or filtering by those columns.
-
Enough rows to include different circumstances that are relevant to the task at hand. For instance, if our code is supposed to extract certain cats from the animals table, our Testing Table should include at least one animal that’s not a cat.
-
Rows that aren’t already sorted, if our analysis is supposed to sort for us.
Data scientists usually think in terms of samples that best serve the purpose of performing inference: Samples should be representative of the entire population, and large enough to get us fairly close to the truth about that population. Computer programmers need to think in terms of Testing Tables that best serve the purpose of verifying that their code does what it’s supposed to: The Tables should be designed to call attention to any imperfections in the code’s instructions.
=== Investigate Testing Tables can also be used to verify that a certain analysis is correct. Code that filters a table to show only cats can’t be verified with a Testing Table that already has only cats. (Why not?)
Code that shows only the kittens…sorted in ascending order by weight must be verified by a Table containing cats, non-cats, old and young cats… and rows that aren’t already sorted!
-
Turn to “Trust, but verify …” (Page 67) in your student workbook.
-
You’ve been given a function called
fixed-cats
and a description of what it claims to do. -
List the names of the animals that you would use in a Testing Table to verify whether the function works as advertised. When you’ve finished, open the Trust-but-Verify Starter File. There are three versions of
fixed-cats
here. Are they all correct? If not, which ones are broken? -
Turn to “Trust, but verify…” (Page 68). Using the same Starter File, construct a Testing Table and figure out which (if any) of the functions are correct!
=== Synthesize Complex analysis has more room for mistakes, so it’s critical to think about a Testing Table that allows us to trust that our code really does what it’s supposed to!
How would you check whether or not a facial recognition system was equally accurate for everyone?
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Scatter Plots :leveloffset: +1
= Scatter Plots
Students investigate scatter plots as a method of visualizing the relationship between two quantitative variables.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
- explanatory variable
-
the variable in a relationship that is presumed to impact the other variable
- response variable
-
the variable in a relationship that is presumed to be affected by the other variable
- scatter plot
-
a display of the relationship between two quantitative variables, graphing each explanatory value on the x axis and the accompanying response on the y axis
== Relationships Between Columns 15 minutes
=== Overview Students are finally introduced to Relate Questions, which ask about the relationship between one quantitative column and another.
=== Launch Can animals' weights help explain why some are adopted quickly while others take a long time? What other factors explain why one pet gets adopted right away, and others wait months?
Theory 1: Smaller animals get adopted faster because they’re easier to care for.
How could we test that theory? Bar and pie charts are great for showing us frequencies or percentages in a categorical column. Histograms and box plots are great for showing us the shape, center, and spread of a single quantitative column. But none of these displays will help us see connections between two quantitative columns.
=== Investigate
-
Take a few minutes to look through the whole dataset, and see if you agree with Theory 1.
-
Could any of our visualizations or summaries provide evidence for or against the theory?
-
Write down your hypothesis on (Dis)Proving a Claim (Page 71), as well as a theory about how we could use this dataset to see if you’re right.
=== Synthesize We’ve got a lot of tools in our toolkit that help us think about an entire column of a dataset:
-
We have ways to find measures of center and spread for a given quantitative column.
-
We have visualizations that let us see the shape of values in a quantitative column.
-
We have visualizations that let us see frequencies or percentages in a categorical column.
What columns is this question asking about?
== Making Scatter Plots 20 minutes
=== Overview Students are introduced to scatter plots, which are visualizations tailored to Relate Questions about quantitative variables. They learn how to construct scatter plots by hand, and in Pyret.
=== Launch
This question is asking about two columns in our dataset. Specifically, it’s asking if there is a relationship between pounds
and weeks
.
Before we can draw a scatter plot, we have to make an important decision: which variable is explanatory and which is response? In this case, are we suspecting that an animal’s weight can explain how long it takes to be adopted, or that how long it takes to be adopted can explain how much an animal weighs?
The first of these makes sense, and reflects our suspicion that weight plays a role in adoption time. The convention is to use the horizontal axis for our explanatory variable and the vertical axis for the response. Thus, pounds
will be x and weeks
will be y.
=== Investigate
We will produce our scatter plot by graphing each animal’s pounds
and weeks
values as a point on the x and y axes.
Complete Creating a Scatter Plot (Page 72) in your Student Workbook.
Teaching Tip Divide the full table up into sub-lists, and have a few students plot 3-4 animals on the board. This can be done collaboratively, resulting in a whole-class scatterplot! |
-
Open your “Animals Starter File”. (If you do not have this file, or if something has happened to it, you can always make a new copy.)
-
Make a scatter plot that displays the relationship between weight and adoption time.
-
Are there any patterns or trends that you see here?
-
Try making a few other scatter plots, looking for relationships between other columns in the
animals-table
.
=== Synthesize Have students share their observations. What trends do they see? Are there any points that seem unusual? Why?
== Looking for Trends 20 minutes
=== Overview Students are asked to identify patterns in their scatter plots. This activity builds towards the idea of linear associations, but does not go into depth (as the following lesson does).
=== Launch
Shown below is a scatter plot of the relationships between the animals' age
and the number of weeks
it takes to be adopted.
-
Can you see a “cloud” around which the points are clustered?
-
Does the number of weeks to adoption seem to go up or down as the weight increases?
-
Are there any points that “stray from the pack”? Which ones?
Teaching Tip Project the scatter plot at the front of the room, and have students come up to the plot to point out their patterns. |
A straight-line pattern in the cloud of points suggests a linear relationship between two columns. If we can pinpoint a line around which the points cluster (as we’ll do in a future lesson), it would be useful for making predictions. For example, our line might predict how many weeks
a new dog would wait to be adopted, if it weighs 68 pounds
.
Do any data points seem unusually far away from the main cloud of points? Which animals are those? These points are called unusual observations. Unusual observations in a scatter plot are like outliers in a histogram, but more complicated because it’s the combination of x and y values that makes them stand apart from the rest of the cloud.
Unusual observations are always worth thinking about
-
Sometimes they’re just random. Felix seems to have been adopted quickly, considering how much he weighs. Maybe he just met the right family early, or maybe we find out he lives nearby, got lost and his family came to get him. In that case, we might need to do some deep thinking about whether or not it’s appropriate to remove him from our dataset.
-
Sometimes they can give you a deeper insight into your data. Maybe Felix is a special, popular (and heavy!) breed of cat, and we discover that our dataset is missing an important column for breed!
-
Sometimes unusual observations are the points we are looking for! What if we wanted to know which restaurants are a good value, and which are rip-offs? We could make a scatter plot of restaurant reviews vs. prices, and look for an observation that’s high above the rest of the points. That would be a restaurant whose reviews are unusually good for the price. An observation way below the cloud would be a really bad deal.
=== Investigate
For practice, try making scatter plots for each of the following relationships, always expressed as “response variable vs explanatory variable”. If you see any unusual observations, try to explain them!
-
The
pounds
of an animal vs itsage
-
The number of
weeks
for an animal to be adopted vs its number oflegs
-
The number of
legs
vs theage
of an animal. -
Do you see a linear (straight-line) relationship in any of these, evidenced by a cloud of points that’s clearly rising or falling from left to right? Are there any unusual observations?
=== Synthesize Debrief, showing the plots on the board. Make sure students see plots for which there is no relationship, like the last one!
Theory 2: Younger animals get adopted faster because they are easier to care for.
It might be tempting to go straight into making a scatter plot to explore how weeks to adoption may be affected by age. But different animals have very different lifespans! A 5-year-old tarantula is still really young, while a 5-year-old rabbit is fully grown. With differences like this, it doesn’t make sense to put them all on the same scatter plot. By mixing them together, we may be hiding a real relationship, or creating the illusion of a relationship that isn’t really there! What should we do to explore this theory?
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Correlations :leveloffset: +1
= Correlations
Students continue to interpret scatter plots, and think about direction and strength of linear relationships.
Prerequisites |
None |
||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
- form
-
of a relationship between two quantitative variables: whether the two variables together vary linearly or in some other way
- r
-
a number between −1 and 1 that measures the direction and strength of a linear relationship between two quantitative variables (also known as correlation value)
== Correlations have Form 10 minutes
=== Overview Students identify and make use of patterns in scatter plots, learning to characterize them as being linear, curved, or showing no clear pattern. This builds intuition for determining if the form is linear, in which case we can proceed to correlation and linear regression
=== Launch
By now we have learned ways to summarize a single quantitative variable, like the age
of an animal in our dataset: report the center, spread, and shape of the distribution. Together, those numbers tell us what age is typical, how much the ages vary, and what kind of age values are usual or unusual. We could do the same for pounds
, weeks
, or any other quantitative column.
But those individual summaries tell us nothing about the relationship between animals' ages and weights. In order to understand such relationships, we have to expand our view from a single dimension (along one axis) to two dimensions. This goes hand in hand with expanding our display from a one-dimensional histogram to a two-dimensional scatter plot.
Rather than summarizing each distribution in one dimension, we can summarize a linear relationship between two quantitative variables. But this only makes sense if the scatter plot follows a straight-line pattern, as opposed to being curved. So the very first assessment we have to make is to identify the form of the relationship as being linear or not.
Form: whether a relationship is linear or not
=== Investigate The relationship between two quantitative variables can take many forms - some patterns are linear, and appear as a straight line sloping up or down. Some patterns are non-linear, and may look like a curve or an arc. And sometimes there is no pattern or relationship at all!
Have students turn to Identifying Form, Direction and Strength (Page 73) in their student workbooks. For each scatter plot, identify whether the relationship is linear, non-linear or if there’s no relationship at all.
=== Synthesize Data Scientists use their eyes all the time! It doesn’t make sense to search for correlations when there’s no pattern at all, and only linear relationships make sense if we want to summarize with a correlation.
Going Deeper In an AP Statistics class or full-year Data Science class, it’s appropriate to discuss non-linear relationships here. In a dedicated computer science class, it may also be appropriate to talk about transforming the x- or y-axis (using |
== Correlations have Direction & Strength 20 minutes
=== Overview Once students have learned to identify a possible linear relationship, they can turn their attention to other qualities of that relationship: its direction and strength. Each of these is expressed in the r-value, which students learn to read.
=== Launch Assuming a relationship is linear, data scientists calculate a single number called "correlation" - or r-value - that reports both the direction and strength.
Direction: whether a linear relationship is positive or negative.
A linear relationship between two quantitative variables is positive if, in general, the scatter plot points are sloping up: smaller x values tend to go with smaller y values, and larger x values tend to go with larger y values. The relationship is negative if points slope down: smaller x values tend to go with larger y values, and larger x values tend to go with smaller y values.
-
Positive relationships are by far most common because of natural tendencies for variables to increase in tandem. For example, “the older the animal, the more it tends to weigh”. This is usually true for human animals, too!
-
Negative relationships can also occur. For example, “the older a child gets, the fewer new words he or she learns each day.”
Strength: how closely the two variables are correlated.
A relationship between two quantitative variables is strong if the points are tightly clustered. In this case, knowing the x-value of a data point gives us a very good idea of what its y-value will be.
-
A strong linear relationship means that the points in the scatter plot are all clustered tightly around an invisible line.
-
A weak linear relationship means that the cloud of points is scattered very loosely around the line.
=== Investigate Have students turn to Identifying Form, Direction and Strength (Page 73) in their student workbooks. For each scatter plot, identify whether the relationship is positive or negative, and whether it is strong or weak.
The correlation r is a number (between -1 and 1) that tells us the direction and strength of a linear relationship between two variables. r is positive or negative depending on whether the correlation is positive or negative. The strength of a correlation is the distance from zero: an r-value of zero means there is no correlation at all, and stronger correlations will be closer to −1 or 1.
An r-value of about ±0.65 or ±0.70 or more is typically considered a strong correlation, and anything between ±0.35 and ±0.65 is “moderately correlated”. Anything less than about ±0.25 or ±0.35 may be considered weak. However, these cutoffs are not an exact science! In some contexts an r-value of ±0.50 might be considered impressively strong!
Calculating r from a data set only tells us the direction and strength of the relationship in that particular sample. If the correlation between adoption time and age for a representative sample of about 30 shelter animals turns out to be +0.44, the correlation for the larger population of animals will probably be close to that, but certainly not the same.
Have students turn to Identifying Form and r-Values (Page 74) in their student workbooks. For each scatter plot, identify whether the relationship linear, and use r to summarize direction and strength.
-
In the Interactions Area, create a scatter plot for the Animals Dataset, using
"pounds"
as the xs and"weeks"
as the ys. -
Form: Does the point cloud appear linear or non-linear?
-
Direction: If it’s linear, does it appear to go up or down as you move from left to right?
-
Strength: Is the point cloud tightly packed, or loosely dispersed?
-
Would you predict that the r-value is positive or negative? Will it be closer to zero, closer to ±1, or in between?
-
Have Pyret compute the r-value, by typing
r-value(animals-table, "pounds", "weeks")
. Does this match your prediction? -
Repeat this process using
"age"
as the xs. Is this correlation stronger or weaker than the correlation for"pounds"
? What does that mean?
=== Common Misconceptions - Students often conflate strength and direction, thinking that a strong correlation must be positive and a weak one must be negative. - Students may also falsely believe that there is ALWAYS a correlation between any two variables in their dataset. - Students often believe that strength and sample size are interchangeable, leading to mistaken assumptions like "any correlation found in a million data points must be strong!"
=== Synthesize It is useful to ask students probing questions, to help address the misconceptions listed above. Some examples:
-
What is the difference between a weak relationship and a negative relationship?
-
What is the difference between a strong relationship and a positive relationship?
-
If we find a strong relationship in a sample, can we always infer that relationship holds for the whole population?
-
Suppose we have two correlations, one drawn from 10 data points and one drawn from 50. If both correlations are identical in direction and strength, should we trust them equally when making an inference about the larger population?
Correlation does NOT imply causation.
It’s easy to be seduced by large r-values, and believe that we’re really onto something that will help us make predictions! But Data Scientists know better than that…
Here are some real-life correlations that have absolutely no causal relationship; they come about either by chance or because both of them are tied in with another variable that’s (often) lurking in the background.
-
“Number of people who drowned after falling out of a fishing boat” v. “Marriage rate in Kentucky” (r = 0.98)
-
“Average per-person consumption of chicken” v. “U.S. crude oil imports” (r = 0.95)
-
“Marriage rate in Wyoming” v. “Domestic production of cars” (r = 0.99)
-
“Number of people who get tangled in their own bedsheets” v. “Amount of cheese consumed that year” (r = 0.95)
All of these correlations come from the Spurious Correlations website. If time allows, have your students explore the site to see more!
== Your Analysis flexible
=== Overview Students repeat the previous activity, this time applying it to their own dataset and interpreting their own results. Note: this activity can be done as a homework assignment, but we recommend giving students an additional class period to work on this.
=== Launch What correlations do you think there are in your dataset? Would you like to investigate a subset of your data to find those correlations?
=== Investigate
-
Brainstorm a few possible correlations that you might expect to find in your dataset, and make some scatter plots to investigate.
-
Turn to Correlations in My Dataset (Page 75), and list three correlations you’d like to search for.
-
Investigate these correlations. If you need blank Design Recipes, you can find them at the back of your workbook, just before the Contracts.
=== Synthesize What correlations did you find? Did you need to filter out certain rows in order to get those correlations?
After looking at the scatter plot for our animal shelter, do you still agree with the claim on (Dis)Proving a Claim (Page 71)? (Perhaps they need more information, or to see the analysis broken down separately by animal!)
== Additional Exercises: - Identifying Form, Direction and Strength (Matching)
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Linear Regression :leveloffset: +1
= Linear Regression
Students compute the “line of best fit” using linear regression, and summarize linear relationships in a dataset.
Prerequisites |
None |
||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
Summative Assessment / Capstone: |
||||||||||||||||||
Language Table |
|
- explanatory variable
-
the variable in a relationship that is presumed to impact the other variable
- line of best fit
-
summarizes the relationship (if linear) between two quantitative variables
- linear regression
-
modeling the relationship between two quantitative variables using a straight line
- predictor function
-
a function which, given a value from one data set, makes an educated guess at a related value in a different data set
- response variable
-
the variable in a relationship that is presumed to be affected by the other variable
== Intro to Linear Regression 10 minutes
=== Overview Students are introduced to the concept of linear regression, and learn how to interpret the slope and intercept. For teachers who have the need and the bandwidth to go deeper, this is a good opportunity to teach the algorithm behind linear regression.
=== Launch
Make two scatterplots from the animals-table
, using age
as the explanatory variable in one plot and pounds
as the explanatory variable in the other. In both plots, use weeks
as your response variable and name
for the labels
. We will refer to the explanatory column as “xs” and the response column as “ys.”
“Can we predict an animal’s adoption time based on its size? Its age?”
Have students write down what they think on What’s on your mind? (Page 81), then quickly survey the class.
weeks-v-pounds scatterplot 🖼Show image We are asking if we can use an animal’s size or age to predict how long it will take to be adopted. A scatter plot of adoption time versus size does suggest that smaller animals get adopted in a shorter period of time and larger animals take longer. Similarly, younger animals tend to be adopted faster than older ones. Can we be more precise about this, and actually predict how long it will take an animal to be adopted, based on these factors? And which one would give us a better prediction?
The mean, median, and mode are three different ways to measure the “center” of a dataset in one dimension. Each represents a different way to collapse a bunch of points on a number line into a single, summary value. If the “center” of points on a one dimensional number line is a single point, what is the “center” of points in a two-dimensional cloud, which cluster around a line?
What we need to do is find a line — called a line of best fit, or a regression line — that is at the center of this cloud. Each point in our scatter plot “pulls” on the line, with points above the line yanking it up and points below the line dragging it down. Points that are really far away — especially influential observations that are far out in the x direction —- pull on the line with more force. This line can be graphed on top of the scatter plot as a function, called the predictor function.
Given a value on the x-axis, this line allows us to predict what the corresponding value on the y-axis might be. This allows us to make predictions based on our data.
Is there only one “best line”? Based on methods of calculus, data scientists know the answer to this question is yes! That justifies us talking about a single “line of best fit.”
Data scientists use a statistical method called linear regression to pinpoint linear relationships in a dataset. When we draw our regression line on a scatter plot, we can imagine a rubber bands stretching vertically between the line itself and each point in the plot — every point pulls the line a little “up” or “down”. Linear regression is the math behind the line of best fit.
Going Deeper If you want to teach students the algorithm for linear regression, now is the time! However, this algorithm is not a required portion of Bootstrap:Data Science. |
=== Investigate
Have students open this Interactive LR Plot.
-
Try moving the blue point “P”, and see what effect it has on the red line.
-
Find the number called r. In your own words, explain what this number tells us.
-
What’s the largest r-value you can get? What do you think that number means?
-
Where can you move it so that it is most aligned with the other points?
-
Where can you move it so that it is least aligned with the other points?
-
Could the regression line ever be above or below all the points? Why or why not?
Let’s explore scatter plots for weeks-v-pounds and weeks-v-age:
weeks-v-pounds scatterplot 🖼Show image weeks-v-age scatterplot 🖼Show image
After looking at the point clouds, we are left with a few questions:
-
Do the relationships appear to be linear for one? Both?
-
If a relationship is linear, what line in particular are the scatter plot points clustering around?
-
What is the r-value for each relationship?
-
Turn to Drawing Predictors (Page 77).
-
In the first column, draw a line of best fit through each of the scatter plots.
-
In the second column, circle whether the slope of the line (which is the same as the direction of the correlation) is positive or negative.
=== Synthesize Give students some time to experiment, then share back observations. Can they come up with rules or suggestions for how to minimize error?
-
Would it be possible to have a line that is below all the points? (no)
-
Would it be possible to have a line that is above all the points? (no)
-
Would it be possible to have a line with more points on one side than the other? (yes)
== Linear Regression in Pyret 20 minutes
=== Overview
Students are introduced to the lr-plot
function in Pyret, which performs a linear regression and plots the result.
=== Launch Pyret includes a powerful display, which (1) draws a scatterplot, (2) draws the line of best fit, and (3) even displays the equation for that line:
# use linear regression to extract a predictor function # lr-plot :: (t :: Table, ls :: String, xs :: String, ys :: String) -> Image lr-plot(animals-table, "name", "age", "weeks")
🖼Show image
lr-plot
is a function that takes a Table and the names of 3 columns:
-
ls
— the name of the column to use for labels (e.g. “names of pets”) -
xs
— the name of the column to use for x-coordinates (e.g. “age of each pet”) -
ys
— the name of the column to use for y-coordinates (e.g. “weeks for each pet to be adopted”)
Our goal is to use values of the variable on our x-axis to predict values of the variable on our y-axis.
Pedagogical Note We prefer the words “explanatory” and “response” in our curriculum, because in other contexts the words “dependent” and “independent” refer to whether or not the variables are related at all, as opposed to what role each plays in the relationship. |
Have students create an lr-plot
for our animals-table
, using "names"
for the labels, "age"
for the x-axis and "weeks"
for the y-axis.
The resulting scatterplot looks like those we’ve seen before, but it has a few important additions. First, we can see the line of best fit drawn onto the plot. We can also see the equation for that line (in red), in the form y = mx + b. In this plot, we can see that the slope of the line is 0.792, which means that on average, each extra year of age results in an extra 0.792 weeks of waiting to be adopted (about 5 or 6 extra days). By plugging in an animal’s age for x, we can make a prediction about how many weeks it will take to be adopted. For example, we predict a 5-year-old animal to be adopted in 0.7925 + 2.285 = 6.245 weeks. That’s the y-value exactly on the line at x=5.
The intercept is 2.285
. This is where the best-fitting line crosses the y-axis. We want to be careful not to interpret this too literally, and say that a newborn animal would be adopted in 2.285 weeks, because none of the animals in our data set was that young. Still, the regression line (or line of best fit) suggests that a baby animal, whose age is close to 0, would take only about 3 weeks to be adopted.
We also see the r-value is +0.442. The sign is positive, consistent with the fact that the scatter plot point cloud, along with the line of best fit, slopes upward. The fact that the r-value is close to 0.5 tells us that the strength is moderate. This is consistent with the fact that the scatter plot points are somewhere between being really tightly clustered and really loosely scattered.
Going Deeper Students may notice another value in the lr-plot, called R^2. This value describes the percentage of the variation in the y-variable that is explained by least-squares regression on the x variable. In other words, an R^2 value of 0.20 could mean that “20% of the variation in adoption time is explained by regressing adoption time on the age of the animal”. Discussion of R^2 may be appropriate for older students, or in an AP Statistics class. |
=== Investigate
-
Make another lr-plot, but this time use the animals' weight as our explanatory variable instead of their age.
-
If an animal is 5 years old, how long would our line of best fit predict they would wait to be adopted? What if they were a newborn, just 0 years old?
-
If an animal weighs 21 pounds, how long would our line of best fit predict they would wait to be adopted? What if they weighed 0.1 pounds?
-
Make another lr-plot, comparing the
age
v.weeks
columns for only the cats.
=== Synthesize A predictor only makes sense within the range of the data that was used to generate it. Statistical models are just proxies for the real world, drawn from a limited sample of data: they might make a useful prediction in the range of that data, but once we try to extrapolate beyond that data we may quickly get into trouble!
Does the linear regression for our sample of the Animals Dataset allow us to make inferences about the behavior of the larger dataset? Why or why not?
== Interpreting LR Plots in Pyret 20 minutes
=== Overview Students learn how to write about the results of a linear regression, using proper statistical terminology and thinking through the many ways this language can be misused.
=== Launch How well can you interpret the results of a linear regression analysis? Can you write your own?
-
What does it mean when a data point is above the line of best fit?
-
What does it mean when a data point is below the line of best fit?
=== Investigate
-
Turn to Interpreting Regression Lines & r-Values (Page 78), and match the write-up on the left with the line of best fit and r-value on the right.
-
Turn to Regression Analysis in the Animals Dataset (Page 79) to see how Data Scientists would write up the finding involving cats’ age and adoption time. Write up two other findings from the linear regressions you performed on this dataset.
When looking at a regression for adoption time v. age for just the cats, we saw that the slope of the predictor function was +0.23, meaning that for every year older a cat is, we expect a +0.23-week increase in the time taken to adopt the cat. The r-value was +0.566, confirming that the correlation is positive and indicating moderate strength.
=== Common Misconceptions
Students often think it doesn’t matter which variable is assigned to be x and which is y in a regression. It’s true that you’ll get the same correlation either way---for example, r=+0.442 whether your scatter plot shows weeks
v. pounds
or pounds
v. weeks
. However, the regression line is different, due to the math involved in minimizing vertical distances from the line, not horizontal.
=== Synthesize Have students read their text aloud, to get comfortable with the phrasing.
== Your Analysis flexible
=== Overview Students repeat the previous activity, this time applying it to their own dataset and interpreting their own results. Note: this activity can be done briefly as a homework assignment, but we recommend giving students an additional class period to work on this.
=== Launch Now that you’ve gotten some practice performing linear regression on the Animals Dataset, it’s time to apply that knowledge to your own data!
=== Investigate
-
Write up your findings by filling out Regression Analysis in Your Dataset (Page 80).
-
Students should fill in the Correlations portion of their Research Paper, using the scatter plots and linear regression plots they’ve constructed for their dataset and explaining what they show.
=== Synthesize Have students share their findings with the class. Get excited about the connections they are making and the conclusions they are drawing! Encourage students to make suggestions to one another about further analysis.
You’ve learned how linear regression can be used to fit a line to a linear cloud, and how to determine the direction and strength of that relationship. The word “linear” is important here. In the image on the right, there’s clearly a pattern, but it doesn’t look like a straight line! There are many other kinds of statistical models out there, but all of them work the same way: use a particular kind of mathematical function (linear or otherwise), to figure out how to get the “best fit” for a cloud of data.
Project Options: Olympic Records In both this project, students gather data about olympic records over time in running, swimming, or speed skating. They use what they’ve learned in the class so far to analyze the change over time, using scatter plots and linear regression. This project can be used as a mid-term or formative assessment, or as a capstone for a limited implementation of Bootstrap:Data Science. See the project description is available here. (Project designed by Joy Straub) |
== Additional Exercises:
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Ethics and Privacy :leveloffset: +1
= Ethics and Privacy
Students consider ethical issues and privacy in the context of data science.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
== Case Studies 40 minutes
=== Overview Students break into groups and read one of three case studies, each dealing with a different issue in Data Science. They discuss the implications of each, then share back.
=== Launch
“With great power comes great responsibility”
During World War 2, scientists were engaged in a race to develop new weapons, more powerful than anything the world had ever seen. While the immediate goal was "win the war", many of the scientists realized that the weapons they were developing could be used for all sorts of things after the war was over - and not all of them were good.
With tech companies hiring Data Scientists at a staggering rate and collecting massive datasets on users for those scientists to mine, there’s a new arms race happening right now. Search engines tailor their results based on what they know about the customer doing the search, and social media networks want to recommend friends based on what they know about all of us. Both of these goals require building profiles on everyone, figuring out what their preferences are and where they tend to spend their time. They might require figuring out whether each of us is male or female, more likely to go to a movie or a play, or about to buy a dishwasher or a television.
But these datasets and profiles could be used for far more than that. What if the FBI used them to try and figure out who is likely to commit a crime, or a company tries to learn their employees' religion or sexual orientation?
As they build ever-more sophisticated models based on ever-more accurate datasets, Data Scientists need to think about the ethics of what they’re doing as well!
=== Investigate Divide the class into groups of 3-4, and assign each group a different case study. Have each group choose one person to share back with the class.
-
How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did (Forbes)
-
Facebook 'likes' can reveal your secrets (CNN)
-
Algorithmic Bias in Criminal Sentencing (Propublica)
(Note: The third article is quite long, but only the first half is needed for students to complete this activity.)
Have students complete Case Study: Ethics, Privacy, and Bias (Page 82).
=== Synthesize Give students time to discuss and share back. Encourage students to share back differing views on the articles.
What are some commonalities and differences among the issues raised by these articles?
OPTIONAL: Can the class come up with a list of "Rules for Ethical Data Science"?
Extension 1) For homework, have students write arguments in support of a randomly-chosen side of each case study. Select twelve students (two for each side of all three case studies), and have them debate in front of the class. Each side gets to make "opening" and "closing" arguments, and they take turns so that the closer for each side can respond to what the other side said. Then have the class vote on who was most convincing. 2) For homework, have students find their own articles about ethical issues in data science and write a one-page essay defending one side of it. |
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
== Threats to Validity :leveloffset: +1
= Threats to Validity
Students consider possible threats to the validity of their analysis.
Prerequisites |
|||||||||||||||||||
Relevant Standards |
Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere). Common Core Math Standards
CSTA Standards
K-12CS Standards
Next-Gen Science Standards
Oklahoma Standards
|
||||||||||||||||||
Lesson Goals |
Students will be able to…
|
||||||||||||||||||
Student-facing Lesson Goals |
|
||||||||||||||||||
Materials |
|||||||||||||||||||
Preparation |
|
||||||||||||||||||
Supplemental Resources |
|||||||||||||||||||
Language Table |
|
- threats to validity
-
factors that can undermine the conclusion of a study
== Threats to Validity 20 minutes
=== Overview Students are introduced to the concept of validity, and a number of possible threats that might make an analysis invalid.
=== Launch
Survey says: “People prefer cats to dogs”
As good Data Scientists, the staff at the animal shelter is constantly gathering data about their animals, their volunteers, and the people who come to visit. But just because they have data doesn’t mean the conclusions they draw from it are correct! For example: suppose they surveyed 1,000 cat-owners and found that 95% of them thought cats were the best pet. Could they really claim that people generally prefer cats to dogs?
Have students share back what they think. The issue here is that cat-owners are not a representative sample of the population, so the claim is invalid.
There’s more to data analysis than simply collecting data and crunching numbers. In the example of the cat-owning survey, the claim that “people prefer cats to dogs” is invalid because the data itself wasn’t representative of the whole population (of course cat-owners are partial to cats!). This is just one example of what are called Threats to Validity.
There are several major threats to validity you should be on guard against:
-
Selection bias - Data was gathered from a biased, non-representative sample of the population. This is the problem with surveying cat owners to find out which animal is most loved. Remember that, in general, randomness is the key to obtaining unbiased samples!
-
Bias in the study design - Suppose you survey a random sample of pet owners that includes representative numbers of both cat and dog owners. But you ask them a “loaded” question like “Since annual vet care comes to about $300 for dogs and only about half of that for cats, would you say that owning a cat is less of a burden than owning a dog?” This could easily lead to a misrepresentation of people’s true opinions.
-
Poor choice of summary - Even if the selection is unbiased, sometimes outliers are so extreme that they shift the results of our analysis (such as the mean) in ways that don’t represent the population as a whole. For example, if the shelter happened to house a 100-year-old tortoise, and summarized its animals’ ages with the mean, this would inflate our perception of what age is typical.
-
Confounding variables - The gathered data does not take into account other factors that might influence a relationship. For example, a study might conclude that cat owners are more environmentally conscious: they’re more likely to use public transportation than dog owners. The confounding variable here could be urban versus rural dwelling: people who live in big cities are more likely to use public transportation and also more likely to own cats.
This is just a small list of different threats to validity. There are plenty more!
=== Investigate On Identifying Threats to Validity (Page 84) and Identifying Threats to Validity (Page 85), you’ll find four different claims backed by four different datasets. Each one of those claims suffers from a serious threat to validity. Can you figure out what those threats are?
=== Synthesize Give students time to discuss and share back.
Life is messy, and there are always threats to validity. Data Science is about doing the best you can to minimize those threats, and to be up front about what they are whenever you publish a finding. When you do your own analysis, make sure you include a discussion of the threats to validity!
== Fake News! 20 minutes
=== Overview Students are asked to consider the ways in which statistics are misused in popular culture, and become critical consumers of some statistical claims. Finally, they are given the opportunity to misuse their own statistics, to better understand how someone might distort data for their own ends.
=== Launch You’ve already seen a number of ways that statistics can be misused:
-
Intentionally using the wrong chart
-
Changing the scale of a chart
-
Using the mean instead of the median with heavily-skewed data
-
Using the wrong language when describing a Linear Regression
-
Using a correlation to imply causation
With all the news being shared through newspapers, television, radio, and social media, it’s important to be critical consumers of information!
=== Investigate
-
On Fake News! (Page 86), you’ll find some deliberately misleading claims made by slimy Data Scientists. Can you figure out why these claims should not be trusted ?
-
Once you’ve finished, consider your own dataset and analysis: what misleading claims could someone make about your work? Turn to Lies, Darned Lies, and Statistics (Page 87), and come up with four misleading claims based on data or displays from your work.
-
Trade papers with another group, and see if you can figure out why each other’s claims are not to be trusted!
=== Synthesize Have students share back their "lies". Was anyone able to stump the other group?
== Your Analysis flexible
=== Overview Students repeat the previous activity, this time applying it to their own dataset and interpreting their own results. Note: this activity can be done briefly as a homework assignment, but we recommend giving students an additional class period to work on this.
=== Launch In every analysis, there are always threats to validity. It’s important to always be upfront about what those threats are, so that anyone who reads your analysis can make their own decision.
=== Investigate
-
Students should fill in the Findings portion of their Research Paper, discussing threats to validity and drawing conclusions from their linear regression results.
== Additional Exercises:
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.