Skip to content

Commit

Permalink
su24final 6-9 sols
Browse files Browse the repository at this point in the history
  • Loading branch information
pallavisprabhu authored and pallavisprabhu committed Nov 27, 2024
1 parent 8291ca7 commit c85c35e
Show file tree
Hide file tree
Showing 12 changed files with 1,825 additions and 367 deletions.
Binary file added docs/assets/images/su24-final/tour_df.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,575 changes: 1,575 additions & 0 deletions docs/su24-final/index.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions problems/su24-final/data-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ across all of the individual `stages` combined. Each row represents one stage of
equivalently, one day of racing). This dataset will be called `stages`.

The columns of `stages` are as follows:

- `"Stage" (int):` The stage number for the respective year.
- `"Date" (str):` The day that the stage took place, formatted as ”YYYY-MM-DD.”
- `"Distance" (float):` The distance of the stage in kilometers.
Expand Down
59 changes: 14 additions & 45 deletions problems/su24-final/q01.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,13 @@
# BEGIN PROB

\[(23 pts)\]

# BEGIN SUBPROB

Fill in the blanks so that the expression below evaluates to the
*proportion* of stages won by the country with the most stage wins.

```py
stages.groupby(__(i)__).__(ii)__.get("Type").__(iii)__ / stages.shape[0]

`(i)` :

`(ii)` :

`(iii)` :
```

# BEGIN SOLUTION

Expand All @@ -23,56 +17,36 @@ Fill in the blanks so that the expression below evaluates to the

# BEGIN SUBPROB

The distance of a stage alone does not encapsulate its difficulty, as
riders feel more tired as the tour goes on. Because of this, we want to
consider "real distance,\" a measurement of the length of a stage that
takes into account how far into the tour the riders are. The "real
distance\" is calculated with the following process:
The distance of a stage alone does not encapsulate its difficulty, as riders feel more tired as the tour goes on. Because of this, we want to consider "real distance" a measurement of the length of a stage that takes into account how far into the tour the riders are. The "real distance" is calculated with the following process:

(i) Add one to the stage number.

(ii) Take the square root of the result of (i).

(iii) Multiply the result of (ii) by the raw distance of the stage.

Complete the implementation of the function `real_distance`, which takes
in `stages` (a DataFrame), `stage` (a string, the name of the column
containing stage numbers), and `distance` (a string, the name of the
column containing stage distances). `real_distance` returns a Series
containing all of the "real distances\" of the stages, as calculated
above.

def real_distance(stages, stage, distance):
________
Complete the implementation of the function `real_distance`, which takes in `stages` (a DataFrame), `stage` (a string, the name of the column containing stage numbers), and `distance` (a string, the name of the column containing stage distances). `real_distance` returns a Series containing all of the "real distances" of the stages, as calculated above.

::: responsebox
1in `return stages.get(distance) * np.sqrt(stages.get(stage) + 1)`
:::
```py
def real_distance(stages, stage, distance):
________
```

# BEGIN SOLUTION
**Solution:** `return stages.get(distance) * np.sqrt(stages.get(stage) + 1)`

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

Sometimes, stages are repeated in different editions of the Tour de
France, meaning that there are some pairs of `"Origin"` and
`"Destination"` that appear more than once in `stages`. Fill in the
blanks so that the expression below evaluates how often the most common
`"Origin"` and `"Destination"` pair in the `stages` DataFrame appears.
Sometimes, stages are repeated in different editions of the Tour de France, meaning that there are some pairs of `"Origin"` and `"Destination"` that appear more than once in `stages`. Fill in the blanks so that the expression below evaluates how often the most common `"Origin"` and `"Destination"` pair in the `stages` DataFrame appears.

``` {xleftmargin="-1.5cm"}
```py
stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(iii)__]
```

`(i)` :

`(ii)` :

`(iii)` :

# BEGIN SOLUTION

# END SOLUTION
Expand All @@ -81,18 +55,13 @@ stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(ii

# BEGIN SUBPROB

Fill in the blanks so that the value of `mystery_three` is the
`"Destination"` of the longest stage before Stage 12.
Fill in the blanks so that the value of `mystery_three` is the `"Destination"` of the longest stage before Stage 12.

```py
mystery = stages[stages.get(__(i)__) < 12]
mystery_two = mystery.sort_values(by = "Distance", ascending = __(ii)__)
mystery_three = mystery_two.get(__(iii)__).iloc[-1]

`(i)` :

`(ii)` :

`(iii)` :
```

# BEGIN SOLUTION

Expand Down
23 changes: 5 additions & 18 deletions problems/su24-final/q02.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# BEGIN PROB

Suppose we run the following code to simulate the winners of the Tour de
France.\
Suppose we run the following code to simulate the winners of the Tour de France.

```py
evenepoel_wins = 0
vingegaard_wins = 0
pogacar_wins = 0
Expand All @@ -14,14 +14,11 @@ France.\
vingegaard_wins = vingegaard_wins + 1
elif result[2] == 1:
pogacar_wins = pogacar_wins + 1
```

# BEGIN SUBPROB

What is the probability that `pogacar_wins` is equal to 4 when the code
finishes running? Do not simplify your answer.

::: center
:::
What is the probability that `pogacar_wins` is equal to 4 when the code finishes running? Do not simplify your answer.

# BEGIN SOLUTION

Expand All @@ -31,17 +28,7 @@ finishes running? Do not simplify your answer.

# BEGIN SUBPROB

What is the probability that `evenepoel_wins` is at least 1 when the
code finishes running? Do not simplify your answer.

::: center
:::

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB
What is the probability that `evenepoel_wins` is at least 1 when the code finishes running? Do not simplify your answer.

# BEGIN SOLUTION

Expand Down
24 changes: 4 additions & 20 deletions problems/su24-final/q03.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
# BEGIN PROB

\[(12 pts)\] We want to estimate the mean distance of Tour de France
stages by bootstrapping 10,000 times and constructing a 90% confidence
interval for the mean. In this question, suppose `random_stages` is a
random sample of size 500 drawn with replacement from `stages`. Identify
the line numbers with errors in the code below. In the adjacent box,
point out the error by describing the mistake in less than 10 words or
writing a code snippet (correct only the part you think is wrong). You
may or may not need all the spaces provided below to identify errors.
We want to estimate the mean distance of Tour de France stages by bootstrapping 10,000 times and constructing a 90% confidence interval for the mean. In this question, suppose `random_stages` is a random sample of size 500 drawn with replacement from `stages`. Identify the line numbers with errors in the code below. In the adjacent box, point out the error by describing the mistake in less than 10 words or writing a code snippet (correct only the part you think is wrong). You may or may not need all the spaces provided below to identify errors.

```py

line 1: means = np.array([])
line 2:
Expand All @@ -18,18 +13,7 @@ may or may not need all the spaces provided below to identify errors.
line 7:
line 8: left_bound = np.percentile(means, 0)
line 9: right_bound = np.percentile(means, 90)

`a) `

`b) `

`c) `

`d) `

`e) `

`f) `
```

# BEGIN SOLUTION

Expand Down
131 changes: 47 additions & 84 deletions problems/su24-final/q04.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,29 @@
# BEGIN PROB

\[(16.5 pts)\]
Below is a density histogram representing the distribution of randomly sampled stage distances.

Below is a density histogram representing the distribution of randomly
sampled stage distances.

::: center
![image](final_images/histogram.png)
:::
<div style="text-align: center;">
<img src="../assets/images/su24-final/histogram.png" width="500">
</div>

# BEGIN SUBPROB

Which statement below correctly describes the relationship between the
mean and the median of the sampled stage distances?
Which statement below correctly describes the relationship between the mean and the median of the sampled stage distances?

( ) The mean is significantly larger than the median.

( ) The mean is significantly smaller than the median.

( ) The mean is approximately equal to the median.

( ) It is impossible to know the relationship between the mean and the
median.
( ) It is impossible to know the relationship between the mean and the median.

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB # BEGIN SUBPROB
# END SUBPROB

Assume there are 100 stages in the random sample that generated this
plot. If there are 5 stages in the bin `[275, 300)`, approximately how
many stages are in the bin `[200, 225)`?
# BEGIN SUBPROB

Assume there are 100 stages in the random sample that generated this plot. If there are 5 stages in the bin `[275, 300)`, approximately how many stages are in the bin `[200, 225)`?

# BEGIN SOLUTION

Expand All @@ -41,77 +33,39 @@ many stages are in the bin `[200, 225)`?

# BEGIN SUBPROB

Assume the mean distance is 200 km and the standard deviation is 50 km.
At least what proportion of stage distances are guaranteed to lie
between 0 km and 400 km? Do not simplify your answer.
Assume the mean distance is 200 km and the standard deviation is 50 km. At least what proportion of stage distances are guaranteed to lie between 0 km and 400 km? Do not simplify your answer.

::: responsebox
1in Using Chebyshev's inequality, we know at least $1 - \frac{1}{z^2}$
of the data lies within $z$ SDs. Here, $z = 4$ so we know
$1 - \frac{1}{16} = \frac{15}{16}$ of the data lie in that range.
:::

# BEGIN SOLUTION
**Solution:** Using Chebyshev's inequality, we know at least $1 - \frac{1}{z^2}$ of the data lies within $z$ SDs. Here, $z = 4$ so we know $1 - \frac{1}{16} = \frac{15}{16}$ of the data lie in that range.

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

Again, assume the mean stage distance is 200 km and the standard
deviation is 50 km. Now, suppose we take a random sample of size 25 from
the stage distances, calculate the mean stage distance of this sample,
and repeat this process 500 times. What proportion of the means that we
calculate will fall between 190 km and 210 km? Do not simplify your
answer.

::: responsebox
0.82in We know about 68% of values lie within 1 standard deviation of
the mean of any normal distribution. The distribution of means of
samples of size 25 from this dataset is normally distributed with mean
200km and SD $\frac{50}{\sqrt{25}} = 10$, so 190km to 210km contains 68%
of the values.
:::
Again, assume the mean stage distance is 200 km and the standard deviation is 50 km. Now, suppose we take a random sample of size 25 from the stage distances, calculate the mean stage distance of this sample, and repeat this process 500 times. What proportion of the means that we calculate will fall between 190 km and 210 km? Do not simplify your answer.


# BEGIN SOLUTION
**Solution:** We know about 68% of values lie within 1 standard deviation of the mean of any normal distribution. The distribution of means of samples of size 25 from this dataset is normally distributed with mean 200km and SD $\frac{50}{\sqrt{25}} = 10$, so 190km to 210km contains 68% of the values.

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

(3.5 pts) Assume the mean distance is 200 km and the standard deviation
is 50 km. Suppose we use the Central Limit Theorem to generate a 95%
confidence interval for the true mean distance of all Tour de France
stages, and get the interval $[190\text{ km}, 210\text{ km}]$. Which of
the following interpretations of this confidence interval are correct?

[ ] 95% of Tour de France stage distances fall between 190 km and 210
km.

[ ] There is a 95% chance that the true mean distance of all Tour de
France stages is\
between 190 km and 210 km.

[ ] We are 95% confident that the true mean distance of all Tour de
France stages is\
between 190 km and 210 km.
Assume the mean distance is 200 km and the standard deviation is 50 km. Suppose we use the Central Limit Theorem to generate a 95% confidence interval for the true mean distance of all Tour de France stages, and get the interval $[190\text{ km}, 210\text{ km}]$. Which of the following interpretations of this confidence interval are correct?

[ ] 95% of Tour de France stage distances fall between 190 km and 210 km.
[ ] There is a 95% chance that the true mean distance of all Tour de France stages is between 190 km and 210 km.
[ ] We are 95% confident that the true mean distance of all Tour de France stages is between 190 km and 210 km.
[ ] Our sample is of size 100.

[ ] Our sample is of size 25.

[ ] If we collected many original samples and constructed many 95%
confidence inter-\
vals, then exactly 95% of those intervals would contain the true mean
distance.

[ ] If we collected many original samples and constructed many 95%
confidence inter-\
vals, then roughly 95% of those intervals would contain the true mean
distance.
[ ] If we collected many original samples and constructed many 95% confidence intervals, then exactly 95% of those intervals would contain the true mean distance.
[ ] If we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of those intervals would contain the true mean distance.

# BEGIN SOLUTION

Expand All @@ -121,28 +75,37 @@ distance.

# BEGIN SUBPROB

Suppose we take 500 random samples of size 100 from the stage distances,
calculate their means, and draw a histogram of the distribution of these
sample means. We label this Histogram A. Then, we take 500 random
samples of size 1000 from the stage distances, calculate their means,
and draw a histogram of the distribution of these sample means. We label
this Histogram B. Fill in the blanks so that the sentence below
correctly describes how Histogram B looks in comparison to Histogram A.
Suppose we take 500 random samples of size 100 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram A. Then, we take 500 random samples of size 1000 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram B. Fill in the blanks so that the sentence below correctly describes how Histogram B looks in comparison to Histogram A.

"Relative to Histogram A, Histogram B would appear \_\_(i)\_\_ and shifted \_\_(ii)\_\_ due to the \_\_(iii)\_\_ mean and the \_\_(iv)\_\_ standard deviation."

(i):

( ) thinner
( ) wider
( ) the same width
( ) unknown

(ii):

::: center
"Relative to Histogram A, Histogram B would appear [   (i)
  ]{.underline} and shifted [   (ii)   ]{.underline} due to the [
  (iii)   ]{.underline} mean and the [   (iv)   ]{.underline} standard
deviation.\"
:::
( ) left
( ) right
( ) not at all
( ) unknown

(i): ( ) thinner ( ) wider ( ) the same width ( ) unknown
(iii):

(ii): ( ) left ( ) right ( ) not at all ( ) unknown
( ) larger
( ) smaller
( ) unchanged
( ) unknown

(iii): ( ) larger ( ) smaller ( ) unchanged ( ) unknown
(iv):

(iv): ( ) larger ( ) smaller ( ) unchanged ( ) unknown
( ) larger
( ) smaller
( ) unchanged
( ) unknown

# BEGIN SOLUTION

Expand Down
Loading

0 comments on commit c85c35e

Please sign in to comment.