Skip to content

Commit

Permalink
su24 final base
Browse files Browse the repository at this point in the history
  • Loading branch information
pallavisprabhu authored and pallavisprabhu committed Nov 26, 2024
1 parent f3516d8 commit 8291ca7
Show file tree
Hide file tree
Showing 12 changed files with 975 additions and 0 deletions.
Binary file added assets/images/su24-final/tour_df.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions pages/exams/su24-final.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
title: 'Summer 2024 Final Exam'
instructors: Nishant Kheterpal
context: This exam was administered in-person. The exam was closed-notes, except students were provided a copy of the <a href='https://drive.google.com/file/d/1ky0Np67HS2O4LO913P-ing97SJG0j27n/view'>DSC 10 Reference Sheet</a>. No calculators were allowed. Students had **3 hours** to take this exam.
show_solution: true
data_info: su24-final/data-info
problems:
- su24-final/q01
- su24-final/q02
- su24-final/q03
- su24-final/q04
- su24-final/q05
- su24-final/q06
- su24-final/q07
- su24-final/q08
- su24-final/q09
25 changes: 25 additions & 0 deletions problems/su24-final/data-info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
In this exam, you’ll work with a data set representing the results of the Tour de France, a
multi-stage, weeks-long cycling race. The Tour de France takes place over many days each
year, and on each day, the riders compete in individual races called `stages`. Each `stage` is
a standalone race, and the winner of the entire tour is determined by who performs the best
across all of the individual `stages` combined. Each row represents one stage of the Tour (or
equivalently, one day of racing). This dataset will be called `stages`.

The columns of `stages` are as follows:
- `"Stage" (int):` The stage number for the respective year.
- `"Date" (str):` The day that the stage took place, formatted as ”YYYY-MM-DD.”
- `"Distance" (float):` The distance of the stage in kilometers.
- `"Origin" (str):` The name of the city in which the stage starts.
- `"Destination" (str):` The name of the city in which the stage ends.
- `"Type" (str):` The type of the stage.
- `"Winner" (str):` The name of the rider who won the stage
- `"Winner Country" (str):` The country from which the winning rider of the stage is from

The first few rows of `stages` are shown below, though `stages` has many more rows than
pictured.

<center><img src='../assets/images/su24-final/tour_df.png' width=800></center>
<br>

Throughout this exam, we will refer to `stages` repeatedly.
Assume that we have already run `import babypandas as bpd `and `import numpy as np`.
103 changes: 103 additions & 0 deletions problems/su24-final/q01.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# BEGIN PROB

\[(23 pts)\]

# BEGIN SUBPROB

Fill in the blanks so that the expression below evaluates to the
*proportion* of stages won by the country with the most stage wins.

stages.groupby(__(i)__).__(ii)__.get("Type").__(iii)__ / stages.shape[0]

`(i)` :

`(ii)` :

`(iii)` :

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

The distance of a stage alone does not encapsulate its difficulty, as
riders feel more tired as the tour goes on. Because of this, we want to
consider "real distance,\" a measurement of the length of a stage that
takes into account how far into the tour the riders are. The "real
distance\" is calculated with the following process:

(i) Add one to the stage number.

(ii) Take the square root of the result of (i).

(iii) Multiply the result of (ii) by the raw distance of the stage.

Complete the implementation of the function `real_distance`, which takes
in `stages` (a DataFrame), `stage` (a string, the name of the column
containing stage numbers), and `distance` (a string, the name of the
column containing stage distances). `real_distance` returns a Series
containing all of the "real distances\" of the stages, as calculated
above.

def real_distance(stages, stage, distance):
________

::: responsebox
1in `return stages.get(distance) * np.sqrt(stages.get(stage) + 1)`
:::

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

Sometimes, stages are repeated in different editions of the Tour de
France, meaning that there are some pairs of `"Origin"` and
`"Destination"` that appear more than once in `stages`. Fill in the
blanks so that the expression below evaluates how often the most common
`"Origin"` and `"Destination"` pair in the `stages` DataFrame appears.

``` {xleftmargin="-1.5cm"}
stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(iii)__]
```

`(i)` :

`(ii)` :

`(iii)` :

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

Fill in the blanks so that the value of `mystery_three` is the
`"Destination"` of the longest stage before Stage 12.

mystery = stages[stages.get(__(i)__) < 12]
mystery_two = mystery.sort_values(by = "Distance", ascending = __(ii)__)
mystery_three = mystery_two.get(__(iii)__).iloc[-1]

`(i)` :

`(ii)` :

`(iii)` :

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# END PROB
52 changes: 52 additions & 0 deletions problems/su24-final/q02.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# BEGIN PROB

Suppose we run the following code to simulate the winners of the Tour de
France.\

evenepoel_wins = 0
vingegaard_wins = 0
pogacar_wins = 0
for i in np.arange(4):
result = np.random.multinomial(1, [0.3, 0.3, 0.4])
if result[0] == 1:
evenepoel_wins = evenepoel_wins + 1
elif result[1] == 1:
vingegaard_wins = vingegaard_wins + 1
elif result[2] == 1:
pogacar_wins = pogacar_wins + 1

# BEGIN SUBPROB

What is the probability that `pogacar_wins` is equal to 4 when the code
finishes running? Do not simplify your answer.

::: center
:::

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

What is the probability that `evenepoel_wins` is at least 1 when the
code finishes running? Do not simplify your answer.

::: center
:::

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# END PROB
38 changes: 38 additions & 0 deletions problems/su24-final/q03.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# BEGIN PROB

\[(12 pts)\] We want to estimate the mean distance of Tour de France
stages by bootstrapping 10,000 times and constructing a 90% confidence
interval for the mean. In this question, suppose `random_stages` is a
random sample of size 500 drawn with replacement from `stages`. Identify
the line numbers with errors in the code below. In the adjacent box,
point out the error by describing the mistake in less than 10 words or
writing a code snippet (correct only the part you think is wrong). You
may or may not need all the spaces provided below to identify errors.

line 1: means = np.array([])
line 2:
line 3: for i in 10000:
line 4: resample = random_stages.sample(10000)
line 5: resample_mean = resample.get("Distance").mean()
line 6: np.append(means, resample_mean)
line 7:
line 8: left_bound = np.percentile(means, 0)
line 9: right_bound = np.percentile(means, 90)

`a) `

`b) `

`c) `

`d) `

`e) `

`f) `

# BEGIN SOLUTION

# END SOLUTION

# END PROB
153 changes: 153 additions & 0 deletions problems/su24-final/q04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# BEGIN PROB

\[(16.5 pts)\]

Below is a density histogram representing the distribution of randomly
sampled stage distances.

::: center
![image](final_images/histogram.png)
:::

# BEGIN SUBPROB

Which statement below correctly describes the relationship between the
mean and the median of the sampled stage distances?

( ) The mean is significantly larger than the median.

( ) The mean is significantly smaller than the median.

( ) The mean is approximately equal to the median.

( ) It is impossible to know the relationship between the mean and the
median.

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB # BEGIN SUBPROB

Assume there are 100 stages in the random sample that generated this
plot. If there are 5 stages in the bin `[275, 300)`, approximately how
many stages are in the bin `[200, 225)`?

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

Assume the mean distance is 200 km and the standard deviation is 50 km.
At least what proportion of stage distances are guaranteed to lie
between 0 km and 400 km? Do not simplify your answer.

::: responsebox
1in Using Chebyshev's inequality, we know at least $1 - \frac{1}{z^2}$
of the data lies within $z$ SDs. Here, $z = 4$ so we know
$1 - \frac{1}{16} = \frac{15}{16}$ of the data lie in that range.
:::

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

Again, assume the mean stage distance is 200 km and the standard
deviation is 50 km. Now, suppose we take a random sample of size 25 from
the stage distances, calculate the mean stage distance of this sample,
and repeat this process 500 times. What proportion of the means that we
calculate will fall between 190 km and 210 km? Do not simplify your
answer.

::: responsebox
0.82in We know about 68% of values lie within 1 standard deviation of
the mean of any normal distribution. The distribution of means of
samples of size 25 from this dataset is normally distributed with mean
200km and SD $\frac{50}{\sqrt{25}} = 10$, so 190km to 210km contains 68%
of the values.
:::

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

(3.5 pts) Assume the mean distance is 200 km and the standard deviation
is 50 km. Suppose we use the Central Limit Theorem to generate a 95%
confidence interval for the true mean distance of all Tour de France
stages, and get the interval $[190\text{ km}, 210\text{ km}]$. Which of
the following interpretations of this confidence interval are correct?

[ ] 95% of Tour de France stage distances fall between 190 km and 210
km.

[ ] There is a 95% chance that the true mean distance of all Tour de
France stages is\
between 190 km and 210 km.

[ ] We are 95% confident that the true mean distance of all Tour de
France stages is\
between 190 km and 210 km.

[ ] Our sample is of size 100.

[ ] Our sample is of size 25.

[ ] If we collected many original samples and constructed many 95%
confidence inter-\
vals, then exactly 95% of those intervals would contain the true mean
distance.

[ ] If we collected many original samples and constructed many 95%
confidence inter-\
vals, then roughly 95% of those intervals would contain the true mean
distance.

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# BEGIN SUBPROB

Suppose we take 500 random samples of size 100 from the stage distances,
calculate their means, and draw a histogram of the distribution of these
sample means. We label this Histogram A. Then, we take 500 random
samples of size 1000 from the stage distances, calculate their means,
and draw a histogram of the distribution of these sample means. We label
this Histogram B. Fill in the blanks so that the sentence below
correctly describes how Histogram B looks in comparison to Histogram A.

::: center
"Relative to Histogram A, Histogram B would appear [   (i)
  ]{.underline} and shifted [   (ii)   ]{.underline} due to the [
  (iii)   ]{.underline} mean and the [   (iv)   ]{.underline} standard
deviation.\"
:::

(i): ( ) thinner ( ) wider ( ) the same width ( ) unknown

(ii): ( ) left ( ) right ( ) not at all ( ) unknown

(iii): ( ) larger ( ) smaller ( ) unchanged ( ) unknown

(iv): ( ) larger ( ) smaller ( ) unchanged ( ) unknown

# BEGIN SOLUTION

# END SOLUTION

# END SUBPROB

# END PROB
Loading

0 comments on commit 8291ca7

Please sign in to comment.