su24 final base

dsc-courses · Nov 26, 2024 · 8291ca7 · 8291ca7
1 parent f3516d8
commit 8291ca7
Show file tree

Hide file tree

Showing 12 changed files with 975 additions and 0 deletions.
diff --git a/assets/images/su24-final/tour_df.png b/assets/images/su24-final/tour_df.png
diff --git a/pages/exams/su24-final.yml b/pages/exams/su24-final.yml
@@ -0,0 +1,15 @@
+title: 'Summer 2024 Final Exam'
+instructors: Nishant Kheterpal
+context: This exam was administered in-person. The exam was closed-notes, except students were provided a copy of the <a href='https://drive.google.com/file/d/1ky0Np67HS2O4LO913P-ing97SJG0j27n/view'>DSC 10 Reference Sheet</a>. No calculators were allowed. Students had **3 hours** to take this exam.
+show_solution: true
+data_info: su24-final/data-info
+problems:
+  - su24-final/q01
+  - su24-final/q02
+  - su24-final/q03
+  - su24-final/q04
+  - su24-final/q05
+  - su24-final/q06
+  - su24-final/q07
+  - su24-final/q08
+  - su24-final/q09
diff --git a/problems/su24-final/data-info.md b/problems/su24-final/data-info.md
@@ -0,0 +1,25 @@
+In this exam, you’ll work with a data set representing the results of the Tour de France, a
+multi-stage, weeks-long cycling race. The Tour de France takes place over many days each
+year, and on each day, the riders compete in individual races called `stages`. Each `stage` is
+a standalone race, and the winner of the entire tour is determined by who performs the best
+across all of the individual `stages` combined. Each row represents one stage of the Tour (or
+equivalently, one day of racing). This dataset will be called `stages`.
+
+The columns of `stages` are as follows:
+- `"Stage" (int):` The stage number for the respective year.
+- `"Date" (str):` The day that the stage took place, formatted as ”YYYY-MM-DD.”
+- `"Distance" (float):` The distance of the stage in kilometers.
+- `"Origin" (str):` The name of the city in which the stage starts.
+- `"Destination" (str):` The name of the city in which the stage ends.
+- `"Type" (str):` The type of the stage.
+- `"Winner" (str):` The name of the rider who won the stage
+- `"Winner Country" (str):` The country from which the winning rider of the stage is from
+
+The first few rows of `stages` are shown below, though `stages` has many more rows than
+pictured.
+
+<center><img src='../assets/images/su24-final/tour_df.png' width=800></center>
+<br>
+
+Throughout this exam, we will refer to `stages` repeatedly.
+Assume that we have already run `import babypandas as bpd `and `import numpy as np`.
diff --git a/problems/su24-final/q01.md b/problems/su24-final/q01.md
@@ -0,0 +1,103 @@
+# BEGIN PROB
+
+\[(23 pts)\]
+
+# BEGIN SUBPROB
+
+Fill in the blanks so that the expression below evaluates to the
+*proportion* of stages won by the country with the most stage wins.
+
+    stages.groupby(__(i)__).__(ii)__.get("Type").__(iii)__ / stages.shape[0]
+
+`(i)` :
+
+`(ii)` :
+
+`(iii)` :
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+The distance of a stage alone does not encapsulate its difficulty, as
+riders feel more tired as the tour goes on. Because of this, we want to
+consider "real distance,\" a measurement of the length of a stage that
+takes into account how far into the tour the riders are. The "real
+distance\" is calculated with the following process:
+
+(i) Add one to the stage number.
+
+(ii) Take the square root of the result of (i).
+
+(iii) Multiply the result of (ii) by the raw distance of the stage.
+
+Complete the implementation of the function `real_distance`, which takes
+in `stages` (a DataFrame), `stage` (a string, the name of the column
+containing stage numbers), and `distance` (a string, the name of the
+column containing stage distances). `real_distance` returns a Series
+containing all of the "real distances\" of the stages, as calculated
+above.
+
+        def real_distance(stages, stage, distance):
+            ________
+
+::: responsebox
+1in `return stages.get(distance) * np.sqrt(stages.get(stage) + 1)`
+:::
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+Sometimes, stages are repeated in different editions of the Tour de
+France, meaning that there are some pairs of `"Origin"` and
+`"Destination"` that appear more than once in `stages`. Fill in the
+blanks so that the expression below evaluates how often the most common
+`"Origin"` and `"Destination"` pair in the `stages` DataFrame appears.
+
+``` {xleftmargin="-1.5cm"}
+stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(iii)__]
+```
+
+`(i)` :
+
+`(ii)` :
+
+`(iii)` :
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+Fill in the blanks so that the value of `mystery_three` is the
+`"Destination"` of the longest stage before Stage 12.
+
+    mystery = stages[stages.get(__(i)__) < 12]
+    mystery_two = mystery.sort_values(by = "Distance", ascending = __(ii)__)
+    mystery_three = mystery_two.get(__(iii)__).iloc[-1]
+
+`(i)` :
+
+`(ii)` :
+
+`(iii)` :
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# END PROB
diff --git a/problems/su24-final/q02.md b/problems/su24-final/q02.md
@@ -0,0 +1,52 @@
+# BEGIN PROB
+
+Suppose we run the following code to simulate the winners of the Tour de
+France.\
+
+    evenepoel_wins = 0
+    vingegaard_wins = 0
+    pogacar_wins = 0
+    for i in np.arange(4):
+        result = np.random.multinomial(1, [0.3, 0.3, 0.4])
+        if result[0] == 1:
+            evenepoel_wins = evenepoel_wins + 1
+        elif result[1] == 1:
+            vingegaard_wins = vingegaard_wins + 1
+        elif result[2] == 1:
+            pogacar_wins = pogacar_wins + 1
+
+# BEGIN SUBPROB
+
+What is the probability that `pogacar_wins` is equal to 4 when the code
+finishes running? Do not simplify your answer.
+
+::: center
+:::
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+What is the probability that `evenepoel_wins` is at least 1 when the
+code finishes running? Do not simplify your answer.
+
+::: center
+:::
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# END PROB
diff --git a/problems/su24-final/q03.md b/problems/su24-final/q03.md
@@ -0,0 +1,38 @@
+# BEGIN PROB
+
+\[(12 pts)\] We want to estimate the mean distance of Tour de France
+stages by bootstrapping 10,000 times and constructing a 90% confidence
+interval for the mean. In this question, suppose `random_stages` is a
+random sample of size 500 drawn with replacement from `stages`. Identify
+the line numbers with errors in the code below. In the adjacent box,
+point out the error by describing the mistake in less than 10 words or
+writing a code snippet (correct only the part you think is wrong). You
+may or may not need all the spaces provided below to identify errors.
+
+    line 1:      means = np.array([])
+    line 2: 
+    line 3:      for i in 10000:
+    line 4:          resample = random_stages.sample(10000)
+    line 5:          resample_mean = resample.get("Distance").mean()
+    line 6:          np.append(means, resample_mean)
+    line 7:    
+    line 8:      left_bound = np.percentile(means, 0)
+    line 9:      right_bound = np.percentile(means, 90)
+
+`a) `
+
+`b) `
+
+`c) `
+
+`d) `
+
+`e) `
+
+`f) `
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END PROB
diff --git a/problems/su24-final/q04.md b/problems/su24-final/q04.md
@@ -0,0 +1,153 @@
+# BEGIN PROB
+
+\[(16.5 pts)\]
+
+Below is a density histogram representing the distribution of randomly
+sampled stage distances.
+
+::: center
+![image](final_images/histogram.png)
+:::
+
+# BEGIN SUBPROB
+
+Which statement below correctly describes the relationship between the
+mean and the median of the sampled stage distances?
+
+( ) The mean is significantly larger than the median.
+
+( ) The mean is significantly smaller than the median.
+
+( ) The mean is approximately equal to the median.
+
+( ) It is impossible to know the relationship between the mean and the
+median.
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB # BEGIN SUBPROB
+
+Assume there are 100 stages in the random sample that generated this
+plot. If there are 5 stages in the bin `[275, 300)`, approximately how
+many stages are in the bin `[200, 225)`?
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+Assume the mean distance is 200 km and the standard deviation is 50 km.
+At least what proportion of stage distances are guaranteed to lie
+between 0 km and 400 km? Do not simplify your answer.
+
+::: responsebox
+1in Using Chebyshev's inequality, we know at least $1 - \frac{1}{z^2}$
+of the data lies within $z$ SDs. Here, $z = 4$ so we know
+$1 - \frac{1}{16} = \frac{15}{16}$ of the data lie in that range.
+:::
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+Again, assume the mean stage distance is 200 km and the standard
+deviation is 50 km. Now, suppose we take a random sample of size 25 from
+the stage distances, calculate the mean stage distance of this sample,
+and repeat this process 500 times. What proportion of the means that we
+calculate will fall between 190 km and 210 km? Do not simplify your
+answer.
+
+::: responsebox
+0.82in We know about 68% of values lie within 1 standard deviation of
+the mean of any normal distribution. The distribution of means of
+samples of size 25 from this dataset is normally distributed with mean
+200km and SD $\frac{50}{\sqrt{25}} = 10$, so 190km to 210km contains 68%
+of the values.
+:::
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+(3.5 pts) Assume the mean distance is 200 km and the standard deviation
+is 50 km. Suppose we use the Central Limit Theorem to generate a 95%
+confidence interval for the true mean distance of all Tour de France
+stages, and get the interval $[190\text{ km}, 210\text{ km}]$. Which of
+the following interpretations of this confidence interval are correct?
+
+[ ] 95% of Tour de France stage distances fall between 190 km and 210
+km.
+
+[ ] There is a 95% chance that the true mean distance of all Tour de
+France stages is\
+between 190 km and 210 km.
+
+[ ] We are 95% confident that the true mean distance of all Tour de
+France stages is\
+between 190 km and 210 km.
+
+[ ] Our sample is of size 100.
+
+[ ] Our sample is of size 25.
+
+[ ] If we collected many original samples and constructed many 95%
+confidence inter-\
+vals, then exactly 95% of those intervals would contain the true mean
+distance.
+
+[ ] If we collected many original samples and constructed many 95%
+confidence inter-\
+vals, then roughly 95% of those intervals would contain the true mean
+distance.
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# BEGIN SUBPROB
+
+Suppose we take 500 random samples of size 100 from the stage distances,
+calculate their means, and draw a histogram of the distribution of these
+sample means. We label this Histogram A. Then, we take 500 random
+samples of size 1000 from the stage distances, calculate their means,
+and draw a histogram of the distribution of these sample means. We label
+this Histogram B. Fill in the blanks so that the sentence below
+correctly describes how Histogram B looks in comparison to Histogram A.
+
+::: center
+"Relative to Histogram A, Histogram B would appear [   (i)
+  ]{.underline} and shifted [   (ii)   ]{.underline} due to the [
+  (iii)   ]{.underline} mean and the [   (iv)   ]{.underline} standard
+deviation.\"
+:::
+
+(i): ( ) thinner ( ) wider ( ) the same width ( ) unknown
+
+(ii): ( ) left ( ) right ( ) not at all ( ) unknown
+
+(iii): ( ) larger ( ) smaller ( ) unchanged ( ) unknown
+
+(iv): ( ) larger ( ) smaller ( ) unchanged ( ) unknown
+
+# BEGIN SOLUTION
+
+# END SOLUTION
+
+# END SUBPROB
+
+# END PROB