su24final 6-9 sols

dsc-courses · Nov 27, 2024 · c85c35e · c85c35e
1 parent 8291ca7
commit c85c35e
Show file tree

Hide file tree

Showing 12 changed files with 1,825 additions and 367 deletions.
diff --git a/docs/assets/images/su24-final/tour_df.png b/docs/assets/images/su24-final/tour_df.png
diff --git a/docs/su24-final/index.html b/docs/su24-final/index.html
diff --git a/problems/su24-final/data-info.md b/problems/su24-final/data-info.md
@@ -6,6 +6,7 @@ across all of the individual `stages` combined. Each row represents one stage of
 equivalently, one day of racing). This dataset will be called `stages`.
 
 The columns of `stages` are as follows:
+
 - `"Stage" (int):` The stage number for the respective year.
 - `"Date" (str):` The day that the stage took place, formatted as ”YYYY-MM-DD.”
 - `"Distance" (float):` The distance of the stage in kilometers.

diff --git a/problems/su24-final/q01.md b/problems/su24-final/q01.md
@@ -1,19 +1,13 @@
 # BEGIN PROB
 
-\[(23 pts)\]
-
 # BEGIN SUBPROB
 
 Fill in the blanks so that the expression below evaluates to the
 *proportion* of stages won by the country with the most stage wins.
 
+```py
     stages.groupby(__(i)__).__(ii)__.get("Type").__(iii)__ / stages.shape[0]
-
-`(i)` :
-
-`(ii)` :
-
-`(iii)` :
+```
 
 # BEGIN SOLUTION
 
@@ -23,56 +17,36 @@ Fill in the blanks so that the expression below evaluates to the
 
 # BEGIN SUBPROB
 
-The distance of a stage alone does not encapsulate its difficulty, as
-riders feel more tired as the tour goes on. Because of this, we want to
-consider "real distance,\" a measurement of the length of a stage that
-takes into account how far into the tour the riders are. The "real
-distance\" is calculated with the following process:
+The distance of a stage alone does not encapsulate its difficulty, as riders feel more tired as the tour goes on. Because of this, we want to consider "real distance" a measurement of the length of a stage that takes into account how far into the tour the riders are. The "real distance" is calculated with the following process:
 
 (i) Add one to the stage number.
 
 (ii) Take the square root of the result of (i).
 
 (iii) Multiply the result of (ii) by the raw distance of the stage.
 
-Complete the implementation of the function `real_distance`, which takes
-in `stages` (a DataFrame), `stage` (a string, the name of the column
-containing stage numbers), and `distance` (a string, the name of the
-column containing stage distances). `real_distance` returns a Series
-containing all of the "real distances\" of the stages, as calculated
-above.
-
-        def real_distance(stages, stage, distance):
-            ________
+Complete the implementation of the function `real_distance`, which takes in `stages` (a DataFrame), `stage` (a string, the name of the column containing stage numbers), and `distance` (a string, the name of the column containing stage distances). `real_distance` returns a Series containing all of the "real distances" of the stages, as calculated above.
 
-::: responsebox
-1in `return stages.get(distance) * np.sqrt(stages.get(stage) + 1)`
-:::
+```py
+    def real_distance(stages, stage, distance):
+         ________
+```
 
 # BEGIN SOLUTION
+**Solution:** `return stages.get(distance) * np.sqrt(stages.get(stage) + 1)`
 
 # END SOLUTION
 
 # END SUBPROB
 
 # BEGIN SUBPROB
 
-Sometimes, stages are repeated in different editions of the Tour de
-France, meaning that there are some pairs of `"Origin"` and
-`"Destination"` that appear more than once in `stages`. Fill in the
-blanks so that the expression below evaluates how often the most common
-`"Origin"` and `"Destination"` pair in the `stages` DataFrame appears.
+Sometimes, stages are repeated in different editions of the Tour de France, meaning that there are some pairs of `"Origin"` and `"Destination"` that appear more than once in `stages`. Fill in the blanks so that the expression below evaluates how often the most common `"Origin"` and `"Destination"` pair in the `stages` DataFrame appears.
 
-``` {xleftmargin="-1.5cm"}
+```py
 stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(iii)__]
 ```
 
-`(i)` :
-
-`(ii)` :
-
-`(iii)` :
-
 # BEGIN SOLUTION
 
 # END SOLUTION
@@ -81,18 +55,13 @@ stages.groupby(__(i)__).__(ii)__.sort_values(by = "Date").get("Type").iloc[__(ii
 
 # BEGIN SUBPROB
 
-Fill in the blanks so that the value of `mystery_three` is the
-`"Destination"` of the longest stage before Stage 12.
+Fill in the blanks so that the value of `mystery_three` is the `"Destination"` of the longest stage before Stage 12.
 
+```py
     mystery = stages[stages.get(__(i)__) < 12]
     mystery_two = mystery.sort_values(by = "Distance", ascending = __(ii)__)
     mystery_three = mystery_two.get(__(iii)__).iloc[-1]
-
-`(i)` :
-
-`(ii)` :
-
-`(iii)` :
+```
 
 # BEGIN SOLUTION
 

diff --git a/problems/su24-final/q02.md b/problems/su24-final/q02.md
@@ -1,8 +1,8 @@
 # BEGIN PROB
 
-Suppose we run the following code to simulate the winners of the Tour de
-France.\
+Suppose we run the following code to simulate the winners of the Tour de France.
 
+```py
     evenepoel_wins = 0
     vingegaard_wins = 0
     pogacar_wins = 0
@@ -14,14 +14,11 @@ France.\
             vingegaard_wins = vingegaard_wins + 1
         elif result[2] == 1:
             pogacar_wins = pogacar_wins + 1
+```
 
 # BEGIN SUBPROB
 
-What is the probability that `pogacar_wins` is equal to 4 when the code
-finishes running? Do not simplify your answer.
-
-::: center
-:::
+What is the probability that `pogacar_wins` is equal to 4 when the code finishes running? Do not simplify your answer.
 
 # BEGIN SOLUTION
 
@@ -31,17 +28,7 @@ finishes running? Do not simplify your answer.
 
 # BEGIN SUBPROB
 
-What is the probability that `evenepoel_wins` is at least 1 when the
-code finishes running? Do not simplify your answer.
-
-::: center
-:::
-
-# BEGIN SOLUTION
-
-# END SOLUTION
-
-# END SUBPROB
+What is the probability that `evenepoel_wins` is at least 1 when the code finishes running? Do not simplify your answer.
 
 # BEGIN SOLUTION
 

diff --git a/problems/su24-final/q03.md b/problems/su24-final/q03.md
@@ -1,13 +1,8 @@
 # BEGIN PROB
 
-\[(12 pts)\] We want to estimate the mean distance of Tour de France
-stages by bootstrapping 10,000 times and constructing a 90% confidence
-interval for the mean. In this question, suppose `random_stages` is a
-random sample of size 500 drawn with replacement from `stages`. Identify
-the line numbers with errors in the code below. In the adjacent box,
-point out the error by describing the mistake in less than 10 words or
-writing a code snippet (correct only the part you think is wrong). You
-may or may not need all the spaces provided below to identify errors.
+We want to estimate the mean distance of Tour de France stages by bootstrapping 10,000 times and constructing a 90% confidence interval for the mean. In this question, suppose `random_stages` is a random sample of size 500 drawn with replacement from `stages`. Identify the line numbers with errors in the code below. In the adjacent box, point out the error by describing the mistake in less than 10 words or writing a code snippet (correct only the part you think is wrong). You may or may not need all the spaces provided below to identify errors.
+
+```py
 
     line 1:      means = np.array([])
     line 2: 
@@ -18,18 +13,7 @@ may or may not need all the spaces provided below to identify errors.
     line 7:    
     line 8:      left_bound = np.percentile(means, 0)
     line 9:      right_bound = np.percentile(means, 90)
-
-`a) `
-
-`b) `
-
-`c) `
-
-`d) `
-
-`e) `
-
-`f) `
+```
 
 # BEGIN SOLUTION
 

diff --git a/problems/su24-final/q04.md b/problems/su24-final/q04.md
@@ -1,37 +1,29 @@
 # BEGIN PROB
 
-\[(16.5 pts)\]
+Below is a density histogram representing the distribution of randomly sampled stage distances.
 
-Below is a density histogram representing the distribution of randomly
-sampled stage distances.
-
-::: center
-![image](final_images/histogram.png)
-:::
+<div style="text-align: center;">
+<img src="../assets/images/su24-final/histogram.png" width="500">
+</div>
 
 # BEGIN SUBPROB
 
-Which statement below correctly describes the relationship between the
-mean and the median of the sampled stage distances?
+Which statement below correctly describes the relationship between the mean and the median of the sampled stage distances?
 
 ( ) The mean is significantly larger than the median.
-
 ( ) The mean is significantly smaller than the median.
-
 ( ) The mean is approximately equal to the median.
-
-( ) It is impossible to know the relationship between the mean and the
-median.
+( ) It is impossible to know the relationship between the mean and the median.
 
 # BEGIN SOLUTION
 
 # END SOLUTION
 
-# END SUBPROB # BEGIN SUBPROB
+# END SUBPROB 
 
-Assume there are 100 stages in the random sample that generated this
-plot. If there are 5 stages in the bin `[275, 300)`, approximately how
-many stages are in the bin `[200, 225)`?
+# BEGIN SUBPROB
+
+Assume there are 100 stages in the random sample that generated this plot. If there are 5 stages in the bin `[275, 300)`, approximately how many stages are in the bin `[200, 225)`?
 
 # BEGIN SOLUTION
 
@@ -41,77 +33,39 @@ many stages are in the bin `[200, 225)`?
 
 # BEGIN SUBPROB
 
-Assume the mean distance is 200 km and the standard deviation is 50 km.
-At least what proportion of stage distances are guaranteed to lie
-between 0 km and 400 km? Do not simplify your answer.
+Assume the mean distance is 200 km and the standard deviation is 50 km. At least what proportion of stage distances are guaranteed to lie between 0 km and 400 km? Do not simplify your answer.
 
-::: responsebox
-1in Using Chebyshev's inequality, we know at least $1 - \frac{1}{z^2}$
-of the data lies within $z$ SDs. Here, $z = 4$ so we know
-$1 - \frac{1}{16} = \frac{15}{16}$ of the data lie in that range.
-:::
 
 # BEGIN SOLUTION
+**Solution:** Using Chebyshev's inequality, we know at least $1 - \frac{1}{z^2}$ of the data lies within $z$ SDs. Here, $z = 4$ so we know $1 - \frac{1}{16} = \frac{15}{16}$ of the data lie in that range.
 
 # END SOLUTION
 
 # END SUBPROB
 
 # BEGIN SUBPROB
 
-Again, assume the mean stage distance is 200 km and the standard
-deviation is 50 km. Now, suppose we take a random sample of size 25 from
-the stage distances, calculate the mean stage distance of this sample,
-and repeat this process 500 times. What proportion of the means that we
-calculate will fall between 190 km and 210 km? Do not simplify your
-answer.
-
-::: responsebox
-0.82in We know about 68% of values lie within 1 standard deviation of
-the mean of any normal distribution. The distribution of means of
-samples of size 25 from this dataset is normally distributed with mean
-200km and SD $\frac{50}{\sqrt{25}} = 10$, so 190km to 210km contains 68%
-of the values.
-:::
+Again, assume the mean stage distance is 200 km and the standard deviation is 50 km. Now, suppose we take a random sample of size 25 from the stage distances, calculate the mean stage distance of this sample, and repeat this process 500 times. What proportion of the means that we calculate will fall between 190 km and 210 km? Do not simplify your answer.
+
 
 # BEGIN SOLUTION
+**Solution:** We know about 68% of values lie within 1 standard deviation of the mean of any normal distribution. The distribution of means of samples of size 25 from this dataset is normally distributed with mean 200km and SD $\frac{50}{\sqrt{25}} = 10$, so 190km to 210km contains 68% of the values.
 
 # END SOLUTION
 
 # END SUBPROB
 
 # BEGIN SUBPROB
 
-(3.5 pts) Assume the mean distance is 200 km and the standard deviation
-is 50 km. Suppose we use the Central Limit Theorem to generate a 95%
-confidence interval for the true mean distance of all Tour de France
-stages, and get the interval $[190\text{ km}, 210\text{ km}]$. Which of
-the following interpretations of this confidence interval are correct?
-
-[ ] 95% of Tour de France stage distances fall between 190 km and 210
-km.
-
-[ ] There is a 95% chance that the true mean distance of all Tour de
-France stages is\
-between 190 km and 210 km.
-
-[ ] We are 95% confident that the true mean distance of all Tour de
-France stages is\
-between 190 km and 210 km.
+Assume the mean distance is 200 km and the standard deviation is 50 km. Suppose we use the Central Limit Theorem to generate a 95% confidence interval for the true mean distance of all Tour de France stages, and get the interval $[190\text{ km}, 210\text{ km}]$. Which of the following interpretations of this confidence interval are correct?
 
+[ ] 95% of Tour de France stage distances fall between 190 km and 210 km.
+[ ] There is a 95% chance that the true mean distance of all Tour de France stages is between 190 km and 210 km.
+[ ] We are 95% confident that the true mean distance of all Tour de France stages is between 190 km and 210 km.
 [ ] Our sample is of size 100.
-
 [ ] Our sample is of size 25.
-
-[ ] If we collected many original samples and constructed many 95%
-confidence inter-\
-vals, then exactly 95% of those intervals would contain the true mean
-distance.
-
-[ ] If we collected many original samples and constructed many 95%
-confidence inter-\
-vals, then roughly 95% of those intervals would contain the true mean
-distance.
+[ ] If we collected many original samples and constructed many 95% confidence intervals, then exactly 95% of those intervals would contain the true mean distance.
+[ ] If we collected many original samples and constructed many 95% confidence intervals, then roughly 95% of those intervals would contain the true mean distance.
 
 # BEGIN SOLUTION
 
@@ -121,28 +75,37 @@ distance.
 
 # BEGIN SUBPROB
 
-Suppose we take 500 random samples of size 100 from the stage distances,
-calculate their means, and draw a histogram of the distribution of these
-sample means. We label this Histogram A. Then, we take 500 random
-samples of size 1000 from the stage distances, calculate their means,
-and draw a histogram of the distribution of these sample means. We label
-this Histogram B. Fill in the blanks so that the sentence below
-correctly describes how Histogram B looks in comparison to Histogram A.
+Suppose we take 500 random samples of size 100 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram A. Then, we take 500 random samples of size 1000 from the stage distances, calculate their means, and draw a histogram of the distribution of these sample means. We label this Histogram B. Fill in the blanks so that the sentence below correctly describes how Histogram B looks in comparison to Histogram A.
+
+"Relative to Histogram A, Histogram B would appear \_\_(i)\_\_ and shifted \_\_(ii)\_\_ due to the \_\_(iii)\_\_ mean and the \_\_(iv)\_\_ standard deviation."
+
+(i): 
+
+( ) thinner 
+( ) wider 
+( ) the same width 
+( ) unknown
+
+(ii): 
 
-::: center
-"Relative to Histogram A, Histogram B would appear [   (i)
-  ]{.underline} and shifted [   (ii)   ]{.underline} due to the [
-  (iii)   ]{.underline} mean and the [   (iv)   ]{.underline} standard
-deviation.\"
-:::
+( ) left 
+( ) right 
+( ) not at all 
+( ) unknown
 
-(i): ( ) thinner ( ) wider ( ) the same width ( ) unknown
+(iii): 
 
-(ii): ( ) left ( ) right ( ) not at all ( ) unknown
+( ) larger 
+( ) smaller 
+( ) unchanged 
+( ) unknown
 
-(iii): ( ) larger ( ) smaller ( ) unchanged ( ) unknown
+(iv): 
 
-(iv): ( ) larger ( ) smaller ( ) unchanged ( ) unknown
+( ) larger 
+( ) smaller 
+( ) unchanged 
+( ) unknown
 
 # BEGIN SOLUTION