Skip to content

Commit

Permalink
added permutation test
Browse files Browse the repository at this point in the history
LaTeX rendering needs to be fixed
  • Loading branch information
m1nce committed May 18, 2024
1 parent 01e8839 commit ca0b6db
Show file tree
Hide file tree
Showing 89 changed files with 48,032 additions and 5 deletions.
118 changes: 118 additions & 0 deletions docs/statistical-inference/permutation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
sidebar_position: 2
sidebar_label: Permutation Testing
title: Permutation Testing
---

import DataFrameComponent from '../../components/DataFrameComponent.jsx';

We use a permutation test to test whether two samples were drawn from the same population.

---

## 1. State the question/hypothesis

Our pair of hypotheses is:
* **Null Hypothesis:** The mean weights of dogs and cats are the *same*.
* **Alternative Hypothesis:** The mean weights of dogs and cats are *different*.

Since the alternative hypothesis is of the form "A and B are different," the test statistic should measure distance and use an absolute value.

$\therefore$ Use **absolute difference in group means** as the test statistic.

---

## 2. Query the DataFrame

Since we want to compare the distributions of only cats and dogs, we need to make sure to only include the relevant pieces of data (e.g., cats and dogs weights).

```python
# Create a boolean Series that references which rows fulfill either condition.
querying_condition = (full_pets.get('Species') == 'dog') | (full_pets.get('Species') == 'cat')
# Query.
cats_dogs = full_pets[querying_condition]
# Display the first 5 rows only.
cats_dogs.take(np.arange(5))
```

<DataFrameComponent data={'{"columns":["ID","Species","Color","Weight","Age","Is_Cat","Owner_Comment"],"index":[0,1,2,3,4],"data":[["dog_001","dog","black",40.0,5.0,false," There are no bad dogs, only bad owners."],["cat_001","cat","golden",1.5,0.2,true,"My best birthday present ever!!!"],["cat_002","cat","black",15.0,9.0,true,"****All you need is love and a cat.****"],["dog_002","dog","white",80.0,2.0,false,"Love is a wet nose and a wagging tail."],["dog_003","dog","black",25.0,0.5,false,"Be the person your dog thinks you are."]]}'} />

---

## 3. Create a function to calculate test statistic

Since our hypotheses depend on the test statistic, create a function to be able to calculate it during every trial of our permutation test.

```python
def difference_in_means(cats_dogs):
"""
Calculate the absolute difference in the mean weight of dogs and cats.
---
Input:
cats_dogs: a DataFrame containing the columns 'Species' and 'Weight'.
---
Output:
The absolute difference in the mean weight of dogs and cats.
"""
means = cats_dogs.groupby('Species').mean()
return np.abs((means.get('Weight').loc['dog'] - means.get('Weight').loc['cat']))
```

*Note: Although we can simply do this in the function, we can practice good coding habits by separating our code into readable bits!*

---

## 4. Simulate the permutation test

```python
n = 500 # Number of simulations.
statistics = np.array([]) # Array to keep track of the difference in means for each iteration.
for i in np.arange(n): # Run the simulation `n` number of times
# 1. Shuffle the species.
shuffled = cats_dogs.assign(Species=np.random.permutation(cats_dogs.get('Species')))

# 2. Compute the test statistic.
statistic = difference_in_means(shuffled)

# 3. Save the result.
statistics = np.append(statistics, statistic)
```

This code will run the permutation test 500 times, but a different reasonable number can be used instead. It is **crucial** to keep track of the difference in means each time our for-loop runs so that the number of simulated values can be displayed.

---

## 5. Conclusion

```python
observed = difference_in_means(cats_dogs)
p_value = np.count_nonzero(statistics >= observed) / n

print("The observed value of the test statistic is:", observed)
print("The p-value is:", p_value)
```
**The observed value of the test statistic is: 30.361111111111107** <br />
**The p-value is: 0.004**

Using a significance level of 0.05...

**Conclusion:**
* Under the null hypothesis, we rarely see a difference greater than the observed value.
* Therefore, we reject the null hypothesis: the evidence implies that the two groups do not come from the same distribution.
* Still, we cannot conclude that species causes a different weight because there may be other factors at play.

---

## 6. Extra

Let's see how our observed statistic compares to the overall simulated values!

```python
# Create the histogram.
bpd.DataFrame().assign(statistics=statistics).plot(kind='hist', bins=20, density=True, ec='w')
# Don't worry about this line - you won't need to know it for this course!
plt.axvline(x=observed, c='black', linewidth=4, label='population difference in means')
```
![Distribution](/img/statistical-inference-plots/permutation-test.png)

From this graph, we can tell that there is barely any data to the **right** of the black vertical line, meaning we have a very low p-value!
16 changes: 15 additions & 1 deletion docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
// See: https://docusaurus.io/docs/api/docusaurus-config

import {themes as prismThemes} from 'prism-react-renderer';
const math = require('remark-math')
const katex = require('rehype-katex')

/** @type {import('@docusaurus/types').Config} */
const config = {
Expand All @@ -20,7 +22,7 @@ const config = {

// GitHub pages deployment config.
// If you aren't using GitHub pages, you don't need these.
organizationName: 'dsc-courses', // Usually your GitHub org/user name.
organizationName: 'dsc-courses', // Usually your GitHub org/username.
projectName: 'bpd-reference', // Usually your repo name.

onBrokenLinks: 'throw',
Expand All @@ -44,6 +46,8 @@ const config = {
({
docs: {
sidebarPath: './sidebars.js',
remarkPlugins: [math],
rehypePlugins: [katex],
// Please change this to your repo.
// Remove this to remove the "edit this page" links.
editUrl:
Expand All @@ -56,6 +60,16 @@ const config = {
],
],

stylesheets: [
{
href: '/katex/katex.min.css',
type: 'text/css',
integrity:
'sha384-AfEj0r4/OFrOo5t7NnNe46zW/tFgW6x/bCJG8FqQCEo3+Aro6EYUG4+cU+KJWu/X',
crossorigin: 'anonymous',
},
],

themeConfig:
/** @type {import('@docusaurus/preset-classic').ThemeConfig} */
({
Expand Down
Loading

0 comments on commit ca0b6db

Please sign in to comment.