More Sugar == More Happiness ? 🧐

🍦🍨🍰🍜🍕🍔🌮🥘

Introduction

Our original datasets came from two csv files, “recipes” and “ratings”, scraped from food.com. They include information about recipes posted on the website after 2008. In order to get the average rating per recipe, we merged these two datasets and used groupby method. We also replace any rating of 0 to be null value because the lowest rating someone can give is 1. Having a rating of 0 means the user did not leave a star rating, but maybe other forms of interaction like a comment or review. Then, we assigned the average rating column we created back to the recipes dataset. Hence, the dataset we worked with in our project is the original recipes dataset with an additional column average rating generated from the ratings datasets.

Using this dataset, we hope to answer the following question:

What is the relationship between the amount of sugar and the happiness of people?

By answering this question, we will be able to learn more about people’s feelings towards sweet food and figure out whether they will bring happiness to people. This will help people to determine if sweet food is what they should eat more often to keep them feeling happy.

There is a total of 83782 rows in our dataset and some relevant columns to our question are “nutrition” which includes multiple nutritious values including the percentage of the daily value of sugar, and “avg_rating” which is the average rating for the recipe on a scale from 1 to 5.

Cleaning & Exploratory Data Analysis

Data Cleaning

In order to clean our data, we first expanded the nutritions column so that each value is in its own column thus making it easier for us to access the value we want, which is the amount of sugar in that recipe. We also changed the type of each of these nutritious values to float so we can use them to generate meaningful insights like taking the mean of them.

Then we added a boolean column named "healthy" that tells us whether each recipe is healthy or not based on a standard we developed using the amount of protein and carbohydrate in that recipe. For the sake of this project, we consider a healthy recipe to be ones that have more protein and less carbohydrates. To quantify this, we use the average value of proteins and carbohydrates as the threshold to determine if values in a recipe are considered low or high.

We also want to determine whether people are happy with each recipe. Here we assume that a high rating means that reviewers are happy from eating the dish they cooked based on this recipe. Opposite of this means that reviewers are sad from eating the dish they cooked based on this recipe. Thus, we created a new column named happiness to determine if people are happy or sad to a recipe on average.

Next, we converted the "submitted", a column that contains the date a recipe was submitted to the website, into pandas timestamp type, which allows us to extract the year out of the date and created a column named "published_year".

We also dealt with the outliers in the "minutes" column, which contains the time in minutes it takes to cook the recipe. We consider a recipe with a cooking time over 1 day to be unreasonable because they are outliers that do not give valuable information to our analysis, thus we will consider them as missing for the purpose of our project. We also consider a time of 0 minutes to be missing value since it is unlikely to have 0 minutes as the time needed for cooking a dish.

	name	id	minutes	contributor_id	submitted	n_steps	n_ingredients	avg_rating	calories (#)	total fat (PDV)	sugar (PDV)	sodium (PDV)	protein (PDV)	saturated fat (PDV)	carbohydrates (PDV)	healthy	happiness	published_year
0	1 brownies in the world best ever	333281	40.0	985201	2008-10-27	10	9	4.0	138.4	10.0	50.0	3.0	3.0	19.0	6.0	False	happy	2008
1	1 in canada chocolate chip cookies	453467	45.0	1848091	2011-04-11	12	11	5.0	595.1	46.0	211.0	22.0	13.0	51.0	26.0	False	happy	2011
2	412 broccoli casserole	306168	40.0	50969	2008-05-30	6	9	5.0	194.8	20.0	6.0	32.0	22.0	36.0	3.0	False	happy	2008
3	millionaire pound cake	286009	120.0	461724	2008-02-12	7	7	5.0	878.3	63.0	326.0	13.0	20.0	123.0	39.0	False	happy	2008
4	2000 meatloaf	475785	90.0	2202916	2012-03-06	17	13	5.0	267.0	30.0	12.0	12.0	29.0	48.0	2.0	False	happy	2012

For this dataframe, we showed the first five rows of our cleaned dataframe. Because there are too many texts in the columns called tags, steps, description and ingredients, we do not show them in this table.

Univariate Analysis

We used a box plot to show the distribution of sugar (PDV). For the purpose of showing the trend, we did not show the recipe with sugar (PDV) of more than 200. In our plot, the smallest value is 0, the first quartile is 8, the median is 21, and the third quartile is 50. There is a big difference between the median and the third quartile. Though we set the upper limit as 200 to ignore some outliers, there are recipes that have a sugar value of more than 200 which is not included in our graph. This box plot suggests that most of the recipes are between 0 and 115.

Bivariate Analysis

We used a scatter plot to show the relationship between average rating and sugar (PDV). As mentioned in Univariate Analysis, most of the recipes are between 0 sugar (PDV) and 115 sugar (PDV). For the purpose of showing the trend, we show the recipe with sugar (PDV) of more than 1000 to cover most of the recipes. Based on the scatter plot, we can see that most of the recipes were clustered between 0 - 200 (sugar (PDV)) and 3 - 5 (average rating).

Interesting Aggregates

Our pivot table shows how much sugar (PDV) sad recipes and happy recipes have on average for each year. It is significant because from our pivot table, we can see that the sad recipes tend to have more sugar than the happy recipes before 2015. From 2016 to 2018, the happy recipes had a sugar spike and exceeded the sad recipes.

published_year	happy	sad
2008	67.5513	67.4426
2009	66.6881	75.3277
2010	63.1394	65.3088
2011	72.996	71.4495
2012	64.9092	94.5102
2013	68.216	99.0137
2014	67.9798	89.3721
2015	58.729	154
2016	94.625	69
2017	187.347	114.643
2018	239.043	146

Assessment of Missingness

NMAR Analysis

After examining which columns have missing values, we do believe there is a column in our dataset that is not missing at random (NMAR). The “description” column contains the description of recipes provided by the users. We can say that this column is NMAR, which means its missingness depends on its own value and does not depend on other columns. That is because if a recipe is a well-known recipe that most people already know about, then they would not really need a description column. Some might argue that we can tell what recipe it is from the name column, but the name would not tell us information about how popular a recipe is and how many people know about it. So we still conclude that the description column is NMAR.

If we do in fact want to obtain additional data that could explain the missingness of the description column and make it MAR, we would need to have information about the popularity of a recipe and how often it is being cooked by people.

Missingness Dependency

Missingness on average_rating & published year of recipe

Created a pivot table that shows the proportion of recipes in each year for missing ratings and not missing ratings, then plotted it to compare the distribution of published years when rating is missing and not missing.

Then we did a permutation test on the missingness of average rating where we used total variation distance (TVD) as our test statistics since the published year is categorical data and set the critical value to be 0.05. We did the permutation test 1000 times to generate an empirical distribution of our test statistic and present it in the following graph with the observed statistics.

After calculating the p-value, we get 0.0 which means that we would reject the null that the published year of recipes that are missing average ratings and not missing average ratings come from the same distribution. So the difference between them might not be due to random chance. Thus, the missingness of the average rating might be dependent on the published year.

Missingness on average_rating & number of ingredients

We created an overlaid histogram with box plots to show the distribution of the number of ingredients when the average rating is missing and not missing.

Hypothesis Testing

Null Hypothesis: In the population, the amount of sugar in recipes that make people happy and recipes that make people sad have the same distribution, and the observed differences in our samples are due to random chance.
Alternative Hypothesis: In the population, recipes that make people sad have more sugar than recipes that make people happy, on average. The observed difference in our samples cannot be explained by random chance alone.
Test Statistic: Difference in group means.

mean sugar amount of sad recipe−mean sugar amount of happy recipe
Significance Level: We choose 0.05 as our significance level for our hypothesis testing.
Resulting p-value: 0.13
Conclusion: we failed to reject the null hypothesis that the two groups come from the same distribution. Therefore the difference in the observed data could be just due to random chance.

Recall that our question is “What is the relationship between the amount of sugar and the happiness of people?” Hence, to discover the relationship between sugar and happiness, we try to have one null hypothesis that we can test, which is that recipes that make people sad and recipes that make people happy have the same distribution of sugar amount. This means whether the recipe makes people happy or sad has no relationship with the amount of sugar. Our alternative hypothesis is that recipes that make people sad tend to have more sugar than recipes that make people happy based on our observation of the mean sugar amount in the happy and sad recipes from our observed data. Since we want to test whether both recipes have the same distribution, we need to do a permutation test and shuffle the columns to make whether people are happy or sad at random. Because the data we have are categorical (sad or happy) and numerical (sugar amount), we can use the difference between the two group means as our test statistic.

After the simulation of permutation tests, we get the resulting p-value as 0.13, which is greater than 0.05. This means that we fail to reject our null hypothesis and the difference in the observed data could be just due to random chance. This means that the difference between the average sugar amount for happy recipes and sad recipes we observed based on the given data might have just been due to random chance and does not really show sad recipes will have more sugar than happy recipes.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
_site		_site
assets		assets
.DS_Store		.DS_Store
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

More Sugar == More Happiness ? 🧐

Introduction

Cleaning & Exploratory Data Analysis

Data Cleaning

Univariate Analysis

Bivariate Analysis

Interesting Aggregates

Assessment of Missingness

NMAR Analysis

Missingness Dependency

Missingness on average_rating & published year of recipe

Missingness on average_rating & number of ingredients

Hypothesis Testing

About

Releases

Packages

Languages

lz0227/recipes_ratings

Folders and files

Latest commit

History

Repository files navigation

More Sugar == More Happiness ? 🧐

Introduction

Cleaning & Exploratory Data Analysis

Data Cleaning

Univariate Analysis

Bivariate Analysis

Interesting Aggregates

Assessment of Missingness

NMAR Analysis

Missingness Dependency

Missingness on average_rating & published year of recipe

Missingness on average_rating & number of ingredients

Hypothesis Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages