index.qmd

---
title: "EDS 214: Analytical Workflows and Scientific Reproducibility"
---

```{r setup, include=FALSE}
# knitr::opts_chunk$set(echo = FALSE)
# library(tidyverse)
# library(kableExtra)
# library(googlesheets4)
# library(DT)
# Learn more about creating websites with Distill at:
# https://rstudio.github.io/distill/website.html

# Learn more about publishing to GitHub Pages at:
# https://rstudio.github.io/distill/publish_website.html#github-pages

```

```{r collaborative EDS, out.width='80%', fig.align="center", fig.cap="Collaborative and Reproducible Environmental Data Science as summarized by the MEDS Students cohort 21-22, art by Allison Horst",echo=FALSE}

knitr::include_graphics("img/2021_Student_coworking_teams.jpeg")
```


## Course description

The generation and analysis of environmental data is often a complex, multi-step process that may involve the collaboration of many people. Increasingly tools that document and help to organize workflows are being used to ensure reproducibility, shareability, and transparency of the results. This course will introduce students to the conceptual organization of workflows (including code, documents, and data) as a way to conduct reproducible analyses. These concepts will be combined with the practice of various software tools and collaborative coding techniques to develop and manage multi-step analytical workflows as a team.


###  Why going reproducible ?

There are many reasons why it is essential to make your science reproducible and how the necessity of openness is a cornerstone of the integrity and efficacy of the scientific research process. Here we will also be focusing on why making your work reproducible will empower **you** to iterate quickly, integrate new information more easily to iterate quickly, scale your analysis to larger data sets, and better collaborate by receiving feedback and contributions from others, as well as enable your "future self" to reuse and build from your own work.

To make your data-driven research reproducible, it is important to develop scientific workflows that will be relying on **programming** to accomplish the necessary tasks to go from the raw data to the results (figures, new data, publications, ...) of your analysis. Scripting languages, even better open ones, such as `R` and `python`, are well-suited for scientists to develop reproducible scientific workflows. Those scripting languages provide a large ecosystem of libraries (also referred to as packages or modules) that are ready to be leveraged to conduct analysis and modeling. In this course, we will introduce how to use `R`, `git`, and `GitHub` to develop such workflows as a team.


```{r tidy-workflow, out.width='80%', fig.align="center", fig.cap="Workflow example using the `tidyverse`. Note the program box around the workflow and the iterative nature of the analytical process described. _Source: R for Data Science <https://r4ds.had.co.nz/>_",echo=FALSE}
  knitr::include_graphics(here::here("img","tidy-workflow.png"))
```

**Two points to stress about this figure**:

- Workflows are rarely linear... even less so their implementation
- Note the programming box -- yes, you'll need to code this :) 

***Workflows are developed iteratively, and one of the most helpful things you can do as a data scientist is to talk about them with your research team.***


## Teaching team

- [Julien Brun](https://brunj7.github.io){target="_blank"}, Instructor

- [Brian Lee](https://bren.ucsb.edu/people/brian-lee){target="_blank"}, TA

_MEDS Slack is the best way to communicate with us._


## Important links

-   [Course syllabus](https://docs.google.com/document/d/1ylxg20c4HsNc0C_qZV6PNO8sTjwOf5ZhAbzcemN2UtU/edit?usp=sharing){target="_blank"}
-   [Code of Conduct](code_of_conduct.html){target="_blank"}


## Predictable daily schedule

**Course dates:** Monday (2024-08-26) - Friday (2024-08-30)

EDS 214 is an intensive 1-week long graded 2-unit course. Students should plan to attend all scheduled sessions. All course requirements will be completed between 10am and 5pm PST (M - F), i.e. you are not expected to do additional work for EDS 214 outside of those hours, unless you are working with the Teaching Assistant in student hours.

### Daily schedule (subject to change):

|  **Time (PST)**   |        **Activity**                                   |
|-------------------|-------------------------------------------------------|
| 10:00am - 10:50am | *Lecture (50 min)*                                    |
| 10:50am - 11:00am | *Break   (10 min)*                                    |
| 11:00am - 12:30am | *Interactive Session 1  (90 min)*                     |
| 12:30am - 1:30pm  | *Lunch  (60 min)*                                     |
| 1:30pm - 2:00pm   | *Lecture  (30 min)*                                   |
| 2:00pm - 3:00pm   | *Interactive Session 2  (60 min)*                     |
| 3:00pm - 3:10pm   | *Break   (10 min)*                                    |
| 3:10pm - 4:00pm   | *Flex time  (50 min)*                                  |
| 4:00pm - 4:50pm   | *Q&A with TA as needed  (50 min)*                      |

## Learning objectives

The goal of EDS 214 - Analytical Workflows and Scientific Reproducibility  is to expose incoming MEDS students to "good enough" practices of scientific programming develop skills in environmental data science to produce reproducible research. By the end of the course, students should be able to: 

-   **Develop knowledge in scientific analytical workflows** to learn how to make your data-driven research reproducible. These scientific workflows rely on programming to accomplish the necessary tasks to go from the raw data to the results of your analysis (figures, new data, publications, ...). Scripting languages, even better open ones such as `R` and `python`, are well-suited for scientists to develop reproducible scientific workflows, but are not the only tools you will need to develop reproducible and collaborative workflows

-   **Learn how to code in a collaborative manner** by practicing techniques such as code review and pair programming. Become comfortable asking for and conducting code reviews using `git` and `GitHub` to create pull requests, ask for feedback from peers, and merge changes into the main repository. Practice pair programming to cement the collaborative development of reproducible analytical workflows

-   **Practice documenting code and data** in a systematic way that will enable your collaborators, including your future self, to understand and reuse your work


## Grading

The grading for this course is organized as follows:

- 50% Class participation
- 50% Group project


## Sessions (subject to change)

```{r echo=FALSE, message=FALSE, eval=FALSE}
# used for the dynamic development of the content
gs4_deauth()
sched <- read_sheet("https://docs.google.com/spreadsheets/d/1GckWJ4li6G2AyCjS8MrQSN6znym27aO9IPujdWfHc14/edit#gid=0")

sched2 <- sched %>%
  mutate(Day = glue::glue("<a href='{Day_link}' target='_blank'>{Day}</a>")) %>%
  select(-Day_link)

datatable(sched2, rownames=FALSE, escape=FALSE)

```

| Day / Session     	 | Topics                                                                                                                                                                                                       |	Interactive Sessions |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|
| Monday morning       |  [Reproducible workflows](day_1.html#introduction-to-eds-214) | [Planning things: from flow charts to pseudocode](day1-pseudocode.html)|
 Monday afternoon      |  [Coding together](day_1.html#lecture-2-coding-together)                 | [Collaborating with GitHub](day_1.html#collaborating-using-github); [Github collaboration Hands-on](https://github.com/brunj7/eds214-handson-ghcollab); [Bonus: Github conflicts](day1-git-collaboration-conflicts.html)  |
| Tuesday morning      |  [Working on a remote server](day_2.html#working-on-a-remote-server) |	[the command line and friends](day_2.html#the-command-line-interface-cli) | 
| Tuesday afternoon	   |  [Automating things with `bash`](day2-remote_server.html#rstudio-server)                 |	[bash scripting & uploading things](day2-bash-scripting.html) |
| Wednesday morning    |  [How to get data using APIs](day_3.html)                                |  [APIs hands-on](day_3.html#hands-on-apis): USGS `dataretrieval`	& `metajam` |
| Wednesday afternoon  |	[Introduction to group project](group_project.html)                                     |	[Group project](group_project.html) |
| Thursday morning     |  [Reproducible tools](day_4.html#leveraging-existing-frameworks-for-reproducibility)                 |	`targets` & `rrtools` as examples	|
| Thursday afternoon   |	[Documenting things](day_4.html#documenting-things)                                              |	Group project |
| Friday morning	     |  [Sharing things](day_5.html#scientific-products)	                                       |  Group project |
| Friday afternoon	   |  [Project presentations](day_5.html#group-projects)	                                     |  [Project presentations](day_5.html#group-projects) | 


## Course requirements

### Computing

- Minimum [MEDS device requirements](https://ucsb-meds.github.io/computer_reqs.html){target="_blank"}

- Have a ready to be used GitHub Account [https://github.com/](https://github.com/){target="_blank"}

- [MEDS servers](https://bren.zendesk.com/hc/en-us/articles/8727865050900-MEDS-Computational-Server-Guide){target="_blank"}


### Textbook

- R for Data Science, 2nd edition: <https://r4ds.hadley.nz/>
- The Practice of Reproducible Research: <http://www.practicereproducibleresearch.org/>