diff --git a/README.md b/README.md index 376cb1e7..09854153 100644 --- a/README.md +++ b/README.md @@ -7,4 +7,35 @@ This is a guide for anyone who needs to share data with a statistician. The targ * Students or postdocs in scientific disciplines looking for consulting advice * Junior statistics students whose job it is to collate/clean data sets -The goal of this guide is to ensure the most reproducible and the most +The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls +and sources of delay in the transition from data collection to data analysis. The Leek group works with a large +number of collaborators and the number one source of variation in the speed to results is the status of the data +when they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally. + +My strong feeling is that statisticians should be able to handle the data in whatever state they arrive. It is important +to see the raw data, understand the steps in the processing pipeline, and be able to incorporate hidden sources of +variability in one's data analysis. On the other hand, for many data types, the processing steps are well documented +and standardized. So the work of converting the data from raw form to directly analyzable form can be performed +before calling on a statistician. This can dramatically speed the turnaround time, since the statistician doesn't +have to work through all the pre-processing steps first. + + +What you should deliver to the statistician +==================== + +For maximum speed in the analysis this is the information you should pass to a statistician: + +1. The raw data. +2. A [tidy data set](http://vita.had.co.nz/papers/tidy-data.pdf) +3. An explicit and exact recipe you used to go from 1 -> 2 + +Let's look at each part of the data package you will transfer. + + + +What you should expect from a statistician +==================== + + + +