06-tidy-data.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>Software Carpentry: R for reproducible scientific analysis</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container card">
      <div class="banner">
        <a href="http://software-carpentry.org" title="Software Carpentry">
          <img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
        </a>
      </div>
      <article>
      <div class="row">
        <div class="col-md-10 col-md-offset-1">
                    <a href="index.html"><h1 class="title">R for reproducible scientific analysis</h1></a>
          <h2 class="subtitle">Tidy Data</h2>
          <section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Understand what it means for data to be tidy
<ul>
<li>Each variable forms a column.</li>
<li>Each observation forms a row.</li>
<li>Each type of observational unit forms a table.</li>
</ul></li>
<li>Be able to use <code>tidyr::gather</code> to make “wide” data “long”</li>
</ul>
</div>
</section>
<h3 id="tidy-data">Tidy data</h3>
<p>Data can be organized many ways. While there may be times that call for other organizational schemes, this is the standard and generally-best way to organize data:</p>
<ol style="list-style-type: decimal">
<li>Each variable forms a column</li>
<li>Each observation forms a row</li>
<li>Each observational type forms a table</li>
</ol>
<p>What exactly constitutes a variable can be difficult to define out of context, but as a general rule, if observations are measured in different units, they should be in different columns, and if they are measured in the same units, there is a good chance they should be in the same column.</p>
<p>An example will clarify. Download <a href="https://raw.githubusercontent.com/michaellevy/gapminder-R/gh-pages/data/wide_eg.csv">this fake data</a> on blood readings under several medical treatments. Save it to your data directory and read it into R. Is this data in tidy format? Why not – which of the three principles does it not satisfy?</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">blood &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&#39;data/wide_eg.csv&#39;</span>)
blood</code></pre></div>
<pre class="output"><code># A tibble: 14 × 5
   `Blood albumin levels in four individuals (g / dL)`    X2      X3
                                                 &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;
1                                                 &lt;NA&gt;  &lt;NA&gt;    &lt;NA&gt;
2                                              subject   sex control
3                                                    1     M    4.49
4                                                    2     F    5.98
5                                                    3     F    3.77
6                                                    4     M    2.60
7                                                    5     m    5.20
8                                                    6     F    4.22
9                                                    7     F    2.25
10                                                   8     F    3.76
11                                                   9     M    4.93
12                                                  10     M    5.66
13                                                  11     F    3.30
14                                                  12     M    5.90
# ... with 2 more variables: X4 &lt;chr&gt;, X5 &lt;chr&gt;
</code></pre>
<p>Hmm, that doesn’t look quite right? What do you think happened? How could we fix it? The answer, of course, is in the <code>?read_csv</code> helpfile.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">blood &lt;-<span class="st"> </span><span class="kw">read_csv</span>(<span class="st">&#39;data/wide_eg.csv&#39;</span>, <span class="dt">skip =</span> <span class="dv">2</span>)
blood</code></pre></div>
<pre class="output"><code># A tibble: 12 × 5
   subject   sex control treatment1 treatment2
     &lt;int&gt; &lt;chr&gt;   &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;
1        1     M    4.49       1.61       3.03
2        2     F    5.98       3.40       3.79
3        3     F    3.77       1.92       4.66
4        4     M    2.60       4.72       7.48
5        5     m    5.20       1.76       9.77
6        6     F    4.22       4.41       7.23
7        7     F    2.25       6.55       2.51
8        8     F    3.76       1.92       9.60
9        9     M    4.93       2.73       7.26
10      10     M    5.66       6.10       9.88
11      11     F    3.30       3.92       4.47
12      12     M    5.90       1.52       2.66
</code></pre>
<p>There’s one other little problem in this dataset – do you see it? One of the subject’s sexes was entered as “m” instead of “M”. If we later try to <code>group_by</code> sex or plot by sex, we will get three groups: “F”, “M”, and “m”. To fix this, we could identify the errant entry and replace it specifically with something like <code>blood$sex[blood$sex == &quot;m&quot;] = &quot;M&quot;</code>, or we can simply tell R to make all the entries in that variable upper-case with the <code>toupper</code> function:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">blood$sex</code></pre></div>
<pre class="output"><code> [1] &quot;M&quot; &quot;F&quot; &quot;F&quot; &quot;M&quot; &quot;m&quot; &quot;F&quot; &quot;F&quot; &quot;F&quot; &quot;M&quot; &quot;M&quot; &quot;F&quot; &quot;M&quot;
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">toupper</span>(blood$sex)</code></pre></div>
<pre class="output"><code> [1] &quot;M&quot; &quot;F&quot; &quot;F&quot; &quot;M&quot; &quot;M&quot; &quot;F&quot; &quot;F&quot; &quot;F&quot; &quot;M&quot; &quot;M&quot; &quot;F&quot; &quot;M&quot;
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">blood$sex =<span class="st"> </span><span class="kw">toupper</span>(blood$sex)
<span class="co"># Or, using dplyr:</span>
blood =<span class="st"> </span><span class="kw">mutate</span>(blood, <span class="dt">sex =</span> <span class="kw">toupper</span>(sex))
<span class="co"># Note that those accomplish the exact same thing in very different ways. Studying the difference may be useful. </span>
blood</code></pre></div>
<pre class="output"><code># A tibble: 12 × 5
   subject   sex control treatment1 treatment2
     &lt;int&gt; &lt;chr&gt;   &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;
1        1     M    4.49       1.61       3.03
2        2     F    5.98       3.40       3.79
3        3     F    3.77       1.92       4.66
4        4     M    2.60       4.72       7.48
5        5     M    5.20       1.76       9.77
6        6     F    4.22       4.41       7.23
7        7     F    2.25       6.55       2.51
8        8     F    3.76       1.92       9.60
9        9     M    4.93       2.73       7.26
10      10     M    5.66       6.10       9.88
11      11     F    3.30       3.92       4.47
12      12     M    5.90       1.52       2.66
</code></pre>
<p>It looks like we’ve got 12 individuals, each subjected to three conditions – a control and two treatments. Each observation here is a person in a treatment (we don’t know what the measured value is), so each row should be defined by a person-treatment; that is, we should have 12 rows with four columns: subject, sex, condition, and the measured value.</p>
<h4 id="gather"><code>gather()</code></h4>
<p>A typical analysis of data like these consists of calculating means and standard deviations by condition and sex. It is possible to do this with the data in their current form, but it will be much easier if we tidy the data first.</p>
<p>We can transform the data to tidy form with the <code>gather</code> function from the <code>tidyr</code> package, which of course is part of the <code>tidyverse</code>.</p>
<p>Let’s look at the arguments to <code>gather</code> include the data.frame you’re gathering, which columns to gather, and names for two columns in the new data.frame: the key and the value. The key will consist of the old names of gathered columns, and the value will consist of the entries in those columns. The order of arguments is data.frame, key, value, columns-to-gather:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">gather</span>(blood, <span class="dt">key =</span> condition, <span class="dt">value =</span> albumin,
       control, treatment1, treatment2)</code></pre></div>
<pre class="output"><code># A tibble: 36 × 4
   subject   sex condition albumin
     &lt;int&gt; &lt;chr&gt;     &lt;chr&gt;   &lt;dbl&gt;
1        1     M   control    4.49
2        2     F   control    5.98
3        3     F   control    3.77
4        4     M   control    2.60
5        5     M   control    5.20
6        6     F   control    4.22
7        7     F   control    2.25
8        8     F   control    3.76
9        9     M   control    4.93
10      10     M   control    5.66
# ... with 26 more rows
</code></pre>
<p>You can also tell <code>gather</code> which columns <em>not</em> to gather – these are the “ID variables”; that is, they identify the unit of analysis on each row.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">blood.tidy &lt;-<span class="st"> </span><span class="kw">gather</span>(blood, <span class="dt">key =</span> condition, <span class="dt">value =</span> albumin,
                     -subject, -sex)</code></pre></div>
<p>Note that the variables with <code>-</code> in front of them aren’t removed from the data.frame. Instead we get one row for each combination of those variables.</p>
<h3 id="packages-in-the-tidyverse-expect-tidy-data">Packages in the tidyverse expect tidy data</h3>
<p>Now that the <code>blood</code> data is in tidy form we can easily summarize and plot it using <em>dplyr</em> and <em>ggplot2</em>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summarize</span>(<span class="kw">group_by</span>(blood.tidy, sex, condition),
          <span class="kw">mean</span>(albumin),
          <span class="kw">sd</span>(albumin))</code></pre></div>
<pre class="output"><code>Source: local data frame [6 x 4]
Groups: sex [?]

    sex  condition `mean(albumin)` `sd(albumin)`
  &lt;chr&gt;      &lt;chr&gt;           &lt;dbl&gt;         &lt;dbl&gt;
1     F    control        3.880000      1.228446
2     F treatment1        3.686667      1.737857
3     F treatment2        5.376667      2.582337
4     M    control        4.796667      1.188489
5     M treatment1        3.073333      1.911499
6     M treatment2        6.680000      3.170091
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(blood.tidy, <span class="kw">aes</span>(<span class="dt">x =</span> condition, <span class="dt">y =</span> albumin, <span class="dt">color =</span> sex)) +
<span class="st">          </span><span class="kw">geom_violin</span>()</code></pre></div>
<p><img src="fig/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h4 id="challenge-gather-and-plot"><span class="glyphicon glyphicon-pencil"></span>Challenge – Gather and plot</h4>
</div>
<div class="panel-body">
<p>The following code produces a data.frame with the annual relative standard deviation of gdp among countries, both by per-capita gdp and country-total gdp. Run the code. Is the resulting dataset in tidy form?</p>
<pre><code>gapminder %&gt;%
    mutate(gdpCountry = gdpPercap * pop) %&gt;%
    group_by(year) %&gt;%
    summarize(RSD_individual = sd(gdpPercap) / mean(gdpPercap),
              RSD_country = sd(gdpCountry) / mean(gdpCountry))</code></pre>
<p>You could argue that it is or is not in tidy form, because you could see the two outcomes as different variables or two ways of measuring the same variable. For our purposes, consider them two ways of measuring the same variable and <code>gather</code> the data.frame so that there is only one measurement of RSD on each row.</p>
<p>Make a plot with two lines, one for each measure of RSD in gdp, by year. To make the plot black-and-white-printer friendly, distinguish the lines using the <code>linetype</code> <strong>aes</strong>thetic. Could you have made this plot without tidying the data? Why or why not?</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h4 id="challenge-more-practice-with-gather"><span class="glyphicon glyphicon-pencil"></span>Challenge – More practice with gather</h4>
</div>
<div class="panel-body">
<p>Download <a href="https://raw.githubusercontent.com/michaellevy/gapminder-R/gh-pages/data/stocks.tsv">this dataset</a> of closing prices for several restaurant stocks over a year.</p>
<ul>
<li>Is the data tidy?<br />
</li>
<li>If not, tidy it</li>
<li>Which stock performed the best over this year? Use any method you want to answer that question.</li>
</ul>
</div>
</section>
        </div>
      </div>
      </article>
      <div class="footer">
        <a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
        <a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
        <a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
        <a class="label swc-blue-bg" href="LICENSE.html">License</a>
      </div>
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
  </body>
</html>