This repository contains the following files:
- run_analysis.R - R script to generate tidy data file from the input data
- README.md - present README file
- CodeBook.md - codebook file describing data in the tidy data file
This script is aimed at generating tidy data set from the data collected during the project "Human Activity Recognition Using Smartphones Data Set". The script forms part of the "Getting and Cleaning Data" course project offered by JHU School of Public Health, delivered through Coursera.
More information about the "Human Activity Recognition Using Smartphones Data Set" project (including detailed data description) can be found here.
According to the instructions set out in the course project description, this script does the following:
-
Merges the training and the test sets to create one data set.
-
Extracts only the measurements on the mean and standard deviation for each measurement.
-
Uses descriptive activity names to name the activities in the data set
-
creates a second, independent tidy data set with the average of each variable for each activity and each subject.
-
Appropriately labels the data set with descriptive variable names.
Important: see also Notes of the implementation of steps 4 and 5 later in this document.
In order to use this script it is assumed that:
- System which this script is run upon does not require escaping of the spaces in the filenames passed to R functions
- dplyr package is installed (use
install.packages("dyplr")
, if necessary)
- Download and extract UCI HAR Dataset.zip into your current working directory, maintaining subdirectories structure contained in the zipped file
- Download run_analysis.R script into your current working directory
- Run the script using
source("run_analysis.R")
- Output file "UCItidyDataMeans.txt" (wide form) is stored in "UCI HAR Dataset" directory
The tidy data set is written using write.table
function with the parameter row.names = FALSE
and default settings for the remaining ones.
To read back the tidy data file into R type:
read.table("./UCI HAR Dataset/UCItidyDataMeans.txt", header = T)
run_analysis.R script reads 3 files comprising the test data (X_test.txt, subject_test.txt, y_test.txt) and 3 files comprising the train data (X_train.txt, subject_train.txt, y_train.txt) and merges the corresponding test and train files using rbind
function. Variable names (column names) of the feature data frame (i.e. merged X_test.txt and X_train.txt files) are taken from feature.txt file.
In the merged feature data frame are then kept only the columns containing the measurements on the mean and standard deviation for each measurement. Those columns are identified by the strings "mean" and "std" in their names with the exception of all columns containing measurements of an angle (column name starts with the string "angle"). The rationale behind this decision is that the angle variables (columns) by themselves are not mean angles between other measurements (vectors) but angles between means of other measurements (vectors).
After removing the redundant columns, the merged feature data frame is prepended by two columns: SubjectID, identifying the subject which the given observations comes from, and ActivityName, identifying in a descriptive way one of the six activities the observation belongs to.
Finally, the tidy data set is prepared by calculating means of every measurement for each combination of subject and activity. The tidy data set is then saved to a file.
WARNING!
If the file named UCItidyDataMeans.txt already exists, it is silently overwritten!
Notes of the implementation of steps 4 and 5
In order to reduce the memory usage during script execution, the intermediate data frames are removed using rm
function when not needed any more. Due to this, the descritpive column names are constructed from the original column names as one of the last steps in the script and applied to the final, tidy data set only. This means that the original steps 4 and 5 are executed in the reverse order.
run_analysis.R script has been developed and tested in the following environment:
- Mac OS X 10.10.5
- RStudio 0.99.467
- R 3.2.1
Rev. 1.0.2
2015.08.23