This repo contains project code for Getting and Cleaning Data
course given
by John Hopkins university on Coursera.
The script contains a function run.analysis()
that performs the
actual job:
- reads train and test data sets and merges them
- processes the merged data set (extract the relevant variables, adds descriptive activity names, etc.)
- writes the merged data set to
rawdata.csv
- generates the tidy data set
- writes the tidy data set to
tidydata.csv
- returns the tidy data set
If you've a Samsung data available in the current directory, just run:
source('./run_analysis.R')
run.analysis() # invoke the actual function
Otherwise, you can use download.data()
function provided in the script.
The function will download the data archive and extract it.
After that you can run run.analysis()
. The full code:
source('./run_analysis.R')
download.data() # download samsung data and unzip it
run.analysis() # invoke the actual function
If you don't want to use download.data()
, you should download the samsung data
manually, unzip it in the current directory and then run:
source('./run_analysis.R')
run.analysis() # invoke the actual function
Note: In all the examples above, I assume that the script file
run_analysis.R
resides in the current working directory. If it does not,
you should provide the correct path to the file to source it.
In order to create a raw data set, the following regular expression was used:
-(mean|std)[(]
. I.e. all variables containing -mean(
or -std(
in their
names were filtered.
Totally, the raw data set contains 68 variables:
subject
- An identifier of the subject who carried out the experiment.label
- An activity label.
Plus 66 filtered features mined as described below.
The features selected for this database come from the accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time domain signals (prefix 't' to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using another low pass Butterworth filter with a corner frequency of 0.3 Hz.
Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag).
Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. (Note the 'f' to indicate frequency domain signals).
These signals were used to estimate variables of the feature vector for each
pattern:
-XYZ
is used to denote 3-axial signals in the X, Y and Z directions.
tBodyAcc-XYZ
tGravityAcc-XYZ
tBodyAccJerk-XYZ
tBodyGyro-XYZ
tBodyGyroJerk-XYZ
tBodyAccMag
tGravityAccMag
tBodyAccJerkMag
tBodyGyroMag
tBodyGyroJerkMag
fBodyAcc-XYZ
fBodyAccJerk-XYZ
fBodyGyro-XYZ
fBodyAccMag
fBodyAccJerkMag
fBodyGyroMag
fBodyGyroJerkMag
The set of variables that were estimated from these signals are:
mean()
: Mean valuestd()
: Standard deviation
Tidy data set contains the same variables as the raw does, but the variables were renamed according to following rules:
All lower case when possible
- the variables names were not converted to lower case, since it would make them unreadable. Instead, the variable names were converted to satisfycamlCase
rule.Descriptive (Diagnosis versus Dx)
- the variable names are descriptive, so nothing special should be done.Not duplicated
- the variable names are unique, so again nothing special had to be done.Not have underscores or dots or white spaces
- dashes and parentheses were removed from variable names.
To satisfy the requirements above, the following replacements were performed:
- Replace
-mean
withMean
- Replace
-std
withStd
- Remove characters
-()
- Replace
BodyBody
withBody
Raw data set | Tidy data set |
---|---|
subject |
subject |
label |
label |
tBodyAcc-mean()-X |
tBodyAccMeanX |
tBodyAcc-mean()-Y |
tBodyAccMeanY |
tBodyAcc-mean()-Z |
tBodyAccMeanZ |
tBodyAcc-std()-X |
tBodyAccStdX |
tBodyAcc-std()-Y |
tBodyAccStdY |
tBodyAcc-std()-Z |
tBodyAccStdZ |
tGravityAcc-mean()-X |
tGravityAccMeanX |
tGravityAcc-mean()-Y |
tGravityAccMeanY |
tGravityAcc-mean()-Z |
tGravityAccMeanZ |
tGravityAcc-std()-X |
tGravityAccStdX |
tGravityAcc-std()-Y |
tGravityAccStdY |
tGravityAcc-std()-Z |
tGravityAccStdZ |
tBodyAccJerk-mean()-X |
tBodyAccJerkMeanX |
tBodyAccJerk-mean()-Y |
tBodyAccJerkMeanY |
tBodyAccJerk-mean()-Z |
tBodyAccJerkMeanZ |
tBodyAccJerk-std()-X |
tBodyAccJerkStdX |
tBodyAccJerk-std()-Y |
tBodyAccJerkStdY |
tBodyAccJerk-std()-Z |
tBodyAccJerkStdZ |
tBodyGyro-mean()-X |
tBodyGyroMeanX |
tBodyGyro-mean()-Y |
tBodyGyroMeanY |
tBodyGyro-mean()-Z |
tBodyGyroMeanZ |
tBodyGyro-std()-X |
tBodyGyroStdX |
tBodyGyro-std()-Y |
tBodyGyroStdY |
tBodyGyro-std()-Z |
tBodyGyroStdZ |
tBodyGyroJerk-mean()-X |
tBodyGyroJerkMeanX |
tBodyGyroJerk-mean()-Y |
tBodyGyroJerkMeanY |
tBodyGyroJerk-mean()-Z |
tBodyGyroJerkMeanZ |
tBodyGyroJerk-std()-X |
tBodyGyroJerkStdX |
tBodyGyroJerk-std()-Y |
tBodyGyroJerkStdY |
tBodyGyroJerk-std()-Z |
tBodyGyroJerkStdZ |
tBodyAccMag-mean() |
tBodyAccMagMean |
tBodyAccMag-std() |
tBodyAccMagStd |
tGravityAccMag-mean() |
tGravityAccMagMean |
tGravityAccMag-std() |
tGravityAccMagStd |
tBodyAccJerkMag-mean() |
tBodyAccJerkMagMean |
tBodyAccJerkMag-std() |
tBodyAccJerkMagStd |
tBodyGyroMag-mean() |
tBodyGyroMagMean |
tBodyGyroMag-std() |
tBodyGyroMagStd |
tBodyGyroJerkMag-mean() |
tBodyGyroJerkMagMean |
tBodyGyroJerkMag-std() |
tBodyGyroJerkMagStd |
fBodyAcc-mean()-X |
fBodyAccMeanX |
fBodyAcc-mean()-Y |
fBodyAccMeanY |
fBodyAcc-mean()-Z |
fBodyAccMeanZ |
fBodyAcc-std()-X |
fBodyAccStdX |
fBodyAcc-std()-Y |
fBodyAccStdY |
fBodyAcc-std()-Z |
fBodyAccStdZ |
fBodyAccJerk-mean()-X |
fBodyAccJerkMeanX |
fBodyAccJerk-mean()-Y |
fBodyAccJerkMeanY |
fBodyAccJerk-mean()-Z |
fBodyAccJerkMeanZ |
fBodyAccJerk-std()-X |
fBodyAccJerkStdX |
fBodyAccJerk-std()-Y |
fBodyAccJerkStdY |
fBodyAccJerk-std()-Z |
fBodyAccJerkStdZ |
fBodyGyro-mean()-X |
fBodyGyroMeanX |
fBodyGyro-mean()-Y |
fBodyGyroMeanY |
fBodyGyro-mean()-Z |
fBodyGyroMeanZ |
fBodyGyro-std()-X |
fBodyGyroStdX |
fBodyGyro-std()-Y |
fBodyGyroStdY |
fBodyGyro-std()-Z |
fBodyGyroStdZ |
fBodyAccMag-mean() |
fBodyAccMagMean |
fBodyAccMag-std() |
fBodyAccMagStd |
fBodyBodyAccJerkMag-mean() |
fBodyAccJerkMagMean |
fBodyBodyAccJerkMag-std() |
fBodyAccJerkMagStd |
fBodyBodyGyroMag-mean() |
fBodyGyroMagMean |
fBodyBodyGyroMag-std() |
fBodyGyroMagStd |
fBodyBodyGyroJerkMag-mean() |
fBodyGyroJerkMagMean |
fBodyBodyGyroJerkMag-std() |
fBodyGyroJerkMagStd |
No code book is provided, since this README file contains all the required information.