-
Notifications
You must be signed in to change notification settings - Fork 0
/
CRISPRi_Screening_AceticAcid.Rmd
2612 lines (2082 loc) · 123 KB
/
CRISPRi_Screening_AceticAcid.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "CRISPRi PHENOMICS DATA ANALYSIS"
author: "Vaskar Mukherjee"
date: "2/3/2021"
output:
html_document:
toc: true
toc_depth: 4
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# SCAN-O-MATIC PHENOMICS
We screened a CRISPR interference library consisting of >9000 Saccharomyces cerevisiae strains where >98% of all essential and respiratory growth-essential genes were targeted with multiple gRNAs. The screen was performed using the high-throughput, high-resolution scan-o-matic platform (Zackrisson et al., 2016) [link](https://doi.org/10.1534/g3.116.032342), where each strain is analyzed separately in order to generate and analyze high-resolution growth curves without the influence/competition from other strains.
## ACETIC ACID TITRATION
In an ideal library screening (with strains coming from the same background) we should be able to observe a normally distributed wide phenotypic variability under a particular test condition to pick the best and the worst performers in the library. For this purpose, we need to identify a stressor concentration (in this case acetic acid), which should be severe enough to induce large phenotypic variability but at the same time most strains should manage to grow and give us a quantitative phenotype. Therefore, Plate 7 & 8 was pre-screened at different acetic acid concentrations (0, 50mM, 75mM, 100mM, and 150mM of acetic acid) to identify **appropriate acetic acid concentration** for the whole library screening. Unfortunately, the spatial control strain at this point was BY4741, which did not growth at 150mM (BY4741 was later replaced with CC23 i.e. one of the CRISPRi control strain with a gRNA non-homologous to Saccharomyces cerevisiae genome). Therefore, to compare our results we used the absolute generation time (without any normalization for spatial bias). We assumed that the phenotypic variability due to spatial bias will be very similar within the test plates and since we will only look at the phenotypic variability within the strains at this point, it should not severely influence the final conclusion of this titration round. The raw absolute data of plate 7 and 8 is available in the **SOM_AA_TITRA** folder within the **RAW_DATA** folder. The files are organized in the following specific format
phenotypes.Absolute.Atc7.5_aa[**acetic acid concentration in mM**]_p[**plate number**].csv
For example, the result of plate 7 in 50mM of acetic acid is available in the file *phenotypes.Absolute.Atc7.5_aa50_p7.csv*
For the ease of analysis we compiled the data in a .csv and the compiled data is available in the **COMPILED_DATA** folder.
**Acetic acid titration data** : 20210120_AA_titration_absolute_compiled.csv
* Import the data
```{r}
AA_titration_data <- read.csv("COMPILED_DATA/20210120_AA_titration_absolute_compiled.csv", na.strings = "NoGrowth")
```
* Install packages: Out of these **ggplot2** and **reshape** will be frequently used later for data visualization
+ ggplot2
+ reshape
+ ggridges
* Prepare the data in the format requisite for ggplot2 package using reshape
```{r}
AA_titration_data_reshape <- reshape(data=AA_titration_data, idvar="gRNA_name",
varying = colnames(AA_titration_data)[3:7],
v.name=c("Generation_time"),
new.row.names = 1:30000,
direction="long",
timevar = "Condition",
times = colnames(AA_titration_data)[3:7])
```
* Plot the Ridgeline plots: A nice way to compare the density trace of multiple dataset
```{r figure1, echo=FALSE, fig.cap="Figure 1: Density trace of absolute generation time of strains in plate 7 and 8 at different concentration of acetic acid", fig.width=6, fig.height=10}
library(ggplot2)
library(reshape)
library(ggridges)
plt0 <- ggplot(AA_titration_data_reshape, aes(x = Generation_time, y = Condition, height = stat(density))) +
geom_density_ridges2(stat = "binline", bins = 200, scale = 2, draw_baseline = FALSE)+
theme_ridges()
suppressWarnings(print(plt0))
```
### CONCLUSION OF ACETIC ACID TITRATION
At 150mM we observed the largest phenotypic variability within the strains of plate 7 and 8. Therefore, 150mM was the selected acetic acid concentration to screen the entire library.
## IMPORT SCAN-O-MATIC RAW DATA
The phenotypic data generated in scan-o-matic screening in .csv format.
We extract both the absolute and the normalized phenotypes.
The CRISPRi strains in the library were arrayed in 24 plates in 384 format. Each CRISPRi plate was subjected to two different condition (Basal and 150 mM of Acetic acid). Therefore, for each plate four different files are generated. All files generated in a single independent experimental round are stored in a single folder.
* **SOM_SCR_R001** : Raw data for round1
* **SOM_SCR_R002** : Raw data for round2
**ABSOLUTE DATA**
The **Absolute** dataset gives the extracted phenotypes without any spatial normalization
**NORMALIZED DATA**
The **Normalized** dataset is generated after removal of any spatial bias. This is in log2 scale and referred as Log Strain Coefficient (LSC) values
**FILE NAMING**
Each file is named with the plate identifier in such a way so that it can be easily called programmatically
Eg. Plate 1 absolute data in basal (Ctrl) condition have the following string
*Ctrl1.phenotypes.Absolute*
AND
Plate 1 Normalized data in acetic acid (aa) stress have the string
*aa1.phenotypes.Normalized*
### PURPOSE 1
At the end of this data import session, a single data.frame will be generated with the data of 24 plates. The whole dataset will be labeled with the strains attributes using the metadata key file (provided in the COMPILED_DATA folder). The data import below is shown for only Round2 dataset. Round1 can be generated modifying the folder location
**METADATA KEY FILE** : library_keyfile1536.csv
### IMPORTING THE METADATA FILE
```{r}
Metadata_CRISPRi <- read.csv("COMPILED_DATA/library_keyfile1536.csv", na.strings = "#N/A", stringsAsFactors = FALSE)
str(Metadata_CRISPRi)
```
### GENERATE BASAL **ABSOLUTE** DATASET
```{r}
m <- vector(mode = "character", length = 0)
file.names<-vector(mode = "character", length = 0)
temp_df<-data.frame()
data_Ctrl_Abs <- data.frame()
for(i in 1:24){
m <- paste0("Ctrl", i, ".phenotypes.Absolute")
file.names[i] <- dir("RAW_DATA/SOM_SCR_R002/", pattern = m, full.names = TRUE)
temp_df <- read.csv(file.names[i], na.strings = "NoGrowth")
data_Ctrl_Abs <- rbind(data_Ctrl_Abs, temp_df)
}
str(data_Ctrl_Abs)
```
Several phenotypes are extracted. However, the most useful for this study will be,
* Column No: 14 i.e. **Phenotypes.ExperimentGrowthYield**
* Column No: 15 i.e. **Phenotypes.GenerationTime**
Extract only this two column in the final data.frame
Rename the column names to prevent any ambiguity
```{r}
data_Ctrl_Abs_Trim <- data_Ctrl_Abs[, 14:15]
colnames(data_Ctrl_Abs_Trim) <- c("CTRL_Y_ABS", "CTRL_GT_ABS")
str(data_Ctrl_Abs_Trim)
```
### GENERATE ACETIC ACID **ABSOLUTE** DATASET
Following the same strategy as above
```{r}
m <- vector(mode = "character", length = 0)
file.names<-vector(mode = "character", length = 0)
temp_df<-data.frame()
data_AA_Abs <- data.frame()
for(i in 1:24){
m <- paste0("aa", i, ".phenotypes.Absolute")
file.names[i] <- dir("RAW_DATA/SOM_SCR_R002/", pattern = m, full.names = TRUE)
temp_df <- read.csv(file.names[i], na.strings = "NoGrowth")
data_AA_Abs <- rbind(data_AA_Abs, temp_df)
}
data_AA_Abs_Trim <- data_AA_Abs[, 14:15]
colnames(data_AA_Abs_Trim) <- c("AA_Y_ABS", "AA_GT_ABS")
```
### GENERATE BASAL **NORMALIZED** DATASET
```{r}
m <- vector(mode = "character", length = 0)
file.names<-vector(mode = "character", length = 0)
temp_df<-data.frame()
data_Ctrl_Norm <- data.frame()
for(i in 1:24){
m <- paste0("Ctrl", i, ".phenotypes.Normalized")
file.names[i] <- dir("RAW_DATA/SOM_SCR_R002/", pattern = m, full.names = TRUE)
temp_df <- read.csv(file.names[i], na.strings = "NoGrowth")
data_Ctrl_Norm <- rbind(data_Ctrl_Norm, temp_df)
}
str(data_Ctrl_Norm)
```
The most useful for this study will be,
* Column No: 4 i.e. Phenotypes.ExperimentGrowthYield
* Column No: 5 i.e. Phenotypes.GenerationTime
Extract only this two column
```{r}
data_Ctrl_Norm_Trim <- data_Ctrl_Norm[, 4:5]
colnames(data_Ctrl_Norm_Trim) <- c("CTRL_Y_NORM", "CTRL_GT_NORM")
```
### GENERATE ACETIC ACID **NORMALIZED** DATASET
Same as above
```{r}
m <- vector(mode = "character", length = 0)
file.names<-vector(mode = "character", length = 0)
temp_df<-data.frame()
data_AA_Norm <- data.frame()
for(i in 1:24){
m <- paste0("aa", i, ".phenotypes.Normalized")
file.names[i] <- dir("RAW_DATA/SOM_SCR_R002/", pattern = m, full.names = TRUE)
temp_df <- read.csv(file.names[i], na.strings = "NoGrowth")
data_AA_Norm <- rbind(data_AA_Norm, temp_df)
}
data_AA_Norm_Trim <- data_AA_Norm[, 4:5]
colnames(data_AA_Norm_Trim) <- c("AA_Y_NORM", "AA_GT_NORM")
```
### COMBINE THE DATASETS TO OBTAIN FINAL DATAFRAME
Trimmed datasets are combined to obtain the final data.frame.
The combined data frame is labeled as data from ROUND2
```{r}
R <- rep("2nd_round", 36864)
Round_ID <- data.frame(R, stringsAsFactors = FALSE)
whole_data_R2 <- cbind(Metadata_CRISPRi,
Round_ID,
data_Ctrl_Abs_Trim,
data_AA_Abs_Trim,
data_Ctrl_Norm_Trim,
data_AA_Norm_Trim)
colnames(whole_data_R2)[12] <- "Round_ID"
str(whole_data_R2)
```
### IMPORT RESULTS FROM ROUND1
The results from Round1 is already compiled to a .csv file in COMPILED_DATA folder
**Results 1st Round** : 20190903_CRISPRi_Screen_aa_1st_round.csv
Import the dataset and label as data from ROUND1
```{r}
First_round <- read.csv("COMPILED_DATA/20190903_CRISPRi_Screen_aa_1st_round.csv",
na.strings = c("#N/A", "NoGrowth"),
stringsAsFactors = FALSE)
R <- rep("1st_round", 36864)
Round_ID <- data.frame(R, stringsAsFactors = FALSE)
whole_data_R1 <- cbind(Metadata_CRISPRi, Round_ID, First_round[, 12:19])
colnames(whole_data_R1)[12] <- "Round_ID"
str(whole_data_R1)
```
### COMBINE THE DATASETS of ROUND 1 AND 2
```{r}
whole_data_CRISPRi_aa <- rbind(whole_data_R1, whole_data_R2)
```
## SCAN-O-MATIC PHENOMICS ANALYSIS
In this study most of the downstream analysis was performed using the phenotype Generation_time(GT)
### PURPOSE 2
In this session, downstream data processing and statistical analysis of SCAN-O-MATIC raw output will be performed
### ESTIMATE THE LOG PHENOTYPIC INDEX (LPI) VALUES
LPI of strain is the difference of its normalized Generation_Time(GT) / Yield(Y) (LSC, see [IMPORT SCAN-O-MATIC RAW DATA]) on acetic acid stress plate to the basal condition. It gives a **RELATIVE** estimate of how a strain performed under acetic acid stress relative to the basal condition.
The **RELATIVE GENERATION TIME** i.e. LPI_GT = LSC_GT_Acetic_Acid - LSC_GT_Basal
```{r}
whole_data_CRISPRi_aa[, 21] <- whole_data_CRISPRi_aa[, 19]-whole_data_CRISPRi_aa[, 17]
whole_data_CRISPRi_aa[, 22] <- whole_data_CRISPRi_aa[, 20]-whole_data_CRISPRi_aa[, 18]
colnames(whole_data_CRISPRi_aa)[21] <- "LPI_Y"
colnames(whole_data_CRISPRi_aa)[22] <- "LPI_GT"
```
### PERFORM PLATE-WISE BATCH CORRECTION
Plate-wise batch correction was conducted by subtracting the median of LSC GT values of all the individual colonies on a plate from the individual LSC GT values of the colonies growing on that plate.
i.e. if strainX is growing in Basal condition on plate Z, the corrected LSC_GT value for strainX in the Basal condition is the following;
* LSC_GT_Basal_Corrected~strainX~ = (LSC_GT_Basal~strainX~) - Median(LSC_GT Basal~PlateZ~)
```{r}
plate_ID <- as.character(unique(whole_data_CRISPRi_aa$SOURCEPLATEID))
whole_data_CRISPRi_aa_corrected <- whole_data_CRISPRi_aa
med_LogLSCctrl_RND1_GT <- vector(mode = "integer", length = 0)
med_LogLSCaa_RND1_GT <- vector(mode = "integer", length = 0)
med_LogLSCctrl_RND2_GT <- vector(mode = "integer", length = 0)
med_LogLSCaa_RND2_GT <- vector(mode = "integer", length = 0)
med_LogLSCctrl_RND1_Y <- vector(mode = "integer", length = 0)
med_LogLSCaa_RND1_Y <- vector(mode = "integer", length = 0)
med_LogLSCctrl_RND2_Y <- vector(mode = "integer", length = 0)
med_LogLSCaa_RND2_Y <- vector(mode = "integer", length = 0)
for(i in 1:24){
med_LogLSCctrl_RND1_GT[i] <- median(whole_data_CRISPRi_aa_corrected$CTRL_GT_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$CTRL_GT_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round")])
med_LogLSCaa_RND1_GT[i] <- median(whole_data_CRISPRi_aa_corrected$AA_GT_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$AA_GT_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round")])
med_LogLSCctrl_RND2_GT[i] <- median(whole_data_CRISPRi_aa_corrected$CTRL_GT_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$CTRL_GT_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round")])
med_LogLSCaa_RND2_GT[i] <- median(whole_data_CRISPRi_aa_corrected$AA_GT_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$AA_GT_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round")])
med_LogLSCctrl_RND1_Y[i] <- median(whole_data_CRISPRi_aa_corrected$CTRL_Y_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$CTRL_Y_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round")])
med_LogLSCaa_RND1_Y[i] <- median(whole_data_CRISPRi_aa_corrected$AA_Y_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$AA_Y_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round")])
med_LogLSCctrl_RND2_Y[i] <- median(whole_data_CRISPRi_aa_corrected$CTRL_Y_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$CTRL_Y_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round")])
med_LogLSCaa_RND2_Y[i] <- median(whole_data_CRISPRi_aa_corrected$AA_Y_NORM[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i]
& !is.na(whole_data_CRISPRi_aa_corrected$AA_Y_NORM)
& whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round")])
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round") , 23] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round"), 17] - med_LogLSCctrl_RND1_Y[i]
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round") , 24] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round"), 18] - med_LogLSCctrl_RND1_GT[i]
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round") , 23] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round"), 17] - med_LogLSCctrl_RND2_Y[i]
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round") , 24] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round"), 18] - med_LogLSCctrl_RND2_GT[i]
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round") , 25] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round"), 19] - med_LogLSCaa_RND1_Y[i]
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round") , 26] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="1st_round"), 20] - med_LogLSCaa_RND1_GT[i]
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round") , 25] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round"), 19] - med_LogLSCaa_RND2_Y[i]
whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round") , 26] <- whole_data_CRISPRi_aa_corrected[which(whole_data_CRISPRi_aa_corrected$SOURCEPLATEID==plate_ID[i] &
whole_data_CRISPRi_aa_corrected$Round_ID=="2nd_round"), 20] - med_LogLSCaa_RND2_GT[i]
}
```
### ESTIMATE THE BATCH CORRECTED LOG PHENOTYPIC INDEX (LPI) VALUES
Estimate the corrected LPI values (see [ESTIMATE THE LOG PHENOTYPIC INDEX (LPI) VALUES]) based on the corrected LSC values
i.e. LPI_GT~corrected~ = LSC_GT_Acetic_Acid~corrected~ - LSC_GT_Basal~corrected~
**Estimate the corrected LPI_Y**
```{r}
whole_data_CRISPRi_aa_corrected[, 27] <- whole_data_CRISPRi_aa_corrected[, 25] - whole_data_CRISPRi_aa_corrected[, 23]
```
**Estimate the corrected LPI_GT**
```{r}
whole_data_CRISPRi_aa_corrected[, 28] <- whole_data_CRISPRi_aa_corrected[, 26] - whole_data_CRISPRi_aa_corrected[, 24]
```
### SETTING THE NAMES OF THE NEW COLUMNS
```{r}
colnm <- colnames(whole_data_CRISPRi_aa)[17:22]
colnm <- paste0(colnm, "_CR")
colnames(whole_data_CRISPRi_aa_corrected)[23:28] <- colnm
str(whole_data_CRISPRi_aa_corrected)
```
### EXTRACT ONLY THE BATCH CORRECTED COLUMNS
```{r}
whole_data_CRISPRi_aa_2 <- whole_data_CRISPRi_aa_corrected[, c(1:16, 23:28)]
colnames(whole_data_CRISPRi_aa_2)[17:22] <- colnames(whole_data_CRISPRi_aa)[17:22]
str(whole_data_CRISPRi_aa_2)
```
### CONSTRUCT A NEW DATA STRUCTURE
Construct a new data structure where data from each strain (have a unique guide-RNA) is in a separate row and the replicates from first and second round are side by side. Add also the mean, median and standard deviation statistics for each phenotype
#### REMOVE ROWS WITH SPATIAL CONTROL STRAIN DATA
```{r}
Data_CRISPRi_aa <- subset(whole_data_CRISPRi_aa_2, whole_data_CRISPRi_aa_2$gRNA_name!="SP_Ctrl_CC23")
```
#### CREATE A TABLE OF UNIQUE gRNA
```{r}
df_unique_sgRNA <- data.frame(table(Data_CRISPRi_aa$gRNA_name))
```
#### ARRANGE THE DATA IN THE DESIRED FORMAT
```{r}
R1<-vector(mode = "integer", length = 0)
R2<-vector(mode = "integer", length = 0)
test2<-data.frame()
n<-nrow(df_unique_sgRNA)
for(i in 1:n){
R1 <- which(Data_CRISPRi_aa$gRNA_name==df_unique_sgRNA$Var1[i] & Data_CRISPRi_aa$Round_ID=="1st_round")
R2 <- which(Data_CRISPRi_aa$gRNA_name==df_unique_sgRNA$Var1[i] & Data_CRISPRi_aa$Round_ID=="2nd_round")
test1 <- Data_CRISPRi_aa[c(R1, R2), ]
test2[i, c(1:8)]<-test1[1, c(2:4, 6:7, 9:11)]
test2[i, c(9:14)] <- test1$CTRL_GT_NORM
test2[i, 15] <- mean(test1$CTRL_GT_NORM[1:3])
test2[i, 16] <- mean(test1$CTRL_GT_NORM[4:6])
test2[i, 17] <- sd(test1$CTRL_GT_NORM[1:3])
test2[i, 18] <- sd(test1$CTRL_GT_NORM[4:6])
test2[i, 19] <- mean(test1$CTRL_GT_NORM[1:6])
test2[i, 20] <- median(test1$CTRL_GT_NORM[1:6])
test2[i, 21] <- sd(test1$CTRL_GT_NORM[1:6])
test2[i, c(22:27)] <- test1$AA_GT_NORM
test2[i, 28] <- mean(test1$AA_GT_NORM[1:3])
test2[i, 29] <- mean(test1$AA_GT_NORM[4:6])
test2[i, 30] <- sd(test1$AA_GT_NORM[1:3])
test2[i, 31] <- sd(test1$AA_GT_NORM[4:6])
test2[i, 32] <- mean(test1$AA_GT_NORM[1:6])
test2[i, 33] <- median(test1$AA_GT_NORM[1:6])
test2[i, 34] <- sd(test1$AA_GT_NORM[1:6])
test2[i, c(35:40)] <- test1$LPI_GT
test2[i, 41] <- mean(test1$LPI_GT[1:3])
test2[i, 42] <- mean(test1$LPI_GT[4:6])
test2[i, 43] <- sd(test1$LPI_GT[1:3])
test2[i, 44] <- sd(test1$LPI_GT[4:6])
test2[i, 45] <- mean(test1$LPI_GT[1:6])
test2[i, 46] <- median(test1$LPI_GT[1:6])
test2[i, 47] <- sd(test1$LPI_GT[1:6])
test2[i, c(48:53)] <- test1$CTRL_Y_NORM
test2[i, 54] <- mean(test1$CTRL_Y_NORM[1:3])
test2[i, 55] <- mean(test1$CTRL_Y_NORM[4:6])
test2[i, 56] <- sd(test1$CTRL_Y_NORM[1:3])
test2[i, 57] <- sd(test1$CTRL_Y_NORM[4:6])
test2[i, 58] <- mean(test1$CTRL_Y_NORM[1:6])
test2[i, 59] <- median(test1$CTRL_Y_NORM[1:6])
test2[i, 60] <- sd(test1$CTRL_Y_NORM[1:6])
test2[i, c(61:66)] <- test1$AA_Y_NORM
test2[i, 67] <- mean(test1$AA_Y_NORM[1:3])
test2[i, 68] <- mean(test1$AA_Y_NORM[4:6])
test2[i, 69] <- sd(test1$AA_Y_NORM[1:3])
test2[i, 70] <- sd(test1$AA_Y_NORM[4:6])
test2[i, 71] <- mean(test1$AA_Y_NORM[1:6])
test2[i, 72] <- median(test1$AA_Y_NORM[1:6])
test2[i, 73] <- sd(test1$AA_Y_NORM[1:6])
test2[i, c(74:79)] <- test1$LPI_Y
test2[i, 80] <- mean(test1$LPI_Y[1:3])
test2[i, 81] <- mean(test1$LPI_Y[4:6])
test2[i, 82] <- sd(test1$LPI_Y[1:3])
test2[i, 83] <- sd(test1$LPI_Y[4:6])
test2[i, 84] <- mean(test1$LPI_Y[1:6])
test2[i, 85] <- median(test1$LPI_Y[1:6])
test2[i, 86] <- sd(test1$LPI_Y[1:6])
}
```
#### ASSIGN COLUMN NAMES
Column names are already stored in a text times available in the **COMPILED_DATA** folder. Then store the data.frame under a new name.
```{r}
column_names <- read.table("COMPILED_DATA/Column_names.txt", header = FALSE, sep = "\t", as.is = TRUE)
colnames(test2) <- column_names$V1
Analysis_CRISPRi_aa_Complete <- test2
str(Analysis_CRISPRi_aa_Complete)
```
### PERFORM STATISTICAL ANALYSIS
Multiple statistical method was applied to identify the best fit statistical model for this dataset. We start with the complete dataset and give it a new name to avoid distorting the original dataset.
```{r}
Analysis_Final <- Analysis_CRISPRi_aa_Complete
```
#### METHOD 1
For **METHOD 1**, We hypothesized that the difference between the mean(µ) phenotypic performance of a specific CRISPRi strain (StrainX) in the two independent experimental rounds (n=2) to the mean phenotypic performance of all the CRISPRi strains that falls within the interquartile range (IQR) of the complete dataset would be zero, and any difference within the IQR to be just by chance.
**Null Hypothesis** : µ(µ~LPI_GT_StrainX_Round1~, µ~LPI_GT_StrainX_Round2~)- µ(InterquartileRange_LPI_GT) = 0
##### RECALCULATION OF SOME PHENOTYPIC PARAMETERS
In this method, we estimate the Mean / standard deviation (SD) of the LPI GT of Round 1 and Round 2 separately for each strain. When one/two of the three replicates of a strain in a round returned missing value (i.e. NA), then the mean / SD of LPI GT for that round is calculated by taking average of the non NA replicates. Therefore, excluding the missing values the mean and SD statistics were recalculated. We implemented a if else decision tree for this
* The mean and SD of Normalized generation time (**LSC GT mean**) at **Basal condition** re-calculation
```{r}
for(i in 1:nrow(Analysis_Final)){
x1 <- as.numeric(Analysis_Final[i, 9:11][which(!is.na(Analysis_Final[i, 9:11]))])
x2 <- as.numeric(Analysis_Final[i, 12:14][which(!is.na(Analysis_Final[i, 12:14]))])
if(length(x1)==0){
Analysis_Final$CTRL_GT_RND1_MEAN[i] <- NA
} else{
Analysis_Final$CTRL_GT_RND1_MEAN[i] <- as.numeric(mean(x1))
}
if(length(x2)==0){
Analysis_Final$CTRL_GT_RND2_MEAN[i] <- NA
} else{
Analysis_Final$CTRL_GT_RND2_MEAN[i] <- as.numeric(mean(x2))
}
if(sum(is.na(c(Analysis_Final$CTRL_GT_RND1_MEAN[i], Analysis_Final$CTRL_GT_RND2_MEAN[i])))==0){
Analysis_Final$CTRL_GT_RND1_2_MEAN[i] <- as.numeric(mean(c(Analysis_Final$CTRL_GT_RND1_MEAN[i], Analysis_Final$CTRL_GT_RND2_MEAN[i])))
Analysis_Final[i, 87] <- as.numeric(sd(c(Analysis_Final$CTRL_GT_RND1_MEAN[i], Analysis_Final$CTRL_GT_RND2_MEAN[i])))
} else{
Analysis_Final$CTRL_GT_RND1_2_MEAN[i] <- NA
Analysis_Final[i, 87] <- NA
}
}
colnames(Analysis_Final)[87] <- "CTRL_GT_MEAN_RND1_2_SD"
```
* The mean and SD of Normalized generation time (**LSC GT mean**) at **150mM acetic acid** re-calculation
```{r}
for(i in 1:nrow(Analysis_Final)){
x1 <- as.numeric(Analysis_Final[i, 22:24][which(!is.na(Analysis_Final[i, 22:24]))])
x2 <- as.numeric(Analysis_Final[i, 25:27][which(!is.na(Analysis_Final[i, 25:27]))])
if(length(x1)==0){
Analysis_Final$AA_GT_RND1_MEAN[i] <- NA
} else{
Analysis_Final$AA_GT_RND1_MEAN[i] <- as.numeric(mean(x1))
}
if(length(x2)==0){
Analysis_Final$AA_GT_RND2_MEAN[i] <- NA
} else{
Analysis_Final$AA_GT_RND2_MEAN[i] <- as.numeric(mean(x2))
}
if(sum(is.na(c(Analysis_Final$AA_GT_RND1_MEAN[i], Analysis_Final$AA_GT_RND2_MEAN[i])))==0){
Analysis_Final$AA_GT_RND1_2_MEAN[i] <- as.numeric(mean(c(Analysis_Final$AA_GT_RND1_MEAN[i], Analysis_Final$AA_GT_RND2_MEAN[i])))
Analysis_Final[i, 88] <- as.numeric(sd(c(Analysis_Final$AA_GT_RND1_MEAN[i], Analysis_Final$AA_GT_RND2_MEAN[i])))
} else{
Analysis_Final$AA_GT_RND1_2_MEAN[i] <- NA
Analysis_Final[i, 88] <- NA
}
}
colnames(Analysis_Final)[88] <- "AA_GT_MEAN_RND1_2_SD"
```
* The mean and SD of **RELATIVE** generation time (**LPI GT mean**) at **150mM acetic acid** re-calculation
```{r}
for(i in 1:nrow(Analysis_Final)){
x1 <- as.numeric(Analysis_Final[i, 35:37][which(!is.na(Analysis_Final[i, 35:37]))])
x2 <- as.numeric(Analysis_Final[i, 38:40][which(!is.na(Analysis_Final[i, 38:40]))])
if(length(x1)==0){
Analysis_Final$LPI_GT_RND1_MEAN[i] <- NA
} else{
Analysis_Final$LPI_GT_RND1_MEAN[i] <- as.numeric(mean(x1))
}
if(length(x2)==0){
Analysis_Final$LPI_GT_RND2_MEAN[i] <- NA
} else{
Analysis_Final$LPI_GT_RND2_MEAN[i] <- as.numeric(mean(x2))
}
if(sum(is.na(c(Analysis_Final$LPI_GT_RND1_MEAN[i], Analysis_Final$LPI_GT_RND2_MEAN[i])))==0){
Analysis_Final$LPI_GT_RND1_2_MEAN[i] <- as.numeric(mean(c(Analysis_Final$LPI_GT_RND1_MEAN[i], Analysis_Final$LPI_GT_RND2_MEAN[i])))
Analysis_Final[i, 89] <- as.numeric(sd(c(Analysis_Final$LPI_GT_RND1_MEAN[i], Analysis_Final$LPI_GT_RND2_MEAN[i])))
} else{
Analysis_Final$LPI_GT_RND1_2_MEAN[i] <- NA
Analysis_Final[i, 89] <- NA
}
}
colnames(Analysis_Final)[89] <- "LPI_GT_MEAN_RND1_2_SD"
```
##### EXTRACT ALL LPI GT MEAN DATA POINTS WITHIN INTER-QUARTILE-RANGE (IQR)
BOX PLOT - MEAN RELATIVE GENERATION TIME (LPI GT)
```{r figure2, echo=FALSE, fig.cap="Figure 2: Boxplot of mean relative generation time (LPI GT) for all strains in the library", fig.width=2, fig.height=4}
box_stat_LPI_GT_R1_2_mean <- boxplot(Analysis_Final$LPI_GT_RND1_2_MEAN, cex=0.3)
```
Display Box-plot statistics
```{r}
box_stat_LPI_GT_R1_2_mean$stats
```
* 25th Percentile = -0.02428792
* 75th Percentile = 0.07255828
Therefore, extraction of the data points within IQR
```{r}
Intermediate_50 <- Analysis_Final$LPI_GT_RND1_2_MEAN[which(Analysis_Final$LPI_GT_RND1_2_MEAN>=-0.02428792
&Analysis_Final$LPI_GT_RND1_2_MEAN<=0.07255828)]
summary(Intermediate_50)
```
##### ESTIMATE P-VALUE
P-value is estimated by Welch two sample two-sided t-test (an adaptation of Student's t-test)
```{r}
for(i in 1:nrow(Analysis_Final)){
if(sum(is.na(c(Analysis_Final$LPI_GT_RND1_MEAN[i], Analysis_Final$LPI_GT_RND2_MEAN[i])))==0){
P_value <- t.test(Intermediate_50, c(Analysis_Final$LPI_GT_RND1_MEAN[i], Analysis_Final$LPI_GT_RND2_MEAN[i]))
Analysis_Final[i, 90] <- P_value$p.value
} else{
Analysis_Final[i, 90] <- NA
}
}
colnames(Analysis_Final)[90] <- "P_value_M1"
```
##### FALSE DISCOVERY RATE ADJUSTMENT OF P-VALUE
P-value adjustment by **BENJAMINI-HOCHBERG False Discovery Rate (FDR) method**
```{r}
Analysis_Final[which(!is.na(Analysis_Final$P_value_M1)), 91] <- p.adjust(Analysis_Final$P_value_M1[which(!is.na(Analysis_Final$P_value_M1))],
method = "BH",
n = length(Analysis_Final$P_value_M1[which(!is.na(Analysis_Final$P_value_M1))]))
colnames(Analysis_Final)[91] <- "P.adjusted_M1"
```
##### P-VALUE DISGNOSTICS FOR METHOD1
NUMBER OF SIGNIFICANT STRAINS
```{r}
length(Analysis_Final$P_value_M1[which(Analysis_Final$P_value_M1<=0.05)])
length(Analysis_Final$P.adjusted_M1[which(Analysis_Final$P.adjusted_M1<=0.05)])
length(Analysis_Final$P_value_M1[which(Analysis_Final$P_value_M1<=0.1)])
length(Analysis_Final$P.adjusted_M1[which(Analysis_Final$P.adjusted_M1<=0.1)])
```
P-VALUE DIAGNOSTICS BY **HISTOGRAM ANALYSIS**
```{r figure3, echo=FALSE, fig.cap="Figure 3: P-value diagnostic by histogram Method 1", fig.width=8, fig.height=8}
par(mfrow=c(2,2))
hist(Analysis_Final$P_value_M1,
breaks = 20,
xlab = "P-value",
ylab = "Frequency",
main = "P-values of all strains",
col = "skyblue",
xlim = c(0, 1),
ylim = c(0, 1000))
hist(Analysis_Final$P_value_M1[which(Analysis_Final$Control.gRNA==1)],
breaks = 10,
xlab = "P-value",
ylab = "Frequency",
main = "P-values of control strains",
col = "skyblue",
xlim = c(0, 1),
ylim = c(0, 5))
hist(Analysis_Final$P.adjusted_M1,
breaks = 20,
xlab = "P.adj",
ylab = "Frequency",
main = "P.adjusted values of all strains",
col = "skyblue",
xlim = c(0, 1),
ylim = c(0, 1000))
hist(Analysis_Final$P.adjusted_M1[which(Analysis_Final$Control.gRNA==1)],
breaks = 2,
xlab = "P.adj",
ylab = "Frequency",
main = "P.adjusted values of control strains",
col = "skyblue",
xlim = c(0, 1),
ylim = c(0, 10))
```
##### CONCLUSIONS METHOD 1
Method 1 was too rigid as n=2. Even the smallest standard deviation between round1 and round2 is making an observation insignificant. This method puts the whole weightage on the variability between round1 and round2, not on the deviation from the mean of intermediate 50%. Therefore statistical method 1 was discarded after careful evaluation.
#### METHOD 2
For **METHOD 2**, We hypothesized that the difference between the mean(µ) phenotypic performance (LPI GT) of a specific CRISPRi strain (StrainX) in a independent experimental round (each has three technical replicates, i.e. n=3) to the mean phenotypic performance of all the replicates of the CRISPRi control strains (with gRNA targeting no genetic locus in *S. cerevisiae*) in that respective screening round would be zero, and any difference within the CRISPRi control strain phenotypic performance range (LPI GT range) to be just by chance.
**Null Hypothesis** : µ~StrainX~(LPI_GT~Replica1~, LPI_GT~Replica2~, LPI_GT~Replica3~)- µ~CRISPRi_Control_Strains~(LPI_GT) = 0
In this method P-values for each strain were estimated for each round and only strain that showed significant performance in both round were considered for further analysis
First we clone the dataset in a new name to avoid any distortion down the line
```{r}
Analysis_Final_2 <- Analysis_Final
str(Analysis_Final_2)
```
##### EXTRACT CRISPRi CONTROL STRAINS DATA
Extract the CRISPRi-control strains LPI GT data from ROUND1 and ROUND2, respectively and store the output in two different vectors.
* **ROUND 1**
```{r}
CRISPRi_Ctrl_Round1 <- whole_data_CRISPRi_aa_2$LPI_GT[which(whole_data_CRISPRi_aa_2$Control.gRNA == 1
& whole_data_CRISPRi_aa_2$Round_ID=="1st_round")]
summary(CRISPRi_Ctrl_Round1)
```
* **ROUND 2**
```{r}
CRISPRi_Ctrl_Round2 <- whole_data_CRISPRi_aa_2$LPI_GT[which(whole_data_CRISPRi_aa_2$Control.gRNA == 1
& whole_data_CRISPRi_aa_2$Round_ID=="2nd_round")]
summary(CRISPRi_Ctrl_Round2)
```
##### ESTIMATE P-VALUES FOR ROUND 1 AND 2
P-value is estimated by Welch two sample two-sided t-test (an adaptation of Student's t-test)
```{r}
for(i in 1:nrow(Analysis_Final_2)){
test1 <- t(Analysis_Final_2[i, 35:37])
test2 <- t(Analysis_Final_2[i, 38:40])
if(sum(!is.na(test1[, 1]))>=2){
P_value_RND1 <- t.test(CRISPRi_Ctrl_Round1, test1[which(!is.na(test1[, 1]))])
Analysis_Final_2[i, 92] <- P_value_RND1$p.value
} else {
Analysis_Final_2[i, 92] <- NA
}
if(sum(!is.na(test2[, 1]))>=2){
P_value_RND2 <- t.test(CRISPRi_Ctrl_Round2, test2[which(!is.na(test2[, 1]))])
Analysis_Final_2[i, 93] <- P_value_RND2$p.value
} else {
Analysis_Final_2[i, 93] <- NA
}
}
colnames(Analysis_Final_2)[92:93] <- c("P_value_RND1_M2", "P_value_RND2_M2")
```
##### FALSE DISCOVERY RATE ADJUSTMENT OF P-VALUES FOR ROUND 1 AND 2
P-value adjustment by **BENJAMINI-HOCHBERG False Discovery Rate (FDR) method**
```{r}
Analysis_Final_2[which(!is.na(Analysis_Final_2$P_value_RND1_M2)), 94] <- p.adjust(Analysis_Final_2$P_value_RND1_M2[which(!is.na(Analysis_Final_2$P_value_RND1_M2))],
method = "BH",
n = length(Analysis_Final_2$P_value_RND1_M2[which(!is.na(Analysis_Final_2$P_value_RND1_M2))]))
Analysis_Final_2[which(!is.na(Analysis_Final_2$P_value_RND2_M2)), 95] <- p.adjust(Analysis_Final_2$P_value_RND2_M2[which(!is.na(Analysis_Final_2$P_value_RND2_M2))],
method = "BH",
n = length(Analysis_Final_2$P_value_RND2_M2[which(!is.na(Analysis_Final_2$P_value_RND2_M2))]))
colnames(Analysis_Final_2)[94:95] <- c("P.adjusted_RND1_M2", "P.adjusted_RND2_M2")
```
##### P-VALUE DISGNOSTICS FOR METHOD 2 : ROUND1
NUMBER OF SIGNIFICANT STRAINS
```{r}
length(Analysis_Final_2$P_value_RND1_M2[which(Analysis_Final_2$P_value_RND1_M2<=0.05)])
length(Analysis_Final_2$P.adjusted_RND1_M2[which(Analysis_Final_2$P.adjusted_RND1_M2<=0.05)])
length(Analysis_Final_2$P_value_RND1_M2[which(Analysis_Final_2$P_value_RND1_M2<=0.1)])
length(Analysis_Final_2$P.adjusted_RND1_M2[which(Analysis_Final_2$P.adjusted_RND1_M2<=0.1)])
```
P-VALUE DIAGNOSTICS BY **HISTOGRAM ANALYSIS** ROUND 1
```{r figure4, echo=FALSE, fig.cap="Figure 4: P-value diagnostic by histogram, Method 2, Round 1", fig.width=8, fig.height=8}
par(mfrow=c(2,2))
hist(Analysis_Final_2$P_value_RND1_M2,
breaks = 20,
xlab = "P-value",
ylab = "Frequency",
main = "P-values of all strains ROUND1",
col = "skyblue",
ylim = c(0, 5000))
hist(Analysis_Final_2$P_value_RND1_M2[which(Analysis_Final_2$Control.gRNA==1)],
breaks = 20,
xlab = "P-value",
ylab = "Frequency",
main = "P-values of control strains ROUND1",
col = "skyblue",
ylim = c(0, 10))
hist(Analysis_Final_2$P.adjusted_RND1_M2,
breaks = 20,
xlab = "P.adj",
ylab = "Frequency",
main = "P.adjusted values of all strains ROUND1",
col = "skyblue",
ylim = c(0, 5000))
hist(Analysis_Final_2$P.adjusted_RND1_M2[which(Analysis_Final_2$Control.gRNA==1)],
breaks = 20,
xlab = "P.adj",
ylab = "Frequency",
main = "P.adjusted values of control strains ROUND1",
col = "skyblue",
ylim = c(0, 10))
```
##### P-VALUE DISGNOSTICS FOR METHOD 2 : ROUND2
NUMBER OF SIGNIFICANT STRAINS
```{r}
length(Analysis_Final_2$P_value_RND2_M2[which(Analysis_Final_2$P_value_RND2_M2<=0.05)])
length(Analysis_Final_2$P.adjusted_RND2_M2[which(Analysis_Final_2$P.adjusted_RND2_M2<=0.05)])
length(Analysis_Final_2$P_value_RND2_M2[which(Analysis_Final_2$P_value_RND2_M2<=0.1)])
length(Analysis_Final_2$P.adjusted_RND2_M2[which(Analysis_Final_2$P.adjusted_RND2_M2<=0.1)])
```
P-VALUE DIAGNOSTICS BY **HISTOGRAM ANALYSIS** ROUND 2
```{r figure5, echo=FALSE, fig.cap="Figure 5: P-value diagnostic by histogram, Method 2, Round 2", fig.width=8, fig.height=8}
par(mfrow=c(2,2))
hist(Analysis_Final_2$P_value_RND2_M2,
breaks = 20,
xlab = "P-value",
ylab = "Frequency",
main = "P-values of all strains ROUND2",
col = "skyblue",
ylim = c(0, 5000))
hist(Analysis_Final_2$P_value_RND2_M2[which(Analysis_Final_2$Control.gRNA==1)],
breaks = 20,
xlab = "P-value",
ylab = "Frequency",
main = "P-values of control strains ROUND2",
col = "skyblue",
ylim = c(0, 10))
hist(Analysis_Final_2$P.adjusted_RND2_M2,
breaks = 20,
xlab = "P.adj",
ylab = "Frequency",
main = "P.adjusted values of all strains ROUND2",
col = "skyblue",
ylim = c(0, 5000))
hist(Analysis_Final_2$P.adjusted_RND2_M2[which(Analysis_Final_2$Control.gRNA==1)],
breaks = 20,
xlab = "P.adj",
ylab = "Frequency",
main = "P.adjusted values of control strains ROUND2",
col = "skyblue",
ylim = c(0, 10))
```
##### CONCLUSIONS METHOD 2
It is a robust statistical method. However, one of the major problem with this method is setting different thresholds for p.adjusted values and LPI GT Mean for each round.
#### METHOD 3 AND METHOD 4
For **METHOD 3**, We hypothesized that the difference between the mean(µ) phenotypic performance of a specific CRISPRi strain (StrainX) considering all technical replicates (3 in each) in the two independent experimental rounds (i.e. n=6) to the mean phenotypic performance of all the CRISPRi strains that falls within the interquartile range (IQR) of the complete dataset would be zero, and any difference within the IQR to be just by chance.
**Null Hypothesis** : µ~StrainX~(All_replicates_LPI_GT)- µ(InterquartileRange_LPI_GT) = 0
Additionally, we tested one final statistical model to determine significance of our observations
For **METHOD 4**, We hypothesized that the difference between the mean(µ) phenotypic performance of a specific CRISPRi strain (StrainX) considering all technical replicates (3 in each) in the two independent experimental rounds (i.e. n=6) to the mean phenotypic performance of all the CRISPRi control strains (with gRNA targeting no genetic locus in *S. cerevisiae*) would be zero, and any difference within the CRISPRi control strains phenotypic performance range (LPI GT range) to be just by chance.
**Null Hypothesis** : µ~StrainX~(All_replicates_LPI_GT) - µ~CRISPRi_Control_Strains~(LPI_GT) = 0
To ensure that we don't distort the original dataset we clone the Analysis dataset in a new name
```{r}
Analysis_Final_3 <- Analysis_Final_2
```
##### EXTRACT ALL LPI GT DATA POINTS (INCLUDING ALL REPLICATES) WITHIN INTER-QUARTILE-RANGE (IQR)
Since we will consider all replicates this time, we will compare it with all replicates (NOT MEAN) that falls within IQR for Method 3. For this purpose, we extract the IQR dataset including all the replicate data for each strain. We will use the data.frame **Data_CRISPRi_aa** (see, [REMOVE ROWS WITH SPATIAL CONTROL STRAIN DATA]) to extract this numeric vector.
BOX PLOT - RELATIVE GENERATION TIME (LPI GT)
```{r figure6, echo=FALSE, fig.cap="Figure 6: Boxplot of relative generation time (LPI GT) for all strains including all replicates in the library", fig.width=2, fig.height=4}
boxplot_stat_LPI_GT <- boxplot(Data_CRISPRi_aa$LPI_GT, cex = 0.3)
```
Display Box-plot statistics
```{r}
boxplot_stat_LPI_GT$stats
```
* 25th Percentile = -0.04373846
* 75th Percentile = 0.09804938
Therefore, extraction of the data points within IQR
```{r}
Intermediate_50_M3 <- Data_CRISPRi_aa$LPI_GT[which(Data_CRISPRi_aa$LPI_GT >=-0.04373846
&Data_CRISPRi_aa$LPI_GT<=0.09804938)]
summary(Intermediate_50_M3)
```
##### EXTRACT CRISPRi CONTROL STRAINS DATA (ALL REPLICATES)
This time we extract all the replicate data (non the mean) of each of the CRISPRi control strains for the Method 4
```{r}
Crispri_control_M4 <- Data_CRISPRi_aa$LPI_GT[which(Data_CRISPRi_aa$Control.gRNA==1)]
summary(Crispri_control_M4)
```
##### RECALCULATE THE LSC GT MEAN AT BASAL CONDITION AND LPI GT MEAN OF EACH STRAIN
We recalculate the above parameter taking all six replicates into account and excluding the missing values. For this purpose we use a if else decision tree. This means we get a LSC GT / LPI GT value if at-least 1 replicate managed to grow at a particular condition. Else it will return a missing value or NA
Additionally, we also create two columns that shows number of replicates of a strain managed to grow in basal condition (**n_CTRL**) and number of replicates in acetic acid condition (**n_LPI**)
```{r}
for(i in 1:nrow(Analysis_Final_3)){
test1 <- t(Analysis_Final_3[i, 9:14])
test2 <- t(Analysis_Final_3[i, 35:40])
x1 <- sum(!is.na(test1[, 1]))
x2 <- sum(!is.na(test2[, 1]))
CTRL_GT_Mean_temp <- mean(test1[which(!is.na(test1[, 1]))])
LPI_GT_Mean_temp <- mean(test2[which(!is.na(test2[, 1]))])
Analysis_Final_3[i, 96] <- CTRL_GT_Mean_temp
Analysis_Final_3[i, 97] <- x1
Analysis_Final_3[i, 98] <- LPI_GT_Mean_temp
Analysis_Final_3[i, 99] <- x2
}
colnames(Analysis_Final_3)[96:99] <- c("CTRL_GT_Mean_all", "n_CTRL", "LPI_GT_Mean_all", "n_LPI")
```
##### ESTIMATE P-VALUES FOR METHOD 3 AND 4
P-value is estimated by Welch two sample two-sided t-test (an adaptation of Student's t-test)
```{r}
for(i in 1:nrow(Analysis_Final_3)){
test <- t(Analysis_Final_3[i, 35:40])
x <- sum(!is.na(test[, 1]))
if(x>2){
P.value_temp_M3 <- t.test(Intermediate_50_M3, test[which(!is.na(test[, 1]))])
P.value_temp_M4 <- t.test(Crispri_control_M4, test[which(!is.na(test[, 1]))])
Analysis_Final_3[i, 100] <- P.value_temp_M3$p.value
Analysis_Final_3[i, 101] <- P.value_temp_M4$p.value
} else {
Analysis_Final_3[i, 100] <- NA
Analysis_Final_3[i, 101] <- NA
}
}
colnames(Analysis_Final_3)[100:101] <- c("P.value_M3", "P.value_M4")
```
##### FALSE DISCOVERY RATE ADJUSTMENT OF P-VALUES FOR METHOD 3 AND 4
P-value adjustment by **BENJAMINI-HOCHBERG False Discovery Rate (FDR) method**
```{r}
Analysis_Final_3[which(!is.na(Analysis_Final_3$P.value_M3)), 102] <- p.adjust(Analysis_Final_3$P.value_M3[which(!is.na(Analysis_Final_3$P.value_M3))],
method = "BH",
n = length(Analysis_Final_3$P.value_M3[which(!is.na(Analysis_Final_3$P.value_M3))]))
Analysis_Final_3[which(!is.na(Analysis_Final_3$P.value_M4)), 103] <- p.adjust(Analysis_Final_3$P.value_M4[which(!is.na(Analysis_Final_3$P.value_M4))],
method = "BH",
n = length(Analysis_Final_3$P.value_M4[which(!is.na(Analysis_Final_3$P.value_M4))]))
colnames(Analysis_Final_3)[102:103] <- c("P.adjusted_M3", "P.adjusted_M4")
```
##### P-VALUE DISGNOSTICS FOR METHOD 3
NUMBER OF SIGNIFICANT STRAINS
```{r}
length(Analysis_Final_3$P.value_M3[which(Analysis_Final_3$P.value_M3<=0.05)])
length(Analysis_Final_3$P.adjusted_M3[which(Analysis_Final_3$P.adjusted_M3<=0.05)])
length(Analysis_Final_3$P.value_M3[which(Analysis_Final_3$P.value_M3<=0.1)])
length(Analysis_Final_3$P.adjusted_M3[which(Analysis_Final_3$P.adjusted_M3<=0.1)])
```
P-VALUE DIAGNOSTICS BY **HISTOGRAM ANALYSIS**
```{r figure7, echo=FALSE, fig.cap="Figure 7 (Fig S11 in Manuscript): P-value diagnostic by histogram, Method 3", fig.width=8, fig.height=8}
par(mfrow=c(2,2))
hist(Analysis_Final_3$P.value_M3,
breaks = 20,
xlab = "P-value",
ylab = "Frequency",
main = "P-value of all strains",
col = "skyblue",
xlim = c(0, 1),
ylim = c(0, 3000),
cex.lab= 1.5)
hist(Analysis_Final_3$P.value_M3[which(Analysis_Final_3$Control.gRNA==1)],
breaks = 20,
xlab = "P-value",
ylab = "Frequency",
main = "P-value of control strains",
col = "skyblue",
xlim = c(0, 1),
ylim = c(0, 10),
cex.lab= 1.5)
hist(Analysis_Final_3$P.adjusted_M3,
breaks = 20,
xlab = "P.value adjusted",
ylab = "Frequency",
main = "P.adjusted values of all strains",
col = "skyblue",
xlim = c(0, 1),
ylim = c(0, 3000),
cex.lab= 1.5)
hist(Analysis_Final_3$P.adjusted_M3[which(Analysis_Final_3$Control.gRNA==1)],