Skip to content
This repository has been archived by the owner on Mar 17, 2023. It is now read-only.

Add nice MultiQC summary assigning hashes to genes #52

Open
olgabot opened this issue May 18, 2020 · 1 comment
Open

Add nice MultiQC summary assigning hashes to genes #52

olgabot opened this issue May 18, 2020 · 1 comment

Comments

@olgabot
Copy link
Contributor

olgabot commented May 18, 2020

I wanted to add it in this PR (#41) but that PR is already huge and it's pretty out of scope. It would require adding a lot more summarization/etc code and I don't even know what I really want to put here yet.

@olgabot
Copy link
Contributor Author

olgabot commented May 18, 2020

Example using csvtk(#51) to filter the featureCounts output to only the genes with nonzero reads.

 Mon 18 May - 09:52  ~/data_lg/kmer-hashing/brawand2011/predictorthologs/molecule-protein_ksize-27_log2sketchsize-14/mini/featureCounts/gene_counts 
 olga@tesla  csvtk filter -t -f "8>0" $ID\_gene.featureCounts.txt
Geneid  Chr     Start   End     Strand  Length  gene_name       hash-3557262534756__SRR306777_GSM752631_mml_br_F_1__reads_in_shared_hashes.bam
ENSMMUG00000014809      7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7;7   82441976;82442067;82442067;82442121;82442131;82442250;82442250;82442250;82442652;82442652;82442652;82443791;82443791;82443791;82461637;82461655;82461655;82462295;82462295;82465578;82465578;82465698;82465703 82442107;82442150;82442150;82442150;82442150;82442353;82442410;82442410;82442765;82442765;82442765;82443948;82443948;82443948;82461702;82461702;82461702;82462409;82462409;82465818;82465818;82465818;82465818       -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-   1030    HNRNPC  40

In my mini test dataset, the full combination of differential hashes plus all bams is ~3000 files:

(base)
 ✘  Mon 18 May - 10:10  ~/data_lg/kmer-hashing/brawand2011/predictorthologs/molecule-protein_ksize-27_log2sketchsize-14/mini/featureCounts/gene_counts 
 olga@tesla  ls -1 | wc -l
3221

Each of these is ~12MB which is pretty big considering there's thousands of them.

(base)
 Mon 18 May - 10:11  ~/data_lg/kmer-hashing/brawand2011/predictorthologs/molecule-protein_ksize-27_log2sketchsize-14/mini/featureCounts/gene_counts 
 olga@tesla  ll |head
Permissions Size User Group Date Modified Name
.rw-r--r--@  12M olga czb   15 May  2:26  hash-3557262534756__SRR306777_GSM752631_mml_br_F_1_gene.featureCounts.txt
.rw-r--r--@  12M olga czb   15 May  2:27  hash-3557262534756__SRR306778_GSM752632_mml_br_M_1_gene.featureCounts.txt
.rw-r--r--@  12M olga czb   15 May  2:31  hash-3557262534756__SRR306786_GSM752640_mml_lv_F_1_gene.featureCounts.txt
.rw-r--r--@  12M olga czb   15 May  2:28  hash-3557262534756__SRR306787_GSM752641_mml_lv_M_1_gene.featureCounts.txt
.rw-r--r--@  11M olga czb   15 May  2:37  hash-3557262534756__SRR306827_GSM752680_ppa_br_F_2_gene.featureCounts.txt
.rw-r--r--@  11M olga czb   15 May  2:31  hash-3557262534756__SRR306836_GSM752689_ppa_lv_M_1_gene.featureCounts.txt
.rw-r--r--@  12M olga czb   15 May  2:46  hash-185985883680630__SRR306777_GSM752631_mml_br_F_1_gene.featureCounts.txt
.rw-r--r--@  12M olga czb   15 May  2:44  hash-185985883680630__SRR306778_GSM752632_mml_br_M_1_gene.featureCounts.txt
.rw-r--r--@  11M olga czb   15 May  2:49  hash-185985883680630__SRR306827_GSM752680_ppa_br_F_2_gene.featureCounts.txt

I don't see a use case where someone would want ALL of the featureCounts output, especially since so many of the genes are undetected. Maybe I'm wrong?? But I think the best way for summarizing is to do csvtk filter within the featureCounts process and remove all zero-count genes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant