Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when running PAQR step2 #19

Open
Sylarair opened this issue Sep 5, 2020 · 14 comments
Open

Errors when running PAQR step2 #19

Sylarair opened this issue Sep 5, 2020 · 14 comments

Comments

@Sylarair
Copy link

Sylarair commented Sep 5, 2020

Hi,

Thanks for developing PAQR_KAPAC. I am facing a problem when running step2 of PAQR with the source code (error shown in the attached figure).

Does anyone have idea from which the error comes? Many Thanks!

Screen Shot 2020-09-05 at 7 24 12 PM

@Sylarair
Copy link
Author

Sylarair commented Sep 5, 2020

By the way, The data I used in the code is from https://www.encodeproject.org/experiments/ENCSR000CWC/ (ENCFF391BFB.bam, ENCFF983OXY.bam). Also, the config.yaml setting is shown in the attached figure. In addition, I have put the problem into the github issue. Finally, the attached text file is the reference used in my config.yaml.

Screen Shot 2020-09-05 at 7 34 39 PM

clusters.mm10.vM22.no_tsl.atlas2.canonical_chr.tandem.noOverlap_strand_specific.txt

full_transcripts.mm10.vM22.no_tsl.atlas2.canonical_chr.tandem.noOverlap_strand_specific.txt

@koljaLanger
Copy link
Member

Hi

can you please post the output of cat HNRPC_KD/no_bias_samples.out. One possible reason would be that the samples have biased coverages and are excluded from processing. In this case, we highly recommend to do some sanity checks, e.g. visually inspect the coverage in a genome browser to ensure that the sequencing data has trustworthy quality.

If this is really the reason for the error, you could mitigate it by lowering the "bias.median.cutoff" parameter in the config file.

Rgds
Ralf

@Sylarair
Copy link
Author

Sylarair commented Sep 5, 2020

Thanks for your reply. The file is empty.

Screen Shot 2020-09-05 at 8 26 10 PM

@Sylarair
Copy link
Author

Sylarair commented Sep 5, 2020

Also, when I tried to rerun, it shows such logs. What should I do with this? Thanks!

Screen Shot 2020-09-05 at 8 28 19 PM

@Sylarair
Copy link
Author

Sylarair commented Sep 5, 2020

Does this mean there is no valid sample? Thanks.
Screen Shot 2020-09-05 at 8 36 52 PM

@Sylarair
Copy link
Author

Sylarair commented Sep 5, 2020

Meanwhile, the two samples median TIN is 76 and 79. It seems ok (default bias.median.cutoff = 70).

Screen Shot 2020-09-05 at 8 38 10 PM

@koljaLanger
Copy link
Member

Ok,
I think a know the root cause:
We designed PAQR in a way that it provides the input for KAPAC which is why PAQR always works with control-treatment sample pairs.
In your case, both samples are labeled as "HNRNPC" in the config and so no valid pair of samples is inferred in the first part.

An easy workaround would be to adjust your config file as follows:

ctl_rep1: {bam: ENCFF391BFB, type: CNTRL}
HNRNPC_rep1: {bam: ENCFF983OXY, type: KD, control: ctl_rep1}

In this setting, running PAQR would result in the estimation of poly(A) site usage for both samples independently.

Hope this helps!
Ralf

@Sylarair
Copy link
Author

Sylarair commented Sep 5, 2020

Oh, thanks so much! Actually, The two bams are bio-replicates. Would this setting not affect results?

@koljaLanger
Copy link
Member

It affects the results slightly, because PAQR would normally take advantage of the two samples being replicates. But it should not be an issue.

Please bear in mind, though, that running KAPAC requires the comparison across conditions - in your case you would need a reference condition to compare against.

@Sylarair
Copy link
Author

Sylarair commented Sep 6, 2020

Got it, thanks!

@Sylarair
Copy link
Author

Sylarair commented Sep 6, 2020

Thanks for your suggestions because I get the PAQR results successfully.

However, facing so many results, I am confused about the difference of 3 files (relative_usages.filtered.tsv, relative_usages.tsv, and relative_usages.relPos_per_pA.out) and wonder which one is the proximal polyA site usage. Could you please explain it to me for I have not found their interpretation in the manual. Many thanks!

Qin

Screen Shot 2020-09-06 at 11 10 44 PM

@koljaLanger
Copy link
Member

Hi Qin

there are two files which are worth looking into if you want to obtain the proximal poly(A) site usage:

relative_usages.filtered.tsv
tandem_pas_expressions.rpm.filtered.tsv

Both files contain one poly(A) site per line and poly(A) sites are grouped by exons and sorted proximal to distal -> so each line that has a new entry in the exon column indicates a proximal poly(A) site.
The last n columns (n = number of samples; so in your case, the last 2 columns) show the relative usage of the the poly(A) site (in the case of relative_usages.filtered.tsv) or the library size normalized expression (in the case of tandem_pas_expressions.rpm.filtered.tsv).

Hope that helps

Best
Ralf

@Sylarair
Copy link
Author

Sylarair commented Sep 7, 2020

Thanks for your answer which clears my doubts. There is another 2 question which interest me are: 1. what is used to filter 'relative_usages.tsv' to get 'relative_usages.filtered.tsv'? 2. When applying PAQR to analyze forebrain E13.5 day polyA RNA-Seq data, I find that relative usage of proximal polyA sites in most of genes in relative_usages.filtered.tsv are very high (70~90). This is wired and let me wonder whether I am wrong in some step. Could you please give some comments? Thanks!

PAQR_PPAU.pdf

@koljaLanger
Copy link
Member

The filtering simply selects exons with multiple poly(A) sites which have read support in all samples - so this can be considered more of a data cleaning step.
Regarding your proximal-distal usage distribution: it is hard to identify a particular reason for your results without further details. In general, cell from the brain have extended 3' ends - but I don't know if this still holds for embryo samples.
A good starting point would be to get some global statistics: how many sites were quantified? How does the expression distribution looks like? Do you have an expected distribution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants