-
Notifications
You must be signed in to change notification settings - Fork 34
Bias model training
To achieve good bias correction, it is important to have a good and complete bias model. While bias models can be transferred across experiments done under similar settings, one might need to train a new bias model when the experiment conditions change. In this section of the tutorial we detail the command by which one can train their own bias model using again the downloaded and preprocessed tutorial data.
chrombpnet bias pipeline \
-ibam ~/data/downloads/merged.bam \
-d "ATAC" \
-g ~/data/downloads/hg38.fa \
-c ~/data/downloads/hg38.chrom.sizes \
-p ~/data/peaks_no_blacklist.bed \
-n ~/data/negatives.bed \
-f ~/data/splits/fold_0.json \
-b 0.5 \
-o ~/bias_model/ \
-fp k562 \
For general usage of this command you can run chrombpnet bias pipeline -h
or refer to the documentation below. This command is intended to train and do quality checks on the bias model. You also have the option of performing the entire pipeline in two individual commands chrombpnet train
and chrombpnet qc
.
The command above outputs quality check report in two different formats - html and pdf. For your convenience and reference we provide links to both the outputs here - html and pdf.
-
-i
: input file path with filtered reads. Example files for supported types - bam, fragment, tagalign -
-t
: type of input file. Following string inputs are supported - "bam", "fragment", "tagalign". -
-d
: assay type. Following types are supported - "ATAC" or "DNASE" -
-g
: reference genome fasta file. Example file human reference - hg38.fa -
-c
: chromosome and size tab seperated file. Example file in human reference - hg38.chrom.sizes -
-p
: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed -
-n
: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed -
-b
: Float value for bias threshold factor. Defaults to 0.5 for "ATAC" assay type and 0.8 for "DNASE" assay type. -
-f
: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds -
-o
: Output directory path
Please find scripts and best practices for preprocssing here.
The ouput directory will be populated as follows -
models\
...
bias.h5
...
logs\
...
intermediates\
...
evaluation\
...
pwm_from_input.png
bias_metrics.json
bias_only_peaks.counts_pearsonr.png
bias_only_peaks.profile_jsd.png
profile_motifs.pdf
counts_motifs.pdf
...
Following are some things to keep in mind when using custom datasets:
-
profile_motifs.pdf
should only contain enzyme bias motifs. If you see any TF-motifs inprofile_motifs.pdf
andcounts_motifs.pdf
retrain the bias modeltrain_bias_model.sh
with bias threshold factor-b
of 0.4/0.3 for ATAC and a value between 07-0.5 for DNASE. - In
bias_metrics.json
make sure that the counts pearsonr in peaks is less than < -0.2. Otherwise increase the bias threshold by 0.1/0.2.