Skip to content

Bias model training

Anusri Pampari edited this page Dec 24, 2022 · 17 revisions

To achieve good bias correction, it is important to have a good and complete bias model. While bias models can be transferred across experiments done under similar settings, one might need to train a new bias model when the experiment conditions change. In this section of the tutorial we detail the command by which one can train their own bias model using again the downloaded and preprocessed tutorial data.

chrombpnet bias pipeline \
  -ibam ~/data/downloads/merged.bam \
  -d "ATAC" \
  -g ~/data/downloads/hg38.fa \
  -c ~/data/downloads/hg38.chrom.sizes \ 
  -p ~/data/peaks_no_blacklist.bed \
  -n ~/data/negatives.bed \
  -f ~/data/splits/fold_0.json \
  -b 0.5 \ 
  -o ~/bias_model/ \
  -fp k562 \

For general usage of this command you can run chrombpnet bias pipeline -h or refer to the documentation below. This command is intended to train and do quality checks on the bias model. You also have the option of performing the entire pipeline in two individual commands chrombpnet train and chrombpnet qc.

The command above outputs quality check report in two different formats - html and pdf. For your convenience and reference we provide links to both the outputs here - html and pdf.

Usage


Input Format

  • -i: input file path with filtered reads. Example files for supported types - bam, fragment, tagalign
  • -t: type of input file. Following string inputs are supported - "bam", "fragment", "tagalign".
  • -d: assay type. Following types are supported - "ATAC" or "DNASE"
  • -g: reference genome fasta file. Example file human reference - hg38.fa
  • -c: chromosome and size tab seperated file. Example file in human reference - hg38.chrom.sizes
  • -p: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed
  • -n: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed
  • -b: Float value for bias threshold factor. Defaults to 0.5 for "ATAC" assay type and 0.8 for "DNASE" assay type.
  • -f: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds
  • -o: Output directory path

Please find scripts and best practices for preprocssing here.

Output Format

The ouput directory will be populated as follows -

models\
	...
	bias.h5
	...
logs\
	...
	
intermediates\
	...

evaluation\
	...
	pwm_from_input.png 
	bias_metrics.json
	bias_only_peaks.counts_pearsonr.png
	bias_only_peaks.profile_jsd.png
	profile_motifs.pdf
	counts_motifs.pdf
	...

Following are some things to keep in mind when using custom datasets:

  • profile_motifs.pdf should only contain enzyme bias motifs. If you see any TF-motifs in profile_motifs.pdf and counts_motifs.pdf retrain the bias model train_bias_model.sh with bias threshold factor -b of 0.4/0.3 for ATAC and a value between 07-0.5 for DNASE.
  • In bias_metrics.json make sure that the counts pearsonr in peaks is less than < -0.2. Otherwise increase the bias threshold by 0.1/0.2.