Skip to content

Generate contribution score bigwigs

Anusri Pampari edited this page Jul 5, 2023 · 7 revisions

ChromBPNet models have two heads - (1) profile head (to predict the shape of a profile) and (2) counts head (to predict the total counts in a profile). Please use the chrombpnet_nobias.h5 model for this. Here we will provide tools to derive sequence contribution scores for either or both the heads.

Usage

chrombpnet contribs_bw [-h] -m MODEL_H5 -r REGIONS -g GENOME -c CHROM_SIZES -op OUTPUT_PREFIX  [-pc {counts,profile} [{counts,profile} ...]] [-os OUTPUT_PREFIX_STATS] [-t TQDM]
                              [-d DEBUG_CHR [DEBUG_CHR ...]]

Input Format

required arguments:
  -m MODEL_H5, --model-h5 MODEL_H5
                        Path model .h5 file
  -r REGIONS, --regions REGIONS
                        10 column bed file of regions for contribution score predictions
  -g GENOME, --genome GENOME
                        Genome fasta
  -c CHROM_SIZES, --chrom-sizes CHROM_SIZES
                        Chromosome sizes 2 column tab-separated file
  -op OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Output prefix for bigwig files

optional arguments:
  -pc {counts,profile} [{counts,profile} ...], --profile-or-counts {counts,profile} [{counts,profile} ...]
                        use either counts or profile or both for running shap (defaults to running both)
  -os OUTPUT_PREFIX_STATS, --output-prefix-stats OUTPUT_PREFIX_STATS
                        Output stats on bigwig
  -t TQDM, --tqdm TQDM  Use tqdm. If yes then you need to have it installed.
  -d DEBUG_CHR [DEBUG_CHR ...], --debug-chr DEBUG_CHR [DEBUG_CHR ...]
                        Run for specific chromosomes only (e.g. chr1 chr2) for debugging

Output Format

Note: Note that prefix can include a directory path and prefix for the output file. Make sure that the directory in output_prefix exists. Make sure that regions in the input bed file can be expanded to inputlen (default to 2114) regions without overflowing out of the chromosomes. If this condition is not satisfied the program will return with a error.

Note: If you are doing large scale runs, keep in mind that the disk usage of output_prefix.profile_scores.h5 and output_prefix.counts_scores.h5 can be lot around 3Gb each. One way to compress this is to store only shap field in the .h5 files and delete the projected_shap and raw key fields. This structure is however needed for TFModisco runs, so you can reconstruct them temporarily as follows - the raw field is simply the one-hot encoding of the regions in output_prefix.interpreted_regions.bedand theprojected_shapis equivalent toshapmultiplied by theraw` value. For more information refer to this link - https://github.com/kundajelab/chrombpnet/blob/a101087d90f193e1cc6dba1bad1b5fc9761aa129/chrombpnet/evaluation/interpret/interpret.py#L43.

The following two files are created using the output_prefix as prefix for the output when both profile_or_counts includes both profile and counts. If only one of the option is used, only one of the files is generated. Note that prefix can include a directory path and prefix for the output file.

  • output_prefix.profile_scores.h5: A h5 file containing a dictionary with contribution scores for the profile head of the chrombpnet model. The dictionary consists of the following keys raw, shap and projected_shap. The values of each of these keys are explained below. All the values have the shape (L,4,inputlen) where L is the number of inputs given in regions bed file and inputlen is inferred from the input length of the model_h5. This .h5 file can be input to chrombpnet modisco_motifs command for running TF-Modisco.
    • raw key consists of one hot encoding of a sequence from the regions bed file.
    • shap key consists of hypothetical contribution scores obtained from doing deepshap. These values will be used in the MODISCO pipeline (for de-novo motif discovery) next.
    • projected_shap key consists of values obtained by multiplying shap values with raw values. These values will be used in building browser tracks.
  • output_prefix.profile_scores.bw: Contribution scores stored in bigwig format. Can be visualized with dynseq track in Washu Genomics browser.
  • output_prefix.count_scores.h5: A h5 file containing a dictionary with contribution scores for the count head of the chrombpnet model. Has similar structure as the output_prefix.profile_scores.h5. This .h5 file can be input to chrombpnet modisco_motifs command for running TF-Modisco
  • output_prefix.count_scores.bw: Contribution scores stored in bigwig format. Can be visualized with dynseq track in Washu Genomics browser.
  • output_prefix.interpreted_regions.bed: A bed file with regions used for interpretations. The script filters regions whose resulting sequences dont have the same length as input_len inferred from the model. This bed file might have regions fewer than regions provided as input.