Skip to content

Noble-Lab/HiCFoundation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiCFoundation

HiCFoundation is a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species.

Copyright (C) 2024 Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang

License: Apache License 2.0

Contact: Sergei Doulatov ([email protected]) & William Stafford Noble ([email protected]) & Sheng Wang ([email protected])

For technical problems or questions, please reach to Xiao Wang ([email protected]) and Yuanyuan Zhang ([email protected]).

Citation:

Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, & Sheng Wang. A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species. bioRxiv, 2024. Paper

@article{wang2024hicfoundation,   
  title={A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species},   
  author={Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang},    
  journal={bioRxiv},    
  year={2024}    
}   

Introduction

HiCFoundation is a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species. The genetic information within nuclear DNA is organized into a compact three-dimensional (3D) structure that impacts critical cellular processes. High-throughput chromosome conformation capture (Hi-C) stands as the most widely used method for measuring 3D genome architecture, while linear epigenomic assays, such as ATAC-seq, DNase-seq, and ChIP-seq, are extensively employed to characterize genome regulatory activities. However, the integrative analysis of chromatin interactions and associated gene regulatory mechanisms remains challenging due to the mismatched resolution between Hi-C and epigenomic assays, as well as inconsistencies among analysis tools. Here we propose HiCFoundation, a Hi-C-based foundation model for genome architecture and regulatory functions analysis. HiCFoundation is trained from hundreds of Hi-C assays encompassing 118 million contact matrix patches. The model achieves state-of-the-art performance in multiple types of 3D genome analysis, including reproducibility analysis, resolution enhancement, and loop detection, offering high efficiency and broad applicability. We further demonstrate the model's generalizability to genome architecture analysis of 316 species. Notably, by enabling analysis of low-coverage experimental data, HiCFoundation reveals genome-wide loop loss during differentiation of HSPCs to neutrophil. Additionally, HiCFoundation is able to predict multiple gene regulatory activities from Hi-C input by generating epigenomic assays, and further offers interpretable analysis to reveal the relationship between chromatin conformation and genome function. Finally, HiCFoundation can analyze single cell Hi-C data, shedding light on genome structure at single-cell resolution. HiCFoundation thus provides a unified, efficient, generalizable, and interpretable foundation for integrative, multi-species, single-cell, and multi-omics analyses, paving the path for systematically studying genome 3D architecture and its regulatory mechanisms.

Overall Protocol


1) Pre-training stage: the model is trained in a self-supervised fashion on massive quantities of unlabeled Hi-C data. The model takes masked Hi-C submatrices as input, optimizing for the reconstruction of the full submatrix.
2) Fine-tuning stage: the model is fine-tuned and tested for diverse downstream tasks, including integrative Hi-C analysis, multi-omics analysis, and single-cell analysis.

HiCFoundation framework

Installation

System Requirements

  • CPU: 4 cores or higher
  • Memory: 12GB RAM or higher
  • GPU: CUDA-compatible with minimum 12GB memory
  • Note: GPU is mandatory as HiCFoundation

Installation

2. Clone the repository in your computer

git clone https://github.com/Noble-Lab/HiCFoundation.git && cd HiCFoundation

3. Configure environment for HiCFoundation.

3.1 Install anaconda

Install anaconda from https://www.anaconda.com/download#downloads.

3.2 Install environment via yml file
conda env create -f environment.yml
3.3 Activate environment for running

Each time when you want to run HiCFoundation, simply activate the environment by

conda activate HiCFoundation
# To exit
conda deactivate

4. Download the trained HiCFoundation model

You can download our pre-trained and fine-tuned model to hicfoundation_model for inference, embedding generation and fine-tuning purposes.
HiCFoundation model weights: hicfoundation_model

You can also run the following command line to do this

cd hicfoundation_model
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_pretrain.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_reproducibility.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_loop.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_loop_lc.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_resolution.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_epigenomic.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_schic.pth.tar
cd ..

5. (Optional) Visualization software

Juicebox: https://aidenlab.org/juicebox/

HiGlass: https://higlass.io/

Usage

Inference of fine-tuned HiCFoundation

Inference of HiCFoundation for chromatin architecture, multi-omics and single-cell analysis

Overview

This include five different fine-tuned model for

  • Reproducibility analysis: HiCFoundation will generate embeddings of the input Hi-C, and the submatrix embeddings can be used to compare across biological replicates and non-replicates.
  • Chromatin loop detection: HiCFoudation will generate the loop detection of the input Hi-C in .bedpe format.
  • Resolution enhancement: HiCFoundation will generate enhanced Hi-C map given the input Hi-C.
  • Epigenomic assay profiling: HiCFoundation will generate corressponding epigenomic assays in .bigWig format given the input Hi-C.
  • Single-cell Hi-C enhancement: HiCFoundation will generate the enhanced scHi-C given the input siHi-C.

Input format

HiCFoundation supports the .hic/.cool/.pkl/.txt/.pairs/.npy format.

  • .hic/.cool: the common Hi-C format that stores the final matrix of Hi-C experiment
  • .pkl: the pickle file that stores a dict of all Hi-C matrices, with the chrom name as key, and scipy.sparse/numpy array as the value. [chrom_name]:[matrix].
  • .txt/.pairs: the pairs format text that records pairwise interactions in pairs format "#readID\tchr1\tpos1\tchr2\tpos2" that records the chr1:pos1 interactions with chr2:pos2.
  • .npy format: a numpy array that records the contact map of a specific chromosome.

Example

Please download the following files to the example folder for example testing purposes.

Other format examples

Inference for different tasks

1. Inference embeddings for reproducibility analysis

python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 1 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu]
  • input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
  • infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
  • hic_resolution: resolution of the input matrix, default: 25000 (25 kb for reproducibility task).
  • input_submatrix_length: input submatrix row size, default: 224.
  • input_submatrix_width: input submatrix column size, default: 224.
  • stride: scanning stride for the input Hi-C matrix, default: 20.
  • scan_boundary: off-diagonal bound for the scanning, default: 0.
  • trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_reproducibility.pth.tar. Make sure you follow the installment instructions to download it before you run.
  • output_dir: output directory to save the results, default: hicfoundation_inference.
  • gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.

The output is saved in the ``output_dir``, where the embedding is saved in "HiCFoundation_reproducibility_embedding.pkl" in a dict format.
The key of the dict is "chrom:row_index,col_index", and the value is the corresponding embedding.
This embedding corresponds to the submatrix of [row_index:row_index+input_row_size, col_index:col_index+input_col_size] at chromsome ``chrom``.
Example command:
python3 inference.py --input example/ENCFF689CUX.hic --batch_size 4 --resolution 25000 --task 1 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_reproducibility.pth.tar --output hicfoundation_inference/reproducibility_analysis/ --gpu "0"

This uses the low-coverage example ENCFF689CUX.hic to run the inference.
The output embedding is saved in hicfoundation_inference/reproducibility_analysis/HiCFoundation_reproducibility_embedding.pkl.

2. Inference for chromatin loop detection

python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 2 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu]
  • input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
  • infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
  • hic_resolution: resolution of the input matrix, default: 10000 (10 kb for loop detection).
  • input_submatrix_length: input submatrix row size, default: 224.
  • input_submatrix_width: input submatrix column size, default: 224.
  • stride: scanning stride for the input Hi-C matrix, default: 20.
  • scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time). You can also use 200, the detection results should be similar.
  • trained_model_path: load fine-tuned model for inference. Use "hicfoundation_loop.pth.tar" for high-coverage loop detection and "hicfoundation_loop_lc.pth.tar" for low-coverage loop detection. For human dataset, total reads smaller than 50M is treated as low-coverage, that equals to any experiments with less than around 200 reads per 10 kb. Here the model should be hicfoundation_loop.pth.tar or hicfoundation_loop_lc.pth.tar. Make sure you follow the installment instructions to download models before you run.
  • output_dir: output directory to save the results, default: hicfoundation_inference.
  • gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.

The output is saved in the ``output_dir``, where the loop is saved in HiCFoundation_loop_[threshold].bedpe. We kept three confidence level 0.5,0.75,0.9 for your choice. For conservative loop calls, we would recommend you to use 0.9 threshold for loop calls. For low-coverage Hi-C, we would recommend you to to use 0.5 threshold for loop calls.
Each line records a loop calls in the .bedpe file in format of [chr1 x1 x2 chr2 y1 y2], where chr1 typically is the same as chr2; [x1 x2] records the spanning region of left loop anchor, [y1 y2] records the spanning region of the right loop anchor.
Example command:

Loop calls from high-coverage Hi-C

python3 inference.py --input example/4DNFITUOMFUQ.hic --batch_size 4 --resolution 10000 --task 2 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_loop.pth.tar --output hicfoundation_inference/loop_detection/ --gpu "0"

This uses the high-coverage example 4DNFITUOMFUQ.hic to run the inference.
The output loop detection is saved in hicfoundation_inference/loop_detection.
You can find HiCFoundation_loop_0.5.bedpe, HiCFoundation_loop_0.75.bedpe and HiCFoundation_loop_0.9.bedpe.
HiCFoundation_loop_0.9.bedpe includes the most confident loop calls. You can also choose HiCFoundation_loop_0.5.bedpe if you want more loop calls.

Loop calls from low-coverage Hi-C

python3 inference.py --input example/GSE174533_1-C11-CB1.2-C11-CB2.merge.hic --batch_size 4 --resolution 10000 --task 2 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_loop_lc.pth.tar --output hicfoundation_inference/loop_detection_lc/ --gpu "0"

This uses the low-coverage example HSPC in link to run loop calls at low coverage Hi-C.
The output loop detection is saved in hicfoundation_inference/loop_detection_lc/HiCFoundation_loop_0.5.bedpe.
You can also check other more confident loop calls under hicfoundation_inference/loop_detection_lc directory.

3. Inference for resolution enhancement

python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 3 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu] --genome_id [genome_id]
  • input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
  • infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
  • hic_resolution: resolution of the input matrix, default: 10000 (10 kb for resolution enhancement, should also work for 5kb).
  • input_submatrix_length: input submatrix row size, default: 224.
  • input_submatrix_width: input submatrix column size, default: 224.
  • stride: scanning stride for the input Hi-C matrix, default: 20.
  • scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time).
  • trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_resolution.pth.tar. Make sure you follow the installment instructions to download it before you run.
  • output_dir: output directory to save the results, default: hicfoundation_inference.
  • gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
  • genome_id: genome id for generating .hic file. Must be one of hg18, hg19, hg38, dMel, mm9, mm10, anasPlat1, bTaurus3, canFam3, equCab2, galGal4, Pf3D7, sacCer3, sCerS288c, susScr3, or TAIR10; alternatively, this can be the path of the chrom.sizes file that lists on each line the name and size of the chromosomes.

The output is saved in the ``output_dir``, where the enhanced Hi-C is saved in the HiCFoundation_enhanced.pkl and HiCFoundation_enhanced.[ext], where ext correponds to the format that is same as input.
In the pkl file, it stores a dict of all enhanced Hi-C matrices, with the chrom name as key, and scipy.sparse/numpy array as the value.
You can also use [array2hic.py](utils/array2hic.py) and [array2cool.py](utils/array2cool.py) to convert the .pkl to .hic and .cool, respectively.
Example command:
python3 inference.py --input example/ENCFF689CUX.hic --batch_size 4 --resolution 10000 --task 3 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_resolution.pth.tar --output hicfoundation_inference/resolution_enhancement/ --gpu "0" --genome_id hg38

This uses the low-coverage example ENCFF689CUX.hic to run the inference.
The output enhanced Hi-C is saved in hicfoundation_inference/resolution_enhancement/HiCFoundation_enhanced.pkl and hicfoundation_inference/resolution_enhancement/HiCFoundation_enhanced.hic.

4. Inference for epigenomic assays profiling

python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 4 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu] 
  • input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
  • infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
  • hic_resolution: resolution of the input matrix, default: 1000 (1 kb for epigenomic assays prediction).
  • input_submatrix_length: input submatrix row size, default: 128 (covers 128 kb region to predict 128 kb region).
  • input_submatrix_width: input submatrix column size, default: 4000 (covers full off-diagonal 2 Mb region for more accurate prediction).
  • stride: scanning stride for the input Hi-C matrix, default: 32 (64 should yield similar results but should be much faster).
  • scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time).
  • trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_epigenmoic.pth.tar. Make sure you follow the installment instructions to download it before you run.
  • output_dir: output directory to save the results, default: hicfoundation_inference.
  • gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.

The output is saved in the ``output_dir``, where the predicted epigenomic assays are saved in the HiCFoundation_epigenomic_assay_prediction_[assay_name].pkl and HiCFoundation_pred_[assay_name].bigWig.
The output assay includes six different tracks: 'CTCF' (TF ChIP-seq),'H3K4me3' (histone ChIP-seq),'H3K27ac' (histone ChIP-seq),'H3K27me3' (histone ChIP-seq),'ATAC-seq', and 'DNase-seq'.
In the pkl file, it stores a dict of correspondng assays, with the chrom name as key, and numpy array records the predicted assay at 1kb resolution.
In the bigWig file, it records the signals of corresponding assays, you can visualize it [online](https://igv.org/app/).
You can also use [array2bigwig.py](utils/array2bigwig.py) to convert the .pkl to .bigWig file for visualization.
Example command:
python3 inference.py --input example/4DNFITUOMFUQ.hic --batch_size 4 --resolution 1000 --task 4 --input_row_size 128 --input_col_size 4000 --stride 32 --bound 0 --model_path hicfoundation_model/hicfoundation_epigenomic.pth.tar --output hicfoundation_inference/epigenomic_profiling/ --gpu "0" 

This uses the high-coverage example 4DNFITUOMFUQ.hic to run the inference.
The output enhanced Hi-C is saved in hicfoundation_inference/epigenomic_profiling/HiCFoundation_epigenomic_assay_prediction_[assay_name].pkl and hicfoundation_inference/epigenomic_profiling/HiCFoundation_pred_[assay_name].bigWig.

5. Inference for single-cell HiC resolution enhancement

python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 5 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu]
  • input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
  • infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
  • hic_resolution: resolution of the input matrix, recommended: 1,000,000 (1 MB for single-cell HiC resolution enhancement).
  • input_submatrix_length: input submatrix row size, default: 224 (covers 224 MB region to predict 224 MB region).
  • input_submatrix_width: input submatrix column size, default: 224 (covers 224 MB region to predict 224 MB region).
  • stride: scanning stride for the input Hi-C matrix, default: 20.
  • scan_boundary: off-diagonal bound for the scanning, recommended: 250..
  • trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_schic.pth.tar. Make sure you follow the installment instructions to download it before you run.
  • output_dir: output directory to save the results, default: hicfoundation_inference.
  • gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.

The output is saved in the output_dir, where the enhanced single-cell HiC matrix are saved in the HiCFoundation_sc_enhanced.pkl and HiCFoundation_sc_enhanced.pairs.

Example command:
python3 inference.py --input example/GSM7006609_ValbB8w1081.pairs --batch_size 4 --resolution 1000000 --task 5 --input_row_size 224 --input_col_size 224 --stride 20 --bound 250 --model_path hicfoundation_model/hicfoundation_schic.pth.tar --output hicfoundation_inference/sc_hic_enhancement --gpu "0"

This uses the given example GSM7006609_ValbB8w1081.pairs to run the inference.

The output enhanced Hi-C is saved in hicfoundation_inference/sc_hic_enhancement/HiCFoundation_sc_enhanced.pkl and hicfoundation_inference/epigenomic_profiling/HiCFoundation_sc_enhanced.pairs.

Generate multi-scale Hi-C embeddings

Inference of pre-trained HiCFoundation model to generate patch, submatrix, chromosome and genome wide Hi-C embeddings.

Overview

This include four levels of embeddings of the pre-trained HiCFoundation model

  • patch level embdding: an embedding vector corresponds to a 16*16 patch space at specified resolution.
  • submatrix level embedding: an embedding vector corresponds to the specified submatrix at specified resolution.
  • chromosome level embedding: embedding vectors correspond to different chromosomes at specified resolution.
  • genome wide embedding: an embedding vector corresponds to the input Hi-C at specified resolution.

Input format

HiCFoundation supports the .hic/.cool/.pkl/.txt/.pairs/.npy format.

  • .hic/.cool: the common Hi-C format that stores the final matrix of Hi-C experiment
  • .pkl: the pickle file that stores a dict of all Hi-C matrices, with the chrom name as key, and scipy.sparse/numpy array as the value. [chrom_name]:[matrix].
  • .txt/.pairs: the pairs format text that records pairwise interactions in pairs format "#readID\tchr1\tpos1\tchr2\tpos2" that records the chr1:pos1 interactions with chr2:pos2.
  • .npy format: a numpy array that records the contact map of a specific chromosome.

Example

Please download the following files to the example folder for example testing purposes.

Other format examples

Inference

python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 6 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu] --embed_depth [embed_depth]
  • input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
  • infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
  • hic_resolution: resolution of the input matrix, default: 5000/10000 (5kb or 10kb should work the best since pre-trained at 5kb).
  • input_submatrix_length: input submatrix row size.
  • input_submatrix_width: input submatrix column size. For input_submatrix_length, input_submatrix_width, please choose size based on your interested submatrix size. But both should be a multiply of 16.
  • stride: scanning stride for the input Hi-C matrix, default: 20. Please adjust it based on your interest.
  • scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time). Please adjust it based on your interest region. The default only covers the input_submatrix_width*resolution off-diagonal region.
  • trained_model_path: load pre-trained model for inference. Here the model should be hicfoundation_pretrain.pth.tar. Make sure you follow the installment instructions to download it before you run.
  • output_dir: output directory to save the results, default: hicfoundation_embedding.
  • gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
  • embed_depth: Specified the embedding to use for your purpose, default: 0 (encoder output embeddings). You can also specify k from 1 to 8 to indicate the output of k-th layer of decoder.

The output is saved in the ``output_dir``, where the embeddings are saved in the HiCFoundation_embedding.pkl.
It is a dict format that includes four keys that correspond to four level of embeddings: - "patch_embedding": corresponds to patch-level embeddings. Here it keeps a dict with "chrom:pos1,pos2" as the key, and the HiCFoundation embedding as the value. "chrom:pos1,pos2" indicates the center of corresponding patch at ``chrom``, with row at ``pos1``, and col at ``pos2``. - "submat_embedding": corresponds to the submatrix-level embedding. The submatrix size is defined by the input param ``input_row_size`` and ``input_col_size``. Here it keeps a dict with "chrom:pos1,pos2" as the key, and the HiCFoundation embedding as the value. "chrom:pos1,pos2" indicates the center of corresponding patch at ``chrom``, with row at ``pos1``, and col at ``pos2``. - "chromo_embedding": corresponds to the chromosome-level embedding. Here it keeps a dict with "chrom" as the key, and the HiCFoundation embedding of the correpsonding "chrom" as the value. - "genome_embedding": corresponds to the genome-level embedding of the input Hi-C. Here it keeps an embedding vector as the value of "genome_embedding".

Example command

python3 inference.py --input example/4DNFITUOMFUQ.hic --batch_size 4 --resolution 10000 --task 6 --input_row_size 400 --input_col_size 400 --stride 80 --bound 200 --model_path hicfoundation_model/hicfoundation_pretrain.pth.tar --output hicfoundation_inference/hicfoundation_embedding/ --gpu "0" --embed_depth 0

This uses the example 4DNFITUOMFUQ.hic to run the inference with the submatrix size of 400*400 of 6Mb off-diagonal regions.
The output Hi-C embedding is saved in hicfoundation_inference/hicfoundation_embedding/HiCFoundation_embedding.pkl in a dict format.
It our level of embeddings: patch level embdding, submatrix level embedding, chromosome level embedding, and genome wide embedding. See more details above.

Fine-tuning HiCFoundation for new tasks

Fine-tuning of pre-trained HiCFoundation model to new downstream tasks that you are interested with your own data.