HiCFoundation is a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species.
Copyright (C) 2024 Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang
License: Apache License 2.0
Contact: Sergei Doulatov ([email protected]) & William Stafford Noble ([email protected]) & Sheng Wang ([email protected])
For technical problems or questions, please reach to Xiao Wang ([email protected]) and Yuanyuan Zhang ([email protected]).
Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, & Sheng Wang. A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species. bioRxiv, 2024. Paper
@article{wang2024hicfoundation,
title={A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species},
author={Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang},
journal={bioRxiv},
year={2024}
}
HiCFoundation is a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species.
The genetic information within nuclear DNA is organized into a compact three-dimensional (3D) structure that impacts critical cellular processes. High-throughput chromosome conformation capture (Hi-C) stands as the most widely used method for measuring 3D genome architecture, while linear epigenomic assays, such as ATAC-seq, DNase-seq, and ChIP-seq, are extensively employed to characterize genome regulatory activities. However, the integrative analysis of chromatin interactions and associated gene regulatory mechanisms remains challenging due to the mismatched resolution between Hi-C and epigenomic assays, as well as inconsistencies among analysis tools. Here we propose HiCFoundation, a Hi-C-based foundation model for genome architecture and regulatory functions analysis. HiCFoundation is trained from hundreds of Hi-C assays encompassing 118 million contact matrix patches. The model achieves state-of-the-art performance in multiple types of 3D genome analysis, including reproducibility analysis, resolution enhancement, and loop detection, offering high efficiency and broad applicability. We further demonstrate the model's generalizability to genome architecture analysis of 316 species. Notably, by enabling analysis of low-coverage experimental data, HiCFoundation reveals genome-wide loop loss during differentiation of HSPCs to neutrophil. Additionally, HiCFoundation is able to predict multiple gene regulatory activities from Hi-C input by generating epigenomic assays, and further offers interpretable analysis to reveal the relationship between chromatin conformation and genome function. Finally, HiCFoundation can analyze single cell Hi-C data, shedding light on genome structure at single-cell resolution. HiCFoundation thus provides a unified, efficient, generalizable, and interpretable foundation for integrative, multi-species, single-cell, and multi-omics analyses, paving the path for systematically studying genome 3D architecture and its regulatory mechanisms.1) Pre-training stage: the model is trained in a self-supervised fashion on massive quantities of unlabeled Hi-C data. The model takes masked Hi-C submatrices as input, optimizing for the reconstruction of the full submatrix.
2) Fine-tuning stage: the model is fine-tuned and tested for diverse downstream tasks, including integrative Hi-C analysis, multi-omics analysis, and single-cell analysis.
- CPU: 4 cores or higher
- Memory: 12GB RAM or higher
- GPU: CUDA-compatible with minimum 12GB memory
- Note: GPU is mandatory as HiCFoundation
1. Install git
git clone https://github.com/Noble-Lab/HiCFoundation.git && cd HiCFoundation
Install anaconda from https://www.anaconda.com/download#downloads.
conda env create -f environment.yml
Each time when you want to run HiCFoundation, simply activate the environment by
conda activate HiCFoundation
# To exit
conda deactivate
You can download our pre-trained and fine-tuned model to hicfoundation_model
for inference, embedding generation and fine-tuning purposes.
HiCFoundation model weights: hicfoundation_model
You can also run the following command line to do this
cd hicfoundation_model
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_pretrain.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_reproducibility.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_loop.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_loop_lc.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_resolution.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_epigenomic.pth.tar
wget https://huggingface.co/wang3702/hicfoundation_models/blob/main/hicfoundation_schic.pth.tar
cd ..
Juicebox: https://aidenlab.org/juicebox/
HiGlass: https://higlass.io/
Inference of HiCFoundation for chromatin architecture, multi-omics and single-cell analysis
This include five different fine-tuned model for
- Reproducibility analysis: HiCFoundation will generate embeddings of the input Hi-C, and the submatrix embeddings can be used to compare across biological replicates and non-replicates.
- Chromatin loop detection: HiCFoudation will generate the loop detection of the input Hi-C in .bedpe format.
- Resolution enhancement: HiCFoundation will generate enhanced Hi-C map given the input Hi-C.
- Epigenomic assay profiling: HiCFoundation will generate corressponding epigenomic assays in .bigWig format given the input Hi-C.
- Single-cell Hi-C enhancement: HiCFoundation will generate the enhanced scHi-C given the input siHi-C.
HiCFoundation supports the .hic/.cool/.pkl/.txt/.pairs/.npy format.
- .hic/.cool: the common Hi-C format that stores the final matrix of Hi-C experiment
- .pkl: the pickle file that stores a dict of all Hi-C matrices, with the chrom name as key, and scipy.sparse/numpy array as the value. [chrom_name]:[matrix].
- .txt/.pairs: the pairs format text that records pairwise interactions in pairs format "#readID\tchr1\tpos1\tchr2\tpos2" that records the chr1:pos1 interactions with chr2:pos2.
- .npy format: a numpy array that records the contact map of a specific chromosome.
Please download the following files to the example folder for example testing purposes.
- Low coverage Hi-C example: https://www.encodeproject.org/files/ENCFF689CUX/@@download/ENCFF689CUX.hic
- Low coverage Hi-C example2: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE174533&format=file&file=GSE174533%5F1%2DC11%2DCB1%2E2%2DC11%2DCB2%2Emerge%2Ehic
- High coverage Hi-C example: https://data.4dnucleome.org/files-processed/4DNFITUOMFUQ/. (4DN requires authentication in for downloading, so please download in the webpage)
- Single-cell Hi-C example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7006609
(For single-cell Hi-C example, it is already kept in
example
directory, so you do not need to downlaod again.)
- .cool: https://data.4dnucleome.org/files-processed/4DNFI18UHVRO/ (4DN requires authentication in for downloading, so please download in the webpage)
- .txt/.pairs: example/input.pairs
- .pkl: You can run utils/hic2array.py to convert .hic files to .pkl files to see .pkl format.
- .npy: You can use numpy to save any 2D matrix to .npy file to run our inference.
python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 1 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu]
- input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
- infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
- hic_resolution: resolution of the input matrix, default: 25000 (25 kb for reproducibility task).
- input_submatrix_length: input submatrix row size, default: 224.
- input_submatrix_width: input submatrix column size, default: 224.
- stride: scanning stride for the input Hi-C matrix, default: 20.
- scan_boundary: off-diagonal bound for the scanning, default: 0.
- trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_reproducibility.pth.tar. Make sure you follow the installment instructions to download it before you run.
- output_dir: output directory to save the results, default: hicfoundation_inference.
- gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
The output is saved in the ``output_dir``, where the embedding is saved in "HiCFoundation_reproducibility_embedding.pkl" in a dict format.
The key of the dict is "chrom:row_index,col_index", and the value is the corresponding embedding.
This embedding corresponds to the submatrix of [row_index:row_index+input_row_size, col_index:col_index+input_col_size] at chromsome ``chrom``.
python3 inference.py --input example/ENCFF689CUX.hic --batch_size 4 --resolution 25000 --task 1 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_reproducibility.pth.tar --output hicfoundation_inference/reproducibility_analysis/ --gpu "0"
This uses the low-coverage example ENCFF689CUX.hic
to run the inference.
The output embedding is saved in hicfoundation_inference/reproducibility_analysis/HiCFoundation_reproducibility_embedding.pkl
.
python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 2 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu]
- input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
- infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
- hic_resolution: resolution of the input matrix, default: 10000 (10 kb for loop detection).
- input_submatrix_length: input submatrix row size, default: 224.
- input_submatrix_width: input submatrix column size, default: 224.
- stride: scanning stride for the input Hi-C matrix, default: 20.
- scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time). You can also use 200, the detection results should be similar.
- trained_model_path: load fine-tuned model for inference. Use "hicfoundation_loop.pth.tar" for high-coverage loop detection and "hicfoundation_loop_lc.pth.tar" for low-coverage loop detection. For human dataset, total reads smaller than 50M is treated as low-coverage, that equals to any experiments with less than around 200 reads per 10 kb. Here the model should be hicfoundation_loop.pth.tar or hicfoundation_loop_lc.pth.tar. Make sure you follow the installment instructions to download models before you run.
- output_dir: output directory to save the results, default: hicfoundation_inference.
- gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
The output is saved in the ``output_dir``, where the loop is saved in HiCFoundation_loop_[threshold].bedpe. We kept three confidence level 0.5,0.75,0.9 for your choice. For conservative loop calls, we would recommend you to use 0.9 threshold for loop calls. For low-coverage Hi-C, we would recommend you to to use 0.5 threshold for loop calls.
Each line records a loop calls in the .bedpe file in format of [chr1 x1 x2 chr2 y1 y2], where chr1 typically is the same as chr2; [x1 x2] records the spanning region of left loop anchor, [y1 y2] records the spanning region of the right loop anchor.
Loop calls from high-coverage Hi-C
python3 inference.py --input example/4DNFITUOMFUQ.hic --batch_size 4 --resolution 10000 --task 2 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_loop.pth.tar --output hicfoundation_inference/loop_detection/ --gpu "0"
This uses the high-coverage example 4DNFITUOMFUQ.hic
to run the inference.
The output loop detection is saved in hicfoundation_inference/loop_detection
.
You can find HiCFoundation_loop_0.5.bedpe, HiCFoundation_loop_0.75.bedpe and HiCFoundation_loop_0.9.bedpe.
HiCFoundation_loop_0.9.bedpe includes the most confident loop calls. You can also choose HiCFoundation_loop_0.5.bedpe if you want more loop calls.
Loop calls from low-coverage Hi-C
python3 inference.py --input example/GSE174533_1-C11-CB1.2-C11-CB2.merge.hic --batch_size 4 --resolution 10000 --task 2 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_loop_lc.pth.tar --output hicfoundation_inference/loop_detection_lc/ --gpu "0"
This uses the low-coverage example HSPC in link to run loop calls at low coverage Hi-C.
The output loop detection is saved in hicfoundation_inference/loop_detection_lc/HiCFoundation_loop_0.5.bedpe
.
You can also check other more confident loop calls under hicfoundation_inference/loop_detection_lc
directory.
python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 3 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu] --genome_id [genome_id]
- input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
- infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
- hic_resolution: resolution of the input matrix, default: 10000 (10 kb for resolution enhancement, should also work for 5kb).
- input_submatrix_length: input submatrix row size, default: 224.
- input_submatrix_width: input submatrix column size, default: 224.
- stride: scanning stride for the input Hi-C matrix, default: 20.
- scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time).
- trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_resolution.pth.tar. Make sure you follow the installment instructions to download it before you run.
- output_dir: output directory to save the results, default: hicfoundation_inference.
- gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
- genome_id: genome id for generating .hic file. Must be one of hg18, hg19, hg38, dMel, mm9, mm10, anasPlat1, bTaurus3, canFam3, equCab2, galGal4, Pf3D7, sacCer3, sCerS288c, susScr3, or TAIR10; alternatively, this can be the path of the chrom.sizes file that lists on each line the name and size of the chromosomes.
The output is saved in the ``output_dir``, where the enhanced Hi-C is saved in the HiCFoundation_enhanced.pkl and HiCFoundation_enhanced.[ext], where ext correponds to the format that is same as input.
In the pkl file, it stores a dict of all enhanced Hi-C matrices, with the chrom name as key, and scipy.sparse/numpy array as the value.
You can also use [array2hic.py](utils/array2hic.py) and [array2cool.py](utils/array2cool.py) to convert the .pkl to .hic and .cool, respectively.
python3 inference.py --input example/ENCFF689CUX.hic --batch_size 4 --resolution 10000 --task 3 --input_row_size 224 --input_col_size 224 --stride 20 --bound 0 --model_path hicfoundation_model/hicfoundation_resolution.pth.tar --output hicfoundation_inference/resolution_enhancement/ --gpu "0" --genome_id hg38
This uses the low-coverage example ENCFF689CUX.hic
to run the inference.
The output enhanced Hi-C is saved in hicfoundation_inference/resolution_enhancement/HiCFoundation_enhanced.pkl
and hicfoundation_inference/resolution_enhancement/HiCFoundation_enhanced.hic
.
python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 4 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu]
- input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
- infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
- hic_resolution: resolution of the input matrix, default: 1000 (1 kb for epigenomic assays prediction).
- input_submatrix_length: input submatrix row size, default: 128 (covers 128 kb region to predict 128 kb region).
- input_submatrix_width: input submatrix column size, default: 4000 (covers full off-diagonal 2 Mb region for more accurate prediction).
- stride: scanning stride for the input Hi-C matrix, default: 32 (64 should yield similar results but should be much faster).
- scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time).
- trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_epigenmoic.pth.tar. Make sure you follow the installment instructions to download it before you run.
- output_dir: output directory to save the results, default: hicfoundation_inference.
- gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
The output is saved in the ``output_dir``, where the predicted epigenomic assays are saved in the HiCFoundation_epigenomic_assay_prediction_[assay_name].pkl and HiCFoundation_pred_[assay_name].bigWig.
The output assay includes six different tracks: 'CTCF' (TF ChIP-seq),'H3K4me3' (histone ChIP-seq),'H3K27ac' (histone ChIP-seq),'H3K27me3' (histone ChIP-seq),'ATAC-seq', and 'DNase-seq'.
In the pkl file, it stores a dict of correspondng assays, with the chrom name as key, and numpy array records the predicted assay at 1kb resolution.
In the bigWig file, it records the signals of corresponding assays, you can visualize it [online](https://igv.org/app/).
You can also use [array2bigwig.py](utils/array2bigwig.py) to convert the .pkl to .bigWig file for visualization.
python3 inference.py --input example/4DNFITUOMFUQ.hic --batch_size 4 --resolution 1000 --task 4 --input_row_size 128 --input_col_size 4000 --stride 32 --bound 0 --model_path hicfoundation_model/hicfoundation_epigenomic.pth.tar --output hicfoundation_inference/epigenomic_profiling/ --gpu "0"
This uses the high-coverage example 4DNFITUOMFUQ.hic
to run the inference.
The output enhanced Hi-C is saved in hicfoundation_inference/epigenomic_profiling/HiCFoundation_epigenomic_assay_prediction_[assay_name].pkl
and hicfoundation_inference/epigenomic_profiling/HiCFoundation_pred_[assay_name].bigWig
.
python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 5 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu]
- input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
- infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
- hic_resolution: resolution of the input matrix, recommended: 1,000,000 (1 MB for single-cell HiC resolution enhancement).
- input_submatrix_length: input submatrix row size, default: 224 (covers 224 MB region to predict 224 MB region).
- input_submatrix_width: input submatrix column size, default: 224 (covers 224 MB region to predict 224 MB region).
- stride: scanning stride for the input Hi-C matrix, default: 20.
- scan_boundary: off-diagonal bound for the scanning, recommended: 250..
- trained_model_path: load fine-tuned model for inference. Here the model should be hicfoundation_schic.pth.tar. Make sure you follow the installment instructions to download it before you run.
- output_dir: output directory to save the results, default: hicfoundation_inference.
- gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
The output is saved in the output_dir
, where the enhanced single-cell HiC matrix are saved in the HiCFoundation_sc_enhanced.pkl and HiCFoundation_sc_enhanced.pairs.
python3 inference.py --input example/GSM7006609_ValbB8w1081.pairs --batch_size 4 --resolution 1000000 --task 5 --input_row_size 224 --input_col_size 224 --stride 20 --bound 250 --model_path hicfoundation_model/hicfoundation_schic.pth.tar --output hicfoundation_inference/sc_hic_enhancement --gpu "0"
This uses the given example GSM7006609_ValbB8w1081.pairs
to run the inference.
The output enhanced Hi-C is saved in hicfoundation_inference/sc_hic_enhancement/HiCFoundation_sc_enhanced.pkl
and hicfoundation_inference/epigenomic_profiling/HiCFoundation_sc_enhanced.pairs
.
Inference of pre-trained HiCFoundation model to generate patch, submatrix, chromosome and genome wide Hi-C embeddings.
This include four levels of embeddings of the pre-trained HiCFoundation model
- patch level embdding: an embedding vector corresponds to a 16*16 patch space at specified resolution.
- submatrix level embedding: an embedding vector corresponds to the specified submatrix at specified resolution.
- chromosome level embedding: embedding vectors correspond to different chromosomes at specified resolution.
- genome wide embedding: an embedding vector corresponds to the input Hi-C at specified resolution.
HiCFoundation supports the .hic/.cool/.pkl/.txt/.pairs/.npy format.
- .hic/.cool: the common Hi-C format that stores the final matrix of Hi-C experiment
- .pkl: the pickle file that stores a dict of all Hi-C matrices, with the chrom name as key, and scipy.sparse/numpy array as the value. [chrom_name]:[matrix].
- .txt/.pairs: the pairs format text that records pairwise interactions in pairs format "#readID\tchr1\tpos1\tchr2\tpos2" that records the chr1:pos1 interactions with chr2:pos2.
- .npy format: a numpy array that records the contact map of a specific chromosome.
Please download the following files to the example folder for example testing purposes.
- A Hi-C example: https://data.4dnucleome.org/files-processed/4DNFITUOMFUQ/. (4DN requires authentication in for downloading, so please download in the webpage)
- A single-cell Hi-C example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7006527
(For single-cell Hi-C example, it is already kept in
example
directory, so you do not need to downlaod again.)
- .cool: https://data.4dnucleome.org/files-processed/4DNFI18UHVRO/ (4DN requires authentication in for downloading, so please download in the webpage)
- .txt/.pairs: example/input.pairs
- .pkl: You can run utils/hic2array.py to convert .hic files to .pkl files to see .pkl format.
- .npy: You can use numpy to save any 2D matrix to .npy file to run our inference.
python3 inference.py --input [input_file] --batch_size [infer_batch_size] --resolution [hic_resolution] --task 6 --input_row_size [input_submatrix_length] --input_col_size [input_submatrix_width] --stride [stride] --bound [scan_boundary] --model_path [trained_model_path] --output [output_dir] --gpu [gpu] --embed_depth [embed_depth]
- input_file: a .hic/.cool/.pkl/.txt/.pairs/.npy file records Hi-C matrix.
- infer_batch_size: batch size of the input during inference, recommended: 4 for small GPU.
- hic_resolution: resolution of the input matrix, default: 5000/10000 (5kb or 10kb should work the best since pre-trained at 5kb).
- input_submatrix_length: input submatrix row size.
- input_submatrix_width: input submatrix column size. For input_submatrix_length, input_submatrix_width, please choose size based on your interested submatrix size. But both should be a multiply of 16.
- stride: scanning stride for the input Hi-C matrix, default: 20. Please adjust it based on your interest.
- scan_boundary: off-diagonal bound for the scanning, default: 0 (to save time). Please adjust it based on your interest region. The default only covers the input_submatrix_width*resolution off-diagonal region.
- trained_model_path: load pre-trained model for inference. Here the model should be hicfoundation_pretrain.pth.tar. Make sure you follow the installment instructions to download it before you run.
- output_dir: output directory to save the results, default: hicfoundation_embedding.
- gpu: which gpu to use, default: None (will use all GPU). You can specify --gpu="0" to only use GPU 0, you can also specify --gpu="0,1" to use GPU0 and GPU1.
- embed_depth: Specified the embedding to use for your purpose, default: 0 (encoder output embeddings). You can also specify
k
from 1 to 8 to indicate the output of k-th layer of decoder.
The output is saved in the ``output_dir``, where the embeddings are saved in the HiCFoundation_embedding.pkl.
It is a dict format that includes four keys that correspond to four level of embeddings: - "patch_embedding": corresponds to patch-level embeddings. Here it keeps a dict with "chrom:pos1,pos2" as the key, and the HiCFoundation embedding as the value. "chrom:pos1,pos2" indicates the center of corresponding patch at ``chrom``, with row at ``pos1``, and col at ``pos2``. - "submat_embedding": corresponds to the submatrix-level embedding. The submatrix size is defined by the input param ``input_row_size`` and ``input_col_size``. Here it keeps a dict with "chrom:pos1,pos2" as the key, and the HiCFoundation embedding as the value. "chrom:pos1,pos2" indicates the center of corresponding patch at ``chrom``, with row at ``pos1``, and col at ``pos2``. - "chromo_embedding": corresponds to the chromosome-level embedding. Here it keeps a dict with "chrom" as the key, and the HiCFoundation embedding of the correpsonding "chrom" as the value. - "genome_embedding": corresponds to the genome-level embedding of the input Hi-C. Here it keeps an embedding vector as the value of "genome_embedding".
python3 inference.py --input example/4DNFITUOMFUQ.hic --batch_size 4 --resolution 10000 --task 6 --input_row_size 400 --input_col_size 400 --stride 80 --bound 200 --model_path hicfoundation_model/hicfoundation_pretrain.pth.tar --output hicfoundation_inference/hicfoundation_embedding/ --gpu "0" --embed_depth 0
This uses the example 4DNFITUOMFUQ.hic
to run the inference with the submatrix size of 400*400 of 6Mb off-diagonal regions.
The output Hi-C embedding is saved in hicfoundation_inference/hicfoundation_embedding/HiCFoundation_embedding.pkl
in a dict format.
It our level of embeddings: patch level embdding, submatrix level embedding, chromosome level embedding, and genome wide embedding. See more details above.