Why psuedobulk then QC #116

TingTingShao · 2024-03-20T19:08:07Z

TingTingShao
Mar 20, 2024

Dear,

1.Q
Could I please ask why the pseudobulk ATAC-seq profiles per cell type are first generated then conduct the QC? it is kinda more understandable to me that QC is first conducted on the sample level then pseudobulk ATAC-seq.

2.Q
the cell annotation can also be obtained from alternative methods, such as a preliminary clustering analysis using a predefined set of genome-wide regions/peaks (e.g. SCREEN) as input to identify cell populations. --from introduction of the tutorial
Does it mean that the cell annotation can be the ATAC seq clusters defined with clustering methods.
I have the snATAC-seq QC processed with snapATAC2 and post-processed with Leiden clustering. If I am correct, I can convert processed annData to loom -> pseudobulk peak calling -> regions&bigwig -> fragments matrix -> LDA.....

3.Q
To derive a set of consensus peaks, we use the iterative overlap peak merging procedure describe in Corces et al. (2018). First, each summit is extended a peak_half_width (by default, 250bp) in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one.
If I understand it correctly, is this procedure similar to the one in snapATAC2 function: merge peaks?

Looking forward to your reply!

Thanks!
tingting

SeppeDeWinter · 2024-03-25T10:03:57Z

SeppeDeWinter
Mar 25, 2024
Maintainer

Hi @TingTingShao

Please find my answers to your questions below.

Q1: We do the pseudobulking before QC because we make use of the consensus peaks generated after the pseudobulking step. For example to be able to calculate the fraction of reads in peaks (FRIP).

Q2: That's correct!

Q3: I'm not too familiar with that specific function. But from a quick glance it looks fairly similar.

I hope this helps.

All the best,

Seppe

2 replies

lf96abc Aug 15, 2024

Hi @SeppeDeWinter

Sorry if this has already been answered, but I'm also interested in Q1 here.
Since these consensus peaks have generated on the basis of low quality cells which are later removed, should they be regenerated again using only cells which will be used in downstream analysis?

Thank you!

SeppeDeWinter Aug 22, 2024
Maintainer

Hi @lf96abc

I see your point. It is a valid concern, however in general we only calculate the consensus peaks once. We assume that each cluster (cell type) is composed mostly of high quality cells.

That being said, feel free to regenerate consensus peaks after performing the QC.

All the best,

Seppe

TingTingShao · 2024-03-25T11:04:05Z

TingTingShao
Mar 25, 2024
Author

Hi,

Many thanks for your reply.

I am actually a bit confused when integrating results (anndata) from snapATAC2 to pycisTopic.

I was gonna to use pycisTopic and then to do the downstream analysis with SCENICplus, but

with Rbased cisTopic, the input files are:

Bam files + regions
Counts matrix (rows: peaks coordinates, columns: cells), millions of regions, ~40,000 cells -> ~3hrs given [Reference running time (5k cells, 97k regions): 1,5 min], and didn't see example for multi-sample pipeline

with python based cisTopic, I first need to have pseudobulk profiles from cell annotations.

However, for now I only have the snapATAC-seq fragments.tsv.gz files without the cell annotations profiles.

I can not get my hands on snATAC analysis with cistopic from what I understood.

So I preprocessed the data on snapATAC2, and intended to perform the downstream analysis with SCENICplus.

Now I have the QC-processed snATAC-seq anndata, with the concensus peaks, and the leiden clustering, the bigwig files and coverage files for each cluster can also be exported.

However, I don't know how to integrate with the SCENICplus or pyCistopic.

If I integrate with the pyCistopic, the annotation to generate the pseudobulk profiles will come from the leiden clustering, and I don't see the reason to perform LDA. As from what I understood, cisTopic is to use LDA probabilistic model to conduct the dimension reduction(topics) and snapATAC2 is to use Laplacian eigenmap to perform dimension reduciton. I don't see the rationale to first DR with the snapATAC2 and then perform LDA with cisTopic.

AnnDataSet object with n_obs x n_vars = 39252 x 6062095 backed at '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_allfragments/02/microglia_1.h5ads'
contains 293 AnnData objects with keys:
    obs: 'sample', 'region', 'subject', 'ad', 'leiden_0.5', 'leiden_mnc_0.5', 'leiden_harmony_0.5'
    var: 'count', 'selected'
    uns: 'AnnDataSet', 'num_eigen', 'macs3', 'reference_sequences', 'spectral_eigenvalue'
    obsm: 'X_umap', 'X_spectral', 'X_spectral_harmony_sample', 'X_spectral_mnc_sample', 'X_umap_harmony_sample', 'X_umap_mnc_sample'
    obsp: 'distances'

Now my main questions are

with only snATAC-seq fragments data, can I analyse with pycistopic or SCENICplus
how to integrate anndata generated from other platform with the SCENICplus or Cistopic?

Thanks
tingting

1 reply

SeppeDeWinter Mar 26, 2024
Maintainer

Hi

with python based cisTopic, I first need to have pseudobulk profiles from cell annotations.

Most of the time, the input to pycisTopic is a set of fragment files (one per sample) and cell type annotations (or cluster annotations that's also fine). We use the pseudobulks to generate consensus peaks so that we have feature to create a count matrix on.
However, if you already have a count matrix generated by snapATAC2 you can also use that one. To do this you can use the function create_cistopic_object and provide the count matrix via the fragment_matrix argument.

pycisTopic/src/pycisTopic/cistopic_class.py

Line 507 in e5d5f19

def create_cistopic_object(

I can not get my hands on snATAC analysis with cistopic from what I understood.

Why is that? This statement I don't really understand.

So I preprocessed the data on snapATAC2, and intended to perform the downstream analysis with SCENICplus.

What downstream analysis are you trying to perform? To run SCENIC+ you also need scRNA-seq data.

Now I have the QC-processed snATAC-seq anndata, with the concensus peaks, and the leiden clustering, the bigwig files and coverage files for each cluster can also be exported.; However, I don't know how to integrate with the SCENICplus or pyCistopic.

See first paragraph.

If I integrate with the pyCistopic, the annotation to generate the pseudobulk profiles will come from the leiden clustering, and I don't see the reason to perform LDA. As from what I understood, cisTopic is to use LDA probabilistic model to conduct the dimension reduction(topics) and snapATAC2 is to use Laplacian eigenmap to perform dimension reduciton. I don't see the rationale to first DR with the snapATAC2 and then perform LDA with cisTopic.

Topic modelling is not only used to perform dimensionality reduction, it is also used to:

find sets of co-accessible regions, these can be used for motif enrichment analysis (we often also use this for deep learning).
impute accessibility, scATAC-seq has a lot of dropouts using topic modelling we can impute these dropouts (making the data less sparse).

with only snATAC-seq fragments data, can I analyse with pycistopic or SCENICplus

Yes, that's possible. In this case you first have to generate a count matrix on some set of predefined regions (for example all of the screen regions https://screen.encodeproject.org/). After this you can run topic modelling so you can cluster your cells. Based on these clusters you can generate pseudobulk profiles followed by consensus peaks (which will be of higher resolution compared to the general screen regions, these will probably miss regions for cell types that are not included in ENCODE) finally based on these consensus peaks you can generate a count matrix once again and perform topic modelling for the last time. However, given that you have already performed snapATAC2 you can skip this first round and either use the count matrix generated by snapATAC2 directly for topic modelling or use the clusters from snapATAC2 to generate pseudobulk profiles.

how to integrate anndata generated from other platform with the SCENICplus or Cistopic?

see first paragraph.

The code to do this will look something like this.

If you want to use the count matrix generated by snapATAC2

from pycisTopic.cistopic_class import create_cistopic_object

cistopic_obj = create_cistopic_object(
   fragment_matrix = adata.X,
   cell_names = adata.obs_names,
   region_names = adata.var_names
)

if you wish to recalculate consensus peaks

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk

bw_paths, bed_paths = export_pseudobulk(
    input_data = adata.obs,
    variable = <COLUMN_NAME_WITH_CLUSTERS>,
    sample_id_col = <SAMPLE_ID_COLUMN_NAME>,
    chromsizes = chromsizes,
    bed_path = os.path.join(out_dir, "consensus_peak_calling/pseudobulk_bed_files"),
    bigwig_path = os.path.join(out_dir, "consensus_peak_calling/pseudobulk_bw_files"),
    path_to_fragments = fragments_dict,
    n_cpu = 10,
    normalize_bigwig = True,
    temp_dir = "/tmp",
    split_pattern = "-"
)

and follow the remainder of the pycisTopic tutorial.

I hope this answers your questions.

All the best,

Seppe

TingTingShao · 2024-03-26T09:57:17Z

TingTingShao
Mar 26, 2024
Author

Thanks very much for your reply.

Quick try with your solution:

I tried the function create_cistopic_object:

dat=anndata.read_h5ad("microglia.h5ad")
dat
AnnData object with n_obs × n_vars = 39252 × 6062095
    obs: 'sample', 'tsse'
    var: 'count', 'selected'
    uns: 'AnnDataSet', 'num_eigen', 'reference_sequences', 'spectral_eigenvalue'
    obsm: 'X_spectral', 'X_spectral_harmony', 'X_spectral_mnc', 'X_umap', 'X_umap_harmony', 'X_umap_mnc'

dat.X
<39252x6062095 sparse matrix of type '<class 'numpy.uint32'>'
	with 349387777 stored elements in Compressed Sparse Row format>

error:

cistopic_obj = create_cistopic_object(
   fragment_matrix = dat.X,
   cell_names = dat.obs_names,
   region_names = dat.var_names
)
error: 39252 columns passed, passed data had 6062095 columns

cistopic_obj = create_cistopic_object(
   fragment_matrix = dat.X.T,
   cell_names = dat.obs_names,
   region_names = dat.var_names
)
error: ValueError: setting an array element with a sequence.

cistopic_obj = create_cistopic_object(
   fragment_matrix = dat.X[ :, dat.var['selected']].T,
   cell_names = dat.obs_names.tolist(),
   region_names = dat.var['selected'][dat.var['selected'] == True].index.tolist()
)
ValueError: setting an array element with a sequence.

Any idea on the reason?

Thanks
tingting

3 replies

SeppeDeWinter Mar 26, 2024
Maintainer

Hi

Not sure, can you provide the full error message so I can see where in the code the error is thrown.

All the best,

Seppe

TingTingShao Mar 26, 2024
Author

Hi,

I tried with a much smaller data.

AnnData object with n_obs × n_vars = 244 × 49737
    obs: 'sample', 'region', 'subject', 'ad'
    var: 'count', 'selected'
    uns: 'AnnDataSet', 'reference_sequences', 'spectral_eigenvalue'
    obsm: 'X_spectral', 'fragment_paired'

cistopic_obj = create_cistopic_object(
   fragment_matrix = adat2.X,
   cell_names = adat2.obs_names,
   region_names = adat2.var_names
)

2024-03-26 15:37:26,310 cisTopic     INFO     Creating CistopicObject
Traceback (most recent call last):
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 970, in _finalize_columns_and_data
    columns = _validate_or_indexify_columns(contents, columns)
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 1018, in _validate_or_indexify_columns
    raise AssertionError(
AssertionError: 244 columns passed, passed data had 49737 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./cistopic.py", line 14, in <module>
    cistopic_obj = create_cistopic_object(
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pycisTopic/cistopic_class.py", line 604, in create_cistopic_object
    cell_data = pd.DataFrame(
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pandas/core/frame.py", line 745, in __init__
    arrays, columns, index = nested_data_to_arrays(
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 511, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 876, in to_arrays
    content, columns = _finalize_columns_and_data(arr, columns, dtype)
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 973, in _finalize_columns_and_data
    raise ValueError(err) from err
ValueError: 244 columns passed, passed data had 49737 columns

Thanks,
tingting

TingTingShao Mar 26, 2024
Author

I also tried with the other suggestion you gave me:

adat2 = adat.copy()
del(adat)

cell_data=adat2.obs
cell_data['celltype'] = cell_data['leiden_mnc_1'].astype(str) 
del(adat2)

print(fragments_dict)

tmp_dir='/scratch/leuven/351/vsc35107/'
bed_path = '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/031results_pseudobulk_bed_files/'
bigwig_path = '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/032results_pseudobulk_bw_files'
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample',
                 chromsizes = chromsizes,
                 bed_path = bed_path,  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = bigwig_path, # specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 8,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
               #   remove_duplicates = True, # not legal argument
                 temp_dir = os.path.join(tmp_dir, 'ray_spill'), 
                 split_pattern = '-')

error

{'D19-13151': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-13151.fragments.tsv.gz', 'D19-12524': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12524.fragments.tsv.gz', 'D19-13162': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-13162.fragments.tsv.gz', 'D19-12535': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12535.fragments.tsv.gz', 'D19-12536': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12536.fragments.tsv.gz', 'D19-12702': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12702.fragments.tsv.gz', 'D19-13003': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-13003.fragments.tsv.gz', 'D19-12531': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12531.fragments.tsv.gz', 'D19-13182': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-13182.fragments.tsv.gz', 'D19-12534': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12534.fragments.tsv.gz', 'D19-12700': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12700.fragments.tsv.gz', 'D19-13156': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-13156.fragments.tsv.gz', 'D19-13178': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-13178.fragments.tsv.gz', 'D19-12998': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12998.fragments.tsv.gz', 'D19-12697': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12697.fragments.tsv.gz', 'D19-12696': '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/000fragments_new/D19-12696.fragments.tsv.gz'}
2024-03-27 00:05:42,394 cisTopic     INFO     Splitting fragments by cell type.
Traceback (most recent call last):
  File "./cistopic.py", line 73, in <module>
    bed_path = '/lustre1/project/stg_00079/students/tingting/data/sun/snap2_part_synapse_fragments/031results_pseudobulk_bed_files/'
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 156, in export_pseudobulk
    split_fragment_files_by_cell_type(
  File "/data/leuven/351/vsc35107/miniconda3/envs/master/lib/python3.8/site-packages/scatac_fragment_tools/library/split/split_fragments_by_cell_type.py", line 92, in split_fragment_files_by_cell_type
    raise ValueError(f"Fragment file {path_to_fragment_file} does not exist.")
ValueError: Fragment file /tmp/D19-12534/0.fragments.tsv.gz does not exist.

Could you please help me here?
Thanks!
tingting

TingTingShao · 2024-03-26T10:26:57Z

TingTingShao
Mar 26, 2024
Author

What downstream analysis are you trying to perform? To run SCENIC+ you also need scRNA-seq data.

I'm gonna have separate snRNA-seq data but not multiomics for these microglia cells, it is gonna be challenging I suppose, but I'm have a try afterwards.

If that doesn't work, I also have a small set of cells coming from multiomics, so that I can analyze with SCENICplus.

The thing is for now I only have the fragments.tsv.gz files for snATAC-seq data. I analyzed the data with the snapATAC2, but since the data are the same cell type microglia cells from participants with different disease status, I am afraid that the subtle change was not captured in snapATAC2 analysis. So I want to analyze these with cisTopic which can also help me with the later SCENICplus analysis.

Thanks
tingting

1 reply

SeppeDeWinter Mar 26, 2024
Maintainer

Hi @TingTingShao

That makes sense.

Let me know if you need help with the SCENIC+ part. Let's first try to solve the pycisTopic part ;).

All the best,

Seppe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why psuedobulk then QC #116

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why psuedobulk then QC #116

TingTingShao Mar 20, 2024

Replies: 4 comments · 7 replies

SeppeDeWinter Mar 25, 2024 Maintainer

lf96abc Aug 15, 2024

SeppeDeWinter Aug 22, 2024 Maintainer

TingTingShao Mar 25, 2024 Author

SeppeDeWinter Mar 26, 2024 Maintainer

TingTingShao Mar 26, 2024 Author

SeppeDeWinter Mar 26, 2024 Maintainer

TingTingShao Mar 26, 2024 Author

TingTingShao Mar 26, 2024 Author

TingTingShao Mar 26, 2024 Author

SeppeDeWinter Mar 26, 2024 Maintainer

TingTingShao
Mar 20, 2024

Replies: 4 comments 7 replies

SeppeDeWinter
Mar 25, 2024
Maintainer

SeppeDeWinter Aug 22, 2024
Maintainer

TingTingShao
Mar 25, 2024
Author

SeppeDeWinter Mar 26, 2024
Maintainer

TingTingShao
Mar 26, 2024
Author

SeppeDeWinter Mar 26, 2024
Maintainer

TingTingShao Mar 26, 2024
Author

TingTingShao Mar 26, 2024
Author

TingTingShao
Mar 26, 2024
Author

SeppeDeWinter Mar 26, 2024
Maintainer