Unsure on how to create variable-width bins using rebinning algorithm #346

newtonharry · 2024-10-06T04:37:51Z

newtonharry
Oct 6, 2024

Hi,

I recently asked a question that was noticing issues about the total read count between higher and lower resolution bins across the genome: #345

I now understand how different counting strategies, such as paired-insertion and fragment-based (non-insertion) methods, lead to variations in count values depending on the chosen bin resolution. I’m currently exploring how to implement an efficient rebinning function that can adjust bin sizes from lower to higher resolutions while maintaining consistency with the original counting strategy. To do this correctly, access to the fragment sparse matrix would likely be required.

However, I’m curious whether it's possible to construct lower-resolution bins directly from higher-resolution bins without referencing the fragment matrix. In essence, this approach would involve aggregating counts from the smaller bins into larger bins, preserving the viewpoint of the higher resolution. Would such a strategy be feasible?

I've experimented with this concept and observed that my adaptive binning strategy yields higher Average Silhouette Width (ASW) scores. The challenge, however, is that the aggregated counts differ from what SnapATAC2 would generate if the data were processed directly at that lower resolution.

Could it be that the inverse document frequency (IDF) scaling applied during feature weighting is compensating for the increased counts in my approach? Since the aggregated higher-resolution bins inherently have higher counts, IDF might be down-weighting these values to offset the inflation caused by aggregation.

I'm interested to hear your thoughts on this and to see where I'm going wrong/not understanding some fundamental concepts correctly.

Referring to the adaptive binning strategy, I initially depeveloped an iterative solution to aggregate the bins but I've shifted towards using summation matrices via Kronecker dot products:

def generate_chunk_sizes(start_bins, end_bins, n_cols):
    """
    Generates chunk sizes based on start and end positions of bins, adjusted to match the total number of columns.

    Parameters:
    start_bins (list of int): A list specifying the start positions of each bin.
    end_bins (list of int): A list specifying the end positions of each bin.
    n_cols (int): The total number of columns to be allocated across bins.

    Returns:
    list of int: A list of chunk sizes for each bin.
    """
    # Calculate initial chunk sizes proportionally based on bin lengths
    bin_lengths = [end - start for start, end in zip(start_bins, end_bins)]
    total_length = sum(bin_lengths)
    
    # Calculate proportional chunk sizes
    proportional_sizes = [int(round(length / total_length * n_cols)) for length in bin_lengths]
    
    # Adjust the chunk sizes to ensure the sum matches the number of columns
    difference = n_cols - sum(proportional_sizes)
    proportional_sizes[-1] += difference

    return proportional_sizes

def process_sparse_matrix(sparse_matrix, start_bins, end_bins):
    """
    Processes a given sparse matrix by rebinning it using adaptive bin sizes, based on start and end positions of bins.

    Parameters:
    sparse_matrix (csr_matrix): The input sparse matrix to process.
    start_bins (list of int): A list specifying the start positions of each bin (in base pairs).
    end_bins (list of int): A list specifying the end positions of each bin (in base pairs).

    Returns:
    csr_matrix: The rebinned sparse matrix.
    float: The elapsed time for processing.
    """
    if len(start_bins) != len(end_bins):
        raise ValueError("Start and end bins must have the same length.")

    n_rows, n_cols = sparse_matrix.shape

    # Generate chunk sizes using the improved approach
    chunk_sizes = generate_chunk_sizes(start_bins, end_bins, n_cols)

    # Step 1: Optimized creation of the non-uniform summation matrix
    num_new_bins = len(chunk_sizes)
    data = np.ones(n_cols, dtype=np.uint32)
    col_indices = np.repeat(np.arange(num_new_bins), chunk_sizes)
    row_indices = np.arange(n_cols)
    summation_matrix = csr_matrix((data, (row_indices, col_indices)), shape=(n_cols, num_new_bins), dtype=np.uint32)

    # Use dot product to sum specified chunks of columns
    peak_matrix_rebinned = sparse_matrix.dot(summation_matrix).astype(np.uint32).tocsr()

    return peak_matrix_rebinned

Thanks!

kaizhang · 2024-10-07T03:02:33Z

kaizhang
Oct 7, 2024
Maintainer

I think you may be able to do this if you keep a record of fragments that across the boundary of two bins. When you aggregate the counts, you need to subtract the number of fragments that are in the boundaries.

1 reply

newtonharry Oct 7, 2024
Author

This is actually a great point. I wonder if keeping track of fragments that cross bin boundaries can be done using interval trees. Do you have any experience with this or would it be too computationally expensive?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsure on how to create variable-width bins using rebinning algorithm #346

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Unsure on how to create variable-width bins using rebinning algorithm #346

newtonharry Oct 6, 2024

Replies: 1 comment · 1 reply

kaizhang Oct 7, 2024 Maintainer

newtonharry Oct 7, 2024 Author

newtonharry
Oct 6, 2024

Replies: 1 comment 1 reply

kaizhang
Oct 7, 2024
Maintainer

newtonharry Oct 7, 2024
Author