Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue Running in Jupyter Notebook for convert_parquet Function #221

Open
sarakh1999 opened this issue Aug 7, 2024 · 1 comment
Open
Labels
question Further information is requested

Comments

@sarakh1999
Copy link

Take a look at the following function:

import pandas as pd
import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
from cytotable import convert
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor
import random
import sqlite3

Constants for columns

COLUMNS = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
)

Modified convert_parquet function to read data chunk by chunk

def convert_parquet(
input_file,
output_file,
cols=COLUMNS,
chunk_size=150000,
thread=2,
initial_offset=0,
offset_step=100
):

conn = sqlite3.connect(input_file)

# Get total number of rows in the image_table for processing
total_rows = pd.read_sql_query("SELECT COUNT(*) as count FROM image_table", conn)['count'][0]

# Define the schema
schema = pa.schema([
        ('Metadata_TableNumber', pa.int64()),
        ('Metadata_ImageNumber', pa.int64()),
        ('Metadata_Well', pa.string()),
        ('Metadata_Plate', pa.string()),
        ('cytoplasm_Metadata_TableNumber', pa.int64()),
        ('cytoplasm_Metadata_ImageNumber', pa.int64()),
        ('cytoplasm_Metadata_ObjectNumber', pa.int64()),
        ('cells_Metadata_ObjectNumber', pa.int64()),
        ('nuclei_Metadata_ObjectNumber', pa.int64())
    ])

# Create a Parquet writer
pq_writer = pq.ParquetWriter(output_file, schema, compression='gzip')

offset = initial_offset

while offset < total_rows:
    query_limit = f"LIMIT {offset_step} OFFSET {offset}"
    image_df = pd.read_sql_query(f"SELECT * FROM image_table {query_limit}", conn)
    cytoplasm_df = pd.read_sql_query(f"SELECT * FROM cytoplasm_table {query_limit}", conn)
    cells_df = pd.read_sql_query(f"SELECT * FROM cells_table {query_limit}", conn)
    nuclei_df = pd.read_sql_query(f"SELECT * FROM nuclei_table {query_limit}", conn)

    image_pl = pl.from_pandas(image_df)
    cytoplasm_pl = pl.from_pandas(cytoplasm_df)
    cells_pl = pl.from_pandas(cells_df)
    nuclei_pl = pl.from_pandas(nuclei_df)

    image_filtered = image_pl.select(['Metadata_TableNumber', 'Metadata_ImageNumber', 'Metadata_Well', 'Metadata_Plate'])

    # Perform join operations
    result = (
        image_filtered
        .join(cytoplasm_pl, on=['Metadata_TableNumber', 'Metadata_ImageNumber'], how='left')
        .join(cells_pl, left_on=['Metadata_TableNumber', 'Metadata_ImageNumber', 'Metadata_ObjectNumber'], right_on=['Metadata_TableNumber', 'Metadata_ImageNumber', 'Metadata_Cytoplasm_Parent_Cells'], how='left')
        .join(nuclei_pl, left_on=['Metadata_TableNumber', 'Metadata_ImageNumber', 'Metadata_ObjectNumber'], right_on=['Metadata_TableNumber', 'Metadata_ImageNumber', 'Metadata_Cytoplasm_Parent_Nuclei'], how='left')
    )

    # Convert the result to an Arrow table
    result_arrow = result.to_arrow()

    # Write the table to the Parquet file
    pq_writer.write_table(result_arrow)

    offset += offset_step

# Close the Parquet writer
pq_writer.close()

conn.close()

"""Convert sqlite profiles to parquet"""

hash_str = str(random.getrandbits(128))
parsl_config = Config(
                    executors=[
                        ThreadPoolExecutor(
                            max_threads=thread
                        )
                    ],
                    run_dir=f'./runinfo/{hash_str}'
                )

convert(
    source_path=input_file,
    dest_path=output_file,
    identifying_columns=cols,
    dest_datatype='parquet',
    chunk_size=chunk_size,
    preset="cell-health-cellprofiler-to-cytominer-database",
    joins=None,  # No joins needed here since it's already handled
    reload_parsl_config=True,
    parsl_config=parsl_config,
    sort_output=False, 
)

When I run it in jupyter notebook, the function does not work and raise some error regarding threading, but when I run the exact same code on python script it has no issue. I guess the issue is kernel and multithreading in jupyter notebook that this library can't handle.

@sarakh1999 sarakh1999 changed the title Issue Running in Jupyter Notebook Issue Running in Jupyter Notebook for convert_parquet Function Aug 7, 2024
@d33bs
Copy link
Member

d33bs commented Aug 8, 2024

Thanks so much @sarakh1999 for raising this issue! I've noticed that parallel or multithreaded processing does sometimes have issues within Jupyter environments (whether through Parsl, built-ins, or otherwise). That said, I've regularly used the latest revisions of CytoTable inside of local Jupyter environments and also through Google Colab.

Could I ask for more detail surrounding your Python version, Jupyter environment, OS, and other system information that might be helpful for debugging? If you use environment files (like a conda .yml or pyproject.toml file) or lockfiles (such as poetry.lock or conda-lock.yml) these also might be helpful in helping to navigate through any dependency issues that may be happening.

@d33bs d33bs added the question Further information is requested label Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants