Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTPYTHON-165 Refactor nested data handling #245

Merged
merged 25 commits into from
Nov 1, 2024

Conversation

blink1073
Copy link
Member

No description provided.

@aclark4life
Copy link
Contributor

FYI still getting a segfault in INTPYTHON-165 branch with this test code:


from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from pymongoarrow.api import Schema
from pymongoarrow.monkey import patch_all

import code
import os
import pymongo
import readline
import rlcompleter  # noqa
from bson.objectid import ObjectId

# DATABASE_URL = "mongodb+srv://<u>:<p>@<srv>.mongodb.net"
# uri = os.environ.get("DATABASE_URL")
uri = "mongodb://localhost:27017"

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi("1"))

# Send a ping to confirm a successful connection
try:
    client.admin.command("ping")
    print("Pinged your deployment. You successfully connected to MongoDB!")

except Exception as e:
    print(e)

sample_mflix = client["test"]
movies = sample_mflix["movies"]

patch_all()  # add PyMongoArrow functionality directly to Collection instances

# Check the current number of movies
current_count = movies.count_documents({})
target_count = 8000000

print(current_count)

if current_count < target_count:
    # Calculate the number of movies to copy
    movies_to_copy = target_count - current_count

    # Fetch all movies
    all_movies = list(movies.find())
    print(f"Total movies: {len(all_movies)}")

    # Insert movies until the target count is reached
    while current_count < target_count:
        # Ensure unique _id for each document
        for movie in all_movies[:movies_to_copy]:
            movie["_id"] = ObjectId()

        try:
            movies.insert_many(all_movies[:movies_to_copy])
            current_count += len(all_movies[:movies_to_copy])
            print(
                f"Inserted {len(all_movies[:movies_to_copy])} movies. Current count: {current_count}"
            )
        except pymongo.errors.BulkWriteError as bwe:
            print(f"Bulk write error: {bwe.details}")
            break

schema = Schema({"_id": int})
# data_frame = movies.find_pandas_all({})
arrow_table = movies.find_arrow_all({})
readfunc = readline.parse_and_bind("tab: complete")
code.interact(local=globals(), readfunc=readfunc)

@blink1073
Copy link
Member Author

Thanks for the repro code. I suspect we need to use larger integer types for all our internal variables.

@blink1073
Copy link
Member Author

I confirmed that we now raise a ValueError when the builder is OOM instead of seg faulting: ValueError: ('Could not append raw value to', b'fullplot')

@blink1073 blink1073 marked this pull request as ready for review October 24, 2024 22:20
@blink1073 blink1073 requested a review from a team as a code owner October 24, 2024 22:20
Copy link
Contributor

@caseyclements caseyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the walkthrough.

@blink1073 blink1073 merged commit 5406fc3 into mongodb-labs:main Nov 1, 2024
35 checks passed
@blink1073 blink1073 deleted the INTPYTHON-165 branch November 1, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants