Dataframe is all Nat and None after loading #127

Sondos-Omar · 2023-02-03T09:49:25Z

I was trying mongo arrow to load a dataset from mongodb, it is loading the selected columns only that's saving space, but the dataframe is all Nat and Nones only. Is this a common issue and how to fix that?
Thanks in advance

df=collection.find_pandas_all(
{ "prop.Start": {'$gte':start_date,'$lte':end_date}} ,
schema=Schema({
'prop.Start': datetime,
'prop.Name':str,
'_id.objectId':str
}))

ShaneHarvey · 2023-02-03T20:06:05Z

Hi @Sondos-Omar, could you please provide some sample input and output of the current behavior vs the behavior you would expect?

Sondos-Omar · 2023-02-05T13:52:27Z

Hi @ShaneHarvey , thank you for your reply, here is a sample of the output. Loading this data using collections.find followed by appending the dataset to pandas in batches works and loads the expected data (the dates, names and ids). But this takes so much space and time. Double checked the schema on mongodb compass, the prop is object and the prop.Start is datetime and the rest are strings.

juliusgeo · 2023-02-09T00:03:11Z

Hi! Thank you for raising this issue. This is unfortunately because Schemas do not support the "dot" notation used in MongoDB projections. Unfortunately, at this time the best workaround seems to be to flatten the data before ingesting into PyMongoArrow by using an aggregate pipeline. I opened a ticket for the implementation of a real solution to this issue, and a ticket here for us to update our documentation with the correct workaround.

juliusgeo · 2023-02-09T06:58:20Z

An example aggregation pipeline you can use would be:

db.collection.aggregate([
  {
    "$project": {
      "Name": "$value.prop.Name",
      "Start": "$value.prop.Start"
    }
  },
])

Which would yield something that looks like this:

[
  {
    "Name": "foo",
    "Start": <datetime>,
    "_id": ObjectId("5a934e000102030405000000")
  }
]

juliusgeo · 2023-02-10T23:24:59Z

@Sondos-Omar here is a more detailed example:

from pymongo import MongoClient
from pymongoarrow.api import Schema, find_pandas_all
from pymongoarrow.types import (
    ObjectIdType,
)
from datetime import datetime
from bson import datetime_ms
coll = MongoClient(username="user", password="password").db.coll
coll.drop()
start_date, end_date = datetime_ms._millis_to_datetime(0, coll.codec_options),datetime_ms._millis_to_datetime(10, coll.codec_options)
coll.insert_many([{
      "prop": {
        "Name": "foo",
        "Start": start_date,
    }
  }, {
      "prop": {
        "Name": "foo",
        "Start": end_date,
    }
  },])

# The code below is likely all that you will need, the code above is just setting up the Database so it contains the right kind of data.
df=find_pandas_all(coll,
{"prop.Start": {'$gte':start_date,'$lte':end_date,}},
projection={
      "Name": "$prop.Name",
      "Start": "$prop.Start"
},
schema=Schema({"_id": ObjectIdType(), "Start": datetime, "Name": str}))
print(df)
>>>
                                             _id                   Start Name
0  b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95c' 1970-01-01 00:00:00.000  foo
1  b'c\xe6\xd1\xf3\xd3\xea\xac\xf5\x04\xd6\x95d' 1970-01-01 00:00:00.010  foo

juliusgeo · 2023-02-21T22:31:45Z

Hi! @Sondos-Omar We have updated our documentation in this PR to show more examples for using nested data: #130

ShaneHarvey assigned juliusgeo Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe is all Nat and None after loading #127

Dataframe is all Nat and None after loading #127

Sondos-Omar commented Feb 3, 2023 •

edited

Loading

ShaneHarvey commented Feb 3, 2023

Sondos-Omar commented Feb 5, 2023

juliusgeo commented Feb 9, 2023 •

edited

Loading

juliusgeo commented Feb 9, 2023 •

edited

Loading

juliusgeo commented Feb 10, 2023 •

edited

Loading

juliusgeo commented Feb 21, 2023

Dataframe is all Nat and None after loading #127

Dataframe is all Nat and None after loading #127

Comments

Sondos-Omar commented Feb 3, 2023 • edited Loading

ShaneHarvey commented Feb 3, 2023

Sondos-Omar commented Feb 5, 2023

juliusgeo commented Feb 9, 2023 • edited Loading

juliusgeo commented Feb 9, 2023 • edited Loading

juliusgeo commented Feb 10, 2023 • edited Loading

juliusgeo commented Feb 21, 2023

Sondos-Omar commented Feb 3, 2023 •

edited

Loading

juliusgeo commented Feb 9, 2023 •

edited

Loading

juliusgeo commented Feb 9, 2023 •

edited

Loading

juliusgeo commented Feb 10, 2023 •

edited

Loading