From 284f55083523caefc86e344e0163f21d0fca276e Mon Sep 17 00:00:00 2001 From: Janosh Riebesell Date: Mon, 8 Apr 2024 16:38:08 +0200 Subject: [PATCH 1/2] code fence syntax highlight in readme --- README.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 598acc71..dbdbd81e 100644 --- a/README.md +++ b/README.md @@ -1,24 +1,29 @@ # mongo-arrow + Tools for using Apache Arrow with MongoDB ## Apache Arrow -We utilize Apache Arrow to offer fast and easy conversion of MongoDB query result sets to multiple numerical data formats popular among developers including NumPy ndarrays, Pandas DataFrames, parquet files, csv, and more. + +We utilize Apache Arrow to offer fast and easy conversion of MongoDB query result sets to multiple numerical data formats popular among developers including NumPy arrays, Pandas DataFrames, parquet files, CSV, and more. We chose Arrow for this because of its unique set of characteristics: + - language-independent - columnar memory format for flat and hierarchical data, - organized for efficient analytic operations on modern hardware like CPUs and GPUs - zero-copy reads for lightning-fast data access without serialization overhead - - it was simple and fast, and from our perspective Apache Arrow is ideal for processing and transport of large datasets in high-performance applications. + - it was simple and fast, and from our perspective, Apache Arrow is ideal for processing and transporting of large datasets in high-performance applications. As reference points for our implementation, we also took a look at BigQuery’s Pandas integration, pandas methods to handle JSON/semi-structured data, the Snowflake Python connector, and Dask.DataFrame. - ## How it Works + Our implementation relies upon a user-specified data schema to marshall query result sets into tabular form. Example -``` + +```py from pymongoarrow.api import Schema + schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime}) ``` @@ -26,17 +31,19 @@ You can install PyMongoArrow on your local machine using Pip: `$ python -m pip install pymongoarrow` You can export data from MongoDB to a pandas dataframe easily using something like: -``` + +```py df = production.invoices.find_pandas_all({'amount': {'$gt': 100.00}}, schema=invoices) ``` Since PyMongoArrow can automatically infer the schema from the first batch of data, this can be further simplified to: -``` +```py df = production.invoices.find_pandas_all({'amount': {'$gt': 100.00}}) ``` ## Final Thoughts + This library is in the early stages of development, and so it's possible the API may change in the future - we definitely want to continue expanding it. We welcome your feedback as we continue to explore and build this tool. From afe634e7d5b511109ee9ce3166fe108957bfec48 Mon Sep 17 00:00:00 2001 From: Janosh Riebesell Date: Tue, 9 Apr 2024 07:38:24 +0200 Subject: [PATCH 2/2] single to double quotes --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index dbdbd81e..9f6c9e2b 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ Example ```py from pymongoarrow.api import Schema -schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime}) +schema = Schema({"_id": int, "amount": float, "last_updated": datetime}) ``` You can install PyMongoArrow on your local machine using Pip: @@ -33,14 +33,14 @@ You can install PyMongoArrow on your local machine using Pip: You can export data from MongoDB to a pandas dataframe easily using something like: ```py -df = production.invoices.find_pandas_all({'amount': {'$gt': 100.00}}, schema=invoices) +df = production.invoices.find_pandas_all({"amount": {"$gt": 100.00}}, schema=invoices) ``` Since PyMongoArrow can automatically infer the schema from the first batch of data, this can be further simplified to: ```py -df = production.invoices.find_pandas_all({'amount': {'$gt': 100.00}}) +df = production.invoices.find_pandas_all({"amount": {"$gt": 100.00}}) ``` ## Final Thoughts