Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RecursiveJsonSplitter when learning JSON files #1036

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dlqqq
Copy link
Member

@dlqqq dlqqq commented Oct 16, 2024

As-stated in title. Follow-up to #1024.

@dlqqq dlqqq added the enhancement New feature or request label Oct 16, 2024
Copy link
Collaborator

@srdas srdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive splitter throws the following error:
image
This is presumably because it is a recursive splitter that can parse the JSON file without a chunk size requirement. If so, this would mean chunk overlap is not needed as well.

Same error with different LLMs.

@dlqqq
Copy link
Member Author

dlqqq commented Oct 16, 2024

Even after dropping the arguments, the JSON splitter still raises an exception:

2024-10-16 15:38:09,520 - distributed.worker - ERROR - Compute Failed
Key:       split_document-081f610f-c434-4159-b621-79c44a8909bb
State:     executing
Function:  split_document
args:      (Document(metadata={'path': '/Volumes/workplace/jupyter-ai/package.json', 'sha256': b']\xdb\xa9Y(\x15`\xd5\x89t\xd6\xae"+&\xe1\xfe\xe0\x11\xa3G\x934\n\\y\xc3\x85U\x01\xb65', 'extension': '.json'}, page_content='{\n  "name": "@jupyter-ai/monorepo",\n  "version": "2.25.0",\n  "description": "A generative AI extension for JupyterLab",\n  "private": true,\n  "keywords": [\n    "jupyter",\n    "jupyterlab",\n    "jupyterlab-extension"\n  ],\n  "homepage": "https://github.com/jupyterlab/jupyter-ai",\n  "bugs": {\n    "url": "https://github.com/jupyterlab/jupyter-ai/issues",\n    "email": "[email protected]"\n  },\n  "license": "BSD-3-Clause",\n  "author": {\n    "name": "Project Jupyter",\n    "email": "[email protected]"\n  },\n  "workspaces": [\n    ".",\n    "packages/*"\n  ],\n  "scripts": {\n    "build": "lerna run build --stream",\n    "build:core": "lerna run build --stream --scope \\"@jupyter-ai/core\\"",\n    "build:prod": "lerna run build:prod --stream",\n    "clean":
kwargs:    {}
Exception: "IndexError('list index out of range')"
Traceback: '  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/directory.py", line 107, in split_document\n    return splitter.split_documents([document])\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/base.py", line 96, in split_documents\n    return self.create_documents(texts, metadatas=metadatas)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 31, in create_documents\n    for chunk in self.split_text(text, metadata):\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 22, in split_text\n    return splitter.split_text(text)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 106, in split_text\n    chunks = self.split_json(json_data=json_data, convert_lists=convert_lists)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 91, in split_json\n    chunks = self._json_split(json_data)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 78, in _json_split\n    self._set_nested_dict(chunks[-1], current_path, data)\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 32, in _set_nested_dict\n    d[path[-1]] = value\n      ~~~~^^^^\n'

It doesn't seem like RecursiveJsonSplitter is well-supported, since it seems to have a different interface than all the other splitters we use from LangChain. I'm putting this in draft status as there doesn't seem to be a clear path forward; may close this next week, or mark it as ready if I figure something out.

@dlqqq dlqqq marked this pull request as draft October 16, 2024 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants