Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support added for a single file and directory #663

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 23 additions & 15 deletions packages/jupyter-ai/jupyter_ai/document_loaders/directory.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,21 +51,29 @@ def flatten(*chunk_lists):
def split(path, all_files: bool, splitter):
chunks = []

for dir, subdirs, filenames in os.walk(path):
# Filter out hidden filenames, hidden directories, and excluded directories,
# unless "all files" are requested
if not all_files:
subdirs[:] = [d for d in subdirs if not (d[0] == "." or d in EXCLUDE_DIRS)]
filenames = [f for f in filenames if not f[0] == "."]

for filename in filenames:
filepath = Path(os.path.join(dir, filename))
if filepath.suffix not in SUPPORTED_EXTS:
continue

document = dask.delayed(path_to_doc)(filepath)
chunk = dask.delayed(split_document)(document, splitter)
chunks.append(chunk)
if os.path.isfile(path):
filenames = []
filenames.append(os.path.basename(path))
dir = os.path.dirname(path)

if os.path.isdir(path):
for dir, subdirs, filenames in os.walk(path):
# Filter out hidden filenames, hidden directories, and excluded directories,
# unless "all files" are requested
if not all_files:
subdirs[:] = [
d for d in subdirs if not (d[0] == "." or d in EXCLUDE_DIRS)
]
filenames = [f for f in filenames if not f[0] == "."]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few issues with the implementation proposed by this branch:

  • We seek a list of file paths relative to the current directory. However, this branch only adds file names.

  • This branch updates filenames using the assignment operator = instead of .append(), meaning that the list of filenames is dropped with each iteration of the for loop.

  • filenames is also being used by the for block itself. This means that even if the previous issue is fixed, every iteration of this for loop will still delete the value of filenames set by the previous iteration. Take this as a simplified example:

>>> for i in range(5):
...   print(i)
...   i = 1
...
0
1
2
3
4

This implementation can be corrected and simplified greatly. Here are my suggestions.

  1. The logic within the for filename in filenames: ... block on line 69 should be extracted to a separate split_file(path, splitter) function.

  2. Revert the other changes, and simply add this block at the very top of this split() function definition:

if os.path.isfile(path):
    return split_file(path, splitter)

Copy link
Member

@dlqqq dlqqq Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked all this info by adding print() statements in the definition of split() to verify the value of filenames. To test, I ran jupyter lab from the root of this Git repo and called /learn docs to learn all of the Jupyter AI documentation.

Can you do the same before I review this again? Thanks in advance!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlqqq The code for the function split() can be simplified to the following form of the original function:
image
I have tested this separately and it works for a single file or a directory.


for filename in filenames:
filepath = Path(os.path.join(dir, filename))
if filepath.suffix not in SUPPORTED_EXTS:
continue

document = dask.delayed(path_to_doc)(filepath)
chunk = dask.delayed(split_document)(document, splitter)
chunks.append(chunk)

flattened_chunks = dask.delayed(flatten)(*chunks)
return flattened_chunks
Expand Down
Loading