EpicSplitter not following max_chunk_size #8

zkx06111 · 2024-06-14T07:20:08Z

I further looked into the implementation of EpicSplitter. It seems that the chunking process is not really ensuring all chunks are smaller than max_chunk_size. Is this supposed to be like this or is this an error?

For one instance, when building the index for sympy__sympy-11870, in the _chunk_block function of EpicSplitter, when the input isfile_path == 'sympy/combinatorics/permutations.py' and codeblock.content == 'class Permutation(Basic):', the first chunk being appended to chunks has 3333 tokens, even though self.max_chunk_size is 1500, self.hard_token_limit is 2000, self.chunk_size is 750.

Specifically, this 3333-token chunk is appended by this line:

moatless-tools/moatless/index/epic_split.py

Line 266 in bda099d

current_chunk.append(child)

It seems to me that this part of code is recursively chunking the child. If that's correct, do we still need the parent when the child will be indexed separately?

The text was updated successfully, but these errors were encountered:

aorwall · 2024-06-14T09:19:13Z

The solution is a bit over and underengineered at the same time 😅 The idea was to group small chunks (classes and methods) into the same vector. The different token limits are a bit vague as you already noticed. The idea was that max_chunk_size and chunk_size would be the soft limits and to avoid errors when embedding I added hard_token_limit. But as you also noticed this in another issue this is not always handled properly.

I plan to simplify this by do chunking on method level and truncate methods larger than the hard_token_limit. If the parent is a class the class signature, instance variables and constructors could be in a separate chunk.

zkx06111 · 2024-06-14T16:13:07Z

happy to help with that and chat further! we can talk on zoom or whatever

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EpicSplitter not following max_chunk_size #8

EpicSplitter not following max_chunk_size #8

zkx06111 commented Jun 14, 2024 •

edited

Loading

aorwall commented Jun 14, 2024

zkx06111 commented Jun 14, 2024

EpicSplitter not following max_chunk_size #8

EpicSplitter not following max_chunk_size #8

Comments

zkx06111 commented Jun 14, 2024 • edited Loading

aorwall commented Jun 14, 2024

zkx06111 commented Jun 14, 2024

zkx06111 commented Jun 14, 2024 •

edited

Loading