You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I further looked into the implementation of EpicSplitter. It seems that the chunking process is not really ensuring all chunks are smaller than max_chunk_size. Is this supposed to be like this or is this an error?
For one instance, when building the index for sympy__sympy-11870, in the _chunk_block function of EpicSplitter, when the input isfile_path == 'sympy/combinatorics/permutations.py' and codeblock.content == 'class Permutation(Basic):', the first chunk being appended to chunks has 3333 tokens, even though self.max_chunk_size is 1500, self.hard_token_limit is 2000, self.chunk_size is 750.
Specifically, this 3333-token chunk is appended by this line:
It seems to me that this part of code is recursively chunking the child. If that's correct, do we still need the parent when the child will be indexed separately?
The text was updated successfully, but these errors were encountered:
The solution is a bit over and underengineered at the same time 😅 The idea was to group small chunks (classes and methods) into the same vector. The different token limits are a bit vague as you already noticed. The idea was that max_chunk_size and chunk_size would be the soft limits and to avoid errors when embedding I added hard_token_limit. But as you also noticed this in another issue this is not always handled properly.
I plan to simplify this by do chunking on method level and truncate methods larger than the hard_token_limit. If the parent is a class the class signature, instance variables and constructors could be in a separate chunk.
I further looked into the implementation of EpicSplitter. It seems that the chunking process is not really ensuring all chunks are smaller than
max_chunk_size
. Is this supposed to be like this or is this an error?For one instance, when building the index for
sympy__sympy-11870
, in the_chunk_block
function of EpicSplitter, when the input isfile_path == 'sympy/combinatorics/permutations.py' and codeblock.content == 'class Permutation(Basic):'
, the first chunk being appended tochunks
has 3333 tokens, even thoughself.max_chunk_size
is 1500,self.hard_token_limit
is 2000,self.chunk_size
is 750.Specifically, this 3333-token chunk is appended by this line:
moatless-tools/moatless/index/epic_split.py
Line 266 in bda099d
It seems to me that this part of code is recursively chunking the child. If that's correct, do we still need the parent when the child will be indexed separately?
The text was updated successfully, but these errors were encountered: