Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Cleaning up Content in collectors #2702

Open
morbificagent opened this issue Nov 22, 2024 · 0 comments
Open

[FEAT]: Cleaning up Content in collectors #2702

morbificagent opened this issue Nov 22, 2024 · 0 comments
Labels
enhancement New feature or request feature request

Comments

@morbificagent
Copy link

morbificagent commented Nov 22, 2024

What would you like to see?

Hi together,
I have tested AnythingLLM now a few days and had Problems finding context in my files...
Was playing around with the settings but couldnt get it working like i wanted to. It delivered some infomation but was missing many parts.

Looked in the citations showed that the chunks of my office files looked like this:

Information....
10 empty lines
some information...
8 empty lines
Footer

All in all many empty lines eventualy because of style-elements in the document and redundant information because of footer on every page.

So i tried to "compress" the information a little bit by making changes to the document collectors/converters by adding:

function deduplicateContent(content) {
const seen = new Set();
return content
.split("\n")
.filter((line) => {
if (line.trim() === "") return false;
if (seen.has(line)) return false;
seen.add(line);
return true;
})
.join("\n");
}

And
const content = deduplicateContent(pageContent.join("\n"));
a little bit deeper...

Here an example file:
asDocx.txt

The result is that all redundant lines are removed and the empty lines too (which are redundant too for sure ;-) )

Dont know if its the best method doing this but its working and helps me a lot so AnythingLLM can send better context to the LLM...

Eventualy something like this could be implemented from someone who is able to make it better ;-)

@morbificagent morbificagent added enhancement New feature or request feature request labels Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

1 participant