[FEAT]: Cleaning up Content in collectors #2702

morbificagent · 2024-11-22T12:27:55Z

What would you like to see?

Hi together,
I have tested AnythingLLM now a few days and had Problems finding context in my files...
Was playing around with the settings but couldnt get it working like i wanted to. It delivered some infomation but was missing many parts.

Looked in the citations showed that the chunks of my office files looked like this:

Information....
10 empty lines
some information...
8 empty lines
Footer

All in all many empty lines eventualy because of style-elements in the document and redundant information because of footer on every page.

So i tried to "compress" the information a little bit by making changes to the document collectors/converters by adding:

function deduplicateContent(content) {
const seen = new Set();
return content
.split("\n")
.filter((line) => {
if (line.trim() === "") return false;
if (seen.has(line)) return false;
seen.add(line);
return true;
})
.join("\n");
}

And
const content = deduplicateContent(pageContent.join("\n"));
a little bit deeper...

Here an example file:
asDocx.txt

The result is that all redundant lines are removed and the empty lines too (which are redundant too for sure ;-) )

Dont know if its the best method doing this but its working and helps me a lot so AnythingLLM can send better context to the LLM...

Eventualy something like this could be implemented from someone who is able to make it better ;-)

morbificagent added enhancement New feature or request feature request labels Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: Cleaning up Content in collectors #2702

[FEAT]: Cleaning up Content in collectors #2702

morbificagent commented Nov 22, 2024 •

edited

Loading

[FEAT]: Cleaning up Content in collectors #2702

[FEAT]: Cleaning up Content in collectors #2702

Comments

morbificagent commented Nov 22, 2024 • edited Loading

What would you like to see?

morbificagent commented Nov 22, 2024 •

edited

Loading