Skip to content
This repository has been archived by the owner on Jun 15, 2021. It is now read-only.

Suggestions parsing very large corpus? #21

Open
mikeycohen opened this issue Nov 13, 2017 · 6 comments
Open

Suggestions parsing very large corpus? #21

mikeycohen opened this issue Nov 13, 2017 · 6 comments

Comments

@mikeycohen
Copy link

I currently have a very large corpus ~20gb. I've been able to slice it up into bits that at least can be consumed to that I can process them with a very large heap (using about 32gb). There are some tools, like correlation that seem like they are timing out in processing, but could process them if the timeouts were made higher. I can't find where I could tune that. I was wondering if you have other suggestions, pointrs to things I could tweak so that I could process large datasets like this.

So far, voyant has been quite amazing and valuale with the data I have been able to ingest. Thanks so much for the wonderful OSS project!!

=-mikey-=

@sgsinclair
Copy link
Owner

Sounds like you've already found some good solutions. When we ingest huge corpora we usually use the command-line (on a sever instance but not while it's running), it can take a long time, but with lots of memory it usually works (if you're doing it through the web api let me know and I'll go dig up some commands that may be useful. But in those cases the large corpora are intended for specialized skins or tools that handle hundreds or thousands of documents, as you've discovered most tools don't (Trends, for instance, expects to be able to show all document labels). Correlations is one of the heavier tools, every term needs to be compared to every other to determine distribution similarity. But it's a good candidate for better parallelization, and I've been meaning to play with Java 8's parallel stream mechanism.

We'd like to improve this aspect of Voyant, can you say more about what you'd like to be able to do?

@cmbz
Copy link

cmbz commented Mar 23, 2018

Voyant is fantastic and incredibly useful, thanks!

Could you say a little more about your command-line approach?

We downloaded the latest VoyantServer and are using the Web UI (Chrome) on a local machine (Mac, High Sierra, 10.13.3) to load a corpus defined as either A) about 20,000 text files or B) one 84MB Excel file (we tried both approaches, breaking the 20,000 line Excel file into 20,000 separate documents). In both cases, Chrome hangs.

Would love to try your approach and see what happens.

@sgsinclair
Copy link
Owner

Do you see errors in the VoyantServer (not the browser but the application) or on the commandline? It's not impossible that the browser is giving up but that the corpus creation actual completes, you could look in the data/corpora directory to see if there's something that gets created.

The other issue is that Voyant can deal ok with large corpora, but most of the tools aren't well-suited to thousands of documents (100 docs of 100MB will work much better than 10,000 docs of 1MB even if the total size is the same).

@cmbz
Copy link

cmbz commented Mar 30, 2018

We didn't see any console errors but we did see the documents in the data directory. So, perhaps the browser did just give up? However, we restarted the UI and didn't see our corpora available in the list, so perhaps not? Thanks for the information about file size vs. number of documents.

@sgsinclair
Copy link
Owner

If there's a server timeout have a look in data/trombone5_2/corpora to see if there's a corpus there, the folder name is the same as the ID, so you could try

http://localhost:8888/?corpus=FOLDER_NAME

The server timeout without interface notification is an aspect that will be improved soon.

@LiberalArtist
Copy link

@cmbz Have you considered converting your 84 MB Excel file into a plain CSV file before uploading it to Voyant? I suspect that would reduce your file size.

I can confirm from our experience that many of the tools do not work well for corpora with large numbers of documents. We have over 100 documents (and counting), totaling about 30 MB in TEI XML format, and we have already found the Voyant tools much more useful if we divide the documents into multiple corpora.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants