Suggestions parsing very large corpus? #21

mikeycohen · 2017-11-13T06:55:07Z

I currently have a very large corpus ~20gb. I've been able to slice it up into bits that at least can be consumed to that I can process them with a very large heap (using about 32gb). There are some tools, like correlation that seem like they are timing out in processing, but could process them if the timeouts were made higher. I can't find where I could tune that. I was wondering if you have other suggestions, pointrs to things I could tweak so that I could process large datasets like this.

So far, voyant has been quite amazing and valuale with the data I have been able to ingest. Thanks so much for the wonderful OSS project!!

=-mikey-=

sgsinclair · 2017-11-13T13:41:43Z

Sounds like you've already found some good solutions. When we ingest huge corpora we usually use the command-line (on a sever instance but not while it's running), it can take a long time, but with lots of memory it usually works (if you're doing it through the web api let me know and I'll go dig up some commands that may be useful. But in those cases the large corpora are intended for specialized skins or tools that handle hundreds or thousands of documents, as you've discovered most tools don't (Trends, for instance, expects to be able to show all document labels). Correlations is one of the heavier tools, every term needs to be compared to every other to determine distribution similarity. But it's a good candidate for better parallelization, and I've been meaning to play with Java 8's parallel stream mechanism.

We'd like to improve this aspect of Voyant, can you say more about what you'd like to be able to do?

cmbz · 2018-03-23T19:14:09Z

Voyant is fantastic and incredibly useful, thanks!

Could you say a little more about your command-line approach?

We downloaded the latest VoyantServer and are using the Web UI (Chrome) on a local machine (Mac, High Sierra, 10.13.3) to load a corpus defined as either A) about 20,000 text files or B) one 84MB Excel file (we tried both approaches, breaking the 20,000 line Excel file into 20,000 separate documents). In both cases, Chrome hangs.

Would love to try your approach and see what happens.

sgsinclair · 2018-03-23T19:50:46Z

Do you see errors in the VoyantServer (not the browser but the application) or on the commandline? It's not impossible that the browser is giving up but that the corpus creation actual completes, you could look in the data/corpora directory to see if there's something that gets created.

The other issue is that Voyant can deal ok with large corpora, but most of the tools aren't well-suited to thousands of documents (100 docs of 100MB will work much better than 10,000 docs of 1MB even if the total size is the same).

cmbz · 2018-03-30T19:09:48Z

We didn't see any console errors but we did see the documents in the data directory. So, perhaps the browser did just give up? However, we restarted the UI and didn't see our corpora available in the list, so perhaps not? Thanks for the information about file size vs. number of documents.

sgsinclair · 2018-04-04T13:12:09Z

If there's a server timeout have a look in data/trombone5_2/corpora to see if there's a corpus there, the folder name is the same as the ID, so you could try

http://localhost:8888/?corpus=FOLDER_NAME

The server timeout without interface notification is an aspect that will be improved soon.

LiberalArtist · 2018-04-10T01:09:03Z

@cmbz Have you considered converting your 84 MB Excel file into a plain CSV file before uploading it to Voyant? I suspect that would reduce your file size.

I can confirm from our experience that many of the tools do not work well for corpora with large numbers of documents. We have over 100 documents (and counting), totaling about 30 MB in TEI XML format, and we have already found the Voyant tools much more useful if we divide the documents into multiple corpora.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions parsing very large corpus? #21

Suggestions parsing very large corpus? #21

mikeycohen commented Nov 13, 2017

sgsinclair commented Nov 13, 2017

cmbz commented Mar 23, 2018

sgsinclair commented Mar 23, 2018

cmbz commented Mar 30, 2018

sgsinclair commented Apr 4, 2018

LiberalArtist commented Apr 10, 2018

Suggestions parsing very large corpus? #21

Suggestions parsing very large corpus? #21

Comments

mikeycohen commented Nov 13, 2017

sgsinclair commented Nov 13, 2017

cmbz commented Mar 23, 2018

sgsinclair commented Mar 23, 2018

cmbz commented Mar 30, 2018

sgsinclair commented Apr 4, 2018

LiberalArtist commented Apr 10, 2018