I-analyzer offers several types of downloads to users. This document gives a high-level overview of the types of downloads that exist and where they are implemented.
We distinguish between two types of downloads: direct download and scheduled downloads.
For the user, a direct download means their browser will start downloading the file then and there. With a scheduled download, the user will receive an email when their download is complete. Scheduled downloads are only available if the user is signed in.
I-analyzer will automatically choose which type of download to use, based on the number of documents. The cutoff point is configured in the frontend environment.
Direct downloads are executed synchronously. There is an API endpoint to request the download, which will return the requested file.
Scheduled downloads are run with Celery.
The server will query elasticsearch to fetch matching documents. This is done in batches of 10.000 documents using the scroll api.
Documents are written to a CSV file in the server file system (configured with CSV_FILES_PATH
) per batch. This means the server does not need to store the complete results in memory.
When the CSV file is complete, the user receives an email.
When the user downloads the complete file, they can choose additional options; at this point, this is just a choice for the file encoding. (We offer utf-16 encoding for compatability with Microsoft Excel.)
File encoding is less time-consuming to process than fetching data, so it is handled at this point rather than in the initial processing. It also means the user can request a different encoding without re-doing the download.
When the user requests the download, the backend will either stream the file as-is, or, if the encoding needs to be changed, save a converted CSV file and stream that.
When a user views a visualisation, they can always choose between a graphical view and a table.
With the graphical view, the user can download the graph as a PNG file. We use the html-to-image
library to render the image from the page. The VisualizationComponent contains a method to select the HTML element that should be rendered, based on the type of visualisation.
The table view can be downloaded as a CSV. This file is generated by the frontend, using the data it already has available.
Some visualisations base their result on a sample of documents to limit computation time, but offer the user an option to download statistics for the full data.
This happens for the term frequency visualisation and the ngram visualisation.
For these download, a request is sent to the backend and handled asynchronously - similar to the scheduled download. When the user downloads the file, they can choose the encoding, and also pick between long and wide format.
The Download
model is used to keep track of a user's downloads.
The table includes all search results downloads, and full data downloads for visualisations. It does not include other visualisation downloads, as those are generated in the frontend.
Each user account has a download limit. By default, this is 10.000 documents. You can set this in the admin site, to allow individual users to download more documents.
Use this with caution on production servers. Note that the server may also have request timeouts that will effectively prevent users from being able to download large files, even if they are allowed to generate them.