diff --git a/README.md b/README.md index 67e1630..f341146 100644 --- a/README.md +++ b/README.md @@ -16,21 +16,17 @@ With `pip`: pip install impresso-commons ``` -## Notes - -The library supports configuration of s3 credentials via project-specific local .env files. - -## License +or -The second project 'impresso - Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio' is funded by the Swiss National Science Foundation (SNSF) under grant number [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891. - -Aiming to develop and consolidate tools to process and explore large-scale collections of historical newspapers and radio archives, and to study the impact of this tooling on historical research practices, _Impresso II_ builds upon the first project – 'impresso - Media Monitoring of the Past' (grant number [CRSII5_173719](http://p3.snf.ch/project-173719), Sinergia program). More information at . +```bash +pip install --upgrade impresso-commons +``` -Copyright (C) 2024 The _impresso_ team (contributors to this program: Matteo Romanello, Maud Ehrmann, Alex Flückinger, Edoardo Tarek Hölzl, Pauline Conti). +## Notes -This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. +The library supports configuration of s3 credentials via project-specific local .env files. -This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of merchantability or fitness for a particular purpose. See the [GNU Affero General Public License](https://github.com/impresso/impresso-pycommons/blob/master/LICENSE) for more details. +For EPFL members of the Impresso project, further information on how to run the `rebuilder` and the `compute_manifest` scripts on the Runai platform can be found [here](https://github.com/impresso/impresso-infrastructure). ## Data Versioning @@ -65,12 +61,18 @@ The versioning aiming to document the data at each step through versions and sta After each processing step, a manifest should be created documenting the changes made to the data resulting from that processing. It can also be created on the fly during a processing, and in-between processings to count and sanity-check the contents of a given S3 bucket. Once created, the manifest file will automatically be uploaded to the S3 bucket corresponding to the data it was computed on, and optionally pushed to the [impresso-data-release](https://github.com/impresso/impresso-data-release) repository to keep track of all changes made throughout the versions. -#### Computing a manifest - `compute_manifest.py` script +There are multiple ways in which the manifest can be created/computed. + +#### Computing a manifest automatically based on the S3 data - `compute_manifest.py` script -The script `compute_manifest.py`, allows one to compute a manifest on the data present within a specific S3 bucket. -The CLI for this script is the following: +The script `impresso_commons/versioning/compute_manifest.py`, allows one to compute a manifest on the data present within a specific S3 bucket. +This approach is meant to compute the manifest **after** the processing is over, and will automatically fetch the data (according to the configuration), and compute the needed statistics on it. +It can be used or run in three ways: the CLI from the cloned `impresso_pycommons` repository, running the script as a module, or calling the function performing the main logic within one's code. + +The **CLI** for this script is the following: ```bash +# when the working directory is impresso_pycommons/impresso_commons/versioning python compute_manifest.py --config-file= --log-file= [--scheduler= --nworkers= --verbose] ``` @@ -79,6 +81,30 @@ Where the `config_file` should be a simple json file, with specific arguments, a - The script uses [dask](https://www.dask.org/) to parallelize its task. By default, it will start a local cluster, with 8 as the default number of workers (the parameter `nworkers` can be used to specify any desired value). - Optinally, a [dask scheduler and workers](https://docs.dask.org/en/stable/deploying-cli.html) can be started in separate terminal windows, and their IP provided to the script via the `scheduler` parameter. +It can also be **run as a module** with the CLI, but from any other project or directory, as long as `impresso_commons` is installed in the user's environment. The same arguments apply: + +```bash +# the env where impresso_commons is installed should be active +python -m impresso_commons.versioning.compute_manifest --config-file= --log-file= [--scheduler= --nworkers= --verbose] +``` + +Finally, one can prefer to **directly incorporate this computation within their code**. That can be done by calling the `create_manifest` function, performing the main logic in the following way: +```python +from impresso_commons.versioning.compute_manifest import create_manifest + +# optionally, or config_dict can be directly defined +with open(config_file_path, "r", encoding="utf-8") as f_in: + config_dict = json.load(f_in) + +# also optional, can be None +dask_client = Client(n_workers=nworkers, threads_per_worker=1) + +create_manifest(config_dict, dask_client) +``` +- The `config_dict` is a dict with the same contents as the `config_file`, described [here](https://github.com/impresso/impresso-pycommons/blob/data-workflow-versioning/impresso_commons/data/manifest_config/manifest.config.example.md). +- Providing `dask_client` is optional, and the user can choose whether to include it or not. +- However, when generating the manifest in this way, the user should add `if __name__ == "__main__":` in the script calling `create_manifest`. + #### Computing a manifest on the fly during a process It's also possible to compute a manfest on the fly during a process. In particular when the output from the process is not stored on S3, this method is more adapted; eg. for data indexation. @@ -88,6 +114,8 @@ To do so, some simple modifications should be made to the process' code: - Example instantiation: ```python + from impresso_commons.versioning.data_manifest import DataManifest + manifest = DataManifest( data_stage="passim", # DataStage.PASSIM also accepted s3_output_bucket="32-passim-rebuilt-final/passim", # includes partition within bucket @@ -103,6 +131,7 @@ To do so, some simple modifications should be made to the process' code: push_to_git=True, ) ``` + Note however that as opposed to the previous approach, simply instantiating the manifest **will not do anything**, as it is not filled in with S3 data automatically. Instead, the user should provide it with statistics that they computed on their data and wish to track, as it is described in the next steps. 2. **Addition of data and counts:** Once the manifest is instantiated the main interaction with the instantiated manifest object will be through the `add_by_title_year` or `add_by_ci_id` methods (two other with "replace" instead also exist, as well as `add_count_list_by_title_year`, all described in the [documentation](https://impresso-pycommons.readthedocs.io/)), which take as input: - The _media title_ and _year_ to which the provided counts correspond @@ -178,3 +207,23 @@ Based on the information that was updated, the version increment varies: - This parameter is exactly made for the case scenarios where one wants to recompute the manifest on an _entire bucket of existing data_ which has not necessarily been recomputed or changed (for instance if data was copied, or simply to recount etc). - The computation of the manifest in this context is meant more as a sanity-check of the bucket's contents. - The counts and statistics will be computed like in other cases, but the update information (modification date, updated years, git commit url etc) will not be updated unless a change in the statstics is identified (in which case the resulting manifest version is incremented accordingly). + +## About Impresso + +### Impresso project + +[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891. + +### Copyright + +Copyright (C) 2024 The Impresso team. + +### License + +This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later. + +--- + +

+ Impresso Project Logo +

\ No newline at end of file diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle index 641e183..fbd2236 100644 Binary files a/docs/_build/doctrees/environment.pickle and b/docs/_build/doctrees/environment.pickle differ diff --git a/docs/_build/doctrees/images.doctree b/docs/_build/doctrees/images.doctree index a2c17a4..ade1033 100644 Binary files a/docs/_build/doctrees/images.doctree and b/docs/_build/doctrees/images.doctree differ diff --git a/docs/_build/doctrees/index.doctree b/docs/_build/doctrees/index.doctree index 1590498..3ac58e9 100644 Binary files a/docs/_build/doctrees/index.doctree and b/docs/_build/doctrees/index.doctree differ diff --git a/docs/_build/doctrees/io.doctree b/docs/_build/doctrees/io.doctree index d2c1131..7873d90 100644 Binary files a/docs/_build/doctrees/io.doctree and b/docs/_build/doctrees/io.doctree differ diff --git a/docs/_build/doctrees/rebuild.doctree b/docs/_build/doctrees/rebuild.doctree index dc4e806..d9bb08f 100644 Binary files a/docs/_build/doctrees/rebuild.doctree and b/docs/_build/doctrees/rebuild.doctree differ diff --git a/docs/_build/doctrees/utils.doctree b/docs/_build/doctrees/utils.doctree index dc5ac2d..4245d73 100644 Binary files a/docs/_build/doctrees/utils.doctree and b/docs/_build/doctrees/utils.doctree differ diff --git a/docs/_build/doctrees/versioning.doctree b/docs/_build/doctrees/versioning.doctree index 6215c77..3f1affe 100644 Binary files a/docs/_build/doctrees/versioning.doctree and b/docs/_build/doctrees/versioning.doctree differ diff --git a/docs/_build/html/.buildinfo b/docs/_build/html/.buildinfo index d9e8c4b..9501185 100644 --- a/docs/_build/html/.buildinfo +++ b/docs/_build/html/.buildinfo @@ -1,4 +1,4 @@ # Sphinx build info version 1 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. -config: 8c31fa3be0c6ebef3824e4e08997d35b +config: 22fc4b3edf789b51746d48473d45c93f tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/docs/_build/html/_static/basic.css b/docs/_build/html/_static/basic.css index 30fee9d..f316efc 100644 --- a/docs/_build/html/_static/basic.css +++ b/docs/_build/html/_static/basic.css @@ -4,7 +4,7 @@ * * Sphinx stylesheet -- basic theme. * - * :copyright: Copyright 2007-2023 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2024 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ diff --git a/docs/_build/html/_static/doctools.js b/docs/_build/html/_static/doctools.js index d06a71d..4d67807 100644 --- a/docs/_build/html/_static/doctools.js +++ b/docs/_build/html/_static/doctools.js @@ -4,7 +4,7 @@ * * Base JavaScript utilities for all Sphinx HTML documentation. * - * :copyright: Copyright 2007-2023 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2024 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ diff --git a/docs/_build/html/_static/language_data.js b/docs/_build/html/_static/language_data.js index 250f566..367b8ed 100644 --- a/docs/_build/html/_static/language_data.js +++ b/docs/_build/html/_static/language_data.js @@ -5,7 +5,7 @@ * This script contains the language-specific data used by searchtools.js, * namely the list of stopwords, stemmer, scorer and splitter. * - * :copyright: Copyright 2007-2023 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2024 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ @@ -13,7 +13,7 @@ var stopwords = ["a", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "near", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]; -/* Non-minified version is copied as a separate JS file, is available */ +/* Non-minified version is copied as a separate JS file, if available */ /** * Porter Stemmer diff --git a/docs/_build/html/_static/searchtools.js b/docs/_build/html/_static/searchtools.js index 7918c3f..b08d58c 100644 --- a/docs/_build/html/_static/searchtools.js +++ b/docs/_build/html/_static/searchtools.js @@ -4,7 +4,7 @@ * * Sphinx JavaScript utilities for the full-text search. * - * :copyright: Copyright 2007-2023 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2024 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ @@ -99,7 +99,7 @@ const _displayItem = (item, searchTerms, highlightTerms) => { .then((data) => { if (data) listItem.appendChild( - Search.makeSearchSummary(data, searchTerms) + Search.makeSearchSummary(data, searchTerms, anchor) ); // highlight search terms in the summary if (SPHINX_HIGHLIGHT_ENABLED) // set in sphinx_highlight.js @@ -116,8 +116,8 @@ const _finishSearch = (resultCount) => { ); else Search.status.innerText = _( - `Search finished, found ${resultCount} page(s) matching the search query.` - ); + "Search finished, found ${resultCount} page(s) matching the search query." + ).replace('${resultCount}', resultCount); }; const _displayNextItem = ( results, @@ -137,6 +137,22 @@ const _displayNextItem = ( // search finished, update title and status message else _finishSearch(resultCount); }; +// Helper function used by query() to order search results. +// Each input is an array of [docname, title, anchor, descr, score, filename]. +// Order the results by score (in opposite order of appearance, since the +// `_displayNextItem` function uses pop() to retrieve items) and then alphabetically. +const _orderResultsByScoreThenName = (a, b) => { + const leftScore = a[4]; + const rightScore = b[4]; + if (leftScore === rightScore) { + // same score: sort alphabetically + const leftTitle = a[1].toLowerCase(); + const rightTitle = b[1].toLowerCase(); + if (leftTitle === rightTitle) return 0; + return leftTitle > rightTitle ? -1 : 1; // inverted is intentional + } + return leftScore > rightScore ? 1 : -1; +}; /** * Default splitQuery function. Can be overridden in ``sphinx.search`` with a @@ -160,13 +176,26 @@ const Search = { _queued_query: null, _pulse_status: -1, - htmlToText: (htmlString) => { + htmlToText: (htmlString, anchor) => { const htmlElement = new DOMParser().parseFromString(htmlString, 'text/html'); - htmlElement.querySelectorAll(".headerlink").forEach((el) => { el.remove() }); + for (const removalQuery of [".headerlink", "script", "style"]) { + htmlElement.querySelectorAll(removalQuery).forEach((el) => { el.remove() }); + } + if (anchor) { + const anchorContent = htmlElement.querySelector(`[role="main"] ${anchor}`); + if (anchorContent) return anchorContent.textContent; + + console.warn( + `Anchored content block not found. Sphinx search tries to obtain it via DOM query '[role=main] ${anchor}'. Check your theme or template.` + ); + } + + // if anchor not specified or not found, fall back to main content const docContent = htmlElement.querySelector('[role="main"]'); - if (docContent !== undefined) return docContent.textContent; + if (docContent) return docContent.textContent; + console.warn( - "Content block not found. Sphinx search tries to obtain it via '[role=main]'. Could you check your theme or template." + "Content block not found. Sphinx search tries to obtain it via DOM query '[role=main]'. Check your theme or template." ); return ""; }, @@ -239,16 +268,7 @@ const Search = { else Search.deferQuery(query); }, - /** - * execute search (requires search index to be loaded) - */ - query: (query) => { - const filenames = Search._index.filenames; - const docNames = Search._index.docnames; - const titles = Search._index.titles; - const allTitles = Search._index.alltitles; - const indexEntries = Search._index.indexentries; - + _parseQuery: (query) => { // stem the search terms and add them to the correct list const stemmer = new Stemmer(); const searchTerms = new Set(); @@ -284,21 +304,38 @@ const Search = { // console.info("required: ", [...searchTerms]); // console.info("excluded: ", [...excludedTerms]); - // array of [docname, title, anchor, descr, score, filename] - let results = []; + return [query, searchTerms, excludedTerms, highlightTerms, objectTerms]; + }, + + /** + * execute search (requires search index to be loaded) + */ + _performSearch: (query, searchTerms, excludedTerms, highlightTerms, objectTerms) => { + const filenames = Search._index.filenames; + const docNames = Search._index.docnames; + const titles = Search._index.titles; + const allTitles = Search._index.alltitles; + const indexEntries = Search._index.indexentries; + + // Collect multiple result groups to be sorted separately and then ordered. + // Each is an array of [docname, title, anchor, descr, score, filename]. + const normalResults = []; + const nonMainIndexResults = []; + _removeChildren(document.getElementById("search-progress")); - const queryLower = query.toLowerCase(); + const queryLower = query.toLowerCase().trim(); for (const [title, foundTitles] of Object.entries(allTitles)) { - if (title.toLowerCase().includes(queryLower) && (queryLower.length >= title.length/2)) { + if (title.toLowerCase().trim().includes(queryLower) && (queryLower.length >= title.length/2)) { for (const [file, id] of foundTitles) { - let score = Math.round(100 * queryLower.length / title.length) - results.push([ + const score = Math.round(Scorer.title * queryLower.length / title.length); + const boost = titles[file] === title ? 1 : 0; // add a boost for document titles + normalResults.push([ docNames[file], titles[file] !== title ? `${titles[file]} > ${title}` : title, id !== null ? "#" + id : "", null, - score, + score + boost, filenames[file], ]); } @@ -308,46 +345,47 @@ const Search = { // search for explicit entries in index directives for (const [entry, foundEntries] of Object.entries(indexEntries)) { if (entry.includes(queryLower) && (queryLower.length >= entry.length/2)) { - for (const [file, id] of foundEntries) { - let score = Math.round(100 * queryLower.length / entry.length) - results.push([ + for (const [file, id, isMain] of foundEntries) { + const score = Math.round(100 * queryLower.length / entry.length); + const result = [ docNames[file], titles[file], id ? "#" + id : "", null, score, filenames[file], - ]); + ]; + if (isMain) { + normalResults.push(result); + } else { + nonMainIndexResults.push(result); + } } } } // lookup as object objectTerms.forEach((term) => - results.push(...Search.performObjectSearch(term, objectTerms)) + normalResults.push(...Search.performObjectSearch(term, objectTerms)) ); // lookup as search terms in fulltext - results.push(...Search.performTermsSearch(searchTerms, excludedTerms)); + normalResults.push(...Search.performTermsSearch(searchTerms, excludedTerms)); // let the scorer override scores with a custom scoring function - if (Scorer.score) results.forEach((item) => (item[4] = Scorer.score(item))); - - // now sort the results by score (in opposite order of appearance, since the - // display function below uses pop() to retrieve items) and then - // alphabetically - results.sort((a, b) => { - const leftScore = a[4]; - const rightScore = b[4]; - if (leftScore === rightScore) { - // same score: sort alphabetically - const leftTitle = a[1].toLowerCase(); - const rightTitle = b[1].toLowerCase(); - if (leftTitle === rightTitle) return 0; - return leftTitle > rightTitle ? -1 : 1; // inverted is intentional - } - return leftScore > rightScore ? 1 : -1; - }); + if (Scorer.score) { + normalResults.forEach((item) => (item[4] = Scorer.score(item))); + nonMainIndexResults.forEach((item) => (item[4] = Scorer.score(item))); + } + + // Sort each group of results by score and then alphabetically by name. + normalResults.sort(_orderResultsByScoreThenName); + nonMainIndexResults.sort(_orderResultsByScoreThenName); + + // Combine the result groups in (reverse) order. + // Non-main index entries are typically arbitrary cross-references, + // so display them after other results. + let results = [...nonMainIndexResults, ...normalResults]; // remove duplicate search results // note the reversing of results, so that in the case of duplicates, the highest-scoring entry is kept @@ -361,7 +399,12 @@ const Search = { return acc; }, []); - results = results.reverse(); + return results.reverse(); + }, + + query: (query) => { + const [searchQuery, searchTerms, excludedTerms, highlightTerms, objectTerms] = Search._parseQuery(query); + const results = Search._performSearch(searchQuery, searchTerms, excludedTerms, highlightTerms, objectTerms); // for debugging //Search.lastresults = results.slice(); // a copy @@ -466,14 +509,18 @@ const Search = { // add support for partial matches if (word.length > 2) { const escapedWord = _escapeRegExp(word); - Object.keys(terms).forEach((term) => { - if (term.match(escapedWord) && !terms[word]) - arr.push({ files: terms[term], score: Scorer.partialTerm }); - }); - Object.keys(titleTerms).forEach((term) => { - if (term.match(escapedWord) && !titleTerms[word]) - arr.push({ files: titleTerms[word], score: Scorer.partialTitle }); - }); + if (!terms.hasOwnProperty(word)) { + Object.keys(terms).forEach((term) => { + if (term.match(escapedWord)) + arr.push({ files: terms[term], score: Scorer.partialTerm }); + }); + } + if (!titleTerms.hasOwnProperty(word)) { + Object.keys(titleTerms).forEach((term) => { + if (term.match(escapedWord)) + arr.push({ files: titleTerms[term], score: Scorer.partialTitle }); + }); + } } // no match but word was a required one @@ -496,9 +543,8 @@ const Search = { // create the mapping files.forEach((file) => { - if (fileMap.has(file) && fileMap.get(file).indexOf(word) === -1) - fileMap.get(file).push(word); - else fileMap.set(file, [word]); + if (!fileMap.has(file)) fileMap.set(file, [word]); + else if (fileMap.get(file).indexOf(word) === -1) fileMap.get(file).push(word); }); }); @@ -549,8 +595,8 @@ const Search = { * search summary for a given text. keywords is a list * of stemmed words. */ - makeSearchSummary: (htmlText, keywords) => { - const text = Search.htmlToText(htmlText); + makeSearchSummary: (htmlText, keywords, anchor) => { + const text = Search.htmlToText(htmlText, anchor); if (text === "") return null; const textLower = text.toLowerCase(); diff --git a/docs/_build/html/genindex.html b/docs/_build/html/genindex.html index 8523987..58a0354 100644 --- a/docs/_build/html/genindex.html +++ b/docs/_build/html/genindex.html @@ -1,11 +1,13 @@ - + Index — Impresso PyCommons documentation - - + + + + @@ -13,7 +15,7 @@ - + diff --git a/docs/_build/html/images.html b/docs/_build/html/images.html index 0196e80..f252ddb 100644 --- a/docs/_build/html/images.html +++ b/docs/_build/html/images.html @@ -1,12 +1,14 @@ - + - + Image handling — Impresso PyCommons documentation - - + + + + @@ -14,7 +16,7 @@ - + diff --git a/docs/_build/html/index.html b/docs/_build/html/index.html index 4e07105..1592d5a 100644 --- a/docs/_build/html/index.html +++ b/docs/_build/html/index.html @@ -1,12 +1,14 @@ - + - + Welcome to Impresso PyCommons’s documentation! — Impresso PyCommons documentation - - + + + + @@ -14,7 +16,7 @@ - + diff --git a/docs/_build/html/io.html b/docs/_build/html/io.html index 259d623..89339a8 100644 --- a/docs/_build/html/io.html +++ b/docs/_build/html/io.html @@ -1,12 +1,14 @@ - + - + Input/Output — Impresso PyCommons documentation - - + + + + @@ -14,7 +16,7 @@ - + diff --git a/docs/_build/html/py-modindex.html b/docs/_build/html/py-modindex.html index be06ca2..5474c92 100644 --- a/docs/_build/html/py-modindex.html +++ b/docs/_build/html/py-modindex.html @@ -1,11 +1,13 @@ - + Python Module Index — Impresso PyCommons documentation - - + + + + @@ -13,7 +15,7 @@ - + diff --git a/docs/_build/html/rebuild.html b/docs/_build/html/rebuild.html index 7491e7e..f01c479 100644 --- a/docs/_build/html/rebuild.html +++ b/docs/_build/html/rebuild.html @@ -1,12 +1,14 @@ - + - + Text Rebuild — Impresso PyCommons documentation - - + + + + @@ -14,7 +16,7 @@ - + @@ -202,7 +204,7 @@

Rebuild functionsParameters:
  • level (int) – desired level of logging (default: logging.INFO)

  • -
  • file (str) –

  • +
  • file (str)

Returns:
diff --git a/docs/_build/html/search.html b/docs/_build/html/search.html index 7258c61..d737643 100644 --- a/docs/_build/html/search.html +++ b/docs/_build/html/search.html @@ -1,11 +1,13 @@ - + Search — Impresso PyCommons documentation - - + + + + @@ -14,7 +16,7 @@ - + @@ -582,7 +584,7 @@

Utilities + - + Data Versioning — Impresso PyCommons documentation - - + + + + @@ -14,7 +16,7 @@ - + @@ -1284,15 +1286,15 @@

Data Versioning
-impresso_commons.versioning.helpers.filter_new_or_modified_media(rebuilt_mft_path: str, previous_mft_path_str: str) dict[str, Any]
+impresso_commons.versioning.helpers.filter_new_or_modified_media(rebuilt_mft_json: dict[str, Any], previous_mft_json: dict[str, Any]) dict[str, Any]

Compares two manifests to determine new or modified media items.

Typical use-case is during an atomic update, when only media items added or modified compared to the previous process need to be ingested or processed.

Parameters:
    -
  • rebuilt_mft_path (str) – Path of the rebuilt manifest (new).

  • -
  • previous_mft_path_str (str) – Path of the previous process manifest.

  • +
  • rebuilt_mft_json (dict[str, Any]) – json of the rebuilt manifest (new).

  • +
  • previous_mft_json (dict[str, Any]) – json of the previous process manifest.

Returns:
@@ -1527,7 +1529,7 @@

Data Versioning
  • mnf_json (dict) – A dictionary containing manifest data.

  • extended_summary (bool, optional) – Whether to include extended summary

  • -
  • False. (with year statistics. Defaults to) –

  • +
  • False. (with year statistics. Defaults to)

Returns:
diff --git a/impresso_commons/__init__.py b/impresso_commons/__init__.py index 6849410..a82b376 100644 --- a/impresso_commons/__init__.py +++ b/impresso_commons/__init__.py @@ -1 +1 @@ -__version__ = "1.1.0" +__version__ = "1.1.1" diff --git a/impresso_commons/versioning/helpers.py b/impresso_commons/versioning/helpers.py index 70f0ece..c0281ec 100644 --- a/impresso_commons/versioning/helpers.py +++ b/impresso_commons/versioning/helpers.py @@ -927,11 +927,18 @@ def compute_stats_in_entities_bag( "content_items_out": 1, "ne_mentions": len(ci["nes"]), "ne_entities": sorted( - list(set([m["wkd_id"] for m in ci["nes"] if m["wkd_id"] != "NIL"])) + list( + set( + [ + m["wkd_id"] + for m in ci["nes"] + if "wkd_id" in m and m["wkd_id"] not in ["NIL", None] + ] + ) + ) ), # sorted list to ensure all are the same } - ) - .to_dataframe( + ).to_dataframe( meta={ "np_id": str, "year": str, @@ -941,9 +948,14 @@ def compute_stats_in_entities_bag( "ne_entities": object, } ) - .explode("ne_entities") - .persist() + # .explode("ne_entities") + # .persist() + ) + + count_df["ne_entities"] = count_df["ne_entities"].apply( + lambda x: x if isinstance(x, list) else [x] ) + count_df = count_df.explode("ne_entities").persist() # cum the counts for all values collected aggregated_df = ( @@ -1171,7 +1183,7 @@ def manifest_summary(mnf_json: dict[str, Any], extended_summary: bool = False) - def filter_new_or_modified_media( - rebuilt_mft_path: str, previous_mft_path_str: str + rebuilt_mft_json: dict[str, Any], previous_mft_json: dict[str, Any] ) -> dict[str, Any]: """ Compares two manifests to determine new or modified media items. @@ -1180,8 +1192,8 @@ def filter_new_or_modified_media( compared to the previous process need to be ingested or processed. Args: - rebuilt_mft_path (str): Path of the rebuilt manifest (new). - previous_mft_path_str (str): Path of the previous process manifest. + rebuilt_mft_json (dict[str, Any]): json of the rebuilt manifest (new). + previous_mft_json (dict[str, Any]): json of the previous process manifest. Returns: list[dict[str, Any]]: A manifest identical to 'rebuilt_mft_path' but only with @@ -1193,8 +1205,6 @@ def filter_new_or_modified_media( {'media_title': 'modified_media_item_2', 'last_modif_date': '2024-04-03T12:00:00Z', etc.}] """ - rebuilt_mft_json = read_manifest_from_s3_path(rebuilt_mft_path) - previous_mft_json = read_manifest_from_s3_path(previous_mft_path_str) filtered_manifest = copy.deepcopy(rebuilt_mft_json) # Extract last modification date of each media item of the previous process diff --git a/requirements.txt b/requirements.txt index 16e127c..c9ed436 100644 --- a/requirements.txt +++ b/requirements.txt @@ -115,7 +115,7 @@ notebook>=7.0.3 notebook_shim>=0.2.3 numpy>=1.25.2 oauthlib>=3.2.2 -opencv-python>=4.8.0.76 +opencv-python>=4.9.0 overrides>=7.4.0 packaging>=23.1 pandas>=2.1.0