Merge branch 'master' into bugfix/fix-morph-memory-zone

explosion · Dec 10, 2024 · 2676746 · 2676746
2 parents 1a4d21c + 3e30b5b
commit 2676746
Show file tree

Hide file tree

Showing 11 changed files with 561 additions and 82 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -35,7 +35,7 @@ so that more people can benefit from it.
 
 When opening an issue, use a **descriptive title** and include your
 **environment** (operating system, Python version, spaCy version). Our
-[issue template](https://github.com/explosion/spaCy/issues/new) helps you
+[issue templates](https://github.com/explosion/spaCy/issues/new/choose) help you
 remember the most important details to include. If you've discovered a bug, you
 can also submit a [regression test](#fixing-bugs) straight away. When you're
 opening an issue to report the bug, simply refer to your pull request in the

diff --git a/spacy/lang/hr/lemma_lookup_license.txt b/spacy/lang/hr/lemma_lookup_license.txt
@@ -1,5 +1,5 @@
 The list of Croatian lemmas was extracted from the reldi-tagger repository (https://github.com/clarinsi/reldi-tagger).
-Reldi-tagger is licesned under the Apache 2.0 licence.
+Reldi-tagger is licensed under the Apache 2.0 licence.
 
 @InProceedings{ljubesic16-new,
   author = {Nikola Ljubešić and Filip Klubička and Željko Agić and Ivo-Pavao Jazbec},
@@ -12,4 +12,4 @@ Reldi-tagger is licesned under the Apache 2.0 licence.
   publisher = {European Language Resources Association (ELRA)},
   address = {Paris, France},
   isbn = {978-2-9517408-9-1}
- }
+ }
diff --git a/spacy/tests/training/test_pretraining.py → ...sts/training/test_pretraining.py.disabled b/spacy/tests/training/test_pretraining.py → ...sts/training/test_pretraining.py.disabled
diff --git a/website/docs/api/language.mdx b/website/docs/api/language.mdx
@@ -890,6 +890,28 @@ when loading a config with
 | `pipe_name`    | Name of pipeline component to replace listeners for. ~~str~~                                                                                                                                                                                                                                                                                                                                                                           |
 | `listeners`    | The paths to the listeners, relative to the component config, e.g. `["model.tok2vec"]`. Typically, implementations will only connect to one tok2vec component, `model.tok2vec`, but in theory, custom models can use multiple listeners. The value here can either be an empty list to not replace any listeners, or a _complete_ list of the paths to all listener layers used by the model that should be replaced.~~Iterable[str]~~ |
 
+## Language.memory_zone {id="memory_zone",tag="contextmanager",version="3.8"}
+
+Begin a block where all resources allocated during the block will be freed at
+the end of it. If a resources was created within the memory zone block,
+accessing it outside the block is invalid. Behavior of this invalid access is
+undefined. Memory zones should not be nested. The memory zone is helpful for
+services that need to process large volumes of text with a defined memory budget.
+
+> ```python
+> ### Example
+> counts = Counter()
+> with nlp.memory_zone():
+>     for doc in nlp.pipe(texts):
+>         for token in doc:
+>             counts[token.text] += 1
+> ```
+
+| Name | Description |
+| --- | --- |
+| `mem` | Optional `cymem.Pool` object to own allocations (created if not provided). This argument is not required for ordinary usage. Defaults to `None`. ~~Optional[cymem.Pool]~~ |
+| **RETURNS** | The memory pool that owns the allocations. This object is not required for ordinary usage. ~~Iterator[cymem.Pool]~~ |
+
 ## Language.meta {id="meta",tag="property"}
 
 Meta data for the `Language` class, including name, version, data sources,

diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx
@@ -1597,7 +1597,7 @@ The name of the model to be used has to be passed in via the `name` attribute.
 
 | Argument | Description                                                                                                                                                           |
 | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `name`   | The name of a mdodel supported by LangChain for this API. ~~str~~                                                                                                     |
+| `name`   | The name of a model supported by LangChain for this API. ~~str~~                                                                                                     |
 | `config` | Configuration passed on to the LangChain model. Defaults to `{}`. ~~Dict[Any, Any]~~                                                                                  |
 | `query`  | Function that executes the prompts. If `None`, defaults to `spacy.CallLangChain.v1`. ~~Optional[Callable[["langchain.llms.BaseLLM", Iterable[Any]], Iterable[Any]]]~~ |
 

diff --git a/website/docs/usage/memory-management.mdx b/website/docs/usage/memory-management.mdx
@@ -0,0 +1,131 @@
+---
+title: Memory Management
+teaser: Managing Memory for persistent services
+version: 3.8
+menu:
+  - ['Memory Zones', 'memoryzones']
+  - ['Clearing Doc attributes', 'doc-attrs']
+---
+
+spaCy maintains a few internal caches that improve speed,
+but cause memory to increase slightly over time. If you're
+running a batch process that you don't need to be long-lived,
+the increase in memory usage generally isn't a problem.
+However, if you're running spaCy inside a web service, you'll
+often want spaCy's memory usage to stay consistent. Transformer
+models can also run into memory problems sometimes, especially when
+used on a GPU.
+
+## Memory zones {id="memoryzones"}
+
+You can tell spaCy to free data from its internal caches (especially the
+[`Vocab`](/api/vocab)) using the [`Language.memory_zone`](/api/language#memory_zone) context manager. Enter
+the contextmanager and process your text within it, and spaCy will
+**reset its internal caches** (freeing up the associated memory) at the
+end of the block. spaCy objects created inside the memory zone must
+not be accessed once the memory zone is finished.
+
+```python
+### Using memory zones
+from collections import Counter
+
+def count_words(nlp, texts):
+    counts = Counter()
+    with nlp.memory_zone():
+        for doc in nlp.pipe(texts):
+            for token in doc:
+                counts[token.text] += 1
+    return counts
+```
+
+<Infobox title="Important note" variant="warning">
+
+Exiting the memory-zone invalidates all `Doc`, `Token`, `Span` and `Lexeme`
+objects that were created within it. If you access these objects
+after the memory zone exits, you may encounter a segmentation fault
+due to invalid memory access.
+
+</Infobox>
+
+spaCy needs the memory zone contextmanager because the processing pipeline
+can't keep track of which [`Doc`](/api/doc) objects are referring to data in the shared
+[`Vocab`](/api/vocab) cache. For instance, when spaCy encounters a new word, a new [`Lexeme`](/api/lexeme)
+entry is stored in the `Vocab`, and the `Doc` object points to this shared
+data. When the `Doc` goes out of scope, the `Vocab` has no way of knowing that
+this `Lexeme` is no longer in use.
+
+The memory zone solves this problem by
+allowing you to tell the processing pipeline that all data created
+between two points is no longer in use. It is up to the you to honor
+this agreement. If you access objects that are supposed to no longer be in
+use, you may encounter a segmentation fault due to invalid memory access.
+
+A common use case for memory zones will be **within a web service**. The processing
+pipeline can be loaded once, either as a context variable or a global, and each
+request can be handled within a memory zone:
+
+```python
+### Memory zones with FastAPI {highlight="10,23"}
+from fastapi import FastAPI, APIRouter, Depends, Request
+import spacy
+from spacy.language import Language
+
+router = APIRouter()
+
+
+def make_app():
+    app = FastAPI()
+    app.state.NLP = spacy.load("en_core_web_sm")
+    app.include_router(router)
+    return app
+
+
+def get_nlp(request: Request) -> Language:
+    return request.app.state.NLP
+
+
+@router.post("/parse")
+def parse_texts(
+    *, text_batch: list[str], nlp: Language = Depends(get_nlp)
+) -> list[dict]:
+    with nlp.memory_zone():
+        # Put the spaCy call within a separate function, so we can't
+        # leak the Doc objects outside the scope of the memory zone.
+        output = _process_text(nlp, text_batch)
+    return output
+
+
+def _process_text(nlp: Language, texts: list[str]) -> list[dict]:
+    # Call spaCy, and transform the output into our own data
+    # structures. This function is called from inside a memory
+    # zone, so must not return the spaCy objects.
+    docs = list(nlp.pipe(texts))
+    return [
+        {
+            "tokens": [{"text": t.text} for t in doc],
+            "entities": [
+                {"start": e.start, "end": e.end, "label": e.label_} for e in doc.ents
+            ],
+        }
+        for doc in docs
+    ]
+
+
+app = make_app()
+```
+
+## Clearing transformer tensors and other Doc attributes {id="doc-attrs"}
+
+The [`Transformer`](/api/transformer) and [`Tok2Vec`](/api/tok2vec) components set intermediate values onto the `Doc`
+object during parsing. This can cause GPU memory to be exhausted if many `Doc`
+objects are kept in memory together.
+
+To resolve this, you can add the [`doc_cleaner`](/api/pipeline-functions#doc_cleaner) component to your pipeline. By default
+this will clean up the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute and the [`Doc.tensor`](/api/doc#attributes) attribute.
+You can have it clean up other intermediate extension attributes you use in custom
+pipeline components as well.
+
+```python
+### Adding the doc_cleaner
+nlp.add_pipe("doc_cleaner", config={"attrs": {"tensor": None}})
+```
diff --git a/website/docs/usage/rule-based-matching.mdx b/website/docs/usage/rule-based-matching.mdx
@@ -720,7 +720,7 @@ matches = matcher(doc)
 
 # Serve visualization of sentences containing match with displaCy
 # set manual=True to make displaCy render straight from a dictionary
-# (if you're not running the code within a Jupyer environment, you can
+# (if you're not running the code within a Jupyter environment, you can
 # use displacy.serve instead)
 displacy.render(matched_sents, style="ent", manual=True)
 ```