Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textblob not finding the downloaded corpora #474

Open
cagan-elden opened this issue Sep 26, 2024 · 7 comments
Open

Textblob not finding the downloaded corpora #474

cagan-elden opened this issue Sep 26, 2024 · 7 comments

Comments

@cagan-elden
Copy link

python -m textblob.download_corpora

Although I download the corpora as said in the error message it still does not work.
I ain't sure is it because of the NLTK library or not because I've installed that too.

@doctorsketch
Copy link

I found upgrading from NLTK 3.8.1 to 3.9.1 broke my project. I now get errors asking me to:

python -m textblob.download_corpora

Previously you could download textblob corpora on one account and it could be found by another account. This is no longer the case.

Moving back to NLTK 3.8.1 fixed it. I can reproduce the issue by upgrading to 3.9.1 again.

@Ajaychaki2004
Copy link

The problem is due the version moving back to the NLTK 3.8.1 can help to rectify the error

@doctorsketch
Copy link

To follow up on this, I fixed it by specifying the NLTK data path and telling NLTK where to look like this:

def download_nltk_resources(self):
    """
    Downloads required NLTK resources if not already present.
    """
    import nltk
    import os
    
    # Use the environment variable or fall back to default
    nltk_data_path = os.getenv('NLTK_DATA', '/usr/local/share/nltk_data')
    
    # Ensure the directory exists
    os.makedirs(nltk_data_path, exist_ok=True)
    
    # Add our path to NLTK's data path
    nltk.data.path.insert(0, nltk_data_path)
    
    print(f"Using NLTK data path: {nltk_data_path}")
    
    required_resources = {
        'averaged_perceptron_tagger': ('taggers', 'averaged_perceptron_tagger'),
        'averaged_perceptron_tagger_eng': ('taggers', 'averaged_perceptron_tagger_eng'),
        'punkt': ('tokenizers', 'punkt'),
        'punkt_tab': ('tokenizers/punkt_tab', 'english'),
        'movie_reviews': ('corpora', 'movie_reviews'),
        'brown': ('corpora', 'brown'),
        'conll2000': ('corpora', 'conll2000'),
        'wordnet': ('corpora', 'wordnet')
    }
    
    # Download and verify all resources
    for resource, (folder, name) in required_resources.items():
        try:
            nltk.data.find(f'{folder}/{name}')
        except LookupError:
            print(f"Downloading {resource}...")
            nltk.download(resource, download_dir=nltk_data_path, quiet=True)

with NLTK_DATA specified as an environment variable.

Then do something like this:

try:
    # Download resources only once at the start
    if not hasattr(TextParser, '_resources_checked'):
        self.download_nltk_resources()
        TextParser._resources_checked = True

@jimedevelopers
Copy link

How to solve this issue?

@Ajaychaki2004
Copy link

Moving back to NLTK 3.8.1 fixed it. I can reproduce the issue by upgrading to 3.9.1 again.
By downgrading the NLTK you can solve the issue.

@doctorsketch
Copy link

Moving back to NLTK 3.8.1 fixed it. I can reproduce the issue by upgrading to 3.9.1 again.
By downgrading the NLTK you can solve the issue.

Just be aware NLTK <3.9 contains a critical security vulnerability so you're better off specifying the data path like I suggested rather than using an older insecure version.

@Ajaychaki2004
Copy link

Can you tell the solution in detail ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants