Figure out why we can't run the netflix_to_wikidata script on all the movies #12

audiodude · 2024-09-13T04:57:39Z

It seems to be crashing part of the way through. Let's post the stack trace in this bug and see if we can figure it out.

audiodude · 2024-09-17T07:42:47Z

Looks like the network requests are just a bit flaky and eventually one gets stuck and times out. Maybe we're getting rate limited (we should look into that).

#13 (comment)

audiodude · 2024-09-17T08:10:11Z

https://www.wikidata.org/wiki/Wikidata:REST_API#Rate_limits

audiodude · 2024-09-30T23:18:14Z

Okay, I ran the code with the version in #17, with the exponential backoff and with tqdm showing a progress bar. It failed 27% of the way through, but with a different error message than the one we got before:

27%|█████████████████▉                                                | 4824/17770 [28:09<1:15:33,  2.86it/s]
Traceback (most recent call last):
  File "/home/tmoney/code/MediaBridge/src/data_processing/wiki_to_netflix.py", line 152, in <module>
    process_data(False)
  File "/home/tmoney/code/MediaBridge/src/data_processing/wiki_to_netflix.py", line 137, in process_data
    wiki_movie_ids_list, wiki_genres_list, wiki_directors_list = wiki_query(netflix_file, user_agent)
                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tmoney/code/MediaBridge/src/data_processing/wiki_to_netflix.py", line 118, in wiki_query
    response.raise_for_status()
  File "/home/tmoney/.local/share/virtualenvs/MediaBridge-QSS_14Zx/lib/python3.12/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://query.wikidata.org/sparql

A 400 error means that there was something wrong with the request, with what we sent Wikidata. My guess is that there is a movie name like "Face/Off" that has a slash in the name or something, and it's not getting properly escaped, which would make the SPARQL invalid. Picture a movie like Joe's "Magical" Adventure. When put into the SPARQL it would be:

                                mwapi:search "%(Title)s" ;

which would turn into

                                mwapi:search "Joe's "Magical" Adventure" ;

And the quotes would be messed up. We know the error happens at or around item 4824 in the data, so we should just be able to look at the movie titles at that line and figure out what's going on.

Please checkout #17 and run the code and try to figure out what's going on.

audiodude · 2024-09-30T23:41:43Z

Okay I was curious. I changed the iteration to:

    for row in tqdm(data_csv[4820:]):

and added error handling around raise_for_status():

        try:
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            print(repr(row))
            raise e

And got this:

['4825', '1985', 'Brazil: The "Love Conquers All" Version']

Putting aside that this is unlikely to match any movies anyways, we should either:

Skip trying to match any movies with quotes
Properly escape the quotes (will need to lookup how SPARQL does that)

audiodude · 2024-10-01T01:29:13Z

Here's a few more that didn't work from my handling of the 400 error:

['4825', '1985', 'Brazil: The "Love Conquers All" Version']
 33%|██████████████████████▌                                             | 5912/17770 [34:24<58:59,  3.35it/s]['5913', '1994',
'Snowy River: The McGregor Saga "The Race"']
 35%|███████████████████████▏                                          | 6240/17770 [36:22<1:25:52,  2.24it/s]['6241', '1965',
'Operation "Y" and Other Shurik\'s Adventures']
 35%|███████████████████████▎                                          | 6279/17770 [36:36<1:01:23,  3.12it/s]['6280', '2003',
'Sting: Inside the Songs of "Sacred Love"']
 49%|█████████████████████████████████▍                                  | 8747/17770 [51:12<48:19,  3.11it/s
['8748', '2004', 'Morrissey: Who Put the "M" in Manchester']
 49%|█████████████████████████████████▍                                  | 8748/17770 [51:12<52:48,  2.85it/s]

audiodude assigned cocomittens and smaysenhalder Sep 13, 2024

audiodude mentioned this issue Sep 27, 2024

Try exponential backoff to avoid read errors #17

Merged

audiodude unassigned cocomittens Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out why we can't run the netflix_to_wikidata script on all the movies #12

Figure out why we can't run the netflix_to_wikidata script on all the movies #12

audiodude commented Sep 13, 2024

audiodude commented Sep 17, 2024

audiodude commented Sep 17, 2024

audiodude commented Sep 30, 2024

audiodude commented Sep 30, 2024

audiodude commented Oct 1, 2024 •

edited

Loading

Figure out why we can't run the netflix_to_wikidata script on all the movies #12

Figure out why we can't run the netflix_to_wikidata script on all the movies #12

Comments

audiodude commented Sep 13, 2024

audiodude commented Sep 17, 2024

audiodude commented Sep 17, 2024

audiodude commented Sep 30, 2024

audiodude commented Sep 30, 2024

audiodude commented Oct 1, 2024 • edited Loading

audiodude commented Oct 1, 2024 •

edited

Loading