Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T011: Wikipedia scraping fails #406

Open
mbackenkoehler opened this issue Nov 9, 2023 · 1 comment
Open

T011: Wikipedia scraping fails #406

mbackenkoehler opened this issue Nov 9, 2023 · 1 comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mbackenkoehler
Copy link
Collaborator

Selecting the table from the HTML fails in section 1.3.4. Replacing table = header.find_all_next()[4] with table = header.find_all_next()[5] should be enough to fix this, but the whole thing looks a bit flaky.

@mbackenkoehler mbackenkoehler added bug Something isn't working help wanted Extra attention is needed good first issue Good for newcomers labels Nov 9, 2023
@frenio
Copy link

frenio commented Oct 7, 2024

Hi Michael,

I'd be happy to take on this issue! I find two problems with the original code:

  1. The line header = html.find("span", id="General_chemical_properties") returns nothing, because the "span" has been changed to an "h3" on Wikipedia's end.
  2. The assignment of table currently doesn't fetch the correct HTML as implied by your suggested solution.

I think a potentially more stable fix that completely avoids table = header.find_all_next()[5] and avoids specifying the type of element for header would be the following:

header = html.find(id="General_chemical_properties")
table_body = header.find_next("tbody")

I've tested it and it works. If you like this fix, feel free to assign this issue to me and I'll send a PR.

Best,
Frenio

PS:
I would also like to add the following code to the top of the notebook as pd.DataFrame.from_records(data).replace("?", np.nan) triggers a deprecation warning that might distract the reader. Does that sound ok?

import warnings
warnings.filterwarnings("ignore")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants