Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse native CSS selectors #111

Open
malteneuss opened this issue Oct 1, 2024 · 3 comments
Open

Parse native CSS selectors #111

malteneuss opened this issue Oct 1, 2024 · 3 comments

Comments

@malteneuss
Copy link
Contributor

I recently tried out a similar library in Rust https://github.com/rust-scraper/scraper, which i was able to use productively a bit quicker, because i could reuse my knowledge about plain CSS selectors, which are documented for free on sites like https://www.w3schools.com/cssref/css_selectors.php, e.g. let selector = Selector::parse("h1.foo").unwrap();.

Would parsing CSS selectors make sense for scalpel as well?

@aloussase
Copy link

I am also interested in this. I am trying to port a scraper from Kotlin (JSoup) to Haskell. Having CSS selectors would make this a lot easier, since that's what I was using originally. There are some sites that are too hard to scrap manually.

@fimad
Copy link
Owner

fimad commented Oct 17, 2024

I think supporting the basic set of CSS selectors could make sense. I think trying to support all the different pseudo-classes would be a huge effort.

Also, I think ideally we'd have a way to parse the CSS selectors at compile time so you don't have to deal with runtime parse errors. That's not an area I'm super familiar with, maybe template haskell would be the way to go?

I could see this working a couple of ways:

  1. Map CSS selectors to scalpel selectors. This would require some work to extend scalpel selectors since they are less expressive than CSS selectors. Specifically ~, +, |, and , have no analogs and users currently would have to write scraper logic to get at that functionality. I think this is probably the cleanest, but I'm not sure how realistic that is without trying it, it may require more fundamental changes to the internals than makes sense. This could look something like:
text $(css "div.foo > p")
  1. Provide a scraper that behaves like chroot and takes a new CSS selector type instead of a scalpel selector. This is probably the most expedient but also feels the least clean from an API perspective since it doesn't play as nicely with the rest of the APIs. This could look something like:
cssChroot $(css "div.foo > p") $ text anySelector

If someone is interested in investigating and working on this, I'd be able to provide guidance. Otherwise, I think this is interesting and something I'd like to look into but I don't think I'd be able to get to it soon. Most major scalpel development happens in bursts when I happen to have large blocks of free time and I'm not sure when that will happen next.

@aloussase
Copy link

I am not very familiar with the internals of Scalpel, but I can see it uses tagsoup for selection. Maybe it would be possible to leverage Selenium to implement CSS selectors? There is already a Haskell library that could be used: https://github.com/haskell-webdriver/haskell-webdriver/blob/main/examples/readme-example-beginner.md.

A [css|div.foo > p|] quasiquoter would work well for compile-time verification of valid CSS selectors I think. We could implement the compile-time parsing based on the spec and relay the work at runtime to the webdriver.

The webdriver would a heavy dependency and it might not make sense for this project, but I think implementing the whole CSS parsing + selection from scratch would be a lot of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants