Parse native CSS selectors #111

malteneuss · 2024-10-01T07:21:33Z

I recently tried out a similar library in Rust https://github.com/rust-scraper/scraper, which i was able to use productively a bit quicker, because i could reuse my knowledge about plain CSS selectors, which are documented for free on sites like https://www.w3schools.com/cssref/css_selectors.php, e.g. let selector = Selector::parse("h1.foo").unwrap();.

Would parsing CSS selectors make sense for scalpel as well?

The text was updated successfully, but these errors were encountered:

aloussase · 2024-10-06T00:44:36Z

I am also interested in this. I am trying to port a scraper from Kotlin (JSoup) to Haskell. Having CSS selectors would make this a lot easier, since that's what I was using originally. There are some sites that are too hard to scrap manually.

fimad · 2024-10-17T05:00:05Z

I think supporting the basic set of CSS selectors could make sense. I think trying to support all the different pseudo-classes would be a huge effort.

Also, I think ideally we'd have a way to parse the CSS selectors at compile time so you don't have to deal with runtime parse errors. That's not an area I'm super familiar with, maybe template haskell would be the way to go?

I could see this working a couple of ways:

Map CSS selectors to scalpel selectors. This would require some work to extend scalpel selectors since they are less expressive than CSS selectors. Specifically ~, +, |, and , have no analogs and users currently would have to write scraper logic to get at that functionality. I think this is probably the cleanest, but I'm not sure how realistic that is without trying it, it may require more fundamental changes to the internals than makes sense. This could look something like:

text $(css "div.foo > p")

Provide a scraper that behaves like chroot and takes a new CSS selector type instead of a scalpel selector. This is probably the most expedient but also feels the least clean from an API perspective since it doesn't play as nicely with the rest of the APIs. This could look something like:

cssChroot $(css "div.foo > p") $ text anySelector

If someone is interested in investigating and working on this, I'd be able to provide guidance. Otherwise, I think this is interesting and something I'd like to look into but I don't think I'd be able to get to it soon. Most major scalpel development happens in bursts when I happen to have large blocks of free time and I'm not sure when that will happen next.

aloussase · 2024-10-17T14:55:19Z

I am not very familiar with the internals of Scalpel, but I can see it uses tagsoup for selection. Maybe it would be possible to leverage Selenium to implement CSS selectors? There is already a Haskell library that could be used: https://github.com/haskell-webdriver/haskell-webdriver/blob/main/examples/readme-example-beginner.md.

A [css|div.foo > p|] quasiquoter would work well for compile-time verification of valid CSS selectors I think. We could implement the compile-time parsing based on the spec and relay the work at runtime to the webdriver.

The webdriver would a heavy dependency and it might not make sense for this project, but I think implementing the whole CSS parsing + selection from scratch would be a lot of work.

fimad mentioned this issue Oct 17, 2024

Faster HTML tokenization #109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse native CSS selectors #111

Parse native CSS selectors #111

malteneuss commented Oct 1, 2024

aloussase commented Oct 6, 2024

fimad commented Oct 17, 2024

aloussase commented Oct 17, 2024

Parse native CSS selectors #111

Parse native CSS selectors #111

Comments

malteneuss commented Oct 1, 2024

aloussase commented Oct 6, 2024

fimad commented Oct 17, 2024

aloussase commented Oct 17, 2024