the extracted content output not contain picture elements #3

wizos · 2018-07-05T03:02:51Z

No description provided.

chimbori · 2018-10-19T06:31:36Z

Please include more details, including:

A specific URL that demonstrates the problem.
The exact markup downloaded from that URL by a JavaScript-disabled User Agent, e.g. wget or curl.
What you expected to see.
What you actually saw.

platelminto · 2018-11-08T17:05:40Z

I think they just mean that the extracted text doesn't have the original images within the text - is there a way to do this?

chimbori · 2018-11-08T17:16:30Z

<img> elements should be included in the output DOM, so if they're not, then it needs to be debugged. Different sites have different markup, so it's hard to debug without a test case.

OP hasn't replied in a long time, but if you have an example URL + markup, please attach it here.

platelminto · 2018-11-08T17:22:29Z

https://www.wired.com/story/bitcoin-will-burn-planet-down-how-fast/
Gist for Crux output

platelminto · 2018-11-20T01:26:29Z

@chimbori is this being worked on? Still not getting any tags

chimbori · 2018-11-20T19:30:45Z

Not being actively worked on, no. I’ll look into it if/when I have a chance, but the reason I asked for more documentation is that others who see this issue could have enough information to get started.

platelminto · 2018-11-21T10:46:55Z

If anyone does look at this, the reason it doesn't work is because some sites load some of their images lazily with JavaScript, and the HTML you are providing is likely the one before the images are inserted. To fix this, the JavaScript must first be run, then provide that HTML to Crux - this can be done with something like HtmlUnit, but that library doesn't work on Android.

Still trying to find a solution to that, though that might be out of the scope of Crux - with the post-JavaScript HTML, it works fine.

platelminto · 2018-11-30T14:38:31Z

Am now working on this - by the way, where is the code that should make the elements included in the output DOM? I couldn't find any, had to add my own to get any working - when I pass all the tests, I'll submit a merge request.

chimbori added a commit that referenced this issue Nov 23, 2018

Test cases for #6 and #3, disabled and marked as TODO.

b317285

platelminto mentioned this issue Dec 10, 2018

Images now listed in Crux output #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the extracted content output not contain picture elements #3

the extracted content output not contain picture elements #3

wizos commented Jul 5, 2018

chimbori commented Oct 19, 2018

platelminto commented Nov 8, 2018

chimbori commented Nov 8, 2018

platelminto commented Nov 8, 2018

platelminto commented Nov 20, 2018

chimbori commented Nov 20, 2018

platelminto commented Nov 21, 2018

platelminto commented Nov 30, 2018

the extracted content output not contain picture elements #3

the extracted content output not contain picture elements #3

Comments

wizos commented Jul 5, 2018

chimbori commented Oct 19, 2018

platelminto commented Nov 8, 2018

chimbori commented Nov 8, 2018

platelminto commented Nov 8, 2018

platelminto commented Nov 20, 2018

chimbori commented Nov 20, 2018

platelminto commented Nov 21, 2018

platelminto commented Nov 30, 2018