Extending the web scraper. #18

ba11b0y · 2017-10-04T04:19:34Z

There hasn't been much work on the web scraping part.
I am interested to work on this.
Since this is going to be a generic one, what I have thought as of now includes:

A generic web scraper which scrapes all images, links and the text.
Use scrapy for this maybe.

Still a beginner, any tips or corrections?

shubhodeep9 · 2017-10-04T06:53:18Z

@invinciblycool I like the thought, I would suggest, a detailed list of missing components you find in the current code of scraper, then we will assign you the work.

ashwini0529 · 2017-10-04T13:19:55Z

@invinciblycool XML format could be added.

ba11b0y · 2017-10-05T07:35:15Z

@ashwini0529 I have added the XML response to web.py. Let me know if any corrections are needed
@shubhodeep9 I will update the detailed list as soon as my exams get over 😄

ba11b0y · 2017-10-05T11:47:07Z

@ashwini0529 @shubhodeep9 Couldn't resist the excitement 😄
These are some features in my mind which can be added :

If no JSON response is returned by the URL, only the source of the page is returned. We could have a more better scraper which returns either:

A dictionary or a JSON reponse:

{
  "assets":
  {
    "images":
    [
      "link of image1 on the page",
      "link of image2 on the page"
    ],
    "videos":
    [
      "link to embedded video1",
      "link to embedded video2"
    ]
  },
  "content":
  {
    "text": "all raw text from the page",
    "html": "all html from the page"
  }
}

Or creates dedicated directories for the above keys of the dictionaries and actually saves the content to the respective directory.(Inspired from httrack)

Another feature could be adding a specific scrape option.
For Example:
web.scrape(url, scrape_content = "images") returns all the links to images in or saves the images locally.

ashwini0529 · 2017-10-05T13:42:24Z

Hey @invinciblycool Sounds good.
Sounds like a great idea to start with. Go ahead. We can add more features. 🎉

shubhodeep9 · 2017-10-05T13:48:03Z

@invinciblycool Add a TO-DO with your PR, and we will keep this issue alive until we feel satisfied. So that whenever someone gets a new idea on web-scraping, they can add to that TO-DO

ashwini0529 · 2017-10-05T13:49:37Z

Also, please add a [WIP] tag in your PR message. 😄

ba11b0y · 2017-10-05T15:38:38Z

@ashwini0529 To start working if you could make it clear that should the function be returning a response or should create folders and save the content locally. Thanks.
@shubhodeep9 Just confirming a TO-DO with the PR or the issue.

ashwini0529 · 2017-10-05T16:42:52Z

Hey @invinciblycool you can take a look at the QR Code function. I think you can make something like that.
Probable usage like what it was for QRCode:
img = hackr.image.qrcode("https://github.com/pytorn/hackr", dest_path="/tmp/hackr_qrcode.png")

ba11b0y · 2017-10-06T04:52:44Z

I guess then we agree on saving all the content locally.
Will start working on it ASAP.

ashwini0529 · 2017-10-19T10:29:08Z

Hey @invinciblycool Updates?

ba11b0y · 2017-10-20T09:53:07Z

Sorry for the delay, I will try opening a PR by this week.
Happy Diwali BTW. ✨

ashwini0529 · 2017-10-20T09:57:29Z

Perfect @invinciblycool
Happy hacking and Happy Diwali! 😄 🎇

ashwini0529 added the hacktoberfest label Oct 4, 2017

ba11b0y mentioned this issue Oct 5, 2017

Added XML response to web.py #26

Closed

ba11b0y mentioned this issue Oct 20, 2017

[WIP]Adding image scraping #42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending the web scraper. #18

Extending the web scraper. #18

ba11b0y commented Oct 4, 2017

shubhodeep9 commented Oct 4, 2017

ashwini0529 commented Oct 4, 2017

ba11b0y commented Oct 5, 2017

ba11b0y commented Oct 5, 2017 •

edited

Loading

ashwini0529 commented Oct 5, 2017

shubhodeep9 commented Oct 5, 2017

ashwini0529 commented Oct 5, 2017

ba11b0y commented Oct 5, 2017

ashwini0529 commented Oct 5, 2017

ba11b0y commented Oct 6, 2017

ashwini0529 commented Oct 19, 2017

ba11b0y commented Oct 20, 2017

ashwini0529 commented Oct 20, 2017

Extending the web scraper. #18

Extending the web scraper. #18

Comments

ba11b0y commented Oct 4, 2017

shubhodeep9 commented Oct 4, 2017

ashwini0529 commented Oct 4, 2017

ba11b0y commented Oct 5, 2017

ba11b0y commented Oct 5, 2017 • edited Loading

ashwini0529 commented Oct 5, 2017

shubhodeep9 commented Oct 5, 2017

ashwini0529 commented Oct 5, 2017

ba11b0y commented Oct 5, 2017

ashwini0529 commented Oct 5, 2017

ba11b0y commented Oct 6, 2017

ashwini0529 commented Oct 19, 2017

ba11b0y commented Oct 20, 2017

ashwini0529 commented Oct 20, 2017

ba11b0y commented Oct 5, 2017 •

edited

Loading