Name		Name	Last commit message	Last commit date
parent directory ..
Lab 1.pdf		Lab 1.pdf
Readme.md		Readme.md
q1.ipynb		q1.ipynb
q2.ipynb		q2.ipynb
q3.ipynb		q3.ipynb

Readme.md

Web Crawlers Using beautiful Soap

Question 1 :

Given a root URL, e.g., "Vit.ac.in", Design a simple crawler using Python to return all pages that contains a string "admissions" from this site

Steps

We use the BeautifulSoup library of Python in order to parse through the response. We use the Requests library of Python in order to make the web page requests. We use the RE library of Python for pattern matching irrespective of the case or capitalisation of the content in the website.

Question 2 :

Find documents that contain the word “Data” and the word “analytics” within the URL “Vit.ac.in” using Python.

Steps

We use the BeautifulSoup library of Python in order to parse through the response. We use the Requests library of Python in order to make the web page requests. We use the RE library of Python for pattern matching irrespective of the case or capitalisation of the content in the website. After we get the root URL’s web page, we take all the anchor tags from the parsed structure, and then retrieve the anchor tag’s href property which holds the link for the other pages. Using these links, we perform a get request to these pages and save only the links which have the words “data" and “analytics” somewhere in its document. We use regular expression to find if a page contains the required words or not. During the process of making such requests, there is a chance of failure like due to SSL certificate authentication and verification, or just a bad time to connect to the server in some cases.

Question 3 :

Find documents that contain the word “Programme” but not the word “programming” within the URL “Vit.ac.in” using Python.

Steps

We use the BeautifulSoup library of Python in order to parse through the response. We use the Requests library of Python in order to make the web page requests. We use the RE library of Python for pattern matching irrespective of the case or capitalisation of the content in the website. After we get the root URL’s web page, we take all the anchor tags from the parsed structure, and then retrieve the anchor tag’s href property which holds the link for the other pages. Using these links, we perform a get request to these pages and save only the links which have the words “data" and “analytics” somewhere in its document. We use regular expression to find if a page contains the word “programme” or not. We similarly use it to ensure that the chosen pages do not contain the word “programming” in any form. During the process of making such requests, there is a chance of failure like due to SSL certificate authentication and verification, or just a bad time to connect to the server in some cases. We also list these URL’s along with the output to demonstrate the efficiency of our algorithm and code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web_Crawlers_Lab_1

Web_Crawlers_Lab_1

Readme.md

Web Crawlers Using beautiful Soap

Question 1 :

Steps

Question 2 :

Steps

Question 3 :

Steps

Files

Web_Crawlers_Lab_1

Directory actions

More options

Directory actions

More options

Latest commit

History

Web_Crawlers_Lab_1

Folders and files

parent directory

Readme.md

Web Crawlers Using beautiful Soap

Question 1 :

Steps

Question 2 :

Steps

Question 3 :

Steps