GitHub - minasvisual/cloud-functions-crawler-lightweight: Cloud functions Crawler ligthweight base on puppeteer scraper

Based on repository: https://github.com/anishkny/puppeteer-on-cloud-functions

Google Cloud Function to scrape specified data

Node system to create dynamic crawlers with selector rules only.

Body crawlers:

on POST body json ( Header Content-Type: application/json)

{ 
    "name": "example",
    "task": {
        "url": "https://ionicabizau.net", //( string or array of strings)
        "schema": {
            "title": ".header h1", //string selector that returns element textContent (see puppeteer docs)
            "avatar": {
                "selector": ".header img[src]", // advanced way
                "attr": "src",  // OPTIONAL | accept element attributes (see $eval puppeteer docs) ex: href|src|innerHTML
                "eq": 0, // OPTIONAL | element array index, in case of array elements found
                "trim": true // OPTIONAL | clear data with trim function
            },
            "desc": { 
              "selector":".header h2",
              "attr": "innerHTML"
            },
            "menu":{
                "listItem": ".pages > li",
                "data": {
                   "name": "a"
                }
            }
        }
    }
}

Response JSON

{ 
    "example": {
            "title": "Page selector text" 
            "avatar": "http://<link of image>"
            "desc": "<div>content of post</div>"
            "menu": [
               { "name": "Home" },
               { "name": "Blog" },
               { "name": "Contact" },
               { "name": "..." },
            ]
        }
    }
}

Installation

npm install

Test local your task running ( configure example on server.js )

node server

#Logs GCP log is generated during processes

Additional packages

lodash
axios
puppeteer-core
chrome-aws-lambda

Based on this blog post: Introducing headless Chrome support in Cloud Functions and App Engine

Deployment

npm run deploy

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gcloudignore		.gcloudignore
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Based on repository: https://github.com/anishkny/puppeteer-on-cloud-functions

Response JSON

Deployment

About

Releases

Packages

Languages

minasvisual/cloud-functions-crawler-lightweight

Folders and files

Latest commit

History

Repository files navigation

Based on repository: https://github.com/anishkny/puppeteer-on-cloud-functions

Response JSON

Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages