Skip to content

Intercept AJAX / XHR calls to extract information / scraping.

License

Notifications You must be signed in to change notification settings

MassProspecting/intercept

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub issues GitHub GitHub tag (latest by date) GitHub last commit

intercept.js

JavaScript library for intercepting AJAX / XHR calls performed by a website in order to:

  1. perform reverse engineering of communication between front-end and back-end; and

  2. perform data extraction (a.k.a scraping) of such a website.

Outline:

  1. Getting Started
  2. Processing AJAX Responses
  3. Gathering Data
  4. Pause Interception
  5. Debug Mode
  6. Working with Selnium
  7. Disclaimer

1. Getting Started

  1. Open a browser where you can access your Facebook profile.

  2. Go to this URL to see the latest posts in the Facebook groups where you are joined to:

Scraping Facebook Posts

  1. Press CTRL+SHIFT+I to open the Developer Tools:

Scraping Facebook Posts

  1. Inject intercept.js the the webpage.

In the console tab, paste the source code of intercept.js and press ENTER.

Scraping Facebook Posts

  1. Initialize intercept.js:
$$.init();

Scraping Facebook Posts

  1. Scroll down to load new posts, and see the URLs of the AJAX calls in the console.

Scraping Facebook Posts

2. Processing AJAX Responses

Initialize intercept.js with a custom parsing function.

E.g.: The code below extract the content of each post from the AJAX response.

$$.init({
    parse: function(xhr) {
        var s = null; // complete response text 
        var ar = null; // array of lines in the response text
        var x = null; // line in the response text
        var t = null; // response text wrapped in array
        var j = null; // response json

        // get the content of all the posts
        if (xhr._url == '/api/graphql/') {
            s = xhr.responseText;
            ar = s.split("\n");
            for (let z = 0; z < ar.length; z++) {
                x = ar[z];
                // JSON is not a valid json, you must wrap it in array.
                t = '['+x+']';
                j = JSON.parse(t)[0];

                if (x.startsWith('{"label":"CometNewsFeed_viewerConnection$stream$CometNewsFeed_viewer_news_feed"')) {

                    let a = j.data.node.comet_sections.content.story.message;
                    if (a != null) {
                        console.log('POST: ' + a.text);
                    }
                }
            }
        } 
    }
});

Scraping Facebook Posts

Additioonally to logging the contents, you can store them into the $$.data array.

console.log('POST: ' + a.text);
$$.push(a.text);

3. Gathering Data

Every time you call the $$.push metod you add an element into the array $$.data

console.log($$.data.length);
// => 1

You can clean up both arrays: $$.data and $$.calls by calling the $$.reset method:

$$.reset();

4. Pause Interception

You can pause interception:

$$.pause();

You can resume interception:

$$.play();

You can check if interception is running or not:

$$._paused
// => true

5. Debug Mode

You can request intercept.js to store all the requests and their responses into an array.

$$.debug(true);

You can also define the debugging mode when initialize:

$$.init({
    debug: true,
    parse: function(xhr) {
        // ...
    }
});

Such a feature is useful for developers, when they are performing reverse engieering of a website.

$$.calls.length
// => 64

$$.calls[0].url
// => '/ajax/navigation/'

Scraping Facebook Posts

You can check if intercept.js is running in debug mode or not:

$$._debug
// => false

Such a feature is resourses consuming too, and it should keep disabled in production environment.

6. Working with Selenium

You can automate your web-scraping using Selenium, injecting the intercept.js library using the Chrome DevTools Protocol (a.k.a. CDP).

You can find a full example here.

  • Such an example is written in Ruby, but you can use any other lenguage like Phyton if you want.

  • Such an example is using AdsPower Client to operate stealth browsers, but you can use the old fashion Selenium/Webdriver if you want.

In this secton, we explain such an example line by line.

  1. In your Ruby script, include the requried libraries:
require 'net/http'
require 'json'
require 'adspower-client'
  1. Create the AdsPower client:
key = '*************8c95acbf*************'
client = AdsPowerClient.new(key: key);
  1. Start the browser:
id = 'jdu****'
driver = client.driver(id)
  1. Get source code of intercept.js library:
uri = URI.parse('https://raw.githubusercontent.com/leandrosardi/intercept/main/lib/intercept.js')
js1 = Net::HTTP.get(uri)
  1. Get the source code of the scraper:
uri = URI.parse('https://raw.githubusercontent.com/leandrosardi/intercept/main/examples/facebook_group_posts.js')
js2 = Net::HTTP.get(uri)
  1. Injecting the library into the browser using CDP:
driver.execute_cdp("Page.addScriptToEvaluateOnNewDocument", source: js1+js2)
  1. Get the URL to scrape:
url = 'https://www.facebook.com/?filter=groups&sk=h_chr'
driver.get(url)
  1. Waiting for the page to load:
sleep(5)
  1. Reset the interceptor:
driver.execute_script('$$.reset();')
  1. Clicking to load posts with ajax:
a = driver.find_element(:css, 'a[href="/?filter=groups&sk=h_chr"]')
a.click
  1. Waiting for the AJAX to load:
sleep(5)
  1. Getting the list of scraped posts:
s = driver.execute_script('return JSON.stringify($$.data)')
arr = JSON.parse(s)

Disclaimer

Use this library at your own risk.

About

Intercept AJAX / XHR calls to extract information / scraping.

Resources

License

Stars

Watchers

Forks

Packages

No packages published