JavaScript library for intercepting AJAX / XHR calls performed by a website in order to:
-
perform reverse engineering of communication between front-end and back-end; and
-
perform data extraction (a.k.a scraping) of such a website.
Outline:
- Getting Started
- Processing AJAX Responses
- Gathering Data
- Pause Interception
- Debug Mode
- Working with Selnium
- Disclaimer
-
Open a browser where you can access your Facebook profile.
-
Go to this URL to see the latest posts in the Facebook groups where you are joined to:
- Press CTRL+SHIFT+I to open the Developer Tools:
- Inject intercept.js the the webpage.
In the console tab, paste the source code of intercept.js and press ENTER.
- Initialize intercept.js:
$$.init();
- Scroll down to load new posts, and see the URLs of the AJAX calls in the console.
Initialize intercept.js with a custom parsing function.
E.g.: The code below extract the content of each post from the AJAX response.
$$.init({
parse: function(xhr) {
var s = null; // complete response text
var ar = null; // array of lines in the response text
var x = null; // line in the response text
var t = null; // response text wrapped in array
var j = null; // response json
// get the content of all the posts
if (xhr._url == '/api/graphql/') {
s = xhr.responseText;
ar = s.split("\n");
for (let z = 0; z < ar.length; z++) {
x = ar[z];
// JSON is not a valid json, you must wrap it in array.
t = '['+x+']';
j = JSON.parse(t)[0];
if (x.startsWith('{"label":"CometNewsFeed_viewerConnection$stream$CometNewsFeed_viewer_news_feed"')) {
let a = j.data.node.comet_sections.content.story.message;
if (a != null) {
console.log('POST: ' + a.text);
}
}
}
}
}
});
Additioonally to logging the contents, you can store them into the $$.data
array.
console.log('POST: ' + a.text);
$$.push(a.text);
Every time you call the $$.push
metod you add an element into the array $$.data
console.log($$.data.length);
// => 1
You can clean up both arrays: $$.data
and $$.calls
by calling the $$.reset
method:
$$.reset();
You can pause interception:
$$.pause();
You can resume interception:
$$.play();
You can check if interception is running or not:
$$._paused
// => true
You can request intercept.js to store all the requests and their responses into an array.
$$.debug(true);
You can also define the debugging mode when initialize:
$$.init({
debug: true,
parse: function(xhr) {
// ...
}
});
Such a feature is useful for developers, when they are performing reverse engieering of a website.
$$.calls.length
// => 64
$$.calls[0].url
// => '/ajax/navigation/'
You can check if intercept.js is running in debug mode or not:
$$._debug
// => false
Such a feature is resourses consuming too, and it should keep disabled in production environment.
You can automate your web-scraping using Selenium, injecting the intercept.js library using the Chrome DevTools Protocol (a.k.a. CDP).
You can find a full example here.
-
Such an example is written in Ruby, but you can use any other lenguage like Phyton if you want.
-
Such an example is using AdsPower Client to operate stealth browsers, but you can use the old fashion Selenium/Webdriver if you want.
In this secton, we explain such an example line by line.
- In your Ruby script, include the requried libraries:
require 'net/http'
require 'json'
require 'adspower-client'
- Create the AdsPower client:
key = '*************8c95acbf*************'
client = AdsPowerClient.new(key: key);
- Start the browser:
id = 'jdu****'
driver = client.driver(id)
- Get source code of intercept.js library:
uri = URI.parse('https://raw.githubusercontent.com/leandrosardi/intercept/main/lib/intercept.js')
js1 = Net::HTTP.get(uri)
- Get the source code of the scraper:
uri = URI.parse('https://raw.githubusercontent.com/leandrosardi/intercept/main/examples/facebook_group_posts.js')
js2 = Net::HTTP.get(uri)
- Injecting the library into the browser using CDP:
driver.execute_cdp("Page.addScriptToEvaluateOnNewDocument", source: js1+js2)
- Get the URL to scrape:
url = 'https://www.facebook.com/?filter=groups&sk=h_chr'
driver.get(url)
- Waiting for the page to load:
sleep(5)
- Reset the interceptor:
driver.execute_script('$$.reset();')
- Clicking to load posts with ajax:
a = driver.find_element(:css, 'a[href="/?filter=groups&sk=h_chr"]')
a.click
- Waiting for the AJAX to load:
sleep(5)
- Getting the list of scraped posts:
s = driver.execute_script('return JSON.stringify($$.data)')
arr = JSON.parse(s)
Use this library at your own risk.