-
Notifications
You must be signed in to change notification settings - Fork 49
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve architecture documentation (#243)
* cleanup assets * update crawler docs * update search index docs * update webgraph docs
- Loading branch information
1 parent
01de7a1
commit 9e8dc92
Showing
29 changed files
with
46 additions
and
264 deletions.
There are no files selected for viewing
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,41 +1,3 @@ | ||
# Website | ||
# API Docs | ||
|
||
This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator. | ||
|
||
### Installation | ||
|
||
``` | ||
$ yarn | ||
``` | ||
|
||
### Local Development | ||
|
||
``` | ||
$ yarn start | ||
``` | ||
|
||
This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server. | ||
|
||
### Build | ||
|
||
``` | ||
$ yarn build | ||
``` | ||
|
||
This command generates static content into the `build` directory and can be served using any static contents hosting service. | ||
|
||
### Deployment | ||
|
||
Using SSH: | ||
|
||
``` | ||
$ USE_SSH=true yarn deploy | ||
``` | ||
|
||
Not using SSH: | ||
|
||
``` | ||
$ GIT_USER=<Your GitHub username> yarn deploy | ||
``` | ||
|
||
If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch. | ||
This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Crawler | ||
[Information for webmasters here](https://stract.com/webmasters) | ||
|
||
The crawler is a distributed system that scours the web. It has a coordinator process that determines which URLs to crawl and a set of worker processes that fetch the content of those URLs. Each worker receives a batch of crawl jobs to process, stores the fetched contents in an S3 bucket and retrieves a new batch of jobs to process. This continues until the coordinator has determined that the crawl is complete. | ||
|
||
Each crawl job contains a site, a crawl budget and a list of some known high-authority urls for that site. The crawl budget is used to determine how many pages to fetch from the site. Each site is only allowed to be crawled by a single worker at a time to ensure that we don't overload a website. | ||
|
||
## Coordinator | ||
The coordinator is responsible for planning and orchestrating the crawl process. It analyzes data from previous crawls to determine an appropriate crawl budget for each website. This budget helps ensure fair resource allocation and prevents overloading any single site. | ||
|
||
Based on this analysis, the coordinator creates a crawl plan that takes the form of a queue of jobs to be processed. This approach allows for efficient distribution to worker nodes while ensuring the coordinator does not become a bottleneck. | ||
|
||
### Respectfullness | ||
It is of utmost importance that we are respectful of the websites we crawl. We do not want to overload a website with requests and we do not want to crawl pages from the website that the website owner does not want us to crawl. | ||
|
||
To ensure this, the jobs are oriented by site so each site is only included in a single job. When a site gets scheduled to a worker it is then the responsibility of the worker to respect the `robots.txt` file of the domain and to not overload the domain with requests. For more details see the [webmasters](https://stract.com/webmasters) documentation. | ||
|
||
## Worker | ||
The worker is responsible for crawling the sites scheduled by the coordinator. It is completely stateless and stores the fetched data directly to an S3 bucket. It recursively discovers new urls on the assigned site and crawls them until the crawl budget is exhausted. | ||
|
||
When a worker is tasked to crawl a new site, it first checks the `robots.txt` file for the site to see which urls (if any) it is allowed to crawl. | ||
If the worker receives a `429 Too Many Requests` response from the site, it backs off for a while before trying again. The specific backoff time depends on how fast the server responds. |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
# Overview | ||
Stract (and most other web search engines) is composed of three main components: the crawler, the webgraph and the search index. | ||
Stract (and most other web search engines) is composed of three main components: the crawler, the web graph and the search index. | ||
|
||
## Crawler | ||
The crawler, often also referred to as a spider or bot, is the component responsible for collecting and scanning websites across the internet. It begins with a seed list of URLs, which it visits to fetch web pages. The crawler then parses these pages to extract additional URLs, which are then added to the list of URLs to be crawled in the future. This process repeats in a cycle, allowing the crawler to discover new web pages or updates to existing pages continuously. The content fetched by the crawler is passed on to the next components of the search engine: the webgraph and the search index. | ||
The crawler, often also referred to as a spider or bot, is the component responsible for collecting and scanning websites across the internet. It begins with a seed list of URLs, which it visits to fetch web pages. The crawler then parses these pages to extract additional URLs, which are then added to the list of URLs to be crawled in the future. This process repeats in a cycle, allowing the crawler to discover new web pages or updates to existing pages continuously. The content fetched by the crawler is passed on to the next components of the search engine: the web graph and the search index. | ||
|
||
## Webgraph | ||
The webgraph is a data structure that represents the relationships between different web pages. Each node in the webgraph represents a unique web page, and each edge represents a hyperlink from one page to another. The webgraph helps the search engine understand the structure of the web and the authority of different web pages. Authority is determined by factors such as the number of other pages linking to a given page (also known as "backlinks"), which is an important factor in ranking search results. This concept is often referred to as "link analysis." | ||
## Web graph | ||
The web graph is a data structure that represents the relationships between different web pages. Each node in the web graph represents a unique web page, and each edge represents a hyperlink from one page to another. The web graph helps the search engine understand the structure of the web and the authority of different web pages. Stract uses the [harmonic centrality](webgraph.md#harmonic-centrality) to determine the authority of a webpage. | ||
|
||
## Search Index | ||
The search index is the component that facilitates fast and accurate search results. It is akin to the index at the back of a book, providing a direct mapping from words or phrases to the web pages in which they appear. This data structure is often referred to as an "inverted index". The search index is designed to handle complex search queries and return relevant results in a fraction of a second. The index uses the information gathered by the crawler and the structure of the webgraph to rank search results according to their relevance. | ||
The search index is the component that facilitates fast and accurate search results. It is akin to the index at the back of a book, providing a direct mapping from words or phrases to the web pages in which they appear. This data structure is often referred to as an "inverted index". The search index is designed to handle complex search queries and return relevant results in a fraction of a second. The index uses the information gathered by the crawler and the structure of the web graph to rank search results according to their relevance. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.