Skip to content

Commit

Permalink
Improve architecture documentation (#243)
Browse files Browse the repository at this point in the history
* cleanup assets

* update crawler docs

* update search index docs

* update webgraph docs
  • Loading branch information
mikkeldenker authored Dec 3, 2024
1 parent 01de7a1 commit 9e8dc92
Show file tree
Hide file tree
Showing 29 changed files with 46 additions and 264 deletions.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
29 changes: 15 additions & 14 deletions docs/src/add_to_browser.md → docs/add_to_browser.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,41 +7,41 @@ to your browser by following the instructions listed for your respective browser

1. Navigate to [stract.com](https://stract.com)
2. Navigate to settings
![chrome settings button](assets/images/add_browser/chrome_search_1.png)
![chrome settings button](../assets/add_browser/chrome_search_1.png)
3. Select `Search engine`
![chrome search engine button](assets/images/add_browser/chrome_search_2.png)
![chrome search engine button](../assets/add_browser/chrome_search_2.png)
4. Click `Manage search engines and site search`
![chrome manage search engines](assets/images/add_browser/chrome_search_3.png)
![chrome manage search engines](../assets/add_browser/chrome_search_3.png)
5. Scroll down to `Inactive shortcuts`
![chrome inactive shortcuts](assets/images/add_browser/chrome_search_4.png)
![chrome inactive shortcuts](../assets/add_browser/chrome_search_4.png)
6. Select the peapod menu
7. Select `Make Default`
![chrome make default](assets/images/add_browser/chrome_search_5.png)
![chrome make default](../assets/add_browser/chrome_search_5.png)


## Firefox

1. Navigate to [stract.com](https://stract.com)
2. Right click the bar.
3. Select `Add "Stract Search"`
![Add Stract to Search button](assets/images/add_browser/firefox_search_1.png)
![Add Stract to Search button](../assets/add_browser/firefox_search_1.png)
4. Navigate to Settings
5. Select Search
6. Use the `Default Search Engine` dropdown to select Stract
![Make Stract default search](assets/images/add_browser/firefox_search_3.png)
![Make Stract default search](../assets/add_browser/firefox_search_3.png)

## Microsoft Edge
1. Navigate to [stract.com](https://stract.com)
2. Navigate to `Settings`
![edge settings button](assets/images/add_browser/edge_search_1.png)
![edge settings button](../assets/add_browser/edge_search_1.png)
3. Select `Privacy, search, and services`
![edge privacy, search, and services button](assets/images/add_browser/edge_search_2.png)
![edge privacy, search, and services button](../assets/add_browser/edge_search_2.png)
4. Scroll down to `Services`
![edge Services section](assets/images/add_browser/edge_search_3.png)
![edge Services section](../assets/add_browser/edge_search_3.png)
5. Select `Address Search Bar`
6. Select `Manage search engines`
7. Click the menu next to Stract and select `Make default`
![edge make default button](assets/images/add_browser/edge_search_4.png)
![edge make default button](../assets/add_browser/edge_search_4.png)


## Safari
Expand All @@ -52,12 +52,13 @@ as a site search option. What follows describes that process.
1. Navigate to [stract.com](https://stract.com)
2. Open Preferences
3. Navigate to the `Search` panel
![safari search settings](assets/images/add_browser/safari_search_1.png)
![safari search settings](../assets/add_browser/safari_search_1.png)
4. Select `Manage Websites...`
5. Select stract.com from the options
![safari website options](assets/images/add_browser/safari_search_2.png)
![safari website options](../assets/add_browser/safari_search_2.png)


From here stract.com should appear in the search bar and you can arrow down to
it and begin typing.
![safari stract searching](assets/images/add_browser/safari_search_3.png)
![safari stract searching](../assets/add_browser/safari_search_3.png)

42 changes: 2 additions & 40 deletions docs/api/README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,3 @@
# Website
# API Docs

This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.

### Installation

```
$ yarn
```

### Local Development

```
$ yarn start
```

This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.

### Build

```
$ yarn build
```

This command generates static content into the `build` directory and can be served using any static contents hosting service.

### Deployment

Using SSH:

```
$ USE_SSH=true yarn deploy
```

Not using SSH:

```
$ GIT_USER=<Your GitHub username> yarn deploy
```

If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.
22 changes: 22 additions & 0 deletions docs/architecture/crawler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Crawler
[Information for webmasters here](https://stract.com/webmasters)

The crawler is a distributed system that scours the web. It has a coordinator process that determines which URLs to crawl and a set of worker processes that fetch the content of those URLs. Each worker receives a batch of crawl jobs to process, stores the fetched contents in an S3 bucket and retrieves a new batch of jobs to process. This continues until the coordinator has determined that the crawl is complete.

Each crawl job contains a site, a crawl budget and a list of some known high-authority urls for that site. The crawl budget is used to determine how many pages to fetch from the site. Each site is only allowed to be crawled by a single worker at a time to ensure that we don't overload a website.

## Coordinator
The coordinator is responsible for planning and orchestrating the crawl process. It analyzes data from previous crawls to determine an appropriate crawl budget for each website. This budget helps ensure fair resource allocation and prevents overloading any single site.

Based on this analysis, the coordinator creates a crawl plan that takes the form of a queue of jobs to be processed. This approach allows for efficient distribution to worker nodes while ensuring the coordinator does not become a bottleneck.

### Respectfullness
It is of utmost importance that we are respectful of the websites we crawl. We do not want to overload a website with requests and we do not want to crawl pages from the website that the website owner does not want us to crawl.

To ensure this, the jobs are oriented by site so each site is only included in a single job. When a site gets scheduled to a worker it is then the responsibility of the worker to respect the `robots.txt` file of the domain and to not overload the domain with requests. For more details see the [webmasters](https://stract.com/webmasters) documentation.

## Worker
The worker is responsible for crawling the sites scheduled by the coordinator. It is completely stateless and stores the fetched data directly to an S3 bucket. It recursively discovers new urls on the assigned site and crawls them until the crawl budget is exhausted.

When a worker is tasked to crawl a new site, it first checks the `robots.txt` file for the site to see which urls (if any) it is allowed to crawl.
If the worker receives a `429 Too Many Requests` response from the site, it backs off for a while before trying again. The specific backoff time depends on how fast the server responds.
File renamed without changes.
10 changes: 5 additions & 5 deletions docs/src/overview.md → docs/architecture/overview.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Overview
Stract (and most other web search engines) is composed of three main components: the crawler, the webgraph and the search index.
Stract (and most other web search engines) is composed of three main components: the crawler, the web graph and the search index.

## Crawler
The crawler, often also referred to as a spider or bot, is the component responsible for collecting and scanning websites across the internet. It begins with a seed list of URLs, which it visits to fetch web pages. The crawler then parses these pages to extract additional URLs, which are then added to the list of URLs to be crawled in the future. This process repeats in a cycle, allowing the crawler to discover new web pages or updates to existing pages continuously. The content fetched by the crawler is passed on to the next components of the search engine: the webgraph and the search index.
The crawler, often also referred to as a spider or bot, is the component responsible for collecting and scanning websites across the internet. It begins with a seed list of URLs, which it visits to fetch web pages. The crawler then parses these pages to extract additional URLs, which are then added to the list of URLs to be crawled in the future. This process repeats in a cycle, allowing the crawler to discover new web pages or updates to existing pages continuously. The content fetched by the crawler is passed on to the next components of the search engine: the web graph and the search index.

## Webgraph
The webgraph is a data structure that represents the relationships between different web pages. Each node in the webgraph represents a unique web page, and each edge represents a hyperlink from one page to another. The webgraph helps the search engine understand the structure of the web and the authority of different web pages. Authority is determined by factors such as the number of other pages linking to a given page (also known as "backlinks"), which is an important factor in ranking search results. This concept is often referred to as "link analysis."
## Web graph
The web graph is a data structure that represents the relationships between different web pages. Each node in the web graph represents a unique web page, and each edge represents a hyperlink from one page to another. The web graph helps the search engine understand the structure of the web and the authority of different web pages. Stract uses the [harmonic centrality](webgraph.md#harmonic-centrality) to determine the authority of a webpage.

## Search Index
The search index is the component that facilitates fast and accurate search results. It is akin to the index at the back of a book, providing a direct mapping from words or phrases to the web pages in which they appear. This data structure is often referred to as an "inverted index". The search index is designed to handle complex search queries and return relevant results in a fraction of a second. The index uses the information gathered by the crawler and the structure of the webgraph to rank search results according to their relevance.
The search index is the component that facilitates fast and accurate search results. It is akin to the index at the back of a book, providing a direct mapping from words or phrases to the web pages in which they appear. This data structure is often referred to as an "inverted index". The search index is designed to handle complex search queries and return relevant results in a fraction of a second. The index uses the information gathered by the crawler and the structure of the web graph to rank search results according to their relevance.
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,4 @@ The ranking happens in multiple stages. Some of these stages occur at the shard
3. If a lambdamart model has been defined, the best results from the linear regression stage gets passed into the lambdamart model.
- Combining results from all shards
1. Results from each shard are re-ranked using both the linear regression and lambdamart models. This ensures the scores can be properly compared and ordered.
2. The best 20 results, corresponding to the first page, gets scored with a cross encoder and again ranked using the linear regression followed by the lambdamart model.
2. Multiple ranking stages gets applied in the ranking [pipeline](https://github.com/StractOrg/stract/tree/main/crates/core/src/ranking/pipeline) until the top 20 results are found.
9 changes: 1 addition & 8 deletions docs/src/webgraph.md → docs/architecture/webgraph.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,7 @@
# Webgraph
The webgraph, often conceptualized as the "internet's map," provides a structured view of the interconnectedness of pages across the World Wide Web. With billions of pages linked together, the webgraph is a crucial tool for understanding the structure, pattern, and dynamics of the internet.

There are two primary ways of constructing the webgraph:

- **Page-Level Webgraph**: This method involves constructing the graph by analyzing individual pages and their outbound links. The nodes in this graph represent individual web pages, while the edges represent hyperlinks between them. This detailed view is especially helpful for understanding specific page connections.

- **Host-Level Webgraph**: Instead of examining individual pages, this approach consolidates all the links associated with a particular host, effectively simplifying the webgraph. In this representation, nodes represent entire websites or hosts, and edges represent connections between them. This broader perspective is suitable for understanding the authority and influence of entire websites.

## Segments
Given the extreme size of the internet, managing the webgraph as a single monolithic structure in memory is neither efficient nor practical. Thus, it's segmented into smaller parts called segments. Each segment is essentially a portion of the overall webgraph stored in a [RocksDB](https://rocksdb.org/) database on disk. This allows us to create webgraphs that are much larger than what we would otherwise be able to fit in memory.
The webgraph is stored in a tantivy index on disk, where each document represents an edge (hyperlink) between two web pages. Each document contains metadata about the link, such as the source URL, destination URL, anchor text etc.

## Webgraph Uses
The structure of the web can provide highly valuable information when detemining the relevance of a page to a user's search query. PageRank, which is a centrality meassure developed by Larry Page and Sergey Brin, was one of the primary reasons why Google provided much better search results than their competitors in the early days.
Expand Down
37 changes: 0 additions & 37 deletions docs/mkdocs.yml

This file was deleted.

10 changes: 0 additions & 10 deletions docs/src/assets/images/biglogo.svg

This file was deleted.

Loading

0 comments on commit 9e8dc92

Please sign in to comment.