Skip to content

Commit

Permalink
Improve zimba+web-spell docs and release the modules under MIT (#242)
Browse files Browse the repository at this point in the history
* improve web-spell docs

* improve zimba docs

* release zimba + web-spell under MIT
  • Loading branch information
mikkeldenker authored Dec 2, 2024
1 parent 040d044 commit 01de7a1
Show file tree
Hide file tree
Showing 16 changed files with 309 additions and 165 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ We recommend everyone to use the hosted version at [stract.com](https://stract.c

# ‍💼 License

Stract is offered under the terms defined under the [LICENSE.md](LICENSE.md) file.
Stract is offered under the terms defined under the [LICENSE.md](LICENSE.md) file unless otherwise specified in the relevant subdirectory.

# 📬 Contact

Expand Down
35 changes: 31 additions & 4 deletions assets/licenses.html
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ <h1>Third Party Licenses</h1>
<h2>Overview of licenses:</h2>
<ul class="licenses-overview">
<li><a href="#Apache-2.0">Apache License 2.0</a> (411)</li>
<li><a href="#MIT">MIT License</a> (191)</li>
<li><a href="#AGPL-3.0">GNU Affero General Public License v3.0</a> (10)</li>
<li><a href="#MIT">MIT License</a> (192)</li>
<li><a href="#AGPL-3.0">GNU Affero General Public License v3.0</a> (9)</li>
<li><a href="#BSD-3-Clause">BSD 3-Clause &quot;New&quot; or &quot;Revised&quot; License</a> (9)</li>
<li><a href="#MPL-2.0">Mozilla Public License 2.0</a> (8)</li>
<li><a href="#Unicode-3.0">Unicode License v3</a> (4)</li>
Expand Down Expand Up @@ -76,7 +76,6 @@ <h4>Used by:</h4>
<li><a href=" https://crates.io/crates/robotstxt ">robotstxt 0.1.0</a></li>
<li><a href=" https://crates.io/crates/simple_wal ">simple_wal 0.1.0</a></li>
<li><a href=" https://crates.io/crates/speedy_kv ">speedy_kv 0.1.0</a></li>
<li><a href=" https://crates.io/crates/web-spell ">web-spell 0.1.0</a></li>
</ul>
<pre class="license-text">GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Expand Down Expand Up @@ -13176,8 +13175,36 @@ <h4>Used by:</h4>
<h3 id="MIT">MIT License</h3>
<h4>Used by:</h4>
<ul class="license-used-by">
<li><a href=" https://crates.io/crates/optics ">optics 0.1.0</a></li>
<li><a href=" https://crates.io/crates/web-spell ">web-spell 0.1.0</a></li>
<li><a href=" https://crates.io/crates/zimba ">zimba 0.1.0</a></li>
</ul>
<pre class="license-text">MIT License

Copyright (c) 2024 Stract ApS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the &quot;Software&quot;), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED &quot;AS IS&quot;, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.</pre>
</li>
<li class="license">
<h3 id="MIT">MIT License</h3>
<h4>Used by:</h4>
<ul class="license-used-by">
<li><a href=" https://crates.io/crates/optics ">optics 0.1.0</a></li>
<li><a href=" https://github.com/tokio-rs/async-stream ">async-stream-impl 0.3.6</a></li>
<li><a href=" https://github.com/tokio-rs/async-stream ">async-stream 0.3.6</a></li>
<li><a href=" https://github.com/durch/rust-s3 ">aws-creds 0.36.0</a></li>
Expand Down
2 changes: 1 addition & 1 deletion crates/web-spell/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
edition = "2021"
license = "AGPL-3.0"
license = "MIT"
name = "web-spell"
version = "0.1.0"

Expand Down
21 changes: 21 additions & 0 deletions crates/web-spell/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Stract ApS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
10 changes: 3 additions & 7 deletions crates/web-spell/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
# Web Spell

Automatic spelling correction from web data. It is based on the paper
Automatic spelling correction from web data. It is roughly based on the paper
[Using the Web for Language Independent Spellchecking and
Autocorrection](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36180.pdf)
from google.

## Usage
```rust
let checker = SpellChecker::open("<path-to-model>", CorrectionConfig::default()).unwrap();
let correction = checker.correct("hwllo", Lang::Eng);
assert_eq!(correction.unwrap().terms, vec![CorrectionTerm::Corrected { orig: "hwllo".to_string(), correction: "hello".to_string() }]);
```
## License
Web spell is licensed under the MIT license. See the [LICENSE](LICENSE) file for details.
16 changes: 0 additions & 16 deletions crates/web-spell/src/config.rs
Original file line number Diff line number Diff line change
@@ -1,19 +1,3 @@
// Stract is an open source web search engine.
// Copyright (C) 2024 Stract ApS
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.

fn misspelled_prob() -> f64 {
0.1
}
Expand Down
23 changes: 7 additions & 16 deletions crates/web-spell/src/error_model.rs
Original file line number Diff line number Diff line change
@@ -1,19 +1,3 @@
// Stract is an open source web search engine.
// Copyright (C) 2024 Stract ApS
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.

use super::Result;
use std::{
collections::HashMap,
Expand Down Expand Up @@ -55,6 +39,7 @@ pub enum ErrorType {
)]
pub struct ErrorSequence(Vec<ErrorType>);

/// Return all the possible ways to transform one string into another with a single edit.
pub fn possible_errors(a: &str, b: &str) -> Option<ErrorSequence> {
if a == b {
return None;
Expand Down Expand Up @@ -165,6 +150,7 @@ impl From<StoredErrorModel> for ErrorModel {
}
}

/// A model for the probability of an error sequence.
#[derive(Debug)]
pub struct ErrorModel {
errors: HashMap<ErrorSequence, u64>,
Expand All @@ -185,6 +171,7 @@ impl ErrorModel {
}
}

/// Save the error model to disk.
pub fn save<P: AsRef<Path>>(self, path: P) -> Result<()> {
let file = OpenOptions::new()
.write(true)
Expand All @@ -199,6 +186,7 @@ impl ErrorModel {
Ok(())
}

/// Open the error model from disk.
pub fn open<P: AsRef<Path>>(path: P) -> Result<Self> {
let file = OpenOptions::new().read(true).open(path)?;

Expand All @@ -209,18 +197,21 @@ impl ErrorModel {
Ok(stored.into())
}

/// Add an error sequence to the error model.
pub fn add(&mut self, a: &str, b: &str) {
if let Some(errors) = possible_errors(a, b) {
*self.errors.entry(errors).or_insert(0) += 1;
self.total += 1;
}
}

/// Get the probability of an error sequence.
pub fn prob(&self, error: &ErrorSequence) -> f64 {
let count = self.errors.get(error).unwrap_or(&0);
*count as f64 / self.total as f64
}

/// Get the log probability of an error sequence.
pub fn log_prob(&self, error: &ErrorSequence) -> f64 {
match self.errors.get(error) {
Some(count) => (*count as f64).log2() - ((self.total + 1) as f64).log2(),
Expand Down
68 changes: 46 additions & 22 deletions crates/web-spell/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,22 +1,24 @@
// Stract is an open source web search engine.
// Copyright (C) 2024 Stract ApS
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as
// published by the Free Software Foundation, either version 3 of the
// License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.

//! This module contains the spell checker. It is based on the paper
//! This module contains the spell checker. It is roughly based on the paper
//! http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36180.pdf
//! from google.
//!
//! # Usage
//!
//! ```rust
//! # use std::path::Path;
//! # use web_spell::{CorrectionConfig, SpellChecker, Lang};
//!
//! # let path = Path::new("../data/web_spell/checker");
//!
//! # if !path.exists() {
//! # return;
//! # }
//!
//! let checker = SpellChecker::open("<path-to-model>", CorrectionConfig::default());
//! # let checker = SpellChecker::open(path, CorrectionConfig::default());
//! let correction = checker.unwrap().correct("hwllo", &Lang::Eng);
//! ```
mod config;
mod error_model;
pub mod spell_checker;
Expand All @@ -26,6 +28,7 @@ mod trainer;

pub use config::CorrectionConfig;
pub use error_model::ErrorModel;
pub use spell_checker::Lang;
pub use spell_checker::SpellChecker;
pub use stupid_backoff::StupidBackoff;
pub use term_freqs::TermDict;
Expand Down Expand Up @@ -108,24 +111,34 @@ impl From<Correction> for String {
}

impl Correction {
/// Create an empty correction.
pub fn empty(original: String) -> Self {
Self {
original,
terms: Vec::new(),
}
}

/// Push a term to the correction.
pub fn push(&mut self, term: CorrectionTerm) {
self.terms.push(term);
}

/// Check if all terms are not corrected.
pub fn is_all_orig(&self) -> bool {
self.terms
.iter()
.all(|term| matches!(term, CorrectionTerm::NotCorrected(_)))
}
}

/// Split text into sentence ranges by detecting common sentence boundaries like periods, exclamation marks,
/// question marks and newlines. Returns a Vec of byte ranges for each detected sentence.
///
/// The splitting is optimized for performance and simplicity rather than perfect accuracy. It handles
/// common cases like abbreviations, URLs, ellipses and whitespace trimming.
///
/// Note that this is a heuristic approach and may not handle all edge cases correctly.
pub fn sentence_ranges(text: &str) -> Vec<Range<usize>> {
let skip = ["mr.", "ms.", "dr."];

Expand Down Expand Up @@ -178,6 +191,7 @@ pub fn sentence_ranges(text: &str) -> Vec<Range<usize>> {
res
}

/// Tokenize text into words.
pub fn tokenize(text: &str) -> Vec<String> {
text.to_lowercase()
.split_whitespace()
Expand All @@ -188,11 +202,20 @@ pub fn tokenize(text: &str) -> Vec<String> {
.map(|s| s.to_string())
.collect()
}
pub struct MergePointer<'a> {
pub term: String,
pub value: u64,
pub stream: fst::map::Stream<'a>,
pub is_finished: bool,

/// A pointer for merging two term streams.
struct MergePointer<'a> {
/// The current head of the stream.
pub(crate) term: String,

/// The current head value.
pub(crate) value: u64,

/// The stream to merge.
pub(crate) stream: fst::map::Stream<'a>,

/// Whether the stream is finished.
pub(crate) is_finished: bool,
}

impl MergePointer<'_> {
Expand Down Expand Up @@ -234,6 +257,7 @@ impl PartialEq for MergePointer<'_> {

impl Eq for MergePointer<'_> {}

/// Get the next character boundary after or at the given index.
fn ceil_char_boundary(str: &str, index: usize) -> usize {
let mut res = index;

Expand Down
Loading

0 comments on commit 01de7a1

Please sign in to comment.