input-method

First-two-char input method using transformer-based language model and n-gram model. The model predicts the corresponding word ("method") given the previous and current two characters ("fi", "tw", "ch", "in", "me").

How to use

Data preparation

Download the shakespeare dataset

$ wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt -P data/shakespeare

N-gram model

train and evaluate the n-gram model with shakespeare dataset

$ python3 src/input_method/train-ngram.py --ngram 2 --input "My name is Taro. I am a student."

Transformer-based language model

Train the NanoLM model with shakespeare dataset

$ python3 src/input_method/train.py --data_name "shakespeare" --batch_size 128 --n_iterations 5000 --n_freq_eval 100 --dropout_rate 0.1 --learning_rate 0.001 --num_layers 8 --embed_size 256  --head_size 32 --num_heads 8 --block_size 4

Evaluate the NanoLM model with shakespeare dataset

$ python3 src/input_method/evaluate.py --data_name "shakespeare" --block_size 4

Sequence to sequence prediction

$ python3 src/input_method/seq_to_seq.py --data_name "shakespeare" --block_size 16 --input "My name is Taro. I am a student."

Prompt: My name is Taro. I am a student.
Output: my name is taken i am a strange

This program internally convert the prompt to the first-two-char input format and predict the corresponding word using the trained NanoLM model sequentially.

Training on Wikitext2

python3 src/input_method/prepare_wikitext.py

$ python3 src/input_method/train.py --data_name "wikitext" --batch_size 1024 --n_iterations 1000 --n_freq_eval 100 --dropout_rate 0.0 --learning_rate 0.001 --num_layers 8 --embed_size 256  --head_size 32 --num_heads 8 --block_size 4

Features

Two tokenizers are used
- TwoCharTokenizer: vocab = {"a ", ..., "z ", "aa", ..., "zz"}
  - The vocab size is 26 + 26 * 26 = 702
- WordTokenizer: vocab = {"a", ..., "word", ...}
  - The vocab size depends on the dataset
Predict the corresponding word given the previous and current two characters (e.g., P("method" | ("a ", "tw", "ch", "in", "me"))) using the transformer-based language model

Results

N-gram model
Transformer-based language model

2-gram model was the best model among all models.

Draft Paper

You can access the draft paper about this project here.

Citation

@article{sugiura2024input,
  title   = "First-two-char Input Method with N-gram Model and
Transformer-based Language Model",
  author  = "Issa, Sugiura",
  journal = "github.com",
  year    = "2024",
  month   = "Aug",
  url     = "https://github.com/speed1313/input-method"
}

Reference

https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
https://github.com/speed1313/jax-llm
Vaswani et al. "Attention is All You Need." NeurIPS 2017.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
figure		figure
src/input_method		src/input_method
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
draft_paper.pdf		draft_paper.pdf
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

input-method

How to use

Data preparation

N-gram model

Transformer-based language model

Features

Results

Draft Paper

Citation

Reference

About

Releases

Packages

Languages

License

speed1313/input-method

Folders and files

Latest commit

History

Repository files navigation

input-method

How to use

Data preparation

N-gram model

Transformer-based language model

Features

Results

Draft Paper

Citation

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages