Changes

Aleksandras Kostarevas · eda9022a
--- a/Keyboard-LM-docs.md
+++ b/Keyboard-LM-docs.md
+# Starter
+
+Two algorithms are currently used in creating text predictions and suggestions.
+
+* Dictionary-based algorithm (same one as in AOSP Keyboard and its derivatives)
+* Transformer LM (experimental, can be toggled off)
+
+
+## Dictionary algorithm
+
+The dictionary-based algorithm looks up words in a dictionary, and builds a bigram model over time. It's quick, simple, and battle-tested. There's not much to say except that it works and you've probably used it at some point.
+
+
+## Transformer algorithm
+
+The transformer LM is the new algorithm that's being developed. Instead of a bigram model, it uses a transformer language model with the Llama architecture. It currently has some strengths and weaknesses over the dictionary-based algorithm:
+
+* It's much better at predicting next words, and taking into account surrounding context
+* It can be much better at correcting very rough spelling
+* It's currently unable to learn new words instantly like how the dictionary-based algorithm is able to (technically it could through prompting but there are prompt size limits)
+* It's not easily adaptable to non-QWERTY layouts, and is staying QWERTY-only for now
+* It's still very experimental and we're only supporting English for now, but we're keeping things open to allow others to make their own models
+
+To mitigate some of these weaknesses, the dictionary algorithm is run in parallel with the transformer and their results are merged, the weight of merging can be edited in the Advanced Parameters menu.
+
+While we use the Llama architecture, we don't use any of the public models like Llama 2 mainly because they're all too big (billions of parameters). Our current model has 36 million parameters. We also use a custom non-conventional tokenizer with space suffixes instead of prefixes to make word-based inference more efficient. For autocorrect, we use a specific prompt format. You can read more about all of this in the Model section.
+
+(Note: Finetuning is currently disabled by default as its effectiveness has not been properly evaluated, but the following applies if you enable it) As you type things with the keyboard, your typed data is saved locally to temporary storage for later finetuning of the transformer LM. Finetuning is scheduled to run at least once every ~20 hours when your device is idle, plugged in and there's enough data. Under the hood, finetuning trains a LoRA adapter locally on your device, merges it with the original model and saves it. While the original data is deleted after finetuning, the finetuned model's weights may contain the data in some form or another, so we recommend avoiding sharing the finetuned model.
+
+You can import and export model files for backup, transferring finetuned models between devices, or importing custom/third-party models. If you want to make your own models, check out the Model Creation section. The files are in .gguf format but with extra metadata, defined in the GGUF Metadata section.
+
+Under the hood, the model is run using the llama.cpp library.
+
+# FAQ
+
+## Can I just import any gguf model like Llama 2 7B, Mistral 7B, etc?
+No, those are way too big to reasonably run on a keyboard on a phone. They also wouldn't understand the autocorrect system we use. If you try to just import one of these models, you'll get an error message.
+
+## What models can I import?
+TODO: Maintain a GitHub repository with a list of community-made models after launch
+
+## How can I tell which algorithm is being used?
+In FUTO Keyboard, if the transformer LM was used to generate a suggestion, you will see a thin line under it. If there is no thin line, it was generated by the dictionary algorithm.
+
+## I'm having issues with the predictions
+This is still quite experimental and things may break. If it's affecting your use and you'd prefer the old algorithm, you can disable the transformer LM in the settings. Please let us know your experiences in the FUTO chat #Keyboard stream!
+
+## What's this privacy warning about?
+If you try to export a finetuned model, you'll see a privacy warning telling you to avoid sharing the file with others. This is just to let you know to be cautious about sharing the file with anybody. By its nature, finetuning can sort of compress and store the training data in the model's weights, especially if there's overfitting due to small amounts of data. Because of this, it's conceivable that somebody in possession of the file could reconstruct sentences you've typed. So it's best to keep the model private.
+
+## What languages are supported?
+The default transformer LM model only supports English. Dictionary-based algorithm supports English and a few other languages. For all other languages, there are no predictions out of the box but the user history dictionary will learn what you type.
+
+TODO: Add more default dictionaries
+
+## Can you support (XYZ) language for transformer LM?
+We're still working on nailing down English!
+
+If you want, you can train your own model on your own language, this is currently somewhat complicated but we're working on scripts to make this easier.
+
+## How does voice input work?
+It runs [OpenAI Whisper](https://github.com/openai/whisper) on your phone.
+
+## How much of my typing data, voice input recordings, etc are sent to the cloud?
+0
+
+## I know this stuff well and I have a super cool insight that will make everything work 10x better
+Let us know your idea! 
+
+## My question isn't answered
+Let us know in the FUTO chat!
+
+
+# Model
+
+## Model & Architecture
+
+The model architecture is based on Facebook’s Llama. Currently, Facebook offers pretrained Llama checkpoints ranging in sizes from 7B to 65B. We do not make use of these because they are too big for realtime use on mobile devices.
+
+We instead use a ~36M parameter model initialized using the following config:
+
+```python
+from transformers import LlamaForCausalLM, LlamaConfig
+config = LlamaConfig(
+	vocab_size=15008,
+	hidden_size=512,
+	intermediate_size=1024,
+	num_hidden_layers=8,
+	num_attention_heads=8
+)
+
+model = LlamaForCausalLM(config)
+```
+
+The model was pretrained on a few billion tokens of SlimPajama.
+
+## Tokenizer
+
+We use a sentencepiece tokenizer with `treat_whitespace_as_suffix`= true. This is somewhat non-standard and causes issues with some tokenizer implementations, so we use sentencepiece from pretraining all the way to inference in the app (by embedding the sentencepiece tokenizer in the gguf)
+
+The reasoning behind using whitespace as suffix in this case is as follows - when doing autoregressive decoding for next-word prediction, we need to know when the word has finished being predicted and when to stop decoding.
+
+The issue with prefixed whitespace is that it requires looking forward an extra token to know when a word is finished. This is demonstrated with the following examples. 
+
+With prefixed whitespace, decoding a single word step-by-step would go something like this:
+
+1. `▁y` - at this point we only know the word starts with `y`, or maybe the word is just "`y`"?
+2. `es` - at this point we don't know if the word is just [`▁y`, `es`], or [`▁y`, `es`, `terday`], or [`▁y`, `es`, `terday`, `'s`]
+3. `▁and` - NOW we know it's [`▁y`, `es`]
+
+On the other hand, suffixed whitespace lets us save 1 step in the best case:
+
+1. `y` - at this point we only know the word starts with y, or maybe the word is just "y" if the next token is a bare `▁`?
+2. `es▁` - now we know it's [`y`, `es▁`]
+
+And in the worst case, suffixed whitespace makes no difference:
+
+1. `y`
+2. `eah`
+3. `▁`
+
+(Note: the `▁` character represents a space.)
+
+## Autocorrect
+
+For correction, we define the following misspelling-correction format:
+
+`This is some <XBU>txet<XBC>text <XEC>`
+
+Here is a breakdown of the token meanings:
+
+- `<XBU>` - signifying the beginning of user input
+- Following `<XBU>` is user input of a misspelled word. This is actually not tokenized as normal text, but as `<CHAR_{}>` tokens detailed later on
+- `<XBC>` - signifying the beginning of correction
+- Following `<XBC>` is a corrected word, tokenized normally, with a whitespace at the end.
+- `<XEC>` - end of correction
+
+We also define tokens `<CHAR_A>` to `<CHAR_Z>` which signify individual alphabetic characters on a keyboard, and the misspelled user input is tokenized by using these tokens. Thus, the above example actually looks more like:
+
+`This is some <XBU><CHAR_T><CHAR_X><CHAR_E><CHAR_T><XBC>text <XEC>`
+
+Tokenizing in this way has appeared to be a vital step for achieving usable results with smaller language models.
+
+The misspelled words are synthetically generated by an algorithm that applies random offsets to keys, changes certain letters, transposes characters, etc. There's likely some room for improvement in this algorithm, as typing is more complicated than random offsets.
+
+Following pretraining, the model is finetuned on individual correction examples without any surrounding context (training examples all follow just the format of `<XBU>[misspelled word]<XBC>[word]<XEC>`), and with less bias towards common words (if a word appeared `N` times, it appears `log(N)` times during this training). This has appeared to be a vital step for convincing smaller models to learn how to correct words instead of just guessing common words based on context.
+
+Following individual correction example training, the model is further finetuned on text where roughly 1/3 of words have been picked randomly and replaced with a misspelling-correction, this is to develop its ability to correct based on context.
+
+Following SlimPajama correction training, the model is finetuned on a much smaller corpus representing Internet first-person speech with the same 1/3 misspelling augmentation. This has appeared to be an important step for the model to comprehend that sentences can start with “I’m”, certain slang/lingo, etc
+
+### More experimental techniques
+
+Limiting model input to individual characters limits our precision to entire keys, and does not allow us to represent nuance such as tapping between K and L. An experimental workaround for this is to pass interpolated input embeddings in such cases (e.g. `0.5 * <CHAR_K> + 0.5 * <CHAR_L>`). Preliminary evaluation shows a minor increase in accuracy, even without training the model on these mixed embeddings.
+
+Swipe typing could be implemented by sampling a swipe curve, converting the positions to these mixed embeddings and including a special control token to tell the model this is swipe typing. This was tested but the performance is not yet very good, possibly due to synthetic data not being realistic enough.
+
+
+# Model Creation
+
+We recommend reading through the Model section to understand how things work. You can initialize a model like so:
+```python
+from transformers import LlamaForCausalLM, LlamaConfig
+tokenizer = # ...
+
+config = LlamaConfig(
+	vocab_size=len(tokenizer),
+	hidden_size=512,
+	intermediate_size=1024,
+	num_hidden_layers=8,
+	num_attention_heads=8
+)
+
+model = LlamaForCausalLM(config)
+```
+
+You can use our tokenizer, but it will likely not work well for languages other than English. If you're making your own tokenizer, use sentencepiece and make sure you enable `treat_whitespace_as_suffix`.
+
+The tokenizer size should be roughly 15k or smaller. This is because token embeddings take up `len(tokenizer) * hidden_size * 2` parameters. With the above config and 15008 tokens, this is already 15.3 million tokens, which is 43% of the model. We haven't evaluated this, but it can be assumed that dedicating more of the parameters to the layers would bring better performance.
+
+The model should be pretrained on text from your language. A few billion tokens or so may suffice. With a model of this size, this can be quite doable on a single consumer graphics card. You will then need to finetune it on the described misspelling format.
+
+TODO: Training notebooks
+
+# GGUF Metadata
+
+The following fields are defined:
+
+* string `keyboardlm.languages` shall contain a space-separated list of one or more ISO language codes
+* uint32 `keyboardlm.finetuning_count` shall contain an integer, with a base model starting from 0
+* string `keyboardlm.history` shall contain lines of arbitrary text starting with a date, explaining the actions taken on the model in human-readable language
+* string `keyboardlm.features` shall contain a space-separated list of one or more features - defined in the Feature Definitions section
+* string `keyboardlm.ext_tokenizer_type` shall contain a tokenizer type, currently this must be SentencePiece
+* array[uint8] `keyboardlm.ext_tokenizer_data` shall contain tokenizer data
+
+# Feature Definitions
+
+In FUTO Keyboard, features must be supported by the app for the model to be importable. If the version of the app doesn't support a feature, an error will be shown. The only exception to this are optional features, which should start with `_` or `opt_`
+
+## `base_v1`
+To be considered a valid model, this must be set.
+
+## `inverted_space`
+The tokenizer shall invert the position of the space in pieces by placing them at the end instead of the start, and the model shall understand this format. This is mostly required by the keyboard - rationale is explained in the documentation.
+
+Normal space: "_a", "_the", "_of"
+
+Inverted space: "a_", "the_", "of_"
+
+## `xbu_char_autocorrect_v1`
+Requires feature `inverted_space`
+
+The tokenizer shall contain the following tokens, and the model shall understand them as follows:
+
+* `<XBU>` - beginning of user input
+* `<CHAR_A>`, `<CHAR_B>`, ..., `<CHAR_Z>` - alphabetical characters A-Z (user input)
+* `<XBC>` - beginning of correction, followed by a correct word
+* `<XEC>` - end of correction
+
+An example of a correction:
+
+```
+This is an <XBU><CHAR_E><CHAR_X><CHAR_A><CHAR_M><CHAR_P><CHAR_L><CHAR_R><XBC>example <XEC>
+```
+
+The application shall use this feature to prompt the model with the preceding text, the current word, and show top 3 predictions after `<XBC>` to the user.
+
+## `lora_finetunable_v1`
+Requires feature `xbu_char_autocorrect_v1`
+
+If set, the model will be finetuned on user data occasionally.
+
+## `xc0_swipe_typing_v1`
+Requires feature `xbu_char_autocorrect_v1`
+
+The tokenizer shall in addition contain the token `<XC0>`, which is inserted after `<XBU>`. When this token is present, the model shall understand that the following user input are points sampled from swipe typing input, rather than key presses.
+
+TODO: The exact sampling method
+
+## `char_embed_mixing_v1`
+Requires feature `xbu_char_autocorrect_v1`
+
+When set, the application shall interpolate character embeddings based on user tap position.
+
+Example: User taps exactly between Q and W. The model shall receive an input embedding of (0.5 * <CHAR_Q> + 0.5 * <CHAR_W>) instead of just the token.
+
+TODO: Exact implementation details
+
+## `experiment_linear_208_209_210`
+Requires feature `xbu_char_autocorrect_v1`, conflicts with feature `char_embed_mixing_v1`
+
+When set, embeddings for tokens 208, 209 describe the linear encoder layer and 210 the bias.
+
+
+# Credits
+Thanks to the following projects that has made this possible:
+
+* [llama.cpp](https://github.com/ggerganov/llama.cpp) is used for efficient on-device inference and finetuning
+* [OpenAI Whisper](https://github.com/openai/whisper) is used for voice input in the app
+* [HuggingFace Transformers, Tokenizers and other libraries](https://huggingface.co/docs/transformers/index) were used for model creation and training
+* TODO