clarify frame size

kyutai-labs · Sep 19, 2024 · 3df7e80 · 3df7e80
1 parent 7e5251f
commit 3df7e80
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 2 deletions.
diff --git a/FAQ.md b/FAQ.md
@@ -32,7 +32,8 @@ it is however possible to use the Rust backend, which should run in int8 with CU
 
 ### Moshi stopped talking after 5 min.
 
-This is expected on the MLX and Rust implementation. We only use a fixed buffer, and we do not discard
-past entries. The PyTorch version should work for unlimited times, although this is mostly untested and we
+This is expected on the MLX and Rust implementation.
+We only use a fixed buffer, and we do not discard past entries.
+The PyTorch version should work for unlimited times, although this is mostly untested and we
 expect the quality to degrade after a bit (we have no attention sink or other mechanism to improve the streaming
 beyond the finite context used at training).
diff --git a/moshi/README.md b/moshi/README.md
@@ -85,6 +85,12 @@ with torch.no_grad():
             codes = mimi.encode(frame)
             assert codes.shape[-1] == 1, codes.shape
             all_codes.append(codes)
+
+## WARNING: When streaming, make sure to always feed a total amount of audio that is a multiple
+#           of the frame size (1920), otherwise the last frame will not be complete, and thus
+#           will not be encoded. For simplicity, we recommend feeding in audio always in multiple
+#           of the frame size, so that you always know how many time steps you get back in `codes`.
+
 # Now if you have a GPU around.
 mimi.cuda()
 moshi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MOSHI_NAME)