diff --git a/FAQ.md b/FAQ.md index b5510cc..0b00d72 100644 --- a/FAQ.md +++ b/FAQ.md @@ -32,7 +32,8 @@ it is however possible to use the Rust backend, which should run in int8 with CU ### Moshi stopped talking after 5 min. -This is expected on the MLX and Rust implementation. We only use a fixed buffer, and we do not discard -past entries. The PyTorch version should work for unlimited times, although this is mostly untested and we +This is expected on the MLX and Rust implementation. +We only use a fixed buffer, and we do not discard past entries. +The PyTorch version should work for unlimited times, although this is mostly untested and we expect the quality to degrade after a bit (we have no attention sink or other mechanism to improve the streaming beyond the finite context used at training). diff --git a/moshi/README.md b/moshi/README.md index 6d09529..60e20ef 100644 --- a/moshi/README.md +++ b/moshi/README.md @@ -85,6 +85,12 @@ with torch.no_grad(): codes = mimi.encode(frame) assert codes.shape[-1] == 1, codes.shape all_codes.append(codes) + +## WARNING: When streaming, make sure to always feed a total amount of audio that is a multiple +# of the frame size (1920), otherwise the last frame will not be complete, and thus +# will not be encoded. For simplicity, we recommend feeding in audio always in multiple +# of the frame size, so that you always know how many time steps you get back in `codes`. + # Now if you have a GPU around. mimi.cuda() moshi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MOSHI_NAME)