Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenization and prompting API to GPT models #651

Merged
merged 39 commits into from
Apr 3, 2024

Conversation

JulienVig
Copy link
Collaborator

@JulienVig JulienVig commented Mar 18, 2024

Fixes #646

  • Adds tokenization support, allowing to choose any pre-trained tokenizer from Transformers.js
  • Implement text generation with a trained model
  • Extend text processing with shuffling and left padding
  • Create an example of language model training in docs/examples/wikitext.ts
  • I encountered a lot of issues and limitations which I listed in Improve and rework GPT-tfjs #654

@JulienVig JulienVig added the discojs Related to Disco.js label Mar 18, 2024
@JulienVig JulienVig self-assigned this Mar 18, 2024
@JulienVig
Copy link
Collaborator Author

JulienVig commented Mar 18, 2024

@tharvik @peacefulotter I'd be happy to hear your take on how to integrate tokenization and prompting into the Disco architecture.
The initial commit is a POC of a pipeline from text training data (so not already preprocessed) to textual output (cf docs/example/wikitext.ts)
I wanted to make the tokenizer as a trainingInformation field in the infamous TaskProvider but it seems that complex objects (like functions) are removed from the task object when communicated between the server and the client...
My next step is to migrate away from gpt-tokenizer to use Transformers.js such that we can specify which pre-trained tokenizer we want to use as a string in the trainingInformation

@peacefulotter
Copy link
Collaborator

Congrats on the LLM POC milestone!

Firstly, I am wondering if having a custom tokenizer is even really relevant in the first place. Is there really a use case for it in Disco?

If so, instead of passing the tokenizer object, you could pass a string representing the tokenizer to use. Then an object could map tokenizer ids / names to the corresponding instance.

@peacefulotter
Copy link
Collaborator

peacefulotter commented Mar 18, 2024

Secondly, since it appears tokenization needs to be done on the fly (before or during training) and not as a completely separate step. It could be done as a preprocessing step like Hugo planned to do on his branch

@JulienVig
Copy link
Collaborator Author

@peacefulotter yes I want to do both points you mentioned! The second one is already implemented, the tokenization is part of the preprocessing.
And for the first I want to migrate to Transformers.js for that reason, it allows loading different pre-trained tokenizer from a string (rather than an import like gpt-tokenizer)

@tharvik
Copy link
Collaborator

tharvik commented Mar 19, 2024

I'd be happy to hear your take on how to integrate tokenization and prompting into the Disco architecture.

I don't like Task, it's the goto place for every context variable in Disco. here are some questions before adding it to task

  • is it applicable to every model? ie, can images be tokenized?
  • would someone want to change the tokenizer of wikitext?

after a bit more thinking, I don't think it really matters for now, as long as #639 is not done

I wanted to make the tokenizer as a trainingInformation field in the infamous TaskProvider but it seems that complex objects (like functions) are removed from the task object when communicated between the server and the client...

indeed, functions can't be represented by msgpack/json so it gets removed. it'll be nice to have a serialization process to ensure that we get function back from the network (a string such as gpt-tokenizer/davinci in the serialized object that get mapped to a known function)

@JulienVig
Copy link
Collaborator Author

Transformer.js requires Node >= 18, I'm waiting on #653 to be merged

@JulienVig JulienVig marked this pull request as ready for review March 27, 2024 16:45
Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work! a bit of nitpicking here and there but nothing really vital

discojs/discojs-core/package.json Outdated Show resolved Hide resolved
discojs/discojs-node/src/data/text_loader.ts Outdated Show resolved Hide resolved
discojs/discojs-core/src/task/training_information.ts Outdated Show resolved Hide resolved
discojs/discojs-core/src/models/gpt/evaluate.ts Outdated Show resolved Hide resolved
discojs/discojs-core/src/models/gpt/evaluate.ts Outdated Show resolved Hide resolved
discojs/discojs-core/src/models/gpt/evaluate.ts Outdated Show resolved Hide resolved
docs/examples/wikitext.ts Show resolved Hide resolved
docs/examples/wikitext.ts Outdated Show resolved Hide resolved
web-client/.gitignore Show resolved Hide resolved
Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a new models/tokenizer.ts file

LGTM!

a few more comments (I'm a never ending stream of critics, please stop me)

@JulienVig JulienVig removed the request for review from peacefulotter April 2, 2024 09:34
@JulienVig JulienVig merged commit 7c282e7 into develop Apr 3, 2024
23 checks passed
@JulienVig JulienVig deleted the 646-tokenizer-julien branch April 3, 2024 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discojs Related to Disco.js
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tokenization support to Disco LLMs
3 participants