Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType' #1227

Closed
TheGermanEngie opened this issue Apr 1, 2024 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@TheGermanEngie
Copy link

Continuing off issue #861 where a path problem has turned into a data loading problem. Opened a new lightning-ai studio instance, freshly installed the repo and downloaded the checkpoint, still same error.

Screenshot 2024-04-01 at 18 50 18

Here is a sample of my dataset.

sample.json

@TheGermanEngie
Copy link
Author

I even updated the dataset to have one more indent for the {} to match litgpt's given .json object formatting:

From this:
Screenshot 2024-04-01 at 19 49 36

To this:

Screenshot 2024-04-01 at 19 49 57

And I still get the same error.

@TheGermanEngie
Copy link
Author

Running a test load of the default Alpaca dataset with the command litgpt finetune lora --data Alpaca --checkpoint_dir checkpoints/mobiuslabsgmbh/aanaphi2-v0.1 runs fine, so I think it's my dataset. I can't find any formatting differences between the default dataset and mine, so kind of at a loss here.

@TheGermanEngie
Copy link
Author

I would like to add that my dataset does exponentially grow in size because it is based around a speaker diarization type of dataset. To keep all the context of the conversation, each time the Speaker 0 and Speaker 1 nametags change in the dataset, the output is loaded back into the "input" format and the text continues until Speakers change, for example:

Screenshot 2024-04-01 at 20 35 17

As you can imagine, it grows very exponentially. This makes me wonder if the error earlier has to do with context length limits or anything like that when initially loading the dataset: as the finetuning scripts work bydetermining the size of the longest tokenized sample in the dataset to determine the block size.and the files do get quite long. I also did try to truncate the dataset earlier with --train.max_seq_length 256 to no difference.

@gitgroman
Copy link

Same for me, I don't think size matters as it failed with really small dataset with few records.

@carmocca
Copy link
Contributor

carmocca commented Apr 3, 2024

@rasbt can you check this out? There might be a bug in the json datamodule

@carmocca carmocca added the bug Something isn't working label Apr 3, 2024
@gitgroman
Copy link

gitgroman commented Apr 3, 2024

--data.val_split_fraction 0.1

fixed issue.

@carmocca

@carmocca
Copy link
Contributor

carmocca commented Apr 4, 2024

#1241 improves the messaging

@carmocca carmocca closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants