Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] update the preview_csv_data function #249

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

leeeizhang
Copy link
Collaborator

@leeeizhang leeeizhang commented Oct 16, 2024

Enhances the preview_csv_data function to provide a more comprehensive summary of csv files:

  • adding column types to the preview summary
  • including the count of NaN values to assist in handling these fields during code generation
  • adding examples for each column
  • including data ranges for numeric data types

Before:

Data file: /mnt/data00/mle-bench/spaceship-titanic/prepared/public/train.csv
Number of all rows: 7823
All columns: PassengerId, HomePlanet, CryoSleep, Cabin, Destination, Age, VIP, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, Name, Transported
Data example:
{'PassengerId': '4408_01', 'HomePlanet': 'Mars', 'CryoSleep': True, 'Cabin': 'F/906/P', 'Destination': 'TRAPPIST-1e', 'Age': 75.0, 'VIP': False, 'RoomService': 0.0, 'FoodCourt': 0.0, 'ShoppingMall': 0.0, 'Spa': 0.0, 'VRDeck': 0.0, 'Name': 'Pich Knike', 'Transported': True}

After:

CSV file in `/mnt/data00/mle-bench/spaceship-titanic/prepared/public/train.csv` has 7823 rows and 14 columns.
Here is some information about the columns:
Age (float64) has range: 0.00 - 79.00, 162 NaN values
Cabin (object) has 5992 unique values. Some example values: ['B/11/S', 'E/13/S', 'G/109/P', 'C/137/S', 'F/1411/P']
CryoSleep (object) has 2 unique values: [True, False, nan]
Destination (object) has 3 unique values: ['TRAPPIST-1e', 'PSO J318.5-22', '55 Cancri e', nan]
FoodCourt (float64) has range: 0.00 - 27723.00, 169 NaN values
HomePlanet (object) has 3 unique values: ['Mars', 'Europa', 'Earth', nan]
Name (object) has 7625 unique values. Some example values: ['Asch Stradick', 'Juane Popelazquez', 'Keitha Josey', 'Dia Cartez', 'Ankalik Nateansive']
PassengerId (object) has 7823 unique values. Some example values: ['4408_01', '8869_01', '0111_01', '0540_01', '5257_01']
RoomService (float64) has range: 0.00 - 14327.00, 165 NaN values
ShoppingMall (float64) has range: 0.00 - 23492.00, 188 NaN values
Spa (float64) has range: 0.00 - 22408.00, 168 NaN values
Transported (bool) is 50.44% True, 49.56% False
VIP (object) has 2 unique values: [False, True, nan]
VRDeck (float64) has range: 0.00 - 24133.00, 169 NaN values

Before submitting this PR, please make sure you have:

  • confirmed all checks still pass OR confirm CI build passes.
  • verified that any code or assets from external sources are properly credited in comments and/or in
    the credit file.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Oct 16, 2024
@leeeizhang leeeizhang changed the title [WIP] update the preview_csv_data function [MRG] update the preview_csv_data function Oct 16, 2024
@huangyz0918
Copy link
Member

Some people in Discord mentioned that if they have too many columns, the token will exceed the max context window size. Should we add a limit to the number of max preview columns?

@huangyz0918 huangyz0918 self-requested a review October 16, 2024 17:41
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 16, 2024
@huangyz0918 huangyz0918 merged commit 1fcbd8b into MLSysOps:main Oct 16, 2024
3 checks passed
@leeeizhang leeeizhang deleted the lei/enhance-csv-preview branch October 18, 2024 05:02
@leeeizhang
Copy link
Collaborator Author

Some people in Discord mentioned that if they have too many columns, the token will exceed the max context window size. Should we add a limit to the number of max preview columns?

agree, i will add a column-wise limit argument for preview_csv_data function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants