-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kaggle
document loader
#13788
kaggle
document loader
#13788
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,227 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "04c9fdc5", | ||
"metadata": {}, | ||
"source": [ | ||
"# Kaggle dataset\n", | ||
"\n", | ||
">[Kaggle](https://www.kaggle.com/) is a data science competition platform and online community of data scientists and machine learning practitioners under `Google LLC`. `Kaggle` enables users to find and publish datasets, explore and build models in a web-based data science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.\n", | ||
"\n", | ||
"\n", | ||
"This notebook shows how to load [`Kaggle` datasets](https://www.kaggle.com/datasets) to LangChain." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "54b7fc48-ee05-4f71-bead-ed59fd18ca40", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setting up\n", | ||
"\n", | ||
"Follow these steps to use this loader:\n", | ||
"- [Register a Kaggle account and create an API token](https://www.kaggle.com/settings) to use this loader.\n", | ||
"- Install `kaggle` and `pandas` python packages with `pip install kaggle pandas`\n", | ||
"- Use `kaggle datasets list` to list all available datasets\n", | ||
"- Use `kaggle datasets <dataset_name>` to download the dataset\n", | ||
"- Use `unzip <dataset_zipfile_name>` to extract all files in the dataset\n", | ||
"- Open the dataset CSV file and choose the column name for page content\n", | ||
"- Use the dataset CSV file name and the column name to\n", | ||
" initialize the `KaggleDatasetLoader`\n", | ||
"\n", | ||
"Note: Other columns in the dataset CSV file will be treated as metadata.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "5cefa09a-53b2-4f4e-b54e-8d88bb25f1a3", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#!pip install kaggle pandas" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "3611e092", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/leo/.kaggle/kaggle.json'\n", | ||
"ref title size lastUpdated downloadCount voteCount usabilityRating \n", | ||
"-------------------------------------------------------------- ----------------------------------------------- ----- ------------------- ------------- --------- --------------- \n", | ||
"carlmcbrideellis/llm-7-prompt-training-dataset LLM: 7 prompt training dataset 41MB 2023-11-15 07:32:56 926 82 1.0 \n", | ||
"thedrcat/daigt-v2-train-dataset DAIGT V2 Train Dataset 29MB 2023-11-16 01:38:36 438 70 1.0 \n", | ||
"thedrcat/daigt-proper-train-dataset DAIGT Proper Train Dataset 119MB 2023-11-05 14:03:25 1000 112 1.0 \n", | ||
"joebeachcapital/30000-spotify-songs 30000 Spotify Songs 3MB 2023-11-01 06:06:43 5906 148 1.0 \n", | ||
"ddosad/auto-sales-data Automobile Sales data 79KB 2023-11-18 12:36:41 1243 35 1.0 \n", | ||
"everydaycodings/job-opportunity-dataset Job Opportunities Dataset 95KB 2023-11-20 08:33:14 783 26 1.0 \n", | ||
"iamsouravbanerjee/customer-shopping-trends-dataset Customer Shopping Trends Dataset 146KB 2023-10-05 06:45:37 31292 583 1.0 \n", | ||
"dillonmyrick/high-school-student-performance-and-demographics High School Student Performance & Demographics 24KB 2023-11-10 01:33:35 2421 40 1.0 \n", | ||
"nelgiriyewithana/world-educational-data World Educational Data 9KB 2023-11-04 06:10:17 4623 97 1.0 \n", | ||
"prasad22/healthcare-dataset 🩺Healthcare Dataset 🧪 483KB 2023-10-31 11:30:58 4769 82 1.0 \n", | ||
"mauryansshivam/list-of-internet-products-of-top-tech-companies List of Internet Products of Top Tech Companies 9KB 2023-11-15 19:56:56 980 24 1.0 \n", | ||
"alejopaullier/daigt-external-dataset DAIGT | External Dataset 3MB 2023-10-31 19:11:35 793 113 0.7647059 \n", | ||
"lakshayjain611/imdb-100-lowest-ranked-movies-dataset IMDb 100 Lowest Ranked Movies Dataset 8KB 2023-11-11 09:33:18 1033 33 1.0 \n", | ||
"jacksondivakarr/online-shopping-dataset 🛒 Online Shopping Dataset 📊📉📈 5MB 2023-11-12 12:35:58 2187 47 1.0 \n", | ||
"bwandowando/1-5-million-netflix-google-store-reviews 🎬🎥1.5 Million Netflix Google Store Reviews 114MB 2023-11-17 01:30:48 487 24 1.0 \n", | ||
"samyakb/student-stress-factors Student stress factors 887B 2023-11-02 12:42:11 5294 87 0.9411765 \n", | ||
"jdaustralia/icc-cwc23-all-innings-cleaned ICC Cricket World Cup CWC23 All innings 28KB 2023-11-21 04:41:01 805 21 0.9411765 \n", | ||
"anshtanwar/top-200-trending-books-with-reviews Top 100 Bestselling Book Reviews on Amazon 422KB 2023-11-09 06:31:02 5677 53 1.0 \n", | ||
"miquelneck/worlds-spotify-top-50-playlist-musicality-data World's Spotify TOP-50 playlist musicality data 171KB 2023-11-16 11:14:23 1281 36 1.0 \n", | ||
"joebeachcapital/coronavirus-covid-19-cases-daily-updates Coronavirus (COVID-19) Cases (Daily Updates) 14MB 2023-11-22 23:28:03 834 26 1.0 \n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!kaggle datasets list" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "020462f8-9dea-43d0-9365-7cbf3c445440", | ||
"metadata": {}, | ||
"source": [ | ||
"## Example\n", | ||
"\n", | ||
"Here we download one of the dataset used for promt training." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "5e903ebc", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/leo/.kaggle/kaggle.json'\n", | ||
"Downloading llm-7-prompt-training-dataset.zip to /home/leo/PycharmProjects/GLD/langchain/docs/docs/integrations/document_loaders\n", | ||
"100%|██████████████████████████████████████| 41.4M/41.4M [00:18<00:00, 2.43MB/s]\n", | ||
"100%|██████████████████████████████████████| 41.4M/41.4M [00:18<00:00, 2.40MB/s]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!kaggle datasets download carlmcbrideellis/llm-7-prompt-training-dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "e8559946", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Archive: llm-7-prompt-training-dataset.zip\n", | ||
" inflating: train_essays_7_prompts.csv \n", | ||
" inflating: train_essays_7_prompts_v2.csv \n", | ||
" inflating: train_essays_RDizzl3_seven_v1.csv \n", | ||
" inflating: train_essays_RDizzl3_seven_v2.csv \n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!unzip llm-7-prompt-training-dataset.zip" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "109dcac9-25e3-46b8-b1cf-e6a4e2fd562a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.document_loaders import KaggleDatasetLoader" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "021bc377", | ||
"metadata": {}, | ||
"source": [ | ||
"`KaggleDatasetLoader` has these arguments:\n", | ||
"- **dataset_path**: Path to the dataset CSV file.\n", | ||
"- **page_content_column**: Column name of the page content." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "d924885c", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"14877" | ||
] | ||
}, | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"loader = KaggleDatasetLoader(\n", | ||
" dataset_path=\"train_essays_7_prompts.csv\", page_content_column=\"text\"\n", | ||
")\n", | ||
"docs = loader.load()\n", | ||
"len(docs)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"id": "f94ce6a3", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"Document(page_content='Cars. Cars have been around since they became famous in the 1900s, when Henry Ford created and built the first ModelT. Cars have played a major role in our every day lives since then. But now, people are starting to question if limiting car usage would be a good thing. To me, limiting the use of cars might be a good thing to do.\\n\\nIn like matter of this, article, \"In German Suburb, Life Goes On Without Cars,\" by Elizabeth Rosenthal states, how automobiles are the linchpin of suburbs, where middle class families from either Shanghai or Chicago tend to make their homes. Experts say how this is a huge impediment to current efforts to reduce greenhouse gas emissions from tailpipe. Passenger cars are responsible for 12 percent of greenhouse gas emissions in Europe...and up to 50 percent in some carintensive areas in the United States. Cars are the main reason for the greenhouse gas emissions because of a lot of people driving them around all the time getting where they need to go. Article, \"Paris bans driving due to smog,\" by Robert Duffer says, how Paris, after days of nearrecord pollution, enforced a partial driving ban to clear the air of the global city. It also says, how on Monday, motorist with evennumbered license plates were ordered to leave their cars at home or be fined a 22euro fine 31. The same order would be applied to oddnumbered plates the following day. Cars are the reason for polluting entire cities like Paris. This shows how bad cars can be because, of all the pollution that they can cause to an entire city.\\n\\nLikewise, in the article, \"Carfree day is spinning into a big hit in Bogota,\" by Andrew Selsky says, how programs that\\'s set to spread to other countries, millions of Columbians hiked, biked, skated, or took the bus to work during a carfree day, leaving streets of this capital city eerily devoid of traffic jams. It was the third straight year cars have been banned with only buses and taxis permitted for the Day Without Cars in the capital city of 7 million. People like the idea of having carfree days because, it allows them to lesson the pollution that cars put out of their exhaust from people driving all the time. The article also tells how parks and sports centers have bustled throughout the city uneven, pitted sidewalks have been replaced by broad, smooth sidewalks rushhour restrictions have dramatically cut traffic and new restaurants and upscale shopping districts have cropped up. Having no cars has been good for the country of Columbia because, it has aloud them to repair things that have needed repairs for a long time, traffic jams have gone down, and restaurants and shopping districts have popped up, all due to the fact of having less cars around.\\n\\nIn conclusion, the use of less cars and having carfree days, have had a big impact on the environment of cities because, it is cutting down the air pollution that the cars have majorly polluted, it has aloud countries like Columbia to repair sidewalks, and cut down traffic jams. Limiting the use of cars would be a good thing for America. So we should limit the use of cars by maybe riding a bike, or maybe walking somewhere that isn\\'t that far from you and doesn\\'t need the use of a car to get you there. To me, limiting the use of cars might be a good thing to do.', metadata={'label': 0})" | ||
] | ||
}, | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"docs[0]" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Kaggle | ||
|
||
>[Kaggle](https://www.kaggle.com/) is a data science competition platform and online community of data scientists and machine learning practitioners under `Google LLC`. `Kaggle` enables users to find and publish datasets, explore and build models in a web-based data science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. | ||
|
||
## Installation and Setup | ||
|
||
You need to install `kaggle` python package. | ||
|
||
```bash | ||
pip install kaggle | ||
``` | ||
|
||
## Document Loader | ||
|
||
The Kaggle has lots of [datasets](https://www.kaggle.com/datasets) that can be used to train and evaluate your LLM chains. | ||
|
||
See a [usage example and detailed installation instructions](/docs/integrations/document_loaders/kaggle_dataset). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
from typing import Any, Iterator, List | ||
|
||
from langchain.docstore.document import Document | ||
from langchain.document_loaders.base import BaseLoader | ||
|
||
|
||
class KaggleDatasetLoader(BaseLoader): | ||
"""Load from `Kaggle` datasets. | ||
|
||
Follow these steps to use this loader: | ||
- Register a Kaggle account and create an API token to use this loader. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't used the kaggle package before, and would it be possible to automate these steps in the loader instead of having it be a csv loader with some logic around the kaggle dataset format? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is possible. But I don't have time for it right now. I'm switching this PR to draft. Maybe I'll find time for automate it. |
||
See https://www.kaggle.com/settings | ||
- Install `kaggle` python package with `pip install kaggle` | ||
- Use `kaggle datasets list` to list all available datasets | ||
- Use `kaggle datasets <dataset_name>` to download the dataset | ||
- Use `unzip <dataset_zipfile_name>` to extract all files in the dataset | ||
- Open the dataset CSV file and choose the column name for page content | ||
- Use the dataset CSV file name and the column name to | ||
initialize the KaggleDatasetLoader | ||
|
||
Note: Other columns in the dataset CSV file will be treated as metadata. | ||
""" | ||
|
||
def __init__(self, dataset_path: str, page_content_column: str): | ||
"""Initialize the KaggleDatasetLoader. | ||
|
||
Args: | ||
dataset_path: Path to the dataset CSV file. | ||
page_content_column: Page content column name. | ||
""" | ||
|
||
self.dataset_path = dataset_path | ||
self.page_content_column = page_content_column | ||
|
||
def lazy_load(self) -> Iterator[Document]: | ||
"""Load documents lazily.""" | ||
for doc in self.load(): | ||
yield doc | ||
|
||
def load(self) -> List[Document]: | ||
"""Load documents.""" | ||
try: | ||
import pandas as pd | ||
except ImportError: | ||
raise ImportError( | ||
"Could not import pandas python package. " | ||
"Please install it with `pip install pandas`." | ||
) | ||
|
||
df = pd.read_csv(self.dataset_path) | ||
docs = [] | ||
df.apply(lambda row: docs.append(self._sample2document(row)), axis=1) | ||
return docs | ||
|
||
def _sample2document(self, sample: Any) -> Document: | ||
"""Convert a pandas dataframe sample into a Document | ||
|
||
The `content_field` goes into the document content, | ||
all other fields go into the metadata. | ||
""" | ||
assert self.page_content_column in sample.index, ( | ||
f"content field {self.page_content_column} " | ||
f"should be in the sample columns: {sample}" | ||
) | ||
return Document( | ||
page_content=str(sample[self.page_content_column]) | ||
if sample[self.page_content_column] is not None | ||
else "", | ||
metadata={ | ||
k: v | ||
for k, v in sample.to_dict().items() | ||
if k != self.page_content_column | ||
}, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
import os | ||
|
||
import pandas as pd | ||
import pytest | ||
from pytest import fixture | ||
|
||
from langchain.document_loaders.kaggle_dataset import KaggleDatasetLoader | ||
|
||
TEST_DATASET_PATH = "./kaggle_dataset.csv" | ||
|
||
|
||
@fixture(autouse=True) | ||
def setup_and_tear_down() -> None: # type: ignore | ||
if os.path.isfile(TEST_DATASET_PATH): | ||
return True | ||
data = {"col1": [1, 2], "col2": [3, 4], "col3": ["5", "6"]} | ||
pd.DataFrame(data=data).to_csv(TEST_DATASET_PATH, index=False) | ||
yield | ||
if os.path.isfile(TEST_DATASET_PATH): | ||
os.remove(TEST_DATASET_PATH) | ||
|
||
|
||
def test_raise_error_if_path_not_exist() -> None: | ||
assert os.path.isfile(TEST_DATASET_PATH) | ||
os.remove(TEST_DATASET_PATH) | ||
with pytest.raises(FileNotFoundError): | ||
loader = KaggleDatasetLoader( | ||
dataset_path=TEST_DATASET_PATH, page_content_column="col3" | ||
) | ||
loader.load() | ||
|
||
|
||
def test_raise_error_if_wrong_() -> None: | ||
with pytest.raises(AssertionError): | ||
loader = KaggleDatasetLoader( | ||
dataset_path=TEST_DATASET_PATH, page_content_column="wrong_column" | ||
) | ||
loader.load() | ||
|
||
|
||
def test_success() -> None: | ||
loader = KaggleDatasetLoader( | ||
dataset_path=TEST_DATASET_PATH, page_content_column="col3" | ||
) | ||
docs = loader.load() | ||
assert len(docs) == 2 | ||
assert docs[0].page_content == "5" | ||
assert docs[1].page_content == "6" | ||
assert docs[0].metadata == {"col1": 1, "col2": 3} | ||
assert docs[1].metadata == {"col1": 2, "col2": 4} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, I don't think this is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package is used to manually download the kaggle dataset locally. I'd prefer do keep it here for clarity.