Skip to content

Commit

Permalink
docs: update your-first-tstore tutorial with time filters
Browse files Browse the repository at this point in the history
  • Loading branch information
martibosch committed Aug 27, 2024
1 parent 3a0efbe commit 5f0c7c0
Showing 1 changed file with 141 additions and 69 deletions.
210 changes: 141 additions & 69 deletions tutorials/01-your-first-tstore.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"100%|█████████████████████████████████████████████████| 59/59 [01:06<00:00, 1.13s/it]\n"
"100%|█████████████████████████████████████████████████| 59/59 [01:25<00:00, 1.45s/it]\n"
]
},
{
Expand Down Expand Up @@ -1200,7 +1200,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Dumped tstore in: 9.12 s\n",
"Dumped tstore in: 5.17 s\n",
"agrometeo-tstore/tstore_metadata.yaml\n",
"agrometeo-tstore/_attributes.parquet (0.008002 MB)\n",
"agrometeo-tstore/96/temperature/_common_metadata\n",
Expand All @@ -1213,7 +1213,7 @@
"agrometeo-tstore/27/precipitation/year=2023/part-0.parquet (0.482844 MB)\n",
"agrometeo-tstore/27/precipitation/year=2024/part-0.parquet (0.27916 MB)\n",
"Total size: 249.884152 MB (in 794 files)\n",
"Read tstore in: 28.96 s\n"
"Read tstore in: 31.33 s\n"
]
}
],
Expand Down Expand Up @@ -1275,6 +1275,8 @@
"source": [
"We can see that this creates a hierarchical structure in which for each station id, a folder is created for variable. Each variable folder is composed of metadata as well as the actual data partitioned by months (as specified with the `partitioning` argument). Note that the file sizes are quite small which is likely inefficient [<sup>2</sup>](#parquet-file-size).\n",
"\n",
"## Filtering at read time\n",
"\n",
"Note that we may be interested in reading a single variable, in which case reading times are dramatically reduced:"
]
},
Expand All @@ -1288,7 +1290,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Read tstore for temperature in: 4.84 s\n"
"Read tstore for temperature in: 2.09 s\n"
]
},
{
Expand Down Expand Up @@ -1419,14 +1421,85 @@
"id": "13",
"metadata": {},
"source": [
"We can also lazily read the whole TStore into a `TSDF` object:"
"We may only be interested in loading a specific subset of the data, e.g., time period:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Read tstore for one year in: 2.30 s\n"
]
}
],
"source": [
"start = time.time()\n",
"ts_2021_df = tstore.open_tslong(\n",
" tstore_dir,\n",
" start_time=\"2021-01-01\",\n",
" end_time=\"2021-01-31\",\n",
" inclusive=\"both\",\n",
" backend=\"pandas\",\n",
")\n",
"print(f\"Read tstore for one year in: {time.time() - start:.2f} s\")"
]
},
{
"cell_type": "markdown",
"id": "15",
"metadata": {},
"source": [
"which can result in important performance gains as filters are applied at read time. We can further filter the time range and target variables simoultaneously:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Read tstore for one variable, one year in: 0.61 s\n"
]
}
],
"source": [
"start = time.time()\n",
"variable = \"temperature\"\n",
"T_ts_2021_df = tstore.open_tslong(\n",
" tstore_dir,\n",
" ts_variables=variable,\n",
" start_time=\"2021-01-01\",\n",
" end_time=\"2021-01-31\",\n",
" backend=\"pandas\",\n",
")\n",
"print(f\"Read tstore for one variable, one year in: {time.time() - start:.2f} s\")"
]
},
{
"cell_type": "markdown",
"id": "17",
"metadata": {},
"source": [
"## TSDF: the (geo) time series data frame\n",
"\n",
"We can also lazily read the whole TStore into a `TSDF` object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18",
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -1460,41 +1533,41 @@
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>TS[shape=(Delayed('int-38e16e00-90d0-4e84-bb1f...</td>\n",
" <td>TS[shape=(Delayed('int-90297433-1020-43b2-88bf...</td>\n",
" <td>TS[shape=(Delayed('int-e35c9bf6-c45c-4106-b57e...</td>\n",
" <td>TS[shape=(Delayed('int-dab79b77-5b8f-4252-9de1...</td>\n",
" <td>TS[shape=(Delayed('int-193eb739-3820-4a6e-8f8b...</td>\n",
" <td>TS[shape=(Delayed('int-62d167cf-7ec2-44ee-9318...</td>\n",
" <td>POINT (521720.000 148080.000)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>TS[shape=(Delayed('int-51e340b2-4322-4bd0-b1bb...</td>\n",
" <td>TS[shape=(Delayed('int-b454de2f-6529-47f0-bfc1...</td>\n",
" <td>TS[shape=(Delayed('int-779868b7-4f3f-48b8-a206...</td>\n",
" <td>TS[shape=(Delayed('int-2e6b6c99-074b-4185-b16c...</td>\n",
" <td>TS[shape=(Delayed('int-457032eb-012f-4db5-8f50...</td>\n",
" <td>TS[shape=(Delayed('int-f42688e5-09bb-4ef4-8767...</td>\n",
" <td>POINT (507130.000 139310.000)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4</td>\n",
" <td>TS[shape=(Delayed('int-cfa57f7d-10bf-429a-90e4...</td>\n",
" <td>TS[shape=(Delayed('int-492c7579-802d-4e00-a9d5...</td>\n",
" <td>TS[shape=(Delayed('int-70417454-d02b-47f2-bf35...</td>\n",
" <td>TS[shape=(Delayed('int-a0ce6180-2d47-4de6-aa1e...</td>\n",
" <td>TS[shape=(Delayed('int-d8be737b-d259-4635-8ad1...</td>\n",
" <td>TS[shape=(Delayed('int-e36d623f-3b3a-4c10-b94e...</td>\n",
" <td>POINT (520355.000 148210.000)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>10</td>\n",
" <td>TS[shape=(Delayed('int-01663798-313b-423f-b799...</td>\n",
" <td>TS[shape=(Delayed('int-ee8d2b24-c817-4da3-a159...</td>\n",
" <td>TS[shape=(Delayed('int-bdd991de-22d3-4216-a978...</td>\n",
" <td>TS[shape=(Delayed('int-44eca260-487f-469d-98bf...</td>\n",
" <td>TS[shape=(Delayed('int-5c6abbe6-c373-4a3b-b3bb...</td>\n",
" <td>TS[shape=(Delayed('int-d7303769-6427-4e7b-b9eb...</td>\n",
" <td>POINT (557241.000 144716.000)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13</td>\n",
" <td>TS[shape=(Delayed('int-308da7f1-d731-4754-bbf9...</td>\n",
" <td>TS[shape=(Delayed('int-14e4d1e1-1625-4e2e-a9f4...</td>\n",
" <td>TS[shape=(Delayed('int-be560115-ea51-4fe7-94d8...</td>\n",
" <td>TS[shape=(Delayed('int-c5163a73-9363-4b52-ac4c...</td>\n",
" <td>TS[shape=(Delayed('int-de3d0880-944a-4386-8589...</td>\n",
" <td>TS[shape=(Delayed('int-404278c9-8f87-4a87-b8b9...</td>\n",
" <td>POINT (540810.000 151565.000)</td>\n",
" </tr>\n",
" </tbody>\n",
Expand All @@ -1504,25 +1577,25 @@
"text/plain": [
"TSDFGeoPandas wrapping a geopandas.geodataframe.GeoDataFrame:\n",
" station temperature \\\n",
"0 1 TS[shape=(Delayed('int-a0ca544b-d570-43bc-8ae3... \n",
"1 3 TS[shape=(Delayed('int-4b68ff61-a5bd-4a1d-9292... \n",
"2 4 TS[shape=(Delayed('int-332e8444-a91e-46a8-844a... \n",
"3 10 TS[shape=(Delayed('int-cb2e1ffa-22bd-43f0-8403... \n",
"4 13 TS[shape=(Delayed('int-392ab5d4-99bc-4fd3-947f... \n",
"0 1 TS[shape=(Delayed('int-04572265-0284-48fc-8f19... \n",
"1 3 TS[shape=(Delayed('int-239350fb-6c18-404e-aed2... \n",
"2 4 TS[shape=(Delayed('int-3b7df6dc-ac01-4fdb-8c76... \n",
"3 10 TS[shape=(Delayed('int-75467091-6bf8-4c0f-8e39... \n",
"4 13 TS[shape=(Delayed('int-1bb0fdbc-2945-4dd9-856f... \n",
"\n",
" water_vapour \\\n",
"0 TS[shape=(Delayed('int-f541fdfb-eb8e-41e2-bf0b... \n",
"1 TS[shape=(Delayed('int-a32bb86d-a731-4131-8f0a... \n",
"2 TS[shape=(Delayed('int-bbbca391-28de-4c11-b040... \n",
"3 TS[shape=(Delayed('int-459daf8f-3e4f-422e-896c... \n",
"4 TS[shape=(Delayed('int-a7126758-074e-4713-96af... \n",
"0 TS[shape=(Delayed('int-eb6e350d-0e1c-4a01-88c6... \n",
"1 TS[shape=(Delayed('int-58010606-45bc-4962-a46e... \n",
"2 TS[shape=(Delayed('int-7c0aa4e1-db00-49c0-bd52... \n",
"3 TS[shape=(Delayed('int-a926d941-a11f-4fba-ad7d... \n",
"4 TS[shape=(Delayed('int-1e41c626-db33-4a9e-acb2... \n",
"\n",
" precipitation \\\n",
"0 TS[shape=(Delayed('int-4c45fb78-4ea8-453e-884d... \n",
"1 TS[shape=(Delayed('int-c77a9ea6-62dc-4acd-b5cd... \n",
"2 TS[shape=(Delayed('int-06917f0b-8bfd-4f85-851e... \n",
"3 TS[shape=(Delayed('int-bc6029c0-de33-45ed-9b7c... \n",
"4 TS[shape=(Delayed('int-1835999b-615e-4d68-9dd7... \n",
"0 TS[shape=(Delayed('int-7356deb8-ba74-4d38-b688... \n",
"1 TS[shape=(Delayed('int-66bca97a-740f-4fda-8dfd... \n",
"2 TS[shape=(Delayed('int-4f78e7cb-1fb3-48f2-8ddc... \n",
"3 TS[shape=(Delayed('int-57cbc405-ad85-41dd-b12c... \n",
"4 TS[shape=(Delayed('int-7300a444-cf6d-4bf8-bdce... \n",
"\n",
" geometry \n",
"0 POINT (521720.000 148080.000) \n",
Expand All @@ -1544,7 +1617,7 @@
},
{
"cell_type": "markdown",
"id": "15",
"id": "19",
"metadata": {},
"source": [
"Since we provided the geographic location of the stations when writing the TStore, the `TSDF` object will be based on geopandas, with a geometry column that allows performing geographic operations:"
Expand All @@ -1553,7 +1626,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "16",
"id": "20",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1588,33 +1661,33 @@
" <tr>\n",
" <th>4</th>\n",
" <td>13</td>\n",
" <td>TS[shape=(Delayed('int-1a890787-4692-4602-8a9c...</td>\n",
" <td>TS[shape=(Delayed('int-dcac3f13-0b01-492f-80fe...</td>\n",
" <td>TS[shape=(Delayed('int-d7f95a7a-96b6-4bee-a4e3...</td>\n",
" <td>TS[shape=(Delayed('int-ee5a8846-86ac-4beb-9a68...</td>\n",
" <td>TS[shape=(Delayed('int-1ff8da0d-db40-40fe-a1bd...</td>\n",
" <td>TS[shape=(Delayed('int-a0a5fc87-8def-4764-a3d6...</td>\n",
" <td>POINT (540810.000 151565.000)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>41</td>\n",
" <td>TS[shape=(Delayed('int-b39a9348-03c5-498c-befa...</td>\n",
" <td>TS[shape=(Delayed('int-9c90b25d-64ec-410d-9790...</td>\n",
" <td>TS[shape=(Delayed('int-8216afbc-f34b-43eb-bd58...</td>\n",
" <td>TS[shape=(Delayed('int-33868f04-918d-4a2e-bf30...</td>\n",
" <td>TS[shape=(Delayed('int-daa80dd5-d70c-4ce6-ad58...</td>\n",
" <td>TS[shape=(Delayed('int-520d4404-97ff-4cc2-b3a8...</td>\n",
" <td>POINT (548470.000 147690.000)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>98</td>\n",
" <td>TS[shape=(Delayed('int-ebbc8db2-792a-4575-8054...</td>\n",
" <td>TS[shape=(Delayed('int-3d515521-8009-4d11-997b...</td>\n",
" <td>TS[shape=(Delayed('int-162f28f5-8897-47f9-98cd...</td>\n",
" <td>TS[shape=(Delayed('int-1768ff7e-35c4-4193-b208...</td>\n",
" <td>TS[shape=(Delayed('int-91b2a488-f508-4a30-8544...</td>\n",
" <td>TS[shape=(Delayed('int-65f9e49b-be8e-483d-86f1...</td>\n",
" <td>POINT (544528.000 149600.000)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>305</td>\n",
" <td>TS[shape=(Delayed('int-5e6277c3-692c-427d-91f2...</td>\n",
" <td>TS[shape=(Delayed('int-83b8a045-dafc-4f12-9828...</td>\n",
" <td>TS[shape=(Delayed('int-f4f9256b-a6f8-43d5-b8a5...</td>\n",
" <td>TS[shape=(Delayed('int-d90fcf63-de62-4ce6-af28...</td>\n",
" <td>TS[shape=(Delayed('int-960882a5-96dc-4f63-851b...</td>\n",
" <td>TS[shape=(Delayed('int-725b9857-656e-4dc6-96cc...</td>\n",
" <td>POINT (550365.000 147190.000)</td>\n",
" </tr>\n",
" </tbody>\n",
Expand All @@ -1624,22 +1697,22 @@
"text/plain": [
"TSDFGeoPandas wrapping a geopandas.geodataframe.GeoDataFrame:\n",
" station temperature \\\n",
"4 13 TS[shape=(Delayed('int-d2c5a8b2-c230-445a-b66e... \n",
"11 41 TS[shape=(Delayed('int-57c700af-f94a-4dcb-884c... \n",
"29 98 TS[shape=(Delayed('int-85f6ca17-2239-48dd-963a... \n",
"32 305 TS[shape=(Delayed('int-18f6cf54-5e17-4a45-903b... \n",
"4 13 TS[shape=(Delayed('int-6dcd6913-b611-4e1b-b838... \n",
"11 41 TS[shape=(Delayed('int-81caae79-e052-4496-b1a6... \n",
"29 98 TS[shape=(Delayed('int-626e9dea-b627-4cde-8014... \n",
"32 305 TS[shape=(Delayed('int-69d7ce46-8d30-493e-9228... \n",
"\n",
" water_vapour \\\n",
"4 TS[shape=(Delayed('int-a3df96ea-77a1-4497-9b86... \n",
"11 TS[shape=(Delayed('int-3b4df6c8-7d08-419d-a3bd... \n",
"29 TS[shape=(Delayed('int-86d7ca00-125a-4218-b876... \n",
"32 TS[shape=(Delayed('int-61665697-57a8-4e2c-82cf... \n",
"4 TS[shape=(Delayed('int-05d3d031-9513-4b74-9a01... \n",
"11 TS[shape=(Delayed('int-da308200-50c9-4db8-be26... \n",
"29 TS[shape=(Delayed('int-57f84b34-4bdb-4c9c-a59b... \n",
"32 TS[shape=(Delayed('int-1a3e44ab-ece4-4d9b-8133... \n",
"\n",
" precipitation \\\n",
"4 TS[shape=(Delayed('int-c8cc6152-d229-4d41-92f3... \n",
"11 TS[shape=(Delayed('int-92230ce5-0b73-4855-ae85... \n",
"29 TS[shape=(Delayed('int-1b23a47a-87a5-4b25-8370... \n",
"32 TS[shape=(Delayed('int-2f88921d-b262-46a0-8af4... \n",
"4 TS[shape=(Delayed('int-3db5fdfd-69d3-4710-bb66... \n",
"11 TS[shape=(Delayed('int-f206fdcd-6817-45e3-addc... \n",
"29 TS[shape=(Delayed('int-85f07b34-915b-4bd8-8ebb... \n",
"32 TS[shape=(Delayed('int-863e2458-601e-4b24-8334... \n",
"\n",
" geometry \n",
"4 POINT (540810.000 151565.000) \n",
Expand All @@ -1662,7 +1735,7 @@
},
{
"cell_type": "markdown",
"id": "17",
"id": "21",
"metadata": {},
"source": [
"and use the `TS._obj.compute()` method after selecting the target data, so that only the required files are read:"
Expand All @@ -1671,7 +1744,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "18",
"id": "22",
"metadata": {
"lines_to_next_cell": 0
},
Expand Down Expand Up @@ -1783,28 +1856,27 @@
},
{
"cell_type": "markdown",
"id": "19",
"id": "23",
"metadata": {},
"source": [
"###### TODO: better interface to lazily reading tsdf data\n",
"###### TODO: example of reading with time filters\n",
"\n",
"Let us now compare the I/O of tstore with a pandas CSV:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20",
"id": "24",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dumped csv in: 136.48 s\n",
"Dumped csv in: 123.86 s\n",
"Total size: 310.633327 MB\n",
"Read csv in: 8.44 s\n"
"Read csv in: 9.64 s\n"
]
}
],
Expand All @@ -1822,7 +1894,7 @@
},
{
"cell_type": "markdown",
"id": "21",
"id": "25",
"metadata": {},
"source": [
"As we can see, dumping the data to a tstore is about 18 times faster and takes about 80% of disk space. Although reading the CSV is about 3.5 times faster than reading the whole tstore, reading subsets of data (e.g., a single variable or a specific time period) can be significantly faster (about 6 times faster to read temperature only) using tstore and only loads the targeted data into memory (unlike the CSV which requires loading all the data).\n",
Expand Down

0 comments on commit 5f0c7c0

Please sign in to comment.