Skip to content

Commit

Permalink
another incremental update, maybe we can finish this next week
Browse files Browse the repository at this point in the history
  • Loading branch information
betolink committed Oct 3, 2024
1 parent 4210ed2 commit 57b255a
Show file tree
Hide file tree
Showing 24 changed files with 1,641 additions and 417 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ docs
/notebooks/*files/
*.pyc
__pycache__/

/site_libs/manuscript-notebook/
.ipynb_checkpoints/
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@

This repository contains use case gathering, benchmarking, and prototyping work related to cloud-optimization of ICESat-2 data, with the overall goal of better enabling cloud access patterns for the ICESat-2 community. The audience of this repository includes ICESat-2 data providers and tool and service developers with experience and interest in developing solutions to improve the performance of ICESat-2 data in the cloud.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nsidc/cloud-optimized-icesat2/main?labpath=notebooks%2F)


## Level of Support

Expand Down
4 changes: 1 addition & 3 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@ project:
type: manuscript
render:
- paper.qmd
- notebooks/portable-full-comparison.ipynb
- notebooks/h5py.ipynb
- optimize.py
- notebooks/h5py-atl03.ipynb
output-dir: docs
manuscript:
article: paper.qmd
Expand Down
4 changes: 3 additions & 1 deletion notebooks/environment.yml → environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,16 @@ channels:
- conda-forge
dependencies:
- python=3.11
- jupyterlab>3
- jupyterlab>4
- fsspec>=2024.05
- s3fs
- numpy<2.0
- matplotlib-base
- pandas
- xarray
- dask
- distributed
- dask-labextension
- geopandas
- h5py>=3.10
- zarr
Expand Down
Binary file added figures/figure-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified figures/figure-5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

44 changes: 44 additions & 0 deletions notebooks/benchmarks.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
,tool,dataset,cloud-aware,format,file,time,shape,bytes_requested,mean
0,h5py,7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.h5,25.779539585113525,"(46484912,)",289026695.0,1035.1631
1,h5py,7GB,no,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-4mb.h5,68.79631090164185,"(46484912,)",1036723526.0,1035.1631
2,h5py,7GB,no,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-8mb.h5,62.35629677772522,"(46484912,)",947145210.0,1035.1631
3,h5py,7GB,no,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-4mb.h5,27.586012840270996,"(46484912,)",286737116.0,1035.1631
4,h5py,7GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.h5,27.63655662536621,"(46484912,)",269539164.0,1035.1631
5,kerchunk,1GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01.json,3.2203612327575684,"(9720204,)",,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(386.06738, dtype=float32)"
6,kerchunk,1GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01_rechunked-100k-page-8mb.json,4.984490156173706,"(9720204,)",,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(386.06738, dtype=float32)"
7,kerchunk,7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.json,14.271384954452515,"(46484912,)",,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(1035.1631, dtype=float32)"
8,kerchunk,7GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.json,8.327512979507446,"(46484912,)",,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(1035.1631, dtype=float32)"
9,xarray,1GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01.h5,216.67426800727844,"(9720204,)",3125613354.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(386.06738, dtype=float32)"
10,xarray,1GB,no,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01-page-only-4mb.h5,224.73643493652344,"(9720204,)",3671050532.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(386.06738, dtype=float32)"
11,xarray,1GB,no,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01-page-only-4mb.h5,227.53268146514893,"(9720204,)",3671050532.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(386.06738, dtype=float32)"
12,xarray,1GB,no,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01_rechunked-100k-page-4mb.h5,15.311556816101074,"(9720204,)",162702540.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(386.06738, dtype=float32)"
13,xarray,1GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01_rechunked-100k-page-8mb.h5,13.107250928878784,"(9720204,)",136218854.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(386.06738, dtype=float32)"
14,xarray,7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.h5,281.35928440093994,"(46484912,)",3350528419.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(1035.1631, dtype=float32)"
15,xarray,7GB,no,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-4mb.h5,435.6504566669464,"(46484912,)",5526810548.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(1035.1631, dtype=float32)"
16,xarray,7GB,no,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-8mb.h5,359.13231015205383,"(46484912,)",5337618697.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(1035.1631, dtype=float32)"
17,xarray,7GB,no,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-4mb.h5,156.73291158676147,"(46484912,)",2064608594.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(1035.1631, dtype=float32)"
18,xarray,7GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.h5,41.19576835632324,"(46484912,)",432504529.0,"<xarray.DataArray 'h_ph' ()> Size: 4B
array(1035.1631, dtype=float32)"
19,h5coro,1GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01.h5,10.852986097335815,"(9720204,)",,386.06738
20,h5coro,1GB,no,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01-page-only-4mb.h5,8.163445472717285,"(9720204,)",,386.06738
21,h5coro,1GB,no,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01-page-only-4mb.h5,6.128530740737915,"(9720204,)",,386.06738
22,h5coro,1GB,no,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01_rechunked-100k-page-4mb.h5,19.339251279830933,"(9720204,)",,386.06738
23,h5coro,1GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20230618223036_13681901_006_01_rechunked-100k-page-8mb.h5,10.59383749961853,"(9720204,)",,386.06738
24,h5coro,7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.h5,24.180256843566895,"(46484912,)",,1035.1631
25,h5coro,7GB,no,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-4mb.h5,51.16006398200989,"(46484912,)",,1035.1631
26,h5coro,7GB,no,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-8mb.h5,23.30572533607483,"(46484912,)",,1035.1631
27,h5coro,7GB,no,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-4mb.h5,34.204936504364014,"(46484912,)",,1035.1631
28,h5coro,7GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.h5,33.829309940338135,"(46484912,)",,1035.1631
Binary file added notebooks/byte_ranges.pkl.gz
Binary file not shown.
140 changes: 140 additions & 0 deletions notebooks/h5logger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
import re
import numpy as np
import pandas as pd
import logging
import s3fs
import fsspec
import time
import h5py
from datetime import datetime
from uuid import uuid4


def parse_fsspec_log(log_path):
"""
This method only parses fsspec logs that have a FileSize: attached to them.
"""
head_line = re.compile('<File-like object S3FileSystem, .*?>\s*(read: 0 - \d+)')
fsize_line = re.compile('FileSize: (\d+)')
# range_line = re.compile('<File-like object S3FileSystem, .*?>\s* read: (?P<start>[0-9]+) - (?P<end>[0-9]+)')
range_line = re.compile('<File-like object S3FileSystem, .*?>\s* read: (?P<start>[0-9]+) - (?P<end>[0-9]+)(?:\s*,\s*.*?:\s*(?P<hits>[0-9]+)\s*hits,\s*(?P<misses>[0-9]+)\s*misses)?')



ranges = list()
with open(log_path) as logtxt:
for line in logtxt:
if head_line.match(line):
break
else:
raise RuntimeError('HEAD line not found in the log file')

for line in logtxt:
match = fsize_line.match(line)
if match:
fsize = int(match.group(1))
break
else:
raise RuntimeError('FILESIZE line not found in the log file')

logtxt.seek(0)
for line in logtxt:
match = range_line.match(line)
if match:
start=int(match.group(1))
end=int(match.group(2))
hits=match.group(3)
missed=match.group(4)
rsize=end-start+1

ranges.append({"start": start, "end": end, "size": rsize, "hits": hits, "missed": missed})

df = pd.DataFrame(ranges, columns=['start', 'end', 'size', 'hits', 'missed'])
return df

def read_file(info):
h5py_fsspec_benchmarks = {}
ranges = None
file_size = None
block_size = None
iteration, dataset, variables, flavor, url, optimized_read, driver, default_io_params, optimized_io_params = info
if url.endswith(".json"):
return {}
io_params = default_io_params
if optimized_read:
if "rechunked" in url or "page" in url:
optimized = "yes"
print(f"Reading: {url} with optimized I/O parameters")
io_params = optimized_io_params
block_size = io_params["fsspec_params"]["block_size"]
else:
# we cannot read the original file with optimized parameters
optimized = "no"
print(f"Reading: {url} with default parameters")
else:
optimized = "no"
print(f"Reading: {url} with default parameters")
cache_type = io_params["fsspec_params"]["cache_type"]

# this is mostly IO so no perf_counter is needed
start = time.time()
if driver == "fsspec":
fs = s3fs.S3FileSystem(anon=True)
logger = logging.getLogger('fsspec')
logger.setLevel(logging.DEBUG)
file_info = fs.info(url)
file_size = file_info['size']
file_name = url.split("/")[-1]
current_time = datetime.now()
formatted_time = current_time.strftime(f"%Y-%m-%d_%H-%M-%S-{uuid4()}")
log_filename = f"logs/fsspec-{file_name}-{driver}-{optimized}-{formatted_time}.log"
# Create a new FileHandler for each iteration
file_handler = logging.FileHandler(log_filename)
file_handler.setLevel(logging.DEBUG)
# Add the handler to the root logger
logging.getLogger().addHandler(file_handler)
with fs.open(url, mode="rb", **io_params["fsspec_params"]) as fo:
with h5py.File(fo, **io_params["h5py_params"]) as f:
for variable in variables:
data = f[variable][:]
data_mean = data.mean()
req_bytes = fo.cache.total_requested_bytes
logger.debug(f"FileSize: {file_size}")
logging.getLogger().removeHandler(file_handler)
file_handler.close()
ranges = parse_fsspec_log(log_filename)
else:
cloud_params = {
"mode": "r",
"driver": "ros3",
"aws_region": "us-west-2".encode("utf-8")
}
with h5py.File(url, **io_params["h5py_params"], **cloud_params) as f:
for variable in variables:
data = f[variable][:]
data_mean = data.mean()
req_bytes = None # not available
elapsed = time.time() - start
return {
"benchmark": {
"iteration": iteration,
"library": "h5py",
"driver": driver,
"dataset": dataset,
"optimized-read": optimized,
"format": flavor,
"file": url,
"time": elapsed,
"shape": data.shape,
"bytes_requested": req_bytes,
"mean": data_mean},
"ranges": {
"file": url,
"optimized-read": optimized,
"cache_type": cache_type,
"block_size": block_size,
"time": time,
"bytes_requested": req_bytes,
"file_size": file_size,
"ranges": ranges}
}
21 changes: 21 additions & 0 deletions notebooks/h5py-atl03-benchmarks.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
,iteration,library,driver,dataset,optimized-read,format,file,time,shape,bytes_requested,mean
0,0,h5py,ros3,ATL03-7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.h5,2007.8612365722656,"(46484912,)",,-66.14486441580091
1,0,h5py,ros3,ATL03-7GB,no,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-4mb.h5,2142.443084716797,"(46484912,)",,-66.14486441580091
2,0,h5py,ros3,ATL03-7GB,no,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-8mb.h5,2124.307473897934,"(46484912,)",,-66.14486441580091
3,0,h5py,ros3,ATL03-7GB,no,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-4mb.h5,288.2669167518616,"(46484912,)",,-66.14486441580091
4,0,h5py,ros3,ATL03-7GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.h5,274.670058965683,"(46484912,)",,-66.14486441580091
5,0,h5py,fsspec,ATL03-7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.h5,2062.315348148346,"(46484912,)",0.0,-66.14486441580091
6,0,h5py,fsspec,ATL03-7GB,no,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-4mb.h5,2226.6228652000427,"(46484912,)",0.0,-66.14486441580091
7,0,h5py,fsspec,ATL03-7GB,no,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-8mb.h5,2231.902267932892,"(46484912,)",0.0,-66.14486441580091
8,0,h5py,fsspec,ATL03-7GB,no,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-4mb.h5,325.2194719314575,"(46484912,)",0.0,-66.14486441580091
9,0,h5py,fsspec,ATL03-7GB,no,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.h5,362.87884545326233,"(46484912,)",0.0,-66.14486441580091
10,0,h5py,ros3,ATL03-7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.h5,1997.6913211345673,"(46484912,)",,-66.14486441580091
11,0,h5py,ros3,ATL03-7GB,yes,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-4mb.h5,87.55571722984314,"(46484912,)",,-66.14486441580091
12,0,h5py,ros3,ATL03-7GB,yes,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-8mb.h5,95.51402950286865,"(46484912,)",,-66.14486441580091
13,0,h5py,ros3,ATL03-7GB,yes,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-4mb.h5,81.75408124923706,"(46484912,)",,-66.14486441580091
14,0,h5py,ros3,ATL03-7GB,yes,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.h5,75.82156276702881,"(46484912,)",,-66.14486441580091
15,0,h5py,fsspec,ATL03-7GB,no,original,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02.h5,2091.811399459839,"(46484912,)",0.0,-66.14486441580091
16,0,h5py,fsspec,ATL03-7GB,yes,page-only-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-4mb.h5,159.82449340820312,"(46484912,)",771751936.0,-66.14486441580091
17,0,h5py,fsspec,ATL03-7GB,yes,page-only-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02-page-only-8mb.h5,113.62834787368774,"(46484912,)",754974720.0,-66.14486441580091
18,0,h5py,fsspec,ATL03-7GB,yes,rechunked-4mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-4mb.h5,171.39278054237366,"(46484912,)",654311424.0,-66.14486441580091
19,0,h5py,fsspec,ATL03-7GB,yes,rechunked-8mb,s3://its-live-data/test-space/cloud-experiments/h5cloud/atl03/ATL03_20181120182818_08110112_006_02_rechunked-100k-page-8mb.h5,169.21530079841614,"(46484912,)",645922816.0,-66.14486441580091
Loading

0 comments on commit 57b255a

Please sign in to comment.