Author: David Salac http://www.github.com/david-salac
Extends the default NetCDF4 driver (package netCDF4) by providing helpful functionality like reading and writing to variable in some chunks (previously pre-stored in memory - Dask like style of chunking) or dealing with variables regardless of its dimension order. Principally directly extends netCDF4 Dataset class with new functionality. Covers most of the functionality that is often the only reason why developers choose to use stogy libraries like xarray or Dask.
The typical task that occurs during data processing is to swap computed data to the disk (due to shortage of memory). This can be done by using Dask and similar tools that are notoriously known for the inefficiency (the developer has to struggle with the finding of the proper chunk size etc.). Another way is to stream directly to the NetCDF4 file. Unfortunately, default NetCDF4 driver available in Python does not support any simple way to do so (and neither does GDAL or xarray).
The same rationale is valid also for the reading of the data from the large NetCDF4 file where the size of the variable can exceed the size of the available memory.
The presented utility offers the simplest way how to treat this issue. It
allows the developer directly stream to the desired variable inside the NetCDF4
file (by selected dimension). It also allows to read (yield) from the
NetCDF4 variable the chunks of data. The main class simply extends (inherits)
from the netCDF4.Dataset
class and add the possibility to create a variable
for streaming using method createStreamerVariable
.
Another quite typical situation occurs when a user does not exactly know the order of dimensions in the variable before you write into it (or read from it). Or similarly, when the application is in development and order can be changed at some point. This issue is efficiently solved in xarray library. But installing xarray involves many drawbacks like having a massive library in a system (that is not useful otherwise) or risking memory leaks (often related even to the correct usage of xarray). This library directly extends the native netCDF4 drive. Accordingly, it is minimalistic and solves exactly this typical problem.
Suppose that user wants to create a variable called 'var'
with dimensions
called (d1, d2, d3)
with sizes (500, 20, 600)
and swapped (stream) to the
variable using the dimension 'd2' as a pivotal one.
Following example shows the way how to deal with this issue. There are two options what to stream: stream line by line (entity by entity) or stream the whole blob of data.
Once the job is finished, variable must be closed using the flush
method.
import netCDF4_enhancement as ne
# Create a file handler:
fh = ne.NetCDF4Streamer("PATH_TO_NC.nc", "w")
# Define dimensions with names: ("d1", "d2", "d3"), sizes (500, 20, 600)
fh.createDimension("d1", 500)
fh.createDimension("d2", 20)
fh.createDimension("d3", 600)
# Suppose that user wants to have dimensions (d1, d2, d3) and stream
# to the variable using the dimension "d2" and the in-memory "swap" (chunk)
# of size 3 MB.
# Create a streamed variable called 'var' (following expected API):
variable = fh.createStreamerVariable("var", "f8", ("d1", "d2", "d3")),
chunk_dimension="d2",
chunk_size_mb=3)
# Writing the data to the variable:
# A) Write the single value (single entity):
data_set = np.random.random((500, 600))
variable.streamNumpyData(data_set, single_entity=True)
variable.flush() # MUST BE CALLED EXPLICITELY WHEN FINISHED
# B) Writing the chunk of values:
data_set = np.random.random((500, 19, 600))
variable.streamNumpyData(data_set, single_entity=False)
variable.flush() # MUST BE CALLED EXPLICITELY WHEN FINISHED
# ...
# Best practice is to close file at the end (not required)
fh.close()
The purpose of this part is to read from the NetCDF4 variable in some chunks of the defined size. Logic is the opposite of the streaming logic described above.
Suppose that we have a variable described above and we want to read chunks by slicing the dimension d2:
import netCDF4_enhancement as ne
fh = ne.NetCDF4Streamer("PATH_TO_NC.nc", "r")
variable = fh.openStreamerVariable("var",
chunk_dimension="d2",
chunk_size_mb=3)
# ...
# For reading the file "line by line"
for line in variable.yieldNumpyData(True):
# Work with the line
# ...
# For reading the file by blobs (chunks)
for blob in variable.yieldNumpyData(False):
# Work with the blob
# ...
It can be helpful to access the netCDF4.Variable
object inside the
streamed variable (e. g. for defining of some attributes). To do so
follow the logic (example define the attribute description
):
# ...
wrapped_variable = fh.createStreamerVariable("var", "f8", ("d1", "d2", "d3")),
chunk_dimension="d2",
chunk_size_mb=3)
# ...
# Access the netCDF4.Variable object:
netCDF4_variable_object: netCDF4.Variable = wraped_variable.netcdf4_variable
# Define the description attribute (in writing/append mode only):
netCDF4_variable_object.description = "Streamed variable!"
# ...
If the dimension order of data set to be stream to the file or
the expected dimension order for file to be read from the file do not
match to the actual order of dimension in the file, you can use the
axes_order
property to treat this issue.
For example:
variable.axes_order = ("d2", "d3", "d1")
After calling this command, you can stream to the variable using proposed dimension order or read in this order.
If you want to reset to the default dimension order, just write:
variable.axes_order = None
For reading/writing data afterwards, you can use standard approach:
# Using slices
variable[:, :, :, :] = new_value_to_be_written # <- for writing
value_inside = variable[:, :, :, :] # <- for reading
# Slices and concrete indices:
variable[:, :, 34, :] = new_value_to_be_written # <- for writing
value_inside = variable[:, :, 34, :] # <- for reading
# Alternative method for reading value is using .sel method:
# - First option expect arguments ordered by the value of 'axes_order'
value_inside = variable.sel(1, 2, 3, 4)
# - Another approach is bit more universal, keys are names of dimensions
value_inside = variable.sel(d1=1, d2=2, d3=3, d4=4)
This overloaded operator for accessing the values does always return either
numpy.ndarray
object or the float
value (never masked array).
Documentation of the basic functionality.
Class NetCDF4Streamer
is the child of netCDF4.Dataset
class. Accordingly
inherits all the methods.
There are following new methods:
-
createStreamerVariable
: has the same interface asnetCDF4.Dataset.createVariable
. Creates new variable inside NetCDF4 file. ReturnsNetCDF4StreamerVariable
object. Method adds two new optional (keyword type) parameters:chunk_dimension (Optional[str])
: define dimension by that data are streamed to the variable, ifNone
the first dimension is chosen. Default value isNone
. It is pertinent only if you want to use streaming to the variable.chunk_size_mb (float)
: Size of the memory chunk to that data are streamed first (before they are streamed to the variable on the disk). Unit is megabyte. Default size is 512 MB. It is pertinent only if you want to use streaming to the variable.
-
openStreamerVariable
: has the same logic asnetCDF4.Dataset['variable_name']'
. Opens an existing variable inside NetCDF4 file. ReturnsNetCDF4StreamerVariable
object. Method adds two new optional (keyword type) parameters:chunk_dimension (Optional[str])
: define dimension by that data are streamed to the variable, ifNone
the first dimension is chosen. Default value isNone
. It is pertinent only if you want to use streaming to the variable.chunk_size_mb (float)
: Size of the memory chunk to that data are streamed first (before they are streamed to the variable on the disk). Unit is megabyte. Default size is 512 MB. It is pertinent only if you want to use streaming to the variable.
Represent the crucial class in the logic.
There are following methods and properties:
streamNumpyData
: Method that stream the data to the NetCDF4 variable. With parameters:data (np.ndarray)
: The line or the blob of the data to be streamed to the variable.single_entity (bool)
: If True data are streamed as a single "line" (one entity), if False the whole blob is streamed. Number of dimensions is of -1 smaller for True value.
yieldNumpyData
: Method that yields the data from the NetCDF4 variable. With parameters:single_entity (bool)
: If True data are streamed as a single "line" (one entity), if False the whole blob is streamed. Number of dimensions is of -1 smaller for True value.
flush
: Important method that store temporal results of streaming into the physical variable.axes_order
: Property that gets or sets dimension order that is required. Default order can be re-seted usingobj.axes_order = None
. Returns/accepts the tuple of dimension names (string).dimensions
: Return the physical dimension order (tuple of dimension names as string values).__getitem__
: Operator for getting value in the position reflecting order of dimensions defined by theaxes_order
property (similar logic to the normalnetCDF4
variable). Always return eitherfloat
value ornp.ndarray
value (never the mask array).__setitem__
: Operator for setting value in the position reflecting order of dimensions defined by theaxes_order
property (similar logic to the normalnetCDF4
variable).netcdf4_variable
: Return the variable as anetCDF4.Variable
object.sel
: Return the value at given position. Location can be defined either by the positional argument (following the order defined by the 'axes_order' property, or by the keyword logic with key equals to the dimension name and value equals to the selected position.
To install the most actual package, use the command:
git clone https://github.com/david-salac/NetCDF4-variable-streamer
cd NetCDF4-variable-streamer/
python setup.py install
or simply install using PIP:
pip install netCDF4-enhancement
For running package unit-tests, use command:
python setup.py test
In order to run package unit-tests you need to clone package first.