DINCAE.jl
DINCAE (Data-Interpolating Convolutional Auto-Encoder) is a neural network to reconstruct missing data in satellite observations. It can work with gridded data (DINCAE.reconstruct
) or a clouds of points (DINCAE.reconstruct_points
). In the later case, the data can be organized in e.g. tracks (or not).
The code is available at: https://github.com/gher-uliege/DINCAE.jl
The method is described in the following articles:
- Barth, A., Alvera-Azcárate, A., Ličer, M., & Beckers, J.-M. (2020). DINCAE 1.0: a convolutional neural network with error estimates to reconstruct sea surface temperature satellite observations. Geoscientific Model Development, 13(3), 1609–1622. https://doi.org/10.5194/gmd-13-1609-2020
- Barth, A., Alvera-Azcárate, A., Troupin, C., & Beckers, J.-M. (2022). DINCAE 2.0: multivariate convolutional neural network with error estimates to reconstruct sea surface temperature satellite and altimetry observations. Geoscientific Model Development, 15(5), 2183–2196. https://doi.org/10.5194/gmd-15-2183-2022
The neural network will be trained on the GPU. Note convolutional neural networks can require a lot of GPU memory depending on the domain size. Flux.jl
supports NVIDIA GPUs as well as other vendors (see https://fluxml.ai/Flux.jl/stable/gpu/ for details). Training on the CPU can be performed, but it is prohibitively slow.
User API
In most cases, a user only needs to interact with the function DINCAE.reconstruct
or DINCAE.reconstruct_points
.
DINCAE.reconstruct
— Functionreconstruct(Atype,data_all,fnames_rec;...)
Train a neural network to reconstruct missing data using the training data set and periodically run the neural network on the test dataset. The data is assumed to be available on a regular longitude/latitude grid (which is the case of L3 satellite data).
Mandatory parameters
Atype
: array type to usedata_all
: list of named tuples. Every tuple should havefilename
andvarname
.data_all[1]
will be used for training (and perturbed to prevent overfitting). All others entriesdata_all[2:end]
will be reconstructed using the training network
at the epochs defined by save_epochs
.
fnames_rec
: vector of filenames corresponding to the entriesdata_all[2:end]
Optional parameters:
epochs
: the number of epochs (default1000
)batch_size
: the size of a mini-batch (default50
)enc_nfilter_internal
: number of filters of the internal encoding layers (default[16,24,36,54]
)skipconnections
: list of layers with skip connections (default2:(length(enc_nfilter_internal)+1)
)clip_grad
: maximum allowed gradient. Elements of the gradients larger than this values will be clipped (default5.0
).regularization_L2_beta
: Parameter for L2 regularization (default0
, i.e. no regularization)save_epochs
: list of epochs where the results should be saved (default200:10:epochs
)is3D
: Switch to apply 2D (is3D == false
) or 3D (is3D == true
) convolutions (defaultfalse
)upsampling_method
: interpolation method during upsampling which can be either:nearest
or:bilinear
(default:nearest
)ntime_win
: number of time instances within the time window. This number should be odd. (default3
)learning_rate
: initial learning rate of the ADAM optimizer (default0.001
)learning_rate_decay_epoch
: the exponential decay rate of the learning rate. Afterlearning_rate_decay_epoch
the learning rate is halved. The learning rate is computed aslearning_rate * 0.5^(epoch / learning_rate_decay_epoch)
.learning_rate_decay_epoch
can beInf
for a constant learning rate (default)min_std_err
: minimum error standard deviation preventing a division close to zero (defaultexp(-5) = 0.006737946999085467
)loss_weights_refine
: the weigh of the individual refinement layers using in the cost function. Ifloss_weights_refine
has a single element, then there is no refinement. (default(1.,)
)
Note that also the optional parameters should be to tuned for a particular application.
Internally the time mean is removed (per default) from the data before it is reconstructed. The time mean is also added back when the file is saved. However, the mean is undefined for for are pixels in the data defined as valid (sea) by the mask which do not have any valid data in the training dataset.
See DINCAE.load_gridded_nc
for more information about the netCDF file.
DINCAE.reconstruct_points
— FunctionDINCAE.reconstruct_points(T,Atype,filename,varname,grid,fnames_rec )
Mandatory parameters:
T
:Float32
orFloat64
: float-type used by the neural networkArray{T}
,CuArray{T}
,...: array-type used by the neural network.filename
: NetCDF file in the format described below.varname
: name of the primary variable in the NetCDF file.grid
: tuple of ranges with the grid in the longitude and latitude direction e.g.(-180:1:180,-90:1:90)
.fnames_rec
: NetCDF file names of the reconstruction.
Optional parameters:
jitter_std_pos
: standard deviation of the noise to be added to the position of the observations (default(5,5)
)auxdata_files
: gridded auxiliary data file for a multivariate reconstruction.auxdata_files
is an array of named tuples with the fields (filename
, the file name of the NetCDF file,varname
the NetCDF name of the primary variable anderrvarname
the NetCDF name of the expected standard deviation error). For example:probability_skip_for_training
: For a given time step n, every track from the same time step n will be skipped by this probability during training (default 0.2). This does not affect the tracks from previous (n-1,n-2,..) and following time steps (n+1,n+2,...). The goal of this parameter is to force the neural network to learn to interpolate the data in time.paramfile
: the path of the file (netCDF) where the parameter values are stored (default:nothing
).
For example, a single entry of auxdata_files
could be:
auxdata_files = [
+Home · DINCAE.jl DINCAE.jl
DINCAE (Data-Interpolating Convolutional Auto-Encoder) is a neural network to reconstruct missing data in satellite observations. It can work with gridded data (DINCAE.reconstruct
) or a clouds of points (DINCAE.reconstruct_points
). In the later case, the data can be organized in e.g. tracks (or not).
The code is available at: https://github.com/gher-uliege/DINCAE.jl
The method is described in the following articles:
- Barth, A., Alvera-Azcárate, A., Ličer, M., & Beckers, J.-M. (2020). DINCAE 1.0: a convolutional neural network with error estimates to reconstruct sea surface temperature satellite observations. Geoscientific Model Development, 13(3), 1609–1622. https://doi.org/10.5194/gmd-13-1609-2020
- Barth, A., Alvera-Azcárate, A., Troupin, C., & Beckers, J.-M. (2022). DINCAE 2.0: multivariate convolutional neural network with error estimates to reconstruct sea surface temperature satellite and altimetry observations. Geoscientific Model Development, 15(5), 2183–2196. https://doi.org/10.5194/gmd-15-2183-2022
The neural network will be trained on the GPU. Note convolutional neural networks can require a lot of GPU memory depending on the domain size. Flux.jl
supports NVIDIA GPUs as well as other vendors (see https://fluxml.ai/Flux.jl/stable/gpu/ for details). Training on the CPU can be performed, but it is prohibitively slow.
User API
In most cases, a user only needs to interact with the function DINCAE.reconstruct
or DINCAE.reconstruct_points
.
DINCAE.reconstruct
— Functionreconstruct(Atype,data_all,fnames_rec;...)
Train a neural network to reconstruct missing data using the training data set and periodically run the neural network on the test dataset. The data is assumed to be available on a regular longitude/latitude grid (which is the case of L3 satellite data).
Mandatory parameters
Atype
: array type to usedata_all
: list of named tuples. Every tuple should have filename
and varname
. data_all[1]
will be used for training (and perturbed to prevent overfitting). All others entries data_all[2:end]
will be reconstructed using the training network
at the epochs defined by save_epochs
.
fnames_rec
: vector of filenames corresponding to the entries data_all[2:end]
Optional parameters:
epochs
: the number of epochs (default 1000
)batch_size
: the size of a mini-batch (default 50
)enc_nfilter_internal
: number of filters of the internal encoding layers (default [16,24,36,54]
)skipconnections
: list of layers with skip connections (default 2:(length(enc_nfilter_internal)+1)
)clip_grad
: maximum allowed gradient. Elements of the gradients larger than this values will be clipped (default 5.0
).regularization_L2_beta
: Parameter for L2 regularization (default 0
, i.e. no regularization)save_epochs
: list of epochs where the results should be saved (default 200:10:epochs
)is3D
: Switch to apply 2D (is3D == false
) or 3D (is3D == true
) convolutions (default false
)upsampling_method
: interpolation method during upsampling which can be either :nearest
or :bilinear
(default :nearest
)ntime_win
: number of time instances within the time window. This number should be odd. (default 3
)learning_rate
: initial learning rate of the ADAM optimizer (default 0.001
)learning_rate_decay_epoch
: the exponential decay rate of the learning rate. After learning_rate_decay_epoch
the learning rate is halved. The learning rate is computed as learning_rate * 0.5^(epoch / learning_rate_decay_epoch)
. learning_rate_decay_epoch
can be Inf
for a constant learning rate (default)min_std_err
: minimum error standard deviation preventing a division close to zero (default exp(-5) = 0.006737946999085467
)loss_weights_refine
: the weigh of the individual refinement layers using in the cost function. If loss_weights_refine
has a single element, then there is no refinement. (default (1.,)
)
Note Note that also the optional parameters should be to tuned for a particular application.
Internally the time mean is removed (per default) from the data before it is reconstructed. The time mean is also added back when the file is saved. However, the mean is undefined for for are pixels in the data defined as valid (sea) by the mask which do not have any valid data in the training dataset.
See DINCAE.load_gridded_nc
for more information about the netCDF file.
sourceDINCAE.reconstruct_points
— FunctionDINCAE.reconstruct_points(T,Atype,filename,varname,grid,fnames_rec )
Mandatory parameters:
T
: Float32
or Float64
: float-type used by the neural networkArray{T}
, CuArray{T}
,...: array-type used by the neural network.filename
: NetCDF file in the format described below.varname
: name of the primary variable in the NetCDF file.grid
: tuple of ranges with the grid in the longitude and latitude direction e.g. (-180:1:180,-90:1:90)
.fnames_rec
: NetCDF file names of the reconstruction.
Optional parameters:
jitter_std_pos
: standard deviation of the noise to be added to the position of the observations (default (5,5)
)auxdata_files
: gridded auxiliary data file for a multivariate reconstruction. auxdata_files
is an array of named tuples with the fields (filename
, the file name of the NetCDF file, varname
the NetCDF name of the primary variable and errvarname
the NetCDF name of the expected standard deviation error). For example:probability_skip_for_training
: For a given time step n, every track from the same time step n will be skipped by this probability during training (default 0.2). This does not affect the tracks from previous (n-1,n-2,..) and following time steps (n+1,n+2,...). The goal of this parameter is to force the neural network to learn to interpolate the data in time.paramfile
: the path of the file (netCDF) where the parameter values are stored (default: nothing
).
For example, a single entry of auxdata_files
could be:
auxdata_files = [
(filename = "big-sst-file.nc"),
varname = "SST",
errvarname = "SST_error")]
The data in the file should already be interpolated on the targed grid. The file structure of the NetCDF file is described in DINCAE.load_gridded_nc
. The fields defined in this file should not have any missing value (see DIVAnd.ufill).
See DINCAE.reconstruct
for other optional parameters.
An (minimal) example of the NetCDF file is:
netcdf all-sla.train {
@@ -18,7 +18,7 @@
double dtime(obs) ;
dtime:long_name = "time of measurement" ;
dtime:units = "days since 1900-01-01 00:00:00" ;
-}
The file should contain the variables lon
(longitude), lat
(latitude), dtime
(time of measurement) and id
(numeric identifier, only used by post processing scripts) and dates
(time instance of the gridded field). The file should be in the contiguous ragged array representation as specified by the CF convention allowing to group data points into "features" (e.g. tracks for altimetry). Every feature can also contain a single data point.
sourceInternal functions
DINCAE.load_gridded_nc
— Functionlon,lat,time,data,missingmask,mask = load_gridded_nc(fname,varname; minfrac = 0.05)
Load the variable varname
from the NetCDF file fname
. The variable lon
is the longitude in degrees east, lat
is the latitude in degrees north, time
is a DateTime vector, data_full
is a 3-d array with the data, missingmask
is a boolean mask where true means the data is missing and mask
is a boolean mask where true means the data location is valid, e.g. sea points for sea surface temperature.
At the bare-minimum a NetCDF file should have the following variables and attributes:
netcdf file.nc {
+}
The file should contain the variables lon
(longitude), lat
(latitude), dtime
(time of measurement) and id
(numeric identifier, only used by post processing scripts) and dates
(time instance of the gridded field). The file should be in the contiguous ragged array representation as specified by the CF convention allowing to group data points into "features" (e.g. tracks for altimetry). Every feature can also contain a single data point.
sourceInternal functions
DINCAE.load_gridded_nc
— Functionlon,lat,time,data,missingmask,mask = load_gridded_nc(fname,varname; minfrac = 0.05)
Load the variable varname
from the NetCDF file fname
. The variable lon
is the longitude in degrees east, lat
is the latitude in degrees north, time
is a DateTime vector, data_full
is a 3-d array with the data, missingmask
is a boolean mask where true means the data is missing and mask
is a boolean mask where true means the data location is valid, e.g. sea points for sea surface temperature.
At the bare-minimum a NetCDF file should have the following variables and attributes:
netcdf file.nc {
dimensions:
time = UNLIMITED ; // (5266 currently)
lat = 112 ;
@@ -31,13 +31,13 @@
int mask(lat, lon) ;
float SST(time, lat, lon) ;
SST:_FillValue = -9999.f ;
-}
The the netCDF mask is 0 for invalid (e.g. land for an ocean application) and 1 for pixels (e.g. ocean).
sourceDINCAE.NCData
— Typedd = NCData(lon,lat,time,data_full,missingmask,ndims;
+}
The the netCDF mask is 0 for invalid (e.g. land for an ocean application) and 1 for pixels (e.g. ocean).
sourceDINCAE.NCData
— Typedd = NCData(lon,lat,time,data_full,missingmask,ndims;
train = false,
obs_err_std = fill(1.,size(data_full,3)),
jitter_std = fill(0.05,size(data_full,3)),
- mask = trues(size(data_full)[1:2]),
)
Return a structure holding the data for training (train = true
) or testing (train = false
) the neural network. obs_err_std
is the error standard deviation of the observations. The variable lon
is the longitude in degrees east, lat
is the latitude in degrees north, time
is a DateTime vector, data_full
is a 3-d array with the data and missingmask
is a boolean mask where true means the data is missing. jitter_std
is the standard deviation of the noise to be added to the data during training.
sourceReducing GPU memory usage
Convolutional neural networks can require "a lot" of GPU memory. These parameters can affect GPU memory utilisation:
- reduce the mini-batch size
- use fewer layers (e.g.
enc_nfilter_internal
= [16,24,36] or [16,24]) - use less filters (reduce the values of the optional parameter encnfilterinternal)
- use a smaller domain or a lower resolution
Troubleshooting
Installation of cuDNN
If you get the warning Package cuDNN not found in current path
or the error Scalar indexing is disallowed
:
julia> using DINCAE
+ mask = trues(size(data_full)[1:2]),
)
Return a structure holding the data for training (train = true
) or testing (train = false
) the neural network. obs_err_std
is the error standard deviation of the observations. The variable lon
is the longitude in degrees east, lat
is the latitude in degrees north, time
is a DateTime vector, data_full
is a 3-d array with the data and missingmask
is a boolean mask where true means the data is missing. jitter_std
is the standard deviation of the noise to be added to the data during training.
sourceReducing GPU memory usage
Convolutional neural networks can require "a lot" of GPU memory. These parameters can affect GPU memory utilisation:
- reduce the mini-batch size
- use fewer layers (e.g.
enc_nfilter_internal
= [16,24,36] or [16,24]) - use less filters (reduce the values of the optional parameter encnfilterinternal)
- use a smaller domain or a lower resolution
Troubleshooting
Installation of cuDNN
If you get the warning Package cuDNN not found in current path
or the error Scalar indexing is disallowed
:
julia> using DINCAE
┌ Warning: Package cuDNN not found in current path.
│ - Run `import Pkg; Pkg.add("cuDNN")` to install the cuDNN package, then restart julia.
│ - If cuDNN is not installed, some Flux functionalities will not be available when running on the GPU.
You need to install and load cuDNN
before calling a function in DINCAE.jl
:
using cuDNN
using DINCAE
-# ...
Dependencies of DINCAE.jl
DINCAE.jl
depends on Flux.jl
and CUDA.jl
, which will automatically be installed. If you have some problems installing these package you might consult the documentation of Flux.jl
or CUDA.jl
.
Settings
This document was generated with Documenter.jl version 1.8.0 on Thursday 28 November 2024. Using Julia version 1.11.1.
Dependencies of DINCAE.jl
DINCAE.jl
depends on Flux.jl
and CUDA.jl
, which will automatically be installed. If you have some problems installing these package you might consult the documentation of Flux.jl
or CUDA.jl
.