-
Notifications
You must be signed in to change notification settings - Fork 47
Example Single Machine Dataset Upload
The below listing is derived from a script that uploaded a multi-hundred gigabyte uncompressed image dataset off an external hard drive to Google Storage but it would also work with Amazon S3. It has the following helpful features:
- Resumable Upload
- Progress Printing
- Multi-core Upload
The resumable upload feature works by writing the index of uploaded slices to disk by touching filenames in a newly created ./progress/
directory. You can easily reset the upload with rm -r ./progress
or avoid reuploading files by touching e.g. touch progress/5
which would avoid uploading z=5.
A curious feature of this script is that it uses ProcessPoolExecutor
as an independent multi-process runner rather than using CloudVolume's parallel=True
option. This is helpful because it helps parallelize the file reading and decoding step. ProcessPoolExecutor
is used instead of multiprocessing.Pool
as the original multiprocessing module hangs when a child process dies.
Please use the below Python3 code as a guide.
import os
from concurrent.futures import ProcessPoolExecutor
import numpy as np
import tifffile
from cloudvolume import CloudVolume
from cloudvolume.lib import mkdir, touch
info = CloudVolume.create_new_info(
num_channels = 1,
layer_type = 'image', # 'image' or 'segmentation'
data_type = 'uint16', # can pick any popular uint
encoding = 'raw', # see: https://github.com/seung-lab/cloud-volume/wiki/Compression-Choices
resolution = [ 4, 4, 4 ], # X,Y,Z values in nanometers
voxel_offset = [ 0, 0, 1 ], # values X,Y,Z values in voxels
chunk_size = [ 1024, 1024, 1 ], # rechunk of image X,Y,Z in voxels
volume_size = [ 8368, 2258, 12208 ], # X,Y,Z size in voxels
)
# If you're using amazon or the local file system, you can replace 'gs' with 's3' or 'file'
vol = CloudVolume('gs://bucket/dataset/layer', info=info)
vol.provenance.description = "Description of Data"
vol.provenance.owners = ['email_address_for_uploader/imager'] # list of contact email addresses
vol.commit_info() # generates gs://bucket/dataset/layer/info json file
vol.commit_provenance() # generates gs://bucket/dataset/layer/provenance json file
direct = 'local/path/to/images'
progress_dir = mkdir('progress/') # unlike os.mkdir doesn't crash on prexisting
done_files = set([ int(z) for z in os.listdir(progress_dir) ])
all_files = set(range(vol.bounds.minpt.z, vol.bounds.maxpt.z + 1))
to_upload = [ int(z) for z in list(all_files.difference(done_files)) ]
to_upload.sort()
def process(z):
img_name = 'brain_%06d.tif' % z
print('Processing ', img_name)
image = tifffile.imread(os.path.join(direct, img_name))
image = np.swapaxes(image, 0, 1)
image = image[..., np.newaxis]
vol[:,:, z] = image
touch(os.path.join(progress_dir, str(z)))
with ProcessPoolExecutor(max_workers=8) as executor:
executor.map(process, to_upload)
To work with RGB data, set num_channels=3
, set data_type="uint8"
, and make sure that RGB
is the last axis on the image array.
To view an RGB image in neuroglancer, paste this code into the rendering box.
void main() {
vec3 data = vec3(toNormalized(getDataValue(0)), toNormalized(getDataValue(1)), toNormalized(getDataValue(2)));
emitRGB(data);
}
Sharded images compact many files into a single randomly accessible file to reduce strain on the filesystem. You probably only need to worry about them once your data is in the multi-teravoxel range. To upload a sharded image, you have two options.
- Upload the original image and use
igneous xfer --sharded
to create a new sharded image. Then you delete the original upload. - Follow the instructions in Creating a Sharded Image from Scratch