-
Notifications
You must be signed in to change notification settings - Fork 47
Example Single Machine Dataset Upload
William Silversmith edited this page Jun 7, 2018
·
12 revisions
The below listing is derived from a script that uploaded a multi-hundred gigabyte uncompressed image dataset off an external hard drive to Google Storage but it would also work with Amazon S3. It has the following helpful features:
- Resumable Upload
- Progress Printing
- Multi-core Upload
A curious feature of this script is that it uses ProcessPoolExecutor
as an independent multi-process runner rather than using CloudVolume's parallel=True
option. This is helpful because it helps parallelize the file reading and decoding step. ProcessPoolExecutor
is used instead of multiprocessing.Pool
as the original multiprocessing module hangs when a child process dies.
Please use the below Python3 code as a guide.
import os
from concurrent.futures import ProcessPoolExecutor
import numpy as np
from PIL import Image
from cloudvolume import CloudVolume
from cloudvolume.lib import mkdir, touch
info = CloudVolume.create_new_info(
num_channels = 1,
layer_type = 'image', # 'image' or 'segmentation'
data_type = 'uint16', # can pick any popular uint
encoding = 'raw', # other option: 'jpeg' but it's lossy
resolution = [ 16000, 16000, 25000 ], # X,Y,Z values in nanometers
voxel_offset = [ 0, 0, 1 ], # values X,Y,Z values in voxels
chunk_size = [ 1024, 1024, 1 ], # rechunk of image X,Y,Z in voxels
volume_size = [ 8368, 2258, 12208 ], # X,Y,Z size in voxels
)
# If you're using amazon or the local file system, you can replace 'gs' with 's3' or 'file'
vol = CloudVolume('gs://bucket/dataset/layer', info=info)
vol.provenance.description = "Description of Data"
vol.provenance.owners = ['email_address_for_uploader/imager'] # list of contact email addresses
vol.commit_info() # generates gs://bucket/dataset/layer/info json file
vol.commit_provenance() # generates gs://bucket/dataset/layer/provenance json file
direct = 'local/path/to/images'
progress_dir = mkdir('progress/')
done_files = set([ 0 ] + [ int(z) for z in os.listdir(progress_dir) ])
all_files = set(range(1, 12208 + 1))
to_upload = [ int(z) for z in list(all_files.difference(done_files)) ]
to_upload.sort()
def process(z):
img_name = 'brain_%06d.tif' % z
print('Processing ', img_name)
image = Image.open(os.path.join(direct, img_name))
width, height = image.size
array = np.array(list( image.getdata() ), dtype=np.uint16, order='F')
array = array.reshape((1, height, width)).T
vol[:,:, z] = array
image.close()
touch(os.path.join(progress_dir, str(z)))
with ProcessPoolExecutor(max_workers=8) as executor:
executor.map(process, to_upload)