Skip to content

Example Single Machine Dataset Upload

William Silversmith edited this page Jun 7, 2018 · 12 revisions

The below listing is derived from a script that uploaded a multi-hundred gigabyte uncompressed image dataset off an external hard drive to Google Storage but it would also work with Amazon S3. It has the following helpful features:

  1. Resumable Upload
  2. Progress Printing
  3. Multi-core Upload

A curious feature of this script is that it uses ProcessPoolExecutor as an independent multi-process runner rather than using CloudVolume's parallel=True option. This is helpful because it helps parallelize the file reading and decoding step. ProcessPoolExecutor is used instead of multiprocessing.Pool as the original multiprocessing module hangs when a child process dies.

Please use the below Python3 code as a guide.

import os
from concurrent.futures import ProcessPoolExecutor

import numpy as np
from PIL import Image

from cloudvolume import CloudVolume
from cloudvolume.lib import mkdir, touch

info = CloudVolume.create_new_info(
	num_channels = 1,
	layer_type = 'image', # 'image' or 'segmentation'
	data_type = 'uint16', # can pick any popular uint
	encoding = 'raw', # other option: 'jpeg' but it's lossy
	resolution = [ 16000, 16000, 25000 ], # X,Y,Z values in nanometers
	voxel_offset = [ 0, 0, 1 ], # values X,Y,Z values in voxels
	chunk_size = [ 1024, 1024, 1 ], # rechunk of image X,Y,Z in voxels
	volume_size = [ 8368, 2258, 12208 ], # X,Y,Z size in voxels
)

# If you're using amazon or the local file system, you can replace 'gs' with 's3' or 'file'
vol = CloudVolume('gs://bucket/dataset/layer', info=info)
vol.provenance.description = "Description of Data"
vol.provenance.owners = ['email_address_for_uploader/imager'] # list of contact email addresses

vol.commit_info() # generates gs://bucket/dataset/layer/info json file
vol.commit_provenance() # generates gs://bucket/dataset/layer/provenance json file

direct = 'local/path/to/images'

progress_dir = mkdir('progress/')
done_files = set([ 0 ] + [ int(z) for z in os.listdir(progress_dir) ])
all_files = set(range(1, 12208 + 1))

to_upload = [ int(z) for z in list(all_files.difference(done_files)) ]
to_upload.sort()

def process(z):
	img_name = 'brain_%06d.tif' % z
	print('Processing ', img_name)
	image = Image.open(os.path.join(direct, img_name))
	width, height = image.size
	array = np.array(list( image.getdata() ), dtype=np.uint16, order='F')
	array = array.reshape((1, height, width)).T
	vol[:,:, z] = array
	image.close()
	touch(os.path.join(progress_dir, str(z)))

with ProcessPoolExecutor(max_workers=8) as executor:
    executor.map(process, to_upload)