Skip to content

Commit

Permalink
Merge pull request #2 from ENCODE-DCC/dev
Browse files Browse the repository at this point in the history
v0.1.1
  • Loading branch information
leepc12 authored Mar 30, 2020
2 parents ee38e53 + fec41ac commit 35482e9
Show file tree
Hide file tree
Showing 17 changed files with 538 additions and 185 deletions.
101 changes: 101 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
version: 2


defaults: &defaults
machine:
image: circleci/classic:latest
working_directory: ~/autouri


machine_defaults: &machine_defaults
machine:
image: ubuntu-1604:201903-01
working_directory: ~/autouri


install_python3: &install_python3
name: Install python3, pip3
command: |
sudo apt-get update && sudo apt-get install software-properties-common git wget curl -y
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt-get update && sudo apt-get install python3.6 -y
sudo wget https://bootstrap.pypa.io/get-pip.py
sudo python3.6 get-pip.py
sudo ln -s /usr/bin/python3.6 /usr/local/bin/python3
install_py3_packages: &install_py3_packages
name: Install Python packages
command: |
sudo pip3 install pytest requests dateparser filelock
sudo pip3 install --upgrade pyasn1-modules
install_gcs_lib: &install_gcs_lib
name: Install Google Cloud SDK (gcloud and gsutil) and Python API (google-cloud-storage)
command: |
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
sudo apt-get update && sudo apt-get install google-cloud-sdk -y
sudo pip3 install google-cloud-storage
install_aws_lib: &install_aws_lib
name: Install AWS Python API (boto3) and CLI (awscli)
command: |
sudo pip3 install boto3 awscli
make_root_only_dir: &make_root_only_dir
name: Create a directory accessible by root only (to test permission-denied cases)
command: |
sudo mkdir /test-permission-denied
sudo chmod -w /test-permission-denied
jobs:
pytest:
<<: *machine_defaults
steps:
- checkout
- run: *install_python3
- run: *install_py3_packages
- run: *install_gcs_lib
- run: *install_aws_lib
- run: *make_root_only_dir
- run:
no_output_timeout: 60m
command: |
cd tests/
# sign in
echo ${GCLOUD_SERVICE_ACCOUNT_SECRET_JSON} > tmp_key.json
gcloud auth activate-service-account --project=${GOOGLE_PROJECT_ID} --key-file=tmp_key.json
gcloud config set project ${GOOGLE_PROJECT_ID}
export GOOGLE_APPLICATION_CREDENTIALS="${PWD}/tmp_key.json"
aws configure set aws_access_key_id "${AWS_ACCESS_KEY_ID}"
aws configure set aws_secret_access_key "${AWS_SECRET_ACCESS_KEY}"
# run pytest
pytest --ci-prefix ${CIRCLE_WORKFLOW_ID} \
--gcp-private-key-file tmp_key.json \
--s3-root ${S3_ROOT} \
--gcs-root ${GCS_ROOT} \
--gcs-root-url ${GCS_ROOT_URL}
# to use gsutil
export BOTO_CONFIG=/dev/null
# clean up
rm -f tmp_key.json
gsutil -m rm -rf ${S3_ROOT}/${CIRCLE_WORKFLOW_ID}
gsutil -m rm -rf ${GCS_ROOT}/${CIRCLE_WORKFLOW_ID}
gsutil -m rm -rf ${GCS_ROOT_URL}/${CIRCLE_WORKFLOW_ID}
# Define workflow here
workflows:
version: 2
build_workflow:
jobs:
- pytest

62 changes: 37 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
[![CircleCI](https://circleci.com/gh/ENCODE-DCC/autouri.svg?style=svg)](https://circleci.com/gh/ENCODE-DCC/autouri)

> **IMPORTANT**: If you use `--use-gsutil-for-s3` or `GCSURI.USE_GSUTIL_FOR_S3` then you need to update your `gsutil`. This flag allows a direct transfer between `gs://` and `s3://`. This requires `gsutil` >= 4.47. See this [issue](https://github.com/GoogleCloudPlatform/gsutil/issues/935) for details.
```bash
$ pip install gsutil --upgrade
```

# Autouri

## Introduction
Expand Down Expand Up @@ -110,9 +117,9 @@ optional arguments:
## Requirements

- Python >= 3.6
- Packages: `requests` and `filelock`
- Packages: `requests`, `dateparser` and `filelock`
```bash
$ pip3 install requests filelock
$ pip3 install requests dateparser filelock
```

- Install [Google Cloud SDK](https://cloud.google.com/sdk/docs/quickstarts) to get CLIs (`gcloud` and `gsutil`).
Expand All @@ -130,34 +137,42 @@ optional arguments:

## Authentication

GCS: Use `gcloud` CLI. You will be asked to enter credential information of your Google account or redirected to authenticate on a web browser.
```
$ gcloud init
```
- GCS: Use `gcloud` CLI.
- Using end-user credentials: You will be asked to enter credentials of your Google account.
```
$ gcloud init
```
- Using service account credentials: If you use a service account and a JSON key file associated with it.
```
$ gcloud auth activate-service-account --key-file=[YOUR_JSON_KEY.json]
$ GOOGLE_APPLICATION_CREDENTIALS="PATH/FOR/YOUR_JSON_KEY.json"
```
Then set your default project.
```
$ gcloud config set project [YOUR_GCP_PROJECT_ID]
```

S3: Use `aws` CLI. You will be asked to enter credential information of your AWS account.
```
$ aws configure
```
- S3: Use `aws` CLI. You will be asked to enter credentials of your AWS account.
```
$ aws configure
```

- URL: Use `~/.netrc` file to get access to private URLs. Example `.netrc` file. You can define credential per site.
```
machine www.encodeproject.org
login XXXXXXXX
password abcdefghijklmnop
```

URL: Use `~/.netrc` file to get access to private URLs. Example `.netrc` file. You can define credential per site.
```
machine www.encodeproject.org
login XXXXXXXX
password abcdefghijklmnop
```

## Using `gsutil` for direct trasnfer between GCS and S3

Autouri can use `gsutil` CLI for a direct file transfer between S3 and GCS. Define `--use-gsutil-for-s3` in command line arguments or use `GCSURI.init_gcsuri(use_gsutil_for_s3=True)` in Python. Otherwise, file transfer between GCS and S3 will be streamed through your local machine.

`gsutil` must be configured correctly to obtain AWS credentials.
```
$ aws configure # make sure that you already authenticated for AWS
$ gsutil config # write auth info on ~/.boto
```
`gsutil` will take AWS credentials from `~/.aws/credentials` file, which is already generated in [Authentication](#authentication).


## GCS/S3 bucket policies
## GCS/S3 bucket configuration

Autouri best works with default bucket configuration for both cloud storages.

Expand All @@ -169,6 +184,3 @@ S3 (`s3://bucket-name`)
- Object versioning must be turned off.


## Known issues
Race condition is tested with multiple threads trying to write on the same file. File locking mechanism is based on [filelock](https://github.com/benediktschmitt/py-filelock). Such file locking is stable on local/GCS files but rather unstable on S3 (tested with 5 threads).
34 changes: 14 additions & 20 deletions autouri/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from .gcsuri import GCSURI


__version__ = '0.1.0'
__version__ = '0.1.1'


def parse_args():
Expand Down Expand Up @@ -71,10 +71,11 @@ def parse_args():

p_loc = subparser.add_parser(
'loc',
help='type(target_dir).localize(src): Localize source on target directory (class)',
help='AutoURI(src).localize_on(target): Localize source on target directory '
'Target directory must end with directory separator',
parents=[parent_src, parent_target, parent_cp])
p_loc.add_argument('--recursive', action='store_true',
help='Recursively localize source into target class.')
help='Recursively localize source into target directory.')

p_presign = subparser.add_parser(
'presign',
Expand Down Expand Up @@ -119,27 +120,21 @@ def main():

elif args.action == 'cp':
u_src = AutoURI(src)
sep = AutoURI(target).__class__.get_path_sep()
if target.endswith(sep):
type_ = 'dir'
target = sep.join([target.rstrip(sep), u_src.basename])
print(target)
else:
type_ = 'file'
_, flag = u_src.cp(target, make_md5_file=args.make_md5_file)
_, flag = u_src.cp(
target, make_md5_file=args.make_md5_file, return_flag=True)

if flag == 0:
logger.info('Copying from file {s} to {type} {t} done'.format(
s=src, type=type_, t=target))
logger.info('Copying from file {s} to {t} done'.format(
s=src, t=target))
elif flag:
if flag == 1:
reason = 'skipped due to md5 hash match'
elif flag == 2:
reason = 'skipped due to filename/size/mtime match'
reason = 'skipped due to filename/size match and mtime test'
else:
raise NotImplementedError
logger.info('Copying from file {s} to {type} {t} {reason}'.format(
s=src, type=type_, t=target, reason=reason))
logger.info('Copying from file {s} to {t} {reason}'.format(
s=src, t=target, reason=reason))

elif args.action == 'read':
s = AutoURI(src).read()
Expand All @@ -157,11 +152,10 @@ def main():
logger.info('Deleted {s}'.format(s=src))

elif args.action == 'loc':
_, localized = AutoURI(target).__class__.localize(
src,
_, localized = AutoURI(src).localize_on(
target,
recursive=args.recursive,
make_md5_file=args.make_md5_file,
loc_prefix=target)
make_md5_file=args.make_md5_file)
if localized:
logger.info('Localized {s} on {t}'.format(s=src, t=target))
else:
Expand Down
63 changes: 59 additions & 4 deletions autouri/abspath.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import hashlib
import errno
import os
import shutil
from filelock import SoftFileLock
from typing import Dict, Optional, Union
from shutil import copyfile, SameFileError

from .autouri import URIBase, AutoURI, logger
from .metadata import URIMetadata, get_seconds_from_epoch
Expand All @@ -26,7 +28,8 @@ class AbsPath(URIBase):
_PATH_SEP = os.sep

def __init__(self, uri, thread_id=-1):
uri = os.path.expanduser(uri)
if isinstance(uri, str):
uri = os.path.expanduser(uri)
super().__init__(uri, thread_id=thread_id)

@property
Expand Down Expand Up @@ -99,15 +102,34 @@ def _cp(self, dest_uri):

if isinstance(dest_uri, AbsPath):
dest_uri.mkdir_dirname()
shutil.copyfile(self._uri, dest_uri._uri, follow_symlinks=True)
try:
copyfile(self._uri, dest_uri._uri, follow_symlinks=True)
except SameFileError as e:
logger.debug(
'cp: ignored SameFileError. src={src}, dest={dest}'.format(
src=self._uri,
dest=dest_uri._uri))
if os.path.islink(dest_uri._uri):
dest_uri._rm()
copyfile(self._uri, dest_uri._uri, follow_symlinks=True)

return True
return False

def _cp_from(self, src_uri):
return False

def get_mapped_url(self) -> Optional[str]:
for k, v in AbsPath.MAP_PATH_TO_URL.items():
def get_mapped_url(self, map_path_to_url=None) -> Optional[str]:
"""
Args:
map_path_to_url:
dict with k, v where k is a path prefix and v is a URL prefix
k will be replaced with v.
If not given, defaults to use class constant AbsPath.MAP_PATH_TO_URL
"""
if map_path_to_url is None:
map_path_to_url = AbsPath.MAP_PATH_TO_URL
for k, v in map_path_to_url.items():
if k and self._uri.startswith(k):
return self._uri.replace(k, v, 1)
return None
Expand All @@ -122,6 +144,30 @@ def mkdir_dirname(self):
d=self.dirname))
return

def soft_link(self, target, force=False):
"""Make a soft link of self on target absolute path.
If target already exists delete it and create a link.
Args:
target:
Target file's absolute path or URI object.
force:
Delete target file (or link) if it exists
"""
target = AbsPath(target)
if not target.is_valid:
raise ValueError('Target path is not a valid abs path: {t}.'.format(
t=target.uri))
try:
target.mkdir_dirname()
os.symlink(self._uri, target._uri)
except OSError as e:
if e.errno == errno.EEXIST and force:
target.rm()
os.symlink(self._uri, target._uri)
else:
raise e

def __calc_md5sum(self):
"""Expensive md5 calculation
"""
Expand All @@ -131,6 +177,15 @@ def __calc_md5sum(self):
hash_md5.update(chunk)
return hash_md5.hexdigest()

@staticmethod
def get_abspath_if_exists(path):
if isinstance(path, URIBase):
path = path._uri
if isinstance(path, str):
if os.path.exists(os.path.expanduser(path)):
return os.path.abspath(os.path.expanduser(path))
return path

@staticmethod
def init_abspath(
loc_prefix: Optional[str]=None,
Expand Down
Loading

0 comments on commit 35482e9

Please sign in to comment.