Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP Utils and PathBuilder mods #85

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
10ea6d3
Update pathbuilder to handle filepaths missing expected format elements
paulhamer-noaa Nov 1, 2024
0294b56
pylint cleanup
paulhamer-noaa Nov 1, 2024
dc40921
Install required additional packages
paulhamer-noaa Nov 1, 2024
cbb6673
Add idsses-testing
paulhamer-noaa Nov 1, 2024
8f7980a
Linter complaint
paulhamer-noaa Nov 1, 2024
46ee928
Fix copy path
paulhamer-noaa Nov 1, 2024
d65e86b
Remove get_valid update inserted for http utils
paulhamer-noaa Nov 1, 2024
726f89a
Ignore reuqests for valids with MRMS data formats...
paulhamer-noaa Nov 1, 2024
c68abcb
Resolve pull request comment
paulhamer-noaa Nov 4, 2024
614f5e8
Update python/idsse_common/idsse/common/http_utils.py
paulhamer-noaa Nov 14, 2024
925b875
Update python/idsse_common/idsse/common/path_builder.py
paulhamer-noaa Nov 14, 2024
bde25d8
Update python/idsse_common/idsse/common/http_utils.py
paulhamer-noaa Nov 14, 2024
d794fb4
Updated to use a base class and cleaned up tests
paulhamer-noaa Dec 2, 2024
750e616
linter
paulhamer-noaa Dec 2, 2024
d95a8e8
Remove unnecessary super() calls
paulhamer-noaa Dec 2, 2024
3be5383
Linter
paulhamer-noaa Dec 2, 2024
e198139
Update python/idsse_common/idsse/common/aws_utils.py
paulhamer-noaa Dec 3, 2024
2f16100
Update python/idsse_common/idsse/common/protocol_utils.py
paulhamer-noaa Dec 3, 2024
6d2751f
Update python/idsse_common/test/test_http_utils.py
paulhamer-noaa Dec 3, 2024
17b469d
Linter...
paulhamer-noaa Dec 3, 2024
21e0af1
Update python/idsse_common/test/test_http_utils.py
paulhamer-noaa Dec 3, 2024
abbfc16
Update python/idsse_common/test/test_http_utils.py
paulhamer-noaa Dec 5, 2024
73965fc
Update python/idsse_common/test/test_http_utils.py
paulhamer-noaa Dec 5, 2024
d0ba6e6
Some suggested cleanup and cp/ls abstract methods - will require upd…
paulhamer-noaa Dec 5, 2024
2f2ad5a
Merge branch 'IDSSE-1001' of https://github.com/NOAA-GSL/idss-engine-…
paulhamer-noaa Dec 5, 2024
9244a61
remove http_ls/cp with ls/cp
paulhamer-noaa Dec 5, 2024
6df08bd
Linter
paulhamer-noaa Dec 5, 2024
153ff2e
Linter
paulhamer-noaa Dec 5, 2024
d8057ec
Linter
paulhamer-noaa Dec 5, 2024
4677e0b
Using the new pathbuilder in this branch and updates to handle ValueE…
paulhamer-noaa Dec 5, 2024
42bbd28
Update path_builder tests
paulhamer-noaa Dec 5, 2024
c7bba8c
Linter
paulhamer-noaa Dec 5, 2024
6cbd38d
Utils updates for TimeDelta
paulhamer-noaa Dec 5, 2024
3690410
Updated conflict
paulhamer-noaa Dec 6, 2024
2fed8b8
Including the latest rabbitmq handling
paulhamer-noaa Dec 9, 2024
36588ed
Small updates for NSSL MRMS
paulhamer-noaa Dec 11, 2024
d7636bf
Simplify loops
paulhamer-noaa Dec 11, 2024
ea32f26
Typo
paulhamer-noaa Dec 11, 2024
7514ad3
Fix tests to correspond with get latest issue...
paulhamer-noaa Dec 11, 2024
dedbc45
Linter
paulhamer-noaa Dec 11, 2024
de06de3
Small update to add unused arguments
paulhamer-noaa Dec 12, 2024
d054eeb
Linter
paulhamer-noaa Dec 12, 2024
14fcb10
Remove unnecessary pylint
paulhamer-noaa Dec 12, 2024
f23a801
Fix for get_valids to check issue time against filepaths returned.
paulhamer-noaa Dec 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion .github/workflows/linter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
- name: Install python dependencies
run: |
python -m pip install --upgrade pip
pip install pytest pylint==2.17.5 python-dateutil==2.8.2 pint==0.21 importlib-metadata==6.7.0 jsonschema==4.19.0 pika==1.3.1 pyproj numpy==1.26.2 shapely==2.0.2 netcdf4==1.6.3 h5netcdf==1.1.0 pillow==10.2.0 python-logging-rabbitmq==2.3.0
pip install pytest pytest_httpserver pylint==2.17.5 requests==2.31.0 python-dateutil==2.8.2 pint==0.21 importlib-metadata==6.7.0 jsonschema==4.19.0 pika==1.3.1 pyproj numpy==1.26.2 shapely==2.0.2 netcdf4==1.6.3 h5netcdf==1.1.0 pillow==10.2.0 python-logging-rabbitmq==2.3.0

- name: Checkout idss-engine-commons
uses: actions/checkout@v2
Expand All @@ -35,6 +35,17 @@ jobs:
- name: Install IDSSE python commons
working-directory: commons/python/idsse_common
run: pip install .

- name: Checkout idsse-testing
uses: actions/checkout@v2
with:
repository: NOAA-GSL/idsse-testing
ref: main
path: testing/

- name: Install IDSSE python testing
working-directory: testing/python
run: pip install .

- name: Set PYTHONPATH for pylint
run: |
Expand Down
13 changes: 12 additions & 1 deletion .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
- name: Install python dependencies
run: |
python -m pip install --upgrade pip
pip install pytest pylint==2.17.5 python-dateutil==2.8.2 pint==0.21 importlib-metadata==6.7.0 jsonschema==4.19.0 pika==1.3.1 pyproj==3.6.1 numpy==1.26.2 shapely==2.0.2 netcdf4==1.6.3 h5netcdf==1.1.0 pytest-cov==4.1.0 pillow==10.2.0 python-logging-rabbitmq==2.3.0
pip install pytest pytest_httpserver requests==2.31.0 pylint==2.17.5 python-dateutil==2.8.2 pint==0.21 importlib-metadata==6.7.0 jsonschema==4.19.0 pika==1.3.1 pyproj==3.6.1 numpy==1.26.2 shapely==2.0.2 netcdf4==1.6.3 h5netcdf==1.1.0 pytest-cov==4.1.0 pillow==10.2.0 python-logging-rabbitmq==2.3.0

- name: Set PYTHONPATH for pytest
run: |
Expand All @@ -39,6 +39,17 @@ jobs:
working-directory: commons/python/idsse_common
run: pip install .

- name: Checkout idsse-testing
uses: actions/checkout@v2
with:
repository: NOAA-GSL/idsse-testing
ref: main
path: testing/

- name: Install IDSSE python testing
working-directory: testing/python
run: pip install .

- name: Test with pytest
working-directory: python/idsse_common
# run Pytest, exiting nonzero if pytest throws errors (otherwise "| tee" obfuscates)
Expand Down
161 changes: 22 additions & 139 deletions python/idsse_common/idsse/common/aws_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,56 +2,50 @@
# -------------------------------------------------------------------------------
# Created on Tue Feb 14 2023
#
# Copyright (c) 2023 Regents of the University of Colorado. All rights reserved.
# Copyright (c) 2023 Colorado State University. All rights reserved. (1)
# Copyright (c) 2023 Regents of the University of Colorado. All rights reserved. (2)
#
# Contributors:
# Geary J Layne
# Geary Layne (2)
# Paul Hamer (1)
#
# -------------------------------------------------------------------------------

import logging
import fnmatch
import os
from collections.abc import Sequence
from datetime import datetime, timedelta, UTC

from .path_builder import PathBuilder
from .utils import TimeDelta, datetime_gen, exec_cmd
from .utils import exec_cmd
from .protocol_utils import ProtocolUtils
paulhamer-noaa marked this conversation as resolved.
Show resolved Hide resolved

logger = logging.getLogger(__name__)


class AwsUtils():
class AwsUtils(ProtocolUtils):
"""AWS Utility Class"""

def __init__(self,
basedir: str,
subdir: str,
file_base: str,
file_ext: str) -> None:
self.path_builder = PathBuilder(basedir, subdir, file_base, file_ext)

def get_path(self, issue: datetime, valid: datetime) -> str:
"""Delegates to instant PathBuilder to get full path given issue and valid
def ls(self, path: str, prepend_path: bool = True) -> Sequence[str]:
"""Execute a 'ls' on the AWS s3 bucket specified by path

Args:
issue (datetime): Issue date time
valid (datetime): Valid date time
path (str): s3 bucket
prepend_path (bool): Add to the filename

Returns:
str: Absolute path to file or object
Sequence[str]: The results sent to stdout from executing a 'ls' on passed path
"""
lead = TimeDelta(valid-issue)
return self.path_builder.build_path(issue=issue, valid=valid, lead=lead)
return self.aws_ls(path, prepend_path)

def aws_ls(self, path: str, prepend_path: bool = True) -> Sequence[str]:
"""Execute an 'ls' on the AWS s3 bucket specified by path
@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to rename 'aws_ls' to just 'ls' vs having 'ls' call 'aws_ls'. Of course it will no longer be a static method

def aws_ls(path: str, prepend_path: bool = True) -> Sequence[str]:
"""Execute a 'ls' on the AWS s3 bucket specified by path

Args:
path (str): s3 bucket
prepend_path (bool): Add to the filename

Returns:
Sequence[str]: The results sent to stdout from executing an 'ls' on passed path
Sequence[str]: The results sent to stdout from executing a 'ls' on passed path
"""
try:
commands = ['s5cmd', '--no-sign-request', 'ls', path]
Expand All @@ -65,20 +59,20 @@ def aws_ls(self, path: str, prepend_path: bool = True) -> Sequence[str]:
return [os.path.join(path, filename.split(' ')[-1]) for filename in commands_result]
return [filename.split(' ')[-1] for filename in commands_result]

def aws_cp(self,
path: str,
@staticmethod
def aws_cp(path: str,
paulhamer-noaa marked this conversation as resolved.
Show resolved Hide resolved
dest: str,
concurrency: int | None = None,
chunk_size: int | None = None) -> bool:
"""Execute an 'cp' on the AWS s3 bucket specified by path, dest. Attempts to use
"""Execute a 'cp' on the AWS s3 bucket specified by path, dest. Attempts to use
[s5cmd](https://github.com/peak/s5cmd) to copy the file from S3 with parallelization,
but falls back to (slower) aws-cli if s5cmd is not installed or throws an error.

Args:
path (str): Relative or Absolute path to the object to be copied
dest (str): The destination location
concurrency (optional, int): Number of parallel threads for s5cmd to use to copy
the file down from AWS (may be helpful to tweak for large files).
the file down from AWS (maybe helpful to tweak for large files).
Default is None (s5cmd default).
chunk_size (optional, int): Size of chunks (in MB) for s5cmd to split up the source AWS
S3 file so it can download quicker with more threads.
Expand Down Expand Up @@ -111,114 +105,3 @@ def aws_cp(self,
return False
finally:
pass

def check_for(self, issue: datetime, valid: datetime) -> tuple[datetime, str] | None:
"""Checks if an object passed issue/valid exists

Args:
issue (datetime): The issue date/time used to format the path to the object's location
valid (datetime): The valid date/time used to format the path to the object's location

Returns:
[tuple[datetime, str] | None]: A tuple of the valid date/time (indicated by object's
location) and location (path) of a object, or None
if object does not exist
"""
lead = TimeDelta(valid-issue)
file_path = self.get_path(issue, valid)
dir_path = os.path.dirname(file_path)
filenames = self.aws_ls(file_path, prepend_path=False)
filename = self.path_builder.build_filename(issue=issue, valid=valid, lead=lead)
for fname in filenames:
# Support wildcard matches - used for '?' as a single wildcard character in
# issue/valid time specs.
if fnmatch.fnmatch(os.path.basename(fname), filename):
return (valid, os.path.join(dir_path, fname))
return None

def get_issues(self,
num_issues: int = 1,
issue_start: datetime | None = None,
issue_end: datetime = datetime.now(UTC),
time_delta: timedelta = timedelta(hours=1)
) -> Sequence[datetime]:
"""Determine the available issue date/times

Args:
num_issues (int): Maximum number of issue to return. Defaults to 1.
issue_start (datetime, optional): The oldest date/time to look for. Defaults to None.
issue_end (datetime): The newest date/time to look for. Defaults to now (UTC).
time_delta (timedelta): The time step size. Defaults to 1 hour.

Returns:
Sequence[datetime]: A sequence of issue date/times
"""
zero_time_delta = timedelta(seconds=0)
if time_delta == zero_time_delta:
raise ValueError('Time delta must be non zero')

issues_set: set[datetime] = set()
if issue_start:
datetimes = datetime_gen(issue_end, time_delta, issue_start, num_issues)
else:
# check if time delta is positive, if so make negative
if time_delta > zero_time_delta:
time_delta = timedelta(seconds=-1.0 * time_delta.total_seconds())
datetimes = datetime_gen(issue_end, time_delta)
for issue_dt in datetimes:
if issue_start and issue_dt < issue_start:
break
try:
dir_path = self.path_builder.build_dir(issue=issue_dt)
issues = {self.path_builder.get_issue(file_path)
for file_path in self.aws_ls(dir_path)
if file_path.endswith(self.path_builder.file_ext)}
issues_set.update(issues)
if num_issues and len(issues_set) >= num_issues:
break
except PermissionError:
pass
return sorted(issues_set, reverse=True)[:num_issues]

def get_valids(self,
issue: datetime,
valid_start: datetime | None = None,
valid_end: datetime | None = None) -> Sequence[tuple[datetime, str]]:
"""Get all objects consistent with the passed issue date/time and filter by valid range

Args:
issue (datetime): The issue date/time used to format the path to the object's location
valid_start (datetime | None, optional): All returned objects will be for
valids >= valid_start. Defaults to None.
valid_end (datetime | None, optional): All returned objects will be for
valids <= valid_end. Defaults to None.

Returns:
Sequence[tuple[datetime, str]]: A sequence of tuples with valid date/time (indicated by
object's location) and the object's location (path).
Empty Sequence if no valids found for given time range.
"""
if valid_start and valid_start == valid_end:
valids_and_filenames = self.check_for(issue, valid_start)
return [valids_and_filenames] if valids_and_filenames is not None else []

dir_path = self.path_builder.build_dir(issue=issue)
valid_and_file = [(self.path_builder.get_valid(file_path), file_path)
for file_path in self.aws_ls(dir_path)
if file_path.endswith(self.path_builder.file_ext)]

if valid_start:
if valid_end:
valid_and_file = [(valid, filename)
for valid, filename in valid_and_file
if valid_start <= valid <= valid_end]
else:
valid_and_file = [(valid, filename)
for valid, filename in valid_and_file
if valid >= valid_start]
elif valid_end:
valid_and_file = [(valid, filename)
for valid, filename in valid_and_file
if valid <= valid_end]

return valid_and_file
90 changes: 90 additions & 0 deletions python/idsse_common/idsse/common/http_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
"""Helper function for listing directories and retrieving s3 objects"""
# -------------------------------------------------------------------------------
# Created on Tue Dec 3 2024
#
# Copyright (c) 2023 Colorado State University. All rights reserved. (1)
# Copyright (c) 2023 Regents of the University of Colorado. All rights reserved. (2)
#
# Contributors:
# Paul Hamer (1)
#
# -------------------------------------------------------------------------------
import logging
import os
import shutil
from collections.abc import Sequence

import requests

from .protocol_utils import ProtocolUtils

logger = logging.getLogger(__name__)

# pylint: disable=duplicate-code, broad-exception-caught
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# pylint: disable=duplicate-code, broad-exception-caught
# pylint: disable=duplicate-code, broad-exception-caught

If we have to disable linter we want to have as small of a scope as possible. So 'broad-exception-caught' should be for the specific line of code. I'm guessing that the 'duplicate-code' is because this module is very similar to the aws_utils.py, if that is the case maybe we should make a recourse_utils.py for the common and both aws_utils and http_utils extend that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised the possibility of an interface initially, many of the calls made by DAS would be the same for http and aws, at this point it can probably wait for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the refactor we don't need to disable duplicate code, and broad exception should be applied directly to the except and not globally, and ideally we want to specify the specific exception(s) that might happen


class HttpUtils(ProtocolUtils):
"""http Utility Class - Used by DAS for file downloads"""

def ls(self, path: str, prepend_path: bool = True) -> Sequence[str]:
"""Execute a 'ls' on the AWS s3 bucket specified by path
paulhamer-noaa marked this conversation as resolved.
Show resolved Hide resolved

Args:
path (str): s3 bucket
prepend_path (bool): Add to the filename

Returns:
Sequence[str]: The results sent to stdout from executing a 'ls' on passed path
"""
return self.http_ls(path, prepend_path)


@staticmethod
def http_ls(url: str, prepend_path: bool = True) -> Sequence[str]:
paulhamer-noaa marked this conversation as resolved.
Show resolved Hide resolved
"""Execute a 'ls' on the http(s) server
Args:
url (str): URL
prepend_path (bool): Add URL+ to the filename
Returns:
Sequence[str]: The results from executing a request get on passed url
"""
try:
files = []
response = requests.get(url, timeout=5)
response.raise_for_status() # Raise an exception for bad status codes

for line in response.text.splitlines():
if 'href="' in line:
filename = line.split('href="')[1].split('"')[0]
if not filename.endswith('/'): # Exclude directories
files.append(filename)

except requests.exceptions.RequestException as exp:
logger.warning('Unable to query supplied URL: %s', str(exp))
return []
if prepend_path:
return [os.path.join(url, filename) for filename in files]
return files

@staticmethod
def http_cp(url: str, dest: str) -> bool:
"""Execute http request download from URL to dest.

Args:
url (str): URL to the object to be copied
dest (str): The destination location
Returns:
bool: Returns True if copy is successful
"""
try:
with requests.get(os.path.join(url), timeout=5, stream=True) as response:
# Check if the request was successful
if response.status_code == 200:
# Open a file in binary write mode
with open(dest, "wb") as file:
shutil.copyfileobj(response.raw, file)
return True

logger.debug('copy fail: request status code: %s', response.status_code)
return False
except Exception: # pylint: disable=broad-exception-caught
return False
5 changes: 4 additions & 1 deletion python/idsse_common/idsse/common/path_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,9 @@ def parse_args(key: str, value: str, result: dict):
for i, _dir in enumerate(dirs):
res = re.search(r'{.*}', _dir)
if res:
parse_args(res.group(), vals[i][res.span()[0]:], time_args)
try :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to incorporate this exception handling in the new implementation of path builder, thanks for finding/fixing. Is there a test that exercises this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My test was running it against the MRMS http feed so no.

parse_args(res.group(), vals[i][res.span()[0]:], time_args)
except Exception: # pylint: disable=broad-exception-caught

pass
return time_args
Loading
Loading