You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project aims to facilitate Orcasound Data Access. Orcasound data is part of the Registry of Open Data on AWS. Due to the streaming structure of the data (small .ts files), it can be a bit hard for a newcomer to query the data. The goal of this project is to improve the quality of the Orcasound data by following the FAIR(Findability, Accessibility, Interoperability, and Reuse) principles for scientific digital assets. The aim is to build a data catalogue and a user friendly package to facilitate the access and abstract the dependence on the data structure which may change in the future. Useful features will be the ability to quickly identify when data are available and retrieve audio based on node, time range, time frequency, etc. into a desired output format. The orca-hls-utils package has some of this functionality and would benefit from more abstraction, testing, documentation. Many other projects will benefit from this package.
Expected outcomes: A Python package to ease access for free, open Orcasound audio data.
Bonus Skills: ffmpeg, Cloud Computing, experience working with large datasets
Mentors:
Valentina, Scott
Difficulty level: Hard
Project Size: 175 or 350 h
Resources: OOIPY: a package for accessing data from Ocean Observatories Initiative Amazon S3 Inventory: a service to create an inventory catalogue for data on Amazon S3 which can be automatically updated and stored in csv or parquet format. ffspec: Python package to interface with different filesystems in the same way
Points to consider in your proposal:
How would you optimize for accessing many small files?
Can you parallelize some operations?
Can you isolate the dependence on the cloud provider?
Can access to a catalogue abstract and speed up the data access?
Can some data be cached?
What would be the API?
Getting Started:
Get acquainted yourself with the Orcasound data on AWS: access.md
Look through these notebooks experimenting with accessing data. Compare the performance reading data directly with orca-hls-utils vs through the parquet catalogues. Can you make some speed improvements?
The text was updated successfully, but these errors were encountered:
@paulcretu As we consider this issue further and also revise orcanode code this year, it may be worth re-visiting the file naming convention and size/duration for the FLAC data in the archive-orcasound-net S3 bucket.
Are there ways we can align with the BCHN file naming conventions at the same time we re-organize Orcasound data access to optimize ambient-sound-analysis efficiency (e.g. parallelization, cost)?
2018 issue in orcanode seeking human-readable file names (which guided initial decisions about the FLAC filenames that we've been generating for the last 12 months at Port Townsend as an experiment in lossless streaming and associated costs)
This project aims to facilitate Orcasound Data Access. Orcasound data is part of the Registry of Open Data on AWS. Due to the streaming structure of the data (small .ts files), it can be a bit hard for a newcomer to query the data. The goal of this project is to improve the quality of the Orcasound data by following the FAIR(Findability, Accessibility, Interoperability, and Reuse) principles for scientific digital assets. The aim is to build a data catalogue and a user friendly package to facilitate the access and abstract the dependence on the data structure which may change in the future. Useful features will be the ability to quickly identify when data are available and retrieve audio based on node, time range, time frequency, etc. into a desired output format. The orca-hls-utils package has some of this functionality and would benefit from more abstraction, testing, documentation. Many other projects will benefit from this package.
Expected outcomes: A Python package to ease access for free, open Orcasound audio data.
Required Skills:
Object Oriented Python, Project Packaging
Bonus Skills:
ffmpeg
, Cloud Computing, experience working with large datasetsMentors:
Valentina, Scott
Difficulty level: Hard
Project Size: 175 or 350 h
Resources:
OOIPY: a package for accessing data from Ocean Observatories Initiative
Amazon S3 Inventory: a service to create an inventory catalogue for data on Amazon S3 which can be automatically updated and stored in
csv
orparquet
format.ffspec
: Python package to interface with different filesystems in the same wayPoints to consider in your proposal:
How would you optimize for accessing many small files?
Can you parallelize some operations?
Can you isolate the dependence on the cloud provider?
Can access to a catalogue abstract and speed up the data access?
Can some data be cached?
What would be the API?
Getting Started:
Get acquainted yourself with the Orcasound data on AWS: access.md
Look through these notebooks experimenting with accessing data. Compare the performance reading data directly with
orca-hls-utils
vs through the parquet catalogues. Can you make some speed improvements?The text was updated successfully, but these errors were encountered: