“RSE’s background in both computing and academic research puts them in the unique position of being to be able to align the needs of the research community with the services and resources provided by The Cloud.”
Streams considered: Data Management, Data Lifecycle
Within the research community, there is a group of established academic researchers that have worked essentially without computing resources. These researchers are often focused on the basic sciences. They have significant amounts of data in personal storage and have struggled to analyse it on their local computers using fairly basic software. They worked in this way through the years and expressed no need for On-site HPC because they could not imagine scaling up to that level. They did not use research computing because they were trained in and comfortable with older research techniques. This resulted in them focusing on what they could do with local resources, often collecting far more data than they could analyse. They have been reluctant to mine this data because of the extensive and laborious effort involved in capturing, cleaning and reworking the data. The prospect of working with HPC, made them think that they needed to buy servers and or IT tech support. We think that if they are aware of the possibilities for research that The Cloud would offer them, they would be eager to use the resources directly.
Leapfrog these researchers over the HPC hurdle straight into The Cloud.
Tailor a learning pathway syllabus that takes them through the process to achieve the following goals:
- The absolute basics:
- What is The Cloud?
- How does the internet-based computing resources support research?
- How is working with data different on The Cloud?
- How do I access The Cloud from my laptop?
- What software can I use in The Cloud that I cannot use on my desktop or can use differently with The Cloud resources?
- What are the advantages of these software?
- Presenting a survey of The Cloud based research methodologies
- Intermediate:
- What are all The Cloud products offered in the Calculators (e.g. VM, Storage, App server, etc.)?
- How do I start designing a Cloud service that meets my needs by looking at what can be done with my research data?
- What support and current methodologies are there that are related to my field?
- How do I know how much of each product do I need?
- Understanding the costs.
- How do I install my data analysis software on The Cloud?
- Are there software options if my software does not work on The Cloud?
- Advanced:
- How and where do I clean data?
- Formatting data from obsolete formats
- Formatting messy data into accessible formats
- Uploading data into The Cloud
- Automation of basic data management tasks
- Using high speed and power computing in data analysis
- What is data management on The Cloud (e.g. Open Access, Open Science, etc.) to work in my discipline (e.g. basic sciences, advanced sciences)?
- What data management tools help me achieve the above?
- How do I use these tools?
- How do I link my data with the world’s data sets?
- Creating data in compatible standards
- Merging smaller or more specific data into generalized or larger datasets
- Creating analysis at greater scale.
This would take us through the Data lifecycle of:
- Creation/acquisition
- Data management and research
- Archiving, sharing and preservation
Data Linking is a powerful research data share discipline. This fits in with the new product introduced at the Summit: publicdataset.azurewebsite.net
These courses do not link to the environments the researcher use currently or show them how they could do faster research operations on The Cloud. They sound very technical and can be intimidating to Researchers. The researchers cannot immediately see that the investment would enable them to work faster or better. It would be a matter of delivering the same content, in a way Researchers are comfortable but it would also be to focus on the software and the possibilities of the software.
Based on the “HOW” section above, there may be gaps in the literature that need to be filled.
Proposal:
- Establish if the syllabus presented in the “HOW” is complete.
- Review the current course that would fit the syllabus.
- Identify gaps:
- “missing” courses
- Technical course that need to be repackaged for the research community
- Compile the syllabus into a learning pathway.
- Produce a case study that focuses on data cleaning and data mining.