Authors: Ha Nguyen, Van-Dung Pham, Hung Nguyen, Bang Tran, Nicole Schrad, Juli Petereit, and Tin Nguyen
This cloud-based learning module teaches pathway analysis, a term that describes the set of tools and techniques used in life sciences research to discover the biological mechanism behind a condition from high throughput biological data. Pathway Analysis tools are primarily used to analyze these omics datasets to detect relevant groups of genes that are altered in case samples when compared to a control group. Pathway analysis approaches make use of already existing pathway databases and given gene expression data to identify the pathways which are significantly impacted in a given condition.
This module will cost you about $1.00 to run, assuming you shut down and delete all resources at the end of your analysis.
Watch this Introduction Video to learn more about the module.
The course is structured such that the content will be arranged in five submodules which allows us to:
- Download and process data from public repositories,
- Perform differential analysis,
- Perform pathway analysis using different methods that seek to answer different research hypotheses,
- Perform meta-analysis and combine methods and datasets to find consensus results, and
- Interactively explore significantly impacted pathways across multiple analyses, and browsing relationships between pathways and genes.
Each learning submodules will be organized in a R Jupyter notebook with step-by-step hands-on practice with R command line to install necessary tools, obtain data, perform analyses, visualize and interpret the results. The notebook will be executed in the Google Cloud environment. Therefore, the first step is to set up a virtual machine VertexAI.
You can begin by first navigating to and logging in with your credentials. Next, follow the directions in the STRIDES tutorial on setting up a Vertex AI notebook. This will walk you through the basics of cloud platforms and provide links for setting up the environment. Be especially careful to enable idle shutdown as highlighted in step 7. For this module you should select Debian 10 and R 4.2 in the Environment tab in step 5. We recommend the n1-standard-4 machine type in step 6 with 4 vCPUs and 15GB of RAM.
Now that you have successfully created your virtual machine, and you will be directed to Jupyterlab screen.
The next step is to import the notebooks and start the course.
This can be done by selecting the Git from the top menu in Jupyterlab, and choosing the Clone a Repository
Next you can copy and paste in the link of repository:
and click Clone.
This should download our repository to Jupyterlab folder. All tutorial files for five sub-module are in Jupyter format with .ipynv extension . Double click on each file to view the lab content and running the code. This will open the Jupyter file in Jupyter notebook. From here you can run each section, or 'cell', of the code, one by one, by pushing the 'Play' button on the above menu.
Some 'cells' of code take longer for the computer to process than others. You will know a cell is running when a cell has an asterisk next to it [*]. When the cell finishes running, that asterisk will be replaced with a number which represents the order that cell was run in. You can now explore the tutorials by running the code in each, from top to bottom. Look at the 'workflows' section below for a short description of each tutorial.
Jupyter is a powerful tool, with many useful features. For more information on how to use Jupyter, we recommend searching for Jupyter tutorials and literature online.
When you are finished running code, you should turn off your virtual machine to prevent unneeded billing or resource use by checking your notebook and pushing the STOP button.
The content of the course is organized in R Jupyter Notebooks. Another way to view this module is Jupyter Book which is a package to combine individuals Jupyter Notebooks into a web-interface for a better navigation, this is only to view the notebook not to run it. Details of installing the tools and formatting the content can be found at: The content of the course is reposed in the Github repository of Dr. Tin Nguyen's lab, and can be found at The overall structure of the modules is explained below:
In this section, we will describe the steps to create Google Cloud Storage Buckets to store data generated during analysis. The bucket can be created via GUI or using the command line. To use the GUI, the user has to first visit, sign in, click on Buckets on the left menu.
Next, click on the CREATE button below the search bar to start creating a new bucket.
This will then open a page where the user will provide the unique name of the bucket, the location, access control and other information about the bucket. Here, we named our bucket as cpa-output (please remember to create your own since all buckets are meant to have unique names). After this the user will click on the CREATE button to complete the process.
The figure above shows the architecture of the learning module with Google Cloud infrastructure. First, we will create an VertexAI workbench with R kernel. The code and instruction for each submodule is presented in a separate Jupyter Notebook. User can either upload the Notebooks to the VertexAI workbench or clone from the project repository. Then, users can execute the code directly in the Notebook. In our learning course, the submodule 01 will download data from the public repository (e.g., GEO database) for preprocessing and save the processed data to a local file in VertexAI workbench and to the user's Google Cloud Storage Bucket. The output of the submodule 01 will be used as inputs for all other submodules. The outputs of the submodules 02, 03, and 04 will be saved to local repository in VertexAI workbench and the code to copy them to the user's cloud bucket is also included.
This learning module requires some computational hardware and local environment setting from users as the programs and scripts but this is easily provided during the notebook set up explained above. The browser-based development environment provided by Google. However, users need to have Google email account, sufficient internet access, and a standard web-browser (e.g., Chrome, Edge, Firefox etc., Chrome browser is recommended) to create a Cloud Virtual Machine for analysis. It is recommended to execute the Jupyter NoteBook using R kernel version > 4.1 using a standard machine with minimum configuration of 4 vCPUs, 15 GB RAM, and 10GB of HDD.
The following are the R and tool versions used to run this module at the time of development:
All data from the modules were originally downloaded from the Gene Expression Omnibus (GEO) repository using the accession number GSE48350 file. The data was originally generated by Berchtold and Cotman, 2013. We preprocessed this data and normalized it, after which we used it in the subsequent analyses.
Some common errors include:
- Having the Jupyter Notebook kernel defaulting to Python and libraries not loading properly.
- To fix the kernel, check that the upper right hand corner of the edit ribbon says "R". If it doesn't work, you can click the words next to the circle (O) to change the kernel.
- When there are problems loading a library, check that the package has been properly installed.
- If some gsutil commands do not work try writing the command using the system function to run bash commands in R:
system("gsutil or bash command", intern= TRUE)
. - If you run into error when creating your bucket it may be due to a existing bucket with the same name to resolve this choose another unique name for your bucket
- Packages can usually be downloaded by the instructions in the documentation.
- Other errors that may happen are usually due to grammatical errors such as capitalization or spelling errors.
This work was fully supported by NIH NIGMS under grant number GM103440. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.
Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available here.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License