stats-methods.Rmd

# Statistical Methods

Statistical methods are a far larger subject than can be covered in this
handbook. Our team uses a wide variety of methods that differ by project and
change via advances in the field. Nonetheless, here are some methods that we
favor when they are appropriate.

## Multivariate Regression

-   We like Generalized Linear Models. Many of us approach these from a Bayesian
    perspective, and build and fit these models using
    [Stan](http://mc-stan.org/). A great introduction to this type of modeling
    is found in [Statistical
    Rethinking](http://xcelab.net/rm/statistical-rethinking/), by Richard
    McElreath which we have a couple of copies of. Richard also has accompanying
    lectures for this book online as [YouTube
    videos](https://www.youtube.com/playlist?list=PLDcUM9US4XdM9_N6XUUFrhghGJ4K25bFc).
-   For nonlinear models, we like Generalized Additive Models, for which we use
    the **mgcv** R package. Noam has [a bunch of
    materials](https://github.com/noamross/2017-11-14-noamross-gams-nyhackr) on
    this, including slides and exercises from a workshop he co-teaches, and the
    seminal text is [Generalized Additive Models in
    R](https://www.amazon.com/Generalized-Additive-Models-Introduction-Statistical/dp/1498728332/ref=redir_mobile_desktop?_encoding=UTF8&%2AVersion%2A=1&%2Aentries%2A=0),
    by Simon Wood. (Example - Host-Pathogen Phylogeny Project:
    [Paper](https://dx.doi.org/10.1038/nature22975),
    [GitHub](https://github.com/ecohealthalliance/HP3))
-   When a multivariate analysis is primarily about prediction, and less about
    variable inference, machine-learning methods such as **boosted regression
    trees** are useful. We make use of these through the **dismo** or
    **xgboost** packages. (Example - Hotspots 2:
    [Paper](https://www.nature.com/articles/s41467-017-00923-8),
    [GitHub](https://github.com/ecohealthalliance/hotspots2))

## Networks
-   We like networks for both descriptive visualizations and quantitative analyses. R is full of packages to achieve both of these goals, check out **ggnetwork** and **bipartite**.   
-   We assess individual components within networks looking for high-impact individuals due to their high *degree*, *centrality*, or ability to bridge different communities.  For a tutorial on basic metrics check out Randi Griffin's [tutorial on primate social networks](http://rgriff23.github.io/2017/04/26/primate-social-networks-in-igraph.html) which utilizes the **igraph** package.  
-  Often, our networks are bipartite, meaning the networks shows interactions between two type of nodes. For us, these are usually viruses and hosts. You can look at example networks in two PREDICT publications: [Figure 4 in Anthony et al. 2017](https://academic.oup.com/view-large/figure/88156911/vex012f4.tif/Global%20patterns%20in%20coronavirus%20diversity) and [Figure 3 in Willoughby et al. 2017](http://www.mdpi.com/1424-2818/9/3/35/htm). 
- To assess overall network structure and identify communities, we measure the *modularity*. The modularity of a network model is a type of network partitioning into subgroups or *modules*. This is an alternative to clustering alogrithms. For bipartite networks, we like the Barber's modularity (*Q*), calculated through the **lpbrim** package. Alternatives modularity algorithms include NetCarto which uses a simulated annealing (SA) algorithm or using a map equation (ME) algorithm. 

## Bioinformatics and Sequence Analysis

-   [**Bioconductor**](https://www.bioconductor.org/) is a comprehensive ecosystem of R packages focused on sequence analyses. Typing your general topic of interest into their [searchable software table](https://www.bioconductor.org/packages/release/BiocViews.html#___Software) should at least provide an introduction to relevant software.
-   In general, best practices in bioinformatics are changing rapidly, so it is difficult to recommend particular procedures. However, the journal [Nature Protocols](https://www.nature.com/nprot/) covers cutting-edge methods, ranging from the lab to the laptop, in great depth. Many articles include step-by-step instructions to complete a given analysis. For a scattershot introduction to high-quality bioinformatics software and relevant applications, it might be worth checking out the [Bedford](http://bedford.io/projects/), [Pachter](http://pachterlab.github.io/software.html), and [Patro](http://combine-lab.github.io/software/) lab websites. Note that many contemporary bioinformatics tools are accessed through the command line rather than R.
-   For a general overview of RNA sequencing analysis (other bioinformatics pipelines have strong similarities), the [Simple Fool's Guide from the Palumbi lab](http://sfg.stanford.edu/index.html) is a great learning resource.

## Species Distribution Modeling

## Epidemic Simulation and Fitting

## Phylogenetics