10-reference-free-graphs-pggb.Rmd

# Reference-Free Graphs with *pggb*

## pggb

https://github.com/pangenome/pggb

+ All-pairs whole genome alignment 
+ Induces a graph from the alignments
 
## PanGenome Graph Builder

pggb is built on the idea that a pangenome graph represents an alignment of the genomes in the graph (this is literally what Cactus does), but infers the graph from all pairwise alignments instead of a multiple alignment.

pggb computes all pairwise alignments efficiently by focusing on long, colinear homologies, instead of using the more traditional k-mer matching alignment approach.

Critically, pggb performs graph *normalization* to ensure that paths through the graph (e.g. chromosomes) have a linear structure while allowing for cyclic graph structures that capture structural variation.

![Input Genomes](./Figures/pggbFlowDiagram.png){width=100%}

## Reference-Free Graphs

https://academic.oup.com/bioinformatics/article/30/24/3476/2422268

![Input Genomes](./Figures/InputGenomes.png){width=100%}

##  pggb Algorithm

1. All-pairs align genomes with [wfmash](https://github.com/waveygang/wfmash)
2. Convert alignments into a graph using [seqwish](https://github.com/ekg/seqwish)
3. Progressively normalize graph with [smoothxg](https://github.com/pangenome/smoothxg) and [gfaffix](https://github.com/marschall-lab/GFAffix)

## Pipeline

1. Prepare the input
3. pggb
4. View with Bandage

### Set up Directories

1. Make sure you're working in a **screen**

2. Load the pggb conda environment:
```
conda activate pggb
```

3. Make Directory

```
mkdir ~/pggb
```
4. Navigate to the Directory

```
cd ~/pggb
```
5. Link to data

```
ln -s /home/data/pangenomics-2402/yprp/ .
```

6. When you want to leave the conda environment, use:
```
conda deactivate
```

## Yeast Data

Reference:

+ S288C

Using all 12 YPRP assemblies

## Preparing the Input (exercise)

1. Create a FASTA file containing all the yprp assemblies. Call it `yprp.all.fa`.
2. Create a FASTA file containing chromosome VIII from every assembly. Call it `yprp.chrVIII.fa`.
3. How can you confirm that the contents of these files is correct?

### Preparing the Input (solution)

1. `cat yprp/assemblies/*.fa > yprp.all.fa`
2. `awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fa"; print ">" $0 > fnme; close(fnme);}' yprp/assemblies/*.genome.fa && cat *.chrVIII.fa > yprp.chrVIII.fa` && rm `ls | grep -v yprp`
3. `grep -c ‘>’ yprp.all.fa`

## Preparing the Input

[bgzip](https://www.htslib.org/doc/bgzip.html) the FASTA files (~1min):
```
bgzip -c yprp.all.fa > yprp.all.fa.gz
bgzip -c yprp.chrVIII.fa > yprp.chrVIII.fa.gz
```

+ **-c**
    + output the bgzipped file to standard output
+ **>**
    + redirect the standard output into a file
    
Index the bgzip files with [samtools](http://www.htslib.org/doc/samtools.html) [faidx](http://www.htslib.org/doc/samtools-faidx.html):
```
samtools faidx yprp.all.fa.gz
samtools faidx yprp.chrVIII.fa.gz
```


## Running pggb on Chromosome VIII

Build a graph containing all the yprp assemblies (2min):
```
pggb build -i yprp.chrVIII.fa.gz -o output_chrVIII -n 12 -t 20 -p 95
```

+ **-i yprp.chrVIII.fa**
    + an input FASTA containing all sequences
+ **-o output_chrVIIII**
    + the directory where all output files should be placed
+ **-n 12>**
    + the number of haplotypes (assemblies) in the input file
+ **-t 20**
    + the number of threads to use
+ **-p 95**
    + minimum sequence identity of alignment segments
+ **-s 5000**
    + nucleotide segment length when scaffolding the graph
    
NOTE: These arguments were taken from the [pggb paper](https://github.com/pangenome/pggb-paper/blob/main/workflows/AllSpecies.md). Refer to the paper for other species.

Create a copy of the output graph with a simpler name
```
cp output_chrVIII/yprp.chrVIII.fa.gz.<gibberish>.smooth.final.gfa yprp.chrVIII.pggb.gfa
```

## Viewing with Bandage

~~View your Chromsome VIII graph with Bandage (exercise):~~

1. ~~Find CUP1 and YHR054C~~
2. ~~Take a screenshot~~

**pggb graphs are too detailed to view all of Chromosome VIII in Bandage.**


Convert the GFA to VG (<1min):
```
vg convert yprp.chrVIII.pggb.gfa -v > yprp.chrVIII.pggb.vg
```
+ **-v**
    + convert input to vg format

Create a subgraph only containing the CUP1 region (<1min):
```
vg chunk -p S288C.chrVIII:212000-230000 -x yprp.chrVIII.pggb.vg -c 10 > yprp.chrVIII.pggb.cup1_chunk.vg
```
+ **-p S288C.chrVIII:212000-230000**
    + the path (region) to build the chunk from
+ **-x yprp.chrVIII.pggb.vg**
    + the graph to chunk
+ **-c 10**
    + expand the context of the chunk by 10 node steps
    
Convert the subgraph into a GFA for viewing (<1min):
```
vg view yprp.chrVIII.pggb.cup1_chunk.vg -g > yprp.chrVIII.pggb.cup1_chunk.gfa
```

View your Chromsome VIII chunk graph with Bandage (exercise):

1. ~~Find CUP1 and YHR054C~~
2. ~~Take a screenshot~~

## BLAST the graph manually

Create a FASTA file containing the graph sequence
```
gfatools gfa2fa yprp.chrVIII.pggb.gfa > yprp.chrVIII.pggb.fa
```

Copy the FASTA to you computer
```
scp pangenomics-2402:~/pggb/yprp.chrVIII.pggb.fa .
```

Build a BLAST database for the FASTA
```
makeblastdb -in yprp.chrVIII.pggb.fa -input_type fasta -dbtype nucl
```
+ **-in yprp.chrVIII.pggb.fa**
    + the file to build a database for
+ **-input_type fasta**
    + the input file is a FASTA
+ **-dbtype nucl**
    + the type of sequence in the input file is DNA
    
Query the database for [CUP1](https://www.yeastgenome.org/locus/S000001095) and [YHR054C](https://www.yeastgenome.org/locus/S000001096)
```
blastn -db yprp.chrVIII.pggb.fa -query S288C_YHR053C_CUP1-1_genomic.fsa
```

View your Chromsome VIII chunk graph with Bandage (exercise):

1. Find the CUP1 and YHR054C BLAST hits by node ID
2. Take a screenshot

## Exercises

1. Use vg to index the chromosome VIII graph
2. Use vg to map SK1 reads to the chromosome VIII graph
3. Use vg to call variants on chromosome VIII read mapping GAMS

## Running pggb on all Chromosomes

Partition the sequences before building the graph (<1min):
```
partition-before-pggb -i yprp.all.fa.gz -o output_all -n 12 -t 20 -p 95 -s 5000
 ```
This partitions the input FASTA into smaller FASTAs containing sequences that should be in the same subgraph:

+ Will likely correspond to chromosomes if you have complete assemblies
+ May improve run-time of normalization step and make downstream analysis easier
+ Consider skipping if your assemblies/organism has complex structure you want represented in the graph, e.g. polyploidy, translocations, etc.

This will print a `pggb` command for every partition to the command line.

+ Commands are also written to a log file: `output_all/yprp.all.fa.gz.<gibberish>.log`

### Exercise

1. Make a bash script to run all of the generated commands
  + Run the command **NOT DURING THE WORKSHOP** to build the graphs (~30min)

## How do you handle larger graphs? Nextflow

[Nextflow](https://www.nextflow.io/docs/latest/index.html) - Software for running workflows in parallel / distributed computing environments

+ Supports Singularity and Docker containers
+ Nextflow pipelines are open-source
  + You can use other people’s pipelines
  + You can build and share your own pipelines
+ [nf-core/pangenome](https://nf-co.re/pangenome) contains pggb along with other pangenome software

## Pros and Cons

Pros:

+ Graphs are truly reference-free
+ Includes normalization step
  + Especially good for species with complex structural variation, such as plants
  
Cons:

+ Normalization step is currently a bottleneck
+ Partitioning removes inter-chromosome structural variation
+ Graphs contain lots of variation
  + Makes the files large
  + Makes them hard to view with Bandage
+ Can't BLAST GFA within Bandage