---
title: "Tidy Genomics"
author: "Constantin Ahlmann-Eltze"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Tidy Genomics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The most dramatic impact on programming in R the last years was the development of the [tidyverse](http://tidyverse.org/) by Hadley Wickham et al.
which, combined with the ingenious `%>%` from magrittr, provides a uniform philosophy for handling data.
The genomics community has an alternative set of approaches, for which [bioconductor](http://bioconductor.org/) and the
[GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) package provide the basis. The `GenomicRanges` and
the underlying `IRanges` package provide a great set of methods for dealing with intervals as they typically encountered in genomics.
Unfortunately it is not always easy to combine those two worlds, many common operations in `GenomicRanges` focus solely on the
ranges and loose the additional metadata columns. On the other hand the `tidyverse` does not provide a unified set of methods
to do common set operations with intervals.
At least until recently, when the [fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) package was extended with the `genome_join`
method for combining genomic data stored in a `data.frame`. It demonstrated that genomic data could appropriately be handled
with the _tidy_-philosophy.
The `tidygenomics` package extends the limited set of methods provided by the `fuzzyjoin` package for dealing with genomic
data. Its API is inspired by the very popular [bedtools](http://bedtools.readthedocs.io/en/latest/index.html):
- `genome_intersect`
- `genome_subtract`
- `genome_join_closest`
- `genome_cluster`
- `genome_complement`
- `genome_join` _Provided by the fuzzyjoin package_
```{r, message=FALSE, warning=FALSE, echo=FALSE}
library(dplyr)
library(tidygenomics)
```
## genome_intersect
Joins 2 data frames based on their genomic overlap. Unlike the `genome_join` function it updates the boundaries to reflect
the overlap of the regions.
```{r}
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr2"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr2", "chr1"),
start = c(140, 210, 400, 300),
end = c(160, 240, 415, 320))
genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
```
## genome_subtract
Subtracts one data frame from the other. This can be used to split the x data frame into smaller areas.
```{r}
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr1", "chr1"),
start = c(120, 210, 300, 400),
end = c(125, 240, 320, 415))
genome_subtract(x1, x2, by=c("chromosome", "start", "end"))
```
## genome_join_closest
Joins 2 data frames based on their genomic location. If no exact overlap is found the next closest interval is used.
```{r}
x1 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr2", "chr3"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
genome_join_closest(x1, x2, by=c("chr", "start", "end"), distance_column_name="distance", mode="left")
```
## genome_cluster
Add a new column with the cluster if 2 intervals are overlapping or are within the `max_distance`.
```{r}
x1 <- data.frame(id = 1:4, bla=letters[1:4],
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 120, 300, 260),
end = c(150, 250, 350, 450))
genome_cluster(x1, by=c("chromosome", "start", "end"))
genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
```
## genome_complement
Calculates the complement of a genomic region.
```{r}
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
genome_complement(x1, by=c("chromosome", "start", "end"))
```
## genome_join
Classical join function based on the overlap of the interval. Implemented and mainted in the
[fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) package and documented here only for completeness.
```{r}
x1 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr2", "chr3"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="inner")
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="left")
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="anti")
```