tidy data and tidy tools

class: center, middle, inverse, title-slide

# tidy data and tidy tools
## Overview of tools of effective data analysis
### Eric Leung
### 2020-05-17

---

# Overview

- Tidy data

- Core and useful tidyverse functions

- Useful packages

---

# Tidy data overview

![Tidy data](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png)

*Source*: https://r4ds.had.co.nz/tidy-data.html

---

# Tidy data are easier to work with inside the tidyverse

> "Tidy datasets are all alike, but every messy dataset is messy in its own
way."
> 
> –– Hadley Wickham

---

# Objective

Take built-in phyloseq object and convert it to a tidy data set. Then wrangle
the data using tidy principles and tidyverse packages<sup>*</sup>.

.footnote[[*] The "tidyverse" packages are a suite of packages that adhere to
tidy data principles]

---

# Load packages

```r
library(phyloseq)

# install.packages("remotes")
# remotes::install_github("mikemc/speedyseq")
library(speedyseq)  # Optional
```

---

# phyloseq data example

```r
# Load data example
data("GlobalPatterns")
GlobalPatterns
```

```
## phyloseq-class experiment-level object
## otu_table()   OTU Table:          [ 19216 taxa and 26 samples ]:
## sample_data() Sample Data:        [ 26 samples by 7 sample variables ]:
## tax_table()   Taxonomy Table:     [ 19216 taxa by 7 taxonomic ranks ]:
## phy_tree()    Phylogenetic Tree:  [ 19216 tips and 19215 internal nodes ]:
## taxa are rows
```

Let's convert this into a tidy data set.

---

# `psmelt()` converts to tidy data

```r
psm_gp <- psmelt(GlobalPatterns)  # Create tidy data
```

```r
library(tibble)
psm_gp <- tibble(psm_gp)  # Make nicer data frame
```

```r
head(psm_gp, 5)  # View first five rows
```

```
## # A tibble: 5 x 17
##   OTU    Sample   Abundance X.SampleID Primer Final_Barcode Barcode_truncated_p…
##   <chr>  <chr>        <dbl> <fct>      <fct>  <fct>         <fct>               
## 1 549656 AQC4cm     1177685 AQC4cm     ILBC_… ACAGCT        AGCTGT              
## 2 279599 LMEpi24M    914209 LMEpi24M   ILBC_… ACACTG        CAGTGT              
## 3 549656 AQC7cm      711043 AQC7cm     ILBC_… ACAGTG        CACTGT              
## 4 549656 AQC1cm      554198 AQC1cm     ILBC_… ACAGCA        TGCTGT              
## 5 360229 M31Tong     540850 M31Tong    ILBC_… ACACGA        TCGTGT              
## # … with 10 more variables: Barcode_full_length <fct>, SampleType <fct>,
## #   Description <fct>, Kingdom <chr>, Phylum <chr>, Class <chr>, Order <chr>,
## #   Family <chr>, Genus <chr>, Species <chr>
```

---

# Quick note on the tidyverse

- The {tidyverse} package is a suite of packages

- Home website https://www.tidyverse.org/

- Running `library(tidyverse)` will load the main tidyverse packages

- The following code loads the appropriate packages within the tidyverse when
  needed

---

# Let's clean the `Descriptions` column

```r
library(dplyr)

psm_gp %>%
  select(OTU, Sample, SampleType, Description) %>%
  head(5)
```

```
## # A tibble: 5 x 4
##   OTU    Sample   SampleType         Description                                
##   <chr>  <chr>    <fct>              <fct>                                      
## 1 549656 AQC4cm   Freshwater (creek) "Allequash Creek, 3-4 cm depth"            
## 2 279599 LMEpi24M Freshwater         "Lake Mendota Minnesota, 24 meter epilimni…
## 3 549656 AQC7cm   Freshwater (creek) "Allequash Creek, 6-7 cm depth"            
## 4 549656 AQC1cm   Freshwater (creek) "Allequash Creek, 0-1cm depth"             
## 5 360229 M31Tong  Tongue             "M3, Day 1, tongue, whole body study "
```

---

# Just keep stool samples

```r
psm_gp %>%
  select(OTU, Sample, SampleType, Description) %>%
  filter(SampleType == "Feces") %>%
  head(5)
```

```
## # A tibble: 5 x 4
##   OTU    Sample  SampleType Description                               
##   <chr>  <chr>   <fct>      <fct>                                     
## 1 331820 M11Fcsw Feces      "M1, Day 1, fecal swab, whole body study "
## 2 331820 M31Fcsw Feces      "M3, Day 1, fecal swab, whole body study" 
## 3 189047 TS29    Feces      "Twin #2"                                 
## 4 158660 M11Fcsw Feces      "M1, Day 1, fecal swab, whole body study "
## 5 244304 M11Fcsw Feces      "M1, Day 1, fecal swab, whole body study "
```

The descriptions have lots of information in them that we might want to
visualize later on with ggplot2.

Difficult to plot when the data is locked in text form.

**Note**: some of the following code may not be "best" design decision

---

# Load core data wrangling libraries

```r
library(tidyr)
```

```r
# Column must be character, `Description` is currently a factor
psm_gp %>%
  select(OTU, Sample, SampleType, Description) %>%
  filter(SampleType == "Feces") %>%
* mutate(Description = as.character(Description)) %>%
* separate_rows(Description, sep = ", ") %>%
  head(5)
```

```
## # A tibble: 5 x 4
##   OTU    Sample  SampleType Description        
##   <chr>  <chr>   <fct>      <chr>              
## 1 331820 M11Fcsw Feces      "M1"               
## 2 331820 M11Fcsw Feces      "Day 1"            
## 3 331820 M11Fcsw Feces      "fecal swab"       
## 4 331820 M11Fcsw Feces      "whole body study "
## 5 331820 M31Fcsw Feces      "M3"
```

---

# Let's move values into columns

```r
# Create `value` column to keep value
psm_gp %>%
  select(OTU, Sample, SampleType, Description) %>%
  filter(SampleType == "Feces") %>%
  mutate(Description = as.character(Description)) %>%
  separate_rows(Description, sep = ", ") %>%
* mutate(value = 1) %>%
* pivot_wider(names_from = Description,
*             values_from = value) %>%
  head(5)
```

```
## # A tibble: 5 x 11
##   OTU    Sample  SampleType    M1 `Day 1` `fecal swab` `whole body study `    M3
##   <chr>  <chr>   <fct>      <dbl>   <dbl>        <dbl>               <dbl> <dbl>
## 1 331820 M11Fcsw Feces          1       1            1                   1    NA
## 2 331820 M31Fcsw Feces         NA       1            1                  NA     1
## 3 189047 TS29    Feces         NA      NA           NA                  NA    NA
## 4 158660 M11Fcsw Feces          1       1            1                   1    NA
## 5 244304 M11Fcsw Feces          1       1            1                   1    NA
## # … with 3 more variables: whole body study <dbl>, Twin #2 <dbl>, Twin #1 <dbl>
```

---

# We can reverse our pivot

```r
psm_gp %>%
  select(OTU, Sample, SampleType, Description) %>%
  filter(SampleType == "Feces") %>%
  mutate(Description = as.character(Description)) %>%
  separate_rows(Description, sep = ", ") %>%
  mutate(value = 1) %>%
  pivot_wider(names_from = Description,
              values_from = value) %>%
* pivot_longer(cols = !c(OTU, Sample, SampleType),
*              names_to = "Description",
*              values_to = "Value") %>%
  head(5)
```

```
## # A tibble: 5 x 5
##   OTU    Sample  SampleType Description         Value
##   <chr>  <chr>   <fct>      <chr>               <dbl>
## 1 331820 M11Fcsw Feces      "M1"                    1
## 2 331820 M11Fcsw Feces      "Day 1"                 1
## 3 331820 M11Fcsw Feces      "fecal swab"            1
## 4 331820 M11Fcsw Feces      "whole body study "     1
## 5 331820 M11Fcsw Feces      "M3"                   NA
```

---

# Future steps

- With the code here, you could further process the data to filter on specific
  features.

- For example, you could focus on stool samples from a specific twins or a whole
  body study.

- The {stringr} package within the tidyverse will prove useful for checking the
  contents of strings

---

# Functions reviewed

- **Note**: the following functions are in the form `<package>::<function>`

- **`dplyr::select()`** = select, subset, and remove columns from data frame

- **`dplyr::filter()`** = remove or keep specific rows based on information in
  data frame

- **`dplyr::mutate()`** = convert or transform column values

- **`tidyr::separate_rows()`** = take column values and separate them into
  multiple rows

- **`tidyr:pivot_wider()`** and **`tidyr::pivot_longer()`** = transform data
  frame from long to wide data

---

# Useful packages

- janitor = miscellaneous cleaning

- unheadr = weird nested headers

- tidycells = multiple tables spaced out in Excel

---

# janitor for cleaning columns and creating frequency tables

```r
library(janitor)

psm_gp %>%
  select(OTU, Sample, SampleType, Description) %>%
  filter(SampleType == "Feces") %>%
  mutate(Description = as.character(Description)) %>%
  separate_rows(Description, sep = ", ") %>%
* clean_names() %>%
  head(5)
```

```
## # A tibble: 5 x 4
##   otu    sample  sample_type description        
##   <chr>  <chr>   <fct>       <chr>              
## 1 331820 M11Fcsw Feces       "M1"               
## 2 331820 M11Fcsw Feces       "Day 1"            
## 3 331820 M11Fcsw Feces       "fecal swab"       
## 4 331820 M11Fcsw Feces       "whole body study "
## 5 331820 M31Fcsw Feces       "M3"
```

---

# janitor for creating frequency tables

```r
psm_gp %>%
  select(OTU, Sample, SampleType, Description) %>%
* tabyl(SampleType) %>%
* adorn_totals()
```

```
##          SampleType      n    percent
##               Feces  76864 0.15384615
##          Freshwater  38432 0.07692308
##  Freshwater (creek)  57648 0.11538462
##                Mock  57648 0.11538462
##               Ocean  57648 0.11538462
##  Sediment (estuary)  57648 0.11538462
##                Skin  57648 0.11538462
##                Soil  57648 0.11538462
##              Tongue  38432 0.07692308
##               Total 499616 1.00000000
```

---

# unheadr for nested headers

**Before**

| scientific\_name        | common\_name                 | red\_list\_status | mass\_kg |
| :---------------------- | :--------------------------- | :---------------- | -------: |
| Asia                    | NA                           | NA                |       NA |
| CERCOPITHECIDAE         | NA                           | NA                |       NA |
| Trachypithecus obscurus | Dusky Langur                 | NT                |     7.13 |
| Presbytis sumatra       | Black Sumatran Langur        | EN                |     6.00 |

---

# unheadr for nested headers

**After**

| scientific\_name        | common\_name                 | red\_list\_status | mass\_kg | family          |
| :---------------------- | :--------------------------- | :---------------- | -------: | :-------------- |
| Trachypithecus obscurus | Dusky Langur                 | NT                |     7.13 | CERCOPITHECIDAE |
| Presbytis sumatra       | Black Sumatran Langur        | EN                |     6.00 | CERCOPITHECIDAE |

---

# tidycells for odd cell arrangements

![](https://r-rudra.github.io/tidycells/articles/ext/marks.png)

---

# Resources

- R for Data Science https://r4ds.had.co.nz
- Ted's list of underrated tidyverse functions 
  https://hugo-portfolio-example.netlify.app/projects/tidyverse_functions/
- {janitor} R package https://github.com/sfirke/janitor
- {unheadr} R package https://github.com/luisDVA/unheadr
- {tidycells} https://r-rudra.github.io/tidycells/
- Blog post on cleaning messy data 
  https://rfortherestofus.com/2019/12/how-to-clean-messy-data-in-r/