Extending GAMs with interactions

# Extending GAMs with interactions
### Eric Pedersen
### November 4, 2021

---

# Reminder from Tuesday:

A GLM is: 
  - a distribution for our data + 
  - a link function + 
  - a linear model of the transformed mean as a function of covariates

A GAM is a GLM except:
  - covariate functions can be nonlinear, built from basis functions of covariates
  - Estimated with penalized likelihood, penalizing overly wiggly functions
  
  
---

# So far:

- One-dimensional smooth models
- Models with multiple separate predictors
- Normally distributed response

These are good models, but there's lots more to see!

---

# New things

- distribution of the data
  - `family=` argument
- visually checking model assumptions when building models
- adding dimensions to your smooths
  - space and time

---

# Distributions

- `family=` argument in `gam()`
- see `?family.mgcv` for a list
- what you'd expect, as for `lm()` and `glm()`
- most useful cases:
  - `binomial` ( yes/no, `$y \in \{0, 1\}$` )
  - `poisson` ( counts, `$y \in \{0, 1, 2, \ldots\}$` )
  - `Gamma` ( positive, `$y>0$` )

---

# Special count distributions

- Poisson is often not adequate for "real" counts
- Assuming `$\text{Var}(y) = \mathbb{E}(y)$` is usually incorrect

![](02-extending-gams_files/figure-html/species-gala-plots-1.svg)

---

# Special count distributions

- Poisson is often not adequate for "real" counts
- Assuming `$\text{Var}(y) = \mathbb{E}(y)$` is usually incorrect
 
![](02-extending-gams_files/figure-html/species-gala-plots2-1.svg)
---

# Special count distributions

Other options?
 
- `quasipoisson` (count-ish, `$y \geq 0$`)
  - `$\text{Var}(y) = \psi\times\mathbb{E}(y)$` 
  - awkward to check, no likelihood
  
--

- `nb`/`negbin` ( `$y \geq 0$`)
  - `$\text{Var}(y) = \mathbb{E}(y) + \kappa\times\mathbb{E}(y)^2$` 
  - models overdispersed counts
  
--

- `tw`/`Tweedie` ( `$y \geq 0$`)
  - great for CPUE-type data
  - works on non-count positive data as well
  - `$\text{Var}\left(\text{y}\right) = \phi\mathbb{E}(\text{y})^q$`

---

# Tweedie distribution

.pull-left[
![](02-extending-gams_files/figure-html/tweedie-1.svg)
(NB there is a point mass at zero not plotted)
]
.pull-right[
-  `$\text{Var}\left(\text{y}\right) = \phi\mathbb{E}(\text{y})^q$`
- Poisson is `$q=1$` 
- We estimate `$q$` and `$\phi$` for `tw`
- We set `$q$` and estimate `$\phi$` for `Tweedie`
]

---

# Negative binomial distribution

.pull-left[
![](02-extending-gams_files/figure-html/negbin-1.svg)
]
.pull-right[
- `$\text{Var}\left(\text{count}\right) =$` `$\mathbb{E}(\text{count}) + \kappa \mathbb{E}(\text{count})^2$`
- Poisson is `$\kappa=0$`
- Estimate `$\kappa$` for `nb`
- Set `$\kappa$` for `negbin` 
]

---

# 🐟🐠🐡 Example 🐡🐠🐟

.pull-left[
- Let's look at species richness as a function of depth
- Number of species in each trawl (counts!)

```r
trawls <- read.csv(here("data/trawl_nl.csv"))
```
]

]

---

# 🐟🐠🐡 Example 🐡🐠🐟

```r
rich_depthl10 <- gam(richness~s(log10(depth),k = 30), 
                     family=poisson,
              method="REML", 
              data=trawls)
```
]

```r
plot(rich_depthl10, shade=TRUE)
```

![](02-extending-gams_files/figure-html/plot-year-rich-1.svg)
]

---

# `summary` output

```
## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## richness ~ s(log10(depth), k = 30)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.155047   0.003212   982.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df Chi.sq p-value    
## s(log10(depth)) 12.65   15.6  972.3  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.157   Deviance explained = 15.7%
## -REML =  12983  Scale est. = 1         n = 4152
```

---

# Other `family` stuff

- can specify a `link=` argument
  - (usually don't have to)
  - `?family` has options for each distribution
- for a fitted model `model$family`
  - details of what was used
  - `model$family$linkfun()` gives link function
  - `model$family$linkinv()` gives inverse link

---

# How can we check whether our assumed distribution is adequate?

```r
par(mfrow=c(2,2))
gam.check(rich_depthl10)
```

![](02-extending-gams_files/figure-html/pois-rich-check1-1.svg)

```
## 
## Method: REML   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-1.49377e-07,-1.49377e-07]
## (score 12983.34 & scale 1).
## Hessian positive definite, eigenvalue range [2.515006,2.515006].
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                   k'  edf k-index p-value   
## s(log10(depth)) 29.0 12.7    0.95   0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# How can we check whether our assumed distribution is adequate?

```r
par(mfrow=c(2,2))
gam.check(rich_depthl10)
```

```
## 
## Method: REML   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-1.49377e-07,-1.49377e-07]
## (score 12983.34 & scale 1).
## Hessian positive definite, eigenvalue range [2.515006,2.515006].
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                   k'  edf k-index p-value   
## s(log10(depth)) 29.0 12.7    0.95    0.01 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# How can we check whether our assumed distribution is adequate?

```r
appraise(rich_depthl10)
```

![](02-extending-gams_files/figure-html/pois-rich-check3-1.svg)

---

# Let's pause the talk here to try this ourselves

---

## Beyond one dimensional smooths:<br/> space, time and more

---

# 🦐🦐 Example 🦐🦐

```r
trawls <- read.csv(here("data/trawl_nl.csv"))
trawls_2010 <- filter(trawls, year==2010)
range(trawls_2010$shrimp)
```

```
## [1]    0.000 2293.699
```

```r
head(trawls_2010$shrimp, 3)
```

```
## [1]  0.1202384  0.9619069 13.6270140
```

Not normal, not integer!

]

![](02-extending-gams_files/figure-html/unnamed-chunk-1-1.svg)

]

---

# 🦐🗺

.pull-left[
- spatial variation!
- how do we model this??
- `s(long) + s(lat)` misses the interaction
]

---

# Adding dimensions

- Thin plate regression splines (default basis)
- `s(x, y)`
- Assumes `x` and `y` are measured in the same units
  - `x`, `y` projected coordinates 👍
  - `x` temperature, `y` depth 👎

---

# 🦐🗺

Starting with a 2D smooth, assuming data is Gaussian

```r
spatial_shrimp <- gam(shrimp ~ s(x, y, k =100),
                data=trawls_2010,
                family = gaussian,
                method="REML")
```
]

```r
plot(spatial_shrimp, asp=1)
```

![](02-extending-gams_files/figure-html/plot-effect-1.svg)
]

---

# That plot is hard to understand!

```r
plot(spatial_shrimp, asp=1)
```

![](02-extending-gams_files/figure-html/plot-effect-bad-1.svg)
]

```r
plot(spatial_shrimp, scheme=2, asp=1)
```

![](02-extending-gams_files/figure-html/plot-effect-good-1.svg)
]

---

# Using gratia:
.left-column[

```r
draw(spatial_shrimp)
```

]

---

# `summary` output

```
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## shrimp ~ s(x, y, k = 100)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  125.390      8.658   14.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(x,y) 39.32  53.66 3.759  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.309   Deviance explained = 36.6%
## -REML = 3241.7  Scale est. = 36132     n = 482
```

---

## Now give it a try!

---

## Other multi-dimensional smooths

---

# `s(x,y)` doesn't always work

- Only works for `bs="tp"` or `bs="ts"`
- Covariates are isotropic
- What if we wanted to use lat/long?
- Or, more generally: interactions between covariates?

---

# Enter `te()`

.pull-left[
- We can built interactions using `te()`
- Construct 2D basis from 2 1D bases
- Biomass as a function of temperature and depth?
  - `te(temp_bottom, log10(depth))`
- 💭 "marginal 1Ds, join them up"
]

---

# Using `te()`

Just like `s()`:

```r
shrimp_te <- gam(shrimp ~ te(log10(depth), temp_bottom),
                data=trawls_2010,
                family=tw,
                method="REML")
```

---

# `summary`

```
## 
## Family: Tweedie(p=1.595) 
## Link function: log 
## 
## Formula:
## shrimp ~ te(log10(depth), temp_bottom)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.8484     0.2531   7.304 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                edf Ref.df     F p-value    
## te(log10(depth),temp_bottom) 15.19  16.99 33.09  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.337   Deviance explained = 66.1%
## -REML = 1946.5  Scale est. = 8.1292    n = 482
```

---

# Things to fiddle with

- Setting `k` in two ways:
  - `k=5`: 5 for all covariates (total `$5*5=25$`)
  - `k=c(3,5)`: per basis, in order (total `$3*5=15$`)
- Setting `bs` in two  ways:
  - `bs="tp"`: tprs for all bases
  - `bs=c("tp", "tp")`: tprs per basis

---

# Pulling `te()` apart: `ti()`

- Can we look at the components of the `te()`
- `te(x, y) = ti(x, y) + ti(x) + ti(y)`

```r
shrimp_ti <- gam(shrimp ~ ti(temp_bottom, depth) +
                          s(temp_bottom) + s(depth),
                data=trawls_2010,
                family=tw,
                method="REML")
```

---

# `summary`

```
## 
## Family: Tweedie(p=1.591) 
## Link function: log 
## 
## Formula:
## shrimp ~ ti(temp_bottom, depth) + s(temp_bottom) + s(depth)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    1.672      1.067   1.568    0.118
## 
## Approximate significance of smooth terms:
##                         edf Ref.df      F  p-value    
## ti(temp_bottom,depth) 7.417  8.860  4.177 0.000459 ***
## s(temp_bottom)        5.661  6.850  1.703 0.081067 .  
## s(depth)              6.592  7.238 10.177  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.349   Deviance explained = 66.7%
## -REML = 1951.7  Scale est. = 8.1639    n = 482
```

---

## Now give it a try!

---
class: inverse middle center big-subsection

## Building spatio-temporal models

---

# 🦐🗺🗓

![](02-extending-gams_files/figure-html/biom-space-time-plot-1.svg)

---

# Space x time

- We had a 2d spatial model, add time?
  - `te(x,y,year)` ?
- Want that 2d smooth rather than `te(lat, long)`?
- `d=` groups covariates
  - `te(x, y, year, d=c(2,1))` gives `x,y` smooth and `year` smooth tensor
- Assuming default `k=` and `bs=` for bases

---

# Fiddling

- Often fewer temporal replicates
  - Fewer years than unique locations
  - `k=` smaller for temporal covariate?
- Use cubic spline basis for time?
  - simpler basis, even knot placement
- When using `ti()` everything needs to match up!

---

# 🦐🗺🗓

Putting that together:

```r
shrimp_xyt <- gam(shrimp ~ ti(y,x, year, d=c(2,1), 
                              bs=c("tp", "cr"), k=c(20, 5)) +
                           s(x, y, bs="tp", k=20) +
                           s(year,  bs="cr", k=5),
                data=trawls,
                family=tw,
                method="REML")
```

---

# `summary`

```
## 
## Family: Tweedie(p=1.686) 
## Link function: log 
## 
## Formula:
## shrimp ~ ti(y, x, year, d = c(2, 1), bs = c("tp", "cr"), k = c(20, 
##     5)) + s(x, y, bs = "tp", k = 20) + s(year, bs = "cr", k = 5)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.89917    0.03127   124.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df       F p-value    
## ti(y,x,year) 47.65  58.59   8.144  <2e-16 ***
## s(x,y)       18.88  19.00 180.295  <2e-16 ***
## s(year)       3.87   3.98 108.490  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.136   Deviance explained = 45.2%
## -REML =  19084  Scale est. = 9.3801    n = 4152
```

---

# `ti(x, y)` and `ti(year)`

![](02-extending-gams_files/figure-html/ti-xyt-plot-xy-t-1.svg)
---

# `ti(x, y, year)`

![](02-extending-gams_files/figure-html/ti-xyt-plot-xyt-1.svg)

---
class: inverse middle center big-subsection

## Recap

---

# What did we learn?

- set response distribution
  - `family=` argument
  - see `?family`
- spatial smoothing
  - `s(x,y)` for projected coordinates
  - `te(lat, long)` for latitude and longitude (better to project?)
- interactions
  - `te(covar1, covar2, ...)`
  - `ti()` to decompose the effects
- space and time
  - `te(x, y, time, d=c(2,1))`
  - again `ti()` to decompose