State-Space Methods for Testing Hypotheses About Environmental Drivers and Biotic Interactions in Community and Paleoecological Time Series

Author

Quinn Asena, Jack Williams, and Tony Ives

Published

August 5, 2024

Some prep

Welcome to our workshop on how to use the multinomialTS package. First things first, let’s make sure you have all the necessary packages installed. Follow the instructions on the README and then try rendering this document. This may take a few minutes, so while that is running, I’m going to go through a few slides. If this document fails to render, something has gone wrong and one of our helpers/minions will help you sort it out.

Introduction

This workshop will take you through how to fit the multinomialTS model and interpret the results. One of the primary goal of this model is to be able to test multiple hypotheses about the data and lend statistical support to the different hypotheses. For example which environmental drivers are having the strongst effects on which taxa? Or, are interactions among taxa or abiotic variables driving change in the system? This state-space approach allows us to estimate coefficients for taxa interactions and driver-taxa relationships, that we don’t get from methods such as ordination or cluster analysis. We recommend this method as complimentary to other methods, as both have their advantages.

Multinomial state variables (\(Y\))

This model has been designed specifically to work with a multinomially distributed response variable, that is, data from a finite number individuals (e.g., 300 counts) from a sample, from multiple possible classes (e.g., different taxon). We will be using a palaeoecological dataset as an example, where a number of fossil pollen grains are counted (usually 300-400) and identified per slice of sediment. From these data, we know the relative abundance of taxa in the sample, but not the absolute abundance of each species on the landscape. Knowing only the relative abundances creates interdependencies among the taxa as we can only generate estimates \(n-1\) taxa. Put simply, if we have 3 species and 60% of the counts belong to species 1, and 30% of the counts belong to species 2, then the remaining 10% must therefore belong to species 3.

Data handling

The multinomialTS model has been designed to work with multinomially distributed ecological state variables (\(Y\)), but the covariate predictor data (\(X\)) can be of mixed types, such as binary events or charcoal accumulation rates. The data are divided into two, the state variables (\(Y\), e.g., pollen counts), and the environmental covariates (\(X\)). Analysing these data with multinomialTS can be split into two decision streams (Figure 1):

  1. Handling the state variables \(Y\)
  2. Handling the covariates \(X\)
Figure 1: The process of handling the state variables, the covariates, and fitting the model

We will go through an example of how to do this in detail now.

Data from Story Lake

Let’s get some data. We are going to use data that are the pollen counts from Story Lake, IN (Schlenker et al. 2024), these data will be our state variables \(Y\). The data also contain charcoal counts, which will be our environmental covariate \(X\). The data are stored in the Neotoma database.

The following (folded) code block will download the data directly from the Neotoma API, do basic harmonisation of taxa, and some data wrangling to give us a wide dataset of state variables (\(Y\)) and environmental covariates (\(X\)). For this session, the data are already wangled and saved to the data directory, so the code block is set not to run (#| eval: false). We are not going to delve into the details of taxonomic resolution in this workshop, for details on using the neotoma2 package check out the neotoma GitHub.

Code
# Find site in neotoma
story_site <- get_sites(sitename = "Story%")

# Download the data from the site
# Downloading the data only needs to be done once
story_samples <- story_site %>%  
    get_downloads() %>% 
    samples()

# Save the raw data. Always keep an untouched copy of the raw data.
saveRDS(story_samples, "./data/story_samples.rds")

# Read-in the raw samples, this avoids repeatedly downloading from the API
story_samples <- readRDS("./data/story_samples.rds")

# Some harmonisation of species, see Schlenker et al., 2024 in refs for details
# First filter the data for pollen sampels, then harmonise a few species to the genus-level
story_pollen_long <- story_samples %>%
  filter(elementtype == "pollen",
         units == "NISP") %>% 
  mutate(variablename = replace(variablename, stringr::str_detect(variablename, "Pinus*"), "Pinus"),
         variablename = replace(variablename, stringr::str_detect(variablename, "Picea*"), "Picea"),
         variablename = replace(variablename, stringr::str_detect(variablename, "Acer*"), "Acer"),
         variablename = replace(variablename, stringr::str_detect(variablename, "Juglans*"), "Juglans"),
         variablename = replace(variablename, stringr::str_detect(variablename, "Fraxinus*"), "Fraxinus")
  ) %>% 
  group_by(siteid, sitename,
           sampleid, variablename, units, age,
           agetype, depth, datasetid,
           long, lat) %>%
  summarise(value = sum(value), .groups='keep')

# Check names for any variables like spike, trace, Lycopodium, Microspheres...
unique(story_pollen_long$variablename)

saveRDS(story_pollen_long, "./data/story_pollen_long.rds")


# Now let's turn to the charcoal
# first filter data for charcoal counts
story_char <- story_samples %>%
  filter(elementtype == ">125 µm")

# The following code pivots the data to a wide dataframe
# then does a very crude calculation to convert charcoal counts to charcoal accumulation rate
# for the purposes of this workshop the crude calculation will do
story_char_wide <- story_char %>% 
  pivot_wider(id_cols = c(age, depth), names_from = variablename, values_from = value) %>% 
  mutate(lag_depth = abs(lag(depth) - depth),
  lag_age = abs(lag(age) - age),
  acc_rate = lag_depth / lag_age,
  char_acc = acc_rate * Charcoal) %>% 
  mutate(across(everything(), ~ ifelse(row_number() == 1 & is.na(.), 0, .))) %>% 
  select(age, depth, char_acc)

# Let's save the wrangled data
saveRDS(story_char_wide, "./data/story_char_wide.rds")

Since the folded code block above has already been run, and the \(Y\) and \(X\) data saved to the data directory, we can read in the wrangled data directly.

# Read-in the wrangled data 
# Read in our X variables
story_char_wide <- readRDS("./data/story_char_wide.rds")
# Read in the long data for plotting
story_pollen_long <- readRDS("./data/story_pollen_long.rds")

Currently the both the \(Y\) and \(X\) data contain age and depth columns. These columns will be excluded later but I like to keep some form of common ID column at this stage.

Handling the state variables \(Y\)

Let’s start handling our state variables \(Y\). It is not recommended to attempt to estimate coefficients for every species in the data simultaneously, so we are going to look at key taxa and groups. Let’s do some quick and dirty visualisation of the data, by plotting the top five most abundant species by their counts. It can be useful to explore more than the top five species by their counts of course, for example, there may be key species of ecological significance that occur in low abundances. We recommend methods like cluster analyses, and topic modelling as preliminary, or complimentary, analyses but we will not cover those methods in this workshop.

Code
# There are multiple ways of coding the data-wrangling
# I'm sticking with tidyverse for consistency but
# I sometimes BASE is more efficient!
story_pollen_top5_count <- story_pollen_long %>%
  group_by(variablename) %>%
  mutate(total_count = sum(value)) %>%
  slice(1) %>% 
  ungroup() %>%
  select(variablename, total_count) %>% 
  arrange(desc(total_count)) %>% 
  slice(1:5)

story_top5 <- story_pollen_long %>% 
  filter(variablename %in% c(story_pollen_top5_count$variablename))

story_other <- story_pollen_long %>%
  filter(!variablename %in% c(story_pollen_top5_count$variablename)) %>%
  mutate(variablename = "other") %>%
  group_by(variablename, age) %>%
  summarise(value = sum(value))

story_top5_res <- bind_rows(story_top5, story_other) %>%
  mutate(variablename = factor(variablename, levels =  c(
    "other", rev(story_pollen_top5_count$variablename)
  )))

ggplot(story_top5_res, aes(x = age, y = value)) +
  geom_area(colour = "grey90") +
  geom_col() +
  scale_x_reverse(breaks = scales::breaks_pretty(n = 6)) +
  coord_flip() +
  # ylim(0, 0.5) +
  labs(y = "Pollen counts", x = "Time (ybp)") +
  facet_wrap(~variablename,
             nrow = 1) +
  theme_minimal() +
  theme(
    text = element_text(size = 10),
  )
Figure 2: Pollen counts of the five species with the highst counts in the record. All other species are grouped in the ‘other’ species group

From Figure 2, we can see that oak (Quercus) American Beech (Fagus grandifolia) and elm (Ulmus) occur in high counts. Ash (Fraxinus) and hickory (Carya), also occur throughout the record but at lower counts. The large peak at the most recent part of the record, in the grouped ‘other’ species, is from an increase in ragweed (Ambrosia) after Euro-American land clearance (Schlenker et al. 2024). This gives us a starting point for deciding which are our target taxa. Story lake data are dominated by three taxa (oak, beech, and elm), so we are going to use these as out target taxa. Addtionally, from Schlenker et al. (2024), there are notable hardwood taxa: hickory (Carya), hornbeam (Ostrya/Carpinus), ash (Fraxinus), walnut (Juglans), sycamore (Platanus) and maple (Acer). With the ecological knowledge of the hardwood taxa, I am also creating a ‘hardwood’ group.

Code
hardwood_names <- c("Carya", "Ostrya/Carpinus", "Acer", "Juglans", "Fraxinus")
target_taxa <- c("Ulmus", "Fagus grandifolia", "Quercus")

# Create a group of hardwood taxa by summing counts
story_pollen_long_hardwood <- story_pollen_long %>%
  filter(variablename %in% hardwood_names) %>% 
  mutate(variablename = "hardwood") %>% 
  group_by(siteid, sitename,
           sampleid, variablename, units, age,
           agetype, depth, datasetid,
           long, lat) %>%
  summarise(value = sum(value), .groups='keep')

# Filter for target taxa
story_pollen_long_targets <- story_pollen_long %>%
    filter(variablename %in% target_taxa)

# Sum everything else into our reference group
story_other_taxa <- story_pollen_long %>%
  filter(!variablename %in% c(hardwood_names, target_taxa)) %>%
  mutate(variablename = "other") %>% 
  group_by(siteid, sitename,
           sampleid, variablename, units, age,
           agetype, depth, datasetid,
           long, lat) %>%
  summarise(value = sum(value), .groups='keep')

state_variables <- bind_rows(story_other_taxa, story_pollen_long_hardwood, story_pollen_long_targets) %>% 
  mutate(variablename = factor(variablename, levels =  c("other", "hardwood", target_taxa)),
         value = replace_na(value, 0))

Let’s check-out our new grouping:

Code
ggplot(state_variables, aes(x = age, y = value)) +
  geom_area(colour = "grey90") +
  geom_col() +
  scale_x_reverse(breaks = scales::breaks_pretty(n = 6)) +
  coord_flip() +
  # ylim(0, 0.5) +
  labs(y = "Pollen counts", x = "Time (ybp)") +
  facet_wrap(~variablename,
             nrow = 1) +
  theme_minimal() +
  theme(
    text = element_text(size = 10),
  )
Figure 3: Pollen counts of the most common taxa with an additional group of ‘hardwood’ taxa.

The data are 7931 years long, and attempting to predict ecological processes at an annual resolution from observations spaced by hundreds of years will be unreliable. So we need to chose a resolution for our predictions, i.e., we will estimate underlying ecological processes at a 50-year or 100-year temporal resolution. To do so, the data will be binned at a chosen resolution. The width of the bins is chosen according to the effect that binning will have state variables. Binning will aggregate some data (e.g., if two observations fall within the same bin they will be summed), and we want to keep as many observations as possible in independent bins. We want to interfere with the state variables as little as possible and maximise the number of observations over the time-series.

A few summary checks can be useful here. We have done some data wrangling in long format, but the model requires wide data. We are going to pivot out data to wide and do some summaries. We are keeping the age and depth columns as identifiers to match with the charcoal data later.

# Pivot the data to wide format
state_variables_wide <- state_variables %>%
    pivot_wider(id_cols = c(age, depth), names_from = variablename, values_from = value) %>% 
    arrange(age)

saveRDS(state_variables_wide, "./data/state_variables_wide.rds")
# Time-span of the data divided by the numer of observations
max(state_variables_wide$age) / nrow(state_variables_wide)
[1] 83.48578
# Age differences between successive observations
diff(state_variables_wide$age)
 [1]  69.017  42.946  48.331  96.968  11.459  14.340  54.308  55.284 138.681
[10]  66.118  48.695 143.230 103.015 186.753  58.537  54.550  52.888  50.954
[19] 104.623 100.248 100.144  51.406  53.372  52.350  53.950  51.957  53.782
[28]  59.512  72.266 168.719 111.125  47.302  47.531  47.949  55.391  90.190
[37]  66.359  68.199  97.921  60.144  54.207 109.400 165.792 141.147  93.284
[46]  90.969  85.959  87.634  86.346  79.080  85.733  84.658  85.647  85.731
[55]  86.000  88.257  84.158  86.849  90.252 144.587  79.708  89.321  43.635
[64]  43.511  42.991  42.250  44.523  47.083  54.088 115.201  99.857  73.484
[73] 128.326 116.359 119.324 123.707 132.304 219.725 184.122  71.101  68.843
[82]  66.205  65.555  71.999  71.922  86.020 153.328 103.854  76.139  71.061
[91]  71.132 129.632 127.729 137.070
# The range of the age differences
range(diff(state_variables_wide$age))
[1]  11.459 219.725
# the average of the age differences
mean(diff(state_variables_wide$age))
[1] 85.0778
# the standard deviation of the age differences
sd(diff(state_variables_wide$age))
[1] 38.36777

This all looks great! If we bin the data at around 50-75 years, we will still have most of our data in independent bins.

Note: 😱

The model can handle uneven gaps in time, which is great for palaeo-data. But if the time intervals in the data are too inconsistent, for example, consisting of very short intervals followed by very large intervals, then the model will struggle to fit. Inconsistent tinme intervals can occur in cores with very variable accumulation rates, and binning the data will result in summing many observations into one time bin, followed by a large gap in time.

Now that we know roughly what bin-width to use to maximise the number of observations in the state variables, we will apply the same bin-width to both the state variables and the covariate data.

Handling the covariates \(X\)

Covariates need to be handled according to decisions that were made around the state variables, that is, if the state variables were binned at a century-level resolution then the covariates must match that resolution. We can have a finer temporal resolution of the covariates than we do of the state variables. In this dataset, our state variables have 95 observations, and our covariate has 646 observations over the same time-span. Having more observations of environmental covariates is common in palaeo-data and works well with fitting the model.

Note

We can have a finer temporal resolution of the covariates, but we cannot have a finer resolution of the state variables. If your data have a finer resolution of the state variables then it is possible to interpolate the covariate data to match. Interpolating the covariate data….

Since we need to apply the same binning procedure to both the state variables and the covariate, I like to join both datasets. Charcoal was sampled to a greater depth than the pollen data, so we are going to clip the covariate to the extent of the state variables and join the two datasets by their common variable; age.

# Clip covariate data to the extent of the state variables
# Join the two by age so that we get a square matrix
story_join <- story_char_wide %>%
  filter(age <= max(state_variables_wide$age)) %>%
  left_join(state_variables_wide)

# Always double check dimensions before/after joining!
# dim(story_join)
# tail(story_join)

Now we have all our data joined:

head(story_join, n = 10)
# A tibble: 10 × 8
     age depth char_acc other hardwood `Fagus grandifolia` Quercus Ulmus
   <dbl> <dbl>    <dbl> <int>    <int>               <int>   <int> <int>
 1 -66.2   0.5     0      106       60                   8      96     9
 2 -63.4   1.5    15.8     NA       NA                  NA      NA    NA
 3 -61.4   2.5    40.3     NA       NA                  NA      NA    NA
 4 -59.8   3.5    21.8     NA       NA                  NA      NA    NA
 5 -58.5   4.5    37.3     NA       NA                  NA      NA    NA
 6 -57.2   5.5    29.8     NA       NA                  NA      NA    NA
 7 -56.0   6.5    34.4     NA       NA                  NA      NA    NA
 8 -54.8   7.5    21.2     NA       NA                  NA      NA    NA
 9 -53.7   8.5    20.5     NA       NA                  NA      NA    NA
10 -52.6   9.5     8.26    NA       NA                  NA      NA    NA

I always do lots of quick checks like head(), tail(), dim() (and many more!) to make sure to catch any coding mistakes!

tail(story_join, n = 10)
# A tibble: 10 × 8
     age depth char_acc other hardwood `Fagus grandifolia` Quercus Ulmus
   <dbl> <dbl>    <dbl> <int>    <int>               <int>   <int> <int>
 1 7777.  648.     2.50    NA       NA                  NA      NA    NA
 2 7794.  648.     8.19    19       49                   2     142    45
 3 7812.  650.    19.1     NA       NA                  NA      NA    NA
 4 7829.  650.     6.19    NA       NA                  NA      NA    NA
 5 7845.  652.     2.78    NA       NA                  NA      NA    NA
 6 7863.  652.     4.20    NA       NA                  NA      NA    NA
 7 7880.  654.     5.02    NA       NA                  NA      NA    NA
 8 7898.  654.     3.17    NA       NA                  NA      NA    NA
 9 7914.  656.     4.51    NA       NA                  NA      NA    NA
10 7931.  656.     2.60    43       39                   3     172    37

This all looks good, and we can now bin the data to an appropriate temporal resolution for fitting the model. Chosing a bin-width may take some trial and error, I’m going with a bin-width of 50 years.

# This code chunks the data into bins and gives us a grouping variable "bins"
bin_width <- 50
bins <- cut(story_join$age,
            breaks = seq(from = min(story_join$age),
            to = max(story_join$age + bin_width),
            by = bin_width), include.lowest = TRUE, labels = FALSE)

We now apply the bins to our data. For most types of covariates we want the average value of the covariate in each bin. But for the pollen count data, we want the sum. This is because of the underlying multinomial distribution in the model.

# The following code 
story_binned <- bind_cols(bins = bins, story_join) %>%
  group_by(bins) %>% # Group the data by the bins so that we calculate per time bin
  summarise(
    age = mean(age, na.rm = T), # the center of the bin
    char_acc = mean(char_acc, na.rm = T), # mean charcoal accumulation rate per bin
    other = sum(other, na.rm = T), # the sums of the count data
    hardwood = sum(hardwood, na.rm = T),
    `Fagus grandifolia` = sum(`Fagus grandifolia`, na.rm = T),
    Ulmus = sum(Ulmus, na.rm = T),
    Quercus = sum(Quercus, na.rm = T)
  ) %>%
  arrange(desc(age))
# Be aware that the gaps in the pollen data are now filled with 0's not NA
This is important 💯

We want the model to run from the past to the present, and the model runs from the top of the matrix to the bottom. So we want to make sure that the top row of the data is the oldest date. Palaeo-data are usually ordered from the top of the core to the bottom (present to past), so we arrange(desc(age)).

Let’s see what the data look like binned at a 50-year resolution:

Code
head(story_binned, n = 10)
# A tibble: 10 × 8
    bins   age char_acc other hardwood `Fagus grandifolia` Ulmus Quercus
   <int> <dbl>    <dbl> <int>    <int>               <int> <int>   <int>
 1   160 7914.     3.43    43       39                   3    37     172
 2   159 7863.     4.00     0        0                   0     0       0
 3   158 7812.    11.1     19       49                   2    45     142
 4   157 7761.     4.06     0        0                   0     0       0
 5   156 7713.     5.29     0        0                   0     0       0
 6   155 7666.     8.24    34       65                   1    68     129
 7   154 7609.     8.50     0        0                   0     0       0
 8   153 7553.     9.27    39       43                   2    60     145
 9   152 7502.     5.05     0        0                   0     0       0
10   151 7457.     5.46    30       69                   7    57     125

This is looking good, we have reduced the time-intervals over which to run the model. The data are now 160 rows with complete covariate data, containing 93 state variable observations. The original number of pollen observations was 95, so we have not summed many observations within the same bins. Take care when data-wrangling, see how NA in the taxa have been replaced with \(0\)? We will handle this later, but in the model \(0 =\) zero count, which is not the same as no state-variable observations (NA) at a given age/depth.

multinomialTS requires two matrices, a site-by-species matrix \(Y\), and a covariate matrix \(X\), so we will leave tibbles behind and split the data. \(X\) variables may be of different types (e.g., continuous, categorical…) but should be scaled relative to each other.

Code
story_pollen_matrix <- story_binned %>%  # Select taxa
  select(other, hardwood, all_of(target_taxa)) %>% 
  rename("Fagus" = "Fagus grandifolia") %>% # renaming for to shorten name
  as.matrix()
# Replacing 0 with NA is not strictly necessary the way we use the data today
# But it is a good safety to avoid 0-count data where it zhould be no observation
story_pollen_matrix[which(rowSums(story_pollen_matrix) == 0), ] <- NA

story_char_matrix_scaled <- story_binned %>% # select covariates
  select(char_acc) %>% 
  as.matrix() %>% 
  scale() # Scale covariates

Fitting multinomialTS

Ok, now we have:

  • Chosen a temporal resolution for the model of 50 year bins
  • Organised the data into a site-by-species matrix \(Y\) at 50 year bins
  • Binned and scaled covariate matrix \(X\)

With that, we can fit the model.

Finding initial conditions using mnGLMM()

The mnTS() function will provide estimates for biotic interactions (\(C\) matrix), taxa-driver relationships (\(B\) matrix), and cross-correlated error (\(V\) matrix). But the model needs initial starting values for these parameters to begin with. We get initial starting conditions from the data by running the mnGLMM() funcion. mnGLMM() returns estimates for the \(B\) and \(V\) matrices (not the \(C\) matrix) and assumes that there are no time gaps in the data.

Tip

The arguments of the starting values in both the mnGLMM() and mnTS() are all suffixed with .start (e.g., B.start will be the starting values for the \(B\) matrix).

The arguments of the parameters to be estimated are all suffixed with .fixed (e.g., B.fixed will be parameters that are estimated from \(B\) matrix).

Setting-up parameters for mnGMLL()

Now, lets set up the parameters for mnGLMM(). We need:

  • A vector that indicates the row indices where there are state variable observations: sample_idx
  • An integer number of covariates (+ 1 for mnGLMM()): p
  • An integer of the number of state variables: n
  • A matrix of starting values for \(B\): B.start.glmm
  • A matrix of \(B\) parameters to estimate: B.fixed.glmm
  • A matrix of \(V\) parameters to estimate: V.fixed.glmm
# set-up sample index
sample_idx <- which(rowSums(story_pollen_matrix) != 0)
# make sure it works
head(story_pollen_matrix[sample_idx, ])
     other hardwood Ulmus Fagus Quercus
[1,]    43       39    37     3     172
[2,]    19       49    45     2     142
[3,]    34       65    68     1     129
[4,]    39       43    60     2     145
[5,]    30       69    57     7     125
[6,]    30       80    49    13     116

Set-up the remaining parameters:

# Set-up parameters
p <- ncol(story_char_matrix_scaled) + 1 # Number of independent variables plus intercept
n <- ncol(story_pollen_matrix) # number of taxa
V.fixed.glmm <- diag(n)
diag(V.fixed.glmm) <- NA
V.fixed.glmm[1] <- 1 # reference taxa/group [1, 1] is set to 1
# V.fixed.glmm <- diag(n)
# V.fixed.glmm <- matrix(NA, n, n)
# V.fixed.glmm[1] <- 1 # reference taxa/group [1, 1] is set to 1

B.fixed.glmm <- matrix(c(rep(0,p),rep(NA, (n - 1) * p)), p, n) # reference taxa [,1] are set to 0
B.start.glmm <- matrix(c(rep(0,p),rep(.01, (n - 1) * p)), p, n) # reference taxa [,1] are set to 0

These parameters are used as arguments to the mnGLMM() function. Check them out by printing them in your console. Each matrix needs to be the correct dimensions given the number of taxa and number of covariates. The position elements of each matrix reflect the species and/or the covariates, as we will see later in the output of the model.

Common error

Defining the dimensions of each matrix is slightly different between mnGLMM() and mnTS(). The number of rows in the \(B\) matrix in mnGLMM() is equal to the number of covariates + 1 (p + 1), and includes the intercept \(B0\). mnTS() has two arguments one for the intercept; \(B0\), and a separate one for the covariates; \(B\).

Fitting mnGLMM()

Remember that mnGLMM() does not handle gaps in the data and only fits complete \(X\) and \(Y\) matrices. We have created a variable for our observation indices (sample_idx), so for mnGLMM() we will index the matrices by this variable: e.g., story_pollen_matrix[sample_idx, ].

# fit glmm
start_time <- Sys.time()
glmm_mod <- mnGLMM(Y = story_pollen_matrix[sample_idx, ],
                   X = story_char_matrix_scaled[sample_idx, ,drop = F],
                   B.start = B.start.glmm,
                   B.fixed = B.fixed.glmm,
                   V.fixed = V.fixed.glmm)
end_time <- Sys.time()
end_time - start_time
Time difference of 8.870917 secs
saveRDS(glmm_mod, "./outputs/glmm_mod.rds")
Code
glmm_mod <- readRDS("./outputs/glmm_mod.rds")

The outputs of mnGLMM() can be examined with the summary(glmm_mod) and coef(glmm_mod) functions. We are going to skip this for now because the output of mnGMLL() is similar to mnTS(), only, without the \(C\) matrix. For this demonstration we will look at the output of mnTS() as the same concepts apply.

Setting-up parameters for mnTS()

Now, we have reasonable starting values from mnGLMM() lets set up the main state-space model parameters for mnTS(). mTS() needs:

  • row indices as before: sample_idx
  • the number of covariates: p
  • the number of state variables: n
  • starting values for each matrix:
    • \(BO\): B0.start.mnTS (intercept)
    • \(B\): B.start.mnTS (driver-taxa)
    • \(C\): C.start.mnTS (taxa interactions)
    • \(V\): V.start.mnTS (cross-correlated error)
  • parameters to estimate for each matrix:
    • \(BO\): B0.fixed.mnTS
    • \(B\): B.fixed.mnTS
    • \(C\): C.fixed.mnTS
    • \(V\): V.fixed.mnTS

We are using the output from mnGLMM() as starting values for the matrices \(B0\), \(B\), and \(V\). mnGLMM() does not provide estimates for \(C\), so we handle \(C\) a little differently and we input values close to zero for each parameter estimated and let mnTS() do the rest.

# B.start
B0.start.mnTS <- glmm_mod$B[1, , drop = F] # The first row is the Intercept
B.start.mnTS <- glmm_mod$B[2, , drop = F] # Take care here!
# The second row on are the covariates, more covariates, more rows. Here is only one covariate
# B.start.mnTS <- glmm_mod$B[2:3, , drop = F] # For 2 covariates
sigma.start.mnTS <- glmm_mod$sigma

V.fixed.mnTS = matrix(NA, n, n) # Covariance matrix of environmental variation in process eq
V.fixed.mnTS[1] = 1

V.start.mnTS = V.fixed.mnTS
V.start.mnTS <- glmm_mod$V

B.fixed.mnTS <- matrix(NA, p-1, n)
B.fixed.mnTS[,1] <- 0
B0.fixed.mnTS = matrix(c(0, rep(NA, n - 1)), nrow = 1, ncol = n)

# Set-up C without interactions
C.start.mnTS = .5 * diag(n)
C.fixed.mnTS <- C.start.mnTS
C.fixed.mnTS[C.fixed.mnTS != 0] <- NA

# Set-up C with interactions between Fagus and Quercus
C.start.int.mnTS = .5 * diag(n)
C.start.int.mnTS[5, 4] <- .001
C.start.int.mnTS[4, 5] <- .001
C.fixed.int.mnTS <- C.start.int.mnTS
C.fixed.int.mnTS[C.fixed.int.mnTS != 0] <- NA

Fitting mnTS()

Remember, mnTS() does handle gaps in the state-variables where there are data in the covariate matrix. In the following code, we use the complete (no gaps) \(Y\) matrix story_pollen_matrix[sample_idx, ] with dimensions: 93, 5. And the full \(X\) matrix: story_char_matrix_scaled with dimensions: 160, 1 We will fit the model twice:

  • without taxa interactions
  • and without taxa interactions.
start_time <- Sys.time()
mnTS_mod <- mnTS(Y = story_pollen_matrix[sample_idx, ],
                 X = story_char_matrix_scaled, Tsample = sample_idx,
                 B0.start = B0.start.mnTS, B0.fixed = B0.fixed.mnTS,
                 B.start = B.start.mnTS, B.fixed = B.fixed.mnTS,
                 C.start = C.start.mnTS, C.fixed = C.fixed.mnTS,
                 V.start = V.start.mnTS, V.fixed = V.fixed.mnTS,
                 dispersion.fixed = 1, maxit.optim = 1e+6)
# maxit.optim is the max number of iterations the optimiser will complete before stopping.
# increase maxit.optim if the model needs a lot of time to fit.
end_time <- Sys.time()
end_time - start_time
saveRDS(mnTS_mod, "./outputs/mnTS_mod.rds")
Code
mnTS_mod <- readRDS("./outputs/mnTS_mod.rds")
start_time <- Sys.time()
mnTS_mod_int <- mnTS(Y = story_pollen_matrix[sample_idx, ],
                     X = story_char_matrix_scaled, Tsample = sample_idx,
                     B0.start = mnTS_mod$B0, B0.fixed = B0.fixed.mnTS,
                     B.start = mnTS_mod$B, B.fixed = B.fixed.mnTS,
                     C.start = mnTS_mod$C, C.fixed = C.fixed.int.mnTS,
                     V.start = mnTS_mod$V, V.fixed = V.fixed.mnTS,
                     dispersion.fixed = 1, maxit.optim = 1e+6)
end_time <- Sys.time() 
end_time - start_time
saveRDS(mnTS_mod_int, "./outputs/mnTS_mod_int.rds")
Code
mnTS_mod_int <- readRDS("./outputs/mnTS_mod_int.rds")
Getting a good fit

Getting a good fit can take quite a lof of experimenting with the individual parameters. For the interaction model I’m using starting conditions from the no-interaction model, e.g., V.start = mnTS_mod$V.

Interpreting outputs

The coef() and summary() functions will show the model outputs. Let’s check out the coefficients of interaction model:

coef(mnTS_mod_int)
                           Coef.         se         t            P
hardwood              0.63887009 0.32797397  1.947929 5.142343e-02
Ulmus                -0.37197596 0.22052878 -1.686746 9.165227e-02
Fagus                -0.75298508 0.22384055 -3.363935 7.683961e-04
Quercus               1.34744142 0.12097745 11.137955 8.197974e-29
sp.other.other        0.88499656 0.01245596 71.050060 0.000000e+00
sp.hardwood.hardwood  0.96194926 0.01863543 51.619364 0.000000e+00
sp.Ulmus.Ulmus        0.94033943 0.02321432 40.506872 0.000000e+00
sp.Fagus.Fagus        0.80315620 0.04658110 17.242103 1.282831e-66
sp.Quercus.Fagus      0.06704494 0.01614222  4.153391 3.275845e-05
sp.Fagus.Quercus     -0.09762005 0.05031349 -1.940236 5.235102e-02
sp.Quercus.Quercus    0.99209581 0.01908702 51.977507 0.000000e+00
char_acc.hardwood    -0.15091936 0.02598970 -5.806892 6.364308e-09
char_acc.Ulmus       -0.14161025 0.04857906 -2.915047 3.556349e-03
char_acc.Fagus       -0.21567537 0.04962459 -4.346139 1.385548e-05
char_acc.Quercus     -0.17417682 0.02257133 -7.716727 1.193548e-14

Lift an individual matrix:

mnTS_mod_int$C
             other  hardwood     Ulmus      Fagus     Quercus
other    0.8849966 0.0000000 0.0000000 0.00000000  0.00000000
hardwood 0.0000000 0.9619493 0.0000000 0.00000000  0.00000000
Ulmus    0.0000000 0.0000000 0.9403394 0.00000000  0.00000000
Fagus    0.0000000 0.0000000 0.0000000 0.80315620 -0.09762005
Quercus  0.0000000 0.0000000 0.0000000 0.06704494  0.99209581
  • rows = change in abundance
  • columns = abundance
  • Autocorrelation on the diagonal

Quercus-Fagus 0.067 means that change in abundance of Quercus affects the abundance of Fagus.

We won’t delve into comparing individual coefficient values between the two models here. Doing so is important to interpreting the ecological meaning of the data, and requires strong ecological knowledge of the study system. Although that is one of the most interesting parts, it is a little beyond what we can fit into a workshop around how to fit the model!

Re-fitting if necessary

We are aiming to find the best optima possible from the model. Because this may not happen in the first attempt with initial conditions provided by mnGLMM(), we can refit the model with starting conditions from the output of mnTS(). This may initially sound like cheating, running a circular model, but remember the input values are only starting conditions. Starting with values that are closer to the ‘true’ optima is likely to provide a better and faster fit.

As an example, let’s refit the first model:

start_time <- Sys.time()
mnTS_mod_refit <- mnTS(Y = story_pollen_matrix[sample_idx, ],
                           X = story_char_matrix_scaled, Tsample = sample_idx,
                           B0.start = mnTS_mod$B0, B0.fixed = B0.fixed.mnTS,
                           B.start = mnTS_mod$B, B.fixed = B.fixed.mnTS,
                           C.start = mnTS_mod$C, C.fixed = C.fixed.mnTS,
                           V.start = mnTS_mod$V, V.fixed = V.fixed.mnTS,
                           dispersion.fixed = 1, maxit.optim = 1e+6)
end_time <- Sys.time() 
end_time - start_time
Time difference of 13.53042 secs
coef(mnTS_mod_refit)
                           Coef.          se          t             P
hardwood              0.41532455 0.132205149   3.141516  1.680758e-03
Ulmus                -1.06476060 0.277036554  -3.843394  1.213445e-04
Fagus                -0.49932003 0.328767580  -1.518763  1.288222e-01
Quercus               1.11997698 0.144174032   7.768230  7.959071e-15
sp.other.other        0.81275208 0.057988901  14.015649  1.250498e-44
sp.hardwood.hardwood  0.42388155 0.305167839   1.389011  1.648293e-01
sp.Ulmus.Ulmus        0.99737129 0.003913659 254.843712  0.000000e+00
sp.Fagus.Fagus        0.92139120 0.031542814  29.210812 1.413619e-187
sp.Quercus.Quercus    0.86726850 0.055474352  15.633684  4.292508e-55
char_acc.hardwood    -0.08189097 0.046429828  -1.763758  7.777276e-02
char_acc.Ulmus       -0.12995038 0.056352411  -2.306031  2.110893e-02
char_acc.Fagus       -0.11399315 0.075111485  -1.517653  1.291020e-01
char_acc.Quercus     -0.09091494 0.043111347  -2.108840  3.495839e-02
mnTS_mod_refit$AIC
[1] -1391.369

Looks like not much changed, so our original fit was good 😎.

Comparing models

The summary of the model provides both the log likelihood and the AIC (akaike information criterion). These can be used for comparing models. Today we will use the AIC. AIC is penalised for the number of parameters estimated, and so can be better for comparing models where one has more parameters estimated (i.e., the interaction model is estimating more parameters than the model without interactions).

The AIC (and log likelihood) values can be accessed directly with mnTS_mod_int$AIC.

Code
data.frame(without_interactions = mnTS_mod$AIC,
           with_interactions = mnTS_mod_int$AIC)
  without_interactions with_interactions
1            -1391.369         -1435.724

Bootstrapping

We do not advocate for a reliance on P-values for interpreting results. Issues with P-values are well documented (Anderson, Burnham, and Thompson 2000); however, multinomialTS() offers two methods of calculating P-values. The coef() function shows P-values calculated from standard errors from Wald approximation. In simulation tests, we found the Wald Approximation could lead to inflated type-1 errors, so we wrote a bootstrapping function as a more robust alternative. Bootstrapping can take a very long time (depending on the data and the number of bootsraps), so we won’t run the bootstrapping function during the workshop. The code for bootstrapping the model is:

mnTS_mod_boot <- boot.mnTS(mnTS_mod, reps = 10000)

Hypothesis testing

There may be circumstances where coefficients change substantially by fitting the model in alternative ways, e.g., with or without interactions, of different covariates. We can use the post-hoc statistical tests like AIC and likelihood ratio tests to give some statistical support to one model or another. However, in the context of palaeoecological data at least, there is no known correct answer. We can only use the results to lend statistical support to multiple potential hypotheses. Palaeo-data lacks experimental manipulation and direct observation, so we suggest setting up multiple working hypotheses (i.e., Chamberlin 1897), testing components of those hypotheses by fitting multinomialTS under different conditions, and finally assessing the statistical support lent to each hypotheses within the context of ecological knowledge.

References

Anderson, David R., Kenneth P. Burnham, and William L. Thompson. 2000. “Null Hypothesis Testing: Problems, Prevalence, and an Alternative.” The Journal of Wildlife Management 64 (4): 912–23. https://doi.org/10.2307/3803199.
Chamberlin, T. C. 1897. “The Method of Multiple Working Hypotheses.” The Journal of Geology 5: 837–48.
Schlenker, Nora, Jonathan Johnson, Tessa Ray-Cozzens, Vania Stefanova, David M. Nelson, Bryan N. Shuman, and John W. Williams. 2024. “Interacting Effects of Fire and Hydroclimate on Oak and Beech Community Prevalence in the Southern Great Lakes Region.” Journal of Ecology 112 (5): 1101–22. https://doi.org/10.1111/1365-2745.14289.