Code
source("./install_workshop.R")
Loading required package: multinomialTS
Loading required package: pacman
Welcome to our workshop on how to use the multinomialTS
package. First things first, let’s make sure you have all the necessary packages installed. Follow the instructions on the README and then try rendering this document. This may take a few minutes, so while that is running, I’m going to go through a few slides. If this document fails to render, something has gone wrong and one of our helpers/minions will help you sort it out.
This workshop will take you through how to fit the multinomialTS
model and interpret the results. One of the primary goal of this model is to be able to test multiple hypotheses about the data and lend statistical support to the different hypotheses. For example which environmental drivers are having the strongst effects on which taxa? Or, are interactions among taxa or abiotic variables driving change in the system? This state-space approach allows us to estimate coefficients for taxa interactions and driver-taxa relationships, that we don’t get from methods such as ordination or cluster analysis. We recommend this method as complimentary to other methods, as both have their advantages.
This model has been designed specifically to work with a multinomially distributed response variable, that is, data from a finite number individuals (e.g., 300 counts) from a sample, from multiple possible classes (e.g., different taxon). We will be using a palaeoecological dataset as an example, where a number of fossil pollen grains are counted (usually 300-400) and identified per slice of sediment. From these data, we know the relative abundance of taxa in the sample, but not the absolute abundance of each species on the landscape. Knowing only the relative abundances creates interdependencies among the taxa as we can only generate estimates \(n-1\) taxa. Put simply, if we have 3 species and 60% of the counts belong to species 1, and 30% of the counts belong to species 2, then the remaining 10% must therefore belong to species 3.
The multinomialTS
model has been designed to work with multinomially distributed ecological state variables (\(Y\)), but the covariate predictor data (\(X\)) can be of mixed types, such as binary events or charcoal accumulation rates. The data are divided into two, the state variables (\(Y\), e.g., pollen counts), and the environmental covariates (\(X\)). Analysing these data with multinomialTS
can be split into two decision streams (Figure 1):
We will go through an example of how to do this in detail now.
Let’s get some data. We are going to use data that are the pollen counts from Story Lake, IN (Schlenker et al. 2024), these data will be our state variables \(Y\). The data also contain charcoal counts, which will be our environmental covariate \(X\). The data are stored in the Neotoma database.
The following (folded) code block will download the data directly from the Neotoma API, do basic harmonisation of taxa, and some data wrangling to give us a wide dataset of state variables (\(Y\)) and environmental covariates (\(X\)). For this session, the data are already wangled and saved to the data directory, so the code block is set not to run (#| eval: false
). We are not going to delve into the details of taxonomic resolution in this workshop, for details on using the neotoma2
package check out the neotoma GitHub.
# Find site in neotoma
<- get_sites(sitename = "Story%")
story_site
# Download the data from the site
# Downloading the data only needs to be done once
<- story_site %>%
story_samples get_downloads() %>%
samples()
# Save the raw data. Always keep an untouched copy of the raw data.
saveRDS(story_samples, "./data/story_samples.rds")
# Read-in the raw samples, this avoids repeatedly downloading from the API
<- readRDS("./data/story_samples.rds")
story_samples
# Some harmonisation of species, see Schlenker et al., 2024 in refs for details
# First filter the data for pollen sampels, then harmonise a few species to the genus-level
<- story_samples %>%
story_pollen_long filter(elementtype == "pollen",
== "NISP") %>%
units mutate(variablename = replace(variablename, stringr::str_detect(variablename, "Pinus*"), "Pinus"),
variablename = replace(variablename, stringr::str_detect(variablename, "Picea*"), "Picea"),
variablename = replace(variablename, stringr::str_detect(variablename, "Acer*"), "Acer"),
variablename = replace(variablename, stringr::str_detect(variablename, "Juglans*"), "Juglans"),
variablename = replace(variablename, stringr::str_detect(variablename, "Fraxinus*"), "Fraxinus")
%>%
) group_by(siteid, sitename,
sampleid, variablename, units, age,
agetype, depth, datasetid,%>%
long, lat) summarise(value = sum(value), .groups='keep')
# Check names for any variables like spike, trace, Lycopodium, Microspheres...
unique(story_pollen_long$variablename)
saveRDS(story_pollen_long, "./data/story_pollen_long.rds")
# Now let's turn to the charcoal
# first filter data for charcoal counts
<- story_samples %>%
story_char filter(elementtype == ">125 µm")
# The following code pivots the data to a wide dataframe
# then does a very crude calculation to convert charcoal counts to charcoal accumulation rate
# for the purposes of this workshop the crude calculation will do
<- story_char %>%
story_char_wide pivot_wider(id_cols = c(age, depth), names_from = variablename, values_from = value) %>%
mutate(lag_depth = abs(lag(depth) - depth),
lag_age = abs(lag(age) - age),
acc_rate = lag_depth / lag_age,
char_acc = acc_rate * Charcoal) %>%
mutate(across(everything(), ~ ifelse(row_number() == 1 & is.na(.), 0, .))) %>%
select(age, depth, char_acc)
# Let's save the wrangled data
saveRDS(story_char_wide, "./data/story_char_wide.rds")
Since the folded code block above has already been run, and the \(Y\) and \(X\) data saved to the data directory, we can read in the wrangled data directly.
# Read-in the wrangled data
# Read in our X variables
<- readRDS("./data/story_char_wide.rds")
story_char_wide # Read in the long data for plotting
<- readRDS("./data/story_pollen_long.rds") story_pollen_long
Currently the both the \(Y\) and \(X\) data contain age and depth columns. These columns will be excluded later but I like to keep some form of common ID column at this stage.
Let’s start handling our state variables \(Y\). It is not recommended to attempt to estimate coefficients for every species in the data simultaneously, so we are going to look at key taxa and groups. Let’s do some quick and dirty visualisation of the data, by plotting the top five most abundant species by their counts. It can be useful to explore more than the top five species by their counts of course, for example, there may be key species of ecological significance that occur in low abundances. We recommend methods like cluster analyses, and topic modelling as preliminary, or complimentary, analyses but we will not cover those methods in this workshop.
# There are multiple ways of coding the data-wrangling
# I'm sticking with tidyverse for consistency but
# I sometimes BASE is more efficient!
<- story_pollen_long %>%
story_pollen_top5_count group_by(variablename) %>%
mutate(total_count = sum(value)) %>%
slice(1) %>%
ungroup() %>%
select(variablename, total_count) %>%
arrange(desc(total_count)) %>%
slice(1:5)
<- story_pollen_long %>%
story_top5 filter(variablename %in% c(story_pollen_top5_count$variablename))
<- story_pollen_long %>%
story_other filter(!variablename %in% c(story_pollen_top5_count$variablename)) %>%
mutate(variablename = "other") %>%
group_by(variablename, age) %>%
summarise(value = sum(value))
<- bind_rows(story_top5, story_other) %>%
story_top5_res mutate(variablename = factor(variablename, levels = c(
"other", rev(story_pollen_top5_count$variablename)
)))
ggplot(story_top5_res, aes(x = age, y = value)) +
geom_area(colour = "grey90") +
geom_col() +
scale_x_reverse(breaks = scales::breaks_pretty(n = 6)) +
coord_flip() +
# ylim(0, 0.5) +
labs(y = "Pollen counts", x = "Time (ybp)") +
facet_wrap(~variablename,
nrow = 1) +
theme_minimal() +
theme(
text = element_text(size = 10),
)
From Figure 2, we can see that oak (Quercus) American Beech (Fagus grandifolia) and elm (Ulmus) occur in high counts. Ash (Fraxinus) and hickory (Carya), also occur throughout the record but at lower counts. The large peak at the most recent part of the record, in the grouped ‘other’ species, is from an increase in ragweed (Ambrosia) after Euro-American land clearance (Schlenker et al. 2024). This gives us a starting point for deciding which are our target taxa. Story lake data are dominated by three taxa (oak, beech, and elm), so we are going to use these as out target taxa. Addtionally, from Schlenker et al. (2024), there are notable hardwood taxa: hickory (Carya), hornbeam (Ostrya/Carpinus), ash (Fraxinus), walnut (Juglans), sycamore (Platanus) and maple (Acer). With the ecological knowledge of the hardwood taxa, I am also creating a ‘hardwood’ group.
<- c("Carya", "Ostrya/Carpinus", "Acer", "Juglans", "Fraxinus")
hardwood_names <- c("Ulmus", "Fagus grandifolia", "Quercus")
target_taxa
# Create a group of hardwood taxa by summing counts
<- story_pollen_long %>%
story_pollen_long_hardwood filter(variablename %in% hardwood_names) %>%
mutate(variablename = "hardwood") %>%
group_by(siteid, sitename,
sampleid, variablename, units, age,
agetype, depth, datasetid,%>%
long, lat) summarise(value = sum(value), .groups='keep')
# Filter for target taxa
<- story_pollen_long %>%
story_pollen_long_targets filter(variablename %in% target_taxa)
# Sum everything else into our reference group
<- story_pollen_long %>%
story_other_taxa filter(!variablename %in% c(hardwood_names, target_taxa)) %>%
mutate(variablename = "other") %>%
group_by(siteid, sitename,
sampleid, variablename, units, age,
agetype, depth, datasetid,%>%
long, lat) summarise(value = sum(value), .groups='keep')
<- bind_rows(story_other_taxa, story_pollen_long_hardwood, story_pollen_long_targets) %>%
state_variables mutate(variablename = factor(variablename, levels = c("other", "hardwood", target_taxa)),
value = replace_na(value, 0))
Let’s check-out our new grouping:
ggplot(state_variables, aes(x = age, y = value)) +
geom_area(colour = "grey90") +
geom_col() +
scale_x_reverse(breaks = scales::breaks_pretty(n = 6)) +
coord_flip() +
# ylim(0, 0.5) +
labs(y = "Pollen counts", x = "Time (ybp)") +
facet_wrap(~variablename,
nrow = 1) +
theme_minimal() +
theme(
text = element_text(size = 10),
)
The data are 7931 years long, and attempting to predict ecological processes at an annual resolution from observations spaced by hundreds of years will be unreliable. So we need to chose a resolution for our predictions, i.e., we will estimate underlying ecological processes at a 50-year or 100-year temporal resolution. To do so, the data will be binned at a chosen resolution. The width of the bins is chosen according to the effect that binning will have state variables. Binning will aggregate some data (e.g., if two observations fall within the same bin they will be summed), and we want to keep as many observations as possible in independent bins. We want to interfere with the state variables as little as possible and maximise the number of observations over the time-series.
A few summary checks can be useful here. We have done some data wrangling in long format, but the model requires wide data. We are going to pivot out data to wide and do some summaries. We are keeping the age and depth columns as identifiers to match with the charcoal data later.
# Pivot the data to wide format
<- state_variables %>%
state_variables_wide pivot_wider(id_cols = c(age, depth), names_from = variablename, values_from = value) %>%
arrange(age)
saveRDS(state_variables_wide, "./data/state_variables_wide.rds")
# Time-span of the data divided by the numer of observations
max(state_variables_wide$age) / nrow(state_variables_wide)
[1] 83.48578
# Age differences between successive observations
diff(state_variables_wide$age)
[1] 69.017 42.946 48.331 96.968 11.459 14.340 54.308 55.284 138.681
[10] 66.118 48.695 143.230 103.015 186.753 58.537 54.550 52.888 50.954
[19] 104.623 100.248 100.144 51.406 53.372 52.350 53.950 51.957 53.782
[28] 59.512 72.266 168.719 111.125 47.302 47.531 47.949 55.391 90.190
[37] 66.359 68.199 97.921 60.144 54.207 109.400 165.792 141.147 93.284
[46] 90.969 85.959 87.634 86.346 79.080 85.733 84.658 85.647 85.731
[55] 86.000 88.257 84.158 86.849 90.252 144.587 79.708 89.321 43.635
[64] 43.511 42.991 42.250 44.523 47.083 54.088 115.201 99.857 73.484
[73] 128.326 116.359 119.324 123.707 132.304 219.725 184.122 71.101 68.843
[82] 66.205 65.555 71.999 71.922 86.020 153.328 103.854 76.139 71.061
[91] 71.132 129.632 127.729 137.070
# The range of the age differences
range(diff(state_variables_wide$age))
[1] 11.459 219.725
# the average of the age differences
mean(diff(state_variables_wide$age))
[1] 85.0778
# the standard deviation of the age differences
sd(diff(state_variables_wide$age))
[1] 38.36777
This all looks great! If we bin the data at around 50-75 years, we will still have most of our data in independent bins.
The model can handle uneven gaps in time, which is great for palaeo-data. But if the time intervals in the data are too inconsistent, for example, consisting of very short intervals followed by very large intervals, then the model will struggle to fit. Inconsistent tinme intervals can occur in cores with very variable accumulation rates, and binning the data will result in summing many observations into one time bin, followed by a large gap in time.
Now that we know roughly what bin-width to use to maximise the number of observations in the state variables, we will apply the same bin-width to both the state variables and the covariate data.
Covariates need to be handled according to decisions that were made around the state variables, that is, if the state variables were binned at a century-level resolution then the covariates must match that resolution. We can have a finer temporal resolution of the covariates than we do of the state variables. In this dataset, our state variables have 95 observations, and our covariate has 646 observations over the same time-span. Having more observations of environmental covariates is common in palaeo-data and works well with fitting the model.
We can have a finer temporal resolution of the covariates, but we cannot have a finer resolution of the state variables. If your data have a finer resolution of the state variables then it is possible to interpolate the covariate data to match. Interpolating the covariate data….
Since we need to apply the same binning procedure to both the state variables and the covariate, I like to join both datasets. Charcoal was sampled to a greater depth than the pollen data, so we are going to clip the covariate to the extent of the state variables and join the two datasets by their common variable; age.
# Clip covariate data to the extent of the state variables
# Join the two by age so that we get a square matrix
<- story_char_wide %>%
story_join filter(age <= max(state_variables_wide$age)) %>%
left_join(state_variables_wide)
# Always double check dimensions before/after joining!
# dim(story_join)
# tail(story_join)
Now we have all our data joined:
head(story_join, n = 10)
# A tibble: 10 × 8
age depth char_acc other hardwood `Fagus grandifolia` Quercus Ulmus
<dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
1 -66.2 0.5 0 106 60 8 96 9
2 -63.4 1.5 15.8 NA NA NA NA NA
3 -61.4 2.5 40.3 NA NA NA NA NA
4 -59.8 3.5 21.8 NA NA NA NA NA
5 -58.5 4.5 37.3 NA NA NA NA NA
6 -57.2 5.5 29.8 NA NA NA NA NA
7 -56.0 6.5 34.4 NA NA NA NA NA
8 -54.8 7.5 21.2 NA NA NA NA NA
9 -53.7 8.5 20.5 NA NA NA NA NA
10 -52.6 9.5 8.26 NA NA NA NA NA
I always do lots of quick checks like head()
, tail()
, dim()
(and many more!) to make sure to catch any coding mistakes!
tail(story_join, n = 10)
# A tibble: 10 × 8
age depth char_acc other hardwood `Fagus grandifolia` Quercus Ulmus
<dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
1 7777. 648. 2.50 NA NA NA NA NA
2 7794. 648. 8.19 19 49 2 142 45
3 7812. 650. 19.1 NA NA NA NA NA
4 7829. 650. 6.19 NA NA NA NA NA
5 7845. 652. 2.78 NA NA NA NA NA
6 7863. 652. 4.20 NA NA NA NA NA
7 7880. 654. 5.02 NA NA NA NA NA
8 7898. 654. 3.17 NA NA NA NA NA
9 7914. 656. 4.51 NA NA NA NA NA
10 7931. 656. 2.60 43 39 3 172 37
This all looks good, and we can now bin the data to an appropriate temporal resolution for fitting the model. Chosing a bin-width may take some trial and error, I’m going with a bin-width of 50 years.
# This code chunks the data into bins and gives us a grouping variable "bins"
<- 50
bin_width <- cut(story_join$age,
bins breaks = seq(from = min(story_join$age),
to = max(story_join$age + bin_width),
by = bin_width), include.lowest = TRUE, labels = FALSE)
We now apply the bins to our data. For most types of covariates we want the average value of the covariate in each bin. But for the pollen count data, we want the sum. This is because of the underlying multinomial distribution in the model.
# The following code
<- bind_cols(bins = bins, story_join) %>%
story_binned group_by(bins) %>% # Group the data by the bins so that we calculate per time bin
summarise(
age = mean(age, na.rm = T), # the center of the bin
char_acc = mean(char_acc, na.rm = T), # mean charcoal accumulation rate per bin
other = sum(other, na.rm = T), # the sums of the count data
hardwood = sum(hardwood, na.rm = T),
`Fagus grandifolia` = sum(`Fagus grandifolia`, na.rm = T),
Ulmus = sum(Ulmus, na.rm = T),
Quercus = sum(Quercus, na.rm = T)
%>%
) arrange(desc(age))
# Be aware that the gaps in the pollen data are now filled with 0's not NA
We want the model to run from the past to the present, and the model runs from the top of the matrix to the bottom. So we want to make sure that the top row of the data is the oldest date. Palaeo-data are usually ordered from the top of the core to the bottom (present to past), so we arrange(desc(age))
.
Let’s see what the data look like binned at a 50-year resolution:
head(story_binned, n = 10)
# A tibble: 10 × 8
bins age char_acc other hardwood `Fagus grandifolia` Ulmus Quercus
<int> <dbl> <dbl> <int> <int> <int> <int> <int>
1 160 7914. 3.43 43 39 3 37 172
2 159 7863. 4.00 0 0 0 0 0
3 158 7812. 11.1 19 49 2 45 142
4 157 7761. 4.06 0 0 0 0 0
5 156 7713. 5.29 0 0 0 0 0
6 155 7666. 8.24 34 65 1 68 129
7 154 7609. 8.50 0 0 0 0 0
8 153 7553. 9.27 39 43 2 60 145
9 152 7502. 5.05 0 0 0 0 0
10 151 7457. 5.46 30 69 7 57 125
This is looking good, we have reduced the time-intervals over which to run the model. The data are now 160 rows with complete covariate data, containing 93 state variable observations. The original number of pollen observations was 95, so we have not summed many observations within the same bins. Take care when data-wrangling, see how NA
in the taxa have been replaced with \(0\)? We will handle this later, but in the model \(0 =\) zero count, which is not the same as no state-variable observations (NA
) at a given age/depth.
multinomialTS
requires two matrices, a site-by-species matrix \(Y\), and a covariate matrix \(X\), so we will leave tibbles
behind and split the data. \(X\) variables may be of different types (e.g., continuous, categorical…) but should be scaled relative to each other.
<- story_binned %>% # Select taxa
story_pollen_matrix select(other, hardwood, all_of(target_taxa)) %>%
rename("Fagus" = "Fagus grandifolia") %>% # renaming for to shorten name
as.matrix()
# Replacing 0 with NA is not strictly necessary the way we use the data today
# But it is a good safety to avoid 0-count data where it zhould be no observation
which(rowSums(story_pollen_matrix) == 0), ] <- NA
story_pollen_matrix[
<- story_binned %>% # select covariates
story_char_matrix_scaled select(char_acc) %>%
as.matrix() %>%
scale() # Scale covariates
multinomialTS
Ok, now we have:
With that, we can fit the model.
mnGLMM()
The mnTS()
function will provide estimates for biotic interactions (\(C\) matrix), taxa-driver relationships (\(B\) matrix), and cross-correlated error (\(V\) matrix). But the model needs initial starting values for these parameters to begin with. We get initial starting conditions from the data by running the mnGLMM()
funcion. mnGLMM()
returns estimates for the \(B\) and \(V\) matrices (not the \(C\) matrix) and assumes that there are no time gaps in the data.
The arguments of the starting values in both the mnGLMM()
and mnTS()
are all suffixed with .start
(e.g., B.start
will be the starting values for the \(B\) matrix).
The arguments of the parameters to be estimated are all suffixed with .fixed
(e.g., B.fixed
will be parameters that are estimated from \(B\) matrix).
mnGMLL()
Now, lets set up the parameters for mnGLMM()
. We need:
sample_idx
mnGLMM()
): p
n
B.start.glmm
B.fixed.glmm
V.fixed.glmm
# set-up sample index
<- which(rowSums(story_pollen_matrix) != 0)
sample_idx # make sure it works
head(story_pollen_matrix[sample_idx, ])
other hardwood Ulmus Fagus Quercus
[1,] 43 39 37 3 172
[2,] 19 49 45 2 142
[3,] 34 65 68 1 129
[4,] 39 43 60 2 145
[5,] 30 69 57 7 125
[6,] 30 80 49 13 116
Set-up the remaining parameters:
# Set-up parameters
<- ncol(story_char_matrix_scaled) + 1 # Number of independent variables plus intercept
p <- ncol(story_pollen_matrix) # number of taxa
n <- diag(n)
V.fixed.glmm diag(V.fixed.glmm) <- NA
1] <- 1 # reference taxa/group [1, 1] is set to 1
V.fixed.glmm[# V.fixed.glmm <- diag(n)
# V.fixed.glmm <- matrix(NA, n, n)
# V.fixed.glmm[1] <- 1 # reference taxa/group [1, 1] is set to 1
<- matrix(c(rep(0,p),rep(NA, (n - 1) * p)), p, n) # reference taxa [,1] are set to 0
B.fixed.glmm <- matrix(c(rep(0,p),rep(.01, (n - 1) * p)), p, n) # reference taxa [,1] are set to 0 B.start.glmm
These parameters are used as arguments to the mnGLMM()
function. Check them out by printing them in your console. Each matrix needs to be the correct dimensions given the number of taxa and number of covariates. The position elements of each matrix reflect the species and/or the covariates, as we will see later in the output of the model.
Defining the dimensions of each matrix is slightly different between mnGLMM()
and mnTS()
. The number of rows in the \(B\) matrix in mnGLMM()
is equal to the number of covariates + 1 (p + 1
), and includes the intercept \(B0\). mnTS()
has two arguments one for the intercept; \(B0\), and a separate one for the covariates; \(B\).
mnGLMM()
Remember that mnGLMM()
does not handle gaps in the data and only fits complete \(X\) and \(Y\) matrices. We have created a variable for our observation indices (sample_idx
), so for mnGLMM()
we will index the matrices by this variable: e.g., story_pollen_matrix[sample_idx, ]
.
# fit glmm
<- Sys.time()
start_time <- mnGLMM(Y = story_pollen_matrix[sample_idx, ],
glmm_mod X = story_char_matrix_scaled[sample_idx, ,drop = F],
B.start = B.start.glmm,
B.fixed = B.fixed.glmm,
V.fixed = V.fixed.glmm)
<- Sys.time()
end_time - start_time end_time
Time difference of 8.870917 secs
saveRDS(glmm_mod, "./outputs/glmm_mod.rds")
<- readRDS("./outputs/glmm_mod.rds") glmm_mod
The outputs of mnGLMM()
can be examined with the summary(glmm_mod)
and coef(glmm_mod)
functions. We are going to skip this for now because the output of mnGMLL()
is similar to mnTS()
, only, without the \(C\) matrix. For this demonstration we will look at the output of mnTS()
as the same concepts apply.
mnTS()
Now, we have reasonable starting values from mnGLMM()
lets set up the main state-space model parameters for mnTS()
. mTS()
needs:
sample_idx
p
n
B0.start.mnTS
(intercept)B.start.mnTS
(driver-taxa)C.start.mnTS
(taxa interactions)V.start.mnTS
(cross-correlated error)B0.fixed.mnTS
B.fixed.mnTS
C.fixed.mnTS
V.fixed.mnTS
We are using the output from mnGLMM()
as starting values for the matrices \(B0\), \(B\), and \(V\). mnGLMM()
does not provide estimates for \(C\), so we handle \(C\) a little differently and we input values close to zero for each parameter estimated and let mnTS()
do the rest.
# B.start
<- glmm_mod$B[1, , drop = F] # The first row is the Intercept
B0.start.mnTS <- glmm_mod$B[2, , drop = F] # Take care here!
B.start.mnTS # The second row on are the covariates, more covariates, more rows. Here is only one covariate
# B.start.mnTS <- glmm_mod$B[2:3, , drop = F] # For 2 covariates
<- glmm_mod$sigma
sigma.start.mnTS
= matrix(NA, n, n) # Covariance matrix of environmental variation in process eq
V.fixed.mnTS 1] = 1
V.fixed.mnTS[
= V.fixed.mnTS
V.start.mnTS <- glmm_mod$V
V.start.mnTS
<- matrix(NA, p-1, n)
B.fixed.mnTS 1] <- 0
B.fixed.mnTS[,= matrix(c(0, rep(NA, n - 1)), nrow = 1, ncol = n)
B0.fixed.mnTS
# Set-up C without interactions
= .5 * diag(n)
C.start.mnTS <- C.start.mnTS
C.fixed.mnTS != 0] <- NA
C.fixed.mnTS[C.fixed.mnTS
# Set-up C with interactions between Fagus and Quercus
= .5 * diag(n)
C.start.int.mnTS 5, 4] <- .001
C.start.int.mnTS[4, 5] <- .001
C.start.int.mnTS[<- C.start.int.mnTS
C.fixed.int.mnTS != 0] <- NA C.fixed.int.mnTS[C.fixed.int.mnTS
mnTS()
Remember, mnTS()
does handle gaps in the state-variables where there are data in the covariate matrix. In the following code, we use the complete (no gaps) \(Y\) matrix story_pollen_matrix[sample_idx, ]
with dimensions: 93, 5. And the full \(X\) matrix: story_char_matrix_scaled
with dimensions: 160, 1 We will fit the model twice:
<- Sys.time()
start_time <- mnTS(Y = story_pollen_matrix[sample_idx, ],
mnTS_mod X = story_char_matrix_scaled, Tsample = sample_idx,
B0.start = B0.start.mnTS, B0.fixed = B0.fixed.mnTS,
B.start = B.start.mnTS, B.fixed = B.fixed.mnTS,
C.start = C.start.mnTS, C.fixed = C.fixed.mnTS,
V.start = V.start.mnTS, V.fixed = V.fixed.mnTS,
dispersion.fixed = 1, maxit.optim = 1e+6)
# maxit.optim is the max number of iterations the optimiser will complete before stopping.
# increase maxit.optim if the model needs a lot of time to fit.
<- Sys.time()
end_time - start_time
end_time saveRDS(mnTS_mod, "./outputs/mnTS_mod.rds")
<- readRDS("./outputs/mnTS_mod.rds") mnTS_mod
<- Sys.time()
start_time <- mnTS(Y = story_pollen_matrix[sample_idx, ],
mnTS_mod_int X = story_char_matrix_scaled, Tsample = sample_idx,
B0.start = mnTS_mod$B0, B0.fixed = B0.fixed.mnTS,
B.start = mnTS_mod$B, B.fixed = B.fixed.mnTS,
C.start = mnTS_mod$C, C.fixed = C.fixed.int.mnTS,
V.start = mnTS_mod$V, V.fixed = V.fixed.mnTS,
dispersion.fixed = 1, maxit.optim = 1e+6)
<- Sys.time()
end_time - start_time
end_time saveRDS(mnTS_mod_int, "./outputs/mnTS_mod_int.rds")
<- readRDS("./outputs/mnTS_mod_int.rds") mnTS_mod_int
Getting a good fit can take quite a lof of experimenting with the individual parameters. For the interaction model I’m using starting conditions from the no-interaction model, e.g., V.start = mnTS_mod$V
.
The coef()
and summary()
functions will show the model outputs. Let’s check out the coefficients of interaction model:
coef(mnTS_mod_int)
Coef. se t P
hardwood 0.63887009 0.32797397 1.947929 5.142343e-02
Ulmus -0.37197596 0.22052878 -1.686746 9.165227e-02
Fagus -0.75298508 0.22384055 -3.363935 7.683961e-04
Quercus 1.34744142 0.12097745 11.137955 8.197974e-29
sp.other.other 0.88499656 0.01245596 71.050060 0.000000e+00
sp.hardwood.hardwood 0.96194926 0.01863543 51.619364 0.000000e+00
sp.Ulmus.Ulmus 0.94033943 0.02321432 40.506872 0.000000e+00
sp.Fagus.Fagus 0.80315620 0.04658110 17.242103 1.282831e-66
sp.Quercus.Fagus 0.06704494 0.01614222 4.153391 3.275845e-05
sp.Fagus.Quercus -0.09762005 0.05031349 -1.940236 5.235102e-02
sp.Quercus.Quercus 0.99209581 0.01908702 51.977507 0.000000e+00
char_acc.hardwood -0.15091936 0.02598970 -5.806892 6.364308e-09
char_acc.Ulmus -0.14161025 0.04857906 -2.915047 3.556349e-03
char_acc.Fagus -0.21567537 0.04962459 -4.346139 1.385548e-05
char_acc.Quercus -0.17417682 0.02257133 -7.716727 1.193548e-14
Lift an individual matrix:
$C mnTS_mod_int
other hardwood Ulmus Fagus Quercus
other 0.8849966 0.0000000 0.0000000 0.00000000 0.00000000
hardwood 0.0000000 0.9619493 0.0000000 0.00000000 0.00000000
Ulmus 0.0000000 0.0000000 0.9403394 0.00000000 0.00000000
Fagus 0.0000000 0.0000000 0.0000000 0.80315620 -0.09762005
Quercus 0.0000000 0.0000000 0.0000000 0.06704494 0.99209581
Quercus-Fagus 0.067 means that change in abundance of Quercus affects the abundance of Fagus.
We won’t delve into comparing individual coefficient values between the two models here. Doing so is important to interpreting the ecological meaning of the data, and requires strong ecological knowledge of the study system. Although that is one of the most interesting parts, it is a little beyond what we can fit into a workshop around how to fit the model!
We are aiming to find the best optima possible from the model. Because this may not happen in the first attempt with initial conditions provided by mnGLMM()
, we can refit the model with starting conditions from the output of mnTS()
. This may initially sound like cheating, running a circular model, but remember the input values are only starting conditions. Starting with values that are closer to the ‘true’ optima is likely to provide a better and faster fit.
As an example, let’s refit the first model:
<- Sys.time()
start_time <- mnTS(Y = story_pollen_matrix[sample_idx, ],
mnTS_mod_refit X = story_char_matrix_scaled, Tsample = sample_idx,
B0.start = mnTS_mod$B0, B0.fixed = B0.fixed.mnTS,
B.start = mnTS_mod$B, B.fixed = B.fixed.mnTS,
C.start = mnTS_mod$C, C.fixed = C.fixed.mnTS,
V.start = mnTS_mod$V, V.fixed = V.fixed.mnTS,
dispersion.fixed = 1, maxit.optim = 1e+6)
<- Sys.time()
end_time - start_time end_time
Time difference of 13.53042 secs
coef(mnTS_mod_refit)
Coef. se t P
hardwood 0.41532455 0.132205149 3.141516 1.680758e-03
Ulmus -1.06476060 0.277036554 -3.843394 1.213445e-04
Fagus -0.49932003 0.328767580 -1.518763 1.288222e-01
Quercus 1.11997698 0.144174032 7.768230 7.959071e-15
sp.other.other 0.81275208 0.057988901 14.015649 1.250498e-44
sp.hardwood.hardwood 0.42388155 0.305167839 1.389011 1.648293e-01
sp.Ulmus.Ulmus 0.99737129 0.003913659 254.843712 0.000000e+00
sp.Fagus.Fagus 0.92139120 0.031542814 29.210812 1.413619e-187
sp.Quercus.Quercus 0.86726850 0.055474352 15.633684 4.292508e-55
char_acc.hardwood -0.08189097 0.046429828 -1.763758 7.777276e-02
char_acc.Ulmus -0.12995038 0.056352411 -2.306031 2.110893e-02
char_acc.Fagus -0.11399315 0.075111485 -1.517653 1.291020e-01
char_acc.Quercus -0.09091494 0.043111347 -2.108840 3.495839e-02
$AIC mnTS_mod_refit
[1] -1391.369
Looks like not much changed, so our original fit was good 😎.
The summary
of the model provides both the log likelihood and the AIC (akaike information criterion). These can be used for comparing models. Today we will use the AIC. AIC is penalised for the number of parameters estimated, and so can be better for comparing models where one has more parameters estimated (i.e., the interaction model is estimating more parameters than the model without interactions).
The AIC (and log likelihood) values can be accessed directly with mnTS_mod_int$AIC
.
data.frame(without_interactions = mnTS_mod$AIC,
with_interactions = mnTS_mod_int$AIC)
without_interactions with_interactions
1 -1391.369 -1435.724
We do not advocate for a reliance on P-values for interpreting results. Issues with P-values are well documented (Anderson, Burnham, and Thompson 2000); however, multinomialTS()
offers two methods of calculating P-values. The coef()
function shows P-values calculated from standard errors from Wald approximation. In simulation tests, we found the Wald Approximation could lead to inflated type-1 errors, so we wrote a bootstrapping function as a more robust alternative. Bootstrapping can take a very long time (depending on the data and the number of bootsraps), so we won’t run the bootstrapping function during the workshop. The code for bootstrapping the model is:
<- boot.mnTS(mnTS_mod, reps = 10000) mnTS_mod_boot
There may be circumstances where coefficients change substantially by fitting the model in alternative ways, e.g., with or without interactions, of different covariates. We can use the post-hoc statistical tests like AIC and likelihood ratio tests to give some statistical support to one model or another. However, in the context of palaeoecological data at least, there is no known correct answer. We can only use the results to lend statistical support to multiple potential hypotheses. Palaeo-data lacks experimental manipulation and direct observation, so we suggest setting up multiple working hypotheses (i.e., Chamberlin 1897), testing components of those hypotheses by fitting multinomialTS
under different conditions, and finally assessing the statistical support lent to each hypotheses within the context of ecological knowledge.