7  Novelty and No Analogue Futures

Author

Jack Williams and Quinn Asena

Cite Simpson (2007) and Radeloff et al. (2015)

7.1 Part 1: Background

The modern analog technique (MAT) is a standard approach for making quantitative inferences about past vegetation and climate from fossil pollen data – and from many other forms of paleoecological data (Chevalier et al. 2020; J. W. Williams and Shuman 2008; Gavin et al. 2003; Overpeck, Webb, and Prentice 1985). The MAT relies on reasoning by analogy (Jackson and Williams 2004), in which it is assumed that two compositionally similar pollen assemblages were produced by similar plant communities, growing under similar environments. The similarity of pollen assemblages is determined by the calculation of dissimilarity metrics.

The MAT can also be thought of as a non-parametric statistical technique, that computer scientists call a \(k\)-nearest neighbors algorithm. It is a simple form of machine learning. Each fossil pollen assemblage is matched to 1 or more (\(k\)) modern pollen assemblages, then is assigned the ecological and environmental characteristics associated with those modern analogues.

The MAT is a popular approach for reconstructing past environments and climates, due to its relative simplicity and intuitive appeal. However, like any algorithm, if used unwisely, it can produce spurious results. More generally, when reasoning by analogy, be careful! Analogies are always incomplete and imperfect, and you must use your critical judgment to determine whether these imperfections are minor or serious.

To reconstruct past environments using the MAT, we need three data sets:

  1. Modern species assemblages.

  2. Modern environmental variables.

  3. Fossil species assemblages.

The MAT follows four basic steps:

  1. Calculate dissimilarities between a fossil species assemblage \(i\) and all modern species assemblages in set \(S\).

  2. Discard modern samples with dissimilarities greater than a threshold threshold.

  3. Identify and retain the \(k\) closest modern analogs.

  4. Average the environmental conditions associated with the modern analogs and assign them to fossil sample \(i\).

Note that we are taking a detour into paleoclimatology for two reasons. First, because paleoclimatic reconstructions are still a primary use for fossil pollen and other micropaleontological data. Second, because the MAT and the analogue package, as a dissimilarity-based approach, also lets us explore the novelty of past communities - perhaps of more interest to paleoecologists than inferred paleoclimates!

The following section, Section 7.2.1, takes you through some pretty advanced data wrangling required to match species names between the datasets we will use. However, if you prefer to skip all the wrangling and jump directly to the anayses in Section 7.2.2, the data have been saved post-wrangling and can be loaded into R directly. Run the ‘Jump point’ code below and portal to Section 7.2.2.

Jump point

Run this code (unfold the code chunk below) to load in the pre-wranged data and portal to Section 7.2.2

Code
file_names <- c("./data/mod_pol_east_match_prop.rds", "./data/mod_clim_filter.rds", "./data/devils_prop.rds", "./data/devils_ages.rds")
obj_names <- c("mod_pol_east_match_prop", "mod_clim_filter", "devils_prop", "devils_ages")
rds_list <- lapply(file_names, readRDS)
names(rds_list) <- obj_names
list2env(rds_list, globalenv())

7.2 Part 2: analysis

7.2.1 Data wrangling

In this case, data are coming from two separate sources, NeotomaDB and the North American Modern Pollen Dataset (NAMPD). We will use the NAMPD (Whitmore et al. 2005) for the cross-validation analysis. Note that the NAMPD is pre-loaded into the analogue package, but here we’ll use use the original (more complete) data so that we can filter locations easily. Some of the data wrangling required to get two matrices containing the same species with identical names for the correct region (as required by the MAT function) is fairly advanced. This type of wrangling is quite common in palaeoecological data-science as you may be dealing with, for example, raw data collected by individuals, or from different databases that are stored in inconsistent formats.

Ready to Wrange? Let’s get the data ready:

Packages required for this section
# Load up the packages

if (!require("pacman")) install.packages("pacman", repos="http://cran.r-project.org")
pacman::p_load(neotoma2, tidyverse, neotoma2, analogue, vegan, readxl, fuzzyjoin)
Respawn point

The following code (unfold by clicking the dropdown) downloads the data from the neotoma API and formats it with basic harmonisation and pollen proportions calculated. This part of the data wrangling is covered in detail in Chapter 4 so, more easily, you can read in the pre-wrangled .rds file from the data directory. Note that this is the long format data not the species only matrix from the jump point above.

Code
# The following code can be done in one pipeline but we have split it into chunks to help with readablity
devils_samples <- get_sites(siteid = 666) %>% # Download the site samples data
  get_downloads() %>%
  samples()

devils_samples <- devils_samples %>%
  dplyr::filter(ecologicalgroup %in% c("UPHE", "TRSH"), # Filter ecological groups by upland heath and trees and shrubs
                elementtype == "pollen", # filter bu pollen samples
                units == "NISP") %>% # Filter by sampling unit
         mutate(variablename = replace(variablename,
                                stringr::str_detect(variablename, "Pinus.*"), # Harmonize Pinus into one group
                                "Pinus"),
                variablename = replace(variablename,
                                stringr::str_detect(variablename, "Ambrosia.*"),
                                "Ambrosia")) %>%
  group_by(siteid, sitename,
           sampleid, variablename, units, age,
           agetype, depth, datasetid,
           long, lat) %>%
  summarise(value = sum(value), .groups='keep') # The group_by function will drop columns not used as grouping variables

devils_samples <- devils_samples %>% # Calculate proportions of each species by year group
  group_by(age) %>%
  mutate(pollencount = sum(value, na.rm = TRUE)) %>%
  group_by(variablename) %>%
  mutate(prop = value / pollencount) %>%
  arrange(desc(age)) %>%
  ungroup()

# saveRDS(devils_samples, "./data/devils_samples.rds") # Save the data for later

To read in the pre-wrangled long data for Devil’s Lake use:

# Load up the data
devils_samples <- readRDS("./data/devils_samples.rds")

We will use the North American Modern Pollen Database (NAMPD) to create modern analogues. The NAMPD includes both modern surface pollen samples, and climate measurements. The NAMPD is not (yet) accessible through neotoma but is stored on servers at the University of Madison, WI. JACK ACTION POINT: to use this method in other regions what databases are available? The following code does not need to be run but shows how to download the NAMPD from the server to your data directory and unzip the downloaded data from R.

download.file("https://williamspaleolab.github.io/data/whitmoreetal2005.zip", "./data/whitmoreetal2005.zip")
unzip("./data/whitmoreetal2005.zip", exdir = "./data/whitmoreetal2005")

Since the data are already saved in the data directory, let’s read it in from there. The step of downloading data might often be done manually, the point of including the R code is to maintain a reproducible work-flow. It is always good to keep your own copy of the data just in case. Servers may not be maintained forever, or data may be removed for example.

mod_pol <- read_excel("./data/whitmoreetal2005/whitmoreetal2005_v1-72.xls", sheet = "POLLEN DATA")
mod_clim <- read_excel("./data/whitmoreetal2005/whitmoreetal2005_v1-72.xls", sheet = "CLIMATE+BIOCLIMATE")

Still with me? Great! Now we have our three datasets:

  1. Our site of interest (Devil’s Lake palaeo data).
  2. Our modern pollen data.
  3. Our modern climate measurements.

The modern datasets includes the whole of North America and need to be filtered for the Eastern US so that the taxa are relevant to the those in Devil’s Lake. We’re first going to filter by longitude, then select by taxa so that we have a matching matrix of species between Devil’s Lake palaeo-data and the modern surface samples from the NAMPD. We are going to keep all data East of 105°W, see Krause et al. (2019) and John W. Williams, Shuman, and Webb III (2001).

# several columns contain data not used for the MAT
mod_pol_east <- mod_pol %>% 
  filter(LONDD >= -105) %>% # Filter for sites east of -105 degrees
  select(ID2, 14:147) # Select colums with ID and species only

# Many functions require site-by-species matrices in wide format
# Note we pivoted by age so we have an age column not used in MAT
devils_wide <- devils_samples %>% 
  pivot_wider(id_cols = age, names_from = variablename, values_from = prop) %>% 
  replace(is.na(.), 0)

# Store ages separately for plotting later:
devils_ages <- devils_wide$age

We will also filter the modern climate records by the same ID to match the modern pollen records.

mod_clim_filter <- mod_clim %>% 
  filter(ID2 %in% mod_pol_east$ID2)

Right, now is the tricky part of matching species names. Unfortunately, the species names between data downloaded from Neotoma and the NAMPD do not match exactly. Sometimes matching is easiest done manually but because of the number of species in both datasets we will do as much as possible programatically. The fuzzyjoin package allows us to match strings (names) by a distance measure and outputs a number of possible matches.

looks like it’s done a good job for the most part! Butthere are a few mismatches.

# Save the names of each dataset and convert all to lowercase
mod_pol_names <- as_tibble(tolower(colnames(mod_pol_east[-1])))
devil_names <- as_tibble(tolower(colnames(devils_wide[-1])))

# match the names using fuzzy matching
names_matches <- stringdist_left_join(devil_names, mod_pol_names, by = "value",
                                      max_dist = 4, distance_col = "distance", method = "jw") %>% 
                   group_by(value.x) %>%
                   slice_min(order_by = distance, n = 1) # change the value of n to see more potential matches

head(names_matches, 15) 
# A tibble: 15 × 3
# Groups:   value.x [15]
   value.x                      value.y    distance
   <chr>                        <chr>         <dbl>
 1 abies                        abies        0     
 2 acer                         acerx        0.0667
 3 alnus                        alnusx       0.0556
 4 amaranthaceae undiff.        rhamnaceae   0.308 
 5 ambrosia                     ambrosia     0     
 6 amorpha-type                 amorpha      0.139 
 7 angiospermae undiff. (herbs) poaceae      0.365 
 8 apiaceae                     apiaceae     0     
 9 artemisia                    artemisia    0     
10 asteraceae undiff.           asterx       0.296 
11 betula                       betula       0     
12 cannabaceae sensu stricto    abies        0.347 
13 caprifoliaceae sensu lato    caprifolia   0.2   
14 carya                        carya        0     
15 castanea                     castanea     0     

In your console take a good look through the whole list. You don’t want anything matched twice!

names_matches %>% filter(distance > 0.1) # looks like everything below 0.1 is a good match
# A tibble: 20 × 3
# Groups:   value.x [19]
   value.x                      value.y    distance
   <chr>                        <chr>         <dbl>
 1 amaranthaceae undiff.        rhamnaceae    0.308
 2 amorpha-type                 amorpha       0.139
 3 angiospermae undiff. (herbs) poaceae       0.365
 4 asteraceae undiff.           asterx        0.296
 5 cannabaceae sensu stricto    abies         0.347
 6 caprifoliaceae sensu lato    caprifolia    0.2  
 7 chenopodium-type             chenoamx      0.292
 8 cupressaceae                 cupressa      0.111
 9 ericaceae                    arecaceae     0.116
10 ericaceae                    ericacex      0.116
11 fabaceae undiff.             fabaceae      0.167
12 ilex                         piceax        0.25 
13 larix                        larixpseu     0.148
14 morus                        cornus        0.178
15 ostrya/carpinus              ostrycar      0.197
16 plantago                     plantaginx    0.142
17 rhamnaceae/vitaceae          rhamnaceae    0.158
18 rhus                         rubus         0.217
19 sarcobatus vermiculatus      sarcobatus    0.188
20 urticaceae                   urticacx      0.142

Arranging by distance helps.

names_matches %>%
  filter(distance < 0.3)  %>% # everything above 0.3 looks like a bad match
  arrange(desc(distance)) %>% 
  print(n = 15)
# A tibble: 49 × 3
# Groups:   value.x [48]
   value.x                   value.y    distance
   <chr>                     <chr>         <dbl>
 1 asteraceae undiff.        asterx        0.296
 2 chenopodium-type          chenoamx      0.292
 3 ilex                      piceax        0.25 
 4 rhus                      rubus         0.217
 5 caprifoliaceae sensu lato caprifolia    0.2  
 6 ostrya/carpinus           ostrycar      0.197
 7 sarcobatus vermiculatus   sarcobatus    0.188
 8 morus                     cornus        0.178
 9 fabaceae undiff.          fabaceae      0.167
10 rhamnaceae/vitaceae       rhamnaceae    0.158
11 larix                     larixpseu     0.148
12 plantago                  plantaginx    0.142
13 urticaceae                urticacx      0.142
14 amorpha-type              amorpha       0.139
15 ericaceae                 arecaceae     0.116
# ℹ 34 more rows

Ok looks like we only have a few to weed out manually:

names_matches_filter <- names_matches %>% 
  filter(distance < 0.3,
         value.x != "ilex" & value.y != "piceax",
         value.x != "rhus" & value.y != "rubus",
         value.x != "morus" & value.y != "cornus",
         value.x != "ericaceae" & value.y != "arecaceae")  %>% 
         arrange(desc(distance))

Now we have a dataframe (names_matches_filter) with names from Devil’s Lake and their match in the NAMPD. There are two routes for this analysis:

  1. Join the two datasets so that both contain the same spicies column names, creating two larger dataset. Species missing from one dataset are filled with zero.
  2. Conduct the analysis on the intersect of the two datasets, i.e., a smaller dataset of the species common between the two.

Now that we have matched the names between the two datasets, it is not difficult to accomplish either method. The datasets can be expanded so that both contain all the same species using the join function in the analogue package. Conversely, the datasets can be shrunk by selecting the species common to both using the names filter. The code for running the MAT for both methods is the same, so we will create datasets for both methods. We will use the first, more conventional, method for the rest of the chapter and, if you want to explore the second method, you can replace the data in the code with the ‘Intersec’ data.

This code joins the modern pollen data and the Devil’s Lake core data so that the number of columns is expanded to contain the same species in both. Note that while the variables (columns and species names) have been matched, each dataset remains independent. The two datasets have not been merged into one dataset like the *_join functions from tidyr.

mod_pol_east_join <- mod_pol_east %>%
  rename_with(tolower) %>% 
  rename(all_of(deframe(names_matches_filter[ ,1:2]))) %>% 
  select(-id2) %>% # We do not want the ID column in the calculation
  mutate(across(everything(), ~ replace_na(.x, 0))) # Data contain some NA values that we can safely assume to be a zero-count
mod_pol_east_join <- mod_pol_east_join / rowSums(mod_pol_east_join) # I prefer the BASE R way of doing this calculation
# rowSums(mod_pol_east_join) # I regularly use rowSums on wide data as a data check.
# Prevents accidentally using counts or percentages instead of proportions, or including age/depth/ID columns

devils_prop <- devils_wide %>% 
  rename_with(tolower) %>% 
  select(-age)
# rowSums(devils_prop) # Data check

join_dat <- join(mod_pol_east_join, devils_prop, verbose = TRUE)

Summary:

            Rows Cols
Data set 1: 2594  134
Data set 2:  123   51
Merged:     2717  142
mod_pol_east_join <- join_dat$mod_pol_east_join
devils_prop <- join_dat$devils_prop

This code selects the common species of both the modern pollen data and the Devil’s Lake core data and reduces the number of columns of both.

# Select species based on name
mod_pol_east_intersect <- mod_pol_east %>%
  rename_with(tolower) %>% # convert to lowercase for matching
  select(names_matches_filter$value.y) %>%  # keep id column and select species
  as.matrix()

# The analoge functions need species names to match between datasets
# so lets name everything in our filtered modern pollen the same as in Devil's Lake
mod_pol_east_intersect <- mod_pol_east_intersect %>% 
  rename(all_of(deframe(names_matches_filter[ ,1:2])))

# Let's create a matrix of only species for the MAT with relative proportions
# This is an approach for calculating relative abundances on wide data:
mod_pol_east_prop_intersect <- mod_pol_east_intersect / rowSums(mod_pol_east_intersect)
# data check: rowSums(mod_pol_east_prop_intersect)

# while we are here let's make a matrix of species only from Devils Lake
devils_prop_intersect <- devils_wide %>% 
  rename_with(tolower) %>% 
  select(names_matches_filter$value.x) %>% # Remember we removed a few species without matches?
  as.matrix()

# Recall that we already converted the Devil's Lake data to proportions,
# but, we need to re-cecalculate proportions after dropping columns.
# data check: rowSums(devils_prop_intersect)
devils_prop_intersect <- devils_prop_intersect / rowSums(devils_prop_intersect) 
Good practice tip

Whenever filtering or joining data it is good to check the dimensions before and after by using dim() to see if the results are what you expect. Some functions like left_join() or full_join() may run without error but produce unexpected results. It is also good to regularly use rowSums() to make sure you are using the correct data type (counts, proportions or percentages), and have not accidentaly included age, depth, or ID columns if they should be excluded.

That took some pretty advanced wrangling to get the data in order. Such is often the case when using multiple sources of data. As always, there are many ways to achieve the same result. You can use these data wrangling tricks on different datasets and save a lot of time doing manual work.

7.2.2 Cross-validation

The Modern Analogue Technique (MAT) needs the same set of species in both the modern dataset and the core data, and can be calculated in two ways (restated here if you skipped the data wrangling in Section 7.2.1):

  1. By joining the two datasets so that both contain the same spicies column names, creating two larger dataset. Species missing from one dataset are filled with zero.
  2. By conducting the analysis on the intersect of the two datasets, i.e., smaller datasets of the species common between the two.

The remainder of the chapter will use the first, more conventional, method of joining datasets to contain the same species. In Section 7.2.1, we created a dataset for both methods. The code below works for both methods, and you can switch out the data if you want to (create the intersect data by running the code block “intersect” at the end of the previous section).

Before reconstructing environmental variables from fossil pollen assemblages, we usually assess how MAT performs in the modern species assemblage. This step is usually referred to as calibration or cross-validation.

In cross-validation, a calibration function is trained on one portion of the modern species assemblages (calibration set) and applied to another portion of the modern species assemblages (validation set). These two datasets are thus mutually exclusive and - possibly - independent. To cross-validate MAT, we usually use a technique called \(k\)-fold cross-validation. In \(k\)-fold cross-validation the modern data set is split into \(k\) mutually exclusive subsets. For each \(k^{th}\) subset, the calibration dataset comprises all the samples not in \(k\), while the samples in \(k\) comprise the validation dataset. The simplest form of \(k\)-fold cross-validation is the leave-one-out (LOO) technique, in which just a single sample is removed and then all other samples are used to build a prediction for that sample. This procedure is then repeated for all samples. The analogue package uses leave-one-out cross-validation.

Standard metrics include root mean square error (RMSE) and \(R^2\). Here, we’ll use the cross-validation tools built into the analogue package.

Note that the palaeoSig package, developed by Richard Telford, has additional functions that can test for significance relative to randomly generated variables with the same spatial structure as the environmental variable of interest. We won’t use this package in this lab, but it’s useful for testing whether apparently strong cross-validation results are merely an artifact of spatial autocorrelation (Telford and Birks 2005).

Now let’s do the analysis! First is the cross validation of the modern datasets. Using the modern data, we’ll build a calibration dataset and run cross-validation analyses of the calibration dataset. If you portaled here and skipped the Data Wrangling in Section 7.2.1, make sure you load in the pre-formatted data from Section 7.1.

mod_jant <- data.matrix(mod_clim_filter$tjan)

modpoll_jant <- analogue::mat(mod_pol_east_join, mod_jant, method="SQchord")
print(modpoll_jant)

    Modern Analogue Technique

Call:
mat(x = mod_pol_east_join, y = mod_jant, method = "SQchord") 

Percentiles of the dissimilarities for the training set:

   1%    2%    5%   10%   20% 
0.207 0.272 0.385 0.502 0.670 

Inferences based on the mean of k-closest analogues:

  k  RMSEP     R2 Avg Bias Max Bias
  1  2.766  0.934   -0.214   -1.625
  2  2.654  0.939   -0.302   -2.265
  3  2.537  0.944   -0.360   -2.648
  4  2.608  0.941   -0.350   -2.569
  5  2.661  0.938   -0.357   -2.622
  6  2.714  0.936   -0.361   -2.714
  7  2.716  0.936   -0.373   -2.804
  8  2.734  0.935   -0.368   -2.982
  9  2.779  0.933   -0.357   -3.104
 10  2.826  0.930   -0.370   -3.316

Inferences based on the weighted mean of k-closest analogues:

  k RMSEP    R2 Avg Bias Max Bias
  1   NaN    NA      NaN    -1.62
  2   NaN    NA      NaN    -2.19
  3   NaN    NA      NaN    -2.52
  4   NaN    NA      NaN    -2.46
  5   NaN    NA      NaN    -2.52
  6   NaN    NA      NaN    -2.59
  7   NaN    NA      NaN    -2.68
  8   NaN    NA      NaN    -2.81
  9   NaN    NA      NaN    -2.93
 10   NaN    NA      NaN    -3.11
par(mfrow = c(2,2))
plot(modpoll_jant)

Much of the information in this chapter comes from Simpson (2007), and we recommend Simpson (2007) for the interpretation of the outputs. The MAT method shares similar considerations as calculating dissimilarity (Chapter 6). With some additional considerations regarding transfer functions:

  • Time intervals and chronological uncertainties are not accounted for. The difference in time between successive samples often varies in palaeo-data.
  • Time averaging is not taken into account, and each sample may contain a different time-span of ecological productivity.
  • Squared chord distance is one of many distance measures available in analogue, and may not always be the most appropriate.

For deeper considerations of reconstructing temperatures using transfer functions see Telford and Birks (2005).

7.2.3 Novelty and Paleoclimatic Reconstruction

Jack action point Section needs some info from the lecture content to help understand the process as a stand-alone unit.

For this part of the lab, we’ll first assess the novelty of the fossil pollen assemblages at Devil’s Lake. We’ll then reconstruct past climates at Devil’s Lake, using the MAT. Here’s some demonstration code showing the kinds of analyses that can be done and plotted timeseries.

Construct a time series of the minimum SCDs (novelty) for Devil’s Lake.

# Predict temperatures from the fossil data and their modern analogue
devil_jant <- predict(modpoll_jant, devils_prop, k=10)

par(mfrow = c(1,1))
devil_mindis <- minDC(devil_jant)
plot(devil_mindis, depths = devils_ages, quantiles = FALSE, xlab = "Age (yr BP)", ylab = "minSCD")

Construct a time series of the number of modern analogs found per fossil sample, using 0.25 as our no-analog/analog threshold.

devil_cma <- analogue::cma(devil_jant, cutoff=0.25)

plot(x = devils_ages, y = devil_cma$n.analogs, type = "l", xlim= rev(range(devils_ages)), xlab = "Age (yr BP)", ylab = "Number of modern analogs")

Plot the reconstructed January temperatures at Devil’s Lake:

reconPlot(devil_jant, depth = devils_ages, display.error="bars", xlab = "Age (yr BP)", ylab = "Mean Jan temperature")

Resources

Jacktion point:

Copy in a list of DOIs for essential reading and I’ll add insert them as formatted citation links:

  • DOI
  • DOI

7.3 Part 3: Exercises

Jacktion point: review questions and question order :)

  1. In the summary output for JanT, which choice of k (number of analogues) produces the highest R2 and lowest RMSEP?

    1. Which samples at Devil’s Lake lack modern analogs?
  2. Repeat the above analysis for TJul, PJul, GDD5, and MIPT. (GDD5 is growing degree days and is a measure of growing season length and strength; MIPT is an index of soil moisture availability).

  3. Which variable has the best cross-validation statistics?

    1. Which has the worst?
  4. Given what you know about the climatic controls on plant distributions (and/or our ability to precisely quantify these variables), is the relative predictive skill for these different climatic variables surprising or unsurprising? Why?

  5. Check the Devil’s Lake novelty curve against the pollen diagram for Devil’s Lake. What combinations of pollen taxa produce these no-analog assemblages?

  6. According to Webb(1986), what are two general hypotheses for why we might have these past mixtures of species with no modern analogue?

  7. Why should one be dubious about the temperature reconstructions at Devil’s Lake for the high-novelty assemblages?

  8. Even for the low-novelty assemblages, name two potentially important sources of uncertainty in MAT-based reconstructions of past climate from fossil pollen data.

    1. Conversely, based on our knowledge about the geographic climatic distributions of these taxa, why might one argue that these January temperature reconstructions are plausible?
  9. To get practice, make reconstructions of GDD5 and mean July precipitation. Show plots of your work.