Change log

  • 2024-09-06: version for archiving on Ag Data Commons
  • 2024-05-07: add moisture to responses
  • 2024-05-02: updated version with transformed variables in model fit, different slope tests, and improved figures
  • 2024-04-26: original version

Summary

This is the analysis of the corn mycotoxin data. We model the effect of mycotoxin concentration on the concentration of several different nutrients in corn. We include main effects of the different mycotoxins as well as two-way interactions between each pair of mycotoxins. We also include analysis of mycotoxin effects on the L variable from the color analysis, because it seems to be the one most important for determining the overall color of the corn. We use AIC to compare the models with and without interaction terms. We find that the models without interaction terms are better so we omit the interactions. We present adjusted R-squared values for each model as well as the p-values associated with the average slopes (effect of each mycotoxin on each nutrient).

We find that there are significant associations of mycotoxin concentration on protein, fat, and starch, but not fiber. Most of the associations are negative except that DON and FUM have opposing effects on protein (higher FUM associated with higher protein, lower DON associated with higher protein, and the opposite pattern in starch). There is a positive association between FUM concentration and moisture content. Increased AFB1 is also associated with a lower value of L (darker color).

Setup

Load packages and import the data. The file path will need to be modified to run on a different system.

library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(purrr)
library(readxl)
library(patchwork)
library(schemr)
library(performance)
library(ggeffects)
library(gt)
library(marginaleffects)

file_path <- 'project'

myc <- read_excel(file.path(file_path, 'Combined LCMS NIR Color data.xlsx'), sheet = 'Sheet1') %>%
  rename(ID = `neutral id`) %>%
  rename_with(~ gsub('[[:punct:]]', '', .)) %>%
  rename_with(~ gsub(' ', '_', .)) %>%
  rename_with(~ gsub('__', '_', .)) %>%
  mutate(Sample_Location = fct_explicit_na(Sample_Location, na_level = 'unknown'),
         AFB1_Category = factor(AFB1_Category, levels = c('None', 'Present')),
         FUM_Category = factor(FUM_Category, levels = c('Low', 'Medium', 'High')),
         ZEA_Category = factor(ZEA_Category, levels = c('None', 'Present')),
         DON_Category = factor(DON_Category, levels = c('None', 'Low', 'Medium', 'High')))

Exploratory plots

Explore the nutrient composition data.

nutrient_cols <- c('Protein', 'Fat', 'Moisture', 'Fiber', 'Starch', 'AshAI')
myc_cols <- c('AFB1_ppm', 'FUM_ppm_LCMS', 'ZEA_ppm', 'DON_ppm_LCMS')
cat_cols <- grep('Category$', names(myc), value = TRUE)

myc %>% select(all_of(nutrient_cols)) %>% rowSums() %>% head()
## [1] 87.840 88.290 88.975 89.770 89.840 89.280

Even when ash is included, the nutrient compositions do not add up to 100. So it is probably best to treat them as separate independent response variables.

Look at histograms of each nutrient. The distributions are roughly symmetrical so it is acceptable to model them as having normally distributed error.

nutr_hists <- map(nutrient_cols, ~ ggplot(myc, aes(x = !!sym(.))) + geom_histogram(bins = 20))

walk(nutr_hists, print)

Make pairwise scatterplots of each combination of mycotoxin and nutrient concentration, to see what the relationships look like. Are they linear or nonlinear?

pair_plots <- expand_grid(nutrient = nutrient_cols[1:5], mycotoxin = myc_cols)

pair_plots$p <- pmap(pair_plots, function(nutrient, mycotoxin) {
  ylabel <- paste(nutrient, '(%)')
  xlabel <- gsub('_ppm', ' (ppm)', gsub('_LCMS', '', mycotoxin))
  ggplot(myc, aes(x = !!sym(mycotoxin), y = !!sym(nutrient))) + 
    geom_point() +
    labs(y = ylabel, x = xlabel)
})

L_plots <- map(myc_cols, function(mycotoxin) {
  xlabel <- gsub('_ppm', ' (ppm)', gsub('_LCMS', '', mycotoxin))
  ggplot(myc, aes(x = !!sym(mycotoxin), y = LD65_SCI)) + 
    geom_point() +
    labs(y = 'Perceptual lightness (L*)', x = xlabel)
})

(scatterplot_matrix <- Reduce(`+`, c(pair_plots$p, L_plots)) + plot_layout(ncol = 4, nrow = 6, byrow = TRUE))

#ggsave('project/fig2_scatterplot_matrix.png', scatterplot_matrix, height = 10, width = 7, dpi = 400)

Do the same thing using the categorical values of mycotoxin.

pair_boxplots <- expand_grid(nutrient = nutrient_cols, mycotoxin = cat_cols)

pair_boxplots$p <- pmap(pair_boxplots, function(nutrient, mycotoxin) ggplot(myc, aes(x = !!sym(mycotoxin), y = !!sym(nutrient))) + geom_boxplot())

Reduce(`+`, pair_boxplots$p) + plot_layout(ncol = 6, nrow = 4, byrow = FALSE)

Look at mycotoxin concentrations by year. We don’t see much difference so I feel it is acceptable to ignore year in the model.

myc_hists_byyear <- map(myc_cols, ~ ggplot(myc, aes(x = !!sym(.))) + geom_histogram(bins = 20) + facet_wrap(~ harvest_year))

walk(myc_hists_byyear, print)

Examine the color data. Plot each pairwise combination of L, a, and b variables, the points are colored by the actual color generated from the L a b values. To me it looks like the L value is the most important so I will focus on that for the analysis.

myc_color <- with(myc, cbind(LD65_SCI, aD65_SCI, bD65_SCI))
color_values <- rep(NA, nrow(myc))
color_values[complete.cases(myc_color)] <- lab_to_hex(myc_color[complete.cases(myc_color), ])

myc <- myc %>% mutate(color = color_values)
  
ggplot(myc, aes(x = LD65_SCI, y = aD65_SCI)) + geom_point(color = myc$color)

ggplot(myc, aes(x = LD65_SCI, y = bD65_SCI)) + geom_point(color = myc$color)

ggplot(myc, aes(x = aD65_SCI, y = bD65_SCI)) + geom_point(color = myc$color)

Statistical models

Normalize the mycotoxin concentrations by dividing each one by its maximum so that they all range from 0 to 1.

myc <- myc %>%
  mutate(AFB1_norm = AFB1_ppm / max(AFB1_ppm),
         FUM_norm = FUM_ppm_LCMS / max(FUM_ppm_LCMS),
         ZEA_norm = ZEA_ppm / max(ZEA_ppm),
         DON_norm = DON_ppm_LCMS / max(DON_ppm_LCMS))

Check variance inflation factors of mycotoxin. All are low so we do not have to worry about collinearity.

mycdf <- myc %>% select(ends_with('norm')) %>% as.data.frame()
usdm::vif(mycdf)
##   Variables      VIF
## 1 AFB1_norm 1.036240
## 2  FUM_norm 1.065465
## 3  ZEA_norm 1.544111
## 4  DON_norm 1.531532

Fit a multiple regression model to each of the nutrient concentrations, and the L value, including all the two-way interactions. Fit the models on a log scale so that we can ensure that the predictions are always positive.

protein_lm <- lm(log(Protein) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
fat_lm <- lm(log(Fat) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
moisture_lm <- lm(log(Moisture) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
fiber_lm <- lm(log(Fiber) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
starch_lm <- lm(log(Starch) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
L_lm <- lm(log(LD65_SCI) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)

Also fit multiple regression models without the two-way interactions.

protein_lm_oneway <- lm(log(Protein) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
fat_lm_oneway <- lm(log(Fat) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
moisture_lm_oneway <- lm(log(Moisture) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
fiber_lm_oneway <- lm(log(Fiber) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
starch_lm_oneway <- lm(log(Starch) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
L_lm_oneway <- lm(log(LD65_SCI) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)

Use AIC to compare the models with and without the two-way interactions. All show that the model without interactions is an equally good or better fit than the model with interactions. As a result, we will not consider the models containing interactions moving forward.

AIC(protein_lm, protein_lm_oneway)
##                   df       AIC
## protein_lm        12 -901.1684
## protein_lm_oneway  6 -901.4457
AIC(fat_lm, fat_lm_oneway)
##               df       AIC
## fat_lm        12 -827.7074
## fat_lm_oneway  6 -811.9743
AIC(moisture_lm, moisture_lm_oneway)
##                    df       AIC
## moisture_lm        12 -840.0159
## moisture_lm_oneway  6 -846.4696
AIC(fiber_lm, fiber_lm_oneway)
##                 df       AIC
## fiber_lm        12 -1083.237
## fiber_lm_oneway  6 -1082.178
AIC(starch_lm, starch_lm_oneway)
##                  df       AIC
## starch_lm        12 -1783.472
## starch_lm_oneway  6 -1782.329
AIC(L_lm, L_lm_oneway)
##             df       AIC
## L_lm        12 -81.28237
## L_lm_oneway  6 -85.98532

Model diagnostics

Diagnostic plots are not shown here in the interest of space but they look acceptable. The distribution of the residuals of fat is a little bit skewed but it is probably still an acceptable model.

check_model(protein_lm)
check_model(fat_lm)
check_model(moisture_lm)
check_model(fiber_lm)
check_model(starch_lm)
check_model(L_lm)

Results

Adjusted R-squared for each nutrient

Here is the R-squared value adjusted for multiple predictors for each nutrient. It shows that the proportion of variation explained is modest for protein and starch (around 0.1) but very low for the other variables.

fit_data <- tibble(
  rv = c('protein', 'fat', 'fiber', 'starch', 'moisture', 'L color'),
  fit = list(protein_lm_oneway, fat_lm_oneway, fiber_lm_oneway, starch_lm_oneway, moisture_lm_oneway, L_lm_oneway)
) |>
  mutate(ANOVA = map(fit, anova),
         slopes = map(fit, avg_slopes),
         adjrsq = map_dbl(fit, ~ summary(.)$adj.r.squared))

fit_data |>
  select(rv, adjrsq) |>
  gt() |>
  cols_label(rv = 'nutrient', adjrsq = 'adjusted R-squared') |>
  fmt_number(decimals = 3) |>
  gtopts()
nutrient adjusted R-squared
protein 0.084
fat 0.016
fiber 0.007
starch 0.124
moisture 0.015
L color 0.017

Average slopes for each predictor

Now instead of the ANOVA table we show p-values associated with the average slope for each mycotoxin, holding all other mycotoxins constant at their average values. In the model the y variable is on the log scale and the x axis is such that 0 is the lowest concentration and 1 the highest, so the slopes must be interpreted that way. For example the slope of protein with respect to DON is -0.112. This means that the highest value of DON is associated with a protein level that is \(e^{-0.112}\), or 89%, as much as the lowest value of DON. In the table, the 95% confidence intervals of the slopes are given in parentheses. The p-values are based on a z-test.

slope_table <- with(fit_data, map2(rv, slopes, ~ tibble(variable = .x, as_tibble(.y)))) |> 
  list_rbind() |>
  select(variable, term, estimate, conf.low, conf.high, statistic, p.value) |>
  mutate(term = gsub('_norm', '', term))

gt(slope_table, groupname_col = 'variable') |>
  cols_label(estimate = 'slope', term = 'mycotoxin', statistic = md('*z*-statistic'), p.value = md('*p*-value')) |>
  cols_merge(c(estimate, conf.low, conf.high), pattern = '{1} ({2}, {3})') |>
  fmt_number(decimals = 3) |>
  fmt_number(columns = p.value, n_sigfig = 2) |>
  gtopts() |>
  tab_style(style = cell_text(weight = 'bold'), locations = cells_body(columns = p.value, rows = p.value < 0.05))
mycotoxin slope z-statistic p-value
protein
AFB1 0.033 (−0.059, 0.124) 0.702 0.48
DON −0.112 (−0.168, −0.056) −3.934 0.000084
FUM 0.072 (0.019, 0.125) 2.655 0.0079
ZEA 0.008 (−0.097, 0.112) 0.143 0.89
fat
AFB1 −0.145 (−0.249, −0.040) −2.705 0.0068
DON 0.028 (−0.036, 0.092) 0.855 0.39
FUM 0.000 (−0.061, 0.061) 0.000 1.0
ZEA 0.002 (−0.119, 0.122) 0.026 0.98
fiber
AFB1 0.029 (−0.040, 0.099) 0.832 0.41
DON 0.035 (−0.008, 0.077) 1.601 0.11
FUM −0.017 (−0.057, 0.024) −0.814 0.42
ZEA 0.015 (−0.064, 0.094) 0.378 0.71
starch
AFB1 −0.018 (−0.042, 0.005) −1.509 0.13
DON 0.037 (0.022, 0.051) 4.948 0.00000075
FUM −0.026 (−0.039, −0.012) −3.600 0.00032
ZEA −0.029 (−0.057, −0.002) −2.119 0.034
moisture
AFB1 0.020 (−0.079, 0.119) 0.392 0.70
DON −0.039 (−0.100, 0.022) −1.259 0.21
FUM 0.069 (0.012, 0.127) 2.353 0.019
ZEA 0.015 (−0.099, 0.128) 0.256 0.80
L color
AFB1 −0.842 (−1.453, −0.230) −2.699 0.0069
DON −0.041 (−0.255, 0.174) −0.373 0.71
FUM −0.056 (−0.238, 0.126) −0.607 0.54
ZEA 0.016 (−0.359, 0.392) 0.086 0.93

Main effect plots

Show the main effect of each mycotoxin on each nutrient concentration and color, with an estimated marginal trend. Note: for each mycotoxin’s trend, the value of each other mycotoxin is set to its average. P-values associated with the slopes are shown.

max_concs <- apply(myc |> select(AFB1_ppm, DON_ppm_LCMS, FUM_ppm_LCMS, ZEA_ppm), 2, max)
main_effs <- c('AFB1_norm', 'DON_norm', 'FUM_norm', 'ZEA_norm')
short_names <- c('AFB', 'DON', 'FUM', 'ZEA')
plot_colors <- unname(palette.colors()[c(2,3,4,7)])
  
get_preds <- function(eff, maxval, fit) {
  predict_response(fit, terms = eff) |>
    as.data.frame() |>
    mutate(x = x*maxval)
}

main_eff_plots <- function(rv, fit, slopes, ...) {
  plot_titles <- paste('Trend of', rv, 'versus', short_names)
  plot_subtitles <- paste('p =', signif(slopes$p.value, digits = 2))
  
  pred_effs <- map2(main_effs, max_concs, get_preds, fit = fit)
  
  eff_plots <- map(1:4, function(i) {
    ggplot(pred_effs[[i]], aes(x = x, y = predicted, ymin = conf.low, ymax = conf.high)) +
      geom_ribbon(fill = plot_colors[i], alpha = 0.5) +
      geom_line(color = plot_colors[i], linewidth = 1.2) +
      scale_x_continuous(name = paste(short_names[i], '(ppm)'), expand = c(0, 0)) +
      scale_y_continuous(name = rv) +
      ggtitle(plot_titles[i], plot_subtitles[i])
  })

  Reduce(`+`, eff_plots) + plot_layout(ncol = 2, nrow = 2)
}

all_maineff_plots <- pmap(fit_data, main_eff_plots)

walk(all_maineff_plots, print)

Session info

This is included so we can keep track of what software versions were used in the analysis.

sessionInfo()
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] marginaleffects_0.21.0 gt_0.11.0              ggeffects_1.7.0       
##  [4] performance_0.12.2     schemr_0.3.0           patchwork_1.2.0       
##  [7] readxl_1.4.3           purrr_1.0.2            forcats_1.0.0         
## [10] ggplot2_3.5.1          tidyr_1.3.1            dplyr_1.1.4           
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5      xfun_0.46         bslib_0.7.0       raster_3.6-26    
##  [5] insight_0.20.2    lattice_0.22-5    vctrs_0.6.5       tools_4.3.3      
##  [9] generics_0.1.3    datawizard_0.12.2 tibble_3.2.1      fansi_1.0.6      
## [13] highr_0.11        pkgconfig_2.0.3   Matrix_1.6-5      checkmate_2.3.1  
## [17] data.table_1.15.4 lifecycle_1.0.4   compiler_4.3.3    farver_2.1.2     
## [21] stringr_1.5.1     munsell_0.5.1     terra_1.7-78      tiff_0.1-12      
## [25] codetools_0.2-20  httpuv_1.6.15     htmltools_0.5.8.1 sass_0.4.9       
## [29] yaml_2.3.9        later_1.3.2       pillar_1.9.0      jquerylib_0.1.4  
## [33] cachem_1.1.0      mime_0.12         commonmark_1.9.1  sjlabelled_1.2.0 
## [37] tidyselect_1.2.1  digest_0.6.35     usdm_2.1-7        stringi_1.8.4    
## [41] labeling_0.4.3    fastmap_1.2.0     grid_4.3.3        colorspace_2.1-0 
## [45] cli_3.6.2         magrittr_2.0.3    utf8_1.2.4        OpenImageR_1.3.0 
## [49] withr_3.0.0       backports_1.4.1   scales_1.3.0      promises_1.3.0   
## [53] sp_2.1-4          rmarkdown_2.27    jpeg_0.1-10       cellranger_1.1.0 
## [57] hms_1.1.3         png_0.1-8         shiny_1.8.1.1     evaluate_0.24.0  
## [61] haven_2.5.4       knitr_1.48        markdown_1.13     rlang_1.1.3      
## [65] Rcpp_1.0.13       xtable_1.8-4      glue_1.7.0        xml2_1.3.6       
## [69] rstudioapi_0.16.0 jsonlite_1.8.8    R6_2.5.1          apcluster_1.4.13