Change log
- 2024-09-06: version for archiving on Ag Data Commons
- 2024-05-07: add moisture to responses
- 2024-05-02: updated version with transformed variables in model fit,
different slope tests, and improved figures
- 2024-04-26: original version
Summary
This is the analysis of the corn mycotoxin data. We model the effect
of mycotoxin concentration on the concentration of several different
nutrients in corn. We include main effects of the different mycotoxins
as well as two-way interactions between each pair of mycotoxins. We also
include analysis of mycotoxin effects on the L variable from
the color analysis, because it seems to be the one most important for
determining the overall color of the corn. We use AIC to compare the
models with and without interaction terms. We find that the models
without interaction terms are better so we omit the interactions. We
present adjusted R-squared values for each model as well as the p-values
associated with the average slopes (effect of each mycotoxin on each
nutrient).
We find that there are significant associations of mycotoxin
concentration on protein, fat, and starch, but not fiber. Most of the
associations are negative except that DON and FUM have opposing effects
on protein (higher FUM associated with higher protein, lower DON
associated with higher protein, and the opposite pattern in starch).
There is a positive association between FUM concentration and moisture
content. Increased AFB1 is also associated with a lower value of L
(darker color).
Setup
Load packages and import the data. The file path will need to be
modified to run on a different system.
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
library(purrr)
library(readxl)
library(patchwork)
library(schemr)
library(performance)
library(ggeffects)
library(gt)
library(marginaleffects)
file_path <- 'project'
myc <- read_excel(file.path(file_path, 'Combined LCMS NIR Color data.xlsx'), sheet = 'Sheet1') %>%
rename(ID = `neutral id`) %>%
rename_with(~ gsub('[[:punct:]]', '', .)) %>%
rename_with(~ gsub(' ', '_', .)) %>%
rename_with(~ gsub('__', '_', .)) %>%
mutate(Sample_Location = fct_explicit_na(Sample_Location, na_level = 'unknown'),
AFB1_Category = factor(AFB1_Category, levels = c('None', 'Present')),
FUM_Category = factor(FUM_Category, levels = c('Low', 'Medium', 'High')),
ZEA_Category = factor(ZEA_Category, levels = c('None', 'Present')),
DON_Category = factor(DON_Category, levels = c('None', 'Low', 'Medium', 'High')))
Exploratory plots
Explore the nutrient composition data.
nutrient_cols <- c('Protein', 'Fat', 'Moisture', 'Fiber', 'Starch', 'AshAI')
myc_cols <- c('AFB1_ppm', 'FUM_ppm_LCMS', 'ZEA_ppm', 'DON_ppm_LCMS')
cat_cols <- grep('Category$', names(myc), value = TRUE)
myc %>% select(all_of(nutrient_cols)) %>% rowSums() %>% head()
## [1] 87.840 88.290 88.975 89.770 89.840 89.280
Even when ash is included, the nutrient compositions do not add up to
100. So it is probably best to treat them as separate independent
response variables.
Look at histograms of each nutrient. The distributions are roughly
symmetrical so it is acceptable to model them as having normally
distributed error.
nutr_hists <- map(nutrient_cols, ~ ggplot(myc, aes(x = !!sym(.))) + geom_histogram(bins = 20))
walk(nutr_hists, print)






Make pairwise scatterplots of each combination of mycotoxin and
nutrient concentration, to see what the relationships look like. Are
they linear or nonlinear?
pair_plots <- expand_grid(nutrient = nutrient_cols[1:5], mycotoxin = myc_cols)
pair_plots$p <- pmap(pair_plots, function(nutrient, mycotoxin) {
ylabel <- paste(nutrient, '(%)')
xlabel <- gsub('_ppm', ' (ppm)', gsub('_LCMS', '', mycotoxin))
ggplot(myc, aes(x = !!sym(mycotoxin), y = !!sym(nutrient))) +
geom_point() +
labs(y = ylabel, x = xlabel)
})
L_plots <- map(myc_cols, function(mycotoxin) {
xlabel <- gsub('_ppm', ' (ppm)', gsub('_LCMS', '', mycotoxin))
ggplot(myc, aes(x = !!sym(mycotoxin), y = LD65_SCI)) +
geom_point() +
labs(y = 'Perceptual lightness (L*)', x = xlabel)
})
(scatterplot_matrix <- Reduce(`+`, c(pair_plots$p, L_plots)) + plot_layout(ncol = 4, nrow = 6, byrow = TRUE))

#ggsave('project/fig2_scatterplot_matrix.png', scatterplot_matrix, height = 10, width = 7, dpi = 400)
Do the same thing using the categorical values of mycotoxin.
pair_boxplots <- expand_grid(nutrient = nutrient_cols, mycotoxin = cat_cols)
pair_boxplots$p <- pmap(pair_boxplots, function(nutrient, mycotoxin) ggplot(myc, aes(x = !!sym(mycotoxin), y = !!sym(nutrient))) + geom_boxplot())
Reduce(`+`, pair_boxplots$p) + plot_layout(ncol = 6, nrow = 4, byrow = FALSE)

Look at mycotoxin concentrations by year. We don’t see much
difference so I feel it is acceptable to ignore year in the model.
myc_hists_byyear <- map(myc_cols, ~ ggplot(myc, aes(x = !!sym(.))) + geom_histogram(bins = 20) + facet_wrap(~ harvest_year))
walk(myc_hists_byyear, print)




Examine the color data. Plot each pairwise combination of L, a, and b
variables, the points are colored by the actual color generated from the
L a b values. To me it looks like the L value is the most important so I
will focus on that for the analysis.
myc_color <- with(myc, cbind(LD65_SCI, aD65_SCI, bD65_SCI))
color_values <- rep(NA, nrow(myc))
color_values[complete.cases(myc_color)] <- lab_to_hex(myc_color[complete.cases(myc_color), ])
myc <- myc %>% mutate(color = color_values)
ggplot(myc, aes(x = LD65_SCI, y = aD65_SCI)) + geom_point(color = myc$color)

ggplot(myc, aes(x = LD65_SCI, y = bD65_SCI)) + geom_point(color = myc$color)

ggplot(myc, aes(x = aD65_SCI, y = bD65_SCI)) + geom_point(color = myc$color)

Statistical models
Normalize the mycotoxin concentrations by dividing each one by its
maximum so that they all range from 0 to 1.
myc <- myc %>%
mutate(AFB1_norm = AFB1_ppm / max(AFB1_ppm),
FUM_norm = FUM_ppm_LCMS / max(FUM_ppm_LCMS),
ZEA_norm = ZEA_ppm / max(ZEA_ppm),
DON_norm = DON_ppm_LCMS / max(DON_ppm_LCMS))
Check variance inflation factors of mycotoxin. All are low so we do
not have to worry about collinearity.
mycdf <- myc %>% select(ends_with('norm')) %>% as.data.frame()
usdm::vif(mycdf)
## Variables VIF
## 1 AFB1_norm 1.036240
## 2 FUM_norm 1.065465
## 3 ZEA_norm 1.544111
## 4 DON_norm 1.531532
Fit a multiple regression model to each of the nutrient
concentrations, and the L value, including all the two-way interactions.
Fit the models on a log scale so that we can ensure that the predictions
are always positive.
protein_lm <- lm(log(Protein) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
fat_lm <- lm(log(Fat) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
moisture_lm <- lm(log(Moisture) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
fiber_lm <- lm(log(Fiber) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
starch_lm <- lm(log(Starch) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
L_lm <- lm(log(LD65_SCI) ~ (AFB1_norm + FUM_norm + ZEA_norm + DON_norm)^2, data = myc)
Also fit multiple regression models without the two-way
interactions.
protein_lm_oneway <- lm(log(Protein) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
fat_lm_oneway <- lm(log(Fat) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
moisture_lm_oneway <- lm(log(Moisture) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
fiber_lm_oneway <- lm(log(Fiber) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
starch_lm_oneway <- lm(log(Starch) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
L_lm_oneway <- lm(log(LD65_SCI) ~ AFB1_norm + FUM_norm + ZEA_norm + DON_norm, data = myc)
Use AIC to compare the models with and without the two-way
interactions. All show that the model without interactions is an equally
good or better fit than the model with interactions. As a result, we
will not consider the models containing interactions moving forward.
AIC(protein_lm, protein_lm_oneway)
## df AIC
## protein_lm 12 -901.1684
## protein_lm_oneway 6 -901.4457
AIC(fat_lm, fat_lm_oneway)
## df AIC
## fat_lm 12 -827.7074
## fat_lm_oneway 6 -811.9743
AIC(moisture_lm, moisture_lm_oneway)
## df AIC
## moisture_lm 12 -840.0159
## moisture_lm_oneway 6 -846.4696
AIC(fiber_lm, fiber_lm_oneway)
## df AIC
## fiber_lm 12 -1083.237
## fiber_lm_oneway 6 -1082.178
AIC(starch_lm, starch_lm_oneway)
## df AIC
## starch_lm 12 -1783.472
## starch_lm_oneway 6 -1782.329
AIC(L_lm, L_lm_oneway)
## df AIC
## L_lm 12 -81.28237
## L_lm_oneway 6 -85.98532
Model diagnostics
Diagnostic plots are not shown here in the interest of space but they
look acceptable. The distribution of the residuals of fat is a little
bit skewed but it is probably still an acceptable model.
check_model(protein_lm)
check_model(fat_lm)
check_model(moisture_lm)
check_model(fiber_lm)
check_model(starch_lm)
check_model(L_lm)
Results
Adjusted R-squared for each nutrient
Here is the R-squared value adjusted for multiple predictors for each
nutrient. It shows that the proportion of variation explained is modest
for protein and starch (around 0.1) but very low for the other
variables.
fit_data <- tibble(
rv = c('protein', 'fat', 'fiber', 'starch', 'moisture', 'L color'),
fit = list(protein_lm_oneway, fat_lm_oneway, fiber_lm_oneway, starch_lm_oneway, moisture_lm_oneway, L_lm_oneway)
) |>
mutate(ANOVA = map(fit, anova),
slopes = map(fit, avg_slopes),
adjrsq = map_dbl(fit, ~ summary(.)$adj.r.squared))
fit_data |>
select(rv, adjrsq) |>
gt() |>
cols_label(rv = 'nutrient', adjrsq = 'adjusted R-squared') |>
fmt_number(decimals = 3) |>
gtopts()
nutrient |
adjusted R-squared |
protein |
0.084 |
fat |
0.016 |
fiber |
0.007 |
starch |
0.124 |
moisture |
0.015 |
L color |
0.017 |
Average slopes for each predictor
Now instead of the ANOVA table we show p-values associated with the
average slope for each mycotoxin, holding all other mycotoxins constant
at their average values. In the model the y variable is on the log scale
and the x axis is such that 0 is the lowest concentration and 1 the
highest, so the slopes must be interpreted that way. For example the
slope of protein with respect to DON is -0.112. This means that the
highest value of DON is associated with a protein level that is \(e^{-0.112}\), or 89%, as much as the lowest
value of DON. In the table, the 95% confidence intervals of the slopes
are given in parentheses. The p-values are based on a z-test.
slope_table <- with(fit_data, map2(rv, slopes, ~ tibble(variable = .x, as_tibble(.y)))) |>
list_rbind() |>
select(variable, term, estimate, conf.low, conf.high, statistic, p.value) |>
mutate(term = gsub('_norm', '', term))
gt(slope_table, groupname_col = 'variable') |>
cols_label(estimate = 'slope', term = 'mycotoxin', statistic = md('*z*-statistic'), p.value = md('*p*-value')) |>
cols_merge(c(estimate, conf.low, conf.high), pattern = '{1} ({2}, {3})') |>
fmt_number(decimals = 3) |>
fmt_number(columns = p.value, n_sigfig = 2) |>
gtopts() |>
tab_style(style = cell_text(weight = 'bold'), locations = cells_body(columns = p.value, rows = p.value < 0.05))
mycotoxin |
slope |
z-statistic |
p-value |
protein |
AFB1 |
0.033 (−0.059, 0.124) |
0.702 |
0.48 |
DON |
−0.112 (−0.168, −0.056) |
−3.934 |
0.000084 |
FUM |
0.072 (0.019, 0.125) |
2.655 |
0.0079 |
ZEA |
0.008 (−0.097, 0.112) |
0.143 |
0.89 |
fat |
AFB1 |
−0.145 (−0.249, −0.040) |
−2.705 |
0.0068 |
DON |
0.028 (−0.036, 0.092) |
0.855 |
0.39 |
FUM |
0.000 (−0.061, 0.061) |
0.000 |
1.0 |
ZEA |
0.002 (−0.119, 0.122) |
0.026 |
0.98 |
fiber |
AFB1 |
0.029 (−0.040, 0.099) |
0.832 |
0.41 |
DON |
0.035 (−0.008, 0.077) |
1.601 |
0.11 |
FUM |
−0.017 (−0.057, 0.024) |
−0.814 |
0.42 |
ZEA |
0.015 (−0.064, 0.094) |
0.378 |
0.71 |
starch |
AFB1 |
−0.018 (−0.042, 0.005) |
−1.509 |
0.13 |
DON |
0.037 (0.022, 0.051) |
4.948 |
0.00000075 |
FUM |
−0.026 (−0.039, −0.012) |
−3.600 |
0.00032 |
ZEA |
−0.029 (−0.057, −0.002) |
−2.119 |
0.034 |
moisture |
AFB1 |
0.020 (−0.079, 0.119) |
0.392 |
0.70 |
DON |
−0.039 (−0.100, 0.022) |
−1.259 |
0.21 |
FUM |
0.069 (0.012, 0.127) |
2.353 |
0.019 |
ZEA |
0.015 (−0.099, 0.128) |
0.256 |
0.80 |
L color |
AFB1 |
−0.842 (−1.453, −0.230) |
−2.699 |
0.0069 |
DON |
−0.041 (−0.255, 0.174) |
−0.373 |
0.71 |
FUM |
−0.056 (−0.238, 0.126) |
−0.607 |
0.54 |
ZEA |
0.016 (−0.359, 0.392) |
0.086 |
0.93 |
Main effect plots
Show the main effect of each mycotoxin on each nutrient concentration
and color, with an estimated marginal trend. Note: for each mycotoxin’s
trend, the value of each other mycotoxin is set to its average. P-values
associated with the slopes are shown.
max_concs <- apply(myc |> select(AFB1_ppm, DON_ppm_LCMS, FUM_ppm_LCMS, ZEA_ppm), 2, max)
main_effs <- c('AFB1_norm', 'DON_norm', 'FUM_norm', 'ZEA_norm')
short_names <- c('AFB', 'DON', 'FUM', 'ZEA')
plot_colors <- unname(palette.colors()[c(2,3,4,7)])
get_preds <- function(eff, maxval, fit) {
predict_response(fit, terms = eff) |>
as.data.frame() |>
mutate(x = x*maxval)
}
main_eff_plots <- function(rv, fit, slopes, ...) {
plot_titles <- paste('Trend of', rv, 'versus', short_names)
plot_subtitles <- paste('p =', signif(slopes$p.value, digits = 2))
pred_effs <- map2(main_effs, max_concs, get_preds, fit = fit)
eff_plots <- map(1:4, function(i) {
ggplot(pred_effs[[i]], aes(x = x, y = predicted, ymin = conf.low, ymax = conf.high)) +
geom_ribbon(fill = plot_colors[i], alpha = 0.5) +
geom_line(color = plot_colors[i], linewidth = 1.2) +
scale_x_continuous(name = paste(short_names[i], '(ppm)'), expand = c(0, 0)) +
scale_y_continuous(name = rv) +
ggtitle(plot_titles[i], plot_subtitles[i])
})
Reduce(`+`, eff_plots) + plot_layout(ncol = 2, nrow = 2)
}
all_maineff_plots <- pmap(fit_data, main_eff_plots)
walk(all_maineff_plots, print)






Session info
This is included so we can keep track of what software versions were
used in the analysis.
sessionInfo()
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] marginaleffects_0.21.0 gt_0.11.0 ggeffects_1.7.0
## [4] performance_0.12.2 schemr_0.3.0 patchwork_1.2.0
## [7] readxl_1.4.3 purrr_1.0.2 forcats_1.0.0
## [10] ggplot2_3.5.1 tidyr_1.3.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.46 bslib_0.7.0 raster_3.6-26
## [5] insight_0.20.2 lattice_0.22-5 vctrs_0.6.5 tools_4.3.3
## [9] generics_0.1.3 datawizard_0.12.2 tibble_3.2.1 fansi_1.0.6
## [13] highr_0.11 pkgconfig_2.0.3 Matrix_1.6-5 checkmate_2.3.1
## [17] data.table_1.15.4 lifecycle_1.0.4 compiler_4.3.3 farver_2.1.2
## [21] stringr_1.5.1 munsell_0.5.1 terra_1.7-78 tiff_0.1-12
## [25] codetools_0.2-20 httpuv_1.6.15 htmltools_0.5.8.1 sass_0.4.9
## [29] yaml_2.3.9 later_1.3.2 pillar_1.9.0 jquerylib_0.1.4
## [33] cachem_1.1.0 mime_0.12 commonmark_1.9.1 sjlabelled_1.2.0
## [37] tidyselect_1.2.1 digest_0.6.35 usdm_2.1-7 stringi_1.8.4
## [41] labeling_0.4.3 fastmap_1.2.0 grid_4.3.3 colorspace_2.1-0
## [45] cli_3.6.2 magrittr_2.0.3 utf8_1.2.4 OpenImageR_1.3.0
## [49] withr_3.0.0 backports_1.4.1 scales_1.3.0 promises_1.3.0
## [53] sp_2.1-4 rmarkdown_2.27 jpeg_0.1-10 cellranger_1.1.0
## [57] hms_1.1.3 png_0.1-8 shiny_1.8.1.1 evaluate_0.24.0
## [61] haven_2.5.4 knitr_1.48 markdown_1.13 rlang_1.1.3
## [65] Rcpp_1.0.13 xtable_1.8-4 glue_1.7.0 xml2_1.3.6
## [69] rstudioapi_0.16.0 jsonlite_1.8.8 R6_2.5.1 apcluster_1.4.13