Species interactions, forming ecological networks, are a backbone for key
ecological and evolutionary processes; yet enumerating all of the interactions
between
The prediction of ecological interactions shares conceptual and methodological
issues with two fields in biology: species distribution modelling (SDMs), and
genomics. SDMs suffers from issues affecting interactions prediction, namely low
prevalence (due to sparsity of observations/interactions) and data aggregation
(due to bias in sampling some locations/species). An important challenge lies in
the fact that the best measure to quantify the performance of a model is not
necessarilly a point of consensus (these methods, their interpretation, and the
way they are measured, are covered in depth in the next section). In previous
work, @Allouche2006AssAcc suggested that Cohen's
An immense body of research on machine learning application to life sciences is
focused on genomics [which has very specific challenges, see a recent discussion
by @Whalen2021NavPit]; this sub-field has generated recommendations that do not
necessarily match the current best-practices for SDMs, and therefore hint at the
importance of domain-specific guidelines. @Chicco2020AdvMat suggest using
Matthews correlation coefficient (MCC) over
Species interaction networks are often under-sampled [@Jordano2016SamNet; @Jordano2016ChaEco], and this under-sampling is structured taxonomically [@Beauchesne2016ThiOut], structurally [@deAguiar2019RevBia] and spatially [@Poisot2021GloKno; @Wood2015EffSpa]. As a consequence, networks suffer from data deficiencies both within and between datasets. This implies that the comparison of classifiers across space, when undersampling varies locally [see e.g. @McLeod2021SamAsy] is non-trivial. Furthermore, the baseline value of classifiers performance measures under various conditions of skill, bias, and prevalence, has to be identified to allow researchers to evaluate whether their interaction prediction model is indeed learning. Taken together, these considerations highlight three specific issues for ecological networks. First, what values of performance measures are indicative of a classifier with no skill? This is particularly important as it can evaluate whether low prevalence can lull us into a false sense of predictive accuracy. Second, independently of the question of model evaluation, is low prevalence an issue for training or testing, and can we remedy it? Finally, because the low amount of data on interaction makes a lot of imbalance correction methods [see e.g. @Branco2015SurPre] hard to apply, which measures of model performance can be optimized by sacrificing least amount of positive interaction data?
A preliminary question is to examin the baseline performance of these measures, i.e. the values they would take on hypothetical networks based on a classifier that has no-skill. It may sound counter-intuitive to care so deeply about how good a classifier with no-skill is, as by definition, is has no skill. The necessity of this exercise has its roots in the paradox of accuracy: when the desired class ("two species interact") is rare, a model that gets less ecologically performant by only predicting the opposite class ("these two species do not interact") sees its accuracy increase; because most of the guesses have "these two species do not interact" as a correct answer, a model that never predicts interactions would be right an overwhelming majority of the time; it would also be utterly useless. Herein lies the core challenge of predicting species interactions: the extreme imbalance between classes makes the training of predictive models difficult, and their validation even more so as we do not reliably know which negatives are true. The connectance (the proportion of realized interactions, usually the number of interactions divided by the number of species pairs) of empirical networks is usually well under 20%, with larger networks having a lower connectance [@MacDonald2020RevLin], and therefore being increasingly difficult to predict.
Binary classifiers, which it to say, machine learning algorithms whose answer is a binary value, are usually assessed by measuring properties of their confusion matrix, i.e. the contingency table reporting true/false positive/negative hits. A confusion matrix is laid out as
In this matrix, tp is the number of times the model predicts an interaction that exists in the network (true positive), fp is the number of times the model predicts an interaction that does not exist in the network (false positive), fn is the number of times the model fails to predict an interaction that actually exists in the network (false negatives), and tn is the number of times the model correctly predicts that an interaction does not exist (true negatives). From these values, we can derive a number of measures of model performance [see @Strydom2021RoaPre for a review of their interpretation in the context of networks]. At a coarse scale, a classifier is accurate when the trace of the matrix divided by the sum of the matrix is close to 1, with other measures informing us on how the predictions fail.
A lot of binary classifiers are built by using a regressor (whose task is to
guess the value of the interaction, and can therefore return a value considered
to be a pseudo-probability); in this case, the optimal value below which
predictions are assumed to be negative (i.e. the interaction does not exist)
can be determined by picking a threshold maximizing some value on the ROC or the
PR curve. The area under these curves (ROC-AUC and PR-AUC henceforth) give ideas
on the overall goodness of the classifier, and the ideal threshold is the point
on these curves that minimizes the tradeoff represented in these curves.
@Saito2015PrePlo established that the ROC-AUC is biased towards over-estimating
performance for imbalanced data; on the contrary, the PR-AUC is able to identify
classifiers that are less able to detect positive interactions correctly, with
the additional advantage of having a baseline value equal to prevalence.
Therefore, it is important to assess whether these two measures return different
results when applied to ecological network prediction. The ROC curve is defined
by the false positive rate on the
There is an immense diversity of measures to evaluate the performance of
classification tasks [@Ferri2009ExpCom]. Here we will focus on five of them with
high relevance for imbalanced learning [@He2013ImbLea]. The choice of metrics
with relevance to class-imbalanced problems is fundamental, because as
@Japkowicz2013AssMet unambiguously concluded, "relatively robust procedures used
for unskewed data can break down miserably when the data is skewed". Following
@Japkowicz2013AssMet, we focus on two ranking metrics (the areas under the
Receiver Operating Characteristic and Precision Recall curves), and three
threshold metrics (
The
Informedness [@Youden1950IndRat] (also known as bookmaker informedness or the
True Skill Statistic) is
The MCC is defined as
Finally,
One noteworthy fact is that
In this section, we will assume a network with connectance equal to a scalar
In order to write the values of the confusion matrix for a hypothetical
classifier, we need to define two characteristics: its skill, and its bias.
Skill, here, refers to the propensity of the classifier to get the correct
answer (i.e. to assign interactions where they are, and to not assign them
where they are not). A no-skill classifier guesses at random, i.e. it will
guess interactions with a probability
In order to regulate the skill of this classifier, we can define a skill matrix
When
The second element we can adjust in this hypothetical classifier is its bias,
specifically its tendency to over-predict interactions. Like above, we can do so
by defining a bias matrix
The final expression for the confusion matrix in which we can regulate the skill and the bias is
In all further simulations, the confusion matrix
In this section, we will change the values of
In order to examine how MCC,
In @fig:bias, we show that none of the four measures satisfy all the
considerations at once:
These two analyses point to the following recommendations: MCC is indeed more
appropriate than
In the following section, we will generate random bipartite networks, and train
four binary classifiers (as well as an ensemble model using the sum of ranged
outputs from the component models) on 50% of the interaction data. In practice,
testing usually uses 70% of the total data; for ecological networks, where
interactions are sparse and the number of species is low, this may not be
the best solution, as the testing set becomes constrained not by the
proportion of interactions, but by their number. Preliminary experiments
using different splits revealed no qualitative change in the results. Networks
are generated by picking a random infectiousness trait
The training sample is composed of a random pick of up to 50% of the
The dataset used for numerical experiments is composed of a grid of 35 values of
connectance (from 0.011 to 0.5) and 35 values of MLJ.jl
package
[@Blaom2020MljJul; @Blaom2020FleMod] in Julia 1.7 [@Bezanson2017JulFre]. All
machines use the default parameterization; this is an obvious deviation from
best practices, as the hyperparameters of any machine require training before
its application on a real dataset. As we use 612500 such datasets, this would
require over 2 millions unique instances of tweaking the hyperparameters, which
is prohibitive from a computing time point of view. An important thing to keep
in mind is that the problem we simulate has been designed to be simple to solve:
we expect all machines with sensible default parameters to fare well --- the
results presented in the later sections show that this assumption is warranted,
and we further checked that the models do not overfit by ensuring that there is
never more than 5% of difference between the accuracy on the training and
testing sets. All machines return a quantitative prediction, usually (but not
necessarily) in
In order to pick the best confusion matrix for a given trained machine, we
performed a thresholding approach using 500 steps on predictions from the
testing set, and picking the threshold that maximized Youden's informedness.
During the thresholding step, we measured the area under the receiver operating
characteristic (ROC-AUC) and precision-recall (PR-AUC) curves, as measures of
overall performance over the range of returned values. We report the ROC-AUC and
PR-AUC, as well as a suite of other measures as introduced in the next section,
for the best threshold. The ensemble model was generated by summing the
predictions of all component models on the testing set (ranged in 10.17605/OSF.IO/JKEWD
.
After the simulations were completed, we removed all runs (i.e. triples of
model,
In @fig:biasco, we present the response of two thresholding measures (PR-AUC and ROC-AUC) and two ranking measures (Informedness and MCC) to a grid of 35 values of training set balance, and 35 values of connectance, for the four component models as well as the ensemble. ROC-AUC is always high, and does not vary with training set balance. On the other hand, PR-AUC shows very strong responses, increasing with training set balance. It is notable here that two classifiers that seemed to be performing well (Decision Tree and Random Forest) based on their MCC are not able to reach a high PR-AUC even at higher connectances. All models reached a higher performance on more connected networks, and using more balanced training sets. In all cases, informedness was extremely high, which is an expected consequence of the fact that this is the value we optimized to determine the cutoff. MCC increased with training set balance, although this increase became less steep with increasing connectance. Three of the models (kNN, decision tree, and random forest) only increased their PR-AUC sharply when the training set was heavily imbalanced towards more interactions. Interestingly, the ensemble almost always outclassed its component models. For larger connectances (less difficult networks to predict, as they are more balanced), MCC and informedness stared decreasing when the training set bias got too close to one, suggesting that a training set balance of 0.5 may often be appropriate if these measures are the one to optimize.
Based on the results presented in @fig:biasco, it seems that informedness and ROC-AUC are not necessarily able to discriminate between good and bad classifiers (although this result may be an artifact for informedness, as it has been optimized when thresholding). On the other hand, MCC and PR-AUC show a strong response to training set balance, and may therefore be more useful at model comparison.
The previous results revealed that the measure of classification performance responds both to the bias in the training set and to the connectance of the network; from a practical point of view, assembling a training set requires one to withhold positive information, which in ecological networks are very scarce (and typically more valuable than negatives, on which there is a doubt). For this reason, across all values of connectance, we measured the training set balance that maximized a series of performance measures. When this value is high, the training set needs to skew more positive in order to get a performant model; when this value is about 0.5, the training set needs to be artificially balanced to optimize the model performance. These results are presented in @fig:optimbias.
The more "optimistic" measures (ROC-AUC and informedness) required a biasing of the dataset from about 0.4 to 0.75 to be maximized, with the amount of bias required decreasing only slightly with the connectance of the original network. MCC and PR-AUC required values of training set balance from 0.75 to almost 1 to be optimized, which is in line with the results of the previous section, i.e. they are more stringent tests of model performance. These results suggest that learning from a dataset with very low connectance can be a different task than for more connected networks: it becomes increasingly important to capture the mechanisms that make an interaction exist, and therefore having a slightly more biased training dataset might be beneficial. As connectance increases, the need for biased training sets is less prominent, as learning the rules for which interactions do not exist starts gaining importance.
When trained at their optimal training set balance, connectance still had a significant impact on the performance of some machines (@fig:optimvalue). Notably, Decision Tree, and k-NN, as well as Random forest to a lower extent, had low values of PR-AUC. In all cases, the Boosted Regression Tree was reaching very good predictions (especially for connectances larger than 0.1), and the ensemble was almost always scoring perfectly. This suggests that all the models are biased in different ways, and that the averaging in the ensemble is able to correct these biases. We do not expect this last result to have any generality, and provide a discussion of a recent example in which the ensemble was performing worse than its components models.
In this last section, we generate a network using the same model as before, with
The trained models were then thresholded (again by optimising informedness), and
their predictions transformed back into networks for analysis; specifically, we
measured the connectance, nestedness [$\eta$; @Bastolla2009ArcMut], modularity
[$Q$; @Barber2007ModCom], asymmetry [$A$; @Delmas2018AnaEco], and Jaccard
network dissimilarity [@Canard2014EmpEva]. This process was repeated 250 times,
and the results are presented in @tbl:comparison. The k-NN model is an
interesting instance here: it produces the network that looks the most like the
original dataset, despite having the lowest PR-AUC, suggesting it hits high
recall at the cost of low precision. The ensemble was able to reach a very high
PR-AUC (and a very high ROC-AUC), which translated into more accurate
reconstructions of the structure of the network (with the exception of
modulairty, which is underestimated by
Model | MCC | Inf. | ROC-AUC | PR-AUC | Conn. | Jaccard | |||
---|---|---|---|---|---|---|---|---|---|
Decision tree | 0.59 | 0.94 | 0.97 | 0.04 | 0.17 | 0.64 | 0.37 | 0.42 | 0.1 |
BRT | 0.46 | 0.91 | 0.97 | 0.36 | 0.2 | 0.78 | 0.29 | 0.41 | 0.19 |
Random Forest | 0.72 | 0.98 | 0.99 | 0.1 | 0.16 | 0.61 | 0.38 | 0.42 | 0.06 |
k-NN | 0.71 | 0.98 | 0.99 | 0.02 | 0.16 | 0.61 | 0.39 | 0.42 | 0.06 |
Ensemble | 0.74 | 0.98 | 1.0 | 0.79 | 0.16 | 0.61 | 0.38 | 0.42 | 0.06 |
Data | 0.16 | 0.56 | 0.41 | 0.42 | 0.0 |
: Values of four performance metrics, and five network structure metrics, for
500 independent predictions similar to the ones presented in @fig:ecovalid. The
values in bold indicate the best value for each column (including ties).
Because the values have been rounded, values of 1.0 for the ROC-AUC column
indicate an average
We establish that due to the low prevalence of interactions, even poor classifiers applied to food web data will reach a high accuracy; this is because the measure is dominated by the accidentally correct predictions of negatives. On simulated confusion matrices with ranges of imbalance that are credible for ecological networks, MCC had the most desirable behavior, and informedness is a linear measure of classifier skill. By performing simulations with four models and an ensemble, we show that informedness and ROC-AUC are consistently high on network data, whereas MCC and PR-AUC are more accurate measures of the effective performance of the classifier. Finally, by measuring the structure of predicted networks, we highlight an interesting paradox: the models with the best performance measures are not necessarilly the models with the closest reconstructed network structure. We discuss these results in the context of establishing guidelines for the prediction of ecological interactions.
It is noteworthy that the ensemble model was systematically better than the component models. We do not expect that ensembles will always be better than single models. Networks with different structures than the one we simulated here may respond in different ways, especially if the rules are fuzzier than the simple rule we used here. In a recent multi-model comparison involving supervised and unsupervised learning, @Becker2022OptPre found that the ensemble was not the best model, and was specifically under-performing compared to models using biological traits. This may be because the dataset of @Becker2022OptPre was known to be under-sampled, and so the network alone contained less information than the combination of the network and species traits. There is no general conclusion to draw from either these results or ours, besides reinforcing the need to be pragmatic about which models should be included in the ensemble, and whether to use an ensemble at all. In a sense, the surprising performance of the ensemble model should form the basis of the first broad recommendation: optimal training set balance and its interaction with connectance and the specific binary classifier used is, in a sense, an hyperparameter that should be assessed following the approach outlined in this manuscript. The distribution of results in @fig:optimbias and @fig:optimvalue show that there are variations around the trend, and multiple models should probably be trained on their "optimal" training/testing set, as opposed to the same ones.
The results presented here highlight an interesting paradox: although the k-NN model was ultimately able to get a correct estimate of network structure (see @tbl:comparison and @fig:ecovalid), it ultimately remains a poor classifier, as evidenced by its low PR-AUC. This suggests that the goal of predicting interactions and predicting networks may not always be solvable in the same way -- of course a perfect classifier of interactions would make a perfect network prediction; indeed, the best scoring predictor of interactions (the ensemble model) had the best prediction of network structure. The tasks of predicting networks structure and of predicting interactions within networks are essentially two different ones. For some applications (e.g. comparison of network structure across gradients), one may care more about a robust estimate of the structure, at the cost at putting some interactions at the wrong place. For other applications (e.g. identifying pairs of interacting species), one may conversely care more about getting as many pairs right, even though the mistakes accumulate in the form of a slightly worse estimate of network structure. How these two approaches can be reconciled is something to evaluate on a case-by-case basis, especially since there is no guarantee that an ensemble model will always be the most precise one. Despite this apparent tension at the heart of the predictive exercise, we can use the results presented here to suggest a number of guidelines.
First, because we have more trust in reported interactions than in reported
absences of interactions (which are overwhelmingly pseudo-absences), we can
draw on previous literature to recommend informedness as a measure to decide on
a threshold for binary classification [@Chicco2021MatCor]; this being said,
because informedness is insensitive to bias (although it is a linear measure of
skill), the overall model performance is better evaluated through the use of MCC
([@fig:optimbias; @fig:optimvalue]). Because
Second, accuracy alone should not be the main measure of model performance, but
rather an expectation of how well the model should behave given the class
balance in the set on which predictions are made; this is because, as derived
earlier, the expected accuracy for a no-skill no-bias classifier is
Third, because the PR-AUC responds more to network connectance (@fig:optimvalue) and training set imbalance (@fig:optimbias) than ROC-AUC, it should be used as a measure of model performance over the ROC-AUC. This is not to say that ROC-AUC should be discarded (in fact, a low ROC-AUC is undoubtedly a sign of an issue with the model), but that its interpretation should be guided by the PR-AUC value. Specifically, a high ROC-AUC is not informative, as it can be associated to a low PR-AUC (see e.g. Random Forest in @tbl:comparison) This again echoes recommendations from other fields [@Saito2015PrePlo; @Jeni2013FacImb]. We therefore expect to see high ROC-AUC values, and then to pick the model that maximizes the PR-AUC value. Taken together with the previous two guidelines, we strongly encourage to (i) ensure that accuracy and ROC-AUC are high (in the case of accuracy, higher than expected under no-skill no-bias situation), and (ii) to discuss the performance of the model in terms of the most discriminant measures, i.e. PR-AUC and MCC.
Finally, network connectance (i.e. the empirical class imbalance) should inform the composition of the training and testing set, because it is an ecologically relevant value. In the approach outlined here, we treat the class imbalance of the training set as an hyper-parameter, but test the model on a set that has the same class imbalance as the actual dataset. This is an important distinction, as it ensure that the prediction environment matches the testing environment (as we cannot manipulate the connectance of the empirical dataset on which the predictions will be made), and so the values measured on the testing set (or validation set if the data volume allows one to exists) can be directly compared to the values for the actual prediction. A striking result from @fig:optimbias is that Informedness was almost always maximal at 50/50 balance (regardless of connectance), whereas MCC required more positives to be maximized when connectance increases, matching the idea that it is a more stringent measure of performance. This has an important consequence in ecological networks, for which the pool of positive cases (interactions) to draw from is typically small: the most parsimonious measure (i.e. the one requiring to discard the least amount of interactions to train the model) will give the best validation potential, and in this light is very likely informedness [maximizing informedness is, in fact, the generally accepted default for imbalanced classification regardless of the problem domain; @Schisterman2005OptCut]. This last result further strengthens the assumption that the amount of bias is an hyper-parameter that must be fine-tuned, as using the wrong bias can lead to models with lower performance; for this reason, it makes sense to not train all models on the same training/testing set, but rather to optimize the set composition for each of them.
One key element for real-life data that can make the prediction exercise more tractable is that some interactions can safely be assumed to be impossible; indeed, a lot of networks can be reasonably well described using a stochastic block model [e.g. @Xie2017ComCom]. In ecological networks, this can be due to spatial constraints [@Valdovinos2019MutNet], or to the long-standing knowledge that some links are "forbidden" due to traits [@Olesen2011MisFor] or abundances [@Canard2014EmpEva]. The matching rules [@Strona2017ForPer; @Olito2015SpeTra] can be incorporated in the model either by adding compatibility traits, or by only training the model on pairs of species that are not likely to be forbidden links. Knowledge of true negative interactions could be propagated in training/testing sets that have true negatives, and in this situation, it may be possible to use the more usual 70/30 split for training/testing folds as the need to protect against potential unbalance is lowered. Besides forbidden links, a real-life case that may arise is multi-interaction or multi-layer networks [@Pilosof2017MulNat]. These can be studied using the same general approach outlined here, either by assuming that pairs of species can interact in more than one way (wherein one would train a model for each type of interaction, based on the relevant predictors), or by assuming that pairs of species can only have one type of interaction (wherein this becomes a multi-label classification problem).
Acknowledgements: We acknowledge that this study was conducted on land within the traditional unceded territory of the Saint Lawrence Iroquoian, Anishinabewaki, Mohawk, Huron-Wendat, and Omàmiwininiwak nations. We thank Colin J. Carlson, Michael D. Catchen, Giulio Valentino Dalla Riva, and Tanya Strydom for inputs on earlier versions of this manuscript. This research was enabled in part by support provided by Calcul Québec (www.calculquebec.ca) through the Narval general purpose cluster. TP is supported by the Fondation Courtois, a NSERC Discovery Grant and Discovery Acceleration Supplement, by funding to the Viral Emergence Research Initiative (VERENA) consortium including NSF BII 2021909, and by a grant from the Institut de Valorisation des Données (IVADO).