Introduction
Finding a species in a certain location is like finding an encrypted message that traveled through time. It carries the species’ evolutionary history, migration patterns, as well as any direct and indirect effects generated by other species (some of which we may not even know exist). Ecologists have been trying to decode this message with progressively more powerful tools, from their field notes to highly complex computational algorithms. However, to succeed in this challenge it is important to have the right clues in hand. There are many ways we can be misled by data - or the lack of it: taxonomic errors (e.g., due to updates in the taxonomy of a species), geographic inaccuracy (e.g., approximate coordinates or lack of documentation about their accuracy), or sampling biases (e.g., data clustered near roads or research centers) (Ladle and Hortal 2013; Hortal et al. 2015; Poisot et al. 2021). One way to identify - and potentially fix - these errors is to combine many different pieces of information about the occurrence of a species, so agreements and mismatches can emerge. Although previous studies have combined different types of occurrence data to measure the accuracy of datasets (Hurlbert and Jetz 2007; Hurlbert and White 2005; Ficetola et al. 2014), none have used different types of information so far (i.e., ecological characteristics other than geographical distribution). Here we suggest jointly analysing species occurrence (range maps and point occurrences) and ecological interactions to identify mismatches between datasets and areas of data deficit.
Interactions form complex networks that shape ecological structures and maintain the essential functions of ecosystems, such as seed dispersal, pollination, and biological control (Albrecht 2018; Fricke et al. 2022) that ultimately affect the composition, richness, and successional patterns of communities across biomes. Yet, the connection between occurrence and interaction data is a frequent debate in ecology (Blanchet, Cazelles, and Gravel 2020; Wisz et al. 2013). For instance, macroecological models are often used with point or range occurrence data in order to investigate the dynamics of a species with its environment. However, these models do not account for ecological interactions, although it has been demonstrated that they might largely affect species distribution (Abrego et al. 2021; Afkhami, McIntyre, and Strauss 2014; Araújo, Marcondes-Machado, and Costa 2014; Godsoe et al. 2017; Godsoe and Harmon 2012; Gotelli, Graves, and Rahbek 2010; Wisz et al. 2013). Some researchers argue that occurrence data can also capture real-time interactions (see Roy et al. 2016; Ryan et al. 2018), and, because of that, it would not be necessary to include ecological interaction dynamics in macroecological models. On the other hand, many mechanistic simulation models in ecology have considered the effect of competition and facilitation in range shifts. For example, Gotelli et al. (2010) demonstrate how conspecific attraction might be the main factor driving the distribution of migratory birds; Afkhami et al. (2014) explores how mutualistic fungal endophytes are responsible or expanding the range of native grass; many other examples are discussed in Wisz et al. (2013). Although interactions across trophic levels are demonstrated to determine species range (Wisz et al. 2013), the use of these interactions in mechanistic simulation models in macroecology remains insufficient (as discussed in Cabral, Valente, and Hartig 2017).
A significant challenge in this debate is the quality and quantity of species distribution and ecological data (Boakes et al. 2010; Ronquillo et al. 2020; Meyer, Weigelt, and Kreft 2016) - a gap that can lead to erroneous conclusions in macroecological research (Hortal et al. 2008). Amongst the geographical data available are the range maps provided by the International Union for the Conservation of Nature (IUCN). Such maps consist of simplified polygons, often created as alpha or convex hulls around known species locations, refined by expert knowledge about the species (IUCN Red List Technical Working Group 2019). These maps can be used in macroecological inferences in the lack of more precise information (Fourcade 2016; Alhajeri and Fourcade 2019), but it has been recommended that they are used with caution since they tend to underestimate the distribution of species that are not well-known (Herkt, Skidmore, and Fahr 2017) (especially at fine scale resolutions; Hurlbert and Jetz (2007); Hurlbert and White (2005)), do not represent spatial variation in species occurrence and abundance (Dallas, Pironon, and Santini 2020), and can include inadequate areas within the estimated range. Another source of species distribution information is the Global Biodiversity Information Facility (GBIF), which is an online repository of georeferenced observational records that come from various sources, including community science programs, museum collections, and long-term monitoring schemes. A great source of bias in these datasets is the irregular sampling effort, with more occurrences originating from attractive and accessible areas and observation of charismatic species (Alhajeri and Fourcade 2019). As for ecological data, a complete assessment is difficult and is aggravated by biased sampling methods, data aggregation (Poisot et al. 2020; Hortal et al. 2015) and by the fact that interactions are very often events that occur in a narrow window of time. Nevertheless, we have witnessed an increase in the availability of biodiversity data in the last decades, including those collected through community science projects (Callaghan et al. 2019; Pocock et al. 2015) and dedicated databases, such as Mangal (Poisot et al. 2016). This provides an opportunity to merge species distribution and ecological interaction data to improve our predictions of where a species may be found across large spatial scales.
It has been demonstrated that the agreement between range maps and point data varies geographically (Hurlbert and Jetz 2007; Hurlbert and White 2005; Ficetola et al. 2014). Adding ecological interaction data to this comparison might help to elucidate where these (dis)agreements are more likely to be true and which dataset better represent the actual distribution of a species. In this context, we elaborate a method that allows us to detect areas of potential misestimation of species’ distribution data (more precisely range maps) based on interaction data. This method is based on the assumption that organisms cannot persist in an area unless they are directly or indirectly connected to a primary producer within their associated food web (Power 1992). Thus, given that herbivores are the main connection between plant resources (directly limited by environmental conditions) and predators (Dobson 2009; Scott et al. 2018), the range of a predator (omnivore or carnivore) depends on the overlapping ranges of its herbivore preys. If sections of a predator’s range do not overlap with at least one of its prey it will become disconnected from primary producers, and therefore we would not expect the predator to occur in this area.
This mismatch can be the result of different mechanisms, like the misestimation of both the predator’s and the preys’ ranges (Ladle and Hortal 2013; Rondinini et al. 2006), taxonomic errors (Isaac, Mallet, and Mace 2004; Ladle and Hortal 2013), or the lack of information about trophic links (i.e., the lack of connection between the ranges of a predator and a primary producer may be due a third species we don’t know is connected to both). Here in this proof of concept, we investigate the disagreements between available data for species that compose a well-known food web in the African continent, discuss the mechanisms that can lead to this, and reinforce the importance of open geographically explicit interaction data.
Methods
We identified areas of data deficits within the ranges of predators based on a simple rule: any part of a predator’s range that did not intersect with the range of at least one prey herbivore species, which in turn is directly connected to a primary producer (plants), was considered data deficient. To do that, we used a Serengeti food web dataset (Baskerville et al. 2011) (which comprises carnivores, herbivores, and plants from Tanzania) and its species ranges from IUCN. Then, we calculated the difference in range sizes between the original IUCN ranges of predators and those without the areas where they would be disconnected from their food webs, based on species interaction data. Finally, we added the GBIF occurrence points for the Serengeti species to investigate whether the results would be different if we used another source of distribution data.
Data
We investigated the mismatch between savannah species ranges and interactions in Africa (fig. 1). These ecosystems host a range of different species, including the well-characterized predator-prey dynamics between iconic predators (e.g., lions, hyenas, and leopards) and large herbivores (e.g., antelopes, wildebeests, and zebras), as well as a range of herbivorous and carnivorous small mammals. The Serengeti ecosystem has been extensively studied and its food web is one of the most complete we have to date, including primary producers identified to the species level. Here we focus on six groups of herbivores and carnivores from the Serengeti Food Web Data Set (Baskerville et al. 2011). These species exhibit direct antagonistic (predator-prey) interactions with one another and are commonly found across savannah ecosystems on the African continent (McNaughton 1992). Plants in the network were included indirectly in our analyses as we do not expect the primary producers to significantly influence the range of herbivores for several reasons. Firstly, many savannah plants are functionally similar (i.e., grasses, trees and shrubs) and cooccur across the same habitats (Baskerville et al. 2011). Secondly, herbivores in the network are broadly generalists feeding on a wide range of different plants across habitats. Indeed, out of 129 plants in our dataset, herbivores (n = 23) had a mean out degree (mean number of preys) of around 22 (std = 17.5). There is also an absence of global range maps for many plant species (Daru 2020), which prevents their direct inclusion in our analysis. Therefore, we assume that plants consumed by herbivores are present across their ranges, and as such the ranges of herbivores are not expected to be significantly constrained by the availability of food plants.
From the wider ecological network presented in Baskerville (2011), we sampled interaction data for herbivores and carnivores. This subnetwork contained 32 taxa (23 herbivores and 9 carnivores) and 84 interactions and had a connectance of 0.08. Although self-loops are informative, we removed these interactions to allow for the original IUCN ranges of predators with cannibalistic interactions to be adjusted. We treated this overall network as a metaweb since it should contain all potential species interactions between mammalian taxa occurring across savannah ecosystems such as the Serengeti.
We compiled IUCN range maps for the 32 species included in the metaweb from the Spatial Data Download portal (www.iucnredlist.org/resources/spatial-data-download), which we rasterized at a 0.5 degrees resolution (~50 km at the equator). We restricted the rasters to a spatial extent comprised between latitudes 35°S and 40°N and longitudes 20°W and 55°E. We then combined interaction data from the metaweb and cooccurrence data generated from species ranges to create networks for each raster pixel. This generated a total of 11,308 pixel-level networks. These networks describe potential predation, not actual interactions: the former is derived information from the metaweb, and the latter is contingent on the presence of herbivores.
Range overlap measurement
We calculated the geographical overlap, i.e. the extent to which interacting predator and prey species co-occurred across their ranges, as a/(a + c), where a is the number of pixels where predator and prey cooccur and c is the number of pixels where only the focal species occur. This index of geographical overlap can be calculated with prey or predators as the focal species. Values vary between 0 and 1, with values closer to 1 indicating that there is a large overlap in the ranges of the two species and values closer to 0 indicating low cooccurrence across their ranges. For each predator species, we calculated its generality to understand whether the level of trophic specialization (i.e., number of prey items per predator) affects the extent to which the ranges of the species comprised areas of data deficits. One would assume that predators with a greater number of prey taxa (i.e., a higher generality) are less likely to have large areas of data mismatch within their range as it is more likely that at least one prey species is present across most of their range.
Validation
For each species in the dataset we collated point observation data from GBIF (www.gbif.org). We used the GBIF download API to retrieve all species occurrences on November 22nd 2022 (GBIF.org 2022). We restricted our query to the data with spatial coordinates and which were inside the spatial extent of our rasters. A few observations were localized in the ocean near latitude 0° and longitude 0°. We assumed these were errors and removed all observations falling in the extent between latitudes 2°S and 2°N and longitudes 2°W and 2°E to keep only mainland sites. We did not use any additional geographical filters to retrieve as much data as possible. Being mindful of the recent and remarkable anthropogenic impact on African megafauna, we decided to restrict the occurrences used on the validation step to those recorded after the year 2000 (and, therefore, only records with date information). This decision was made after evaluating the overall temporal distribution of the GBIF records.
We then converted the occurrence data into raster format by determining which pixels had at least one GBIF occurrence. This allowed us to remove the effect of repeated sampling in some locations. These data were used to validate the areas identified as being ecologically unrealistic based on species interactions and occurrence data (see beginning of Methods section). To do so, we calculated the proportion of GBIF presence pixels occurring within both the original IUCN species range and the adjusted one (i.e., the one without unrealistic food webs). We then compared these proportions for all predators to verify if the areas of data mismatch contained locations with GBIF observations, hence likely true habitats.
Software
We performed all analyses using Julia v1.7.2 (Bezanson et al. 2017). We used the packages SimpleSDMLayers.jl
(Dansereau and Poisot 2021) to manipulate the raster layers, EcologicalNetworks.jl
(Poisot et al. 2019) to construct and manipulate the interaction networks, and GBIF.jl
(Dansereau and Poisot 2021) to reconcile species names with the GBIF backbone taxonomy (GBIF Secretariat 2021). We also used GDAL (GDAL/OGR contributors 2021) to rasterize the IUCN range maps (initially available as shapefiles from the Spatial Data Download portal). All the scripts required to reproduce the analyses are available at https://doi.org/10.5281/zenodo.6842861.
Results
Mammal species found in the Serengeti food web are widespread in Africa, especially in grasslands and savannahs (panel (a) of fig. 1). From our analysis, most local networks (69.07%) built using the original IUCN range maps had at least one mammal species with a path to a primary producer (panel (b) of fig. 1), which reinforces that the interactions we observe in the Serengeti food web is representative of the interactions for these mammals in the whole African continent. On average, local food webs had almost half of their mammal species disconnected from basal species (mean = 46.2%, median = 33.3%). In addition, 16.6% of the networks only had disconnected mammals, and the number of mammal species varied from 1 to 28, with a mean of 6.7. As expected, the proportion of carnivores with a path to a primary producer was conditional on the total number of mammal species in each local network (third panel of fig. 1).
Specialized predators have higher rates of range mismatch
If we consider that we can not use areas where there are no superposition between predators and prey on ecological analyses, we lose more range area for predators with fewer prey (fig. 2). For instance, both Leptailurus serval and Canis mesomelas have only one prey in the Serengeti food web (tbl. 1), each of them with a very small range compared to those of their predator. This discrepancy between range sizes promotes significant range loss. On the other hand, predators of the genus Panthera are some of the most connected species, and they also lose the least proportion of their ranges. This mismatch between predators and preys can also be a result of taxonomic disagreement between the geographical and ecological data. Although Canis aureus has the same number of prey as Caracal caracal, none of the prey taxa of the former occurs inside its original range (tbl. 1), which results in complete range loss.
There was a high variation in the overlap of predator and prey ranges (fig. 3). The high density of points on the left-hand side of fig. 3 indicates that most preys have small ranges in comparison to those of the set of carnivores in the networks, resulting in either low overlap between both ranges (bottom) or high overlap of ranges because much of that of the prey is within predators’ range (top). The top-right side of the plot encompasses situations where the ranges of both predator and prey are similar and overlapping, while the bottom-right part of the plot represents a situation where the range of the predator is smaller than that of its prey and much of it occurs within the preys’ range. For example, Panthera pardus had many preys occurring inside its range, with highly variable levels of overlap (tbl. 1). In general, species exhibited more consistent values of prey-predator overlap, than predator-prey overlap – indicated by the spread of points along the x-axis, yet more restricted variation on the y-axis (fig. 3). There was also no overall relationship between the two metrics, or for any predator species.
Species | Number of preys | Number of predators | Total range size | Proportion of range occupied by preys | Proportion of range occupied by predators |
---|---|---|---|---|---|
Large carnivores | |||||
Acinonyx jubatus | 8 | 1 | 9250 | 0.437 | 0.618 |
Crocuta crocuta | 12 | 1 | 4822 | 0.844 | 0.253 |
Lycaon pictus | 14 | 0 | 427 | 0.918 | - |
Panthera leo | 18 | 0 | 1274 | 0.935 | - |
Panthera pardus | 22 | 0 | 7563 | 0.766 | - |
Small carnivores | |||||
Canis aureus | 4 | 1 | 816 | 0.000 | 0.782 |
Canis mesomelas | 1 | 1 | 2201 | 0.190 | 0.994 |
Caracal caracal | 4 | 0 | 5239 | 0.833 | - |
Leptailurus serval | 1 | 1 | 4319 | 0.011 | 0.978 |
Small herbivores | |||||
Damaliscus lunatus | 0 | 4 | 626 | - | 1 |
Hippopotamus amphibius | 0 | 0 | 419 | - | - |
Kobus ellipsiprymnus | 0 | 4 | 2961 | - | 1 |
Ourebia ourebi | 0 | 5 | 2484 | - | 1 |
Pedetes capensis | 0 | 2 | 1318 | - | 1 |
Phacochoerus africanus | 0 | 5 | 3331 | - | 1 |
Redunca redunca | 0 | 5 | 1935 | - | 1 |
Rhabdomys pumilio | 0 | 5 | 53 | - | 1 |
Tragelaphus oryx | 0 | 2 | 2316 | - | 0.990 |
Tragelaphus scriptus | 0 | 3 | 3999 | - | 0.985 |
Large grazers | |||||
Aepyceros melampus | 0 | 5 | 1167 | - | 1 |
Alcelaphus buselaphus | 0 | 4 | 2307 | - | 1 |
Connochaetes taurinus | 0 | 6 | 1074 | - | 1 |
Equus quagga | 0 | 5 | 786 | - | 1 |
Eudorcas thomsonii | 0 | 6 | 51 | - | 1 |
Nanger granti | 0 | 6 | 261 | - | 1 |
Hyraxes | |||||
Heterohyrax brucei | 0 | 1 | 1961 | - | 0.973 |
Procavia capensis | 0 | 1 | 5312 | - | 0.647 |
Others | |||||
Giraffa camelopardalis | 0 | 1 | 607 | - | 0.473 |
Loxodonta africana | 0 | 0 | 1078 | - | - |
Madoqua kirkii | 0 | 7 | 443 | - | 1 |
Papio anubis | 0 | 1 | 2571 | - | 0.937 |
Syncerus caffer | 0 | 1 | 2808 | - | 0.251 |
Validation with GBIF occurrences
The proportion of GBIF pixels (pixels with at least one GBIF occurrence) matching the IUCN ranges varied a lot for species with small ranges and way less for species with large ranges (fig. 4, left). This means that species with large ranges had more area where their datasets for ecological and geographical information agreed. The lowest proportions of GBIF pixels occurred for species with small ranges. Amongst herbivores, Rhabdomys pumilio has a proportion of 25.6% of its presence pixels within its IUCN range, while predators have this proportion above 47% (such as Lycaon pictus, with 47.6%, and Panthera leo, with 49.3%). Nevertheless, some species with smaller ranges showed high data overlap (such as Canis mesomelas, with 94.1%, and many herbivores). Overall, predators and preys displayed similar overlap variations, and species with median and large ranges had higher proportions of agreement between GBIF, IUCN and interaction datasets.
The proportion of GBIF pixels in revised ranges can only be equal to or lower than that of the original ranges, as our analysis removes pixels from the original range and does not add new ones. Rather, the absence of a difference between the two types of ranges indicates that no pixels with GBIF observations, hence likely true habitats, were removed by our analysis. Here this proportion was mostly similar to that of the original IUCN ranges for most predator species (fig. 4, right). Two species showed no difference in proportion (Lycaon pictus and Panthera leo) while four species showed only small differences (Crocuta crocuta lost 0.4% of the original data overlap; Caracal caracal lost 3.4%; Acinonyx jubatus and Panthera pardus lost 6.2%).
On the other hand, three species, Canis aureus, Canis mesomelas, and Leptailurus serval showed very high differences, with overlaps lowered by 100%, 58.4%, and 100% respectively. These last two species are also the only predators with a single prey in our metaweb. Canis aureus has four preys, but it has one of the smallest ranges in IUCN, which is not covered by any of its preys. This result reinforces the concern raised in the literature on the use of IUCN range maps for species that are not well known (Herkt, Skidmore, and Fahr 2017), demonstrating how small range species are likely to have their distribution underestimated in the IUCN database. Additionally, the fact that Canis aureus had such a conspicuous discrepancy between its original IUCN range and those of its preys, and between GBIF and IUCN data, may indicate a taxonomic incongruency between the three databases used here, which we explore in the Discussion section. Our results delineate how a mismatch between GBIF and IUCN databases differ greatly with small changes in herbivore species ranges, and it is somewhat positively related to range size for predator species. Moreover, we show that accounting for interactions does not necessarily aggravates this dissimilarity, but it is relevant for species about which we have little ecological information or for specialists groups.
Discussion
Here we identify areas of data mismatch between species range maps by using ecological interaction data (predator-prey interactions within food webs). Our results did show a significant mismatch in the IUCN range areas of specialized and generalist predatory organisms and their prey, which highlights the importance of accounting for species interactions when estimating the range of a species. Although this type of data mismatch can be result of actual ecological processes, outdated occurrence data, taxonomic errors and more, we argue that, here, they rather indicate a lack of interaction sampling data.
The case of the golden jackal (Canis aureus) is a good illustration of how the taxonomic, geographical and ecological data can be used to validate one another. The jackal is a widespread taxon in northern Africa, Europe, and Australasia, generally well adapted to local conditions due to its largely varied diet (Tsunoda and Saito 2020; Krofel et al. 2021). Because of that, we expected that the Canis species in our dataset would be the ones losing the least amount of range, with a higher value of the proportion of GBIF pixels within their IUCN range maps. However, the taxonomy of this group is a matter of intense discussion, as molecular and morphological data seem to disagree in the clustering of species and subspecies (Krofel et al. 2021; Stoyanov 2020). This debate probably influenced our results: with originally only 64.9% of the GBIF pixels of the golden jackal overlapping with its IUCN data, we suspect that many of the GBIF occurrences refer to other Canis species, and that its taxonomic identification in the network database is probably outdated. This led to a complete exclusion of Canis aureus from its original range in our analysis, despite the fact that this species has four documented preys in our metaweb.
Geographical mismatch and data availability
The lack of superposition between IUCN range maps and GBIF occurrences in our results suggests that we certainly miss geographical information about the distribution of either the prey or the predator. On the other hand, if both GBIF and IUCN occurrences tended to superpose and the species was still locally removed, this indicates that we don’t have information about all its interactions (e.g., predators may be feeding on different species than the ones in our dataset outside the Serengeti ecosystem). This rationale can be illustrated with three types of mismatches identified in our results.
First, Panthera leo was one of the species with no difference between ranges before and after our analysis, but 50.7% of its GBIF pixels did not superpose with the IUCN range (fig. 4). In this particular case, the IUCN maps seem to agree with species interaction data. However, the disagreement between the IUCN and the GBIF databases is concerning and suggests that the IUCN maps might underestimate the lion’s distribution.
On the other hand, Leptailurus serval and Canis mesomelas are two of the three species that have the higher proportion of mismatched range due to the lack of paths to a herbivore, but are also some of the species with the higher proportion of GBIF occurrences inside their original IUCN range maps (fig. 4). This indicates that the information we are missing for these two species is related to either an additional interaction or to the presence of external interacting species. To illustrate that, we mapped the GBIF data for the prey of Leptailurus serval, with a mobility buffer around each point (fig. 5). When considering GBIF data, approximately 36% of the prey’s occurrences are within the portion of the predator’s range that was divergent from its original IUCN data. With the buffer area, this corresponds to 5.57% of the mismatched area. By adding GBIF information for the prey, we could therefore reduce the discrepancy of the range (or information) for the predator by 5.57% since its distribution is conditional on the occurrence of its preys. In other words, the range mismatch was exagerated because we were missing information on the presence of an interacting species (i.e., this also indicates that there is a mismatch - or complementarity - between the IUCN and GBIF data for their prey).
Finally, the extreme case of Canis aureus illustrates a lack of both geographical and ecological information: only half of its GBIF presence pixels and none of its preys occur inside its IUCN range. We believe, therefore, that the validation of species distribution based on ecological interaction is a relevant method that can further fill in information gaps. Nevertheless, it is imperative that more geographically explicit data about ecological networks and interactions become available. This would help clarify when cooccurrences can be translated into interactions (Windsor et al. 2022) and help the development of more advanced validation methods for occurrence data.
Next steps
Here we demonstrated how we can detect areas of data deficit in species distribution data using ecological interactions. Knowing where questionable occurrence data are can be crucial in ecological modelling (Hortal 2008; Ladle and Hortal 2013), and accounting for these errors can improve model outputs by diminishing the error propagation (Draper 1995). For instance, we believe our method is a way to account for ecological interactions in habitat suitability models without making the models more complex, but by making sure (not assuming) that the input data - the species occurrence - actually accounts for ecological interactions. Another application of this method is mapping areas where data are deficient, thus helping to identify priority sampling locations for interaction data, which can, in turn, reduce uncertainty in network prediction. For example, if a certain pixel confirms the presence of a species both with IUCN and GBIF data, but lacks connection between species, this pixel has a high potential to hide an unobserved interaction and should therefore be a priority sampling location.
It is important to notice, however, that the quality and usefulness of this method are highly correlated with the amount and quality of data available about species’ occurrences and interactions. With this paper, we hope to add to the collective effort to decode the encrypted message that is the occurrence of a species in space and time. A promising avenue that adds to our method is the prediction of networks and interactions at large scales (Strydom et al. 2021; Windsor et al. 2022), for they can add valuable information about ecological interactions where they are missing. Additionally, in order to achieve a robust modelling framework towards actual species distribution models we should invest in efforts to collect and combine open data on species occurrence and interactions (Windsor et al. 2022), especially because we may be losing ecological interactions at least as fast as we are losing species (Valiente-Banuet et al. 2015).
Acknowledgements
We acknowledge that this study was conducted on land within the traditional unceded territory of the Saint Lawrence Iroquoian, Anishinabewaki, Mohawk, Huron-Wendat, and Omàmiwininiwak nations. We thank the editor and reviewers for their thoughtful comments, which considerably improved this manuscript.