Data Sources

To evaluate the potential tourist-vessel collision risk for five cetacean species regularly observed in Colombia, we conducted habitat modeling analyses for the following species:

To locate information on the subspecies, a search was conducted using the complete name of each subspecies. The following databases were used: Atlas of Living Australia (ALA, http://www.ala.org.au/), Berkeley Ecoengine (Ecoengine, https://ecoengine.berkeley.edu/), Biodiversity Information Serving Our Nation (BISON, https://bison.usgs.gov/), Global Biodiversity Information Facility (GBIF,https://www.gbif.org ), Integrated Digitized Biocollections (iDigBio, https://www.idigbio.org/), iNaturalist (iNat, http://www.inaturalist.org/), the Ocean Biogeographic Information System (OBIS, https://obis.org), Distributed Databases with Backbone (VertNet, http://vertnet.org/), Sistema de Información Ambiental Marina (SIAM, http://siam.invemar.org.co) operated by Colombia’s Marine and Coastal Research Institute INVEMAR, and the Red Nacional de Datos Abiertos sobre Biodiversidad de Colombia (SiB Colombia, https://sibcolombia.net/)

**Figure S2.** Occurrences downloaded from the biodiversity information databases.

Figure S2. Occurrences downloaded from the biodiversity information databases.

In addition to the publicly available databases, our study also utilized unpublished data from oceanographic cruises conducted by the National Maritime Directorate of Colombia (DIMAR) as part of the El Niño Southern Oscillation Regional Study (ERFEN). The data was collected by the Centro de Investigaciones Oceanográficas e Hidrográficas del Pacífico (CCCP) and can be found in the supplementary table 1.

The removal of duplicate records and the filtering of the datasets were performed using a script in the R programming language and packages such as “tidyverse”(v. 1.2.0; Wickham and Wickham (2017)), “dplyr”(v. 1.0; Wickham et al. (2019)), “anytime” (v. 0.3.7; Eddelbuettel (2020)), “rgdal” (v. 1.5-12; Bivand et al. (2020)), “gdata” (v. 2.18.0; Warnes et al. (2012)), and “devtools” (v. 2.3.1; Wickham et al. (2020)). This script can be accessed at the following GitHub link: https://github.com/ChrisBermudezR/Cetacean_Tourist_Vessel_Collision_Risk_Assessment/blob/main/02_Species_Occurrences/01_Occurrence_Data_Download.R

**Figure 2.** Number and locations of cetacean species/subspecies records reported in the Colombian Caribbean and Pacific basins. Occurrences of humpback whale (*Megaptera novaeangliae*); pantropical spotted dolphin’s subspecies: offshore pantropical spotted (*Stenella attenuata attenuata*) and coastal pantropical spotted (*Stenella attenuata graffmani*); Atlantic spotted dolphins (*Stenella frontalis*); spinner dolphin’s subspecies: Central American spinner (*Stenella longirostris centroamericana*), Gray’s spinner (*Stenella longirostris longirostris*), and Eastern spinner (*Stenella longirostris orientalis*); and bottlenose dolphins (*Tursiops truncatus*).

Figure 2. Number and locations of cetacean species/subspecies records reported in the Colombian Caribbean and Pacific basins. Occurrences of humpback whale (Megaptera novaeangliae); pantropical spotted dolphin’s subspecies: offshore pantropical spotted (Stenella attenuata attenuata) and coastal pantropical spotted (Stenella attenuata graffmani); Atlantic spotted dolphins (Stenella frontalis); spinner dolphin’s subspecies: Central American spinner (Stenella longirostris centroamericana), Gray’s spinner (Stenella longirostris longirostris), and Eastern spinner (Stenella longirostris orientalis); and bottlenose dolphins (Tursiops truncatus).

Data thinning and bias removal.

Spatial data thinning and bias removal are crucial steps in developing accurate species distribution models. Public and online datasets that provide occurrence data often display strong spatial biases, which can affect the reliability of these models (Fourcade et al. 2014).

Spatial thinning is a method used to reduce spatial autocorrelation and clustering in occurrence records. This technique involves removing some of the occurrence records from the dataset to create a more even distribution of records across the study area. A random subset of the records can be selected or a clustering algorithm can be applied to group records that are too close together. These methods help to mitigate the impact of spatial biases and improve the accuracy of species distribution models.

To detect the clustering of occurrence data and identify areas of highest density, we used kernel density estimation and visualized the results using the “ggplot2” package (v. 3.4.1; Wickham (2016)) in R (see Figure S3).

**Figure S3.** Two-dimensional estimation of the density of cetacean species occurrences using an axis-aligned bivariate normal kernel, evaluated on a square grid, using the  "*ggplot2*" (v. 3.4.1; @Wickham2016) R package.

Figure S3. Two-dimensional estimation of the density of cetacean species occurrences using an axis-aligned bivariate normal kernel, evaluated on a square grid, using the “ggplot2” (v. 3.4.1; Wickham (2016)) R package.

For all datasets of species occurrence records (Figure S3), a spatial thinning analysis was performed using the R package “spThin(Aiello-Lammens et al. 2015), the analysis was performed using thinning parameter of 10 km for minimal separation of the occurrence data and three repetitions of the thinning procedure on each the dataset.

The table S1 shows the results of the spatial thinning analysis for the species/subspecies in different basins. The “Occurrences” column displays the number of occurrence records available for each species/subspecies in each basin. The “Thinned 01”, “Thinned 02”, and “Thinned 03” columns represent the number of occurrence records remaining after spatial thinning was applied in a different repetition. The “Data Conservation” column shows the percentage of occurrence records that were retained after thinning. For example, for M. novaeangliae in the Pacific basin, only 8% of the original occurrence records were retained after thinning. In contrast, for S. a. attenuata in the Pacific basin, 67% of the original occurrence records were retained after thinning.

Table S1. Number of occurrences of cetacean species/subspecies obtained in the automatic query and Number of occurrences reduced through the spatial thinning technique with the R package “spThin(Aiello-Lammens et al. 2015).

Specie/Subspecie Basin Ocurrences Thinned 01 Thinned 02 Thinned 03 Data Conservation
M. novaeangliae Pacific 7129 586 - - 8%
S. a. attenuata Caribbean 175 107 107 - 61%
S. a. attenuata Pacific 298 201 - - 67%
S. a. graffmani Pacific 983 215 215 - 22%
S. frontalis Caribbean 189 82 82 - 43%
S. l. centroamericana Pacific 125 28 28 28 22%
S. l. longirostris Caribbean 53 35 35 35 66%
S. l. orientalis Pacific 94 46 46 46 46%
T. truncatus Caribbean 165 82 82 82 50%
T. truncatus Pacific 690 263 - - 38%

To detect the effect of spatial thinning, again, kernel density estimation was used and the results were visualized using the “ggplot2” package (v. 3.4.1; Wickham (2016)) in R (Figure S4).

**Figure S4.** Two-dimensional estimation of the density of cetacean species occurrences after the spatial thinning, using an axis-aligned bivariate normal kernel, evaluated on a square grid, using the  "*ggplot2*" (v. 3.4.1; @Wickham2016) R package.

Figure S4. Two-dimensional estimation of the density of cetacean species occurrences after the spatial thinning, using an axis-aligned bivariate normal kernel, evaluated on a square grid, using the “ggplot2” (v. 3.4.1; Wickham (2016)) R package.

All the analysis in R are available on this link: https://github.com/ChrisBermudezR/Cetacean_Tourist_Vessel_Collision_Risk_Assessment/blob/main/02_Species_Occurrences/02_Occ_Bias_Elimination.R

REFERENCES

Acevedo J, Aguayo-Lobo A, Allen J, et al (2017) Migratory preferences of humpback whales between feeding and breeding grounds in the eastern South Pacific. Marine Mammal Science 33:1035–1052. https://doi.org/10.1111/mms.12423
Aiello-Lammens ME, Boria RA, Radosavljevic A, et al (2015) spThin: An R package for spatial thinning of species occurrence records for use in ecological niche models. Ecography 38:541–545. https://doi.org/10.1111/ecog.01132
Ávila IC, Dormann CF, García C, et al (2020) Humpback whales extend their stay in a breeding ground in the Tropical Eastern Pacific. ICES Journal of Marine Science 77:109–118
Bivand R, Keitt T, Rowlingson B (2020) rgdal: Bindings for the ’Geospatial’ Data Abstraction Library. R package version 1.5-12.
Eddelbuettel D (2020) anytime: Anything to ’POSIXct’ or ’Date’ Converter. R package version 0.3.7
Forney KA, Ferguson MC, Becker EA, et al (2012) Habitat-based spatial models of cetacean density in the eastern Pacific Ocean. Endangered Species Research 16:113–133. https://doi.org/10.3354/esr00393
Fourcade Y, Engler JO, Rödder D, Secondi J (2014) Mapping species distributions with MAXENT using a geographically biased sample of presence data: A performance assessment of methods for correcting sampling bias. PloS one 9:e97122
Herzing DL, Perrin WF (2018) Atlantic Spotted Dolphin: Stenella frontalis. In: the Encyclopedia of Marine Mammals. Academic Press, London, UK, pp 40–42
Jefferson TA, Webber MA, Pitman RL (2015) Marine mammals of the world: a comprehensive guide to their identification, 2nd edn. Academic Press, Elsevier Inc.
Perrin WF (2018) Atlantic Spotted Dolphin: Stenella frontalis. In: B. Würsig KMK J. G. M. Thewissen (ed) the Encyclopedia of Marine Mammals. pp 40–42
Warnes GR, Bolker B, Gorjanc G, et al (2012) gdata: Various R programming tools for data manipulation. R package version 2.12.0. R package version 2.13.3
Wickham H (2016) ggplot2: Elegant graphics for data analysis. Springer-Verlag New York
Wickham H, Chang W, Hester J, Chang W (2020) devtools: Tools to Make Developing R Packages Easier
Wickham H, François R, Henry L, Müller K (2019) dplyr: a grammar of data manipulation. R package version 0.8. 0.1. Retrieved January 13:2020
Wickham H, Wickham MH (2017) Package tidyverse. Easily Install and Load the ‘Tidyverse