Species Distribution Modeling with Wallace

I am currently developing additional functionalities for my Machine Learning Tools for Open Science (MLTOS) projects, and one of the main things that I want to do is integrate species distribution modeling (for more information on the topic I suggest you check out the Introduction section of this Nature paper). Part of my research involved looking at what solutions are already existing, and I have stumbled on this gem called Wallace.

This application takes advantage of Shiny to get access to R as a backend. This is awesome in many ways, since R has been quite popular in the ecological sciences, and is home to a multitude of packages on the topic, covering everything from obtaining ecological datasets to downstream analysis.

In this article I will provide you with a short walk through of the functionality available in Wallace. At the time being I couldn’t find a tutorial (probably in the works?), so I think this might be useful.

First go ahead and install the app (detailed instructions on the project’s github page):

install.packages("wallace")
library(wallace)
run_wallace()

You can also install the development version if you want the latest features. But be prepared for bugs!

You should see the following screen in your browser:

The user interface is very intuitive and well structured. The main focus is on the map, while on the left panel you can work on the data. The navigation already provides a good workflow overview, and the sequence of steps is very helpful. As a first step we choose which species we want to obtain data on. For this tutorial I chose the Eurasian lynx (Lynx lynx). The GBIF database is queried and you can even download the data as a csv file, which I also find very useful. The occurrence records are plotted nicely on the map. You can also manually inspect the data:

The next step of the workflow is data subsetting. Often you would just be interested in a subset of the area (let’s say you have budget issues and you would not be able to cover a large area for subsequent on-site sampling). The map interface is intuitive. For example you can drag a polygon around the points of interest, and select them:

The second dataset that you need for SDM is environmental data. For this there are also standard connections to the major databases, such as Bioclim, which is often the first one to try.

After selecting the environmental data there are some additional steps before you can train a model. One of those is sampling background points (your negative/absence labels).

Then you can split your dataset into different folds for cross validation (a standard machine learning procedure). What is really cool here is that the “jackknife” method is also present (a good mathematical description available in this pdf). This method would help you a lot with modeling when you have very few data points.

And finally you can train the model and get numerical results. Those are also standard classification metrics, such as AUC (area under the curve).

You can also show several diagnostic plots to see if there is a good separation between the clusters of points:

The most interesting output of the SDM workflow is the map of predicted suitability across the study region. If the accuracy of your model is good, this would be the final product of your modeling and you can share it for further use.

I hope this tutorial has been useful. I encourage you to give Wallace a try. I think that such software can help tremendously by lowering the bar for coding skills needed, and additionally making sure that scientific work is reproducible!

comments powered by Disqus