Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Sunday, 2 February 2020

Need a statistician!

I've been looking at spider records from Leicestershire and Rutland (data copyright Leicestershire and Rutland Environmental Records Centre). From 1934-2009 there are a total of 40,340 records, from 2010-2019 there are 3,138 records. The phenology of records is as follows:



Plot of the new records against the old records:



From linear regression (R version 3.6.2 2019-12-12):

Multiple R-squared: 0.7861, Adjusted R-squared: 0.7647
F-statistic: 36.75 on 1 and 10 DF, p-value: 0.0001216

The p-value tests the null hypothesis that there no correlation between the variables. Rejecting this hypothesis (p <0.001), the conclusion based on this data is that the observed phenology has not changed significantly in the last decade. Note that even with my level of statistical inexpertise I have studiously avoiding imputing any cause for change, only examining whether there has been a statistically significant change in the last decade. However, Helen Smith helpfully pointed out to me that if you plot the data as percentage vales, there appears to be a "spring shift" in the last decade, even if this is not statistically significant based on available data:



NB: I have revised this post a number of times to avoid propagation of my earlier errors - many thanks to all those who have made helpful comments.



Wednesday, 4 December 2019

Benchmarking Spider Recording

Recently, I've been thinking a lot about occupancy models for invertebrates (see: Filling the White Holes). Other taxa, notably birds and butterflies (through the BTO Wetland Bird Survey and Butterfly Conservation's UK Butterfly Monitoring Scheme (UKBMS), respectively) have good negative data, i.e. an indication of where species are absent as well as where they are present. For most invertebrate taxa, partly because of lack of resource (recording effort) but mostly because of the inefficiency of recording, all we have are "White Holes" - gaps in the data which are difficult to interpret. The BAS Spider Recording Scheme does not include "negative data" because it's virtually impossible (without DNA approaches) to be certain a spider is truly absent from a particular area. This makes occupancy models difficult if not impossible to derive. The alternative is to fall back to benchmark species as indicators of recording coverage.

Following Filling the White Holes I had an interesting online chat with Geoffrey Hall who introduced me to the idea of axiophytes (it's a botany thing). BSBI gives the following criteria for axiophytes:
  • 90% restricted to these conservation habitats
  • Recorded in fewer than 25% of tetrads in the county
It seems that Pliny made it up (as so many other things) in his "Natural History". As far as I can tell, neither axioentomos nor axioarachnos exist, so I've just made up two new words (take that Pliny). The point is that axiowhatevers focus attention on presence (even if it is rare presence), rather than absence, so I still tend towards the idea of using a "universal" benchmark species rather than an axiomatic one. Clearly, the choice of a benchmark species is crucial. For springtails, I am happy to use Orchesella cincta, widely acknowledged to be the commonest species of springtail in lowland Britain. I would be amazed if this species was not present in every quadrat in VC55, it's identifiable in the field with no need to put it under a microscope. For spiders it's a little more complex. The obvious choice is Araneus diadematus. While this might seems to be the most obvious choice, it's only the 4th most commonly recorded species in VC55 (n=724) with only 39% of the count of the most frequently recorded species (Tenuiphantes tenuis, n=1,849). I'm pretty sure that this does not reflect the actual situation, but rather recording bias/snobbery (of which I am probably guilty). While the ubiquity of this species is a good reason to think that this is a valid choice, I needed to test the hypothesis. As a starting point I used quadrat mapping - arbitrarily dividing VC55 into a grid and looking at the number of records within each section. A 25x25 grid worked but the the intervals were a bit small and a 10x10 grid is more informative (all VC55 Spider records to end 2018):


The grid for Araneus diadematus looks like this:


To make sense of this, I converted the distributions into histograms:


The distributions look look similar, but to be sure, I ran some further analysis:


While there is a correlation between the Araneus diadematus distribution and the overall VC55 spider records dataset and this is statistically significant (p = 4.49e-11), it's not a great fit (R2 = 0.48). In contrast, when I run the same exercise on the benchmark species I use for normalizing springtail recording effort (Orchesella cincta), I get an R2 value of 0.87 (p = 2.2e-16), so that's a much better fit from a dataset which is nearly ten times smaller than the VC55 spider data. So I conclude that Araneus diadematus is a valid benchmark for VC55 spider recording - but it's not a great one. If you can think of a better candidate benchmark species, please let me know.


Acknowledgements:
All data Copyright Leicestershire and Rutland Environmental Records Centre.
Data visualization performed using the R platform, v. 3.6.1 (R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org).
J. Cann for assistance with data visualization.

Monday, 25 November 2019

Filling the White Holes

I'm growing increasingly obsessed with data that isn't there... Earlier this month Dom Greaves used the phrase White Holes on Twitter:



It stuck in my mind and I haven't been able to shift it. Previously, I've used R to plot heatmaps of VC55 (Leicestershire and Rutland) spider records (all VC55 spider records (>43k) to end 2018, data copyright Leicestershire and Rutland Environmental Records Centre) (click images for larger versions):



However, the problem with heatmaps is that they inevitably focus attention on where the data is, rather than where it is missing. As an attempt to try to switch the emphasis I tried quadrat mapping - arbitrarily dividing VC55 into a grid and looking at the number of records within each section. Initially I tried a 50x50 grid but with 43,000 records, that choked my computer. A 25x25 grid worked but the the intervals were a bit small and a 10x10 grid is more informative:



Conveniently, R gives the record counts for each tile:



and plotting a histogram of the counts draws attention to how skewed the distribution of records is. Result!



The method isn't prefect. The tiles and hence the counts are of unequal area where they overlap the VC55 boundary, but I don't know how you get around that, with the possible exception that plotting the data by parish rather than by quadrat might be better?

The other place I'm currently stuck is trying to turn unstructured data into occupancy models. This is another White Hole Problem, but one that continues to defeat me.


Acknowledgements:
  • All data Copyright Leicestershire and Rutland Environmental Records Centre.
  • Data visualization performed using the R platform, v. 3.6.1 (R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org).
  • J. Cann for assistance with data visualization.

Thursday, 24 October 2019

How to do biological recording

An interesting new preprint addresses an issue I've been concerned about for a while, how to control for recording effort in assessing how species are doing (Rapid assessment of the suitability of multi-species citizen science datasets for occupancy trend analysis: https://www.biorxiv.org/content/10.1101/813626v1).

The biggest volume of biological data is recorded by unstructured citizen science schemes. Because the data is collected in an essentially random way, many taxon experts are sceptical about the value of these schemes in accurately reflecting populations in the field. Although the statistics are complicated, the method of the new paper seeks to turn unstructured data into occurrence data, i.e. data where we can be sure (to any specified degree) of the presence or absence of a species in a given time period, or the absence of sufficient data to make a determination. The method to do this is to call each 1km grid square a recording site and to count the number of visits each year, one visit constituting one record by any person in a 24 hour period. From this it is possible to calculate the degree of confidence in the occurrence or absence of a species at the site. Ideally (for high confidence) there would be four or more visits from experienced recorders per site per year, but even in the absence of this, the method provides a way of turning the massive amount of unstructured biological recording data available into findings which are easier to interpret and to place confidence limits on.


(click for larger image)

Unsurprisingly, it turns out that butterflies and moths are the runaway winners, the East Midlands performs creditably, and species which get a lot of publicity do better than those for which there are only a handful of experts who can identify them. Nevertheless, if tools could be developed to enable easy utilisation of the method, this would present a valuable way forwards.


(click for larger image)


Note: R package "unmarked" is of relevance: https://cran.r-project.org/web/packages/unmarked/index.html