An introduction to computing spatial Randomness and autocorrelation in R with examples
Spatial randomness and spatial autocorrelation are two concepts related to the distribution of data in space. They describe different aspects of spatial data and have different implications for the analysis of spatial patterns.
Spatial randomness
: Spatial randomness refers to a situation where the location of events or features in space does not follow any discernible pattern. In other words, the events or features are randomly distributed across the study area, and their locations are not influenced by the locations of other events or features. Complete spatial randomness (CSR) is often used as a null hypothesis in spatial analyses, which assumes that the observed pattern is random and not influenced by any underlying spatial process.
Spatial autocorrelation
: Spatial autocorrelation refers to the correlation between a variable at one location and the same variable at neighboring locations. It measures the degree to which nearby locations exhibit similar (or dissimilar) values. Positive spatial autocorrelation means that similar values are clustered together, while negative spatial autocorrelation means that dissimilar values are closer together. If there is no spatial autocorrelation, the values are randomly distributed across the study area.
Spatial randomness pertains to the absence of any discernible pattern in the distribution of events or features in space, while spatial autocorrelation measures the correlation between a variable at one location and the same variable at neighboring locations.
Spatial randomness is often used as a null hypothesis in spatial analysis, while spatial autocorrelation is an inherent property of the data that can reveal underlying spatial processes or structures.
In summary, spatial randomness and spatial autocorrelation describe different aspects of spatial data. Spatial randomness relates to the distribution of events or features, while spatial autocorrelation quantifies the degree of similarity between values at nearby locations. These concepts are essential for understanding and analyzing spatial patterns in various types of spatial data.
Ripley's K function
: The K function is used to assess CSR in point patterns. It calculates the expected number of points within a given distance ‘r’ of an arbitrary point, divided by the overall point density. If the observed K function is close to the theoretical K function under CSR, the point pattern is consistent with spatial randomness.
L function
: The L function is a derived function from Ripley’s K function, which is used to assess Complete Spatial Randomness (CSR) in point pattern data. The L function is designed to provide a linearized visualization of the K function, making it easier to interpret the results.
Nearest Neighbor Distance Distribution function (G function)
: The G function is another measure used to assess CSR in point patterns. It estimates the cumulative distribution function of the nearest neighbor distances in a point pattern. If the observed G function is close to the theoretical G function under CSR, the point pattern is consistent with spatial randomness.
Moran's I
Moran’s I is a global measure of spatial autocorrelation for continuous or areal data. It measures the correlation between a variable at one location and the same variable at neighboring locations. Moran’s I values range from -1 (negative spatial autocorrelation) to 1 (positive spatial autocorrelation), with values close to 0 indicating no spatial autocorrelation.
Geary's C
: Geary’s C is another global measure of spatial autocorrelation for continuous or areal data. It is based on the difference between values at neighboring locations. Geary’s C values range from 0 (positive spatial autocorrelation) to 2 (negative spatial autocorrelation), with values close to 1 indicating no spatial autocorrelation.
Local Indicators of Spatial Association (LISA)
: LISA is a set of local measures of spatial autocorrelation for continuous or areal data. It identifies local clusters of high or low values and spatial outliers. Common LISA statistics include Local Moran’s I, Local Geary’s C, and Local Getis-Ord G*.
Semivariogram
: The semivariogram is a measure of spatial autocorrelation for continuous data that quantifies the similarity between pairs of data points as a function of the distance between them. It is used to model spatial dependence and is the basis for geostatistical techniques like kriging.
The K function is defined by:
\[K(r) = λ^{-1} * E[N(r)] \quad\quad (1)\]where λ is the point density (total number of points divided by the study area), E[N(r)] is the expected number of points within distance ‘r’ from an arbitrary point, and r is the distance.
When analyzing point patterns to test for CSR, we often assume an underlying Poisson process.The theoretical K function under CSR for a Poisson process is:
\[K(r) = \pi * r^2 \quad\quad (2)\]If the observed K function is close to the theoretical K function (Poisson process), the point pattern is consistent with CSR.
If the observed K function lies above the theoretical K function, the point pattern exhibits clustering.
If the observed K function lies below the theoretical K function, the point pattern exhibits regularity or dispersion.
By comparing the observed K function with the theoretical K function under the CSR assumption, you can assess whether the spatial distribution of points is consistent with a Poisson process or if there is evidence of clustering or regularity in the data.
When working with a sample of data points, the K function will not usually be known, and we will use the Edge-corrected estimation of K function (Ripley’s K function):
\[K(r) = \left(1 / λ\right) * \sum_{i=1}^{n}\left[\sum_{j\neq i}(I(||u_i - u_j|| <= r) / w(||u_i - u_j||))\right] \quad\quad (3)\]where:
The edge correction function is an essential component of Ripley’s K function estimation. It is needed to account for the boundary effects that arise when analyzing point patterns within a finite study area. Points near the boundary of the study area have fewer neighbors than points located in the interior, which can lead to biased estimations of the K function. Note that there are different edge-correction methods available, such as the isotropic correction, which is the default method in Ripley’s K function estimation. The isotropic correction is given by:
\[w(||u_i - u_j||) = 1/(1 - (π * r^2 / A)) \quad\quad (4)\]The L function is derived from the K function and is used to provide a linearized interpretation of the results from the K function. The L function is defined as:
\[L(r) = \sqrt{\frac{K(r)}{\pi}} \quad\quad (5)\]Given a Poisson process, we can substitute Equation 2 into the L function to get the L function under CSR:
\[L_{CSR}(r) = \sqrt{\frac{\pi r^2}{\pi}} = r \quad\quad (6)\]To simplify the interpretation of the L function, we often plot the difference between the L function and the distance r:
\[L(r) - r \quad\quad (7)\]The relationship between the K function and the L function can be summarized as follows:
The L function is derived from the K function to provide a linearized interpretation of the spatial pattern analysis.
The L function under CSR is equal to the distance r, which simplifies the interpretation of the results.
By plotting the difference between the L function and the distance r, we can easily identify deviations from CSR, clustering, or dispersion.
To interpret the L(r) - r plot:
If the L(r) - r plot is close to the horizontal axis (L(r) - r ≈ 0), it suggests Complete Spatial Randomness (CSR).
If the L(r) - r plot lies above the horizontal axis (L(r) - r > 0), it indicates clustering.
If the L(r) - r plot lies below the horizontal axis (L(r) - r < 0), it indicates regularity or dispersion.
In summary, the L function is derived from the K function to provide a more straightforward interpretation of spatial patterns in point data. By comparing the L function plot with the expected L function under CSR (L(r) = r), we can assess the degree of clustering or regularity in the data.
In the following example, we will use Ripley’s K function to analyze the bramblecanes
dataset from the spatstat
library in R. First, Let us visulize the locations of the bramble canes being analysed.
Figure 1: Bramble Cane Locations
Next, we will extimate and plot the observed K function along with the theoretical K function under CSR
Besides of the isotropic correction mentioned in Equation 4, Kest
by defualt outputs results using the other three edge correction below:
Border correction: The border edge correction method adjusts the K function estimate by taking into account the distance between the points and the study region’s boundary. It assumes that the point pattern’s intensity is constant up to the boundary, and beyond the boundary, it drops to zero. The border method is suitable when the study region’s boundary is irregular or complex. The border method works by applying a correction factor to the K function estimates, which is based on the distance of each point to the nearest point on the boundary.
Translation correction: The translate edge correction method adjusts the K function estimate by translating the point pattern to several random locations and averaging the K function estimates obtained from each translation. This method assumes that the point pattern is homogeneous and isotropic, meaning that the intensity and orientation of the point pattern are the same throughout the study region. The translate method is suitable for rectangular or square-shaped study regions with uniform boundaries.
Periodic correction: The periodic edge correction method adjusts the K function estimate by creating periodic copies of the study region and the point pattern. This method assumes that the point pattern is periodic and that the study region’s shape and size repeat indefinitely in all directions. The periodic method works by creating multiple copies of the point pattern and the study region, shifting each copy by a multiple of the study region’s size, and then computing the average of the K function estimates obtained from each copy.
Figure 2: Obeserved K Function vs Theoretical K function (CSR)
Finally, we can compute and plot the L function:
Figure 3: Obeserved L Function vs Theoretical L function (CSR)
The G function, also known as the nearest-neighbor distance distribution function, is defined as the cumulative distribution function of the nearest-neighbor distances between points in the point pattern.
The equation for the G function is:
\[G(r) = \frac{1}{n} \sum_{i=1}^{n}\sum_{j\neq i}^{n}I(d(x_i,x_j) \leq r)\]where G(r) is the value of the G function at distance r, n is the number of points in the point pattern, I() is the indicator function that takes the value 1 if the distance between the ith and jth points in the point pattern is less than or equal to r, and 0 otherwise, and $d(x_i,x_j)$ is the Euclidean distance between the ith and jth points in the point pattern.
In other words, the G function calculates the proportion of points in the point pattern that have at least one nearest neighbor within distance r. The G function can be used to test for spatial clustering or regularity in a point pattern. If the G function values are above or below the expected values under CSR, this suggests clustering or regularity, respectively.
We can use the envelope function to construct the upper and lower bounds of the expected values of the Gest
function under CSR. The envelope function is calculated by repeatedly simulating CSR point patterns within the study region and calculating the G function for each simulated pattern. The envelope function is then constructed by calculating the upper and lower percentiles of the simulated G function values for each distance. The resulting upper and lower envelopes are used to compare the observed Gest function values with the expected values under CSR.
Figure 4: Obeserved G Function vs Theoretical G function with Envelope (CSR)
The spatialepi
package in R contains several datasets related to the spatial analysis of epidemiological data. One of these datasets is the pennLC
dataset, which contains information on lung cancer incidence rates in Pennsylvania by county in 2002.
Figure 5: Pennsylvania Smoking Rates
Figure 6: Randomized Pennsylvania Smoking Rates
Moran’s I is calculated using the formula:
\[I = \frac{n}{\sum_{i=1}^{n}\sum_{j=1}^{n}w_{ij}} \frac{\sum_{i=1}^{n}\sum_{j=1}^{n}w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]where $n$ is the number of observations, $w_{ij}$ is the spatial weight between observation $i$ and observation $j$, $x_i$ is the value of the variable of interest for observation $i$, and $\bar{x}$ is the mean value of the variable of interest.
The first term in the formula represents the spatial scale of analysis, or the effective number of neighboring observations. It is equal to $n$ divided by the sum of the spatial weights matrix $W$, which represents the total number of neighboring observations across all observations.
The second term in the formula represents the spatial autocovariance of the variable of interest. It is calculated as the sum of the product of the spatial weight $w_{ij}$ between each pair of observations $i$ and $j$, and the difference between the variable value $x_i$ for observation $i$ and the mean value $\bar{x}$ of the variable across all observations, and the same difference for observation $j$. This term quantifies the degree to which similar values of the variable tend to be clustered together in space.
The third term in the formula represents the overall variance of the variable of interest. It is calculated as the sum of the squared difference between the variable value $x_i$ for each observation $i$ and the mean value $\bar{x}$ of the variable across all observations. This term represents the total amount of variation in the variable across all observations.
To calculate the spatial weights $w_{ij}$ for Moran’s I, there are several methods available, but the most common ones include the following:
where $d_{ij}$ is the Euclidean distance between observations $i$ and $j$, and $p$ is a positive constant that controls the rate of decrease of the weight with distance.
where $K$ is the kernel function, $d_{ij}$ is the Euclidean distance between observations $i$ and $j$, and $h$ is a positive constant that controls the bandwidth of the kernel. Commonly used kernel functions include the Gaussian, uniform, and triangular kernels.
where $x_j$ is the value of the variable for observation $j$, $\bar{x}_{N_i}$ is the mean value of the variable for all observations in the neighborhood of observation $i$ (excluding observation $i$ itself), and $N_i$ is the set of neighbors of observation $i$. This method accounts for spatial autocorrelation in the variable of interest by estimating the local variance of the variable based on neighboring observations.
Note that the lagged mean method assumes that the relationship between neighboring observations is linear and stationary. It may not be appropriate for non-linear or non-stationary