Human Sciences, Statistics, and R

January 6, 2013

The use of statistics has long been important in the human sciences. An early example is an analysis by William Sealy Gosset (alias “Student”) of biometric data obtained by Scotland Yard around 1900. The heights of 3,000 male criminals fit a bell curve almost perfectly:


Histogram © A. H. Dekker, produced using R software

Standard statistical methods allow the identification of correlations, which mark possible causal links:


XKCD teaches us that “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.’”

Newer, more sophisticated statistical methods allow the exploration of time series and spatial data. For example, this project looks at the spatial distribution of West Nile virus (WNV) – which disease clusters are significant, and which are merely tragic coincidence:


Distribution of significant clusters of human WNV in the Chicago region, from Ruiz et al.

SPSS has been the mainstay of statistical analysis in the human sciences, but many newer techniques are better supported in the free R toolkit. For example, this paper discusses detecting significant clusters of diseases using R. The New York Times has commented on R’s growing popularity, and James Holland Jones points out that R is used by the majority of academic statisticians (and hence includes the newest developments in statistics), R has good help resources, and R makes really cool graphics.


A really cool graph in R, using the ggplot2 R package (from Jeromy Anglim’s Psychology and Statistics Blog)

An increasing quantity of human-science-related instructional material is available in R, including:

Through the igraph, sna, and other packages (and the statnet suite), R also provides easy-to-use facilities for social network analysis, a topic dear to my heart. For example, the following code defines the valued centrality measure proposed in this paper:

library("igraph")
valued.centrality <- function (g) {
  recip <- function (x) if (x == 0) 0 else 1/x
  f <- function (r) sum(sapply(r, recip)) / (length(r) - 1)
  apply (shortest.paths(g), MARGIN=1, f)
}

This definition has the advantage of allowing disconnected network components, so that we can use these centrality scores to add colour to a standard plot (using the igraph package within R):


Social network diagram, produced using R software, coloured using centrality scores

– Tony