Human Sciences, Statistics, and R

The use of statistics has long been important in the human sciences. An early example is an analysis by William Sealy Gosset (alias “Student”) of biometric data obtained by Scotland Yard around 1900. The heights of 3,000 male criminals fit a bell curve almost perfectly:


Histogram © A. H. Dekker, produced using R software

Standard statistical methods allow the identification of correlations, which mark possible causal links:


XKCD teaches us that “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.’”

Newer, more sophisticated statistical methods allow the exploration of time series and spatial data. For example, this project looks at the spatial distribution of West Nile virus (WNV) – which disease clusters are significant, and which are merely tragic coincidence:


Distribution of significant clusters of human WNV in the Chicago region, from Ruiz et al.

SPSS has been the mainstay of statistical analysis in the human sciences, but many newer techniques are better supported in the free R toolkit. For example, this paper discusses detecting significant clusters of diseases using R. The New York Times has commented on R’s growing popularity, and James Holland Jones points out that R is used by the majority of academic statisticians (and hence includes the newest developments in statistics), R has good help resources, and R makes really cool graphics.


A really cool graph in R, using the ggplot2 R package (from Jeromy Anglim’s Psychology and Statistics Blog)

An increasing quantity of human-science-related instructional material is available in R, including:

Through the igraph, sna, and other packages (and the statnet suite), R also provides easy-to-use facilities for social network analysis, a topic dear to my heart. For example, the following code defines the valued centrality measure proposed in this paper:

library("igraph")
valued.centrality <- function (g) {
  recip <- function (x) if (x == 0) 0 else 1/x
  f <- function (r) sum(sapply(r, recip)) / (length(r) - 1)
  apply (shortest.paths(g), MARGIN=1, f)
}

This definition has the advantage of allowing disconnected network components, so that we can use these centrality scores to add colour to a standard plot (using the igraph package within R):


Social network diagram, produced using R software, coloured using centrality scores

– Tony

6 Responses to Human Sciences, Statistics, and R

  1. Daniel Digby says:

    …and I’ll bet that any male falling in that height range is more likely to be a criminal. I hope our Homeland Security uses that to flag suspicious plane passengers.

  2. Tony says:

    Well, I think that height range (an average of about 3 inches below current English heights) probably reflects malnutrition due to poverty.

  3. Reblogged this on orgcomplexity.com and commented:
    Meet R- sharing fellow blog post on stats and SNA

  4. […] about network analysis using the igraph package of R (Part I of III and Part II of III). I’ve expressed myself elsewhere on how useful R is, and these posts do a very good job of explaining the network-related aspects of […]

  5. […] I have discussed the benefits of the R statistical toolkit. The image below uses R to plot some data from […]

%d bloggers like this: