Salmond explains his Chinese name data analysis

Rob Salmond has posted further explanation for his data analysis of the Auckland property sales on his own Polity blog. His site has been mostly unloadable so I’ll post it entirely here as an alternate source.

How Labour estimated ethnicity from surnames

In response to requests via Twitter, this post walks readers through the general method Labour used to predict the ethnicity of Auckland house buyers from their surnames. This analysis was featured in the New Zealand Herald’s lead story yesterday.

Note that there are two points in this explanation where I will refuse to go into further detail, in order to protect Labour IP. The rest of this explanation has been made publicly in various venues already, so this post does not give away any new secrets.

Part 1: 2014 demographic study

Pre-election, Labour estimated the ethnicity of every person on the electoral roll, via standard Bayesian updating. There are 3.2 million people on the roll. This was one of many demographic estimates we did for everyone in the country. Most serious political parties now engage in this kind of demographic profiling.

To estimate ethnicity, we used public NZ census data on the ethnic distribution of neighbourhoods, and also used data we developed privately about the ethnic distribution of last, middle, and first names in New Zealand. We followed some advice – especially about estimating Asian ethnicities – from prominent US academic studies. I won’t be describing that process further, as that is sensitive IP for Labour.

Using these data, our base method was to estimate people’s ethnicity in a three-step Bayesian analysis:

  • Step 1: Prior: Neighbourhood ethnic distribution. New information. Lastname distribution. Posterior: Neighbourhood / lastname ethnic distribution.
  • Step 2: Prior: Neighbourhood / lastname ethnic distribution. New information. Firstname distribution. Posterior: Neighbourhood / lastname / firstname ethnic distribution.
  • Step 3: Prior: Neighbourhood / lastname / firstname ethnic distribution. New information. Middlename distribution. Posterior: Neighbourhood / lastname / firstname / middlename ethnic distribution.

This process provides a distribution of the likely ethnicities of each person in New Zealand, given their address and their full name.

The distribution covered the probability that a person was each of the following ethnicities, drawn from the level 1 and level 2 ethnic classifications from the New Zealand census: European, Maori, Pacific (not further defined), Pacific (Samoan), Pacific (Tongan), Asian (not further defined), Asian (Chinese), Asian (Japanese), Asian (Korean), Asian (South Asian), Asian (Middle East), other.

For the person-level point estimates, we used the largest single probability. That probability was typically above 0.9.

We refined these estimates further with three tweaks to account for moderate issues we encountered estimating certain ethnicities. I won’t be describing those tweaks further, because IP.

We then tested our predictions against a more-or-less-random sample of around 3,500 known New Zealanders for whom we had ethnicity data. Our best predictions, which we have used since, were 94.8% accurate.

This is an important point. Having developed our method for estimating ethnicity, we then tested it for accuracy against real world data. Only once we were satisfied it was accurate were we willing to rely on it in our work.

Part 2: Applying the predictions to housing data

To apply our general predictions, derived in part 1 above, to the Auckland housing data, we followed a two-step process.

First, we collapsed the 1.4 million Auckland-based ethnic estimates we had by surname only, as that is the only data we had in the real estate data. This allowed us to also partly leverage the earlier electoral roll-based information we gleaned from first names, middle names, and locations as part of our surname-based estimates.

Most of the surnames pointed strongly (pr>0.9) to one and only one ethnicity, although there were some examples with more mixed predictions. It created estimates such as the following (these are the real values):

Name pr(European) pr(Maori) pr(Chinese) pr(other)
JONES 0.938 0.054 0.001 0.007
HOTERE 0.048 0.887 0.000 0.065
LEE 0.481 0.027 0.400 0.092
LI 0.028 0.001 0.957 0.014

Having done that for each individual purchaser, we then summed the probabilities across all 3,922 sales in the dataset. This provided an aggregate estimate, based on the distributions of likely ethnicities in each individual sale, for the overall ethnic distribution of house buyers in Auckland.

In doing this aggregation, we tested various ways of accounting for the fact that some sales had one surname attached, while others had two or even three, accounting for multiple people with diffrerent surnames purchasing a property together. No matter how we cut those observations, the overall pattern remained within 2% of the numbers that appeared in the New Zealand Herald.

It is that overall distribution, not data cherry-picked from any particular sale, that we then compared with various other aggregate datasets about the ethnic distribution of Auckland residents, or various subsets of Auckland residents. Many of those comparisons are detailed in the Herald article and in my Public Address blog post yesterday.