In order to make Facebook as open and connected as possible for everyone, one of our goals is to understand how different populations of users join and use the service. With that objective in mind, the Facebook Data team recently sought to answer the question, "How diverse are the ethnic backgrounds of the people using Facebook?"
While I applaud their efforts to determine how well they represent the population at large, I question certain of the Facebook Data team's techniques. The primary method of identifying users as a given ethnicity or race for the study is by a user's reported last name. This methodology is based on the correlation of last names to self-reported ethnicity or race in the US Census statistics. Short of actually asking users to self-report their data, this approach seems reasonable. (I'll say a bit more about why I favor self-reporting later.)
However, what Facebook refers to as a mixture-modeling technique seems a bit sketchy. By their definition, they "back solve" for name based on ethnicity. This is recursive: one has to know a variable (in this case, race or ethnicity) in order to use it as a given. Certainly, using this back-solving method to cross-check data is valid. If one assumes that the makeup of Facebook does, indeed, parallel the (self-reported) ethnic and racial makeup reflected in the Census statistics, then determining whether study data correlates with the Census data is a valid data point to verify the categorization assumptions of the study. However, by both reporting correlation with the Census statistics as a result and using the same statistics to "refine" the statistics, the Facebook Data team has skewed the results to be highly self-referential.
The study results also are adjusted based on Internet adoption rates as defined by the telecom industry's NTIA report. There are two problems with this adjustment. The first is that, like the back-solving technique, using Internet adoption rates is self-referential: in this, the researchers are adjusting the results by a number that they are also reporting as a result. If one adjusts by a given variable, one would most certainly expect the results to reflect a high correlation with that variable. The second problem with this method of normalizing the results is the NTIA-reported adoption rates themselves, which reflect "households with Internet access." The NTIA figures, which are from a 2007 report, do not reflect recent significant changes in Internet access and Internet access methods. In certain communities, there has been a strong movement toward mobile-only personal access to the Internet. A large number of people in the black community, for example, have personal Internet access via only a mobile device and incidental access through public, work, or other borrowed computers. This type of access is not represented in the NTIA report.
Although using a technique of back-solving then making adjustments based on Internet adoption as a way to normalize the results seems to obscure rather than refine the results, I applaud Facebook's effort to collect this data. The 2010 Census data and adoption statistics that are current and more accurately reflect current Internet access capabilities and trends will provide better data against which to verify future studies. Further, Facebook has also stated that they are looking to capture first names and friend connections as data points. This, too, may yield more finely-tuned results.
Ultimately, though, I wonder why Facebook does not simply add an optional (and optionally public) profile statistic for Facebook users to self-report ethnicity and race. If the options are identical to the 2010 Census options -- and identically described, one would expect to obtain results that are directly comparable to the Census statistics and therefore a better indicator of whether or not Facebook is representative of the population at large. Furthermore, Facebook could could provide users with an option to allow this statistic to be used only in cumulative reporting or also in reporting in conjunction with other demographics, which would facilitate a significant depth of data for analysis not only by Facebook but but other social networking researchers. I believe Facebook has work to do here in defining exactly what the purpose of their study is and how best to collect their data.
A last point about race and ethnicity: It is very difficult to identify. In a large, open, multi-racial, multi-ethnic, multi-cultural country such as the US, what does race or ethnicity even mean? Much to our shame, for much of our history as a country, we had significant national and institutional racial barriers to "the good life." Race and ethnicity statistics were collected and people were labeled as a means of exclusion. To a large extent today, race and ethnicity statistics are collected as a means of inclusion. That said, in our multi-everything society, how does one determine race or ethnicity outside of self-identification? As one commenter posts on the study summary site:
"although i may appear black, i think mostly like a white man, feel like a woman, and sleep with a man....facebook team --can you see me and hear me now????"In my case, I am a prototypical American: a mixture of many ethnicities. The ones we know for sure are American Indian (Cherokee), Irish, French, Scottish, and English, in roughly the order of percentages. Some recent research points to perhaps also some Egyptian. I self-identify as Cherokee, although most people identify me as Irish (perhaps due to my blue eyes, pale skin, and reddish hair) or French (likely due to my first name) or Scots (based on my last name). Culturally, I am American. So am I white? Native American? African-American? Who knows? Although I identify as an American Indian, recognize my Irish looks and roots, and exhibit certain stereotypical characteristics of my Scots and French ancestry, I am not part of those cultures. So how would Facebook categorize me? Isn't my self-identification about as close as we can get?