On Racial and Ethnic Data Collection
By Rukan Saif

In 1932, the United States Public Health Service partnered with the Tuskegee Institute to begin a study recording syphilis infections in African-American males to justify varied treatment for black people. These 600 men — 399 of whom had syphilis — were unaware of what the study was documenting. The study, originally projected to last six months, lasted 40 years, ending in 1972. What’s worse is that in 1947, when penicillin was deemed the drug of choice, the researchers chose not to offer it to the subjects who had syphilis.
The Tuskegee Experiment is one of many grotesque examples of scientific racism — a concept that germinated during the Enlightenment Era — which posited that there were biological disparities from race to race. Beginning with Carl Linneaus in 1759, this idea evolved to spuriously assert that race-specific diseases dwelled in certain populations, and as Fredrickson explained in Racism: A Short Story, that “these groups could be sorted into a racial taxonomy that reflected gradations of human worth.” Thankfully, the majority of the scientific and public health communities have divorced themselves of this notion that biological discrepancies exist between races and have instead agreed that socioeconomic determinants — such as income level, educational opportunities, workplace safety, and environmental conditions — impact health and healthcare quality. These factors disproportionately affect minority communities. According to the Kaiser Family Foundation, 21.8% of American Indian and Alaska Natives, 19.0% of Latinos, 11.5% of African Americans, and 9.3% of Native Hawaiian or Other Pacific Islanders are uninsured, increasing the risk for undetected health problems.
To comprehensively identify and address socioeconomic disparities among minority peoples, proper racial and ethnic data collection and analysis is necessary. Benefits are manifold; with this data, public health officials can “[identify] population needs...develop hypotheses about the potential causes of health disparities...evaluate the effectiveness of existing initiatives…[and report] on the health status of the population.” However, the process of data collection is inherently nuanced and convoluted, and there are implementations essential to collating dependable, authentic data including, but not limited to, language preference, snowball sampling, and OCAP.
In a survey of Los Angeles healthcare providers, more than 50% believed that their patients could not comply with medical treatments because of linguistic or cultural barriers. Even in Mandarin, the most-spoken language in the world, certain words can get lost in translation. For example, palliative care Google translates to “do-nothing care” and hospice Google translates to “last-minute care.” Therefore, it is unsurprising that the health community has a dearth of racial and ethnic data, especially if subjects cannot understand the questions. For this reason, the public health sector must adopt methods to dismantle language barriers. At a Kaiser Permanente in northern California, patients cannot use “Kaiser data systems...unless the language preference field has been filled in….[and] information sheets in the patient's native language describing...condition or treatment can...be provided.” Making dependable translation services more accessible will help non-English speakers thoroughly understand whatever study is being conducted and help lower their apprehension about giving out information, also expediting the data collection process.
The next implementation, snowball sampling, is particularly helpful for ethnic groups with low populations. Snowball sampling is a non-probability method which uses an individual’s networks to recruit other members who fit the study criteria. This strategy works well when attempting to gather data on hard-to-reach groups because it is not easily generalizable. Snowball sampling could be used to collect data on the 1,299-person Konso community in South-Central Ethiopia.
Finally, when partnering with ethnic or racial groups, public health surveillance ought to adopt OCAP, which stands for ownership, control, access, and possession, thoroughly explained by the First Nations Information Governance Centre. This set of principles is the de facto standard for conducting research with the First Nations, but its foundation can be applied to any ethnic or racial minority community. The first part, ownership, “states that a community or group owns information collectively in the same way that an individual owns his or her personal information.” The second, control, affirms that the community being researched has control over how research, review processes, planning processes, and management are being conducted. Access indicates “the right of...communities to manage and make decisions regarding access to their collective information.” Lastly, possession refers to the physical ownership of the data. At its core, OCAP champions that the community comes first; its members will be “leading” the study without physically conducting the research.
This process must be approached with deliberation. To maintain integrity, the data should be framed with context—in collection and analysis. During collection, many people avoid answering surveys because they are unaware of what information is being collated and what will be done with it. In an article about racism and data collection in The Atlantic, Robyn Autry admitted she has reservations about completing the U.S. Census, saying whether she decides to complete any questionnaire “depends on [her] understanding of why the information is being collected.” During analysis, context is essential to avoid regressing towards the scientific racism of the 18th century. Next, also during analysis, statisticians need to avoid geocoding. Though a long-admired shortcut, geocoding can lead to erasure and inaccuracies, as it essentially guesses what a subject’s race is based on the reported address and surname. Lastly, databases should be annually examined for faulty data or data misconstruction; a report by the NCBI writes that “methods of quality control include reabstraction and/or recontact studies for specific databases, routine or special feedback of summaries of collected data to collectors, and linkage studies across databases.”
Because of the world’s deplorable history with race and culture, racial and ethnic data collection can seem like uncharted territory. There is much ground to cover, and many stigmatized groups to reach out to. However, by adopting strategies like translation access, snowball sampling, and OCAP, the public health community can identify socioeconomic distinctions and, hopefully, begin bridging the healthcare gap.
References:
https://catalyst.nejm.org/doi/full/10.1056/CAT.17.0312
https://www.pewresearch.org/methods/2015/11/12/the-unique-challenges-of-surveying-u-s-latinos/
https://www.geripal.org/2019/05/lost-in-translation-googles-translation-of-palliative-care.html
https://www.theatlantic.com/technology/archive/2017/11/how-racial-data-gets-cleaned/541575/
https://www.policylink.org/sites/default/files/Counting_a_Diverse_Nation_08_15_18.pdf
https://www.ncbi.nlm.nih.gov/books/NBK215740/
https://www.ncbi.nlm.nih.gov/books/NBK215749/