generating diverse names in Snowfakery

Shoutout to collaborators Melanie, Allison, Paul, Aaron, Gabrielle, Ian, and more!

My data bestie Emily and I have talked extensively about how being queer has given us the gift of engaging with technology from a perspective of whimsy and creativity. We can imagine structures that don’t exist yet, and we have a lot of practice negotiating with systems that were not made for *us.* Being queer in tech is awesome. HAPPY PRIDE MONTH!

Which brings me to my next topic. The crew over at Data Generation Toolkit (Snowfakery), which I’ve written about here and here and here has been plugging away at the existential question of making datasets with intrinsically diverse names and nonbinary names as the norm. While there are many vectors of diversity that are important to consider, in this blog post I will be addressing race and gender specifically. (Late addition: I move between addressing race and gender rather fluidly in this blog post. I decided not to edit this into chunks, as I prefer for us to be able to talk about diversity in a way that feels natural, and because all types of diversity are interconnected).

There already exist methods for incorporating new datasets into Snowfakery, so that part is relatively easy, at least for a developer. The bigger problem is, what should these datasets consist of… and how should they be interpreted? This, of course, is a matter of context, examining bias, and never being “done.”

generating “fake” names

Let’s start by examining what exists today:

Snowfakery’s default is to draw data from a repository known as Faker, which maintains lists of fake data of dozens of types, from names to addresses to paragraphs, etc. They have strict rules for the data that Faker relies on. [I was going to write more about the data source requirements here, but I can’t find a reliable reference, oh well!].

When it comes to names, they truly aren’t fake. They are “real” names, detached from identities of actual people. From this list of real names, one or more names will be randomly selected. In the underlying data, each name is assigned a gender, which allows data consumers to ask for “Female Names” or “Male Names.” Eventually, Faker also introduce a method for returning “Nonbinary Names.” (Yes, I am aware that Female/Male are sex assignments and are not relevant to names, but that is how the data source refers to gender at the time of writing).

So how does Faker generate fake names? Well, it starts with a list of real names, generally from the US Census or Social Security Administration. At least a reliable source, right?

In the file that currently generates nonbinary names, there are two different descriptive notes:

    # Top 200 [first] names of the decade from the 60's-90's from:
    # https://www.ssa.gov/OACT/babynames/decades/names1960s.html
    # Weightings derived from total number on each name

    # Top 1000 US surnames from US Census data
    # Weighted by number of occurrences
    # By way of http://names.mongabay.com/data/1000.html on 2/10/2016

Readers with a critical eye for issues of equity might notice a few problems here. This is what the Data Generation Toolkit team came up with:

ranking name lists by popularity introduces/reproduces bias according to the racial/ethnic majority/plurality when the names list was generated
if the USA is getting more racially diverse over time, then selecting names from 1960s-1990s will be a less diverse pool than 1990s-2020s (assuming a 30 year span)
the criteria for generating nonbinary names is to select any name from two lists of gender-assigned names (this problem is contested as to whether it is a problem or not!)
sample generated names suggest names commonly associated with Northern European and North American white demographics

MAJOR CAVEAT – the 4th point above says as much about US, the data consumers, as about the names themselves. Why do we associate names like “John” or “Sally” with white people? One source that I consulted for this post reinforced my inkling that white data consumers may expect People of Color to have what we register as “uncommon” or “ethnic” names when that is NOT NECESSARILY TRUE (emphasis my own). And we know this is true when white hiring managers skip over resumes with names that are inferred to belong to Black applicants. We have a lot of baggage, judgement, and bias when it comes to names – and that’s not the data’s fault.

Because these border changes and immigrations happened decades and centuries ago, surnames in the United States are no longer limited to just one race. Therefore you cannot guess someone’s race based on their [surname].

For example, the name Smith, is the most popular surname in the US, and 70% of the people who have it are white, 23% are black, and the remaining 7% is made up of a variety of different races such as Asian, Pacific Islander, and Hispanic.

The point is, surnames, while coming from many different origins, no longer have anything to do with race.

meandering through unanswerable auestions

So, now we have two issues that need to be addressed:

Implicit bias in the data that likely does oversample data from white people or white-assumed/assimilated people
Implicit bias in the data consumer who imagines a white person when seeing a name appear on a screen

Each of these issues could be infinitely unpacked, which I don’t have time to do in this blog post. But I do want to introduce two even more existential questions about generating name data from a perspective of centering diversity.

If we are generating fake data, should we expect to see “recognizable” names at all? Alternatively, we could use random strings of letters and numbers to represent a person record.
Should we create fake data that represent the world as it is today (ie based on current ratios of names attributed to specific races and genders) or should we create fake data with a social justice agenda that corrects or overcorrects for bias?

Which leads me to another set of potentially unanswerable questions:

What is the purpose of generating fake data? If for testing scalability or code bugs etc, then unrecognizable names that have all lengths (# of letters) and punctuation and anything else that could possibly be part of a name is more useful than testing on the top 200 known names. If for testing usability, recognizable names or phonetically pronounceable names (with cultural sensitivity that this range can be quite broad and be different for different groups of people) would be ideal.
To what extent should representation supersede either of the priorities enumerated in Point 1 above?

Getting down to business

When I really ponder the existential questions above, the “answer” that rises to the surface is actually another question: Who decides? And the answer to that is, US! We decide!

Which is why I’m so glad that we have queer leadership bolstering the Data Generation Toolkit team and coming up with new methods for making datasets closer to reality AND closer to the-world-as-it-should-be where every type of person has dignity, representation, and self-determination. If we can do it with fake data first, maybe we can do it in real life next?

There seem to be three schools of thought with respect to the quandary of generating specifically nonbinary names:

All names are inherently nonbinary until assigned a gender. Pick any name. Congrats, it’s nonbinary! You can even slap on a nonbinary prefix or suffix like Mx. or Ph.D. (This is the currently in place algorithm)
All names are not inherently nonbinary. We should select names that are relatively likely to be assigned to AFAB or AMAB people. Like “Sam” or “Taylor.” We can use code logic to compare lists of AFAB and AMAB names and return names that appear on both lists. (Project contributor Allison smartly proposed logic for this here).
Commonly assigned names can’t represent the complexity of the lived experience of nonbinary people. Many nonbinary people pick new names that are not otherwise thought of as Person Names, like “Bex” “Maple” or a single letter. (I’m not aware of a dataset for self-selected, non-traditional names).

I think all of these approaches have merits. Should we build all of them and let data consumers choose?

The question of nonbinary names seems to come down to both data sourcing and sampling.

However, the question of racially and ethnically diverse people names likely comes down to just sourcing. If we can get a reliable dataset that is more recent than 1960 and not ranked by popularity and from a reasonably racially diverse geography, well, then I think we could solve our biggest hangups with the existing dataset.

These datasets are not very easy to come by, but I was delighted to stumble upon one from the City of Austin. This dataset represents:

(All) people born in the city of Austin, Texas
In 2017

The next question is, do we determine Austin, Texas to be reasonably racially diverse? Compared to what? And if we were to determine an ideal balance of ethnic groups… would that hypothetical place even have open data?

Austin demographics according to US Census

From the data above, I can see that in comparison to the population of the USA, white Americans and Black Americans are underrepresented in Austin (then again, I have no good reason for benchmarking against USA). Also, due to immigration patterns, I assume that there are more Latine people in Austin than recorded in the US Census. People data is really tricky. I don’t have all of the answers.

The nice thing (one of many nice things) about this random dataset is that it is a reasonable size for the scale and scope of the project. There are 6,087 unique names in the dataset, representing 19,896 birth certificates.

I think that in order to implement this dataset, someone would have to take this argument a bit further and come up with a clear reason why sampling data from Austin, TX is a good idea. We would also need a plan for refreshing the data every 5 years or so, if we take seriously the idea that we want to select names that correspond to currently alive people (although first and last names are sampled separately and are not meant to identify real people) and according to current naming trends (which hopefully lean toward diversity!).

The data from Austin may not be “good” by every imaginable attribute but it may well be “good enough” to improve what we have today. As with all matters of diversity, equity, and inclusion, the work is never done! So, you tell me, what do you think we should do next? I would really love to hear from you. The more voices, the merrier!

generating diverse names in Snowfakery

generating “fake” names

meandering through unanswerable auestions

Getting down to business

Like this:

Leave a ReplyCancel reply

generating diverse names in Snowfakery

generating “fake” names

meandering through unanswerable auestions

Getting down to business

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from The Data are Alright