The General Index: a New Tool to Fight Inequity
Getting access to data and the skills to produce it are vital to address inequity inherent in the scholarly record. Access in the short term is a powerful first step, but to truly move the needle will take training a more diverse group of inquisitors to be comfortable using and generating data.
I recently had the chance to listen to Catherine D’Ignazio and Lauren F. Klein discuss their book Data Feminism, published in 2020. In the book, D’Ignazio and Klein invite us to consider data science through a feminist lens that focuses on imbalance of power and the structural forces that cause them.
Inequality in data science mirrors that of our larger society, in which those with money and/or privilege have greater power than those who lack it. In data science, the consequences of power imbalance are a selection bias that skews which data are collected, what decisions govern the analysis, and how the results are valued.
This bias runs a gamut from benevolent oversight to active suppression. There is obvious interference, such as laws and regulations that stifle the conduct and release of gun violence research. More insidious is the problem of data sets never conceived of by those in power, who lack the lived experience to value them. This stems from lack of diversity in the voices that shape and practice data science.
“Commitment to feminist knowledge creation means what we get is more robust when we bring more people to the table.”
Catherine D’Ignazio
In the Library of Missing Datasets, available via GitHub and as a physical and virtual exhibit, Mimi Ọnụọha speculates deeply on the spectrum of what those missing data sets describe – particularly when they dwell in otherwise data-rich contexts. For example: crime statistics are abundant, but there’s scant information on police brutality or hate crimes against transgendered individuals.
The same phenomenon of missing concepts occurs across scholarly literature. There is a substantial network of barriers and incentives that determines what research takes place, who can publish and what gets published. One must have the resources to conduct research in the first place, and the topic must be deemed worthy by publishers and reviewers. The livelihoods and reputations of all who participate are inextricably tied to the production machine.
Identifying gaps in this trove presents a unique challenge because of barriers to access. Even the best-funded libraries and institutions are unlikely to have access to the full scholarly corpus. The majority of publications are not openly accessible; any researchers without institutional backing for serials would be out in the cold.
Yet that lack of data – to study how we study – is a significant hindrance to doing research better. According to a blog post from technology activist Cory Doctorow, we could learn from large-scale study of these papers – text mining “that can reveal holes, biases and defects in our research programs” that could “help fight corruption”.
Furthermore, our investigations should include representation from marginalized groups, in accordance with the message of Data Feminism of embracing plurality. According to D’Ignazio, “commitment to feminist knowledge creation means what we get is more robust when we bring more people to the table.” In this case, it means a broader perspective on how we prioritize what to study and the real-world context we use to interpret results.
Getting people more comfortable using data means more “atypical” data scientists can grow to use tools like the General Index and turn them towards a more diverse set of questions.
The General Index, published in October, is a step in the direction of unlocking academic research papers for public scrutiny. The resource is the work of the nonprofit group Public Resource and its founder, open information advocate Carl Malamud. It contains data on article keywords, n-grams (words appearing in text consecutively) of up to 5 words, and corresponding metadata extracted from over 107 million scientific research papers.
As of yet, using the General Index requires considerable computational knowhow and resources: its terabytes’ worth of information is contained in three massive data tables split across multiple files. To use them requires the space to download the desired tables and familiarity with tools capable of searching and wrangling across them. That considerably narrows the pool – and likely the diversity – of people able to use it. Nonetheless, the Index presents a valuable opportunity for self-reflective research that studies how frequently (or infrequently) concepts are represented in the scientific literature, and how their use changes over time.
In my opinion, using the General Index requires some skill (and confidence to wield it), but not expert-level domain-specific knowledge. This is one reason I feel data literacy education is more than just a “nice to have:” it can help people make sense of data, get comfortable using them, and empower them to collect their own. Getting people more comfortable using data means more “atypical” data scientists can grow to use tools like the General Index and turn them towards a more diverse set of questions.
In other words, encourage and empower all your students to play with data. It will help the fight against inequity in the long run.

Emily Cukier is a Science Librarian at Washington State University. Her interests include biology/life sciences, chemistry, human health and pharmacotherapy, data librarianship, and research ethics. Before coming to WSU, she has worked as a Senior Writer for BioCentury, a pharmaceutical trade publication, and as a nonproprietary naming consultant to the pharmaceutical industry.