Libraries and library vendors contain multitudes of data1 : item circulation, patron information, computer sessions, program attendance, website logs, and searches, just to name a few. Data plays a critical role in analysis and assessment of collections, operations, and services; however, the privacy of the patron must be taken into consideration in using any patron data. Anonymization and de-identification are two methods that can aid libraries and vendors in using patron data assessment while respecting patron privacy.
Library patron data contains Personally Identifiable Information (PII), which consists of two facets:
- Information about a person, such as a name, home address, phone number, and ID number (PII-1)
- Information that can be linked to a person, including financial, education, and medical information (PII-2) (United States 2008). For libraries, PII-2 pertains to someone’s intellectual pursuits, including borrowing history, catalog and web searches, reference inquiries, etc.
PII is scattered throughout the organization; some data lives in the integrated library system, some in electronic resource vendor reports, and some in sheets of paper on a staff person’s desk. This raw data, particularly if brought together into one central database, can be used to comprehensively track a patron’s use of the library. While having this level of tracking allows for valuable longitudinal analysis of patron trends, it also presents a high risk for patron privacy violations.
Folks might have run across anonymization and de-identification being used interchangeably in various discussions surrounding data privacy. These terms refer to two different strategies in reducing re-identification of individuals in data sets. The figure below from the National Institute of Standards and Technology shows the progression of privacy and risk based on the tools and strategies used to strip out various levels of PII (Garfinkel 2015).
The more one strips PII from a data set, the less likely an individual can be linked to a particular data point or activity. While NIST recognizes some variation in definitions of anonymization and de-identification in standards documentation, and prefers to use the term de-identification to refer to both processes, the Future of Privacy Forum in 2016 defined the two terms in their review of de-identification methods (Polonetsky, Tene, and Finch 2016).
Anonymization is commonly referred to the tools and methods that break specific data points from any individual. This approach significantly decreases, or in some cases, eliminates, the risk of re-identification. Since the data is aggregated in several anonymization methods, individual trend analysis or more detailed demographic analysis is very limited.
De-identification is one step removed from anonymization, where the PII facets are eliminated or transformed to break the link between the real world person and the data, while still preserving the ability to do research on individual trends and more detailed demographic analysis.
For libraries and vendors that are interested in de-identification of patron data, what does de-identification of library patron data look like?
PII-1 data should be deleted or obfuscated. Names, phone numbers, patron record numbers, and library card barcodes would be deleted, for example. Other demographic information can be obfuscated to allow some demographic study in the dataset. In the case of the date of birth, storing the age of the individual, instead of the full date of birth, allows staff to track patron trends by age while creating enough “noise” to reduce the risk of re-identifying an individual. In the same vein, limiting geo-specific data to home branch, zip code, or Census Tract can allow for geo-location analysis of the dataset.
PII-2 data requires a multilevel analysis to ensure that patrons cannot be connected to specific resources, programs, and sessions while not losing all granularity for reporting and assessment purposes. Truncation, aggregation, and obfuscation methods work well in de-identifying PII-2 data. Call numbers can be truncated to ensure the link to the individual item is broken via full call number, narrow subject headings and genres can be aggregated into broader subjects and genres, and system timestamps attached to public computer sessions can be obfuscated to only include the date and the length of the session.
De-identification – and to an extent, anonymization – cannot alone protect a data set from re-identification risk. There are several factors for libraries and vendors to consider in their determination on appropriate use of de-identification methods and tools. Service population size is one factor to consider when looking at a data set that contains patron data. Even with de-identification, smaller datasets or datasets that have a number of “outliers” in certain demographic areas (such as geo-location and age) are at higher risk of re-identification. Another factor is the thoroughness of the de-identification process itself. As demonstrated in the AOL search query release in 2006, incomplete de-identification of the dataset, especially PII-2 data, can eventually lead to re-identification (TechCrunch 2006).
A particular factor that can fly under the radar in most consideration processes is the possibility of re-identification through shared fields in multiple datasets. Anytime a dataset shares certain data fields in another dataset, there is the chance that one could use said fields to crosswalk between datasets. For example, let’s say that two tables share the transaction date field. If there are enough transactions during that particular day, one would have some difficulty in determining which patron checked out a particular title that day. If those two tables share an additional field – location – then they have another data point where they can get a bit more precise in seeing what the patron exactly checked out. A recent example involves the New York City Taxicab data set, where people were able to identify taxicab passengers using the pickup and dropoff locations, and the medallion number from their data set along with other data points, including pictures from an image search (Tockar 2014).
De-identification and anonymization are two powerful methods for libraries and vendors striving for balance between protecting patron privacy and implementing evidence-based practices. Nonetheless, these methods are not fool-proof, and are out of reach for some due to the nature of the dataset or the lack of resources to implement them. This makes it the more important for libraries and vendors to assess the risk of possible re-identification of de-identified data. As advancements in data de-identification come along, libraries and vendors will need to adjust these methods to ensure that the risk of re-identification is kept to an acceptable minimum.
[Author’s note – for more information about how de-identification is currently being used in a large, urban library system, please read “Balancing Privacy and Strategic Planning Needs: A Case Study in De-Identification of Patron Data” in the Spring 2017 issue of Journal of Intellectual Freedom and Privacy.]
1With apologies to W. Whitman.
Garfinkel, Simson L. 2015. “De-Identification of Personally Identifiable Information.” NIST. Accessed April 13, 2017.
Polonetsky, Jules, Omer Tene, and Kelsey Finch. 2016. “Shades of Gray: Seeing the Full Spectrum of Practical Data De-Identification.” Santa Clara Law Review, Forthcoming. Accessed April 13, 2017.
TechCrunch. 2006. “AOL: ‘This was a screw up’.” Accessed April 13, 2017.
Tockar, Anthony. 2014. Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset. Neustar. Accessed April 13, 2017.
United States. 2008. Privacy: Alternatives Exist for Enhancing Protection of Personally Identifiable Information: Report to Congressional Requesters. [Washington, D.C.]: U.S. Govt. Accountability Office.