Policy by the Numbers

Data for sound policymaking from Google and friends

Google Ngram and “Information Privacy”

Monday, January 9, 2012

Google NGram is a database that permits statistical analysis of the frequency of use of specific words and phrases in books. The database draws on nearly 5.2 million books from a period between 1500 and 2000 A.D. that have been digitized by the the Google Library Project. With use of the web-based NGram Viewer, it is then possible to create a graphical year-by-year representation of how often a phrase has been used in books.   

In our recent work, The PII Problem, we drew on the NGram viewer to gain a sense of peaks and valleys in policymakers’ attention to “information privacy” from 1950 to 2000.

In this article, we find that this graphic analysis of references to “information privacy” largely correlates with our sense of the development of this area of law. Early use of the term was driven by concern about mainframe computers and their ability to change how data could be organized, accessed and searched.

How did this story then develop during the latter part of the 1970s? After a decline in interest in privacy after enactment of the Privacy Act of 1974, a renewed societal focus in the United States about information privacy began in the early 1980s. Part of this attention was driven, in turn, by the arrival of George Orwell’s titular year, 1984. A flurry of media reports and articles marked this occasion with an analysis of new threats to privacy.

Perhaps most importantly, however, cable operators’ collection of personal information at this time created the same kinds of issues that the Internet would later raise. Even as early as the 1980s, observers noted that coaxial cable technology would permit a user not only to receive information, as broadcast television had allowed, but also to respond to information on the screen and make programming choices. New privacy threats were anticipated as a consequence of the resulting detailed profiles about individual cable consumers, and the use of the term “information privacy” began to rise again.

From the 1990s on, the continuing use of the attention to “information privacy” reflected society’s growing concern with privacy in the PC and then Internet era.

Other topics and techniques can be identified for drawing on Google NGram’s potential as a legal research tool. Legal scholars might draw up a list of core terms in information privacy law, and other legal fields, such as copyright law and constitutional law. The data can be used to explore a variety of questions: How did the use of these core terms develop over time? Did certain legal terms come to supplant others? Can comparisons of the relative frequency of the use of various terms reveal something about the development of legal concepts in a given substantive or doctrinal area? Using Google NGram data, scholars can seek answers to these questions in order to inform current research and fuel new areas of academic inquiry.

Posted by Paul M. Schwartz & Daniel J. Solove

Paul M. Schwartz is a Professor of Law at the University of California, Berkeley School of Law and a Director of the Berkeley Center for Law & Technology.  

Daniel Solove is the John Marshall Harlan Research Professor of Law at the George Washington University Law School. He is also a Senior Policy Advisor to the law firm of Hogan Lovells.