Thursday, May 21, 2015

Is the Internet Healthy?

Meredith Whittaker is Open Source Research Lead at Google.

We are big fans of open data. So we're happy to see that the folks over at Battle for the Net launched The Internet Health Test earlier this week, a nifty tool that allows Internet users test their connection speed across multiple locations.

The test makes use of M-Lab open source code and infrastructure, which means that all of the data gathered from all of the tests will be put into the public domain.

One of the project's goals is to make more public data about Internet performance available to advocates and researchers. Battle for the Net and others will use this data to identify problems with ISP interconnections, and, they claim, to hold ISPs accountable to the FCC's Open Internet Order.

This is certainly a complex issue but we are always thrilled by more data that can be used to inform policy.

You can learn more and run the test over at their site: https://www.battleforthenet.com/internethealthtest

Thursday, May 14, 2015

New data, more facts: an update to the Transparency Report

Cross-posted from the Official Google Blog.

We first launched the Transparency Report in 2010 to help the public learn about the scope of government requests for user data. With recent revelations about government surveillance, calls for companies to make encryption keys available to police, and a wide range of proposals, both in and out of the U.S., to expand surveillance powers throughout the world, the issues today are more complicated than ever. Some issues, like ECPA reform, are less complex, and we’re encouraged by the broad support in Congress for legislation that would codify a standard requiring warrants for communications content.

Google's position remains consistent: We respect the important role of the government in investigating and combating security threats, and we comply with valid legal process. At the same time, we'll fight on behalf of our users against unlawful requests for data or mass surveillance. We also work to make sure surveillance laws are transparent, principled, and reasonable.

Today's Transparency Report update
With this in mind, we're adding some new details to our Transparency Report that we're releasing today.

  • Emergency disclosure requests. We’ve expanded our reporting on requests for information we receive in emergency situations. These emergency disclosure requests come from government agencies seeking information to save the life of a person who is in peril (like a kidnapping victim), or to prevent serious physical injury (like a threatened school shooting). We have a process for evaluating and fast-tracking these requests, and in true emergencies we can provide the necessary data without delay. The Transparency Report previously included this number for the United States, but we’re now reporting for every country that submits this sort of request.

  • Preservation requests. We're also now reporting on government requests asking us to set aside information relating to a particular user's account. These requests can be made so that information needed in an investigation is not lost while the government goes through the steps to get the formal legal process asking us to disclose the information. We call these "preservation requests" and because they don't always lead to formal data requests, we keep them separate from the country totals we report. Beginning with this reporting period, we're reporting this number for every country.

In addition to this new data, the report shows that we've received 30,138 requests from around the world seeking information about more than 50,585 users/accounts; we provided information in response to 63 percent of those requests. We saw slight increases in the number of requests from governments in Europe (2 percent) and Asia/Pacific (7 percent), and a 22 percent increase in requests from governments in Latin America.

The fight for increased transparency
Sometimes, laws and gag-orders prohibit us from notifying someone that a request for their data has been made. There are some situations where these restrictions make sense, and others not so much. We will fight—sometimes through lengthy court action—for our users' right to know when data requests have been made. We've recently succeeded in a couple of important cases.

First, after years of persistent litigation in which we fought for the right to inform Wikileaks of government requests for their data, we were successful in unsealing court documents relating to these requests. We’re now making those documents available to the public here and here.

Second, we've fought to be more transparent regarding the U.S. government's use of National Security Letters, or NSLs. An NSL is a special type of subpoena for user information that the FBI issues without prior judicial oversight. NSLs can include provisions prohibiting the recipient from disclosing any information about it. Reporters speculated in 2013 that we challenged the constitutionality of NSLs; after years of litigation with the government in several courts across multiple jurisdictions, we can now confirm that we challenged 19 NSLs and fought for our right to disclose this to the public. We also recently won the right to release additional information about those challenges and the documents should be available on the public court dockets soon.

Finally, just yesterday, the U.S. House of Representatives voted 338-88 to pass the USA Freedom Act of 2015. This represents a significant step toward broader surveillance reform, while preserving important national security authorities. Read more on our U.S. Public Policy blog.

Posted by Richard Salgado, Legal Director, Law Enforcement and Information Security

Thursday, May 7, 2015

Exploring the world of data-driven innovation

Mike Masnick is founder of the Copia Institute.

In the last few years, there’s obviously been a tremendous explosion in the amount of data floating around. But we’ve also seen an explosion in the efforts to understand and make use of that data in valuable and important ways. The advances, both in terms of the type and amount of data available, combined with advances in computing power to analyze the data, are opening up entirely new fields of innovation that simply weren’t possible before.

We recently launched a new think tank, the Copia Institute, focused on looking at the big challenges and opportunities facing the innovation world today. An area we’re deeply interested in is data-driven innovation. To explore this space more thoroughly, the Copia Institute is putting together an ongoing series of case studies on data-driven innovation, with the first few now available in the Copia library.

Our first set of case studies includes a look at how the Polymerase Chain Reaction (PCR) helped jumpstart the biotechnology field today. PCR is, in short, a machine for copying DNA, something that was extremely difficult to do (outside of living things copying their own DNA). The discovery was something of an accident: A scientist discovered that certain microbes survived in the high temperatures of the hot springs of Yellowstone National Park, previously thought impossible. This resulted in further study that eventually led to the creation of PCR.

PCR was patented but licensed widely and generously. It basically became the key to biotech and genetic research in a variety of different areas. The Human Genome Project, for example, was possible only thanks to the widespread availability of PCR. Those involved in the early efforts around PCR were actively looking to share the information and concept rather than lock it up entirely, although there were debates about doing just that. By making sure that the process was widely available, it helped to accelerate innovation in the biotech and genetics fields. And with the recent expiration of the original PCR patents, the technology is even more widespread today, expanding its contribution to the field.

Another case study explores the value of the HeLa cells in medical research—cancer research in particular. While the initial discovery of HeLa cells may have come under dubious circumstances, their contribution to medical advancement cannot be overstated. The name of the HeLa cells comes from the patient they were originally taken from, a woman named Henrietta Lacks. Unlike previous human cell samples, HeLa cells continued to grow and thrive after being removed from Henrietta. The cells were made widely available and have contributed to a huge number of medical advancements, including work that has resulted in five Nobel prizes to date.

With both PCR and HeLa cells, we saw an important pattern: an early discovery that was shared widely, enabling much greater innovation to flow from proliferation of data. It was the widespread sharing of information and ideas that contributed to many of these key breakthroughs involving biotechnology and health.

At the same time, both cases raise certain questions about how to best handle similar developments in the future. There are questions about intellectual property, privacy, information sharing, trade secrecy and much more. At the Copia Institute, we plan to more dive into many of these issues with our continuing series of case studies, as well as through research and events.

Friday, May 1, 2015

Five ways for states to make the most of open data

Mariko Davidson serves as an Innovation Fellow for the Commonwealth of Massachusetts where she works on all things open data. These opinions are her own. You can follow her @rikohi.

States struggle to define their role in the open data movement. With the exception of some state transportation agencies, states watch their municipalities publish local data, create some neat visualizations and applications, and get credit for being cool and innovative.

States see these successes and want to join the movement. Greater transparency! More efficient government! Innovation! The promise of open data is rich, sexy, and non-partisan. But when a state publishes something like obscure wildlife count data and the community does not engage with it, it can be disappointing.

States should leverage their unique role in government rather than mimic a municipal approach to open data. They must take a different approach to encourage civic engagement, more efficient government, and innovation. Here are few recommendations based on my time as a fellow:

  1. States are a treasure trove of open data. This is still true. When prioritizing what data to publish, focus on the tangible data that impacts the lives of constituents—think aggregating 311 request data from across the state. Mark Headd, former Chief Data Officer for the City of Philadelphia, calls potholes the “gateway drug to civic engagement.”

  2. States can open up data sharing with their municipalities—which leads to a conversation on data standards. States can use their unique position to federate and facilitate data sharing with municipalities. This has a few immediate benefits: a) it allows citizens a centralized source to find all levels of data within the state; b) it increases communication between the municipalities and the state; and c) it begins to push a collective dialogue on data standards for better data sharing and usability.

  3. States in the US create an open data technology precedent for their towns and municipalities. Intentional or not, the state sets an open data technology standard—so they should leverage this power strategically. When a state selects a technology platform to catalog its data, it incentivizes municipalities and towns within the state to follow its lead. If a state chooses a SaaS solution, it creates a financial barrier to entry for municipalities that want to collaborate. The Federal Government understood this when it moved Data.gov to the open source solution CKAN. Bonus: open source software is free and embodies the free and transparent ethos of the greater open data movement.

  4. States can support municipalities and towns by offering open data as a service. This can be an opportunity to provide support to municipalities and towns that might not have the resources to stand up their own open data site.

  5. Finally, states can help facilitate an “innovation pipeline” by providing the data infrastructure and regularly connecting key civic technology actors with government leadership. Over the past few years, the civic technology movement experienced a lot of success in cities with groups like Code for America leading the charge with their local Brigade Chapters. After publishing data and providing the open data infrastructure, states must also engage with the super users and data consumers. States should not shy away from these opportunities. More active state engagement is a crucial element still missing in the civic innovation space in order to collectively create sustainable technology solutions for the communities they serve.

Tuesday, April 28, 2015

Visualization: The future of the World Bank

This visualization of the World Bank Borrowers today and in 2019 isn't the most technologically sophisticated visualization we've ever posted but it is a stark illustration of what the future of the World Bank looks like.

As Tom Murphy writes over on Humanosphere:

The World Bank’s influence is waning. Some point to the emerging Asian Infrastructure Investment Bank as evidence of the body’s declining power, but it is the World Bank’s own projections that illustrate the change. Thirty-six countries will graduate from World Bank loans over the next four years (see the above gif).

The images in Murphy's gif come from a policy paper titled "The World Bank at 75" by Scott Morris and Madeleine Gleave at the Center for Global Development. The paper provides a thorough data-driven analysis of current World Bank lending models and systematic trends that will shape its future. From the paper:
The World Bank continues to operate according to the core model some 71 years after the founding of IBRD and 55 years after the founding of IDA: loans to sovereign governments with terms differentiated largely according to one particular measure (GNI per capita) of a country’s ability to pay. Together, concessional and non-concessional loans to countries still account for 67 percent of the institution’s portfolio.

So when the World Bank looks at the world today, it sees a large number of countries organized by IDA and IBRD status.

And what will the World Bank see in 2019, on the occasion of its 75th anniversary? On its current course and with rote application of existing rules, the picture could look very different, with far fewer of those so-called “IDA” and “IBRD” countries.

But does this picture accurately reflect the development needs that will be pressing in the years ahead? Or instead, does it simply reflect an institutional model that is declining in relevance?

It is remarkable how enduring the World Bank’s basic model has been. The two core features (lender to sovereign governments; terms differentiated by countries’ income category) have tremendous power within the institution, which has grown up around them. The differentiation in terms has generated two of the core silos within the institution: the IBRD and IDA. And lending to national governments (what we will call the “loans to countries” model) is so dominant that it has crowded out other types of engagement, even when there has been political will to do other things (notably, climate-related financing).

So while the model has been laudably durable in some respects, it is also increasingly seems to be stuck at a time when external dynamics call for change.

This paper examines ways in which seeming immoveable forces underlying the World Bank’s work might finally be ripe for change in the face of shifting development needs. Specifically, we offer examples of 1) how country eligibility standards might evolve; and 2) how the bank might move further away from the “loans to countries” model that has long defined it.

Friday, April 24, 2015

How do political campaigns use data analysis?

Looking through SSRN this morning, I came across a paper by David Nickerson (Notre Dame) and Todd Rogers (Harvard), "Political Campaigns and Big Data" (February 2014). It's a nice follow-up to yesterday's post about the software supporting new approaches to data analysis in Washington, DC.

In the paper, Nickerson and Rogers get into the math behind the statistical methods and supervised machine learning employed by political campaign analysts. They discuss the various types of predictive scores assigned to voters—responsiveness, behavior, and support—and the variety of data that analysts pull together to model and then target supporters and potential voters.

In the following excerpt, the authors explain how predictive scores are applied to maximize the value and efficiency of phone bank fundraising calls:

Campaigns use predictive scores to increase the efficiency of efforts to communicate with citizens. For example, professional fundraising phone banks typically charge $4 per completed call (often defined as reaching someone and getting through the entire script), regardless of how much is donated in the end. Suppose a campaign does not use predictive scores and finds that upon completion 17 of the call 60 percent give nothing, 20 percent give $10, 10 percent give $20, and 10 percent give $60. This works out to an average of $10 per completed call. Now assuming the campaign sampled a diverse pool of citizens for a wave of initial calls. It can then look through the voter database that includes all citizens it solicited for donations and all the donations it actually generated, along with other variables in the database such as past donation behavior, past volunteer activity, candidate support score, predicted household wealth, and Census-based neighborhood characteristics (Tam Cho and Gimpel 2007). It can then develop a fundraising behavior score that predicts the expected return for a call to a particular citizen. These scores are probabilistic, and of course it would be impossible to only call citizens who would donate $60, but large gains can quickly be realized. For instance, if a fundraising score eliminated half of the calls to citizens who would donate nothing, so that in the resulting distribution would be 30 percent donate $0, 35 percent donate $10, 17.5 percent donate $20, and 17.5 percent donate $60. The expected revenue from each call would increase from $10 to $17.50. Fundraising scores that increase the proportion of big donor prospects relative to small donor prospects would further improve on these efficiency gains.

If you've ever wanted to know more about how campaigns use data analysis tools and techniques, this paper is a great primer.

Thursday, April 23, 2015

Quorum: Is software the new Congressional intern?

Last month, a number of news outlets wrote about a startup called Quorum. Winner of the 2014 Harvard Innovation Challenge's McKinley Family Grant for Innovation and Entrepreneurial Leadership in Social Enterprise, Quorum has amazing potential to create new ways for legislators to easily use data to understand their constituencies and track legislation—literally data for policymaking. Quorum even pulls data from the American Community Survey, which James Treat of the Census Bureau wrote about for this blog a few years back.

TechCrunch touts Quorum as a replacement for the hordes of summer Hill interns, while the Washington Post likens it to Moneyball for K Street.

Danny Crichton at TechCrunch writes:

The challenges are numerous in this space. "Figuring out who you should talk to is a really tough process," Jonathan Marks, one co-founder of Quorum, explained. "This is a problem that a lot of our clients have, [since] there are tens of thousands of relationships in DC." The challenge is magnified since those relationships change so often.

Another challenge is simply following legislation. Marks gave the example of a non-profit firm that wanted to develop a scorecard with grades for each congressmen on several key votes (a common strategy these days in Washington advocacy). One firm had "three people spending 1.5 weeks to tabulate all the data." An opposition research firm went through "6000 votes on abortion” to tabulate every single congressman's legislative history. This was all done manually (i.e. with an army of interns).

But Quorum is not the first product of its kind. Bloomberg and CQ have long dominated with products targeted at this audience. But this is becoming a competitive space for entrepreneurs. Catherine Ho at the Washington Post explains:

Since 2010, at least four companies, ranging from start-ups to billion-dollar public corporations, have introduced new ways to sell data-based political and competitive intelligence that offers insight into the policymaking process."

[...]

Other companies are emerging in the space with some success. For others, it’s too soon to tell.

Popvox, founded in 2010, is an online platform that collects correspondence between constituents and their representatives on bills, organizes the data by state, and packages the information in charts and maps so lawmakers can easily spot where voters stand on a proposed bill. An early win was when nearly 12,000 people nationwide used the platform to oppose a proposal to allow robo-calls to cellphones — the bill was withdrawn by its sponsors.

Popvox does not disclose its revenue, but co-founder Marci Harris said the platform has more than 400,000 users across every congressional district and has delivered more than 4 million constituent positions to Congress.

FiscalNote, which uses data-mining software and artificial intelligence to predict the outcome of legislation and regulations, has pulled in $19.4 million in capital since its 2013 start from big-name investors including Dallas Mavericks owner Mark Cuban, Yahoo co-founder Jerry Yang and the Winklevoss twins. The company says it achieves 94 percent accuracy. And Ipsos, the publicly traded market research and polling company, is amping up efforts to sell polling data to lobby firms.

For an academic's take on the trend toward data in politics and camapigning, UNC assistant professor Daniel Kreiss published a great piece for the Stanford Law Review in 2012 titled "Yes We Can (Profile You)," which lays out the ways in which political campaigns employ sophisticated data analysis techniques to measure and target voters.