Policy by the Numbers

Global broadband pricing study: Updated dataset

2015-12-14T10:21:00.002-08:00

Vincent Chiu is a Technical Program Manager at Google.

Since 2012, Google has supported the study and publication of broadband pricing for researchers, policymakers and the private sector in order to better understand the affordability landscape and help consumers make smarter choices about broadband access. We released the first dataset in August 2012, and periodically refresh the data (May 2013, March 2014, and February 2015). This data has become an integral part of our understanding of global broadband affordability. Harvard's Berkman Center, Facebook, and others actively use the data to understand the broadband landscape. Today, we’re releasing the latest dataset.

For the mobile data set, we increased the number of countries represented from 112 to 157, and the number of carriers from 331 to 402. For the fixed-line dataset, we increased the number of countries from 105 to 159, and number of carriers from 331 to 424. This data covers 99.3% of current Internet users globally on a country level.

The collection methodology is designed to capture the cost of data plans. We collected samples from a broad range of light to heavy data usage plans, and recorded numerous individual plan parameters such as downstream bandwidth, monthly cost, and more. Finally, where possible, we collected plans from multiple carriers in each country to get an accurate picture.

Broadband Data:

Price observations for fixed broadband plans can be found here.
Mobile broadband prices can be found here.

We provide this information to help people understand the state of internet access and make data-driven decisions. Along with this data collection effort, we have analyzed these pricing data and conducted researches on affordability and Internet penetration. Our early results have indicated several topics deserving further discussions within the ICT data community, including metric normalization, income distribution, and broadband value. We believe:

Normalization is essential for any meaningful statistical analysis. The diversity of plan types and the complexity of tariff structures surrounding mobile broadband pricing requires a careful analytical methodology for normalization.
Income distribution needs to be considered when assessing the broadband affordability situation. The commonly used GNI per capita metric is based on average national income level, which does not consider inequality of income distribution.

We look forward to sharing more findings on these topics in 2016. Please stay tuned for updates on our progress. If you have any feedback on the methodology, contact us at broadband-study@google.com

It’s Humans, Not Algorithms, That Have a Bias Problem

2015-11-18T09:10:00.000-08:00

Joshua New is a policy analyst at the Center for Data Innovation. Reposted from CDI's blog.

Bias in big data. Automated discrimination. Algorithms that erode civil liberties.

These are some of the fears that the White House, the Federal Trade Commission, and other critics have expressed about an increasingly data-driven world. But these critics tend to forget that the world is already full of bias, and discrimination permeates human decision-making.

The truth is that the shift to a more data-driven world represents an unparalleled opportunity to crack down on unfair consumer discrimination by using data analysis to expose biases and reduce human prejudice. This opportunity is aptly demonstrated by the Consumer Financial Protection Bureau’s (CFPB) December 2013 auto loan discrimination suit against Ally Financial, the largest such suit in history, in which data and algorithms played a critical role in identifying and combating racial bias.

CFPB found that, from April 2011 to December 2013, Ally Financial had unfairly set higher interest rates on auto loans for 235,000 minority borrowers and ordered the company to pay out $80 million in damages. But the investigation also posed an interesting challenge: Since creditors are generally prohibited from collecting data on an applicant’s race, there was no hard evidence showing Ally had engaged in discriminatory practices. To piece together what really happened, CFPB used an algorithm to infer a borrower’s race based on other information in his or her loan application. Its analysis identified widespread overcharging of minority borrowers as a result of discriminatory interest rate markups at car dealerships.

Ally Financial buys retail installment contracts from more than 12,000 automobile dealers in the United States, essentially allowing dealers to act as middlemen for auto loans. If a consumer decides to finance his or her new car through a dealership rather than a bank, the dealership submits the consumer’s application to a company like Ally. If approved, the consumer pays back the dealership with interest. The interest rate, of course, matters a great deal. To determine what it will be, Ally calculates a “buy rate”—a minimum interest rate for which it is willing to purchase a retail installment contract, as determined by actuarial models. Ally notifies dealerships of this buy rate, but then also gives them substantial leeway to increase the interest rate to make the contract more profitable. Though consumers are free to negotiate these rates and shop around for the best deal, CFPB’s analysis determined that discretionary dealership pricing had a disparate impact on borrowers who were African American, Hispanic, Asian, or Pacific Islanders. On average, they paid between $200 and $300 more than similarly situated white borrowers.

Since creditors cannot inquire about race or ethnicity, Ally’s algorithmically generated buy rates are objective assessments. But when dealerships increase these rates, their judgments are entirely subjective, relying on humans to make decisions that could very well be influenced by racial bias. If dealerships instead took a similar approach to creditors and automated this decision-making process, there would be no opportunity for human bias to enter the equation. While dealerships could still increase interest rates to capture more profits, they could do so based on algorithmic analysis of predefined criteria about a consumer’s willingness to pay, thereby preventing themselves from offering similar consumers different rates based on their race.

Policymakers should guard against the possibility that automated decision-making could perpetuate bias, but with ever-increasing opportunities to collect and analyze data, the public and private sectors also should follow CFPB’s lead and identify new opportunities where data analytics can help expose and reduce human bias. For example, employers could rely on algorithms to select job applicants for interviews based on their objective qualifications rather than relying on human oversight that can be biased against factors such as whether or not the job applicant has an African American–sounding name. And taxi services could rely on algorithms to match drivers with riders rather than leaving it up to drivers who might be inclined to discriminate against passengers based on their race. If policymakers let fear of computerized decision-making impede wider deployment of fair algorithms, then society will lose a valuable opportunity to build a more just world.

How Do You Measure Wonder?

2015-10-19T09:48:00.000-07:00

Co-authored by Ashley Varady (Program Manager), Christina Perry (Program Director), Lauren Hall (Director of Evaluation), and Bita Nazarian (Executive Director) of 826 Valencia.

In the world of business, terms like performance indicators and productivity are commonplace. It makes sense to want to evaluate progress and the efficacy of a given program, and change course if adequate growth isn't met. Increasingly, these same buzzwords have entered the conversation in the world of education, whereby teacher quality and school funding are connected to standardized test scores and other forms of evaluation. There has been much debate about what constitutes fair and meaningful measures of performance evaluation, for both teachers and students. With the shift to Common Core Standards, these high stakes assessments are moving from multiple choice tests to "authentic tasks" where students communicate their understanding in writing, making writing an essential skill across all disciplines.

The notion that writing is critical to demonstrating understanding has been core to our work at 826 Valencia since our founding in 2002. We work to cultivate these crucial academic skills through rigorous, "real world" writing tasks that are then published in professional books for a much wider audience. In addition to skill building, though, we also invite students to see writing as a resource and a tool for self-empowerment. Our goal is to transform a young person's relationship with writing—moving from intimidation, dread, and defeat to a source of power, wonder, creativity, and expression. We believe that these attitudinal shifts are the essential foundation for skill development. When students have positive feelings about their own potential, hope becomes the motivator—it links their effort in the classroom to their dreams for their future. When our students achieve significant, authentic successes in their daily life, it inspires them to dream bigger about their future.

And so what is the impact of our work? How do you measure the shifts in confidence a student experiences when they subject their writing to multiple revisions and watch their story come to life on paper? What assessment can tease out the sense of pride a student feels when her writing is published and shared beyond her family and her classroom? Or when a student who hates to write begins to see himself as a writer when he learns he has an ear for the rhythm of language? Or that his story means something to someone outside his community?

We ask students what matters to them, who they are, who they want to become, what they have to say, and why. Listening is at the core of this process. We let them decide how they want to tell their story and we show them that their story matters by giving them a forum—publishing and printing their words. We cultivate a sense of wonder, hold a value of creativity, and we have fun. And we see big shifts in our students' lives. We see grades improve and we also see beaming smiles—indicators of pride, confidence, and hope. And all the while, students are improving their skills.

As a result, our evaluation portfolio seeks to measure changes along the spectrum from skills to affinity. We pair writing assessment data with less tangible data in order to paint a complete picture of the impact of 826 Valencia. To this end, we use a variety of tools, including district-wide writing assessments and Fountas and Pinnell Reading Assessments, attitudinal surveys, conversation, reflection, and feedback—all in an effort to better understand our students' learning and demonstrate the efficacy of our model. And we see that rigorous and fun writing positively impacts students for the long haul.

Here's an example of this combined approach at evaluation last year: Students who participated in the 826 program at Buena Vista Horace Mann K-8 demonstrated accelerated reading level growth, with 74% of third and fourth grade students growing more than a year, in the course of just five months. This is especially important for English Language Learners, who are often entering school below grade level in reading and writing. In the first half of the 2014-2015 school year, over 35% of 826 students have already made over a year of growth, which includes 42% of fifth graders, who are participating in the 826 program for the second consecutive year.

We also move beyond these hard stats to hear our students' reflections. From a student in the program, Jesús Islas Garcia, who began the year hating to write: "When I'm published I feel like I'm a movie star or a millionaire. When I write about my life or funny stories or serious and sad stories, other people can laugh, cry or be sad…and when I'm an adult I will be a famous writer in the city of San Francisco."

A third grade student in the program wrote when asked to explain why the sky is blue:

The sky is made with secret ingredients. It had to be made so everyone could breathe. After it was made, people threw blue glitter at it. A crocodile was selling blue glitter every night. He was selling glitter so everyone could have it. He wanted everyone to have some glitter to throw into the sky. The sky was turquoise or blue or sometimes even purple when people threw glitter at it.

This brief excerpt is infused with creativity—the student is deftly imaginative and lyrical.

Shortly after this program wrapped, we held a two-week writing camp for high schoolers called the Young Authors' Workshop, and we got this written feedback: "I learned that the stories we write are not just words. They are the windows to different worlds, to the writer's soul, to the writer's mind. Writing can take you anywhere."

So how do you measure that "movie star" feeling? Or the new realization in the summer camper that writing opens every door? We consider our students' words, stories, and reflections to be key performance indicators used to drive strategy and decision-making, with the same weight we give quantitative assessments. After all, everyone needs a healthy dose of glitter in their lives.

Use Data and Innovation to Match Resources with Need? Sure, We Can Do That

2015-10-01T09:57:00.001-07:00

Hannah Walker is Director of Government Relations at the Food Marketing Institute (FMI). Reposted with permission from FMI's blog.

We have all heard the concerning statistics regarding the amount of food going to waste in the United States while many go without. As a founding member of the Food Waste Reeducation Alliance (FWRA), FMI has been addressing the challenge of reducing food waste in both the United States and globally. FMI recently participated in an announcement with the U.S. Department of Agriculture and the Environmental Protection Agency to emphasize our commitment to the issue and highlight the importance of collaboration between government and the private sector.

I wanted to highlight an interesting and innovative case of using data to address this pressing public concern and stewardship issue. Feeding America has partnered with dozens of grocers across the country seeking creative ways to solve both the food waste and hunger problems. For years, grocers had limited options when perishables were reaching their sell-by-date; the primary one being send it to the landfill. Significant changes came in 2006 when many of FMI’s members began teaming up with Feeding America to better identify perishable food and donate it rather than discard it. This improved collaboration has proven incredibly successful; grocers have donated over 1.4 billion pounds of food between July 1, 2014 and June 30, 2015, a truly amazing increase from the 140 million first donated when the program started in 2006.

In a recent conversation with Feeding America, I learned that they have found a great willingness from our retail members from large national chains down to smaller operators—to donate perishables that will stock the shelves of the local food bank as opposed to adding to their local landfill. In one short decade, the partnership between the grocery industry and Feeding America has made perishables, such as meat, dairy, and produce much more common items on food bank shelves.

This smart and seemingly simple solution is backed by the use of data, innovation and analytics to measure what and how much food is received and where to send it so that it reaches those in the greatest need. By matching meal gap data with available resources, our local food banks are able to serve those who are in the greatest need while reducing our national food waste at the same time.

While 1.4 billion pounds is an incredible improvement from the reported 140 million reported just nine short years ago, there is always more that can be done. Feeding America and grocery partners are currently targeting an additional 300 million pounds of food they believe they can get by further optimizing the data and collection process.

There will never be one solution to solve the challenges of food waste and hunger in the United States and abroad; however, creative ideas like this partnership backed with strong data and creative innovation are making great strides toward both goals.

Information sharing for more efficient network utilization and management

2015-09-18T14:54:00.002-07:00

Andreas Terzis is a Software Engineer at Google. This post originally appeared on the Google Research Blog.

As Internet traffic has grown and changed, Google and other content and application providers have worked cooperatively with Internet service providers (ISPs) so that services can be delivered quickly, efficiently and cost-effectively. For example, rather than content having to traverse a long distance and many different networks to reach an Internet access provider's network, a content provider might store (cache) the data close by and interconnect ("peer ) directly with the access provider. Google has invested billions of dollars in the network and infrastructure necessary to bring our services as close to your Internet access provider's front door as possible, for free—which both reduces ISPs' costs and improves the user experience.

Content and application providers can also tune their services for congested and/or lower bandwidth environments. For instance, YouTube detects how smoothly a video is playing and adjusts the quality to account for temporary fluctuations in bandwidth or congestion. In the Google Video Quality Report, we transparently reveal the speeds YouTube is experiencing on different networks.

As more of Internet traffic becomes encrypted, some network operators have expressed concern about the effect encryption might have on their ability to manage their networks. We don't think there has to be a trade-off here—there are ways to do effective network management of encrypted traffic today, and, through further cooperation between content and application providers and ISPs, we believe this could be made easier while still respecting encryption.

To spur discussion and collaboration on this front, we recently submitted a paper to a workshop organized by the Internet Architecture Board outlining some ideas. We advocate for a model where ISPs selectively share network state to content and applications providers, enabling them to adapt to available network resources.

For example, we recently proposed to the Internet Engineering Task Force the concept of Throughput Guidance (TG), whereby mobile network operators could share information about the throughput of a radio downlink. Preliminary field tests in a production LTE network showed that TG reduces YouTube join latency, defined as the amount of time until the video starts playing, by 8% on average, rebuffering time by 20% on average, and rebuffer count by 2% on average. In addition to improving quality of experience for users, this mechanism improves the utilization of providers’ networks. Encryption of traffic would have no impact on the efficacy of this approach; it works equally well with encrypted and unencrypted traffic.

Throughput Guidance is one possible solution and many questions remain unanswered. It's still relatively early days in our exploration of this and the other measures in our short paper, and we're looking forward to getting feedback and collaborating with network operators and others.

Data for social good: suicide prevention

2015-09-11T14:24:00.000-07:00

Earlier this week, GOOD Magazine published an interesting piece by Mark Hay on suicide prevention titled "Can Big Data Help Us Fight Rising Suicide Rates?" The part of the article that talks about data-driven prevention starts about halfway through. What follows is an excerpt from that section.

Yet there is one frontier in suicide prevention that seems especially promising, though in a way, it maybe a bit removed from the problem’s human element: big data predictions and intervention targeting.

We know that some populations are more likely than others to commit suicide. Men in the United States account for 79 percent of all suicides. People in their 20s are at higher risk than others. And whites and Native Americans tend to have higher suicide rates than other ethnicities. Yet we don’t have the greatest ability to grasp trends and other niche factors to build up actionable, targetable profiles of communities where we should focus our efforts. We’re stuck trying to expand a suicide prevention dragnet, as opposed to getting individuals at risk the precise information they need (even if they don’t tip off major signs to their friends and family).

That’s a big part of why last year, groups like the National Action Alliance for Suicide Prevention’s Research Prioritization Task Force listed better surveillance, data collection, and research on existing data as priorities for work in the field over the next decade. It’s also why multiple organizations are now developing algorithms to sort through diverse datasets, trying to identify behaviors, social media posting trends, language, lifestyle changes, or any other proxy that can help us predict suicidal tendencies. By doing this, the theory goes, we can target and deliver exactly the right information.

One of the greatest proponents of this data-heavy approach to suicide prevention is the United States Army, which suffers from a suicide rate many times higher than the general population. In 2012, they had more suicide deaths than casualties in Afghanistan. Yet with millions of soldiers stationed around the globe and limited suicide prevention resources, it’s been difficult to simply rely on expanding the dragnet. Instead, last December the Army announced that they’d developed an algorithm that distills the details of a soldier’s personal information into a set of 400 characteristics that mix and match to show whether an individual is likely in need of intervention. Their analysis isn’t perfect yet, but they’ve been able to identify a cluster of characteristics within 5 percent of military personnel who accounted for 52 percent of suicides, showing that they’re on the right track to better targeting and allocating prevention resources.

Yet perhaps the greatest distillation of this data-driven approach (combined with the expansive, barrier-reducing impulse of mainstream efforts) is the Crisis Text Line. Created in 2013 by organizers from DoSomething.org, the text line allows those too scared, embarrassed, or uncomfortable to vocalize their problems to friends, or over a hotline, to simply trace a pattern on a cell phone keypad (741741) and then type their problems in a text message. As of 2015, algorithmic learning allows the Crisis Text Line to search for keywords, based on over 8 million previous texts and data gathered from hundreds of suicide prevention workers, to identify who’s at serious risk and assign counselors to respond. But more than that, the data in texts can trip off time and vocabulary sensors, matching counselors with expertise in certain areas to respond to specific texters, or bringing up precisely tailored resources. For example, the system knows that self-harm peaks at 4 a.m. and that people typing “Mormon” are usually dealing with issues related to LGBTQ identity, discrimination, and isolation. Low-impact and low-cost with high potential for delivering the best information possible to those in need, it’s one of the cleverer young programs out there pushing the suicide prevention gains made over the last century.

It’ll be a few years before we can understand the impact of data analysis and targeting on suicide prevention efforts, especially relative to general attempts to expand existing programs. And given the limited success of a half-century of serious gains in understanding and resource provision, we’d be wise not to get our hopes up too much. But it’s not unreasonable to suspect that a combination of diversifying means of access, lowering barriers of communication, and better identifying those at risk could help us bring programs to populations that have not yet received them (or that we could not support quickly enough before). At the very least, crunching existing data may help us to discover why suicide rates have increased in recent years and to understand the mechanisms of this widespread social issue. We have solid, logical reason to support the development of programs like the Army’s algorithms and the Crisis Text Line, and to push for further similar initiatives. But really we have reason to support any kind of suicide prevention innovation, even if it feels less robust or promising than the recent data-driven efforts. If you've ever witnessed the pain that those moving towards suicide feel, or the wide-reaching fallout after someone takes his or her life, you'll understand the visceral, human need to let a thousand flowers bloom, desperately hoping that one of them sticks. Hopefully, if data mining and targeting works well, that'll only inspire further innovation, slowly putting a greater and greater dent in the phenomenon of suicide.

The reusable holdout: Preserving validity in adaptive data analysis

2015-08-10T11:42:00.000-07:00

Moritz Hardt is a Research Scientist at Google. This post was originally published on the Google Research Blog.

Machine learning and statistical analysis play an important role at the forefront of scientific and technological progress. But with all data analysis, there is a danger that findings observed in a particular sample do not generalize to the underlying population from which the data were drawn. A popular XKCD cartoon illustrates that if you test sufficiently many different colors of jelly beans for correlation with acne, you will eventually find one color that correlates with acne at a p-value below the infamous 0.05 significance level.

Image credit: XKCD

Unfortunately, the problem of false discovery is even more delicate than the cartoon suggests. Correcting reported p-values for a fixed number of multiple tests is a fairly well understood topic in statistics. A simple approach is to multiply each p-value by the number of tests, but there are more sophisticated tools. However, almost all existing approaches to ensuring the validity of statistical inferences assume that the analyst performs a fixed procedure chosen before the data are examined. For example, "test all 20 flavors of jelly beans." In practice, however, the analyst is informed by data exploration, as well as the results of previous analyses. How did the scientist choose to study acne and jelly beans in the first place? Often such choices are influenced by previous interactions with the same data. This adaptive behavior of the analyst leads to an increased risk of spurious discoveries that are neither prevented nor detected by standard approaches. Each adaptive choice the analyst makes multiplies the number of possible analyses that could possibly follow; it is often difficult or impossible to describe and analyze the exact experimental setup ahead of time.

In The Reusable Holdout: Preserving Validity in Adaptive Data Analysis, a joint work with Cynthia Dwork (Microsoft Research), Vitaly Feldman (IBM Almaden Research Center), Toniann Pitassi (University of Toronto), Omer Reingold (Samsung Research America) and Aaron Roth (University of Pennsylvania), to appear in Science tomorrow, we present a new methodology for navigating the challenges of adaptivity. A central application of our general approach is the reusable holdout mechanism that allows the analyst to safely validate the results of many adaptively chosen analyses without the need to collect costly fresh data each time.

The curse of adaptivity

A beautiful example of how false discovery arises as a result of adaptivity is Freedman's paradox. Suppose that we want to build a model that explains "systolic blood pressure" in terms of hundreds of variables quantifying the intake of various kinds of food. In order to reduce the number of variables and simplify our task, we first select some promising looking variables, for example, those that have a positive correlation with the response variable (systolic blood pressure). We then fit a linear regression model on the selected variables. To measure the goodness of our model fit, we crank out a standard F-test from our favorite statistics textbook and report the resulting p-value.

Inference after selection: We first select a subset of the variables based on a data-dependent criterion and then fit a linear model on the selected variables.

Freedman showed that the reported p-value is highly misleading—even if the data were completely random with no correlation whatsoever between the response variable and the data points, we'd likely observe a significant p-value! The bias stems from the fact that we selected a subset of the variables adaptively based on the data, but we never account for this fact. There is a huge number of possible subsets of variables that we selected from. The mere fact that we chose one test over the other by peeking at the data creates a selection bias that invalidates the assumptions underlying the F-test.

Freedman's paradox bears an important lesson. Significance levels of standard procedures do not capture the vast number of analyses one can choose to carry out or to omit. For this reason, adaptivity is one of the primary explanations of why research findings are frequently false as was argued by Gelman and Loken who aptly refer to adaptivity as "garden of the forking paths."

Machine learning competitions and holdout sets

Adaptivity is not just an issue with p-values in the empirical sciences. It affects other domains of data science just as well. Machine learning competitions are a perfect example. Competitions have become an extremely popular format for solving prediction and classification problems of all sorts.

Each team in the competition has full access to a publicly available training set which they use to build a predictive model for a certain task such as image classification. Competitors can repeatedly submit a model and see how the model performs on a fixed holdout data set not available to them. The central component of any competition is the public leaderboard which ranks all teams according to the prediction accuracy of their best model so far on the holdout. Every time a team makes a submission they observe the score of their model on the same holdout data. This methodology is inspired by the classic holdout method for validating the performance of a predictive model.

Ideally, the holdout score gives an accurate estimate of the true performance of the model on the underlying distribution from which the data were drawn. However, this is only the case when the model is independent of the holdout data! In contrast, in a competition the model generally incorporates previously observed feedback from the holdout set. Competitors work adaptively and iteratively with the feedback they receive. An improved score for one submission might convince the team to tweak their current approach, while a lower score might cause them to try out a different strategy. But the moment a team modifies their model based on a previously observed holdout score, they create a dependency between the model and the holdout data that invalidates the assumption of the classic holdout method. As a result, competitors may begin to overfit to the holdout data that supports the leaderboard. This means that their score on the public leaderboard continues to improve, while the true performance of the model does not. In fact, unreliable leaderboards are a widely observed phenomenon in machine learning competitions.

Reusable holdout sets

A standard proposal for coping with adaptivity is simply to discourage it. In the empirical sciences, this proposal is known as pre-registration and requires the researcher to specify the exact experimental setup ahead of time. While possible in some simple cases, it is in general too restrictive as it runs counter to today's complex data analysis workflows.

Rather than limiting the analyst, our approach provides means of reliably verifying the results of an arbitrary adaptive data analysis. The key tool for doing so is what we call the reusable holdout method. As with the classic holdout method discussed above, the analyst is given unfettered access to the training data. What changes is that there is a new algorithm in charge of evaluating statistics on the holdout set. This algorithm ensures that the holdout set maintains the essential guarantees of fresh data over the course of many estimation steps.

The limit of the method is determined by the size of the holdout set—the number of times that the holdout set may be used grows roughly as the square of the number of collected data points in the holdout, as our theory shows.

Armed with the reusable holdout, the analyst is free to explore the training data and verify tentative conclusions on the holdout set. It is now entirely safe to use any information provided by the holdout algorithm in the choice of new analyses to carry out, or the tweaking of existing models and parameters.

A general methodology

The reusable holdout is only one instance of a broader methodology that is, perhaps surprisingly, based on differential privacy—a notion of privacy preservation in data analysis. At its core, differential privacy is a notion of stability requiring that any single sample should not influence the outcome of the analysis significantly.

Example of a stable learning algorithm: Deletion of any single data point does not affect the accuracy of the classifier much.

A beautiful line of work in machine learning shows that various notions of stability imply generalization. That is any sample estimate computed by a stable algorithm (such as the prediction accuracy of a model on a sample) must be close to what we would observe on fresh data.

What sets differential privacy apart from other stability notions is that it is preserved by adaptive composition. Combining multiple algorithms that each preserve differential privacy yields a new algorithm that also satisfies differential privacy albeit at some quantitative loss in the stability guarantee. This is true even if the output of one algorithm influences the choice of the next. This strong adaptive composition property is what makes differential privacy an excellent stability notion for adaptive data analysis.

In a nutshell, the reusable holdout mechanism is simply this: access the holdout set only through a suitable differentially private algorithm. It is important to note, however, that the user does not need to understand differential privacy to use our method. The user interface of the reusable holdout is the same as that of the widely used classical method.

Reliable benchmarks

A closely related work with Avrim Blum dives deeper into the problem of maintaining a reliable leaderboard in machine learning competitions (see this blog post for more background). While the reusable holdout could directly be used for this purpose, it turns out that a variant of the reusable holdout, we call the Ladder algorithm, provides even better accuracy.

This method is not just useful for machine learning competitions, since there are many problems that are roughly equivalent to that of maintaining an accurate leaderboard in a competition. Consider, for example, a performance benchmark that a company uses to test improvements to a system internally before deploying them in a production system. As the benchmark data set is used repeatedly and adaptively for tasks such as model selection, hyper-parameter search and testing, there is a danger that eventually the benchmark becomes unreliable.

Conclusion

Modern data analysis is inherently an adaptive process. Attempts to limit what data scientists will do in practice are ill-fated. Instead we should create tools that respect the usual workflow of data science while at the same time increasing the reliability of data driven insights. It is our goal to continue exploring techniques that can help to create more reliable validation techniques and benchmarks that track true performance more accurately than existing methods.

Bridging the Digital Divide in Gigabit Cities

2015-07-30T10:04:00.000-07:00

Denise Linn conducted this research as an MPP Candidate at the Harvard Kennedy School. She is currently a Program Analyst at the Smart Chicago Collaborative.

With the rise of coalitions like Next Century Cities and Gig.U and the development of groundbreaking networks in cities like Chattanooga and Kansas City, the buzz surrounding gigabit Internet speeds has swelled in the US. Cities are working closely with companies like Google Fiber or even building out fiber-optic infrastructure themselves. The suggested rewards of these investments include stronger local economies, vibrant tech startup scenes, progress in distance learning, telemedicine, research—and the list goes on.

But when superfast gigabit speeds are available in a city, what does that mean for people beyond tech entrepreneurs and other heavy Internet users? How can cities make sure that technological innovation lifts up the lives of every resident? This all leads to the ultimate question I examined in my recent research: What does the availability of high speed Internet mean for the digital divide?

Unpacking public data can shed some insight on this important issue. The 2013 American Community Survey’s tract and city-level demographic data merged with the Federal Communications Commission’s broadband subscribership data tell us a complex story about what faster speeds do to digital inclusion in metro areas. Though on the surface, both normal cities and gigabit cities do not appear to differ greatly in terms of overall broadband adoption, the data show that there is significant interaction between poverty and gigabit infrastructure. In other words, the presence of gigabit infrastructure has a significant correlation with higher connectivity in lower-income neighborhoods. Poorer cities and poorer census tracts are predicted to fare better when there is gigabit availability.

Summary one-pager: Data-Driven Digital Inclusion Strategy for Gigabit Cities from Denise Linn

Why is this? There are a few possible explanations:

Increased competition: It’s possible that faster speeds spur competition, lower prices, and make at-home broadband subscriptions possible for more people.
Greater awareness of why the Internet is important: According to Pew, the number one barrier for broadband adoption in the home is lack of awareness or understanding of how the Internet is relevant to everyday activities. It’s possible that the community organizing process required to build gigabit networks engages low-income neighborhoods and heightens awareness of why the Internet is important throughout a city.
Empowered anchor institutions in low-income areas: Within gigabit cities, anchor institutions—community-based organizations and libraries—deliver critical services to help get people online. In my research I saw interesting outliers—namely, very poor census tracts that were walkable and had easy access to public amenities or programs saw higher rates of Internet connectivity. For example, Hamilton County’s census tract 20 in Chattanooga, TN is both dense and is home to four churches and Howard High School. In 2013, 46% of households in this tract were living in poverty, but over 80% subscribed to broadband service.

The data analysis also points to weaknesses in high-speed Internet cities: broadband adoption in concentrated populations of non-English speakers and communities with low educational attainment. Interestingly, these residents are predicted to be worse off in gigabit cities. This observation points to what many might already suspect—that the relevancy and skill barriers to broadband adoption cannot be solved by faster speeds alone.

Fortunately, cities can understand and take ownership over their own digital divides, whether they are gigabit cities or aspiring gigabit cities. The public sector has a major role to play in digital inclusion. For example, cities can hire a digital inclusion specialist to work full time on the issue or create a grants program for local nonprofits. It’s clear that city governments can set the tone for broadband adoption. You can see my recommended digital inclusion actions for city governments here.

The National League of Cities, in partnership with Next Century Cities and Google Fiber, is conducting a webinar on August 6th to provide practical steps and specific case examples for city governments seeking to heighten their work in this area. Also, cities with great programs or programming ideas will have the opportunity to win a first-ever Digital Inclusion Leadership Award and share their success stories at the NLC conference in November.

To learn more about digital inclusion and dive deeper into the subjects covered in this post, see A Data-Driven Digital Inclusion Strategy for Gigabit Cities, or the summary here.

Mapping youth well-being worldwide with open data

2015-07-29T10:53:00.000-07:00

Ryan Swanstrom is a blogger at Data Science 101. This post originally appeared on DataKind's blog.

How does mapping child poverty in Washington DC help inform efforts to support child and young adult well being in the UK and Kentucky?

Back in March 2012, a team of DataKind volunteers in Washington DC worked furiously to finish their final presentation at a weekend DataDive. Little did they know, the impact of their work would extend far beyond DC and far beyond the weekend. Their prototyped visualization ultimately became a polished tool that would impact communities worldwide.

DC Action for Children's Data Tools 2.0 is an interactive visualization tool to explore the effects of income, healthcare, neighborhoods, and population on child well-being in the Washington DC area. The source code for Data Tools 2.0 and open data sources have since been used by DataKind UK and Code for America volunteers to benefit their local partners. There is now potential for it to reach even more communities through DataLook's #openimpact Marathon.

See how far a solution can spread when you bring together open data, open code and open hearted volunteers around the world.

What a difference a DataDive makes

DC Action for Children, a Washington DC nonprofit focusing on child well-being, needed help understanding how Washington DC could be one of the most affluent and wealthy cities in the United States, yet have one of the highest child poverty rates. Could mapping child poverty help uncover patterns and insights to drive action to address it?

A team of DataDive volunteers, led by Data Ambassador Sisi Wei, took on the challenge and, in less than 24 hours, created a prototype that wrangled data in a multitude of forms from government agencies, Census and DC Action for Children's own databases. The 24-hours then evolved into a multi-month DataCorps project involving many DataKind volunteers. The team unveiled a more polished version to a large and influential audience in Washington DC, including the Mayor of DC himself! They then completed the final enhancements to create Data Tools 2.0, which is now live on DC Action for Children’s website.

The project has since released the source code on Github, and the team has continued to collaborate and advance the project to where it is today. In fact, if you’re local, check out the August 5th DataKind DC Meetup to join in and continue improving the tool.

This story alone is incredible and speaks to the incredible commitment of these volunteers and the importance of having a strong partner like DC Action for Children to implement and utilize the work as an integrated part of its mission.

And that's usually where the story ends. Thanks to DataKind’s global network though, the impact of this work was just starting to spread.

A Visualization Goes Viral

Because the visualization used open data (freely available data for public use) and open source software or code (freely available code that can be viewed, modified, and reused), other volunteers could quickly repurpose the work and apply it to their local community.

DataKind UK London DataDive

The first time the visualization was replicated was in October 2014 for The North East Child Poverty Commission. The Commission had a similar challenge of wanting to better understand child poverty in the North East of England. A team at the London DataDive reused the code from DataTools 2.0 and created a similar visualization for the North East of England. This enabled the team to quickly produce valuable results that “thrilled” NECPC. One of the team’s Data Ambassadors continued to work with the organization and has since migrated the visualization to a different platform in Tableau.

DataKind UK Leeds DataDive

In April 2015, DataKind UK hosted another DataDive in Leeds with three charity partners, Volition, Voluntary Action Leeds and the Young Foundation, to tackle the structural causes of inequality in the city. All three charity teams came together to create a visualization tool that allows people to explore features of financial, young NEETs (Not in Education, Employment, or Training) and mental health inequality. But they did not recreate the wheel—they leveraged past work and repurposed code from DC Action for Children. Read more about the event in this recap from DataDive attendee, Andy Dickinson.

Beyond the DataKind Network

Now, it’s great to see a solution scale within an organization’s network, but it’s even more impressive to see it scale beyond, in this case, into Kentucky and maybe one day India or Finland.

#HackForChange with Code For America

In June 2015, the city of Louisville, Kentucky teamed with Civic Data Alliance to host a hackathon in honor of the National Day of Civic Hacking. Kentucky Youth Advocates, a nonprofit organization focused on "making Kentucky the best place in America to be a kid," wanted to visually explore the factors affecting successful children outcomes across Council Districts. There is a large variance in child resources throughout the city, which is having an effect on child well-being. The volunteers repurposed the original code and used local publicly available data to create the Kentucky Youth Advocates Data Visualization, which is now helping the city of Louisville better distribute resources for children.

#openimpact Marathon

DC Action for Children is also one of the projects selected for the #openimpact Marathon hosted by DataLook. The goal of the marathon is to get people and groups to replicate existing data-driven projects for social good. So far, there is interest in replicating the Data Tools 2.0 visualization for child crimes in India and another potential replication for senior citizens in Finland. There is no telling where this visualization will end up helping next. Get involved!

Ok ok, but what is the impact of all this really?

Aren’t these just visualizations? Yes, as any good data scientist knows, data visualizations are not an end in and of themselves. In fact, it’s typically just part of the overall process of gaining insight into data for some larger end goal. Similarly, open data in and of itself does not automatically mean impact. The data has to be easy to access, in the right formats, and people have to apply it to real-world challenges. Just because you build it (or open it), does not necessarily mean impact will come.

Yet visualizations and open data sources are often a critical first step to bigger outcomes. So what makes the difference between a flashy marketing tool and something that will help improve real people’s lives? The strength of the partner organization that will ultimately use it to create change in the world.

Data visualizations, open data and open source code alone are not going to end child poverty. People are going to end child poverty. The strength of the tool itself is less important than the strength of an organization’s strategy of how to use it to inform decision-making and conversation around a given issue.

Thankfully, DC Action for Children has been a tremendous partner and is using Data Tools 2.0 as a key part of its efforts to improve the lives of children in DC. It’s exciting to see the tool now spreading to equally impressive partners around the world.

Data for Good in Bangalore

2015-06-29T13:06:00.002-07:00

Miriam Young is a Communications Specialist at DataKind.

At DataKind, we believe the same algorithms and computational techniques that help companies generate profit can help social change organizations increase their impact. As a global nonprofit, we harness the power of data science in the service of humanity by engaging data scientists and social change organizations on projects designed to address critical social issues.

Our global Chapter Network recently wrapped up a marathon of DataDives, helping local organizations with their data challenges over the course of a weekend. This post highlights two of the projects from DataKind Bangalore’s first DataDive earlier this year, where volunteers used data science to help support rural agriculture and combat urban corruption.

Digital Green

Founded in 2008, Digital Green is an international, nonprofit development organization that builds and deploys information and communication technology to amplify the effectiveness of development efforts to affect sustained social change. They have a series of educational videos of agricultural best practices to help farmers in villages succeed.

The Challenge

Help farmers more easily find videos relevant to them by developing a recommendation engine that suggests videos based on open data on local agricultural conditions. The team was working with a collection of videos, each focused on a specific crop, along with descriptions, but each description was in a different regional language. The challenge, then, was parsing and interpreting this information to use it as as a descriptive feature for the video. To add another challenge, they needed geodata with the geographical boundaries of different regions to map the videos to a region with specific soil types and environmental conditions, but the data didn’t exist.

The Solution

The volunteers got to work preparing this dataset and published boundaries of 103,344 indian villages and geocoded 1062 Digital Green villages in Madhya Pradesh(MP) to 22 soil polygons. They then clustered MP districts into 5 agro-climatic clusters based on 179 feature vectors, mapping villages that Digital Green works with into these agro-climatic clusters. Finally, the team developed a Hinglish parser that parses the Hindi titles of available videos and translates them to English to help the recommender system understand which crop the videos relate to.

I Change My City / Janaagraha

Janaagraha was established in 2001 as a nonprofit that aims to combine the efforts of the government and citizens to ensure better quality of life in cities by improving urban infrastructure, services and civic engagement. Their civic portal, IChangeMyCity promotes civic action at a neighborhood level by enabling citizens to report a complaint that then gets upvoted by the community and flagged for government officials to take action.

The Challenge

Deal with duplicate complaints that can clog the system and identify factors that delay open issues from being closed out.

The Solution

To deal with the problem of duplicate complaints, the team used Jaccard similarity and Cosine similarity on vectorized complaints to cluster similar complaints together. Disambiguation was performed by ward and geography. The model they built delivered a precision of more than 90%.

To deal with the problem of identifying factors affecting closure by user and authorities, the team used two approaches. The first approach involved analysis using Decision Trees by capturing attributes like Comments, Vote-ups, Agency ID, Subcategory and so on. The second approach involved logistic regression to predict closure probability. Closure probability was modeled as a function of complaint subcategory, ward, comment velocity, vote-ups and similar other factors.

With these new features, iChangeMyCity will be able to better handle the large volume of incoming requests and Digital Green will be better able to serve farmers.

These initial findings are certainly valuable, but DataDives are actually much bigger than just weekend events. The weeks of preparation that go into them and months of impact that ripple out from them make them a step in an organization’s larger data science journey. This is certainly the case here, as both of these organizations are now exploring long-term projects with DataKind Bangalore to expand on this work.

Stay tuned for updates on these exciting projects to see what happens next!

Interested in getting involved? Find your local chapter and sign up to learn more about our upcoming events.

The Price of Data Localization

2015-06-24T16:32:00.000-07:00

Forced data localization laws require data be stored in a specific country, rather than in a distributed “cloud” spread across global networks. As we see the development of more cloud-based products and services, these laws run counter to the direction of technological innovation.

In fact, many studies have shown that forced data localization could negatively impact privacy as well as security and integrity of data. Other studies, like one by the European Centre for International Political Economy, have shown that data localization has negative impacts on the economies that require it.

Adding to the mounting evidence against data localization, new research by Leviathan Security Group shows the harms at a smaller scale: direct cost of forced data localization to local businesses, rather than whole economies. The costs can be pretty dramatic:

...[W]e find that for many countries that are considering or have considered forced data localization laws, local companies would be required to pay 30-60% more for their computing needs than if they could go outside the country's borders.

Leviathan looked at the major public cloud providers who allow on-demand self-service provisioning through their infrastructure. The group includes Amazon Web Services, DigitalOcean, Google Compute Engine, HP Public Cloud, Linode, Microsoft Azure, and Rackspace Cloud Servers. Consumers in affected countries might be able to find other cloud providers, but many of these providers don't allow self-service provisioning, instead requiring a confidentiality agreement, a full business-to-business agreement, or other paperwork. In many countries, cloud providers won't be available at all, so businesses must make major capital investments in computer hardware and infrastructure, rather than being able to take advantage of flexible and cost-saving per-use models.

Leviathan created an interactive visualization that allows anyone to compare all the cloud vendors by location and price around the world. You can check out this study and the visualization, along with their previous work on cloud security, at valueofcloudsecurity.com.

Smart Maps for Smart Cities: India’s $8 Billion+ Opportunity

2015-06-08T12:54:00.000-07:00

Gaurav Gupta is Dalberg's Regional Director for Asia.

Did you know that India is expected to see the greatest migration to cities of any country in the world in the next three decades, with over 400 million new inhabitants moving into urban areas? To accommodate this influx of city dwellers, India’s urban infrastructure will have to grow, too.

That growth has already begun. In the last six years alone, India’s road network has already expanded by one-quarter, while the number of total businesses increased by one-third.

To better understand how smart maps—citizen-centric maps that crowdsource, capture, and share a broad range of detailed data—can help India develop smarter and more efficient cities, our team at Dalberg Global Development Advisors worked with the Confederation of Indian Industry on a new study, Smart Maps for Smart Cities: India’s $8 Billion+ Opportunity. What we found was that even for a select set of use cases, smart maps can help India gain over USD $8 billion in savings and value, save 13,000 lives, and reduce one million metric tons of carbon emissions a year in cities alone. Their aggregate impact is likely to be several multiples higher.

Our research shows that simple improvements in basic maps can lead to significant social impact: smart maps can also help businesses attract more consumers, increase foreign tourist spending and even help women feel safer.

In these quickly changing cityscapes, online tools like maps need to be especially dynamic, able to update faster and quickly expand coverage of local businesses in order to serve as highly useful tools for citizens. Yet today, most cities lack sophisticated online tools that make changing information, like road conditions and new businesses, easy to find online. Only 10-20% of the India’s businesses, for instance, are listed on online maps.

So what will it take to continue developing smart maps to help power these cities? Our study shows that India will need to embrace a new policy framework that truly encourages scalable solutions and innovation by promoting crowdsourcing and creating a single accessible point of contact between government and the local mapping industry.

Moving beyond the binary of connectivity

2015-06-05T16:43:00.001-07:00

Back in April, we shared a post from designer and Internet researcher An Xiao Mina about the "sneakernet." She has a new post on The Society Pages in which she sets out to define a concept she calls the binary of connectivity.

But what exactly is this binary of connectivity? Attendees at my talk asked me to define it, and I’d like to propose a working definition:

The connectivity binary is the view that there is a single mode of connecting to the internet — one person, one device, one always-on subscription.

The connectivity binary is grounded in a Western, urban, middle class mode of connectivity; this mode of connecting is seen as the penultimate realization of our relationship to the internet and communications technologies. Thinking in a binary way renders other modes of access invisible, both to makers and influencers on the internet and to advertising engines and big data, and it limits our understanding of the internet and its global impact.

I can imagine at least two axes of a connectivity spectrum: single vs. shared usage, and continuous vs. intermittent access. For many readers of Cyborgology, single usage, continuous access to the web is likely the norm. The most extreme example of this might be iconized in the now infamous image of Robert Scoble wearing Google Glass in the shower–we are always connected, always getting feeds of data our way.

Here’s how other sections of those axes might map to practices I’ve observed in different parts of the world. Imagine these at differing degrees away from the center of a matrix:

Shared Usage, Continuous Access: I saved up to buy a laptop with a USB stick that my family of four can use. We take turns using it, and our connection is pretty stable.

Single, Intermittent: I have a low-cost Chinese feature phone (maybe a Xiaomi), and I pay a few dollars each month for 10 MB of access. I keep my data plan off most of time.

Shared, Intermittent: I walk all day to visit an internet cafe once every few months to check my Facebook account, listen to music on YouTube and practice my typing skills. I don’t own a computer myself.

For the purposes of simplicity, I’m assuming that we’re talking about devices that have one connection. But, of course, some devices have multiple connections (think of a phone with multiple SIMs) and some connections have multiple devices (think of roommates sharing a wifi router).

Read the full post here.

Housing Data Hub - from Open Data to Information

2015-05-27T11:59:00.000-07:00

Joy Bonaguro Chief Data Officer, City and County of San Francisco. This is a repost from April at DataSF.org announcing the launch of their Housing Data Hub.

Housing is a complex issue and it affects everyone in the City. However, there is not a lot of broadly shared knowledge about the existing portfolio of programs. The Hub puts all housing data in one place, visualizes it, and provides the program context. This is also the first of what we hope to be a series of strategic open data releases over time. Read more about that below or check out the Hub, which took a village to create!

Evolution of Open Data: Strategic Releases

The Housing Data Hub is also born out of a belief that simply publishing data is no longer sufficient. Open data programs need to take on the role of adding value to open data versus simply posting it and hoping for its use. Moreover, we are learning how important context is to understanding government datasets. While metadata is an essential part of context, it’s a starting not endpoint.

For us a strategic release is one or more key datasets + a data product. A data product can be a report, a website, an analysis, a package of visualizations, an article...you get the idea. The key point: you have done something beyond simply publishing the data. You provide context and information that transforms the data into insights or helps inform a conversation. (P.S. That’s also why we are excited about Socrata’s new dataset user experience for our open data platform).

Will we only do strategic releases?

No! First off - it’s a ton of work and requires amazing partnerships. Strategic (or thematic) releases should be a key part of an open data program but not the only part. We will continue to publish datasets per department plans (coming out formally this summer). And we’ll also continue to take data nominations to inform department plans.

We’ll reserve strategic releases to:

Address a pressing information gap or need
Inform issues of high public interest or concern
Tie together disparate data that may otherwise be used in isolation
Unpack complex policy areas through the thoughtful dissemination of open data
Pair data with the content and domain expertise that we are uniquely positioned to offer (e.g answer the questions we receive over and over again in a scalable way)
Build data products that are unlikely to be built by the private sector
Solve cross-department reporting challenges

And leverage the open data program to expose the key datasets and provide context and visualizations via data products.

We also think this is a key part of broadening the value of open data. Open data portals have focused more on a technical audience (what we call our citizen programmers). Strategic releases can help democratize how governments disseminate their data for a local audience that may be focused on issues in addition to the apps and services built on government data. It can also be a means to increase internal buyin and support for open data.

Next steps

As part of our rolling release, we will continue to work to automate the datasets feeding the hub. You can read more about our rollout process, inspired by the UK Government Digital Service. We’ll also follow up with technical post on the platform, which is available on GitHub, including how we are consuming the data via our open data APIs.

Is the Internet Healthy?

2015-05-21T14:30:00.000-07:00

Meredith Whittaker is Open Source Research Lead at Google.

We are big fans of open data. So we're happy to see that the folks over at Battle for the Net launched The Internet Health Test earlier this week, a nifty tool that allows Internet users test their connection speed across multiple locations.

The test makes use of M-Lab open source code and infrastructure, which means that all of the data gathered from all of the tests will be put into the public domain.

One of the project's goals is to make more public data about Internet performance available to advocates and researchers. Battle for the Net and others will use this data to identify problems with ISP interconnections, and, they claim, to hold ISPs accountable to the FCC's Open Internet Order.

This is certainly a complex issue but we are always thrilled by more data that can be used to inform policy.

You can learn more and run the test over at their site: https://www.battleforthenet.com/internethealthtest

New data, more facts: an update to the Transparency Report

2015-05-14T18:14:00.000-07:00

Cross-posted from the Official Google Blog.

We first launched the Transparency Report in 2010 to help the public learn about the scope of government requests for user data. With recent revelations about government surveillance, calls for companies to make encryption keys available to police, and a wide range of proposals, both in and out of the U.S., to expand surveillance powers throughout the world, the issues today are more complicated than ever. Some issues, like ECPA reform, are less complex, and we’re encouraged by the broad support in Congress for legislation that would codify a standard requiring warrants for communications content.

Google's position remains consistent: We respect the important role of the government in investigating and combating security threats, and we comply with valid legal process. At the same time, we'll fight on behalf of our users against unlawful requests for data or mass surveillance. We also work to make sure surveillance laws are transparent, principled, and reasonable.

Today's Transparency Report update
With this in mind, we're adding some new details to our Transparency Report that we're releasing today.

Emergency disclosure requests. We’ve expanded our reporting on requests for information we receive in emergency situations. These emergency disclosure requests come from government agencies seeking information to save the life of a person who is in peril (like a kidnapping victim), or to prevent serious physical injury (like a threatened school shooting). We have a process for evaluating and fast-tracking these requests, and in true emergencies we can provide the necessary data without delay. The Transparency Report previously included this number for the United States, but we’re now reporting for every country that submits this sort of request.

Preservation requests. We're also now reporting on government requests asking us to set aside information relating to a particular user's account. These requests can be made so that information needed in an investigation is not lost while the government goes through the steps to get the formal legal process asking us to disclose the information. We call these "preservation requests" and because they don't always lead to formal data requests, we keep them separate from the country totals we report. Beginning with this reporting period, we're reporting this number for every country.

In addition to this new data, the report shows that we've received 30,138 requests from around the world seeking information about more than 50,585 users/accounts; we provided information in response to 63 percent of those requests. We saw slight increases in the number of requests from governments in Europe (2 percent) and Asia/Pacific (7 percent), and a 22 percent increase in requests from governments in Latin America.

The fight for increased transparency
Sometimes, laws and gag-orders prohibit us from notifying someone that a request for their data has been made. There are some situations where these restrictions make sense, and others not so much. We will fight—sometimes through lengthy court action—for our users' right to know when data requests have been made. We've recently succeeded in a couple of important cases.

First, after years of persistent litigation in which we fought for the right to inform Wikileaks of government requests for their data, we were successful in unsealing court documents relating to these requests. We’re now making those documents available to the public here and here.

Second, we've fought to be more transparent regarding the U.S. government's use of National Security Letters, or NSLs. An NSL is a special type of subpoena for user information that the FBI issues without prior judicial oversight. NSLs can include provisions prohibiting the recipient from disclosing any information about it. Reporters speculated in 2013 that we challenged the constitutionality of NSLs; after years of litigation with the government in several courts across multiple jurisdictions, we can now confirm that we challenged 19 NSLs and fought for our right to disclose this to the public. We also recently won the right to release additional information about those challenges and the documents should be available on the public court dockets soon.

Finally, just yesterday, the U.S. House of Representatives voted 338-88 to pass the USA Freedom Act of 2015. This represents a significant step toward broader surveillance reform, while preserving important national security authorities. Read more on our U.S. Public Policy blog.

Posted by Richard Salgado, Legal Director, Law Enforcement and Information Security

Exploring the world of data-driven innovation

2015-05-07T15:55:00.002-07:00

Mike Masnick is founder of the Copia Institute.

In the last few years, there’s obviously been a tremendous explosion in the amount of data floating around. But we’ve also seen an explosion in the efforts to understand and make use of that data in valuable and important ways. The advances, both in terms of the type and amount of data available, combined with advances in computing power to analyze the data, are opening up entirely new fields of innovation that simply weren’t possible before.

We recently launched a new think tank, the Copia Institute, focused on looking at the big challenges and opportunities facing the innovation world today. An area we’re deeply interested in is data-driven innovation. To explore this space more thoroughly, the Copia Institute is putting together an ongoing series of case studies on data-driven innovation, with the first few now available in the Copia library.

Our first set of case studies includes a look at how the Polymerase Chain Reaction (PCR) helped jumpstart the biotechnology field today. PCR is, in short, a machine for copying DNA, something that was extremely difficult to do (outside of living things copying their own DNA). The discovery was something of an accident: A scientist discovered that certain microbes survived in the high temperatures of the hot springs of Yellowstone National Park, previously thought impossible. This resulted in further study that eventually led to the creation of PCR.

PCR was patented but licensed widely and generously. It basically became the key to biotech and genetic research in a variety of different areas. The Human Genome Project, for example, was possible only thanks to the widespread availability of PCR. Those involved in the early efforts around PCR were actively looking to share the information and concept rather than lock it up entirely, although there were debates about doing just that. By making sure that the process was widely available, it helped to accelerate innovation in the biotech and genetics fields. And with the recent expiration of the original PCR patents, the technology is even more widespread today, expanding its contribution to the field.

Another case study explores the value of the HeLa cells in medical research—cancer research in particular. While the initial discovery of HeLa cells may have come under dubious circumstances, their contribution to medical advancement cannot be overstated. The name of the HeLa cells comes from the patient they were originally taken from, a woman named Henrietta Lacks. Unlike previous human cell samples, HeLa cells continued to grow and thrive after being removed from Henrietta. The cells were made widely available and have contributed to a huge number of medical advancements, including work that has resulted in five Nobel prizes to date.

With both PCR and HeLa cells, we saw an important pattern: an early discovery that was shared widely, enabling much greater innovation to flow from proliferation of data. It was the widespread sharing of information and ideas that contributed to many of these key breakthroughs involving biotechnology and health.

At the same time, both cases raise certain questions about how to best handle similar developments in the future. There are questions about intellectual property, privacy, information sharing, trade secrecy and much more. At the Copia Institute, we plan to more dive into many of these issues with our continuing series of case studies, as well as through research and events.

Five ways for states to make the most of open data

2015-05-01T09:59:00.000-07:00

Mariko Davidson serves as an Innovation Fellow for the Commonwealth of Massachusetts where she works on all things open data. These opinions are her own. You can follow her @rikohi.

States struggle to define their role in the open data movement. With the exception of some state transportation agencies, states watch their municipalities publish local data, create some neat visualizations and applications, and get credit for being cool and innovative.

States see these successes and want to join the movement. Greater transparency! More efficient government! Innovation! The promise of open data is rich, sexy, and non-partisan. But when a state publishes something like obscure wildlife count data and the community does not engage with it, it can be disappointing.

States should leverage their unique role in government rather than mimic a municipal approach to open data. They must take a different approach to encourage civic engagement, more efficient government, and innovation. Here are few recommendations based on my time as a fellow:

States are a treasure trove of open data. This is still true. When prioritizing what data to publish, focus on the tangible data that impacts the lives of constituents—think aggregating 311 request data from across the state. Mark Headd, former Chief Data Officer for the City of Philadelphia, calls potholes the “gateway drug to civic engagement.”

States can open up data sharing with their municipalities—which leads to a conversation on data standards. States can use their unique position to federate and facilitate data sharing with municipalities. This has a few immediate benefits: a) it allows citizens a centralized source to find all levels of data within the state; b) it increases communication between the municipalities and the state; and c) it begins to push a collective dialogue on data standards for better data sharing and usability.

States in the US create an open data technology precedent for their towns and municipalities. Intentional or not, the state sets an open data technology standard—so they should leverage this power strategically. When a state selects a technology platform to catalog its data, it incentivizes municipalities and towns within the state to follow its lead. If a state chooses a SaaS solution, it creates a financial barrier to entry for municipalities that want to collaborate. The Federal Government understood this when it moved Data.gov to the open source solution CKAN. Bonus: open source software is free and embodies the free and transparent ethos of the greater open data movement.

States can support municipalities and towns by offering open data as a service. This can be an opportunity to provide support to municipalities and towns that might not have the resources to stand up their own open data site.

Finally, states can help facilitate an “innovation pipeline” by providing the data infrastructure and regularly connecting key civic technology actors with government leadership. Over the past few years, the civic technology movement experienced a lot of success in cities with groups like Code for America leading the charge with their local Brigade Chapters. After publishing data and providing the open data infrastructure, states must also engage with the super users and data consumers. States should not shy away from these opportunities. More active state engagement is a crucial element still missing in the civic innovation space in order to collectively create sustainable technology solutions for the communities they serve.

Visualization: The future of the World Bank

2015-04-28T16:12:00.004-07:00

This visualization of the World Bank Borrowers today and in 2019 isn't the most technologically sophisticated visualization we've ever posted but it is a stark illustration of what the future of the World Bank looks like.

As Tom Murphy writes over on Humanosphere:

The World Bank’s influence is waning. Some point to the emerging Asian Infrastructure Investment Bank as evidence of the body’s declining power, but it is the World Bank’s own projections that illustrate the change. Thirty-six countries will graduate from World Bank loans over the next four years (see the above gif).

The images in Murphy's gif come from a policy paper titled "The World Bank at 75" by Scott Morris and Madeleine Gleave at the Center for Global Development. The paper provides a thorough data-driven analysis of current World Bank lending models and systematic trends that will shape its future. From the paper:

The World Bank continues to operate according to the core model some 71 years after the founding of IBRD and 55 years after the founding of IDA: loans to sovereign governments with terms differentiated largely according to one particular measure (GNI per capita) of a country’s ability to pay. Together, concessional and non-concessional loans to countries still account for 67 percent of the institution’s portfolio.

So when the World Bank looks at the world today, it sees a large number of countries organized by IDA and IBRD status.

And what will the World Bank see in 2019, on the occasion of its 75th anniversary? On its current course and with rote application of existing rules, the picture could look very different, with far fewer of those so-called “IDA” and “IBRD” countries.

But does this picture accurately reflect the development needs that will be pressing in the years ahead? Or instead, does it simply reflect an institutional model that is declining in relevance?

It is remarkable how enduring the World Bank’s basic model has been. The two core features (lender to sovereign governments; terms differentiated by countries’ income category) have tremendous power within the institution, which has grown up around them. The differentiation in terms has generated two of the core silos within the institution: the IBRD and IDA. And lending to national governments (what we will call the “loans to countries” model) is so dominant that it has crowded out other types of engagement, even when there has been political will to do other things (notably, climate-related financing).

So while the model has been laudably durable in some respects, it is also increasingly seems to be stuck at a time when external dynamics call for change.

This paper examines ways in which seeming immoveable forces underlying the World Bank’s work might finally be ripe for change in the face of shifting development needs. Specifically, we offer examples of 1) how country eligibility standards might evolve; and 2) how the bank might move further away from the “loans to countries” model that has long defined it.

How do political campaigns use data analysis?

2015-04-24T09:51:00.002-07:00

Looking through SSRN this morning, I came across a paper by David Nickerson (Notre Dame) and Todd Rogers (Harvard), "Political Campaigns and Big Data" (February 2014). It's a nice follow-up to yesterday's post about the software supporting new approaches to data analysis in Washington, DC.

In the paper, Nickerson and Rogers get into the math behind the statistical methods and supervised machine learning employed by political campaign analysts. They discuss the various types of predictive scores assigned to voters—responsiveness, behavior, and support—and the variety of data that analysts pull together to model and then target supporters and potential voters.

In the following excerpt, the authors explain how predictive scores are applied to maximize the value and efficiency of phone bank fundraising calls:

Campaigns use predictive scores to increase the efficiency of efforts to communicate with citizens. For example, professional fundraising phone banks typically charge $4 per completed call (often defined as reaching someone and getting through the entire script), regardless of how much is donated in the end. Suppose a campaign does not use predictive scores and finds that upon completion 17 of the call 60 percent give nothing, 20 percent give $10, 10 percent give $20, and 10 percent give $60. This works out to an average of $10 per completed call. Now assuming the campaign sampled a diverse pool of citizens for a wave of initial calls. It can then look through the voter database that includes all citizens it solicited for donations and all the donations it actually generated, along with other variables in the database such as past donation behavior, past volunteer activity, candidate support score, predicted household wealth, and Census-based neighborhood characteristics (Tam Cho and Gimpel 2007). It can then develop a fundraising behavior score that predicts the expected return for a call to a particular citizen. These scores are probabilistic, and of course it would be impossible to only call citizens who would donate $60, but large gains can quickly be realized. For instance, if a fundraising score eliminated half of the calls to citizens who would donate nothing, so that in the resulting distribution would be 30 percent donate $0, 35 percent donate $10, 17.5 percent donate $20, and 17.5 percent donate $60. The expected revenue from each call would increase from $10 to $17.50. Fundraising scores that increase the proportion of big donor prospects relative to small donor prospects would further improve on these efficiency gains.

If you've ever wanted to know more about how campaigns use data analysis tools and techniques, this paper is a great primer.

Quorum: Is software the new Congressional intern?

2015-04-23T13:35:00.001-07:00

Last month, a number of news outlets wrote about a startup called Quorum. Winner of the 2014 Harvard Innovation Challenge's McKinley Family Grant for Innovation and Entrepreneurial Leadership in Social Enterprise, Quorum has amazing potential to create new ways for legislators to easily use data to understand their constituencies and track legislation—literally data for policymaking. Quorum even pulls data from the American Community Survey, which James Treat of the Census Bureau wrote about for this blog a few years back.

TechCrunch touts Quorum as a replacement for the hordes of summer Hill interns, while the Washington Post likens it to Moneyball for K Street.

Danny Crichton at TechCrunch writes:

The challenges are numerous in this space. "Figuring out who you should talk to is a really tough process," Jonathan Marks, one co-founder of Quorum, explained. "This is a problem that a lot of our clients have, [since] there are tens of thousands of relationships in DC." The challenge is magnified since those relationships change so often.

Another challenge is simply following legislation. Marks gave the example of a non-profit firm that wanted to develop a scorecard with grades for each congressmen on several key votes (a common strategy these days in Washington advocacy). One firm had "three people spending 1.5 weeks to tabulate all the data." An opposition research firm went through "6000 votes on abortion” to tabulate every single congressman's legislative history. This was all done manually (i.e. with an army of interns).

But Quorum is not the first product of its kind. Bloomberg and CQ have long dominated with products targeted at this audience. But this is becoming a competitive space for entrepreneurs. Catherine Ho at the Washington Post explains:

Since 2010, at least four companies, ranging from start-ups to billion-dollar public corporations, have introduced new ways to sell data-based political and competitive intelligence that offers insight into the policymaking process."

[...]

Other companies are emerging in the space with some success. For others, it’s too soon to tell.

Popvox, founded in 2010, is an online platform that collects correspondence between constituents and their representatives on bills, organizes the data by state, and packages the information in charts and maps so lawmakers can easily spot where voters stand on a proposed bill. An early win was when nearly 12,000 people nationwide used the platform to oppose a proposal to allow robo-calls to cellphones — the bill was withdrawn by its sponsors.

Popvox does not disclose its revenue, but co-founder Marci Harris said the platform has more than 400,000 users across every congressional district and has delivered more than 4 million constituent positions to Congress.

FiscalNote, which uses data-mining software and artificial intelligence to predict the outcome of legislation and regulations, has pulled in $19.4 million in capital since its 2013 start from big-name investors including Dallas Mavericks owner Mark Cuban, Yahoo co-founder Jerry Yang and the Winklevoss twins. The company says it achieves 94 percent accuracy. And Ipsos, the publicly traded market research and polling company, is amping up efforts to sell polling data to lobby firms.

For an academic's take on the trend toward data in politics and camapigning, UNC assistant professor Daniel Kreiss published a great piece for the Stanford Law Review in 2012 titled "Yes We Can (Profile You)," which lays out the ways in which political campaigns employ sophisticated data analysis techniques to measure and target voters.

What exactly is Section 215?

2015-04-17T15:33:00.001-07:00

On June 1, Section 215 of the USA PATRIOT Act is set to expire. This is a critical moment in an effort to reform and modernize government surveillance frameworks in the United States. But it's difficult to explain how we got here and why this is important in a few sentences.

The Electronic Frontier Foundation has put together a great background video on Section 215 that explains what it is, what it does, and what's at stake. And they include some data as well.

Watch the video below and learn more about EFF's efforts here.

Disability Confident: How can we measure if government policy is working?

2015-04-15T13:36:00.001-07:00

Andy White is Employment and Working Age Manager in the Evidence & Service Impact Section at RNIB.

Current UK government policy to improve employment opportunities for disabled people is based on the government’s Disability Confident campaign. Charities such as RNIB are keeping a close watch on this by measuring its impact on the employment rates of disabled people.

Blind and partially sighted people are significantly less likely to be in paid employment than the general population or other disabled people. For every three registered blind and partially sighted people of working age, only one is in paid employment. Worse, blind and partially sighted people are nearly five times more likely than the general population to have had no paid work for five years.

Measuring the employment rates of people registered as blind (serious sight impaired) or partially sighted (sight impaired) gives us the clearest indication of the employment status of people living with sight loss. But even among those not registered, the Labour Force Survey indicates that just over 44% of people who are described as "long term disabled with a seeing difficulty" are employed, compared with 74% of the general population.

One way to increase the numbers of blind and partially sighted people in employment is to focus on increasing the supply of blind and partially sighted people to the labour market by building their attributes and capabilities, and increasing the demand for meaningful work by supporting creative employment opportunities.

Another approach is to support people with sight loss to keep working—27% of non-working registered blind and partially sighted people said that the main reason for leaving their last job was the onset of sight loss or deterioration of their sight. However, 30% who were not working but who had worked in the past said that they maybe or definitely could have continued in their job given the right support.

We can address this by providing blind and partially sighted people with appropriate vocational rehabilitation support, and helping employers understand the business case for job retention. This is a challenge, given that the majority of employers have a negative attitude toward employing a blind or partially sighted person.

Blind and partially sighted people looking for work need specialist support on their journey towards employment. In addition to barriers common to anyone out of work for a long period, blind and partially sighted jobseekers have specific needs related to their sight loss.

Research indicates that those furthest from the labour market require a more resource-intensive model of support than those who are actively seeking work. Many blind and partially sighted jobseekers fall into this category.

The increased pressure on out-of-work blind and partially sighted people to join employment programmes means greater engagement in welfare to work programmes, and an increasing responsibility for the welfare to work industry to meet the specific needs of blind and partially sighted jobseekers.

Government policies such as the Disability Confident campaign will only be effective when there is a sea change in the proportion of blind and partially sighted people of working age achieving greater independence through paid employment.

Research about the employment status of blind and partially sighted people can be found on the Knowledge Hub section of RNIB's website. We also publish a series of evidence-based reviews, including one for people of working age, upon which this blog is based.

Data shows what millions knew: the Internet was really slow!

2015-04-10T09:07:00.000-07:00

Meredith Whittaker is Open Source Research Lead at Google.

For much of 2013 and 2014, accessing major content and services was nearly impossible for millions of US Internet users. That sounds like a big deal, right? It is. But it's also hard to document. Users complained, the press reported disputes between Netflix and Comcast, but the scope and extent of the problem wasn't understood until late 2014.

This is thanks in large part to M-Lab, a broad collaboration of academic and industry researchers committed to openly and empirically measuring global Internet performance. Using a massive archive of open data, M-Lab researchers uncovered interconnection problems between Internet service providers (ISPs) that resulted in nationwide performance slowdowns. Their published report, ISP Interconnection and its Impact on Consumer Internet Performance, lays out the data.

To back up a moment—interconnection sounds complicated. It's not. Interconnection is the means by which different networks connect to each other. This connection allows you to access online content and services hosted anywhere, not just content and services hosted by a single access provider (think AOL in the 1990’s vs today’s Internet). By definition, the Inter-net wouldn't exist without interconnection.

Interconnection points are the places where Internet traffic crosses from one network to another. Uncongested interconnection points are critical to a healthy, open Internet. Put another way, it doesn't matter how wide the road is on either side—if the bridge is too narrow, traffic will be slow.

M-Lab data and research exposed just such slowdowns. Let’s take a look…

The chart above shows download throughput data, collected by M-Lab in NYC between Feb 2013 and Sept 2014. The reflects traffic between customers of Time Warner Cable, Verizon, and Comcast—major ISPs—and an M-Lab server hosted on Cogent's network. Cogent is a major transit ISP and many content and services are hosted on Cogent’s network and on similar transit networks. Traffic between people and the content they want to access has to move through an interconnection point between their ISP (TWC, Comcast, and Verizon, in this case) and Cogent. What we see here, then, is severe degradation of download throughput between these ISPs and Cogent that lasted for about a year. During this time, customers of these three ISPs attempting to access anything hosted on Cogent in NYC were subjected to severely slowed Internet performance.

But maybe things are just slow, no?

Here you see download throughput in NYC during the same time period, for the same three ISPs (plus Cablevision). The difference: here they are accessing an M-Lab server hosted on Internap's network (another transit ISP). In this case, in the same region, for the same general population of users, during the same time, download throughput was stable. Content and services accessed on Internap's network performed just fine.

Couldn't this just be Cogent's problem? Another good question…

Here we return to Cogent. This graph spans the same time period, in NYC, looking again at download throughput across a Cogent interconnection point. The difference? We’re looking at traffic to customers of the ISP Cablevision.

Comparing these three graphs, we see M-Lab data exposing problems that aren't specific to one ISP or ISPs, but a problem with the relationship between pairs of ISPs—in this example, Cogent when paired with Time Warner, Comcast, or Verizon. This relationship manifests, technically, as interconnection.

These graphs focus on NYC but M-Lab saw similar patterns across the US as researchers examined performance trends across pairs of ISPs nationwide—e.g., whenever Comcast interconnected with Cogent. The research shows that the scope and scale of interconnection-related performance issues were nationwide and continued for over a year. IT also shows that these issues were not strictly technical in nature. In many cases, the same patterns of performance degradation existed across the US wherever a given pair of ISPs interconnected. This rules out a regional technical problem and instead points to business disputes as the cause of congestion.

M-Lab research shows that when interconnection goes bad, it’s not theoretical: it interferes with real people trying to do critical things. Good data and careful research helped to quantify the real, human impact of what had been relegated to technical discussion lists and sidebars in long policy documents. More focus on open data projects like M-Lab could help quantify the human impact across myriad issues, moving us from a hypothetical to a real and actionable understanding of how to draft better policies.

The Rural Broadband Digital Divide

2015-04-06T10:00:00.000-07:00

Michael Curri is president and founder of Strategic Networks Group.

There is a high degree of awareness of how differences in Internet connectivity contribute to the "digital divide" experienced by many, if not most, rural areas. Less is understood about a very real divide that exists from (a lack of) utilization. That’s right, just as important as "speed" is how much businesses and non-commercial organizations utilize the Internet.

Using the data SNG has collected in numerous states between 2012 and February 2015, we can actually quantify this digital divide. Just as significantly, we can identify the types of organizations (industry, size, rural/urban, etc.) that are experiencing the greatest gap in utilization. To quantify utilization, SNG has developed a means to measure utilization we call the Digital Economy index (DEi) that is a reflection of how many Internet processes or applications an organization uses. We measure use of 17 applications on a ten-point scale (ten being best) to develop the DEi (e.g. an organization using 8 of 17 applications would have a DEi score of 4.7).

Collecting data in numerous states, each with rural and urban components, SNG has uncovered the digital divide that exists based largely on the size of the community businesses are located. The table on the right shows that the more urban a community, the higher the DEi score. Regardless of speed available, rural communities are utilizing the Internet and its applications at a lower rate largely because in rural areas there is less knowledge transfer amongst peers and less of a market for specialized technical services.

Beyond the notable gap in Internet utilization between rural and urban areas, SNG’s research also reveals sectors and types of organizations that suffer most from this digital divide. This is consistent with our findings that rural communities have far less local resources to support businesses looking to better utilize broadband applications.

For small towns & isolated small towns (in essence, the census terms for “rural”), local governments have the largest utilization gap compared to their metropolitan peers: with a DEi of 5.24 compared to 7.17. Libraries also show a notable utilization gaps: metro = 7.23; rural = 6.12). In contrast, K-12 schools of comparable size have very similar DEi scores regardless of how urban or rural they are.

When examining industry type, it is illuminating to see just how much variance there can be depending on industry. Ironically, one of the biggest utilization gaps is in what might be considered the most advanced sector (Professional and Technical Services) which is large, growing, and well paying but slow to adopt key Internet applications.

Larger businesses in rural areas (100 or more employees) still experience a utilization gap to their urban counterparts. Rural businesses with less than 100 employees experience a much larger utilization gap.

So while fiber, net neutrality, and FCC decisions dominate the news, the success of broadband in driving impacts is dependent on utilization.

This means that providing our rural businesses with the knowledge and support to leverage the Internet is key to maintaining competitiveness. Furthermore, in today’s landscape it is easier to live rural and work globally, as long as rural businesses have access to networks and support systems that help them thrive in the digital economy. Developing local networks and supports is a direct and significant opportunity (as well as challenge) for local business retention and growth. There are ways to achieve this, including SNG’s Small Business Growth Program. We’d love to share with you how this program can drive economic growth in your region.

See more here.