Monday, June 29, 2015

Data for Good in Bangalore

Miriam Young is a Communications Specialist at DataKind.

At DataKind, we believe the same algorithms and computational techniques that help companies generate profit can help social change organizations increase their impact. As a global nonprofit, we harness the power of data science in the service of humanity by engaging data scientists and social change organizations on projects designed to address critical social issues.

Our global Chapter Network recently wrapped up a marathon of DataDives, helping local organizations with their data challenges over the course of a weekend. This post highlights two of the projects from DataKind Bangalore’s first DataDive earlier this year, where volunteers used data science to help support rural agriculture and combat urban corruption.

Digital Green

Founded in 2008, Digital Green is an international, nonprofit development organization that builds and deploys information and communication technology to amplify the effectiveness of development efforts to affect sustained social change. They have a series of educational videos of agricultural best practices to help farmers in villages succeed.

The Challenge

Help farmers more easily find videos relevant to them by developing a recommendation engine that suggests videos based on open data on local agricultural conditions. The team was working with a collection of videos, each focused on a specific crop, along with descriptions, but each description was in a different regional language. The challenge, then, was parsing and interpreting this information to use it as as a descriptive feature for the video. To add another challenge, they needed geodata with the geographical boundaries of different regions to map the videos to a region with specific soil types and environmental conditions, but the data didn’t exist.

The Solution

The volunteers got to work preparing this dataset and published boundaries of 103,344 indian villages and geocoded 1062 Digital Green villages in Madhya Pradesh(MP) to 22 soil polygons. They then clustered MP districts into 5 agro-climatic clusters based on 179 feature vectors, mapping villages that Digital Green works with into these agro-climatic clusters. Finally, the team developed a Hinglish parser that parses the Hindi titles of available videos and translates them to English to help the recommender system understand which crop the videos relate to.

I Change My City / Janaagraha

Janaagraha was established in 2001 as a nonprofit that aims to combine the efforts of the government and citizens to ensure better quality of life in cities by improving urban infrastructure, services and civic engagement. Their civic portal, IChangeMyCity promotes civic action at a neighborhood level by enabling citizens to report a complaint that then gets upvoted by the community and flagged for government officials to take action.

The Challenge

Deal with duplicate complaints that can clog the system and identify factors that delay open issues from being closed out.

The Solution

To deal with the problem of duplicate complaints, the team used Jaccard similarity and Cosine similarity on vectorized complaints to cluster similar complaints together. Disambiguation was performed by ward and geography. The model they built delivered a precision of more than 90%.

To deal with the problem of identifying factors affecting closure by user and authorities, the team used two approaches. The first approach involved analysis using Decision Trees by capturing attributes like Comments, Vote-ups, Agency ID, Subcategory and so on. The second approach involved logistic regression to predict closure probability. Closure probability was modeled as a function of complaint subcategory, ward, comment velocity, vote-ups and similar other factors.

With these new features, iChangeMyCity will be able to better handle the large volume of incoming requests and Digital Green will be better able to serve farmers.

These initial findings are certainly valuable, but DataDives are actually much bigger than just weekend events. The weeks of preparation that go into them and months of impact that ripple out from them make them a step in an organization’s larger data science journey. This is certainly the case here, as both of these organizations are now exploring long-term projects with DataKind Bangalore to expand on this work.

Stay tuned for updates on these exciting projects to see what happens next!

Interested in getting involved? Find your local chapter and sign up to learn more about our upcoming events.

Wednesday, June 24, 2015

The Price of Data Localization

Forced data localization laws require data be stored in a specific country, rather than in a distributed “cloud” spread across global networks. As we see the development of more cloud-based products and services, these laws run counter to the direction of technological innovation.

In fact, many studies have shown that forced data localization could negatively impact privacy as well as security and integrity of data. Other studies, like one by the European Centre for International Political Economy, have shown that data localization has negative impacts on the economies that require it.

Adding to the mounting evidence against data localization, new research by Leviathan Security Group shows the harms at a smaller scale: direct cost of forced data localization to local businesses, rather than whole economies. The costs can be pretty dramatic:

...[W]e find that for many countries that are considering or have considered forced data localization laws, local companies would be required to pay 30-60% more for their computing needs than if they could go outside the country's borders.

Leviathan looked at the major public cloud providers who allow on-demand self-service provisioning through their infrastructure. The group includes Amazon Web Services, DigitalOcean, Google Compute Engine, HP Public Cloud, Linode, Microsoft Azure, and Rackspace Cloud Servers. Consumers in affected countries might be able to find other cloud providers, but many of these providers don't allow self-service provisioning, instead requiring a confidentiality agreement, a full business-to-business agreement, or other paperwork. In many countries, cloud providers won't be available at all, so businesses must make major capital investments in computer hardware and infrastructure, rather than being able to take advantage of flexible and cost-saving per-use models.

Leviathan created an interactive visualization that allows anyone to compare all the cloud vendors by location and price around the world. You can check out this study and the visualization, along with their previous work on cloud security, at

Monday, June 8, 2015

Smart Maps for Smart Cities: India’s $8 Billion+ Opportunity

Gaurav Gupta is Dalberg's Regional Director for Asia.

Did you know that India is expected to see the greatest migration to cities of any country in the world in the next three decades, with over 400 million new inhabitants moving into urban areas? To accommodate this influx of city dwellers, India’s urban infrastructure will have to grow, too.

That growth has already begun. In the last six years alone, India’s road network has already expanded by one-quarter, while the number of total businesses increased by one-third.

To better understand how smart maps—citizen-centric maps that crowdsource, capture, and share a broad range of detailed data—can help India develop smarter and more efficient cities, our team at Dalberg Global Development Advisors worked with the Confederation of Indian Industry on a new study, Smart Maps for Smart Cities: India’s $8 Billion+ Opportunity. What we found was that even for a select set of use cases, smart maps can help India gain over USD $8 billion in savings and value, save 13,000 lives, and reduce one million metric tons of carbon emissions a year in cities alone. Their aggregate impact is likely to be several multiples higher.

Our research shows that simple improvements in basic maps can lead to significant social impact: smart maps can also help businesses attract more consumers, increase foreign tourist spending and even help women feel safer.

In these quickly changing cityscapes, online tools like maps need to be especially dynamic, able to update faster and quickly expand coverage of local businesses in order to serve as highly useful tools for citizens. Yet today, most cities lack sophisticated online tools that make changing information, like road conditions and new businesses, easy to find online. Only 10-20% of the India’s businesses, for instance, are listed on online maps.

So what will it take to continue developing smart maps to help power these cities? Our study shows that India will need to embrace a new policy framework that truly encourages scalable solutions and innovation by promoting crowdsourcing and creating a single accessible point of contact between government and the local mapping industry.

Friday, June 5, 2015

Moving beyond the binary of connectivity

Back in April, we shared a post from designer and Internet researcher An Xiao Mina about the "sneakernet." She has a new post on The Society Pages in which she sets out to define a concept she calls the binary of connectivity.

But what exactly is this binary of connectivity? Attendees at my talk asked me to define it, and I’d like to propose a working definition:

The connectivity binary is the view that there is a single mode of connecting to the internet — one person, one device, one always-on subscription.

The connectivity binary is grounded in a Western, urban, middle class mode of connectivity; this mode of connecting is seen as the penultimate realization of our relationship to the internet and communications technologies. Thinking in a binary way renders other modes of access invisible, both to makers and influencers on the internet and to advertising engines and big data, and it limits our understanding of the internet and its global impact.

I can imagine at least two axes of a connectivity spectrum: single vs. shared usage, and continuous vs. intermittent access. For many readers of Cyborgology, single usage, continuous access to the web is likely the norm. The most extreme example of this might be iconized in the now infamous image of Robert Scoble wearing Google Glass in the shower–we are always connected, always getting feeds of data our way.

Here’s how other sections of those axes might map to practices I’ve observed in different parts of the world. Imagine these at differing degrees away from the center of a matrix:

  • Shared Usage, Continuous Access: I saved up to buy a laptop with a USB stick that my family of four can use. We take turns using it, and our connection is pretty stable.

  • Single, Intermittent: I have a low-cost Chinese feature phone (maybe a Xiaomi), and I pay a few dollars each month for 10 MB of access. I keep my data plan off most of time.

  • Shared, Intermittent: I walk all day to visit an internet cafe once every few months to check my Facebook account, listen to music on YouTube and practice my typing skills. I don’t own a computer myself.

For the purposes of simplicity, I’m assuming that we’re talking about devices that have one connection. But, of course, some devices have multiple connections (think of a phone with multiple SIMs) and some connections have multiple devices (think of roommates sharing a wifi router).

Read the full post here.

Wednesday, May 27, 2015

Housing Data Hub - from Open Data to Information

Joy Bonaguro Chief Data Officer, City and County of San Francisco. This is a repost from April at announcing the launch of their Housing Data Hub.

Housing is a complex issue and it affects everyone in the City. However, there is not a lot of broadly shared knowledge about the existing portfolio of programs. The Hub puts all housing data in one place, visualizes it, and provides the program context. This is also the first of what we hope to be a series of strategic open data releases over time. Read more about that below or check out the Hub, which took a village to create!

Evolution of Open Data: Strategic Releases

The Housing Data Hub is also born out of a belief that simply publishing data is no longer sufficient. Open data programs need to take on the role of adding value to open data versus simply posting it and hoping for its use. Moreover, we are learning how important context is to understanding government datasets. While metadata is an essential part of context, it’s a starting not endpoint.

For us a strategic release is one or more key datasets + a data product. A data product can be a report, a website, an analysis, a package of visualizations, an get the idea. The key point: you have done something beyond simply publishing the data. You provide context and information that transforms the data into insights or helps inform a conversation. (P.S. That’s also why we are excited about Socrata’s new dataset user experience for our open data platform).

Will we only do strategic releases?

No! First off - it’s a ton of work and requires amazing partnerships. Strategic (or thematic) releases should be a key part of an open data program but not the only part. We will continue to publish datasets per department plans (coming out formally this summer). And we’ll also continue to take data nominations to inform department plans.

We’ll reserve strategic releases to:

  • Address a pressing information gap or need
  • Inform issues of high public interest or concern
  • Tie together disparate data that may otherwise be used in isolation
  • Unpack complex policy areas through the thoughtful dissemination of open data
  • Pair data with the content and domain expertise that we are uniquely positioned to offer (e.g answer the questions we receive over and over again in a scalable way)
  • Build data products that are unlikely to be built by the private sector
  • Solve cross-department reporting challenges

And leverage the open data program to expose the key datasets and provide context and visualizations via data products.

We also think this is a key part of broadening the value of open data. Open data portals have focused more on a technical audience (what we call our citizen programmers). Strategic releases can help democratize how governments disseminate their data for a local audience that may be focused on issues in addition to the apps and services built on government data. It can also be a means to increase internal buyin and support for open data.

Next steps

As part of our rolling release, we will continue to work to automate the datasets feeding the hub. You can read more about our rollout process, inspired by the UK Government Digital Service. We’ll also follow up with technical post on the platform, which is available on GitHub, including how we are consuming the data via our open data APIs.

Thursday, May 21, 2015

Is the Internet Healthy?

Meredith Whittaker is Open Source Research Lead at Google.

We are big fans of open data. So we're happy to see that the folks over at Battle for the Net launched The Internet Health Test earlier this week, a nifty tool that allows Internet users test their connection speed across multiple locations.

The test makes use of M-Lab open source code and infrastructure, which means that all of the data gathered from all of the tests will be put into the public domain.

One of the project's goals is to make more public data about Internet performance available to advocates and researchers. Battle for the Net and others will use this data to identify problems with ISP interconnections, and, they claim, to hold ISPs accountable to the FCC's Open Internet Order.

This is certainly a complex issue but we are always thrilled by more data that can be used to inform policy.

You can learn more and run the test over at their site:

Thursday, May 14, 2015

New data, more facts: an update to the Transparency Report

Cross-posted from the Official Google Blog.

We first launched the Transparency Report in 2010 to help the public learn about the scope of government requests for user data. With recent revelations about government surveillance, calls for companies to make encryption keys available to police, and a wide range of proposals, both in and out of the U.S., to expand surveillance powers throughout the world, the issues today are more complicated than ever. Some issues, like ECPA reform, are less complex, and we’re encouraged by the broad support in Congress for legislation that would codify a standard requiring warrants for communications content.

Google's position remains consistent: We respect the important role of the government in investigating and combating security threats, and we comply with valid legal process. At the same time, we'll fight on behalf of our users against unlawful requests for data or mass surveillance. We also work to make sure surveillance laws are transparent, principled, and reasonable.

Today's Transparency Report update
With this in mind, we're adding some new details to our Transparency Report that we're releasing today.

  • Emergency disclosure requests. We’ve expanded our reporting on requests for information we receive in emergency situations. These emergency disclosure requests come from government agencies seeking information to save the life of a person who is in peril (like a kidnapping victim), or to prevent serious physical injury (like a threatened school shooting). We have a process for evaluating and fast-tracking these requests, and in true emergencies we can provide the necessary data without delay. The Transparency Report previously included this number for the United States, but we’re now reporting for every country that submits this sort of request.

  • Preservation requests. We're also now reporting on government requests asking us to set aside information relating to a particular user's account. These requests can be made so that information needed in an investigation is not lost while the government goes through the steps to get the formal legal process asking us to disclose the information. We call these "preservation requests" and because they don't always lead to formal data requests, we keep them separate from the country totals we report. Beginning with this reporting period, we're reporting this number for every country.

In addition to this new data, the report shows that we've received 30,138 requests from around the world seeking information about more than 50,585 users/accounts; we provided information in response to 63 percent of those requests. We saw slight increases in the number of requests from governments in Europe (2 percent) and Asia/Pacific (7 percent), and a 22 percent increase in requests from governments in Latin America.

The fight for increased transparency
Sometimes, laws and gag-orders prohibit us from notifying someone that a request for their data has been made. There are some situations where these restrictions make sense, and others not so much. We will fight—sometimes through lengthy court action—for our users' right to know when data requests have been made. We've recently succeeded in a couple of important cases.

First, after years of persistent litigation in which we fought for the right to inform Wikileaks of government requests for their data, we were successful in unsealing court documents relating to these requests. We’re now making those documents available to the public here and here.

Second, we've fought to be more transparent regarding the U.S. government's use of National Security Letters, or NSLs. An NSL is a special type of subpoena for user information that the FBI issues without prior judicial oversight. NSLs can include provisions prohibiting the recipient from disclosing any information about it. Reporters speculated in 2013 that we challenged the constitutionality of NSLs; after years of litigation with the government in several courts across multiple jurisdictions, we can now confirm that we challenged 19 NSLs and fought for our right to disclose this to the public. We also recently won the right to release additional information about those challenges and the documents should be available on the public court dockets soon.

Finally, just yesterday, the U.S. House of Representatives voted 338-88 to pass the USA Freedom Act of 2015. This represents a significant step toward broader surveillance reform, while preserving important national security authorities. Read more on our U.S. Public Policy blog.

Posted by Richard Salgado, Legal Director, Law Enforcement and Information Security