Flint fiasco calls for heavy-hitting data analysis
Flint fiasco calls for heavy-hitting data analysis
The water crisis in Flint, Michigan, highlights a number of serious problems: a public health outbreak; inadequate urban infrastructure; environmental injustice; and political failures. But when it comes to recovery, the central challenge — and one that has received relatively little attention — is our lack of useful information and understanding.
Who is most at risk? Where are the harmful sources of lead? Where should resources be allocated? Using modern Big-Data tools, we can answer these questions and help inform the response to this crisis.
With the support of our student team at the University of Michigan, we have aggregated a trove of available data around Flint’s water issues, including water test results, records of the service lines that deliver water to homes, information on parcels of land and water usage.
Leveraging new algorithmic and statistical tools, we are able to produce a significantly more complete picture of the risks and challenges in Flint.
These methods strongly resemble those used by Facebook, Amazon and other large tech companies who collect vast amounts of data from users. But whereas Facebook's algorithms crunch through uploaded photographs to detect faces and Amazon’s models predict which products you’ll like, we are using these analytics tools to detect homes with high risk of lead contamination and to predict the locations of lead pipes buried underground or hidden in the homes of residents.
What have we learned? Here are a few takeaways from our research.
Lead contamination varies widely across homes and is highly scattered around Flint, but it is surprisingly predictable
The headlines on Flint easily could lead one to believe all homes in the city have dangerously high levels of lead. But in fact, using data from the state’s sentinel program, we found during a period in February only between 8 and 15 percent of homes had lead above the federal action level of 15 parts per billion (ppb).
Indeed, things have been improving from January through August, according to the test data from the sentinel program. Based on about 750 homes monitored repeatedly, fewer homes have tested above the action level over time. Almost half of all samples have virtually no detectable level (below 1 parts per billion).
These low numbers provide little comfort when we don’t know which homes are at risk. Only around 30 percent of homes in Flint have had their water tested, according to government data, and these water tests do not guarantee safety; they only identify danger. Also, it is clear from the data that homes that are slower to sample their water tend to be those at much greater risk.
So can we find these homes? The answer is yes, to a modest degree of accuracy. We have built statistical models that profile a home based on several attributes (year of construction, location, value, size), and provide an estimate of the risk level.
The quality of these models is driven by the huge swaths of data from water samples submitted by residents and tested by government officials in response to the crisis. This provides us with a database of measurements that includes over 20,000 water samples covering roughly 10,000 homes in Flint from November to present.
We have made our risk assessments available to government officials, and are being incorporated into an mobile application, funded by Google and built by students at UM Flint, that allows Flint residents to learn of their home’s risk level.
These statistical models not only provide predictions; they also give a better understanding of the problems. This has much broader implications, as these factors predicting lead may generalize beyond Flint.
The data suggest that lead contamination is associated with a number of factors; older homes tend to be at greater risk, for instance, as are those of lower home value. Lower-value homes also tend to be those with the lowest rates of water sampling. Additionally, while the highest readings are geographically scattered, the homes predicted to be at high risk tend to cluster in specific neighborhoods.
Flint’s lead pipe records are spotty and noisy, but statistical methods can significantly fill the gap
Media reports and political efforts have continued to focus on the so-called "water service lines" that connect each house to the distribution system in the street. The assumption is that homes with lead service lines are most at risk for lead exposure and poisoning. As a result, much of the attention has been on locating and replacing these lines.
The Michigan legislature has allocated over $25 million toward replacing the harmful lines, beginning with a pilot phase of roughly 250 homes. This effort is being headed up by a team under National Guard Brig. Gen. Michael McDaniel.
The problem, however, is not only with lines made out of lead material: Lead particulate can accumulate on the walls of corroded galvanized steel pipes. Pipes made of copper or plastic, on the other hand, are generally considered to be safe.
But there are immediate challenges with the line replacement program. And the most obvious is: Where are these dangerous pipes?
The city, unfortunately, did not maintain consistent records on service line installations and materials. But city officials eventually found, after some searching, a set of maps with handwritten annotations (last updated in 1984), and these records were digitized by a UM Flint research team lead by professor Marty Kaufman. These appeared to identify the material of the service lines for most home parcels in Flint.
How complete and accurate are these records? Unfortunately, not very. For over 30 percent of homes, either there are missing labels or the records disagree with a home inspection of a portion of the service line.
We can again fill in gaps with the help of algorithms and data. Looking for patterns in the existing records, statistical tools can provide a reasonable "educated guess" as to the type of material in a home’s service line. We have been working directly with McDaniel’s line replacement team, providing statistical estimates of where lead pipes are most likely to be found, and this has guided their targeting of replacement resources.
Our recommendations are adapting to incoming data, using techniques applied in online advertising experiments or clinical trials, to identify the risky homes quickly and efficiently.
Our machine learning techniques, which use all of the available city data, parcel records and a database of over 3,000 inspection reports, are able to estimate line materials with better than 80 percent accuracy. We find, for instance, that houses built in the 1920s to 1940s are many times more likely than those built after 1960 to have lead in their service line. Our guesses aren’t perfect by any means, but estimates of this level can save millions of dollars on recovery efforts.
Home service lines may not be the largest contributor of lead
Despite the huge media attention focused on the service lines, one major takeaway from our analyses is that these service lines may not be the major driver of the lead in Flint’s drinking water. Yes, it is the case that those homes with copper service lines have lower lead levels, on average, than those with lead in their service line. But when you look closely at the water testing data, the differences are much smaller than you might think.
While it is difficult to determine with certainty due to the spotty records, what we have found is that large spikes of lead occur in homes with and without lead service lines. This suggests a large fraction of the dangerously high lead readings probably are not being driven by the service line material but instead by other factors. Environmental engineers who study these problems report that lead can leach from several sources, including the home’s interior plumbing, faucet fixtures and aging pipe solder.
What we can conclude is that both citizens and policymakers may need to widen their focus beyond the service line materials and consider alternative efforts to address other sources of lead. Service line replacement is certainly a necessary part of the solution, but it will not be sufficient.
Toward solving the broader problem, data and statistical tools can help greatly reduce risks at much lower cost, and a data-oriented understanding of the problems in Flint can guide efforts to address lead concerns in other regions as well.