Machine learning has received mixed reactions from the water utility community. It’s not often understood how machine learning can be used to help with developing a service line inventory for Lead and Copper Rule Revisions (LCRR) Compliance but it has promise, which many find alluring. The Trinnex and CDM Smith teams gathered on December 7th, 2022 to discuss this very topic and we’ve summarized the most important points below.
The purpose of service line inventory development
Before we jump into machine learning, let’s look closer at the purpose for inventory development. By October 2024, all water utilities in the U.S. will have to submit a service line inventory to the Environmental Protection Agency (EPA), indicating the material types of their service lines – all in an effort to remove lead in drinking water. For municipalities with wet infrastructure built before 1986 when lead was banned, the biggest task will be confirming whether the service lines are lead or are made from another type of material.
How does machine learning work?
Confirming service line materials will require field verification programs which will take time and resources to complete. However, machine learning focuses on developing models that learn from data and aims to predict patterns.
The basics of machine learning for predictions
There are four types of machine learning estimation methods to predict results:
- Classification: Sometimes referred to as supervised learning, classification predicts the class a record belongs to. For example, guessing if an image of an animal is a cat or a dog based on certain characteristics like it barks.
- Clustering: Sometimes referred to as unsupervised learning, clustering groups data according to similarities and without labeled responses. For example, distinguishing cars from people.
- Regression: This method predicts a value using mathematical methods. For example, predicting stock price.
- Dimensionality reduction: Reduces the number of variables, which helps prevent overloading. It is often used to select the most important features (measurable properties such as age of a home) for machine learning and can improve the accuracy of regression or classification models
What type of machine learning is used for service line material predictions?
Classification is the primary technique used for service line material predictions. Trinnex uses all four techniques within our machine learning model, leadCAST predict. Clustering, regression, and dimensionality reduction are used at the feature engineering stage. And we use what’s called an ensemble classifier, or a classification method that combines multiple models, to select the best-fit model for each utility's unique situation based on the accuracy from the cross-validation between predicted materials vs. actual materials for properties in the test data set.
Building inventory with vs. without machine learning
Building a model without machine learning is possible. We can apply the manual process and use excel to create charts to drill down and classify lead vs. non-lead line properties. However, this is time-consuming and may not manage grey areas.
Machine learning not only assigns a category, but also assigns a probability of having lead lines to those properties. It can automate the development of the decision tree to predict lead as shown in the following figure.
Machine learning will also look for co-linearity and determine which features are most important. For example, if one has sale price and tax value of home, the model will look for a linear relationship as it relates to confirming lead and remove the variable with a weaker impact.
Another advantage of machine learning is feature engineering. Using a machine learning-enabled model, one can convert sale price of home to its sale price relative to the neighborhood. With enough training data, the model can learn patterns. For example, based on similar data, it can tease out homes that have lead based on the assumption that newer homes, which will not have lead, will be more expensive.
Training the machine learning model
To train the model, we feed in known data from field verifications and hold back about 20% of it to test the accuracy of the model, as depicted in the figure. This process is referred to as cross-validation. To truly assess the accuracy, you want to validate the model against a dataset that you have confidence in. The states that do accept machine learning require field verifications for model training and testing.
Iterative machine learning to build an ideal model
Machine learning follows a loop:
- Leverage available data to predict materials
- Use feedback loop of those predictions to target locations for field verification
- Update the model training and test datasets with those field verifications
- And repeat
With more verifications, the model will improve and become more accurate.
Machine learning limitations
Machine learning has its own set of limitations.
- Universal machine learning models do not work for everyone. For example, a model trained on Utility A will not work on Utility B. The model is identifying patterns unique to Utility A and without representative training data covering Utility B, the model cannot be trusted.
- The model requires field verification for training and testing.
- It cannot predict what it does not know. For example, if training data contains zero lead, the model will not predict lead. To account for this, one must train the model on the next worst and oldest material, likely galvanized, and then use high predictions to verify that your oldest truly is only galvanized.
- The model is limited by data resolution. If you don’t have property-specific data on housing values and are relying on census data, the model will not be precise.
How can machine learning be used reliably?
Initially the model should help prioritize field inspections, estimate replacement costs, and sanity check our confidence in historical records. Once the model demonstrates high accuracy and reliability is when it can potentially be used for service line material subcategory in the inventory.
The initial model is developed from the data from field inspections and historical records if their accuracy is validated. The initial model provides guidance on where to perform further field investigation.
After the initial model is built, it needs to continue to learn with new data, as more field verifications are completed. This is when you’ll see the hit rate improve and might even see scenarios where the model finds inaccuracies in historical records.
Toward the end of the program, presuming the model has been enriched with new field verifications over time and is highly accurate and reliable, we can consider using it to determine service line material in the inventory.
Hear more as Mark Zito, Trinnex LCRR expert, explains how to use machine learning reliably.
Training the model to over-identify lead
The outer edges of the prediction score are what we really want to be accurate to confidently rule out properties that do not have lead. In the initial stages of the model development, we optimize recall — the measure of actual lead lines predicted by the model properly. This is a more conservative approach recommended at the start when you have lesser training data. As more training data is collected, we balance precision and recall.
Components of model evaluation
When evaluating the model, you want to look at three main categories:
- What is the accuracy of the prediction; accuracy has many components
- Are the parameters representative and are they themselves accurate?
- Are the field verifications, the training data, representative of the system and accurate?
Domain expertise input on data collection and analysis
Accuracy of input data is crucial and this is where having domain expertise comes into play. A common problem in machine learning is over-fitting; this is when a model is too perfect and a human in the loop can identify this.
Targeting proper data for the model, such as using install dates vs. year built is better. Having synthetic data, such as property value relative to neighbors, is useful.
Missing data needs to be managed properly; the parameter with major gaps should not be used.
Managing missing data
This step really occurs before building a model. The data going into the model must be validated, especially the key features. If more than 20% of the data is missing, then one must consider removing that parameter.
There are also techniques to fill in the gaps, such as estimating the year built using a spatially weighted imputation.
Imputed values should be tracked and filled in with known data over time if possible.
Evaluate parameter weighting
Look at parameter weighting through a feature importance plot. Sometimes zip code is the most influential data element by their relative importance of influence in the model. This suggests that location is more important than the year a property was built. But we also need to consider if this is caused by a bias in the training dataset.
Randomness and representativeness of data
Randomness and representativeness are both essential for model training and to reduce biases. The following figure explains the kind of data we need to build an inventory.
The results of the machine learning model then can be used to prioritize areas that have high likelihood of lead.
Learn more about machine learning for LCRR compliance
Machine learning has been solving puzzles and easing challenges in various fields, and with the power of data, it can bring enormous benefits to the water utility industry. Make sure to check out the full webinar recording of our December 7th webinar here. Want to get a more in-depth look into leadCAST Predict, the machine learning model built specifically for service line material prediction? Schedule a demo with our data scientists.