How Do You Measure Success?

5 min readDec 4, 2020

by John F. Elder

Editor’s Note: John Elder has occasionally joked that the risk of pulling in a PhD to help with a problem is it will inspire them to work on one even more interesting. Today, he became that person. He was originally asked to help with a simple question on a Request For Information (RFI) about defining metrics of success for a project. Our colleague requesting aid originally thought the request was a bit general, as in “how would we develop metrics for a new problem?”. By the time they realized the RFI’s goal was very specific (e.g., “maximize the count and ratio of correctly-labelled documents per hour”) it was too late, Dr. Elder had magnified “How do you measure success?” into the One Question that Rules them All. Enjoy.

The biggest risk in buying data science work is not technical; it’s social. Most projects work in the lab just fine but aren’t implemented. So, if you want to succeed, hire a consultant (hint: Elder Research) that has a proven track record of getting their work installed and running. Obviously, they’ll possess the needed technical skills, but they’ll also have learned how to earn the trust and cooperation of their clients so that the scary chasm of change — from doing business the old way to the new analytic way — is successfully hurdled, and the huge ROI from data science is achieved.

Standing Offer

In our 25 years, Elder Research has seen this happen so often that we make a standing offer to prospective clients: We’ll work on your project for free. Just implement the solution and give us 20% of any increase in its revenue. Your only cost is a fairly modest amount of time to cooperate to explain the challenge — its goals, constraints, data, etc.

Sadly, most client’s contracting structures make this essentially impossible to execute. But they are impressed by the confidence exemplified by our offer. What they don’t realize is how great a bet it would be for both of us! 95% of our projects work, and the return is often orders of magnitude over the cost. But since our prospective clients don’t anticipate the gains as being as inevitable as we do, they sometimes fail to greenlight a great project. If the no-risk shared-profit contracting option were possible it would allow each party to act (wager) comfortably within our different risk perceptions, and both parties would come out far ahead the vast majority of the time.

Real-World Problem

Assuming we’re working together let’s get back to measuring performance. First question: What’s the problem? (Sometimes, the best thing to focus on is not the most obvious.) Being a good consultant means learning how to listen. Where is the pain? What are the knobs (possible controls), the limits, the main and secondary goals? What are the essentials and the like-to-haves? What relevant historical data can be found? There, embedded like fossils of ancient struggles, are waiting details of actions that worked and didn’t, of useful relationships and startling surprises — which, when revealed and harnessed, will lead to new efficiencies and profitability!

Embedded Technical Problem

While mostly listening, we also share ideas that have worked from our storehouse of experience in order to focus the conversation to define a technical problem that can be solved which is as close as possible to the real-world problem. Its solution has three key characteristics:

Criteria of merit to score the quality of a solution
Data Science / Machine Learning stage to gain information from historical data
Optimization to confirm out-of-sample performance

The criteria of merit is how we score success during optimization. While it is tempting to use the usual statistical metrics that are built into every tool — like least squared error (or R²) for estimation, area under the curve (AUC) for recommendation, or percent correct for classification — I (and others) have shown how they are all sub-optimal for those standard problems, and even more so for problems similar to real-world challenges! When possible, it’s best to use a metric that is custom-built to the client’s particular problem. But, once you’re off the path of the standard metrics, you need to employ custom optimization methods, which highlight that third key area. The team needs ways to be able to optimize any define-able metric within a bounded space in reasonable time. (Optimization was my PhD dissertation topic, so I have a favorite approach there, and my algorithm, GROPE, is still a world champion in terms of fewest probes, for low-dimensional (< 20 or so) real-valued problems. But a toolbox of optimization methods is needed to cover a wide variety of problem types.)

Learning with Data Science

Our core expertise is in Data Science, and that technology has the great ability to extract knowledge from historical data and provide something of an “oracle” for the future. Its main danger stems from its strength, over-applied: it can be pushed too hard and “over-fit” or “over-search” (two different issues) and identify patterns in the data that are really spurious correlations. This is sometimes memorably called “torturing the data until it confesses”. Such patterns appear in training but almost certainly won’t also be in new “out of sample” data — the only place where it actually matters if they exist. Elder Research is expert in the modern resampling methods necessary to establish the right level of complexity and depth of search to use during the data science modeling stage to avoid over-fit and over-search. We have even invented and teach cutting-edge methods, such as Target Shuffling (TS) and Deep TS, that protect models from going too far. We are widely engaged for our model validation and verification skills and advice. When a model is properly tuned, so its out-of-sample performance is well-calibrated, then it can be used with great confidence, and its ROI will be solid.

Implement and Monitor

Lastly, the model must be implemented and monitored. Implementation is a complex stage (likened to swapping out a motorcycle’s engine while riding it), but worth the trouble, as the ROI does not happen otherwise and is hard to beat anywhere else. In our experience, most data science models are very robust and last for years. Still, they depend on the stability of the system upon which they were trained. A big change (9/11 terrorist attacks, the internet, new tax laws, major new competition) will likely jolt the underlying system enough to force retraining. So, regularly monitor model performance to be sure that it’s behaving well. The good news is that updating a model with new data is much less work than building it the first time around, so maintenance can be very affordable.

Bottom line? Success depends on using a good model live. The hardest part is using it, so make sure your culture is ready to make decisions in a new way. The second hardest part is building a model that works out-of-sample. So work with, or become, experts in model validation. The third most important part is defining your problem well. Get those pieces in place and success — at scale — is a great wager.

Originally published at https://www.elderresearch.com on December 4, 2020.