Is Artificial Intelligence Truly Testable?

Liming Zhu, Research Director, Csiro’s Data61 and Qinghua Lu, Senior Research Scientist, Csiro’s Data61

Liming Zhu, Research Director, Csiro’s Data61

William Gibson once said: “The future is already here—it's just not very evenly distributed.”

The benefit of being part of CSIRO’s Data61, the data and digital specialist arm of Australia’s national science agency, is that we’re able to experience, experiment, and test the future right now.

There has been tremendous progress in the last few years in the fields of artificial intelligence (AI), machine learning (ML), and data-driven systems. AI and ML have been successfully applied to testing quite as the field has been traditionally data-rich (with source code, test cases, configurations, logs, and even binary files...) and relatively well-labeled for model training through vulnerability databases, bug fixing history, and continuous testing/ delivery learnings.

Formal methods such as interactive theorem proving as a form of traditional logic-based AI, have also made strides, substituting and complementing testing in critical systems, as evidenced by the success of systems like seL4 and related high-assurance focused programs from the US Defense Advanced Research Projects Agency (DARPA).

Beyond “AI and ML for testing” is a much bigger and looming challenge around “testing for AI, ML, and data-driven systems” which are substantially different from, if not more difficult in order of magnitude, than traditional desktop, web, mobile, and enterprise systems.

First, as predicted by Google a few years ago, there is a software crisis surrounding “software engineering for ML”(SE4ML)-driven systems, just as there was a software crisis in the 1960s when the field of software engineering was first created.

Even with the best talent on the planet and the years of experiences of building such systems, Google’s ML and data-driven systems have been accumulating significant “technical debts” to a significant level of degraded testability and maintainability. Training and improving models from a multitude of complex and evolving data sources, connecting these models, and integrating the associated interference or predictive components into a long-running production system is much more complex than the continuous testing and delivering of traditional web-based systems.

There are new types of bugs to hunt down during the training and validation phase, during conversion of training models into production components and during acceptance testing and ongoing monitoring. A key concern is that these AI/ML/data-driven systems are constantly learning from the wild, once deployed. If something is wrong, is it because of the data (and what data?), the code, a carefully crafted adversarial attack, or a new insight that is just not yet fully appreciated by human developers?

Qinghua Lu, Senior Research Scientist, Csiro’s Data61

It’s not a full-blown crisis yet, simply because many enterprises have not fully embraced such systems in their production process. We see emerging research and technologies around reproducible ML, continuous delivery for machine learning (CD4ML), and debugging of ML-driven systems (e.g. crowdsourced testing, incentive mechanisms). If your organization has solid software engineering and testing practices, expanding them beyond code to model governance/versioning, data versioning, configuration/environment versioning (especially though lightweight containers) can help you be more confident in exploiting the value of data and AI/ML as your competitive advantages.

Second, AI/ML-driven systems are learning from a vast amount of often personal and sensitive data, to make critical decisions for us, or sometimes about us. We need to be concerned about ethical and legal aspects on the use of this data and the decisions derived from it.

Testing an AI/ML/data-driven system is no longer just against some functional specification and traditional non-functional attributes such as performance, security, reliability, and interoperability. Increasingly, it’s about testing the ethical conformance and legal compliance surrounding an enterprise’s acquisition and access of data (e.g. leveraging blockchain technology for ML accountability), the specific use purpose of data, potential bias in data, models and decisions, as well as explainability of these models and decisions. We are starting to see interesting ethical-by-design technologies, test cases for ethical issues such as fairness being integrated with the development life cycle, and automated compliance testing. We see the needs of continuous delivery moving into “continuous validation” of your AI/ML/data-driven systems through interaction with customers, users, stakeholders, and regulators.

Finally, many effective AI and ML systems rely on highly complex models, especially the latest deep learning models. For example, Google’s latest Bidirectional Encoder Representations from Transformers (BERT) model for natural language processing has approximately 300 million parameters in the model (not its training data set). These types of models usually work and have extraordinary predictive powers. But when a test case does not work, the explainability of these models—or the ability to explain why something did not work—can be a challenge in itself.

But is it truly a testability challenge or something else? When AlphaGo played its “37th move” by taking a position on the fifth line during that milestone match, best human players thought AI made a learner’s blunder and failed the “test” that we thought we know the answer. Only later we knew as humans we had failed the test of understanding it. So can an AI that learns from a vast amount of data and interacts with its unique environment be truly testable in the traditional sense?

I am optimistic that we can test AI, but the meaning and approach of testing will need to be fundamentally re-considered. Rather than testing mechanical computing systems formed by 0s and 1s against our well-defined specifications, we will be testing something that can learn insights that surprise us or we may not even fully understand. Testing could become a learning experience for human and AI in a world of AI-Human co-evolution. That’s the future we are in right now.

Matthew Faries, CIO, AFG Australian Finance Group

Jana Branham, Chief Information Officer, ACH Food Companies

Denny Charlie, EVP & CIO, Soho Global Health

Martin Pickrodt, Chief Information Officer, Mesitis

David Bejar, VP Head of IT Software Engineering, Allianz Indonesia

Sachin Mulik, VP, Quality Engineering, Amdocs

Jason Williams, Head of Technology, Worldline

Troy Lau, Division Leader for Ai, Human and Data Technologies, Draper

Is Artificial Intelligence Truly Testable?

Machine Learning

Weekly Brief