Artificial intelligence (AI) systems development and operation involve terms and techniques that may be new to some internal auditors, or that contain meanings or applications that are different from their normal audit usage. Each of the terms below have a long history in the development and execution of AI processes. As such they can promote a common understanding of AI terms that can be applied to auditing these systems.
Datasets are difficult to create because independent judges should review their features and uses, and then validate them for correctness. These judgments drive the system in the training phase of system development, and if the data is not validated, the system may learn based on errors.
In machine learning systems, datasets are normally "locked," meaning data is not changed to fit the algorithm. Instead, the algorithm is changed based on the system predictions derived from the data. As a safety precaution, data scientists usually are barred from examining the datasets to determine the reasons for such changes. This prevents them from biasing the algorithm given their understanding of the data relationships.
Consider a system that reviews the ZIP codes of business accounts. The system may fail to recognize ZIP codes beginning with "0," such as 01001 for Agawam, Mass., or that contain alphanumeric characters such as V6C1H2 for Vancouver, B.C. Locking the dataset prevents the data scientists from inspecting the errors directly. Instead, they would have to investigate why the system is interpreting some accounts differently than others and whether the algorithm contains a defect. Barring data scientists in this way is another form of locking the dataset.
Because historical datasets are not always verified before AI system use, the internal auditor needs to ensure an appropriate validation process is in place to confirm data integrity. Use of automated systems to judge data integrity may mask AI issues that adversely affect the quality of the output.
Therefore, a customary practice in the industry has been to use independent, third-party judges for validation purposes. The judges, however, must have sufficient expertise in the data domain of the system to render valid test results. If they use algorithms as part of their validation process, then those, too, must be validated independently. Usually any inconsistency in the test results during judging is reviewed and reconciled as part of the process. A well-designed validation process will help avoid user acceptance of system outcomes that are inherently flawed.
Overfitting and Trimming
The data scientist selects datasets to train the AI system that are intended to reflect the actual data domain. Sometimes those datasets reflect ambiguous conditions that should be trimmed or deleted to enhance the probability of error-free results.
For example, the first name "Pat" can apply to either gender. To avoid system confusion, the data scientist would likely trim it from the training dataset. However, the first name "Tracy," although historically applicable to both male and female, is more commonly a female name. Trimming "Tracy" from the training datasets might bias system outcomes toward males without eliminating much ambiguity when the production data is processed.
The problem with trimming is that it can cause data overfit in an algorithm and biased system results during the production phase. Data overfit occurs when the training dataset is trimmed to derive a particular algorithm, rather than the algorithm adjusting itself to a training dataset that represents the actual data domain. The resulting algorithm is not based on a representative data domain. Internal audits should examine process controls over the training dataset to safeguard against data overfit caused by excessive data trimming designed to achieve a desired algorithmic outcome.
It is important for the data scientist to examine data outliers. For example, a machine learning system may be 90% accurate in correcting misspelled words, but it also may flag numbers as errors and correct them. Those corrections can cause havoc with critical documents, such as financial reports, if the data scientist failed to review system predictions for such outliers.
Performance metrics should be used to assess AI system accuracy (How close are the predictions to the true values?) and precision (How consistent are the outcomes between system iterations?). Such metrics are a best practice, because they indicate performance issues in AI system operations, including:
- False positives: identifying an unacceptable item as correct.
- False negatives: identifying an acceptable item as incorrect.
- Missed items: not addressing all items in the population.
A formal review process to cover these issues improves system performance and helps decrease audit risk.
Putting in place accuracy and precision metrics is a best practice when evaluating AI systems. Although these metrics show how well a system finds issues, they do not tell the entire story. In addition, measurements to identify issues missed (false negatives), incorrect identification of issues (false positives), and where an issue exists in the data, but the system failed to detect the issue (missing issues), are needed to measure the full performance of a system.
Internal auditors must be careful to safeguard the integrity of the AI audit from user misinterpretations of system outcomes. That is because the system may generate supportable conclusions that are simply misunderstood or ignored.
For instance, if a system were to predict that jungle fires are related to climate change, this does not confirm that climate change has caused the jungle fires. Earlier this year, news organizations reported that climate change caused fires in the Amazon jungle. However,
NASA had asserted that the fires were the same as previous years with no change over time and with no relation to global warming. While there might be a correlation between the two, causation should not be inferred from the system prediction.
Internal auditors need to take the human factor into account when assessing system quality. System users may simply refuse to believe or act upon system predictions because of bias, personal preference, or preconceived notions.