AI - Lack of data characterization significantly reduces accuracy of AI results

M. Gilger
Modus Operandi,
United States

Keywords: AI, object based production, knowledge base, data characterization, provenance

Summary:

It is understood that AI algorithms depend on accurate and well-tagged data to perform accurately. It is also understood that much of the time spent during an AI project is on extract, transform, and load - and more specifically, cleaning, normalizing, and appropriately tagging the data. But the problem is much worse for the intelligence community. Security and data characterization becomes a significant factor in failed AI efforts. A document that contains 80% unclassified information and 20% secret information will not be processed at all by an AI system at the unclassified level. This is because of our document-centric world of intelligence "reports." There is an overall classification of the document, and that is what is used by the pre-processors to determine if the document is to be processed or not (ignoring cross-domain networks at the moment). Thus, there is significant information that could be used for further knowledge processing by the AI that should be available, but it isn't. Further, an analyst has to characterize their source for their analysis (trust/accuracy), as well as characterize (assess) their own interpretation of the information (quality of the analysis) that they formulated from the source(s). However, this information does not readily translate into information that the AI can use. The AI treats the document and each paragraph within that document at the same level of trustworthiness, and at the same quality of analysis. If one paragraph has a very trustworthy source and excellent analysis, and the next paragraph has a trusted, but not very accurate source, and questionable analysis (missing information to make a more accurate analysis for example), the AI will still process those two paragraphs at the same characterization levels, yet the human decision maker would modify their decision based on how trustworthy the source is and how accurate the analysts has rated their analysis. This is a significant disparity that is rarely discussed in AI proceedings, and isn't just a problem for AI, but also for decision makers who receive documents where the characterization of the information is not present in the reports. Utilizing an Object-based Production approach to intelligence information, information is tracked in a more granular, human/machine readable form, which produces "reports" on-the-fly based on models from an always-up-to-date object (knowledge) store. All provenance information--including data characterization-- is preserved so that the AI tools now have the appropriate tags to process the information with the appropriate weights - much more like a human decision maker. This is performed at intelligence creation - at the source, prevented a need for down-stream, manual tagging of the data later - where significant context and information understanding is lost.