• The WTF Files

The Dark Side of Artificial Intelligence and Dark Data 2020

If humans are teaching AI and algorithms how to think, what is to prevent them from seeding the systems with dis-information and dark systemic views?

Organizations store data for many reasons, most often for record keeping and regulatory compliance. But, there is also a tendency to hoard data that could potentially become harmful and valuable. In the end, most companies never use even a fraction of the data they store for any purpose because the data may become inaccessible.

This could be because the storage reservoir doesn't document the metadata labels appropriately, some of the data is in a format the integrated tools can't read, or the data isn't retrievable through a query because the data is mis-labeled or incorrect.

This data that organizations routinely store during normal operations is called dark data. Dark data is a major limiting factor in producing good data analysis because the quality of any data analysis depends on the body of information accessible to the analytics tools, both promptly and in full detail. Sometimes this data is used to dictate or prove facts with false information depending on the programmers goals.

The main impact of dark data is on the quality of data used for analysis to extract valuable information. This is important. Dark data makes it difficult to access and find vital information, confirm its origins, and promptly obtain essential information to make good, data-driven decisions. The impact on quality stems from the following factors:

  • Data accessibility: Inability to access data that is unstructured or in a different media format, such as images, audio, or video, leads to loss of access to essential information that would improve analysis.

  • Data accuracy: The accuracy of a data analysis rests on the accuracy of the input data. Accurate analysis leads to the extraction of qualitatively more valuable information. Hence, dark data has a significant impact on the accuracy of the extracted information and the quality of the information produced by that analysis.

  • Data auditability: The inability to trace the provenance of data can lead to its omission from analysis, thereby affecting the data quality. This, in turn, can lead to faulty data-driven decision making.

Many organizations resort to human resources to manually extract and annotate the data and enter it into a relational database. The problem with this is the algorithm systems only understand what the system is taught.

The advent of deep learning has made it possible to create a new breed of intelligent data extraction and mining tools that can extract structured data from dark data much faster and with greater accuracy than human beings can. Still the information being fed to the system is still based on possible bias information.