The IT industry is making a transition from unstructured data to structured information. IT organizations are working on bringing more forms of data into the information fold while integrating information better through portals and content management systems. Zach Wahl from the Project Performance Corporation delineated the nature of the problem at a recent DCI Conference on Portals and Content Management:
* 80 percent of business is conducted on unstructured information (Gartner Group).
* 85 percent of all data stored is held in an unstructured format (Butler Group).
* Unstructured data doubles every three months (Gartner Group).
* 7 million web pages are added every day (Gartner Group).
The major problems with unstructured data are threefold: 1)the data in text memos, speech, audio, video and images is much more difficult to analyze then structured data held primarily as numerics and short strings in databases – the algorithms are more complex, the results are softer and more ambiguous or vatriable than numerical results, and the time to compute are generally longer; 2)also the data present the same cleansing if not more difficult problems as numerics and key string fields; but there is just lots more data to clean, filter and normalize; 3)finally the methods of analysis of speech, audio, video, even text blocks are still emerging – often there is not consensus on the best methods to use or the interpretation of results. So BI firms have tended to stick to using structured and numeric results. But that is changing.
As organizations and the industry mature, their ability to utilize their data resources is also maturing and improving. Isolated islands of information and silo applications are becoming the exception and not the rule. This transition mirrors what is happening with data in organizations. Unstructured data is being used and transformed into structured information. The softer unstructured data are being made actionable because they contain “voice” and real insights on the nature of say customer service or warranty problems or real survey results on product acceptance. The problem is to unlock these resources.
However, that is possible because because like objects, unstructured data have methods such as validation/cleansing, searching/sorting, formating/presenting, and analysis/predicting that can lead to consistent and provable insights. But equally important and just as hard to do, unstructured data have to be linked to one another and structured data in models and relationships. The world of databases, data marts, and semi-structured data stores is only as effective as the models and measures that use those stores. By making such statements with data – “this is the state of the art” , “these are current directions”, “this is how we measure up against competition”, “this is what happened and these are our options”, “the likely trends in the market are these” and so forth – data becomes information. So there are two major trends occuring in organizational data. First it is being integrated acrss the organization in political boundary defying ways. And second, the core of data being used is going well beyond the tried and true financials and numerical data and other structured sources. So welcome to the two pronged data integration to information era.