The database world is rushing headlong into the business of utilizing semi-structured and unstructrured data. This is happening in the BI-Business Intelligence Domain, Product Warranty support, and various regulatory compliance efforts. These are often the source of of the highest growth in databanks and data sources.
Here is the spectrum of data characteristics as one moves along the continuum from unstructured through semi-structured to structured data. Note that unstructured and/or semi-structured data can be found mixed in structured contexts. For example, CLOB and other very Large Object Blocks can contain raw text or other data awaiting inspection, interpretation or other processing. Here is the
Unstructured data characteristics
– has not been converted to data types: dates, characters, numeric, etc;
– often is in stream form – one datatype, minimal metatdata on origin, author, timing etc;
– may not be all stored electronically. Many archives point to data stored on paper or micrographics;
– if stored electronically, it may be on bulk transfer medium such as tape or other sequential stores not conducive to quick searching, sorting or categorizing;
– may be raw input streams not yet filtered for noise, breakdowns, spurious or incomplete readings or data. Also
the data stream may be awaiting sampling, aggregation, averaging, summarizing or other editing and cleaning (translation, spell check, semantic gloss, etc);
– often contains minimal formating for presentation or maximal formating for use in one presentation media(video, audio, etc);
– likely neither encrypted nor compressed for security or storage efficiency;
– has few to no metadata tags: authors, owners, date of creation, update history, info categories, etc;
In contrast semi-structured data might be thought of as unstructured data on the way to be transformed to structured data. But in fact given cost realities, large quantities of data remain in the semi-structured state with simple CRUD systems for updates and maintenance with backup, recovery and replication facilities. Typically statistical, search, and basic query systems may be attached to semi-structured data or it is the feed into datawarehousing systems on an adhoc basis where it is used in more structured queries and analysis.
Semi-structured data characteristics
– has some metadata links to underlying data:
– has some local structuring of data into a model – that is establishing some relationships with other data in a system
– often lacks complete analysis of data for intra-record integrity and cross table reconciliation
– may have had some data integrity scans for legitimacy/validation of parts of the data; but not comprehensive
– still has few or none ongoing data validation, constraints, or triggers maintaining data integrity
– may be preformated for presentation which can introduce undesirable artifacts;
– may be indexed for search and sorting capabilities;
– still likely neither encrypted nor compressed for security or storage efficiency;
Finally, structured data is taking broader forms. For a long time the theory has been that relational databases are the optimal final state for stuctured data. However, the impedance mismatch between object and relational data has given new life to structured data in object models like Hibernte and Objectivity. Likewise, IBM – the originators and keepers of the relational database flame with DB2 have acknowledged equal footing for XML data stored essentially in a hierarchical model (querying done with XQuery). In fact, IBM has done one of the best jobs of melding relational SQL and XML-XQuery. DB2 9 has the capability of doing XQuery sub queries in a SQL controlled query or vice versa (SQL Subqueries in a XQuery environ).
In short, watch for other major database vendors to take an object or XML plunge in their programs as the base of structured data get expanded. In fact, this is the very nature of Data Workflow Integration (similar in magnitude to GUI Integration) where the world of long transactions (minutes to days) gets integrated with the data spectrum that is unstructured through semi-structured to structured data. Who from Harvard Business School was telling us that IT was becoming totally routine, commoditized, and no longer the source of innovation while fundamental changes are still occurring across the whole IT Framework ?