Data Modeling for Unstructured Data
Data modeling is the analysis of data objects used in a business or other context and the identification of the relationships among these data objects. Another definition, this from Scott Ambler, Chief Methodologist for Agile and Lean within IBM Rational: “Data modeling is the act of exploring data-oriented structures. Like other modeling artifact data models can be used for a variety of purposes, from high-level conceptual models to physical data models.”
The sheer quantity and complexity of unstructured data opens up many new opportunities for the analyst and modeler. Imagine requirements such as: Show me consumer feedback on my product from all Website discussion groups for the last six months; Show me all photographs taken of the fountains in Rome from the summers of 2002 through 2007; Show me all contracts which contain a particular liability clause.
“So – what is a data model?” asks David Dichmann, Product Line Director for Design Tools at Sybase. “It is first and foremost, a way to capture business language and information relationships that provide context to make it useful in decision making activities. It is then specialized into representations of storage paradigms, and ultimately, when appropriate, into detailed designs of physical systems where structures will be implemented to manage, store, move, transform and analyze data points. Today’s data models are way beyond traditional logical/physical representations of database systems implementation. Today’s data models are architectural drawings of the meaning and intent of the information – simple, beautiful creations that drive the logic of applications, systems and technology and physical implementations of business information infrastructure.”
Dichmann posits that the data model, if viewed as “an abstraction of the physical representation of the database structures,” clearly declines in value in the face of the schema-less [data] or the constantly changing schemas. “But, if it is the abstraction of the conceptual representation of the information, we see a rise in importance. The language of the business, and the context of data points, provide meaning to the analysis that we want to gain from these non-traditional systems. [We are on a journey] from points of data (records collected by recording all our ‘transactions’) to meaningful information (the collation, aggregation and analysis of points of data by applying context to data). With Big Data, we do not even consider the data points themselves, but rather jump right to some trend analysis (aggregation of sorts). Interpretation comes from comparisons to a series of basis points to be used in decision making, taking data all the way to wisdom. The basis points themselves are context and can be modeled.”
Some posit that with regard to big data, data modeling is a major obstacle to agile business intelligence (BI). The answer, according leading software analyst Barney Finucane.
“The need for data modeling depends upon the application. [Software] products that promise user friendly analysis without any data analysis are usually intended for a specific type of analysis that does not require any previously specified structure.” A good example of data that does not require modeling is what retailers gather about their customers. “This data comes in big flat tables with many columns, and the whole point of the analysis is to find unexpected patterns in this unstructured data. In this case adding a model is adding assumptions that may actually hinder the analysis process.”
However, “some types of analyses only make sense with at least some modeling. Time intelligence is an example of a type of analysis that is supported by a data model. Also analyzing predefined internal structures such as cost accounts or complex sales channels is usually more convenient based on predefined structures. The alternative method of discovering the structures in the raw data may not be possible.”
Finucane adds: “Planning is a common area of agile BI, and planning is rarely possible without predefined structures. It is no coincidence that the tools that promise analysis without data modeling do not offer planning features. Planning requires adding new data to an existing data set. In some cases, this includes adding new master data, for example when new products are being planned. Furthermore, there is often a good deal of custom business logic in a planning application that cannot be defined automatically. Most financial planning processes, and the analysis and simulation that goes along with them cannot be carried out on a simple table. In my view the new generation columnar databases are a welcome addition to agile BI. But I also think that their marketing is sometimes a little over the top when it comes to dismissing existing BI solution in this area.”
Forrester Research analyst James Kobielus goes a step further: “Big data rely on solid data modeling. Statistical predictive models and test analytic models will be the core applications you will need to do big data.”
But Brett Sheppard, executive director at Zettaforce and a former senior analyst at Gartner, disagrees. “Letting data speak for itself through analysis of entire data sets is eclipsing modeling from subsets. In the past, all too often what were once disregarded as ‘outliers’ on the far edges of a data model turned out to be the telltale signs of a micro-trend that became a major event. To enable this advanced analytics and integrate in real-time with operational processes, companies and public sector organizations are evolving their enterprise architectures to incorporate new tools and approaches.”
Ultimately, it is important to remember re: data modeling for Big Data is that any given model is just a simplified representation of reality and can take many forms.
One of the best tools for the modeling of unstructured data is Apache Cassandra, this to be discussed at length in a subsequent chapter. The most important aspect of Cassandra and other such tools is that they allow the flexibility required to ensure data models are scaled in a way that is cost-effective with regard to unstructured Big Data, especially the application of multidimensional data models, vertical industry data models and customizable analytics problem algorithms.
In the final analysis, a data model for Big Data is useless without the human element: the skilled eye of the data scientist, discerning subtleties (‘outliers’) in the data. “Data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise,” writes Ben Gimpert of Altos Research. “All Big Data models are wrong, but some are useful, to paraphrase the statistician George Box. A data scientist working in isolation could train a predictive model with the perfect in-sample accuracy, but only an understanding of how the business will use the model lets [him or her] balance the crucial bias/variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.“