Big Data Science Overview
In May of 2011, consulting firm McKinsey and Company issued a report in which it predicted organizations being deluged with data in the years to come. They also predicted that a range of major industries – including health care, the public sector, retail, and manufacturing – will benefit greatly from analyzing this rapidly growing pile of information generally referred to as “Big Data.”
“Collecting and analyzing transactional data will give organizations more insight into their customers’ preferences,” writes Joab Jackson of IDG News. “It can be used to better inform the creation of products and services, and allows organizations to remedy emerging problems more quickly.”
“The use of big data will become a key basis of competition and growth for individual firms,” the McKinsey report concludes. “The use of big data will underpin new waves of productivity growth and consumer surplus.”
The field of “Data Science” lies at the core of this revolution.
Wikibon and IDC both promise a Big Data market in 2015 of between $16.9 billion (IDC) to $32.1 billion (Wikibon). Recently, none other than Tim O’Reilly declared that “data science is the new black.”
Enter the “data scientists” – an elite and specialized class of highly-compensated data cleaning, analysis, and visualization experts. “Data scientists will be a special breed,” says one industry pundit, “the only people with the experience and expertise to wrestle with the messy explosion of both digital (and dirty) data and big data tools. Data Science will become a specialized, in-house function, similar to today’s Accounting, Legal, and IT departments. Leading universities will establish stand-alone Data Science departments, conferring data science degrees, Bachelor’s to Ph.D…. Data scientists will be either academics, independent consultants, or members of the corporate data science function, where they will rise to the title of CDO (understood in leading organizations as a Chief Decision Officer and by laggards as Chief Data Officer).”
As suggested above, Data Science is a multidisciplinary endeavor combining a range of skills in a variety of fields, together with a strong talent for creative, intuitive thinking: the envisioning of new and useful data sets..
Mason divides data science into two equally important functions. One half is analytic or, “counting things.” The other half is the invention of new techniques that can draw insights from data that were not possible before. “Data Science is the combination of analytics and the development of new algorithms. You may have to invent something, but it’s okay if you can answer a question just by counting. The key is making the effort to ask the questions.
“[Data Science] absolutely gives us a competitive advantage if we can better understand what people care about and better use the data we have to create more relevant experiences,” says Aaron Batalion, chief technology officer for online shopping service LivingSocial, which uses technologies such as the Apache Hadoop data processing platform to mine insights about customer preferences.
Not only LivingSocial, but also such organizations as Google, Amazon, Yahoo, Facebook and Twitter have been on the cutting edge of leveraging the new discipline of Data Science to make the most of their growing piles of user information.
But Data Science remains a science for which tools are still being invented – a science in a state of flux. “There are vexing problems slowing the growth and the practical implementation of big data technologies,” writes Mike Driscoll. “For the technologies to succeed at scale, there are several fundamental capabilities they should contain, including stream processing, palatalization, indexing, data evaluation environments and visualization.” And all this evolution is ongoing.
In the following posts, we shall explore the range of special skills/disciplines involved in the practice of Data Science. We shall also explore such key software tools as Hadoop and Cassandra, and take a glimpse at some of the most innovative and successful applications of Data Science to date. But first, we will define the essentials of that vital raw material upon which Data Science feeds: Big Data.
“You can’t have a conversation in today’s business technology world without touching on the topic of Big Data,” says NetworkWorld’s Michael Friedenberg. “Simply put, it’s about data sets so large – in volume, velocity and variety – that they’re impossible to manage with conventional database tools. In 2011, our global output of data was estimated at 1.8 zettabytes (each Zettabyte equals 1 billion terabytes). Even more staggering is the widely quoted estimate that 90 percent of the data in the world were created within the past two years.”
Friedenberg continues: “Behind this explosive growth in data, of course, is the world of unstructured data. At [the 2011] HP Discover Conference, Mike Lynch, executive vice president of information management and CEO of Autonomy, talked about the huge spike in the generation of unstructured data. He said the IT world is moving away from structured, machine-friendly information (managed in rows and columns) and toward more human-friendly, unstructured data that originate from sources as varied as e-mail and social media and that includes not just words and numbers, but also video, audio and images.”
“Big Data means extremely scalable analytics,” Forrester Research analyst James Kobielus told Information Age in October of 2011. “It means analyzing petabytes of structured and unstructured data at high velocity. That’s what everybody’s talking about.”
As a catch-all term, “Big Data” is pretty nebulous. As ZDNet’s Dan Kusnetzky notes: “If one sits through the presentations from ten suppliers of technology, fifteen or so different definitions are likely to come forward. Each definition, of course, tends to support the need for that supplier’s products and services. Imagine that.”
Industry politics aside, here’s an unbiased definition of this complex field:
Every day of the week, we create 2.5 quintillion bytes of data. This data comes from everywhere: from sensors used to gather climate information, posts to social media sites, digital pictures and videos posted online, transaction records of online purchases, and from cell phone GPS signals – to name a few. In the 11 years between 2009 and 2020, the size of the “Digital Universe” will increase 44 fold. That’s a 41% increase in capacity every year. In addition, only 5% of this data being created is structured and the remaining 95% is largely unstructured, or at best semi-structured. This is Big Data.
Per a recent analysis from IBM, Big Data Big comprises three dimensions: Variety, Velocity and Volume.
re: Variety – Big Data extends well beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.
re: Velocity – Frequently time-sensitive, Big Data must be used simultaneously with its stream into the enterprise in order to maximize its value.
re: Volume – Big Data comes in one size: enormous. By definition, enterprises are awash with it, easily amassing terabytes and even petabytes of information. This volume presents the most immediate hurdle for conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies currently hold large amounts of archived data, but not the tools to process it.
(Note: To these, IBM’s Michael Schroeck adds Veracity. In other words, a firm’s imperative to screen out spam and other data that is not useful for making business decisions.)
Per Edd Dumbill (and no, that is not a typo), program chair for the O’Reilly Strata Conference and the O’Reilly Open Source Convention, Big Data “is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”
Dumbill continues: “The hot IT buzzword … Big Data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work required to extract them. For leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring Big Data processing into the reach of the less well-resourced. Big Data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud.”
Dumbill further explains that the value of Big Data falls into two categories: analytical use, and enabling new products. “Big Data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports.”
Overall, Big Data is – in its raw form – utter chaos. Approximately 80% of the effort involved in dealing with this largely-unstructured data is simply cleaning it up. Per Pete Warden in his Big Data Glossary: “I probably spend more time turning messy source data into something usable than I do on the rest of the data analysis process combined.”
The Data Scientist takes the “chaos” of “messy source data” and finds within this morass the pure gold of actionable market information.