What is SAP HANA’s Big Data Strategy ?
What is big data?
Big data is more than just data volume or size. It is about generating valuable real-time insights from raw data, no matter what the size is, the type, or the rate at which it is generated.
In other words, Big Data is the ocean of information we swim in every day – vast zetabytes of data flowing from our computers, mobile devices, and machine sensors. With the right solutions, organizations can dive into all that data and gain valuable insights that were previously unimaginable.
Leveraging big data, organizations have successfully uncovered new insights from all of their data, creating opportunities to transform businesses, industries, and even the quality of our lives.
Big data is not an intrinsic good in and of itself. It is useful as a stepping- stone to business value. Many organizations see the opportunities for big data and understand some of the use cases. The problem is that there are challenges in getting to that business value. In attempting to integrate big data with their existing data sources, organizations face questions and concerns such as:
- Lack of skills—Where can I ﬁnd the resources to make this project a reality?
- Slow deployment—How do I speed up the implementation time, reducing the effort to implement a solution or application?
- Complex IT environments—How do I rationalize new big data technologies in an already complex IT environment?
- Integrating many data sources—What is the relationship between all of my data sources and how do I normalize that relationship?
To work with big data you need to be able to acquire it, analyze it, and act on insights derived from it.
The SAP HANA platform provides these capabilities. It delivers in-memory processing of data with tiered, petabyte scale storage and integration with SAP IQ and Hadoop.
This segment describes how three groups can beneﬁt from the big data capabilities inherent in SAP HANA:
- BI analysts: These analysts have been used to working with traditional data sources such as data warehouses and systems of record and helping organizations support a single version of the truth using SAP BusinessObjects and other BI tools.
- Analysts and data scientists using advanced analytics: Analysts and data scientists are trained to work with the variety, volume, and velocity of big data.
- Everyone through operationalized insights: Everyone in an organization beneﬁts when insights are automatically delivered in context. Through embedded analytics, insights from big data and traditional data sources are integrated into the context of business processes and applications. In this way, the entire organization becomes more data-driven as a matter of course, capable of repeatable victories based on the latest information.
Big Data Characteristics
The journey to big data has taken us into a world where data sets are growing in size and complexity and where there are ever-increasing demands for query execution speed. These three driving factors of volume, variety, and velocity were ﬁrst enumerated by Gartner’s Doug Laney in 2001, and the ensuing decade has only brought them into sharper focus.
The volume of data generated by modern computing systems, sensor networks, and social media streams is ever growing. What was once considered digital exhaust, only to be collected for audit or regulatory reasons, has now become a treasure trove of information. Data storage costs have been reduced to the point where it is more cost eﬀective to save anything that you’d ever expect to need and sort through it later rather than spend scarce resources up front assessing its ultimate worth.
Data is being collected in a wide variety of formats, ranging from simple relationships collected by tiny machines up to complex multimedia. The modern data center houses an ever-growing range of data formats as more and more systems come online, producing transactional and analytical data in a plethora of data structures, including bulky and complex voice and video formats.
The most important aspect of big data has been the need for velocity in all aspects of data management. It has long been possible to store immense volumes of data in relational systems, but query times were so slow as to be unusable for real time operations. Thus, in practice, many relational databases never stored more than several terabytes of data because response times would degrade too much. Inserting even more varieties of data into that database would further degrade response times, causing IT staﬀ to keep the database clean of all but the most important transactional information. In other words, big data is a velocity problem that is exacerbated by greater volumes and varieties of information. Immense data stores with widely varying data need to have fast performance so that complex analytical tools can turn around insights and help inform decisions quickly.
The Ultimate Goal: Faster, Data-Driven Decisionmaking
Big data is important because ultimately it can improve decisionmaking.
However, time is of the essence. Figure 1 shows how decision making works and the delays inherent in it.
Figure 1. Minimizing Action Distance
Hackathorn shows that there are three types of latency in the decision making process:
- Immediately after the triggered event, there is data latency, where data is integrated and made ready for analysis. Sometimes this also involves several steps of preprocessing.
- After the data is prepared for analysis, there is analysis latency, the time involved in initiating the analysis, packaging its results, and delivering it to the appropriate person.
- After the information is delivered, there’s a certain amount of time that organizations and people take to actually take action and execute a decision. That introduces decision latency.
These latencies mean that we are losing time, getting farther and farther away from the point at which the event occurred. The farther away an action is from the point of the triggered event, the more the value of that action diminishes. We must respond and respond fast to the events happening around us.
In today’s high speed, highly connected world, the window of opportunity to respond to events is shrinking, so we need to make sure that we are able to react quickly and reduce the amount of value erosion.
We can reduce action time by reducing each phase of latency, by making the data available and ready for analysis as close to the event as possible, by making sure that information is delivered fast, and therefore, enabling a faster decision through human collaborative systems.
Meeting the Challenges of Big Data
“The enterprise data warehouse is dead!” That was the title of a 2011 article that ran in the Business Computing World in the UK. But that vision is short- sighted. Successful companies will wed old and new together, creating a synergistic, greater whole.
The enterprise data warehouse is not dead, but traditional approaches to managing data are dead. We live in a diﬀerent world from the one that existed when the traditional relational database was ﬁrst architected. Back then the database dealt with recording and storing transactional data only and reporting happened for decision support. It was high value, highly structured data—not the mounds of data that may or may not have value that we face today. It was an era when time was measured by the calendar, not the stopwatch.
Today we face a completely diﬀerent world. We generate vast amounts of non-transactional data—whether documents, Facebook posts, tweets, or log information coming oﬀ of our phones, web servers, and other connected devices.
We no longer want to report just against operational activities, but we also want to analyze, explore, predict, visualize and inspect in ways never imagined by those early database engineers.
Back when databases were designed, memory was extremely expensive. Just one terabyte of RAM cost over $100 million dollars. Today, we can get it for less than $5,000. Because memory was expensive, database engineers built a database architecture centered on the disk.
The problem is, disk is just too slow. Reading 1 petabyte of data oﬀ a disk sequentially would take 58 days using the fastest hard disk available today. SSD speeds that up to 2 days using the fastest SSD RAID, but the price is hefty: 1 petabyte of SSD RAID disk costs $12.5 million dollars.
That’s why innovators have been ﬁnding new ways to store and process data, all in an eﬀort to get around the disk bottleneck and improve response time.
Distributed computing moved around the disk bottleneck by spreading the data across many disks that can be read simultaneously. In a perfectly balanced environment, a distributed database would have an equal amount of data across each machine. As a result, the maximum time to read the data would be a fraction of the time of a database stored on a single disk. For instance, if we split 1 Petabyte of data evenly across 10 disks that are read simultaneously, then the response time would be theoretically one tenth of the time of a single disk. Of course, in practice, there is a cost to moving the data between the machines and coordinating a single result back to the user, but overall a database distributed across multiple disks can reduce response time.
Hadoop builds on the concept of distributed computing, but opens up the platform to handle arbitrary data sets that do not necessarily follow a predeﬁned schema and to analyze that data with any arbitrarily designed algorithm. This ﬂexibility comes at a cost of course, such as the need for specialized programming skills. However, the Hadoop project has evolved over the years to include subprojects that move beyond Hadoop Distributed File System (HDFS) and MapReduce.
Hadoop was originally developed at big Internet companies as a ﬂexible tool to process Web logs. Based on its heritage, the original Hadoop HDFS and MapReduce projects made diﬀerent assumptions than relational databases about how data is processed. In particular, the early Hadoop projects assume you want to read all (or at least most of) the data stored on your disks, which is why the MapReduce framework is designed to look for a predeﬁned pattern within all of the data stored in HDFS. Furthermore, MapReduce algorithms are coded in Java or C/C++ in order to give the programmer the ﬂexibility to deﬁne the search pattern as well as the schema of the result set. This combined capability ensured that the original Web companies could store any or all of the Web logs without having to do a lot of costly preprocessing of the data, typically done with ‘enterprise data.’ Furthermore, as business analysts at the ﬁrms had a new idea for the fast evolving business, they could easily run a program to search for a new pattern. This ﬂexibility meant that MapReduce queries usually took time to execute, forcing many companies to run them as a batch process.
Moving database architectures from row-oriented storage models to columnar storage models helped to reduce the amount of data accessed on a single disk. This is fundamentally diﬀerent than the original Hadoop project, which assumed the user wanted to read all of the data on a disk. The columnar database architecture assumes that any given query will need to read only a subset of the data on a disk.
The columnar database architecture assumes that the user typically will only want to access a small number of the attributes or columns within a database table. Imagine you have a table storing historical sales transactions with 8 columns: Year, Quarter, Country, State, Sales Representative, Customer, Product, Revenue. At the end of the year, each department may ask different questions. For example:
- Finance: What was the total revenue by year and quarter for last 3 years?
- Marketing: What was the total revenue by product and by country?
- Sales: What was the total revenue by the sales representative?
In each case, the user only accesses a subset of the columns. While this is a simplistic example, in practice many of the questions that people ask use only a small subset of the sometimes hundreds of columns in a table.
A columnar database stores all of the data associated with a particular attributes or column in the same physical space on the disk. In this way, when only 3 of the 8 columns of data are needed to answer a question, the database only needs to read 3 segments of the database from the disk instead of the entire thing.
Furthermore, by storing all of the data for a given column together, columnar databases can exploit the repeating patterns within a column’s data in order to highly compress it, further reducing the number of bits read oﬀ disk. Consider the Country column from our example above. Storing the name “United States” as text would take at least 13–26 bytes of data depending on the encoding used. There are less than 256 countries in the world, which means that each country can be uniquely identiﬁed by using only 1 byte (8 bits) of information. So ‘United States’ could be replaced with, say, the number ‘1’ compressing the column entry from 13–26 Bytes into 1 Byte. This form of compression is called tokenization.
It is very common for the rows to have a lot of repeated information. Building on our example, imagine that the ‘country’ column contains ‘United States’ for ﬁrst 15 rows of the table, which has been replaced with the number ‘1’ stored in a single byte in each row. This essentially means we have 15 entries in a row, each containing the number ‘1’. On disk, then it looks like this ‘111111111111111’. This duplication can be replaced with the value, ‘1’, and a count of the number of duplicate entries — something conceptually like this: ‘1D15’, which says number ‘1’ duplicated 15 times. This form of compression is called run-length encoding.
In summary, then, our ﬁrst 15 rows of the ‘country’ column get compressed from 195–390 bytes down to potentially 3 or 4 Bytes. Compression is important because it reduces the amount of data that gets read from disk. In our example above, reading 4 bytes from disk represents 200 bytes stored in other databases, which dramatically accelerates response time.
In summary, the data storage architecture organized around columns, which reduces the amount of data that needs to be scanned and also makes it easy to compress the data, makes columnar databases ideally suited for BI and analytic workloads.
Here are certain surefire facts you need to consider about in-memory databases.
In-memory databases take response times to a whole new level. They remove the disk from the equation for data access altogether, and only use it for logging and backup. In-memory databases leverage the power of today’s processors to read and analyze data 1,000 times faster than reading the data oﬀ disk.
Combine columnar data stores within memory to highly compress the data, and soon you can see performance gains of 1,000, 10,000 … and in some cases, customers have experienced results 100,000 times faster.
In-memory is the future of data management, and so the real-time SAP HANA platform for big data platform has SAP HANA at its core. Nonetheless, technology like Hadoop has a critical, complementary role to play. A complete big data solution is an end-to-end in nature. It handles everything from low-level data ingestion, storage, processing, visualization, and engagement to analytic solutions and applications.
A complete big data solution has another characteristic as well: it handles all kinds of data. Location aware applications and applications that support mapping have made spatial data more important than ever, both operationally and in targeting customers. SAP HANA Spatial Processing helps process this important type of big data.
So much of big data is text, and text analytics, whether sentiment analysis of social media data or analysis of doctor’s notes to help drive better healthcare, is another key to big data. Text analysis is another key capability of SAP HANA.
Predictive analytics, largely the province of the data scientist, is another feature supported in SAP HANA through the Predictive Analytic Libraries or PAL.
Streaming analytics is another key area supported by SAP HANA and Sybase Event Stream Processor (ESP). Analytics where the velocity of the data is especially critical, such as in ﬁnancial services, as well as in contexts like manufacturing where machine or sensor data is used and analyzed, an area referred to as the Internet of Things.