Big data is more than just data volume or size. It is about generating valuable real-time insights from raw data, no matter what the size is, the type, or the rate at which it is generated.
In other words, Big Data is the ocean of information we swim in every day – vast zettabytes of data flowing from our computers, mobile devices, and machine sensors. With the right solutions, organizations can dive into all that data and gain valuable insights that were previously unimaginable.
Leveraging big data, organizations have successfully uncovered new insights from all of their data, creating opportunities to transform businesses, industries, and even the quality of our lives.
Big data is not an intrinsic good in and of itself. It is useful as a stepping- stone to business value. Many organizations see the opportunities for big data and understand some of the use cases. The problem is that there are challenges in getting to that business value. In attempting to integrate big data with their existing data sources, organizations face questions and concerns such as:
To work with big data you need to be able to acquire it, analyze it, and act on insights derived from it.
The SAP HANA platform provides these capabilities. It delivers in-memory processing of data with tiered, petabyte-scale storage and integration with SAP IQ and Hadoop.
This segment describes how three groups can benefit from the big data capabilities inherent in SAP HANA:
The journey to big data has taken us into a world where data sets are growing in size and complexity and where there are ever-increasing demands for query execution speed. These three driving factors of volume, variety, and velocity were first enumerated by Gartner’s Doug Laney in 2001, and the ensuing decade has only brought them into sharper focus.
The volume of data generated by modern computing systems, sensor networks, and social media streams is ever-growing. What was once considered digital exhaust, only to be collected for audit or regulatory reasons, has now become a treasure trove of information. Data storage costs have been reduced to the point where it is more cost-effective to save anything that you’d ever expect to need and sort through it later rather than spend scarce resources upfront assessing its ultimate worth.
Data is being collected in a wide variety of formats, ranging from simple relationships collected by tiny machines up to complex multimedia. The modern data center houses an ever-growing range of data formats as more and more systems come online, producing transactional and analytical data in a plethora of data structures, including bulky and complex voice and video formats.
The most important aspect of big data has been the need for velocity in all aspects of data management. It has long been possible to store immense volumes of data in relational systems, but query times were so slow as to be unusable for real-time operations. Thus, in practice, many relational databases never stored more than several terabytes of data because response times would degrade too much. Inserting even more varieties of data into that database would further degrade response times, causing IT staff to keep the database clean of all but the most important transactional information. In other words, big data is a velocity problem that is exacerbated by greater volumes and varieties of information. Immense data stores with widely varying data need to have fast performance so that complex analytical tools can turn around insights and help inform decisions quickly.
[ Related Article: SAP HANA Interview Questions ]
Big data is important because ultimately it can improve decision-making.
However, time is of the essence. Figure 1 shows how decision-making works and the delays inherent in it.
Figure 1. Minimizing Action Distance
Hackathorn shows that there are three types of latency in the decision-making process:
These latencies mean that we are losing time, getting farther and farther away from the point at which the event occurred. The farther away action is from the point of the triggered event, the more the value of that action diminishes. We must respond and respond fast to the events happening around us.
In today’s high-speed, highly connected world, the window of opportunity to respond to events is shrinking, so we need to make sure that we are able to react quickly and reduce the amount of value erosion.
We can reduce action time by reducing each phase of latency, by making the data available and ready for analysis as close to the event as possible, by making sure that information is delivered fast, and therefore, enabling a faster decision through human collaborative systems.
“The enterprise data warehouse is dead!” That was the title of a 2011 article that ran in the Business Computing World in the UK. But that vision is short-sighted. Successful companies will wed old and new together, creating a synergistic, greater whole.
The enterprise data warehouse is not dead, but traditional approaches to managing data are dead. We live in a different world from the one that existed when the traditional relational database was first architected. Back then the database dealt with recording and storing transactional data only and reporting happened for decision support. It was high-value, highly structured data—not the mounds of data that may or may not have a value that we face today. It was an era when the time was measured by the calendar, not the stopwatch.
Today we face a completely different world. We generate vast amounts of non-transactional data—whether documents, Facebook posts, tweets, or log information coming from our phones, web servers, and other connected devices.
We no longer want to report just against operational activities, but we also want to analyze, explore, predict, visualize and inspect in ways never imagined by those early database engineers.
Back when databases were designed, memory was extremely expensive. Just one terabyte of RAM cost over $100 million dollars. Today, we can get it for less than $5,000. Because memory was expensive, database engineers built a database architecture centered on the disk.
The problem is, the disk is just too slow. Reading 1 petabyte of data of a disk sequentially would take 58 days using the fastest hard disk available today. SSD speeds that up to 2 days using the fastest SSD RAID, but the price is hefty: 1 petabyte of SSD RAID disk costs $12.5 million dollars.
That’s why innovators have been finding new ways to store and process data, all in an effort to get around the disk bottleneck and improve response time.
Distributed computing moved around the disk bottleneck by spreading the data across many disks that can be read simultaneously. In a perfectly balanced environment, a distributed database would have an equal amount of data across each machine. As a result, the maximum time to read the data would be a fraction of the time of a database stored on a single disk. For instance, if we split 1 Petabyte of data evenly across 10 disks that are read simultaneously, then the response time would be theoretically one-tenth of the time of a single disk. Of course, in practice, there is a cost to moving the data between the machines and coordinating a single result back to the user, but overall a database distributed across multiple disks can reduce response time.
Hadoop builds on the concept of distributed computing but opens up the platform to handle arbitrary data sets that do not necessarily follow a predefined schema and to analyze that data with any arbitrarily designed algorithm. This flexibility comes at a cost of course, such as the need for specialized programming skills. However, the Hadoop project has evolved over the years to include subprojects that move beyond Hadoop Distributed File System (HDFS) and MapReduce.
Hadoop was originally developed by big Internet companies as a flexible tool to process Weblogs. Based on its heritage, the original Hadoop HDFS and MapReduce projects made different assumptions than relational databases about how data is processed. In particular, the early Hadoop projects assume you want to read all (or at least most of) the data stored on your disks, which is why the MapReduce framework is designed to look for a predefined pattern within all of the data stored in HDFS. Furthermore, MapReduce algorithms are coded in Java or C/C++ in order to give the programmer the flexibility to define the search pattern as well as the schema of the result set. This combined capability ensured that the original Web companies could store any or all of the Weblogs without having to do a lot of costly preprocessing of the data, typically done with ‘enterprise data.’ Furthermore, as business analysts at the firms had a new idea for the fast-evolving business, they could easily run a program to search for a new pattern. This flexibility meant that MapReduce queries usually took time to execute, forcing many companies to run them as a batch process.
Moving database architectures from row-oriented storage models to columnar storage models helped to reduce the amount of data accessed on a single disk. This is fundamentally different than the original Hadoop project, which assumed the user wanted to read all of the data on a disk. The columnar database architecture assumes that any given query will need to read only a subset of the data on a disk.
The columnar database architecture assumes that the user typically will only want to access a small number of the attributes or columns within a database table. Imagine you have a table storing historical sales transactions with 8 columns: Year, Quarter, Country, State, Sales Representative, Customer, Product, Revenue. At the end of the year, each department may ask different questions. For example:
In each case, the user only accesses a subset of the columns. While this is a simplistic example, in practice many of the questions that people ask use only a small subset of the sometimes hundreds of columns in a table.
A columnar database stores all of the data associated with a particular attribute or column in the same physical space on the disk. In this way, when only 3 of the 8 columns of data are needed to answer a question, the database only needs to read 3 segments of the database from the disk instead of the entire thing.
Furthermore, by storing all of the data for a given column together, columnar databases can exploit the repeating patterns within a column’s data in order to highly compress it, further reducing the number of bits read of disk. Consider the Country column from our example above. Storing the name “United States” as text would take at least 13–26 bytes of data depending on the encoding used. There are less than 256 countries in the world, which means that each country can be uniquely identified by using only 1 byte (8 bits) of information. So ‘The United States could be replaced with, say, the number ‘1’ compressing the column entry from 13–26 Bytes into 1 Byte. This form of compression is called tokenization.
It is very common for the rows to have a lot of repeated information. Building on our example, imagine that the ‘country’ column contains ‘United States’ for the first 15 rows of the table, which has been replaced with the number ‘1’ stored in a single byte in each row. This essentially means we have 15 entries in a row, each containing the number ‘1’. On disk, then it looks like this ‘111111111111111’. This duplication can be replaced with the value, ‘1’, and a count of the number of duplicate entries — something conceptually like this: ‘1D15’, which says number ‘1’ duplicated 15 times. This form of compression is called run-length encoding.
In summary, then, our first 15 rows of the ‘country’ column get compressed from 195–390 bytes down to potentially 3 or 4 Bytes. Compression is important because it reduces the amount of data that gets read from disk. In our example above, reading 4 bytes from disk represents 200 bytes stored in other databases, which dramatically accelerates response time.
In summary, the data storage architecture organized around columns, which reduces the amount of data that needs to be scanned and also makes it easy to compress the data, makes columnar databases ideally suited for BI and analytic workloads.
Here are certain surefire facts you need to consider about in-memory databases.
In-memory databases take response times to a whole new level. They remove the disk from the equation for data access altogether and only use it for logging and backup. In-memory databases leverage the power of today’s processors to read and analyze data 1,000 times faster than reading the data off the disk.
Combine columnar data stores within memory to highly compress the data, and soon you can see performance gains of 1,000, 10,000 … and in some cases, customers have experienced results 100,000 times faster.
In-memory is the future of data management, and so the real-time SAP HANA platform for big data platform has SAP HANA at its core. Nonetheless, technology like Hadoop has a critical, complementary role to play. A complete big data solution is end-to-end in nature. It handles everything from low-level data ingestion, storage, processing, visualization, and engagement to analytic solutions and applications.
A complete big data solution has another characteristic as well: it handles all kinds of data. Location-aware applications and applications that support mapping have made spatial data more important than ever, both operationally and in targeting customers. SAP HANA Spatial Processing helps process this important type of big data.
So much of big data is text, and text analytics, whether sentiment analysis of social media data or analysis of doctor’s notes to help drive better healthcare, is another key to big data. Text analysis is another key capability of SAP HANA.
Predictive analytics, largely the province of the data scientist, is another feature supported in SAP HANA through the Predictive Analytic Libraries or PAL.
Streaming analytics is another key area supported by SAP HANA and Sybase Event Stream Processor (ESP). Analytics where the velocity of the data is especially critical, such as in financial services, as well as in contexts like manufacturing where the machine or sensor data is used and analyzed, an area referred to as the Internet of Things.
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|