Data Lake is a system and method for storing vast amount of data in their near original format or even original format to apply analytics and run big data analysis. The data can be files, blobs or objects.
Data Lake can store structured, semi structured and unstructured data in original format that can be used later for processing.
Example: Apache Hadoop
Data Lakes centralize the data once they gathered from their sources. It combines the data and store them for processing. The contents of Data Lake can be normalized and enriched. Data Lake supports metadata extraction, indexing, formatting and conversion, segregation, augmentation, aggregation and cross-linking.
In April 2015, Microsoft Azure announced Data Lake Service for Enterprise customers. With Data Lake service Microsoft shifted its data storage and analytics service from basic storage platform to a fully realized platform for distributed analytics and clustering for HDInsight.
Azure Data Lake is a large central storage repository based on Apache Hadoop. It is an alternative to enterprise data silos and holds massive amount of data in their original format. Data can come from any number of sources. Data Lake repository is not concerned about the source and purpose of data. It just provides a common repository to perform deep analytics.
Azure Data Lake has three services.
1. Data Lake Store
2. Data Lake Analytics
3. Azure HDInsight.
Azure Data Lake is built on top of Apache Hadoop and based on Apache YARN cloud management tool. It is Microsoft’s Implementation for HDFS file system in cloud. Azure Data Lake is a completely cloud based solution and does not require any hardware or server to be installed in user end. It can be scale according to need.
Azure Storage API and Hadoop Distributed File System is compatible with Data Lake.
Data Lake is compatible with Azure Active Directory and uses it for security and authentication.
Data Lake is designed to have very low latency and near real-time analytics for web-analytics, IOT analytics and sensor information processing.
Data can be gathered from any sources like social media, website and app logs, devices and sensors etc. and can be stored in near original format.
Data Lake Store is a hyper-scale repository for big data analytic workloads. It let user store data of any size and any format ranging from social media contents, relational databases and logs. It provides unlimited storage without any restrictions. An individual file can be a petabyte in size and with no retention policy. It uses Hadoop file system and provides compatibility with HDFS.
Service Integration for Data Lake Store:
1. Data Lake Analytics
Microsoft is planning to introduce integration services for Microsoft’s Revolution-R Enterprise, Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.
Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations such as read, write, accessing block locations and configuring replication factors. In addition, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.
A new file system-AzureDataLakeFilesystem (adl://) is introduced for directly accessing the repository. Applications and System that are capable of using new file system gains additional flexibility and performance gains over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.
Data Lake Analytics is an in-depth data analytics tool by Microsoft cloud offering. Users write business logic for data processing in this tool. Most important feature of Data Lake Analytics is it can process unstructured data by applying schema on read logic, which imposes a structure on the data as you retrieve it from its source.
It supports U-SQL language, which let users run custom logic, and user defined functions. U-SQL provides more control and scalability over jobs. Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data are files, U-SQL schematizes the data on extract.
HDInsight is a fully managed Hadoop cluster service that supports a wide range of analytic engines, including Spark, Storm, and HBase. It is designed to take advantage of Data Lake Store in order to maximize security, scalability, and throughput. It supports managed cluster in Linux and Windows.
Hadoop: HDFS data storage with support for MapReduce and parallel processing.
HBase : NoSQL database built on Hadoop for large sets of structured and semi-structured data.
Storm : Distributed, real-time computational service for data streams.
U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.
U-SQL can process unstructured data by applying schema on read and inserting custom logic. Each query produces a rowset and the rowset can be assigned to a variable.
The EXTRACT keyword reads data from a file and defines the schema on read. The OUTPUT writes data from a rowset to a file. These two statements use Azure Data Lake file path.
EXTRACT UserId int,
This script reads from the source file called SearchLog.tsv, schematizes it, and writes the rowset back into a file called SearchLog-first-u-sql.csv
|Data Warehouse||Data Lake|
|Structured and Processed||Data||Semi structured, unstructured and Structured|
|Schema on write||Processing||Schema on read|
|Less agile and fixed configuration||Agility||Highly agile and fully configurable|
|Business professionals||Users||Data Scientists|
Data Lake security includes
3. Network isolation
4. Data protection
Data Lake authentication use azure active directory for authentication of users and policy enforcing.
Authorization and access control are stored separately in Data Lake and provides below settings
1. Role-based access control (RBAC) provided by Azure for account management
2. POSIX ACL for accessing data in the store
Network isolation provides firewalls and define an IP address range for trusted clients and only these clients can access Data Lake.
Data Protection uses Transport Layer Security (TLS) protocol to secure data over the network.
Auditing and diagnostic logs are shown in azure portal.
Data Lake Store:
|First 100 TB||?2.58 per GB|
|Next 100 TB to 1,000 TB||?2.52 per GB|
|Next 1,000 TB to 5,000 TB||?2.45 per GB|
|Over 5,000 TB||Custom by contacting Microsoft|
Monthly commitment packages
|Committed Capacity||Price/Month||Savings over pay-as-you-go|
|Over 1,000 TB||Custom by contacting Microsoft|
Price for transaction:
|Write operations (per 10,000)||?3.31|
|Read operations (per 10,000)||?0.27|
|Transaction size limit||No limit|
Data Lake Analytics:
Monthly Committed Price
|Included Analytics Unit Hours||Price/Month||Savings over Pay-As-You-Go|
|> 1,00,000||Custom by contacting Microsoft|
1. Highly scalable and redundant data storage for all needs.
2. Data Lake stores everything like logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
3. No limit on data storage and file size
4. In-depth analytics
5. It supports schema less storage whereas data warehouse does not.
Get Updates on Tech posts, Interview & Certification questions and training schedules