Data Lake is a system and method for storing a vast amount of data in their near original format or even original format to apply analytics and run big data analysis. The data can be files, blobs or objects.
Data Lake can store structured, semi-structured and unstructured data in the original format that can be used later for processing.
Example: Apache Hadoop
Data Lakes centralize the data once they gathered from their sources. It combines the data and stores them for processing. The contents of Data Lake can be normalized and enriched. Data Lake supports metadata extraction, indexing, formatting and conversion, segregation, augmentation, aggregation and cross-linking.
Azure Data Lake:
In April 2015, Microsoft Azure announced Data Lake Service for Enterprise customers. With Data Lake service Microsoft shifted its data storage and analytics service from basic storage platform to a fully-realized platform for distributed analytics and clustering for HDInsight.
Azure Data Lake is a large central storage repository based on Apache Hadoop. It is an alternative to enterprise data silos and holds a massive amount of data in their original format. Data can come from any number of sources. Data Lake repository is not concerned about the source and purpose of data. It just provides a common repository to perform deep analytics.
Azure Data Lake has three services.
1. Data Lake Store
2. Data Lake Analytics
3. Azure HDInsight.
Azure Data Lake Architecture:
Azure Data Lake is built on top of Apache Hadoop and based on the Apache YARN cloud management tool. It is Microsoft’s Implementation for the HDFS file system in the cloud. Azure Data Lake is a completely cloud-based solution and does not require any hardware or server to be installed on the user end. It can be scaled according to need.
Azure Storage API and Hadoop Distributed File System is compatible with Data Lake.
Data Lake is compatible with Azure Active Directory and uses it for security and authentication.
Data Lake is designed to have very low latency and near real-time analytics for web-analytics, IoT analytics, and sensor information processing.
Data can be gathered from any sources like social media, website and app logs, devices and sensors, etc. and can be stored in the near-original format.
Data Lake Store:
Subscribe to our youtube channel to get new updates..!
Data Lake Store is a hyper-scale repository for big data analytics workloads. It let user store data of any size and any format ranging from social media contents, relational databases, and logs. It provides unlimited storage without any restrictions. An individual file can be a petabyte in size and with no retention policy. It uses the Hadoop file system and provides compatibility with HDFS.
Service Integration for Data Lake Store:
1. Data Lake Analytics
Microsoft is planning to introduce integration services for Microsoft’s Revolution-R Enterprise, Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.
Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations such as read, write, accessing block locations and configuring replication factors. In addition, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.
A new file system-AzureDataLakeFilesystem (adl://) is introduced for directly accessing the repository. Applications and System that are capable of using the new file system gains additional flexibility and performance gains over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.
Azure Data Lake Analytics:
Data Lake Analytics is an in-depth data analytics tool by Microsoft cloud offering. Users write business logic for data processing in this tool. A most important feature of Data Lake Analytics is it can process unstructured data by applying schema on reading logic, which imposes a structure on the data as you retrieve it from its source.
It supports U-SQL language, which lets users run custom logic and user-defined functions. U-SQL provides more control and scalability over jobs. Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data are files, U-SQL schematizes the data on the extract.
HDInsight is a fully managed Hadoop cluster service that supports a wide range of analytic engines, including Spark, Storm, and HBase. It is designed to take advantage of the Data Lake Store in order to maximize security, scalability, and throughput. It supports managed clusters in Linux and Windows.
Hadoop: HDFS data storage with support for MapReduce and parallel processing.
HBase: NoSQL database built on Hadoop for large sets of structured and semi-structured data.
Storm: Distributed, real-time computational service for data streams.
U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.
U-SQL can process unstructured data by applying schema on reading and inserting custom logic. Each query produces a row set and the row set can be assigned to a variable.
The EXTRACT keyword reads data from a file and defines the schema on reading. The OUTPUT writes data from a row set to a file. These two statements use the Azure Data Lake file path.
EXTRACT UserId int,
This script reads from the source file called SearchLog.tsv, schematizes it and writes the rowset back into a file called SearchLog-first-u-sql.csv
Difference between Data Warehouse and Data Lake:
|Data Warehouse||Data Lake|
|Structured and Processed||Data||Semi-structured, unstructured and Structured|
|Schema on write||Processing||Schema on reading|
|Less agile and fixed-configuration||Agility||Highly agile and fully configurable|
|Business professionals||Users||Data Scientists|
Data Lake Security:
Upcoming Batches - Azure Training!
6:30 AM IST
6:30 AM IST
7:00 AM IST
6:30 AM IST
Data Lake security includes
3. Network isolation
4. Data protection
Data Lake authentication uses the azure active directory for authentication of users and policy enforcing.
Authorization and access control are stored separately in Data Lake and provides below settings
1. Role-based access control (RBAC) provided by Azure for account management
2. POSIX ACL for accessing data in the store
Network isolation provides firewalls and defines an IP address range for trusted clients and only these clients can access Data Lake.
Data Protection uses Transport Layer Security (TLS) protocol to secure data over the network.
Auditing and diagnostic logs are shown in the Azure portal.
Data Lake Pricing:
Data Lake Store:
|First 100 TB||Rs. 2.58 per GB|
|Next 100 TB to 1,000 TB||Rs. 2.52 per GB|
|Next 1,000 TB to 5,000 TB||Rs. 2.45 per GB|
|Over 5,000 TB||Custom by contacting Microsoft|
Monthly commitment packages
|Committed Capacity||Price/Month||Savings over pay-as-you-go|
|1 TB||Rs. 2,313.37||12%|
|10 TB||Rs. 21,150.80||19%|
|100 TB||Rs. 1,91,679.13||27%|
|500 TB||Rs. 8,79,080.13||31%|
|1,000 TB||Rs. 17,18,502.50||33%|
|Over 1,000 TB||Custom by contacting Microsoft|
Price for the transaction:
|Write operations (per 10,000)||Rs. 3.31|
|Read operations (per 10,000)||Rs. 0.27|
|Transaction size limit||No limit|
Data Lake Analytics:
|Analytics Unit||Rs. 132.20/hour|
Monthly Committed Price
|Included Analytics Unit Hours||Price/Month||Savings over Pay-As-You-Go|
|> 1,00,000||Custom by contacting Microsoft|
1. Highly scalable and redundant data storage for all needs.
2. Data Lake stores everything like logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
3. No limit on data storage and file size
4. In-depth analytics
5. It supports schema-less storage whereas the data warehouse does not.