Blog

Azure Data Lake

  • (4.0)
  •   |   239 Ratings

Introduction:

Data Lake is a system and method for storing vast amount of data in their near original format or even original format to apply analytics and run big data analysis. The data can be files, blobs or objects.

Data Lake can store structured, semi structured and unstructured data in original format that can be used later for processing. 

Example: Apache Hadoop

Data Lakes centralize the data once they gathered from their sources. It combines the data and store them for processing. The contents of Data Lake can be normalized and enriched. Data Lake supports metadata extraction, indexing, formatting and conversion, segregation, augmentation, aggregation and cross-linking.

Azure Data Lake:

In April 2015, Microsoft Azure announced Data Lake Service for Enterprise customers. With Data Lake service Microsoft shifted its data storage and analytics service from basic storage platform to a fully realized platform for distributed analytics and clustering for HDInsight.

Enthusiastic about exploring the skill set of Microsoft Azure? Then, have a look at the Microsoft Azure Training together additional knowledge. 

Azure Data Lake is a large central storage repository based on Apache Hadoop. It is an alternative to enterprise data silos and holds massive amount of data in their original format. Data can come from any number of sources. Data Lake repository is not concerned about the source and purpose of data. It just provides a common repository to perform deep analytics.

Azure Data Lake has three services.

1. Data Lake Store
2. Data Lake Analytics
3. Azure HDInsight

Azure Data Lake Architecture:

Azure Data Lake is built on top of Apache Hadoop and based on Apache YARN cloud management tool. It is Microsoft’s Implementation for HDFS file system in cloud. Azure Data Lake is a completely cloud based solution and does not require any hardware or server to be installed in user end. It can be scale according to need.

Azure Storage API and Hadoop Distributed File System is compatible with Data Lake.

Data Lake is compatible with Azure Active Directory and uses it for security and authentication.
Data Lake is designed to have very low latency and near real-time analytics for web-analytics, IOT analytics and sensor information processing. 

Data can be gathered from any sources like social media, website and app logs, devices and sensors etc. and can be stored in near original format.

Data Lake Store:

Data Lake Store is a hyper-scale repository for big data analytic workloads. It let user store data of any size and any format ranging from social media contents, relational databases and logs. It provides unlimited storage without any restrictions. An individual file can be a petabyte in size and with no retention policy. It uses Hadoop file system and provides compatibility with HDFS.

Service Integration for Data Lake Store:

1. Data Lake Analytics
2. HDInsight

Microsoft is planning to introduce integration services for Microsoft’s Revolution-R Enterprise, Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.

Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations such as read, write, accessing block locations and configuring replication factors. In addition, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.

A new file system-AzureDataLakeFilesystem (adl://) is introduced for directly accessing the repository. Applications and System that are capable of using new file system gains additional flexibility and performance gains over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.

Azure Data Lake Analytics:

Data Lake Analytics is an in-depth data analytics tool by Microsoft cloud offering. Users write business logic for data processing in this tool. Most important feature of Data Lake Analytics is it can process unstructured data by applying schema on read logic, which imposes a structure on the data as you retrieve it from its source.

Frequently asked Microsoft Azure Interview Questions

It supports U-SQL language, which let users run custom logic, and user defined functions. U-SQL provides more control and scalability over jobs. Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data are files, U-SQL schematizes the data on extract.

Azure HDInsight: 

HDInsight is a fully managed Hadoop cluster service that supports a wide range of analytic engines, including Spark, Storm, and HBase. It is designed to take advantage of Data Lake Store in order to maximize security, scalability, and throughput. It supports managed cluster in Linux and Windows.

Hadoop: HDFS data storage with support for MapReduce and parallel processing.
HBase : NoSQL database built on Hadoop for large sets of structured and semi-structured data.
Storm : Distributed, real-time computational service for data streams.

U-SQL:

U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.
U-SQL can process unstructured data by applying schema on read and inserting custom logic. Each query produces a rowset and the rowset can be assigned to a variable.

The EXTRACT keyword reads data from a file and defines the schema on read. The OUTPUT writes data from a rowset to a file. These two statements use Azure Data Lake file path.

Example: adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv

Example Script:
@searchlog =
    EXTRACT UserId         int,
            Start                   DateTime,
            Region               string,
            Query                 string,
            Duration              int,
            Urls                    string,
            ClickedUrls         string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv(); 

OUTPUT @searchlog   
    TO "/output/SearchLog-first-u-sql.csv"
    USING Outputters.Csv();

This script reads from the source file called SearchLog.tsv, schematizes it, and writes the rowset back into a file called SearchLog-first-u-sql.csv

Difference between Data Warehouse and Data Lake:

Data Warehouse   Data Lake
Structured and Processed Data Semi structured, unstructured and Structured
Schema on write Processing Schema on read
Expensive Storage Low cost
Less agile and fixed configuration Agility Highly agile and fully configurable
Mature Security Mature
Business professionals Users Data Scientists

Data Lake Security:

Data Lake security includes 

1. Authentication
2. Authorization
3. Network isolation
4. Data protection
5. Auditing

Data Lake authentication use azure active directory for authentication of users and policy enforcing. 
Authorization and access control are stored separately in Data Lake and provides below settings

1. Role-based access control (RBAC) provided by Azure for account management
2. POSIX ACL for accessing data in the store

Network isolation provides firewalls and define an IP address range for trusted clients and only these clients can access Data Lake.
Data Protection uses Transport Layer Security (TLS) protocol to secure data over the network.
Auditing and diagnostic logs are shown in azure portal.

Data Lake Pricing:

Data Lake Store:

Pay-as-you-go

Usage Price/Month
First 100 TB  Rs. 2.58 per GB 
Next 100 TB to 1,000 TB  Rs. 2.52 per GB 
Next 1,000 TB to 5,000 TB  Rs. 2.45 per GB 
Over 5,000 TB Custom by contacting Microsoft

Monthly commitment packages

Committed Capacity Price/Month Savings over pay-as-you-go
1 TB  Rs. 2,313.37  12% 
10 TB  Rs. 21,150.80  19% 
100 TB  Rs. 1,91,679.13  27% 
500 TB  Rs. 8,79,080.13  31%
1,000 TB  Rs. 17,18,502.50  33%
Over 1,000 TB Custom by contacting Microsoft  

Price for transaction:

Usage Price
Write operations (per 10,000) Rs. 3.31
Read operations (per 10,000) Rs. 0.27
Delete operations Free
Transaction size limit No limit

Data Lake Analytics:

Pay-as-you-go

Usage Price
Analytics Unit Rs. 132.20/hour

Monthly Committed Price

Included Analytics Unit Hours Price/Month Savings over Pay-As-You-Go
100 Rs. 6,610  50%
500 Rs. 29,744  55%
1,000 Rs. 52,877  60%
5,000 Rs. 2,37,947  64%
10,000 Rs. 4,29,626  67%
50,000 Rs. 19,16,792  71%
1,00,000 Rs. 34,37,005  74%
> 1,00,000 Custom by contacting Microsoft  

Benefits:

1. Highly scalable and redundant data storage for all needs.
2. Data Lake stores everything like logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
3. No limit on data storage and file size
4. In-depth analytics
5. It supports schema less storage whereas data warehouse does not.

Explore Microsoft Azure Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Popular Courses in 2018

Get Updates on Tech posts, Interview & Certification questions and training schedules