Azure Data Lake | Learn Azure Data Lake Architecture

Microsoft Azure Articles

Microsoft Azure Quiz

Test and Explore your knowledge

Table of Contents

What is a Data Lake?

What is Azure Data Lake?

Azure Data Lake services

Azure Data Lake Architecture

Difference between Data Warehouse and Data Lake

Data Lake Security

Benefits of using Azure Data Lake

Data Lake Pricing

Azure Data Lake is a Microsoft service built for simplifying big data storage and analytics. It is a system for storing vast amounts of data in its original format for processing and running analytics. It is useful for developers, data scientists, and analysts as it simplifies data management and processing. Azure Data Lake offers seamless integration and is the optimal solution to productivity and scalability challenges faced by organizations.

Example: Apache Hadoop

Table Of Content - Azure Data Lake

What is a Data Lake?
What is Azure Data Lake?
Azure Data Lake services
Azure Data Lake Architecture
Difference between Data Warehouse and Data Lake
Data Lake Security
Benefits of using Azure Data Lake
Data Lake Pricing

What is a Data Lake?

Data Lake is a large centralized repository for storing vast amounts of raw data in its original format for future use by a data engineer. A wide range of structured, semi-structured, and unstructured data can be stored in its native form for processing and in-depth analysis. Data Lakes provide unlimited storage space without any restrictions on file size, or data access (including programming, SQL-like queries, and REST calls). It supports metadata extraction, indexing, formatting and conversion, segregation, augmentation, aggregation, and cross-linking.

What is Azure Data Lake?

In April 2015, Microsoft Azure announced Data Lake Service for Enterprise customers. With Data Lake services Microsoft shifted its data storage and analytics service from a basic storage platform to a fully-realized platform for distributed analytics and clustering for HDInsight.

Built on YARN and HDFS Azure Data Lake is a large central storage repository based on Apache Hadoop. It is an alternative to enterprise data silos and holds a massive amount of data in their original format. Data Lake in Azure has the ability to store and analyze large volumes of a variety of data at varying speeds. It is not concerned about the source and purpose of data. It just provides a common repository to perform deep analytics.

Interested in Microsoft Azure training and certification course for professionals: Register now for our 30 hours "Azure online training" course offered by ‘Mindmajix - A Global online training platform’.

Azure Data Lake services

1. Azure Data Lake Store:

Data Lake Store is a hyper-scale repository for big data analytics workloads. It allows users to store data irrespective of size and format such as social media content, relational databases, and logs. It provides unlimited storage for unstructured and structured data without any restrictions. An individual file can be a petabyte in size and with no retention policy. It uses the Hadoop Distributed File System (HDFS) for the cloud.

Service Integration for Data Lake Store

Data Lake Analytics
HDInsight

Microsoft is planning to introduce integration services for Microsoft’s Revolution-R Enterprise, Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.

Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations such as read, write, accessing block locations, and configuring replication factors. Besides, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.

A new file system-Azure Data Lake Filesystem (adl://) is introduced for directly accessing the repository. Applications and System that are capable of using the new file system gains additional flexibility and performance over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.

2. Azure Data Lake Analytics:

Azure Data Lake Analytics is the latest Microsoft data lake offering. It is an in-depth data analytics tool for Users to write business logic for data processing. The most important feature of Data Lake Analytics is its ability to process unstructured data by applying schema on reading logic, which imposes a structure on the data as you retrieve it from its source. The data source could be Data Lake Store or Azure Storage.

It supports U-SQL language, which allows users to run custom logic and user-defined functions. U-SQL provides more control and scalability over jobs. Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data is in files, U-SQL schematizes the data upon extraction.

3. Azure HDInsight:

HDInsight is a fully managed Hadoop cluster service that supports a wide range of analytic engines, including Spark, Storm, and HBase. It is designed to take advantage of the Data Lake Store in order to maximize security, scalability, and throughput. It supports managed clusters in Linux and Windows.

Hadoop: HDFS data storage with support for MapReduce and parallel processing.
HBase: NoSQL database built on Hadoop for large sets of structured and semi-structured data.
Storm: Distributed, real-time computational service for data streams.

U-SQL

U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.

U-SQL can process unstructured data by applying schema on reading and inserting custom logic. Each query produces a row set and the row set can be assigned to a variable.

The EXTRACT keyword reads data from a file and defines the schema on reading. The OUTPUT writes data from a row set to a file. These two statements use the Azure Data Lake file path.

Example: adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv

Example Script:

@searchlog =

EXTRACT

User Id	int,
Start	DateTime
Region	string
Query	string
Duration	int
URLs	string
ClickedUrls	string

FROM "/Samples/Data/SearchLog.tsv"

USING Extractors.Tsv();

OUTPUT @searchlog

TO "/output/SearchLog-first-u-sql.csv"

USING Outputters.Csv();

This script reads from the source file called SearchLog.tsv, schematizes it and writes the rowset back into a file called SearchLog-first-u-sql.csv

Also Read: HDInsight of Microsoft Azure

Azure Data Lake Architecture

Azure Data Lake is built on top of Apache Hadoop and based on the Apache YARN cloud management tool. It is Microsoft’s Implementation for the HDFS file system in the cloud. Azure Data Lake is a completely cloud-based solution and does not require any hardware or server to be installed on the user end. It can be scaled according to need.

Azure Storage API and Hadoop Distributed File System are compatible with Data Lake.

Data Lake is compatible with Azure Active Directory and uses it for security and authentication.

Data Lake is designed to have very low latency and near real-time analytics for web-analytics, IoT analytics, and sensor information processing.

Data can be gathered from any sources like social media, website and app logs, devices and sensors, etc. and can be stored in the near-original format.

Related Article: Azure Data Factory Tutorial

Difference between Data Warehouse and Data Lake

	Data Warehouse	Data Lake
Data	Structured and Processed	Semi-structured, unstructured and Structured
Processing	Schema on write	Schema on reading
Storage	Expensive	Low cost
Agility	Less agile and fixed-configuration	Highly agile and fully configurable
Security	Mature	Mature
Users	Business professionals	Data Scientists

[Related Article: Azure Interview Questions]

Data Lake Security

Data Lake Security includes:

Authentication
Authorization
Network isolation
Data protection
Auditing

Data Lake authentication uses the azure active directory for authentication of users and enforcing policies.

Authorization and access control are stored separately in Data Lake and using below settings

Role-based access control (RBAC) provided by Azure for account management.
POSIX ACL for accessing data in the store.

Network isolation provides firewalls and defines an IP address range for trusted clients and only these clients can access Data Lake. Data Protection uses Transport Layer Security (TLS) protocol to secure data over the network. Auditing and diagnostic logs are shown in the Azure portal.

Uses of Azure Data Lake

General-purpose object storage managed by Azure
Streaming and processing of batch workloads.
Curation of data by analysts and data engineers for specific needs without making copies.

Benefits of using Azure Data Lake

Highly scalable and flexible as it is housed on the cloud.
Allows streamlining data storage for all enterprise needs.
Large scale data can be processed simultaneously providing quick access to insights
Data Lake stores everything like logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
No limit on data storage and file size.
Supports heavy analytics workloads for in-depth analytics.
It supports schema-less storage whereas the data warehouse does not.

Azure Data Lake Storage Gen2

Built on Azure Blob, the Azure Data Lake Storage Gen2 offers capabilities like file system semantics, directory, file level security, low-cost, tiered storage, high availability/disaster recovery and scalability. Its set of capabilities consists of the best features from Azure Blob storage and Azure Data Lake Storage Gen1.

Data Lake Pricing

Data Lake Store:

Pay-as-you-go

Usage	Price/Month
First 100 TB	Rs. 2.58 per GB
Next 100 TB to 1,000 TB	Rs. 2.52 per GB
Next 1,000 TB to 5,000 TB	Rs. 2.45 per GB
Over 5,000 TB	Custom by contacting Microsoft

Monthly commitment packages

Committed Capacity	Price/Month	Savings over pay-as-you-go
1 TB	Rs. 2,313.37	12%
10 TB	Rs. 21,150.80	19%
100 TB	Rs. 1,91,679.13	27%
500 TB	Rs. 8,79,080.13	31%
1,000 TB	Rs. 17,18,502.50	33%
Over 1,000 TB	Custom by contacting Microsoft

Price for the transaction:

Usage	Price
Write operations (per 10,000)	Rs. 3.31
Read operations (per 10,000)	Rs. 0.27
Delete operations	Free
Transaction size limit	No limit

Data Lake Analytics:

Pay-as-you-go

Usage	Price
Analytics Unit	Rs. 132.20/hour

Monthly Committed Price

Included Analytics Unit Hours	Price/Month	Savings over Pay-As-You-Go
100	Rs. 6,610	50%
500	Rs. 29,744	55%
1,000	Rs. 52,877	60%
5,000	Rs. 2,37,947	64%
10,000	Rs. 4,29,626	67%
50,000	Rs. 19,16,792	71%
1,00,000	Rs. 34,37,005	74%
> 1,00,000	Custom by contacting Microsoft

Microsoft Azure Data Lake Architecture is helping data scientists, engineers, and analysts by solving much of their big data dilemma. This scalable cloud data lake offers a single storage structure for multiple analytic projects of different sizes. Our online certification helps you learn Azure Data Lake from basic to advanced levels.

Explore Microsoft Azure Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download Now!

On-Job Support Service

Online Work Support for your on-job roles.

@Learner@SME

Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:

Pay Per Hour
Pay Per Week
Monthly

Learn MoreContact us

Course Schedule

Name	Dates
Azure Training	Oct 28 to Nov 12	View Details
Azure Training	Nov 01 to Nov 16	View Details
Azure Training	Nov 04 to Nov 19	View Details
Azure Training	Nov 08 to Nov 23	View Details

Last updated: 13 Oct 2025

About Author

Anji Velagana

Anji Velagana is working as a Digital Marketing Analyst and Content Contributor for Mindmajix. He writes about various platforms like Servicenow, Business analysis, Performance testing, Mulesoft, Oracle Exadata, Azure, and few other courses. Contact him via anjivelagana@gmail.com and LinkedIn.

read less

Recommended Courses