Everything You Need To Know About Azure Data Lake

Rating: 4.8
15124

Azure Data Lake is a Microsoft service built for simplifying big data storage and analytics. It is a system for storing vast amounts of data in its original format for processing and running analytics. It is useful for developers, data scientists, and analysts as it simplifies data management and processing. Azure Data Lake offers seamless integration and is the optimal solution to productivity and scalability challenges faced by organizations.

Example: Apache Hadoop

Table Of Content - Azure Data Lake

What is a Data Lake?

Data Lake is a large centralized repository for storing vast amounts of raw data in its original format for future use by a data engineer. A wide range of structured, semi-structured, and unstructured data can be stored in its native form for processing and in-depth analysis. Data Lakes provide unlimited storage space without any restrictions on file size, or data access (including programming, SQL-like queries, and REST calls). It supports metadata extraction, indexing, formatting and conversion, segregation, augmentation, aggregation, and cross-linking.

What is Azure Data Lake?

In April 2015, Microsoft Azure announced Data Lake Service for Enterprise customers. With Data Lake services Microsoft shifted its data storage and analytics service from a basic storage platform to a fully-realized platform for distributed analytics and clustering for HDInsight.

Built on YARN and HDFS Azure Data Lake is a large central storage repository based on Apache Hadoop. It is an alternative to enterprise data silos and holds a massive amount of data in their original format. Data Lake in Azure has the ability to store and analyze large volumes of a variety of data at varying speeds. It is not concerned about the source and purpose of data. It just provides a common repository to perform deep analytics.

Interested in Microsoft Azure training and certification course for professionals: Register now for our 30 hours "Azure online training" course offered by ‘Mindmajix - A Global online training platform’. 

Azure Data Lake services

1. Azure Data Lake Store:

Data Lake Store is a hyper-scale repository for big data analytics workloads. It allows users to store data irrespective of size and format such as social media content, relational databases, and logs. It provides unlimited storage for unstructured and structured data without any restrictions. An individual file can be a petabyte in size and with no retention policy. It uses the Hadoop Distributed File System (HDFS) for the cloud.

Service Integration for Data Lake Store

  • Data Lake Analytics
  • HDInsight

Microsoft is planning to introduce integration services for Microsoft’s Revolution-R Enterprise, Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.

Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations such as read, write, accessing block locations, and configuring replication factors. Besides, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.

A new file system-Azure Data Lake Filesystem (adl://) is introduced for directly accessing the repository. Applications and System that are capable of using the new file system gains additional flexibility and performance over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.

MindMajix Youtube Channel

2. Azure Data Lake Analytics:

Azure Data Lake  Analytics is the latest Microsoft data lake offering. It is an in-depth data analytics tool for Users to write business logic for data processing. The most important feature of Data Lake Analytics is its ability to process unstructured data by applying schema on reading logic, which imposes a structure on the data as you retrieve it from its source. The data source could be Data Lake Store or Azure Storage.

It supports U-SQL language, which allows users to run custom logic and user-defined functions. U-SQL provides more control and scalability over jobs. Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data is in files, U-SQL schematizes the data upon extraction. 

3. Azure HDInsight: 

HDInsight is a fully managed Hadoop cluster service that supports a wide range of analytic engines, including Spark, Storm, and HBase. It is designed to take advantage of the Data Lake Store in order to maximize security, scalability, and throughput. It supports managed clusters in Linux and Windows.

  • Hadoop: HDFS data storage with support for MapReduce and parallel processing.
  • HBase: NoSQL database built on Hadoop for large sets of structured and semi-structured data.
  • Storm: Distributed, real-time computational service for data streams.

U-SQL

U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.

U-SQL can process unstructured data by applying schema on reading and inserting custom logic. Each query produces a row set and the row set can be assigned to a variable.

The EXTRACT keyword reads data from a file and defines the schema on reading. The OUTPUT writes data from a row set to a file. These two statements use the Azure Data Lake file path.

Example: adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv

Example Script:

@searchlog =

   EXTRACT

User Id

int,

Start

DateTime

Region

string

Query

string

Duration 

int

URLs 

string

ClickedUrls 

string

    FROM "/Samples/Data/SearchLog.tsv"

    USING Extractors.Tsv(); 

OUTPUT @searchlog   

    TO "/output/SearchLog-first-u-sql.csv"

    USING Outputters.Csv();

This script reads from the source file called SearchLog.tsv, schematizes it and writes the rowset back into a file called SearchLog-first-u-sql.csv

Also Read: HDInsight of Microsoft Azure

Azure Data Lake Architecture

Azure Data Lake is built on top of Apache Hadoop and based on the Apache YARN cloud management tool. It is Microsoft’s Implementation for the HDFS file system in the cloud. Azure Data Lake is a completely cloud-based solution and does not require any hardware or server to be installed on the user end. It can be scaled according to need.

Azure Storage API and Hadoop Distributed File System are compatible with Data Lake.

Data Lake is compatible with Azure Active Directory and uses it for security and authentication.

Data Lake is designed to have very low latency and near real-time analytics for web-analytics, IoT analytics, and sensor information processing.

Azure Data Lake Architecture

Data can be gathered from any sources like social media, website and app logs, devices and sensors, etc. and can be stored in the near-original format.

Related Article: Azure Data Factory Tutorial

Difference between Data Warehouse and Data Lake

 

Data Warehouse

Data Lake

Data

Structured and Processed

Semi-structured, unstructured and Structured

Processing

Schema on write

Schema on reading

Storage

Expensive

Low cost

Agility

Less agile and fixed-configuration

Highly agile and fully configurable

Security

Mature

Mature

Users

Business professionals

Data Scientists

 

[Related Article: Azure Interview Questions]

Data Lake Security

Data Lake Security includes:

  • Authentication
  • Authorization
  • Network isolation
  • Data protection
  • Auditing

Data Lake authentication uses the azure active directory for authentication of users and enforcing policies. 

Authorization and access control are stored separately in Data Lake and using below settings

  • Role-based access control (RBAC) provided by Azure for account management.
  • POSIX ACL for accessing data in the store.

Network isolation provides firewalls and defines an IP address range for trusted clients and only these clients can access Data Lake. Data Protection uses Transport Layer Security (TLS) protocol to secure data over the network. Auditing and diagnostic logs are shown in the Azure portal.

Uses of Azure Data Lake

  • General-purpose object storage managed by Azure
  • Streaming and processing of batch workloads.
  • Curation of data by analysts and data engineers for specific needs without making copies.

Benefits of using Azure Data Lake

  • Highly scalable and flexible as it is housed on the cloud.
  • Allows streamlining data storage for all enterprise needs.
  • Large scale data can be processed simultaneously providing quick access to insights
  • Data Lake stores everything like logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
  • No limit on data storage and file size.
  • Supports heavy analytics workloads for in-depth analytics.
  • It supports schema-less storage whereas the data warehouse does not.

Azure Data Lake Storage Gen2

Built on Azure Blob, the Azure Data Lake Storage Gen2 offers capabilities like file system semantics, directory, file level security, low-cost, tiered storage, high availability/disaster recovery and scalability. Its set of capabilities consists of the best features from Azure Blob storage and Azure Data Lake Storage Gen1. 

Data Lake Pricing

  1. Data Lake Store:

Pay-as-you-go

Usage

Price/Month

First 100 TB 

Rs. 2.58 per GB 

Next 100 TB to 1,000 TB 

Rs. 2.52 per GB 

Next 1,000 TB to 5,000 TB 

Rs. 2.45 per GB 

Over 5,000 TB

Custom by contacting Microsoft

 

Monthly commitment packages

Committed Capacity

Price/Month

Savings over pay-as-you-go

1 TB 

Rs. 2,313.37 

12% 

10 TB 

Rs. 21,150.80 

19% 

100 TB 

Rs. 1,91,679.13 

27% 

500 TB 

Rs. 8,79,080.13 

31%

1,000 TB 

Rs. 17,18,502.50 

33%

Over 1,000 TB

Custom by contacting Microsoft

 

 

Price for the transaction:

Usage

Price

Write operations (per 10,000)

Rs. 3.31

Read operations (per 10,000)

Rs. 0.27

Delete operations

Free

Transaction size limit

No limit

 

  1. Data Lake Analytics:

Pay-as-you-go

Usage

Price

Analytics Unit

Rs. 132.20/hour

 

Monthly Committed Price

Included Analytics Unit Hours

Price/Month

Savings over Pay-As-You-Go

100

Rs. 6,610 

50%

500

Rs. 29,744 

55%

1,000

Rs. 52,877 

60%

5,000

Rs. 2,37,947 

64%

10,000

Rs. 4,29,626 

67%

50,000

Rs. 19,16,792 

71%

1,00,000

Rs. 34,37,005 

74%

> 1,00,000

Custom by contacting Microsoft

 

 

Microsoft Azure Data Lake Architecture is helping data scientists, engineers, and analysts by solving much of their big data dilemma. This scalable cloud data lake offers a single storage structure for multiple analytic projects of different sizes. Our online certification helps you learn Azure Data Lake from basic to advanced levels.

If you interested to learn Microsoft Azure Data Lake and build a career in Cloud Computing Technology? Then check out our Azure Certification Training Course at your near Cities.

Microsoft Azure Course BangaloreMicrosoft Azure Course HyderabadMicrosoft Azure Course PuneMicrosoft Azure Course DelhiMicrosoft Azure Course ChennaiMicrosoft Azure Course NewyorkMicrosoft Azure Course WashingtonMicrosoft Azure Course DallasMicrosoft Azure Course Maryland, Microsoft Azure Training VirginiaMicrosoft Azure Training Pennsylvania

These courses are incorporated with Live instructor-led training, Industry Use cases, and hands-on live projects. This training program will make you an expert in Microsoft Azure and help you to achieve your dream job.

 

Explore Microsoft Azure Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download Now!
Course Schedule
NameDates
Azure TrainingOct 19 to Nov 03View Details
Azure TrainingOct 22 to Nov 06View Details
Azure TrainingOct 26 to Nov 10View Details
Azure TrainingOct 29 to Nov 13View Details
Last updated: 08 Oct 2024
About Author

Anji Velagana is working as a Digital Marketing Analyst and Content Contributor for Mindmajix. He writes about various platforms like Servicenow, Business analysis,  Performance testing, Mulesoft, Oracle Exadata, Azure, and few other courses. Contact him via anjivelagana@gmail.com and LinkedIn.

read less
  1. Share:
Microsoft Azure Articles