Amazon Athena is an interactive query solution for analyzing data in Amazon S3 with normal SQL. Because Athena is a serverless platform, there is no infrastructure to maintain and you just pay for the queries you run. This Amazon Athena blog will walk you through the fundamentals as well as advanced features of Amazon Athena
Analyzing Data is a highly complex process with several attempts to ease, and there are several tools available today to analyze the data. The tech giant Amazon is providing a service with the name Amazon Athena to analyze the data. This tutorial walks you through Amazon Athena and helps you create a table based on sample data stored in Amazon S3, query the table, and check the query results.
Want to become a Certified AWS Professional? Visit here to Learn AWS Certification Training
In Amazon Athena blog, you will learn below topics
AWS Athena is a code-free, fully automated, zero-admin, data pipeline that performs database automation, Parquet file conversion, table creation, Snappy compression, partitioning, and more. It is an interactive query service to analyze Amazon S3 data using standard SQL.
Amazon launched Athena on November 20, 2016, and this serverless query service provides data analysis with standard SQL. With the AWS management console, users can point Athena at data stored in Amazon S3 and execute queries to get results in seconds using standard SQL.
Amazon Athena has no infrastructure to set up or manage, and the customers need to pay only for the queries they run on it. Amazon Athena scales executing queries in parallel, scales automatically, providing fast results even with a large dataset and complex questions.
Amazon Athena is a serverless and interactive tool to analyze data and processes complex queries in relatively less time. Being a serverless service, you pay only for the queries you execute. Mark your data in S3 and define the required schema using standard SQL and go.
Let’s compare the two data analysis tools, Microsoft SQL Server and Amazon Athena.
Features | Microsoft SQL Server | Amazon Athena |
Definition | Microsoft SQL Server is a database management and analysis system. | Amazon Athena is an interactive query service that makes data analysis easy. |
Usage | Used for DCL, DML, DDL and TCL operations on Database. | Used for DatabaseDML operations. |
Benefits |
|
|
Integration |
|
|
Limitations |
|
|
If you want to know more about AWS: Click here to know about What is AWS
Athena works directly with S3 data. It uses a distributed SQL engine, Presto for running queries. It uses Apache Hive to create and alter tables and partitions.
Let’s have a look at the prerequisites to start working with Athena:
Parameter | Value |
Database host | athena.us-west-2.amazonaws.com |
Database username | IAM username |
Database password | Secret Access Key |
Database name | Access Key ID |
Database port | 443 |
S3 staging directory | s3://aws-athena-query-results-technology/ |
create database if not exists costdb;
create external table if not exists cost (
InvoiceID string,
PayerAccountId string,
LinkedAccountId string,
RecordType string,
RecordId string,
ProductName string,
RateId string,
SubscriptionId string,
PricingPlanId string,
UsageType string,
Operation string,
AvailabilityZone string,
ReservedInstance string,
ItemDescription string,
UsageStartDate string,
UsageEndDate string,
UsageQuantity string,
Rate string,
Cost string,
ResourceId string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = ''
)
stored as textfile
location 's3://technology-aws-billing-data/athena'
Create a table that matches the CSV formats and files in S3 billing bucket.
After a bit of trial and error, some unseen errors that we can view are as follows:
To ensure that you can query your Athena database, you can run the below query for various AWS services you use:
select distinct costdb.cost.productname
from costdb.
In the earlier section, each column is mentioned as a string data type. You have to cast columns as the data types to move further:
select productname, operation,
sum (cast(cost as double))
from costdb.cost
group by 1, 2
order by 3 desc
Amazon's Cost Explorer
Amazon provides a tool called Cost Explorer to drag and drop, which comes with a set of prebuilt reports like “Monthly service costs”, ” reserved instance usage”, etc.
If you are curious, try to recreate the query above service costs and operation. It doesn’t seem to be possible.
You can slice your raw data to your satisfaction, and can also compute growth rates every month, build histograms, compute using z-scores, etc.
Athena's pricing model: The Pricing of Athena is $5 to scan Terabyte data from S3, surrounded to the closest megabyte having a minimum of 10 MB per query.
Reducing Athena's cost: The cost trick is reducing the data that is scanned. This is possible in three ways:
Compress your data through gzip or other supported formats: If you get a compression rate of 2:1 ratio, the cost is reduced by 50%.
Use columnar data formats like Apache Parquet: If the query references only to two columns, you need not scan the entire row that results in significant savings.
Partition the data: The partition keys can be defined either one or more. For instance, if your data consists of a customer_id column and a time-based column, the amount of data scanned is reduced significantly when the query has clauses for the data and customer columns. It allows you to analyze S3 data using standard SQL without managing any infrastructure. You can even access Athena via a BI tool with JDBC driver.
Frequently Asked AWS Interview Questions
There are distinct options available for accessing Athena quickly. It can be accessed through any of the following tools:
As you have gained knowledge about Amazon Athena, let us walk through various features of Athena.
Athena is one of the best services offered by Amazon. It has several features making it suitable to analyze data. Let’s have a look at the various features of Athena.
Easy Implementation: Athena requires no installation and can directly access using the AWS Console.
Serverless: The end-user does not face any problems in configuring, scaling or failure as Athena is a serverless service. It can take care of everything on its own.
Pay per query: It charges only for queries you run, i.e. the amount of data that is managed per query.
Fast: Athena is a high-speed analytics tool and can perform even the complex queries in relatively less time by splitting into simpler ones and running them parallelly, and merge them to provide the desired output.
Secure: Using AWS Identity and IAM policies, Athena provides you with complete control over the data set.
High availability: With AWS, Athena is accessible and the user can run queries round the clock.
Integration: The best feature of Athena is its integration with AWS Glue.
In this tutorial, we are using live resources, so you are only charged for the queries you run but not for the datasets you use, and if you want to upload your data files into Amazon S3, charges do apply.
To query S3 file data, you need to have an external table associated with the file structure. We can CREATE EXTERNAL TABLES in two ways:
To manually create an EXTERNAL table, write the statement CREATE EXTERNAL TABLE following the correct structure and specify the correct format and accurate location. An example is shown below:
Creating an External table manually
The created ExTERNAL tables are stored in AWS Glue Catalog. The Glue Clawer parses the structure of the input file and generates metadata tables, defined in Glue Data Catalog.
The crawler uses an AWS IAM (Identity and Access Management) role to permit access to the data stored and the Data Catalog. You should have permission to pass the roles to the crawler for accessing Amazon S3 paths that are crawled.
Go to AWS Glue, choose “Add tables” and then select “Add tables using a crawler” option.
Add tables using Glue crawler
Give the crawler a name. Let's say for example: cars-crawler
Enter crawler name
Choose the path in Amazon S3 where the file is saved.
If you plan to query only one file, you can choose either an S3 file path or the S3 folder path to query all the files in the folder having the same structure.
Enter crawler name
Choose the path in Amazon S3 where the file is saved.
If you plan to query only one file, you can choose either an S3 file path or the S3 folder path to query all the files in the folder having the same structure.
cars.json file is in the S3 location s3://rosyll-niranjana-xavier/data_input/json-files/cars.json. You can also choose s3://rosyll-niranjana-xavier/data_input/json-files/ as the path.
Create an IAM role that is having permission to the S3 object that you aim to query or choose an existing IAM role (which has enough privileges to access the S3 object).
Choose a database that contains the external tables and optionally choose a prefix to be added to the external table name.
Choose database and prefix for external tables
Click Finish to create the Glue Crawler
Run the crawler
The External table created it under the specified database. Now you can query the S3 object using it.
SELECT data from the external table
Since we placed a file, the “SELECT *FROM json_files;” query returns a record which was in the file. Let’s now try to place another file having the same structure in the same S3 folder and try to query the EXTERNAL TABLE again.
petercars.json file uploaded to S3
If you Query the same EXTERNAL table, you will see two rows returned instead of one.
When the same EXTERNAL TABLE is queried, you will get two records. This is because there are two files in the S3 folder with the desired structure. You can perform several operations on the data. For instance, the following query will UNNEST the array in the result set.
UNNEST arrays in Athena
As we discussed earlier, Amazon Athena is an interactive query service to query data in Amazon S3 with the standard SQL statements. Athena reads the data without performing operations such as addition or modification.
Now let’s look at Amazon Athena pricing and some tips to reduce Athena costs.
According to the Amazon Athena’s pricing page, Athena is priced at $5/TB scanned to run a query. If you cancel any query, the charge is for the data scanned up to the cancellation point of the query.
Doing that math for smaller queries:
https://mindmajix.com/aws-athena
Therefore, you will be charged a minimum of $0.000004768 (to scan 10 MB minimum). So be careful to those 200KB queries. You will still be charged for a full 10 MB.
Things That Are Free
Database, table, DDL-related executions, and schema are all free. For example, there is no charge for any of the following statements:
Additional Costs
Athena reads the data that is stored in S3. There are standard charges in S3 to store the data based on how it’s stored. It stores query history and results in another bucket known as a secondary S3 bucket. Therefore, there will also be standard S3 data charges for that new data stored in the same bucket.
Cost Reduction Techniques
Amazon Athena is an exciting service. It helps you to structure your data and queries to reduce your costs up to an extent and you’ll be added with a potential new candidate to your arsenal for serverless computing.
AWS Glue is a perfectly managed ETL service which makes it flexible for customers who want to prepare and load data for analytics. You can build and execute an ETL in the Amazon Management Console with a few clicks. You can point AWS Glue to your AWS data and discovers your data and store associated metadata like Schema and table definition in the AWS Glue Data Catalog. Your data once cataloged is immediately searchable, queryable, and available for ETL.
Amazon QuickSight is a cloud-powered, fast BI service, which makes it easy to deliver insights to everyone in the organization. Being a wholly managed service, QuickSight lets you create interactive dashboards easily and publish with ML insights. Dashboards can be accessed from any device embedded into your applications, websites, and portals. Using Pay-per-Session pricing, it allows you to provide everyone to obtain data required when only paying for what you use.
Some of the major benefits provided by Amazon QuickSight are listed as follows:
Are you interested to learn AWS and build a career in Cloud Computing? Then check out our AWS Certification Training Course at your near Cities
AWS Certification Course in Ahmedabad, AWS Certification Course in Bangalore, AWS Certification Course in Chennai, AWS Certification Course in Delhi, AWS Certification Course in Dallas, AWS Certification Course in Hyderabad, AWS Certification Course in Kolkata, AWS Certification Course in London, AWS Certification Course in Mumbai, AWS Certification Course in NewYork, AWS Certification Course in Noida, AWS Certification Course in Pune, AWS Certification Course in Toronto
These courses are incorporated with Live instructor-led training, Industry Use cases, and hands-on live projects. This training program will make you an expert in AWS and help you to achieve your dream job.
Conclusion
As data has become an essential asset that a company owns, gaining insights and extracting more out of the data is more critical now than ever. With public cloud services, providing service-based analytics services such as Amazon Athena, businesses can get more insights without any expensive complications that arise with home-built analytics tools.
Being a serverless architecture and employing ANSI SQL, Athena makes data queries quick to set up, easy to use, and fast to run. The pay-per-use model of Athena will make it affordable to run analytics. Since Athena works with Amazon S3 and comes with unmatched scalability, durability, reliability and the power of object storage, this is the perfect-suite to run analytics workloads.
Name | Dates | |
---|---|---|
AWS Training | Sep 14 to Sep 29 | View Details |
AWS Training | Sep 17 to Oct 02 | View Details |
AWS Training | Sep 21 to Oct 06 | View Details |
AWS Training | Sep 24 to Oct 09 | View Details |
Usha Sri Mendi is a Senior Content writer with more than three years of experience in writing for Mindmajix on various IT platforms such as Tableau, Linux, and Cloud Computing. She spends her precious time on researching various technologies, and startups. Reach out to her via LinkedIn and Twitter.