Analyzing Data is a highly complex process with several attempts to ease, and there are several tools available today to analyze the data. The tech giant Amazon is providing a service with the name Amazon Athena to analyze the data. This tutorial walks you through Amazon Athena and helps you create a table based on sample data stored in Amazon S3, query the table, and check the query results.

What is AWS Athena?

AWS Athena is a code-free, fully automated, zero-admin, data pipeline that performs database automation, Parquet file conversion, table creation, Snappy compression, partitioning, and more. It is an interactive query service to analyze Amazon S3 data using standard SQL. 

Amazon launched Athena on November 20, 2016, and this serverless query service provides data analysis with standard SQL. With the AWS management console, users can point Athena at data stored in Amazon S3 and execute queries to get results in seconds using standard SQL.

Amazon Athena has no infrastructure to set up or manage, and the customers need to pay only for the queries they run on it. Amazon Athena scales executing queries in parallel, scales automatically, providing fast results even with a large dataset and complex questions. 

Difference between Microsoft SQL Server and Amazon Athena

Amazon Athena is a serverless and interactive tool to analyze data and processes complex queries in relatively less time. Being a serverless service, you pay only for the queries you execute. Mark your data in S3 and define the required schema using standard SQL and go.

Let’s compare the two data analysis tools, Microsoft SQL Server and Amazon Athena.

Features Microsoft SQL Server Amazon Athena
Definition Microsoft SQL Server is a database management and analysis system. Amazon Athena is an interactive query service that makes data analysis easy.
Usage Used for DCL, DML, DDL and TCL operations on Database. Used for DatabaseDML operations.
Benefits
  • Reliable and easy to use
  • High performance
  • Easy to maintain
  • Easy server installation
  • Multiple tools integration possible
  • Easy to use
  • High performance
  • No maintenance required 
  • No server configuration required
  • Multiple tools integration possible

    

 

Integration
  • Sequelize
  • SQLDep
  • Presto
  • Amazon S3
  • AWS Glue
  • Presto

 

Limitations
  • Limited RDS storage
  • Limited instances
  • Can not handle recursion
  • No DDL supported
  • Works with external table only
  • User-Defined Functions not supported

Check Out AWS Tutorials

How does AWS Athena work?

Athena works directly with S3 data. It uses a distributed SQL engine, Presto for running queries. It uses Apache Hive to create and alter tables and partitions. 

Let’s have a look at the prerequisites to start working with Athena:

  1. Must have an AWS account
  2. Enable your account to export your cost and usage data into an S3 bucket. 
  3. Prepare buckets for Athena to connect. 
  4. AWS creates manifest files using metadata every time it writes to the bucket. Create a folder inside the technology-aws-billing-data bucket known as Athena, which contains only the data.
  5. To simplify the setup, we can use one region: the us-west-2 region.
  6. The final step is downloading the credentials for the new IAM user. The credentials will directly map to the database credentials to connect.
Parameter Value
Database host athena.us-west-2.amazonaws.com
Database username IAM username
Database password Secret Access Key
Database name Access Key ID
Database port 443
S3 staging directory s3://aws-athena-query-results-technology/

Creating an Athena database and table

create database if not exists costdb;
    create external table if not exists cost (
        InvoiceID string,
        PayerAccountId string,
        LinkedAccountId string,
        RecordType string,
        RecordId string,
        ProductName string,
        RateId string,
        SubscriptionId string,
        PricingPlanId string,
        UsageType string,
        Operation string,
        AvailabilityZone string,
        ReservedInstance string,
        ItemDescription string,
        UsageStartDate string,
        UsageEndDate string,
        UsageQuantity string,
        Rate string,
        Cost string,
        ResourceId string
    )    
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    with serdeproperties (
        'separatorChar' = ',',
        'quoteChar' = '"',
        'escapeChar' = ''
    )
    stored as textfile 
    location 's3://technology-aws-billing-data/athena'

Create a table that matches the CSV formats and files in S3 billing bucket.

After a bit of trial and error, some unseen errors that we can view are as follows:

  1. Using OpenCSVSerde plugin, you can parse CSV files.
  2. The plugin supports gzip files but not zip files. You’ll convert the compression format to gzip or one of its supporting formats.
  3. This plugin claims in supporting skip.header.line.count to skip header rows, but seems to be broken. You have to rewrite CSV files manually without the header.
  4. You can run the DDL statements using AWS Web console or through the product for creating databases.

Creating the first query using Athena

To ensure that you can query your Athena database, you can run the below query for various AWS services you use:

select distinct costdb.cost.productname
from costdb.

Cost by AWS service and operation

In the earlier section, each column is mentioned as a string data type. You have to cast columns as the data types to move further:

select productname, operation, 
 sum (cast(cost as double)) 
from costdb.cost
group by 1, 2
order by 3 desc 

Amazon's Cost Explorer

Amazon provides a tool called Cost Explorer to drag and drop, which comes with a set of prebuilt reports like “Monthly service costs”, ” reserved instance usage”, etc.

If you are curious, try to recreate the query above service costs and operation. It doesn’t seem to be possible.

You can slice your raw data to your satisfaction, and can also compute growth rates every month, build histograms, compute using z-scores, etc.

Additional considerations:

Athena's pricing model: The Pricing of Athena is $5  to scan Terabyte data from S3, surrounded to the closest megabyte having a minimum of 10 MB per query.

Reducing Athena's cost: The cost trick is reducing the data that is scanned. This is possible in three ways:

Compress your data through gzip or other supported formats: If you get a compression rate of 2:1 ratio, the cost is reduced by 50%.

Use columnar data formats like Apache Parquet: If the query references only to two columns, you need not scan the entire row that results in significant savings.

Subscribe to our youtube channel to get new updates..!

Partition the data: The partition keys can be defined either one or more. For instance, if your data consists of a customer_id column and a time-based column, the amount of data scanned is reduced significantly when the query has clauses for the data and customer columns. It allows you to analyze S3 data using standard SQL without managing any infrastructure. You can even access Athena via a BI tool with JDBC driver. 

Frequently Asked AWS Interview Questions

Accessing Amazon Athena

There are distinct options available for accessing Athena quickly. It can be accessed through any of the following tools:

  • AWS Console
  • Athena with your JDBC
  • AWS CLI

As you have gained knowledge about Amazon Athena, let us walk through various features of Athena.

Features of Athena

Athena is one of the best services offered by Amazon. It has several features making it suitable to analyze data. Let’s have a look at the various features of Athena.

Easy Implementation: Athena requires no installation and can directly access using the AWS Console.

Serverless: The end-user does not face any problems in configuring, scaling or failure as Athena is a serverless service. It can take care of everything on its own.

Pay per query: It charges only for queries you run, i.e. the amount of data that is managed per query. 

Fast: Athena is a high-speed analytics tool and can perform even the complex queries in relatively less time by splitting into simpler ones and running them parallelly, and merge them to provide the desired output.

Secure: Using AWS Identity and IAM policies, Athena provides you with complete control over the data set. 

High availability: With AWS, Athena is accessible and the user can run queries round the clock. 

Integration: The best feature of Athena is its integration with AWS Glue. 

Creating Table In Athena

In this tutorial, we are using live resources, so you are only charged for the queries you run but not for the datasets you use, and if you want to upload your data files into Amazon S3, charges do apply.

To query S3 file data, you need to have an external table associated with the file structure. We can CREATE EXTERNAL TABLES in two ways:

  • Manually.
  • Using the AWS Glue crawler.

To manually create an EXTERNAL table, write the statement CREATE EXTERNAL TABLE following the correct structure and specify the correct format and accurate location. An example is shown below:

Creating an External table manually

The created ExTERNAL tables are stored in AWS Glue Catalog. The Glue Clawer parses the structure of the input file and generates metadata tables, defined in Glue Data Catalog.

The crawler uses an AWS IAM (Identity and Access Management) role to permit access to the data stored and the Data Catalog. You should have permission to pass the roles to the crawler for accessing Amazon S3 paths that are crawled.

Go to AWS Glue, choose “Add tables” and then select “Add tables using a crawler” option.

Add tables using Glue crawler

Give the crawler a name. Let's say for example: cars-crawler

Enter crawler name

Choose the path in Amazon S3 where the file is saved.

If you plan to query only one file, you can choose either an S3 file path or the S3 folder path to query all the files in the folder having the same structure. 

Enter crawler name

Choose the path in Amazon S3 where the file is saved.

If you plan to query only one file, you can choose either an S3 file path or the S3 folder path to query all the files in the folder having the same structure. 

cars.json file is in the S3 location s3://rosyll-niranjana-xavier/data_input/json-files/cars.json. You can also choose s3://rosyll-niranjana-xavier/data_input/json-files/ as the path.

Create an IAM role that is having permission to the S3 object that you aim to query or choose an existing IAM role (which has enough privileges to access the S3 object).

Choose a database that contains the external tables and optionally choose a prefix to be added to the external table name.

Choose database and prefix for external tables

Click Finish to create the Glue Crawler

Run the crawler

The External table created it under the specified database. Now you can query the S3 object using it.

SELECT data from the external table

Since we placed a file, the “SELECT *FROM json_files;” query returns a record which was in the file. Let’s now try to place another file having the same structure in the same S3 folder and try to query the EXTERNAL TABLE again.

petercars.json file uploaded to S3

If you Query the same EXTERNAL table, you will see two rows returned instead of one.

When the same EXTERNAL TABLE is queried, you will get two records. This is because there are two files in the S3 folder with the desired structure. You can perform several operations on the data. For instance, the following query will UNNEST the array in the result set.

UNNEST arrays in Athena

AWS Athena Pricing details

As we discussed earlier, Amazon Athena is an interactive query service to query data in Amazon S3 with the standard SQL statements. Athena reads the data without performing operations such as addition or modification.

Now let’s look at Amazon Athena pricing and some tips to reduce Athena costs.

According to the Amazon Athena’s pricing page, Athena is priced at $5/TB scanned to run a query. If you cancel any query,  the charge is for the data scanned up to the cancellation point of the query.

Doing that math for smaller queries:

https://mindmajix.com/aws-athena

Therefore, you will be charged a minimum of $0.000004768 (to scan 10 MB minimum). So be careful to those 200KB queries. You will still be charged for a full 10 MB.

Things That Are Free

Database, table, DDL-related executions, and schema are all free. For example, there is no charge for any of the following statements:

  • CREATE EXTERNAL TABLE
  • MSCK REPAIR TABLE
  • ALTER TABLE

Additional Costs

Athena reads the data that is stored in S3. There are standard charges in S3 to store the data based on how it’s stored. It stores query history and results in another bucket known as a secondary S3 bucket. Therefore, there will also be standard S3 data charges for that new data stored in the same bucket.

Cost Reduction Techniques

  1. Till now, we came across the pricing details of Amazon Athena. Now let’s look into some of the cost reduction techniques listed below:
  2. Remove historical results using S3 lifecycle rules
  3. Compress your input data in S3
  4. Use Partitions Effectively
  5. Store Your Data in a Columnar Format

Amazon Athena is an exciting service. It helps you to structure your data and queries to reduce your costs up to an extent and you’ll be added with a potential new candidate to your arsenal for serverless computing.

AWS Glue

AWS Glue is a perfectly managed ETL service which makes it flexible for customers who want to prepare and load data for analytics. You can build and execute an ETL in the Amazon Management Console with a few clicks. You can point AWS Glue to your AWS data and discovers your data and store associated metadata like Schema and table definition in the AWS Glue Data Catalog. Your data once cataloged is immediately searchable, queryable, and available for ETL.

Benefits of AWS Glue

  1. AWS Glue is integrated with a wide range of AWS services, which means less hassle for you while onboarding.
  2. AWS Glue is serverless. No infrastructure required to provision or manage. 
  3. Need to pay only for the resources used to run the jobs.

Amazon QuickSight

Amazon QuickSight is a cloud-powered, fast BI service, which makes it easy to deliver insights to everyone in the organization. Being a wholly managed service, QuickSight lets you create interactive dashboards easily and publish with ML insights. Dashboards can be accessed from any device embedded into your applications, websites, and portals. Using Pay-per-Session pricing, it allows you to provide everyone to obtain data required when only paying for what you use.

Some of the major benefits provided by Amazon QuickSight are listed as follows:

  1. Pay only for what you use
  2. Scale from 10 users to 10,000
  3. Embed self-service data analytics
  4. Build end-to-end BI solutions

Conclusion

As data has become an essential asset that a company owns, gaining insights and extracting more out of the data is more critical now than ever. With public cloud services, providing service-based analytics services such as Amazon Athena, businesses can get more insights without any expensive complications that arise with home-built analytics tools. 

Being a serverless architecture and employing ANSI SQL, Athena makes data queries quick to set up, easy to use, and fast to run. The pay-per-use model of Athena will make it affordable to run analytics. Since Athena works with Amazon S3 and comes with unmatched scalability, durability, reliability and the power of object storage, this is the perfect-suite to run analytics workloads.

Explore AWS Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download Now!