Quantcast
Channel: Learning the code way
Viewing all articles
Browse latest Browse all 231

AWS Athena

$
0
0
I have been going through use cases where some basic analytics needed to be run on structured logs generated by our system. The way I did it till now, is to spin up an EMR cluster, load my logs on it and execute hive queries.
Then I found Athena.
Amazon Athena is an interactive query service that makes it easy to analyze
data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
We already have most of our relevant data in S3. Which means we cold use Athena directly with it. Other Benefits:
Athena is serverless, so there is no infrastructure to set up or manage,
and you pay only for the queries you run.
Athena scales automatically -executing queries in parallel—so results
are fast, even with large datasets and complex queries.
I wrote up a dummy code that created files of user data
I created a few files and then setup a directory structure in s3:
The next step was to query this data in Athena. For this we need to make a database and table in Athena.
For each dataset, a table needs to exist in Athena. The metadata in the
table tells Athena where the data is located in Amazon S3, and
specifies the structure of the data, for example, column names, data
types, and the name of the table. Databases are a logical grouping of
tables, and also hold only metadata and schema information for a dataset.
The tables creation process registers the dataset with Athena. This
registration occurs in the AWS Glue Data Catalog and enables Athena to
run queries on the data.

Step 1: Create the database:
CREATEDATABASE users_db
Step 2: Create the users table:
CREATEEXTERNALTABLE IF NOTEXISTS userRecs ( 
user_id int,
name String,
phone_no String,
age int,
hobbies array<string>,
state String,
country String)
COMMENT'User details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY'\t'
COLLECTION ITEMS TERMINATED BY','
LINES TERMINATED BY'\n'
STORED AS TEXTFILE
LOCATION's3://athene-test-dump/';
The table here is created over the base S3 bucket. I did not apply any  partioning. Simply provided the base folder and let Athena detect the files under the folder hierarchy.
When you query an existing table, under the hood, Amazon Athena uses Presto,
a distributed SQL engine.
I executed a simple query:
The query execution history is as below:
I also decided to setup a partitioned table for the same data:

CREATEEXTERNALTABLE IF NOTEXISTS users_partitioned (
user_id int,
name String,
phone_no String,
age int,
hobbies array<string>,
state String,
country String
) COMMENT'User details' PARTITIONED BY (dataset_date String) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\t' COLLECTION ITEMS TERMINATED BY',' LINES TERMINATED BY'\n' STORED AS TEXTFILE LOCATION's3://athene-test-dump/';
The query executed successfully and Athena UI gave the below comment:
Query successful. If your table has partitions, you need to load these partitions
to be able to query data. You can either load all partitions or load them
individually. If you use the load all partitions (MSCK REPAIR TABLE) command,
partitions must be in a format understood by Hive. Learn more.
My data is not partitioned in the Hive format. So I will have to manually load the partitions.
ALTERTABLE users_partitioned ADD PARTITION (dataset_date='2020-05-28') 
location's3://athene-test-dump/2020/05/28';


If instead of date being parts of different folder levels (i.e. s3://athene-test-dump/2020/05/28), I had it as 's3://athene-test-dump/dataset=2020-05-28', than I could have loaded it using the MSCK REPAIR TABLE command.

Viewing all articles
Browse latest Browse all 231

Trending Articles