Amazon Glue – Athena Lab

Scenario summary

In this lab, you are a data analyst who works for an international development agency. The agency focuses on drought relief. You are asked to look for weather patterns since 1950.

Fortunately for you, global weather data is already stored in Amazon S3. The National Centers for Environmental Information (NCEI), which is in the US, maintains a dataset of climate data. This dataset includes observations from weather stations around the globe. The Global Historical Climatology Network Daily (GHCN-D) contains daily weather summaries from ground-based stations. The dataset contains data that goes back to 1763, and it is updated daily.

The most common recorded parameters are daily temperatures, rainfall, and snowfall. These parameters are useful for assessing risks for drought, flooding, and extreme weather. The data definitions are publicly available on the AWS Open Data Website.

The following diagram illustrates the architecture of the solution you will develop in this lab:

Task 1: Create a crawler for the GHCN-D dataset

As a data analyst, you might not always know the schema of the data that you need to analyze. AWS Glue is designed for this situation. You can direct AWS Glue to your data stored on AWS, and it will discover your data. AWS Glue will then store the associated metadata (for example, the table definition and schema) in the AWS Glue Data Catalog. You accomplish this by creating a crawler that will inspect the data source and infer a schema based on the data. The account that you use to log in when you run AWS Glue must have permissions to access the data source. In this lab, you will work with publicly available data, so there’s no need to create a specific AWS Identity and Access Management (IAM) account. However, this will not always be the case. To read more about how AWS Glue and IAM work together, see Authentication and Access Control for AWS Glue.

The first task is to create a crawler that will discover the schema for the GHCN-D dataset.

On the AWS Management Console, on the Services , choose AWS Glue.
Choose Get started.

Note: If you do not see the Get started window, then proceed to the next step.
Choose Add tables using a crawler.
For the crawler name, enter Weather.
Choose Next.
On the Specify crawler source type page, choose Data stores.
Choose Next.
Choose Specified path in another account.
For the Include path, enter the following S3 bucket location:

s3://noaa-ghcn-pds/csv/
Choose Next.
When prompted to add another data store, choose No.
Choose Next.
Choose Choose Create a New role.
Choose Next.
Accept the default frequency of Run on demand.
Choose Next.
Choose Add database.
In the Database name box, enter weatherdata.
Choose Create.
Choose Next .
Review the summary of the crawler and then choose Finish.

Task 1.1: Run the crawler

You can create AWS Glue crawlers to either run on demand, or on a set schedule. To read more about scheduling crawlers, see Scheduling an AWS Glue Crawler. Because you created your crawler to run on demand, you must run the crawler to generate the metadata.

From the window with the message that the weather crawler was created, choose Run it now?

You will see the status for the crawler change to Starting and then Running crawler. After approximately 1 minute, the status will change to Ready, and the Tables added column will indicate that one table was added.
AWS Glue creates a table to store the metadata about the GHCN-D dataset. Inspect the data that AWS Glue captured about the data source.

Task 1.2: Review the metadata created by AWS Glue

In the navigation pane, choose Databases.
Choose the weatherdata database.
Choose Tables in weatherdata.
Choose the csv table.
Review the metadata that the weather crawler captured. You should see a list of the columns the crawler discovered. The following screenshot illustrates some of the columns:

Notice that the columns are named col0 through col6. In the next step, you will give the columns more descriptive names.

Task 1.3: Edit the schema

In the upper-right corner of the window, choose Edit schema.
Change the column names by selecting them and entering the new names. The following table lists the new column names to use.

Choose Save.

Task 2: Query the table using the AWS Glue Data Catalog

Now that you created the AWS Glue Data Catalog, you can use the metadata that is stored in the AWS Glue Data Catalog to query the data in Amazon Athena.

From the navigation pane, choose Tables.
Select the csv table check box.
From the Action menu, choose view data.

You will see a warning that Athena is going to open and that you will be charged for Athena usage. The Athena console will open.
Choose Preview data.
Choose Get started.
If the tutorial window opens, close it.

You see the following message at the top of the console:

You must specify an Amazon Simple Storage Service (Amazon S3) bucket to hold the results from any queries that you run.

On the AWS Management Console, on the Services menu, choose S3.
Create / Select the bucket name within the same region as that of Glue catalog.
In the bucket properties window, choose Copy bucket ARN.
Paste the bucket Amazon Resource Name (ARN) into a text editor.
To return to Athena, go back the AWS Management Console and on the Services menu, choose Athena.
Choose setup a query result location in Amazon S3.
In the Query Result Location box, enter the name of the bucket. The name of the bucket is the long string of characters at the end of the bucket ARN. Before the name of the bucket, make sure to specify s3:// and terminate the bucket name with a forward slash (/), as shown in the example.

Choose Save.
From the list of databases, choose the weatherdata database.
Choose the csv table.
Choose the vertical ellipsis (three dots) and then choose Preview table.

You will see the first 10 records of the weather table.

Athena ran a structure query language (SQL) query to get first 10 rows from the table. Data is not yet loaded at this stage, but by using AWS Glue, you inferred and edited the schema to suit your needs. Also, note the resources that were consumed by the Athena query (the run time and the amount of data that was scanned). As you develop more complex applications, minimizing resource consumption will play an important role in optimizing costs.

Task 2.1: Create a table for data after 1950

In this step, you will create an external table that only includes data since 1950. To optimize your use of Athena, you will store data in the Apache Parquet format. Apache Parquet is an open source, columnar data format that is optimized for performance and storage. To read more about Apache Parquet, see Apache Parquet.

Create an S3 bucket to store the external table. The bucket should be in the same Region that your lab is running in. Follow the bucket naming conventions that are described in the Amazon S3 documentation.

Copy the following query and in the Athena command window, paste the query. Remember to replace with the name of the bucket that you created:

CREATE table weatherdata.late20th
WITH (
 format='PARQUET', external_location='s3:///lab3/'
) AS SELECT date, type, observation  FROM csv
WHERE date/10000 between 1950 and 2015;

Choose Run query.
Preview the query by going to the late20th table, choosing the vertical ellipsis (three dots) next to the table, and then choosing Preview table.

Task 2.2: Run a query from the selected data

Now that you isolated the data that you are interested in, you can write queries for further analysis. Start by creating a view that only includes the maximum temperature reading, or TMAX, value. To create this view:

Copy the following query and in the Athena command window, paste the query:

CREATE VIEW TMAX AS
SELECT date, observation, type
FROM late20th
WHERE type = 'TMAX'

Choose Run query.
Preview the data in the view by going to tmax, choosing the vertical ellipsis (three dots) next to it, and then choosing Preview.
Copy the following query and in the Athena command window, paste the query:

SELECT date/10000 as Year, avg(observation)/10 as Max

FROM tmax

GROUP BY date/10000 ORDER BY date/10000;
Choose Run query.

You should see a table of data results from 1950–2018, with the average maximum temperature for each year.