Analyze Data with Amazon Redshift

Data analysts often work with very large datasets. For example, some data warehouses can contain multiple petabytes of data, and data lakes can be even larger. Amazon Redshift is designed to handle extremely large datasets. Unlike traditional relational database management systems, Amazon Redshift uses columnar data storage. Columnar storage improves query performance and saves on storage costs. To read more about columnar storage, see Columnar storage in the Amazon Redshift documentation.

Objectives

After completing this lab, you will be able to:

Access Amazon Redshift in the AWS Management Console
Create an Amazon Redshift cluster.
Load data from Amazon Simple Storage Service (Amazon S3) into an Amazon Redshift table
Query data in Amazon Redshift

Scenario summary

In this lab, you are a data analyst who works for an enterprise that manages ticket sales for music events. The company has a web application that aggregates ticket sales between buyers and sellers. When customers buy tickets to these events, the web application stores data about both the buyer and the seller.

Your manager asked you to generate several reports from the company’s data files. These data files are already stored in Amazon S3, so you decide to use Amazon Redshift to load the data and build the reports.

The following diagram illustrates the environment you will build in this lab:

Task 1: Review the security group for accessing the Amazon Redshift console

When you access the Amazon Redshift console, you must be authenticated and authorized for Amazon Redshift access. This access is controlled by an AWS Identity and Access Management (IAM) role. Roles are used by IAM to limit which actions console users can take. A role has already been created for this exercise.

Note: In a real-world scenario, an administrator would create this role for any user who must access Amazon Redshift. To learn more about how IAM and Amazon Redshift work together, see Authorizing Amazon Redshift to Access Other AWS Services on Your Behalf.

To create the IAM role:

From the list of services, choose IAM.
From the navigation , choose Roles.
Create a Role
Select the Trusted entity type as AWS Service and in the Use Case section, select Redshift from the dropdown menu
Select Redshift - Customizable and click Next.
In Add permissions tab, search for AmazonS3ReadOnlyAccess and select it.

In the same page, click on Create Policy, select JSON tab and enter the following

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:DescribeVpcs",
                "ec2:DescribeSubnets",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeAddresses",
                "ec2:AssociateAddress",
                "ec2:DisassociateAddress",
                "ec2:CreateNetworkInterface",
                "ec2:DeleteNetworkInterface",
                "ec2:ModifyNetworkInterfaceAttribute"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Name the policy RedshiftIAMPolicy and save it

Go back to previous tab of Add Permissions and refresh the policies, search for RedshiftIAMPolicy and select it, and click Next
Name the role as myRedshiftRole and create the Role

This role gives Amazon Redshift permission to get a list of items from Amazon S3 and to retrieve those items.

Task 2: Create and configure an Amazon Redshift cluster

Clusters are the main infrastructure component of an Amazon Redshift data warehouse. A cluster is made up of one or more compute nodes. If there is more than one node, then one of the nodes will be the leader node and the other nodes will be compute nodes. Client applications interact with the leader node. The following diagram illustrates the overall Amazon Redshift infrastructure:

To read more about Amazon Redshift infrastructure, see Amazon Redshift System Overview.

To create a cluster for your data warehouse:

On the AWS Management Console, on the Services , choose Amazon Redshift.
Choose Create cluster.
Under Cluster configuration, enter the following values. Some of these configurations might already be prefilled with the correct values:
- Cluster identifier: redshift-cluster-1

Node type: dc2.large
- Nodes: 2
- Database port: 5439
- Master user name: awsuser

Enter the following value for the Master user password:
- Master user password: Passw0rd1
Expand Cluster permissions and select myRedshiftRole from the drop down list.
Choose Associate IAM role.
Choose Create cluster.

It will take a few minutes for the Amazon Redshift to create the cluster.
From the navigation pane, choose Clusters.
Choose View all clusters.

Note that redshift-cluster-1 is in the list of clusters, and the status shows creating.

Task 2.1: Create a security group for your cluster

You must configure the virtual private cloud (VPC) that hosts your Amazon Redshift cluster so it allows traffic through port 5439. To do this, you must create and configure a security group for your VPC.

On the AWS Management Console, go to the Services and choose EC2.
In the NETWORK & SECURITY section, choose Security Groups.
Choose Create Security Group.
Enter the name Redshift Security Group and the description Security group for my Amazon Redshift cluster.
In the Security group rules section, choose Add Rule.
In the Security group rules section, enter the following values:
- Type: Redshift
- Protocol: TCP
- Port Range: 5439
- Source: Anywhere
- IP: 0.0.0.0/0,::/0
- Description: Redshift inbound rule
Choose Create.

Task 2.2: Configure your Amazon Redshift cluster

On the AWS Management Console, on the Services menu, choose Amazon Redshift.
From the navigation pane, choose Clusters.

On the Clusters page, you will see a list with the name of the cluster that you created. The Status column will display pending until your cluster is ready for the next step. You may need to use the refresh button on the top-right corner to refresh the window until you see the status change.
When the Cluster Status changes to available and the DB health displays healthy, go to the list of clusters and choose your cluster.
Navigate to the Properties tab.
In the Network and security settings section, choose Edit.
Select the Redshift Security Group from the dropdown list and remove the default group.
Choose Save changes.

Task 3: Load data into your Amazon Redshift cluster

Now that you created your cluster, you can load data into the cluster from Amazon S3. Amazon Redshift can store data from a variety of formats, including:

Text
Apache Avro
JavaScript Object Notation (JSON)
Apache Parquet
Optimized Row Columnar (ORC)

The data you use in this lab is stored as text files with the pipe (|) delimiter. To load the data, you will first create the tables with the create table command, and then copy the data with the copy command.

Task 3.1: Create the tables in the dev database

From the navigation, choose Clusters.
From the list of clusters, choose redshift-cluster-1.
From the navigation pane, choose Query editor.
Choose Connect to database and enter the following values:
- Database: dev
- Database user: awsuser
Choose Connect.

Note: If you receive this error message—Query Editor is not available for the cluster you selected. Upgrade the cluster to the latest release to enable it.—then refresh the Amazon Redshift page and try connecting again. The connection will eventually be established.
From the Schema dropdown list, choose the public schema.

Copy the following code block, and paste it into the query window. This code creates the users table.

create table users(
    userid integer not null distkey sortkey,
    username char(8),
    firstname varchar(30),
    lastname varchar(30),
    city varchar(30),
    state char(2),
    email varchar(100),
    phone char(14),
    likesports boolean,
    liketheatre boolean,
    likeconcerts boolean,
    likejazz boolean,
    likeclassical boolean,
    likeopera boolean,
    likerock boolean,
    likevegas boolean,
    likebroadway boolean,
    likemusicals boolean);

Choose Run query.

After the query completes, you should see Statement completed successfully.

Copy the following code block, and paste it into the query window. This code creates the data table.

create table date(
    dateid smallint not null distkey sortkey,
    caldate date not null,
    day character(3) not null,
    week smallint not null,
    month character(5) not null,
    qtr character(5) not null,
    year smallint not null,
    holiday boolean default('N'));

Choose Run query.

Copy the following code block, and paste it into the query window. This code creates the sales table.

create table sales(
    salesid integer not null,
    listid integer not null distkey,
    sellerid integer not null,
    buyerid integer not null,
    eventid integer not null,
    dateid smallint not null sortkey,
    qtysold smallint not null,
    pricepaid decimal(8,2),
    commission decimal(8,2),
    saletime timestamp);

Choose Run query.

Task 3.2: Load data from Amazon S3

To load the data from Amazon S3:

Copy and paste the following code into the query window. Replace with the ARN that you stored when you reviewed the MyRedshiftRole in Task 1.
```
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt'
credentials 'aws_iam_role='
delimiter '|' region 'us-west-2';
```
After the query runs, you should see Statement completed successfully.
Copy and paste the following code into the query window. Replace with the ARN that you stored when you reviewed the MyRedshiftRole in Task 1.
```
copy date from 's3://awssampledbuswest2/tickit/date2008_pipe.txt'
credentials 'aws_iam_role='
delimiter '|' region 'us-west-2';
```
The sales data is stored with a \t delimiter. Because the data includes time data, you must also specify the time format.

Copy the following code and run it in the query window. Replace with the ARN that you stored when you reviewed the MyRedshiftRole in Task 1.

copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt'
credentials 'aws_iam_role='
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';

Task 4: Query the data

Now that you loaded the data into your Amazon Redshift cluster, you can write queries to generate the reports that your manager asked for. Here are two queries to get you started:

To see the overall structure of the sales table:

View the first 10 records by going to the sales table and choosing the Preview icon next to it.

You should see query results like the following example:

To retrieve the number of items sold for a particular date, copy the following query, paste it into the Query window, and choose Run query.
```
SELECT sum(qtysold)
FROM sales, date
WHERE sales.dateid = date.dateid
AND caldate = '2008-01-05';
```
The query should return the value of 210.

To find the top 10 buyers by quantity, copy the following query, paste it into the Query window, and choose Run query.

SELECT firstname, lastname, total_quantity
FROM
(SELECT buyerid, sum(qtysold) total_quantity
FROM  sales
GROUP BY buyerid
ORDER BY total_quantity desc limit 10) Q, users
WHERE Q.buyerid = userid
ORDER BY Q.total_quantity desc;

The query should return a list of customers and the quantity that was sold.

Amazon Redshift Lab