Loading CSV and JSON Data from S3 to Redshift Using the COPY Command

Loading CSV and JSON Data from S3 to Redshift

In this article, we will discuss how to load data from Amazon Simple Storage Service (S3) into Amazon Redshift using the COPY command. We will cover both CSV and JSON data formats and provide examples of how to escape special characters in these formats.

Understanding the Requirements

Before we begin, let’s review the requirements:

We have data stored in an S3 bucket.
The data is in a CSV or JSON format.
We want to load this data into a Redshift database.
The data will be loaded using the COPY command.

Loading CSV Data from S3 to Redshift

To load CSV data from S3, we can use the following command:

copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;

This command is almost correct, but we need to make one adjustment. In the COPY command, the third field (the desc column) contains a double quote inside a JSON object. To load this data correctly, we must escape the inner double quotes using another double quote.

Escaping Double Quotes in CSV Data

To escape the double quotes, we need to surround each value with single quotes instead of double quotes. Here’s the corrected command:

copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;

Loading JSON Data from S3 to Redshift

To load JSON data from S3, we can use the COPY command with the following options:

copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
format as json

This will load the JSON data directly into the Redshift database.

Parsing JSON Data in Redshift

To parse the JSON data, we can use the json_parse function. Here’s an example:

SELECT *
FROM table
WHERE json_parse(desc) @> '{"id":1,"name":"test"}';

This will return all rows where the desc column contains a JSON object with the specified key-value pairs.

Using REMOVEQUOTES in CSV Data

When loading CSV data, we need to use the REMOVEQUOTES option to remove the quotes from each field. This option is necessary because we are using single quotes instead of double quotes to escape special characters.

copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;

Using IGNOREBLANKLINES in CSV Data

We also need to use the IGNOREBLANKLINES option to ignore blank lines in the data. This is necessary because we are loading a manifest file that may contain blank lines.

copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;

Conclusion

In this article, we discussed how to load CSV and JSON data from S3 into Redshift using the COPY command. We covered both the basic syntax and some advanced options, such as escaping special characters in CSV data and parsing JSON data in Redshift.

By following these steps and examples, you should be able to successfully load your data from S3 into Redshift. Happy loading!

Last modified on 2024-05-01