AWS Glue: Introduction, Examples and Troubleshooting

Last updated:
Table of Contents

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

Introduction

AWS Glue is used, among other things, to parse and set schemas for data.

The most important concept is that of the Data Catalog, which is the schema definition for some data (for example, in an S3 bucket). The Data Catalog can be used across all products in your AWS account.

Simplest possible example

  • Have your data (JSON, CSV, XML) in a S3 bucket

  • Create a Classifier that defines how to read your data into rows

  • Create a Crawler using the classifier you created above

Glue is commonly used together with Athena

A common workflow is:

  • Crawl an S3 using AWS Glue to find out what the schema looks like and build a table

  • Query this table using AWS Athena

Troubleshooting: Crawling and Querying JSON Data

It may be possible that Athena cannot read crawled Glue data, even though it has been correctly crawled.

For example if you have a file with the following contents in an S3 bucket:

[
  {"Id":123,"Name":"john smith"},
  {"Id":234,"Name":"mary smith"},    
]

This line gets correctly parsed by Glue but Athena can't query it! 1

Athena needs your file to look like this:

{"Id":123,"Name":"john smith"}
{"Id":234,"Name":"mary smith"}

Troubleshooting: Crawling and Querying XML Data

Once again, if you have file with XML content like this (dame data as above)

<row><Id>123</Id><Name>john smith</Name></row>
<row><Id>234</Id><Name>mary smith</Name></row>

AWS Glue can read this and it will correctly parse the fields and build a table. However, upon trying to read this table with Athena, you'll get the following error: HIVE_UNKNOWN_ERROR: Unable to create input format.

This is because AWS Athena cannot query XML files, even though you can parse them with AWS Glue.

Troubleshooting: Zero Records Returned

One annoying feature of Glue/Athena is that each data file must be in its own S3 folder, otherwise Athena won't be able to query it (it'll always say "Zero Records Returned")

  • Don't do this:

    # BAD
    my-bucket/
    |-- my-csv-file-1.csv
    |-- my-csv-file-2.csv
    
  • Do this instead:

    # GOOD
    my-bucket/
    |-- file1/
    |   |-- my-csv-file-1.csv
    |-- file2/
        |-- my-csv-file-2.csv
    

Troubleshooting: HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException

Tables created by AWS Glue lack one configuration option, which can be used to ignore malformed json.

You need to drop the table created by Glue and re-create it using this configuration option.

Use SHOW CREATE TABLE to show what your current table definition looks like:

SHOW CREATE TABLE posts;
CREATE EXTERNAL TABLE `posts`(
  `id` int COMMENT 'from deserializer', 
  `body` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='Body,Id') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://path/to/your/bucket/posts/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='crawl-all', 
  'averageRecordSize'='148', 
  'classification'='json', 
  'compressionType'='none', 
  'objectCount'='1', 
  'recordCount'='505974', 
  'sizeKey'='5722795', 
  'typeOfData'='file')

Now drop the table and recreate it, adding 'ignore.malformed.json' = 'true' to SERDEPROPERTIES. Make sure you use the output of your own call to SHOW CREATE TABLE!

Dropping your table won't make you lose your data

DROP TABLE posts;

CREATE EXTERNAL TABLE `posts`(
  `id` int COMMENT 'from deserializer', 
  `body` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='Body,Id',
  'ignore.malformed.json'='true') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://path/to/your/bucket/posts/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='crawl-all', 
  'averageRecordSize'='148', 
  'classification'='json', 
  'compressionType'='none', 
  'objectCount'='1', 
  'recordCount'='505974', 
  'sizeKey'='5722795', 
  'typeOfData'='file')

1: Trying to read this file using Athena would cause the following error: HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Missing value at 1 [character 2 line 1]

Dialogue & Discussion