Reading JSON Data from HDFS

Use the PXF HDFS Connector to read JSON-format data. This section describes how to use PXF to access JSON data in HDFS, including how to create and query an external table that references a JSON file in the HDFS data store.

Prerequisites

Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from HDFS.

Working with JSON Data

JSON is a text-based data-interchange format. JSON data is typically stored in a file with a .json suffix.

A .json file will contain a collection of objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, or an object or an array. You can define nested JSON objects and arrays.

Sample JSON data file content:

  {
    "created_at":"MonSep3004:04:53+00002013",
    "id_str":"384529256681725952",
    "user": {
      "id":31424214,
      "location":"COLUMBUS"
    },
    "coordinates":{
      "type":"Point",
      "values":[
         13,
         99
      ]
    }
  }

In the sample above, user is an object composed of fields named id and location. To specify the nested fields in the user object as Greenplum Database external table columns, use . projection:

user.id
user.location

coordinates is an object composed of a text field named type and an array of integers named values. Use [] to identify specific elements of the values array as Greenplum Database external table columns:

coordinates.values[0]
coordinates.values[1]

Refer to Introducing JSON for detailed information on JSON syntax.

JSON to Greenplum Database Data Type Mapping

To represent JSON data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type. JSON supports complex data types including projections and arrays. Use N-level projection to map members of nested objects and arrays to primitive data types.

The following table summarizes external mapping rules for JSON data.

Table 1. JSON Mapping

JSON Data Type PXF/Greenplum Data Type
Primitive type (integer, float, string, boolean, null) Use the corresponding Greenplum Database built-in data type; see Greenplum Database Data Types.
Array Use [] brackets to identify a specific array index to a member of primitive type.
Object Use dot . notation to specify each level of projection (nesting) to a member of a primitive type.

JSON Data Read Modes

PXF supports two data read modes. The default mode expects one full JSON record per line. PXF also supports a read mode operating on JSON records that span multiple lines.

In upcoming examples, you will use both read modes to operate on a sample data set. The schema of the sample data set defines objects with the following member names and value data types:

  • “created_at” - text
  • “id_str” - text
  • “user” - object
    • “id” - integer
    • “location” - text
  • “coordinates” - object (optional)
    • “type” - text
    • “values” - array
      • [0] - integer
      • [1] - integer

The single-JSON-record-per-line data set follows:

{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{
"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values"
: [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{
"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{
"id":287819058,"location":""}, "coordinates": null}

This is the data set for for the multi-line JSON record data set:

{
  "root":[
    {
      "record_obj":{
        "created_at":"MonSep3004:04:53+00002013",
        "id_str":"384529256681725952",
        "user":{
          "id":31424214,
          "location":"COLUMBUS"
        },
        "coordinates":null
      },
      "record_obj":{
        "created_at":"MonSep3004:04:54+00002013",
        "id_str":"384529260872228864",
        "user":{
          "id":67600981,
          "location":"KryberWorld"
        },
        "coordinates":{
          "type":"Point",
          "values":[
             8,
             52
          ]
        }
      }
    }
  ]
}

You will create JSON files for the sample data sets and add them to HDFS in the next section.

Loading the Sample JSON Data to HDFS

The PXF HDFS connector reads native JSON stored in HDFS. Before you can use Greenplum Database to query JSON format data, the data must reside in your HDFS data store.

Copy and paste the single line JSON record sample data set above to a file named singleline.json. Similarly, copy and paste the multi-line JSON record data set to a file named multiline.json.

Note: Ensure that there are no blank lines in your JSON files.

Copy the JSON data files that you just created to your HDFS data store. Create the /data/pxf_examples directory if you did not do so in a previous exercise. For example:

$ hdfs dfs -mkdir /data/pxf_examples
$ hdfs dfs -put singleline.json /data/pxf_examples
$ hdfs dfs -put multiline.json /data/pxf_examples

Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and analyze the JSON data.

Creating the External Table

Use the hdfs:json profile to read JSON-format files from HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:json[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory or file in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:json.
<custom‑option> <custom-option>s are discussed below.
FORMAT ‘CUSTOM’ Use FORMAT 'CUSTOM' with the hdfs:json profile. The CUSTOM FORMAT requires that you specify (FORMATTER='pxfwritable_import').

PXF supports single- and multi- line JSON records. When you want to read multi-line JSON records, you must provide an IDENTIFIER <custom-option> and value. Use this <custom-option> to identify the member name of the first field in the JSON record object:

Option Keyword   Syntax,  Example(s)   Description
IDENTIFIER &IDENTIFIER=<value>
&IDENTIFIER=created_at
You must include the IDENTIFIER keyword and <value> in the LOCATION string only when you are accessing JSON data comprised of multi-line records. Use the <value> to identify the member name of the first field in the JSON record object.

Example: Reading a JSON File with Single Line Records

Use the following CREATE EXTERNAL TABLE SQL command to create a readable external table that references the single-line-per-record JSON data file.

CREATE EXTERNAL TABLE singleline_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values[0]" INTEGER,
  "coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Notice the use of . projection to access the nested fields in the user and coordinates objects. Also notice the use of [] to access specific elements of the coordinates.values[] array.

To query the JSON data in the external table:

SELECT * FROM singleline_json_tbl;

Example: Reading a JSON file with Multi-Line Records

The SQL command to create a readable external table from the multi-line-per-record JSON file is very similar to that of the single line data set above. You must additionally specify the LOCATION clause IDENTIFIER keyword and an associated value when you want to read multi-line JSON records. For example:

CREATE EXTERNAL TABLE multiline_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values[0]" INTEGER,
  "coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/multiline.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

created_at identifies the member name of the first field in the JSON record record_obj in the sample data schema.

To query the JSON data in this external table:

SELECT * FROM multiline_json_tbl;