Reading Avro Data from HDFS

Use the PXF HDFS Connector to read Avro-format data. This section describes how to use PXF to access Avro data in HDFS, including how to create and query an external table that references an Avro file in the HDFS data store.

Prerequisites

Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from HDFS.

Working with Avro Data

Apache Avro is a data serialization framework where the data is serialized in a compact binary format. Avro specifies that data types be defined in JSON. Avro format data has an independent schema, also defined in JSON. An Avro schema, together with its data, is fully self-describing.

Data Type Mapping

Avro supports both primitive and complex data types.

To represent Avro primitive data types in Greenplum Database, map data values to Greenplum Database columns of the same type.

Avro supports complex data types including arrays, maps, records, enumerations, and fixed types. Map top-level fields of these complex data types to the Greenplum Database TEXT type. While Greenplum Database does not natively support these types, you can create Greenplum Database functions or application code to extract or further process subcomponents of these complex data types.

The following table summarizes external mapping rules for Avro data.

Avro Data Type PXF/Greenplum Data Type
boolean boolean
bytes bytea
double double
float real
int int or smallint
long bigint
string text
Complex type: Array, Map, Record, or Enum text, with delimiters inserted between collection items, mapped key-value pairs, and record data.
Complex type: Fixed bytea
Union Follows the above conventions for primitive or complex data types, depending on the union; supports Null values.

Avro Schemas and Data

Avro schemas are defined using JSON, and composed of the same primitive and complex types identified in the data type mapping section above. Avro schema files typically have a .avsc suffix.

Fields in an Avro schema file are defined via an array of objects, each of which is specified by a name and a type.

Creating the External Table

Use the hdfs:avro profile to read Avro-format data in HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:avro[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory or file in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:avro.
<custom‑option> <custom-option>s are discussed below.
FORMAT ‘CUSTOM’ Use FORMAT 'CUSTOM' with the hdfs:avro profile. The CUSTOM FORMAT requires that you specify (FORMATTER='pxfwritable_import').

For complex types, the PXF hdfs:avro profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by identifying values for specific hdfs:avro custom options in the CREATE EXTERNAL TABLE command.

The hdfs:avro profile supports the following <custom-option>s:

Option Keyword Description
COLLECTION_DELIM The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps an Avro complex data type to a text column. The default is the comma , character.
MAPKEY_DELIM The delimiter character(s) to place between the key and value of a map entry when PXF maps an Avro complex data type to a text column. The default is the colon : character.
RECORDKEY_DELIM The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text column. The default is the colon : character.

Example: Reading Avro Data

The examples in this section will operate on Avro data with the following field name and data type record schema:

  • id - long
  • username - string
  • followers - array of string
  • fmap - map of long
  • relationship - enumerated type
  • address - record comprised of street number (int), street name (string), and city (string)

Create Schema

Perform the following operations to create an Avro schema to represent the example schema described above.

  1. Create a file named avro_schema.avsc:

    $ vi /tmp/avro_schema.avsc
    
  2. Copy and paste the following text into avro_schema.avsc:

    {
    "type" : "record",
      "name" : "example_schema",
      "namespace" : "com.example",
      "fields" : [ {
        "name" : "id",
        "type" : "long",
        "doc" : "Id of the user account"
      }, {
        "name" : "username",
        "type" : "string",
        "doc" : "Name of the user account"
      }, {
        "name" : "followers",
        "type" : {"type": "array", "items": "string"},
        "doc" : "Users followers"
      }, {
        "name": "fmap",
        "type": {"type": "map", "values": "long"}
      }, {
        "name": "relationship",
        "type": {
            "type": "enum",
            "name": "relationshipEnum",
            "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
        }
      }, {
        "name": "address",
        "type": {
            "type": "record",
            "name": "addressRecord",
            "fields": [
                {"name":"number", "type":"int"},
                {"name":"street", "type":"string"},
                {"name":"city", "type":"string"}]
        }
      } ],
      "doc:" : "A basic schema for storing messages"
    }
    

Create Avro Data File (JSON)

Perform the following steps to create a sample Avro data file conforming to the above schema.

  1. Create a text file named pxf_avro.txt:

    $ vi /tmp/pxf_avro.txt
    
  2. Enter the following data into pxf_avro.txt:

    {"id":1, "username":"john","followers":["kate", "santosh"], "relationship": "FRIEND", "fmap": {"kate":10,"santosh":4}, "address":{"number":1, "street":"renaissance drive", "city":"san jose"}}
    
    {"id":2, "username":"jim","followers":["john", "pam"], "relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer creek", "city":"palo alto"}}
    

    The sample data uses a comma , to separate top level records and a colon : to separate map/key values and record field name/values.

  3. Convert the text file to Avro format. There are various ways to perform the conversion, both programmatically and via the command line. In this example, we use the Java Avro tools; the jar avro-tools-1.8.1.jar file resides in the current directory:

    $ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file /tmp/avro_schema.avsc /tmp/pxf_avro.txt > /tmp/pxf_avro.avro
    

    The generated Avro binary data file is written to /tmp/pxf_avro.avro.

  4. Copy the generated Avro file to HDFS:

    $ hdfs dfs -put /tmp/pxf_avro.avro /data/pxf_examples/
    

Query With hdfs:avro Profile

Perform the following operations to create and query an external table that references the pxf_avro.avro file that you added to HDFS in the previous section. When creating the table:

  • Map the top-level primitive fields, id (type long) and username (type string), to their equivalent Greenplum Database types (bigint and text).
  • Map the remaining complex fields to type text.
  • Explicitly set the record, map, and collection delimiters using the hdfs:avro profile custom options.
  1. Use the hdfs:avro profile to create a queryable external table from the pxf_avro.avro file:

    postgres=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, followers text, fmap text, relationship text, address text)
                LOCATION ('pxf://data/pxf_examples/pxf_avro.avro?PROFILE=hdfs:avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
              FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    
  2. Perform a simple query of the pxf_hdfs_avro table:

    postgres=# SELECT * FROM pxf_hdfs_avro;
    
     id | username |   followers    |        fmap         | relationship |                      address                      
    ----+----------+----------------+--------------------+--------------+---------------------------------------------------
      1 | john     | [kate,santosh] | {kate:10,santosh:4} | FRIEND       | {number:1,street:renaissance drive,city:san jose}
      2 | jim      | [john,pam]     | {pam:3,john:3}      | COLLEAGUE    | {number:9,street:deer creek,city:palo alto}
    (2 rows)
    

    The simple query of the external table shows the components of the complex type data separated with the delimiters specified in the CREATE EXTERNAL TABLE call.

  3. Process the delimited components in the text columns as necessary for your application. For example, the following command uses the Greenplum Database internal string_to_array function to convert entries in the followers field to a text array column in a new view.

    postgres=# CREATE VIEW followers_view AS 
    SELECT username, address, string_to_array(substring(followers FROM 2 FOR (char_length(followers) - 2)), ',')::text[] 
        AS followers 
      FROM pxf_hdfs_avro;
    
  4. Query the view to filter rows based on whether a particular follower appears in the view:

    postgres=# SELECT username, address FROM followers_view WHERE followers @> '{john}';
    
     username |                   address                   
    ----------+---------------------------------------------
     jim      | {number:9,street:deer creek,city:palo alto}