Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF

PXF is installed with connectors to Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Minio, and S3 object stores.

Prerequisites

Before working with object store data using PXF, ensure that:

Connectors, Data Formats, and Profiles

The PXF object store connectors provide built-in profiles to support the following data formats:

  • Text
  • Avro
  • JSON
  • Parquet
  • AvroSequenceFile
  • SequenceFile

The PXF connectors to Azure, Google Cloud Storage, Minio, and S3 expose the following profiles to read, and in many cases write, these supported data formats:

Data Format Azure Blob Storage Azure Data Lake Google Cloud Storage S3 or Minio
delimited single line plain text wasbs:text adl:text gs:text s3:text
delimited text with quoted linefeeds wasbs:text:multi adl:text:multi gs:text:multi s3:text:multi
Avro wasbs:avro adl:avro gs:avro s3:avro
JSON wasbs:json adl:json gs:json s3:json
Parquet wasbs:parquet adl:parquet gs:parquet s3:parquet
AvroSequenceFile wasbs:AvroSequenceFile adl:AvroSequenceFile gs:AvroSequenceFile s3:AvroSequenceFile
SequenceFile wasbs:SequenceFile adl:SequenceFile gs:SequenceFile s3:SequenceFile

You provide the profile name when you specify the pxf protocol on a CREATE EXTERNAL TABLE command to create a Greenplum Database external table that references a file or directory in the specific object store. For example, the following command creates an external table that specifies the profile named s3:text:

CREATE EXTERNAL TABLE pxf_s3_text(location text, month text, num_orders int, total_sales float8)
  LOCATION ('pxf://S3_BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'TEXT' (delimiter=E',');

Overriding the S3 Server Configuration

If you are accessing an S3 object store, you can override the S3 server configuration by directly specifying the S3 access ID and secret key via these custom options in the CREATE EXTERNAL TABLE LOCATION clause:

Custom Option Value Description
accesskey The AWS account access key ID.
secretkey The secret key associated with the AWS access key ID.

For example:

CREATE EXTERNAL TABLE pxf_ext_tbl(name text, orders int)
  LOCATION ('pxf://S3_BUCKET/dir/file.txt?PROFILE=s3:text&SERVER=s3srvcfg&accesskey=YOURKEY&secretkey=YOURSECRET')
FORMAT 'TEXT' (delimiter=E',');

Credentials that you provide in this manner are visible as part of the external table definition. Do not use this method of passing credentials in a production environment.

PXF does not support overriding Azure, Google Cloud Storage, and Minio server credentials in this manner at this time.