Configuring the Hadoop User, User Impersonation, and Proxying

PXF accesses Hadoop services on behalf of Greenplum Database end users.

When user impersonation is enabled (the default), PXF accesses Hadoop services using the identity of the Greenplum Database user account that logs in to Greenplum and performs an operation that uses a PXF connector. Keep in mind that PXF uses only the login identity of the user when accessing Hadoop services. For example, if a user logs in to Greenplum Database as the user jane and then execute SET ROLE or SET SESSION AUTHORIZATION to assume a different user identity, all PXF requests still use the identity jane to access Hadoop services. When user impersonation is enabled, you must explicitly configure each Hadoop data source (HDFS, Hive, HBase) to allow PXF to act as a proxy for impersonating specific Hadoop users or groups.

When user impersonation is disabled, PXF executes all Hadoop service requests as the PXF process owner (usually gpadmin) or the Hadoop user identity that you specify. This behavior provides no means to control access to Hadoop services for different Greenplum Database users. It requires that this user have access to all files and directories in HDFS, and all tables in Hive and HBase that are referenced in PXF external table definitions.

You configure the Hadoop user and PXF user impersonation setting for a server via the pxf-site.xml server configuration file. Refer to About Kerberos and User Impersonation Configuration (pxf-site.xml) for more information about the configuration properties in this file.

The following table describes some of the PXF configuration scenarios for Hadoop access:

Scenario pxf-site.xml Required Impersonation Setting Required Configuration
PXF accesses Hadoop using the identity of the Greenplum Database user. yes true Enable user impersonation, identify the Hadoop proxy user in the pxf.service.user.name, and configure Hadoop proxying for this Hadoop user identity.
PXF accesses Hadoop using the identity of the operating system user that started the PXF process. yes false Disable user impersonation.
PXF accesses Hadoop using a user identity that you specify. yes false Disable user impersonation and identify the Hadoop user identity in the pxf.service.user.name property setting.

Configure the Hadoop User

By default, PXF accesses Hadoop using the identity of the Greenplum Database user, and you are required to set up a proxy Hadoop user. You can configure PXF to access Hadoop as a different user on a per-server basis.

Perform the following procedure to configure the Hadoop user:

  1. Log in to your Greenplum Database master node as the administrative user:

    $ ssh gpadmin@<gpmaster>
    
  2. Identify the name of the PXF Hadoop server configuration that you want to update.

  3. Navigate to the server configuration directory. For example, if the server is named hdp3:

    gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
    
  4. If the server configuration does not yet include a pxf-site.xml file, copy the template file to the directory. For example:

    gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
    
  5. Open the pxf-site.xml file in the editor of your choice, and configure the Hadoop user name. When impersonation is disabled, this name identifies the Hadoop user identity that PXF will use to access the Hadoop system. When user impersonation is enabled, this name identifies the PXF proxy Hadoop user. For example, if you want to access Hadoop as the user hdfsuser1:

    <property>
        <name>pxf.service.user.name</name>
        <value>hdfsuser1</value>
    </property>
    
  6. Save the pxf-site.xml file and exit the editor.

  7. Use the pxf cluster sync command to synchronize the PXF Hadoop server configuration to your Greenplum Database cluster:

    gpadmin@gpmaster$ pxf cluster sync
    

Configure PXF User Impersonation

PXF user impersonation is enabled by default for Hadoop servers. You can configure PXF user impersonation on a per-server basis. Perform the following procedure to turn PXF user impersonation on or off for the Hadoop server configuration:

In previous versions of Greenplum Database, you configured user impersonation globally for Hadoop clusters via the now deprecated PXF_USER_IMPERSONATION setting in the pxf-env.sh configuration file.
  1. Navigate to the server configuration directory. For example, if the server is named hdp3:

    gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
    
  2. If the server configuration does not yet include a pxf-site.xml file, copy the template file to the directory. For example:

    gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
    
  3. Open the pxf-site.xml file in the editor of your choice, and update the user impersonation property setting. For example, if you do not require user impersonation for this server configuration, set the pxf.service.user.impersonation property to false:

    <property>
        <name>pxf.service.user.impersonation</name>
        <value>false</value>
    </property>
    

    If you require user impersonation, turn it on:

    <property>
        <name>pxf.service.user.impersonation</name>
        <value>true</value>
    </property>
    
  4. If you enabled user impersonation, you must configure Hadoop proxying as described in Configure Hadoop Proxying. You must also configure Hive User Impersonation and HBase User Impersonation if you plan to use those services.

  5. Save the pxf-site.xml file and exit the editor.

  6. Use the pxf cluster sync command to synchronize the PXF Hadoop server configuration to your Greenplum Database cluster:

    gpadmin@gpmaster$ pxf cluster sync
    

Configure Hadoop Proxying

When PXF user impersonation is enabled for a Hadoop server configuration, you must configure Hadoop to permit PXF to proxy Greenplum users. This configuration involves setting certain hadoop.proxyuser.* properties. Follow these steps to set up PXF Hadoop proxy users:

  1. Log in to your Hadoop cluster and open the core-site.xml configuration file using a text editor, or use Ambari or another Hadoop cluster manager to add or edit the Hadoop property values described in this procedure.

  2. Set the property hadoop.proxyuser.<name>.hosts to specify the list of PXF host names from which proxy requests are permitted. Substitute the PXF proxy Hadoop user for <name>. The PXF proxy Hadoop user is the pxf.service.user.name that you configured in the procedure above, or, if you are using Kerberos authentication to Hadoop, the proxy user identity is the primary component of the Kerberos principal. If you have not configured pxf.service.user.name, the proxy user is the operating system user that started PXF. Provide multiple PXF host names in a comma-separated list. For example, if the PXF proxy user is named hdfsuser2:

    <property>
        <name>hadoop.proxyuser.hdfsuser2.hosts</name>
        <value>pxfhost1,pxfhost2,pxfhost3</value>
    </property>
    
  3. Set the property hadoop.proxyuser.<name>.groups to specify the list of HDFS groups that PXF as Hadoop user <name> can impersonate. You should limit this list to only those groups that require access to HDFS data from PXF. For example:

    <property>
        <name>hadoop.proxyuser.hdfsuser2.groups</name>
        <value>group1,group2</value>
    </property>
    
  4. You must restart Hadoop for your core-site.xml changes to take effect.

  5. Copy the updated core-site.xml file to the PXF Hadoop server configuration directory $PXF_CONF/servers/<server_name> on the Greenplum Database master and synchronize the configuration to the standby master and each Greenplum Database segment host.

Hive User Impersonation

The PXF Hive connector uses the Hive MetaStore to determine the HDFS locations of Hive tables, and then accesses the underlying HDFS files directly. No specific impersonation configuration is required for Hive, because the Hadoop proxy configuration in core-site.xml also applies to Hive tables accessed in this manner.

HBase User Impersonation

In order for user impersonation to work with HBase, you must enable the AccessController coprocessor in the HBase configuration and restart the cluster. See 61.3 Server-side Configuration for Simple User Access Operation in the Apache HBase Reference Guide for the required hbase-site.xml configuration settings.