PL/R Language
PL/R Language
This chapter contains the following information:
About Greenplum Database PL/R
PL/R is a procedural language. With the Greenplum Database PL/R extension you can write database functions in the R programming language and use R packages that contain R functions and data sets.
For information about supported PL/R versions, see the Greenplum Database Release Notes.
Installing R
For RHEL and CentOS, installing the PL/R package installs R in $GPHOME/ext/R-<version> and updates $GPHOME/greenplum_path.sh for Greenplum Database to use R.
To use PL/R on Ubuntu host systems, you must install and configure R on all Greenplum Database host systems before installing PL/R.
- To install R, run these apt commands on all host
systems.
$ sudo apt update && sudo apt install r-base
Installing r-base also installs dependent packages including r-base-core.
- To configure Greenplum Database to use R, add the R_HOME environment
variable to $GPHOME/greenplum_path.sh on all hosts. This example
command returns the R home directory.
$ R RHOME /usr/lib/R
Using the previous R home directory as an example, add this line to the file on all hosts.
export R_HOME=/usr/lib/R
- Source $GPHOME/greenplum_path.sh and restart Greenplum Database. For
example, run these commands on the Greenplum Database master
host.
$ source $GPHOME/greenplum_path.sh $ gpstop -r
Enabling PL/R Language Support
For each database that requires its use, register the PL/R language with the SQL command CREATE EXTENSION. Because PL/R is an untrusted language, only superusers can register PL/R with a database. For example, run this command as the gpadmin user to register the language with the database named testdb:
$ psql -d testdb -c 'CREATE EXTENSION plr;'
PL/R is registered as an untrusted language.
Examples
The following are simple PL/R examples.
Example 1: Using PL/R for single row operators
This function generates an array of numbers with a normal distribution using the R function rnorm().
CREATE OR REPLACE FUNCTION r_norm(n integer, mean float8, std_dev float8) RETURNS float8[ ] AS $$ x<-rnorm(n,mean,std_dev) return(x) $$ LANGUAGE 'plr';
The following CREATE TABLE command uses the r_norm() function to populate the table. The r_norm() function creates an array of 10 numbers.
CREATE TABLE test_norm_var AS SELECT id, r_norm(10,0,1) as x FROM (SELECT generate_series(1,30:: bigint) AS ID) foo DISTRIBUTED BY (id);
Example 2: Returning PL/R data.frames in Tabular Form
Assuming your PL/R function returns an R data.frame as its output, unless you want to use arrays of arrays, some work is required to see your data.frame from PL/R as a simple SQL table:
- Create a TYPE in a Greenplum database with the same
dimensions as your R
data.frame:
CREATE TYPE t1 AS ...
- Use this TYPE when defining your PL/R function
... RETURNS SET OF t1 AS ...
Sample SQL for this is given in the next example.
Example 3: Hierarchical Regression using PL/R
The SQL below defines a TYPE and runs hierarchical regression using PL/R:
--Create TYPE to store model results DROP TYPE IF EXISTS wj_model_results CASCADE; CREATE TYPE wj_model_results AS ( cs text, coefext float, ci_95_lower float, ci_95_upper float, ci_90_lower float, ci_90_upper float, ci_80_lower float, ci_80_upper float); --Create PL/R function to run model in R DROP FUNCTION IF EXISTS wj_plr_RE(float [ ], text [ ]); CREATE FUNCTION wj_plr_RE(response float [ ], cs text [ ]) RETURNS SETOF wj_model_results AS $$ library(arm) y<- log(response) cs<- cs d_temp<- data.frame(y,cs) m0 <- lmer (y ~ 1 + (1 | cs), data=d_temp) cs_unique<- sort(unique(cs)) n_cs_unique<- length(cs_unique) temp_m0<- data.frame(matrix0,n_cs_unique, 7)) for (i in 1:n_cs_unique){temp_m0[i,]<- c(exp(coef(m0)$cs[i,1] + c(0,-1.96,1.96,-1.65,1.65, -1.28,1.28)*se.ranef(m0)$cs[i]))} names(temp_m0)<- c("Coefest", "CI_95_Lower", "CI_95_Upper", "CI_90_Lower", "CI_90_Upper", "CI_80_Lower", "CI_80_Upper") temp_m0_v2<- data.frames(cs_unique, temp_m0) return(temp_m0_v2) $$ LANGUAGE 'plr'; --Run modeling plr function and store model results in a --table DROP TABLE IF EXISTS wj_model_results_roi; CREATE TABLE wj_model_results_roi AS SELECT * FROM wj_plr_RE((SELECT wj_droi2_array), (SELECT cs FROM wj_droi2_array));
Downloading and Installing R Packages
R packages are modules that contain R functions and data sets. You can install R packages to extend R and PL/R functionality in Greenplum Database.
- For an R package, identify all dependent R packages and each package web
URL. The information can be found by selecting the given package from the following
navigation page:
https://cran.r-project.org/web/packages/available_packages_by_name.html
As an example, the page for the R package arm indicates that the package requires the following R libraries: Matrix, lattice, lme4, R2WinBUGS, coda, abind, foreign, and MASS.
You can also try installing the package with R CMD INSTALL command to determine the dependent packages.
For the R installation included with the Greenplum Database PL/R extension, the required R packages are installed with the PL/R extension. However, the Matrix package requires a newer version.
- From the command line, use the wget utility to download
the tar.gz files for the arm package to the Greenplum Database master
host:
wget https://cran.r-project.org/src/contrib/Archive/arm/arm_1.5-03.tar.gz
wget https://cran.r-project.org/src/contrib/Archive/Matrix/Matrix_0.9996875-1.tar.gz
- Use the gpscp utility and the hosts_all file to copy the
tar.gz files to the same directory on all nodes of the Greenplum
Database cluster. The hosts_all file contains a list of all the
Greenplum Database segment hosts. You might require root access to do
this.
gpscp -f hosts_all Matrix_0.9996875-1.tar.gz =:/home/gpadmin
gpscp -f /hosts_all arm_1.5-03.tar.gz =:/home/gpadmin
- Use the gpssh utility in interactive mode to log into
each Greenplum Database segment host (gpssh -f all_hosts). Install the
packages from the command prompt using the R CMD INSTALL command. Note
that this may require root access. For example, this R install command installs the
packages for the arm package.
$R_HOME/bin/R CMD INSTALL Matrix_0.9996875-1.tar.gz arm_1.5-03.tar.gz
- Ensure that the package is installed in the
$R_HOME/library directory on all the segments (the
gpssh can be use to install the package). For example, this
gpssh command list the contents of the R library
directory.
gpssh -s -f all_hosts "ls $R_HOME/library"
The gpssh option -s sources the greenplum_path.sh file before running commands on the remote hosts.
- Test if the R package can be loaded.
This function performs a simple test to if an R package can be loaded:
CREATE OR REPLACE FUNCTION R_test_require(fname text) RETURNS boolean AS $BODY$ return(require(fname,character.only=T)) $BODY$ LANGUAGE 'plr';
This SQL command checks if the R package arm can be loaded:
SELECT R_test_require('arm');
Displaying R Library Information
You can use the R command line to display information about the installed libraries and functions on the Greenplum Database host. You can also add and remove libraries from the R installation. To start the R command line on the host, log into the host as the gadmin user and run the script R from the directory $GPHOME/ext/R-3.3.3/bin.
This R function lists the available R packages from the R command line:
> library()
Display the documentation for a particular R package
> library(help="package_name") > help(package="package_name")
Display the help file for an R function:
> help("function_name") > ?function_name
To see what packages are installed, use the R command installed.packages(). This will return a matrix with a row for each package that has been installed. Below, we look at the first 5 rows of this matrix.
> installed.packages()
Any package that does not appear in the installed packages matrix must be installed and loaded before its functions can be used.
An R package can be installed with install.packages():
> install.packages("package_name") > install.packages("mypkg", dependencies = TRUE, type="source")
> library(" package_name ")
An R package can be removed with remove.packages
> remove.packages("package_name")
You can use the R command -e option to run functions from the command line. For example, this command displays help on the R package MASS.
$ R -e 'help("MASS")'
Loading R Modules at Startup
PL/R can automatically load saved R code during interpreter initialization. To use this feature, you create the plr_modules database table and then insert the R modules you want to auto-load into the table. If the table exists, PL/R will load the code it contains into the interpreter.
In a Greenplum Database system, table rows are usually distributed so that each row exists at only one segment instance. The R interpreter at each segment instance, however, needs to load all of the modules, so a normally distributed table will not work. You must create the plr_modules table as a replicated table in the default schema so that all rows in the table are present at every segment instance. For example:
CREATE TABLE public.plr_modules { modseq int4, modsrc text ) DISTRIBUTED REPLICATED;
See https://www.joeconway.com/plr/doc/plr-module-funcs.html for more information about using the PL/R auto-load feature.
References
https://www.r-project.org/ - The R Project home page
https://cran.r-project.org/web/packages/PivotalR/ - The home page for PivotalR, a package that provides an R interface to operate on Greenplum Database tables and views that is similar to the R data.frame. PivotalR also supports using the machine learning package MADlib directly from R.
- R Functions and Arguments - https://www.joeconway.com/doc/plr-funcs.html
- Passing Data Values in R - https://www.joeconway.com/doc/plr-data.html
- Aggregate Functions in R - https://www.joeconway.com/doc/plr-aggregate-funcs.html