gp_sparse_vector

gp_sparse_vector

The gp_sparse_vector module implements a Greenplum Database data type and associated functions that use compressed storage of zeros to make vector computations on floating point numbers faster.

The gp_sparse_vector module is a Greenplum Database extension.

Installing and Registering the Module

The gp_sparse_vector module is installed when you install Greenplum Database. Before you can use any of the functions defined in the module, you must register the gp_sparse_vector extension in each database where you want to use the functions.

Using the gp_sparse_vector Module

When you use arrays of floating point numbers for various calculations, you will often have long runs of zeros. This is common in scientific, retail optimization, and text processing applications. Each floating point number takes 8 bytes of storage in memory and/or disk. Saving those zeros is often impractical. There are also many computations that benefit from skipping over the zeros.

For example, suppose the following array of doubles is stored as a float8[] in Greenplum Database:

'{0, 33, <40,000 zeros>, 12, 22 }'::float8[]

This type of array arises often in text processing, where a dictionary may have 40-100K terms and the number of words in a particular document is stored in a vector. This array would occupy slightly more than 320KB of memory/disk, most of it zeros. Any operation that you perform on this data works on 40,001 fields that are not important.

The Greenplum Database built-in array datatype utilizes a bitmap for null values, but it is a poor choice for this use case because it is not optimized for float8[] or for long runs of zeros instead of nulls, and the bitmap is not run-length-encoding- (RLE) compressed. Even if each zero were stored as a NULL in the array, the bitmap for nulls would use 5KB to mark the nulls, which is not nearly as efficient as it could be.

The Greenplum Database gp_sparse_vector module defines a data type and a simple RLE-based scheme that is biased toward being efficient for zero value bitmaps. This scheme uses only 6 bytes for bitmap storage.

Note: The sparse vector data type defined by the gp_sparse_vector module is named svec. svec supports only float8 vector values.

You can construct an svec directly from a float array as follows:

SELECT ('{0, 13, 37, 53, 0, 71 }'::float8[])::svec;

The gp_sparse_vector module supports the vector operators <, >, *, **, /, =, +, sum(), vec_count_nonzero(), and so on. These operators take advantage of the efficient sparse storage format, making computations on svecs faster.

The plus (+) operator adds each of the terms of two vectors of the same dimension together. For example, if vector a = {0,1,5} and vector b = {4,3,2}, you would compute the vector addition as follows:

SELECT ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
 float8  
---------
 {4,4,7}

A vector dot product (%*%) between vectors a and b returns a scalar result of type float8. Compute the dot product ((0*4+1*3+5*2)=13) as follows:

SELECT '{0,1,5}'::float8[]::svec %*% '{4,3,2}'::float8[]::svec;
 ?column? 
----------
       13

Special vector aggregate functions are also useful. sum() is self explanatory. vec_count_nonzero() evaluates the count of non-zero terms found in a set of svec and returns an svec with the counts. For instance, for the set of vectors {0,1,5},{10,0,3},{0,0,3},{0,1,0}, the count of non-zero terms would be {1,2,3}. Use vec_count_nonzero() to compute the count of these vectors:

CREATE TABLE listvecs( a svec );

INSERT INTO listvecs VALUES ('{0,1,5}'::float8[]),
    ('{10,0,3}'::float8[]),
    ('{0,0,3}'::float8[]),
    ('{0,1,0}'::float8[]);

SELECT vec_count_nonzero( a )::float8[] FROM listvecs;
 count_vec 
-----------
 {1,2,3}
(1 row)

Additional Module Documentation

Refer to the gp_sparse_vector READMEs in the Greenplum Database github repository for additional information about this module.

Apache MADlib includes an extended implementation of sparse vectors. See the MADlib Documentation for a description of this MADlib module.

Example

A text classification example that describes a dictionary and some documents follows. You will create Greenplum Database tables representing a dictionary and some documents. You then perform document classification using vector arithmetic on word counts and proportions of dictionary words in each document.

Suppose that you have a dictionary composed of words in a text array. Create a table to store the dictionary data and insert some data (words) into the table. For example:

CREATE TABLE features (dictionary text[][]) DISTRIBUTED RANDOMLY;
INSERT INTO features 
    VALUES ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,'
            'one,really,second,the,third,this,until}');

You have a set of documents, also defined as an array of words. Create a table to represent the documents and insert some data into the table:

CREATE TABLE documents(docnum int, document text[]) DISTRIBUTED RANDOMLY;
INSERT INTO documents VALUES 
    (1,'{this,is,one,document,in,the,corpus}'),
    (2,'{i,am,the,second,document,in,the,corpus}'),
    (3,'{being,third,never,really,bothered,me,until,now}'),
    (4,'{the,document,before,me,is,the,third,document}');

Using the dictionary and document tables, find the dictionary words that are present in each document. To do this, you first prepare a Sparse Feature Vector, or SFV, for each document. An SFV is a vector of dimension N, where N is the number of dictionary words, and each SFV contains a count of each dictionary word in the document.

You can use the gp_extract_feature_histogram() function to create an SFV from a document. gp_extract_feature_histogram() outputs an svec for each document that contains the count of each of the dictionary words in the ordinal positions of the dictionary.

SELECT gp_extract_feature_histogram(
    (SELECT dictionary FROM features LIMIT 1), document)::float8[], document
        FROM documents ORDER BY docnum;

     gp_extract_feature_histogram        |                     document                         
-----------------------------------------+--------------------------------------------------
 {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
 {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
 {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}
 {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}

SELECT * FROM features;
                                               dictionary
--------------------------------------------------------------------------------------------------------
 {am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}

The SFV of the second document, "i am the second document in the corpus", is {1,3*0,1,1,1,1,6*0,1,2}. The word "am" is the first ordinate in the dictionary, and there is 1 instance of it in the SFV. The word "before" has no instances in the document, so its value is 0; and so on.

gp_extract_feature_histogram() is very speed optimized - it is a single routine version of a hash join that processes large numbers of documents into their SFVs in parallel at the highest possible speeds.

For the next part of the processing, generate a sparse vector of the dictionary dimension (19). The vectors that you generate for each document are referred to as the corpus.

CREATE table corpus (docnum int, feature_vector svec) DISTRIBUTED RANDOMLY;

INSERT INTO corpus
    (SELECT docnum, 
        gp_extract_feature_histogram(
            (select dictionary FROM features LIMIT 1), document) from documents);

Count the number of times each feature occurs at least once in all documents:

SELECT (vec_count_nonzero(feature_vector))::float8[] AS count_in_document FROM corpus;

            count_in_document
-----------------------------------------
 {1,1,1,1,2,3,1,2,2,2,1,1,1,1,1,3,2,1,1}

Count all occurrences of each term in all documents:

SELECT (sum(feature_vector))::float8[] AS sum_in_document FROM corpus;

             sum_in_document
-----------------------------------------
 {1,1,1,1,2,4,1,2,2,2,1,1,1,1,1,5,2,1,1}

The remainder of the classification process is vector math. The count is turned into a weight that reflects Term Frequency / Inverse Document Frequency (tf/idf). The calculation for a given term in a given document is:

#_times_term_appears_in_this_doc * log( #_docs / #_docs_the_term_appears_in )

#_docs is the total number of documents (4 in this case). Note that there is one divisor for each dictionary word and its value is the number of times that word appears in the document.

For example, the term "document" in document 1 would have a weight of 1 * log( 4/3 ). In document 4, the term would have a weight of 2 * log( 4/3 ). Terms that appear in every document would have weight 0.

This single vector for the whole corpus is then scalar product multiplied by each document SFV to produce the tf/idf.

Calculate the tf/idf:

SELECT docnum, (feature_vector*logidf)::float8[] AS tf_idf 
    FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector)) AS logidf FROM corpus)
    AS foo, corpus ORDER BY docnum;
 docnum |                                                                          tf_idf                                                                          
--------+----------------------------------------------------------------------------------------------------------------------------------------------------------
      1 | {0,0,0,0,0.693147180559945,0.287682072451781,0,0.693147180559945,0.693147180559945,0,0,0,1.38629436111989,0,0,0.287682072451781,0,1.38629436111989,0}
      2 | {1.38629436111989,0,0,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,0,0,0,0,0,1.38629436111989,0.575364144903562,0,0,0}
      3 | {0,0,1.38629436111989,1.38629436111989,0,0,0,0,0,0.693147180559945,1.38629436111989,1.38629436111989,0,1.38629436111989,0,0,0.693147180559945,0,1.38629436111989
}
      4 | {0,1.38629436111989,0,0,0,0.575364144903562,0,0,0.693147180559945,0.693147180559945,0,0,0,0,0,0.575364144903562,0.693147180559945,0,0}

You can determine the angular distance between one document and the rest of the documents using the ACOS of the dot product of the document vectors:

CREATE TABLE weights AS 
    (SELECT docnum, (feature_vector*logidf) tf_idf 
        FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector))
       AS logidf FROM corpus) foo, corpus ORDER BY docnum)
    DISTRIBUTED RANDOMLY;

Calculate the angular distance between the first document and every other document:

SELECT docnum, trunc((180.*(ACOS(dmin(1.,(tf_idf%*%testdoc)/(l2norm(tf_idf)*l2norm(testdoc))))/(4.*ATAN(1.))))::numeric,2)
     AS angular_distance FROM weights,
     (SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
ORDER BY 1;

 docnum | angular_distance 
--------+------------------
      1 |             0.00
      2 |            78.82
      3 |            90.00
      4 |            80.02

You can see that the angular distance between document 1 and itself is 0 degrees, and between document 1 and 3 is 90 degrees because they share no features at all.