Predictive Analysis Library Manual

Table of Contents
Overview .......................................................................................................................................... 4 PAL Common Interface .................................................................................................................. 5 Input Data Table .................................................................................................................... 5 Parameter Table .................................................................................................................... 6 Specifying the ID Column ............................................................................................. 6 Output Data Table ................................................................................................................. 7 List of PAL Algorithms ................................................................................................................... 8 PAL Algorithm Descriptions.......................................................................................................... 9 K-means ................................................................................................................................ 9 Prerequisites ................................................................................................................. 9 Interface (kmeans)........................................................................................................ 9 Interface (validateKmeans)......................................................................................... 11 Example ...................................................................................................................... 12 C4.5 Decision Tree .............................................................................................................. 16 Prerequisites ............................................................................................................... 16 Interface (createDT) ................................................................................................... 16 Interface (predictWithDT) ........................................................................................... 17 Example ...................................................................................................................... 18 KNN .................................................................................................................................... 22 Prerequisites ............................................................................................................... 22 Interface (knn) ............................................................................................................ 22 Example ...................................................................................................................... 23 Multiple Linear Regression .................................................................................................. 26 Prerequisites ............................................................................................................... 26 Interface (linearRegression) ....................................................................................... 26 Interface (forecastWithLR).......................................................................................... 27 Example ...................................................................................................................... 28 Apriori .................................................................................................................................. 32 Prerequisite ................................................................................................................ 32 Interface (aprioriRule) ................................................................................................. 32 Example ...................................................................................................................... 33 ABC Classification ............................................................................................................... 36 Prerequisite ................................................................................................................ 36 Interface (abcAnalysis) ............................................................................................... 36 Example ...................................................................................................................... 37 Weighted Score Table ......................................................................................................... 39 Prerequisites ............................................................................................................... 39 Interface (weightedTable) ........................................................................................... 39 SAP AG 2011 2
Example ...................................................................................................................... 40 Log and Trace ............................................................................................................................... 43 Copyrights ..................................................................................................................................... 46
SAP AG 2011
Predictive Analysis Library Reference Manual
Overview
Since the SAP HANA 1.0 GA, SQL Script v2 can be used to express application logic within the database that exceeds the capabilities of pure SQL. With enhanced control flow capabilities, SQL Script v2 is more suitable for pushing complex application logic to the SAP HANA database. When designing HANA applications, procedures are the main programmable containers. However, it is difficult and often impossible to describe predictive analysis logic with procedures. For example, an application may need to perform a cluster analysis in a huge customer table with a terra byte of data. It is impossible to implement the analysis in a procedure even with the simple classic K-means algorithm. Obviously, it is not a wise decision to copy a huge table to the application server to perform the Kmeans calculation, because data-copying will be slow and is not necessary. SAP provides the Predictive Analysis Library (PAL) to offer you the flexibility and efficiency to develop HANA applications requiring predictive functionality. In the above case, using the PAL is the best choice. PAL algorithms can be called directly in L wrapper within SQL Script v2. The inputs and outputs are all tables. Currently, PAL includes seven well known predictive analysis algorithms in three data mining algorithm categories: Cluster analysis Classification analysis Association analysis
The functions in PAL are predefined. More functions will be supported in future releases. The seven algorithms included in PAL were carefully selected based on the following criteria: 1. These algorithms are required for SAP HANA applications. 2. Market surveys (e.g. Rexer Analytics and KDnuggets polls) show that these algorithms are most commonly used. 3. These are the most common algorithms available in database products from other vendors in the marketplace, such as Microsoft SQL Server, Oracle, and IBM DB2.
SAP AG 2011
PAL Common Interface

The Predictive Analysis Library uses tables as the data interface. Three types of tables are used for each PAL function: Input data table Supplies input data to PAL functions. Each function can have one or more input data tables. Parameter table Supplies parameters to PAL functions. Parameters are special inputs that control what the functions do. Each function has only one parameter table. Output data table Holds the output results calculated by PAL functions. Each function can have one or more output data tables.
The structures of the above tables are pre-defined and cannot be changed. Table names and column names can be specified by the users.
Input Data Table

Input data tables contain the input data required to perform predictive analysis, sometimes described as the historical and training data used to build predictive models. For some cluster and classification functions, primary keys/IDs are required in the input data table, so that the output data table can contain the same primary key/ID column. Usually the ID column is the first column of the input data table. For some PAL functions, you can specify which column should be the ID column. See Specifying the ID Column for details. A typical input data table looks like the following: ID 0 1 2 3 4 5 6 7 8 9 10 X1 0.1 0.1 10.1 10.1 10.1 1000.1 999.1 0.1 0.1 11.1 0.001 X2 0.1 0 10.1 9.9 10.1 1000.1 1000.1 10000.1 9999.1 10.1 0.001 X3 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
SAP AG 2011
Parameter Table
PAL functions use parameter tables to transfer parameter values. Each PAL function has its own parameter table. To avoid the conflict of table names when several users call PAL functions at the same time, the parameter table must be created as a local temporary column table, so that each parameter table has its unique scope per session. The table structure is defined as below. Each row contains only one parameter value. Column Name Name intArgs doubleArgs stringArgs Data Type VARCHAR/CHAR Integer Double VARCHAR/CHAR Description Parameter name Integer parameter value Double parameter value String parameter value
The following shows an example of a parameter table with three parameters. The first parameter, THREAD_NUMBER, is an integer parameter. Thus, in the THREAD_NUMBER row, you should fill the parameter value in the intArgs column, and leave the doubleArgs and stringArgs columns blank. Name THREAD_NUMBER SUPPORT VAR_NAME intArgs 1 0.2 hello doubleArgs stringArgs
Specifying the ID Column

Some cluster and classification functions include an ID column in the input data table. Usually the ID column is the first column of the input data table. For some PAL functions, you can specify a column number for the ID_COLUMN parameter in the parameter table to tell the function which column should be the ID column. Column number starts from 0, so the first column is column 0, the second column 1, and so on. For example, if you want to specify the first column as the ID column, the parameter table should contain the following row: Name ID_COLUMN intArgs 0 doubleArgs stringArgs
SAP AG 2011
Output Data Table

Output tables contain predicted results, such as classification types and fitted values. A typical output table looks like the following: ID 0 1 2 3 4 5 6 7 8 9 10 CENTER_ASSIGN 0 1 4 4 4 4 4 2 2 3 1 ENERGY 0.3 0.2 20.3 20.1 20.3 2000.3 1999.3 10000.3 9999.3 21.3 0.12
SAP AG 2011
List of PAL Algorithms

The following table lists all available algorithms and functions in the Predictive Analysis Library. The algorithms are grouped by categories. To learn more about an algorithm, click the algorithm name. To learn more about function interface, click the function name.
Category Cluster Analysis
Algorithm K-means
Function Name kmeans validateKmeans
Description Clusters data using the k-means algorithm. Validates or measures the quality of clustered result. Creates tree models using the C4.5 algorithm. The model is represented in JSON. Uses the tree model to perform prediction. K-nearest neighbor algorithm function. Multiple linear regression algorithm function. Makes forecasts using the regression equation. Creates association rules using the Apriori algorithm. Performs ABC classification analysis. For example, when 20% items contribute 80% of the total revenue, these 20% items can be put into class A. Performs weighted table calculation.
Classification Analysis
C4.5 Decision Tree
createDT
predictWithDT KNN Multiple Linear Regression knn linearRegression forecastWithLR Association Analysis Other Apriori ABC Classification aprioriRule abcAnalysis
Weighted Score Table
weightedTable
SAP AG 2011
PAL Algorithm Descriptions

This section contains detailed descriptions of all available algorithms in the Predictive Analysis Library. The following information is provided for each algorithm. Algorithm description Prerequisites Interface description (function name; L function signature; input, parameter, and output tables) Example
K-means
In predictive analysis, k-means clustering is a method of cluster analysis which aims to partition n observations or records into k clusters in which each observation belongs to the cluster with the nearest mean. In marketing and customer-relationship management areas, it uses customer data to track customer behavior and create strategic business initiatives. Organizations can use this data to divide customers into segments based on variants such as demography, customer behavior, customer profitability, measure of risk, and lifetime value of a customer or retention probability. Clustering works to group records together according to an algorithm or mathematical formula that attempts to find centroids, or centers, around which similar records gravitate. The most common algorithm uses an iterative refinement technique. It is also referred to as Lloyd's algorithm: Given an initial set of k means m1,...,mk, the algorithm proceeds by alternating between two steps: Assignment step: Assign each observation to the cluster with the closest mean. Update step: Calculate the new means to be the centroid of the observations in the cluster.
The algorithm repeats until the assignments no longer change. For more information, refer to http://en.wikipedia.org/wiki/K-means_clustering. The k-means implementation in PAL supports multi-thread, data normalization, different distance level measurement, and cluster quality measure (Silhouette).The implementation doesnt support categorical data, however this can be managed through data transformation. The first K and random K starting methods are supported.
Prerequisites
The input data should contain an ID column and the other columns should be integer or double data type. Input data does not contain null value. The function will issue errors when encountering null values.
Interface (kmeans)
Function: pal::kmeans This is a clustering function using the k-means algorithm.
SAP AG 2011
L Function Signature
pal::kmeans(Table<...> dataset, Table<...> args, Table<...> result)
Input Table
Table Data Column 1st column Other columns Column Data Type Integer Integer or double Description ID Attribute data Constraint
Parameter Table
Name GROUP_NUMBER DISTANCE_LEVEL Data Type Integer Integer Description Number of groups (k). Computes the distance between item and cluster center. MAX_ITERATION START_COLUMN COLUMN_NUM START_ROW ROW_NUM INIT_TYPE Integer Integer Integer Integer Integer Integer 2 = Euclidean distance
Maximum number of iterations. The index number of the first data column (column index starts from zero). Number of data columns. The index number of the first data row (row index starts from zero). Number of data rows. Center initialization type: 1 = first K 2 = weighted random with replacement 3 = random without replacement
NORMALIZATION
Integer
Normalization type: 0 = no 1 = yes, for each point X(x1,x2,...xn), normalized value will be X' (|x1|/S,|x2|/S...|xn|/S) where S = |x1|+|x2|+...|xn|
THREAD_NUMBER EXIT_THRESHOLD
Integer Double
Number of threads. Threshold (actual value) for exiting the iterations.
SAP AG 2011
10
Output Table
Table Result Column 1st column 2nd column 3th column Column Data Type Integer Integer or double Integer or double Description ID Clustered item assigned to class number Sum of item's attribute, which is used in normalization. Constraint
Interface (validateKmeans)
Function: pal::validateKmeans This is a quality measurement function for k-means clustering.
pal::validateKmeans (Table<...> dataset, Table<...> args, Table<...> result)
Input Table
Table Data Column 1st column Other columns Type Data/ Class Data 1st column 2nd column Column Data Type Integer Integer or double Integer Integer Description ID Attribute data ID Class type Constraint
Parameter Table
Name VARIABLE_NUM THREAD_NUMBER Data Type Integer Integer Description Number of variables Number of threads
Output Table
Table Result Column 1st column 2nd column Column Data Type VARCHAR/CHAR Double Description Name Measure result Constraint
SAP AG 2011
11
Example
;#KMEANS.SQL ; # create type for kmeans result ALTER SESSION SET CURRENT_SCHEMA = "DM_PAL"; DROP TYPE T_KMEANS_RESULT_ASSIGN_TAB; CREATE TYPE T_KMEANS_RESULT_ASSIGN_TAB AS TABLE( "ID" INT, "CENTER_ASSIGN" INT, "ENERGY" DOUBLE); ; # create type for double arguments DROP TYPE T_SINGLE_COLUMN_DOUBLE_TAB ; CREATE TYPE T_SINGLE_COLUMN_DOUBLE_TAB AS TABLE("VALUE" DOUBLE) ; ; # create type for integer arguments DROP TYPE T_SINGLE_COLUMN_INT_TAB ; CREATE TYPE T_SINGLE_COLUMN_INT_TAB AS TABLE("VALUE" INT) ; DROP TYPE KMEANS_DOUBLE_INPUT; CREATE TYPE KMEANS_DOUBLE_INPUT AS TABLE( "ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("ID")); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE ( "Name" VARCHAR (50),"intArgs" INTEGER, "doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP PROCEDURE KMEANS_WITH_NEWDB; CREATE PROCEDURE KMEANS_WITH_NEWDB (IN dataset KMEANS_DOUBLE_INPUT, IN control CONTROL_T, OUT cluster_assignment T_KMEANS_RESULT_ASSIGN_TAB) LANGUAGE LLANG AS BEGIN export Void main(Table< Int32 "ID", Double "V000", Double "V001"> "dataset" datasetTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab, Table<Int32 "ID", Int32 "CENTER_ASSIGN", Double "ENERGY"> "cluster_assignment" & resultTab) { pal::kmeans(datasetTab, argsTab, resultTab); } END;
SAP AG 2011
12
; # ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Data Preparation and Function Call ;;;;;;;;;;; DROP TABLE KMEANS_DATA_TAB_PAL; CREATE COLUMN TABLE KMEANS_DATA_TAB_PAL ( "ID" INT, "V000" DOUBLE, "V001" DOUBLE, primary key("ID")); ; # clean kmeans result DROP TABLE KMEANS_RESULT_ASSIGN_TAB_PAL; CREATE COLUMN TABLE KMEANS_RESULT_ASSIGN_TAB_PAL "ID" INT, "CENTER_ASSIGN" INT, "ENERGY" DOUBLE, primary key("ID")); INSERT INTO KMEANS_DATA_TAB_PAL VALUES (0 , 0.1, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (1 , 0.2, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (2 , 0.3, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (3 , 0.4, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (4 , 0.5, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (5 , 1.1, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (6 , 1.2, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (7 , 1.3, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (8 , 1.4, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (9 , 1.5, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (10,16.1, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (11,16.2, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (12,16.3, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (13,16.4, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (14,16.5, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (15,50.0, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (16,50.1, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (17,50.2, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (18,50.3, INSERT INTO KMEANS_DATA_TAB_PAL VALUES (19,50.4, TRUNCATE TABLE #CONTROL_TAB; INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES
0.2); 0.3); 0.4); 0.5); 0.6); 15.1); 15.2); 15.3); 15.4); 15.5); 1.1); 1.2); 1.3); 1.4); 1.5); 50.1); 50.2); 50.3); 50.4); 50.5);
('THREAD_NUMBER',1,null,null); ('GROUP_NUMBER',4,null,null); ('DISTANCE_LEVEL',2,null,null); ('MAX_ITERATION',100,null,null); ('START_COLUMN',1,null,null); ('COLUMN_NUM',2,null,null); ('START_ROW',0,null,null); ('ROW_NUM',20,null,null); ('INIT_TYPE',1,null,null); ('NORMALIZATION',0,null,null); ('EXIT_THRESHOLD',null,0.0001,null);
TRUNCATE TABLE KMEANS_RESULT_ASSIGN_TAB_PAL; CALL KMEANS_WITH_NEWDB(KMEANS_DATA_TAB_PAL, "#CONTROL_TAB",KMEANS_RESULT_ASSIGN_TAB_PAL) with overview; SELECT KMEANS_RESULT_ASSIGN_TAB_PAL.ID, KMEANS_DATA_TAB_PAL.V000, KMEANS_DATA_TAB_PAL.V001, CENTER_ASSIGN+1 AS CENTER_ASSIGN FROM KMEANS_RESULT_ASSIGN_TAB_PAL, KMEANS_DATA_TAB_PAL WHERE KMEANS_DATA_TAB_PAL.ID = KMEANS_RESULT_ASSIGN_TAB_PAL.ID
SAP AG 2011
13
;#EXPECTED OUTPUT:
#validateKmeans.sql DROP VIEW KMEANS_TYPE_ASSIGN_TAB_PAL; CREATE VIEW KMEANS_TYPE_ASSIGN_TAB_PAL AS SELECT "ID", "CENTER_ASSIGN" AS "TYPE_ASSIGN" FROM KMEANS_RESULT_ASSIGN_TAB_PAL; DROP TYPE T_KMEANS_TYPE_ASSIGN_TAB_S; CREATE TYPE T_KMEANS_TYPE_ASSIGN_TAB_S AS TABLE( "ID" INTEGER, "TYPE_ASSIGN" INTEGER); DROP TYPE T_KMEANS_RESULT_SVALUE_TAB; CREATE TYPE T_KMEANS_RESULT_SVALUE_TAB AS TABLE( "NAME" VARCHAR (50), "S" DOUBLE); DROP PROCEDURE KMEANSVALIDATE_WITH_NEWDB; CREATE PROCEDURE KMEANSVALIDATE_WITH_NEWDB (IN dataset KMEANS_DOUBLE_INPUT, IN typeset T_KMEANS_TYPE_ASSIGN_TAB_S, IN control CONTROL_T, OUT sValue T_KMEANS_RESULT_SVALUE_TAB) LANGUAGE LLANG AS BEGIN export Void main(Table< Int32 "ID", Double "V000", Double "V001"> "dataset" datasetTab, SAP AG 2011 14
Table<Int32 "ID", Int32 "TYPE_ASSIGN"> "typeset" typesetTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab, Table<String "NAME", Double "S"> "sValue" & resultTab) { pal::validateKmeans(datasetTab, typesetTab,argsTab, resultTab); } END; DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); TRUNCATE TABLE #CONTROL_TAB; INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',1,null,null); DROP TABLE KMEANS_SVALUE_TAB_PAL; CREATE COLUMN TABLE KMEANS_SVALUE_TAB_PAL ( "NAME" VARCHAR (50), "S" DOUBLE); CALL KMEANSVALIDATE_WITH_NEWDB(KMEANS_DATA_TAB_PAL, KMEANS_TYPE_ASSIGN_TAB_PAL, "#CONTROL_TAB", KMEANS_SVALUE_TAB_PAL) with overview; SELECT * FROM KMEANS_SVALUE_TAB_PAL ;#EXPECTED OUTPUT:
SAP AG 2011
15
C4.5 Decision Tree

A decision tree is used as a classifier for determining an appropriate action or decision among a predetermined set of actions for a given case. A decision tree helps to effectively identify the factors to consider and how each factor has historically been associated with different outcomes of the decision. A decision tree is a classifier that uses a tree-like structure of conditions and their possible consequences. Each node of a decision tree can be a leaf node or a decision node. Leaf node: identifies the value of the dependent (target) variable. Decision node: contains one condition that specifies some test on an attribute value. The outcome of the condition is further divided into branches with subtrees or leaf nodes.
C4.5 is an algorithm used to generate a decision tree. C4.5 builds decision trees from a set of training data , using the concept of information entropy. The training data is a set S = s1,s2,... of already classified samples. Each sample si = x1,x2,... is a vector where x1,x2,... represent attributes or features of the sample. The training data is augmented with a vector C = c1,c2,... where c1,c2,... represent the class to which each sample belongs. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recursively works on the smaller sublists. For more information, refer to http://en.wikipedia.org/wiki/C4.5_algorithm. The C4.5 decision tree functions implemented in PAL supports both discrete and continuous values. A continuous attribute is discretised by defining fixed intervals provided by the user. For example, if the salary ranges from $100 to $20,000, then we can form intervals such as $0 $8,000, $8,000 $18,000, and $18,000 $20,000. An attribute value will fall into any one of these intervals. In the PAL implementation, Reduced Error Pruning (REP) Algorithm is used as pruning method.
Prerequisites
The column order and column number of predicted data should be the same as the order and number used in tree model building. The last column of training data is used as a predicted field and its type should be discrete type. Predicted data set should have an ID column. Input data does not contain null value, otherwise exceptions will be thrown. The table used to store the tree model should be a column table.
Interface (createDT)
Function: pal::createDT The createDT function creates a decision tree from input training data.
pal::createDT(Table<...> training, Table<...> args, Table<...> result)
SAP AG 2011
16
Input Table
Table Training / Historical Data Column Columns Column Data Type VARCHAR/CHAR, integer, or double Description Table used to build the predictive tree model Constraint Discrete value: integer and string Continuous value: integer and double
Parameter Table
Name START_COLUMN Data Type Integer Description The first attribute/column used to make prediction in the training data set (column index starts from zero). The last attribute/column used to make prediction in the training data set .This attribute/column must contain class information (column index starts from zero). The percentage to be applied to determine the input training data set. Number of threads. Defines which column needs discretization and the interval provided by the user (column index starts from zero). The integer value specifies the column position. The double value specifies the interval.
END_COLUMN
Integer
PERCENTAGE THREAD_NUMBER CONTINUOUS_COL
Double Integer (Integer, Double) (optional)
Output Table
Table Result (tree model) Column 1st column Column Data Type CLOB Description Tree model saved as a JSON string in the 1st column. Constraint The table should be a column table; otherwise the CLOB type is not supported.
Interface (predictWithDT)
Function: pal::predictWithDT The predictWithDT function is used to perform prediction by using decision trees.
SAP AG 2011
17
pal::predictWithDT(Table<...> predictive, Table<...> args, Table<...> model, Table<...> result)
Input Table
Table Predicted Data Column Columns Column Data Type VARCHAR/CHAR or integer/Double Description Data to be classified (predicted) Constraint An ID column is mandatory. Its data type should be integer.
Predictive Model
1st column
CLOB
Serialized tree model
Parameter Table
Name START_COLUMN Data Type Integer Description The first attribute/column used to make prediction in the input data set (column index starts from zero). The last attribute/column used to make prediction in the input data set (column index starts from zero). The column of predicted data used as primary key/ID. Column index starts from zero (column index starts from zero). Number of threads
END_COLUMN
Integer
ID_COLUMN
Integer
THREAD_NUMBER
Integer
Output Table
Table Result Column 1st column 2nd column Column Data Type Integer VARCHAR/CHAR Description ID Predictive result Constraint
Example
;# DECISION TREE DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"Region" VARCHAR(50),"SalesPeriod" VARCHAR(50),"Revenue" INT,"CLASS" VARCHAR(50)); DROP TYPE MODEL_T; CREATE TYPE MODEL_T AS TABLE("Model" CLOB);
SAP AG 2011
18
DROP TYPE PREDICTIVE_T; CREATE TYPE PREDICTIVE_T AS TABLE("ID" INT,"Region" VARCHAR(50),"SalesPeriod" VARCHAR(50),"Revenue" INT); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT, "CLASS" VARCHAR(50)); DROP PROCEDURE palDT1; CREATE PROCEDURE palDT1( IN data DATA_T, IN control CONTROL_T, OUT model MODEL_T ) LANGUAGE LLANG AS BEGIN export Void main(Table<Int32 "ID",String "Region",String "SalesPeriod",Int32 "Revenue",String "CLASS"> "data" dataTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab, Table<String "Model"> "model" & modelTab) { pal::createDT(dataTab, argsTab, modelTab); } END; DROP PROCEDURE palDT2; CREATE PROCEDURE palDT2( IN predictive PREDICTIVE_T, IN control CONTROL_T, IN model MODEL_T, OUT results RESULT_T ) LANGUAGE LLANG AS BEGIN export Void main(Table<Int32 "ID", String "Region",String "SalesPeriod",Int32 "Revenue"> "predictive" predictTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab, Table<String "Model"> "model" modelTab, Table<Int32 "ID", String "CLASS"> "results" & resultsTab){ pal::predictWithDT(predictTab, argsTab, modelTab, resultsTab); } END; DROP TABLE TESTDT_TAB; CREATE COLUMN TABLE TESTDT_TAB("ID" INT,"Region" VARCHAR(50),"SalesPeriod" VARCHAR(50),"Revenue" INT,"CLASS" VARCHAR(50)); INSERT INTO TESTDT_TAB VALUES (0, 'South', 'Winter', 100000, 'Good'); INSERT INTO TESTDT_TAB VALUES (1, 'North', 'Spring', 45000, 'Average'); INSERT INTO TESTDT_TAB VALUES (2, 'West', 'Summer', 30000, 'Poor'); INSERT INTO TESTDT_TAB VALUES (3, 'East', 'Autumn', 5000, 'Poor');
SAP AG 2011
19
INSERT INSERT INSERT INSERT INSERT
INTO INTO INTO INTO INTO
TESTDT_TAB TESTDT_TAB TESTDT_TAB TESTDT_TAB TESTDT_TAB
VALUES VALUES VALUES VALUES VALUES
(4, (5, (6, (7, (8,
'West', 'Spring', 5000, 'Poor'); 'East', 'Spring', 200000, 'Good'); 'South', 'Summer', 25000, 'Poor'); 'South', 'Spring', 10000, 'Average'); 'North', 'Winter', 50000, 'Average');
DROP TABLE PREDICTIVE_TAB; CREATE COLUMN TABLE PREDICTIVE_TAB ("ID" INT,"Region" VARCHAR(50),"SalesPeriod" VARCHAR(50),"Revenue" INT); INSERT INTO PREDICTIVE_TAB VALUES (0,'South', 'Autumn', 60000); INSERT INTO PREDICTIVE_TAB VALUES (1,'North', 'Spring', 30000); INSERT INTO PREDICTIVE_TAB VALUES (2,'South', 'Summer', 25000); INSERT INTO PREDICTIVE_TAB VALUES (3,'West', 'Winter', 5000); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('START_COLUMN',1,null,null); INSERT INTO #CONTROL_TAB VALUES ('END_COLUMN',4,null,null); INSERT INTO #CONTROL_TAB VALUES ('PERCENTAGE',null,0.71,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',1,null,null); INSERT INTO #CONTROL_TAB VALUES ('CONTINUOUS_COL',3,25000,null); INSERT INTO #CONTROL_TAB VALUES ('CONTINUOUS_COL',3,60000,null); DROP TABLE #PRE_CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #PRE_CONTROL_TAB ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #PRE_CONTROL_TAB VALUES ('START_COLUMN',1,null,null); INSERT INTO #PRE_CONTROL_TAB VALUES ('END_COLUMN',3,null,null); INSERT INTO #PRE_CONTROL_TAB VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO #PRE_CONTROL_TAB VALUES ('ID_COLUMN',0,null,null); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT, "CLASS" VARCHAR(50)); DROP TABLE MODEL_TAB; CREATE COLUMN TABLE MODEL_TAB ("Model" CLOB); CALL palDT1(TESTDT_TAB, "#CONTROL_TAB", MODEL_TAB) with overview; CALL palDT2(PREDICTIVE_TAB, "#PRE_CONTROL_TAB", MODEL_TAB, RESULTS_TAB) with overview; SELECT * FROM RESULTS_TAB; ;#EXPECTED OUTPUT:
SAP AG 2011
20
SAP AG 2011
21
KNN
The k-nearest neighbor algorithm (KNN) is a method for classifying objects based on closest training examples in the feature space. KNN is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor. The neighbors are taken from a set of objects for which the correct classification is known. The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. Usually Euclidean distance is used as the distance metric. For more information, refer to http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm. The PAL implementation of KNN supports multi-thread and different voting type.
Prerequisites
The first column of training data and input data should be ID column. The second column of training data should be class type. The class type column is of integer type. Other data columns are of integer or double type. Input data does not contain null value.
Interface (knn)
Function: pal::knn This is a classification function using the KNN algorithm.
pal::knn(Table<...> value, Table<...> classvalue, Table<...> args, Table<...> result)
Input Table
Table Training Data Column 1st column 2nd column Other columns Class Data 1st column Other columns Column Data Type Integer Integer Integer or double Integer Integer or double Description ID Class type Attribute data ID Attribute data Constraint
Parameter Table
Name K_NEAREST_NEIGHBOURS Data Type Integer Description Number of nearest neighbors (k)
SAP AG 2011
22
Name ATTRIBUTE_NUM VOTING_TYPE
Data Type Integer Integer
Description Number of attributes Voting type: 0 = majority voting 1 = distance-weighted voting
THREAD_NUMBER
Integer
Number of threads
Output Table
Table Result Column 1st column 2nd column Column Data Type Integer Integer or double Description ID class type Constraint
Example
;# knn.sql ALTER SESSION SET CURRENT_SCHEMA = "DM_PAL"; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"TYPE" INT,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE CLASSDATA_T; CREATE TYPE CLASSDATA_T AS TABLE( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Type" INT); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER, "doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP PROCEDURE palKNN;
SAP AG 2011
23
CREATE PROCEDURE palKNN( IN data DATA_T, IN classdata CLASSDATA_T, IN control CONTROL_T, OUT results RESULT_T ) LANGUAGE LLANG AS BEGIN export Void main(Table<Int32 "ID",Int32 "TYPE",Double "X1", Double "X2"> "data" dataTab, Table<Int32 "ID",Double "X1", Double "X2"> "classdata" classdataTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab, Table<Int32 "ID",Int32 "Type"> "results" & resultsTab) { pal::knn(dataTab, classdataTab, argsTab, resultsTab); } END; DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ( "ID" INT,"TYPE" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO DATA_TAB VALUES (0,2,1,1); INSERT INTO DATA_TAB VALUES (1,3,10,10); INSERT INTO DATA_TAB VALUES (2,3,10,11); INSERT INTO DATA_TAB VALUES (3,3,10,10); INSERT INTO DATA_TAB VALUES (4,1,1000,1000); INSERT INTO DATA_TAB VALUES (5,1,1000,1001); INSERT INTO DATA_TAB VALUES (6,1,1000,999); INSERT INTO DATA_TAB VALUES (7,1,999,999); INSERT INTO DATA_TAB VALUES (8,1,999,1000); INSERT INTO DATA_TAB VALUES (9,1,1000,1000); DROP TABLE CLASSDATA_TAB; CREATE COLUMN TABLE CLASSDATA_TAB ( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO CLASSDATA_TAB VALUES (0,2,1); INSERT INTO CLASSDATA_TAB VALUES (1,9,10); INSERT INTO CLASSDATA_TAB VALUES (2,9,11); INSERT INTO CLASSDATA_TAB VALUES (3,15000,15000); INSERT INTO CLASSDATA_TAB VALUES (4,1000,1000); INSERT INTO CLASSDATA_TAB VALUES (5,500,1001); INSERT INTO CLASSDATA_TAB VALUES (6,500,999); INSERT INTO CLASSDATA_TAB VALUES (7,199,999); TRUNCATE TABLE #CONTROL_TAB; INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES
('K_NEAREST_NEIGHBOURS',3,null,null); ('ATTRIBUTE_NUM',2,null,null); ('VOTING_TYPE',0,null,null); ('THREAD_NUMBER',8,null,null);
DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Type" INT);
SAP AG 2011
24
CALL palKNN(DATA_TAB, CLASSDATA_TAB, "#CONTROL_TAB", RESULTS_TAB) with overview; SELECT * FROM RESULTS_TAB; ;#EXPECTED OUTPUT:
SAP AG 2011
25
Multiple Linear Regression

Linear regression is an approach to model the relationship between a scalar variable y and one or more variables denoted X. In linear regression, data are modeled using linear functions, and unknown model parameters are estimated from the data. Such models are called linear models. For more information, refer to http://en.wikipedia.org/wiki/Linear_regression. In PAL, the implementation of linear regression is to solve the equation: Ax=Y Where A is MxN matrix, x is Nx1 matrix, and Y is Mx1 matrix. Then, x= reverse(A)*Y And it can be transformed into x= reverse(transpose(A)*A) *transpose(A) *Y The implementation also supports calculating the F value and R^2 determining statistical significance.
Prerequisites
No missing or null data in inputs. Data is numeric, not categorical. Given the structure as Y and X1...Xn, there must be more than n+1 records available for analysis.
Interface (linearRegression)
Function: pal::linearRegression This is a multiple linear regression function.
pal::linearRegression( Table<...> data, Table<...> args, Table<...> result, Table<...> fitted, Table<...> significance)
Input Table
Table Data Column 1st column 2nd column Other columns Column Data Type Integer Integer or double Integer or double Description ID Variable y Variable Xn Constraint
Parameter Table
Name VARIABLE_NUM THREAD_NUMBER Data Type Integer Integer Description Number of variable X Number of threads
SAP AG 2011
26
Output Table
Table Result Column 1st column 2nd column Column Data Type Integer Integer or double Description ID Value Ai (A0 is the intercept; A1 is the beta coefficient for X1, A2 is the beta coefficient for X2 etc ) ID Value Yi Name Value (R^2 / F) Constraint
Fitted Data
1st column 2nd column
Integer Integer or double VARCHAR/CHAR Double
Significance
1st column 2nd column
Interface (forecastWithLR)
Function: pal::forecastWithLR This function is used to perform predication with linear regression result.
pal::forecastWithLR( Table<...> predictdata, Table<...> coefficient, Table<...> args, Table<...> result)
Input Table
Table Predictive Data Column 1st column Other columns Coefficient 1st column 2nd column Column Data Type Integer Integer or double Integer Integer or double Description ID Variable Xn ID Value Ai Constraint
Parameter Table
Name VARIABLE_NUM THREAD_NUMBER Data Type Integer Integer Description Number of variable X Number of threads
SAP AG 2011
27
Output Table
Table Fitted Result Column 1st column 2nd column Column Data Type Integer Integer or double Description ID Value Yi Constraint
Example
;# linearRegression ALTER SESSION SET CURRENT_SCHEMA = "DM_PAL"; DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"Y" DOUBLE,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP TYPE SIGNIFICANCE_T; CREATE TYPE SIGNIFICANCE_T AS TABLE("NAME" varchar(50),"VALUE" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER, "doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP PROCEDURE palLR; CREATE PROCEDURE palLR( IN data DATA_T, IN control CONTROL_T, OUT results RESULT_T, OUT fittedValue FITTED_T,OUT significance SIGNIFICANCE_T ) LANGUAGE LLANG AS BEGIN export Void main(Table<Int32 "ID",Double "Y",Double "X1", Double "X2"> "data" dataTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab, Table<Int32 "ID",Double "Ai"> "results" & resultsTab, Table<Int32 "ID",Double "Fitted"> "fittedValue" & fittedTab, Table<String "NAME",Double "VALUE"> "significance" & significanceTab) { pal::linearRegression(dataTab, argsTab, resultsTab, fittedTab,significanceTab); } END;
SAP AG 2011
28
DROP TABLE DATA_TAB; CREATE COLUMN TABLE DATA_TAB ( "ID" INT,"Y" DOUBLE,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO DATA_TAB VALUES (0,0.5,0.13,0.33); INSERT INTO DATA_TAB VALUES (1,0.15,0.14,0.34); INSERT INTO DATA_TAB VALUES (2,0.25,0.15,0.36); INSERT INTO DATA_TAB VALUES (3,0.35,0.16,0.35); INSERT INTO DATA_TAB VALUES (4,0.45,0.17,0.37); INSERT INTO DATA_TAB VALUES (5,0.55,0.18,0.38); INSERT INTO DATA_TAB VALUES (6,0.65,0.19,0.39); INSERT INTO DATA_TAB VALUES (7,0.75,0.19,0.31); INSERT INTO DATA_TAB VALUES (8,0.85,0.11,0.32); INSERT INTO DATA_TAB VALUES (9,0.95,0.12,0.33); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE RESULTS_TAB; CREATE COLUMN TABLE RESULTS_TAB ("ID" INT,"Ai" DOUBLE); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); DROP TABLE SIGNIFICANCE_TAB; CREATE COLUMN TABLE SIGNIFICANCE_TAB ("NAME" varchar(50),"VALUE" DOUBLE); CALL palLR(DATA_TAB, "#CONTROL_TAB", RESULTS_TAB, FITTED_TAB, SIGNIFICANCE_TAB) with overview; SELECT * FROM RESULTS_TAB; SELECT * FROM FITTED_TAB; SELECT * FROM SIGNIFICANCE_TAB; ;#EXPECTED OUTPUT: RESULTS_TAB:
FITTED_TAB:
SAP AG 2011
29
SIGNIFICANCE_TAB:
;# forecastWithLR ALTER SESSION SET CURRENT_SCHEMA = "DM_PAL"; DROP TYPE PREDICT_T; CREATE TYPE PREDICT_T AS TABLE( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); DROP TYPE COEFFICIENT_T; CREATE TYPE COEFFICIENT_T AS TABLE("ID" INT,"Ai" DOUBLE); DROP TYPE FITTED_T; CREATE TYPE FITTED_T AS TABLE("ID" INT,"Fitted" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER, "doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP PROCEDURE palForecastWithLR; CREATE PROCEDURE palForecastWithLR( IN predictData PREDICT_T, IN coefficient COEFFICIENT_T, IN control CONTROL_T, OUT fittedValue FITTED_T ) LANGUAGE LLANG AS BEGIN export Void main( Table<Int32 "ID",Double "X1", Double "X2"> "predictData" predictDataTab, Table<Int32 "ID",Double "Ai"> "coefficient" coefficientTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab,
SAP AG 2011
30
Table<Int32 "ID",Double "Fitted"> "fittedValue" & fittedTab) { pal::forecastWithLR(predictDataTab, coefficientTab, argsTab, fittedTab); } END; DROP TABLE PREDICTDATA_TAB; CREATE COLUMN TABLE PREDICTDATA_TAB ( "ID" INT,"X1" DOUBLE, "X2" DOUBLE); INSERT INTO PREDICTDATA_TAB VALUES (0,0.5,0.3); INSERT INTO PREDICTDATA_TAB VALUES (1,4,0.4); INSERT INTO PREDICTDATA_TAB VALUES (2,0,1.6); INSERT INTO PREDICTDATA_TAB VALUES (3,0.3,0.45); INSERT INTO PREDICTDATA_TAB VALUES (4,0.4,1.7); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB ( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); INSERT INTO #CONTROL_TAB VALUES ('VARIABLE_NUM',2,null,null); INSERT INTO #CONTROL_TAB VALUES ('THREAD_NUMBER',8,null,null); DROP TABLE COEEFICIENT_TAB; CREATE COLUMN TABLE COEEFICIENT_TAB ("ID" INT,"Ai" DOUBLE); INSERT INTO COEEFICIENT_TAB VALUES (0,1.7120914258645001); INSERT INTO COEEFICIENT_TAB VALUES (1,0.2652771198483208); INSERT INTO COEEFICIENT_TAB VALUES (2,-3.471103742302148); DROP TABLE FITTED_TAB; CREATE COLUMN TABLE FITTED_TAB ("ID" INT,"Fitted" DOUBLE); CALL palForecastWithLR(PREDICTDATA_TAB, COEEFICIENT_TAB, "#CONTROL_TAB", FITTED_TAB) with overview; SELECT * FROM FITTED_TAB; ;#EXPECTED OUTPUT:
SAP AG 2011
31
Apriori
Apriori is a classic predictive analysis algorithm for finding association rules in association analysis. Association analysis uncovers the hidden patterns, correlations or casual structures among a set of items or objects. For example, association analysis is used to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of customers with association analysis, then prediction of their future behavior may be made. Apriori is designed to operate on databases containing transactions. As is common in association rule mining, given a set of items, the algorithm attempts to find subsets which are common to at least a minimum number of the item sets. Apriori uses a bottom up approach, where frequent subsets are extended one item at a time, a step known as candidate generation, and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses breadth-first search and a tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1. Then it prunes the candidates which have an infrequent sub pattern. The candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates. For more information, refer to http://en.wikipedia.org/wiki/Apriori_algorithm. The Apriori function in PAL uses vertical data format to store the transaction data in memory. The function can take string or integer transaction ID and item ID as input. It supports the output of confidence, support, and lift value, but does not limit the number of output rules. However, you can use SQL Script to select the number of output rules, for example: SELECT TOP 2000 FROM RULE_RESULTS where lift >0.5
Prerequisite
Input data does not contain null value.
Interface (aprioriRule)
Function: pal::aprioriRule This function reads input transactions data and generates association rules by the Apriori algorithm.
pal::aprioriRule( Table<...> dataset, Table<...> args, Table<...> result)
Input Table
Table Dataset/Historical Data Column Transaction ID column Item column Column Data Type Integer or VARCHAR/CHAR Integer or VARCHAR/CHAR Description Transaction ID Item ID Constraint
SAP AG 2011
32
Parameter Table
Name MIN_SUPPORT MIN_CONFIDENCE TID_COLUMN Data Type Double Double Integer Description User-specified minimum support (actual value). User-specified minimum confidence (actual value). Indicates which column stores the transaction ID (column index starts from zero). Indicates which column stores the items ID (column index starts from zero). Number of threads.
ITEM_COLUMN THREAD_NUMBER
Integer Integer
Output Table
Table Result Column 1st column 2nd column 3rd column 4th column 5th column Column Data Type VARCHAR/CHAR VARCHAR/CHAR Double Double Double Description Leading items Dependent items Support value Confidence value Lift value Constraint
Example
;#aprioriRule DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"CUSTOMER" INT,"ITEM" VARCHAR(20)); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("PRERULE" VARCHAR(500),"POSTRULE" VARCHAR(500),"SUPPORT" DOUBLE, "CONFIDENCE" DOUBLE,"LIFT" DOUBLE); DROP TABLE #CONTROL_TAB; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB( "Name" VARCHAR (50), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR (100)); DROP TYPE CONTROL_T;
SAP AG 2011
33
CREATE TYPE CONTROL_T AS TABLE( "Name" VARCHAR (50),"intArgs" INTEGER, "doubleArgs" DOUBLE,"stringArgs" VARCHAR (100)); DROP PROCEDURE palapprioriRule; CREATE PROCEDURE palapprioriRule( IN data DATA_T, IN control CONTROL_T, OUT results RESULT_T ) LANGUAGE LLANG AS BEGIN export Void main(Table<Int32 "ID",Int32 "CUSTOMER",String "ITEM"> "data" dataTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs", String "stringArgs"> "control" argsTab, Table<String "PRERULE",String "POSTRULE",Double "SUPPORT",Double "CONFIDENCE",Double "LIFT"> "results" & resultsTab) { pal::aprioriRule(dataTab, argsTab, resultsTab); } END; DROP TABLE TESTASSOCIATION_TAB; CREATE COLUMN TABLE TESTASSOCIATION_TAB("ID" INT,"CUSTOMER" INT,"ITEM" VARCHAR(20)); INSERT INTO TESTASSOCIATION_TAB VALUES (0, 0, 'item1'); INSERT INTO TESTASSOCIATION_TAB VALUES (1, 0, 'item2'); INSERT INTO TESTASSOCIATION_TAB VALUES (2, 0, 'item5'); #transacion T0: I1,I2,I5 INSERT INTO TESTASSOCIATION_TAB VALUES (3, 1, 'item2'); INSERT INTO TESTASSOCIATION_TAB VALUES (4, 1, 'item4'); #transacion T1: I2,I4 INSERT INTO TESTASSOCIATION_TAB VALUES (5, 2, 'item2'); INSERT INTO TESTASSOCIATION_TAB VALUES (6, 2, 'item3'); INSERT INTO TESTASSOCIATION_TAB VALUES (7, 3, 'item1'); INSERT INTO TESTASSOCIATION_TAB VALUES (8, 3, 'item2'); INSERT INTO TESTASSOCIATION_TAB VALUES (9, 3, 'item4'); INSERT INTO TESTASSOCIATION_TAB VALUES (10, 4,'item1'); INSERT INTO TESTASSOCIATION_TAB VALUES (11, 4,'item3'); INSERT INTO TESTASSOCIATION_TAB VALUES (12, 5, 'item2'); INSERT INTO TESTASSOCIATION_TAB VALUES (13, 5, 'item3'); INSERT INTO TESTASSOCIATION_TAB VALUES (14, 6, 'item1'); INSERT INTO TESTASSOCIATION_TAB VALUES (15, 6, 'item3'); INSERT INTO TESTASSOCIATION_TAB VALUES (16, 7, 'item1'); INSERT INTO TESTASSOCIATION_TAB VALUES (18, 7, 'item2'); INSERT INTO TESTASSOCIATION_TAB VALUES (19, 7, 'item3'); INSERT INTO TESTASSOCIATION_TAB VALUES (20, 7, 'item5'); INSERT INTO TESTASSOCIATION_TAB VALUES (21, 8, 'item1'); INSERT INTO TESTASSOCIATION_TAB VALUES (22, 8, 'item2'); INSERT INTO TESTASSOCIATION_TAB VALUES (23, 8, 'item3'); DROP TABLE RESULTS_TAB;
SAP AG 2011
34
CREATE COLUMN TABLE RESULTS_TAB ("PRERULE" VARCHAR(500),"POSTRULE" VARCHAR(500), "SUPPORT" Double, "CONFIDENCE" Double,"LIFT" DOUBLE); TRUNCATE TABLE #CONTROL_TAB; INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES INSERT INTO #CONTROL_TAB VALUES
('TID_COLUMN',1,null,null); ('ITEM_COLUMN',2,null,null); ('THREAD_NUMBER',4,null,null); ('MIN_SUPPORT',null,0.2,null); ('MIN_CONFIDENCE',null,0.2,null);
TRUNCATE TABLE RESULTS_TAB; CALL palapprioriRule(TESTASSOCIATION_TAB, "#CONTROL_TAB", RESULTS_TAB) with overview; SELECT * FROM RESULTS_TAB; ;#EXPECTED OUTPUT:
SAP AG 2011
35
ABC Classification
ABC Classification is used to classify objects, such as customers, employees, or products, based on a particular measure, such as revenue or profit. ABC analysis suggests that inventories of an organization are not of equal value. Thus, the inventories are grouped into three categories (A, B, and C) by their estimated importance. A items are very important for an organization. B items are important, but less important than A items and more important than C items. Therefore, B items are of medium importance, and C items are marginally important. An example of ABC classification is as follows: A items 20% of the items account for 70% of the annual consumption value of all items. B items 30% of the items account for 25% of the annual consumption value of all items. C items 50% of the items account for 5% of the annual consumption value of all items.
For more information, refer to http://en.wikipedia.org/wiki/ABC_analysis.
Prerequisite
Input data cannot contain null value. The item names in the Input table must be of string data type and be unique.
Interface (abcAnalysis)
Function: pal::abcAnalysis This function performs the ABC analysis algorithm.
pal::abcAnalysis ( Table<...> target, Table<...> args, Table<...> result)
Input Table
Table Target Data Column 1st column 2nd column Column Data Type VARCHAR/CHAR Double Description Item name Value Constraint
Parameter Table
Name START_COLUMN Data Type Integer Description The first column used to do the classification (column index starts from zero). The last column used to do the classification (column index starts from zero). Number of threads Interval for A class
END_COLUMN THREAD_NUMBER PERCENT_A
Integer Integer Double
SAP AG 2011
36
Name PERCENT_B PERCENT_C
Data Type Double Double
Description Interval for B class Interval for C class
Output Table
Table Result Column 1st column 2nd column Column Data Type VARCHAR/CHAR VARCHAR/CHAR Description ABC class Items Constraint
Example
;# ABC DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE("ITEM" VARCHAR(100),"VALUE" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ABC" VARCHAR(10),"ITEM" VARCHAR(100)); DROP PROCEDURE palAbcAnalysis; CREATE PROCEDURE palAbcAnalysis( IN target DATA_T, IN control CONTROL_T, OUT results RESULT_T ) LANGUAGE LLANG AS BEGIN export Void main(Table<String "ITEM", Double "VALUE"> "target" targetTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs",String "strArgs"> "control" controlTab, Table<String "ABC", String "ITEM"> "results" & resultsTab) { pal::abcAnalysis(targetTab, controlTab, resultsTab); } END; DROP TABLE #CONTROL_TBL; CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TBL ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE,"strArgs" VARCHAR(100)); INSERT INTO #CONTROL_TBL VALUES ('START_COLUMN',0,null,NULL); INSERT INTO #CONTROL_TBL VALUES ('END_COLUMN',1,null,null); INSERT INTO #CONTROL_TBL VALUES ('THREAD_NUMBER',2,null,null); INSERT INTO #CONTROL_TBL VALUES ('PERCENT_A',null,0.7,null);
SAP AG 2011
37
INSERT INTO #CONTROL_TBL VALUES ('PERCENT_B',null,0.2,null); INSERT INTO #CONTROL_TBL VALUES ('PERCENT_C',null,0.1,null); DROP TABLE TESTABCTAB; CREATE COLUMN TABLE TESTABCTAB("ITEM" VARCHAR(100),"VALUE" DOUBLE); INSERT INTO TESTABCTAB VALUES ('item1', 15.4); INSERT INTO TESTABCTAB VALUES ('item2', 200.4); INSERT INTO TESTABCTAB VALUES ('item3', 280.4); INSERT INTO TESTABCTAB VALUES ('item4', 100.9);#100.9 INSERT INTO TESTABCTAB VALUES ('item5', 40.4); INSERT INTO TESTABCTAB VALUES ('item6', 25.6); INSERT INTO TESTABCTAB VALUES ('item7', 18.4); INSERT INTO TESTABCTAB VALUES ('item8', 10.5); INSERT INTO TESTABCTAB VALUES ('item9', 96.15); INSERT INTO TESTABCTAB VALUES ('item10', 9.4); DROP TABLE RESULT_TBL; CREATE COLUMN TABLE RESULT_TBL("ABC" VARCHAR(10),"ITEM" VARCHAR(100)); CALL palAbcAnalysis(TESTABCTAB, "#CONTROL_TBL", RESULT_TBL) with overview; SELECT * FROM RESULT_TBL; ;#EXPECTED OUTPUT:
SAP AG 2011
38
Weighted Score Table

A weighted score table is a method of evaluating alternatives when the importance of each criterion differs. In a weighted score table, each alternative is given a score for each criterion. These scores are then weighted by the importance of each criterion. All of an alternative's weighted scores are then added together to calculate that alternative's total weighted score. The alternative with the highest total score should be the best alternative. Weighted score tables can be used to make predictions about future customer behavior. A model based on historical data in a data mining application may be applied to new data to make prediction. The prediction, that is, the output of the model, is also called a Score. A single score for customers can be calculated by taking into account different dimensions. A function defined by weighted score tables is a linear combination of functions of a variable. f(x1,,xn) = w1 f1(x1) + + wn fn(xn)
Prerequisites
Input data cannot contain null value. The column of the Map Function table should be sorted by the attribute order of the Input Data table.
Interface (weightedTable)
Function: pal::weightedTable This function performs weighted table calculation. It is similar to the Volume Driver function in the Business Function Library (BFL). Volume Driver calculates only one column, but weightedTable calculates multiple columns at the same time.
pal::weightedTable ( Table<...> target, Table<...> mapfun, Table<...> control, Table<...> args, Table<...> result)
Input Table
Table Target/Input Data Column Columns Column Data Type VARCHAR/CHAR, integer, or double Description Specifies which will be used to calculate the scores Constraint Discrete value: integer, string, double; Continuous value: integer, double; An ID column is mandatory. Its data type should be integer. Map Function Columns VARCHAR/CHAR, integer, or double Creates the map function Every attribute (except ID) in the Input Data table maps to two columns in the Map Function table: Key column and Value column. The Value column must be double type.
SAP AG 2011
39
Table Control
Column Columns
Column Data Type Integer or double
Description
Constraint This table has three columns. When the Input Data table has n attributes (except ID), the Weight Table will have n rows.
Output Table
Table Result Column 1st column 2nd column Column Data Type Integer Double Description ID Result value Constraint
Parameter Table
Name START_COLUMN END_COLUMN THREAD_NUMBER ID_COLUMN Data Type Integer Integer Integer Integer Description The first column used to do the calculation (column index starts from zero). The last column used to do the calculation (column index starts from zero). Number of threads Specifies the ID column (column index starts from zero).
Example
;#weightedTable DROP TYPE DATA_T; CREATE TYPE DATA_T AS TABLE( "ID" INT,"GENDER" VARCHAR(10),"INCOME" INT,"HEIGHT" DOUBLE); DROP TYPE MAP_FUN_T; CREATE TYPE MAP_FUN_T AS TABLE("GENDER" VARCHAR(10), "VAL1" DOUBLE, "INCOME" INT, "VAL2" DOUBLE, "HEIGHT" DOUBLE, "VAL3" DOUBLE); DROP TYPE CONTROL_T; CREATE TYPE CONTROL_T AS TABLE( "WEIGHT" DOUBLE, "ISDIS" INT, "ROWNUM" INT); DROP TYPE PARAMETERS_T; CREATE TYPE PARAMETERS_T AS TABLE("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE);
SAP AG 2011
40
DROP TYPE RESULT_T; CREATE TYPE RESULT_T AS TABLE("ID" INT,"result" DOUBLE); DROP PROCEDURE palWeightTable; CREATE PROCEDURE palWeightTable( IN target DATA_T, IN mapfun MAP_FUN_T, IN control CONTROL_T, IN parameters PARAMETERS_T,OUT results RESULT_T ) LANGUAGE LLANG AS BEGIN export Void main(Table<Int32 "ID",String "GENDER",Int32 "INCOME",Double "HEIGHT"> "target" targetTab, Table<String "GENDER",Double "VAL1",Int32 "INCOME",Double "VAL2",Double "HEIGHT",Double "VAL3"> "mapfun" mapfunTab, Table<Double "WEIGHT",Int32 "ISDIS",Int32 "ROWNUM"> "control" controlTab, Table<String "Name", Int32 "intArgs", Double "doubleArgs"> "parameters" parametersTab, Table<Int32 "ID",Double "result"> "results" & resultsTab) { pal::weightedTable(targetTab, mapfunTab, controlTab, parametersTab, resultsTab); } END; DROP TABLE TESTTARGET_TBL; CREATE COLUMN TABLE TESTTARGET_TBL ("ID" INT,"GENDER" VARCHAR(10),"INCOME" INT,"HEIGHT" DOUBLE); INSERT INTO TESTTARGET_TBL VALUES (0,'male',5000,1.73); INSERT INTO TESTTARGET_TBL VALUES (1,'male',9000,1.80); INSERT INTO TESTTARGET_TBL VALUES (2,'female',6000,1.55); INSERT INTO TESTTARGET_TBL VALUES (3,'male',15000,1.65); INSERT INTO TESTTARGET_TBL VALUES (4,'female',2000,1.70); INSERT INTO TESTTARGET_TBL VALUES (5,'female',12000,1.65); INSERT INTO TESTTARGET_TBL VALUES (6,'male',1000,1.65); INSERT INTO TESTTARGET_TBL VALUES (7,'male',8000,1.60); INSERT INTO TESTTARGET_TBL VALUES (8,'female',5500,1.85);#5500 INSERT INTO TESTTARGET_TBL VALUES (9,'female',9500,1.85); DROP TABLE MAP_FUN_TBL; CREATE COLUMN TABLE MAP_FUN_TBL ( "GENDER" VARCHAR(10), "VAL1" DOUBLE, "INCOME" INT, "VAL2" DOUBLE, "HEIGHT" DOUBLE, "VAL3" DOUBLE); INSERT INTO MAP_FUN_TBL VALUES ('male',2.0, 0,0.0, 1.5,0.0); INSERT INTO MAP_FUN_TBL VALUES ('female',1.5, 5500,1.0, 1.6,1.0); INSERT INTO MAP_FUN_TBL VALUES ('null',0.0, 9000,2.0, 1.71,2.0); INSERT INTO MAP_FUN_TBL VALUES ('null',0.0, 12000,3.0, 1.80,3.0); DROP TABLE CONTROL_TBL;
SAP AG 2011
41
CREATE INT); INSERT INSERT INSERT
COLUMN TABLE CONTROL_TBL ("WEIGHT" DOUBLE, "ISDIS" INT, "ROWNUM" INTO CONTROL_TBL VALUES (0.5,1,2); INTO CONTROL_TBL VALUES (2.0,-1,4); INTO CONTROL_TBL VALUES (1.0,-1,4);
DROP TABLE #PARAMETERS_TBL; CREATE LOCAL TEMPORARY COLUMN TABLE #PARAMETERS_TBL ("Name" VARCHAR(100), "intArgs" INT, "doubleArgs" DOUBLE); INSERT INTO #PARAMETERS_TBL VALUES ('ID_COLUMN',0,null); INSERT INTO #PARAMETERS_TBL VALUES ('START_COLUMN',1,null); INSERT INTO #PARAMETERS_TBL VALUES ('END_COLUMN',3,null); INSERT INTO #PARAMETERS_TBL VALUES ('THREAD_NUMBER',2,null); DROP TABLE RESULT_TBL; CREATE COLUMN TABLE RESULT_TBL("ID" INT,"result" DOUBLE); CALL palWeightTable(TESTTARGET_TBL, MAP_FUN_TBL, CONTROL_TBL, "#PARAMETERS_TBL", RESULT_TBL) with overview; SELECT * FROM RESULT_TBL; ;#EXPECTED OUTPUT:
SAP AG 2011
42
Log and Trace

To learn about the details of PAL function implementation or trace a problem occurred during function runtime, you can check the log information in SAP HANA Database Administration.
Procedure
To open a log file: 1. In SAP HANA Database Administration, choose Trace Levels. Then click Add Component in the Trace Levels dialog box, enter the PAL function name, and click OK. The name should be entered as PAL_<FunctionName>, for example, PAL_APRIORIRULE.
SAP AG 2011
43
2. Select a log type for the newly added PAL component and click OK, as shown below.
3. Right-click on indexserver_<host_name> and select Show File.
SAP AG 2011
44
Result
The log file for the specified PAL function is displayed. You can use the Find button to search for the log information that you need.
SAP AG 2011
45
Copyrights
Copyright 2011 SAP AG. All rights reserved.
SAP AG 2011
46

Predictive Analysis Library Manual

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Predictive Analysis Library Manual

Hochgeladen von

Copyright:

Verfügbare Formate

Table of Contents

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

PAL Common Interface

Input Data Table

Predictive Analysis Library Reference Manual

Specifying the ID Column

Predictive Analysis Library Reference Manual

Output Data Table

Predictive Analysis Library Reference Manual

List of PAL Algorithms

Category Cluster Analysis

Function Name kmeans validateKmeans

C4.5 Decision Tree

Weighted Score Table

Predictive Analysis Library Reference Manual

PAL Algorithm Descriptions

Predictive Analysis Library Reference Manual

Number of threads. Threshold (actual value) for exiting the iterations.

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

C4.5 Decision Tree

Predictive Analysis Library Reference Manual

PERCENTAGE THREAD_NUMBER CONTINUOUS_COL

Double Integer (Integer, Double) (optional)

Predictive Analysis Library Reference Manual

Serialized tree model

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

INSERT INSERT INSERT INSERT INSERT

INTO INTO INTO INTO INTO

TESTDT_TAB TESTDT_TAB TESTDT_TAB TESTDT_TAB TESTDT_TAB

VALUES VALUES VALUES VALUES VALUES

(4, (5, (6, (7, (8,

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Name ATTRIBUTE_NUM VOTING_TYPE

Data Type Integer Integer

Description Number of attributes Voting type: 0 = majority voting 1 = distance-weighted voting

Predictive Analysis Library Reference Manual

('K_NEAREST_NEIGHBOURS',3,null,null); ('ATTRIBUTE_NUM',2,null,null); ('VOTING_TYPE',0,null,null); ('THREAD_NUMBER',8,null,null);

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Multiple Linear Regression

Predictive Analysis Library Reference Manual

1st column 2nd column

Integer Integer or double VARCHAR/CHAR Double

1st column 2nd column

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

Predictive Analysis Library Reference Manual

('TID_COLUMN',1,null,null); ('ITEM_COLUMN',2,null,null); ('THREAD_NUMBER',4,null,null); ('MIN_SUPPORT',null,0.2,null); ('MIN_CONFIDENCE',null,0.2,null);

Predictive Analysis Library Reference Manual

For more information, refer to http://en.wikipedia.org/wiki/ABC_analysis.