Sie sind auf Seite 1von 218

PUBLIC

SAP Predictive Analytics


2018-10-08

Predictive Analytics Data Access Guide


© 2018 SAP SE or an SAP affiliate company. All rights reserved.

THE BEST RUN


Content

1 About This Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 What's New in Predictive Analytics Data Access Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Connecting to your Database Management System on Windows. . . . . . . . . . . . . . . . . . . . . . . . 9


3.1 About Connecting to Your Database Management System on Windows. . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Setting Up the ODBC Driver and the ODBC Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 SAP HANA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Installing SAP HANA ODBC Driver on Windows 64 Bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Configuring SAP HANA ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
SAP HANA as a Data Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Using SAP HANA Service as a Data Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 SAP Vora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
About SAP Vora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Installing the Simba Spark SQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Configuring SAP Vora DSN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Setting the ODBC Behaviour Flag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
About Oracle Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Setting Up Data Access for Oracle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Installing Oracle ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Setting Up Oracle ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
About Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Setting Up Data Access for Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Creating Data Manipulations using Date Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Spark SQL Restrictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
About Hive Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Setting Up Data Access for Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Creating Data Manipulations using Date Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Hive Restrictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Teradata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
About Teradata Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Installing the Prerequisite Teradata Client Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Setting Up Data Access for Teradata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Installing Teradata ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Predictive Analytics Data Access Guide


2 PUBLIC Content
Setting Up Teradata ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9 Microsoft SQL Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Setting Up Microsoft ODBC Driver for SQL Server 2005. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
About Microsoft ODBC Driver for SQL Server 2008. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44
Installing Microsoft ODBC Driver for SQL Server 2008. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
Setting Up Microsoft ODBC Driver for SQL Server 2008. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
About Microsoft ODBC Driver for SQL Server 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Installing Microsoft ODBC Driver for SQL Server 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Setting Up Microsoft ODBC Driver for SQL Server 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
About Microsoft ODBC Driver for SQL Server 2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.10 Netezza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Configuring Netezza ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.11 IBM DB2 V9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
Setting Up DB2 ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.12 Sybase IQ 15.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Installing Sybase IQ ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Setting Up Sybase IQ ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 PostgreSQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Downloading PostgreSQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Installing PostgreSQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Setting Up PostgreSQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.14 Vertica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
About Vertica ODBC Driver Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Setting Up Vertica ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
3.15 In-database Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Native Spark Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Modeling in SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4 Connecting to your Database Management System on Linux. . . . . . . . . . . . . . . . . . . . . . . . . . 81


4.1 About Connecting to Your Database Management System on Linux. . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 ODBC Driver Manager Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
About ODBC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Default Setup for Oracle and Teradata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Setup the unixODBC for Other DBMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Set Up the ODBC Driver and the ODBC Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84
4.3 SAP HANA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Installing Prerequisite Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Setting Up ODBC Connectivity with Automated Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Troubleshooting SAP HANA ODBC Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
SAP HANA as a Data Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 SAP Vora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
About SAP Vora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Predictive Analytics Data Access Guide


Content PUBLIC 3
Installing the Simba Spark SQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Configuring SAP Vora DSN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Setting the ODBC Behaviour Flag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Oracle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
About Oracle Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Setting Up Data Access for Oracle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6 Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
About Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Setting Up Data Access for Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Creating Data Manipulations using Date Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Spark SQL Restrictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
About Hive Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Setting Up Data Access for Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Creating Data Manipulations using Date Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101
Hive Restrictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.8 Teradata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
About Teradata Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Using Data Access for Teradata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Using Teradata ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.9 IBM DB2 V9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Installing DB2 Client. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Setting Up the DB2 Client for Connection to the Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Setting Up DB2 ODBC Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113
4.10 Sybase IQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Installing Sybase IQ ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Setting Up Sybase IQ ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.11 Netezza. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Installing and Setting Up the Netezza ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115
4.12 PostgreSQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
About PostgreSQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Installing the PostgreSQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Setting Up PostgreSQL ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117
4.13 Vertica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Installing and Setting Up Vertica ODBC Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.14 SQL Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
About SQLServer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Installing and Setting Up Microsoft SQLServer Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.15 In-database Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Native Spark Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Modeling in SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Predictive Analytics Data Access Guide


4 PUBLIC Content
4.16 Special Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Installing and Setting Up unixODBC on Suse 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5 Importing Flat Data Files Into Your DBMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143


5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2 Importing Flat Files into an Oracle Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143
Creating the Target Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143
Creating the Control File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Importing the Flat File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3 Importing a Flat File into IBM DB2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Writing an SQL Query for IBM DB2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Connecting to IBM DB2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149
Creating Target Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Defining Import Specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Sample SQL Queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Importing the Flat File into IBM DB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 Importing Flat Files into Microsoft Access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Importing a Flat File into a Microsoft Access Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6 ODBC Fine Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156


6.1 About ODBC Fine Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2 Issues with ODBC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 How Automated Analytics Manages ODBC Specificities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.4 How to Override Automated Analytics Automatic Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157
Overriding the Full Behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
Other Options to Override. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Schema Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Data Manipulation SQL Generation Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7 Bulk Load With Oracle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194


7.1 About Bulk Load for Oracle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .194
7.2 Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.3 Performances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.4 Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.5 Perimeter of Usage and Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195
7.6 Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

8 Fast Write for Teradata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197


8.1 About this document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2 Purpose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.3 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Main components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Predictive Analytics Data Access Guide


Content PUBLIC 5
8.4 Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.5 Perimeter of Usage and Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .201
8.6 Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Installing Fastload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Checking the Installation of Fastload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Teradata Patches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Automated Analytics Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.7 Fast Write Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206
Logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Tips and Tricks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207
8.8 Advanced Setup and Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209
8.9 Main Options for Integration Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Log and Temporary Objects Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Logon Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Changing Paths & Protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.10 Main Options for Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.11 Main options for Debug and Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.12 Annex A - Default fastload Script. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212
Built-in Template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
8.13 Annex B - List of Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

Predictive Analytics Data Access Guide


6 PUBLIC Content
1 About This Guide

This guide provides you with the information and procedures to use database with SAP Predictive Analytics. It
is a collection of previously independent guides that have been grouped into a single guide.

Predictive Analytics Data Access Guide


About This Guide PUBLIC 7
2 What's New in Predictive Analytics Data
Access Guide

Links to information about the new features and documentation changes in the data access guide for SAP
Predictive Analytics 3.3.

Section What's New Link to More Information

Using SAP HANA Service as a Data You can define a Windows ODBC Data Using SAP HANA Service as a Data
Source Source for the SAP HANA Service using Source [page 13]
the tenant database address.

ODBC Fine Tuning The option DropEmptySpace allows DropEmptySpace [page 164]
you to delete remaining empty tables
after model deletion.

Connecting to Your Database Manage­ Update on Teradata database. ● Installing Prerequisite Software
ment System on Linux [page 107]
● Installing these Packages (if
Needed) [page 107]
● Installing Teradata ODBC Driver
[page 109]
● Setting Up Teradata ODBC Driver
[page 110]

Predictive Analytics Data Access Guide


8 PUBLIC What's New in Predictive Analytics Data Access Guide
3 Connecting to your Database
Management System on Windows

3.1 About Connecting to Your Database Management


System on Windows

This section shows how to connect Automated Analytics to a database management system (DBMS) by
creating an ODBC connection.

Configuring steps are presented for the following databases:

● SAP HANA
● SAP HANA Vora
● Spark
● Hive
● Teradata
● Microsoft SQL Server
● Netezza
● IBM DB2
● Sybase IQ
● PostgreSQL
● Vertica

All the ODBC drivers used in the examples can be purchased via their respective editors.

This document explains how to install the drivers (when not delivered with the OS) used to create the ODBC
connections. It also shows how to configure these drivers so that they suit Automated Analytics requirements.

 Caution

Automated Analytics does not use quotes around table and column names by default. So if they contain
mixed cases or special characters including spaces, they can be misinterpreted by the DBMS. To change
this behavior, you need to set the CaseSensitive option to "True" as explained in the ODBC Fine Tuning
guide.

3.2 Setting Up the ODBC Driver and the ODBC Connection

The ODBC Driver Manager is installed by default on Windows OS, but it is necessary to install the
corresponding ODBC driver for each ODBC connection to a DBMS.

After installing the driver, users can declare a new ODBC connection via dialog boxes (all the options specific to
a DBMS are graphically edited and no longer stored in a text configuration file).

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 9
1. Select Start > Settings > Control Panel .

2. Click Administrative Tools > Data Sources (ODBC) .


3. Click Add to create a new data source.
4. Select the driver.
5. Click Finish.

 Note

Some dialogs specific to the chosen DBMS are displayed. The only point all the DBMS have in common
is that these dialogs always include a field allowing you to type the name of the ODBC connection.

3.3 SAP HANA

3.3.1 Installing SAP HANA ODBC Driver on Windows 64 Bits

1. Launch hdbsetup.exe in HDB_CLIENT_WINDOWS_X86_64. The Install new SAP HANA Database Client
option is automatically selected.

2. Click Install and wait for the installation to finish.

Predictive Analytics Data Access Guide


10 PUBLIC Connecting to your Database Management System on Windows
3.3.2 Configuring SAP HANA ODBC Driver

1. Click Start > Control Panel > Administrative Tools > Data Sources (ODBC) to open the ODBC Data
Source Administrator.
2. Select the SAP HANA ODBC driver (HDODBC or HDBODBC32).

3. Click Finish.
4. Enter a name for the data source.
5. Keep the default options.
6. Fill the Login tab with the information provided by your database administrator.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 11
7. Enter the host name of SAP HANA server and the port to use separated by a colon ( : ). The port to use is
typically 3<Instance number>15 . For example, for a standard instance number 00, port to use is 30015.
8. Test your parameters by clicking Connect and providing proper test credentials.

9. Click OK.
The success panel appears.
10. Click OK to keep all other settings and validate the dialog box.

3.3.3 SAP HANA as a Data Source

You can use SAP HANA databases as data sources in Data Manager and for all types of modeling analyses in
Modeler: Classification/Regression, Clustering, Time Series, Association Rules, Social, and Recommendation.

SAP HANA tables or SQL views found in the Catalog node of the SAP HANA database

All types of SAP HANA views found in the Content node of the SAP HANA database.

An SAP HANA view is a predefined virtual grouping of table


columns that enables data access for a particular business
requirement. Views are specific to the type of tables that are
included, and to the type of calculations that are applied to
columns. For example, an analytic view is built on a fact table
and associated attribute views. A calculation view executes a
function on columns when the view is accessed.

 Restriction
● Analytic and calculation views that use the variable
mapping feature (available starting with SAP HANA
SPS 09) are not supported.
● You cannot edit data in SAP HANA views using
Automated Analytics.

Predictive Analytics Data Access Guide


12 PUBLIC Connecting to your Database Management System on Windows
Smart Data Access virtual tables Thanks to Smart Data Access, you can expose data from re­
mote sources tables as virtual tables and combine them with
HANA regular tables. This allows you to access data sources
that are not natively supported by the application, or to com­
bine data from multiple heterogeneous sources.

 Caution
To use virtual tables as input datasets for training or ap­
plying a model or as output datasets for applying a
model, you need to check that the following conditions
are met:

● The in-database application mode is not used.


● The destination table for storing the predicted val­
ues exists in the remote source before applying the
model.
● The structure of the remote table, that is the col­
umn names and types, must match exactly what is
expected with respect to the generation options; if
this is not the case an error will occur.

 Caution
In Data Manager, use virtual tables with caution as the
generated queries can be complex. Smart Data Access
may not be able to delegate much of the processing to
the underlying source depending on the source capabili­
ties. This can impact performance.

Prerequisites

You must know the ODBC source name and the connection information for your SAP HANA database. For more
information, contact your SAP HANA administrator.

In addition to having the authorizations required for querying the SAP HANA view, you need to be granted the
SELECT privilege on the _SYS_BI schema, which contains metadata on views. Please refer to SAP HANA
guides for detailed information on security aspects.

3.3.4 Using SAP HANA Service as a Data Source

The SAP Cloud Platform SAP HANA Service gives you access to your SAP HANA database, with all of its on-
premise features, but with the advantage of zero administration, as SAP takes care of the infrastructure and
database platform operations.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 13
To use the SAP HANA Service, ensure your configuration meets the following pre-requisites:

● Predictive Analytics Desktop version 3.3 patch 4 or later


● SAP HANA ODBC Driver version 2.2.106 or later
● Access to the SAP Cloud Platform Cockpit
● The SAP HANA Service instance is available

You can connect using the tenant database endpoint address for the server port, or by configuring the
connection by WebSocket using the web server URL as an advanced setting of the server port address. Both
connection methods are described in Related Information.

Related Information

Connecting to the SAP HANA Service Using the Tenant Database Address [page 14]

3.3.4.1 Connecting to the SAP HANA Service Using the


Tenant Database Address

You'll need to install the required SAP Predictive Analytics Desktop and ODBC driver versions, then define the
ODBC connection using the tenant database endpoint address for the server port. The steps you need to follow
are listed here:

Connection step More information

Download and install the required Predictive Ana­ Download and Install the Predictive Analytics Desktop Application [page
lytics version 3.3 Patch 4. 14]

Update your SAP HANA ODCB driver version to Update Your SAP HANA ODBC Driver [page 15]

2.2.106.

Locate and copy the SAP HANA Tenant database Locate and Copy the SAP HANA Tenant Database Address [page 16]
address.

Define the Windows ODBC data source using the Define a Windows ODBC Data Source for the SAP HANA Service [page
SAP HANA Tenant database address. 16]

3.3.4.1.1 Download and Install the Predictive Analytics


Desktop Application

You need a valid S-user identifier on the SAP Support site before you can download the installation file. S-user
information is available here: https://support.sap.com/en/my-support/users.html .

Predictive Analytics Data Access Guide


14 PUBLIC Connecting to your Database Management System on Windows
Download the desktop installation file as follows:

1. Go to the page https://launchpad.support.sap.com/#/softwarecenter .


2. Select the option BY ALPHABETICAL INDEX (A-Z).
3. Select the letter P (as in Predictive).
4. Click the entry SAP PREDICTIVE ANALYTICS.
5. Click SAP PREDICTIVE ANALYTICS 3.
6. Click COMPRISED SOFTWARE COMPONENT VERSIONS.
7. Click PREDICTIVE ANALYTICS DESKTOP 3.
8. Select the Windows entry from the Operating System dropdown list.
9. Select the Patch 4 for SAP Predictive Analytics DESKTOP 3.3 checkbox, and click the name of the .EXE file,
PADESKTOP3003P_4-70001855.EXE, to start the download.

Once you have downloaded the installation file, double click it to install.

Related Information

Update Your SAP HANA ODBC Driver [page 15]

3.3.4.1.2 Update Your SAP HANA ODBC Driver

You download the latest SAP HANA ODBC client from the SAP Development Tools website, and then run the
installation/update wizard.

1. Go to the SAP Development Tools website:https://tools.hana.ondemand.com/.


2. Click the HANA tab.
3. Under SAP HANA Client 2.0, go to the Windows row in Available SAP HANA Client Downloads, and click the
file: hanaclient-<version number>-windows-x64.tar.gz. This is always the latest version available
for developers. The minimum version required is hanaclient-2.3.106-windows-x64.tar.gz.

The hanaclient-<version number>-windows-x64.tar.gz file downloads to your local machine.


4. Uncompress the .tar.gz file and double click hdbsetup.exe to start the SAP Lifecycle Management
wizard.
5. Select the Update SAP HANA Database Client option, click Next, and follow the wizard to complete the
update.

Related Information

Locate and Copy the SAP HANA Tenant Database Address [page 16]

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 15
3.3.4.1.3 Locate and Copy the SAP HANA Tenant Database
Address

You use the SAP HANA Service tenant database address as the port address for the ODBC data source
connection. If your are connecting using WebSocket, then you need to copy the tenant database GUID as well.
This will be used as the WEBSOCKETURL parameter as an additional connection property.

1. Log on to the SAP Cloud Platform Cockpit for the target SAP HANA tenant database.
2. Click the Service Instances tab.
3. Look for the hana-enterprise service in the list of services, and click it's properties button under
Actions.

The SAP HANA Service Dashboard for the service appears showing the service properties and some
management features.
4. Under the Endpoints section, select and copy the Tenant DB address. This is the SAP HANA database
address that you will specify as the Server Port for the ODBC connection. It would be useful to paste the
address into a text editor to ensure that it will be available when you define the data source connection. If
you are connecting by WebSocket, then you also need to copy the server ID. This is explained in the next
step.
5. For a WebSocket connection: Under the Detail section, copy the ID address. Store this in a text editor along
with the tenant DB address.

Now that you have the tenant database endpoint address, and its ID if you are using WebSocket, you have
the information necressary to complete the ODBC Data Source connection.

Related Information

Define a Windows ODBC Data Source for the SAP HANA Service [page 16]

3.3.4.1.4 Define a Windows ODBC Data Source for the SAP


HANA Service

You'll create a new data source for the SAP HANA Service by specifying the SAP Cloud Platform (SCP) tenant
database address in the ODBC data source connection. If you are connecting using WebSocket, you will need to
edit the address, and add some additional connection properties including the URL of your web server. These
are explained in this section.

1. Open your computer's ODBC Data Source Administrator (64-bit) and click the System DSN tab.
2. Click the Add button.
3. Select the driver HDBODBC.
4. Click Finish.

The ODBC Configuration for SAP HANA box appears.

Predictive Analytics Data Access Guide


16 PUBLIC Connecting to your Database Management System on Windows
5. Type a name for the data source, for example HANA_SERVICE_US, or a name that you can identify with the
service.
6. Enter the Server Port address. You need the SCP tenant database address that you copied or stored in a
text file from the SAP HANA Service Dashboard. You copied this address in the previous topic Locate and
Copy the SAP HANA Tenant Database Address [page 16]. Do one of the following:

○ For a standard ODBC connection: Paste the SCP tenant database address in the Server Port field.
○ For a WebSocket connection: Paste the SCP tenant database address in the Server Port field, then
replace the zeus part with wsproxy, and replace the TCP/IP port value at the end of the address with
80. For example:

ODBC connection WebSocket connection

zeus.hana.production.jimkiwi.dba wsproxy.hana.production.jimkiwi.dba:80
:2869

7. Click the Settings button.

The Advanced ODBC Connection Property Setup box appears.


8. Under SSL Connection, select both of the following checkboxes:

○ Connect using SSL


○ Validate the SSL certificate

If you are defining a standard ODBC connection, click OK to close the Advanced ODBC Connection Property
Setup box. You've completed the connection definition.

If you are defining a WebSocket connection, then continue with the following steps.
9. Click Add, and enter the following information in the Additional Connection Properties box, then click OK.

Property Value

WEBSOCKETURL Type /service/ at the start of the address, and paste the HANA Service instance ID that
you copied from the section Locate and Copy the SAP HANA Tenant Database Address
[page 16]. You now have the following syntax: /service/<tenant database ID>

The WebSocket connection can also require further connection properties. Depending on your
environment, you may also need to specify a proxy server; for example, environments that limit outgoing
ports usually require proxy server information.

10. If you need to specify a proxy server, then click Add and add each of the following properties successively
to Additional Connection Properties. Here we give some standard values as an example, but you may need
to enter values that correspond to your environment.

Property Value

PROXY_HOST PROXY

PROXY_PORT 8080

11. Click OK to close the Advanced ODBC Connection Property Setup box.

You can use the Windows ODBC Data Source for SAP HANA Service for on premise clients, for example,
Predictive Analytics Desktop, Predictive Analytics Server, Predictive Factory, Jupyter Notebook with
Python API, or KxShell script.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 17
3.4 SAP Vora

3.4.1 About SAP Vora

Native Spark Modeling optimizes the modeling process on SAP Vora data sources.

To setup an ODBC connection to SAP Vora you need to follow the steps listed below:

1. Install the Simba Spark SQL ODBC driver.


2. Configure a SAP Vora DSN with the Simba Spark SQL driver.
3. Set the ODBC behaviour flag for the DSN to SAP Vora.

For more information, refer to the section Native Spark Modeling in the related information below.

Related Information

Native Spark Modeling [page 60]


Native Spark Modeling [page 120]

3.4.2 Installing the Simba Spark SQL ODBC Driver

1. Download the Simba driver relevant for the recommended Voraversion from the Simba website .Refer to
the SAP Product Availability Matrix http://service.sap.com/sap/support/pam to know the supported
product versions.
2. Follow the instructions to install the driver.

 Note

HortonWorks customers can use the HortonWorks Spark SQL ODBC driver provided by Simba.

3.4.3 Configuring SAP Vora DSN

1. Fill the connection parameters with the information provided by your cluster administrator.
2. Set the advanced options for the connection.
a. Click Advanced Options at the bottom of the window.
b. In the Advanced Options window, select the following options:
○ Use Native Query
○ Fast SQLPrepare

Predictive Analytics Data Access Guide


18 PUBLIC Connecting to your Database Management System on Windows
c. Click OK.

The settings are saved and the Advanced Options window closes.
3. Click Test at the bottom of the window.

The Test Results window opens. If the connection has been successfully set, the message displays
SUCCESS! If another message is displayed, check your settings with your cluster administrator.
4. Edit the registry:
a. Start regedit.
b. Open the correct node depending on your DNS:
○ For a system-defined DNS: HKEY_LOCAL_MACHINE\SOFTWARE\ODBC\ODBC.INI\<Name of
the DNS>
○ For a user-defined DNS: HKEY_CURRENT_USER\SOFTWARE\ODBC\ODBC.INI\<Name of the
DNS>
c. Right-click in the panel on the right.
d. Select New String Value .
e. Enter getSchemasWithQuery.
f. Double-click the key getSchemasWithQuery and set its value to 0.

3.4.4 Setting the ODBC Behaviour Flag

The ODBC Behaviour flag is required to enable SAP Vora ODBC connectivity with the Simba Spark SQL ODBC
driver to switch the ODBC behavior from Spark SQL to SAP Vora.

1. Edit the SparkConnector\Spark.cfg file.


2. Set the Behaviour property for either a specific Vora DSN name or for all DSNs as shown in the examples
below.

 Note

For OEM installations there is no SparkConnector folder. You need to set the Behaviour flag in the
KxShell.cfg configuration file.

 Example

Spark.cfg set up for a single Vora DSN called MY_VORA_ODBC_DSN

 Sample Code

# SAP Vora ODBC Connectivity


#
# Set the "Behaviour" option for all Vora ODBC DSNs here
# example 1: for a specific DSN called MY_VORA_ODBC_DSN
ODBCStoreSQLMapper.MY_VORA_ODBC_DSN.Behaviour=Vora

Spark.cfg set up to use Vora behavior for all DSNs, which means that every DSN will be treated as a Vora
DSN.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 19
 Sample Code

# SAP Vora ODBC Connectivity


#
# Set the "Behaviour" option for all Vora ODBC DSNs here
# example 2: to use Vora for all DSNs
ODBCStoreSQLMapper.*.Behaviour=Vora

3.5 Oracle

3.5.1 About Oracle Database

Automated Analytics standard installation includes a component named Data Access for Oracle, this is the
recommended way to connect to a Oracle database. This component allows connecting to Oracle DBMS
without the need to install any other Oracle software.

Another advantage of this component is that a special bulk mode can be activated with this driver. Using this
mode, writes in Oracle DBMS are boosted by a 20 factor. Depending on the algorithm that have been used to
build the model, such acceleration is mandatory when scoring. Social models are a typical example of such
algorithms where scoring cannot be done with the feature In-database Apply. Note that Data Access for Oracle
is a regular Automated Analytics feature and therefore, is liable to licensing.

However when Data Access for Oracle cannot be implemented (due to the company IT policy for example), you
can install directly Oracle ODBC driver. The driver must be set up after the installation so that it suits the
application requirements. Once the driver is set up, users can start using the application.

3.5.2 Setting Up Data Access for Oracle

1. Start the ODBC administrator.


2. Click Add.
3. Select SAP PA Automated Analytics <version number>: DataDirect 7.1 SP5 Oracle Wire Protocol.

Predictive Analytics Data Access Guide


20 PUBLIC Connecting to your Database Management System on Windows
The first tab of the dialog box describes the actual connectivity to your Oracle DBMS. There are numerous
ways to connect to an Oracle DBMS, you have to request assistance from your Oracle Database
administrator to setup the proper parameters.

However, a very common way to define Oracle connectivity is to fill the dialog as shown in the screenshot
above:
○ Data Source Name: the name of the data source
○ Host: the host name of the Oracle server or its IP address
○ Service Name: the service name of the Oracle DBMS on the Oracle server
4. In the Bulk tab, you can activate bulk mode. Using bulk mode accelerates writes in Oracle.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 21
Use the following settings:

 Caution

Any other tabs in this dialog box must only be changed on request by support.

Predictive Analytics Data Access Guide


22 PUBLIC Connecting to your Database Management System on Windows
3.5.3 Installing Oracle ODBC Driver

Oracle delivers a Windows ODBC driver with the standard Oracle CD. However, this driver needs an installation
of the Oracle client to run. The most appropriate Oracle CD to use is the Oracle client CD, which will install all
the mandatory components (including the ODBC driver).

1. Insert the CD in the computer. The Oracle Universal installer is automatically launched.
2. Click Next.
3. Select the Runtime option (according to the actual Oracle Universal version).
4. Click Next on all the following screens and keep all the default options.

To be able to use Oracle ODBC drivers, it is necessary to configure a Service Name on the database (remote or
local) with the Oracle client software.

The Oracle Net Manager assistant is automatically launched at the end of the installation. If needed, it can be
launched separately.

To configure the Service Name, set the Sevice Name and the Host Name. This information is provided by your
database administrator.

 Note

On the Oracle Net Manager panel, the Service Name to use locally is ora10 and the database runs on
kxsrvmulti3.

3.5.4 Setting Up Oracle ODBC Driver

1. Start the ODBC Data Source Administrator.


2. Select Oracle ODBC driver.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 23
3. Enter the required information using the values found in the Oracle Net Manager. (See the procedure
Installing Oracle ODBC Driver).
4. Keep the default values for all the other options.

 Caution

Do not mistake Oracle ODBC driver for Microsoft Oracle driver, which is not supported by Automated
Analytics.

Related Information

Installing Oracle ODBC Driver [page 23]

3.6 Spark SQL

3.6.1 About Spark SQL

SAP Predictive Analytics supports the Apache Spark framework in order to perform large-scale data
processing with the Automated Predictive Server.

 Note

To know the supported version of Spark SQL, refer to the SAP Product Availability Matrix http://
service.sap.com/sap/support/pam .

3.6.2 Setting Up Data Access for Spark SQL

1. Start the ODBC administrator.


2. Click Add.
3. Select SAP PA Automated Analytics <version number>: DataDirect 8.01 Apache Spark.

Predictive Analytics Data Access Guide


24 PUBLIC Connecting to your Database Management System on Windows
The first tab of the dialog box describes the actual connectivity to your Spark Server.
○ Data Source Name: the name of the data source
○ Host Name: the host name of the Spark Server or its IP address
○ PortNumber: the port number of the Spark Server
○ Database Name: the database to use in the Spark Server
4. The second tab of the dialog box describes advanced parameters. Make sure to use these settings:

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 25
 Caution

Any other tabs in this dialog box must only be changed upon request by the support.

3.6.3 Creating Data Manipulations using Date Functions

SAP Predictive Analytics provides UDF extensions to Spark SQL allowing the management of dates. Apache
Spark SQL provides very few date functions natively.

 Note

Installation of UDFs for Spark SQL requires access to the Apache Spark SQL server.

Predictive Analytics Data Access Guide


26 PUBLIC Connecting to your Database Management System on Windows
3.6.3.1 Installation in Apache Spark SQL Server

The Apache Spark SQL UDFs for the application are located in:

{SAP Predictive Analytics folder}/resources/KxenHiveUDF.jar

You need to copy this file into the local file system of the Apache Spark SQL server. The jar to deploy is the
same used for Hive.

3.6.3.2 Activation in SAP Predictive Analytics

In this section, <server_local_path_to_jar> designates the local path to the copied KxenHiveUDF.jar file inside
the Apache Spark SQL server.

On the computer running the application, locate the configuration file (.cfg) corresponding to SAP Predictive
Analytics product you want to use:

● KxCORBA.cfg when using an SAP Predictive Analytics server


● KJWizard.cfg when using an SAP Predictive Analytics workstation

Add these lines to the proper configuration file using following syntax:

ODBCStoreSQLMapper.*.SQLOnConnect1="ADD JAR <server_local_path_to_jar>"


ODBCStoreSQLMapper.*.SQLOnConnect2="CREATE TEMPORARY FUNCTION KxenUDF_add_year
as 'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper.*.SQLOnConnect3= "CREATE TEMPORARY FUNCTION KxenUDF_add_month
as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper.*.SQLOnConnect4= "CREATE TEMPORARY FUNCTION KxenUDF_add_day
as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper.*.SQLOnConnect5= "CREATE TEMPORARY FUNCTION KxenUDF_add_hour
as 'com.kxen.udf.cUDFAdd_hour'"
ODBCStoreSQLMapper.*.SQLOnConnect6= "CREATE TEMPORARY FUNCTION KxenUDF_add_min
as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper.*.SQLOnConnect7="CREATE TEMPORARY FUNCTION KxenUDF_add_sec
as'com.kxen.udf.cUDFAdd_sec'"

These lines will activate the UDFs extension for all DBMS connections. If you are using other DBMS
connections than Spark SQL, you can activate the UDFs extensions only for a specific connection by replacing
the star (*) with the actual name of the ODBC connection you have defined.

 Example

ODBCStoreSQLMapper.My_DSN.SQLOnConnect1="ADD JAR <server_local_path_to_jar>"


ODBCStoreSQLMapper.My_DSN.SQLOnConnect2="CREATE TEMPORARY FUNCTION
KxenUDF_add_year as'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect3= "CREATE TEMPORARY FUNCTION
KxenUDF_add_month as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect4= "CREATE TEMPORARY FUNCTION
KxenUDF_add_day as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect5= "CREATE TEMPORARY FUNCTION
KxenUDF_add_hour as 'com.kxen.udf.cUDFAdd_hour'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect6= "CREATE TEMPORARY FUNCTION
KxenUDF_add_min as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect7="CREATE TEMPORARY FUNCTION
KxenUDF_add_sec as 'com.kxen.udf.cUDFAdd_sec'"

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 27
3.6.4 Spark SQL Restrictions

Restrictions for using Spark SQL with SAP Predictive Analytics.

3.6.4.1 Managing Primary Keys

Spark SQL does not publish the primary keys of a table.

The usual workaround is to use a description file in which the primary keys are properly described. This
description can be loaded either by the user or automatically by the application. In this case, for a table XX, the
application automatically reads the description file named KxDesc_XX.

Unfortunately, it may not be easy to push new files in Spark SQL and due to other limitations of Spark SQL, the
application is not able to save these files in Spark SQL. It is still possible to use description files stored in a
standard text repository but all descriptions must be explicitly read.

As a convenience, with Spark SQL, the application uses a new heuristic to guess the primary keys of a table.
Field names are compared to patterns, allowing the detection of names commonly used for primary keys.

The default setup for the application on Spark SQL is to manage the fields listed below as primary keys:

● Starting with ‘KEY’ or ‘ID. For example: KEY_DPTMT or IDCOMPANY


● Ending with ‘KEY’ or ‘ID.; For example: DPTMKEY or COMPANY_ID

 Note

The list of patterns for primary keys can be tuned by the user.

PrimaryKeyRegExp

This option allows the specification of a list of patterns that will be recognized as primary keys. The syntax
follows the convention described in the section ODBC Fine Tuning About ODBC Fine Tuning [page 156]. The
patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=KEY$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=ID$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^ID

 Note

This set of patterns is the default one for SAP Predictive Analytics and Spark SQL.

Predictive Analytics Data Access Guide


28 PUBLIC Connecting to your Database Management System on Windows
NotPrimaryKeyRegExp

Patterns described by PrimaryKeyRegExp may match too many field names. This option is applied after the
PrimaryKeyRegExp patterning and allows the explicit description of field names that are not primary keys.
These patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.NotPrimaryKeyRegExp.1=^CRICKEY

All field names ending by KEY will be managed as primary keys except field names ending with CRICKEY .

3.7 Hive

3.7.1 About Hive Database

Automated Analytics provides a DataDirect driver for Hive. This component lets you connect to a Hive server
without the need to install any other software. Hive technology allows you to use SQL statements on top of a
Hadoop system. This connectivity component supports:

● Hive versions. To know which versions are supported, refer to SAP Product Availability Matrix http://
service.sap.com/sap/support/pam
● Hive server 1 and Hive server 2: The server can be set up in 2 modes. Both modes are transparently
managed, however, Hive server 2 is preferred as it provides more authentication and multi-connection
features.

The driver must be set up after the installation so that it suits the application requirements. Once the driver is
set up, users can start using the application. Native Spark Modeling optimizes the modeling process on Hive
data sources. For more information, refer to the section Native Spark Modeling [page 60]

3.7.2 Setting Up Data Access for Hive

1. Start the ODBC administrator.


2. Click Add.
3. Select SAP PA Automated Analytics <version number>: DataDirect 7.1 SP5 Apache Hive Wire Protocol.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 29
The first tab of the dialog box describes the actual connectivity to your Hive Server.
○ Data Source Name: the name of the data source
○ Host Name: the host name of the Hive Server or its IP address
○ PortNumber: the port number of the Hive Server
○ Database Name: the database to use in the Hive Server
4. The second tab of the dialog box describes advanced parameters. Make sure to use these settings:

Predictive Analytics Data Access Guide


30 PUBLIC Connecting to your Database Management System on Windows
 Caution

Any other tabs in this dialog box must only be changed upon request by the support.

3.7.3 Creating Data Manipulations using Date Functions

SAP Predictive Analytics provides UDF extensions to Hive allowing the management of dates. Apache Hive
provides very few date functions natively.

 Note

Installation of UDFs for Hive requires access to the Apache Hive server.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 31
3.7.3.1 Installation in Apache Hive Server

The Apache Hive UDFs for the application are located in:

{SAP Predictive Analytics folder}/resources/KxenHiveUDF.jar

You need to copy this file into the local file system of the Apache Hive server.

3.7.3.2 Activation in SAP Predictive Analytics

In this section, server_local_path_to_jar designates the local path to the copied KxenHiveUDF.jar file
inside the Apache Hive server.

On the computer running the application, locate the configuration file corresponding to SAP Predictive
Analytics product you want to use:

● KxCORBA.cfg when using an SAP Predictive Analytics server


● KJWizard.cfg when using an SAP Predictive Analytics workstation

Add these lines to the proper configuration file:

ODBCStoreSQLMapper.*.SQLOnConnect1="ADD JAR <server_local_path_to_jar>"


ODBCStoreSQLMapper. *.SQLOnConnect2="CREATE TEMPORARY FUNCTION KxenUDF_add_year
as 'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper. *.SQLOnConnect3= "CREATE TEMPORARY FUNCTION
KxenUDF_add_month as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper.*.SQLOnConnect4= "CREATE TEMPORARY FUNCTION KxenUDF_add_day
as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper.*.SQLOnConnect5= "CREATE TEMPORARY FUNCTION KxenUDF_add_hour
as 'com.kxen.udf.cUDFAdd_hour'"
ODBCStoreSQLMapper.*.SQLOnConnect6= "CREATE TEMPORARY FUNCTION KxenUDF_add_min
as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper.*.SQLOnConnect7="CREATE TEMPORARY FUNCTION KxenUDF_add_sec
as'com.kxen.udf.cUDFAdd_sec'"

These lines will activate the UDFs extension for all DBMS connections. If you are using a DBMS connections
other than Hive, you can activate the UDFs extensions for a specific connection by replacing the star (*) with
the actual name of the ODBC connection you have defined.

 Example

ODBCStoreSQLMapper.MyBigHive.SQLOnConnect1="ADD JAR
<server_local_path_to_jar>"
ODBCStoreSQLMapper.MyBigHive.SQLOnConnect2="CREATE TEMPORARY FUNCTION
KxenUDF_add_year as'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect3= "CREATE TEMPORARY FUNCTION
KxenUDF_add_month as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect4= "CREATE TEMPORARY FUNCTION
KxenUDF_add_day as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect5= "CREATE TEMPORARY FUNCTION
KxenUDF_add_hour as 'com.kxen.udf.cUDFAdd_hour'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect6= "CREATE TEMPORARY FUNCTION
KxenUDF_add_min as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect7="CREATE TEMPORARY FUNCTION
KxenUDF_add_sec as 'com.kxen.udf.cUDFAdd_sec'"

Predictive Analytics Data Access Guide


32 PUBLIC Connecting to your Database Management System on Windows
3.7.4 Hive Restrictions

Hive technology is built on top of Hadoop technology and allows access to the Hadoop world with classic SQL
statements. Hive’s goal is to provide a standard SQL DBMS on top of a Hadoop system. However, Hive is not yet
a full SQL DBMS and has some restrictions.

3.7.4.1 Using the Code Generator

The code generator is compatible with Apache Hive, however, due to some restrictions of the current ODBC
driver provided by DataDirect, the SQL code generated by the code generator cannot be executed when the
model contains a question mark (?) as a significant category.

3.7.4.2 Aggregates

Since COUNT DISTINCT is not supported in analytical functions by Hadoop Hive, the application does not
support the COUNT DISTINCT aggregate.

Note that aggregates with sub queries syntax are not supported by Hive.

3.7.4.3 Time-stamped Populations

Hive does not support functions (Union, Except, Intersection, Cross Product) that are used by the
application for building Compound and Cross Product time-stamped populations. As a result, these two types
of time-stamped populations cannot be used with Hive. Filtered time-stamped populations, however, can be
used with Hive.

3.7.4.4 Inserting Data in Hive

With Hive, it is possible to create new tables and insert new records using complex statements but there is no
way to push a single record using usual INSERT INTO <table> VALUES(…).

For example, the statement below works:

CREATE TABLE TODEL(ID INT)


INSERT INTO TABLE TODEL SELECT ID FROM ADULTID

Whereas the following statement does not work:

INSERT INTO TABLE TODEL VALUES(1)

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 33
The in-database application feature and SAP Predictive Analytics code generator are not impacted by this
limitation but several other features of the application are blocked:

● Scoring with models not compatible with the in-database application feature
● Saving models, data manipulations, variable pool, and descriptions in Hive DBMS
● Transferring data with the Data Toolkit
● Generating distinct values in a dataset

3.7.4.5 Managing Primary Keys

Hive does not publish the primary keys of a table. The usual workaround with is to use a description file in
which the primary keys are properly described. This description can be loaded either by the user or
automatically by the application. In this case, for a table XX, the application automatically reads the description
file named KxDesc_XX.

Unfortunately, it may not be easy to push new files in Hive and due to other limitations of Hive, the application
is not able to save these files in Hive. It is still possible to use description files stored in a standard text
repository but all descriptions must be explicitly read.

As a convenience, with Hive, the application uses a new heuristic to guess the primary keys of a table. Field
names are compared to patterns, allowing the detection of names commonly used for primary keys.

The default setup for the application on Hive is to manage the fields listed below as primary keys:

● Starting with ‘KEY’ or ‘ID. For example: KEY_DPTMT or IDCOMPANY


● Ending with ‘KEY’ or ‘ID.; For example: DPTMKEY or COMPANY_ID

 Note

The list of patterns for primary keys can be tuned by the user.

PrimaryKeyRegExp

This option allows the specification of a list of patterns that will be recognized as primary keys. The syntax
follows the convention described in the section ODBC Fine Tuning About ODBC Fine Tuning [page 156]. The
patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=KEY$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=ID$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^ID

 Note

This set of patterns is the default one for SAP Predictive Analytics and Hive.

Predictive Analytics Data Access Guide


34 PUBLIC Connecting to your Database Management System on Windows
NotPrimaryKeyRegExp

Patterns described by PrimaryKeyRegExp may match too many field names. This option is applied after the
PrimaryKeyRegExp patterning and allows the explicit description of field names that are not primary keys.
These patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.NotPrimaryKeyRegExp.1=^CRICKEY

All field names ending by KEY will be managed as primary keys except field names ending with CRICKEY .

3.7.4.6 Tuning

You can tune the Hive connection by disabling the KxDesc mechanism or by modifiying the heap size of the
JVM or Apache Hadoop client application.

PrimaryKeyRegExp

This option allows the specification of a list of patterns that will be recognized as primary keys. The syntax
follows the convention described in the section ODBC Fine Tuning About ODBC Fine Tuning [page 156]. The
patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=KEY$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=ID$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^ID

 Note

This set of patterns is the default one for SAP Predictive Analytics and Hive.

NotPrimaryKeyRegExp

Patterns described by PrimaryKeyRegExp may match too many field names. This option is applied after the
PrimaryKeyRegExp patterning and allows the explicit description of field names that are not primary keys.
These patterns are using the regexp formalism.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 35
 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.NotPrimaryKeyRegExp.1=^CRICKEY

All field names ending by KEY will be managed as primary keys except field names ending with CRICKEY .

SupportKxDesc

Even if pushing such description files in Hive may not be easy, the automatic KxDesc_XXX is a convenient
feature. This is why the application still provides it on Hive. However, accessing Hive’s metadata may be heavy
and may uselessly slow the processes down.

The SupportKxDesc option allows the deactivation of the KxDesc mechanism and thus possibly speeds up
usage with Hive.

 Example

ODBCStoreSQLMapper.*.SupportKxDesc=false

JVM Heap Size

When using Hive server 2 with a wide dataset, you need to make sure to increase the heap size of the Java
Virtual Machine (JVM) for the Hive Metastore service.

Apache Hadoop Client Application Heap Size

The out-of-the-box Apache Hadoop installation sets the heap size for the client applications to 512 MB. Leaving
the memory size at its default value can cause out-of-memory errors when accessing large tables. To avoid this,
you need to increase the heap size to 2 GB by changing the ‘HADOOP_CLIENT_OPTS’ variable setting within
the usr/local/hadoop/hadoop-env.sh script.

3.8 Teradata

3.8.1 About Teradata Database

Automated Analytics standard installation includes a component named Data Access for Teradata. This
component allows connecting to Teradata using only a minimal and very common set of Teradata packages. It

Predictive Analytics Data Access Guide


36 PUBLIC Connecting to your Database Management System on Windows
is the recommended way to connect to a Teradata database. Note that Data Access for Teradata is a regular
Automated Analytics feature and therefore, submitted to licensing.

However when you need to use FastWrite or when Data Access for Teradata cannot be implemented (due to the
company IT policy for example), you can skip to the installation of Teradata ODBC driver.

3.8.2 Installing the Prerequisite Teradata Client Packages

Before using Data Access for Teradata, you need to install the following standard client Teradata packages:

● Tdicu
● TeraGSS
● Cliv2

1. Insert the CD Teradata Tools and Utilities - Teradata utility Pack - Windows.
The installation starts automatically.

2. Click Install Product.


3. Choose Custom as the setup type.

4. Check CLIv2.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 37
 Note

The components Shared ICU Libraries for Teradata and Teradata GSS Client are automatically selected.
These are the only Teradata packages that need to be installed in order to establish an ODBC
connection to Teradata.

5. Click Next.
6. Keep the default values for all the other options.
7. Click Finish.

3.8.3 Setting Up Data Access for Teradata

1. Start the ODBC administrator.


2. Select SAP PA Automated Analytics <version number>: DataDirect 7.1 SP5 Teradata.

Predictive Analytics Data Access Guide


38 PUBLIC Connecting to your Database Management System on Windows
The Teradata Driver Setup dialog box appears.
3. Fill in the following fields: Name, DBCName or Alias, Default Database and Session Character Set.
For example, to set up an ODBC connection named KxTera13_dd to connect on the kxen Database of a
Teradata DBMS running on a server with the IP address 10.1.1.102., you will need to enter the following
information:
○ Name: KxTera13_dd
○ DBCName or Alias: 10.1.1.102
○ Default Database: kxen
○ Session Character Set: ASCII
4. Click the Option tab.
5. Select Show Selectable Tables and Enable LOBs.
6. Click the Advanced tab.
7. Fill in the fields as detailed in the following table.

Option Value

Maximum Response Buffer Size 64000

Port Number 1025

Login Timeout 20

ProcedureWithPrintStmt N

ProcedureWithSPLSource Y

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 39
3.8.4 Installing Teradata ODBC Driver

1. Insert the CD Teradata Tools and Utilities - Teradata utility Pack - Windows.
The installation starts automatically.

2. Click Install Product.


3. Choose Custom as the setup type.

4. Click Next.
5. Check ODBC Driver for Teradata.

 Note

The components Shared ICU Libraries for Teradata and Teradata GSS Client are automatically selected.
These are the only Teradata packages that need to be installed in order to establish an ODBC
connection to Teradata.

Predictive Analytics Data Access Guide


40 PUBLIC Connecting to your Database Management System on Windows
6. Click Next.
7. Keep the default values for all the other options.
8. Click Finish.

3.8.5 Setting Up Teradata ODBC Driver

1. Start the ODBC Administrator.


2. Select Teradata Driver.
3. Fill in the following fields: Name, Name(s) or IP address(es), Default Database.
Ask your database administrator to provide you with the correct values for each option.

The ODBC connection set up in the following screenshot is named KxTera13_dd. It connects on the kxen
database of a Teradata DBMS running on a server with the IP address 10.1.1.102.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 41
4. Click Options.
5. In the Teradata ODBC Driver Options window, select the following options.

○ Use Column Names


○ Use X Views
○ Display Kanji Conversion Errors
6. Set the following options to the values shown here:

Option Value

Session Mode System Default


DateTime Format AAA
Return Generated Keys No

7. Click Advanced.
8. In Maximum Response Buffer Size, enter 64000.
9. Keep the default values for the other options.
10. Validate all screens.

3.9 Microsoft SQL Server

3.9.1 Setting Up Microsoft ODBC Driver for SQL Server 2005

On Windows a OS, the SQL Server ODBC driver is installed by default.

1. Open the ODBC Data Source Administrator.


2. Click the Select DNS tab.
3. Click Add.
4. In the Create New Data Source panel, select the SQL Server driver.
5. Click Finish.
6. In the Create a New Data Source to SQL Server panel, select the authentication method recommended by
your database administrator.
7. Check Connect to SQL Server to obtain default settings for the additional configuration options.
8. In Login ID and Password, enter the connection information to the data source.

Predictive Analytics Data Access Guide


42 PUBLIC Connecting to your Database Management System on Windows
9. Click Next.
10. Select the default database.

11. Click Next.


12. Click Finish.

A screen listing all the activated options is displayed and allows testing the connection.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 43
3.9.2 About Microsoft ODBC Driver for SQL Server 2008

Available Microsoft SQL Server ODBC drivers are:

● SQL Server
● SQL Server Native Client

The support of the new SQL Server 2008 drivers requires SQL Server Native Client 10.x (or later) for operations
with Automated Analytics. Therefore, Automated Analytics applicative components check the availability of
this driver before connecting to SQL Server 2008 databases. Without the required SQL Server Native Client
driver, the connection fails.

You can check the current version of the driver by opening the ODBC Data Source Administrator.

Windows Version SQL Server Native Client Version

Microsoft Windows 2008 64-bit 2009.100.1600.01

Predictive Analytics Data Access Guide


44 PUBLIC Connecting to your Database Management System on Windows
If you have selected the correct driver type to connect to SQL Server 2008 databases, namely SQL Server
Native Client driver and not SQL Server driver, the following logo should be displayed on the Create a New Data
Source to SQL Server panel:

3.9.3 Installing Microsoft ODBC Driver for SQL Server 2008

1. Download the installation program for SQL Server Native Client for Microsoft 2008 Server 64-bit edition:

http://go.microsoft.com/fwlink/?LinkID=188401&clcid=0x409

 Caution

The version of the driver must be 10.0 or later.

You can also perform a search for Microsoft® SQL Server® 2008 R2 Native Client on the Microsoft
Download Center:

http://microsoft.com/downloads/en .
2. Install the driver.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 45
3.9.4 Setting Up Microsoft ODBC Driver for SQL Server 2008

1. Open the ODBC Data Source Administrator.


2. Click Add....
3. Select SQL Server Native Client driver.

4. Click Finish.
5. You are prompted to configure the parameters for your new ODBC connection: enter the information
provided by your database administration.

6. Click Next >.

Predictive Analytics Data Access Guide


46 PUBLIC Connecting to your Database Management System on Windows
7. Check Connection to SQL Server....
8. In Login ID and Password, enter the connection information provided by your database administrator.

 Note

As an alternative, depending on your administrative management, you can use the Integrated Windows
authentication to connect to your database.

9. Click Next.

10. Check Change the default database to: and select the correct database in the list.
11. Click Next.

12. Keep all default parameters.


13. Click Finish.
A summary screen appears:

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 47
14. Click Test Data Source....
If the connection has been set up correctly, the following screen appears:

3.9.5 About Microsoft ODBC Driver for SQL Server 2012

Available Microsoft SQL Server ODBC drivers are:

● SQL Server
● SQL Server Native Client

The support of the new SQL Server 2012 drivers requires SQL Server Native Client 11.x (or later) for operations
with Automated Analytics. Therefore, Automated Analyticsapplicative components check the availability of this
driver before connecting to SQL Server 2012 databases. Without the required SQL Server Native Client driver,
the connection fails.

You can check the current version of the driver the ODBC Data Source Administrator.

Windows Version SQL Server Native Client Version

Microsoft Windows 2012 64-bit 2011.110.2100.60

Predictive Analytics Data Access Guide


48 PUBLIC Connecting to your Database Management System on Windows
If you have selected the correct driver type to connect to SQL Server 2012 databases, namely SQL Server
Native Client driver and not SQL Server driver, the following logo should be displayed on the Create a New Data
Source to SQL Server panel:

3.9.6 Installing Microsoft ODBC Driver for SQL Server 2012

1. Download the installation program for SQL Server Native Client:


http://www.microsoft.com/en-us/download/details.aspx?id=36434

 Caution

The version of the driver must be 11.0 or later.

You can also perform a search for Microsoft® SQL Server® 2012 Native Client on the Microsoft Download
Center:http://microsoft.com/downloads/en .
2. Install the driver.

3.9.7 Setting Up Microsoft ODBC Driver for SQL Server 2012

1. Open the ODBC Data Source Administrator.


2. Click Add....
3. Select SQL Server Native Client 11.0 driver.
4. Click Finish.
5. You are prompted to configure the parameters for your new ODBC connection: enter the information
provided by your database administrator.
6. Click Next >
7. Check Connect to SQL Server....
8. In Login ID and Password, enter the connection information provided by your database administrator.

 Note

As an alternative, depending on your administrative management, you can use the Integrated Windows
authentication to connect to your database.

9. Click Next >.


10. Check Change the default database to: and select the correct database in the list.
11. Click Next >.
12. Keep all default parameters.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 49
13. Click Finish.
14. Click Test Data Source.

3.9.8 About Microsoft ODBC Driver for SQL Server 2014

There is no specific ODBC driver associated with SQL Server 2014. The last version provided by Microsoft is
SQLServer 2012, which is fully compatible with SQLServer 2014.

3.10 Netezza

3.10.1 Configuring Netezza ODBC Driver

1. Click Start > Settings > Control Panel > Administrative Tools > Data Sources (ODBC) to access the
configuration dialog box.
2. In the DSN Options tab, fill in the fields with the correct Server name, Port number and Database name:

3. In the Advanced DSN Options tab, fill in the fields as follows:

Predictive Analytics Data Access Guide


50 PUBLIC Connecting to your Database Management System on Windows
 Caution

Check the following items.


○ The Date Format option is set to YMD.
○ The Enable Fast Select option is NOT selected.
.

4. In the SSL DSN Options tab, check Preferred Unsecured.

5. In the Driver Options tab, fill in the field as follows:

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 51
3.11 IBM DB2 V9

3.11.1 Setting Up DB2 ODBC Driver

1. In order to use the DB2 ODBC Driver, specify a data source name and a Database alias:

2. In the Advanced Settings panel, apply the workaround 1024:

Predictive Analytics Data Access Guide


52 PUBLIC Connecting to your Database Management System on Windows
 Note

While installing DB2 V9 driver on windows 8.1 and 2012 R2, you might get an error on the DB2 Setup
wizard (GUI) so that theWindows installer has stopped working.

To avoid this issue, perform a DB2 silent installation process using response files (.rsp). You can find an
example of DB2 rsp files into your DB2 Driver archive located here :
<path_to_your_db2_driver_archive>\image\db2\Windows\samples\db2client.rsp.

For more details on performing a silent installation, please visit the IBM Knowledge Center website at
https://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.qb.server.doc/doc/
c0007502.html .

3.12 Sybase IQ 15.4

3.12.1 Installing Sybase IQ ODBC Driver

1. Launch setup.exe located in Sybase package Sybase IQ 15.4 Network Client Windows x64
64-bit.zip.
2. Choose the folder where you want to install the driver.

3. In the Choose Install Set panel, select Custom.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 53
4. select only Sybase IQ ODBC Driver.

5. Review and proceed to install.

Predictive Analytics Data Access Guide


54 PUBLIC Connecting to your Database Management System on Windows
3.12.2 Setting Up Sybase IQ ODBC Driver

1. Click Start > Control Panel > Administrative Tools > Data Sources (ODBC) to open the ODBC Data
Source Administrator.
2. Select the Sybase IQ driver (here it is Adaptive Server IQ).
3. Click Configure.
4. In the ODBC tab, enter the name of the data source in the Data source name field.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 55
5. Keep all the default options.
6. In the Login tab, fill in the fields with the information provided by your database administrator.

Predictive Analytics Data Access Guide


56 PUBLIC Connecting to your Database Management System on Windows
7. Click OK.

3.13 PostgreSQL

3.13.1 Downloading PostgreSQL ODBC Driver

Download the PostgreSQL ODBC drivers from the main web site (http://www.postgresql.org/ ) of the
PostgreSQL Global Development group, the main PostgreSQL community.

ODBC driver for Windows 64-bit: http://www.master.postgresql.org/download/mirrors-ftp/odbc/versions/msi/


psqlodbc_09_00_0300-x64.zip

3.13.2 Installing PostgreSQL ODBC Driver

Double-click the installer file psql_odbc.mis located in the archive.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 57
3.13.3 Setting Up PostgreSQL ODBC Driver

1. Click Start > Control Panel > Administrative Tools > Data Sources (ODBC) to open the ODBC Data
Source Administrator.
2. Select PostgreSQL ANSI.

3. Fill in the Database, Server and Port fields as indicated by your database administrator.

4. Click Datasource to access additional parameters.

Predictive Analytics Data Access Guide


58 PUBLIC Connecting to your Database Management System on Windows
 Caution

For perfomance reasons, it is important to enable "Use/Declare Fetch".

5. Click Page 2 and check that only the following options are selected:

○ LF <->CR/LF Conversion
○ Server side prepare
○ Int8 As default
○ Protocol 7.4+

6. Click OK.
7. Click Test to check that the connection is functional.

 Note

To define a PostgreSQL ODBC conection suitable for Automated Analytics in Unicode mode, you have
to choose the PostgreSQL Unicode driver when creating the connection.

3.14 Vertica

3.14.1 About Vertica ODBC Driver Installation

Vertica ODBC Driver is part of the standard tools delivered by your Vertica reseller.

3.14.2 Setting Up Vertica ODBC Driver

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 59
1. Click Start > Settings > Control Panel > Administrative Tools > Data Sources (ODBC) to access the
configuration dialog box.
2. Select Vertica ODBC Driver 4.1.

3. Fill in the Database, Server and Port fields as indicated by your database administrator.

4. Click Test Connection to check if the connection is functional.


5. Click Save.

3.15 In-database Modeling

3.15.1 Native Spark Modeling

You use the Native Spark Modeling feature of SAP Predictive Analytics Automated Analytics to delegate the
data-intensive modeling steps to Apache Spark.

 Note

Prerequisite: You have installed the recommended version of Hive or SAP Vora. Refer to the sections about
Hive or SAP Vora in the related information below.

Predictive Analytics Data Access Guide


60 PUBLIC Connecting to your Database Management System on Windows
Native Spark Modeling uses a native Spark application developed in the Scala programming language. This
native Spark Scala approach ensures that the application can leverage the full benefits offered by Spark
including parallel execution, in-memory processing, data caching, resilience, and integration into the Hadoop
landscape.

Native Spark Modeling improves:

● Data transfer: as the data intensive processing is done close to the data, this reduces the data transfer
back to the SAP Predictive Analytics server or desktop.
● Modeling performance: the distributed computing power of Apache Spark improves the performance.
● Scalability: the training process evolves with the size of the Hadoop cluster and so enables more model
training to be completed in the same time and with bigger or wider datasets. It is optimized for more
columns in the training dataset.
● Transparency: the delegation of the modeling process to Apache Spark is transparent to the end user and
uses the same automated modeling process and familiar interfaces as before.

 Note

A more general term, "In-database Modeling", refers to a similar approach that delegates the data
processing steps to a database. The term "In-database Modeling" is sometimes used in the Automated
Analytics configuration and messages to refer to this broader approach.

Related Information

About Hive Database [page 29]


Configure the YARN Port [page 76]
Setting Up Data Access for Hive [page 29]

3.15.1.1 Installation of Native Spark Modeling

Native Spark Modeling is built on the Hadoop ecosystem, an open-source project providing a collection of
components that support distributed processing of large datasets across a cluster of machines. Hadoop allows
both structured as well as complex, unstructured data to be stored, accessed and analyzed across the cluster.

Native Spark Modeling requires the following main components of Hadoop

Required Hadoop Component Role

Spark The in-memory data processing framework

HDFS The resilient Hadoop distributed file system

YARN The YARN resource manager helps control and monitor the
execution of Spark applications on the cluster

To specify the training dataset and to store temporary results from Spark jobs, you also need a Hadoop SQL
engine, either Hive or SAP HANA Vora, and the appropriate ODBC driver.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 61
SQL Engine

Hive A data warehouse based on data distributed in Hadoop Dis­


tributed File System (HDFS). On HortonWorks distributions
Hive can run on Map Reduce or Tez. For more information,
refer to the link Setting Up Data Access for Hive in the related
information below .

SAP HANA Vora A high performance SQL on Hadoop engine that boosts the
execution performance of Spark. For more information, refer
to the link Setting Up Data Access for SAP HANA Vora in the
related information below .

 Caution

Native Spark Modeling does not support Spark SQL for a number of reasons including platform and
performance limitations.

As a prerequisite, you need to download the artefacts required to run Spark as they are not packaged with the
SAP Predictive Analytics installer.

Related Information

Component Description More Information

Ambari An open operational framework for pro­ ● Ambari


visioning, managing and monitoring
● HortonWorks HDP
Apache Hadoop clusters. Used in Hor­
tonWorks HDP distribution.

Apache Hive A data warehouse infrastructure sup­ Apache Hive


porting data summarization, query, and
analysis

Apache Spark Cluster computing data processing Apache Spark


framework

Cloudera Manager Cloudera's automated cluster manage­ Cloudera


ment tool used in CDH

HDFS The Hadoop Distributed File System HDFS Users Guide

SAP HANA Vora A high performance in-memory SQL on ● SAP HANA Vora on the SAP Help
Hadoop framework Portal
● Release note

Spark SQL A module for structured and semi- Spark SQL and DataFrame Guide
structured data processing

YARN Hadoop resource manager and job ● Apache Hadoop YARN


scheduler
● Running Spark on YARN

Predictive Analytics Data Access Guide


62 PUBLIC Connecting to your Database Management System on Windows
Connection Pairing

Native Spark Modeling requires 2 connections :

● An ODBC connection: For specifying the training dataset and to retrieve temporary results of modeling
steps written by the native Spark application. Uses an ODBC DSN to connect to a Hive or SAP HANA Vora
thriftserver. Note: the full training dataset always remains on the cluster.
● A Spark on YARN connection: For executing Spark jobs. A SparkContext is created (so this becomes the
Driver Application in Spark terminology) which then communicates with the YARN Resource Manager
component on the cluster. The YARN Resource Manager assigns cluster resources. The Driver Application
communicates with Spark executors on each node to distribute the work.

 Note

To enable Native Spark Modeling, the ODBC DSN connection must be paired with the Spark on YARN
connection by adding Spark configuration properties that include the DSN name.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 63
3.15.1.1.1 Hadoop Platform

The installation process differs depending on the operating system and choice of Hive or SAP Vora as a data
source. Refer to the SAP Product Availability Matrix http://service.sap.com/sap/support/pam .

 Caution

If you are an existing customer, then open a ticket directly via your profile with Big Data Services support to
request the steps to install Native Sparks Modeling on SAP Cloud Platform. If you are a new customer,
please contact Big Data Services at https://www.altiscale.com/contact-us/

 Note

You cannot run Native Spark Modeling against multiple versions of Spark at the same time on the same
installation of Automated Analytics. As a workaround you can change the configuration to switch between
versions.

3.15.1.1.2 Recommendation for Automated Analytics Server


Location

Native Spark Modeling shares the processing between the Automated Analytics Server or Workstation and the
Hadoop cluster in an interactive Spark session - this is called YARN client mode. In this mode, the driver
application containing the SparkContext coordinates the remote executors that run the tasks assigned by it.
The recommended approach for performance and scalability is to co-locate the driver application within the
cluster.

 Note

For performance and scalability install the Automated Analytics Server on a jumpbox, edge node or
gateway machine co-located with the worker nodes on the cluster.

A Hadoop cluster involves a very large number of similar computers that can be considered as four types of
machines:

● Cluster provisioning system with Ambari (for HortonWorks) or Cloudera Manager installed.
● Master cluster nodes that contain systems such as HDFS NameNodes and central cluster management
tools (such as the YARN resource manager and ZooKeeper servers).
● Worker nodes that do the actual computing and contain HDFS data.
● Jump boxes, edge nodes, or gateway machines that contain only client components. These machines allow
users to start their jobs from the cluster.

We recommend that you install Automated Analytics server on a jump box to get the following benefits:

● Reduction in latency:
The recommendation for Spark applications using YARN client mode is to co-locate the client (in this case
the Automated Analytics server) with the worker machines.
● Consistent configuration:
A jump box contains a Spark client and Hive client installations managed by the cluster manager web
interface (Ambari or Cloudera Manager). As Native Spark Modeling uses YARN and Hive, it requires three

Predictive Analytics Data Access Guide


64 PUBLIC Connecting to your Database Management System on Windows
XML configuration files (yarn-site.xml, core-site.xml and hive-site.xml). A symbolic link can be
used to the client XML files so they remain synchronized with any configuration changes made in cluster
manager web interface.

Recommended Setup

This setup uses an Automated Analytics server co-located with the worker nodes in the cluster.

Limited Setup with Workstation Installation

 Note

This setup should only be used in a non-production environment.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 65
Related Information

● How Spark runs on clusters


● Spark submit client mode recommendations

3.15.1.1.3 Install the Apache Spark Jar Files


Download the Apache Spark "pre-built for Hadoop 2.6 and later" version that is relevant for your enterprise
Hadoop platform.

The Spark lib directory is located in the compressed Spark binary file as shown in the last column of the table
below:

Download the relevant cluster file:


Cluster Type Direct Download URL Folder to Extract

Apache Hive (Spark version 1.6.1) http://archive.apache.org/dist/spark/ spark-1.6.1-bin-hadoop2.6\lib


spark-1.6.1/spark-1.6.1-bin-ha­
doop2.6.tgz

Use a file compression/uncompression utility to extract the lib folder contents from the Spark binary
download to the SparkConnector/jars folder.

 Note

Only the lib folder needs to be extracted. The spark-examples jar can be removed.

Predictive Analytics Data Access Guide


66 PUBLIC Connecting to your Database Management System on Windows
Directory structure before:

SparkConnector/jars/DROP_HERE_SPARK_ASSEMBLY.txt
SparkConnector/jars/idbm-spark-1_6.jar

Example directory structure after extracting the Spark 1.6.1 libs folder content:

SparkConnector/jars/DROP_HERE_SPARK_ASSEMBLY.txt
SparkConnector/jars/idbm-spark-1_6.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/spark-1.6.1-yarn-shuffle.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-
hadoop2.6.0.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-
hadoop2.6.0.jar -> this can be removed

3.15.1.1.4 Upload Spark Assembly Jar

The relatively large Spark assembly jar is uploaded to HDFS each time a Spark job is submitted. To avoid this,
you can manually upload the Spark assembly jar to HDFS. Refer to the sections Edit the SparkConnections.ini
File [page 74] and Configure the YARN Port [page 76] to know how to specify the location of the assembly jar
in the SparkConnections.ini configuration file.

1. Logon to a cluster node.


2. Put the jar on HDFS.

 Example

To put the Spark 1.6.1 assembly jar file into the /jars directory on HDFS on a cluster when the spark
assembly jar is in the /tmp directory, use the following::

hdfs dfs –mkdir /jars


hdfs dfs -copyFromLocal /tmp/spark-assembly-1.6.1-hadoop2.6.0.jar /jars

3.15.1.1.5 Winutils.exe

Apache Spark requires the executable file winutils.exe to function correctly on the Windows Operating System
when running against a non-Windows cluster.

1. Download winutils.exe http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe .


2. Move/copy winutils.exe to the SparkConnector\bin directory under the installation directory (same
location as the "DROP_HERE_WINUTILS.txt" file).

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 67
3.15.1.2 Connection Setup

Native Spark Modeling is enabled by default for Hive and SAP HANA Vora connections.

It requires specific Spark configuration entries and the Hadoop client configuration files to enable the YARN
connection to the cluster.

Hadoop Client Configuration Files

Configuration File Required for Comment

hive-site.xml minimum Hadoop client configuration Can be left empty SAP HANA Vora

core-site.xml minimum Hadoop client configuration

yarn-site.xml minimum Hadoop client configuration

hdfs-site.xml High Availability clusters

For simplicity, if you run the installation on the cluster you can create symbolic links to the Hadoop client
configuration files in the /etc/hadoop/conf directory.

If you run the installation outside the cluster, download the client configuration files from your cluster manager
(Ambari or Cloudera Manager) or copy the files from the cluster.

3.15.1.2.1 Download the HortonWorks Client Configuration


Files from Ambari

You want to download the Hadoop client configuration files to enable the YARN connection to a HortonWorks
cluster.

You have access to the Ambari web User Interface.

Download the client configuration files (hive-site.xml, core-site.xml, and yarn-site.xml at a


minimum) from the HortonWorks Ambari web User Interface:

1. Log on to the Ambari web User Interface.


2. Download the YARN client configuration files (including the core-site.xml and yarn-site.xml files)
a. Go to the menu Services and click on Yarn service.
b. On the right side, click on Service actions.
c. From the dropdown list, select Download Client Configs.

The downloaded file contains the core-site.xml and yarn-site.xml files.


d. In the core-site.xml file, remove or comment out the net.topology.script.file.name
property.

Predictive Analytics Data Access Guide


68 PUBLIC Connecting to your Database Management System on Windows
 Sample Code

Example of the commented property

<!--<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
</property>-->

3. Download the Hive client configuration files including the hive-site.xml file:
a. Go to the Services menu and click on Hive service.
b. On right side, click on Service actions.
c. From the dropdown list, select Hive Client Download Client Configs.
4. In the Hive-site.xml file, remove all properties except for the hive.metastore.uris property.

The Hive-site.xml file must contain only the following code.

 Sample Code

<?xml version="1.0" encoding="UTF-8"?>


<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<server name></value>
</property>
</configuration>

3.15.1.2.2 Download the Cloudera Client Configuration Files


from Cloudera Manager

You want to download the Hadoop client configuration files to enable the YARN connection to a Cloudera
cluster.

Download the following client configuration files from the Cloudera Manager web User Interface:

Configuration File Client

hive-site.xml Hive

core-site.xml YARN

yarn-site.xml YARN

1. Log on to the Cloudera Manager web User Interface.


2. Download the YARN client configuration files.
a. From the Home menu, select the Yarn service.
b. On the right side, click Actions.
c. From the dropdown list, select Download Client Configuration.

The downloaded file contains the core-site.xml and the yarn-site.xml.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 69
d. In the core-site.xml file, remove or comment out the net.topology.script.file.name
property.

 Sample Code

Example of the commented property

<!--<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
</property>-->

3. Download the Hive client configuration files.


a. From the Home menu, select the Hive service.
b. On the right side, click Actions.
c. From the dropdown list, select Download Client Configuration.
d. In the Hive-site.xml file, remove all properties except for the hive.metastore.uris property.

The Hive-site.xml file must contain only the following code.

 Sample Code

<?xml version="1.0" encoding="UTF-8"?>


<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<server name></value>
</property>
</configuration>

3.15.1.2.3 Create a Directory for the Client Configuration


Files

You have downloaded the Hadoop client configuration XML files.

Create a directory to store the files:

1. Change directory to SparkConnector/HadoopConfig directory from the installation root directory.


2. Create a directory with the name of the ODBC connection.
3. Copy the Hadoop client configuration files to the new directory.

3.15.1.3 Configuring the Spark Connection

You configure Spark on YARN connection using two property files located under the SparkConnector directory:

1. Spark.cfg - this file contains the common properties that apply to all Spark connections.
2. SparkConnections.ini - this file contains configuration properties for Spark connections linked to specific
Hive ODBC datasource names.

Predictive Analytics Data Access Guide


70 PUBLIC Connecting to your Database Management System on Windows
Spark.cfg properties
Property Description Default/Expected Values

SparkAssemblyLibFolder Mandatory. Location of the down­ "../SparkConnector/jars/spark-1.6.1-


loaded Spark lib folder. The file is con­ bin-hadoop2.6/lib"
figured by default for Spark 1.6.1.

If you you need to change the location,


see section Edit the Spark.cfg File in the
related information below.

SparkConnectionsFile Mandatory. Location of the connec­ "../SparkConnector/SparkConnec­


tions file tions.ini"

For more information, see section Edit


the SparkConnections.ini File in the re­
lated information below.

JVMPath Mandatory. Location of the Java Virtual "../jre/lib/{OS}/server" (default)


Machine used by the Spark connection.
"../../j2re/bin/server"
This is OS dependent.

IDBMJarFile Mandatory. Location of the Native "../SparkConnector/jars/idbm-


Spark Modeling application jar. spark-1_6.jar".

If you you need to change the location,


see section Edit the Spark.cfg File in the
related information below.

HadoopHome Mandatory. Internal Hadoop Home "../SparkConnector/" .


containing a bin subdirectory. Mainly
used for Windows (for winutils.exe).

HadoopConfigDir Mandatory. Location where the Ha­ %TEMP%


doop client configuration XML files is
copied during modeling.

log4jOutputFolder Location of the Native Spark Modeling "%TEMP%"


log output

log4jConfigurationFile Location of the log4j logging configura- "SparkConnector/log4j.properties"


tion property file (default)

Change this to control the level of log­


ging or log recycling behavior.

JVMPermGenSize The permanent generated size of the Recommended value is 256 (for 256
Java Virtual Machine used by the Spark MB)
connection

DriverMemory Increase the driver memory if you en­ Recommended value is 8192 (for 8192
counter out-of-memory errors. MB).

CreateHadoopConfigSubFolder If the value is "true", a subfolder is cre­ true (default) or false.


ated for each modeling process ID con­
taining the Hadoop client configuration
files in the HadoopConfigDir folder.

If the value is “false” no subfolder is cre­


ated and HadoopConfigDir folder is
used.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 71
Property Description Default/Expected Values

AutoPort Enables or disables automatic configu- true (default) or false


ration of YARN ports.

When the value is "true" the port range


can be configured with the MinPort and
MaxPort properties.

When it is "false" YARN automatically


assigns ports.

This helps when configuring a YARN


connection across a network firewall.

MinPort See AutoPort property 55000

MaxPort See AutoPort property 56000

AutoIDBMActivation Allows Native Spark Modeling to be ac­ true (default) or false.


tivated when matching entries exist in
the SparkConnections.ini file for the
ODBC connection.

The SparkConnections.ini properties


Property Description Default value

HadoopConfigDir Mandatory. Location of the Hadoop cli­ ../SparkConnector/hadoopConfig/


ent configuration XML files (hive- $DATASOURCENAME
site.xml, core-site.xml,
yarn-site.xml). hdfs-
site.xml needs to be included if run­
ning on a High Availability cluster

HadoopUserName Mandatory. User with privilege to run hive


Spark jobs.

ConnectionTimeOut Period (in seconds) after which a Spark 1000000 (seconds)


job will time out. It is useful to change
the value to minutes in case of trouble­
shooting .

.native."property_name" A Native Spark property can be speci­ Refer to http://spark.apache.org/docs/


fied by adding native and quotes to the 1.6.1/configuration.html
property name and value.

 Sample Code

SparkConnection.
$MyDSN.native."spark.
executor.memory"="2g"

Related Information

Edit the Spark.cfg File [page 73]

Predictive Analytics Data Access Guide


72 PUBLIC Connecting to your Database Management System on Windows
Edit the SparkConnections.ini File [page 74]

3.15.1.3.1 Edit the Spark.cfg File

You want to change a default setting relating to all Spark connections.

1. Change directory to the SparkConnector directory from the install root directory.
2. Edit the Spark.cfg file and add the necessary properties. All property names begin with Spark..
3. Restart the client after making any changes to the Spark.cfg file.

 Example

 Sample Code

# Native Spark Modeling settings. See SparkConnection.ini for DSN specific


settings.

# ##### HIVE on HortonWorks HDP 2.4.2 or Cloudera CDH 5.8.x (Spark 1.6.1)
#####
Spark.SparkAssemblyLibFolder="../../../SparkConnector/jars/spark-1.6.1-bin-
hadoop2.6/lib"
Spark.IDBMJarFile="../../../SparkConnector/jars/idbm-spark-1_6.jar"

# ##### MEMORY TUNING #####


# increase the Spark Driver memory (in MB) if you encounter out of memory
errors
# Note: When using Predictive Analytics Workstation for Windows then the
Spark Driver memory is shared with the Desktop client.
# If increasing the Spark Driver memory you also need to modify
KJWizard.ini vmarg.1 setting for JVM maximum heap size (e.g. vmarg.1=-
Xmx4096m).
# It is recommended to set the Spark Driver memory to approximately 75% of
the vmarg.1 value.
Spark.DriverMemory=1024

#
###########################################################################
##
# ##### RECOMMENDED TO KEEP TO DEFAULTS #####
# Spark JVM and classpath settings.
Spark.JVMPath="../../j2re/bin/server"
Spark.HadoopConfigDir=$KXTEMPDIR
Spark.HadoopHome="../../../SparkConnector/"

# per connection configuration


Spark.SparkConnectionsFile="../../../SparkConnector/SparkConnections.ini"

Spark.AutoIDBMActivation=true
Spark.DefaultOutputSchema="/tmp"

# set permanent generation size (in MB) as spark assembly jar files
contain a lot of classes
Spark.JVMPermGenSize=256

# location of the logging properties file


Spark.log4jConfigurationFile="../../../SparkConnector/log4j.properties"

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 73
3.15.1.3.2 Edit the SparkConnections.ini File
You want to configure a new Spark connection linked to a particular ODBC connection to enable Native Spark
Modeling. You know the name of the ODBC connection name (DSN).

1. Change directory to the SparkConnector directory under the install root directory.
2. Edit the SparkConnections.ini file. All property names begin with SparkConnection.

 Note

The default SparkConnections.ini file does not contain any entries. Samples are provided for each
supported cluster type (Cloudera and HortonWorks) with a Hive DSN in the
SparkConnections_samples.txt Use these as a template for creating your own configuration.

Default SparkConnections.ini content

# This file contains Spark configuration entries that are specific for each
particular data source name (DSN).
#
# To enable Native Spark Modeling against a data source you need to define at
least the minimum properties.
# Create a separate set of entries for each DSN. Start each entry with the
text "SparkConnection" followed by the DSN name.

# Note: only Hive is supported with Native Spark Modeling on Windows.


# For SAP HANA Vora on Windows you could use a client/server installation
with a Windows client to a Linux server/Vora jumpbox.

# There are 2 mandatory parameters that have to be set for each DSN -
# 1. hadoopUserName, a user name with privileges to run Spark on YARN and
# 2. hadoopConfigDir, the directory of the Hadoop client configuration
files for the DSN (the directory with the core-site.xml, yarn-site.xml, hive-
site.xml files at a minimum)
#
# It is highly recommended to upload the spark assembly jar to HDFS,
especially on windows.
# e.g. for a DSN called MY_DSN and a Spark 1.6.1 assembly jar in the HDFS
jars folder -
# SparkConnection.MY_DSN.native."spark.yarn.jar"="hdfs://hostname:8020/
jars/spark-assembly-1.6.1-hadoop2.6.0.jar"
#
# It is possible to pass in native Spark configuration parameters using
"native" in the property.
# e.g. to add the "spark.executor.instances=4" native Spark configuration
to a DSN called MY_HIVE_DSN -
# SparkConnection.MY_DSN.native."spark.executor.instances"="4"
#
# #########################################
# Specific settings for HortonWorks HDP clusters
#
# These 2 properties are also mandatory for HortonWorks clusters and need
to match the HDP version exactly.
# (hint: get the correct value from the spark-defaults.conf Spark client
configuration file or Ambari)
# Example for HortonWorks HDP 2.4.2
# SparkConnection.MY_HDP_DSN.native."spark.yarn.am.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"
# SparkConnection.MY_HDP_DSN.native."spark.driver.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"
#
# #########################################
# Refer to the SparkConnections_samples.txt file for sample content.
########################################################################

Predictive Analytics Data Access Guide


74 PUBLIC Connecting to your Database Management System on Windows
3.15.1.3.2.1 Sample Content for a HortonWorks Cluster with a
Hive DSN

Sample SparkConnections.ini additional content for a HortonWorks HDP 2.4.2 (Spark 1.6.1) Hive DSN with
name MY_HDP242_HIVE_DSN.

 Sample Code

# upload the spark 1.6.1 assembly jar to HDFS and reference the HDFS location
here
SparkConnection.MY_HDP242_HIVE_DSN.native."spark.yarn.jar"="hdfs://hostname:
8020/jars/spark-assembly-1.6.1-hadoop2.6.0.jar"

# hadoopConfigDir and hadoopUserName are mandatory


# hadoopConfigDir – use relative paths to the Hadoop client XML config files
(yarn-site.xml, core-site.xml and hive-site.xml at a minimum)
SparkConnection.MY_HDP242_HIVE_DSN.hadoopConfigDir="../../../SparkConnector/
hadoopConfig/MY_HDP242_HIVE_DSN"
SparkConnection.MY_HDP242_HIVE_DSN.hadoopUserName="hive"

# HORTONWORKS SPECIFIC: these 2 properties are also mandatory for HortonWorks


clusters and need to match the HDP version exactly
# sample values for HDP 2.4.2
SparkConnection.MY_HDP242_HIVE_DSN.native."spark.yarn.am.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"
SparkConnection.MY_HDP242_HIVE_DSN.native."spark.driver.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"

# (optional) time out in seconds


SparkConnection.MY_HDP242_HIVE_DSN.connectionTimeOut=1000

# (optional) performance tuning


#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.executor.instances"="4"
#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.executor.cores"="2"
#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.executor.memory"="4g"
#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.driver.maxResultSize"="4g"

3.15.1.3.2.2 Sample Content for a Cloudera Cluster with Hive


DSN

Sample SparkConnections.ini additional content for a Cloudera CDH 5.8.x (Spark 1.6.1) Hive DSN with
name MY_CDH58_HIVE_DSN.

 Sample Code

# Sample for a Cloudera CDH 5.8.x (Spark 1.6.1) Hive DSN with name
MY_CDH58_HIVE_DSN

# upload the spark 1.6.1 assembly jar to HDFS and reference the HDFS location
here
SparkConnection.MY_CDH58_HIVE_DSN.native."spark.yarn.jar"="hdfs://hostname:
8020/jars/spark-assembly-1.6.1-hadoop2.6.0.jar"

# hadoopConfigDir and hadoopUserName are mandatory


# hadoopConfigDir – use relative paths to the Hadoop client XML config files
(yarn-site.xml, core-site.xml and hive-site.xml at a minimum)

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 75
SparkConnection.MY_CDH58_HIVE_DSN.hadoopConfigDir="../../../SparkConnector/
hadoopConfig/MY_CDH58_HIVE_DSN"
SparkConnection.MY_CDH58_HIVE_DSN.hadoopUserName="hive"

# (optional) time out in seconds


SparkConnection.MY_CDH58_HIVE_DSN.connectionTimeOut=1000

# (optional) performance tuning


#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.executor.instances"="4"
#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.executor.cores"="2"
#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.executor.memory"="4g"
#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.driver.maxResultSize"="4g"

3.15.1.3.3 Configure the YARN Port

You want to use specific ports for the Spark connection.

Native Spark Modeling uses the YARN cluster manager to communicate, deploy and execute the predictive
modeling steps in Spark. The cluster manager distributes tasks throughout the compute nodes of the cluster,
allocates the resources across the applications and monitors using consolidated logs and web pages. YARN is
one of the main cluster managers used for Spark applications. Once connected to the cluster via YARN, Spark
acquires executors on nodes in the cluster, which are processes that run the modeling steps. Next, it deploys
and executes the modeling steps through a series of tasks, executed in parallel. Native Spark Modeling uses the
YARN client mode to enable direct control of the driver program from Automated Analytics. It needs to
communicate across the network to Hadoop cluster to:

● Upload the SAP Predictive Analytics application (a jar file).


● Initiate and monitor the Spark application remotely.

The communication across the network may be blocked by firewall rules. To enable the communication to flow
through the firewall, configure the ports used by YARN:

1. Change directory to the SparkConnections directory from the install root directory.
2. Edit the SparkConnections.ini file to set specific ports for the YARN connection:

# Spark ODBC DSN = MY_HIVE


# security (see http://spark.apache.org/docs/latest/security.html)
# From Executor To Driver. Connect to application / Notify executor state
changes. Akka-based. Set to "0" to choose a port randomly.
SparkConnection.MY_HIVE.native."spark.driver.port"="55300"
# From Driver To Executor. Schedule tasks. Akka-based. Set to "0" to choose a
port randomly.
SparkConnection.MY_HIVE.native."spark.executor.port"="55301"
# From Executor To Driver. File server for files and jars. Jetty-based
SparkConnection.MY_HIVE.native."spark.fileserver.port"="55302"
# From Executor To Driver. HTTP Broadcast. Jetty-based. Not used by
TorrentBroadcast, which sends data through the block manager instead.
SparkConnection.MY_HIVE.native."spark.broadcast.port"="55303"
# From Executor/Driver To Executor/Driver. Block Manager port. Raw socket via
ServerSocketChannel
SparkConnection.MY_HIVE.native."spark.blockManager.port"="55304"

Predictive Analytics Data Access Guide


76 PUBLIC Connecting to your Database Management System on Windows
3.15.1.4 Restrictions

Thanks to Spark, the building of Automated Analytics models can run on Hadoop with better performance,
higher scalability and no data transfer back to the SAP Predictive Analytics server or desktop. For the models
using Hadoop as a data source, the model training computations by default are delegated to the Spark engine
on Hadoop whenever it is possible.

To run the training on Spark, you must fulfill the conditions described below:

● You have installed SAP Predictive Analytics 2.5 or higher.


● You are connected to Hadoop data source using the Hive ODBC driver from Automated Analytics.
● You have installed the required versions of Hadoop distribution (see Hadoop Platform [page 64]), Hive (see
About Hive Database [page 29]) and Spark.
● You have installed the Apache Spark Binary and Spark Assembly Jar (seeInstall the Apache Spark Jar Files
[page 66]).

If the delegation to Spark is not possible or if the option is not selected, the Automated Analytics modeling
engine does the training computations on a machine where SAP Predictive Analytics is installed (for example,
execution of the Automated Analytics algorithms).

 Restriction

As of SAP Predictive Analytics 2.5 version, Native Spark Modeling is supported for the training of
classification and regression models with single target. All other types of models are mainly handled by the
Automated Analytics engine.

Refer to the SAP Note 2278743 for more details on restrictions.

To change the default behaviour of the system:

1. Select File Preferences or press F2.


2. On the Model Training Delegation panel, deselect the option.
3. Click OK to save your changes.

 Note

When editing the preferences, you can restore the default settings by clicking the Reset button.

3.15.1.5 Performance Tuning

Split HDFS file

If the hive table is created as an external table on a HDFS directory, check if this directory contains multiple
files. Each file represents a partition when this hive table is processed in Native Spark Modeling.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 77
Setting Spark Executor Core number - spark.executor.cores

When running Native Spark Modeling on a cluster, do not assign all available CPU to Spark executors as this
could prevent Hive from fetching the results due to a lack of resources. For instance, assign only 75% of CPU to
Spark. If you have 12 cores on a node machine, set spark.executor.cores=9.

Recommended Spark settings

# Spark Executor settings


SparkConnection.MY_HIVE.native."spark.executor.instances"="4"
SparkConnection.MY_HIVE.native."spark.executor.cores"="16"
# if you use many threads => increase memory
SparkConnection.MY_HIVE.native."spark.executor.memory"="16g"

Install Automated Analytics server on a jump box on the cluster itself.

It is recommended to install Automated Analytics server on a jump box on the cluster itself. The memory size
of the jump box must be sized to be big enough to hold the result sets generated from the Native Spark
Modeling application.

 Note

For more details, refer to the section Recommendation for Automated Analytics Server Location [page 64]
in the related information below.

Related Information

Recommendation for Automated Analytics Server Location [page 64]

3.15.1.6 Troubleshooting

Troubleshootings for particular message codes can appear in the logs.

 Note

The message codes starting with "KXEN_W" are warning messages and the processing will still continue.

The message codes starting with "KXEN_I" are information messages and the processing will still continue.

Predictive Analytics Data Access Guide


78 PUBLIC Connecting to your Database Management System on Windows
The following table matches the message codes with the solution:
Message Code Remediation Steps

KXEN_W_IDBM_UNSUPPORTED_FEATURES_DETECTED An unsupported feature was de­


tected. The warning message
KXEN_W_IDBM_REGRESSION_MODELS_NOT_SUPPORTED will provide a more detailed de­
sciption.
KXEN_W_IDBM_INCOMPATIBLE_VARIABLE_USAGE
Native Spark Modeling will be
KXEN_W_IDBM_SPACE_ADV_PARAMS_NOT_SUPPORTED
"off" and the normal training
KXEN_W_IDBM_INCOMPATIBLE_TRANSFORM_FEATURE_USAGE process will continue.

KXEN_W_IDBM_USER_STRUCTURES_NOT_SUPPORTED To use Native Spark Modeling for


this model, either remove the af­
KXEN_W_IDBM_COMPOSITE_VARIABLES_NOT_SUPPORTED
fected variable(s) or avoid the
KXEN_W_IDBM_CUSTOM_PARTITION_STRATEGY_NOT_SUPPORTED feature if possible.

KXEN_W_IDBM_PARTITION_STRATEGY_NOT_SUPPORTED

KXEN_W_IDBM_UNSUPPORTED_TRANSFORM

KXEN_W_IDBM_K2R_GAIN_CHARTS_NOT_SUPPORTED

KXEN_W_IDBM_RISK_MODE_K2R_NOT_SUPPORTED

KXEN_W_IDBM_DECISION_MODE_K2R_NOT_SUPPORTED

KXEN_W_IDBM_K2R_ORDER_NOT_SUPPORTED

KXEN_W_IDBM_MULTI_TARGETS_MODE_NOT_SUPPORTED

KXEN_E_EXPECT_IDBM_FILESTORE_NATIVESPARK_PAIRING Add the mandatory entries in


the SparkConnector/SparkCon­
nections.ini file for the connec­
tion. Refer to the section Config-
uring the Spark Connection
[page 70]

KXEN_I_IDBM_AUTO_ACTIVATION Native Spark Modeling has been


activated.

KXEN_E_IDBM_MISSING_HADOOP_XML A Hadoop client configuration


(XML) file is missing. Check the
Spark.cfg to ensure it is pointing
to the right Spark assembly jar
for your cluster. Refer to the sec­
tion Connection Setup [page
68] .

KXEN_E_IDBM_MISSING_WINUTILS The winutils.exe file is required


by Spark on Windows. Refer to
the section Winutils.exe [page
67]

KXEN_E_IDBM_MISSING_JARS A required jar file was not found.


Refer to the section Installation
of Native Spark Modeling [page
61]

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Windows PUBLIC 79
3.15.2 Modeling in SAP HANA

You can delegate the model training computations in SAP HANA if the following prerequisites are fulfilled:

● APL (version 2.4 or higher) must be installed on the SAP HANA database server.
● The minimum required version of SAP HANA is SPS 10 Database Revision 102.02 (SAP HANA 1.00.102.02).
● The APL version must be the same as the version of SAP Predictive Analytics desktop or server.
● The SAP HANA user connecting to the ODBC data source must have permission to run APL.

For more information, see the SAP HANA Automated Predictive Library Reference Guide on the SAP Help
Portal at http://help.sap.com/pa.

 Restriction

Cases when model training is not delegated to APL:

● In the Recommendation and Social Analysis modules.


● When the model uses a custom partition strategy.
● When the model uses the option to compute a decision tree.

To unselect the default behavior:

1. Select File Preferences or press F2.


2. On the Model Training Delegation panel, deselect the option.
3. Click OK to save your changes.

Predictive Analytics Data Access Guide


80 PUBLIC Connecting to your Database Management System on Windows
4 Connecting to your Database
Management System on Linux

4.1 About Connecting to Your Database Management


System on Linux

This section indicates how to connect Automated Analytics to a database management system (DBMS) by
creating an ODBC connection.

Configuring steps are detailed for the following databases on Linux OS:

● SAP HANA
● Oracle
● Spark SQL
● Hive
● Teradata
● MySQL 5
● IBM DB2
● Sybase IQ
● Netezza
● PostgreSQL
● Vertica
● SQLServer
● Greenplum
● Native Spark Modeling

All the ODBC drivers used in the examples can be purchased via their respective editors.

This document explains how to install the drivers used to create the ODBC connections when they are not
delivered with the OS. It also shows how to configure them so that they suit Automated Analytics requirements.

 Caution

Automated Analytics does not use quotes around table and column names by default. So if they contain
mixed cases or special characters including spaces, they can be misinterpreted by the DBMS. To change
this behavior, you need to set the CaseSensitive option to "True" as explained in the ODBC Fine Tuning
guide.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 81
4.2 ODBC Driver Manager Setup

4.2.1 About ODBC

The ODBC software is based on a two-part architecture:

● An ODBC Driver Manager : this manager is in charge of dispatching ODBC calls to the right DBMS driver.
● ODBC drivers: these drivers are actually implementing ODBC calls. Each DBMS needs its own ODBC driver.

An ODBC driver cannot function without the ODBC Driver Manager and ODBC Driver Manager is common for
several ODBC drivers. While an ODBC driver is supposed to be compatible with any ODBC Driver Manager,
some of them are strongly bound to their preferred ODBC Driver Manager.

4.2.2 Default Setup for Oracle and Teradata

Data Access for Oracle and Data Access for Teradata are delivered with their own ODBC Driver Manager
properly set up.

Automated Analytics can use the Teradata ODBC driver provided with the Teradata software client package. In
this case, the ODBC Driver Manager delivered by Teradata is mandatory. See the related topic.

Related Information

Using Data Access for Teradata [page 106]

4.2.3 Setup the unixODBC for Other DBMS

The ODBC Driver Manager unixODBC has been tested and validated for the following DBMS:

● SAP HANA
● SAP HANA Vora
● postgreSQL
● Sybase IQ
● DB2
● Netezza
● Vertica
● SQLServer

Predictive Analytics Data Access Guide


82 PUBLIC Connecting to your Database Management System on Linux
Checking that unixODBC is Installed

The command isql allows you to check if the unixODBC component is installed and available for any user:

isql --version
unixODBC 2.2.14

If this command fails, you have to install the component unixODBC.

4.2.3.1 Downloading unixODBC

UNIX OS is often delivered without any ODBC software layer. To download the ODBC Driver Manager, go to the
unixODBC website: http://www.unixodbc.org.

On Linux, usual package repositories allow to easily find a unixODBC package. Modern Linux distributions also
have a package manager with a prebuilt unixODBC package that can easily be installed.

 Note

The installation procedure depends on the version of the UNIX OS. See section Installing and Setting Up
unixODBC on Suse 11 .

 Example

On Ubuntu 13.4, unixODBC can be downloaded and installed with this single command:

sudo apt-get install unixodbc

4.2.3.2 Setting Up unixODBC for Use with Automated


Analytics

For technical reasons, the library libodbc.so.1 must be available in the library path so that the application can be
executed.

Depending on the actual version of unixODBC, a library with this exact name may not be available. In that case,
you must create a symbolic link to libodbc.so.

 Note

Depending on the OS, unixODBC binaries may not be available and therefore unixODBC sources must be
compiled. Do not hesitate to contact the support for assistance on this subject.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 83
 Example

On Ubuntu 13.04, use the command:

sudo ln –s /usr/lib/x86_64-libux-gnu/libodbc.so /usr/lib/x86_64-libux-gnu/


libodbc.so.1

4.2.3.3 Activating unixODBC in Automated Analytics

The default setup of standard Automated Analytics deliveries is to use the ODBC Driver Manager embedded in
DataAccess.

An additional step is needed to deactivate it and allow unixODBC Driver Manager to be used.

4.2.3.3.1 Deactivating DataAccess ODBC Driver Manager

1. Edit the file <Installation Dir>/KxProfile.sh.


2. Comment the line KXEN_ODBC_DRIVER_MANAGER=Kxen .
3. Uncomment the line KXEN_ODBC_DRIVER= . Note that when you keep the value empty, the driver
provided by the OS is used by default.

4.2.4 Set Up the ODBC Driver and the ODBC Connection

Each ODBC connection to a DBMS requires its corresponding ODBC driver. The ODBC connection must be
declared with DBMS specific connection options in a reference file: odbc.ini .

Automated Analytics is delivered with a directory (odbcconf) containing template files for each DBMS and
providing an easy way to specify the odbc.ini file to use.

4.2.4.1 Installing and Setting Up a Suitable ODBC


Connection for Automated Analytics

1. Install and set up a DBMS client software (depending on the DBMS, this step may be optional).
2. Install the corresponding ODBC driver.
3. Copy the file <Installation Dir>/odbcconf/odbc.ini.<dbms> as odbc.ini (in the same
directory).

 Code Syntax

cp odbc.ini.<dbms> odbc.ini

Predictive Analytics Data Access Guide


84 PUBLIC Connecting to your Database Management System on Linux
where <dbms> corresponds to your database management system.

4. Replace the <xxx> values in odbc.ini with the real values. All the tricky ODBC options have already been
filled in the provided templates so you only have to fill some options that are easy to find. In a general
manner, it is a good practice to do this operation with the help of your database administrator.

4.3 SAP HANA

4.3.1 Installing Prerequisite Software

Before setting up the connection to SAP HANA, you need to install additional software.

4.3.1.1 Unarchiving SAP HANA Client Software

HANA Client sp6 rev67 minimum is mandatory for Automated Analytics.

These revisions can be found in the Service Market Place (SMP) and are typically delivered as a SAP
proprietary archive .sap .

These archives can be extracted with the SAP tool sapcar by using the following command:

sapcar –xvf <archive name>.sar

This command will extract all files in a folder with the same name as the archive.

4.3.1.2 Installation of SAP HANA Client Software on 64-bit


Linux with Recent Kernel

Recent 64-bit Linux distributions require the installation of the HDB_CLIENT_LINUX_X86_64 software
package.

cd HDB_CLIENT_LINUX_X86_64
chmod +x hdbinst hdbsetup hdbuninst instruntime/sbd
sudo ./hdbinst –a client

If the installation is successful, the following message appears:

SAP HANA Database Client installation kit detected.


SAP HANA Database Installation Manager - Client Installation 1.00.68.384084
***************************************************************************
Enter Installation Path [/usr/sap/hdbclient]:<RETURN>
Checking installation...
Preparing package "Python Runtime"...
Preparing package "Product Manifest"...
Preparing package "SQLDBC"...

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 85
Preparing package "REPOTOOLS"...
Preparing package "Python DB API"...
Preparing package "ODBC"...
Preparing package "JDBC"...
Preparing package "Client Installer"...
Installing SAP HANA Database Client to /usr/sap/hdbclient...
Installing package 'Python Runtime' ...
Installing package 'Product Manifest' ...
Installing package 'SQLDBC' ...
Installing package 'REPOTOOLS' ...
Installing package 'Python DB API' ...
Installing package 'ODBC' ...
Installing package 'JDBC' ...
Installing package 'Client Installer' ...
Installation done
Log file written to '/var/tmp/hdb_client_2013-11-12_16.06.06/hdbinst_client.log'

If an error occurs, the SAP HANA installer does its best to display a comprehensive report about the error.

SAP HANA Database Client installation kit detected.


Installation failed
Checking installation kit failed
Checking package ODBC failed
Software isn't runnable on your system:
file magic: elf 64-bit lsb X86-64 shared object
MAKE SYSTEM:
architecture = X86-64
c_runtime = GLIBC 2.11.1
processor_features = sse sse2 cmpxchg16b
system = Linux
version = 2.6.32.59
subversion = 0.7-default
YOUR SYSTEM:
architecture = X86-64
c_runtime = GLIBC 2.5
processor_features = sse sse2 cmpxchg16b
system = Linux
version = 2.6.18
subversion = 8.el5
Checking package SQLDBC failed
Software isn't runnable on your system:
file magic: elf 64-bit lsb X86-64 shared object
MAKE SYSTEM:
architecture = X86-64
c_runtime = GLIBC 2.11.1
processor_features = sse sse2 cmpxchg16b
system = Linux
version = 2.6.32.59
subversion = 0.7-default
YOUR SYSTEM:
architecture = X86-64
c_runtime = GLIBC 2.5
processor_features = sse sse2 cmpxchg16b
system = Linux
version = 2.6.18
subversion = 8.el5

In this example, the report shows that this client needs at least version 2.6.32 of the kernel and version 2.11.1 of
GLIBC, whereas the current system provides only kernel 2.6.18 and GLIBC 2.5. The installed Linux version,
HDB_CLIENT_LINUX_X86_64, is outdated and should be replaced by the
HDB_CLIENT_LINUX_X86_64_SLES9 package.

Predictive Analytics Data Access Guide


86 PUBLIC Connecting to your Database Management System on Linux
4.3.1.3 Installation of SAP HANA Client software on Linux
64 bits with Kernel Version < 2.6.32

Old versions of Linux require the installation of the HDB_CLIENT_LINUX_X86_64_SLES9 software package.

In this example, the report shows that this client needs at least version 2.6.32
of the kernel and version 2.11.1 of GLIBC, whereas the current system provides
only kernel 2.6.18 and GLIBC 2.5. The installed Linux version,
HDB_CLIENT_LINUX_X86_64, is outdated and should be replaced by the
HDB_CLIENT_LINUX_X86_64_SLES9 package .

If the installation is successful, the following message appears:

SAP HANA Database Client installation kit detected.


SAP HANA Database Installation Manager - Client Installation 1.00.68.384084
***************************************************************************
Enter Installation Path [/usr/sap/hdbclient]:
Checking installation...
Preparing package "Product Manifest"...
Preparing package "SQLDBC"...
Preparing package "ODBC"...
Preparing package "JDBC"...
Preparing package "Client Installer"...
Installing SAP HANA Database Client to /usr/sap/hdbclient...
Installing package 'Product Manifest' ...
Installing package 'SQLDBC' ...
Installing package 'ODBC' ...
Installing package 'JDBC' ...
Installing package 'Client Installer' ...
Installation done
Log file written to '/var/tmp/hdb_client_2013-11-12_16.20.37/hdbinst_client.log'

4.3.1.4 Checking SAP HANA Native Client Connectivity

The SAP HANA Client package provides tools to test native SAP HANA connectivity without involving any other
software layer, especially any ODBC layer. We strongly advise you to use these tools before going further.

 Code Syntax

cd /usr/sap/hdbclient
./hdbsql
hdbsql=> \c -i <instance number -n <host name> -u <user> -p <password>

 Example

To test SAP HANA running at Hana-host as instance number 00 allowing user myName/MyPass, the
commands to use are:

cd /usr/sap/hdbclient
./hdbsql
hdbsql=> \c -i 00 -n Hana-host -u MyName -p MyPass
Connected to BAS@Hana-host:30015

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 87
4.3.1.4.1 Missing libaio-dev Package

On recent Linux kernels, some softawre packages might be missing, making it impossible to use SAP HANA
Native Client Connectivity.

In such a case, when running hdbsql, the following error message is triggered:

hdbsqlerror while loading shared libraries: libaio.so

Install the libaio-dev package to solve this issue. The command to install the package depends on the Linux
version but is usually:

sudo apt-get install libaio-dev

4.3.1.5 Checking SAP HANA ODBC Connectivity

If the previous check is successful, another SAP tool allows checking the ODBC connectivity layer without any
other software layer than the SAP ones.

We strongly advise you to use these tools before going further.

 Code Syntax

export LD_LIBRARY_PATH=/usr/sap/hdbclient:$LD_LIBRARY_PATH
/usr/sap/hdbclient/odbcreg <Servernode> <ServerDB> <UID> <PWD>

 Example

For SAP HANA running at Hana-host as instance number 00 allowing user MyName/MyPass on database
BAS, the commands to use are:

/usr/sap/hdbclient/odbcreg Hana-host:30015 KXN MyName MyPass


ODBC Driver test.
Connect string: 'SERVERNODE=Hana-host:
30015;SERVERDB=BAS;UID=MyName;PWD=MyPass;'.
retcode: 0
outString(68): SERVERNODE={Hana-host:
30015};SERVERDB=BAS;UID=MyName;PWD=MyPass;
Driver version SAP HDB 1.00 (2013-10-15).
Select now(): 2013-11-12 15:44:55.272000000 (29)

The last line displays the current time on the HANA server using SQL with ODBC connectivity and allows
the full validation of SAP HANA ODBC connectivity to the given HANA server.

Predictive Analytics Data Access Guide


88 PUBLIC Connecting to your Database Management System on Linux
4.3.2 Setting Up ODBC Connectivity with Automated
Analytics

1. Check that unixODBC is properly installed.

 Note

The unixODBC component is mandatory. The exact process to install this package, if missing, depends
on the exact version of UNIX. See the related topic on setting up ODBC on Suse 11.

2. Set up the application so it uses unixODBC provided by the OS.


3. Edit the KxProfile.sh file located in the Automated Analytics installation folder.
4. Uncomment the line #KXEN_ODBC_DRIVER= so the file ends with:

#KXEN_ODBC_DRIVER_MANAGER=Kxen
#KXEN_ODBC_DRIVER_MANAGER=tdodbc
KXEN_ODBC_DRIVER_MANAGER=
fi
export KXEN_ODBC_DRIVER_MANAGER
# -- Define here additionnal variables required for all InfiniteInsight
instances...
#Force our unixODBC Driver Manager

5. Copy the file <installation folder>/odbconfig/odbc.ini.HANA to <installation folder>/


odbcconfig/odbc.ini and edit the actual values.

 Example

The following odbc.ini file describes a connection name MyHANASP6 that allows the connection to the
HANA host named Hana-host running instance 00 with the typical path for the HANA ODBC driver.

[ODBC Data Sources]


MyHANA12=HANA sp12
[MyHANA12]
Driver=HDBODBC
ServerNode=Hana-host:30015

Related Information

Setup the unixODBC for Other DBMS [page 82]


Installing and Setting Up unixODBC on Suse 11 [page 141]

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 89
4.3.3 Troubleshooting SAP HANA ODBC Connectivity

SAP provides some tools to set up trace and debug information of SAP HANA ODBC Driver.

 Caution

Such debug and trace information is costly to generate and must be activated only for debug and tests
purposes. It is important to switch back to regular configuration when debugging is done.

4.3.3.1 Before Setting Up SAP HANA ODBC Driver for


Automated Analytics

If previous checks are successful, SAP HANA ODBC connectivity can be set up for the application.

The application GUI displays the list of available ODBC connections and therefore allows not typing any
connection name. Should you choose to type yourself the connection name or use the command line tool
KxShell, be sure to use exactly the same upper/lower cases as declared in odbc.ini. Otherwise, SAP HANA
ODBC driver will not be able to connect. This behavior is specific to the SAP HANA ODBC Client.

 Example

If MyHana6 is described in odbc.ini, using the connection name MYHANA6 will display an error message
with the following ODBC diagnostic:

Connection failed: [08S01][SAP AG][LIBODBCHDB SO][HDBODBC] Communication link


failure;-10709 Connect failed (no reachable host left).

4.3.3.2 Activating the Full Debug Mode

● Enter the following commands.

cd /usr/sap/hdbclient
./hdbodbc_cons CONFIG TRACE PACKET ON
./hdbodbc_cons CONFIG TRACE DEBUG ON
./hdbodbc_cons CONFIG TRACE API ON
./hdbodbc_cons CONFIG TRACE SQL ON
./hdbodbc_cons CONFIG TRACE FILENAME <filename>

After this sequence of commands, any usage of the SAP HANA ODBC driver will fill comprehensive debug and
trace information in the file called <filename> .

Predictive Analytics Data Access Guide


90 PUBLIC Connecting to your Database Management System on Linux
4.3.3.3 Deactivating the Full Debug Mode

● Enter the following commands.

cd /usr/sap/hdbclient
./hdbodbc_cons CONFIG TRACE PACKET OFF
./hdbodbc_cons CONFIG TRACE DEBUG OFF
./hdbodbc_cons CONFIG TRACE API OFF
./hdbodbc_cons CONFIG TRACE SQL OFF

4.3.3.4 Checking the Status of the Trace Mode

● Enter the following commands.

hd /usr/sap/hdbclient
hdbodbc_cons SHOW ALL

4.3.4 SAP HANA as a Data Source

You can use SAP HANA databases as data sources in Data Manager and for all types of modeling analyses in
Modeler: Classification/Regression, Clustering, Time Series, Association Rules, Social, and Recommendation.

SAP HANA tables or SQL views found in the Catalog node of the SAP HANA database

All types of SAP HANA views found in the Content node of the SAP HANA database.

An SAP HANA view is a predefined virtual grouping of table


columns that enables data access for a particular business
requirement. Views are specific to the type of tables that are
included, and to the type of calculations that are applied to
columns. For example, an analytic view is built on a fact table
and associated attribute views. A calculation view executes a
function on columns when the view is accessed.

 Restriction
● Analytic and calculation views that use the variable
mapping feature (available starting with SAP HANA
SPS 09) are not supported.
● You cannot edit data in SAP HANA views using
Automated Analytics.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 91
Smart Data Access virtual tables Thanks to Smart Data Access, you can expose data from re­
mote sources tables as virtual tables and combine them with
HANA regular tables. This allows you to access data sources
that are not natively supported by the application, or to com­
bine data from multiple heterogeneous sources.

 Caution
To use virtual tables as input datasets for training or ap­
plying a model or as output datasets for applying a
model, you need to check that the following conditions
are met:

● The in-database application mode is not used.


● The destination table for storing the predicted val­
ues exists in the remote source before applying the
model.
● The structure of the remote table, that is the col­
umn names and types, must match exactly what is
expected with respect to the generation options; if
this is not the case an error will occur.

 Caution
In Data Manager, use virtual tables with caution as the
generated queries can be complex. Smart Data Access
may not be able to delegate much of the processing to
the underlying source depending on the source capabili­
ties. This can impact performance.

Prerequisites

You must know the ODBC source name and the connection information for your SAP HANA database. For more
information, contact your SAP HANA administrator.

In addition to having the authorizations required for querying the SAP HANA view, you need to be granted the
SELECT privilege on the _SYS_BI schema, which contains metadata on views. Please refer to SAP HANA
guides for detailed information on security aspects.

4.4 SAP Vora

4.4.1 About SAP Vora

Native Spark Modeling optimizes the modeling process on SAP Vora data sources.

To setup an ODBC connection to SAP Vora you need to follow the steps listed below:

Predictive Analytics Data Access Guide


92 PUBLIC Connecting to your Database Management System on Linux
1. Install the Simba Spark SQL ODBC driver.
2. Configure a SAP Vora DSN with the Simba Spark SQL driver.
3. Set the ODBC behaviour flag for the DSN to SAP Vora.

For more information, refer to the section Native Spark Modeling in the related information below.

Related Information

Native Spark Modeling [page 60]


Native Spark Modeling [page 120]

4.4.2 Installing the Simba Spark SQL ODBC Driver

1. Download the Simba driver relevant for the recommended Voraversion from the Simba website .Refer to
the SAP Product Availability Matrix http://service.sap.com/sap/support/pam to know the supported
product versions.
2. Follow the instructions to install the driver.

 Note

HortonWorks customers can use the HortonWorks Spark SQL ODBC driver provided by Simba.

4.4.3 Configuring SAP Vora DSN

1. Install the unixODBC package corresponding to your operating system.


2. Install the Simba Spark ODBC Driver and its required dependencies for your operating system based on
the shipped instructions.
3. Go to the Setup/ folder located in the installation folder of the Simba drivers, for example, /opt/simba/
spark/Setup.
4. Copy the files listed in the following table to your home directory and rename them as specified in the
table.

File to Copy New Name

odbcinst.ini .odbcinst.ini

odbc.ini .odbc.ini

5. Update the .odbcinst.ini file in your home directory to point to the correct driver installation path.
Change the driver path to the correct installation location on your system.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 93
 Sample Code

.odbcinst.ini

[ODBC Drivers]
Simba Spark ODBC Driver 64-bit=Installed

[Simba Spark ODBC Driver 64-bit]


Description=Simba Spark ODBC Driver (64-bit)
Driver=/opt/simba/sparkodbc/lib/64/libsparkodbc_sb64.so

6. Update the .odbc.ini file in your home directory.


a. Add the connection information you have obtained from your cluster administrator.
b. Change the driver path to the correct installation location on your system.
You can use the odbc.ini.VORA file located in the installation folder as a template when using Vora 1.4. It
contains the following code.

 Sample Code

.odbc.ini

[ODBC Data Sources]


VORA1.4=Simba Spark ODBC Driver 64-bit

[VORA1.4]
Description=Simba Spark ODBC Driver(64-bit) DSN
Driver=/opt/simba/spark/lib/64/libsparkodbc_sb64.so
HOST=host.example.com
PORT=19123
SparkServerType=3
AuthMech=3
ThriftTransport=1
BinaryColumnLength=32767
DecimalColumnScale=10
DefaultStringColumnLength=255
FastSQLPrepare=1
FixUnquotedDefaultSchemaNameInQuery=1
GetTablesWithQuery=0
GetSchemasWithQuery=0
RowFetchedPerBlock=1000
#UseNativeQuery=1
UnicodeSqlCharacterTypes=0
SSL=0
TwoWaySSL=0
CatalogSchemaSwitch=0
UID=vora

7. In the file KxProfile.sh located in the installation directory, comment the line
KXEN_ODBC_DRIVER_MANAGER=Kxen and uncomment the line KXEN_ODBC_DRIVER_MANAGER=
The resulting code must look as follows.

 Sample Code

#KXEN_ODBC_DRIVER_MANAGER=Kxen
#KXEN_ODBC_DRIVER_MANAGER=tdodbc
KXEN_ODBC_DRIVER_MANAGER=

8. Copy the .odbc.ini file to in the odbcconfig/ folder located in your installation directory.

Predictive Analytics Data Access Guide


94 PUBLIC Connecting to your Database Management System on Linux
9. Go to the folder lib/64 in the driver installation directory, for example, /opt/simba/
sparkodbc/lib/64.
10. Edit the simba.sparkodbc.ini file and set ErrorMessagesPath as shown in the following example
changing the path /opt/simba/sparkodbc to the correct installation location on your system.

 Sample Code

simba.sparkodbc.ini

[Driver]
ErrorMessagesPath=/opt/simba/sparkodbc/ErrorMessages/
LogLevel=0
LogPath=/tmp/simbaspark.log
SwapFilePath=/tmp

4.4.4 Setting the ODBC Behaviour Flag


The ODBC Behaviour flag is required to enable SAP Vora ODBC connectivity with the Simba Spark SQL ODBC
driver to switch the ODBC behavior from Spark SQL to SAP Vora.

1. Edit the SparkConnector\Spark.cfg file.


2. Set the Behaviour property for either a specific Vora DSN name or for all DSNs as shown in the examples
below.

 Note

For OEM installations there is no SparkConnector folder. You need to set the Behaviour flag in the
KxShell.cfg configuration file.

 Example

Spark.cfg set up for a single Vora DSN called MY_VORA_ODBC_DSN

 Sample Code

# SAP Vora ODBC Connectivity


#
# Set the "Behaviour" option for all Vora ODBC DSNs here
# example 1: for a specific DSN called MY_VORA_ODBC_DSN
ODBCStoreSQLMapper.MY_VORA_ODBC_DSN.Behaviour=Vora

Spark.cfg set up to use Vora behavior for all DSNs, which means that every DSN will be treated as a Vora
DSN.

 Sample Code

# SAP Vora ODBC Connectivity


#
# Set the "Behaviour" option for all Vora ODBC DSNs here
# example 2: to use Vora for all DSNs
ODBCStoreSQLMapper.*.Behaviour=Vora

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 95
4.5 Oracle

4.5.1 About Oracle Database

Automated Analytics standard installation includes a component named Data Access for Oracle. This
component allows connecting to Oracle DBMS without the need to install any additional Oracle software.

On Linux systems, this is the only component supported by Automated Analytics to connect on Oracle.

Another advantage of this component is that a special bulk mode can be activated with this driver. Using this
mode, writes in Oracle DBMS are boosted by a 20 factor. Depending on the algorithms that have been used to
build the model, such acceleration is mandatory when scoring. Social models are a typical example of such
algorithms where scoring cannot be done with the feature In-database Apply. Note that this component is a
regular Automated Analytics feature and therefore, is liable to licensing.

The driver must be set up after the installation so that it suits the application requirements. Once the driver is
set up, users can start using the application.

4.5.2 Setting Up Data Access for Oracle

In the odbcconfig directory (located in the root directory):

1. Copy the file odbc.ini.KxenConnectors.oracle as odbc.ini (in the same directory). This can be
done typing the following commands:

cd <Installation dir>/odbcconfig
cp odbc.ini.KxenConnectors.oracle odbc.ini

2. Edit the odbc.ini file and replace <MyDSN> with the name of the connection you want to use.
3. Replace all the <xxx> with your actual parameters. The first parameters set up the actual connectivity to
your Oracle DBMS. There are numerous ways to connect to an Oracle DBMS, so you have to request
assistance from your Oracle database administrator to setup the proper parameters.
However, a very common way to define Oracle connectivity is to setup the first two parameters:
○ HostName: the host name of the Oracle server or its IP address.
○ ServiceName: the service name of the Oracle DBMS on the Oracle server.

It is not mandatory but we advise to also update the InstallDir parameters with the full path of Automated
Analytics installation as described.
4. Open the script KxProfile.sh located in the root directory.
5. Check that the line #KXEN_ODBC_DRIVER_MANAGER=Kxen is not commented.
6. The server processes can now be launched using the script kxen.server, located in the root directory.

Predictive Analytics Data Access Guide


96 PUBLIC Connecting to your Database Management System on Linux
4.6 Spark SQL

4.6.1 About Spark SQL

SAP Predictive Analytics supports the Apache Spark framework in order to perform large-scale data
processing with the Automated Predictive Server.

 Note

To know the supported version of Spark SQL, refer to the SAP Product Availability Matrix http://
service.sap.com/sap/support/pam .

4.6.2 Setting Up Data Access for Spark SQL

In the odbcconfig directory (located in the root directory):

1. Copy the file odbc.ini.KxenConnectors.spark as odbc.ini (in the same directory). This can be done
typing the following commands:

 Code Syntax

cd <Installation Dir>/odbcconfig
cp odbc.ini.KxenConnectors.spark odbc.ini

2. Edit the odbc.ini file and replace <MyDSN> with the name of the connection you want to use.
3. Replace all the <xxx> with your actual parameters::
○ HOSTNAME: the host name of the Spark SQL Thrift Server or its IP address.
○ PORTNUMBER: the port number of the Spark SQL Thrift Server
○ DATABASE: the database to use in the Spark SQL Thrift Server
4. Open the script KxProfile.sh located in the root directory.
5. Check that the line #KXEN_ODBC_DRIVER_MANAGER=Kxen is not commented.
6. The server processes can now be launched using the script kxen.server, located in the root directory.

4.6.3 Creating Data Manipulations using Date Functions

SAP Predictive Analytics provides UDF extensions to Spark SQL allowing the management of dates. Apache
Spark SQL provides very few date functions natively.

 Note

Installation of UDFs for Spark SQL requires access to the Apache Spark SQL server.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 97
4.6.3.1 Installation in Apache Spark SQL Server

The Apache Spark SQL UDFs for the application are located in:

{SAP Predictive Analytics folder}/resources/KxenHiveUDF.jar

You need to copy this file into the local file system of the Apache Spark SQL server. The jar to deploy is the
same used for Hive.

4.6.3.2 Activation in SAP Predictive Analytics

In this section, <server_local_path_to_jar> designates the local path to the copied KxenHiveUDF.jar file inside
the Apache Spark SQL server.

On the computer running the application, locate the configuration file (.cfg) corresponding to SAP Predictive
Analytics product you want to use:

● KxCORBA.cfg when using an SAP Predictive Analytics server


● KJWizard.cfg when using an SAP Predictive Analytics workstation

Add these lines to the proper configuration file using following syntax:

ODBCStoreSQLMapper.*.SQLOnConnect1="ADD JAR <server_local_path_to_jar>"


ODBCStoreSQLMapper.*.SQLOnConnect2="CREATE TEMPORARY FUNCTION KxenUDF_add_year
as 'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper.*.SQLOnConnect3= "CREATE TEMPORARY FUNCTION KxenUDF_add_month
as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper.*.SQLOnConnect4= "CREATE TEMPORARY FUNCTION KxenUDF_add_day
as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper.*.SQLOnConnect5= "CREATE TEMPORARY FUNCTION KxenUDF_add_hour
as 'com.kxen.udf.cUDFAdd_hour'"
ODBCStoreSQLMapper.*.SQLOnConnect6= "CREATE TEMPORARY FUNCTION KxenUDF_add_min
as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper.*.SQLOnConnect7="CREATE TEMPORARY FUNCTION KxenUDF_add_sec
as'com.kxen.udf.cUDFAdd_sec'"

These lines will activate the UDFs extension for all DBMS connections. If you are using other DBMS
connections than Spark SQL, you can activate the UDFs extensions only for a specific connection by replacing
the star (*) with the actual name of the ODBC connection you have defined.

 Example

ODBCStoreSQLMapper.My_DSN.SQLOnConnect1="ADD JAR <server_local_path_to_jar>"


ODBCStoreSQLMapper.My_DSN.SQLOnConnect2="CREATE TEMPORARY FUNCTION
KxenUDF_add_year as'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect3= "CREATE TEMPORARY FUNCTION
KxenUDF_add_month as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect4= "CREATE TEMPORARY FUNCTION
KxenUDF_add_day as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect5= "CREATE TEMPORARY FUNCTION
KxenUDF_add_hour as 'com.kxen.udf.cUDFAdd_hour'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect6= "CREATE TEMPORARY FUNCTION
KxenUDF_add_min as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper.My_DSN.SQLOnConnect7="CREATE TEMPORARY FUNCTION
KxenUDF_add_sec as 'com.kxen.udf.cUDFAdd_sec'"

Predictive Analytics Data Access Guide


98 PUBLIC Connecting to your Database Management System on Linux
4.6.4 Spark SQL Restrictions

Restrictions for using Spark SQL with SAP Predictive Analytics.

4.6.4.1 Managing Primary Keys

Spark SQL does not publish the primary keys of a table.

The usual workaround is to use a description file in which the primary keys are properly described. This
description can be loaded either by the user or automatically by the application. In this case, for a table XX, the
application automatically reads the description file named KxDesc_XX.

Unfortunately, it may not be easy to push new files in Spark SQL and due to other limitations of Spark SQL, the
application is not able to save these files in Spark SQL. It is still possible to use description files stored in a
standard text repository but all descriptions must be explicitly read.

As a convenience, with Spark SQL, the application uses a new heuristic to guess the primary keys of a table.
Field names are compared to patterns, allowing the detection of names commonly used for primary keys.

The default setup for the application on Spark SQL is to manage the fields listed below as primary keys:

● Starting with ‘KEY’ or ‘ID. For example: KEY_DPTMT or IDCOMPANY


● Ending with ‘KEY’ or ‘ID.; For example: DPTMKEY or COMPANY_ID

 Note

The list of patterns for primary keys can be tuned by the user.

PrimaryKeyRegExp

This option allows the specification of a list of patterns that will be recognized as primary keys. The syntax
follows the convention described in the section ODBC Fine Tuning About ODBC Fine Tuning [page 156]. The
patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=KEY$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=ID$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^ID

 Note

This set of patterns is the default one for SAP Predictive Analytics and Spark SQL.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 99
NotPrimaryKeyRegExp

Patterns described by PrimaryKeyRegExp may match too many field names. This option is applied after the
PrimaryKeyRegExp patterning and allows the explicit description of field names that are not primary keys.
These patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.NotPrimaryKeyRegExp.1=^CRICKEY

All field names ending by KEY will be managed as primary keys except field names ending with CRICKEY .

4.7 Hive

4.7.1 About Hive Database

Automated Analytics provides a DataDirect driver for Hive. This component lets you connect to a Hive server
without the need to install any other software. Hive technology allows you to use SQL statements on top of a
Hadoop system. This connectivity component supports:

● Hive versions. To know which versions are supported, refer to SAP Product Availability Matrix http://
service.sap.com/sap/support/pam
● Hive server 1 and Hive server 2: The server can be set up in 2 modes. Both modes are transparently
managed, however, Hive server 2 is preferred as it provides more authentication and multi-connection
features.

The driver must be set up after the installation so that it suits the application requirements. Once the driver is
set up, users can start using the application. Native Spark Modeling optimizes the modeling process on Hive
data sources. For more information, refer to Native Spark Modeling [page 120]

4.7.2 Setting Up Data Access for Hive

In the odbcconfig directory (located in the root directory):

1. Copy the file odbc.ini.KxenConnectors.hive as odbc.ini (in the same directory). This can be done
typing the following commands:

 Code Syntax

cd <Installation Dir>/odbcconfig
cp odbc.ini.KxenConnectors.hive odbc.ini

2. Edit the odbc.ini file and replace <MyDSN> with the name of the connection you want to use.

Predictive Analytics Data Access Guide


100 PUBLIC Connecting to your Database Management System on Linux
3. Replace all the <xxx> with your actual parameters::
○ HOSTNAME: the host name of the Hive Server or its IP address.
○ PORTNUMBER: the port number of the Hive Server
○ DATABASE: the database to use in the Hive Server
4. Open the script KxProfile.sh located in the root directory.
5. Check that the line #KXEN_ODBC_DRIVER_MANAGER=Kxen is not commented.
6. The server processes can now be launched using the script kxen.server, located in the root directory.

4.7.3 Creating Data Manipulations using Date Functions

SAP Predictive Analytics provides UDF extensions to Hive allowing the management of dates. Apache Hive
provides very few date functions natively.

 Note

Installation of UDFs for Hive requires access to the Apache Hive server.

4.7.3.1 Installation in Apache Hive Server

The Apache Hive UDFs for the application are located in:

{SAP Predictive Analytics folder}/resources/KxenHiveUDF.jar

You need to copy this file into the local file system of the Apache Hive server.

4.7.3.2 Activation in SAP Predictive Analytics

In this section, server_local_path_to_jar designates the local path to the copied KxenHiveUDF.jar file
inside the Apache Hive server.

On the computer running the application, locate the configuration file corresponding to SAP Predictive
Analytics product you want to use:

● KxCORBA.cfg when using an SAP Predictive Analytics server


● KJWizard.cfg when using an SAP Predictive Analytics workstation

Add these lines to the proper configuration file:

ODBCStoreSQLMapper.*.SQLOnConnect1="ADD JAR <server_local_path_to_jar>"


ODBCStoreSQLMapper. *.SQLOnConnect2="CREATE TEMPORARY FUNCTION KxenUDF_add_year
as 'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper. *.SQLOnConnect3= "CREATE TEMPORARY FUNCTION
KxenUDF_add_month as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper.*.SQLOnConnect4= "CREATE TEMPORARY FUNCTION KxenUDF_add_day
as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper.*.SQLOnConnect5= "CREATE TEMPORARY FUNCTION KxenUDF_add_hour
as 'com.kxen.udf.cUDFAdd_hour'"

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 101
ODBCStoreSQLMapper.*.SQLOnConnect6= "CREATE TEMPORARY FUNCTION KxenUDF_add_min
as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper.*.SQLOnConnect7="CREATE TEMPORARY FUNCTION KxenUDF_add_sec
as'com.kxen.udf.cUDFAdd_sec'"

These lines will activate the UDFs extension for all DBMS connections. If you are using a DBMS connections
other than Hive, you can activate the UDFs extensions for a specific connection by replacing the star (*) with
the actual name of the ODBC connection you have defined.

 Example

ODBCStoreSQLMapper.MyBigHive.SQLOnConnect1="ADD JAR
<server_local_path_to_jar>"
ODBCStoreSQLMapper.MyBigHive.SQLOnConnect2="CREATE TEMPORARY FUNCTION
KxenUDF_add_year as'com.kxen.udf.cUDFAdd_year'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect3= "CREATE TEMPORARY FUNCTION
KxenUDF_add_month as 'com.kxen.udf.cUDFAdd_month'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect4= "CREATE TEMPORARY FUNCTION
KxenUDF_add_day as 'com.kxen.udf.cUDFAdd_day'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect5= "CREATE TEMPORARY FUNCTION
KxenUDF_add_hour as 'com.kxen.udf.cUDFAdd_hour'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect6= "CREATE TEMPORARY FUNCTION
KxenUDF_add_min as 'com.kxen.udf.cUDFAdd_min'"
ODBCStoreSQLMapper. MyBigHive.SQLOnConnect7="CREATE TEMPORARY FUNCTION
KxenUDF_add_sec as 'com.kxen.udf.cUDFAdd_sec'"

4.7.4 Hive Restrictions

Hive technology is built on top of Hadoop technology and allows access to the Hadoop world with classic SQL
statements. Hive’s goal is to provide a standard SQL DBMS on top of a Hadoop system. However, Hive is not yet
a full SQL DBMS and has some restrictions.

4.7.4.1 Using the Code Generator

The code generator is compatible with Apache Hive, however, due to some restrictions of the current ODBC
driver provided by DataDirect, the SQL code generated by the code generator cannot be executed when the
model contains a question mark (?) as a significant category.

4.7.4.2 Aggregates

Since COUNT DISTINCT is not supported in analytical functions by Hadoop Hive, the application does not
support the COUNT DISTINCT aggregate.

Note that aggregates with sub queries syntax are not supported by Hive.

Predictive Analytics Data Access Guide


102 PUBLIC Connecting to your Database Management System on Linux
4.7.4.3 Time-stamped Populations

Hive does not support functions (Union, Except, Intersection, Cross Product) that are used by the
application for building Compound and Cross Product time-stamped populations. As a result, these two types
of time-stamped populations cannot be used with Hive. Filtered time-stamped populations, however, can be
used with Hive.

4.7.4.4 Inserting Data in Hive

With Hive, it is possible to create new tables and insert new records using complex statements but there is no
way to push a single record using usual INSERT INTO <table> VALUES(…).

For example, the statement below works:

CREATE TABLE TODEL(ID INT)


INSERT INTO TABLE TODEL SELECT ID FROM ADULTID

Whereas the following statement does not work:

INSERT INTO TABLE TODEL VALUES(1)

The in-database application feature and SAP Predictive Analytics code generator are not impacted by this
limitation but several other features of the application are blocked:

● Scoring with models not compatible with the in-database application feature
● Saving models, data manipulations, variable pool, and descriptions in Hive DBMS
● Transferring data with the Data Toolkit
● Generating distinct values in a dataset

4.7.4.5 Managing Primary Keys

Hive does not publish the primary keys of a table. The usual workaround with is to use a description file in
which the primary keys are properly described. This description can be loaded either by the user or
automatically by the application. In this case, for a table XX, the application automatically reads the description
file named KxDesc_XX.

Unfortunately, it may not be easy to push new files in Hive and due to other limitations of Hive, the application
is not able to save these files in Hive. It is still possible to use description files stored in a standard text
repository but all descriptions must be explicitly read.

As a convenience, with Hive, the application uses a new heuristic to guess the primary keys of a table. Field
names are compared to patterns, allowing the detection of names commonly used for primary keys.

The default setup for the application on Hive is to manage the fields listed below as primary keys:

● Starting with ‘KEY’ or ‘ID. For example: KEY_DPTMT or IDCOMPANY


● Ending with ‘KEY’ or ‘ID.; For example: DPTMKEY or COMPANY_ID

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 103
 Note

The list of patterns for primary keys can be tuned by the user.

PrimaryKeyRegExp

This option allows the specification of a list of patterns that will be recognized as primary keys. The syntax
follows the convention described in the section ODBC Fine Tuning About ODBC Fine Tuning [page 156]. The
patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=KEY$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=ID$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^ID

 Note

This set of patterns is the default one for SAP Predictive Analytics and Hive.

NotPrimaryKeyRegExp

Patterns described by PrimaryKeyRegExp may match too many field names. This option is applied after the
PrimaryKeyRegExp patterning and allows the explicit description of field names that are not primary keys.
These patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.NotPrimaryKeyRegExp.1=^CRICKEY

All field names ending by KEY will be managed as primary keys except field names ending with CRICKEY .

Predictive Analytics Data Access Guide


104 PUBLIC Connecting to your Database Management System on Linux
4.7.4.6 Tuning

You can tune the Hive connection by disabling the KxDesc mechanism or by modifiying the heap size of the
JVM or Apache Hadoop client application.

PrimaryKeyRegExp

This option allows the specification of a list of patterns that will be recognized as primary keys. The syntax
follows the convention described in the section ODBC Fine Tuning About ODBC Fine Tuning [page 156]. The
patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=KEY$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=ID$
ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^ID

 Note

This set of patterns is the default one for SAP Predictive Analytics and Hive.

NotPrimaryKeyRegExp

Patterns described by PrimaryKeyRegExp may match too many field names. This option is applied after the
PrimaryKeyRegExp patterning and allows the explicit description of field names that are not primary keys.
These patterns are using the regexp formalism.

 Example

ODBCStoreSQLMapper.*.PrimaryKeyRegExp.1=^KEY
ODBCStoreSQLMapper.*.NotPrimaryKeyRegExp.1=^CRICKEY

All field names ending by KEY will be managed as primary keys except field names ending with CRICKEY .

SupportKxDesc

Even if pushing such description files in Hive may not be easy, the automatic KxDesc_XXX is a convenient
feature. This is why the application still provides it on Hive. However, accessing Hive’s metadata may be heavy
and may uselessly slow the processes down.

The SupportKxDesc option allows the deactivation of the KxDesc mechanism and thus possibly speeds up
usage with Hive.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 105
 Example

ODBCStoreSQLMapper.*.SupportKxDesc=false

JVM Heap Size

When using Hive server 2 with a wide dataset, you need to make sure to increase the heap size of the Java
Virtual Machine (JVM) for the Hive Metastore service.

Apache Hadoop Client Application Heap Size

The out-of-the-box Apache Hadoop installation sets the heap size for the client applications to 512 MB. Leaving
the memory size at its default value can cause out-of-memory errors when accessing large tables. To avoid this,
you need to increase the heap size to 2 GB by changing the ‘HADOOP_CLIENT_OPTS’ variable setting within
the usr/local/hadoop/hadoop-env.sh script.

4.8 Teradata

4.8.1 About Teradata Database

There are two ways to connect to a Teradata database with Automated Analytics.

If there is no plan to use FastWrite, the recommended solution is to set up Data Access (refer to section Using
Data Access for Teradata); however when this feature cannot be implemented (due to the company IT policy for
example), you can skip to the Teradata ODBC driver installation (refer to section Using Teradata ODBC Driver).

4.8.2 Using Data Access for Teradata

Automated Analytics standard installation includes a component named Data Access for Teradata .

This component allows connecting easily to Teradata DBMS. You do not need to install a Teradata ODBC Driver.
Note that this component is a regular Automated Analytics feature and therefore, is liable to licensing.

Predictive Analytics Data Access Guide


106 PUBLIC Connecting to your Database Management System on Linux
4.8.2.1 Installing Prerequisite Software

To use Data Access for Teradata , you need to install the following Teradata client packages first:

● tdicu
● TeraGSS
● cliv2

These components are part of the Teradata Tools and Utility Files and are commonly installed on any computer
that needs accessing a Teradata server. Data Access for Teradata is compatible with TTUF15.10 to TTUF16.0.

4.8.2.1.1 Installing these Packages (if Needed)

 Note

We recommend installing the latest patches from Teradata. These patches must be installed after the CD
installation. They are available in the client patches section of Teradata website, after connecting to your
user account.

1. Insert the TTUF (Teradata Tools and Utilities Files) CD that contains the correct Teradata Client Software to
install. When inserted, the TTUF CD will allow selecting the Teradata package to install. The TTUF CD to use
depends on the Teradata DBMS version:

Teradata DBMS version TTUF version

V16.0 TTUF 16.0

V15.10 TTUF 15.10

2. Select cliv2, which is the only top-level package needed for Automated Analytics. The dependent
packages, teraGSS and tdicu, will be automatically selected.

Advanced Unicode Configuration:

Default setup is to manage data using ASCII charset in Teradata. If you need to manage any combination of
foreign charsets, a more advanced setup must be used.
1. Add these two lines in files KxenServer/KxCORBA.cfg and KxShell/KxShell.cfg:

MultiLanguageIsDefault=true
ODBCStoreSQLMapper.*.supportUnicodeOnData=utf8

2. Edit odbconfig/odbc.ini. Instead of

# IANAAppCodePage=106
CharacterSet=ASCII

it contains:

IANAAppCodePage=106
CharacterSet=UTF8

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 107
 Note

The template file odbcconfig/odbc.ini.KxenConnectors.teradata is prefilled for such


configuration.

4.8.2.1.2 Setting Up Data Access for Teradata

In the odbcconfig directory (located in the root directory):

1. Copy the file odbc.ini.KxenConnectors.teradata as odbc.ini (in the same directory). This can be
done typing the following commands:

 Code Syntax

cd <Installation Dir>/odbcconfig
cp odbc.ini.KxenConnectors.teradata odbc.ini

2. Edit the odbc.ini file and replace <MyDSN> with the name of the connection you want to use.
3. Replace all the <xxx> with your actual parameters::
○ DBCNAME: the host name of the Teradata Server or its IP address.
○ DATABASE: the database to use in the Teradata Server
4. Open the script KxProfile.sh located in the root directory.
5. Check that the line #KXEN_ODBC_DRIVER_MANAGER=Kxen is not commented.
6. The server processes can now be launched using the script kxen.server, located in the root directory.

Advanced Unicode Configuration:

Default setup is to manage data using ASCII charset in Teradata. If you need to manage any combination of
foreign charsets, a more advanced setup must be used.
1. Add these two lines in files KxenServer/KxCORBA.cfg and KxShell/KxShell.cfg:

MultiLanguageIsDefault=true
ODBCStoreSQLMapper.*.supportUnicodeOnData=utf8

2. Edit odbconfig/odbc.ini. Instead of

# IANAAppCodePage=106
CharacterSet=ASCII

it contains:

IANAAppCodePage=106
CharacterSet=UTF8

 Note

The template file odbcconfig/odbc.ini.KxenConnectors.teradata is prefilled for such


configuration.

Predictive Analytics Data Access Guide


108 PUBLIC Connecting to your Database Management System on Linux
4.8.3 Using Teradata ODBC Driver
When Data Access for Teradata is not available or if Teradata ODBC driver is preferred for any reason, the
application supports the standard Teradata ODBC driver.

4.8.3.1 Conventions
To facilitate the reading, naming conventions are used in this section.

They are presented in the following table:

The acronym... Refers to...

<Installation dir> the installation directory of Automated Analytics.

<Tdodbc dir> the installation directory of Teradata ODBC package. Usually /usr/odbc .

4.8.3.2 Installing Teradata ODBC Driver

 Recommendation

We recommend installing the latest patches from Teradata. These patches must be installed after the CD
installation. They are available in the client patches section of the Teradata website, after connecting to
your user account.

1. Insert the TTUF (Teradata Tools and Utilities Files ) CD, which contains the Teradata Client Software to
install.When inserted, the TTUF CD will allow selecting the Teradata package to install.

Teradata DBMS version TTUF version

V16.0 TTUF 16.0

V15.1 TTUF 15.1

2. Select tdodbc, which is the only top-level package needed for Automated Analytics. The dependent
packages, teraGSS and tdicu, will be automatically selected.
3. Deactivate UnixODBC and activate the Teradata ODBC Driver manager.

 Code Syntax

cd <Installation Dir>/libs/tdodbc
ln -s /opt/teradata/client/ODBC_64/lib/libodbc.so libodbc.so.1

 Note

This step is required as on Linux, the ODBC driver manager (UnixODBC) installed during the standard
installation of Automated Analytics does not allow connecting the application to the Teradata
database.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 109
Advanced Unicode Configuration

Default setup is to manage data using ASCII charset in Teradata. If you need to manage any combination of
foreign charsets, a more advanced setup must be used:
1. Add this line in files KxenServer/KxCORBA.cfg and KxShell/KxShell.cfg:

MultiLanguageIsDefault=true

2. Edit odbconfig/odbc.ini. Instead of:

# IANAAppCodePage=106
CharacterSet=ASCII

it contains:

IANAAppCodePage=106
CharacterSet=UTF8

 Note

The template file odbcconfig/odbc.ini.teradata is prefilled for such configuration.

4.8.3.3 Setting Up Teradata ODBC Driver

1. To specify the Teradata ODBC libraries directory (typically, /opt/teradata/client/ODBC_64/lib),


configure the environment variable corresponding to your operating system as listed in the following table.

For the Operating System... Set the Environment Variable...

Linux LD_LIBRARY_PATH

To configure these variables, edit the script KxProfile.sh, located in the root directory, as shown in the
following example.

 Sample Code

LD_LIBRARY_PATH=/opt/teradata/client/ODBC_64/lib
export LD_LIBRARY_PATH

2. In the directory odbcconfig (located in the root directory), copy the file odbc.ini.teradata as
odbc.ini (in the same directory) by typing the following commands:

 Code Syntax

cd <Installation dir>/odbcconfig
cp odbc.ini.teradata odbc.ini

3. Edit the odbc.ini file.


4. Replace all the <xxx> with the actual parameters. The main parameters to modify are:

Predictive Analytics Data Access Guide


110 PUBLIC Connecting to your Database Management System on Linux
○ Driver: the exact path to the file tdata.so installed by Teradata.
○ DBCName: the host name of the Teradata server or its IP address.
○ Database & DefaultDatabase: the current Teradata database if the default one associated to the
current user is not correct.
5. Edit the script KxProfile.sh located in the root directory.
6. Uncomment the line #KXEN_ODBC_DRIVER_MANAGER=tdodbc.
7. The server processes can now be launched using the script kxen.server, located in the root directory.

4.9 IBM DB2 V9

4.9.1 Installing DB2 Client

1. Type the following command lines:

$ gunzip v9fp7_linuxx64_client.tar.gz
$ tar xvf v9fp7_linuxx64_client.tar
$ cd client
$ su root
# ./db2setup

The setup window opens.


2. Click Install a Product.
3. Click Install New.
4. In the Installation type menu, check Custom and click Next.
The menus allowing customizing the installation are activated.
5. In the Installation action menu, check Install IBM Data Server Client and click Next .
6. In the Features menu, select the features DB2 Client, Client Support, and Administration Tools and click
Next .
7. In the Instance setup menu, check Create a DB2 instance and click Next .
8. Check the New user option.
9. Enter db2inst1 as User name, db2grp1 as Group name, and the password corresponding to the
db2inst1 user as Password.
10. Click Next.
The last panel is displayed summarizing the installation parameters.
11. Click Finish to begin copying files and complete the installation procedure.

4.9.2 Setting Up the DB2 Client for Connection to the Server

1. Catalog the DB2 server by typing the following commands:

$ su - port

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 111
$ cd /opt/ibm/db2/V9.1/bin
$ db2
db2 => catalog tcpip node <logical name of DBMS> remote <DBMS server
hostname> server 50000

 Output Code

DB20000I The CATALOG TCPIP NODE command completed successfully.


DB21056W Directory changes may not be effective until the directory cache
is refreshed.

2. To catalog a specific database for the DBMS, type the following command:

db2 => catalog database <database name> at node <logical name of DBMS>

 Output Code

DB20000I The CATALOG DATABASE command completed successfully.


DB21056W Directory changes may not be effective until the directory cache
is refreshed.

3. Commit and publish the parameters previously set by typing the following commands:

db2 => db2stop

 Output Code

DB20000I The DB2STOP command completed successfully.

db2 => db2start

 Output Code

DB20000I The DB2START command completed successfully.

4. Test the connectivity by typing the following command:

db2 => quit

 Output Code

DB20000I The QUIT command completed successfully.

$ db2
db2 => connect to <database name> user <user name> using <password>

 Output Code

Database Connection Information


Database server = DB2/NT 9.5.0
SQL authorization ID = <user name>
Local database alias = <database name>
select * from

Predictive Analytics Data Access Guide


112 PUBLIC Connecting to your Database Management System on Linux
db2 => list tables
db2 => LIST NODE DIRECTORY
Node Directory
Number of entries in the directory = 1
Node 1 entry:
Node name = <logical name of DBMS>
Comment =
Directory entry type = LOCAL
Protocol = TCPIP
Hostname = <DBMS server hostname>
Service name = 50000
db2 => list database directory
System Database Directory
Number of entries in the directory = 1
Database 1 entry:
Database alias = <database name>
Database name = <database name>
Node name = <logical name of DBMS>
Database release level = b.00
Comment =
Directory entry type = Remote
Catalog database partition number = -1
Alternate server hostname =
Alternate server port number =

db2 => quit

4.9.3 Setting Up DB2 ODBC Driver

The driver must be configured to suit Automated Analytics requirements and use the application.

1. Open the script KxLauncher.sh located in the root directory, which is used to start Automated Analytics
processes.
2. Configure the environment variable <DB2INSTANCE> with the name of your database instance.

 Sample Code

DB2INSTANCE=port
export DB2INSTANCE

Where port is the name of the database instance, for example db2inst1.
3. Edit the .odbc.ini file.
4. Add the following lines:

$ cat $ODBCINI
[ODBC Data Sources]
<database name>=IBM DB2 ODBC Driver
[<database name>]
Description = DB2 on your server
Driver =/opt/ibm/db2/V9.1/lib64/libdb2.so
Host = <DBMS server hostname>
ServerType = Windows
FetchBufferSize = 99
UserName =
Password =
Database = <database name>

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 113
ServerOptions =
ConnectOptions =
Options =
ReadOnly = no

 Caution

The DSN name must be identical to the name of your IBM DB2 database.

5. Export the .odbc.ini file in the script KxLauncher.sh. Note that for IBM DB2 UDB v7.2, it is possible to
use the driver /opt/IBMdb2/V7.1/lib/libdb2_36.so.
6. You can now start the server processes.

4.10 Sybase IQ

4.10.1 Installing Sybase IQ ODBC Driver

1. Download Sybase IQ client 15.4 software on the Sybase website

 Note

The actual name of the installer depends on the selected version. Next steps show the example of a
Sybase IQ 15.4 client network ESD #2 64 bits installer. Sybase IQ 15.2 client networdk is the minimal
version supported.

2. Enter the following commands to decompress the installer:

gunzip EBF20595.tgz
tar xvf ebf20595.tar
cd ebf20595

3. Enter the command ./setup.bin to configure XWindows and execute the installer.

 Caution

Depending on the location where the client installation is stored, this may require root privileges.

The installation wizard opens.


4. Click Next.
5. Choose the installation path and click Next.
6. Select the Custom option and click Next.
7. In the panel Choose Install Set, select the options Sybase IQ Client Sybase and IQ ODBC Driver.
8. In the panel Software License Type Selection, check the Evaluate Sybase OQ Client Suite [...] option. Note
that the Network client is not submitted to licensing, there is no problem with the evaluation edition.
9. Select a localized End-user license agreement and check I agree to the terms […].
10. Click Next. A summary panel for the current installation process is displayed.
11. Check the installation information and click Install.

Predictive Analytics Data Access Guide


114 PUBLIC Connecting to your Database Management System on Linux
At the end of the installation process, the Install Complete panel is displayed. The installation procedure is
complete.
12. Click Done to close the installation wizard.

4.10.2 Setting Up Sybase IQ ODBC Driver

1. To specify the Sybase libraries directory (for example, /usr/SybaseIQ/IQ-15_4/lib64), configure the
environment variable corresponding to your operating system as listed in the following table.

For the Operating System... Set the Environment Variable...

SunOS LD_LIBRARY_PATH

Linux LD_LIBRARY_PATH

AIX LIBPATH

To configure these variables, edit the script KxProfile.sh, located in the root directory, as shown in the
following example.

 Sample Code

LD_LIBRARY_PATH= /usr/SybaseIQ/IQ-15_4/lib64
export LD_LIBRARY_PATH

2. In the directory odbcconfig (located in the root directory), copy the file odbc.ini.sybaseiq as
odbc.ini (in the same directory) by typing the following command:

 Code Syntax

cp odbc.ini.sybaseiq odbc.ini

3. Replace all the <xxx> with your actual parameters.The main parameters to change are:
○ Driver: the exact path to libdbodbc12.so file installed by Sybase
○ CommLinks: the network parameters of Sybase IQ engine (host name and port)
○ EngineName: the name of the Sybase IQ engine running on server described by CommLinks
○ Database: the actual Sybase IQdatabase if several database are managed by the Sybase IQ engine
4. Use the script kxen.server, located in the root directory, to launch the server processes.

4.11 Netezza

4.11.1 Installing and Setting Up the Netezza ODBC Driver

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 115
1. Install the Netezza ODBC archive provided by Netezza.
2. To specify the Netezza ODBC libraries directory configure the environment variable corresponding to your
operating system as listed in the following table. For a standard installation of Netezza, the Netezza ODBC
libraries directory is /usr/local/nz/lib64.

For the Operating System... Set the Environment Variable...

SunOS LD_LIBRARY_PATH

Linux LD_LIBRARY_PATH

AIX LIBPATH

To configure these variables, edit the script KxProfile.sh, located in the root directory, as shown in the
following example.

LD_LIBRARY_PATH=/usr/local/nz/lib64
export LD_LIBRARY_PATH

3. In the directory odbcconfig/ located in the root directory, copy the odbc.ini.netezza file as
odbc.ini file by typing the following command:

cp odbc.ini.netezza odbc.ini

4. Replace all the <xxx> with your actual parameters. The main parameters to change are:
○ Driver: the exact path to libnzodbc.so file installed by Netezza. If you have executed a standard
installation of Netezza without modifying any parameters, keep the default value set up in the
odbc.ini file.
○ ServerName: the host name of the Netezza server or its IP address.
○ Database: the actual Netezza database.

 Note

To use another odbc.ini file, set the <NZ_ODBC_INI_PATH> variable located in the KxProfile.sh
file. The line containing the variable is preceded by a comment describing how to change its value.

5. The server processes can now be launched using the script kxen.server, located in the root directory.

 Note

If you plan to manage datasets using foreign character sets with Netezza, comment the line LC_ALL=C
in the KxProfile.sh file.

Predictive Analytics Data Access Guide


116 PUBLIC Connecting to your Database Management System on Linux
4.12 PostgreSQL

4.12.1 About PostgreSQL ODBC Driver

Automated Analytics has been tested and validated with PostgreSQL ODBC Driver 9.00.0200-1 for Red Hat
Enterprise 5 64 bits.

This driver and its associated support library can be found on the PostgreSQL website (http://
yum.pgrpms.org/9.0/redhat/rhel-5Server-x86_64/ ).

The exact files to download are:

● postgresql90-libs-9.0.4-2PGDG.rhel5.x86_64.rpm
● postgresql90-odbc-09.00.0200-1PGDG.rhel5.x86_64.rpm

4.12.2 Installing the PostgreSQL ODBC Driver

The driver is delivered as an rpm package.

Enter the following commands:

rpm -i postgresql90-libs-9.0.4-2PGDG.rhel5.x86_64.rpm
rpm –I postgresql90-odbc-09.00.0200-1PGDG.rhel5.x86_64.rpm

 Note

These commands must be executed with root’s administration rights.

4.12.3 Setting Up PostgreSQL ODBC Driver

1. Copy the file odbc.ini.postgresql.ini as odbc.ini (in the same directory) using the following
command:

cd <Installation dir>/odbcconfig
cp odbc.ini.postgresql odbc.ini

2. Edit the odbc.ini file and replace <MyDSN> with the connection name you want to use.
3. Replace all the <xxx> with the actual parameters. The first two parameters set up the connectivity to your
PostgreSQL DBMS:
○ Servername is the name of the computer running the PostgreSQL DBMS
○ Database is the name of the database you need to access

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 117
4.13 Vertica

4.13.1 Installing and Setting Up Vertica ODBC Driver

1. Install the Vertica ODBC archive provided by Vertica.


2. To specify the Vertica libraries directory (for example, /opt/vertica/lib64), configure the environment
variable corresponding to your operating system as listed in the following table.

For the Operating System... Set the Environment Variable...

SunOS LD_LIBRARY_PATH

Linux LD_LIBRARY_PATH

AIX LIBPATH

To configure these variables, edit the script KxProfile.sh, located in the root directory, as shown in the
following example.

 Sample Code

LD_LIBRARY_PATH= /opt/vertica/lib64
export LD_LIBRARY_PATH

3. In the directory odbcconfig (located in the root directory), copy the odbc.ini.vertica file as
odbc.ini file in the same directory. This can be done typing the following command:

cp odbc.ini.vertica odbc.ini

4. Replace all the <xxx> with your actual parameters. The main parameters to change are:
○ Driver: the exact path to the file libverticaodbc_unixodbc.so installed by Vertica.
○ ServerName: the host name of the Vertica server or its IP address.
○ Database: the actual Vertica database.
5. The server processes can now be launched using the script kxen.server, located in the root directory.

4.14 SQL Server

4.14.1 About SQLServer

Microsoft provides an SQLServer ODBC driver for Linux OS.

 Caution

The connection of Automated Analytics to SQLServer 2005 using this driver has not been validated. Only
recent versions of SQLServer have been validated.

Predictive Analytics Data Access Guide


118 PUBLIC Connecting to your Database Management System on Linux
4.14.2 Installing and Setting Up Microsoft SQLServer Driver

Steps in this process assume that the Linux server can connect to the Internet. If an internet connection is not
possible on the Linux server, download the unixodbc 23. archive (ftp://ftp.unixodbc.org/pub/
unixODBC/unixODBC-2.3.0.tar.gz), and copy the file to the Linux server.

1. Download the driver for the appropriate version of Linux OS from the Microsoft web site .
2. Unpack it using the following command: tar xvfz msodbcsql-11.0.22x0.0.tar.gz.
3. Install unixODBC 2.3.

 Note

Microsoft strongly recommends using version 2.3 of unixODBC package. Depending on the version of
Linux OS, this package may be already installed. If it is missing, it can be installed from standard
package repositories. However, the Microsoft installer may not be able to find it even if it exists on the
machine. The Microsoft package provides a script allowing you to download, compile and install
unixODBC 2.3.

 Caution

This will potentially install the new driver manager over any existing unixODBC driver manager.

a. As a super user, execute the script:

build_dm.sh

Or specify the archive in the arguments: ./build_dm.sh --download-url=file:///<path to


archive>t/unixODBC-2.3.0.tar.gz
b. To install the driver manager, run the command:
cd /tmp/unixODBC.5744.19587.1964/unixODBC-2.3.0; make install
4. Install the ODBC driver using the following command:

./install.sh install

5. Type in "Yes" to accept the license.

Enter YES to accept the license or anything else to terminate the


installation: YES
Checking for 64 bit Linux compatible OS .....................................
OK
Checking required libs are installed ........................................
OK
unixODBC utilities (odbc_config and odbcinst) installed .....................
OK
unixODBC Driver Manager version 2.3.0 installed .............................
OK
unixODBC Driver Manager configuration correct ..............................
OK*
Microsoft ODBC Driver 11 for SQL Server already installed ............ NOT
FOUND
Microsoft ODBC Driver 11 for SQL Server files copied ........................
OK
Symbolic links for bcp and sqlcmd created ...................................
OK
Microsoft ODBC Driver 11 for SQL Server registered ...................
INSTALLED
Install log created at /tmp/msodbcsql. 5744.19587.1965/install.log.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 119
One or more steps may have an *. See README for more information regarding
these steps.

6. In the directory odbcconfig (located in the root directory), copy the odbc.ini.SQLServer file as
odbc.ini file in the same directory by typing the following command:

cp odbc.ini.sqlserver odbc.ini

7. Replace all the <xxx> with your actual parameters. The main parameters to change are:

○ Driver: the exact path to the driver library (usually /opt/Microsoft/msodbcsql/


libmsodbcsql-11.0.so.2270.0).
○ Server: the host name of the SQLServer server or its IP address.
○ Database: the actual SQLServer database.
8. The server processes can now be launched using the script kxen.server, located in the root directory.

4.15 In-database Modeling

4.15.1 Native Spark Modeling


You use the Native Spark Modeling feature of SAP Predictive Analytics to delegate the data-intensive modeling
steps to Apache Spark.

 Note

Prerequisite: You have installed the recommended version of Hive as Native Spark Modeling is enabled by
default for Hive connections. Refer to the sections about Hive or SAP Vora in the related information below.

Native Spark Modeling uses a native Spark application developed in the Scala programming language. This
native Spark Scala approach ensures that the application can leverage the full benefits offered by Spark
including parallel execution, in-memory processing, data caching, resilience, and integration into the Hadoop
landscape.

Native Spark Modeling improves:

● Data transfer: as the data intensive processing is done close to the data, it reduces the data transfer back
to the SAP Predictive Analytics server or desktop.
● Modeling performance: the distributed computing power of Apache Spark improves the performance.
● Scalability: the training process evolves with the size of the Hadoop cluster thus enabling more model
trainings to be completed in the same time and with bigger or wider datasets. It is optimized for more
columns in the training dataset.
● Transparency: the delegation of the modeling process to Apache Spark is transparent to the end user. It
uses the same automated modeling process and familiar interfaces as before.

 Note

A more general term, "In-database Modeling", refers to a similar approach that delegates the data
processing steps to a database. The term "In-database Modeling" is sometimes used in the Automated
Analytics configuration and messages to refer to this broader approach.

Predictive Analytics Data Access Guide


120 PUBLIC Connecting to your Database Management System on Linux
Related Information

About SAP Vora [page 18]


About Hive Database [page 100]

4.15.1.1 Installation of Native Spark Modeling


The feature Native Spark Modeling in SAP Predictive Analytics requires the following four main components of
Hadoop:

● Core Hadoop
● YARN: The YARN resource manager helps control and monitor the execution of the Native Spark Modeling
Spark application on the cluster. For more information, refer to the section Configure the YARN Port [page
76].
● Hive: A data warehouse based on data distributed in Hadoop Distributed File System (HDFS). For more
information, refer to the section Setting Up Data Access for Hive [page 100] .
● Spark: A data processing framework.

Native Spark Modeling Architecture

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 121
As a prerequisite, you need to download the artefacts required to run Spark as they are not packaged with the
SAP Predictive Analytics installer.

 Caution

Any elements manually added through the installation process need to be manually removed on
uninstallation.

4.15.1.1.1 Hadoop Platform

The installation process differs depending on the operating system and choice of Hive or SAP Vora as a data
source. Refer to the SAP Product Availability Matrix http://service.sap.com/sap/support/pam .

 Caution

If you are an existing customer, then open a ticket directly via your profile with Big Data Services support to
request the steps to install Native Sparks Modeling on SAP Cloud Platform. If you are a new customer,
please contact Big Data Services at https://www.altiscale.com/contact-us/

 Note

You cannot run Native Spark Modeling against multiple versions of Spark at the same time on the same
installation of Automated Analytics. As a workaround you can change the configuration to switch between
versions.

4.15.1.1.2 Recommendation for Automated Analytics Server


Location

Native Spark Modeling shares the processing between the Automated Analytics Server or Workstation and the
Hadoop cluster in an interactive Spark session - this is called YARN client mode. In this mode, the driver
application containing the SparkContext coordinates the remote executors that run the tasks assigned by it.
The recommended approach for performance and scalability is to co-locate the driver application within the
cluster.

 Note

For performance and scalability install the Automated Analytics Server on a jumpbox, edge node or
gateway machine co-located with the worker nodes on the cluster.

A Hadoop cluster involves a very large number of similar computers that can be considered as four types of
machines:

● Cluster provisioning system with Ambari (for HortonWorks) or Cloudera Manager installed.
● Master cluster nodes that contain systems such as HDFS NameNodes and central cluster management
tools (such as the YARN resource manager and ZooKeeper servers).

Predictive Analytics Data Access Guide


122 PUBLIC Connecting to your Database Management System on Linux
● Worker nodes that do the actual computing and contain HDFS data.
● Jump boxes, edge nodes, or gateway machines that contain only client components. These machines allow
users to start their jobs from the cluster.

We recommend that you install Automated Analytics server on a jump box to get the following benefits:

● Reduction in latency:
The recommendation for Spark applications using YARN client mode is to co-locate the client (in this case
the Automated Analytics server) with the worker machines.
● Consistent configuration:
A jump box contains a Spark client and Hive client installations managed by the cluster manager web
interface (Ambari or Cloudera Manager). As Native Spark Modeling uses YARN and Hive, it requires three
XML configuration files (yarn-site.xml, core-site.xml and hive-site.xml). A symbolic link can be
used to the client XML files so they remain synchronized with any configuration changes made in cluster
manager web interface.

Recommended Setup

This setup uses an Automated Analytics server co-located with the worker nodes in the cluster.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 123
Limited Setup with Workstation Installation

 Note

This setup should only be used in a non-production environment.

Related Information

● How Spark runs on clusters


● Spark submit client mode recommendations

4.15.1.1.3 Install the Apache Spark Jar Files

Download the Apache Spark "pre-built for Hadoop 2.6 and later" version that is relevant for your enterprise
Hadoop platform.

Predictive Analytics Data Access Guide


124 PUBLIC Connecting to your Database Management System on Linux
The Spark lib directory is located in the compressed Spark binary file as shown in the last column of the table
below:

Download the relevant cluster file:


Cluster Type Direct Download URL Folder to Extract

Apache Hive (Spark version 1.6.1) http://archive.apache.org/dist/spark/ spark-1.6.1-bin-


spark-1.6.1/spark-1.6.1-bin-ha­ hadoop2.6\lib
doop2.6.tgz

Use a file compression/uncompression utility to extract the lib folder contents from the Spark binary
download to the SparkConnector/jars folder.

 Note

Only the lib folder needs to be extracted. The spark-examples jar can be removed.

Directory structure before:

SparkConnector/jars/DROP_HERE_SPARK_ASSEMBLY.txt
SparkConnector/jars/idbm-spark-1_6.jar

 Example

Directory structure after extracting the Spark 1.6.1 lib folder content:

SparkConnector/jars/DROP_HERE_SPARK_ASSEMBLY.txt
SparkConnector/jars/idbm-spark-1_6.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-
jdo-3.2.6.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/spark-1.6.1-yarn-shuffle.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-
hadoop2.6.0.jar
SparkConnector/jars/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-
hadoop2.6.0.jar -> this can be removed

4.15.1.1.4 Upload Spark Assembly Jar

The relatively large Spark assembly jar is uploaded to HDFS each time a Spark job is submitted. To avoid this,
you can manually upload the Spark assembly jar to HDFS. Refer to the sections Edit the SparkConnections.ini
File and Configure the YARN Port in the related information below to know how to specify the location of the
assembly jar in the SparkConnections.ini configuration file.

1. Logon to a cluster node.


2. Put the jar on HDFS.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 125
 Example

 Sample Code

To put the Spark 1.6.1 assembly jar file into the /jars directory on HDFS on a Apache Hive cluster
when the spark assembly jar is in the /tmp directory, use the following command:

hdfs dfs –mkdir /jars


hdfs dfs -copyFromLocal /tmp/spark-assembly-1.6.1-hadoop2.6.0.jar /jars

Related Information

Configure the YARN Port [page 76]


Edit the SparkConnections.ini File [page 133]

4.15.1.1.5 Copy SAP Vora jar from Cluster

You are connecting to SAP Vora 1.4 as a data source. An additional jar file, spark-sap-
datasources-1.4.2.12-vora-1.4-assembly.jar, is required to connect to SAP Vora 1.2.

 Note

The jar file location is set in the Spark.VoraJarFile property of the Spark.cfg file. Refer to the Edit the
Spark.cfg section in the related information below and the SAP Vora examples provided.

1. Get the jar file using one of the following ways:

○ Extract the jar file from the SAP Vora download package (tgz). It is located in the following directory:
vora-base\package\lib\vora-spark\lib\spark-sap-datasources-1.4.2.12-vora-1.4-
assembly.jar
○ Logon to a cluster node, preferrably the jumpbox.
By default the jar file can be found at the following location:

Hadoop Vendor Default Location

HortonWorks /var/lib/ambari-agent/cache/
stacks/HDP/2.4/services/vora-base/
package/lib/vora-spark/lib/spark-sap-
datasources-1.4.2.12-vora-1.4-
assembly.jar

Cloudera

2. Copy the jar file to the SparkConnector/jars folder.

Predictive Analytics Data Access Guide


126 PUBLIC Connecting to your Database Management System on Linux
4.15.1.2 Connection Setup

Native Spark Modeling is enabled by default for Hive connections. (For more information , refer to the section
Setting Up Data Access for Hive [page 100] ).

It requires specific connection configuration entries and three Hadoop client configuration files (hive-site.xml,
core-site.xml and yarn-site.xml).

As a prerequisite, you have installed a server with basic configuration. Refer to the Server Installation Guide for
Linux for more information.

If you run the installation on a non-Windows jumpbox on the cluster, create symbolic links to the hive-site.xml,
core-site.xml and yarn site.xml files.

If you run the installation outside the cluster, download the hive-site.xml, core-site.xml and yarn site.xml files.

4.15.1.2.1 Download the HortonWorks Client Configuration


Files from Ambari

You want to download the Hadoop client configuration files to enable the YARN connection to a HortonWorks
cluster.

You have access to the Ambari web User Interface.

Download the client configuration files (hive-site.xml, core-site.xml, and yarn-site.xml at a


minimum) from the HortonWorks Ambari web User Interface:

1. Log on to the Ambari web User Interface.


2. Download the YARN client configuration files (including the core-site.xml and yarn-site.xml files)
a. Go to the menu Services and click on Yarn service.
b. On the right side, click on Service actions.
c. From the dropdown list, select Download Client Configs.

The downloaded file contains the core-site.xml and yarn-site.xml files.


d. In the core-site.xml file, remove or comment out the net.topology.script.file.name
property.

 Sample Code

Example of the commented property

<!--<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
</property>-->

3. Download the Hive client configuration files including the hive-site.xml file:
a. Go to the Services menu and click on Hive service.
b. On right side, click on Service actions.
c. From the dropdown list, select Hive Client Download Client Configs.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 127
4. In the Hive-site.xml file, remove all properties except for the hive.metastore.uris property.

The Hive-site.xml file must contain only the following code.

 Sample Code

<?xml version="1.0" encoding="UTF-8"?>


<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<server name></value>
</property>
</configuration>

4.15.1.2.2 Download the Cloudera Client Configuration Files


from Cloudera Manager

You want to download the Hadoop client configuration files to enable the YARN connection to a Cloudera
cluster.

Download the following client configuration files from the Cloudera Manager web User Interface:

Configuration File Client

hive-site.xml Hive

core-site.xml YARN

yarn-site.xml YARN

1. Log on to the Cloudera Manager web User Interface.


2. Download the YARN client configuration files.
a. From the Home menu, select the Yarn service.
b. On the right side, click Actions.
c. From the dropdown list, select Download Client Configuration.

The downloaded file contains the core-site.xml and the yarn-site.xml.


d. In the core-site.xml file, remove or comment out the net.topology.script.file.name
property.

 Sample Code

Example of the commented property

<!--<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
</property>-->

3. Download the Hive client configuration files.


a. From the Home menu, select the Hive service.

Predictive Analytics Data Access Guide


128 PUBLIC Connecting to your Database Management System on Linux
b. On the right side, click Actions.
c. From the dropdown list, select Download Client Configuration.
d. In the Hive-site.xml file, remove all properties except for the hive.metastore.uris property.

The Hive-site.xml file must contain only the following code.

 Sample Code

<?xml version="1.0" encoding="UTF-8"?>


<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<server name></value>
</property>
</configuration>

4.15.1.2.3 Create a Directory for the Client Configuration


Files

You have downloaded the three hadoop client configuration XML files.

Create a directory to store the files:

1. Change directory to SparkConnector/HadoopConfig directory from the installation root directory.


2. Create a directory with the name of the Hive ODBC connection.
3. Copy the three hadoop client configuration files to the new directory.
4. Change the ownership of the directory and files to kxenadmin:kxenusers.

4.15.1.3 Configure the Spark Connection

You configure Spark on YARN connection using the two property files located under the SparkConnector
directory listed in the following table.

File Name Content

Spark.cfg Common properties that apply to all Spark connections

SparkConnections.ini Configuration properties for Spark connections linked to


specific Hive ODBC datasource names

The two tables below describe the properties available in each property file.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 129
Spark.cfg properties
Property Description Default/Expected Values

SparkAssemblyLibFolder Mandatory. Location of the down­ Apache Hive or SAP Vora 1.4 (default):
loaded Spark lib folder
"../SparkConnector/jars/
The file is configured by default for spark-1.6.1-bin-hadoop2.6/
Spark 1.6.1. If you are using SAP Vora, lib"
change the value.

For more information, see section Edit


the Spark.cfg File in the related informa­
tion at the bottom of this topic.

SparkConnectionsFile Mandatory. Location of the connec­ "../SparkConnector/


tions file SparkConnections.ini"
For more information, see Edit the
SparkConnections.ini File [page 74]

JVMPath Mandatory. Location of the Java Virtual "../jre/lib/{OS}/server"


Machine used by the Spark connection. (default)
This location is OS dependent.
"../../j2re/bin/server"

IDBMJarFile Mandatory. Location of the Native Apache Hive or SAP Vora 1.4 (default):
Spark Modeling application jar.
"../SparkConnector/jars/
 Note idbm-spark-1_6.jar"

The file is configured by default for SAP Vora 1.2:


Cloudera. If you are using a
HortonWorks cluster, change the "../SparkConnector/jars/
value. idbm-spark-1_5.jar"

VoraJarFile Location of SAP Vora data source jar "../SparkConnector/jars/


file. spark-sap-
datasources-1.4.2.12-
vora-1.4-assembly.jar"

.native."property_name" A Native Spark property can be speci­ Apache Hive or SAP Vora 1.4 (Spark
fied by adding native and quotes to 1.6.1):
the property name and value, for exam­
Refer to http://spark.apache.org/docs/
ple:
1.6.1/configuration.html .
Spark
Connection.
$MyDSN.native."spark.ex
ecutor.memory"="2g"

HadoopHome Mandatory. Internal Hadoop Home "../SparkConnector/"


containing a bin subdirectory. Mainly
used for Windows (for winutils.exe).

HadoopConfigDir Mandatory. Location where the /tmp


Hadoop client configuration XML files
are copied during modeling.

Predictive Analytics Data Access Guide


130 PUBLIC Connecting to your Database Management System on Linux
Property Description Default/Expected Values

log4jOutputFolder Location of the Native Spark Modeling "/tmp"


log output

log4jConfigurationFile Location of the log4j logging configura- "SparkConnector/


tion property file log4j.properties" (default)
Change this location to control the level
of logging or the behavior of log recy­
cling.

JVMPermGenSize The permanent generated size of the Recommended value is 256 (for 256
Java Virtual Machine used by the Spark MB)
connection

DriverMemory Increase the driver memory if you en­ recommended value is 8192 (for 8192
counter out-of-memory errors. MB).

CreateHadoopConfigSubFolder If the value is "true", a subfolder is cre­ true (default) or false.


ated in the HadoopConfigDir folder
for each modeling process ID. Each
subfolder contains the Hadoop client
configuration files.

If the value is “false” no subfolder is cre­


ated and the HadoopConfigDir
folder is used.

AutoPort Enables or disables automatic configu- true (default) or false


ration of YARN ports.

When the value is "true" the port range


can be configured with the MinPort
and MaxPort properties.

When it is "false" YARN automatically


assigns ports.

This option helps when configuring a


YARN connection across a network fire-
wall.

MinPort See AutoPort property 55000

MaxPort See AutoPort property 56000

AutoIDBMActivation Allows activating Native Spark true (default) or false.


Modeling when matching entries exist
in the SparkConnections.ini file for the
ODBC connection.

The SparkConnections.ini properties


Property Description Default value

HadoopConfigDir Mandatory. Location of the Hadoop cli­ ../SparkConnector/hadoopConfig/


ent configuration XML files (hive- $DATASOURCENAME
site.xml, core-site.xml, yarn-site.xml)

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 131
Property Description Default value

HadoopUserName Mandatory. User with privilege to run hive


Spark jobs.

ConnectionTimeOut Period (in seconds) after which a Spark 1000000 (seconds)


job will time out. It is useful to change
the value to minutes in case of trouble­
shooting .

.native."property_name" A Native Spark property can be speci­ Cloudera (Spark 1.5.0): Refer to http://
fied by adding native and quotes to the spark.apache.org/docs/1.5.0/configu­
property name and value. For example: ration.html

 Sample Code HortonWorks (Spark 1.4.1): Refer to


http://spark.apache.org/docs/1.4.1/
configuration.html .

 Sample Code

SparkConnection.$MyDSN.na­
tive."spark.executor.memory"="2g".

Related Information

Edit the Spark.cfg File [page 132]

4.15.1.3.1 Edit the Spark.cfg File

You want to change a default setting relating to all Spark connections.

1. Change directory to the SparkConnector directory from the installation root directory.
2. Edit the Spark.cfg file and add the necessary properties. All property names begin with Spark..

Set the version of the jar files depending on the data source:
○ Hive and SAP Vora 1.4 are supported with Spark version 1.6.1

 Note

Most of the settings can remain at the default values.

3. For Workstation installation only: restart the client after making any changes to the Spark.cfg file.

 Example

 Sample Code

# Native Spark Modeling settings. See SparkConnection.ini for DSN specific


settings.

Predictive Analytics Data Access Guide


132 PUBLIC Connecting to your Database Management System on Linux
# ##### HIVE on HortonWorks HDP 2.4.2 or Cloudera CDH 5.8.x (Spark 1.6.1)
#####
Spark.SparkAssemblyLibFolder="../../../SparkConnector/jars/spark-1.6.1-bin-
hadoop2.6/lib"
Spark.IDBMJarFile="../../../SparkConnector/jars/idbm-spark-1_6.jar"

# ##### MEMORY TUNING #####


# increase the Spark Driver memory (in MB) if you encounter out of memory
errors
# Note: When using Predictive Analytics Workstation for Windows then the
Spark Driver memory is shared with the Desktop client.
# If increasing the Spark Driver memory you also need to modify
KJWizard.ini vmarg.1 setting for JVM maximum heap size (e.g. vmarg.1=-
Xmx4096m).
# It is recommended to set the Spark Driver memory to approximately 75% of
the vmarg.1 value.
Spark.DriverMemory=1024

#
###########################################################################
##
# ##### RECOMMENDED TO KEEP TO DEFAULTS #####
# Spark JVM and classpath settings.
Spark.JVMPath="../../j2re/bin/server"
Spark.HadoopConfigDir=$KXTEMPDIR
Spark.HadoopHome="../../../SparkConnector/"

# per connection configuration


Spark.SparkConnectionsFile="../../../SparkConnector/SparkConnections.ini"

Spark.AutoIDBMActivation=true
Spark.DefaultOutputSchema="/tmp"

# set permanent generation size (in MB) as spark assembly jar files
contain a lot of classes
Spark.JVMPermGenSize=256

# location of the logging properties file


Spark.log4jConfigurationFile="../../../SparkConnector/log4j.properties"

4.15.1.3.2 Edit the SparkConnections.ini File

You want to configure a new Spark connection linked to a particular ODBC connection to enable Native Spark
Modeling. You know the name of the ODBC connection name (DSN).

1. Change directory to the SparkConnector directory under the install root directory.
2. Edit the SparkConnections.ini file. All property names begin with SparkConnection.

 Note

The default SparkConnections.ini file does not contain any entries. Samples are provided for each
supported cluster type (Cloudera and HortonWorks) with a Hive DSN in
SparkConnections_samples.txt. Use these samples as a template for creating your own
configuration.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 133
Default SparkConnections.ini content

# This file contains Spark configuration entries that are specific for each
particular data source name (DSN).
#
# To enable Native Spark Modeling against a data source, define at least the
minimum properties.
# Create a separate set of entries for each DSN. Start each entry with the
text "SparkConnection" followed by the DSN name.

# Note: only Hive is supported with Native Spark Modeling on Windows.


# For SAP HANA Vora on Windows you could use a client/server installation
with a Windows client to a Linux server/Vora jumpbox.

# 2 mandatory parameters must be set for each DSN -


# 1. hadoopUserName, a user name with privileges to run Spark on YARN and
# 2. hadoopConfigDir, the directory of the Hadoop client configuration
files for the DSN (the directory with the core-site.xml, yarn-site.xml, hive-
site.xml files at a minimum)
#
# It is highly recommended to upload the spark assembly jar to HDFS,
especially on Windows.
# fro example, for a DSN called MY_DSN and a Spark 1.6.1 assembly jar in
the HDFS jars folder -
# SparkConnection.MY_DSN.native."spark.yarn.jar"="hdfs://hostname:8020/
jars/spark-assembly-1.6.1-hadoop2.6.0.jar"
#
# It is possible to pass in native Spark configuration parameters using
"native" in the property.
# for example, to add the "spark.executor.instances=4" native Spark
configuration to a DSN called MY_HIVE_DSN -
# SparkConnection.MY_DSN.native."spark.executor.instances"="4"
#
# #########################################
# Specific settings for HortonWorks HDP clusters
#
# These 2 properties are also mandatory for HortonWorks clusters and must
match the HDP version exactly.
# (hint: get the correct value from the spark-defaults.conf Spark client
configuration file or Ambari)
# Example for HortonWorks HDP 2.4.2
# SparkConnection.MY_HDP_DSN.native."spark.yarn.am.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"
# SparkConnection.MY_HDP_DSN.native."spark.driver.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"
#
# #########################################
# Refer to the SparkConnections_samples.txt file for sample content.
########################################################################

 Note

There are two properties specifically for SAP Vora:


○ spark.sap.autoregister
○ spark.vora.discovery
These properties are required to support SAP Vora views containing Vora tables as a data source.

Predictive Analytics Data Access Guide


134 PUBLIC Connecting to your Database Management System on Linux
4.15.1.3.2.1 Sample Content for a HortonWorks Cluster with a
Hive DSN

Sample SparkConnections.ini additional content for a HortonWorks HDP 2.4.2 (Spark 1.6.1) Hive DSN
named MY_HDP242_HIVE_DSN.

 Sample Code

# Sample entries for a HortonWorks HDP 2.4.2 (Spark 1.6.1) Hive DSN with name
MY_HDP242_HIVE_DSN
# upload the spark 1.6.1 assembly jar to HDFS and reference the HDFS location
here
SparkConnection.MY_HDP242_HIVE_DSN.native."spark.yarn.jar"="hdfs://hostname:
8020/jars/spark-assembly-1.6.1-hadoop2.6.0.jar"
# hadoopConfigDir and hadoopUserName are mandatory
# hadoopConfigDir – use relative paths to the Hadoop client XML config files
(yarn-site.xml, core-site.xml and hive-site.xml at a minimum)
SparkConnection.MY_HDP242_HIVE_DSN.hadoopConfigDir="../../../SparkConnector/
hadoopConfig/MY_HDP242_HIVE_DSN"
SparkConnection.MY_HDP242_HIVE_DSN.hadoopUserName="hive"
# HORTONWORKS SPECIFIC: these 2 properties are also mandatory for HortonWorks
clusters and need to match the HDP version exactly
# sample values for HDP 2.4.2
SparkConnection.MY_HDP242_HIVE_DSN.native."spark.yarn.am.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"
SparkConnection.MY_HDP242_HIVE_DSN.native."spark.driver.extraJavaOptions"="-
Dhdp.version=2.4.2.0-058"
# (optional) time out in seconds
SparkConnection.MY_HDP242_HIVE_DSN.connectionTimeOut=1000
# (optional) performance tuning
#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.executor.instances"="4"
#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.executor.cores"="2"
#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.executor.memory"="4g"
#SparkConnection.MY_HDP242_HIVE_DSN.native."spark.driver.maxResultSize"="4g"

4.15.1.3.2.2 Sample Content for a HortonWorks Cluster with


SAP Vora DSN

Sample SparkConnections.ini additional content for a HortonWorks HDP 2.4.2 (Spark 1.6.1) SAP Vora
connection with a (Simba) Spark SQL DSN with name MY_HDP242_VORA_DSN.

 Sample Code

# Sample entries for a HortonWorks HDP 2.4.2 (Spark 1.6.1) Hive DSN with name
MY_HDP242_HIVE_DSN # upload the spark 1.6.1 assembly jar to HDFS and
reference the HDFS location here
SparkConnection.MY_HDP242_HIVE_DSN.native."spark.yarn.jar"="hdfs://hostname:
8020/jars/spark-assembly-1.6.1-hadoop2.6.0.jar"

# hadoopConfigDir and hadoopUserName are mandatory


# hadoopConfigDir – use relative paths to the Hadoop client XML config files
(yarn-site.xml, core-site.xml and hive-site.xml as a minimum)
# NOTE: for Vora the hive-site.xml can be an empty file
SparkConnection.MY_HDP242_VORA_DSN.hadoopConfigDir="../../../SparkConnector/
hadoopConfig/MY_HDP234_VORA_DSN"

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 135
SparkConnection.MY_HDP242_VORA_DSN.hadoopUserName="spark"

# HORTONWORKS SPECIFIC: these 2 properties are also mandatory for HortonWorks


clusters and need to match the HDP version exactly
# sample values for HDP 2.4.2. Actual version should be taken from Ambari
SparkConnection.MY_HDP242_VORA_DSN.native."spark.yarn.am.extraJavaOptions"="-
Dhdp.version=2.4.2XXXX"
SparkConnection.MY_HDP242_HIVE_VORA_DSN.native."spark.driver.extraJavaOptions"
="-Dhdp.version=2.4.2XXXX"

# VORA SPECIFIC: add auto-register settings to support Vora views

SparkConnection.MY_HDP234_VORA_DSN.native."spark.sap.autoregister"="com.sap.sp
ark.vora"

# (optional) time out in seconds


SparkConnection.MY_HDP242_HIVE_VORA_DSN.connectionTimeOut=1000

# (optional) performance tuning


#SparkConnection.MY_HDP242_VORA_DSN.native."spark.executor.instances"="4"
#SparkConnection.MY_HDP242_VORA_DSN.native."spark.executor.cores"="2"
#SparkConnection.MY_HDP242_VORA_DSN.native."spark.executor.memory"="4g"
#SparkConnection.MY_HDP242_VORA_DSN.native."spark.driver.maxResultSize"="4g"

4.15.1.3.2.3 Sample Content for a Cloudera Cluster with Hive


DSN

Sample SparkConnections.ini additional content for a Cloudera CDH 5.8.x (Spark 1.6.1) Hive DSN named
MY_CDH58_HIVE_DSN.

 Sample Code

# Sample for a Cloudera CDH 5.8.x (Spark 1.6.1) Hive DSN with name
MY_CDH58_HIVE_DSN
# upload the spark 1.6.1 assembly jar to HDFS and reference the HDFS location
here
SparkConnection.MY_CDH58_HIVE_DSN.native."spark.yarn.jar"="hdfs://hostname:
8020/jars/spark-assembly-1.6.1-hadoop2.6.0.jar"
# hadoopConfigDir and hadoopUserName are mandatory
# hadoopConfigDir – use relative paths to the Hadoop client XML config files
(yarn-site.xml, core-site.xml and hive-site.xml at a minimum)
SparkConnection.MY_CDH58_HIVE_DSN.hadoopConfigDir="../../../SparkConnector/
hadoopConfig/MY_CDH58_HIVE_DSN"
SparkConnection.MY_CDH58_HIVE_DSN.hadoopUserName="hive"
# (optional) time out in seconds
SparkConnection.MY_CDH58_HIVE_DSN.connectionTimeOut=1000
# (optional) performance tuning
#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.executor.instances"="4"
#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.executor.cores"="2"
#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.executor.memory"="4g"
#SparkConnection.MY_CDH58_HIVE_DSN.native."spark.driver.maxResultSize"="4g"

Predictive Analytics Data Access Guide


136 PUBLIC Connecting to your Database Management System on Linux
4.15.1.3.2.4 Sample Content for a Cloudera Cluster with SAP
Vora DSN

Sample SparkConnections.ini additional content for a Cloudera CDH 5.8.x (Spark 1.6.1) Vora DSN with
name MY_CDH58_VORA_DSN.

 Sample Code

# Sample for a Cloudera CDH 5.8.x (Spark 1.6.1) Vora DSN with name
MY_CDH58_VORA_DSN

# upload the spark 1.6.1 assembly jar to HDFS and reference the HDFS location
here
SparkConnection.MY_CDH58_VORA_DSN.native."spark.yarn.jar"="hdfs://hostname:
8020/jars/spark-assembly-1.6.1-hadoop2.6.0.jar"

# hadoopConfigDir and hadoopUserName are mandatory


# hadoopConfigDir – use relative paths to the Hadoop client XML config files
(yarn-site.xml, core-site.xml and hive-site.xml at a minimum)
SparkConnection.MY_CDH58_VORA_DSN.hadoopConfigDir="../../../SparkConnector/
hadoopConfig/MY_CDH58_VORA_DSN"
SparkConnection.MY_CDH58_VORA_DSN.hadoopUserName="spark"

# VORA SPECIFIC: add auto-register settings to support Vora views


SparkConnection.MY_CDH58_VORA_DSN.native."spark.sap.autoregister"="com.sap.spa
rk.vora"
SparkConnection.MY_CDH5_VORA_DSN.native."spark.sap.autoregister"="com.sap.spar
k.vora"
# (optional) time out in seconds
SparkConnection.MY_CDH58_VORA_DSN.connectionTimeOut=1000

# (optional) performance tuning


#SparkConnection.MY_CDH58_VORA_DSN.native."spark.executor.instances"="4"
#SparkConnection.MY_CDH58_VORA_DSN.native."spark.executor.cores"="2"
#SparkConnection.MY_CDH58_VORA_DSN.native."spark.executor.memory"="4g"
#SparkConnection.MY_CDH58_VORA_DSN.native."spark.driver.maxResultSize"="4g"

4.15.1.3.3 Configure the YARN Port


You want to use specific ports for the Spark connection.

Native Spark Modeling uses the YARN cluster manager to communicate, deploy and execute the predictive
modeling steps in Spark. The cluster manager distributes tasks throughout the compute nodes of the cluster,
allocates the resources across the applications and monitors using consolidated logs and web pages. YARN is
one of the main cluster managers used for Spark applications. Once connected to the cluster via YARN, Spark
acquires executors on nodes in the cluster, which are processes that run the modeling steps. Next, it deploys
and executes the modeling steps through a series of tasks, executed in parallel. Native Spark Modeling uses the
YARN client mode to enable direct control of the driver program from Automated Analytics. It needs to
communicate across the network to Hadoop cluster to:

● Upload the SAP Predictive Analytics application (a jar file).


● Initiate and monitor the Spark application remotely.

The communication across the network may be blocked by firewall rules. To enable the communication to flow
through the firewall, configure the ports used by YARN:

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 137
1. Change directory to the SparkConnections directory from the install root directory.
2. Edit the SparkConnections.ini file to set specific ports for the YARN connection:

# Spark ODBC DSN = MY_HIVE


# security (see http://spark.apache.org/docs/latest/security.html)
# From Executor To Driver. Connect to application / Notify executor state
changes. Akka-based. Set to "0" to choose a port randomly.
SparkConnection.MY_HIVE.native."spark.driver.port"="55300"
# From Driver To Executor. Schedule tasks. Akka-based. Set to "0" to choose a
port randomly.
SparkConnection.MY_HIVE.native."spark.executor.port"="55301"
# From Executor To Driver. File server for files and jars. Jetty-based
SparkConnection.MY_HIVE.native."spark.fileserver.port"="55302"
# From Executor To Driver. HTTP Broadcast. Jetty-based. Not used by
TorrentBroadcast, which sends data through the block manager instead.
SparkConnection.MY_HIVE.native."spark.broadcast.port"="55303"
# From Executor/Driver To Executor/Driver. Block Manager port. Raw socket via
ServerSocketChannel
SparkConnection.MY_HIVE.native."spark.blockManager.port"="55304"

4.15.1.4 Restrictions

Thanks to Spark, the building of Automated Analytics models can run on Hadoop with better performance,
higher scalability and no data transfer back to the SAP Predictive Analytics server or desktop. For the models
using Hadoop as a data source, the model training computations by default are delegated to the Spark engine
on Hadoop whenever it is possible.

To run the training on Spark, you must fulfill the conditions described below:

● You have installed SAP Predictive Analytics 2.5 or higher.


● You are connected to Hadoop data source using the Hive ODBC driver from Automated Analytics.
● You have installed the required versions of Hadoop distribution (see Hadoop Platform [page 64]), Hive (see
About Hive Database [page 100]) and Spark.
● You have installed the Apache Spark Binary and Spark Assembly Jar (seeInstall the Apache Spark Jar Files
[page 124]).

If the delegation to Spark is not possible or if the option is not selected, the Automated Analytics modeling
engine does the training computations on a machine where SAP Predictive Analytics is installed (for example,
execution of the Automated Analytics algorithms).

 Restriction

As of SAP Predictive Analytics 2.5 version, Native Spark Modeling is supported for the training of
classification and regression models with single target. All other types of models are mainly handled by the
Automated Analytics engine.

Refer to the SAP Note 2278743 for more details on restrictions.

To change the default behaviour of the system:

1. Select File Preferences or press F2.


2. On the Model Training Delegation panel, deselect the option.
3. Click OK to save your changes.

Predictive Analytics Data Access Guide


138 PUBLIC Connecting to your Database Management System on Linux
 Note

When editing the preferences, you can restore the default settings by clicking the Reset button.

4.15.1.5 Performance Tuning

Split HDFS file

If the hive table is created as an external table on a HDFS directory, check if this directory contains multiple
files. Each file represents a partition when this hive table is processed in Native Spark Modeling.

Setting Spark Executor Core number - spark.executor.cores

When running Native Spark Modeling on a cluster, do not assign all available CPU to Spark executors as this
could prevent Hive from fetching the results due to a lack of resources. For instance, assign only 75% of CPU to
Spark. If you have 12 cores on a node machine, set spark.executor.cores=9.

Recommended Spark settings

# Spark Executor settings


SparkConnection.MY_HIVE.native."spark.executor.instances"="4"
SparkConnection.MY_HIVE.native."spark.executor.cores"="16"
# if you use many threads => increase memory
SparkConnection.MY_HIVE.native."spark.executor.memory"="16g"

Install Automated Analytics server on a jump box on the cluster itself.

It is recommended to install Automated Analytics server on a jump box on the cluster itself. The memory size
of the jump box must be sized to be big enough to hold the result sets generated from the Native Spark
Modeling application.

 Note

For more details, refer to the section Recommendation for Automated Analytics Server Location [page 64]
in the related information below.

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 139
Related Information

Recommendation for Automated Analytics Server Location [page 64]

4.15.1.6 Troubleshooting

Troubleshootings for particular message codes can appear in the logs.

 Note

The message codes starting with "KXEN_W" are warning messages and the processing will still continue.

The message codes starting with "KXEN_I" are information messages and the processing will still continue.

The following table matches the message codes with the solution:
Message Code Remediation Steps

KXEN_W_IDBM_UNSUPPORTED_FEATURES_DETECTED An unsupported feature was de­


tected. The warning message
KXEN_W_IDBM_REGRESSION_MODELS_NOT_SUPPORTED will provide a more detailed de­
sciption.
KXEN_W_IDBM_INCOMPATIBLE_VARIABLE_USAGE
Native Spark Modeling will be
KXEN_W_IDBM_SPACE_ADV_PARAMS_NOT_SUPPORTED
"off" and the normal training
KXEN_W_IDBM_INCOMPATIBLE_TRANSFORM_FEATURE_USAGE process will continue.

KXEN_W_IDBM_USER_STRUCTURES_NOT_SUPPORTED To use Native Spark Modeling for


this model, either remove the af­
KXEN_W_IDBM_COMPOSITE_VARIABLES_NOT_SUPPORTED
fected variable(s) or avoid the
KXEN_W_IDBM_CUSTOM_PARTITION_STRATEGY_NOT_SUPPORTED feature if possible.

KXEN_W_IDBM_PARTITION_STRATEGY_NOT_SUPPORTED

KXEN_W_IDBM_UNSUPPORTED_TRANSFORM

KXEN_W_IDBM_K2R_GAIN_CHARTS_NOT_SUPPORTED

KXEN_W_IDBM_RISK_MODE_K2R_NOT_SUPPORTED

KXEN_W_IDBM_DECISION_MODE_K2R_NOT_SUPPORTED

KXEN_W_IDBM_K2R_ORDER_NOT_SUPPORTED

KXEN_W_IDBM_MULTI_TARGETS_MODE_NOT_SUPPORTED

KXEN_E_EXPECT_IDBM_FILESTORE_NATIVESPARK_PAIRING Add the mandatory entries in


the SparkConnector/SparkCon­
nections.ini file for the connec­
tion. Refer to the section Config-
ure the Spark Connection [page
129]

KXEN_I_IDBM_AUTO_ACTIVATION Native Spark Modeling has been


activated.

Predictive Analytics Data Access Guide


140 PUBLIC Connecting to your Database Management System on Linux
Message Code Remediation Steps

KXEN_E_IDBM_MISSING_HADOOP_XML A Hadoop client configuration


(XML) file is missing. Check the
Spark.cfg to ensure it is pointing
to the right Spark assembly jar
for your cluster. Refer to the sec­
tion Connection Setup [page
127] .

KXEN_E_IDBM_MISSING_JARS A required jar file was not found.


Refer to the section Installation
of Native Spark Modeling [page
121]

4.15.2 Modeling in SAP HANA

You can delegate the model training computations in SAP HANA if the following prerequisites are fulfilled:

● APL (version 2.4 or higher) must be installed on the SAP HANA database server.
● The minimum required version of SAP HANA is SPS 10 Database Revision 102.02 (SAP HANA 1.00.102.02).
● The APL version must be the same as the version of SAP Predictive Analytics desktop or server.
● The SAP HANA user connecting to the ODBC data source must have permission to run APL.

For more information, see the SAP HANA Automated Predictive Library Reference Guide on the SAP Help
Portal at http://help.sap.com/pa.

 Restriction

Cases when model training is not delegated to APL:

● In the Recommendation and Social Analysis modules.


● When the model uses a custom partition strategy.
● When the model uses the option to compute a decision tree.

To unselect the default behavior:

1. Select File Preferences or press F2.


2. On the Model Training Delegation panel, deselect the option.
3. Click OK to save your changes.

4.16 Special Case

4.16.1 Installing and Setting Up unixODBC on Suse 11

The unixODBC software package is an ODBC driver manager and is mandatory to be able to connect to a
DBMS using the standard ODBC technology. Depending on the version of Linux currently used, the processes

Predictive Analytics Data Access Guide


Connecting to your Database Management System on Linux PUBLIC 141
to install this package can vary. The following section describes the process to install unixODBC on Linux SUSE
11 .

4.16.1.1 Checking the Setup of the Software Package


Repository

If the operating system has been installed with the default setup, the Software Package repository is set up to
use the installation DVD.

 Code Syntax

sudo zypper repos

 Output Code

…root's
password:# | Alias
| Name
| Enabled | Refresh--
+--------------------------------------------------
+--------------------------------------------------+---------+--------1 |
SUSE-Linux-Enterprise-Server-11-SP2 11.2.2-1.234 | SUSE-Linux-Enterprise-
Server-11-SP2
11.2.2-1.234 | Yes | No

Predictive Analytics Data Access Guide


142 PUBLIC Connecting to your Database Management System on Linux
5 Importing Flat Data Files Into Your DBMS

5.1 Introduction

This document explains how to import flat data files into the following database management systems
(DBMS):

● Oracle 8i and lower versions.


● IBM DB2 v7.2 and lower versions.
● Microsoft Access 2000.

It presents scripts that enable you to:

● Create target tables – which correspond to the flat data files to import – in your DBMS,
● Import flat data files into these tables.

These scripts have been tested and will work under Linux and Windows operating systems (OS). All the
examples used for the scripts are based on the flat data files provided as samples for the Data Manager - Event
Logging use case. For more information, see the Event Log Aggregation Scenario - Data Manager User Guide on
SAP Help Portal at http://help.sap.com/pa.

5.2 Importing Flat Files into an Oracle Database

To Import a flat file into an Oracle Database, you need to:

1. Create a target table, into which the flat file will be imported.
2. Create a control file that specifies the import settings.
3. Run the control files using the Oracle tool SQL*Loader.

5.2.1 Creating the Target Table

You need to create a table in an Oracle database before importing flat files.

1. Write an SQL query containing creation statements for the fields contained in the flat file to be imported.
2. Execute this query using the Oracle tool SQL*Plus.

 Note

– SQL*Plus is a command line SQL and PL/SQL language interface and reporting tool that ships with
the Oracle Database Server. It can be used interactively or driven from scripts.

Predictive Analytics Data Access Guide


Importing Flat Data Files Into Your DBMS PUBLIC 143
 Note

All the scripts were tested on Oracle 8i.

5.2.1.1 Writing the SQL Query

You need to write a SQL query to create a table in an Oracle database.

● Specify the following information in the query:


a. Owner (table schema or user)
b. Table name
c. Columns specification (name, type, constraints)
d. Primary key (if exists)
e. Tablespace (logical space disk)

Here is a sample script:

 Sample Code

-- Specify the owner


CREATE TABLE TABLE_USER.TABLE_NAME(
-- Specify the name, type and constraints for each column
COLUMN_1_NAME datatype,
COLUMN_2_NAME datatype,
…,
COLUMN_N_NAME datatype
-- Specify the primary key
PRIMARY KEY(COLUMN_1_NAME)
)
-- Specify the table space
TABLESPACE TABLE_TABLESPACE;

 Note

In the above script, the command line starting with "--" represents comments.

5.2.1.1.1 Basic Oracle Datatypes: Uses and Syntax

Use and syntax of basic Oracle datatypes:

● VARCHAR2 – Character data type. Can contain letters, numbers and punctuation. The syntax for this data
type is: VARCHAR2(size) where size is the maximum number of alphanumeric characters the column can
hold. For example VARCHAR2(25) can hold up to 25 alphanumeric characters. In Oracle8, the maximum
size of a VARCHAR2 column is 4,000 bytes.
● NUMBER – Numeric data type. Can contain integer or floating point numbers only. The syntax for this data
type is: NUMBER(precision, scale) where precision is the total size of the number including decimal point
and scale is the number of places to the right of the decimal. For example, NUMBER(6,2) can hold a number
between -999.99 and 999.99.

Predictive Analytics Data Access Guide


144 PUBLIC Importing Flat Data Files Into Your DBMS
● DATE – Date and Time data type. Can contain a date and time portion in the format: DD-MON-YY
HH:MI:SS. No additional information is needed when specifying the DATE data type. If no time component
is supplied when the date is inserted, the time of 00:00:00 is used as a default. The output format of the
date and time can be modified to conform to local standards.

5.2.1.2 Executing the SQL Query

To execute an Oracle SQL query, you need to use the Oracle tool SQL*Plus.

The query is executed in a Microsoft Windows or a Linux Operating System.

1. Start the tool SQL*Plus., using the command sqlplus.


2. Connect to your database, entering your user name and password.
3. To run the script from your database, use the command @SCRIPTNAME.
If you want to run the script contained in the file script.sql, use the command @script.sql.
When you run the command @script.sql, the script.sql file can be proceeded by its folder (from where
sqlplus has been started).

5.2.1.3 Example: Creating a Target Table

The following SQL script creates Oracle tables for the flat data files provided as samples for the Data Manager -
Event Logging use case. For more information about these files, see the Event Log Aggregation Scenario - Data
Manager User Guide on SAP Help Portal at http://help.sap.com/pa. In this script, tables are created using the
values YOURUSER as the user and YOURTABLESPACE as the tablespace

 Sample Code

CREATE TABLE YOURUSER.CUSTOMERS(


ID NUMBER NOT NULL,
SEX VARCHAR2(10),
MARITAL_STATUS VARCHAR2(30),
GEOID NUMBER,
EDUCATIONNUM NUMBER,
OCCUPATION VARCHAR2(20),
AGE NUMBER,
REFCONT1 NUMBER,
REFCONT2 NUMBER,
REFCONT3 NUMBER,
NOM1 VARCHAR2(1),
NOM2 VARCHAR2(1),
NOM3 VARCHAR2(2),
PRIMARY KEY(ID)
)
TABLESPACE YOURTABLESPACE;
CREATE TABLE YOURUSER.DEMOG(
GEO_ID NUMBER NOT NULL,
INHABITANTS_K NUMBER,
INCOME_K$ NUMBER,
CONT1 NUMBER,
CONT2 NUMBER,
CONT3 NUMBER,
CONT4 NUMBER,

Predictive Analytics Data Access Guide


Importing Flat Data Files Into Your DBMS PUBLIC 145
CONT5 NUMBER,
CONT6 NUMBER,
CONT7 NUMBER,
CONT8 NUMBER,
CONT9 NUMBER,
CONT10 NUMBER,
CONT11 NUMBER,
CONT12 NUMBER,
CONT13 NUMBER,
CONT14 NUMBER,
CONT15 NUMBER,
CONT16 NUMBER,
CONT17 NUMBER,
CONT18 NUMBER,
CONT19 NUMBER,
CONT20 NUMBER,
CONT21 NUMBER,
CONT22 NUMBER,
CONT23 NUMBER,
CONT24 NUMBER,
CONT25 NUMBER,
CONT26 NUMBER,
CONT27 NUMBER,
CONT28 NUMBER,
CONT29 NUMBER,
CONT30 NUMBER,
CONT31 NUMBER,
CONT32 NUMBER,
CONT33 NUMBER,
CONT34 NUMBER,
CONT35 NUMBER,
CONT36 NUMBER,
CONT37 NUMBER,
CONT38 NUMBER,
CONT39 NUMBER,
CONT40 NUMBER,
PRIMARY KEY(GEO_ID)
)
TABLESPACE YOURTABLESPACE;
CREATE TABLE YOURUSER.MAILINGS1_2(
REFID NUMBER NOT NULL,
REF_DATE DATE,
RESPONSE VARCHAR2(10),
PRIMARY KEY(REFID)
)
TABLESPACE YOURTABLESPACE;
CREATE TABLE YOURUSER.SALES(
EVENTID NUMBER NOT NULL,
REFID NUMBER,
EVENT_DATE DATE,
AMOUNT NUMBER,
PRIMARY KEY(EVENTID)
)
TABLESPACE YOURTABLESPACE;

Predictive Analytics Data Access Guide


146 PUBLIC Importing Flat Data Files Into Your DBMS
5.2.2 Creating the Control File

A control file is used to describe the mapping between the flat file to be imported and its corresponding table –
that you have created beforehand – in the database. It is also used to describe the specification (delimiter, file
name) of the flat file.

● Specify the following:


a. Flat data file name
b. Field delimiter
c. Columns order

 Example

A control file looks as follows:

LOAD DATA
INFILE 'file.txt'
-- The command TRUNCATE removes all the rows in the target table.
TRUNCATE
-- The following specifies TABLE_NAME as the target table.
INTO TABLE TABLE_NAME
-- The following command specifies that data values are separated by commas.
FIELDS TERMINATED BY ','
(COLUMN1_NAME, COLUMN2_NAME, COLUMN3_NAME,…, COLUMNN_NAME)

5.2.2.1 Sample Control Files

The control files described here are used to import the flat data files provided as samples, for the Data Manager
- Event Logging use case, into an Oracle database. For more information about these files, see the Event Log
Aggregation Scenario - Data Manager User Guide on SAP Help Portal at http://help.sap.com/pa.

Related Information

Sample Control File for Customers Data [page 147]


Sample Control File for Demog Data [page 148]
Sample Control File for Mailings1_2 Data [page 148]
Sample Control File for Sales Data [page 148]

5.2.2.1.1 Sample Control File for Customers Data

 Sample Code

customers.control
LOAD DATA

Predictive Analytics Data Access Guide


Importing Flat Data Files Into Your DBMS PUBLIC 147
INFILE 'customers.csv'
INTO TABLE CUSTOMERS
FIELDS TERMINATED BY ';'
(ID,SEX,MARITAL_STATUS,GEOID,EDUCATIONNUM,OCCUPATION,AGE,REFCONT1,REFCONT2,REF
CONT3,NOM1,NOM2,NOM

5.2.2.1.2 Sample Control File for Demog Data

 Sample Code

demog.control
LOAD DATA
INFILE 'demog.csv'
INTO TABLE DEMOG
FIELDS TERMINATED BY ';'
(GEO_ID,INHABITANTS_K,INCOME_K
$,CONT1,CONT2,CONT3,CONT4,CONT5,CONT6,CONT7,CONT8,CONT9,CONT10,CONT11,CONT12,C
ONT13,CONT14,CONT15,CONT16,CONT17,CONT18,CONT19,CONT20,CONT21,CONT22,CONT23,CO
NT24,CONT25,CONT26,CONT27,CONT28,CONT29,CONT30,CONT31,CONT32,CONT33,CONT34,CON
T35,CONT36,CONT37,CONT38,CONT39,CONT40)

5.2.2.1.3 Sample Control File for Mailings1_2 Data

 Sample Code

mailings1_2.control
LOAD DATA
INFILE 'mailings1_2.csv'
INTO TABLE MAILINGS1_2
FIELDS TERMINATED BY ';'
(REFID,REF_DATE DATE 'yyyy-mm-dd',RESPONSE)

5.2.2.1.4 Sample Control File for Sales Data

 Sample Code

sales.control
LOAD DATA
INFILE 'sales.csv'
INTO TABLE SALES
FIELDS TERMINATED BY ';'
(EVENTID,REFID,EVENT_DATE DATE 'yyyy-mm-dd',AMOUNT)
For our KEL datasets execute these four command lines:
sqlldr control=customers.control
sqlldr control=mailings1_2.control
sqlldr control=demog.control
sqlldr control=sales.control

Predictive Analytics Data Access Guide


148 PUBLIC Importing Flat Data Files Into Your DBMS
5.2.3 Importing the Flat File

To import flat files into an Oracle database, you need to run one control file for each flat file to be imported. To
run the control files, you need to use the Oracle tool SQL*Loader.

1. Run the Oracle tool SQL*Loader.


2. Run the command sqlldr control=controlfile.

5.3 Importing a Flat File into IBM DB2

To import a flat file into IBM DB2, you need to:

1. Write a DB2 SQL query whose function is to connect to IBM DB2, to create the empty table into which the
flat data files will be imported, and then to define the import specifications.
2. Execute this DB2 SQL query, using the IBM DB2 command interpreter.

 Note

All the scripts were tested on IBM UDB DB2 v7.1 and v7.2.

5.3.1 Writing an SQL Query for IBM DB2

● Create an SQL query whose function is to:


a. Connect to your IBM DB2 database.
b. Create the table into which the flat data files will be imported.
c. Define the import specifications.

5.3.2 Connecting to IBM DB2

● To connect to your IBM DB2 Database, use the following command line, with your database and user
names:Connect to DATABASE_NAME user USER_NAME

Predictive Analytics Data Access Guide


Importing Flat Data Files Into Your DBMS PUBLIC 149
5.3.3 Creating Target Table

● Use the following SQL statement to create a target table for IBM DB2:

CREATE TABLE SCHEMA_NAME.TABLE_NAME(


COLUMN_1_NAME datatype,
COLUMN_2_NAME datatype,
…,
COLUMN_N_NAME datatype,
PRIMARY KEY (COLUMN1_NAME)
);

5.3.3.1 Basic IBM DB2 Datatypes

Here is the list of the main IBM DB2 datatypes:

● CHARACTER
● INTEGER
● DOUBLE
● REAL
● TIME
● DATE

5.3.4 Defining Import Specifications

An SQL query whose function is to define the import specification looks like the following:

 Sample Code

import from file.txt of del modified by coldel, decpt.


method P(1,2,…,N)
commitcount NbRows insert into SCHEMA_NAME.TABLE_NAME;

The table below describes each of its commands.

Import Commands
Command Description

coldel, Data values are separated by commas.

decpt. The decimal separator of the flat data file is a dot.

commitcount NbRows IBM DB2 will make a commit after each NbRows rows im­
ported.

insert into SCHEMA_NAME.TABLE_NAME The target table is SCHEMA_NAME.TABLE_NAME.

method P(1,2,…N) The N first fields of the flat data file will be imported in the
target table.

Predictive Analytics Data Access Guide


150 PUBLIC Importing Flat Data Files Into Your DBMS
5.3.5 Sample SQL Queries

The SQL queries to use for the flat data files provided as samples for the Data Manager - Event Logging use
case, into IBM DB2. The function of these queries is to:

● Connect to the IBM DB2 database,


● Create the target table, into which the flat data files will be imported,
● Define the import specifications.

For more information about these files, see the Event Log Aggregation Scenario - Data Manager User Guide on
SAP Help Portal at http://help.sap.com/pa.

Related Information

Sample Customers Data File [page 151]


Sample Demog Data File [page 152]
Sample mailings_1_2 Data File [page 153]
Sample Sales Data File [page 154]

5.3.5.1 Sample Customers Data File

 Sample Code

db2_customers.sql
connect to YOURDATABASE user YOURUSER;
create table YOURSCHEMA.CUSTOMERS(
ID INTEGER NOT NULL,
SEX CHAR(10),
MARITAL_STATUS CHAR(30),
GEOID INTEGER,
EDUCATIONNUM INTEGER,
OCCUPATION CHAR(20),
AGE INTEGER,
REFCONT1 DOUBLE,
REFCONT2 DOUBLE,
REFCONT3 DOUBLE,
NOM1 CHAR(1),
NOM2 CHAR(1),
NOM3 CHAR(2),
PRIMARY KEY (ID)
);
import from customers.csv of del modified by coldel, decpt.
method P(1,2,3,4,5,6,7,8,9,10,11,12,13)
commitcount 1000 insert into YOURSCHEMA.CUSTOMERS;
disconnect YOURDATABASE;

Predictive Analytics Data Access Guide


Importing Flat Data Files Into Your DBMS PUBLIC 151
5.3.5.2 Sample Demog Data File

 Sample Code

db2_demog.sql
connect to YOURDATABASE user YOURUSER;
create table YOURSCHEMA.DEMOG(
GEO_ID INTEGER NOT NULL,
INHABITANTS_K DOUBLE,
INCOME_K$ DOUBLE,
CONT1 DOUBLE,
CONT2 DOUBLE,
CONT3 DOUBLE,
CONT4 DOUBLE,
CONT5 DOUBLE,
CONT6 DOUBLE,
CONT7 DOUBLE,
CONT8 DOUBLE,
CONT9 DOUBLE,
CONT10 DOUBLE,
CONT11 DOUBLE,
CONT12 DOUBLE,
CONT13 DOUBLE,
CONT14 DOUBLE,
CONT15 DOUBLE,
CONT16 DOUBLE,
CONT17 DOUBLE,
CONT18 DOUBLE,
CONT19 DOUBLE,
CONT20 DOUBLE,
CONT21 DOUBLE,
CONT22 DOUBLE,
CONT23 DOUBLE,
CONT24 DOUBLE,
CONT25 DOUBLE,
CONT26 DOUBLE,
CONT27 DOUBLE,
CONT28 DOUBLE,
CONT29 DOUBLE,
CONT30 DOUBLE,
CONT31 DOUBLE,
CONT32 DOUBLE,
CONT33 DOUBLE,
CONT34 DOUBLE,
CONT35 DOUBLE,
CONT36 DOUBLE,
CONT37 DOUBLE,
CONT38 DOUBLE,
CONT39 DOUBLE,
CONT40 DOUBLE,
PRIMARY KEY (GEO_ID)
);
import from demog.csv of del modified by coldel, decpt.
method
P(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,2
9,30,31,32,33,34,35,36,37,38,39,40,41,42,43)
commitcount 1000 insert into YOURSCHEMA.DEMOG;
disconnect YOURDATABASE;

Predictive Analytics Data Access Guide


152 PUBLIC Importing Flat Data Files Into Your DBMS
5.3.5.3 Sample mailings_1_2 Data File

 Sample Code

db2_demog.sql
connect to YOURDATABASE user YOURUSER;
create table YOURSCHEMA.DEMOG(
GEO_ID INTEGER NOT NULL,
INHABITANTS_K DOUBLE,
INCOME_K$ DOUBLE,
CONT1 DOUBLE,
CONT2 DOUBLE,
CONT3 DOUBLE,
CONT4 DOUBLE,
CONT5 DOUBLE,
CONT6 DOUBLE,
CONT7 DOUBLE,
CONT8 DOUBLE,
CONT9 DOUBLE,
CONT10 DOUBLE,
CONT11 DOUBLE,
CONT12 DOUBLE,
CONT13 DOUBLE,
CONT14 DOUBLE,
CONT15 DOUBLE,
CONT16 DOUBLE,
CONT17 DOUBLE,
CONT18 DOUBLE,
CONT19 DOUBLE,
CONT20 DOUBLE,
CONT21 DOUBLE,
CONT22 DOUBLE,
CONT23 DOUBLE,
CONT24 DOUBLE,
CONT25 DOUBLE,
CONT26 DOUBLE,
CONT27 DOUBLE,
CONT28 DOUBLE,
CONT29 DOUBLE,
CONT30 DOUBLE,
CONT31 DOUBLE,
CONT32 DOUBLE,
CONT33 DOUBLE,
CONT34 DOUBLE,
CONT35 DOUBLE,
CONT36 DOUBLE,
CONT37 DOUBLE,
CONT38 DOUBLE,
CONT39 DOUBLE,
CONT40 DOUBLE,
PRIMARY KEY (GEO_ID)
);
import from demog.csv of del modified by coldel, decpt.
method
P(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,2
9,30,31,32,33,34,35,36,37,38,39,40,41,42,43)
commitcount 1000 insert into YOURSCHEMA.DEMOG;
disconnect YOURDATABASE;

Predictive Analytics Data Access Guide


Importing Flat Data Files Into Your DBMS PUBLIC 153
5.3.5.4 Sample Sales Data File

 Sample Code

db2_demog.sql
connect to YOURDATABASE user YOURUSER;
create table YOURSCHEMA.DEMOG(
GEO_ID INTEGER NOT NULL,
INHABITANTS_K DOUBLE,
INCOME_K$ DOUBLE,
CONT1 DOUBLE,
CONT2 DOUBLE,
CONT3 DOUBLE,
CONT4 DOUBLE,
CONT5 DOUBLE,
CONT6 DOUBLE,
CONT7 DOUBLE,
CONT8 DOUBLE,
CONT9 DOUBLE,
CONT10 DOUBLE,
CONT11 DOUBLE,
CONT12 DOUBLE,
CONT13 DOUBLE,
CONT14 DOUBLE,
CONT15 DOUBLE,
CONT16 DOUBLE,
CONT17 DOUBLE,
CONT18 DOUBLE,
CONT19 DOUBLE,
CONT20 DOUBLE,
CONT21 DOUBLE,
CONT22 DOUBLE,
CONT23 DOUBLE,
CONT24 DOUBLE,
CONT25 DOUBLE,
CONT26 DOUBLE,
CONT27 DOUBLE,
CONT28 DOUBLE,
CONT29 DOUBLE,
CONT30 DOUBLE,
CONT31 DOUBLE,
CONT32 DOUBLE,
CONT33 DOUBLE,
CONT34 DOUBLE,
CONT35 DOUBLE,
CONT36 DOUBLE,
CONT37 DOUBLE,
CONT38 DOUBLE,
CONT39 DOUBLE,
CONT40 DOUBLE,
PRIMARY KEY (GEO_ID)
);
import from demog.csv of del modified by coldel, decpt.
method
P(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,2
9,30,31,32,33,34,35,36,37,38,39,40,41,42,43)
commitcount 1000 insert into YOURSCHEMA.DEMOG;
disconnect YOURDATABASE;

Predictive Analytics Data Access Guide


154 PUBLIC Importing Flat Data Files Into Your DBMS
5.3.6 Importing the Flat File into IBM DB

● Under a Microsoft Windows OS, you have to open the DB2 Command Windows , by using the DOS
command db2cmd .
● Under Linux OS, the login user must be able to run db2.

● To start the Flat File Import, use the command: db2 –stf SCRIPTNAME
If you want to run the script contained in the file script.sql, you will use the command db2 –stf
script.sql

 Example

The IBM DB2 command lines below are used to start the import of the flat data files provided as samples
for the SAP Predictive Analytics Explorer - Event Logging use case, into IBM DB2. For more information
about these files, see the Event Log Aggregation Scenario - Data Manager User Guide on SAP Help Portal at
http://help.sap.com/pa.

● db2 –stf customers.sql

● db2 –stf mailings1_2.sql

● db2 –stf demog.sql

● db2 –stf sales.sql

5.4 Importing Flat Files into Microsoft Access

Microsoft Access comes with a Wizard that helps you in importing flat data files.

 Note

This wizard lets you import only one flat file at a time.

5.4.1 Importing a Flat File into a Microsoft Access Database

1. In Microsoft Access, select File Get External Data Import .


The Import dialog box appears.
2. In the drop-down menu associated to the field Files of Type, select Text Files.
3. Select the flat data files to import, and click Import.
The Import Text Wizard appears. It will guide you through the process of importing your flat files into MS
Access.
4. Pay particular attention to specifying data types, keys and indexes.

The table into which the flat data file is imported is automatically created during the import process.

Predictive Analytics Data Access Guide


Importing Flat Data Files Into Your DBMS PUBLIC 155
6 ODBC Fine Tuning

6.1 About ODBC Fine Tuning

This section describes how to configure minimal functionalities in an Open Database Connectivity (ODBC)
configuration for Automated Analytics. Issues encountered when using the ODBC standard with the
application are described as well as solutions to help bypass these issues.

6.2 Issues with ODBC

The ODBC framework consists of two elements:

● The ODBC driver manager. The manager does the following:


○ Manages the list of drivers and known databases.
○ Transmits all requests to the DBMS ODBC driver.
○ Manages Unicode when the DBMS does not provide this service.
● The ODBC drivers. The drivers implement the ODBC API so that the ODBC driver manager can use it.

The ODBC standard being somewhat vague, tests have revealed problems with the ODBC manager and the
ODBC drivers. Such as:

For the ODBC driver manager (depending on the platform):

● Some advanced requests cannot be implemented.


● Unicode compatibility depends highly on the DBMS/client setup and in some cases is unavailable.

For ODBC drivers:

● Some information about the DBMS features can be missing, inaccurate, partial, or difficult to use.
● On some ODBC drivers, even simple operations like asking for standard information can cause a crash.

6.3 How Automated Analytics Manages ODBC Specificities

Automated Analyticsprovides access to ODBC middleware as follows:

Predictive Analytics Data Access Guide


156 PUBLIC ODBC Fine Tuning
If the application recognizes a supported DBMS Automated Analytics uses internal drivers.

These mini drivers hard-code a correct behavior for the cur­


rent DBMS and bypass known ODBC issues for this DBMS.
For example:

● Since Teradata does not publish the key words to de­


clare Unicode data, the application uses a hard-coded
Teradata syntax.

Automated Analytics sets options depending on the recog­


nized DBMS.

If the application does not recognize a supported DBMS Automated Analytics uses a generic driver that asks the min­
imum from ODBC.

You can override most of these automatic settings. Refer to


the related information below.

 Note

For the list of supported platforms, refer to the Product Availability Matrix page and search for “SAP
Predictive Analytics”.

 Caution

Names of tables and columns are not quoted by default. So if they contain mixed cases or special
characters including spaces, they can be misinterpreted by the DBMS. To change this behavior, you need to
set the option CaseSensitive to "True". For more details, refer to the related information below.

Related Information

How to Override Automated Analytics Automatic Settings [page 157]


CaseSensitive [page 163]

6.4 How to Override Automated Analytics Automatic


Settings

The standard configuration files provide options to override ODBC traits.

● KxCORBA.cfg when using an SAP Predictive Analytics server.


● KJWizard.cfg when using an SAP Predictive Analytics workstation.

Each declared ODBC data source can have its own set of overrides.

The syntax for an override is as follows:

ODBCStoreSQLMapper.<ODBC DSN>.<Option>="<Value>"

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 157
The following table describes the parameters:

Parameter Case sensitive Space and tab characters allowed

<ODBC DSN> yes no

<Option> no no

<Value> no yes

For example, if you have an ODBC data source named MyDBMS, the following syntax indicates that the
application cannot request the list of specific SQL keywords when using MyDBMS:

ODBCStoreSQLMapper.MyDBMS.DynamicSqlKeywords="no"

To override all your ODBC data sources, replace the parameter <ODBC DSN> by * in the syntax.

For example, instead of using the override option DynamicSqlKeywords for each DBMS:

ODBCStoreSQLMapper.MyDBMS1.DynamicSqlKeywords=”no”

ODBCStoreSQLMapper.MyDBMS2.DynamicSqlKeywords=”no”

ODBCStoreSQLMapper.MyDBMS3.DynamicSqlKeywords=”no”

Use the following row:

ODBCStoreSQLMapper.*.DynamicSqlKeywords=”no”

6.4.1 Overriding the Full Behavior

It is possible to declare that a DSN should be used like a specific well-known DBMS (see the list of well-known
DBMS). All options of the chosen DBMS will be duplicated and used by Automated Analytics. This will prevent
you from setting up all the options one by one.

Even when overriding the full behavior, you can modify any ODBC option.

ODBCStoreSQLMapper.<ODBC DSN>.behaviour="<Value>"

 Note

<Value> can be:

● ORACLE
● DB2
● SQL SERVER
● NETEZZA
● TERADATA
● ADAPTIVE SERVER IQ
● VERTICA

Predictive Analytics Data Access Guide


158 PUBLIC ODBC Fine Tuning
● GREENPLUM
● HIVE
● HANA
● SPARK

6.4.2 Other Options to Override

If the column is checked... the option can be used to...

Customization customize user & DBMS specificities

Backward Compatibility replicate the behavior from previous versions of Automated


Analytics

Work-around overcome errors occurring with some ODBC driver

Optimization optimize Automated Analytics' use of the DBMS

Backward Work-
Customiza­ Compatibil­ Optimiza­
Option Name tion ity around tion

AcceptTable x

ApplyWorkArounds x x

AutomaticallyExportedFileds x

canAskTableDescription x

CanGenerateConnectorSQLWithSubSelectLimit x

checkConflictInAppendMode x x

CaseSensitive x

Commit Frequency x

ConstantFieldsManagedByValue x

CorrelatedTablesGroupSize x

defaultSchemaForKxenTemporary x

defaultSchemaForMetadata x

DisableDuplicateIdManagementForAggregates x

DropEmptySpace x x

DynamicAuthorizedChars x

DynamicSQLKeyWords x

GenerateAggregatesFilterInTheOperand x

GuessMode x

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 159
Backward Work-
Customiza­ Compatibil­ Optimiza­
Option Name tion ity around tion

HANADataStorageAsColumns x x x

HANATempStorageAsColumns x x x

HANAViewsWithOrder x x

HasFastOrderInSelect x x

IgnorePrefixOnNumeric x x

InitDataManipulationPackage x

KxenBool x

KxenDate x

KxenDateTime x

KxenInteger x

KxenNumber x

KxenString x

KxenUString x

KxLocksName x

LocksPolicy x x

ManageDefaultCredentialsAsEmpty x x

ManageForeignKeys x x

MaxConcurrentStatements x

MaxExtraCondUsage x

MaximumLengthOfDBMSString x x

MaximumLengthOfDBMSUString x x

MaxSizeForBlobs x

MetaDataUseFastStarSelect x x

NetezzaUseLimitAll x x x

NotPrimaryKeyRegExp x x

OptimizeMultiInstancesExpressions x

OptimizeMultiInstancesExpressions x

OptimizeMultiInstancesThreshold x

PrefilterAggregatesGroupTable x

PrimaryKeyRegExp x

RejectTable x

Predictive Analytics Data Access Guide


160 PUBLIC ODBC Fine Tuning
Backward Work-
Customiza­ Compatibil­ Optimiza­
Option Name tion ity around tion

SmartCreateTableAndKeys x

SmartSelectForCountOnConnector x

SmartSelectForCountOnselect x

SmartSelectForCountOnTable x

SmartSelectForFullRead x

SmartSelectForGuess x

SmartSelectForLayout x

SQLOnCatalog x x

SQLOnConnect x

SQLOnDisconnect x

SQLOnInitDataManip x

SupportHugeOracleIntegers x x

SupportKxDesc x x

SupportNativeInfos x x x

SupportNativeRead x x

SupportPrimaryKeys x x

SupportStatistics x x

SupportTempTables x x x

SupportUnicodeOnConnect x x

SupportUnicodeOnData x x

SupportUnicodeOnMeta x x

TableTypesFilter x x

TSPPersistence x

UserAuthorizedChars x

UserKeyWords x

WarnAboutIssues x

6.4.2.1 AcceptTable

This option allows you to define a list of tables that must be displayed when the lsSpace function is called. The
value of this parameter must be a regular expression.

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 161
 Caution

This option is deactivated when the option SqlOnCatalog is activated.

 Example

All tables whose name contains Customer or Product will be displayed in the list of tables offered to the
user.

ConnectorContainerSQLBuilder.<MyDSN>.AcceptTable.1=".*Customer.*"

ConnectorContainerSQLBuilder.<MyDSN>.AcceptTable.2=".*Product.*"

6.4.2.2 ApplyWorkArounds

Some DBMS/ODBC driver combinations have some well-known issues. When possible, at each connection to
the DBMS, Automated Analytics tries to detect such combinations in order to fix the issues.

Detection and fixes can be incomplete. The ApplyWorkarounds option (default value is true) allows
activating/deactivating this automatic setup. If deactivated, users can still emit their own configuration SQL
statements with the SQLOnConnect option.

Currently, the application automatically applies workarounds in the following contexts:

Oracle

The statement ALTER SESSION SET NLS_TERRITORY='AMERICA' is emitted. This mainly allows data
manipulations to have the same results on all Oracle DBMS instances whatever their default locale.

SQLServer

The statement SET DATEFORMAT YMD is emitted at each connection. This mainly allows data manipulations to
have the same results on all SQLServer DBMS instances whatever their default locale.

6.4.2.3 CanAskTableDescription

This parameter value is True if the layout of a table can be directly requested. If not, analyzing the layout of the
table will be done by analyzing the result of SQL request: SELECT * FROM <Table>

Depending on DBMS and workload, directly requesting the layout of a table can be heavy.

Predictive Analytics Data Access Guide


162 PUBLIC ODBC Fine Tuning
In Automated Analytics, the default behavior is to analyze the result of a SELECT * FROM <table> WHERE
1=0 and so, the default value is false. The drawback is that some drivers may not return accurate information
on fields. For example, whatever the driver, any comment on a field is lost and SQLServer returns SMALLINT
as INT and SMALLDATE as DATE.

 Example

ODBCStoreSQLMapper.MyDSN.CanAskTableDescription="True"

6.4.2.4 CanGenerateConnectorSQLWithSubSelectLimit
The Data Manipulation module is able to generate very complex SQL Statements. With complex SQL
statements, previewing a result set may be heavy and may take a lot of time. There are several kinds of data
manipulations (TSP for example) where it is possible to limit the size of the data to work on and so dramatically
shorten the time needed to view first lines of the result set. Such kind of advanced SQL statement is not always
possible (it may depend on the DBMS) and is itself very complex to setup.

CanGenerateConnectorSQLWithSubSelectLimit set to false (default depends on the DBMS but is


globally true) allows stopping this advanced mechanism and ensures that viewing data will always work even if
more slowly.

 Example

ODBCStoreSQLMapper.<MyDSN>.CanGenerateConnectorWithSubSelectLimit="false"

6.4.2.5 CaseSensitive
Automated Analytics does not use quotes when transmitting table and column names by default. However
when these names contain a variation of cases or special characters, it can lead to incorrect queries.

Set the option CaseSensitive to Flase if you do not want the application to add quotes around all table and
column names. This option is set to True by default.

 Example

 Sample Code

The following code activates the CaseSensitive option for all databases.

ODBCStoreSQLMapper.*.CaseSensitive="true"

 Sample Code

The following code activates the CaseSensitive option for the ODBC source HANA_DB only.

ODBCStoreSQLMapper.HANA_DB.CaseSensitive="true"

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 163
6.4.2.6 checkConflictInAppendMode

For Hive database, this option lets you activate or deactivate the test that checks conflicts between keys
appearing both in the input and output application datasets.

The default value is true, which means that the process can only be performed if no key from the input dataset
is present in the output dataset.

 Example

ODBCStoreSQLMapper.<ODBC DSN>.checkConflictInAppendMode=false

6.4.2.7 Commit Frequency

This parameter value is 1024 by default. It allows you to define the commit frequency for ODBC.

 Example

ODBCStoreSQLMapper.MyDSN.Commit_Frequency=10

6.4.2.8 CreateNonLoggedTable

On Oracle databases, creating tables with the Non Logged option is a good way to speed up the inserts and
index creation. It bypasses the writing of the redo log and significantly improves performance.

So, to increase efficiency, all tables created by the application can be performed without logging when the
following option is enabled:

ODBCStoreSQLMapper.<ODBC DSN>.CreateNonLoggedTable=true

6.4.2.9 DropEmptySpace

This option allows you to delete remaining empty tables after model deletion.

By default, this option is set as false. This means that when you delete a model, the empty associated tables
remain and are still stored in the database. It can be an issue if you have chosen to create one table per
segments in the case of segmented time series models, for example.

Setting this option as true will delete those tables and clean up your database.

Predictive Analytics Data Access Guide


164 PUBLIC ODBC Fine Tuning
6.4.2.10 DynamicAuthorizedChars

This parameter value is True if the application can ask the DBMS the list of its authorized characters in SQL
requests. If an identifier includes a character that is not in this list, the application places the identifier between
quotes.

 Example

ODBCStoreSQLMapper.MyDSN.DynamicAuthorizedChars="True"

6.4.2.11 DynamicSQLKeyWords

This parameter value is True if the application can ask the DBMS the list of its SQL reserved keywords.

 Example

ODBCStoreSQLMapper.MyDSN.DynamicSQLKeywords="True"

6.4.2.12 GuessMode

By default, when analyzing a table, the application reads the first 100 lines of the table and applies some
heuristics to infer the value of each field (for example, an integer field with a few values is guessed as being a
nominal variable).

Depending on the DBMS, the layout and the size of the table, reading the first lines of a table may be heavy.
When the GuessMode parameter is set to Fast, it allows activating a fast analyze mode that infers the value of
a variable only from its type. As these heuristics do not depend on the actual data, this mode is faster than the
full standard process.

 Code Syntax

ODBCStoreSQLMapper.MyDSN.GuessMode="Fast"

The following table lists the rules used by the Fast Guess Mode to determine a variable value from its type.

DBMS Field Type Variable Value

Integer continuous

Float continuous

String nominal

Date continuous

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 165
DBMS Field Type Variable Value

Datetime continuous

Boolean nominal

6.4.2.13 HANADataStorageAsColumns,
HANATempStorageAsColumns

Prior to rev73 of SAP HANA, columnar storage could not be used as some valid SQL generated by Automated
Analytics failed. Row storage was used for all tables generated up to rev73 of SAP HANA. Columnar storage
was supported and used after rev73 of SAP HANA.

The HANADataStorageAsColumns and HANATempStorageAsColumns options allow you to change this


behavior or standard data table and temporary tables (used by data manipulations).

 Example

ODBCStoreSQLMapper.MyDSN.HANADataStorageAsColumns=false

6.4.2.14 HasFastOrderInSelect

On some DBMS, ordering rows is very heavy on large tables even if the order is the same as the order of
primary keys. This can be an issue when analyzing a table and thus requesting the first lines of the table. This
option allows not using an ORDER BY when analyzing the first lines of a table. Default value is true.

6.4.2.15 IgnorePrefixOnNumeric

Depending on the DBMS, literals may need a prefix and a suffix. For example, a string constant is typically
surrounded by single quotes (') and some dates must be prefixed by a hash character (#).

This is usually not needed when using a numeric or integer constant. Prefix and suffix are published by the
ODBC driver for each SQL type. However, some ODBC drivers publish an unecessary prefix for several types of
constants. SQLServer 2005 for instance publishes $ as the prefix to use when handling a MONEY constant.

This option has been added in order to be compatible with unexpected use cases where this prefix is really
mandatory.

The default value is true.

Predictive Analytics Data Access Guide


166 PUBLIC ODBC Fine Tuning
6.4.2.16 InitDataManipulationPackage, SQLOnInitDataManip

In order to work, the Data Manipulation module makes some suppositions of current options and objects in the
DBMS. When and only when needed, each time the data manipulation package is initialized, some SQL
statements are transparently emitted to setup this option.

For example, on Oracle, an ALTER SESSION SET NLS_TERRITORY='AMERICA' is emitted to be sure that all
functions and operators used are in the proper locale.

When InitDataManipulationPackage is true (default is false), the SQLOnInitDataManip option allows


you to specify some other SQL statements to emit and precisely match the current DBMS setup. The syntax
used is the same as SQLOnCatalog option.

 Example

SQLServer when installed on a Windows server setup with foreign language will use the foreign format for
date which may be incompatible with some Data Manipulation operators. Then, one must force the proper
date format with a SET DATEFORMAT YMD each time the data manipulation is initialized.

ODBCStoreSQLMapper.<SQLServerDSN>.InitDataManipulationPackage="true"

ODBCStoreSQLMapper.<SQLServerDSN>.SQLOnInitDataManip.1=“SET DATEFORMAT YMD"

6.4.2.17 KxenBool

This SQL type creates a Boolean field.

 Example

ODBCStoreSQLMapper.MyDSN.KxenBool="INTEGER"

6.4.2.18 KxenDate

This SQL type creates a date field.

 Example

ODBCStoreSQLMapper.MyDSN.KxenDate="DATETIME"

6.4.2.19 KxenDateTime

This SQL type creates a datetime field.

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 167
 Example

ODBCStoreSQLMapper.MyDSN.KxenInteger="TIMESTAMP"

6.4.2.20 KxenInteger64

This SQL type creates an integer field.

 Example

ODBCStoreSQLMapper.MyDSN.KxenInteger64="DECIMAL(12,0)"

6.4.2.21 KxenNumber

This SQL type creates a number field.

 Example

ODBCStoreSQLMapper.MyDSN.KxenNumber="NUMBER(15,3)"

6.4.2.22 KxenString

This SQL type creates a string field. In the following example, %d is a special pattern and will be dynamically
replaced by the field length.

 Example

ODBCStoreSQLMapper.MyDSN.KxenString="VARCHAR2(%d)"

6.4.2.23 KxenUString

This SQL type creates a unicode string field. In the following example, %d is a special pattern and will be
dynamically replaced by the field length.

 Example

ODBCStoreSQLMapper.MyDSN.KxenUString="VARCHAR(%d) UNICODE CHARSET"

Predictive Analytics Data Access Guide


168 PUBLIC ODBC Fine Tuning
6.4.2.24 KxLocksName

This option allows you to force the usage of a specific name for KxLocks table. This can be useful for example,
if the metadata repository is stored in a specific schema of the DBMS or if there is several metadata
repositories stored in dedicated schemas.

 Example

ODBCStoreSQLMapper.MyDSN.KxLocksName="KxenMetaData.KxLocks"

6.4.2.25 LocksPolicy

When the application needs to access its metadata repository (models, connectors, variable pool, …), a lock is
set in order to prevent incoherent states in multi-users contexts.

There are different strategies for the locking process:

Value Behavior

UseKxLocksTable Uses a special table named KxLocks to track and lock all
access to metadata repository. This table is stored in the
current DBMS and manages all concurrent accesses coming
from the current DBMS.

UseUniqueFileInGlobalFolder Indicates the application to track and lock its metadata di­
rectory with special files (default names are
KxAdmin.lckand KxAdmin.inf). These files are
stored in the same directory as the metadata repository.

UsePerModelKxLocksTable Uses an optimized KxLocks table to minimize locking


times when many users are concurrently saving models.
This is used by default for SAP HANA,Oracle, SQL Server
and Teradata.

6.4.2.26 ManageForeignKeys

The application requests the metadata repository of the DBMS about foreign keys of a table (that is, fields that
address a primary key in other tables). Main usage of this information is to automatically offer join keys when
designing a data manipulation. Depending on the DBMS and its current load, this request can be heavy and
therefore avoided when useless.

There are two possible values:

● true (default): the information is requested from the metadata repository


● false: the application bypasses the request

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 169
 Example

ODBCStoreSQLMapper.MyDSN.ManageForeignKeys="false"

6.4.2.27 MaxConcurrentStatements

The number of concurrent statements is published by the ODBC driver and represents the maximum number
of active statements for each ODBC connection allowed by the DBMS. For example, SQLServer can manage
only one concurrent statement for each connection, meaning that a SELECT statement returning data and
inserting a record requires two separate connections.

Unfortunately, some ODBC drivers incorrectly publish this information. Even though the application has a
correct value for each known DBMS, this option, which allows you to force the proper value, has been added to
handle an unexpected setup.

Default value is none meaning the value is automatically discovered or forced by the application.

6.4.2.28 MaximumLengthOfDBMSString

Some DBMS describe string fields with a huge size (more than 2GB) even if all strings actually stored have a
standard size. To avoid memory problems, these huge strings are truncated to a maximum size.

The MaximumLengthOfDBMSString option allows you to change the maximum length.

 Example

ODBCStoreSQLMapper.MyDSN.MaximumLengthOfDBMSString=5000

6.4.2.29 MaximumLengthOfDBMSUString

Some DBMS (such as Hive Server) describe Unicode string fields with a huge size (more than 2GB) even if all
strings actually stored have a standard size. To avoid memory problems, these huge Unicode strings are
truncated to a maximum size. The maximum size possible:

The MaximumLengthOfDBMSUString option allows you to change the maximum length.

 Example

ODBCStoreSQLMapper.MyDSN.MaximumLengthOfDBMSUString=5000

Predictive Analytics Data Access Guide


170 PUBLIC ODBC Fine Tuning
6.4.2.30 MaxSizeForBlob

Some DBMS manage special string types. These types (BLOB,CLOB,LONG VARCHAR,…) can store huge
strings (limit is typically 2GB) and are typically used to store documents. Requesting the full value of such
fields is very complex, heavy and uses a lot of resources.

In order to speed up management of this kind of fields, the application always request the DBMS to convert it in
a regular string type with a maximum length of 8000 characters.

The MaxSizeForBlob option allows changing the maximum length.

 Example

ODBCStoreSQLMapper.MyDSN.MaxSizeForBlob="255"

6.4.2.31 MetaDataUseFastStarSelect

Each time a metadata table is read, the SQL statement that is generated uses the pattern SELECT
<field1>,<Field2>… FROM <Table>. This allows an OEM to extend metadata with its own information. On
some DBMS (Hive Server), such SQL is very slow.

The MetaDataUseFastStarSelect option allows you to generate a SQL using the pattern SELECT * FROM
<Table> which may be faster.

 Example

ODBCStoreSQLMapper.MyDSN.MetaDataUseStarSelect=true

6.4.2.32 NetezzaUseLimitAll

For Netezza database, this option lets you activate or deactivate the usage of a SQL tuning option ( “LIMIT
ALL”) . This hint allows Netezza ‘s SQL optimizer to execute the SQL requests generated by SAP Predictive
Analytics with efficiency.

The default value is true, which means that this hint is automatically added.

 Example

ODBCStoreSQLMapper.<ODBC DSN>.NetezzaUseLimitAll=false

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 171
6.4.2.33 OracleParallelCreationMode

Oracle Parallel Query (formerly Oracle Parallel Query Option or PQO) allows breaking a given SQL statement up
so that its parts can run simultaneously on different processors in a multi-processor machine. This improves
performance.

This parameter allows you to set the value of parallelism for all CREATE TABLE statements. Indeed, when a
degree value has been specified to the table, this value is used for all SQL statements involving the table.

The possible values are:

● PARALLEL: Oracle selects a degree of parallelism equal to the number of CPUs available. It is the mode by
default.
● NO: deactivates the parallelism; only one thread will be used to populate the table
● ExtraOptions: extra character string that represents the expected options of parallel threads used in the
parallel operation. This depends on the database used. For details, contact your database administrator.

 Example

ODBCStoreSQLMapper.<ODBC DSN>.OracleParallelClause=PARALLEL

ODBCStoreSQLMapper.<ODBC DSN>.OracleParallelClause=NO

ODBCStoreSQLMapper.<ODBC DSN>.OracleParallelClause=PARALLEL(DEGREE 2)

ODBCStoreSQLMapper.<ODBC DSN>.OracleParallelClause=PARALLEL(DEGREE 4
INSTANCES 2)

6.4.2.34 PrimaryKeyRegExp, NotPrimaryKeyRegExp

Some DBMS may not manage the primary keys of a table. In such case, the usual workaround is to read
description tables either automatically with an implicit name (KxDesc_XXX) or manually with an explicit name.
Moreover, it may be difficult to push these descriptions in those DBMS or difficult to always manage
descriptions. Hive server suffers from both issues.

As a facility, the option PrimaryKeyRegExp allows you to describe the field names that will be automatically
managed as primary keys. NotPrimaryKeyRegExp allows you to filter the field names that are not primary
keys and is applied after the filter PrimaryKeyRegExp.

 Example

ODBCStoreSQLMapper.MyDSN.PrimaryKeyRegExp.1=KEY$

ODBCStoreSQLMapper.MyDSN.PrimaryKeyRegExp.2=^KEY

ODBCStoreSQLMapper.MyDSN.PrimaryKeyRegExp.3=ID$

ODBCStoreSQLMapper.MyDSN.PrimaryKeyRegExp.4=^ID

Predictive Analytics Data Access Guide


172 PUBLIC ODBC Fine Tuning
These lines set up the application so that all field names finishing with KEY or IS and all field names starting
with KEY or ID are managed as primary keys. This is the default setup of the Hive connection.

6.4.2.35 RejectTable

This option allows you to define a list of tables that must not be displayed when the lsSpace function is called.
The value of this parameter must be a regular expression.

 Caution

This option is deactivated when the option SqlOnCatalog is activated.

In the following example, all tables whose name contains Customer or Product will be excluded from the list of
tables offered to the user.

ConnectorContainerSQLBuilder.<MyDSN>.RejectTable.1=”.*Customer.*”

ConnectorContainerSQLBuilder.<MyDSN>.RejectTable.2=”.*Product.*”

Database Name Regexp

Oracle SYS\.3*

SYSTEM\.3*

PUBLIC\.3*

OWB\.3*

OUTLN\.3*

.*\.SMP_3*_3*

.*\.VDK_.*_.*

.*\.VMQ_.*_.*

WMSYS\..*

SYSMAN\..*

IX\..*

.*\.BIN\$*

OLAPSYS\..*

CTXSYS\..*

DBSNMP\..*

DMSYS\..*

EXFSYS\..*

MDSYS\..*

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 173
Database Name Regexp

XDB\..*

Sybase ASE dbo\.sys.*

DB2 DBADMIN\..*

SYSTOOLS\..*

SQL server 2005 sys\..*

INFORMATION_SCHEMA\..*

Sybase IQ DBA\..*

sys\..*

HANA SYS.*

_SYS*.*

SAP_XS_LM*.*

SAP_HANA_ADMIN.*

UIS.*

SYSTEM*.*

SAP_REST_API.*

SAP_XS_LM_PE.*

SAP_XS_LM_PE_TMP.*

HANA_XS_BASE.*

Teradata SysAdmin\..*

SQLJ\..*

DBC.*

DBC.*

Wx2 SYS\..*0

6.4.2.36 SmartCreateTableAndKeys

This option allows the application to generate DBMS primary keys when a variable is declared as a key.
Declaring true primary keys makes future usage of the table much more efficient and is also a good practice
when designing a DBMS schema.

However, note that primary keys involve much more constraints on the actual values of data: the unicity of the
values of the primary must be respected and (depending on the DBMS), the null value for a component of the
primary key is forbidden.

The purpose of this option is to keep compatibility with older versions of the application, which were generating
simple multiple indexes.

Predictive Analytics Data Access Guide


174 PUBLIC ODBC Fine Tuning
 Example

ODBCStoreSQLMapper.MyDSN.SmartCreateTableAndKeys="false"

6.4.2.37 SmartSelectForCountOnTable,
SmartSelectForCountOnSelect,
SmartSelectForCountOnConnector,
SmartSelectForFullRead, SmartSelectForGuess,
SmartSelectForLayout

Automated Analytics generates the smartest possible SQL statement when requesting data. Some specific
options of the DBMS SQL are then used. When an option is false, the SQL generated is the most basic one.

Main objects of these enhancements are the following:

● decrease DBMS usage and increase concurrency in the DBMS


● speed up the SQL statement by asking for the shortest result set

Options Description Example on SQL Server

SmartSelectForCountOnTable These options control the generation of When requesting the number of records
advanced statements when a count of of a table, and
SmartSelectForCountOnSelec records is needed in various contexts. SmartSelectForCountOnTable
t is true (default), the SQL statement
generated is: SELECT * FROM
SmartSelectForCountOnConne <Table> WITH (NOLOCK) other­
ctor wise it is a regular SELECT * FROM
<table>. The enhanced version uses
less resources in SQL Server

SmartSelectForFullRead These options manage the generation When guessing the layout of a table and
of advanced SQL statements when a
SmartSelectForGuess is true (de­
SmartSelectForGuess full read of data or a guess must be
done. When the number of records to fault), the SQL statement used is:
SmartSelectForLayout read is already known (that is, in a
SELECT TOP 100* FROM
guess or when a sequential partition
strategy is used), the SQL statement
<table> WITH (NOLOCK) which is
generated is optimized. much more efficient than the regular
SELECT * FROM <table>.

6.4.2.38 SQLOnCatalog

A database may contain thousands of tables, so listing the tables can be very laborious. More, ODBC standard
does not define a portable way to list only the tables that are actually usable for the connected user. The
parameter SQLOnCatalog contains the SQL request emitted to build the actual list of tables. A long SQL

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 175
request can be easily specified by adding a sequence number after the keywords SQLOnCatalog, the final SQL
request will be the concatenation of all SQLOnCatalogXXX lines. If an execution error occurs, this is reported
and Automated Analytics switches to the default behavior and lists all the available tables.

This option allows building a list of tables that are small and of interest for the final user.

There is two main kinds of SQL requests that may be used:

● Tuning of the system catalogs:


○ The system catalogs of the database are requested and some filtering is added in order to build a
shorter and more user-oriented list of tables. The SQL request should be built with the help of the
Database Administrator.
● Tuning at application level:
○ The list of usable tables is explicitly stored in a dedicated table that is maintained by another
operational process or another application. This can be done without knowing the structure of system
catalogs that may be complex.

6.4.2.38.1 Format of the SQL Request to Use for


SQLOnCatalog

The following table lists the format of all keywords in the SQL request when using the option SQLOnCatalog.

In the SQL request, the special %USER%, %SQLUSER%, %KXENUSER%, %OSUSER%, %PASSWAORD%, %KXENVERSION
%, %KXEN32% and %DSN% keywords are searched and if existing are replaced by:

Keyword Description

%USER% Current SQL user login (deprecated replaced by %SQLUSER% now)

%SQLUSER%,%SQLUSER_MAJUS%, Current SQL user login as provided by user, in uppercase, in lowercase


%SQLUSER_MINUS%

%KXENUSER%,%KXENUSER_MAJUS%, Current Physical Automated Analytics user login as provided by user, in


%KXENUSER_MINUS% uppercase, in lowercase

%OSUSER%,%OSUSER_MAJUS%, Current Physical Automated Analytics user login as provided by user, in


%OSUSER_MINUS% uppercase, in lowercase

%PASSWORD% Current SQL password

%KXENVERSION% Current Automated Analytics version

%DSN% Name of the current ODBC connection

 Note

These macros are only used by SQLonCatalog, SQLOnConnect, defaultSchemaFoMetadata and


defaultSchemaForKxenTemporary.

The only constraint of the resulting dataset is that it must have at least two fields of string type:

● The first one will give the Schema name


● The second one will give the Table name

Predictive Analytics Data Access Guide


176 PUBLIC ODBC Fine Tuning
 Example

Since SQL Server 2005, the list of user tables can be obtained with:

ODBCStoreSQLMapper.MyDSN.SQLOnCatalog1="select table_schema as SchemaName,


table_name as TableName from information_schema.tables order by
SchemaName,TableName"

The list of user table names that starts with the keyword 'ForModeling' can be obtained with:

ODBCStoreSQLMapper.MyDSN.SQLOnCatalog1="select table_schema as SchemaName,


table_name as TableName from information_schema.tables"

ODBCStoreSQLMapper.MyDSN.SQLOnCatalog2=" WHERE (name LIKE 'ForModeling%' )


ORDER BY SchemaName,TableName"

If the list of available tables is stored in the table TablesForModeling, it can be processed with the lines:

ODBCStoreSQLMapper.MyDSN.SQLOnCatalog=“SELECT <Field1>,<Field2> FROM


TablesForModeling ORDER BY <Field1>,<Field2>"

For Teradata, the list of tables created by the current user can be obtained with:

ODBCStoreSQLMapper.MyDSN.SQLOnCatalog1="SELECT
TRIM(DataBaseName),TRIM(TableName) FROM dbc.Tables WHERE CreatorName='%USER%'
AND Tablekind in ('T','V') ORDER BY DataBaseName, TableName"

For Teradata, the list of readable and editable tables for current user can be obtained with:

ODBCStoreSQLMapper.MyDSN.SQLOnCatalog1="select distinct TRIM(DatabaseName),


TRIM(TableName) from dbc.ALLRIGHTS where (AccessRight = 'R' or AccessRight =
'U' or AccessRight = 'I' or AccessRight = 'IX' or AccessRight = 'RF' ) and
UserName = ‘%SQLUSER%"

For Oracle, the list of tables and views created by the current user can be obtained with:

ODBCStoreSQLMapper.MyDSN.SQLOnCatalog1="select '%SQLUSER%',t_Name from


( select TABLE_NAME as t_Name from USER_TABLES union SELECT VIEW_NAME as
t_Name FROM USER_VIEWS) order by t_Name"

6.4.2.39 SQLOnConnect

This parameter contains the SQL request emitted just after a successful connection to the data source. Several
requests can be emitted by adding a sequence number after SqlOnConnect. An execution error in a SQL
request is not reported and does not stop the connection process. A typical use of this type is to configure
DBMS session options that are not available with ODBC (Ex session charset for Oracle)

 Example

ODBCStoreSQLMapper.MyDSN.SQLOnConnect1="ALTER SESSION SET NLS_CHARSET


xxxxxx"

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 177
ODBCStoreSQLMapper.MyDSN.SQLOnConnect2="ALTER SESSION SET DATE TIME FORMAT
\"YYYY-MM-DD
HH:MM:SS\""

An advanced use of this type is to configure DBMS session options according to the %KXENUSER% login.

 Example

It is possible to automatically redirect all DBMS connection to a default DBMS schema where user has only
a read data access.

Oracle:

ODBCStoreSQLMapper.MyDSN.SQLOnConnect="ALTER SESSION SET CURRENT_SCHEMA=


%KXENUSER%"

6.4.2.40 SQLOnDisconnect

This parameter contains the SQL query emitted just before a connection to the data source is closed. Several
queries can be emitted by adding a sequence number after SqlOnConnect. An execution error in an SQL
query is not reported and does not stop the connection process.

 Example

ODBCStoreSQLMapper.MyDSN.SQLOnDisconnect1="drop table XXX"

6.4.2.41 SupportHugeOracleIntegers

Oracle is able to manage huge integers up to 38 decimal digits that cannot fit in 64-bit integers. Even storing
such huge values in floating numbers is not enough (refer to the option
OracleMaxNumberDigitsForInteger): precision can be lost and values managed by Automated Analytics
may be corrupted. Such a precision loss is a problem when these integers are used as primary keys and the
application has to write them back.

In order to avoid precision loss with such huge primary key values, this option forces the application to manage
them as strings. Strings are not subject to any truncation or precision loss when read from Oracle or written
back to Oracle. Limiting this option to primary keys also ensures that models are not impacted, since primary
keys are not managed as input for predictive models.

The default value of this option is false. This option should be activated only when primary keys are defined as
Oracle default integers (NUMBER(38)) and actually storing large values.

 Example

ODBCStoreSQLMapper.<ODBC DSN>. SupportHugeOracleIntegers =true

Predictive Analytics Data Access Guide


178 PUBLIC ODBC Fine Tuning
6.4.2.42 SupportKxDesc

Each time a DBMS table is analyzed, the application tries to read a description stored in a table named
KxDesc_<tableName>. Depending on the DBMS, this feature can slow down the analyze process even if no
KxDesc description is available.

The SupportKxDesc option allows you to suppress this feature.

 Example

ODBCStoreSQLMapper.MyDSN.SupportKxDesc=false

 Caution

At this time, Hive server is the only DBMS for which this option would be useful. For backward
compability reason, this option is always true, even for Hive server.

6.4.2.43 SupportNativeInfos

When creating an applyOut table, the application adds primary key fields of the applyIn table. By doing so, the
application tries its best to replicate exactly the SQL type (with all subtle variations specific to current DBMS)
of the input field with information provided by the current ODBC driver.

Depending on unexpected field/DBMS/driver/OS combination, this reconstruction of the proper SQL type can
be incorrect. SupportNativeInfos allows deactivating this reconstruction mechanism.

If deactivated, any field cloned from an input field uses a basic and safe type.

6.4.2.44 SupportNativeRead

Historically, Automated Analytics used to request ODBC drivers to return data in string format and converted
strings in the proper native type. For example, the value 123 was received as string "123" and converted to the
true integer constant 123.

This, while safer, used more network and CPU resources, so the application now requests the ODBC driver to
return the data in their native format (which is typically more concise) and no longer needs to convert them.

Unfortunately, some combinations of ODBC drivers/ DBMS do not manage native values properly. In that case,
SupportNativeRead allows you to force reading data values in string format.

6.4.2.45 SupportPrimaryKeys and SupportStatistics

Some DBMS are not able to publish the primary keys of a table. In such a case, the application tries to request
the statistics of a table, finds the first unique index and promotes the associated fields to primary keys.

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 179
SupportPrimaryKeys indicates the DBMS is able to report primary keys. SupportStatistics indicates the
DBMS manage statistics on a table.

6.4.2.46 SupportTempTables

Complex data manipulations may use temporary tables. Some DBMS have the temporary table feature (tables
stored in a special space and automatically deleted) and the application has validated its usage. As the
temporary space usage can be constrained by IT policy, the option SupportTempTables allows you to force
the application to use its own mechanism for temporary tables. In such a case, the temporary tables are
regular tables and the application handles their destruction..

The option SupportTempTables is set to true for SQLServer and Teradata.

 Note

Even if the temporary table mechanism is available on Oracle, it has not been validated with Automated
Analytics.

6.4.2.47 SupportUnicodeOnConnect

The ODBC API allows establishing connections to DBMS using Unicode parameters as well as native character
set for the user and password.

Tests done have demonstrated (especially on Linux) that the proper use of the Unicode option may depend on
complex setup of the DBMS/client software.

By default, the application establishes the connection with the native character set parameters and then
requests the Unicode capabilities of the current configuration. The option SupportUnicodeOnConnect
(default value false) allows you to force the use of Unicode for the user and password.

6.4.2.48 TableTypesFilter

By default, SAP Predictive Analytics displays the list of tables and views. This option allows you to limit this list.
The default value is TABLE,VIEW.

 Sample Code

The following code will display the list of tables only.

ODBCStoreSQLMapper.*.TableTypesFilter=TABLE

Predictive Analytics Data Access Guide


180 PUBLIC ODBC Fine Tuning
 Example

Oracle manages the specific concept of SYNONYMS. Any object in Oracle can have a synonym, which
typically has a simpler name than the original object. For performance reasons, the default behavior on
Oracle is not to manage this list of synonyms, but you can activate it by adding SYNONYM to this option.

ODBCStoreSQLMapper.*.TableTypesFilter="TABLE,VIEW,SYNONYM"

6.4.2.49 Unicode Management

You can find Unicode strings in two places in the DBMS:

● the data records,


● the table names or field names.

You can encounter the following ODBC drivers/ODBC driver manager/DBMS configurations:

● the records manage Unicode, but not the table/field names,


● the records and the table/field names manage Unicode,
● an erroneous answer is returned when reporting about Unicode features.

SupportUnicodeOnMeta / SupportUnicodeOnData

Two Unicode options can be associated to a data source:

● SupportUnicodeOnData describes the kind of character conversion supported by the ODBC driver/
ODBC Driver Manager/DBMS in the record data.
● SupportUnicodeOnMeta describes the kind of character conversion supported by the ODBC driver/
ODBC Driver Manager/DBMS in the object names (tables, fields, indexes).

Default Values for SupportUnicodeOnData / SupportUnicodeOnMeta

The default values for these options follows the standard application behavior related to
MultiLanguageIsDefault option:

● MultiLanguageIsDefault = no (default)
○ This behavior is compatible with previous versions: input/output is done in native character sets (client
code page). Automated Analytics ODBC layout works in conjunction with mini drivers in order to read
and emit characters in client code page.
● MultiLanguageIsDefault = yes
○ All input/output are done in Unicode (UTF-16) characters. Automated Analytics ODBC layout works in
conjunction with mini drivers in order to read and emit characters in UTF-16 format.

The MultiLanguageIsDefault option is a global switch to the full Automated Analytics engine and can be
overloaded by SupportUnicodeOnMeta / SupportUnicodeOnData options on every DSN.

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 181
Available Character Conversions

The values for SupportUnicodeOnData / SupportUnicodeOnMeta can be:

● UTF16 , yes , 1 , on , true , y , t : the ODBC Driver/ODBC Driver Manager/DBMS will emit characters in
UTF-16 format meaning Automated Analytics must do a UTF-16 to UTF-8 conversion.
● UTF8 : the ODBC Driver/ODBC Driver Manager/DBMS will emit characters in UTF-8 format. Automated
Analytics will not do any conversion.
● Any other value means the ODBC Driver/ODBC Driver Manager/DBMS will emit characters in client code
page. Automated Analytics must do a code page to UTF-8 conversion

UTF-8 is a special value as it is not ODBC compliant. At this time, only Teradata ODBC driver for Teradata
DBMS 2R5.1 Beta needs this special value. The associated mini driver manages it.

6.4.2.50 UserAuthorizedChars

This parameter contains an explicit list of authorized characters. This list is concatenated to the standard ANSI
list, which is always used.

 Example

ODBCStoreSQLMapper.MyDSN.UserAuthorizedChars="~[]"

6.4.2.51 UserKeywords

This parameter contains an explicit list of SQL keywords separated by a comma (,). This list is concatenated to
the standard ANSI list of SQL key words, which is always used. The application adds quotes around any table or
field name appearing in this list.

 Example

ODBCStoreSQLMapper.MyDSN.UserKeywords="SCORE,TRIGGER,ID"

6.4.2.52 WarnAboutIssues

Automated Analytics tries to detect partially supported configuration of DBMS/ODBC drivers. In such a case, a
warning is emitted about potential issues. As with ApplyWorkArounds, such detection can be inappropriate.
WarnAboutIssues (default value is true) allows the warning to be hidden if false.

At this time, such warnings are only emitted for Teradata in two cases:

● Old Windows demonstration version of Teradata (05.00.0011) is used.

Predictive Analytics Data Access Guide


182 PUBLIC ODBC Fine Tuning
● Incoherent Unicode configuration of Automated Analytics and Unicode configuration of ODBC connection
are detected: Automated Analytics in Unicode configuration must use a Teradata ODBC driver in Unicode
configuration.

6.4.3 Schema Definition

When overriding the application, you can set a database to have its structure defined in a schema.

In addition, on Sybase ASE, a user must have the role SA to create and drop tables.

6.4.3.1 defaultSchemaForKxenTemporary

All temporary metadata will be created in this schema and the life span of metadata is completely managed by
Automated Analytics.

The same set of macros available with SQLOnCatalog options can be used too. For more information on these
options, you can refer to Format of the SQL Request to Use for SQLOnCatalog [page 176].

 Example

ODBCStoreSQLMapper.*.defaultSchemaForKxentemporary=%SQLUSER_MAJUS% will use the user's


private schema to store all temporary tables.

 Note

For all HANA connections, the default value for defaultSchemaForKxenTemporary is %SQLUSER_MAJUS
%, meaning that all temporary tables that may be needed by Automated Analytics are stored in the private
schema of the current user.

6.4.3.2 defaultSchemaForMetadata

You can set a default schema name for all Automated Analytics metadata spaces.

The list of these metadata are:

● KxAdmin
● KxInfos
● KxLocks
● KxCommunities0
● KxLinks0
● KxNodes0
● KxOlapCube0
● ConnectorsTable

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 183
● KxMapping
● KxVariables
● KxCatTranslation
● KXENMeta.KxContStruct
● KXENMeta.KxNomStruct
● KXENMeta.KxOrdStruct

The same set of macros available with SQLOnCatalog options can be used too. For more information on these
options, you can refer to Format of the SQL Request to Use for SQLOnCatalog [page 176]

 Example

ODBCStoreSQLMapper.*.defaultSchemaForMetadata=%SQLUSER_MAJUS% will use the user's private


schema to store all metadata.

 Note

When saving a model or a data manipulation, if the key defaultSchemaForMetadata is specified and if
the model or the data manipulation do not have a schema definition then the specified schema will be used.

 Note

For all HANA connections, the default value for defaultSchemaForMetadata is %SQLUSER_MAJUS%,
meaning that all temporary tables that may be needed by Automated Analytics are stored in the private
schema of the current user.

6.4.4 Data Manipulation SQL Generation Options

These options are used to drive the data manipulation SQL code generator. Each declared ODBC data source
can have its own overrides set as in the ODBC driver options case.

6.4.4.1 AggregatesAsCorrelated

This option allows you to force or disable the generation of standard aggregates (min max, count, avg) using
correlated tables instead of SELECT sub queries. Some data bases ignore this option when they only support
one of the two forms. For instance Teradata handles only correlated tables.

 Note

Currently, no warning is given when such a situation occurs.

Possible values:

● System (default): the system automatically selects which kind of aggregates generation is used.
● Forced: correlated tables are used when the RDBMS allows it.
● Disabled: SELECT sub queries are used when the RDBMS allows it.

Predictive Analytics Data Access Guide


184 PUBLIC ODBC Fine Tuning
 Note

See section Supported Aggregates SQL Forms for the list of databases supporting each form.

 Example

ConnectorContainerSQLBuilder.<MyDSN>.AggregatesAsCorrelated="Disabled"

Supported Aggregates SQL Forms

Standard First/Last Exists

SQLServer Correlated tables as default Correlated tables as default Sub query only

Oracle Correlated tables as default Correlated tables only Sub query only

Teradata Correlated tables only Correlated tables only Correlated tables only

DB2 Sub query only Correlated tables only Sub query only

PostgreSQL Correlated tables as default Correlated tables only Correlated tables only

SybaseIQ Correlated tables only Correlated tables only Correlated tables only

Vertica Correlated tables only Correlated tables only Correlated tables only

Netezza Correlated tables only Correlated tables as default Correlated tables as default

Caption:

● Sub query only: on this database, the application generates aggregates using only the sub-query form.
● Correlated tables only: on this database, the application generates aggregates using only correlated tables
● Correlated tables as default: on this database, the application generates correlated tables by default, but if
forced it can generate SELECT sub queries when possible.

6.4.4.2 AggregatesAsCorrelated_Exists

This option has the same effect as AggregatesAsCorrelated, but it affects the Exists aggregate.

Possible values:

● System (default): the system automatically selects which kind of aggregates generation is used.
● Forced: correlated tables are used when the RDBMS allows it.
● Disabled: SELECT sub queries are used when the RDBMS allows it.

 Note

See section Supported Aggregates SQL Forms for the list of databases supporting each form.

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 185
 Example

ConnectorContainerSQLBuilder.<MyDSN>.AggregatesAsCorrelated_Exists="Forced"

6.4.4.3 AggregatesAsCorrelated_FirstLast

This option has the same effect as AggregatesAsCorrelated, but it affects the First/Last aggregates.

Possible values:

● System (default): the system automatically selects which kind of aggregates generation is used.
● Forced: correlated tables are used when the RDBMS allows it.
● Disabled: SELECT sub queries are used when the RDBMS allows it.

 Note

See section Supported Aggregates SQL Forms for the list of databases supporting each form.

 Example

ConnectorContainerSQLBuilder.<MyDSN>.AggregatesAsCorrelated_FirstLast="Forced"

6.4.4.4 AllowFiltersInIntermediatesSteps

In a multi-steps data manipulation, the global filter may be simple enough to be applied to some intermediate
steps. By default, Automated Analytics does not forward any filter, even in simple situations. This configuration
key instructs the application to apply the global filter to the intermediate steps that can evaluate it, when the
filter does not depend on an aggregate.

Possible values:

● true (default)
● false

 Example

ConnectorContainerSQLBuilder.<MyDSN>.AllowFiltersInInterdiatesSteps="false"

6.4.4.5 AutomaticallyExportedFields

Explorer objects, such as entities or time-stamped populations, are configured to expose only a defined set of
fields. This option allows specifying if other fields should be visible to data manipulations users.

Predictive Analytics Data Access Guide


186 PUBLIC ODBC Fine Tuning
Possible values:

● KxTarget;KxWeight (default)
● name of the fields to be made visible separated by semi-colons

 Example

ConnectorContainerSQLBuilder.<MyDSN>.AutomacallyExportedfields=“MyColumn1;MyCo
lumn2;MyColumn3”

6.4.4.6 ConstantFieldsManagedByValue

This parameter allows optimizing the use of constant-valued fields. When this option is activated, a data
manipulation having to expose a constant-valued field does so by propagating up its actual value and informing
upper data manipulations that the concerned field is constant and thus that its value has to be used anywhere
the field name is referenced. This allows some data manipulations to be executed 10 times faster.

Possible values:

● true (default)
● false

 Example

ConnectorContainerSQLBuilder.<MyDSN>.ConstantFieldsManagedByValue="false"

6.4.4.7 CorrelatedTablesGroupSize

When the GenerateAggregatesFilterInTheOperand option is set to System, the aggregates sharing the
same time window but yielding different extra filter conditions are grouped together. This parameter allows you
to set the maximum number of aggregates that a given group may contain.

The default value is 7.

 Example

ConnectorContainerSQLBuilder.<MyDSN>.CorrelatedTablesGroupSize=5

6.4.4.8 DisableDuplicateIdManagementForAggregates

To deal with reference tables that may contain duplicate identifiers, the application now generates more
complex SQL statements. On some databases this may lead to poorly performing SQL requests. When the
reference table is known not to have duplicate identifiers, the user can get rid of the overhead implied by this
extra processing.

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 187
Possible Values:

● true
● false (default)

 Example

ConnectorContainerSQLBuilder.<MyDSN>.DisableDuplicateIdManagementForAggregates
="true"

6.4.4.9 EnforceCommonTableExpressionUsage
In data mining, you use Common Table Expression (CTE) to optimize a complex sub-data manipulation.

The option EnforceCommonTableExpressionUsage activates the SQL "with clause" if the following
conditions are met:

● The backend database supports the "with clause" feature.


● The data manipulation is used as reference table for an aggregate.

The use of CTE is controlled by the EnforceCommonTableExpressionUsage option. The following table lists
its possible values.

Value Description

Disabled CTEs are not generated.

Enabled Any sub-data manipulation that is not captured by a persis­


tence (as view or table) rule are optimized using a dedicated
CTE.

Auto The sub-data manipulations used as reference table for


some aggregates will be optimized through a “with clause”.

 Tip

To activate this option by default for SAP HANA at a DSN level in the configuration file, use the following
statement:

ConnectorContainerSQLBuilder.<MyDSN>.EnforceCommonTableExpressionUsage
="Forced".

To completely deactivate the use of CTE at a DSN level for SAP HANA, use the following statement:

ConnectorContainerSQLBuilder.<MyDSN>.CTEAllowed=false.

6.4.4.10 ForwardVisbleKeysToIntermediateSteps
When a data manipulation is too complex to be expressed as single SELECT statements, Automated Anaytics
automatically splits it in smaller steps that create intermediate tables. This leads to situations where key

Predictive Analytics Data Access Guide


188 PUBLIC ODBC Fine Tuning
columns in one step/table cannot be considered as keys in another one (typically, due to a merge). To cope
with most cases, the application has implemented a strategy that filters out unused keys columns.

ForwardVisibleKeysToIntermediateSteps configuration key allows preserving visible key columns even


if there are not used.

Possible values:

● true (default)
● false

 Example

ConnectorContainerSQLBuilder.<MyDSN>.ForwardVisibleKeysToIntermediateSteps="fa
lse"

6.4.4.11 GenerateAggregatesFilterInTheOperand

On databases such as Teradata, Oracle, and DB2, Automated Analytics formulates aggregates as correlated or
derived table expressions. Each of these tables gathers aggregates done over the same rows set. A rows set is
defined, on one hand, by the key columns, and on the other hand, by the filtering condition. The filter embeds
the time window specification (when provided) along with an additional predicate.

This option allows generating the predicate in such a way that it is taken into account without negatively
impacting the filtering condition. That is, leading to related pivoted aggregates being generated in different
tables.

Values

● System (default): The system decides whether or not to add the extra predicate to the filtering condition
based on the number of related aggregates. This value can be set through the MaxExtraCondUsage
parameter.
● Forced: the extra predicate is never included in the the filter expression
● Disabled: the extra predicate is always included in the the filter expression

 Example

ConnectorContainerSQLBuilder.<MyDSN>.GenerateAggregatesFilterInOperand="Forced
"

6.4.4.12 GenerateSerialMerge

For the sake of clarity, the notation (T1,T2) will stand for the operation of (left outter) joining T1 and T2 on with
join conditions.

In a long chain/sequence of tables joins, for instance ((T1,(T2,T4)), (T3,T4)), some of the join conditions may
be generated right after the join or at the end of the query without any impact on the resulting rows set. In

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 189
some cases, end-positioned join conditions will negatively impact the performance of the generated SQL. This
option allows specifying the position of the join conditions.

Possible values:

● System (default): the system automatically selects the preferred form according to the experiments on the
concerned data base.
● Forced: the join conditions are generated right after the join operation.
● Disabled: the join conditions are generated at the end of the join sequence.

 Example

ConnectorContainerSQLBuilder.<MyDSN>.GenerateSerialMerge="Forced"

6.4.4.13 MaxExtraCondUsage

When the GenerateAggregatesFilterInTheOperand option is set to System, the aggregates sharing the
same time window but yielding different extra filter conditions are grouped together if their number is over a
given threshold. This parameter sets that threshold.

The default value is 5.

 Example

ConnectorContainerSQLBuilder.<MyDSN>.MaxExtraCondUsage=5

6.4.4.14 OptimizeExpressionsBasedOnDM

In a data manipulation, expressions that are referenced several times are factorized for efficiency purposes.
When these expressions use fields coming from a joined/sub data manipulation, the kernel may fail to correctly
factorize them. In that case, this option allows excluding them from the factorization process.

 Note

This option can only be activated when the option OptimizeMultiInstancesExpressions is on.

Possible values:

● true (default)
● false

 Example

ConnectorContainerSQLBuilder.<MyDSN>.OptimizeExpressionsBasedOnDM="false"

Predictive Analytics Data Access Guide


190 PUBLIC ODBC Fine Tuning
6.4.4.15 OptimizeMultiInstancesExpressions

In a data manipulation, two strategies can be used when a computed field/expression is referenced more than
once:

● the standard strategy , where each reference is replaced by the full SQL definition of the concerned
computed field so that the computation takes place several times.
● the optimized strategy , where a temporary column is computed (once) using the field SQL definition.
Then each subsequent reference to the computed field will be replaced by a reference to the new column.

Possible Values

● System (default): the strategy is automatically selected by the system. By default, if a field is referenced
more than once, the optimized strategy is used.
● Forced: the optimized strategy is used.
● Disabled: the standard strategy is used.

 Example

ConnectorContainerSQLBuilder.<MyDSN>.OptimizeMultiInstancesExpressions="Forced
"

6.4.4.16 OptimizeMultiInstancesThreshold

When multi-referenced expressions should be optimized (see OptimizeMultiInstancesExpressions), this


parameter specifies the maximum number of reuses before this optimization takes place.

Default Value:

● 2

 Example

ConnectorContainerSQLBuilder.<MyDSN>.OptimizeMultiReferencesThreshold=3

6.4.4.17 PrefilterAggregatesGroupTable

The table expressions grouping pivoted aggregates together have the ability to filter the rows seen by these
aggregates. Since each aggregate proceeds only the matching rows, this pre-filtering can be left off: that is the
current default behavior. It can be changed using the PrefilterAggregatesGroupTable configuration key.

Pre-filtering often involves non-indexed columns, which consequently leads to full scans. When all the columns
used in the aggregates filters or in the join conditions are known to be indexed, switching to pre-filtering may
have positive performance impacts.

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 191
Possible values:

● true
● false (default)

 Example

ConnectorContainerSQLBuilder.<MyDSN>.PrefilerAggregatesGroupTable="true"

6.4.4.18 SQLAsSQLScript

This option allows you to force or deactivate the decomposition of a data manipulation into smaller
intermediate steps (automatically determined by the system).

Possible values:

● System (default): depending on the relative complexity of the data manipulation, the system chooses
whether to activate the decomposition.
● Forced: the data manipulation is decomposed into intermediate steps regardless of its complexity.
● Disabled: the data manipulation is never decomposed into intermediate steps no matter how complex it
could be. Be very careful when selecting this value.

 Example

ConnectorContainerSQLBuilder.<MyDSN>.SQLAsSQLScript="Disabled"

6.4.4.19 TSPPersistence

In operational environment, analytical datasets generate complicated SQL that some RDBMS optimizer may
fail to analyze correctly. This has been observed especially when filtered time-stamped populations are
involved. To circumvent this, materializing the time-stamped population generally helps the optimizer make
better choices for the execution plan. This option helps controlling that behavior.

Possible Values

● System (default): the system uses specific rules to automatically decide whether to apply the
materialization. For the time being, all filtered time-stamped populations are materialized.
● Forced: all time-stamped populations are materialized
● Disabled: time-stamped populations are never materialized

 Example

ConnectorContainerSQLBuilder.<MyDSN>.TSPPersistence="Forced"

Predictive Analytics Data Access Guide


192 PUBLIC ODBC Fine Tuning
6.4.4.20 UnusedKeysInIntermediatesForOptim

For optimization purposes, a given multi-referenced expressions factorization takes place only when the table
on which it is computed has keys columns.

In multi-steps data manipulation, to avoid violating physical keys non-nullity and uniqueness constraints, the
system applies specific rules filtering out keys that are not required in intermediates tables. This may lead to
situations where multi-referenced expressions are left non-factorized.

This option instructs the system not to exclude from intermediate tables key columns that may be used in the
optimization process.

Possible values:

● true (default)
● false

 Example

ConnectorContainerSQLBuilder.<MyDSN>.UnusedKeysInIntermediatesForOptim="false"

Predictive Analytics Data Access Guide


ODBC Fine Tuning PUBLIC 193
7 Bulk Load With Oracle

7.1 About Bulk Load for Oracle

Typically scoring processes write large amounts of data into DBMS and the In-database feature of the
application manages this process entirely in the DBMS, avoiding slow writes to DBMS.

However, with the integration of new algorithms and the increase in complexity and size of the models, there
are now some contexts where In-database Apply cannot be used and writes into DBMS must be entirely driven
by the application, slowing the scoring process. Social is a typical example of this particular context.

The Bulk Load for Oracle feature addresses this issue for Oracle and is available for Automated Analytics.

7.2 Technology

DataDirect

Bulk Load for Oracle uses the bulk mode of the DataDirect technology. This technology is embedded in SAP
Predictive Analytics Connector for Oracle.

Oracle

DataDirect’s bulk mode is built on top of Oracle’s DirectPath technology.

7.3 Performances

The software stack becomes large but performance gains are very important.

An example of Automated Analytics’ data transfer from text to Oracle using bulk load:

● On our Oracle10 test server running on small hardware


● File to import: 7 fields and 44 M lines

Time to import in regular mode: 5 hours and 10 minutes

Time to import with bulk load activated: 14 minutes

Predictive Analytics Data Access Guide


194 PUBLIC Bulk Load With Oracle
7.4 Setup

As soon as SAP Predictive Analytics Connector for Oracle is used to connect with Oracle, the Bulk Load feature
is available.

On Windows, to activate Bulk Load for Oracle, you need to check an option in the setup dialog of the ODBC
connection.

On Linux, the option is automatically activated by the standard application installation.

In this guide, the sections Connecting to your Database Management System -available for Linux and Windows-
describe the complete setup of an oracle connection to DBMS.

7.5 Perimeter of Usage and Limitations

At this time, the Bulk Load feature is used when only insertions must be done in DBMS. Update/insert
sequences cannot use the Bulk Load.

In a practical manner, the Bulk Load is activated on

● Scoring to new tables


● Scoring to empty tables
● Saving a metadata table (KxAdmin, models, connectors, KxInfos, …)

 Caution

Scoring in an existing already filled table cannot use the Bulk Load mode.

7.6 Tuning

Various options are available to deactivate and tune the Bulk Load feature.

The section ODBC Fine Tuning of this guide describes these parameters.

Here is a short abstract of these parameters:

SupportBulkImport allows totally deactivating the Bulk Load and switch back to
regular insertions

SupportDeferredKeyCreationForBulkImport allows creating primary keys after full insertion has been
completed instead of default mode where keys are created
during insertions

Predictive Analytics Data Access Guide


Bulk Load With Oracle PUBLIC 195
MaxNbRowsInBulkTransaction Maximum number of rows that can be inserted before a
flush is requested

MaxNbRowsInBulkBatch Number of records inserted by block

ApplyBulkWorkarounds allows deactivating the detection of DataDirect issues

Predictive Analytics Data Access Guide


196 PUBLIC Bulk Load With Oracle
8 Fast Write for Teradata

8.1 About this document

This document is addressed to SAP internals and partners who want to understand the Automated Analytics
feature Fast Write for Teradata.

In this document, fastload references the Teradata import tool and Fast Write references the Automated
Analytics feature built on top of fastload.

Complete documentation can be found on the SAP Help Portal at http://help.sap.com/pa

8.2 Purpose

The application is historically not designed to write a large amount of data in a DBMS.

Writing large amount of data in DBMS typically concerns scoring processes and the In-database feature of the
application manages this process entirely in DBMS, thus avoiding slow response.

However, with the integration of new algorithms and the increase of complexity and size of the models, there
are now some contexts where the In-database Apply feature cannot be used and the application must write
directly in the DBMS, slowing down the score process. Social is a typical example of algorithms suffering from
this issue.

The feature Fast Write for Teradata addresses this issue for Teradata and is available since version 6.1 of the
application.

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 197
8.3 Architecture

8.3.1 Main components

ODBC

The fastload technology is a fast import technology. It does not provide any way to request meta data
information about tables. This is why the Fast Write mode still needs an ODBC connection.

Fast Write

The Fast Write module manages metadata information about target table, prepares a special fast
communication channel, generates a fastload script instrumented to make fastload fetch data from a special
communication channel and feed this channel as fast as possible. This module also monitors the fastload task
in order to provide the user with some feedback.

Predictive Analytics Data Access Guide


198 PUBLIC Fast Write for Teradata
Fastload

This is the best-in-class Teradata’s import technology. It executes the fastload script generated by the
application and is able to send data to Teradata using a proprietary protocol.

This tool has a plugin facility (INMOD setup) allowing data to be fetched from an external library.

KxFastLoadBridge

This is a plugin for fastload following Teradata’s INMOD protocol. This plugin connects back to Automated
Analytics and fetches data from a special communication channel.

Fastload Script

Except for the INMOD plugin usage, the fastload script generated by the application uses the standard and
basic fastload features.

Fast Write Data Stream

The fast communication channel used by Automated Analytics to send data to KxFastLoadBridge is a named
pipe.

8.3.2 Technology

The Fast Write feature uses the best Teradata import tool: fastload. The fastload technology allows the import
of huge datasets in a Teradata database with a specialized and efficient proprietary protocol.

Advantages

Among several advantages, fastload is:

● A very common and natural tool in Teradata’s ecosystem. Its installation is straightforward and often,
already done.
● A cornerstone tool for Teradata. Therefore, it always benefits from best-in-class Teradata technology
(parallel import for example) and is always up-to-date regarding latest Teradata DBMS features.

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 199
Drawbacks

For the following reasons, the detection of the availability of fastload and the activation of Fast Write are not
automatic

● fastload is a Teradata proprietary tool complex to manage and to integrate. Initializing a fastload task is
quite heavy with an important bootstrap time and it may be counterproductive to use it for small datasets.
● Due to specificities of the fastload protocol, a bad tuning may use too many resources in Teradata DBMS
and therefore slow other activities.
● fastload tasks are typically a sparse resource in a Teradata DBMS. Number of concurrent and available
fastload tasks may be small (default is 15).

For more information, contact SAP Support.

8.4 Performance

Compared to ODBC, the default tuning of Fast Write allows you to enhance performances by a coefficient of
100 minimum. A fastload task comprise of two steps: the acquisition phase and the application phase.

Phase I: Acquisition Phase

During the acquisition phase, rows are sent as fast as possible to the DBMS and are stored in a temporary
spool space. This phase is heavily dependent on the actual network bandwidth usage and the capacity of the
client to send data as fast as possible.

Phase II: Application Phase

During the application phase, the data stored in the temporary space is dispatched into Teradata nodes. This
phase does not use any network resource and is entirely done inside Teradata’s DBMS.

Test platforms

In a day-to-day basis, the application uses small test platforms for Teradata. These platforms have hardware
limitations and for these platforms, phase II takes some time. Using a true Teradata hardware, this phase II is
almost free.

Predictive Analytics Data Access Guide


200 PUBLIC Fast Write for Teradata
8.5 Perimeter of Usage and Limitations

Supported Platforms

Client OS Teradata Client Teradata DBMS

Windows 64-bit TTUF 13.1 Teradata 13.1

Linux 64-bit TTUF 13.1 Teradata 13.1

Windows 64-bit TTUF 14.0 Teradata 14.0

Linux 64-bit TTUF 14.0 Teradata 14.0

Windows 64-bit TTUF 14.1 Teradata 14.1

Linux 64-bit TTUF 14.1 Teradata 14.1

Windows 64-bit TTUF 15.0 Teradata 15.0

Linux 64-bit TTUF 15.0 Teradata 15.0

Linux 64-bit TTUF 15.1 Teradata 15.1

 Caution

1. For TTUF 14.10, patch for fastload 14.10.00.03 is mandatory

Compatibility with TD 14.0 & TD 14.1 & TD 15.0 & TD 15.10

Fast Write is automatically compatible with the enhancements on huge resultsets provided in TD 14.0, on
extended object names provided in TD 14.1, and with the enhancements provided in TD 15.0 and TD 15.10.

Limitations

The limitations of Fast Write are the limitations of fastload:

● FastWrite is activated only on non-existing or empty tables.


● Blobs cannot be used.

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 201
8.6 Setup

ODBC

The fastload tool does not provide any metadata information. Therefore, the ODBC technology is used to
discover metadata information about data (list and type of fields) and setup fastload tasks.

This is why the ODBC driver must be properly setup even if Fast Write is activated.

 Note

ODBC installation and setup are described in the support document Connecting to your Database
Management System.

32/64-bit fastload Packages

Fastload is a 32-bit technology. Even on a 64-bit OS and with Automated Analytics in 64-bit, fastload and all its
dependencies are 32-bit Teradata packages.

Globally, to be able to use the Fast Write feature, the following Teradata packages need to be installed:

● fastload 32-bit
● fastload dependencies
● fdicu 32-bit
● teragss 32-bit
● cliv2 32-bit
● DataConnector (piom) 32-bit
● tdodbc 64-bit
● tdodbc dependencies
● tdicu 64-bit
● teragss 64-bit

8.6.1 Installing Fastload

8.6.1.1 To install fastload on Windows

The fastload packages can be found on the Teradata CD labeled:

Teradata Tools and Utilities


Load/unload for Windows
Release 13.10.01
Volume: 01/03

Predictive Analytics Data Access Guide


202 PUBLIC Fast Write for Teradata
1. Insert the CD and select Custom Setup.
2. When prompted to select packages, select fastload.

All dependencies are automatically selected.


3. Confirm the summary dialog.

The installation is then straightforward.


4. Once fastload has been set up, check that the installation has been correctly executed (see the topic To
Check the Installation of Fastload).

8.6.1.2 To install fastload on Linux

The fastload packages can be found on the Teradata CD labeled:

Teradata Tools and Utilities


Load/Unload for HP-UX & Linux or Load/Unload for AIX & Solaris
Release 13.10.01
Volume: 02/03 or 03/03

1. Insert the CD that contains the Teradata Client Software to install. When inserted, the CD will allow
selecting the Teradata packages to install.
2. Select fastload, which is the only top-level package needed for the application. The dependent packages
tdicu, TeraGSS, DataConnector and cliv2 will automatically be selected.
3. Once fastload has been set up, check that the installation has been correctly executed.

8.6.2 Checking the Installation of Fastload

It is very important that fastload is properly set up for the Fast Write feature to work. If the following procedure
fails, stop the Fast Write setup and contact support.

1. Open a command window (on Windows) or an interactive shell (on Linux).


2. At prompt, type fastload. If fastload is properly set up, you should see:

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 203
3. Type .logon <Teradata host name>/<user>,<password>

 Note

<Teradata host name> follows Teradata naming policy or can be an IP address as shown in the example
above.

4. Check if the logon is successful (FDL4808 LOGON successful).

5. Type .logoff; to disconnect


6. Check that the whole fastload task has terminated properly: the message Highest return code
encountered = ‘0’ should be displayed.

Predictive Analytics Data Access Guide


204 PUBLIC Fast Write for Teradata
 Caution

Any error in this procedure is fatal. The fastload tool must be properly setup before Fast Write can be
activated.

8.6.3 Teradata Patches

It is always a good idea to update your fastload installation with up-to-date patches. The process to download
and apply Teradata patches depends on each Teradata site and must be done by an authorized user.

Here are the patched versions of Teradata packages the application is currently using to validate Fast Write:

OS Architecture Package Version

Linux 32 tdicu 13.10.00.02

Linux 32 TeraGSS 13.10.04.01

Linux 32 cliv2 13.10.00.11-1

Linux 32 DataConnector 13.10.00.09-1

Linux 32 fastload 13.10.00.12-1

Windows 32 tdicu 13.10.00.02

Windows 32 TeraGSS 13.10.05.1

Windows 32 cliv2 13.10.00.08

Windows 32 piom 13.10.00.09

Windows 32 fastload 13.10.00.13

8.6.4 Automated Analytics Setup

Fast Write must be manually activated for each ODBC connection. This is done by editing the configuration file
Fast Write.cfg and adding the following line for each ODBC connection that should use Fast
Write:ODBCStoreSQLMapper.<ODBC DSN>.SupportBulkImport=true.

 Example

If the connection ODBC to accelerate is BigData, the line to add is:


ODBCStoreSQLMapper.BigData.SupportBulkImport=true

 Note

Instead of <ODBC Name>, the wildcard * can be used and means ‘all ODBC connections’.

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 205
8.6.4.1 Checking the Activation of Fast Write?

A simple way is to use the application to make a data transfer from sample files to Teradata.

1. Launch SAP Predictive Analytics.

2. Select Toolkit> Perform a Data Transfer .


3. Choose the sample file Census01.csv
4. Transfer it using a Teradata ODBC connection with Fast Write enabled. The transfer should be very fast
(~10s).
5. Click the detailed log button to see the whole logs for Fast Write. Actual values of throughput and times
depend on the current configuration.

8.7 Fast Write Usage

Standard Usage

After the initial setup, the Fast Write feature is designed to be automatically used in a transparent way.
Specifically, there is no special action associated to Fast Write in the graphical interface.

8.7.1 Logs

Fast Write uses a complex architecture. Each component of this architecture provides logs in order to ease
support and tuning.

Globally, the application drives these components to name all these logs using the same default policy. Each
log is named KxFastLoad_<Unique_Session_Number>_<topic>.

Fastload Temporary Log Tables

For each fastload task, the fastload tool creates two log tables named:

● KxFastLoad_<Unique_Session_Number>_Log1
● KxFastLoad_<Unique_Session_Number>_Log2

These tables are populated only in case of a data issue (record rejected due to constraint violation, bad data
format, …) and actual content can be analyzed only by a DBA. The database used to store these tables is the
same as the target table.

These tables are automatically deleted if no issue is detected.

Predictive Analytics Data Access Guide


206 PUBLIC Fast Write for Teradata
Temporary Objects and Logs

For each Fast Write session, three files are created:

● KxFastLoad_<Unique_Session_Number>_script.fld: the fastload script generated by the


application
● KxFastLoad_<Unique_Session_Number>_log: the log file generated by fastload
● KxFastLoad_<Unique_Session_Number>_err: the log file generated by KxFastLoadBridge

These files are created in the standard temp directory and are automatically deleted if no issue is detected.

Several configuration options allow tuning the name of these files, their folder,… and are described in the Annex
B - List of Options.

Logs & Error Management

In case of any error or warning (logon issue, network or DBMS issue, bad format, ...) reported by fastload, the
application does not delete any log and displays a message asking to collect these files and send them to the
SAP support.

 Example

8.7.2 Tips and Tricks

Tips and Tricks

The fastload technology uses a special import feature of the Teradata DBMS; while running it sets up the target
table in a specific state (as well as both log tables). While in this state, a table cannot be used for the standard
SELECT,INSERT,UPDATE or DELETE information. Any attempt at doing that will generate the following
diagnostic: Operation not allowed: <table name>l is being Loaded.

After an error in a fastload task, the target and log tables are kept in this special ‘loading state’ and therefore
cannot be used. Only way to go further is then to delete the target and log tables.

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 207
Cleaning Process After an Error:

1. Check if a fastload process is still running (rare condition). If yes, this process must be killed before going
further.
2. Drop the target table.
3. Drop the two log tables KxFastLoad_<unique_number>_Log1 and KxFastLoad_<unique_number>_Log2.

Always Provide Logon Information

With the Teradata ODBC driver, it is common to store the logon information (user name and password) in the
ODBC setup dialog (Windows) or the odbc.ini file (Linux). When used, this allows you to type the logon
credentials and quickly connecting to Teradata.

This facility is not managed by the fastload tool (ODBC setup and fastload setup are separated); when Fast
Write is used, the logon credentials MUST be provided.

Otherwise, on Windows, an error message is displayed.

 Note

If providing the credentials is an issue, some tuning options allow you to control this check and to define a
default user and a default password.

And on Linux, an error is triggered.

Automated Analytics Metadata/Small Tables and Fast Write Performances

The Fast Write feature is really fast but needs some bootstrap time and important resources. As a result, using
it for small tables is counterproductive.

Some Automated Analytics metadata tables are always small and are automatically not eligible for the Fast
Write mode; however the application does not evaluate the size of regular data to see if it is worth using Fast
Write .

Models also need a special attention when saved in metadata: the first save will be done automatically with the
Fast Write mode; but saving the model a second time in the same table will not use it and will be slow.

Predictive Analytics Data Access Guide


208 PUBLIC Fast Write for Teradata
 Note

To always benefit from the Fast Write mode, always save your models in new tables.

8.8 Advanced Setup and Tuning

The Fast Write feature has many options allowing you to tune various aspects of this module.

Default suitable values are built-in in the application but for specific purposes or contexts, each option can be
overridden.

An option has a name and a value.

To Override an Option Value:

1. Open the configuration file Fast Write.cfg.


2. Add a line:
Fast Write.<ODBC Connection Name>.<Option>=<Value>
For example:
Fast Write.BigData.DropTempFiles=false

 Note

The wildcard * can be used instead of <ODBC Connection Name> meaning ‘all connections’.

8.9 Main Options for Integration Tuning

8.9.1 Log and Temporary Objects Management

DropLogTablesBefore

The default behavior is to drop log tables before any action. Change the value to No if this cleaning must not be
done.

LogSchemaName

The log tables are stored in the same schema as the target table. Change the value of this option to force a
specific schema where all log tables will be stored.

TempPrefix

KxFastLoad_<Unique_Number>_is the default prefix added before every temporary object name. Change
the value of this option to force a user defined prefix.

You should use this option only for debug purposes; it will force a single name for all fastload tasks and forbid to
have several fastload tasks at the same time.

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 209
8.9.2 Logon Mechanism

Host

The Teradata host to contact is automatically inferred from the current ODBC connection. If the host is not
properly guessed, use this option to force the correct host name.

UserName, Password

The user/password sent to fastload are the same as the ones used by the current ODBC connection. You can
change these values to force a specific user name/password.

LogMech, LogData, Logon

Fast Write uses the default fastload mechanism for logon. Use these options to force a specific logon
mechanism and its parameters (Active Directory,..). Please consult the fastload documentation for proper
usage of these parameters.

CheckEmptyUserName

Fast Write checks that the user name is not empty. Set this option to deactivate this test.

DefaultUserName

This option allows you to define a default user name. This default user name will be used by Fast Write when no
user name is provided. Note that this option is used only if the option CheckEmptyUserName has been
deactivated.

DefaultPassword

This option allows you to define a default password. This password will be used by Fast Write when no password
is provided.

8.9.3 Changing Paths & Protocols

FastLoadPath

The standard path for the fastload tool is "" on Windows and /usr/bin on Linux. Change this option if fastload
has not been set up in the default folders.

FastLoadName

Fastload is the name of the tool launched by Fast Write. Change this option if a non-standard name is used.

KxFastLoadBridgeFullPath

Change this option if your installation is not standard and if the location of KxFastLoadBridge cannot be
automatically found.

EXITProtocol

Predictive Analytics Data Access Guide


210 PUBLIC Fast Write for Teradata
Fast Write automatically manages extended fastload protocol coming with TD 14.0 & TD 14.1.If for any reason,
this automatic choice is not properly done, this option allows you to force a given protocol. Possible values are
indicated in the following table:

EXIT Compatible with all versions of Teradata

EXIT64 Available only in TD 14.1. 64-bits events are managed.

EXITEON Available only in TD 14.1. EON events are managed. Note that
EXITEON implies EXIT64.

8.10 Main Options for Performance Tuning

TimeBetweenPerformancesInfos

Automated Analytics displays some performance information (Nb rows inserted, Current Nb Rows /s) every
ten seconds. This option allows you to change this delay.

BlockSize

Rows are encoded in a binary format suitable for fastload and sent by blocks to fastload. The default size of a
block is 1048576 (1MB). Change this option to tune the size of the blocks.

MaxQueueSize

Binary blocks are asynchronously sent to fastload, meaning that some blocks may be queued waiting until
fastload can treat them. The maximum size of this queue is 20. Basic statistics about usage of this queue are
available in the application log:

● maximum actual size,


● time spent in full queue state (fastload overloaded),
● time spent with no block available for fastload (starvation).

Both options MaxQueueSize and BlockSize allow you to tune the block queue and maximizing throughput of
rows.

SessionMin and SessionMax

These parameters handle the same name fastload parameters.

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 211
The concept of low level fastload sessions is key to maximizing the performances. Changing these options
must be validated by an SAP representative or a Teradata administrator.

The default value for SessionMin is 1, SessionMax is 4. This setup always allows you to initiate a Fast Write
session:

● with an immediate performance improvement, even for ‘small’ datasets,


● without consuming too many Teradata’s resources.

You can change these parameters with the help of a Teradata DBA or your application support if the number of
low level fastload sessions is an identified issue or if full maximization of performances is required, especially
for huge datasets.

8.11 Main options for Debug and Support

TemplateScriptFullPath

The fastload script used is generated from a built-in template script.

It is possible to force the usage of a specific template script by using this option.

DropTempFiles

All temporary files are automatically deleted when no issue is detected. It can be useful to force this option to
No in order to check the logs for a successful session.

ReportFastloadWarning

Fastload is a complex tool reporting various conditions depending on actual production setup. The default
behavior is to report any warning emitted by fastload. It may be useful to deactivate this reporting if the
warning condition has no practical impact. This deactivation must be done with the counsel of SAP.

8.12 Annex A - Default fastload Script

Usage

The fastload script actually used by is generated from a built-in template.

This template has many %OptionName% patterns. Each pattern is replaced by its actual value at generation
time. The actual value is either generated by the application or found in the Fast Write.cfg configuration file.

Predictive Analytics Data Access Guide


212 PUBLIC Fast Write for Teradata
Each %<OptionName>% pattern is a possible <OptionName> option.

8.12.1 Built-in Template

** Generated by %KXEN_VERSION%
** %DATE%
SET SESSION CHARSET “%SESSIONCHARSET%”;
SHOW VERSION;
SESSIONS %SESSIONMAX% %SESSIONMIN%;
**
** Connection step
**
** These 3 lines will be used if the options LogMech,LogData,Logon are set
*** LOGMECH %LOGMECH%;
*** LOGDATA %LOGDATA%;
*** LOGON %LOGON%; LOGON %HOST%/%USERNAME%,%PASSWORD%;
** The initial cleaning lines are not added if the DropLogTablesBefore option is
unset
** Cleaning logs
** DROP TABLE %FULLLOGTABLENAME1%; DROP TABLE %FULLLOGTABLENAME2%;
** Define all fields to work with
** DEFINE %FIELDSDEFINITIONS%
** Usage of INMOD feature INMOD=%KXFASTLOADBRIDGEFULLPATH%;
** NOTIFY allows monitoring session steps
** %EXIT% specifies an external handler to publish notifications
** TEXT allows giving parameters to establish communications with KXEN NOTIFY
HIGH %EXIT% % KXFASTLOADBRIDGEFULLPATH % TEXT “PP=%PRIVATEPREFIX% TL=%TRACELEVEL
% TF=%TRACEFILE% TE=%TRACESTDERR%";
** The actual start of the fastload session
** BEGIN LOADING %FULLTABLENAME% ERRORFILES % FULLLOGTABLENAME1% %
FULLLOGTABLENAME2% INDICATORS;
** The INSERT statement using fastload fields defined by 'DEFINE' statement
**
INSERT INTO %FULLTABLENAME%
(%FIELDNAMES%)
VALUES
(%FIELDSDEFINES%);
** End of job
**
END LOADING;

8.12.2 Example

** Generated by KXEN InfiniteInsight - KXEN Internal for development purpose


only - Valid to 2013-01-15 6.1.0-b1
** 2012-07-18 10:03:32
**
SET SESSION CHARSET "ASCII";
SHOW VERSION;
** Connection step
**
LOGON 10.1.1.218/kxenodbc,kxenodbc;
** Cleaning logs
**
DROP TABLE KxFastLoad_1_7308_Log1;
DROP TABLE KxFastLoad_1_7308_Log2;
** Define all fields to work with

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 213
**
** With usage of INMOD feature
**
DEFINE
f0 (BIGINT),
f1 (BIGINT),
f2 (BIGINT),
f3 (BIGINT),
f4 (BIGINT),
f5 (BIGINT),
f6 (BIGINT),
f7 (BIGINT),
f8 (BIGINT),
f9 (BIGINT)
INMOD=KxFastLoadBridge.dll;
** NOTIFY allows to monitor session steps
** EXIT specifies an external handler to publish notifications
** TEXT allows to give parameters to establish communications with KXEN
NOTIFY HIGH EXIT KxFastLoadBridge.dll TEXT "PP=KxFastLoad_1_7308 TL=Detailed
TF=Pub TE=On";
** The actual start of fastload session
**
BEGIN LOADING acgafwdo ERRORFILES KxFastLoad_1_7308_Log1, KxFastLoad_1_7308_Log2
INDICATORS;
** The INSERT statement using fastload fields defined by 'DEFINE' statement
**
INSERT INTO acgafwdo (id, i1, i2, i3, i4, i5, i6, i7, i8, i9) VALUES
(:f0, :f1, :f2, :f3, :f4, :f5, :f6, :f7, :f8, :f9);
** End of job
**
END LOADING;
QUIT

8.13 Annex B - List of Options

Option Category Default value Comment

fastloadName Integration fastload Name of fastload executable

fastloadPath Integration "" or /usr/bin Path of fastload executable

EXITProtocol Integration EXIT or EXITEON

FullKxInfosScriptPa Integration <temp dir>/%TempPrefix Trace from KxFastLoad­


th %_trc.txt Bridge

TemplateScriptFullP Integration Allows you to use a specific


ath template

TempPrefix Integration KxFastLoad_<Unique num­ Added to every temp or log


ber> object

Date Script Current date

DropLogTablesBefore Script Yes Drop log tables before any


action

Predictive Analytics Data Access Guide


214 PUBLIC Fast Write for Teradata
EXITProtocol Script EXITEON for TTUF 14.10. Extended Object Names
EXIT otherwise (EON) are new feature of Ter­
adata 14.1

FieldNames Script generated from target table Field names in fastload syn­
definition tax

FieldsDefinition Script generated from target table Field definitions in fastload


definition format

FieldValues Script generated from target table Field values in fastload syn­
definition tax

FullLogTableName1 Script %LogSchemaName%. Log table generated by fast­


%TempPrefix%_log1 load

FullLogTableName1 Script %LogSchemaName%. Log table generated by fast­


%TempPrefix%_log2 load

FullResultScriptPat Script <temp dir>/%TempPrefix fastload execution log


h %_log.txt

FullScriptPath Script <temp dir>/%TempPrefix Full path of generated script


%_script.fld

FullTableName Script %SchemaName%.%Table­ Target table specification


Name%

Host Script Teradata host used by ODBC Allows to force a Teradata


connection gateway

KxFastLoadBridgeFul Script generated from Automated Full path of KxFastLoad­


lPath Analytics installation Bridge

LogData Script Use this option and the 2 fol­


lowing options to setup an
advanced authentication
scheme (cf fastload docu­
mentation)

LogMech Script

Logon Script

CheckEmptyUserName Integration true

DefaultUserName Integration empty

DefaultPassword Integration empty

LogSchemaName Script generated from Target table Allows you to force a specific
schema for log tables

LogTableName1 Script %TempPrefix%_<Unique


number>_log1

LogTableName2 Script %TempPrefix%_<Unique


number>_log2

Password Script password from ODBC con­


nection

SchemaName Script generated from Target table Schema part of target table
spec

Predictive Analytics Data Access Guide


Fast Write for Teradata PUBLIC 215
SessionCharset Script ASCII or UTF16 Can be also UTF8

TableName Script The target table Table part of target table


spec

UserName Script user name from ODBC con­


nection

DropTempFiles Support Yes Temp files are kept only in


case of error

DumpOptionsBeforefa Support No Dump all options in SQL log


stload

Mode Support Advanced ‘Synchronous’ allows you to


use a slower but simpler fast­
load dialog

TraceFile Support Pub Managed by Automated Ana­


lytics only

TraceLevel Support Detailed Managed by Automated Ana­


lytics only

BlockSize Tuning 1048576 (1MB) Size of a fastload binary


block

MaxQueueSize Tuning 20 Max # of blocks waiting to be


sent

SessionMax Tuning Use these parameters with


caution. Cf fastload docu­
mentation

SessionMin Tuning Use these parameters with


caution. Cf fastload docu­
mentation

TimeBetweenPerforma Tuning 10s


ncesInfos

Predictive Analytics Data Access Guide


216 PUBLIC Fast Write for Teradata
Important Disclaimers and Legal Information

Hyperlinks
Some links are classified by an icon and/or a mouseover text. These links provide additional information.
About the icons:

● Links with the icon : You are entering a Web site that is not hosted by SAP. By using such links, you agree (unless expressly stated otherwise in your
agreements with SAP) to this:

● The content of the linked-to site is not SAP documentation. You may not infer any product claims against SAP based on this information.
● SAP does not agree or disagree with the content on the linked-to site, nor does SAP warrant the availability and correctness. SAP shall not be liable for any
damages caused by the use of such content unless damages have been caused by SAP's gross negligence or willful misconduct.

● Links with the icon : You are leaving the documentation for that particular SAP product or service and are entering a SAP-hosted Web site. By using such
links, you agree that (unless expressly stated otherwise in your agreements with SAP) you may not infer any product claims against SAP based on this
information.

Beta and Other Experimental Features


Experimental features are not part of the officially delivered scope that SAP guarantees for future releases. This means that experimental features may be changed by
SAP at any time for any reason without notice. Experimental features are not for productive use. You may not demonstrate, test, examine, evaluate or otherwise use
the experimental features in a live operating environment or with data that has not been sufficiently backed up.
The purpose of experimental features is to get feedback early on, allowing customers and partners to influence the future product accordingly. By providing your
feedback (e.g. in the SAP Community), you accept that intellectual property rights of the contributions or derivative works shall remain the exclusive property of SAP.

Example Code
Any software coding and/or code snippets are examples. They are not for productive use. The example code is only intended to better explain and visualize the syntax
and phrasing rules. SAP does not warrant the correctness and completeness of the example code. SAP shall not be liable for errors or damages caused by the use of
example code unless damages have been caused by SAP's gross negligence or willful misconduct.

Gender-Related Language
We try not to use gender-specific word forms and formulations. As appropriate for context and readability, SAP may use masculine word forms to refer to all genders.

Predictive Analytics Data Access Guide


Important Disclaimers and Legal Information PUBLIC 217
www.sap.com/contactsap

© 2018 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form


or for any purpose without the express permission of SAP SE or an SAP
affiliate company. The information contained herein may be changed
without prior notice.

Some software products marketed by SAP SE and its distributors


contain proprietary software components of other software vendors.
National product specifications may vary.

These materials are provided by SAP SE or an SAP affiliate company for


informational purposes only, without representation or warranty of any
kind, and SAP or its affiliated companies shall not be liable for errors or
omissions with respect to the materials. The only warranties for SAP or
SAP affiliate company products and services are those that are set forth
in the express warranty statements accompanying such products and
services, if any. Nothing herein should be construed as constituting an
additional warranty.

SAP and other SAP products and services mentioned herein as well as
their respective logos are trademarks or registered trademarks of SAP
SE (or an SAP affiliate company) in Germany and other countries. All
other product and service names mentioned are the trademarks of their
respective companies.

Please see https://www.sap.com/about/legal/trademark.html for


additional trademark information and notices.

THE BEST RUN