Sie sind auf Seite 1von 28

Matthew Lawler lawlermj1@gmail.

com DW Massively Parallel Processing Design Patterns

DW Massively Parallel
Processing Design
Patterns

Matthew Lawler lawlermj1@gmail.com

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 1 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

INTRODUCTION 3

MASSIVELY PARALLEL PROCESSING DESIGN PATTERNS 11

KEY PATTERNS 15

DISTRIBUTION KEY DISCOVERY ALGORITHM 21

FACT DIMENSION PATTERNS 24

TEST PATTERNS 25

FROM ORACLE TO NETEZZA 27

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 2 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Introduction

Licence
As these are generic software documentation standards, they will be covered by the 'Creative
Commons Zero v1.0 Universal' CC0 licence.

Warranty
The author does not make any warranty, express or implied, that any statements in this document
are free of error, or are consistent with particular standard of merchantability, or they will meet the
requirements for any particular application or environment. They should not be relied on for solving
a problem whose incorrect solution could result in injury or loss of property. If you do use this
material in such a manner, it is at your own risk. The author disclaims all liability for direct or
consequential damage resulting from its use.

Purpose
The primary goal of this document is to improve the effectiveness and efficiency of a Massively
Parallel Processing Appliance using Netezza as an example.

Put simply, if the patterns are not followed, then query performance will be at least 100 times
worse, and may never finish. If followed, millions of rows can be processed in minutes. In one
example, 7 million rows are processed through 50 tables in 7 minutes.

Some specific and secondary goals are:

to improve query performance.

to improve integration between Netezza and end user tools.

to improve data quality on Netezza.

According to Hoare, "Premature optimization is the root of all evil". So, effectiveness always comes
before efficiency. In many cases, effectiveness is sufficient. But, if not, then optimization or
efficiency needs to be pursued.

These design patterns, which explain Netezza strengths, and identify and address Netezza
weaknesses, will assist the designer to produce effective and efficient solutions that will enable
effective end user reporting.

Audience
For Oracle developers that need to design databases and build SQL queries on an Massively Parallel
Processing Appliance like Netezza, this document covers the design patterns needed to produce
acceptable performance.

The primary audience of this document are designers, developers, data modellers, Business
Intelligence staff and Business Intelligence support personnel working on the Netezza platform.

Assumptions
There is a Netezza learning curve that this document aims to reduce. It is assumed that the reader
has some level of knowledge of Oracle, SQL, and dimensional (Kimball) data modelling. Knowledge
D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 3 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

of software design, relational and Inmon data modelling may also be helpful. A lack of knowledge of
Netezza is also assumed.

Approach
This will be documented using Rules and Design Patterns for Netezza. That which will not be covered
includes: Hardware, Scheduling, Data modelling, Naming conventions, KPIs, ETL tools, BI front end
tools, Glossary, Security, Data Lineage, Data Traceability, etc.

90% of what follows is taken from Netezza, or Computer Science sources. There has been deliberate
and unapologetic simplification, to make the original sourced rules more workable, and to help the
reader avoid the many blind alleys that were explored. Most of the rules are conditional, so there
are many cases where either choice is valid, depending on the situation.

But there is always a fierce debate among developers about the best languages, tools, methods and
patterns, which can sometime descend into religious wars. So these rules will also be subject to that
kind of scrutiny, as the rules are by no means perfect. After all, according to Linus Torvalds, "given
enough eyeballs, all bugs are shallow". So, if any reviewer wants to change or enhance these
patterns, then they are welcome. But they will also need to document the justification, tests and
any worked examples and/or code snippets.

Additional patterns need to be discovered and documented for Referential Integrity, Automated roll-
out, Trees, Data Vault, RDF, Profiling and Reconciliation.

Also, additional Netezza topics to be covered include Zone maps, Query Plan Optimisation, Organise
On and Clustered Base Tables.

Related Documents
All these documents are published by Netezza.

O Name Subject

1 Aginity_Netezza_Workbench_Documentation Overview

2 Aginity_Workbench_for_Netezza_Functionality_Overview Detailed Introduction

3 IBM_Netezza_In-Database_Analytics_Reference_Guide Analytics functions

4 IBM Netezza User-Defined Functions Developer's Guide

5 Netezza RedGuide Overview

6 Netezza_advanced_security_admin_guide Security

7 Netezza_data_loading_guide Data Loading

8 Netezza_database_users_guide SQL user guide

9 Netezza_getting_started_tips background information and tips

10 Netezza_odbc_jdbc_guide ODBC, JDBC clients, or the OLE DB


connector.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 4 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

O Name Subject

11 Netezza_Spatial_Package_Developers_Guide Spatial Analytics functions

12 Netezza_Spatial_Package_Reference_Guide Spatial Analytics functions

13 Netezza_Spatial_Package_Users_Guide-3.0.0 Spatial Geometric Functions

14 Netezza_stored_procedures_guide Developers Guide for Stored procs

15 Netezza_system_admin_guide Official guide to distribution keys

16 Netezza_udf_dev_guide Developers Guide for Functions

17 Netezza-Basics Course Outline

18 netezza-fpga Overview

Definitions
What is a DW Appliance?
A DW appliance is a tightly integrated hardware and software tool designed for MPP. A DW
appliance stores data in a series of parallel nodes, to enable fast load, and fast retrieval. Examples of
DW appliances are Netezza or Teradata. Each node needs to store related records so that the local
query only uses local data to compete a query. The related records are grouped together based on a
single shared key, which is the distribution key. In effect, a DW appliance follows the key-value store
pattern seen in much of the NoSQL solutions. See the appendix for a comparison of Oracle and
Netezza. The documents list in the appendix provides more detail.

What is Massively Parallel Processing MPP?


In Asymmetric Massively Parallel Processing (AMMP) fast data processing is achieved by dividing the
work into parallel streams that maximize the utilization of each MPP node (Snippet Blade). The
query is distributed into thousands of streams that are each processed close to the data source by
only pulling the necessary data through the Field Programmable Gate Arrays (FPGAs) into the CPUs.
Maximum performance is achieved when co-location of data on a MPP node enables the FPGAs to
pull the smallest amount of relevant data into the CPUs and not have to request and wait on data to
come across the Network Fabric from other nodes.

Definitions
Term Source Definition

API DB Application Programming Interface

AWB Netezza Aginity Work Bench

Bag Maths In mathematics, a multiset (or bag) is a generalization of the concept of


a set that, unlike a set, allows multiple instances of the multiset's
elements. In other words, it has duplicate instances.

BI DW Business Intelligence

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 5 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Term Source Definition

BLOB DB Binary Large Object

Cardinality DB The number of unique values for an attribute in a table. Low


cardinality refers to a limited number of values, relative to the overall
number of rows in the table.

CLOB DB Character Large Object

Data Model Date The data model must represent demonstrably true statements about
the business area. That is, the entities must represent things that
mean something to the business, and the relationships between
entities must represent meaningful links.

Data Profiling DW Data profiling is the process of examining the data available in an
existing data source (e.g. a database or a file) and collecting statistics
and information about that data.

Data Transform DW A data transformation converts a set of data values from the data
format of a source data system into the data format of a destination
data system.

Data Vault Linstedt Data Vault is a database method that is designed to provide long-term
historical storage of data coming in from multiple operational systems.
It provides a DW pattern that supports the Inmon goals of Subject-
orientation, non-volatility, integration and Time-variance.

Data DW An implementation of an informational database used to store


Warehouse sharable data sourced from an operational database-of-record.

Design Pattern Martin A design pattern is a general reusable solution to a commonly


occurring problem within a given context in software design. A design
pattern is not a finished design that can be transformed directly into
source or machine code. It is a description or template for how to
solve a problem that can be used in many different situations.
'Patterns are, of course, under articulated abstractions.' Conor
McBride. Note that Design patterns differ by language or platform,
depending on the language capability to abstract patterns. See the
‘Revenge of the Nerds’ article. Whenever the same code is repeated at
different points, then this is an indication that there is a design pattern.

Design Pattern Norvig These are templates that describe design alternatives. These can
include descriptions of what experienced designers know and which is
not written down in the Language Manual. They can also include hints
and reminders for design choices, higher-order abstractions for
program organization, design trade-offs and how to avoid limitations of
implementation language.

Dimension Kimball An independent entity in a dimensional model that serves as an entry


point or as a mechanism for slicing and dicing the additive measure
located in the fact table of the dimensional model. For example, all
months, quarters, years, etc., make up a time dimension. Based on

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 6 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Term Source Definition

Measure Theory.

Distribution Key NZ The distribution key is used to determine the placement of table rows
across multiple Netezza nodes. The distribution key is part of the table
definition.

DK NZ Distribution Key

DW DW Data Warehouse

DW Appliance DW This is an integrated hardware, software and DBMS platform, designed


for high performance reporting.

EAV DB An Entity-Attribute-Value table. This is a type of RDF table. Data is


recorded in only three columns: Entity: the item being described;
Attribute or parameter: a foreign key into a table of attribute
definitions; Value of the attribute. This pattern is very popular with
business users, as they have the opportunity to redefine data in an
application. Often, much of the useful reporting data are in these
columns. However, it can be quite difficult to pivot this data into
usable reporting tables. See the Resource Description Framework
(RDF) as an example.

ELT DW Extract Load and Transform

Epoch Time Unix Aka Unix time is a system for describing instants in time, defined as the
number of seconds that have elapsed since 00:00:00 Coordinated
Universal Time (UTC), Thursday, 1 January 1970, not counting leap
seconds.

ETL DW Extract, Transform and Load

Fact Kimball A business performance measurement, typically numeric and additive,


that is stored in a fact table. Based on Measure Theory.

FK DB Foreign Key

Foreign Key DB A foreign key is the primary key of one data structure that is placed
into a related data structure to represent a relationship among those
structures. Foreign keys resolve relationships, and support navigation
among data structures.

Hierarchy DB This is the same structure as a Mathematical Tree.

Imperative DB With imperative programming, like PL/SQL, the coder defines how the
code will proceed, from step by step. In declarative programming, like
SQL, the coder defines what the code will do, and lets the compiler
determine the best way to implement this.

Join Codd A join is a binary operator on two relations or database tables.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 7 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Term Source Definition

Key Source This represents cleansed Source tables that have had a surrogate
and/or a distribution key added. A surrogate key is needed for
Fact/Dimension joins. A distribution key is critical for adequate
Netezza performance. Note that these tables would still retain their
source names.

KPI DW Key Performance Indicator

Massively DW This refers to the use of a large number of processors (or separate
Parallel computers) to perform a set of coordinated computations in parallel
Processing (simultaneously).

Metadata DB Metadata is "data about data". While not often used in reporting,
these tables are important in DW standards, and for generating and
describing DW components.

MPP DW Massively Parallel Processing

Netezza Netezza Netezza designs and markets high-performance data warehouse


appliances and advanced analytics applications for uses including
enterprise data warehousing, business intelligence, predictive analytics
and business continuity planning.

NoSQL DB A NoSQL database provides a mechanism for storage and retrieval of


data that is modelled in a non-relational manner.

NPS Netezza Netezza Performance Server

Nub Source This represents Source tables that have been cleansed, but without
column name changes. For example, cleansing can discard boilerplate
columns, de-duplicate rows, and add defaults values (e.g. N/A for
nulls), convert types (e.g. Text -> Dates), fix column lengths, etc. AKA
Bag 2 set

NULL DB NULL indicates that a data value does not exist in a particular column
row pair in the database. In other words, if the value of a column in a
row is NULL, then the value is undefined.

NZ NZ Used as short form for Netezza. Not to be confused with the shaky
isles.

ODS DW Operational Data Store

Pivot DB A pivot table is the transformation of an EAV table into columnar form.

PK DB Primary Key

Primary Key DB A column or combination of columns whose values uniquely identify a


row or record in the table. The primary key(s) will have a unique value
for each record or row in the table. That is, their cardinality will be 1.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 8 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Term Source Definition

RDF W3C The Resource Description Framework (RDF) is a family of World Wide
Web Consortium (W3C) specifications originally designed as a
metadata data model. It has come to be used as a general method for
conceptual description of information and/or knowledge management,
which is implemented in web resources, and other formats. An RDF
triple is a statement about resources in the form of a subject-
predicate-object expression. DF is a more general form of the original
EAV Entity-Attribute-Value design pattern, where the subject is an
entity, predicate is an attribute and object is a value. A set of such RDF
triples forms an RDF graph.

Reconciliation The process of ensuring that there is set equality between a source
data set and a replicated target data set.

Referential DB Referential integrity is a property of data which, when satisfied,


Integrity requires every value of one attribute (column) of a relation (table) to
exist as a value of another attribute in a different (or the same) relation
(table).

RI DB Referential Integrity

Set Maths In mathematics, a set is a collection of distinct objects, considered as


an object in its own right. The set has no duplicates, and each object
has an identifier (or key).

Snippet NZ The independent node that performs functions on a data subsets in


Processing Unit parallel with other SPUs.

SPU NZ Snippet Processing Unit

SQL DB Structured Query Language

Tree Maths A tree is an undirected graph in which any two vertices are connected
by exactly one path.

UDA NZ User defined aggregates

UDF NZ User Defined Functions

UDTF NZ User defined table functions

UDX NZ This is generally used by Netezza developers to refer to user developed


code. This covers user defined functions (UDF), user defined
aggregates (UDA), and user defined table functions (UDTF).

Union Maths Union of the sets A and B, denoted A U B, is the set of all objects that
are a member of A, or B, or both.

View DB A projection. Apart from the display of data returned, a view can be
considered to be a pure function.

XML DB Extensible Mark-up Language

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 9 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Tags
Business Intelligence ; Data Design ; Data Load ; Data Mapping ; Data Model ; Data Transformation ;
Data Vault ; Data Warehouse ; Database ; Database Design ; Design Pattern ; DW Appliance ;
Extract Load Transform - ELT ; Extract Transform Load - ETL ; Fact / Dimension ; Hierarchy ; Inmon ;
Kimball ; Massive Parallel Processing - MPP ; Master Data Management ; Metadata ; Netezza ;
Oracle ; SQL ; Standards ; Teradata ; Data Architect ; Data Architecture ;

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 10 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Massively Parallel Processing Design Patterns


These design rules or strategies, are a guide to certain patterns, and certain implementations. They
are more like proverbs than like templates. Some of these rules are in the Netezza manuals, and
others have been discovered elsewhere. In effect, these rules could be used as a check list for design
reviews. However, they are not meant to be applied blindly, as there will always be exceptions.

DO these things
O DO Explanation

1 Allocate distribution keys It has been found that a join not using distribution keys is at least
100 times slower! In some cases, the join simply does not return a
value even after one hour. If DKs are incorrectly set to RANDOM,
then the performance hit will not occur overnight when the Fact
and Dimension tables are built, but will affect query performance.
This means that fact and dimension tables must be created with
DKs. Try and also apply them to the source loaded tables as well.

2 Use TABLEs Use physical tables when transforming from intermediate to final
state.

3 Use TRUNCATE, INSERT Keep the intermediate states persistent. This will enable the
and SELECT ONLY developer to reason properly about the progressive state changes,
which will help with debugging. This means that ELT becomes the
norm, as ETL is traditionally a single threaded design, whereas ELT
exploits MPP properly.

4 Bulk Load Use NZLoad to load data. Always do bulk processing on Netezza –
never do one record at a time processing. Prepare bulk load data
by using a series of SQL statements to generate temporary result
sets before loading.

5 Build PROCEDUREs as a Apply set theory. Break a proc down into a series of INSERTS. Do
series of declarative SQL not write the PROCEDURE in an imperative style, with loops, etc.
statements. In effect, change the data as a composition of data transform
functions.

6 Make PROCEDUREs as COMMIT works differently in Netezza. There is far less control
atomic as possible. than in an Oracle. Simplified, rely on the BEGIN and END block in a
PROCEDURE to make a COMMIT. If the PROCEDURE is large, and
there is a failure, then the ROLLBACK will take a long time. So a
large PROCEDURE should be split up into sensible, consistent units.
For example, split them by schema and by data transform step
type.

7 Avoid using Database Do not hard code DB name in SQL so that it is easy to migrate code
name from one DB to another. DB is always known from the context.

8 When creating objects, Ensure that the user and schema name is the same when creating
ensure that the logon DB Objects. In Aginity, one can logon as one Schema, and create

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 11 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

O DO Explanation

Schema is the same as objects in another Schema. These DB Objects will have the logon
the object Schema. Schema as the OWNER, and the create schema as the SCHEMA.

9 Follow Oracle Name Stay within the 30 character limit. Never begin names with an
Standards underscore, $ or #. This will make names more operable with most
tools, as they will almost always be able to handle Oracle names.

10 Use Vanilla, ANSI SQL Only use standard vanilla Netezza data types. This includes types
Types ONLY such as "VARCHAR", "DATETIME", "TIMESTAMP", "DOUBLE",
"REAL", "NUMERIC", "NVARCHAR".

11 Use NUMERIC(19,0) for In particular, do not use BIGINT, as this is a non-ANSI standard data
distribution keys type. This type is unrecognisable to external BI tools, as Netezza is
only ranked 27 among Databases. These BI tools need to join on
the Distribution Key, and this will not be possible if they throw a
data type not recognised error. BIGINT is physically identical to
NUMERIC (19, 0). The hash8 functions generates 64-bit values in
range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
These numbers need to be a NUMERIC (19,0), rather than
NUMERIC (18,0). The only disadvantage to not using BIGINT is that
zone maps are not available to NUMERIC(19,0), but these are not
as critical to performance as a correctly defined distribution keys.

12 Always make precision NUMERIC defaults to NUMERIC (18,0) inside Netezza, but how do
and scale explicit on external tools know that? They will just default to some random
numeric data types. That length, unless it is made explicit.
is, NUMERIC(p,s) not
NUMERIC.

13 Use CLOB types These often contain XML, which can be parsed and used for
reporting. CLOB types are often stored on separate tables, as they
quickly exceed the 64K row size.

14 Use NOT NULL If the source has null values, then use the default. Many bugs are
eliminated when NULL values are eliminated. This applies to
intermediate and final tables in the DW. In the case of landing
zone tables, the column nullability should be identical to the source
system.

15 Stay within common Oracle allows up to 1,000 columns in a table. Oracle has a
Limits maximum column size of 4K. Netezza has a maximum row size of
64K.

DON’T do these things


O DON’T Explanation

1 Use RANDOM distribution Avoid this as much as possible. Only use this if no distribution
keys keys can be found. Sometimes, for small tables (<50,000 rows),
the performance hit is acceptable.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 12 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

O DON’T Explanation

2 Use VIEWs It has been found that a simple VIEW SELECT is up to 3 times as
slow as a SELECT directly from a physical table, without
considering the negative impact of joining multiple VIEWS
without distribution keys. If the row count > 50,000, then
performance degrades to unacceptable levels. Only use Views as
a last resort, and only for small tables. The reason is that VIEWs
are disconnected from the MPP nodes. This means that any joins
using a view will not perform well, as the data is in the host
memory and NOT distributed evenly over the MPP SPU nodes. If
a front end tool is creating joins, then Netezza must present the
best possible performance.

3 Use UPDATEs Updates are row by row operations, rather than as a bulk
operation. So they do not perform well in an MPP environment.
Instead use INSERTs into new tables, with the changes. It is also
easier to reason about code when data is persistent, which will
help in debugging. Netezza finds each row, marks it as a soft
delete, which is cleaned up by the GROOM command, and inserts
rows with the new values.

4 Use single thread Single thread tools negate the whole raison d'etre of a MPP
processes environment. They simply do not scale. Again, this may not be
noticeable for small tables (<50,000), but it becomes an issue for
larger tables.

5 Build PROCEDUREs in a Apply set theory. Break a proc down into a series of INSERTS. Do
PL/SQL imperative manner. not write the PROCEDURE in an imperative style, with loops, etc.

6 Make PROCEDUREs Monolithic PROCEDUREs require a longer ROLLBACK when failure


monolithic. occurs. Split the PROCEDURE into more atomic units, while still
ensuring that the units form a consistent whole.

7 Use Database name Do not hard code DB name in SQL so that it is easy to migrate
code from one DB to another. DB is always known from the
context.

8 When creating objects, do Unlike Oracle, Owner and Schema are not the same. Don't mix
not create an object with a owner and schema. This mixes up permissions, and confuses
different Schema to the tools like Aginity, so that tables and columns are not shown, but
logon Schema. which do exist. If this occurs, carefully drop and create within the
correct schema. Also, as a developer, one does not have the right
to set these even in Dev and Test.

9 Follow Netezza Name Do not create 128 character names.


Standards

10 Use Netezza specific The Netezza types to avoid are: "BIGINT", "BOOL", "BOOLEAN",
syntactic sugar Types "CHAR", "CHAR VARYING", "CHARACTER", "CHARACTER
VARYING", "DATE", "DOUBLE PRECISION", "FLOAT", "FLOAT1",
"FLOAT2", "FLOAT3", "FLOAT4", "FLOAT5", "FLOAT6", "FLOAT7",
"FLOAT8", "BIGINT", "BYTEINT", "DEC", "DECIMAL", "INT", "INT1",
"INT2", "INT4", "INT8", "INTEGER", "SMALLINT". These are all
D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 13 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

O DON’T Explanation

translatable into the standard types of "VARCHAR", "DATETIME",


"TIMESTAMP", "DOUBLE", "REAL", "NUMERIC", "NVARCHAR".

11 Use BIGINT for distribution In particular, do not use BIGINT, as this is a non-ANSI standard
keys data type. This type is unrecognisable to external BI tools, as
Netezza is only ranked 27 among Databases. These BI tools need
to join on the Distribution Key, and this will not be possible if they
throw a data type not recognised error.

12 Rely on implicit precision NUMERIC defaults to NUMERIC(18,0) inside Netezza, but how do
and scale explicit on external tools know that? They will just default to some random
numeric data types. That length, unless it is made explicit.
is, NUMERIC(p,s) not
NUMERIC.

13 Use BLOB types These are almost always not needed for reporting.

14 Use NULL It is difficult to reason correctly in the presence of Null values.


Joins with Null values behave unexpectedly. This applies to
intermediate and final tables in the DW. In the case of landing
zone tables, the column nullability should be identical to the
source system.

15 Go outside common Limits Netezza allows up to 1,600 columns in a table. Netezza allows up
to 64K row size. It is unclear what the Oracle row size limit is.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 14 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Key Patterns
Always Allocate distribution keys.
It has been found that a join not using distribution keys is at least 100 times slower! In some cases,
the join simply does not return a value even after one hour. If DKs are incorrectly set to RANDOM,
then the performance hit will not occur overnight when the Fact and Dimension tables are built, but
will affect query performance. This means that fact and dimension tables must be created with DKs.
Avoid RANDOM distribution keys as much as possible. Only use this if no distribution keys can be
found.

This section explains why distribution keys are critical, and some simple rules to follow.

How to discover distribution keys?


Fact Dimension model
In the case of a Fact Dimension data model, it is fairly simple.

The distribution key will be the key of the most important dimension, which is normally the
dimension with the largest number of rows. This key will then be added to the fact table, and both
tables should be distributed using this key. This will ensure that the join between the most
important dimension and the fact table will perform well. If this dimension has sub-types then the
dimension can be split into sub-types, and a common distribution key can be shared across all these
dimensions.

Source data model


In the case of a Source data model, it is only a little more complex.

A simple approach is to take the mode foreign key (the most frequently occurring foreign key) found
across all active tables. This will be the most important column in the database. After removing
these tables from the set, take the next highest mode foreign key. This will be the second most
important column in the database. And so on.

The actual number of distribution keys will generally be quite small, and less than the table count. It
seems that the number of distribution keys will be less than square root of table count, but more
than the log of the table count. It is also apparent that a distribution key is a natural way of
organising the tables into groups. These could be called distribution key subject areas.

Other DK discovery approaches


See detailed DK discovery algorithm below. There is also an alternate manual DK discovery method.
Any or all of these approaches will help.

Summary Rules
1. Define only one table column as the distribution key. This can be a primary or a foreign key.

2. Never construct a distribution key from multiple columns.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 15 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

3. Never use a date as a source column for a distribution key.

4. Never use a low-cardinality column as a source column for a distribution key. The number
distinct must be more than the number of Netezza nodes.

5. Only use RANDOM, when all attempts to identify a distribution key fail.

6. The distribution key data type should be NUMERIC (19, 0).1

7. If the source key used for a distribution key is a VARCHAR, then populate an integer
sequence based on the source key, or a hash, and cast it to a NUMERIC (19,0) type.

8. If the source column used for a distribution key is a NUMBER, then cast it to a NUMERIC (19,
0) type.

9. The distribution key must be used in a join in order to achieve high performance.

10. For many to many tables, table replication with each table using one of the foreign keys as a
distribution key will provide the best performance, provided the distribution key is already
defined for another table.

Case 1: Not using a distribution key


The child table does not have a PARENTKEY_DK_ID column anymore. Instead, RANDOM is selected.
The same query still joins on the logical key, PARENTKEY. This is logically correct, and does return
the same result set as Case 1, but the example will show why it does not perform well.

1
An alternate approach is to transform the VARCHAR using the SQLEXT.HASH8 function, or rely on the Netezza
to internally generate a HASH value over the VARCHAR column. Provided the table will always be less than a
million rows, then the probability of a hash collision may be acceptable. See the appendix for a fuller
discussion on this.
D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 16 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

CREATE TABLE PARENT_TABLE ( PARENTKEY VARCHAR( 30 ) NOT NULL , COLOUR


VARCHAR( 30 ), PARENTKEY_DK_ID NUMERIC( 18, 0 ) NOT NULL ) DISTRIBUTE ON
( PARENTKEY_DK_ID );

CREATE TABLE CHILD_TABLE ( CHILDKEY VARCHAR( 30 ) NOT NULL , PARENTKEY


VARCHAR( 30 ) NOT NULL , ISCLEAR VARCHAR( 30 ) ) DISTRIBUTE ON RANDOM ;

SELECT CHILDKEY, ISCLEAR, COLOUR FROM CHILD_TABLE CT INNER JOIN


PARENT_TABLE PT ON CT.PARENTKEY = PT.PARENTKEY ;

ParentKey Colour ParentKey_DK_ID


p1 red 1
p2 blue 2

ChildKey ParentKey IsClear ChildKey IsClear Colour


c1 p1 N c1 N red
c2 p1 Y c2 Y red
c3 p1 N c3 N red
c4 p2 Y c4 Y blue
c5 p2 N c5 N blue
c6 p2 Y c6 Y blue

Case 1: Split without a common distribution key


Splitting the full table has meant that some of each SPU child table rows are not related to the SPU
parent table. Here, a p2 child row appears on the SPU 1, whereas it should be on SPU 2.

INSERT INTO CHILD_TABLE SELECT CHILDKEY, ISCLEAR, INSERT INTO CHILD_TABLE SELECT CHILDKEY, ISCLEAR,
COLOUR, PARENTKEY_DK_ID FROM COLOUR, PARENTKEY_DK_ID FROM
SOURCE_CHILD_TABLE CT SOURCE_CHILD_TABLE CT

SPU #1 SPU #2

P are n tK e y C o lo u r P a re n tK e y _D K _ ID P a re n tK e y C o lo u r P are n tK e y _ D K _ID


p1 re d 1 p2 b lu e 2

C h ild K e y P are n tK e y Is C le a r C h ild K e y P are n tK e y Is C le a r


c1 p1 N c2 p1 Y
c3 p1 N c4 p2 Y
c5 p2 N c6 p2 Y

Case 1: Map join across all nodes


Map the same join independently on each SPU. Now there are gaps in the output rows on each
node. The unjoined rows have to be collected into the Redistribute Set, which involves considerably
more processing.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 17 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

SELECT CHILDKEY, ISCLEAR, COLOUR FROM CHILD_TABLE CT INNER JOIN


PARENT_TABLE PT ON CT.PARENTKEY = PT.PARENTKEY;

SPU #1 SPU #2
ParentKey Colour ParentKey_DK_ID ParentKey Colour ParentKey_DK_ID
p1 red 1 p2 blue 2

ChildKey ParentKey IsClear ChildKey ParentKey IsClear


c1 p1 N c2 p1 Y
c3 p1 N c4 p2 Y
c5 p2 N c6 p2 Y

Redistribute Set
ChildKey IsClear Colour ChildKey IsClear Colour ChildKey IsClear Colour
c1 N red c3 N red
c2 Y red c5 N blue c4 Y blue
c6 Y blue

Case 1: Reduce result sets with UNION


Reduce the results by UNION each table into 1 large result table. Now imagine what happens when
there are 1,000,000 rows in the Redistribute Set. This is a lot of extra processing required.

SPU #1 SPU #2

ParentKey Colour ParentKey_DK_ID ParentKey Colour ParentKey_DK_ID


p1 red 1 p2 blue 2

ChildKey ParentKey IsClear ChildKey ParentKey IsClear


c1 p1 N c2 p1 Y
c3 p1 N c4 p2 Y
c5 p2 N c6 p2 Y
Redistribute Set

ChildKey IsClear Colour ChildKey IsClear Colour ChildKey IsClear Colour


c1 N red c3 N red
c2 Y red c5 N blue c4 Y blue
c6 Y blue

SELECT FROM SPU#1 UNION SELECT FROM SPU#2 UNION SELECT FROM REDSTRIBUTE_SET

Final Result Set

ChildKey IsClear Colour


c1 N red
c2 Y red
c3 N red
c4 Y blue
c5 N blue
c6 Y blue

Case 2: Using a distribution key


The following is a conceptual discussion, as the actual details of the implementation of DW appliance
are immaterial to the designer. This is deliberately a simplification of the Netezza approach. The
reason is that Netezza has just implemented the standard Key Value Map Reduce Parallelisation
Design Pattern. What follows will be an illustration of this.

This example shows a simple parent and child table example, with a common PARENTKEY. The join
shows data from both tables.
D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 18 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

CREATE TABLE PARENT_TABLE ( PARENTKEY VARCHAR( 30 ) NOT NULL , COLOUR


VARCHAR( 30 ), PARENTKEY_DK_ID NUMERIC( 18, 0 ) NOT NULL ) DISTRIBUTE ON
( PARENTKEY_DK_ID );

CREATE TABLE CHILD_TABLE ( CHILDKEY VARCHAR( 30 ) NOT NULL , PARENTKEY


VARCHAR( 30 ) NOT NULL , ISCLEAR VARCHAR( 30 ), PARENTKEY_DK_ID
NUMERIC( 18, 0 ) NOT NULL ) DISTRIBUTE ON ( PARENTKEY_DK_ID );

SELECT CHILDKEY, ISCLEAR, COLOUR FROM CHILD_TABLE CT INNER JOIN


PARENT_TABLE PT ON CT.PARENTKEY_DK_ID = PT.PARENTKEY_DK_ID ;

ParentKey Colour ParentKey_DK_ID


p1 red 1
p2 blue 2

ChildKey ParentKey IsClear ParentKey_DK_ID ChildKey IsClear Colour


c1 p1 N 1 c1 N red
c2 p1 Y 1 c2 Y red
c3 p1 N 1 c3 N red
c4 p2 Y 2 c4 Y blue
c5 p2 N 2 c5 N blue
c6 p2 Y 2 c6 Y blue

Case 2: Split with a distribution key


Split the full table across all nodes, so that there are now tables with related rows on each SPU.

INSERT INTO CHILD_TABLE SELECT CHILDKEY, ISCLEAR, INSERT INTO CHILD_TABLE SELECT CHILDKEY, ISCLEAR,
COLOUR, PARENTKEY_DK_ID FROM COLOUR, PARENTKEY_DK_ID FROM
SOURCE_CHILD_TABLE CT WHERE PARENTKEY_DK_ID = 1; SOURCE_CHILD_TABLE CT WHERE PARENTKEY_DK_ID = 2;

SPU #1 SPU #2
ParentKey Colour ParentKey_DK_ID ParentKey Colour ParentKey_DK_ID
p1 red 1 p2 blue 2

ChildKey ParentKey IsClear ParentKey_DK_ID ChildKey ParentKey IsClear ParentKey_DK_ID


c1 p1 N 1 c4 p2 Y 2
c2 p1 Y 1 c5 p2 N 2
c3 p1 N 1 c6 p2 Y 2

Case 2: Map join across all nodes

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 19 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Map the same join independently on each SPU. There are no gaps in the output rows on each node.
Note that there are 2 columns in the join - the original key and the distribution key. Nothing is
written to the Redistribute Set, as all joins are complete within each SPU.

SELECT CHILDKEY, ISCLEAR, COLOUR FROM CHILD_TABLE CT INNER JOIN


PARENT_TABLE PT ON CT.PARENTKEY = PT.PARENTKEY AND
CT.PARENTKEY_DK_ID = PT.PARENTKEY_DK_ID ;

SPU #1 SPU #2
ParentKey Colour ParentKey_DK_ID ParentKey Colour ParentKey_DK_ID
p1 red 1 p2 blue 2

ChildKey ParentKey IsClear ParentKey_DK_ID ChildKey ParentKey IsClear ParentKey_DK_ID


c1 p1 N 1 c4 p2 Y 2
c2 p1 Y 1 c5 p2 N 2
c3 p1 N 1 c6 p2 Y 2
Redistribute Set

ChildKey IsClear Colour ChildKey IsClear Colour ChildKey IsClear Colour


c1 N red c4 Y blue
c2 Y red c5 N blue
c3 N red c6 Y blue

Case 2: Reduce result sets with UNION


Reduce the results by UNION each table into 1 large result table. This only needs to collect each
results set off each SPU, and nothing is collected from the Redistribute Set.

SPU #1 SPU #2
ParentKey Colour ParentKey_DK_ID ParentKey Colour ParentKey_DK_ID
p1 red 1 p2 blue 2

ChildKey ParentKey IsClear ParentKey_DK_ID ChildKey ParentKey IsClear ParentKey_DK_ID


c1 p1 N 1 c4 p2 Y 2
c2 p1 Y 1 c5 p2 N 2
c3 p1 N 1 c6 p2 Y 2

Redistribute Set
ChildKey IsClear Colour ChildKey IsClear Colour ChildKey IsClear Colour
c1 N red c4 Y blue
c2 Y red c5 N blue
c3 N red c6 Y blue

SELECT FROM SPU#1 UNION SELECT FROM SPU#2 UNION SELECT FROM RAM

Final Result Set


ChildKey IsClear Colour
c1 N red
c2 Y red
c3 N red
c4 Y blue
c5 N blue
c6 Y blue

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 20 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Distribution Key Discovery Algorithm


Why?
The primary reason for the Netezza environment is its scalable performance. The tables must be
designed to enable scalable performance, following Netezza design rules. It is the responsibility of
Netezza database designers to ensure this. A lack of design will guarantee poor performance.

This is required whenever the extract and load of a new Netezza table is being designed.

This represents a mandatory requirement. That is, if a Netezza Distribution Key is not defined, then
the table design is also incomplete. This overrides any other methodology consideration. Specifically,
the use of DISTRIBUTE ON RANDOM should be minimised. That is, overuse of RANDOM will ensure
poor Netezza performance.

If current tables have a poorly defined Distribution Key, then this can be corrected. However, the
change impact will require a table export, creating a new table with the correct Distribution Key
clause, table import, drop old table, and renaming new table.

Identifying Distribution Keys


Goal: Identify the smallest set of Distribution Keys that cover as many tables as possible.

Discussion
Do not think of the distribution key as an index, even though its purpose is to improve performance.
As each table can have at most one distribution key in practice, this is a reduction or simplification
process of the database to a small number of critical FKs. A distribution key is not meant to
represent the full set of relationships, or to produce a full set of indices across all tables.

Essentially, this document describes the discovery process of the mapping between foreign keys and
tables. Once the most important foreign key is found, then it can be used as a distribution key on
both the child and parent table. The distribution key does not need to be unique. Therefore, a
distribution key can be a foreign key, as well as a primary key.

After the distribution keys are discovered, there will be mutually exclusive table sets, each
distributed by their shared foreign keys and single primary key. The child tables can be thought of as
satellite or dependent tables, and the parent table can be thought of as a hub table. Note that this is
not the Data Vault methodology.

The foreign key sets have to make business sense. The groupings should be explainable to the
business, as these will be primary joins within the table sets. This explanation can be final validation
of the groupings.

Only ever consider the main primary key or a foreign key based on a primary key as a candidate
distribution key. Always ignore alternate primary keys and alternate foreign keys. Again, as the table
can only be distributed once on a single key, considering alternate keys as a candidate distribution
key is pointless.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 21 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Never construct a distribution key from a concatenation of keys. Firstly, the child and parent table
must share the same concatenation keys for a join to work between the child and parent tables.
Secondly, there is very little performance improvement from shifting the data within a node. Thirdly,
if the first concatenated key is sparse (i.e. many empty or null values), it should be considered as a
relatively inactive key, and should not be a candidate distribution key at all.

Finally, in the case of many to many link tables, it may appear that key concatenation would be
viable. For these tables, table replication with each table using one of the foreign keys as a
distribution key will provide the best performance, provided the distribution key is already defined
for another table. This solution has other drawbacks, as the query writers will need to know which
table will provide the best performance for a given query.

Skewness occurs when the distribution key values resolve to only a subset of Netezza nodes. This
can occur when using date as a source column, so always avoid date keys. It can also occur when
using very low cardinality columns, such as Gender, etc. Again, always avoid using these as
distribution keys.

RANDOM can be used, but only if there are no valid foreign keys available. Do not use a singleton
primary key, as the distribution will still be in effect completely random. This can occur for a number
of reasons. For example, if the table is a parent of other tables, but its primary key is not the most
important on those child tables. This is typical of type or code tables. Alternately, the table may have
relationships to other tables, but these are meaningless boilerplate relationships, such as user
created, etc. Finally, the table is a true orphan, with no relationships with any other table.

Netezza mandates that the distribution key should have a number type of2 for performance reasons.

Finally, the distribution key must be used in a join in order to achieve high performance. The Netezza
Optimiser does not determine this. The query builder must know this.

Algorithm
The following provides an almost deterministic function to identify the best distribution key. This is a
distillation of the relevant Netezza documents, as well as my own experience. This algorithm is only
documented here, and is not defined in any of the Netezza references. This function should produce
a result, but it is not the last word. Further refinement may occur.

Preparation

A. For all tables, discover the primary key, and all active, foreign keys.

B. Get frequency distribution for all foreign keys to identify the candidate distribution keys.

C. Create a matrix of table name by foreign keys and primary key.

D. Use the frequency distribution and matrix to rank the candidate distribution keys in importance.

E. Follow this imperative algorithm to produce a list of (tableName, DK/RANDOM) mappings

2
Actually, Netezza recommends BIGINT. But other tools do not recognise this special new non-ANSI data type,
and as it is a key, it is critical to be a recognisable, ANSI data type. BIGINT is just syntactic sugar for NUMERIC
(19, 0), so NUMERIC (19, 0) is recommended.
D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 22 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Execution

1. Are there any unallocated tables with FKs?

2. If no, allocate unallocated tables to RANDOM and return list of (Tablename, DK)

3. If yes, are there any unallocated tables without DKs?

4. If no, return list of (Tablename, DK/RANDOM)

5. If yes, Get next FK.

6. Are there unallocated tables with this FK?

7. If not, repeat 5.

8. If yes, allocate each table with this FK to this DK.

9. If yes, re-allocate the parent table where the FK = PK to this DK.

10. repeat 2.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 23 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Fact Dimension Patterns


It is natural to provide a fact dimensional model as the final design output to be accessed using the
end user tools. The following are just a few rules to follow in designing these. This is not meant to be
a complete list. Please refer to Kimball documentation for further design advice.

Try to minimise the number of fact tables, with the same grain. This will require fewer joins by the
end user, which will make it easier to use.

See table below.

Issue Resolution

Data Types Only standard Netezza data types can be used. In particular, do not use non-
standard types like BIGINT, as BI tools may not recognise these data types.

Key Joins All joins need to use both the surrogate key and the distribution key, as the Netezza
optimizer does not use the distribution key by default. Any BI tool end user making
joins needs to be aware of this.

Key Structure BI tools normally require that the dimensions have single key. When the true
dimension key is made up of compound keys, these need to be concatenated into a
single, surrogate key, using a unique separator to prevent inadvertent key
duplication. Alternately the key can be created using a lookup table.

Key Types All surrogate keys must be integers. The standard Netezza data type for this should
be NUMERIC (19, 0).

Kimball All BI tools work best with a properly formed Kimball star schema.

No Nulls It is difficult for the end reporting user to reason about columns with NULL values, so
all source null values must be filled with default values.

User Control All BI tools support filtering, with WHERE clause, and aggregations with GROUP BY.
This should be controlled by the BI tool user. It should be avoided within Netezza.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 24 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Test Patterns
Unit Tests
The table below shows a series of units and integration (or join) tests. Naturally, data transform
development cannot run successfully without passing through unit testing.

O Test Description SQL Ex


Typ p
e

1 Dat Checks that data conforms to type. Ensure that all columns are set to NOT NULL. N
a o
Typ er
e ro
r

2 Any Checks that each SELECT DISTINCT (SELECT case when count(*) > 0 then 1 else 1
Row table has some rows. 0 end FROM SCHEMA2.<table1>) * (SELECT case when
s count(*) > 0 then 1 else 0 end FROM SCHEMA2.<table2>)
ROWPROD FROM SCHEMA2.D_DATE

3 All Checks that the PK of SELECT DISTINCT ( SELECT ( COUNT(*) - COUNT( DISTINCT ) 0
PK each dim and fact is DIFF FROM SCHEMA2. ) + ( SELECT ( COUNT(*) - COUNT(
Uni unique. DISTINCT ) ) DIFF FROM SCHEMA2. ) ) SUMDIFF FROM
que SCHEMA2.D_DATE ;

4 All Checks that all SELECT ((SELECT count(*) FROM SCHEMA2.<fact_Table> FV 0


Left dimension left joins to LEFT JOIN SCHEMA2.<Dim1> ET ON FV.<fact_Table_key> =
Join the fact have same ET.<Dim1_key> ) - count(*)) as DIFF FROM
s row count. SCHEMA2.<fact_Table> ;

5 All Checks that all SELECT ((SELECT count(*) FROM SCHEMA2.<fact_Table> FV 0


Inne dimension inner joins INNER JOIN SCHEMA2.<Dim1> ET ON FV.<fact_Table_key> =
r to the fact have same ET.<Dim1_key> ) - count(*)) as DIFF FROM
Join row count. SCHEMA2.<fact_Table> ;
s

Performance Test
Data transform timing using Aginity is only an approximation, as there is some additional latency
associated with Aginity data display. When final performance testing is required, add the following
statements to a proc, in between each INSERT statement. This will accumulate the combined time to
run a proc, and help identify bottlenecks, etc.

eventTime := TO_CHAR( NOW(),'YYYY-MM-DD HH24:MI:SS');

returnValue := 'TABLE1 loaded at: ' || eventTime || ' ' ;

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 25 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

RAISE NOTICE ' % ', returnValue ;

accumulator = accumulator || returnValue;

Volume Testing
Step 1: Get the data

It is a given that as much production data as possible should be loaded into the development and
test databases. This will ensure that the testing takes place using normal volumes. This is the
simplest way to do volume testing. If this is not done, then the first release will really be a volume
test, with all the attendant stress.

This requires Production volume tables are available in Dev. The simplest method of doing this is
using Transient External Tables. This shows a sample of the command needed. This can be submitted
in Aginity. Aginity also has a set of Tools that allows you to export and import data in bulk from
existing tables. See the Netezza_data_Loading_guide for more details.

To Extract from DB2 (Prod)

CREATE EXTERNAL TABLE 'C:\Netezza\DB2\SCHEMA1\TABLE3.dat' USING ( delim 167 datestyle


'MDY' datedelim '/' maxerrors 2 encoding 'internal' REMOTESOURCE 'ODBC' logDir 'c:\Netezza\'
escapeChar '\' ) AS SELECT * FROM DB2.SCHEMA1.TABLE3;

To Load into DB1 (Dev)

INSERT INTO DB1.SCHEMA1.TABLE3 SELECT * FROM external


'C:\Netezza\DB2\SCHEMA1\TABLE3.dat' USING ( delim 167 datestyle 'MDY' datedelim '/' maxerrors
2 encoding 'internal' REMOTESOURCE 'ODBC' logDir 'c:\Netezza\' escapeChar '\' );

Insert directly from DB2 to DB1.

It is much easier to transfer from DB1 (Dev) to DB2 (test) on the same machine, provided there is
permission. All that is required is to read from the source to the target using SQL.

TRUNCATE TABLE DB1.SCHEMA1.TABLE4;

INSERT INTO DB1.SCHEMA1.TABLE4 SELECT * FROM DB2.SCHEMA1.TABLE4 ;

Run each SELECT separately

After all required data is loaded into DEV, decompose the PROC into a series of SELECTS, and run
them manually in Aginity to see how each SELECT performs individually. Then once they are all
performing adequately, rerun the PROC as a whole, to ensure overall adequate performance.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 26 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

From Oracle to Netezza


As most developers will be more familiar with Oracle than Netezza, this table gives feature by
feature availability comparison, which may be a useful starting point.

Name Type Oracle Netezza

Based on Diff Oracle Postgres

Bulk (set) transactions Diff Some Best

Db Ranking Diff 1 27

Description Diff Widely used RDBMS DW appliance

Developer Diff Oracle IBM

Initial release Diff 1980 2000

Partitioning methods Diff Horizontal partitioning Sharding

Replication methods Diff Master-master and Master-slave Master-slave


replication replication

Scales to Diff Terabytes Petabytes

Appliance Netezza Not Available Available

Implicit Casting Netezza Not Available Available

Map Reduce Netezza Not Available Available

Multi Db per Netezza Not Available Available


Environment

Query Tuning not Netezza Not Available Available


needed

Any Hardware platform Oracle Available Not Available

APIs - Oracle Oracle ODP.NET, Oracle Call Interface (OCI), Not Available

BLOB Oracle Available Not Available

CONNECT BY Oracle Available Not Available

Correlated sub queries Oracle Available Not Available

Cursors Oracle Available Not Available

Foreign keys Oracle Available Available, not


enforced.

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 27 of 28
Matthew Lawler lawlermj1@gmail.com DW Massively Parallel Processing Design Patterns

Name Type Oracle Netezza

Index Oracle Available Not Available

PIVOT Oracle Available Not Available

PL/SQL Oracle Available Not Available

Row level transactions Oracle Available Not Available

Triggers Oracle Available Not Available

XML support Oracle Available Not Available

APIs - Common Same JDBC,ODBC, OLE DB JDBC, ODBC, OLE DB

Concurrency Same Available Available

Data scheme Same Available Available

Database model Same Relational DBMS Relational DBMS

Function Same Available Available

SQL Same Available Available

Supported languages Same C, Java, Python, Perl, R C, Java, Python, Perl,


R

Transaction concepts Same ACID ACID

Typing Same Available Available

D:\D\Documents\DW Me\0 Publish\DW Massively Parallel Processing Design Patterns.docx February 13,
2018 28 of 28

Das könnte Ihnen auch gefallen