Beruflich Dokumente
Kultur Dokumente
Background
The banking DDS is an industry data model that provides an integrated data backplane for SAS Banking Solutions. The banking
DDS has been in development since early 2001, and the first production release was in 2004. Originally, the banking DDS
supported only SAS Customer Intelligence Solutions. It has grown consistently over time to support more solutions such as SAS
Customer Analytics for Banking, SAS Credit Scoring for Banking, SAS Credit Risk for Banking, and SAS Risk Management for
Banking.
Initially, the banking DDS was supported only in SAS data sets. Based on customer demand, the banking DDS is now supported
on SAS data sets, SAS Scalable Performance Data Server (SPD Server), Oracle, Teradata, DB2, and Microsoft SQL Server.
In the overall data flow, the banking DDS is the hub or data repository for downstream SAS solutions. ETL processes flow from the
banking DDS to the solution data marts, and some solutions have ETL processes that write back the results to the banking DDS.
Figure 1 shows the overall data flow.
Enterprise Operational/
Data Warehouse Sources
Solutions
ETL
Extract and
Transform
Reference
Data
SAS Solution
Write Back
Staging Area
Master
Data
Extract and
Transform
5
ETL
Load
EDW
Solution Data
Marts
SAS Solution
Transaction
Data
Write Back
6
ETL
SAS Solution
2
Write Back
2
The banking DDS uses two types of data flowsan acquisition ETL flow and a data delivery ETL flow.
Overview
The intent of this project was to load data into the banking DDS tables in order to simulate a customer environment and learn how
to improve performance. ETL jobs deployed from SAS Risk Management for Banking were used in this project.
The banking DDS data model was deployed in SAS and three other database systems. Generated data was loaded to test the
different techniques for fine-tuning the data structure for the banking DDS and for improving the performance of selected ETL jobs.
Specifically, techniques that enabled the ETL jobs to run inside the databases were tested. The scope of the project focused on
what could be done by fine-tuning the ETL without rewriting the logic for the job.
Project Setup
The project was set up with three tiers for the testing environment. All of the databases, except SAS, resided on its own server.
The SAS Metadata Server was set up on a separate server, and it also functioned as the SAS server. Separate client workstations
for managing the banking DDS and SAS Risk Management for Banking ETL metadata and for executing ETL jobs were
established.
RMB_INT_100_I_EQUITY_QUOTE
RMB_STG_230_QUOTE_EQUITY
RMB_INT_I_RISK_FACTOR
The first job takes stock quote data from the banking DDS and loads it into the intermediate tables.
The second job takes the stock quote data from the banking DDS and loads it into the final staging table format for SAS Risk
Management for Banking.
The third job creates an intermediate table for risk factor data. This table is used by other staging ETL jobs.
Attention was focused on the RMB_INT_100_I_EQUITY_QUOTE job. To provide adequate data and to obtain meaningful results,
the banking DDS EQUITY_QUOTE table was populated with 200 GB of data (2,029,962,547 rows of data).
Metadata Environments
To test ETL job performance with the banking DDS deployed in SAS and the three other database systems, the same banking
DDS tables and ETL jobs had to be deployed in separate metadata environments. Custom repositories were set up under the SAS
Metadata Server. Each environment had a custom repository.
Explicit Pass-Through
Implicit Pass-Through
Figure 8: Logical Overview of ETL with Source DDS Data Table Hosted on Database
Figure 9: Logical Overview of ETL with All Data Tables Hosted on Database
8
be ~200 GB of data. This extraction of data required creating approximately 2 billion rows in a SAS data set. The I/O cost of this
job was severe. And, the jobs performance was heavily impeded. After the data was moved into SAS, the various tasks in the SAS
Extract job took considerably less time. The following figure shows the workflow of and the different tasks in the SAS Extract job.
Conclusions
In our testing, minimal changes were made to the source DDS data table (EQUITY_QUOTE). The ETL jobs that were tested used
the highest volume of data that was available. By using the EQUITY_QUOTE table (the largest source DDS data table), the ETL
job had minimal filter and join conditions. This information governed how data was distributed in an MPP database. The sample
data had the highest cardinality data in the QUOTE_DTTM column. This made the QUOTE_DTTM column a good candidate for
the distribution key. Because data is different for each customer, there is no one-size-fits-all method for optimizing performance.
The strategy should be to minimize I/O by localizing the work, intermediate, and stage tables. If tables can be localized, there will
be a performance improvement.
The default ETL job created many data sets. Then, data was moved from one location to another. Moving this data caused the
largest I/O bottleneck. The largest move within this I/O bottleneck was the initial move of data from a 200 GB source DDS data
table to a 136.5 GB SAS data set. By localizing the data and moving it from one table to another table within the database, or by
removing the data move all together by creating a view to the database, job time was cut drastically.
The results are displayed in the following table. Each database environment was running on different hardware. In these results, it
is more important to see the individual gain in performance, not a comparison of performance. So, the databases are not labeled
so that they are not compared. SAS values are provided as benchmarks.
SAS
Database A
Database B
1.65
Over 8.00
6.58
Database C
Over 8.00
1.62
Database
Over 8.00
SAS solutions such as SAS Risk Management for Banking often include out-of-the-box ETL jobs. When either the source table or
the target table of these ETL jobs is changed from a SAS data set to a database table, ETL performance is affected. In our testing,
we dramatically improved ETL performance in SAS Risk Management for Banking by making minor changes to the ETL jobs
provided with the solution. Although these changes were made to a specific set of ETL jobs, the underlying techniques can be used
to change any ETL job. The bottom line is performance is optimized when ETL code executes in the database.
10
Figure A.3: SQL Join with Single Input (left) and Output Generated by the SASTRACE Option (right)
12