Sie sind auf Seite 1von 11

SQL*LOADER STRATEGIES FOR

DATA WAREHOUSE APPLICATIONS


Jeffery L. Maresh, Maresh Consulting, Inc.

Introduction

SQL*Loader is Oracle’s bulk data loading utility. It is most commonly used to load data from flat files
from legacy systems that are being converted to an Oracle database application. In data warehouse
environments, in addition to performing the initial load of historical data, it is also useful for performing
periodic data loads. When source data already resides in a database, the data can often be extracted from
them directly through database links or transparent gateways and loaded into the data warehouse.
However when source data does not reside in a database, or must undergo complex transformations, it is
likely to end up in a ASCII flat file that can be conveniently loaded using SQL*Loader.
A common example of this transformation and load process from the telecommunications industry
would be the process of loading call detail records (CDR) from the switches that handle telephone calls.
Switches often produce cryptic ASCII records that are downloaded periodically to a data processing
center. The data is massaged and scrubbed and some other identification information meaningful to the
business will likely be added to produce a record of each telephone call. It is obvious that any decent
sized telecommunications company will produce CDRs by the millions. Once the file containing a
certain number of CDRs has been built, it can be loaded into a database table using SQL*Loader.
SQL*Loader offers many features that make it suitable as an industrial strength data loading facility. As
an Oracle product that is released with each version of Oracle Server Enterprise Edition, it is always
compatible with the most recent release of the database. SQL*Loader has many features. A user
configurable control file is used to define the characteristics of the data being loaded and the locations of
output files that will be produced when it runs. Using the control file, it’s possible to specify some data
transformations. The table or tables that will be loaded are also specified here. The operating
characteristics and modes of SQL*Loader are controller using command line options. Alternatively,
these command line keywords can be placed into a parameter file. When SQL*Loader is executed, it
generates a detailed log of it’s activities and places the results into a log file. This file is useful to
determine high-level operational information about the data that was loaded including execution times
and rows counts processed. Records from the input file that cannot be loaded because of errors are
placed into the bad file. This file is useful for debugging data errors. Records that were not loaded
because certain criteria specified in the control file caused them to be rejected are written to a discard
file. To learn more about the operational details of SQL*Loader, refer to the SQL*Loader sections in
the Oracle Utilities section of the documentation that accompanies Oracle Server Enterprise Edition.

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

The goal of this paper is to illustrate four commonly used data loading scenarios for data warehouse
environments including operational data stores (ODS) in Oracle Enterprise Server version 8.1.6 or
higher. The term data warehouse environments refers to data warehouses in the traditional sense where
operational data is stored in time intervals and rarely, if every, modified or deleted. The term
operational data store refers to systems that supplement operational reporting. Here, data may or may
not be stored by time keys, and is likely to be modified and deleted. There are many other scenarios that
can be considered, but these four should handle most data loading requirements.
While the basic operation of SQL*Loader is relatively fundamental, certain modes of operation can
produce very high data load rates. As with most software applications, higher performance comes with
a cost. In this case, the costs are a variety of limitations on constraints, triggers and indexes. Typical
uses will be offered for each scenario based upon the various application and business requirements.
The limitations and possible work arounds will also be discussed.

Scenario 1: Data loads into fact tables

Data are loaded into the data warehouse at periodic intervals usually depending upon the frequency at
which the source systems generate business events. For example, the generation of CDRs in the above
telecommunications application occurs 24-hours a day, 7 days a week. Because of the high data
volume, it is usually necessary to load CDR files as the input files are generated. It is usually not
possible to load all of the CDR files for an entire day during a short daily maintenance window. On the
other hand, monthly accounting data are generated once per month from a financial accounting system,
hence it is logical to load them once each month into the data warehouse fact tables. Fact tables are
those tables be queried directly by user applications.
For small volumes of data, such as monthly accounting data, or perhaps daily order data for a
manufacturing company, the conventional path mode of SQL*Loader is likely to produce sufficient load
throughput.

02/08/2002
02
02/07/2002
1906,’02/03/2002’,1455.22
02
02/06/2002
1907,’02/04/2002’,-31.22
02
1908,’02/04/2002’,870.00 02/05/2002
1909,’02/05/2002’,45228.18 SQL*Loader 02
1910,’02/06/2002’,-1132.60 02/04/2002
1911,’02/06/2002’,286.24 02
… 02/03/2002
02
02/02/2002
Input Data File Partitioned 02
Fact 02/01/2002
Table 02
……
Figure 1. Data Load into a Fact Table

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

Conventional Path SQL*Loader

Conventional path SQL*Loader is the default operating mode. The term conventional path refers to the
fact that data finds its way into tables by the same internal mechanisms of conventional INSERT
statement. SQL*Loader generates the appropriate SQL statement or statements to insert the data from
the input file. As rows are read from the input data file, values are bound to the SQL statement, which is
then executed.

SQL*Loader

INSERT INTO …

SQL Command Processing


Database Space Management
Instance
Buffer Cache Management
Database Writers

Database

Figure 2. Conventional Path SQL*Loader

Since it utilizes conventional INSERT statements, all of the nice data integrity mechanisms such as
primary and unique key constraints, check and referential integrity constraints are all enforced. Triggers
present on the tables being loaded will also fire, and data will be committed periodically according to the
row interval specified in the SQL*Loader parameter file. Additionally, it is irrelevant whether or not the
table being loaded is nonpartitioned or partitioned since Oracle’s partitioning strategy is transparent to
SQL*Loader.
Conventional path loads are also well behaved if the process should abort or be terminated by an
operator. If a natural primary key is present, recovering from the failure is probably just a matter of
rerunning SQL*Loader on the same input data file. SQL*Loader will reject the rows that have already
been loaded when the primary or unique keys are violated. If there is no natural primary or unique key,

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

then one must simply determine what data was loaded and edit the input data file to exclude that data
before restarting SQL*Loader.
If higher throughput is desired, multiple SQL*Loader jobs may be executed concurrently. There are
several table design issues that should be considered to optimize performance under this scenario.
Increasing the number of free lists on the table will reduce the risk of free list contention. In addition, a
larger number of smaller tablespace datafiles on the tablespace in which the table being loaded is housed
is preferable to a smaller number of large datafiles. This will reduce the likelihood of data file
contention.
While this scenario has many desirable attributes, there are a number of factors that may or may not
cause a problem.
1. One must consider when end-users need to see the data. Whether or not the data is currently stored
in the database, it is considered to be published when it becomes visible to end-users. Since data are
being loaded directly into fact tables in this scenario, data are published as soon as the rows are
committed. If the application depends upon near real-time data access, then this is a desirable
scenario. An example of this type of application is one that detects credit card fraud. Since most
fraudulent activity occurs within several hours after a credit card number has be stolen, adding data
to an operational data store in weekly intervals would not be an effective solution. Here, it is
imperative to publish the data as soon as it can be loaded into the database. If however, an
application requires that data for a specific time unit only be available in its entirety, then publishing
data as a result of the load process is probably not a viable solution. An example would be an
application that produces daily reports on retail sales. There are two parts to the problem. First, a
query that is unbounded by time will produce partial results for the day that is currently being
loaded. Second, each time the query is run, the results for the current day will changes as new data
is loaded. One solution to this problem is to accumulate the input data files until the period has
completed, and then load all of it at once. This is a good solution if daily data volumes are low.
However, if the data volumes are high enough such that the data cannot be loaded within a
reasonable maintenance window, then another alternative must be employed.
2. There will be another problem with this scenario if the table being loaded has bitmap indexes. In
data warehouse and possibly ODS environments, single-column bitmap indexes are usually created
on each column that is likely to appear in query predicates as either join or limiting conditions.
Because of the inherent physical implementation of bitmap indexes, loading even small amounts of
data into tables having bitmap indexes can cause significant performance degradation when the
bitmap indexes are used by subsequent queries. In addition, if the table contains many bitmap
indexes, performance of conventional path data loads can be drastically reduced by the creation of
the index entries, and the size of the bitmap indexes may increase dramatically. Therefore, if
conventional path loads must be performed, the table should not use bitmap indexes unless they can
be dropped before the data load begins, and built after the data load have completed.
3. One should also consider the backout scenarios when loading directly into fact tables. If small
amounts of data must be removed from a large fact table after is has been discovered that bad data

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

was accidentally loaded, the only alternative is to issue one or more DELETE statements. This
process may be both time-consuming and computationally expensive.
4. Last of all, if significant amounts of data must be loaded, even multiple concurrent SQL*Loader
sessions may not be able to load data fast enough.
It is also worth mentioning that for any scenario that utilizes a partitioned fact table, a performance
penalty will be paid anytime a query does not specify limiting conditions on the partition key. The
partition key is usually the time interval of the business transaction. When limiting conditions are
placed on the partition key column, Oracle is able to eliminate all but those partitions containing the data
being queried. For example, if the partition key is the telephone call date and the user queries for certain
types of phone calls between February 1, 2002 and February 7, 2002, then all partitions not containing
those dates are culled by the optimizer. If however, the user queries the table for a certain type of call
without a date restriction, then all table partitions must be probed. For long-running queries, probing
partitions not containing data that will be returned by the query usually do not reduce performance
significantly. But for short running queries on tables with many partitions, the query may run several
times longer than the same query run on a corresponding nonpartitioned table. The second scenario
works well in data warehouse and ODS environments because the environments most often store and
query data based upon the time value that the business event occurred.

Scenario 2: Data loads into staging tables

This scenario adds more complexity to the first scenario to mitigate one or more of the undesirable
characteristics. A staging table is used as the target table for loading data whenever data becomes
available. A staging table is a table that is used to hold data before it is transferred to the fact table. It is
typically not visible to end-users. If higher throughput is desired, multiple SQL*Loader jobs may be
executed concurrently. This table should have column definitions identical to those of the
corresponding fact table. Only indexes necessary to facilitate enforcement of surrogate keys should be
created. Other constraints present on the corresponding fact table may also be absent. This is an
acceptable practice only if one has transformed the data to assure that no constraints on the target fact
table will be violated. If the data can be reloaded in the event of media failure, the staging table should
be created with the NOLOGGING option to reduce the volume of redo log activity when data is loaded.
This will help to improve load throughput.
The fact table should be partitioned at a time interval chosen to meet the business case. The partition
time span is usually established as the interval in which data is published, but may be shorter if large
volumes of data must be processed. For example, if the table holds transaction data and users are
interested in viewing data by complete days, then a daily partition interval is probably the best choice.

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

02/08/2002
Partition 02
02/07/2002
Exchange 02
2241,’02/05/2002’,1455.22 02/06/2002
2242,’02/05/2002’,-31.22 Staging 02
2243,’02/05/2002’,870.00 Table 02/05/2002
2244,’02/05/2002’,45228.18 SQL*Loader 02
2245,’02/05/2002’,-1132.60 02/04/2002
2246,’02/05/2002’,286.24 02
… 02/03/2002
02
02/02/2002
Input Data File Partitioned 02
Fact 02/01/2002
Table 02
……

Figure 3. Data Load into Staging Table

Once data for the time period corresponding to the partition interval has been loaded into the staging
table, all constraints and indexes that are present on the fact table must be created on the staging table.
Once this process has been completed, the staging table can be exchanged with the empty partition of
the fact table using the following SQL statement.
ALTER TABLE transaction_data
EXCHANGE PARTITION transaction_data00785 WITH TABLE transaction_data_stage
INCLUDING INDEXES;

Data in the staging table must match the partition interval of the partition being exchanged on the fact
table. In addition, only local indexes may be present on the fact table. The partition exchange process
usually completes in a few seconds because the data is not moved. Instead, the table and index segments
are attached to the partitioned table through changes in the data dictionary. The indexes that were
created on the staging table become local index partitions on the fact table. As the process implies, the
original empty table partition that was exchanged becomes the staging table. This process can be
repeated for each load period.

This scenario solves some of the problems with the first scenario as follows.
1. If the type of application is such that users want to see data in complete time units, as in the daily
retail reports example, then this scenario is suitable. Data will not be published until the staging
table has been exchanged with the fact table partition.
2. Since most indexes are built after all of the data have been loaded into the staging table,
performance problems that would result from poorly structured bitmap indexes are avoided.
3. If it is determined that errant data were loaded into the staging table, it may be removed using
conventional DELETE statements. Alternatively, the staging table could simply be truncated

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

and reloaded with the correct data. This provides a faster way of correcting data problems not
detected during the transformation process.
4. Some improvements in throughput can be expected if the fact table has many constraints and
indexes since few indexes must be maintained during the load process. In addition, if high
volumes of data are loaded, one can begin loading data into a second staging table as soon as
new data becomes available. Loading into a secondary staging table can commence while the
first staging table is being prepared to be exchanged with the fact table partition.
This scenario has several features that may or may not be a problem.
1. Data is now not published until all data for the partition interval have been loaded. If an
application requires near real-time data, as described in the credit card fraud business case, then
this scenario will likely be unacceptable.
2. This scenario entails higher maintenance requirements than the first scenario. However, it is
straightforward to automate all of the tasks associated with building indexes and constraints on
the staging table, and performing the partition exchange. This can be done with a robust
structured programming language such as Perl that can easily perform OS calls, handle logic, and
perform database tasks.

Scenario 3: Direct path data loads into staging tables

This scenario is a higher throughput version of the previous scenario. Performance is improved by using
the direct path mode of SQL*Loader. The same configuration of the staging and partitioned fact table
used for the second scenario is used once again except that it is preferable that no indexes be present on
the staging table while it is being loaded.

Direct Path SQL*Loader

The direct path mode of SQL*Loader offers significant performance improvements over conventional
path mode. If the staging table has no indexes or constraints, improvements of 4x to 8x are possible.
Recall that conventional path SQL*Loader uses conventional INSERT statements. In conventional path
mode, data must go through the SGA and is written to data files by Oracle kernel processes. The higher
throughput with direct path load occurs because table data blocks are formatted directly by SQL*Loader
and written directly to tablespace datafiles.
During the load, data blocks being populated are located above the high-water mark in the table hence
the process of searching for data blocks that are eligible to receive new rows in table free lists is
eliminated. Once the load has completed, the high-water mark on the table is raised to the position
above the highest block that was loaded. In this scenario, only one direct path SQL*Loader process
may be running at a time. To enable direct path loads, the parameter DIRECT=TRUE must be set.

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

In addition to higher loader throughput, the overall performance impact on the database attributable to
data loads will be reduced. If data are being loaded continuously using multiple conventional path
SQL*Loader processes, the positive performance improvement achieved by using direct path
SQL*Loader may be very significant. But performance comes with more restrictions. Consult the
Oracle documentation for SQL*Loader for a complete list of limitations. A few of the more egregious
ones include:
1. An exclusive lock placed upon the table while data is being loaded. This prevents any other
DML activity from occurring.
2. Triggers will not fire for rows being inserted.
3. Rows that violate unique key constraints are not rejected. If this occurs, the corresponding index
will be left in an unusuable state. To return the index to a usable state, the offending data must
be removed and the index must be rebuilt.
4. Referential integrity constraints are disabled during the load and must be reenabled after the load
completes. If the SQL*Loader option has been specified to reenable constraints after the
loading, the entire table will be checked, not just the rows that were loaded. Any failures are
reported in the SQL*Loader error log.
5. SQL functions (eg. TO_DATE(), TO_CHAR(), etc.) cannot be used within the control file which
could otherwise be used to perform some data transformations.
6.

SQL Command Processing


SQL*Loader
Space Management
Buffer Cache Management Database
Instance
Database Writers

Data Path
Database

Figure 4. Direct Path SQL*Loader

It is usually for one or more of these reasons that database architects and developers are often scared out
of using direct path SQL*Loader. However, it becomes quite practical when using a data staging table
as described in this scenario. The data transformation process that occurs prior to loading the data can
overcome most of the above limitations. Once these limitations have been overcome, the process of

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

preparing the table to be exchanged with the table partition, and the exchange process are identical to
those described in the second scenario. The features enumerated in the second scenario as potential
problems are also valid here. While direct path SQL*Loader can be used to perform incremental data
loads directly into the fact table, the limitations make it a rather impractical proposition.

Scenario 4: Historical data loads using direct path

While the third scenario will probably provide suitable throughput for incremental data loads,
throughput may not be sufficient when loading the volumes of historical data that may occur during the
initial population of a data warehouse. To achieve higher performance, the parallel direct path mode of
SQL*Loader may be used. Here, data is loaded directly into the fact tables with constraints and triggers
disabled, and no indexes present. Since the table is in the process of being initially populated, it is
assumed that users will not be accessing the table while it is being loaded.
Like the non-parallel mode version of direct path SQL*Loader, the performance gains are realized
because the SQL*Loader processes populate data blocks and write them directly to tablespace datafiles,
thus bypassing the SGA and associated kernel processes. As long as there is sufficient system I/O
bandwidth, and the loads are spread across multiple datafiles, load throughput should scale nearly
linearly with the number of SQL*Loader processes employed.
When parallel direct path loads are enabled using the PARALLEL=TRUE option, SQL*Loader behaves
differently in the way that new blocks are added than the non-parallel mode. With the non-parallel
direct path mode of SQL*Loader, new data is added in the space above the table high water mark.
When the parallel direct path mode is used, each of the SQL*Loader processes create temporary
segments to hold the new data in the tablespace containing the table being loaded. As each loader
process completes, the temporary segments are merged with the table. After all of the historical data
have been loaded, the appropriate indexes and constraints should be created, and any triggers should be
enabled.
Multiple direct path SQL*Loader sessions may load a table or single partition of a partitioned table.
This is referred to as intrasegment concurrency. Multiple direct path SQL*Loader sessions may each
load different partitions of the same table. These are considered to be nonparallel direct path loads
where only one process is loading data into any segment at a given time. This is referred to as
intersegment concurrency. It is also possible for each direct path SQL*Loader process to load multiple
partitions of a partitioned table. To improve performance when loading partitioned tables, limit each
SQL*Loader session to a single partition by partitioning the data into separate data files before they are
loaded.

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

02/08/2002
Input Data File 02
02/07/2002
02 Intersegment
SQL*Loader 02/06/2002 Concurrency
02
Input Data File 02/05/2002
02
SQL*Loader 02/04/2002
02 Partitioned
02/03/2002 Fact
02 Table
Input Data File 02/02/2002
02
02/01/2002
02 Intrasegment
01/31/2002 Concurrency
Input Data File
01/30/2002
SQL*Loader
……

Figure 5. Parallel Direct Path Load into a Fact Table

This scenario illustrates a practical application of parallel direct path SQL*Loader. Consult the Oracle
documentation for SQL*Loader for a complete list of limitations. The following are some of the most
restrictive ones for this mode.
1. Indexes are not maintained. If indexes are present on a table while parallel direct path loads are
running, the parameter SKIP_INDEX_MAINTENANCE=YES must be specified.
2. Triggers, and referential integrity and check constraints must be disabled and must be enabled
manually after the last loader has terminated.
3. SQL functions (eg. TO_DATE(), TO_CHAR(), etc.) cannot be used within the control file which
could otherwise be used to perform some data transformations.

Summary

SQL*Loader is a powerful utility for loading data from flat files into nonpartitioned and partitioned
Oracle database tables. It has two basic modes of operation. The conventional path mode of
SQL*Loader uses conventional SQL INSERT statements to load data. The greatest benefit of
conventional path mode is that all constraint mechanisms and indexes are maintained on the data being
loaded. While conventional path mode offers the greatest amount of flexibility, it may not be able to
handle high load throughput associated with large incremental data loads. Using a data staging table,
direct path SQL*Loader can be used to load higher volumes of data with a reduced load on the overall
database. This is accomplished by bypassing the conventional path through the SGA and associated
kernel processes. With the increase in performance come restrictions involving constraints and triggers.

www.rmoug.org RMOUG Training Days 2002


SQL*LOADER STRATEGIES Maresh

For loading historical data when a data warehouse is being initially populated, still higher performance
can be achieved using the parallel direct path mode of SQL*Loader. This mode supports data loads into
nonpartitioned and partitioned tables with throughput that scales nearly linearly to the number of loader
processes if there is sufficient system I/O bandwidth. Each of the four scenarios illustrate how each of
the various SQL*Loader modes can be effectively and efficiently used for various tasks associated with
loading data from flat files into data warehouse tables.

www.rmoug.org RMOUG Training Days 2002

Das könnte Ihnen auch gefallen