Sie sind auf Seite 1von 9

Lesson 8: Sorting SAS Data Sets Summary Main Points

Understanding the SORT Procedure Sorting data is useful for reordering data for reporting, reducing data retrieval time, and enabling BY-group processing. However, PROC SORT is resource-intensive, using considerable disk space, memory, I/O, and CPU time. You can use options or techniques with PROC SORT to minimize resource usage. SAS supports PROC SORT in all operating environments, so PROC SORT cant take advantage of any platform-specific sort enhancements. PROC SORT executes in memory up to the limit imposed by the SORTSIZE= option. In fact, PROC SORT minimizes the use of external storage and tries to sort entirely in memory, if possible. By default, PROC SORT executes in parallel using multiple threads. Taking advantage of threaded processing in SAS can help you reduce I/O when you sort data. These are some useful terms related to threaded processing: Term thread Definition a single, independent flow of execution through a program or within a process multiple units of work that the operating system schedules for concurrent execution computers with multiple CPUs that share the same memory and a thread-enabled operating system; can spawn and process multiple threads simultaneously

parallel processing

symmetric multiprocessing machines (SMPs)

You can determine how many CPUs are available in your SAS session by using a PROC OPTIONS step that specifies OPTION=CPUCOUNT. When you specify OPTION=CPUCOUNT, the SAS log displays the number of available processors.

SAS Programming 3: Advanced Techniques and Efficiencies


Copyright 2010 SAS Institute Inc., Cary, NC, USA. All rights reserved.

Lesson 8: Sorting SAS Data Sets


PROC OPTIONS OPTION=CPUCOUNT; RUN; This is the process that PROC SORT uses for parallel processing. This example uses four threads: Steps in Parallel Processing Using PROC SORT 1. PROC SORT breaks the SAS data set into chunks by dividing the total number of observations by the total number of threads available to do the parallel processing. 2. PROC SORT creates the processing threads. 3. The threads read and process data: Thread 1 starts reading and processing data chunk 1. Thread 2 reads and processes chunk 2. Thread 3 reads and processes chunk 3. Thread 4 reads and processes chunk 4. 4. PROC SORT collates the partial results. Using threaded processing completes the sort in less real time than handling each task sequentially, although the CPU time is generally increased. Other SAS tasks besides sorting can also exploit threading. These tasks include subsetting using WHERE expressions, filtering variables using DROP or KEEP statements or data set options, indexing, and summarizing data. In addition to PROC SORT, these Base SAS procedures are multithreaded: PROC MEANS, PROC SUMMARY, PROC REPORT, PROC SQL (using the GROUP BY and ORDER BY clauses), and PROC TABULATE. Many SAS/STAT procedures are also multithreaded. When you benchmark using the threaded procedures, compare real time rather than CPU time. The back-end collating process to re-create the single data set might increase total CPU time while reducing real or elapsed time.

SAS Programming 3: Advanced Techniques and Efficiencies

Lesson 8: Sorting SAS Data Sets


Controlling Threaded Processing in PROC SORT You can enable or disable threaded sorting in two ways. You can use the THREADS | NOTHREADS system option , or you can specify the THREADS | NOTHREADS option in the PROC SORT statement. In both cases, the default is THREADS. Specifying the THREADS | NOTHREADS option in a PROC statement overrides the THREADS | NOTHREADS system option. The THREADS | NOTHREADS option interacts with the TAGSORT option. If you specify the TAGSORT option with PROC SORT, SAS disables threading. The TAGSORT option stores only the BY variables and the observation numbers, named tags, in temporary files. When the sorting process completes, PROC SORT uses the tags to retrieve observations from the input data set in sorted order. To control the number of processors that are available for SAS to use, you can specify the CPUCOUNT= system option. The default CPU count is the actual number of CPUs available. Specifying a numeric value for the CPUCOUNT= option can only decrease the number of CPUs available to SAS. If you dont have the number of CPUs specified as the CPUCOUNT= value, SAS uses the actual number of CPUs available. However, this might result in reduced overall performance, as SAS may allocate more threads than available processors. Your system administrator might limit the number of CPUs that are available for SAS processing. So ACTUAL might be lower than the total number of CPUs in the machine that SAS is using. OPTIONS CPUCOUNT=ACTUAL | 1-1024;

Improving Sort Performance When you use the SAS sort, a quick rule of thumb for sort space is four times the size of the SAS data set. Even when you sort in place (sort a data set back to the same name), you need enough space in the library for two copies of the data. Sorting takes place in the PROC SORT utility work space. This work space is shared by memory and disk. But if you can sort the data all in memory, the sort runs faster, because you avoid writing and reading temporary utility swap files. Determining how much sort space you need is not an exact science. The amount of space that the SAS sort needs depends on four conditions:

SAS Programming 3: Advanced Techniques and Efficiencies

Lesson 8: Sorting SAS Data Sets


o The first is whether PROC SORT can use threading. Threaded sorts take less space than non-threaded sorts. Threaded sorts generally require three times the size of the SAS data set being sorted. o The second condition concerns the data itself and has two parts: the length of the observations, and the number of variables in the BY statement and their storage lengths. These factors are important because the utility work space requires enough room to hold an entire observation and two copies of the BY variable values for every observation. The SAS sort routine uses the duplicate BY variable values to retrieve BY values quickly without having to reread the entire observation. o The third condition is the operating environment where PROC SORT executes, which plays a big part in allocating space. o The final condition is the library where PROC SORT writes the sorted data. You need enough space in the source library for one data set and enough space in the target library for one copy of the data set. For more information about calculating sort space, see Calculating Sort Space in the appendix Details. Determining sort space requirements has no specific guidelines, because the required sort space depends on your data. However, if you dont have enough memory or virtual memory allocated to PROC SORT, the procedure wont have enough memory to divide the space for each thread. To avoid this problem, you can use the SORTSIZE= option in the PROC SORT statement to specify the amount of memory that's available to PROC SORT. The SORTSIZE= option can also improve the sort performance by restricting the operating systems swapping of memory to disk. The possible SORTSIZE= values depend on your operating environment. SORTSIZE=n | nK | nM | nG | MAX | SIZE The default SORTSIZE= value in the Windows operating environment is 64 megabytes. A SORTSIZE= value as large as the required sort space ensures that the sort occurs in memory. This reduces processing time. If PROC SORT needs more memory than you specify, it creates a temporary utility file to complete the sort. This increases processing time. For the multi-threaded SAS9 sort, if the SORTSIZE= value is too small, the sort fails to complete at all.

SAS Programming 3: Advanced Techniques and Efficiencies

Lesson 8: Sorting SAS Data Sets


For optimal performance, you should set the SORTSIZE= option to a value smaller than the available physical memory. This enables the programs and the operating environment to stay resident in memory. You should investigate how changing the value of the SORTSIZE= option affects resources. In some cases, using a host sort utility might be the most effective way to sort data. A host sort utility is the operating system's native sort utility, such as IBM's DFSORT, or a thirdparty sort utility such as SYNCSORT. Host sort utilities are available in the Windows, UNIX, and z/OS operating environments. Ask your system administrator whether a host sort utility is available at your site. Generally, the SAS sort is more efficient for smaller data sets, because it is an in-memory sort, whereas a host sort is more efficient for larger data sets. You can use several SAS system options to specify the sort utility that PROC SORT uses. o The SORTPGM= option specifies whether PROC SORT uses the SAS sort utility or the host sort utility. This OPTIONS statement specifies SORTPGM=HOST to always sort using the host sort utility. If you specify BEST, SAS chooses a sort utility based on two factors: the number of bytes being sorted and the value of the SORTCUTP= option. o The SORTCUTP= option specifies the cutoff point between the SAS sort and the host sort. If the data set contains more bytes than the SORTCUTP= value, the host sort utility sorts the entire data set. The default values are 0 in the Windows and UNIX operating environments, which means the SAS sort is used, and four megabytes in the z/OS operating environment. To determine the optimal value of the SORTCUTP= option, you should specify the SORTPGM= option and benchmark a PROC SORT step with larger and larger data sets. o The SORTPGM= option also interacts with the SORTNAME= option. If the value of the SORTPGM= option is BEST or HOST and you happen to have multiple host sort utilities available, you can use the SORTNAME= option to specify which host sort utility PROC SORT uses.

Setting the Sort Indicator and the Validation Indicator Even when PROC SORT creates a separate output data set, if the data is already sorted, the procedure only copies the data set. When SAS sorts a data set, it sets a sort indicator. When the sort indicator is YES and you try to re-sort the data by the same BY variables, SAS doesn't perform another sort.

SAS Programming 3: Advanced Techniques and Efficiencies

Lesson 8: Sorting SAS Data Sets


At the bottom of PROC CONTENTS output, SAS prints sort information for the data set, including the BY variables used for the sort, a validation indicator for whether or not SAS validated the sort, and the collating sequence used to order the data. You can set the sort indicator and the validation indicator in several ways. If the input data is already in sorted order, you can specify the order by using the SORTEDBY= data set option. This option applies to the output data set. The BY clause indicates the data order, and _NULL_ removes any existing sort information from the descriptor portion of the data set. (SORTEDBY=BY-clause | _NULL_ ) The SORTEDBY= option sets the sort indicator on the data set to YES and asserts that the data is ordered by order date. However, because SAS hasn't yet validated the data order, it has to check the order while processing the data set. You can use two methods for asking SAS to validate that a data set really is sorted, sort the data set only if necessary, and set the validation indicator to YES: o The first is the SORTVALIDATE system option. This option causes the SORT procedure to validate that a data set is sorted correctly when a user-specified sort indicator is set. If the data set is sorted correctly, SAS sets the validation indicator to YES. If the data set is not sorted correctly, SAS sorts the data set and then sets the validation indicator to YES. OPTIONS SORTVALIDATE; o The second way is using the PRESORTED option in the PROC SORT statement. The PRESORTED option is available beginning in SAS 9.2. Here's the syntaxvery simple. This option tells PROC SORT to check the input data set to determine whether the observations are in order before sorting the data. By specifying the PRESORTED option, you can avoid the cost of sorting the data set. The PRESORTED option is powerful. It validates the sequencing of the data, sorts the data if it is not sequenced properly, and sets both the sort indicator and the validation indicator to YES. PROC SORT DATA=SAS-data-set PRESORTED;

Controlling the Sort Order When you sort data, you can control the sort order in two ways: by specifying a collating sequence, and by specifying whether or not the observations in a BY group remain in the same order in the output data set. Controlling the order of observations is also a potential way to improve sort performance.

SAS Programming 3: Advanced Techniques and Efficiencies

Lesson 8: Sorting SAS Data Sets


The character set determines the sort order of characters. By default, PROC SORT uses the ASCII collating sequence in the Windows and UNIX operating environments, and the EBCDIC collating sequence in the z/OS operating environment. To change the collating sequence, you can specify one collating option in the PROC SORT statement. In addition, by default PROC SORT maintains the order of the observations within a BY group in the output data set. You can also use the EQUALS | NOEQUALS option in the PROC SORT statement to specify the order of the observations within a BY group in the output data set. EQUALS preserves the original order of observations withing BY groups in the input data in BY groups in output data. EQUALS is the default, but its more expensive in terms of CPU time, memory, and I/O. NOEQUALS does not guarantee the original order of observations within BY groups. However, NOEQUALS can save CPU time, memory, and I/O. Both EQUALS and NOEQUALS guarantee the order of the data that you specify in the BY statement. To detect and remove observations with duplicate BY values, you can use the NODUPKEY option in the PROC SORT statement. To specify the output data set where SAS writes the duplicate observations, you can use the DUPOUT= option. The DUPOUT= option is new in SAS 9. PROC SORT DATA=SAS-data-set NODUPKEY DUPOUT=SAS-data-set; To replace the default ASCII or EBCDIC collating sequence, you can specify one collating sequence option in the PROC SORT statement. You can specify REVERSE to reverse the default sequence. Or you can specify DANISH, FINNISH, NORWEGIAN, POLISH, or SWEDISH. You can also specify NATIONAL for a customized sequence. Finally, you can specify the SORTSEQ= option to specify a collating sequence, a translation table such as POLISH or SPANISH, an encoding, or the keyword LINGUISTIC. PROC SORT DATA=SAS-data-set <collating-sequence-option>; In SAS 9.2, you can use SORTSEQ=LINGUISTIC to specify linguistic collation, which sorts characters according to rules of a specified language. In turn, the setting of the SAS system option LOCALE determines the language. Within SORTSEQ=LINGUISTIC, the NUMERIC_COLLATION=ON collating rule orders integer values within the text by their numeric values instead of by the characters used to represent the numbers. You can also specify other collating rules for the LINGUISTIC option, including CASE_FIRST= and STRENGTH=. For more information about these collating rules, see Collating Rules in the appendix Details.

SAS Programming 3: Advanced Techniques and Efficiencies

Lesson 8: Sorting SAS Data Sets Sample Code


Using the THREADS | NOTHREADS Option options nothreads; proc sort data=orion.order_fact threads; by Order_Date; run; Using the CPUCOUNT= Option options cpucount=5; Using the SORTSIZE= Option proc sort data=orion.order_fact sortsize=300M; by Order_Date; run; Using the SORTPGM=, SORTCUTP=, and SORTNAME= Options options sortpgm=best sortcutp=40M sortname="syncsort"; Using the SORTEDBY= Option filename M1 'mon1.dat'; * change the filepath as needed; data january(sortedby=Order_Date); infile M1 dlm=','; input Customer_ID Order_ID Order_Type Order_Date : date9. Delivery_Date : date9.; run; proc contents data=january; run;

SAS Programming 3: Advanced Techniques and Efficiencies

Lesson 8: Sorting SAS Data Sets


Using the PRESORTED Option proc sort data=orion.salesstaff presorted; by Emp_Hire_Date; run; Using the EQUALS | NOEQUALS Option proc sort data=orion.customer out=customer_equals equals; by Country; run; proc print data=customer_equals(obs=10); var Customer_ID Country; title 'With EQUALS Option'; run; Using the NODUPKEY and DUPOUT= Options proc sort data=orion.salesstaff nodupkey out=oneemp dupout=extra; by Employee_ID; run; Using the SORTSEQ= Option with the NUMERIC_COLLATION=ON Collating Rule proc sort data=orion.customer out=customer sortseq=linguistic(numeric_collation=on); by Customer_Address; run;

SAS Programming 3: Advanced Techniques and Efficiencies