Sie sind auf Seite 1von 5

UnderstandingtheLENGTHstatementintheSASdatastep

Background:SASdatastructures(upthruversion9.*)havealwaysemployedtwopossibledatatypes,numeric
andcharacter.TherearehintsfromSAStechsupportthatrichdatatypeswillbepresentwheneverV10comes
along,whichmaymakeseveralofthefollowingpointsmoot.Butuntiltheyactuallyofferrichdatatypessimilarto
whatispossibleinsideaSQLtablestructure,itsimportanttounderstandhowtheproperuseoftheLENGTH
statementcanimpactthestoragecharacteristicsofyourunderlyingSASdatafiles.Thesecommentsapplytoboth
temporary(WORK.******)filesANDpermanentSASdatabases.

Defaults:Inthoseinstanceswhereafieldiscreatedforthefirsttime,theSAScomponentwhichbuildstheData
SetVector(orDSV)willlookoverthestatementsdefinedwithintheoveralldatastepanddeterminewhenthe
fieldistobenumericorcharacter.Ifnocluesarepresentforagivenfield,itwilldefaultthetypetonumeric.You
ofcoursehavecompletecontroloverthesesettingssoitsalwaysadvisedthatyoupredefinethelayoutwith
eitheraLENGTHorATTRIBstatement(wewillfocusonLENGTHwithinthistalk).Letsstartwithasimpleexample
toseehowthesystemdeterminesthesetupaswellasthedefaultvalues.

DATAEXAMPLE1;A=1;B=X;RUN;PROCCONTENTSDATA=EXAMPLE1;RUN;

NoLENGTHorATTRIBstatementyetexistssothesystemlooksatthefirstinstanceofeachfieldandsetsthetypes
andlengthsbasedonthereviewofthefirstinstanceofeachdefinedfield.VariableAisbeingassignedanumber
soSASmakesthisanumericfieldandifnototherwiseinstructed,allnumericvariableswillreceivealengthof8
bytes.VariableBisgoingtostoreacharacterfieldsincethefirstoccurrenceofthefieldisassignedavalueinside
matchingquotes,whichthesyntaxreferencedesignatingacharacterstringwithinthedatastep.Andbecausethat
firstrecordonlyhasalengthofonecolumn,thelengthwillonlybe1.PartialresultsfromthePROCCONTENTS
confirmsthis.

Alphabetic List of Variables and Attributes

# Variable Type Len

1 a Num 8
2 b Char 1

AttributeInheritance:Assubsequentdatafilesarereferencedinascript,thepreviouslydefined
characteristicsofthefieldaresimplypulledintothenextdatasetbeingcreated.TheSET,MERGEorUPDATE
statementswillcausetheDSVforthatnewstructuretodefinetheattributesofthepreviouslyexistingfields.Most
folksareprobablyusedtomodifyingcertainattributesasascriptprogresses,suchastheassignedformatorlabel
definition.Fortunatelyinourcase,itsalsopossibletoredefinethelengthofagivenfieldifsodesiredandwewill
seeanexampleofhowtoaccomplishthislaterinthisdocument.Beforedoingthat,itsbesttounderstandthe
optimalsettingsforthecharacterandnumericvariableswithinanyofyourSASdatafiles.

CharacterFieldRecommendations:Itsbesttosetthelengthofanycharacterfieldtothelongestpossible
stringyouanticipatetostorewithinthatgivenvariable.Thisiseasilyaccomplishedbyaddingthefieldnametoa
LENGTHstatement,followingthefieldwitha$symbolandoveralllengthyouwishtorequest.LENGTHB$1;
wouldbethesyntaxfromthepreviousexample.

Thisiswhereitsimportanttoknowyourdatasincethenumberyouspecifyafterthe$shouldalwaysbethe
largestnumberofcolumnsyourcharacterfieldwillconsume.Listinganythinglargerthanthatlongeststringis
simplywastingresourcessincetheDSVwillcarveoutextrabitsthatwillonlycontainpaddedblanks.Realizethat
theseblanksstillconsumespace.Likewise,settingthecharacterlengthtoanythinglessthanyourmaximum
stringlengthwillresultinpossibletruncationofdatavalues.Therangeofpossiblecharacterfieldlengthsonthe
Windowsplatformisbetween1and32768.

NumericFieldRecommendations:EarlierwithinthePROCCONTENTS,wesawthedefaultlengthfor
numericfieldsas8bytespervariable.whichhappenstobethelargestpossiblenumericlength.SAStakesthis
conservativeapproachintheeventthevariablecontainsanydecimalinformation.Itsimportantyouunderstand
yourdatabecauseifafieldisanoninteger,youdefinitelywanttoleavethenumericlengthat8tomaintainthe
highestprecisionpossible.Technically,youdontneedtomentionthesenumericfieldsonaLENGTHstatement
sincethesoftwarewilldefaultthefieldtolength8ifnototherwiseinstructed(asperourearlierexample).

Butforthosenumericfieldsyouknowwillalwayscontainintegervalues,youllbemakingyourprogramsmore
efficientandhelpingtheStorageAreaNetwork(SAN)bycuttingdownthesizeofyourtemporaryandpermanent
datafiles.OntheWindowsplatform,SASallowsthenumericlengthtovarybetween3and8bytes.To
understandthesedifferences,thesecondandfourthcolumnsbelowshouldbeconsultedtodeterminewhich
lengthtouseforyourfieldwhilemaintainingtheprecisionwithinyourgivencolumn.

Significant Digits and Largest Integer by Length for SAS Variables under Windows
Largest Integer Significant Digits
Length in Bytes Represented Exactly Exponential Notation Retained

3 8,192 213 3

4 2,097,152 221 6

5 536,870,912 229 8

6 137,438,953,472 237 11

7 35,184,372,088,832 245 13

8 9,007,199,254,740,992 253 15

Letsconsiderfourscenariosinthemajorityofourwork.

(1) Thecreationofbinaryorsimpleindicatorvalues(0,1,2,.).Therangeofthesevaluesseldomgetsbeyond
3digitsandmostcertainlyneveruptothe8192upperrangevalueforafieldwithadefinedlengthof3.
SoanysimpleindicatorcolumnshouldalwayscontainaLENGTHvariable1variable2..variable_n3;
statementwithintherealmofthedatastepcreatingthisfield.Notehoweachvariablelistedwillpickup
thelengthcharacteristiclistedaftertheendofthelastfield.
(2) The***SSNfield(s).Becauseanumericlengthof5onlymaintainsprecisionto8positions,itwillnot
sufficientlycontainthetruevaluesforanySSNthatisbeyondthe536870912value.Thesoftwarewould
begintoroundthesehighervaluesandbecausetheprecisionislost,yourdatalosesitsintegrity.Thus,a
lengthof6ontheWindowsplatformisoptimalforstoringtheSSNssinceitisactuallygoodthru11
positionsandweareonlyworriedaboutprecisionthruthefirst9positions.
(3) SASdatefields.BecausetheSASenginestorestheinternaldateasthenumberofindividualdays
betweenJanuary1,1960andthedatefieldwithinyourdatafile,prettymucheverydatevariablecanbe
safelystoredwithaLENGTHof4.ConsiderthatMarch2010isinthe18,300rangeofdays.When
comparedagainsttheabovetable,itsobviousaday2+milliondaysintothefutureisbeyondanything
youneedtoworryabout.
(4) SASdate/timefields.TheSASenginealsousesmidnightJanuary1,1960astheoriginfordetermining
date/timevaluesbutinsteadcountsthenumberofsecondstillthedate/timestoredwithinyourdatafile.
AquicklookatthevalueforaMarch2010date/timeshowsthefigureisalreadyupto10positions.Thus,
youreprobablysafeusingaLENGTHof6butcouldpossiblyneedittobe7or8dependinguponhowfar
inthefuture(orpast)yourprojectionhappenedtobe.Asbefore,itsimportanttoknowyourdatawhen
makingthisdefinition.

ImpactonStorage: Nowthatweveseenwhattherecommendationsare,letstakealookataworking
exampleofhowtoimplementthesechangesagainstapreexistingdatafile.Thenewmasterpatientdemographic
filewasoriginallycreatedwithall6fieldsbeingdefinedasnumeric,eachwiththedefaultlengthof8.Hereisa
portionoftheCONTENTSsnapshot.

Alphabetic List of Variables and Attributes

# Variable Type Len Format Informat Label

6 AGE2010MAR11 Num 8 COMPUTED AGE ON 2010MAR11


3 dob Num 8
2 dob_assign_ Num 8 1: source priority, 2: concensus,
level 3: priority w inconsistency
1 scrssn Num 8 SSN11. BEST22. SCRAMBLED SOCIAL SECURITY NUMBER
5 sex Num 8
4 sex_assign_ Num 8 1: source priority, 2: concensus,
level 3: priority w inconsistency

Evenwithoutknowingmuchaboutthisdata,onecaneasilyassumethattheAGErelatedfieldwillbea3
digitvalueattheverymost.Sincethelengthof3willhousevaluesupto8192,thisistheobviouschoice
toassigntothisfield.
TheDOBfieldisshortforDateOfBirthandthusaSASdatecolumn.OurearlierdiscussiononSASdates
meansweshouldusealengthof4onthisfield.
ThelabelsontheDOB_ASSIGN_LEVELandSEX_ASSIGN_LEVELprovideacluethateachfieldconsistsofat
most3possiblesmallintegervalues.Again,aSASlengthof3istheobviouschoice.
TheSCRSSNfieldisthescrambledSSNofthepatient.OurearlierdiscussiononSSNsmeansweshoulduse
alengthof6here.
Finally,theSEXfieldisnotimmediatelyobvious.Butsinceweknowthereareveryfewpossiblechoices
for the patients gender, chances are high this is going to a simple (small) integer list. A quick adhoc
frequencydistributionconfirmsasmuch,soalengthof3isagaininorderhere.

Togetthesenewlydefinedlengthsintoadatastructure,wesimplyhavetodefinethenewlengthsonaLENGTH
statementpriortothementionoftheSETstatementwithinourexamplebuild,asperthefollowing.IftheSET
statementweretoprecedetheLENGTHstatement,theoriginalfieldcharacteristicswouldbedefinedintotheDSV
structureandtheLENGTHstatementwouldbasicallyhavenoimpact.Becausewelistitfirst,theDSVwillmove
sequentiallyfromtoptobottomofthedatastepasafirststepanddefineeachfieldscharacteristicperitsfirst
mention.ThenastheDSVstartstopopulatewithdatarecords(pertheexecutionoftheSETstatement),thedata
willsimplypopulateintothenewstructureandbeoutputtothefilewehavedesignatedontheDATAstatement.

libname test 'f:\sas_stuff\data\consistent_variables';


data test.example;
length scrssn 6 dob 4 dob_assign_level sex_assign_level sex age2010mar11 3;
set test.fy10dob_sex2010mar11;
run;

NotethatIveresettheorderofthefieldsslightlybutpositionisntofmajorimportancewiththeSASenvironment
sincethefieldreferencesaredoneexplicitlybyvariablename.Intheeventyoutrulywanttomaintaintheorder,
youshouldexplicitlylisteachfieldasitwaspreviouslydefinedintheoriginalfilesetup.

AfterrunningtheabovestepstocreatemysmallerTEST.EXAMPLEfile,Ialsoranacomparisonoftheoriginalfile
againstthenewstructurewiththefollowingcode:

proc compare base=test.fy10dob_sex2010mar11 compare=test.example ; run;

Thefollowingresultsofthatprocedureshowthatnoneofthe5.3millionrecordshaveunequalvalues,meaning
thattheprecisionofthedatawasmaintainedbythesmallernumericlengths.

The COMPARE Procedure


Comparison of TEST.FY10DOB_SEX2010MAR11 with TEST.EXAMPLE
(Method=EXACT)

Data Set Summary

Dataset Created Modified NVar NObs

TEST.FY10DOB_SEX2010MAR11 11MAR10:09:21:53 11MAR10:09:21:53 6 5385921


TEST.EXAMPLE 12MAR10:10:55:53 12MAR10:10:55:53 6 5385921

Variables Summary

Number of Variables in Common: 6.


Number of Variables with Differing Attributes: 6.

Listing of Common Variables with Differing Attributes

Variable Dataset Type Length Format Informat

scrssn TEST.FY10DOB_SEX2010MAR11 Num 8 SSN11. BEST22.


TEST.EXAMPLE Num 6 SSN11. BEST22.
dob_assign_level TEST.FY10DOB_SEX2010MAR11 Num 8
TEST.EXAMPLE Num 3
dob TEST.FY10DOB_SEX2010MAR11 Num 8
TEST.EXAMPLE Num 4
sex_assign_level TEST.FY10DOB_SEX2010MAR11 Num 8
TEST.EXAMPLE Num 3
sex TEST.FY10DOB_SEX2010MAR11 Num 8
TEST.EXAMPLE Num 3
AGE2010MAR11 TEST.FY10DOB_SEX2010MAR11 Num 8
TEST.EXAMPLE Num 3
Observation Summary

Observation Base Compare

First Obs 1 1
Last Obs 5385921 5385921

Number of Observations in Common: 5385921.


Total Number of Observations Read from TEST.FY10DOB_SEX2010MAR11: 5385921.
Total Number of Observations Read from TEST.EXAMPLE: 5385921.

Number of Observations with Some Compared Variables Unequal: 0.


Number of Observations with All Compared Variables Equal: 5385921.

NOTE: No unequal values were found. All values compared are exactly equal.

WhenlookingatthesourcefolderwithWindowsExplorer,itseasytoseethatwewentfroma256Mbfiledownto
a128Mbfile..asavingsof50%!!

Conclusions: This example focused on two permanent data files stored in the same folder so the impact
couldeasilybedisplayedtotheaudience.Thesameconceptapplieswhetheryourdatafileexistsinthetemporary
WORKenvironmentoroneofyourpermanentdatafoldersthough.WORKfilesthatemploytheseapproachesand
arebuiltefficientlywillallowtheSASserverstorunquicker.Sincemostofyourfinalpermanentoutputfilesare
builtoffcombinationsofyourtemporarytables,makingtheseoptimaldefinitionsatthestartwillhaveapositive
effectastheyfilterintoallyoursubsequentsets.

Extra:Ifyouwanttoretrofitpreviouslycreateddatainaprogrammaticfashion,gototheSASsupportsiteand
searchforthe%SQUEEZEmacro.Downloadthefileandplaceitintoyourmacroautocallfolderonyoursystem.
(e.g.C:\ProgramFiles\SAS\SASV9.*\Core\SASmacro\isthelikelylocation)

The internal documentation on this macro is good. The main thing to be aware of is you cant overwrite a file
directly.Thatis,thelocationspecifiedinargument1ofthemacrocallspecificationMUSTbedifferentthanthe
locationinargument2.Alltheothersettingsareoptionaldependingonyourparticularneeds.

Das könnte Ihnen auch gefallen