Beruflich Dokumente
Kultur Dokumente
Background:SASdatastructures(upthruversion9.*)havealwaysemployedtwopossibledatatypes,numeric
andcharacter.TherearehintsfromSAStechsupportthatrichdatatypeswillbepresentwheneverV10comes
along,whichmaymakeseveralofthefollowingpointsmoot.Butuntiltheyactuallyofferrichdatatypessimilarto
whatispossibleinsideaSQLtablestructure,itsimportanttounderstandhowtheproperuseoftheLENGTH
statementcanimpactthestoragecharacteristicsofyourunderlyingSASdatafiles.Thesecommentsapplytoboth
temporary(WORK.******)filesANDpermanentSASdatabases.
Defaults:Inthoseinstanceswhereafieldiscreatedforthefirsttime,theSAScomponentwhichbuildstheData
SetVector(orDSV)willlookoverthestatementsdefinedwithintheoveralldatastepanddeterminewhenthe
fieldistobenumericorcharacter.Ifnocluesarepresentforagivenfield,itwilldefaultthetypetonumeric.You
ofcoursehavecompletecontroloverthesesettingssoitsalwaysadvisedthatyoupredefinethelayoutwith
eitheraLENGTHorATTRIBstatement(wewillfocusonLENGTHwithinthistalk).Letsstartwithasimpleexample
toseehowthesystemdeterminesthesetupaswellasthedefaultvalues.
DATAEXAMPLE1;A=1;B=X;RUN;PROCCONTENTSDATA=EXAMPLE1;RUN;
NoLENGTHorATTRIBstatementyetexistssothesystemlooksatthefirstinstanceofeachfieldandsetsthetypes
andlengthsbasedonthereviewofthefirstinstanceofeachdefinedfield.VariableAisbeingassignedanumber
soSASmakesthisanumericfieldandifnototherwiseinstructed,allnumericvariableswillreceivealengthof8
bytes.VariableBisgoingtostoreacharacterfieldsincethefirstoccurrenceofthefieldisassignedavalueinside
matchingquotes,whichthesyntaxreferencedesignatingacharacterstringwithinthedatastep.Andbecausethat
firstrecordonlyhasalengthofonecolumn,thelengthwillonlybe1.PartialresultsfromthePROCCONTENTS
confirmsthis.
1 a Num 8
2 b Char 1
AttributeInheritance:Assubsequentdatafilesarereferencedinascript,thepreviouslydefined
characteristicsofthefieldaresimplypulledintothenextdatasetbeingcreated.TheSET,MERGEorUPDATE
statementswillcausetheDSVforthatnewstructuretodefinetheattributesofthepreviouslyexistingfields.Most
folksareprobablyusedtomodifyingcertainattributesasascriptprogresses,suchastheassignedformatorlabel
definition.Fortunatelyinourcase,itsalsopossibletoredefinethelengthofagivenfieldifsodesiredandwewill
seeanexampleofhowtoaccomplishthislaterinthisdocument.Beforedoingthat,itsbesttounderstandthe
optimalsettingsforthecharacterandnumericvariableswithinanyofyourSASdatafiles.
CharacterFieldRecommendations:Itsbesttosetthelengthofanycharacterfieldtothelongestpossible
stringyouanticipatetostorewithinthatgivenvariable.Thisiseasilyaccomplishedbyaddingthefieldnametoa
LENGTHstatement,followingthefieldwitha$symbolandoveralllengthyouwishtorequest.LENGTHB$1;
wouldbethesyntaxfromthepreviousexample.
Thisiswhereitsimportanttoknowyourdatasincethenumberyouspecifyafterthe$shouldalwaysbethe
largestnumberofcolumnsyourcharacterfieldwillconsume.Listinganythinglargerthanthatlongeststringis
simplywastingresourcessincetheDSVwillcarveoutextrabitsthatwillonlycontainpaddedblanks.Realizethat
theseblanksstillconsumespace.Likewise,settingthecharacterlengthtoanythinglessthanyourmaximum
stringlengthwillresultinpossibletruncationofdatavalues.Therangeofpossiblecharacterfieldlengthsonthe
Windowsplatformisbetween1and32768.
NumericFieldRecommendations:EarlierwithinthePROCCONTENTS,wesawthedefaultlengthfor
numericfieldsas8bytespervariable.whichhappenstobethelargestpossiblenumericlength.SAStakesthis
conservativeapproachintheeventthevariablecontainsanydecimalinformation.Itsimportantyouunderstand
yourdatabecauseifafieldisanoninteger,youdefinitelywanttoleavethenumericlengthat8tomaintainthe
highestprecisionpossible.Technically,youdontneedtomentionthesenumericfieldsonaLENGTHstatement
sincethesoftwarewilldefaultthefieldtolength8ifnototherwiseinstructed(asperourearlierexample).
Butforthosenumericfieldsyouknowwillalwayscontainintegervalues,youllbemakingyourprogramsmore
efficientandhelpingtheStorageAreaNetwork(SAN)bycuttingdownthesizeofyourtemporaryandpermanent
datafiles.OntheWindowsplatform,SASallowsthenumericlengthtovarybetween3and8bytes.To
understandthesedifferences,thesecondandfourthcolumnsbelowshouldbeconsultedtodeterminewhich
lengthtouseforyourfieldwhilemaintainingtheprecisionwithinyourgivencolumn.
Significant Digits and Largest Integer by Length for SAS Variables under Windows
Largest Integer Significant Digits
Length in Bytes Represented Exactly Exponential Notation Retained
3 8,192 213 3
4 2,097,152 221 6
5 536,870,912 229 8
6 137,438,953,472 237 11
7 35,184,372,088,832 245 13
8 9,007,199,254,740,992 253 15
Letsconsiderfourscenariosinthemajorityofourwork.
(1) Thecreationofbinaryorsimpleindicatorvalues(0,1,2,.).Therangeofthesevaluesseldomgetsbeyond
3digitsandmostcertainlyneveruptothe8192upperrangevalueforafieldwithadefinedlengthof3.
SoanysimpleindicatorcolumnshouldalwayscontainaLENGTHvariable1variable2..variable_n3;
statementwithintherealmofthedatastepcreatingthisfield.Notehoweachvariablelistedwillpickup
thelengthcharacteristiclistedaftertheendofthelastfield.
(2) The***SSNfield(s).Becauseanumericlengthof5onlymaintainsprecisionto8positions,itwillnot
sufficientlycontainthetruevaluesforanySSNthatisbeyondthe536870912value.Thesoftwarewould
begintoroundthesehighervaluesandbecausetheprecisionislost,yourdatalosesitsintegrity.Thus,a
lengthof6ontheWindowsplatformisoptimalforstoringtheSSNssinceitisactuallygoodthru11
positionsandweareonlyworriedaboutprecisionthruthefirst9positions.
(3) SASdatefields.BecausetheSASenginestorestheinternaldateasthenumberofindividualdays
betweenJanuary1,1960andthedatefieldwithinyourdatafile,prettymucheverydatevariablecanbe
safelystoredwithaLENGTHof4.ConsiderthatMarch2010isinthe18,300rangeofdays.When
comparedagainsttheabovetable,itsobviousaday2+milliondaysintothefutureisbeyondanything
youneedtoworryabout.
(4) SASdate/timefields.TheSASenginealsousesmidnightJanuary1,1960astheoriginfordetermining
date/timevaluesbutinsteadcountsthenumberofsecondstillthedate/timestoredwithinyourdatafile.
AquicklookatthevalueforaMarch2010date/timeshowsthefigureisalreadyupto10positions.Thus,
youreprobablysafeusingaLENGTHof6butcouldpossiblyneedittobe7or8dependinguponhowfar
inthefuture(orpast)yourprojectionhappenedtobe.Asbefore,itsimportanttoknowyourdatawhen
makingthisdefinition.
ImpactonStorage: Nowthatweveseenwhattherecommendationsare,letstakealookataworking
exampleofhowtoimplementthesechangesagainstapreexistingdatafile.Thenewmasterpatientdemographic
filewasoriginallycreatedwithall6fieldsbeingdefinedasnumeric,eachwiththedefaultlengthof8.Hereisa
portionoftheCONTENTSsnapshot.
Evenwithoutknowingmuchaboutthisdata,onecaneasilyassumethattheAGErelatedfieldwillbea3
digitvalueattheverymost.Sincethelengthof3willhousevaluesupto8192,thisistheobviouschoice
toassigntothisfield.
TheDOBfieldisshortforDateOfBirthandthusaSASdatecolumn.OurearlierdiscussiononSASdates
meansweshouldusealengthof4onthisfield.
ThelabelsontheDOB_ASSIGN_LEVELandSEX_ASSIGN_LEVELprovideacluethateachfieldconsistsofat
most3possiblesmallintegervalues.Again,aSASlengthof3istheobviouschoice.
TheSCRSSNfieldisthescrambledSSNofthepatient.OurearlierdiscussiononSSNsmeansweshoulduse
alengthof6here.
Finally,theSEXfieldisnotimmediatelyobvious.Butsinceweknowthereareveryfewpossiblechoices
for the patients gender, chances are high this is going to a simple (small) integer list. A quick adhoc
frequencydistributionconfirmsasmuch,soalengthof3isagaininorderhere.
Togetthesenewlydefinedlengthsintoadatastructure,wesimplyhavetodefinethenewlengthsonaLENGTH
statementpriortothementionoftheSETstatementwithinourexamplebuild,asperthefollowing.IftheSET
statementweretoprecedetheLENGTHstatement,theoriginalfieldcharacteristicswouldbedefinedintotheDSV
structureandtheLENGTHstatementwouldbasicallyhavenoimpact.Becausewelistitfirst,theDSVwillmove
sequentiallyfromtoptobottomofthedatastepasafirststepanddefineeachfieldscharacteristicperitsfirst
mention.ThenastheDSVstartstopopulatewithdatarecords(pertheexecutionoftheSETstatement),thedata
willsimplypopulateintothenewstructureandbeoutputtothefilewehavedesignatedontheDATAstatement.
NotethatIveresettheorderofthefieldsslightlybutpositionisntofmajorimportancewiththeSASenvironment
sincethefieldreferencesaredoneexplicitlybyvariablename.Intheeventyoutrulywanttomaintaintheorder,
youshouldexplicitlylisteachfieldasitwaspreviouslydefinedintheoriginalfilesetup.
AfterrunningtheabovestepstocreatemysmallerTEST.EXAMPLEfile,Ialsoranacomparisonoftheoriginalfile
againstthenewstructurewiththefollowingcode:
Thefollowingresultsofthatprocedureshowthatnoneofthe5.3millionrecordshaveunequalvalues,meaning
thattheprecisionofthedatawasmaintainedbythesmallernumericlengths.
Variables Summary
First Obs 1 1
Last Obs 5385921 5385921
NOTE: No unequal values were found. All values compared are exactly equal.
WhenlookingatthesourcefolderwithWindowsExplorer,itseasytoseethatwewentfroma256Mbfiledownto
a128Mbfile..asavingsof50%!!
Conclusions: This example focused on two permanent data files stored in the same folder so the impact
couldeasilybedisplayedtotheaudience.Thesameconceptapplieswhetheryourdatafileexistsinthetemporary
WORKenvironmentoroneofyourpermanentdatafoldersthough.WORKfilesthatemploytheseapproachesand
arebuiltefficientlywillallowtheSASserverstorunquicker.Sincemostofyourfinalpermanentoutputfilesare
builtoffcombinationsofyourtemporarytables,makingtheseoptimaldefinitionsatthestartwillhaveapositive
effectastheyfilterintoallyoursubsequentsets.
Extra:Ifyouwanttoretrofitpreviouslycreateddatainaprogrammaticfashion,gototheSASsupportsiteand
searchforthe%SQUEEZEmacro.Downloadthefileandplaceitintoyourmacroautocallfolderonyoursystem.
(e.g.C:\ProgramFiles\SAS\SASV9.*\Core\SASmacro\isthelikelylocation)
The internal documentation on this macro is good. The main thing to be aware of is you cant overwrite a file
directly.Thatis,thelocationspecifiedinargument1ofthemacrocallspecificationMUSTbedifferentthanthe
locationinargument2.Alltheothersettingsareoptionaldependingonyourparticularneeds.