Beruflich Dokumente
Kultur Dokumente
BobMuenchen
IthankthemanyRdevelopersforprovidingsuchwonderfultoolsforfreeandalltherhelp
participantswhohavekindlyansweredsomanyquestions.I'mespeciallygratefultothepeople
whoprovidedadvice,caughttyposandsuggestedimprovementsincluding:PatrickBurns,Peter
Flom,MartinGregory,CharilaosSkiadasandMichaelWexler.
SASisaregisteredtrademarkofSASInstitute.
SPSSisatrademarkofSPSSInc.
MATLABisatrademarkofTheMathworks,Inc.
Copyright2006,2007,RobertA.Muenchen.Alicenseisgrantedforpersonalstudyand
classroomuse.Redistributioninanyotherformisprohibited.
1
Introduction.....................................................................................................................................4
TheFiveMainPartsofSASandSPSS...............................................................................................4
Typographic&ProgrammingConventions.....................................................................................5
HelpandDocumentation................................................................................................................6
GraphicalUserInterfaces................................................................................................................7
EasingIntoR....................................................................................................................................7
AFewRBasics.................................................................................................................................7
InstallingAddonPackages..............................................................................................................9
DataAcquisition............................................................................................................................10
ExampleTextFiles.....................................................................................................................10
TheRDataEditor......................................................................................................................10
ReadingDelimitedTextFiles.....................................................................................................11
ReadingTextDatawithinaProgram(Datalines,Cards,BeginData)....................................13
ReadingFixedWidthTextFiles,1RecordperCase..................................................................14
ReadingFixedWidthTextFiles,2RecordsperCase.................................................................15
ImportingDatafromSAS..........................................................................................................17
ImportingDatafromSPSS.........................................................................................................18
ExportingDatatoSAS&SPSSDataSets...................................................................................18
SelectingVariablesandObservations...........................................................................................19
SelectingVariablesVar,Variables=........................................................................................19
SelectingObservationsWhere,If,SelectIf............................................................................26
SelectingBothVariablesandObservations..............................................................................32
ConvertingDataStructures.......................................................................................................32
DataConversionFunctions.......................................................................................................33
2
DataManagement.........................................................................................................................33
TransformingVariables.............................................................................................................33
ConditionalTransformations....................................................................................................36
LogicalOperators..................................................................................................................36
ConditionalTransformationstoAssignMissingValues............................................................38
MultipleConditionalTransformations......................................................................................41
RenamingVariables(andObservations)................................................................................42
RecodingVariables....................................................................................................................45
KeepingandDroppingVariables...............................................................................................48
ByorSplitFileProcessing..........................................................................................................48
Stacking/Concatenating/AddingDataSets...........................................................................50
Joining/MergingDataFrames.................................................................................................50
AggregatingorSummarizingData............................................................................................52
ReshapingVariablestoObservationsandBack........................................................................55
SortingDataFrames..................................................................................................................57
ValueLabelsorFormats(&MeasurementLevel).........................................................................58
VariableLabels..............................................................................................................................63
WorkspaceManagement..............................................................................................................65
WorkspaceManagementFunctions.........................................................................................66
Graphics.........................................................................................................................................67
Analysis..........................................................................................................................................71
Summary........................................................................................................................................78
IsRHardertoUse?........................................................................................................................79
Conclusion.....................................................................................................................................80
INTRODUCTION
ThegoalofthisdocumentistoprovideanintroductiontoRthatthatistailoredtopeoplewho
alreadyknoweitherSASorSPSS.Foreachof27fundamentaltopics,wewillcompareprograms
writteninSAS,SPSSandtheRlanguage.
Sinceitsreleasein1996,Rhasdramaticallychangedthelandscapeofresearchsoftware.There
areveryfewthingsthatSASorSPSSwilldothatRcannot,whileRcandoawiderangeofthings
thattheotherscannot.GiventhatRisfreeandtheothersquiteexpensive,Risdefinitelyworth
investigating.
Ittakesmoststatisticspackagesatleastfiveyearstoaddamajornewanalyticmethod.
StatisticianswhodevelopnewmethodsoftenworkinR,soRusersoftengettousethem
immediately.Therearenowover800addonpackagesavailableforR.
RalsohasfullmatrixcapabilitiesthatarequitesimilartoMATLAB,anditevenoffersaMATLAB
emulationpackage.ForacomparisonofRandMATLAB,see
http://wiki.rproject.org/rwiki/doku.php?id=gettingstarted:translations:octave2r.
SASandSPSSaresosimilartoeachotherthatmovingfromonetotheotherisfairly
straightforward.Rhoweveristotallydifferent,makingthetransitionconfusingatfirst.Ihopeto
easethatconfusionbyfocusingonthesimilaritiesanddifferencesinthisdocument.Itmaythen
beeasiertofollowamorecomprehensiveintroductiontoR.
Iintroducetopicsinacarefullychosenordersoitisbesttoreadthisfrombeginningtoendthe
firsttimethrough,evenifyouthinkyoudon'tneedtoknowaparticulartopic.Lateryoucanskip
directlytothesectionyouneed.
THEFIVEMAINPARTSOFSASANDSPSS
WhileSASandSPSSoffermanyhundredsoffunctionsandprocedures,thesefallintofivemain
categories:
1. Datainputandmanagementstatementsthathelpyouread,transformand
organizeyourdata.
2. Statisticalandgraphicalprocedurestohelpyouanalyzedata.
3. An output management system to help you extract output from statistical
procedures for processing in other procedures, or to let you customize
printedoutput.SAScallsthistheOutputDeliverySystem(ODS),SPSScallsit
theOutputManagementSystem(OMS).
4. Amacrolanguagetohelpyouusesetsoftheabovecommandsrepeatedly.
5. Amatrixlanguagetoaddnewalgorithms(SAS/IMLandSPSSMatrix).
4
SASandSPSShandleeachwithdifferentsystemsthatfollowdifferentrules.Forsimplicitys
sake,introductorytraininginSASorSPSStypicallyfocusontopics1and2.Perhapsthemajority
ofusersneverlearnthemoreadvancedtopics.However,Rperformsthesefivefunctionsina
waythatcompletelyintegratesthemall.Sowhilewellfocusontopics1and2withwhen
discussingSASandSPSS,welldiscusssomeofallfiveregardingR.OtherintroductoryguidesinR
coverthesetopicsinamuchmorebalancedmanner.Whenyoufinishwiththisdocument,you
willwanttoreadoneofthese;seethesectionHelpandDocumentationfor
recommendations.
TheintegrationofthesefiveareasgivesRasignificantadvantageinpower.Thisadvantageis
demonstratedbythefactthatmostRproceduresarewrittenusingtheRlanguage.SASand
SPSSproceduresarenotwrittenusingtheirlanguages.Rsproceduresarealsoavailableforyou
toseeandmodifyinanywayyoulike.
WhileonlyasmallpercentofSASandSPSSuserstakeadvantageoftheiroutputmanagement
systems,virtuallyallRusersdo.ThatisbecauseR'sisdramaticallyeasiertouse.Forexample,
youcancreateandstorearegressionmodelwithmyModel<-lm(y~x).Youcangetseveral
diagnosticplotswithplot(myModel) orcomparetwomodelswith
anova(myModel1,mymodel2).Thatisaveryflexibleapproach!
ThepriceRpaysforthisoutputmanagementadvantageisthattheoutputtomostproceduresis
sparseanddoesnotappearinRaspublicationquality.Variablelabelsarenotapartofthecore
system.TheyhavebeenaddedonintheHmiscpackageatestamenttoRsamazingflexibility
buttheyarenotusedbymostprocedures.Youcanusefunctionsfromaddonpackagestowrite
outHTMLorTEXfilesthatyoucanthenimportintowordprocessingtools.SPSS,andmore
recentlySAS,makeoutputthatispublicationqualitybydefault,buthardertouseasinputto
furtheranalysis.
Onthetopicofmatrixlanguages,SASandSPSSoffertheminaformthatdifferssharplyfrom
theirmainlanguages.Forexample,thewayyouselectvariablesinthemainSASproductbears
norelationtohowyouselecttheminSAS/IML.InR,thematrixcapabilitiesarecompletely
integrated.
TYPOGRAPHIC&PROGRAMMINGCONVENTIONS
Intheexamplesbelow,IassumeyouknoweitherSASorSPSSandsoincludelittleexplanationin
thoseprograms.TheRprogramsareembellishedwithcommentsandvariations.
AlthoughRhasmanywaystogeneratepracticedataandhasavarietyofexampledatasets,all
theexamplesbelowarebaseduponatextfilethatthefirstexamplesreadandthenstore.Once
youhaveruntheimportstepforanyorallpackages,thentheexampleswillworkbecausethey
readthedatasavedatthatstep.TheexamplesusefilepathsappropriateforMicrosoft
Windows,butshouldbereadilyadaptabletoanyothersystem.
AllprogrammingcodeandRfunctionnamesarewrittenin: this courier font.
Namesofotherdocumentsandmenusarewrittenin: thisitalicfont.
Whenlearninganewlanguageitcanbehardtotellthecommandsfromthenames.Tohelp
differentiate,ICAPITALIZEcommandsinSASandSPSSanduselowercasefornames.HoweverR
iscasesensitivesoIhavetousetheexactcasethattheprogramrequires.Sotohelp
differentiate,Iusethecommonprefix"my"innameslikemydataormysubset.WhileIpreferto
useRnameslikemy.subset,theperiodhasspecialmeaninginSASandsoIavoiditinthe
examples.
HELPANDDOCUMENTATION
Thecommandhelp.start()orchoosingHTMLHelpfromtheHelpmenuwillyieldatable
ofcontentsthatpointstohelpfiles,manuals,frequentlyaskedquestionsandthelike.Toget
helpforacertainfunctionsuchassummary,usehelp(summary)orprefixthetopicwitha
questionmark:?summary.Togethelponanoperator,encloseitinquotesasinhelp("<-")
fortheassignmentoperator.Ifyoudontknowthenameofacommandoroperator,use
help.search("your search string")tosearchthebuiltinhelpfiles.
Thehelpfilescanbesomewhatintimidatingatfirst,sincemanyofthemassumeyoualready
knowalotaboutR.AneasierwaytolearnmoreaboutRisbyreadingRforBeginnersby
Paradisorsomeoftheotherwonderfulbooksavailableforfreeathttp://cran.rproject.org/
underdocumentation.AnothergoodchoiceisAnIntroductiontoSandtheHmiscandDesign
LibrariesbyAlzolaandHarrellat
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf.Themostwidelyrecommended
statisticsbookonRisModernAppliedStatisticsinS,byVenablesandRipley.NotethatRis
almostidenticaltotheSlanguageandthesebooksonSusuallypointoutwhatthefew
differencesare.
IhighlyrecommendsigningupfortheRhelplistservathttp://www.rproject.org/undermailing
lists.Thereyoucanlearnalotbyreadinganswerstothemyriadofquestionspeoplepost.Also,
ifyoupostyourownquestionsonthelist,you'relikelytogetananswerinanhourortwo.But
pleasereadthepostingguide,http://www.Rproject.org/postingguide.html, beforesending
yourfirstquestion.
SearchingthewebforinformationonRcanbefrustrating,sincethestringRrarelyrefersto
theRLanguage.However,youcansearchjusttheRsiteusingRSiteSearch("your
search string"), or go to http://www.rproject.org/andclicksearch.Ifyouusethe
6
Firefoxwebbrowser,thereisaplugincalledRsitesearchavailableat
http://addictedtor.free.fr/rsitesearch/.
GRAPHICALUSERINTERFACES
ThemainRinstallationdoesnotincludeapointandclickgraphicaluserinterface(GUI)for
runninganalyses,butyoucanlearnaboutseveralatthemainRwebsite,http://www.r
project.org/underRelatedProjectsandthenRGUIs.MyfavoriteoneisRcommander,which
lookssimilartotheSPSSGUI.Itprovidesmenusformanyanalyticandgraphicalmethodsand
showsyoutheRcommandsthatitenters,makingiteasytolearnthecommandsasyouuseit.
YoucanlearnmoreaboutRCommanderfromhttp://socserv.mcmaster.ca/jfox/Misc/Rcmdr/.
Ifyoudodatamining,youmaybeinterestedintheRATTLEuserinterfacefrom
http://rattle.togaware.com/.ItisapointandclickinterfacethatwritesandexecutesRprograms
foryou.
EASINGINTOR
Asanystudentofhumanbehaviorcantellyou,fewthingsguaranteesuccesslikeimmediate
reinforcement.SoagreatwaytoeaseyourwayintoRistocontinuetouseSAS,SPSSoryour
favoritespreadsheetprogramtoenterandmanageyourdata,thenusethecommandsbelowto
importitandgodirectlytographsandanalyses.Asyoufinderrorsinyourdata(andyouknow
youwill)youcangobacktoyourothersoftware,correctthemandthenimportitagain.Itsnot
anidealwaytoworkbutitdoesgetyouintoRquickly.
AFEWRBASICS
Beforereadinganyoftheexampleprogramsbelow,youllneedtoknowafewthingsaboutR.
WhatwillbecomeimmediatelyapparentishowcompletelydifferentRis.Whatwillnotbe
obviousiswhythesedifferencesgiveitsuchanadvantage.
SASandSPSSbothuseonemaindatastructure,thedataset.Instead,Rhasmanydifferentdata
structures.Theonethatismostlikeadatasetiscalledadataframe.SASandSPSSdatasets
arealwaysviewedasarectanglewithvariablesinthecolumnsandrecordsintherows.SAS
callstheserecordsobservationsandSPSScallsthemcases.Rdocumentationusesvariables
andcolumnsinterchangeably.Itusuallyreferstoobservationsorcasesasrows.
RdataframeshaveaformalplaceforanIDvariableitcallsrowlabels.SASandSPSSusers
typicallyhaveanIDvariablecontaininganobservation/casenumberorperhapsasubjects
name.Butthisvariableislikeanyotherunlessyourunaprocedurethatidentifiesobservations.
YoucanuseRthiswaytoo,butproceduresthatidentifyobservationsmaydosoautomaticallyif
yousetyourIDvariabletobeofficialrowlabels.Alsowhenyoudothat,thevariablesoriginal
name(id,subject,ssn)vanishes.Theinformationisusedautomaticallywhenitisneeded.
7
AnotherdatastructureRusesfrequentlyisthevector.Avectorisasingledimensionalcollection
ofnumbers(numericvector)orcharactervalues(charactervector)likevariablenames.
VariablenamesinRcanbeanylengthconsistingofletters,numbersortheperiod"."andshould
beginwithaletter.Notethatunderscoresarenotallowedsomy_dataisnotavalidnamebut
my.datais.However,ifyoualwaysputquotesaroundavariable(object)name,itcanbeany
nonemptystring.UnlikeSAS,theperiodhasnomeaninginthenameofadataset.However
giventhatmyreaderswilloftenbeSASusers,Iavoidtheuseoftheperiod.Casematterssoyou
canhavetwovariables,onenamedmyvarandanothernamedMyVarinthesamedataframe,
althoughthatisnotagoodidea!Someaddonpackages,tweaknameslikethecapitalized
Savetorepresentacompatible,butenhanced,versionofabuiltinfunctionlikethelower
casedsave.
RhasseveraloperatorsthataredifferentfromSASorSPSS.Theassignmentoperatorisnotthe
equalsignyoureusedto,butisthetwosymbols,"<-"typedrightnexttoeachother.Youcan
useittoputdataintoobjectsasin:
mynames<-c("workshop","gender","q1","q2","q3","q4").Thecombine
function,c,putsthenamestogetherintoasingleobjectcalledacharactervector.Ralsouses
thedoubleequalsigns"=="asalogicalcomparisonoperatorasin,gender=="f".The#
operatorisusedtobeginacomment,whichthencontinuesuntiltheendoftheline.Semicolons
canbeusedtoentermultiplecommandsonasinglelineasinSAS,buttheyarenotusuallyused
toendacommand.
Commandscanbeginandendanywhereonalineandadditionalspacesareignored.Youcan
continueacommandonanewlinesolongasthefragmentyouleavebehindisnotalreadya
completecommanditself.Goingtoanewlineafteracommaisusuallyasafebet,andasyou
willsee,Rcommandsarefilledwithcommasmakingthatconvenient.
LetslookatoneofthesimplestRfunctions,mean.Sayinghelp(mean)willtellsusthatthe
formofthefunctionismean(x,trim=0,na.rm=FALSE)wherexisdataframe,numeric
vectororadate.Welldiscussthismoreinthesectiononselectingvariables.Trimisthe
argumentthattellsRwhatpercentoftheextremevaluestoexcludebeforecalculatingthe
mean.Thezeroindicatesthedefaultvalue,i.e.noneofyourdatawillbetrimmedunlessyou
changethat.
Thena.rmargumentappearsinmanyRfunctions.RusesNAtorepresentNotAvailable,or
missingvalues.ThedefaultvalueoffalsetellsusthatRwillnotremovethemunlessyoutellit
otherwise.SotheresultwillbeNAifanymissingvaluesarepresent!ManyfunctionsinRareset
thiswaybydefault.ThatisquitetheoppositefromSASandSPSS,whichalmostalwaysassume
youwanttouseallthedatayouhaveunlessyoutellitotherwise.
Wecanrunthisbynamingeachargument:
mean(x=mydata, trim=.25,na.rm=TRUE).Itwillwarnusthatthesecondvariable,
gender,isnotnumericbutgoaheadandcomputetheresult.Ifwelisteveryargumentinorder,
weneednotnamethemall.However,mostpeopleskipnamingthefirstargumentandthen
nametheothersandincludethemonlyiftheywishtochangetheirdefaultvalues.Forexample,
mean(mydata,na.rm=TRUE).
UnlikeSASorSPSStheoutputinRdoesnotappearnicelyformattedandreadytopublish.
HoweveryoucanusethefunctionsintheprettyRandHmiscpackagestomaketheresultsof
tabularoutputmorereadyforpublication.
Toruntheexamplesbelow,downloadRfromoneofthe"mirrors"athttp://cran.rproject.org/
andinstallit.Startitandenter(orcut&paste)theexamplesintotheconsolewindowatthe>
prompt.OryoucanuseFile>NewScripttoentertheexamplesintoandselectsometextand
rightclickittosubmitorrunthestatements.IfyouarereadingthePDFversionofthis
document,youmaynotbeablecutandpaste(dependsuponyourPDFtools).AnHTMLversion
thatmakescut/pasteeasyisalsoavailableathttp://oit.utk.edu/scc/RforSASandSPSSusers.html.
INSTALLINGADDONPACKAGES
ThisisaveryimportanttopicinR.InSASandSPSSinstallations,youusuallyhaveeverythingyou
havepaidforinstalledatonce.Rismuchmoremodular.ThemaininstallationwillinstallRanda
popularsetofaddonscalledlibraries.Hundredsofotherlibrariesareavailabletoinstall
separatelyfromtheComprehensiveRArchiveNetwork,(CRAN).Foralistofthemwith
descriptions,seehttp://cran.rproject.org/underPackages,butdontdownloadthemthere.
Rautomatesthedownloadandinstallationprocess.Onceyouhavechosenapackage,choose
InstallPackagesfromthePackagesmenu.ItwillaskyouwhichCRANmirrorsiteyouwantto
use.Choosethenearestone.Itwillthenshowyouthemanypackagesavailable.Choosetheone
youwantanditwilldownloadandinstallitforyou.
Onceitisinstalled,itisonthecomputersharddrive.Touseit,youmustloaditbychoosing
LoadPackagefromthePackagesmenu.Itwillshowyouthenamesofallpackagesthatare
installedbutnotyetloaded.Youcanalsoloadapackagewiththecommand
library(packagename).
Ifthepackagecontainsexampledatasets,youcanloadthemwiththedatacommand.Enter
data()toseewhatisavailableandthendata(mydata)toloadonenamed,forexample,
mydata.
DATAACQUISITION
Thissectiongivesabriefoverviewofdataimportandexport,especiallytoandfromSASand
SPSS.Foracomprehensivediscussionofdataacquisition,seetheRDataImport/Exportmanual.
Intheexampleprogramswewilluse,afterimportingdataintoRwewillsaveitwiththe
commandsave.image(file=c:\\mydata.Rdata).Rusesthebackslashto
representthingslikenewlines"\n"soweusetwoinarowinfilenames.Oncesaved,the
followingprogramsloaditbackintomemorywiththecommand
load(file=c:\\mydata.Rdata).
Formoredetails,seethesectiononWorkspaceManagement.
EXAMPLETEXTFILES
Wellusethefilesbelowandreadthemseveraldifferentways.Notethattheforwardslash"/"
hasaspecialmeaninginR,soyouneedtorefertothefilesaseither"c:\\mydata"or
"c:/mydata".Allourexampleswillusethe"\\"formasitismorenoticeable.
Ifyoucreatethesetwofilesonyourharddrive,thenalloftheexamplesofreadingdatawill
work.TheywillalsosaveSAS,SPSSandRdatasetsthatalltheotherexampleswilluse.Thatway
youcanrunthemallbycuttingandpastingtheprogramsintoanyofthesethreepackages.You
cancreatethesetwofilesbyusinganytexteditorsuchasNotepad.Simplycutandpastethe
dataintoyoureditorandsavethefilesonyourCdrivewiththefilenamesbelow.
c:\mydata.csv
id,workshop,gender,q1,q2,q3,q4
1,1,f,1,1,5,1
2,2,f,2,1,4,1
3,1,f,2,2,4,3
4,2,f,3,1, ,3
5,1,m,4,5,2,4
6,2,m,5,4,5,5
7,1,m,5,3,4,4
8,2,m,4,5,5,5
c:\mydata.txt
(same, less names & commas)
11f1151
22f2141
31f2243
42f31 3
51m4524
62m5455
71m5344
82m4555
THERDATAEDITOR
Rhasasimplespreadsheetstyledataeditor.Youaccessitbycreatinganemptydataframeand
theneditingit:
10
READINGDELIMITEDTEXTFILES
Delimitedtextfilesusesomedelimiter,spaces,tabsorcommastoseparateeachvalue.The
defaultdelimiterinRistabsorplainspaces.Notethattouseablankspaceasamissingvalue,
youmustuseeithertabsorcommasasdelimiters.
Variablenamesareusuallyinthefirstrowofadelimitedfile.Youtypicallythinkofhavinga
variablenameforeverycolumn.However,ifyouarelackingjustonevariablename,Rassumes
youhaveanIDvariableinthefirstcolumn.Itwillthenautomaticallyassignitastherownames.
SincemostpackagesexportdatabywritingoutthenamesallvariablesincludinganID(ifone
exists),theexamplesbelowdonotusethatfeature.
Wewillreadacommaseparatedvalue(CSV)file,c:\mydata.csvshownabove.Itiseasiestifyou
usetheapproachthatwritesvariablenamesinthetoprow,oneforeachvariableincludingan
IDvariableifyouhaveone.Youcanreadthisfileusingthefollowingcommands.Notethatthe
printfunctioninRprintsthedataframeasinprint(mydata).Sincethisisdoneso
frequently,itisthedefaultfunctionandsoenteringsimplymydatainaninteractiveRsession
willalsoprintit.
Youcanskipcolumnsbyspecifyingthe"class"ofeachcolumntoskipasNULL.However,the
colClassesargumenttodothisrequiresyouspecifytheclassesofallcolumns,includingany
initialIDorrownamesvariable.Theclassesincludelogical,integer,numeric,complex,
character,raw),or"factor","Date"or"POSIXct.
SAS
11
SPSS
UsingtheexampleCSVfileabove,youcanmakeiteveneasiertoread.Ifyoudeletethename
IDinthefirstrow,youwillhaveonefewernamesthanyouhavecolumnsofdata.WhenR
seesthissituation,itusesthefirstcolumnasrownames.Ididnotusethatformatbecausemost
programswillwritethenameoftheIDvariablewhenitexportsdata.Youcanthenreadsucha
filewiththesimplecommand:
mydata <- read.table("c:\\mydata.dat")
IfyourfilehasvariablenamesinthefirstrowbutdoesnothaveanIDvariablethatcanactas
rowlabels,thenusetheform:
mydata <- read.table("c:\\mydata.dat", header=TRUE)
12
READINGTEXTDATAWITHINAPROGRAM
(DATALINES,CARDS,BEGINDATA)
Nowthatwehaveseenhowtoreadatextfileinthesectionabove,wecanmoreeasily
understandhowtoreaddatathatisembeddedwithinaprogram.Rworksbyputtingdatainto
objectsandthenprocessingthoseobjectswithfunctions.Inthiscase,we'llputthedataintoa
charactervector,named"mystring".Mystringwillhaveonlyonereallylongvalue.Thenwewill
readitjustaswedidinthepreviousexample,butwithtextConnection(mystring)
replacingc:\mydata.csv inthe read.table function.
SAS
SPSS
13
1,1,f,1,1,5,1
2,2,f,2,1,4,1
3,1,f,2,2,4,3
4,2,f,3,1, ,3
5,1,m,4,5,2,4
6,2,m,5,4,5,5
7,1,m,5,3,4,4
8,2,m,4,5,5,5")
# This reads it just as a text file but processing it
# first through the textConnection function.
mydata<-read.table(textConnection(mystring),
header=TRUE,sep=",",row.names="id")
print(mydata)
save.image(file="c:\\mydata.Rdata") #Writes the file to disk.
READINGFIXEDWIDTHTEXTFILES,1RECORDPERCASE
Ifyouhaveanondelimitedtextfilewithonerecordpercase,youcanreaditusingthefollowing
approach.RhasnowhereneartheflexibilityinreadingfixedwidthtextfilesthatSASandSPSS
have.OtherOpenSourcelanguagessuchasPerlorPythonareextremelygoodatreadingtext
filesandconvertingthemtoaformthatRcaneasilyread.
Thisexampleaddsquiteabitofcomplexity.Sincefilepathsareseldomassimpleas
"c:\mydata.txt"Iamstoringitinacharactervectornamedmyfile.Thisapproachalsomakesit
easytodealwithlongpathnamesandalsoletsyouputallthefilereferencesyouuseatthetop
ofthefilesoyoucanchangethemeasily.
Sincefixedwidthtextfilesseldomcontainaheaderlineofvariablenames,Istorethenamesin
anothercharactervector,myVariableNames,andfeedthattothecol.namesoptioninthe
read.fwffunction.
Intheexampleafterthisone,wewillnotneedtheworkshopvariable,sowewillskipitherejust
toseehowcolumnsareskipped.Bygivingitanegativewidthof1,we'retellingRtoskipone
column.Theread.fwffunctionisforFixedWidthFormatfilesanditreadsthedata.
SAS
SPSS
14
q4 8.
READINGFIXEDWIDTHTEXTFILES,2RECORDSPERCASE
Thissectionbuildsupontheoneabove,sodon'tstartonthisuntilyouunderstanditfirst.This
topicrequiresanewconcept,adatastructurecalledalist.Alistisacollectionofobjects.Adata
frameisitselfatypeoflist.Whenreadingonerecordpercase,thewidthsofeachvariableare
storedinavectorlike:
myVariableWidths<-c(2,-1,1,1,1,1,1).
NowRneedsoneofthesevectorsforeachrecordperobservation.YoutellRaboutthesetwo
vectorsbycombiningthemintoalistusingtheform:
list( myRecord1Widths, myRecord2Widths ).
Wewillreadthemydata.txtfileagain;thistimeasifithadtworecordsforeachcaseand
linetwocontainsthevaluesforvariablesq5throughq8.Wedon'tneedthevariableworkshop
15
onthefirstline,nordoweneedtoreadid,workshoporgenderonthesecondline,sowe'llskip
thosebyusingnegativecolumnwidths.
Notethattheseprogramsdonotsavetheirfilestodiskaswewillnotusetheminfurther
examples.
SAS
SPSS
q4 8
q8 8;
7
7
q4 8
q8 8.
16
file=myfile,
width=myVariableWidths,
col.names=myVariableNames,
row.names="id",
na.strings="999",
fill=TRUE,
strip.white=TRUE)
print(mydata)
IMPORTINGDATAFROMSAS
RcanreadaSASdatasetinxportformatand,ifyouhaveSASinstalled,directlyfromaregular
SASdatasetwiththeextensionsas7bdat.Althoughtheforeignpackageisthemostwidely
documentedapproach,itlacksimportantcapabilities.FunctionsintheHmiscpackageaddthe
abilitytoreadformattedvalues,variablelabelsandlengths.
SASusersrarelyusethelengthstatement,acceptingthedefaultstoragemethodofdouble
precision.Thiswastesabitofdiskspacebutsavesprogrammertime.HoweversinceRsavesall
itsdatainmemory,spacelimitationsarefarmoreimportant.Ifyouusethelengthstatementin
SAStosavespace,thesasxport.getfunctionwilltakeadvantageofit.
Youwillneedtheforeignpackageforthisexample.ItcomeswithRbutmustbeloadedusing
thelibrary(foreign)function.YoualsoneedtheHmiscpackage,whichdoesnotcome
withRbutisveryeasytoinstall.Forinstructions,seethesection,InstallingAddOn
Packages.
TheexamplebelowassumesyouhaveaSASxportformatfile.Formuchmoreinformationon
readingSASfiles,seeAnIntroductiontoSandtheHmiscandDesignLibrariesat
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf.
* SAS Program to Create Export Format File.
SAS
Export * Something like this was done to create your
* export format file. It would benefit from
* labels, formats & length statements;
17
IMPORTINGDATAFROMSPSS
ImportingadatafilefromSPSSisdoneusingtheforeignpackage.ItcomeswithRsoyoudon't
havetoinstallit,butyoudohavetoloaditwiththelibrarycommand.Theread.spssfunctionis
supposedtoreadbothSPSSsavefilesandportablefilesusingexactlythesamecommands.
HoweverIhaveseenitworkonlyintermittentlyon.savfiles.Portableformatfilesseemtowork
everytime.
SPSS
Export
GET FILE='C:\mydata.sav'.
EXPORT OUTFILE='c:\mydata.por'.
RImport # R Program to Import an SPSS Data File.
# This loads the needed package.
library(Hmisc)
# This Reads the SPSS file.
mydata<-spss.get("c:\\mydata.por",use.value.labels=TRUE)
print(mydata)
# Save the data to a file.
save.image(file="c:\\mydata.Rdata")
EXPORTINGDATATOSAS&SPSSDATASETS
Nowletsuseourexampletextfileabovetoexportdata.Thefirstexampleusesthedefault
write.tablefunctiontocreateatextfilewithrowandcolumnnamesandvaluesseparatedby
tabs.Tabshaveanadvantageovercommasascommasareusedbysomecountriesasdecimal
points,andexportedtextstringsmayalsocontaincommas.Bydefaultitwritesoutthe"m"and
"f"forgender,includingthequotes.Enterhelp(write.table)foroutputoptions.
Thesecondtwoexamplesusethewrite.foreignfunctiontowriteoutacommadelimited
textfilealongwithaSASorSPSSprogramfiletoreadit.Theyrequirealibrarycalledforeign,
whichisloadedfirstwiththelibraryfunction.Notethatthesetwoexampleswriteoutthe
gendervaluesas1and2forfandmrespectively.SeethesectiononValueLabelsorFormats
(&MeasurementLevel)ifyouwanttochangevalues.NotethattheexporttoSPSScreatesa
syntaxfilethatyoumustruntocreatetheSPSSdatafile.
Rexportto
text
Rexportto
18
SAS
Rexportto
SPSS
write.foreign(mydata,"c:/mydata2.txt","c:/mydata.sas",
package="SAS")
# R Program to Write an SPSS Export File
# and a program to read it into SPSS.
library(foreign)
write.foreign(mydata,"c:/mydata2.txt","c:/mydata.sps",
package="SPSS")
SELECTINGVARIABLESANDOBSERVATIONS
InSASandSPSS,selectingvariablesforananalysisissimplewhileselectingobservationsis
muchmorecomplicated.InR,thesetwoprocessesarealmostidentical.Asaresult,variable
selectioninRisbothmoreflexibleandquiteabitmorecomplex.Howeversinceyouneedto
learnthatcomplexitytoselectobservations,itisnotmuchaddedeffort.
SelectingvariablesinSASorSPSSisquitesimple.Ourexampledatasetcontainsthevariables:
workshop, gender, q1, q2, q3, q4.SASletsyourefertothembyindividualname
orincontiguousorderseparatedbydoubledashesasinworkshop--q4. SASalsousesa
singledashtorequestvariablesthatshareanumericsuffix,q1-q4,regardlessoftheirorderin
thedataset.Selectinganyvariablebeginningwithaqisdonewithq:.SPSSallowsyoutolist
variablesnamesindividuallyorwithcontiguousvariablesseparatedbyto,asingender to
q4.
SelectingobservationsinSASorSPSSrequirestheuseoflogicalconditionswithcommandslike
IF,WHEREorSELECTIF.Youneverusethatlogictoselectvariables.IfyouhaveusedSASor
SPSSforlong,youprobablyknowdozensofwaystoselectobservations,butyoudidntsee
themallinthefirstintroductoryguideyouread.WithR,itisbesttodiveinandseethemall
becauseunderstandingthemisthekeytounderstandingotherdocumentation,especiallythe
helpfiles.
SELECTINGVARIABLESVAR,VARIABLES=
Eventhoughselectingvariablesandobservationsaredonethesameway,I'lldiscussthemin
twodifferentsections,withdifferentexampleprograms.Thissectionfocusesonlyonselecting
variables.
Ourexampledataframehasseveralimportantattributes:
Ithas6variablesorcolumns,whichareautomaticallygivenindexnumbersof
1,2,3,4,5,6.InRyoucanabbreviatethisas1:6.Thecolonoperatorisntjustshorthand
asinworkshop to q4.Entering1:6attheRconsolewillcauseittoactually
generatethesequence,1,2,3,4,5,6.
19
Ourdataframehastwodimensions,rowsandcolumns.Thesearereferredtousing
squarebracketsasmydata[rows,columns].Thissectionfocusesonthesecond
parameter,thecolumns(variables).
Ourdataframeisalsoalist,withonedimension.Youcanaddresstheelementsofthe
listusingtwosquarebracketsasinmydata[[3]]toselectourthirdvariable,q1.
Roffersmanywaystoselectvariables(columns)fromadataframetouseinananalysis.Ifyou
performananalysiswithoutselectinganyvariables,theRfunctionwilluseallthevariablesifit
can.ThatismuchlikeSASwhereyouspecifyadatasetbutnoVARstatement.Forexample,to
getsummarystatisticsonallvariables(andallobservationsorrows),usesummary(mydata).
Youcansubstituteanyoftheexamplesbelowtochooseasubsetofvariables.Forexample,
summary( mydata[ "q1"] )wouldgetasummaryforjustvariableq1usingthedata
frame,mydata.
Youcanselectvariablesbyindexnumberoravector(column)ofindexes.For
example,mydata[ ,3]selectsallrowsforthethirdvariableorcolumn,q1.Ifyou
leaveoutanindex,itwillassumeyouwantthemall.Ifyouleavethecommaout
completely,Rassumesyouwantacolumn,so mydata[3]isalmostthesameas
mydata[ ,3] bothrefertoourthirdvariable,q1.Somefunctionsrequireone
approachortheother.SeethesectiononConvertingDataStructuresfordetails.
Toselectmorethanonevariableusingindexes,youmustcombinethemintoanumeric
vectorusingthecfunction.Somydata[ c(3,4,5,6) ]selectsvariable3through
6.YouwillseethisapproachusedmanywaysinR.Youcombinemultipleobjectsintoa
singleoneinseveralwaystofeedintofunctionsthatrequireasingleobject.
Thecolonoperator:cangenerateanumericvectordirectly,somydata[3:6]
selectsthesamevariables.
Ifyouuseanegativesignonanindex,youwillexcludethosecolumns.Forexample,
mydata[-c(3,4,5,6), ]willexcludethosevariables.Thecolonoperatorcan
generatelongerstringsofnumbers,butit'stricky.Theform-3:6generatesthevalues
from3to+6or
3,2,1,0,1,2,3,4,5,6.TheisolatefunctionI()inRexiststoclarifysuchoccasional
confusion.Youuseitintheform,mydata[ ,-I(3:6) ]showingRthatyouwant
theminussigntoapplytothejustthesetofnumbersfrom+3through+6.
20
SelectionbyindexesisthemostfundamentalapproachinRbecauseallR'sdata
structuresalwayshavethem.Theydonothavetohavenames.
Youcanselectacolumnbynameinquotes,asinmydata["q1"]. Risstillexpecting
theform mydata[row,column],butwhenyousupplyonlyoneparameter,it
assumesitisthecolumn.So mydata[ ,"q1"]worksaswell.Ifyouhavemorethan
onename,youmustcombinethemintoasinglecharactervectorusingthecombineor
cfunction.Forexample,
mydata[ c("q1","q2","q3","q4") ].
Unfortunately,thecolonoperatordoesnotworkdirectlywithcharacterprefixes,but
youcanpastetheletter"q"ontothenumbersyougenerateusingthatoperator.This
codegeneratesthesamelistastheparagraphaboveandstoresitinacharactervector
calledmyqs.Youcanusethisapproachtogeneratevariablenamestouseinavarietyof
circumstances.Notethatmerelychangingthe4belowto400wouldgeneratethe
sequenceq1toq400.Thesep="" parametertellsRtoseparatetheletterqandthe
generatednumberswithnothing.
myqs <-paste( "q", 1:4, sep="")
summary( mydata[myqs] )
YoucanselectacolumnbyalogicalvectorofTRUE/FALSEvalues.Forexample,
mydata[c(FALSE,FALSE,TRUE,FALSE,FALSE,FALSE)]willselectthethird
column,q1becausethe3rdvalueisTRUEandthe3rdcolumnisq1.AsinSASorSPSS,
thedigits1and0canrepresentTRUEandFALSE.InR,theas.logicalfunctiontellsRto
dothat.sowecouldalsoselectthethirdvariablewith:
mydata[ as.logical( c(0,0,1,0,0,0) ) ]
Itisunlikelythatyouwouldevertypeinalogicalvectorlikethis.Instead,alogical
statementsuchasnames(mydata)=="q1"generatesthelogicalvectoryouneed.So
mydata[ colname(mydata)=="q1" ]isanotherwayofselectingq1.The!
signrepresentsNOTsoyoucanalsousethatvectortogetallthevariablesexceptforq1
usingeitherform:
mydata[ names(mydata)!="q1" ]
mydata[ !names(mydata)=="q1" ]
Youcouldalsousethecolonoperatorandthepastefunctiondemonstratedinthebullet
abovetogeneratelongpatternsofTRUE/FALSEvaluesbasedonlongsetsofnumerically
suffixednames.
Youcanselectacolumnusing$notation,whichhastheformmydata$myvar,e.g.
mydata$q1.ThisisreferredtoseveralwaysinR,including$prefixing,prefixingby
dataframe$or$notation.Whenyouusethismethodtoselectmultiplevariables,
21
youneedtocombinethemintoasingleobjectlikeadataframe,asin
summary( data.frame(mydata$q1,mydata$q2) ).Havingseenthe
combinefunction,yournaturalinclinationmightbetouseitformultiplevariablesas
in:summary( c(mydata$q1,mydata$q2) ).Thiswouldindeedmakeasingle
object,butcertainlynottheoneaSASorSPSSuserexpects.Itwouldstackthemboth
intoasinglevariablewithtwiceasmanyobservations!
Youcanselectavariablefromadataframebyitssimplecolumnname,e.g.justq1,but
onlyifyouattachthedataframefirst.UnlikeSASandSPSS,youcanhavemanyactive
datasetsopenandequallyaccessibleatonce.YoucanactuallycorrelateXfromonedata
framewithYstoredinanother!
Afteryousubmitthefunction,attach(mydata),youcanrefertojustq1andRwill
knowwhichoneyoumean.Thisworkswhenselectingexistingvariablesbutisbest
avoidedwhencreatingthem.Thisisbecauseanyvariablecanalsoexistallbyitselfin
Rsworkspace.Sowhenaddingnewvariablestoadataframe,youneedtouseanyof
theabovemethodsthatmakeitabsolutelyclearwhereyouwantthevariablestored.
Withthisapproachgettingsummarystatisticsonmultiplevariablesmightlooklike,
summary( data.frame(q1,q2) ).
Youcanselectvariableswiththesubsetfunction.Themainadvantagetothisisthatit
istheonlybuiltinapproachtoselectingcontiguoussetsofvariablessuchasq1q4(in
SAS)orq1toq4(inSPSS).Itfollowstheform,subset(mydata, select=q1:q4)
Forexample,whenusedwiththesummaryfunction,itwouldappearas
summary( subset(mydata,select=q1:q4) )
Notethattheadditionalspacesaddedaroundthesubsetfunctionhelpincrease
readability.Rignoresthem.
Youcanselectvariablesbyusingalistindexasinmydata[[3]]tochoosethethird
variable.Thisapproachisusuallyusedforunderothercircumstances.Withthis
approachyoucannotusethecolonoperator,so mydata[[3:6]]isinvalid.
Theexamplesbelowdemonstratemanywaystoselectvariables.Tomakeiteasiertoseethe
resultoftheselection,wewillusetheprintfunction.Whenworkinginteractively,thisisthe
defaultfunction,so mydata["q1"] and print( mydata["q1"] ) areequivalent.
Howevertogiveyouthefeelhowtheselectionworksinallfunctions,Iusethelongerform.
SAS
22
SPSS
23
24
25
SELECTINGOBSERVATIONSWHERE,IF,SELECTIF
ItbearsrepeatingthattheapproachesthatRusestoselectobservationsare,forthemostpart,
thesameasthosediscussedaboveforselectingvariables.Thissectionfocusesonlyonselecting
observations.
Ourexampledataframehasseveralimportantattributes:
Ourdataframehas8observationsorrows,whichareautomaticallygivenindex
numbers,orindexesof1,2,3,4,5,6,7,8.InRyoucanabbreviatethisas1:8.
Ithasrownames,"1","2","3","4","5","6","7","8".Theyarestoredwithin
ourdataframeinanobjectincalledtherownamesvector.Notethatthequotes
aroundthemshowthattheyarestoredascharacters,notasnumbers.Sotheycannot
beabbreviatedsimply1:8.Youcanabbreviatethewithas.character(1:8).Row
namescanalsobecharactervaluesthatareeasiertoidentify,suchas
"Ann","Bob","Carla".
Ourdataframehastwodimensions,rowsandcolumns.Thesearereferredtoas
mydata[rows,columns].Thissectionfocusesonthefirstparameter,therows.
Therearemanywaystoselectobservations(rows)fromadataframe.AsinSASorSPSS,ifyou
selectnorows,youlluseallthedata.Forexample,togetsummarystatisticsonallrowsforall
columns,usesummary( mydata ).Youcansubstituteanyoftheexamplesbelowtochoose
asubsetofobservations.Forexample,
summary( mydata[mydata$gender=="m", ] )
wouldgetasummaryforjustthemales,onallvariables.Thefactthatthecolumnpositionafter
thecommaisblanktellsRtouseallthevariables.
Youcanselectobservationsbyindex.Forexample,mydata[8 , ]selectsallthe
variablesforrow8.
Toselectmorethanonevariableusingindexes,youmustcombinethemintoanumeric
vectorusingthecombineorcfunction.Somydata[ c(5,6,7,8), ]selects
rows5through8,whichhappentobethemales.Thecolonoperator":"cangeneratea
numericvectordirectly,somydata[5:8, ]selectsthesameobservations.
Ifyouuseanegativesignonanindex,youwillexcludethoseobservations.Forexample
26
mydata[-c(1,2,3,4), ]willexcludethefirstfourrecords,thefemales.The
colonoperatorcanabbreviatethisaswell,butit'stricky.Theform-1:4generatesthe
valuesfrom1to+4or
1,0,1,2,3,4.TheisolatefunctioninRexiststoclarifysuchoccasionalconfusion.Youuse
itintheform,mydata[I(1:4),]showingRthatyouwanttheminussigntoapplytothe
justthesetofnumbers1,2,3,4.
YoucanselectobservationsbyalogicalvectorofTRUE/FALSEvalues.Forexample,
mydata[c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE),]will
selectthefirstfourrows,thefemales.The!signrepresentsNOTsoyoucanalsouse
thatvectortogetthemaleswith
mydata[!c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE),]
AsinSASorSPSS,thedigits1and0canrepresentTRUEandFALSE.InR,theas.logical
functiontellsRtodothat.sowecouldalsoselectthefirstfourrowswith:
mydata[ as.logical( c(1,1,1,1,0,0,0,0) ), ]
Notethatthelocationofbrackets,parenthesesandcommasstartstogetrathertedious
anderrorproneatthispoint.AcontextsensitiveeditorsuchasTINNRorESSwouldbe
abighelpinavoidingerrors.
Alogicalstatementsuchasrownames(mydata)=="8"generatesalogicalvector
liketheoneintheparagraphabovebutwithasingleTRUEentry.
Somydata[ rowname(mydata)=="8", ]isanotherwayofselectingthe8th
observation.
The!signrepresentsNOTsoyoucanalsoexcludeonlythe8thobservationusing
eitherform:
mydata[ rownames(mydata)!="8", ]
mydata[ !rownames(mydata)=="8", ]
27
ThisisoneplaceIfindtheformmydata$varnameparticularlyappealing.Ifwewantto
selectthefemalesinourdataframe,mydata["gender"]=="f"willcreatethe
logicalvectorweneed.Wecanapplyitintheform
mydata[ mydata["gender"]=="f", ]butIfindthestyleof
mydata[ mydata$gender=="f", ]muchlessbusy.Ofcoursetheeasiesttoread
is mydata[ gender=="f", ]butthatdoesrequireattachingthefile.
Youcanselectobservationsusingthesubsetfunction.Yousimplylistyourlogical
conditionunderthesubsetargumentasin:
subset(mydata,subset=gender=="f")
Notethatwhenselectingvariables,thereisthe$prefixform,mydata$genderandtheattached
formofjustgender.Whenselectingobservations,thesetwohavenoequivalents.
SAS
SPSS
28
TEMPORARY.
SELECT IF(gender = "f").
SAVE OUTFILE='C:\females.sav'.
EXECUTE .
# R Program to Select Observations.
load(file="c:\\mydata.Rdata")
attach(mydata)
print(mydata)
#---SELECTING OBSERVATIONS BY INDEX
# Print all rows.
print( mydata[1:8, ] )
# Just the males:
print( mydata[5:8, ] )
# Negative numbers exclude rows.
# So this excludes the females in rows 1 through 4.
# The isolate function is used to apply the minus
# to 1,2,3,4 and prevent -1,0,1,2,3,4.
print( mydata[-I(1:4), ] )
# The which function can find the index numbers
# of the true condition.
which(gender=="m")
# You can use those index numbers like this.
print( mydata[which(gender=="m"), ] )
# You can make the logic as complex as you like:
print( mydata[ which(gender=="m" & q4==5), ] )
# You can save the indices to a numeric vector stored OUTSIDE the
# original data frame. Otherwise how would you store the 5,6,7,8
# values in a data frame that has 8 rows?
happyGuys <- which(gender=="m" & q4==5)
print(happyGuys)
# Now you can use the happyGuys vector to select observations.
# Note that even though happyGuys is a variable name, it is
# not used in quotes.
print( mydata[happyGuys, ] )
#---SELECTING OBSERVATIONS BY LOGICAL VECTOR
29
30
"Bob","Scott","Mike","Rich")
print(mynames)
# Store those new names in mydata.
row.names(mydata) <- mynames
print(mydata)
# This again selects by row name,
# but it's much more clear.
print( mydata[ c("Ann","Cary","Sue","Carla"), ] )
# If we have a set of row names we use often we can save it
# to a character vector. Note that this vector's length is
# not equal to the number of rows in the data frame, so it
# cannot be stored there.
myGals <- c("Ann","Cary","Sue","Carla")
print(myGals)
# And then use it to select rows. Note that it is
# not enclosed in quotes when used this way.
print( mydata[ myGals, ] )
# Let's pretend group has lots of values and we want to select
# only some groups by listing their values.
myGals <- row.names(mydata)=="Ann" |
row.names(mydata)=="Cary" |
row.names(mydata)=="Sue" |
row.names(mydata)=="Carla"
print(myGals)
#The %in% function can shorten this up.
print( row.names(mydata) %in% c("Ann","Cary","Sue","Carla") )
# Just like our example for working with variable names
# we can use "which" to get index values
firstMale <- which( row.names(mydata)=="Bob" )
print(firstMale)
lastMale <- which( row.names(mydata)=="Rich" )
print(lastMale)
# This last one selects both rows and columns at once.
# All the rules that work for rows also work for columns.
mydata[firstMale:lastMale, c("q1","q2","q3","q4")]
#---CREATING A NEW DATA FRAME OF SELECTED OBSERVATIONS
# Any of these methods can put the selected
# observations in a new data frame.
31
SELECTINGBOTHVARIABLESANDOBSERVATIONS
Whilethesectionsabovehavefocusedonselectingvariablesandobservationsseparately,you
cancombinemethodsinmostcases.Forexample,printingvariablesgenderthroughq4forjust
thefemalescouldbedonewithanyoftheseapproaches:
print( mydata[1:4,2:6] )
attach(mydata)
print( mydata[gender=="f", c("gender","q1","q2","q3","q4")] )
detach(mydata)
print( subset(mydata,subset=gender=="f",select=gender:q4) )
CONVERTINGDATASTRUCTURES
InSASandSPSS,thereisonlyonedatastructure,thedataset.Withinthat,thereisonlyone
structure,thevariable.Itseemsabsurdlyobvious,butweneedtosayit:allSASandSPSS
proceduresacceptvariablesasinput.
Ontheotherhand,Rhasseveraldatastructures:dataframes,vectors,listsandmatrices.R
takesadvantageofthesebyhavingthesamefunction(procedure)dodifferentthingsdepending
onwhatyougiveit.Sotocontrolafunctionyouhavetobeawareofwhattypeofobjectyoure
givingit,thetypesofobjectsthefunctionisabletoaccept,whatitdoeswitheachandhowto
convertfromonedatastructuretoanother.
Wehaveseenseveralinstancesofthisinourexamples.Wesawthatthesummaryfunction
cannotacceptmultiplevariablesoftheformsummary(mydata$q1,mydata$q2)butitwill
acceptasingledataframethatcontainsonlythosevariablesasin
summary( data.frame(mydata$q1,mydataq2) ).However,somefunctionswill
indeedacceptthefirstform.
InthesectiononSelectingVariables,Isaidthatmydata[ ,3]andmydata[3]were
almostthesame.Similarly,Isaidmydata[q1]andmydata[ ,q1]werealmostthe
same.Formanyfunctionsthoseapproachesareinterchangeable.Butsomeproceduresare
pickierthanothersandrequireveryspecificdatastructures.Thesecommandspassthedatain
theformofadataframe mydata[3],mydata["q1"].
Thesepassthedataintheformofavector:mydata[ ,3],mydata[ ,"q1"].
32
Ialsosaidabovethatthesamemethodsareusedtoselectvariablesandobservations.An
exceptiontothatisthatselectingacolumnusingtheform mydata[ ,3]willpassthedataas
adataframewhileselectingarowusingthesametypeofnotation, mydata[3, ]passesthe
dataasavector!
Ifyouhavehavingaproblemfiguringoutwhichformofdatayouhave,therearefunctionsthat
willtellyou.Forexample,class(mydata)willtellyouitsclassisdataframeand
mode(mydata)willtellyoulist.Sofunctionsthatrequireeitherformwillworkwithit.
Therearealsoaseriesoffunctionsthattestthestatusofanobject,andtheyallbeginwithis.
Forexample, is.data.frame(mydata[3])willdisplayTRUEbut
is.vector((mydata[3])willdisplayFALSE.
Someofthefunctionsyoucanusetoconvertfromonestructuretoanotherarebelow.
DATACONVERSIONFUNCTIONS
Vectorstodataframe
data.frame(x,y,z)
Vectorstocolumnsofamatrix
cbind(x,y,z)
Vectorstorowsofamatrix
rbind(x,y,z)
Vectorscombinedintoonelongone
c(x,y,z)
Dataframetomatrix
as.matrix(mydataframe)
Matrixtodataframe
as.data.frame(mymatrix)
Avectortoamatrix
as.matrix(myvector)
Matrixtooneverylongvector
as.vector(mymatrix)
Listcontainingonevectortojustavector
unlist(mylist)
DATAMANAGEMENT
TRANSFORMINGVARIABLES
UnlikeSAS,Rhasnoseparationofphaseslikethedatastepandprocsteps.ItismorelikeSPSS
whereaslongasyouhavedatareadin,youcanmodifyit.Infact,youcanevenmodifyvariables
inthemiddleofproceduresasinthisexamplewherewetakethesquarerootofq4before
gettingsummarystatisticsonit:summary( sqrt(mydata$q4) ).
33
Rperformstransformationssuchasaddingorsubtractingvariablesonthewholevariableat
once,asdoSASandSPSS.Inotherwords,althoughRhasloops,theyarenotneededforthis
typeofmanipulation.Thebasictransformationsincludesqrtforsquareroot,logfornatural
logarithm,log10forthebase10logarithmandsoon.TheequivalenttotheMEANSfunctionsin
SASandSPSSiscalledrowmeans.
InthesectionSelectingVariables,wesawvariouswaystoselectvariables:byindex,by
columnname,bylogicalvector,usingthestyle mydata$myvar,byusingsimplythevariable
nameafteryouhaveattachedadataframeandusingthesubsetfunction.Usuallythebest
waytonameanewvariableisusingthe mydata$varname format.Asfortherightsideof
theequation,youcanusethatmethodtoo,butitislonger:
Anotherwaytousetheshorternameswithoutattachinganddetachingiswiththetransform
function.Itistheequivalenttoattachingadataframe,performingasmanytransformationsas
youlikeusingshortvariablenamesandthendetachingthedata.Youcanevenusetheshorter
nameforminthecreatedvariable(s).Strangelyenough,ituses=insteadof<.Itlookslike:
mydata <- transform(mydata, sum=q1+q2+q3+q4)
Ifyouhavemanytransformations,itseasiertoreadthemonseparatelines:
Mydata <- transform( mydata,
sum=q1+q2+q3+q4,
mean=sum/4)
Youcancreateanewvariableusingtheindexmethodtoo.Thiswouldcreateavariable7where
previouslywehadonly6.RwillalsogiveitacolumnnameofV7.Ifyou'reusingtheindex
approach,itisprobablyeasiertoinitializeanewvariablebybindinganewvariabletomydata.
Thecolumnbindfunction,cbind,letsyounamethenewvariable,initializeittozeroandthen
bindittomydata:
34
SAS
SPSS
35
CONDITIONALTRANSFORMATIONS
Conditionaltransformationsapplydifferentformulastovarioussubgroupsofthedata.For
example,theformulasforrecommendeddailyallowancesofvitaminsdifferformalesand
females.
BelowarethelogicaloperatorsforSAS,SPSSandRandhowafewcomparisonsdiffer.
LOGICALOPERATORS
Seealso help(Logic) and help(Syntax).
SAS
SPSS
Equals
=orEQ
=orEQ
==
Lessthan
<orLT
<orLT
<
Greaterthan
>orGT
>orGT
>
Lessorequal
<=orLE
<=orLE
<=
Greaterorequal
>=orGE
>=orGE
>=
Notequal
^=,<>orNE
~=orNE
!=
And
&orAND
&orAND
&
Or
|orOR
|orOR
0<=x<=1
0<=x<=1
(X>=0)AND
(x>=0)&(x<=1)
(X<=1)
Missingvaluesize
Missingislessthanall
numbers
Comparisonswith
missingaresettoNA
Comparisonswith
missingaresettoNA
Symboltorepresent
missingin
comparisons
"."
MISSINGorSYSMIS
is.na(x)
(Notethat==NAcan
neverbetrue.)
36
Theexamplesbelowdemonstrateavarietyofconditionaltransformations.
SAS
SPSS
37
This example also assigns the different formulas for males & females.
However, it checks for the male gender so it will not assume missing
genders are male. The nesting involved is quite a bit more complex
than the example above.
CONDITIONALTRANSFORMATIONSTOASSIGNMISSINGVALUES
RrepresentsmissingvalueswithNA,forNotAvailable.ThelettersNAarealsoanobjectinR
thatyoucanusetoassignmissingvalues.UnlikeSAS,thevalueusedtostoretheNAisnotthe
smallestnumberyourcomputercanstore,sologicalcomparisonssuchasx<0willresultinNA
whenxismissing.
38
Whenimportingdata,blanksarereadasmissing(whenblanksarenotusedasdelimiters)asis
thestringNA.Butifyouhaveothervalues,youwillofcoursehavetotellRwhichvaluesare
missing.Theread.tablefunctionprovidesanargument,na.strings,thatallowsyouto
setmissingvalues.However,itappliesthevaluestoallvariables,whichisunlikelytobeofuse.
Forexamplea2columnvariablesuchasyearsofeducationmayhave99representmissing,but
thevariableagemayhave99asavalidvalue.
PeriodsthatrepresentmissingvaluesinSAScauseRtoreadthewholevariableasacharacter
vector.Soyouhavetofirstfixthemissingvaluesandthenconvertittonumericusingthe
as.numeric()function.
NotethatsinceanylogicalcomparisononNAsresultsinanNAoutcome,evenq1==NAwillnot
beTRUEwhenq1isindeedNA.Soifyouwantedtosubstituteanothervaluesuchasthemean,
youwouldneedtousetheis.nafunction.ItwillbeTRUEwhenavalueisNA:
mydata[is.na(mydata$q1),"q1"] <- mean(mydata$q1,rm.na=TRUE)
SAS
SPSS
*This does the same thing but is quicker for lots of vars;
array q q1-q4;
do over q;
if q=9 then q=.;
end;
* SPSS Program to Assign Missing Values.
GET FILE=("c:\mydata.sav")
MISSING q1 TO q4 (9).
SAVE OUTFILE=("c:\mydata.sav")
# R Program to Assign Missing Values.
# NA is the missing value code in R.
load(file="c:\\mydata.Rdata")mydata
attach(mydata)
# This finds the rows where q1==9 and chooses the variable
# q1 by its column number, 3. There are no nines in the data,
# but you can put a few there in q1 manually in the data editor
# using fix(mydata).
fix(mydata)
39
mydata[q1==9,3] <- NA
print(mydata)
#This uses q1==9 to select the rows
# and chooses the variable using its name, q1.
mydata[q1==9,"q1"] <- NA
print(mydata)
#This uses the ifelse function on the whole of q1 at once.
mydata$q1 <- ifelse( q1 == 9, NA, mydata$q1)
print(mydata)
# This replaces every value 9 of EVERY variable
# in mydata, be careful how you use it!
mydata[mydata==9] <- NA
print(mydata)
#--------------------------------------------# Below are more complex examples that let
# you convert many variables at once.
#--------------------------------------------#Create a function that replaces 9s with NAs.
my9isNA <- function(x) {x[x==9] <- NA; x}
#Apply the function to a single variable:
mydata$q1 <- my9isNA(q1)
print(mydata)
# This assigns the missing values for a
# manually entered list of variables.
# The lapply function applies a function
# to every member of a list (a data frame is a type of list).
mydata[c("q1","q2","q3","q4")] <- lapply(
mydata[c("q1","q2","q3","q4")],
my9isNA)
print(mydata)
# This does contiguous variables without naming them all.
# If you know their column numbers, use this approach.
# q1 is the 3rd column and q4 is the 6th, so use 3:6.
mydata[ 3:6 ] <- lapply( mydata[ 3:6 ], my9isNA)
mydata
# Here is a way to handle many contiguous variables
# by name rather than by column number.
# We use the which function to find out the column
40
# number of
A <- which(
Z <- which(
mydata[ A:Z
mydata
MULTIPLECONDITIONALTRANSFORMATIONS
Conditionaltransformationsapplydifferentformulastodifferentsubsetsofthedata.For
example,theformulasforrecommendeddailyallowancesofvitaminsdifferformalesand
females.Ifyouhaveonlyoneformulatoapplytoeachsubgroup,readthesectionaboveon
conditionaltransformations.Theexamplebelowshowshowtohandleasetofformulasforeach
subgroup.
NotethatIdonotuseattach(mydata)whichwouldallowmetorefertovariablesbytheir
shortname,e.g.gender.Thatisbecauseitispossibletocreateavariableoutsideofadata
frame.Whencreatingvariables,itisbesttostickwiththeformofthenamethatincludesthe
dataframe,e.g. mydata$gender.
TheRprogrambelowcreatestwonew(rathersilly)scores,score1andscore2.Itdoesitusinga
logicalvector(variable)thatisTRUEfortheobservationswewanttheformulastoapplytoand
FALSEotherwise.ThesectionSelectingObservationshasmuchmoreinformationonlogical
vectors.
SAS
SPSS
41
EXECUTE.
# R Program for Multiple Conditional Transformations.
# Read the file into a data frame and print it.
load(file="c:\\mydata.Rdata")
print(mydata)
# Use column bind to add two new columns to mydata.
# Not necessary for this example, but handy to know.
mydata <- cbind(mydata, score1 = 0, score2 = 0)
attach(mydata)
mydata$guys <- gender=="m" #Creates a logical vector for males.
mydata$gals <- gender=="f" #Creates another for females.
print(mydata)
# Applies basic math to the group selected by the logic.
mydata$score1[gals]<- 2*q1[gals] + q2[gals]
mydata$score2[gals]<- 3*q1[gals] + q2[gals]
mydata$score1[guys]<-20*q1[guys] + q2[guys]
mydata$score2[guys]<-30*q1[guys] + q2[guys]
print(mydata)
# This step uses NULL to delete the guys & gals vars.
mydata$gals <- mydata$guys <- NULL
print(mydata)
RENAMINGVARIABLES(ANDOBSERVATIONS)
InSASandSPSS,youareneverawareofwherethevariablenamesarestoredorhow.Youjust
knowtheyareinthedatasetsomewhere.Renamingissimplyamatterofmatchingthenew
nametotheoldname.InRhowever,bothrowandcolumnnamesarestoredincharacter
vectorswithinthedataframe.Inessence,theyarejustanotherformofvariablethatyoucan
manipulate.
Youcanseethenamesinthedataeditor,andchangingthemtherebyhandisaveryeasyway
torenamethem.The rename functionthatcomeswiththereshapepackageallowsyouto
changenameswithoutknowingaboutthenameandrow.namevectorsandsoisalsoquite
easytouse.
Thoseapproachesaredemonstratedfirstintheexamplesbelow.Ithenuseseveralother
approachestorenamevariablesthatmakeitmuchmoreclearhowRstoresthenamesandhow
tochangethem.
SAS
42
SPSS
*or;
*RENAME q1=x1 q2=x2 q3=x3 q4=x4;
RUN;
* SPSS Program for Renaming Variables.
GET FILE='C:\mydata.sav'.
RENAME VARIABLES (Q1=X1)(Q2=X2)(Q3=X3)(Q4=X4).
EXECUTE.
# R Program for Renaming Variables.
load(file="c:\\mydata.Rdata")
print(mydata)
#---This uses the data editor.
#Make the changes by clicking on the names in the spreadsheet,
then closing it.
fix(mydata)
print(mydata)
# Restore original names for next example.
names(mydata)<-c("group", "gender", "q1", "q2", "q3", "q4")
#---This method is most like other languages such as SPSS or SAS.
# It's easy to understand but doesn't help you understand what
# R is actually doing. It requires the reshape library.
library(reshape)
mydata <- rename(mydata, c(q1="x1")) #Note the new name is one in
quotes.
mydata <- rename(mydata, c(q2="x2"))
mydata <- rename(mydata, c(q3="x3"))
mydata <- rename(mydata, c(q4="x4"))
print(mydata)
# Restore original names for next example.
names(mydata)<-c("group", "gender", "q1", "q2", "q3", "q4")
#---Simplest renaming with no packages required.
# With this approach, you simply list every name in order,
# even those you don't need to change.
names(mydata) <- c("group", "gender", "x1", "x2", "x3", "x4")
print(mydata)
# Restore original names for next example.
names(mydata)<-c("group", "gender", "q1", "q2", "q3", "q4")
#---The edit function actually generates a list of names for you.
# If you edit them and close the window, they will change.
names(mydata) <- edit( names(mydata) )
print(mydata)
43
]
]
]
]
<-"x1"
<-"x2"
<-"x3"
<-"x4"
44
RECODINGVARIABLES
RecodingisjustasimplerwayofdoingasetofsimilarIF/THENconditionaltransformations.Itis
oftenusedinsurveyresearchtocollapse5pointLikertscaleitemstosimpler3point
Disagree/Neutral/Agreescalestosummarizeresults.Itisalsooftenusedtoreversethescaleof
negativelywordeditemssothatalargenumericvaluehasthesamemeaningacrossallitems.
Scalereversalsaremoreeasilydonebysubtractingeachscorefrom6asinq1r=6q1.That
resultsin65=1,61=5andsoon.Theexamplebelowjustcollapsesthescale.
Therecodefunctionusedisinthecarslibrary,whichyou'llhavetoinstallbeforerunningthis.
SeeInstallingAddOnPackagesfordetails.
SASdoesn'thaveaseparaterecodefacilitybutitdoesoffersasimilarcapabilityusingitsvalue
labelformats.Thathastheusefulfeatureofapplyingtheformatsincategoricalanalysesand
ignoringitotherwise.Forexample,PROCFREQwillusetheformatbutPROCMEANSwillignore
45
it.YoucanalsorecodethedatawithaseriesofIF/THENstatements.Bothmethodsareshown
below.Forsimplicity,IleavethevaluelabelsoutoftheSPSSandRprograms.Thoseare
demonstratedinthesectionValueLabelsorFormats(&MeasurementLevel).
Forrecodingcontinuousvariablesintocategorical,seethecut2functionintheHmisclibrary.For
choosingoptimalcutpointswithregardtoatargetvariable,seetherpartfunctionorthetree
functioninHmisc.
SAS
SPSS
46
EXECUTE .
# R Program for Recoding Variables.
load(file="c:\\mydata.Rdata")
print(mydata)
attach(mydata)
library(cars)
mydata$q1<-recode(q1,"1=2;5=4")
mydata$q2<-recode(q2,"1=2;5=4")
mydata$q3<-recode(q3,"1=2;5=4")
mydata$q4<-recode(q4,"1=2;5=4")
mydata
# This does the same thing but for large sets
# of variables.
# Generate two sets of var names to use.
myQnames <paste( "q", 1:4, sep="")
47
KEEPINGANDDROPPINGVARIABLES
InSASyoucanusetheKEEPandDROPstatementstodeterminewhichvariablestosaveinyour
dataset.TheSPSSequivalentistheDELETEVARIABLESstatement.InR,themethodsdiscussed
intheSelectingVariablessectionperformthisfunctionaswell.OneadditionalfeatureinRisthe
NULLobject,whichyoucanusetodeletevariablesindataframeswithoutmakingnewversions
ofthedata.Touseitsimplyapplyitinanyvalidassignmentsuchas:
mydata$varname<-NULL.
Note that the remove function deletes only whole objects, so you will get an error
message if you try:
rm(mydata$varname)
The example below keeps the variables only the left side of our data frame. In the later
example on joining files, well do the same for the other variables then join the two back
together.
SAS
SPSS
BYORSPLITFILEPROCESSING
Whenyouwanttorepeatananalysisforeverylevelofavariable,youcanusetheBYstatement
inSASorSPLITFILEcommandinSPSS.Rdoesthiswiththe"by"function.Ithasthreerequired
48
parameters,thedataframename,thefactorvariable(s)tosplitonandtheanalyticalfunction.
Afterthoseparametersaresupplied,anyadditionalparametersettingsarepassedtothe
analyticalfunction.Theexamplesbelowusethesummaryfunctiontogetbasicstatsbygender
andthenbybothgenderandworkshop.
SASandSPSSbothrequireyoutosortthedatabythefactorvariable(s),butRdoesnot.
SAS
SPSS
49
STACKING/CONCATENATING/ADDINGDATASETS
Theexamplesbelowfirstsplitmydataintoseparatedatasetsformalesandfemales.Thenit
showshowtoputthembacktogether.SAScallsthisconcatenation,SPSScallsitaddingfilesand
R,withitsrow/columnorientationcallsitbindingrows.
SAS
SPSS
GET FILE='C:\females.sav'.
ADD FILES /FILE=*
/FILE='C:\males.sav'.
EXECUTE.
# R Program for Stacking/Concatenating/Adding Data Sets.
load(file="c:\\mydata.Rdata")print(mydata)
attach(mydata)
#Put only males in a data frame.
males <- mydata[gender=="m", ]
print(males)
#Put only females in another data frame.
females <- mydata[gender=="f", ]
print(females)
#Bind their rows together with the rbind function.
both<-rbind(females,males)
print(both)
JOINING/MERGINGDATAFRAMES
Oneofthemostfrequentlyuseddatamanipulationmethodsisjoiningormergingtwodatasets.
Ifyouhaveaonetomanyjoin,itwillcreatearowforeverypossiblematch.Acommonexample
50
isashortdataframecontaininghouseholdlevelinformationsuchasfamilyincomejoinedtoa
longerdatasetofindividualfamilymembervariables.Acompleterecordofeachfamilymember
alongwiththeirhouseholdincomewillresult.Duplicatesinmorethanonedataframeare
possible,butshouldbestudiedcarefullyforerrors.
Intheexamplebelow,buildsonthekeeping/droppingvariablesexampleabove.We'llstartwith
mydata,maketwocopies(leftandright)containingdifferentvariablesandthenjointhemback
togethertorecreatetheoriginalfile.
SAS
SPSS
GET FILE='C:\myleft.sav'.
MATCH FILES /FILE=*
/FILE='C:\myright.sav'
/BY id.
EXECUTE.
# R Program for Joining/Merging Data Sets.
#Note that row.names="id" is not used when reading
# the table below. That is because we need to match
# on ID so we keep it as a variable.
mydata<-read.table("c:\\mydata.csv",header=TRUE,sep=",")
print(mydata)
#Create a data frame keeping the left two q variables.
myleft<-mydata[ c("id","workshop","gender","q1","q2") ]
print(myleft)
#Create a data frame keeping the right two q variables.
myright<-mydata[ c("id","workshop","q3","q4") ]
51
print(myright)
#Merge the two dataframes by ID.
#Since "workshop" is in both, and is not used
# to merge the dataframes, R will save both
# and name them workshop.x and workshop.y
# Don't save it in both to avoid this.
both<-merge(myleft,myright,by="id")
print(both)
#Merge dataframes by both ID and workshop.
#Since there are no unsed variables, it recreates
#the original dataframe perfectly.
both<-merge(myleft,myright,by=c("id","workshop"))
print(both)
#Merge dataframes by both ID and workshop,
#while allowing them to have different names in each dataframe.
#They're the same here, but they could be different.
both<-merge(myleft,myright,by.x=c("id","workshop"),
by.y=c("id","workshop"))
print(both)
AGGREGATINGORSUMMARIZINGDATA
Weoftenhavetoworkondatathatisasummarizationofotherdata.Forexample,wemight
workonfamilydatathatwascollapsedfromfamilymemberdata.Belowwewillcreateafile
containingthemeanoftheq1variablebygenderandthenbyworkshopandgender.Weprint
eachdataframeforclarity.Thelastexamplemergesthesummarizeddatabacktoouroriginal
fileandsobuildsuponthepreviousexamplesinthesection,Joining/MergingDataFrames.
AmuchmoreextensiveaggregationapproachisavailableinHadleyWickhamsdelightful
reshapepackagedocumentedat
http://cran.rproject.org/doc/packages/reshape.pdf.
SAS
52
SPSS
53
EXECUTE.
54
print(mydata2)
RESHAPINGVARIABLESTOOBSERVATIONSANDBACK
Acommondatamanagementproblemisreshapingdatafromwideformattolongandback.
Ifweassumeourvariablesq1,q2,q3,q4arethesameitemmeasuredatfourtimes,thisisthe
standardwideformatforrepeatedmeasuresdata.Convertingthistothelongformatconsistsof
writingoutfourrecords,eachofwhichhasjustonemeasure,we'llcallitY,andacounter
variable,oftencalledtime,thatgoes1,2,3,4.Sointhesimplestcase,twovariableswillreplace
asmanyastherearerepeatsthroughtime.
Goingfromwidetolongisjustthereverse.SPSSmakesthisprocessveryeasytodowiththeir
RestructureDataWizard.ItactuallygeneratedtheSPSSprogrambelow.TheSASapproachis
quitecomplexandtakesabitofstudy.HadleyWickham'sexcellentreshapepackageinRis
quitepowerfulandeasytouse.Itusestheanalogyofmeltingyourdatasothatyoucancastit
intoadifferentmold.Inadditiontoreshaping,thepackagemakesquickworkofawiderangeof
aggregationproblems.
SAS
55
SPSS
56
SORTINGDATAFRAMES
SortingisoneoftheareasthatRdiffersmostfromSASandSPSS.Itdoesnotdirectlysortadata
frame.Instead,itdeterminestheorderofthesortedrowsandthenappliesthemtodothesort.
ConsiderthenamesAnn,Eve,Cary,Dave,Bob.Theyarealmostsortedinascendingorder.Since
thenumberofnamesissmall,itiseasytodeterminetheorderthatthenameswouldrequireto
besorted.Weneedthe1stname,Ann,followedbythe5thname,Bob,followedbythe3rd
name,Cary,the4thname,Daveandfinallythe2ndname,Eve.Theorderfunctionwouldget
thoseindexvaluesforus:15342.
Onewaytoselectrowsfromadataframeistousetheform mydata[rows,columns].If
youleavethemallout,asin mydata[ , ] thenyoullgetallrowsandallcolumns.Youcan
selectsomerowsaswehavedoneelsewheretoselectthefemalesinthefirst4recordswith
mydata[ c(1,2,3,4), ].Wecanselecttheminreverseorderwith
mydata[ c(4,3,2,1), ].
Ifweappliedthatideatotheindexesinournameexample,wecouldget
mydata[ c(1,5,3,4,2) , ]toprint(orsave)theminorder.Sincethe order function
determinestheindexesofthesortedorderautomatically,wecoulddothesamethingwith
mydata [ order(name), ].
SAS
SPSS
57
VALUELABELSORFORMATS(&MEASUREMENTLEVEL)
ThissectionblendstwotopicsbecauseinRtheyareinseparable.InbothSASandSPSSassigning
labelstovaluesisindependentofthevariablesmeasurementscale.InR,valuelabelscanonly
beassignedtovariableswhoselevelisfactor.
InSASavariablesmeasurementlevelofnominal,ordinalorintervalisnotsetinadvance.
Instead,listingthevariableonaspecificstatementsuchasBYorCLASSwilltellSASthatyou
wishtoviewitascategorical.
SASuseofvaluelabelsishandledinatwostepprocess.First,ProcFormatcreatesaformat
foreveryuniquesetoflabelsandthentheFormatstatementassignsaformattoeachvariable
orsetofvariables(seeexamplebelow).Theformatsarestoredoutsidethedatasetinaformat
library.
InSPSS,theVARIABLELEVELstatementsetsmeasurementlevelbutthisismerelyaconvenience
tohelpthegraphicaluserinterfaceworkwell.SoSPSSwillnot,forexample,showyounominal
variablesintheDESCRIPTIVESproceduresdialogbox,butifyouenterthemintothe
programminglanguage,itwillacceptthem.AswithSAS,specialcommandstellSPSShowyou
wanttoviewthescaleofthedata.TheseincludeBYandGROUPS.
Independently,theVALUELABELcommandsetslabelsforeachlevelandthelabelsarestored
withinthedatasetitself.
58
Rhasthemeasurementlevelsoffactorfornominaldata,orderedfactorforordinaldataand
numericforintervalorscaledata.Yousettheseinadvanceandthenthestatisticaland
graphicalproceduresusethemintheappropriatewayautomatically.
Inourexampletextfile,datagenderwasenteredasmandfsoRassignsthevalues
assigned2and1sincefprecedesminthealphabet.Forcharacterdata,thosedefaultsareoften
sufficient.Howeveryoucanusefactor()tochangeeither.Thevaluesassignedfollowtheorder
onthelevelsargumentsobelowwithmcomingfirst,itwouldbeassociatedwith1andf
with2.Thelabelsargumentfollowstheorderofthelevels.Thisexamplesetsmas1,fas2
andusesthefullywrittenoutlabels.
mydata$genderF <- factor( mydata$gender,
levels=c("m","f"),labels=c("Male","Female") )
ThereisadangerinsettingvariablelabelswithcharacterdatathatdoesnotappearatallinSAS
orSPSS.Inthestatementabove,ifwesetlevels=c("m","F")instead,thefemaleswould
allbesettomissing(NA)becausetheactualvaluesarelowercase.TherearenocapitalFsin
thedata!Thisdangerappliesofcoursetoothermoreobviousmisspellings.
OfcourseRhasnowaytoautomaticallyidentifynumericfactors,suchasourworkshopvariable.
Ittooisacategoricalmeasure,butinitiallyRassumesitisnumericbecauseitisenteredas1and
2.Thefactorfunctionconvertsittoafactorandletsyouoptionallyassignlabels.Factorlabelsin
Rarestoredinthedataframeitself.Wecanconvertittoafactorwiththecommand,
mydata$workshop <- factor( mydata$workshop,
levels=c(1,2,3,4),
labels=c("R","SAS","SPSS","Stata") )
Notethatourdataonlycontainthevalues1and2butthatisfine.Youcanalsoconverta
numericvariabletoafactorwithoutassigninglabelsusingjust
mydata$workshop<-factor(mydata$workshop).
Rsapproachsavesyouworkwithcharacterdataandisnomoreworkwithnumericdatathan
SASorSPSS.Rneverneedsacharactervariableconvertedtonumericformtoworkwithcertain
functionsasSPSSdoes.
ButthereisadownsidetoR'sapproach.Whatifyouwanttoviewvariablesbothways?For
example,surveyresearchersusuallywanttogetfrequenciesontheirLikertscalesthatgofrom1
to5andtheywantthemeanscoreaswell.Theyoftenalsowanttoaveragethemtogetherto
makenewscores.TheHmiscpackages describe functionprovidesbothfrequenciesand
means.
Manyotherfunctions,suchassummary,requirethatyouconvertthevariablefromfactorto
numeric(orviceversa)togetittodowhatyouwant.Severalconversionfunctionshelpwiththis
59
SPSS
GET FILE="c:\mydata.sav".
VARIABLE LEVEL workshop (NOMINAL)
/q1 TO q4 (SCALE).
VALUE LABELS workshop 1 'Control' 2 'Treatment'
/q1 TO q4
1 'Strongly Disagree'
2 'Disagree'
3 'Neutral'
4 'Agree'
5 'Strongly Agree'.
SAVE OUTFILE="C:\mydata.sav".
# R Program to Assign Value Labels & Factor Status.
# By default, group was read in as numeric and gender as factor.
# That is because gender is character data.
load(file="c:\\mydata.Rdata")
attach(mydata)
print(mydata)
# Note that summary will treat group as numeric by default,
60
61
"Strongly Agree")
# Now create a new set of variables
mydata$q1f <- factor(q1, myQlevels,
mydata$q2f <- factor(q2, myQlevels,
mydata$q3f <- factor(q3, myQlevels,
mydata$q4f <- factor(q4, myQlevels,
as factors.
myQlabels)
myQlabels)
myQlabels)
myQlabels)
62
VARIABLELABELS
PerhapsthemostfundamentalfeaturemissingfromthemainRdistributionissupportfor
variablelabels.Itisatestamenttoitsopennessandflexibilitythatsuchafundamentalfeature
couldbeaddedonbyuserwrittenpackage.FrankHarrells Hmisc packagedoesjustthat.It
addsanattributetothedatasetoflabelandstoresthelabelsthere,evenconvertingthem
fromSASautomatically(butnotSPSS).Asamazingasthisadditionis,thefactthatvariablelabels
werenotincludedinthemaindistributionmeansthatmostproceduresdonottakeadvantage
ofwhat Hmisc adds.Themanywonderfulfunctionsinthe Hmisc packagedoofcourse.
Asecondapproachtovariablelabelsistostorethemasacharactervariable.Thatisthe
approachusedbythe prettyR package.
Finally,youcanuseillegalvariablenamesofanylengthbyenclosingtheminquotes.Thishas
theadvantageofworkingwithallRfunctions.Unfortunatelyitalsomakesselectingvariablesby
namemuchlesspractical.Youcantypeoutthelongnameoruseasearchfunctiongreptofind
keywordsinyourlabels.Youcanofcoursestillselectvariablesusingindexnumbers.
TheRprogrambelowdemonstratesallthreeapproachestovariablelabels.
SAS
SPSS
63
64
summary ( mydata[myvars] )
WORKSPACEMANAGEMENT
ThewayRstoresitsdataisverydifferentfromSASandSPSS.Whileinuse,thedataisstoredin
thecomputersrandomaccessmemoryratherthanonaharddrive.ThismeansthatRcannot
handledatasetsaslargeasSASorSPSS.Giventhelowcostofmemorytodaythisismuchlessof
aproblemthanyoumightthink.Rcanhandlehundredsofthousandsofrecordsonacomputer
with2gigabytesofRAM.ThatisthecurrentRAMlimitintodays32bitversionofWindows.You
canusevirtualmemorytogoupto3gigabytesbutthatslowsthingsdownconsiderably.
Operatingsystemscapableof64bitmemoryspacesarebecomingmorepopular.Thehuge
amountsofmemorytheycanhandlemitigatethisproblem.Onewayaroundthelimitationisto
storeyourdatainarelationaldatabaseanduseitsfacilitiestogeneratearandomsampleto
workwith.AnotheralternativeistouseSPLUS,acommercialpackagethathasanalmost
identicallanguagewithextensionstohandlebigdata.
AftereachRsession,itofferstosavetheworkspace.Ifyouclickyes,itstoresitinafilenamed
c:\ProgramFiles\R\R2.4.1\.Rdata.ThenexttimeyoustartRitautomaticallyloadsthefile
backintomemoryandyoucancontinueworking.
Thename.RdataisntveryeasytoworkwithinWindowssinceithidesfileextensionsby
default,andanextensionisanythingthatfollowsaperiod.Sotheentirefilename.Rdataisan
extensionandhiddenfromview!Togetwindowstoshowyouthisfile,inExploreruncheckthe
optionbelowandclickApplytoallfolders:
Tools>FolderOptions>View>[]Hideextensionstoknownfiletypes.
Ifyouwanttoavoidusingthename.Rdata,youcanatanytimechooseSaveWorkspacefrom
theFilemenuinRandnameitanythingyoulikewiththeextension.Rdata.Thenwhenyoustart
R,youwillhavetouseLoadWorkspacetoloaditfromtheharddrivebackintothecomputers
memory.YoucanalsouseRfunctionstosaveandloadyourworkspace.Tosaveyour
workspace,usethecommandsave.image(file=c:\\mydata.Rdata),
wheremydataisanyfilenameyoulike.Ifyouwanttosaveonlyasubsetofyourworkspace,you
canusethecommandsave(mydata,file=c:\\mydata.Rdata).
Saveisoneofthefewfunctionsthatcanacceptmanyobjectsseparatedbycommas,somight
savethreewith save(mydata,myVars,myObs,file=c:\\myExamples.Rdata).Regardlessofhow
yousavedyourworkspace,thecommandtoloadthatworkspacebackintomemoryis
load(file=mydata.Rdata).
Thereisanotherwaytokeepyourvariousprojectsorganizedifyoulikeusingshortcuts.You
createanRshortcutforeachofyouranalysisprojects.Thenyourightclicktheshortcut,choose
PropertiesandsettheStartinfoldertoauniquefolder.WhenyouusethatshortcuttostartR,
uponexititwillstorethe.Rdatafileforthatproject.Althoughneatlyorganizedintoseparate
65
folders,eachprojectworkspacewillstillbeinafilenamed.Rdata.Findingtheparticularfileyou
needviasearchenginesorbackupsystemsishamperedbythisapproach.Ifyouaccidentally
movedan.Rdatafiletoanotherfolder,youwouldnotknowwhatwasinitwithoutloadingit
intoR.
Youcanseewhatobjectsareinyourworkspacewiththelsfunction.Tolistallobjectsuse
ls().Ifyouwanttoseeifjustoneobjectexists,youcanenterls(mydata).Todeletean
object,usetheremovefunctionasinrm(mydata).Toremovealltheobjectsinyour
workspace,combinethetwofunctionsasin rm( list=ls() ).
Tohelpconservevaluableworkspacesize,youcanusethecleanup.importfunctioninthe
Hmisc package.Itautomaticallystoresthevariablesintheirmostcompactform.Youuseitas
mydata<-cleanup.import(myata). SeeInstallingAddonPackagesformore
details.
Toconserveworkspacebysavingonlythevariablesyouneedinadataframe,seeKeepingand
DroppingVariables.The rm functioncannotdropvariableswithinadataframe.
WORKSPACEMANAGEMENTFUNCTIONS
Storedata
efficiently
(requiresHmisc)
mydata<-cleanup.import(mydata)
Listobject
contents
(requiresHmisc)
contents(mydata)
Listallobjects
ls()
Listjustmydata
ls(mydata)
Removeanobject
rm(mydata)
Removeseveral
rm(mydata,myvars,myobs)
Removeall
objects
rm( list=ls() )
save.image(file=c:\\mydata.Rdata)
66
Savesome
save(mydata,x,y,z,file=c:\\myExamples.Rdata)
Loadworkspace
load(file=c:\\mydata.Rdata)
GRAPHICS
Rcanmakeaverywiderangeofgraphs,anditsdefaultsettingsareusuallyexcellent.Iwillonly
demonstrateafewsimplegraphsandreferyoutoothersourcesformoreinformation.There
arealsowonderfullycomprehensivecollectionsofgraphsandtheRprogramstomakethemat
theRGraphGallery http://addictedtor.free.fr/graphiques/ andtheImageBrowserat
http://bg9.imslab.co.jp/Rhelp/.
YoucanalsousethedemocapabilityinRtohaveitshowyouasampleofwhatitcando.Enter
thecommandsbelowintheRconsole.
demo(graphics)
demo(persp)
library(lattice)
demo(lattice)
RdoesnothavedynamicvisualizationcapabilitieslikethoseinSAS/INSIGHT,butitdoeshavea
linktotheexcellentGgobipackageavailableforfreeathttp://www.ggobi.org/.Therggobi
packagelinksthetwoanditisavailableathttp://www.ggobi.org/rggobi/.
IfyouareinterestedinSPSS'extremelyflexibleGraphicsProductionLanguage(GPL),youwill
wanttoreadaboutHadleyWickham's ggplot package.Unfortunatelytheexamplesbelow
donotyetincludeggplot.
Theexamplesbelowdemonstrateahistogram,barplotandscatterplot.WhileSASandSPSS
assumeyourdataneedsummarizingandyouuseoptionstotellthemwhenitisnot,Rassumes
justtheopposite.Thefunctionbarplot(c(40,60))makesachartwithjustthosetwobars.
Butthecommandbarplot(gender)alsoassumeyouwanttoseeeveryunsummarized
value.Youllget8bars,oneforeverysubject,withtheheightofthebarseither1or2since
thoselevelcodeswereassignedbyR(inalphabeticalorder)whenitreadthevariablegender.So
togetaplotthatsummarizeshowmanymalesandfemalesthereareincludesthesummary
function barplot( summary(gender) ).
Ifwetrytocreateasimilarbarplotforworkshop,andworkshophasnotbeenconvertedtoa
factor,Rsays, "Error in barplot.default(summary(workshop)) : 'height'
must be a vector or a matrix."
67
Thatisnotaparticularlyusefulmessage.Ifyouhavenotalreadymadeworkshopintoafactor
(seesectiononValueLabelsorFormats(&MeasurementLevel)youcandosointhe
middleofthecommand: barplot( summary( as.factor(workshop) ) )
Thecommand barplot(gender)willyieldabarofheight1or2foreveryobservation!
Sincethesummaryfunctioncountsfactorlevels(andgenderisafactor)abarplotshowingthe
totalnumberofmalesandfemalesisdonewithbarplot( summary(gender) ).
SAS
SPSS
68
69
SOURCE: s=userSource(id("graphdataset"))
DATA: workshop=col(source(s), name("workshop"))
DATA: q1=col(source(s), name("q1"))
DATA: q2=col(source(s), name("q2"))
DATA: q3=col(source(s), name("q3"))
DATA: q4=col(source(s), name("q4"))
TRANS: workshop_label = eval("workshop")
TRANS: q1_label = eval("q1")
TRANS: q2_label = eval("q2")
TRANS: q3_label = eval("q3")
TRANS: q4_label = eval("q4")
GUIDE: axis(dim(1.1), ticks(null()))
GUIDE: axis(dim(2.1), ticks(null()))
GUIDE: axis(dim(1), gap(0px))
GUIDE: axis(dim(2), gap(0px))
ELEMENT: point(position((
workshop/workshop_label+q1/q1_label+q2/q2_label+q3/q3_label+q4/q4
_label)*(
workshop/workshop_label+q1/q1_label+q2/q2_label+q3/q3_label+q4/q4
_label)))
END GPL.
70
#Basic Graphics in R.
load(file="c:\\mydata.Rdata")
print(mydata)
attach(mydata) #Makes this the default dataset.
hist(q1)
barplot(c(40,60))
barplot(summary(gender))
# These statements generate an error since workshop
# is not a factor yet.
barplot(summary(workshop))
# Now use as.factor function to make workshop a factor.
barplot(summary(as.factor(workshop)))
# Scatterplot of q1 by q2.
plot(q1,q2)
# Scatterplot matrix of whole data frame.
plot(mydata)
ANALYSIS
Thissectiondemonstratessomeverybasichypothesistests.Sincetheexamplesuseourpractice
dataset,someofthetestsareratherabsurd.SeeModernAppliedStatisticsinSandAn
IntroductiontoSandtheHmiscandDesignLibrariesformuchmorethoroughandinteresting
coverage.
71
WhenperformingasingletypeofanalysisinSASorSPSS,youprepareyourcommandsandthen
submitthemtogetallyourresultsatonce.YoucouldsavesomeoftheoutputusingSAS
OutputDeliverySystemorSPSSOutputManagementSystem,butfewpeopledo.Ronthe
otherhandshowsyouverylittleoutputwitheachcommandandeveryoneusesitsintegrated
outputmanagementcapabilities.Forexamplewhenrunninglinearregressionmymodel<lm(q4~q1,q2,q3)willrunandsaveyourmodel,buttellyounothingaboutit.Followingthat
withadditionalfunctionssuchassummary(mymodel),anova(mymodel)and
plot(mymodel)actuallydisplaytheresults.
Inanalysisofvariance,SASandSPSSdefaulttopartial(typeIII)sumsofsquares(SS)andFtests.
Instead,Rprovidessequentialones.RsttestsforindividualparametersarethesameasSAS
andSPSS,soyouhavebothsetsofpvalues.IfyouwantpartialFs,youcansquarethetvalues
provided.SinceitseasytosaveamodelinRyoucancompareanytwowith
anova(fullModel,reducedModel).Thisapproachcantestfortheadditionaleffectofa
singlevariable,orasetofvariables.Stepwisemodelsaredoneviatheadd1,drop1and
stepAICfunctions.
TheRexamplesrequireinstallingtheHmiscandprettyRpackagesbeforehand.Seedetailsin
InstallingAddonPackages.
SAS
72
model q4=q1-q3;
run;
*---Group comparisons;
* Chi-square;
proc freq;
tables workshop*gender/chisq; run;
* Independent samples t-test;
proc ttest;
class gender;
var q4; run;
* Nonparametric version of above
using Wilcoxon/Mann-Whitney test;
proc npar1way;
class gender;
var q4; run;
* Paired samples t-test (silly one);
proc ttest;
paired q1*q2; run;
* Nonparametric version of above using
Signed Rank test;
proc univariate;
var myDiff;
run;
*Oneway Analysis of Variance (ANOVA);
proc glm;
class workshop;
model q4=workshop;
means workshop gender / tukey; run;
SPSS
73
BY gender
74
75
76
SUMMARY
Aswehaveseen,RdiffersfromSASandSPSSinmanyways.Belowisasummarychartofsome
ofthemostimportantdifferences.Youwillfindfarmoredetailoneachsubjectinthemainbody
ofthispaper.
SAS&SPSS
Output
Management
System
Outputmanagementsystemsare
rarelyusedforroutineanalyses.
Outputisroutinelyandeasilypassed
throughadditionalfunctionstoget
moreresults.
Macrolanguage
Aseparatelanguageusedmainly
forrepetitivetasksoraddingnew
functionality.Addedcapabilities
arenotrunthesamewayasbuilt
inprocedures.
Anintegrallanguagethatisused
routinely.Alsousedtoautomate
repetitivetasksandaddnew
functionality.Addedcapabilitiesare
runthesamewayasbuiltinones.
MatrixLanguage
Aseparatelanguageusedonlyto
addnewfeatures.Addedfeatures
arenotrunthesamewayasbuilt
inprocedures.
AnintegralpartofRthatyouuse
evenwhenselectingvariablesor
observations.Addedcapabilitiesare
runthesamewayasbuiltinones.
VariableLabels
Builtin.Usedbyallprocedures.
Addedon.Usedbyfewprocedures.
PublishingResults Seeitformattedimmediatelyin
anystyleyouchoose.Quickcut&
pastetowordprocessormaintains
tablestatusandstyle.Canalso
exporttoafile.
Processoutputwithadditional
proceduresthatrouteformatted
outputtoafile.Youdontseeit
formatteduntilyouimportittoa
wordprocessorortextformatter.
DataSize
Limitedonlybydisksize.
Datamustfitintothecomputers
RAMmemory.
DataStructure
Rectangulardataset.
Rectangulardataframe,vector,
matrix,list.
78
Choosingdata
Allthedataforananalysismust
beinasingledataset.
Analysescanfreelycombinevariables
fromdifferentdataframesorother
structures.
Choosing
Variables
Usesthesimplelistsofvariable
namesintheformofx,y,z;atoz;
az
Useswidevarietyofselectionby
indexvalue,variablename,logical
condition(sameaswhenselecting
observations).
Choosing
Observations
UseslogicalconditionsinIF,
SELECTIF,WHERE
Useswidevarietyofselectionby
indexvalue,variablename,logical
condition(sameaswhenselecting
variables).
ConvertingData
Structuresto
MatchProcedure
orFunction
Allproceduresacceptallvariables. Originaldatastructureplusvariable
selectionmethoddetermines
structureofvariablespassedto
analyses.Useofconversionfunctions
mayberequired.
Controlling
Procedureor
Function
StatementssuchasCLASSand
MODELandoptionscontrolthe
procedure.
Thedatasstructure(itsmode)
determineswhatisdone,alongwith
options(arguments)
ISRHARDERTOUSE?
AsmentionedinTheFiveMainPartsofSASandSPSS,manySASandSPSSusersuseonly
thedatainputandmanagementcommandsalongwithproceduresforgraphicsandanalysis.
Theydontgettheadvantagesofferedbyoutputmanagementsystems,macrolanguagesor
matrixlanguages.ManyintroductorySASandSPSSdocumentsfocusonlyonthoseparts.
WehaveseenthatRhasmanycomplexitiesthatSASandSPSSlacksuchas:multipledata
structures,averywiderangeofvariableselectionmethods,functionsthatacceptvariables
storedonlyincertaindatastructuresorselectedincertainways,datastructureconversiontools
andworkspacemanagementcommands.
SoifyouareaSASorSPSSuserwhohashappilyavoidedthecomplexitiesofoutput
management,macrosandmatrixlanguages,Risindeedhardertouse.Ontheotherhand,it
makesthoseaddedfeaturessomucheasiertousethanSASorSPSSthatyoumayfindyourself
moreeagertoexpandyourhorizons.TheaddedpowerofRanditsfreepricemakeitwellworth
theaddedeffortittakestolearn.
79
CONCLUSION
WehaveonlytouchedonthefullcapabilitiesofRbutnowyoushouldbereadytodiveintoa
morecomprehensiveintroductorymanualordirectlyintoachapteronregressionorANOVA
withoutbeingtotallylost.
Ihopetoimprovethisdocumentastimegoeson,sopleasedropmealineatBobM@utk.eduor
(865)9745230totellmewhatyouthink.HowcanIimprovethisforthenextperson?Negative
commentsareoftenthemostuseful,sodontbeshy.
HavefunworkingwithR!
80