Sie sind auf Seite 1von 88

Stata Learning Module

Using IF with Stata commands


This module shows the use of if with common Stata commands.
Let's use the auto data file.
sysuse auto
For this module, we will focus on make, rep78, foreign, mpg, and price We can use the keep command to keep just
these variables.
keep make rep78 foreign mpg price
Let's make a table of rep78 by foreign to look at the repair histories of the forein and domestic cars.
tabulate rep78 foreign
| foreign
rep78 | 0 1 | Total
-----------+----------------------+----------
1 | 2 0 | 2
2 | 8 0 | 8
3 | 27 3 | 30
4 | 9 9 | 18
5 | 2 9 | 11
-----------+----------------------+----------
Total | 48 21 | 69
Suppose we wanted to focus on just the cars with repair histories of four or better. We can use if suffi! to do this.
tabulate rep78 foreign if (rep78 >=4)
| foreign
rep78 | 0 1 | Total
-----------+----------------------+----------
4 | 9 9 | 18
5 | 2 9 | 11
-----------+----------------------+----------
Total | 11 18 | 29
Let's make the above table usin the column and nofreq options. "ote that the column and nofreq come after the
comma. These are options on the tabulate command and options need to be placed after a comma.
tabulate rep78 foreign if (rep78 >=4), column nofreq
| foreign
rep78 | 0 1 | Total
-----------+----------------------+----------
4 | 81.82 50.00 | 62.07
5 | 18.18 50.00 | 37.93
-----------+----------------------+----------
Total | 100.00 100.00 | 100.00
The use of if is not limited to the tabulate command. #ere, we use it with the list command.
list if (rep78 >= 4)
make price mpg rep78 foreign
3. A! "pirit 3799 22 . 0
5. #$ick %lectra 7827 15 4 0
7. #$ick &pel 4453 26 . 0
$
15. !'e(. )mpala 5705 16 4 0
20. *o+ge !olt 3984 30 5 0
24. ,or+ ,ie-ta 4389 28 4 0
29. erc. #o.cat 3829 22 4 0
30. erc. !o$gar 5379 14 4 0
33. erc. /0-7 6303 14 4 0
35. &l+- 98 8814 21 4 0
38. &l+- *elta 88 4890 18 4 0
43. 1l2m. !'amp 4425 34 5 0
45. 1l2m. "apporo 6486 26 . 0
47. 1ont. !atalina 5798 18 4 0
51. 1ont. 1'oeni3 4424 19 . 0
53. A$+i 5000 9690 17 5 1
55. #4 320i 9735 25 4 1
56. *at-$n 200 6229 23 4 1
57. *at-$n 210 4589 35 5 1
58. *at-$n 510 5079 24 4 1
59. *at-$n 810 8129 21 4 1
61. 5on+a Accor+ 5799 25 5 1
62. 5on+a !i(ic 4499 28 4 1
63. a6+a 78! 3995 30 4 1
64. 1e$geot 604 12990 14 . 1
66. "$.ar$ 3798 35 5 1
67. To2ota !elica 5899 18 5 1
68. To2ota !orolla 3748 31 5 1
69. To2ota !orona 5719 18 5 1
70. 94 *a-'er 7140 23 4 1
71. 94 *ie-el 5397 41 5 1
72. 94 0a..it 4697 25 4 1
73. 94 "cirocco 6850 25 4 1
74. 9ol(o 260 11995 17 5 1
%id you see the values of rep78 that had a value of & Those are missin values. For e!ample, the value of rep78 for
the '() Spirit were missin. Stata treats a missin value as positive infinity, the hihest number possible. So, when
we said list if !rep78 "# $% Stata included the observations where rep78 was . as well.
*f we wanted to include just the valid observations that are reater than or e+ual to ,, we can do the followin to tell
Stata we want rep78 "# $ and rep78 not missing.
list if (rep78 >= 4) & !missing(rep78)
make price mpg rep78 foreign
5. #$ick %lectra 7827 15 4 0
15. !'e(. )mpala 5705 16 4 0
20. *o+ge !olt 3984 30 5 0
24. ,or+ ,ie-ta 4389 28 4 0
29. erc. #o.cat 3829 22 4 0
30. erc. !o$gar 5379 14 4 0
33. erc. /0-7 6303 14 4 0
35. &l+- 98 8814 21 4 0
38. &l+- *elta 88 4890 18 4 0
43. 1l2m. !'amp 4425 34 5 0
47. 1ont. !atalina 5798 18 4 0
53. A$+i 5000 9690 17 5 1
55. #4 320i 9735 25 4 1
56. *at-$n 200 6229 23 4 1
57. *at-$n 210 4589 35 5 1
58. *at-$n 510 5079 24 4 1
59. *at-$n 810 8129 21 4 1
61. 5on+a Accor+ 5799 25 5 1
62. 5on+a !i(ic 4499 28 4 1
63. a6+a 78! 3995 30 4 1
66. "$.ar$ 3798 35 5 1
-
67. To2ota !elica 5899 18 5 1
68. To2ota !orolla 3748 31 5 1
69. To2ota !orona 5719 18 5 1
70. 94 *a-'er 7140 23 4 1
71. 94 *ie-el 5397 41 5 1
72. 94 0a..it 4697 25 4 1
73. 94 "cirocco 6850 25 4 1
74. 9ol(o 260 11995 17 5 1
We can use if with most Stata commands. #ere, we et summary statistics for price for cars with repair histories of $ or
-. "ote the .. represents *S /01'L T2 and 3 represents 24.
summarize price if (rep78 == ) ! (rep78 == ")
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
price | 10 5687 3216.375 3667 14500
' simpler way to say this would be...
summarize price if (rep78 #= ")
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
price | 10 5687 3216.375 3667 14500
Likewise, we can do this for cars with repair history of 5, , or 6.
summarize price if (rep78 == $) ! (rep78 == 4) ! (rep78 == %)
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
price | 59 6223.847 2880.454 3291 15906
Let's simplify this by sayin rep78 9. 5.
summarize price if (rep78 >= $)
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
price | 64 6239.984 2925.843 3291 15906
%id you see the mistake we made& We accidentally included the missin values because we forot to e!clude them. We
really needed to say.
summarize price if (rep78 >= $) & !missing(rep78)
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
price | 59 6223.847 2880.454 3291 15906
Summar&
(ost Stata commands can be followed by if, for e!ample
Summari:e if rep78 e+uals -
summarize if (rep78 == ")
Summari:e if rep78 is reater than or e+ual to -
summarize if (rep78 >= ")
5
Summari:e if rep78 reater than -
summarize if (rep78 > ")
Summari:e if rep78 less than or e+ual to -
summarize if (rep78 #= ")
Summari:e if rep78 less than -
summarize if (rep78 #")
Summari:e if rep78 not e+ual to -
summarize if (rep78 != ")
If e'pressions can be connected with
3 for 24
; for '"%
Missing (alues
(issin values are represented as . and are the hihest value possible. therefore, when values are missin, be careful
with commands like
summarize if (rep78 > $)
summarize if (rep78 >= $)
summarize if (rep78 != $)
to omit missin values, use
summarize if (rep78 > $) & !missing(rep78)
summarize if (rep78 >= $) & !missing(rep78)
summarize if (rep78 != $) & !missing(rep78)
Stata Learning Module
) statistical sampler in Stata
This module will ive a brief overview of some common statistical tests in Stata. Let's use the auto data file that we
will use for our e!amples.
use auto
t*tests
Let's do a t<test comparin the miles per allon =mpg> of forein and domestic cars.
ttest mpg , by(foreign)

T:o--ample t te-t :it' e;$al (ariance-
------------------------------------------------------------------------------
7ro$p | &.- ean "t+. %rr. "t+. *e(. <95= !onf. )nter(al>
---------+--------------------------------------------------------------------
0 | 52 19.82692 .657777 4.743297 18.50638 21.14747
1 | 22 24.77273 1.40951 6.611187 21.84149 27.70396
---------+--------------------------------------------------------------------
com.ine+ | 74 21.2973 .6725511 5.785503 19.9569 22.63769
---------+--------------------------------------------------------------------
+iff | -4.945804 1.362162 -7.661225 -2.230384
,
------------------------------------------------------------------------------
*egree- of free+om? 72
5o? mean@0A - mean@1A B +iff B 0
5a? +iff C0 5a? +iff DBE0E 5a? +iffF 0
t B -3.6308 t B -3.6308 t B -3.6308
1 C t B 0.0003 1 F |t| B 0.0005 1 F t B 0.9997
's you see in the output above, the domestic cars had sinificantly lower mpg =$?.8> than the forein cars =-,.7>.
+hi*square
Let's compare the repair ratin =rep78> of the forein and domestic cars. We can make a crosstab of rep78 by foreign.
We may want to ask whether these variables are independent. We can use the chi, option to re+uest a chi<s+uare test of
independence as well as the crosstab.
tabulate rep78 foreign, c&i"
| foreign
rep78 | 0 1 | Total
-----------+----------------------+----------
1 | 2 0 | 2
2 | 8 0 | 8
3 | 27 3 | 30
4 | 9 9 | 18
5 | 2 9 | 11
-----------+----------------------+----------
Total | 48 21 | 69
1ear-on c'i2@4A B 27.2640 1r B 0.000
The chi<s+uare is not really valid when you have empty cells. *n such cases when you have empty cells, or cells with
small fre+uencies, you can re+uest Fisher's e!act test with the e'act option.
tabulate rep78 foreign, c&i" e'act
| foreign
rep78 | 0 1 | Total
-----------+----------------------+----------
1 | 2 0 | 2
2 | 8 0 | 8
3 | 27 3 | 30
4 | 9 9 | 18
5 | 2 9 | 11
-----------+----------------------+----------
Total | 48 21 | 69
1ear-on c'i2@4A B 27.2640 1r B 0.000
,i-'erG- e3act B 0.000
+orrelation
We can use the correlate command to et the correlations amon variables. Let's look at the correlations amon price
mpg weight and rep78. =We use rep78 in the correlation even thouh it is not continuous to illustrate what happens
when you use correlate with variables with missin data.>
correlate price mpg (eig&t rep78
@o.-B69A
| price mpg :eig't rep78
6
---------+------------------------------------
price | 1.0000
mpg | -0.4559 1.0000
:eig't | 0.5478 -0.8055 1.0000
rep78 | 0.0066 0.4023 -0.4003 1.0000
"ote that the output above said =obs.@?>. The correlate command drops data on a listwise basis, meanin that if any of
the variables are missin, then the entire observation is omitted from the correlation analysis.
We can use pwcorr =pairwise correlations> if we want to obtain correlations that deletes missin data on a pairwise
basis instead of a listwise basis. We will use the obs option to show the number of observations used for calculatin
each correlation.
p(corr price mpg (eig&t rep78, obs
| price mpg :eig't rep78
----------+------------------------------------
price | 1.0000
| 74
|
mpg | -0.4686 1.0000
| 74 74
|
:eig't | 0.5386 -0.8072 1.0000
| 74 74 74
|
rep78 | 0.0066 0.4023 -0.4003 1.0000
| 69 69 69 69
|
"ote how the correlations that involve rep78 have an " of @? compared to the other correlations that have an " of 7,.
This is because rep78 has five missin values, so it only had @? valid observations, but the other variables had no
missin data so they had 7, valid observations.
-egression
Let's look at doin reression analysis in Stata. For this e!ample, let's drop the cases where rep78 is $ or - or missin.
)rop if (rep78 #= ") ! (rep78==*)
@15 o.-er(ation- +elete+A
"ow, let's predict mpg from price and weight. 's you see below, weight is a sinificant predictor of mpg, but price is
not.
regress mpg price (eig&t

"o$rce | "" +f " H$m.er of o.- B 59
---------+------------------------------ ,@ 2I 56A B 47.87
o+el | 1375.62097 2 687.810483 1ro. F , B 0.0000
0e-i+$al | 804.616322 56 14.3681486 0--;$are+ B 0.6310
---------+------------------------------ A+J 0--;$are+ B 0.6178
Total | 2180.23729 58 37.5902981 0oot "% B 3.7905
------------------------------------------------------------------------------
mpg | !oef. "t+. %rr. t 1F|t| <95= !onf. )nter(al>
---------+--------------------------------------------------------------------
price | -.0000139 .0002108 -0.066 0.948 -.0004362 .0004084
:eig't | -.005828 .0007301 -7.982 0.000 -.0072906 -.0043654
Kcon- | 39.08279 1.855011 21.069 0.000 35.36676 42.79882
------------------------------------------------------------------------------
@
What if we wanted to predict mpg from rep78 as well. rep78 is really more of a cateorical variable than it is a
continuous variable. To include it in the reression, we should convert rep78 into dummy variables. Fortunately, Stata
makes dummy variables easily usin tabulate. The gen!rep% option tells Stata that we want to enerate dummy
variables from rep78 and we want the stem of the dummy variables to be rep.
tabulate rep78, gen(rep)
rep78 | ,re;. 1ercent !$m.
------------+-----------------------------------
3 | 30 50.85 50.85
4 | 18 30.51 81.36
5 | 11 18.64 100.00
------------+-----------------------------------
Total | 59 100.00
Stata has created rep. =$ if rep78 is 5>, rep, =$ if rep78 is ,> and rep/ =$ if rep78 is 6>. We can use the tabulate
command to verify that the dummy variables were created properly.
tabulate rep78 rep
| rep78BB 3.0000
rep78 | 0 1 | Total
-----------+----------------------+----------
3 | 0 30 | 30
4 | 18 0 | 18
5 | 11 0 | 11
-----------+----------------------+----------
Total | 29 30 | 59
tabulate rep78 rep"
| rep78BB 4.0000
rep78 | 0 1 | Total
-----------+----------------------+----------
3 | 30 0 | 30
4 | 0 18 | 18
5 | 11 0 | 11
-----------+----------------------+----------
Total | 41 18 | 59
tabulate rep78 rep$
| rep78BB 5.0000
rep78 | 0 1 | Total
-----------+----------------------+----------
3 | 30 0 | 30
4 | 18 0 | 18
5 | 0 11 | 11
-----------+----------------------+----------
Total | 48 11 | 59
"ow we can include rep. and rep, as dummy variables in the reression model.
regress mpg price (eig&t rep rep"
"o$rce | "" +f " H$m.er of o.- B 59
-------------+------------------------------ ,@ 4I 54A B 26.04
o+el | 1435.91975 4 358.979938 1ro. F , B 0.0000
0e-i+$al | 744.317536 54 13.7836581 0--;$are+ B 0.6586
-------------+------------------------------ A+J 0--;$are+ B 0.6333
Total | 2180.23729 58 37.5902981 0oot "% B 3.7126
------------------------------------------------------------------------------
mpg | !oef. "t+. %rr. t 1F|t| <95= !onf. )nter(al>
-------------+----------------------------------------------------------------
price | -.0001126 .0002133 -0.53 0.600 -.0005403 .0003151
:eig't | -.005107 .0008236 -6.20 0.000 -.0067584 -.0034557
7
rep1 | -2.886288 1.504639 -1.92 0.060 -5.902908 .1303314
rep2 | -2.88417 1.484817 -1.94 0.057 -5.861048 .0927086
Kcon- | 39.89189 1.892188 21.08 0.000 36.09828 43.6855
------------------------------------------------------------------------------
)nal&sis of 0ariance
*f you wanted to do an analysis of variance lookin at the differences in mpg amon the three repair roups, you can
use the onewa& command to do this.
one(ay mpg rep78
Anal2-i- of 9ariance
"o$rce "" +f " , 1ro. F ,
------------------------------------------------------------------------
#et:een gro$p- 506.325167 2 253.162583 8.47 0.0006
4it'in gro$p- 1673.91212 56 29.8912879
------------------------------------------------------------------------
Total 2180.23729 58 37.5902981
#artlettG- te-t for e;$al (ariance-? c'i2@2A B 9.9384 1ro.Fc'i2 B 0.007

*f you include the tabulate option, you et mean mpg for the three roups, which shows that the roup with the best
repair ratin =rep78 of 6> also has the hihest mpg =-7.5>.
one(ay mpg rep78, tabulate

| "$mmar2 of mpg
rep78 | ean "t+. *e(. ,re;.
------------+------------------------------------
3 | 19.433333 4.1413252 30
4 | 21.666667 4.9348699 18
5 | 27.363636 8.7323849 11
------------+------------------------------------
Total | 21.59322 6.1310927 59
Anal2-i- of 9ariance
"o$rce "" +f " , 1ro. F ,
------------------------------------------------------------------------
#et:een gro$p- 506.325167 2 253.162583 8.47 0.0006
4it'in gro$p- 1673.91212 56 29.8912879
------------------------------------------------------------------------
Total 2180.23729 58 37.5902981
#artlettG- te-t for e;$al (ariance-? c'i2@2A B 9.9384 1ro.Fc'i2 B 0.007

*f you want to include covariates, you need to use the ano0a command. The continuous!price weight% option tells
Stata that those variables are covariates.
ano+a mpg rep78 price (eig&t, continuous(price (eig&t)

H$m.er of o.- B 59 0--;$are+ B 0.6586
0oot "% B 3.71263 A+J 0--;$are+ B 0.6333
"o$rce | 1artial "" +f " , 1ro. F ,
-----------+----------------------------------------------------
o+el | 1435.91975 4 358.979938 26.04 0.0000
|
rep78 | 60.2987853 2 30.1493926 2.19 0.1221
8
price | 3.8421233 1 3.8421233 0.28 0.5997
:eig't | 529.932889 1 529.932889 38.45 0.0000
|
0e-i+$al | 744.317536 54 13.7836581
-----------+----------------------------------------------------
Total | 2180.23729 58 37.5902981

Stata Learning Module
)n o0er0iew of Stata s&nta'
This module shows the eneral structure of Stata commands. We will do this usin summari1e as an e!ample, althouh
this eneral structure applies to most Stata commands.
Let's first use the auto data file.
use auto
's you have seen, we can type summari1e and it will ive us summary statistics for all of the variables.
summarize
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
'+room | 74 2.993243 .8459948 1.5 5
tr$nk | 74 13.75676 4.277404 5 23
:eig't | 74 3019.459 777.1936 1760 4840
lengt' | 74 187.9324 22.26634 142 233
t$rn | 74 39.64865 4.399354 31 51
+i-pl | 74 197.2973 91.83722 79 425
gratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
*t is also possible to name the variables you are interested in, like below we et summary statistics just for mpg and
price.
summarize mpg price
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
price | 74 6165.257 2949.496 3291 15906
We could further tell Stata to limit the summary statistics to just forein cars by addin an if clause.
summarize mpg price if (foreign == )
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
mpg | 22 24.77273 6.611187 14 41
price | 22 6384.682 2621.915 3748 12990
The if clause can contain more than one condition. #ere, we ask for summary statistics for the forein cars which et
less than 5A miles per allon.
summarize mpg price if (foreign == ) & (mpg #$,)
9aria.le | &.- ean "t+. *e(. in a3
?
---------+-----------------------------------------------------
mpg | 17 21.94118 3.896643 14 28
price | 17 6996.235 2674.552 3895 12990
We can use the detail option to ask Stata to ive us more detail in the summary statistics. "otice that the detail option
oes after the comma. *f the comma were omitted, Stata would ive an error.
summarize mpg price if (foreign == ) & (mpg #$,) , )etail
mpg
-------------------------------------------------------------
1ercentile- "malle-t
1= 14 14
5= 14 17
10= 17 17 &.- 17
25= 18 18 "$m of 4gt. 17
50= 23 ean 21.94118
8arge-t "t+. *e(. 3.896643
75= 25 25
90= 26 25 9ariance 15.18382
95= 28 26 "ke:ne-- -.4901235
99= 28 28 L$rto-i- 2.201759
price
-------------------------------------------------------------
1ercentile- "malle-t
1= 3895 3895
5= 3895 4296
10= 4296 4499 &.- 17
25= 5079 4697 "$m of 4gt. 17
50= 6229 ean 6996.235
8arge-t "t+. *e(. 2674.552
75= 8129 9690
90= 11995 9735 9ariance 7153229
95= 12990 11995 "ke:ne-- .9818272
99= 12990 12990 L$rto-i- 2.930843
"ote that even thouh we built these parts up one at a time, they don't have to o toether. Let's look at some other
forms of the summari1e command.
Bou can tell Stata which observation numbers you want usin the in clause. #ere we ask for summaries of observations
$ to $A. This is useful if you have a bi data file and want to try out a command on a subset of all your observations.
summarize in -,
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
make | 0
price | 10 5517.4 2063.518 3799 10372
mpg | 10 19.5 3.27448 15 26
rep78 | 8 3.125 .3535534 3 4
'+room | 10 3.3 .7527727 2 4.5
tr$nk | 10 14.7 3.88873 10 21
:eig't | 10 3271 558.3796 2230 4080
lengt' | 10 194 19.32759 168 222
t$rn | 10 40.2 3.259175 34 43
+i-pl | 10 223.9 71.77503 121 350
gratio | 10 2.907 .3225264 2.41 3.58
foreign | 10 0 0 0 0
$A
'lso, recall that you can ask Stata to perform summaries for forein and domestic cars separately usin b&, as shown
below.
sort foreign
by foreign. summarize
-F foreignB 0
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
make | 0
price | 52 6072.423 3097.104 3291 15906
mpg | 52 19.82692 4.743297 12 34
rep78 | 48 3.020833 .837666 1 5
'+room | 52 3.153846 .9157578 1.5 5
tr$nk | 52 14.75 4.306288 7 23
:eig't | 52 3317.115 695.3637 1800 4840
lengt' | 52 196.1346 20.04605 147 233
t$rn | 52 41.44231 3.967582 31 51
+i-pl | 52 233.7115 85.26299 86 425
gratio | 52 2.806538 .3359556 2.19 3.58
foreign | 52 0 0 0 0
-F foreignB 1
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
make | 0
price | 22 6384.682 2621.915 3748 12990
mpg | 22 24.77273 6.611187 14 41
rep78 | 21 4.285714 .7171372 3 5
'+room | 22 2.613636 .4862837 1.5 3.5
tr$nk | 22 11.40909 3.216906 5 16
:eig't | 22 2315.909 433.0035 1760 3420
lengt' | 22 168.5455 13.68255 142 193
t$rn | 22 35.40909 1.501082 32 38
+i-pl | 22 111.2273 24.88054 79 163
gratio | 22 3.507273 .2969076 2.98 3.89
foreign | 22 1 0 1 1
Let's review all those pieces.
' command can be preceded with a b& clause, as shown below.
summari1e preceded with b&
by foreign. summarize
There are many parts that can come after a command, they are each presented separately below.
summari1e with names of variables
summarize mpg price
summari1e with in specifyin records to summari:e.
summarize in -,
summari1e with simple if specifyin records to summari:e.
summarize if (foreign == )
summari1e with comple! if specifyin records to summari:e.
$$
summarize if (foreign == ) & (mpg > $,)
summari1e followed by option=s>.
summarize , )etail
So, puttin it all toether, the eneral synta! of the summari:e command can be described asC
/by +arlist.0 summarize /+arlist0 /in range0 /if e'p0 , /options0
1nderstandin the overall synta! of Stata commands helps you remember them and use them more effectively, and it
also helps you understand the help in Stata. 'll the e!tra stuff about b&, if and in could be confusin. Let's have a look
at the help for summari:e and it makes more sense knowin what the b&, if and in parts mean.
&elp summarize
-------------------------------------------------------------------------------
'elp for -$mmari6e @man$al? <0> -$mmari6eA
-------------------------------------------------------------------------------
"$mmar2 -tati-tic-
------------------
<.2 (arli-t?> -$mmari6e <(arli-t> <:eig't> <if e3p> <in range>
<I M +etail | meanonl2 N format >
Stata Learning Module
Using and sa0ing files in Stata
Using and sa0ing Stata data files
The use command ets a Stata data file from disk and places it in memory so you can analy:e andDor modify it. ' data
file must be read into memory before you can analy:e it. *t is kind of like when you open a 2ord documentE you need
to read a 2ord document into 2ord before you can work with it. The use command below ets the Stata data file
called autodta from disk and places it in memory so we can analy:e andDor modify it. Since Stata data files end with
dta you need only say use auto and Stata knows to read in the file called autodta.
sysuse auto
The describe command tells you information about the data that is currently sittin in memory.
)escribe
!ontain- +ata from a$to.+ta
o.-? 74
(ar-? 12 17 ,e. 1999 10?49
-i6e? 3I108 @99.6= of memor2 freeA
-------------------------------------------------------------------------------
1. make -tr17 =17-
2. price int =9.0g
3. mpg .2te =9.0g
4. rep78 .2te =9.0g
5. '+room float =9.0g
6. tr$nk .2te =9.0g
7. :eig't int =9.0g
8. lengt' int =9.0g
9. t$rn .2te =9.0g
10. +i-pl int =9.0g
11. gratio float =9.0g
$-
12. foreign .2te =9.0g
-------------------------------------------------------------------------------
"orte+ .2?
"ow that the data is in memory, we can analy:e it. For e!ample, the summari1e command ives summary statistics for
the data currently in memory.
summarize
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
'+room | 74 2.993243 .8459948 1.5 5
tr$nk | 74 13.75676 4.277404 5 23
:eig't | 74 3019.459 777.1936 1760 4840
lengt' | 74 187.9324 22.26634 142 233
t$rn | 74 39.64865 4.399354 31 51
+i-pl | 74 197.2973 91.83722 79 425
gratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
Let's make a chane to the data in memory. We will compute a variable called price, which will be double the value of
price.
generate price" = "1price
*f we use the describe command aain, we see the variable we just created is part of the data in memory. We also see a
note from Stata sayin dataset has changed since last sa0ed. Stata knows that the data in memory has chaned, and
would need to be saved to avoid losin the chanes. *t is like when you are editin a 2ord documentE if you don't save
the data, any chanes you make will be lost. *f we shut the computer off before savin the chanes, the chanes we
made would be lost.
)escribe
!ontain- +ata from a$to.+ta
o.-? 74
(ar-? 13 17 ,e. 1999 10?49
-i6e? 3I404 @99.6= of memor2 freeA
-------------------------------------------------------------------------------
1. make -tr17 =17-
2. price int =9.0g
3. mpg .2te =9.0g
4. rep78 .2te =9.0g
5. '+room float =9.0g
6. tr$nk .2te =9.0g
7. :eig't int =9.0g
8. lengt' int =9.0g
9. t$rn .2te =9.0g
10. +i-pl int =9.0g
11. gratio float =9.0g
12. foreign .2te =9.0g
13. price2 float =9.0g
-------------------------------------------------------------------------------
"orte+ .2?
Hote? +ata-et 'a- c'ange+ -ince la-t -a(e+
The sa0e command is used to save the data in memory permanently on disk. Let's save this data and call it auto, =Stata
will save it as auto,dta>.
$5
sa+e auto"
file a$to2.+ta -a(e+
Let's make another chane to the dataset. We will compute a variable called price/ which will be three times the value
of price.
generate price$ = $1price
Let's try to save this data aain to auto,
sa+e auto"
file a$to2.+ta alrea+2 e3i-t-
r@602AO
%id you see how Stata said file auto,dta alread& e'ists& Stata is worried that you will accidentally overwrite your
data file. Bou need to use the replace option to tell Stata that you know that the file e!ists and you want to replace it.
sa+e auto", replace
file a$to2.+ta -a(e+
Let's make another chane to the data in memory by creatin a variable called price$ that is four times the price.
generate price4 = price14
Suppose we want to use the oriinal auto file and we don't care if we lose the chanes we just made in memory =i.e.,
losin the variable price$>. We can try to use the auto file.
sysuse auto
noO +ata in memor2 :o$l+ .e lo-t
r@4AO
See how Stata refused to use the file, sayin no3 data in memor& would be lost& Stata did not want you to lose the
chanes that you made to the data sittin in memory. *f you really want to discard the chanes in memory, then use
need to use the clear option on the use command, as shown below.
sysuse auto, clear
Stata tries to protect you from losin your data by doin the followinC
$. *f you want to sa0e a file over an e!istin file, you need to use the replace option, e.., sa0e auto, replace.
-. *f you try to use a file and the file in memory has unsaved chanes, you need to use the clear option to tell Stata that
you want to discard the chanes, e.., use auto, clear.
Fefore we move on to the ne!t topic, let's clear out the data in memory.
clear
Using files larger than . megab&te
When you use a data file, Stata reads the entire file into memory. Fy default, Stata limits the si:e of data in memory to
$ meabyte =G) version @.A *ntercooled>. Bou can view the amount of memory that Stata has reserved for data with the
memor& command.
$,
memory
Total memor2 1I048I576 .2te- 100.00=
o(er'ea+ @pointer-A 0 0.00=
+ata 0 0.00=
------------
+ata + o(er'ea+ 0 0.00=
program-I -a(e+ re-$lt-I etc. 1I152 0.11=
------------
Total 1I152 0.11=
,ree 1I047I424 99.89=
*f you try to use a file which e!ceeds the amount of memory Stata has allocated for data, it will ive you an error
messae like this.
no room to add more obser0ations
r!45.%3
Bou can increase the amount of memory that Stata has allocated to data usin the set memor& command. For e!ample,
if you had a data file which was $.6 meabytes, you can set the memory to, say, - meabytes shown below.
set memory "m
@2048kA
2nce you have increased the memory, you should be able to use the data file if you have allocated enouh memory for
it.
Summar&
To use the auto file from disk and read it into memory
sysuse auto
To sa0e the file auto from memory to disk
sa+e auto
To sa0e a file if the file auto already e!ists
sa+e auto, replace
to use a file auto and clear out the current data in memory
sysuse auto, clear
*f you want to clear out the data in memory, you want to lose the chanes
clear
To allocate - meabytes of memory for a data file.
set memory "m
$6
To view the allocation of memory to data and how much is used.
memory
Stata Learning Module
Inputting &our data into Stata
This module will show how to input your data into Stata. This covers inputtin data with comma delimited, tab
delimited, space delimited, and fi!ed column data.
. 6&ping data into the Stata editor
2ne of the easiest methods for ettin data into Stata is usin the Stata data editor, which resembles an /!cel
spreadsheet. *t is useful when your data is on paper and needs to be typed in, or if your data is already typed into an
/!cel spreadsheet. To learn more about the Stata data editor, see the edit module.
, +omma7tab separated file with 0ariable names on line .
Two common file formats for raw data are comma separated files and tab separated files. Such files are commonly
made from spreadsheet prorams like 8'cel. )onsider the comma delimited file shown below.
type auto"*ra(
makeI mpgI :eig'tI price
A! !oncor+I 22I 2930I 4099
A! 1acerI 17I 3350I 4749
A! "piritI 22I 2640I 3799
#$ick !ent$r2I 20I 3250I 4816
#$ick %lectraI 15I4080I 7827
This file has two characteristicsC
< The first line has the names of the variables separated by commas,
< The followin lines have the values for the variables, also separated by commas.
This kind of file can be read usin the insheet command, as shown below.
ins&eet using auto"*ra(
@4 (ar-I 5 o.-A
We can check to see if the data came in riht usin the list command.
list
make mpg :eig't price
1. A! !oncor+ 22 2930 4099
2. A! 1acer 17 3350 4749
3. A! "pirit 22 2640 3799
4. #$ick !ent$r2 20 3250 4816
5. #$ick %lectra 15 4080 7827
Since you will likely have more observations, you can use in to list just a subset of observations. Felow, we list
observations $ throuh 5.
list in -$
make mpg :eig't price
1. A! !oncor+ 22 2930 4099
$@
2. A! 1acer 17 3350 4749
3. A! "pirit 22 2640 3799
"ow that the file has been read into Stata, you can save it with the sa0e command =we will skip doin that step>.
The e!act same insheet command could be used to read a tab delimited file. The insheet command is clever because it
can fiure out whether you have a comma delimited or tab delimited file, and then read it. =#owever, insheet could
not handle a file that uses a mi!ture of commas and tabs as delimiters.>
Fefore startin the ne!t section, let's clear out the e!istin data in memory.
clear
/ +omma7tab separated file !no 0ariable names in file%
)onsider a file that is identical to the one we e!amined in the previous section, but it does not have the variable names
on line $
type auto$*ra(
A! !oncor+I 22I 2930I 4099
A! 1acerI 17I 3350I 4749
A! "piritI 22I 2640I 3799
#$ick !ent$r2I 20I 3250I 4816
#$ick %lectraI 15I4080I 7827
This file can be read usin the insheet command as shown below.
ins&eet using auto$*ra(
@4 (ar-I 5 o.-A
Fut where did Stata et the variable names& *f Stata does not have names for the variables, it names them 0., 0,, 0/
etc., as you can see below.
list
(1 (2 (3 (4
1. A! !oncor+ 22 2930 4099
2. A! 1acer 17 3350 4749
3. A! "pirit 22 2640 3799
4. #$ick !ent$r2 20 3250 4816
5. #$ick %lectra 15 4080 7827
Let's clear out the data in memory, and then try readin the data aain.
clear
"ow, let's try readin the data and tell Stata the names of the variables on the insheet command.
ins&eet make mpg (eig&t price using auto$*ra(
@4 (ar-I 5 o.-A
's the list command shows, Stata used the variable names supplied on the insheet command.
list
make mpg :eig't price
1. A! !oncor+ 22 2930 4099
$7
2. A! 1acer 17 3350 4749
3. A! "pirit 22 2640 3799
4. #$ick !ent$r2 20 3250 4816
5. #$ick %lectra 15 4080 7827
The insheet command works e+ually well on files which use tabs as separators. Stata e!amines the file and determines
whether commas or tabs are bein used as separators and reads the file appropriately.
"ow that the file has been read into Stata, you can save it with the sa0e command =we will skip doin that step>.
Let's clear out the data in memory before oin to the ne!t section.
clear
$ Space separated file
)onsider a file where the variables are separated by spaces like the one shown below.
type auto4*ra(
EA! !oncor+E 22 2930 4099
EA! 1acerE 17 3350 4749
EA! "piritE 22 2640 3799
E#$ick !ent$r2E 20 3250 4816
E#$ick %lectraE 15 4080 7827
"ote that the make of car is contained within +uotation marks. This is necessary because the names contain spaces
within them. Without the +uotes, Stata would think '() is the make and )oncord is the mpg. *f the make did not
have spaces embedded within them, the +uotation marks would not be needed.
This file can be read with the infile command as shown below.
infile str$ make mpg (eig&t price using auto4*ra(
@5 o.-er(ation- rea+A
Bou may be askin yourself, where did the str./ come from& Since make is a character variable, we need to tell Stata
that it is a character variable, and how lon it can be. The str./ tells Stata it is a strin variable and that it could be up
to $5 characters wide.
The list command confirms that the data was read correctly.
list
make mpg :eig't price
1. A! !oncor+ 22 2930 4099
2. A! 1acer 17 3350 4749
3. A! "pirit 22 2640 3799
4. #$ick !ent$r2 20 3250 4816
5. #$ick %lectra 15 4080 7827
"ow that the file has been read into Stata, you can save it with the sa0e command =we will skip doin that step>.
Let's clear out the data in memory before movin on to the ne!t section.
clear
9 Fi'ed format file
$8
)onsider a file usin fi!ed column data like the one shown below.
type auto%*ra(
A! !oncor+ 22 2930 4099
A! 1acer 17 3350 4749
A! "pirit 22 2640 3799
#$ick !ent$r2 20 3250 4816
#$ick %lectra 15 4080 7827
"ote that the variables are clearly defined by which column=s> they are located. 'lso, note that the make of car is not
contained within +uotation marks. The +uotations are not needed because the columns define where the make beins
and ends, and the embedded spaces no loner create confusion.
This file can be read with the infi' command as shown below.
infi' str make 2$ mpg %23 (eig&t 82" price "$2"3 using auto%*ra(
@5 o.-er(ation- rea+A
#ere aain we need to tell Stata that make is a strin variable by precedin make with str. We did not need to indicate
the lenth since Stata can infer that make can be up to $5 characters wide based on the column locations.
The list command confirms that the data was read correctly.
list
make mpg :eig't price
1. A! !oncor+ 22 2930 4099
2. A! 1acer 17 3350 4749
3. A! "pirit 22 2640 3799
4. #$ick !ent$r2 20 3250 4816
5. #$ick %lectra 15 4080 7827
"ow that the file has been read into Stata, you can save it with the sa0e command =we will skip doin that step>.
Let's clear out the data in memory before movin on to the ne!t section.
clear
: ;ther methods of getting data into Stata
This does not cover all possible methods of ettin raw data into Stata, but does cover many common situations. See
the Stata 1sers Huide for more comprehensive information on readin raw data into Stata.
'nother method that should be mentioned is the use of data conversion prorams. These prorams can convert data
from one file format into another file format. For e!ample, they could directly create a Stata file from an /!cel
Spreadsheet, a Lotus Spreadsheet, an 'ccess database, a %base database, a S'S data file, an SGSS system file, etc. Two
such e!amples are Stat Transfer and %F(S )opy. Foth of these products are available on SS) G)s and %F(S )opy is
available on "icco and 'ristotle.
Finally, if you are usin "icco, 'ristotle or the 4SD@AAA )luster, there is a command specifically for convertin S'S
data into Stata called sas,stata. *f you have S'S data you want to convert to Stata, this may be a useful way to et your
S'S data into Stata.
7 Summar&
Frin up the Stata data editor for typin data in.
$?
* e)it
4ead in the comma or tab delimited file called auto,raw takin the variable names from the first line of data.
* ins&eet using auto"*ra(, clear
4ead in the comma or tab delimited file called auto/raw namin the variables mp weiht and price.
* ins&eet make mpg (eig&t price using auto$*ra(, clear
4ead in the space separated file named auto$raw. The variable make is surrounded by +uotes because it has embedded
blanks.
* infile str$ make mpg (eig&t price using auto4*ra(, clear
4ead in the fi!ed format file named auto9raw.
* infi' str make 2$ mpg %23 (eig&t 82" using auto%*ra(, clear
2ther methods
<=MS7+op&, Stat 6ransfer, sas,stata, and Stata Users >uide.
Stata Learning Module
Using dates in Stata
This module will show how to use date variables, date functions, and date display formats in Stata.
+on0erting dates from raw data using the ?date!%? function
The trick to inputtin dates in Stata is to foret they are dates, and treat them as character strins, and then later convert
them into a Stata date variable. Bou miht have the followin date data in your raw data file.
type )ates*ra(
Po'n 1 Pan 1960
ar2 11 P$l 1955
Late 12 Ho( 1962
ark 8 P$n 1959
Bou can read these data by typinC
infi' str name 24 str b)ay 327 using )ates*ra(
@4 o.-er(ation- rea+A
1sin the list command, you can see that the date information has been read correctly into bda&.
list
name .+a2
1. Po'n 1 Pan 1960
2. ar2 11 P$l 1955
3. Late 12 Ho( 1962
4. ark 8 P$n 1959
Since bda& is a strin variable, you cannot do any kind of date computations with it until you make a date variable
from it. Bou can enerate a date version of bda& usin the date!% function. The e!ample below creates a date variable
called birthda& from the character variable bda&. The synta! is slihtly different dependin on which version of Stata
-A
you are usin. The difference is in how the pattern is specified. *n Stata ? it should be lower case =e.., IdmyI> and in
Stata $A, it should be upper case for day, month, and year =e.., I%(BI> but lower case if you want to specify hours,
minutes or seconds =e.., I%(BhmsI>. 2ur data are in the order day, month, year, so we use I%(BI =or IdmyI if you
are usin Stata ?> within the date!% command. =1nless otherwise noted, all other Stata commands on this pae are the
same for versions ? and $A.>
*n Stata 0ersion 4C
generate birt&)ay=)ate(b)ay,4)my4)
*n Stata 0ersion .5C
generate birt&)ay=)ate(b)ay,45674)
Let's have a look at both bda& and birthda&.
list
name .+a2 .irt'+a2
1. Po'n 1 Pan 1960 0
2. ar2 11 P$l 1955 -1635
3. Late 12 Ho( 1962 1046
4. ark 8 P$n 1959 -207
The values for birthday may seem confusin. The value of birthda& for John is A and the value of birthda& for (ark is
<-A7. %ates are actually stored as the number of da&s from @an ., .4:5 which is convenient for the computer storin
and performin date computations, but is difficult for you and * to read.
We can tell Stata that birthda& should be displayed usin the Kd format to make it easier for humans to read.
format birt&)ay 8)
list
name .+a2 .irt'+a2
1. Po'n 1 Pan 1960 01Jan1960
2. ar2 11 P$l 1955 11J$l1955
3. Late 12 Ho( 1962 12no(1962
4. ark 8 P$n 1959 08J$n1959
The date!% function is very fle!ible and can handle dates written in almost any manner. For e!ample, consider the file
dates,raw.
type )ates"*ra(
Po'n Pan 1 1960
ar2 07Q11Q1955
Late 11.12.1962
ark P$nQ8 1959
These dates are messy, but they are consistent. /ven thouh the formats look different, it is always a month day year
separated by a delimiter =e.., space slash dot or dash>. We can try usin the synta! from above to read in our new
dates. "ote that, as discussed above, for Stata version $A the order of the date is declared in upper case letters =i.e.,
I(%BI> while for version ? it is declared in all lower case =i.e., ImdyI>.
clear
infi' str name 24 str b)ay 327 using )ates"*ra(

@4 o.-er(ation- rea+A

generate birt&)ay=)ate(b)ay,46574)
-$
format birt&)ay 8)
list
name .+a2 .irt'+a2
1. Po'n Pan 1 1960 01Jan1960
2. ar2 07Q11Q1955 11J$l1955
3. Late 11.12.1962 12no(1962
4. ark P$nQ8 1959 08J$n1959
Stata was able to read those dates without a problem. Let's try an even touher set of dates. For e!ample, consider the
dates in dates/raw.
type )ates$*ra(
4-12-1990
4.12.1990
Apr 12I 1990
Apr12I1990
April 12I 1990
4Q12.1990
Apr121990
Let's try readin these dates and see how Stata handles them. 'ain, remember that for Stata version $A dates are
declared I(%BI while for version ? they are declared ImdyI.
clear
infi' str b)ay 2", using )ates$*ra(
@7 o.-er(ation- rea+A
generate birt&)ay=)ate(b)ay,46574)
@1 mi--ing (al$e generate+A
format birt&)ay 8)
list
.+a2 .irt'+a2
1. 4-12-1990 12apr1990
2. 4.12.1990 12apr1990
3. Apr 12I 1990 12apr1990
4. Apr12I1990 12apr1990
5. April 12I 1990 12apr1990
6. 4Q12.1990 12apr1990
7. Apr121990 .
's you can see, Stata was able to handle almost all of those cra:y date formats. *t was able to handle 'pr$-,$??A even
thouh there was not a delimiter between the month and day =Stata was able to fiure it out since the month was
character and the day was a number>. The only date that did not work was 'pr$-$??A and that is because there was no
delimiter between the day and year. 's you can see, the date!% function can handle just about any date as lon as there
are delimiters separatin the month day and year. *n certain cases Stata can read all numeric dates entered without
delimiters, see help dates for more information.
+on0erting dates from raw data using the md&!% function
*n some cases, you may have the month, day, and year stored as numeric variables in a dataset. For e!ample, you may
have the followin data for birth dates from dates$raw.
type )ates4*ra(
7 11 1948
1 1 1960
10 15 1970
12 10 1971
Bou can read in this data usin the followin synta! to create a separate variable for month, day and year.
--
clear
infi' mont& 2" )ay 42% year 72, using )ates4*ra(
@4 o.-er(ation- rea+A
list
mont' +a2 2ear
1. 7 11 1948
2. 1 1 1960
3. 10 15 1970
4. 12 10 1971
' Stata date variable can be created usin the mdy=> function as shown below.
generate birt&)ay=m)y(mont&,)ay,year)
Let's format birthday usin the Ad format so it displays better.
format birt&)ay 8)
list
mont' +a2 2ear .irt'+a2
1. 7 11 1948 11J$l1948
2. 1 1 1960 01Jan1960
3. 10 15 1970 15oct1970
4. 12 10 1971 10+ec1971
)onsider the data in dates9raw, which is the same as dates,.raw e!cept that only two diits are used to sinify the
year.
type )ates%*ra(
7 11 48
1 1 60
10 15 70
12 10 71
Let's try readin these dates just like we read dates$raw.
clear
infi' mont& 2" )ay 42% year 72, using )ates%*ra(
@4 o.-er(ation- rea+A
generate birt&)ay=m)y(mont&,)ay,year)
@4 mi--ing (al$e- generate+A
format birt&)ay 8)
list
mont' +a2 2ear .irt'+a2
1. 7 11 48 .
2. 1 1 60 .
3. 10 15 70 .
4. 12 10 71 .
's you can see, the values for birthda& are all missin. This is because Stata assumes that the years were literally ,8,
@A, 7A and 7$ =it does not assume they are $?,8, $?@A, $?7A and $?7$>. Bou can force Stata to assume the century
portion is $?AA by addin $?AA to the year as shown below =note that we use replace instead of generate since the
variable birthda& already e!ists>.
replace birt&)ay=m)y(mont&,)ay,year9:,,)
@4 real c'ange- ma+eA
format birt&)ay 8)
list
mont' +a2 2ear .irt'+a2
1. 7 11 48 11J$l1948
-5
2. 1 1 60 01Jan1960
3. 10 15 70 15oct1970
4. 12 10 71 10+ec1971
+omputations with elapsed dates
%ate variables make computations involvin dates very convenient. For e!ample, to calculate everyone's ae on
January $, -AAA simply use the followin conversion.
generate age",,,=( m)y(,,",,,) 2 birt&)ay ) - $3%*"%
list
mont' +a2 2ear .irt'+a2 age2000
1. 7 11 48 11J$l1948 51.47433
2. 1 1 60 01Jan1960 40
3. 10 15 70 15oct1970 29.21287
4. 12 10 71 10+ec1971 28.06023
Glease note that this formula for ae does not work well over very short time spans. For e!ample, the ae for a child on
their his birthday will be less than one due to usin 5@6.-6. There are formulas that are more e!act but also much more
comple!. #ere is an e!ample courtesy of %an Flanchette.
generate altage = floor((/ym(",,,, ) 2 ym(year(birt&)ay), mont&(birt&)ay))0 2 / #
)ay(birt&)ay)0) - ")
;ther date functions
Hiven a date variable, one can have the month, day and year returned separately if desired, usin the month!%, da&!%
and &ear!% functions, respectively.
generate m=mont&(birt&)ay)
generate )=)ay(birt&)ay)
generate y=year(birt&)ay)
list m ) y birt&)ay
m + 2 .irt'+a2
1. 7 11 1948 11J$l1948
2. 1 1 1960 01Jan1960
3. 10 15 1970 15oct1970
4. 12 10 1971 10+ec1971
*f you'd like to return the da& of the week for a date variable, use the dow!% function =where A.Sunday, $.(onday
etc.>.
gen (eek;)=)o((birt&)ay)
list birt&)ay (eek;)
.irt'+a2 :eekK+
1. 11J$l1948 0
2. 01Jan1960 5
3. 15oct1970 4
4. 10+ec1971 5
Summar&
The date!% function converts strins containin dates to date variables. The synta! varies slihtly by version.
*n Stata 0ersion 4C
gen )ate" = )ate()ate, 4)my4)
*n Stata 0ersion .5C
-,
gen )ate" = )ate()ate, 45674)
The md&!% function takes three numeric aruments =month, day, year> and converts them to a date variable.
generate birt&)ay=m)y(mont&,)ay,year)
Bou can display elapsed times as actual dates with display formats such as the Ad format.
format birt&)ay 8)
2ther date functions include the month!%, da&!%, &ear!%, and dow!% functions. For online help with dates, type help
dates at the command line. For more detailed e!planations about how Stata handles dates and date functions, please
refer to the Stata 1sers Huide.
Stata Learning Module
Labeling data
This module will show how to create labels for your data. Stata allows you to label your data file =data label>, to label
the variables within your data file =0ariable labels>, and to label the values for your variables =0alue labels>. Let's use
a file called autolab that does not have any labels.
use &ttp.--(((*ats*ucla*e)u-stat-stata-mo)ules-autolab*)ta, clear
Let's use the describe command to verify that indeed this file does not have any labels.
)escribe
!ontain- +ata from a$tola..+ta
o.-? 74 1978 A$tomo.ile *ata
(ar-? 12 23 &ct 2008 13?36
-i6e? 3I478 @99.9= of memor2 freeA @K+ta 'a- note-A
------------------------------------------------------------------------------------------------
-------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
------------------------------------------------------------------------------------------------
-------------------------
make -tr18 =-18-
price int =8.0gc
mpg int =8.0g
rep78 int =8.0g
'ea+room float =6.1f
tr$nk int =8.0g
:eig't int =8.0gc
lengt' int =8.0g
t$rn int =8.0g
+i-placement int =8.0g
gearKratio float =6.2f
foreign .2te =8.0g
-------------------------------------------------------------------------------
"orte+ .2?
Let's use the label data command to add a label describin the data file. This label can be up to 8A characters lon.
label )ata 4<&is file contains auto )ata for t&e year :784
The describe command shows that this label has been applied to the version that is currently in memory.
)escribe
!ontain- +ata from a$tola..+ta
o.-? 74 T'i- file contain- a$to +ata for t'e 2ear 1978
-6
(ar-? 12 23 &ct 2008 13?36
-i6e? 3I478 @99.9= of memor2 freeA @K+ta 'a- note-A
------------------------------------------------------------------------------------------------
-------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
------------------------------------------------------------------------------------------------
-------------------------
make -tr18 =-18-
price int =8.0gc
mpg int =8.0g
rep78 int =8.0g
'ea+room float =6.1f
tr$nk int =8.0g
:eig't int =8.0gc
lengt' int =8.0g
t$rn int =8.0g
+i-placement int =8.0g
gearKratio float =6.2f
foreign .2te =8.0g
-------------------------------------------------------------------------------
"orte+ .2?
Let's use the label 0ariable command to assin labels to the variables rep78 price, mpg and foreign.
label +ariable rep78 4t&e repair recor) from :784
label +ariable price 4t&e price of t&e car in :784
label +ariable mpg 4t&e miles per gallon for t&e car4
label +ariable foreign 4t&e origin of t&e car, foreign or )omestic4
The describe command shows these labels have been applied to the variables.
)escribe
!ontain- +ata from a$tola..+ta
o.-? 74 T'i- file contain- a$to +ata for t'e 2ear 1978
(ar-? 12 23 &ct 2008 13?36
-i6e? 3I478 @99.9= of memor2 freeA @K+ta 'a- note-A
------------------------------------------------------------------------------------------------
-------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
------------------------------------------------------------------------------------------------
-------------------------
make -tr18 =-18-
price int =8.0gc t'e price of t'e car in 1978
mpg int =8.0g t'e mile- per gallon for t'e car
rep78 int =8.0g t'e repair recor+ from 1978
'ea+room float =6.1f
tr$nk int =8.0g
:eig't int =8.0gc
lengt' int =8.0g
t$rn int =8.0g
+i-placement int =8.0g
gearKratio float =6.2f
foreign .2te =8.0g t'e origin of t'e carI foreign or +ome-tic
-------------------------------------------------------------------------------
"orte+ .2?
Let's make a value label called foreignl to label the values of the variable foreign. This is a two step process where you
first define the label, and then you assin the label to the variable. The label define command below creates the value
label called foreignl that associates A with domestic car and $ with foreign car.
-@
label )efine foreignl , 4)omestic car4 4foreign car4
The label 0alues command below associates the variable foreign with the label foreignl.
label +alues foreign foreignl
*f we use the describe command, we can see that the variable foreign has a value label called foreignl assined to it.
)escribe
!ontain- +ata from a$tola..+ta
o.-? 74 T'i- file contain- a$to +ata for t'e 2ear 1978
(ar-? 12 23 &ct 2008 13?36
-i6e? 3I478 @99.9= of memor2 freeA @K+ta 'a- note-A
------------------------------------------------------------------------------------------------
-------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
------------------------------------------------------------------------------------------------
-------------------------
make -tr18 =-18-
price int =8.0gc t'e price of t'e car in 1978
mpg int =8.0g t'e mile- per gallon for t'e car
rep78 int =8.0g t'e repair recor+ from 1978
'ea+room float =6.1f
tr$nk int =8.0g
:eig't int =8.0gc
lengt' int =8.0g
t$rn int =8.0g
+i-placement int =8.0g
gearKratio float =6.2f
foreign .2te =12.0g foreignl t'e origin of t'e carI foreign or +ome-tic
-------------------------------------------------------------------------------
"orte+ .2?
"ow when we use the tabulate foreign command, it shows the labels domestic car and foreign car instead of just A
and $.
table foreign
-------------+-----------
t'e origin |
of t'e carI |
foreign or |
+ome-tic | ,re;.
-------------+-----------
+ome-tic car | 52
foreign car | 22
-------------+-----------
Lalue labels are used in other commands as well. For e!ample, below we issue the ttest , b&!foreign% command, and
the output labels the roups as domestic and foreign =instead of A and $>.
ttest mpg , by(foreign)
T:o--ample t te-t :it' e;$al (ariance-
------------------------------------------------------------------------------
7ro$p | &.- ean "t+. %rr. "t+. *e(. <95= !onf. )nter(al>
---------+--------------------------------------------------------------------
+ome-tic | 52 19.82692 .657777 4.743297 18.50638 21.14747
foreign | 22 24.77273 1.40951 6.611187 21.84149 27.70396
---------+--------------------------------------------------------------------
-7
com.ine+ | 74 21.2973 .6725511 5.785503 19.9569 22.63769
---------+--------------------------------------------------------------------
+iff | -4.945804 1.362162 -7.661225 -2.230384
------------------------------------------------------------------------------
*egree- of free+om? 72
5o? mean@+ome-ticA - mean@foreignA B +iff B 0
5a? +iff C0 5a? +iff DBE0E 5a? +iffF 0
t B -3.6308 t B -3.6308 t B -3.6308
1 C t B 0.0003 1 F |t| B 0.0005 1 F t B 0.9997
2ne very important noteC These labels are assined to the data that is currently in memory. To make these chanes
permanent, you need to sa0e the data. When you sa0e the data, all of the labels =data labels, variable labels, value
labels> will be saved with the data file.
Summar&
'ssin a label to the data file currently in memory.
label )ata 4:78 auto )ata4
'ssin a label to the variable forein.
label +ariable foreign 4t&e origin of t&e car, foreign or )omestic4
)reate the value label foreignl and assin it to the variable foreign.
label )efine foreignl , 4)omestic car4 4foreign car4
label +alues foreign foreignl
Stata Learning Module
+reating and recoding 0ariables
This module shows how to create and recode variables. *n Stata you can create new variables with generate and you
can modify the values of an e!istin variable with replace and with recode.
+omputing new 0ariables using generate and replace
Let's use the auto data for our e!amples. *n this section we will see how to compute variables with generate and
replace.
use auto
The variable length contains the lenth of the car in inches. Felow we see summary statistics for length.
summarize lengt&
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
lengt' | 74 187.9324 22.26634 142 233
Let's use the generate command to make a new variable that has the lenth in feet instead of inches, called lenBft.
generate len;ft = lengt& - "
-8
We should emphasi:e that generate is for creatin a new variable. For an e!istin variable, you need to use the replace
command =not generate>. 's shown below, we use replace to repeat the assinment to lenBft.
replace len;ft = lengt& - "
@49 real c'ange- ma+eA
summarize lengt& len;ft
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
lengt' | 74 187.9324 22.26634 142 233
lenKft | 74 15.66104 1.855528 11.83333 19.41667
The synta! of generate and replace are identical, e!ceptC
< generate works when the variable does not yet e!ist and will ive an error if the variable already e!ists.
< replace works when the variable already e!ists, and will ive an error if the variable does not yet e!ist.
Suppose we wanted to make a variable called length, which has length s+uared.
generate lengt&" = lengt&="
summarize lengt&"
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
lengt'2 | 74 35807.69 8364.045 20164 54289
2r we miht want to make loglen which is the natural lo of length.
generate loglen = log(lengt&)
summarize loglen
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
loglen | 74 5.229035 .1201383 4.955827 5.451038
Let's et the mean and standard deviation of length and we can make M<scores of length.
summarize lengt&
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
lengt' | 74 187.9324 22.26634 142 233
The mean is $87.?5 and the standard deviation is --.-7, so 1length can be computed as shown below.
generate zlengt& = (lengt& 2 87*:$) - ""*"7
summarize zlengt&
9aria.le | &.- ean "t+. *e(. in a3
---------+-----------------------------------------------------
6lengt' | 74 .0001092 .9998357 -2.062416 2.023799
With generate and replace
you can use N < for addition and subtraction
-?
you can use O D for multiplication and division
you can use P for e!ponents =e.., lenthP->
you can use = > for controllin order of operations.
-ecoding new 0ariables using generate and replace
Suppose that we wanted to break mpg down into three cateories. Let's look at a table of mpg to see where we miht
draw the lines for such cateories.
tabulate mpg
mpg | ,re;. 1ercent !$m.
------------+-----------------------------------
12 | 2 2.70 2.70
14 | 6 8.11 10.81
15 | 2 2.70 13.51
16 | 4 5.41 18.92
17 | 4 5.41 24.32
18 | 9 12.16 36.49
19 | 8 10.81 47.30
20 | 3 4.05 51.35
21 | 5 6.76 58.11
22 | 5 6.76 64.86
23 | 3 4.05 68.92
24 | 4 5.41 74.32
25 | 5 6.76 81.08
26 | 3 4.05 85.14
28 | 3 4.05 89.19
29 | 1 1.35 90.54
30 | 2 2.70 93.24
31 | 1 1.35 94.59
34 | 1 1.35 95.95
35 | 2 2.70 98.65
41 | 1 1.35 100.00
------------+-----------------------------------
Total | 74 100.00
Let's convert mpg into three cateories to help make this more readable. #ere we convert mpg into three cateories
usin generate and replace.
generate mpg$ = *
@74 mi--ing (al$e- generate+A
replace mpg$ = if (mpg #= 8)
@27 real c'ange- ma+eA
replace mpg$ = " if (mpg >= :) & (mpg #="$)
@24 real c'ange- ma+eA
replace mpg$ = $ if (mpg >= "4) & (mpg #*)
@23 real c'ange- ma+eA
Let's use tabulate to check that this worked correctly. *ndeed, you can see that a value of $ for mpg/ oes from $-<$8,
a value of - oes from $?<-5, and a value of 5 oes from -,<,$.
tabulate mpg mpg$
5A
| mpg3
mpg | 1 2 3 | Total
-----------+---------------------------------+----------
12 | 2 0 0 | 2
14 | 6 0 0 | 6
15 | 2 0 0 | 2
16 | 4 0 0 | 4
17 | 4 0 0 | 4
18 | 9 0 0 | 9
19 | 0 8 0 | 8
20 | 0 3 0 | 3
21 | 0 5 0 | 5
22 | 0 5 0 | 5
23 | 0 3 0 | 3
24 | 0 0 4 | 4
25 | 0 0 5 | 5
26 | 0 0 3 | 3
28 | 0 0 3 | 3
29 | 0 0 1 | 1
30 | 0 0 2 | 2
31 | 0 0 1 | 1
34 | 0 0 1 | 1
35 | 0 0 2 | 2
41 | 0 0 1 | 1
-----------+---------------------------------+----------
Total | 27 24 23 | 74
"ow, we could use mpg/ to show a crosstab of mpg/ by foreign to contrast the mileae of the forein and domestic
cars.
tabulate mpg$ foreign, column
| foreign
mpg3 | 0 1 | Total
-----------+----------------------+----------
1 | 22 5 | 27
| 42.31 22.73 | 36.49
-----------+----------------------+----------
2 | 19 5 | 24
| 36.54 22.73 | 32.43
-----------+----------------------+----------
3 | 11 12 | 23
| 21.15 54.55 | 31.08
-----------+----------------------+----------
Total | 52 22 | 74
| 100.00 100.00 | 100.00
The crosstab above shows that -$K of the domestic cars fall into the high mileage cateory, while 66K of the forein
cars fit into this cateory.
-ecoding 0ariables using recode
There is an easier way to recode mpg to three cateories usin generate and recode. First, we make a copy of mpg,
callin it mpg/a. Then, we use recode to convert mpg/a into three cateoriesC min<$8 into $, $?<-5 into -, and -,<ma!
into 5.
generate mpg$a = mpg
reco)e mpg$a (min-8=) (:-"$=") ("4-ma'=$)
5$
@74 c'ange- ma+eA
Let's double check to see that this worked correctly. We see that it worked perfectly.
tabulate mpg mpg$a
| mpg3a
mpg | 1 2 3 | Total
-----------+---------------------------------+----------
12 | 2 0 0 | 2
14 | 6 0 0 | 6
15 | 2 0 0 | 2
16 | 4 0 0 | 4
17 | 4 0 0 | 4
18 | 9 0 0 | 9
19 | 0 8 0 | 8
20 | 0 3 0 | 3
21 | 0 5 0 | 5
22 | 0 5 0 | 5
23 | 0 3 0 | 3
24 | 0 0 4 | 4
25 | 0 0 5 | 5
26 | 0 0 3 | 3
28 | 0 0 3 | 3
29 | 0 0 1 | 1
30 | 0 0 2 | 2
31 | 0 0 1 | 1
34 | 0 0 1 | 1
35 | 0 0 2 | 2
41 | 0 0 1 | 1
-----------+---------------------------------+----------
Total | 27 24 23 | 74

-ecodes with if
Let's create a variable called mpgfd that assesses the mileae of the cars with respect to their oriin. Let this be a AD$
variable called mpgfd which isC
A if below the median mp for its roup =foreinDdomestic>
$ if atDabove the median mp for its roup =foreinDdomestic>.
sort foreign
by foreign. summarize mpg, )etail
-F foreignB 0
mpg
-------------------------------------------------------------
1ercentile- "malle-t
1= 12 12
5= 14 12
10= 14 14 &.- 52
25= 16.5 14 "$m of 4gt. 52
50= 19 ean 19.82692
8arge-t "t+. *e(. 4.743297
75= 22 28
90= 26 29 9ariance 22.49887
95= 29 30 "ke:ne-- .7712432
99= 34 34 L$rto-i- 3.441459
5-
-F foreignB 1
mpg
-------------------------------------------------------------
1ercentile- "malle-t
1= 14 14
5= 17 17
10= 17 17 &.- 22
25= 21 18 "$m of 4gt. 22
50= 24.5 ean 24.77273
8arge-t "t+. *e(. 6.611187
75= 28 31
90= 35 35 9ariance 43.70779
95= 35 35 "ke:ne-- .657329
99= 41 41 L$rto-i- 3.10734
We see that the median is $? for the domestic =forein..A> cars and -,.6 for the forein =forein..$> cars. The
generate and recode commands below recode mpg into mpgfd based on the domestic car median for the domestic
cars, and based on the forein car median for the forein cars.
generate mpgf) = mpg
reco)e mpgf) (min-8=,) (:-ma'=) if foreign==,
@52 c'ange- ma+eA
reco)e mpgf) (min-"4=,) ("%-ma'=) if foreign==
@22 c'ange- ma+eA
We can check usin this below, and the recoded value mpgfd looks correct.
by foreign. tabulate mpg mpgf)
-F foreignB 0
| mpgf+
mpg | 0 1 | Total
-----------+----------------------+----------
12 | 2 0 | 2
14 | 5 0 | 5
15 | 2 0 | 2
16 | 4 0 | 4
17 | 2 0 | 2
18 | 7 0 | 7
19 | 0 8 | 8
20 | 0 3 | 3
21 | 0 3 | 3
22 | 0 5 | 5
24 | 0 3 | 3
25 | 0 1 | 1
26 | 0 2 | 2
28 | 0 2 | 2
29 | 0 1 | 1
30 | 0 1 | 1
34 | 0 1 | 1
-----------+----------------------+----------
Total | 22 30 | 52
-F foreignB 1
| mpgf+
55
mpg | 0 1 | Total
-----------+----------------------+----------
14 | 1 0 | 1
17 | 2 0 | 2
18 | 2 0 | 2
21 | 2 0 | 2
23 | 3 0 | 3
24 | 1 0 | 1
25 | 0 4 | 4
26 | 0 1 | 1
28 | 0 1 | 1
30 | 0 1 | 1
31 | 0 1 | 1
35 | 0 2 | 2
41 | 0 1 | 1
-----------+----------------------+----------
Total | 11 11 | 22
Summar&
)reate a new variable lenBft which is length divided by $-.
generate len;ft = lengt& - "
)hane values of an e!istin variable named lenBft.
replace len;ft = lengt& - "
4ecode mpg into mpg/, havin three cateories usin generate and replace if
generate mpg$ = *
replace mpg$ = if (mpg #=8)
replace mpg$ = " if (mpg >=:) & (mpg #="$)
replace mpg$ = $ if (mpg >="4) & (mpg #*)
4ecode mpg into mpg/a, havin three cateories, $ - 5, usin generate and recode.
generate mpg$a = mpg
reco)e mpg$a (min-8=) (:-"$=") ("4-ma'=$)
4ecode mpg into mpgfd, havin two cateories, but usin different cutoffs for forein and domestic cars.
generate mpgf) = mpg
reco)e mpgf) (min-8=,) (:-ma'=) if foreign==,
reco)e mpgf) (min-"4=,) ("%-ma'=) if foreign==
Stata Learning Module
Subsetting data
This module shows how you can subset data in Stata. Bou can subset data by keepin or droppin variables, and you
can subset data by keepin or droppin observations. Bou can also subset data as you use a data file if you are tryin to
read a file that is too bi to fit into the memory on your computer.
Ceeping and dropping 0ariables
5,
Sometimes you do not want all of the variables in a data file. Bou can use the keep and drop commands to subset
variables. *f we think of your data like a spreadsheet, this section will show how you can remove columns =variables>
from your data. Let's illustrate this with the auto data file.
sysuse auto
We can use the describe command to see its variables.
)escribe
!ontain- +ata from !?R1rogram ,ile-R"tata10Ra+oR.a-eQaQa$to.+ta
o.-? 74 1978 A$tomo.ile *ata
(ar-? 12 13 Apr 2007 17?45
-i6e? 3I478 @99.7= of memor2 freeA @K+ta 'a- note-A
-------------------------------------------------------------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
-------------------------------------------------------------------------------
make -tr18 =-18- ake an+ o+el
price int =8.0gc 1rice
mpg int =8.0g ileage @mpgA
rep78 int =8.0g 0epair 0ecor+ 1978
'ea+room float =6.1f 5ea+room @in.A
tr$nk int =8.0g Tr$nk -pace @c$. ft.A
:eig't int =8.0gc 4eig't @l.-.A
lengt' int =8.0g 8engt' @in.A
t$rn int =8.0g T$rn !ircle @ft.A
+i-placement int =8.0g *i-placement @c$. in.A
gearKratio float =6.2f 7ear 0atio
foreign .2te =8.0g origin !ar t2pe
-------------------------------------------------------------------------------
"orte+ .2? foreign
Suppose we want to just have make mpg and price, we can keep just those variables, as shown below.
keep make mpg price
*f we issue the describe command aain, we see that indeed those are the only variables left.
)escribe
!ontain- +ata from !?R1rogram ,ile-R"tata10Ra+oR.a-eQaQa$to.+ta
o.-? 74 1978 A$tomo.ile *ata
(ar-? 3 13 Apr 2007 17?45
-i6e? 1I924 @99.8= of memor2 freeA @K+ta 'a- note-A
-------------------------------------------------------------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
-------------------------------------------------------------------------------
make -tr18 =-18- ake an+ o+el
price int =8.0gc 1rice
mpg int =8.0g ileage @mpgA
-------------------------------------------------------------------------------
"orte+ .2?
Hote? +ata-et 'a- c'ange+ -ince la-t -a(e+
4emember, this has not chaned the file on disk, but only the copy we have in memory. *f we saved this file callin it
auto, it would mean that we would replace the e!istin file =with all the variables> with this file which just has make,
mpg and price. *n effect, we would permanently lose all of the other variables in the data file. *t is important to be
careful when usin the sa0e command after you have eliminated variables, and it is recommended that you save such
56
files to a file with a new name, e.., sa0e auto,. Let's show how to use the drop command to drop variables. First, let's
clear out the data in memory and use the auto data file.
sysuse auto, clear
perhaps we are not interested in the variables displ and gearBratio. We can et rid of them usin the drop command
shown below.
)rop )ispl gear;ratio
'ain, usin describe shows that the variables have been eliminated.
)escribe
!ontain- +ata from !?R1rogram ,ile-R"tata10Ra+oR.a-eQaQa$to.+ta
o.-? 74 1978 A$tomo.ile *ata
(ar-? 10 13 Apr 2007 17?45
-i6e? 3I034 @99.7= of memor2 freeA @K+ta 'a- note-A
-------------------------------------------------------------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
-------------------------------------------------------------------------------
make -tr18 =-18- ake an+ o+el
price int =8.0gc 1rice
mpg int =8.0g ileage @mpgA
rep78 int =8.0g 0epair 0ecor+ 1978
'ea+room float =6.1f 5ea+room @in.A
tr$nk int =8.0g Tr$nk -pace @c$. ft.A
:eig't int =8.0gc 4eig't @l.-.A
lengt' int =8.0g 8engt' @in.A
t$rn int =8.0g T$rn !ircle @ft.A
foreign .2te =8.0g origin !ar t2pe
-------------------------------------------------------------------------------
"orte+ .2? foreign
Hote? +ata-et 'a- c'ange+ -ince la-t -a(e
*f we wanted to make this chane permanent, we could save the file as auto,dta as shown below.
sa+e auto"
file a$to2.+ta -a(e+
Ceeping and dropping obser0ations
The above showed how to use keep and drop variables to eliminate variables from your data file. The keep if and drop
if commands can be used to keep and drop observations. Thinkin of your data like a spreadsheet, the keep if and drop
if commands can be used to eliminate rows of your data. Let's illustrate this with the auto data. Let's use the auto file
and clear out the data currently in memory.
sysuse auto , clear
The variable rep78 has values $ to 6, and also has some missin values, as shown below.
tabulate rep78 , missing
0epair |
0ecor+ 1978 | ,re;. 1ercent !$m.
------------+-----------------------------------
1 | 2 2.70 2.70
5@
2 | 8 10.81 13.51
3 | 30 40.54 54.05
4 | 18 24.32 78.38
5 | 11 14.86 93.24
. | 5 6.76 100.00
------------+-----------------------------------
Total | 74 100.00
We may want to eliminate the observations which have missin values usin drop if as shown below. The portion after
the drop if specifies which observations that should be eliminated.
)rop if missing(rep78)
@5 o.-er(ation- +elete+A
1sin the tabulate command aain shows that these observations have been eliminated.
tabulate rep78 , missing
rep78 | ,re;. 1ercent !$m.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 30 43.48 57.97
4 | 18 26.09 84.06
5 | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
We could make this chane permanent by usin the sa0e command to save the file. Let's illustrate usin keep if to
eliminate observations. First let's clear out the current file and use the auto data file.
sysuse auto , clear
The keep if command can be used to eliminate observations, e!cept that the part after the keep if specifies which
observations should be kept. Suppose we want to keep just the cars which had a repair ratin of 5 or less. The easiest
way to do this would be usin the keep if command, as shown below.
keep if (rep78 #= $)
@34 o.-er(ation- +elete+A
The tabulate command shows that this was successful.
tabulate rep78, missing

rep78 | ,re;. 1ercent !$m.
------------+-----------------------------------
1 | 2 5.00 5.00
2 | 8 20.00 25.00
3 | 30 75.00 100.00
------------+-----------------------------------
Total | 40 100.00
Fefore we o on to the ne!t section, let's clear out the data that is currently in memory.
clear
57
Selecting 0ariables and obser0ations with ?use?
The above sections showed how to use keep, drop, keep if, and drop if for eliminatin variables and observations.
Sometimes, you may want to use a data file which is bier than you can fit into memory and you would wish to
eliminate variables andDor observations as you use the file. This is illustrated below with the auto data file. Selectin
variables. Bou can specify just the variables you wish to brin in on the use command. For e!ample, let's use the auto
data file with just make price and mpg.
use make price mpg using &ttp.--(((*stata2press*com-)ata-r,-auto
The describe command shows us that this worked.
)escribe
!ontain- +ata from 'ttp?QQ:::.-tata-pre--.comQ+ataQr10Qa$to.+ta
o.-? 74 1978 A$tomo.ile *ata
(ar-? 3 13 Apr 2007 17?45
-i6e? 1I924 @99.8= of memor2 freeA @K+ta 'a- note-A
-------------------------------------------------------------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
-------------------------------------------------------------------------------
make -tr18 =-18- ake an+ o+el
price int =8.0gc 1rice
mpg int =8.0g ileage @mpgA
-------------------------------------------------------------------------------
"orte+ .2?
Let's clear out the data before the ne!t e!ample.
clear
Suppose we want to just brin in the observations where rep78 is 5 or less. We can do this as shown below.
use &ttp.--(((*stata2press*com-)ata-r,-auto if (rep78 #= $)
We can use tabulate to double check that this worked.
tabulate rep78, missing
rep78 | ,re;. 1ercent !$m.
------------+-----------------------------------
1 | 2 5.00 5.00
2 | 8 20.00 25.00
3 | 30 75.00 100.00
------------+-----------------------------------
Total | 40 100.00
Let's clear out the data before the ne!t e!ample.
clear
Let's show another e!ample. Lets read in just the cars that had a ratin of , or hiher.
use &ttp.--(((*stata2press*com-)ata-r,-auto if (rep78 >= 4) & (rep78 #*)
Let's check this usin the tabulate command.
58
tabulate rep78, missing
rep78 | ,re;. 1ercent !$m.
------------+-----------------------------------
4 | 18 62.07 62.07
5 | 11 37.93 100.00
------------+-----------------------------------
Total | 29 100.00
Let's clear out the data before the ne!t e!ample.
clear
Bou can both eliminate variables and observations with the use command. Let's read in just make mpg price and
rep78 for the cars with a repair record of 5 or lower.
use make mpg price rep78 if (rep78 #= $) using &ttp.--(((*stata2press*com-)ata-r,-auto
Let's check this usin describe and tabulate.
)escribe
!ontain- +ata from 'ttp?QQ:::.-tata-pre--.comQ+ataQr10Qa$to.+ta
o.-? 40 1978 A$tomo.ile *ata
(ar-? 4 13 Apr 2007 17?45
-i6e? 1I120 @99.9= of memor2 freeA @K+ta 'a- note-A
-------------------------------------------------------------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
-------------------------------------------------------------------------------
make -tr18 =-18- ake an+ o+el
price int =8.0gc 1rice
mpg int =8.0g ileage @mpgA
rep78 int =8.0g 0epair 0ecor+ 1978
-------------------------------------------------------------------------------
"orte+ .2?

tabulate rep78
rep78 | ,re;. 1ercent !$m.
------------+-----------------------------------
1 | 2 5.00 5.00
2 | 8 20.00 25.00
3 | 30 75.00 100.00
------------+-----------------------------------
Total | 40 100.00
Let's clear out the data before the ne!t e!ample.
clear
"ote that the orderin of if and using is arbitrary.
use make mpg price rep78 using &ttp.--(((*stata2press*com-)ata-r,-auto if (rep78 #= $)
Let's check this usin describe and tabulate.
)escribe
5?
!ontain- +ata from 'ttp?QQ:::.-tata-pre--.comQ+ataQr10Qa$to.+ta
o.-? 40 1978 A$tomo.ile *ata
(ar-? 4 13 Apr 2007 17?45
-i6e? 1I120 @99.9= of memor2 freeA @K+ta 'a- note-A
-------------------------------------------------------------------------------
-torage +i-pla2 (al$e
(aria.le name t2pe format la.el (aria.le la.el
-------------------------------------------------------------------------------
make -tr18 =-18- ake an+ o+el
price int =8.0gc 1rice
mpg int =8.0g ileage @mpgA
rep78 int =8.0g 0epair 0ecor+ 1978
-------------------------------------------------------------------------------
"orte+ .2?
tabulate rep78
rep78 | ,re;. 1ercent !$m.
------------+-----------------------------------
1 | 2 5.00 5.00
2 | 8 20.00 25.00
3 | 30 75.00 100.00
------------+-----------------------------------
Total | 40 100.00
#ave a look at this command. %o you think it will work&
use make mpg if (rep78 #= $) using &ttp.--(((*stata2press*com-)ata-r,-auto
rep78 not fo$n+
r@111AO
Bou see, rep78 was not one of the variables read in, so it could not be used in the if portion. To use a variable in the if
portion, it has to be one of the variables that is read in.
Summar&
1sin keepDdrop to eliminate variables
keep make price mpg
drop displ gearBratio
1sin keep ifDdrop if to eliminate observations
drop if missing!rep78%
keep if !rep78 D# /%
/liminatin variables andDor observations with use
use make mpg price rep78 using auto
use auto if !rep78 D# /%
use make mpg price rep78 using auto if !rep78 D# /%
Stata Learning Modules
+ollapsing data across obser0ations
,A
Sometimes you have data files that need to be collapsed to be useful to you. For e!ample, you miht have student data
but you really want classroom data, or you miht have weekly data but you want monthly data, etc. We will illustrate
this usin an e!ample showin how you can collapse data across kids to make family level data.
#ere is a file containin information about the kids in three families. There is one record per kid. =irth is the order of
birth =i.e., $ is first>, age wt and se' are the child's ae, weiht and se!. We will use this file for showin how to
collapse data across observations.
use &ttp.--(((*ats*ucla*e)u-stat-stata-mo)ules-ki)s, clear
list
fami+ ki+name .irt' age :t -e3
1. 1 #et' 1 9 60 f
2. 1 #o. 2 6 40 m
3. 1 #ar. 3 3 20 f
4. 2 An+2 1 8 80 m
5. 2 Al 2 6 50 m
6. 2 Ann 3 2 20 f
7. 3 1ete 1 6 60 m
8. 3 1am 2 4 40 f
9. 3 1'il 3 2 20 m
)onsider the collapse command below. *t collapses across all of the observations to make a sinle record with the
averae ae of the kids.
collapse age
list
age
1. 5.111111
The above collapse command was not very useful, but you can combine it with the b&!famid% option, and then it
creates one record for each family that contains the averae ae of the kids in the family.
use &ttp.--(((*ats*ucla*e)u-stat-stata-mo)ules-ki)s, clear
collapse age, by(fami))
list
fami+ age
1. 1 6
2. 2 5.333333
3. 3 4
The followin collapse command does the e!act same thin as above, e!cept that the averae of age is named a0gage
and we have e!plicitly told the collapse command that we want it to compute the mean.
use &ttp.--(((*ats*ucla*e)u-stat-stata-mo)ules-ki)s, clear
collapse (mean) a+gage=age, by(fami))
list
fami+ a(gage
1. 1 6
2. 2 5.333333
3. 3 4
We can re+uest averaes for more than one variable. #ere we et the averae for age and for wt all in the same
command.
use &ttp.--(((*ats*ucla*e)u-stat-stata-mo)ules-ki)s, clear
collapse (mean) a+gage=age a+g(t=(t, by(fami))
list
fami+ a(gage a(g:t
,$
1. 1 6 40
2. 2 5.333333 50
3. 3 4 40
This command ets the averae of age and wt like the command above, and also computes numkids which is the count
of the number of kids in each family =obtained by countin the number of observations with valid values of birth>.
use &ttp.--(((*ats*ucla*e)u-stat-stata-mo)ules-ki)s, clear
collapse (mean) a+gage=age a+g(t=(t (count) numki)s=birt&, by(fami))
list
fami+ a(gage a(g:t n$mki+-
1. 1 6 40 3
2. 2 5.333333 50 3
3. 3 4 40 3
Suppose you wanted a count of the number of boys and irls in the family. We can do that with one e!tra step. We will
create a dummy variable that is $ if the kid is a boy =A if not>, and a dummy variable that is $ if the kid is a irl =and A if
not>. The sum of the bo& dummy variable is the number of boys and the sum of the girl dummy variable is the number
of irls.
First, let's use the kids file =and clear out the e!istin data>.
use &ttp.--(((*ats*ucla*e)u-stat-stata-mo)ules-ki)s, clear
We use tabulate with the generate option to make the dummy variables.
tabulate se', generate(se')um)
-e3 | ,re;. 1ercent !$m.
------------+-----------------------------------
f | 4 44.44 44.44
m | 5 55.56 100.00
------------+-----------------------------------
Total | 9 100.00
We can look at the dummy variables. Se'dum. is the dummy variable for irls. Se'dum, is the dummy variable for
boys. The sum of se'dum. is the number of irls in the family. The sum of se'dum, is the number of boys in the
family.
list fami) se' se')um se')um"
fami+ -e3 -e3+$m1 -e3+$m2
1. 1 f 1 0
2. 1 m 0 1
3. 1 f 1 0
4. 2 m 0 1
5. 2 m 0 1
6. 2 f 1 0
7. 3 m 0 1
8. 3 f 1 0
9. 3 m 0 1
The command below creates girls which is the number of irls in the family, and bo&s which is the number of boys in
the family.
collapse (count) numki)s=birt& (sum) girls=se')um boys=se')um", by(fami))
We can list out the data to confirm that it worked correctly.
list fami) boys girls numki)s
,-
fami+ .o2- girl- n$mki+-
1. 1 1 2 3
2. 2 2 1 3
3. 3 2 1 3
Summar&
To create one record per family =famid> with the averae of ae within each family.
collapse age, by(fami))
To create one record per family =famid> with the averae of ae =called avae> and averae weiht =called avwt>
within each family.
collapse (mean) a+gage=age a+g(t=(t, by(fami))
Same as above e!ample, but also counts the number of kids within each family callin that numkids.
collapse (mean) a+gage=age a+g(t=(t (count) numki)s=birt&, by(fami))
)ounts the number of boys and irls in each family by usin tabulate to create dummy variables based on se! and then
summin the dummy variables within each family.
tabulate se', generate(se')um)
collapse (sum) girls=se')um boys=se')um", by(fami))
Stata Learning Module
2orking across 0ariables using foreach
. Introduction
This module illustrates =$> how to create and recode variables manually and =-> how to use foreach to ease the process
of creatin and recodin variables.
)onsider the sample proram below, which reads in income data for twelve months.
input fami) inc2inc"
$"8 $4$ $4 "%,, "7,, $%,, $4 $$: $%4 "8" "4$4 "88
" 4,4" $,84 $,8 $%, $8,, $,, %$ ":4 $8: 4"4 4"74 447
$ 3,% 3"$ 3$ 3,, 3,, 3",, 383 3$" $"$ 4"$ 3,$: 3"%
en)

list
The output is shown below
list fami) inc2inc", clean
fami+ inc1 inc2 inc3 inc4 inc5 inc6 inc7 inc8 inc9 inc10 inc11
inc12
1 3281 3413 3114 2500 2700 3500 3114 3319 3514 1282 2434
2818
2 4042 3084 3108 3150 3800 3100 1531 2914 3819 4124 4274
4471
3 6015 6123 6113 6100 6100 6200 6186 6132 3123 4231 6039
6215
,5
, +omputing 0ariables !manuall&%
Say that we wanted to compute the amount of ta! =$AK> paid for each month, the simplest way to do this is to compute
$- variables =ta'inc.*ta'inc.,> by multiplyin each of the =inc.*inc.,> by .$A as illustrated below. 's you see, this
re+uires enterin a command computin the ta! for each month of data =for months $ to $-> via the generate
command.
generate ta'inc = inc 1 *,
generate ta'inc" = inc" 1 *,
generate ta'inc$ = inc$ 1 *,
generate ta'inc4 = inc4 1 *,
generate ta'inc% = inc% 1 *,
generate ta'inc3 = inc3 1 *,
generate ta'inc7 = inc7 1 *,
generate ta'inc8 = inc8 1 *,
generate ta'inc: = inc: 1 *,
generate ta'inc,= inc, 1 *,
generate ta'inc= inc 1 *,
generate ta'inc"= inc" 1 *,
The output is shown below.

+----------------------------------------------------------------------------------------------+
1. | fami+ | inc1 | inc2 | inc3 | inc4 | inc5 | inc6 | inc7 | inc8 | inc9 | inc10 | inc11 |
inc12 |
| 1 | 3281 | 3413 | 3114 | 2500 | 2700 | 3500 | 3114 | 3319 | 3514 | 1282 | 2434 |
2818 |

|----------------------------------------------------------------------------------------------|
| ta3inc1 | ta3inc2 | ta3inc3 | ta3inc4 | ta3inc5 | ta3inc6 | ta3inc7 | ta3inc8 |
ta3inc9 |
| 328.1 | 341.3 | 311.4 | 250 | 270 | 350 | 311.4 | 331.9 |
351.4 |

|----------------------------------------------------------------------------------------------|
| ta3inc10 | ta3inc11 | ta3inc12
|
| 128.2 | 243.4 | 281.8
|

+----------------------------------------------------------------------------------------------+

+----------------------------------------------------------------------------------------------+
2. | fami+ | inc1 | inc2 | inc3 | inc4 | inc5 | inc6 | inc7 | inc8 | inc9 | inc10 | inc11 |
inc12 |
| 2 | 4042 | 3084 | 3108 | 3150 | 3800 | 3100 | 1531 | 2914 | 3819 | 4124 | 4274 |
4471 |

|----------------------------------------------------------------------------------------------|
| ta3inc1 | ta3inc2 | ta3inc3 | ta3inc4 | ta3inc5 | ta3inc6 | ta3inc7 | ta3inc8 |
ta3inc9 |
| 404.2 | 308.4 | 310.8 | 315 | 380 | 310 | 153.1 | 291.4 |
381.9 |

|----------------------------------------------------------------------------------------------|
| ta3inc10 | ta3inc11 | ta3inc12
|
| 412.4 | 427.4 | 447.1
|
,,

+----------------------------------------------------------------------------------------------+

+----------------------------------------------------------------------------------------------+
3. | fami+ | inc1 | inc2 | inc3 | inc4 | inc5 | inc6 | inc7 | inc8 | inc9 | inc10 | inc11 |
inc12 |
| 3 | 6015 | 6123 | 6113 | 6100 | 6100 | 6200 | 6186 | 6132 | 3123 | 4231 | 6039 |
6215 |

|----------------------------------------------------------------------------------------------|
| ta3inc1 | ta3inc2 | ta3inc3 | ta3inc4 | ta3inc5 | ta3inc6 | ta3inc7 | ta3inc8 |
ta3inc9 |
| 601.5 | 612.3 | 611.3 | 610 | 610 | 620 | 618.6 | 613.2 |
312.3 |

|----------------------------------------------------------------------------------------------|
| ta3inc10 | ta3inc11 | ta3inc12
|
| 423.1 | 603.9 | 621.5
|

+----------------------------------------------------------------------------------------------+
/ +omputing 0ariables !using the foreach command%
'nother way to compute $- variables representin the amount of ta! paid =$AK> for each month is to use the foreach
command. *n the e!ample below we use the foreach command to cycle throuh the variables inc. to inc., and
compute the ta!able income as ta'inc. < ta'inc.,.
foreac& +ar of +arlist inc2inc" >
generate ta'?+ar@ = ?+ar@ 1 *,
A
The initial foreach statement tells Stata that we want to cycle throuh the variables inc. to inc., usin the statements
that are surrounded by the curly braces. The first time we cycle throuh the statements, the value of 0ar will be inc.
and the second time the value of 0ar will be inc, and so on until the final iteration where the value of 0ar will be
inc.,. /ach statement within the loop =in this case, just the one enerate statement> is evaluated and e!ecuted. When
we are inside the foreach loop, we can access the value of 0ar by surroundin it with the funny +uotation marks like
this E0arF . The E is the +uote riht below the Q on your keyborad and the ' is the +uote below the I on your keyboard.
The first time throuh the loop, E0arF is replaced with inc., so the statement
generate ta'?+ar@ = ?+ar@ 1 *,
becomes
generate ta'inc = inc 1 *,
This is repeated for inc, and then inc/ and so on until inc., So, this foreach loop is the e+uivalent of e!ecutin the $-
generate statements manually, but much easier and less error prone.
$ +ollapsing across 0ariables !manuall&%
2ften one needs to sum across variables =also known as collapsin across variables>. For e!ample, let's say the
+uarterly income for each observation is desired. *n order to et this information, four +uarterly variables incqtr.*
incqtr$ need to be computed. 'ain, this can be achieved manually or by usin the foreach command. Felow is an
,6
e!ample of how to compute , +uarterly income variables incqtr.*incqtr$ by simply addin toether the months that
comprise a +uarter.
generate incqtr = inc 9 inc" 9 inc$
generate incqtr" = inc4 9 inc% 9 inc3
generate incqtr$ = inc7 9 inc8 9 inc:
generate incqtr4 = inc,9 inc9 inc"
list incqtr 2 incqtr4
The output is shown below.
+---------------------------------------+
| inc;tr1 inc;tr2 inc;tr3 inc;tr4 |
|---------------------------------------|
1. | 9808 8700 9947 6534 |
2. | 10234 10050 8264 12869 |
3. | 18251 18400 15441 16485 |
+---------------------------------------+
9 +ollapsing across 0ariables !using the foreach command%
This same result as above can be achieved usin the foreach command. The e!ample below illustrates how to compute
the +uarterly income variables incqtr.*incqtr$ usin the foreach command.
foreac& qtr of numlist -4 >
local m$ = ?qtr@1$
local m" = (?qtr@1$)2
local m = (?qtr@1$)2"
generate incqtr?qtr@ = inc?m@ 9 inc?m"@ 9 inc?m$@
A
list incqtr 2 incqtr4
The output is shown below.
+---------------------------------------+
| inc;tr1 inc;tr2 inc;tr3 inc;tr4 |
|---------------------------------------|
1. | 9808 8700 9947 6534 |
2. | 10234 10050 8264 12869 |
3. | 18251 18400 15441 16485 |
+---------------------------------------+
*n this e!ample, instead of cyclin across variables, the foreach command is cyclin across numbers, $, -, 5 then ,
which we refer to as qtr which represent the , +uarters of variables that we wish to create. The trick is the relationship
between the +uarter and the month numbers that compose the +uarter and to create a kind of formula that relates the
+uarters to the months. For e!ample, +uarter $ of data corresponds to months 5, - and $, so we can say that when the
+uarter =+tr> is $ we want the months represented by +trO5, =+trO5><$ and =+trO5><-, yieldin 5, -, and $. This is what
the statements below from the foreach loop are doin. They are relatin the +uarter to the months.
local m3 B S;trGT3
local m2 B @S;trGT3A-1
local m1 B @S;trGT3A-2
So, when qtr is $, the value for m/ is $O5, the value for m, is =$O5><$ and the value for m. is =$O5><-. Then, imaine
all of those values bein substituted into the followin statement from the foreach loop.
generate incqtr?qtr@ = inc?m@ 9 inc?m"@ 9 inc?m$@
,@
This then becomes
generate incqtr = inc$ 9 inc" 9 inc
and for the ne!t +uarter =when qtr becomes -> the statement would become
generate incqtr" = inc3 9 inc% 9 inc4
*n this e!ample, with only , +uarters of data, it would probably be easier to simply write out the , generate statements
manually, however if you had ,A +uarters of data, then the foreach loop can save you considerable time, effort and
mistakes.
: Identif&ing patterns across 0ariables !using the foreach command%
The foreach command can also be used to identify patterns across variables of a dataset. Let's say, for e!ample, that
one needs to know which months had income that was less than the income of the previous month. To obtain this
information, dummy indicators can be created to indicate in which months this occurred. "ote that only $$ dummy
indicators are needed for a $- month period because the interest is in the chane from one month to the ne!t. When a
month has income that is less than the income of the previous month, the dummy indicators lowinc,*lowinc., et
assined a I$I. When this is not the case, they are assined a IAI. This proram is illustrated below =note for
simplicity we assume no missin data on income>.
foreac& curmon of numlist "-" >
local lastmon = ?curmon@ 2
generate lo(inc?curmon@ = if ( inc?curmon@ # inc?lastmon@ )
replace lo(inc?curmon@ = , if ( inc?curmon@ >= inc?lastmon@ )
A
We can list out the oriinal values of inc and lowinc and verify that this worked properly
list fami) inc2inc", clean noobs
fami+ inc1 inc2 inc3 inc4 inc5 inc6 inc7 inc8 inc9 inc10 inc11
inc12
1 3281 3413 3114 2500 2700 3500 3114 3319 3514 1282 2434
2818
2 4042 3084 3108 3150 3800 3100 1531 2914 3819 4124 4274
4471
3 6015 6123 6113 6100 6100 6200 6186 6132 3123 4231 6039
6215
list fami) lo(inc"2lo(inc", clean noob-
fami+ lo:inc2 lo:inc3 lo:inc4 lo:inc5 lo:inc6 lo:inc7 lo:inc8 lo:inc9 lo:inc10 lo:inc11 lo:inc12
1 0 1 1 0 0 1 0 0 1 0 0
2 1 0 0 0 1 1 0 0 0 0 0
3 0 1 1 0 0 1 1 1 0 0 0
This time we used the foreach loop to compare the current month, represented by curmon, and the prior month,
computed as EcurmonF*. creatin lastmon. So, for the first pass throuh the foreach loop the value for curmon is -
and the value for lastmon is $, so the generate and replace statements become
generate lo(inc" = if ( inc" # inc )
replace lo(inc" = , if ( inc" >= inc )
The process is repeated until curmon is $-, and then the generate and replace statements become
generate lo(inc" = if ( inc" # inc )
,7
replace lo(inc" = , if ( inc" >= inc )
*f you were usin foreach to span a lare rane of values =say $D$AAA> then it is more effcient to use for0alues since it
is desined to +uickly increment throuh a se+uential list, for e!ample
for+alues curmon = "-" >
local lastmon = ?curmon@ 2
generate lo(inc?curmon@ = if ( inc?curmon@ # inc?lastmon@ )
replace lo(inc?curmon@ = , if ( inc?curmon@ >= inc?lastmon@ )
A
Stata Learning Module
Introduction to graphs in Stata
This module will introduce some basic raphs in Stata 8, includin historams, bo!plots, scatterplots, and scatterplot
matrices.
Let's use the auto data file for makin some raphs.
sysuse auto*)ta
The histogram command can be used to make a simple historam of mpg
&istogram mpg
The graph bo' command can be used to produce a bo!plot which can help you e!amine the distribution of mpg. *f
mpg were normal, the line =the median> would be in the middle of the bo! =the -6th and 76th percentiles> and the ends
of the whiskers =6th and ?6th percentile> would be e+uidistant from the bo!. The bo!plot for mpg shows positive skew.
The median is pulled to the low end of the bo!, and the ?6th percentile is stretched out away from the bo!.
grap& bo' mpg
,8
The bo!plot can be done separately for forein and domestic cars usin the b&! % option.
grap& bo' mpg, by(foreign)
' two way scatter plot can be used to show the relationship between mpg and weight. 's we would e!pect, there is a
neative relationship between mpg and weight.
grap& t(o(ay scatter mpg (eig&t
,?
"ote that you can save typin like this
t(o(ay scatter mpg (eig&t
We can show the reression line predictin mpg from weiht like this.
t(o(ay lfit mpg (eig&t
We can combine these raphs like shown below.
6A
t(o(ay (scatter mpg (eig&t) (lfit mpg (eig&t)
We can add labels to the points labelin them by make as shown below. "ote that mlabel is an option on the scatter
command.
t(o(ay (scatter mpg (eig&t, mlabel(make) ) (lfit mpg (eig&t)
We can combine et separate raphs for forein and domestic cars as shown below, and we have re+uested confidence
bands around the predicted values by usin lfitci in place of lfit . "ote that the b& option is at the end of the command.
6$
t(o(ay (scatter mpg (eig&t) (lfitci mpg (eig&t), by(foreign)
Bou can re+uest a scatter plot matri! with the graph matri' command. #ere we e!amine the relationships amon mpg,
weight and price.
grap& matri' mpg (eig&t price
Stata Learning Module
>raphicsG ;0er0iew of 6wowa& Hlots
6-
This module shows e!amples of the different kinds of raphs that can be created with the graph twowa& command.
This is illustrated by showin the command and the resultin raph. For more information, see the Stata Hraphics
(anual available over the web and from within Stata by typin help graph, and in particular the section on Two Way
Scatterplots.
=asic twowa& scatterplot
sysuse sp%,,
grap& t(o(ay scatter close )ate

Line Hlot
grap& t(o(ay line close )ate
65

+onnected Line Hlot
grap& t(o(ay connecte) close )ate
6,
Immediate scatterplot
grap& t(o(ay scatteri ---
:3%*8 %"$: ($) 4Bo( :3%*84 ---
$7$*7$ %,,% ($) 4Cig& $7$*7$4 , msymbol(i)
66
Scatterplot and Immediate Scatterplot
grap& t(o(ay ---
(scatter close )ate) ---
(scatteri :3%*8 %"$: ($) 4Bo(, :-", :3%*84 ---
$7$*7 %,,% ($) 4Cig&, -$,, $7$*74, msymbol(i) )
6@
)rea >raph
)rop if ;n > %7
grap& t(o(ay area close )ate, sort

=ar plot
grap& t(o(ay bar close )ate
67

Spike plot
grap& t(o(ay spike close )ate

<ropline plot
68
grap& t(o(ay )ropline close )ate

<ot plot
grap& t(o(ay )ot c&ange )ate

6?
-ange plot with area shading
grap& t(o(ay rarea &ig& lo( )ate

-ange plot with bars
grap& t(o(ay rbar &ig& lo( )ate
@A

-ange plot with spikes
grap& t(o(ay rspike &ig& lo( )ate

-ange plot with capped spikes
grap& t(o(ay rcap &ig& lo( )ate
@$

-ange plot with spikes capped with s&mbols
grap& t(o(ay rcapsym &ig& lo( )ate

-ange plot with markers
@-
grap& t(o(ay rscatter &ig& lo( )ate

-ange plot with lines
grap& t(o(ay rline &ig& lo( )ate

@5
-ange plot with lines and markers
grap& t(o(ay rconnecte) &ig& lo( )ate
@,

Median band line plot
use &ttp.--(((*ats*ucla*e)u-stat-stata-notes-&sb", clear
grap& t(o(ay mban) rea) (rite

Spline line plot
grap& t(o(ay mspline rea) (rite
@6

L;28SS line plot
grap& t(o(ay lo(ess rea) (rite

Linear prediction plot
@@
grap& t(o(ay lfit rea) (rite

Iuadratic prediction plot
grap& t(o(ay qfit rea) (rite

@7
Fractional pol&nomial plot
grap& t(o(ay fpfit rea) (rite

Linear prediction plot with confidence inter0als
grap& t(o(ay lfitci rea) (rite
@8

Iuadratic plot with confidence inter0als
grap& t(o(ay qfitci rea) (rite

Fractional pol&nomial plot with +Is
grap& t(o(ay fpfitci rea) (rite
@?

Jistogram
grap& t(o(ay &istogram rea)

Cernel densit& plot
7A
grap& t(o(ay k)ensity rea)

Function plot
grap& t(o(ay function y=norm)en('), range(24 4)
7$
Stata Learning Module
>raphicsG 6wowa& Scatterplots
This module shows some of the options when usin the twowa& command to produce scatterplots. This is illustrated
by showin the command and the resultin raph. This includes hotlinks to the Stata Hraphics (anual available over
the web and from within Stata by typin help graph.
Two Way Scatterplots
....
Fasic twoway scatterplot
t(o(ay (scatter rea) (rite)
Schemes

1sin /conomist Scheme
t(o(ay (scatter rea) (rite) , sc&eme(economist)
7-


1sin s$mono Scheme
t(o(ay (scatter rea) (rite) , sc&eme(smono)
(arker Glacement 2ptions =i.e. Jitter>

Scatterplot with jitter
t(o(ay (scatter (rite rea), Ditter($))
Without jitter
75
t(o(ay (scatter (rite rea))
(arker Label 2ptions

1sin small black s+uare symbols.
t(o(ay (scatter (rite rea), msymbol(square) msize(small) mcolor(black))
7,

With markers red on the inside, black medium thick outline
t(o(ay (scatter (rite rea), mfcolor(re)) mlcolor(black) ml(i)t&(me)t&ick) )
*dentifyin 2bservations with (arker Labels
t(o(ay (scatter rea) (rite, mlabel(i)))
76

1sin lare red marker labels at $- 2'clock
t(o(ay (scatter rea) (rite if i) #=,, mlabel(i)) mlabposition(") mlabsize(large) mlabcolor(re)))

(arkers at ?A deree anle at $- 2'clock with a ap of 6
t(o(ay (scatter rea) (rite if i) #=,,
mlabel(ses) mlabangle(:,) mlabposition(") mlabgap(%))
*f mlabgap option is omitted
t(o(ay (scatter rea) (rite if i) #=,, ---
mlabel(ses) mlabangle(:,) mlabposition("))
7@

(odifyin marker position separately for variables =$>
generate pos = $
replace pos = if (i) == %)
replace pos = % if (i) == 3)
replace pos = : if (i) == $)
t(o(ay (scatter rea) (rite if i) #= ,, mlabel(ses) mlab+(pos))

*f option mlab0 is not used
t(o(ay (scatter rea) (rite if i) #= ,, mlabel(ses))
)onnect 2ptions

)onnectin with straiht line
egen mrea) = mean(rea)), by((rite)
t(o(ay (scatter mrea) (rite, connect(l) sort)
*f the sort option is omitted
t(o(ay (scatter mrea) (rite, connect(l))
77

(edium thick black dotted connectin line
t(o(ay (scatter mrea) (rite, connect(l) cl(i)t&(me)t&ick) clcolor(black) clpattern()ot) sort)

Show aps in line when there are missin values
egen s)rea) = s)(rea)), by((rite)
t(o(ay (scatter s)rea) (rite, connect(l) sort cmissing(n))
2mittin cmissing option
t(o(ay (scatter s)rea) (rite, connect(l) sort cmissing(n))
78
Footnotes
R$. "otice that the variable pos is used to control the position of the marker label. 's shown in the code =repeated
below>, pos is assined a value of 5 representin 5 2')lock, and then when id is 6 the position of the marker label is $
2')lock, and when id is 6 the position is 6 2')lock, and then when id is 5 the position is ? 2')lock, allowin us to
avoid labels that run off the ede of the raph or overwrite each other.
generate pos # /
replace pos # . if !id ## 9%
replace pos # 9 if !id ## :%
replace pos # 4 if !id ## /%
Stata Learning Module
>raphicsG +ombining 6wowa& Scatterplots
This module shows e!amples of combinin twoway scatterplots. This is illustrated by showin the command and the
resultin raph. This includes hotlinks to the Stata Hraphics (anual available over the web and from within Stata by
typin help graph.
The data set used in these e!amples can be obtained usin the followin commandC
use &ttp.--(((*ats*ucla*e)u-stat-stata-notes-&sb", clear
This illustrates combinin raphs in the followin situations.
Glots for separate roups =usin b&>
)ombinin separate plots toether into a sinle plot
)ombinin separate raphs toether into a sinle raph
7?
Glots for separate roups

Separate raphs by ender =male and female>
t(o(ay (scatter rea) (rite), by(female)

Separate raphs by ses and ender
t(o(ay (scatter rea) (rite), by(female ses)
8A


Swappin position of ses and ender
t(o(ay (scatter rea) (rite), by(ses female, cols("))
)ombinin scatterplots and linear fit in one raph

Scatterplot with linear fit
8$
t(o(ay (scatter rea) (rite) ---
(lfit rea) (rite) , ---
ytitle(Eea)ing Fcore)

Hraphs separated by S/S and female with linear fit lines and points identified by id
t(o(ay (scatter rea) (rite, mlabel(i))) ---
(lfit rea) (rite, range($, 7,)) , ---
ytitle(Eea)ing Fcore) by(ses female)
8-
Hraph for hih ses females with linear fit with and without obs 6$
t(o(ay (scatter rea) (rite, mlabel(i))) ---
(lfit rea) (rite, range($, 7,)) ---
(lfit rea) (rite if i) != %, range($, 7,)) if female== & ses==$, ---
ytitle(Eea)ing Fcore) legen)(lab($ 4Gitte) +alues (it&out Hbs %4))
)ombinin scatterplots with multiple variables and linear fits

4eadin and math score by writin score
t(o(ay (scatter rea) (rite) ---
(scatter mat& (rite)
85

4eadin and math score by writin score with fit lines
t(o(ay (scatter rea) (rite) ---
(scatter mat& (rite) ---
(lfit rea) (rite) ---
(lfit mat& (rite)

8,

'ddin leend to above raph
t(o(ay (scatter rea) (rite) ---
(scatter mat& (rite) ---
(lfit rea) (rite) ---
(lfit mat& (rite), ---
legen)(label($ 4Binear Git4) label(4 4Binear Git4)) ---
legen)(or)er( $ " 4))

Final version of raph
makin line style same as dot style, and ranes the same
t(o(ay (scatter rea) (rite) ---
(scatter mat& (rite) ---
(lfit rea) (rite, pstyle(p) range("% 8,) ) ---
(lfit mat& (rite, pstyle(p") range("% 8,) ), ---
legen)(label($ 4Binear Git4) label(4 4Binear Git4)) ---
legen)(or)er( $ " 4))
86
)ombinin scatterplots and linear fit for separate roups

2verlay raph of males and females in one raph
separate (rite, by(female)
t(o(ay (scatter (rite, rea)) (scatter (rite rea)), ---
ytitle(Iriting Fcore) legen)(or)er( 46ales4 " 4Gemales4))

2verlay raph of males and females in one raph with linear fit lines
t(o(ay (scatter (rite, rea)) (scatter (rite rea)) ---
8@
(lfit (rite, rea)) (lfit (rite rea)), ---
ytitle(Iriting Fcore) ---
legen)(or)er( 46ales4 " 4Gemales4 $ 4Bfit 6ales4 4 4Bfit Gemales4))
)ombinin separate raphs into one raph

(akin the Hraphs
First, we make 5 raphs =not shown>
t(o(ay (scatter rea) (rite) (lfit rea) (rite), name(scatter)
regress rea) (rite
r+fplot, name(r+f)
l+r"plot, name(l+r)
"ow we can use graph combine to combine these into one raph, shown below.
grap& combine scatter r+f l+r
87


)ombinin the raphs differently
We can move the place where the empty raph is located, as shown below.
grap& combine scatter r+f l+r, &ole(")
88

Das könnte Ihnen auch gefallen