Beruflich Dokumente
Kultur Dokumente
Lines starting with ”>” contain R codes, and they should be written without the ”>” sign. Codes and
R outputs are typesetted with Courier font to separate them from normal text.
This exercise has been written so that you should test every command, and see what they do
yourself. If you need help, just ask!
The dataset consists of 17 bioinformatics students, who have given their height and shoe size
measurements for teaching purposes.
There is folder data on your desktop. In its subfolder students, there is a file students.txt. Open it in
Excel, and check what columns it contains. Note the column headers. You can leave the file open in
Excel in case you need to see it later; else just close it.
Open R by double-clicking on its icon on the desktop. Go to the menu File, and select option
Change Dir. Change the directory to the directory where students.txt file is located.
Read the data into an R object named as students (data is in a tab-delimited text file having a title
for every colum):
Check that R read the file correctly (objects can be printed just by typing their name):
> students
height shoesize gender population
1 181 44 male kuopio
2 160 38 female kuopio
3 174 42 female kuopio
4 170 43 male kuopio
5 172 43 male kuopio
6 165 39 female kuopio
7 161 38 female kuopio
8 167 38 female tampere
9 164 39 female tampere
10 166 38 female tampere
11 162 37 female tampere
12 158 36 female tampere
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere
You can also print the column headers only (sometimes the whole table does not fit on the screen,
and this might be more helpful):
> names(students)
[1] "height" "shoesize" "gender" "population"
Individual columns can be called using the following syntax: first comes the name of the object,
followed by a dollor sign, after which comes the name of the column:
> students$height
3. Simple statistics
> mean(students$height)
[1] 169.7647
> mean(students$shoesize)
[1] 40.47059
> sd(students$height)
[1] 7.578996
> sd(students$shoesize)
[1] 2.695312
What are the gender and sampling site distribution (how many observations are in each groups)?
Type:
> table(students$gender)
gender
female male
9 8
> table(students$population)
population
kuopio tampere
7 10
> table(students$gender,studenbts$population)
population
gender kuopio tampere
female 4 5
male 3 5
4. Useful plots
Usually graphical inspectation gives an easier interpretation. How are heights distributed? To use a
histogram, type:
> hist(students$height)
That’s the distribution for the whole population. But, is there is a difference in heights between the
sampling sites? That can be studied using a box plot. In this case variable height is divided into two
groups using the variable gender, and a separate boxplot is produced for both of these plots:
So, there is large difference between the genders in heights. Does the same apply for sampling
sites? Write the code for this yourself.
How are height and shoe size related? You can get a graphical view of this by making a scatter plot:
5. Recoding variables
What if we want to differentiate between males and females in the plot? Let’s use different plotting
symbols for males and females.
First, we need a vector of plotting symbols. Let’s plot females with F and males with M. The new
vector can be produced by the command ifelse:
Check from the help file what are the arguments for ifelse command.
We can even represent different populations with colors. Let’s recode the population variable with
color names (Kuopio=blue Tampepre=red):
There are only 16 symbols on the plot. Can you figure out where one has vanished?
6. Making a new dataset
Make a new dataset from the variables height, shoesize, sym and cols:
> students.new
students.height students.shoesize sym cols
1 181 44 M Blue
2 160 38 F Blue
3 174 42 F Blue
4 170 43 M Blue
5 172 43 M Blue
6 165 39 F Blue
7 161 38 F Blue
8 167 38 F Red
9 164 39 F Red
10 166 38 F Red
11 162 37 F Red
12 158 36 F Red
13 175 42 M Red
14 181 44 M Red
15 180 43 M Red
16 177 43 M Red
17 173 41 M Red
> class(students.new)
[1] "data.frame"
Make two subsets of the dataset students. Split it in two according to gender.
> which(students$gender==”male”)
[1] 1 4 5 13 14 15 16 17
Based on that use subscripts to select the correct subset (take only rows for which gender is male):
> students.male<-students[which(students$gender==”male”),]
height shoesize gender population
1 181 44 male kuopio
4 170 43 male kuopio
5 172 43 male kuopio
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere
Similarly, make a new dataset from females.
Sometimes we want to split the dataset using some continuos variable, such as height. Typically the
median of the variable is used. Make two new datasets that containg individuals below and above
the median height:
> median(students$height)
[1] 170
> students.short<- students[which(students$height<=
median(students$height)),]
> students.short
8. Quit R
To quit R, type:
> q( )
R then asks you whether you would like to save the workspace or not. This is generally a good idea,
and answer the question “yes”. You can then get back to the same analysis just by double-clicking
on the .Rdata-icon in your students-folder.
If double-clicking does not works, you can start R, and use menu choise File->Load Workspace and
File->Load History to acquire the same result.
AFFYMETRIX PREPROCESSING EXERCISE
PRELIMINARY OPERATIONS
A. Start RGui;
B. Change the working directory shoosing the folder where thr CEL files and the PHENODATA are
located;
C. Load the needed libraries:
library(affy)
library(affyQCReport)
library(hs133ahsentrezgcdf)
library(hgu133aEG1000)
First, we create a new AffyBatch object (dat) where we import the CEL files.
dat <- ReadAffy()
The PHENODATA contains the information about the samples (i.e. the microarrays) of our dataset. The
PHENODATA are usually stored into a table (TAB-delimited .txt file) where each row represents a sample
and each column represents a variable.
We create a data.frame object to import the PHENODATA text file
pd <- read.table("phenod.txt", header=T, row.names=1, sep="\t")
2. PERFORMING BASIC QC
This creates a PDF file with some plots. Read the affyQCReport vignette for interpreting the different
plots. Note that the plots in page 2 can be also produced by the commands
boxplot(dat)
and
hist(dat)
Additionally, we can check the RNA quality by using the AffyRNAdeg function.
deg <- AffyRNAdeg(dat)
Individual probes in a probeset are ordered by location relative to the 5! end of the targeted RNA molecule.
Since RNA degradation typically starts from the 5! end of the molecule, we would expect probe intensities
to be systematically lowered at that end of a probeset when compared to the 3! end. On each chip, probe
intensities are averaged by location in probeset, with the average taken over probesets.
We can plot the results into a new PDF file:
a) we create a new PDF graphic device. this will direct everything we plot into the new PDF file
pdf("rnadeg.pdf")
We want to use the re-annotation of the affymetrix probes of the hgu133a chipset according to the Entrez
Gene database.
For doing so, we instruct R to use the new CDF package with the re-annotated information
dat@cdfName <- "hs133ahsentrezgcdf"
We preprocess the data using the RMA algorithm. The new datrma object is of class ExpressionSet
datrma <- rma(dat)
Finally, we save the normalized expression values into a TAB-delimited text file (.txt)
write.exprs(datrma, "datexprs.txt", sep="\t")
FINAL OPERATIONS
A. Save the workspace
save.image("Affy_Preprocessing.Rdata")
Dario Greco
Institute of Biotechnology - University of Helsinki
Building Cultivator II, room 223b
P.O.Box 56 Viikinkaari 4
FIN-00014 Finland
Office: +358 9 191 58951
Fax: +358 9 191 58952
Mobile: +358 44 023 5780
Email: dario.greco@helsinki.fi
"!#%$%$%& '()
*,+.-
/103254603798:2<;=;?>@BAC8(DFEHG IJ>8:07LK0=03KNMPOBKRQ=K
S.TUV.WX+ZY+ X+%[\+]+^U_
TU`,a+.bW[c,d+d
+XXefg[T.Wihjklk?mgd?
n%U_+%oLpq
d X
+%r
s otfuT*ud
.
#otvU%%X. wda
X
x o
.*gaTTyz.
{t| DFKQ=KJ>fI}4V70=Q
~(TX.X.f.L".W+fn_+6hjklk?m]d?
fn
U_+^TluX+e)"X+a+ .TU,TR
**,+l
L
dTU
hjlm%l<fhjklk?m
Rp W+^ fW?U_+N.W+<[
.n)TUT+
.9[W++<.W+NfRXa
+6w+
9.W+5a+Ld%p:.U_+Xo Yz
[W Wb
lfTXLT
.*g.
t.W+9f
Xf+r
m%)
)m%_m
? l¡¢
=£zm%) 5¤¥)¦l _
§¨+©ªXLa)
n`[W?t.W+e-%.va+ m%)
lfTXr
m%)
« hj
¬)%k
®¬mk°¯j%hl%¬mk?±_
² ±_³
´
µ
¡ ¢
µ ´
µ ´µ ¤ °¶ %
°
· =
¸ %
j %
h ¡
)
´
¹ ¹ ´
¹ ´%¹ ¤ ¡¢ °¶ %·°¸=j%h% )¡
²´ ² ´ ² ´ ² ¤ ¡¢ °¶ %·°¸=j%h% )¡
´º º ´º ´º ¤ ¡¢ °¶ %·°¸=j%h% )¡
³´ ³ ´ ³ ´ ³ ¤ ¡¢ °¶ %·°¸=j%h% )¡
´
» » ´
» ´%» ¤ ¡¢ °¶ %·°¸=j%h% )¡
´¼ ¼ ´¼ ´
¼ ¤ ¡¢ °¶ %·°¸=j%h% )¡
´½ ½ ´½ ´
½ ¤ ¡¢ °¶ %·°¸=j%h% )¡
¾µ ¿ ¾ µ ¾ µ ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
¾¹ µzà ¾ ¹ ¾ ¹ ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
¾² µlµ ¾ ² ¾ ² ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
¾º µ¹ ¾ º ¾ º ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
¾³ µ ² ¾ ³ ¾ ³ ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
¾» µº ¾ » ¾ » ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
¾¼ µ ³ ¾ ¼ ¾ ¼ ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
¾½ µ» ¾ ½ ¾ ½ ¤ ¡¢ °¶ %· )¡¢% ?£ÁÀ
Â
pRW++V+ s !fa+X]Ä.W++`ÆÅ3++)6Xf*uda+XÈÇL+Z.:[TaÄ
d3+%(ÉLd3%Ê^Ë9ÌZo<
aT*uX
\ x Í\ÎB+aTaq[W WÍXÏ*uda+XW?-_+^++"av3+a+"lÈ[W fW
_+XtT+
fWÍ.f_o
S9fba+Xg
* .W+T*gU_+Í?aTX.X,d
U%f* ++
°X.
+i-%.va+
)Á
*u*g
)m% ¤ kmjk?m%
o" .W+,
aTaz[TU<[\+g+
.W+V+Z aT]
aT*uX"Ç
+ ?
fn)U%
-%aT+Xw
*
L+++
W?+aXZÌ"TÍ
T.
ÍaXf,
aT*u © ?
X.d3
©Tr
) )m% ¤ kmjkm_
m
? ¯j
h)
¬mk ¢
l ´ ¡¢ ¢
¤ ´ ¢lh
_k m_ ¡¢ %
)T +t
NfeX5TV)d3
<
.*g%[\+tW?-_+tX.d3+ Æ?+ ¢%l ´ ¡¢ %o5~
e++'NÆYufuÇo U%dNa+XÏÌ
[\+w[\
aXf+ ¢
) ´
¡=j ¦ %o
hjklklk?m"T6X9d3%XX.Ta+,U%T-_+gÆÅ++lB[\+TU%WlXB".W+VX.d3
XoV.WXw[+g +%o UoVTU%
+g.W+
+ZÅ+ L(?
X.d
XzoN".W+9
aTaz[TUg[\+^U%T-_+9[\+TU%W)$]VaTa¨.W%X+9X.d3
X[W%X+9-%aT+9X s T
f
aT*u © ?
X.d3
©To
) ¸ j l k m =j ¦ µ l¢
¸<.) %¶ ´ l¢ h:)
¶ l
) ¸ j l
¢
m% ¡¢ ) µ Ã
)d3
Ra+XL]
L
)fT
U_++9?*,+X.fÈav_
T
.*g.
¨o\pRW+X+^+e+
Xf+l
)
Á lm%
_ (z l¡¢
?£ ¤ mlh?_
) ¡l= j
m
¢% )
++9.W+X.#%$]X.d
?*,+X".W+Tta
.
Xt
.W+^.fr
)
µ¹lÃ
e| Kt89A07 6@BO 8(7002<8qQ=>7L@
!X.?aTaT"a) a(?
n
U%
bTl+X.T Í\+
fWiX.d
XX.f
+
*}.W+w
+U%
T)+X.T_o9
hjklk)kmg.WXX
+BX.TUÈ
*u*gbm ´ ¾ l¢
)
±)¢
) ´ oL*,
UÈX+-_+fa<aT+.?.T-_+X3a+©ªXX+
W++e*,+.W
È[W++w +?
fn)U%
X.f
.
¨?aTa3-
aT+X3+a[ $o Î]+e+dav
+")g-%aT+w$o Îr
)l#"%$ m ´ ¾ l¢
)
±)¢
) ´ .)&gk? %¢ mlh_·%
'w| DF>fKtAt@B7t;Q=>f8F4)(7RQ=;
§¨+©ªXLa)
n`.W+9X...T.
Í5?
n
U%
X.f
+
X.TU%?a-%aT+XT+ fW?+a¨
* .f s r
* j .)l
¶+ ,µ
-[
+X.W+9X.W?d3+w5.W+9X...T.
Í fW?U_+6+LddaT)TU,a
Ul.T.W*u ^.fX
.*/.
* j Ïhl¢_ ¹ .)l
¶+ µ, )
)*,
.W+t+d+X+lf.
5WX.
U%f*Xt
fT+
l`X.TUg
*u*g
¡h)¢
j <fhl¢_ ¹ )l % ¶0 ,µ l
§¨+©ªX
*ud?+9.W+wX .+tda
(a
U?Ç¥\ x Ì-XRa
U?Ç¥\Î_Ì
Ída
o
hl¢_ hl¢% ¹ .)_ µ
hl¢_ ¶ hl¢% ¹ .)_
¶+ µ
¡h)¢ Ïlh ¢_hl¢_ ¶ kmj j j ® ´ m ) % ¡hl¢ %
+t,X++e [\uda
XX.T*^aTf+
X.aT_=
d+bB+[ [T[wr
¦ )µ µ
¡h)¢ )fh)¢%
h)¢% ¶ ¹ .h)¢% ¶ $h)¢%
Vkmj j j ¡hl¢ %
:_
[RlL,da
T.W+w
.TU%T?a¨[Tz[ Ç #_ÌUlT¨[.T+%r
¤. ¹
<' +-
X.aT_[+uX+X.T*uda+]a
a5Xd
e?
n
U%
X.f
.
i*,+.W)Ç tÄ
=ÌZoBpRW+
a+-_+aq5?
n
U%
T)+X.T "3+eTl-_+X..TUl+1Ç X+ez[ .U,TX.+
5.U%=ÌZr
jk?m%
%¡h)¢ Ïhl¢_ ¹ .)
¶ µ, `
¡l?j
&Íh)¢¸ Z ¸ j ? j ) _
hjk ´ ¼ µ¹ )
jk?m%
%¡h)¢ Ï hl¢_ ¹ .) _ µ, `
¡l?j
&Íh)¢¸ Z ¸ j ? j ) _l
hjk ´ ¼ µ ¹ )
| E 70IJK ( >(KRQ=>7L@
)T +È[\++ÈT)++X.+ÁTÁf.%X,w\Î\ x W?+aXa+©ªXu
*ud+"a
Uif.%XB
*P.W+
?
n
U%
X.f
+
fr
km ¢%%km)hj
j
j) l)m
.)_ ¥ k? ¢% ¡%
.U%*,+l9k
¢_ +]+%o Uo`Zk_j
m 3_ zhl¢) ) %3
V¡l?j ) jz¡hl¢) l %oi.T.TU
aTÍ.W+6X.
a+.+^.W+]*,+.W)Xw+
U%W¨:+%o Uo¡3uX.fXe
V¡l=j l j¡h)¢l ) _o6§¨
U"f.
f`[T.W
li
.*gaTTyz.
1XB
fT+ [T.WÄk ¢_ ¢ =%o"L+XaT^t
.*gaTTyz.
Á 3+,X++ T1
b da
r
¡h)¢ 5km m
)m
µ ¦ h jk ´ ¼ µ» )
p¨."¡hl¢ =j ?j¡
¢l )
.W"
.*gaTTy+"
.*gaTTy+Ífr
km à ¢
%k?mlhj
=j j l )lm% )l&k
¢_ ¢ ?_
¡h)¢ ? j =jz¡
¢l l km à m
lm
µ kmj %·
¢%¬¢%%km)hj
m j
¢3%
¡h)¢ ? j =jz¡
¢l l km.m%lm% µ ¥k?m j )·
¢%%k?mlhj
m j
¢3_
.v +XL
+e
.*gaTTy+
X.TU
km ¹ ¢
%k?mlhj
¸)) llm
k?m k
¢_ ´ mlhl=%
++9.W+9+ZÅ+ r
¢ ¦ ¡ h)¢ km ´ )¢ hqkm mk ´ ¢lhmk? km l
¢ ¦ ¡ h)¢ km ¹ ´ l¢ h: km ¹ mk
´ ¢lhmk? km ¹ l
"!#%$%$%& '()
*,+.-
/102435025673408:9<;=9>2@?0ACBB.DE9(FHG5;=9>IJIJ9>3K6C9>249>I
L .MNO.PQ+SR+ QT+%UV+W+HNX
MNO5Y+."PUZ4["]\^++_.`YMYM"+SRa+QQ+bNX++QUM.Pdcef_fJg
aJ
h%NX+%iHjkP+lT,Qnmo
*qp!r*O X.TQHUP++5.P+4nsntumo
*v&:tna%twyxyh
hX
*O +5&
.*rY=*O +PJ-X+z++{
*OaJ+{UM.Pb+Smo++ +H@st5i_jkP+NXXY|Q}[{
VP~U.P+nh
) ThX
mntna%twHNX++"\=+ Q5.P+"
.M-
Mm
.P+5NX++Q~i1LH+TMYQOm@.P+b+SRa+.M*,+), z=+rmo
M
.P+.. Y+%
dkYMY~UWi>iMVLn
MCiMkH
NV@i>iMk)a=++^ji>'iMkCzMVCii#%$%$%$XSi
d X.T+SRa+QQ.
a[YMMNd+_.][J+Q5NX++QWUM.PYM+++SRa+QQ.
MnLH+S[J +_5*O +%i
@S|%,HCS.
.S ¡¢ #%$%#%#x£#%$%#%¤¥ykYMY~U@#%$%$%$i a
mVMTOM+
.Si
¦¨§ ;=9>B0T©v025A;=09>I
)T.@u{ThX+4cef_fgOaJ
h%NX+WM)OQ+
ª ce«_¬g%¬_>®Tcef_fJg¯
VPJNX+.P+U
.h)MNbM+
."+
{.P+T.NX+Q[Y+%
ª±° g%¬)²
³ °´µ ¬³)g%¶X·g
¬
²
³ °J´ ®¸¹_º»
¹|¼~·g%¬)²³ °´<½y°)¾_° ¸X¯
jkP++
".P+}T
ª ¬)² µ ¬³)g%¶ ½ fgefgX²
³ ´ ® ° g
¬
²
³ °J´%¿XÀ e
c)³
Ágf³<Â ´ »
Ã_¬|Ä
³ µ ¸ ´ º » ° ¸ » )° Å
³ ¬ ½ Ä%»_c
ÃXf
Æ ´µ ¸«gX¶ ´ º » ° ¸%¯
s+SR
Ç+YMM*OMJ+n.P+@Qa
QÇUP PbPJ-X+Cz=++r*r.hX+rzJ
do
YM*OÉÈSzJ
Q.a=
~ÈPJ
QV-
YM+Wp~>z_ON%M-
MN
.P+*ÊU+MN%P_$
ª )¬ ² ¿ Ë ³ e² _Å °´µ f g ° ¬|e ¾ ~® Ì=ÂÍÆ_¬» Ë .® ¬)² %¿ Î ¯ £ÆJÄ%»_c®Ï¬)² ¿
Î ¯_¯
ª ¬)² ¿Ë ³ e² _Å °´>Ð ¬
² ¿ » _° Å ³
¬ ¿ ¸ «g%¶ ´ º» ° ¸ µ)µ ~Ì Ñ µÓÒ
Ô(MJYMYM{+
"NX++HJ*,+Qn¥.T{Y`X
@Mm
.*r.
ª )¬ ² ¿
² ³%Ƴ Õ´ µ ¬ ³_g%¶)ÖX¹
×(®~¸¹_º»
¹J¼ ½ ² g_cJ¸X¯
ª ¬)² ¿ º_|¬ e~Æ °
³ ¬ µ ²
³ ° ×g
»%à ° ®Ï¬)² ¿ ² ³
Ƴ ´ ¯
+ØÙQÚ
.*rYMMÛ+>.P+TkUM.PWa.M)Ïxy.Ma}Y
+QQ=
.*rYMMÛ~.
"o+SmYMzJ
Th)N%
4
.+ .
}*,+.P
QQ+JS
ª fg µ Æ»%¬%fg)ceÜ
³Ýe °)Å e~Æ)¹_¬)¬g
´ ®.¬)²>Âf³ °_Å »X¶ µ ¸º=¸X¯
9>I|06C2K©vA?J;0.F
+U
Y"YMMhX+O Y Y`+W.P+HNX++}+SR
a+QQ
m(h
) ThX
C*O ++Y`.M-X+,UMY{
a+*O +%iÇÔ
.PQa.a=%Q+%Y+ØÙQC+S[+W.P+Hmo
YMY~UMN S
d,
o
ª
¶ ³ ´ e²_Æ µ Ä«JeÆ)¶(®¸SÝe
c%¶ Î ³
¸ µ Ì^¸Ý|e
cX¶¸ µ Ä=®.¬³%º<® Ò ÂJ¯^£¬³
º>®Ì=¯)¯_¯
LH+Q.MN%¥*r..]RdYMY~UCQQCr+Q.M*r+}.P+}-
YM+Qm
.
rS i@w£.PQnUV
.h"U+5+}M_++Q.+dM
.P+NX++W+SRa+QQ.
d]\=++ +Q@z+£UV++¥h
hX
UMY"
a+W*O +r aJT*,++!@x"MYJSiVw£
+}+Q..M*r+nMÇU+n+S[+HYQ4
.P+aJT*,++W# MY
xn+SmSSi%$Ç
: {Q++UPJ.P+@-%.`zY+
¶
³ ´ e²XÆ:
)TMQ@
NX+.P+@UM.P-
.`zY+ ° g
¬
²
³ °´ z){
aMN
ª Ä~«|e~Æ)¶<®.¶³ ´ e²XÆÇ ° g%¬)²³ °´ ¯
Ýe
c%¶ Î ³
&'(Ý|e%c%¶*)_ce¶³
Á_ÃXf
«³
¬ÁgfJ³ À
e c)³
Ágf³,+X'- +X.
Ä
Ì Ì Ò Ì Ä
Ì Ä
Ì ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
Ä10 Ì Ò 0 Ä10 Ä10 ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
Ä2- Ì Ò - Ä2- Ä2- ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
Ä43 Ì Ò 3 Ä43 Ä43 ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
Ä1. Ì Ò . Ä1. Ä1. ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
Ä15 Ì Ò 5 Ä15 Ä15 ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
Ä26 Ì Ò 6 Ä26 Ä26 ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
Ä2 Ì Ò Ä2 Ä2 ½´ º» ° Î ³/ Ë e
cX¶ ° _º³
7Ì Ì Ì 8 7Ì 7 Ì ½´ º» ° Î ³/ ¹_º»
¹J¼9
70 Ì Ì ÌÒ 70 7 0 ½´ º» ° Î ³/ ¹_º»
¹J¼9
7- Ì Ì Ì_Ì 7 - 7 - ½´ º» ° Î ³/ ¹_º»
¹J¼9
73 Ì Ì Ì40 7 3 7 3 ½´ º» ° Î ³/ ¹_º»
¹J¼9
7. Ì Ì Ì:- 7 . 7 . ½´ º» ° Î ³/ ¹_º»
¹J¼9
57 Ì Ì Ì;3 7 5 7 5 ½´ º» ° Î ³/ ¹_º»
¹J¼9
76 Ì Ì Ì4. 7 6 7 6 ½´ º» ° Î ³/ ¹_º»
¹J¼9
7 Ì Ì Ì45 7 7 ½´ º» ° Î ³/ ¹_º»
¹J¼9
Ç
TPUZ
.+Q.a=
QO
+W.TXijkP+H-
YM+Qp~
$5M"£UV,
YM*OQ4È MY
xn+Sm.ÈHÈ!@x"MYÈ
+YMYUPJCMm
.*r.
U+W ¥
zTM"mo
* +
P1.TXiÇjkP+[Q.C
Q+HQ
pX#MY
x@+SmS=< $>!@x"MYJ@?AMY{xÇ@+Sm
UP P:*,+Q.PJÇm
* .P+[Q.Ç+MN%P_k.TQU+CNX+.P+Y
N%T.}z+£UV++bUMYO
a=+@O+Sm++ +%i
Ô
*Ê.P+WY`
Q.C+MN%P_H.TQUV+}NX+
pX#MY
x@+SmS=< pX>!@x"MYJ@?AMY{xÇ@+SmB<C! xDMYE?C! xÇ@+Sm
jkP+aJT*,++Q+.P+¥+Q..M*r+{mo
*Ê.P+TlQMN,YMM+C*,
+Y^[..MN
ª e ° µ c f À e ° ®Ífg(Â϶
³ ´ e²_Æ|¯
)M ++Q.MN%Õ*r..]R
)TM+U
YM*OQ.P++{UMYMYCz=+{UÉaJT*,++Qr+Q..M*r+Õmo
,+
TP
NX++%ijkP+aJT*,++n-
YM+Qmo
.P+H[QNX++}+%
ª e °¿ Ä%»_³/ Ð Ì^ÂÍÑ
Ý|e%c%¶F Î ³/ 'Ý|e%c%¶
Òǽ 5G.H8H.H-H|Ì Ò ½ 5/-G8H-JÌ28/6
Ql.P+Y
N%T.rm(NX++,pHz=+U++¥h
) ThX
@*O +}{UMY
a+*O +WQ 0.639
i
08:9<;=9>2@?0ACBB.D 9(FG5;9>I|IJ9>3 6@9>249>I
+O+5NX
MNbr[]\=++).`YMYM1+SRa+QQ+1NX++Qz_aaYM)MN:Ïxy+Q.nmo
+
TP1NX++%i}w
+
,*rhX+}.P+}+Q..M*r.
mQ.T1+.
@+++dM¥.P+WÏxy+QCm
.*}Y`r*,
+W+YM`zY+%|Y+ØÙQaaYM
X+Q.`¥-
.` +WQP.Mh
NX+%
ª e ° µ ³g
³ ´ ®e ° ¯
+ØÙQÉQ++.P+
a p$±]\^++_.`YMYM +SR
a+QTQ+uNX++Q1QT
.+
MNÓaJT*,++ !@x"MY
ÍÄ
»)³
µ 0_S
ª±° »%º_·g%«c)³Ú®Je °  Ä
»)³
µ 0(ÂÍÆ_ÃXf
«³%¬ µ Ì Ò Âg%¶%Ã ´°)µ ¸_¶_¬=¸X¯
c)»Ä 7 Î » Ë ) + »)c
ÃXf
Æ ¼ Ágf³,+)c)»
Ƴ¼ ¹ °
ǽ g_c%ó
0Ì:3'8 5 6 Ä_Á)¹ ¹_º»%¹J¼ÚÂ.ceºJe¶FX¼f² Ì Ò 6H6'.H0 Ò 2- ½ 0uÌ20 10
3 -)³
Ì_Ì Ì4. ½ -
.
3 Ò 0 6 Ì4.ZÄ_Á)¹
')
·ÇÂ|e² Å c
´ ef=e
c)g
¬ ° »
¹ 3'-G8H-G./-C2- ½£Ò Ì20
Ì:- ._³ Ò 6 Ì_Ì ½£Ò
./-'.H5 Ì:3 8 ̱Ä_Á)¹*+%¹_·
G+%×
X·
×X· Î ¹_Á Ì:-'. Ò 0/-G0A
Ì ½ Ì4-A
Ì40 6)³ Ò 6 Ì Òǽ 5
3|Ì4-G8 Ì)Ì 0ZÄ_Á)¹
')
·ÇÂݳ_g 7 c
´ ef=e
c)g
¬ ° »/+ -G6/3'-H6 Ò
Ì ½£Ò Ì4-A
Ì40 ̳ Ò 5uÌ Òǽ Ì
Ì:6G-G8 . 6 Ì:6 Ä_Á)¹ ¹_º»/+¼)¼_¼ÚÂ.ceºJe¶FX¼f² 3'H-'5Ì;3 Ò½ 8uÌ:3C
Ì Ò 0_³ Ò . ½ -
Ì;38H5 3 Ì4. .ZÄ_Á)¹ ³ ´° 3'/3Ì:H-C
Ì ½£Ò Ì20 28 ._³ Ò . 6 ½ 3
0H.H-H6 6 6 Ì:6 Ä_Á)¹
')
· ´ ÂJe² Å c% ´ efe%c_g
¬ ° » 3'H-'5Ì;3
Ì ½£Ò Ì:3 28 ̳ Ò 3 . ½ 8
3'8/3|Ì Ì4- 5ZÄ_Á)¹ ´ ef=e
c)g
¬ ° »
³_g ´°´° ³
¬»_c 6G-H6|Ì:H-C
Ì ½£Ò Ì4- 46 5_³ Ò 3 . ½ 3
0/|Ì40 Ì Ì48ZÄ_Á)¹ . ´ ef=e
c_g%¬ ° » ¼ Î )H.H. Ò Ì -|Ì:6'5H5/6C Ò½ 5uÌ20 25 0_³ Ò 0 0 ½ -
8
3'6 - 0ZÄ_Á)¹
')
·ÇÂݳ_g 7 c
´ ef=e
c)g
¬ ° » À -'./-'0H8H0A Ò½ 5uÌ)Ì 2. 0_³ Ò Ì Òǽ 5
++4axy-%YM+QUV++O
.+ +É+5b*}YM.MaY+l+Q..MN{z)dk+.*OMyØÙQ ) TP_z=+.NØÙQÔ<LÊoÔ|YQ+
LQ -X+."+~V*,+.P)^i
jkP+zMN%NX+Q.C]\^++ +z=+U++¥NX++h
) ThX
@"UMY"£)a=+H*O +HQM".P+W+SRa+QQ.
m<t@a=%tnw
[Q.CM".P+WYMQ. Si>jkPQC*rhX+QCQT+Q+Q.M +MUk
QC.P+h
) ThX
CNX++%i
)FW9<;+*0I|9
PRELIMINARY OPERATIONS
A. Start RGui;
B. Change the working directory choosing the folder where the Affymetrix data is located;
C. Load the workspace you have saved from the AFFY exercise:
Load(“Affy_Preprocessing.Rdata”)
C. Load the needed libraries:
library(affy)
library(hgu133aEG1000)
library(limma)
First, we visualize the number of significant genes by Venn diagrams. Here we use p-value < 0.01 and
Benjamini-Hochberg p-value correction.
vennDiagram(decideTests(fit2, p.value = 0.01, adjust.method =
"BH"))
we extract the gene symbol and the entrez gene information from the annotation package
gs <- as.data.frame(unlist(as.list(hgu133aEG1000SYMBOL)))
eg <- as.data.frame(unlist(as.list(hgu133aEG1000ENTREZID)))
annot <- cbind(rownames(gs), gs, eg)
colnames(annot) <- c(“ID”, “Gene Symbol”, “Entrez Gene ID”)
we store the significant genes with their annotation in a data-frame object
results <- topTable(fit2, coef=1, n = 1838, genelist=annot)
Finally, we save the table of significant genes into a TAB-delimited text file (.txt)
write.table(results, "results.txt", sep="\t")
FINAL OPERATIONS
A. Save the workspace
save.image("Affy_Limma.Rdata")
Dario Greco
Institute of Biotechnology - University of Helsinki
Building Cultivator II, room 223b
P.O.Box 56 Viikinkaari 4
FIN-00014 Finland
Office: +358 9 191 58951
Fax: +358 9 191 58952
Mobile: +358 44 023 5780
Email: dario.greco@helsinki.fi
FINDING OVER-REPRESENTED GO FAMILIES
PRELIMINARY OPERATIONS
A. Start RGui;
B. Change the working directory choosing the folder where the Affymetrix data is located;
C. Load the workspace you have saved from the AFFY-LIMMA exercise:
Load(“Affy_Limma.Rdata”)
C. Load the needed libraries:
library(GOstats)
library(limma)
library(affy)
library(hgu133aEG1000)
Now, we can create the parameters for running the Fisher's Exact Test:
params <- new("GOHyperGParams", geneIds =
as.vector(results[,3]), annotation = "hgu133aEG1000", ontology
= "BP", pvalueCutoff = 0.05, conditional = FALSE, testDirection
= "over")
In this command, we specify the Entrez Gene Ids, the annotation package, the ontology that we want to
assay (BP, MF, or CC), the p-value cut off (here we chose 0.05), whether we want to run a conditional
test, and the test direction, for finding the over- or the under-represented families (here we want to find
the over-represented families).
Now we can export the results into a TAB delimited text file:
write.table(BPresults, "BP_over.txt", sep="\t")
FINAL OPERATIONS
A. Save the workspace
save.image("Affy_GOstats.Rdata")
B. Save the history
savehistory("Affy_GOstats.Rhistory")
Dario Greco
Institute of Biotechnology - University of Helsinki
Building Cultivator II, room 223b
P.O.Box 56 Viikinkaari 4
FIN-00014 Finland
Office: +358 9 191 58951
Fax: +358 9 191 58952
Mobile: +358 44 023 5780
Email: dario.greco@helsinki.fi
Clustering – Exercises
This exercise introduces some clustering methods available in R and Bioconductor.This exercise
uses the prenormalized yest dataset.
We want only the cdc15 data, so take only those columns from the data:
> names(d)
> da<-data.frame(d[26:49])
> dat<-na.omit(da)
Select only the genes that are among the 0.3% of the highest standard deviations.
> library(genefilter)
> # Row-wise SDs
> sds<-rowSds(dat)
> # Which is the value at 99.7% of data
> sdt<-quantile(sds, 0.997)
> sel<-(sds>sdt)
> set<-dat[sel, ]
> heatmap(as.matrix(set))
To get other colors in the heatmap, you first need to generate a sequence of colors, and then plot the
heatmap using these colors:
> library(RColorBrewer)
> heatcol<-colorRampPalette(c("Red", "Green"))(32)
> heatmap(as.matrix(set), col=heatcol)
4. Saving the heatmap into a file
For further modifications, the heatmap might need to be saved in a file. This is accomplished with:
> cwd=getwd()
> bmp(file.path(cwd, "heatmap.bmp"), width=1800, height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()
This results into about 6*6 inch print quality bitmap image in your data folder. Some papers might
want to get a postscript image, and this is accomplished as:
> cwd=getwd()
> postscript(file.path(cwd, "heatmap.ps"), width=1800,
height=1800)
> heatmap(as.matrix(set), col=heatcol)
> dev.off()
In K-means clustering you need to pick an artificial number, the number of clusters (K).
> k<-c(5)
> km<-kmeans(set, k, iter.max=1000)
Calculate an average withinness of the results. This is a measure of how close together genes lie
inside the clusters.
> mean(km$withinss)
[1] 21.1838
Run the same K-means analysis several times (save the result into a new object every time). Select
the K-means clustering giving the smallest withinness score as the best result.
> ss<-c(1000000)
> for(i in 1:10) {
> km<-kmeans(set, 5)
> if(mean(km$withinss)<=ss) {
> ss<-mean(km$withinss)
> km.best<-km
> }
> }
6. Visualizing the K-means clustering
Next, initiate a 2*2 image area, and draw the expression profiles. We need to apply a for-loop here:
> par(mfrow=c(2,2))
> for(i in 1:4) {
> matplot(t(set[km$cluster==i,]), type="l",
main=paste(“cluster:”, i), ylab=”log expression”, xlab=”time”)
> }