Sie sind auf Seite 1von 525

Advances in Experimental Medicine and Biology 919

Hamid Mirzaei
Martin Carrasco Editors

Modern Proteomics –
Sample Preparation,
Analysis and Practical
Applications
Advances in Experimental Medicine
and Biology

Volume 919

Editorial Board
IRUN R. COHEN, The Weizmann Institute of Science, Rehovot, Israel
N.S. ABEL LAJTHA, Kline Institute for Psychiatric Research, Orangeburg,
NY, USA
JOHN D. LAMBRIS, University of Pennsylvania, Philadelphia, PA, USA
RODOLFO PAOLETTI, University of Milan, Milan, Italy
More information about this series at http://www.springer.com/series/5584
Hamid Mirzaei • Martin Carrasco
Editors

Modern Proteomics –
Sample Preparation,
Analysis and Practical
Applications
Editors
Hamid Mirzaei Martin Carrasco
UT Southwestern Medical Center Biotech Division
Dallas, TX, USA Neurophagy Therapeutics, INC
Odessa, TX, USA

ISSN 0065-2598 ISSN 2214-8019 (electronic)


Advances in Experimental Medicine and Biology
ISBN 978-3-319-41446-1 ISBN 978-3-319-41448-5 (eBook)
DOI 10.1007/978-3-319-41448-5

Library of Congress Control Number: 2016960751

# Springer International Publishing Switzerland 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in
this book are believed to be true and accurate at the date of publication. Neither the publisher nor
the authors or the editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

With significant advances made in recent years, there has been an increasing
demand for proteomics technology to help pave the way for hypothesis-
driven sciences or produce data for data-driven hypothesis generation. Clini-
cal proteomics is another area of research that has received a significant
amount of attention in recent years with the great promise of developing
biomarkers for the early detection of fatal diseases such as cancer. Other than
scientists who are involved either in method development or complex prote-
omics applications such as biomarker discovery, there is an increasing
number of scientists without training experience in the field who wish to
use proteomics in their research. Due to the significant cost associated with
mass spectrometer acquisition, maintenance, and operation, many educa-
tional institutes and companies have formed core facilities to provide access
to proteomics technology. Faculties specializing in proteomics have also
been experiencing an increased demand for collaboration in recent years as
the potential of proteomics technology has become more apparent due to an
increasing number of high-impact publications. A major bottleneck in
collaborations between the proteomics community and biologists is the
lack of efficient communication. Biologists often don’t have a clear under-
standing of what it takes to carry out a proteomics experiment successfully
and reproducibly; this lack of understanding often leads to unrealistic
expectations or poor experimental design and execution. Proteomics sample
preparation is highly variable and experiment dependent and as such not
comparable to DNA/RNA sample preparation methods. There are many
different ways to perform mass spectrometry and process data. Often it is
hard to identify the best method when biologists do not have a clear under-
standing of how proteomics works.
We decided to write this book to help all scientists interested in using
proteomics in their research and those who want to become experts in the
field. This book shall be a resource for experimental design starting from
sample source, sample preparation, and mass spectrometry to data analysis
and interpretation. With this purpose in mind, we contacted scientists who
will be considered leaders in their field and asked for chapter contributions.
Since authors of various chapters do not communicate with each other, there
is some redundancy between chapters. We believe this redundancy is neces-
sary as it reflects different experiences and viewpoints. It is also helpful for
scientists who are not familiar with proteomics to learn the different methods

v
vi Preface

and tools used in various steps in the proteomics pipeline to bolster their
experimental design and execution capabilities. We are hoping that this book
will serve as a tool for understanding how to design a practical, successful
experiment with desirable results. Furthermore, we believe this work can also
be used as a manual for the execution of the various steps in a proteomics
experiment. It is not practical to include every known proteomics protocol in
one book, as there is no way of verifying every protocol for reproducibility.
Protocols presented in this book were provided by leaders in the field and
represent a good starting point for method development. For more complex
proteomics experiments, we recommend that those who are not experts in the
field work with an experienced proteomics team.
Every field benefits from a centralized source of information; proteomics
is no different. By taking the first step to create what could become a primary
reference for proteomics, we are providing a resource for scientists in their
own research and encouraging other leaders in the field to unite and support
our cause. In turn, this resource could be updated periodically as new
technology and techniques arise, thus assisting future scientists in their
endeavors.
Contents

Part I Sample Preparation Strategies for Proteomics . . . . . . . . .


1 Proteomes, Their Compositions and Their Sources . . . . . . . . 3
Anna Kwasnik, Claire Tonry, Angela Mc Ardle,
Aisha Qasim Butt, Rosanna Inzitari, and Stephen R. Pennington
2 Protein Fractionation and Enrichment Prior
to Proteomics Sample Preparation . . . . . . . . . . . . . . . . . . . . . 23
Andrew J. Alpert
3 Sample Preparation for Mass Spectrometry-Based
Proteomics; from Proteomes to Peptides . . . . . . . . . . . . . . . . 43
John C. Rogers and Ryan D. Bomgarden
4 Plant Structure and Specificity – Challenges and Sample
Preparation Considerations for Proteomics . . . . . . . . . . . . . . 63
Sophie Alvarez and Michael J. Naldrett
5 Improving Proteome Coverage by Reducing Sample
Complexity via Chromatography . . . . . . . . . . . . . . . . . . . . . . 83
Uma Kota and Mark L. Stolowitz

Part II Mass Spectrometry for Proteomics Analysis . . . . . . . . . .


6 Database Search Engines: Paradigms, Challenges
and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Kenneth Verheggen, Lennart Martens, Frode S. Berven,
Harald Barsnes, and Marc Vaudel
7 Mass Analyzers and Mass Spectrometers . . . . . . . . . . . . . . . 157
Anthony M. Haag
8 Top-Down Mass Spectrometry: Proteomics
to Proteoforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Steven M. Patrie

vii
viii Contents

Part III Bioinformatic Tools for Proteomics data Analysis


and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 Platforms and Pipelines for Proteomics Data Analysis
and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Marius Cosmin Codrea and Sven Nahnsen
10 Tandem Mass Spectrum Sequencing: An Alternative
to Database Search Engines in Shotgun Proteomics . . . . . . . . 217
Thilo Muth, Erdmann Rapp, Frode S. Berven, Harald Barsnes,
and Marc Vaudel
11 Visualization, Inspection and Interpretation of Shotgun
Proteomics Identification Results . . . . . . . . . . . . . . . . . . . . . . 227
Ragnhild R. Lereim, Eystein Oveland, Frode S. Berven,
Marc Vaudel, and Harald Barsnes
12 Protein Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Zengyou He, Ting Huang, Can Zhao, and Ben Teng
13 Modification Site Localization in Peptides . . . . . . . . . . . . . . . 243
Robert J. Chalkley
14 Useful Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Andre Bui and Maria D. Person
15 Mass Spectrometry-Based Protein Quantification . . . . . . . . . 255
Yun Chen, Fuqiang Wang, Feifei Xu, and Ting Yang
16 Bioinformatics Tools for Proteomics
Data Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Karla Grisel Calderón-González, Jesús Hernández-Monge,
Marı́a Esther Herrera-Aguirre, and Juan Pedro Luna-Arias

Part IV Applications of Proteomics Technologies in Biological


and Medical Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 Identification, Quantification, and Site Localization
of Protein Posttranslational Modifications via Mass
Spectrometry-Based Proteomics . . . . . . . . . . . . . . . . . . . . . . 345
Mi Ke, Hainan Shen, Linjue Wang, Shusheng Luo, Lin Lin,
Jie Yang, and Ruijun Tian
18 Protein-Protein Interaction Detection Via Mass
Spectrometry-Based Proteomics . . . . . . . . . . . . . . . . . . . . . . 383
Benedetta Turriziani, Alexander Kriegsheim,
and Stephen R. Pennington
19 Protein Structural Analysis via Mass Spectrometry-Based
Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Antonio Artigues, Owen W. Nadeau, Mary Ashley Rimmer,
Maria T. Villar, Xiuxia Du, Aron W. Fenton,
and Gerald M. Carlson
Contents ix

Part V Clinical Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


20 Introduction to Clinical Proteomics . . . . . . . . . . . . . . . . . . . . 435
John E. Wiktorowicz and Allan R. Brasier
21 Discovery of Candidate Biomarkers . . . . . . . . . . . . . . . . . . . 443
John E. Wiktorowicz and Kizhake V. Soman
22 Statistical Approaches to Candidate Biomarker
Panel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Heidi M. Spratt and Hyunsu Ju
23 Qualification and Verification of Protein Biomarker
Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Yingxin Zhao and Allan R. Brasier
24 Protocol for Standardizing High-to-Moderate
Abundance Protein Biomarker Assessments Through
an MRM-with-Standard-Peptides Quantitative
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Andrew J. Percy, Juncong Yang, Andrew G. Chambers,
Yassene Mohammed, Tasso Miliotis,
and Christoph H. Borchers

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Part I
Sample Preparation Strategies for Proteomics
Proteomes, Their Compositions and Their
Sources 1
Anna Kwasnik, Claire Tonry, Angela Mc Ardle,
Aisha Qasim Butt, Rosanna Inzitari,
and Stephen R. Pennington

Abstract
Biological samples of human and animal origin are utilized in research for
many purposes and in a variety of scientific fields, including mass
spectrometry-based proteomics. Various types of samples, including
organs, tissues, cells, body fluids such as blood, plasma, cerebrospinal
fluid, saliva and semen, can be collected from humans or animals and
processed for proteomics analysis. Depending on the physiological state
and sample origin, collected samples are used in research and diagnostics
for different purposes. In mass spectrometry-based proteomics, body
fluids and tissues are commonly used in discovery experiments to search
for specific protein markers that can distinguish physiological from patho-
physiological states, which in turn offer new diagnosis strategies and help
developing new drugs to prevent disease more efficiently. Cell lines in
combination with technologies such as Stable Isotope Labeling by Amino
Acids in Cell Culture (SILAC) have broader application and are used
frequently to investigate the mechanism of a disease or to investigate for
the mechanism of a drug function. All of these are important components
for defining the mechanisms of disease, discovering new pharmaceutical
treatments and finally testing side effects of newly discovered drugs.

Keywords
Sample origin • Cell culture • Biological fluids

1.1 Cell

1.1.1 Cell Culture


A. Kwasnik • C. Tonry • A.M. Ardle • A.Q. Butt
R. Inzitari • S.R. Pennington (*)
School of Medicine and Medical Sciences, UCD Conway The human body is composed of an average of
Institute of Biomolecular and Biomedical Research, 37.2 trillion types of cells that differ in morphol-
University College Dublin, Dublin 4, Ireland
ogy, size and function. When the same or
e-mail: stephen.pennington@ucd.ie

# Springer International Publishing Switzerland 2016 3


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_1
4 A. Kwasnik et al.

different types of cells are interconnected with have to be handled with care, according to biohaz-
each other to carry out specific functions within ard rules [4]. In the laboratory, biopsies are dis-
an organism they are referred to as tissues. sected and/or disaggregated in sterile conditions
Organs are formed from the combination of at by either mechanical techniques or enzymatic
least two tissues in the human body, which in methods to establish primary cell lines.
turn, are interconnected to form specialised body Depending on their origin (tissues or body fluids)
systems. The interplay between such systems animal cell cultures can grow either as an adherent
contributes to the maintenance of homeostasis monolayer or in suspension. Adherent cells may
of the entire body [1]. grow as a monolayer attached to a cell culture
Cell culture can be divided into three basic vessel (plate or flask surface) and this attachment
types, the culture of primary and secondary cells is often essential for cell growth and proliferation.
and of cell lines. Cells directly isolated from Most tissue derived cell lines are anchorage-
mammalian tissues that have not been dependent. In contrast, hematopoietic cells (cells
sub-cultured are called primary cells. Once a derived from blood, spleen, or bone marrow) are
primary cell culture has been sub-cultured anchorage-independent and can grow and prolif-
(propagated in vitro), it is termed as a secondary erate without being attached to a substratum.
cell culture. Such cells have a limited life span in Interestingly, some transformed and malignant
culture. Primary or secondary cells that have tumour cell lines can grow in anchorage-
been immortalised to expand their life span are independent conditions. Over the years various
called cell lines. techniques have been established to initiate spe-
Many steps have to be carefully considered cific types of primary cell culture. The original
before initiating a primary cell culture. Thoughtful method that was used in cell culture initiation was
planning of how samples will be obtained, how a method called the ‘spillage technique’. The
tissue will be isolated, what method(s) will be name of the method reflects how tissue was
used to isolate cells from tissue and finally what processed to isolate cells. Slices of tissue section
method of cell culture will be used after cell were placed in medium and shaken to allow cells
isolation, have to be undertaken before setting up to migrate into the medium. The cell suspension
cell cultures. Additionally, any work performed was further used to establish new primary cell
on animal or human samples has to be carried out culture [5]. Nowadays the most common methods
according to proper legislation on experimenta- used to establish a primary cell culture from
tion with animals [2] or medical ethical rules in tissues utilise enzymatic digestion to separate
the case of human samples [3]. While whole cells from tissues [6]. Cells that are isolated from
organs can be isolated from animals, the most mammalian tissues and are grown until
common cell sources from humans are biopsies sub-culture are known as primary cell cultures.
from specific organs or tissues usually obtained Cells isolated directly from tissues are usually
during diagnostic examination or surgery. Work heterogeneous but closely represent tissue specific
with and/or sample collection from animals for properties and the protein expression profile of
research requires ethical approval from the appro- parental cells. Primary cell cultures after several
priate research ethics committee, while obtaining sub-cultures into fresh media will either die out or
human samples requires consent from the local transform to become a continuous cell line. This
hospital ethics committee, from the doctor or sur- ability is commonly observed among rodent cells
geon responsible for the patient and from the but not in human cells, as cells derived from
patient or the patient’s relatives. Biopsies taken humans have a limited lifespan and can only be
during surgery are collected using sterile cultured for a limited time period before becom-
containers in appropriate physiological solution ing ‘senescent’. To extend the lifespan of cell
or culture medium. It is important to note that cultures, the cells are commonly immortalised by
human biopsies carry a risk of infection such as viral transfection so that the cells continue to
hepatitis, HIV or tuberculosis, so human samples divide for more ‘passages’. Bioresources of
1 Proteomes, Their Compositions and Their Sources 5

immortalised human and animal cell lines are The simplicity of maintaining cell lines
commercially available. One of the biggest under various media conditions has led to the
resources of cell lines is provided the ATCC Cell invention of a mass spectrometry technique
Biology Collection (http://www.lgcstandards-atcc. called stable isotope labelling with amino
org/en.aspx). Transformed cell cultures show sim- acids in cell culture (SILAC) [21]. In this
ilar phenotypic and molecular properties to neo- method, culture media is supplemented with
plastic cells, including changes in morphology or light (normal) and heavy labelled amino acids
chromosomal variations. Moreover, transformed that are incorporated into newly synthesised
cells have an ability to form tumours when injected proteins. The heavy amino acid contain a 2H
to animals/hosts with weak immune systems. instead of H, a 13C instead of 12C or a 15N
Conducting research on cell lines has many instead of 14N. The incorporation of heavy
advantages, with the major one being the consis- labelled amino acids into the proteins increases
tency and reproducibility of results that can be the molecular weight/size of the protein com-
achieved from experiments performed on cell pared to the light (normal) proteins. This rule
lines when compared to tissues, organs or body also applies to the peptides generated after
fluids. For this reason, cell culture is commonly enzymatic digestion of proteins, and leads to a
used in the field of mass spectrometry, especially known mass shift compared to the respective
during method development, where consistency unlabelled peptide. This SILAC technique
and reproducibility between experiments is nec- allows direct experiment comparison of differ-
essary. Easy accessibility, rapid growth rate and ent proteomes in a single tube experiment with
ease of manipulation both genetically and bio- minimal introduction of sample preparation
chemically (through chemical and pharmacolog- errors. Indeed SILAC is broadly used in the
ical treatment) make cell lines an attractive mass spectrometry field [22, 23].
model in research. Thus, cell lines are commonly As one may expect, cell culture is not without
used in mass spectrometry discovery limitations. The most common problems with cell
experiments to investigate differences between lines include; infection with microorganisms,
normal and aberrant (for example cancerous) the cross-contamination of cell lines with other
phenotypes [7–9] or to investigate different cell types and genomic and phenotypic instability
stages of the disease [10, 11]. Cell lines are also [24–26]. Contamination with microorganisms is a
commonly applied in mass spectrometry serious problem worldwide among laboratories, as
research to investigate the signal transduction of the presence of microbes in culture media can
molecular pathways of specific types of cells inhibit cell proliferation and growth and in the
[12, 13] and post-translational modifications most extreme infections lead to cellular death.
such as phosphorylation [14, 15] or ubiquiti- The most common animal culture contaminations
nylation [16, 17]. Cell lines are also used to test are caused by bacteria, yeast, fungi, mold and
the effect of various chemical compounds (for mycoplasma [27]. Microbial contamination is
example inhibitors or activators) or pharmaco- most often caused by poor cell culture technique
logical drugs on different cellular systems. This and by the use of contaminated media, reagents or
approach is commonly used in the mass spec- equipment. Microbes can also be present in
trometry field to determine pathways that are incubators, refrigerators, laminar flow hoods or on
affected by treatment or to investigate side the skin of researchers working under laminar flow
effect(s) of treatment [18]. Another advantage hoods. The infection can also be introduced when
of using cell culture for mass spectrometry cell cultures are received from external sources
based research is the analysis of proteomes of such as other laboratories, or cells that have been
specific cellular organelles. This is referred to as isolated from infected animals or humans.
subcellular proteomics and is based on the iden- Several features of microbial contamination can
tification of proteins specifically expressed in be visually observed, such as a change in pH that
cellular organelles such as the nucleus, ribosome usually leads to a change in colour of the medium
or mitochondria [19, 20]. or makes the medium appear cloudy. Also,
6 A. Kwasnik et al.

careful inspection of the cultured cells under the identification of cross-contamination. One problem
microscope can indicate some changes in infected that has been noticed during long term culturing of
cultures such as cell death. The presence of rods, cell lines is that rapidly growing cell lines (such as
cocci or thin filamentous mycelia that may form tumour cell lines) are likely to undergo genomic
clumps or spores may indicate bacterial or yeast/ fluctuations and this often leads to phenotypic
fungi contamination, respectively. While bacterial and/or genotypic instability as well as gene or
or yeast/fungus contamination can be detected eas- protein expression changes between different
ily, the cells contaminated with mycoplasma may passages of the same cell line or sub-lines that are
grow undetected for several passages as there is no derived from the same parental population of cells
obvious evidence of mycoplasma infection such as [38]. These genotypic and phenotypic changes can
a pH change or the presence of cellular death have an effect on gene expression and are caused
[28]. To overcome problems with microbial con- by many environmental factors, such as culturing
tamination, cell culture media are commonly conditions, including different types of media,
supplemented with antibiotics such as penicillin, serum, trypsin, CO2 levels or the temperature
streptomycin and amphotericin b. Although these used to culture cells between different or even
reduce the risk of microbial contamination, the within the same laboratory. Genetic changes
routine use of antibiotics may affect the phenotype can occur spontaneously when a small popula-
of the cells. Moreover, microbes may become tion of cells divide at a higher rate than the other
resistant to the antibiotics and grow despite the cells. This will result in the natural selection of
presence of antibiotics. Most of the cell culture the smaller population of the cells within a short
laboratories routinely apply sensitive tests to detect period of time. While the culturing conditions
mycoplasma infections, which are based on the can be kept constant, there is not much that can
detection of rRNA or DNA of mycoplasma or the be done about natural drifting except that stocks
visualisation of mycoplasma-specific polyclonal of the low cell passages must be prepared
antibodies [29]. A common problem that is not and stored in liquid nitrogen and these stocks
easy to detect, and was subsequently ignored for should be used at regular time intervals, for
many years in science research, is cross- example every ten passages. Research has
contamination of the cells with different cell lines. shown that maintaining cell lines under identical
Cross-contamination is more difficult to recognise culture conditions results in a more stable
compared to microbial infections such as bacterial genotype and phenotype over a long period of
or fungal, as there is no physical indication of cross- time [39, 40].
contamination such as the change of medium col- Overall, cell-cell cross contamination and
our or cell death. Often the most common sources contamination with microorganisms as well as
of cross-contamination are poor cell culture genomic and phenotypic instability are common
techniques and human mistakes such as simple problems in cell culture. All of these factors
errors made during sample labelling. The first affect experimental results, which in turn forbid
reports of cross-contamination came from the reliable comparison of the results within and
research conducted by Nelson-Ress who reported between laboratories due to the lack good repro-
that many of the cell lines used in research have ducibility. Regular quality controls of cross-
been switched or cross-contaminated with contamination and microbial contamination in
HeLa cells [30]. Since then the problem of cross- the cell culture laboratory can help to overcome
contamination has become widely recognised by these problems. Maintaining stable culturing
scientists and much research has been undertaken conditions and renewing cell cultures at low
to address this issue [31–33]. Several methods, passages over short periods of time is a good
such as karyotyping [34], isozyme analysis [35], way to retain good genomic and phenotypic sta-
HLA (human leukocyte antigen) typing [36] and bility of cultured cells. All of these disadvantages
DNA fingerprinting [37] have been applied for the of cell culture may cause serious problems when
1 Proteomes, Their Compositions and Their Sources 7

conducting research but can be easily avoided and the effects of viral infection at the protein
when good aseptic culturing technique is used level [48]. Once tissue cells have been treated to
and cells are carefully monitored. induce a desired phenotype, cells are lysed to
extract protein for comprehensive analysis of pro-
tein expression and protein-protein interactions
1.1.2 Tissue Culture [48]. The major downside of performing
proteomic investigations on cultured cells is that
Tissue culture is the growth of animal tissue they cannot provide accurate insight into disease
outside of the organism, in a culture medium. progression in vivo. Successful in vitro
For tissue culture, cells are grown in a medium investigations of the pathobiology of disease are
supplemented with nutrients and energy sources therefore enhanced if cells are grown in an envi-
that are essential for cell survival. To prevent ronment that mimics the 3D architecture of human
contamination or infection, tissue culture tissue [49]. Growing tissue cells in vitro in 3D
medium is also supplemented with antibiotics heterogeneous co-culture systems, which allow
and/or fungicides [41]. Tissue culture provides for interactions between disease (e.g. tumour)
an in vitro model of animal tissue that can be cells, stromal cells, endothelial cells, fibroblasts,
easily manipulated for research investigations immune cells and the extracellular matrix (ECM),
pertaining to disease progression. Furthermore, is thought to overcome the limitations of the stan-
tissue culturing allows the analysis of single cell dard 2D monolayer culture systems [49–51]. Vari-
populations (e.g. fibroblasts or macrophages) as ous 3D model systems have been extensively
well as mixed cell populations, similar to what utilized in the field of cancer research
would be found in the in vivo environment [52, 53]. The ‘multicellular tumour spheroid
[42, 43]. Cultured tissue cells can be frozen model’ refers to the culturing of cells under non
down and stored over long periods of time for adherent conditions, ‘tumour spheres’ mimic can-
future study. Freezing cultures prevents geneti- cer cell expansion in serum-free conditions with
cally induced changes and the loss of cultures supplemented growth factors, while ‘tissue-
due to senescence or accidental contamination derived tumour spheres and organotypic multicel-
[44]. Tissue culture is classified as a primary lular spheroids’ are derived from mechanical dis-
cell culture when cells are extracted directly sociation and cutting of tumour tissue [51].
from human tissues and grown in culture medium, Aside from culturing of tissue cells, proteomic
however, once the cells are sub-cultured and experiments can also be performed on fresh tis-
immortalised, they are then classified as cell sue specimens. This method is slightly more
lines [45]. As primary cultured cells retain the challenging, as harvesting and processing tissue
unique characteristics of the original tissue from specimens must be performed as quickly as pos-
which they were extracted, they are of significant sible to avoid any protein degradation [54]. Tis-
use in investigations designed towards the sue samples can, however, be snap frozen to
understanding of disease origin and malignant preserve their proteome integrity if, for example,
progression at a cellular level [46]. samples have to be retrospectively analysed
The proteome refers to all proteins that are [55]. Another way to preserve tissue samples is
expressed by a cell, tissue or organism under to fix them using formalin and embed them in
defined conditions, at a certain time. The prote- paraffin wax. Such samples are referred to as
ome is highly influenced by both environmental formalin fixed paraffin embedded (FFPE) tissues
stimuli and disease processes, which is why samples. This is a universal method of prepara-
proteomic-based investigations are key to under- tion to preserve and stabilize tissue samples for
standing the biological mechanisms which under- histological evaluation. Protein extraction from
line disease [47]. To this end, tissue culture FFPE material has proven difficult in the past,
models are useful for the investigation of the due to the molecular cross-linking that occurs
role of certain genes, the effects of drug treatment during formalin fixing. However, numerous
8 A. Kwasnik et al.

protocols have been optimized for efficient pro- [59–63]. However, these efforts have failed to
tein extraction from FFPE material for result in clinically applicable disease biomarkers.
subsequent proteomic analysis via both antibody As such, there is still a reliance on informative
and mass spectrometry based techniques animal models to accelerate the progress in clini-
[56, 57]. Techniques for harvesting cells directly cal proteomics [64].
from tissue samples have evolved in the last The use of animal models overcomes
number of years. Laser capture microdissection limitations regarding the organ and tissue sam-
(LCM) is a popular technique in which cells from pling from humans, which is particularly
specified regions of interest within a tissue sec- restricting in the study of neurological disorders
tion can be obtained, using a microscope to guide [64]. For disease-focused investigations, animal
a laser beam that attaches cells from the tissue to models provide a much more controlled system
an adhesive film [58]. This technique is particu- which allows for proteomic profiling at set times/
larly useful when, for example, comparing disease points with less influence from the exter-
tumour tissue to surrounding benign or stromal nal environment [65, 66]. Animals such as mice,
tissue from the same patient. rats, pigs, dogs, zebra-fish and fruit flies are con-
Tissue samples for either culturing or direct sidered useful models for human disease, based
harvesting are generally obtained during routine on the overall conservation of their proteome
surgery (human) or following euthanasia of ani- with the human proteome of interest
mal models. In this way they are useful for [67, 68]. Similar to cell culture experiments, a
proteomic profiling of the disease state and/or disease or disease-like state can be induced in
the surrounding area, for molecular animals, which would be housed under identical
investigations of the disease process. conditions as control healthy animals. The dis-
ease phenotype can be induced genetically, ther-
apeutically or with environmental stimuli [69–
71]. When animals are eventually euthanized, the
1.1.3 Organ
differences in protein expression observed in
their organ/tissue material compared to that of
Organs encompass a variety of different tissue
the control healthy animals, can be more confi-
and cell types. Therefore they provide a much
dently associated with disease, which is less true
more heterogeneous sample source for proteomic
for human samples due to the inherent heteroge-
investigation. As with tissue samples, organ cells
neity of humans [66, 72]. Indeed, the comprehen-
can be extracted and grown under cell culturing
sive quantitative proteomic analysis of whole
conditions, or extracted and digested directly
animals is now achievable by implementing an
from organ tissue (generally obtained post-
in vivo SILAC technique (see chap. 13). This is
mortem), as described for tissue culture. How-
achieved by feeding animals with a 13C6-lysine
ever, it is difficult to routinely obtain organ
diet for in vivo labelling of proteins, which, when
samples from humans. Because proteins can be
extracted from animal organ/tissue material, can
routinely measured in easily accessible
be analysed using mass spectrometry based
biological fluids, thus they are more attractive
techniques to make comparisons between
candidates as biomarkers; many biomarker dis-
healthy and diseased tissues [73].
covery and validation experiments for organ-
associated pathologies are therefore conducted
on bio-specimens, which are secreted from the
organ of interest. For example, blood, bile, stool 1.1.4 Exosomes
and urine are attractive sources for the identifica-
tion of protein biomarkers related to the heart, In addition to profiling the proteins expressed
liver, intestine and kidney/pancreas, respectively within the cell, it is widely accepted that proteins
1 Proteomes, Their Compositions and Their Sources 9

that are secreted by various cells are also a valu- 1.2 Biological Fluid
able source of pathobiological information
[74]. The global study of proteins that are 1.2.1 Serum and Plasma
secreted by cells is defined as secretomics
[75]. Secreted proteins can be found in both Blood is a bodily fluid that circulates through
biological fluids and conditioned media from arteries and veins, supplying the tissue with oxy-
cell cultures [76]. The secretome is largely gen and taking away carbon dioxide to be
represented by membranous vesicles, of which excreted. It is also responsible for providing
there are many types including exosomes, nutrients to tissues, hormones to cells, and is an
exosome-like vesicles, microparticles, important part of the immune system. Blood
microvesicles, membrane-bound particles, apo- constitutes up to 8 % of total body weight in
ptotic bodies and apoptotic microparticles humans and it contains components such as
[77]. Of these, exosomes are considered particu- serum, plasma, red blood cells (RBCs), white
larly attractive for proteomic research. Exosomes blood cells (WBCs) and clotting factors. Serum
are small membrane vesicles derived from the is the liquid fraction of whole blood that is col-
luminal membranes of multivesicular bodies. lected by centrifugation after the blood is
They are actively released by fusion of allowed to clot and it does not contain RBCs,
microvesicular bodies with the cell membrane WBCs or clotting factors. Plasma is the pale
[78]. They are differentiated from other membra- yellow liquid component of blood that holds the
nous vesicles by their characteristic size (approx- blood cells in suspension, thus acting as an extra-
imately 30–100 nm) and expression of CD81 cellular matrix for blood cells. Plasma is col-
protein [79, 80]. Exosomes are likely to be lected by centrifugation of whole blood
enriched in low abundant and membrane proteins collected in tubes that are treated with anticoag-
that are difficult to detect in standard cell or ulant. Both serum and plasma contain similar
biological fluid material. They contain a components such as glucose, electrolytes,
conserved set of common proteins which are antibodies, antigens, hormones, proteins,
essential for the biogenesis, structure and traf- enzymes, nutrients and certain other molecules
ficking of the biovesicles. Moreover, they con- whereas clotting factors are only present in
tain proteins which would be specific to the cell/ plasma [84].
biological fluid from which they are isolated and Both serum and plasma represent ideal
are therefore considered a valuable sample biological samples, as they are readily accessible
source for disease-specific biomarker discovery body fluids and contain many proteins that are
[81, 82]. synthesized, secreted, shed or lost from the cells
Due to the growing popularity of exosomes in and tissues throughout the body. Fluctuations in
proteomic research, there are numerous the expression levels of these proteins in serum
optimized protocols available for exosome isola- and plasma can reflect a pathophysiological con-
tion from biofluids (conditioned media, serum/ dition [85] and thus they are routinely used for
plasma, urine etc.). Generally, exosome isolation blood testing in hospitals and clinics [86–
can be achieved following a series of ultra- 88]. Serum is preferentially used for the determi-
centrifugation steps. However, there are also a nation of an individual’s blood group and for
number of commercial kits available for various diagnostic blood tests such as determin-
exosome isolation and purification which are ing the levels of hCG, cholesterol, proteins,
applicable for proteomic profiling of exosome sugar, etc. in blood. However, plasma is primar-
material [83]. ily used for transfusion in patients suffering from
10 A. Kwasnik et al.

haemophilia and other blood-clotting disorders, total serum or plasma protein content represents
immunodeficiency, shock or burns [84]. Serum is a small number of highly abundant proteins such
favourably used for diagnostic testing in medi- as albumin, immunoglobulins, alpha-1-
cine due to the presence of more antigens as antitrypsin, haptoglobulin, etc. that can mask
compared to plasma. Moreover, anticoagulants potential biomarkers. Thus, prior to proteomic
in plasma may interfere with the chemical analysis, it is essential to deplete these highly
reactions that are employed in diagnostic tests abundant proteins with the use of columns or
to measure levels of the blood constituents. Fur- matrices such as Multiple immuno-Affinity
thermore, anticoagulants in plasma may draw Removal System (MARS, Agilent Technologies)
water out of cells, thus diluting the sample and [96–98], ProteomeLab IgY system (Beckman
changing the test results. Whilst plasma may not Coulter) [99], hexapeptide combinatorial library
be the preferential body fluid for diagnostic tests, beads [100], ImmunoAffinity Subtraction Chro-
it presents various benefits for patients suffering matography resin (IASC) [101] and others. It is
from blood-clotting disorders requiring transfu- clear that during affinity depletions of high abun-
sion, as plasma can be frozen and stored for up to dant proteins, these columns also remove other
a year and is easy to transport. Moreover, plasma components of serum and plasma by ‘non-spe-
is replaced in the human body after every cific’ binding. Since most proteomic biomarker
2–3 days, thus it can be donated more frequently discovery studies don’t have a specific target
while whole blood cannot be donated very fre- protein, it is not possible to know whether a
quently. Therefore, while the anti-coagulants biomarker of interest is lost upon the removal
present in plasma makes it undesirable for certain of high abundant proteins.
diagnostic tests; serum cannot be used for
transfusions, due to the absence of blood clotting
factors. Thus, both serum and plasma have dif- 1.2.2 Cerebrospinal Fluid
ferent advantages and disadvantages and are fit to
serve different applications in medicine. Cerebrospinal fluid (CSF) is a transparent body
Serum and plasma have been used for multi- fluid (mean volume 150 ml) contained within the
ple proteomics based biomarker discovery stud- brain ventricles (25 ml) and the central and spinal
ies [89–93] as they represent readily accessible subarachnoid spaces (125 ml). It is produced
and clinically relevant samples. However, there predominantly in the choroid plexus and plays a
appears to be a lack of understanding of the protective role in the central nervous system
issues critical for the processing of plasma and (CNS) [102–104]. Historically it was believed
serum samples for analysis. Often, the most basic that the main role of CSF was to provide mechan-
yet crucial aspects of serum and plasma sample ical protection to the CNS, acting as a shock
collection are neglected, such as uniformity in absorber. However it is now well understood
collection of samples using a standard operating that in addition to this function CSF has an
procedure (SOP), sample processing, and storage essential role in maintaining homeostasis within
conditions. It is only by doing this that one is able the interstitial fluid of the brain parenchyma as
to assure reproducibility of samples and to allow well as regulating neuronal functioning
some rational comparison of data from various [103, 104].
laboratories [94]. Until that is accomplished, any Due to the proximity to the brain and spinal
kind of data analysis is questionable. The next cord, CSF is a common matrix for monitoring
problem with the use of serum and plasma and assessing neurodegenerative disorders.
samples for proteomic analysis is the analytical Molecular changes that occur in the CNS such
challenge that these samples present due to the as changes in protein expression levels serve as
presence of the wide dynamic qualitative and objective markers of CNS-associated disease.
quantitative range of proteins that spans over Indeed over the past decade, considerable effort
12 orders of magnitude [95]. In fact, 96 % of has been extended to the discovery of putative
1 Proteomes, Their Compositions and Their Sources 11

protein biomarkers of neurodegenerative disease of waste produced by the human body, which
in human CSF. Proteomic analysis of CSF is accumulates in the blood. The kidney also fulfils
typically performed using high resolution liquid other roles such as maintaining whole body
chromatography mass spectrometry (LC-MS/ homeostasis and producing hormones including
MS) [104]. Many mass spectrometry based stud- renin and erythropoietin [112, 113]. The human
ies have identified CSF biomarkers with potential kidney is composed of one million units called
diagnostic utility in neurodegenerative disease nephrons, which can be divided into two func-
including Alzheimer’s, multiple sclerosis and tional parts: the glomerulus and the renal tubule.
Parkinson’s [105, 106]. The glomerulus is responsible for the first filtra-
Despite the advantages of CSF there are some tion of the plasma to generate the “primitive”
difficulties associated with using this body fluid. urine. The renal tubule is dedicated to reabsorb
Firstly the collection of CSF requires an invasive most of the primitive urine to generate the “final”
procedure referred to as a lumbar puncture or a urine that exits the kidney through the ureter into
spinal tap. A lumbar puncture must be performed the bladder. In 24 h, about 900 L of plasma flows
by a physician, it is an uncomfortable procedure through the kidneys. 150–180 L is filtered as
and can be associated with postdural puncture ‘primitive’ urine but more than 99 % of this
headaches [107]. Moreover, traumatic punctures urine is reabsorbed. The remaining unabsorbed
can introduce red blood cells into the CSF and plasma generates the “final” urine. Analysis of
artificially increase the white blood cell count the urinary proteome may therefore contain
and protein expression levels and thereby skew information not only from the kidney and the
a diagnosis [108]. Secondly CSF is a complex urinary tract but also from other organs of plasma
matrix, 80 % of the CSF proteome originates obtained by glomerular of plasma, making it a
from plasma yielding a highly dynamic range of good source of biomarkers for urogenital and
protein concentrations (spanning 10 orders of systemic diseases [114, 115].
magnitude) [109, 110]. As with serum and Under normal conditions, urinary proteins are
plasma, the presence of highly abundant proteins stored in different compartments that can be
precludes the identification of potentially inter- isolated by sequential centrifugation. The sepa-
esting analytes present in lower concentrations. rate populations of proteins are identified as sol-
To facilitate a greater depth of analysis it is uble proteins, urinary sediment proteins and
necessary to remove the highly abundant proteins urinary exosomes. Soluble proteins are derived
from the sample before analysis and many by glomerular filtration of plasma proteins while
methods for protein depletion have been some are also excreted by epithelial cells. The
established [109]. Alternatively, fractionation urinary sediment proteins are mainly sloughed
methods can be employed for improved coverage epithelial cells and casts. The urinary exosomes
and deeper proteome analysis [104, 111]. While are derived from the epithelial lining and the
depletion and fractionation methods enhance our urinary tract but can also be derived from many
ability to identify lower abundant proteins, they other cell types, which can be identified in
are neither time or cost effective techniques and plasma and may be filtered in urine. Urine has
these step wise procedures can add variation several advantages compared to that of other
during sample preparation leading to dubious body fluids: they can be obtained in large
findings [102]. quantities using a non-invasive procedure, uri-
nary peptides and lower molecular weight
proteins are generally soluble and can be
1.2.3 Urine analysed in a mass spectrometer without any
digestion. Moreover, the urinary protein compo-
Human urine has been used for decades by sition is relatively stable, probably due to the
physicians to diagnose various disorders. Urine presence of endogenous proteases in the bladder,
is produced by the kidney during the elimination while urine is being stored there. Stability
12 A. Kwasnik et al.

studies have shown that the urinary proteome and sublingual) with a small fluid contribution
does not change significantly when urine is from several minor glands and from the gingival
stored at 4  C for several days or while stored crevicular fluid (GCF) [125]. Most salivary
at room temperature for up to 6 h [116, 117]. In proteins are synthesized in the acinar cells of
addition, urine can be stored for several years at the salivary glands and follow a well establish
20  C without significant alterations to its pro- secretory pathway. For the majority of salivary
teome. Studies of urinary exosomes, however, proteins this common secretory pathway
indicate that this proteome may be less includes transit in the Golgi apparatus and stor-
stable [118]. age in secretory granules, release from the cell
On the other hand, urine varies widely in pro- into the duct system and secretion into the
tein and peptide concentrations, depending on mouth [126].
differences in the daily intake of fluid, however During the different steps of the secretory
this can be normalized through the consideration pathway, proteins are subjected to a number of
of creatinine excretion [119]. In addition, the defi- changes such as removal of the signal peptide as
nition of disease-specific biomarkers in urine is well as various post-translational modifications
complicated, due to the significant changes in the (PTMs) including proteolytic cleavage, glycosyl-
proteome throughout the day that can be ation, phosphorylation, and sulfation. Further
connected with the time of collection, diet, exer- modifications of the proteins and peptides occur
cise, circadian rhythms and circulatory levels of during transition into the ducts before secretion
various hormones [120, 121]. These variations and additional modifications occur in the oral
seem to affect only a limited fraction of the uri- cavity after secretion from the cells as a result
nary proteome while a large portion shows high of the action of a number of proteolytic enzymes
reproducibility [122]. The Human Kidney and of different origin [127].
Urine Proteome Project (www.hkupp.org/), Saliva is composed mostly of water
under the directive of the World Human Proteome containing electrolytes, immunoglobulins,
Organisation (www.hupo.org/), is currently proteins and enzymes and plays an important
establishing standardized procedures to avoid part in the health of the oral cavity [128]. The
this variability. basic role of saliva is protection of the oral
Currently, the common preparation method mucous membrane of the oral cavity and diges-
for biomarker identification in urine involves tive tract through the following functions:
centrifugation of the urine sample and collection maintaining lubrication, buffering action and
of the soluble fraction or the urinary exosomes, clearance, maintenance of tooth and mucosal
followed by 1 or 2 separation steps before mass integrity, and also facilitating the repair of the
spectrometry analysis [123]. However, the pellet mucosal layer. Saliva also contains components
fraction is also of biological interest as it contains that show antibacterial and antiviral activity as
information from proximal tissue or organs and well as playing an important role in taste and the
also from organisms that colonize or infect the first phase of food digestion [129].
urogenital tract. Filter-aided sample preparation In healthy subjects the production of saliva is
(FASP) has been used in shotgun proteomics for up to 1 to 2 L a day. Saliva secretion follows
the lysis of cells presents in urinary pellets, after circadian rhythms and production is usually
the solubilisation of proteins derived from cell highest in the late afternoon while it is lowest
pellets [124]. during the night [130, 131].
The production of low amounts of saliva is
related to a number of different pathologies
1.2.4 Saliva and is indicated by the general term of dry
mouth (xerostomia). Certain medication can
Saliva is a clear liquid that originates mainly also affect saliva production (low or over
from three major glands (parotid, submandibular, production) [132].
1 Proteomes, Their Compositions and Their Sources 13

Human saliva contains proteins of clinical It is also generally recommended to use


relevance and about 30 % of blood proteins are low-protein binding tubes made of plastic to
also present in saliva, making this biological avoid the adsorption of analytes to the tubes or
sample an important tool for clinical application. the release of polymers from the plastic that can
Moreover, the simplistic nature of sampling, the interfere with the subsequent analysis. It is
non-invasiveness, ease of collection and the pos- important to use a standardized saliva collection
sible multiple collections by untrained and processing protocol from both diseased
professionals are some of the advantages of patients and healthy controls. For example, it is
saliva sampling. On the other hand, due to the recommended to discard the initial 2 min of
dynamics of the salivary proteome, sample prep- parotid secretion due to its large inter-individual
aration needs to be coupled to a well-controlled variability. Moreover, for proteomic analysis, it
study design in order to allow saliva to enter is really important to keep samples on ice during
clinical practice as an alternative to blood-based collection and processing because protein degrada-
methods. tion in whole saliva is very rapid at room tempera-
Human saliva reflects the health and well- ture and this may occur during saliva collection
being of the body, and most of the biomolecules and handling [136]. One way to minimize
that are usually detected in urine and blood can misleading or artificial degradation of proteins is
also be found in salivary secretions, however, the to minimize the processing time between sample
concentration of proteins range in saliva is collection and final storage. Saliva samples in
10–1000x lower than in blood [133, 134]. The research projects are often stored for long time
low concentration of highly interesting proteins periods before they can be analysed. The
and the high concentration of some classes of recommended storage temperature is below
proteins (i.e. mucins), along with the technology 20  C until analysis. Some researchers freeze
used for their characterization and analysis, sig- samples in liquid nitrogen to avoid problems of
nificantly influence the preparation method. For slow freezing of biological samples and protein
the purpose of precision and accuracy of a mea- dishomogeneity. The recommended temperature
surement, specimen collection, handling and for saliva sample storage is 80  C as unusual
processing are of vital importance. For example post-translational modifications have been
the use of a cocktail of protease inhibitors after reported for samples stored for 3 days at 20  C,
collection and during the processing of saliva demonstrating that protease activity is still present
samples must be standardised. at 20  C. This activity was not observed when
Proteolytic activity plays a fundamental role sample were stored at 80  C [135].
in the secretion pathway, which allows fully To subject saliva samples to proteomic analy-
mature proteins to be secreted and be functional sis, samples are typically collected on ice as
in the oral cavity. However, different proteases, whole saliva or as selective saliva from specific
both endogenous (derived either from the sali- salivary gland. The samples are then centrifuged
vary glands or from the exfoliating cells) and to remove insoluble material and the supernatant
exogenous (oral flora) contribute to the overall is stored at below 20  C until analysis [137–
proteolytic activity in saliva samples post collec- 141]. The centrifugation step needs to be
tion. The action of these proteases may result in evaluated in terms of length and speed applied,
misleading information about the saliva prote- because the extent of centrifugation has been
ome. The use of protease inhibitors can help to shown to cause co-precipitation of specific clas-
avoid incorrect identification of a pre-secretory ses of proteins such as PRP, cystatins and
event due to the post secretory proteolytic activ- statherin [142]. Some researchers report the use
ity [135], however, their use, especially when the of centrifugation in conjunction with protein pre-
inhibitors are peptides, can increases the com- cipitation (e.g. 10 % TCA/acetone/20 mM DTT)
plexity of the sample and interfere with to prevent loss of proteins [143, 144], while
proteomic analysis. another study reports the use of acid to eliminate
14 A. Kwasnik et al.

mucins and acidic insoluble protein to generate a top-down proteomics approach to saliva samples
sample that can be directly analysed by mass allows the identification of single nucleotide
spectrometry [135]. Centrifugation may some- polymorphisms and new sites of phosphorylation
times be avoided if the samples are collected on cystatin SN and PRP3 [153]. Moreover, small
from single glands using canniculation, or for proteins and peptides are abundant in saliva, and
ductal secretion collections using a Carlson– the relatively small size of these components has
Crittenden cup [138, 145] over the orifice of the enabled top-down analytical approaches to profile
Stenson’s duct [146, 147]. The main goal of a their abundances and identify PTMs including
well-established and rigorous process for phosphorylation, Gln to pyro-Glu conversion and
processing the salivary sample is to minimize glycosylation [154].
artificial changes after sample collection that The bottom-up proteomic approach
could lead to a ‘false salivary proteome’. Low minimizes sample complexity, increases sensi-
abundant salivary proteins have been extensively tivity and is the traditional approach for PTM
studied by applying sample preparation methods characterization following protease digestion or
involving separation and enrichment strategies, PTM release. Top-down and bottom-up proteo-
and the same strategies have been applied for the mics approaches require sample preparation to
characterization of post translational modifica- separate or fractionate components before detec-
tion (PTM), with a special focus on phosphoryla- tion by a mass spectrometer. These separation
tion [148, 149] and glycosylation procedures can include SDS-PAGE, liquid chro-
[150]. Enrichment strategies typically involve matography, isoelectric focusing, affinity chro-
the use of a solid phase matrix with affinity for matography for depletion or enrichment, and
the PTM being studied (i.e. TiO2 for phosphory- release of PTMs.
lation). There are a number of commercially For the detection of low abundant disease
available kits for biomarker discovery on saliva specific biomolecules in human saliva, which
samples and identified biomarkers are strictly are mainly derived from blood or GCF, an
related to the type of sample acquired, such as enrichment strategy needs to be implemented to
whole salivary samples or samples selected from enrich for low abundant proteins by the removal
a single gland [151]. A number of commercially of high abundant proteins. Enrichment strategies
available kits that are applicable to research or include pre-fractionation methods, such as
diagnostic purposes for the study of saliva sequential extraction of proteins with varying
include DNA Genotek (www.dnagenotek.com); buffer conditions [155], sub-cellular fraction-
Salimetrics oral swabs (http://www.salimetrics. ation [156] and selective removal of high abun-
com); Oasis Diagnostics® VerOFy® I/II; DNA dant proteins via affinity methods [157].
SALTM (http://www.4saliva.com); OraSure
Technologies OraSure HIV specimen collection
device (http://www.orasure.com); CoZart® 1.2.5 Semen
drugs of abuse collection devices (http://www.
concateno.com); and the Greiner Bio-One Saliva Human semen is a greyish coloured body fluid
Collection System (http://www.gbo.com) [152]. that is composed of a variety of components
Standard proteomic analysis can be performed produced by male gonads during a process called
using either a ‘bottom-up’ or ‘top-down’ approach. ejaculation. The main component of semen is the
The top-down approach is used for analysis of spermatozoa, which are ejaculated in the pres-
intact proteins without protease digestion and can ence of enzymes and nutrients (seminal fluid)
lead to the unbiased detection of isoforms and that help spermatozoa to survive and enable fer-
variants from sequence polymorphisims, splice tilization. Seminal fluid is produced by multiple
variants and post-translational modifications as male accessory glands such as the prostate, sem-
compared to a digested peptide mixture against a inal vesicles, the epididymis and Cowper’s
specific protein database. Application of this gland. Seminal fluid contains acid phosphatase,
1 Proteomes, Their Compositions and Their Sources 15

inositol, citric acid, calcium, magnesium, zinc, complete count of blood cells is routinely used
fructose, ascorbic acid, prostaglandins, in diagnosis to screen for a wide range of
L-carnitine and neutral alpha-glucosidase conditions and diseases. Any variations from
[158]. Moreover, seminal fluid contains high normal cell morphology, composition of the
amounts of proteins and amino acids that range cells or differences in expression of cell surface
from 35 to 55 g/L and is therefore a good and markers may indicate various disease conditions,
easily accessible source for protein identification. thus an evaluation of blood cells has a practical
However, similar to other body fluids, semen application in diagnosis and disease treatment.
contains a number of highly abundant proteins Blood cells are also routinely used in research to
that mask the low abundant proteins and this investigate the molecular mechanism of various
makes proteomic analysis of seminal fluid disease states or to develop new disease
difficult. treatments. Blood for cell-based research is usu-
Semen samples have applications in research ally collected into tubes with anticoagulants
areas such as reproduction [159, 160] and pros- such as heparin, EDTA or acid citrate dextrose
tate cancer and are used for many purposes in the (ACD) to assure that the coagulation cascade is
diagnosis of male fertility, for example, for the blocked and cells stay in a suspension rather than
assessment of spermatozoa morphology, motility as a clotted blood sample. An initial and impor-
and concentration [161, 162]. tant step in research based on blood cells, is
For diagnostic or research purposes, semen is isolation of the cells from blood plasma. Several
collected by ejaculation into a non-toxic and methods to isolate specific subsets of the blood
clean plastic or glass container. The collection, cells, erythrocytes [168, 169], thrombocytes
transport and processing of semen samples [170] and lymphocytes [171] have been
should be kept at an ambient temperature of described.
20–37  C [163]. An essential step in semen sam- Over the last few years, the application of
ple preparation for mass spectrometry analysis is circulating tumour cells (CTC), that are present
the purification of seminal fluid from sperm in the blood of patients with metastatic cancer,
cells and any other semen containing cells. have become very popular models to investigate
This step is usually achieved by density gradient certain aspects of metastatic disease. CTCs are
centrifugation by using PureSperm or Percoll. used to determine the prognosis of metastatic
An alternative method; through swim-up has progression or relapse, to monitor anti-cancer
also been described [164]. The non-invasive treatments, to understand the mechanism of met-
collection of seminal fluid and the specificity astatic disease and finally to use this knowledge
of seminal fluid to male glands make it a poten- to develop new strategies in disease treatment
tially good source for discovery of new [172, 173]. Several methods to isolate CTCs
biomarkers in prostate cancer and research from blood have been developed and optimised
towards infertility. Indeed, the application of including density gradient centrifugation [174],
seminal fluid in both the prostate cancer size-dependent selection [175], positive selec-
research [165] and reproduction [166, 167] has tion of cells based on expression of the mem-
increased over the last few years. brane antigen EpCAM [175] or negative
selection of cells based on the depletion of
cells with the CD45 antigen [176]. Although
1.2.6 Circulating Tumour Cells CTCs are an excellent model to investigate the
metastatic state of disease, the biggest disadvan-
Body fluids, in addition to aqueous solution, also tage of CTCs is their very low abundancy in the
contain solid cells. For example, 45 % of the blood, with a yield of 1 cell per 106–107
blood is composed of the mixture of red blood leukocytes. Another limitation of research
cells (erythrocytes), white blood cells conducted on CTC cells is cell heterogeneity,
(lymphocytes) and pellets (thrombocytes). The which make it difficult to isolate the whole CTC
16 A. Kwasnik et al.

population from the blood. Both the low yield of 14. Salomon AR et al (2003) Profiling of tyrosine phos-
cells and the heterogeneity of the cells make phorylation pathways in human cells using mass
spectrometry. Proc Natl Acad Sci 100(2):443–448
research with CTCs challenging and limited, 15. Zhang Y et al (2005) Time-resolved mass spectrom-
thus only few advances in the field have been etry of tyrosine phosphorylation sites in the epider-
made. To improve isolation of CTCs from mal growth factor receptor signaling network reveals
blood, combined isolation techniques have dynamic modules. Mol Cell Proteomics 4
(9):1240–1250
been used. Further ex-vivo culture of CTCs 16. Meierhofer D et al (2008) Quantitative analysis of
increases the amount of available global ubiquitination in HeLa cells by mass spec-
material [177]. trometry. J Proteome Res 7(10):4566–4576
17. Xu P, Peng J (2006) Dissecting the ubiquitin path-
way by mass spectrometry. Biochim Biophys Acta
(BBA)-Protein Proteomics 1764(12):1940–1947
References 18. Bose R et al (2006) Phosphoproteomic analysis of
Her2/neu signaling and inhibition. Proc Natl Acad
1. Sherwood L (2015) Human physiology: from cells to Sci U S A 103(26):9773–9778
systems. Cengage Learning, Andover 19. Dreger M (2003) Proteome analysis at the level of
2. McGrath J et al (2010) Guidelines for reporting subcellular structures. Eur J Biochem 270
experiments involving animals: the ARRIVE (4):589–599
guidelines. Br J Pharmacol 160(7):1573–1576 20. Drissi R, Dubois ML, Boisvert FM (2013) Proteo-
3. Sciences C.f.I.O.o.M (2002) International ethical mics methods for subcellular proteome analysis.
guidelines for biomedical research involving FEBS J 280(22):5626–5634
human subjects. Bull Med Ethics (182):17 21. Mann M (2014) Fifteen years of Stable Isotope
4. Miller MJ et al (2012) Guidelines for safe work Labeling by Amino Acids in Cell Culture (SILAC).
practices in human and animal medical diagnostic Methods Mol Biol 1188:1–7
laboratories. MMWR Surveill Summ 6(61):1–102 22. Zanivan S et al (2013) SILAC-based proteomics of
5. McCallum HM, Lowther GW (1996) Long-term cul- human primary endothelial cell morphogenesis
ture of primary breast cancer in defined medium. unveils tumor angiogenic markers. Mol Cell Proteo-
Breast Cancer Res Treat 39(3):247–259 mics 12(12):3599–3611
6. Rittie L, Fisher GJ (2005) Isolation and culture of 23. Geiger T et al (2013) Initial quantitative proteomic
skin fibroblasts. Methods Mol Med 117:83–98 map of 28 mouse tissues using the SILAC mouse.
7. Patwardhan AJ et al (2005) Comparison of normal Mol Cell Proteomics 12(6):1709–1722
and breast cancer cell lines using proteome, genome, 24. Markovic O, Markovic N (1998) Cell cross-
and interactome data. J Proteome Res 4 contamination in cell cultures: the silent and
(6):1952–1960 neglected danger. In Vitro Cell Dev Biol Anim 34
8. He J et al (2014) Fingerprinting breast cancer (1):1–8
vs. normal mammary cells by mass spectrometric 25. Burdall SE et al (2003) Breast cancer cell lines:
analysis of volatiles. Sci Rep 4:5196 friend or foe? Breast Cancer Res 5(2):89–89
9. Rubporn A et al (2009) Comparative proteomic anal- 26. Masters JR (2000) Human cancer cell lines: fact and
ysis of lung cancer cell line and lung fibroblast cell fantasy. Nat Rev Mol Cell Biol 1(3):233–236
line. Cancer Genomics-Proteomics 6(4):229–237 27. Langdon SP (2004) Cell culture contamination: an
10. Masayo Y et al (2009) The proteomic profile of overview. Methods Mol Med 88:309–317
pancreatic cancer cell lines corresponding to carci- 28. Drexler HG, Uphoff CC (2002) Mycoplasma con-
nogenesis and metastasis. J Proteome Bioinforma tamination of cell cultures: incidence, sources,
2:1–18 effects, detection, elimination, prevention. Cytotech-
11. Wu W et al (2002) Identification and validation of nology 39(2):75–90
metastasis-associated proteins in head and neck can- 29. Uphoff CC, Gignac SM, Drexler HG (1992) Myco-
cer cell lines by two-dimensional electrophoresis and plasma contamination in human leukemia cell lines:
mass spectrometry. Clin Exp Metastasis 19 I. Comparison of various detection methods. J
(4):319–326 Immunol Methods 149(1):43–53
12. Lewis TS et al (2000) Identification of novel MAP 30. Nelson‐Rees WA, Flandermeyer RR, Hawthorne PK
kinase pathway signaling targets by functional pro- (1975) Distinctive banded marker chromosomes of
teomics and mass spectrometry. Mol Cell 6 human tumor cell lines. Int J Cancer 16(1):74–82
(6):1343–1354 31. Drexler HG et al (2003) False leukemia–lymphoma
13. Choudhary C, Mann M (2010) Decoding signalling cell lines: an update on over 500 cell lines. Leukemia
networks by mass spectrometry-based proteomics. 17(2):416–426
Nat Rev Mol Cell Biol 11(6):427–439
1 Proteomes, Their Compositions and Their Sources 17

32. Yoshino K et al (2006) Essential role for gene 51. Weiswald L-B, Bellet D, Dangles-Marie V (2015)
profiling analysis in the authentication of human Spherical cancer models in tumor biology. Neoplasia
cell lines. Hum Cell 19(1):43–48 17(1):1–15
33. MacLeod RA et al (1999) Widespread intraspecies 52. Ellem SJ, De-Juan-Pardo EM, Risbridger GP (2014)
cross‐contamination of human tumor cell lines aris- In vitro modeling of the prostate cancer microenvi-
ing at source. Int J Cancer 83(4):555–563 ronment. Adv Drug Deliv Rev 79:214–221
34. van Bokhoven A et al (2001) TSU-Pr1 and JCA-1 53. Fang X et al (2013) Novel 3D co-culture model for
cells are derivatives of T24 bladder carcinoma cells epithelial-stromal cells interaction in prostate can-
and are not of prostatic origin. Cancer Res 61 cer. PLoS One 8(9):e75187
(17):6340–6344 54. Lexander H et al (2006) Evaluation of two sample
35. Nims RW et al (1998) Sensitivity of isoenzyme preparation methods for prostate proteome analysis.
analysis for the detection of interspecies cell line Proteomics 6(13):3918–3925
cross-contamination. In Vitro Cell Dev Biol Anim 55. Micke P et al (2006) Biobanking of fresh frozen
34(1):35–39 tissue: RNA is stable in nonfixed surgical specimens.
36. Masters J et al (1988) Bladder cancer cell line cross- Lab Investig 86(2):202–211
contamination: identification using a locus-specific 56. Guo H et al (2012) An efficient procedure for protein
minisatellite probe. Br J Cancer 57(3):284 extraction from formalin-fixed, paraffin-embedded
37. van Helden PD et al (1988) Cross-contamination of tissues for reverse phase protein arrays. Proteome
human esophageal squamous carcinoma cell lines Sci 10(1):56
detected by DNA fingerprint analysis. Cancer Res 57. Scicchitano MS et al (2009) Protein extraction of
48(20):5660–5662 formalin-fixed, paraffin-embedded tissue enables
38. Nowell PC (1976) The clonal evolution of tumor cell robust proteomic profiles by mass spectrometry. J
populations. Science 194(4260):23–28 Histochem Cytochem 57(9):849–860
39. Wistuba II et al (1999) Comparison of features of 58. Kerk NM et al (2003) Laser capture microdissection
human lung cancer cell lines and their corresponding of cells from plant tissues. Plant Physiol 132(1):27–35
tumors. Clin Cancer Res 5(5):991–1000 59. Dos Remedios C et al (2003) Genomics, proteomics
40. Wistuba II et al (1998) Comparison of features of and bioinformatics of human heart failure. J Muscle
human breast cancer cell lines and their Res Cell Motil 24(4–6):251–261
corresponding tumors. Clin Cancer Res 4 60. Folli F et al (2010) Proteomics reveals novel oxida-
(12):2931–2938 tive and glycolytic mechanisms in type 1 diabetic
41. Phelan K, May KM (2015) Basic techniques in mam- patients’ skin which are normalized by kidney-
malian cell tissue culture. Curr Protoc Cell Biol 1.1. pancreas transplantation. Plos one 5(3):e9923
1–1.1. 22 61. Kienzl K et al (2009) Proteomic profiling of acute
42. Seluanov A, Vaidya A, Gorbunova V (2010) cardiac allograft rejection. Transplantation 88
Establishing primary adult fibroblast cultures from (4):553–560
rodents. J Vis Exp: JoVE(44) 62. Mas VR et al (2009) Proteomic analysis of HCV
43. Legouis D et al (2015) Ex vivo analysis of renal cirrhosis and HCV-induced HCC: identifying
proximal tubular cells. BMC Cell Biol 16(1):1–11 biomarkers for monitoring HCV-cirrhotic patients
44. Phelan MC Basic techniques for mammalian cell awaiting liver transplantation. Transplantation 87
tissue culture. Curr Protoc Cell Biol (1):143
45. Martin BM (1994) Tissue culture techniques: an 63. Vidal BC, Bonventre JV, I-Hong Hsu S (2005)
introduction. Springer Science & Business Media Towards the application of proteomics in renal dis-
46. Laschi M et al (2015) Establishment of four new ease diagnosis. Clin Sci (Lond) 109(5):421–430
human primary cell cultures from Chemo‐Naı̈ve Ital- 64. Bendixen E (2014) Animal models for translational
ian Osteosarcoma patients. J cell physiol 230:2718 proteomics. PROTEOMICS Clin Appl 8
47. Hood LE et al (2012) New and improved proteomics (9–10):637–639
technologies for understanding complex biological 65. Terp MG, Ditzel HJ (2014) Application of proteo-
systems: addressing a grand challenge in the life mics in the study of rodent models of cancer. PRO-
sciences. Proteomics 12(18):2773–2783 TEOMICS Clin Appl 8(9–10):640–652
48. Ahmad Y, Lamond AI (2014) A perspective on 66. Bousette N, Gramolini AO, Kislinger T (2008) Pro-
proteomics in cell biology. Trends Cell Biol 24 teomics‐based investigations of animal models of
(4):257–264 disease. PROTEOMICS Clin Appl 2(5):638–653
49. Kim JB (2005) Three-dimensional tissue culture 67. Conn PM (2013) Animal models for the study of
models in cancer biology. In: Seminars in cancer human disease. Academic, Amsterdam
biology. Elsevier 68. Kooij V et al (2014) Sizing up models of heart
50. Olechnowicz SW, Edwards CM (2014) failure: proteomics from flies to humans. PROTEO-
Contributions of the host microenvironment to MICS Clin Appl 8(9–10):653–664
cancer-induced bone disease. Cancer Res 74
(6):1625–1631
18 A. Kwasnik et al.

69. G€otz J, Ittner LM (2008) Animal models of 87. Palkuti HS (1998) Specimen control and quality
Alzheimer’s disease and frontotemporal dementia. control. In: Corriveau DMAF, Fritsma GA (eds)
Nat Rev Neurosci 9(7):532–544 Hemostasis and thrombosis in the clinical laboratory.
70. Edinger M et al (1999) Noninvasive assessment of Lippincott, Philadelphia, pp 67–91
tumor cell proliferation in animal models. Neoplasia 88. Kratz A, Ferraro M, Sluss PM, Lewandrowski KB
1(4):303–310 (2004) Laboratory reference values. N Engl J Med
71. Raeburn D, Underwood SL, Villamil ME (1992) 351:1548–1563
Techniques for drug delivery to the airways, and 89. Keshishian H et al (2015) Multiplexed, quantitative
the assessment of lung function in animal models. J workflow for sensitive biomarker discovery in
Pharmacol Toxicol Methods 27(3):143–159 plasma yields novel candidates for early myocardial
72. Sowell RA, Owen JB, Butterfield DA (2009) Prote- injury. Mol Cell Proteomics 14:2375
omics in animal models of Alzheimer’s and 90. Morrissey B et al (2013) Development of a label-free
Parkinson’s diseases. Ageing Res Rev 8(1):1–17 LC-MS/MS strategy to approach the identification of
73. Flintoft L (2008) Animal models: proteomics goes candidate protein biomarkers of disease recurrence
live in the mouse. Nat Rev Genet 9(9):655–655 in prostate cancer patients in a clinical trial of com-
74. Stastna M, Van Eyk JE (2012) Secreted proteins as a bined hormone and radiation therapy. Proteomics
fundamental source for biomarker discovery. Prote- Clin Appl 7(5–6):316–326
omics 12(4–5):722–735 91. Mullan RH et al (2007) Early changes in serum type
75. Hathout Y (2007) Approaches to the study of the cell II collagen biomarkers predict radiographic progres-
secretome sion at one year in inflammatory arthritis patients
76. Pavlou MP, Diamandis EP (2010) The cancer cell after biologic therapy. Arthritis Rheum 56
secretome: a good source for discovering (9):2919–2928
biomarkers? J Proteome 73(10):1896–1906 92. Sekigawa I et al (2008) Protein biomarker analysis
77. Théry C, Ostrowski M, Segura E (2009) Membrane by mass spectrometry in patients with rheumatoid
vesicles as conveyors of immune responses. Nat Rev arthritis receiving anti-tumor necrosis factor-alpha
Immunol 9(8):581–593 antibody therapy. Clin Exp Rheumatol 26
78. Bijnsdorp IV et al (2013) Exosomal ITGA3 (2):261–267
interferes with non-cancerous prostate cell functions 93. Zhao J et al (2015) Identification of potential plasma
and is increased in urine exosomes of metastatic biomarkers for esophageal squamous cell carcinoma
prostate cancer patients. J Extracell Vesicles 2 by a proteomic method. Int J Clin Exp Pathol 8
79. Jeppesen DK et al (2014) Quantitative proteomics of (2):1535–1544
fractionated membrane and lumen exosome proteins 94. Lundblad R (2003) Considerations for the use of
from isogenic metastatic and nonmetastatic bladder blood plasma and serum for proteomic analysis.
cancer cells reveal differential expression of EMT Internet J Genomics and Proteomics 1(2)
factors. Proteomics 14(6):699–712 95. Millioni R et al (2011) High abundance proteins
80. Hosseini-Beheshti E et al (2012) Exosomes as bio- depletion vs low abundance proteins enrichment:
marker enriched microvesicles: characterization of comparison of methods to reduce the plasma prote-
exosomal proteins derived from a panel of prostate ome complexity. PLoS One 6(5):e19603
cell lines with distinct AR phenotypes. Mol Cell 96. Cyr DD et al (2011) Characterization of serum
Proteomics: MCP. M111. 014845 proteins associated with IL28B genotype among
81. Duijvesz D et al (2011) Exosomes as biomarker patients with chronic hepatitis C. PLoS One 6(7):
treasure chests for prostate cancer. Eur Urol 59 e21854
(5):823–831 97. Haslene-Hox H et al (2011) A new method for isola-
82. Raimondo F et al (2011) Advances in membranous tion of interstitial fluid from human solid tumors
vesicle and exosome proteomics improving applied to proteomic analysis of ovarian carcinoma
biological understanding and biomarker discovery. tissue. PLoS One 6(4), e19217
Proteomics 11(4):709–720 98. Smith MP et al (2011) A systematic analysis of the
83. Kang G-Y et al (2014) Exosomal proteins in the effects of increasing degrees of serum
aqueous humor as novel biomarkers in patients immunodepletion in terms of depth of coverage and
with neovascular age-related macular degeneration. other key aspects in top-down and bottom-up
J Proteome Res 13(2):581–595 proteomic analyses. Proteomics 11(11):2222–2235
84. Anthea M, H J, McLaughlin CW, Johnson S, Warner 99. Levreri I et al (2005) Separation of human serum
MQ, LaHart D, Wright JD (1993) Human biology proteins using the Beckman-Coulter PF2D system:
and health. Prentice Hall, Englewood Cliffs analysis of ion exchange-based first dimension chro-
85. Anderson NL, Anderson NG (2002) The human matography. Clin Chem Lab Med 43(12):1327–1333
plasma proteome: history, character, and diagnostic 100. Sennels L et al (2007) Proteomic analysis of human
prospects. Mol Cell Proteomics 1(11):845–867 blood serum using peptide library beads. J Proteome
86. Rodak BS, WB (2002) Hematology: clinical Res 6(10):4055–4062
principles and applications, 2nd edn. Philadelphia
1 Proteomes, Their Compositions and Their Sources 19

101. Pieper R et al (2003) Multi-component 116. Schaub S et al (2004) Urine protein profiling with
immunoaffinity subtraction chromatography: an surface-enhanced laser-desorption/ionization time-
innovative step towards a comprehensive survey of of-flight mass spectrometry. Kidney Int 65
the human plasma proteome. Proteomics 3 (1):323–332
(4):422–432 117. Theodorescu D et al (2006) Discovery and validation
102. Huang JT, McKenna T, Hughes C, Leweke FM, of new protein biomarkers for urothelial cancer: a
Schwarz E, Bahn S (2007) CSF biomarker discovery prospective analysis. lancet oncol 7(3):230–240
using label-free nano-LC-MS based proteomic 118. Zhou H et al (2006) Collection, storage, preserva-
profiling: technical aspects. J Sep Sci 30:214–225 tion, and normalization of human urinary exosomes
103. Sakka L, Coll G, Chazal J (2011) Anatomy and for biomarker discovery. Kidney Int 69
physiology of cerebrospinal fluid. Eur Ann (8):1471–1476
Otorhinolaryngol Head Neck Dis 123:309–316 119. Vestergaard P, Leverett R (1958) Constancy of uri-
104. Percy AJ, Yang J, Chambers AG, Simon R, Hardie nary creatinine excretion. J Lab Clin Med 51
DB, Borchers CH (2014) Multiplexed MRM with (2):211–218
internal standards for cerebrospinal fluid candidate 120. Mischak H (2005) Capillary electrophoresis coupled
protein biomarker quantitation. J Proteome Res to mass spectrometry for clinical diagnostic
13:3733 purposes. Electrophoresis 26:2708–2716
105. Choi YS, Choe LH, Lee KH (2010) Recent cerebro- 121. Pisitkun T, Johnstone R, Knepper MA (2006) Dis-
spinal fluid biomarker studies of Alzheimer’s dis- covery of urinary biomarkers. Mol Cell Proteomics 5
ease. Expert Rev Proteomics 7(6):919–929 (10):1760–1771
106. Stoop MP et al (2008) Multiple sclerosis-related 122. Weissinger EM et al (2004) Proteomic patterns
proteins identified in cerebrospinal fluid by advanced established with capillary electrophoresis and mass
mass spectrometry. Proteomics 8(8):1576–1585 spectrometry for diagnostic purposes. Kidney Int 65
107. Claveau D, Dankoff J (2013) Is lumbar puncture still (6):2426–2434
needed in suspected subarachnoid hemorrhage after 123. Hortin GL et al (2006) Proteomics: a new diagnostic
a negative head computed tomographic scan? CJEM frontier. Clin Chem 52(7):1218–1222
15:1–3 124. Yu Y, Pieper R (2015) Urinary pellet sample prepa-
108. Seehusen DA, Reeves MM, Fomin DA (2003) Cere- ration for shotgun proteomic analysis of microbial
brospinal fluid analysis. Am Fam Physician 68 infection and host–pathogen interactions. Proteomic
(6):1103–1108 Profiling: Methods and Protocols 65–74
109. Shores KS et al (2008) Use of peptide analogue 125. Riva A et al (2000) A high resolution sem study of
diversity library beads for increased depth of human minor salivary glands. Eur J Morphol 38
proteomic analysis: application to cerebrospinal (4):219–226
fluid. J Proteome Res 7(5):1922–1931 126. Castle D, Castle A (1998) Intracellular transport and
110. Thouvenot E et al (2008) Enhanced detection of secretion of salivary proteins. Crit Rev Oral Biol
CNS cell secretome in plasma protein-depleted cere- Med 9(1):4–22
brospinal fluid. J Proteome Res 7(10):4409–4421 127. Messana I et al (2008) Trafficking and postsecretory
111. Lehnert S, Jesse S, Rist W, Steinacker P, Soininen H, events responsible for the formation of secreted
Herukka SK, Tumani H, Lenter M, Oeckl P, human salivary peptides: a proteomics approach.
Ferger B, Hengerer B, Otto M (2012) iTRAQ and Mol Cell Proteomics 7(5):911–926
multiple reaction monitoring as proteomic tools for 128. Humphrey SP, Williamson RT (2001) A review of
biomarker search in cerebrospinal fluid of patients saliva: normal composition, flow, and function. J
with Parkinson’s disease dementia. Exp Neurol 234 Prosthet Dent 85(2):162–169
(2):499–505 129. Edgar WM (1992) Saliva: its secretion, composition
112. Iorio L, Avagliano F (1999) Observations on the and functions. Br Dent J 172(8):305–312
Liber medicine orinalibus by Hermogenes. Am J 130. Hansen AM, Garde AH, Persson R (2008) Measure-
Nephrol 19(2):185–188 ment of salivary cortisol–effects of replacing poly-
113. Moe OW, Berry CA, Rector FC (2000) The kidney. ester with cotton and switching antibody. Scand J
W. B. Saunders, Philadelphia Clin Lab Invest 68(8):826–829
114. Ling XB et al (2010) Urine peptidomic and targeted 131. Hansen AM, Garde AH, Persson R (2008) Sources of
plasma protein analyses in the diagnosis and moni- biological and methodological variation in salivary
toring of systemic juvenile idiopathic arthritis. Clin cortisol and their impact on measurement among
Proteomics 6(4):175–193 healthy adults: a review. Scand J Clin Lab Invest
115. Wu T et al (2013) Urinary angiostatin-a novel puta- 68(6):448–458
tive marker of renal pathology chronicity in lupus 132. Tzioufas AG, Kapsogeorgou EK (2015) Biomarkers.
nephritis. Mol Cell Proteomics 12(5):1170–1179 Saliva proteomics is a promising tool to study
20 A. Kwasnik et al.

Sjogren syndrome. Nat Rev Rheumatol 11 151. Al-Tarawneh SK et al (2011) Defining salivary
(4):202–203 biomarkers using mass spectrometry-based proteo-
133. Schafer CA et al (2014) Saliva diagnostics: utilizing mics: a systematic review. OMICS J Integr Biol 15
oral fluids to determine health status. Monogr Oral (6):353–361
Sci 24:88–98 152. Pfaffe T et al (2011) Diagnostic potential of saliva:
134. Heflin L, Walsh S, Bagajewicz M (2009) Design current state and future applications. Clin Chem 57
of medical diagnostics products: a case-study of a (5):675–687
saliva diagnostics kit. Comput Chem Eng 33 153. Whitelegge JP et al (2007) Protein-sequence
(5):1067–1076 polymorphisms and post-translational modifications
135. Messana I et al (2008) Facts and artifacts in proteo- in proteins from human saliva using top-down
mics of body fluids. What proteomics of saliva is Fourier-transform ion cyclotron resonance mass
telling us? J Sep Sci 31(11):1948–1963 spectrometry. Int J Mass Spectrom 268(2):190–197
136. Esser D et al (2008) Sample stability and protein 154. Vitorino R et al (2011) Finding new posttranslational
composition of saliva: implications for its use as a modifications in salivary proline‐rich proteins. PRO-
diagnostic fluid. Biomark Insights 3:25–27 TEOMICS Clin Appl 5(3–4):197–197
137. Yan W et al (2009) Systematic comparison of the 155. Molloy MP et al (1999) Extraction of Escherichia
human saliva and plasma proteomes. Proteomics coli proteins with organic solvents prior to two‐
Clin Appl 3(1):116–134 dimensional electrophoresis. Electrophoresis 20
138. Navazesh M, Kumar SK (2008) Measuring salivary (4–5):701–704
flow: challenges and opportunities. J Am Dent Assoc 156. Pasquali C, Fialka I, Huber LA (1999) Subcellular
139 Suppl:35s–40s fractionation, electromigration analysis and mapping
139. Atkinson KR et al (2008) Rapid saliva processing of organelles. J Chromatogr B Biomed Sci Appl 722
techniques for near real-time analysis of salivary (1):89–102
steroids and protein. J Clin Lab Anal 22(6):395–402 157. Krief G et al (2011) Improved visualization of low
140. Michishige F et al (2006) Effect of saliva collection abundance oral fluid proteins after triple depletion of
method on the concentration of protein components alpha amylase, albumin and IgG. Oral Dis 17
in saliva. J Med Investig 53(1–2):140–146 (1):45–52
141. Vitorino R et al (2004) Identification of human 158. Owen DH, Katz DF (2005) A review of the physical
whole saliva protein components using proteomics. and chemical properties of human semen and the
Proteomics 4(4):1109–1115 formulation of a semen simulant. J Androl 26
142. Saunte C (1983) Quantification of salivation, nasal (4):459–469
secretion and tearing in man. Cephalalgia 3 159. Bartoov B et al (1999) Quantitative ultramor-
(3):159–173 phological analysis of human sperm: fifteen years
143. Jessie K et al (2010) Proteomic analysis of whole of experience in the diagnosis and management of
human saliva detects enhanced expression of male factor infertility. Arch Androl 43(1):13–25
interleukin-1 receptor antagonist, thioredoxin and 160. Pizzol D et al (2014) Genetic and molecular
lipocalin-1 in cigarette smokers compared to diagnostics of male infertility in the clinical practice.
non-smokers. Int J Mol Sci 11(11):4488–4505 Front Biosci (Landmark Ed) 19:291–303
144. Soares S et al (2011) Reactivity of human salivary 161. Liu DY, Baker HW (1992) Tests of human sperm
proteins families toward food polyphenols. J Agric function and fertilization in vitro. Fertil Steril 58
Food Chem 59(10):5535–5547 (3):465–483
145. Carlson A, Crittenden A (1909) The relation of pty- 162. Liu DY, Baker HW (2002) Evaluation and assess-
alin concentration to the diet and to the rate of ment of semen for IVF/ICSI. Asian J Androl 4
salivary secretion. Exp Biol Med 7(2):52–54 (4):281–285
146. Heft MW, Baum BJ (1984) Unstimulated and 163. World Health Organization (1999) WHO laboratory
stimulated parotid salivary flow rate in individuals manual for the examination of human semen and
of different ages. J Dent Res 63(10):1182–1185 sperm-cervical mucus interaction. Cambridge Uni-
147. Lashley K (1916) Reflex secretion of the human versity Press, Cambridge
parotid gland. J Exp Psychol 1(6):461 164. Amaral A et al (2014) The combined human sperm
148. Nita‐Lazar A, Saito‐Benz H, White FM (2008) proteome: cellular pathways and implications for
Quantitative phosphoproteomics by mass spectrom- basic and clinical science. Hum Reprod Update 20
etry: past, present, and future. Proteomics 8 (1):40–62
(21):4433–4443 165. Drake RR et al (2010) In-depth proteomic analyses
149. Thingholm TE, Jensen ON, Larsen MR (2009) Ana- of direct expressed prostatic secretions. J Proteome
lytical strategies for phosphoproteomics. Proteomics Res 9(5):2109–2116
9(6):1451–1468 166. Milardi D et al (2013) Proteomics of human seminal
150. Zielinska DF et al (2010) Precision mapping of an plasma: identification of biomarker candidates for
in vivo N-glycoproteome reveals rigid topological fertility and infertility and the evolution of technol-
and sequence constraints. Cell 141(5):897–907 ogy. Mol Reprod Dev 80(5):350–357
1 Proteomes, Their Compositions and Their Sources 21

167. Milardi D et al (2012) Proteomic approach in the 173. Schmidt U et al (2004) Quantification of
identification of fertility pattern in seminal plasma of disseminated tumor cells in the bloodstream of
fertile men. Fertil Steril 97(1):67–73.e1 patients with hormone-refractory prostate carcinoma
168. Thompson CB et al (1984) A method for the separa- undergoing cytotoxic chemotherapy. Int J Oncol
tion of erythrocytes on the basis of size using coun- 24(6):1393–1399
terflow centrifugation. Am J Hematol 17(2):177–183 174. Gertler R et al (2003) Detection of circulating tumor
169. Van der Vegt SGL et al (1985) Counterflow centrifu- cells in blood using an optimized density gradient
gation of red cell populations: a cell age related sepa- centrifugation. Recent Results Cancer Res
ration technique. Br J Haematol 61(3):393–403 162:149–155
170. Dhurat R, Sukesh M (2014) Principles and methods 175. Farace F et al (2011) A direct comparison of cell
of preparation of platelet-rich plasma: a review search and ISET for circulating tumour-cell detec-
and author’s perspective. J Cutan Aesthet Surg tion in patients with metastatic carcinomas. Br J
7(4):189–197 Cancer 105(6):847–853
171. Godoy-Ramirez K et al (2004) Optimum culture 176. Liu Z et al (2011) Negative enrichment
conditions for specific and nonspecific activation of by immunomagnetic nanobeads for unbiased charac-
whole blood and PBMC for intracellular cytokine terization of circulating tumor cells from peripheral
assessment by flow cytometry. J Immunol Methods blood of cancer patients. J Transl Med 9:70
292(1–2):1–15 177. Yu M et al (2014) Cancer therapy. Ex vivo culture
172. de Bono JS et al (2008) Circulating tumor cells of circulating breast tumor cells for individualized test-
predict survival benefit from treatment in metastatic ing of drug susceptibility. Science 345(6193):216–220
castration-resistant prostate cancer. Clin Cancer Res
14(19):6302–6309
Protein Fractionation and Enrichment
Prior to Proteomics Sample Preparation 2
Andrew J. Alpert

Abstract
Proteins may be considered as polypeptides large enough to have a well-
defined tertiary, or three-dimensional structure. In aqueous media, this
structure is typically one in which polar and charged amino acid residues
are on the surface while hydrophobic residues tend to be sequestered in the
core and reasonably inaccessible to the aqueous environment. Proteins
that are not normally found free in aqueous media, such as membrane
proteins and apolipoproteins, can have tertiary structures that deviate from
this model. In general, the biological activity of proteins requires the
preservation of their tertiary structure, and this sets more limits upon the
chromatography than is true of peptides. In proteomics, the concern is
with which proteins are present and in what quantity rather than
maintaining biological activity. Such applications are freer to use mobile
and stationary phases that denature protein structure. However,
considerations of solubility and recovery may still set more limits on the
chromatography than is the case with peptides.

Keywords
Protein fractionation • Protein chromatography • Ion-exchange
chromatography (IEX) • Hydrophobic Interaction Chromatography
(HIC) • Size-Exclusion Chromatography (SEC) • Reversed-Phase
Chromatography (RPC) • Hydrophilic Interaction Chromatography
(HILIC) • Affinity chromatography • Multi-dimensional
chromatography for top-down proteomics

2.1 Overall Requirements

Proteins may be considered as polypeptides large


A.J. Alpert (*) enough to have a well-defined tertiary, or three-
PolyLC Inc., Columbia, MD, USA
dimensional structure. In aqueous media, this
e-mail: aalpert@polylc.com

# Springer International Publishing Switzerland 2016 23


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_2
24 A.J. Alpert

structure is typically one in which polar and applications as well as separations for basic
charged amino acid residues are on the surface research.
while hydrophobic residues tend to be
sequestered in the core and reasonably inaccessi-
ble to the aqueous environment. Proteins that are 2.2 Modes of Chromatography
not normally found free in aqueous media, such
as membrane proteins and apolipoproteins, can 2.2.1 Ion-Exchange Chromatography
have tertiary structures that deviate from this (IEX)
model. In general, the biological activity of
proteins requires the preservation of their tertiary
structure, and this sets more limits upon the Proteins have charged residues on the surface
chromatography than is true of peptides. In pro- of their structures and so are attracted electrostat-
teomics, the concern is with which proteins are ically to a stationary phase of the opposite
present and in what quantity rather than charge. Figure 2.1 shows the separation of
maintaining biological activity. Such variants of ovalbumin that have differing num-
applications are freer to use mobile and station- bers of phosphate groups. At the isoelectric point
ary phases that denature protein structure. How- (pI) of a protein, the amount of positive (+) and
ever, considerations of solubility and recovery negative ( ) charge is in balance. At a pH higher
may still set more limits on the chromatography than the pH corresponding to the pI, a protein has
than is the case with peptides. a net ( ) charge and is retained by an anion-
Proteomics can involve the analysis of minor exchange column. At a lower pH, it has a net
variants of individual proteins as well as the (+) charge and is retained by a cation-exchange
identification and quantitation of the proteins in column. However, chromatography is a surface
a complex mixture. Accordingly, this chapter interaction. This distinguishes it from electro-
presents examples of quality control and clinical phoresis, which involves field effects. It is

Fig. 2.1 Anion-exchange of phosphorylation variants of ovalbumin. Sample: Ovalbumin (Sigma Grade VI [99 %]).
Column: PolyWAX LP (100  4.6 mm; 5-μm, 1000-Å). Gradient: 10 mM K-PO4, pH 7.0, with 60–300 mM NaCl in 20’
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 25

possible for a protein to have non-uniform distri- the mobile phases. See Figs. 2.9 and 2.11 for
bution of the charged residues on its surface. If examples of the use of such solubilizing agents.
there is a cluster of residues of the same charge, When counting numbers of proteins, approxi-
then the protein can bind through that cluster to mately 50 % of mammalian proteins have pI
an ion-exchange column of the opposite charge values above 7 and 50 % below 7, with a mini-
from the residues. The result is retention of many mum at pH 7 itself [1]. When counting protein
proteins at pH ranges one unit or more beyond abundance rather than protein numbers, though,
their pI value, despite their having the same net one finds that acidic proteins, with pI values
charge as the column. Accordingly, some below 7, are about 4x more abundant in serum
proteins are retained by both anion- and cation- and cell lysates than are basic proteins. For this
exchange columns. In addition, since the protein reason, anion-exchange is generally more useful
can be highly oriented in its binding to the sur- for fractionation of complex mixtures of mam-
face, some residues will be closer to the surface malian proteins when compared to cation-
than others and can have a greater effect on the exchange. Cation-exchange is more useful for
interaction. Accordingly, chromatography can certain specific applications, such as quality con-
distinguish between variants of a protein that trol (QC) analysis of monoclonal antibodies and
differ in the position of a residue that has been the clinical analysis of hemoglobin variants.
derivatized, oxidized, or otherwise modified, as Elution in ion-exchange usually involves a
shown in Fig. 2.2. Such positional variants would gradient of increasing salt concentration. If an
not be separated by electrophoresis. absorbance detector is used at a wavelength
In general, the conditions used in IEX are below 230 nm, then salts should be used that
mild and do not denature proteins. Many mem- are transparent in this range. Such salts
brane and structural proteins are not readily sol- would include NaCl or KCl with phosphate,
uble in the aqueous media that are normally used MES (2-(N-morpholino)ethanesulfonic acid), or
for IEX. In such cases, organic solvents or HEPES (4-(2-hydroxyethyl)-1-piperazineethane-
solubilizing agents such as hexafluoro-2- sulfonic acid) used as buffers. If the application
propanol or trifluoroethanol can be included in requires a mobile phase that is volatile, then one

Fig. 2.2 Separation by


cation-exchange of
PEGylation positional
variants of tumor necrosis
factor soluble receptor
Type I. The main product
was PEGylated at the
N-terminus (Met1). Side
products were PEGylated
at the lysine residues
indicated instead of or in
addition to the N-terminus.
Column: PolyCAT A
(200  4.6 mm; 5-μm,
1000-Å).
Gradient: 20 mM sodium
acetate, pH 5.0, with 20 %
ACN, and a gradient of
1–160 mM KCl in 30’.
(Adapted from J.E. Seely
et al., BioPharm Intl. 18
(March 2005) 30)
26 A.J. Alpert

can use ammonium acetate as both the buffering performance between them, assuming all other
and the gradient salt. However, most protein variables are the same.
applications require concentrations of salt for For protein applications, it is important to use
elution that are so high as to be incompatible ion-exchange materials that have been
with applications such as mass spectrometry manufactured for the purpose. The least expen-
that require volatile mobile phases. Alterna- sive ion-exchange materials generally feature
tively, a pH gradient can be used to change the charged groups attached to a polymeric resin.
net charge of a protein to one closer to that of the While there are resins that are hydrophilic
stationary phase, decreasing the amount of salt enough for the purpose, many are not, such as
required for elution. With cation-exchange, this those with a polystyrene-divinylbenzene base.
involves a gradient of increasing pH (usually to Such materials may perform well with small
one above pH 6.5, above which histidine residues analytes, but are so hydrophobic that many
lose their (+) charge). Conversely, the pH change proteins will not elute from columns of such
with anion-exchange usually involves a decreas- materials. In general, an ion-exchange material
ing pH gradient. for proteins must have a thick, hydrophilic coat-
The charge density of the ion-exchange mate- ing that hides the base material from proteins in
rial varies with pH. The titration curve of a sim- solution. Another important property is that
ple amine or carboxylic acid in solution features porous materials, such as those based on silica,
a sharp inflection point. That is not true of must have pores wide enough for protein diffu-
polyelectrolytes, in which the charge on one sion in and out to be facile. This requires pores at
functional group affects the ease of charging least 300 Å wide, and many proteins afford
neighboring groups. This is true whether the sharper, more symmetrical peaks with pores in
polyelectrolyte is in solution or immobilized on the range 1000–1500 Å. Such materials have
the surface of a stationary phase. As a result, lower surface area than do 300-Å pore materials,
titration curves of ion-exchange stationary but the degree of retention is usually not a prob-
phase materials in suspension feature a contin- lem in ion-exchange of proteins.
uum of charge density varying over a wide pH An exception to these general trends is the
range [2]. An ion-exchange material is consid- recent use of weak cation-exchange (WCX)
ered to be “weak” or “strong” based just on the materials with a gradient to a pH low enough to
pH range where it starts to lose charge, not on the uncharge the carboxyl- groups in the coating.
degree of attraction of a charged analyte to the This can be performed with a gradient from
material. Weak anion-exchange (WAX) dilute ammonium formate, pH ~ 5, to unbuffered
materials are fully charged below pH 5 but are formic acid (typically in the range 0.5–2 %).
only about 5 % charged at pH 9, with a contin- This will be discussed in more detail in
uum of variation of charge density in between. Sect. 2.3.5.
Such materials feature primary, secondary or If one is not sure whether to use an anion- or
tertiary amines as functional groups or a mixture cation-exchange column, one solution would be
of the three. A strong anion-exchange (SAX) to use a mixed-bed column that contains both
material has quaternary amine functional groups materials. In principle, a mixed-bed column will
and retains most of its charge density as high as retain all proteins. Such columns have proved to
pH 12. A weak cation-exchange (WCX) material be useful for fractionation of complex mixtures
has carboxyl- functional groups which can be of proteins. Figure 2.3 shows the uniform distri-
uncharged at pH < 4. A strong cation-exchange bution of proteins such columns can afford.
(SCX) material retains most ( ) charge density Strong retention of a protein on a mixed-bed
down to pH 2. At a pH where a weak and a strong IEX column is facilitated by either of two struc-
ion-exchange material both have their full charge tural characteristics:
density, there is no significant difference in
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 27

Fig. 2.3 Fractionation of proteins from a lysate of 4 T1 (200  4.6-mm; 5-μm, 1000-Å). Gradient: 10–1000-
cells (mouse mammary tumor) by a mixed-bed mM ammonium acetate, as shown (Adapted from
ion-exchange column. Column: PolyCATWAX Ref. [3])

(a) An extreme pI value in either direction polar residues of proteins. At this point the sol-
(b) A high percentage of charged residues of vation of proteins is marginal, since a solute must
either sign [3] surround itself with molecules of the solvent in
order to remain in solution. When now exposed
The higher the percentage of charged to a modestly hydrophobic surface, the protein
residues, the more likely that the protein surface will adsorb to the surface, thereby partitioning
will contain at least one patch with several out of the aqueous phase. A gradient is then run
residues of the same sign through which strong of decreasing salt concentration. Proteins are
binding can occur, as discussed above. A protein resolvated – or, rather, rehydrated – and elute in
with a patch of this sort will be strongly retained order of their increase in hydrophobic character
even if the pI value is near neutrality. of the surface of their tertiary structure. This
High concentrations of organic solvents in the elution order appears to be the same as that in
mobile phases will generally denature water- RPC. The difference is that the stationary phase
soluble proteins. However, more modest is appreciably less hydrophobic than is true with
concentrations of solvents can sometimes an RPC material, and the mobile phase lacks any
improve selectivity, depending on the protein denaturing components. Consequently, in gen-
involved. Figure 2.4 shows an example of this eral HIC is a nondenaturing method. An excep-
with a set of closely-related glycoproteins. tion would be with protein complexes in which
the subunits interact through electrostatic
interactions rather than hydrophobic interactions;
2.2.2 Hydrophobic Interaction this attraction can be disrupted by the high salt
Chromatography (HIC) concentrations used in HIC.
HIC compares well with IEX in its high
Chromatography in the HIC mode starts with a capacity and selectivity. The basis of selectivity
high concentration of a salt whose ions are is complementary to that of IEX, operating via
surrounded with a strongly retained sphere of hydrophobic interaction with the hydrophobic
hydration, such as a sulfate, phosphate or citrate. residues. Consequently, the two modes can fruit-
This leaves less of the water free to hydrate the fully be used in sequence for isolating a protein
28 A.J. Alpert

Fig. 2.4 Separation by cation-exchange of glycosylation ammonium acetate, pH 6.0. Top: No ACN. Bottom:
variants of recombinant α–bungarotoxin expressed in 40 % ACN in both mobile phases (Data courtesy of
P. pastoris. Column: PolyCAT A (200  4.6-mm; Robert Rogowski and Edward Hawrot, Brown
5-μm, 300-Å). Gradient: 60’ linear, 50–300 mM University)

from a complex mixture or simply dividing it sufficient to promote binding to a HIC material
into fractions with fewer components per frac- and the liquid is pumped into a HIC cartridge. It
tion. This is discussed later in the Proteomics can then be eluted in a volume much lower than
section. HIC can be extremely sensitive to that of the original sample.
minor variations in polarity, as is the case in the
example in Fig. 2.5. This is helpful in quality
control analysis of proteins. 2.2.3 Size-Exclusion Chromatography
HIC is also often used as the “capture” step for (SEC)
the initial collection of a recombinant protein in
solution in a fermentation vat. A suitable salt SEC of proteins is performed with hydrophilic
such as ammonium sulfate is added in an amount stationary phases with well-defined pore
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 29

Fig. 2.5 Separation by HIC of Fab and Fc antibody sulfoxide form. Column: PolyPROPYL A (100  4.6-
fragments and their oxidation products. The minor peaks mm; 3-μm, 1500-Å). Gradient: Decreasing ammonium
indicated correspond to the major peaks eluting after them sulfate concentration in 20 mM K-PO4, pH 7.0
but with a single methionine residue oxidized to the

diameters. This is a nondenaturing mode and can (a) SEC-MS of intact proteins: A column is
be performed with moderate concentrations chosen with a pore diameter narrow enough
(100–200 mM) of volatile salts such as ammo- to insure that the protein of interest elutes in
nium acetate. Being an isocratic mode, it is easy the Vo peak, which is then directed to the
to implement. The main limitation is that it is a mass spectrometer. Small molecules such as
low-resolution mode. A general rule is that for nonvolatile salts elute later and are directed
two proteins to be resolved to baseline in SEC, to waste. The mobile phase must be volatile.
they must differ in molecular weight by at least a Solvents used to date include 50 mM formic
factor of two, a characteristic that does not pre- acid, in which case proteins elute in dena-
dispose this mode to separations based on fine tured forms (Fig. 2.7), or 200 mM ammo-
differences. Given this limitation, the histogram nium acetate, in which case they elute with
in Fig. 2.6 suggests that the entire human prote- their tertiary structures intact (for “native”
ome would yield only 5 or 6 baseline-resolved mass spectrometry) (Fig. 2.8).
peaks in SEC. That is something of an underesti- (b) Top-down proteomics: In mass spectra of
mate; a good SEC column can produce about intact proteins, the signal-to-noise ratio
eight baseline-resolved peaks within the fraction- decreases as the protein molecular weight
ation range, including the Total Exclusion Vol- increases [4]. Consequently, small proteins
ume (Vo) peak and the Total Inclusion Volume interfere with the detection of large proteins
(Vt) peak. in the same sample. Identification of large
In recent years some applications have started proteins is facilitated by preliminary sepa-
using SEC as a filter to separate very large ration of the proteins < 40 KDa from
molecules from very small ones. These include proteins > 40 KDa using an SEC spin
the following: cartridge.
30 A.J. Alpert

Fig. 2.6 Frequency of


occurrence of proteins with
specific masses in the
human proteome (Adapted
from Ref. [4])

2.2.4 Reversed-Phase and a diameter of 500–1000 Å should be consid-


Chromatography (RPC) ered. Such materials are uncommon, but Polymer
Laboratories offers PLRP materials with pore
RPC is the most widely used mode of HPLC. In diameters in this range. Recently, columns of
general, it is not well-suited to protein PLRP material have been shown to afford higher
applications. Proteins tend to denature when protein recovery and sharper peaks than does a
exposed to the hydrophobic surface. Subsequent silica-based C-4 column [6].
exposure to organic solvents, sometimes featur-
ing extremes of pH and chaotropes such as TFA,
causes even more thorough loss of tertiary struc- 2.2.5 Hydrophilic Interaction
ture. Small proteins can tolerate these conditions. Chromatography (HILIC)
The denaturation of large proteins could expose
more than a hundred hydrophobic residues for The use of HILIC for intact proteins has been
simultaneous interaction with the stationary limited to date. The main obstacle is the tendency
phase. The result may be elution in peaks of many proteins to precipitate from the predom-
15 min wide or no elution at all. inantly organic solvents used for binding in the
A large percentage of RPC applications with HILIC mode. To date, applications of pure
proteins involve columns with C-4 or C-8 func- HILIC have involved membrane proteins [7, 8]
tional groups [5]. Elution from such materials is and apolipoproteins [9] that do not normally
more facile than from more hydrophobic freely occur in aqueous media and which there-
materials. The greater retention capacity of fore are compatible with the organic solvents
more hydrophobic materials is not needed in used in HILIC, as in the example in Fig. 2.9.
any case, since proteins contain more than A combination that has been more widely used
enough hydrophobic residues to guarantee reten- is an IEX column eluted with a predominantly
tion. The pore diameter should be at least 300 Å, organic mobile phase. Under these conditions
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 31

Fig. 2.7 SEC of antibody chains under denaturing mass spectra. Column: PolyHYDROXYETHYL A,
conditions. (a) Protein components elute in the Vo peak 250  2.1-mm; 5-μm, 300-Å. Mobile phase: 0.1 %
at 3.5’ which is directed to the mass spectrometer, while formic acid (Adapted from L.J. Brady et al., J. Am. Soc.
the rest of the eluate is directed to waste; (b) Resulting Mass Spectrom. 19 (2008) 502)

hydrophilic interaction is superimposed upon the (Fig. 2.10). This combination can also be used
electrostatic effects. Accordingly, the column will for high-resolution separation of variants of other
be sensitive to variations in polarity as well as in types proteins that do not normally occur in aque-
charge. For example, in the absence of hydrophilic ous solution, as in the example in Fig. 2.11 of an
interaction, a column will be sensitive to the acet- emulsion of pulmonary surfactant proteins. Treat-
ylation of lysine residues, which reduces the num- ment of this sort removes lipids and detergents
ber of (+) charges. In the presence of hydrophilic from protein samples.
interaction, it will also be sensitive to the methyl-
ation of lysine residues, which affects polarity but
not charge. This combination has been used for 2.2.6 Affinity Chromatography
separation of histone variants with numerous pos-
sible combinations of lysine acetylation, methyla- This mode involves a stationary phase with some
tion, and other post-translational modifications immobilized compound that has an unusually
32 A.J. Alpert

Fig. 2.8 SEC of antibody-drug conjugates (ADC’s) is in multiples of two. Column:


under nondenaturing conditions. The conjugates are PolyHYDROXYETHYL A, 150  0.3-mm capillary;
attached to free thiol group resulting from reduction of 5-μm, 300-Å. Mobile phase: 200 mM ammonium acetate
disulfide bridges and so conjugate content of the antibody (From: S.M. Hengel et al., Anal. Chem. 86 (2014) 3420)

Fig. 2.9 HILIC of intact


mitochondrial membrane
proteins. Proteins were
identified by direct analysis
by MS. Sample: Extract of
bovine heart mitochondria.
Column:
PolyHYDROXYETHYL A
(100x2.1-mm; 5-μm,
300-Å). Gradient: (a)
20 mM ammonium
formate, pH 3.7, containing
0.5 % hexafluoro-2-
propanol [a solubilizing
agent], with 63 %
2-propanol + 22.5 %
ACN; (b) Same but with no
ACN and with 30 %
2-propanol (Adapted from
Ref. [8])
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 33

Fig. 2.10 Separation of Histone H4 isoforms by WCX-HILIC. Column: PolyCAT A capillary, 500  0.1 mm; 5-μm,
1000-Å. Gradient: 1–8 % formic acid in 70 % ACN (From Ref. [10])

strong and selective interaction with a specific 1. An affinity column may not exist for a specific
protein or class of proteins. The selectivity can be separation or purification of interest;
quite high, as with an immobilized antibody or 2. The interaction with the affinity ligand is so
lectin, or more general, as with the interaction of strong that it tends to dominate the
an immobilized boronic acid group with chromatography.
glycoproteins that contain carbohydrate residues
with cis-diol groups. One can obtain a high Consequently, affinity columns are poor at
enrichment factor with an affinity column. This separating the retained proteins from each
is helpful when the protein of interest is a minor other. Instead, they are used in a version of
component in a large volume. An alternative solid phase extraction, which separates a mixture
situation is one in which an affinity column is into components that are retained and
used to deplete a sample of the proteins of components that are not retained. The retained
highest abundance in order to facilitate the iden- components can then be separated from each
tification of the remaining proteins. The Agilent other with a more general mode of chromatogra-
MARS column has been widely used for this. phy, such as IEX or RPC. An example of this is
There are two main drawbacks to affinity presented in Fig. 2.12.
chromatography:
34 A.J. Alpert

Fig. 2.11 SCX-HILIC of


mAU
pulmonary surfactant
protein (SP). Sample: 140
Emulsion of 500 parts
lipids (lecithins, steroids, Lecithins,
etc.): 1 part bovine SP. The 120
lipids eluted in the void
steroids, &
volume. Some of the SP other lipids
isn’t soluble in water but 100
was soluble in this mobile
phase and eluted within the
salt gradient. The retained Surfactant
80
peaks presumably Protein
correspond to the different Variants
SP proteins present in vivo.
Column: 60
PolySULFOETHYL A,
200  4.6-mm; 5-μm,
1000-Å. Mobile phase: 40
(a) 0.1 %
methylphosphonic
acid + 5 mM NaClO4, 20
pH 3.0, with 70 % ACN;
(b) Same but with 100 mM
NaClO4. Gradient: 5’ hold, 0
then 0–100 % B in 60’

−20
0 10 20 30 40 50 60 70 min

together from an IEX column because they had


2.3 Examples of Applications the same charge will probably differ in polarity
and can be separated by a HIC column, for
2.3.1 Multi-Dimensional example. Common sequences are IEX followed
Chromatography for Top-Down by HIC or the opposite sequence (HIC-IEX).
Proteomics An example is shown in Fig. 2.13. There is
some sense in using IEX as the first dimension
When an affinity method does not exist for the since sample processing is minimized; one can
isolation of a protein, then the alternative is to simply add salt to the collected fractions and load
perform sample simplification: Distribution of them onto a HIC column directly with no need
the components of a mixture into subsets by for desalting (although it may be beneficial to
collection of fractions from a column used in a concentrate the sample). Fractions collected
general-purpose mode of chromatography. The from the HIC column would probably have to
protein of interest would then represent a greater be desalted prior to subsequent analysis by some
percentage of the protein in the fraction in which other mode.
it resides. Extracts of biological fluids generally The more dimensions of fractionation used
contain so many proteins that no single method and the more fractions collected for each one,
suffices for purification of an individual compo- the fewer proteins there will be in each fraction.
nent. In such cases, then, each fraction from the That facilitates their identification, especially the
first run is subdivided further on the basis of ones of low abundance. The drawback to this
properties complementary to those that governed approach is that the number of fractions
retention in the first run. Proteins that eluted multiplies rapidly as one adds additional
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 35

a
0.010 0.010
0.009 0.009
Lex-Glycoproteins
0.008 0.008
0.007 0.007
0.006 0.006
Unbound
0.005 0.005
AU

AU
proteins
0.004 0.004
0.003 0.003
0.002 0.002
0.001 0.001
0.000 0.000
−0.001 to glycine, pH 2.5 −0.001
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
Minutes
b
0.055 0.055
Rerun via RPC Stage 4 breast
0.050 cancer plasma 0.050

0.045 0.045

0.040 0.040

0.035 0.035
AU

AU
0.030 0.030

0.025 0.025

0.020 0.020
NIST plasma
0.015 0.015
control
0.010 0.010

0.005 0.005
35 40 45 50 55 60 65 70 75 80 85 90 95 100 105
Minutes

Fig. 2.12 Top: Affinity isolation of plasma glycoproteins containing the Lewis x antigen by a column with
immobilized anti-Lex antibody.
Bottom: Separation of the retained glycoproteins on a nonporous C-18 RPC column with an ACN gradient (Adapted
from: W. Cho, K. Jung, and F.E. Regnier, Anal. Chem. 80 (2008) 5286)

fractionation steps. There is a tradeoff, then, Even if one is performing bottom-up proteo-
between how many proteins one wishes to isolate mics, preliminary fractionation of the intact
or identify and how much time and work will be proteins can generate a significant increase in
involved. In the example in Fig. 2.13, for exam- identifications. The fractions collected from the
ple, the addition of an extra dimension of separa- separation in Fig. 2.3 were individually digested
tion increased the number of fractions to be with trypsin and then further fractionated with
processed 35 times while increasing the number an SCX-RPC sequence. This resulted in the
of nonredundant proteins identified from 47 to identification of 3135 proteins. Omitting the
201, an increase by a factor of 4.3. Most of the protein fractionation step reduced the protein
additional identifications were of proteins of low identifications to 1292. In this case, then, the
abundance. extra step increased the number of fractions
36 A.J. Alpert

Fig. 2.13 An IEX-HIC-RPC sequence for 3-D fraction- were eluted into a Q Exactive Orbitrap mass spectrome-
ation of intact proteins for top-down proteomics. Sample: ter. Representative mass spectra are shown for the
HEK 293 cell lysate. Step 1: Mixed-Bed IEX, with indicated peaks, along with zoom-in spectra with unit
35 1-min fractions collected. Fraction #3 [colored] was mass isotopic resolution. Starting with all 35 HIC
selected for further processing. Step 2: HIC. Again, fractions from IEX fraction #3, 201 nonredundant
35 fractions were collected, and fraction #20 [colored] proteins were identified (Adapted from: S.G. Valeja
was selected for further processing. Step. 3: RPC. Proteins et al., Anal. Chem. 87 (2015) 5363)

12x while increasing the protein identifications steps suffice for purification of proteins from
2.4x. complex mixtures, provided a suitable bioassay
or location method is available. Success requires
the ability to determine where in the eluate is the
2.3.2 Location and Isolation of a Pure protein of interest. If the protein has unique
Protein from a Mixture absorption or fluorescence characteristics – for
example, the absorption of light at 415 nm by
Isolation of an individual protein from a complex proteins with a heme ring – then its location is
mixture usually requires several sequential steps. clear. Otherwise, its presence must be
If no affinity method is available, then a succes- ascertained by a bioassay, fraction by fraction.
sion of general-purpose chromatography steps is Figure 2.14 demonstrates one method to assess
required. In general, three successive purification the location of a protein that binds a specific
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 37

Fig. 2.14 Identification of a target protein in a lysate that of 4513–0042. The shaded bars on the left represent the
binds a drug. An antifungal compound, 4513–0042, elution position of free 4513–0042. The single shaded bar
reportedly disrupts ergosterol synthesis. It was incubated at 65’ indicates the elution position of a probable complex
with whole yeast extract and the proteins were then of 4513–0042 and Erg6p, a yeast protein in the ergosterol
fractionated on a mixed-bed IEX column. Fractions pathway (Adapted from: J.N.Y. Chan et al., Mol. Cell.
were collected and analyzed via LC-MS for the presence Prot. 11 (2012) M111.016642)

drug. The drug is added to a mixture of proteins, separate closely-related variants of the protein
which are then separated. The collected fractions from each other. Examples of different types of
are individually analyzed for the drug via mass applications follow:
spectrometry to locate the elution position of any
complexes formed by the selective binding of the 1. Assessing the degree of deamidation of a
drug by specific proteins in the mixture. protein.
An alternative approach is evident in Deamidation of susceptible asparagine
Fig. 2.15. Here, the elution position of (Asn) residues is the most significant nonen-
interleukin-6 (IL-6) from a mixed-bed IEX col- zymatic reaction affecting the shelf life of
umn was ascertained. The corresponding fraction biologically active proteins. The
in serum was collected and digested with trypsin. consequences for biological activity of
Peptides unique to IL-6 were measured via deamidation of any particular Asn residue
Selected Reaction Monitoring (SRM) mass spec- range from trivial to crucial, depending on
trometry. The Limit of Detection (LOD) was the protein and the location of the residue.
50 ng IL-6/ml serum. This concentration is too Deamidation of Asn is promoted by the fol-
high for measurement of normal levels of IL-6 in lowing factors:
serum but may be low enough for its measure- (a) Elevated temperature and pH
ment in cases of disease. (b) The presence of a sterically unhindered
residue on the C-terminal side of the Asn
in question.
2.3.3 QC Analysis The kinetics of deamidation are fastest with
an Asn-Gly sequence. Other sequences that
In contrast to the situation described above, QC are frequently involved in deamidation are
applications frequently involve a protein at a Asn-Ala, Asn-Asp, and Asn-Ser. Nonenzy-
high state of purity. The objective is usually to matic deamidation proceeds via loss of
38 A.J. Alpert

Fig. 2.15 Measurement of interleukin-6 (IL-6) in whole was collected, digested with trypsin, and then analyzed
serum. Top: Elution of an IL-6 standard from a mixed-bed via LC-MS for measurement of peptides unique to IL-6
IEX column. Bottom: Fractionation of whole serum on (MRM) (Adapted from: L. Bian, M. Kukula, J. Barrera,
the same column under the same conditions. The 1.5’ and K.A. Schug, ASMS 2015 conference, poster Th 597)
fraction corresponding to the elution window of IL-6

ammonia from the Asn side-chain with acidic pH. The products of hydrolysis of the
subsequent formation of a succinimide ring. ring are the same as with deamidation; n-Asp
This ring can hydrolyze unsymmetrically to and isoAsp variants, one of which is identical
form either an n-aspartyl (n-Asp) residue or an to the starting protein. It is possible to separate
isoaspartyl (isoAsp) residue, in a ratio even these closely-related variants. The pKa
between 1:2 and 1:3. Conversion of a neutral of an n-Asp residue is around 3.9, while the
Asn residue to an Asp residue adds one addi- pKa of an isoAsp residue is about 3.1. Conse-
tional ( ) charge to the protein. Accordingly, quently, at pH 4.0, an n-Asp residue has lost
IEX is a good way to separate the native about half of its ( ) charge while an isoAsp
protein from various deamidation variants. residue retains most of its charge. This can
A susceptible Asp residue can also form a cause a protein variant with an isoAsp residue
succinimide ring, this time proceeding via to elute earlier from a cation-exchange col-
dehydration rather than deamidation. Again, umn than the same protein with an n-Asp
a sterically unhindered residue on the variant.
C-terminal side of the Asp residue tends to 2. Assessing the position of derivatization. Some
promote the reaction. In contrast with reactions are not limited to the target residue.
deamidation, though, dehydration of suscepti- An example is shown in Fig. 2.2. Here, a PEG
ble Asp residues is promoted by neutral or (polyethylene glycol) chain was attached
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 39

covalently to the N-terminus of the protein. pattern in Quality Control analysis. Another
There were also significant side reactions with concern is the degree of aggregation of
most of the lysine residues. A cation- antibodies. This is generally measured by
exchange column was able to separate these Size Exclusion Chromatography.
positional variants because a lysine residue is Recently there has been considerable interest
a good binding site in cation-exchange. The in the diagnostic and therapeutic potential of
column is sensitive to anything that affects antibodies with covalently-attached drugs or
that binding. Some lysine residues are more toxins. These are called antibody-drug
important than others to the overall binding, conjugates, or ADC’s. Antibody molecules
which accounts for the sensitivity to the posi- vary in the number and position of conjugate
tion of the PEG’s attachment. molecules attached. The product of the syn-
3. Analysis of monoclonal antibodies. This is thesis must be analyzed to ascertain the com-
usually performed by cation-exchange. The position of the product in this regard. In
heavy chains have a lysine or arginine as the Fig. 2.8, SEC with “native” MS analysis is
C-terminal residue. A basic residue in a termi- used to determine the number of conjugates
nal position is readily available for interaction per molecule based on mass differences. Fig-
with a stationary phase, and so those residues ure 2.16 shows the physical separation of
play a significant role in retention. Loss of the ADC’s via HIC.
basic residue from the end of one heavy chain
causes the antibody to elute significantly
sooner, and loss of the basic residues from
the ends of both heavy chains leads to even 2.3.4 Example of a Clinical Analysis:
earlier elution. Consequently, cation- Hemoglobins
exchange of monoclonal antibodies character-
istically results in a pattern of three major This may be the most widespread application in
peaks. The minor peaks eluting earlier than the world involving the analysis of a protein by
each of the major ones are generally HPLC. The analyses play a role in the control of
deamidation variants. Some antibody two significant problems in public health:
producers treat their antibodies with carboxy-
peptidase B to cleave off the terminal basic (a) Glycated hemoglobin: Hemoglobin A1c
residues. This does not affect the biological (Hb A1c) has a residue of glucose cova-
activity; the motive is solely to simplify the lently attached to the N-terminus of the

Fig. 2.16 Analysis via HIC of an antibody-drug conju- contain 2 or 4 molecules of the conjugate. Column:
gate (ADC). The two minor peaks eluting after the native PolyPROPYL A, 100  4.6 mm (3-μm, 1500-Å). A
antibody peak are variants that contain a single conjugate decreasing gradient of ammonium sulfate was used
in different positions. The subsequent major peaks
40 A.J. Alpert

beta-chain. Its concentration is proportion- between glycation of hemoglobin at the


ate to the average glucose level in the blood N-terminus of the beta chain and the less
during approximately a 1-month period. frequent glycation at a lysine residue.
Such information is useful for diagnosis of (b) Analysis of hemoglobin variants: Certain
diabetes and monitoring its treatment. parts of the world have a significant occur-
About 4–5 % of the hemoglobin of a normal rence of genetic mutants of hemoglobin in
individual is in the form of Hb A1c. In a the local gene pool. The occurrence of such
case of uncontrolled diabetes, the level can mutations tends to coincide with a high
be as high as 15–16 %. There are a number incidence of malaria; it is speculated that
of different assays for Hb A1c. The most the carriers of the mutations are more resis-
common one that involves chromatography tant to the effects of the disease. However,
is to pass a sample through a column with an people who are homozygous for these
immobilized ligand of phenylboronic acid. mutations suffer effects that shorten their
Boronic acids form a covalent but transient lives significantly. This is a significant
5-member ring with compounds containing public health problem in such countries.
cis-diol groups, including glucose residues In subsaharan Africa the major variants,
attached to proteins. The resulting chro- Hb S and Hb C, cause sickle cell anemia.
matogram features just two peaks: A major A similar syndrome occurs in the
early peak consisting of the hemoglobins India-Pakistan area (Hb D) and in south-
that lack sugar adducts and a minor peak, east Asia (Hb E). In the Mediterranean
eluted with a step to lower pH, that causes basin and a belt across the Middle East
the glycated hemoglobins to elute. The area through Iran, the major hemoglobinopathy
under the two peaks is then integrated to is beta-thalassemia, which is diagnosed via
determine the percentage of glycated hemo- an elevation in the percentage of hemoglo-
globin. This method does not distinguish bin A2 (Hb A2).

Fig. 2.17 Analysis of


hemoglobins via cation-
exchange. Left: A
composite standard,
including the S and C
variants associated with
sickle cell anemia. Right:
A clinical sample from an
individual with an elevated
level of hemoglobin A2. All
of these variants, including
hemoglobins A1c and F,
are completely separated in
less than 3.5’. Column:
PolyCAT A, 35  4.6-
mm; 3-μm, 1500-Å.
Gradient: An increasing
NaCl gradient in a Bis-tris
buffer that contains
2 mM NaCN
2 Protein Fractionation and Enrichment Prior to Proteomics Sample Preparation 41

Erythrocytes are isolated by centrifugation of 2.4 Summary


a blood sample and then lysed. The resulting
solution, a hemolyzate, can be analyzed directly. The ability of bottom-up proteomics to identify
An even simpler method involves blotting a drop more than 30–40 peptides was made possible by
of blood on filter paper, punching out the blot, increasing the degree of separation prior to the mass
and solubilizing and analyzing the hemoglobins. spectrometer. At present the separation methods are
There are various tests for the variants of interest, a major bottleneck in the development of top-down
but the most widely employed is HPLC separa- mass spectrometry of proteins. Given the examples
tion via cation-exchange. Hemoglobin has an described above, there is reason to be optimistic that
absorption maximum at 415 nm, which makes it appropriate methods will be forthcoming and that
convenient to analyze with a minimum of sample progress will then depend on advances in the mass
processing. Figure 2.17 shows some examples. spectrometry instrumentation.

2.3.5 Alternatives to RPC for Direct References


LC-MS
1. Wang H, Qian W-J, Chin MH, Petyuk VA, Barry RC,
Liu T, Gritsenko MA, Mottaz MA, Moore RJ, Camp
Examples of SEC-MS were presented in the sec- DG II, Khan AH, Smith DJ, Smith RD (2006) J
tion on SEC, and the section on HILIC has an Proteome Res 5:361
example of HILIC-MS of some membrane 2. Alpert AJ, Regnier FE (1979) J Chromatogr 185:375
proteins. 3. Zhang L, Yao L, Zhang Y, Xue T, Dai G, Chen K,
Hu X, Xu LX (2012) J Chromatogr B 905:96
One of the more widely used alternatives is 4. Compton PD, Zamdborg L, Thomas PM, Kelleher NL
WCX-HILIC. This is the use of a weak cation- (2011) Anal Chem 83:6868
exchange column with a gradient to a pH low 5. Zhang J, Roth MJ, Chang AN, Plymire DA, Corbett
enough to uncharge the carboxyl- groups. While JR, Greenberg BM, Patrie SM (2013) Anal Chem
85:10377
this can be performed in strictly aqueous media, 6. Vellaichamy A, Tran JC, Catherman AD, Lee JE,
the most popular combination starts with a con- Kellie JF, Sweet SMM, Zamdborg J, Thomas PM,
centration of acetonitrile in the range 60–70 %. Ahlf DR, Durbin KR, Valaskovic GA, Kelleher NL
This superimposes a significant degree of hydro- (2010) Anal Chem 82:1234
7. Jen€
o P, Scherer PE, Manningkrieg U, Horst M (1993)
philic interaction on the electrostatic effects, pro- Anal Biochem 21:292
moting the retention of proteins with a net charge 8. Carroll J, Fearnley IM, Walker JE (2006) Proc Natl
of either sign. Along with the decreasing pH Acad Sci U S A 103:16170
gradient, then, there is a gradient of decreasing 9. Tetaz T, Detzner S, Friedlein A, Molitor B, Mary J-L
(2011) J Chromatogr A 1218:5892
ACN concentration, which tunes down the 10. Tian Z, Tolić N, Zhao R, Moore RJ, Hengel SM,
hydrophilic interaction. The eluting proteins are Robinson EW, Stenoien DL, Wu S, Smith RD, Paša-
readily analyzed directly via mass spectrometry. Tolić L (2012) Genome Biol 13:R86
This combination is widely used for analysis of 11. Young NL, DiMaggio PA, Plazas-Mayorca MD,
Baliban RC, Floudas CA, Garcia BA (2009) Mol
histones, both “top-down” [10] and “middle- Cell Prot 8:2266
down” [11, 12]. Figure 2.10 [above] shows an 12. Sidoli S, Lin S, Karch KR, Garcia BA (2015) Anal
example of this. Chem 87:3129
Sample Preparation for Mass
Spectrometry-Based Proteomics; 3
from Proteomes to Peptides

John C. Rogers and Ryan D. Bomgarden

Abstract
Mass spectrometry (MS) has become the predominant technology to
analyze proteins due to it ability to identify and characterize proteins
and their modifications with high sensitivity and selectivity (Aebersold
and Mann, Nature 422(6928):198–207, 2003; Han et al., Curr Opin Chem
Biol 12(5):483–490, 2008). While mass spectrometry instruments have
improved rapidly over the past couple of decades, mass spectrometry
results have remained largely dependent on sample preparation and qual-
ity. Sample ionization and mass measurements are susceptible to a wide
variety of interferences, including buffers, salts, polymers, and detergents.
These contaminants also impair MS system performance, often requiring
time consuming maintenance or costly repairs to restore function. The
goal of this chapter is to describe the rationale, considerations, and general
techniques used to prepare samples for proteomic mass spectrometry
analysis.

Keywords
Protein chromatography • Protein extraction • Lysis • Protein depletion or
enrichment • Digestion • In-gel digestion • In-solution digestion • Filter-
assisted sample preparation (FASP) • Digestion comparison

3.1 Overview determine the right experimental strategy. A suc-


cessful proteomics experiment requires the inte-
Due to the complexity of proteomic samples and gration of good sample preparation,
the wide variety of sample preparation instrumentation, and software (Fig. 3.1). There-
techniques, a proteomics researcher must first fore, it is important to understand the goals and
expectations of the project and to choose and
optimize the best sample preparation method
J.C. Rogers (*) • R.D. Bomgarden accordingly. For example, the sample prepara-
Thermo Fisher Scientific, Rockford, IL, USA tion requirements for protein identification from
e-mail: john.rogers@thermofisher.com

# Springer International Publishing Switzerland 2016 43


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_3
44 J.C. Rogers and R.D. Bomgarden

Fig. 3.2 The proteomics conflict. It is impossible to


optimize sensitivity, throughput and comprehensiveness
simultaneously. Discovery proteomics strategies optimize
sensitivity and comprehensiveness with few samples.
Targeted proteomics strategies optimize sensitivity and
Fig. 3.1 The key to proteomics success. Successful pro-
scalability by limiting the number of monitored features.
teomics laboratories and companies recognize the impor-
Note that comprehensive analysis with reasonable
tance of sophisticated sample preparation,
throughput is enabled by sample multiplexing with mass
instrumentation, and software technologies and skills.
tag reagents
Workflows designed to maximize the overlap between
these complementary technologies are an effective identification and characterization, proteome
means of improving proteomics research
profiling, and targeted protein analysis.
Protein identification and characterization is
a gel slice are very different from the commonly performed to identify protein
requirements to identify protein interaction isoforms, splice variants, post-translational
networks, measure changes in the mitochondrial modifications, and interacting proteins
proteome, understand protein phosphorylation [11]. These studies are typically performed after
and signaling in cancer, or identify protein protein separation using SDS polyacrylamide gel
biomarkers of cancer metastasis in plasma [3– electrophoresis (SDS-PAGE) and may also
6]. Unlike genomic or transcriptomic research, involve a protein enrichment step, such as immu-
there is no “standard” universal sample prepara- noprecipitation. In contrast, proteomic profiling
tion method for proteomics. is typically performed on whole protein or
Additionally, proteomics experiments must sub-proteome extracts digested in solution. This
balance the competing needs for sensitive and comprehensive approach requires more instru-
complete proteome coverage with the scalability ment analysis time per sample to maximize the
of analyses (Fig. 3.2). Proteomic strategies to number of protein identifications at the expense
improve proteome coverage require multidimen- of the number of samples that can be analyzed.
sional fractionation; however, this fractionation Isobaric mass tags (e.g. iTRAQ and TMT) can
increases the sample analysis time and sacrifices help to address this sample throughput limitation
throughput [7, 8]. Alternatively, MS acquisition by allowing multiple samples to be combined
strategies that improve the sensitivity, reproduc- into a single LC-MS analysis [12–14]. Targeted
ibility, and throughput of protein quantification, protein analysis limits the number of features that
such as selected reaction monitoring (SRM) or are monitored to a pre-selected list of target
parallel reaction monitoring (PRM), limit the peptides and their transitions. These methods
number of features that can be monitored optimize sample preparation, chromatography,
[9, 10]. For this reason, proteomics research is instrument tuning, and fragmentation to achieve
generally divided into three categories: protein the highest sensitivity and throughput for
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 45

hundreds of samples. Ultimately, a sample prep- The quality and consistency of sample prepa-
aration strategy should be chosen which ration influences the time and cost of MS analysis
generates the most biologically relevant or useful and the reliability of the results. For MS-based
data possible for a given experiment. proteomics to reach its full potential as a rou-
Protein analysis using tandem mass spectrom- tinely used detection technology in research and
etry (MS/MS, or MSn) can be performed on clinical settings, variability associated with the
intact proteins (“top-down” proteomics) or pro- sample preparation steps that precede MS analy-
tein digests (“bottom-up” proteomics). sis must be addressed. Despite extensive litera-
Top-down proteomics is a growing field, as it ture describing various MS sample preparation
permits nearly complete protein sequence cover- methods explained below and elsewhere, there is
age and enables simultaneous characterization of little standardization among methods. This
protein isoforms and modifications [15, 16]. How- results in confusion for those new to MS sample
ever, top-down analysis is currently limited to preparation techniques and high variability in
proteins less than ~50,000 Da and requires high MS analysis results, even among expert MS
resolution MS instrumentation (>100,000 laboratories.
resolving power) to accurately identify proteins
and protein isoforms. Recently, “middle-down”
strategies have also been developed to reduce the 3.2 Protein Extraction
sizes of intact proteins through partial digestion
or using proteases that cleave at rare sites or at Tissue or cell lysis is the first step in protein
specific positions within a protein extraction and solubilization. Numerous
(e.g. antibodies, [17, 18]). Sample preparation techniques have been developed to obtain the
for intact proteins typically involves multi- highest protein yield for different organisms,
dimensional protein fractionation to reduce sam- sample types, subcellular fractions, or specific
ple complexity and protein desalting to remove proteins. Due to the diversity of tissue and cell
residual salts or other impurities that may form types, both physical disruption and reagent-based
adducts during ionization. methods are often required to extract cellular
Bottom-up proteomic strategies represent the proteins. Physical lysis equipment, such as
vast majority of MS proteomic analyses. These homogenizers, bead beaters, and sonicators, are
methods use proteases to digest proteins at spe- commonly used to disrupt tissues or cells in order
cific amino acids into peptides with a predictable to extract cellular contents and shear DNA. In
terminus. Unlike proteins, peptides are more eas- contrast, reagent-based methods use denaturants
ily separated by reverse phase HPLC and ionize or detergents to lyse cells and solubilize proteins.
well by electrospray or matrix-assisted laser Cell lysis also liberates proteases and other cata-
desorption ionization (MALDI). Importantly, bolic enzymes so broad-spectrum protease and
peptides fragment during MS/MS to yield phosphatase inhibitor cocktails are typically
amino acid sequence information. Similar to included during sample preparation to prevent
proteins, multi-dimensional fractionation of nonspecific proteolysis and loss of protein phos-
peptides can be used to reduce sample complex- phorylation, respectively.
ity [19] but removal of salts, detergents and other Through the use of different buffers,
impurities can be more difficult at the peptide detergents and salts, cell lysis protocols can be
level than the protein level. As peptide fraction- optimized for the best protein extraction for a
ation, liquid chromatography (LC), and MS anal- particular sample or protein fraction. Strong
ysis are addressed in other chapters, this chapter denaturants (e.g. urea or guanidine) and ionic
will primarily focus on bottom-up protein sample detergents (e.g. sodium dodecyl sulfate (SDS)
preparation strategies prior to LC-MS/MS or deoxycholate (SDC)) solubilize membrane
analysis. proteins and denature proteins. Non-ionic or
46 J.C. Rogers and R.D. Bomgarden

zwitterionic detergents (e.g. Triton X-100, compatible with mass spectrometry because
NP-40, digitonin, or CHAPS) have a lower criti- they are dialyzable and monodisperse
cal micelle concentration and require lower (i.e. homogeneous) [23]. In addition, a variety
detergent concentrations to solubilize proteins of mass spectrometry-compatible detergents are
[20, 21]. These detergents generally solubilize commercially available. Invitrosol (Thermo Sci-
membrane proteins and protein complexes with entific) contains several monodisperse detergents
less denaturation and disruption of protein- that elute in regions of the HPLC gradient that do
protein interactions [21]. not interfere with peptides or their chromatogra-
Unfortunately, many detergents used to solu- phy. Cleavable detergents, such as ProteaseMax
bilize proteins cause significant problems during (Promega), Rapigest (Waters), PPS Silent Sur-
downstream mass spectrometry analysis if they factant (Expedeon), or Progenta (Protea),
are not completely removed. In addition to cell degrade with heat or at low pH into products
lysis buffers, detergents used to clean laboratory that do not interfere with LC-MS. As digestion
glassware may also contaminate samples and LC requires incubation at 37  C and LC-MS loading
solvents. Detergents present in the sample can: buffers contain formic acid or trifluoroacetic
acid, sample preparation workflows do not
1. Contaminate and foul autosampler needles, require any significant modification to use these
valves, connectors, and lines MS-compatible detergents [24].
2. Affect liquid chromatography by reducing
column capacity and performance
3. Affect crystallization prior to matrix assisted 3.3 Protein Depletion or
laser desorption ionization (MALDI) sample Enrichment
analysis;
4. Suppress electrospray ionization (ESI) prior Depending on the protein source and the copy
to introduction into the mass spectrometer number per cell, there can be a tremendous dif-
5. Deposit in the mass spectrometer, interfering ference in the concentration between the lowest
with the spectra and reducing sensitivity of and most abundant proteins. For mammalian
the instrument. tissues and cell lines, protein expression can
range over 6–9 orders of magnitude. For serum
Flexible tubing or poor quality plastic and plasma samples, the dynamic range can be
consumables can also leach phthalates and other greater than 12 orders of magnitude with serum
contaminants that can interfere with downstream albumin representing over 50 % of the protein
LC-MS analysis [22]. Both phthalates and content [25]. In order to get an adequate depth of
detergents ionize very well and overwhelm pep- protein coverage in serum, to identify relevant
tide signals. Polydisperse detergents, such as Tri- biomarkers, abundant protein depletion is
ton X-100, Tween or NP-40, contain a required. Although affinity chromatography
distribution of variable length polyethylene gly- using Cibacron blue dye can be used to remove
col (PEG) chains that often elute throughout the albumin, immunoaffinity using antibodies is typ-
LC gradient as a family of peaks separated by ically required to remove other abundant proteins
44 Da mass units and overwhelm the LC-MS such as immunoglobulins, transferrin, fibrinogen,
results. Fortunately, these leachables and and apo-lipoproteins [26]. One advantage of
detergents can often be removed by gel electro- using antibodies for immunodepletion is that
phoresis, protein precipitation, or filter-assisted one sample preparation technique can be used
sample preparation (FASP) techniques described to remove the top 2–20 most abundant proteins
later in this chapter. depending on the product used. Another is that
While all detergents can affect downstream the depletion resins can be regenerated for multi-
LC-MS analysis, N-octyl-beta-glucoside and ple uses; though this can affect protein depletion
octylthioglucoside are considered more reproducibility over time.
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 47

Protein enrichment techniques are commonly precipitation [38, 39]. Gel electrophoresis is an
overlooked during protein sample preparation inexpensive, straightforward method for the
but may be necessary in order to identify and removal of salts, detergents, and other small
quantify biologically relevant proteins which molecules prior to in-gel digestion. However,
are typically in lower abundance. One method keratins from skin and dust are common
of protein enrichment is subcellular fraction- contaminants which can be introduced when
ation, which separates proteins by location in a pouring and handling gels so it is imperative to
particular cellular compartment or organelle. always wear gloves and to use MS grade reagents
Subcellular fractionation using sucrose density to minimize this contamination.
gradient centrifugation can separate vesicles Reverse phase C4 or C8 cartridges can
and organelles including the nucleus, remove salts from proteins but concentrate
mitochondria, or chloroplasts from cytosolic non-ionic detergents and may have poor recovery
and vesicle proteins [27, 28]. Differential extrac- of hydrophilic proteins. Strong cation exchange
tion is another subcellular fractionation tech- resins can remove anionic detergents, like
nique which uses detergents to selectively deoxycholate or sodium dodecyl sulfate (SDS),
solubilize nuclear, chromatin-bound, membrane, but typically require salts for protein elution
cytosolic, and cytoskeletal proteins [29]. Another which then have to be removed before LC-MS
method of protein enrichment is through protein analysis. Dialysis membranes and cassettes are
modifications. Cell surface proteins which are available with a variety of molecular weight
glycosylated can be enriched by chemical label- cut-offs (MWCO) and can effectively exchange
ing of oxidized glycans, metabolic incorporation buffer components to remove contaminants; but
of azide-containing sugars [30–32], or lectin dialysis is relatively slow, requires multiple
affinity [33]. Phosphoproteins can be enriched buffer changes, and may be difficult with small
with immobilized metal affinity chromatography volumes. Spin columns or stirred-cell pressure
[34]. Activity-based chemical probes are another devices with MWCO membranes can rapidly
method for enrichment of enzyme subclasses exchange buffers to remove small molecule
such as kinases, hydrolases, and oxidases contaminants and concentrate samples. These
[35, 36]. Finally, affinity capture using immuno- MWCO devices allow sequential buffer
precipitation is the method of choice for enrich- exchange steps to be performed and can be used
ment of specific protein targets or protein for complete MS sample preparation in the filter-
complexes as this technique provides the highest assisted sample preparation (FASP) methods.
selectivity and sensitivity for the lowest abundant Size exclusion resins retain small molecules in
proteins [37]. porous beads while excluding proteins enabling
rapid and efficient buffer exchange with minimal
sample loss, especially in a spin column format.
3.4 Protein Preparation Notably, of all of the desalting methods avail-
able, precipitation with organic solvents such as
Unfortunately, many protein extraction, fraction- acetone or methanol/chloroform with or without
ation, enrichment and depletion methods intro- organic acids (e.g. TCA or TFA) is the most
duce salts, buffers, detergents, and other common method for desalting proteins prior to
contaminants which are not MS compatible. MS sample preparation as it the least expensive,
Because of the relative difference in molecular simplest and most scalable option.
weight, it is simplest and preferable to remove
these small molecule contaminants before pro-
tein digestion. There are a variety of options to 3.5 Protein Digestion
remove these small molecules, including gel
electrophoresis, chromatography, dialysis, buffer Trypsin is the most commonly used protease for
exchange, size exclusion, and protein MS sample preparation because of its high
48 J.C. Rogers and R.D. Bomgarden

activity, selectivity and relatively low cost. Tryp- >10–15 % and improve the average sequence
sin cleaves proteins to generate peptides with a coverage by 60–160 % [42, 45–47]. Different
lysine or arginine residue at the carboxy terminus proteases have also been shown to provide a
[40]. These basic amino acids at the end of every unique repertoire of phosphopeptides which are
tryptic peptide improve peptide ionization and not observed in tryptic digests [48]. Therefore, a
MS/MS fragmentation for peptide identification. multiple enzyme strategy is recommended for
Although trypsin is the most popular enzyme used comprehensive analysis of single proteins or
for protein digestion, some protein sequences are complex proteomes.
not efficiently cleaved by trypsin or do not contain Multiple studies have demonstrated that
basic amino acids spaced close or far enough apart chaotropes, solvents and detergents increase the
to generate peptides which can be used for protein efficiency of protein digestion [49, 50]. These
identification. Trypsin digestion is less efficient at reagents assist in the solubilization and unfolding
lysine and arginine residues followed by proline, of proteins, especially integral and transmembrane
repeated basic residues (e.g. KK, RK), or in the proteins or hydrophobic stretches of protein
presence of post translational modifications sequence. Efficient digestion is important to max-
(e.g. methylation, acetylation), resulting in missed imize the number of peptides and proteins
cleavages [41]. Some tryptic peptides may be too identified in a sample, and complete digestion
small to retain on reversed phase LC columns or permits the reproducible quantitation of peptides.
are not unique for a particular protein. Others may Organic solvent additives, such as 5–20 % aceto-
be too large and hydrophobic to identify by nitrile (ACN), trifluoroethanol, and methanol have
LC-MS. For example, 56 % of the tryptic peptides been shown to improve digestion efficiency and
in yeast are 6 amino acids long, while 97 % of only require vacuum centrifugation or dilution to
peptides identified by LC-MS are 7–35 amino be compatible with LC-MS analysis. Urea and
acids [42]. These short or extremely long uniden- guanidine chaotropes also improve protein solubi-
tified peptides result in incomplete protein lization and digestion efficiency. These salts are
sequence coverage, resulting in missing specific easily removed from proteins by desalting on dial-
peptide sequences or sites of posttranslational ysis, or from peptides by using reverse phase C18
modifications. tips, cartridges, or trap columns. However, urea
For more comprehensive proteome coverage, can modify lysine residues, resulting in
alternative proteases are often used to generate carbamylation artifacts [51] and some proteases
different peptide sequences that may not be are not active in guanidine. Finally, some
identified from tryptic digests. Partial digestion detergents which are used for protein extraction
with specific or non-selective proteases, like have also been shown to aid protein digestion.
elastase or proteinase K, have been used to Depending on the detergent, these reagents can
increase protein sequence coverage; but these be removed after digestion by phase transfer,
proteases also increase the complexity and detergent removal resins, or hydrolysis with low
variability of digestion, making it more difficult pH [24, 50, 52]. Interestingly, it is reported that a
to reproducibly identify the same peptides and combination of 1 M guanidine and 20 % ACN
proteins in replicate samples [43, 44]. Proteases with any MS compatible detergent greatly
with distinct cleavage specificities, such as ArgC, improves the digestion efficiency and specificity
AspN, chymotrypsin, GluC, LysC, or LysN, pro- over any one of these additives alone [24]. While
duce complementary sequence information the effects of solvents, chaotropes, and detergents
which can be combined to improve sequence have been well studied for trypsin digestion, and
coverage. This multi-enzyme approach has been to a lesser extent for LysC digestion, the effects of
used successfully by multiple laboratories to these additives on other proteases are not well
increase the number of protein identifications understood.
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 49

3.6 Peptide Preparation: In-Gel unfolding and limit proteolytic digestion. Peptides
Digestion that remain linked by disulfides are also difficult
to identify due to the complexity of the peptide
Once the proteins in a complex sample are fragment ion spectra. Protein disulfides are typi-
solubilized, there are three general approaches to cally reduced with either dithiothreitol (DTT) or
prepare protein digests: in-gel digestion, tris 2-carboxyethylphosphine (TCEP) in the pres-
in-solution digestion, and filter-assisted sample ence of other denaturants (i.e. heat, SDS, urea,
preparation (Fig. 3.3). All three of these methods guanidine, etc.). Reduced cysteines are then
remove contaminating detergents and other small alkylated with iodoacetamide, iodoacetic acid,
molecules, reduce and alkylate proteins, digest chloroacetamide, 4-vinyl pyridine, or N-ethyl
proteins to peptides, and prepare peptides for maleimide (NEM) to prevent oxidation [56–58].
mass spectrometry analysis. Sodium dodecyl Haloacetyl-containing alkylating agents are light
sulfate-polyacrylamide gel electrophoresis sensitive and must be made fresh. Alkylation
(SDS-PAGE) is the most common technique for reactions should be performed at pH 8.0 to
protein analysis [39, 53]. Gel electrophoresis is a avoid alkylation at other amino acids, and
simple, inexpensive and a relatively high resolu- excess reagent should be quenched with DTT
tion protein separation method that can be to prevent side reactions and over-alkylation of
employed in either one dimension (1D) to resolve proteins. After reduction and alkylation, gel
proteins by molecular weight or two dimensions bands are digested with a protease; and the
(2D) to resolve proteins by isoelectric point and peptides are extracted using standard techniques
molecular weight [54]. Although 2D PAGE is not [39]. While in-gel digestion is more prone to
compatible with salts and ionic detergents, 1D incomplete or less reproducible digestion and
SDS-PAGE can easily remove these and other lower recovery of peptides relative to
substances which may interfere with LC-MS anal- in-solution option (50–70 % recovery), gel
ysis. In fact, many academic proteomic core labs electrophoresis remains an important sample
prefer or require samples to be provided in gels or preparation technique prior to MS analysis
gel slices because this method is so effective for (Fig. 3.3, and Supplement Method 1).
sample clean up. Depending on the depth of anal-
ysis, a single band can be excised or a complex
sample can then be excised as a set of gel slices in 3.7 Peptide Preparation:
a method often referred to as GeLC-MS In-Solution Digestion
[55]. Another advantage of gel-based fraction-
ation methods is that they can reduce sample In-solution digestion is a popular alternative to
complexity and separate highly abundant proteins in-gel digestion, because it requires fewer steps
from lower abundant proteins. Since all of the and can be scaled for the analysis of samples
peptides from the respective protein(s) are containing less than 10 μg or greater than 1 mg
contained in a single gel band, spot or fraction, of protein. For this method, proteins are first dena-
protein sequence coverage and posttranslational tured with detergents and heat or with urea or
modification mapping is also improved. guanidine chaotropes. Disulfide bonds between
After gel electrophoresis, separated proteins cysteine residues are reduced and alkylated and
are detected and visualized with a variety of gel then sample contaminants are typically removed
stains, including Coomassie Blue, Colloidal by precipitation prior to digestion and cleanup. As
Coomassie, and glutaraldehyde-free silver stain. stated above, urea has been used for many years
Gel bands containing protein(s) of interest are but is not recommended because it must be made
then excised, destained, reduced, and alkylated fresh as the formation of isocyanic acid over time
to improve digestion and peptide extraction increases the likelihood of protein carbamylation
[39]. Disulfide bonds prevent complete protein [51]. Protein solubilization and denaturation with
50 J.C. Rogers and R.D. Bomgarden

SDS-
PAGE

Lysate Fractionation
Preparation & Clean up
MS
- Lysis In-solution - Detergent Removal
- Fractionation - Enrichment
- Depletion
w/precipitation
- Fractionation
- Enrichment - Desalting
- Protein Assay - Peptide Assay

Filter-
assisted

Peptide Preparation

- Buffer exchange
- Reduction
- Alkylation
- Digestion

Fig. 3.3 General protein sample preparation workflow. There are many options for the extraction of proteins from
tissue and cell lysates, protein fractionation and enrichment, and digestion to peptides for MS analysis

SDS or SDC is more effective than urea, and these detergents must be removed prior to digestion in
detergents permit heating during the reduction of order to prevent downstream contamination of
disulfides improving protein denaturation before LC-MS equipment. Most detergents can be
digestion. removed by protein precipitation with four
Once disulfides have been reduced and volumes of cold (20  C) acetone. Precipitation
alkylated, contaminating salts, reducing and with dilute deoxycholate and trichloroacetic acid,
alkylating reagents, detergents, and small mole- methanol, a 4:1:3 ratio of methanol:chloroform:
cule metabolites present in the sample matrix water, followed by an additional three volumes of
should be removed from the sample before diges- methanol, or partitioning with ethyl acetate are
tion. Depending on the sample source and extrac- alternative methods of detergent removal
tion technique, small molecule contaminants may [65, 67–69]. As an alternative workflow, digestion
include excess protein labeling reagents, lipids, can be performed in 0.1 % SDS or SDC, and these
nucleotides, and phosphoryl- or amine-containing detergents may be removed from the peptides
metabolites (e.g. phosphocholine, aminoglycans, after digestion using a detergent removal spin
etc.) that could interfere with downstream peptide column or by acidification to precipitate SDC
enrichment or chemical tagging [59]. These [52, 70, 71].
contaminants can be removed by buffer exchange Detergents, chaotropes, and organic solvent
using gel filtration resins, dialysis, gel electropho- additives improve trypsin digestion efficiency
resis, filtration with a molecular weight cutoff and dramatically increase peptide and protein
filter, or most commonly by precipitation with an identifications in complex protein mixtures
acid or an organic solvent [59–69]. Polydisperse [49, 52, 71]. For tryptic digestion, the protein is
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 51

dissolved in a buffered solution at pH 8.0 formalin fixed paraffin embedded slices,


(e.g. 50–100 mM ammonium bicarbonate), and C. elegans, phosphoproteomic, and
digestion is performed for 4–16 h at 37  C with glycoproteomic samples [73, 75–77]. Recently
agitation. Low concentrations of acetonitrile, some proposed enhancements to the FASP pro-
urea, SDS, SDC, or MS-compatible detergents tocol have been reported including: 1) simulta-
may be included to solubilize the precipitated neous reduction and alkylation to eliminate
protein pellets and partially denature the protein several centrifugation steps and improve alkyl-
to improve digestion efficiency. Endoproteinase ation specificity; 2) prior passivation of the
LysC is an enzyme which cleaves after lysines MWCO membrane with Tween-20 for higher
similar to trypsin. Unlike trypsin, LysC can peptide recovery, and; 3) the replacement of urea
cleave at lysine residues followed by proline with deoxycholate for improved tryptic
and is active under denaturing conditions digestion [78].
(e.g. 8 M urea). LysC digestion is often
performed for 1–4 h before tryptic digestion for
more complete and reproducible digestion 3.9 Peptide Preparation
[72]. After digestion, peptides may be desalted Comparison
off-line using reverse phase solid phase extrac-
tion cartridges, tips, or on-line using a trap col- As described previously, many proteomic sample
umn before MS analysis, as described in another preparation methods have been described in the
chapter of this book. literature (Figs. 3.3 and 3.4), and these methods
are modified further by members of the same lab
or by other laboratories. This makes it extremely
3.8 Peptide Preparation: Filter- difficult for new MS users to identify the best
Assisted Sample Preparation protocol and generate consistent results. Each of
(FASP) these protocols described here has advantages
and disadvantages. GeLC-MS simplifies protein
Molecular weight cutoff (MWCO) filters have fractionation and maintains peptides from the
been used for decades to concentrate and proteins from a gel band in a single fraction,
exchange buffers for protein samples. Protocols but it is limited by scale, protein digestion effi-
for protein sample preparation with MWCO ciency, and peptide recovery. In-solution diges-
filters prior to MS were introduced in 2005 by tion with urea can carbamylate lysine residues,
Manza et al., and improved upon in 2009 and requires desalting to remove urea after digestion,
over subsequent years by the laboratory of and can suffer from poor protein extraction
Matthias Mann [63, 73, 74]. Filter-assisted sam- recovery without detergents. FASP is compatible
ple preparation (FASP) utilizes SDS, heat, and with a wide variety of samples but requires many
urea to solubilize and denature proteins before centrifugation steps, resulting in low sample
transfer to a MWCO spin column which is processing throughput. Finally, digestion in the
used for protein collection, concentration, and presence of detergent and subsequent removal of
digestion. An advantage of FASP is that the detergent with a resin, precipitation, or phase
detergents, salts, and small molecules can be transfer extraction may not be scalable or repro-
easily removed through multiple rounds of wash- ducible. Since sample preparation is the most
ing. Concentrated proteins are then alkylated, problematic area of MS-based proteome analy-
washed and digested on the membrane before sis, it is important to have robust, reproducible
elution and desalting. FASP is compatible with methods that can be easily adopted by novice and
a wide variety of samples and has been applied to expert MS labs alike.
0.2–200 μg protein samples in a wide variety of We have compared the sample preparation
applications, including brain tissue samples, results from FASP and three solution-based
52 J.C. Rogers and R.D. Bomgarden

A. In-solution B. Filter-
C. AmBic/SDS D. Urea
w/precipitation assisted
Extract proteins from cell Extract proteins from cell Extract proteins from cell Extract proteins from cell
lysate with 1% SDS lysate with 4% SDS, DTT, lysate with 0.1% SDS and lysate with 8M Urea.
AmBic and heat. Sonicate. and heat. Sonicate. heat. Sonicate. Sonicate.

Remove SDS with urea


Reduction washes in spin Reduction Reduction
concentrator (3x15min)

Alkylation Alkylation Alkylation Alkylation

Acetone Precipitation Trypsin digestion, Trypsin digestion,


Buffer exchange in spin
1hr at -20ºC 4hrs at 37°C 4hrs at 37°C
concentrator (6x15min)

LysC digestion, Trypsin digestion,


2hrs at 37ºC C18 desalting C18 desalting
4hrs at 37ºC

Peptide recovery by NaCl LC-MS Analysis LC-MS Analysis


Trypsin digestion,
wash in spin concentrator
O/N at 37ºC
(3x10min)

C18 desalting C18 desalting

LC-MS Analysis LC-MS Analysis

Fig. 3.4 Comparison of standard sample preparation workflows. A summary of the optimized Pierce sample prepara-
tion protocol is compared to three other popular standard proteomic sample prep methods that were evaluated

sample preparation methods (Fig. 3.4, [79]). We minimize non-selective alkylation or incom-
first used a step-wise approach to optimize a lysis pletely digested peptides, we could significantly
protocol for high protein recovery from mamma- improve the reproducibility and the number of
lian cell lysates. Protein solubilization with peptide and protein identifications (Tables 3.1
0.1–4 % SDS yielded 5–40 % more protein and 3.2).
than solubilization with 8 M urea [79]. Next, Reproducibility of digestion was assessed by
the completeness of disulfide reduction, the the number of identified peptides and proteins
selectivity of alkylation at cysteine residues, identified, by the sequence coverage of a diges-
and the digestion efficiency was assessed with tion indicator internal standard (Table 3.1), and
single or double digestion (LysC-trypsin) by the targeted quantitative analysis of peptides
routines. During this analysis, we discovered from a digestion indicator internal standard. To
that improved chromatography resins and address this, we spiked a non-mammalian protein
columns combined with fast, high resolution in each lysate, processed triplicate samples
instruments often reveal longer, more highly according to the optimized protocol, and then
charged peptides with missed cleavages that are quantified five peptides by targeted product ion
not detected on lower resolution or slower mass monitoring on a Thermo Scientific Velos ion
spectrometers. By optimizing protocols to trap. The coefficients of variation (CV) were
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 53

Table 3.1 Reproducibility of LC-MS/MS results from three biological replicates


Sample 1 Sample 2 Sample 3
Number of Proteins 3382 3228 3376
Number of Unique Peptides 16,333 15,939 17,048
Missed Cleavages (%) 7.8 8.8 8.6
Disulfide Bond Reduction (%) 100 100 100
Cysteine Alkylation (%) 100 100 100
Over Akylation (%) 0.1 0.3 0.9
Digestion Indicator Protein Sequence Coverage (%) 62.50 62.93 65.09
HeLa cell lysate (200 μg) in 200 μL lysis buffer was spiked with 2 μg Digestion Indicator processed by the Pierce Mass
Spec Sample Prep Kit for Culture Cells and then analyzed by LC-MS/MS on a Q Exactive mass spectrometer

Table 3.2 Comparison of peptide and protein identification results between sample preparation methods
Pierce FASP AmBic-SDS Urea
Number of Proteins 3964  22 3894  13 3716  79 3756  91
Number of Unique Peptides 19,902  190 18,738  128 17,401  587 19,398  689
Missed Cleavages (%) 7.3  0.1 13.9  1.2 17.5  1.3 9.8  1.0
Disulfide Bond 100 100 100 100
Reduction (%)
Methionine Oxidation (%) 3.0  0.1 11.3  1.5 2.6  0.1 5.3  0.5
Cysteine Alkylation (%) 99.8  0.4 99.8  0.3 100.0  0.0 100.0  0.0
Over Akylation (%) 0.7  0.2 0.1  0.1 0.8 %  0.6 2.4  0.4
Hela lysate samples (100 μg) were prepared according to each protocol and 500 ng was analyzed in triplicates by LC-FT
MS/IT MS2 CID on an Orbitrap Elite mass spectrometer

4–15 % with a mean CV of 7 % [79]. This quan- carbamylation by urea, and results in higher pro-
titative analysis further demonstrated the high tein identification rates than other popular “stan-
reproducibility of sample processing using the dard” sample preparation methods (Fig. 3.3 and
optimized protocol. Table 3.2).
To assess the scalability of this sample prepa-
ration protocol, 10 μg to 5 mg of HeLa cell lysate
was processed according to the protocol. Analy-
sis of equivalent volumes of peptide samples by 3.10 Methods
LC-MS/MS resulted in identical chromatograms,
demonstrating the scalability of this protocol 3.10.1 Protein Extraction
over a 500x dynamic range of sample amounts
(Fig. 3.5). This sample preparation protocol was Duplicate or triplicate HeLa S3 cell pellets, each
also used for brain tissue and resulted in repro- containing 2  106 cells, were re-suspended in:
ducible, high quality peptide sample (a) 0.2 mL of 0.1 M Tris–HCl, 4 % SDS, 0.1 M
preparations, demonstrating the versatility of DTT, pH 7.6 (FASP method); (b) 0.05 M ammo-
this method for different cell and tissue sample nium bicarbonate, 0.1 % SDS, pH 8.0 (AmBic/
types (Fig. 3.6). SDS method); (c) 0.1 M Tris–HCl, 8 M urea,
We found that the acetone precipitation pro- pH 8.5 (urea method), or (d) Lysis Buffer from
tocol with optimized reduction, alkylation, and the Thermo Scientific Pierce Mass Spec Sample
digestion reproducibly yielded high quality pep- Prep Kit for Cultured Cells. Samples were
tide samples for LC-MS/MS analysis (Table 3.1). incubated at 95  C for 5 min except the urea
This method yields more protein lysate from sample, which was incubated at RT for 30 min.
cultured cells, is highly reproducible, is scalable, Each cell suspension was sonicated on ice for
is simpler and faster than FASP, has no risk of 20 s. The cell debris was removed by
54 J.C. Rogers and R.D. Bomgarden

RT:0.00 - 140.07
38.40 NL:
100 1.29E8
Base Peak
42.55
65.53
10µg MS
44.39 66.20
32.61 10ugin50ul_
50 70.92 1
51.40 64.12 83.80
25.69
50.28 54.55 63.25
8.82 18.61 21.13 74.92 81.08 90.30 94.92 135.99
7.29 95.99 111.67 124.51
0 NL:
42.74
100 1.28E8
51.33 Base Peak
MS 50ug3
39.67
44.74 63.94
70.65
90.33
50µg
50 32.88 83.76
25.78 29.10 94.77
58.35
38.81 50.31
8.32 66.08 71.93 80.63
23.03
1.35 17.66 100.76 109.77 114.53 124.88 135.93
0 NL:
42.43
100
Relative Abundance

39.39 1.17E8
70.43
Base Peak
51.01 MS 100ug2
44.55 63.65
83.58
90.28
100µg
50 32.36
58.34 90.44
24.54 26.53 94.99
34.82 50.15 80.43
21.90 65.62 72.84
0.99 8.33 20.38 100.54 109.82 114.31 125.47 135.00
0
50.78 NL:
100 1.33E8
42.24
39.39 Base Peak
44.09
63.76
70.48 200µg MS 200ug1
50 32.11
83.56 89.85
28.34 62.93 95.02
24.99 80.33
38.42 49.81 58.57
22.40 76.07
1.50 8.33 101.02 107.88
19.08 114.60 125.15 133.32 135.97
0
42.61 NL:
100 1.67E8
39.66 63.84
44.49 Base Peak
51.08 70.43 5mg MS 5mg
26.70 32.61
50 83.78 89.69 94.47
25.67 50.21
58.67 62.97
8.29 8.52 25.12 71.73 80.54
5.88
100.60 103.50 114.33 133.34 134.96
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Time (min)

Fig. 3.5 Scalability of new MS sample prep kit protocol. HeLa lysate samples (10 μg–5 mg) were prepared according
to protocol. Samples (500 ng) subjected to LC-MS/MS analysis on a Thermo Scientific Velos Pro ion trap mass
spectrometer

centrifugation at 16,000  g for 10 min and the 1 h. The protein was re-suspended in digestion
supernatant was assayed for protein concentra- buffer and digested with Lys-C (1:100, enzyme:
tion using Thermo Scientific Pierce BCA Protein substrate) for 2 h at 37  C followed by digestion
Assay or Thermo Scientific Pierce BCA Protein with trypsin (1:50, enzyme:substrate) overnight
Assay Kit-Reducing Agent Compatible Assay. at 37  C. Peptide samples were also prepared
according to standard urea, FASP1, and AmBic/
SDS workflow.
3.10.2 Sample Preparation

HeLa cell lysate (100 μg) with digestion indica- 3.10.3 LC-MS and Data Analysis
tor (1 %, w/w) was reduced with 10 mM DTT for
45 min at 50  C and alkylated with 50 mM A Thermo Scientific EASY-nLC 1000 HPLC
iodoacetamide for 20 min in dark at RT. Excess system and Thermo Scientific EASYSpray
iodoacetamide and other contaminants were Source with Thermo Scientific EasySpray Col-
removed by acetone precipitation at -20  C for umn (25 cm  75 μm i.d., PepMap C18) was
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 55

RT: 0.00 - 140.04


62.20 NL:
100

90
Mouse brain 1.98E8
Base Peak
MS tissue1
80 94.23 lysate,
88.16
70 74.05
sample 1
Relative Abundance

60 56.05
34.47
71.28
50 65.23 80.87 82.98
44.73 54.68 77.09 102.74
40 36.75
32.24
43.54 104.06
30 43.19 46.99
136.35
31.61 104.38
20 27.49
25.19 106.13
10 112.71 124.69
19.44 135.95
9.69 9.94 115.40
0
93.51 NL:
100
2.19E8
Base Peak
90

80
60.96
Mouse brain MS tissue3

87.60
70 lysate,
60

50
70.22 73.27 80.13
sample 2
44.27 50.40 54.69 63.21 82.21 94.87
40 34.01
43.42 46.39 55.93 95.87 105.69
30 37.30
30.41
136.29
20 25.69
23.22 105.84 112.26 124.13
10 134.72
9.77 10.25
7.65
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Time (min)

Fig. 3.6 Evaluation of sample preparation workflow with tissue samples. Mouse brain tissue (0.25 g) was
homogenized with a tissue tearer and the proteins were extracted using the Thermo Scientific Pierce Mass Spec Sample
Prep Kit for Cultured Cells. Tissue lysate (100 μg) was subjected to sample preparation workflow and sample (500 ng)
was analyzed by LC-MS/MS on a Thermo Scientific Velos Pro ion trap mass spectrometer

used to separate peptides (500 ng) with a 30 %


acetonitrile gradient in 0.1 % formic acid over 3.11 Conclusions
100–140 min at a flow rate of 300 nL/min. The
samples were analyzed using a Thermo Scientific A variety of sample preparation methods have
Velos Pro, a Q Exactive hybrid quadrupole- been described, along with a brief comparison of
Orbitrap or an Thermo Scientific Orbitrap Elite several in-solution and filter-assisted sample
mass spectrometers. For data analysis, Thermo preparation methods. While each of these
Scientific Proteome Discoverer software version methods has advantages and disadvantages, all
1.4 was used to search MS/MS spectra against of these methods are capable of providing
the uniprot human database using SEQUEST* contaminant-free peptide samples compatible
search engine with a 1 % false discovery rate. with mass spectrometric analysis. Unfortunately,
Static modifications included carbamidomethyl none of these sample preparation methods is
(C) and dynamic modifications included oxida- sufficiently simplified, standardized, or
tion (M). The data set was screened by Preview automated to enable rapid adoption and wide-
software (Protein Metrics) for assessment of spread use by novice or non-MS users.
sample preparation quality. To assess the diges- In order to identify thousands of proteins from
tion efficiency, the Digestion Indicator protein a complex lysate, it is essential to have robust
sequence was included in the protein database. sample preparation methods for protein extrac-
Five digestion indicator peptides were quantified tion, reduction, alkylation, digestion, and clean-
manually with extracted ion-chromatograms of up. It is also essential to optimize LC and MS
the raw LC-MS/MS data or automatically with instrument performance, and to regularly (daily
Thermo Scientific Pinpoint 1.2 software. or weekly) assess instrument performance with a
56 J.C. Rogers and R.D. Bomgarden

standard, well understood positive control • Destain solution: 25 mM ammonium bicar-


samples. A variety of such standards are com- bonate/50 % acetonitrile (ACN). Mix 80 mg
mercially available, including mixtures of isoto- of ammonium bicarbonate with 20 mL of
pically labeled heavy peptides to assess acetonitrile (ACN) and 20 mL of ultrapure
chromatography, standard digests of common water.
proteins or protein mixtures (e.g. bovine serum Note: if destaining glutaraldehyde-free silver
albumin and cytochrome C), as well as standard stained gels, prepare separate 100 mM sodium
digests of complex proteomes from bacteria, thiosulfate and 30 mM potassium ferricyanide
yeast, or human cell lines from several MS solutions, then make destaining solution by
reagent vendors. Regular use of standards is crit- mixing them in a 1:1 (v:v) ratio. Protect ferri-
ical to ensuring that the instrumentation is work- cyanide solution from light.
ing properly before precious samples are • DTT stock solution: 10 mM DTT in 25 mM
analyzed. ammonium bicarbonate
Ideally, it would be best to have a simpler, • Iodoacetamide (IAM) stock solution: 20 mM
universal sample preparation method, as it would in 25 mM ammonium bicarbonate (always
permit standardization of methods and would prepare fresh, protect from light)
improve the reproducibility of results across • 10 ng/μl Trypsin, sequencing-grade (use
laboratories and over time. For example, decades 25 mM ice cold ammonium bicarbonate to
ago ion exchange-based DNA preparation kits dilute stock trypsin solution, immediately
rapidly supplanted the use of ultracentrifugation before adding to gel pieces)
for plasmid DNA sample preparation. That sim-
plification enabled broader adoption, higher Equipment
throughput, and standardization of nucleic acid
preparation methods. In contrast to DNA extrac- • Gloves! (to minimize keratin contamination)
tion from bacteria, the variety of protein sources, • Clean glass plate (large enough to place entire
the diversity of proteins themselves, and protein gel on and room for a working area, 8”  8”)
biology in general are perhaps too complex to • Gel-cutting devices: clean steel razor blades
permit similar improvements that simplify, stan- or surgical scalpel
dardize, and automate protein sample preparation. • Low protein binding micro-centrifuge tubes
Nevertheless, continued improvements in sample (0.65 mL or 1.5 mL)
preparation robustness and ease of use are neces- • Gel-loading pipette tips
sary for proteomics methods to be more widely • Autosampler vials with perforated caps
adopted and to successfully advance protein MS • SpeedVac Concentrator
beyond academic research or specialized MS labs
and into individualized, bench top point of use or
large clinical applications. Sample Processing

1. Place the gel on a clean glass plate. Cover the


gel with just enough ultrapure water to pre-
Supplementary Protocols vent dehydration during the slicing process.
2. Cut the gel lane using (new, if possible)
1. In-gel Digestion scalpel or razor blade.
3. Cut each of the excised bands into 1–2 mm
Materials cubes and transfer these cubes to a 0.65 mL
low protein binding microcentrifuge tube.
• 25 mM ammonium bicarbonate: Dissolve 4. Add ~100 μL (or enough to cover gel slices)
80 mg ammonium bicarbonate in 40 mL ultra- of 25 mM ammonium bicarbonate/50 %
pure water ACN and vortex for 10 min.
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 57

5. Using gel loading pipet tip, extract the 21. Concentrate peptide extracts using a speed-
supernatant and discard. The procedure vac concentrator to a volume that is slightly
should be repeated until the stain is larger than will be used for injection during
completely removed. Two additional LC-MS/MS analysis.
washes should be sufficient for moderately 22. Store the vial with the extracted peptides at
intense bands. 20  C if the samples will not be run the
6. Add 100 μL of 5 mM DTT and incubate for same day.
30 min at 50  C. Spin. Discard all the liquid
afterwards.
7. Allow samples to cool to room temperature. 2. In-Solution Sample Preparation With
8. Add 100 μL of 20 mM iodoacetamide and Acetone Precipitation
incubate the gel pieces in the dark for 45 min
at room temperature. Spin. Discard the liquid
afterward. Materials
9. Wash the gel pieces with 100 μL of 25 mM
ammonium bicarbonate, vortex 10 min, spin. • 100ABCS: 100 mM NH4HCO3 with 0.1 %
Discard the liquid afterwards. sodium dodecyl sulfate, pH 8.0, 5 mL
10. Wash the gel pieces with ~100 μL (or enough • 50ABC: 50 mM NH4HCO3, pH 8.0, 5 mL
to cover) of 25 mM ammonium bicarbonate • 500 mM DTT in 50ABC, 0.5 mL
in 50 % ACN, vortex 10 min, spin. Discard • 500 mM Iodoacetamide (IAM) in 50ABC,
the liquid. 0.5 mL (protect solution from light)
11. Dehydrate the gel pieces in 100 % ACN for • 0.1 % acetic acid in water, 250 μL
10 min, spin and discard the liquid • Lys-C Protease, MS Grade, 20 μg
afterwards. • MS-Grade Trypsin Protease, MS Grade, 20 μg
12. Dry the sample in a speed-vac for 10 min. • Pre-chilled 90 % acetone: Prepare 90 % ace-
The gel pieces are now ready for tryptic tone in ultrapure water (e.g., mix 45 mL of
digestion. 100 % acetone with 5 mL of ultrapure water)
13. Just before use, dilute or reconstitute trypsin and store at 20  C.
with 50 mM ice cold ammonium bicarbonate • Pre-chilled 100 % acetone: Store 100 % ace-
to give final concentration of the 10 ng/μL. tone at 20  C.
14. Add trypsin solution to just cover the gel • Trifluoroacetic acid (TFA)
pieces. • Phosphate-buffered saline (PBS)
15. Verify that the gel pieces are covered with
trypsin solution. Equipment
16. Add 25 mM ammonium bicarbonate as
needed to cover the gel pieces. • Low protein binding microcentrifuge tubes
17. Spin briefly and incubate at 37  C for 4 h – • Microtip probe sonicator or nuclease (e.g.,
overnight. Thermo Scientific™ Pierce™ Universal
18. Stop digestion by adding 20 μL of 5 % Nuclease for Cell Lysis, Product No. 88700)
formic acid. • Heating block
19. Vortex 15–20 min, spin, and transfer the • SpeedVac Concentrator
digest solution (aqueous extraction) into a
clean autosampler vial appropriate for Procedure
LC/MS-MS.
20. To the gel pieces, add 30 μL (enough to Cell Lysis
cover) of 50 % ACN/1 % formic acid,
vortex 15–20 min., spin, and transfer solu- 1. Culture cells to harvest at least 100 μg of
tion to the tube used above. Repeat this protein. For best results, culture a mini-
step once. mum of 1  106 cells.
58 J.C. Rogers and R.D. Bomgarden

Note: Rinse cell pellets 2–3 times with 1X 5. After alkylation with IAM, immediately
PBS to remove cell culture media. Pellet add 460 μL (4 volumes) of pre-chilled
cells using low-speed centrifugation (i.e., < (20  C) 100 % acetone to sample. Vortex
1000  g) to prevent premature cell lysis. tube and incubate at 20  C for 1 h to
2. Lyse the cells by adding five cell-pellet overnight to precipitate proteins.
volumes of 100ABCS (i.e. 100 μL of 6. Centrifuge at 14,000  g for 10 min at
100ABCS for a 20 μL cell pellet). Pipette 4  C. Carefully remove acetone without
sample up and down to break up the cell dislodging the protein pellet.
clumps and gently vortex sample to mix. 7. Add 50 μL of pre-chilled (20  C) 90 %
3. Incubate the lysate at 95  C for 5 min. acetone, vortex to mix and centrifuge at
4. Cool the lysate on ice for 5 min. 14,000  g for 5 min at 4  C.
5. Sonicate lysate on ice using a microtip 8. Carefully remove acetone without
probe sonicator to reduce the sample vis- dislodging the protein pellet. Allow the
cosity by shearing DNA. pellet to dry for 2–3 min and immediately
6. Centrifuge lysate at 14,000  g for 10 min proceed to Protein Digestion.
at 4  C. Note: Do not dry the acetone-precipitated
7. Carefully separate the supernatant and protein pellet for more than 2–3 min;
transfer into a new tube. excess drying will make the pellet difficult
8. Determine the protein concentration of the to re-suspend in the Digestion Buffer.
supernatant using established methods
such as the BCA Protein Assay Kit
Enzymatic Protein Digestion

Reduction, Alkylation and Acetone Precipitation 9. Add 100 μL of 50ABC to the acetone-
Note: This procedure is optimized for 100 μg of precipitated protein pellet and resuspend
cell lysate protein at 1 mg/mL concentration; by gently pipetting up and down to break
however, the procedure may be used for the pellet.
10–200 μg of cell lysate protein with an appropri- Note: An acetone-precipitated protein pel-
ate amount of reagents (DTT, IAM, Lys-C and let may not completely dissolve; however,
trypsin). When using 10 μg of cell lysate, a protein after proteolysis at 37  C, all the protein
concentration of 0.2–1 mg/mL may be used. will be solubilized.
1. Add 100 μg of lysate protein to a polypro- 10. Immediately before use, add 40 μL of
pylene microcentrifuge tube and adjust the ultrapure water to the bottom of the vial
sample volume to 100 μL using 100ABCS containing lyophilized Lys-C and incu-
to a final concentration of 1 mg/mL. bate at room temperature for 5 min.
2. Add 2.1 μL of DTT solution to the sample Gently pipette up and down to dissolve.
(final DTT concentration is ~10 mM). Mix Store any remaining 0.5 μg/μL Lys-C
and incubate at 50  C for 45 min. Discard solution in single-use volumes at 80  C.
any unused DTT solution. 11. Add 2 μL of Lys-C (1 μg, enzyme-to-
3. Cool the sample to room temperature for substrate ratio ¼ 1:100) to the sample.
10 min. Mix and incubate at 37  C for 2 h.
4. Add 11.5 μL of IAM solution to the sample 12. Immediately before use, add 40 μL of
(final IAM concentration is ~50 mM). Mix 0.1 % acetic acid to the bottom of the
and incubate at room temperature for vial containing trypsin and incubate at
20 min protected from light. Discard any room temperature for 5 min. Gently
unused IAM solution. pipette up and down to dissolve. Store
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 59

any remaining 0.5 μg/μL trypsin solution • Bench-top centrifuge


in single-use volumes at 80  C for long- • Temperature-controlled incubator or heat
term storage. block at 50  C
13. Add 4 μL of trypsin (2 μg, enzyme-to- • Thermo-mixer at 37  C
substrate ratio ¼ 1:50) to the sample. • SpeedVac Concentrator
Mix and incubate overnight at 37  C.
14. Freeze samples at 80  C to stop diges-
tion. (Optional: stop digestion by Procedure
acidifying with TFA)
15. Speed vac sample to 1–5 μL.
1. Combine up to 30 μL of a protein extract
16. Resuspend the sample in an appropriate
(0.2–400 μg) with 200 μL of UABC in the
buffer (e.g., 0.1 % TFA) for LC-MS
filter unit and centrifuge at 14,000  g for
analysis.
15 min.
Note: Proteolytic digests prepared using
2. Add 200 μL of UABC to the filter unit and
this protocol are directly compatible with
centrifuge at 14,000  g for 15 min.
LC-MS analysis. Clean-up of samples
3. Discard the flow-through from the
with C18 spin tips or columns is optional.
collection tube.
4. Add 100 μL DTT solution and mix at
600 rpm in a thermo-mixer for 1 min and
3. Filter-assisted Sample Preparation incubate at 50  C without mixing for 45 min.
(FASP) 5. Centrifuge the filter units at 14,000  g for
10 min.
6. Add 100 μL IAM solution, cover with foil,
Materials mix by gentle vortexing for 1 min, and incu-
bate in dark at room temperature without
• UABC: 8 M urea in 100 mM NH4HCO3 mixing for 30 min.
(ABC) pH 8.0. Prepare fresh, 1 mL per 7. Centrifuge the filter units at 14,000  g for
sample. 10 min.
• IAM solution: 55 mM iodoacetamide in 8. Add 100 μL of UABC to the filter unit and
UABC. Prepare 100 μL per sample. centrifuge at 14,000  g for 15 min. Repeat
• DTT solution: 50 mM DTT in UABC. Prepare this step one more time.
100 μL per sample 9. Add 100 μL of 50ABC to the filter unit and
• Trypsin: MS grade Modified Trypsin, centrifuge at 14,000  g for 10 min. Repeat
0.5 μg/μL in 50 mM NH4HCO3 in water this step one more time.
• 50ABC: 50 mM NH4HCO3 in water. Prepare 10. Transfer the filter units to new collection
0.5 mL per sample tubes.
• 25ABC: 25 mM NH4HCO3 in water. Prepare 11. Add 100 μL of 50ABC with trypsin (enzyme
0.25 mL per sample to protein ratio 1:50) and mix at 600 rpm in
Note: UABC and IAM solutions must be thermo-mixer at 37  C for 4–18 h.
freshly prepared and used within a day. IAM 12. Centrifuge the filter units at 14,000  g for
is light sensitive, so protect from light 10 min.
13. Add 50 μL of 25ABC and centrifuge the
Equipment filter units at 14,000  g for 10 min.
14. Add 50 μL of 10ABC and centrifuge the
• Low protein binding tubes filter units at 14,000  g for 10 min.
• 10 or 30 kDa cut off filter (Vivacon 500, cat # 15. Concentrate down to ~5 μL and add 0.1 %
VN01H02) FA to a final volume of ~20–25 μL.
60 J.C. Rogers and R.D. Bomgarden

References 16. McLafferty FW, Breuker K, Jin M, Han X, Infusini G,


Jiang H et al (2007) Top-down MS, a powerful com-
plement to the high capabilities of proteolysis proteo-
1. Aebersold R, Mann M (2003) Mass spectrometry-
mics. FEBS J 274(24):6256–6268
based proteomics. Nature 422(6928):198–207
17. Fornelli L, Ayoub D, Aizikov K, Beck A, Tsybin YO
2. Han X, Aslanian A, Yates JR 3rd (2008) Mass spec-
(2014) Middle-down analysis of monoclonal
trometry for proteomics. Curr Opin Chem Biol 12
antibodies with electron transfer dissociation orbitrap
(5):483–490
fourier transform mass spectrometry. Anal Chem 86
3. Washam CL, Byrum SD, Leitzel K, Ali SM, Tackett
(6):3005–3012
AJ, Gaddy D et al (2013) Identification of PTHrP
18. Wu C, Tran JC, Zamdborg L, Durbin KR, Li M, Ahlf
(12–48) as a plasma biomarker associated with breast
DR et al (2012) A protease for ‘middle-down’ proteo-
cancer bone metastasis. Cancer Epidemiol Biomark
mics. Nat Methods 9(8):822–824
Prev Publ Am Assoc Cancer Res Cosponsored Am
19. Wu CC, MacCoss MJ (2002) Shotgun proteomics:
Soc Prevent Oncol 22(5):972–983
tools for the analysis of complex biological systems.
4. Huttlin EL, Ting L, Bruckner RJ, Gebreab F, Gygi
Curr Opin Mol Ther 4(3):242–250
MP, Szpyt J et al (2015) The BioPlex network: a
20. Seddon AM, Curnow P, Booth PJ (2004) Membrane
systematic exploration of the human interactome.
proteins, lipids and detergents: not just a soap opera.
Cell 162(2):425–440
Biochim Biophys Acta 1666(1–2):105–117
5. Meisinger C, Sickmann A, Pfanner N (2008) The
21. Feist P, Hummon AB (2015) Proteomic challenges:
mitochondrial proteome: from inventory to function.
sample preparation techniques for microgram-
Cell 134(1):22–24
quantity protein analysis from biological samples.
6. Bryson BD, White FM (2012) Signaling for death:
Int J Mol Sci 16(2):3537–3563
tyrosine phosphorylation in the response to glucose
22. Keller BO, Sui J, Young AB, Whittal RM (2008)
deprivation. Mol Syst Biol 8:591
Interferences and contaminants encountered in mod-
7. Wolters DA, Washburn MP, Yates JR 3rd (2001) An
ern mass spectrometry. Anal Chim Acta 627(1):71–81
automated multidimensional protein identification
23. Loo RR, Dales N, Andrews PC (1996) The effect of
technology for shotgun proteomics. Anal Chem 73
detergents on proteins analyzed by electrospray ioni-
(23):5683–5690
zation. Methods Mol Biol (Clifton, NJ) 61:141–160
8. Wang Y, Yang F, Gritsenko MA, Wang Y, Clauss T,
24. Waas M, Bhattacharya S, Chuppa S, Wu X, Jensen
Liu T et al (2011) Reversed-phase chromatography
DR, Omasits U et al (2014) Combine and conquer:
with multiple fraction concatenation strategy for pro-
surfactants, solvents, and chaotropes for robust mass
teome profiling of human MCF10A cells. Proteomics
spectrometry based analyses of membrane proteins.
11(10):2019–2026
Anal Chem 86(3):1551–1559
9. Picotti P, Aebersold R (2012) Selected reaction
25. Anderson NL, Anderson NG (2002) The human
monitoring-based proteomics: workflows, potential,
plasma proteome: history, character, and diagnostic
pitfalls and future directions. Nat Methods 9
prospects. Mol Cell Proteomics MCP 1(11):845–867
(6):555–566
26. Polaskova V, Kapur A, Khan A, Molloy MP, Baker
10. Peterson AC, Russell JD, Bailey DJ, Westphall MS,
MS (2010) High-abundance protein depletion: com-
Coon JJ (2012) Parallel reaction monitoring for high
parison of methods for human plasma biomarker dis-
resolution and high mass accuracy quantitative,
covery. Electrophoresis 31(3):471–482
targeted proteomics. Mol Cell Proteomics MCP 11
27. Huber LA, Pfaller K, Vietor I (2003) Organelle prote-
(11):1475–1488
omics: implications for subcellular fractionation in
11. Stastna M, Van Eyk JE (2012) Analysis of protein
proteomics. Circ Res 92(9):962–968
isoforms: can we do it better? Proteomics 12
28. Dunkley TP, Watson R, Griffin JL, Dupree P, Lilley
(19–20):2937–2948
KS (2004) Localization of organelle proteins by iso-
12. Savitski MM, Reinhard FB, Franken H, Werner T,
tope tagging (LOPIT). Mol Cell Proteomics MCP 3
Savitski MF, Eberhard D et al (2014) Tracking cancer
(11):1128–1134
drugs in living cells by thermal profiling of the prote-
29. Ramsby ML, Makowski GS, Khairallah EA (1994)
ome. Science (New York, NY) 346(6205):1255784
Differential detergent fractionation of isolated
13. Weekes MP, Tomasec P, Huttlin EL, Fielding CA,
hepatocytes: biochemical, immunochemical and
Nusinow D, Stanton RJ et al (2014) Quantitative
two-dimensional gel electrophoresis characterization
temporal viromics: an approach to investigate host-
of cytoskeletal and noncytoskeletal compartments.
pathogen interaction. Cell 157(6):1460–1472
Electrophoresis 15(2):265–277
14. Klein T, Fung SY, Renner F, Blank MA, Dufour A,
30. Gu B, Zhang J, Wang W, Mo L, Zhou Y, Chen L
Kang S et al (2015) The paracaspase MALT1 cleaves
et al (2010) Global expression of cell surface proteins
HOIL1 reducing linear ubiquitination by LUBAC to
in embryonic stem cells. PLoS One 5(12):e15795
dampen lymphocyte NF-kappaB signalling. Nat
31. Weekes MP, Antrobus R, Lill JR, Duncan LM, Hor S,
Commun 6:8777
Lehner PJ (2010) Comparative analysis of techniques
15. Catherman AD, Skinner OS, Kelleher NL (2014) Top
to purify plasma membrane proteins. J Biomol Tech
down proteomics: facts and perspectives. Biochem
JBT 21(3):108–115
Biophys Res Commun 445(4):683–693
3 Sample Preparation for Mass Spectrometry-Based Proteomics; from Proteomes to Peptides 61

32. Yang L, Nyalwidhe JO, Guo S, Drake RR, Semmes 46. Biringer RG, Amato H, Harrington MG, Fonteh AN,
OJ (2011) Targeted identification of metastasis- Riggins JN, Huhmer AF (2006) Enhanced sequence
associated cell-surface sialoglycoproteins in prostate coverage of proteins in human cerebrospinal fluid
cancer. Mol Cell Proteomics MCP 10(6): using multiple enzymatic digestion and linear ion
M110.007294 trap LC-MS/MS. Brief Funct Genomic Proteomic 5
33. Deeb SJ, Cox J, Schmidt-Supprian M, Mann M (2014) (2):144–153
N-linked glycosylation enrichment for in-depth cell 47. Choudhary G, Wu SL, Shieh P, Hancock WS (2003)
surface proteomics of diffuse large B-cell lymphoma Multiple enzymatic digestion for enhanced sequence
subtypes. Mol Cell Proteomics MCP 13(1):240–251 coverage of proteins in complex proteomic mixtures
34. Nilsson CL, Dillon R, Devakumar A, Shi SD, using capillary LC with ion trap MS/MS. J Proteome
Greig M, Rogers JC et al (2010) Quantitative Res 2(1):59–67
phosphoproteomic analysis of the STAT3/IL-6/ 48. Giansanti P, Aye TT, van den Toorn H, Peng M, van
HIF1alpha signaling network: an initial study in Breukelen B, Heck AJ (2015) An augmented
GSC11 glioblastoma stem cells. J Proteome Res 9 multiple-protease-based human phosphopeptide
(1):430–443 atlas. Cell Rep 11(11):1834–1843
35. Patricelli MP, Szardenings AK, Liyanage M, 49. Leon IR, Schwammle V, Jensen ON, Sprenger RR
Nomanbhoy TK, Wu M, Weissig H et al (2007) Func- (2013) Quantitative assessment of in-solution diges-
tional interrogation of the kinome using nucleotide tion efficiency identifies optimal protocols for unbi-
acyl phosphates. Biochemistry 46(2):350–358 ased protein analysis. Mol Cell Proteomics: MCP 12
36. Lemeer S, Zorgiebel C, Ruprecht B, Kohl K, Kuster B (10):2992–3005
(2013) Comparing immobilized kinase inhibitors and 50. Chen EI, Cociorva D, Norris JL, Yates JR 3rd (2007)
covalent ATP probes for proteomic profiling of kinase Optimization of mass spectrometry-compatible
expression and drug selectivity. J Proteome Res 12 surfactants for shotgun proteomics. J Proteome Res
(4):1723–1731 6(7):2529–2538
37. ten Have S, Boulon S, Ahmad Y, Lamond AI (2011) 51. Kollipara L, Zahedi RP (2013) Protein carbamylation:
Mass spectrometry-based immuno-precipitation pro- in vivo modification or in vitro artefact? Proteomics
teomics – the user’s guide. Proteomics 11 13(6):941–944
(6):1153–1159 52. Proc JL, Kuzyk MA, Hardie DB, Yang J, Smith DS,
38. Evans DR, Romero JK, Westoby M (2009) Concen- Jackson AM et al (2010) A quantitative study of the
tration of proteins and removal of solutes. Methods effects of chaotropic agents, surfactants, and solvents
Enzymol 463:97–120 on the digestion efficiency of human plasma proteins
39. Gundry RL, White MY, Murray CI, Kane LA, Fu Q, by trypsin. J Proteome Res 9(10):5422–5437
Stanley BA et al (2009) Preparation of proteins and 53. Laemmli UK (1970) Cleavage of structural proteins
peptides for mass spectrometry analysis in a bottom- during the assembly of the head of bacteriophage T4.
up proteomics workflow. In: Frederick MA et al (eds) Nature 227(5259):680–685
Current protocols in molecular biology. Chapter 10: 54. Rabilloud T, Chevallet M, Luche S, Lelong C (2010)
Unit10.25 Two-dimensional gel electrophoresis in proteomics:
40. Olsen JV, Ong SE, Mann M (2004) Trypsin cleaves past, present and future. J Proteome 73
exclusively C-terminal to arginine and lysine (11):2064–2077
residues. Mol Cell Proteomics MCP 3(6):608–614 55. Schirle M, Heurtier MA, Kuster B (2003) Profiling
41. Benore-Parsons M, Seidah NG, Wennogle LP (1989) core proteomes of human cell lines by
Substrate phosphorylation can inhibit proteolysis by one-dimensional PAGE and liquid chromatography-
trypsin-like enzymes. Arch Biochem Biophys 272 tandem mass spectrometry. Mol Cell Proteomics
(2):274–280 MCP 2(12):1297–1305
42. Swaney DL, Wenger CD, Coon JJ (2010) Value of 56. Sechi S, Chait BT (1998) Modification of cysteine
using multiple proteases for large-scale mass residues by alkylation. A tool in peptide mapping
spectrometry-based proteomics. J Proteome Res 9 and protein identification. Anal Chem 70
(3):1323–1329 (24):5150–5158
43. Wu CC, MacCoss MJ, Howell KE, Yates JR 3rd 57. Nielsen ML, Vermeulen M, Bonaldi T, Cox J,
(2003) A method for the comprehensive proteomic Moroder L, Mann M (2008) Iodoacetamide-induced
analysis of membrane proteins. Nat Biotechnol 21 artifact mimics ubiquitination in mass spectrometry.
(5):532–538 Nat Methods 5(6):459–460
44. Niessen S, McLeod I, Yates JR 3rd (2006) Direct 58. Jiang X, Shamshurin D, Spicer V, Krokhin OV (2013)
enzymatic digestion of protein complexes for MS The effect of various S-alkylating agents on the chro-
analysis. CSH Protoc 2006(7) matographic behavior of cysteine-containing peptides
45. Bian Y, Ye M, Song C, Cheng K, Wang C, Wei X in reversed-phase chromatography. J Chromatogr B
et al (2012) Improve the coverage for the analysis of Anal Technol Biomed Life Sci 915–916:57–63
phosphoproteome of HeLa cells by a tandem digestion 59. Ruhaak LR, Zauner G, Huhn C, Bruggink C, Deelder
approach. J Proteome Res 11(5):2828–2837 AM, Wuhrer M (2010) Glycan labeling strategies and
62 J.C. Rogers and R.D. Bomgarden

their use in identification and quantification. Anal 71. Bereman MS, Egertson JD, MacCoss MJ (2011)
Bioanal Chem 397(8):3457–3481 Comparison between procedures using SDS for shot-
60. Arnold U, Ulbrich-Hofmann R (1999) Quantitative gun proteomic analyses of complex samples. Proteo-
protein precipitation from guanidine hydrochloride- mics 11(14):2931–2935
containing solutions by sodium deoxycholate/ 72. Glatter T, Ludwig C, Ahrne E, Aebersold R, Heck AJ,
trichloroacetic acid. Anal Biochem 271(2):197–199 Schmidt A (2012) Large-scale quantitative assess-
61. Bensadoun A, Weinstein D (1976) Assay of proteins ment of different in-solution protein digestion
in the presence of interfering materials. Anal Biochem protocols reveals superior cleavage efficiency of tan-
70(1):241–250 dem Lys-C/trypsin proteolysis over trypsin digestion.
62. Buxton TB, Crockett JK, Moore WL 3rd, Moore WL J Proteome Res 11(11):5145–5156
Jr, Rissing JP (1979) Protein precipitation by acetone 73. Wisniewski JR, Zougman A, Mann M (2009) Combi-
for the analysis of polyethylene glycol in intestinal nation of FASP and StageTip-based fractionation
perfusion fluid. Gastroenterology 76(4):820–824 allows in-depth analysis of the hippocampal mem-
63. Manza LL, Stamer SL, Ham AJ, Codreanu SG, brane proteome. J Proteome Res 8(12):5674–5678
Liebler DC (2005) Sample preparation and digestion 74. Wisniewski JR, Zougman A, Nagaraj N, Mann M
for proteomic analyses using spin filters. Proteomics 5 (2009) Universal sample preparation method for pro-
(7):1742–1745 teome analysis. Nat Methods 6(5):359–362
64. Peterson GL (1977) A simplification of the protein 75. Wisniewski JR, Nagaraj N, Zougman A, Gnad F,
assay method of Lowry et al. which is more generally Mann M (2010) Brain phosphoproteome obtained by
applicable. Anal Biochem 83(2):346–356 a FASP-based method reveals plasma membrane pro-
65. Wessel D, Flugge UI (1984) A method for the quanti- tein topology. J Proteome Res 9(6):3280–3289
tative recovery of protein in dilute solution in the 76. Zielinska DF, Gnad F, Jedrusik-Bode M, Wisniewski
presence of detergents and lipids. Anal Biochem 138 JR, Mann M (2009) Caenorhabditis elegans has a
(1):141–143 phosphoproteome atypical for metazoans that is
66. Barritault D, Expert-Bezancon A, Guerin MF, Hayes enriched in developmental and sex determination
D (1976) The use of acetone precipitation in the proteins. J Proteome Res 8(8):4039–4049
isolation of ribosomal proteins. Eur J Biochem/ 77. Zielinska DF, Gnad F, Wisniewski JR, Mann M
FEBS 63(1):131–135 (2010) Precision mapping of an in vivo
67. Crowell AM, Wall MJ, Doucette AA (2013) N-glycoproteome reveals rigid topological and
Maximizing recovery of water-soluble proteins sequence constraints. Cell 141(5):897–907
through acetone precipitation. Anal Chim Acta 78. Erde J, Loo RR, Loo JA (2014) Enhanced FASP
796:48–54 (eFASP) to increase proteome coverage and sample
68. Yeung YG, Nieves E, Angeletti RH, Stanley ER recovery for quantitative proteomic experiments. J
(2008) Removal of detergents from protein digests Proteome Res 13(4):1885–1895
for mass spectrometry analysis. Anal Biochem 382 79. Antharavally B, Jiang X, Cunningham R,
(2):135–137 Bomgarden R, Zhang Y, Viner R et al (2013) Versa-
69. Yeung YG, Stanley ER (2010) Rapid detergent tile Mass Spectrometry Sample Preparation Procedure
removal from peptide samples with ethyl acetate for for Complex Protein Samples [cited 2015 November
mass spectrometry analysis. Current protocols in pro- 8]. Available from: https://www.thermofisher.com/us/
tein science/editorial board, John EC et al. Chapter 16: en/home/life-science/protein-biology/protein-biol
Unit 16.2 ogy-learning-center/protein-biology-resource-library/
70. Antharavally BS, Mallia KA, Rosenblatt MM, protein-biology-application-notes/mass-spectrome
Salunkhe AM, Rogers JC, Haney P et al (2011) Effi- try-sample-preparation-procedure-protein-samples.
cient removal of detergents from proteins and html
peptides in a spin column format. Anal Biochem
416(1):39–44
Plant Structure and Specificity –
Challenges and Sample Preparation 4
Considerations for Proteomics

Sophie Alvarez and Michael J. Naldrett

Abstract
Plants are considered as a simple structured organism when compared to
humans and other vertebrates. The number of organs and tissue types is
very limited. Instead the origin of the complexity comes from the high
number and variety of plant species that exist, with >300,000 compared to
5000 in mammals. Proteomics, defined as the large-scale study of the
proteins present in a tissue, cell or cellular compartment at a defined time
point, was introduced in 1994. However, the first publications reported in
the plant proteomics field only appeared at the beginning of the twenty-
first century. Since these early years, the increase of proteomic studies in
plants has only followed a linear trend. The main reason for this stems
from the challenges specific to studying plants, those of protein extraction
from cells with variously strengthened cellulosic cell walls, and a high
abundance of interfering compounds, such as phenolic compounds and
pigments located in plastids throughout the plant. Indeed, the heterogene-
ity between different organs and tissue types, between species and differ-
ent developmental stages, requires the use of optimized plant protein
extraction methods as described in this section. The second bottleneck
of plant proteomics, which will not be discussed or reviewed here, is the
lack of genomic information. Without sequence databases of the
>300,000 species, proteomic studies of plants, especially of those that
are not considered economically relevant, are impossible to accomplish.

Keywords
Plant proteomics • Plant cell lysis • Plant secretome • Plant organs • Plant
meristem and suspension culture cells • Green algae and plastids • Plant
protein extraction

S. Alvarez (*) • M.J. Naldrett


Center for Biotechnology, University of Nebraska–
Plant proteomics has been and is still an interesting
Lincoln, Beadle Center, 1901 Vine St, Lincoln, NE
68588, USA approach to studying the content and abundance of
e-mail: salvarez@unl.edu proteins (protein expression) in response to changes

# Springer International Publishing Switzerland 2016 63


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_4
64 S. Alvarez and M.J. Naldrett

of the environment, such as drought, cold or a described to help scientists overcome the unique
challenge by pathogens. Understanding how plants challenges of working with plant samples.
grow and interact with the environment is a
pre-requisite for crop improvement and increased
productivity. The study of plant growth is of course 4.1 Plant Cell Wall and Secretome
not seen to be as important as studying the more
immediate ailments of our human condition. The cell wall is a structure found in eukaryotic
Though, as we move further into the twenty-first plant and algal cells. It is composed of two layers
century and the world’s population now exceeds called the primary and the secondary cell wall,
seven billion, the question of how to produce which surround the plasma membrane
enough food and feedstock to sustain this growth (Fig. 4.1a). The composition of the cell wall,
is an ever more significant challenge, which could which varies between plant families and tissues,
endanger the human species if this issue is not given consists of a very complex network of cellulose
the effort that it deserves. In addition, many plants microfibrils embedded in a polysaccharide
have been used for thousands of years for their matrix composed of pectin, hemicellulose and
healing properties in traditional medicine. Even glycoproteins for the primary cell wall and a
more recently, since the 1980s, plants have been rigid skeleton of cellulose, hemicellulose and
bioengineered for drug production by pharmaceuti- lignin for the secondary cell wall. The cell wall
cal companies as they often offer faster and cheaper is not surrounded by any additional physical bar-
production alternatives. The most pertinent recent rier, but the space between two cell walls from
example is the production of ZMapp in tobacco adjacent cells, called apoplast, allows small
(Nicotiana benthamiana) for the treatment of molecules and proteins to circulate between
patients suffering from the Ebola virus disease [82]. cells, through pores in the cell wall made of
Proteomic studies have focused on many dif- proteins, called plasmodesmata (Fig. 4.1a).
ferent plant species, with one of the first and most While the cell wall has the essential roles of
studied being Arabidopsis thaliana, a model maintaining the turgor pressure of plant cells by
system in the plant field. Its popularity amongst providing rigidity and as a physical barrier to
plant biologists comes from its small genome of avoid pathogen invasion, the cell wall proteins
135 Mbp, which was fully sequenced in 2000 not only contribute to the structural role of the
(The Arabidopsis Genome Initiative [1]), the cell wall, but are also involved in cell-cell and
known 27,029 protein coding genes [98], and its cell-pathogen communication, contributing to
rapid life cycle under controlled conditions plant development and growth and also to the
(6 weeks from germination to mature seed). plant’s response to environmental changes and
Although the study of Arabidopsis affords a its adaptation. However, the heterogeneity of the
look into most of the basic biological processes cell wall and the complex network of the
in plants, many of these plant mechanisms and polysaccharides, make protein extraction from
their molecular regulation are not always the cell wall challenging. The use of mechanical
transposable into other plants, especially to tissue disruption to release proteins from plant
crops, because of genomic and physiological cells is a prerequisite to the study of the total
differences. Among the crops, the most studied protein content of the cell. It is not, however,
are maize, wheat, and soybean, because of their always suitable for studying cell wall proteins
economic relevance in food production, and (CWP). Here, different isolation and elution
more recently for their biofuel possibilities. methods are necessary, depending on the type
Below, we describe some specific structures of protein that is to be studied. Three different
found in plants and highlight the challenges for types of cell wall proteins are defined, depending
proteomic studies. The specific requirements and on the strength of their interaction with the cell
procedures for protein sample preparation of wall. Proteins with no or little interaction with
these different types of plant material are also cell wall components (CWP1) and proteins
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 65

Fig. 4.1 (a) Plant cell wall diagram; (b) Vacuum infiltration and centrifugation workflow for cell wall protein
isolation

weakly interacting with the matrix by van der depending on the sample type: recovery of the
Waals’ interactions, hydrogen bonds, and hydro- SCCs’ medium and washing of cells with liquid
phobic or ionic interactions (CWP2) are extract- medium and low salt solutions [14, 20]; or using
able with salts, using non-disruptive methods vacuum infiltration-centrifugation which consists
to avoid damaging the plasma membrane. of infusing a solution of salts into the intercellular
These proteins, found in the apoplast, constitute space by use of reduced pressure ( 75 to
the apoplastic proteome or plant secretome 50 kPa) for a short time (5–15 min) and
(i.e. secreted proteins). These have an essential harvesting the apoplastic fluid containing the
function in plant-pathogen interactions and proteins by gentle centrifugation (Fig. 4.1B). To
adaptation to stress. The third category of cell dissolve the loosely bound CWP1, a low ionic
wall proteins are those proteins strongly bound strength (KCl) solution is used, while salts such
to the cell wall (CWP3) which require disrup- as CaCl2, LiCl and NaCl are used to isolate the
tive methods for extraction. These methods have weakly bound CWP2 (Table 4.1). The first
been optimized to reduce the contamination from method, applying only to suspension cells, cannot
cell wall polysaccharides, phenolic compounds be applied to actual plant tissues which are more
and the intracellular proteins. relevant for understanding responses to environ-
Suspension culture cells (SCCs) are typically mental interactions. Here the second method is
the method of choice to study the secretome and more suitable. Although these methods are
cell wall proteins. To isolate the apoplastic cell non-disruptive, the salt concentration has to be
wall proteins (CWP1 and 2), non-disruptive very carefully optimized to the type of sample
methods are essential to preserve intact plasma under investigation in order to minimize cytosolic
membranes and avoid contamination from intra- contamination. Excessive salt concentrations can
cellular components. Two approaches can be used cause the membrane to rupture and release not
66 S. Alvarez and M.J. Naldrett

Table 4.1 Salt solutions to isolate CWP1, CWP2 or CWP3 according to the sample type and species
Cell wall
Salt solutions group Sample type and species References
10 mM sodium phosphate, pH 6.0 CWP1 Root of onion (Allium sepa L.) [28]
10 mM phosphate buffer, pH 6.0, containing 0.2 M KCl CWP1 Leaf blades of maize [31]
(Zea mays)
0.01 M Mes buffer pH 5.5, containing 0.2 M KCl CWP1 Leaf blades of tall fescue [64]
(Festuca arundinacea) [123]
Root of maize
50 mM of LaCl3 CWP1 Pea (Pisum sativum L.) [70]
internodes
0.3 M mannitol CWP1 Arabidopsis rosettes [15]
0.6 M NaCl CWP1 Potato (Solanum tuberosum L.) [74]
and tubers
CWP2
1 M NaCl CWP1 Maize endosperm [24, 94]
and
CWP2
50 mM phosphate buffer, CWP1 Tobacco (Nicotiana tabacum) [30]
200 mM NaCl, pH 7.5 and leaves
CWP2
25 mM Tris–HCl pH 7.4, 50 mM EDTA CWP1 Arabidopsis seedlings [19]
(Ethylenediamine-tetraacetic acid), 150 mM MgCl2 and
CWP2
0.1 M potassium phosphate (pH 7.8) or 66 mM potassium CWP1 Leaves from Arabidopsis, [41]
phosphate (pH 7.6) containing 10 mM MgCl2 and 14 mM and wheat (Triticum aestivum) and
2-ME CWP2 rice (Oryza sativa)
20 mM ascorbic acid and 20 mM CaCl2 (pH 3.0) CWP1 Winter rye (Secare cereale) [43]
and leaves
CWP2
1 M NaCl followed by 0.4 M CaCl2 CWP1 Protoplasts from flax (Linum [85]
and usitatissimum) hypocotyls
CWP2
200 mM CaCl2, 50 mM Na-acetate, pH 5.5, followed by CWP3 Medicago sativa stem [108, 109]
3 M LiCl, 50 mM Na-acetate, pH 5.5
200 mM CaCl2 CWP3 Arabidopsis suspension cells [12, 26]
5 mM acetate buffer, pH 4.6, 0.2 M CaCl2 and 10 μL CWP3 Arabidopsis hypocotyls [34]
protease inhibitor cocktail followed by 5 mM acetate
buffer, pH 4.6, 2 M LiCl and 10 μL protease inhibitor
cocktail, followed by 62.5 mM Tris, 4 % SDS, 50 mM
DTT, pH 6.8 (HCl)

only intracellular proteins but also phenolic pectinase, is used to digest the cell wall
compounds, which can lead to downstream issues carbohydrates, supplemented with an osmoticum
with proteomic experiments as we will describe in medium containing salts and sorbitol [58] or
the next section. Therefore, cytosolic contamina- glucose [85] to allow plasmolysis, which will
tion must be checked for, when using the lead to the obtention of protoplasts, e.g. plant
non-disruptive and disruptive methods described cells which have had their cell wall removed.
in Table 4.1, by measuring the activity of glucose- These protoplasts can be placed onto cell wall
6-phosphate dehydrogenase (G6PDH) [60]. regeneration medium, where newly synthesized
An enzymatic approach can be used to study and secreted cell wall proteins are released
CWP1 and CWP2 from suspension cells or plant using washes with low ionic salt solutions
tissue. A mixture of enzymes, cellulase and (Table 4.1).
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 67

For proteins embedded in the cell wall (CWP3), reproductive (i.e. flower, fruit and seed). The
the cell wall can be described as recalcitrant to “typical” plant body only consists of two organ
study. Disruptive methods must be used to study systems:
CWP3. The disruption is used to isolate the cell
wall fraction first, followed by stringent washes • The root system (below the ground) which
with aqueous and organic solutions to remove functions to anchor the plant to the soil, to
contaminating proteins and small molecules from absorb water and minerals and to store some
the intracellular compartment, prior to protein of the products of photosynthesis
extraction using the same salts mentioned before, • The shoot system (above the ground) which
CaCl2 and LiCl (Table 4.1). Another approach includes the stem and conducts minerals and
described previously for use with Arabidopsis sus- water from the root to the leaves with leaves
pension cells [26] consists of adding a sedimenta- being the critical location of photosynthesis
tion step in glycerol followed by centrifugation to for energy production and synthesis of
remove the intracellular compartment which is less organic and other compounds. At the repro-
dense than the cell wall. This protocol was more ductive stage, the shoot also includes flowers
recently adopted to extract cell wall proteins from which when fertilized develop into fruit
sugarcane suspension cells [17]. In order to carrying the seeds for dispersion.
decrease cytosolic contamination, additional
modifications have been considered. The modified The shoot system and its specific components,
protocol introduced by Feiz et al. [34], which was because of their important role in energy produc-
adopted for the study of cell wall proteins from tion and reproduction, but also because of their
alfalfa stem [102] and Arabidopsis suspension cells ready availability and ease of harvesting, have
[49] consists of combining sequential steps of sedi- resulted in the most proteomic publications. How-
mentation using different concentrations of ever, the shoot system also represents the most
sucrose, followed by centrifugation, with an exten- challenging structure because of the presence of
sive wash with 5 mM acetate buffer, pH 4.6, to high levels of phenolic compounds and derived
remove the sucrose, followed by two extraction pigments. The phenolic compounds are secondary
steps with two low ionic salts of 0.2 M CaCl2 and small molecules with at least one hydroxy-
2 M LiCl, including protease inhibitors, followed substituted aromatic ring and a large diversity of
by a final extraction with detergent (4 % SDS). structures, the properties of which are related
One should keep in mind that cell wall protein to specific functions and location. Some phenolic
extraction is not the only challenging aspect, the compounds can be widespread throughout the
downstream procedure for protein identification plant while others are specific to certain
or characterization can be far from straightfor- plants, plant organs and developmental stages.
ward depending on the origin and nature of the Examples showing their extreme complexity in
proteins extracted, such as the arabinogalacto- structure and distribution range from phenolic
proteins, or the amino acid Gly- or Pro-rich acids such as caffeic acid (critical precursor in
proteins found in the cell wall matrix. the phenylpropanoid biosynthesis pathway) to
flavonoids such as anthocyanins (pigments found
in flowers and fruit involved in color and
4.2 Plant Organs and Organ flavor) and lignin (polymerized aromatic alcohols
Systems constituent of the secondary cell wall). These
compounds present a major challenge for protein
Plants, unlike most mammals, have a limited extraction. Another abundant pigment found in
number of organs, split between two categories, plants and concentrated in the leaves is chloro-
vegetative (i.e. root, stem and leaf) and phyll. It has a critical role in photosynthesis,
68 S. Alvarez and M.J. Naldrett

functioning as the antenna, assisted by other Similar challenges are faced in the study of seeds
components, that absorbs photons from light and and their germination process. Seeds, essential
converts them into energetic electrons further cap- for propagation and multiplication of plants, are
tured by coenzymes involved in Calvin Cycle mainly composed of an endosperm, which is the
reactions. In addition to chlorophyll, another abun- storage compartment consisting of starch and stor-
dant component of the photosynthesis process age proteins (the source of nutrients during germi-
responsible for the fixation of CO2 is the enzyme nation) and the embryo, which will germinate into
Rubisco (Ribulose-1,5-bisphosphate carboxylase/ a seedling when dormancy is abolished. When the
oxygenase), also known as the most abundant pro- seed is studied as a whole, it can be a challenge to
tein on earth. The abundance of this protein is a identify low abundance proteins amongst the
major issue in proteomics, since its presence abundant storage proteins. In rice, a PEG-assisted
makes the study of low level proteins very chal- fractionation method has been used to remove the
lenging. Various Rubisco depletion strategies have abundant storage proteins [3]. More recently, a
been developed. Amongst them, small molecules different approach was also developed to remove
such as phytate or polyethylenimine (PEI), which the storage proteins found in soybean (i.e. glycinin
interact with specific proteins, have been used to and beta-conglycinin) using a precipitation step
precipitate Rubisco from protein samples. In the with 10 mM Ca2+ [57].
case of phytate, the interaction, done in the pres- The root system is a simple structure that has
ence of Ca2+ at a defined pH of 6.8, removed been largely ignored in plant proteomics, having
85 % of Rubisco from soybean leaves [56]. lost out to the shoot system. Only over the past
PEI was successfully used in combination with 6 years has interest in roots increased, because of
fractionation to increase the protein resolution, their direct involvement in the perception of
an approach called PARC (PEI-assisted Rubisco stresses related to water and nutrient availability
cleanup) [118]. However these interactions are not in soils. Roots do not represent any greater diffi-
specific to Rubisco and can also precipitate culty in terms of protein extraction, though yields
non-targeted proteins. Commercial kits using are somewhat lower than from leaves. Rather,
immunoaffinity removal of Rubisco are also avail- issues arise with the isolation of clean, soil or
able (IgY Rubisco columns from Sigma; Rubisco media-free, roots. The time between harvest and
depletion kit from Agrisera). However, the the subsequent soil removal steps can delay the
antibodies in these kits do not work as well for freezing of samples, which in turn can be respon-
every plant species. A different approach, using sible for biological variation in the downstream
differential PEG (polyethylene glycol) precipita- studies. One alternative is to grow plants and
tion, was used to isolate Rubisco from a specific harvest roots using sterile medium on plates, or
fraction [113]. PEG present in the fractions now to use hydroponic cultures. In both cases, this
containing only low levels of Rubisco could then makes harvesting easier and faster.
be cleaned up using one of the protein extraction In addition, some specific organs and tissues
methods used for plants as described below. found in certain species are considered recalcitrant
Clearly, this type of fractionation can also lead to to protein extraction. In particular, woody plant
losses of other groups of proteins that coprecipitate material or tissues containing large amounts of
with Rubisco. There clearly is no perfect solution pigments, phenolic compounds and carbohydrates
for increasing the dynamic range of proteins (i.e. flowers and fruits). These often bring with
identified from tissues containing Rubisco without them significant difficulties and require specific
suffering from protein losses one way or another. sample preparation to remove interfering
An alternative strategy to increase the resolution compounds. The presence of the cell wall also
and the dynamic range is to increase the fraction- presents another issue. There are two main
ation of protein samples or consider subcellular methods for total protein extraction described in
fractionation, both of which incur time penalties. the literature that aim to solve these challenges.
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 69

These are the TCA (trichloroacetic acid)/acetone buffered solution mixed with a sucrose buffer.
precipitation method and the phenol extraction This inverts the phases putting the phenol phase
method. Both have been used successfully for at the top. The proteins, denatured and dissolved in
various plant organ and tissue samples of a large the phenol phase, are then precipitated with a
variety of species. Although these protocols solution of methanolic ammonium acetate. The
were first optimized to be compatible with pellet is then successively washed with methanolic
two-dimensional polyacrylamide gel electropho- ammonium acetate, followed by acetone and
resis (2D-PAGE) as this was the main proteomics methanol washes to remove pigments and lipids.
separation tool used 15 years ago, nowadays these The pellet can then be re-suspended in any buffer
protein extraction protocols are still being used desired. The protocol used in our lab modified
for LC-based separations after adjustment of the from [48] is described in Table 4.2.
solubilization buffer to make it mass spectrometry It is important to note that:
compatible.
The TCA/acetone precipitation method was 1. The phenol extraction method is more favor-
used as early as the 1980s for protein separation able for solubilizing membrane-associated
using 1D and 2D-PAGE [29, 111]. Here, removal proteins, not like the TCA/acetone method
of pigments and phenolics, but also lipids and which favors water soluble proteins.
nucleic acids that could interfere with gel resolu- 2. For both the TCA/acetone and phenol extrac-
tion was essential. The process consists of tion protocols, the use of protease inhibitors is
denaturing and precipitating proteins. Addition not mandatory, because these methods use
of 2-mercaptoethanol (2-ME) into the buffer denaturing conditions which have been
inhibits the formation of new disulfide bonds. reported to inhibit protease activity. Also
Most of the interfering compounds will stay sol- throughout, the samples are kept at 4  C or
uble in the acetone, which is removed, leaving 20  C as a precaution. However, because
the pelleted proteins which can be solubilized some proteases can still be active even in
using any type of 2D-PAGE or MS-compatible harsh conditions, it is recommended to add a
solubilization buffer, depending on later steps. cocktail of protease inhibitors, either in the
The protocol adapted from Damerval et al. [29] acetone wash for the TCA/acetone procedure,
with modifications is described in Table 4.2. Tis- or in the aqueous extraction buffer for the
sue disruption is done before the addition of phenol extraction method.
precipitation buffer to improve total protein 3. Although manual grinding for tissue disrup-
release through better tissue homogenization tion leads in our hands to better protein yields,
and cell wall breakage. this step can be automated for higher through-
The phenol extraction method is commonly put if laboratories are equipped with a bead
used by molecular biologists for nucleic acid beating tool which uses grinding balls and a
extraction because it efficiently removes protein shaking homogenizer to disrupt the tissues or
from samples. This feature meant it could be cells. The use of these tools also provides
adapted for protein extraction from plant tissues better reproducibility between samples; how-
[48], taking advantage of its ability to remove ever, attention has to be placed on keeping
the interfering compounds, but also for its ability samples frozen during the homogenization
to remove DNA, to give improved 2D-PAGE steps to avoid protease activity.
separations compared to the TCA/acetone extrac- 4. Protein solubilization from the pellet is more
tion. In this adapted phenol extraction method, the efficient after phenol extraction than using the
first step consists of separating proteins from salts, TCA/acetone method, resulting in reduced
nucleic acids and carbohydrates with a Tris-phenol protein loss.
70 S. Alvarez and M.J. Naldrett

Table 4.2 TCA/acetone precipitation and phenol extraction procedures


TCA/acetone precipitation Phenol extraction
Buffers: Buffers:
Precipitation buffer: 10 % trichloroacetic acid (TCA) in Extraction buffer (stock solution): 0.1 M Tris–HCl
acetone solution. Store at 20  C. pH 8.8, 10 mM EDTA, 0.4 % 2-ME, 0.9 M sucrose –
2-ME must be omitted from the stock solution. Store at
4  C.
Wash buffer: 100 % acetone. Store at 20  C. Phenol buffer: Phenol buffered with Tris–HCl, pH 8.8
(commercial product). Store at 4  C.
Wash buffer I: 0.1 M ammonium acetate in methanol, II:
80 % acetone and III: 70 % methanol. Store at 20  C.
Procedure: Procedure:
1. Aliquot the volume needed of COLD precipitation 1. Aliquot the volume of extraction buffer needed based
buffer and COLD wash buffer, and add 0.07 % on the number of samples and add 2-ME at 0.4 % and the
2-mercaptoethanol (2-ME) in both before use. Store at cOmplete protease cocktail inhibitor (Roche) at 1x. Keep
20  C. the buffer on ice.
2. Transfer the frozen samples (150–200 mg) to the liquid 2. Transfer the frozen samples (100–150 mg) to the liquid
nitrogen in a mortar and grind into fine powder using the nitrogen in the mortar and grind into a fine powder using
pestle. Note: Liquid nitrogen should be used to cool the the pestle. Note: Liquid nitrogen is used to cool down the
equipment needed (mortar, pestle and spatula) just prior equipment needed (mortar, pestle and spatula) before
to use. being used.
3. Using the chilled spatula transfer the ground sample 3. Using the chilled spatula transfer the sample into 2 mL
into 2 mL centrifuge tubes and place them in liquid centrifuge tubes and place them in liquid nitrogen until all
nitrogen until all the samples are ready. the samples are ready.
4. Remove the tubes from the liquid nitrogen and add 4. Remove the tubes from the liquid nitrogen one by one
1.5 mL of COLD precipitation buffer to each. and add 600 μL of extraction buffer and 600 μL of phenol
buffer to each. Note: Use phenol and extraction buffer in
the hood. If phenol droplets get on your gloves, change
them as soon as possible.
5. Precipitate proteins by placing the tubes at 20  C for 5. Vortex the tubes immediately after adding the buffers
at least 2 h or overnight. and place on ice until all the samples are ready for the
next step.
6. Centrifuge the tubes at 4  C for 15 min at 16,000  g. 6. Vortex the tubes on high for 30 min using a vortexer
equipped with a tube adapter, placed in a cold room at
4  C.
7. Remove the supernatant and wash the pellet by adding 7. Centrifuge the tubes for 15 min at 16,000  g at 4  C.
1.5 mL COLD wash buffer and transferring them to
20  C for at least 2 h.
8. Repeat steps 6 and 7 one more time or until the pellet 8. Remove the phenol phase (top phase) and transfer it to
gets discolored. a new 2 mL tube. Place the tubes on ice until all the
samples are ready for the next step.
9. Dry the pellet under vacuum or air dry in the hood. 9. Back-extract the aqueous phase by adding 400 μL of
phenol buffer.
10. Resuspend the pellet using the appropriate buffer 10. Repeat steps 5–8.
(see Sect. 4.5. for solubilization buffer 11. Remove the phenol phase and combine it with the first
recommendations). extraction.
12. Precipitate the phenol phase by adding 5 volumes of
COLD wash buffer I. Split the sample into several 2 mL
tubes if necessary before adding the wash buffer I.
13. Vortex the tubes and incubate them at 20  C for at
least 2 h or overnight.
14. Centrifuge the tubes at 4  C for 15 min at 16,000  g.
(continued)
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 71

Table 4.2 (continued)


TCA/acetone precipitation Phenol extraction
15. Remove the supernatant and wash the pellet by adding
1.5 mL of COLD wash buffer I and transferring to
20  C for at least 20 min.
16. Repeat steps 12 and 13 one more time with COLD
wash buffer I, then one time with COLD wash buffer II
and one time with COLD wash buffer III.
17. Dry the pellet under vacuum or air.
18. Resuspend the pellet using the appropriate buffer (see
Sect. 4.5. for solubilization buffer recommendations).

Undifferentiated cells are also present in


4.3 Plant Meristem planta throughout their entire life. These rapidly
and Suspension Culture Cells dividing cells located at the tips of the root and
(SCCs) shoot, which are called apical meristems are
essential for producing new cells and tissue for
Suspension culture cells (SCCs) are widely used plant growth. In contrast to suspension cells,
as a model system in plant biology to investigate meristems in plants are very limited in amount
the molecular basis of different mechanisms and and can be difficult to isolate. However, they do
their regulation, because they bypass the com- not contain phenolics and other interfering
plexity of regulation from an entire plant. SCCs compounds, so can be easier to study than other
can be grown to any quantity required and the types of plant samples. Meristematic cells, like
homogeneity of the population offers greater stem cells can also be used to seed new SCCs
reproducibility. For proteomic studies, they also without the need for going through dedifferentia-
present an easily harvestable material which tion first, providing an even faster way to obtain
often does not contain the interfering compounds cell suspension material.
found in whole plants. SCCs are produced from Protein extraction from suspension cells does
plant tissue culture, which is widely used for gene not require the tedious extractions described pre-
transformation and regeneration of modified viously for organ and organ systems; however,
plants, or for the propagation of identical plants disruption of the cells is still required to break the
without the need for crossing and seed production. cell wall structure to release the proteins.
The cells can be prepared from different types of Homogenization of cells in the presence of a
explants (hypocotyls, leaves and roots) by dedif- protein solubilization buffer can be used. This
ferentiation techniques which yield unorganized method does not require precipitation of the
cell masses called callus. These calluses can then proteins first, instead, the proteins are directly
be transplanted into a new medium for plant solubilized and the cell debris is removed by
regeneration via somatic embryogenesis, or can centrifugation. A wider diversity of buffers can
be transferred into flasks containing liquid culture be used to extract soluble proteins, but buffers
medium for growth of cell biomass to maintain suitable for extracting membrane or membrane-
suspension cells. Cell growth goes through an bound proteins can also be used, depending on
exponential phase before it slows down to a pla- the requirements of the proteomics analytical
teau, when the nutrients in the medium are no approach selected (e.g. 2D-PAGE or LC-MS-
longer sufficient. Regularly transferring an aliquot based methods, with or without offline
of the cells into new medium allows suspension pre-fractionation, or multi-dimensional protein
cells to be maintained for years. identification technology (MudPIT)).
72 S. Alvarez and M.J. Naldrett

Direct solubilization of proteins from cells protein solubilization during homogenization.


without prior cleanup was inspired by the Because of the efficiency of lysis buffers at
O’Farrell lysis buffer which contained 9 M urea, directly solubilizing proteins from tissues, this
2 % Nonidet P-40 (NP-40), 2 % 2-ME and 2 % method has not only been applied to suspension
carrier ampholytes (any desired pH interval) cells or meristems, but also to other organs. Simi-
[73]. This buffer consists of a mixture of larly, the TCA/acetone and phenol precipitation
solubilizing components: a chaotrope to disrupt extractions have their place, when the removal of
intra- and interprotein interactions and unfold interfering compounds is important from suspen-
structure by breaking hydrogen bonds; a detergent sion cells and meristems, which have been used to
to disrupt hydrophobic bonds and improve protein elicit the synthesis of secondary small molecules
solubilization; and a reducing reagent to break and into the liquid medium.
prevent reformation of disulfide cross-links. It is important to note that:
Modifications to the lysis buffer recipe according
to sample type and approach were made using 1. The use of protease inhibitors is critical.
mixtures of urea and thiourea with non-ionic Examples are: ethylenediamine-tetraacetic
(NP-40 or Triton X-100) or zwitterionic acid (EDTA) or ethylene glycol tetraacetic
(3-[(3-Cholamidopropyl) dimethylammonio]-1- acid (EGTA) – chelators; pepstatin A and
propanesulfonate; CHAPS) detergents containing phenylmethylsulfonyl fluoride (PMSF) – site
reducing reagents (Dithiothreitol; DTT). Others blocking reagents; or a combination of both
replaced urea with the anionic detergent SDS chelators and blocking site reagents; or the
(sodium dodecyl sulfate) [40]. The use of use of commercial mixtures such as the cOm-
detergents, such as SDS and Triton X-100, gives plete protease inhibitor cocktail from Roche
increasingly better membrane protein solubili- in the buffers to avoid protein degradation
zation when compared to the TCA/acetone during solubilization.
and phenol extraction methods previously 2. In most cases this homogenization is followed
described. However, this does oversimplify the by an acetone precipitation or desalting step
situation somewhat. Detergents perform differ- to remove detergents and/or the high con-
ently depending on sample type and whether centrations of salts and chaotropes which can
denaturing chaotropes such as urea and thiourea interfere with downstream steps such as
are present. Therefore, if membrane protein solu- isoelectrofocusing (IEF) or trypsin digestion.
bilization efficiency is the goal, optimization 3. The presence of detergents in the final stages of
using the biological system of interest is required sample analysis causes severe suppression of
[62]. Some examples of lysis buffer compositions ionization in the mass spectrometer and major
found in the literature are given in Table 4.3 to contamination of the separation systems that
provide a useful starting point for single step are usually online to such equipment.

Table 4.3 Examples of lysis buffer compositions used for total protein solubilization during tissue disruption and
homogenization for different sample types and applications
Buffer composition Sample type Platform Reference
0.5 M EDTA, pH 8.0, 1 M Tris–HCl, pH 6.8, 10 % SDS, 100 % 2-ME, Arabidopsis cells SDS- [99]
100 % glycerol and bromophenol blue powder PAGE
8 M urea, 2 M thiourea, 2 mM disodium EDTA salt, 4 % CHAPS, Ginseng (Panax 2D-PAGE [97]
65 mM DTT, 2 % ampholytes (pH 3–10), and 1 % TBP ginseng) cells
(tributylphosphine)
9.5 M urea, 2 % NP-40, 2 % ampholine (pH 3.5–10), 5 % 2-ME and Rice suspension 2D-PAGE [55]
0.05 % polyvinylpyrrolidone (PVP-40) cells
50 mM Tris–HCl, 15 mM EDTA, 1 mM NaF, 0.5 mM Na3VO3, Rice suspension DIGE and [61]
15 mM β-glycerophosphate, 1 mM PMSF, 1 mM DTT, 2 μg/mL cells iTRAQ
pepstatin, 2 μg/mL aprotinin, 2 μg/mL leupeptin, pH 7.5
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 73

4.4 Green Algae and Plastids precursor of the chloroplast found in plant cells.
Because their cell wall is similar to gram nega-
Plants are not the only organisms with a cell wall tive bacteria, this organism will not be addressed
capable of photosynthesis and therefore posing in this section.
similar challenges in the study of their protein The green algae have been shown greater
content. Algae are eukaryotic organisms more interest over the past decade because of their
recently classified in the kingdom of protists. potential to solve the global food and fuel crisis.
They are found as unicellular or multicellular Indeed, some families of algae produce high
organisms living in salt or fresh water levels of lipid and can be grown fast without
environments. Most of them are photosynthetic the use of valuable farmland. However, the cost
and able to fix CO2 from the atmosphere or from for production of energy-rich oil from algae is
organic sources using light and water to produce very expensive and more efforts still have to be
energy and biomass while releasing oxygen into made to increase production and lower the cost in
the atmosphere. The cell wall found in algae order to compete with the cost of crude oil.
presents a huge variety in its composition that Understanding how to increase lipid production
has evolved throughout the taxa, with the later and biomass is critical and proteomics tools are
divergence in the algal taxa presenting a very among those that are used to tease out the
similar cell wall composition to that of land pathways involved. Because of the presence of
plants. Their cell wall acts as a physical protec- chlorophyll and lipids the TCA/acetone precipi-
tion and defense against microbial attack and tation and phenol extraction methods have been
also, just like in plants, has a great implication shown to be the most successful in proteomic
in cell-cell and cell-substrate sensing and com- studies [105].
munication. Furthermore, in contrast to plants, Plastids like chloroplasts have their own DNA
the algal cell wall is involved in sexual reproduc- and protein biosynthesis machinery. Understand-
tion. The study of the cell wall composition ing the origin of chloroplast proteins and how
found in the different algal taxa has been recently they are involved in chloroplast development,
reviewed [32]. signaling and interaction networks is of great
In addition to the cell wall, algal cells also interest. Plastids and chloroplasts can be isolated
contain plastids that are capable of photosynthe- from algal cells or plant tissue, respectively, by
sis. Depending on the pigment accumulated, using sucrose [45] or Percoll density gradients
three out of the seven phyla of algae are easily [51, 52]. Much has already been done to identify
identifiable: the green algae or Chlorophyta the chloroplast proteomes in plants, and the spe-
consists of plastids containing chlorophyll a and cific subproteomes of the thylakoid membrane
b, similar to plant chloroplasts, and also some and lumen [78, 79], and the chloroplast
carotenoids; the red algae or Rhodophyta envelope [16].
contains chlorophyll a and accumulates
phycobilins in their plastids; the brown algae or
Phaeophyta has chlorophyll a and c plus fuco- 4.5 Recommendations
xanthin. In addition to algae, because for Selection of the Optimal
cyanobacteria are also photosynthetic and live Protein Extraction Method
in water, these have been named the blue-green According to Sample Type
algae in reference to the pigments accumulating
in their cells. However, cyanobacteria are pro- The “best” method of protein extraction for each
karyotic organisms with no plastids, which sample type and species (i.e. the method leading
instead use the pigment phycocyanin to capture to the highest number of proteins in a reproduc-
light for photosynthesis. It is this group of ible manner) will be one that has been fully
organisms that is thought to be the ancestral optimized for the particular system in hand
74 S. Alvarez and M.J. Naldrett

Table 4.4 Summary table of references using one or more of the three main methods used for plant protein extraction
according to sample type and species
Tissue
homogenization in
Sample type TCA/acetone precipitation Phenol extraction buffer
Seedlings Arabidopsis [27]; Rice (Oryza Arabidopsis [11, 42]
sativa) [100]
Shoots Arabidopsis [9] Maize (Zea mays)
[93]
Leaves Banana (Musa spp.) [18]; Cucumber Banana [18]; Brachypodium Maize [93]; Rice
(Cucumis sativus L.) [27]; Maize distachyon [63]; Grape (Vitis [53, 54, 91]; Soybean
[10, 27, 103]; Rice [27, 100, 115]; vinifera) [66]; Olive (Olea europaea (Glycine max) [95];
Sugar beet (Beta vulgaris L.) L.) [106]; Potato (Solanum Wheat [67]
[39, 116]; Wheat (Triticum aestivum tuberosum L.) [18]
L.) [27, 80]
Stem/ Rice [100]; Soybean [95]
hypocotyl
Roots Arabidopsis [9, 50, 59]; Barley Arabidopsis [6]; Avocado (Persea Rice [11, 121];
(Hordeum vulgare L.) [110]; Americana) [2]; Brassica juncea [4]; Soybean [95]
Cucumber (Cucumis sativus) [33]; Grape [66]; Rice [25]; Soybean
Rice [71, 100, 114]; Wheat [80]; [104]; Wheat [8]
Sugar beet [116]
Flowers Cannabis sativa [83]
Fruit Apple (Malus x domestica) [92, 120]; Tomato [89]
Avocado (Persea americana),
Banana and Tomato (Solanum
lycopersicum) [87];
Seeds European beech (Fagus sylvatica L.) Cacao (Theobroma cacao L.) [72]; Arabidopsis [35, 36];
(including [76]; Maize [69]; Norway maple Soybean [23, 38]; Maize [94] Camellia sinensis
endosperm (Acer platanoides L.) [77]; Rice [21]; Rice [27, 53,
and [100]; 117]; Maize [22]
embryos)
Meristems Banana [18]; Medicago truncatula Apple [18]; Banana [18, 86];
[44] Cacao [81]
Suspension Sugar beet [75]; Ginseng (Panax Sugar beet [75]; Ginseng [97] Rice [55, 61];
Cells ginseng) [97]; Grape [90] Ginseng [97]
Others Horse gram [13]; Maize xylem sap Cactus and houseleek [75]; Cotton Horse gram [13];
[5]; Rice bran, chaff and callus (Gossypium barbadense) ovules
[100]. Tomato flower bud (Zhao et al. [46]; Horse gram [13]; Poplar wood
2013); Cactus and houseleek [75] [101]; Wheat callus [122]

before committing to any further proteomics for specific sample types, such as recalcitrant
experiments. However, given no prior knowl- plant tissue of banana leaves [18], various fruits
edge, it is most often the case that scientists (banana, avocado and tomato) [87, 119], horse
will adopt one of the previously described gram [13] and grapewine leaves, pine needles
methods (TCA/acetone precipitation, phenol and cork oak ectomycorrhizal roots [88]. They
extraction and direct extraction using lysis concluded that while both TCA/acetone and phe-
buffer) and stick with it. Table 4.4 summarizes nol extractions are more suitable for recalcitrant
the references describing the use of one or more plant tissue and give similar protein yields, the
of these methods to help with the selection of phenol extraction method gives higher quality
the method most likely to offer initial success 2D-PAGE results, showing that the cleaning
for each sample type and species. Only a few steps lead to cleaner samples with less interfering
studies have systematically compared methods contaminants, aided by the use of phenol which
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 75

acts as a dissociating reagent decreasing the in SDS-containing buffer, the proteins were
interactions between proteins and other precipitated using TCA/acetone and washed to
compounds [18]. These studies also noted that remove the SDS. Although the use of SDS
the phenol extraction method gave a more improves membrane protein solubilization, even
glycoprotein-rich sample with reduced levels of small amounts of SDS left after precipitation can
Rubisco [18, 87]. Other studies in samples of impair many downstream analytical approaches
sugar beet cells, cactus and houseleek [75], and through the presence of the negative charge. The
ginseng cells [97] showed that phenol extraction use of SDS has been combined with phenol
has a higher cleaning capacity, which also led to extraction and with the benefits of TCA/acetone
better protein yields. This result is consistent precipitation from various recalcitrant plant
with the observation noted in Table 4.4 where tissues [107]. This combination allows removal
the largest contribution of samples extracted by of many of the interfering phenolic compounds
the phenol extraction method comes from the and pigments present in recalcitrant plant tissues
group of plant tissues considered recalcitrant. – the presence of SDS helps with the solubiliza-
We have to keep in mind that even if the tion of proteins before phenol extraction, but
phenol extraction method is known to give becomes very time consuming. A variation of
cleaner protein samples, its time-consuming use the same protocol without the TCA/acetone
may not nowadays be required with the evolution extraction was more recently tested and
of downstream experiments having moved away optimized on soybean roots using the extraction
from 2D-PAGE to gel-free methods. Here, buffer: 0.1 M Tris–HCl, pH 8.0, 2 % SDS, 5 %
offline HPLC fractionation is often combined 2-ME, 30 % sucrose, 1 mM PMSF [84]. The
with LC-MS/MS. However, there is still a dis- comparison of both protocols, with or without
tinct lack of literature on the comparison of TCA/acetone precipitation before SDS/phenol
extraction methods using gel-free shotgun prote- extraction did not show significant differences
omics approaches in the plant field. either in the 2D-PAGE profiles, which were
Only a few plant studies have tried to find very well resolved and streakless, or in protein
alternative extraction protocols for gel-based yield and reproducibility. However for recalci-
approaches that match the ones previously trant plant tissues, this protocol can be
described here. Their goal has been to combine recommended and has been more recently
the benefits of each. In 2006, a study performed optimized and described [112]. In summary,
on leaf tissue of various species (Arabidopsis, although some efforts have been made to find a
rice, wheat, maize and cucumber) showed that faster, more reproducible and higher yield pro-
using a lysis buffer (composed of 7 M urea, 2 M tein extraction method, the methods established
thiourea, 4 % CHAPS, 18 mM Tris–HCl, pH 8.0) in the 1980s are still very popular and have not
for the solubilization of protein pellets obtained been completely superseded.
from TCA/acetone precipitation improved the In contrast to lysis buffers used for direct pro-
number of spots resolved by 2D-PAGE when tein extraction during tissue homogenization,
compared to the use of lysis buffer without so-called solubilization buffers used to dissolve
pre-TCA/acetone precipitation [27]. The addi- the protein pellets after the TCA/acetone or phe-
tion of the TCA/acetone precipitation removes nol precipitation steps have evolved with the
most of the non-protein compounds allowing move from 2D-PAGE to mass spectrometry-
better solubilization of the pelleted material. based approaches. At the point where IPG strips
Another study from 2006 compared the use of were introduced for 2D-PAGE the solubilization
TCA/acetone precipitation and the lysis buffer buffers used after TCA/acetone extraction were
method, to the use of SDS buffer (2 % SDS, modified from the original O’Farrell lysis buffer
60 mM DTT, 20 % glycerol and 40 mM Tris– for compatibility. These modifications involved
HCl, pH 8.5) on the recalcitrant fruit tissue of combining chaotropes such as urea with thiourea,
apple and banana [96]. After boiling the samples adding detergents such as Triton X-100, CHAPS
76 S. Alvarez and M.J. Naldrett

Table 4.5 Examples of buffer composition for protein solubilization according to the proteomics platform used
Buffer composition Sample type Platform References
7 M urea, 2 M thiourea, 4 % CHAPS, 18 mM Tris- Arabidopsis seedlings, 2D-PAGE [27]
HCI (pH 8.0), 14 mM Trizma® base, two EDTA- leaves of rice, maize,
free protease inhibitor cocktail tablets, 0.2 % wheat, cucumber
Triton X-100 (R), and 50 mM DTT, to a final
volume of 100 mL
5 M urea, 2 M thiourea, 2 % CHAPS, 2 % Maize endosperm 2D-PAGE [69]
Sulfobetaine 3–10, 20 mM DTT, 5 mM TCEP,
0.75 % carrier ampholytes
0.5 M bicine buffer, pH 8.5 containing 0.1 % SDS Grapevine leaf iTRAQ/LC-MS/MS [65]
1 M urea, 0.5 M bicine, 0.09 % SDS Arabidopsis roots iTRAQ/LC-MS/MS [7]
8 M urea in 500 mM triethylammonium Wheat roots iTRAQ/LC-MS/MS [8]
bicarbonate (TEAB)
8 M urea, 25 mM TEAB, 0.2 % Triton X-100, Cotton (Gossypium iTRAQ/LC-MS/MS [47]
0.1 % SDS, pH 8.5 barbadense) ovules

or sulfobetaine 3–10, and reducing reagents such separate the proteins from the detrimental
as DTT and tributylphosphine [68]. Some reagents. The gel lane is then cut up into many
examples of buffer compositions used are found consecutive pieces, the proteins digested and the
in Table 4.5. These buffers were also found useful released peptides analyzed by LC-MS/MS. This
for solubilization of protein pellets after phenol workflow is widely used for shotgun proteomics
extraction prior to 2D-PAGE, but are incompati- experiments as well as for label-free quantitative
ble with gel-free LC-MS/MS approaches. The proteomics by spectrum counting. However, it is
high concentration of chaotropes such as urea, not compatible with quantitative proteomics
without prior removal or dilution, can deactivate platforms using isobaric tags (i.e. iTRAQ and
proteases such as trypsin. Native proteases can TMT) for multiplex labeling. The labeling step,
also be affected and deactivated by high urea, a here required after protein digestion and before
useful feature, though offset by the fact that urea, subsequent fractionation, has additional buffer
if not prepared correctly, can highly carbamylate compatibility requirements (absence of primary
amino groups and sulfhydryls making peptide amines). The iTRAQ kit provides a dissolution
identification difficult. The use of detergents like buffer consisting of 0.5 M triethylammonium
CHAPS, Triton X-100 and SDS are also incom- bicarbonate (TEAB, pH 8.5) where SDS is
patible with liquid chromatography and mass added to 1 %. However, this buffer has very
spectrometry and must be removed prior to analy- limited efficiency at completely solubilizing
sis. An acetone precipitation is often used at the protein pellets after phenol extraction, acetone
protein level to remove the interfering reagents precipitation or TCA/acetone extraction. Addi-
from the sample, however protein losses are tionally, since the SILAC approach is not suit-
observed and membrane proteins, for example, able for plants because of limited isotope amino
do not redissolve as well in buffers that do not acid incorporation into plant cells and tissues,
contain detergents. iTRAQ and TMT labeling is more widely used.
A new workflow was introduced in 2008 to Solubilizing these protein pellets in a manner
remove chaotropes and detergents efficiently compatible with the labeling reagents still
before mass spectrometry, known as GeLC-MS/ remains a challenge. Some examples of compat-
MS [37]. In this method, pellet solubilization is ible buffer compositions for use with these quan-
done using the standard Laemmli buffer (0.1 % titative labeling platforms are given in Table 4.5.
2-ME, 0.0005 % Bromophenol blue,10 % glyc- The pellets do not fully redissolve, but they have
erol, 2 % SDS, 63 mM Tris–HCl, pH 6.8), been shown to give good protein recovery.
followed by SDS-PAGE to fractionate and Beyond the challenges of cell walls and plant
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 77

protein extraction, once the protein is digested 11. Ban Y, Kobayashi Y, Hara T, Hamada T,
the techniques that can be applied at the peptide Hashimoto T, Takeda S, Hattori T (2013) Alpha-
tubulin is rapidly phosphorylated in response to
level are the same as for any other organism. hyperosmotic stress in rice and Arabidopsis. Plant
Cell Physiol 54:848–858
12. Bayer EM, Bottrill AR, Walshaw J, Vigouroux M,
Naldrett MJ, Thomas CL, Maule AJ (2006)
References Arabidopsis cell wall proteome defined using multi-
dimensional protein identification technology. Prote-
1. The Arabidopsis Genome Initiative (2000) Analysis omics 6:301–311
of the genome sequence of the flowering plant 13. Bhardwaj J, Yadav SK (2013) A common protein
Arabidopsis thaliana. Nature 408:796–815 extraction protocol for proteomic analysis: horse
2. Acosta-Muniz CH, Escobar-Tovar L, Valdes- gram a case study. Am J Agric Biol Sci 8:293–301
Rodriguez S, Fernandez-Pavia S, Arias-Saucedo LJ, 14. Borderies G, Jamet E, Lafitte C, Rossignol M,
de la Cruz Espindola Barquera M, Gomez Lim MA Jauneau A, Boudart G, Monsarrat B, Esquerre-
(2012) Identification of avocado (Persea americana) Tugaye MT, Boudet A, Pont-Lezica R (2003) Prote-
root proteins induced by infection with the oomycete omics of loosely bound cell wall proteins of
Phytophthora cinnamomi using a proteomic Arabidopsis thaliana cell suspension cultures: a crit-
approach. Physiol Plant 144:59–72 ical analysis. Electrophoresis 24:3421–3432
3. Ahsan N, Lee SH, Lee DG, Lee H, Lee SW, Bahk 15. Boudart G, Jamet E, Rossignol M, Lafitte C,
JD, Lee BH (2007) Physiological and protein Borderies G, Jauneau A, Esquerre-Tugaye MT,
profiles alternation of germinating rice seedlings Pont-Lezica R (2005) Cell wall proteins in
exposed to acute cadmium toxicity. C R Biol apoplastic fluids of Arabidopsis thaliana rosettes:
330:735–746 identification by mass spectrometry and bioinformat-
4. Alvarez S, Berla BM, Sheffield J, Cahoon RE, Jez ics. Proteomics 5:212–221
JM, Hicks LM (2009) Comprehensive analysis of the 16. Brautigam A, Hoffmann-Benning S, Weber AP
Brassica juncea root proteome in response to cad- (2008) Comparative proteomics of chloroplast
mium exposure by complementary proteomic envelopes from C3 and C4 plants reveals specific
approaches. Proteomics 9:2419–2431 adaptations of the plastid envelope to C4 photosyn-
5. Alvarez S, Goodger JQ, Marsh EL, Chen S, thesis and candidate proteins required for
Asirvatham VS, Schachtman DP (2006) Characteri- maintaining C4 metabolite fluxes. Plant Physiol
zation of the maize xylem sap proteome. J Proteome 148:568–579
Res 5:963–972 17. Calderan-Rodrigues MJ, Jamet E, Bonassi MB,
6. Alvarez S, Hicks LM, Pandey S (2011) Guidetti-Gonzalez S, Begossi AC, Setem LV,
ABA-dependent and -independent G-protein Franceschini LM, Fonseca JG, Labate CA (2014)
signaling in Arabidopsis roots revealed through an Cell wall proteomics of sugarcane cell suspension
iTRAQ proteomics approach. J Proteome Res cultures. Proteomics 14:738–749
10:3107–3122 18. Carpentier SC, Witters E, Laukens K, Deckers P,
7. Alvarez S, Roy Choudhury S, Hicks LM, Pandey S Swennen R, Panis B (2005) Preparation of protein
(2013) Quantitative proteomics-based analysis extracts from recalcitrant plant tissues: an evaluation
supports a significant role of GTG proteins in regu- of different methods for two-dimensional gel elec-
lation of ABA response in Arabidopsis roots. J Pro- trophoresis analysis. Proteomics 5:2497–2507
teome Res 12:1487–1501 19. Casasoli M, Spadoni S, Lilley KS, Cervone F, De
8. Alvarez S, Roy Choudhury S, Pandey S (2014) Com- Lorenzo G, Mattei B (2008) Identification by 2-D
parative quantitative proteomics analysis of the ABA DIGE of apoplastic proteins regulated by oligogalac-
response of roots of drought-sensitive and drought- turonides in Arabidopsis thaliana. Proteomics
tolerant wheat varieties identifies proteomic 8:1042–1054
signatures of drought adaptability. J Proteome Res 20. Charmont S, Jamet E, Pont-Lezica R, Canut H
13:1688–1701 (2005) Proteomic analysis of secreted proteins from
9. Alvarez S, Zhu M, Chen S (2009) Proteomics of Arabidopsis thaliana seedlings: improved recovery
Arabidopsis redox proteins in response to methyl following removal of phenolic compounds. Phyto-
jasmonate. J Proteome 73:30–40 chemistry 66:453–461
10. Amiour N, Imbaud S, Clement G, Agier N, Zivy M, 21. Chen Q, Yang L, Ahmad P, Wan X, Hu X (2011)
Valot B, Balliau T, Armengaud P, Quillere I, Proteomic profiling and redox status alteration of
Canas R, Tercet-Laforgue T, Hirel B (2012) The recalcitrant tea (Camellia sinensis) seed in response
use of metabolomics integrated with transcriptomic to desiccation. Planta 233:583–592
and proteomic studies for identifying key steps 22. Chen ZY, Brown RL, Rajasekaran K, Damann KE,
involved in the control of nitrogen metabolism in Cleveland TE (2006) Identification of a maize kernel
crops such as maize. J Exp Bot 63:5017–5033 pathogenesis-related protein and evidence for its
78 S. Alvarez and M.J. Naldrett

involvement in resistance to Aspergillus flavus 36. Gallardo K, Job C, Groot SP, Puype M, Demol H,
infection and Aflatoxin production. Phytopathology Vandekerckhove J, Job D (2002) Proteomics of
96:87–95 Arabidopsis seed germination. A comparative study
23. Cheng L, Gao X, Li S, Shi M, Javeed H, Jing X, of wild-type and gibberellin-deficient seeds. Plant
Yang G, He G (2010) Proteomic analysis of soybean Physiol 129:823–837
[Glycine max (L.) Meer.] seeds during imbibition at 37. Gao BB, Stuart L, Feener EP (2008) Label-free
chilling temperature. Mol Breed 26:1–17 quantitative analysis of one-dimensional PAGE
24. Cheng WH, Taliercio EW, Chourey PS (1996) The LC/MS/MS proteome: application on angiotensin
miniature1 seed locus of maize encodes a cell wall II-stimulated smooth muscle cells secretome. Mol
invertase required for normal development of Endo- Cell Proteomics MCP 7:2399–2409
sperm and maternal cells in the Pedicel. Plant Cell 38. Hajduch M, Ganapathy A, Stein JW, Thelen JJ (2005)
8:971–983 A systematic proteomic study of seed filling in
25. Chitteti BR, Peng Z (2007) Proteome and soybean. Establishment of high-resolution two-
phosphoproteome differential expression under dimensional reference maps, expression profiles, and
salinity stress in rice (Oryza sativa) roots. J Proteome an interactive proteome database. Plant Physiol
Res 6:1718–1727 137:1397–1419
26. Chivasa S, Ndimba BK, Simon WJ, Robertson D, Yu 39. Hajheidari M, Abdollahian-Noghabi M, Askari H,
XL, Knox JP, Bolwell P, Slabas AR (2002) Heidari M, Sadeghian SY, Ober ES, Salekdeh GH
Proteomic analysis of the Arabidopsis thaliana cell (2005) Proteome analysis of sugar beet leaves under
wall. Electrophoresis 23:1754–1765 drought stress. Proteomics 5:950–960
27. Cho K, Torres NL, Subramanyam S, Deepak SA, 40. Harder A, Wildgruber R, Nawrocki A, Fey SJ,
Sardesai N, Han O, Williams CE, Ishii H, Larsen PM, Gorg A (1999) Comparison of yeast
Iwahashi H, Rakwal R (2006) Protein extraction/ cell protein solubilization procedures for
solubilization protocol for monocot and dicot plant two-dimensional electrophoresis. Electrophoresis
gel-based proteomics. J Plant Biol 49:413–420 20:826–829
28. Cordoba-Pedregosa M, Gonzalez-Reyes JA, 41. Haslam RP, Downie AL, Raventon M, Gallardo K,
Canadillas M, Navas P, Cordoba F (1996) Role of Job D, Pallett KE, John P, Parry MAJ, Coleman JOD
apoplastic and cell-wall peroxidases on the stimula- (2003) The assessment of enriched apoplastic
tion of root elongation by Ascorbate. Plant Physiol extracts using proteomic approaches. Ann Appl
112:1119–1125 Biol 143:81–91
29. Damerval C, De Vienne D, Zivy M, Thiellement H 42. Hebeler R, Oeljeklaus S, Reidegeld KA,
(1986) Technical improvements in two-dimensional Eisenacher M, Stephan C, Sitek B, Stuhler K,
electrophoresis increase the level of genetic variation Meyer HE, Sturre MJ, Dijkwel PP, Warscheid B
detected in wheat-seedling proteins. Electrophoresis (2008) Study of early leaf senescence in Arabidopsis
7:52–54 thaliana by quantitative proteomics using reciprocal
30. Dani V, Simon WJ, Duranti M, Croy RR (2005) 14 N/15N labeling and difference gel electrophore-
Changes in the tobacco leaf apoplast proteome in sis. Mol Cell Proteomics MCP 7:108–120
response to salt stress. Proteomics 5:737–745 43. Hiilovaara-Teijo M, Hannukkala A, Griffith M, Yu
31. de Souza IR, MacAdam JW (2001) Gibberellic acid XM, Pihakaski-Maunsbach K (1999) Snow-mold-
and dwarfism effects on the growth dynamics of B73 induced apoplastic proteins in winter rye leaves
maize (Zea mays L.) leaf blades: a transient increase lack antifreeze activity. Plant Physiol 121:665–674
in apoplastic peroxidase activity precedes cessation 44. Holmes P, Farquharson R, Hall PJ, Rolfe BG (2006)
of cell elongation. J Exp Bot 52:1673–1682 Proteomic analysis of root meristems and the effects
32. Domozych DS, Ciancia M, Fangel JU, Mikkelsen of acetohydroxyacid synthase-inhibiting herbicides
MD, Ulvskov P, Willats WG (2012) The cell walls in the root of Medicago truncatula. J Proteome Res
of green algae: a journey through evolution and 5:2309–2316
diversity. Front Plant Sci 3:82 45. Hopkins JF, Spencer DF, Laboissiere S, Neilson JA,
33. Du CX, Fan HF, Guo SR, Tezuka T, Li J (2010) Eveleigh RJ, Durnford DG, Gray MW, Archibald JM
Proteomic analysis of cucumber seedling roots (2012) Proteomics reveals plastid- and periplastid-
subjected to salt stress. Phytochemistry targeted proteins in the chlorarachniophyte alga
71:1450–1459 Bigelowiella natans. Genome Biol Evol 4:1391–1406
34. Feiz L, Irshad M, Pont-Lezica R, Canut H, Jamet E 46. Hu G, Koh J, Yoo M-J, Grupp K, Chen S, Wendel JF
(2006) Evaluation of cell wall preparations for pro- (2013) Proteomic profiling of developing cotton
teomics: a new procedure for purifying cell walls fibers from wild and domesticate Gossypium
from Arabidopsis hypocotyls. Plant Methods 2:10 barbadense. New Phytol 200:570–582
35. Gallardo K, Job C, Groot SP, Puype M, Demol H, 47. Hu G, Koh J, Yoo MJ, Pathak D, Chen S, Wendel JF
Vandekerckhove J, Job D (2001) Proteomic analysis (2014) Proteomics profiling of fiber development
of arabidopsis seed germination and priming. Plant and domestication in upland cotton (Gossypium
Physiol 126:835–848 hirsutum L.). Planta 240:1237
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 79

48. Hurkman WJ, Tanaka CK (1986) Solubilization of 61. Liu D, Ford KL, Roessner U, Natera S, Cassin AM,
plant membrane proteins for analysis by Patterson JH, Bacic A (2013) Rice suspension
two-dimensional gel electrophoresis. Plant Physiol cultured cells are evaluated as a model system to
81:802–806 study salt responsive networks in plants using a
49. Irshad M, Canut H, Borderies G, Pont-Lezica R, combined proteomic and metabolomic profiling
Jamet E (2008) A new picture of cell wall protein approach. Proteomics 13:2046–2062
dynamics in elongating cells of Arabidopsis thaliana: 62. Luche S, Santoni V, Rabilloud T (2003) Evaluation
confirmed actors and newcomers. BMC Plant Biol of nonionic and zwitterionic detergents as membrane
8:94 protein solubilizers in two-dimensional electropho-
50. Jiang Y, Yang B, Harris NS, Deyholos MK (2007) resis. Proteomics 3:249–253
Comparative proteomic analysis of NaCl stress- 63. Lv DW, Subburaj S, Cao M, Yan X, Li X, Appels R,
responsive proteins in Arabidopsis roots. J Exp Bot Sun DF, Ma W, Yan YM (2014) Proteome and
58:3591–3607 phosphoproteome characterization reveals new
51. Kamal AH, Cho K, Kim DE, Uozumi N, Chung KY, response and defense mechanisms of Brachypodium
Lee SY, Choi JS, Cho SW, Shin CS, Woo SH (2012) distachyon leaves under salt stress. Mol Cell Proteo-
Changes in physiology and protein abundance in mics MCP 13:632–652
salt-stressed wheat chloroplasts. Mol Biol Rep 64. Macadam JW, Sharp RE, Nelson CJ (1992) Peroxi-
39:9059–9074 dase activity in the leaf elongation zone of tall fes-
52. Kamal AH, Cho K, Komatsu S, Uozumi N, Choi JS, cue: II. Spatial distribution of apoplastic peroxidase
Woo SH (2012) Towards an understanding of wheat activity in genotypes differing in length of the elon-
chloroplasts: a methodical investigation of thylakoid gation zone. Plant Physiol 99:879–885
proteome. Mol Biol Rep 39:5069–5083 65. Marsh E, Alvarez S, Hicks LM, Barbazuk WB,
53. Komatsu S, Kajiwara H, Hirano H (1993) A rice Qiu W, Kovacs L, Schachtman D (2010) Changes
protein library: a data-file of rice proteins separated in protein abundance during powdery mildew infec-
by two-dimensional electrophoresis. TAG Theor tion of leaf tissues of Cabernet Sauvignon grapevine
Appl Genet Theoretische und angewandte Genetik (Vitis vinifera L.). Proteomics 10:2057–2064
86:935–942 66. Marsoni M, Vanini C, Campa M, Cucchi U, Espen L,
54. Komatsu S, Muhammad A, Rakwal R (1999) Separa- Bracale M (2005) Protein extraction from grape
tion and characterization of proteins from green and tissues by two-dimensional electrophoresis. Vitis
etiolated shoots of rice (Oryza sativa L.): towards a 44:181–186
rice proteome. Electrophoresis 20:630–636 67. Maytalman D, Mert Z, Baykal AT, Inan C, Gunel A,
55. Komatsu S, Rakwal R, Li Z (1999) Separation and Hasancebi S (2013) Proteomic analysis of early
characterization of proteins in rice (Oryza sativa) responsive resistance proteins of wheat (Triticum
suspension cultured cells. Plant Cell Tissue Organ aestivum) to yellow rust (Puccinia striformis f. sp.
Cult 55:183–192 tritici) using ProteomeLab PF2D. Plant OMICS J
56. Krishnan HB, Natarajan SS (2009) A rapid method 6:24–35
for depletion of Rubisco from soybean (Glycine 68. Mechin V, Consoli L, Le Guilloux M, Damerval C
max) leaf for proteomic analysis of lower abundance (2003) An efficient solubilization buffer for plant
proteins. Phytochemistry 70:1958–1964 proteins focused in immobilized pH gradients. Pro-
57. Krishnan HB, Oehrle NW, Natarajan SS (2009) A teomics 3:1299–1302
rapid and simple procedure for the depletion of 69. Mechin V, Thevenot C, Le Guilloux M, Prioul JL,
abundant storage proteins from legume seeds to Damerval C (2007) Developmental analysis of
advance proteome analysis: a case study using Gly- maize endosperm proteome suggests a pivotal role
cine max. Proteomics 9:3174–3188 for pyruvate orthophosphate dikinase. Plant Physiol
58. Kwon HK, Yokoyama R, Nishitani K (2005) A 143:1203–1219
proteomic approach to apoplastic proteins involved 70. Morrow DL, Jones RL (1986) Localization and par-
in cell wall regeneration in protoplasts of tial characterization of the extracellular proteins
Arabidopsis suspension-cultured cells. Plant Cell centrifuged from pea internodes. Physiol Plant
Physiol 46:843–857 67:397–407
59. Lan P, Li W, Wen TN, Shiau JY, Wu YC, Lin W, 71. Nam MH, Huh SM, Kim KM, Park WJ, Seo JB,
Schmidt W (2011) iTRAQ protein profile analysis of Cho K, Kim DY, Kim BG, Yoon IS (2012) Compar-
Arabidopsis roots reveals new aspects critical for ative proteomic analysis of early salt stress-
iron homeostasis. Plant Physiol 155:821–834 responsive proteins in roots of SnRK2 transgenic
60. Li ZC, McClure JW, Hagerman AE (1989) Soluble rice. Proteome Sci 10:25
and bound apoplastic activity for peroxidase, beta-d- 72. Noah AM, Niemenak N, Sunderhaus S, Haase C,
glucosidase, malate dehydrogenase, and nonspecific Omokolo DN, Winkelmann T, Braun HP (2013)
Arylesterase, in barley (Hordeum vulgare L.) and oat Comparative proteomic analysis of early somatic
(Avena sativa L.) primary leaves. Plant Physiol and zygotic embryogenesis in Theobroma cacao
90:185–190 L. J Proteome 78:123–133
80 S. Alvarez and M.J. Naldrett

73. O’Farrell PH (1975) High resolution Correlation between ionically bound cell-wall
two-dimensional electrophoresis of proteins. J Biol proteins and morphogenetic response. Plant Physiol
Chem 250:4007–4021 112:1191–1199
74. Olivieri F, Godoy AV, Escande A, Casalongue CA 86. Samyn B, Sergeant K, Carpentier S, Debyser G,
(1998) Analysis of intercellular washing fluids of Panis B, Swennen R, Van Beeumen J (2007) Func-
potato tubers and detection of increased proteolytic tional proteome analysis of the banana plant (Musa
activity upon fungal infection. Physiol Plant spp.) using de novo sequence analysis of derivatized
10:232–238 peptides. J Proteome Res 6:70–80
75. Pavokovic D, Kriznik B, Krsnik-Rasol M (2012) 87. Saravanan RS, Rose JK (2004) A critical evaluation
Evaluation of protein extraction methods for of sample extraction techniques for enhanced
proteomic analysis of non-model recalcitrant plant proteomic analysis of recalcitrant plant tissues. Pro-
tissues. Croat Chem Acta 85:177–183 teomics 4:2522–2532
76. Pawlowski TA (2007) Proteomics of European 88. Sebastiana M, Figueiredo A, Monteiro F, Martins J,
beech (Fagus sylvatica L.) seed dormancy breaking: Franco C, Coelho AV, Vaz F, Simoes T, Penque D,
influence of abscisic and gibberellic acids. Proteo- Pais MS, Ferreira S (2013) A possible approach for
mics 7:2246–2257 gel-based proteomic studies in recalcitrant woody
77. Pawlowski TA (2009) Proteome analysis of Norway plants. SpringerPlus 2:210
maple (Acer platanoides L.) seeds dormancy break- 89. Shah P, Powell AL, Orlando R, Bergmann C,
ing and germination: influence of abscisic and Gutierrez-Sanchez G (2012) Proteomic analysis of
gibberellic acids. BMC Plant Biol 9:48 ripening tomato fruit infected by Botrytis cinerea. J
78. Peltier JB, Cai Y, Sun Q, Zabrouskov V, Proteome Res 11:2178–2192
Giacomelli L, Rudella A, Ytterberg AJ, 90. Sharathchandra RG, Stander C, Jacobson D,
Rutschow H, van Wijk KJ (2006) The oligomeric Ndimba B, Vivier MA (2011) Proteomic analysis
stromal proteome of Arabidopsis thaliana of grape berry cell cultures reveals that develop-
chloroplasts. Mol Cell Proteomics MCP 5:114–133 mentally regulated ripening related processes can
79. Peltier JB, Friso G, Kalume DE, Roepstorff P, be studied using cultured cells. PLoS One 6:
Nilsson F, Adamska I, van Wijk KJ (2000) Proteo- e14708
mics of the chloroplast: systematic identification and 91. Shen S, Sharma A, Komatsu S (2003) Characteriza-
targeting analysis of lumenal and peripheral thyla- tion of proteins responsive to gibberellin in the leaf-
koid proteins. Plant Cell 12:319–341 sheath of rice (Oryza sativa L.) seedling using prote-
80. Peng Z, Wang M, Li F, Lv H, Li C, Xia G (2009) A ome analysis. Biol Pharm Bull 26:129–136
proteomic study of the response to salinity and 92. Shi Y, Jiang L, Zhang L, Kang R, Yu Z (2014)
drought stress in an introgression strain of bread Dynamic changes in proteins during apple (Malus x
wheat. Mol Cell Proteomics MCP 8:2676–2686 domestica) fruit ripening and storage. Hortic Res 1:6
81. Pirovani CP, Carvalho HA, Machado RC, Gomes 93. Shoresh M, Harman GE (2008) The molecular basis
DS, Alvim FC, Pomella AW, Gramacho KP, of shoot responses of maize seedlings to
Cascardo JC, Pereira GA, Micheli F (2008) Protein Trichoderma harzianum T22 inoculation of the
extraction for proteome analysis from cacao leaves root: a proteomic approach. Plant Physiol
and meristems, organs infected by Moniliophthora 147:2147–2163
perniciosa, the causal agent of the witches’ broom 94. Silva-Sanchez C, Chen S, Zhu N, Li QB, Chourey PS
disease. Electrophoresis 29:2391–2401 (2013) Proteomic comparison of basal endosperm in
82. Qiu X, Wong G, Audet J, Bello A, Fernando L, maize miniature1 mutant and its wild-type Mn1.
Alimonti JB, Fausther-Bovendo H, Wei H, Front Plant Sci 4:211
Aviles J, Hiatt E, Johnson A, Morton J, Swope K, 95. Sobhanian H, Razavizadeh R, Nanjo Y, Ehsanpour
Bohorov O, Bohorova N, Goodman C, Kim D, AA, Jazii FR, Motamed N, Komatsu S (2010) Prote-
Pauly MH, Velasco J, Pettitt J, Olinger GG, ome analysis of soybean leaves, hypocotyls and
Whaley K, Xu B, Strong JE, Zeitlin L, Kobinger roots under salt stress. Proteome Sci 8:19
GP (2014) Reversion of advanced Ebola virus dis- 96. Song J, Braun G, Bevis E, Doncaster K (2006) A
ease in nonhuman primates with ZMapp. Nature simple protocol for protein extraction of recalcitrant
514:47 fruit tissues suitable for 2-DE and MS analysis.
83. Raharjo TJ, Widjaja I, Roytrakul S, Verpoorte R Electrophoresis 27:3144–3151
(2004) Comparative proteomics of Cannabis sativa 97. Sun J, Fu J, Zhou R (2014) Proteomic analysis of
plant tissues. J Biomol Tech JBT 15:97–106 differentially expressed proteins induced by salicylic
84. Rodrigues EP, Torres AR, da Silva Batista JS, acid in suspension-cultured ginseng cells. Saudi J
Huergo L, Hungria M (2012) A simple, economical Biol Sci 21:185–190
and reproducible protein extraction protocol for pro- 98. Swarbreck D, Wilks C, Lamesch P, Berardini TZ,
teomics studies of soybean roots. Genet Mol Biol Garcia-Hernandez M, Foerster H, Li D, Meyer T,
35:348–352 Muller R, Ploetz L, Radenbaugh A, Singh S,
85. Roger D, David A, David H (1996) Immobilization Swing V, Tissier C, Zhang P, Huala E (2008) The
of flax protoplasts in agarose and alginate beads. Arabidopsis Information Resource (TAIR): gene
4 Plant Structure and Specificity – Challenges and Sample Preparation. . . 81

structure and function annotation. Nucleic Acids Res 112. Wu X, Xiong E, Wang W, Scali M, Cresti M (2014)
36:D1009–D1014 Universal sample preparation method integrating
99. Tsugama D, Liu S, Takano T (2011) A rapid chemi- trichloroacetic acid/acetone precipitation with phe-
cal method for lysing Arabidopsis celss for protein nol extraction for crop proteomic analysis. Nat
analysis. Plant Methods 7:22 Protoc 9:362–374
100. Tsugita A, Kawakami T, Uchiyama Y, Kamo M, 113. Xi J, Wang X, Li S, Zhou X, Yue L, Fan J, Hao D
Miyatake N, Nozu Y (1994) Separation and charac- (2006) Polyethylene glycol fractionation improved
terization of rice proteins. Electrophoresis detection of low-abundant proteins by
15:708–720 two-dimensional electrophoresis analysis of plant
101. Vander Mijnsbrugge K, Meyermans H, Van proteome. Phytochemistry 67:2341–2348
Montagu M, Bauw G, Boerjan W (2000) Wood 114. Yan S, Tang Z, Su W, Sun W (2005) Proteomic
formation in poplar: identification, characterization, analysis of salt stress-responsive proteins in rice
and seasonal variation of xylem proteins. Planta root. Proteomics 5:235–244
210:589–598 115. Yan SP, Zhang QY, Tang ZC, Su WA, Sun WN
102. Verdonk JC, Hatfield RD, Sullivan ML (2012) (2006) Comparative proteomic analysis provides
Proteomic analysis of cell walls of two developmen- new insights into chilling stress responses in rice.
tal stages of alfalfa stems. Front Plant Sci 3:279 Mol Cell Proteomics MCP 5:484–496
103. Vincent D, Lapierre C, Pollet B, Cornic G, 116. Yang L, Zhang Y, Zhu N, Koh J, Ma C, Pan Y, Yu B,
Negroni L, Zivy M (2005) Water deficits affect Chen S, Li H (2013) Proteomic analysis of salt
caffeate O-methyltransferase, lignification, and tolerance in sugar beet monosomic addition line
related enzymes in maize leaves. A proteomic inves- M14. J Proteome Res 12:4931–4950
tigation. Plant Physiol 137:949–960 117. Zhang H, Lian C, Shen Z (2009) Proteomic identifi-
104. Wan J, Torres M, Ganapathy A, Thelen J, DaGue cation of small, copper-responsive proteins in
BB, Mooney B, Xu D, Stacey G (2005) Proteomic germinating embryos of Oryza sativa. Ann Bot
analysis of soybean root hairs after infection by 103:923–930
Bradyrhizobium japonicum. Mol Plant-Microbe 118. Zhang Y, Gao P, Xing Z, Jin S, Chen Z, Liu L,
Interact MPMI 18:458–467 Constantino N, Wang X, Shi W, Yuan JS, Dai SY
105. Wang H, Alvarez S, Hicks LM (2012) Comprehen- (2013) Application of an improved proteomics
sive comparison of iTRAQ and label-free LC-based method for abundant protein cleanup: molecular
quantitative proteomics approaches using two and genomic mechanisms study in plant defense.
Chlamydomonas reinhardtii strains of interest for Mol Cell Proteomics MCP 12:3431–3442
biofuels engineering. J Proteome Res 11:487–501 119. Zhao X, Ren J, Cui N, Fan H, Yu G, Li T (2013)
106. Wang W, Scali M, Vignani R, Spadafora A, Sensi E, Preparation of protein extraction from flower buds of
Mazzuca S, Cresti M (2003) Protein extraction for Solanum lycopersicum for two-dimensional gel elec-
two-dimensional electrophoresis from olive leaf, a trophoresis. Br Biotechnol J 3:183–190
plant tissue containing high levels of interfering 120. Zheng Q, Song J, Campbell-Palmer L, Thompson K,
compounds. Electrophoresis 24:2369–2375 Li L, Walker B, Cui Y, Li X (2013) A proteomic
107. Wang W, Vignani R, Scali M, Cresti M (2006) A investigation of apple fruit during ripening and in
universal and rapid protocol for protein extraction response to ethylene treatment. J Proteome 93:276–294
from recalcitrant plant tissues for proteomic analy- 121. Zhong B, Karibe H, Komatsu S, Ichimura H,
sis. Electrophoresis 27:2782–2786 Nagamura Y, Sasaki T, Hirano H (1997) Screening
108. Watson BS, Lei Z, Dixon RA, Sumner LW (2004) of rice genes from a cDNA catalog based on the
Proteomics of Medicago sativa cell walls. Phyto- sequence data-file of proteins separated by
chemistry 65:1709–1720 two-dimensional electrophoresis. Breed Sci
109. Watson BS, Sumner LW (2007) Isolation of cell wall 47:245–251
proteins from Medicago sativa stems. Methods Mol 122. Zhou X, Wang K, Lv D, Wu C, Li J, Zhao P, Lin Z,
Biol 355:79–92 Du L, Yan Y, Ye X (2013) Global analysis of differ-
110. Witzel K, Weidner A, Surabhi GK, Borner A, entially expressed genes and proteins in the wheat
Mock HP (2009) Salt stress-induced alterations callus infected by Agrobacterium tumefaciens. PLoS
in the root proteome of barley genotypes with One 8:e79390
contrasting response towards salinity. J Exp Bot 123. Zhu J, Chen S, Alvarez S, Asirvatham VS,
60:3545–3557 Schachtman DP, Wu Y, Sharp RE (2006) Cell wall
111. Wu FS, Wang MY (1984) Extraction of proteins for proteome in the maize primary root elongation zone.
sodium dodecyl sulfate-polyacrylamide gel electro- I. Extraction and identification of water-soluble and
phoresis from protease-rich plant tissues. Anal lightly ionically bound proteins. Plant Physiol
Biochem 139:100–103 140:311–325
Improving Proteome Coverage by
Reducing Sample Complexity via 5
Chromatography

Uma Kota and Mark L. Stolowitz

Abstract
High performance liquid chromatography (HPLC) is currently one of the
most powerful analytical tools that has revolutionized the field of proteo-
mics. Formerly known as high pressure liquid chromatography, this tech-
nique was introduced in the early 1960s to improve the efficiency of liquid
chromatography separations using small stationary phase particles packed
in columns. Since its introduction, continued advancements in column
technology, development of different stationary phase materials and
improved instrumentation has allowed the full potential of this technique
to be realized. The various modes of HPLC in combination with mass
spectrometry has evolved into the principal analytical technique in prote-
omics. It is now common practice to combine different types of HPLC in a
multidimensional workflow to identify and quantify peptides and proteins
with high sensitivity and resolution from limited amounts of samples.
More recently, the introduction of Ultra High Performance Liquid Chro-
matography (UHPLC) has further raised the level of performance of this
technique with significant increases in resolution, speed and sensitivity.
The number of applications of HPLC and UHPLC in proteomics has been
rapidly expanding and will continue to be a pivotal analytical technique.
The aim of the following sections is to familiarize the beginner with the
various HPLC methods routinely used in proteomics and provide suffi-
cient practical knowledge regarding each of them to develop a separation
and analytical protocol.

5.1 High Performance Liquid


Chromatography

High performance liquid chromatography


U. Kota • M.L. Stolowitz (*)
(HPLC) is currently one of the most powerful
Canary Center at Stanford for Cancer Early Detection,
Palo Alto, CA 94304, USA analytical tools that has revolutionized the field
e-mail: mstolowitz@stanford.edu of proteomics. Formerly known as high pressure

# Springer International Publishing Switzerland 2016 83


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_5
84 U. Kota and M.L. Stolowitz

liquid chromatography, this technique was time, thus making it a reliable and cost-effective
introduced in the early 1960s to improve the fractionation method. The resolving power of
efficiency of liquid chromatography separations this RPC is reflected by its frequent use in multi-
using small stationary phase particles packed in dimensional separations [38, 97, 98, 128, 162,
columns. Since its introduction, continued 166]. The combined desalting and purification
advancements in column technology, develop- aspects of RP-HPLC makes it suitable as the
ment of different stationary phase materials and final step of a multidimensional fractionation
improved instrumentation has allowed the full protocol, specifically prior to analysis by mass
potential of this technique to be realized. The spectrometry of purified solutes [31].
various modes of HPLC in combination with
mass spectrometry has evolved into the principal Theory of Reversed- Phase Chromato-
analytical technique in proteomics. It is now graphy The separation of biomolecules by
common practice to combine different types of RPC is based on the reversible hydrophobic inter-
HPLC in a multidimensional workflow to iden- action between the sample in the mobile phase and
tify and quantify peptides and proteins with high the stationary phase. The distribution of the sample
sensitivity and resolution from limited amounts between the two phases depends on the binding
of samples. More recently, the introduction of properties of the stationary phase, hydrophobicity
Ultra High Performance Liquid Chromatography of the sample molecule, composition of the mobile
(UHPLC) has further raised the level of perfor- phase. RPC is an adsorptive process which relies
mance of this technique with significant increases on partitioning mechanism to effect separation.
in resolution, speed and sensitivity. The number of Separation relies on an equilibrium between the
applications of HPLC and UHPLC in proteomics sample molecules in the eluent and the surface of
has been rapidly expanding and will continue to the stationary phase. The stationary phase is more
be a pivotal analytical technique. The aim of the hydrophobic than the mobile phase when an aque-
following sections is to familiarize the beginner ous/organic solvent mobile phase is used. Initial
with the various HPLC methods routinely used in conditions are primarily aqueous, favoring a high
proteomics and provide sufficient practical knowl- degree of organized water structure surrounding
edge regarding each of them to develop a separa- the sample molecule and favoring the adsorption
tion and analytical protocol. of the sample molecule from the mobile phase
onto the stationary phase. A small percentage of
organic modifier, typically 3–5 % acetonitrile is
5.2 Reversed-Phase present in order to achieve a “wetted” surface. As
Chromatography sample binds to the stationary phase, the hydro-
phobic area exposed to the mobile phase is
Reversed-Phase chromatography (RPC) is rou- minimized, thus the degree of organized water
tinely used for the high-resolution separation structure is diminished [168]. Bound samples are
of proteins, peptides and nucleic acids. Most desorbed from the stationary phase by adjusting
common applications include desalting, the polarity of the mobile phase over time by
concentrating samples, peptide mapping, purifi- increasing the final concentration of the organic
cation procedures and determining purity. RPC is solvent in the final mobile phase, such that the
one of the most widely applied separation bound molecules tend to dissociate from the sta-
techniques on an analytical scale due to several tionary phase back into the mobile phase, in the
reasons. It is a robust technique that can be order of increasing hydrophobicity.
applied to a wide range of molecules including
charged and polar molecules. This separation RPC almost always uses gradient elution
technique allows precise control of variables instead of isocratic elution. Peptides and proteins
such as organic solvent type and concentration, have a mix of accessible hydrophilic and hydro-
pH, temperature. In addition to being highly phobic amino acid side chains, hence interaction
reproducible, RPC columns are also known to with the stationary phase has the nature of a
be stable and efficient over long periods of multi-point attachment. Furthermore, although
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 85

these biomolecules adsorb strongly to the surface derivatization of all the silanol groups. This par-
of a RP matrix under aqueous conditions, they tial derivatization can lead to undesirable mixed
desorb from the matrix within a very narrow mode ion exchange effects due to the residual
window of organic modifier concentration. Any polar silanol groups. Therefore, the residual
given biological sample typically contains a silanols are subjected to further silanization
broad mixture of biomolecules with a diverse with reactive trimethylsilane reagents to yield a
range of adsorption affinities and hence the only so-called end-capped packing material. Repro-
practical method for reverse phase separation of ducible chemical derivatization of the silanol
such samples is by gradient elution. Separation in surface as well as capping is critical for efficient
RPC is due to the different binding properties of reverse phase chromatography with batch-to-
the solute present in the sample as a result of the batch reproducibility.
differences in their hydrophobic properties. The
degree of solute binding to the stationary phase The most important stationary phase properties
can be controlled by manipulating the hydropho- that have a profound influence on retention and
bic properties of the initial mobile phase. This selectivity in RPC are the type of native silica,
allows a high degree of flexibility in separation the silanol content and the carbon loading.
conditions allowing one to resolve solutes that Spherical silica gel is the most commonly
vary only slightly in their hydrophobicity. Because used packing material for RPC. While the porous
of its excellent resolving power and great flexibil- silica beads are chemically stable at low pH and
ity, RPC is an indispensable technique for high in organic solvents used for RPC, it is chemically
performance separation of complex biological unstable in aqueous solution at high pH and not
samples and purification of desired solutes. Fur- recommended for prolonged exposure above
thermore, since binding under the initial phase is pH 7.5, particularly at elevated temperature.
absolute, the starting concentration of the desired Alternatively, synthetic organic polymer-based
solute in the sample is not critical allowing diluted columns have become increasingly popular as
samples to be applied to the column. reversed phase media. The commonly used
polymers are polystyrene, methylacrylate, poly-
ethylene and polypropylene. Polystyrene-based
columns have particularly been used in large
5.2.1 Column Characteristics scale preparative chromatography because of
their excellent chemical stability, particularly
Stationary Phases and Bonding Chemistries under strong acidic and basic conditions [157].
The RPC system used for analysis of peptides Polymer-based RPC columns have several intrin-
and proteins usually consists of an n-alkyl-silica- sic features that give them key advantages over
based sorbent from which the solutes are eluted silica based columns. Apart from their chemical
with gradients of increasing concentrations of stability, polymer-based columns have uniform
organic solvent such as acetonitrile containing particles and high physical stability. The poly-
an ionic modifier such as trifluoroacetic acid meric reverse-phase sorbents allow the mobile
(TFA) [2]. The chromatographic packing mate- phase to perfuse through the sorbent matrix thus
rial used in RPC are commonly based on allowing the transport of the solutes into the
microparticulate porous silica that is chemically interior of the sorbent particle much more
modified by a reactive silane containing n-alkyl rapidly than by simple diffusion as in the case
hydrophobic ligand. The most commonly used of silica particles. Consequently separations can
ligands are hydrocarbons such as n-butyl (C4), be achieved at higher flow rates with shorter
n-octyl (C8) and n-octadecyl (C18). The process re-equilibration times.
of chemical immobilization of the ligands on The carbon load is dependent on the choice of
silica results in approximately only half the silica n-alkyl ligands and its density. The type of ligand
surface being modified [1]. This is due to steric has a significant influence on the retention of
hindrance from the large and bulky C8 and C18 peptides and proteins. In general proteins and
ligands that often prevents complete large peptides are best separated on short RPC
86 U. Kota and M.L. Stolowitz

columns that have less hydrophobic n-butyl mean diameter of the silica spheres used as the
ligands bonded to wide pore silica gels support material. Large particle sizes (10–15 μm)
(e.g. 300 Å). This allows greater protein recovery are generally used for large scale preparative and
and conformational integrity. The more hydro- process applications due to their increased capac-
phobic longer alkyl chain ligands (C18 ligands) ity and low pressure requirements at high flow
are generally useful for the separation of peptides rates. Small scale preparative and analytical scale
smaller than ~2000–3000 Da range and are com- chromatography routinely use 3 μm and 5 μm
monly used for separation of peptides resulting size particles. Until recently, the practical parti-
from protease digestion of proteins. cle size limit was around 3 μm since smaller
Other ligands such as phenyl, including particle sizes often limits the use of conventional
phenyl-hexyl, diphenyl and cyanopropyl ligands liquid chromatography (LC) systems with a
have also been used for RPC but afford different standard pressure rating of 5800 psi (400 bar).
retention characteristics and can provide differ- However with the continuous demand for high
ent selectivities [21, 103, 184]. Polar modified throughput analysis and higher resolution the use
reverse phase columns (polar embedded or of sub-2 μm stationary-phase support particles
polar end capped groups) enhance interaction was made possible because of the advent of LC
between the peptides and the particles and result systems capable of handling very high back
in different selectivity. In general, polar- pressures (>1400 bar or 20,000 psi). This tech-
endcapped phases display similar hydrophobic nique termed as ultra-high pressure liquid chro-
retention characteristics as conventional C18 matography (UHPLC) allows the use of smaller
columns, but express higher hydrogen bonding particle size columns with higher efficiency and
and silanol activity. While polar-embedded wider range of usable flow rates resulting in
phases display the opposite behavior, with greatly better resolution, higher sensitivity with signifi-
reduced hydrophobic properties compared to both cantly faster overall analysis time.
conventional C18 and polar-end capped phases as
well as reduced silanol activity [89]. The substantial gain in column efficiency and
sample throughput acheived by using sub-2 μm
can be explained using the Van Deemter equation
Surface Area and Pore Size While retention is
that describes the relationship between the column
primarily controlled by the bonded phase chemis-
efficiency (measured in terms of plate height)
try and mobile phase chemistry, the surface area of
versus the flow rate (linear velocity, μ) [160].
the packing material also plays an important role.
The surface area available is dependent on the pore H ¼ A þ B=μ þ Cμ
size of the particle used for packing. The pore size
of a column is selected so that the sample The term H refers to the plate height and is
molecules have easy access to the pores. Smaller- defined as the distance a compound must travel
pore columns are desired because of their higher in a column needed to separate two similar
surface area, as long as the analytes are sufficiently analytes. Plate height (H) is derived by dividing
small to easily enter the pores. The surface area of a the column length (L) by the calculated number
particle is inversely proportional to the pore diam- of theoretical plates (N). It is desirable to have
eter, so a 3-μm particle size, 100 Å pore column the smallest plate height in order to obtain the
will have approximately three times the surface maximum number of plates. Hence a column
area as a 3-μm, 300 Å pore column. Particles with a higher N will provide narrower peak at a
given RT than a given column with a lower N
with pores 100–150 Å are used for peptides and
number. The term A represents “eddy diffusion
small molecules while particles with 300 Å pore
or multi-path effect” that the analytes experience
size are used for separation of proteins.
as they travel through a packed bed. The A term
is directly proportional to particle size (dp) and is
Particle Size The separation efficiency of a col-
smaller in well-packed columns. The B term
umn depends on the particle size, column length
represents “longitudinal diffusion” of the solute
or flow rate. The particle size is defined as the
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 87

band in the mobile phase and is proportional to alternative to UHPLC. These columns provide
the Diffusion coefficient (Dm) of the solute. Lon- similar efficiency, resolution and throughput as
gitudinal diffusion has a negative effect on reso- the sub-2 μm particle size columns but at con-
lution at lower flow rates and is not significant at ventional HPLC pressure limits [141].
higher flower rates. The C term represents “resis-
tance to mass transfer coefficient” of the analyte Column Dimensions There are several critical
between the mobile phase and stationary phase. parameters in HPLC that contribute to the reso-
Since it takes time for analytes in the mobile lution and recovery of proteins and peptides.
phase to move into the stationary phase and These include column dimensions, flow rate, col-
vice versa, at faster flow rates, there is less time umn temperature, mobile phase composition and
for equilibrium to be reached between the phases gradient used for elution. Column dimensions
and the mass transfer effect on peak broadening affect the efficiency, sensitivity and speed
is directly related to mobile phase velocity. of analysis. Further, the choice of column
As seen in the Van Deemter plot (Plate height, dimensions will depend on the chromatographic
H vs Linear Velocity, μ) which is a composite application; analytical, semi-preparative, prepar-
curve from three relatively independent ative, complexity of the sample, etc. Column
parameters, the minimum point on the curve dimensions consist of particle size, column
marks the minimum plate height (Hmin) and the length and internal diameter. The effect of parti-
optimum velocity (Vopt), which is the flow rate at cle size on efficiency has been discussed in detail
maximum column efficiency. above. While column efficiency in inversely pro-
Particles with small dp have shorter diffusion portional to dp, it is directly proportional to col-
path lengths, thus allowing the solute to travel umn length. Increasing the column length not
in and out of the particle faster. Therefore only increases efficiency but also improves reso-
the analyte spends less time inside the particle lution. However, longer columns also lead to
where peak diffusion can occur. Hence the higher back pressure and longer analysis times.
contributions from the A and B/μ terms are The resolution of larger molecules such as
minimal even at higher flow rates when using proteins and polypeptides is not significantly
columns packed with particles with small dp impacted by the column length as their interac-
(sub-3 μm). At higher flow rates, the van tion with the column packing happens in a single
Deemter curve is dominated by contribution adsorption/desorption step near the top of the
from the C term which is proportional to dp2. column and very little interaction takes place as
Given the inherent higher efficiency of smaller- these molecules elute down the column without
particles columns, they have come to dominate affecting resolution. Column lengths play a more
modern HPLC and are particularly useful in fast important role when resolving smaller peptides
LC and in high-speed applications [41]. such as those generated from enzymatic digests
An alternative to sub-2 μm particle size where resolution can be improved by increasing
particles has been the development of the Fused the column length. The column internal diameter
Core particles. These particles consists of a solid affects the sample capacity which is a function of
1.7 μm core with a 0.5 μm porous silica shell sample volume. Consequently for two columns
surrounding it (dp ¼ 2.7 μm). These superficially of equal diameter but differing in length, the
porous particle columns offer significant longer column has higher sample capacity and
advantages over conventional porous columns. higher resolution [142].
Because the diffusion occurs in the porous outer
shell and not the solid core, the fused core In general, short columns of 50–150 mm in
particle columns allows higher flow rates without length with 2.0–4.6 mm I.D., packed with 3- or 5-
sacrificing column efficiency [35]. Another μm particles are recommended for the separation
advantage of the fused core particles is that of large peptides and proteins. Longer columns,
their relatively large particle size greatly reduces 150–250 mm and I.D. of 2.1–4.6 mm, packed
the backpressure, thus providing a practical with 1.8–3 μm particles or 2.7 μm fused core
88 U. Kota and M.L. Stolowitz

particles are recommended for the separation of (lower back pressure), good “wetting” properties
small peptides and enzymatic digests. Nano even at low concentrations of organic solvent (%
(0.075–0.1 mm), capillary (0.2–0.4 mm) or B) in the mobile phase and is highly volatile
microbore (1–2 mm I.D.) columns are employed allowing easy sample preparation for downstream
when sample is limited and/or higher sensitivity is mass spectrometry analysis. Additionally, aceto-
required. The column dimensions also determine nitrile exhibits high optical transparency in the
the flow rate used for separation, which in turn detection wavelength of proteins and peptides
affects resolution. Typical analytical scale making it suitable for UV detection [1].
columns utilize flow rates ranging between 0.5
and 2.0 ml/min. With microbore columns flow Most preparative and analytical, high resolu-
rates of 50–250 μl/min are used, whereas capillary tion separations of proteins and peptides are car-
and nanobore columns typically utilize flow rates ried out using gradient elution. Method
of 1–20 μl/min and 20–300 nl/min, respectively. development starts with carrying out the separa-
The separation of large biomolecules is insensitive tion with an initial mobile phase which is highly
to flow rate. However, flow rate is an important aqueous (3–5 % B) and rapidly increasing the %
factor for the separation of small peptides and of organic solvent over a short period of time.
protein digests in order to achieve good resolution. The retention and elution of the analytes of inter-
Additionally, the choice of column dimensions est can be further optimized by adjusting the
and flow rates is also determined by its concentration of organic solvent in the mobile
compatability with the type of HPLC/UHPLC. phase and/or modifying the gradient length.
Resolution can also be manipulated by High resolution analyses typically use longer
controlling the operating temperature. Although gradients in order allow as many components in
reverse phase separation of proteins and peptides the analyte mixture as possible to bind to the RP
are normally performed at ambient temperature, column and then elute them differentially to
the retention and/resolution of analytes in obtain a comprehensive profile. For preparative
RP-HPLC is influenced by temperature by applications, the gradient conditions are
changes in solvent viscosity. In general, an optimized to allow the separation of the analyte
increase in temperature reduces retention in of interest from contaminants. Desalting of
RPC and can have some effects on selectivity. samples is a low resolution application and typi-
Temperature variations have also been shown to cally done using a step gradient. The hydrophilic
affect the secondary structure of peptides and contaminants and salt are eluted under low
hence affect selectivity during RP-HPLC [31]. organic conditions and the more hydrophobic
components are eluted at a higher concentration
Mobile Phase The most important characteris- of the organic solvent.
tic of RP-HPLC is the ability to manipulate the Besides organic modifier, altering pH can also
solute retention and resolution by changing the improve control over the selectivity and in some
composition of mobile phase used during the cases improve ionization and solubility.
separation process. Since the peptides and RP-HPLC is generally carried out with
proteins bind to the RPC column under aqueous trifluroacetic acid (TFA). This anionic counter
conditions and elute as the hydrophobicity of the ion interacts with the with protonated groups on
mobile phase increases, high resolution of com- the proteins/peptides and suppresses their influ-
plex mixtures is often achieved by applying a ence on the overall hydrophobicity and enhance
gradient of increasing organic solvent concentra- binding to the stationary phase. Thus the use of
tion. The most commonly used organic solvents in an ion pairing agent can alter the retention behav-
the order of their eluotropic strength are acetoni- ior and subsequent selectivity.
trile, methanol and isopropanol, all of them being
readily miscible with water. Acetonitrile is the Column Sources A partial list of the popular
most popular choice for most peptide and protein silica-based columns from different vendors can
fractionation protocols due its lower viscosity be found in Table 5.1. There are also several
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 89

Table 5.1 RPC column manufacturers and products


Manufacturer/vendor Product name
Agilent Zorbax StableBond, Eclipse Plus, Bonus, Extend-C18, Poroshell 120 and 300
Waters Atlantis, Symmetry, SunFire, X-Bridge, ACQUITY, XTerra, CORTECS
Thermo/Dionex Acclaim, Hypersil, Hypersil GOLD, Syncronis, Accucore, Accucore XL
Sigma-Aldrich (Supelco) Discovery, Supelcosil, Ascentis Express, Ascentis Express Pepitde-ES 160 Å,
GL Sciences Intersil, InterSustain
The Nest Group Inc. GRACE/Vydac®, HAISIL, PROTO ™, TARGA
EMD Millipore Chromolith,CapRod, LiChrospher,
Macherey-Nagel Nucleosil, Nucleoshell
AkzoNobel Kromasil
MacMod ProntoSIL, ACE Ultracore
Advanced Material Technology Halo, Halo Peptide-ES 160 Å,
Phenomenex Gemini, Luna, Synergi, Kinetex, Aeris Peptide

commercially available software and automated chromatographic conditions, which can then be
systems for HPLC method development and subjected to fine optimization. Computer
computer aided optimization, some of which facilitated method development has been made
are listed below and also reviewed in some key available for over two decades and have been
references [114, 151]. predominantly used in the pharmaceutical indus-
try. Their application to proteomics workflow has
• DryLab (http://molnar-institute.com/drylab/) been limited and yet to be fully utilized.
• ChromSwordAuto (http://www.chromsword.
com/en/products/method_development/
chromswordauto/automated_hplc_method_ 5.3 Ion Exchange Chromatography
development/)
• Fusion LC Method Development (www. Theory The range of concentrations in conjunc-
smatrix.com/fusion_lc_method_dev.html) tion with the large number of proteins in any
• ACD/AutoChrom (http://www.acdlabs.com/ given proteome requires the use of multi-
products/com_iden/meth_dev/autochrom/) dimensional separation strategies in order to
obtain comprehensive profiles. Ion exchange
Because many factors influence separation chromatography (IEX) is commonly used as the
(column efficiency, type of stationary phase, first fractionation step in chromatographic multi-
flow rate, pH etc.) and challenging samples dimensional separation, involving proteins. Sep-
often require the simultaneous adjustment of sev- aration in IEX is based on Coulombic
eral variables, many researchers use computer- interactions between ionic components of
facilitated method development. The software proteins/peptides and the charged stationary
uses a small number of experimental runs to phase [38]. The stationary phases for IEX are
simulate the chromatographic separation when characterized by the nature and strength of the
any of the several conditions are changed. The acidic or basic moeities covalently attached to
experimental data are used to “calibrate” the their surfaces and the types of ions they attract.
software for a given sample, after which Anion exchangers contain positively charged
simulated runs can be carried out by entering new groups and retain negatively charged analytes,
conditions into the computer [151]. Alternatively, whereas cation exchangers retain positively
the software may summarize the results of a large charged analytes on their negatively charged sur-
number of such simulations in the form of conve- face. The binding and elution is based on compe-
nient resolution maps, allowing the user to analyze tition between the charged groups on the
the results and identify the most promising proteins/peptides and the charged counter ions
90 U. Kota and M.L. Stolowitz

in the mobile phase for binding the oppositely charged group is attached to the glucose unit of
charged groups on the stationary phase. Elution the dextran. These ion exchangers are derived
of bound proteins/peptides is commonly done by from either Sephadex G-25 or Sephadex G-50,
increasing the ionic strength of the mobile phase. with the former being more tightly cross-linked
The salt concentration in the mobile phase can be and rigid, while the latter is more porous and has
controlled such that the anion or cation popula- better capacity for molecules with molecular
tion in the solution competitively displaces the weights greater than 300,000. The dextran
analytes bound to the stationary phase. Alterna- beads are stable in water, salt solutions, organic
tively, a change in the pH of the mobile phase solvents, alkaline and weakly acidic solutions.
alters the ionic properties of the functional However, very low pH (<2) could hydrolyze
groups on both the stationary phase and analytes. the glycosidic linkage, especially at higher
Thus separation of bioanalytes can be performed temperatures. Sepharose ion exchangers are
either by gradient or isocratic elution, allowing based on cross-linked agarose gel filtration
more variability in the design of the IEX media Sepharose CL-6B and the functional
experiments [155]. Strong ion exchangers bear groups are attached to the gel by ether linkages
functional groups that remain ionized over a to the monosaccharide units. They have an exclu-
wide range of pH (includes sulfonic acid and sion limit for proteins with molecular weight of
quaternary ammonium moieties) and are used to approximately 4  106. Sephacel ion exchangers
separate weakly basic and acidic analytes. The are based on high-purity micro-crystalline cellu-
bound analytes are eluted by displacement with lose and the functional groups are attached dur-
salts that have a higher affinity to the stationary ing their synthesis by ether linkage to glucose
phase exchange sites, i.e. by salt elution. Weak units of the polysaccharide chains. While
ion exchangers bear functional groups that are Sephacel is also macroporous with an exclusion
titratable over a narrow pH range (includes car- limit of 1  106, agarose and dextran beads have
boxylic acid and secondary amine properties) better flow properties [81]. These soft ion
and hence used to retain and separate highly exchange chromatography media are available
charged analytes. Further details regarding the as dry granular powder or as pre-swollen as
mechanism of ion exchange chromatography is well as prepacked columns for HPLC.
well reviewed in the following references Organic polymer-based support material such
[144, 155]. as styrene/divinyl benzene copolymers,
polymethylacrylate and polyvinyl resins. The sur-
Stationary Phases The stationary phases used face of these non- porous, synthetic polymers is
in IEX consists of a support material synthetic modified with a hydrophilic coating and bonded
resins, polysaccharides or silica with charged with a uniform, ion-exchange layer in order to
functional groups covalently attached to them. prevent low recovery due to their hydrophobicity.
Colloidal Cellulose-based ion exchangers were Similar to RP-HPLC columns, a plethora of IEX
the first to be used for the separation of proteins columns are commercially available, varying in
[63], but their irregular particle shape led to poor particle size, pore size and other characteristics,
flow properties. Ion exchangers based on dextran depending on the type of application [167].
(Sephadex), agarose (Sepharose) and cross- The charged functional group bound to the
linked cellulose (Sephacel) have high porosity matrix determines the useful pH range and the
and are better suited for the separation of high type of ion exchanger. The total number of
molecular weight biomolecules. charged moieties and their availability
determines the capacity of the ion exchanger.
The agarose and dextran bead based ion – The different functional groups used in IEX are
exchangers were first introduced by Pharmacia listed in the table below (Table 5.2). Most widely
(now General Electric Health Care BioSciences) used anion and cation ion exchangers have
and ever since has led to big advances in protein immobilized diethylaminoethyl (DEAE) and
separation [81]. In Sephadex ion exchangers, the carboxymethyl (CM) groups, respectively.
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 91

Table 5.2 Functional groups and pKa values of ion exchangers


Ion-exchanger
type Functional group name Abbreviation Structure pKa*
Anion, weak Diethylaminoethyl DEAE -O-CH2-CH2-NH+(CH2-CH3)2 6–9
Anion, weak Dimethylaminoethyl DMAE -O-CH2-CH2-NH+(CH3)2 ~10
Anion, strong Trimethylaminoethyl TMAE -O-CH2-CH2-N+(CH3)3 –
Anion, strong Trimethylaminohydroxypropyl QA O-CH2-CHOH-CH2-N+(CH3)3
Anion, strong Diethyl-(2-hydroxypropyl) QAE O-CH2-CH2-N+(CH2-CH3)2(CH2-
aminoethyl CHOH-CH3)
Cation, weak Carboxymethyl CM -O-CH2-COO 3.5–4.5
Cation, strong Sulfoethyl SE -O-CH2-CH2-SO3 2
Cation, strong Sulphopropyl SP -O-CH2-CHOH-CH2-O-CH2-CH2-CH2- 2–2.5
CH2SO3
Cation, strong Methyl sulphonate S -O-CH2-CHOH-CH2-O-CH2-CH2- 2
CHOH-CH2SO3
*Karlsson [83]; pKa values are a function of ionic strength. Values reported here are at 0.1 M NaCl

DEAE is a weak base with a net positive charge, overall positive charge on peptides allows them to
while CM is a weak acid containing a negative bind to the negatively charged strong cationic sta-
charge. Sulfonate {Sulfopropyl (SP) and methyl tionary phase and subsequently eluted using a lin-
sulfate [S]} and quaternary amino groups are the ear gradient of increasing ionic strength.
commonly used strong ion exchangers.
The choice of ion exchanger for separation Mobile Phase The buffering components used
depends on the isoelectric point of the biomole- in IEX play an important role in the binding and
cule and its stability at various pH values. In elution of the analytes. Ion exchange resins are
practice, proteins are stable and functionally usually have counter ions bound to their func-
active within a narrow pH range and so the tional groups. These are normally Cl ions cation
choice of ion exchanger is often determined by exchangers and Na+ for anion exchangers. The
the pH stability of the desired protein(s). If the counter ions are held by electrostatic interaction
protein(s) are stable at pH values below its pI, then and have specific selectivity for each type of
a cation ion exchanger is used and similarly an media. The lower the specificity of the counter
anion exchanger is used if the protein is stable at ion the more readily it can be exchanged with
pH values above its pI. While both strong and weak another group with a similar charge. Thus the
ion exchangers have been used for proteomics buffering ions must have the same charge as the
applications, strong cation exchangers have a con- functional groups as the ion exchangers, since the
siderable advantage for protein and peptide opposite charges will be participate in the
separations as they retain negative charge over ion-exchange process. The common cationic
the whole range from acidic to neutral pH buffers used for anion exchangers include Tris,
[22]. However, the high degree of tertiary structure alkylamines, ammonium, triethanoamine etc.
of proteins makes them less tolerant to drastic Similarly, anionic buffers recommended for cat-
separation conditions. By contrast, peptides toler- ion exchangers include phosphate, acetate, for-
ate a much wider range of conditions as their native mate, HEPES etc.
state is dominated by secondary structures,
stabilized mainly by hydrogen bonding. At low Protein/peptide movement down the column
pH conditions (< pH 3), under which SCX is can be slowed down by hydrophobic attraction
performed, peptides are positively charged due to and hydrogen bonding with the ion exchanger.
protonation of the N-termini of lysine, arginine and This may lead to irreversible binding or denatur-
histidine side chains and neutralization of the car- ation of proteins and hence poor recovery. The
boxyl side chains of aspartate and glutamate. The inclusion of acetonitrile (10–25 %) in the
92 U. Kota and M.L. Stolowitz

ion-exchange mobile phase helps reduce hydro- unstable or precipitate out at certain pH values.
phobic attraction and improves retention of Moreover, it is more difficult to produce a continu-
charged peptides. ous pH gradient at constant ionic strength on stan-
dard ion-exchange columns, since mixing of
Experimental Technique Ion exchange buffers of different pH results in simultaneous
columns can be purchased ready to use or can change in ionic strength [20].
be prepared in the lab by packing a column with Elution by changing the ionic strength can be
loose ion exchange matrix according to performed in a linear or step wise fashion. A
manufacturer’s instructions. Prior to packing, linear gradient is achieved by gradually increas-
the matrix must be equilibrated with the working ing the ionic strength (usually sodium chloride)
buffer and after packing the column must be at a constant rate using a gradient mixer. The
washed further with several volumes of the equil- gradient mixers allow the formation of a con-
ibration buffer. Before use, the column must be trolled and reproducible salt gradient that is
charged with counter ion by flushing the column essential for run-to-run consistency. In step-
with one to two column volumes of the high-salt wise elution, the salt solution of the next higher
buffer used for elution. Once charged, the col- concentration in the step in introduced onto the
umn must be washed thoroughly with the binding column and maintained for at least two bed
buffer to ensure equilibration in the low-salt volumes or until the proteins of interest have
buffer prior to sample loading. The sample must eluted. This is followed by the next higher con-
be prepared in the low salt buffer and should be centration and the process is repeated until all
filtered before applying onto the column to proteins are eluted. Since most proteins elute
reduce the risk of blocking the column. Many between 0.1 and 0.4 M sodium chloride, steps
proteins tend to aggregate in solution close to at 25–50 mM increments until 0.4 M are
the protein’s pI and this aggregation is increased recommended, during method development
at low ionic strength. Hence, a starting buffer of [20]. The resolution achieved depends on the
20–50 mM is recommended [144]. The flow rate type of gradient used. Step gradient being a sim-
used is determined by the particle size and col- pler and more rapid process, usually results in
umn dimension, however it is typical to use simultaneous elution of multiple protein peaks
lower flow rates during chromatography com- due to the large increase in ionic strength. Gradi-
pared to the column washing and equilibration ent elution is useful when separating proteins/
steps. peptides with very close pIs.
The concentrations of salt used for the gradi-
Following sample loading, the column must ent can be determined by trial and error. After
be washed with 5–10 column-bed volumes of the initial purification runs have been analyzed, the
equilibrating buffer to remove any unbound gradient can be altered and fine-tuned to optimize
proteins and other contaminants. All the steps the separation of the desired proteins. Following
during the chromatography (loading, washing elution, the column is regenerated using a high-
and elution) must be monitored by measuring salt buffer. Sodium Chloride (1 M) is often used
the optical density (A280/A 215) of the flow through. to clean the column between purification runs.
When the washes contain little or no proteins/ The salt removes tightly bound contaminants on
peptides, elution can be initiated. Target proteins/ the stationary phase and simultaneous charges
peptides can be eluted either by increasing the ionic the column with the counter ion.
strength or by changing the pH. Proteins and
peptides are commonly eluted by increasing the Applications The high complexity of most
ionic strength of the mobile phase. Elution can proteomic samples, both in the number of
also be done by changing the buffer pH (raising proteins and their concentration range, often
the pH to elute from cation exchangers and lower- exceeds the separation capacity and detection
ing the pH for anion exchangers), but is not fre- power of most liquid chromatography and mass
quently used since many proteins tend to be spectrometry platforms. Hence, it is imperative
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 93

to use a multidimensional chromatography might include fractionation and/or enrichment


approach in order to obtain a detailed proteome techniques that are performed either at the protein
map. The most common “shot-gun’ proteomic and/or peptide level. Chromatography-based pro-
approach involves the generation of tryptic tein fractionation/enrichment techniques resolve
peptides in solution and the sample is carried proteins based on their physico-chemical
through the first dimension of separation based properties. Size-exclusion chromatography (SEC)
on charge (SCX, hydrophilic interaction chroma- separates proteins based on their molecular sizes in
tography (HILIC) or iso-electric focusing (IEF)) solution [46]. When an aqueous or aqueous/
[17, 25, 30] followed by reverse phased chroma- organic mobile phase is used for separation, this
tography that is based on hydrophobicity. SCXC technique is referred to as gel-filtration chromatog-
is an ideal pre-fractionation method for peptides raphy. SEC can be applied in two distinct ways, a
because at low pH values, the vast majority of group separation process and a fractionation pro-
tryptic peptides bind to the strong cation cess. Group separation refers to where the sample
exchange column and elution with a salt or pH is separated into two major groups such as sample
gradient allows the peptides to be resolved exclu- desalting, buffer exchange, removal of low molec-
sively by their relative charge state [14]. A typi- ular weight contaminants or removal of reagents to
cal multidimensional chromatography workflow stop a reaction. SEC can be applied as a fraction-
combines SCX with RP, where between 10 and ation process where the sample is fractionated
60 fractions from the SCX are collected, either according to their molecular size.
on-line or off-line. Subsequently, each fraction is
injected onto the RP either offline or directly Theory SEC is a liquid chromatography tech-
eluted into ESI source with nonpolar buffer. nique where the stationary phase consists of
Post translational modifications can also influ- spherical porous particles with carefully con-
ence the charge properties of peptides and trolled pore size, through which the biomolecules
hence their retention and separation. In fact, diffuse based on their molecular size using an
SCX chromatography has been used for the aqueous buffer as the mobile phase. The pore
enrichment of phosphopeptides and N-terminal size of the stationary phase particles determines
acetylated peptides from complex mixtures the molecular size range within which the separa-
[42, 113]. The counterpart to SCXC, strong tion occurs. Solute molecules larger than the
anion exchange (SAX), a separation based pri- available pore size are excluded from the particles
marily on negative charge has also been used in and migrate through the column exclusively in the
multidimensional chromatography workflows mobile phase. As the molecular size decreases
both at the protein [187] and peptide level, par- with respect to the average pore size of the pack-
ticularly for characterizing protein phosphoryla- ing material, molecules penetrate the pores at
tion [36, 40, 60]. varying degrees with the smallest molecules dif-
fusing furthest into the pore structure and eluting
Among the different separation strategies avail- last. Thus, very large molecules elute first, in the
able for peptide fractionation, SCX chromatogra- void volume of the column followed by smaller
phy is a relatively simple and well established molecules, sequentially in the order of decreasing
method. Compared to most other enrichment molecular size, with the smallest molecules elut-
techniques, SCX chromatography is relatively ing in the elution volume of the column.
simple, robust, and reproducible that can be
performed on small amounts of sample [113]. Since separation is based on size, SEC is
widely used as an analytical technique to deter-
mine the molecular weight distribution of
5.4 Size Exclusion proteins in their native state. The size of the
Chromatography proteins can be determined, provided that the
SEC column has been previously calibrated
Most proteomic approaches employ one or more with appropriate molecular weight standards. In
methods to reduce sample complexity. These the context of proteomics, SEC is mostly used as
94 U. Kota and M.L. Stolowitz

a fractionation technique. Fractionation at the The other silica-based media used for SEC are
protein level has the advantage that it allows for the Zorbax Porous Silica Microspheres (Agilent
maintaining important information such a post- Technologies Inc.) that consists of extremely
translational modification, polymorphisms, func- uniform colloidal silica beads that are
tional groups, cellular location, complexes/ agglutinated to form spherical spheres. The pat-
aggregates and protein interactions [12]. ented polymerization process enables the control
of both the particle size and pore size so as to
Stationary Phase The separation of produce column packing that will provide sepa-
bio-molecules by size-exclusion chromatogra- ration over a specific molecular range. Addition-
phy was first demonstrated by Lindqvist and ally, Zorbax PSM packed columns have
Strogårds [96], where they used starch to sepa- excellent bed stability, high-efficiency perfor-
rate peptides from amino acids. Subsequently, mance and moderate back pressures [115]. Alter-
Porath and Flondin [132, 133] developed a natively, columns packed with superficially
cross-linked dextran gel and demonstrated the porous silica microspheres, called “Poroshell”
separation of proteins based on their molecular particles, also from Agilent Technologies Inc.
weight. This gel was made commercially avail- have also been used for SEC [87, 100]. These
able as Sephadex (GE Healthcare Life Sciences particles have been described in detail before
Inc.) and was the standard media for size-based (RP-HPLC, section-X). Owing to their large sur-
separation for many years. Polymeric resins face area, these particles have been shown to
based on agar/agarose [64, 130], polyacrylamide have higher sample loading capacities. They
[65, 90, 154], polyvinylethylcarbitol [90], allow fast gradient elution separation of proteins
polyvinylpyrrolidone [90] and derivatized and peptides, with good peak shapes well within
porous silica [44, 137, 139] have been developed the operating pressure limits of most modern
and used for SEC. The soft polymeric resins were HPLC systems. Most commercially available
found to compress under pressure and higher SEC columns for protein and peptide separations
flow rates. This limited the speed and resolution are silica based, ranging in 3–5 μm particle size
of the chromatographic process. Alternatively, the and 100–450 Å pore size. The larger pore sizes
use of derivatized porous silica for SEC was are best suited for the analysis of monoclonal
explored [86]. The high mechanical strength, antibodies, their aggregates, very large proteins
non-swelling nature and inertness to a fairly and protein complexes.
wide range of conditions (temperature, solvents)
proved valuable for high pressure or high perfor- Method Development SEC is often used for the
mance SEC (HPSEC) applications. Despite the fractionation of one or more proteins of known
several significant advantages over organic gels, molecular weight and is typically the first step in
silica-based media suffers from strong ionic a multidimensional chromatography strategy for
interactions between the proteins and the surface sample fractionation. Although a relatively low
silanol groups. This has been addressed by both resolution technique, SEC has the advantages of
surface modifications and the use of mobile phase high reproducibility, stability and relatively short
additives. Diol-modified silica phases is typically analysis time [149]. It is a robust technique that
used for SEC applications involving proteins and can be performed in the presence of detergents,
peptides [112, 135, 139]. More recently, porous denaturing agents, at low or high ionic strength
hybrid materials having a mixed composition of and varying temperatures.
silica and organosiloxanes [171], initially devel-
oped for reverse-phase chromatography have now It is important to establish if the aim of the
been improved and expanded to SEC. One such experiment is group separation or high resolution
example is the bridged ethyl hybrid (BEH) fraction prior to selecting a column. Efficient
particles with surface modified diols have column packing is essential, particularly for
improved chemical stability and reduced silanol high resolution fractionation. Hence, the use of
activity over silica columns [123]. prepacked columns is recommended to ensure
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 95

reproducible and high resolution fractionation. during the separation. The pH of the mobile
Columns must be selected based on the highest phase has a significant effect on the peak
flow rate that maintains the resolution and mini- shape and elution time through electrostatic,
mum separation time. Gel filtration columns hydrophobic and solubility effects [138]. The
packed with sub-2 μm particles have been pH-dependent ionic interactions with the station-
shown to have higher efficiency and improved ary phase can be predicted based on the relation-
resolution at higher flow rates with short run ship between the pH of the mobile phase and the
times [39]. Resolution in SEC is mostly pI of the sample. Ion-exchange and ion-exclusion
influenced by sample volume and column effects can occur at pH values below and above
dimensions. Sample volumes should ideally be the pI of the protein sample, respectively
between 5 and 10 % of the total column volume [56]. The ionic strength, composition and pH of
as higher volumes beyond this range results in the mobile phase can be manipulated to improve
decreased resolution and peak distortion (i.e., resolution as long as it does not affect the stabil-
tailing) [67]. This technique is independent of ity of the sample or cause conformational
sample mass and hence sample concentration. changes of proteins.
However, the solubility or the viscosity of the
sample may limit the concentration of sample Applications There are only a few reports in the
used for separation. High viscosity causes irreg- literature about the application of SEC in prote-
ular flow patterns and inconsistent separation, ome research. SEC has been used as a first
leading to broad peaks and high back pressure. dimension in multidimensional separation for
The viscosity of the sample should be the same as proteome research both at the peptide [92, 116,
that of the eluent. Column length has a significant 124, 125] and protein level [74, 82, 91, 150].
effect on resolution in SEC. Since samples are SEC has also been applied as a fractionation
eluted isocratically during gel filtration, increas- technique to study post translational modifica-
ing the column length provides a means of tions such as phosphorylation [159] and glyco-
improving resolution [131, 138]. However, sylation [8]. SEC was used to fractionate
increasing column length also leads to a propor- complex yeast tryptic digests into pools of
tional increase in run time and peak width. peptides based on their size. The large post-
Once the right matrix and column are tryptic digestion peptides were subjected to a
selected, other chromatographic variables secondary digestion followed by LC  MS/MS
variables that may be manipulated and optimized analysis that lead to a significant increase in
are the buffer system (type, ionic strength), pH, identified proteins and a 32–50 % relative
and solubility additives (e.g. detergents, organic increase in average sequence coverage compared
solvents) [3]. Nonbinding interactions between to a single trypsin digestion alone. This second-
the sample molecules and the stationary phase ary digestion strategy was applied to analyze the
are dominated by electrostatic and hydrophobic phosphoproteomes of fission yeast and of a
interactions. SEC often employs high salt con- human cell line. SEC has also been applied to
centration and/or ionic strength buffers to reduce enrich N-linked glycopeptides relative to the
electrostatic interactions between the stationary non-glycosylated peptide from human serum
phase and proteins/peptides as well as protein- digest. The gylcosylation sites were identified
protein interactions [10]. However, at very high by treating the enriched glycosylation fraction
concentrations (600–1000 mM), the peaks begin with PNGaseF followed by LC/MS/MS analysis.
to broaden and to be retained due to hydrophobic
effects, especially for peptides and strongly
hydrophobic proteins [72, 73, 102]. If detergents 5.5 Hydrophilic Interaction Liquid
are used to stabilize the sample, they should be Chromatography (HILIC)
present both in the sample buffer and mobile
phase buffer. Samples are eluted isocratically, Although RP-HPLC is the most widely applied
hence there is no need to use different buffers separation technique, this technique is not
96 U. Kota and M.L. Stolowitz

suitable for the analysis of highly polar, hydro- onto the stationary phase [5]. Additionally,
philic and ionizable molecules as they are poorly polar analytes can also undergo ion exchange
retained on the hydrophobic stationary phase. with the charged groups on the silica surface
Polar compounds can be separated by normal depending on the nature of the stationary phase.
phase chromatography (NPC), in which the sam- For example, underivatized silanol groups on
ple components partition between a polar station- bare silica are themselves both acidic and hydro-
ary phase and a less polar mobile phase. The philic in nature. The pKa of the surface silanol
compounds are eluted in the order of decreasing group is 7.1  0.5 (Hair 1970). These residual
hydrophobicity by increasing the polarity of the silanol groups are partially ionized and can inter-
mobile phase. The mobile phase in NPC typi- act with basic analytes through hydrogen bond-
cally utilizes 100 % organic solvent or a blend ing and electrostatic interactions. Hence
of miscible organic solvents. Hydrophilic inter- depending on the surface charge on the stationary
action chromatography (HILIC), the term first phase, the retention mechanism can be a combi-
coined by Alpert [5], is a variation of NPC that nation of partitioning of solutes in aqueous
still utilizes a polar stationary phase but the two-phase system and specific interactions with
mobile phase consists of organic solvents that the surface charged groups. Other factors
are water miscible. Since the coining of the governing retention are hydrogen bonding,
term in 1990, HILIC has become an increasingly which depends on the acidity or basicity of the
popular technique for the analysis of polar and peptides and the dipole-dipole interactions,
hydrophilic compounds, particularly because this which depends on the dipole moments and polar-
method has been shown to provide improved izability of the analytes [28, 37]. As mentioned
sensitivity compared to RPC when used in com- above, HILIC uses aqueous-organic solvent
bination with electrospray ionization-based mass mobile phases, typically 40–97 % acetonitrile
spectrometry [118, 119]. HILIC has also gained in water or other volatile buffers, thus making it
popularity in proteomics and is often used as an a very mass spectrometry friendly technique
orthogonal separation method in conjunction [9]. Since partitioning is an important component
with RP-HPLC in multidimensional separation of the HILIC retention mechanism, the presence
of peptides, especially for the targeted analysis of a significant amount of water in the mobile
of post-translational modifications [16, 18]. phase is crucial for maintaining an immobilized
aqueous layer on the surface of the stationary
Theory The HILIC mode of separation had phase [118]. Unlike RPLC, gradient elution in
been applied to separate amino acids [105], HILIC begins with low polarity organic solvent
sugars [94, 120, 161], organic amines [15], and the polar analytes are eluted by increasing
basic drugs [47, 75] and nitrogenous bases [34] the polar aqueous content in the mobile phase.
even before the actual term was coined. The Thus the elution order in HILIC is more or less
publication by Alpert was the first paper to dem- the inverse of that in RPC [5], which means this
onstrate the application of HILIC to the separa- separation technique is well suited particularly
tion of peptides in addition to other polar for those peptides that are poorly retained on RP
compounds as well as discuss the separation columns. The reader is directed to a comprehen-
mechanism in detail. As mentioned earlier, sive review by Hermstr€om and Irgum [62] that
HILIC uses polar stationary phases such as provides an excellent background to HILIC and
underivatized bare silica or uncharged modified details about its separation mechanism.
silica (diol, amino, cyano) and high levels of
organic solvent. The retention mechanism Stationary Phases The rising popularity of
works on the basis that water adsorbs onto the HILIC as a distinct chromatographic mode for
stationary phase to form an immobilized layer separation of protein and peptide mixtures has
and the analyte partitions between this and the coincided with the development of a diverse
bulk mobile phase. This distinguishes HILIC range of stationary phase materials with different
from NPC where the solutes adsorb directly retention and selective properties. Separations
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 97

are typically performed using packing materials including diol (YMC-Triart Diol, LiChrospher
having particle sizes in the ranging from 100 Diol, Intersil Diol), cross-linked diol (Luna
sub-2 μm to 10 μm and average pore size of HILIC), amide (TSKgel Amide-80, GlycoSep
approximately 120 Å, thus making HILIC a N), aspartamide (PolyHYDROXYETHYL A),
high resolution technique amenable for both cyanopropyl (LiChrospher 100 CN, Altima
HPLC and UHPLC systems. The separation Cyano HP, Spherisorb CN) and cyclodextrin
efficiencies of the different commercially avail- (Nucleodex β-OH, Cyclobond I 2000) groups.
able HILIC columns have been studied and com- They have found application for the separation
prehensively reviewed in the following articles of oligosaccharides, peptides, proteins, and
[70, 84]. oligonucleotides.
Aminopropyl silica phases (Luna NH2,
The most common stationary phases used for Hypersil APS-2 (amino), Zorbax NH2,
HILIC are silica-based and are available as fully LiChrospher 100 NH2, TSKgel NH2-100) are
porous, superficially porous, ethylene bridged positively charged and among the oldest amine-
hybrid (BEH) and monolithic columns [122]. based phases. The negatively charged derivatized
The silica-based phases can be classified into silica mainly consist of stationary–phases having
two large groups: unmodified bare silica phases a special poly(peptide) coating. Examples
and polar chemically-bonded phases. The first include Poly(aspartic acid)-silica (PolyCAT A),
HILIC applications were developed on unmodi- poly(2-sulfoethyl aspartamide) (Polysulfo-ethyl
fied bare underivatized silica phases and remain a A) and poly(2-hydroxyethyl aspartamide)
popular choice for the separation of (Polyhydroxy-ethyl A) all manufactured by
carbohydrates. The free silanol groups are the PolyLC Inc. (Columbia,MD). Poly(aspartic
key chemical feature of hydrated silica surfaces acid) silica was originally developed as a weak
and their acidity is controlled by the purity of cation exchange material and used for the sepa-
silica itself [69]. Chemically bonded phases are ration of proteins [4]. The stationary phase
supplied by many manufacturers and include consists of silica material with a bonded coating
weak and strong cation exchangers [95], diol of hydrophilic aspartic acid polymer. While the
[137, 156], amino [6, 94, 152, 158], amide β-carboxy group of aspartic acid is responsible
[152, 179], polysulfoethyl aspartamide and for the cation exchange capacity, it can also
polyhydroxyethyl aspartamide, pentafluorophe- act as an acceptor/donor group for hydrogen
nylpropyl and amino-cyano-phases bonds between solutes and the stationary
[95, 118]. The chemical derivatization of the phase [95]. It is this feature that makes this
surface with polar functional groups is done poly(peptide) stationary phase well suited for
much the same way as C18 or C8 phases are HILIC separations of peptides and proteins.
prepared for RPC. The polar stationary phases Poly(2-hydroxyethylaspartamide)-silica is made
can be further classified into neutral, charged and by incorporating ethanolamine into a coating
zwitterionic phases based on the charge state of of poly(succinimide) bonded to silica
the functional groups. The chemical stability of [95]. The material is neutrally charged and reten-
silica-based phases is limited under extreme pH tion mechanism is mainly through hydrophilic
values and most separations are performed in the interaction, thus allowing sharper peaks and bet-
pH range between 2.0 and 8.0. ter selectivity. Although the poly(2-hydroxyethyl
Neutral stationary phases contain polar func- aspartamide) stationary phases was used for
tional groups that are in neutral form in the range the separation of a wide range of biomolecules
of pH 3–8, usually used for the mobile phase in including peptides [19, 126, 179], this HILIC
HILIC. The retention mechanism is mainly based phase seems to have lost some of its momentum
on hydrophilic interactions. Many HILIC station- compared to more recent dedicated HILIC
ary phases belong to this category, which phases, due to their lower efficiency [158], lim-
comprises a large variety of functional groups, ited longtime stability [188], or column bleeding,
98 U. Kota and M.L. Stolowitz

as recently reported for a poly(succinimide)- charge groups are in equal molar ratio, but they
based phase [111]. Poly(2-sulfoethyl asparta- still exhibit weak ionic interactions that allow
mide) was originally developed as a strong cation separations to be optimized using low ionic
ion exchanger of peptides but has also been used strength buffers. The charge state of the zwitter-
for HILIC separations [7, 95]. It is synthesized by ionic phases is pH independent. However, pH
aminolysis of taurine with poly(succinimide) can affect the charge state of peptides, affecting
covalently bonded to silica and exhibits mixed- their hydrophilicity and, thereby their
mode effect i.e. hydrophilic interactions and retention [37].
electrostatic effects [95]. Like the poly
(2-hydroxyethyl aspartamide), this stationary
phase also exhibited column bleeding resulting 5.5.1 Mobile Phase Selection
in several interfering peaks during a two dimen-
sional proteomics study [111]. The typical mobile phase for the HILIC separa-
Zwitterionic derivatized HILIC phases was tion of peptides is a water-miscible polar organic
introduced by Irgum and coworkers [76, 77, 78]. solvent such as acetonitrile, methanol and
These phases consist of a layer of highly polar isopropanol at concentrations of up to 85 %
switterionic sulfoalkylbetaine groups (Fig. 5.2a) [95]. Alcohols can be used as alternative
grafted onto wide bore silica (ZIC-HILIC) or a solvents, but a higher concentration is needed in
polymeric support (ZIC-pHILIC). More recently, order to achieve the same degree of retention of
another new stationary phase with phosphor- the analyte relative to an aprotic solvent-water
ylcholine functional groups bonded to silica combination [28]. The eluotropic strength of the
(ZIC-cHILIC) (Fig. 5.2b) has been introduced. most commonly used mobile phase solvents are
ZIC-cHILIC is an inverted zwitterionic station- listed below according to their decreasing elution
ary phase and hence shows different selectivity strength: water > methanol > ethanol >
compared to ZIC-HILIC. The net charge on isopropanol > acetonitrile. Acetonitrile is highly
either of these phases in neutral as the oppositely recommended due to its low viscosity and has the

Fig. 5.1 Van Deemter curve showing the relationship their individual contributions. Hmin ¼ minimum plate
between plate height (H) vs. linear velocity (μ). The height, Vopt ¼ Optimum velocity (Figure adopted from
Van Deemter curve is a composite plot of the A, B/μ various web sources)
and Cμ terms where the each term is plotted to show
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 99

Fig. 5.2 Chemical structure of the zwitterionic bonded phases with (a) sulfobetaine functional group (ZIC-HILIC) and
(b) phopshorylcholine group (ZIC-cHILIC)

lowest absorbance at shorter wavelengths used to polar analytes and the stationary phase, conse-
measure peptides. It is recommended to try a quently having a significant effect on retention
range of acetonitrile concentrations starting and selectivity [57]. Most silica-based HILIC
with atleast 60 % to ensure sufficient hydrophilic separations are carried out in the pH range of
interaction. Other solvents such as tetrahydrofu- 3–8. Solvent pH is adjusted by the addition of
ran and acetone can also be used. The different using buffer salts such as ammonium acetate and
organic solvents can also be used in various ammonium formate for acidic pHs and ammo-
concentrations to alter retention and nium hydroxide and carbonate for high pHs. The
selectivity [57]. selectivity of the separations can be altered by
Separation can be performed either in changing the mobile phase pH, which not only
isocratically or using a gradient. Gradient elution changes the ionization of the functional groups
is performed either by increasing the amount of on the stationary phase (e.g. amino) but also
water, i.e. decreasing the organic solvent concen- affects the relative ionization of the anaytes and
tration in the mobile phase or by an increasing thus their retention [107, 108]. Bare silica and
salt gradient. In addition to organic solvent, silica-based neutral stationary phases are also
mobile phase pH and buffer/salt concentrations affected by the mobile phase pH. Normal silanol
are also critical to HILIC method development. groups are slightly acidic and can become
Mobile phase pH affects ionization state of both deprotonated at higher pH values. This could
100 U. Kota and M.L. Stolowitz

lead to increased electrostatic attraction of basic and buffers to be used. Commercially available
compounds with the negatively charged silanol columns are shipped in alcohol or other organic
groups and results in stronger retention. solvents and must be conditioned prior to sample
While buffers are important to prevent pH injection. The column must be first washed thor-
fluctuations of the mobile phase, appropriate oughly with HPLC grade water (95 % or higher, at
buffer concentration is important to minimize least 10–15 column volumes) to remove alcohol.
peak broadening. The most common buffer salts Failure to remove the organic solvent could lead to
used at ammonium acetate or formate (typically precipitation of salts that are not soluble in organic
5–15 mM), because of their high solubility in solvents and damage the column. The initial wash-
organic solvents, low UV absorbance and are ing is followed by rinsing the column with a ~ 10
mass spectrometer friendly. Other buffer salts column volumes of wash buffer. The composition
may be used, however it is important that they of the wash buffer is typically defined in the
are readily soluble in organic solvents and manufacturer’s instructions. The pH of the wash
have excellent UV transparency. Ammonium buffer is not adjusted and used as is. This is
bicarbonate, triethylamine phosphate (TEAP), followed by flushing the column again with water
sodium-methylphosphonate (Na-MePO4), sodium to remove the salt buffer prior to conditioning the
percholate have also been applied in HILIC column with the starting mobile phase.
separations, however these buffers are not volatile
and cannot be used with mass spectrometry as a To obtain optimum binding to the stationary
detector. The impact of buffer concentration on phase, samples must be dissolved in organic
retention and selectivity is dependent on the solvents such as acetonitrile, methanol, ethanol
nature of interaction between the analyte and the or isopropanol, with acetonitrile being the first
stationary phase. For non-ionizable compounds, choice. Samples are typically dissolved in the
retention is solely dependent on portioning buffers having the same organic solvent content
between the immobilized aqueous layer and the as the starting mobile phase. If sample solubility
hydrophobic mobile phase. Thus, high buffer/salt is an issue, then a mixture of the different organic
concentration increases the retention time of these solvents or a small percentage of water or buffer
analytes. For ionizable analytes, electrostatic may be used to improve solubility. Water is a
interaction (attractive or repulsive) is an important strong eluent, hence the amount of water used to
component of the retention mechanism. In this dilute and/or dissolve the sample must be care-
case, high salt concentration is necessary to dis- fully adjusted as too much water can lead to peak
rupt the electrostatic attractions between the ana- broadening or splitting. Furthermore, large injec-
lyte and stationary phase. A detailed investigation tion volumes containing high percentage of water
by McCalley on the HILIC separation of basic causes peak deterioration and loss of sensitivity
compounds using bare silica columns has shown [57]. Salts (KCl, NaCl etc) may be present due
that high salt concentration improves peak to sample preparation and are insoluble in high
shape of charged analytes and also diminishes organic solvents. Hence samples must be filtered
column overloading effects [107]. In general, it prior to injection to prevent the precipitate from
is important to identify the type of electrostatic clogging the column.
interactions between the charged analytes and the Several factors influence the retention of
stationary phases so as to optimize the buffer/salt peptides on the stationary phase. These include
type and concentration in order to achieve the the hydrophilicity of the analyte, solvent pH and
desired retention and selectivity [57]. buffer concentration and column temperature.
The importance of pH and buffer concentration
Method Development It is important to select has been discussed in detail above. Most HILIC
the target analytes and the objective of the separations use gradient elution and depending
method prior to starting the method development on the elution profile, resolution can be further
as this will determine the type of stationary phase optimized by changing the slope of the gradient
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 101

or by increasing the salt concentration in the charge +2 and +3) are clustered together in a
mobile phase used for elution [95]. Shallow narrow elution window, which is not observed
gradients yield enhanced resolution but also in HILIC separations [16, 18, 52, 53]. The elec-
increase analysis time. If higher salt concentra- trostatic interactions ensure that HILIC separa-
tion is needed for elution, then a two-step gradi- tion is not merely the reverse of RPC and the
ent elution could be advantageous compared to hydrophilic interactions allows similarly charged
a one-step linear gradient [95]. Like all other peptides to elute over a wider time window [17].
various forms of LC, determining the optimal
gradient involves several experiments. Tempera- HILIC is becomingly an increasingly
ture can also be used to optimize resolution and popular technique for the enrichment of post-
selectivity of the HILIC separation. Increasing translational modifications both at the peptide
temperature leads to decreased retention, if and protein level. The most common applications
hydrophilic retention is the primary retention have been the targeted analysis of phospho-
mechanism, but deviation from this behavior rylated, glycosylated and N-terminal acetylated
can occur if other retention mechanisms are peptides. The attachment of a phospho-moiety to
involved. In HILIC, column temperature has rel- a peptide increases its hydrophilicity and lowers
atively less of an effect on the retention mecha- its pI. Phosphopeptides can be enriched by SCX
nism compared to the organic solvent content of at low pH owing to the fact that acidic residues
the mobile phase, its pH and buffer strength such as aspartic acid and glutamic acid are neu-
[57]. Temperature can be used to optimize the tral while the phosphor-serine/tyrosine/threonine
method and achieve high resolution separations. residues are negative. Tryptic phosphopeptides
elute earlier than the unmodified tryptic peptides.
Applications Because of its selectivity, HILIC However, multiply phosphorylated peptides are
is becoming increasingly popular as an “orthog- poorly retained by SCX and either poorly
onal” separation technique to RPC and applied in retained or even lost. McNulty and Annan [110]
two dimensional separation of complex prote- were the first to explore HILIC as a first dimen-
ome samples [17, 52, 53, 71, 165]. Evidently, sion of separation and enrichment for
HILIC and RPC mobile phase buffers are not phosphopeptides. When using HILIC for
directly compatible and hence the 2D-LC setup phoshopeptide enrichment, retention is based on
is often performed in an off-line mode overall hydrophilicity of the peptides. Therefore
[16, 18]. HILIC has been shown to have separa- in contrast to SCX, phosphorylated peptides are
tion power superior to both SEC and SCXC strongly retained under typical HILIC
[52, 53]. Although SCXC is the most common conditions, allowing the separation of peptides
first dimensional separation, this method is with differing numbers of phosphorylation to be
shown to lead to incomplete recovery of hydro- separated using a step-wise gradient [16, 18].
phobic peptides [7]. Even with the addition of The HILIC separation is usually combined with
organic solvents (e.g. 25 % acetonitrile) in the another enrichment technique such as IMAC or
mobile phase, the recovery of peptides by SCXC TiO2 for a more comprehensive analysis of the
has been found to be lower than expected phosphoproteome [45, 48, 110, 183].
[52, 53]. On the other hand, SEC has low peak Acetylated N-terminal tryptic peptides behave
capacity, which limits its utility as a first dimen- similarly to phosphorylated peptides during
sional separation technique in a 2D-LC SCX fractionation, i.e., they tend to cluster in
approach. The HILIC retention mechanism the first few fractions. The N-terminal charge is
includes both partitioning and electrostatic neutralized by acetylation, lowering the net
interactions; hence the separation partially charge of the peptide compared to the unmodi-
resembles the peptide retention in SCX mode. fied version. Boersema et al. have evaluated the
However, SCX separation power can be limited use of ZIC-HILIC for the enrichment of
by the fact that the most prevalent peptides (net N-acetylated peptides [17]. The neutralized
102 U. Kota and M.L. Stolowitz

acetylated peptides have reduced hydrophilicity. phase and the solutes in the mobile phase.
The polarity is further reduced at pH 3, at which Although most chromatography interactions are
ZIC-HILIC separation was performed and the considered in terms of single modes, such as
N-acetylated peptides eluted in the first fractions. ionic or hydrophobic interactions, proteins and
Their work also showed that at higher pH peptides are polyions that exhibit both hydro-
conditions (pH 6.8 and 8) ZIC-HILIC has higher philic and hydrophobic properties. Their actual
separation power, while at pH 3, this separation chromatographic separation involves multiple
technique is most orthogonal with RPC. modes of interaction and often unintended ‘sec-
Following phosphorylation, glycosylation is ondary interactions’ with the stationary phase
the second most studied PTM. Although lead to peak tailing. Such effects were observed
glycoproteins/glycopeptides have been enriched during the development of reversed-phase chro-
by affinity techniques such as lectin-mediated matography where incompletely capped silanol
capture, HILIC is an upcoming promising addi- groups exhibited ion exchange activity, causing
tional enrichment technique. The glycan group peak tailing and retention shift for basic
(s) contributes to the overall hydrophilicity of the compounds [54]. However, it was later realized
modified peptide and this physicochemical prop- that this “mixed mode” interaction could be a
erty is used to separate these peptides from the new technique to improve the resolution and
non-glycosylated peptides. ZIC-HILIC and/or selectivity of the separation by using suitable
amide-bonded phases are commonly used for approaches such as mixing of two types of sta-
glycan and glycopeptide enrichment [13, 29, 58, tionary phase in a single column or using
88, 106, 164, 169]. There are also several reports biphasic columns [152, 166, 170].
of applications of HILIC-SPE for desalting
and/or purification of glycans and glycopeptides Theory In contrast to single mode chromatog-
[104, 140, 145–147]. The advantage of raphy, MMC uses a stationary phase that is
performing HILIC-SPE is the possibility to intentionally functionalized with ligands that is
elute with water, thus providing a salt free and capable of multiple modes of interaction with the
acid-free sample that is ideal for subsequent mass biomolecules. These multiple modes can include
spectrometry or other detection methods [180]. hydrophobic, ion exchange, affinity, electrostatic
as well as hydrogen bonding, π-π and thiophilic
interactions [32]. There are several MMC separa-
5.6 Mixed Mode Chromatography tion modes and are named based on at least
two types of interactions between the stationary
Most proteomics workflows utilize two- dimen- phase and the solutes, that occur either simulta-
sional liquid chromatography (2D-LC) or multi- neously or separately. Like all other chromato-
dimensional chromatography to fractionate graphic techniques, the retention mechanism in
complex biological samples. This often requires MMC is influenced by the type of ligand, the
elaborate experimental set up using two or more base matrix, linker and the linker chemistry.
columns, collecting/processing large number of Depending on the chemical nature of the ligands
fractions and analyzing each of them. This pro- (e.g. polar, non-polar, hydrophobic, hydrophilic,
cess is time consuming and limits the number of acidic and basic groups) and nature of the matrices,
samples that be characterized with high resolu- different types of mixed-mode stationary phases
tion and high sensitivity. More recently, mixed provide different retention mechanisms. Retention
mode chromatography (MMC) has received sig- mechanism is also influenced by the size and struc-
nificant attention as an alternative chromatogra- ture of the solute molecules as well as the mobile
phy technique that can enhance selectivity phase. The exact mechanism of interactions
beyond that of single mode separation, between the analyte and the multimodal ligands
performed separately. MMC utilizes more than has not been well studied, as most publications
one type of interaction between the stationary are focused on applications of MMC. Theoretical
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 103

explanations for the MMC retention mechanism, affinity chromatography [173]. Hydrophobic
based on the different molecular interactions are ligands such as short alkyl chains and phenyl
classified into three categories detailed groups are attached to the stationary phase and
explanations of which can be found in [174]. separation is based on the reversible interaction
between the hydrophobic amino acid side chains
In the case of protein and peptide separations, on the surface of a protein and the hydrophobic
interaction between the multiple ligands on the ligand. Proteins bind to the column in high ionic
MMC matrix and the multiple types of amino strength buffer and elution is usually performed
acid residues and their charge states at the by decreasing the salt concentration, stepwise or
contact region affects the retention of these using a gradient. When compared to RPC, the
biomolecules. The most common mixed mode density of the ligands on the stationary phase is
separations for proteins and peptides are based much lower and HIC uses milder binding and
on ion-exchange and hydrophobic interactions. eluting conditions that allow to maintain the
One such example is the WAX-RP mixed mode biological activity of the target proteins [136,
separation, where the separation mechanism in 173].
an IEX/RP mixed mode is predicted to be a The biggest advantage of MMC is that selec-
complex interplay of hydrophobic, ion exchange tivity can be optimized by adjusting the mobile
and ion exclusion [129]. The interactions phase ionic strength, pH and/or organic solvent.
between the different types of chromatography Additionally, MMC does not require ion pairing
are not independent of each other and the relative agents in the mobile phase for separating highly
contribution of each mechanism depends on the hydrophilic charged analytes and hence is MS
hydrophobicity of the analyte, its charge and also compatible. The adjustable selectivity allows
the mobile-phase conditions such as pH, ionic easy separation of analytes of varying charges
strength and degree of organic modifier. Hence and hydrophobicity in a single analysis.
in such a system, increasing the ionic strength
will disrupt the ionic bonds but the increasing Stationary Phases The stationary phase for
salt strength will also favor stronger hydrophobic MMC can be generated either by physical or
adsorption of the solute. When compared to con- chemical methods. The simplest form of MMC
ventional one dimensional SCX separation, can be achieved by connecting two different
mixed mode IEX/RPC has been shown to have types of columns in series, known as “tandem
increased fractionation efficiency resulting in a column” [43]. However, the two mobile phases
more homogenous distribution of peptides across used for chromatography must be compatible
all fractions. Furthermore, the doubly/triply and work synergistically. Tandem columns also
charged peptides were found to elute over a lead to high back pressure, especially if high
wide elution window unlike in SCX where the flow rates are used for rapid separation [174].
majority of the tryptic peptides elute within a A second approach is to pack two or more types
narrow elution window [129]. of stationary phases into a column, termed as
Hydrophobic interaction chromatography “biphasic column” [43], however packing two
(HIC) is also a type of mixed mode chro- stationary phases homogeneously can be chal-
matographic process in which the protein of lenging. Rossi and Horvath [43] compared the
interest in the mixture binds to a dual mode performance of both tandem and hybrid columns
(i.e., one mode for binding and another mode using commercially available WCX, WAX and
for elution), ionizable ligand. This chro- SCX and strong- anion exchange (SAX) station-
matographic process was pioneered by Porath ary phases and found their separation efficiency,
et al. [134] and Hjertén [66] and is considered a including resolution to be very similar.
gentle separation technique compared to RPC
and also complimentary to other chro- Biphasic columns were first used by Yates
matographic modes such as IEX, SEC and et al. for the fractionation of tryptic peptides
104 U. Kota and M.L. Stolowitz

in their multidimensional protein identification cation exchange and metal-affinity mechanism.


technology (MudPIT) [97, 98, 166, 170]. For the The phosphate groups of the media interact elec-
MudPIT approach, peptides were loaded onto a trostatically with the amines/positively charged
biphasic microcapillary column, packed with 1:1 amino groups on proteins while the also the cal-
ratio of SCX and RP (C18) stationary phase. This cium ions on the surface of the hydroxyapatite
approach led to the unbiased, comprehensive anal- crystals bind to either the carboxyl or phosphoryl
ysis of the S. cerevisiae proteome largely because groups on proteins. Elution is achieved by either
it was able to detect and identify a wide variety of using a phosphate or NaCl gradient [11, 49].
protein classes including proteins with extremes in The more recent mixed mode ligands are
pI, molecular weight, abundance, and summarized in Fig. 5.3 and a more comprehen-
hydrophobicity [170]. The biphasic system was sive list can be reviewed here [181]. Some
further improved to a three phase MudPIT of the first MMC ligands were developed by
columns which included an additional reversed- Yon and coworkers for protein chromatography
phase column prior to the SCX and RPC column [175, 176, 178]. These mixed-mode ligands with
and was used for the online desalting of the sam- a net negative charge adsorbed proteins based on
ple, prior to fractionation [109]. the net effect of hydrophobic interactions and
A third physical approach is to pack two or electrostatic repulsion for protein purification.
more types of stationary phases into a column to Hydrocarbyl ligands are also frequently used
generate a “mixed bed column” or” hybrid col- in protein chromatography. Examples of this
umn”. The first ‘hybrid column’ was prepared by family include the two commercially available
Walshe et al. [163] by mixing together SCX and adsorbents, hexylamine (HEA) and phenylpro-
RP (C18) stationary phases and was found to pylamine (PPA) Hypercel (Pall LifeScience,
exhibit chromatographic properties of both NY, USA) [23]. Ligands based on alkyl amines
modes. Motoyama et al. [117] prepared a with ω-amino groups [148] as well as the nega-
mixed-bed resin of a blend of anion and cation tively charged counterparts of these ligands, i.e.,
exchange (ACE) and showed improved recovery carboxylic [26, 177] or sulfonic [24, 26, 55] acids
of peptides and phosphopeptides compared to have also reported. Secondary interactions such
SCXC alone. as hydrogen-bonding have also been used as one
The biggest advantage of using these of the interactions in MMC. The introduction of a
physical methods of MMC is that the analytes hydrogen bonding group in the proximity of
can be directly transferred from one chro- ionic groups has been shown to be beneficial
matographic mode to another, thereby reducing for protein binding under high salt conditions
dead volume of the system, number of connec- [79, 80]. Based on these findings two commercial
tions and simplifies the overall experimental adsorbents Capto™MMC and Capto™adhere
procedure [174]. (GE Healthcare, NJ, USA) were developed.
In the chemical approach to MMC, the col- Capto™MMC is a weak cation exchanger with
umn consists of a single stationary phase that is a phenyl group for hydrophobic interactions and
derivatized with two or more functional groups amide group for hydrogen bonding.
(or ligands). Most MMC ligands have been Capto™adhere is a strong cation exchanger
designed for the purpose of protein purification, again with a phenyl group for hydrophobic inter-
specifically for immunoglobulin purification. action and a hydroxyl group for hydrogen
Hydroxyapatite is one of the oldest mixed mode bonding.
chromatographic media that has been used regu- IEXC functional groups such as quaternary
larly for the purification of antibodies due to its ammonium, amino, carboxyl and sulfonic groups
high selectivity and ease of use as it can be can also be adapted to act as mixed mode ligands.
performed under neutral conditions [85]. The Girot and coworkers [24, 55], have used
hydroxyapatite crystals generate a mixed-mode 2-mercapto-5-benzimidazole sulfonic acid, a
resin, where the separation is achieved by both ligand of MBI Hypercel (Pall Life Sciences,
Fig. 5.3 Ligands for mixed mode chromatography amine is a positively charged ligand for Capto™adhere
(MMC) from selected commercially available mixed (GE Healthcare, NJ, USA). It is a mulitmodal strong
mode media. (a) Hexylamino (HEA) is a positively cation exchanger. (e) 2-mercapto-5-benzimidazole sul-
charged ligand for HEA Hypercel (Pall Life Sciences, fonic acid is a negatively charged ligand for MBI station-
NY, USA). (b) Phenylpropylamino (PPA) is a positively ary phase (Pall Life Sciences, NY, USA). (f) 2-
charged ligand for PPA Hypercel (Pall Life Sciences, NY, Benzamido-4-mercaptobutamoic acid is a negatively
USA). (c) 4-Mercapto ethyl pyridine (MEP) is a posi- charged ligand for Capto™MMC (GE Healthcare, NJ,
tively charged ligand for MEP Hypercel (Pall Life USA). It is a mulitmodal weak cation exchanger. (* Posi-
Sciences, NY, USA). (d) N-benzyl-N-methyl ethanol tive/Negative charge is reported at physiological pH)
106 U. Kota and M.L. Stolowitz

NY, USA) as a multimodal ligand for the purifi- introduced by Hodges and group [185, 186] has
cation of antibodies. Heterocyclic mixed mode proven to be very versatile for peptide
ligands have also been applied for protein purifi- separations versus RPC, specifically for
cation [27, 33, 59, 172, 182]. MEP Hypercel separating highly charged species [101]. HILIC/
(Pall Life Sciences, NY, USA) contains SCX was carried out on a poly (2-sulphoethyl
4-mercapto-ethyl-pyridine (MEP), an ionizable aspartamide)-silica (polysulphoethyl A)
ligand which is uncharged at physiological (PolyLC, Columbia, MD, USA) strong SCX col-
pH. Protein adsorption is achieved by hydropho- umn. Peptide separation was carried out in the
bic interactions and eluted by reducing the pH of presence of a high organic modifier concentra-
the mobile phase to 4 or lower, where the ligand tion (60–80 % ACN) to promote hydrophilic
is positively charges. This dual mode mechanism interactions between the solute and the hydro-
forms the principle of hydrophobic charge induc- philic/charged SCX stationary phase, with
tion chromatography (HCIC) [27, 50, 51, 172]. A peptides then eluted with a linear salt gradient.
number of new mixed mode chromatographic Peptides are generally eluted in groups in
stationary phases have been commercialized by order of increasing net positive charge; within
SIELC (Wheeling, IL) and are available under these groups, peptides are resolved in order
the trade name Primesep. The HPLC column of increasing hydrophilicity (decreasing
choices include combinations of RP with anion, hydrophobicity) [101].
cation and zwitter ion functional groups [93].
In summary, the ligand for mixed-mode chro- As in all other forms of chromatography,
matography should have at least one hydrophobic solvent selection, pH, salt concentration and
moiety and one ionic moiety. The hydrophobic temperature influence sensitivity and resolution.
moiety must be carefully chosen so as to achieve In general adsorption of proteins in MMC occurs
a sufficiently high capacity and afford reasonable under low –to-moderate ionic strength, neutral
recovery. The pKa of the ionic moiety is essential pH and elution is achieved by electrostatic repul-
for the performance of the ligand and should be sion when the pH value is lowered below the pI
estimated in ligand screening and design of the target and pKa of the ligand. Columns are
[181]. Secondary interactions such as hydrogen regenerated with chelating reagents, acid/base
bonding can also contribute to protein-ligand wash, high salt concentrations [143].
binding and can be introduced either as hydrogen
donors to anion-exchange ligands or hydrogen Summary Examples of peptide fractionation by
acceptors to cation-exchange ligands. MMC cited in the previous sections clearly
demonstrates the several distinct advantages
Method Development As mentioned earlier, this form of chromatography has over the
most proteomic approaches use either single-mode chromatography. As compared
IEX/RPLC [54, 121, 129] and HILIC/IEX with traditional 2D approaches, MMC has
[61, 101] combinations of separation modes. shown improved selectivity, resolution and
While octadecylsilanes (C18) still remains the higher sample loading capacity [61, 117,
preferred RP ligand, the choice of the ionic 121]. This approach offers increased separation
ligand will depend on the class of peptides and degrees of freedom in adjusting separation
to be enriched and/or fractionated. Gilar et al. selectivity compared to any one type of chroma-
[54] used silica-based pentafluorophenyl (PFP) tography. Given the limited number of
MMC column to selectively enrich for publications that have reported the use of MMC
negatively charged peptides, such as as part of a routine proteomic sample preparation
phosphopeptides and sialylated glycopeptides. workflow suggests that this method has not yet
Stationary phase containing octadecylsilanes been fully exploited. Currently, most of the
and dialkylamines has been used as RPC/AEX MMC applications are focused on small
mixed mode combination for peptide separation molecules and proteins, predominantly immuno-
[68]. HILIC/SCX mixed mode approach, first globulin purification. This could partly be due to
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 107

the lack of a deep understanding of the mixed acids and other polar compounds. J Chromatogr A
mode retention mechanism, which would other- 499(0):177–1962
6. Alpert AJ (2007) Electrostatic repulsion hydrophilic
wise be useful for synthesizing new ligands and interaction chromatography for isocratic separation
stationary phases that could help accelerate of charged solutes and selective isolation of
development of its applications. phosphopeptides. Anal Chem 80(1):62–763
7. Alpert AJ, Andrews PC (1988) Cation-exchange
chromatography of peptides on poly(2-sulfoethyl
aspartamide)-silica. J Chromatogr 443:85–964
References
1. Aguilar M-I (2004) HPLC of peptides and proteins. was developed for use in high-performance hydrophilic-
HPLC of peptides and proteins, vol 251 M-I Aguilar. interaction chromatography. Hydrophilic-interaction
Springer, New York, p 3–8 chromatography is particularly promising for such trou-
2. Aguilar MI, Hearn MT (1996) High-resolution blesome solutes as histones, membrane proteins, and
reversed-phase high-performance liquid chromatog- phosphorylated amino acids and peptides. Hydrophilic-
raphy of peptides and proteins. Methods Enzymol interaction chromatography fractionations resemble those
270:3–26 obtained through partitioning mechanisms. The chroma-
3. Allen DP (1999) 18 – Application of size exclusion- tography of DNA, in particular, resembles the partitioning
high-performance liquid chromatography for bio- observed with aqueous two-phase systems based on poly-
pharmaceutical protein and peptide therapeutics. ethylene glycol and dextran solutions.
3
Column handbook for size exclusion chromatogra- If an ion-exchange column is eluted with a predomi-
phy. Cs Wu. Academic Press, San Diego, 531–537 nantly organic mobile phase, then solutes can be retained
4. Alpert AJ (1983) Cation-exchange high-perfor- through hydrophilic interaction even if they have the same
mance liquid chromatography of proteins on poly charge as the stationary phase. This combination is termed
(aspartic acid)—silica. J Chromatogr A 266 electrostatic repulsion-hydrophilic interaction chromatog-
(0):23–371 raphy (ERLIC). With mixtures of solutes that differ
5. Alpert AJ (1990) Hydrophilic-interaction chroma- greatly in charge, repulsion effects can be exploited to
tography for the separation of peptides, nucleic selectively antagonize the retention of the solutes that
normally would be the best retained. This permits the
isocratic resolution of mixtures that normally require
gradients, including peptides, amino acids, and
1
A simple cation-exchange material for high- nucleotides. ERLIC affords convenient separations of
performance liquid chromatography of proteins was highly charged peptides that cannot readily be resolved
developed. Poly(succinimide) reacted rapidly with by other means. In addition, phosphopeptides can be
aminopropylsilica and the product was hydrolysed to isolated selectively from a tryptic digest.
poly(aspartic acid)—silica. Reaction conditions were 4
A strong cation-exchange material, poly(2-sulfoethyl
optimized to yield a material with an ion-exchange capac- aspartamide)-silica (PolySULFOETHYL Aspartamide)
ity of 430 mg hemoglobin/g material. High-performance was developed for purification and analysis of peptides
liquid chromatographic columns of the material featured by high-performance liquid chromatography. All peptides
excellent performance in terms of capacity, selectivity, examined were retained at pH 3, even when the amino
recovery of enzyme activity, peak shape and durability. terminus was the only basic group. Peptides were eluted
Protein standards and clinical hemoglobin samples were in order of increasing number of basic residues with a salt
well resolved in minutes. Poly(succinimide)—silica was gradient. Capacity was high, as was selectivity and col-
readily derivatized to give products other than poly umn efficiency. This new column material displays mod-
(aspartic acid)—silica, and several such materials were est mixed-mode effects, allowing the resolution of
prepared. Such materials could be useful for affinity chro- peptides having identical charges at a given pH. The
matography or enzyme immobilization. selectivity can be manipulated by the addition of organic
2
When a hydrophilic chromatography column is eluted solvent to the mobile phases; this increases the retention
with a hydrophobic (mostly organic) mobile phase, reten- of some peptides and decreases the retention of others.
tion increases with hydrophilicity of solutes. The term The retention in any given case may reflect a combination
hydrophilic-interaction chromatography is proposed for of steric factors and non-electrostatic interactions. Selec-
this variant of normal-phase chromatography. This mode tivity was complementary to that of reversed-phase chro-
of chromatography is of general utility. Mixtures of matography (RPC) materials. Excellent purifications were
proteins, peptides, amino acids, oligonucleotides, and obtained by sequential use of PolySULFOETHYL
carbohydrates are all resolved, with selectivity comple- Aspartamide and RPC columns for purification of
mentary to those of other modes. Typically, the order of peptides from crude tissue extracts. The new cation
elution is the opposite of that obtained with reversed- exchanger is quite promising as a supplement to RPC
phase chromatography. A hydrophilic, neutral packing for general peptide chromatography.
108 U. Kota and M.L. Stolowitz

8. Alvarez-Manilla G, Atwood et al (2006) Tools for 11. Ayyar BV, Arora S et al (2012) Affinity chromatog-
glycoproteomic analysis: size exclusion chromatog- raphy as a tool for antibody purification. Methods 56
raphy facilitates identification of tryptic (2):116–1297
glycopeptides with N-linked glycosylation sites. J 12. Barbour J, Wiese S et al (2008) Mass spectrometry.
Proteome Res 5(3):701–7085 Proteomics sample preparation. Wiley-VCH Verlag
9. Appelblad P, Jonsson P et al (2006) A practical guide GmbH & Co. KGaA, p 41–1288
to HILIC: a tutorial and application book. Merck 13. Bereman MS, Williams TI et al (2009) Development
SeQuant AB, Umeå of a nanoLC LTQ orbitrap mass spectrometric
10. Arakawa T, Ejima D et al (2010) The critical role of method for profiling glycans derived from plasma
mobile phase composition in size exclusion chroma- from healthy, benign tumor control, and epithelial
tography of protein pharmaceuticals. J Pharm Sci 99 ovarian cancer patients. Anal Chem 81
(4):1674–16926 (3):1130–11369

5
Proteomic techniques, such as HPLC coupled to tandem cosolvents on protein adsorption and retention and
mass spectrometry (LC-MS/MS), have proved useful for describe the mechanism of the cosolvent effects.
the identification of specific glycosylation sites on # 2009 Wiley-Liss, Inc. and the American Pharmacists
glycoproteins (glycoproteomics). Glycosylation sites on Association J Pharm Sci 99: 1674–1692, 2010.
7
glycopeptides produced by trypsinization of complex gly- The global antibody market has grown exponentially
coprotein mixtures, however, are particularly difficult to due to increasing applications in research, diagnostics
identify both because a repertoire of glycans may be and therapy. Antibodies are present in complex matrices
expressed at a particular glycosylation site, and because (e.g. serum, milk, egg yolk, fermentation broth or plant-
glycopeptides are usually present in relatively low abun- derived extracts). This has led to the need for develop-
dance (2–5 %) in peptide mixtures compared to ment of novel platforms for purification of large quantities
nonglycosylated peptides. Previously reported methods of antibody with defined clinical and performance
to facilitate glycopeptide identification require either sev- requirements. However, the choice of method is strictly
eral pre-enrichment steps, involve complex derivatization limited by the manufacturing cost and the quality of the
procedures, or are restricted to a subset of all the glycan end product required. Affinity chromatography is one of
structures that are present in a glycoprotein mixture. the most extensively used methods for antibody purifica-
Because the N-linked glycans expressed on tryptic tion, due to its high selectivity and rapidity. Its effective-
glycopeptides contribute substantially to their mass, we ness is largely based on the binding characteristics of the
demonstrate that size exclusion chromatography (SEC) required antibody and the ligand used for antibody cap-
provided a significant enrichment of N-linked ture. The approaches used for antibody purification are
glycopeptides relative to nonglycosylated peptides. The critically examined with the aim of providing the reader
glycosylated peptides were then identified by LC-MS/MS with the principles and practical insights required to
after treatment with PNGase-F by the monoisotopic mass understand the intricacies of the procedures. Affinity sup-
increase of 0.984 Da caused by the deglycosylation of the port matrices and ligands for affinity chromatography are
peptide. Analyses performed on human serum showed discussed, including their relevant underlying principles
that this SEC glycopeptide isolation procedure results in of use, their potential value and their performance in
at least a 3-fold increase in the total number of purifying different types of antibodies, along with a list
glycopeptides identified by LC-MS/MS, demonstrating of commercially available alternatives. Furthermore, the
that this simple, nonselective, rapid method is an effective principal factors influencing purification procedures at
tool to facilitate the identification of peptides with various stages are highlighted. Practical considerations
N-linked glycosylation sites. Keywords: for development and/or optimizations of efficient
glycoproteomics; LC/MS/MS; glycopeptides; N-linked antibody-purification protocols are suggested.
8
glycosylation sites; size excusion chromatography. This chapter contains sections titled: * A Practical
6
Size exclusion chromatography (SEC) is the most Guideline to Electrospray Ionization Mass Spectrometry
widely used method for aggregation analysis of pharma- for Proteomics Application * References * Sample Prep-
ceutical proteins. However SEC analysis has a number of aration for the Application of MALDI Mass Spectrometry
limitations, and one of the most important ones is protein in Proteome Analysis * References * Sample Preparation
adsorption to the resin. This problem is particularly severe for Label-Free Proteomic Analyses of Body Fluids by
when using new columns, and often column Fourier Transform Ion Cyclotron Mass Spectrometry *
preconditioning protocols are required. This review References * Sample Preparation for Differential Prote-
focuses on the role that addition of various cosolvents to ome Analysis: Labeling Technologies for Mass Spec-
the mobile phase plays in suppressing that protein adsorp- trometry * References * Determining Membrane Protein
tion. Cosolvents such as salt, amino acids, and organic Localization Within Subcellular Compartments Using
solvents are often used for this purpose. Because the Stable Isotope Tagging * References.
9
protein interaction with the resin surface is highly hetero- We report the development of split-less nano-flow liquid
geneous, different cosolvents affect the protein adsorption chromatography mass spectrometric analysis of glycans
differently. We will summarize the various effects of chemically cleaved from glycoproteins in plasma. Porous
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 109

14. Betancourt LH, De Bock PJ et al (2013) SCX charge 15. Bidlingmeyer BA, Del Rios JK, 3 et al (1982) Sepa-
state selective separation of tryptic peptides com- ration of organic amine compounds on silica gel with
bined with 2D-RP-HPLC allows for detailed prote- reversed-phase eluents. Anal Chem 54:442–447
ome mapping. J Proteomics 91(0):164–17110 16. Boersema P, Mohammed S et al (2008) Hydrophilic
interaction liquid chromatography (HILIC) in prote-
omics. Anal Bioanal Chem 391(1):151–159
17. Boersema PJ, Divecha N et al (2007) Evaluation and
optimization of ZIC-HILIC-RP as an alternative
graphitized carbon operating under reverse-phase MudPIT strategy. J Proteome Res 6(3):937–94611
conditions and an amide-based stationary phase operating 18. Boersema PJ, Mohammed S et al (2008) Hydrophilic
under hydrophilic interaction conditions are quantita- interaction liquid chromatography (HILIC) in prote-
tively compared for glycan separation. Both stationary omics. Anal Bioanal Chem 391(1):151–15912
phases demonstrated similar column efficiencies and
excellent retention time reproducibility without an inter-
nal standard to correct for retention time shift. The 95 %
11
confidence intervals of the mean retention times were In proteomics, a digested cell lysate is often too com-
+/4 s across 5 days of analysis for both stationary plex for direct comprehensive mass spectrometric analy-
phases; however, the amide stationary phase was sis. To reduce complexity, several peptide separation
observed to be more robust. The high mass measurement techniques have been introduced including very success-
accuracy of less than 2 ppm and fragmentation spectra ful two-dimensional liquid chromatography (2D-LC)
provided highly confident identifications along with struc- approaches. Here, we assess the potential of zwitterionic
tural information. In addition, data are compared among Hydrophilic Interaction Liquid Chromatography
samples derived from 10 healthy controls, 10 controls (ZIC-HILIC) as a first dimension for the analysis of com-
with a differential diagnosis of benign gynecologic plex peptide mixtures. We show that ZIC-HILIC separa-
tumors, and 10 diseased epithelial ovarian cancer patients tion is dramatically dependent on buffer pH in the range
(EOC). Two fucosylated glycans were found to be from 3 to 8, due to deprotonation of acidic amino acids.
up-regulated in healthy controls and provided an accurate ZIC-HILIC exhibits a mixed-mode effect consisting of
diagnostic value with an area under the receiver operator electrostatic and polar interactions. We developed a
characteristic curve of 0.87. However, these same glycans 2D-LC system that hyphenates ZIC-HILIC off-line with
provided a significantly less diagnostic value when used reversed-phase (RP). The two dimensions are fairly
to differentiate EOC from benign tumor control samples orthogonal, and the system performs very well in the
with an area under the curve of 0.73. analysis of minute amounts of complex peptide mixtures.
10
Multidimensional peptide fractionation is widely used Applying this method to the analysis of 10 mug of a
in proteomics to reduce the complexity of peptide cellular nuclear lysate, we were able to confidently iden-
mixtures prior to mass spectrometric analysis. Here, we tify over 1000 proteins. Compared to strong cation
describe the sequential use of strong cation exchange and exchange chromatography (SCX), ZIC-HILIC shows bet-
reversed phase liquid chromatography in both basic and ter chromatographic resolution and absence of clustering
acidic pH buffers for separating tryptic peptides from of prevalent +2 and +3 charged peptides. At pH 3,
complex mixtures of proteins. Strong cation exchange ZIC-HILIC separation allows best orthogonality with RP
exclusively separates peptide by their charge state into and resembles conventional SCX separation. A significant
neutral, singly and multi-charged species. To further enrichment of N-acetylated peptides in the first fractions
reduce complexity, each peptide group was separated by is observed at these conditions. ZIC-HILIC separation at
reversed phase liquid chromatography at basic pH and the high pH (6.8 and 8), however, enables better chromatog-
resultant fractions were analyzed by LC–MS/MS. This raphy, resulting in more comprehensive data acquisition.
workflow was applied to a soluble protein lysate from With this extended flexibility, we conclude that
mouse embryonic fibroblast cells, and more than 5000 ZIC-HILIC is a very good alternative for the more con-
proteins from 29,843 peptides were identified. The high ventional SCX in multidimensional peptide separation
selectivity displayed during the SCX step (93 % to strategies.
12
100 %) and the overlaps between proteins identified In proteomics, nanoflow multidimensional chromatog-
from the SCX-separated peptide groups, are interesting raphy is now the gold standard for the separation of
assets of the procedure. Biological significance The pres- complex mixtures of peptides as generated by
ent work shows how complex mixture of peptides can be in-solution digestion of whole-cell lysates. Ideally, the
selectively separated by SCX based essentially on the net different stationary phases used in multidimensional chro-
charge of peptides. The proposed workflow results in matography should provide orthogonal separation
three well-defined subset of peptides of specific amino characteristics. For this reason, the combination of strong
acid composition, which are representative of the constit- cation exchange chromatography (SCX) and reversed-
uent proteins. The very high selectivity obtained (93 % to phase (RP) chromatography is the most widely used com-
99 %) on the peptide side, underscores for the first time bination for the separation of peptides. Here, we review
the possibility of SCX chromatography to aid in the potential of hydrophilic interaction liquid chromatog-
validating identified peptides. raphy (HILIC) as a separation tool in the
110 U. Kota and M.L. Stolowitz

19. Boutin JA, Ernould AP et al (1992) Use of hydro- 22. Boysen RI, Hearn MTW (2001) HPLC of peptides
philic interaction chromatography for the study of and proteins. Current protocols in protein science,
tyrosine protein kinase specificity. J Chromatogr B Wiley15
Biomed Sci Appl 583(2):137–14313 23. Brenac Brochier V, Schapman A et al (2008) Fast
20. Boyer R (2005) Principles and reactions of protein purification process optimization using mixed-mode
extraction, purification, and characterization: chromatography sorbents in pre-packed mini-
Ahmed, Hafiz. Biochem Mol Biol Educ 33 columns. J Chromatogr A 1177(2):226–23316
(2):145–146 24. Brenac V, Ravault V et al (2005) Capture of a
21. Boyes BE, Walker DG (1995) Selectivity optimiza- monoclonal antibody and prediction of separation
tion of reversed-phase high-performance liquid chro- conditions using a synthetic multimodal ligand
matographic peptide and protein separations by attached on chips and beads. J Chromatogr B 818
varying bonded-phase functionality. J Chromatogr (1):61–6617
A 691(1–2):337–34714

15
High-performance liquid chromatography (HPLC) is an
essential tool for the purification and characterization of
multidimensional separation of peptides in proteomics biomacromolecules. This unit presents a thorough discus-
applications. Recent work has revealed that HILIC may sion of the eight types of HPLC currently used,
provide an excellent alternative to SCX, possessing sev- highlighting equipment and start-up procedures,
eral advantages in the area of separation power and recommendations for running each type of experiment,
targeted analysis of protein post-translational and theoretical considerations for the separation of
modifications. [figure: see text] peptides and proteins. This is an excellent primer for
13
A new HPLC method has been developed to assay HPLC users.
tyrosine protein kinase activity. Using hydrophilic inter- 16
Pre-packed MediaScout® MiniChrom columns of
action chromatography, it is possible to resolve the four 2.5, 5 and 10 mL were investigated for screening three
components of the incubation medium: substrate peptide, mixed-mode chromatography sorbents (HEA, PPA and
[32P]phosphorylated peptide, unreacted [γ-32P]ATP, and MEP HyperCelTM). Packing performance was of good
32P-labelled inorganic phosphate. ATP interacts so quality and the three sorbents displayed higher capacity
strongly with the stationary phase material that it can be than traditional HIC sorbents in physiological-like
removed selectively from the incubation medium with conditions. Each sorbent offered a unique selectivity.
solid-phase extraction cartridges packed with the same Bovine Î2-lactoglobulin was partially purified after load-
type of material. The three remaining components of ing milk whey directly on HEA HyperCel sorbent. The
interest can then be resolved by reversed-phase or hydro- combination of small pre-packed columns and SELDI-
philic interaction HPLC. This procedure permits the eval- MS appeared to be a valuable strategy for high-
uation of almost every type of peptide as a substrate of throughput screening of chromatography sorbents and
tyrosine protein kinase. for enabling rapid process development and optimization.
14 17
Several chemical bonded-phase modified silicas were A synthetic ligand called 2-mercapto-5-benzimidazole-
prepared using sterically protected monofunctional silane sulfonic acid has been successfully used for the specific
reagents which varied widely in structure and polarity. chromatographic capture of antibodies from a cell culture
Since some of these bonded-phase packing materials are supernatant. Adsorption occurred at physiological ionic
highly polar (hydrophilic), resistance to acid-catalyzed strength and pH range between 5.0 and 6.0, with some
bonded-phase loss by hydrolysis was examined, and binding capacity variations within this pH range: antibody
observed to remain high even for the highly polar Diol uptake increased when the pH decreased. With very dilute
bonded-phase functionality. Modification of the surface feedstocks, as was the case with the cell culture superna-
of 300 Å pore size, fully hydroxylated and base- tant under investigation, it was found that the pH had to be
deactivated silica microspheres with these sterically slightly lowered to get a good antibody sorption capacity.
protected silanes yielded HPLC column packing materials To optimize separation conditions, a preliminary study
for examination of separation selectivities in reversed- was made using ProteinChip® Arrays that displayed
phase separations of peptide and protein mixtures. Dis- the same chemical functionalities as the resin. Arrays
tinct separation selectivities were apparent for each were analyzed using SELDI–MS. By this mean, it was
bonded-phase functionality. Selectivity differences possible to cross-over simultaneously different pH
ranged from limited band spacing changes for steric- conditions at the adsorption and the desorption steps.
protected C18 and C8 bonded-phases, to reversal of elu- Best conditions were implemented for preparative separa-
tion order for the more polar C3 and CN bonded phases. tion using regular lab-scale columns. At pH 5.2, antibody
The use of column-based selectivity differences between adsorption was not complete, while at pH 5.0 the antibody
sequential reversed-phase separation steps is used for the was entirely captured. pH 9 was selected at elution, rather
two-step HPLC isolation of a recombinant human amy- than pH 8.0 or 10.0, and resulted in a complete desorption
loid precursor polypeptide fragment from a crude bacte- of antibodies from the column. Benefits of the prediction
rial extract. of separation conditions of antibodies on MBI beads using
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 111

25. Brunner E, Ahrens CH et al (2007) A high-quality 27. Burton SC, Harding DRK (1998) Hydrophobic
catalog of the Drosophila melanogaster proteome. charge induction chromatography: salt independent
Nat Biotech 25(5):576–58318 protein adsorption and facile elution with aqueous
26. Burton SC, Haggarty NW et al (1997) One step buffers. J Chromatogr A 814(1–2): 71–8120
purification of chymosin by mixed mode chromatog- 28. Buszewski B, Noga S (2012) Hydrophilic interaction
raphy. Biotechnol Bioeng 56(1):45–5519 liquid chromatography (HILIC)–a powerful separa-
tion technique. Anal Bioanal Chem 402(1):231–24721
29. Calvano CD, Zambonin CG et al (2008) Assessment
of lectin and HILIC based enrichment protocols for
SELDI–MS were a significant reduction in analysis
characterization of serum glycoproteins by mass
time and in sample volume. This was possible because
spectrometry. J Proteome 71(3):304–31722
the separation of IgG on the chip surface did mimic very
well the separation on beads.
18
Understanding how proteins and their complex interac-
tion networks convert the genomic information into a suited to large scale application. High capacity chymosin
dynamic living organism is a fundamental challenge in adsorption was found with carboxymethyl ion exchange
biological sciences. As an important step towards under- matrices, but low ionic strength was essential for adsorp-
standing the systems biology of a complex eukaryote, we tion and the purity was inferior to that of the mixed mode
cataloged 63 % of the predicted Drosophila melanogaster matrices. # 1997 John Wiley & Sons, Inc. Biotechnol
proteome by detecting 9124 proteins from 498,000 redun- Bioeng56: 45–55, 1997.
20
dant and 72,281 distinct peptide identifications. This A new form of protein chromatography, hydrophobic
unprecedented high proteome coverage for a complex charge induction, is described. Matrices prepared by
eukaryote was achieved by combining sample diversity, attachment of weak acid and base ligands were uncharged
multidimensional biochemical fractionation and analysis- at adsorption pH. At low ligand densities, protein adsorp-
driven experimentation feedback loops, whereby data tion was typically promoted with lyotropic salts. At higher
collection is guided by statistical analysis of prior data. ligand densities, chymosin, chymotrypsinogen and lyso-
We show that high-quality proteomics data provide cru- zyme were adsorbed independently of ionic strength. A
cial information to amend genome annotation and to pH change released the electrostatic potential of the
confirm many predicted gene models. We also present matrix and weakened hydrophobic interactions, inducing
experimentally identified proteotypic peptides matching elution. Matrix hydrophobicity and titration range could
[sim]50 % of D. melanogaster gene models. This library be matched to protein requirements by ligand choice and
of proteotypic peptides should enable fast, targeted and density. Both adsorption and elution could be carried out
quantitative proteomic studies to elucidate the systems within the pH 5TM9 range.
21
biology of this model organism. Hydrophilic interaction liquid chromatography
19
Mixed mode Sepharose and Perloza bead cellulose (HILIC) provides an alternative approach to effectively
matrices were prepared using various chemistries. These separate small polar compounds on polar stationary
matrices contained hydrophobic (aliphatic and/or aro- phases. The purpose of this work was to review the
matic) and ionic (carboxylate or alkylamine) groups. options for the characterization of HILIC stationary
Hydrophobic amine ligands were attached to epichloro- phases and their applications for separations of polar
hydrin activated Sepharose (mixed mode amine matrices). compounds in complex matrices. The characteristics of
Hexylamine, aminophenylpropanediol and phenylethyl- the hydrophilic stationary phase may affect and in some
amine were the preferred ligands, on the basis of cost cases limit the choices of mobile phase composition, ion
and performance. Other mixed mode matrices were pro- strength or buffer pH value available, since mechanisms
duced by incomplete attachment (0–80 %) of the same other than hydrophilic partitioning could potentially
amine ligands to carboxylate matrices. The best results occur. Enhancing our understanding of retention behavior
were obtained using unmodified or partially ligand- in HILIC increases the scope of possible applications of
modified aminocaproic acid Sepharose and Perloza. liquid chromatography. One interesting option may also
High ligand densities were used, resulting in high capac- be to use HILIC in orthogonal and/or two-dimensional
ity. Furthermore, chymosin was adsorbed at high and low separations. Bioapplications of HILIC systems are also
ionic strengths, which reduced sample preparation presented.
22
requirements. Chymosin, essentially homogeneous by Protein glycosylation is a common post-translational
electrophoresis, was recovered by a small pH change. modification that is involved in many biological pro-
The methods described were simple, efficient, inexpen- cesses, including cell adhesion, protein–protein and
sive and provided very good resolution of chymosin from receptor-ligand interactions. The glycoproteome
a crude recombinant source. The carboxylate matrices had constitutes a source for identification of disease
the best combination of capacity and regeneration biomarkers since altered protein glycosylation profiles
properties. The performance of Sepharose and Perloza are associated with certain human ailments. Glycoprotein
carboxylate matrices was similar, but higher capacities analysis by mass spectrometry of biological samples, such
were found for the latter. Because it is cheaper and can as blood serum, is hampered by sample complexity and
be used at higher flow rates, Perloza should be better the low concentration of the potentially informative
112 U. Kota and M.L. Stolowitz

30. Cargile BJ, Bundy JL et al (2004) Gel based isoelec- 32. Chung WK, Freed AS et al (2010) Evaluation of
tric focusing of peptides and the utility of isoelectric protein adsorption and preferred binding regions in
point in protein identification. J Proteome Res 3 multimodal chromatography using NMR. Proc Natl
(1):112–11923 Acad Sci 107(39):16811–1681625
31. Chen Y, Mant CT et al (2003) Temperature selectiv- 33. Coffinier Y, Vijayalakshmi MA (2004) Mercaptohe-
ity effects in reversed-phase liquid chromatography terocyclic ligands grafted on a poly(ethylene vinyl
due to conformation differences between helical and alcohol) membrane for the purification of immuno-
non-helical peptides. J Chromatogr A 1010 globulin G in a salt independent thiophilic chroma-
(1):45–6124 tography. J Chromatogr B 808(1):51–5626

glycopeptides and -proteins. We assessed the utility of analogues exhibited a greater effect of varying tempera-
lectin-based and HILIC-based affinity enrichment ture on elution behaviour compared to the random coil
techniques, alone or in combination, for preparation of peptide analogues, due to the unfolding of α-helical struc-
glycoproteins and glycopeptides for subsequent analysis ture with the increase of temperature during RP-HPLC. In
by MALDI and ESI mass spectrometry. The methods addition, temperature generally produced different effects
were successfully applied to human serum samples and on the separations of peptides with different l- or d-amino
a total of 86 N-glycosylation sites in 45 proteins were acid substitutions within the groups of helical or
identified using a mixture of three immobilized lectins for non-helical peptides. The results demonstrate that
consecutive glycoprotein enrichment and glycopeptide variations in temperature can be used to effect significant
enrichment. The combination of lectin affinity enrichment changes in selectivity among the peptide analogues
of glycoproteins and subsequent HILIC enrichment of despite their very high degree of sequence homology.
tryptic glycopeptides identified 81 N-glycosylation sites Our results also suggest that a temperature-based
in 44 proteins. A total of 63 glycosylation sites in approach to RP-HPLC can be used to distinguish varying
38 proteins were identified by both methods, amino acid substitutions at the same site of the peptide
demonstrating distinct differences and complementarity. sequence. We believe that the peptide mixtures presented
Serial application of custom-made microcolumns of here provide a good model for studying temperature
mixed, immobilized lectins proved efficient for recovery effects on selectivity due to conformational differences
and analysis of glycopeptides from serum samples of of peptides, both for the rational development of peptide
breast cancer patients and healthy individuals to assess separation optimization protocols and a probe to distin-
glycosylation site frequencies. guish between peptide conformations.
23 25
Here we present the theoretical and experimental eval- NMR titration experiments with labeled human
uation of peptide isoelectric point as a method to aid in the ubiquitin were employed in concert with chromatographic
identification of peptides from complex mixtures. data obtained with a library of ubiquitin mutants to study
Predicted pI values were found to match closely the the nature of protein adsorption in multimodal
experimentally obtained data, resulting in the develop- (MM) chromatography. The elution order of the mutants
ment of a unique filter that lowers the effective false on the MM resin was significantly different from that
positive rate for peptide identification. Due to the reduc- obtained by ion-exchange chromatography. Further, the
tion of the false positive rate, the cross-correlation chromatographic results with the protein library indicated
parameters Xcorr and deltaCn from the SEQUEST pro- that mutations in a defined region induced greater changes
gram can be lowered resulting in 25 % more peptide in protein affinity to the solid support. Chemical shift
identifications. This approach was successfully applied mapping and determination of dissociation constants
to analysis of the soluble fraction of the E. coli proteome, from NMR titration experiments with the MM ligand
where 417 proteins were identified from 1022 peptides and isotopically enriched ubiquitin were used to deter-
using just 20 microg of material. mine and rank the relative binding affinities of interaction
24
In order to characterize the effect of temperature on the sites on the protein surface. The results with NMR con-
retention behaviour and selectivity of separation of firmed that the protein possessed a distinct preferred bind-
polypeptides and proteins in reversed-phase high-perfor- ing region for the MM ligand in agreement with the
mance liquid chromatography (RP-HPLC), the chro- chromatographic results. Finally, coarse-grained ligand
matographic properties of four series of peptides, with docking simulations were employed to study the modes
different peptide conformations, have been studied as a of interaction between the MM ligand and ubiquitin. The
function of temperature (5–80  C). The secondary struc- use of NMR titration experiments in concert with chro-
ture of model peptides was based on either the amphi- matographic data obtained with protein libraries
pathic α-helical peptide sequence Ac-EAEKAAKEXd/ represents a previously undescribed approach for
lEKAAKEAEK-amide, (position X being in the centre elucidating the structural basis of protein binding affinity
of the hydrophobic face of the α-helix), or the random in MM chromatographic systems.
26
coil peptide sequence Ac-Xd/lLGAKGAGVG-amide, In this study, we attempted a limited combinatorial
where position X is substituted by the 19 l- or d-amino approach for designing affinity ligands based on
acids and glycine. We have shown that the helical peptide mercaptoheterocyclic components. The template, divinyl
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 113

34. Cox GB, Stout RW (1987) Study of the retention 37. Di Palma S, Boersema PJ et al (2011) Zwitterionic
mechanisms for basic compounds on silica under Hydrophilic Interaction Liquid Chromatography
pseudo-reversed-phase conditions. J Chromatogr (ZIC-HILIC and ZIC-cHILIC) Provide high resolu-
384:315–336 tion separation and increase sensitivity in proteome
35. Cunliffe JM, Maloney TD (2007) Fused‐core parti- analysis. Anal Chem 83(9):3440–344728
cle technology as an alternative to sub‐2‐μm 38. Di Palma S, Hennrich ML et al (2012) Recent
particles to achieve high separation efficiency with advances in peptide separation by multidimensional
low backpressure. J Sep Sci 30(18):3104–3109 liquid chromatography for proteome analysis. J Pro-
36. Dai J, Jin WH et al (2006) Protein phosphorylation teome 75(13):3791–381329
and expression profiling by Yin-Yang multidimen-
sional liquid chromatography (Yin-Yang MDLC)
mass spectrometry. J Proteome Res 6(1):250–26227
exchange; Yin-Yang multidimensional liquid chromatog-
raphy; pH elution; Mass spectrometry.
28
sulfone structure (DVS), which was grafted on poly(eth- The complexity of peptide mixtures that are analyzed in
ylene vinyl alcohol) (PEVA) hollow fiber membrane, has proteomics necessitates fractionation by multidimen-
served for the tethering of different heterocyclic sional separation approaches prior to mass spectrometric
compounds as pyridine, imidazole, purine and pyrimidine analysis. In this work, we introduce and evaluate hydro-
rings. Their ability to adsorb specifically IgG in a salt philic interaction liquid chromatography (HILIC) based
independent manner out of pure IgG solution, mixture of strategies for the separation of complex peptide mixtures.
IgG/albumin and human plasma was demonstrated. The two zwitterionic HILIC materials (ZIC-HILIC and
Mercapto methyl imidazole (MMI) has shown the best ZIC-cHILIC) chosen for this work differ in the spatial
adsorption of IgG in terms of binding capacity. No sub- orientation of the positive and negative charged groups.
class discrimination was observed on all tested ligands Online experiments revealed a pH-independent resolving
except for mercapto methyl pyrimidine where the major power for the ZIC-cHILIC resin while ZIC-HILIC
IgG subclass adsorbed was IgG3. MMI gave an IgG showed a decrease in resolving power at an acidic
binding capacity of 100 μg/cm2 of hollow fiber mem- pH. Subsequently, we extensively evaluated the
brane surface area. performances of ZIC-HILIC and ZIC-cHILIC as first
27
A system which consisted of multidimensional liquid dimension in an off-line two-dimensional liquid chroma-
chromatography (Yin-yang MDLC) coupled with mass tography (2D-LC) strategy in combination with reversed
spectrometry was used for the identification of peptides phase (RP), with respect to peptide separation efficiency
and phosphopeptides. The multidimensional liquid chro- and how the retention time correlates with a number of
matography combines the strong-cation exchange (SCX), peptide physicochemical properties. Both resins allowed
strong-anion exchange (SAX), and reverse-phase the identification of more than 20?000 unique peptides
methods for the separation. Protein digests were first corresponding to over 3500 proteins in each experimental
loaded on an SCX column. The flow-through peptides condition from a remarkably low (1.5μg) amount of
from SCX were collected and further loaded on an SAX starting material of HeLa lysate digestion. The resulting
column. Both columns were eluted by offline pH steps, data allows the drawing of a comprehensive picture
and the collected fractions were identified by reverse- regarding ZIC- and ZIC-cHILIC peptide separation
phase liquid chromatography tandem mass spectrometry. characteristics. Furthermore, the extent of protein
Comprehensive peptide identification was achieved by identifications observed from such a level of material
the Yin-yang MDLC-MS/MS for a 1 mg mouse liver. In demonstrates that HILIC can rival or surpass traditional
total, 14?105 unique peptides were identified with high multidimensional strategies employed in proteomics.
29
confidence, including 13?256 unmodified peptides and Shotgun proteomics dominates the field of proteomics.
849 phosphopeptides with 809 phosphorylated sites. The The foundations of the strategy consist of multiple rounds
SCX and SAX in the Yin-Yang system displayed comple- of peptide separation where chromatography provides the
mentary features of binding and separation for peptides. bedrock. Initially, the scene was relatively simple with the
When coupled with reverse-phase liquid chromatography majority of strategies based on some types of ion
mass spectrometry, the SAX-based method can detect exchange and reversed phase chromatography. The thirst
more extremely acidic (pI < 4.0) and phosphorylated to achieve comprehensivity, when it comes to proteome
peptides, while the SCX-based method detects more rela- coverage and the global characterization of post transla-
tively basic peptides (pI > 4.0). In total, 134 groups of tional modifications, has led to the introduction of several
phosphorylated peptide isoforms were obtained, with new separations. In this review, we attempt to provide a
common peptide sequences but different phosphorylated historical perspective to separations in proteomics as well
states. This unbiased profiling of protein expression and as indicate the principles of their operation and rationales
phosphorylation provides a powerful approach to probe for their implementation. Furthermore, we provide a
protein dynamics, without using any prefractionation and guide on what are the possibilities for combining different
chemical derivation. Keywords: Protein phosphorylation; separations in order to increase peak capacity and prote-
Protein expression; Strong-cation exchange; Strong-anion ome coverage. We aim to show how separations enrich
114 U. Kota and M.L. Stolowitz

39. Diederich P, Hansen SK et al (2011) A sub-two 44. Engelhardt H, Mathes D (1981) High-performance
minutes method for monoclonal antibody-aggregate liquid chromatography of proteins using chemically-
quantification using parallel interlaced size exclu- modified silica supports. Chromatographia 14
sion high performance liquid chromatography. J (6):325–332
Chromatogr A 1218(50):9010–901830 45. Engholm-Keller K, Hansen TA et al (2011) Multidi-
40. Dong M, Wu M et al (2010) Coupling strong anion- mensional strategy for sensitive phosphoproteomics
exchange monolithic capillary with MALDI-TOF incorporating protein prefractionation combined
MS for sensitive detection of phosphopeptides in with SIMAC, HILIC, and TiO2 chromatography
protein digest. Anal Chem 82(7):2907–2915 applied to proximal EGF signaling. J Proteome Res
41. Dong MW (2006) HPLC columns and trends. Mod- 10(12):5383–539733
ern HPLC for practicing scientists, Wiley 47–7531
42. Edelmann MJ (2011) Strong cation exchange chro-
matography in analysis of posttranslational
modifications: innovations and perspectives. J salt gradient at pH 7.0. The serial order of the columns
Biomed Biotechnol 2011:7 was found to affect the chromatographic results, and the
43. El Rassi Z, Horváth C (1986) Tandem columns and effect was attributed to alteration of the salt gradient
mixed-bed columns in high-performance liquid profile upon traversing the first ion-exchange column.
chromatography of proteins. J Chromatogr A 359 Single columns, packed with a binary mixture of a cation
(0):255–26432 and an anion exchanger gave similar chromatographic
results as the tandem columns and thus offered an alterna-
tive approach to the separation of both acidic and basic
proteins in a single chromatographic run. A ternary mixed
the world of proteomics and how further developments phase was obtained by adding a mildly hydrophobic sta-
may impact the field. tionary phase to the mixture of the two ion exchangers.
30
In process development and during commercial produc- This column could be used with increasing salt gradient as
tion of monoclonal antibodies (mAb) the monitoring of a cation exchanger for the separation of basic proteins, or
aggregate levels is obligatory. The standard assay for as an anion exchanger for the separation of acidic
mAb aggregate quantification is based on size exclusion proteins. Furthermore, it could be used as a “bipolar”
chromatography (SEC) performed on a HPLC system. electrostatic-interaction column with increasing salt gra-
Advantages hereof are high precision and simplicity, dient and as a hydrophobic-interaction column with
however, standard SEC methodology is very time con- decreasing salt gradient for the separation of both types
suming. With an average throughput of usually two of proteins in a single chromatographic run. The constitu-
samples per hour, it neither fits to high throughput process ent stationary phases used in the mixed-bed columns were
development (HTPD), nor is it applicable for purification prepared from the same silica support, i.e., they had the
process monitoring. We present a comparison of three same particle and pore dimensions, density, and pore
different SEC columns for mAb-aggregate quantification volume. Besides their obvious advantages in analytical
addressing throughput, resolution, and reproducibility. A applications, appropriate mixed stationary phases, all hav-
short column (150 mm) with sub-two micron particles ing retentive properties for the components to be
was shown to generate high resolution (~1.5) and preci- separated, are expected to be useful also in preparative
sion (coefficient of variation (cv) < 1) with an assay time chromatography to “tailor” column selectivity for a given
below 6 min. This column type was then used to combine separation problem without loss of separating capacity.
interlaced sample injections with parallelization of two 33
Comprehensive enrichment and fractionation is essen-
columns aiming for an absolute minimal assay time. By tial to obtain a broad coverage of the phosphoproteome.
doing so, both lag times before and after the peaks of This inevitably leads to sample loss, and thus,
interest were successfully eliminated resulting in an assay phosphoproteomics studies are usually only performed
time below 2 min. It was demonstrated that determined on highly abundant samples. Here, we present a compre-
aggregate levels and precision of the throughput hensive phosphoproteomics strategy applied to 400 μg of
optimized SEC assay were equal to those of a single protein from EGF-stimulated HeLa cells. The proteins are
injection based assay. Hence, the presented methodology separated into membrane and cytoplasmic fractions using
of parallel interlaced SEC (PI-SEC) represents a valuable sodium carbonate combined with ultracentrifugation. The
tool addressing HTPD and process monitoring. phosphopeptides were separated into
31
This chapter contains sections titled: * Scope * General monophosphorylated and multiphosphorylated pools
Column Description and Characteristics * Column Types using sequential elution from IMAC (SIMAC) followed
* Column Packing Characteristics * Modern HPLC Col- by hydrophilic interaction liquid chromatography of the
umn Trends * Guard Columns * Specialty Columns * mono- and nonphosphorylated peptides and subsequent
Column Selection Guides * Summary * References * titanium dioxide chromatography of the HILIC fractions.
Internet Resources. This strategy facilitated the identification of >4700
32
By using a cation- and an anion-exchange column in unique phosphopeptides, while 636 phosphosites were
series, mixtures of acidic and basic proteins were changing following short-term EGF stimulation, many
separated in a single chromatographic run with increasing of which were not previously known to be involved in
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 115

46. Fekete S, Beck A et al (2014) Theory and practice of 48. Fukuda I, Hirabayashi-Ishioka Y et al (2013) Opti-
size exclusion chromatography for the analysis of mization of enrichment conditions on TiO2 chroma-
protein aggregates. J Pharm Biomed Anal(0)34 tography using glycerol as an additive reagent for
47. Flanagan RJ, Jane I (1985) High-performance liquid effective phosphoproteomic analysis. J Proteome
chromatographic analysis of basic drugs on silica Res 12(12):5587–559736
columns using non-aqueous ionic eluents. I. Factors 49. Gagnon P, Beam K (2009) Antibody aggregate
influencing retention, peak shape and detector removal by hydroxyapatite chromatography. Curr
response. J Chromatogr 323(2):173–18935 Pharm Biotechnol 10(4):440–44637

EGFR signaling. We further compared three different


data processing programs and found large differences in primarily via cation exchange with surface silanols. How-
their peptide identification rates due to different ever, additional factors must play a part with compounds
implementations of recalibration and filtering. Manually such as morphine which give tailing peaks at acidic or
validating a subset of low-scoring peptides exclusively neutral eluent pHs.
identified using the MaxQuant software revealed a large 36
Metal oxide affinity chromatography (MOAC)
percentage of false positive identifications. This indicates represented by titanium dioxide (TiO2) chromatography
that, despite having highly accurate precursor mass deter- has been used for phosphopeptide enrichment from cell
mination, peptides with low fragment ion scores should lysate digests prior to mass spectrometry. For in-depth
not automatically be reported in phosphoproteomics phosphoproteomic analysis, it is important for MOAC to
studies. achieve high phosphopeptide enrichment efficiency by
34
Size exclusion chromatography (SEC) is a historical optimizing purification conditions. However, there are
technique widely employed for the detailed characteriza- some differences in phosphopeptide selectivity and speci-
tion of therapeutic proteins and can be considered as a ficity enriched by various TiO2 materials and procedures.
reference and powerful technique for the qualitative and Here, we report that binding/wash buffers containing
quantitative evaluation of aggregates. The main advan- polyhydric alcohols, such as glycerol, markedly improve
tage of this approach is the mild mobile phase conditions phosphopeptide selectivity from complex peptide
that permit the characterization of proteins with minimal mixtures. In addition, the elution conditions combined
impact on the conformational structure and local environ- with secondary amines, such as bis-Tris propane, made
ment. Despite the fact that the chromatographic behavior it possible to recover phosphopeptides with highly hydro-
and peak shape are hardly predictable in SEC, some phobic properties and/or longer peptide lengths. To assess
generic rules can be applied for SEC method develop- the practical applicability of our improved method, we
ment, which are described in this review. During recent confirmed using PC3 prostate cancer cells. By combining
years, some improvements were introduced to conven- the hydrophilic interaction chromatography (HILIC) with
tional SEC that will also be discussed. Of these new the optimized TiO2 enrichment method prior to LC-MS/
SEC characteristics, we discuss (i) the commercialization MS analysis, over 8300 phosphorylation sites and 2600
of shorter and narrower columns packed with reduced phosphoproteins were identified. Additionally, some
particle sizes allowing an improvement in the resolution dephosphorylations of those were identified by treatment
and throughput; (ii) the possibility of combining SEC with dasatinib for a kinase inhibitor. These results indicate
with various detectors, including refractive index (RI), that our method is applicable to understanding the
ultraviolet (UV), multi-angle laser light scattering profiling of kinase inhibitors such as anticancer
(MALLS) and viscometer (IV), for extensive characteri- compounds, which will be useful for drug discovery and
zation of protein samples and (iii) the possibility of development.
hyphenating SEC with mass spectrometry 37
Hydroxyapatite (HA) has proven in recent years to be
(MS) detectors using an adapted mobile phase containing one of the most versatile and powerful methods for
a small proportion of organic modifiers and ion-pairing removing aggregates from antibody preparations. It is
reagents. effective with IgA, IgG and IgM, and it reduces aggregate
35
The use of silica columns together with non-aqueous levels from above 60 % to less than 0.1 %. Three basic
ionic eluents provides a stable yet flexible system for the elution strategies have evolved, one that removes
high-performance liquid chromatographic analysis of aggregates from a modest proportion of clones, another
basic drugs. At constant ionic strength, eluent pH from the majority, and one that appears to be universally
influences retention via ionisation of surface silanols and effective. Each has distinct development and process
protonation of basic analytes, pKa values indicating the ramifications. This review defines what HA is, how it
pH of maximum retention. At constant pH, retention is interacts with various classes of biomolecules, how
proportional to the reciprocal of the eluent ionic strength those interactions are controlled by different elution
for fully protonated analytes and quaternary ammonium strategies, and how to determine which approach may be
compounds. The addition of water up to 10 % (v/v) has most effective for a particular antibody. Consideration is
little effect on retention if the protonation of the analytes also given to HA’s specific strengths and limitations from
is unaffected. Thus, it is likely that retention is mediated an industrial perspective.
116 U. Kota and M.L. Stolowitz

50. Ghose S, Hubbard B et al (2005) Protein interactions 53. Gilar M, Olivova P et al (2005) Two-dimensional
in hydrophobic charge induction chromatography separation of peptides using RP-RP-HPLC system
(HCIC). Biotechnol Prog 21(2):498–508 with different pH in first and second separation
51. Ghose S, Hubbard B et al (2006) Evaluation and dimensions. J Sep Sci 28(14):1694–170340
comparison of alternatives to Protein A chromatog- 54. Gilar M, Yu YQ et al (2008) Mixed-mode chroma-
raphy: mimetic and hydrophobic charge induction tography for fractionation of peptides,
chromatographic stationary phases. J Chromatogr A phosphopeptides, and sialylated glycopeptides. J
1122(1–2):144–15238 Chromatogr A 1191(1–2):162–17041
52. Gilar M, Olivova P et al (2005) Orthogonality of
separation in two-dimensional liquid chromatogra-
phy. Anal Chem 77(19):6426–643439
2D systems were found to provide suitable orthogonality.
The RP-RP system (employing significantly different pH
in both RP separation dimensions) had the highest practi-
38
In this paper Protein A mimetic and hydrophobic cal peak capacity of 2D-LC systems investigated.
40
charge induction chromatographic (HCIC) stationary Two-dimensional high performance liquid chromatog-
phases are characterized in terms of their protein adsorp- raphy is a useful tool for proteome analysis, providing a
tion characteristics and their selectivity is compared with greater peak capacity than single-dimensional LC. The
Protein A chromatography using a set of Chinese hamster most popular 2D-HPLC approach used today for
ovary-derived monoclonal antibodies and Fc-fusion proteomic research combines strong cation exchange
proteins. Linear retention experiments were employed to and reversed-phase HPLC. We have evaluated an alterna-
compare the selectivities of these resins for both non-IgG tive mode for 2D-HPLC of peptides, employing reversed-
model proteins as well as antibodies and the fusion phase columns in both separation dimensions. The orthog-
proteins. While none of the non-IgG model proteins onality of 2D separation was investigated for selected
were observed to bind to the Protein A resin, most of types of RP stationary phases, ion-pairing agents and
them did in fact bind to the alternative resins. In addition, mobile phase pH. The pH appears to have the most sig-
while the elution pH was similar for the model proteins nificant impact on the RP-LC separation selectivity; the
and antibodies on the HCIC resin, the mimetic resins did greatest orthogonality was achieved for the system with
exhibit higher binding for the antibodies under these C18 columns using pH 10 in the first and pH 2.6 in the
linear pH gradient conditions. A mixed mode preparative second LC dimension. Separation was performed in
isotherm model previously developed for HCIC was off-line mode with partial fraction evaporation. The
shown to accurately describe the adsorption behavior of achievable peak capacity in RP-RP-HPLC and overall
the mimetic materials as well. Host cell protein clearance performance compares favorably to SCX-RP-HPLC and
profiles were also investigated under preparative holds promise for proteomic analysis.
conditions using complex biological feeds and the results 41
A mixed-mode chromatographic (MMC) sorbent was
indicated that while some selectivity was observed for prepared by functionalizing the silica sorbent with a
both the HCIC and the mimetic materials, the purification pentafluorophenyl (PFP) ligand. The resulting stationary
factors were in general significantly less than those phase provided a reversed-phase (RP) retention mode
obtained with Protein A. It is important to note, however, along with a relatively mild strong cation-exchange
that the selectivity of the mimetic and HCIC materials (SCX) retention interaction. While the mechanism of
was also observed to be antibody specific indicating that interaction is not entirely clear, it is believed that the
further optimization may well result in increased silanols in the vicinity of the perfluorinated ligand act as
selectivities for these materials. strongly acidic sites. The 2.1 mm  150 mm column
39
Two-dimensional liquid chromatography is often used packed with such sorbent was applied to the separation
to reduce the proteomic sample complexity prior to tan- of peptides. Linear RP gradients in combination with salt
dem mass spectrometry analysis. The 2D-LC perfor- steps were used for pseudo two-dimensional
mance depends on the peak capacity in both (2D) separation and fractionation of tryptic peptides. An
chromatographic dimensions, and separation orthogonal- alternative approach of using linear cation-exchange
ity. The peak capacity and selectivity of many LC modes gradients combined with RP step gradients was also
for peptides is not well known, and mathematical charac- investigated. Besides the attractive forces, the ionic repul-
terization for orthogonality is underdeveloped. Conse- sion contributed to the retention mechanism. The analytes
quently, it is difficult to estimate the performance of with strong negatively charged sites (phosphorylated
2D-LC for peptide separation. The goal of this paper peptides, sialylated glycopeptides) eluted in significantly
was to investigate a selectivity of common LC modes different patterns than generic tryptic peptides. This reten-
and to identify the 2D-LC systems with a useful orthogo- tion mechanism was employed for the isolation of
nality. A geometric approach for orthogonality descrip- phosphopeptides or sialylated glycopeptides from
tion was developed and applied for estimation of a non-functionalized peptide mixtures. The mixed-mode
practical peak 2D-LC capacity. Selected LC modes column was utilized in conjunction with a phosphopeptide
including various RP, SCX, SEC, and HILIC were com- enrichment solid phase extraction (SPE) device packed
bined in 2D-LC setups. SCX-RP, HILIC-RP, and RP-RP with metal oxide affinity chromatography (MOAC)
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 117

55. Girot P, Averty E et al (2004) 2-Mercapto-5- 57. Guo Y, Wang X (2013) Hilic method development.
benzimidazolesulfonic acid: an effective multimodal Hydrophilic interaction chromatography, Wiley,
ligand for the separation of antibodies. J Chromatogr 87–11044
B 808(1):25–3342 58. Hägglund P, Bunkenborg J et al (2004) A new strat-
56. Golovchenko NP, Kataeva Ia Fau – Akimenko VK egy for identification of N-Glycosylated proteins and
et al (1992) Analysis of pH-dependent protein unambiguous assignment of their glycosylation sites
interactions with gel filtration medium43 using HILIC enrichment and partial deglycosylation.
J Proteome Res 3(3):556–56645

sorbent. The combination of MOAC and mixed-mode


chromatography (MMC) provided for an enhanced behaviour of strong hydrophobic endoglucanase (1) on a
extraction selectivity of phosphopeptides and sialylated Superose column as a function of pH was much more
glycopeptides peptides from complex samples, such as complex because of two interplaying effects, electrostatic
yeast and human serum tryptic digests. and hydrophobic. Ideal size-exclusion chromatography
42
The report describes the use of 2-mercapto-5- could be achieved only in a narrow range of the
benzimidazolesulfonic acid (MBISA) as a ligand for the conditions: first, the mobile phase must contain a weak
separation of antibodies by chromatography. The ligand salting-out electrolyte such as NaCl, and second, the
shows a relatively specific adsorption property for mobile phase pH must be high enough that hydrophobic
antibodies from very crude biologicals at pH 5.0–5.5. At interactions between the solute and support are balanced
this pH range most of other proteins do not interact with by their electrostatic repulsion. At pH greater than pI, the
the resin especially when the ionic strength is similar to retardation of endoglucanase (1) gradually increased with
physiological conditions. Several characterization studies decreasing pH as a result of lowering of repulsive electro-
are described such as antibody adsorption in different static interactions whether or not the buffer ionic strength
conditions of ionic strength, pH and temperature. These was high. At pH less than pI a drastic increase in the
properties are advantageously used to selectively capture capacity factor k’ was observed owing to the additivity
antibodies from very crude feed stocks without dilution or of hydrophobic and ion-exchange effects. (ABSTRACT
addition of lyotropic salts. Demonstration was made that TRUNCATED AT 250 WORDS) FAU – Golovchenko,
the adsorption mechanism is neither based on ion N P.
44
exchange nor on hydrophobic associations, but rather as This chapter focuses on method development
an assembly of a variety of properties of the ligand itself. employing hydrophilic interaction chromatography
Binding capacity in the described conditions ranges (HILIC) as the chromatographic technique. Various
between 25 and 30 mg/mL of resin. The sorbent does aspects of method development are discussed including
not co-adsorb albumin (Alb) and seems compatible with method objectives, sample considerations, systematic
a large variety of feedstocks. Quantitative antibody method development, column and mobile phase selection,
desorption occurs when the pH is raised above 8.5. The and other operating parameters (e.g., column temperature,
final purity of the antibody depends on the nature of the sample solvent, and charged aerosol detector (CAD) or
feedstock, and can reach levels of purity as high as 98 %. mass spectrometric (MS) detectors). The chapter provides
Even with very crude biological liquids such as ascites general guidance on HILIC method development based
fluids, cell culture supernatants and Chon fraction II + III on a solid understanding of HILIC basics and the authors’
from human plasma fractionation where the number of experience with bioanalytical and pharmaceutical
protein impurities is particularly large, immunoglobumins methods.
G (IgG) were separated at high purity level in a 45
Characterization of glycoproteins using mass spectrom-
single step. etry ranges from determination of carbohydrate-protein
43
A prepacked Superose 12 HR 10/30 column was used to linkages to the full characterization of all glycan
study the effects of elution ionic strength and pH on the structures attached to each glycosylation site. In a novel
chromatographic behaviour of a strong hydrophobic Clos- approach to identify N-glycosylation sites in complex
tridium thermocellum endoglucanase (1) and two weak biological samples, we performed an enrichment of
hydrophobic proteins, Clostridium thermocellum glycosylated peptides through hydrophilic interaction liq-
endoglucanase C and egg white lysozyme. Ion-exclusion uid chromatography (HILIC) followed by partial
or ion-exchange interactions between weakly hydropho- deglycosylation using a combination of endo-?-N-
bic proteins and the gel matrix were observed at low ionic acetylglucosaminidases (EC 3.2.1.96). After hydrolysis
strength, depending on whether the pH of the elution with these enzymes, a single N-acetylglucosamine
buffer was higher or lower than the pI values of the (GlcNAc) residue remains linked to the asparagine resi-
proteins. These interactions were due to the presence of due. The removal of the major part of the glycan
negatively charged groups on the surface of Superose and simplifies the MS/MS fragment ion spectra of
could be eliminated at any pH by adding electrolyte at a glycopeptides, while the remaining GlcNAc residue
concentration determined by its chemical identity. The enables unambiguous assignment of the glycosylation
optimum results were observed with sodium sulphate at site together with the amino acid sequence. We first tested
a concentration of 100 mM. The chromatographic our approach on a mixture of known glycoproteins, and
118 U. Kota and M.L. Stolowitz

59. Hamilton GE, Luechau F et al (2000) Development 61. Hartmann E, Chen Y et al (2003) Comparison of
of a mixed mode adsorption process for the direct reversed-phase liquid chromatography and hydro-
product sequestration of an extracellular protease philic interaction/cation-exchange chromatography
from microbial batch cultures. J Biotechnol 79 for the separation of amphipathic alpha-helical
(2):103–11546 peptides with L- and D-amino acid substitutions in
60. Han G, Ye M et al (2008) Large-scale the hydrophilic face. J Chromatogr A 1009
phosphoproteome analysis of human liver tissue by (1–2):61–7148
enrichment and fractionation of phosphopeptides 62. Hemstr€ om P, Irgum K (2006) Hydrophilic interac-
with strong anion exchange chromatography. Prote- tion chromatography. J Sep Sci 29(12):1784–182149
omics 8(7):1346–136147

enrichment and identification of phosphopeptides. It was


subsequently the method was applied to samples of also demonstrated that SAX have the ability to fractionate
human plasma obtained by lectin chromatography phosphopeptides under gradient elution based on their
followed by 1D gel-electrophoresis for determination of different interaction with SAX adsorbent. SAX was fur-
62 glycosylation sites in 37 glycoproteins. Keywords: ther applied to enrich and fractionate phosphopeptides in
proteomics; post-translational modifications ? mass spec- tryptic digest of proteins extracted from human liver
trometry ? HILIC ? endoglycosidase ? lectin affinity tissue adjacent to tumorous region for phosphoproteome
chromatography ? glycosylation ? Plasma proteins. profiling. This resulted in the highly confident identifica-
46
Direct product sequestration of extracellular proteins tion of 274 phosphorylation sites from 305 unique
from microbial batch cultures can be achieved by contin- phosphopeptides corresponding to 168 proteins at false
uous or intermittent broth recycle through an external discovery rate (FDR) of 0.96 %.
48
extractive loop. Here, we describe the development of a Mixed-mode hydrophilic interaction/cation-exchange
fluidisable, mixed mode adsorbent, designed to tolerate chromatography (HILIC/CEX) is a novel high-
increasing ionic strength (synonymous with extended pro- performance technique which has excellent potential for
ductive batch cultures). This facilitated operations for the peptide separations. Separations by HILIX/CEX are car-
integrated recovery of an extracellular acid protease from ried out by subjecting peptides to linear increasing salt
cultures of Yarrowia lipolytica. Mixed mode adsorbents gradients in the presence of high levels of acetonitrile,
were prepared using chemistries containing hydrophobic which promotes hydrophilic interactions overlaid on ionic
and ionic groups. Matrix hydrophobicity and titration interactions with the cation-exchange matrix. In the pres-
ranges were matched to the requirements of integrated ent study, HILIC/CEX has been compared to reversed-
protease adsorption. A single expanded bed was able to phase liquid chromatography (RP-HPLC) for separation
service the productive phase of growth without recourse of mixtures of diastereomeric amphipathic alpha-helical
to the pH adjustment of the broth previously required for peptide analogues, where L- and D-amino acid
ion exchange adsorption. This resulted in increased yields substitutions were made in the centre of the hydrophilic
of product, accompanied by further increases in enzyme face of the amphipathic alpha-helix. Unlike RP-HPLC,
specific activity. A step change from pH 4.5 to 2.6, across temperature had a substantial effect on HILIC/CEX of the
the isoelectric point of the protease, enabled high resolu- peptides, with a rise in temperature from 25 to 65 degrees
tion fixed bed elution induced by electrostatic repulsion. C increasing the retention times of the peptides as well as
The generic application of mixed mode chemistries, improving resolution. Our results again highlight the
which combine the physical robustness of ion-exchange potential of HILIC/CEX as a peptide separation mode in
ligands in sanitisation and sterilisation procedures with a its own right as well as an excellent complement to
selectivity, which approaches that of affinity interactions, RP-HPLC.
is discussed. 49
Separation of polar compounds on polar stationary
47
The mixture of phosphopeptides enriched from prote- phases with partly aqueous eluents is by no means a new
ome samples are very complex. To reduce the complexity separation mode in LC. The first HPLC applications were
it is necessary to fractionate the phosphopeptides. How- published more than 30 years ago, and were for a long
ever, conventional enrichment methods typically only time mostly confined to carbohydrate analysis. In the
enrich phosphopeptides but not fractionate early 1990s new phases started to emerge, and the practice
phosphopeptides. In this study, the application of strong was given a name, hydrophilic interaction chromatogra-
anion exchange (SAX) chromatography for enrichment phy (HILIC). Although the use of this separation mode
and fractionation of phosphopeptides was presented. It has been relatively limited, we have seen a sudden
was found that phosphopeptides were highly enriched by increase in popularity over the last few years, promoted
SAX and majority of unmodified peptides did not bind by the need to analyze polar compounds in increasingly
onto SAX. Compared with Fe3+ immobilized metal affin- complex mixtures. Another reason for the increase in
ity chromatography (Fe3 + IMAC), almost double popularity is the widespread use of MS coupled to
phosphopeptides were identified from the same sample LC. The partly aqueous eluents high in ACN with a
when only one fraction was generated by SAX. SAX and limited need of adding salt is almost ideal for ESI. The
Fe3 + IMAC showed the complementarity in applications now encompass most categories of polar
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 119

63. Hirs CHW, Stein WH et al (1951) Chromatography of 67. Hong P, Koza S et al (2012) Size-exclusion chroma-
proteins. Ribonuclease. J Am Chem Soc 73(4):1893 tography for the analysis of protein biotherapeutics
64. Hjertén S (1964) The preparation of agarose spheres and their aggregates. J Liq Chromatogr Relat
for chromatography of molecules and particles. Technol 35(20):2923–295053
Biochim Biophys Acta (BBA) – Special Sect 68. Huang P, Jin X et al (1999) Use of a mixed-mode
Biophys Subjects 79(2):393–39850 packing and voltage tuning for peptide mixture sepa-
65. Hjertén S, Mosbach R (1962) “Molecular-sieve” ration in pressurized capillary electrochromatography
chromatography of proteins on columns of cross- with an ion trap storage/reflectron time-of-flight mass
linked polyacrylamide. Anal Biochem 3 spectrometer detector. Anal Chem 71(9):1786–179154
(2):109–11851 69. Ibrahim MEA, Lucy CA (2013) Stationary phases
66. Hjertén S, Rosengren J et al (1974) Hydrophobic for Hilic. Hydrophilic interaction chromatography,
interaction chromatography: the synthesis and the Wiley, 43–8555
use of some alkyl and aryl derivatives of agarose. J
Chromatogr A 101(2):281–28852
chromatography decreases with a decrease in ionic
strength and temperature.
compounds, charged as well as uncharged, although 53
In recent years, the use and number of biotherapeutics
HILIC is particularly well suited for solutes lacking has increased significantly. For these largely protein-
charge where coulombic interactions cannot be used to based therapies, the quantitation of aggregates is of par-
mediate retention. The review attempts to summarize the ticular concern given their potential effect on efficacy and
ongoing discussion on the separation mechanism and immunogenicity. This need has renewed interest in size-
gives an overview of the stationary phases used and the exclusion chromatography (SEC). In the following review
applications addressed with this separation mode in LC. we will outline the history and background of SEC for the
50
A method is described for preparation of spherical analysis of proteins. We will also discuss the instrumen-
agarose or agar grains, to be used as bed material for tation for these analyses, including the use of different
chromatographic “sieving” of molecules and particles. types of detectors. Method development for protein anal-
Due to a comparatively great hardness of these grains, ysis by SEC will also be outlined, including the effect of
they give high flow rates even if they are made small in mobile phase and column parameters (column length,
order to increase the resolving power of the column. pore size). We will also review some of the applications
51
Columns packed with cross-linked polyacrylamide of this mode of separation that are of particular impor-
have been used for chromatographic separation of high tance to protein biopharmaceutical development and
molecular weight substances, especially proteins. These highlight some considerations in their implementation.
54
columns also allow separation of large molecules from A mixed-mode (reversed-phase/anion-exchange) sta-
small ones, for instance proteins from amino acids, tionary phase has been used as the capillary column
peptides, salts. There is a positive correlation between packing for investigation of the separation of peptide
the molecular size of a protein and its Rf value. mixtures in pressurized capillary electrochromatography
52
Aliphatic and aromatic alcohols in the form of glycidyl (pCEC). This stationary phase contains both
ethers have been coupled to agarose gels. These neutral octadecylsilanes and dialkylamines. The amine groups
agarose derivatives, which thus contain hydrophobic of the stationary phase determine the charge density on
substituents, have been used as adsorbents in hydrophobic the surface of the packing and can produce a strong and
interaction chromatography. The coupling yield and the constant electroosmotic flow (EOF) at low pH. A compar-
degree of substitution have been determined for one ali- ison was made in terms of the capability of separating
phatic and one aromatic model substance. Different frac- tryptic digests between the mixed-mode phase and C18
tionation problems require different degrees of reversed phase. In addition, the constant EOF enabled the
hydrophobicity of the substituents. To “tailor make” tuning of the retention and the selectivity of the separation
gels, the hydrophobicity can be varied in small steps by by adjusting the mobile phase pH from 2 to 5. Further-
the use of aliphatic alcohols of different chain length. The more, the magnitude and the polarity of the electric volt-
agarose derivatives described have been used for the age were demonstrated to greatly influence the elution
purification of proteins, demonstrated with a plasma frac- profiles of the peptides in pCEC. An ion trap storage/
tionation, viruses (STNV) and even whole cells (baker’s reflectron time-of-flight mass spectrometer was used as
yeast). Under suitable experimental conditions, the an on-line detector in these experiments due to its ability
interactions can be very mild (enzyme activities have to provide rapid and accurate mass detection of the sam-
been recovered in a 50–100 % yield). Enzyme reactors ple components eluting from the separation column.
55
with a high capacity can be prepared in a simple manner Literature and research on hydrophilic interaction liq-
by applying the enzyme solution at any pH on to a suitable uid chromatography (HILIC) has increased dramatically
hydrophobically interacting bed. As the enzymes are not in recent years. This has been accompanied by a corre-
covalently linked to the bed, they can easily be recovered spondingly rapid increase in stationary phases developed
in the free form. Contrary to ion-exchange chromatogra- for HILIC. This chapter first discusses all classes of sta-
phy, the adsorption in hydrophobic interaction tionary phases used in HILIC mode in terms of chemistry,
120 U. Kota and M.L. Stolowitz

70. Ikegami T, Tomomatsu K et al (2008) Separation 72. Irvine GB (2001) Determination of molecular size by
efficiencies in hydrophilic interaction chromatogra- size-exclusion chromatography (gel filtration). Cur-
phy. J Chromatogr A 1184(1–2):474–50356 rent protocols in cell biology, Wiley58
71. Intoh A, Kurisaki A et al (2009) Separation with 73. Irvine GB (2003) High-performance size-exclusion
zwitterionic hydrophilic interaction liquid chroma- chromatography of peptides. J Biochem Biophys
tography improves protein identification by matrix- Methods 56(1–3):233–24259
assisted laser desorption/ionization-based proteomic 74. Jacobs JM, Mottaz HM et al (2003) Multidimen-
analysis. Biomed Chromatogr 23(6):607–61457 sional proteome analysis of human mammary epi-
thelial cells. J Proteome Res 3(1):68–7560

58
available trade names, and representative applications. Size-exclusion or gel filtration chromatography is one
The classes of stationary phases include underivatized of the most popular methods for determining the sizes of
silica phase, derivatized silica phase, and nonsilica proteins. Proteins in solution, or other macromolecules,
phases. Important characteristics of some selected com- are applied to a column with a defined support medium.
mercial HILIC phases are summarized in a table. The The behavior of the protein depends on its size and that of
table classifies HILIC phases according to their chemical the pores in the medium. If the protein is small relative to
nature. Then, the chapter compares these HILIC phases in the pore size, it will partition into the medium and emerge
terms of efficiency, retention, and selectivity. from the column after larger proteins. Besides a protein’s
56
Hydrophilic interaction chromatography (HILIC) is size, this technique can also be used for protein purifica-
important for the separation of highly polar substances tion, analysis of purity, and study of interactions between
including biologically active compounds, such as phar- proteins. In this unit protocols are provided for size-
maceutical drugs, neurotransmitters, nucleosides, exclusion high-performance liquid chromatography
nucleotides, amino acids, peptides, proteins, oligosac- (SE-HPLC) and for conventional gel filtration, including
charides, carbohydrates, etc. In the HILIC mode separa- calibration of columns (in terms of the Stokes radius)
tion, aqueous organic solvents are used as mobile phases using protein standards.
59
on more polar stationary phases that consist of bare silica, Gel filtration on soft gels has been employed for over
and silica phases modified with amino, amide, zwitter- 40 years for the separation, desalting and molecular
ionic functional group, polyols including saccharides and weight estimation of peptides and proteins. Technical
other polar groups. This review discusses the column improvements have given rise to high-performance size-
efficiency of HILIC materials in relation to solute and exclusion chromatography (HPSEC) on rigid supports,
stationary phase structures, as well as comparisons giving more rapid run times and increased resolution.
between particle-packed and monolithic columns. In addi- Initially, these packings were more suitable for the sepa-
tion, a literature review consisting of 2006–2007 data is ration of proteins than of peptides, but supports that oper-
included, as a follow up to the excellent review by ate in the fractionation range <10,000 Daltons (Da) are
Hemstr€ om and Irgum. now available. In this report, HPSEC is described in
57
Comprehensive proteomic analyses necessitate effi- relation to its application to peptides, especially regarding
cient separation of peptide mixtures for the subsequent purification, estimation of molecular weight and study of
identification of proteins by mass spectrometry (MS). molecular associations.
60
However, digestion of proteins extracted from cells and Recent multidimensional liquid chromatography
tissues often yields complex peptide mixtures that con- MS/MS studies have contributed to the identification of
found direct comprehensive MS analysis. This study large numbers of expressed proteins for numerous spe-
investigated a zwitterionic hydrophilic interaction liquid cies. The present study couples size exclusion chromatog-
chromatography (ZIC-HILIC) technique for the peptide raphy of intact proteins with the separation of tryptically
separation step, which was verified by subsequent MS digested peptides using a combination of strong cation
analysis. Human serum albumin (HSA) was the model exchange and high resolution, reversed phase capillary
protein used for this analysis. HSA was digested with chromatography to identify proteins extracted from
trypsin and resolved by ZIC-HILIC or conventional human mammary epithelial cells (HMECs). In addition
strong cation exchange (SCX) prior to MS analysis for to conventional conservative criteria for protein
peptide identification. Separation with ZIC-HILIC signif- identifications, the confidence levels were additionally
icantly improved the identification of HSA peptides over increased through the use of peptide normalized elution
SCX chromatography. Detailed analyses of the identified times (NET) for the liquid chromatographic separation
peptides revealed that the ZIC-HILIC has better peptide step. The combined approach resulted in a total of 5838
fractionation ability. We further demonstrated that unique peptides identified covering 1574 different
ZIC-HILIC is useful for quantitatively surveying cell proteins with an estimated 4 % gene coverage of the
surface markers specifically expressed in undifferentiated human genome, as annotated by the National Center for
embryonic stem cells. These results suggested the value of Biotechnology Information (NCBI). This database
ZIC-HILIC as a novel and efficient separation method for provides a baseline for comparison against variations in
comprehensive and quantitative proteomic analyses. other genetically and environmentally perturbed systems.
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 121

75. Jane I (1975) The separation of a wide range of drugs 78. Jiang W, Irgum K (2002) Tentacle-type zwitterionic
of abuse by high-pressure liquid chromatography. J stationary phase prepared by surface-initiated graft
Chromatogr A 111(1):227–233 polymerization of 3-[N,N-Dimethyl-N-(Methacry-
76. Jiang W, Fischer G et al (2006) Zwitterionic station- loyloxyethyl)- ammonium] propanesulfonate
ary phase with covalently bonded phosphorylcholine through peroxide groups yethered on porous silica.
type polymer grafts and its applicability to separa- Anal Chem 74(18):4682–468763
tion of peptides in the hydrophilic interaction liquid 79. Johansson BL, Belew M et al (2003) Preparation and
chromatography mode. J Chromatogr A 1127 characterization of prototypes for multi-modal
(1–2):82–9161
77. Jiang W, Irgum K (2001). Synthesis and evaluation
of polymer-based zwitterionic stationary phases for
separation of ionic species. Anal Chem 73 N-dimethyl-N-(methacryloyloxyethyl)ammonium]
(9):1993–200362 propanesulfonate, initiated by benzoin methyl ether under
365-nm light. According to elemental analyses, both the
S300-ECH-DMA-PS and S300-TC-DMA-PS materials
appeared to have overall charge stoichometries close to
Proteins identified were categorized based upon intracel- unity, whereas the grafted material, S300-MAA-SPE,
lular location and biological process with the identifica- seemed to carry an excess of anion exchange sites in
tion of numerous receptors, regulatory proteins, and addition to the zwitterionic groups. Yet all three zwitter-
extracellular proteins, demonstrating the usefulness of ionic stationary phases were capable of separating inor-
this application in the global analysis of human cells for ganic anions and cations simultaneously and
future comparative studies. Keywords: human ? HMEC ? independently using aqueous solutions of perchloric acid
multidimensional ? liquid chromatography ? proteome ? or perchlorate salts as eluent, albeit with markedly differ-
global ? Size exclusion. ent selectivities. On the S300-TC-DMA-PS and S300-
61
A novel phosphorylcholine type zwitterionic stationary MAA-SPE materials, the retention times increased for
phase was synthesized by graft polymerization of cations and decreased for anions with increasing eluent
2-methacryloyloxyethyl phosphorylcholine onto the sur- concentration, whereas with the S300-ECH-DMA-PS
face of porous silica particles. The resulting material material, the retention times of both anions and cations
possesses both negatively charged phosphoric acid and decreased with increasing eluent concentration. These
positively charged quaternary ammonium groups, which results demonstrate the importance of choosing appropri-
renders it a low net charge over a wide pH range. The ate synthesis conditions in order to prepare covalently
composition of the surface grafts were determined by bonded zwitterionic separation materials with an accept-
elemental analysis and solid state NMR, and the surface able charge balance.
charge (zeta-potential) in different buffer solutions were 63
A novel stationary phase with tentacle-type zwitterionic
measured using photon correlation spectroscopy. Separa- interaction layer was synthesized by free radical graft
tion of several peptides was investigated on packed polymerization of 3-[N,N-dimethyl-N-(methacryloy-
columns in the hydrophilic interaction liquid chromatog- loxyethyl)ammonium]propanesulfonate (SPE) from the
raphy (HILIC) separation mode. It was shown that small surface of Kromasil porous silica particles. The polymer-
peptides can be separated based on hydrophilic interaction ization was initiated by thermal cleavage of tert-
and ionic interaction between the stationary phase and butylperoxy groups covalently attached to the particle
analyte. The organic solvent composition, the pH and surface, and the material therefore carries a tentacle-type
the salt concentration of the eluent have strong effects polymeric interaction layer with 3-sulfopropylbetaine
on the retention time. Compared to native silica before functional moieties. The composition of the surface graft
grafting, the newly synthesized zwitterionic material gave was determined by elemental analysis, and the surface
more stable retention times for basic peptides over pH charge was measured using photon correlation spectros-
range 3–7 due to elimination of the dissociation of silanol copy. The measured zeta-potentials were close to 0 and
groups. nearly independent of pH, and the tentacle character of the
62
Three different zwitterionic functional stationary interactive layers were evident from the lack of colloidal
phases for chromatography were synthesized on the stability in the absence of salt (antipolyelectrolytic behav-
basis of 2-hydroxyethyl methacrylate (HEMA) polymeric ior) and a marked increase in column back-pressure when
particles. Two synthesis routes, producing materials the concentration of perchloric acid or perchlorate salt
designated S300-ECH-DMA-PS or S300-TC-DMA-PS, was increased. The chromatographic properties were
involved activation of the hydroxyl groups of the evaluated on columns packed with the functionalized
HEMA material with epichlorohydrin or thionyl chloride, material, and it was shown that this zwitterionic stationary
respectively, followed by dimethylamination and phase could simultaneously and independently separate
quaternizing 3-sulfopropylation with 1,3-propane sultone. inorganic anions and cations using aqueous solutions of
The third route was accomplished by attaching methacry- perchloric acid or perchlorate salts as eluents. The mate-
late moieties to the HEMA through a reaction with rial was also capable of separating two acidic and three
methacrylic anhydride, followed by graft photopoly- basic proteins in a single run, using gradient salt elution at
merization of the zwitterionic monomer 3-[N, constant pH.
122 U. Kota and M.L. Stolowitz

separation aimed for capture of positively charged 81. Jungbauer A. (2005) Chromatographic media for
biomolecules at high-salt conditions. J Chromatogr bioseparation. J Chromatogr A 1065(1):3–1266
A 1016(1):35–4964
80. Johansson BL, Belew M et al (2003) Preparation and
characterization of prototypes for multi-modal sepa-
ration media aimed for capture of negatively charged
biomolecules at high salt conditions. J Chromatogr A
1016(1):21–3365 anion-exchangers resulting in elution of test proteins at
high ionic strength. Candidates were then tested with
respect to breakthrough capacity of BSA in a buffer
adjusted to a high conductivity (20 mM Piperazine and
64
Several prototypes of aromatic (Ar) and non-aromatic 0.25 M NaCl, pH 6.0). The recovery of BSA was also
(NoAr) cation-exchange ligands suitable for capture of tested with a salt step (from 0.25 to 2.0 M NaCl using
proteins from high conductivity (ca. 30 mS/cm) mobile 20 mM Piperazine as buffer, pH 6.0) or with a pH-step to
phases were coupled to Sepharoseâ„¢ 6 Fast Flow. These pH 4.0. We have found that non-aromatic multi-modal
new prototypes of multi-modal cation-exchangers were anion-exchange ligands based on primary or secondary
found by screening a diverse library of multi-modal amines (or both) are optimal for the capture of proteins at
ligands and selecting cation-exchangers resulting in elu- high salt conditions. Furthermore, these new multi-modal
tion of test proteins at high ionic-strength. Candidates anion-exchange ligands have been designed to take
were then tested with respect to breakthrough capacity advantage not only of electrostatic but also hydrogen
of bovine serum albumin (BSA), human IgG and lyso- bond interactions. This has been accomplished through
zyme in buffers adjusted to a high conductivity. By apply- modification of the ligands by the introduction of
ing a salt-step or a pH-step the recoveries were also tested. hydroxyl groups in the proximity of the ionic group.
We have found that aromatic multi-modal cation- Experimental evidence on the importance of the relative
exchanger ligands based on carboxylic acids seem to be position of the hydroxyl groups on the ligand in order to
optimal for the capture of proteins at high-salt conditions. improve the breakthrough capacity of BSA has been
Experimental evidence on the importance of the relative found. Compared to strong anion-exchangers such as Q
position of the aromatic group in order to improve the Sepharoseâ„¢ Fast Flow the new multi-modal weak
breakthrough capacity at high-salt conditions has been anion-exchangers have breakthrough capacities of BSA
found. It was also found that an amide group on the Î  at mobile phases of 28 mS/cm and pH 6.0 that are 20â
carbon was essential for capture of proteins at high-salt €“30 times higher. The new multi-modal anion-
conditions. Compared to a strong cation-exchanger such exchangers can also be used at normal anion-exchange
as SP SepharoseTM Fast Flow the best new multi-modal conditions and with either a salt step or a pH-step to acidic
weak cation-exchangers have breakthrough capacities of pH can accomplish the elution of proteins. In addition, the
BSA, human IgG and lysozyme that are 10–30 times functional performance of the new anion-exchangers was
higher at high-salt conditions. The new multi-modal cat- found to be intact after treatment in 1.0 M sodium hydrox-
ion-exchangers can also be used at normal cation- ide solution for 1 week. A number of multi-modal anion-
exchange conditions and with either a salt-step or a exchange ligands based on aromatic amines exhibiting
pH-step (to pH-values where the proteins are negatively high breakthrough capacity of BSA have been found.
charged) to accomplish elution of proteins. In addition, With these ligands recovery was often found to be low
the functional performance of the new cation-exchangers due to strong non-electrostatic interactions. However, for
was found to be intact after treatment in 1.0 M sodium phenol derived anion-exchange media the recovery can be
hydroxide solution for 10 days. For BSA it was also improved by desorption at high pH.
possible to design cation-exchangers based on 66
Bioseparation processes are dominated by chro-
non-aromatic carboxyl acid ligands with high capacities matographic steps. Even primary recovery is sometimes
at high-salt conditions. A common feature of these ligands accomplished by chromatographic separation, using a
is that they contain hydrogen acceptor groups close to the fluidized bed instead of a fixed bed. In this review, the
carboxylic group. Furthermore, it was also possible to action principles, features of chromatography media
obtain high breakthrough capacities for lysozyme and regarding physical and chemical properties will be
BSA of a strong cation-exchanger (SP Sepharoseâ„¢ described. An attempt will be made to establish categories
Fast Flow) if phenyl groups were attached to the beads. of different media. Characteristics for bioseparation are
Varying the ligand ratio (SP/Phenyl) could be used for the large pores and particle sizes. To achieve sufficient
optimizing the function of mixed-ligand ion-exchange capacity for ultralarge molecules, such as plasmids or
media. nanoparticles, such as viruses monoliths are the media
65
Several prototypes of multi-modal ligands suitable for of choice. In these media, the mass transport is accom-
the capture of negatively charged proteins from high plished by convection, and thus, the low diffusivity can be
conductivity (28 mS/cm) mobile phases were coupled to overcome. Common to all modern chromatography media
Sepharose 6 Fast Flow. These new prototypes of multi- is the fast operation. There are examples where a resi-
modal anion-exchangers were found by screening a dence time of less then 3 min, is sufficient to reach the full
diverse library of multi-modal ligands and selecting potential of the adsorbent.
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 123

82. Kakhniashvili DG, Bulla LA et al (2004) The human 86. Kirkland JJ (1973) Porous silica microsphere column
erythrocyte proteome: analysis by ion trap mass packings for high-speed liquid—liquid chromatogra-
spectrometry. Mol Cell Proteomics 3(5):501–509 phy. J Chromatogr A 83(0):149–16770
83. Karlsson E, Hirsh I (2011) Ion exchange chromatog- 87. Kirkland JJ, Truszkowski FA et al (2000) Superfi-
raphy. Protein purification, Wiley, 93–13367 cially porous silica microspheres for fast high-
84. Kawachi Y, Ikegami T et al (2011) Chromatographic performance liquid chromatography of
characterization of hydrophilic interaction liquid macromolecules. J Chromatogr A 890(1):3–1371
chromatography stationary phases: hydrophilicity,
charge effects, structural selectivity, and separation
efficiency. J Chromatogr A 1218(35):5903–591968
competition mechanism for several proteins and nucleo-
85. Kawasaki T, Niikura M et al (1990) Fundamental
side phosphates was analysed on the basis of the general
study of hydroxyapatite high-performance liquid
theory of gradient chromatography that has been
chromatography: II. Experimental analysis on the
established recently. It was concluded that the number,
basis of the general theory of gradient chromatogra-
x0 of adsorbing sites of HA that are covered by an
phy. J Chromatogr A 515(0):91–12369
adsorbed molecule, in general, tends to increase slowly
with increase in molecular mass, but that the correlation
between molecular mass and x0 is weak. The conclusion is
67
This chapter contains sections titled: * Introduction * consistent with the deduction made earlier that the stereo-
The Ion Exchange Process * Charge Properties of Proteins chemical structure of the local molecular surface (which
* The Stationary Phase—The Ion Exchangers * Nonionic is highly characteristic of a molecule, and is intimately
Interactions * The Mobile Phase: Buffers and Salts * related to the x0 value) is discerned by the regular crystal
Experimental Planning and Preparation * Chro- surface structure of HA. The capacity factor, k0
matographic Techniques * Handling of Isolated Proteins , is argued on the basis of the competition model.
* Hydroxyapatite Chromatography * Applications * 70
A new column packing for high-performance liquid
Acknowledgments * References. chromatography, porous microspheres of silica produced
68
Fourteen commercially available particle-packed by the agglutination of colloidal silica particles, has
columns and a monolithic column for hydrophilic interac- recently been introduced for use in adsorption chromatog-
tion liquid chromatography (HILIC) were characterized raphy. The narrow-size range, relatively homogeneous
in terms of the degree of hydrophilicity, the selectivity for pore structure and short diffusion path lengths of these
hydrophilic-hydrophobic substituents, the selectivity for &lt;10-μ particles result in very high column efficiencies,
the regio and configurational differences in hydrophilic and the relatively large, highly available surface area
substituents, the selectivity for molecular shapes, the eval- provides for high sample capacity. The microsphere pack-
uation of electrostatic interactions, and the evaluation of ing displays retention and efficiency characteristics which
the acidic-basic nature of the stationary phases using are less dependent on water content than wide-pore silica
nucleoside derivatives, phenyl glucoside derivatives, xan- gel. Columns of the microspheres may be prepared which
thine derivatives, sodium p-toluenesulfonate, and are reproducible in chromatographic performance, using a
trimethylphenylammonium chloride as a set of samples. simple high-pressure slurry-packing procedure. More
Principal component analysis based on the data of reten- than 10,000 theoretical plates have been obtained on a
tion factors could separate three clusters of the HILIC single 25-cm-long column of 5-μ microspheres at carrier
phases. The column efficiency and the peak asymmetry velocities of about 0.7 cm/sec. Plate heights of about five
factors were also discussed. These data on the selectivity particle diameters and more than thirty-six effective
for partial structural differences were summarized as plates/sec have been demonstrated for solutes with capac-
radar-shaped diagrams. This method of column character- ity factors (k0 ) in the 2–5 range. These columns may be
ization is helpful to classify HILIC stationary phases on connected in series using low-volume fittings with little
the basis of their chromatographic properties, and to loss in efficiency. Columns of the 5-μ particles appear to
choose better columns for targets to be separated. Judging be limited by mobile phase mass transfer effects,
from the retention factor for uridine, these HILIC columns contrasted to the stagnant mobile phase mass transfer
could be separated into two groups: strongly retentive and limitations exhibited by similar 8- to 9-μ particles.
weakly retentive stationary phases. Among the strongly 71
Very fast reversed-phase separations of biomacro-
retentive stationary phases, zwitterionic and amide molecules are performed using columns made with super-
functionalities were found to be the most selective on ficially porous silica microsphere column packings
the basis of partial structural differences. The (“Poroshell”). These column packings consist of ultra-
hydroxyethyl-type stationary phase showed the highest pure “biofriendly” silica microspheres composed of
retention factor, but with low separation efficiency. solid cores and thin outer shells with uniform pores. The
Weakly retentive stationary phases generally showed excellent kinetic properties of these new column packings
lower selectivity for partial structural differences. allow stable, high-resolution gradient chromatography of
69
In hydroxyapatite (HA) chromatography, competition polypeptides, proteins, nucleic acids, DNA fragments,
occurs between the sample molecule and ions from the etc. in a fraction of the time required for conventional
buffer for adsorption onto the crystal surface of HA. The separations. Contrasted with &lt;2-μm non-porous
124 U. Kota and M.L. Stolowitz

88. Kirsch S, Muthing J et al (2009) On-line nano- 92. Leitner A, Reischl R et al (2012) Expanding the
HPLC/ESI QTOF MS monitoring of alpha2-3 and chemical cross-linking toolbox by the use of multi-
alpha2-6 sialylation in granulocyte glycosphingo- ple proteases and enrichment by size exclusion chro-
lipidome. Biol Chem 390(7):657–67272 matography. Mol Cell Proteomics 11(3)75
89. Layne J (2002) Characterization and comparison of 93. Li J, Shao S et al (2008) Simultaneous determination
the chromatographic performance of conventional, of cations, zwitterions and neutral compounds using
polar-embedded, and polar-endcapped reversed- mixed-mode reversed-phase and cation-exchange
phase liquid chromatography stationary phases. J high-performance liquid chromatography. J
Chromatogr A 957(2):149–16473 Chromatogr A 1185(2):185–19376
90. Lea DJ, Sehon AH (1962) Preparation of synthetic
gels for chromatography of macromolecules. Can J
Chem 40(1):159–160
this report, we introduce innovative multidimensional
91. Lecchi P, Gupte AR et al (2003) Size-exclusion
schemes for proteomics analysis, in which SEC plays a
chromatography in multidimensional separation
practical role. Liquid isoelectric focusing (IEF) was com-
schemes for proteome analysis. J Biochem Biophys
bined with SEC, and experimental results were compared
Methods 56(1–3):141–15274
to those obtained by two-dimensional polyacrylamide gel
electrophoresis (2D-PAGE), well-established techniques
relying upon similar criteria for separation. Additional
particles, Poroshell packings can be used optimally with experiments were performed to evaluate the practical
existing equipments and greater sample loading contribution of SEC in multidimensional chro-
capacities, while retaining kinetic (and separation speed) matographic separations. Specifically, we evaluated the
advantages over conventional totally porous particles. combination of SEC and ion exchange chromatography in
72
A novel glycosphingolipidomic protocol using nano- an analytical scheme for the mass spectrometric analysis
high performance liquid chromatography coupled of protein-extracts obtained from bacterial cultures grown
on-line to electrospray ionization quadrupole time-of- in stable isotope enriched media. Experimental conditions
flight mass spectrometry (ESI-QTOF-MS) focusing on and practical considerations are discussed.
the separation of isomeric ganglioside structures is 75
Chemical cross-linking in combination with mass spec-
described here. A highly efficient separation of alpha2- trometric analysis offers the potential to obtain
3- and alpha2-6-sialylated ganglioside species of different low-resolution structural information from proteins and
carbohydrate chain length was achieved on an HILIC- protein complexes. Identification of peptides connected
amido column, followed by sensitive flow-through ESI- by a cross-link provides direct evidence for the physical
QTOF-MS detection and unambiguous structural identifi- interaction of amino acid side chains, information that can
cation by tandem MS experiments. The protocol was be used for computational modeling purposes. Despite
applied to encompass the glycosphingolipidome of impressive advances that were made in recent years, the
human granulocytes, where 182 distinct components number of experimentally observed cross-links still falls
could be clearly identified and assigned regarding the below the number of possible contacts of cross-linkable
ganglioside type and the isomer distribution. side chains within the span of the cross-linker. Here, we
73
We have evaluated and compared the performance of propose two complementary experimental strategies to
several conventional C18 phases with those possessing expand cross-linking data sets. First, enrichment of
either a polar-endcapping group or a polar-embedded cross-linked peptides by size exclusion chromatography
group within the primary alkyl ligand and found distinct selects cross-linked peptides based on their higher molec-
differences in the chromatographic behavior among the ular mass, thereby depleting the majority of unmodified
three groups, as well as a high degree of variability within peptides present in proteolytic digests of cross-linked
each group. The trend is for the polar-endcapped phases to samples. Second, we demonstrate that the use of proteases
display similar hydrophobic retention characteristics as in addition to trypsin, such as Asp-N, can additionally
the conventional C18 columns, but to express higher boost the number of observable cross-linking sites. The
hydrogen bonding capacities and silanol activity. The benefits of both SEC enrichment and multiprotease
polar-embedded phases displayed the opposite behavior, digests are demonstrated on a set of model proteins and
with a greatly reduced hydrophobic nature compared to the improved workflow is applied to the characterization
the conventional and polar-endcapped C18 phases, and of the 20S proteasome from rabbit and Schizosac-
also a very much reduced silanol activity. Most interest- charomyces pombe.
ingly, it appears that ionic or dipole interactions play a 76
A novel mixed-mode reversed-phase and cation-
significant role in the overall retention behavior of the exchange high-performance liquid chromatography
polar-embedded phases towards basic and acidic analytes. (HPLC) method is described to simultaneously determine
74
Size-exclusion chromatography (SEC) is a separation four related impurities of cations, zwitterions and neutral
technique with a relatively low resolving power, com- compounds in developmental Drug A. The commercial
pared to those usually utilized in proteomics. Therefore, column is Primesep 200 containing hydrophobic alkyl
it is often overlooked in experimental protocols, when the chains with embedded acidic groups in H+ form on a
main goal is resolving complex biological mixtures. In silica support. The mobile phase variables of acid
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 125

94. Linden JC, Lawhead CL (1975) Liquid chromatog- 98. Link AJ, Eng J et al (1999) Direct analysis of protein
raphy of saccharides. J Chromatogr A 105 complexes using mass spectrometry. Nat Biotech 17
(1):125–13377 (7):676–682
95. Lindner H, Helliger W (2004) Hydrophilic interac- 99. Lork KD, Unger KK Solute retention in reversed-
tion chromatography. HPLC of peptides and phase chromatography as a function of stationary
proteins. MI Aguilar, Springer, New York, phase properties: effect of n-alkyl chain length and
251, 75–88 ligand density.
96. Lindqvist B, Storgards T (1955) Molecular-sieving 100. Luo J, Zhou W et al (2013). Comparison of fully-
properties of starch. Nature 175(4455):511–512 porous beads and cored beads in size exclusion chro-
97. Link AJ, Eng J et al (1999) Direct analysis of protein matography for protein purification. Chem Eng Sci
complexes using mass spectrometry. Nat Biotechnol 102:99–10579
17(7):676–68278 101. Mant CT, Hodges RS (2008) Mixed-mode hydro-
philic interaction/cation-exchange chromatography:
separation of complex mixtures of peptides of vary-
ing charge and hydrophobicity. J Sep Sci 31
additives, contents of acetonitrile and concentrations of
(9):1573–158480
potassium chloride have been thoroughly investigated to
optimize the separation. The retention factors as a func-
tion of the concentrations of potassium chloride and the
percentages of acetonitrile in the mobile phases are 79
Size-exclusion chromatography (SEC) relies exclu-
investigated to get an insight into the retention and sepa- sively on intraparticle diffusion to separate solutes of
ration mechanisms of each related impurity and Drug different molecular sizes and shapes. Thus, its feed vol-
A. Furthermore, the elution orders of the related ume can only be a small fraction of the column volume.
impurities and Drug A in an ion-pair chromatography Much larger columns are required for SEC than other
(IPC) are compared to those in the mixed-mode HPLC forms of liquid chromatography. Becasue of this, SEC
to further understand the chromatographic retention often employs less expensive soft gels in large-scale
behaviors of each related impurity and Drug A. The applications to reduce costs. Excessive bed compression
study found that the positively charged Degradant forces engineers to use pancake-shaped columns instead
1, Degradant 2 and Drug A were retained by both of more desirable slim columns during scale-up. Cored
ion-exchange and reversed-phase partitioning beads have impenetrable rigid cores that result in lower
mechanisms. RI2, a small ionic compound, was primarily pressure drops and better pressure resistance. They also
retained by ion-exchange. RI4, a neutral compound, was provide sharper peaks due to shortened radial distance for
retained through reversed-phase partitioning without diffusion. Using a new general rate model for SEC with
ion-exchange. Moreover, the method performance cored beads, this work demonstrated that cored beads
characteristics of selectivity, sensitivity and accuracy performed better than fully-porous beads for myoglobin
have been demonstrated to be suitable to determine the and ovalbumin separation through computer simulation.
related impurities in the capsules of Drug A. This theoretical work could encourge the research and
77
The analysis of saccharides by liquid chromatography product development of cored beads for large-scale SEC
on an automated instrument is described. Conditions for that has not been reported. # 2013 Elsevier Ltd.
the resolution and quantitation of fructose, glucose, 80
Mixed-mode hydrophilic interaction/cation-exchange
sucrose, melibiose, raffinose, betaine and three kestose chromatography (HILIC/CEX) was applied to the separa-
isomers as well as starch hydrolysates are given. Liquid tion of two mixtures of synthetic peptide standards: (i) a
chromatographic analysis equals the precision and accu- 27-peptide mixture containing three groups of peptides
racy of gas–liquid chromatographic analysis. Greater (each group containing nine peptides of the same net
analysis flexibility and reduced sample preparation are charge of +1, +2 or +3), where the hydrophilicity/
important advantages over gas–liquid chromatographic hydrophobicity of adjacent peptides within the groups
analysis. varied only subtly (generally by only a single carbon
78
We describe a rapid, sensitive process for comprehen- atom); and (ii) peptide pairs with the same composition
sively identifying proteins in macromolecular complexes but different sequences, where the sole difference
that uses multidimensional liquid chromatography between the peptides was the position of a single amino
(LC) and tandem mass spectrometry (MS/MS) to separate acid substitution. HILIC/CEX is essentially CEX chroma-
and fragment peptides. The SEQUEST algorithm, relying tography in the presence of high levels of organic modifier
upon translated genomic sequences, infers amino acid (generally ACN). The present study demonstrated the
sequences from the fragment ions. The method was dramatic effect of increasing ACN concentration (opti-
applied to the Saccharomyces cerevisiae ribosome lead- mum levels of 60–80 %, depending on the application) on
ing to the identification of a novel protein component of the separation of both mixtures of peptides. The greater
the yeast and human 40S subunit. By offering the ability the charge on the peptides, the better the separation
to identify >100 proteins in a single run, this process achievable by HILIC/CEX. In addition, HILIC/CEX sep-
enables components in even the largest macromolecular aration of both the peptide mixtures used in the present
complexes to be analyzed comprehensively. study was shown to be superior to that of the more
126 U. Kota and M.L. Stolowitz

102. Mant CT, Parker JMR et al (1987) Siz-exclusion 105. Martin AJ, Synge RL (1941) A new form of chro-
high-performance liquid chromatography of matogram employing two liquid phases: a theory of
peptides: requirement for peptide standards to moni- chromatography. 2. Application to the micro-
tor column performance and non-ideal behaviour. J determination of the higher monoamino-acids in
Chromatogr A 397(0):99–11281 proteins. Biochem J 35(12):1358–1368
103. Marchand DH, Croes K et al (2005) Column selec- 106. Mauko L, Nordborg A et al (2011) Glycan profiling
tivity in reversed-phase liquid chromatography: VII. of monoclonal antibodies using zwitterionic-type
Cyanopropyl columns. J Chromatogr A 1062 hydrophilic interaction chromatography coupled
(1):57–6482 with electrospray ionization mass spectrometry
104. Marino K, Bones J et al (2010) A systematic detection. Anal Biochem 408(2):235–24183
approach to protein glycosylation analysis: a path 107. McCalley DV (2007) Is hydrophilic interaction chro-
through the maze. Nat Chem Biol 6(10):713–723 matography with silica columns a viable alternative
to reversed-phase liquid chromatography for the
analysis of ionisable compounds?. J Chromatogr A
1171(1–2):46–5584
commonly applied RP-HPLC mode. Our results highlight
again the efficacy of HILIC/CEX as a peptide separation
mode in its own right as well as an excellent complement 83
We present a new method for the analysis of glycans
to RP-HPLC. enzymatically released from monoclonal antibodies
81
A series of five synthetic peptide polymers with the (MAbs) employing a zwitterionic-type hydrophilic inter-
sequence Ac-(G-L-G-A-K-G-A-G-V-G)n-amide, where action chromatography (ZIC–HILIC) column coupled
n ¼ 1–5, was employed to assess the resolving power of with electrospray ionization mass spectrometry (ESI–
high-performance size-exclusion columns in peptide MS). Both native and reduced glycans were analyzed,
separations. The peptide standards showed great versatil- and the developed procedure was compared with a stan-
ity in monitoring both ideal (no interactions of solutes dard HILIC procedure used in the pharmaceutical indus-
with the column material) and non-ideal (hydrophobic try whereby fluorescent-labeled glycans are analyzed
and/or ionic interactions of solutes with the column mate- using a TSK Amide-80 column coupled with fluorescence
rial) size-exclusion behaviour in volatile and non-volatile detection. The separation of isobaric alditol oligosac-
mobile phases. The effectiveness of adding salts or charides present in monoclonal antibodies and ribonucle-
organic solvents to overcome non-specific interactions ase B is demonstrated, and ZIC–HILIC is shown to have
of solutes with the column materials was well illustrated good capability for structural recognition. Glycan profiles
by the standards. In addition, the advantageous use of obtained with the ZIC–HILIC column and ESI–MS
non-ideal size-exclusion behaviour was highlighted. The provided detailed information on MAb glycosylation,
ability to predict the position and/or elution order of including identification of some less abundant glycan
peptides during size-exclusion chromatography (SEC) species, and are consistent with the profiles generated
requires peptides to be separated by a pure size-exclusion with the standard procedure. This new ZIC–HILIC
process. Although the peptide standards demonstrated method offers a simpler and faster approach for glycosyl-
similar ideal size-exclusion profiles in non-denaturing ation analysis of therapeutic antibodies.
medium on all the columns studied this study suggested 84
The separation of acidic, neutral and particularly basic
that, if the conformational character of a peptide protein solutes was investigated using a bare silica column,
mixture in a particular mobile phase is uncertain ideal mostly under hydrophilic interaction chromatography
size-exclusion behaviour is required, SEC should be car- (HILIC) conditions with water concentrations >2.5 %
ried out under highly denaturing conditions. and with >70 % acetonitrile (ACN). Profound changes
82
Eleven cyanopropyl (“cyano”) columns were in selectivity could be obtained by judicious selection of
characterized by means of a relationship developed origi- the buffer and its pH. Acidic solutes had low retention or
nally for alkyl-silica columns. Compared to type-B alkyl- showed exclusion in ammonium formate buffers, but were
silica columns (i.e., made from pure silica), cyano strongly retained when using trifluoroacetic acid (TFA)
columns are much less hydrophobic (smaller H), less buffers, possibly due to suppression of repulsion of the
sterically restricted (smaller S*), and have lower solute anions from ionised silanol groups at the low pH s s
hydrogen-bond acidity (smaller A). Because sample of TFA solutions of aqueous ACN. At high buffer pH, the
retention is generally much weaker on cyano versus ionisation of weak bases was suppressed, reducing ionic
other columns (e.g., C8, C18), a change to a cyano column (and possibly hydrophilic retention) leading to further
usually requires a significantly weaker mobile phase in opportunities for manipulation of selectivity. Peak shapes
order to maintain comparable values of k for both of basic solutes were excellent in ammonium formate
columns. For this reason, practical comparisons of selec- buffers, and overloading effects, which are a major prob-
tivity between cyano and other columns (i.e., involving lem for charged bases in RPLC, were relatively insignifi-
different mobile phases for each column) must take into cant in analytical separations using this buffer. HILIC
account possible changes in separation due to the change separations were ideal for fast analysis of ionised bases,
in mobile phase, as well as change in the column. due to the low viscosity of mobile phases with high ACN
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 127

108. McCalley DV (2013) Separation mechanisms in 111. Mihailova A, Lundanes E et al (2006) Determination
hydrophilic interaction chromatography. Hydro- and removal of impurities in 2-D LC-MS of peptides.
philic interaction chromatography, Wiley, 1–4185 J Sep Sci 29(4):576–58188
109. McDonald WH, Ohi R et al (2002) Comparison of 112. Miller NT, Feibush B et al (1984) Wide-pore silica-
three directly coupled HPLC MS/MS strategies for based ether-bonded phases for separation of proteins
identification of proteins from complex mixtures: by high-performance hydrophobic-interaction and
single-dimension LC-MS/MS, 2-phase MudPIT, size exclusion chromatography. J Chromatogr A
and 3-phase MudPIT. Int J Mass Spectrom 219 316(0):519–53689
(1):245–25186
110. McNulty DE, Annan RS (2008) Hydrophilic interac-
tion chromatography reduces the complexity of the
and detection of components. Proteomics strategies often
phosphoproteome and improves global
combine two orthogonal separation modes to meet this
phosphopeptide isolation and detection. Mol Cell
challenge. In nearly all cases, the second dimension is a
Proteomics 7(5):971–98087
reverse phase separation interfaced directly to a mass
spectrometer. Here we report on the use of hydrophilic
interaction chromatography (HILIC) as part of a multidi-
content, and the favourable Van Deemter curves which mensional chromatography strategy for proteomics. Tryp-
resulted from higher solute diffusivities. tic peptides are separated on TSKgel Amide-80 columns
85
Hydrophilic interaction chromatography (HILIC) is a using a shallow inverse organic gradient. Under these
technique that has become increasingly popular for the conditions, peptide retention is based on overall hydro-
separation of polar, hydrophilic, and ionizable philicity, and a separation truly orthogonal to reverse
compounds, which are difficult to separate by reversed- phase is produced. Analysis of tryptic digests from HeLa
phase (RP) chromatography due to their poor retention cells yielded numbers of protein identifications compara-
when RP is used. HILIC typically uses a polar stationary ble to that obtained using strong cation exchange. We also
phase such as bare silica or a polar bonded phase, together demonstrate that HILIC represents a significant advance
with an eluent. This chapter considers in some detail the in phosphoproteomics analysis. We exploited the strong
various mechanisms that contribute to HILIC separations. hydrophilicity of the phosphate group to selectively
Contributory mechanisms are likely to be partition, enrich and fractionate phosphopeptides based on their
adsorption, ionic interactions, and even hydrophobic increased retention under HILIC conditions. Subsequent
retention depending on the experimental conditions. IMAC enrichment of phosphopeptides from HILIC
86
One of the most effective methods for the direct identi- fractions showed better than 99 % selectivity. This was
fication of proteins from complex mixtures without first achieved without the use of derivatization or chemical
having to resolve them by polyacrylamide gel electropho- modifiers. In a 300-μg equivalent of HeLa cell lysate we
resis is to separate proteolytically generated peptides by identified over 1000 unique phosphorylation sites. More
microcapillary HPLC and then collect data directly on the than 700 novel sites were added to the HeLa
eluent using a tandem mass spectrometer. Multidimen- phosphoproteome.
88
sional HPLC separation techniques provide access to even Problems occurring during operation of a 2-D LC-MS
more complex mixtures of proteins. A set of techniques system for separation and identification of neuropeptides,
for multidimensional analysis was developed in our lab; such as contamination of the used salts and column bleed,
collectively they are known as multidimensional protein are described. When using polysulfoethyl aspartamide,
identification technology (MudPIT). These strategies which is widely used as a strong cation exchange station-
employ a biphasic column with a section of reversed ary phase in the first dimension, interfering peaks were
phase (RP) material flanked by strong cation exchange observed in the second-dimension reversed-phase
(SCX) resin and allow for multidimensional separation of chromatograms. The observed peaks, found to be caused
peptides. A variation on MudPIT adds an additional sec- by column bleeding, had abundance above the threshold
tion of RP material behind the SCX and RP. This 3-phase value and influenced the quality of the analyses. The
column can be used for “online” desalting of the sample. origin of the peaks was verified and appropriate measures
We compare the analysis of a complex mixture of proteins are proposed. Additionally, peaks caused by polyethylene
purified by their association with bovine brain glycols (PEGs), covering approximately 5 min of feasible
microtubules using a single-dimension LC-MS/MS col- chromatographic time in every fraction, were observed.
umn, a 2-phase (standard) MudPIT column, and a 3-phase The commercial ammonium formate salts used to prepare
MudPIT column. We find that the 3-phase MudPIT col- the first-dimension mobile phase were found to contain
umn yields a greater number of protein identifications for PEG impurities, and in subsequent work the salt solutions
this test sample and allows data to be collected on a set of were prepared from formic acid and ammonia to avoid
hydrophilic peptides not sampled using the 2-phase any additional contaminations.
MudPIT column. 89
This paper examines the use of wide-pore silica-based
87
The diversity and complexity of proteins and peptides hydrophilic ether-bonded phases for the chromatographic
in biological systems requires powerful liquid separation of proteins under mild elution conditions. In
chromatography-based separations to optimize resolution particular, ether phases of the following structure
128 U. Kota and M.L. Stolowitz

113. Mohammed S, Heck AJR (2011) Strong cation 115. Moody RT (1999) 3 – Zorbax porous silica micro-
exchange (SCX) based analytical methods for the sphere columns for high-performance size exclusion
targeted analysis of protein post-translational chromatography. Column handbook for size exclu-
modifications. Curr Opin Biotechnol 22(1):9–1690 sion chromatography. Cs Wu. San Diego, Academic
114. Molnar I (2002) Computerized design of separation Press, pp 75–92
strategies by reversed-phase liquid chromatography: 116. Moore AW, Jorgenson JW (1995) Comprehensive
development of DryLab software. J Chromatogr A three-dimensional separation of peptides using size
965(1–2):175–19491 exclusion chromatography/reversed phase liquid
chromatography/optically gated capillary zone elec-
trophoresis. Anal Chem 67(19):3456–3463
117. Motoyama A, Xu T et al (2007). Anion and cation
Si-(CH2)3-O-(CH2-CH2-O)n-R, where n 1, 2, 3 and R
mixed-bed ion exchange for enhanced multidimen-
methyl, ethyl or n-butyl, have been prepared. These
sional separations of peptides and phosphopeptides.
phases can be employed either in high-performance
Anal Chem 79(10):3623–363492
hydrophobic-interaction or size-exclusion chromatogra-
118. Naidong W (2003) Bioanalytical liquid chromatog-
phy, depending on mobile phase conditions. In the
raphy tandem mass spectrometry methods on
hydrophobic-interaction mode, a gradient of decreasing
salt concentration, e.g., from 3 M ammonium sulfate
(pH 6.0, 25  C), yields sharp peaks with high mass recov-
ery of active proteins. In this mode, retention can be
controlled by salt type and concentration, as well as by
column temperature. In the size-exclusion mode, use of The recent availability of new 32-bit programming tools
medium ionic strength, e.g., 0.5 M ammonium acetate allowed calculations of chromatograms to be completed
(pH 6.0) yields linear calibration of log (MW[n]) vs. more quickly so as to show peak movements which result
retention volume. Even at 0.05 M salt concentration, no for example from slight changes in eluent pH. DryLab is a
stationary phase charge effects on protein elution are great success of interdisciplinary and intercontinental
observed. These bonded-phase columns exhibit good cooperation by many scientists.
column-to-column reproducibility and constant retention 92
Shotgun proteomics typically uses multidimensional
for at least 5 months of continual use. Examples of the LC/MS/MS analysis of enzymatically digested proteins,
high-performance separation of proteins in both modes where strong cation-exchange (SCX) and reversed-phase
are illustrated. (RP) separations are coupled to increase the separation
90
The multidimensional combination of strong cation power and dynamic range of analysis. Here we report an
exchange (SCX) chromatography and reversed phase on-line multidimensional LC method using an anion- and
chromatography has emerged as a powerful approach to cation-exchange mixed bed for the first separation dimen-
separate peptides originating from complex samples such sion. The mixed-bed ion-exchange resin improved pep-
as digested cellular lysates or tissues before analysis by tide recovery over SCX resins alone and showed better
mass spectrometry, enabling the identification of over orthogonality to RP separations in two-dimensional
10,000 s of peptides and thousands of proteins in a single separations. The Donnan effect, which was enhanced by
sample. Although, such multidimensional chromatogra- the introduction of fixed opposite charges in one column,
phy approaches are powerful, the in-depth analysis of is proposed as the mechanism responsible for improved
protein post-translational modifications still requires addi- peptide recovery by producing higher fluxes of salt
tional sample preparation steps, involving the specific cations and lower populations of salt anions proximal to
enrichment of peptides displaying the targeted modifica- the SCX phase. An increase in orthogonality was
tion. Here, we describe how in particular SCX chroma- achieved by a combination of increased retention for
tography can be used for the targeted analysis of acidic peptides and moderately reduced retention of neu-
important post-translational modifications, such as phos- tral to basic peptides by the added anion-exchange resin.
phorylation and N-terminal acetylation. Compared to The combination of these effects led to ?100 % increase
other methods, SCX is less labor-intensive and more in the number of identified peptides from an analysis of a
robust, and therefore likely more easily adaptable to tryptic digest of a yeast whole cell lysate. The application
main-stream research laboratories. of the method to phosphopeptide-enriched samples
91
The development of DryLab software is a special increased by 94 % phosphopeptide identifications over
achievement in analytical HPLC which took place in the SCX alone. The lower pKa of phosphopeptides led to
last 16 years. This paper tries to collect some of the specific enrichment in a single salt step resolving acidic
historical mile stones and concepts. DryLab, being always phosphopeptides from other phospho- and
subject to change according to the needs of the user, never non-phosphopeptides. Unlike previous methods that use
stopped being developed. Under the influence of an ever anion exchange to alter selectivity or enrich
changing science market, the DryLab development team phosphopeptides, the proposed format is unique in that it
had to consider not just scientific improvements, but also works with typical acidic buffer systems used in
new technological achievements, such as the introduction electrospray ionization, making it feasible for online mul-
of Windows 1.0 and 3.1, and later Windows NT and 2000. tidimensional LC/MS/MS applications.
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 129

underivatized silica columns with aqueous/organic 122. Nogueira R, Lubda D et al (2006) Silica-based
mobile phases. J Chromatogr B 796(2):209–22493 monolithic columns with mixed-mode reversed-
119. Naidong W, Shou W et al (2001) Novel liquid phase/weak anion-exchange selectivity principle for
chromatographic–tandem mass spectrometric high-performance liquid chromatography. J Sep Sci
methods using silica columns and aqueous–organic 29(7):966–97896
mobile phases for quantitative analysis of polar ionic
analytes in biological fluids. J Chromatogr B Biomed
Sci Appl 754(2):387–39994 95
This article describes a new complementary peptide
120. Nikolov ZL, Reilly PJ (1985) Retention of
separation and purification concept that makes use of a
carbohydrates on silica and amine-bonded silica sta-
novel mixed-mode reversed-phase/weak anion-exchange
tionary phases: application of the hydration model. J
(RP/WAX) type stationary phase. The RP/WAX is based
Chromatogr A 325:287–293
on N-(10-undecenoyl)-3-aminoquinuclidine selector,
121. Nogueira R, Lämmerhofer M et al (2005) Alterna-
which is covalently immobilized on thiol-modified silica
tive high-performance liquid chromatographic pep-
tide separation and purification concept using a new particles (5 μm, 100 Å pore diameter) by radical addition
mixed-mode reversed-phase/weak anion-exchange reaction. Remaining thiol groups are capped by radical
type stationary phase. J Chromatogr A 1089 addition with 1-hexene. This newly developed separation
(1–2):158–16995 material contains two distinct binding domains in a single
chromatographic interactive ligand: a lipophilic alkyl
chain for hydrophobic interactions with lipophilic
93
This review article summarizes the recent progress on moieties of the solute, such as in the reversed-phase
bioanalytical LC–MS/MS methods using underivatized chromatography, and a cationic site for anion-exchange
silica columns and aqueous/organic mobile phases. Vari- chromatography with oppositely charged solutes, which
ous types of polar analytes were extracted by using pro- also enables repulsive ionic interactions with positively
tein precipitation (PP), liquid/liquid extraction (LLE) or charged functional groups, leading to ion-exclusion phe-
solid-phase extraction (SPE) and were then analyzed nomena. The beneficial effect that may result from the
using LC–MS/MS on the silica columns. Use of silica combination of the two chromatographic modes is
columns and aqueous/organic mobile phases could signif- exemplified by the application of this new separation
icantly enhance LC–MS/MS method sensitivity, due to material for the chromatographic separation of the N-
the high organic content in the mobile phase. Thanks to and C-terminally protected tetrapeptide N-acetyl-Ile-
the very low backpressure generated from the silica col- Glu-Gly-Arg-p-nitroanilide from its side products.
umn with low aqueous/high organic mobile phases, LC– Mobile phase variables have been thoroughly investigated
MS/MS methods at high flow rates are feasible, resulting to optimize the separation and to get a deeper insight into
in significant timesaving. Because organic solvents have the retention and separation mechanism, which turned out
weaker eluting strength than water, direct injection of the to be more complex than any of the individual chroma-
organic solvent extracts from the reversed-phase solid- tography modes alone. A significant anion-exchange
phase extraction onto the silica column was possible. retention contribution at optimal pH of 4.5 was found
Gradient elution on the silica columns using aqueous/ only for acetate but not for formate as counter-ion. In
organic mobile phases was also demonstrated. Contrary loadability studies using acetate, peptide masses up to
to what is commonly perceived, the silica column 200 mg could be injected onto an analytical 250 mm  4
demonstrated superior column stability. This technology mm i.d. RP/WAX column (5 μm) still without touching
can be a valuable supplement to the reversed-phase LC– bands of major impurity and target peptide peaks. The
MS/MS. corresponding loadability tests with formate allowed the
94
Use of silica stationary phase and aqueous–organic injection of only 25 % of this amount. The analysis of the
mobile phases could significantly enhance LC–MS–MS purified peptide by capillary high-performance liquid
method sensitivity. The LC conditions were compatible chromatography (HPLC)-UV and HPLC–ESI-MS
with MS detection. Analytes with basic functional groups employing RP-18 columns revealed that the known
were eluted with acidic mobile phases and detected by MS major impurities have all been removed by a single chro-
in the positive ion mode. Analytes with acid functional matographic step employing the RP/WAX stationary
groups were eluted with mobile phases at neutral pH and phase. The better selectivity and enhanced sample loading
detected by MS in the negative ion mode. Analytes poorly capacity in comparison to RP-HPLC resulted in an
retained on reversed-phase columns showed good reten- improved productivity of the new purification protocol.
tion on silica columns. Compared with reversed-phase For example, the yield of pure peptide per chro-
LC–MS–MS, 5–8-fold sensitivity increases were matographic run on RP/WAX phase was by a factor of
observed for basic polar ionic compounds when using about 15 higher compared to the standard gradient elution
silica columns and aqueous–organic mobile phase. Up to RP-purification protocol.
96
a 20-fold sensitivity increase was observed for acidic This article describes the synthesis, chromatographic
polar ionic compounds. Silica columns and aqueous– characterization, and performance evaluation of analyti-
organic mobile phases were used for assaying nicotine, cal (100 x 4.6 mm id) and semipreparative (100 x 10 mm
cotinine, and albuterol in biological fluids. id) monolithic silica columns with mixed-mode RP/weak
130 U. Kota and M.L. Stolowitz

123. O’Gara JE, Wyndham KD (2006) Porous hybrid 124. Opiteck GJ, Jorgenson JW et al (1997)
organic‐inorganic particles in reversed‐phase liquid Two-Dimensional SEC/RPLC coupled to mass spec-
chromatography. J Liquid Chromatogr Related trometry for the analysis of peptides. Anal Chem 69
Technol 29(7–8):1025–104597 (13):2283–229198
125. Opiteck GJ, Ramirez SM et al (1998) Comprehen-
sive two-dimensional high-performance liquid chro-
matography for the isolation of overexpressed
anion-exchange (RP/WAX) surface modification. The proteins and proteome mapping. Anal Biochem 258
monolithic RP/WAX columns were obtained by immobi- (2):349–36199
lization of N-(10-undecenoyl)-3-aminoquinuclidine onto
thiol-modified monolithic silica columns (Chromolith) by
a radical addition reaction. Their chromatographic char-
acterization by Engelhardt and Tanaka tests revealed moiety). The hybrid particles are defined and classified
slightly lower hydrophobic selectivities than C-8 phases, within the context of a broader definition of hybrid
as well as higher polarity and also improved shape selec- materials. First syntheses and chromatographic
tivity than RP-18e silica rods. The surface modification evaluations are discussed for this class of hybrid packing
enabled separation by both RP and anion-exchange chro- materials. Publications are then described, which charac-
matography principles, and thus showed complementary terize two distinguishing chemical properties of hybrid
selectivities to the RP-18e monoliths. The mixed-mode particles vs. silica gel: 1) less acidic silanols, and 2)
monoliths have been tested for the separation of peptides markedly longer lifetimes in alkaline mobile phases.
and turned out to be particularly useful for hydrophilic These properties are achieved without sacrificing
acidic peptides, which are usually insufficiently retained mechanical strength, as is found for fully organic
on RP-18e monolithic columns. Compared to a particles, i.e., polymers, with the same chemical features.
corresponding particulate RP/WAX column (5 microm, Literature reports are then reviewed that employ hybrid
10 nm pore diameter), the analytical RP/WAX monolith based reversed?phase column packings for HPLC. Topics
caused lower system pressure drops and showed, as covered include fundamental retention mechanism stud-
expected, higher efficiency (e.g. by a factor of about 2.5 ies, methods development studies, and applications made
lower C-term for a tetrapeptide). The upscaling from the possible with the hybrid based products. Further review is
analytical to semipreparative column dimension was also presented on the use of theses hybrid particles for UPLC.
successful. The hybrid particles afford good mechanical strength
97
Abstract Reversed?phase chromatographic media have without sacrificing retention and loading capacity, as is
recently become available that are based on porous hybrid found for non?porous particles. Applications employing
organic?inorganic particles. The present paper reviews hybrid based particles in the UPLC mode are then
hybrid particles that are made from organosilanes (organic reported.
98
moiety) and tetraalkoxysilanes (inorganic moiety). The A two-dimensional liquid chromatography system is
hybrid particles are defined and classified within the context described here which uses size exclusion liquid chroma-
of a broader definition of hybrid materials. First syntheses tography (SEC) followed by reversed phase liquid chro-
and chromatographic evaluations are discussed for this matography (RPLC) to separate the mixture of peptides
class of hybrid packing materials. Publications are then resulting from the enzymatic digestion of a protein. A
described, which characterize two distinguishing chemical novel LC/LC interface, using two RPLC columns in par-
properties of hybrid particles vs. silica gel: 1) less acidic allel rather than storage loops, joins the two chro-
silanols, and 2) markedly longer lifetimes in alkaline matographic dimensions. This new interface design
mobile phases. These properties are achieved without permits the use of conventional analytical diameter
sacrificing mechanical strength, as is found for fully organic HPLC columns, 7.8 mm for SEC and 4.6 mm for RPLC,
particles, i.e., polymers, with the same chemical features. making construction and maintenance of this system very
Literature reports are then reviewed that employ hybrid easy. The reversed phase chromatography utilizes 1.5 ?m
based reversed?phase column packings for HPLC. Topics diameter, nonporous C-18 modified silica particles, which
covered include fundamental retention mechanism studies, produce fast and efficient analyses. Following the high-
methods development studies, and applications made pos- resolution two-dimensional chromatographic separation,
sible with the hybrid based products. Further review is an electrospray mass spectrometer detects the peptide
presented on the use of theses hybrid particles for UPLC. fragments. The mass spectrometer scans a 2000 m/z
The hybrid particles afford good mechanical strength with- range to identify the analytes from their molecular
out sacrificing retention and loading capacity, as is found weights. The analyses of tryptic digests of ovalbumin
for non?porous particles. Applications employing hybrid and serum albumin are each described.
99
based particles in the UPLC mode are then reported. A two-dimensional liquid chromatographic system is
Reversed?phase chromatographic media have recently described here which uses size-exclusion liquid chroma-
become available that are based on porous hybrid tography (SEC) followed by reversed-phase liquid chro-
organic?inorganic particles. The present paper reviews matography (RPLC) to separate the mixture of proteins
hybrid particles that are made from organosilanes resulting from the lysis ofEscherichia colicells and to
(organic moiety) and tetraalkoxysilanes (inorganic isolate the proteins that they produce. The size-exclusion
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 131

126. Oyler AR, Armstrong BL et al (1996) Hydrophilic 128. Peng J, Elias JE et al (2003) Evaluation of multidi-
interaction chromatography on amino-silica phases mensional chromatography coupled with tandem
complements reversed-phase high-performance liq- mass spectrometry (LC/LC-MS/MS) for large-scale
uid chromatography and capillary electrophoresis for protein analysis: the yeast proteome. J Proteome Res
peptide analysis. J Chromatogr A 724 2(1):43–50102
(1–2):378–383100 129. Phillips HL, Williamson JC et al (2010) Shotgun
127. Pabst M, Altmann F (2011) Glycan analysis by mod- proteome analysis utilising mixed mode (reversed
ern instrumental methods. Proteomics 11 phase-anion exchange chromatography) in conjunc-
(4):631–643101 tion with reversed phase liquid chromatography
mass spectrometry analysis. Proteomics 10
(16):2950–2960103
chromatography can be conducted under either
denaturing or nondenaturing conditions. Peaks eluting
from the first dimension are automatically subjected to may come near this ideal. After an expose of the relevant
reversed-phase chromatography to separate similarly techniques, we try to depict how analytical raw data are
sized proteins on the basis of their various hydropho- translated into structural assignments using retention
bicities. The RPLC also serves to desalt the analytes so times, mass and fragment spectra. A method’s ability to
that they can be detected in the deep ultraviolet region at discriminate between the many conceivable isomeric
215 nm regardless of the SEC mobile phase used. The structures together with the time, effort and sample
two-dimensional (2D) chromatograms produced in this amount needed for that purpose is suggested as a criterion
manner then strongly resemble the format of stained 2D for the comparative assessment of approaches and their
gels, in that spots are displayed on aX–Yaxis and intensity evolutionary stages.
represents quantity of analyte. Following chro- 102
Highly complex protein mixtures can be directly
matographic separation, the analytes are deposited into analyzed after proteolysis by liquid chromatography cou-
six 96-well (576 total) polypropylene microtiter plates via pled with tandem mass spectrometry (LC–MS/MS). In
a fraction collector. Interesting fractions are analyzed by this paper, we have utilized the combination of strong
matrix-assisted laser desorption ionization time-of-flight cation exchange (SCX) and reversed-phase
mass spectrometry (MALDI-TOF/MS) or electrospray (RP) chromatography to achieve two-dimensional separa-
mass spectrometry (ESI/MS) depending on sample con- tion prior to MS/MS. One milligram of whole yeast pro-
centration, which both yield accurate (2 to 0.02 %) tein was proteolyzed and separated by SCX
molecular weight information on intact proteins without chromatography (2.1 mm i.d.) with fraction collection
any additional sample preparation, electroblotting, every minute during an 80-min elution. Eighty fractions
destaining, etc. The remaining 97 % of a fraction can were reduced in volume and then re-injected via an
then be used for other analyses, such Edman sequencing, autosampler in an automated fashion using a vented-
amino acid analysis, or proteolytic digestion and sequenc- column (100 μm i.d.) approach for RP-LC-MS/MS analy-
ing by tandem mass spectrometry. This 2D HPLC protein sis. More than 162 000 MS/MS spectra were collected
purification and identification system was used to isolate with 26 815 matched to yeast peptides (7537 unique
the src homology (SH2) domain of the nonreceptor tyro- peptides). A total of 1504 yeast proteins were unambigu-
sine kinase pp60c-srcand β-lactamase, both inserted ously identified in this single analysis. We present a
intoE. coli,as well as a number of native proteins com- comparison of this experiment with a previously
prising a small portion of theE. coliproteome. published yeast proteome analysis by Yates and
100
Hydrophilic interaction chromatography (HILIC) on colleagues (Washburn, M. P.; Wolters, D.; Yates, J. R.,
amine bonded-phase silica columns provides separations III. Nat. Biotechnol. 2001, 19, 242–7). In addition, we
of peptides that are complementary to those obtained with report an in-depth analysis of the false-positive rates
reversed-phase HPLC and free solution capillary electro- associated with peptide identification using the Sequest
phoresis. This is illustrated with the peptide drug atosiban algorithm and a reversed yeast protein database. New
and nine diastereomers. Moreover, one of the HILIC criteria are proposed to decrease false-positives to less
methods was suitable for coupling with electrospray than 1 % and to greatly reduce the need for manual
mass spectrometry. interpretation while permitting more proteins to be
101
The oligosaccharides attached to proteins or lipids are identified.
103
among the most challenging analytical tasks due to their The 2-D peptide separations employing mixed mode
complexity and variety. Knowing the genes and enzymes reversed phase anion exchange (MM (RP-AX)) HPLC in
responsible for their biosynthesis, a large but not unlim- the first dimension in conjunction with RP chromatogra-
ited number of different structures and isomers of such phy in the second dimension were developed and utilised
glycans can be imagined. Understanding of the biological for shotgun proteome analysis. Compared with strong
role of structural variations requires the ability to unam- cation exchange (SCX) typically employed for shotgun
biguously determine the identity and quantity of all gly- proteomic analysis, peptide separations using MM
can species. Here, we examine, which analytical (RP-AX) revealed improved separation efficiency and
strategies – with a certain high-throughput potential – increased peptide distribution across the elution gradient.
132 U. Kota and M.L. Stolowitz

130. Polson A (1961) Fractionation of protein mixtures on 133. Porath J, Flodin PER (1959) Gel filtration: a method
columns of granulated agar. Biochim Biophys Acta for desalting and group separation. Nature 183
50(3):565–567 (4676):1657–1659
131. Popovici ST, Schoenmakers PJ (2005) Fast size- 134. Porath J, Sundberg L et al (1973) Salting-out in
exclusion chromatography—Theoretical and practi- amphiphilic gels as a new approach to hydrophobic
cal considerations. J Chromatogr A 1099 adsorption. Nature 245(5426):465–466
(1–2):92–102104 135. Porsch B. (1993) Epoxy- and diol-modified silica:
132. Porath J (1960) Gel filtration of proteins, peptides optimization of surface bonding reaction. J
and amino acids. Biochim Biophys Acta 39 Chromatogr A 653(1):1–7106
(2):193–207105 136. Queiroz JA, Tomaz CT, et al107
137. Regnier FE, Noel R (1976) Glycerolpropylsilane
bonded phases in the steric exclusion chromatogra-
phy of biological macromolecules. J Chromatogr Sci
In addition, improved sample handling, with no signifi-
14(7):316–320108
cant reduction in the orthogonality of the peptide
138. Ricker RD, Sandoval LA (1996) Fast, reproducible
separations was observed. The shotgun proteomic analy-
size-exclusion chromatography of biological
sis of a mammalian nuclear cell lysate revealed additional
macromolecules. J Chromatogr A 743(1):43–50109
proteome coverage (2818 versus 1125 unique peptides
and 602 versus 238 proteins) using the MM (RP-AX)
compared with the traditional SCX hyphenated to
RP-LC-MS/MS. The MM analysis resulted in approxi- peptides and amino acids the influence of the solvents
mately 90 % of the unique peptides identified present in mentioned appears to be the reverse of that for the basic
only one fraction, with a heterogeneous peptide distribu- compounds. 4. 4. Aromatic substitution has a marked
tion across all fractions. No clustering of the predominant effect on the migration through the gels. The relative
peptide charge states was observed during the gradient speed of dinitrophenylated amino acids is highly depen-
elution. The application of MM (RP-AX) for 2-D LC dent on the buffer used. Such influence of the buffer was
proteomic studies was also extended in the analysis of not noticed for phenylalanine, tyrosine and tryptophan,
iTRAQ-labelled HeLa and cyanobacterial proteomes although these compounds are retarded to a different
using nano-flow chromatography interfaced to the extent. 5. 5. When the columns are properly prepared,
MS/MS. We demonstrate MM (RP-AX) HPLC as an symmetrical distribution of each compound is always
alternative approach for shotgun proteomic studies that obtained. 6. 6. The column capacity is very high com-
offers significant advantages over traditional SCX peptide pared to other similar column methods (chromatography
separations. and zone electrophoresis). 7. 7. The reproducibility is very
104 good. 8. 8. The gels are easily regenerated in the columns
Fast SEC is a very interesting modification of conven- and may be used daily over a period of months without
tional SEC. The need for it emerges from combinatorial detectable deterioration.
chemistry and high-throughput experimentation, where 106
high-speed analyses are required. The different The 3-glycidyloxypropyltrimethoxysilane-silica bond-
approaches to change the speed of analysis are exten- ing reaction was investigated. The carbon and bonded
sively described in this paper. Special attention is paid epoxide content after the bonding reaction and.
107
to the trade-off between analysis time and resolution and In this article, an overview of hydrophobic interaction
to the selection of optimal column lengths and flow rates. chromatography (HIC) of proteins is given. After a brief
Simulations are used to design and to understand description of protein hydrophobicity and hydrophobic
experiments. Integrity plots are constructed to judge the interactions, we present the different proposed theories
quality of various SEC systems. Fast separations in size- for the retention mechanism of proteins in HIC. Addition-
exclusion chromatography are found to be more favorable ally, the main parameters to consider for the optimization
than suggested by conventional theory. The results are of fractionation processes by HIC and the stationary
based on experimental data obtained for polystyrene phases available were described. Selected examples of
using THF as mobile phase. protein fractionation by HIC are also presented.
108
105
1. 1. Mixtures of proteins, peptides and amino acids Glycerolpropylsilane bonded phases have been found
can be fractionated by filtration through beds of dextran to control the adsorption and/or denaturation of proteins
gel containing only small amounts of carboxylic groups. and nucleic acids on controlled porosity glass supports.
2. 2. Group separations are readily achieved. In highly The bonded-phase thickness is 18-19A while the amount
cross-linked dextran proteins and large peptides move of glycerol moiety varies from 80 to 150 mumoles/g
together ahead of amino acids. In dextran gels of low depending on support pore diameter. It has been
degree of cross-linking peptides and even proteins may demonstrated that carbohydrate bonded supports may be
be retained on the columns, so that a fractionation of used in the chromatography of proteins, nucleic acids, and
substances within these groups may be obtained. 3. 3. polysaccharides.
109
Basic peptides and amino acids move slowly through the The size-dependent separation of biological
gels in certain basic solvents such as 1 M pyridine and macromolecules can be effectively carried out using
faster in acidic solvents such as 1 M acetic acid. For acidic size-exclusion chromatography (SEC) on silica-based
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 133

139. Roumeliotis P, Unger KK (1981) Assessment and 141. Salisbury JJ (2008) Fused-core particles: a practical
optimization of system parameters in size exclusion alternative to sub-2 micron particles. J Chromatogr
separation of proteins on diol-modified silica Sci 46(10):883–886112
columns. J Chromatogr A 218(0):535–546110 142. Sandra K, Moshir M et al (2008) Highly efficient
140. Ruhaak LR, Hennig R et al (2010) Optimized peptide separations in proteomics: Part 1. Unidimen-
workflow for preparation of apts-labeled n-glycans sional high performance liquid chromatography. J
allowing high-throughput analysis of human plasma Chromatogr B 866(1–2):48–63113
glycomes using 48-channel multiplexed CGE-LIF. J
Proteome Res 9(12):6655–6664111
reducing agent. Reaction conditions are optimized for a
high labeling efficiency, short handling times, and only
HPLC columns. For this technique to be successful, limited loss of sialic acids. Third, samples are subjected to
appropriate methods should be chosen. This paper hydrophilic interaction chromatography (HILIC) purifica-
presents practical guidelines for the development of tion at the 96-well plate format. Subsequently, purified
reproducible SEC methods based upon optimized sample APTS-labeled N-glycans are analyzed by CGE-LIF using
volume, flow-rate, column length and use of mobile phase a 48-capillary DNA sequencer. The method was found to
conditions that reduce non-ideal SEC behavior – be robust and suitable for high-throughput glycan analy-
parameters often ignored in SEC. Adjustment of these sis. Even though the method comprises two overnight
parameters often results in more accurate elution times incubations, 96 samples can be analyzed with an overall
for proper molecular-mass determination, sharper peaks labor allocation time of 2.5 h. The method was applied to
for improved resolution and shorter run times for serum samples from a pregnant woman, which were sam-
increased throughput. In general, sample volume and pled during first, second, and third trimesters of preg-
flow-rate should be kept to a minimum for optimal reso- nancy, as well as 6 weeks, 3 months, and 6 months
lution in SEC. Increasing column length improves resolu- postpartum. Alterations in the glycosylation patterns
tion and may be achieved by placing columns in tandem. were observed with gestation and time after delivery.
In addition, adjustment of the mobile phase conditions can 112
The benefits of sub-2 micron particle size columns
significantly enhance resolution. However, the results are have been widely researched and published. The use of
difficult to predict because the sample plays a major role these columns on ultrahigh-pressure liquid chromatogra-
in this interaction, as does the column packing. When phy (UHPLC) instrumentation may lead to increased
possible, mobile phase ionic strength and pH should be efficiencies and higher throughput. However, these
altered until the peak(s) of interest elute at the expected instruments may not be readily available to the pharma-
time and with good peak shape. Finally, use of smaller- ceutical chemist. Within the past year, a practical alterna-
diameter columns (i.e., 4.6 mm rather than 9.4 mm) and tive has been introduced which offers increased
small-diameter packing (4.5 μm) particles are also briefly efficiencies, but at conventional HPLC pressure
discussed. The principles described here are limitations. These particles are called fused-core particles
demonstrated, using antibodies and a number of standard and are comprised of a 1.7- micron solid core
proteins under a variety of SEC conditions. encompassed by a 0.5-micron porous silica layer
110
On diol-modified silica columns the retention of (dp ¼ 2.7 micron). The goal for this research was to test
proteins is governed by a size exclusion effect, but these columns for efficiency and robustness utilizing a
superimposed on this are some secondary effects, i.e., mixture of Torcetrapib and its relative impurities. Our
ionic and diol-ligand interactions which can be controlled results indicate that excellent theoretical plates
and adjusted reproducibly by varying the eluent composi- (14,000) were achievable for run times less than
tion. The eluent composition also affects the column 5 min. Compared to the Waters Acquity particles, the
efficiency and peak shape. Both dependences can be fused-core particles achieved approximately 80 % of the
employed to obtain a better resolution of proteins than efficiency but with half the observed backpressure. Our
can be expected from size exclusion alone. robustness results concluded that these separations were
111
High-throughput methods for oligosaccharide analysis reproducible for at least 500 injections while the % RSD
are required when searching for glycan-based biomarkers. for retention time, theoretical plates, peak asymmetry, and
Next to mass spectrometry-based methods, which allow resolution was found to be less than 1 %.
113
fast and reproducible analysis of such compounds, further Sample complexity and dynamic range constitute
separation-based techniques are needed, which allow for enormous challenges in proteome analysis. The back-
quantitative analysis. Here, an optimized sample prepara- end technology in typical proteomics platforms, namely
tion method for N-glycan-profiling by multiplexed capil- mass spectrometry (MS), can only tolerate a certain com-
lary gel electrophoresis with laser-induced fluorescence plexity, has a limited dynamic range per spectrum and is
detection (CGE-LIF) was developed, enabling high- very sensitive towards ion suppression. Therefore, com-
throughput glycosylation analysis. First, glycans are ponent overlap has to be minimized for successful mass
released enzymatically from denatured plasma spectrometric analysis and subsequent protein identifica-
glycoproteins. Second, glycans are labeled with APTS tion and quantification. The present review describes the
using 2-picoline borane as a nontoxic and efficient advances that have been made in liquid-based separation
134 U. Kota and M.L. Stolowitz

143. Saraswat M, Musante L et al (2013) Preparative 147. Selman MHJ, Niks EH et al (2010) IgG Fc
purification of recombinant proteins: current status N-Glycosylation changes in lambert-eaton myas-
and future trends. BioMed Res Int 2013:2018 thenic syndrome and myasthenia gravis. J Proteome
144. Selkirk, C. (2004). Ion-exchange chromatography. Res 10(1):143–152116
Protein purification protocols. P Cutler, Humana 148. Shaltiel S, Er-El Z (1973) Hydrophobic chromatog-
Press, 244, 125–131 raphy: use for purification of glycogen synthetase.
145. Selman MHJ, Hemayatkar M et al (2011) Cotton Proc Natl Acad Sci 70(3):778–781117
HILIC SPE microtips for microscale purification
and enrichment of glycans and glycopeptides. Anal
Chem 83(7):2492–2499114
by reversed-phase or hydrophilic interaction solid-phase
146. Selman MHJ, McDonnell LA et al (2010) Immuno-
extraction. Glycopeptides are analyzed by intermediate
globulin G glycopeptide profiling by matrix-assisted
pressure matrix-assisted laser desorption ionization
laser desorption ionization Fourier transform ion
Fourier transform ion cyclotron resonance mass spec-
cyclotron resonance mass spectrometry. Anal Chem
trometry (MALDI-FTICR-MS). Notably, both dihydrox-
82(3):1073–1081115
ybenzoic acid (DHB) and α-cyano-4-hydroxycinnamic
acid (CHCA) matrixes allowed the registration of
sialylated as well as nonsialylated glycopeptides. Data
techniques with focus on the recent developments to boost were automatically processed, and IgG isotype-specific
the resolving power. The review is divided in two parts; Fc glycosylation profiles were obtained. The entire
the first part deals with unidimensional liquid chromatog- method showed an interday variation below 10 % for
raphy and the second part with bi- and multidimensional the six major glycoforms of both IgG1 and IgG2. The
liquid-based separation techniques. Part 1 mainly focuses method was found suitable for isotype-specific high-
on reversed-phase HPLC due to the fact that it is and will, throughput IgG glycosylation profiling from human
in the near future, remain the technique of choice to be plasma. As an example we successfully applied the
hyphenated with MS. The impact of increasing the col- method to profile the IgG glycosylation of 62 human
umn length, decreasing the particle diameter, replacing samples.
the traditional packed beds by monolithics, amongst 116
N-glycosylation of the immunoglobulin Fc moiety
others, is described. The review is complemented with influences its biological activity by, for example,
data obtained in the laboratories of the authors. modulating the interaction with Fc receptors. Changes in
114
Solid-phase extraction microtips are important devices IgG glycosylation have been found to be associated with
in modern bioanalytics, as they allow miniaturized sample various inflammatory diseases. Here we evaluated for the
preparation for mass spectrometric analysis. Here we first time IgG Fc N-glycosylation changes in well-defined
introduce the use of cotton wool for the preparation of antibody-mediated autoimmune diseases, that is, the neu-
filter-free HILIC SPE microtips. To this end, pieces of rological disorders Lambert-Eaton myasthenic syndrome
cotton wool pads (approximately 500 μg) were packed and myasthenia gravis, with antibodies to muscle nico-
into 10 μL pipet tips. The performance of the tips was tinic acetylcholine receptors or muscle-specific kinase.
evaluated for microscale purification of tryptic IgG Fc IgGs were purified from serum or plasma by protein A
N-glycopeptides. Cotton wool HILIC SPE microtips affinity chromatography and digested with trypsin.
allowed the removal of salts, most nonglycosylated Glycopeptides were purified and analyzed by MALDI-
peptides, and detergents such as SDS from FTICR?MS. Glycoform distributions of both IgG1 and
glycoconjugate samples. MALDI-TOF-MS glycopeptide IgG2 were determined for 229 patients and 56 controls.
profiles were very repeatable with different tips as well as We observed an overall age and sex dependency of IgG
reused tips, and very similar profiles were obtained with Fc N-glycosylation, which was in accordance with litera-
different brands of cotton wool pads. In addition, we used ture. All three disease groups showed lower levels of IgG2
cotton HILIC microtips to purify N-glycans after galactosylation compared to controls. In addition, LEMS
N-glycosidase F treatment of IgG and transferrin followed patients showed lower IgG1 galactosylation. Notably, the
by MALDI-TOF-MS detection. In conclusion, we estab- galactosylation differences were not paralleled by a dif-
lish cotton wool microtips for glycan and glycopeptide ference in IgG sialylation. Moreover, the level of IgG
purification with subsequent mass spectrometric core-fucosylation and bisecting N-acetylglucosamine
detection. were evaluated. The control and disease groups revealed
115
Immunoglobulin G (IgG) fragment crystallizable similar levels of IgG Fc core-fucosylation. Interestingly,
(Fc) glycosylation is essential for Fc-receptor-mediated LEMS patients below 50 years showed elevated levels of
activities. Changes in IgG Fc glycosylation have been bisecting N-acetylglucosamine on IgG1 and IgG2,
found to be associated with various diseases. Here we demonstrating for the first time the link of changes in
describe a high-throughput IgG glycosylation profiling the level of bisecting N-acetylglucosamine with disease.
117
method. Sample preparation is performed in 96-well A homologous series of δ-aminoalkylagaroses
plate format: IgGs are purified from 2 ?L of human [Sepharose-NH(CH2)nNH2] that varied in the length of
plasma using immobilized protein A. IgGs are cleaved their hydrocarbon side chains was synthesized. This fam-
with trypsin, and the resulting glycopeptides are purified ily of agaroses was used for a new type of
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 135

149. Shi Y, Xiang R et al (2004) The role of liquid 151. Snyder LR, Wrisley L et al (2006). Computer-aided
chromatography in proteomics. J Chromatogr A optimization. HPLC made to measure, Wiley-VCH
1053(1–2):27–36118 Verlag GmbH & Co. KGaA, 565–623120
150. Simpson DC, Ahn S et al (2006) Using size exclu- 152. Strege MA, Stevenson S et al (2000) Mixed-mode
sion chromatography-RPLC and RPLC-CIEF as anion  cation exchange/hydrophilic interaction
two-dimensional separation strategies for protein liquid chromatography  electrospray mass spec-
profiling. Electrophoresis 27(13):2722–2733119 trometry as an alternative to reversed phase for
small molecule drug discovery. Anal Chem 72
(19):4629–4633
chromatography, in which retention of proteins is
achieved mainly through lipophilic interactions between
the hydrocarbon side chains on the agarose and accessible broad proteome coverage and good throughput. However,
hydrophobic pockets in the protein. When an extract of due to incomplete sequence coverage, this approach is not
rabbit muscle was subjected to chromatography on these ideally suited to the study of modified proteins. The mod-
modified agaroses, the columns with short arms (n ¼ 2 ification complement of a protein can best be elucidated
and n ¼ 3) excluded glycogen synthetase (EC 2.4.1.11), by analyzing the intact protein. 2-DE, typically coupled
but the enzyme was retained on Î’-aminobutyl-agarose with the analysis of peptides that result from in-gel diges-
(n ¼ 4), from which it could be eluted with a linear tion, is the most frequently applied protein separation
NaCl gradient. Higher members of this series (e.g., technique in MS-based proteomics. As an alternative,
n ¼ 6) bind the synthetase so tightly that it can be eluted numerous column-based liquid phase techniques, which
only in a denatured form. A column of δ-aminobutyl- are generally more amenable to automation, are being
agarose, which retained the synthetase, excluded glyco- investigated. In this work, the combination of size-
gen phosphorylase (EC 2.4.1.1), which in this column exclusion chromatography (SEC) fractionation with
series and under the same conditions requires side chains RPLC-Fourier-transform ion cyclotron resonance
5-(or 6)-carbon-atoms long for retention. Therefore, it is (FTICR)-MS is compared with the combination of
possible to isolate glycogen synthetase by passage of RPLC fractionation with CIEF-FTICR-MS for the analy-
muscle extract through δ-aminobutyl-agarose, then to sis of the Shewanella oneidensis proteome. SEC-RPLC-
extract phosphorylase by subjecting the excluded proteins FTICR-MS allowed the detection of 297 proteins, as
to chromatography on δ-aminohexyl-agarose (n ¼ 6). On opposed to 166 using RPLC-CIEF-FTICR-MS,
a preparative scale, the synthetase (I form) was purified indicating that approaches based on LC-MS provide bet-
25- to 50-fold in one step. This paper describes some basic ter coverage. However, there were significant differences
features and potential uses of hydrophobic chromatogra- in the sets of proteins detected and both approaches pro-
phy. The relevance of the results presented here to the vide a basis for accurately quantifying changes in protein
design and use of affinity chromatography columns is and modified protein abundances.
discussed. 120
This chapter contains sections titled: * Computer-
118
Proteomics represents a significant challenge to sepa- Facilitated HPLC Method Development Using DryLab®
ration scientists because of the diversity and complexity Software Introduction HistoryTheoryDryLab Capabilities
of proteins and peptides present in biological systems. DryLab OperationMode ChoicesPractical Applications of
Mass spectrometry as the central enabling technology in DryLab® in the LaboratoryConclusions * References
proteomics allows detection and identification of ChromSword® Software for Automated and Computer-
thousands of proteins and peptides in a single experiment. Assisted Development of HPLC Methods Introduction
Liquid chromatography is recognized as an indispensable Off-Line ModeOn-Line ModeChromSword®
tool in proteomics research since it provides high-speed, VersionsExperimental Set-Up for On-Line ModeMethod
high-resolution and high-sensitivity separation of Development with ChromSword®Off-Line Mode
macromolecules. In addition, the unique features of chro- (Computer-Assisted Method Development)On-Line
matography enable the detection of low-abundance spe- Mode – Fully Automated Optimization of Isocratic and
cies such as post-translationally modified proteins. Gradient Separations Software Functions for
Components such as phosphorylated proteins are often AutomationHow Does the System Optimize
present in complex mixtures at vanishingly small Separations?Conclusion * References Multifactorial Sys-
concentrations. New chromatographic methods are tematic Method Development and Optimization in
needed to solve these analytical challenges, which are Reversed-Phase HPLC Introduction and Factorial
clearly formidable, but not insurmountable. This review ViewpointStrategy for Partially Automated Method
covers recent advances in liquid chromatography, as it has DevelopmentComparison of Commercially Available
impacted the area of proteomics. The future prospects for Software Packages with Regard to Their Contribution to
emerging chromatographic technologies such as mono- Factorial Method DevelopmentDevelopment of a New
lithic capillary columns, high temperature chromatogra- System for Multifactorial Method Development Selection
phy and capillary electrochromatography are discussed. of Stationary PhasesOptimizing Methods with
119
Bottom-up proteomics (analyzing peptides that result HEUREKAEvaluation of Data with
from protein digestion) has demonstrated capability for HEUREKAConclusion and Outlook * References.
136 U. Kota and M.L. Stolowitz

153. Štulı́k K, Pacáková V et al (1997) Stationary phases 158. Tolstikov VV, Fiehn O (2002). Analysis of highly
for peptide analysis by high performance liquid polar compounds of plant origin: combination of
chromatography: a review. Anal Chim Acta 352 hydrophilic interaction chromatography and
(1–3):1–19121 electrospray ion trap mass spectrometry. Anal
154. Sun K, Sehon AH (1965) The use of polyacrylamide Biochem 301(2):298–307124
gels for chromatography of proteins. Can J Chem 43
(4):969–976
155. Cirkovic VelickovicT, JO, Mihajlovic L (2012) Sep-
aration of amino acids, peptides, and proteins by ion aqueous trifluoroacetic acid, pH 2.1, or 1.0 %
exchange chromatography. Ion exchange technology triethylamine-acetic acid, pH 10.6. Chromatographic
II: applications. ML Dr.Inamuddin. Netherlands, performances with mobile phases of low and high-pH
Springer, Dordrecht were practically equivalent and facilitated the separation
156. Tanaka H, Zhou X et al (2003) Characterization of a of more than 50 tryptic peptides of bovine serum albumin
novel diol column for high-performance liquid chro- within 15–20 min with peak widths at half height between
matography. J Chromatogr A 987(1–2):119–125122 4 and 10 s. Neither a significant change in retentivity nor
157. Toll H, Oberacher H et al (2005). Separation, detec- efficiency of the monolithic column was observed during
tion, and identification of peptides by ion-pair 17-day operation at pH 10.6 and 50  C. Upon separation
reversed-phase high-performance liquid by RP-HPIPC at high-pH, peptide detectabilities in full-
chromatography-electrospray ionization mass spec- scan negative-ion electrospray ionization mass spectrom-
trometry at high and low pH. J Chromatogr A 1079 etry (negESI-MS) were about two to three times lower as
(1–2):274–286123 compared to RP-HPIPC at low-pH with posESI-MS
detection. Tandem mass spectra obtained by fragmenta-
tion of deprotonated peptide ions in negative ion mode
yielded interpretable sequence information only in a few
121
A survey is given of modern stationary phases cases of relatively short peptides. However, in order to
employed in high performance liquid chromatography obtain sequence information for peptides separated with
(HPLC) analysis of peptides. The physico-chemical alkaline mobile phases, tandem mass spectrometry
properties of peptides and their consequences for the (MS/MS) could be performed in positive ion mode. The
selection and optimization of the separation system are chromatographic selectivities were significantly different
briefly discussed, followed by a summary of the in separations performed with acidic and alkaline eluents,
approaches to the selection and characterization of sta- which facilitated the fractionation of a complex peptide
tionary phases. The properties and applicability of various mixture obtained by the tryptic digestion of 10 proteins
stationary phases are then critically reviewed, including utilizing off-line, two-dimensional RP-HPIPC at high pH
aspects such as size-exclusion, ion-exchange, reversed-  RP-HPIPC at low pH and subsequent on-line identifi-
phase, hydrophobic-interaction, affinity and chiral cation by posESI-MS/MS.
systems, as well as some specialized separation 124
The primary goal of metabolomic analysis is the unbi-
techniques. Emphasis is placed on the most recent ased relative quantification of every metabolite in a
literature. biological system. A number of different metabolite-
122
For the investigation of a diol phase (Inertsil Diol profiling techniques must be combined to make this pos-
column) in hydrophilic interaction chromatography, sible. Here we report the separation and analysis of highly
urea, sucrose and glycine were used as test compounds. polar compounds in a proof of concept study. Compounds
The chromatographic conditions were investigated for were separated and analyzed using hydrophilic interaction
optimal column efficiency. The column temperature liquid chromatography (HILIC) coupled to electrospray
used in common reversed-phase liquid chromatography ionization (ESI) mass spectrometry. Two types of HILIC
could also be used for the separation and the flow-rate microbore columns (Polyhydroxyethyl A and TSK Gel
should be adjusted to 0.3-0.5 ml/min to optimize column Amide 80) were compared to normal phase silica HPLC
efficiency. It is suggested that the velocity of the hydro- columns. The best separations of standards mixtures and
philic interaction is slower than the hydrophobic interac- plant samples were achieved using the Amide 80 station-
tion in RPLC. The addition of trifluoroacetic acid is ary phase. ESI enabled the detection of both positively
effective for the retention of glycine, but ineffective for and negatively charged metabolites, when coupled to a
urea and sucrose. The diol phase exhibited sufficient quadrupole ion trap mass spectrometer using continuous
chemical stability even if exposed to water in high per- polarity switching. By stepwise mass spectrometric frag-
centage, and could be applied with isocratic elution for the mentation of the most intense ions, unknown compounds
separation/analysis of amino acids and glucose. could be identified and then included into a custom mass
123
Bioactive peptides and tryptic digests of various spectrometric library. This method was used to detect
proteins were separated under acidic and alkaline oligosaccharides, glycosides, amino sugars, amino acids,
conditions by ion-pair-reversed-phase high-performance and sugar nucleotides in phloem exudates from petioles of
liquid chromatography (RP-HPIPC) in 200 μm fully expanded Cucurbita maxima leaves. Quantitative
I.D. monolithic, poly(styrene-divinylbenzene)-based cap- analysis was performed using external standards. The
illary columns using gradients of acetonitrile in 0.050 % detection limit for stachyose was 0.5 ng per injection
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 137

159. Tran BQ, Hernandez C et al (2010) Addressing tryp- 162. Wagner K, Miliotis T et al (2002) An automated
sin bias in large scale (phospho)proteome analysis by on-line multidimensional HPLC system for protein
size exclusion chromatography and secondary diges- and peptide mapping with integrated sample prepa-
tion of large post-trypsin peptides. J Proteome Res ration. Anal Chem 74(4):809–820128
10(2):800–811125 163. Walshe M, Kelly MT et al (1995) Retention studies
160. van Deemter JJ, Zuiderweg FJ et al (1956) Longitu- on mixed-mode columns in high-performance liquid
dinal diffusion and resistance to mass transfer as chromatography. J Chromatogr A 708(1):31–40129
causes of nonideality in chromatography. Chem
Eng Sci 5(6):271–289126
161. Verhaar LAT, Kuster BFM (1982). Contribution to
with acetonitrile—water as eluent. In spite of the two
the elucidation of the mechanism of sugar retention
anomeric forms of the reducing sugars, single peaks can
on amine-modified silica in liquid chromatography. J
be obtained because mutarotation is fast under these
Chromatogr A 234(1):57–64127
conditions. The bonded amine groups catalyse the
mutarotation in such a way that triethylamine added to
eluent has not influence. The separation of the sugars is
(Amide 80). The concentration of stachyose in the result of their partition between two liquid phases,
investigated phloem samples was in the range of because the composition of the stationary liquid phase
1–7 mM depending on the plant. appears to be much richer in water than the eluent.
125
In the vast majority of bottom-up proteomics studies, 128
A comprehensive on-line two-dimensional 2D-HPLC
protein digestion is performed using only mammalian tryp- system with integrated sample preparation was developed
sin. Although it is clearly the best enzyme available, the sole for the analysis of proteins and peptides with a molecular
use of trypsin rarely leads to complete sequence coverage, weight below 20 kDa. The system setup provided fast
even for abundant proteins. It is commonly assumed that separations and high resolving power and is considered to
this is because many tryptic peptides are either too short or be a complementary technique to 2D gel electrophoresis in
too long to be identified by RPLC/MS/MS. We show proteomics. The on-line system reproducibly resolved
through in silico analysis that 20–30 %? of the total  1000 peaks within the total analysis time of 96 min and
sequence of three proteomes (Schizosaccharomyces avoided sample losses by off-line sample handling. The
pombe, Saccharomyces cerevisiae, and Homo sapiens) is low-molecular-weight target analytes were separated from
expected to be covered by Large post-Trypsin Peptides the matrix using novel silica-based restricted access
(LpTPs) with Mr above 3000 Da. We then established materials (RAM) with ion exchange functionalities. The
size exclusion chromatography to fractionate complex size-selective sample fractionation step was followed by
yeast tryptic digests into pools of peptides based on size. anion or cation exchange chromatography as the first
We found that secondary digestion of LpTPs followed by dimension. The separation mechanism in the subsequent
LC/MS/MS analysis leads to a significant increase in second dimension employed hydrophobic interactions
identified proteins and a 32–50 % relative increase in aver- using short reversed-phase (RP) columns. A new column-
age sequence coverage compared to trypsin digestion alone. switching technique, including four parallel reversed-phase
Application of the developed strategy to analyze the columns, was employed in the second dimension for on-line
phosphoproteomes of S. pombe and of a human cell line fractionation and separation. Gradient elution and UV
identified a significant fraction of novel phosphosites. Over- detection of two columns were performed simultaneously
all our data indicate that specific targeting of LpTPs can while loading the third and regenerating the fourth column.
complement standard bottom-up workflows to reveal a The total integrated workstation was operated in an unat-
largely neglected portion of the proteome. tended mode. Selected peaks were collected and analyzed
126
The mechanisms of band broadening in linear, non- off-line by MALDI-TOF mass spectrometry. The system
ideal chromatography are examined. A development is was applied to protein mapping of biological samples of
presented of a rate theory for this process, wherein human hemofiltrate as well as of cell lysates originating
nonideality is caused by: • axial molecular diffusion; • from a human fetal fibroblast cell line, demonstrating it to
axial eddy diffusion; • finiteness of transfer coefficient. be a viable alternative to 2D gel electrophoresis for
The correspondence with the plate theory is given, so that mapping peptides and small proteins.
the results can also be expressed in heights equivalent to a 129
The retention properties of a column prepared by
theoretical plate. The plate theory has been extended to mixing together strong cation exchange (SCX) and
the case of a finite volume of feed; the requirement for this reversed-phase (C18) packing materials were investigated
feed volume to be negligible has been examined and a using a range of test solutes. The column was found to
method is presented for evaluating concentration profiles exhibit chromatographic properties characteristic of both
obtained with a larger volume of feed. An analysis is phases. The effects of changes in eluent composition,
given of experimental results, whereby the relative buffer ion, ionic strength and pH on the capacity factors
contributions to band broadening for various cooperating of different compounds were determined. The dual nature
mechanisms could be ascertained. of the retention mechanism allowed the retention of
127
Liquid chromatography of reducing or non-reducing ionisable molecules to be adjusted by altering the compo-
sugars results in single peaks on amine-modified silica sition of the aqueous component of the mobile phase
138 U. Kota and M.L. Stolowitz

164. Wang X, Emmett MR et al (2010) Liquid chroma- 167. Weiss J, Jensen D (2003) Modern stationary phases
tography electrospray ionization Fourier transform for ion chromatography. Anal Bioanal Chem 375
ion cyclotron resonance mass spectrometric charac- (1):81–98
terization of N-linked glycans and glycopeptides. 168. Westermeier R, Naven T et al (2008) Liquid chro-
Anal Chem 82(15):6542–6548130 matography techniques. Proteomics in practice,
165. Wang X, Li W et al (2005) Orthogonal method Wiley-VCH Verlag GmbH & Co. KGaA,
development using hydrophilic interaction chroma- 151–213133
tography and reversed-phase high-performance liq- 169. Wohlgemuth J, Karas M et al (2010) Enhanced
uid chromatography for the determination of glyco-profiling by specific glycopeptide enrichment
pharmaceuticals and impurities. J Chromatogr A and complementary monolithic nano-LC
1083(1–2):58–62131 (ZIC-HILIC/RP18e)/ESI-MS analysis. J Sep Sci 33
166. Washburn MP, Wolters D et al (2001) Large-scale (6–7):880–890134
analysis of the yeast proteome by multidimensional
protein identification technology. Nat Biotechnol 19
(3):242–247132 132
We describe a largely unbiased method for rapid and
large-scale proteome analysis by multidimensional liquid
chromatography, tandem mass spectrometry, and data-
base searching by the SEQUEST algorithm, named mul-
while those of compounds uncharged over the pH range tidimensional protein identification technology
investigated remained unaffected. Results were compared (MudPIT). MudPIT was applied to the proteome of the
those obtained on a C18 column and it was found that the Saccharomyces cerevisiae strain BJ5460 grown to
acidic and weakly basic compounds had higher capacity mid-log phase and yielded the largest proteome analysis
factors on this column whereas strongly basic compounds to date. A total of 1484 proteins were detected and
had higher capacity factors on the mixed-mode column. identified. Categorization of these hits demonstrated the
130
We combine liquid chromatography, electrospray ion- ability of this technology to detect and identify proteins
ization, and Fourier transform ion cyclotron resonance rarely seen in proteome analysis, including
mass spectrometry (LC ESI FT-ICR MS) to determine low-abundance proteins like transcription factors and pro-
the sugar composition, linkage pattern, and attachment tein kinases. Furthermore, we identified 131 proteins with
sites of N-linked glycans. N-linked glycans were enzy- three or more predicted transmembrane domains, which
matically released from glycoproteins with peptide allowed us to map the soluble domains of many of the
N-glycosidase F, followed by purification with integral membrane proteins. MudPIT is useful for prote-
graphitized carbon cartridge solid-phase extraction and ome analysis and may be specifically applied to integral
separation over a TSK-Gel Amide80 column under membrane proteins to obtain detailed biochemical infor-
hydrophilic interaction chromatography (HILIC) mation on this unwieldy class of proteins.
133
conditions. Unique glycopeptide compositions were This chapter contains sections titled: * Basic
determined from experimentally measured masses for Principles of Important Liquid Chromatography
different combinations of glycans and glycopeptides. Techniques Ion Exchange ChromatographyReversed
The method was validated by identifying four peptides Phase ChromatographyAffinity ChromatographyGel Fil-
glycosylated so as to yield 12 glycopeptides unique in tration * Strategic Approach and General Applicability *
glycan composition for the standard glycoprotein, bovine Liquid Chromatography Techniques and Applications in
alpha-2-HS-glycoprotein. We then assigned a total of Proteome Analysis Peptide Separation2DLC Peptide
137 unique glycopeptide compositions from SeparationAffinity Chromatography and LC-MS/
18 glycoproteins from fetal bovine serum, and the glycan MSProtein Pre-fractionation * Practical Considerations
structures for most of the assigned glycopeptides were and Application of LC-based Protein Pre-fractionation
heterogeneous. Highly accurate FT-ICR mass measure- Sample Extraction and PreparationExperimental
ment is essential for reliable identification. SetupIon Exchange Chromatography and Protein
131
A hydrophilic interaction chromatography (HILIC) Pre-fractionationReversed Phase Chromatography and
method has been developed and validated as a secondary Protein Pre-fractionationFraction Size and Number of
or orthogonal method complementary to a reversed-phase Fractions * Critical Review and Outlook.
134
HPLC (RP-HPLC) method for quantitation of a polar Dedicated and specific sample preparation and ade-
active pharmaceutical ingredient and its three degradation quate chromatographic resolution prior to MS are neces-
products. The HILIC method uses a diol column and a sary for comprehensive and site-specific glycosylation
mobile phase consisting of acetonitrile/water and ammo- analysis to compensate for high heterogeneity of protein
nium chloride. The compounds of interest show signifi- glycosylation, low-abundance of specific glycoforms and
cant differences in retention behaviors with the two very ion-suppression effects caused by coelution of other
different chromatographic systems, which are desired in peptides. This article describes a scheme for glycopeptide
developing orthogonal methods. The HILIC method is profiling, which comprises HILIC batch enrichment
validated and has met all validation acceptance criteria followed by complementary HILIC and RP-LC in 1-D
for the support of drug development activities. and 2-D approaches. For reproducible and sensitive nano-
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 139

170. Wolters DA, Washburn MP et al (2001) An 172. Xia HF, Lin DQ et al (2008) Preparation and evalua-
automated multidimensional protein identification tion of cellulose adsorbents for hydrophobic charge
technology for shotgun proteomics. Anal Chem 73 induction chromatography. Ind Eng Chem Res 47
(23):5683–5690135 (23):9566–9572137
171. Wyndham KD, O’Gara JE et al (2003) Characteriza- 173. Xie S, Svec F et al (1997) Rigid porous
tion and evaluation of C18 HPLC stationary phases polyacrylamide-based monolithic columns
based on ethyl-bridged hybrid organic/inorganic containing butyl methacrylate as a separation
particles. Anal Chem 75(24):6781–6788136 medium for the rapid hydrophobic interaction chro-

LC/ESI-MS analysis, we used ZIC-HILIC and RP18e


monolithic silica capillaries and assessed their retention
characteristics and complementarity for glycopeptide diameter (185, 148, and 108 Å), were characterized by
separations. The experiments revealed that elemental analysis, SEM, and nitrogen sorption analysis
pre-enrichment of glycopeptides in combination with LC and were chemically modified in a two-step process using
employing both phases considerably improves site- octadecyltrichlorosilane and trimethylchlorosilane. The
specific elucidation of glycosylation heterogeneity. Zwit- resultant bonded materials had an octadecyl surface con-
terionic hydrophilic interaction liquid chromatography centration of 3.17?3.35 ?mol/m2, which is comparable to
showed high capability to separate glycopeptides by the coverage obtained for an identically bonded silica
their glycan composition, which coeluted on RP18e. By particle (3.44 ?mol/m2) that had a surface area of
varying solvent conditions, retention can be well tuned, 344 m2/g. These hybrid materials were shown to have
and efficient separations were achieved even in absence of sufficient mechanical strength under conditions normally
any additives like salt or formic acid. RP18e facilitated employed for traditional reversed-phase HPLC
glycopeptide separations with high peak capacity based applications, using a high-pressure column flow test.
on peptide sequence and degree of sialylation. The chromatographic properties of the C18 bonded hybrid
Implementing both orthogonal and complementary phases were compared to a C18 bonded silica using a
phases in 1-D and 2-D LC setups was shown to signifi- variety of neutral and basic analytes under the same
cantly increase the number of different identified mobile-phase conditions. The hybrid phases exhibited
glycoforms and possesses great potential for comprehen- similar selectivity to the silica-based column, yet had
sive glycoproteomics approaches. improved peak tailing factors for the basic analytes. Col-
135
We describe an automated method for shotgun proteo- umn retentivity increased with increasing particle surface
mics named multidimensional protein identification tech- area. Elevated pH aging studies of these hybrid materials
nology (MudPIT), which combines multidimensional showed dramatic improvement in chemical stability for
liquid chromatography with electrospray ionization tan- both bonded and unbonded hybrid materials compared to
dem mass spectrometry. The multidimensional liquid the C18 bonded silica phase, as determined by monitoring
chromatography method integrates a strong cation- the loss in column efficiency through 140-h exposure to a
exchange (SCX) resin and reversed-phase resin in a pH 10 triethylamine mobile phase at 50  C.
biphasic column. We detail the improvements over a 137
Hydrophobic charge induction chromatography
system described by Link et al. (Link, A. J.; Eng, J.; (HCIC) has been proven to be an efficient technique for
Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; antibody purification. Several HCIC adsorbents were
Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999, prepared with macroporous cellulose?tungsten carbide
17, 676?682) that separates and acquires tandem mass composite beads (Cell-TuC) as the matrix. First, the cel-
spectra for thousands of peptides. Peptides elute off the lulose beads were activated by allyl bromide (AB) or
SCX phase by increasing pI, and elution off the SCX divinyl sulfone (DVS), and then they were coupled with
material is evenly distributed across an analysis. In addi- three types of mercaptoheterocyclic groups?4-mercapto-
tion, we describe the chromatographic benchmarks of ethyl-pyridine hydrochloride (MEP), 2-mercapto-1-
MudPIT. MudPIT was reproducible within 0.5 % methyl-imidazole (MMI), and 2-mercapto-benzimidazole
between two analyses. Furthermore, a dynamic range of (MBI)?as the HCIC ligands. Four types of HCIC
10?000 to 1 between the most abundant and least abun- adsorbents were obtained, labeled Cell-TuC-AB-MEP,
dant proteins/peptides in a complex peptide mixture has Cell-TuC-DVS-MEP, Cell-TuC-DVS-MMI, and Cell-
been demonstrated. By improving sample preparation TuC-DVS-MBI. The activation and coupling conditions
along with separations, the method improves the overall were optimized for high ligand density. The isotherm
analysis of proteomes by identifying proteins of all func- adsorption of immunoglobulin of egg yolk (IgY) on four
tional and physical classes. HCIC adsorbents were investigated. High adsorption
136
The characterization and evaluation of three novel 5-? capacities of IgY could be obtained for all four adsorbents
m HPLC column packings, prepared using ethyl-bridged at pH 7, and low adsorption of IgY at pH 4 and of bovine
hybrid organic/inorganic materials, is described. These serum albumin (BSA) at pH 7 was observed, which
highly spherical hybrid particles, which vary in specific indicates that the HCIC adsorbents prepared have a poten-
surface area (140, 187, and 270 m2/g) and average pore tial application for antibody purification.
140 U. Kota and M.L. Stolowitz

matography of proteins. J Chromatogr A 775 aspartate transcarbamoylase from wheat germ. Anal
(1–2):65–72138 Biochem 113(2):219–228141
174. Yang Y, Geng X (2011). Mixed-mode chromatogra- 178. Yon RJ, Simmonds RJ (1975). Protein chromatogra-
phy and its applications to biopolymers. J phy on adsorbents with hydrophobic and ionic
Chromatogr A 1218(49):8813–8825139 groups. Some properties of N-(3-carboxypropionyl)
175. Yon RJ (1972) Chromatography of lipophilic aminodecyl-sepharose and its interaction with
proteins on adsorbents containing mixed hydropho- wheat-germ aspartate transcarbamoylase. Biochem
bic and ionic groups. Biochem J 126(3):765–767 J 151(2):281–290142
176. Yon RJ (1974) Enzyme purification by hydrophobic
chromatography: an alternative approach illustrated
in the purification of aspartate transcarbamoylase from Coomassie Blue R250-Sepharose. Experimental evi-
from wheat germ (short communication). Biochem dence suggests that (a) the enzyme is adsorbed at hetero-
J 137(1):127–130140 geneous sites on each column, only some of which are
177. Yon RJ (1981) Versatility of mixed-function susceptible to substrate-specific desorption; (b) in none of
adsorbents in biospecific protein desorption: acci- these cases is the initial adsorption essentially biospecific,
dental affinity and an improved purification of i.e., these are not cases of classical affinity chromatogra-
phy; (c) in the case of 10-carboxydecylamino-Sepharose,
and therefore presumably also in the other cases, the
138 desorption is biospecific, i.e., involves the formation of
Macroporous poly(acrylamide-co-butyl methacrylate- the catalytically significant enzyme-carbamoyl phosphate
co-N,N0 -methylenebisacrylamide) monoliths containing complex. Substrate-specific desorption in these cases
up to 15 % butyl methacrylate units have been prepared appears to derive from accidental affinity between, on
by direct polymerization within the confines of HPLC the one hand, clusters of active (ionic, hydrophobic, aro-
columns. The hydrodynamic and chromatographic matic, etc.) groups on the protein and, on the other,
properties of these 50 mm  8 mm I.D. columns – such complementary clusters on the adsorbent, some of these
as back pressure at different flow-rates, effect of percent- interactions being perturbed when the ligands binds to the
age of hydrophobic component in the polymerization protein. Biospecific desorption from 10-carboxyde-
mixture, effect of salt concentration on the retention of cylamino-Sepharose has been incorporated as the sole
proteins, dynamic loading capacity, and recovery – were chromatographic step in a new, 8000-fold purification of
determined under conditions typical of hydrophobic inter- the enzyme. It is suggested that biospecific desorption
action chromatography. Using the monolithic column, from essentially nonbiospecific adsorbents could explain
five proteins were easily separated within only 3 min. some published purifications currently described as
139
Mixed-mode chromatography is a type of chromatog- “affinity chromatography”.
raphy in which a chromatographic stationary phase 142
1. The charge state of two derivatives of Sepharose
interacts with solutes through more than one interaction prepared by the CNBr activation method were studied by
mode. This technique has been growing rapidly because acid-base titration and by ion-exchange chromatography.
of its advantages over conventional chromatography, such Dodecyl-Sepharose exhibited cationic groups (21mumol/
as its high resolution, high selectivity, high sample load- ml of settled gel; pKa ¼ 9.6) that were tentatively
ing, high speed, and the ability to replace two convention- assigned to the coupling isourea group. 2. CPAD-
ally corresponding columns in certain circumstances. In Sepharose [N-(3-carboxypropionyl)aminodecyl-
this work, some aspects of the development of mixed- Sepharose] has anionic (carboxyl) groups (pKa ¼ 4.5)
mode chromatography are reviewed, such as stationary and cationic groups (pKa ¼ 9.6) in roughly equal
phase preparation, combinations of various separation concentrations (e coupling group. CPAD-Sepharose is
modes, separation mechanisms, typical applications to slightly negatively charged at pH 7.0 and substantially
biopolymers and peptides, and future prospects. negatively charged at pH 8.5. 3. The pKa values of
140
Two adsorbents containing similar numbers of hydro- dodecyl-Sepharose and CPAD-Sepharose are unaffected
carbon (C(10)) chains but different numbers of carboxyl by a 100-fold increase in the concentration of KCl.
groups were made by chemical modification of 4. CPAD-Sepharose has considerable affinity for wheat-
Sepharose. The use of these adsorbents to purify proteins, germ aspartate transcarbamoylase at pH 8.5 when the
under conditions where hydrophobic adsorption is partly adsorbent and enzyme are both negatively charged. The
resisted by electrostatic repulsion, is illustrated in the interaction involves the C10 chain but is relatively mod-
purification of aspartate transcarbamoylase (EC 2.1.3.2) erate compared with C10 chains associated only with
from wheat germ. positive charge. 5. Desorption of the enzyme adsorbed
141
Under appropriate experimental conditions (usually to CPAD-Sepharose can be achieved by raising the pH to
but not invariably including low ionic strength) wheat increase the electrostatic repulsion, or by introducing the
germ aspartate transcarbamoylase can be specifically detergent sodium deoxycholate. Acetone and butan-1-ol
desorbed by the substrate, carbamoyl phosphate, from also weaken the adsorption at pH 8.5. 6. High
hydroxyapatite, from N-(3-carboxypropionyl) concentrations of sodium acetate or sodium phosphate
aminooctyl-Sepharose, from 10-carboxydecylamino- induced the enzyme to bind more tightly to CPAD-
Sepharose, from Cibacron Blue F3GA-Sepharose, and Sepharose. 7. These results are discussed in terms of a
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 141

179. Yoshida T (2004) Peptide separation by hydrophilic- 182. Zhao G, Peng G et al (2008) 5-Aminoindole, a new
interaction chromatography: a review. J Biochem ligand for hydrophobic charge induction chromatog-
Biophys Methods 60(3):265–280143 raphy. J Chromatogr A 1211(1–2):90–98146
180. Zauner G, Deelder AM et al (2011) Recent advances 183. Zhou H, Di Palma S et al (2012). Toward a compre-
in hydrophilic interaction liquid chromatography hensive characterization of a human cancer cell
(HILIC) for structural glycomics. Electrophoresis phosphoproteome. J Proteome Res 12(1):260–271147
32(24):3456–3466144
181. Zhao G, Dong XY et al (2009) Ligands for mixed-
mode protein chromatography: principles,
influences on the performance of mixed-mode adsorbents.
characteristics and design. J Biotechnol 144
These principles should be considered in the screening
(1):3–11145
and design of mixed-mode ligands. Strategies for the
design of synthetic affinity ligands, especially the bioin-
formatics and combinatorial methods, may be adopted for
‘repulsion-controlled’ model or hydrophobic mixed-mode ligand design. More efforts are needed for
chromatography. the development of rational design and screening methods
143
Recent developments in the separation of peptides by for mixed-mode protein ligands by sophisticated compu-
high-performance liquid chromatography (HPLC) using tational and experimental approaches.
polar sorbents with less polar eluents are summarized in 146
Hydrophobic charge induction chromatography
this review. This separation mode is now commonly (HCIC) is a mixed-mode chromatography that achieves
referred to as Hydrophilic-Interaction Chromatography high adsorption capacity by hydrophobic interaction and
(HILIC). The retention mechanism and chromatographic facile elution by pH-induced charge repulsion between
behavior of polar solutes under HILIC conditions are the solute and ligand. This article reports a new medium,
studied on TSKgel Amide-80 columns, which consist of 5-aminoindole-modified Sepharose (AI-Sepharose) for
carbamoyl groups bonded to a silica gel matrix, using a HCIC. The adsorption equilibrium and kinetics of lyso-
mixture of acetonitrile (MeCN)–water containing 0.1 % zyme and bovine serum albumin (BSA) to AI-Sepharose
trifluoroacetic acid (TFA). Some applications are given in were determined by batch adsorption experiments at dif-
peptide field using Hydrophilic-Interaction ferent conditions to provide insight into the adsorption
Chromatography. properties of the medium. The influence of salt type on
144
This review presents recent progress in employing protein adsorption to AI-Sepharose corresponded with the
hydrophilic interaction liquid chromatography (HILIC) trend for other hydrophobicity-related properties in litera-
for glycan and glycopeptides analysis. After an introduc- ture. Both ligand density and salt concentration had posi-
tion of this technique, the following themes are addressed: tive influences on the adsorption of the two proteins
(i) implementation of HILIC in large-scale studies for investigated. The adsorption capacity of lysozyme, a
analyzing the human plasma N-glycome; (ii) the use of basic protein, decreased rapidly when pH decreased
HILIC UPLC (ultrahigh pressure liquid chromatography) from 7 to 3 due to the increase of electrostatic repulsion,
for fast high-resolution runs and its successful application while BSA, an acidic protein, achieved maximum adsorp-
with online MS for glycan and glycopeptide analysis; (iii) tion capacity around its isoelectric point. Dynamic
high-throughput profiling using HILIC solid-phase adsorption experiments showed that the effective pore
extraction in combination with MS detection; (iv) HILIC diffusion coefficient of lysozyme remained constant at
sample preparation for CE and CGE; (v) the latest different salt concentrations, while that of BSA decreased
glycoproteomic approaches implementing HILIC separa- with increased salt concentration due to its greater steric
tion; (vi) future perspectives of HILIC including its use in hindrance in pore diffusion. High protein recovery by
large-scale glycoproteomics studies such as the analysis adsorption at pH 7.10 elution at pH 3.0 was obtained at
of entire glycoproteomes at the glycopeptide level. a number of NaCl concentrations, indicating that the
145
Mixed-mode chromatography is a chromatographic adsorbent has typical characteristics of HCIC and
method that utilizes more than one form of interactions potentials for applications in protein purification.
147
between the stationary phase and the solutes in a feed Mass spectrometry (MS)-based phosphoproteomics
stream. Compared with other types of chromatography, has achieved extraordinary success in qualitative and
mixed-mode chromatography is advantageous in its salt- quantitative analysis of cellular protein phosphorylation.
independent adsorption, facile elution by charge repul- Considering that an estimated level of phosphorylation in
sion, and unique selectivity. Hence, it has already proved a cell is placed at well above 100?000 sites, there is still
beneficial for the separation of proteins as well as other much room for improvement. Here, we attempt to extend
purposes. In this article, mixed-mode ligands for protein the depth of phosphoproteome coverage while
purification have been reviewed. These ligands usually maintaining realistic aspirations in terms of available
have an aliphatic or aromatic group as the hydrophobic material, robustness, and instrument running time. We
moiety and an amino, carboxyl or sulfonic group as the developed three strategies, where each provided a differ-
ionic moiety. Heterocyclic groups are good ligand ent balance between these three key parameters. The first
candidates for their unique hydrophobicity and dissocia- strategy simply used enrichment by Ti4 + IMAC
tion property. Hydrogen bonding groups also have followed by reversed chromatography LC-MS (termed
142 U. Kota and M.L. Stolowitz

184. Zhou NE, Mant CT et al (1991) Comparison of hydrophilic and strong cation-exchange columns. J
silica-based cyanopropyl and octyl reversed-phase Chromatogr 548(1–2):13–24149
packings for the separation of peptides and proteins. 186. Zhu BY, Mant CT et al (1992) Mixed-mode hydro-
J Chromatogr 548(1–2):179–193148 philic and ionic interaction chromatography rivals
185. Zhu BY, Mant CT et al (1991). Hydrophilic- reversed-phase liquid chromatography for the sepa-
interaction chromatography of peptides on ration of peptides. J Chromatogr A 594
(1–2):75–86150

1D). The second strategy incorporated an additional frac-


tionation step through the use of HILIC (2D). Finally, a
third strategy was designed employing first an SCX frac-
tionation, followed by Ti4 + IMAC enrichment and
additional fractionation by HILIC (3D). A preliminary
evaluation was performed on the HeLa cell line.
149
Detecting 3700 phosphopeptides in about 2 h, the 1D Hydrophilic-interaction chromatography (HILIC) was
strategy was found to be the most sensitive but limited recently introduced as a potentially useful separation
in comprehensivity, mainly due to issues with complexity mode for the purification of peptides and other polar
and dynamic range. Overall, the best balance was compounds. The elution order of peptides in HILIC,
achieved using the 2D based strategy, identifying close which separates solutes based on hydrophilic interactions,
to 17?000 phosphopeptides with less than 1 mg of mate- should be opposite to that obtained in reversed-phase
rial in about 48 h. Subsequently, we confirmed the chromatography, which separates solutes based on hydro-
findings with the K562 cell sample. When sufficient mate- phobic interactions. Three series of peptides, two of
rial was available, the 3D strategy increased which consisted of positively charged peptides (indepen-
phosphoproteome allowing over 22?000 unique dent of pH at pH less than 7) and one of which consisted
phosphopeptides to be identified. Unfortunately, the 3D of uncharged or negatively charged peptides (dependent
strategy required more time and over 1 mg of material on pH), and which varied in overall hydrophilicity/
before it started to outperform 2D. Ultimately, combining hydrophobicity, were utilized to examine the separation
all strategies, we were able to identify over 16?000 and mechanism and efficiency of HILIC on hydrophilic and
nearly 24?000 unique phosphorylation sites from the can- strong cation-exchange columns.
cer cell lines HeLa and K562, respectively. In summary, 150
Peptide separations based upon mixed-mode hydro-
we demonstrate the need to carry out extensive fraction- philic and ionic interactions with a strong cation-exchange
ation for deep mining of the phosphoproteome and pro- column have been investigated. The peptide separations
vide a guide for appropriate strategies depending on were generally achieved by utilizing a linear increasing
sample amount and/or analysis time. salt (sodium perchlorate) gradient in the presence of aceto-
148
The performance of a silica-based C8 packing was nitrile (29–90 %, v/v) at pH 7. The presence of acetonitrile
compared with that of a less hydrophobic, silica-based in the mobile phase promotes hydrophilic interactions with
cyanopropyl (CN) packing during their application to the hydrophilic stationary phase, these hydrophilic
reversed-phase high-performance liquid chromatography interactions becoming increasingly important to the sepa-
(linear trifluoroacetic acid-water to trifluoroacetic acid- ration process as the acetonitrile concentration is increased.
acetonitrile gradients) of peptides and proteins. It was At acetonitrile concentrations of 20–50 % (v/v) in the
found that: (1) the CN column showed excellent selectiv- mobile phase, the peptides utilized in this study were eluted
ity for peptides which varied widely in hydrophobicity in order of increasing net positive charge, indicating that
and peptide chain length; (2) peptides which could not be ionic interactions were dominating the separation process.
resolved easily on the C8 column were widely separated Peptides with the same net positive charge were also well
on the CN column; (3) certain mixtures of peptides and resolved by an hydrophilic interaction mechanism, being
small organic molecules which could not be resolved on eluted in order of increasing hydrophilicity (decreasing
the C8 column were completely separated on the CN hydrophobicity). At higher acetonitrile concentrations
column; (4) impurities arising from solid-phase peptide (70–90 %, v/v), column selectivity was changed dramati-
synthesis were resolved by a wide margin on the CN cally, with hydrophilic interactions now dominating the
column, unlike on the C8 column, where these separation process. Under these conditions, specific
compounds were eluted very close to the peptide product peptides may be eluted earlier or later than less highly
of interest: and (5) specific protein mixtures exhibited charged peptides, depending upon their hydrophilic/hydro-
superior resolution and peak shape on the CN column phobic character. This mixed-mode methodology was
compared with the C8 column. The results clearly dem- compared to reversed-phase liquid chromatography of the
onstrate the effectiveness of employing stationary phases peptides at pH 2 and pH 7. The results of this comparison
of different selectivities (as opposed to the more common suggested that mixed-mode hydrophilic-ion-exchange
optimization protocol of manipulating the mobile phase) chromatography on a strong cation-exchange column rivals
for specific peptide and protein applications, an approach reversed-phase liquid chromatography for peptide
underestimated in the past. separations.
5 Improving Proteome Coverage by Reducing Sample Complexity via Chromatography 143

187. Zhu S, Zhang X et al (2012) Developing a strong high-abundance proteins depletion in human plasma.
anion exchange/RP (SAX/RP) 2D LC system for Proteomics 12(23–24):3451–3463151
188. Zywicki B, Catchpole G et al152

151
Human plasma is dominated by high-abundance α-chaconine and α-solanine, were compared for robust-
proteins which severely impede the detection of ness in high-throughput operations for over 1000 analyti-
low-abundance proteins. Unfortunately, now there is no cal runs using potato tuber samples from field trials.
efficient method for large-scale depletion of high- Glycoalkaloids were analyzed using liquid chromatogra-
abundance proteins in human plasma. In this study, we phy coupled to tandem mass spectrometry in multiple
developed a new strategy, strong anion exchange (SAX)/ reaction monitoring mode. An electrospray interface was
RP 2D LC system, which has potential for large-scale used in the detection of glycoalkaloids in positive ion
depletion of high-abundance proteins in human plasma. mode. Classical reversed phase (RP) and hydrophilic
Separation gradients of the system were optimized to interaction (HILIC) columns were investigated for chro-
ensure an extensive separation of plasma proteins. Plasma matographic separation, ruggedness, recovery, precision,
was fractionated into 67 fractions by SAX. All these and accuracy. During the validation procedure both
fractions were subjected a thorough separation by the methods proved to be precise and accurate enough in
2D RPLC and 66 peaks with high UV absorption (>20 relation to the high degree of endogenous biological
mAU) at 215 nm were collected. Proteins in these peaks variability found for field-grown potato tubers. However,
were identified by LC-MS/MS analysis. Results showed the RP method was found to be more precise, more
that 83 proteins could be identified in these peaks, accurate, and, more importantly, more rugged than the
68 among them were reported to be high- or middle- HILIC method for maintaining the analytes’ peak shape
abundance proteins in plasma. All these proteins had symmetry in high-throughput operation. When applied to
definite retention times and were mapped in the 2D the comparison of six classically bred potato cultivars to
SAX-RP system, which resulted in accurate depletion of six genetically modified (GM) lines engineered to synthe-
high-abundance proteins with ease. Our studies provide a size health beneficial inulins, the glycoalkaloid content in
convenient and effective method for large-scale depletion potato peels of all GM lines was found within the range of
of high-abundance proteins and in-depth research in the six cultivars. We suggest complementing current
human plasma proteomics. unbiased metabolomic strategies by validating quantita-
152
Two rapid methods for highly selective detection and tive analytical methods for important target analytes such
quantification of the two major glycoalkaloids in potatoes, as the toxic glycoalkaloids in potato plants.
Part II
Mass Spectrometry for Proteomics Analysis
Database Search Engines: Paradigms,
Challenges and Solutions 6
Kenneth Verheggen, Lennart Martens, Frode S. Berven,
Harald Barsnes, and Marc Vaudel

Abstract
The first step in identifying proteins from mass spectrometry based shot-
gun proteomics data is to infer peptides from tandem mass spectra, a task
generally achieved using database search engines. In this chapter, the
basic principles of database search engines are introduced with a focus
on open source software, and the use of database search engines is
demonstrated using the freely available SearchGUI interface. This chapter
also discusses how to tackle general issues related to sequence database
searching and shows how to minimize their impact.

Keywords
Peptide identification • Search engines • Shotgun proteomics • Sequence
database searching

Abbreviations

PSM Peptide Spectrum Match


K. Verheggen • L. Martens
PTM Post-Translational Modification
Department of Medical Protein Research, VIB, Ghent,
Belgium
Department of Biochemistry, Faculty of Medicine and
Health Sciences, Ghent University, Ghent, Belgium 6.1 Introduction
F.S. Berven
Proteomics Unit, Department of Biomedicine, University The raw output of modern mass spectrometers
of Bergen, Jonas Liesvei 91, N-5009 Bergen, Norway used in high throughput proteomics does not
KG Jebsen Centre for Multiple Sclerosis Research, provide directly interpretable information
Department of Clinical Medicine, University of Bergen,
Bergen, Norway
H. Barsnes (*) • M. Vaudel
Norwegian Multiple Sclerosis Competence Centre, Proteomics Unit, Department of Biomedicine, University
Department of Neurology, Haukeland University of Bergen, Jonas Liesvei 91, N-5009 Bergen, Norway
Hospital, Bergen, Norway e-mail: harald.barsnes@biomed.uib.no

# Springer International Publishing Switzerland 2016 147


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_6
148 K. Verheggen et al.

Experimental Data
SEQUENCE SCORE

LLALWGPDPAAAFV 100
Observed Peak List
Mass Spectrometer
EGSLQKCCTSICSLY 100

TRREAED 97

Spectrum LQVGQVELGGGPG 95
Matching
…WM RLLPLGP…

in silico sequence processing

Protein Sequence Theoretic Spectrum


Database

Fig. 6.1 Standard workflow for sequence database database. The matching of these two types of spectra
searches. The output from the mass spectrometer, results in a list of peptide-to-spectrum matches (PSMs),
consisting of experimental spectra, is compared to the each scored according to how well the peptide matches
theoretical spectra obtained from the peptides resulting the spectrum
from an in silico digestion of the proteins in the search

Table 6.1 A chronological (non-exhaustive) list of achieved by software referred to as (proteomics)


available sequence database search algorithms search engines [1, 2].
Algorithm Published Free As shown in Fig. 6.1, search engines compare
SEQUEST [3] 1994 the experimental spectra to theoretical protein
Mascot [65] 1999 sequences obtained from a protein sequence
X!Tandem [6] 2004 ✓ database. By performing an in silico processing
OMSSA [10] 2004 ✓ of the theoretical sequences, the digestion and
InSpect [66] 2005 ✓
fragmentation of the actual experiment is mim-
MyriMatch [7] 2007 ✓
icked by the software. Hence, theoretical spectra
Crux [67] 2008 ✓
are generated and can then be compared to the
MS-GF+ [9] 2010 ✓
Tide [12] 2011 ✓
experimentally obtained spectra. A match
MassWiz [68] 2011 ✓ between an experimental and a theoretical spec-
Andromeda [69] 2011 ✓ trum is called a Peptide-to-Spectrum Match
Comet [11] 2012 ✓ (PSM). The core output of a search engine is
Byonic [70] 2012 composed of a list of such candidate PSMs for
Peaks DB [71] 2012 every spectrum, with associated scores (typically
Morpheus [72] 2013 ✓ reported as e-values) that provide an assessment
MS Amanda [8] 2014 ✓ of the quality of the match.
Note that the Published column indicates when the manu- SEQUEST [3] was the first widely used algo-
script describing the algorithm or search engine was rithm implementing this technique, and was rap-
published, and does not necessarily correspond to when
it was made available to the community. The final column idly adopted in everyday lab practices. As listed in
indicates if the algorithm is freely available Table 6.1, several algorithms, commercial and
academic, were subsequently made available to
about the proteins originally present in the the community. In order to provide biologically
analyzed sample. In order to infer the protein meaningful results, these algorithms were also
composition of a sample, scientists rely on integrated into broader environments like the
computational and statistical tools to process, Trans-Proteomic Pipeline (TPP) [4] and OpenMS
control and interpret the data. An essential part [5], where search engines can be combined with
of this process is the identification of peptides other tools, building complex proteomic
from tandem mass spectra, a task generally workflows for protein inference, quantification
and functional analysis.
6 Database Search Engines: Paradigms, Challenges and Solutions 149

Fig. 6.2 SearchGUI main dialog. At the top the spectrum engine(s) to use are chosen; and at the bottom optional
files to process, the general search settings and the output post-processing to merge and view the results can be
folder can be selected; in the middle the search set up

In this chapter, we will demonstrate the use allows the input and output files, and the shared
of search engines for peptide identification, with search settings used across all algorithms to be
a focus on free and open source specified (more details below); in the ‘Search
implementations. As an example, X!Tandem Engines’ section, the user can select which
[6], MyriMatch [7], MS Amanda [8], MS-GF+ search engine(s) to use and to set search engine
[9], OMSSA [10], Comet [11] and Tide [12], specific parameters (by clicking on the cogwheel
will be run via the SearchGUI [13] interface next to the search engine in question); finally,
(http://compomics.github.io/projects/searchgui. the’Post Processing’ section can be used to auto-
html) – a user friendly framework to operate all matically run PeptideShaker [14] after the iden-
seven of these command line algorithms. The tification step. PeptideShaker will then import
main steps of the process will be detailed, and merge the output from the different search
highlighting potential pitfalls and available engines, which generally gives better results than
solutions. SearchGUI is open source software using only a single search engine [15]. Note how-
and does not require any installation except ever, that the post processing applied by a tool
downloading and unzipping. Upon starting the such as PeptideShaker is beyond the scope of this
tool, the dialog displayed in Fig. 6.2 appears, chapter and will not be detailed here. For further
allowing the user to set up the desired search information on this topic, the interested reader is
parameters and start the search. instead recommended to consult the extensive
The main dialog of SearchGUI consists of tutorial material on peptide and protein
three sections: the top section, ‘Input & Output’ identification [16].
150 K. Verheggen et al.

6.2 Spectrum Input 6.3 Search Settings

The raw output of a mass spectrometer consists The parameters for the search can be set by
of all spectra (MS1 and MS2) in a vendor specific clicking the ‘Edit’ button next to the ‘Search
binary format, along with various other data Settings’ field, which opens the dialog shown in
related to the chromatography and operational Fig. 6.3. Here the common search settings for the
status of the instrument during acquisition. How- different search engines are displayed, including:
ever, search engines generally take as input (i) the database to search; (ii) the allowed
processed peak lists of MS2 spectra. Before mass tolerances; (iii) the post-translational
starting the search, signal processing steps are modifications (PTMs) to consider; and (iv) the
therefore used to transform the raw data into protease and fragmentation settings used.
peaks lists. Typical processing steps include Together these parameters define the search
noise removal, baseline correction, deisotoping space that will be used by the algorithms, which
and peak picking. Note that with high resolution is critical in three aspects: (i) it is impossible to
instruments only the latter is generally required identify peptides which are not included in the
[17]. The reference platform for converting raw search space; (ii) a large search space increases
data into peak lists is ProteoWizard [18], which the likelihood that similar peptides occur, which
can be used to generate files that are compatible are difficult to resolve [20]; and (iii) ambiguous
with most search engines. In the case of peptide identifications complicate the protein
SearchGUI, files in the mgf (Mascot Generic inference issue [21]. Using a large search space
File) format are used. More advanced spectrum thus favours the occurrence of false positive
processing options are available in the OpenMS identifications, while using a very small search
platform [5, 19]. space will lead to many false negatives and

Fig. 6.3 SearchGUI search settings dialog. At the top the (note the option to add modifications as either fixed or
protein sequence database is selected; in the middle the variable); and at the bottom the protease and fragmenta-
modifications assumed to be in the sample are chosen tion settings are inserted
6 Database Search Engines: Paradigms, Challenges and Solutions 151

unreliable scores for the reported scores. Finding noted that this approach quickly becomes very
the correct search settings is thus critical when complicated when working with poorly defined
using search engines. samples such as encountered in meta-
proteomics [26]. It is also important to note
that protein databases are in constant develop-
6.4 Protein Databases ment, and it is therefore crucial to clearly docu-
ment the version of the database used for a given
In order to match a spectrum to its theoretical project (typically accompanied by the total
counterpart, a protein sequence database is number of sequences in that database), and to
required. The database is in essence a list of all only compare results obtained with the same
protein sequences that could presumably be version of a given database.
found in the sample. Protein sequences can be Model organisms are well covered by
obtained from online protein databases, gener- UniProt. This is however not always the case
ally in the text-based FASTA format. While for less characterized or strongly mutated
specialized databases exist for specific species organisms, where missing proteins can poten-
or pathologies (e.g. The Arabidopsis Informa- tially be problematic. In such cases a related
tion Resource [22], TAIR, for Arabidopsis species with similar sequences is generally
thaliana, or TBDB [23] for Tuberculosis), used. It is also worth mentioning that spectra
generic resources for protein sequences such as from missing peptides are prone to generate
the Universal Protein knowledgebase false positive identifications [27], this is notably
UniProtKB [24] (http://uniprot.org) and the case for contaminants which should be
Ensembl [25] (http://ww.ensembl.org) have included in the list of searched proteins. This
been established as well. is especially important when searching
UniProt provides annotated sets of protein non-human data, as minute amounts of human
sequences, deduced from sequenced genomes keratin, from hair or skin, often end up in the
for a large number of species, and consists of samples. If these are not filtered out as
two main collections of sequences: Swiss-Prot contaminants, the search engines may very
and TrEMBL. Swiss-Prot contains manually well mistake them as evidence for proteins not
annotated and reviewed protein sequences, actually in the sample [28]. A list of common
while TrEMBL is automatically annotated and contaminants can be found at the Global Prote-
not yet manually curated. It is usually ome Machine [29] (GPM) website (http://www.
recommended to search against Swiss-Prot, as thegpm.org/crap).
this ensures that the identifications are based on Although databases containing protein
high quality protein information. If UniProt isoforms hold the promise for higher identifica-
should contain no sequences for the organism tion rates, the number of peptides identified is
under study, Ensembl can provide a useful alter- generally stable if not diminished, while the
native. In essence a nucleic acid sequence data- complexity of the subsequent protein inference
base that includes recently sequenced organisms, step is dramatically increased [30]. Thus, using
Ensembl also provides translated sequences in databases with high protein ambiguity results in
the form of protein databases. increased number of proteins based on the same
In order to limit the search space, it is advised peptide sequences [31]. This ambiguity is partic-
to tailor the set of sequences searched to those ularly problematic in quantitative and functional
that are expected in the sample. This is achieved analyses [32]. Thus, the option to include
by restricting the search to the species of inter- isoforms in the sequence database should be
est. Species specific sequence sets can be considered carefully, and the data resulting
obtained from the UniProt website by selecting from such searches should be interpreted with
a specific taxonomy. However, it should be due caution.
152 K. Verheggen et al.

6.5 Post-Translational amino acids, or at the peptide or protein termini.


Modifications (PTMs) For example, a peptide containing an oxidized
methionine carries an extra oxygen atom. This
Post-translational modifications (PTMs) can be means that all peptides containing an oxidized
categorized according to whether they occur methionine will have their intact mass increased
in vivo or in vitro. Modifications in the first by approximately 16 Da. Moreover, each frag-
category are part of cellular mechanisms, for ment ion that contains the modified methionine
example, phosphorylation as a mechanism to will also be affected in the same way. The search
activate proteins. Such natural modifications engine thus has to look for both versions of these
play an important part as control mechanisms peptides: the unmodified as well as the modified
for cellular regulation [33–35]. However, this form. Given that this split into two distinct pep-
category of modifications are often present in tide forms has to be done for every methionine
sub-stoichiometric amounts and are therefore residue in a peptide, and given that peptides can
unlikely to be found without prior enrichment contain more than one methionine, it should be
[36]. Thus, the choice to include in vivo clear that adding variable modifications has a
modifications in a search depends on whether dramatic impact on the size of the search space.
the experiment actually targets these. And as already mentioned, increasing the search
In vitro modifications are linked to intentional space also increases the likelihood of false
or unintentional modification due to sample positives. It is therefore generally good practice
handling and preparation, where the most widely to evaluate the abundance of PTMs before
encountered modifications are oxidation of including them in the search settings.
methionine, an unintentional modification occur- Fixed modifications on the other hand, do not
ring due to the sample coming in contact with air, impact the size of the search space, as a search
and the intentional protection of cysteine engine can simply consider all potential modifi-
residues by alkylation after reduction of disulfide cation sites as modified, effectively replacing the
bonds. In the latter case, the modification is the masses of the affected residues by their modified
result of a high yield chemical reaction that masses and eliminating the need to consider mul-
ensures that nearly all relevant sites will be tiple alternatives for each peptide.
modified. Another example where in vitro In SearchGUI, fixed and variable
modifications are expected to occur in close to modifications are selected from a predefined
100 % of the cases, is in label based quantifica- list. But the list can be extended using the drop
tion strategies such as SILAC, iTRAQ and TMT, down menu above the table, and new
where the incorporation or labelling efficiency is modifications can be added by clicking on the
moreover typically verified (for a detailed exam- cogwheel. Modifications are saved in a search
ple see [37]). engine independent structure [38] and can be
From the above, it is clear that PTMs are reused in future searches.
encountered in two different forms: (i) if a modi-
fication is expected to occur at (almost) all possi-
ble modification sites, it is referred to as a fixed 6.6 Protease and Fragmentation
(or static) modification, while (ii) a modification
that is more unpredictable is called a variable When digesting a sample using a specific prote-
(or dynamic) modification. It is important to ase, the peptides obtained abide to the enzyme
note that fixed and variable modifications have cleavage rules. The leading protease in proteo-
a very different impact on the search space. mics is trypsin [39]. Trypsin is commonly found
PTMs are identified by search engines via the in the digestive system of many vertebrates, and
mass shift they induce in the amino acid cleaves peptide chains at the carboxyl-terminal
sequence at specific positions, e.g., at particular side of the amino acids lysine (K) and arginine
(R), except when followed by proline (P). Due to
6 Database Search Engines: Paradigms, Challenges and Solutions 153

this cleavage specificity, the amount of possible (ETD) yield mainly c and z ions. Setting the
peptides is limited. Restricting the considered allowed precursor charge(s) depends on the ion-
peptides to those fitting the tryptic cleavage ization method used, with the defaults being +1
rules dramatically reduces the search space, for Matrix Assisted Laser Desorption (MALDI)
hence improving the search speed and reducing and +2, +3 and +4 for Electrospray Ionisation
the number of false positives. However, note that (ESI). Note that the charges encountered can
using the cleavage rules as a filter sets a strong differ based on the peptides present in solution
dependence on the quality of the digestion and notably depending on the protease used for
[40]. This factor is generally relaxed by allowing digestion, specific chemical modification [48],
a given number of missed cleavages, thus or by the mass spectrometer being tuned to
accounting for the presence of sites which are target specific charges.
not accessible to the enzyme [39]. Prediction
tools exist that evaluate each potential cleavage
site for missed cleavage [41–43] and are com- 6.7 Search Engine Specific Settings
pared in [44], but these approaches have so far
not been included in search engines. The search settings detailed above are common
It is also important to tailor the search space for all search engines. However, most search
to the resolution of the mass spectrometer, both engines also have their own specific settings,
at the MS1 and MS2 levels; either in ppm (parts allowing the user to customize the inner
per million, tolerance relative to the precursor workings of the search engine algorithm.
m/z) or in Dalton (absolute tolerance). The SearchGUI provides access to these search
tolerances depend on the resolution of the engine specific settings by clicking on the
measurements and can be optimized for a cogwheels located to the right of each search
given setup [45]. Again, it should be mentioned engine in the main dialog (see Fig. 6.2).
that while relaxing the accuracy requirements, These advanced settings will not be described
i.e., increasing the tolerances, may result in in detail here, and it is advised to refer to the
more peptides identified, it will in most cases documentation of the original algorithm before
also increase the number of false positives. making any changes to the default values. Rele-
Notably, for low resolution mass spectrometers, vant options to inspect include the quick PTM
it is common practice to search with a wide searches options of X!Tandem and the related
tolerance and filter out the PSMs a posteriori, refinement procedure section. While the latter
a method which can substantially increase the can increase the identification coverage by
identification rate [46]. relaxing the search parameters in a so-called
The charge states and ion types considered second pass search, it is known to bias the esti-
by the search engine should be adapted to the mation of error rates [49]. It is also advised to
ionization technique and fragmentation type verify the fragmentation method selected for
used, and can be done by setting the expected MyriMatch, MS Amanda, and MS-GF+; and for
type of fragment ions and precursor charges. the latter verify the selected detector and proto-
Fragment ion types generally consist of one col. Also note that MS-GF+ does not take into
forward ion (a, b or c; all containing the original account the provided MS2 tolerance, as it will
amino-terminus of the peptide) and one rewind optimize this setting internally [9].
ion (x, y or z; all containing the original
carboxyl-terminus), according to the nomencla-
ture by Roepstorff and Fohlman [47]. Collision 6.8 Conclusion and Perspectives
Induced Dissociation (CID) and Higher-energy
Collisional Dissociation (HCD) generate mainly One of the main challenges in peptide identifi-
b and y ions, while Electron Capture Dissocia- cation from mass spectrometry based shotgun
tion (ECD) and Electron Transfer Dissociation proteomics data is the presence of false positive
154 K. Verheggen et al.

identifications. Their presence is controlled a References


posteriori in post processing software through
the estimation of a False Discovery Rate (FDR), 1. Mueller LN, Brusniak MY, Mani DR et al (2008) An
as reviewed in detail by Nesvizhskii [50]. The assessment of software solutions for the analysis of
mass spectrometry based quantitative proteomics
technique was pioneered by the PeptideProphet data. J Proteome Res 7:51–61
tool [51] using score distribution modelling. 2. Vaudel M, Sickmann A, Martens L (2010) Peptide
Subsequently, the target/decoy approach [52], and protein quantification: a map of the minefield.
relying on the inclusion of artificial, nonsensical Proteomics 10:650–670
3. Eng J, McCormack AL, Yates JR III (1994) An
sequences among the searched proteins was rap- approach to correlate tandem mass spectral data of
idly adopted in the field, providing more accu- peptides with amino acid sequences in a protein data-
rate error rate estimates [53]. These so-called base. J Am Soc Mass Spectrom 5:976–989
decoy sequences can easily be generated and 4. Deutsch EW, Mendoza L, Shteynberg D et al (2010)
A guided tour of the trans-proteomic pipeline. Proteo-
appended to the original protein sequences in mics 10:1150–1159
SearchGUI when selecting the FASTA file in 5. Sturm M, Bertsch A, Gropl C et al (2008) OpenMS –
the search settings dialog. an open-source software framework for mass spec-
Searching large datasets, or using a large trometry. BMC Bioinf 9:163
6. Craig R, Beavis RC (2004) TANDEM: matching
search space containing, for example, different proteins with tandem mass spectra. Bioinformatics
species or accounting for multiple modifications, 20:1466–1467
quickly becomes impractical on standard desktop 7. Tabb DL, Fernando CG, Chambers MC (2007)
computers. Solutions have therefore been devel- MyriMatch: highly accurate tandem mass spectral
peptide identification by multivariate hypergeometric
oped to speed up this process, including the use analysis. J Proteome Res 6:654–661
of distributed computing [54], graphical 8. Dorfer V, Pichler P, Stranzl T et al (2014) MS
processing units (GPUs) [55], and the increas- Amanda, a universal identification algorithm
ingly popular cloud computing [56–58]. Further- optimized for high accuracy tandem mass spectra. J
Proteome Res 13:3679–3684
more, by exploiting user friendly platforms for 9. Kim S, Mischerikow N, Bandeira N et al (2010) The
biological data processing such as Galaxy [59– generating function of CID, ETD, and CID/ETD pairs
61], powerful data analysis solutions are made of tandem mass spectra: applications to database
available to every interested scientist. search. Mol Cell Proteomics 9:2840–2852
10. Geer LY, Markey SP, Kowalak JA et al (2004) Open
Despite all this progress, database searching mass spectrometry search algorithm. J Proteome Res
may not always be the method of choice for 3:958–964
identifying peptides. For example, if no 11. Eng JK, Jahan TA, Hoopmann MR (2013) Comet: an
sequence database is available for the species open-source MS/MS sequence database search tool.
Proteomics 13:22–24
under analysis, or if the search space cannot be 12. Diament BJ, Noble WS (2011) Faster SEQUEST
reduced to a given species or a set of cleavage searching for peptide identification from tandem
rules, search engines will not be of much use. In mass spectra. J Proteome Res 10:3871–3879
such cases, related approaches such as spectrum 13. Vaudel M, Barsnes H, Berven FS et al (2011)
SearchGUI: an open-source graphical user interface
library searching [62] or de novo sequencing for simultaneous OMSSA and X!Tandem searches.
might be better alternatives [63]. Notably, the Proteomics 11:996–999
latter allows for mutation tolerant identification 14. Vaudel M, Burkhart JM, Zahedi RP et al (2015)
of proteins [64] and screening for unexpected PeptideShaker enables reanalysis of MS-derived pro-
teomics data sets. Nat Biotechnol 33:22–24
modifications. 15. Shteynberg D, Nesvizhskii AI, Moritz RL et al (2013)
Combining results of multiple search engines in pro-
Acknowledgments K.V. acknowledges the support of teomics. Mol Cell Proteomics 12:2383–2393
Ghent University. L.M. acknowledges the support of 16. Vaudel M, Venne AS, Berven FS et al (2014) Shed-
Ghent University (Multidisciplinary Research Partnership ding light on black boxes in protein identification.
“Bioinformatics: from nucleotides to networks”) and the Proteomics 14:1001–1005
IWT SBO grant ‘InSPECtor’ (120025). H.B. is supported 17. Mancuso F, Bunkenborg J, Wierer M et al (2012) Data
by the Research Council of Norway. extraction from proteomics raw data: an evaluation of
6 Database Search Engines: Paradigms, Challenges and Solutions 155

nine tandem MS tools using a large Orbitrap data set. J 35. Pawson T, Scott JD (2005) Protein phosphorylation in
Proteome 75:5293–5303 signaling – 50 years and counting. Trends Biochem
18. Kessner D, Chambers M, Burke R et al (2008) Sci 30:286–290
ProteoWizard: open source software for rapid proteo- 36. Loroch S, Dickhut C, Zahedi RP et al (2013)
mics tools development. Bioinformatics Phosphoproteomics – more than meets the eye. Elec-
24:2534–2536 trophoresis 34:1483–1492
19. Kohlbacher O, Reinert K, Gropl C et al (2007) TOPP – 37. Aasebo E, Vaudel M, Mjaavatten O et al (2014) Per-
the OpenMS proteomics pipeline. Bioinformatics 23: formance of super-SILAC based quantitative proteo-
e191–e197 mics for comparison of different acute myeloid
20. Colaert N, Degroeve S, Helsens K et al (2011) Analy- leukemia (AML) cell lines. Proteomics 14:1971–1976
sis of the resolution limitations of peptide identifica- 38. Barsnes H, Vaudel M, Colaert N et al (2011)
tion algorithms. J Proteome Res 10:5555–5561 Compomics-utilities: an open-source Java library for
21. Nesvizhskii AI, Aebersold R (2005) Interpretation of computational proteomics. BMC Bioinf 12:70
shotgun proteomic data: the protein inference prob- 39. Vandermarliere E, Mueller M, Martens L (2013) Get-
lem. Mol Cell Proteomics 4:1419–1440 ting intimate with trypsin, the leading protease in
22. Huala E, Dickerman AW, Garcia-Hernandez M proteomics. Mass Spectrom Rev 32:453–465
et al (2001) The Arabidopsis Information Resource 40. Burkhart JM, Schumbrutzki C, Wortelkamp S
(TAIR): a comprehensive database and web-based et al (2012) Systematic and quantitative comparison
information retrieval, analysis, and visualization sys- of digest efficiency and specificity reveals the impact
tem for a model plant. Nucleic Acids Res 29:102–105 of trypsin quality on MS-based proteomics. J Prote-
23. Reddy TB, Riley R, Wymore F et al (2009) TB data- ome 75:1454–1462
base: an integrated platform for tuberculosis research. 41. Siepen JA, Keevil EJ, Knight D et al (2007) Prediction
Nucleic Acids Res 37:D499–D508 of missed cleavage sites in tryptic peptides aids pro-
24. Apweiler R, Bairoch A, Wu CH et al (2004) UniProt: tein identification in proteomics. J Proteome Res
the Universal Protein knowledgebase. Nucleic Acids 6:399–408
Res 32:D115–D119 42. Lawless C, Hubbard SJ (2012) Prediction of missed
25. Flicek P, Amode MR, Barrell D et al (2014) Ensembl proteolytic cleavages for the selection of surrogate
2014. Nucleic Acids Res 42:D749–D755 peptides for quantitative proteomics. OMICS
26. Muth T, Benndorf D, Reichl U et al (2013) Searching 16:449–456
for a needle in a stack of needles: challenges in 43. Fannes T, Vandermarliere E, Schietgat L et al (2013)
metaproteomics data analysis. Mol BioSyst Predicting tryptic cleavage from proteomics data
9:578–585 using decision tree ensembles. J Proteome Res
27. Knudsen GM, Chalkley RJ (2011) The effect of using 12:2253–2259
an inappropriate protein database for proteomic data 44. Kelchtermans P, Bittremieux W, De Grave K
analysis. PLoS One 6:e20873 et al (2014) Machine learning applications in proteo-
28. Ghesquiere B, Helsens K, Vandekerckhove J mics research: how the past can boost the future.
et al (2011) A stringent approach to improve the Proteomics 14:353–366
quality of nitrotyrosine peptide identifications. Prote- 45. Vaudel M, Burkhart JM, Sickmann A et al (2011)
omics 11:1094–1098 Peptide identification quality control. Proteomics
29. Craig R, Cortens JP, Beavis RC (2004) Open source 11:2105–2114
system for analyzing, validating, and storing protein 46. Beausoleil SA, Villen J, Gerber SA et al (2006) A
identification data. J Proteome Res 3:1234–1242 probability-based approach for high-throughput pro-
30. Martens L, Hermjakob H (2007) Proteomics data val- tein phosphorylation analysis and site localization.
idation: why all must provide data. Mol Biosyst Nat Biotechnol 24:1285–1292
3:518–522 47. Roepstorff P, Fohlman J (1984) Proposal for a com-
31. Barsnes H, Martens L (2013) Crowdsourcing in pro- mon nomenclature for sequence ions in mass spectra
teomics: public resources lead to better experiments. of peptides. Biomed Mass Spectrom 11:601
Amino Acids 44:1129–1137 48. Thingholm TE, Palmisano G, Kjeldsen F et al (2010)
32. Vaudel M, Sickmann A, Martens L (2014) Introduc- Undesirable charge-enhancement of isobaric tagged
tion to opportunities and pitfalls in functional mass phosphopeptides leads to reduced identification effi-
spectrometry based proteomics. Biochim Biophys ciency. J Proteome Res 9:4045–4052
Acta 1844:12–20 49. Everett LJ, Bierl C, Master SR (2010) Unbiased sta-
33. Venne AS, Kollipara L, Zahedi RP (2014) The next tistical analysis for multi-stage proteomic search
level of complexity: crosstalk of posttranslational strategies. J Proteome Res 9:700–707
modifications. Proteomics 14:513–524 50. Nesvizhskii AI (2010) A survey of computational
34. Olsen JV, Mann M (2013) Status of large-scale anal- methods and error rate estimation procedures for pep-
ysis of post-translational modifications by mass spec- tide and protein identification in shotgun proteomics. J
trometry. Mol Cell Proteomics 12:3444–3452 Proteome 73:2092–2123
156 K. Verheggen et al.

51. Keller A, Nesvizhskii AI, Kolker E et al (2002) 62. Lam H (2011) Building and searching tandem mass
Empirical statistical model to estimate the accuracy spectral libraries for peptide identification. Mol Cell
of peptide identifications made by MS/MS and data- Proteomics 10(R111):008565
base search. Anal Chem 74:5383–5392 63. Allmer J (2011) Algorithms for the de novo sequenc-
52. Elias JE, Gygi SP (2010) Target-decoy search strategy ing of peptides from tandem mass spectra. Expert Rev
for mass spectrometry-based proteomics. Methods Proteomics 8:645–657
Mol Biol 604:55–71 64. Dasari S, Chambers MC, Slebos RJ et al (2010)
53. Ma K, Vitek O, Nesvizhskii AI (2012) A statistical TagRecon: high-throughput mutation identification
model-building perspective to identification of through sequence tagging. J Proteome Res
MS/MS spectra with PeptideProphet. BMC Bioinf 9:1716–1726
13(Suppl 16):S1 65. Perkins DN, Pappin DJ, Creasy DM et al (1999)
54. Verheggen K, Barsnes H, Martens L (2014) Probability-based protein identification by searching
Distributed computing and data storage in proteomics: sequence databases using mass spectrometry data.
many hands make light work, and a stronger memory. Electrophoresis 20:3551–3567
Proteomics 14:367–377 66. Tanner S, Shu H, Frank A et al (2005) InsPecT:
55. Baumgardner LA, Shanmugam AK, Lam H identification of posttranslationally modified peptides
et al (2011) Fast parallel tandem mass spectral library from tandem mass spectra. Anal Chem 77:4626–4639
searching using GPU hardware acceleration. J Prote- 67. Park CY, Klammer AA, Kall L et al (2008) Rapid and
ome Res 10:2882–2888 accurate peptide identification from tandem mass
56. Trudgian DC, Mirzaei H (2012) Cloud CPFP: a shot- spectra. J Proteome Res 7:3022–3027
gun proteomics data analysis pipeline using cloud and 68. Yadav AK, Kumar D, Dash D (2011) MassWiz: a
high performance computing. J Proteome Res novel scoring algorithm with target-decoy based anal-
11:6282–6290 ysis pipeline for tandem mass spectrometry. J Prote-
57. Muth T, Peters J, Blackburn J et al (2013) ome Res 10:2154–2160
ProteoCloud: a full-featured open source proteomics 69. Cox J, Neuhauser N, Michalski A et al (2011)
cloud computing pipeline. J Proteome 88:104–108 Andromeda: a peptide search engine integrated into
58. Afgan E, Chapman B, Taylor J (2012) CloudMan as a the MaxQuant environment. J Proteome Res
platform for tool, data, and analysis distribution. BMC 10:1794–1805
Bioinf 13:315 70. Bern M, Kil YJ, Becker C (2012) Byonic: advanced
59. Giardine B, Riemer C, Hardison RC et al (2005) Gal- peptide and protein identification software. Curr
axy: a platform for interactive large-scale genome Protoc Bioinf Chapter 13, Unit13 20
analysis. Genome Res 15:1451–1455 71. Zhang J, Xin L, Shan B et al (2012) PEAKS DB: de
60. Boekel J, Chilton JM, Cooke IR et al (2015) Multi- novo sequencing assisted database search for sensitive
omic data analysis using Galaxy. Nat Biotechnol and accurate peptide identification. Mol Cell Proteo-
33:137–139 mics 11:M111 010587
61. Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a 72. Wenger CD, Coon JJ (2013) A proteomics search
comprehensive approach for supporting accessible, algorithm specifically designed for high-resolution
reproducible, and transparent computational research tandem mass spectra. J Proteome Res 12:1377–1386
in the life sciences. Genome Biol 11:R86
Mass Analyzers and Mass Spectrometers
7
Anthony M. Haag

Abstract
Mass spectrometers are comprised of three main components: an ion
source, a mass analyzer, and a detector. Ionization of the analyte occurs
in the ion source and the resulting ions are counted at the detector.
However, it is the mass analyzer that is responsible for determing the
mass-to-charge ratio (m/z) of the ions (Jennings KR, Dolnikowski GG,
Method Enzymol 193:37–61, 1990). Therefore, it is primarily the analyzer
that allows the mass spectrometer to serve its primary goal – determining
the mass of the analytes being measured. This becomes important in the
field of molecular biology, where biomolecules may be of low molecular
weight or often take on multiple charges (z) after ionization (Fenn JB,
Mann M, Meng CK, Wong SF, Whitehouse CM, Science 246:64–71,
1989). For this reason, the choice of analyzer is dependant on the
properties of the analyte after ionization and the requirements of the
experiment being performed.

Keywords
Mass spectrometer • Mass analyzer • Quadrupole mass analyzer • Ion trap
mass analyzer • Time-of-flight (TOF) mass analyzer • FT-ICR mass
analyzer • Orbitrap mass analyzer • Tandem mass analyzer • Triple
quadrupoles tandem mass analyzer • Q-TOF tandem mass analyzer •
TOF/TOF tandem mass analyzer • Product ion scan • Precursor ion
scan • Neutral loss scan • Selected reaction monitoring

7.1 Introduction
A.M. Haag (*)
Department of Pathology and Immunology, Baylor Mass spectrometers are comprised of three main
College of Medicine, Houston, TX 77030, USA
components: an ion source, a mass analyzer, and
Texas Children’s Microbiome Center, Texas Children’s a detector. Ionization of the analyte occurs in the
Hospital, Houston, TX 77030, USA
ion source. The mass analyzer then resolves ions
e-mail: Anthony.Haag@bcm.edu

# Springer International Publishing Switzerland 2016 157


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_7
158 A.M. Haag

based on their mass-to-charge ratio (m/z) use. Quadrupole mass analyzers are often
[1]. Ions most often impact a detector to produce employed in benchtop mass spectrometers due
a signal that is recorded. A mass spectrum is to their low cost, compact design, durability and
a plot of the relative abundance of ions against reliability. For these reasons they have become
their m/z. It is primarily the analyzer that allows the workhorse analyzer in the pharmaceutical
the mass spectrometer to serve its primary goal – industry. They are often used in tandem with
determining the mass of the analyte being each other such as in triple quadrupole mass
measured. Because analyzers only measure the spectrometers or with other mass analyzers such
m/z of ions, some amount of mass spectral inter- as time-of flight (TOF) [3].
pretation is often required by the mass A quadrupole analyzer is essentially a mass
spectrometrist. This becomes important in the filter, due to its ability to discriminate and filter
field of molecular biology, where biomolecules ions of different m/z [4]. Quadrupoles consist of
often take on multiple charges (z) after ionization four cylindrical or hyperbolic rods in parallel
[2]. For this reason, the m/z of a compound will with each other (Fig. 7.1). Rods opposite each
often be a fraction of the actual mass (m) of other are electrically connected together and a
the ion. radio frequency (RF) potential is applied. A
There are many different analyzer designs direct current (DC) potential is then
available. Along with their ability to resolve superimposed over of the RF potential. The com-
ions of different m/z, several analyzers are also bination of RF and DC potential causes ions to
capable of trapping and storing ions. Thus, such oscillate as they pass through the quadrupole in
analyzers can function in a multitude of roles. the z-direction. Depending on the DC potential
The most common types of analyzers in commer- and frequency of the RF field, only ions of a
cial production include quadrupole, ion trap, particular m/z will have stable trajectories.
time-of-flight (TOF), and Fourier transform Those ions that have unstable trajectories will
analyzers (ion cyclotron [ICR] and Orbitrap), collide into the rods and be filtered out. By vary-
along with numerous combinations or hybrids ing the DC and RF potentials, ions of different m/
of these analyzers. The choice of mass analyzer z can be scanned or “filtered” through the
depends on a number of factors and experimental quadrupoles [5].
considerations. Such factors may include but are A quadrupole or other multipole (hexapole or
not limited to octupole) can also operate in an “RF-only”
mode, in which the DC potential is reduced and
1. The desired m/z range to be analyzed only an RF potential is applied to the rods. This
2. The mass of the analyte allows all ions to pass through the multipole,
3. The required resolving power of the analyzer thereby transforming the quadrupole analyzer
4. The ability of the analyzer to interface with into a device for transmitting ions from one
the ion source of the mass spectrometer
5. The limit of detection required
Because there is no single mass analyzer that is
suitable for all applications, most laboratories
will employ different mass spectrometers that
utilize different analyzers. The most commonly
utilized mass analyzers are discussed below.

Fig. 7.1 A quadrupole mass analyzer consists of four


7.2 Quadrupole metallic rods connected to both an RF and DC field.
Ions entering the quadrupole will oscillate as they pass
through the field between the quadrupole rods. Ions with
The quadrupole mass analyzer continues to be stable oscillation trajectories will pass through while
one of the most popular types of mass analyzer in those that are unstable will collide with the rods
7 Mass Analyzers and Mass Spectrometers 159

area of the mass spectrometer to another, such as disadvantage when analyzing large molecular
moving ions from the ionization source into weight compounds that may not form multiply
another analyzer. Thus, RF-only multipoles can charged ions or complex mixtures of compounds
act as ion transmission guides within a mass with similar masses.
spectrometer where needed. RF-only multipoles
can also act as collision cells for performing
collision-induced dissociation (CID) [6]. When
7.3 Ion Trap
an inert gas is introduced into the collision cell
and the RF-energy on the multipoles is increased,
The ion trap mass analyzer is a modification of
ions that are transmitted through the collision cell
the quadrupole mass analyzer [9]. The 3D ion
will undergo fragmentation via CID. By varying
trap, also known as a Paul Trap, was the most
the RF-energy, the amount of ion fragmentation
common ion trap until the twenty-first century
can be controlled.
[10, 11]. Recently, the 2D linear ion trap has
As mentioned earlier, a major advantage of
become more popular because of its numerous
quadrupole mass spectrometers is their low cost
advantages over 3D traps in most commercially
and compact shape and size which makes them
available equipment. The 3D traps consist of two
ideal for most laboratories. They are made by a
hyperbolic electrode plates facing each other and
variety of different manufacturers and have
a hyperbolic ring electrode placed in between
proven to be rugged and reliable for long periods
them (Fig. 7.2). Using an oscillating RF field
of time, thus require little maintenance. They
and a superimposed DC electric field, similar to
have excellent stability over long periods of
that in quadrupoles, ions are trapped between the
time, thereby reducing the need for repeated
electrodes. In order to act as an analyzer, ions of
calibrations. Because quadrupole analyzers
different m/z are selectively ejected from the trap
have fast duty cycles and the need for a continu-
by varying the RF potential. The ejected ions are
ous flux of ions, they easily interface to both gas
then registered at the detector. 2D traps, often
chromatography (GC) and liquid chromatogra-
referred to as linear traps, are equivalent to
phy (LC) equipment [7, 8]. However, this
quadrupoles but a potential field is applied to
makes quadrupole analyzers less suitable for
each end of the quadrupole in order to trap the
pulsed ion sources such as matrix assisted laser
ions within the quadrupole itself. Ions can be
desorption/ionization (MALDI). Also, quadru-
selectively ejected either axially or radially
pole analyzers suffer from both limited mass
depending on the design of the 2D trap.
ranges and poor resolution. This puts them at a

Fig. 7.2 In a 3D trap, ions enter a small opening in the on the RF frequency, the m/z of the ions, and the ampli-
endcap of one of the electrodes. An RF field is placed on tude of the RF field. Ions may be selectively ejected at the
the ring electrode, trapping the ions toward the center of opposite endcap electrode from which they entered by
the ion trap. The stability of ions within the trap is based increasing the voltage of the RF field on the ring electrode
160 A.M. Haag

Because ion traps have the ability to accumu- 7.4 Time-of-Flight


late ions over time, mass spectrometers that uti-
lize them are known for their improved Although Time-of-flight (TOF) analyzers have
sensitivity. Much like their quadrupole counter- been around for some time, it has been the advent
part, ion trap analyzers have the advantage of of MALDI ionization (which allowed for the
having a small and compact size, making them easy analysis of large biomolecules) that pro-
very affordable in most mass spectrometers. For pelled TOF analyzers to the forefront
this reason, they have played a major role in [14, 15]. TOF analyzers are the easiest to con-
expanding the field of proteomics. Much of the ceptualize as illustrated in (Fig. 7.3). In its sim-
early developments in identification of proteins plest form, a TOF analyzer consists primarily of
in a complex mixture were performed on mass a flight tube and an acceleration grid that acts to
spectrometers utilizing ion trap analyzers accelerate a “packet” of ions from the ionization
[12, 13]. source to the MS detector [16]. Essentially, if two
One of the biggest disadvantages to ion trap ions of different m/z are accelerated from the ion
analyzers is their low resolving power. Even at source with the same kinetic energy and allowed
slow scan speeds, ion trap analyzers (particularly to drift through a field free region of the flight
3D models) have only single unit mass resolu- tube, then their arrival times at the detector will
tion. With the advancement of other types of be different.
analyzers that have faster speed, better mass The equation for kinetic energy for any mass
accuracy and superior resolution, there has been is
a shift away from performing proteomic analysis
on mass spectrometers using only ion trap 1
Ek ¼ mv2
analyzers. However, due to their geometry, 2D 2
analyzers still find widespread use in hybrid wherein Ek is the kinetic energy of the ion after
instruments, particularly those that utilize them acceleration, m is the mass of the ion and v is its
as a precursor mass filter when performing velocity. Ion velocity remains constant after
tandem MS. acceleration as it moves through the field free

Fig. 7.3 Ions generated by


the ion source are
accelerated by placing a
pulsed electric potential on
the acceleration grid. The
accelerated ions then drift
through a field free region
of a flight tube where they
are separated based on their
m/z. The greater the ion
mass, the slower the drift
through the flight tube.
When an ion hits the
detector, the mass
spectrometer determines
the time it took for the ion
to drift through the flight
tube. The drift time through
the flight tube is
proportional to the m/z of
the ion
7 Mass Analyzers and Mass Spectrometers 161

region of the flight tube. Because this velocity masses of many peptides, TOF analyzers have
remains static, then velocity is given by been the most popular analyzer for performing
peptide mass fingerprinting [23, 24]. Although
d
v¼ new tandem analyzer configurations have
t allowed TOF analyzers to be interfaced ion
sources that provide a continuous flux of ions,
wherein d is the distance the ion travels and t is
they have initially been employed with only
the time it takes for the ion to travel from its
pulsed ion sources such as MALDI.
acceleration point to the detector. Substituting v
into the kinetic energy equation results in
1 7.5 FT-ICR
Ek ¼ mðd=tÞ2
2
FT-ICR analyzers determine m/z by measuring the
Solving for the mass of the ion yields
cyclotron frequency of ions in a fixed magnetic
2Ek t2 field [25]. Ions are first introduced into a Penning
m¼ trap, a device similar to a 3D ion trap but using a
d2
magnetic field to trap ions rather than an electric
Because the initial kinetic energy (Ek) of the ions field. The ions are injected into the magnetic field
and the length of the flight tube ( d ) remain from the source as a “packet”. The ions then
constant, mass is strictly a function of the time experience a Lorentz force, which causes them
it takes for the ions to be detected after initial to assume a circular motion in a plane perpendic-
acceleration (time-of-flight). ular to the magnetic field (Fig. 7.4). The angular
TOF analyzers today also employ the use of a frequency, also known as the cyclotron frequency,
reflectron which reflects the ion path back in the is described by the equation
direction of the ion source before being detected qB
[17]. This allows for corrections in the small ωc ¼
m
differences in initial kinetic energies of the
ions that may occur during acceleration [18]. where ωc is the angular frequency of the ions in
Other methods such as delayed extraction are radians, m is the mass of the ion, q its charge and B
also employed in order to increase resolution is the strength of the magnetic field. However,
[19, 20]. After ions are formed, delayed extrac- because the ions are not in phase when initially
tion introduces a small delay (usually on the introduced into the trap and typically have very
order of a few hundred nanoseconds) in the elec- small orbits, it is impossible to detect them. In
tric pulse of the acceleration grid before the ions order to detect these ions, they must be coher-
are accelerated. This small delay allows the ions ently excited to a larger radius within their plane
formed after ionization to equilibrate and have a of motion. This is accomplished by exciting the
more uniform average momentum before ions with a limited frequency sweep of a broad-
acceleration. band RF field [26]. This excitation coherently
Due to the high ion transmission efficiencies places the ions in a higher cyclotron orbit,
of TOF analyzers, they can achieve the widest which allows them to be detected. As the ions
mass range of all mass analyzers. TOF analyzers are detected over time by the receiver plates,
allow for the separation of ions with masses of their signal intensity is digitized with respect to
only a few Daltons to well over 100 kDa time and converted to a frequency spectrum via a
[21]. This makes them the analyzer of choice Fourier transform. The cyclotron frequencies of
for observing singly charged high mass the ions are proportional to their m/z.
biomolecules such as proteins [22]. Because of One of the biggest advantages of FT-ICR
their ability to simultaneously measure the analyzers are their very high mass accuracy and
162 A.M. Haag

Fig. 7.4 Ions are injected into a magnetic field for which The circular motion of the ions in the magnetic field is
they then undergo a small cyclotron frequency perpendic- detected by the receiver plates and a Fourier transform
ular to the magnetic field. A brief broadband RF pulse converts the signal to a frequency spectrum. The angular
excites the ions into a larger and coherent cyclotron orbit. frequency of the ions is determined by their m/z

resolving power. One million resolution has been resolution of large molecular weight ions with
reported on instruments with magnetic field multiple charge states, very high resolution must
strengths as low as 1 T [27]. All aspects of be employed. For example, a 50 KDa protein,
FT-ICR improve with higher magnetic field: regardless of charge state, would require a reso-
increased resolution, increased mass accuracy, lution of 50,000 of the analyzer in order to
increased number of ions that can be put in the observe isotopic peaks. In order to perform
cell, decreased ion coalescence, etc. Most com- top-down sequencing of proteins, it is preferable
mercially available FT-ICR analyzers operate in to have isotopic resolution of the protein and
magnetic field ranges between 7 and 12 T. This its MS/MS products. Because of the resolving
high resolution and mass accuracy is very useful power of FT-ICR, entire large proteins can be
when determining the elemental composition of sequenced and identified when performing tan-
small molecules based on their “mass defect” dem MS. Post-translational modifications within
[28]. For example, two compounds, one with an isolated protein can also be identified without
the empirical formula C6H12 and the other with having to first perform chemical or enzymatic
the empirical formula C5NH10, both appear to cleavage of the isolated protein as required in
have the same mass of 84 Da. However, when bottom-up approaches [29].
calculating their mass with very high precision, Due to the need for very strong magnetic
C6H12 has an exact mass of 84.09389 Da and fields, FT-ICR analyzers require the use of large
C5H10N has an exact mass of 84.08131 Da, a superconducting magnets. This introduces two
difference of 0.01258 Da. This is due to slight major problems. First, large magnet sizes require
differences in the binding energies in the nuclei large amounts of lab space to be available. This
of the carbon and nitrogen atoms, thus causing a may also include the need for high laboratory
slight shift in their atomic mass. Unlike many ceilings in order to perform maintenance. Sec-
other analyzers with lower resolving power, ond, superconducting magnets require liquid
FT-ICR analyzers have the ability to obtain helium as a coolant in order for them to operate.
empirical formulas directly from mass data. The cost of liquid helium is high and often
Another major application for FT-ICR is in beyond the budget of many small laboratories.
the field of proteomics where high mass accuracy The initial cost of most FT-ICR instruments is
is often required. In order to maintain isotopic also very high.
7 Mass Analyzers and Mass Spectrometers 163

Mass spectrometers that utilize FT-ICR also the inner and outer electrodes, creating a linear
suffer from slow scan speeds compared to other electric field between them. Ions are introduced
analyzers such as time-of-flight. This makes it tangentially into the orbitrap as a “packet”
impractical for many LC-tandem MS between the inner and outer electrodes through
experiments, such as Multi-dimensional Protein a hole machined into one of the outer electrodes.
Identification Technology (MudPIT), where Due to the electric field between the inner and
many different co-eluting peptides need to be outer electrodes, the ion packet is bent towards
analyzed at very high scan rates in order to col- the inner electrode while the tangential velocity
lect as much tandem MS data as possible. of the ions creates an opposing centrifugal force.
At a specific potential between the inner and
outer electrodes, the ions remain in a spiral path
around the inner electrode. However, due to the
7.6 Orbitrap
conical shape of the electrodes, a harmonic axial
oscillation in the ions is induced. The outer
Similar to the FT-ICR analyzer, the orbitrap is
electrodes also act as receiver plates that detect
also a type of analyzer that makes use of a
the back and forth axial harmonic motion of the
Fourier transform to convert a signal, produced
ions. This signal image is digitized and
by ions oscillating in a trap, from the time
transformed from the time domain to the fre-
domain to a frequency domain [30, 31]. Unlike
quency domain. Similar to FT-ICR, the axial
FT-ICR analyzers, which use a magnetic field to
harmonic frequencies are proportional to the m/
induce oscillation in the ions, orbitrap analyzers
z of the ions.
use an electric field to induce these
One of the major advantages of the orbitrap
oscillations [32].
analyzer is its high resolving power, resulting in
The orbitrap mass analyzer is composed of
its use as a replacement for FT-ICR analyzers for
three main parts, an inner spindle electrode cov-
many applications, particularly those involving
ered by two hollow outer concave electrodes
proteomics. In general, FT-ICR analyzers are
facing each other. The two outer electrodes are
superior to orbitraps in the low molecular weight
separated by a thin ring of dielectric material
range, thus making them ideal for low mass
(Fig. 7.5). A voltage potential is applied between

Fig. 7.5 In an orbitrap


analyzer, ions enter
through an opening in
one of the outer electrodes.
The entry of the ions is
tangential to the inner
electrode. At a particular
potential between the inner
and outer electrodes, the
ions will continuously spin
around the inner electrode.
Ions will oscillate back and
forth along the axis of the
inner electrode. This
oscillation is detected and
transformed via a Fourier
transform to obtain a mass
spectrum
164 A.M. Haag

compounds. However, there is a fast decrease in very high mass resolution or accuracy. Orbitrap
the resolving power of FT-ICR analyzers at analyzers are also very prone to space-charge
higher m/z. This decrease in resolving power of effects and therefore the amount of ions entering
FT-ICR analyzers is inversely proportional with the analyzer must be monitored by the MS soft-
an increase in the m/z being measured. With ware in order to trap a limited amount of ions.
orbitrap analyzers, this decrease in resolving
power is inversely proportional to the square-
root of the m/z being measured. Therefore, 7.7 Tandem Mass Analyzers
orbitrap analyzers often have better resolving
power than FT-ICR analyzers at higher m/z Mass spectrometers that utilize two or more mass
[33]. This property can give the orbitrap an analyzers consecutively are known as tandem
advantage when analyzing high molecular mass spectrometers [34, 35]. Tandem MS analy-
weight compounds such as proteins. Recently, sis is the process by which the first analyzer is
there has been a move from bottom-up proteomic used to select ions of a particular m/z value,
analysis to top-down analysis. Because top-down subject those ions to CID (as described in
analysis requires very high resolving power, it RF-only multipoles), and then analyze the
was limited to FT-ICR analyzers and beyond the resulting product ions using a second mass ana-
affordability of most MS labs. Orbitrap analyzers lyzer. CID is also sometimes referred to as
have been instrumental in overcoming this diffi- collision-activated dissociation (CAD) and is a
culty and have therefore pushed the advancement process by which ions are fragmented by collid-
of top-down proteomics. ing them with chemically inert gas (typically
There are also a number of other advantages argon or nitrogen) at low pressure (~105 torr).
to orbitrap analyzers. Unlike the large size and The fragmentation occurs due to converting
operating costs of instruments utilizing FT-ICR, some of the kinetic energy from the collision of
orbitrap instruments are much smaller and the analyte ion with inert gas atoms to internal
require very little maintenance. Orbitrap energy of the ions, thus resulting in bond break-
analyzers also do not use magnetic fields to oper- age of the analyte ion molecules [36]. These
ate, and therefore cryogenic refrigerants such as product ions formed from CID often provide
liquid helium are not necessary and operating information about the structure of the analyte
costs are kept low. Although counterintuitive, molecules.
the resolving power of the orbitrap analyzer is
increased by the decrease in size of the analyzer.
The main limitation to improved orbitrap design 7.7.1 Triple Quadrupoles
has been the tolerances needed in the machining
process during manufacture. As machining pro- Triple quadrupole mass spectrometers are one of
cesses improve, smaller orbitrap analyzers will the most commonly sold types of mass spectrom-
no doubt continue to decrease the overall size of eter and are one of the best examples of using
mass spectrometers that utilize them. Their analyzers in tandem [37]. In a triple quadrupole
smaller design will also allow for faster acquisi- mass spectrometer, three sets of quadrupole
tion rates and higher resolution. analyzers are used in sequence (Fig. 7.6). The
Improvements in orbirap analyzer design will first analyzer is often referred to as Q1 and can
continue to provide faster scan speeds and duty scan across a range of m/z values or selectively
cycles. However, even with major improvements filter ions of a selected m/z. Those ions that pass
expected in the future, they will continue to be through Q1 then enter a second set of
slower than that of TOF analyzers. This makes quadrupoles that are referred to as Q2. Unlike a
orbitrap analyzers potentially less ideal for quadrupole that operates as an analyzer, Q2 is
performing MudPIT experiments where fast used exclusively as a collision cell to fragment
acquisition rates may outweigh the need for the selected ions from Q1. The product ions
7 Mass Analyzers and Mass Spectrometers 165

Fig. 7.6 A triple quadrupole mass spectrometer consists into quadrupole Q2, where ions undergo CID. The
of three quadrupole analyzers used in tandem. Ions enter resulting fragment ions are then filtered through the final
the first quadrupole, Q1. Ions that pass through Q1 enter quadrupole, Q3

Table 7.1 Table of different scan modes of triple quadrupole mass analyzers

formed in this process can then be either scanned specific analytes. For example, in a bottom-up
through the final set of quadrupoles, Q3, to obtain proteomics approach, the sequence of many
a mass spectrum or Q3 can be fixed in order to peptides eluting off a chromatographic col-
monitor a particular ion. The combination of umn can be sequenced.
fixed or scanning modes of Q1 and Q3 determine Precursor ion scan – In a precursor ion scan, Q1
the type of scan performed [38, 39]. The most is scanned across the entire m/z range of the
common scan modes are described below and in analyzer. The precursor ions subsequently
Table 7.1. pass through Q2 for CID. However, Q3 is
kept fixed such that only product ions of a
Product ion scan – In a product ion scan, Q1 specific m/z are filtered through the quadru-
remains fixed such that only ions of a selected pole. The mass chromatogram is plotted as the
m/z are filtered through the quadrupole. These intensity of the ions exiting Q3 with respect to
ions are then fragmented via CID through Q2. the m/z value that they originated from in Q1.
The resulting product ions are then scanned In other words, precursor ion scanning allows
and analyzed in Q3. Once the product ions are one to determine the m/z of all precursor ions
recorded, Q1 can then fix on a new m/z and the that have the same product ion. This is valu-
process repeated. This technique is often used able in proteomics when one wants to identify
in order to determine structural information of all peptides that may have the same functional
166 A.M. Haag

group. For example, performing a precursor 7.7.2 Q-TOF


ion scan at m/z ¼ 216, a signature immonium
ion for phosphotyrosine, allows one to selec- Quadrupole analyzers prefer to operate effi-
tive identify peptides that may contain ciently when there is a continuous stream of
phosphotyrosine. ions from the ion source. However, TOF
Neutral loss scan – A neutral loss scan is a analyzers prefer a pulse or packet of ions. In
technique to track ions before and after the order for the two analyzers to work in tandem,
loss of a neutral group. Both Q1 and Q3 are the TOF analyzer is placed in an orthogonal
scanned simultaneously over the entire m/z configuration after the quadrupole analyzer
range but with Q3 offset from the Q1 by an [40]. This configuration allows ions that are fil-
amount that corresponds to the loss of a neu- tered through the quadrupole to be injected
tral fragment from the ion. Using this method, orthogonally into the TOF analyzer as a packet
all precursors that undergo the loss of the using a set of pusher and puller plates between
same neutral fragment can be monitored. Sim- the two analyzers (Fig. 7.7) [41].
ilar to precursor ion scanning, this technique Some of the biggest advantages to Q-TOF
can be a powerful tool for quickly and selec- tandem analyzers are their higher mass accuracy,
tively identify peptides that are post- higher resolution and increased scan speed as
translationally modified such as those that compared to triple quadrupole mass analyzers
have been phosphorylated. An example is in and thus their ability to easily interface to liquid
the identification of peptides with chromatography and perform very fast tandem
phosphorylated serine or threonine. MS. This allows many spectra to be acquired
Performing low energy CID on peptides that when there are many co-eluting compounds in a
are phosphorylated will often result in the loss chromatographic run. Although the resolution of
of phosphoric acid (H3PO4, m/z ¼ 98) from the data is not of the same quality as that when
the parent ion. analyzed by an orbitrap or FT-ICR analyzer, it is
Selected reaction monitoring – Selected Reac- far superior to that obtained by standard quadru-
tion monitoring (SRM), sometimes referred pole or ion traps.
to as multiple reaction monitoring (MRM), is
a popular scanning technique for the quantifi-
cation of compounds in a mixture. Q1 is fixed 7.7.3 TOF/TOF
to allow only precursors of a particular m/z to
filter through the quadrupole. CID then occurs Time-of-flight/time-of-flight (TOF/TOF) is a
in Q2 and all fragments sent to Q3. Q3 is fixed method by which two TOF analyzers are used
to only allow product ions of specific m/z to in tandem and CID is performed between the two
filter through. Thus, specific signature frag- TOF analyzers (Fig. 7.8) [42]. This allows one to
ment ions originating from a compound of perform tandem MS on biological compounds
known mass can be monitored. This technique such as peptide and oligonucleotides that often
essentially allows for a single known com- are ionized by ionization methods such as
pound to be monitored in real time. One MALDI [43]. Because of the speed at which
caveat is that although the mass and potential TOF analyzers operate, sample analysis in both
m/z values for Q1 can be easily determined for the MS and MS/MS level can be performed very
a compound of interest, the m/z values of the rapidly.
product ions of that compound must be known In order to perform tandem MS in a TOF/TOF
prior to designing the SRM experiment. This analyzer, very fast electronic switching must
can be solved but first performing a product occur in a series of steps. First, ions of different
ion scan of the compound of interest and m/z are separated through the flight tube based on
recording the m/z of all product ions. their velocity. Second, ions of a particular m/z are
7 Mass Analyzers and Mass Spectrometers 167

Fig. 7.7 The QTOF analyzer is a hybrid of triple quadrupole analyzer and a time-of-flight analyzer. It is analogous to a
triple quadrupole system but with the exception that the last quadrupole is replaced by a time-of-flight analyzer

Fig. 7.8 TOF/TOF analyzers combine two TOF passing into a collision cell where they undergo CID. The
analyzers in tandem. Ions are accelerated in the first resulting fragment ions are re-accelerated into the second
TOF. A timed ion selector allows ions of a particular TOF and their time-of-flight is measured to obtain a mass
m/z to pass. The selected ions are then decelerated before spectrum

selected while filtering out all others. This pre- of this analyzer combination, combined with
cursor ion selection is often performed using a MALDI ion sources, make it ideal for the analy-
Bradbury-Nielsen gate which is essentially a sis of peptides from tryptic digests.
timed-ion-selector (TIS) that filters ions based
on their arrival time to the gate [44]. Third, the
selected precursor ions are then passed to a set of 7.7.4 Other Tandem Analyzer
ion optics that de-accelerates them to a much Combinations
slower velocity. Fourth, the ions then pass
through a collision cell for CID. Finally, the There are other combinations of mass analyzers
product ions formed are re-accelerated into a which are far too numerous to list. In principle,
second flight tube and analyzed. The fast analysis the combination of mass analyzers, regardless of
168 A.M. Haag

their type, allow the mass spectrometrist to per- to problem solving by mass spectrometry will
form tandem MS. The choice of combination is undoubtedly continue to grow.
dependent on many necessary factors such as
resolution, acquisition speed (duty cycle), mass
accuracy, etc. For example, if high resolution of a
References
product ion is required but not that of its precur-
sor, the first analyzer may be a quadrupole or ion 1. Jennings KR, Dolnikowski GG (1990) Mass
trap and the second analyzer an orbitrap or an analyzers. Method Enzymol 193:37–61
FT-ICR. Newer instruments have a multitude of 2. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse
CM (1989) Electrospray ionization for mass spec-
different analyzers that may be utilized in a num-
trometry of large biomolecules. Science 246:64–71
ber of different configurations. As newer 3. Chernushevich IV, Ens W, Standing KG (1999)
combinations of analyzers continue, the variety Orthogonal-injection TOFMS for analyzing
of tandem MS methods will also continue to biomolecules. Anal Chem 71:452A–461A
4. Miller PE, Denton MB (1986) The quadrupole mass
grow.
filter: basic operating concepts. J Chem Educ
63:617–622
5. March RE (1997) An introduction to quadrupole ion
7.8 Other Analyzers trap mass spectrometry. J Mass Spectrom 32:351–369
6. Hayes RN, Gross ML (1990) Collision-induced disso-
ciation. Method Enzymol 193:237–263
Although there are a number of other types of 7. Arpino PJ, Guiochon G (1979) LC/MS coupling. Anal
analyzers, those that have been described herein Chem 51(7):692A–697A
comprise the majority of analyzers currently used 8. Blakely CR, Vestal ML (1983) Thermospray interface
for liquid chromatography/mass spectrometry. Anal
in mass spectrometers. There have been many Chem 55:750–754
other analyzers that were once popular but have 9. Wong PSH, Cooks RG (1997) Ion trap mass spec-
been overtaken by the current selection of trometry. Currentseparations.com 16(3)
analyzers for a multitude of reasons. For exam- 10. Paul W, Steinwedel H (1953) Ein neues
Massenspektrometer ohne Magnetfeld. Zeitschrift
ple, magnetic sector analyzers were one of the
für Naturforschung A 8(7):448–450
first analyzers used in mass spectrometry. They 11. Stafford GC, Kelley PE, Syka JEP, Reynolds WE,
can have high resolution (~200,000), good stabil- Todd JFJ (1984) Recent improvements in and
ity, and significant mass accuracy, but unfortu- applications of advanced ion trap technology. Int J
Mass Spectrom Ion Process 60(1):85–98
nately suffer from their large size, low resolution
12. Tong W, Link A, Eng JK, Yates JR (1999) Identifica-
for precursor ion selectivity, and slow scan tion of proteins in complexes by solid-phase
speeds. For these reasons magnetic sector microextraction/multistep elution/capillary electro-
instruments have been less ideal for interfacing phoresis/tandem mass spectrometry. Anal Chem
71:2270–2278
to LC. Other analyzers have found a niche mar- 13. Link AJ, Eng J, Schieltz DM, Carmack E, Mize GJ,
ket for a number of reasons. The QTRAP ana- Morris DR, Garvik BM, Yates JR (1999) Direct anal-
lyzer allows a triple-quad mass spectrometer to ysis of protein complexes using mass spectrometry.
act as a quadrupole and linear ion trap tandem Nat Biotechnol 17:676–682
14. Karas M, Bachman D, Bagr U, Hillenkamp F (1987)
mass spectrometer. Although there are a number
Matrix-assisted ultraviolet laser desorption of
of advantages to this type of analyzer, the non-volatile compounds. Int J Mass Spectrom Ion
demand has not propelled it to the point that it Process 78:53–68
has become one of the primary analyzers used in 15. Juhasz P, Roskey MT, Smirnov IP, Haff LA, Vestal
ML, Martin SA (1996) Applications of delayed
proteomics.
extraction matrix-assisted laser desorption ionization
There is no doubt that newer analyzers will be time-of-flight mass spectrometry to oligonucleotide
developed along with improvements in current analysis. Anal Chem 68:941–946
ones. These advancements will continue to push 16. Mamyrin BA (2001) Time-of-flight mass spectrome-
try (concepts, achievements, and prospects). Int J
the limits of current mass spectrometry. Because
Mass Spectrom 206:251–266
of the complex nature of the proteomics field, the 17. Cotter RJ (1999) The new time-of-flight mass spec-
necessity for many different avenues of approach trometry. Anal Chem 71:445A–451A
7 Mass Analyzers and Mass Spectrometers 169

18. Mamyrin BA, Karataev VI, Shmikk DV, Zagulin VA spectrometry. Mol Cell Proteomics 10(7):
(1973) The mass reflectron, a new nonmagnetic time- M111.009431. doi:10.1074/mcpM111.009431
of-flight mass spectrometer with high resolution. Sov 31. Hu Q, Noll RJ, Li H, Makarov A, Hardman M, Cooks
Phys – JETP 64:82–89 RG (2005) The orbitrap: a new mass spectrometer.
19. Vestal ML, Juhasz P, Martin SA (1995) Delayed J Mass Spectrom 40:430–443
extraction matrix-assisted laser desorption time-of- 32. Makarov A (2000) Electrostatic axially harmonic
flight mass spectrometry. Rapid Commun Mass orbital trapping: a high-performance technique of
Spectrom 9(11):1044–1050 mass analysis. Anal Chem 72:1156–1162
20. Juhasz P, Vestal ML, Martin SA (1997) On the initial 33. Zubarev RA, Makarov A (2013) Orbitrap mass spec-
velocity of ions generated by matrix-assisted laser trometry. Anal Chem 85:5288–5296
desorption ionization and its effect on the calibration 34. de Hoffmann E (1996) Tandem mass spectrometry: a
of delayed extraction time-of-flight mass spectra. primer. J Mass Spectrom 31:129–137
J Am Soc Mass Spectrom 8:209–217 35. Yost RA, Boyd RK (1990) Tandem mass spectrome-
21. Karas M, Bahr U (1990) Laser desorption ionization try: quadrupole and hybrid instruments. Method
mass spectrometry of large biomolecules. Trends Enzymol 193:154–200
Anal Chem 9(10):321–325 36. Cooks RG (1995) Collision-induced dissociation:
22. Hillenkamp F, Karas M (1990) Mass spectrometry of readings and commentary. J Mass Spectrom
peptides and proteins by matrix assisted ultraviolet 30:1215–1221
laser desorption/ionization. Method Ezymol 37. Yost RA, Enke CG (1979) Triple quadrupole mass
193:280–295 spectrometry. Anal Chem 51(12):1251A–1264A
23. Pappin DJC, Hojrup P, Bleasby AJ (1993) Rapid 38. Yost RA, Enke CG (1978) Selected ion fragmentation
identification of proteins by peptide-mass fingerprint- with a tandem quadrupole mass spectrometer. J Am
ing. Curr Biol 3:327–332 Chem Soc 100(7):2274–2275
24. Thiede B, H€ ohenwarter W, Krah A, Mattow J, 39. Domon B, Aebersold R (2006) Mass spectrometry and
Schmid M, Schmidt F, Jungblut PR (2005) Peptide protein analysis. Science 312:212–217
mass fingerprinting. Methods 35:237–247 40. Morris HR, Paxton T, Dell A, Langhorne J, Berg M,
25. Comisarow MB, Marshall AG (1974) Fourier trans- Bordoli RS, Hoyes J, Bateman RH (1996) High sensi-
form mass Ion cyclotron resonance spectroscopy. tivity collisionally-activated decomposition tandem
Chem Phys Lett 25:282–283 mass spectrometry on a novel quadrupole/orthogo-
26. Comisarow MB, Marshall AG (1974) Frequency- nal-acceleration time-of-flight mass spectrometer.
sweep Fourier transform ion cyclotron resonance Rapid Commun Mass Spectrom 10:889–896
spectroscopy. Chem Phys Lett 26:489–490 41. Chernushevich IV, Loboda AV, Thomson BA (2001)
27. Gorshkov MV, Udseth HR, Anderson GA, Smith RD An introduction to quadrupole-time-of-flight mass
(2002) High performance electrospray ionization spectrometry. J Mass Spectrom 36:849–865
Fourier transform ion cyclotron resonance mass spec- 42. Medzihradszky KF, Campbell JM, Baldwin MA,
trometry at low magnetic field. Eur J Mass Spectrom Falick AM, Juhasz P, Vestal ML, Burlingame AL
8:169–176 (2000) The characteristics of peptide collision-
28. Marshall AG, Hendrickson CL, Jackson GS (1998) induced dissociation using a high-performance
Fourier transform ion cyclotron resonance mass spec- MALDI-TOF/TOF tandem mass spectrometer. Anal
trometry: a primer. Mass Spectrom Rev 17:1–35 Chem 72:552–558
29. Bogdanov B, Smith RD (2005) Proteomics by FTICR 43. Vestal ML, Campbell JM (2005) Tandem time-of-flight
mass spectrometry: top down and bottom up. Mass mass spectrometry. Method Enzymol 402:79–108
Spectrom Rev 24:168–200 44. Bradbury NE, Nielsen RA (1936) Absolute values of
30. Scigelova M, Hornshaw M, Giannakopulos A, the electron mobility in hydrogen. Phys Rev
Makarov A (2011) Fourier transform mass 49:388–393
Top-Down Mass Spectrometry:
Proteomics to Proteoforms 8
Steven M. Patrie

Abstract
This chapter highlights many of the fundamental concepts and
technologies in the field of top-down mass spectrometry (TDMS), and
provides numerous examples of contributions that TD is making in biol-
ogy, biophysics, and clinical investigations. TD workflows include
variegated steps that may include non-specific or targeted preparative
strategies, orthogonal liquid chromatography techniques, analyte ioniza-
tion, mass analysis, tandem mass spectrometry (MS/MS) and informatics
procedures. This diversity of experimental designs has evolved to manage
the large dynamic range of protein expression and diverse physiochemical
properties of proteins in proteome investigations, tackle proteoform
microheterogeneity, as well as determine structure and composition of
gas-phase proteins and protein assemblies.

Keywords
Review • Proteomics • Proteoform • Protein • Mass spectrometry • Top-
down • Bottom-up • Informatics • Electrospray ionization • Alternative
splicing • SNPs • Post-translational modifications • Fourier transform •
FT-ICR • Orbitrap • Time-of-flight • Quadrupole ion trap • Native MS •
Data-dependent • Data-independent • Reversed-phase • Tandem MS •
Charge reduction • Supercharging • Chromatography • Hydrogen/
deuterium exchange • Label-free quantitation • SILAC • Membrane
proteins

S.M. Patrie (*)


Computational and Systems Biology & Biomedical
Engineering Graduate Programs, University of Texas
Southwestern Medical Center, Dallas, TX, USA Department of Bioengineering, University of Texas at
Department of Pathology, University of Texas Dallas, Richardson, TX, USA
Southwestern Medical Center, Dallas, TX, USA e-mail: Steven.Patrie@UTSouthwestern.edu

# Springer International Publishing Switzerland 2016 171


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_8
172 S.M. Patrie

Abbreviations MALDI matrix assisted laser desorption


ionization
2DGE 2D gel electrophoresis MBP myelin basic protein
μESI micro electrospray ionization MD middle-down
ALS acid labile surfactant MS/MS tandem mass spectrometry
BU bottom-up m/z mass to charge ratio
CAD collision activated dissociation nESI nano electrospray ionization
CF chromatofocusing pI isoelectric point
CFGE continuous flow gel elution PIITA precursor ion independent top-down
CID collision induced dissociation algorithm
CZE capillary zone electrophoresis PLRP Polystyrene-divinylbenzene
DB database copolymers, commercialized by
DD data-dependent Agilent
DI data-independent PT protein-platinum
dMS differential mass spectrometry PTM post translational modification
DT drift tube Q quadrupole
ECD electron capture dissociation QIT quadrupole ion trap
ESI electrospray ionization QIT quadrupole ion-trap
ETD electron transfer dissociation RP reversed-phase
FAIMS field asymmetric ion mobility SAX strong anion exchange
spectrometry SCX strong cation exchange
FT Fourier transform SDS sodium dodecyl sulfate
FTMS Fourier transform mass spectrometry SEC size exclusion chromatography
GELFrEE gel-eluted liquid fraction entrapment SID surface-induced dissociation
electrophoresis S/N signal to noise
HCD higher-energy collision dissociation TD top-down
HDL high-density lipoprotein TDMS top-down mass spectrometry
HDX hydrogen/deuterium exchange TOF time-of-flight
HILIC Hydrophilic interaction liquid TW traveling-wave
chromatography UPLC ultra-high performance LC
ICR ion cyclotron resonance UVPD ultraviolet photon dissociation
IEC Ion exchange chromatography WAX weak anion exchange
IEF isoelectric focusing WCX weak cation exchange
IMAC immobilized metal-ion affinity Zip-tips® solid-phase capture/extraction tips,
chromatography commercialized by Millipore
IMP Integral membrane protein
IM-MS Ion mobility-mass spectrometry
IPG immobilized pH gradient 8.1 Introduction
ISD in-source dissociation
ISD in source dissociation (nozzle/skim- Proteome diversity can substantially deviate
mer dissociation) from that predicted from the central dogma of
IRMPD infrared multiphoton dissociation biology [1] which states “the coded genetic infor-
LC-MS liquid chromatography mass mation hard-wired into DNA is transcribed into
spectrometry individual transportable cassettes, composed of
LQT linear quadrupole trap messenger RNA (mRNA); each mRNA cassette
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 173

contains the program for synthesis of a particular protein trafficking, protein complex formation,
protein (or small number of proteins)” membrane assembly, or signal cascades).
(Fig. 8.1A). A “protein” is often a mixture of Proteoform diversification may also impact dis-
poly-peptidyl products (proteoforms [2]) that ease pathobiology [3–5]; therefore, proteoform
are molecularly similar, sharing appreciable expression ratios are being actively investigated
amino acid homology, yet chemically distinct. with the goal of translation to the clinic as diag-
The chemical heterogeneity occurs through nostic or prognostic markers.
mechanisms such as allelic alterations (e.g., Highlighted throughout this chapter is the tool
single nucleotide polymorphisms (SNP), or development for protein and proteoform-level
mutations), alternative splicing, and post- investigations that has occurred in the field of
translational modifications (PTMs). Regarding top-down mass spectrometry (TDMS) over the
PTMs, hundreds are currently described and past few decades. This review includes discus-
may manifest both enzymatically (e.g., phos- sion on advancements in sample processing,
phorylation) and non-enzymatically (e.g., car- chromatography, MS and tandem MS (MS/MS
bonylation) at times coincident with translation or MSn), and bioinformatics. The tenet that
(e.g. glycosylation) or after inter- and extra- governs TD innovations is quite simple: prote-
cellular displacement. The existence of such ome and proteoform diversity is best
diversity ensures that reasonably specific molec- interrogated when proteins are analyzed intact
ular tools are available for diverse jobs (e.g., (Fig. 8.1B). A key distinction between protein-

Fig. 8.1 Principles of proteome diversification and always measure the masses of the proteins/proteoforms
TDMS. (A) Schema highlights the information transfer in a sample. This is often followed by dissociation of the
from genome (DNA) through the transcriptome (messen- detected species in the mass spectrometer (e.g., MS/MS)
ger RNA) to the proteome (proteins). A protein often and then informatics searches to identify the parent pro-
manifests as a collection of related “proteoforms” that tein and localize positions of chemical microheter-
derive from the same subset of genetic components, but ogeneity. Investigations may also seek to quantify
are chemically diversified through polymorphisms, between samples the expression changes of the protein
mutations, alternative splicing, and/or co- or post- or determine ratio changes between related proteoforms
translational modifications. (B) TD investigations will
174 S.M. Patrie

centric vs. proteoform-centric studies is that at a 1–4 kV voltage differential relative to the
while a protein centered study simply seeks to mass spectrometer’s inlet. Generally, the ion for-
identify the proteins present within the samples, mation occurs through steps of desolvation and
proteoform analysis attempts to localize all Coulombic explosion of shrinking droplets, and
sources of molecular variation amongst related surface evaporation of the charged analyte, all of
proteoforms. In addition, proteoform analyses which are facilitated by heated optics in the mass
seek to quantify expression changes at both the spectrometer inlet. ESI of a protein leads to mul-
protein and proteoform levels between samples. tiple charge (z) states that sequentially differ by
These objectives largely set TD apart from one charge and are observed in a mass to charge
bottom-up (BU) experiments where proteins are (m/z) spectrum (Fig. 8.2B). Molecular ions can
subjected to pre-analytical processing by be generated in protonated, cationized, or
proteases (e.g., trypsin) [6]. Elucidation of anionized states. Traditionally TD is performed
proteoform microheterogeneity is challenging in “positive ionization mode” where ESI, aided
with BU when the protein exhibits chemical by the presence of acids in the sample (e.g.,
diversity spatially distributed throughout the formic acid or acetic acid) generates multiply
amino acid backbone [7]. An intermediate strat- protonated states. The charge state may be
egy between TD and BU called middle-down assigned with the formula (M + nH)n+, where n
(MD) utilizes a limited digest at targeted amino is the number of added protons (H+) to a mole-
acids to selectively cleave larger proteins into cule with mass (M ). Thus, the m/z is determined

analytically manageable mid-size polypeptides by mz ¼ ðM þ nðmH þ ÞÞ=n where mH+ is the
[8]. While MD conceptually emulates BU, here mass of a proton (1.007277 Da).
it is considered a TD sample preparation method Conventional ESI utilizes high flow rates
because it philosophically seeks to elucidate and (>1 μL/min) and a counter gas to facilitate
quantify combinatorial products at the desolvation. However, investigators often use
proteoform level. micro-ESI (μESI) and nano-ESI (nESI) which
exploit small inner diameter (ID) (1–50 μm ID)
fused silica capillary emitters to introduce ana-
8.2 Mass Spectrometry lyte at low flow rates (200–1000 nL/min and
10–200 nL/min, respectively) [12, 13]. This
8.2.1 Ionization leads to smaller droplets for more efficient ion
formation without the need for a desolvation gas.
TDMS is performed on protein ions in vacuo The limits of detection for proteins analyzed with
(Fig. 8.2A). The gas-phase ions are generated μESI and nESI are usually in the nM to μM
by electrospray ionization (ESI) [9] or matrix concentration range. Additionally, since most
assisted laser desorption ionization (MALDI) flow regimes applied in ESI are concentration
[10, 11]. ESI and MALDI are key in proteomics sensitive [14], ESI is amenable to quantitative
as they permit polypeptide analysis up to several studies by overall spectral counts (intensity) in
hundred thousand Daltons (Da) without the m/z spectrum.
disrupting amino acid bonds, side-chain PTMs, Despite the utility of ESI, the removal of
or many non-covalent interactions in protein excess salt (e.g. sodium or potassium), sample
assemblies. Both ESI and MALDI have been buffers (e.g. Tris(hydroxymethyl)aminomethane
applied in TD research; however, ESI is the (Tris) or phosphate buffered saline (PBS)),
workhorse largely because it seamlessly detergents (e.g. sodium dodecyl sulfate (SDS))
integrates liquid chromatography with mass or plasticizers from samples is critical because
spectrometry (LC-MS). ESI aerosolizes an ana- they create spectral artifacts and contribute to
lyte in solution using an atmospheric emitter held chemical noise in the ESI solution
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 175

Fig. 8.2 Overview of ESI and MS data processing. distributions to monoisotopic masses often leads to
(A) TD often applies ESI to aerosolize and ionize low part per million mass accuracies. The “– #”
proteins suspended in solution. A mass to charge (m/z) indicates the isotopologue for which a mass is reported.
spectrum is acquired for the parent protein and during (C) Representative m/z (top) and corresponding zero
subsequent fragmentation events. (B) A representative charge mass spectrum (bottom) for di-N-glycosylated
m/z spectrum (top) and corresponding zero charge mass lipocalin-type prostaglandin d synthase. The data
spectrum (bottom) for bovine ubiquitin. The inset in the highlights significant spectral complexity often
m/z spectrum highlights with sufficient instrument associated with proteoform-level investigations, in this
resolving power carbon-12 (12C) carbon-13 (13C) case multiple glycoproteoforms present at different
isotopologues for each charge state may be observed ratios. Adapted with permission from Ref. [106]. Copy-
(see Sect. 8.3.4). Deconvolution of isotopologue right 2014 by John Wiley & Sons, Inc

[15]. Therefore, successful implementation of processing for both “offline” direct infusion of
ESI in proteomics often includes scaling LC individual proteins and “online” LC-MS
columns to small diameter fused silica capillary experiments on complex mixtures. When applied
columns (50–150 μm ID, 5–25 cm length) with with direct infusion the devices permit spectral
integrated ESI emitters [16]. Of note, ESI is often averaging for improved spectral signal to noise
performed with microfluidic devices [17] or (S/N) [18].
robots [18] which facilitate assembly line
176 S.M. Patrie

8.2.2 Mass Analyzers ms) which serves to dramatically improve spec-


tral resolving power. On a well calibrated instru-
To separate ions by their m/z, mass analyzers use ment, high spectral resolving power permits
electric or magnetic fields to apply a force that precise mass measurements at part-per-million or
lead to both mass-dependent (Newton’s second part-per-billion mass accuracy [22, 23]. As
law) and ion-dependent (Lorentz force law) described in more detail below, the benefits of
accelerations. Detection typically occurs in high resolving power and mass accuracy include
units of time, frequency, or current. The analyzer the use of accurate mass tags to discriminate
classes most relevant to TD include: (1) quadru- between elemental/chemical compositions of spe-
pole mass filters (Q), (2) quadrupole ion-trap cies in a database [24, 25], the resolution of meta-
(QIT) in the design of 3D cylindrical-hyperbolic bolically incorporated isotopic labels for
rings or 2D linear traps with quadrupole rods quantitation (see Sect. 8.4.2) [26], and the discrim-
(LQT), (3) time-of-flight (TOF), and (4) Fourier ination of PTMs with similar mass (e.g.,
transform MS (FTMS) which includes ion cyclo- O-phosphorylation, 79.96633 Da, vs. -
tron resonance (FT-ICR) and Orbitrap. While O-sulfonation, 79.95682 Da) [27].
analyzer performance characteristics vary, A notable feature of modern instruments is
(Table 8.1) a notable distinction between that they often combine mass analyzers in tan-
analyzers comes in terms of processing duty dem (e.g., Q-TOF, QqQ, QqQ-FTMS,
cycle (scan rate) vs. spectral resolving power QIT-FTMS). Hybrids serve to improve dynamic
(m/Δm50%) and mass accuracy. The resolving range for continuous ion sources (such as ESI)
power is typically determined from the minimum [28, 29], aid selected enrichment of specific spe-
peak width (Δmfwhm) at a set m/z value cies in a sample [18, 30] and permit the parallel
(e.g. 400 m/z) which permits comparison of dif- processing of high-resolving power scans in the
ferent instrument types. Scanning instruments FTMS with lower-resolving power MS/MS
such as Q, QIT, and TOF provide a high scan events in a separate QIT [31].
rate (millisecond) that is useful for high- FTMS has largely dominated the TD field.
throughput MS/MS and reliable quantitative Orbitraps, which do not require superconducting
sampling during LC-MS. [19]. However, the magnets, have been broadly accepted for routine
higher acquisition rate may reduce resolving LC-MS applications [32], while FT-ICR has
power and consequently mass accuracy (see been widely applied in detailed proteoform
Sect. 8.3.4). FTMS uses high-field investigations where the highest resolving
superconducting magnets (FT-ICR) [20] or power and mass accuracies are required
high-electric fields (Orbitrap) [21] to trap ions [33, 34]. However, continued innovation has
prior to frequency-based detection. FTMS is decreased the performance gaps between
often performed at lower scan rate (100–1000’s

Table 8.1 Mass analyzers: Typical characteristics of mass analyzers used for top-down experiments
Spectral duty
Mass analyzer Ionization Resolving power cycle(s) Upper m/z range Mass accuracy (ppm)
QIT/LTQ ESI, MALDI 1–3000 0.02–0.2 2000–3000 100–250
Q-TOF ESI, MALDI 10,000–50,000 <0.01 >100,000 5.0–15.0
Orbitrap ESI, MALDI 15,000–250,000 0.01–1.0 20,000–50,000 2.0–10.0
FT-ICR ESI, MALDI 15,000–5,000,000 0.1–5.0 50,000 0.5–5.0
Abbreviations: FT-ICR Fourier transform ion cyclotron resonance, LTQ linear trap quadrupole, QIT quadrupole ion
trap, Q-TOF quadrupole time-of-flight
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 177

analyzers resulting in a broad application of TD fragment types. Electron capture dissociation


on other classes of instrument [35–37]. (ECD) [47] and electron transfer dissociation
(ETD) [48] cleave the backbone N-Cα bonds to
form “c” (N-terminal) and “z” (C-terminal) frag-
8.2.3 Tandem Mass Spectrometry ment ions. Addition of the electron onto the
(MS/MS, MSn) polypeptide forms a radical cation that rapidly
cleaves the backbone. Consequently, ECD and
Mass spectrometers apply a variety of fragmen- ETD spectra are fragment rich due to the random
tation techniques (Table 8.2) that either have cleavage events. In contrast with most other
slow (105–1 s) or fast (<108 s) activation methods [49], ECD and ETD do not eject labile
timeframes [37, 38]. The different techniques side-chain PTMs (such as phosphorylation or
produce distinct types of terminal fragment ions glycosylation) which increases their utility for
[38, 39] that are annotated by the Roepstorff, localizing PTMs along the protein’s backbone
Fohlman, and Biemann nomenclature (Fig. 8.3, [50, 51].
right) [40, 41]. In TDMS, thermal or vibrational In high-throughput “omics” investigations,
heating of the amino acid backbone occurs the automation of MS and MS/MS events may
through collisions with gas or photons from a be divided into data-dependent (DD) vs. data-
laser. Examples include resonant-based collision independent (DI) acquisitions (Fig. 8.4)
induced dissociation (CID) [39, 42], [30, 52]. These balance highly selective MS/MS
non-resonant higher-energy collisional dissocia- events on a single spectral target (DD) vs.
tion (HCD) or collision activated dissociation throughput with parallel fragmentation of
(CAD) [43], infrared multiphoton dissociation co-existing species in a spectrum (DI). A
(IRMPD) [44], and ultraviolet photon dissocia- protein’s gas-phase structure and charge number,
tion (UVPD) [45, 46]. These methods predomi- as well as the choice of MS/MS method
nately cleave the polypeptide at the weakest influences the number and position of fragmen-
amide bonds leading to “b” (N-terminal) and tation events for a protein. Therefore, in DD
“y” (C-terminal) fragments; although the rapid acquisitions, MS/MS methods are often
activation by UVPD (~1015 s) results in rela- exploited in parallel to improve the total number
tively random backbone cleavage that leads to of identifications across a proteome. Alterna-
complex spectra presenting most terminal tively, MS/MS methods may be used

Table 8.2 Fragmentation techniques: Comparison of common in vacuo dissociation methods


Technique Fragmentation Mechanisms (cleavage site) Special equipment Automation
CIDa Collision Resonant excitationb (b, y) Ion trap DD
CAD/HCD Collision Non-resonant excitationc (b, y) Collision cell DD, DI
ECD Electron Electron transfer (c, z) Heated cathode DD
ETD Electron Radical transfer (c, z) Chemical ionization source DD
IRMPD Photon Direct excitationd (b, y) CO2 laser DD, DI
ISD Collision Non-resonant excitationc (b, y) MS inlet (nozzle/skimmer) DI
SID Collision Non-resonant excitationc (b, y)e Metal surface DD, DI
UVPD Photon Direct excitationd (a, b, c, x, y, z) Excimer laser DD, DI
Abbreviations: CID collision-induced dissociation, CAD collision assisted dissociation, ECD electron capture dissoci-
ation, ETD electron transfer dissociation, HCD higher-energy collisional dissociation, IRMPD infrared multiphoton
dissociation, ISD in source dissociation (nozzle/skimmer dissociation), SID surface induced dissociation, UVPD
ultraviolet photo-dissociation
a
Called sustained off-resonance irradiation collision-activated (induced) dissociation (SORI-CAD) when performed in
FT-ICR
b
Application of radio frequencies to excite/dissociate to increase kinetic energy of trapped ions
c
Application of DC potentials to accelerate ions into a high pressure region or surface
d
Introduction of single or multiple photons to trapped ions
e
Commonly results in ejection of macromolecular assemblies (see native MS)
178 S.M. Patrie

Fig. 8.3 Protein identification: A set of in silico in- acid sequence tags that are searched against databases
formatics tools are used to process raw MS data, as well for proteins that contain the tag within its amino acid
as, identify and characterize the proteoforms present. sequence. Alternatively, since fragment ions often contain
Deconvoluted spectra are typically provided as lists of either the N- and C- terminus, observed fragment masses
parent masses and/or associated fragment masses. Frag- are searched against theoretical terminal fragment ion
mentation data often supports the generation of amino masses for each protein in a database

sequentially (MSn) on product ions to improve high resolution data [59–61]. Generally, the
the resolution of proteoform characterization z for a protonated ion in a spectrum is readily
[18, 30, 49, 53–55]. These experiments typically derived from spacing between adjacent charge

use inclusion and exclusion criteria via decision- states: zmzi ¼ mziþ1 = mziþ1  mzi . Instruments that
tree methods to automatically target or adjust attain sufficient spectral resolving power may
fragmentation variables that may be influenced also separate peaks into isotopologues that pre-
by the precursors mass or charge (e.g., activation dominately reflects the natural variation of
times and energy levels) [18, 56, 57]. DI carbon-12 (12C) and carbon-13 (13C) in the poly-
acquisitions have been applied to study proteins peptide (Fig. 8.2B, inset). Here, z is derived by
>25 kDa where low resolving power broadband counting the number of isotopes in a single m/z
spectrum are obtained followed by coincident unit or by way of the Δm/z difference between
dissociation of all components [32]. Alterna- adjacent isotopes (z ¼ 1/Δm/ziso1-iso2). When
tively, segmentation of the m/z range into multi- reporting from high-resolving power datasets
ple ~30–80 Δ m/z windows may be applied to either the monoisotopic mass (12C10013C0, i.e.,
enhance the dynamic range for intact precursor polypeptide containing only12C) or most abun-
measurement and improve the sensitivity of frag- dant isotope (12C100-n13Cn) mass is given,
ment ion detection [18, 58]. contrasting lower resolution approaches where
the average molecular mass for all unresolved
isotopes is reported.
8.2.4 Data Analysis and Informatics As highlighted above, MS/MS serves a key
role in identifying proteins and differentiating
Algorithms and software tools are available proteoforms. For example, the masses of
through instrument vendors or online to support fragments may be used for de novo sequence
automated spectral deconvolution of low and analysis. Here, a series of fragment ions differing
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 179

Fig. 8.4 Overview of data-dependent (DD) vs. data- LC where proteins are automatically interrogated over
independent (DI) fragmentation. (A) “Omics” time with DD or DI fragmentation. (B) In DD analysis,
investigations by TD typically include reversed-phased automated selection, enrichment, and fragmentation of
180 S.M. Patrie

in masses equal to that of distinct amino acids precursor ion independent top-down algorithm
(i.e., sequence tag) are searched against the pro- (PIITA) cross-correlates deconvoluted MS2
tein database for matches with the same consen- spectra to theoretical MS [2] spectra in a manner
sus sequence (Fig. 8.3) [62–64]. An alternative similar to the SEQUEST algorithm used in ion
approach to protein identification is to correlate peptide studies [74]. After identification, PIITA
predicted terminal fragment ions of proteins to uses any difference observed between the
those observed in the MS/MS spectrum. For observed precursor mass and theoretical precur-
individual spectra, diverse open access resources sor mass to identify and locate PTMs. Prelimi-
are available for assignment of tandem MS data nary work with PIITA characterized 154 proteins
to target sequences (e.g., PROWL [65], BUPID at <1 % FDR from Salmonella typhimurium
[66], and MASH Suite [67]). Similarly, Kelleher membrane extracts [74]. BIG-Mascot was cre-
and coworkers have created the ProSight series ated to extend the working mass range of the
of search engines (e.g., web-based ProSight PTM peptide-based search engine MASCOT (Matrix
(free), ProSight Lite (free), and ProSightPC 3.0 Science) [75]. Initial examples highlighted the
(commercialized by ThermoFisher Scientific) identification of protein variants from 8 to
[68, 69]. ProSight uses a Poisson model to deter- 669 kDa through a combination of accurate
mine the significance of an identification made mass tags and/or MS/MS events.
from MS/MS fragment matches [70]. The proba-
bility of a random identification is dependent
upon the experimental mass accuracy and the
number of fragments that match a protein in a 8.3 Chromatography
database relative to the total number of fragments
observed. Various scoring metrics are available MS on intact proteins presents significant
to determine confidence in the identification and challenges. Increased charge and isotopologue
estimate false discovery rates (FDR) for protein multiplicity at higher mass (>20 kDa) quickly
identifications [32]. ProSight is also amenable to decrease spectral S/N [76]. This compounds
assigning confidence in complex proteoform upon other factors that degrade signal including
studies [71]. For example, in work on histone charge competition between different proteins
H4, Pesavento et al. performed in silico “shot- during ESI, protein solubility, and biological or
gun” annotation of all plausible H4 PTMs to technical chemical noise. Fortunately, many of
create a database for MS/MS searches [34]. these issues can be overcome by chro-
Other informatics approaches have also been matographic preprocessing (e.g., molecular
created for TD. For example, Pevzner and weight cutoff filters, dialysis, or immunoprecipi-
coworkers created a spectral alignment algorithm tation). For complex mixtures, the observational
that identifies protein forms presenting with con- capacity of the workflow is also increased by
comitant PTMs. [72] In a follow-up report, the multidimensional steps which fractionate
algorithm, MS-Align-E, was used to characterize proteins by orthogonal physicochemical
histone H4 proteoforms, proving particularly properties (e.g., size, isoelectric point (pI),
useful for proteoform assignments in the absence hydrophobicity, and polarity). Many of these
of highly annotated databases [73]. The tools are briefly discussed here.

ä
Fig. 8.4 (continued) individual charge states is performed as an intermediate between DD and broadband DI because
on-the-fly. (C) In broadband DI (left), mass or m/z informa- fragmentation events occur sequentially on enriched m/z
tion is not used as a pre-selection criteria and all spectral windows that often contain multiple charge states of
partners (different charge states or different proteins) are more than one protein. Adapted with permission from
simultaneously fragmented. Segmented DI (right) serves Ref. [52]. Copyright 2004 by Elsevier
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 181

8.3.1 Reversed Phase Separations (e.g. 50–65  C). The use of higher temperatures
enhances adsorption/desorption kinetics, lowers
Whether offline, online, or via solid-phase cap- the mobile-phase viscosity, and helps to denature
ture/extraction tips (products such as Millipore’s proteins. Vellaichamy et al. used PLRP in
ZipTips), reversed-phase (RP) chromatography capillary-columns and found a ~2–3 improve-
plays a pivotal role in TDMS. RP separations ment in resolution and spectral S/N versus con-
are mediated by the strength of the interaction ventional porous silica [77]. Monolithic
between the hydrophobic domains of proteins stationary phases composed of a cross-linked
and the non-polar stationary phase (Fig. 8.3A). network of mesoporous material (e.g., polymer,
Strong interactions permit simple sample clean- silica, and organic-silica hybrid monoliths) also
up from polar salts and buffers, as well as, sup- provide good mass transfer characteristics for
port mixture partitioning by gradient elution with proteins. Eeltinik et al. created a 200 μm  250 mm
increased organic solvent concentration over capillary polymer monolithic column and
time. In contrast with peptide-based analysis, showed peak capacities >600 for complex
when using conventional porous RPLC resins mixtures studies [80]. Monolithic separations
proteins often elute from the RP columns over have also been applied in TD investigations on
several minutes. The poor peak capacity milk proteome [81] and the characterization of
originates from structural and proteoform-level 19S and 20S proteasome proteins [82]. In a sepa-
diversity, column precipitation, and poor diffu- rate approach, Roth et al. showed superficially
sion/mass transfer characteristics through porous porous resins, consisting of a solid core with
media. This has prompted the application of dif- <1 μm porous outer shell packed into capillary
ferent resin architectures or chemistries for columns, yield protein elution in <10 s, with 104
improved resolution in TD studies. Straight- quantitative dynamic range, and attomole detec-
forward adjustments include the use of small tion limits for standard proteins and lysates on
C(n)-alkyl chain (n ¼ 4 vs 18) resins which heart tissue or cell cultures [16]. Finally,
decrease the strength of protein and surface predictions of lower plate height minimum in
interactions. Larger pore size (1000 Å Van Deemter plots suggests that extension of
vs. 120–300 Å) has also been shown to improve RPLC to <2 μm particles will improve chro-
peak capacity for larger proteins, as well as, matographic resolution [83]. Wu et al. showed
increase of the organic mobile phase eluotropic that 0.5 μm diameter nonporous silica particles in
strength (e.g., isopropanol) improves solubility of capillary columns limit eddy diffusion and create
hydrophobic proteins [77]. Additionally, exten- a “slip flow” phenomenon along capillary walls
sion of conventional resins to ultra-high perfor- that enhances flow and decreased velocity
mance LC (UPLC) permits protein separations at distributions of analyte [84]. When applied in
high back pressure (400–1600 bar) [78]. For TD studies on Escherichia coli, a peak capacity
example, Ansong et al. used an UPLC system of 750 was observed for a 60 min gradient.
with 5 μm particle sizes packed into 80 cm long
columns and extended gradients (~4 h) to identify
563 small to mid-sized polypeptides (and 1665 8.3.2 Ion Exchange
proteoforms) from lysates of Salmonella
typhimurium [79]. Ion exchange chromatography (IEC), such as
Novel RP resin architectures have also been weak anion exchange (WAX), weak cation
pursued. For example, polystyrene- exchange (WCX), strong anion exchange
divinylbenzene copolymers (e.g. Agilent’s (SAX), strong cation exchange (SCX), and
PLRP-S media) provide good mechanical and immobilized metal-ion affinity chromatography
chemical stability under acidic/basic pH (IMAC), separates proteins based upon charge-
extremes and at elevated column temperatures charge interactions between a protein and a
charged resin [85]. A step-wise or linear gradient
182 S.M. Patrie

increase in counter ion concentration (supplied >200 H3.2 and 70 H4 species from 1 μg of
by salts or changes in pH) helps to elute proteins, material in 2 h. Similarly, Tian, et al. recently
often under non-denaturing conditions. Com- created an online multidimensional histone
bined with RPLC, both online and offline IEC fractionation system that automatically
support multidimensional TD experiments. For fractionated ~7.5 μg of major histone family
example, Shrama et al. used WAX members by RPLC prior to metal-free
prefractionation and online RPLC and detected WCX/HILIC-MS/MS analysis [92, 93]. They
715 intact proteins from a Shewanella oneidensis identified 105 H4, 110 H2B, 77 H2A, and
MR-1 cell lysate [86]. Roth et al. used 416 H3 proteoforms in a single run.
WAX/RPLC in a 2D workflow and analyzed
>600 proteoforms from human primary
leukocytes harvested from leukoreduction filters 8.3.4 Capillary Zone Electrophoresis &
[87]. Similarly, SAX as a first dimension separa- Isoelectric Focusing
tion technique has been applied for integrated TD
and BU studies on E. coli [82, 83]. These studies Solution-based electrophoretic approaches have
highlight the complementarity of the approaches, received considerable attention in TD studies
with small and larger proteins often over- owing to their high separation efficiencies.
represented for TD and BU, respectively Online approaches with capillary zone electro-
[88, 89]. phoresis (CZE) have been used to characterize
proteins from microorganisms, biofluids,
protein-ligand interactions, biopharmaceuticals
8.3.3 Hydrophilic Interactions and dietary proteins [94–96]. For example, Sun
et al. created an electrokinetically pumped
Hydrophilic interaction liquid chromatography “sheath-flow” ESI-CZE–MS interface, providing
(HILIC) separates polypeptides via a normal- or proof of concept for TD on standard proteins and
polar-stationary phase in the presence of a less Mycobaterium marinum secretome [97, 98]. Li
polar mobile phase [90]. In HILIC, the et al. created a similar apparatus as part of a
stationary-phase is typically primed with water multidimensional scheme that size sorted
to form a hydrophilic shell prior to addition of the proteins into discrete molecular mass windows
organic mobile phase. Separation is achieved by prior to CZE-ESI-MS/MS, identifying
partitioning the polypeptides between the hydro- 30 proteins from 30 to 80 kDa from Pseudomo-
philic and hydrophobic layers with gradient elu- nas aeruginosa [99]. Haselberg et al. created a
tion by increasing water concentration over time, “sheathless” CZE ESI interface for the character-
resulting in elution based on hydrophilicity. ization of 18 and 74 glycoproteoforms of recom-
When the stationary phase is supplied by IEC binant human interferon-β and human
columns, the added ionic interactions provide erythropoietin respectively [100]. Han
added selectivity. For example, HILIC on a et al. applied a similar approach as part of an
WCX column has found widespread use for the RPLC-CZE TD workflow characterizing ~300
sub-fractionation of histone proteoforms in TD proteoforms from 270 ng of protein from
and MD studies. Garcia et al. utilized Pyrococcus furiosus, as well as proteins in the
WCX-HILIC offline with subsequent RPLC and Dam1 complex from Saccharomyces
direct infusion by μESI to identify 150 and 42 dif- cerevisiae [101].
ferent proteoforms on histone H3.2 and H4, Chromatofocusing (CF) and isoelectric focus-
respectively [33, 91]. Young and coworkers ing (IEF) separate proteins based upon their iso-
extended WCX-HILIC to capillary-based electric point (pI ). CF exploits a pH gradient on
columns for MD applications [4]. They created IEC columns and has been applied in studies on
a “saltless” pH gradient for direct coupling to methanosarcinides [18], membrane proteins
the mass spectrometer, and characterized [102] and cancer cells [103]. Similarly, a variety
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 183

of IEF systems (e.g., Rotofor, Free Flow Electro- has largely prevented its use in TD. Alterna-
phoresis, IsoelectriQ, OFFGEL and Zoom IEF) tively, solution-based size exclusion chromatog-
have been applied for first-dimension fraction- raphy (SEC) and continuous flow gel elution
ation of intact proteins (0.05–5 mg) [104]. Each (CFGE) have been routinely exploited in TD. In
system has different routes for generation of the SEC, proteins migrate through a porous poly-
pH gradient (e.g. carrier ampholytes and/or meric column and are separated by their hydro-
immobilized pH gradient membranes). For dynamic volume with larger proteins eluting
example, Zhang et al. recently used off-gel IEF before smaller ones [109]. Examples of SEC in
prior to LC-MS to characterize hundreds of TD include the characterization of lumen
proteoforms in heart sarcomeres ranging from proteins from Arabidopsis thaliana [110], mem-
5 to 230 kDa in mass [105, 106]. Here, protein brane protein complexes in Synechocystis
separation occurs in solution via voltage-driven sp. PC6803 [111], degradation products of
migration through an immobilized pH gradient biopharmaceuticals [112], and structural studies
(IPG) (Fig. 8.5A). Jensen et al. created a totally on amyloid beta oligomers [113]. Chen and Ge
solution-based capillary IEF (cIEF) procedure recently reported that a novel ultrahigh pressure
for measurements of the E. coli and SEC approach utilizing MS compatible elution
D. radiodurans proteomes, characterizing up buffers permitted MS analysis of proteins from
˜ 1000 proteoforms from a total injection 6 to 669 kDa in <7 min [114].
of ˜ 300 ng [107]. CFGE separates proteins on a tubular poly-
acrylamide gel electrophoresis column with elut-
ing proteins fraction collected by increasing
protein mass over time. Meng et al. used a pre-
8.3.5 Size-Based Separations
parative CFGE apparatus with an acid labile sur-
factant (ALS) to fractionate low mass yeast
Size/mass separations are an attractive option to
proteome in ˜5 kDa windows prior to offline
overcome S/N bias associated with measuring
RPLC and μESI [115]. Use of ALS instead of
intact proteins over broad mass ranges
SDS allowed direct processing of PAGE
[76]. While 2D gel electrophoresis (2DGE) has
fractions without precipitation. Tran
unsurpassed peak capacity for proteins from 5 to
et al. further refined this approach creating a
250 kDa [108], poor duty cycle for gel elution

Fig. 8.5 (A) SDS/PAGE analysis of mouse heart Society. (B) SDS/PAGE analysis of yeast proteome
myofibrils fractionated by off-gel IEF with a 3–10 fractionated by GELFrEE mass separation. Adapted
immobilized pH gradient. Adapted with permission from with permission from Ref. [116]. Copyright 2009 by
Ref. [105]. Copyright 2013 by American Chemical American Chemical Society
184 S.M. Patrie

gel-eluted liquid fraction entrapment electropho- including N-terminal and internal acetylation,
resis (GELFrEE) [116] technique that exploits mono- and dimethylatation deamination,
smaller tube dimensions with short resolving deamidation, and phosphorylation. Similar
gels [116]. GELFrEE could reproducibly sepa- investigations have been performed on clinical
rate μg to milligram levels of material from 5 to biomarkers including transthyretin and hemoglo-
250 kDa with high recovery (Fig. 8.5B) [70, 77], bin variants [122–126], studies into the
and has been used in recent multidimensional TD deamidation kinetics in ribonuclease A [127]
investigations. For example, Kelleher and and beta-2-microglobulin [128], and hundreds
coworkers combined IEF, GELFrEE, and of nitration and oxidation events on calmodulin
LC-MS/MS into a 4D dimension TD workflow [129]. Additionally, Ge and coworkers have used
that provided a theoretical 4D peak capacity TD to monitor phosphorylation on diverse myo-
>100,000. The work represents the largest fibril proteins in the context of chronic heart
dynamic range for TD on mammalian cell lysate failure [130–132]. In infectious disease research,
reported to date, as well as, highlights that in TD Burnaevskiy et al. characterized a novel
proteomics investigations, the number of N-terminal demyristoylation and coincident
observed proteoforms will typically exceed the amidation event on ADP-ribosylation factor
number of identified proteins by ˜3 (ARF)1p and (ARF)2p by Shigella flexneri viru-
[32, 116]. Scaling of CFGE to microfluidic lence factor invasion plasmid antigen J
devises is also showing potential for ultrafast (IpaJ) [133].
size based separations. For example, Root TD is also being applied for the evaluation of
et al. fully resolved various standard proteins microheterogeneity associated with protein gly-
through a 75 μm ID, 25 cm long fused-silica cosylation [112, 134–140]. For example,
capillary coated with a poly-(N-hydroxyethyla- Bourgoin-Voillard et al. used CID, IRMPD,
crylamide) polymer in <3 min. [117] ETD and ECD to characterize fragmentation
dynamics of intact RNAse B and its bound
N-glycans [66]. In the characterization of two
8.4 Frontiers isoforms of prostate specific antigen the Associ-
ation of Bimolecular Resource Facilities (ABRF)
8.4.1 Comprehensive Proteoform determined that TD quantified glycoproteoforms
Studies with the same reliability as conventional peptide-
N-glycosidase F (PNGase F) release of
As highlighted above, the transition to multidi- N-glycans procedures [141, 142]. In another
mensional separations and online LC-MS/MS example, Zhang et al. used off-gel IEF to frac-
has dramatically increased proteome coverage tionate total cerebrospinal fluid (CSF) proteins
for complex mixtures and detailed proteoform prior to LC-MS. [105, 106] The approach per-
investigations on target proteins [32, 79]. In mitted the generation of virtual 2D gels (pI vs
targeted studies, the investigations into the his- mass) that resolved >200 di-n-glycosylated
tone family members (H1, H2a, H2b, H4 and H3) lipocalin-type prostaglandin d synthase
(see Sect. 8.4.3) exemplify the utility of glycoproteoforms directly from the
top-down for the examination of extreme com- CSF-proteome [106].
plexity associated with heterogeneous modifica-
tion states [4, 33, 34, 91, 118–120]. Extreme
proteoform diversification is not limited to epi- 8.4.2 Quantitation
genetic regulators. For example, Zhang
et al. used TD to characterize 12 PTMs at 11 dif- TD quantitation methods have largely mirrored
ferent sites on myelin basic protein (MBP), an those used in bottom-up [143]. Label-free
intrinsically disordered protein the myelin quantitation (LFQ) offer a distinct advantage
sheath [121]. They found diverse PTM classes, for comparison of clinical samples where
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 185

metabolic-labeling is not possible and chemical- a novel neutron-encoded mass signature strategy
labeling costs may be too restrictive. For exam- which labeled proteins in yeast cultures with
ple, Gucinski and Boyne examined the impact of either 13C615N2-lysine or 2H8-lysine [26]. The
multiple charge states, isotopologues, and resolv- work highlighted a mass difference of 0.036 Da
ing power on linearity of TD LFQ on protein between the isotopologues, that was not distin-
therapeutics, highlighting that absolute quantita- guished during low resolving power scans, but
tion by standard curves, as well as relative quan- could be discriminated upon acquiring a high-
titation between analyte and internal standards, is resolution scan, permitted quantitation based
readily possible [144]. For complex mixtures, upon isotopologue ratios. The strategy has poten-
Julka et al. combined ultraviolet detection with tial for TD metabolic-labeling studies because it
2D SAX LC-MS to quantify proteins spiked into permits multiplexing without increasing spectral
E. coli, demonstrating good linearity (R2 congestion [26].
> 0.99) over a ten-fold concentration range
[145]. In biomarker discovery, Mazur
et al. applied an automated differential MS 8.4.3 Biologics and Biosimilars
(dMS) infrastructure to examine changes
between patient expression in human high- Protein-drugs are often modified by PTMs (such
density lipoprotein (HDL) [139, 146]. More as N-glycosylation) and careful attention to the
recently, Ntai extended LFQ to multidimensional location, composition, or structure of these
TD workflows that sought to characterize prote- modifications is key to ensuring both biological
ome and proteoform-level effects of deleting his- efficacy and toxicity of biologics and biosimilars
tone deacetylase (rpd3) in budding yeast. [147] [156]. TD has been increasingly exploited to
Numerous other investigations have also meet these regulatory challenges. For example,
evaluated the reliability of TD for proteoform Boyne and coworkers applied TD to evaluate
ratio determination [33, 87, 91, 118, 119, 148], PTM-profiles on FDA-approved and unapproved
often finding that the ratios of isobaric filgrastim therapeutics [156, 157]. Additionally,
proteoforms, for example resulting from PTM they defined sequence variations for different
positional isomers, must be determined from forms of herring protamine sulfate and low abun-
fragment ion ratios [118, 148, 149]. dant impurities in the complex drug product.
Metabolic labeling by SILAC (Stable Isotope Similar efforts have extended TD to antibody
Labeling by Amino Acid in Cell culture) has based drugs with investigators creating MS and
found use for proteins isolated from cell cultures MS/MS protocols for intact monoclonal IgGs
[86, 150, 151]. For example, Veenstra (˜150 kDa) or on the two light chains (Lc,
et al. introduced isotopically labeled leucine ˜25 kDa each) and two heavy chains (Hc,
into proteins in E. coli characterizing expression ˜50 kDa each) independently. For example,
changes via CIEF and FT-ICR [152]. Parks Mazur et al. used SEC with CID and ETD to
et al. used WAX and RPLC separations with characterize IgG impurities (e.g., proteolytic
FT-ICR to characterize14N/15N metabolic labels breakdown products) [158]. Studies on intact
on histidine, leucine, and tryptophan to deter- antibodies with Orbitrap and FT-ICR show that
mine expression ratios on 231 metabolically ETD and ECD can provide ˜33 % sequence cov-
labeled protein pairs in S. cerevisiae [153]. Simi- erage [159, 160]. Zhang et al. performed ISD on
lar work by Pesavento et al. monitored changes an intact antibody followed by CID of ISD
to histone H4K20 methylation during the HeLa fragments to improve sequence coverage to
cell cycle progression [154]. Collier et al. also 46 % and 27 % for Lc and Hc, respectively
applied metabolic labeling for quantitative com- [161]. LC–MS/MS have also been performed
parison of hundreds of proteins from Aspergillus on individual Lc and Hc after offline disulfide
flavus and human embryonic stem cells bond reduction [162–164]. In a another novel
[150, 155]. More recently, Rhoads et al. created approach, Nicolardi et al. extended reduction
186 S.M. Patrie

steps to online LCMS, using an inline electro- cyanobacteria [174]. In another example,
chemical reduction cell to systematically release Catherman et al. enriched mitochondrial proteins
interchain disulfide bonds and disassemble Lc from NCI H-1299 cells and used multidimen-
and Hc from the full IgG1 mAb [165]. sional separations to characterize over 300
IMPs with up to 12 transmembrane helixes
[175]. Of note, CID has been shown to preferen-
8.4.4 Membrane Proteins tially fragment in transmembrane domains
[176]. Additionally, it complements ETD or
Integral membrane proteins (IMPs) play key ECD (with or without vibrational activation by
roles in transmembrane signaling and are thought IRMPD [177]) for high amino acid resolution of
to constitute approximately a third of the prote- PTMs over variegated inter-, trans-, or intra-
ome [102, 111]. IMPs have challenged both TD cellular domains (e.g., proteolytic processing,
and BU workflows because of their amphipathic phosphorylation, disulfide bonds, cysteinylation,
characteristics (harboring polar soluble domains heme-modification, pyroglutamate, acetylation,
and hydrophobic transmembrane domains) and amidation, formulation, N6-retinylidene, etc.)
heterogeneous PTMs [166]. Methods for solubi- [166, 174–176].
lization of IMPs often include SDS or Triton
X-100 [167, 168]. To overcome the workflow
mismatch, precipitation by chloroform/metha- 8.4.5 Hydrogen/Deuterium Exchange
nol/water or extraction with acetone are com-
monly used [169]. Similarly, other agents (e.g., Characterization of a protein’s structure is impor-
urea, sodium deoxycholate, or acid-labile tant for understanding its function. As protein in
surfactants) also enhance solubility of IMPs and solution unfold and refold, hydrogen bonds break
are compatible with SEC and RPLC workflows and reform. The dynamics of these structural
[170]. Other investigations have also shown IMP changes can be monitored though hydrogen/deu-
solubility is enhanced through >80 % formic terium exchange (HDX) at solvent-exposed
acid or organic solvents with strong eluotropic amides [178]. The conservation of HDX infor-
strength (e.g., isopropanol) [167–169, 171]. For mation in intact proteins via TD analysis offers
example, Doucette et al. recently examined the distinct benefits over peptide-based assays where
effectiveness of acetone and chloroform/metha- 10–50 % deuterium back-exchange often results
nol/water precipitation for SDS removal and during subsequent proteolytic processing. Key to
found that both provided >90 % recovery when the utility of TD in HDX is that ECD leads to
resolubilization of the precipitated proteins was low H/D “scrambling” along the amino acid
performed in cold (21  C) 80 % formic backbone during gas-phase dissociation. Low
acid [172]. scrambling preserves structural information
Recent investigations have demonstrated the associated with H/D positions [179, 180]. Wang
scale at which polytopic IMPs can be et al. recently used HDX with ECD to gain
interrogated by TD. For example, Carroll conformer-specific information on non-native
et al. used Q-TOF with CID to characterize sev- protein states of ubiquitin [181]. Similarly, Pan
eral small proteolipids (1–4 transmembrane heli- et al. performed HDX with short HPLC gradients
ces) and larger proteins (1–18 transmembrane to characterize the therapeutic protein interferon
helices) from bovine mitochondrial preparations α2a, a cancer drug, and several variants
[171]. LC-MS/MS with CAD or ETD was used [182]. Their methods provided insight into the
to characterize eleven integral and five peripheral protein’s primary structure including identifica-
subunits of the 750 kDa Photosystem II (PSII) tion of preferential oxidation on methionine
complex from the eukaryotic red alga, Galdieria residues that led to distinct PTM-induced
sulphuraria [173], and numerous subunits of the structural changes. To minimize back-exchange
cytochrome b6f complex from chloroplasts and over extended periods, Amon et al. recently
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 187

devised a sub-zero cooled microchip for nESI A notable distinction between native MS
showing <5 % back-exchange over 10 min and conventional protein MS is the relatively
[183]. Pan et al. also extended subzero temperature high m/z values observed for protein complexes
screens to HPLC runs showing ˜2 % back- as compared to individual denatured proteins. This
exchange over 10 min elution gradients when is because charge incorporation during ESI on
carrying out structural comparability tests on intact folded structures is largely restricted to surface
antibodies [184]. sites. The need for high m/z sampling has made
ESI-Q-TOF the mainstay for native MS because
TOF permits sensitive analysis largely indepen-
8.4.6 Native MS and Protein Complex dent of m/z. For example, Zhou et al. performed
Studies native ESI on the 801 kDa chaperonin complex
GroEL and used CID and surface-induced dissoci-
Studies on protein macromolecular assemblies in ation (SID) to show that ejection of highly charged
their native state are key to understanding protein monomers is the predominant dissociation path-
function. Numerous TD protocols for “native” way in CID while SID results in extensive dissoci-
MS have emerged to complement conventional ation into a wide variety of products (Fig. 8.6)
structural biology techniques (e.g., NMR and [187]. Similarly, Snijder et al. recently used
X-ray crystallography), providing information ESI-Q-TOF to study the 18 MDa capsid of
on the complex’s protein composition, stoichi- bacteriophage HK97 [188], and Blackwell
ometry, spatial arrangements of subunits, assem- et al. used Q-TOF with CID and SID to dissociate
bly dynamics, or interactions with other ligands the heterohexamer toyocamycin nitrile hydratase
or metal ions [185]. To support these variegated into monomers and subcomplexes [189]. These
efforts, researchers have developed various MS efforts highlight how the choice of dissociation
compatible buffers and additives that can pre- mode and the extent of energy deposition dictates
serve native assemblies during ESI and support the degree of disassembly [178, 180, 187, 189].
both controlled disassembly of the complexes The improved mass range associated with
and denaturation of the proteins [185, 186]. FTMS technologies (e.g., high field Orbitrap,

Fig. 8.6 Native MS. (A) Collision induced dissociation discussed are selectively labeled with the corresponding
of the +71 GroEL tetradecamer. (B) Surface-induced colors of the dots in the legend. Adapted with permission
dissociation of GroEL tetradecamer +71. The inset spec- from Ref. [187]. Copyright 2013 by American Chemical
trum is a zoom-in view of the region shaded in the middle Society
of the full SID spectrum. Charge states of several peaks
188 S.M. Patrie

superconducting magnets in excess of 15 T for coverage [203]. IM has also been integrated post
FT-ICR, and expanded use of absorption-mode dissociation to reduce spectral congestion by
FT data-processing modes) [30, 190–197] has resolving overlapping product ion series prior to
significant promise for native MS. For example, MS analysis [204]. For example, Zinnel
FT-ICR with CAD on complexes has been used et al. created a hybrid CID-IM-MS strategy
to probe nucleotides and metal ligand-binding showing a 2–10 increased sequence coverage
sites [198] while ETD and ECD were used to for various peptides or proteins [205].
localize the position of bound ligands and the In studies that probe gas-phase conformations
topology of protein complexes [199]. Similarly of intact proteins, results on ubiquitin show that
Li et al. used a 15 T FT-ICR-MS with an infinity solution structure is consistent with ions pro-
trap for studies on yeast and horse liver alcohol duced by ESI as long as experimental conditions
dehydrogenase (147 kDa) [200]. Their approach avoid thermal unfolding or Coulomb repulsion-
permitted the isotopic resolution of a yeast ADH induced unfolding at higher charge
(yADH) tetramer (147 kDa) and showed ˜40 % [206, 207]. Continued improvement in IM
sequence coverage could be obtained directly resolving power is also providing insight into
from the native yADH tetramer complex when conformational flexibility of proteins. For exam-
using a combination of ECD, ISD, CAD, and ple, Clemmer and coworkers applied a
IRMPD methods. Also, Skinner et al. recently frequency-based linear DT (overtone mobility
adapted GELFrEE for native state size spectrometry) to sample continuous ion sources
separations followed by MS/MS on an Orbitrap at resolving powers >100 [208]. Additionally,
to characterize protein complexes from mouse Shvartsburg has improved FAIMS resolving
heart tissue and fundal secretome of power to ˜400 through the application of elevated
Trichoderma reesei [201]. electric fields and hydrogen-rich collision gases.
This approach separated charge states of standard
proteins (up to 30 kDa) into over 20 gas-phase
8.4.7 Ion Mobility conformers per charge state [209]. Continued
application of high resolving power IM is
Ion mobility (IM) separates ions based on their expected to facilitate the interrogation of isobaric
collision cross section (ratio of size-to-charge) proteoforms through differences in gas-phase
prior to MS analysis [202]. Structure-based structure [210], as well as complement native
separations derive from low-energy collisions MS investigations by monitoring disassembly
with a neutral drift gas region in the presence of heterogeneous complexes [211].
of electrostatic or electrodynamic fields. IM is also being applied in mainstream bio-
Separations can occur either through space or pharmaceutical and biomarker research.
temporal dispersion. Spatial dispersion is Escribano et al. used IM and TDMS to analyze
accomplished via differential mobility and the specificity and structural behavior of several
field asymmetric ion mobility spectrometry protein-platinum (Pt) metallodrug adducts and to
(FAIMS), while time separations are accom- determine the primary binding site(s) in
plished with drift tube (DT) or traveling-wave dicarboxylate Pt compounds [212]. Bowers and
(TW) formats. coworkers applied IM to characterize factors
Since IM occurs on the millisecond leading to aberrant aggregation in
timeframe, it may be applied as an orthogonal neurodegeneration and neuroinflammation, such
post-LC separation strategy to improve dynamic as amyloid beta and tau [213]. Young
range in proteomics. For example, IM has been et al. expanded upon these efforts by establishing
combined with nano-LC-IM-MS/MS to evaluate an high-throughput small molecule screen to
charge-state specific fragmentation tendencies, identify inhibitors of amyloid aggregation
with results generally showing MS/MS on high [214]. In related work, Beveridge
charge unfolded ions leads to improved sequence et al. proposed a generalized IM-MS framework
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 189

that permits more accurate prediction of the enhanced fragmentation efficiency due to a
extent of disorder in protein when compared to more unfolded gas-phase structure [226]. There-
conventional charge hydropathy plots [215]. fore investigations have sought to identify agents
that enhance protonation beyond that typically
obtained by ESI. Early observations suggest
8.4.8 Charge Manipulation that denaturing conditions in an ESI solution
containing surfactants at low concentrations
enhances charging (e.g., cationic, cetyl trimethy-
Charge Reduction MS and MS/MS spectra are lammonium bromide (CTAB) and zwitterionic,
often populated by overlapping charge state 3-(3-cholamidopropyl) dimethylammonio-1-
distributions associated with simultaneous detec- propanesulfonate (CHAPS), and many nonionic
tion of proteins or fragment ions. The spectral saccharide-based detergents) [227]. More
congestion may lead to missed assignment of an recently, Iavarone et al. showed that addition of
ion’s charge, particularly under conditions with glycerol and m-nitrobenzyl alcohol (m-NBA)
inadequate resolving power [140]. Gas-phase can effectively “supercharge” proteins
“proton-transfer” (PT) reaction methods seek to [228]. Other diverse organic compounds or com-
reduce spectral congestion by converting mon organic solvents (e.g., DMSO) also enhance
multiply-charged ions into readily interpreted charging in either direct infusion or LC-MS
mono- and di-protonated species [216–219]. For applications [186, 229, 230]. For example, Teo
example, McLuckey and coworkers have charge- and Donald recently compared m-NBA,
reduced intact proteins with anion-based sulfolane, 2-nitroanisole, ethylene carbonate
perfluorocarbons [193]. Their approach applies (EC) and propylene carbonate (PC) showing
an electrodynamic QIT which enables that the latter two promote charge states higher
frequency-based accumulation of specific charge than the theoretical maximum predicted by
reduced states [220]. In similar efforts, they have proton-transfer reactivity (Fig. 8.7B)
shown charge reduction on CID fragments [226]. Williams and coworkers have recently
provides an order of magnitude improvement in introduced an electrothermal supercharging
informatics scores [221, 222] (Fig. 8.7A). Efforts method that rapidly switches between native
by Smith and coworkers have shown that novel and denaturing conditions via changing ESI
[204] Po α particle and coronal discharge sources potentials, showing that inclusion of sulphate
also permit controlled proton-reduction on and phosphate anions increases protein charging
mixtures of electrosprayed proteins from aqueous ammonium and sodium buffers
[223, 224]. For omics investigations, Chi [231, 232].
et al. have demonstrated that benzoate anions
introduced post ETD enabled the identification
of E. coli 70S ribosomal proteins during online
LC-MS/MS on an LTQ [225]. Similar work by 8.5 Concluding Remarks
Huang et al. combined CID and ETD with
ion/ion proton transfer reactions in a Q-TOF to Advancements in chromatography, MS, and
characterize unknown proteins with novel PTMs informatics have made “birds-eye-view”
from E. coli [36]. (top-down) proteomics increasingly available to
the masses. Key to the field’s expansion is the
Super-Charging Enhancement of gas-phase high resolving power afforded by modern mass
charge also has benefits in TD, including spectrometers which provides unparalleled clar-
improved resolving power at lower m/z for ity on the microheterogeneity that exists in a
most mass analyzers, improved sensitivity of proteome at the proteoform-level. TD has proven
charge-sensitive detection (e.g., FTMS), and to be an exceptionally powerful resource for
190 S.M. Patrie

Fig. 8.7 Charge reduction and supercharging: (A) proton charge reduction reaction (middle) with the
Collision activated dissociation of carbonic anhydrase corresponding deconvoluted spectrum (bottom). Adapted
[M + 32H]32+ (top). Simultaneous CAD and ion/ion with permission from Ref. [221]. Copyright 2009 by
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 191

hypothesis-driven research on defined protein 6. Aebersold R, Mann M (2003) Mass spectrometry-


targets. With technology advances showing based proteomics. Nature 422:198–207
7. Meyer B, Papasotiriou DG, Karas M (2011) 100 %
duty-cycles and sensitivities that surpasses protein sequence coverage: a modern form of surre-
many conventional bioassays (gel- alism in proteomics. Amino Acids 41:291–310
electrophoresis or western blots), it is easy to 8. Cannon J, Lohnes K, Wynne C, Wang Y,
envision molecular biologists applying TD in Edwards N, Fenselau C (2010) High-throughput
middle-down analysis using an orbitrap. J Proteome
their daily screens. While omic-level screens Res 9:3886–3890, PMC2917504
are still largely done in dedicated research labs, 9. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse
more widespread implementation is expected CM (1989) Electrospray ionization for mass-
with the continued refinement of procedures, spectrometry of large biomolecules. Science
246:64–71
particularly informatics for comprehensive 10. Karas M, Hillenkamp F (1988) Laser desorption
proteoform studies as well as the training of a ionization of proteins with molecular masses exceed-
new generation of researchers on TD protocols. ing 10,000 daltons. Anal Chem 60:2299–2301
11. Tanaka K, Waki H, Yutaka I, Akita S, Yoshida Y,
Yoshida T (1988) Protein and polymer analyses up
Acknowledgement This work was supported by the to m/z 100 000 by laser ionization time-of-flight
National Institute of General Medical Sciences of the mass spectrometry. Rapid Commun Mass Spectrom
National Institutes of Health under award number 2:151–153
1R01GM115739-01A1. Any opinions, findings, conclusions 12. Emmett MR, Caprioli RM (1994) Micro-
or recommendations expressed in this material are those of electrospray mass spectrometry: ultra-high-sensitiv-
the authors and do not necessarily reflect the views of the ity analysis of peptides and proteins. J Am Soc Mass
National Institutes of Health. This work was also supported Spectrom 5:605–613
by the Multiple Sclerosis Society [PP-1503-04034], The 13. Valaskovic GA, Kelleher NL, Little DP, Aaserud DJ,
Darrel K. Royal Research Fund for Alzheimer’s Disease McLafferty FW (1995) Attomole-sensitivity
[48680-DKR], The Texas Alzheimer’s Research and Care electrospray source for large-molecule mass spec-
Consortium Investigator Grant Program [354091], and the trometry. Anal Chem 67:3802–3805
UT System Neuroscience and Neurotechnology Research 14. Marginean I, Kelly RT, Prior DC, LaMarche BL,
Institute [363027]. Tang K, Smith RD (2008) Analytical characteriza-
tion of the electrospray ion source in the nanoflow
regime. Anal Chem 80:6573–6579, PMC2692497
15. Cech NB, Enke CG (2001) Practical implications of
References some recent studies in electrospray ionization
fundamentals. Mass Spectrom Rev 20:362–387
1. Lodish H, Berk A, Zipursky SL, and al., e (2000) 16. Roth MJ, Plymire DA, Chang AN, Kim J, Maresh
Molecular cell biology, 4th edn. W. H. Freeman, EM, Larson SE, Patrie SM (2011) Sensitive and
New York reproducible intact mass analysis of complex protein
2. Smith LM, Kelleher NL, Proteomics CTD (2013) mixtures with superficially porous capillary
Proteoform: a single term describing protein com- reversed-phase liquid chromatography mass spec-
plexity. Nat Methods 10:186–187 trometry. Anal Chem 83:9586–9592
3. Janke C, Bulinski JC (2011) Post-translational regu- 17. Needham SR, Valaskovic GA (2015) Microspray
lation of the microtubule cytoskeleton: mechanisms and microflow LC-MS/MS: the perfect fit for
and functions. Nat Rev Mol Cell Biol 12:773–786 bioanalysis. Bioanalysis 7:1061–1064
4. Young NL, DiMaggio PA, Plazas-Mayorca MD, 18. Patrie SM, Ferguson JT, Robinson DE, Whipple D,
Baliban RC, Floudas CA, Garcia BA (2009) High Rother M, Metcalf WW, Kelleher NL (2006) Top
throughput characterization of combinatorial histone down mass spectrometry of <60-kDa proteins from
codes. Mol Cell Proteomics 8:2266–2284 Methanosarcina acetivorans using quadrupole
5. Jahn O, Tenzer S, Werner HB (2009) Myelin prote- FRMS with automated octopole collisionally
omics: molecular anatomy of an insulating sheath. activated dissociation. Mol Cell Proteomics MCP
Mol Neurobiol 40:55–72, 2758371 5:14–25

ä
Fig. 8.7 (continued) American Chemical Society. (B) 1 % sulfolane, 3 % o-nitroanisole, 10 % ethylene carbon-
ESI mass spectra of 45/54/1 methanol/water/acetic acid ate, or 15 % propylene carbonate. Adapted with permis-
solutions containing 10 μM Cytochrome C and no sion from Ref. [225]. Copyright 2014 by American
supercharging additive, 0.5 % m-nitrobenzyl alcohol, Chemical Society
192 S.M. Patrie

19. Bielow C, Aiche S, Andreotti S, Reinert K (2011) resolution linear Ion trap orbitrap mass spectrometer
MSSimulator: simulation of mass spectrometry data. (orbitrap elite) facilitates top down LC MS/MS and
J Proteome Res 10:2922–2929 versatile peptide fragmentation modes. Mol Cell
20. Marshall AG, Hendrickson CL, Jackson GS (1998) Proteomics MCP 11, O111.013698
Fourier transform ion cyclotron resonance mass 32. Tran JC, Zamdborg L, Ahlf DR, Lee JE, Catherman
spectrometry: a primer. Mass Spectrom Rev 17:1–35 AD, Durbin KR, Tipton JD, Vellaichamy A, Kellie
21. Hu Q, Noll RJ, Li H, Makarov A, Hardman M, JF, Li MX, Wu C, Sweet SMM, Early BP, Siuti N,
Graham Cooks R (2005) The orbitrap: a new mass LeDuc RD, Compton PD, Thomas PM, Kelleher NL
spectrometer. J Mass Spectrom 40:430–443 (2011) Mapping intact protein isoforms in discovery
22. Valeja SG, Kaiser NK, Xian F, Hendrickson CL, mode using top-down proteomics. Nature 480:254,
Rouse JC, Marshall AG (2011) Unit mass baseline U141
resolution for an intact 148 kDa therapeutic mono- 33. Garcia BA, Pesavento JJ, Mizzen CA, Kelleher NL
clonal antibody by Fourier transform Ion cyclotron (2007) Pervasive combinatorial modification of his-
resonance mass spectrometry. Anal Chem tone H3 in human cells. Nat Methods 4:487–489
83:8391–8395 34. Pesavento JJ, Kim YB, Taylor GK, Kelleher NL
23. Williams DK, Muddiman DC (2007) Parts-Per-bil- (2004) Shotgun annotation of histone modifications:
lion mass measurement accuracy achieved through a new approach for streamlined characterization of
the combination of multiple linear regression and proteins by top down mass spectrometry. J Am
automatic gain control in a Fourier transform Ion Chem Soc 126:3386–3387
cyclotron resonance mass spectrometer. Anal Chem 35. Ginter JM, Zhou F, Johnston MV (2004) Generating
79:5058–5063 protein sequence tags by combining cone and con-
24. Smith RD, Anderson GA, Lipton MS, Pasa-Tolic L, ventional collision induced dissociation in a quadru-
Shen Y, Conrads TP, Veenstra TD, Udseth HR pole time-of-flight mass spectrometer. J Am Soc
(2002) An accurate mass tag strategy for quantitative Mass Spectrom 15:1478–1486
and high-throughput proteome measurements. Prote- 36. Huang TY, McLuckey SA (2010) Top-down protein
omics 2:513–523 characterization facilitated by ion/ion reactions on a
25. Conrads TP, Anderson GA, Veenstra TD, Pasa- quadrupole/time of flight platform. Proteomics
Tolic L, Smith RD (2000) Utility of accurate mass 10:3577–3588
tags for proteome-wide protein identification. Anal 37. Madsen JA, Gardner MW, Smith SI, Ledvina AR,
Chem 72:3349–3354 Coon JJ, Schwartz JC, Stafford GC, Brodbelt JS
26. Rhoads TW, Rose CM, Bailey DJ, Riley NM, (2009) Top-down protein fragmentation by infrared
Molden RC, Nestler AJ, Merrill AE, Smith LM, multiphoton dissociation in a dual pressure linear Ion
Hebert AS, Westphall MS, Pagliarini DJ, Garcia trap. Anal Chem 81:8677–8686
BA, Coon JJ (2014) Neutron-encoded mass 38. Sleno L, Volmer DA (2004) Ion activation methods
signatures for quantitative top-down proteomics. for tandem mass spectrometry. J Mass Spectrom
Anal Chem 86:2314–2319 39:1091–1112
27. Mao Y, Zamdborg L, Kelleher NL, Hendrickson CL, 39. Wells JM, McLuckey SA (2005) Collision-induced
Marshall AG (2011) Identification of phosphorylated dissociation (CID) of peptides and proteins. Methods
human peptides by accurate mass measurement Enzymol 402:148–185
alone. Int J Mass Spectrom 308:357–361 40. Roepstorff P, Fohlman J (1984) Proposal for a com-
28. Senko MW, Hendrickson CL, Pasa-Tolic L, Marto mon nomenclature for sequence ions in mass spectra
JA, White FM, Guan S, Marshall AG (1996) of peptides. Biomed Mass Spectrom 11:601
Electrospray ionization Fourier transform ion cyclo- 41. Biemann K (1988) Contributions of mass spectrom-
tron resonance at 9.4 T. Rapid Commun Mass etry to peptide and protein structure. Biomed Envi-
Spectrom 10:1824–1828 ron Mass Spectrom 16:99–111
29. Glish GL, Burinsky DJ (2008) Hybrid mass 42. Bean MF, Carr SA, Thorne GC, Reilly MH, Gaskell
spectrometers for tandem mass spectrometry. J Am SJ (1991) Tandem mass spectrometry of peptides
Soc Mass Spectrom 19:161–172 using hybrid and four-sector instruments: a compar-
30. Patrie SM, Charlebois JP, Whipple D, Kelleher NL, ative study. Anal Chem 63:1473–1481
Hendrickson CL, Quinn JP, Marshall AG, 43. Olsen JV, Macek B, Lange O, Makarov A,
Mukhopadhyay B (2004) Construction of a hybrid Horning S, Mann M (2007) Higher-energy C-trap
quadrupole/Fourier transform ion cyclotron reso- dissociation for peptide modification analysis. Nat
nance mass spectrometer for versatile MS/MS Methods 4:709–712
above 10 kDa. J Am Soc Mass Spectrom 44. Little DP, Speir JP, Senko MW, O’Connor PB,
15:1099–1108 McLafferty FW (1994) Infrared multiphoton disso-
31. Michalski A, Damoc E, Lange O, Denisov E, ciation of large multiply charged ions for biomole-
Nolting D, Müller M, Viner R, Schwartz J, cule sequencing. Anal Chem 66:2809–2815
Remes P, Belford M, Dunyach J-J, Cox J, 45. Shaw JB, Li WZ, Holden DD, Zhang Y, Griep-
Horning S, Mann M, Makarov A (2012) Ultra high Raming J, Fellers RT, Early BP, Thomas PM,
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 193

Kelleher NL, Brodbelt JS (2013) Complete protein transfer dissociation. Rapid Commun Mass
characterization using top-down mass spectrometry Spectrom 26:282–286
and ultraviolet photodissociation. J Am Chem Soc 58. Tipton JD, Tran JC, Catherman AD, Ahlf DR,
135:12646–12651 Durbin KR, Lee JE, Kellie JF, Kelleher NL,
46. Madsen JA, Boutz DR, Brodbelt JS (2010) Ultrafast Hendrickson CL, Marshall AG (2012) Nano-LC
ultraviolet photodissociation at 193 nm and its appli- FTICR tandem mass spectrometry for top-down pro-
cability to proteomic workflows. J Proteome Res teomics: routine baseline unit mass resolution of
9:4205–4214 whole cell lysate proteins up to 72 kDa. Anal Chem
47. McLafferty FW, Horn DM, Breuker K, Ge Y, Lewis 84:2111–2117
MA, Cerda B, Zubarev RA, Carpenter BK (2001) 59. Zhang ZQ, Marshall AG (1998) A universal algo-
Electron capture dissociation of gaseous multiply rithm for fast and automated charge state
charged ions by Fourier-transform ion cyclotron res- deconvolution of electrospray mass-to-charge ratio
onance. J Am Soc Mass Spectrom 12:245–249 spectra. J Am Soc Mass Spectrom 9:225–233
48. Mikesh LM, Ueberheide B, Chi A, Coon JJ, Syka JE, 60. Horn DM, Zubarev RA, McLafferty FW (2000)
Shabanowitz J, Hunt DF (2006) The utility of ETD Automated de novo sequencing of proteins by tan-
mass spectrometry in proteomic analysis. Biochim dem high-resolution mass spectrometry. Proc Natl
Biophys Acta 1764:1811–1822 Acad Sci U S A 97:10313–10317
49. Hakansson K, Chalmers MJ, Quinn JP, McFarland 61. Liu X, Inbar Y, Dorrestein PC, Wynne C,
MA, Hendrickson CL, Marshall AG (2003) Com- Edwards N, Souda P, Whitelegge JP, Bafna V,
bined electron capture and infrared multiphoton dis- Pevzner PA (2010) Deconvolution and database
sociation for multistage MS/MS in a Fourier search of complex tandem mass spectra of intact
transform ion cyclotron resonance mass spectrome- proteins: a combinatorial approach. Mol Cell Prote-
ter. Anal Chem 75:3256–3262 omics MCP 9:2772–2782, 3101958
50. Wuhrer M, Catalina MI, Deelder AM, Hokke CH 62. Frank A, Tanner S, Bafna V, Pevzner P (2005)
(2007) Glycoproteomics based on tandem mass Peptide sequence tags for fast database search in
spectrometry of glycopeptides. J Chromatogr B mass-spectrometry. J Proteome Res 4:1287–1295
Anal Technol Biomed Life Sci 849:115–128 63. Sheng QH, Xie T, Ding DF (2000) De novo inter-
51. Boersema PJ, Mohammed S, Heck AJ (2009) pretation of MS/MS spectra and protein identifica-
Phosphopeptide fragmentation and analysis by tion via database searching. Sheng wu hua xue yu
mass spectrometry. J Mass Spectrom 44:861–878 sheng wu wu li xue bao Acta biochimica et
52. Patrie SM, Robinson DE, Meng F, Du Y, Kelleher biophysica Sinica 32:595–600
NL (2004) Strategies for automating top-down pro- 64. Liu X, Dekker LJ, Wu S, Vanduijn MM, Luider TM,
tein analysis with Q-FTICR MS. Int J Mass Tolic N, Kou Q, Dvorkin M, Alexandrova S,
Spectrom 234:175–184 Vyatkina K, Pasa-Tolic L, Pevzner PA (2014) De
53. Hakansson K, Cooper HJ, Emmett MR, Costello CE, novo protein sequencing by combining top-down
Marshall AG, Nilsson CL (2001) Electron capture and bottom-up tandem mass spectra. J Proteome
dissociation and infrared multiphoton dissociation Res 13:3241–3248
MS/MS of an N-glycosylated tryptic peptic to yield 65. Beavis R, Feny€o D (2004) Finding protein sequences
complementary sequence information. Anal Chem using PROWL. In: Current protocols in bioinformat-
73:4530–4536 ics. Wiley, New York
54. Catalina MI, Koeleman CA, Deelder AM, Wuhrer M 66. Bourgoin-Voillard S, Leymarie N, Costello CE
(2007) Electron transfer dissociation of (2014) Top-down tandem mass spectrometry on
N-glycopeptides: loss of the entire N-glycosylated RNase A and B using a Qh/FT-ICR hybrid mass
asparagine side chain. Rapid Commun Mass spectrometer. Proteomics 14:1174–1184, 4095805
Spectrom 21:1053–1061 67. Guner H, Close PL, Cai WX, Zhang H, Peng Y,
55. Wu SL, Huhmer AF, Hao Z, Karger BL (2007) Gregorich ZR, Ge Y (2014) MASH suite: a user-
On-line LC-MS approach combining collision- friendly and versatile software interface for high-
induced dissociation (CID), electron-transfer disso- resolution mass spectrometry data interpretation
ciation (ETD), and CID of an isolated charge- and visualization. J Am Soc Mass Spectrom
reduced species for the trace-level characterization 25:464–470
of proteins with post-translational modifications. J 68. Fellers RT, Greer JB, Early BP, Yu X, LeDuc RD,
Proteome Res 6:4230–4244, 2557440 Kelleher NL, Thomas PM (2014) ProSight Lite:
56. Wenger CD, Boyne MT, Ferguson JT, Robinson DE, Graphical software to analyze top-down mass spec-
Kelleher NL (2008) Versatile online-offline engine trometry data. Proteomics
for automated acquisition of high-resolution tandem 69. LeDuc RD, Taylor GK, Kim YB, Januszyk TE,
mass spectra. Anal Chem 80:8055–8063 Bynum LH, Sola JV, Garavelli JS, Kelleher NL
57. Rozman M, Gaskell SJ (2012) Charge state depen- (2004) ProSight PTM: an integrated environment
dent top-down characterisation using electron for protein identification and characterization by
194 S.M. Patrie

top-down mass spectrometry. Nucleic Acids Res 32: using polymer monolithic capillary columns. J
W340–W345 Chromatogr A 1218:5504–5511
70. Meng F, Cargile BJ, Miller LM, Forbes AJ, Johnson 81. Pierri G, Kotoni D, Simone P, Villani C, Pepe G,
JR, Kelleher NL (2001) Informatics and Campiglia P, Dugo P, Gasparrini F (2013) Analysis
multiplexing of intact protein identification in bacte- of bovine milk caseins on organic monolithic
ria and the archaea. Nat Biotechnol 19:952–957 columns: An integrated capillary liquid
71. LeDuc RD, Fellers RT, Early BP, Greer JB, Thomas chromatography-high resolution mass spectrometry
PM, Kelleher NL (2014) The C-score: a Bayesian approach for the study of time-dependent casein
framework to sharply improve proteoform scoring in degradation. J Chromatogr A 1313:259–269
high-throughput top down proteomics. J Proteome 82. Lakshmanan R, Wolff JJ, Alvarado R, Loo JA
Res 13:3231–3240, 4084843 (2014) Top-down protein identification of
72. Frank AM, Pesavento JJ, Mizzen CA, Kelleher NL, proteasome proteins with nanoLC-FT-ICR-MS
Pevzner PA (2008) Interpreting top-down mass spec- employing data-independent fragmentation
tra using spectral alignment. Anal Chem methods. Proteomics 14:1271–1282
80:2499–2505 83. Everley RA, Croley TR (2008) Ultra-performance
73. Liu XW, Hengel S, Wu S, Tolic N, Pasa-Tolic L, liquid chromatography/mass spectrometry of intact
Pevzner PA (2013) Identification of ultramodified proteins. J Chromatogr A 1192:239–247
proteins using top-down tandem mass spectra. J Pro- 84. Wu Z, Wei B, Zhang X, Wirth MJ (2014) Efficient
teome Res 12:5830–5838 separations of intact proteins using slip-flow with
74. Tsai YS, Scherl A, Shaw JL, MacKay CL, Shaffer nano-liquid chromatography-mass spectrometry.
SA, Langridge-Smith PR, Goodlett DR (2009) Pre- Anal Chem 86:1592–1598, 3982985
cursor ion independent algorithm for top-down shot- 85. Fekete S, Beck A, Veuthey JL, Guillarme D (2015)
gun proteomics. J Am Soc Mass Spectrom Ion-exchange chromatography for the characteriza-
20:2154–2166 tion of biopharmaceuticals. J Pharm Biomed Anal
75. Karabacak NM, Li L, Tiwari A, Hayward LJ, 113:43
Hong P, Easterling ML, Agar JN (2009) Sensitive 86. Sharma S, Simpson DC, Tolic N, Jaitly N, May-
and specific identification of wild type and variant ampurath AM, Smith RD, Pasa-Tolic L (2007)
proteins from 8 to 669 kDa using top-down mass Proteomic profiling of intact proteins using
spectrometry. Mol Cell Proteomics 8:846–856 WAX-RPLC 2-D separations and FTICR mass spec-
76. Compton PD, Zamdborg L, Thomas PM, Kelleher trometry. J Proteome Res 6:602–610
NL (2011) On the scalability and requirements of 87. Roth MJ, Parks BA, Ferguson JT, Boyne MT 2nd,
whole protein mass spectrometry. Anal Chem Kelleher NL (2008) “Proteotyping”: population pro-
83:6868–6874, 3165072 teomics of human leukocytes using top down mass
77. Vellaichamy A, Tran JC, Catherman AD, Lee JE, spectrometry. Anal Chem 80:2857–2866, 2615201
Kellie JF, Sweet SM, Zamdborg L, Thomas PM, 88. Millea KM, Krull IS, Cohen SA, Gebler JC, Berger
Ahlf DR, Durbin KR, Valaskovic GA, Kelleher NL SJ (2006) Integration of multidimensional chro-
(2010) Size-sorting combined with improved matographic protein separations with a combined
nanocapillary liquid chromatography-mass spec- “top-down” and “bottom-up” proteomic strategy. J
trometry for identification of intact proteins up to Proteome Res 5:135–146
80 kDa. Anal Chem 82:1234–1244, 2823583 89. Bunger MK, Cargile BJ, Ngunjiri A, Bundy JL,
78. MacNair JE, Opiteck GJ, Jorgenson JW, Moseley Stephenson JL Jr (2008) Automated proteomics of
MA 3rd (1997) Rapid separation and characteriza- E. coli via top-down electron-transfer dissociation
tion of protein and peptide mixtures using 1.5 mass spectrometry. Anal Chem 80:1459–1467
microns diameter non-porous silica in packed capil- 90. Buszewski B, Noga S (2012) Hydrophilic interaction
lary liquid chromatography/mass spectrometry. liquid chromatography (HILIC)–a powerful separa-
Rapid Commun Mass Spectrom 11:1279–1285 tion technique. Anal Bioanal Chem 402:231–247,
79. Ansong C, Wu S, Meng D, Liu X, Brewer HM, PMC3249561
Deatherage Kaiser BL, Nakayasu ES, Cort JR, 91. Pesavento JJ, Bullock CR, LeDuc RD, Mizzen CA,
Pevzner P, Smith RD, Heffron F, Adkins JN, Pasa- Kelleher NL (2008) Combinatorial modification of
Tolic L (2013) Top-down proteomics reveals a human histone H4 quantitated by two-dimensional
unique protein S-thiolation switch in Salmonella liquid chromatography coupled with top down mass
Typhimurium in response to infection-like spectrometry. J Biol Chem 283:14927–14937,
conditions. Proc Natl Acad Sci U S A 2397456
110:10153–10158, 3690903 92. Tian Z, Zhao R, Tolic N, Moore RJ, Stenoien DL,
80. Eeltink S, Wouters B, Desmet G, Ursem M, Robinson EW, Smith RD, Pasa-Tolic L (2010)
Blinco D, Kemp GD, Treumann A (2011) High- Two-dimensional liquid chromatography system for
resolution separations of protein isoforms with liquid online top-down mass spectrometry. Proteomics
chromatography time-of-flight mass spectrometry 10:3610–3620, 3010896
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 195

93. Tian Z, Tolic N, Zhao R, Moore RJ, Hengel SM, mass spectrometry on tissue extracts and biofluids
Robinson EW, Stenoien DL, Wu S, Smith RD, Pasa- with isoelectric focusing and superficially porous
Tolic L (2012) Enhanced top-down characterization silica liquid chromatography. Anal Chem
of histone post-translational modifications. Genome 85:10377–10384
Biol 13:R86, 3491414 106. Zhang JM, Corbett JR, Plymire DA, Greenberg BM,
94. Haselberg R, de Jong GJ, Somsen GW (2011) Capil- Patrie SM (2014) Proteoform analysis of lipocalin-
lary electrophoresis-mass spectrometry for the anal- type prostaglandin D-synthase from human cerebro-
ysis of intact proteins 2007–2010. Electrophoresis spinal fluid by isoelectric focusing and superficially
32:66–82 porous liquid chromatography with Fourier trans-
95. Haselberg R, de Jong GJ, Somsen GW (2007) Capil- form mass spectrometry. Proteomics 14:1223–1231
lary electrophoresis-mass spectrometry for the anal- 107. Jensen PK, Pasa-Tolic L, Peden KK, Martinovic S,
ysis of intact proteins. J Chromatogr A 1159:81–109 Lipton MS, Anderson GA, Tolic N, Wong KK,
96. Simpson DC, Ahn S, Pasa-Tolic L, Bogdanov B, Smith RD (2000) Mass spectrometric detection for
Mottaz HM, Vilkov AN, Anderson GA, Lipton MS, capillary isoelectric focusing separations of complex
Smith RD (2006) Using size exclusion protein mixtures. Electrophoresis 21:1372–1380
chromatography-RPLC and RPLC-CIEF as 108. Arentz G, Weiland F, Oehler MK, Hoffmann P
two-dimensional separation strategies for protein (2015) State of the art of 2D DIGE. Proteomics
profiling. Electrophoresis 27:2722–2733 Clin Appl 9:277–288
97. Sun L, Knierman MD, Zhu G, Dovichi NJ (2013) 109. Uliyanchenko E (2014) Size-exclusion chromatogra-
Fast top-down intact protein characterization with phy-from high-performance to ultra-performance.
capillary zone electrophoresis-electrospray ioniza- Anal Bioanal Chem 406:6087–6094
tion tandem mass spectrometry. Anal Chem 110. Zabrouskov V, Giacomelli L, van Wijk KJ,
85:5989–5995, 3770260 McLafferty FW (2003) New approach for plant pro-
98. Zhao Y, Sun L, Champion MM, Knierman MD, teomics – characterization of chloroplast proteins of
Dovichi NJ (2014) Capillary zone electrophoresis- Arabidopsis thaliana by top-down mass spectrome-
electrospray ionization-tandem mass spectrometry try. Mol Cell Proteomics 2:1253–1260
for top-down characterization of the Mycobacterium 111. Whitelegge J (2005) Tandem mass spectrometry of
marinum secretome. Anal Chem 86:4873–4878, integral membrane proteins for top-down proteo-
4033641 mics. TrAC-Trends Anal Chem 24:576–582
99. Li Y, Compton PD, Tran JC, Ntai I, Kelleher NL 112. Mazur Mt Fau – Seipert RS, Seipert Rs Fau –
(2014) Optimizing capillary electrophoresis for Mahon D, Mahon D Fau – Zhou Q, Zhou Q Fau –
top-down proteomics of 30–80 kDa proteins. Prote- Liu T, Liu T (2012) A platform for characterizing
omics 14:1158–1164, PMC4034378 therapeutic monoclonal antibody breakdown
100. Haselberg R, de Jong GJ, Somsen GW (2013) products by 2D chromatography and top-down
Low-flow sheathless capillary electrophoresis-mass mass spectrometry. AAPS J 14
spectrometry for sensitive glycoform profiling of 113. Pan J, Han J, Borchers CH, Konermann L (2012)
intact pharmaceutical proteins. Anal Chem Structure and dynamics of small soluble Abeta
85:2289–2296 (1–40) oligomers studied by top-down hydrogen
101. Han X, Wang Y, Aslanian A, Fonslow B, Graczyk B, exchange mass spectrometry. Biochemistry
Davis TN, Yates JR 3rd (2014) In-line separation by 51:3694–3703
capillary electrophoresis prior to analysis by 114. Chen X, Ge Y (2013) Ultrahigh pressure fast size
top-down mass spectrometry enables sensitive char- exclusion chromatography for top-down proteomics.
acterization of protein complexes. J Proteome Res Proteomics 13:2563–2566
13:6078–6086, 4262260 115. Meng F, Cargile BJ, Patrie SM, Johnson JR,
102. Whitelegge JP, Laganowsky A, Nishio J, Souda P, McLoughlin SM, Kelleher NL (2002) Processing
Zhang HM, Cramer WA (2006) Sequencing covalent complex mixtures of intact proteins for direct analy-
modifications of membrane proteins. J Exp Bot sis by mass spectrometry. Anal Chem 74:2923–2929
57:1515–1522 116. Tran JC, Doucette AA (2009) Multiplexed size sep-
103. Yan F, Subramanian B, Nakeff A, Barder TJ, Parus aration of intact proteins in solution phase for mass
SJ, Lubman DM (2003) A comparison of drug- spectrometry. Anal Chem 81:6201–6209
treated and untreated HCT-116 human colon adeno- 117. Root BE, Zhang B, Barron AE (2009) Size-based
carcinoma cells using a 2-D liquid separation protein separations by microchip electrophoresis
mapping method based upon chromatofocusing PI using an acid-labile surfactant as a replacement for
fractionation. Anal Chem 75:2299–2308 SDS. Electrophoresis 30:2117–2122
104. Stoyanov A (2012) IEF-based multidimensional 118. Garcia BA, Thomas CE, Kelleher NL, Mizzen CA
applications in proteomics: toward higher resolution. (2008) Tissue-specific expression and post-
Electrophoresis 33:3281–3290 translational modification of histone H3 variants. J
105. Zhang J, Roth MJ, Chang AN, Plymire DA, Corbett Proteome Res 7:4225–4236
JR, Greenberg BM, Patrie SM (2013) Top-down
196 S.M. Patrie

119. Boyne MT 2nd, Pesavento JJ, Mizzen CA, Kelleher capture dissociation/electron transfer dissociation
NL (2006) Precise characterization of human mass spectrometry. Mol Cell Proteomics
histones in the H2A gene family by top down mass 7:1838–1849
spectrometry. J Proteome Res 5:248–253 131. Zhang J, Guy MJ, Norman HS, Chen YC, Xu QG,
120. Dang X, Scotcher J, Wu S, Chu RK, Tolic N, Ntai I, Dong XT, Guner H, Wang SJ, Kohmoto T, Young
Thomas PM, Fellers RT, Early BP, Zheng Y, Durbin KH, Moss RL, Ge Y (2011) Top-down quantitative
KR, Leduc RD, Wolff JJ, Thompson CJ, Pan J, proteomics identified phosphorylation of cardiac tro-
Han J, Shaw JB, Salisbury JP, Easterling M, ponin I as a candidate biomarker for chronic heart
Borchers CH, Brodbelt JS, Agar JN, Pasa-Tolic L, failure. J Proteome Res 10:4054–4065
Kelleher NL, Young NL (2014) The first pilot proj- 132. Dong X, Sumandea CA, Chen YC, Garcia-Cazarin
ect of the consortium for top-down proteomics: a ML, Zhang J, Balke CW, Sumandea MP, Ge Y
status report. Proteomics 14:1130–1140, 4146406 (2012) Augmented phosphorylation of cardiac tro-
121. Zhang C, Walker AK, Zand R, Moscarello MA, Yan ponin I in hypertensive heart failure. J Biol Chem
JM, Andrews PC (2012) Myelin basic protein 287:848–857, 3256890
undergoes a broader range of modifications in 133. Burnaevskiy N, Fox TG, Plymire DA, Ertelt JM,
mammals than in lower vertebrates. J Proteome Res Weigele BA, Selyunin AS, Way SS, Patrie SM,
11:4791–4802, 3612544 Alto NM (2013) Proteolytic elimination of
122. Nepomuceno AI, Mason CJ, Muddiman DC, Bergen N-myristoyl modifications by the Shigella virulence
HR, Zeldenrust SR (2004) Detection of genetic factor IpaJ. Nature 496:106–109, 3722872
variants of transthyretin by liquid chromatography- 134. Twine SM, Reid CW, Aubry A, McMullin DR,
dual electrospray ionization Fourier-transform ion- Fulton KM, Austin J, Logan SM (2009) Motility
cyclotron-resonance mass spectrometry. Clin Chem and flagellar glycosylation in clostridium difficile. J
50:1535–1543 Bacteriol 191:7050–7062
123. Theberge R, Infusini G, Tong W, McComb ME, 135. Chamot-Rooke J, Rousseau B, Lanternier F,
Costello CE (2011) Top-down analysis of small Mikaty G, Mairey E, Malosse C, Bouchoux G,
plasma proteins using an LTQ-orbitrap. Potential Pelicic V, Camoin L, Nassif X, Dumenil G (2007)
for mass spectrometry-based clinical assays for Alternative Neisseria spp. type IV pilin glycosyla-
transthyretin and hemoglobin. Int J Mass Spectrom tion with a glyceramido acetamido trideoxyhexose
300:130–142, 3098445 residue. Proc Natl Acad Sci USA 104:14783–14788
124. Edwards RL, Griffiths P, Bunch J, Cooper HJ (2012) 136. Twine SM, Paul CJ, Vinogradov E, McNally DJ,
Top-down proteomics and direct surface sampling of Brisson JR, Mullen JA, McMullin DR, Jarrell HC,
neonatal dried blood spots: diagnosis of unknown Austin JW, Kelly JF, Logan SM (2008) Flagellar
hemoglobin variants. J Am Soc Mass Spectrom glycosylation in Clostridium botulinum. FEBS J
23:1921–1930 275:4428–4444
125. Mao P, Wang D (2014) Top-down proteomics of a 137. Wagner-Rousset E, Bednarczyk A, Bussat MC,
drop of blood for diabetes monitoring. J Proteome Colas O, Corvaia N, Schaeffer C, Van
Res 13:1560–1569, 3993886 Dorsselaer A, Beck A (2008) The way forward,
126. Sarsby J, Martin NJ, Lalor PF, Bunch J, Cooper HJ enhanced characterization of therapeutic antibody
(2014) Top-down and bottom-up identification of glycosylation: comparison of three level mass
proteins by liquid extraction surface analysis mass spectrometry-based strategies. J Chromatogr B
spectrometry of healthy and diseased human liver 872:23–37
tissue. J Am Soc Mass Spectrom 25:1953–1961, 138. Reid GE, Stephenson JL, McLuckey SA (2002) Tan-
4197381 dem mass spectrometry of ribonuclease A and B:
127. Zabrouskov V, Han XM, Welker E, Zhai HL, Lin C, N-linked glycosylation site analysis of whole protein
van Wijk KJ, Scheraga HA, McLafferty FW (2006) ions. Anal Chem 74:577–583
Stepwise deamidation of ribonuclease A at five sites 139. Mazur MT, Cardasis HL, Spellman DS, Liaw A,
determined by top down mass spectrometry. Bio- Yates NA, Hendrickson RC Quantitative analysis
chemistry 45:987–992 of intact apolipoproteins in human HDL by
128. Li X, Yu X, Costello CE, Lin C, O’Connor PB top-down differential mass spectrometry. Proc Natl
(2012) Top-down study of beta(2)-microglobulin Acad Sci USA 107:7728–7733
deamidation. Anal Chem 84:6150–6157 140. Reid GE, McLuckey SA (2002) ‘Top down’ protein
129. Lourette N, Smallwood H, Wu S, Robinson EW, characterization via tandem mass spectrometry. J
Squier TC, Smith RD, Pasa-Tolic L (2010) A Mass Spectrom 37:663–675
top-down LC-FTICR MS-based strategy for 141. Friedman DB, Andacht TM, Bunger MK, Chien AS,
characterizing oxidized calmodulin in activated Hawke DH, Krijgsveld J, Lane WS, Lilley KS,
macrophages. J Am Soc Mass Spectrom 21:930–939 MacCoss MJ, Moritz RL, Settlage RE, Sherman
130. Zabrouskov V, Ge Y, Schwartz J, Walker JW (2008) NE, Weintraub ST, Witkowska HE, Yates NA,
Unraveling molecular complexity of phosphorylated Turck CW (2011) The ABRF Proteomics Research
human cardiac troponin I by top down electron Group studies: educational exercises for qualitative
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 197

and quantitative proteomic analyses. Proteomics 152. Veenstra TD, Martinovic S, Anderson GA, Pasa-
11:1371–1381 Tolic L, Smith RD (2000) Proteome analysis using
142. Leymarie N, Griffin PJ, Jonscher K, Kolarich D, selective incorporation of isotopically labeled amino
Orlando R, McComb M, Zaia J, Aguilan J, Alley acids. J Am Soc Mass Spectrom 11:78–82
WR, Altmann F, Ball LE, Basumallick L, Bazemore- 153. Parks BA, Jiang L, Thomas PM, Wenger CD, Roth
Walker CR, Behnken H, Blank MA, Brown KJ, MJ, Boyne MT 2nd, Burke PV, Kwast KE, Kelleher
Bunz SC, Cairo CW, Cipollo JF, Daneshfar R, NL (2007) Top-down proteomics on a chro-
Desaire H, Drake RR, Go EP, Goldman R, matographic time scale using linear ion trap fourier
Gruber C, Halim A, Hathout Y, Hensbergen PJ, transform hybrid mass spectrometers. Anal Chem
Horn DM, Hurum D, Jabs W, Larson G, Ly M, 79:7984–7991, 2361135
Mann BF, Marx K, Mechref Y, Meyer B, 154. Pesavento JJ, Yang H, Kelleher NL, Mizzen CA
Moginger U, Neusubeta C, Nilsson J, Novotny MV, (2008) Certain and progressive methylation of his-
Nyalwidhe JO, Packer NH, Pompach P, Reiz B, tone H4 at lysine 20 during the cell cycle. Mol Cell
Resemann A, Rohrer JS, Ruthenbeck A, Sanda M, Biol 28:468–486
Schulz JM, Schweiger-Hufnagel U, Sihlbom C, 155. Collier TS, Sarkar P, Rao B, Muddiman
Song E, Staples GO, Suckau D, Tang H, Thaysen- DC. Quantitative top-down proteomics of SILAC
Andersen M, Viner RI, An Y, Valmu L, Wada Y, labeled human embryonic stem cells. J Am Soc
Watson M, Windwarder M, Whittal R, Wuhrer M, Mass Spectrom 21:879–889
Zhu Y, Zou C (2013) Interlaboratory study on dif- 156. Levy MJ, Gucinski AC, Sommers CD, Ghasriani H,
ferential analysis of protein glycosylation by mass Wang B, Keire DA, Boyne MT 2nd (2014) Analyti-
spectrometry: the ABRF glycoprotein research cal techniques and bioactivity assays to compare the
multi-institutional study 2012. Mol Cell Proteomics: structure and function of filgrastim (granulocyte-
MCP 12:2935–2951, 3790302 colony stimulating factor) therapeutics from differ-
143. Collier TS, Muddiman DC (2012) Analytical ent manufacturers. Anal Bioanal Chem
strategies for the global quantification of intact 406:6559–6567
proteins. Amino Acids 43:1109 157. Gucinski AC, Boyne MT 2nd (2014) Identification
144. Gucinski AC, Boyne MT 2nd (2012) Evaluation of of site-specific heterogeneity in peptide drugs using
intact mass spectrometry for the quantitative analysis intact mass spectrometry with electron transfer dis-
of protein therapeutics. Anal Chem 84:8045–8051 sociation. Rapid Commun Mass Spectrom
145. Julka S, Folkenroth J, Young SA (2011) Two dimen- 28:1757–1763
sional liquid chromatography-ultraviolet/mass spec- 158. Mazur MT, Seipert RS, Mahon D, Zhou Q, Liu T
trometric (2DLC-UV/MS) analyses for quantitation (2012) A platform for characterizing therapeutic
of intact proteins in complex biological matrices. J monoclonal antibody breakdown products by 2D
Chromatogr B 879:2057–2063 chromatography and top-down mass spectrometry.
146. Mazur MT, Fyhr R (2011) An algorithm for AAPS J 14:530–541, 3385834
identifying multiply modified endogenous proteins 159. Fornelli L, Damoc E, Thomas PM, Kelleher NL,
using both full-scan and high-resolution tandem Aizikov K, Denisov E, Makarov A, Tsybin YO
mass spectrometric data. Rapid Commun Mass (2012) Analysis of intact monoclonal antibody
Spectrom 25:3617–3626 IgG1 by electron transfer dissociation Orbitrap
147. Ntai I, Kim K, Fellers RT, Skinner OS, Smith AD, FTMS. Mol Cell Proteomics: MCP 11:1758–1767,
Early BP, Savaryn JP, LeDuc RD, Thomas PM, 3518117
Kelleher NL (2014) Applying label-free quantitation 160. Mao Y, Valeja SG, Rouse JC, Hendrickson CL,
to top down proteomics. Anal Chem 86:4961–4968 Marshall AG (2013) Top-down structural analysis
148. Pesavento JJ, Mizzen CA, Kelleher NL (2006) of an intact monoclonal antibody by electron capture
Quantitative analysis of modified proteins and their dissociation-Fourier transform ion cyclotron
positional isomers by tandem mass spectrometry: resonance-mass spectrometry. Anal Chem
human histone H4. Anal Chem 78:4271–4280 85:4239–4246
149. Thomas CE, Kelleher NL, Mizzen CA (2006) Mass 161. Zhang Z, Shah B (2007) Characterization of variable
spectrometric characterization of human histone H3: regions of monoclonal antibodies by top-down mass
a bird’s eye view. J Proteome Res 5:240–247 spectrometry. Anal Chem 79:5723–5729
150. Collier TS, Sarkar P, Franck WL, Rao BM, Dean 162. Bondarenko PV, Second TP, Zabrouskov V,
RA, Muddiman DC Direct comparison of stable Makarov AA, Zhang Z (2009) Mass measurement
isotope labeling by amino acids in cell culture and and top-down HPLC/MS analysis of intact monoclo-
spectral counting for quantitative proteomics. Anal nal antibodies on a hybrid linear quadrupole ion trap-
Chem 82:8696–8702 Orbitrap mass spectrometer. J Am Soc Mass
151. Waanders LF, Hanke S, Mann M (2007) Top-down Spectrom 20:1415–1424
quantitation and characterization of SILAC-labeled 163. Ren D, Pipes GD, Hambly D, Bondarenko PV,
proteins. J Am Soc Mass Spectrom 18:2058–2064 Treuheit MJ, Gadgil HS (2009) Top-down N-termi-
nal sequencing of Immunoglobulin subunits with
198 S.M. Patrie

electrospray ionization time of flight mass spectrom- human proteome: membrane proteins, mitochondria,
etry. Anal Biochem 384:42–48 and senescence. Mol Cell Proteomics: MCP
164. Liu H, Gaza-Bulseco G, Chumsae C (2009) Analysis 12:3465–3473, 3861700
of reduced monoclonal antibodies using size exclu- 176. Catherman AD, Li M, Tran JC, Durbin KR,
sion chromatography coupled with mass spectrome- Compton PD, Early BP, Thomas PM, Kelleher NL
try. J Am Soc Mass Spectrom 20:2258–2264 (2013) Top down proteomics of human membrane
165. Nicolardi S, Deelder AM, Palmblad M, van der proteins from enriched mitochondrial fractions. Anal
Burgt YE (2014) Structural analysis of an intact Chem 85:1880–1888, 3565750
monoclonal antibody by online electrochemical 177. Zabrouskov V, Whitelegge JP (2007) Increased cov-
reduction of disulfide bonds and Fourier transform erage in the transmembrane domain with activated-
ion cyclotron resonance mass spectrometry. Anal ion electron capture dissociation for top-down
Chem 86:5376–5382 Fourier-transform mass spectrometry of integral
166. Souda P, Ryan CM, Cramer WA, Whitelegge J membrane proteins. J Proteome Res 6:2205–2210
(2011) Profiling of integral membrane proteins and 178. Kaltashov IA, Bobst CE, Abzalimov RR (2009) H/D
their post translational modifications using high- exchange and mass spectrometry in the studies of
resolution mass spectrometry. Methods 55:330–336 protein conformation and dynamics: is there a need
167. Whitelegge J, Halgand F, Souda P, Zabrouskov V for a top-down approach? Anal Chem 81:7892–7899
(2006) Top-down mass spectrometry of integral 179. Abzalimov RR, Kaplan DA, Easterling ML,
membrane proteins. Expert Rev Proteomics Kaltashov IA (2009) Protein conformations can be
3:585–596 probed in top-down HDX MS experiments utilizing
168. Whitelegge JP (2005) Sequencing covalent electron transfer dissociation of protein ions without
modifications of membrane proteins. Comp hydrogen scrambling. J Am Soc Mass Spectrom
Biochem Phys A 141:S249 20:1514–1517
169. Schindler PA, Van Dorsselaer A, Falick AM (1993) 180. Pan J, Han J, Borchers CH, Konermann L (2008)
Analysis of hydrophobic proteins and peptides by Electron capture dissociation of electrosprayed pro-
electrospray ionization mass spectrometry. Anal tein ions for spatially resolved hydrogen exchange
Biochem 213:256–263 measurements. J Am Chem Soc 130:11574–11575
170. Whitelegge JP, Zhang H, Aguilera R, Taylor RM, 181. Wang G, Kaltashov IA (2014) Approach to charac-
Cramer WA (2002) Full subunit coverage liquid terization of the higher order structure of disulfide-
chromatography electrospray ionization mass spec- containing proteins using hydrogen/deuterium
trometry (LCMS+) of an oligomeric membrane pro- exchange and top-down mass spectrometry. Anal
tein: cytochrome b(6)f complex from spinach and Chem 86:7293–7298, 4144750
the cyanobacterium Mastigocladus laminosus. Mol 182. Pan J, Borchers CH (2014) Top-down mass spec-
Cell Proteomics: MCP 1:816–827 trometry and hydrogen/deuterium exchange for com-
171. Carroll J, Altman MC, Fearnley IM, Walker JE prehensive structural characterization of interferons:
(2007) Identification of membrane proteins by tan- implications for biosimilars. Proteomics
dem mass spectrometry of protein ions. Proc Natl 14:1249–1258
Acad Sci U S A 104:14330–14335, 1952138 183. Amon S, Trelle MB, Jensen ON, Jorgensen TJ
172. Doucette AA, Vieira DB, Orton DJ, Wall MJ (2014) (2012) Spatially resolved protein hydrogen exchange
Resolubilization of precipitated intact membrane measured by subzero-cooled chip-based nanoelec-
proteins with cold formic acid for analysis by mass trospray ionization tandem mass spectrometry.
spectrometry. J Proteome Res 13:6001–6012 Anal Chem 84:4467–4473
173. Thangaraj B, Ryan CM, Souda P, Krause K, Faull 184. Pan JX, Zhang SP, Parker CE, Borchers CH (2014)
KF, Weber AP, Fromme P, Whitelegge JP (2010) Subzero temperature chromatography and top-down
Data-directed top-down Fourier-transform mass mass spectrometry for protein higher-order structure
spectrometry of a large integral membrane protein characterization: method validation and application
complex: photosystem II from Galdieria sulphuraria. to therapeutic antibodies. J Am Chem Soc
Proteomics 10:3644–3656, 3517113 136:13065–13071
174. Ryan CM, Souda P, Bassilian S, Ujwal R, Zhang J, 185. Lorenzen K, van Duijn E (2010) Native mass spec-
Abramson J, Ping P, Durazo A, Bowie JU, Hasan SS, trometry as a tool in structural biology.In John EC
Baniulis D, Cramer WA, Faull KF, Whitelegge JP et al (ed) Current protocols in protein science.
(2010) Post-translational modifications of integral Chapter 17, Unit17 12
membrane proteins resolved by top-down Fourier 186. Lomeli SH, Peng IX, Yin S, Loo RR, Loo JA (2010)
transform mass spectrometry with collisionally New reagents for increasing ESI multiple charging
activated dissociation. Mol Cell Proteomics: MCP of proteins and protein complexes. J Am Soc Mass
9:791–803, 2871414 Spectrom 21:127–131, 2821426
175. Catherman AD, Durbin KR, Ahlf DR, Early BP, 187. Zhou M, Jones CM, Wysocki VH (2013) Dissecting
Fellers RT, Tran JC, Thomas PM, Kelleher NL the large noncovalent protein complex GroEL with
(2013) Large-scale top-down proteomics of the
8 Top-Down Mass Spectrometry: Proteomics to Proteoforms 199

surface-induced dissociation and ion mobility-mass FTICR MS experiment. J Am Soc Mass Spectrom
spectrometry. Anal Chem 85:8262–8267 25:2060–2068
188. Snijder J, Rose RJ, Veesler D, Johnson JE, Heck 201. Skinner OS, Do Vale LH, Catherman AD,
AJR (2013) Studying 18 MDa virus assemblies Havugimana PC, Sousa MV, Compton PD, Kelleher
with native mass spectrometry. Angew Chem Int NL (2015) Native GELFrEE: a New separation tech-
Ed 52:4020–4023 nique for biomolecular assemblies. Anal Chem
189. Blackwell AE, Dodds ED, Bandarian V, Wysocki 87:3032–3038
VH (2011) Revealing the quaternary structure of a 202. May JC, Goodwin CR, McLean JA (2015) Ion
heterogeneous noncovalent protein complex through mobility-mass spectrometry strategies for untargeted
surface-induced dissociation. Anal Chem systems, synthetic, and chemical biology. Curr Opin
83:2862–2865, 3343771 Biotechnol 31:117–121, PMC4297680
190. Belov ME, Damoc E, Denisov E, Compton PD, 203. Sowell RA, Koeniger SL, Valentine SJ, Moon MH,
Horning S, Makarov AA, Kelleher NL (2013) From Clemmer DE (2004) Nanoflow LC/IMS-MS and LC/
protein complexes to subunit backbone fragments: a IMS-CID/MS of protein mixtures. J Am Soc Mass
multi-stage approach to native mass spectrometry. Spectrom 15:1341–1353
Anal Chem 85:11163–11173 204. McKenna T (2007) Top-down sequencing using the
191. Ahlf DR, Compton PD, Tran JC, Early BP, Thomas SynaptHigh Definition Mass
PM, Kelleher NL (2012) Evaluation of the compact Spectrometry™(HDMS™) System. Nat Methods|
high-field orbitrap for top-down proteomics of Application Notes
human cells. J Proteome Res 11:4308–4314 205. Zinnel NF, Pai PJ, Russell DH (2012) Ion mobility-
192. Beu SC, Blakney GT, Quinn JP, Hendrickson CL, mass spectrometry (IM-MS) for top-down proteo-
Marshall AG (2004) Broadband phase correction of mics: increased dynamic range affords increased
FT-ICR mass spectra via simultaneous excitation sequence coverage. Anal Chem 84:3390–3397
and detection. Anal Chem 76:5756–5761 206. Wyttenbach T, Bowers MT (2011) Structural stabil-
193. Makarov A, Denisov E, Lange O (2009) Perfor- ity from solution to the Gas phase: native solution
mance evaluation of a high-field Orbitrap mass ana- structure of ubiquitin survives analysis in a solvent-
lyzer. J Am Soc Mass Spectrom 20:1391–1396 free Ion mobility-mass spectrometry environment. J
194. Schaub TM, Hendrickson CL, Horning S, Quinn JP, Phys Chem B 115:12266–12275
Senko MW, Marshall AG (2008) High-performance 207. Shi HL, Pierson NA, Valentine SJ, Clemmer DE
mass spectrometry: Fourier transform ion cyclotron (2012) Conformation types of ubiquitin [M + 8H]
resonance at 14.5 Tesla. Anal Chem 80:3985–3990 (8+) ions from water:methanol solutions: evidence
195. Scigelova M, Hornshaw M, Giannakopulos A, for the N and A states in aqueous solution. J Phys
Makarov A (2011) Fourier transform mass spectrom- Chem B 116:3344–3352
etry. Mol Cell Proteomics 10:M111 009431, 208. Ewing MA, Conant CR, Zucker SM, Griffith KJ,
3134075 Clemmer DE (2015) Selected overtone mobility
196. Xian F, Hendrickson CL, Blakney GT, Beu SC, spectrometry. Anal Chem 87:5132–5138
Marshall AG (2010) Automated broadband phase 209. Shvartsburg AA (2014) Ultrahigh-resolution differ-
correction of Fourier transform Ion cyclotron reso- ential ion mobility separations of conformers for
nance mass spectra. Anal Chem 82:8807–8812 proteins above 10 kDa: onset of dipole alignment?
197. Dyachenko A, Wang G, Belov M, Makarov A, de Anal Chem 86:10608–10615
Jong RN, van den Bremer ET, Parren PW, Heck AJ 210. Shvartsburg AA, Zheng YP, Smith RD, Kelleher NL
(2015) Tandem native mass-spectrometry on (2012) Ion mobility separation of variant histone
antibody-drug conjugates and submillion Da tails extending to the “middle-down” range. Anal
antibody-antigen protein assemblies on an orbitrap Chem 84:4271–4276
EMR equipped with a high-mass quadrupole mass 211. Cui W, Zhang H, Blankenship RE, Gross ML (2015)
selector. Anal Chem 87:6095–6102 Electron-capture dissociation and ion mobility mass
198. Yin S, Loo JA (2010) Elucidating the site of protein- spectrometry for characterization of the hemoglobin
ATP binding by top-down mass spectrometry. J Am protein assembly. Protein Sci 24:1325
Soc Mass Spectrom 21:899–907 212. Escribano E, Madurga S, Vilaseca M, Moreno V
199. Zhang H, Cui W, Wen J, Blankenship RE, Gross ML (2014) Ion mobility and Top-down MS complemen-
(2010) Native electrospray and electron-capture dis- tary approaches for the structural analysis of protein
sociation in FTICR mass spectrometry provide models bound to anticancer metallodrugs. Inorg
top-down sequencing of a protein component in an Chim Acta Part B 423:60–69
intact protein assembly. J Am Soc Mass Spectrom 213. Do TD, Economou NJ, Chamas A, Buratto SK, Shea
21:1966–1968, 2991543 JE, Bowers MT (2014) Interactions between
200. Li H, Wongkongkathep P, Van Orden SL, Ogorzalek amyloid-beta and Tau fragments promote aberrant
Loo RR, Loo JA (2014) Revealing ligand binding aggregates: implications for amyloid toxicity. J Phys
sites and quantifying subunit variants of noncovalent Chem B 118:11220–11230
protein complexes in a single native top-down
200 S.M. Patrie

214. Young LM, Saunders JC, Mahood RA, Revill CH, 223. Frey BL, Lin Y, Westphall MS, Smith LM (2005)
Foster RJ, Tu L-H, Raleigh DP, Radford SE, Controlling gas-phase reactions for efficient charge
Ashcroft AE (2015) Screening and classifying reduction electrospray mass spectrometry of intact
small-molecule inhibitors of amyloid formation proteins. J Am Soc Mass Spectrom 16:1876–1887,
using ion mobility spectrometry–mass spectrometry. 1489883
Nat Chem 7:73–81 224. Scalf M, Westphall MS, Smith LM (2000) Charge
215. Beveridge R, Covill S, Pacholarz KJ, Kalapothakis reduction electrospray mass spectrometry. Anal
JMD, MacPhee CE, Barran PE (2014) A mass- Chem 72:52–60
spectrometry-based framework to define the extent 225. Chi A, Bai DL, Geer LY, Shabanowitz J, Hunt DF
of disorder in proteins. Anal Chem 86:10979–10991 (2007) Analysis of intact proteins on a chro-
216. McLuckey SA, Glish GL, Van Berkel GJ (1991) matographic time scale by electron transfer dissoci-
Charge determination of product ions formed from ation tandem mass spectrometry. Int J Mass
collision-induced dissociation of multiply Spectrom 259:197–203
protonated molecules via ion/molecule reactions. 226. Teo CA, Donald WA (2014) Solution additives for
Anal Chem 63:1971–1978 supercharging proteins beyond the theoretical maxi-
217. Abzalimov RR, Kaltashov IA (2010) Electrospray mum proton-transfer limit in electrospray ionization
ionization mass spectrometry of highly heteroge- mass spectrometry. Anal Chem 86:4455–4462
neous protein systems: protein ion charge state 227. Loo RRO, Dales N, Andrews PC (1994) Surfactant
assignment via incomplete charge reduction. Anal effects on protein-structure examined by
Chem 82:7523–7526 electrospray-ionization mass-spectrometry. Protein
218. Hassell KM, LeBlanc YC, McLuckey SA (2011) Sci 3:1975–1983
Chemical noise reduction via mass spectrometry 228. Iavarone AT, Jurchen JC, Williams ER (2001)
and ion/ion charge inversion: amino acids. Anal Supercharged protein and peptide ions formed by
Chem 83:3252–3255, 3084898 electrospray ionization. Anal Chem 73:1455–1460,
219. Pitteri SJ, McLuckey SA (2005) Recent 1414801
developments in the ion/ion chemistry of high-mass 229. Valeja SG, Tipton JD, Emmett MR, Marshall AG
multiply charged ions. Mass Spectrom Rev (2010) New reagents for enhanced liquid chro-
24:931–958 matographic separation and charging of intact pro-
220. McLuckey S (2009) Peptide and protein Ion/Ion tein ions for electrospray ionization mass
reactions in electrodynamic Ion traps: tools and spectrometry. Anal Chem 82:7515–7519, 2932825
methods. In: Lipton M, Paša-Tolic L (eds) Mass 230. Iavarone AT, Jurchen JC, Williams ER (2000)
spectrometry of proteins and peptides. Humana Effects of solvent on the maximum charge state
Press, New York, pp 395–412 and charge state distribution of protein ions pro-
221. McLuckey SA, Reid GE, Wells JM (2002) Ion duced by electrospray ionization. J Am Soc Mass
parking during ion/ion reactions in electrodynamic Spectrom 11:976–985
ion traps. Anal Chem 74:336–346 231. Cassou CA, Williams ER (2014) Anions in electro-
222. Liu J, Huang TY, McLuckey SA (2009) Simulta- thermal supercharging of proteins with electrospray
neous transmission mode collision-induced dissocia- ionization follow a reverse Hofmeister series. Anal
tion and ion/ion reactions for top-down protein Chem 86:1640–1647, PMC3983018
identification/characterization using a quadrupole/ 232. Sterling HJ, Cassou CA, Susa AC, Williams ER
time-of-flight tandem mass spectrometer. Anal (2012) Electrothermal supercharging of proteins in
Chem 81:2159–2167, 2667222 native electrospray ionization. Anal Chem
84:3795–3801, PMC3328611
Part III
Bioinformatic Tools for Proteomics data
Analysis and Interpretation
Platforms and Pipelines for Proteomics
Data Analysis and Management 9
Marius Cosmin Codrea and Sven Nahnsen

Abstract
Since mass spectrometry was introduced as the core technology for large-
scale analysis of the proteome, the speed of data acquisition, dynamic
ranges of measurements, and data quality are continuously improving.
These improvements are triggered by regular launches of new
methodologies and instruments.

Keywords
Bioinformatics • Proteomics data processing • Protein identification •
Protein quantification • Data processing pipeline • Trans proteomic
pipeline • OpenMS pipeline • The Central Proteomics Facilities Pipeline
(CPFP) • MaxQuant pipeline • Scaffold pipeline • Sorcerer pipeline • IPA/
IP2 pipeline

SILAC Stable isotope labeling by amino acids


Abbreviations in cell culture
SRM Selected Reaction Monitoring
FDR False Discovery Rate
TB terra byte
GO Gene Ontology
TPP Trans-Proteomics Pipeline
GUI Graphical User Interface
I/O input, output
iTRAQ Isobaric tags for relative and absolute
quantitation
9.1 Introduction
M/Z mass-to-charge
PTM Post-Translational Modification
Since mass spectrometry was introduced as the
RT retention time
core technology for large-scale analysis of the
proteome, the speed of data acquisition, dynamic
M.C. Codrea • S. Nahnsen (*) ranges of measurements, and data quality are
Quantitative Biology Center (QBiC), University of
continuously improving. These improvements
Tübingen, Auf der Morgenstelle 10, 72076 Tübingen,
Germany are triggered by regular launches of new
e-mail: sven.nahnsen@uni-tuebingen.de methodologies and instruments.

# Springer International Publishing Switzerland 2016 203


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_9
204 M.C. Codrea and S. Nahnsen

A consequence of higher throughput and per- the entire peptide in a survey scan mass spectrum
formance is the increased size and complexity of (MS1). At the same time, a certain number of
the data. Mass spectrometry studies using the peptide ions are automatically selected for frag-
latest technology can readily generate datasets mentation based on their occurring intensity –
in the TB range (e.g., [1, 2]). These datasets this method is called data-dependent acquisition.
contain millions of spectra that need to be Different methods for peptide fragmentation
converted into biological insights, and manual have been established (e.g., collision-induced
completion of this task is prohibitive. Hence, fragmentation (CID), higher-energy collisional
the importance of bioinformatics tools for the dissociation (HCD), electron-transfer dissocia-
analysis of proteomics data is growing rapidly. tion (ETD) etc.). Following fragmentation, the
Besides efficient implementations of the required resulting product ions are measured and recorded
features to meet these demands, there is a strong in a fragment ion or tandem MS spectrum
need for flexible and user-friendly interfaces. (MS/MS or MS2). This measurement can take
Software tools in the field of proteomics need to place consecutively in the same mass analyzer
provide a broad operability to allow for a scal- (tandem MS in time) or on a hybrid instrument
able integration for the complete workflow. The with an additional mass analyzer (tandem MS in
processing and analysis software should be space). Classically, the MS2 spectra are used to
usable by biologists, instrument technicians and identify peptides using database mapping, while
computer scientists alike. the MS1 spectra allow one to estimate their rela-
From a computational point of view, tive quantities [4]. Such proteomics experiments
algorithms for the identification and quantifica- can lead to hundreds of gigabytes that need
tion of peptides and proteins from a selection of automated bioinformatics data management,
mass spectra are at the core of the workflow. For processing and analysis tools.
the remainder of this chapter, these fundamental Bioinformatics tools comprise a diverse selec-
tasks are summarized as data processing. Fol- tion of approaches for the processing, analyzing
lowing data processing, proteomics studies usu- and managing of mass spectrometry-based pro-
ally require bioinformatics data analysis, which teomics data. The underlying architecture varies
comprises statistical assessment of quantitative from monolithic commercial applications to free-
data, functional enrichment, visualization and the and open source software libraries.
integration with other -OMICS technologies. The Common to all tools that enable the complete
underlying data in proteomics can be derived proteomics processing and analysis workflow are
from targeted, data-independent acquisition, or algorithmic solutions to search tandem MS spec-
shotgun proteomics experiments [3]. We will tra against a protein database, as well as methods
mainly focus on tools for shotgun proteomics in for the statistical post-processing of identifica-
this chapter. In shotgun proteomics (also referred tion results and quantification. The development
to as bottom-up proteomics), proteins are enzy- of tandem MS search engines has been a topic in
matically digested to peptides and separated computational proteomics research since the
using chromatography systems, most frequently emergence of the field in the early 1990s. These
reversed-phase chromatography using a high- developments include Sequest [5], Mascot [6],
performance liquid chromatographer (HPLC). OMSSA [7] or Andromeda [8] among many
The eluents of the chromatographic separation others (for a comprehensive review, see [9]). As
are ionized (e.g., via Matrix assisted laser a second step of the computational identification,
desorption/ionization or more commonly via the raw search results are subject to post-
electrospray ionization) and online injected into processing tools, such as PeptideProphet [10] or
the mass spectrometer. A mass spectrometry Percolator [11].
experiment is frequently set up as a two stage Computational tools that provide the analysis
mass measurement. The first mass measurement workflow of proteomics data are available from
records mass-to-charge ratios and intensities of the instrument vendors or bioinformatics firms.
9 Platforms and Pipelines for Proteomics Data Analysis and Management 205

Oftentimes, they are also available as open refactored into modules and thereby reduce the
source and freeware applications that are fre- complexity of the tasks. A general advantage of
quently developed as bioinformatics research this approach is the maintainable codebase and
projects. The applicability of such tools ranges facilitated development. Most community-wide
from single PCs to high performance compute software projects have an underlying modular-
clusters. While many tools are implemented in an ized design. Due to the separation of individual
operating system-independent fashion, some tasks, the modularized design is more scalable
tools are available for one operating system compared to the one-codebase, monolithic appli-
only. With the advent of very complex and cation. On the other hand, the refactoring of tasks
data-intense experiments, scalability is slowly and algorithms frequently introduces an I/O
becoming an important factor to consider when overload.
choosing an adequate software suit or data anal- For proteomics data processing and analysis,
ysis strategy. While it is difficult to draw a clear- the user has the option to choose between differ-
cut functional grouping of software tools and ent software solutions. While the solutions can be
platforms, there is a conceptual difference clearly associated with either a monolithic or
among the available tools: monolithic and mod- modular architectures, the choice between these
ular software tools. two options is not trivial. The technical
Monolithic applications are usually very user- challenges in computational proteomics were
friendly and come with a reduced complexity for recently discussed in a series of reviews
their application. Most of the monolithic soft- [17]. Valuable insights of the state-of-the-art in
ware platforms in proteomics [12] are easy to open source libraries for proteomics data analysis
deploy and their usage is facilitated through an can be found in [18]. This chapter briefly
intuitive graphical user interface (GUI). The introduces the technical background of the dif-
most prominent examples for monolithic ferent solutions and provides means for the selec-
analyses software/platforms in proteomics are tion of the most suited tool for a given dataset.
MaxQuant [13] and commercial packages, such The choice of the software to analyze research
as proprietary software from the instrument data is obviously at the forefront of factors to
vendors. A drawback of large monolithic contribute to the success of research projects. It
applications is that the entire codebase is pack- is important to carefully decide the most suitable
aged and software maintenance is a very tool. The main questions each user of proteomics
advanced task that can partly only be done by software will need to address are: What is the
the developers. Most importantly for the data- level of expertise in software application and
intense field of proteomics, monolithic development? Is it enough to use existing tools
applications are difficult to scale for high- or will it be required to implement additional
throughput data processing and customization. functionality? What degree of flexibility (for
Adding new functionalities is also very difficult, both developers and users) is needed? What is
if not impossible. the size of data that needs to be processed? Do I
Classically, the counterpart of monolithic have enough hardware resources to run my anal-
applications is a modularized collection of soft- ysis in-house and is my software compatible with
ware tools. Prominent examples in proteomics these resources and the data that will be
include the Trans Proteomic Pipeline (TPP) generated?
[14] and OpenMS [15]. Comparative details for
these and others can be found in a recent
benchmarking paper [16]. 9.2 Material and Methods
Modularized software applications group
individual parts of the whole functionality into As described above, common to all software
separate modules that may also have stand-alone tools for the analysis of shotgun proteomics
functionality. Even individual algorithms can be data are processing strategies for data I/O, for
206 M.C. Codrea and S. Nahnsen

peptide and protein identification as well as dif- indispensable for a sustainable proteomics
ferent quantification techniques. We describe the research, but also to facilitate software develop-
nodes of such a processing workflow as depicted ment and maintenance. They encourage repro-
in Fig. 9.1. We use the modular workflow ducible data analysis, enhance data sharing and
description for illustration, but the underlying enable benchmarking of analysis algorithms.
functionality is applicable for monolithic The ProteoWizard suite of proteomics data
implementations as well. tools [23, 24] provides the “msConvert” utility
for converting between common mass spectrom-
eter file formats to mzML and mzXML.
9.2.1 Data I/O While Fig. 9.1 points out the conversion step
as a node, it should be noted that not all tools and
To process or analyze any mass spectrometry pipelines require this step, and can read the
data, it is essential to have access to the required binary formats directly.
information. Most instruments write their own
binary files in proprietary formats and accessing
the content is only possible if software tools can 9.2.2 Signal Processing
use the necessary libraries encoding this func-
tionality. The content and the structure of the Appropriate pre-processing of the mass spectra
data files varies among the proprietary formats can significantly improve the quality of the
(e.g., ThermoFisher Scientific *.raw; ABI/Sciex results. The most common methods include:
*.wiff; Agilent *.d). Due to the continuous devel- (1) filtering or “denoising”; (2) baseline correc-
opment of technology and the addition of new tion, which eliminates systematic trends; (3) nor-
features, these formats are frequently updated. malization; (4) peak detection;
Software tools that analyze MS data produce Modern high-resolution instruments [25, 26]
and require additional information than the raw have made the raw data signal processing steps
spectra and general acquisition settings, which much simpler than years ago. It is no longer a
are encoded in the raw files. critical step in the workflow, but is occasionally
Fortunately, there is a viable community needed.
effort towards the implementation of common,
open standard file formats and as result of such
an effort, the HUPO Proteomics Standards Initia- 9.2.3 Feature Finding
tive (PSI, a wide variety of XML-based open
formats) has been introduced [20]. Mass spectrometers measure eluting analytes
For example, mzML is the current standard over a certain period of time, resulting in the
format for storing MS data (i.e., spectra) whereas analyte’s elution profile. Furthermore, the ele-
mzIdentML, mzTab and mzQuantML formats mental compositions of the molecular species
are for storing analysis results (peptide/protein give rise to isotopic patterns. Integrating over
identification and quantification). The usage of the elution profile and the isotopic pattern, all
these standard formats in proteomics software signals for one analyte can be summed up and
development is illustrated in a recent tutorial peptide feature intensities can be derived
[21]. The emerging mz5 format [22] has been [27]. Figure 9.2 illustrates the distribution of the
introduced as an alternative to the XML isotopic peaks of a peptide feature with charge
encoding, which is more compact and efficient two. Feature finding in this case, aims at auto-
file format, since it avoids the heavy load of matically collect all the individual peaks that are
XML tags (http://www.hdfgroup.org/HDF5/). visible along the m/z and the RT access. When
Despite the strong need to streamline data analy- electrospray ionization is used, peptides are fre-
sis in open formats, software tools are still quently observed with different charges (e.g.,
lagging behind. Open formats are not only z ¼ +1, z ¼ +2, z ¼ +3) and therefore peaks
9 Platforms and Pipelines for Proteomics Data Analysis and Management 207

A B

Protein DB /
LC-MS raw data
spectral libaries

C
Data conversion
convert binary format to open
standards, e.g., *.mxML

Signal processing
Denoising: baseline correction
and peak picking

Feature finding Database search


Identify and integrate all signals to Annotate and score MS2
feature intesity; overlay with IDs spectra with peptide sequences

Quantification Protein inference


Map alignment, linking of tuples, Identify and score proteins from
normalization, protein quant. a given set of peptides

D
Protein/peptide
tables

E
Statistics Biol. analysis
Statistical assessment of quanti- Functional annotation; Pathway
fication; marker identification analysis, data integtation

Fig. 9.1 A typical workflow for the analysis of shotgun recently also to the open standard exchange format,
proteomics data. (a) shall exemplify the data acquired mzTAB [19]. (e) After completion of the data processing,
during the LC-MS runs, (b) corresponds to the a priori the resulting information is subject to data analysis. These
needed knowledge about the samples, e.g., which analyses involve the statistical assessment of differential
organisms are analyzed. Nodes outlined in (c) are protein expression and/or biomarker identification (espe-
summarized as data processing tools; a detailed descrip- cially in clinical studies). Performing functional annota-
tion of these tasks is outlined below. The data processing tion or enrichment analyses facilitates the biological
workflow generates an output of peptide and protein IDs interpretation of high-throughput proteomics data. With
(d) along with their differential or absolute quantification. the emergence of multi-OMICS data, the biological anal-
The format of this output varies from pipeline to pipeline ysis frequently also involve data integration steps
and ranges from tsv tables to graphical output and since
208 M.C. Codrea and S. Nahnsen

Fig. 9.2 A peptide feature captures all information of an eluting peptide. Here a doubly charged feature with its
monoisotopic ion measured at 602.8 Th. The whole peptide elutes over 1 min from the column

can be observed at m/1, m/2 and m/3, where m is peptides by matching the fragment spectra against
the mass of the analyte. Charge determination the theoretical spectra derived from the target
aims at identifying peak groups that can explain protein database (i.e., known sequences). In
the presence of an analyte at a certain charge. For brief, using database searching, peptide sequences
MS1 peaks that are selected for MS2 fragmenta- are assigned to these spectra as outlined above.
tion, this is referred to as “precursor charge”. Following database searches, statistical assess-
Tools that implement the functionality of merg- ment is required to distinguish correct from
ing different charge states of the same peptide are incorrect identifications, most commonly using
called charge state deconvolution tools. Such the target/decoy strategy [28]. Using this strategy,
tools assemble quantities of the different charge reversed, shuffled or randomized protein
state features into a single decharged peptide sequences can serve as negative controls and
feature. thus help to estimate the overall FDR.
For accurate quantification, algorithmic tools
are needed to find the isotopic groups and com- Protein Inference Following peptide identifica-
bine the corresponding peaks into features, and tion, parent protein identity can be inferred from
more importantly, collate the individual peak its daughter peptides. In particular, due to false
intensities into single feature intensities and discoveries on the peptide level as well as the fact
optionally perform charge state deconvolution. that peptides can map to multiple proteins, pro-
Independent of the nature of the data (with or tein inference with accurate error estimation
without labels), feature finding is the most cru- remains a difficult problem. Continual release
cial step in any quantification workflow. of tools providing solutions for protein inference
attempt to alleviate this problem. Protein Prophet
[29] is one of the most widely used tools but
other tools include MAYU [30] or more recent
9.2.4 Identification
BP-Quant [31] that specifically addresses the
problem of alternative splicing and other
proteoforms – which is the core challenge in
Peptide Identification Classically, in a tandem
protein inference.
MS setup, MS2 spectra are used to identify
9 Platforms and Pipelines for Proteomics Data Analysis and Management 209

9.2.5 Quantification instrumentation, the RT dimension, however,


can be quite variable within one experiment;
Protein quantification can be done by attaching these variabilities can even be non-linear [27].
labels to proteins or peptides, chemically [32] or All tools discussed within this chapter provide
metabolically [33]. Label-free strategies are algorithmic solutions that account for these
another approach for protein quantification. For variabilities (see [40]) for the algorithm as
a comprehensive review of quantification implemented in TOPP [41]). A broader over-
strategies, we refer to [34]. The choice of the view of different alignment tools can be found
data analysis strategy obviously depends on the in [42].
methods that were used for the generation of the After the accurate identification of features,
data. For the quantitative analysis of labeled data calculation of individual feature intensities, the
many dedicated tools are available next step in the quantification node needs to
[35, 36]. Recently, label-free quantification is account for systematic biases that have been
gaining increasing interest due to the practical introduced during the sample preparation and/or
simplicity for data generation and the expansion the measurements. This procedure is commonly
of software applications for quantifying label- referred to as map normalization. Normalization
free data [4, 37]. Most of the libraries and steps also include an optional assessment of
computational frameworks provide algorithmic biological variation in biological replicates –
solutions to a wide range of quantitative data. such analyses, however, should be done with
Besides the feature-based quantification, an alter- caution, since biological variation is an inherent
native approach to assess differential peptide and property of any biological system. Nonetheless,
protein quantities include methods that are normalization is beneficial for any quantitative
summarized as spectral counting. These methods set-up, but is essential for label-free analyses due
rely on the counting of the number of MS/MS to a much higher technical variation in compari-
events that can be associated with a certain pro- son to labeled data, where multiple samples are
tein and thereby allow for differential quantifica- measured in the same run.
tion. Spectral counting methods have been
comprehensively reviewed in [38].
9.2.7 Statistical Analysis

9.2.6 Alignment and Normalization Various quality-control measures such as mass


of Multiple Runs error or charge distribution during an LC run can
readily give a diagnostic on the dataset
Quantitative studies usually involve a panel of [43]. Moreover, simple descriptive statistics,
samples reflecting the experimental design of scatter plots, clustering or PCA plots can be
interest. While recent advances in large-scale informative in their own right. However, in
multiplexing [39] can reduce the number of non-trivial experiments, statistical analysis has
runs significantly, moderately large experiments to match the original experimental design and
(e.g., for clinical studies) still require multiple therefore expert knowledge is often needed to
injections into the LC-MS setup. If analytes need choose the right package and method [Chap. 11].
to be quantified across multiple LC-MS runs,
the unavoidable technical variability need to
be corrected. Algorithmic solutions for this are 9.3 Tools and Platforms
summarized as map alignment with a map refer-
ring to the 2D-space (retention time (RT) vs. The following section outlines the most com-
mass-to-charge (m/z) of an LC-MS run. The monly used software solutions in proteomics.
variations in the m/z is marginal in latest This section details the underlying functionality
210 M.C. Codrea and S. Nahnsen

Table 9.1 Summary of the software tools and platforms for proteomics data analysis
b
Current
c
Tool License Interface version Platforms File formats URL
TPP Open-source Command- 2.4.2 W, L mz(ML|XML) http://tools.
line proteomecenter.org
GPL v. 2.0 and Web
LGPL
OpenMS Open-source Command- 1.11 W, L, M mz(ML|XML| http://open-ms.
line Data) sourceforge.net/
TOPP
CPFP Open-source Web 2.1.1 L, M mz(ML|XML) http://cpfp.
CDDLa MySQL sourceforge.net/
MaxQuant Freeware GUI 1.5.2.8. W Thermo.RAW www.maxquant.org
mzXML
Scaffold commercial GUI 4.4.1 W, L, M Major vendor www.
formats proteomesoftware.com
Sorcerer commercial Web Visit URL L Major vendor www.sagenresearch.
formats com
IPA commercial Web IP 2 L ms1, ms2, www.
mzXML, integratedproteomics.
DTASelect com
a
CDDL: OSI approved Common Development and Distribution License
b
December 2014
c
Operating systems: W Windows, L Linux, M Mac OS

and applicability of the tools and if available, it TPP is shipped with Comet [46] and X!TAN-
points the user to the resources and provides DEM [47] search engines but the currently
information on the licensing for the individual supported engines are: SEQUEST [5]; MSGF+
software applications. Table 9.1 summarizes the [48]; Inspect [49]; OMSSA [7]; MyriMatch [50];
major properties of the individual tools. Mascot [6].
In addition to the traditional database
(sequence) search, TPP provides the SpectraST
tool as an alternative approach [51]. SpectraST is
9.3.1 Trans Proteomic Pipeline (Open
a spectral library building and searching tool
Source)
wherein: (1) Previously observed and identified
peptide MS/MS spectra are compiled and stored
The Trans-Proteomics Pipeline (TPP) is one of
into “spectral libraries” and (2) Newly observed
the most mature suites of software tools for the
spectra to be identified are matched against the
analysis of LC-MS/MS data [14]. The tools cover
entire target spectral library. This approach
all the steps in a shotgun proteomics analysis
shows great potential in complementing and/or
workflow from raw data conversion to protein-
substituting the classical sequence searching.
level identification and quantification [44], (See
TPP includes ASAPRatio [52] and XPRESS
Fig. 9.1). Within TPP, particular emphasize has
[53] for relative abundances of proteins from
been put into statistical validation of the
ICAT-reagent labeled data. iTRAQ and TMT
identifications. Typically, peptide identifications
labeled samples can be analyzed and quantified
from different search engines are validated by
with the Libra TPP module.
PeptideProphet [10] and refined and merged
TPP offers a web-based GUI (called Petunia),
with iProphet [45]. Subsequently, protein infer-
which gives access to the tools and data in a
ence is performed with ProteinProphet [29] and
visual environment as an alternative to the
results provided at different false discovery rates
command-line interface.
(FDR).
9 Platforms and Pipelines for Proteomics Data Analysis and Management 211

9.3.2 OpenMS (Open Source) The analysis pipeline covers identification,


quantitation and validation of peptides and
OpenMS has been designed as a software frame- proteins [14].
work for mass spectrometry [15]. As such, it
provides data structures and algorithms to rapidly
design and assemble analysis pipelines. OpenMS 9.3.4 MaxQuant (Freeware)
is developed in the C++ programming language
and its code is freely available under the 3-clause MaxQuant is a proteomics software application
BSD license at https://github.com/OpenMS/ designed for quantitative analysis of LC-MS/MS
OpenMS. Besides the core structures, OpenMS data [13]. It is a freely available, closed-source
is shipped with TOPP, the OpenMS Proteomics (written in C# using the.NET Framework),
Pipeline [41], which is a collection of monolithic, and Windows only application
precompiled building blocks that can be chained (www.maxquant.org). Its algorithms are particu-
together to form production-ready processing larly tailored for high-resolution data such as
pipelines. Both the library and the TOPP tools Thermo Orbitrap and FT.
are available for all major operating system (iOS, MaxQuant provides all steps for a shotgun
Windows and Linux). The TOPP tools include proteomics analysis workflow (See Fig. 9.1)
nodes for data handling, raw data signal organized into the “Quant” module, Andromeda
processing, peptide and protein identification as search engine [8] and the “Identify” module.
well as for the quantification of peptides and It can provide: (1) protein identification,
proteins using different labeling strategies (e.g., (2) Feature-based label-free quantification and
isobaric labeling or SILAC) or label-free (3) quantification for SILAC, TMT and iTRAQ-
[4]. OpenMS tools and the resulting constructed labeled samples.
pipelines can easily be executed on high- In essence, the user has to choose the raw data
performance computing clusters using different files, the target database and make the appropri-
workflow systems, e.g., Galaxy [54], thus ate parameter settings (e.g., SILAC labels, mass
providing a scalable solution for large-scale tolerances). These can be done via the GUI,
data centers. which is a typical desktop application interface.
Furthermore, the OpenMS framework also The workflow then runs all the intermediate nec-
provides tools for the visualization of MS raw essary steps (e.g., feature detection, MS/MS
data and analysis results, as well as a pipeline searches, filtering, protein assembly and quanti-
designing tool, the OpenMS Proteomics Pipeline fication) in what appears to the user as a single
Assistant, TOPPAS [55]. Using TOPPAS, the analysis run.
user can intuitively build customized processing MaxQuant is equipped with a built-in
pipelines. “Viewer” for data inspection and browsing the
results. Recently, the authors recommend the
“Perseus” framework (www.perseus-frame
work.org/) for subsequent statistical analysis of
9.3.3 CPFP (Web-Based Freeware)
MaxQuant output.
The Central Proteomics Facilities Pipeline (cpfp.
sourceforge.net) is in essence a web based wrap-
9.3.5 Scaffold (Commercially
per around TPP tools, various search engines
Available)
(Mascot [6], OMSSA [7], and X!TANDEM
[47]) and a MySQL back-end for storing spectra
Scaffold (Proteome Software, Portland OR,
and results [56]. It is primarily suited for core
USA, www.proteomesoftware.com) is a feature-
facilities, providing an easy to use web interface
rich software suite to assist in analysis, visualiza-
to upload data, trigger workflows and browse the
tion, quantification, annotation and validation of
results.
complex LC-MS/MS experiments. It supports a
212 M.C. Codrea and S. Nahnsen

wide variety of search engines: Mascot [6], SORCERER™ 2 is also a fully integrated data
MascotDistiller, MatrixScience (London, UK, analysis system particularly tailored for labs with
http://www.matrixscience.com/distiller.html), moderate throughput (or high, but not continuous
Proteome Discoverer, Thermo Fisher (Bremen, throughput). SORCERER-V (standing for vir-
Germany, http://www.thermoscientific.com), tual), is a scaled down, yet complete platform
Spectrum Mill (Agilent, Santa Clara, USA, packed into a virtual machine that can run on a
http://www.chem.agilent.com/), SEQUEST [5], regular modern PC. This is offered as an entry-
IdentityE/PLGS (Waters, Manchester, UK, level product for scientists to explore and start
www.waters.com), OMSSA [7], X!TANDEM building data analysis and data-mining
[47] and MaxQuant/Andromeda [8, 13]. Valida- platforms.
tion is achieved by the Peptide Prophet/Protein SORCERER analysis is accessible via Scaf-
Prophet algorithms [14] with an enhanced pro- fold or the Trans-Proteomics Pipeline.
tein grouping method [57]. It supports label free
quantitation (MS1 precursor intensity as well as
MS2 spectral counting). Scaffold Q+ and Q + S
9.3.7 IPA/IP2 (Commercially Available)
can perform iTRAQ, TMT and SILAC based
quantitation. Basic statistics like t-test, ANOVA
Integrated Proteomics Pipeline (IP2, Integrated
or Kruskal Wallis test are included and offer
Proteomics Applications, Inc. San Diego, CA,
built-in differential expression analysis. Data
USA, www.integratedproteomics.com) provides
can be filtered using various criteria like pep-
complete solutions for proteomics data analysis.
tide/protein probabilities (FDR), search engine
The core methods for protein identification,
scores, expression values (fold change), etc.
quantification, filtering and analysis are
Scaffold maintains multiple GO annotation
Sequest/ProLuCID [5] (http://fields.scripps.edu/
databases and allows GO filtering as well as
downloads.php), DTASelect2 [58, 59], and Cen-
categorical GO term quantitation. Visualization
sus [60]. It uses internal file formats: ms1 and
spans from raw MS/MS spectra to peptides and
ms2 (RawExtract, http://fields.scripps.edu/
proteins coverage, differential expression,
downloads.php) [61]. It has a project-oriented
modifications, GO annotation, Venn diagrams,
web interface and features GO analysis and
as well as intensity scatterplots and quantitation
PTM analysis as well as basic statistics support.
charts.
Quality control visualization includes: search
engines scatterplot comparisons, ROC plots for
sensitivity/specificity, error estimates and 9.4 Technical Aspects and Data
randomized permutation calculation. Dissemination
Scaffold PTM module provides PTM site
localization and probability, motif validation 9.4.1 Computation and Data
and sequence visualization and filtering. Management

While the software tools can, in principle, run on


9.3.6 Sorcerer (Commercially a single PC, the true computational throughput
Available) and performance is achieved when running on
computer clusters or on clouds, in parallel. One
Sorcerer (Sage-N Research, Milpitas CA, USA, can, for example, run MaxQuant on a regular PC
www.sagenresearch.com) platforms include or even on a modern laptop for projects with a
tightly integrated hardware & software solutions. small number of samples. Obviously, this strat-
The Enterprise solution is customizable, scalable egy does not scale well with respect to the ever
and provides very high-throughput aggregate accumulating data and the expected turnaround
analysis and integrated optimized storage. times. Dedicated hardware and advanced tools
9 Platforms and Pipelines for Proteomics Data Analysis and Management 213

and IT knowledge are needed to securely store, necessary IT support, while setting-up, develop-
back-up and retrieve the data. ing and maintaining a custom, open-source plat-
form may require extensive IT and
bioinformatics knowledge and assistance.
9.4.2 Resources and Repositories Small proteomics labs or labs without dedi-
cated bioinformatics staff may opt for
The major freely accessible resources of protein pre-assembled, GUI-based applications, to
sequence, annotation, functional information are: avoid IT overhead. More advanced or large-
UniProt (Universal Protein Resource, www. scale facilities will require higher flexibility and
uniprot.org); Ensembl (www.ensembl.org) and scalability in their bioinformatics applications
NCBI (The National Center for Biotechnology and infrastructure.
Information www.ncbi.nlm.nih.gov). Since the technological development in prote-
Modern proteomics journals require that upon omics is an on-going process, it can be
publication of articles, the data (both raw and anticipated that new software applications will
processed) to be made publicly available. The emerge and that proteomics software develop-
ProteomeXchange consortium (www. ment will remain a vivid field of bioinformatics.
proteomexchange.org) has emerged as the pri-
mary coordinator of the main existing proteomics
repositories. The current repository of choice for References
tandem MS/MS datasets is PRIDE (www.ebi.ac.
uk/pride) [62] and for Selected Reaction Monitor- 1. Kirkwood KJ et al (2013) Characterization of native
protein complexes and protein isoform variation using
ing (SRM) datasets, the PASSEL component of
size-fractionation-based quantitative proteomics. Mol
PeptideAtlas (www.peptideatlas.org/passel/) [63]. Cell Proteomics 12(12):3851–3873
The consortium collects, centralizes and 2. Kim MS et al (2014) A draft map of the human
disseminates the raw data, processed data proteome. Nature 509(7502):575–581
3. Domon B, Aebersold R (2010) Options and
(as published by the contributors) as well as the
considerations when selecting a quantitative proteo-
essential metainformation about the dataset (e.g., mics strategy. Nat Biotechnol 28(7):710–721
species, tissue, genetic background or health 4. Weisser H et al (2013) An automated pipeline for
state). high-throughput label-free quantitative proteomics. J
Proteome Res 12:1628
5. Eng JK, McCormack AL, Yates JR (1994) An
approach to correlate tandem mass spectral data of
9.5 Conclusion peptides with amino acid sequences in a protein data-
base. J Am Soc Mass Spectrom 5(11):976–989
6. Perkins DN et al (1999) Probability-based protein
The growing variety of proteomics software tools identification by searching sequence databases using
and platforms can only reflect an increasing mass spectrometry data. Electrophoresis 20
interest in the field of proteomics and its (18):3551–3567
expected impact on life and medical sciences. 7. Geer LY et al (2004) Open mass spectrometry search
algorithm. J Proteome Res 3(5):958–964
The application areas of the described tools
8. Cox J et al (2011) Andromeda: a peptide search
range from specialized experiments to generic engine integrated into the MaxQuant environment. J
solutions for the most commonly performed Proteome Res 10(4):1794–1805
experiments. Libraries and their modular build- 9. Eng JK et al (2011) A face in the crowd: recognizing
peptides through database search. Mol Cell Proteo-
ing blocks primarily fulfill the latter functional-
mics 10(11):R111.009522
ity. Specialized applications are most frequently 10. Keller A et al (2002) Empirical statistical model to
covered by monolithic stand-alone applications. estimate the accuracy of peptide identifications made
Therefore, the choice of appropriate software by MS/MS and database search. Anal Chem 74
(20):5383–5392
tools and platforms is primarily driven by the
11. Kall L et al (2007) Semi-supervised learning for pep-
needs and capabilities of one’s lab. Commercial, tide identification from shotgun proteomics datasets.
turn-key solutions can obviously cut down in the Nat Methods 4(11):923–925
214 M.C. Codrea and S. Nahnsen

12. Zhang R et al (2010) Evaluation of computational 30. Reiter L et al (2009) Protein identification false dis-
platforms for LS-MS based label-free Quantitative- covery rates for very large proteomics data sets
Proteomics: a global view. J Proteomics Bioinform generated by tandem mass spectrometry. Mol Cell
3:260–265 Proteomics 8(11):2405–2417
13. Cox J, Mann M (2008) MaxQuant enables high pep- 31. Webb-Robertson BJ et al (2014) Bayesian proteoform
tide identification rates, individualized p.p.b.-range modeling improves protein quantification of global
mass accuracies and proteome-wide protein quantifi- proteomic measurements. Mol Cell Proteomics 13
cation. Nat Biotechnol 26(12):1367–1372 (12):3639–3646
14. Deutsch EW et al (2010) A guided tour of the trans- 32. Gygi SP et al (1999) Quantitative analysis of complex
proteomic pipeline. Proteomics 10(6):1150–1159 protein mixtures using isotope-coded affinity tags. Nat
15. Sturm M et al (2008) OpenMS – an open-source Biotechnol 17(10):994–999
software framework for mass spectrometry. BMC 33. Ong S-E et al (2002) Stable isotope labeling by amino
Bioinf 9:163 acids in cell culture, SILAC, as a simple and accurate
16. Hoekman B et al (2012) msCompare: a framework for approach to expression proteomics. Mol Cell Proteo-
quantitative analysis of label-free LC-MS data for mics 1(5):376–386
comparative biomarker studies. Mol Cell Proteomics 34. Bantscheff M et al (2007) Quantitative mass spec-
11:M111.015974 trometry in proteomics: a critical review. Anal
17. Aebersold R (2011) Editorial: from data to results. Bioanal Chem 389(4):1017–1031
Mol Cell Proteomics 10(11):E111 014787 35. Liao Z et al (2012) IsoQuant: a software tool for stable
18. Perez-Riverol Y et al (2014) Open source libraries and isotope labeling by amino acids in cell culture-based
frameworks for mass spectrometry based proteomics: mass spectrometry quantitation. Anal Chem 84
a developer’s perspective. Biochim Biophys Acta (10):4535–4543
1844(1 Pt A):63–76 36. Wen B et al (2014) IQuant: an automated pipeline for
19. Griss J et al (2014) The mzTab data exchange format: quantitative proteomics based upon isobaric tags. Pro-
communicating mass-spectrometry-based proteomics teomics 14(20):2280–2285
and metabolomics experimental results to a wider 37. Cox J et al (2014) Accurate proteomewide label-free
audience. Mol Cell Proteomics 13(10):2765–2775 quantification by delayed normalization and maximal
20. Deutsch EW (2012) File formats commonly used in peptide ratio extraction, termed MaxLFQ. Mol Cell
mass spectrometry proteomics. Mol Cell Proteomics Proteomics 13(9):2513–2526
11(12):1612–1621 38. Lundgren DH et al (2010) Role of spectral counting in
21. Gonzalez-Galarza FF et al (2014) A tutorial for soft- quantitative proteomics. Expert Rev Proteomics 7
ware development in quantitative proteomics using (1):39–53
PSI standard formats. Biochim Biophys Acta 1844 39. Dephoure N, Gygi SP (2012) Hyperplexing: a method
(1 Pt A):88–97 for higher-order multiplexed quantitative proteomics
22. Wilhelm M et al (2012) mz5: space- and time- provides a map of the dynamic response to rapamycin
efficient storage of mass spectrometry data sets. Mol in yeast. Sci Signal 5(217):rs2
Cell Proteomics 11(1):O111 011379 40. Lange E et al (2007) A geometric approach for the
23. Kessner D et al (2008) ProteoWizard: open source alignment of liquid chromatography-mass spectrome-
software for rapid proteomics tools development. Bio- try data. Bioinformatics (Oxford, England) 23(13):
informatics (Oxford, England) 24(21):2534–2536 i273–i281
24. Chambers MC et al (2012) A cross-platform toolkit 41. Kohlbacher O et al (2007) TOPP-the OpenMS prote-
for mass spectrometry and proteomics. Nat omics pipeline. Bioinformatics (Oxford, England) 23
Biotechnol 30(10):918–920 (2):e191–e197
25. Olsen JV et al (2009) A dual pressure linear ion trap 42. Lange E et al (2008) Critical assessment of alignment
orbitrap instrument with very high sequencing speed. procedures for LC-MS proteomics and metabolomics
Mol Cell Proteomics 8(12):2759–2769 measurements. BMC Bioinf 9:375
26. Kelstrup CD et al (2014) Rapid and deep proteomes 43. Walzer M et al (2014) qcML: an exchange format
by faster sequencing on a benchtop quadrupole ultra- for quality control metrics from mass spectrom-
high-field orbitrap mass spectrometer. J Proteome Res etry experiments. Mol Cell Proteomics 13(8):
13(12):6187–6195 1905–1913
27. Nahnsen S et al (2013) Tools for label-free peptide 44. Keller A, Shteynberg D (2011) Software pipeline
quantification. Mol Cell Proteomics 12(3):549–556 and data analysis for MS/MS proteomics: the
28. Elias JE, Gygi SP (2007) Target-decoy search strategy trans-proteomic pipeline. Methods Mol Biol 694:
for increased confidence in large-scale protein 169–189
identifications by mass spectrometry. Nat Methods 4 45. Shteynberg D et al (2011) iProphet: multi-level inte-
(3):207–214 grative analysis of shotgun proteomic data improves
29. Nesvizhskii AI et al (2003) A statistical model for peptide and protein identification rates and error
identifying proteins by tandem mass spectrometry. estimates. Mol Cell Proteomics MCP 10(12):M111
Anal Chem 75(17):4646–4658 007690
9 Platforms and Pipelines for Proteomics Data Analysis and Management 215

46. Eng JK, Jahan TA, Hoopmann MR (2013) Comet: an 55. Junker J et al (2012) TOPPAS: a graphical workflow
open-source MS/MS sequence database search tool. editor for the analysis of high-throughput proteomics
Proteomics 13(1):22–24 data. J Proteome Res 11(7):3914–3920
47. Craig R, Beavis RC (2004) TANDEM: matching 56. Trudgian DC et al (2010) CPFP: a central proteomics
proteins with tandem mass spectra. Bioinformatics facilities pipeline. Bioinformatics 26(8):1131–1132
(Oxford, England) 20(9):1466–1467 57. Searle BC (2010) Scaffold: a bioinformatic tool for
48. Kim S, Pevzner PA (2014) MS-GF+ makes progress validating MS/MS-based proteomic studies. Proteo-
towards a universal database search tool for proteo- mics 10(6):1265–1269
mics. Nat Commun 5:5277 58. Tabb DL, McDonald WH, Yates JR 3rd (2002)
49. Tanner S et al (2005) InsPecT: Identification of DTASelect and Contrast: tools for assembling and
posttransiationally modified peptides from tandem comparing protein identifications from shotgun prote-
mass spectra. Anal Chem 77(14):4626–4639 omics. J Proteome Res 1(1):21–26
50. Tabb DL, Fernando CG, Chambers MC (2007) 59. Cociorva D, Tabb LD, Yates JR (2007) Validation of
MyriMatch: highly accurate tandem mass spectral tandem mass spectrometry database search results
peptide identification by multivariate hypergeometric using DTASelect. Curr Protoc Bioinformatics
analysis. J Proteome Res 6(2):654–661 Chapter 13: p. Unit 13.4
51. Lam H et al (2008) Building consensus spectral 60. Park SK et al (2008) A quantitative analysis software
libraries for peptide identification in proteomics. Nat tool for mass spectrometry-based proteomics. Nat
Methods 5(10):873–875 Methods 5(4):319–322
52. Li XJ et al (2003) Automated statistical analysis of 61. McDonald WH et al (2004) MS1, MS2, and
protein abundance ratios from data generated by SQT-three unified, compact, and easily parsed file
stable-isotope dilution and tandem mass spectrome- formats for the storage of shotgun proteomic spectra
try. Anal Chem 75(23):6648–6657 and identifications. Rapid Commun Mass Spectrom
53. Han DK et al (2001) Quantitative profiling of 18(18):2162–2168
differentiation-induced microsomal proteins using 62. Vizcaino JA et al (2009) A guide to the proteomics
isotope-coded affinity tags and mass spectrometry. identifications database proteomics data repository.
Nat Biotechnol 19(10):946–951 Proteomics 9(18):4276–4283
54. Goecks J et al (2010) Galaxy: a comprehensive 63. Deutsch EW, Lam H, Aebersold R (2008)
approach for supporting accessible, reproducible, PeptideAtlas: a resource for target selection for
and transparent computational research in the life emerging targeted proteomics workflows. EMBO
sciences. Genome Biol 11(8):R86 Rep 9(5):429–434
Tandem Mass Spectrum Sequencing:
An Alternative to Database Search 10
Engines in Shotgun Proteomics

Thilo Muth, Erdmann Rapp, Frode S. Berven, Harald Barsnes,


and Marc Vaudel

Abstract
Protein identification via database searches has become the gold standard
in mass spectrometry based shotgun proteomics. However, as the quality
of tandem mass spectra improves, direct mass spectrum sequencing gains
interest as a database-independent alternative. In this chapter, the general
principle of this so-called de novo sequencing is introduced along with
pitfalls and challenges of the technique. The main tools available are
presented with a focus on user friendly open source software which can
be directly applied in everyday proteomic workflows.

Keywords
de novo identification • Mass spectrum sequencing • Quality control •
Visualization

T. Muth • E. Rapp
Max Planck Institute for Dynamics of Complex Technical
Abbreviations
Systems, Magdeburg, Germany
PSM Peptide Spectrum Match
glyXera GmbH, Magdeburg, Germany
FDR False Discovery Rate
F.S. Berven
m/z Mass over Charge
Proteomics Unit, Department of Biomedicine, University
of Bergen, Jonas Liesvei 91, N-5009 Bergen, Norway BLAST Basic Local Alignment Search Tool
PTM Post-Translational Modification
KG Jebsen Centre for Multiple Sclerosis Research,
Department of Clinical Medicine, University of Bergen,
Bergen, Norway
H. Barsnes (*) • M. Vaudel
Norwegian Multiple Sclerosis Competence Centre, Proteomics Unit, Department of Biomedicine, University
Department of Neurology, Haukeland University of Bergen, Jonas Liesvei 91, N-5009 Bergen, Norway
Hospital, Bergen, Norway e-mail: harald.barsnes@biomed.uib.no

# Springer International Publishing Switzerland 2016 217


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_10
218 T. Muth et al.

10.1 Introduction everyday lab practices, and search engines were


thus rapidly established as the gold standard for
In the early days of shotgun proteomics, three shotgun protein identification. Nowadays how-
paradigms emerged for the computational deri- ever, the enhanced fragmentation quality,
vation of peptide sequences from tandem mass sub-ppm resolution of modern mass
spectra as replacement to chemical strategies like spectrometers, and increased computational
Edman degradation [1]: speed and parallelization of computers make it
realistic to (re)introduce mass spectrum
(i) Spectral matching as inherited from small sequencing.
molecule analyses approaches [2] Thereby, peptides are inferred from the spec-
(ii) Automated sequencing of spectra as tra in an unbiased way, independently from any
exemplified by the SEQPEP algorithm [3] database, thus providing the unique potential to
(iii) Making use of the growing protein identify protein isoforms, mutated sequences,
sequence databases to restrain the search and unexpected modifications. However, this
space [4], giving birth to the first search advantage comes at the cost of high computa-
engines, like the pioneer algorithms tional complexity and challenging protein
SEQUEST [5] and MOWSE (later inference.
employed in Mascot) [6, 7]. In this chapter, we will present the different
paradigms of de novo identification and the algo-
Mass spectrum sequencing is generally rithmic implementations. We will also illustrate
termed de novo peptide identification by opposi- how mass spectrum sequencing can be integrated
tion to database search engines since it does not in standard proteomic workflows via user-
rely on a priori knowledge about the possible friendly interfaces. Finally, we will discuss the
peptide sequences. By definition, it consists of remaining challenges for a complete integration
building a sequence from the spectrum fragment of mass spectrum sequencing in everyday
ions, by chaining peaks separated by amino acid practices.
characteristic masses. While in the ideal case all
fragment ions can be assigned and a full peptide
sequence from the N-terminus to the C-terminus 10.2 Paradigms and Algorithms
built, peptides generally fragment unevenly ren-
dering the detection of some fragment ions Two main paradigms emerged in mass spectrum
unlikely or even impossible [8]. As a result, sequencing: tag-based approaches, and complete
peaks are generally missing and full sequence sequence de novo identification algorithms.
assignments are usually not possible, introducing While the latter attempts to derive the entire
sequence ambiguities or gaps due to missing peptide sequence from the spectrum, the tag
fragments. Conversely, the occurrence of a high approach only partially identifies the peptide via
number of peaks typically observed for multiply a sequence tag of a few amino acids. The ratio-
charged precursors, neutral losses and noisy nale behind the tag approach is that while com-
spectra makes the sequencing practically plete fragment ion coverage is rare, spectra
impossible. generally present a series of a few high intense
The success of de novo identification thus peaks providing a high quality tag which can be
strongly relies on the quality of the fragmentation used for further identification. Mann and Wilm
and resolution of the mass spectrometer, and is pioneered tag sequencing suggesting that these
computationally demanding. Three decades ago, “islands” of sequence ions present valuable
the resolution of mass spectrometers and the information complementary to database search
computational speed made spectral libraries results [9]. The approach was applied for
searching and direct sequencing challenging for non-sequenced organisms in MultiTag [10], and
10 Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines. . . 219

Table 10.1 Software available for mass spectrum sequencing


Publication Number of citations (Total/
Sequencing type Name year Average/Trend) Free Maintained
Tag-based MultiTag [10] 2003 88/7.33/↘ Yes No
Tag-based GutenTag [11] 2003 175/14.58/↘ Yes Yes
Tag-based DirecTag [12] 2008 32/4.57/↗ Yes Yes
Full de novo Lutefisk [20] 1997 238/13.22/↘ Yes No
Full de novo SeqMS [21] 2000 36/2.40/↘ Yes No
Full de novo Sub [22] 2003 59/4.92/↘ Yes ?
Full de novo NovoHMM [23] 2005 73/7.30/↗ Yes No
Full de novo Audens [24] 2005 37/3.70/↘ Yes ?
Full de novo MSNovo [25] 2007 38/4.75/↗ Yes ?
Full de novo Vonode [26] 2010 11/2.2/! Yes ?
Full de novo GenoMS [27] 2010 14/2.80/↗ Yes ?
Full de novo CompNovo [28] 2009 18/3.00/! Yes No
Full de novo MetaSPS [29] 2013 N/A Yes ?
Full de novo pNovo + [30] 2013 N/A Yes ?
Full de novo PepNovo + [31] 2005 280/28.0/↗ Yes No
Full de novo UniNovo [32] 2013 N/A Yes No
Full de novo coupled with PEAKS [33, 42] 2003 364/30.33/↗ No Yes
database search
Full de novo coupled with Bionics [34] 2007 64/8.00/↗ No Yes
database search
Peptide assembly TagRecon [39] 2010 30/6.00/↗ Yes Yes
Sequence similarity FASTA [38] 1988 9457/350.56/↘ Yes ?
Sequence similarity BLAST [37] 1990 38,684/1548.08/↗ Yes ?
Sequence similarity PepExplorer 2014 N/A Yes ?
[41]
Protein inference MSDA [40] 2014 N/A Yes Yes
Protein inference IdPicker [44] 2007 136/17.00/↗ Yes Yes
Graphical interface BumberDash N/A Yes Yes
Graphical interface DeNovoGUI 2013 N/A Yes Yes
[47]
The table lists the software mentioned in this book chapter, classified by use case, and provides the publication year and
corresponding reference. For tools published earlier than 2013, the number of citations according to Thomson
Reuters™ Web of Science™ is given. Finally, the table indicates whether the software is free and maintained.
Whenever a tool could not be found, it was marked as not maintained. Note that the number of citations is solely
given as an indicator of the tool usage. The total and average numbers of citations per year are given, as well as the trend
for the last 3 years’ citation average relative to the global average: ↗ increasing number of citations, ! stable number of
citations, and ↘ decreasing number of citations

in generic tools for mass spectrum sequencing by length of the longest tag which can be derived
the Tabb lab: GutenTag [11] and DirecTag [12], from its spectrum annotation in a standard shot-
as listed in Table 10.1. gun proteomic run (a tryptic HeLa digest
Tag algorithms have the advantage that they measured on a Q Exactive, data from [13])
can be extremely fast, however, they are obtained from the combination of five search
criticized for requiring clearly defined consecu- engines (MS Amanda [14], MS-GF+ [15],
tive lists of amino acids, and to provide only Myrimatch [16], OMSSA [17], and X!Tandem
limited information about the sequence. Fig- [18]) using PeptideShaker (http://www.ncbi.nlm.
ure 10.1 displays the distribution of Peptide nih.gov/pubmed/25574629). From the figure, it
Spectrum Matches (PSMs) according to the is clear that a tag of at least three amino acids can
220 T. Muth et al.

Fig. 10.1 Distribution of Peptide Spectrum Matches threshold but not the quality filters embedded in
(PSMs) according to the length of the longest tag which PeptideShaker, and (iii) Confident – PSMs passing a
can be derived from its spectrum annotation in a standard 1 % FDR threshold and the quality filters. The PSMs
shotgun proteomic experiment obtained from the combi- with a tag length of at least three is circled in blue,
nation of five search engines (see text for details). PSMs comprising 92 % of the validated PSMs and 96 % of the
are sorted into four categories: (i) Not Validated – PSMs confident PSMs. When no combination of two annotated
which do not pass a 1 % False Discovery Rate (FDR) peaks separated by a single amino acid mass could be
threshold, (ii) Doubtful – PSMs passing a 1 % FDR found, the PSM was categorized in the ‘0’ category

be derived from 92 % of the PSMs validated at values [19]. If the mass difference between two
1 % False Discovery Rate (FDR), and from 96 % different peaks corresponds to the mass of an
of the PSMs passing the quality filters amino acid, possibly carrying a modification, an
implemented in PeptideShaker. The potential edge is drawn to connect the respective vertices.
identification rate of tag approaches thus appears This procedure is repeated until a full path is
to be comparable to search engines. However, it found that connects the N-terminal with the
requires the intervention of a downstream algo- C-terminal vertices. Additionally, these
rithm to infer the complete peptide sequence, a connections are scored, for example, based on
point which will be touched upon in the follow- the intensity of the peaks or the accuracy of the
ing section. peak m/z matching. The de novo sequencing
Besides the tag-based approaches, several algorithms then try to find the path with the best
algorithms have been developed aiming at a score and this path is transferred back to a pep-
complete sequencing of tandem mass spectra. tide sequence suggestion. Some algorithms also
As schematized in Fig. 10.2, the standard include peptide fragmentation models in order to
approach relies on a spectrum graph that consists provide statistical significance for the scoring:
of vertices and edges: the peaks of the spectrum can a peak be explained by a predicted fragmen-
are converted into vertices with attributed m/z tation rule or is it simply a random match?
10 Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines. . . 221

Fig. 10.2 The spectrum de 228 343


novo sequencing principle. 129
The spectrum is converted 506 ΔM = 99
619
into a spectrum graph, and 732
129
the peptide sequence is then 228
derived from the graph ΔM = 115

343
ΔM=163
Spectrum (b-ion peaks)
506
ΔM = 113

EVDYLLR 619
ΔM = 113
Amino Acid Mass 732
Glu (E) 129.04259 Da
Val (V) 99.06841 Da
Asp (D) 115.02694 Da
Tyr (Y) 163.06333 Da
Leu (L) 113.08406 Da

De novo identification algorithms are available sequence gaps where the amino acid sequence
as both free and commercial software, see could not be inferred. For most biological studies,
Table 10.1. One of the most popular pioneer algo- however, protein level information is necessary to
rithm is Lutefisk [20], which was followed by draw meaningful conclusions. Thus, these
SeqMS [21], sub [22], NovoHMM [23], Audens sequences, or partial sequences, are mapped to
[24], and MSNovo [25]. Vonode [26] and known protein sequences, for example, to
GenoMS [27] were subsequently specifically UniProtKB [36] reference proteomes. The chal-
developed for proteogenomic studies. CompNovo lenge is to provide relevant results in a reasonable
[28], MetaSPS [29] and pNovo + [30] were devel- time, without losing the hits not exactly matching
oped for dataset presenting complementary multi- the sequences in the database, as in the case of
ple fragmentation techniques. Finally, we have the sequence mutations. The most frequently chosen
tools of the Pevzner group, PepNovo + [31] and option is to proceed with a sequence similarity
UniNovo [32]. PEAKS [33] (http://www.bioinfor. search using the BLAST [37] (Basic Local Align-
com) and Bionic [34] (http://www.proteinmetrics. ment Search Tool) or FASTA [38] algorithms
com) are among the most encountered commercial available online, for example, from the UniProt
software tools supporting de novo sequencing website (http://uniprot.org). These approaches,
strategies in their workflows. Most of these tools however, lose the information of the precursor
are indexed in the ‘OMICS tools’ platform [35] mass and thus do not take mass gaps into account.
(http://omictools.com) maintaining links to the Moreover, these heuristics only resolve a limited
respective web pages. set of mutations. Dedicated software has therefore
been developed, such as TagRecon [39] for
DirecTag results, MSDA [40] for PepNovo+,
10.3 Mapping de novo Sequences and PepExplorer [41] as a more generic solution
onto Protein Databases supporting several algorithms. As a direct result,
mass spectrum sequencing output can be readily
The result of sequencing algorithms is a list of interpreted, similar as for standard database search
potential peptide sequences, possibly containing engine results.
222 T. Muth et al.

In order to fully benefit from the advantages (http://fenchurch.mc.vanderbilt.edu/software.


of both database search and mass spectrum php), which is dedicated to software from the
sequencing, efforts have been put toward Tabb group and allows operating the group’s
unifying the two approaches. This is for instance command line tools via a graphical user inter-
the case in the abovementioned commercial soft- face, before gathering the results in IdPicker.
ware (PEAKS [33] and Bionic [34]) where Notably, the Audens de novo algorithm also
sequencing results are combined with database comes with a graphical interface.
results [42]. The best representative of such Here, we present how to run the popular
efforts in academic freeware is IdPicker sequencing tools PepNovo + and DirecTag, as
[43, 44], which, in a user-friendly interface, representatives for de novo sequencing and
combines the strength of virtually any database tag-based approaches, and inspect their results in
search engine results (thanks to the standard a user-friendly interface called DeNovoGUI [47]
mzIdentML format [45]), with the mass spec- (http://compomics.github.io/projects/denovogui.
trum sequencing results of DirecTag combined html) – an easy-to-use and open source software
with TagRecon, and the spectral matching results which does not require any specific installation.
of Pepitome [46]. When starting the tool, the main dialog opens as
displayed in Fig. 10.3.
Under ‘Input & Output’ located at the top of
the interface, the user provides the peak list file
10.4 Using Sequencing Algorithms
(s) to analyze, the sequencing settings to use, and
the output folder where the results will be saved.
Most of the algorithms presented above have to
The results presented in this chapter are obtained
be run on the command line which may require
from the example file included in DeNovoGUI,
additional technical expertise. In order to achieve
which can be accessed via the ‘File’ -> ‘Load
the transfer of these algorithms to lab practices, it
Example’ menu. The sequencing settings can be
is therefore vital to provide user-friendly
edited by clicking on the ‘Edit’ button, opening
interfaces, together with respective teaching
the dialog shown in Fig. 10.4. This dialog allows
material [13], enabling the use of the tools and
for adjusting the general sequencing settings,
the inspection of the results without the need of
such as mass tolerances, and algorithm specific
advanced skills in the computer science domain.
settings. The user can also select post-
An example of such an interface is BumberDash

Fig. 10.3 The main DeNovoGUI dialog. The user algorithm(s) to operate, DirecTag and/or PepNovo+, and
provides the desired input, settings, and output for the starts the sequencing
tools to process at the top. The user then selects the
10 Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines. . . 223

Fig. 10.4 This dialog allows editing the sequencing modifications (PTMs). Above the table listing the PTMs,
settings. General settings are listed at the top, notably a drop down menu allows for displaying a more extended
including mass tolerances. These are followed by list of modifications and a cogwheel allows the creation of
DirecTag specific settings, and finally post-translational user-defined modifications

translational modifications (PTMs), and even add At the top of the results display, all the input
custom modifications using the compomics- spectra are listed in the ‘Query Spectra’ table,
utilities structure [48], by clicking the cogwheel and the de novo peptide sequences are shown for
above the table listing the PTMs. each selected spectrum in the ‘De Novo
In the ‘Sequencing Methods’ section of the Peptides’ table. The ‘Query Spectra’ table
main interface, the PepNovo + and the DirecTag displays information collected from the original
algorithms can be selected. As soon as the spectra, such as title, precursor m/z, charge and
settings and input files have been chosen, the identification state, while the ‘De Novo Peptides’
mass spectrum sequencing can be started by table shows details obtained from the de novo
clicking the ‘Start Sequencing!’ button. While sequencing results on the selected spectrum: pep-
the algorithms are running, the user is informed tide sequence, precursor m/z and charge, termi-
about the status of the sequencing and a progress nal mass gaps and scores. Note that the mass gaps
bar is shown. When the sequencing has finished, are annotated on the sequence as well, and that
the results are stored in the provided output PTMs are indicated using a user customizable
folder (in the tools respective original formats), color coding. Finally, the last column allows for
and the detailed results are parsed and displayed a direct online BLAST of the selected sequence.
in the DeNovoGUI interface, as shown in At the bottom, the currently selected spectrum
Fig. 10.5. Note that previous sequencing result is displayed with the fragment ion annotation
files can be opened directly via the ‘File’ -> corresponding to the selected de novo peptide
‘Open’ menu option. solution. A sequence overlay annotates the
224 T. Muth et al.

Fig. 10.5 Display of sequencing results in DeNovoGUI. file are listed in the middle table. At the bottom, the
At the top, the sequenced spectra can be selected by the selected sequence is annotated on the selected spectrum
user. The sequencing results of both algorithms on that with the amino acids annotated between the peaks

amino acids between fragment ion peaks. A previously unknown or mutated peptide
menu under the spectrum allows customization sequences as well as unexpected PTMs. This
of the spectrum annotation. Note that from the approach can be used complementarily to data-
top menu and spectrum contextual menu, differ- base searches in fully integrated environments.
ent export options are available, allowing the One of the main issues barely touched upon in
export of publication level illustrations, the literature, and beyond the scope of this chap-
Microsoft Excel compatible tab separated tables, ter, is the evaluation of the quality of sequencing
a simple matching to protein databases export, matches and the estimation of a reliable false
and an export of the whole dataset compatible discovery rate as done with database searches
with BLAST. Thus, the complete workflow from using the target/decoy strategy [49]. This can be
spectra sequencing, via interpretation of the de especially challenging when evaluating matches
novo results, to the export of the results for fur- containing sequence mutations.
ther processing in other software, is supported Modern computational power and the use of
within the same user-friendly framework. computer clusters allow for a valuable integra-
tion of mass spectrum sequencing into any prote-
omics workflow. As mass spectrum sequencing
performance is improving with better software
10.5 Conclusion and Perspectives
and hardware optimizations, and is made easier
to handle by relying on user-friendly interfaces,
Mass spectrum sequencing is a technique that is
the application of this promising technique will
fully independent of an external protein database
surely increase in shotgun proteomic studies.
resource. This unbiased approach becomes more
Ideally, it should become integrated in standard
efficient and accurate as the quality of spectra
proteomic workflows as an alternative, respec-
produced by high accuracy and high resolution
tively, an add-on to conventional database search
mass spectrometers increases. In principle,
engines, which then would be able to provide
sequencing algorithms are able to retrieve
10 Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines. . . 225

improved identification coverage at controlled 15. Kim S, Gupta N, Pevzner PA (2008) Spectral
error rates. probabilities and generating functions of tandem
mass spectra: a strike against decoy databases. J Pro-
teome Res 7:3354–3363
Acknowledgements T.M. and E.R. acknowledge the 16. Tabb DL, Fernando CG, Chambers MC (2007)
support by Max Planck Society. H.B. is supported by the MyriMatch: highly accurate tandem mass spectral
Research Council of Norway. peptide identification by multivariate hypergeometric
analysis. J Proteome Res 6:654–661
17. Geer LY, Markey SP, Kowalak JA et al (2004) Open
mass spectrometry search algorithm. J Proteome Res
References 3:958–964
18. Craig R, Beavis RC (2004) TANDEM: matching
1. Edman P, Begg G (1967) A protein sequenator. Eur J proteins with tandem mass spectra. Bioinformatics
Biochem 1:80–91 20:1466–1467
2. Martinsen DP, Song B-H (1985) Computer 19. Chen T, Kao MY, Tepel M et al (2001) A dynamic
applications in mass spectral interpretation: a recent programming approach to de novo peptide sequencing
review. Mass Spectrom Rev 4:461–490 via tandem mass spectrometry. J Comput Biol
3. Johnson RS, Biemann K (1989) Computer program 8:325–337
(SEQPEP) to aid in the interpretation of high-energy 20. Taylor JA, Johnson RS (1997) Sequence database
collision tandem mass spectra of peptides. Biomed searches via de novo peptide sequencing by tandem
Environ Mass Spectrom 18:945–957 mass spectrometry. Rapid Commun Mass Spectrom
4. Henzel WJ, Billeci TM, Stults JT et al (1993) 11:1067–1075
Identifying proteins from two-dimensional gels by 21. Fernandez-de-Cossio J, Gonzalez J, Satomi Y
molecular mass searching of peptide fragments in et al (2000) Automated interpretation of low-energy
protein sequence databases. Proc Natl Acad Sci U S collision-induced dissociation spectra by SeqMS, a
A 90:5011–5015 software aid for de novo sequencing by tandem mass
5. Eng JK, McCormack AL, Yates JR (1994) An spectrometry. Electrophoresis 21:1694–1699
approach to correlate tandem mass spectral data of 22. Lu B, Chen T (2003) A suboptimal algorithm for de
peptides with amino acid sequences in a protein data- novo peptide sequencing via tandem mass spectrome-
base. J Am Soc Mass Spectrom 5:976–989 try. J Comput Biol 10:1–12
6. Pappin DJ, Hojrup P, Bleasby AJ (1993) Rapid iden- 23. Fischer B, Roth V, Roos F et al (2005) NovoHMM: a
tification of proteins by peptide-mass fingerprinting. hidden Markov model for de novo peptide sequenc-
Curr Biol 3:327–332 ing. Anal Chem 77:7265–7273
7. Perkins DN, Pappin DJ, Creasy DM et al (1999) 24. Grossmann J, Roos FF, Cieliebak M et al (2005)
Probability-based protein identification by searching AUDENS: a tool for automated peptide de novo
sequence databases using mass spectrometry data. sequencing. J Proteome Res 4:1768–1774
Electrophoresis 20:3551–3567 25. Mo L, Dutta D, Wan Y et al (2007) MSNovo: a
8. Barsnes H, Eidhammer I, Martens L (2011) A global dynamic programming algorithm for de novo peptide
analysis of peptide fragmentation variability. Proteo- sequencing via tandem mass spectrometry. Anal
mics 11:1181–1188 Chem 79:4870–4878
9. Mann M, Wilm M (1994) Error-tolerant identification 26. Pan C, Park BH, McDonald WH et al (2010) A high-
of peptides in sequence databases by peptide sequence throughput de novo sequencing approach for shotgun
tags. Anal Chem 66:4390–4399 proteomics using high-resolution tandem mass spec-
10. Sunyaev S, Liska AJ, Golod A et al (2003) MultiTag: trometry. BMC Bioinf 11:118
multiple error-tolerant sequence tag search for the 27. Castellana NE, Pham V, Arnott D et al (2010) Tem-
sequence-similarity identification of proteins by plate proteogenomics: sequencing whole proteins
mass spectrometry. Anal Chem 75:1307–1315 using an imperfect database. Mol Cell Proteomics
11. Tabb DL, Saraf A, Yates JR 3rd (2003) GutenTag: 9:1260–1270
high-throughput sequence tagging via an empirically 28. Bertsch A, Leinenbach A, Pervukhin A et al (2009)
derived fragmentation model. Anal Chem De novo peptide sequencing by tandem MS using
75:6415–6421 complementary CID and electron transfer dissocia-
12. Tabb DL, Ma ZQ, Martin DB et al (2008) DirecTag: tion. Electrophoresis 30:3736–3747
accurate sequence tags from peptide MS/MS through 29. Guthals A, Clauser KR, Frank AM et al (2013)
statistical scoring. J Proteome Res 7:3838–3846 Sequencing-grade de novo analysis of MS/MS triplets
13. Vaudel M, Venne AS, Berven FS et al (2014) Shed- (CID/HCD/ETD) from overlapping peptides. J Prote-
ding light on black boxes in protein identification. ome Res 12:2846–2857
Proteomics 14:1001–1005 30. Chi H, Chen H, He K et al (2013) pNovo+: de novo
14. Dorfer V, Pichler P, Stranzl T et al (2014) MS peptide sequencing using complementary HCD and
Amanda, a universal identification algorithm ETD tandem mass spectra. J Proteome Res
optimized for high accuracy tandem mass spectra. J 12:615–625
Proteome Res 13:3679–3684
226 T. Muth et al.

31. Frank A, Pevzner P (2005) PepNovo: de novo peptide 41. Leprevost FV, Valente RH, Borges DL et al (2014)
sequencing via probabilistic network modeling. Anal PepExplorer: a similarity-driven tool for analyzing de
Chem 77:964–973 novo sequencing results. Mol Cell Proteomics 13
32. Jeong K, Kim S, Pevzner PA (2013) UniNovo: a (9):2480–2489
universal tool for de novo peptide sequencing. Bioin- 42. Zhang J, Xin L, Shan B et al (2012) PEAKS DB: de
formatics 29:1953–1962 novo sequencing assisted database search for sensitive
33. Ma B, Zhang K, Hendrie C et al (2003) PEAKS: and accurate peptide identification. Mol Cell Proteo-
powerful software for peptide de novo sequencing mics 11:M111.010587
by tandem mass spectrometry. Rapid Commun Mass 43. Ma ZQ, Dasari S, Chambers MC et al (2009) IDPicker
Spectrom 17:2337–2342 2.0: improved protein assembly with high discrimina-
34. Bern M, Cai Y, Goldberg D (2007) Lookup peaks: a tion peptide identification filtering. J Proteome Res
hybrid of de novo sequencing and database search for 8:3872–3881
protein identification by tandem mass spectrometry. 44. Zhang B, Chambers MC, Tabb DL (2007) Proteomic
Anal Chem 79:1393–1400 parsimony through bipartite graph analysis improves
35. Henry VJ, Bandrowski AE, Pepin AS et al (2014) accuracy and transparency. J Proteome Res
OMICtools: an informative directory for multi-omic 6:3549–3557
data analysis. Database J Biol Databases Curation 45. Jones AR, Eisenacher M, Mayer G et al (2012) The
2014. Available from: http://www.ncbi.nlm.nih.gov/ mzIdentML data standard for mass spectrometry-
pubmed/25024350 based proteomics results. Mol Cell Proteomics 11:
36. Apweiler R, Bairoch A, Wu CH et al (2004) UniProt: M111.014381
the Universal Protein knowledgebase. Nucleic Acids 46. Dasari S, Chambers MC, Martinez MA et al (2012)
Res 32:D115–D119 Pepitome: evaluating improved spectral library search
37. Altschul SF, Gish W, Miller W et al (1990) Basic for identification complementarity and quality assess-
local alignment search tool. J Mol Biol 215:403–410 ment. J Proteome Res 11:1686–1695
38. Pearson WR, Lipman DJ (1988) Improved tools for 47. Muth T, Weilnbock L, Rapp E et al (2014)
biological sequence comparison. Proc Natl Acad Sci DeNovoGUI: an open source graphical user interface
U S A 85:2444–2448 for de novo sequencing of tandem mass spectra. J
39. Dasari S, Chambers MC, Slebos RJ et al (2010) Proteome Res 13(2):1143–1146
TagRecon: high-throughput mutation identification 48. Barsnes H, Vaudel M, Colaert N et al (2011)
through sequence tagging. J Proteome Res compomics-utilities: an open-source Java library for
9:1716–1726 computational proteomics. BMC Bioinf 12:70
40. Carapito C, Burel A, Guterl P et al (2014) MSDA, a 49. Elias JE, Gygi SP (2010) Target-decoy search strategy
proteomics software suite for in-depth Mass Spec- for mass spectrometry-based proteomics. Methods
trometry Data Analysis using grid computing. Proteo- Mol Biol 604:55–71
mics 14:1014–1019
Visualization, Inspection
and Interpretation of Shotgun 11
Proteomics Identification Results

Ragnhild R. Lereim, Eystein Oveland, Frode S. Berven,


Marc Vaudel, and Harald Barsnes

Abstract
Shotgun proteomics is a high throughput technique for protein identifica-
tion able to identify up to several thousand proteins from a single sample.
In order to make sense of this large amount of data, proteomics analysis
software is needed, aimed at making the data intuitively accessible to
beginners as well as experienced scientists. This chapter provides insight
on where to start when analyzing shotgun proteomics data, with a focus on
explaining the most common pitfalls in protein identification analysis and
how to avoid them. Finally, the move to seeing beyond the list of
identified proteins and to putting the results into a bigger biological
context is discussed.

Keywords
Protein identification • Visualization • Protein annotation • Validation

R.R. Lereim • F.S. Berven


Proteomics Unit, Department of Biomedicine, University
of Bergen, Jonas Liesvei 91, N-5009 Bergen, Norway
Abbreviations
KG Jebsen Centre for Multiple Sclerosis Research,
Department of Clinical Medicine, University of Bergen, PTM Post-Translational Modification
Bergen, Norway PSM Peptide Spectrum Match
Norwegian Multiple Sclerosis Competence Centre,
FDR False Discovery Rate
Department of Neurology, Haukeland University, Bergen, FNR False Negative Rate
Norway GO Gene Ontology
E. Oveland PI Protein Inference
KG Jebsen Centre for Multiple Sclerosis Research,
Department of Clinical Medicine, University of Bergen,
Bergen, Norway
Norwegian Multiple Sclerosis Competence Centre,
Department of Neurology, Haukeland University, Bergen,
Norway M. Vaudel • H. Barsnes (*)
Proteomics Unit, Department of Biomedicine, University
Department of Clinical Medicine, University of Bergen, of Bergen, Jonas Liesvei 91, N-5009 Bergen, Norway
Bergen, Norway e-mail: harald.barsnes@biomed.uib.no

# Springer International Publishing Switzerland 2016 227


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_11
228 R.R. Lereim et al.

11.1 Background The goal of most shotgun experiments is to put


the results into a bigger biological context, often
In shotgun proteomics, thousands of proteins are by relating the results to information available in
digested into peptides prior to mass spectrome- protein annotation databases, e.g., related to
try, and the generated MS/MS spectra are genes, protein functions, protein structures or
matched to theoretical peptides from a protein biological pathways [9, 10]. There are several
sequence database using dedicated algorithms software solutions aimed at visualizing and
[1]. These matches, termed Peptide Spectrum interpreting shotgun proteomic data (http://www.
Matches (PSMs), are scored and ranked, and the ncbi.nlm.nih.gov/pubmed/25504833), and setting
best match per spectrum is used as the peptide the findings into a bigger biological context, such
candidate to infer the proteins. By design, shot- as the freely available MaxQuant [11], or commer-
gun proteomics thus investigates peptides and cial alternatives like ProteomeDiscoverer
not proteins, a fact that gives rise to numerous (Thermo Scientific, Thermo Fisher Scientific
computational difficulties when trying to figure Inc.) or Scaffold (Proteome Software, Inc.).
out which peptide belongs to which protein In this chapter, the open source and freely
[2, 3]. A task made even more complicated by available analysis software PeptideShaker (http://
the existence of post-translational modifications www.ncbi.nlm.nih.gov/pubmed/25574629) will
(PTMs). In order to avoid the most common be used as an example to show what can be
pitfalls when analyzing shotgun proteomics achieved via the use of such software packages.
data, a basic understanding of the computational The main concepts and knowledge should how-
and statistical methods is needed. ever be transferable to most shotgun analysis tools.
Inherent to the shotgun approach, the
matching between theoretical and experimental
spectra will generate false positives, i.e., a wrong 11.2 Shotgun Proteomics Data
match passing the validation threshold. The con-
trol of the share of false positive matches, the In PeptideShaker, a new project can be created
False Discovery Rate (FDR), and its optimiza- based on the results from multiple identification
tion are an important focus of the data interpreta- algorithms, plus a set of identification
tion [4, 5]. Generally, so-called decoy sequences parameters, a protein sequence database, and
are included in the database, which are used to one or more spectrum files in the standard mgf
match the experimental data to the theoretical format (http://www.matrixscience.com/help/
data [6]. The distribution of decoy hits is then data_file_help.html#GEN). If the search has not
used to evaluate the quality of the identifications already been performed, it is possible to use
as reviewed in detail elsewhere [4, 7]. multiple search engines via SearchGUI [12] and
As multiple proteins can share one or several open the data in PeptideShaker. Public and pri-
peptide sequences, a PSM can end up being used vate datasets stored in the PRIDE database [13],
as evidence for the wrong protein, referred to as made available via the ProteomeXchange con-
the protein inference problem [3]. To ensure sortium [14], can also be reanalyzed via the
correct identification, protein inference cases “PRIDE Reshake” feature. In this chapter, the
may have to be inspected manually. Furthermore, dataset made available by the developers
the localization of a PTM in a protein can be of (ProteomeXchange accession PXD000674) will
biological importance. But the exact location is be used. The dataset can be loaded by clicking
often difficult to assess based on mass spectrom- “Open Example” in the PeptideShaker Welcome
etry data alone. Algorithms exist to estimate the Dialog. For detailed tutorials on project creation,
quality of the localization [8], but statistics tool usage and proteomics identification in gen-
regarding PTM localization still ought to be fur- eral, please see http://compomics.com/bioinfor
ther evaluated to ensure correct identification. matics-for-proteomics [15].
11 Visualization, Inspection and Interpretation of Shotgun Proteomics. . . 229

11.3 Getting an Overview box in the upper right corner. In short, the “Over-
view” tab quickly gives the user an overview of
After loading the data, the PeptideShaker “Over- the search result and provides direct interaction
view” tab displays the combined search engine with the data. Additional tabs in the upper right
identification results at the protein, peptide and corner can be used to further investigate different
PSM level (Fig. 11.1). Three linked tables are aspects of the shotgun proteomics result.
used to represent the proteins, peptides and
PSMs, meaning that selecting a protein displays
the identified peptides of that protein, similarly, 11.4 Protein Inference
selecting a peptide displays corresponding
PSMs. In addition to the tables (http://www. A protein is identified either by peptides that can
ncbi.nlm.nih.gov/pubmed/25422159), the tab only derive from that specific protein, so-called
includes visualization of the PSMs in a spectrum unique peptides, or by peptides that can derive
viewer [16] and a display of the protein sequence from several distinct proteins, so-called shared or
coverage. The spectrum viewer allows for degenerate peptides. The latter is often due to
inspection of the quality of the PSMs, while the protein isoforms (or more generally
sequence coverage at the bottom shows the loca- proteoforms), but the proteins can also be unre-
tion of the identified peptides in a linear repre- lated. When a group of proteins cannot be distin-
sentation of the protein, with the selected peptide guished by a unique peptide, a so-called
in blue. Notably, PTMs identified for the protein ambiguity group is created [3], and a representa-
are also mapped onto the sequence using user- tive protein is chosen for the group (also some-
defined color coding. A targeted search for spe- times referred to as a leading protein).
cific proteins or peptides is facilitated by a search

Fig. 11.1 PeptideShaker overview tab: (1) the search selected protein; (5) the PSM table lists details on the
box allows for targeted investigation of proteins and PSMs used to identify the selected peptide; (6) the spec-
peptides; (2) select other tabs for additional analysis trum viewer displays the selected PSM; and (7) the
and quality control; (3) the protein table displays details sequence coverage of the selected protein is displayed in
on the proteins identified in the dataset; (4) the peptide the sequence coverage panel
table lists details on the peptides used to identify the
230 R.R. Lereim et al.

Related proteins may have similar functions, 11.5 Inspecting Spectrum


and unless the experiment focuses on a specific Identifications
proteoform, having a group of related proteins
will usually have limited impact on the outcome. For proteins, peptides and PSMs, the quality of
Groups of unrelated proteins are however more the identification is indicated in the confidence
problematic, given that they can lead to incorrect and validation columns. Low confidence often
biological interpretations. Note that the ambigu- results from poor peptide to spectrum matching.
ity groups that are created, and the chosen lead- The spectra for each peptide can be inspected
ing protein, can be different when comparing either in the “Overview” tab or (in more detail)
different analysis software. In practical terms, in the “Spectrum ID” tab. When selecting a spec-
this means that the same peptides can lead to trum in the “Overview” tab, the peaks that can be
different protein identifications depending on explained by the peptide fragmentation are
which algorithm is used. In PeptideShaker, the annotated and outlined in red in the spectrum
protein inference (PI) status is color coded in the viewer. The spectrum viewer also includes sev-
protein and peptide tables, and clicking it eral additional plots that can be used to manually
displays details on the respective peptide to pro- investigate the PSM quality (Fig. 11.2).
tein matching. It is here possible to change the A high quality PSM generally has clearly
protein representing the group and its PI status. defined peptide fragment ion peaks, covering
However, this reduces the reproducibility, and all the most intense peaks in the spectrum with low
such changes should therefore be well grounded mass errors, i.e., the difference between the
and documented. masses of the peaks in the experimental spectrum

Fig. 11.2 PSM investigation in the PeptideShaker over- spectrum showing annotated peaks (red) and background
view tab: (a) Example of a PSM classified as confident in peaks (gray). (5) A bubble plot of the fragment ion mass
PeptideShaker. (1) The most intense fragment ions error plotted against the m/z, where the size of the bubble
identified and their intensities are illustrated with colored represents the peak intensity. (b) Example of a PSM
bars. (2) A histogram displays the intensities covered by classified as doubtful in PeptideShaker: few detected pep-
the fragment ions (green), and the background intensities tide fragment ions, most of the high intensity peaks are
(grey). (3) The mass error is plotted against the m/z for not detected, and the annotated peaks have a high mass
every annotated fragment ion. (4) The PSM mass error
11 Visualization, Inspection and Interpretation of Shotgun Proteomics. . . 231

and the theoretical spectrum [17]. Note that can also be compared to the combined result
artifacts during mass spectrometry analysis can generated by the analysis software. If the com-
result in increasing mass error with increasing bined result is impaired by one of the search
m/z values. Therefore, a high mass error is not engines one should consider excluding it, as this
necessarily due to a false PSM. High mass errors may improve the overall identification rate.
resulting from wrongly annotated spectra are
typically sporadic, distributing with a great
spread and no clear trend with increasing 11.7 Validating the Identifications
m/z. The distribution of the mass errors can be
visualized by clicking the “Bubble Plot” option Shotgun analysis software statistically validates
below the spectrum viewer in the “Overview” tab the identification results at a user-defined false
(Fig. 11.2a(5)). discovery rate (FDR) threshold, and provides a
As manual investigation of PSMs is time con- confidence level illustrating the quality of the
suming, matches passing the statistical threshold match. Both FDR and confidence are generally
are generally trusted. However, if a peptide or estimated using the target/decoy approach
protein of interest is based on only spectra with [6]. For an extensive description on how these
low confidence, it is advised to further verify the error rates are calculated, see Nesvizhskii
presence of this peptide or protein. et al. [4]. Matches can also be further validated
by automated expert inspection of the
matches [18].
11.6 Search Engine Performance Importantly, a 1 % FDR threshold indicates
that there is an estimated amount of one false
The performance of the identification algorithms, discovery per 100 validated entries. This means
generally search engines, can be compared in the that even validated proteins might be false
“Spectrum IDs” tab. A spectrum can be assigned positives, which is important to keep in mind
to different peptides, by one or more algorithms. during the data analysis. The false negative rate
The matches for one particular spectrum can be (FNR) is also stated as an estimate of how many
viewed via the table at the top, where all the correct matches that are left out due to the FDR
spectra generated by the mass spectrometer are threshold.
listed. Selecting a spectrum displays the peptides The “Validation” tab provides an overview of
inferred by each algorithm, showing search the total number of validated entries, and the
engine agreement (or disagreement), and the associated FDR and FNR levels. It allows the
match retained by PeptideShaker. By selecting user to tune the statistical thresholds, balancing
a PSM, the spectrum will be annotated accord- between sensitivity and specificity. Experiments
ingly, thus making it possible to compare the requiring high quality results should be validated
spectrum annotation for the different peptide at a stringent FDR (typically 1 %), while
candidates and inspect their validity. experiments interested in a high identification
As search engines have slightly different coverage can tune the validation threshold
approaches, there will be differences in the num- toward FNR minimization.
bers of annotated spectra. When the results of a In PeptideShaker the validation approach
given algorithm clearly deviate from the others, combines statistical validation and expert inspec-
it may indicate that the given search engine under tion, resulting in three color coded categories:
or over performed on the given experiment, e.g., (i) Confident – indicating that the statistical
a specific search engine may perform best/worst threshold was passed as well as the quality filters
using certain identification parameters or data- (green); (ii) Doubtful – indicating that the statis-
base types. It is important to verify the source of tical threshold was passed but not the quality
such differences to avoid any bias in the final filters (yellow); and (ii) Not Validated –
result. The individual search engine performance indicating that the statistical threshold was not
232 R.R. Lereim et al.

passed (red). Clicking the icon representing the spectrometer [19]. If any of the quality metrics
validation level (found in the rightmost columns are unexpected, the source of the problem should
in the tables in the “Overview” tab) opens a be detected and eliminated before continuing the
dialog with details on the match validation analysis.
criteria.
The following default quality filters are
employed at the PSM, peptide and protein 11.9 Validating PTMs
level: (i) PSMs must have a low mass error and
a high fragment ion sequence coverage; Modified peptides are often much less abundant
(ii) peptides have to be identified by at least two than unmodified peptides in vivo. Therefore, to
confident PSMs; and (iii) proteins have to be detect modified peptides, the samples are com-
identified by at least two confident peptides and monly enriched for a certain type of PTM prior to
at least two confident PSMs. Together with the mass spectrometry analysis. Searching for PTMs
statistical validation, labeling the entries as con- also results in several computational difficulties.
fident or doubtful on the basis on these quality First of all, the modified and unmodified peptides
filters makes it easier to find out which are most often counted as separate
identifications to trust. identifications. This can lead to an increased
number of peptides for each protein, even though
the protein sequence coverage remains
11.8 Overall Quality Control unchanged. In practical terms, this means that
proteins can pass the quality control filter of at
Analysis software collect statistics that allow for least two confident peptides per protein on the
optimization of search parameters and evaluation basis of a single peptide sequence.
of the success rate of the experiment. These In order to accurately determine the PTM
quality control statistics can vary from software localization in a peptide, a large degree of the
to software. The “QC Plots” tab in PeptideShaker peptide fragments ought to be detected. The
displays quality metrics at the protein, peptide appearance of an m/z addition in all fragment
and PSM level, thus making it straightforward to ions past a certain point in the peptide sequence,
evaluate the overall quality of the validated named site determining ions [20], indicates the
entries before continuing the analysis. At the location of the PTM. However, as all peptide
protein level there are statistics on how many fragment ions are usually not identified, False
peptides the proteins are identified by, plus the Localization Rate (FLR), is typically higher
distributions of the sequence coverage and pro- than the False Discovery Rate (FDR), which
tein lengths. The peptide QC plots include statis- does not account for PTM localization. In such
tics on how many peptides that have missed cases the algorithms calculate the probability for
cleavages (due to incomplete digestion), the the PTM localization at the modifiable residues
number of PSMs the peptides are identified by, within the peptide using PTM localization
and the peptide lengths. The PSM statistics scores. In PeptideShaker, the popular A-score
include the precursor mass error, and the precur- [20] and PhosphoRS [21] PTM probabilistic
sor charge. localization scores can be used complementarily
These metrics can be used as a measure of the to the D-score [22]. As a result, confidently
success of the laboratory procedure and the localized PTMs are indicated on peptide
parameters used during data analysis. For exam- sequences with a colored background while
ple, if the precursor mass deviation of the PSMs ambiguous sites are shown with a white back-
is large, tailoring the mass accuracy parameters ground throughout the interface. For sites of
might improve the number of identifications interest, the relevant PSMs can be inspected by
[5]. However, large mass errors may also indi- selecting them in the “Variable Modifications”
cate a calibration issue with the mass table in the “Modifications” tab. Overlapping
11 Visualization, Inspection and Interpretation of Shotgun Proteomics. . . 233

peptides, listed in the “Related Peptides” table structure of the protein, displayed via Jmol
with the same PTM localization can then also be [29]. By selecting specific peptides, the
used as an additional quality control. More researcher can investigate their location on the
details on PTM localization inspection can be protein structure. This function can be used to
found elsewhere [8, 23]. resolve PTM location conflicts [30], as PTMs
located on the protein according to their function,
typically at reactive sites at the surface. Addi-
11.10 Biological Context tional information about the structures is avail-
able by clicking the PDB identifier.
In shotgun analysis software, protein annotations Annotation can be collected manually, but
can be used to further understand the identified this can be time consuming, and knowing which
proteins by linking to commonly used protein database to use is not always easy. The “Annota-
and gene knowledge databases. A detailed list tion” tab can be used to obtain annotations from
of resources and specific tools can be found in several online databases and resources. This can
dedicated reviews [10, 24]. The ones integrated be done for a single protein, or for the complete
in PeptideShaker are highlighted in the following list of validated proteins, and includes pathway
sections, as a brief introduction to the annotation databases such as STRING [31] and Reactome
possibilities. [32], protein functional databases such as
In the “Overview” tab in PeptideShaker, the DAVID [33], protein interaction databases such
protein accession numbers are linked to the as IntAct [34] and protein signature databases
UniProt knowledgebase [25], and the chromo- such as InterPro [35]. Finally, there are databases
some annotation is provided using Ensembl that collect information from multiple resources,
[26]. Clicking the protein accession number such as DASty [36].
opens the UniProt web page for the given pro- Finally, it is important to keep in mind that
tein, while clicking the chromosome number databases are not static entities, and change with
displays the related Ensembl gene name as well the constant input from new literature. For this
as a list of Gene Ontology (GO) terms for that reason, the database version used for annotation
gene. This provides an easy access to the basic should always be stated in the publication, and
gene and protein information about the leading the quality of the data should also be carefully
protein of the identified protein ambiguity group. considered [37, 38].
GO analysis can be conducted for the entire
dataset in the “GO Analysis” tab. A subset of the
available GO terms (a so-called GO Slim) is used 11.11 Conclusions and Perspectives
to annotate the validated proteins in the dataset.
The frequency of proteins annotated by each GO Visualization of shotgun proteomics results
term is compared to the annotation frequency of allows the researcher to investigate both the iden-
the same term for the studied species in Ensembl. tification algorithms performance and the quality
This can be used to see if the dataset has a of the experimental results. User-friendly and
significantly higher or lower frequency of visual analysis software interfaces thus empower
proteins with gene information linked to a spe- the experimentalists, allowing them to critically
cific GO term (such as “aging”, “cell division”, interpret their data using state of the art
etc.) in the selected organism. Information about algorithms without demanding advanced knowl-
a specific GO term can be accessed by clicking edge in (bio)informatics.
the GO identifier linked to the EBI QuickGO web The computational difficulties in interpreting
service [27]. and combining data from several search engines
The “3D Structures” tab uses information as highlighted in this chapter, show the impor-
from the Protein Data Bank (PDB) [28] to map tance of using high quality analysis software as a
the identified peptides and PTMs onto the 3D tool to interact with and understand proteomics
234 R.R. Lereim et al.

data. Several software exist, with different ways 6. Elias JE, Gygi SP (2010) Target-decoy search strategy
of selecting leading proteins in PI ambiguity for mass spectrometry-based proteomics. Methods
Mol Biol 604:55–71
groups, using quality control filters, and 7. Vaudel M, Sickmann A, Martens L (2012) Current
algorithms for combining results from different methods for global proteome identification. Expert
search engines. For this reason, directly compar- Rev Proteomics 9:519–532
ing results from different software ought to be 8. Chalkley RJ, Clauser KR (2012) Modification site
localization scoring: strategies and performance. Mol
done with caution. Cell Proteomics 11:3–14
To conclude, the more the researcher knows 9. Barsnes H, Martens L (2013) Crowdsourcing in pro-
about the bioinformatics tools used for the anal- teomics: public resources lead to better experiments.
ysis, the better the results of the analysis. How- Amino Acids 44:1129–1137
10. Vizcaino JA, Mueller M, Hermjakob H et al (2009)
ever tempting, manual interference with the Charting online OMICS resources: a navigational
results should be done with the utmost caution, chart for clinical researchers. Proteomics Clin Appl
for both experimentalists and bioinformaticians, 3:18–29
due to resulting in reduced reproducibility and 11. Cox J, Mann M (2008) MaxQuant enables high pep-
tide identification rates, individualized p.p.b.-range
the chance of introducing interpretation biases. mass accuracies and proteome-wide protein quantifi-
PeptideShaker allows the collecting of data cation. Nat Biotechnol 26:1367–1372
from a single mass spectrometry run, and can 12. Vaudel M, Barsnes H, Berven FS et al (2011)
also analyze several fractions together. However, SearchGUI: an open-source graphical user interface
for simultaneous OMSSA and X!Tandem searches.
comparing one project to another has to be done Proteomics 11:996–999
manually, by exporting the data and comparing 13. Vizcaino JA, Cote RG, Csordas A et al (2013) The
them in programs such as Perseus (http://www. PRoteomics IDEntifications (PRIDE) database and
maxquant.org). Given that an increasing number associated tools: status in 2013. Nucleic Acids Res
41:D1063–D1069
of proteomics experiments aim at comparing dif- 14. Vizcaino JA, Deutsch EW, Wang R et al (2014)
ferent conditions measured in parallel, there is a ProteomeXchange provides globally coordinated pro-
strong need for a broader free interface allowing teomics data submission and dissemination. Nat
intuitive comparison of multiple projects, as Biotechnol 32:223–226
15. Vaudel M, Venne AS, Berven FS et al (2014) Shed-
available in commercial software. ding light on black boxes in protein identification.
Proteomics 14:1001–1005
Acknowledgments R.R.L, E.O. and F.S.B. acknowledge 16. Barsnes H, Vaudel M, Colaert N et al (2011)
the support by Kjell Alme’s Legacy for Research in Multi- Compomics-utilities: an open-source Java library for
ple Sclerosis, and the Kristian Gerhard Jebsen Foundation. computational proteomics. BMC Bioinf 12:70
H.B. is supported by the Research Council of Norway. 17. Barsnes H, Eidhammer I, Martens L (2011) A global
analysis of peptide fragmentation variability. Proteo-
mics 11:1181–1188
18. Helsens K, Timmerman E, Vandekerckhove J
et al (2008) Peptizer, a tool for assessing false positive
References peptide identifications and manually validating
selected results. Mol Cell Proteomics 7:2364–2372
1. Aebersold R, Mann M (2003) Mass spectrometry- 19. Olsen JV, de Godoy LM, Li G et al (2005) Parts per
based proteomics. Nature 422:198–207 million mass accuracy on an Orbitrap mass spectrom-
2. Duncan MW, Aebersold R, Caprioli RM (2010) The eter via lock mass injection into a C-trap. Mol Cell
pros and cons of peptide-centric proteomics. Nat Proteomics 4:2010–2021
Biotechnol 28:659–664 20. Beausoleil SA, Villen J, Gerber SA et al (2006) A
3. Nesvizhskii AI, Aebersold R (2005) Interpretation of probability-based approach for high-throughput pro-
shotgun proteomic data: the protein inference prob- tein phosphorylation analysis and site localization.
lem. Mol Cell Proteomics 4:1419–1440 Nat Biotechnol 24:1285–1292
4. Nesvizhskii AI (2010) A survey of computational 21. Savitski MM, Lemeer S, Boesche M et al (2011) Con-
methods and error rate estimation procedures for pep- fident phosphorylation site localization using the Mas-
tide and protein identification in shotgun proteomics. J cot Delta Score. Mol Cell Proteomics 10:
Proteomics 73:2092–2123 M110.003830
5. Vaudel M, Burkhart JM, Sickmann A et al (2011) 22. Vaudel M, Breiter D, Beck F et al (2013) D-score: a
Peptide identification quality control. Proteomics search engine independent MD-score. Proteomics
11:2105–2114 13:1036–1041
11 Visualization, Inspection and Interpretation of Shotgun Proteomics. . . 235

23. Olsen JV, Mann M (2013) Status of large-scale anal- associations between proteins. Nucleic Acids Res
ysis of post-translational modifications by mass spec- 31:258–261
trometry. Mol Cell Proteomics 12:3444–3452 32. Croft D, O’Kelly G, Wu G et al (2011) Reactome: a
24. Vaudel M, Sickmann A, Martens L (2014) Introduc- database of reactions, pathways and biological pro-
tion to opportunities and pitfalls in functional mass cesses. Nucleic Acids Res 39:D691–D697
spectrometry based proteomics. Biochim Biophys 33. da Huang W, Sherman BT, Lempicki RA (2009)
Acta 1844:12–20 Systematic and integrative analysis of large gene
25. Apweiler R, Bairoch A, Wu CH et al (2004) UniProt: lists using DAVID bioinformatics resources. Nat
the Universal Protein knowledgebase. Nucleic Acids Protoc 4:44–57
Res 32:D115–D119 34. Kerrien S, Aranda B, Breuza L et al (2012) The IntAct
26. Flicek P, Amode MR, Barrell D et al (2011) Ensembl molecular interaction database in 2012. Nucleic Acids
2011. Nucleic Acids Res 39:D800–D806 Res 40:D841–D846
27. Binns D, Dimmer E, Huntley R et al (2009) QuickGO: 35. Hunter S, Jones P, Mitchell A et al (2012) InterPro in
a web-based tool for Gene Ontology searching. Bio- 2011: new developments in the family and domain
informatics 25:3045–3046 prediction database. Nucleic Acids Res 40:D306–
28. Sussman JL, Lin D, Jiang J et al (1998) Protein Data D312
Bank (PDB): database of three-dimensional structural 36. Villaveces JM, Jimenez RC, Garcia LJ et al (2011)
information of biological macromolecules. Acta Dasty3, a WEB framework for DAS. Bioinformatics
Crystallogr D Biol Crystallogr 54:1078–1084 27:2616–2617
29. Herraez A (2006) Biomolecules in the computer: Jmol 37. Muller T, Schrotter A, Loosse C et al (2011) Sense
to the rescue. Biochem Mol Biol Educ 34:255–261 and nonsense of pathway analysis software in proteo-
30. Vandermarliere E, Martens L (2013) Protein structure mics. J Proteome Res 10:5398–5408
as a means to triage proposed PTM sites. Proteomics 38. Khatri P, Sirota M, Butte AJ (2012) Ten years of
13:1028–1035 pathway analysis: current approaches and outstanding
31. von Mering C, Huynen M, Jaeggi D et al (2003) challenges. PLoS Comput Biol 8, e1002375
STRING: a database of predicted functional
Protein Inference
12
Zengyou He, Ting Huang, Can Zhao, and Ben Teng

Abstract
Protein inference is one of the most important steps in protein identifica-
tion, which transforms peptides identified from tandem mass spectra into a
list of proteins. In this chapter, we provide a brief introduction on this
problem and present a short summary on the existing protein inference
methods in the literature.

Keywords
Protein identification • Protein inference

12.1 Problem Statement identification score or probability. Protein verti-


and Challenges ces are the candidate proteins that may be present
in the sample. If the sequence of a protein vertex
Protein inference describes the process used to in the database contains the sequence of at least
assemble identified peptides into a list of proteins one peptide vertex, this protein is a candidate
that are believed to be present in a sample. The protein resulting in a connection between the
standard input of protein inference can be con- peptide and the protein in the bipartite graph.
sidered as a bipartite graph [1], as shown in The task of protein inference is to make a selec-
Fig. 12.1. The two sets of nodes in this bipartite tion from the candidate proteins that best
graph represent the identified peptides reported explains all the identified peptides.
by the peptide identification algorithms and the The two biggest challenges in protein infer-
candidate proteins, respectively. Additionally, ence are how to tackle degenerate peptides and
each peptide vertex has a corresponding one-hit wonders. Degenerate peptides are
peptides that are shared by multiple candidate
proteins. It is difficult to distinguish from which
Z. He (*) • C. Zhao • B. Teng
School of Software, Dalian University of Technology, protein any given degenerate peptide originated.
Dalian, China One-hit wonders are proteins that match with
e-mail: zyhe@dlut.edu.cn only one identified peptide. Since current peptide
T. Huang identification algorithms are not perfect, this pep-
College of Computer and Information Science, tide may be discovered by chance and the
Northeastern University, Boston, MA, USA

# Springer International Publishing Switzerland 2016 237


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_12
238 Z. He et al.

Fig. 12.1 The standard


input of protein inference Proteins R1 R2 R3 R4
problem. Here protein
R1 is a one-hit wonder and
peptides P2 and P3 are
degenerate peptides

Peptides P1 P2 P3 P4 P5 P6

reliability of the one-hit wonders cannot be selects the protein that matches with the largest
guaranteed. To address these problems, different number of peptides and removes all its matching
protein inference algorithms and tools have been peptides from the identified peptide set; then it
developed during the past decade. repeats the first step until the identified peptide
set is empty. The selected proteins are considered
to be present in the sample. For example, apply-
12.2 Algorithms and Tools ing the parsimony principle to the sample in
Fig. 12.1 will successively report proteins {R4,
The available methods for solving the protein R3, R1} or {R4, R2, R1}.
inference problem can be categorized into two IDPicker [2, 3] is a typical method which uses
classes [1]: the bipartite graph model and the the parsimony principle for protein inference. It
supplementary information model, as shown in reports the minimum protein identifications
Fig. 12.2. This classification is based on the dif- through a greedy algorithm. Meanwhile, in its
ferent input information that inference algorithms latest version, IDPicker has been extended to
used to assemble the identified peptides. integrate multiple peptide identification scores
generated by different peptide identification
methods.
12.2.1 Bipartite Graph Model DBParser [4], MassSieve [5] and LDFA [6]
also employ the parsimony analysis to remove
The algorithms that belong to the bipartite graph redundant protein identifications. But LDFA is a
model mainly use the information generated little different from the other parsimonious
from bipartite graphs, such as peptide-protein methods. It assigns the shared peptide to the
relationships and peptide identification scores. corresponding protein according to peptide
They can be subdivided into three categories detectability, rather than the number of sibling
based on different models used: the parsimonious peptides matched to the same protein. Peptide
model, the statistical model and the optimistic detectability is an intrinsic property of the pep-
model. tide. It indicates the probability of detecting a
peptide in a standard sample by a standard prote-
12.2.1.1 Parsimonious Model omics routine if its parent protein is present.
Since protein inference algorithms are aimed at The methods that employ the parsimony prin-
finding a subset of protein vertices that can cover ciple have deterministic results and fast running
all the identified peptides, it is natural to apply speeds. They require very few parameters and
the parsimony principle (Occam’s razor princi- thus are easy to use. However, only reporting
ple) to solve the inference problem. More pre- the minimum number of proteins in any given
cisely, the objective is to report a minimum sample may lead to the loss of useful informa-
subset of proteins that can “explain” all identified tion. For example, homologous proteins are
peptides. In practice, a greedy algorithm is often likely to have the same set of identified peptides
used to find the solution efficiently. The greedy and they may all be present in the sample. Unfor-
algorithm typically works as follows: it first tunately, the parsimonious methods will
12 Protein Inference 239

Fig. 12.2 The


Parametric Non-parametric
classification of protein
model model
inference methods

Statistical model

Parsimonious
model Optimistic model
Bipartite graph model

Raw MS/MS Peptide expression


data information

Supplementary
information model mRNA expression
PMF data
data

PPI network Gene model

probably only report one of them. Moreover, integrated into the popular Trans-Proteomic Pipe-
each reported protein will not have a score and line software.
the probability that any one of the selected MSBayesPro [8] describes two Bayesian
proteins is present in the sample is unknown. approaches to address the protein inference prob-
lem. The basic Bayesian model assumes that all
12.2.1.2 Statistical Model the peptides have equal identification scores.
According to the assumptions made by these Another advanced model incorporates the pep-
statistical methods, the methods can be divided tide identification scores into the Bayesian
into two categories: the non-parametric model model. Moreover, MSBayesPro provides a
and the parametric model. Gibbs sampling algorithm to quickly approxi-
The non-parametric model does not rely on mate the protein posterior probabilities.
the assumption that the data are drawn from a MSBayesPro has two important features: (1) the
given parametric probability distribution. use of peptide detectability; (2) the use of both
ProteinProphet [7] is the most widely used identified and non-identified peptides. These
method to solve the protein inference problem. salient features will help improve the identifica-
ProteinProphet employs an iterative procedure to tion accuracy.
estimate protein probabilities. It first computes the ProteinLP [9] uses the joint probability that
protein probability as the probability that at least both a protein and its constituent peptide are
one identified peptide corresponding to the protein present in the sample as the unknown variable
is correct, and then re-computes the peptide to compute the protein probability. It first makes
weight conditioned on the protein probabilities. a mathematical transformation of such joint
The above iteration process continues until con- probability to obtain a new variable. Then, both
vergence. ProteinProphet also considers the num- the peptide probability and protein probability
ber of sibling peptides in the scoring procedure to are represented as a formula that is built on the
facilitate the assignment of degenerate peptides to linear combination of these new variables.
the most likely protein. ProteinProphet is Finally, the protein inference problem is
240 Z. He et al.

formulated as a linear programming problem. 12.2.2 Supplementary Information


Since ProteinLP is based on linear programming Model
(LP) model, it can be solved efficiently with
existing LP software packages. In the bipartite graph model, it is difficult to
The parametric model first assumes that the further improve the identification performance,
data follows some form of probability distribu- no matter how ideal the algorithm is. This is
tion and then makes an inference about the because the input information of this model is
parameters of the distribution. Since parametric limited. For example, proteins P2 and P3 are very
methods make more assumptions than difficult to be distinguished if only based on the
non-parametric methods, they may produce information shown in Fig. 12.1. In order to
more accurate protein probability estimations if improve the identification accuracy, some sup-
these additional assumptions are correct. plementary information can be incorporated into
PROT_PROBE [10] is a typical method for the protein inference process. Such supplemen-
protein inference using the parametric model. tary information can facilitate identification of
Each protein identification result is modeled as proteins that may not be identifiable with high
a random Bernoulli event which has two confidence by MS/MS evidence alone. So far,
outcomes: a protein is either identified or not. there are six types of supplementary information
The probability of the protein identification at that have been used: raw MS/MS data, peptide
each Bernoulli event is determined either from mass fingerprinting data, peptide expression
the relative length of the protein in the database profiles, protein interaction networks, mRNA
(null hypothesis) or from the hyper-geometric expression data and gene models.
probabilities of peptides (alternative hypothesis).
By comparing the two distributions, the one that Raw MS/MS Data This data takes advantage of
the protein belongs to is determined. the raw MS/MS spectra information. Protein
identification includes two steps: peptide identi-
12.2.1.3 Optimistic Model fication and protein inference. This separation
In contrast to the parsimonious model which may lead to a significant loss of information
reports the minimum list of protein during the protein inference. For example, sup-
identifications, optimistic model returns all pose only the best-matched peptide is reported
potential proteins that meet some simple crite- for each spectrum. For a particular spectrum, if
rion. Two-peptide rule is a typical example of an this best-matched peptide is incorrect, then the
optimistic model. It reports all the candidate information about the second-ranked, possibly
proteins matching at least two peptides without correct peptide, is not available to protein infer-
any further filtering. For instance, applying the ence algorithms. Thus, the raw MS/MS data
two-peptide rule to the example in Fig. 12.1 will model directly conducts protein inference from
report proteins R2, R3 and R4 to the user. the raw spectra in order to obtain better identifi-
DTASelect [11] also falls into the category of cation results. HSM [12] is a typical protein
an optimistic model. In this method, a protein is inference method that utilizes raw MS/MS data.
regarded as being present in the sample if it It is an integrated statistical model, which jointly
matches a sufficient number of different peptides assess the confidence of the peptides and proteins
or at least one peptide that appears many times. identified from raw MS/MS data.
The optimistic model is simple to understand
and easy to use. However, if the filtering condi- Peptide Mass Fingerprinting Data There are
tion is overly strict, some true protein two types of data for identifying proteins in the
identifications would be missed. Alternatively, sample: single-stage MS data and MS/MS data.
if the filtering condition is overly liberal, the set Shotgun proteomics is based on tandem mass
of reported proteins would include too many spectrometry data. Peptide mass fingerprinting
false positives. (PMF) is the identification method that utilizes
12 Protein Inference 241

single stage MS data. PMF assumes that every example, MSpresso [16] re-calculates protein
protein has a set of peptides and thus masses of identification probabilities given their mRNA
these peptides can form its unique fingerprinting. abundances.
PMF matches observed peptide masses with the-
oretical peptide masses to identify proteins. Gene Model Compared to protein interaction
MS-based methods provide wider coverage than network and mRNA expression data, it is easier
MS/MS-based method, while their identification to accurately and quickly obtain accurate gene
accuracy is lower since MS data have less infor- information. A DNA segment can generate mul-
mation than MS/MS data. It is a natural idea to tiple proteins and these proteins are relevant. The
combine MS data and MS/MS data in a unified existence of one protein may indicate that other
model so that the identification performance can proteins originating from the same gene are also
be improved. The PSC method [13] combines the present in the sample. The typical application of
MS data and MS/MS data together under a partial the gene model is Markovian Inference of
set covering model to identify the proteins in the Proteins and Gene Models (MIPGEM) [17]. It
sample. addresses the problem of protein and gene model
inference through a probabilistic graphical
Peptide Expression Information Peptide model.
expression information, such as peptide intensity Different supplementary information models
information, is widely used in label-free quantita- have their own characteristics. Methods that
tive proteomics studies. Recently, it has been used incorporate MS-related data (raw MS/MS data,
to improve protein identifications. PIPER [14] PMF data and peptide expression data) can be
assumes that peptides derived from the same applied to the analysis of any sample since such
protein should have similar expression profiles. data are always available. In contrast, approaches
Thus, according to the known peptide expression that use other biological data can only work
profiles, PIPER can filter out some identified when the required supplementary information
proteins to obtain more accurate inference results. are available.

Protein Interaction Network Most of the pro-


tein inference methods consider the candidate 12.3 Validation for Protein
proteins independently. In fact, two or more Identifications
proteins usually bind together to carry out the
biological functions, which form the protein- Since none of the protein inference algorithms
protein interaction network (PPI network). That are perfect, controlling the quality of inferred
is, certain proteins are correlated with each other. proteins is as important as developing protein
Thus, it is reasonable to consider the protein- inference algorithms. For a long time, the assess-
protein interaction information in the protein ment of inferred proteins has been confused with
inference procedure. CEA [15] tries to revive the validation of peptide identifications. In fact,
the eliminated proteins by incorporating the pro- inferred proteins are more biologically relevant
tein interaction network. It assumes that a than identified peptides in a proteomics experi-
non-confident protein will become confident if ment. Therefore, it is vital to control the quality
it has a sufficient number of confident neighbor of the identification results at the protein-level.
proteins in the PPI network. The model uses the However, the accurate assessment of the confi-
relationships among proteins to adjust the identi- dence of protein identifications remains an open
fication results generated by other protein infer- question. To date, several research efforts have
ence methods. been made to estimate the protein-level error rate
in terms of false discovery rate (FDR).
mRNA Expression Data mRNA expression On one hand, some methods rely on the use of
information during transcription can be used to decoy databases during FDR estimation. In these
help estimate the protein probability as well. For methods, the MS/MS spectra are first searched
242 Z. He et al.

against a target-decoy database and then the 4. Yang X, Dondeti V, Dezube R et al (2004) DBParser:
number of false positive protein identifications web-based software for shotgun proteomic data
analyses. J Proteome Res 3(5):1002–1008
is estimated according to the number of decoy 5. Slotta DJ, Mcfarland MA, Markey SP (2010)
entries. The naive target-decoy method and MassSieve: panning MS/MS peptide data for proteins.
MAYU are two examples in this category. For Proteomics 10(16):3035–3039
the naive target-decoy method, FDR is calculated 6. Alves P, Arnold RJ, Novotny MV et al (2007)
Advancement in protein inference from shotgun pro-
by doubling the ratio of the number of decoy teomics using peptide detectability. Pac Symp
proteins and the total number of protein Biocomput 12:409–420
identifications. MAYU [18] uses a more sophis- 7. Nesvizhskii AI, Keller A, Kolker E et al (2003) A
ticated statistical model to estimate the expected statistical model for identifying proteins by tandem
mass spectrometry. Anal Chem 75(17):4646–4658
number of false positive protein identifications. 8. Li YF, Arnold RJ, Li Y et al (2009) A Bayesian
On the other hand, the decoy-free method approach to protein inference problem in shotgun
evaluates the protein inference results without proteomics. J Comput Biol 16(8):1–11
searching a decoy database. For instance, the 9. Huang T, He Z (2012) A linear programming model
for protein inference problem in shotgun proteomics.
method in [19] uses a random permutation Bioinformatics 28(22):2956–2962
method to estimate the confidence of each pro- 10. Sadygov RG, Liu H, Yates JR (2004) Statistical
tein in terms of p-value and calculates the FDR models for protein validation using tandem mass
from these p-values. spectral data and protein amino acid sequence
databases. Anal Chem 76(6):1664–1671
11. Tabb DL, Mcdonald H, Yates JR (2002) DTASelect
and contrast: tools for assembling and comparing
protein identifications from shotgun proteomics. J
12.4 Conclusions Proteome Res 1:21–26
12. Shen C, Wang ZH, Shankar G et al (2008) A hierar-
Researchers have proposed many solutions from chical statistical model to assess the confidence of
different angles to tackle the protein inference peptides and proteins inferred from tandem mass
spectrometry. Bioinformatics 24(2):202–208
problem. However, the performance of current 13. He Z, Yang C, Yu W (2011) A partial set covering
available protein inference methods is still far model for protein mixture identification using mass
from satisfactory in practice. Therefore, more spectrometry data. IEEE/ACM Trans Comput Biol
research efforts are still needed towards this Bioinform 8(2):368–380
14. Kearney P, Butler H, Eng K et al (2008) Protein
direction. identification and peptide expression resolver:
harmonizing protein identification with protein
Acknowledgement This work was partially supported expression data. J Proteome Res 7(1):234–244
by the Natural Science Foundation of China under Grant 15. Li J, Zimmerman LJ, Park B-H et al (2009) Network-
No. 61003176 and the Fundamental Research Funds for assisted protein identification and data interpretation
the Central Universities of China (DUT14QY07). in shotgun proteomics. Mol Syst Biol 5:303
16. Ramakrishnan SR, Vogel C, Prince JT et al (2009)
Integrating shotgun proteomics and mRNA expres-
sion data to improve protein identification. Bioinfor-
References matics 25(11):1397–1403
17. Gerstera S, Qelib E, Ahrensb CH et al (2010) Protein
and gene model inference based on statistical
1. Huang T, Wang J, Yu W et al (2012) Protein infer- modeling in k-partite graphs. Proc Natl Acad Sci U
ence: a review. Brief Bioinform 13(5):586–614 S A 107(27):12101–12106
2. Zhang B, Chambers MC, Tabb DL (2007) Proteomic 18. Reiter L, Claassen M, Schrimpf SP et al (2009) Pro-
parsimony through bipartite graph analysis improves tein identification false discovery rates for very large
accuracy and transparency. J Proteome Res 6 proteomics datasets generated by tandem mass spec-
(9):3549–3557 trometry. Mol Cell Proteomics 8(11):2405–2417
3. Ma Z-Q, Dasari S, Chambers MC et al (2009) 19. Teng B, Huang T, He Z (2014) Decoy-free protein-
IDPicker 2.0: improved protein assembly with high level false discovery rate estimation. Bioinformatics
discrimination peptide identification filtering. J Prote- 30(5):675–681
ome Res 8(8):3872–3881
Modification Site Localization in
Peptides 13
Robert J. Chalkley

Abstract
There are a large number of search engines designed to take mass spec-
trometry fragmentation spectra and match them to peptides from proteins
in a database. These peptides could be unmodified, but they could also
bear modifications that were added biologically or during sample prepa-
ration. As a measure of reliability for the peptide identification, software
normally calculates how likely a given quality of match could have been
achieved at random, most commonly through the use of target-decoy
database searching (Elias and Gygi, Nat Methods 4(3): 207–214, 2007).
Matching the correct peptide but with the wrong modification localization
is not a random match, so results with this error will normally still be
assessed as reliable identifications by the search engine. Hence, an extra
step is required to determine site localization reliability, and the software
approaches to measure this are the subject of this part of the chapter.

Keywords
Modification site localization • False localization rate • Peak picking

13.1 Approaches 2. Those that independently calculate a score


based on an estimation of how likely a given
Site localization scoring approaches can be bro- site-determining peak may have been
ken down broadly into two camps: observed at random

1. Those that make use of score/probability I will give three examples of software
differences reported directly from the search employing each approach, but a more in-depth
engine that was used for peptide identification coverage of a wider range of tools has previously
been published [2]. Table 13.1 summarizes
R.J. Chalkley (*) approaches used by these six software tools,
Department of Pharmaceutical Chemistry, University of which will be described in more detail below.
California San Francisco, San Francisco, CA 94143, USA
e-mail: chalkley@cgl.ucsf.edu

# Springer International Publishing Switzerland 2016 243


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_13
244 R.J. Chalkley

Table 13.1 Comparison of site localization software


Scoring:
Probability or
difference Search engine results
Software Peak picking score Representing ambiguity applied to
A-Score N peaks per 100 Th Probability Reports best site; does Sequest. Can be applied
not indicate best to multiple search engines
alternative location via Scaffold.
Mascot Delta N peaks per 110 Th Difference Reports best site; does Mascot
Score score not indicate best
alternative location
PhosphoRS Variable number of Probability Reports probability for Sequest and Mascot via
peaks per 100 Th all sites ProteomeDiscoverer
PTM Score N peaks per 100 Th Probability Reports probability for Andromeda (MaxQuant)
all sites
SLIP Score N most intense peaks in Difference Lists all sites within Protein Prospector
each half of observed score score threshold
m/z range
Variable 25 peaks with highest Difference Lists all sites within Spectrum Mill
modification S/N after precursor and score score threshold
localization isotope removal
score

Examples of search engine based site locali- note that although these values are derived from
zation scoring include Mascot Delta Score [3], probability scores they are not probability
SLIP scoring in Protein Prospector [4] and vari- measures for site localization, so should be
able modification localization scoring in Spec- treated simply as arbitrary scores.
trum Mill [5]. These scores are automatically Examples of software that calculate scores
reported by Protein Prospector and Spectrum based on estimating the probability of matching
Mill, whereas Mascot Delta Score is calculated a peak at random include A-Score [6] (which is
separately by software that processes the Mas- also available as part of Scaffold [7]), PTM Score
cot search result output. In each case the locali- [8] in MaxQuant and PhosphoRS [9] (which is
zation score is derived by determining a score or also available in ProteomeDiscoverer [10]). Both
probability difference between the top scoring A-Score and PTM Score treat observed masses as
peptide / site combination and the next highest integer values (which is a reasonable step for low
scoring match of the same peptide but with a mass accuracy ion trap CID data, but less so for
different modification site localization. Spec- high mass accuracy fragmentation data), then
trum Mill reports arbitrary scores for peptide calculate probabilities under the assumption that
identifications, and the resulting site localiza- if, for example, four peaks per 100 m/z are consid-
tion is scored on the same scale. Mascot and ered, then the probability of randomly matching a
Protein Prospector both report probability / peak is 4 in 100. In the case of A-Score the
expectation value scores. Site localization resulting output is converted into a score that is –
scores reported by these two software programs 10  log10(p), so 13 corresponds to 95 % confi-
are derived from differences in reported dence and 20 corresponds to 99 % confidence.
probabilities for peptides identified with differ- In the case of PTM Score and PhosphoRS they
ent modification site localizations. Scores are invert their probabilities of random matching into
reported on a log10 scale, such that a score of probabilities of correct localization, then norma-
10 represents an order of magnitude difference lize values for all sites in the peptide so they sum to
and 20 represents two orders of magnitude dif- 1 (for a singly modified peptide); i.e. the peptide is
ference in probability score. It is important to definitely modified somewhere.
13 Modification Site Localization in Peptides 245

In order to be able to localize the site of performance, and the former of these datasets
modification in a peptide it is necessary to was also used for evaluating A-Score [3] and
observe fragments formed by cleavage between SLIP scoring in Protein Prospector [4]. More
two potential sites of modification; if such a recently a much larger synthetic phosphopeptide
fragment is not observed, then the site localiza- library of greater than 100,000 phosphopeptides
tion should be reported as ambiguous. Unfortu- was created using a limited number of seed
nately, a given MS/MS spectrum will usually sequences, then permuting the residues 1 and
contain a mixture of fragments from the compo- +1 from the modified residue, and this reference
nent of interest, but also ‘background’ ions dataset was used for comparing PTM Score,
derived from other co-isolated precursors, or PhosphoRS and Mascot Delta Score [11]. An
maybe electrical noise. Hence, software needs interesting result from this comparison was the
to make a decision as to which peaks should be surprisingly high complementarity of tools;
considered during scoring and site localization. i.e. each tool reliably identified a different subset
This choice is normally made on the basis of of sites, so combining multiple tools on a dataset
intensity. Software could use a constant intensity could significantly increase the number of reli-
threshold across the whole spectrum, as is the able site localizations.
case for Spectrum Mill, but most split the spec- The other approach employed for creating a
trum into parts and then pick a certain number of dataset of known answers was to take data where
the most intense peaks in each part. For example, there is only one potential site of modification,
Protein Prospector divides the observed mass then measure how often there is an assignment to
range in half and then considers an equal number decoy amino acid residues [4]. Using proline and
of peaks (as a default 20) per half. A-Score, PTM glutamine residues as decoy potential phosphor-
Score and Mascot split the spectrum into bins of ylation sites about 10,000 phosphopeptide spec-
m/z 100 (or m/z 110 in the case of Mascot), then tra were assessed, from which false localization
use a constant number of peaks per m/z bin. rates for a given score could be calculated for
PhosphoRS performs the same binning, but can SLIP scoring in Protein Prospector.
vary the number of peaks used within each bin.

13.3 Effect of Fragment Mass


13.2 Assessing Performance Accuracy

For benchmarking performance of software As previously stated, both A-Score and


identifying peptides, the use of target-decoy PTM-Score were designed for analysis of low
database searching to calculate false discovery mass accuracy fragmentation data and assume
rates (FDRs) allows comparison of tools on a only unit mass accuracy. Hence, they do not
level playing field [1]. However, there is no make use of higher accuracy mass measurement.
equivalent approach that can be used to calculate By narrowing the mass window bins; e.g. using
a modification site false localization rate (FLR). 0.1 Da instead of 1 Da bins, then this information
The only practical way to do this is to produce could be utilized, and this type of approach is
datasets where the correct answers are known what PhosphoRS does. By using narrow mass
and hope scores have the same meaning when bins, for most windows it will be impossible to
analyzing other data. Two approaches have been produce a peptide fragment within the given
used to create such datasets. mass range, so the approach of assuming equal
The first is to create synthetic peptide libraries likelihood of matching a peak in all bins (and
with known modification sites. The publications hence probability calculation) falls down. How-
describing Mascot Delta Score [3] and ever, as final probabilities are all normalized to
PhosphoRS [9] created libraries of about sum to 1, it is unclear how much of an issue this
180 phosphopeptides for benchmarking their really is. Search engine site localization scoring
246 R.J. Chalkley

will automatically make use of mass accuracy probabilities for all sites. An attractive feature
through the fragment mass tolerance during data- of the Protein Prospector output is that it provides
base searching, so may have advantages for hyperlinks to annotated spectra where if there is
analyzing high mass accuracy data. localization ambiguity then it will annotate with
both/all localizations and indicate which peaks
(if any) are unique to one localization interpreta-
13.4 Handling Ambiguity tion (Fig. 13.1). This feature is also accessible
through a web interface to support data sharing
It is rare to get fragmentation of all peptide bonds and publication, using MS-Viewer [13].
in a peptide, especially in CID or HCD fragmen-
tation, which both have a strong sequence prefer-
ence in bond cleavage [12]. ETD produces more 13.5 Application
even fragmentation, meaning you are more likely
to be able to determine modification sites from Phosphorylation is the post-translational modifi-
this type of data [11], but there will still be cation that has seen the greatest need for site
spectra where a site cannot be reliably deter- localization scoring due to the combination of it
mined. Software programs handle this issue in being heavily studied, but also because it can
different ways. In the case of A-Score and Mas- occur on several amino acids, most commonly
cot Delta Score, the software has to assign a site, serine, threonine and tyrosine residues, so there
even if the score is 0, and they do not indicate are often multiple potential sites of phosphoryla-
what the alternative modification location would tion present in a given peptide. Indeed, several of
be. Protein Prospector and Spectrum Mill the tools described in this chapter were specifi-
employ localization score thresholds, and if the cally designed for phosphopeptide analysis,
score is below this threshold then they report all although there is no reason why they cannot be
residues that could be modified. In the case of adapted for other PTMs. However, the user
PTM Score and PhosphoRS they report must appreciate that each software calculates

Fig. 13.1 Annotation of alternative site localization. shows z ions consistent with both site localizations,
Protein Prospector reported the site localization of the suggesting this may be a mixture spectrum of the two
HexNAc sugar modification in this peptide as ambiguous different singly modified versions co-eluting
between Thr-2 and Ser-3. Annotating the alternatives
13 Modification Site Localization in Peptides 247

scores/probabilities under the assumption that References


only the residues specified could bear a particular
modification. This becomes a complication when 1. Elias JE, Gygi SP (2007) Target-decoy search strategy
different modifications produce a similar or iden- for increased confidence in large-scale protein
identifications by mass spectrometry. Nat Methods 4
tical mass change. For example, lysine methyla- (3):207–214
tion is an important biological modification 2. Chalkley RJ, Clauser KR (2012) Modification site
[14]. However, methylation of carboxylic acid localization scoring: strategies and performance. Mol
residues can easily be introduced during sample Cell Proteomics 11(5):3–14
3. Savitski MM, Lemeer S, Boesche M, Lang M,
handling (e.g. staining and storage of gels in Mathieson T, Bantscheff M, Kuster B (2011) Confi-
solutions containing methanol and acid is com- dent phosphorylation site localization using the Mas-
mon) and also many single amino acid cot Delta Score. Mol Cell Proteomics 10(2):
substitutions such as serine to threonine, or M110.003830
4. Baker PR, Trinidad JC, Chalkley RJ (2011) Modifi-
valine to leucine or isoleucine produce the same cation site localization scoring integrated into a
mass shift. Hence, it is important to make sure search engine. Mol Cell Proteomics 10(7):M111
software is considering any potential location for 008078
the modification. It is also worth remembering 5. Spectrum Mill – Agilent Technologies Inc. Available
from: http://www.chem.agilent.com/en-US/Products/
that during the peptide identification step itself software/chromatography/ms/spectrummillformassh
the reliability of identifications of modified unterworkstation/pages/default.aspx
peptides may be lower than for unmodified 6. Beausoleil SA, Villen J, Gerber SA, Rush J, Gygi SP
because the search engine generally considers (2006) A probability-based approach for high-
throughput protein phosphorylation analysis and site
many more modified peptides than unmodified, localization. Nat Biotechnol 24(10):1285–1292
leading to a higher proportion of the incorrect 7. Scaffold – Proteome Software. Available from: http://
answers at a given FDR threshold being modified www.proteomesoftware.com/products/ptm/
peptides [15]. 8. Olsen JV, Blagoev B, Gnad F, Macek B, Kumar C,
Mortensen P, Mann M (2006) Global, in vivo, and
site-specific phosphorylation dynamics in signaling
networks. Cell 127(3):635–648
13.6 Conclusions 9. Taus T, Kocher T, Pichler P, Paschke C, Schmidt A,
Henrich C, Mechtler K (2011) Universal and confi-
dent phosphorylation site localization using
Using modified peptide identifications from a phosphoRS. J Proteome Res 10(12):5354–5362
search engine without any evaluation of site 10. Proteome Discoverer – Thermo Scientific. Available
localization reliability produces many incorrect from: http://www.thermoscientific.com/en/product/
results. There are several tools that can evaluate proteome-discoverer-software.html
11. Marx H, Lemeer S, Schliep JE, Matheron L,
site localization reliability, although in many Mohammed S, Cox J, Mann M, Heck AJ, Kuster B
cases the choice of tool is dictated by the search (2013) A large synthetic peptide and phosphopeptide
engine that was used for the peptide identifica- reference library for mass spectrometry-based proteo-
tion step, such that a user may not formally have mics. Nat Biotechnol 31(6):557–564
12. Kapp EA, Schutz F, Reid GE, Eddes JS, Moritz RL,
any choice as to which they use. Reassuringly, O’Hair RA, Speed TP, Simpson RJ (2003) Mining a
these tools seem to perform reasonably at tandem mass spectrometry database to determine the
evaluating the reliability of the results they report trends and global factors influencing peptide fragmen-
at a given acceptance threshold, although they all tation. Anal Chem 75(22):6251–6264
13. Baker PR, Chalkley RJ (2014) MS-viewer: a
clearly have many false negative results where web-based spectral viewer for proteomics results.
they correctly identify the modification site but Mol Cell Proteomics 13(5):1392–1396
score the result below a confidence threshold. 14. Moore KE, Gozani O (2014) An unexpected journey:
Hence, there is clear room for improvement in lysine methylation across the proteome. Biochim
Biophys Acta 1839(12):1395–1403
the performance of these tools. 15. Chalkley RJ (2013) When target-decoy false discov-
ery rate estimations are inaccurate and how to spot
Acknowledgement This work was supported by NIH instances. J Proteome Res 12(2):1062–1064
NIGMS grant 8P41GM103481.
Useful Web Resources
14
Andre Bui and Maria D. Person

Abstract
An increasing number of web resources are available for aiding in proteo-
mics research. Databases contain repositories of proteins and associated
information. A recent article by Chen et al. (Genomics Proteomics Bioin-
formatics 13(1):36–39, 2015) evaluates a number of MS-based proteomics
repositories containing MS and expression data, including repositories
devoted to cataloguing high confidence post-translational modifications.
Many sites have tools developed by research labs that are shared with the
community and online tutorials and videos for learning how to use the
tools. This chapter contains a selection of web sites useful for proteomics
analyses but is by no means comprehensive. Using a search engine such as
Google is the easiest way to find the sites using the name given below.

Keywords
Proteomics • Informatics • Web resources • Mass spectrometry • Protein
identification • Databases

An increasing number of web resources are avail- Many sites have tools developed by research labs
able for aiding in proteomics research. Databases that are shared with the community and online
contain repositories of proteins and associated tutorials and videos for learning how to use the
information. A recent article by Chen et al. [1] tools. This chapter contains a selection of web
evaluates a number of MS-based proteomics sites useful for proteomics analyses but is by no
repositories containing MS and expression data, means comprehensive. Using a search engine
including repositories devoted to cataloguing such as Google is the easiest way to find the
high confidence post-translational modifications. sites using the name given below.

ExPASy Launched by the Swiss Institute of Bio-


A. Bui • M.D. Person (*)
Proteomics Facility, Institute for Cellular and Molecular informatics (SIB), the Expert Protein Analysis
Biology and College of Pharmacy, The University of System (ExPASy) is a bioinformatics resource
Texas at Austin, Austin, TX, USA portal that collects various web resources and
e-mail: mperson@austin.utexas.edu

# Springer International Publishing Switzerland 2016 249


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_14
250 A. Bui and M.D. Person

repositories for protein and proteomics analyses. Human Protein Atlas This site contains a spa-
Proteomics and protein specific applications tial proteomics atlas resolved at the single cell
include various molecular weight calculators, soft- level with images from immunohistochemical
ware for 2-D PAGE analysis, sequence alignment (IHC) staining for predicted proteins from a
tools, and protein structure and modeling tools. consortium of investigators in Sweden and
India. The proteins are organized by tissue
UniProt UniProt is a repository of proteome using a gene centric approach. There are atlases
sequence databases for multiple species stored in for 44 tissues and 46 cell lines, staining from
FASTA formats that can be used across various 20 types of cancer, with the first complete draft
proteomics software packages for analysis. A released in 2015 containing 13 million
European consortium of EMBL, SIB, and PIR annotated images, all provided in an accessible
maintains it. The UniProt databases contain both platform. Abundance estimates are made based
manually annotated and reviewed Swiss-Prot on RNA-seq FPKM values, including coverage
databases and TrEMBL databases, which are auto- plots of the reads, while protein expression is
matically annotated and unreviewed. Individual visualized with tissue IHC. While there is RNA
protein information can also be accessed, which evidence for all the predicted protein encoding
includes function, protein localization, commonly genes, the antibody evidence exists for 83 % of
associated PTMs, sequence and sequence similar- the predicted proteins, about 17 K proteins.
ity, and structural information, along with Tissue enriched and enhanced proteins are
references. Full proteomes for each species can identified, housekeeping, secretome, mem-
be downloaded as FASTA files for use in database brane, regulatory, isoform, cancer and
search algorithms and these include all proteins druggable proteomes are defined and explored.
validated, either with or without the isoforms. The antigen peptide library of protein epitope
signature tags (PrEST) has enabled defined
standards for LC-MS/MS based quantitation.
From these, ratios between RNA and protein
14.1 Human Proteome Resources
levels have been defined for a subset of proteins
in specific cell lines. The primary caveat with
There are a number of species-specific genomics
the use of antibody-based data is cross reactiv-
resources [2], and 2014 saw publication of papers
ity, estimated at causing 25–50 % of IHC
detailing the most in-depth exploration of the
staining. However, extensive validation of
human proteome by mass spectrometry methods
antibodies is performed, including checking if
[3, 4]. Over 10,000 gene products were detected in
paired antibody pairs display identical behavior
both studies using high resolution FT-MS data, and
in human tissue and using genetic methods like
covering all major organs as well as several cell
siRNA and CRISPER on cell lines. A subcellu-
types. However, the extensive datasets did not
lar proteome atlas is being developed for
detect every protein and there is controversy over
release in 2016 to detail protein expression in
the accurate calculation of false discovery rates that
subcellular compartments.
have a major impact on the number of identified
proteins. In 2015, the antibody based complemen-
Human Proteome Map This LC-MS/MS based
tary Human Protein Atlas was completed [5].
resource provides a graphical overview of
expression levels in multiple tissues. These are
neXtProt As an elaboration of the annotation
the results of a large-scale project on 17 adult and
seen in UniProt Swiss-Prot entries, neXtProt is
seven fetal tissues, and six hematopoietic cells.
limited to annotating and organizing data on
The proteins in this database represent over
human proteins from SIB and GeneBio. Informa-
17,000 genes. Proteotypic peptide sequences are
tion on function, expression, interaction, locali-
given for each protein to facilitate targeted
zation, proteomics, structures, GO terms and
studies.
medical implications is presented.
14 Useful Web Resources 251

ProteomicsDB An LC-MS/MS based database proteomics analysis originating from the Ruedi
of human proteins with coverage of 93 % of Aebersold lab. The workflow of the TPP is
predicted proteins, over 18,000 proteins designed to be platform independent, handling
represented. You can search for individual vendor specific file formats by converting to the
proteins and find projects where this protein has universal standard file format. The search engine
been identified and the proteotypical peptide for of choice performs database searching, while
the protein. Tissue expression levels are peptide and protein validations are handled by
displayed, and RNA-Seq data has been the PeptideProphet and ProteinProphet toolsets.
incorporated in a heat map display format. Quantitation and visualization modules are also
available, making the TPP an all-in-one freeware
platform for tandem mass spectrometry based
protein identification experiments.
14.2 Protein Identification
and Quantitation
Maxquant The site contains tools for protein
identification in Andromeda, for protein quanti-
tation by stable isotope labeled and label free
Mascot Matrix Science has an extensive collec-
methods, and for statistical analysis and visuali-
tion of educational materials on the protein iden-
zation in Perseus developed by the Jurgen Cox
tification process as well as help for one of the
lab. They have given particular attention to accu-
most commonly used, platform independent
rate false discovery rate calculations, using q
database search algorithms for protein
values to determine protein FDR rates that are
identification.
more stringent than the PSM or peptide FDRs are
given by other programs. Tools for learning the
ProteinProspector Developed by the UCSF
software include videos and help sites, and a
Mass Spectrometry Facility, ProteinProspector
summer course is held every year to master the
is a suite of platform independent tools designed
software.
for MS based proteomics experiments, ranging
from initial experimental design to data
pFind Studio pFind Studio is a freeware suite
processing and database searching. Some of the
of programs designed for computational analysis
more useful toolsets are highlighted here:
of MS/MS based proteomics experiments from
the Chinese Academy of Sciences. Included in
MS-Digest — a tool for performing in
the package are:
silico digests of proteins and protein
databases using various available proteases,
pFind: protein identification software
with parameters supplied by the user.
pLink: designed for the analysis of cross-linked
MS-Viewer — Allows for the annotation and
peptides and SUMOylation
visualization of MS/MS spectra from database
pNovo: De novo peptide sequencing using vari-
searches.
ous fragmentation methods
DB-Stat — Mines for statistical information from
pLabel: spectral annotation software for
supplied FASTA databases, such as total
visualizing spectral matches in MS/MS results
number of entries, entries within a selected
molecular weight range, mass of the longest
protein, and other desired information.
Skyline Skyline is a platform independent,
freely available application designed for
chromatography-based quantitation using MS1
Trans Proteomic Pipeline The Trans
and MS2 ion intensities developed by the
Proteomic Pipeline (TPP) is a collection of free-
Michael MacCoss lab. Skyline originated as a
ware software tools for MS/MS based
252 A. Bui and M.D. Person

toolset for the development of SRM/MRM PTM. Using a JAVA based implementation, the
assays for triple quadrupole instruments, but has advantage of Luciphor2 over the original is not
expanded to include parallel reaction monitoring only its ability to evaluate any PTM but also
(PRM), DIA, and targeted DDA methods. Along- score results from any search tool. Luciphor2
side its method development features, Skyline can process PeptideProphet XML files derived
offers the ability to QC individual peptides for from the TPP or tab-delimited files with scores
quantitation, a robust toolset for visualizing the from any protein search engine.
results of individual quantitative MS runs, and
the ability to export customized reports ProteomeScout A compendia of information
depending on the experimental requirements. on PTMs from six large databases and additional
Tutorials show how to use the software, and experiments created by the Neagle lab. Query
extensive visualization is possible for manual with protein name to get information about
quality control. Webinars are held as new tools modifications, binding partners, mutations,
are developed. domains and structural elements.

Panorama Also developed by the MacCoss lab, ProSight Lite Developed for top-down analy-
Panorama is freely available software that is sis of protein sequences with fragment matching
designed to act as a repository server application for a variety of fragment types by the Neil
for Skyline targeted proteomics experiments. Kelleher lab at Northwestern University.
Panorama is designed to be a platform for sharing ProSight Lite is available for free download.
and organizing data in the Skyline format for Using deconvoluted MS and MS/MS data
easy visualization and access. Information such acquired from intact proteins, protein
as results, spectral annotations, or spectral library modifications can be mapped.
chromatograms can be shared and accessed
amongst collaborators for easy access.
14.4 Protein Interactions
CRAPome The Contaminant Repository for
Affinity Purification (CRAPome) is an annotated String-DB A database of protein-protein
database provided by a collaboration between the interactions maintained by a European consor-
Alexey Nesvizhskii and Anne-Claude Gingras tium including the University of Copenhagen,
labs which contains negative controls considered EMBL and University of Zurich. The informa-
as common protein contaminants in mass spec- tion is culled from many sources including
trometry experiments. The CRAPome also experimental evidence from pull down mass
contains software available for the analysis of spectrometry experiments, co-expression of
data generated from tandem MS experiments transcripts, genomic context and retrieval from
against the CRAPome database. the literature. The interactions are displayed
graphically with protein balls connected by
lines showing the type or confidence of the evi-
14.3 Protein Modifications dence. The protein structures from Protein Data
Bank are accessed with a single click, making
Luciphor2 Luciphor2 is an enhanced version of this site a great source for graphics.
the original Luciphor. While the original imple-
mentation focused solely on site localization and DAVID A source for functional classification of
scoring of phosphorylation events on individual user data based on enrichment of GO terms and
peptides, Luciphor2 expands this to include any other annotation metrics from SAIC-Frederick.
14 Useful Web Resources 253

14.5 Orbitrap Information Kumar P, Sahasrabuddhe NA, Balakrishnan L,


Advani J, George B, Renuse S, Selvan LD, Patil AH,
Nanjappa V, Radhakrishnan A, Prasad S,
PlanetOrbitrap A website designed to act as an Subbannayya T, Raju R, Kumar M, Sreenivasamurthy
umbrella and informational repository for the SK, Marimuthu A, Sathe GJ, Chavan S, Datta KK,
Thermo Orbitrap family of instruments. Included Subbannayya Y, Sahu A, Yelamanchi SD, Jayaram S,
Rajagopalan P, Sharma J, Murthy KR, Syed N, Goel R,
is a science library that has access to peer- Khan AA, Ahmad S, Dey G, Mudgal K, Chatterjee A,
reviewed scientific papers, application notes Huang TC, Zhong J, Wu X, Shaw PG, Freed D, Zahari
and technical guides, poster presentations from MS, Mukherjee KK, Shankar S, Mahadevan A, Lam H,
conferences, product support notes, and webinars Mitchell CJ, Shankar SK, Satishchandra P, Schroeder
JT, Sirdeshmukh R, Maitra A, Leach SD, Drake CG,
for the application of the Orbitrap to various Halushka MK, Prasad TS, Hruban RH, Kerr CL, Bader
experimental needs. A community forum is also GD, Iacobuzio-Donahue CA, Gowda H, Pandey A
available for members to interact and get (2014) A draft map of the human proteome. Nature
troubleshooting and tips from the wide network 509(7502):575–581. doi:10.1038/nature13302.
PubMed PMID: 24870542; PubMed Central PMCID:
of Thermo Orbitrap users. PMC4403737
4. Wilhelm M, Schlegl J, Hahne H,
Acknowledgments Thanks to Michelle Gadush for MoghaddasGholami A, Lieberenz M, Savitski MM,
website suggestions. This work was supported by Ziegler E, Butzmann L, Gessulat S, Marx H,
CPRIT grant RP110782 and the UT system proteomics Mathieson T, Lemeer S, Schnatbaum K, Reimer U,
core facility network grant. Wenschuh H, Mollenhauer M, Slotta-Huspenina J,
Boese JH, Bantscheff M, Gerstmair A, Faerber F,
Kuster B (2014) Mass-spectrometry-based draft of the
human proteome. Nature 509(7502):582–587. doi:10.
References 1038/nature13319
5. Uhlén M, Fagerberg L, Hallstr€
om BM, Lindskog C,
1. Chen T, Zhao J, Ma J, Zhu Y (2015) Web resources for Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C,
mass spectrometry-based proteomics. Genomics Prote- Sj€
ostedt E, Asplund A, Olsson I, Edlund K,
omics Bioinformatics 13(1):36–39. doi:10.1016/j.gpb. Lundberg E, Navani S, Szigyarto CA, Odeberg J,
2015.01.004. Review Djureinovic D, Takanen JO, Hober S, Alm T, Edqvist
2. Tang B, Wang Y, Zhu J, Zhao W (2015) Web resources PH, Berling H, Tegel H, Mulder J, Rockberg J,
for model organism studies. Genomics Proteomics Bio- Nilsson P, Schwenk JM, Hamsten M, von Feilitzen K,
informatics 13(1):64–68. doi:10.1016/j.gpb.2015.01. Forsberg M, Persson L, Johansson F, Zwahlen M, von
003. Review Heijne G, Nielsen J, Pontén F (2015) Proteomics.
3. Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Tissue-based map of the human proteome. Science
Chaerkady R, Madugundu AK, Kelkar DS, Isserlin R, 347(6220):1260419. doi:10.1126/science.1260419
Jain S, Thomas JK, Muthusamy B, Leal-Rojas P,
Mass Spectrometry-Based Protein
Quantification 15
Yun Chen, Fuqiang Wang, Feifei Xu, and Ting Yang

Abstract
Quantification of individual proteins and even entire proteomes is an
important theme in proteomics research. Quantitative proteomics is an
approach to obtain quantitative information about proteins in a sample.
Compared to qualitative or semi-quantitative proteomics, this approach
can provide more insight into the effects of a specific stimulus, such as a
change in the expression level of a protein and its posttranslational
modifications, or to a panel of proposed biomarkers in a given disease
state. Proteomics methodologies, along with a variety of bioinformatics
approaches, are a major tool in quantitative proteomics. As the theory and
technological aspects underlying the proteomics methodologies will be
extensively described in Chap. 20, and protein identification as a prereq-
uisite of quantification has been discussed in Chap. 17, we will focus on
the quantitative proteomics bioinformatics algorithms and software tools
in this chapter. Our goal is to provide researchers and newcomers a
rational framework to select suitable bioinformatics tools for data analy-
sis, interpretation, and integration in protein quantification. Before doing
so, a brief overview of quantitative proteomics is provided.

Keywords
Protein quantification bioinformatics • Quantitative signal processing •
SAPRatio • MAXquant • Progenesis QI • APEX • Trans-Proteomic
Pipeline (TPP) • IsobariQ and Iquant • Targeted proteomics •
PeptideAtlas • Skyline • ATAQS

15.1 Brief Introduction


of Quantitative Proteomics
Y. Chen (*) • F. Wang • F. Xu • T. Yang
School of Pharmacy, Nanjing Medical University, 818 Despite a number of recent developments in
Tian Yuan East Road, Nanjing 211166, China
proteomics-associated technologies, such as
e-mail: ychen@njmu.edu

# Springer International Publishing Switzerland 2016 255


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_15
256 Y. Chen et al.

two-dimensional polyacrylamide gel electropho- a cellular protein. Both relative and absolute
resis (2D-PAGE) and protein microarrays [1], quantification can be achieved using isotopic/
mass spectrometry (MS)-based proteomics isobaric labeling or label-free strategies. Stable
remains an essential technique for quantitative isotope labeling typically compares naturally
proteome analyses. Our focus here is on abundant stable isotope peptides to physico-
MS-based proteomics. chemically identical peptides with atoms
In MS-based proteomics, two fundamental enriched in a heavy stable isotope at the MS
approaches are currently employed: top-down level. Isobaric labeling allows peptides or
and bottom-up proteomics. In top-down proteo- proteins to be labeled with isobaric reagents and
mics, intact proteins or large fragments are is usually detected at the MS/MS level [4]. Label-
subjected to mass spectrometry. Bottom-up pro- ing technologies include in vivo labeling
teomics relies on proteolytic peptides, which are via metabolic incorporation or in vitro labeling
generated by enzymatic digestion of proteins. Of via chemical reactions. Metabolic incorporation
several such strategies that have been developed, such as SILAC (stable isotope labeling with
all involve the digestion of proteins into peptides, amino acids in cell culture) introduces stable
typically with trypsin, followed by chro- isotope labeled amino acids in cells. Chemical
matographic separation, ionization and mass reactions such as iTRAQ or TMT incorporate
spectrometric analysis of the complex peptide amine-specific isobaric tags onto sites such as
samples. Due to the protein size limitation the N-terminus, C-terminus, cysteines, lysines,
(<50 kDa) and the general reduction of sensitiv- and tyrosines. Alternatively, label-free quantifi-
ity by one order of magnitude in top-down prote- cation uses ion signal intensities acquired by the
omics [2], bottom-up proteomics is more mass spectrometer (i.e., ion intensity measure-
commonly used. Currently, bottom-up proteo- ment) or the number of spectra matched to
mics has three types of classifications: peptides from a protein (i.e., spectral counting)
as a proxy to assess the protein quantities within
1. Relative and absolute quantification the sample.
(according to the information they can Absolute quantification provides accurate
provide) protein amount information by spiking protein
2. Label-based and label free proteomics or peptide samples with known concentrations
(according to the underlying methodology) of heavy isotope labeled synthetic peptides. As
3. Discovery and targeted proteomics (according an essential approach in absolute quantification,
to the pre-selected range of proteins) targeted proteomics has recently taken front
(Table 15.1) stage in the proteomics community [5]. Targeted
proteomics specifically refers to absolute quanti-
Targeted proteomics (hypothesis-driven proteo- fication using selected/multiple reaction moni-
mics) is a new concept in protein quantification toring (SRM or MRM) on a triple quadrupole
for highly selective and high-throughput analysis instrument and a stable isotope labeled internal
of one or more target proteins, corresponding to standard [3]. Targeted proteomics strategies limit
discovery (shotgun) proteomics [3].
the number of proteins that are monitored and
Relative quantification compares the specific optimize the chromatography instrument tuning
protein level in different samples, with results and acquisition methods to achieve the highest
being expressed as a relative fold change of pro- sensitivity and throughput for hundreds or
tein abundance, whereas absolute quantification thousands of samples. Discovery proteomics
is the determination of the exact amount or mass often requires large sample quantities and
concentration of a protein, for example, in units multi-dimensional fractionation, which
of ng/mL of a plasma biomarker or in mol/cell of diminishes sensitivity and throughput [6].
15 Mass Spectrometry-Based Protein Quantification 257

Table 15.1 Classification of proteomics


Absolute quantification Relative quantification
15
Label-based AQUA, SISCAPA Metabolic N, SILAC
PSAQ Chemical ICAT, ICPL,
Absolute SILAC iTRAQ, TMT, IPTL,
ELEX!Quant DML, mTRAQ
18
QCONcat Enzymatic O
Label-free PAI, emPAI Ion intensity (XIC)
APEX
Top3 Spectral counting
iBAQ

programming languages and supported operating


15.2 Bioinformatic Tools systems. We hope that this information can pro-
in Quantitative Proteomics vide a starting point for further reading or an
initial guide for newcomers.
Data analysis in MS-based proteomics is more
challenging than for other high-throughput
technologies and remains a principal bottleneck
in proteomics [7]. There are a number of bioin- 15.3 Common Issues in Proteomics
formatics tools available in quantitative proteo- Quantification
mics. These tools can be combined in various
ways to generate different proteomics data anal- Data analysis in relative proteomics quantifica-
ysis pipelines. Table 15.2 provides a summary of tion generally includes raw data processing,
quantitative proteomics approaches, along with followed by an ion chromatogram ratio calcula-
the associated software tools. Theoretically, all tion to infer the peptide abundance ratio,
the approaches are achieved by providing/com- followed by relative protein abundances calcula-
paring the amounts of peptides and proteins. tion using peptide ratios [11]. Absolute quantifi-
There are common challenges for calculating cation is often performed in a manner similar to
quantitative values at the protein and peptide relative quantification. Absolute quantity of a
level. Thus, these common issues will be first peptide is calculated by comparing its ion inten-
presented with bioinformatics solutions, sity with the ion intensity of an identical chemi-
followed by a selective description of several cally synthesized heavy isotope labeled peptide
software tools for both relative and absolute spiked in with known concentration as an inter-
quantification. As the majority of tools have nal standard. Many of the same problems
been described in protein identification, we will encountered in relative quantification still occur
focus on their quantitative features in data analy- in absolute quantification, and the existing soft-
sis. Additionally, targeted proteomics and its ware for relative quantification could be easily
recently developed tools, such as Skyline and adapted for absolute quantitative purposes. Fig-
ATAQs, will be extensively illustrated due to ure 15.1 is an overview of quantitative proteo-
their novelty in proteomics. mics data analysis. Common computational and
Taking previous reviews into account [8–10], statistical issues will be interpreted below. Base-
Table 15.3 shows the details of most available line subtraction, noise filtering, mass calibration,
software tools, including the supported retention time alignment, and peak detection that
instruments, free/commercial software, type of are primarily employed in spectra processing and
data, database search engine, requisite input protein identification will also be briefly
files and software dependencies, as well as described for completeness.
Table 15.2 Proteomics strategies and associated software tools
Label-based Metabolic MSQuant http://msquant.alwaysdata.net/
Maxquant http://maxquant.org/
MFPaQ http://mfpaq.sourceforge.net/
OpenMS http://open-ms.sourceforge.net/
Proteome http://www.thermoscientific.com
Discoverer
WARP-LC http://www.bdal.com/products/software/warp-lc/
PVIEW http://compbio.cs.princeton.edu/pview/
Elucidator http://www.rosettabio.com/
ASAPRatio http://tools.proteomecenter.org/wiki/
Mascot Distiller http://Matrixscience.com/distiller.html
Scaffold http://www.proteomesoftware.com/products/
Census http://fields.scripps.edu/census/
PEAKSQ http://www.bioinfomaticssolutions.com/products/peaks/
quantification.php
MaXIC-Q http://ms64.iis.sinica.edu.tw/MaXIC-Q_web/index.html
Chemical Mascot Distiller http://Matrixscience.com/distiller.html
MaXIC-Q http://ms64.iis.sinica.edu.tw/MaXIC-Q_web/index.html
Maxquant http://maxquant.org/
MFPaQ http://mfpaq.sourceforge.net/
OpenMS http://open-ms.sourceforge.net/
PeakQuant http://www.medizinisches-proteom-center.de/software
ProRata http://code.google.com/p/prorata/
Proteios http://www.proteios.org/
TPP http://www.proteomecenter.org/software.php
-Libra http://tools.proteomecenter.org/wiki/
VIPER http://omics.pnl.gov/software/VIPER.php
X-Tracker http://www.x-tracker.info/
Proteome http://www.thermoscientific.com
Discoverer
WARP-LC http://www.bdal.com/products/software/warp-lc/
Mascot http://www.matrixscience.com
PVIEW http://compbio.cs.princeton.edu/pview/
IsobariQ http://www.biotek.uio.no/research/thiede_group/software
jTraqX http://sourceforge.net/projects/protms
IQuant http://sourceforge.net/projects/iquant/
Rover http://genesis.ugent.be/rover/
VEMS http://www.portugene.com/software.html
Elucidator http://www.rosettabio.com/
XPRESS http://tools.proteomecenter.org/wiki/
ASAPRatio http://tools.proteomecenter.org/wiki/
ZoomQuant http://proteomics.mcw.edu/zoomquant.html
PEAKSQ http://www.bioinfomaticssolutions.com/products/peaks/
quantification.php
ProteinPilot http://absciex.com/
Multi-Q iTracker http://ms.iis.sinica.edu.tw/Multi-Q/
http://www.cranfield.ac.uk/health/researchareas/
bioinformatics/page6801.html
MSQuant http://msquant.alwaysdata.net/
Scaffold Q + http://www.proteomesoftware.com/
Enzymatic MSQuant http://msquant.alwaysdata.net/
ProRata http://code.google.com/p/prorata/
VIPER http://omics.pnl.gov/software/VIPER.php
ZoomQuant http://proteomics.mcw.edu/zoomquant.html
(continued)
Table 15.2 (continued)
Proteome http://www.thermoscientific.com
Discoverer
WARP-LC http://www.bdal.com/products/software/warp-lc/
PVIEW http://compbio.cs.princeton.edu/pview/
Mascot Distiller http://Matrixscience.com/distiller.html
PEAKSQ http://www.bioinfomaticssolutions.com/products/peaks/
quantification.php
Intensity- Maxquant http://maxquant.org/
based MSQuant http://msquant.alwaysdata.net/
-SpecArray http://tools.proteomecenter.org/wiki/
Corra http://tools.proteomecenter.org/wiki/index.php?
title¼Software:Corra
Expression E http://www.waters.com/waters/nav.htm?cid¼10011719
IDEAL-Q http://ms.iis.sinica.edu.tw/IDEAL-Q/
Mascot Distiller http://www.matrixscience.com/distiller.html
http://www.chem.agilent.com/en-
MassHunter Mass US/Products/software/chromatography/ms/
Profiler masshunterprofiling/pages/default.aspx
msBID http://tools.proteomecenter.org/wiki/index.php?
title¼Software:msBID
MsXelerator http://www.msmetrix.com/
Profile Analysis http://www.bdal.com/products/software/profileanalysis/
overview.html
Progenesis LC-MS http://www.nonlinear.com/
ProteinQuant http://www.ncgg.indiana.edu/
ProtQuant http://www.agbase.msstate.edu/cgi-bin/tools/index.cgi
QuanLynx http://www.waters.com/waters/nav.htm?locale¼de_DE&
cid¼513662
Refiner MS http://www.genedata.com/products/expressionist/modules.
html
Peaks Q http://www.bioinfomaticssolutions.com/products/peaks/
quantification.php
PVIEW http://compbio.cs.princeton.edu/pview/
VIPER http://omics.pnl.gov/software/VIPER.php
TOPP http://open-ms.sourceforge.net/news.php
SuperHirn http://tools.proteomecenter.org/wiki/
SIEVE http://www.thermoscientific.com/
PEPPeR http://www.broadinstitute.org/cancer/software/genepattern/
desc/proteomics
msInspect http://proteomics.fhcrc.org/CPL/msinspect/index.html
Msight http://www.expasy.org/MSight/
DeCyder MS http://www.gelifesciences.com/
Spectral APEX http://pfgrc.jcvi.org/index.php/bioinformatics/apex.html
counting ProteoIQ http://bioinquire.com
Abacus http://abacustpp.sourceforge.net/
emPAI (Mascot) http://www.matrixscience.com/
emPAICalc http://empai.iab.keio.ac.jp/
Scaffold 3 http://www.proteomesoftware.com/
Targeted Skyline http://proteome.gs.washington.edu/software/skyline
proteomics MaxQuant http://maxquant.org/
ATAQS http://tools.proteomecenter.org/ATAQS/ATAQS.html
TIQAM http://tools.proteomecenter.org/TIQAM/TIQAM.html
MRMaid http://138.250.31.29/mrmaid/
SRMCollider http://www.srmcollider.org/srmcollider/srmcollider.py
MRMer http://proteomics.fhcrc.org/CPL/MRMer.html
Table 15.3 Detailed information of selected software tools
260

Compatible Type
search of Operation system
Software Version Technique engine data Instruments Input files Distribution Language (OS)
ASAP – ICPL, ICAT, SEQUEST MS2 Any via mzXML or Via T PP/pepXML Free C Windows, Linux
Ratio SILAC mzML/Thermo (generated in the TPP) OSX/Mac
APEX 1.1.0 Improvement SEQUEST, LC- Any via protXML .fasta, .oi, protXML Free open Java Windows, OSX,
of SC MASCOT MS/ files source Linux
and X!T MS
andem
Census 1.72/2.3 15N, SILAC, – MS1/ Any via mzXML MS1/MS2, DTASelect, Free Java Windows, OSX,
iTRAQ, SC MS2 mzXML, pepXML Linux
Corra v3.0 Intensity- SEQUEST LC- Any via mzXML mzXML Free Java Linux
based MS
IQuant 2.0.1 iTRAQ, TMT MASCOT MS1/ Any via mzXML mzXML Free open Java Windows XP, Ubuntu
MS2 source 8.04 - the Hardy
Heron, and Mac OS
 10.4.11 platforms
IsobariQ 1.1 iTRAQ, MASCOT MS1/ Any via mzXML mzXML Free C++ windows
TMT/IPTL MS2
iTracker 1.1 iTRAQ SEQUEST, MS2 All machines that .mgf, .dta Free Windows Windows, Linux
MASCOT export exe Perl
noncentroided
spectra and support.
dta or .mgf file
-Libra – iTRAQ, TMT SEQUEST – Any via mzXML or Via TPP/pepXML or Free C Windows, Linux,
mzML/Thermo summary.html files OSX
MapQuant 2.1.1 Label free SEQUEST LC- Thermo, Waters OpenRawSEQUEST/ Free open Visual C+ Windows, Linux
image MS mzXML and source +/C
recognition MQScript files
MaxQuant 1.2.2.5/ SILAC, MASCOT MS1/ LTQ, Orbitrap, .raw (Thermo) Free C# Windows
1.4.1.2 ICPL, Label MS2 FT-ICR(Thermo)
free/
Intensity-
based
mProphet 1.0.4.1 SRM, AQUA, – LC- Any via mzXML mzXML, .xls Free open Perl/R Windows, Linux
QconCAT, MS/ source
PSAQ MS
Y. Chen et al.
15

MRMaid 2.0 SRM, label- – LC- – .mgf, .pkl, or mzXML Free Java Web-based
free MS/
MS
MRMer – SRM, label- – LC- Waters mzXML Free Java Windows, OSX,
free MS Linux
MSQuant 2.0b6 15N, SILAC, MASCOT MS1/ QSTAR (ABI), .raw (Thermo), .dat(Waters), Free open C# and Windows
ICAT, 18O, MS2 Q-ToF(Waters), .wiff(ABI)/MASCOT html source VB .NET
label-free/ LTQ, FT, Orbitrap and raw spectral file(.wiff, .
Intensity- (Thermo) raw and .dat supported)
based
Multi-Q 1.6.5.4 iTRAQ SEQUEST, LC- Any via mzXM mzXML, .wiff (ABI)/ Free VB .NET Windows, Web
MASCOT MS/ L/Applied Vendor formats
and X!T MS Biosystems, (.wiff, .raw) are
andem Thermo,Waters & converted into a
Bruker Daltonics reduced mzXML
OpenMS 1.8/ iTRAQ, XTandem, MS1/ Any via mzXML or .dta, mzData, mzXML, Free open C++ Windows, Linux,
1.11.1 SILAC, SEQUEST, MS2 mzML mzML source OSX
Mass Spectrometry-Based Protein Quantification

labelfree MASCOT,
OMSSA
15
PeakQuant 1.5.42 N, SILAC, – MS1/ Any via mzXML mzXML Free Java Windows, Linux,
iTRAQ MS2 OSX
PEPPeR – Label-free – MS1/ Any via mzXML mzXML Free Perl Windows
(Intensity- MS2
based)
Progenesis 1.0 Label-free MASCOT LC- Any mzXML, mzML and Commercial – –
QI MS NetCDF
15
ProRata 1.0 N, SILAC, SEQUEST LC- Any via mzXML/ mzXML/mzXML and Free open C++ Windows,
ICAT, 18O MS Thermo DTASelect output file source Linux
Proteios 2.16/2.19 iTRAQ, TMT Mascot, MS1/ Any via mzXML mzML Free open Java Windows, Linux,
XITandem, MS2 source OSX
OMSSA
18
QUIL – O, ICAT SEQUEST LC- LCQ, LTQ, FT-ICR – Available Visual C+ Windows
MS (Thermo) on request/ +
Free
15
Qupe – N – LC- LTQ, Orbitrap mzXML Web Java Web-based
MS (Thermo)
Skyline 1.2/2.5 SRM, label- – LC- Any via mzXML mzXML, pepXML Free open C# Windows
261

free MS source
(continued)
Table 15.3 (continued)
262

Compatible Type
search of Operation system
Software Version Technique engine data Instruments Input files Distribution Language (OS)
18
STEM – O MASCOT LC- Waters .pkl (ProteinLynx, Free – Windows
MS Waters)/MASCOT
.dat file and .raw file
TPP 4.5.0 ICAT, Any MS1/ mzXML, mzML Free – Windows, Linux,
SILAC, MS2 OSX
iTRAQ
18
VIPER 3.48.456/ O, ICAT/ – MS1/ Any via mzML .pek, .CSV, .mzXML,. Free open – Windows
3.49 Intensity- MS2 mzData, .raw(Thermo) source
based
XPRESS – ICPL, ICAT, SEQUEST – Any via mzXML or Via T PP/pepXM L Free C Windows, Linux,
SILAC, 14N/ mzML/Thermo (generated in the TPP) OSX
15
N
X-Tracker 1.3 iTRAQ, 15N – MS1/ Any via mzML mzML, mzIdentML Free open Java Windows, Linux,
MS2 source OSX,
18
ZoomQuant – O SEQUEST LC- LTQ (Thermo) .raw (Thermo)/Uses Free Perl Windows, Linux,
MS various formats(i) OSX
Xcaliber raw file
(ii) SEQUEST.out
file Processed to
internal.colon file
Some information was cited from [9]
Y. Chen et al.
15 Mass Spectrometry-Based Protein Quantification 263

Ba tract corre

Ma rete nmen

Co ingle ogram

Qu eptid
Pe
Da ssme

sub eline

and e alig

of omat
No

at p
ass

bas

tim

chr
ckg ion

nst
ss

Qu rotei
ant
ak
ta q

s
ise

at p
e

cal tion

ant n le
ruc ion
De
rou

ific leve
ual t

fi

ibr
tec
lter

ific
tio

atio l
n
nd nd

e
ity

atio
tio

n
n

atio el
ing
a

n
n

n
v
t
ctio

s
n

Fig. 15.1 General software workflow for quantitative proteomics

15.3.1 Data Quality Assessment prevented by routine calibration [14]. Random


error, also called noise, can be divided into low
The proteomics process is sensitive to changes in and high frequency noise. The aim of the noise
sample preparation and spectra collection, espe- filtering step is to remove the random noise in the
cially for label-free approaches. Extreme caution mass spectra and to enhance the signal to noise
must be used to maintain the same sample prepa- ratio (S/N). In general, noise filtering is
ration protocol throughout the experiments. performed before identifying peptide peaks.
Introduction of any systematic bias into the data The selected features left after the removal of
collection and sample handling will significantly the noise are often called peaks. There are several
impact the result, even if sophisticated bioinfor- methods to remove the noise and select the
matics tools are used [12]. This topic has been peaks, including (1) filter methods, (2) wrapper
extensively investigated by Hilario et al. [13]. methods, and (3) embedded methods [16]. The
filter methods are most commonly used, and a
wavelet filter has been reported to perform best
15.3.2 Background Subtraction among these filters (e.g., average filter, Savitzky-
and Baseline Correction Golay filter, Gaussian filter, Kaiser window, and
wavelet based filters) [17].
The baseline signal must be subtracted from the
raw spectrum because the detector may overesti-
mate the number of ions arriving at its surface, 15.3.4 Peak Detection
especially in low-molecular-weight regions.
Recently, this type of correction has no longer After noise filtering, the charge state is defined
been necessary, and the general assumption has by analyzing the isotope distribution, and peak
become that commercial mass spectrometry overlap is also resolved. The algorithms and tools
instrument software will remove the background associated with isotope and charge state
signal automatically [14]. deconvolution have been reviewed elsewhere
[18, 19]. The peaks are detected using methods
including the isotopic cluster identification
15.3.3 Noise Filtering method by Horn et al. [13], the local maximum
peak detection method by Yasui et al. [14, 15], or
Two types of errors are often present in experi- the mean-spectrum undecimated discrete wave-
mental data, systematic error and random error let transform-based peak detection method by
[15]. Systematic error can be caused by a variety Morris et al. [16]. The resulting spectral informa-
of factors, such as drift of calibration constants tion is subsequently subjected to database
with time or temperature which can be easily searching. Commercial search engines such as
264 Y. Chen et al.

MASCOT, SEQUEST and Phenyx support many peptide information. There are three possible m/
relevant instruments and their fragmentation z values that can be used to define the m/z value
methods. It is possible to perform analyses with for a given peptide in order to integrate the signal
publicly available tools as well, for instance, and extract its ion chromatogram: the experimen-
VEMS v3.0. The most common databases used tally observed m/z reported by the instrument’s
in searching are NCBI’s Entrez Protein, RefSeq, software, the experimental m/z reported by the
IPI, Swiss-Prot, UniProt, and TrEMBL [2]. Once search engine (which may differ), and the exact
peptide identification has been deemed accept- theoretical m/z calculated from the sequence in a
able, the identified peptide information is used to given ion charge state [8]. Currently, most soft-
locate specific peptide elution time in quantifica- ware tools use the theoretical peptide sequence
tion applications. mass to determine the m/z value. Using the deter-
mined m/z value, a single ion chromatogram can
be extracted. However, the construction method
15.3.5 Mass Calibration and Retention for the ion chromatogram varies amongst the
Time Alignment software tools [8]. Some tools construct several
single ion chromatograms (from multiple charge
Software provides advanced systems for mass states) and average them, whereas some others
calibration. The mass dimension rarely requires construct only one ion chromatogram from the
calibration, and the data alignment can conve- most abundant precursor ion.
niently be reduced to a simpler problem of
aligning the retention time dimension [20]. A
variety of alignment approaches have been 15.3.7 Quantification at Peptide Level
suggested, including dynamic time warping, cor-
relation optimized warping, parametric time After defining peptide elution peaks, the next
warping, and peak alignment [21]. Good align- step is to calculate peptide abundances for the
ment is especially required for label-free quanti- light and/or heavy peptides. There are several
fication. According to the experimental design, different algorithms to calculate peptide
the alignment can be performed before or after abundances – peak area, least squares regression
peptide identification. and principal component analysis [22]. The peak
area approach calculates the area of peaks. In the
least squares regression approach, the peak
15.3.6 Construction of Single Ion profiles of light and heavy peptides are converted
Chromatograms into a scatterplot based on their ion intensities.
The slope of the regression provides a measure of
The steps described above are also employed in the background-subtracted ratio, the intercept
protein identification. The following workflow provides a measure of the ratio of the two
including peptide ion chromatogram extraction, backgrounds, and the correlation coefficient
quantification at the peptide level and quantifica- provides a measure of the ratio quality [23]. Prin-
tion at the protein level represents the major cipal component analysis generates a similar
concept in quantitative proteomics. scatterplot of ion intensities from both light and
Determination of the correct start and end heavy peptides, and calculates two principal
points of peptide elution peaks in chromatograms components and their values. The slope of the
is crucial for accurate and precise quantification first principal component indicates the peptide
results. The ion chromatogram extraction process abundance ratio. The criteria of peak area inte-
is a computationally intensive step in quantita- gration depend on the S/N of the chromatogram,
tive analysis. Strictly speaking, an updated peak the chromatogram peak shape, and even individ-
profile is reconstructed using the identified ual users’ biases.
15 Mass Spectrometry-Based Protein Quantification 265

15.3.8 Quantification at Protein Level 15.4 Selected Bioinformatics Tools

If there is more than one peptide ratio for a target 15.4.1 Automated Statistical Analysis
protein, the individual peptide values must be of Protein Abundance Ratios
combined in some fashion. There are three dif- (ASAPRatio)
ferent approaches for calculation [8, 22]. One
approach is to calculate the mean or median of ASAPRatio is currently applied for ICAT data
all peptide measurements, fitting the experimen- analysis. The algorithms of ASAPRatio utilize
tal values to a normal distribution [24]. The sec- Savitzky-Golay smoothing filter, statistics for
ond method is a weighted average in which weighted samples, and Dixon’s test for outliers,
peptides with given weights, based on scores to evaluate relative protein abundance ratios and
such as the quality or standard deviation, are their associated errors [24]. In the construction of
used to derive a protein abundance ratio. The single ion chromatograms, ASAPRatio considers
third approach is to calculate the protein ratio signal integrated over three isotopic peaks, for
from an estimated likelihood function and the each isotopic variant and for four charge states.
significance of the protein ratio is also related to Background subtraction and outlier removal are
the maximum of the likelihood function then performed prior to the calculation of an
[25]. Sometimes, protein ratio calculation is abundance ratio for each peptide. For each
complicated because several identified peptides unique peptide, the abundance ratio is calculated
are not unique to the target protein and may occur for each observed charge state and then all valid
in other proteins. These peptides that cannot be abundance ratios from the different charge states
used to estimate the final protein ratio should be are collected, weighted by the sum of the two
removed as outliers. Nesvizhskii and Aebersold corresponding elution peak areas. Ratios are
have provided a review of resolving averaged for individual peaks, and then over all
multipeptide/protein issues [26]. Statistical tests peaks (using weights in both cases). If there are
(e.g., t test or ANOVA) assign significance levels more than three ion ratios, a final ‘unique peptide
to ratio estimations and help to control error ratio’ is produced for each peptide. If there is
rates. more than one peptide ratio for a particular pro-
Many of the bioinformatics tools in quantita- tein, ASAPRatio use a weighted mean of the
tive proteomics follow the procedures described peptide ratios to calculate the protein ratio,
above. However, the algorithms implemented in using estimated errors for the peptide ratios [8].
these tools are used to correct potential artifacts
created by the different proteomics approaches
[8, 18]; thus, one software is not suitable for all 15.4.2 MAXquant
quantitative strategies. In addition, their applica-
tion is restricted by the nature of the experiment Maxquant produced by Mattias Mann’s group
to be performed and available instrumentation. was developed based on the MSQuant and has
To further explain bioinformatics tools, we similar properties with MSQuant [9]. Since it is
would like to provide more details about several designed for analyzing large mass spectrometric
selected tools in the next section of this chapter. data sets, Maxquant is more suitable to high
These software packages were chosen because resolution data generated by the Thermo Orbitrap
sufficient details of the implemented algorithms and FT mass spectrometers. This software
are available in the respective publications, supports label-free methods in addition to
whereas the others are “black box” designs tied SILAC and ICPL (Isotope-coded protein label).
to instrument vendors or are commercial Peaks are detected in each MS scan by fitting a
products. Gaussian peak shape to the three central raw data
266 Y. Chen et al.

points [27]. Using correlation analysis and graph number of MS-identified tryptic peptides derived
theory, MaxQuant detects peaks, isotope clusters form that protein by the total number of
and SILAC peptide pairs as three-dimensional MS-identified peptides [30]. However, this tech-
objects in m/z, elution time and signal intensity nique is confounded by peptide physicochemical
space. It currently uses Mascot to generate pep- properties, affecting MS detection and resulting
tide candidates for MS/MS spectra. The in each peptide having a different detection prob-
subsequent analysis includes robust processing ability. In APEX, machine-learning algorithms
and filtering for peptide mass accuracy and are used to predict weighting factors for each
false discovery rate (FDR) thresholds at protein peptide-spectrum match (PSM) based on the
and peptide level [28]. Protein ratios are calcu- predicted properties of the peptide. The spectral
lated as the median of all peptide ratios and can count is weighted accordingly and used to calcu-
be normalized to correct for unequal protein late the protein abundance. The user-supplied
amounts. normalization factor, typically an estimate of
total protein concentration, converts the relative
abundance values into absolute terms. Thus,
15.4.3 Progenesis QI quantification results over basic spectral
counting can be improved.
Progenesis QI software package from Nonlinear
Dynamics, Waters is a software solution for
label-free quantification. Ion intensities are
15.4.5 Trans-Proteomic Pipeline (TPP)
employed to provide quantification. Progenesis
QI is capable of processing a large number of
TPP, a collection of software tools, is instrument-
replicates, and has an accessible graphical user
independent and supports commonly used prote-
interface allowing users to view their MS data in
omics workflows [31]. Importantly, the pipeline
two- or three-dimensional (2D or 3D) maps to
uses open, standard data formats and calculates
verify if peptides have been quantified accu-
estimates of sensitivity and error rates, thus
rately. The peptide outlines mark the boundaries
allowing for meaningful data exchange. TPP
of each peptide isotope. The peptide abundance
relies on, and integrates in its workflow, external
is the sum of the intensities within the isotope
search engines (e.g., peptide identification
boundaries. To obtain the protein abundance, the
(SEQUEST, Mascot, COMET, PeptideProphet
sum of all unique normalized peptide ion
and X!Tandem) and protein identification
abundances for a specific protein on each run is
(ProteinProphet)) [32, 33]. Quantification analy-
calculated. Furthermore, several post-processing
sis tools such as ASAPRatio described above and
software (i.e., the Progenesis Post-Processor)
Libra can also be used in the pipeline flow for
extends the application of Progenesis QI to
peptide and protein quantification. Due to its
label-based quantification by embedding
special usage, we will give more details here
Progenesis QI in the analysis of stable isotope
based on an example data of Tandem Mass
labeling data and top3 pseudo-absolute quantifi-
Tags (TMT) 6-plex labeling (named as dataset
cation [29]. The validated quantification range
HuN9; searched using Mascot 2.1 against human
has been reported in the 2–1000 fmol.
protein sequences of UniProt 2013_12). TMT is
an isobaric compound that allows peptides from
up to six samples to be identified and quantified
15.4.4 APEX (Absolute Protein
in a single experiment. The intensities of six
Expression)
reporter ions are used for the quantification of
peptides in different samples. Figure 15.2 shows
APEX is a modified spectral counting technique.
representative mass spectra of a surrogate pep-
Spectral counting techniques typically infer the
tide used for quantification.
relative quantity of a protein by dividing the
15 Mass Spectrometry-Based Protein Quantification 267

Fig. 15.2 Representative parent and product ion spectra of GVFHQTVSR, one of the four unique peptides of protein
P12109 (Gene name: COL6A1, collagen alpha-1(VI) chain)

We used TPP to interpret the search result of samples). This protein shows a significant
HuN9. After loading the Mascot result, the TPP decrease in group 2 (average fold change 1.7,
pipeline uses PeptideProphet and ProteinProphet p value < 0.001).
to validate the identification of peptides and
proteins [34]. With Libra, each peptide channel
was normalized by the sum of peptides’ 15.4.6 IsobariQ and Iquant
channels. The values that deviated from the aver-
age by more than two sigma were removed. The As isobaric labeling is more efficient in peptide
protein level of each sample (labeling) was cal- labeling and, thus, is more widely used to find
culated as the median of normalized values of the differently expressed proteins between two
corresponding ions of peptides. Using two samples from different physiological or patho-
peptides of protein P12109 (GVFHQTVSR and logical states, the recently developed tools
GDEGPPGSEGAR) as quantifiers, Fig. 15.3 IQuant and IsobariQ are also discussed here,
shows the raw intensities of peptides (A), followed by their comparison using the same
normalized values for peptides (B) and final rel- data set in TPP.
ative expression levels of protein using individ- IsobariQ was developed in C++ for the
ual peptides (C). The dataset HuN9 is designed windows platform and released under the GNU
as a 3 versus 3 sample comparison (Group 1: General Public License version 3. The statistical
TMT126-128 samples; Group 2: TMT129-131 language R and the server Rserve must be
268 Y. Chen et al.

protein ratios are calculated as the median of the


individual peptide ratios (reporter ions) or pooled
standard deviation of all quantification points
(IPTL). The user can select which peptides are
included in the calculation of protein values.
IsobariQ uses both z-statistics, similar to how
MaxQuant treats SILAC data, to address the sig-
nificance of a protein ratio. The Benjamini-
Hochberg method is applied for ratio correction.
IQuant is implemented in JAVA and R,
provides a GUI as well as a command-line inter-
face, and works on both Windows and Linux
system [35]. It integrates Mascot Percolator and
advanced statistical algorithms to process the
mass data. The abundance of reporter ion can
be normalized through VSN and median-based
approach. The VSN method has also been
adapted in IsobariQ. Non-unique peptides and
outlier peptide ratios are removed prior to quan-
titative calculation. The weight approach is
employed to evaluate the ratios of protein quan-
tity based on reporter ion intensities [36]. To
estimate the statistical significance of the protein
quantitative ratios, IQuant adopts the permuta-
tion test, a non-parametric approach. For each
protein, IQuant as IsobariQ provides a signifi-
cance evaluation that is corrected for multiple
hypothesis testing by the Benjamini-Hochberg
method [37].
Both IQuant and IsobariQ perform the analy-
sis based on the Mascot identification results.
They both need the identification result of Mas-
cot (“dat” file) as input. IQuant further needs the
fasta file of a sequence database for protein infer-
ence. The key steps shared by these two tools in
quantitative proteomics are: tag impurity correc-
tion, peptide quantification, peptide ratios nor-
malization, and protein quantification. As
Fig. 15.3 Raw intensities (a) and normalized values described above, they both require installation
(b) of surrogate peptides and relative quantitative values of R software for performing VSN [11]. VSN
(c) of P12109
provides a robust variant of the maximum-
likelihood estimator for differential expression,
installed separately. The user can choose which was originally used for microarray data.
between three different types of normalization: Both tools provide peptide normalization based
(1) Division by median, (2) Variance stabilizing on division by median. And IsobariQ addition-
normalization (VSN), (3) Division by channel ally supports normalization by channel sum. For
sum (reporter ions only). Once all the peptides protein quantification, the protein ratios were
are successfully quantified and normalized, the finally calculated as the median values of the
15 Mass Spectrometry-Based Protein Quantification 269

individual peptide ratios in IsobariQ, whereas chromatography coupled on-line to SRM


IQuant employed a weight approach to evaluate (LC/SRM) assays are developed to detect frag-
the ratios of protein [12]. ment ion signals from proteolytic peptides driven
To further illustrate the different normaliza- from target proteins [38, 39]. The ion mass of the
tion method performances, the previous HuN9 precursor peptide is filtered through in the first
data set was processed. As shown in Table 15.4, mass analyzer (Q1), while a peptide fragment
the normalization by median-based approach ion, generated by collision-induced dissociation
produced similar result to the theoretical predic- in Q2, is filtered through in the third mass ana-
tion in this study. However, VSN is lyzer (Q3). The precursor/product ion m/z pair,
recommended by IQuant as default method to referred to as the transition, is used to yield the
solve the issue of heterogeneity of variance chromatogram. The area under the curve of the
among peaks. chromatogram provides a quantitative measure-
We further compared the quantification ment for each desired peptide and target protein.
results of IsobariQ and IQuant using the example As Method of the Year 2012, LC/MS/MS-based
file provided by IQuant (File name: tte-1-1.dat; targeted proteomics allows researchers to quan-
iTRAQ8Plex labeling). IQuant quantified more tify proteins with high sensitivity, high selectiv-
proteins than IsobariQ, likely due to its use of ity and wide dynamic ranges
MascotPercolator to improve protein identifica- [40]. Low-abundance proteins of interest are not
tion (Fig. 15.4a). We also calculated the ratios ignored as they are in discovery proteomics. If
(116/114) of 495 common proteins using VSN the retention time of the surrogate peptides is
and median normalization methods in IQuant and used as a constraint in data acquisition (sched-
IsobariQ, respectively. The quantification results uled SRM), several hundred peptides can be
showed good correlations between these two quantified during a single LC/MS/MS run
tools, with a Spearman coefficient of 0.977 [41, 42].
(Fig. 15.4b). The most critical step in the establishment of a
There are many excellent software tools avail- targeted proteomics assay is the selection of pro-
able, and there is no consensus on how to calcu- teolytic peptides that (1) are unique to a candi-
late protein and peptide abundances. The date protein, (2) would ionize efficiently, (3) are
selection of these tools is partly restricted by completely digested (carry no miscleavages) and
the nature of the experiment and available instru- (4) can generate high-quality SRM (high S/N).
mentation, as well as the type of information the Given these criteria that must be met for each
end-user is looking for. transition, designing SRM assays for a protein
can be time-consuming, and the workload
increases rapidly as the number of target proteins
is increased. To streamline this process, freely
15.5 Targeted Proteomics by SRM
available software resources have been devel-
oped. There are primarily two opposite
Due to the special experimental design and data
approaches, theoretical and experimental, either
analysis of targeted proteomics, this approach is
using in silico prediction by various algorithms
discussed individually here. In a targeted analy-
or based on spectral evidence using existing mass
sis for protein quantification, liquid

Table 15.4 Comparison of different methods for peptide normalization


Normalization Ratio 127/126 Ratio 128/126 Ratio 129/126 Ratio 130/126 Ratio 131/126
NONE 0.95 0.90 0.93 0.96 0.91
SUM (Isobaric Q) 1.01 1.01 1.01 1.01 1.01
Median(Isobaric Q) 1.00 1.00 1.00 1.00 1.00
VSN(IQuant and Isobaric Q) 0.98 1.04 1.00 1.00 1.00
270 Y. Chen et al.

Fig. 15.4 Comparison of the quantification results of were filtered with a q-value equal to or less than 0.01,
IsobariQ and IQuant. (a) The number of proteins obtained and only proteins with equal to or more than two peptides
from each software. (b) The correlation of the protein were used for further analysis
ratios obtained from IQuant and IsobariQ. The PSMs

Table 15.5 Predictive computational models for peptide selection


Prediction method Website
ESP predictor http://www.broadinstitute.org/cancer/software/genepattern/modules/ESPPredictor.htm
STEPP http://cbb.pnnl.gov/portal/software/stepp.html
Peptide sieve (PAGE-ESI) http://tools.proteomecenter.org/wiki/index.php?title¼Software:PeptideSieve
Peptide detectability http://darwin.informatics.indiana.edu/applications/PeptideDetectabilityPredictor/

spectral data from either public repositories or value by averaging over all amino acids in each
in-house experiments (e.g., spectra recorded dur- peptide. The software simulates a peptide
ing global discovery experiments). In a theoreti- response using the Random Forest algorithm.
cal SRM design, it is possible to predict which As reported previously, the ESP predictor can
peptides and product ions are most appropriate achieve a success rate of 89 % at selecting one
for SRM protein quantification by several or more high-responding peptides per protein on
computational tools (Table 15.5). However, it average [43]. There are some empirical criteria
should be noted that the mechanisms of proteol- for peptide selection, which may be helpful at the
ysis, ionization, and fragmentation are not yet primary stage of quantitative proteomics and are
sufficiently well understood to produce accurate listed (Table 15.6).
models for best SRM transition predictions. The The experimental approach uses experimen-
current models can only assist to select high- tally obtained peptide spectra as evidence, and
responding peptides, particularly in the absence several software tools have been developed to
of experimental data. For example, ESP predictor extract the necessary information from those
considered 550 physicochemical properties to spectra to build SRM assays. Publicly available
model the peptide response. For each physico- spectral repositories include PRIDE, GPMDB,
chemical property, it computes the property PeptideAtlas and others (Table 15.7). The
15 Mass Spectrometry-Based Protein Quantification 271

Table 15.6 Empirical criteria for peptide selection in SRM


Necessary 1. The amino acid sequence of the peptide is unique for a target protein
condition 2. Length between 6 and 16 amino acids
3. No posttranslational modifications and no single nucleotide polymorphism
4. NO methionine or cysteine residues are included
5. For membrane protein, No transmembrane region
6. For trypsin digestion, NO continuous sequence of arginine or lysine residues (RR, KK, RK,
KR) occurs in the digestion region
7. For trypsin digestion, the peptide does NOT include a proline residue at the Cterminal side of an
arginine or lysine residue (RP or KP) in the digestion region
Additional 1. No histidine residue
condition 2. Containing one of leucine, isoleucine, valine, alanine or proline residue
3. Hydrophobic amino acids should comprise less than 40% of the peptide

Table 15.7 Public proteomics spectral repositories


Proteomics repositories Website
PeptideAtlas http://www.peptideatlas.org/speclib/
GPMDB ftp://ftp.thegpm.org/projects/xhunter/libs/
PRIDE http://www.ebi.ac.uk/pride
NIST http://www.peptideatlas.org/speclib/
MacCoss http://proteome.gs.washington.edu/software/bibliospec/documentation/libs.html

software tools are Targeted Identification for and may be more comprehensive. We will give a
Quantitative Analysis by Multiple reaction mon- simple example, re-analyzing data from recently
itoring (TIQAM), MRMer, SRMCollider, published papers from our laboratory to explore
MaRiMba, MRMaid, Skyline and ATAQS as the applicability of PeptideAtlas and quantitative
listed in Table 15.2, or commercial with the capability of Skyline and ATAQS.
software platforms provided by mass spectrome-
ter vendors (e.g., SRM Workflow software
(based on SIEVE), Pinpoint and P3 predictor
15.5.1 PeptideAtlas
(Thermo Scientific), mTRAQ-reagent-based
MRMPilot software and multiple reaction moni-
Current proteomics repositories have been based
toring initiated detection and sequencing
on shotgun proteomics data. Databases such as
(MIDAS) Workflow Designer (Applied
PeptideAtlas are candidates that have the poten-
Biosystems), VerifyE and TargetLynx™ Appli-
tial to handle SRM data. Features have been
cation Manager (Waters), MassHunter Optimizer
added to PeptideAtlas to leverage shotgun data
(Agilent Technologies)) [44]. Commercial soft-
in support of SRM experiment design [17]. An
ware tools are not freely accessible and the
SRM-specific section of PeptideAtlas,
algorithms are generally not published. This
SRMAtlas, has been created as a combined cata-
chapter focuses, therefore, on the freely avail-
log of best-available transitions selected from
able, platform-independent informatics resources
either PeptideAtlas shotgun data, data collected
for SRM transition design.
for whole proteome synthetic tryptic peptide
The initial software packages for SRM assay
libraries [18], published validated transitions,
development were often single-user packages,
and theoretical transition prediction approaches.
such as MRMaid and MaRiMba, and they were
SRMAtlas encompasses four levels of informa-
limited in their specific scope. Newer software
tion and an algorithm (PeptideAtlas Best-
packages, such as Skyline and ATAQS, aim to
transition Selection Tool (PABST)) to intelli-
integrate the entire targeted proteomic workflow
gently merge the levels with a weighting and
272 Y. Chen et al.

scoring technique to provide ranked lists of page (http://www.srmatlas.org/doc/webServiceAc


peptides and transitions for all proteins for a cess.php). This capability enables users to import
species. Using these data, generic SRM the results from SRMAtlas directly into SRM bio-
measurements can be set up for protein quantifi- informatics tools such as ATAQs and Skyline.
cation. PeptideAtlas SRM Experiment Library
(PASSEL) is an active repository for SRM exper-
imental data acquired in real-world studies. Dif-
15.5.2 Skyline
ferent from SRMAtlas, this repository is
specifically designed to store and present SRM
Skyline is a Windows client application for
experimental data in a publicly accessible
targeted proteomics method creation and quanti-
manner [45].
tative data analysis. It is open source and freely
Using HSP27 (P02786) recently investigated
available [50]. Skyline can not only establish an
in our laboratory as an example [46], we can
initial set of peptides and transitions but can also
obtain a bulk list of recommended peptides and
allow us to further refine and optimize these
transitions and the desired user-settable
initial instrument methods after
parameters. Each transition was listed with its
experimental runs.
attributes, including final score and the source
Skyline supports all major publicly available
of the peptide as depicted in Fig. 15.5. The list
spectral libraries. New spectral libraries can also
may also be constrained to a subset of these
be built, for example, post-translational
classes, for example, to only optimize the
modifications (PTMs) that are unavailable in
transitions selected from a real QQQ spectrum.
public libraries. Skyline can support peptide and
This table may be optimal for us to use in the
transition picking both in silico and from spectral
quantification of real samples. In our study, the
libraries automatically. Peptide settings include
doubly charged ion VSASPLLYTLIEK was
the following: presence or absence of specific
most abundant, and the corresponding transition
residues (including heavy amino acids), enzyme,
of m/z 717.2 ! 1089.8 had the greatest S/N in
peptide length, and charge states. Transition
LC/SRM. Notably, only proteins with trypsin
settings include the following: collision energy
digestion can be processed using PeptideAtlas
(CE) and declustering (set to instrument vendor-
in the current version.
specific values if necessary), product ion m/z
It is important to be judicious in the use of
greater than the precursor, and monoisotopic or
tryptic peptides in SRM assay development
average masses. Retention time (RT) can also be
because public MS/MS spectra databases often
predicted ab initio using a selection of
lack information about the experimental methods
“calculators”, such as SSRCalc [51]. Matching
and MS instrumentation used to obtain these
spectra are shown in a graph pane with ion peak
spectra [47]. The predicted chromatographic
intensity ranking expressed in both the graph and
and mass spectrometric behavior of peptides are
document tree (Fig. 15.6). Several empirical
not always sufficiently accurate to omit the need
criteria are valuable and provided for the creation
for experimental verification. While spectra
of a new targeted assay, for instance, start with
generated on a triple-quadrupole instrument are
more transitions than required, prefer singly
often preferred, when not available, consensus
charged y-ions, etc. The resulting list of
ion trap spectra are often used as a substitute in
transitions can be exported in MS-vendor-spe-
many cases [48, 49]. Thus, the ion peak intensity
cific formats, such as Agilent, Thermo, and
ranking in a library is usually different from that
Waters, so they can easily be scheduled in MS
provided by experiments (Table 15.8).
for quantitative monitoring later. Finally, empir-
The peptide and transition information of
ical measurements in the experimental context
proteins may also be queried programmatically
are performed. After acquired SRM data are
by other software via web service interfaces, as
imported, subsequent method refinement is
described in detail at the SRMAtlas access help
15 Mass Spectrometry-Based Protein Quantification 273

Fig. 15.5 Result of PABST processing for HSP27. The list of peptides is provided in reverse sorted order with the best
peptides appearing first, i.e., those with the highest value in the “Adj SS” column

carried out based on these results to achieve a modifications (Fig. 15.7) and assigning them
highly effective instrument method. broadly or explicitly to individual peptides.
Skyline fully supports protein quantification, After importing the results files, Skyline
with dialogs for defining static and heavy isotope calculates ratios between the unlabeled peptide
274 Y. Chen et al.

Table 15.8 Comparison of the Atlas library rank and the experimental rank of transitions for the peptide of
QLSSGVSEIR
Compound Precursor Product Ion Library Experimental
Compound group name ion ion name rank rank
sp|P04792| QLSSGVSEIR 538.3 834.4 y6 2 3
HSPB1_HUMAN
sp|P04792| QLSSGVSEIR 538.3 660.4 y5 5 4
HSPB1_HUMAN
sp|P04792| QLSSGVSEIR 538.3 504.3 y4 1 1
HSPB1_HUMAN
sp|P04792| QLSSGVSEIR 538.3 417.2 y3 3 5
HSPB1_HUMAN
sp|P04792| QLSSGVSEIR 538.3 288.2 b5 4 2
HSPB1_HUMAN
The data were obtained using an Agilent Series 1200 HPLC system (Agilent Technologies, Waldbronn, Germany) and a
6410 Triple Quad LC/MS mass spectrometer (Agilent Technologies, Santa Clara, CA, USA)

and the labeled internal standard and provides SRM assay as well as for various analysis tasks),
direct editing of integration boundaries and PABST. A peptide transition list was
[50]. The comma separated value format can generated using PABST based on user-defined
also be obtained for further analysis with statisti- criteria (Fig. 15.8).
cal tools such as Excel and R. Because the ATAQS software is primarily
designed to support multiple users at an institu-
tion, several points deserve attention. (1) Because
a number of algorithms were implemented in
15.5.3 ATAQS (Automated
ATAQS and it is an institution-wide computing
and Targeted Analysis
resource, the installation requirement is higher
with Quantitative SRM)
compared to the others (e.g., Tomcat v 6 .0.26
or higher, Java 1.6, Ant v1.7.1, MySQL v 5.0.77,
ATAQS is an integrated software platform that
Firefox 3.6.x, Firegoose-0.8.259.xpi, Adobe
supports all stages of targeted, SRM-based pro-
Flash Player 10, R 2.11.1). (2) The protein iden-
teomics experiments including target selection,
tification could be more reliable. As stated by the
transition optimization and post acquisition data
developers, Skyline uses a single score
analysis [52]. ATAQS is written in Java and
(a hydrophobicity value from SSRCalc) to pro-
provides a graphic user interface for the popular
vide confidence in identification, whereas
browser Mozilla Firefox. Different from Skyline,
ATAQs performs two sequential selections
which is a desktop application with manual
based on two separate peptide detectability
inspection for validation, ATAQS provides
algorithms (Peptide Sieve and Peptide Detect-
modules with algorithms that collectively sup-
ability Predictor) and user-defined criteria (e.g.,
port all steps of the SRM assay development
number of amino acids (peptide length), amino
and deployment workflow for targeted proteomic
acid composition and uniqueness of sequence
experiments. For example, mProphet in ATAQS
(does not map to more than one protein or one
can provide which transition group has higher
region in the genome). (3) Peak signals in
validated score. ATAQS can be easily extended
ATAQS are smoothed by discrete Fourier trans-
and customized by the user with the addition of
formation and integrated using mQuest, com-
user-implemented algorithms at any of the
pared to Skyline using CRAWDAD algorithms
workflow steps. The software uses FireGoose to
for chromatogram retention time alignment,
connect to various Web services [53]. Among
warping and peak integration [49]. (4) ATAQS
these Web services are PeptideAtlas, TIQAM,
requires the experimenters to convert the file to
PIPE2 (to generate a list of proteins to design a
15 Mass Spectrometry-Based Protein Quantification 275

Fig. 15.6 The Skyline spectral library explorer showing spectral views of HSP27 (a) and its phosphorylation (b)
276 Y. Chen et al.

(A)

(B)
PeptideSequence ProteinName ReplicateName PeptideRetentionTime RatioLightToHeavy Transition AreaRatio
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S1 5.33 0.4262 S - y8+ 0.3263+/-0.0303
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S1 5.33 0.4262 G - y6+ 0.6576+/-0.0398
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S1 5.33 0.4262 S - y4+ 0.4792+/-0.0513
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S1 5.33 0.4262 E - y3+ 0.4818+/-0.0242
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S1 5.33 0.4262 I - y2+ 0.3766+/-0.0354
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S2 5.33 0.8842 S - y8+ 0.6836+/-0.0381
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S2 5.33 0.8842 G - y6+ 1.3461+/-0.0781
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S2 5.33 0.8842 S - y4+ 0.9876+/-0.0768
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S2 5.33 0.8842 E - y3+ 0.9947+/-0.0452
QLSSGVSEIR sp|P04792|HSPB1_HUMAN HSP27 S2 5.33 0.8842 I - y2+ 0.7836+/-0.0701

Fig. 15.7 The Skyline result of HSP27 quantification using the surrogate peptide QLSSGVSEIR and the
corresponding [D8]Val isotope-labeled internal standard. Spectral view (a) and exported area ratio result (b) are
provided

Fig. 15.8 The generated transition list in ATAQS


15 Mass Spectrometry-Based Protein Quantification 277

an open source format such as TraML, mzXML 8. Lau KW, Jones AR, Swainston N, Siepen JA,
or mzML, whereas Skyline is the only open Hubbard SJ (2007) Capture and analysis of quantita-
tive proteomic data. Proteomics 7:2787–2799
source program that can read all native vendor 9. Gonzalez-Galarza FF, Lawless C, Hubbard SJ, Fan J,
file formats. Bessant C, Hermjakob H et al (2012) A critical
appraisal of techniques, software packages, and
standards for quantitative proteomic analysis. Omics
16:431–442
15.6 Conclusions 10. Lemeer S, Hahne H, Pachl F, Kuster B (2012) Soft-
ware tools for MS-based quantitative proteomics: a
As techniques for quantitative proteomics con- brief overview. Methods Mol Biol 893:489–499
11. Park SK, Venable JD, Xu T, Yates JR 3rd (2008) A
tinue to grow, bioinformatics software tools are
quantitative analysis software tool for mass
similarly expanding in number. Proteomics spectrometry-based proteomics. Nat Methods
researchers are now faced with many tools to 5:319–322
choose from, all with different advantages and 12. Matthiesen R, Datta S. Feature selection and machine
learning with mass spectrometry data. In: Mass spec-
disadvantages. Software that is developed for a
trometry data analysis in proteomics. Humana Press,
particular type of mass spectrometer/method pp 237–262
may be inadvertently or intentionally optimized 13. Hilario M, Kalousis A, Pellegrini C, Muller M (2006)
for data from that instrument or using that prote- Processing and classification of protein mass spectra.
Mass Spectrom Rev 25:409–449
omics approach and may be less well suited for
14. Matthiesen R, Matthiesen R. LC-MS spectra
more general use. Sometimes, the result is also processing. In: Mass spectrometry data analysis in
easily influenced by the familiarity and expertise proteomics. Humana Press, pp 47–63
of the performers with the programs being 15. Mortensen P, Gouw JW, Olsen JV, Ong SE, Rigbolt
KT, Bunkenborg J et al (2010) MSQuant, an open
processed. Thus, we cannot claim which tool is
source platform for mass spectrometry-based quanti-
better. Currently, the rational way in quantitative tative proteomics. J Proteome Res 9:393–403
proteomics is to select bioinformatics tools opti- 16. Azuaje F, Dopazo J (2005) Integrative data analysis
mally suited to address the specific proteomics and visualization: introduction to critical problems,
goals and challenges. In: Data analysis and visualiza-
issue under consideration and the associated
tion in genomics and proteomics. Wiley, Hoboken, pp
information. 1–9
17. Yang C, He Z, Yu W (2009) Comparison of public
peak detection algorithms for MALDI mass spectrom-
etry data analysis. BMC Bioinf 10:4
References 18. Matthiesen R (2007) Methods, algorithms and tools in
computational proteomics: a practical point of view.
1. Wang Y, Li H, Chen S (2010) Advances in quantita- Proteomics 7:2815–2832
tive proteomics. Front Biol 5:195–203 19. Ryu SY (2014) Bioinformatics tools to identify and
2. Jungblut PR (2014) The proteomics quantification quantify proteins using mass spectrometry data. Adv
dilemma. J Proteomics 107:98–102 Protein Chem Struct Biol 94:1–17
3. Doerr A (2010) Targeted proteomics. Nat Methods 20. Matthiesen R, Azevedo L, Amorim A, Carvalho AS
7:837–842 (2011) Discussion on common data analysis strategies
4. Nogueira FC, Palmisano G, Schwammle V, Campos used in MS-based proteomics. Proteomics
FA, Larsen MR, Domont GB et al (2012) Perfor- 11:604–619
mance of isobaric and isotopic labeling in quantitative 21. Vandenbogaert M, Li-Thiao-Te S, Kaltenbach HM,
plant proteomics. J Proteome Res 11:3046–3052 Zhang R, Aittokallio T, Schwikowski B (2008) Align-
5. Yocum AK, Chinnaiyan AM (2009) Current affairs in ment of LC-MS images, with applications to bio-
quantitative targeted proteomics: multiple reaction marker discovery and protein identification.
monitoring-mass spectrometry. Brief Funct Genomic Proteomics 8:650–672
Proteomic 8:145–157 22. Lu B, Xu T, Park SK, McClatchy DB, Liao L, Yates
6. Colangelo CM, Chung L, Bruce C, Cheung KH JR 3rd (2009) Shotgun protein identification and
(2013) Review of software tools for design and analy- quantification by mass spectrometry in
sis of large scale MRM proteomic datasets. Methods neuroproteomics. Methods Mol Biol 566:229–259
61:287–298 23. MacCoss MJ (2005) Computational analysis of shot-
7. Patterson SD, Aebersold RH (2003) Proteomics: the gun proteomics data. Curr Opin Chem Biol 9:88–94
first decade and beyond. Nat Genet 33
(Suppl):311–323
278 Y. Chen et al.

24. Li XJ, Zhang H, Ranish JA, Aebersold R (2003) 39. Kelter G, Steinbach D, Konkimalla VB, Tahara T,
Automated statistical analysis of protein abundance Taketani S, Fiebig HH et al (2007) Role of transferrin
ratios from data generated by stable-isotope dilution receptor and the ABC transporters ABCB6 and
and tandem mass spectrometry. Anal Chem ABCB7 for resistance and differentiation of tumor
75:6648–6657 cells towards artesunate. PLoS One 2, e798
25. Pan C, Kora G, McDonald WH, Tabb DL, 40. Aebersold R (2013) Method of the year 2012. Nat
VerBerkmoes NC, Hurst GB et al (2006) ProRata: a Methods 10:1
quantitative proteomics program for accurate protein 41. Lange V, Malmstrom JA, Didion J, King NL,
abundance ratio estimation with confidence interval Johansson BP, Schafer J et al (2008) Targeted quanti-
evaluation. Anal Chem 78:7121–7131 tative analysis of Streptococcus pyogenes virulence
26. Nesvizhskii AI, Aebersold R (2005) Interpretation of factors by multiple reaction monitoring. Mol Cell
shotgun proteomic data: the protein inference prob- Proteomics 7:1489–1500
lem. Mol Cell Proteomics 4:1419–1440 42. Kiyonami R, Schoen A, Prakash A, Peterman S,
27. Cox J, Mann M (2008) MaxQuant enables high pep- Zabrouskov V, Picotti P et al (2011) Increased selec-
tide identification rates, individualized p.p.b.-range tivity, analytical precision, and throughput in targeted
mass accuracies and proteome-wide protein quantifi- proteomics. Mol Cell Proteomics 10:M110.002931
cation. Nat Biotechnol 26:1367–1372 43. Fusaro VA, Mani DR, Mesirov JP, Carr SA (2009)
28. Jagtap P, Bandhakavi S, Higgins L, McGowan T, Prediction of high-responding peptides for targeted
Sa R, Stone MD et al (2012) Workflow for analysis protein assays by mass spectrometry. Nat Biotechnol
of high mass accuracy salivary data set using 27:190–198
MaxQuant and ProteinPilot search algorithm. Proteo- 44. Cham Mead JA, Bianco L, Bessant C (2010) Free
mics 12:1726–1730 computational resources for designing selected reac-
29. Qi D, Brownridge P, Xia D, Mackay K, Gonzalez- tion monitoring transitions. Proteomics 10:1106–1126
Galarza FF, Kenyani J et al (2012) A software toolkit 45. Farrah T, Deutsch EW, Kreisberg R, Sun Z, Campbell
and interface for performing stable isotope labeling DS, Mendoza L et al (2012) PASSEL: the
and top3 quantification using Progenesis LC-MS. PeptideAtlas SRMexperiment library. Proteomics
Omics 16:489–495 12:1170–1175
30. Braisted JC, Kuntumalla S, Vogel C, Marcotte EM, 46. Xu F, Yang T, Fang D, Xu Q, Chen Y (2014) An
Rodrigues AR, Wang R et al (2008) The APEX quan- investigation of heat shock protein 27 and
titative proteomics tool: generating protein quantita- P-glycoprotein mediated multi-drug resistance in
tion estimates from LC-MS/MS proteomics results. breast cancer using liquid chromatography-tandem
BMC Bioinf 9:529 mass spectrometry-based targeted proteomics. J Pro-
31. Pedrioli PG (2010) Trans-proteomic pipeline: a pipe- teomics 108:188–197
line for proteomic analysis. Methods Mol Biol 47. Proc JL, Kuzyk MA, Hardie DB, Yang J, Smith DS,
604:213–238 Jackson AM et al (2010) A quantitative study of the
32. Deutsch EW, Mendoza L, Shteynberg D, Farrah T, effects of chaotropic agents, surfactants, and solvents
Lam H, Tasman N et al (2010) A guided tour of the on the digestion efficiency of human plasma proteins
trans-proteomic pipeline. Proteomics 10:1150–1159 by trypsin. J Proteome Res 9:5422–5437
33. Keller A, Eng J, Zhang N, Li XJ, Aebersold R (2005) 48. Sherwood CA, Eastham A, Lee LW, Risler J, Vitek O,
A uniform proteomics MS/MS analysis platform Martin DB (2009) Correlation between y-type ions
utilizing open XML file formats. Mol Syst Biol observed in ion trap and triple quadrupole mass
1:2005.0017 spectrometers. J Proteome Res 8:4243–4251
34. Nesvizhskii AI, Keller A, Kolker E, Aebersold R 49. Prakash A, Tomazela DM, Frewen B, Maclean B,
(2003) A statistical model for identifying proteins by Merrihew G, Peterman S et al (2009) Expediting the
tandem mass spectrometry. Anal Chem development of targeted SRM assays: using data from
75:4646–4658 shotgun proteomics to automate method development.
35. Wen B, Zhou R, Feng Q, Wang Q, Wang J, Liu S J Proteome Res 8:2733–2739
(2014) IQuant: an automated pipeline for quantitative 50. MacLean B, Tomazela DM, Shulman N,
proteomics based upon isobaric tags. Proteomics Chambers M, Finney GL, Frewen B et al (2010) Sky-
14:2280–2285 line: an open source document editor for creating and
36. Breitwieser FP, Muller A, Dayon L, Kocher T, analyzing targeted proteomics experiments. Bioinfor-
Hainard A, Pichler P et al (2011) General statistical matics 26:966–968
modeling of data from protein relative expression 51. Krokhin OV, Craig R, Spicer V, Ens W, Standing KG,
isobaric tags. J Proteome Res 10:2758–2766 Beavis RC et al (2004) An improved model for pre-
37. Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I diction of retention times of tryptic peptides in ion
(2001) Controlling the false discovery rate in behavior pair reversed-phase HPLC: its application to protein
genetics research. Behav Brain Res 125:279–284 peptide mapping by off-line HPLC-MALDI MS. Mol
38. Doerr A (2011) Targeted proteomics. Nat Methods Cell Proteomics 3:908–919
8:43
15 Mass Spectrometry-Based Protein Quantification 279

52. Brusniak MY, Kwok ST, Christiansen M, 53. Shannon PT, Reiss DJ, Bonneau R, Baliga NS (2006)
Campbell D, Reiter L, Picotti P et al (2011) The Gaggle: an open-source software system for
ATAQS: a computational software tool for high integrating bioinformatics software and data sources.
throughput transition optimization and validation for BMC Bioinf 7:176
selected reaction monitoring mass spectrometry.
BMC Bioinf 12:78
Bioinformatics Tools for Proteomics
Data Interpretation 16
Karla Grisel Calderón-González, Jesús Hernández-Monge,
Marı́a Esther Herrera-Aguirre, and Juan Pedro Luna-Arias

Abstract
Biological systems function via intricate cellular processes and networks
in which RNAs, metabolites, proteins and other cellular compounds have
a precise role and are exquisitely regulated (Kumar and Mann, FEBS Lett
583(11):1703–1712, 2009). The development of high-throughput
technologies, such as the Next Generation DNA Sequencing (NGS) and
DNA microarrays for sequencing genomes or metagenomes, have trig-
gered a dramatic increase in the last few years in the amount of informa-
tion stored in the GenBank and UniProt Knowledgebase (UniProtKB).
GenBank release 210, reported in October 2015, contains
202,237,081,559 nucleotides corresponding to 188,372,017 sequences,
whilst there are only 1,222,635,267,498 nucleotides corresponding to
309,198,943 sequences from Whole Genome Shotgun (WGS) projects.
In the case of UniProKB/Swiss-Prot, release 2015_12 (December 9, 2015)
contains 196,219,159 amino acids that correspond to 550,116 entries.
Meanwhile, UniProtKB/TrEMBL (release 2015_12 of December
9 2015) contains 1,838,851,8871 amino acids corresponding to
555,270,679 entries. Proteomics has also improved our knowledge of
proteins that are being expressed in cells at a certain time of the cell
cycle. It has also allowed the identification of molecules forming part of
multiprotein complexes and an increasing number of posttranslational
modifications (PTMs) that are present in proteins, as well as the variants
of proteins expressed.

K.G. Calderón-González • M.E. Herrera-Aguirre


J.P. Luna-Arias (*)
Departamento de Biologı́a Celular, Centro de
Investigación y de Estudios Avanzados del Instituto
Politécnico Nacional (Cinvestav-IPN), Av. Instituto
Politécnico Nacional 2508, Col. San Pedro Zacatenco,
Gustavo A. Madero, C.P. 07360 Ciudad de México, J. Hernández-Monge
Mexico Instituto de Fı́sica, Universidad Autónoma de San Luis
e-mail: jpluna@cell.cinvestav.mx; jpluna@cinvestav.mx; Potosı́, Av. Manuel Nava 6, Zona Universitaria, C.P.
jpluna2003@gmail.com 78290 San Luis Potosı́, S.L.P., Mexico

# Springer International Publishing Switzerland 2016 281


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_16
282 K.G. Calderón-González et al.

Keywords
Proteomics data interpretation • Interactome mapping • Gene Ontology •
STRING • MINT • IntAct • HPRD • BioGRID • PIPs • MPIDB • TAIR •
PANTHER • DAVID • KEGG • IPA

Biological systems function via intricate cellular 120,000. Moreover, guessing four PTMs in each
processes and networks in which RNAs, protein, then, the total number of proteins in a
metabolites, proteins and other cellular cell would range between 320,000 and 480,000.
compounds have a precise role and are exqui- However, when we consider the more than
sitely regulated [1]. The development of high- 400 different PTMs that have been found [3]
throughput technologies, such as the Next Gen- the number of proteins in a cell would easily
eration DNA Sequencing (NGS) and DNA grow to more than one million.
microarrays for sequencing genomes or Proteins do not function alone; they usually
metagenomes, have triggered a dramatic increase carry their function by interacting with one or
in the last few years in the amount of information more partners. The main goal of the protein-
stored in the GenBank and UniProt protein interaction map is to catalogue
Knowledgebase (UniProtKB). GenBank release interactions and to define the interactome.
210, reported in October 2015, contains These interactions are currently determined
202,237,081,559 nucleotides corresponding to using a vast array of technologies, including
188,372,017 sequences, whilst there are only yeast two hybrid systems, tag-fusion proteins
1,222,635,267,498 nucleotides corresponding to for the identification of interacting proteins,
309,198,943 sequences from Whole Genome co-immunoprecipitation, chemical crosslinking,
Shotgun (WGS) projects. In the case of phage display, FRET (Fluorescence Resonance
UniProKB/Swiss-Prot, release 2015_12 Energy Transfer), SPR (Surface Plasmon Reso-
(December 9, 2015) contains 196,219,159 nance), tandem affinity purification, protein
amino acids that correspond to 550,116 entries. microarrays, protein domains, etc. Many of
Meanwhile, UniProtKB/TrEMBL (release these techniques, if not all, use mass spectrome-
2015_12 of December 9 2015) contains try and non-redundant gene and protein
1,838,851,8871 amino acids corresponding to databases as the main tools for the identification
555,270,679 entries. Proteomics has also of peptides and proteins. Many of the cellular
improved our knowledge of proteins that are protein-protein interaction networks have been
being expressed in cells at a certain time of the catalogued and a number of interactome
cell cycle. It has also allowed the identification of databases have been established. There are sev-
molecules forming part of multiprotein eral protein-protein interaction databases freely
complexes and an increasing number of post- available via World Wide Web that can be used
translational modifications (PTMs) that are pres- to determine the putative functions of a protein
ent in proteins, as well as the variants of proteins based on its direct or indirect interactions.
expressed. Protein-protein interaction maps in these
Considering that human cells contain between databases are, in general, based on the informa-
20,000 and 30,000 protein-encoding genes and tion published, mostly in PubMed. In this sec-
possibility that there could be approximately four tion, we describe some of the most important
alternative splice variants for each gene [2], the databases available, including STRING, MINT,
total number of proteins that could be expressed IntAct, HPRD, BioGRID, PIPs, MPIDB and
at a certain time would range between 80,000 and TAIR. Furthermore, additional tools such as
16 Bioinformatics Tools for Proteomics Data Interpretation 283

Gene Ontology, PANTHER, DAVID, KEGG, Terms are linked by three possible relationships:
and IPA, among others, have been developed to “is_a”, “part_of”, and “positively regulates/neg-
facilitate data mapping into these databases. We atively regulates”. The “is_a” is a simple rela-
are certain that these tools will be useful in tionship between a class and a subclass. The
understanding the intricate interactions and “part_of” relationship is more complex than the
functions of proteins in cells. former. C is part of D means that whenever C is
present, it always belongs to D; for instance, an
organelle (C) is always part of a cell (D), but not
all cells have the same organelles. In the GO
16.1 Gene Ontology
website (http://geneontology.org), a variety of
browsers provide visualization and query
Many proteins are conserved through evolution
capabilities for GO. For example, the AMIGO
and consequently share the same functions. How-
browser provides a web interface for searching
ever, the systems of nomenclature for genes and
and displaying ontologies, term definitions and
proteins stay divergent despite repeated evalua-
associated annotated gene products for diverse
tion of gene similarities by experts [4]. In order
organism databases [6]. The GO Online SQL
to tackle this challenge, the Gene Ontology
(Structured Query Language) Environment
(GO) consortium was created. The aim of the
(GOOSE) for AmiGO 2, allows users to freely
GO project is to provide a structured vocabulary
enter SQL queries in the GO database. On the
to define specific biological domains that
other hand, the PANTHER Classification Sys-
describe gene products in different organisms
tem, that is further described next, provides
[5]. GO project began in 1998 as a collaborative
enrichment analysis tools for GO.
effort between three organism databases:
FlyBase (Drosophila), the Mouse Genome Infor-
matics (MIG) project and the Saccharomy-
ces Genome Database (SGD). The GO 16.2 PANTHER
Consortium has been continuously growing due
to the deposition of several animal, microbial and PANTHER (Protein ANalysis THrough Evolu-
plant genome databases [6], as well as the recent tionary Relationships) is a classification system
addition of ontology areas, such as cell cycle and that combines ontology, gene function,
cilia-related terms, as well as multicellular pathways and statistical tools. This classifica-
organism processes [7]. By using these tion system can analyze sequencing, gene
ontologies, it is possible to graph structures that expression, and proteomics data [9]. PANTHER
comprise cellular components, molecular is a large database of gene families developed as
functions, biological processes, and the a resource for family and subfamily classifica-
relationships between them in a species- tion of proteins [10]. PANTHER has two main
independent manner [7]. In other words, GO is components: PANTHER library (PANTHER/
divided in two modules, the ontologies, called LIB) and PANTHER index (PANTHER/X).
GO ontology, which includes defined terms and PANTHER library is a collection of protein
their relationships, and the GO annotations, families and subfamilies represented as phylo-
which covers gene products and defined terms genetic trees assembled using Hidden Markov
[8]. The GO annotation is generated either by a statistical models (HMMs) and a multiple
curator or automatically through predictive sequence alignment algorithm (MSA)
methods (95 % by this method). (Fig. 16.1a) [9–12]. PANTHER index is a set
The gene ontology relationships are devel- of ontological abbreviated terms that describe
oped like a tree, depicting a hierarchy from the function of proteins in biological processes
more general terms to more specific ones. or molecular functions [10–12]. In addition,
284 K.G. Calderón-González et al.

A Sequences
B

Clusters
Identification and
curation of pathways

Multiple sequence
alignment (MSA)

CellDesigner software
Phylogenetic Hidden Markov
tree Model (HMM)

PANTHER Pathway diagrams


Family/Subfamily Reaction class and
relationships

Sequences Molecular class or


(Individual proteins) pathway component PANTHER Pathway

Cell type or cellular


components

PANTHER Library

PANTHER Pathways

Fig. 16.1 PANTHER data overview. PANTHER has components or a particular molecular class. Then,
two main modules: (a) PANTHER Library which is a pathways are drawn and curated by expert curators
collection of families and subfamilies of proteins. This using the CellDesigner software. Pathways are built
library is constructed from a selection of sequences built based on molecular class or pathway component, reaction
into clusters. These clusters are then used to generate class and relationships, and cell type or cellular
multiple sequence alignments (MSA), phylogenetic components. The pathway component is a link between
trees, and statistical HMMs. (b) PANTHER Pathways various PANTHER modules
are built using literature databases related to pathway

PANTHER has a Pathway module, in which the families/subfamilies and HMM analysis
pathways are represented as a diagram (Fig. 16.1) [9, 10, 12]. Pathways are created
generated with CellDesigner software and annotated by expert curators, according to
(Fig. 16.1b) [13]. This module uses a defined evidence found in the literature. Moreover,
vocabulary to describe pathways and their pathways can be curated with the Pathway
components, including pathway class and curation software (http://curation.pantherdb.
components, molecular class, reaction class, org/) [14, 15]. Some of the pathways included
reaction relationships, cell type, and cellular in the PANTHER database are Cell cycle, DNA
components [14, 15]. PANTHER pathways are replication, General transcription regulation,
related to protein sequences in the PANTHER/ Glycolysis, Tricarboxylic acid cycle, among
LIB and, therefore, are also connected with others (http://www.pantherdb.org/pathway/
16 Bioinformatics Tools for Proteomics Data Interpretation 285

pathwayList.jsp). The PANTHER database 223 molecular functions; 243 terms of protein
contains the following information: class; 41,603 terms used in GO database
annotations, including 9942 molecular
1. Genes (104 genomes; 1,424,953 total genes; functions, 27,852 biological processes, and
1,026,421 genes in PANTHER families with 3809 cellular component terms (http://www.
phylogenetic trees, MSA and HMMs) pantherdb.org/data).
2. Families (11,928 families and 83,190
subfamilies) The main window in PANTHER is composed
3. Pathways (177 pathways, 3092 pathway of two main toolbars. The first one contains dif-
components, 2447 sequences related to ferent links to individual topics (Fig. 16.2, items
pathways, and 2447 references captured for 1–5), as well as an option for registration, login
the pathways) and contact (Fig. 16.2, items 6–8). The second
4. Ontologies (550 terms in PANTHER GO toolbar contains different options for data analy-
slim, 257 terms corresponding to biological sis, including gene list analysis, browse,
process, 70 cellular components, and sequence search, cSNP scoring, and keyword

Fig. 16.2 PANTHER Classification System website. data analysis: (9) Gene list analysis, (10) browse, (11)
The main window in PANTHER contains two main sequence search, (12) cSNP scoping, and (13) keyword
toolbars. The first toolbar on top has links to different search. PANTHER also includes: (14) Quick keyword
options inclduing: (1) PANTHER data, (2) PANTHER search, (15) whole genome function views, (16) genome
tools, (3) workspace, (4) downloads and (5) help/tutorial, statistics, (17) publications, and (18) recent publications
and a section for (6) registration, (7) login, and (8) con- describing PANTHER [16]
tact. The second toolbar, right under the first one, is for
286 K.G. Calderón-González et al.

search (Fig. 16.2, items 9–13). In addition, PAN- The top hit HMM can be observed in the results
THER has a panel for keyword search and quick page, which also contains a statistical value for
links (Fig. 16.2, items 14–18) [16]. In the analy- significance [17]. The Evolutionary Analysis of
sis of list of genes or proteins, different func- Coding SNPS (cSNP scoring) tool estimates the
tional classification views can be obtained, probability of a specific amino-acid change
including gene list, bar or pie charts. Also, [17]. The keyword search tool can be used to
genes or proteins can be statistically analyzed obtain a variety of information, such as genes,
through an enrichment test or a statistical over- families, pathways, and ontology terms for the
representation test [17]. The PANTHER Ontol- protein of interest. However, we will focus on the
ogy Browser also called PANTHER Prowler, generation of graphs for proteins classified in
browses and retrieves results (e.g. molecular different categories.
functions, biological process, cellular compo-
nent, protein class, pathway, and species) for
input data related to ontology terms, such as 16.3 PANTHER Gene List Analysis
genes and families [11, 17]. The PANTHER
HMM sequence-scoring (sequence search) tool, To perform a gene list analysis using the PAN-
can be used to search and compare protein THER website (http://pantherdb.org), go to the
sequences with the HMMs of PANTHER library. toolbar gene list analysis (Fig. 16.3) and enter the

Fig. 16.3 Procedure to perform gene list analysis in PANTHER. The red section denotes the three primordial steps:
(1) Enter the IDs of proteins to be analyzed, (2) select the organism, and (3) select the type of analysis to be performed
16 Bioinformatics Tools for Proteomics Data Interpretation 287

IDs of the genes or proteins in your list (Ensembl, 16.4 DAVID


Ensembl_PRO, Ensembl_TRS, Gene ID, Gene
symbol, GI, HGNC, IPI, UniGene, UniProtKB The Database for Annotation, Visualization, and
ID) into the window, separating IDs by a space or Integrated Discovery (DAVID) was developed in
comma. IDs can also be uploaded as a txt file. 2003 to address the emerging challenges posed
Then select the list type for query data (i.e. ID by the post-genomic era [19]. DAVID, as well as
List, Previously exported gene list, Workspace other tools for the analysis of large gene lists, is
list or PANTHER Generic Mapping File) and the based on the principle of gene enrichment that
organism of interest for analysis. In our example, are functionally related to an altered gene/protein
we selected “ID list” and “Homo sapiens”. After- (generated by high throughput technologies).
ward, choose the type of analysis you like to These enriched genes might potentially cooper-
perform. For example, we selected the “func- ate within a determined group and/or biological
tional classification” viewed as a pie chart. process [20]. DAVID is composed of the DAVID
Finally, click on the submit key (Fig. 16.3). In knowledgebase and five annotation tools:
the results webpage, genes can be classified
according to Molecular Function, Biological Pro- 1. DAVID Functional Annotation
cess, Cellular Component, Protein Class, and 2. DAVID Gene Functional Classification
Pathway (Fig. 16.4a). The chart obtained for a 3. DAVID Gene ID Conversion
certain process can change for other processes. In 4. DAVID Gene Name Viewer
addition, pie charts can be changed to bar charts 5. NIAID Pathogen Annotation Browser.
and vice versa (Fig. 16.4b). The list of genes
obtained in each ontological classification can The DAVID Knowledgebase is constructed
be exported as a txt file. Classification categories around the “DAVID Gene Concept”, which
may also contain different subcategories. When include tens of millions of gene/protein
the cursor is located over a category in a chart, a identifiers from several major public databases.
message containing the following information This data concentration eliminates annotation
will be displayed: Category name and its redundancy among different resources and
corresponding identifier, number of genes allows the organization of gene identifiers into
included from your list, the corresponding per- more than 40 functional classification categories,
centage of gene hits against the total number of e.g. Ontology (more than 40 million records),
identified genes, and the percentage of gene hits Protein-protein interactions (more than four
against the total number function hits millions), Disease gene associations (9000),
(Fig. 16.4a). When a subcategory is selected, Pathways (above 50,000), Functional categories
the corresponding gene list will be displayed (more than 6.9 millions), etc. [21].
(Fig. 16.5). As an example, we classified a list DAVID Gene Functional Classification: This
of overexpressed proteins in common between tool is useful for the exploration of large lists of
Luminal A (MCF7 and T47D) and Claudin-low genes into more feasible modules ordered
(MDA-MB-231) breast cancer cells lines, which according to their functional relationships.
were recently described by Calderón-González These functionally organized modules are very
et al. [18]. These proteins were categorized into useful in processing large amounts of informa-
Molecular functions and Cellular components tion, switching from a gene centric analysis to a
(Fig. 16.4). In the first category, the most repre- module-centric analysis [21].
sentative processes were: Binding and Catalytic DAVID Functional Annotation Tool Suite:
activity with 25 and 21 genes, respectively The Functional Annotation Tool Suite displays
(Figs. 16.4a and 16.5a). For Cellular component three ways for combining results: Functional
classification, categories with the higher number Annotation Clustering, Functional Annotation
of genes were: Cell part (14 genes) and Macro- Chart and Functional Annotation Table. The
molecular complex (10 genes) (Fig. 16.4b).
288 K.G. Calderón-González et al.

Fig. 16.4 Functional classification of proteins The proteins were classified into (a) Biological Processes
up-regulated in both Luminal A (MCF7 and T47D) and and (b) Cellular Components. Figure shows the change
Claudin-low (MDA-MB-231) breast cancer cells lines. of pie chart to a bar graphic as well

Functional Annotation Clustering tool allows the to eliminate the redundant relationships that exist
user to group genes depending on the degree of in many-genes-to-many-terms cases (i.e. when
their functional association. It is performed with one gene is associated with many different
a novel algorithm that measures relationships redundant terms and one term is associated with
among annotation terms. This process is useful many genes) [21]. Additional features of this
16 Bioinformatics Tools for Proteomics Data Interpretation 289

Fig. 16.5 Classification of Biological Processes for different categories of processes, e.g. Metabolic Pro-
proteins up-regulated in both Luminal A (MCF7 and cesses. (b) List of genes involved in the selected Meta-
T47D) and Claudin-low (MDA-MB-231) breast cancer bolic Processes
cells lines (a) Biological processes pie chart displaying
290 K.G. Calderón-González et al.

clustering tool is the ability to rank the impor- an overview of gene names to gain insight into
tance of annotation groups with an enrichment their biological system and have a priori general
score (EASE scores) that uses the geometric idea of interesting processes that might be
mean of all the enrichment p-values of each involved.
annotation term in the group; the annotation clus- DAVID’s NIAID Pathogen Browser: The
tering tool provides a link to a 2-D viewer for National Institute of Allergy and Infectious
related gene-term relationships, allowing a fast Diseases (NIAID) has defined three categories
way to focus on the genes that have common of priority pathogens, A, B and C. These
annotation terms [22]. On the other hand, The pathogens are important for biodefense purposes
Functional Annotation Chart tool can be used to and have become attractive study subjects
get the typical gene-GO term enrichment analy- because of the increasing research funding avail-
sis (similar to other tools) to identify the most able to study them. The DAVID NIAID Patho-
relevant (overrepresented) biological terms gen Browser is provided as a support tool for
associated with a given gene list. However, researchers that would like to explore the biology
DAVID offers extended annotation coverage in of the priority pathogens types. For example, one
comparison to other enrichment analysis tools. may choose the word “anthrax” and type the key
The enhanced annotation coverage includes not word “toxin”, the result is a list of genes from
only the GO terms but more than 40 annotation Bacillus anthracis that matches to the typed key
categories, such as protein-protein interactions, word. This tool may assist researchers in under-
protein functional domains, disease associations, standing the biology of a priority pathogen if the
bio-pathways, sequence features, gene tissue gene list retrieved from the DAVID NIAID Path-
expression, etc. This tool is helpful to identify ogen Browser is further analyzed by one of
enriched annotation terms associated with the DAVID’s Bioinformatics Resources [21].
gene list of interest in a linear tabular text format. Analysis of gene lists: To carry out an optimal
Similar to the Annotation Clustering Tool, the gene list analysis, the list should; (1) have
Functional Annotation Chart also provides links enough number of genes/proteins ranging from
to further explore the list of interacting proteins, hundreds to thousands (e.g. 100–2000), (2) only
link gene-disease associations and visualize include genes with statistical significance that
genes on BioCarta and KEGG pathway maps show a notable up or down regulation, (3) show
[21]. Finally, the Functional Annotation reproducibility between experimental
Table tool is a query engine for DAVID replicas [22].
Knowledgebase without statistical probes. It DAVID bioinformatics resources website is
delivers annotation information in a table format organized in two main toolbars (Fig. 16.6).
for every gene from the users’ gene list. This is a There are different links, like Start Analysis,
particularly useful tool when users want to have a Shortcut to DAVID Tools, Technical Center,
closer look of some specific interesting genes and among others on top. On the left side, there are
explore its annotation information. other shortcuts to DAVID Tools that also offers a
DAVID’s Gene ID Conversion tool allows brief explanation for each tool. Recently added
conversion of user’s input gene or gene product DAVID NIAID Pathogen Annotation Browser
identifiers from any type to another in a more tool can be found on the top menu in shortcut to
comprehensive and high throughput manner with DAVID Tools.
a uniquely enhanced ID-ID mapping database It is straightforward to upload a gene list for
leveraging heterogeneous annotations [23]. DAVID bioinformatics analysis (Fig. 16.7a).
DAVID’s Gene Name Viewer is another tool Firstly, go to https://david.ncifcrf.gov/gene2
useful to quickly attach meaning to a list of gene gene.jsp and select Start analysis. On the left
IDs, translating them into their corresponding side choose upload in the list manager, then:
gene names. Thus, before proceeding to an (1) Copy/paste the gene lists to be analyzed into
in-depth analysis, researchers can quickly have box A; a text file or a gene IDs list can also be
16 Bioinformatics Tools for Proteomics Data Interpretation 291

Fig. 16.6 DAVID Bioinformatic Resources Website. Us. And the toolbar on the left side (8) has links to Tools
This website has two main toolbars. The toolbar on the that offer a brief explanation for each of DAVID’s tool.
top has links to: (1) Start Analysis, (2) Shortcut to Additionally, in (2) we can find the recently added tool
DAVID Tools, (3) Technical Center, (4) Downloads and NIAID Pathogen Annotation Browser (9)
APIs, (5) Terms of Service, (6) Why David, and (7) About

uploaded in box B, (2) Choose the corresponding Gene List Manager) (Fig. 16.7b). By clicking
gene identifier type for your input gene IDs; Start Analysis, users can go back at any time to
alternatively use the ID conversion tool to seek upload another gene list or to access any analyti-
(or convert) the correct gene identifier, (3) Select cal tool suite of interest.
the type of list you are submitting, either gene list In this section, a couple of examples are
or gene background. The general guideline is to presented to showcase a few of the tools from
set up a pool of genes as population background. David’s toolbox that are most widely used using
This usually includes all the genes that could be gene lists corresponding to proteins down
possibly detected (e.g. all the probes included in regulated in both Luminal A (MCF7 and T47D)
a particular DNA microarray). Since most of the and Claudin-low (MDA-MB-231) breast cancer
studies are done in a genome-wide scale, there is cell lines studied by Calderón-González
no need to set a background (default background et al. [18]. Selecting Functional Annotation
is the entire genome), (4) Submit the List. The Tool (Fig. 16.7b), results in Annotation Sum-
different analysis suites are displayed mary Results, which displays the number and
(Fig. 16.7b) that will be applied to the submitted percentage of genes (from the submitted gene
gene list shown on the left (highlighted in the list) involved in different GO categories
292 K.G. Calderón-González et al.

Fig. 16.7 Uploading data


into David’s gene list
manager. (a) On the left
side; (1) Upload a gene
list, (2) Choose the A
corresponding gene Get started here
identifier, (3) Select the
type of list, either gene list
or gene background, (4)
Submit the gene list. (b)
Once the user has
submitted the gene list, the 1) Copy /paste the gene list
Analysis Wizard shows the
shortcuts for the different
DAVID Analysis tools

2) Choose the corresponding gene identifier.

3) Select the type of list

4) Click Submit

B Gene list currently being analyzed

DAVID
Analysis
Tools

(Fig. 16.8). In each category, users can click on Kappa statistics. This tool also provides a link
Chart to obtain an individual chart report for the to generate a 2D-view map that allows a fast way
selected category. Users can choose a number of to associate genes that have common annotation
categories for further analysis in the Combined terms.
Annotation Tools (Fig. 16.8). A table divided in From this very specific gene list, we observed
several annotation clusters will be obtained by an enriched group of genes involved in mitochon-
clicking on Annotation Clustering Tool. Every drial function. Noteworthy, the high correlation of
annotation cluster is formed by a group of terms this result in comparison with other tools previ-
from functionally related genes. Taken all ously explored. Since the submitted gene list
together, the chance to identify a biological sig- corresponds to down-regulated genes in a
nificance increases (Fig. 16.9). The degree of proteomic approach, this result suggests that
similarity between annotations is measured by MCF7, T47D and MDA-MB231 breast cancer
16 Bioinformatics Tools for Proteomics Data Interpretation 293

In this example, the “UP_REG LIST1” is being analyzed,


conformed by 49 genes using “Homo sapiens” as Background

Individual reports: 3
Percentage of: genes involved in this category/ total genes
1 of my “UP_REGLIST1” gene list. (5/49= .102 or 10.2%)
4
“my genes” involved in this category

2 Single Chart Report for this


5 specific annotation category

6
Select the categories of your interest that
will be further analyzed in the Combined
Annotation Tools

7
Combined reports:
Reports a redundant list of annotation This tool Clusters non-redundant annotation terms,
terms of the selected categories just for the previously selected categories (above).

Table report of all selected categories

Fig. 16.8 Functional Annotation Tool Suite. (1) Gene report of functional categories. (6) The user can choose
List Manager showing the list that is being analyzed. (2) the number of categories to be considered for further
Annotation Summary results displaying different analysis in the Combined Annotation Tools (7) by
categories: (3) the number and (4) percentage of genes checking the check boxes next to each category
involved. (5) Clicking on this box will generate a chart

cell lines have an impaired mitochondrial function


in comparison to the MCF10A control cell line. 16.5 KEGG
For instance, NADH-coenzyme Q reductase,
3,2 trans-enoyl-Coenzyme A isomerase, cyto- The Kyoto Encyclopedia of Genes and Genomes
chrome c oxidase, and malate dehydrogenase (KEGG) is a database resource designed for
are some of the encoding genes that had a high understanding and interpreting biological
EASE SCORE and are involved in the mitochon- systems using high-throughput data [24–
drial inner membrane function. 26]. KEGG is composed of 17 databases
organized into four categories:
294 K.G. Calderón-Gon

The overall encrichment score for the group based on the EASE
3
4 scores of each term members. The higher the more enriched.

Clustering options and stringency


2
Term in the annotation cluster
5
Genes involved in individual term
7
Related term search

6
Annotation clusters of terms
which share functional similarities EASE Score (a modified Fisher Exact
p-value. The smaller, the more enriched

8
9

1. Systems information: KEGG PATHWAY 2. Genomic information: KEGG ORTHOLOGY


(pathway maps), KEGG BRITE (functional (orthology (KO) groups), KEGG GENOME
hierarchies and table files) and KEGG MOD- (complete genomes), KEGG GENES (gene
Fig.
ULE 16.9 An structural
(Pathway, example complex,
of the Functional
functional Annotation to theirSSDB
catalogs) KEGG enrichment scoresimilarity
(sequence (3) and stringency
Clustering Tool. This image shows the
set and signature modules). These databases results database for genes), DGENES shown
obtained involved in each term are (draft for each clu
by searching the “DWN_REG LIST”.
are manually created using published The search results well as the EASE score
genomes) and MGENES (metagenomes). (6) and the related term
show three
literature
clusters (1) each categorized further according The links to obtain the gene
The information about genes and genomes is
list in each annotat
to different terms (2). The clusters are ranked according (8) and a 2D-Map View (9) are provided
16 Bioinformatics Tools for Proteomics Data Interpretation 295

obtained from different databases, such as of amino acids, Biosynthesis of secondary


RefSeq (prokaryotes, eukaryotes, plasmids metabolites, Carbon metabolism, Degradation
and viruses), GenBank (prokaryotes), and of aromatic compounds, Fatty acid metabo-
PubMed (addendum: collection of manually lism, Microbial metabolism in diverse
created protein sequences entry) environments, and 2-Oxocarboxylic acid
3. Chemical information, also called KEGG metabolism) and Cancer pathway [29]
LIGAND: KEGG COMPOUND (metabolites 3. BlastKOALA: KOALA is defined as KEGG
and other small molecules), KEGG GLYCAN Orthology And Links Annotation.
(glycans), KEGG REACTION (biochemical BlastKOALA is used for the annotation of
reactions), KEGG RPAIR (reactant pairs), completely sequenced genomes. This tool
KEGG RCLASS (reaction class), and KEGG utilizes the Pangenomes database
ENZYME (enzyme nomenclature) 4. GhostKOALA: this tool is designed by the
4. Health information commonly called KEGG metagenome annotation and it uses the
MEDICUS: KEGG DISEASE (human Pangenomes and Viruses databases [26, 27],
diseases), KEGG DRUG (drugs), KEGG (5) BLAST/FASTA performs searches of sim-
DGROUP (drug groups), KEGG ENVIRON ilar sequences
(crude drugs and health related substances), 5. SIMCOMP searches for similar chemical
JAPIC (drug labels in Japan) and DailyMed structures
(links to drug labels in USA) [26].

The annotation system in KEGG is based on Pathway Maps Analysis To map proteins of
the correlation between functional information interest into Pathways, go to the KEGG website
and orthologous groups (KEGG Orthology or (http://www.genome.jp/kegg/) and on the Data-
KO) through the assignment of KO identifiers oriented entry points, click on the KEGG PATH-
(K number). This information is stored in the WAY key (Fig. 16.10). In the Pathway Mapping
KO database and is independent of the KEGG menu, select the mapping tool of interest: Search
GENE database that contains completely Pathway, Search&Color Pathway or Color Path-
sequenced genomes [26]. The KO system is way. As an example, the up and down-regulated
essential for connecting the genomic information proteins found common between Luminal A
with systemic functional information resulting in (MCF7 and T47D) and Claudin-low
the conversion of genes to K numbers, leading to (MDA-MB-231) breast cancer cells lines from
an automatic reconstruction of KEGG Calderón-González et al. were analyzed with
PATHWAYS and other networks [26, 27]. Cur- the Search&Color Pathway tool [18]. -
rently, KEGG has more than 4000 complete Up-regulated proteins were colored in red, whilst
genomes annotated with the KO system [26]. down-regulated polypeptides were presented in
KEGG has several analysis tools: green (Fig. 16.11). To perform this analysis, an
organism must be selected first by clicking on the
1. KEGG Mapper which is the interface used for org key, after which a new window is displayed
KEGG Mapping. This is composed of KEGG to find the three to four KEGG organism code.
BRITE, MODULE, and PATHWAY Type the desired organism in the window and
mapping tools, which map genes, proteins, then click on select. In this example, H. sapiens
small molecules, etc. (also called objects) has the hsa code. The next step is to introduce
into all brite functional hierarchies, modules IDs in UniProtKB format, followed by the word
and pathways maps, respectively [28] red or green as mentioned before. Other compat-
2. KEGG Atlas is a graphical interface to navi- ible ID formats are KEGG-Identifiers, NCBI-
gate the global integrated maps in KEGG. GeneID and NCBI-ProteinID. Alternatively, a
Maps available are Metabolism (Biosynthesis file containing IDs can be uploaded. To perform
296 K.G. Calderón-González et al.

Fig. 16.10 KEGG website. This image shows the differ- tools for the data analysis including KEGG Mapper,
ent links provided in KEGG’s website, including KEGG KEGG Atlas, BlastKOALA, Ghost KOALA, BLAST/
Home, KEGG Database, KEGG Objects, KEGG Soft- FASTA, SIMCOMP. KEGG Pathway modules are
ware, among others. The website also provides several highlighted in a red box

the search, the following options were selected; particular UniProtKB ID will display the infor-
(1) to include aliases and (2) to display objects mation for the selected ID (Fig. 16.13a). On the
not found in the search (Fig. 16.12a). The result other hand, if the code of the H. sapiens organism
window shows a list of pathways where proteins in KEGG is selected, a new window containing
were mapped, as well as a list of protein IDs that KEGG information about that protein, including
were not found (Fig. 16.12a). A list of proteins Gene name, Disease, KEGG Orthology, Struc-
found in each pathway, including their ture, Motifs in the protein, and Pathways, among
UniProtKB IDs and KEGG H. sapiens database other information will be displayed (Fig. 16.13b).
codes is also displayed (Fig. 16.12b). Clicking a Finally, when a certain pathway is selected, an
16 Bioinformatics Tools for Proteomics Data Interpretation 297

Fig. 16.11 KEGG pathway mapping tool. This image well as the organism need to be selected. Protein acces-
shows the general procedure for mapping proteins in sion numbers are followed with the word red or green to
Search & Color Pathway module. The format of IDs as highlight up- or downregulated proteins, respectively

image is generated where up- or down-regulated metabolism), " CMBL (Hydrolase), # CISY
proteins are highlighted in red or green respec- (Carbon metabolism, 2-Oxocarboxylic acid
tively (Fig. 16.14). In the case of the breast can- metabolism, biosynthesis of amino acids, carbo-
cer cell line, most quantified proteins mapped to hydrate metabolism), # AL1A3 (Carbohydrate
metabolic processes, with 22 polypeptides [5 - metabolism, amino acid metabolism, metabolism
up-regulated (") and 17 down-regulated (#)]: of other amino acids, xenobiotics biodegradation
#3HIDH, " SAHH3, # IVD (Amino acid and metabolism, chemical carcinogenesis),
298 K.G. Calderón-González et al.

Fig. 16.12 Search & Color Pathway result. (a) A list of number of proteins involved. (b) Two examples of
proteins that were not found are shown at the top. The proteins involved in RNA transport and DNA replication
list of different pathways is also displayed with the processes

# AATM (Carbon metabolism, 2-Oxocarboxylic in cancer, endocrine and metabolic diseases),


acid metabolism, biosynthesis of amino acids, # ACADM (Carbon metabolism, fatty acid
amino acid metabolism, fat digestion and absorp- metabolism, carbohydrate metabolism, lipid
tion), # HCDH (Fatty acid metabolism, carbohy- metabolism, amino acid metabolism, metabolism
drate metabolism, lipid metabolism, amino acid of other amino acids, PPAR signaling pathway),
metabolism), # HXK1 (Carbon metabolism, car- " METK2 (Biosynthesis of amino acids, amino
bohydrate metabolism, biosynthesis of other sec- acid metabolism), # MDHM (Carbon metabo-
ondary metabolites, HIF-1 signaling pathway, lism, carbohydrate metabolism, amino acid
insulin signaling pathway, carbohydrate diges- metabolism), # NDUBA, # NDUS3 (Energy
tion and absorption, central carbon metabolism metabolism, neurodegenerative diseases,
16 Bioinformatics Tools for Proteomics Data Interpretation 299

Fig. 16.13 Additional information for proteins in KEGG Database. The proteins displayed in each pathway have a link
to additional information: (a) UniProtKB website and (b) KEGG database

endocrine and metabolic diseases), # DHB12 amino acid metabolism, glucagon signaling path-
(Fatty acid metabolism, lipid metabolism), way, central carbon metabolism in cancer),
# ODPB (Carbon metabolism, carbohydrate # CYC (Energy metabolism, cellular processes,
metabolism, HIF-1 signaling pathway, glucagon pathways in cancer, neurodegenerative diseases,
signaling pathway, central carbon metabolism in cardiovascular diseases, endocrine and metabolic
cancer), " PGAM1 (Carbon metabolism, biosyn- diseases, infectious diseases), # RPN1 (Glycan
thesis of amino acids, carbohydrate metabolism, biosynthesis and metabolism, folding, sorting
300 K.G. Calderón-González et al.

Fig. 16.14 Proteins mapped into KEGG PATHWAYS. mapping. Some of the processes found to be affected are,
Polypeptides found up- or down-regulated in both Lumi- (a) RNA transport process, and (b) DNA replication
nal A (MCF7 and T47D) and Claudin-low (MDA-MB- process. Up-regulated proteins are colored in red and
231) breast cancer cell lines were submitted to KEGG down-regulated proteins are in green

and degradation), # NLTP (Lipid metabolism, mapped pathways were: RNA transport with
cellular processes, PPAR signaling pathway), 5 proteins " IMB1, " RAN, " EIF3B, " EIF3F,
# SPEE (Amino acid metabolism, metabolism " EIF3I) (Fig. 16.14a) and DNA replication with
of other amino acids), " PYR1(Nucleotide 4 polypeptides involved ("MCM3, " MCM4,
metabolism, amino acid metabolism). Others " MCM6, " PCNA) (Fig. 16.14b).
16 Bioinformatics Tools for Proteomics Data Interpretation 301

16.6 Ingenuity Pathway Analysis 3. Protein-Protein Interactions including BIND,


(IPA) cognia, DIP, Interactome studies, MINT, and
MIPS
Ingenuity Pathway Analysis (IPA, QIAGENs 4. Additional sources: An open access database
Redwood City, www.qiagen.com/ingenuity) is a of genome-wide association results,
software application platform developed for BIOGRID, Breast cancer information core
analysis, understanding, integration and interpre- (BIC), Catalogue of somatic mutations in can-
tation of biological data [30]. Ingenuity can ana- cer (COSMIC), Chemical Carcinogenesis
lyze data acquired using platforms such as Research Information System (CCRIS),
microarrays, proteomics, metabolomics, etc. ClinicalTrials.gov, ClinVar, DrugBank, GO,
IPA uses the QIAGEN’s Ingenuity Knowledge GVK Biosciences, Hazardous Substances
Base in which contents extracted from articles, Data Bank (HSDB), HumanCyc, IntAct,
biomedical literature, reviews, internally curated miRBase, Mouse Genome Database (MGD),
knowledge, and other sources are structured into Obesity Gene Map Database, and Online
Ontology terms. The information in this platform Mendelian Inheritance in Man (OMIM).
are categorized into several knowledgebases:
The principal components of IPA suite are
1. Ingenuity expert information, including Inge-
nuity expert findings and Ingenuity expert 1. Core Analyze
assist findings 2. IPA-Tox
2. Ingenuity supported third party information 3. IPA-Biomarker
including MicroRNA-mRNA interactions 4. IPA-Metabolomics (Fig. 16.15)
(miRecords, TarBase, TargetScan)

Fig. 16.15 The main page of Ingenuity Pathway Analy- the dataset- and pathway options, as well as different
sis suit. All functions are listed via in two main tabs, analysis options, including Core, IPA-Tox,
Learning IPA, and shortcuts. The shortcut tab contains IPA-Biomarker and IPA-Metabolomics
302 K.G. Calderón-González et al.

Core Analyze consists of classified data sets Protein Index, KEGG, Life Technologies
mapped into biological processes, networks and (Applied Biosystems), miRBase (mature),
pathways. IPA-Tox module includes data classi- miRBase (stemloop), PubChem CID, RefSeq,
fied in the context of toxicological processes. In UCSC hg18 and 19, UniGene and UniProtKB/
this tool the toxicity and safety of compounds is Swiss-Prot accession number. The confidence
evaluated. IPA-Tox keeps track of the biological reported by IPA are either experimentally deter-
processes that are related to compound toxicity at mined or theoretically predicted. Some tissues
various biochemical and molecular levels. and cell lines covered by IPA include tissue and
IPA-Biomarker tool is used to identify and prior- primary cells from nervous and other organ
itize potential biomarker candidates. The selec- systems and cell lines from breast cancer, cervi-
tion of these putative biomarkers is based on cal, central nervous system (CNS), colon, hepa-
their biological characteristics. Finally, the toma, immune, kidney, leukemia, lung,
fourth application IPA-Metabolomics, is able to lymphoma, macrophage, melanoma, myeloma,
analyze metabolomics data, which are then con- neuroblastoma, osteosarcoma, ovarian, pancre-
textualized into biological insights (metabolism atic, prostate and teratocarcinoma model
and cell physiology). systems. Mutations covered include functional
IPA supports several types of identifiers effect, inheritance mode, translation impact,
including Affymetrix, Affymetrix SNP ID, unclassified mutation, zygosity and wild type.
Agilent, CAS registry number, CodeLink, IPA analysis core protocol: To use IPA, a
dbSNP, Ensembl, GenBank, Entrez gene, Gene license needs to be purchased but one can use a
Symbol-mouse, Gene Symbol-rat and Gene trial version for a limited period of time. To
symbol—Human (Hugo/HGNC), GenPept, GI perform an analysis in IPA, first an analysis
number, Human Metabolome Database dataset need to be created (Fig. 16.16). To create
(HMDB), Illumina, Ingenuity, International an analysis dataset, go to Annotate datasets

Fig. 16.16 Creation of a dataset with the IPA software. Red rectangles spotlight the basic steps to perform an analysis
for a dataset
16 Bioinformatics Tools for Proteomics Data Interpretation 303

option in the IPA window (Fig. 16.15), select the options to filter the data. We filtered the
file you wish to analyze and save the file. For parameters for breast cancer disease as follows:
illustration purposes, we analyzed proteins dif-
ferentially expressed in common in Luminal A 1. General settings: Ingenuity knowledge base
(MCF7 and T47D) and Claudin-low (MDA-MB- (genes only). Considering direct and indirect
231) breast cancer cell lines from Calderón- relationships
González et al. [18]. It is necessary to specify 2. Networks: 25 interaction networks with
the following information for the data that you 35 molecules per interactome. Include endog-
wish to analyze: enous chemicals (default parameters)
3. Data sources: All
1. File format: Flexible format 4. Confidence: All
2. Column header: Yes 5. Species: Human with stringent filter
3. Identifier type: UniProt/Swiss-Prot accession 6. Tissues and cell lines: Mammary gland as
4. Array platform: In this case, it does not apply organ and all breast cancer cell lines of
database
Then the observation names must be edited, 7. Mutations: All.
specifying the ID of proteins; in our case, the
observation option 1 was selected (114:113. At the end of the page, cutoff values are
MCF7/MCF 10A), 2 (117:113. T47D/MCF selected. We focused on up- and down-regulated
10A), 3 (115:113 MDA-MB-231/MCF10A), proteins (Fig. 16.17). The statistical significance
according to data number. Finally, the quantita- was determined by Fisher´s Exact Test, for which
tive data format must be specified, which in our the p-value cutoff was set at 0.05. As a result of
case we chose Exp Ratio (Fig. 16.16). this analysis, we obtained three summary results,
To carry out IPA Core analyses, we first one for each observation. Then, we performed a
uploaded the dataset previously created and then Core Comparison Analysis. This analysis was
specified the parameters according to the goals of performed using the following option (Core:
our study. The IPA platform gives different Compare analysis). The procedure also requires

Fig. 16.17 Core parameters needed for IPA analysis. Figure shows the different parameters that need to be set to
perform and delimit a Core Analysis. In this case the analysis was focus on breast cancer disease
304 K.G. Calderón-González et al.

selecting files for comparison. The summary 5. Networks (Networks for each observation or
results for all observation are reported in a single overlapping networks)
file. The Core Analysis result window shows 6. Molecules (Tables).
different tool bars:
We focused our analysis on canonical path-
1. Canonical Pathways (Chart and HeatMap) way result obtained as a chart (Fig. 16.18a) or a
2. Upstream Analysis (Table and HeatMap) HeatMap (Fig. 16.18b). In both cases, the num-
3. Diseases & Functions (Chart and HeatMap) ber of up- and down-regulated proteins and their
4. Regulator effects (Table) statistical probability were reported. Some of the

Fig. 16.18 Classification of proteins found up- or down-regulated in both Luminal A and Claudin-Low breast cancer
cell lines into canonical pathways with IPA software. The result can be displayed as (a) Bar chart or (b) Heatmap
16 Bioinformatics Tools for Proteomics Data Interpretation 305

processes affected were: Fatty acid oxidation I #DLG1, #EZR, "FUS, "ILK, "KPNB1, #MVP,
(#ACADM, #ECI1, #HADH, #IVD, #SCP2, #RELA, #S100A8, "SET, #SLC25A5, "XRCC5
#SLC27A4 with a p-value 3.57  108), aspar- and "XRCC6) (Fig. 16.19a). (2) Cell death and
tate degradation II (#GOT2 and #MDH2, p-value survival, cellular development, DNA replication,
of 3.78  104), cell cycle control of chromo- recombination and repair, cancer and hereditary
somal replication ("MCM3, "MCM4 and disorder obtained 12 proteins ("ABCF2, "CAD,
"MCM6, p-value 1.01  103), telomere exten- #CTNND1, #CYCS, "HSP90AB1,
sion by telomerase ("XRCC5 and "XRCC6, #LGALS3BP, "MAT2A, "MCM6, "MSH6,
p-value 5.44  103), and protein and "NUMA1, "PCNA, "SNRPG) with a score of
ubiquitination pathway (HSP90AB1, "PSMA3, 15 (Fig. 16.19b). Proteins in red and green repre-
"PSMC1, "PSMD2, #PSMD3, and "PSMD7, p- sent the up- and down- regulated proteins,
value 8.65  103). respectively. Small molecules are shown in gray
Diseases functions are divided into two color to highlight their relationship with our
categories, Diseases and Bio Functions and Tox proteins. Created Networks can be exported to
Functions. We only obtained the first category. IPA pathway for subcellular localization and
We found the affected processes to be: decoration of network with organelles and
backgrounds.
1. Cell-to-cell signaling and interaction: Forma-
tion of focal adhesions (#CTNND1 and
"STMN1, p-value 1.30  103) 16.7 Biomarkers Module
2. Cellular assembly and organization: Forma-
tion of focal adhesions (#CTNND1 and To perform biomarker filtration, we used the
"STMN1, p- value 2.39  102) and poly- Biomarkers module. As a first step in using the
merization of microtubules ("STMN1, Biomarker module, we selected the analysis
p-value 2.39  102) dataset function and choose a dataset created
3. Cellular function and maintenance: Formation previously. Next we chose the following
of focal adhesions (#CTNND1 and "STMN1, parameters:
p-value 1.30  103) and polymerization
of microtubules ("STMN1, p-value 1. Species: Human
2.39  102) 2. Tissues and cell lines: mammary gland as
4. Cell death and survival: Anoikis (#CTNND1 organ and breast cancer cell lines
and "ILK, p-value 3.99  103) and cytotox- 3. Molecules: All
icity of breast cancer cell lines (#RELA, 4. Diseases: Cancer
p-value 3.17  102) 5. Biofluids: All
5. Drug metabolism: Synthesis and oxidation of 6. Biomarkers: All biomarkers application
tretinoin (#ALDH1A3, p-value 8.02  103) (diagnosis, disease progression, efficacy, not
6. Cellular development: Epithelial-mesen- applicable, prognosis, response to therapy,
chymal transition of breast cancer cell lines safety and unspecified application) and breast
("ILK and "STMN1, p-value 4.45  102) disease (breast cancer, breast carcinoma, duc-
among other processes tal carcinoma, ductal carcinoma in situ,
infiltrating ductal breast carcinoma,
The interactome data obtained in three sepa- infiltrating lobular breast carcinoma, invasive
rate experiments were processed resulting in ductal breast cancer, lobular breast cancer,
identification of two principal networks related mammary neoplasm, metastasic breast can-
to: (1) Cellular development, cellular growth and cer) (Fig. 16.20a).
proliferation, cellular movement, cell death and
survival, and cancer, with a score of 19 and We then ran the analysis, saved the results,
14 molecules involved (#ALDH1A3, #CTSD, and performed a comparative analysis on our
306 K.G. Calderón-González et al.

Fig. 16.19 IPA Networks


of proteins found up- or
down-regulated in both
Luminal A and Claudin-
Low breast cancer cell
lines. The up- and down-
regulated proteins are
represented by molecules
in red and green color,
respectively. (a)
Interactome related to
cellular development,
cellular growth and
proliferation, cellular
movement, cell death and
survival, and cancer. (b)
Interactome involved in
cell death and survival,
cellular development, DNA
replication, recombination
and repair, cancer and
hereditary disorder

datasets. In this analysis, we had three datasets to were found in blood and all are related to cancer;
compare (Fig. 16.20b) and only considered however, they are not unique to this disease, as
proteins found in all three datasets. We found they are found in other diseases.
four candidate biomarkers common between the
luminal A and Claudin-low cells falling into dif-
ferent biomarker application categories: unspec-
16.8 Protein-Protein Interactions
ified application ("KHSRP protein found in
Databases
nucleus and #S100A8 with cytoplasmic localiza-
tion), diagnosis, efficacy (#RELA localized in
16.8.1 STRING
nucleus and "STMN1 found in cytoplasm)
RELA was also found related to the drug
STRING (Search Tool for the Retrieval of
NF-kappa B decoy (Fig. 16.21). All proteins
Interacting Genes/Proteins) is a database of
16 Bioinformatics Tools for Proteomics Data Interpretation 307

Fig. 16.20 Filter parameters for biomarker analysis in IPA software. (a) Creating a filter for putative biomarkers. (b)
Comparison analysis between all observations (MCF7, T47D and MDA-MB-231)

known and predicted protein interactions main objective of this database is to integrate,
[31]. This database was developed by the Center predict and unify several protein-protein
for Protein Research (CPR), The European interactions [31, 32]. Associations between
Molecular Biology Laboratory (EMBL), The proteins can be physical (direct) or functional
Swiss Institute of Bioinformatics (SIB), The Uni- (indirect). The functional associations are
versity of Copenhagen (KU), The Technische defined as the interaction between two proteins
Universität Dresden (TUD), and The Universität that participate or contribute in the same cellular
Zürich (UZH). STRING version 10.0 has process or metabolic pathway, as well as other
9,643,763 proteins from 2031 organisms. The functional processes [32–34].
308 K.G. Calderón-González et al.

Fig. 16.21 Result of biomarker filter. Figure shows the four common biomarkers between. Luminal A and Claudin-
low breast cancer cell lines

STRING database uses the following type of predictions against a reference database.
information to predict possible interaction: STRING uses the KEGG database because this
is manually curated [32, 37].
1. Genomic data STRING website is composed of two
2. High throughput experiments components, the first component deals with pro-
3. Co-expression tein analysis and the second covers the platforms
4. Data extracted from literature (Fig. 16.22). The window of results displays the
networks of protein-protein associations. The
STRING import knowledge about protein- resulting interactome is represented by
protein interactions from other databases such connecting lines. Each one of these lines
as IntAct, MINT, BioGRID, Reactome, KEGG, represents different types of evidence. Networks
BIND, HPRD, DIP, NCI-Nature Pathway Inter- can be viewed in three forms:
action, GO, and EcoCyc [33]. In addition,
STRING has a large collection of predicted 1. Evidence view in which connections are color
interactions that are produced de novo using pre- coded as follows, neighborhood (green), gene
diction algorithms [33, 35]. De novo predictions fusion (red), co-occurrence (blue),
are made using genomic context such as co-expression (black), experiments (purple),
conserved genomic neighborhood, gene fusion database (light blue), text mining (yellow),
events, and co-occurrence of genes across the and homology (gray)
genome [34]. STRING also performs searches 2. Confidence view in which the thickness of
for genes with similar transcriptional response connecting lines correlates with the strength
through a variety of conditions (co-expression) of the associations
[33]. Information extracted from literature is 3. Interaction view in which the type of
another source used to extract protein association interactions is color coded as follows; activa-
information from. In this case, STRING obtains tion (brilliant green), inhibition (red), binding
information from all abstracts in PubMed data- (blue), phenotype (brilliant blue), catalysis
base directly [36]. Finally, STRING assigns a (purple), posttranslational modifications
probabilistic confidence score to all associations (lilac), reaction (black) and expression (olive
obtained through comparison of the association green)
16 Bioinformatics Tools for Proteomics Data Interpretation 309

Fig. 16.22 STRING window view. The STRING a protein sequence. The analysis can be performed for
webpage has different options to perform interaction anal- multiple proteins in the same way. In addition, the main
ysis. The search can be done by the name of the protein or page has various tabs with information about this platform

STRING has also an interactive view. In this accessed the STRING website http://string-db.
option the network can by reordered by moving org/.
the proteins in the network. In advanced option, To generate a network of protein interactions, a
the network can be enriched into a GO Biological list (one or more) of protein names, accession
Processes, GO Molecular functions, GO Cellular number, or sequence, as well as the organism or
components, KEGG Pathways, PFAM domains, species they originated from, need to be
INTERPRO domains, and Protein- Protein specified (Fig. 16.22). At the bottom of the
interactions. In each enrichment category, a result window there is a parameter box. The
new window is displayed containing a list of options in the parameter box are used to select
interactors, which contains different processes, the active prediction algorithm. The confidence
the number of proteins involved as well as a p- score as well as the number of interactors can be
value. adjusted as well (Fig. 16.23). The interactome
can be seen according to evidence (Fig. 16.24a),
confidence (Fig. 16.24b) and action
16.8.2 Protein-Protein Interaction (Fig. 16.24c). In each network, a score is
Networks generated according to each protein’s interac-
tion evidence. In addition, a brief description for
To determine the protein-protein interaction of each protein is also displayed (Fig. 16.24).
overexpressed NUDC protein exclusively found
NUDC protein is associated with PAFAH1B1
in Claudin-low breast cancer cell line [18], we
310 K.G. Calderón-González et al.

Fig. 16.23 STRING results view. A window containing different parameters is shown at the bottom. The active
prediction methods as well as the confidence of the interactions in the network can be selected in this window

(platelet-activating factor acetylhydrolase 1b), 1 (A. nidulans)), ZW10 (ZW10, kinetochore


PLK1 (polo-like kinase 1), NDEL1 (nudE associated, homolog (Drosophila), FBXW11
nuclear distribution E homolog (A. nidulans)- (F-box and WD repeat domain containing 11),
like 1), HSP90AA1 (heat shock protein 90 kDa CLIP1 (CAP-GLY domain containing linker
alpha), BTRC (beta-transducin repeat protein 1) and ZWILCH (Zwilch, kinetochore
containing E3 ubiquitin protein ligase), NDE1 associated, homolog (Drosophila)). All
(nudE nuclear distribution E homolog interactions have more than 0.90 score. In
16 Bioinformatics Tools for Proteomics Data Interpretation 311

Fig. 16.24 Interaction network of NUDC protein. This gray, homology. (b) Confidence view where thicker
polypeptide is overexpressed exclusively in Claudin-low lines represent stronger associations. (c) Interaction
breast cancer cell line. The interactome can be seen in view, where the different modes of action are
three options. (a) Evidence view, where the color lines represented by different colors. Brilliant green, activa-
represent the diverse evidences of interactions: tion; red, inhibition; blue, binding; brilliant blue, phe-
Green, neighborhood; red, gene fusion; blue, notype; purple, catalysis, lilac, PTMs; black, reaction;
co-occurrence; black, co-expression; purple, olive green, expression. The three view modes provide
experiments; light blue, database; yellow, text mining; a score of the different evidence of interaction
312 K.G. Calderón-González et al.

addition, the network was enriched into GO common in Luminal A (MCF7 and T47D) and
Biological Processes. Processes showed Enrich- Claudin-low (MDA-MB-231) breast cancer cells
ment with statistical significance were: lines [18]. In this case, we used the highest confi-
dence (0.900) possible to generate our interaction
1. Mitotic prometaphase (4.940  1013) network. The network has several interaction
2. Mitotic anaphase (8.089  1012) nodes related to:
3. Mitotic M phase (6.309  1011)
4. M phase (6.309  1011) 1. Energy metabolism
5. Mitotic cell cycle phase (4.300  1010) 2. Translation
6. Cell cycle phase (4.300  1010) 3. Proteasome
4. Replication and repair
All processes mentioned above have at least 5. Transcription
eight proteins involved. We selected the cell
cycle phase process as an example. The proteins Red and green arrows indicate up- and down-
enriched in this process are shown in color red regulated proteins, respectively (Fig. 16.26).
(Fig. 16.25a). We selected the interacting
proteins NUDC and ZW10 as examples to
extract interaction information. ZW10 was
16.8.3 MINT
selected because it is an essential component of
the mitotic checkpoint that prevents cells from
The Molecular INTeraction database or MINT is
prematurely exiting mitosis. The evidence
an open source protein-protein interaction data-
supporting the functional link between these
base developed at the Università degli Studi di
two proteins are the following:
Roma Tor Vergata that has been experimentally
verified [38, 39]. The webpage can be found
1. Co-expression (putative homologs are
at http://mint.bio.uniroma2.it/mint/Welcome.do
co-expressed in other species, score 0.065)
(Fig. 16.27). The current version of MINT
2. Association in curated database (score 0.900)
database (November 2015) contains 241,458
3. Co-mentioned in PubMed abstracts (score
interactions, corresponding to 35,553 proteins
0.285)
and 5554 PMIDS (PubMed unique identifiers).
Species included are Drosophila melanogaster,
Also putative homologs are mentioned
Saccharomyces cerevisiae, Caenorhabditis
together in other species (score 0.192). The com-
elegans, mammals and viruses, with mammal
bined score is 0.938. There is also activity evi-
databases being the main datasets. Evidences
dence, such as catalysis (score 0.900), binding
for protein-protein interactions include associa-
(score 0.900) and reaction (score of 0.900) that
tion studies, co-localization, direct interactions,
support the interaction between these two proteins
interactions in form of complexes, enzymatic
(Fig. 16.25b). For proteins selected in a network,
reactions, and high throughput studies.
STRING displays a window with information
Protein-protein interactions have been identified
about their 3D structure, as well as links to
by a number of methods including co-
Ensembl, GeneCards, KEGG, Nextprot and
immunoprecipitation with either anti-bait or
UniProt. Also, STRING can show the protein
anti-tag antibodies, fluorescence microscopy,
sequence and the sequence of its homologs in
peptide arrays, protein arrays, pull down
organisms stored in STRING. NUDC has three
experiments, SPR, tandem affinity isolation,
3D structures obtained from Protein DataBase
two hybrid arrays, two hybrid pooling, and two
(PDB) (Fig. 16.25c). As mentioned above,
hybrid systems, etc. Additionally, the MINT
STRING can perform network analysis for multi-
database is freely available for academic and
ple proteins as well. We performed an interactome
commercial users.
analysis for the up- and down-regulated proteins
16 Bioinformatics Tools for Proteomics Data Interpretation 313

Fig. 16.25 Interaction network of NUDC overexpressed have a statistical significance ( p-value) are involved in
protein found exclusively in Claudin- low breast cancer cell cycle phase. (b) Evidence supporting interaction
cell line. STRING platform provides different informa- between NUDC and ZW10. (c) 3D protein structure
tion for the generated network. (a) Network enrichment information
for GO Biological Processes. The proteins in red which

There are three additional databases available protein-protein interaction database specialized
via MINT website including HomoMINT, Dom- on viruses.
ino, and VirusMINT. The first one is an inferred Protein interaction searches in MINT database
network for human; the second is specialized in (Fig. 16.28a) can be carried out using PubMed
domain-peptide interactions, and the last is a ID, D.O.I, or author’s name. Alternatively, this
314 K.G. Calderón-González et al.

Fig. 16.26 STRING interaction network of proteins Green, neighborhood; red, gene fusion; blue,
found up- or down-regulated in both Luminal A (MCF7 co-occurrence; black, co-expression; purple,
and T47D) and Claudin-low (MDA-MB-231) breast can- experiments; light blue, database; yellow, text-mining;
cer cell lines. This list has interaction nodes related to: (1) gray, homology. Red arrows indicate up-regulation and
Energy metabolism, (2) Translation, (3) Proteosome green arrows down-regulation. A box with information
degradation, (4) Replication and repair, (5) Transcription. about some proteins is also shown
Colored lines represent different evidence of interaction:
16 Bioinformatics Tools for Proteomics Data Interpretation 315

Fig. 16.27 Homepage of the Molecular INTeraction database, MINT

database can be searched against protein or gene our analysis, click on the Search tab and type
name, protein accession number (Protein AN) or P46459 (Fig. 16.28, arrow 1) and then select the
keywords. Protein accession numbers recognized organism (Fig. 16.28, arrow 2) and then press the
by MINT search engine are FlyBase, Ensembl, Search key (Fig. 16.28, arrow 3). Results show
Human Identified Gene Encoded Large Protein certain information for the queried protein
Analyzed database (HUGE), Nematode database including its ID, species, synonyms, domains
(WormBase), OMIM, REACTOME pathway found in query, a link to its role in diseases, its
database, the Saccharomyces Genome Database gene ontology, references covering the target
(SGD), and Universal Protein Resource protein, prediction of its modular domain
Knowledgebase (UniProtKB). interactions (ADAN), and its orthologs in
To demonstrate how MINT database works, MINT database (Fig. 16.28). Results also display
we selected the vesicle-fusing ATPase NSF a window containing a list of molecules
(P46459) for analysis. This protein is part of a interacting with the target according to MINT
set of proteins that were found overexpressed in database, evidence for each interaction and a
several breast cancer cell lines [18]. To follow global score for each interaction (Fig. 16.28).
316 K.G. Calderón-González et al.

Fig. 16.28 MINT search webpage. (a) Search in MINT (3) a list of proteins. (b, c) Result of a query for vesicle-
can be performed using: (1) Gene or protein name, Pro- fusing ATPase NSF from Homo sapiens (UniProtKB/
tein ID or keywords and the species of interest or the Swiss-Prot ID P46459). (c) List of NSF interactors are
whole database, (2) Protein sequence in FASTA format, shown

Clicking on the MINT viewer will generate a list number 4 and a new window appeared
of interactions that are displayed as a function of showing the partner name, ID, and techniques
score threshold. For each partner, a number used to determine the interaction, as well as a
showing evidence for interaction is shown PubMed identifier containing this information
(Fig. 16.29). As an example, we clicked on (Fig. 16.29).
16 Bioinformatics Tools for Proteomics Data Interpretation 317

Fig. 16.29 Binary interactions of the N-ethylmaleimide- displayed showing the name of the corresponding
sensitive fusion protein NSF viewed in MINT database. interactor (GABBR2, Gamma-aminobutyric acid type B
(a) Basic information queried for NSF. (b) Binary inter- receptor subunit 2) and the experimental methods used to
action map of NSF with 15 interactors found in MINT determine this interaction, as well as the PMID ID for the
database. (c) Selecting number 4 in (b), a new window is publication describing it
318 K.G. Calderón-González et al.

16.8.4 IntAct compound, or UniProtKB, ChEBI (Chemical


Entities of Biological Interest), RNA Central,
IntAct is a database of protein-protein PMID or IMEx (International Molecular
interactions, as well as a suite of analytical Exchange) IDs. The principal page of IntAct
tools at The European Bioinformatics Institute (Fig. 16.30) contains links to other websites
(EBI), which is part of the European Molecular the might be of interest. These sites include
Biology Laboratory (EMBL) [40, 41]. All infor- MINT, UniProtKB, The Swiss Institute of
mation has been curated by experts at the Bioinformatics (SIB), The Interologous Inter-
IntAct team. action Database (I2D), The Innate Immune
This freely available database can be Response Database (Innate Database), Molecu-
accessed through its webpage http://www.ebi. lar Connections, The Extracellular Matrix
ac.uk/intact/. Interactions Database (MatrixDB), The Modu-
As of November 26th, 2015 this database had lar Approach to Cellular Functions Resource
registered 355,819 interactions, which included (MB Info), a curated resource for functional
89,340 interactors (proteins) described in 36,864 analysis of agricultural plant and animal gene
experiments, 13,892 PMIDs, and 564,831 binary products (AgBase), and The cardiovascular
interactions. Methods used for the determination Gene Annotation database at the London’s
of protein-protein interactions include tandem Global University (UCL).
affinity purification, anti-tag co-immunopreci- As an example of the function of IntAct, we
pitation, two hybrid systems, pull down selected the protein XRCC6 (X-ray repair cross-
experiments, two hybrid arrays, anti-bait complementing protein 6, UniProtKB ID
co-immunoprecipitation, two hybrid pooling P12956), which was found overexpressed in
approach, and co-sedimentation, among others. both Luminal A and MDA-MB-231 breast can-
The source of information mainly comes from cer cell lines [18]. This protein is a single-
human (42.5 %), various S. cerevisiae strains stranded DNA-dependent and ATP-dependent
(22.8 %), Mus musculus (11.3 %), and 30 –50 DNA helicase involved in DNA
D. melanogaster (8.1 %). Other species included non-homologous end joining (NHEJ) required
are Escherichia coli, C. elegans, A. thaliana, for double-strand break repair and V(D)J recom-
Campylobacter jejuni, etc. MINT and IntAct bination. To reproduce our analysis, in the search
databases have recently joined their individual window (Fig. 16.30) type XRCC6 or P12956 ID
efforts to optimize resources as the MIntAct and push the search key. A new window will
project, thus avoiding duplication of appear on screen with the results for your query
activities [42]. (Fig. 16.31). There are 324 binary interaction
IntAct model has three main components, found for XRCC6 protein up to date. These
interactions, interactors, and experiments used interactions are displayed as a table, where mol-
to determine interactions. Protein interactions ecule A is your query or bait, and B molecules
are inferred using scientific publications, are proteins interacting with your query. For each
including binary interactions or complexes. An interaction, a list of interaction methods used for
interactor can be defined as a biological mole- the determination of such interactions is shown,
cule (mainly a protein) involved in a specific their corresponding IDs, and the source database
interaction. An interaction is not circumscribed as well. When you click on the interactors tab,
to binary interactions only; it also includes a new page will be shown containing a list of
interactions with more partners identified in all interactors, showing the type of interactor,
the experiment performed, e.g. precipitation of the number of interactions described, a link to
multi-protein complexes. Search in IntAct data- access the description in UniProtKB, and a
base can be performed in different ways, includ- description of the interaction (Fig. 16.32). More
ing name of gene, protein, RNA or chemical information, including interactions described, the
16 Bioinformatics Tools for Proteomics Data Interpretation 319

Fig. 16.30 Homepage of the IntAct Molecular Interaction Database

chromosome location in Ensembl webpage, the 16.8.5 HPRD


mRNA expression for interactor in the Expres-
sion Atlas webpage, and pathways is displayed The Human Protein Reference Database (HPRD)
when interactors are searched separately. The is a free web resource containing information
map of interactions for your query can be of human proteins, including an information
displayed in three layouts, force directed summary for each protein, their PTMs, protein-
(Fig. 16.33), radial (Fig. 16.34) or circle protein interactions, expression levels in tissues,
(Fig. 16.35). In all cases, you can zoom in the mRNA and protein sequences, non-protein
graph with the tool window at the bottom. interactions, alternate names, participation in
Search can also be performed for a list of diseases, and domains found in proteins. All the
identifiers. The result will be more complex as information stored in this database is curated by
all interactions for each member of your list will a group of expert biologists from the Pandey Lab
be shown. As an example, we only show the graph at Johns Hopkins University and the Institute of
for ten proteins overexpressed in Luminal A and Bioinformatics in Bangalore, India [43]. The
MDA-MB-231 breast cancer cell lines [18], where current version of HPRD is 9. It contains infor-
a total of 1101 binary interactions were found in mation for 30,047 proteins, 41,327 protein-
database (Figs. 16.36, 16.37 and 16.38). protein interactions, 93,710 PTMs, 112,158
320 K.G. Calderón-González et al.

Fig. 16.31 List of binary interactions found for XRCC6 IntAct database. A total of 324 interactions were found
(the X-ray repair cross-complementing protein 6 from for this protein
Homo sapiens, UniProtKB/Swiss-Prot ID P12956) in

sites of protein expression, 22,490 sites of intra- To perform a search, click on the Query key,
cellular localization, 470 domains, and 453,521 type your query and push the Search button on
PMIDs. In addition, two other applications have the upper left part on screen (Fig. 16.39, arrow).
been recently added, the PhosphoMotif Finder There are several options for a query, including
and NetPath resources, which allow the identifi- Protein Name, Accession Number (RefSeq,
cation of phosphorylation motifs for known GenBank, OMIM, UniProtKB and Entrez Gene
kinases/phosphatases and binding motifs for Name), HPRD identifier, Gene Symbol, Chro-
phospho serine/threonine or phospho tyrosine mosome locus, Molecular Class (e.g. Nuclease,
in a compendium of signaling pathways in Serine Proteinase, Translation Regulatory pro-
humans [43]. tein, Glycosylase, etc.), PTMs (e.g. ADP
16 Bioinformatics Tools for Proteomics Data Interpretation 321

Fig. 16.32 List of binary interactions found for XRCC6 prostaglandin J2 and Midostaurin), 26 nucleic acid
(the X-ray repair cross-complementing protein 6 from molecules, and four genes (Klk3, kallikrein-related pepti-
Homo sapiens, UniProtKB/Swiss-Prot ID P12956) in dase 3 encoding gene; Tmps2, Transmembrame protease
IntAct database. There are 150 proteins, three chemical serine 2). here only a list of 20 protein interactors is shown
compounds (XAV939, 15-deoxy-Delta(12,14)-

Ribosylation, Glycation, Nitration, Sumoylation. apparatus protein 1, isoform 1), Molecular


Ubiquitination), Cellular Component, Domain Class (Structural protein), Molecular Function
Name, Motif, Expression Site, Length of Protein (Structural molecule activity), and Biological
sequence, Molecular Mass, and Diseases Process (Cell growth and/or maintenance).
(Fig. 16.40). To present an example, we searched Seven additional tabs are provided, which are
NUMA1. Results are shown in Fig. 16.41. Infor- Summary, Sequence, Interactions, External
mation retrieved includes the name of protein Links, Alternate Names, Diseases, PTMs, and
(NUMA1 corresponds to the Nuclear mitotic Substrates. The General tab contains the
Fig. 16.33 Force-directed layout of the interaction map found for XRCC6 in IntAct database. XRCC6 protein is at the
center of the map

Fig. 16.34 Radial layout of the interaction map found for XRCC6 in IntAct database. XRCC6 protein query is at the
center of the map
16 Bioinformatics Tools for Proteomics Data Interpretation 323

Fig. 16.35 Circle layout of the interaction map found for XRCC6 in IntAct database. XRCC6 protein query is located
at the top of the map

corresponding HPRD ID 01236, Gene symbol (Fig. 16.39). Furthermore, access to Human
NUMA1, Molecular Weight 238259 Da, Chro- Proteinpedia, Pathways, PhosphoMotif Finder,
mosome location 11q13, intracellular localiza- or downloading the complete HPRD are possible
tion, domains and motifs, and sites of tissue using the main menu.
gene expression (Fig. 16.41). The sequence of
NUMA1 and its corresponding mRNA are
obtained by clicking on Sequence tab 16.8.6 BioGRID
(Fig. 16.42). A list of proteins that interact with
NUMA1, and types of experiment and The Biological General Repository for Interac-
interactions (direct or in a complex) are shown tion Datasets (BioGRID, http://thebiogrid.org),
in Fig. 16.43. as many other protein-protein interactions
Alternatively, it is possible to search HPRD databases, has as main goals to curate, organize
by browsing Molecule Class, Domains, Motifs, and make it freely available. The funding
PTMs, and Localization by pushing the Browse partners of this important database are the
key on the right of the main webpage National Institutes of Health (NIH), the
Fig. 16.36 Interaction map found for PSA3, SYWC, layout of the network showing many more interactions
MCM4, SMAP, DDB1, EIF3, PYR1, MCM3, SSRP1 that are contained in the IntAct database
and METK2 proteins in IntAct database. Force directed

Canadian Institutes of Health Research (CIHR), that can be found in the BioGRID webpage. The
the Genome Canada, and GenomeQuébec. Many current version of BioGRID database (3.4.131,
other institutions have joined efforts to December 2015) has information for several
BioGRID, including the Université de Montréal, model organisms, including A. thaliana,
Princeton University, Mount Sinai Hospital, C. elegans, Candida albicans, Danio rerio,
University of Edinburgh, SGD, FlyBase, Dictyostellium discoideum, D. melanogaster,
GeneDB, NCBI, WormBase, MaizeGDB, H. sapiens, Mus musculus, Neurospora crassa,
MINT, IntAct, String, MatrixDB, SIB, GO, Plasmodium falciparum, S. cerevisiae,
UniProt, Reactome, Cytoscape, and many others Schizosaccharomyces pombe, Xenopus laevis,
16 Bioinformatics Tools for Proteomics Data Interpretation 325

Fig. 16.37 Radial layout of the network found for PSA3, SYWC, MCM4, SMAP, DDB1, EIF3, PYR1, MCM3,
SSRP1 and METK2 proteins in IntAct database

among other eukaryotic organisms. Further- products and 45,623 unique publications.
more, it has information of prokaryotic cells, BioGRID database also includes 11,329
such as B. subtilis, E. coli, Mycobacterium non-redundant interactions between 4851
tuberculosis, and Streptococcus pneumoniae. unique chemical compounds and 2464 gene
Some viruses are included as well, e.g. Hepatitis products accumulated from 8875 scientific
C virus, Human Herpesvirus, Human Immuno- publications. BioGRID also contains PTMs
deficiency virus, and Human Papillomavirus information. A total of 19,981 PTMs corres-
type 16 [44–46]. In its current version, the ponding to 18,578 unassigned sites, 3165 unique
BioGRID database contains 749,213 non- redun- proteins, 14,999 genes retrieved from 4317
dant interactions, corresponding to 63,026 gene publications are stored in this database.
326 K.G. Calderón-González et al.

Fig. 16.38 Circle layout of the interaction map found for PSA3, SYWC, MCM4, SMAP, DDB1, EIF3, PYR1, MCM3,
SSRP1 and METK2 proteins in IntAct database

To perform a search in BioGRID database, MDA-MB-231 breast cancer cell lines


type your query (gene name, identifier or [18]. Results indicates that MCM6, the
keywords) in the gene search window and Minichromosome maintenance complex com-
select the species (Fig. 16.44). It is important ponent 6, is involved in four GO Biological
to note that only one protein at a time can be Processes:
searched. Alternatively, searches can be done
by PubMed publication. However, searching 1. DNA replication
of Multiple Genes or Publications will be 2. DNA strand elongation involved in DNA
available soon. As an example of a search, replication
we selected the MCM6 protein, which was 3. G1/S transition of mitotic cell cycle
found overexpressed in both Luminal A and 4. Mitotic cell cycle
16 Bioinformatics Tools for Proteomics Data Interpretation 327

Fig. 16.39 Homepage of the Human Protein Reference Database HPRD

Fig. 16.40 Query webpage of the Human Protein Reference Database HPRD
328 K.G. Calderón-González et al.

Fig. 16.41 HPRD query result for the Nuclear Mitotic indicating the chromosome localization, subcellular
Apparatus Protein 1, NUMA1. This screenshot shows a localization, domains, and tissues where the protein is
putative PTM map as well as a summary for NUMA1 expressed

This protein is also involved in four GO In order of significance according to the num-
Functions: ber of physical interactions, MCM6 has
82 interactors which are MCM2, MCM4,
1. ATP binding MCM7, MCM10, MCMBP, MCM3, CDT1,
2. ATP-dependent DNA helicase activity TONSL, MCM5, HIST1H4A, SSRP1, ASF1B,
3. Identical protein binding CDKN2A, ASF1A, MMS22L, and ING5
4. Protein binding (Fig. 16.45). When the interactions option is
selected, a list of 142 interactions are displayed
MCM6 is also part of three GO Components: on screen, indicating the name of interactor, its
role in the interaction, name of the species, code
1. MCM complex for the experimental evidence, source of the
2. Nucleoplasm dataset, whether interaction is from high or
3. Nucleus (Fig. 16.45, arrows 1–3) low high throughput screening experiments, a
16 Bioinformatics Tools for Proteomics Data Interpretation 329

Fig. 16.42 Protein and DNA sequences for NUMA1 in HPRD

score for each interaction, the name of the person information such as the type of modification
who curated the information, and additional indicated as well as the source of information
notes (Fig. 16.46). When the Network tab is are also provided if PTM option is selected
selected, three different layouts can be obtained: (Fig. 16.51). In the case of MCM6, there are
Concentric circles (Fig. 16.47), Single circle 35 Lysine residues marked as ubiquitinated and
(Fig. 16.48), and Grid (Fig. 16.49). If the number two additional non-assigned PTMs (neddylation
of minimum evidence is changed to five for and sumoylation) (Fig. 16.52).
example, the number of interactions will drop
(Fig. 16.50), thus reducing the complexity of
the interaction map. When the PTM sites tab is 16.8.7 PIPs
selected, the amino acid sequence of the query is
displayed and those residues with an identified The Human Protein-Protein Interaction Predic-
PTM are highlighted in blue. Additional tion (PIPs) is a specialized database containing a
330 K.G. Calderón-González et al.

Fig. 16.43 List of protein interactors of NUMA1 queried in HPRD

catalogue of predicted human protein-protein InterPro (Protein sequence analysis and classifi-
interactions that have been probabilistically cation, http://www.ebi.ac.uk/interpro) and Pfam
determined using a Bayesian model, which (Protein families, http://pfam.xfam.org) protein
takes into account several modules: Expression, domain databases. PTM co-occurrence uses the
Orthology, Localization, Domain co-occurrence, information contained in HPRD and UniProtKB.
PTMs co-occurrence, Disorder, and Transitive. Disorder refers to the prediction of intrinsic dis-
Expression considers information from a number order of protein found in VLS2 prediction.
of gene expression profiles. Orthology uses the Finally, Transitive is a module which involves
interactions that have been determined for the local topology of networks, considering all
orthologues from fly, human, worm and yeast. modules described above [47].
Localization is determined by using a human PIPs database is located at the University of
subcellular localization predictor (PSLT) in dif- Dundee and the current version (December 2015)
ferent subcellular compartments. Domain contains 37,606 interactions with a score > 1.0,
co-occurrence uses the information stored in indicating a high probability of occurrence. To
16 Bioinformatics Tools for Proteomics Data Interpretation 331

Fig. 16.44 Homepage of the Biological General Repository for Interaction Databasets, BioGRID

perform a search, an ID in IPI, RefSeq or selected. For score values equal or larger than
UniProtKB format must be entered in the search 2.5, 12.5, 25, 250, and 2500, there were
window. As an example, when TBP was used to 33, 15, 13, 7, and 3 interactions, respectively.
initiate a query, results were displayed in several When the number of interactions for a score
boxes each containing a number of interactions  1.0 is selected, a list of interactors and the
with a certain score. In this case, there are scores for each module used will be displayed
65 interactions when a score value  1.0 was on the screen.
332 K.G. Calderón-González et al.

Fig. 16.45 Result summary for the Minichromosome Maintenance Complex Component 6, MCM6, queried in
BioGRID. A total of 82 interactors were found in database

16.8.8 MPIDB 18 and contains 24,295 interactions that have


been experimentally determined for 250 species
The Microbial Protein Interaction Database of bacteria. This number of interactions
(MPIDB) at the Craig Venter Institute (http:// corresponds to 7810 proteins and 24,295
jcvi.org/mpidb/about.php) is a database whose interactors. Like many other databases, MPIDB
main goal is to gather information for all known also imports information from other databases,
protein interactions from microbial organisms including IntAct, Database of Interacting
[48]. The current version of MPIDB is 2009-11- Proteins (DIP), The Biomolecular Interaction
16 Bioinformatics Tools for Proteomics Data Interpretation 333

Fig. 16.46 List of interactions found for MCM6 in BioGRID

Network Database (BIND) and MINT. Search 16.8.9 TAIR


can be performed using the name of a protein
(UniProtKB ID or locus name) or by selecting The Arabidopsis Information Resource (TAIR)
species name. Results will be displayed as a table at Phoenix Bioinformatics (https://www.
containing the UniProtKB ID, name of protein, arabidopsis.org) is a database of information for
interactor, loci of query and interactor, species plant research model A. thaliana.
for query and interactor and the number of This database contains the whole A. thaliana
evidences for such interaction. genome sequence, analysis, structure and
334 K.G. Calderón-González et al.

Fig. 16.47 Map of interactions for MCM6 in BioGRID database. Layout of interaction map is shown in concentric
circles, where query protein is at the center

annotation of genes, information for all proteins database. Search in TAIR can be performed in
encoded in its genome, data from gene expres- several ways: DNA/Clones, Ecotypes, Genes,
sion experiments, genome maps, pathways, and Gene Ontology, Plant Ontology, Keywords,
other information useful to the scientific commu- Locus, Markers, Microarray element, Microarray
nity [49]. Like other databases, experts from expression, People/Labs, Polymorphism/Alleles,
TAIR curate information using published Protein, Protocols, PMIDS, Seed/Germplasm,
experiments before entering them in this and Text. TAIR webpage also contains tools for
16 Bioinformatics Tools for Proteomics Data Interpretation 335

Fig. 16.48 Map of interactions for MCM6 in BioGRID database. Layout of interaction map is shown as a single circle,
where MCM6 query protein is located at the top of the map

analysis of sequences, as well as viewers for database was created by scientists at the
maps and sequences. It is recommended to regis- Weizmann Institute of Science and LifeMap
ter in TAIR to download the whole genome Sciences. Search can be done using keywords,
sequence. symbols, aliases, or identifiers. Information that
can be retrieved from this database include:

16.8.10 GeneCards 1. Aliases for query


2. Links to HGNC (HUGO Gene Nomen-
The Human Gene Database (GeneCards, http:// clature Committee, http://www.genenames.
www.genecards.org) is another useful database org), Entrez Gene at NCBI, Ensembl
covering the human genome [50–53]. This (genome databases for vertebrates and other
336 K.G. Calderón-González et al.

Fig. 16.49 Grid layout of the map of interactions for MCM6 in BioGRID database. MCM6 query protein is located at
the top left corner of the map

eukaryotic species, http://www.ensembl.org/ 5. Protein information such as Protein ID,


index.html), OMIM http://www.omim.org), Length in amino acids, Molecular Mass,
and UniProtKB Quaternary structure, Three dimensional
3. Summaries of queries retrieved from differ- structure from OCA (Brower-database for
ent sources protein structure/function, http://oca.
4. Genomics data for query, including Regu- weizmann.ac.il/oca-docs/oca-home.html),
latory Elements, Genomic location, Geno- Proteopedia (The free, collaborative
mic region view, and RefSeq DNA sequence D-encyclopedia of proteins & other
16 Bioinformatics Tools for Proteomics Data Interpretation 337

Fig. 16.50 Grid layout of the map of interactions for MCM6 in BioGRID database using a minimum value of 5 as
evidence

molecules, http://proteopedia.org/wiki/ analysis and classification, http://www.


index.php/Main_Page), Alternative splice ebi.ac.uk/interpro), ProtoNet (Automatic
forms, Data of protein expression in Prote- Hierarchical Classification of Proteins,
omics DB (https://www.proteomicsdb.org/ http://www.protonet.cs.huji.ac.il/requested/
proteomicsdb/#overview), PaxDB (Protein cluster_card.php?global¼protonet|no|6|61|
Abundance Across Organisms, http://pax- lifetime|1|2|2&cluster¼4023630&releaseid¼
db.org/#!home), MOPED (Multi-Omics 6&firstEnterTimeClient¼&blast¼11053692|
Profiling Expression Database, https:// 274977&clusteringNum¼61)
www.proteinspire.org/MOPED/mopedviews/ 6. Functions retrieved from UniProtKB,
proteinExpressionDatabase.jsf), MaxQB Enzyme Number; Gene Ontology;
(The MaxQuant DataBase, http://maxqb. Phenotypes; Animal models for query; links
biochem.mpg.de/mxdb/), and PTMs, to CRISPR products, miRNAs, siRNAs,
(6) Domains in InterPro (Protein sequence, shRNAs, clone products, etc.
338 K.G. Calderón-González et al.

Fig. 16.51 PTMs reported for MCM6 in BioGRID database. There are a few sites shown to carry ubiquitination for
MCM6. Reference is also provided

7. Localization of genes in chromosomes and campaign¼genecards&utm_content¼banner_


subcellular location of proteins expression)
8. Pathways 12. Orthologs
9. Drugs for query 13. Paralogs
10. Transcripts: Reference sequence (RefSeq), 14. Variants
Enseml, Unigene Clusters 15. Disorders in MalaCards (The Humans Dis-
11. Expression in tissues: GeneAnalytics (http:// ease Database, http://www.malacards.org)
geneanalytics.genecards.org/?utm_source¼ 16. Publications
genecards&utm_medium¼banner&utm_
16 Bioinformatics Tools for Proteomics Data Interpretation 339

Fig. 16.52 PTMs reported for MCM6 in BioGRID database. Other PTMs are also shown in this figure for MCM6,
including neddylation, sumoylation, as well as other ubiquitination sites

In addition, there are a lot of links to and Consejo Nacional de Ciencia y Tecnologı́a (Conacyt)
companies that might have products for the pro- from Mexico, with the project number SALUD-2009-01-
113674, both granted to Dr. Juan Pedro Luna Arias.
tein of interests, such as antibodies, immunoflu-
orescence, animal models, silencing, etc.

Acknowledgements We thank the Instituto de Ciencia y References


Tecnologı́a del Distrito Federal (ICyTDF), now renamed
Secretarı́a de Ciencia, Tecnologı́a e Innovación de la 1. Kumar C, Mann M (2009) Bioinformatics analysis of
Ciudad de México (SECITI), for its support with the mass spectrometry-based proteomics data sets. FEBS
project ICyTDF-J.LA (CM-272/12-SECITI/033/2012), Lett 583(11):1703–1712
340 K.G. Calderón-González et al.

2. Su Z, Wang J, Yu J, Huang X, Gu X (2006) Evolution 16. PANTHER User Manual (2015). http://pantherdb.org/
of alternative splicing after gene duplication. Genome help/PANTHER_user_manual.pdf
Res 16(2):182–189 17. Mi H, Muruganujan A, Thomas PD (2013) PAN-
3. Twyman RM (2004) Principles of proteomics. Gar- THER in 2013: modeling the evolution of gene func-
land Biosciences/BIOS Scientific Publishers, tion, and other gene attributes, in the context of
Hampshire phylogenetic trees. Nucleic Acids Res 41(Database
4. Ashburner M, Ball CA, Blake JA, Botstein D, issue):D377–D386
Butler H, Cherry JM, Davis AP, Dolinski K, Dwight 18. Calderon-Gonzalez KG, Valero Rustarazo ML,
SS, Eppig JT et al (2000) Gene ontology: tool for the Labra-Barrios ML, Bazan-Mendez CI, Tavera-Tapia-
unification of biology. The gene ontology consortium. A, Herrera-Aguirre M, Sanchez Del Pino MM,
Nat Genet 25(1):25–29 Gallegos-Perez JL, Gonzalez- Marquez H,
5. Gene Ontology Consortium (2001) Creating the gene Hernandez-Hernandez JM et al (2015) Data set of
ontology resource: design and implementation. the protein expression profiles of Luminal A,
Genome Res 11(8):1425–1433 Claudin-low and overexpressing HER2(+) breast can-
6. Harris MA, Clark J, Ireland A, Lomax J, cer cell lines by iTRAQ labelling and tandem mass
Ashburner M, Foulger R, Eilbeck K, Lewis S, spectrometry. Data Brief 4:292–301
Marshall B, Mungall C et al (2004) The Gene Ontol- 19. Dennis G Jr, Sherman BT, Hosack DA, Yang J,
ogy (GO) database and informatics resource. Nucleic Gao W, Lane HC, Lempicki RA (2003) DAVID:
Acids Res 32(Database issue):D258–D261 database for annotation, visualization, and integrated
7. Gene Ontology C (2015) Gene ontology consortium: discovery. Genome Biol 4(5):P3
going forward. Nucleic Acids Res 43(Database issue): 20. da Huang W, Sherman BT, Lempicki RA (2009)
D1049–D1056 Bioinformatics enrichment tools: paths toward the
8. Rhee SY, Wood V, Dolinski K, Draghici S (2008) Use comprehensive functional analysis of large gene
and misuse of the gene ontology annotations. Nat Rev lists. Nucleic Acids Res 37(1):1–13
Genet 9(7):509–515 21. Huang DW, Sherman BT, Tan Q, Kir J, Liu D,
9. Mi H, Muruganujan A, Casagrande JT, Thomas PD Bryant D, Guo Y, Stephens R, Baseler MW, Lane
(2013) Large-scale gene function analysis with the HC et al (2007) DAVID bioinformatics resources:
PANTHER classification system. Nat Protoc 8 expanded annotation database and novel algorithms
(8):1551–1566 to better extract biology from large gene lists.
10. Thomas PD, Campbell MJ, Kejariwal A, Mi H, Nucleic Acids Res 35(Web Server issue):
Karlak B, Daverman R, Diemer K, Muruganujan A, W169–W175
Narechania A (2003) PANTHER: a library of protein 22. da Huang W, Sherman BT, Lempicki RA (2009)
families and subfamilies indexed by function. Systematic and integrative analysis of large gene
Genome Res 13(9):2129–2141 lists using DAVID bioinformatics resources. Nat
11. Thomas PD, Kejariwal A, Campbell MJ, Mi H, Protoc 4(1):44–57
Diemer K, Guo N, Ladunga I, Ulitsky- Lazareva B, 23. da Huang W, Sherman BT, Stephens R, Baseler MW,
Muruganujan A, Rabkin S et al (2003) PANTHER: a Lane HC, Lempicki RA (2008) DAVID gene ID con-
browsable database of gene products organized by version tool. Bioinformation 2(10):428–430
biological function, using curated protein family and 24. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori
subfamily classification. Nucleic Acids Res 31 M (2004) The KEGG resource for deciphering the
(1):334–341 genome. Nucleic Acids Res 32(Database issue):
12. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, D277–D280
Vandergriff J, Rabkin S, Guo N, Muruganujan A, 25. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF,
Doremieux O, Campbell MJ et al (2005) The PAN- Itoh M, Kawashima S, Katayama T, Araki M,
THER database of protein families, subfamilies, Hirakawa M (2006) From genomics to chemical geno-
functions and pathways. Nucleic Acids Res 33(Data- mics: new developments in KEGG. Nucleic Acids
base issue):D284–D288 Res 34(Database issue):D354–D357
13. Funahashi A, Jouraku A, Matsuoka Y, Morohashi M, 26. Kanehisa M, Sato Y, Kawashima M, Furumichi M,
Kikuchi N, Kitano H (2008) CellDesigner 3.5: a ver- Tanabe M (2015) KEGG as a reference resource for
satile modeling tool for biochemical networks. Proc gene and protein annotation. Nucleic Acids Res
IEEE 96(8):1254 44:457
14. Mi H, Guo N, Kejariwal A, Thomas PD (2007) PAN- 27. Kanehisa M, Sato Y, Morishima K (2015)
THER version 6: protein sequence and function evo- BlastKOALA and GhostKOALA: KEGG tools for
lution data with expanded representation of biological functional characterization of genome and
pathways. Nucleic Acids Res 35(Database issue): metagenome sequences. J Mol Biol 428:726
D247–D252 28. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M
15. Mi H, Thomas P (2009) PANTHER pathway: an (2012) KEGG for integration and interpretation of
ontology-based pathway database coupled with data large-scale molecular data sets. Nucleic Acids Res
analysis tools. Methods Mol Biol 563:123–140 40(Database issue):D109–D114
16 Bioinformatics Tools for Proteomics Data Interpretation 341

29. Okuda S, Yamada T, Hamajima M, Itoh M, Friedrichsen A, Huntley R et al (2007) IntAct--open


Katayama T, Bork P, Goto S, Kanehisa M (2008) source resource for molecular interaction data.
KEGG Atlas mapping for global analysis of metabolic Nucleic Acids Res 35(Database issue):D561–D565
pathways. Nucleic Acids Res 36(Web Server issue): 42. Orchard S, Ammari M, Aranda B, Breuza L,
W423–W426 Briganti L, Broackes-Carter F, Campbell NH,
30. Chaiboonchoe A, Samarasinghe S, Kulasiri D, Salehi- Chavali G, Chen C, del-Toro N et al (2014) The
Ashtiani K (2014) Integrated analysis of gene network MIntAct project--IntAct as a common curation plat-
in childhood leukemia from microarray and pathway form for 11 molecular interaction databases. Nucleic
databases. BioMed Res Int 2014:278748 Acids Res 42(Database issue):D358–D363
31. von Mering C, Jensen LJ, Kuhn M, Chaffron S, 43. Keshava Prasad TS, Goel R, Kandasamy K,
Doerks T, Kruger B, Snel B, Bork P (2007) STRING Keerthikumar S, Kumar S, Mathivanan S,
7--recent developments in the integration and predic- Telikicherla D, Raju R, Shafreen B, Venugopal A
tion of protein interactions. Nucleic Acids Res 35 et al (2009) Human protein reference database--2009
(Database issue):D358–D362 update. Nucleic Acids Res 37(Database issue):D767–
32. von Mering C, Jensen LJ, Snel B, Hooper SD, D772
Krupp M, Foglierini M, Jouffre N, Huynen MA, 44. Breitkreutz BJ, Stark C, Tyers M (2003) The GRID:
Bork P (2005) STRING: known and predicted the general repository for interaction datasets.
protein-protein associations, integrated and trans- Genome Biol 4(3):R23
ferred across organisms. Nucleic Acids Res 33(Data- 45. Stark C, Breitkreutz BJ, Reguly T, Boucher L,
base issue):D433–D437 Breitkreutz A, Tyers M (2006) BioGRID: a general
33. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, repository for interaction datasets. Nucleic Acids Res
Muller J, Doerks T, Julien P, Roth A, Simonovic M 34(Database issue):D535–D539
et al (2009) STRING 8--a global view on proteins and 46. Chatr-Aryamontri A, Breitkreutz BJ, Oughtred R,
their functional interactions in 630 organisms. Boucher L, Heinicke S, Chen D, Stark C,
Nucleic Acids Res 37(Database issue):D412–D416 Breitkreutz A, Kolas N, O’Donnell L et al (2015)
34. von Mering C, Huynen M, Jaeggi D, Schmidt S, The BioGRID interaction database: 2015 update.
Bork P, Snel B (2003) STRING: a database of Nucleic Acids Res 43(Database issue):D470–D478
predicted functional associations between proteins. 47. Scott MS, Barton GJ (2007) Probabilistic prediction
Nucleic Acids Res 31(1):258–261 and ranking of human protein-protein interactions.
35. Harrington ED, Jensen LJ, Bork P (2008) Predicting BMC Bioinf 8:239
biological networks from genomic data. FEBS Lett 48. Goll J, Rajagopala SV, Shiau SC, Wu H, Lamb BT,
582(8):1251–1258 Uetz P (2008) MPIDB: the microbial protein interac-
36. Marcotte EM, Xenarios I, Eisenberg D (2001) Mining tion database. Bioinformatics 24(15):1743–1744
literature for protein-protein interactions. Bioinfor- 49. Lamesch P, Berardini TZ, Li D, Swarbreck D,
matics 17(4):359–363 Wilks C, Sasidharan R, Muller R, Dreher K, Alexan-
37. Szklarczyk D, Franceschini A, Kuhn M, der DL, Garcia-Hernandez M et al (2012) The
Simonovic M, Roth A, Minguez P, Doerks T, Arabidopsis Information Resource (TAIR): improved
Stark M, Muller J, Bork P et al (2011) The STRING gene annotation and new tools. Nucleic Acids Res 40
database in 2011: functional interaction networks of (Database issue):D1202–D1210
proteins, globally integrated and scored. Nucleic 50. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D
Acids Res 39(Database issue):D561–D568 (1997) GeneCards: integrating information about
38. Zanzoni A, Montecchi-Palazzi L, Quondam M, genes, proteins and diseases. Trends Genet 13(4):163
Ausiello G, Helmer-Citterich M, Cesareni G (2002) 51. Safran MC-CV, Shmueli O, Rosen N, Benjamin-
MINT: a molecular INTeraction database. FEBS Lett Rodrig H, Ophir R, Yanai I, Shmoish M, Lancet D
513(1):135–140 (2003) The GeneCards family of databases:
39. Licata L, Briganti L, Peluso D, Perfetto L, GeneCards, GeneLoc, GeneNote and GeneAnnot. In:
Iannuccelli M, Galeota E, Sacco F, Palma A, Proceedings of the IEEE Computer Science Bioinfor-
Nardozza AP, Santonico E et al (2012) MINT, the matics Conference CSB2003
molecular interaction database: 2012 update. Nucleic 52. Stelzer GHA, Dalah A, Rosen N, Shmoish M, Iny
Acids Res 40(Database issue):D857–D861 Stein T, Sirota A, Madi A, Safran M, Lancet D
40. Hermjakob H, Montecchi-Palazzi L, Lewington C, (2008) GeneCards: one stop site for human gene
Mudali S, Kerrien S, Orchard S, Vingron M, research. FISEB (ILANIT)
Roechert B, Roepstorff P, Valencia A et al (2004) 53. Harel A, Inger A, Stelzer G, Strichman-Almashanu L,
IntAct: an open source molecular interaction database. Dalah I, Safran M, Lancet D (2009) GIFtS: annotation
Nucleic Acids Res 32(Database issue):D452–D455 landscape analysis with GeneCards. BMC Bioinf
41. Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, 10:348
Bridge A, Derow C, Dimmer E, Feuermann M,
Part IV
Applications of Proteomics Technologies
in Biological and Medical Sciences
Identification, Quantification, and
Site Localization of Protein 17
Posttranslational Modifications via
Mass Spectrometry-Based Proteomics

Mi Ke, Hainan Shen, Linjue Wang, Shusheng Luo, Lin Lin,


Jie Yang, and Ruijun Tian

Abstract
Posttranslational modifications (PTMs) are important biochemical pro-
cesses for regulating various signaling pathways and determining specific
cell fate. Mass spectrometry (MS)-based proteomics has been developed
extensively in the past decade and is becoming the standard approach for
systematic characterization of different PTMs on a global scale. In this
chapter, we will explain the biological importance of various PTMs,
summarize key innovations in PTMs enrichment strategies, high-
performance liquid chromatography (HPLC)-based fractionation
approaches, mass spectrometry detection methods, and lastly bioinfor-
matic tools for PTMs related data analysis. With great effort in recent
years by the proteomics community, highly efficient enriching methods
and comprehensive resources have been developed. This chapter will
specifically focus on five major types of PTMs; phosphorylation, glyco-
sylation, ubiquitination/sumosylation, acetylation, and methylation.

Keywords
Posttranslational modifications (PTMs) • Phosphoproteomics •
Glycoproteomics • Ubiquitinome and sumoylated proteome •
Methylome • Acetylome • PTM enrichment • Metal oxide affinity
chromatography (MOAC) • TiO2 • IMAC • Immunoprecipitation •
Peptide fractionation • Orthogonality • Lectin affinity • Hydrazine-based
purification • Anti-k(GG) Ab • KMBD MBT-based purification • Lysine-
methylation • Lysine-acetylation • Serial PTM enrichment • PTM mass
spectrometry • CID • HCD • ETD • MS3

R. Tian (*)
Department of Chemistry, South University of Science
and Technology of China, 518055 Shenzhen, China
Shenzhen Key Laboratory of Cell Microenvironment,
M. Ke • H. Shen • L. Wang • S. Luo • L. Lin • J. Yang South University of Science and Technology of China,
Department of Chemistry, South University of Science 518055 Shenzhen, China
and Technology of China, 518055 Shenzhen, China e-mail: tian.rj@sustc.edu.cn

# Springer International Publishing Switzerland 2016 345


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_17
346 M. Ke et al.

17.1 Introduction PTMs with indispensable biological functions


are enzyme-dependent and reversible. These
Posttranslational modification (PTMs) alter the dynamic PTM modulations provide finely tuned
biochemical properties of a protein by the addi- biochemical mechanisms for the regulation of
tion of a chemical group to one or more of its key physiological states in cells and their
amino acid residues. PTMs are extremely impor- responses to the external environment. PTMs
tant for protein function as they can influence the regulate the activity of proteins and also signifi-
activity state, stability, localization, turnover, cantly expand the biological system complexity.
and interactions with other proteins [1]. Up to Some of the major physiological roles for PTMs
now, more than 200 posttranslational are summarized as follows:
modifications of proteins are known to occur
physiologically. Analysis of these modifications 1. PTMs can significantly change the three-
is a great challenge, as biologically significant dimensional structure of proteins, usually to
PTMs usually happen with low stoichiometry at modulate its function. A case in point is the
low abundance. Since mass spectrometry (MS)- hydroxylation of proline and lysine residues
proteomic methodologies have demonstrated tre- in collagens which stabilizes their coiled
mendous potential for quantitatively profiling structure [4]
various PTMs, increasing attention has been 2. PTMs such as phosphorylation are key factors
given to the development of powerful proteomic for regulating dynamic protein–protein
approaches to explore different PTMs in various interactions
biological systems [2]. 3. The sub-cellular localization of some proteins
In this chapter, we will provide a comprehen- is directed by PTMs. For example, proteins
sive review of MS-based proteomic analysis of modified with glycosylphosphatidylinositol
different PTMs, including sample enrichment on cysteine are usually directed to the cellular
approaches, high-performance liquid chromatog- membrane
raphy (HPLC)-based fractionation approaches, 4. Protein stability and half-life time are also
MS detection techniques and bioinformatics modulated by PTMs. It is well-known that
methods. We will specifically focus our discus- the K48-ubiquitination tag is a signal for pro-
sion on five major types of PTMs; phosphoryla- tein degradation by proteasome pathway.
tion, glycosylation, ubiquitination/sumosylation,
acetylation, and methylation.

17.2.1 Phosphorylation
17.2 Biological Functions of PTMs Phosphorylation is one of the most ubiquitous
and important PTMs. It is a process that involves
Posttranslational modifications (PTMs) are bio- the kinase-regulated transportation of a phos-
chemical processes for modifying proteins with phate group from ATP to specific amino acid
various chemical groups such as phosphate, gly- residues, mainly serine (S), threonine (T), tyro-
can, methyl, acetyl, ubiquitin, etc. The majority sine (Y), and the recently discovered histidine
of eukaryotic proteins (50–70 %) are modulated (H). Reversible protein phosphorylation is a
by different PTMs in space- and time-dependent widely used modulation method which is utilized
manners, which are crucial for various biological by both eukaryotic and prokaryotic organisms.
functions. Over 200 types of PTMs have been Over 1/3 of proteins are modified by phosphory-
identified so far, most of which are irreversible lation in eukaryotes at any given time [5, 6]. Dur-
and lead to permanent changes in protein confor- ing signal transduction, upon the binding of a
mation and function [3]. With reversible phos- secreted ligand, the receptors are often
phorylation as an example (Fig. 17.1), many key phosphorylated, which subsequently activates
17 Identification, Quantification, and Site Localization of Protein. . . 347

Fig. 17.1 Protein


phosphorylation is a
typical model for
reversible and enzyme-
dependent PTM.
Phosphorylation is
catalyzed by the
“writer”-kinases and the
dephosphorylation is
performed by
“eraser”-phosphatases,
Protein phosphorylation
mainly happens on specific
amino acid residues
including serine (S),
threonine (T), and tyrosine
(Y)

dynamic intracellular signaling pathways. It is and O-linked glycosylation are the most widely
currently known that many human diseases, for studied. N-linked glycans (essentially made up of
instance cancer and Alzheimer’s disease (AD), two N-acetyl glucosamine and three mannose
result from abnormal phosphorylation of key residues) always attached to asparagine (Asn)
functional proteins. For example, Tau protein residues with the consensus sequence Asn-X-
holds more than 20 phosphorylation sites in AD Ser/Thr (X represents any amino acid except
which are rigorously modulated by various pro- proline) [11]. N-glycosylation is the most com-
tein kinases, such as cycling dependent kinase mon glycosylation modification. O-glycosylated
5 (CDK5), Mitogen-activated protein kinase proteins also show their roles in various cellular
(MAPK), etc. [7] functions, especially in cell metabolism.
O-glycosylation occurs mainly on serine and
threonine side chains, and sometimes can occur
17.2.2 Glycosylation on oxidized forms of lysine and proline residues
[12]. Glycosylation plays a role in many impor-
Protein glycosylation is a process which involves tant cellular functions, such as cell adhesion,
the attachment of sugar moieties to proteins. receptor activation, endocytosis, cell immune
Most of the proteins in the plasma membrane, responses, etc. The sensitive recognition of
endoplasmic reticulum (ER), and extracellular protein-protein and cell-cell interactions, typi-
environment are glycosylated. Glycosylation is cally in the intracellular microenvironment, is
structurally the most complex PTM, hence, the the most well studied example of the functional-
mechanism of this process is highly-ordered, and ity of protein glycosylation [13].
its extent and complexity correlate closely with
the level of evolution [8]. Different types of
glycosylation have been well characterized, 17.2.3 Ubiquitination and Sumoylation
including N-linked glycosylation, O-linked gly-
cosylation, C-mannosylation, glypiation Ubiquitin is a small (76 amino acids) polypeptide
(glycosylphosphatidylinositol anchor), etc. with that is usually characterized by two glycine
[9, 10]. Oligosaccharides vary in terms of the residues (diGly) in the C-terminal domain. The
function and structure of their sugar residues ubiquitin peptide chain contains seven lysine
(i.e. galactose, glucose, mannose, residues at positions 6, 11, 27, 29, 33, 48 and
N-acetylgalactosamine, N-acetylglucosamine, 63, from the N- to the C-terminus, through which
fucose, sialic acid, etc.). Among them, N-linked it can be attached to substrates [14]. Ubiquitin
348 M. Ke et al.

modification functions as a signaling mechanism, 17.2.4 Acetylation


mainly by activating the protein degradation
machinery. Attachment of ubiquitin to other Protein acetylation involves the introduction of
proteins is catalyzed by an activating enzyme an acetyl group into a polypeptide by replacing
(E1), a ubiquitin-conjugating enzyme (E2) and an active hydroxyl group. Lysine has emerged as
a ubiquitin ligase (E3). This process plays signif- the main acetylation site for many key functional
icant roles in regulating cell apoptosis, transcrip- proteins, such as histones, transcription
tion modulation, DNA repair, etc. In eukaryotes, regulators, and enzymes associated with glycoly-
over 30 % of newly synthesized proteins are sis [21]. Like reversible phosphorylation, the
degraded because of damage to their structure. acetylation process is catalyzed by a pair of
Mono-ubiquitination regulates cellular functions enzymes, including histone acetyltransferase
by altering the activity of proteins, changing the (HAT) and histone deacetylase (HDA). Acetyla-
binding affinity with other proteins, and tion regulates protein activity and can crosstalk
transporting specific proteins to their site of with other PTMs such as phosphorylation and
activity [15]. A single ubiquitin moiety linked methylation in the dynamic control of transcrip-
to a substrate is often a signal for additional tion activity [22], cellular signaling [23], etc.
linkages of ubiquitin molecules onto the existing Acetylation modification reduces the electro-
ubiquitin, thus forming a polyubiquitin chain. static attraction between histone 4 and the
Typically, K48-linked polyubiquitin chains tar- phospho-rich negatively charged DNA back-
get proteins for proteolysis. A chain of at least bone, thereby loosening chromatin structure and
four ubiquitin molecules on a condemned protein resulting in increased transcription activity [22].
can be recognized by the 26 s proteasome [16].
In addition to the well-studied ubiquitin sys-
tem, later studies have uncovered other
17.2.5 Methylation
ubiquitin-like modifications including small
ubiquitin-related modifier (SUMO) [17], Nedd8
Protein methylation is a common PTM that
[18], Atg8 [19], and ISG15 [20], etc. Unlike
modifies proteins with methyl groups mostly on
ubiquitination, sumoylation is a reversible and
lysine (K) and arginine (R) amino acid residues.
multifunctional modification, participating in
Arginine methylation is catalyzed mainly by two
many cellular signaling pathways. There are
classes of arginine methyltransferases (PRMTs):
three well-characterized SUMO proteins in
humans: SUMO1, SUMO2 and SUMO3.
– Type I PRMTs including PRMT1, PRMT3,
SUMO2 and SUMO3 share about 95 % sequence
PRMT4, PRMT6, and PRMT8
homology and have only few conspicuous func-
– Type II PRMTs including PRMT5 and
tional differences. Conversely, the homology
PRMT7
between SUMO2/3 and SUMO1 is only 44 %
(Fig. 17.2) and they also have distinct target
Both classes of enzymes can catalyze arginine
proteins.
monomethylation. Type I PRMTs can also add

Fig. 17.2 Amino acid sequence alignment of SUMO1, 2 and 3 shows different degree of sequence distinction
17 Identification, Quantification, and Site Localization of Protein. . . 349

methyl groups to arginine side chain forming (2D-GE), western blot analysis, autoradiography
asymmetric dimethyl arginine, while type II and Edman sequencing.
enzymes can further catalyze the symmetric 2D-GE is a classical separation method that
dimethylation (Fig. 17.3a) [24]. Lysines can be separates proteins on the basis of their isoelectric
mono-, di-, or trimethylated by lysine point and molecular weight [28]. 2D-GE has
methyltransferases (KMTs), and this modifica- been successfully used to directly separate differ-
tion can also be reversed by demethylases ent types of modified proteins. For example,
(DMTs) (Fig. 17.3b) [25, 26]. Generally, meth- phosphorylation changes the charge of a protein
ylation has been reported to regulate RNA and is often indicated by a horizontal trail of
processing, gene transcription, DNA damage protein spots on a two-dimensional gel. Once
repair, and signal transduction [27]. the proteins have been isolated, a variety of
detection techniques can be used in succession.
The proteins in the gel can be unselectively
visualized by staining the gel with coomassie
17.3 Proteomic Strategies for PTMs
blue or colloidal silver. The staining intensity of
Analysis
the gel spots roughly reflects the protein amount,
providing information on the relative proportion
The basic procedure for PTM analysis is roughly
of the various modified states.
the same as the procedure used for the identifica-
The specific subsets of PTM-modified
tion of proteins in ‘classical’ proteomics
proteins present in the gel can also be selectively
research. However, the PTMs analyses are gen-
detected and visualized. This can be achieved by
erally more difficult for the following reasons:
using a PTM-specific staining reagent to develop
the gel or by using PTM-specific antibodies for
1. The endogenously modified proteins only
western blotting, or by incorporating
constitute a small fraction of the total protein
PTM-specific radiolabels into the proteins. For
numbers (low stoichiometry)
example, phosphoproteins can be selectively
2. Since the covalent bond between the PTM and
stained and visualized with phosphate-specific
the amino acid side chain is typically labile, it
fluorescent probes (such as BO-IMI, in which a
is often difficult to maintain the peptides in
BODIPY dye is attached to a reactive imidazole
their modified state during sample preparation
group) [29]. Western blot analysis with
and subsequent ionization in mass
antibodies against specific phosphorylation sites
spectrometry
is widely used to detect different types of
3. PTMs are frequently transient in the dynamic
phosphoproteins (S, T, and Y) [30, 31]. Likewise,
homeostasis of nature. Therefore, more effec-
nitrated proteins can be detected with anti-
tive sample preparation methods, more sensi-
nitrotyrosine antibodies. In these cases, the qual-
tive detection technology and more
ity of the antibody, including specificity and sen-
comprehensive data analysis strategies are
sitivity, is critical for the detection.
needed in the analysis of PTMs.
Autoradiography is an alternative detection
technology that was widely used in the past,
although it is sometimes expensive and hazard-
ous. Proteins are labeled (in vivo or in vitro)
17.3.1 Conventional Analysis Methods with radioactive PTM precursors before extrac-
tion and separation, and subsequently
Conventionally, PTMs analysis has been carried visualized by autoradiography. A number of
out by laborious biochemical approaches, includ- specific radiolabeling agents are available,
ing two-dimensional gel electrophoresis such as [32P]-phosphate or [γ-32P]-ATP for
350 M. Ke et al.

Fig. 17.3 Overview of arginine and lysine methyla- of which PPRMT I can transfer a second methyl group to
tion. (a) Arginine methylation is catalyzed by protein the same guanidino nitrogen amino of arginine, denoted
arginine methyltransferases (PPRMTs) I or II (a) [24], as asymmetric dimethylation while PPRMT II catalyzes
17 Identification, Quantification, and Site Localization of Protein. . . 351

phosphoproteins, [3H]-inositol for Compared with those conventional analysis


GPI-anchored proteins, and [3H]-myristyl for methods, mass spectrometry (MS) has emerged
N-myristoylated proteins. as a powerful technique to analyze PTMs due to
Edman degradation, the classical protein its high efficiency, sensitivity, and selectivity.
sequencing technique, can be used to locate mod- The extensive application of MS based proteo-
ification sites. The proteolytic peptide fractions mics in PTM analysis is the result of the devel-
are applied to the sequencer and their amino acid opment of effective enrichment strategies; faster,
sequence is determined. Modified amino acids more sensitive MS detection technique and pow-
become apparent for their absence or retention erful bioinformatics methods. All these aspects
time shift in the corresponding sequencing cycle. will be described in detail in the subsequent
Edman sequencing in combination with paragraphs.
radiolabeling was once widely used for
characterizing phosphorylation sites. For this, 32P
labeled proteins were digested into peptides and 17.3.2 Enrichment of PTMs Prior to MS
separated, and the candidate phosphorylation sites Analysis
were identified by recording the cycle in which the
radiolabeled amino acid was released [32]. PTMs are often found at sub-stoichiometric
While feasible, the traditional methods men- levels and represent a small proportion of all
tioned above suffer from various shortcomings. peptides present in a total cell lysate, which is
The 2D-GE separations are difficult to achieve why it requires enrichment/purification to
when separating low abundance, acidic, basic, improve their measurement prior to mass spec-
hydrophobic, very large, or very small proteins. trometry identification. Table 17.1 summarizes
Furthermore, this technique has reproducibility the well-established enrichment methods for spe-
issues, a limited dynamic range and a low cific PTMs.
throughput, which hinders its application in the
global characterization of PTMs. Antibody- 17.3.2.1 Phosphorylation
based western blot analyses show poor perfor-
mance in the detection of some types of PTMs Enrichment
due to steric hindrance of the recognition site. Phosphorylation is one of the most extensively
Autoradiography is hazardous and radio-isotopes studied PTMs due to its biological significance in
of carbon and hydrogen are rather weak radio cell signaling and regulation. Due to the
emitters, which makes it difficult to efficiently sub-stoichiometric and highly dynamic nature
detect corresponding modified proteins (for of phosphorylation, large-scale studies of the
example, 14C or 3H in the case of protein meth- phospho-proteome require sophisticated experi-
ylation and acetylation). Edman sequencing is mental workflows that primarily hinge upon
tedious and requires massive amounts starting achieving a highly efficiency, highly specific
material and it has a lengthy analysis cycle. enrichment. Several affinity enrichment
This is especially true when radiolabelling is protocols have been established for enriching
involved, which limits its application in high- phosphorylated peptides from complex proteome
throughput studies. digests such as cell lysates. These methods
include metal oxide affinity chromatography

ä
Fig. 17.3 (continued) the formation of symmetric nitrogen amino of lysine, forming monomethyl (Kme1),
dimethylarginine by adding the second methyl group to dimethyl (Kme2), or trimethyl (Kme3) lysine, respec-
a different guanidine nitrogen atom of arginine. (b) tively (Note: AdoMet, S-adenosylmethionine synthase;
Lysine methylation is catalyzed by the enzyme KMTs, AdoH, S-Adenosyl-L-homocysteine, equals to SAH;
usually are histone methyltransferases (HMTs) [25], PPRMT, protein arginine methyltransferase; KMT, lysine
adding one, two, or three methyls to the distinct guanidino methyltransferase; SAM)
352 M. Ke et al.

Table 17.1 PTMs MS shift and the reporter fragments observed by collision-based dissociation
Amino Mass shift/Gross formula shift Diagnostic ions (specific Neutral loss (labile
acid (stable in MS/MS fragment ions in MS/MS in MS/MS
PTMs type modified fragmentation) fragmentation) fragmentation)
Phosphorylation Tyr +79.9663 Da (HPO3) [96, 97] 216.0426 Da (+) [98] 79.9663 (HPO3) Da
[99]
Ser/Thr/ 97 Da ( ) [93] 97.976 (H3PO4) Da
Tyr 78.9591 Da ( ) [97] [97, 100]
63 Da ( ) [87]
Glycosylation N-linked >800 Da [83] 204.087 Da (+) [101] 203.079 Da
(Asn) Variable [96] (HexNAc) [102]
162.053 Da
(Hexose) [102]
O-linked +203.0793 Da (HexNAc) 163.0606 Da (+) [101] 291.095 Da (Sialic
(Ser/Thr) [101] acids) [102],
+162.0528 Da (Hexose) [101], 366.140 Da (+) [101] 365.148 Da
+291.095 Da (Sialic acids) 246.0977 Da (+) [101] (HexHexNAc)
[102] [102]
+365.148 Da (HexHexNAc) 292.103 Da (+) [102]
[102] 274.093 Da (+) [102]
Acetylation Ser/Thr/ +42.0105 Da (CH3CO) [101] 126.0913 Da (+) [103] n/a
Lys 143.1179 Da (+) [103]
Methylation Lys/Arg +14.01565 Da (CH3) [96, 101] 71.06 Da (+), 46.06 Da (+) n/a
+28.0313 Da (C2H6) [96] (Dimethylation) [104]
+42.04695 Da (C3H9) [96]
Ubiquitination Lys +114.043 Da (Gly-Gly) [105] n/a n/a
Note: (+): in positive-mode; ( ): in negative-mode

(MOAC), immobilized metal ion affinity chro- such as ammonium bicarbonate at a pH of 9. Usu-
matography (IMAC), immunoprecipitation- ally TiO2-based MOAC enrichment suffers from
based enrichment, and domain-based low specificity due to the competitive binding of
enrichment. acidic amino residues (e.g. Glu and Asp) in
non-phosphopeptides. Considerable efforts have
been made to improve the specificity of this
Metal Oxide Affinity Chromatography (MOAC)
protocol by introducing competitive additives
MOAC represents one of the most commonly
such as 2,5-hydroxybenzoic acid (DHB) [36],
used strategies for phosphopeptide enrichment.
phthalic acid [37] and glutamic acid [38] into
This technique is based on the affinity that phos-
the loading buffers.
phate groups have towards metal oxides. Several
metal oxides, including TiO2 [33], ZrO2 [34] and
Nb2O5 [35], have been successfully used for this Immobilized Metal Ion Affinity Chromatography
purpose. TiO2 is the most popular MOAC sub- (IMAC)
strate, with high enrichment efficiency and spec- IMAC is another widely used affinity purification
ificity. In a typical TiO2-based MOAC technique for phospho-peptide enrichment. The
procedure, the sample is mixed with an acidic affinity between phospho-peptides and IMAC
buffer (e.g. 0.1 % (v/v) trifluoroacetic acid) to resin is caused by electrostatic interactions
protonate acidic residues of non-phosphorylated between the negatively charged phosphate
peptides, preventing their adsorption to TiO2. groups of phospho-peptides and the positively
After a washing step, phosphopeptides are eluted charged metal ions that are bound to a solid
from the TiO2 column under alkaline conditions, support via iminodiacetic acid (IDA) or
17 Identification, Quantification, and Site Localization of Protein. . . 353

nitriloacetic acid (NTA) ligands. Various metal (Fe3+-IMAC, Zr4+-IMAC, TiO2 and ZrO2). The
ions have been tested for their efficiency high specificity and efficiency of Ti4+-IMAC is
in phosphorylated peptide enrichment, such as mainly due to the flexibility of the spacer arm
Fe3+[39], Ti4+[40], Zr4+[41], Ga3+[42], etc. The that is linked to the polymer beads, and also to
general procedure is similar to MOAC. First, the the specific interaction between the immobilized
tryptic digest is dissolved in IMAC-binding Ti4+ and the phosphate groups that prevents bind-
buffer and loaded onto an IMAC column for ing of acidic peptides [48].
incubation. Then nonphosphorylated peptides
are removed by washing the resin with IMAC
Immunoprecipitation-Based Enrichment
binding buffer. Phosphopeptides are then
Tyrosine phosphorylation often occurs at very
removed from beads at high pH or with phos-
low abundance and the occupancy is estimated
phate salts. IMAC was first introduced by
at about 0.5 % of all human phosphorylation
Anderson and Porathin in 1986 for the enrich-
events with the majority occurring via serine
ment of phosphoproteins [43], and has been
(~90 %) or threonine (~10 %) residues
extensively improved by many other researchers.
[49]. Therefore, the aforementioned approaches
To reduce the nonspecific binding of
are not well-suited for the study of tyrosine phos-
nonphosphorylated acidic peptides to the IMAC
phorylation. Immunoprecipitation (IP) with
resin, Ficarro et al. developed an technique that
immobilized antibodies against phosphotyrosine
blocks the carboxylic groups that are present at
(pTyr) is a well-established strategy for the
the C-terminus of peptides and in acidic residues
enrichment of pTyr carrying phosphopeptides.
(i.e. Glu and Asp) by methyl-esterification
With the highly specific commercially available
[44]. Despite having increased specificity toward
antibodies against pTyr (i.e., PY100), Rikova
phosphopeptides, this approach suffers from
et al. identified 4551 phospho-tyrosinesites on
incomplete reaction and side derivatization
2700 different proteins and characterized tyro-
reactions which might complicate the MS identi-
sine kinase signaling across 41 non-small cell
fication. IMAC enrichment has a bias towards
lung cancer (NSCLC) cell lines and over
multi-phosphorylated peptides, which
150 NSCLC tumors (Rikova, 2007, Cell). In
necessitates the implementation of complemen-
addition to the pTyr-specific antibodies,
tary strategies such as SIMAC (sequential elution
substrates of RTKs such as Scr homology
from IMAC) [45]. The SIMAC approach
2 (SH2) domains can also be used to enrich
combines both Fe-IMAC and TiO2 enrichment
tyrosine-phosphorylated proteins. Using the
strategies for phospho-peptide enrichment in a
SH2 domain of the adapter protein Grb2
consecutive manner. A typical SIMAC workflow
(GST-SH2 fusion protein), Blagoev
starts with an IMAC enrichment to first capture
et al. identified 228 proteins. However, this
multi-phosphorylated peptides. The flow-
approach is limited to those phosphotyrosine
through and acid eluted fractions are then col-
containing proteins that interact with the
lected and subjected to TiO2 enrichment, to cap-
SH2-containing bait used in the assay.
ture most of the mono-phosphorylated peptides.
IP approaches are not commonly used for
Using such a strategy, Thingholm and coworker
phosphoserine and phosphothreonine enrich-
were able to double the identification of phos-
ment, mainly because highly specific antibodies
phorylation sites as compared with single TiO2
against pThr and pSer do not exist. Some studies
enrichment [46].
have employed antibodies raised against the con-
A new type of IMAC approach with
sensus motifs in phosphothreonine and
immobilized metal ions was developed by Zou
phosphoserine peptides [50, 51]. However,
et al. for high-efficient enrichment of
yields of such approaches were relatively low,
phosphorylated peptides [47]. This resin, which
because those antibodies did not bind all pS/pT
uses Ti4+, outperformed all other phosphopeptide
sites with the same efficiency.
enrichment methods that use other metal ions
354 M. Ke et al.

Fractionation inability to retain strongly acidic, negatively


Ion-Exchange Chromatography charged multi-phosphopeptides [54, 55]. Dai
Ion-Exchange Chromatography is a charge- et al. have devised a multidimensional liquid
based strategy for the enrichment of chromatography (Yin-Yang MDLC) approach
phosphopeptides according to the interaction combining SCX and SAX to profile the
between the negatively charged phosphate phospho-proteome of mouse liver [54]. In this
group and the Strong Cation Exchange (SCX) approach, protein digests were first loaded onto
or Strong Anion Exchange (SAX) matrix. SCX a SCX column. Flow through peptides from SCX
chromatography has been one of the most popu- were then collected and further loaded onto an
lar fractionation strategies for sample complexity SAX column. Both the SCX and SAX columns
reduction in phosphoproteomics experiments were eluted offline by a pH gradient to fraction-
[39]. The principle of SCX for fractionation of ate the phosphopeptides for following RP-LC/
phosphopeptides is illustrated in Fig. 17.4. Under MS identification.
acidic conditions (e.g. pH 2.7), the N-terminal
amino group and the C-terminal Lys/Arg Reverse Phase Chromatography
residues of most tryptic phosphopeptides were Reverse phase chromatography (RPC) fraction-
protonized to have a net charge of 2+, whereas ation of protein/peptides is based on hydrophobic
mono-phosphopeptides have a charge state of interactions of the protein/peptides with the RPC
only 1+ due to the one unit of the attached nega- stationary phase. Theoretically, phosphopeptides
tively charged phosphate group (Left panel). The are less retained by RP column and eluted earlier
net charge of a phosphopeptide is decreased by than the nonphosphorylated counterparts, due to
one unit for each added phosphate group. This their reduced hydrophobicity as a result of the
means that phosphopeptides have a decreased attached phosphate groups. RPC is often used as
affinity (mono-phosphorylated) or no affinity a second dimension separation for
(multi-phosphorylated) for the SCX media. [52]. phosphopeptide fractionation because of its
In contrast to SCX chromatography, SAX superior separation efficiency and excellent com-
chromatography tends to retain the negatively patibility with LC/MS. Despite its excellent abil-
charged phosphopeptides more effectively than ity to fractionate phosphopeptides, RPC is less
nonphosphorylated peptides [53]. SAX was commonly used for offline fractionation due to
shown to have a better selectivity for multiply the lack of orthogonality with inline RPC
phosphorylated peptides and was initially LC/MS. For this reason, high-pH RPC was
introduced to compensate one of the main issues introduced by Gilar et al. as the first dimension
associated with SCX, which is the relative of separation for peptide mixtures (Gilar 2005,

100 AQSGSDSS*PEPK 1+
ENS*PAAFPDR 1+
TVDS*PK 1+ TVDSPK 2+
DSS*VPETPDNER 1+ YFLVGAGAIGCELLK 2+
pH=2.7 Net charge LVLDSHIWAFK 3+
LFQLGPPS*PVK 1+
OD220

.....
NH3 -Thr- Val -Asp-Ser -Pro-Lys -COOH 2+ AYS*PEVR 1+
Ac*AEELVLER 1+
+ + TLLEQLDDDQ 1+
COOH NH3 ......
NH3 -Thr- Val -Asp-Ser -Pro-Lys -COOH 1+
+ +
-
COOH NH3 0
HPO3 0 35
Time (min)

Fig. 17.4 Scheme for phosphopeptide enrichment by HeLa cell lysate after trypsin digestion. The dashed
SCX chromatography. At pH 2.7, most peptides produced line indicates the salt gradient. Some identified peptides
by trypsin proteolysis have a solution charge of 2, whereas from the collected fractions are shown. Phosphorylation
phosphopeptides have a charge state of only 1 (left panel). sites are denoted by an asterisk (right panel) (adopt from
SCX chromatography separation at pH 2.7 of a [52])
17 Identification, Quantification, and Site Localization of Protein. . . 355

AC) showing excellent orthogonality with Orthogonality in 2D-LC


low-pH RPC, comparable to SCX-RP (Wang Liquid chromatography (LC) has become the
2011, PROTEOMICS) for shotgun proteome method-of-choice for the fractionation of
analysis. The orthogonality of the high-pH RPC peptides in complex mixtures due to its high
and low-pH RPC could be explained by the dra- resolving power and compatibility with down-
matic change in charge distribution within the stream MS. By combining the resolving power
peptide chain as a result of mobile-phase pH of two orthogonal chromatography modes
(Gilar 2005, AC). This approach was then (2D-LC), complex peptide mixtures can be fur-
adopted and refined by Zou et al. for global ther simplified due to the increased resolution
phosphopeptide analysis, which resulted in the and higher peak capacity of the combined
identification of 30 % more peptides in mice methods [54]. Gilar et al. comprehensively
liver compared to a conventional RPLC approach investigated the orthogonality of SCX, SEC,
(Song 2010, AC). HILIC and RP for the 2D separation of defined
peptides mixtures and showed that SCX-RP,
HILIC-RP, and RP-RP (performed at high pH
HILIC/ERLIC
for the first dimension followed by low pH for
Hydrophilic interaction liquid chromatography
the second dimension) provided the best combi-
(HILIC) and electrostatic repulsion hydrophilic
nation in terms of orthogonality (Fig. 17.5).
interaction chromatography (ERLIC) are
The multidimensional combination of SCX
promising alternatives to ion-exchange and RP
and RPC has emerged as a powerful approach to
chromatography for the pre-fractionation and
separate phosphopeptides before analysis by mass
enrichment of phosphopeptides based on
spectrometry. By applying a multi-dimensional
phosphopeptide hydrophilicity (polarity). HILIC
SCX-IMAC-RPC procedure, Gygi et al. were
uses a polar sorbent (e.g. TSK gel amide) to
able to identify more than 5500 phosphoproteins
retain the highly hydrophilic phosphate groups.
with over 13,000 phosphorylation sites in mouse
An organic containing loading buffer is used to
liver [57] and drosophila embryos [58]. McNulty
promote hydrophilic interactions between
et al. have demonstrated that HILIC could also be
phosphopeptides and the polar sorbent.
a good first dimension for the multidimensional
Non-phosphorylated peptides, which are less
separation of phosphopeptides by providing better
hydrophilic, elute in the early fractions, followed
orthogonality to the subsequent RPC than SCX.
by singly and multiply phosphorylated peptides
Using HILIC-RPC they were able to achieve
in a gradient of increasing water. Alpert et al. first
higher coverage of the Hela phosphoproteome
introduced ERLIC for the separation of
compared to SCX-RPC [59]. More recently,
phosphopeptides in 2008. This chromatography
Song et al. established a new RPC-RPC approach
mode simultaneously uses hydrophilic interac-
for in depth phosphopeptides analysis [60]. They
tion and electrostatic repulsion on a weak anion
operated the first dimension of RPC separation at
exchange (WAX) column to separate
high pH (i.e. pH 10) and collected time-based
phosphopeptides [56]. When performed at low
fractions. They then pooled early fractions with
pH, non-phosphopeptides are protonized and
late fractions that were collected in equal time
are electrostatically repulsed by the WAX col-
intervals to decrease the total number of fractions
umn, while the phosphopeptides, due to the pres-
before the second dimension RPLC-tandem mass
ence of phosphate groups, are still negatively
spectrometry (MS/MS) at low pH. The resulting
charged and electrostatically retained by the
highly orthogonal 2D separation yielded 30 %
ERLIC column. With an increasing salt-gradient,
more phosphopeptide identifications when com-
phosphopeptides elute according to the number
pared to the conventional RPLC approach
of phosphate groups, with monophosphorylated
(Fig. 17.6).
peptides eluting first.
356 M. Ke et al.

A 1.0 B 1.0

0.8 0.8
RP Phenyl, pH 2.6

RP PFP, pH 2.6
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
RP C18, pH 2.6 RP C18, pH 2.6

C 1.0 D 1.0
pI>7.5
pI<5.5
0.8 7.5<pI<5.5 0.8
RP C18, pH 10

SEC, pH 4.5

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
RP C18, pH 2.6 RP C18, pH 2.6

E 1.0 F 1.0
1+
1+
2+ 2+
0.8 3+ 0.8 3+
4+ 4+
HILIC, pH 4.5

SCX, pH 3.25

5+ 5+
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
RP C18, pH 2.6 RP C18, pH 2.6

Fig. 17.5 Orthogonality of selected 2D-LC system


17 Identification, Quantification, and Site Localization of Protein. . . 357

100
2
R =0.42559
A Separation of phosphopeptides by RPLC at high pH
80
B

1 fraction number
0.8 60

40

st
20
Abs

0.4

0
0 20 40 60 80 100
nd
2 t/m in
50
2
0.0 R =0.42742
40
C

1 fraction number
0 10 20 30 40 50 60 70 80 90
30
t/m in

20

st
10

1 2 3 ..... 43 44 45 46 47 48 ..... 88 89 90
0
0 20 40 60 80 100
nd
2 t/m in
50
2
R =0.00287
40 D

1 fraction number
1 2 3 ..... 43 44 45 30

20
st

10
Analysis of the pooled fractions by RPLC-MS/MS at low pH
0
0 20 40 60 80 100
nd
2 t/m in

Fig. 17.6 The high pH RPC/low pH RPC approach with hypothetical 2D separation of peptides, with 90 fractions;
high orthogonality for separation of phosphopeptides. (a) (c) reducing fraction number by pooling adjacent
The scheme for high pH RPC fractionation of fractions; and (d) reducing fraction number by pooling
phosphopeptides; (b) 2D retention plots for a equal interval fraction

Reducing Sample Complexity to Enhance HILIC and RPC conjugated with specific enrich-
Phosphoproteinome Coverage ment strategies (i.e. IMAC, TiO2, etc.) were
Even with the most efficient and specific enrich- comprehensively studied for their ability to
ment strategies, the PTM sample complexity will reduce the sample complexity and to enhance
exceed the resolving power of state-of-the-art the coverage of the PTM-ome. Gygi et al. have
LC/MSMS systems. In order to reduce sample identified 5635 unique phosphorylation sites
complexity and to increase the depth of PTM from 2328 proteins from mouse liver [57] and
coverage, a combination of enrichment 13,720 different phosphorylation sites from 2702
procedures and proper fractionation strategies is proteins in developing Drosophila embryos [39]
necessary. For instance, Trinidad et al. combined by applying a two-step phosphopeptide enrich-
SCX fractionation with IMAC to study phos- ment procedure consisting of SCX chromatogra-
phorylation in mouse brain and reported a phy followed by IMAC. Similarly, Olsen
three-fold increase in phosphopeptide identifica- et al. used SCX fractionation followed by TiO2
tion compared to SCX alone, demonstrating that enrichment was used to characterize the dynam-
a combination of fractionation and specific ics of human cell cycle phosphorylation
phosphopeptide enrichment is essential for [33, 61]. They detected 6600 phosphorylation
large-scale phosphoproteomic studies. Peptide sites from 2244 proteins in epidermal growth
fractionation strategies such as SCX, SAX, factor stimulated HeLa cancer cells
358 M. Ke et al.

[33]. However, one important disadvantage Different types of lectins, immobilized on solid
associated with SCX fractionation is that supports such as agarose or magnetic beads, are
phosphopeptides are not equally distributed in used to enrich glycoproteins/glycopeptides
all fractions due to varying charge distributions according to their glycan structures. The enrich-
amongst mono and polyphosphorylated peptides. ment efficiency for different glycopeptides can
To address this problem, McNulty et al. applied a be significantly increased by using lectins with
HILIC-IMAC instead of the SCX-IMAC broad specificities [62]. Alternatively, lectins
approach for fractionation of phosphopeptides, with narrow specificity can be utilized as “struc-
which resulted in higher coverage of HeLa cell ture specific affinity selectors”. Concanavalin A
phosphoproteome [59]. In this study, the authors (Con A), wheat germ agglutinin A (WGA), Pea-
also showed that the use of IMAC prior to a nut agglutinin (PNA) and aleuriaaurantia (AAA)
HILIC separation of phosphopeptides resulted are some of the most widely used lectins for
in an increased contamination with enriching N-linked glycosylated proteins. ConA
non-phosphopeptides. The percentage of is a plant lectin that has high affinity for a series
phosphopeptides increased to 99 % when of high-mannose and hybrid-type N-glycans
performing IMAC on the HILIC fractions, [63]. WGA recognizes N-acetylglucosamine
which indicated the importance of a and sialic acid residues while PNA is specific to
prefractionation step in reducing sample com- T-antigen, which is commonly found in
plexity and improving enrichment efficiency. O-glycans [64]. AAA, on the other hand, shows
broad specificity towards L-Fuc-containing
17.3.2.2 Glycosylation glycans [47]. Figure 17.7 provides a detailed
summary of different lectins that have been
N-Glycosylation used for N-glycosylated proteins enrichment.
Lectin The ability of different lectins to recognize spe-
Lectin affinity enrichment is an efficient strategy cific glycosylation motifs was used to develop a
for glycoprotein/glycopeptide enrichment. multi-lectin affinity system that can achieve a

Fig. 17.7 N-linked glycans and their binding lectins [67]


17 Identification, Quantification, and Site Localization of Protein. . . 359

comprehensive enrichment of glycoproteins galactose; Fuc, fucose; Sia, sialic acid. Nx


from biological fluids [65]. Moreover, high- [ST] refers to a consensus tripeptide sequence
performance lectin affinity columns and for N-linked glycosylation.
microcolumns have been developed that can be
used directly in line with LC-MS/MS
Hydrazide Chemistry
systems [66].
Hydrazide chemistry, developed by Aebersold
Schematic illustration of various N-linked
et al., is one of the most efficient techniques for
glycans attached to the polypeptide chain and
N-linked glycopeptide enrichment [68]. As high-
several lectins with different binding specificity
light in Fig. 17.8, glycoproteins are oxidized with
for the non-reducing end of the oligosaccharide.
sodium periodate to generate aldehydes in the
Lectins with affinity for specific oligosaccharides
carbohydrates, which then react with hydrazide
are denoted above or to the side of the chains
groups immobilized on resin to form hydrazine
(e.g., Con A, Jacalin. . .). Abbreviations: AAL,
bonds. After the removal of nonglycosylated
Aleuria aurantia agglutinin (lectin); RCA120,
peptides, the N-glycopeptides are selectively
Ricinus communis agglutinin; SNA, Sambucus
released from the resin by PNGase F cleavage
nigra agglutinin; SSA, Sambucus sieboldiana
for LC–MS analysis. In 2007 another group [69]
agglutinin; WGA, wheat germ agglutinin; Man,
modified this method to capture glycopeptides
mannose; GlcNAc, N-acetylglucosamine; Gal,
rather than glycoproteins to minimize sample

a
Glycoprotein oxidation Glycoprotein oxidation

Coupling Coupling

Wash Wash
Proteolysis Proteolysis

Wash and Wash and


collect collect
Isotope labeling d0 Isotope labeling d4

Release

Analysis

b
CH2OH CH2OH CH2OH

O O O

OH OH
OH O Oxidation OH O Coupling OH O
O O N N
Oligosaccharide chain Oligosaccharide chain Oligosaccharide chain
NH NH
Protein Protein Protein
NH2 NH2

NH NH

Fig. 17.8 Schematic diagram of quantitative analysis of N-linked glycopeptides [68]


360 M. Ke et al.

loss and increase sensitivity. The development mass spectrometry. Anonsen et al. [75] used the
and application of this method has been well same strategy (combining antibody based enrich-
described in literature [70, 71]. ment with downstream MS analyses) to study the
Based on the hydrazide chemistry method glycoproteome of N. gonorrhoeae.
described above, Wollscheid et al. [72] devel-
oped a cell surface capturing technology for 17.3.2.3 Ubiquitination and Sumoylation
labeling and enriching cell surface exposed
N-glycoproteome before cell lysis. Tagging the Chain
(a) Strategy for quantitative analysis of Affinity tag based enrichment strategies are often
glycopeptides. Proteins from two biological used for ubiquitinome analysis. Typically, cells
samples are oxidized and coupled to hydrazide are transfected and ubiquitin is expressed with an
resin. Nonglycosylated peptides are removed by epitope tag, such as a histidine tag or hemagglu-
proteolysis and extensive washes. The nonglyco- tinin (HA) tag at the N-terminus to facilitate
peptides are isotope labeled by succinic anhy- subsequent affinity purification using nickel
dride carrying d0 or d4 tags. The beads are then beads (for histidine tag) or an anti-epitope anti-
combined and the isotopically tagged peptides body. By using a yeast model system expressing
are released by PNGase F and analyzed by 6xHis-tagged ubiquitin, Penget et al. [76],
LC-MS/MS. (b) Oxidation of a carbohydrate to provided the first successful profiling of
an aldehyde followed by covalent coupling to ubiquitinated proteins and ubiquitination sites
hydrazide resin. using LC-MS/MS. Generally, a large percentage
of proteins purified using a single-step
O-Glycosylation ubiquitinome purification are not ubiquitinated
The chemical/enzymatic photochemical cleavage (impurities include proteins with multiple histi-
(CEPC) method was used in O-GlcNAcylated dine residues in a short sequence). In this case,
peptide enrichment [73]. In this method tandem affinity tags for two-step purification
(Fig. 17.8), O-GlcNAcylated peptides are first were developed. Tagwerker et al. [77] described
enzymatically labeled with azidogalactosamine a fused a tandem histidine-biotin tag (HB-tag)
(GalNAz). The free azido group in GalNAz is strategy for two-step purification of the
then conjugated to the alkyne group in a ubiquitinated proteome under fully denaturing
photocleavable biotin probe (PC-PEG-biotin- conditions. The HB-tagged proteins were
alkyne) through CuAAC. The biotinylated sequentially purified by Ni2+ chelate chromatog-
peptides are then enriched using avidin affinity raphy and streptavidin resins to greatly reduce
chromatography, and subsequently released via the nonspecific proteins background.
photochemical cleavage. O-GlcNAc-modified
peptides enriched by this method are tagged with Anti-k(GG)
a basic aminomethyltriazolacetylgalactosamine Recently, a monoclonal antibody-based peptide-
(AMTGalNAc) that facilitates ETD identification enrichment strategy has been developed for
and site localization of O-GlcNAc–modified large-scale analysis of ubiquitination sites
peptides [74]. targeting dyglycine, anti-k(GG) moieties. This
A handful of complementary methods have antibody can specifically target a diglycine
been developed for the enrichment and identifi- adduct left at sites of ubiquitination after trypsin
cation of O-GlcNAcylation. Teo and coworker digestion with high efficiency. By using an anti-k
[73] obtained three antibodies capable of (GG) antibody for enriching diglycine containing
immunoprecipitating glycoproteins from peptides, Guoqiang Xu and coworkers identify
HEK293T cell lysates. While each antibody 374 diglycine-modified lysines on
captures a slightly different subset of targets, a 236 ubiquitinated proteins in which 72 % of
total of 215 putatively O-GlcNAcglycosylated these proteins and 92 % of the ubiquitination
proteins were isolated and identified by shotgun sites were reported for the first time [78]. In
17 Identification, Quantification, and Site Localization of Protein. . . 361

another recent study by Kim et al. [79], more more than 20 % of mitochondrial proteins are
than 19,000 diGly-modified lysine residues acetylated [80]. A global analysis of lysine acet-
from 5000 proteins were identified. This study ylation using immunoprecipitation technique in a
proved the feasibility of global ubiquitinome human cell line has recently identified 3600 sites
profiling for the first time. on 1750 proteins [81].

17.3.2.4 Methylation
17.3.2.6 Serial Enrichment
of Different PTMs
KMBD MBT for Lysine Methylation
More recently, a strategy for serial enrichments
Recently a new strategy for enrichment of
of different PTMs (SEPTM) from the same
methylated proteins was introduced that relies
biological sample have been proposed by
on the affinity of naturally occurring 3xMBT
Mertins [82]. This approach enables the analysis
domain repeats of L3MBTL1 for protein methyl-
of the phosphoproteome, ubiquitinome and
ation. This affinity strategy was introduced as a
acetylome from the same biological sample with-
universal method for detection and identification
out decreasing the quality of each individual
of proteins carrying a mono- or dimethylated
PTM. With their streamlined sequential use of
lysine residue [33].
IMAC (for phoshorylated peptides), K(GG)-
specific antibodies (for ubiquitinated peptides)
Antibody for Methylation
and K(Ac)-specific antibodies (for lysine-
Protein methylation is a posttranslational modifi-
acetylated peptides) strategies, more than
cation that adds a single or multiple methyl group
20,000 phosphorylation sites, 15,000
to the guanidino group of arginine or the primary
ubiquitination sites, and 3000 acetylation sites
amine of lysine residue side chains. Currently a
were identified, of which 0.3 % of peptides
high-throughput method for isolation and identi-
contained different types of modifications.
fication of lysine methylation does not exist due,
SEPTM approach, although in its infancy,
in large part, to the lack of a specific antibody for
might open a new avenue for systematic analysis
methyl lysine. [26] Recently Michael’s group
various PTMs to study PTM crosstalk in cell
[14] developed highly specific antibodies against
signaling.
methyl arginine and lysine motifs. These highly
specific antibodies recognize monomethyl argi-
nine; symmetric and asymmetric dimethyl argi-
nine (sDMA and aDMA); and monomethyl, 17.3.3 Mapping PTMs With Mass
dimethyl, and trimethyl lysine motifs. These Spectrometry
antibodies were used to enrich methyl peptides,
over 1000 arginine methylation sites and Posttranslationally modified proteins are cova-
160 lysine methylation sites were identified, lently modified with specific chemical groups.
which is the most methylation sites identified in PTMs often occur at low stoichiometry and are
a single study to date. Other useful arginine often labile during mass spectrometry [83]. Due
methyl–specific antibodies have been developed, to these characteristics, global detection of PTMs
such as ASYM24 and ASYM25, which are spe- requires mass spectrometers with high resolu-
cific for aDMA, and SYM10 and SYM11, which tion, high scan speed, and high sensitivity [84].
recognize sDMA [6]. The major goals of PTMs analysis are
(i) identifying modified proteins, (ii) localizing
17.3.2.5 Acetylation modification sites on specific amino acids in pro-
Immunoprecipitation using monoclonal tein sequence, (iii) measuring the stoichiometry
antibodies is the main enrichment strategy for of the modified sites, and (iv) accurately
acetylated lysine residues. Using this antibody quantifying the dynamic changes of these cova-
enrichment strategy, a new study showed that lent modifications. Achieving these goals require
362 M. Ke et al.

the right mass spectrometer with proper ioniza- (CAD). In standard CID/CAD fragmentation,
tion, fragmentation, and detection technologies. protonated peptides collide with an inert neutral
Since phosphorylation is one of the most gas following an electric potential acceleration in
important and well-studied PTMs in biological the vacuum of the mass spectrometer.
systems, we use it as an example to explain Non-modified peptides are generally fragmented
different ionization and fragmentation at their backbone amide bonds which results in b-
techniques that are commonly used to improve and y ladder ions that cover the peptide sequence
global phosphoproteome analysis. from its N- and C-terminal respectively. How-
ever, due to the neutral loss of the phosphate
17.3.3.1 Ionization Strategies group from phosphorylated peptides, the MS
A successful detection of PTM by mass spec- analysis of labile phosphorylation by CAD ioni-
trometry is often challenging due to decreased zation is often challenging. As shown in
peptide ionization efficiency as a result of PTM Fig. 17.9a, the phosphate group is the preferred
chemistry. For example, reduced ionization effi- site for protonation and subsequent nucleophilic
ciency of phosphorylated peptides compared to attack from a neighboring amide carbonyl group
their non-modified counterparts has been [89, 90]. This results in a dominant neutral loss
reported for both electrospray ionization (ESI) peak, while sequence informative ions can rarely
[85, 86] and matrix-assistant laser desorption be observed as shown in Fig. 17.9b [91]. The
ionization (MALDI) [85]. This is mainly due to extent of neutral loss depends on parameters,
the addition of a negatively charged phosphate such as charge state, the chemical structure of
group which reduces the ionization yield of the modified amino acid, the availability of
phosphopeptides in positive-ion mode [86]. Neg- mobile protons, peptide amino acid sequence,
ative-ion mode can yield more ions for the amount of collision energy exerted in frag-
phosphopeptides [87] but MS/MS scan for pep- mentation and the type of mass spectrometer.
tide sequencing still need to be done in the posi- Neutral loss is frequently observed in ion trap
tive mode. Non-specific adsorption of mass spectrometers that have lower collision
phosphopeptides to stainless steel parts in the energy and relatively longer activation time com-
LC-MS system can also contribute to high detec- pared to QqQ or QTOF mass spectrometers.
tion limit [85, 88]. To compensate for reduced Moreover, the extent of neutral loss appears to
ionization efficiency and sensitivity, modified also depend on the ratio of charge state versus
peptide pre-fractionation or enrichment steps number of basic amino acid residues [92]. When
are necessary. the ratio is higher, less neutral loss is observed.
Since the mobile proton is available, the energy
17.3.3.2 Fragmentation Methods applied to charge-directed backbone fragmenta-
Peptide PTM sites are usually detected according tion can be much lower. Phosphorylated tyrosine
to the shift in the fragment ions m/z (Table 17.1), residues often lose HPO3 (80 Da), while loss of
for example, a 80 Da mass shift reports the addi- H3PO4 (98 Da) is more observed in
tion of HPO3 group. Since a number of PTMs, phosphorylated serine and threonine residues
such as phosphorylation and glycosylation, are [93]. This is mainly due to the fact that the C-O
labile during standard CID fragmentation, differ- bond of phosphorylated tyrosine residue is stron-
ent types of fragmentation strategies have been ger than that of phosphorylated serine and threo-
developed for labile PTM analysis, including nine residues.
collision-based methods (CAD, HCD, MSA) Apart from neutral loss issue, in ion trap
and electron-based methods (ECD, ETD). instruments, gas-phase rearrangement of phos-
phate groups between different amino acid
Collision-Induced Dissociation (CID)/ residues has been observed. This rearrangement
Collision-Activated Decomposition (CAD) complicates the correct and confident localiza-
Collision-induced dissociation (CID) is also tion of phosphorylation sites. Palumbo
known as collision-activated decomposition et al. demonstrated that in gas phase and prior
17 Identification, Quantification, and Site Localization of Protein. . . 363

Fig. 17.9 Neutral loss in CAD fragmentation. (a) Fragmentation pattern with loss of phosphoric acid from a
multiply protonated phosphopeptide by CAD (Adapted from [90]). (b) CAD spectrum of the [M + 2H]2+ ion of
RLPIFNRIpSVSE (m/z 756), dominated by neutral loss of phosphoric acid (Adapted from [94])

to fragmentation, phosphate groups can transfer spectrometers. MS3 scan is usually triggered in
to unmodified hydroxyl-containing amino acid a data-dependent manner when a major neutral
residue [95]. loss peak is observed. However, MS3 analysis
may not yield an unambiguous phospho site
localization due to the loss of H3PO4 prior to
MS3 Scan and Multistage Activation (MSA)
MS3 fragmentation. Additional complications
To obtain more sequence information for
in MS3 data interpretation may arise when the
peptides carrying labile modifications with neu-
combined losses of HPO3 from the
tral loss in CID MS/MS, (MS/MS/MS) MS3 scan
phosphorylated residue and H2O from another
mode has been developed in ion trap mass
364 M. Ke et al.

non-phosphorylated yields a neutral loss of issue of ion trap fragmentation, it has been
98 Da [95]. widely adapted for PTM analysis [98].
An alternative approach for MS3 scan is MSA A detailed comparison of HCD and CID frag-
scan, also termed pseudo MS3 scan, which uses a mentation of a synthetic phosphopeptide is
supplemental selective activation of the common shown in Fig. 17.10. In the low-mass region,
neutral loss products and activation of the pre- HCD produced a clear a2/b2 ion pair, y1 and y2
cursor ion simultaneously, and then records all ions, with relative higher abundance. Further-
the fragment ions [94]. Since the MSA spectrum more, the phosphotyrosine-specific immonium
contains both MS2 and data-dependent MS3 ion at m/z 216.0426 can be detected in HCD
spectra, there is no need for MSA to isolate and spectrum with high confidence. However, a full
fragment the neutral loss ions. A detailed com- consensus has not been reached yet to determine
parison of the scan cycles for MS2, data- whether a higher resolution, slower acquisition
dependent neutral loss MS3 and MSA has been speed HCD-based strategy is better for modified
presented by Gygi et al. [91]. Compared with the peptide identification (e.g. phosphorylation) or a
MS2 and MS3 spectrum, MSA spectrum lower resolution, faster acquisition speed
contains the relative intensity for both b/y-ions CID-based acquisition [106, 107].
and b/y-98 ions without neutral loss peak. Suc-
cessful MSA dissociation is usually performed Electron-Based Dissociations
on an ion trap mass spectrometer (LTQ), with Compared with CID, electron-based dissocia-
relative low-mass resolution. However, it has tion, such as electron capture dissociation
been shown the standard MS2 scan outperforms (ECD) and electron transfer dissociation (ETD),
MSA and MS3 when the experiments were yield sequence fragments while maintaining the
performed on high-mass accuracy instrumenta- modified group. In ECD, precursor ions are
tion (e.g. LTQ Orbitrap), in which accurate pre- bombarded with near-thermal electrons
cursor mass determination is achieved in a high- (<0.2 eV). The basicity of the amide carbonyl
throughput manner. The extra time needed to oxygen can abstract a proton from an amino acid
perform additional fragmentation in MSA or residue in the sequence. Then the N-Cα bond is
MS3 reduces the opportunities to sequence addi- dissociated with very low energy barriers, lead-
tional peptides [91]. ing to c- and z- type ions. The process of peptide
ion capturing an electron, which charge-reduces
the peptide into a radical cation [108]
High-Energy Collisional Dissociation (HCD)
(Fig. 17.11). In ETD process, fluoranthene
High-energy collisional dissociation or higher-
radical-anions are used as reagents transferring
energy C-trap dissociation (HCD), also termed
an electron to a peptide with multiple charges.
as “beam-type collisional activation”, performs
This reaction reduces the peptide charge by one,
fragmentation in higher-energy collision dissoci-
then triggers the peptide backbone fragmentation
ation cells with higher energy and shorter activa-
to produce a series of complementary c and z
tion time as compared with ion trap CID. HCD
type fragment ions [109] (Fig. 17.11).
tends to produce less neutral loss peaks and more
Figure 17.12 is a comparison between ETD
sequence-specific fragment ions which predomi-
and CAD analysis of a phosphopeptide [89]. The
nantly contain y-ions and some b-ions. The pro-
CAD spectrum (Fig. 17.12a) lacks sufficient
portion of b-ions is smaller than y-ions because
fragmentation and cannot be matched to a correct
they tend to be fragmented further to a-ions in
sequence by database searching. In contrast,
HCD mode. With higher energy, HCD can also
ETD fragmentation results in a more successful
produce smaller species than ion trap CID, such
identification with near-complete backbone frag-
as immonium ion to help identify specific
mentation (Fig. 17.12b). Proline, which fre-
modified residues [98, 106]. Since HCD frag-
quently occurs at PTMs motif, does not cleave
mentation overcomes the “low mass cutoff”
at its backbone amide bond due to its side chain
17 Identification, Quantification, and Site Localization of Protein. . . 365

a LTQ-MS/MS
MH22+ - H3PO4
671.78
Relative abundance

100
YFMpTEpYVATR
80

60 y5
689.30
40 a2 y4 y8 2+
y3 y6 y7
446.27 516.71
20
283.14
565.70
662.78
818.34 901.38 999.36 y8 1,130.40
347.20 1,032.42
b2 y92+ 752.24
1,165.37
0
200 400 600 800 1,000 1,200
m/z
b HCD-MS/MS
pY
216.0426
H

H2N C
T´E
´
213.0875 MT CH2
215.0855
a2
YFMpTEpYVATR
100 283.14
O
O P OH
Relative abundance

Y
80 136.08 OH

y7
y5
210 212 214 216 218 220 901.38
60 m/z 689.30
y8
F 1,032.42

40 y4
b2 y82+ x7
446.27
311.14 516.71 y6 929.38
y1 y3 LM 818.34
20 175.12 y2 391.28 y92+
M 216.04 590.25 y9
782.32
1,179.49
0
100 300 500 700 900 1,100
m/z

Fig. 17.10 Detection of a synthetic phosphopeptide fragmentation of the same phosphopeptide. Inset: close-
containing the sequence YFMpTEpYVATR (“pY” up of the region with the phosphotyrosine-specific
indicates phosphotyrosine). (a) CID fragmentation by immonium ion at m/z 216.0426 (Adapted from [98])
linear ion trap in Orbitrap mass spectrometer. (b) HCD

ring structure and so the N–C bond N-terminal to be applied complementary to increase the num-
the proline does not fragment by ETD ber of identifications and the coverage of peptide
[110]. And, as can be seen in Fig. 17.12b, the c sequence [89, 111].
and z-type ions at the N-terminal of proline are Several studies have confirmed that efficiency
not detectable. But this limitation does not exist of ETD fragmentation is dependent on charge
in CAD fragmentation (Fig. 17.12a). Further- states >2 [89, 111] and the charge density
more, CAD was found to be more effective in (i.e. ratio of charge over number of amino acid
fragmenting peptides containing lower net residues) [93, 112]. Therefore, the peptides
charge compared to ETD. Usually phosphoserine generated by Lys-C or trypsin with higher charge
and phosphothreonine motifs with one or more density will be excellent targets for ETD analy-
basic residues fragment better by ETD. For large sis. ETD can break the backbone randomly in
scale expansive identification of protein phos- longer peptides, such as peptides generated by
phorylation though, both CID and ETD should the proteinase Lys C [113]. ETD can be
366 M. Ke et al.

Fig. 17.11 Proposed ECD/ETD fragmentation mechanism of phosphorylated peptides (Adapted from [93])

Fig. 17.12 Analysis of phosphopeptide TRQsPQTLKR (“s” indicates phosphoserine). (a) CAD mass spectra of
the sequence. (b) ETD mass spectra of the same sequence (Adapted from [89])
17 Identification, Quantification, and Site Localization of Protein. . . 367

implemented on various instruments, such as variable polarities, charge states of precursor


Q-TOF [114], and linear ion trap-Orbitrap hybrid ions etc., it is important to choose and optimize
instruments [115]. fragmentation strategies and scan modes suitable
Recently, novel hybrid fragmentation for each PTM analysis. A number of strategies
techniques are developed based on the charge have been proposed to improve sequence and site
state and m/z value of the precursor ion. As diagnostic fragmentation, including the use of
reported by Heck et al. [121], a novel hybrid neutral loss-triggered MS3 and MSA in ion
fragmentation technique, termed as EThcD traps, HCD, ETD/ECD, or a combination of
(combining electron-transfer and higher-energy these approaches [121, 122]. For example, moni-
collision dissociation), is used for unambiguous toring specific diagnostic ion of aDMA in a pre-
phosphorylation site localization. ETD could not cursor ion scan can differentiate it from its
cleave the N–Cα bond N-terminal to proline, isomer sDMA. That is because dimethylarginine
which hinders the phosphosite localization of is sufficiently stable under CID conditions,
the proline-rich peptides. However, EThcD can resulting in cleavage of the peptide backbone to
specifically address this issue by generating b/y- support sequence information [104].
and c/z-type ions. Compared to phosphorylation, glycosylation is
A data-dependent decision tree (DT) method structurally a much more complex PTM with
was developed by Coon and co-workers. They highly heterogeneous glycan structures. Most of
designed and embedded a data-dependent deci- the major fragmentation methods described above
sion tree algorithm (DT) in QLT-Orbitrap capa- have been used for characterizing intact
ble of both the CAD and ETD dissociation. glycopeptides. HCD with higher fragmentation
Following the MS1 analysis in the Orbitrap, six energies and higher mass accuracy is also a favor-
data dependent MS/MS activation was ite approach for glycosylation characterization.
performed either in ETD-only, CAD-only or HCD, which generates diagnostic oxonium ions
DT-based selection mode with product ion anal- and Y1 ions (e.g. peptide + acetylglucosamine),
ysis performed in the QLT. In the DT-based has been proven highly effective for locating gly-
selection for every MS/MS event, CID or ETD cosylation sites [124]. ETD and ECD yield
is utilized to fragment precursors depending on sequence fragments while maintaining the carbo-
its charge state and m/z in real time automati- hydrate structure, thereby enabling site localiza-
cally. They compared the CAD-only or tion [110, 125].
ETD-only analyses with DT-based selection in Recently, Cooper et al. developed an HCD
large-scale proteome analyses. Their data product ion-triggered ETD approach which effec-
showed the DT approach identified more tively improves the accuracy and sensitivity in
phosphopeptides (7422), compared with either identifying both glycosylation site and peptide
CAD (2801) or ETD (5874) phosphopeptides sequence simultaneously [126]. As shown in
alone [122]. Currently, Jyoti S. et al. have com- Fig. 17.13, after a full MS scan of a N-linked
pared a data-dependent neutral loss-triggered- glycopeptide in Orbitrap, HCD MS/MS scan was
ETD (DDNL) strategy to DT. In a DDNL method triggered for a precursor ion at m/z 645.6194. If
performed on an LTQ Orbitrap Velos hybrid the diagnostic ions of HexNAc
mass spectrometer, all peptides were fragmented (N-acetylhexoseamine) oxonium ions (m/z
by CID and if a prominent neutral loss peaks 204.09) and HexHexNAc oxonium ions (m/z
corresponding to the loss of a phosphoric acid 366.14) were among the top 20 most abundant
were observed the precursor ion was isolated for peaks, ETD MS/MS was then triggered in the
fragmentation with ETD [123]. linear ion trap to fragment the precursor ion (m/z
645.62) (Fig. 17.13c). The advantage of this
Detection of Other Types of PTMs approach is that the structure of the glycan and
Apart from phosphorylation, considering differ- the sequence of the glycopeptide were determined
ent degrees of lability in other types of PTMs, simultaneously from HCD and ETD spectra
368 M. Ke et al.

Fig. 17.13 HCD product ion-triggered ETD MS/MS ions at m/z 645.6194. (c) In linear ion trap supplemental
of Lys-C digest of ribonuclease B. (a) A full MS scan activation ETD MS/MS of precursor ions with m/z 645.62
(m/z 380–1600) recorded in the Orbitrap at retention time (Adapted from [126])
of 25.16 min. (b) HCD MS/MS spectrum of precursor

respectively. This approach is also routinely used combination of variable fragmentation modes
in site mapping of O-linked glycosylated peptides. (e.g. HCD plus ETD) and different scan modes
However, O-GalNAc and O-GlcNAc cannot be is a powerful approach for mapping various
distinguished from each other by their signature PTMs. In conclusion, the features of different
ions only [127]. In a recent study reported by Hart fragmentation methods are summarized in
et al. [128], O-GlcNAc-modified peptides were Table 17.2.
specifically labeled with AMT-GalNAc for both Recently ion mobility spectrometry (IMS)
enrichment and better fragmentation in ETD scan. was applied in PTM analysis. This technology
As a result, the enriched peptides appeared in can potentially separate isomers and/or variants,
charge states of +3 or higher, which increased such as phosphorylated variants [129], glycol-
the fragment efficiency in ETD, compared to isoforms [130], variants of histone methylated
untagged, native O-GlcNAc peptides. The AMT- and acetylated peptides [131]. Additionally,
GalNAc-GlcNAc modification brings in three pulsed Q dissociation (PQD) combined with
major diagnostic oxonium fragment ions, which ETD can be applied to analyze the O-GlcNAc
can be readily detected by HCD. peptide [132].
In summary, the advancement of both hard-
ware and software in hybrid mass spectrometry, a
17 Identification, Quantification, and Site Localization of Protein. . . 369

Table 17.2 Features of different fragmentation methods for PTM detection


Fragment
methods Advantages Limitations Type of fragment ions
CAD/CID 1. High speed 1. Lower net charged peptides Neutral losses, less b-, y- product
2. High sensitivity [106] required ions compared with MSA [94]
2. Labile PTMs are lost [89]
3. Selective cleavage [116]
HCD 1. No low mass cutoff and 1. Slower scan cycle [107] b- and y- ions, b/y- additional
multiple cleavage events neutral losses of NH3, H2O, HPO3,
leading to richer fragments H3PO4 [117];
[106, 107]
2. Higher resolution, higher a2/b2 fragment ion pair; internal
dynamic range and less noisy fragments, immonium ion
compared with CID spectra [98, 117]
[107]
3. The neutral loss of
phosphoric acid is
unproductive and more
sequence-specific fragment
ions [106]
ETD/ 1. Labile PTMs preserved [89] 1. Multiply charged (charge Mainly c- and z- fragment ions
ECD 2. Break randomly for longer state > 2) peptides required [89, 120]
peptides, such as peptides [89, 119]
generated by the proteinase 2. The activation time is longer
Lys C [113, 118] [111]
3. Limited fragmentation
efficiency of doubly charged
species [111]
MS3/ 1. “Neutral loss” issue largely 1. Need extra analysis time b- and y-ions, along with several
MSA addressed new cleavages (e.g. b/y-H3PO4),
2. Especially useful in 2. Low-abundance, sequence devoid of the major neutral loss
low-mass resolution informative product ions are lost fragment ion [94]
instrumentation [91] after isolation of the major neutral
loss product in MS3 [94].
3. Most effective for singly and
doubly charged peptides [94]
Note: CID/CAD indicates Collision-induced dissociation/collision-activated decomposition; HCD indicates High-
energy collisional dissociation or higher-energy C-trap dissociation; ETD indicates Electron transfer dissociation;
ECD indicates Electron capture dissociation; MS3 indicates data-dependent MS3 method; MSA indicates Multistage
activation, also name Pseudo-MS3 method

17.3.4 Bioinformatics Methods localization of various PTM sites are discussed.


for Predicting Following that, a series of databases containing
and Identifying PTMs large datasets from global PTM analyses are
introduced. These databases are useful resources
MS-based methods provide tools for efficient for MS-based proteomic studies. Furthermore, we
localization of PTM sites in a global scale. Large will sum up related bioinformatic tools for sophis-
amounts of data generated by modern mass ticated analysis of biological pathways associated
spectrometers can lead to false identifications with PTMs. In this section we will cover the latest
and should be interpreted carefully with detailed bioinformatics approaches that can be used to
statistical analysis. In this section, several MS data mine and analyze large datasets generated by
interpretation software packages for confident MS-based workflows.
370 M. Ke et al.

17.3.4.1 Localization of PTM Sites phosphosite. However, it is always possible for


Protein PTMs are important for understanding an irrelevant peak to be randomly annotated as a
cell signaling and other important biological site-determining ion in the MS/MS spectra.
mechanisms. The raw MS data should be care- Ascore utilizes a cumulative binomial model to
fully processed to localize PTM sites with mini- calculate the probability of a peak randomly
mal errors. Traditional manual validation is matched to one of the site-determining ions in
hardly practical in large-scale PTM localization. the MS/MS spectra. A higher score implies a
In the past few years, a number of probability- smaller probability for a random match and a
based scoring systems have been developed for higher confidence in phosphosite determination.
accurately localizing PTM sites on peptide Phosphopeptides with Ascore 19 indicates that
sequences. In this part, several commonly used there is a 99 % or more chance for a correct
automated PTM localization algorithms, such as phosphorylation site localization. Ascore
Ascore and PTMscore, will be discussed. between 15 and 19 can ensure >90 % certainty
Ascore is an algorithm for localizing protein for the localization and those with score of 3–15
phosphorylation sites. When a peptide with mul- has a success rate around 80 %. A Ascore < 3
tiple possible phosphosites (S/T/Y residues) is means that peptide MSMS spectra contain little
identified, Ascore measures the probability of or no site-determining ions for proper
each possible site being the phosphosite using phosphosite localization. Figure 17.14 illustrates
“site-determining ions” extracted from MS/MS a processing example of Ascore [133].
spectra. Site-determining ions are the critical b/y PTMscore is another localization tool that is
ions that can distinguish the accurate widely used. PTMscore has a similar algorithm

Fig. 17.14 Localizing a PTM site with Ascore and the two candidate sites with the highest score (lowest
[133]. Different possible PTM sites within a single pep- chance of their site-determining ions being random
tide can be differentiated with the site-determining ions matches) are picked to calculate the Ascore (the score
(Fig. 16.14c). Ascore measures the probability of all difference between them). (Cited from Nature Biotech-
detected site-determining ions to be random matches nology, 24(10), 1285–1292 with permission [133])
17 Identification, Quantification, and Site Localization of Protein. . . 371

as Ascore which is also based on the random that are used to identify the peptide sequence but
binomial distributions [33]. Putative site- can hardly localize the PTM site. Mascot Delta
determining b and y ions are generated to match Score can deal with ions from various fragmen-
with the actual MS spectra. The four most intense tation techniques (need to optimize
fragment ions in every 100 Da m/z intervals of respectively) [134].
MS2 or MS3 spectra are picked out. All possible PhosphoRS, is a newly developed PTM local-
combinations of the phosphorylation sites are ization tool that is compatible with all common
tested (the putative ions and the actual spectra). fragmentation techniques, such as ECD, ETD,
PTMscore algorithm then generates a score HCD, and CID. Compared to Mascot Delta
(PTM score) for each combination. According Score and Ascore, PhosphoRS can identify
to the PTM score and the motifs, all the testing more phosphorylation sites at the same confi-
peptides can be classified into four categories. dence level (>99 %) [135]. The fundamental
Class I collects the phosphorylation sites with algorithm of PhosphoRS applies a binomial
highest localization probability (>0.75). In probability. However, with different fragmenta-
class II & III, the sites have a localization proba- tion techniques, the algorithm should be
bility which varies from 25 to 75 %. The sites in optimized respectively. The result comparison
class II have to match at least one of 22 kinase between PhosphoRS and other scoring systems
motives whereas in class III this criterion is is shown in Fig. 17.15a. This result show that
removed. If one site has a probability less than various fragmentation modes lead to similar
25 %, it will be sorted into class IV with the accuracy with PhosphoRS analysis (Fig. 17.15b).
lowest confidence level. sLoMo (Site Localization of Modification) is a
Mascot Delta Score is another phosphoryla- localization tool developed from the Ascore
tion site localization scoring tool that is similar to algorithm capable of analyzing both CID and
Ascore in terms of sensitivity and specificity. ECD/ETD generated ions. Furthermore, sLoMo
However, it is worth mentioning that Mascot can be used to perform site localization on a
outperforms the Ascore in tyrosine variety of modifications, such as oxidation and
(Y) phosphosite localization. The Mascot Delta phosphorylation. The scoring algorithm uses a
Score results from calculating the differences Poisson random distribution which is similar to
between the Mascot scores of the two top ions the accumulated binomial distribution. sLoMo is

Fig. 17.15 (a) The diagram reveals the numbers of percentage and the absolute numbers of phospho-sites
non-redundant phosphorylation sites in the same sample are visualized. (Cited from Journal of Proteome
using various localization tools. (b) The comparison Research, 10(12), 5354–5362 with permission [135])
among MSA, ETD, and HCD generated data. The
372 M. Ke et al.

also compatible with different data formats, such Table 17.3 Statistics of large-scale PTM mapping
as Sequest and OMSSA [136]. PTM type Sites Proteins Reference
Protein prospector is a search engine that Phosphorylation 50,000 7832 [139]
reports all modifications present in an identified Ubiquitination 20,000 5000 [140]
peptide [137]. The core localization tool in Pro- Acetylation 3600 1750 [81]
tein prospector is called SLIP (Site Localization Methylation 1160 N/A [141]
in Peptide). SLIP scores are generated by com- N-Glycosylation 6367 2352 [142]
paring the tightness of the match between hypo- O-Glycosylation 177 602 [143]
thetical MSMS spectra generated from in silico Sumoylation N/A 593 [144]
fragmentation of peptides modified at all possible
sites and the acquired MSMS spectra. resource for PTM localization and biological
Oscore is a tool that exclusively differentiates analysis.
the O-GlcNAc peptides from the unmodified There are a variety of databases with specific
peptides. It utilizes the eight O-GlcNAcylation features and emphasis that contain large amounts
spectral features to calculate the sum of the of MSMS data covering various PTMs. Compre-
normalized intensities divided by a rank value hensive databases, such as PhosphoSitePlus or
[132]. The score is tested by inputting more SysPTM2.0 aim at providing coverage for multi-
than 700 GlcNAc spectra from the O-GlcNAc ple PTMs. PhosphoSitePlus (http://www.
peptide database and about 11,300 non-O- phosphosite.org) is an updated version of
GlcNAc spectra. An Oscore lower than 2.0 PhosphoSite, which covers other common
indicates the existence of an O-GlcNAc peptide. PTMs such as acetylation, methylation,
PTMap is a sequence alignment software ubiquitination and O-linked glycosylation in
designed for accurate identification of full- addition to phosphorylation [145]. Currently,
spectrum from posttranslationally modified 245,509 phosphorylation sites are stored in
proteins [138]. This software integrates two logi- PhosphoSitePlus, which is higher than any other
cal score systems, SUnmatched and PTMap score. database. This database includes crucial informa-
A high SUnmatched score indicates that there are a tion regarding various modified proteins’
large number of unmatched peaks with signifi- biological functions and structures.
cant intensities in the MSMS spectra, whereas PhosphoSitePlus is one of the most dynamic
PTMap score estimates how well the sequence and continuously updated databases, covering
and the MSMS spectra explain each other. A protein PTM information. In addition to
PTMap score with a value over 1.0 indicates a PhosphoSitePlus, some newly developed
confident match between the identified PTM sites databases also provide comprehensive informa-
and the hypothetical site. It worth mentioning tion about various PTMs. dbPTM3.0 (http://
that PTMap is the only software that can identify dbptm.mbc.nctu.edu.tw) houses integrated data
novel PTMs with high accuracy. from 11 public resources along with manually
curated data from MS/MS PTM extracted from
17.3.4.2 PTMs Related Databases research literature. This database stores informa-
Proteomics is a rapidly evolving field with tion for more than 200,000 PTM sites with
increasing number of datasets generated daily related information including PTM regulated
for different PTMs by various high-throughput protein-protein interactions and the topologies
LC-MS/MS platforms. Currently, MS-based pro- of the PTM carrying transmembrane proteins
teomics approaches can map about 50,000 phos- [146]. Compendium of Protein Lysine
phorylation sites directly in a single cell line Modifications (CPLM) is a specific database for
[81]. Table 17.3 summarizes various large-scale protein lysine modifications (PLMs) which occur
MS based PTM studies in the past few years. at ε-amino groups of lysine residues. CPLM
With increasing number of datasets, well curated stores information for 200,000 sites from 12 dif-
databases are becoming an indispensable ferent types of PLMs and the co-occurrences of
17 Identification, Quantification, and Site Localization of Protein. . . 373

various PLMs on the same modification site data stored in a previous database called
(http://cplm.biocuckoo.org) [147]. GlycoSuiteDB. It also integrates data from
Other than general databases, specific other protein glycosylation resources, such as
databases have also been developed to store EUROCarbDB (structural), UniCarb-DB (exper-
information about specific types of PTMs. Due imental LC-MS/MS data), etc. [153].
to the significant role that protein phosphoryla- Several ubiquitination/ubiquitin-like conjuga-
tion plays in crucial cellular processes, such as tion databases have also been introduced in
cellular growth, intercellular signaling etc., sev- recent years including UUCD, SCUD and
eral databases are dedicated to this modification Ubiprot [154–156]. Ubiquitin and Ubiquitin-
entirely. PhosphoSitePlus is one of the largest like Conjugation Database 2.0 (UUCD: http://
resources of PTMs which have been mentioned uucd.biocuckoo.org/) stores 117,703 proteins
above. Another commonly-used phosphorylation from 144 eukaryotic species [154]. 1831 differ-
database is a eukaryotic protein database called ent ubiquitin-related enzymes and protein
Phospho.ELM (http://phospho.elm.eu.org/) that domains are collected from manually curated
provides information such as phosphopeptide data and classified into various families respec-
sequence, absolute position, Uniprot accession tively. Saccharomyces Cerevisiae Ubiquitination
number and the upstream motif information Database (SCUD: http://scud.kaist.ac.kr/) is
[148]. In the latest update in 2012, Phospho. another ubiquitination database that specifically
ELM had collected 42,914 non-redundant records 940 ubiquitinated proteins and 73 related
phospho-sites. PHOsphorylation SIte DAtabase enzymes in Baker’s yeast [155]. Another
(PHOSIDA) is another database build using MS resource named UbiProt (http://ubiprot.org.ru/)
data from screening datasets of phosphosites is a database that summarizes various
(http://www.phosida.com) [149]. In addition to ubiquitination protein substrates [156]. Each pro-
the original 6600 phosphosites that are observed tein contains information about ubiquitination
from HeLa cells, the database also has gathered sites, conjugation cascade (polyubiquitin topog-
information from other species. PHOSIDA has raphy), literature reference and links to related
stored 70,095 phosphorylation sites and around databases.
10,000 acetylation and N-Glycosylation sites.
Like phosphorylation, different types of gly-
17.3.4.3 Pathway Analysis
cosylation modification play significant roles in
PTM site localization tools and databases pro-
many biological processes. O-GLYCBASE is
vide detailed PTM-related information and dis-
one of the earliest glycoprotein databases,
tinct understanding of specific PTM distributions
which has in its collection 243 experimentally
in specific proteins or cells. However, in proteo-
verified O-glycosylated proteins including 2413
mics research, scientists often aim at solving
different sites [150]. Unipep (http://www.unipep.
specific biological problems that often require
org) is a N-linked glycosylation database that
mapping a specific signaling pathway in the tar-
covers 9651 N-Glycosylated peptides including
get organisms. As a result, signaling pathway
parent protein sequences and modification sites
analysis is a crucial step in hypothesis generation
(predicted and identified) along with the relative
or testing that combines PTM sites information
motifs (if available) [151]. GlycoProtDB (http://
with system’s interaction dynamics that may reg-
jcggdb.jp/rcmg/gpdb) is another database that
ulate a series of biological activities. Among
was constructed using data collected from a
various protein PTMs, phosphorylation signaling
series of experiments in which nine mouse
based on kinase and phosphotase activities is
tissues and samples from other species such as
critical to nearly all cellular regulatory processes
Homo sapiens were systematically analyzed for
in archea, prokaryotic and eukaryotic organisms
detection of glycopeptides and their sites
[157]. The pathway repository and analysis tools
[152]. Another publicly available
are essential in proteomics research and hence,
knowledgebase, UniCarbKB, is built based on
374 M. Ke et al.

some of the representative tools will be To sum up, pathway analysis provides crucial
introduced in the following section. information to guide downstream research by
Kyoto Encyclopedia of Genes and Genomes mapping large-scale PTM identifications espe-
(KEGG) is a database storing genomic and rela- cially phosphorylation sites for better under-
tive functional information provided by bioinfor- standing the biological functions of PTMs.
matic analysis of genomics, proteomics and
metabolomics data [158]. KEGG PATHWAY is
one of KEGG’s sub-databases that stores graphi- 17.4 PTM Crosstalk
cal representations of various cellular signaling
pathways (http://www.kegg.jp/kegg/pathway. In the past decade various enrichment strategies
html). The database collects not only the meta- have been developed for global analysis of vari-
bolic pathways that are largely conserved in var- ous protein posttranslational modifications.
ious species, but also the more complex Immobilized metal affinity chromatography
regulatory pathways, such as signal transduction, (IMAC) and TiO2 affinity enrichment methods
cell cycle, etc. Furthermore, KEGG PATHWAY are ubiquitously used for phosphopeptide enrich-
can automatically generate pathway diagrams ment. Antibodies have been raised to specifically
differing from existing reference pathways recognize acetyllysine containing peptides to
using manually provided data. As of October study protein acetylation. Ubiquitinated peptides
5th, 2014, KEGG pathway stored 465 manually are enriched with antibodies against a diGly
drawn reference pathway maps and 318,245 moieties reminiscent of a ubiquitin chain. To
computationally generated pathway maps in reduce the complexity of PTM samples and
total. increase the coverage, peptide fractionation
KinomeXPlorer is another useful platform for methods such as strong cation exchange, HILIC
modeling interactions between various kinase- or isoelectric focusing are usually used. Orthog-
substrates present in human and other major onal combinations of different enrichment and
eukaryotic model organisms [159]. The platform fractionation approaches have been examined to
includes an improved NetWorkIN, which is an study crosstalk between various PTMs.
algorithm that systematically predicts the motif- Using S. cerevisiae as a model organism in a
based network of kinases and their substrates, study focused on crosstalk between phosphoryla-
and an algorithm called NetPhorest for tion and ubiquitination, researchers identified
classifying phospho-sites according to the about 2100 phosphorylation sites co-localizing
kinases and phospho-binding domains. The with 2189 ubiquitination sites in about
NetWorkIN algorithm first identifies the kinase 466 proteins [162], using two different serial
motif in a phosphoprotein sequence. Then, a tool enrichment methodologies (Fig. 17.16). The
named STRING is used to construct a network of first PTM purification step used cobalt-NTA
specific interactions for each substrate (nitrilotriacetic acid) affinity media to purify
[160]. NetPhorest also includes a comprehensive His-tagged ubiquitinated proteins, followed by
online atlas of linear motifs from specific kinases trypsin digestion of half the flow-through and
and the phospho-binding domains. It also enrichment of di-Gly peptides with a monoclonal
includes a series of probability-based classifiers antibody against lysine-diGly. The rest of the
for sorting out the phosphorylation sites in terms proteins after Ub-enrichment were then digested
of their linear motifs [161]. The KinomeXPlorer with another specific enzyme lysC, and exposed
is also able to calculate the likelihood of various to subsequent phosphorylated peptides enrich-
kinase-substrate yielding desired information ment, with IMAC or TiO2. In the second enrich-
from NetWorkIN and NetPhorest algorithm anal- ment strategy SCX chromatography is used to
ysis and gives a most possible kinase for a spe- separate tryptic peptides by their solution charge
cific phosphorylation site with a largest after trypsin digestion, followed by diGly
calculated score. peptides enrichment. Bioinformatic investigation
17 Identification, Quantification, and Site Localization of Protein. . . 375

Fig. 17.16 Two enrichment strategies in the context of the proteasome inhibition experiment [162]

of the data suggests that phosphorylation sites glycine-rich region is a mechanism for kinase
co-localized with ubiquitination sites were more regulation.
conserved than the rest, demonstrating the func- In another study focused on studying crosstalk
tional importance of PTM crosstalk. between phosphorylation, ubiquitination and
The prevalence of co-occurring modifications acetylation, a fine-tuned method for serial enrich-
and the role they might play in regulating protein ment of these PTMs from the same sample
function is not fully understood. The same study (SEPTM), has been described. Serial enrichment
also showed that certain proteasome substrates from high pH reverse phase chromatography
require specific phosphorylation for degradation, fractions [163] greatly increases the quality and
denoted as phosphodegrons. SILAC experiment quantity of peptide coverage. A small percentage
with Btz-mediated proteasome inhibition caused (5 %) of the fractionated peptides were analyzed
on average more than twofold increase in 12.9 % by LC-MS/MS and the remaining (95 %) were
of ubiquitination sites and 3.4 % of phosphoryla- subjected to subsequent finely designed serial
tion sites on ubiquitinated proteins, suggesting PTM enrichments. The original 24 fractions
that already ubiquitinated proteins may get fur- were internally mixed into 12 fractions for phos-
ther ubiquitinated to increase the stoichiometry phorylation enrichment (IMAC) and then into
for faster degradation or ubiquitination regulates 6 fractions for ubiquitination enrichment (anti-
the phosphorylation state of proteins [162]. In K (GG) antibody) and the rest for acetylation
some kinases, enrichment of ubiquitination sites peptides enrichment (anti-K (Ac) antibody).
near the domain activation loop and in the This serial enrichment combined with LC-MS/
376 M. Ke et al.

MS allowed detection of more than 20,000 phos- described an efficient enrichment protocol spe-
phorylation, 15,000 ubiquitination and 3000 cifically for O-GlcNAc-modified proteins and
acetylation sites in about 8000 proteins, peptides using a small amount of sample for
uncovering the mysteries of PTM crosstalk. comprehensive mapping of O-GlcNAc-modified
Possible crosstalk between O-GlcNAcylation amino acids [128]. This method made use of a
and phosphorylation-mediated signaling have new biotin reagent named PC-PEG-biotin-alkyne
been explored numerous times in the past with for O-GlcNAc-modified peptides enrichment,
limited success. Protein phosphorylation is [165]. This reagent contains a photo-cleavable
catalyzed by hundreds of distinct kinases but 1, 2-(nitro-phenyl) ethyl moiety that reacts with
glycosylation is catalyzed mainly by two O-GlcNAc-modified peptides but can later be
enzymes: polypeptide beta-N-acetylglu- released by photoactive cleavage (UV 254 nm),
cosaminyl transferase (OGT) and beta-D-N- leaving a basic amino-methyltriazole tag at the
acetylglucosaminidase (OGA), both of which O-GlcNAc modification site. The biotinylated
gain specificity via transient associations with peptides were enriched by affinity chromatogra-
many other proteins. Based on this knowledge, phy, and finally released from the solid carrier,
a study in 2008 investigated the crosstalk and analyzed by ETD-MS. A heavy isotope
between phosphorylation and glycosylation by labeled version of the photo-cleavable biotin
detecting the changes in site-specific phosphory- alkyne is currently under synthesis for site-
lation when GlcNAcylation is globally increased specific O-GlcNAc quantification. It was later
by inhibition of OGA [164]. As a result of shown that the flow-through from the avidin
GlcNAcylation up-regulation, more than chromatography can be further enriched for
280 phosphorylation sites were found down- other posttranslational modifications. This idea
regulated and 148 sites found up-regulated, was first applied to investigate crosstalk between
suggesting an elaborate interplay between these glycosylation and phosphorylation, in which
two posttranslational modifications. This study phosphatase was inhibited during the labeling
also yielded the hypothesis that there might be process to prevent the loss of phospho sites
competition between these two PTMs for the [128]. The investigation of the interplay between
occupancy of the same or proximal sites, by phosphorylation and GlcNAcylation using a
which regulating each other’s the activity. serial enrichment protocol, combined with
Though the above mentioned study helped to SILAC, has mapped and quantified over 120 spe-
understand the principles behind crosstalk cific O-GlcNAc-modified residues and over
between O-GlcNAcylation and phosphorylation, 350 phosphorylated residues from only 15 μg of
study of crosstalk between glycosylation and sample by MS/MS analysis.
other PTMs has been of limited success, as gly-
cosylation site mapping is still limited by the
state of technology. These limitations are in
References
large part due to the following factors: (1) low
stoichiometry of O-GlcNAcylation at each site 1. Scott JD, Pawson T (2009) Cell signaling in space
on proteins; (2) low ionization efficiency of and time: where proteins come together and when
O-GlcNAcylated peptides; (3) the lability of they’re apart. Science 326(5957):1220–1224
β-linkage between O-GlcNAc moiety and 2. Bensimon A, Heck AJ, Aebersold R (2012) Mass
spectrometry-based proteomics and network biol-
Ser/Thr. These problems have been investigated ogy. Annu Rev Biochem 81:379–405
for a long time which have resulted in sample 3. Christopher W (2006) Posttranslational modification
enrichment method optimization and new gener- of proteins: expanding nature’s inventory. Colo.:
ation of mass spectrometry fragmentation Roberts and Co. Publishers, Englewood, p xxi
4. Bakri Y et al (2005) Balance of MafB and PU.1
methods, such as electron capture dissociation specifies alternative macrophage or dendritic cell
(ECD) and electron transfer dissociation (ETD). fate. Blood 105(7):2707–2716
A study reported in 2010 by Zihao Wang’s group
17 Identification, Quantification, and Site Localization of Protein. . . 377

5. Macek B, Mann M, Olsen JV (2009) Global and site- 21. Zhao S et al (2010) Regulation of cellular metabo-
specific quantitative phosphoproteomics: principles lism by protein lysine acetylation. Science 327
and applications. Annu Rev Pharmacol Toxicol (5968):1000–1004
49:199–221 22. Chaurasia MK et al (2014) A prawn core histone 4:
6. Mann M et al (2002) Analysis of protein phosphory- derivation of N- and C-terminal peptides and their
lation using mass spectrometry: deciphering the antimicrobial properties, molecular characterization
phosphoproteome. Trends Biotechnol 20 and mRNA transcription. Microbiol Res 170:78
(6):261–268 23. Yang XJ, Seto E (2008) Lysine acetylation: codified
7. Ihara Y, Nukina N, Miura R, Ogawara M (1986) crosstalk with other posttranslational modifications.
Phosphorylated tau protein is integrated into paired Mol Cell 31(4):449–461
helical filaments in Alzheimer’s disease. J Biochem 24. Huang DT, Walden H, Duda D, Schulman BA
99(6):1807–1810 (2004) Ubiquitin-like protein activation. Oncogene
8. Pedersen B, Holscher T, Sato Y, Pawlinski R, 23(11):1958–1971
Mackman N (2005) A balance between tissue factor 25. Black JC, Van Rechem C, Whetstine JR (2012)
and tissue factor pathway inhibitor is required for Histone lysine methylation dynamics: establishment,
embryonic development and hemostasis in adult regulation, and biological impact. Mol Cell 48
mice. Blood 105(7):2777–2782 (4):491–507
9. Spiro RG (2002) Protein glycosylation: nature, dis- 26. Jellinger KA (2010) The neuropathologic substrate
tribution, enzymatic formation, and disease of Parkinson disease dementia. Acta Neuropathol
implications of glycopeptide bonds. Glycobiology 119(1):151–153
12(4):43R–56R 27. Munshi A, Shafi G, Aliya N, Jyothy A (2009) His-
10. Lechner J, Wieland F (1989) Structure and biosyn- tone modifications dictate specific biological
thesis of prokaryotic glycoproteins. Annu Rev readouts. J Genet Genomics 36(2):75–88
Biochem 58:173–194 28. Rabilloud T, Chevallet M, Luche S, Lelong C (2010)
11. Trombetta ES (2003) The contribution of N-glycans Two-dimensional gel electrophoresis in proteomics:
and their processing in the endoplasmic reticulum to past, present and future. J Proteomics 73
glycoprotein biosynthesis. Glycobiology 13(9):77R– (11):2064–2077
91R 29. Wang P, Giese RW (1998) Phosphate-specific fluo-
12. Gemmill TR, Trimble RB (1999) Overview of N- rescence labeling with BO-IMI: reaction details. J
and O-linked oligosaccharide structures found in Chromatogr A 809(1–2):211–218
various yeast species. Biochim Biophys Acta 1426 30. Abu-Lawi KI, Sultzer BM (1995) Induction of serine
(2):227–237 and threonine protein phosphorylation by endotoxin-
13. Rudd PM, Elliott T, Cresswell P, Wilson IA, Dwek associated protein in murine resident peritoneal
RA (2001) Glycosylation and the immune system. macrophages. Infect Immun 63(2):498–502
Science 291(5512):2370–2376 31. Arad-Dann H, Beller U, Haimovitch R, Gavrieli Y,
14. Kravtsova-Ivantsiv Y, Ciechanover A (2012) Ben-Sasson SA (1993) Immunohistochemistry of
Non-canonical ubiquitin-based signals for phosphotyrosine residues: identification of distinct
proteasomal degradation. J Cell Sci 125 intracellular patterns in epithelial and steroidogenic
(Pt 3):539–548 tissues. J Histochem Cytochem 41(4):513–519
15. Hicke L (1999) Gettin’ down with ubiquitin: turning 32. MacDonald JA, Mackey AJ, Pearson WR, Haystead
off cell-surface receptors, transporters and channels. TAJ (2002) A strategy for the rapid identification of
Trends Cell Biol 9(3):107–112 phosphorylation sites in the phosphoproteome. Mol
16. Hicke L (2001) Protein regulation by monoubiquitin. Cell Proteomics 1(4):314–322
Nat Rev Mol Cell Biol 2(3):195–201 33. Olsen JV et al (2006) Global, in vivo, and site-
17. Impens F, Radoshevich L, Cossart P, Ribet D (2014) specific phosphorylation dynamics in signaling
Mapping of SUMO sites and analysis of networks. Cell 127(3):635–648
SUMOylation changes induced by external stimuli. 34. Sugiyama N et al (2007) Phosphopeptide enrichment
Proc Natl Acad Sci U S A 111(34):12432–12437 by aliphatic hydroxy acid-modified metal oxide
18. Kamitani T, Kito K, Nguyen HP, Yeh ET (1997) chromatography for nano-LC-MS/MS in proteomics
Characterization of NEDD8, a developmentally applications. Mol Cell Proteomics 6(6):1103–1109
down-regulated ubiquitin-like protein. J Biol Chem 35. Ficarro SB, Parikh JR, Blank NC, Marto JA (2008)
272(45):28557–28562 Niobium (V) oxide (Nb2O5): application to
19. Ohsumi Y (2001) Molecular dissection of phosphoproteomics. Anal Chem 80(12):4606–4613
autophagy: two ubiquitin-like systems. Nat Rev 36. Larsen MR, Thingholm TE, Jensen ON,
Mol Cell Biol 2(3):211–216 Roepstorff P, Jørgensen TJD (2005) Highly selective
20. Loeb KR, Haas AL (1992) The interferon-inducible enrichment of phosphorylated peptides from peptide
15-kDa ubiquitin homolog conjugates to intracellu- mixtures using titanium dioxide microcolumns. Mol
lar proteins. J Biol Chem 267(11):7806–7813 Cell Proteomics 4(7):873–886
378 M. Ke et al.

37. Bodenmiller B, Mueller LN, Mueller M, Domon B, 52. Beausoleil SA et al (2004) Large-scale characteriza-
Aebersold R (2007) Reproducible isolation of dis- tion of HeLa cell nuclear phosphoproteins. Proc Natl
tinct, overlapping segments of the phosphoproteome. Acad Sci U S A 101(33):12130–12135
Nat Methods 4(3):231–237 53. Han G et al (2008) Large-scale phosphoproteome
38. Wu J, Shakey Q, Liu W, Schuller A, Follettie MT analysis of human liver tissue by enrichment and
(2007) Global profiling of phosphopeptides by tita- fractionation of phosphopeptides with strong anion
nia affinity enrichment. J Proteome Res 6 exchange chromatography. Proteomics 8
(12):4684–4689 (7):1346–1361
39. Villen J, Gygi SP (2008) The SCX/IMAC enrich- 54. Gilar M, Olivova P, Daly AE, Gebler JC (2005)
ment approach for global phosphorylation analysis Orthogonality of separation in two-dimensional liq-
by mass spectrometry. Nat Protoc 3(10):1630–1638 uid chromatography. Anal Chem 77(19):6426–6434
40. Zhou H et al (2013) Robust phosphoproteome 55. Reinders J, Sickmann A (2005) State-of-the-art in
enrichment using monodisperse microsphere-based phosphoproteomics. Proteomics 5(16):4052–4061
immobilized titanium (IV) ion affinity chromatogra- 56. Alpert AJ (2008) Electrostatic repulsion hydrophilic
phy. Nat Protoc 8(3):461–480 interaction chromatography for isocratic separation
41. Feng S et al (2007) Immobilized zirconium ion affin- of charged solutes and selective isolation of
ity chromatography for specific enrichment of phosphopeptides. Anal Chem 80(1):62–76
phosphopeptides in phosphoproteome analysis. Mol 57. Villén J, Beausoleil SA, Gerber SA, Gygi SP (2007)
Cell Proteomics 6(9):1656–1665 Large-scale phosphorylation analysis of mouse liver.
42. Posewitz MC, Tempst P (1999) Immobilized gallium Proc Natl Acad Sci 104(5):1488–1493
(III) affinity chromatography of phosphopeptides. 58. Zhai B, Villen J, Beausoleil SA, Mintseris J, Gygi SP
Anal Chem 71(14):2883–2892 (2008) Phosphoproteome analysis of drosophila
43. Andersson L, Porath J (1986) Isolation of metanogaster embryos. J Proteome Res 7
phosphoproteins by immobilized metal (Fe3+) affin- (4):1675–1682
ity chromatography. Anal Biochem 154(1):250–254 59. McNulty DE, Annan RS (2008) Hydrophilic interac-
44. Ficarro SB et al (2002) Phosphoproteome analysis tion chromatography reduces the complexity of the
by mass spectrometry and its application to Saccha- phosphoproteome and improves global
romyces cerevisiae. Nat Biotechnol 20(3):301–305 phosphopeptide isolation and detection. Mol Cell
45. Engholm-Keller K et al (2012) TiSH–a robust and Proteomics 7(5):971–980
sensitive global phosphoproteomics strategy 60. Song CX et al (2010) Reversed-phase-reversed-
employing a combination of TiO2, SIMAC, and phase liquid chromatography approach with high
HILIC. J Proteome 75(18):5749–5761 orthogonality for multidimensional separation of
46. Thingholm TE, Jensen ON, Robinson PJ, Larsen MR phosphopeptides. Anal Chem 82(1):53–56
(2008) SIMAC (sequential elution from IMAC), a 61. Sano A, Nakamura H (2004) Chemo-affinity of tita-
phosphoproteomics strategy for the rapid separation nia for the column-switching HPLC analysis of
of monophosphorylated from multiply phosphopeptides. Anal Sci 20(3):565–566
phosphorylated peptides. Mol Cell Proteomics 7 62. Kaji H et al (2003) Lectin affinity capture, isotope-
(4):661–671 coded tagging and mass spectrometry to identify
47. Zhou H et al (2008) Specific phosphopeptide enrich- N-linked glycoproteins. Nat Biotechnol 21
ment with immobilized titanium Ion affinity chroma- (6):667–672
tography adsorbent for phosphoproteome analysis. J 63. Wang L et al (2006) OK—Concanavalin A-captured
Proteome Res 7(9):3957–3967 glycoproteins in healthy human urine. Mol Cell Pro-
48. Beltran L, Casado P, Rodriguez-Prados JC, Cutillas teomics 5(3):560–562
PR (2012) Global profiling of protein kinase 64. Wisniewski JR, Nagaraj N, Zougman A, Gnad F,
activities in cancer cells by mass spectrometry. J Mann M (2010) Brain phosphoproteome obtained
Proteome 77:492–503 by a FASP-based method reveals plasma membrane
49. Hunter T, Sefton BM (1980) Transforming gene- protein topology. J Proteome Res 9(6):3280–3289
product of Rous-sarcoma virus phosphorylates tyro- 65. Yang Z, Hancock WS (2005) Monitoring glycosyla-
sine. Proc Natl Acad Sci U S A-Biol Sci 77 tion pattern changes of glycoproteins using multi-
(3):1311–1315 lectin affinity chromatography. J Chromatogr A
50. Matsuoka S et al (2007) ATM and ATR substrate 1070(1–2):57–64
analysis reveals extensive protein networks respon- 66. Madera M, Mechref Y, Novotny MV (2005) Com-
sive to DNA damage. Science 316(5828):1160–1166 bining lectin microcolumns with high-resolution
51. Gronborg M et al (2002) A mass spectrometry-based separation techniques for enrichment of
proteomic approach for identification of serine/thre- glycoproteins and glycopeptides. Anal Chem 77
onine-phosphorylated proteins by enrichment with (13):4081–4090
phospho-specific antibodies – Identification of a 67. Kaji H, Yamauchi Y, Takahashi N, Isobe T (2007)
novel protein, Frigg, as a protein kinase A substrate. Mass spectrometric identification of N-linked
Mol Cell Proteomics 1(7):517–527 glycopeptides using lectin-mediated affinity capture
17 Identification, Quantification, and Site Localization of Protein. . . 379

and glycosylation site-specific stable isotope tag- 83. Mann M, Jensen ON (2003) Proteomic analysis of
ging. Nat Protoc 1(6):3019–3027 post-translational modifications. Nat Biotechnol 21
68. Zhang H, X-j L, Martin DB, Aebersold R (2003) (3):255–261
Identification and quantification of N-linked 84. Tian R (2014) Exploring intercellular signaling by
glycoproteins using hydrazide chemistry, stable iso- proteomic approaches. Proteomics 14(4–5):498–512
tope labeling and mass spectrometry. Nat Biotechnol 85. Gropengiesser J, Varadarajan BT, Stephanowitz H,
21(6):660–666 Krause E (2009) The relative influence of phosphor-
69. Sun B et al (2007) Shotgun glycopeptide capture ylation and methylation on responsiveness of
approach coupled with mass spectrometry for com- peptides to MALDI and ESI mass spectrometry. J
prehensive glycoproteomics. Mol Cell Proteomics 6 Mass Spectrom 44(5):821–831
(1):141–149 86. Gao Y, Wang Y (2007) A method to determine the
70. Alley WR Jr, Mann BF, Novotny MV (2013) High- ionization efficiency change of peptides caused by
sensitivity analytical approaches for the structural phosphorylation. J Am Soc Mass Spectrom 18
characterization of glycoproteins. Chem Rev 113 (11):1973–1976
(4):2668–2732 87. Witze ES, Old WM, Resing KA, Ahn NG (2007)
71. Sun B, Hood L (2014) Protein-centric Mapping protein post-translational modifications
N-glycoproteomics analysis of membrane and with mass spectrometry. Nat Methods 4
plasma membrane proteins. J Proteome Res 13 (10):798–806
(6):2705–2714 88. Tuytten R et al (2006) Stainless steel electrospray
72. Wollscheid B et al (2009) Mass-spectrometric iden- probe: a dead end for phosphorylated organic
tification and relative quantification of N-linked cell compounds? J Chromatogr A 1104(1–2):209–221
surface glycoproteins. Nat Biotechnol 27 89. Swaney DL, Wenger CD, Thomson JA, Coon JJ
(4):378–386 (2009) Human embryonic stem cell
73. Teo CF et al (2010) Glycopeptide-specific monoclo- phosphoproteome revealed by electron transfer dis-
nal antibodies suggest new roles for O-GlcNAc. Nat sociation tandem mass spectrometry. Proc Natl Acad
Chem Biol 6(5):338–343 Sci 106(4):995–1000
74. Alfaro JF et al (2012) Tandem mass spectrometry 90. Syka JEP, Coon JJ, Schroeder MJ, Shabanowitz J,
identifies many mouse brain O-GlcNAcylated Hunt DF (2004) Peptide and protein sequence analy-
proteins including EGF domain-specific O-GlcNAc sis by electron transfer dissociation mass spectrome-
transferase targets. Proc Natl Acad Sci 109 try. Proc Natl Acad Sci U S A 101(26):9528–9533
(19):7280–7285 91. Villen J, Beausoleil SA, Gygi SP (2008) Evaluation
75. Anonsen JH, Vik A, Egge-Jacobsen W, Koomey M of the utility of neutral-loss-dependent MS3
(2012) An extended spectrum of target proteins and strategies in large-scale phosphorylation analysis.
modification sites in the general O-linked protein Proteomics 8(21):4444–4452
glycosylation system in Neisseria gonorrhoeae. J 92. Palumbo AM, Tepe JJ, Reid GE (2008) Mechanistic
Proteome Res 11(12):5781–5793 insights into the multistage gas-phase fragmentation
76. Peng J et al (2003) A proteomics approach to under- behavior of phosphoserine- and phosphothreonine-
standing protein ubiquitination. Nat Biotechnol 21 containing peptides. J Proteome Res 7(2):771–779
(8):921–926 93. Boersema PJ, Mohammed S, Heck AJR (2009)
77. Tagwerker C et al (2006) A tandem affinity tag for Phosphopeptide fragmentation and analysis by
two-step purification under fully denaturing mass spectrometry. J Mass Spectrom 44(6):861–878
conditions – Application in ubiquitin profiling and 94. Schroeder MJ, Shabanowitz J, Schwartz JC, Hunt
protein complex identification combined with DF, Coon JJ (2004) A neutral loss activation method
in vivo cross-linking. Mol Cell Proteomics 5 for improved phosphopeptide sequence analysis by
(4):737–748 quadrupole ion trap mass spectrometry. Anal Chem
78. Xu G, Paige JS, Jaffrey SR (2010) Global analysis of 76(13):3590–3598
lysine ubiquitination by ubiquitin remnant 95. Palumbo AM, Reid GE (2008) Evaluation of
immunoaffinity profiling. Nat Biotechnol 28 Gas-phase rearrangement and competing fragmenta-
(8):868–873 tion reactions on protein phosphorylation site assign-
79. Kim W et al (2011) Systematic and quantitative ment using collision induced dissociation-MS/MS
assessment of the ubiquitin-modified proteome. and MS3. Anal Chem 80(24):9735–9747
Mol Cell 44(2):325–340 96. Cain JA, Solis N, Cordwell SJ (2014) Beyond gene
80. Kim SC et al (2006) Substrate and functional diver- expression: the impact of protein post-translational
sity of lysine acetylation revealed by a proteomics modifications in bacteria. J Proteome 97:265–286
survey. Mol Cell 23(4):607–618 97. Hung C-W, Schlosser A, Wei J, Lehmann WD
81. Choudhary C et al (2009) Lysine acetylation targets (2007) Collision-induced reporter fragmentations
protein complexes and co-regulates major cellular for identification of covalently modified peptides.
functions. Science 325(5942):834–840 Anal Bioanal Chem 389(4):1003–1016
82. Mertins P et al (2013) Integrated proteomic analysis 98. Olsen JV et al (2007) Higher-energy C-trap dissoci-
of post-translational modifications by serial enrich- ation for peptide modification analysis. Nat Methods
ment. Nat Methods 10(7):634–637 4(9):709–712
380 M. Ke et al.

99. Li X et al (2007) Large-scale phosphorylation anal- 114. Xia Y et al (2006) Implementation of ion/ion
ysis of alpha-factor-arrested Saccharomyces reactions in a quadrupole/time-of-flight tandem
cerevisiae. J Proteome Res 6(3):1190–1197 mass spectrometer. Anal Chem 78(12):4146–4154
100. Myung S et al (2011) High-capacity ion trap coupled 115. McAlister GC et al (2008) A proteomics grade elec-
to a time-of-flight mass spectrometer for comprehen- tron transfer dissociation-enabled hybrid linear ion
sive linked scans with no scanning losses. Int J Mass trap-Orbitrap mass spectrometer. J Proteome Res 7
Spectrom 301(1–3):211–219 (8):3127–3136
101. Chaze T et al (2014) O-Glycosylation of the 116. Wysocki VH, Tsaprailis G, Smith LL, Breci LA
N-terminal region of the serine-rich adhesin Srr1 of (2000) Special feature: commentary – mobile and
streptococcus agalactiae explored by mass spectrom- localized protons: a framework for understanding
etry. Mol Cell Proteomics 13(9):2168–2182 peptide dissociation. J Mass Spectrom 35
102. Larsen MR, Trelle MB, Thingholm TE, Jensen ON (12):1399–1406
(2006) Analysis of posttranslational modifications of 117. Michalski A, Neuhauser N, Cox J, Mann M (2012) A
proteins by tandem mass spectrometry. systematic investigation into the nature of tryptic
Biotechniques 40(6):790–798 HCD spectra. J Proteome Res 11(11):5479–5491
103. Melo-Braga MN et al (2012) Modulation of protein 118. Zubarev RA, Kelleher NL, McLafferty FW (1998)
phosphorylation, N-Glycosylation and Electron capture dissociation of multiply charged
Lys-Acetylation in grape (Vitis vinifera) mesocarp protein cations. A nonergodic process. J Am Chem
and exocarp owing to lobesia botrana infection. Mol Soc 120(13):3265–3266
Cell Proteomics 11(10):945–956 119. Cooper HJ, Hakansson K, Marshall AG (2005) The
104. Rappsilber J, Friesen WJ, Paushkin S, Dreyfuss G, role of electron capture dissociation in biomolecular
Mann M (2003) Detection of arginine dimethylated analysis. Mass Spectrom Rev 24(2):201–222
peptides by parallel precursor ion scanning mass 120. Bakhtiar R, Guan ZQ (2005) Electron capture disso-
spectrometry in positive ion mode. Anal Chem 75 ciation mass spectrometry in characterization of
(13):3107–3114 post-translational modifications. Biochem Biophys
105. Na CH, Peng J (2012) Analysis of ubiquitinated Res Commun 334(1):1–8
proteome by quantitative mass spectrometry. 121. Frese CK et al (2013) Unambiguous phosphosite
Methods Mol Biol 893:417–429 localization using Electron-Transfer/Higher-Energy
106. Jedrychowski MP et al (2011) Evaluation of HCD- collision Dissociation (EThcD). J Proteome Res 12
and CID-type fragmentation within their respective (3):1520–1525
detection platforms for murine phosphoproteomics. 122. Swaney DL, McAlister GC, Coon JJ (2008) Decision
Mol Cell Proteomics 10(12):M111 009910 tree-driven tandem mass spectrometry for shotgun
107. Nagaraj N, D’Souza RCJ, Cox J, Olsen JV, Mann M proteomics. Nat Methods 5(11):959–964
(2010) Feasibility of large-scale phosphoproteomics 123. Collins MO, Wright JC, Jones M, Rayner JC,
with higher energy collisional dissociation fragmen- Choudhary JS (2014) Confident and sensitive
tation. J Proteome Res 9(12):6786–6794 phosphoproteomics using combinations of collision
108. Syrstad EA, Turecek F (2005) Toward a general induced dissociation and electron transfer dissocia-
mechanism of electron capture dissociation. J Am tion. J Proteome 103:1–14
Soc Mass Spectrom 16(2):208–224 124. Hart-Smith G, Raftery MJ (2012) Detection and
109. Chi A et al (2007) Analysis of phosphorylation sites characterization of low abundance glycopeptides
on proteins from Saccharomyces cerevisiae by elec- via higher-energy C-Trap dissociation and orbitrap
tron transfer dissociation (ETD) mass spectrometry. mass analysis. J Am Soc Mass Spectrom 23
Proc Natl Acad Sci U S A 104(7):2193–2198 (1):124–140
110. Mikesh LM et al (2006) The utility of ETD mass 125. Hakansson K et al (2001) Electron capture dissocia-
spectrometry in proteomic analysis. Biochim tion and infrared multiphoton dissociation MS/MS
Biophys Acta 1764(12):1811–1822 of an N-glycosylated tryptic peptide to yield comple-
111. Frese CK et al (2011) Improved peptide identifica- mentary sequence information. Anal Chem 73
tion by targeted fragmentation using CID, HCD and (18):4530–4536
ETD on an LTQ-Orbitrap Velos. J Proteome Res 10 126. Singh C, Zampronio CG, Creese AJ, Cooper HJ
(5):2377–2388 (2012) Higher Energy Collision Dissociation
112. Good DM, Wirtala M, McAlister GC, Coon JJ (HCD) product ion-triggered Electron Transfer Dis-
(2007) Performance characteristics of electron trans- sociation (ETD) mass spectrometry for the analysis
fer dissociation mass spectrometry. Mol Cell Prote- of N-linked glycoproteins. J Proteome Res 11
omics 6(11):1942–1951 (9):4517–4525
113. Molina H, Horn DM, Tang N, Mathivanan S, Pandey 127. Zhao P et al (2011) Combining high-energy C-trap
A (2007) Global proteomic profiling of dissociation and electron transfer dissociation for
phosphopeptides using electron transfer dissociation protein O-GlcNAc modification site assignment. J
tandem mass spectrometry. Proc Natl Acad Sci U S Proteome Res 10(9):4088–4104
A 104(7):2199–2204
17 Identification, Quantification, and Site Localization of Protein. . . 381

128. Wang Z et al (2010) Enrichment and site mapping of metabolic-regulation. Rev Aquat Sci 4
O-linked N-acetylglucosamine by a combination of (2–3):225–259
chemical/enzymatic tagging, photochemical cleav- 144. Owens DR (2002) New horizons – alternative routes
age, and electron transfer dissociation mass spec- for insulin therapy. Nat Rev Drug Discov 1
trometry. Mol Cell Proteomics 9(1):153–160 (7):529–540
129. Shvartsburg AA, Singer D, Smith RD, Hoffmann R 145. Hornbeck PV et al (2012) PhosphoSitePlus: a com-
(2011) Ion mobility separation of isomeric prehensive resource for investigating the structure
phosphopeptides from a protein with variant modifi- and function of experimentally determined post-
cation of adjacent residues. Anal Chem 83 translational modifications in man and mouse.
(13):5078–5085 Nucleic Acids Res 40(D1):D261–D270
130. Creese AJ, Cooper HJ (2012) Separation and identi- 146. Lu CT et al (2013) dbPTM 3.0: an informative
fication of isomeric glycopeptides by high field resource for investigating substrate site specificity
asymmetric waveform Ion mobility spectrometry. and functional association of protein post-
Anal Chem 84(5):2597–2601 translational modifications. Nucleic Acids Res 41
131. Shvartsburg AA, Zheng Y, Smith RD, Kelleher NL (D1):D295–D305
(2012) Ion mobility separation of variant histone 147. Liu ZX et al (2014) CPLM: a database of protein
tails extending to the “middle-down” range. Anal lysine modifications. Nucleic Acids Res 42(D1):
Chem 84(10):4271–4276 D531–D536
132. Hahne H, Kuster B (2011) A novel two-stage tandem 148. Dinkel H et al (2011) Phospho.ELM: a database of
mass spectrometry approach and scoring scheme for phosphorylation sites-update 2011. Nucleic Acids
the identification of O-GlcNAc modified peptides. J Res 39:D261–D267
Am Soc Mass Spectrom 22(5):931–942 149. Gnad F, Gunawardena J, Mann M (2011) PHOSIDA
133. Beausoleil SA, Villen J, Gerber SA, Rush J, Gygi SP 2011: the posttranslational modification database.
(2006) A probability-based approach for high- Nucleic Acids Res 39:D253–D260
throughput protein phosphorylation analysis and 150. Gupta R, Birch H, Rapacki K, Brunak S, Hansen JE
site localization. Nat Biotechnol 24(10):1285–1292 (1999) O-GLYCBASE version 4.0: a revised data-
134. Savitski MM et al (2011) Confident phosphorylation base of O-glycosylated proteins. Nucleic Acids Res
site localization using the mascot delta score. Mol 27(1):370–372
Cell Proteomics 10(2):M110.003830 151. Zhang H et al (2006) UniPep – a database for human
135. Taus T et al (2011) Universal and confident phos- N-linked glycosites: a resource for biomarker dis-
phorylation site localization using phosphoRS. J Pro- covery. Genome Biol 7(8):R73
teome Res 10(12):5354–5362 152. Kaji H et al (2012) Large-scale identification of
136. Bailey CM et al (2009) SLoMo: automated site N-glycosylated proteins of mouse tissues and con-
localization of modifications from ETD/ECD mass struction of a glycoprotein database, GlycoProtDB. J
spectra. J Proteome Res 8(4):1965–1971 Proteome Res 11(9):4553–4566
137. Baker PR, Trinidad JC, Chalkley RJ (2011) Modifi- 153. Campbell MP et al (2014) UniCarbKB: building a
cation site localization scoring integrated into a knowledge platform for glycoproteomics. Nucleic
search engine. Mol Cell Proteomics 10(7): Acids Res 42(D1):D215–D221
M111.008078 154. Gao TS et al (2013) UUCD: a family-based database
138. Chen Y, Chen W, Cobb MH, Zhao YM (2009) of ubiquitin and ubiquitin-like conjugation. Nucleic
PTMap-A sequence alignment software for unre- Acids Res 41(D1):D445–D451
stricted, accurate, and full-spectrum identification 155. Lee WC, Lee M, Jung JW, Kim KP, Kim D (2008)
of post-translational modification sites. Proc Natl SCUD: Saccharomyces Cerevisiae Ubiquitination
Acad Sci U S A 106(3):761–766 Database. BMC Genomics 9:7
139. Sharma K et al (2014) Ultradeep human 156. Chernorudskiy AL et al (2007) UbiProt: a database
phosphoproteome reveals a distinct regulatory nature of ubiquitinated proteins. Bmc Bioinf 8:126
of Tyr and Ser/Thr-based signaling. Cell Rep 8:1583 157. Fiedler D et al (2009) Functional organization of the
140. Udeshi ND et al (2013) Refined preparation and use S-cerevisiae phosphorylation network. Cell 136
of anti-diglycine remnant (K-epsilon-GG) antibody (5):952–963
enables routine quantification of 10,000 s of 158. Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclo-
ubiquitination sites in single proteomics pedia of Genes and Genomes. Nucleic Acids Res 28
experiments. Mol Cell Proteomics 12(3):825–831 (1):27–30
141. Guo AL et al (2014) Immunoaffinity enrichment and 159. Horn H et al (2014) KinomeXplorer: an integrated
mass spectrometry analysis of protein methylation. platform for kinome biology studies. Nat Methods
Mol Cell Proteomics 13(1):372–387 11(6):603–604
142. Zielinska DF, Gnad F, Wisniewski JR, Mann M 160. Linding R et al (2008) NetworKIN: a resource for
(2010) Precision mapping of an in vivo exploring cellular phosphorylation networks.
N-glycoproteome reveals rigid topological and Nucleic Acids Res 36:D695–D699
sequence constraints. Cell 141(5):897–907 161. Miller ML et al (2008) Linear motif atlas for
143. Mommsen TP, Plisetskaya EM (1991) Insulin in phosphorylation-dependent signaling. Sci Signal 1
fishes and agnathans – history, structure, and (35):ra2
382 M. Ke et al.

162. Swaney DL et al (2013) Global analysis of phos- specific phosphorylation dynamics in response to
phorylation and ubiquitination cross-talk in protein globally elevated O-GlcNAc. Proc Natl Acad Sci U
degradation. Nat Methods 10(7):676–682 S A 105(37):13793–13798
163. Wang Y et al (2011) Reversed-phase chromatogra- 165. Olejnik J, Sonar S, Krzymanska-Olejnik E,
phy with multiple fraction concatenation strategy for Rothschild KJ (1995) Photocleavable biotin
proteome profiling of human MCF10A cells. PRO- derivatives: a versatile approach for the isolation of
TEOMICS 11(10):2019–2026 biomolecules. Proc Natl Acad Sci U S A 92
164. Wang Z, Gucek M, Hart GW (2008) Cross-talk (16):7590–7594
between GlcNAcylation and phosphorylation: site-
Protein-Protein Interaction Detection
Via Mass Spectrometry-Based 18
Proteomics

Benedetta Turriziani, Alexander von Kriegsheim,


and Stephen R. Pennington

Abstract
Analysis of protein-protein interactions is one of the mainstays of mass
spectrometry-based proteomics and recent developments, which have
simplified the methodology, have permitted non-specialised laboratories
to adopt the approach. We introduce and review three complimentary
methods which allow for the targeted, global and site-specific analysis of
protein complexes. Co-precipitation of endogenous or ectopically
expressed proteins and their complexes followed by proteomic analysis
allows for the discovery and accurate quantification of specific protein
interactions. Whereas complimentary methods, such as co-purification of
entire complexes based on physico-chemical attributes, can give a snap-
shot of the composition and dynamics of protein complexes on a global
scale. Cross-linking on the other hand can pinpoint the amino acids
involved in protein-protein interactions to such a resolution that the likely
complex can be reconstructed computationally.

Keywords
Protein complex • Co-purification • Cross-linking • Co-elution •
Interaction proteomics

18.1 Introduction

High-throughput DNA sequencing allowed for


the first time, the correlation of pathologies to
B. Turriziani • A. von Kriegsheim
Systems Biology Ireland, Conway Institute, University specific genes aberrations. Unfortunately, the
College Dublin, Belfield, Dublin 4, Ireland genomic information itself is not enough for a
S.R. Pennington (*) comprehensive understanding of the mechanisms
School of Medicine and Medical Sciences, UCD Conway that bring about the pathological transformations.
Institute of Biomolecular and Biomedical Research, The number of proteins is larger than the number
University College Dublin, Dublin 4, Ireland
of codifying genes, due to the presence of
e-mail: stephen.pennington@ucd.ie

# Springer International Publishing Switzerland 2016 383


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_18
384 B. Turriziani et al.

additional levels of regulation during and after activation of the reporter gene. Thus, the tran-
protein translation. Events such as splicing and scription of the reporter gene only occurs when
post-translational modifications, which regulate bait and prey interact with each other and form a
protein activation, and localization and degrada- functional transcription factor complex. There-
tion, all add to proteome complexity. All these fore, the interaction between bait and prey can
factors are regulated by the interaction of be detected by the signal resultant of the reporter
proteins with other proteins, working coopera- gene expression [3].
tively in intricate signalling networks. For these Although genome-wide interaction screens
reasons, dynamic and static protein-protein using Y2H have been undertaken, the method
interactions are essential cornerstones of all sig- has several drawbacks. One of the major
nalling networks. Thus, identifying and limitations of Y2H is the reliability of the data
quantifying protein interactions is of great inter- generated. The rate of false positives and false
est within the biological and medical sciences as negatives among the interactions identified can
well as the emerging field of systems biology. be higher than 60 % in some cases [4]. The main
Interaction proteomics aims to study protein reasons for this unreliability are connected to
interactions in an unbiased manner and the infor- both biological and technical factors. The bio-
mation obtained by this approach can be used to chemical differences in protein translation and
determine how proteins assemble in complexes regulation between the host, which is generally
and form networks. These methods must not only yeast, and the subject of study is the first param-
aim to identify novel interactions, but should eter to keep in mind. Expressing two proteins in
ideally reveal precise protein collocation in the yeast may lead to a physiologically irrelevant
specific biological system investigated. In order interaction as in physiological conditions these
to maximize the information gained from inter- two proteins may be expressed in different
action screens, interaction studies are ideally compartments, in different cells, or even at dif-
designed not only to identify specific binders ferent times in cell cycle. Crucially, protein
but to accurately quantify dynamic changes in interactions are frequently regulated by post-
protein-protein interactions in response to translational modifications (PTMs), which may
perturbations. not occur in yeast. For instance, if the interaction
Since the focus on research shifted from between two proteins requires the phosphoryla-
genome to proteome, a wide number of methods tion of one of them, in yeast this interaction may
have been developed to detect protein-protein not occur. Since such regulation by PTMs is
interactions [1]. One of the most widely used frequent, this is an aspect that dramatically
methods is the yeast two-hybrid screening affects the efficiency of Y2H as a tool for inter-
(Y2H). The system was firstly described by action proteomics studies. A further limitation
Fields and Song in 1989 [2] using a Saccharomy- concerns the fact that Y2H fails to provide infor-
ces cerevisiae model. The yeast two-hybrid mation about the dynamics of an interaction. The
consists of a genetic screening system in which last aspect that makes Y2H–generated data unre-
protein interactions are detected by the signalling liable is that it only detects direct binary
of a reporter gene. Two plasmids are required to interactions. Taken together, since there is inter-
perform the screening. In the first plasmid the est in studying dynamic protein complexes in
protein of interest, called bait, is fused with a their physiological context, all these limitations
DNA-binding domain (BD) of a yeast transcrip- highlight the need for complimentary methods.
tional factor, generally Gal4. The BD binds a The possibility to work in near physiological
region upstream to the promoter of the reporter conditions and, more importantly, to provide
gene. In the other plasmid a second protein, information about protein complexes stoichiom-
called the prey, is fused with the activation etry and dynamics, have made affinity purifica-
domain (AD) for the transcriptional factor Gal4. tion coupled to mass spectrometry (AP-MS) [5] a
Both the BD and AD of Gal4 are required for the suitable complimentary method to Y2H. The AP
18 Protein-Protein Interaction Detection Via Mass Spectrometry-Based Proteomics 385

requires purification of proteins and their used for the affinity purification. The TAP plas-
complexes by enriching the bait protein by affin- mid contains three modules, the tags, the Staphy-
ity purification of endogenous or tagged proteins lococcus aureus proteinA (ProtA) IgG binding
with specific antibodies or other affinity tags. In domain and a calmodulin binding domain
essence the tagged or endogenous bait-protein (CBD), separated by a TEV (tobacco etch virus)
and its complexes are captured by an protease cleavage site. Completing this proce-
immobilized matrix, processed and subsequently dure requires a two-step affinity purification. In
analysed by mass spectrometry. Commonly used the first step the lysate is incubated with IgG
tags are Flag, GFP, Myc, His, and HA; each of Sepharose beads. The ProtA of the TAP tag
which vary in terms of efficiency of purification interacts with the IgG domain in a way that
[6]. In order to reduce the complexity of the only the bait and its interactors remain bound to
sample, eluted proteins are subsequently the beads. The rest is removed by several washes.
separated by PAGE, the bands of interest The complexes are then treated with TEV prote-
excised, in-gel digested and then identified by ase. Treatment with TEV has two roles, first to
mass spectrometry [7]. disrupt the link between IgG beads and bait, and
Even though this analysis workflow allows us the second is to expose the CBD for the next
to overcome some of the limitation inherent to purification steps. The eluate is then incubated a
Y2H, due to the gel-based fractionation the second time with calmodulin beads. After a sec-
method requires significant machine time, ond elution, samples are further fractionated and
numerous handling steps [8] and loss of sensitiv- separated by SDS polyacrylamide gel or directly
ity due to contamination and sample loss. Overall analysed by mass spectrometry. A general proto-
these limitations have made gel-based fraction- col for tandem affinity purification was described
ation impractical for large-scale interactome by Puig et al. [10]. Double step purification
analysis in the average biological laboratory. improves the purity of the target complex com-
Thus, new gel-free methods which reduced pared to a single-step affinity purification proto-
handling and analysis time have been developed. col. Despite improved sensitivity in the
identification of protein-protein interactions in
TAP purification methods, major limitations
18.2 Co-immunoprecipitation still remain. Primarily, due to low yields of over-
(Co-IP) all recovery and the large number of cells or
lysate that is required. In addition, as a conse-
Recent advances in mass spectrometry technol- quence of the lengthy protocol, only stable
ogy have resulted in increasingly more sensitive complexes remain intact and dynamic
instruments, perfectly suitable for protein com- interactions are all but lost. Novel, more rapid
plex identification. These advancements have methods have been developed using alternative
given a big boost to the application of AP-MS affinity tags. Although these improvements have
in interaction proteomics studies, but also have reduced the time required for performing the
highlighted the necessity for novel purification experiments, none of the methods have been
protocols, which appear to be the main able to improve the recovery of weak and
constraining step towards new improvements. dynamic interactions. Thus, classical TAP purifi-
Several purification strategies have been cation methods have been largely abandoned.
improved since then, especially protocols involv- A good alternative strategy to TAP was
ing immunoprecipitation. One of the most effi- established by Rees et al. [11], which still retains
cient technologies, which is still in use, was the specificity of the double step purification. In
developed in 1999 by Rigaut et al. using a tan- the method they developed, a parallel affinity
dem affinity purification tag (TAP) [9]. This capture (iPAC) method was coupled to mass
method requires expression of a fusion protein spectrometry using D. melanogaster as a model
consisting of a bait protein and two epitope tags organism. Similar to the previous protocol, the
386 B. Turriziani et al.

method is based on a double affinity tag system. one sample the bait is isolated by an immobilized
A construct is generated as previously described anti-FLAG antibody. In parallel, a second, iden-
[12], containing tags in an exon flanked by splic- tical sample is incubated with StrepII beads. The
ing sites. This way, only plasmids in which the principle of this method is that since the bait
target gene was inserted correctly will be trans- expresses both tags, the two independent
lated into a functional protein. The tags used in immunoprecipitations should give similar results
this type of approach are a marker of expression or at least contain a subset of common proteins.
to check if the bait is expressed in the cells, in A workflow of the method is illustrated in
this case the yellow fluorescent protein (YFP), (Fig. 18.1a).
and two affinity tags for the immunoprecipitation The interactomes isolated from the two
and purification protocol, StrepII and the FLAG. samples after mass spectrometry analysis are
Unlike the TAP methods, in which the two compared with tagless controls purified with
purifications are done sequentially, they are both the FLAG and the StrepII beads.
performed in parallel in the iPAC approach. In Contaminants are identified by a cross analysis

A B C

Interactome Interactome
1 2
m/z m/z m/z m/z m/z m/z

Fig. 18.1 Schematic representation of co-affinity-pre- antibody followed by washes and digestion. After diges-
cipitation workflows. (a) In the IPAC method the bait is tion the interacting proteins are identified by mass spec-
double tagged with Flag and streptavidin and is expressed trometry. (c) In BioID the bait is tagged with a
in cells. The cell lysate is then incubated with beads promiscuous biotin–ligase leading to the selective
recognising either Flag or streptavidin in two parallel biotinylation of proteins proximate to the bait-ligase
incubations. After binding of the complexes to the fusion protein. Proteins forming complexes with the bait
beads, unspecific interactors are washed away, whereas are likely to be biotinylated, while non-interacting
the protein complexes are retained and analysed by mass proteins will remain untagged. The biotinylated proteins
spectrometry. Proteins identified in both reactions are are purified by affinity purification. The proteins are sub-
considered specific interactors. (b) On-bead digestion sequently identified via mass spectrometry
workflow, the lysate is incubated with a bait-specific
18 Protein-Protein Interaction Detection Via Mass Spectrometry-Based Proteomics 387

among the different controls. The data sets from FLAG-AP in native conditions. More interest-
FLAG, StrepII and controls are compared and ingly, the major part of those interactors
proteins common between the three sets of data identified using the BioID, but not the FLAG-
are classified as contaminants and as a group are AP, were proteins of specific cellular
called the BEADome. Secondly, the FLAG and compartments which are generally more difficult
StrepII pull-downs are compared for proteins to isolate due to their poor solubility, including
specifically binding to the affinity matrices inde- proteins associated with membranes, the centro-
pendently of the bait. Thus, all the proteins that some, chromatin and cell junctions. This last
are identified in one pull-down and not in the observation makes the BioID a suitable method
other are classified as contaminants. The to study proteins from specific localizations,
remaining identified proteins that overlap which would otherwise be difficult to analyse.
between the FLAG and StrepII set of data are Another advantage of the BioID seems to be the
then classified as specific interactions. The pro- more gentle conditions in which the experiment
tocol can be performed using different tags is performed that allow the recovery of weak
according to specific experimental exigencies. interactions usually lost during FLAG-AP. One
To prove the effectiveness of the protocol, both limitation of the BioID protocol is the incubation
methods have been performed and their effi- time necessary to label the cell with biotin, which
ciency evaluated in terms of stability of the in the cited work was about 24 h. From this point
bait, interactome recovery after all of the proce- of view it may be difficult to reconcile this
dural steps and in terms of identified interactome. method with dynamic interactions.
Overall, iPAC combines the benefits of a double In recent years there has been a trend to aban-
purification with different tags in terms of effi- don gel-based fractionation for unfractionated
ciency and quality of the results, with the ability in-solution or on-bead digestions. This trend has
to detect weak and transient interactions. been triggered by improvements in the acquisi-
More recently, another approach has been tion speed of mass spectrometers and the resolu-
developed that is based on a single purification tion of uHPLC chromatographic systems. These
step using biotin ligase as a tag. The work by new approaches have several advantages.
Couzens et al. was focused on the mammalian Chiefly, the reduced need for fractionation,
Hippo pathway and how the phosphorylation which drastically reduces the acquisition time
state affects its interaction network [13]. Two and sample handling [14, 15]. Indeed, just a few
complimentary methods were performed in par- steps are required to go from a lysate to the final
allel, FLAG affinity purification (FLAG-AP) and analysis on the mass spectrometer. This not only
a new method of biotinylation followed by shortens the analysis time, but also reduces the
streptavidin affinity purification (BioID). Nine- overall level of contamination. In addition, sam-
teen proteins were tagged with a promiscuous ple recovery is improved as in-gel digestion
biotin ligase (BirA) that promotes the covalent results in loss of material due to inefficient elu-
linkage of a biotin moiety not only to the proteins tion and recovery from gel slices. Finally,
directly interacting with the bait, but also to streamlined protocols are compatible with high
proteins present in its proximity (Fig. 18.1c). In throughput, automated robotic handling stations,
this way, when the bait forms a protein complex, which will permit large-scale interaction screens
all the proteins of this complex are labelled with required for systems biology. Excitingly, using a
biotin. Incubation with streptavidin beads allows “double barrel” column LC method coupled to an
purification of biotinylated proteins and their ultra-fast Q-Exactive HF, the Mann group has
identification by mass spectrometry. A compara- recently published a method which uses 96-well
tive analysis of the data sets generated with plates for all handling steps and, by using fast
FLAG-AP and BioID revealed only a partial gradients, has broken through the 24 h per-plate
overlap. The BioID data set contains a larger machine time [16]. This unparalleled speed and,
number of potential interactors compared to the therefore, reduced cost per-sample, finally puts
388 B. Turriziani et al.

mass spectrometry based interaction proteomics method to overcome this shortcoming is


in the same ball park as targeted antibody represented by an approach which combines
approaches, such as western blotting, in terms quantitative immunoprecipitation with a knock-
of cost, sensitivity and sample throughput. down strategy (QUICK) [19]. In this work the
Quantitative proteomics has proven essential authors use a SILAC-based quantification [20]
for well-controlled proteomics experiments and and a control in which the bait protein is tran-
interactome proteomics is no exception. But siently knocked down by silencing RNA
unfortunately, Stable isotope labeling by amino (siRNA). As a result, proteins interacting with
acids in cell culture (SILAC), which is one of the the bait are reduced in the control, whereas the
most widely utilized methodologies for quantita- concentrations of unspecific and cross-reacting
tive proteomics, is not applicable in IP-MS proteins are not affected by the knockdown.
experiments. SILAC, which allows the mixing Alternatively, a negative control can be
of the samples prior to any processing, cannot generated by using knock-out cell line
be used because in IP-MS, samples can only be generated by CRISPR-Cas9. The system is
combined after the IP has been performed. This based on the nuclease Cas9 that introduces a
limitation is due to the dynamic exchanges that double-strands break in the target gene and can
could occur between heavy and light complexes lead to silencing of the gene by introducing
during the incubation step with the affinity frame-shift mutations [21].
matrix. Thus, alternative methods, such as label
free quantification (LFQ) [17] are increasingly
used. The LFQ method works with any source 18.3 Co-elution
material, as it does not require any form of label-
ling. An example of a strategy using LFQ in All previously mentioned approaches isolate a
conjunction with on-beads digestion was protein of interest and its interactors by
elaborated by Turriziani et al. [18]. The protocol (immuno)-precipitation. Alternatively, protein
is based on classical IP protocols implemented in correlation profiling (PCP) aims not to detect
most biological laboratories and is therefore eas- how a single protein relates with its interactome,
ily implemented without specialist knowledge or but rather determines the composition of all
equipment. intact complexes (Fig. 18.2). The underlying
All of the strategies described in this section assumption is that protein complexes co-purify
have advantages and weaknesses, but in general when separated based on their biochemical char-
they offer good alternatives to the original IP-MS acteristic, such as size, density or hydrophobicity
protocols and each contribute their strengths to [22]. The conditions in which the separation is
overcome the main limitations of IP-MS performed are tailored to preserve the
methods. interactions and integrity of the protein complex.
Regardless of all these improvements, all of Complexes are frequently separated via chroma-
these strategies use exogenous, tagged proteins tography using disparate gradients and solid
as baits, which are often overexpressed. As phases. As mentioned, the principle of the
highlighted before, there could be many issues method is based on the fact that interacting
related to the process of tagging a protein or proteins and protein complexes will co-elute
simply overexpressing it. with the same profile. After fractionating individ-
On the other hand, the immunoprecipitation ual complexes, the proteins are digested,
of endogenous proteins has several pitfalls as analysed and quantified by mass spectrometry.
well. Even if an antibody is found which effi- Several protocols have been developed to dem-
ciently enriches the protein, the antibody may onstrate the efficiency of this approach. An
have cross-reactivity to other unrelated proteins. example is the work of Andersen et al., which
Thus, endogenous IPs have to be well controlled focusses on the human centrosome [23]. The
in order to avoid false-positives. An efficient isolated centrosomes were separated by a sucrose
18 Protein-Protein Interaction Detection Via Mass Spectrometry-Based Proteomics 389

Fig. 18.2 General Summary of co-elution principles. Bigger complexes travel faster through the gel and are the
After lysis, protein complexes are separated based on first ones to be collected. Smaller complexes are trapped
their various biochemical properties. The complexes are in the porous gel and are collected towards the end. (c)
collected into different fractions, digested and analyzed Representation of the ion exchange chromatography. The
by the mass spectrometry. Proteins identified in the same complexes pass through the column and interact with the
fraction are likely to in the same complex. (a) The charged matrix. The retention time depends on the overall
complexes are separated according to their density in a charge of the complexes. Positively charged complexes
sucrose gradient. The heavier complexes will be located travel faster through the column, whereas negatively
towards the bottom of the tube, whereas lighter ones will charged complexes are retained longer due to their inter-
remain at the top. (b) Schematization of a size-exclusion action with the positively charged matrix
chromatography. The column is filled with a porous gel.

gradient (Fig. 18.2a). Different fractions were difficult to isolate, especially those associated
collected, analysed and individually quantified. with organelles. As highlighted by the authors,
The results were validated using two orthogonal this technique is compatible with isotope-
approaches. They initially compared the protein labelling, which can provide quantitative infor-
content of different fractions with the elution mation about the protein complex dynamics over
profiles of centrosome markers. Additionally, time. Nevertheless, this strategy failed to detect a
they also selected a few candidates and individu- conspicuous part of proteins associated with the
ally validated them by Co-IP. Overall, the com- centrosome. Possible causes include low abun-
bination of these approaches allowed the authors dance of missed proteins or their participation in
to identify most of the known centrosomal complexes that are not stably associated with the
complexes, and groups of likely novel complexes centrosome, raising questions about the effec-
were further partially validated by tiveness of this method for detecting weak and
co-immunofluorescence. This study showed that dynamic interactions.
PCP is a reliable method to characterise multi- Although the PCP approaches have some
protein complexes by co-elution profiling. In advantages over AP-MS, they also present a set
particular, PCP showed good efficiency in the of technical challenges that need to be overcome
characterization of cellular structures, which are in order to improve the depth and efficiency of
390 B. Turriziani et al.

this method. The first complication is related to phosphorylation. The protein presented two elu-
the effectiveness of the separation strategies. tion peaks, in fractions 21 and 28. Unlike the rest
New methods of separation used in combina- of the peptides, which eluted in both peaks, the
tion with mass spectrometry range from various phosphorylated peptide was only present in the
chromatography techniques to gel electrophore- second peak and co-eluted with a specific
sis. One of the most well-established methods to complexes. The sequence of the phosphorylation
separate proteins in purification strategies is size site matched the ATM/ATR kinase consensus
exclusion chromatography (SEC). SEC consists motif, suggesting that the phosphorylation of
of a chromatographic column with a porous sta- NUTD5 by these kinases is necessary for its
tionary phase, generally agarose, which interaction with the second complex. As
fractionates proteins and complexes by their abil- highlighted by the authors, the application of
ity to migrate through the pores or being trapped efficient separation strategies, like SEC to PCP,
in the stationary material (Fig. 18.2b). In recent helps to overcome some problems connected to
work, Kirkwood et al. have coupled SEC and the sensitivity and effectiveness of gradient
mass spectrometry to isolate and characterize separations. The improved separation of
soluble protein complexes from human osteosar- complexes by SEC, in comparison to sucrose
coma cells [24]. The study focussed on how gradients, improves the resolution and specificity
protein isoforms and post-translational of detected complexes. Additionally, clustering
modifications influence the association with dis- of co-eluting proteins with components of known
tinct complexes. The experiments were complexes facilitates the detection of new
performed in the absence of detergent in order interactors of known complexes.
to preserve the interactions. In addition, a lysate Recently, Kristensen et al. have described a
collected in denaturing conditions was used as strategy to study the dynamic of interactome in
negative control. The complexes were separated HeLa cells as a result of epidermal growth factor
in a SEC column and 40 fractions were collected. (EGF) treatment. In this paper they combined the
After separation, the different fractions were SILAC labelling method for protein quantifica-
digested separately and analysed by mass spec- tion with SEC separation and mass spectrometry
trometry. Overall, 8000 proteins were identified [25]. The aim of the study was to identify and
and clustered according to the elution profile. quantify the changes in protein-protein
Subsequently, protein complexes were assigned interactions following stimulation with EGF.
to clusters by correlating known interaction to The light medium was used to label an internal
the elution profiles. Interestingly, it was possible standard for the identification of proteins in each
to define differential behaviour of various protein fraction, while medium and heavy SILAC media
isoforms. Authors reported the case of the het- were used to label control and treated cells,
erochromatin protein-1 binding protein respectively. The ratio of heavy proteins over
3 (HP1BP3), which is essential for the modula- medium was used to quantify the dynamic
tion of chromatin functions. This protein has four changes in protein interactions following EGF
known isoforms, three of which eluted with sim- treatment. As expected, the co-elution profile of
ilar elution profiles, while isoform 3 migrated some proteins represented by their corresponding
differently, indicating it might be part of a dis- chromatography peaks in SEC was changed after
tinct complex. Isoform 3 lacks a particular pro- the treatment. Some of these proteins were bound
tein interaction domain that is present in the other to different complexes, while others were
three isoforms, which might explain why it associated with the same complex but their stoi-
behaves differently. Similarly, the authors of chiometry was altered upon EGF stimulation.
this study identified posttranslational About 350 proteins showed a different behaviour
modifications that alter complex formation. As compared to the control, among which were a
an example, NUDT5 was identified with number of well-known components of the EGF
13 peptides, one of them showing a serine signalling pathway.
18 Protein-Protein Interaction Detection Via Mass Spectrometry-Based Proteomics 391

Beyond the advantages already listed for the script. Overall, this reduced the time for the anal-
general PCP protocols, the combination of PCP ysis from weeks to days and the details of the
with a quantitative method like SILAC is an computational method is described in the paper
effective method to track changes in dynamic published by Scott et al. [29]. This method is not
protein interactions as a result of various specific for SEC and can be applied to other
perturbations. fractionation techniques.
In a recent study Havugimana et al. [26] have
performed an extensive analysis of the soluble
protein interactome in mammalian cell lines 18.4 Cross-Linking
(HeLa S3 and HEK293) using a combination of
different fractionation methods. Initially, protein A remarkable number of new methods have been
complexes were separated via ion exchange developed to improve the quality of interactome
chromatography (IEX-HPLC) in non-denaturing data in terms of sensitivity, reliability and high
conditions (Fig. 18.2c). To study protein throughput. While co-immunoprecipitation and
interactions that could be disrupted by salt, a co-elution strategies have proven effective for
second method of fractionation was used in par- the characterization of protein interactions, both
allel, involving sucrose gradient centrifugation suffer from a loss of weak and transitory
coupled with isoelectric focusing. By combining interactions. One way to stabilize this type of
these methods, 364 previously unannotated interaction is to use chemical crosslinking to
complexes were identified and linked to the covalently link weakly interacting proteins.
pathologies they were studying. The quality of Chemical cross-linkers are molecules capable of
the method was assessed by comparing the generating a covalent bond between two
co-elution profiles of 20 well-known complexes polymers or macromolecules and their use is
as references. The issue of overlapping profiles well established in chemistry. The use of chemi-
and consequential false positives was solved by cal cross-linkers to characterize protein
the development of a computational algorithm interactions was first reported in the 1970s [30],
that correlated the data with previous functional but what gave a bigger incentive to use this
genomic and evolutionary correlations technology more widely were the developments
[27, 28]. This robust computational method of proteomic approaches based on mass spec-
improved the reliability of the data by identifying trometry. The cross-linking strategy relies on
and filtering false positives, which resulted in a converting protein interactions into strong cova-
high confidence physical interaction network. lent bonds that become resistant even in
Using this strategy, an accurate characterization denaturing conditions. As such, cross-linking
of protein complexes can be achieved. The main has been an attractive method to investigate
concern about the described method is the large weak and transient interactions. As a general
amount of fractions collected (over 1000), which principle, a cross-linker is constituted by two
was necessary to achieve the desired resolution. reactive groups separated by a spacer. The two
In addition, the bioinformatics analysis is chal- groups can react with lateral chains of amino
lenging. These issues limit the use of this acids that are close to each other, especially
approach for most groups. More recently, a thiols, carboxylic acids and amino residues that
computational method has been developed to are more reactive (Fig. 18.3).
facilitate the analysis of protein interaction data There is a large variety of cross-linkers avail-
generated with PCP. The study combined a SEC able, each of them with specific features in terms
approach with SILAC and was aimed at of reactivity and mechanisms of action. One of
detecting protein complexes altered after infec- the most commonly used crosslinking agents is
tion with Salmonella enterica. The authors formaldehyde (FA), which can be used not only
developed analytical tools which allowed the to fix protein-protein interactions, but also the
generation of interaction maps with a single interaction between proteins and nucleic acids
392 B. Turriziani et al.

Fig. 18.3 Cross-linking principle Intact cells or cell complexes. Complexes are then isolated by affinity puri-
lysate is incubated with a cross-linker composed of one fication, digested and identified by mass spectrometry.
or two reactive groups separated by a spacer of various Individual, cross-linked peptides can indicate proximity
lengths. The reactive groups bind the proteins in a com- of the peptides in the intact protein complexes, which can
plex and generate a covalent bond which stabilises the suggest a likely complex assembly

[31, 32]. FA has been used in several protocols in proteins (RIME) procedure has been developed
order to stabilize weaker and transient by Mohammed et al. [35]. The method couples
interactions prior to affinity purification [33], cross-linking using FA with an in-solution digest.
such as the interactome of a constitutively active In contrast to the previous method, the bait is not
form of M-Ras [34]. Myc-tagged M-Ras was tagged or exogenously expressed. Instead, the
expressed in cells and interactions between the endogenous bait is enriched with a mixture of
bait and its interactors were fixed in cells by antibodies specifically raised against the protein.
FA. After the cross-linking reaction, the proteins In this specific study, the interactome of the
were purified by an anti-Myc IP. The analysis oestrogen receptor (ER) was analysed in breast
showed that the method was able to efficiently cancer cell lines. In addition, SILAC labelling
purify the tagged bait and its interacting was used to quantify the interaction differences
complexes and the authors were able to identify induced by either oestrogen or tamoxifen. The
several new interactors. Nevertheless, since the observation that the interaction between GREB1
protocol is based on the tagging of the bait, it and ER is predominant in ER+ specimen and that
presents the same limitations previously its expression is decreased in tamoxifen resistant
described. More recently, a rapid immunoprecip- cells lead to speculation on a role of this protein
itation mass spectrometry of endogenous in the hormonal response.
18 Protein-Protein Interaction Detection Via Mass Spectrometry-Based Proteomics 393

The presented methods are only an example to the study of protein interactions. In contrast to
of systematic analysis of protein-protein chemical cross-linking strategies, a natural
interactions by cross-linking and it is evident amino acid is replaced by a modified, reactive
that these strategies are valuable tools to study analogue. In essence, specific amino acids con-
low abundance proteins and transient tain a photo-reactive diazirine group, which can
interactions. More importantly, the sensitivity be activated upon exposure to ultraviolet light
of these protocols allows working with a small [38] to become a reactive intermediate that cova-
amount of biological starting material, which lently bind to an acceptor group within a
makes the analysis feasible for primary cultures. neighbouring protein. The amino acids most fre-
Additionally, cross-linking with FA facilitates quently modified are Leucine and Methionine. A
the elimination of major contaminant and good example of photo-cross-linking applied to
non-specific interactors due to the attained stabil- interaction proteomics is the work of Suchanek
ity that allows stringent lysis and washes. et al. [39]. Photo reactive isoleucines, leucines
Aside from FA, a large number of other cross- and methionines were introduced into COS7
linker has been developed over the years, each cells by using engineered tRNA. The expression
with different characteristics to take under con- of the modified amino acids did not interfere with
sideration when planning an experiment. The protein biosynthesis and the exposure to UV light
length of the cross-linker has a significant influ- did not affect cell viability as assessed by the
ence on its capability to perform an efficient authors. The new method was used to study
reaction. For example, long cross-linkers are protein-protein interactions of a regulatory com-
often associated with an increased rate of plex involved in lipid homeostasis. A HA-tagged
non-specific interactions [36] as longer cross- PGRMC1 and a Myc-tagged Insig1 were
linker can not only link non-interacting proteins expressed in COS7 with or without photo-
simply proximal to the bait but, due to their size, methionine (photo-Met) and their interaction
alter the structure of the linked protein by inter- was validated via SDS-PAGE after both HA or
nal crosslinking. Another factor that influences Myc immunoprecipitation. As highlighted by the
the cross-linking efficiency is the hydrophobicity authors, photo-cross-linking showed the same
and the size of the cross linker itself. These specificity as the chemical approach, with an
factors have to be considered to find cross-linkers advantage over conventional chemical cross-
that can permeate the membrane or access the linking in the characterization of large
inner surface of protein complexes efficiently. In complexes. Since modified amino acids are
addition, one main feature to consider is the already part of the protein sequence, photo-
specificity of the reactive groups on the cross- crosslinking doesn’t have the two major
linker that targets specific amino acid residues limitations of chemical cross-linking, namely
[37]. Chemical cross-linking methods also risk of altering the protein structure and accessi-
require a lot of optimization, cross-linker con- bility to the core of the protein complex. On the
centration and incubation times must all be spe- other hand, the specificity may be altered due to
cifically tested for each experimental set-up and the irradiation and duration of photo-activation
target. which could generate unspecific interactions due
Chemical cross-linking is a good strategy to to reactive intermediates.
stabilize weak interactions to help detect low In addition to facilitating the detection of
abundance proteins in interaction proteomics interactors, cross-linking can generate further
studies, but as discussed above, the process of information. Since photo cross-linkers have the
adding a reactive chemical cross-linker to the ability to connect amino acids in their proximity,
sample can generate artefacts if the cross-linker they can link residues in separate domains of the
is not carefully selected. Photo-cross-linking is same protein. The requirement of physical prox-
an alternative strategy which can overcome some imity for cross-linking can help provide struc-
of the limitations and has been recently applied tural information [40]. One example of where
394 B. Turriziani et al.

structural information is extracted using photo in recent years to make proteomic network
cross-linking experiments is the characterization mapping reliable, fast and able to cope with
of the RNA polymerase II/TFIIF transcriptional dynamic changes in the variety of networks. In
complex in Saccharomyces cerevisiae [41]. In this sense, large-scale studies have been carried
this study the authors used bis(sulfosuc- out to develop new protocols to attempt to define
cinimidyl)suberate (BS3), a cross-linker that the interactome of several organisms.
reacts with the amine groups of lysines, and Novel protocols to process samples in an
were able to identify different linkage sites easier, faster way in conjunction with a new
between the subunits of the large RNA polymer- generation of reliable, sensitive LC-MS/MS
ase complex and the transcriptional factors platforms has democratized the mass spectrom-
TFIIF. The method provided information about etry based detection of protein complexes. Sys-
the direct interactions between the two tematic analysis, which a few years ago were
complexes, and identified the regions within only accessible to large, specialized groups are
TFIIF which directly binds the RNA polymerase now within the reach of more applied biological
surface. This approach is especially suited for researchers. The possibility to study protein
studying the native configuration of transcrip- complexes in every laboratory without the
tional factors complexes that are generally chal- need to work with specialized facilities has
lenging to study. Nevertheless, this method still opened up new opportunities. Initial proteomics
has some limitation [42]. Although the rate of studies have already given us a glimpse of how
false positive is low, there is still the risk of protein interactions link up to build intricate and
generating artefacts. The cross-linking has to be complex dynamic networks which are difficult,
well calibrated in order to decrease the amount of if not impossible, to decipher without these
unspecific links. Furthermore, the identification advances. Moreover, due to rapid developments
of cross-linked peptides is challenging, despite of new methods, per-sample costs have been
the emergence of new analytical tools, due to the reduced and can now easily compete with
increased complexity of the linked peptides other high and medium throughput proteomics
[42]. In our opinion it would be advantageous methods. In contrast with other techniques,
to develop cross-linkers which can be reliably mass spectrometry-based proteomics still
split in an ion trap. The linked peptides could retains the unique ability to assess and quantify
then be fragmented individually by MS3b, which what is known, but also to discover new
would reduce the complexity while still retaining interactions and links in a systematic and unbi-
the information about the linkage. A ased fashion.
fragmentable cross-linker, in conjunction with
targeted enrichment methods for linked peptides,
would allow researchers to determine protein- References
protein interactions and more importantly, the
exact site of interaction. 1. Phizicky EM, Fields S (1995) Protein-protein
interactions: methods for detection and analysis.
Microbiol Rev 59(1):94–123
2. Fields S, Song O (1989) A novel genetic system to
18.5 Conclusions detect protein-protein interactions. Nature 340
(6230):245–246
3. Parrish JR, Gulyas KD, Finley RL Jr (2006) Yeast
While it is clear that proteomics, especially the
two-hybrid contributions to interactome mapping.
interactome studies, can play a critical role in Curr Opin Biotechnol 17(4):387–393
determining how pathological events are 4. Vidalain PO et al (2004) Increasing specificity in
initiated by molecular events, initial attempts to high-throughput yeast two-hybrid experiments.
Methods 32(4):363–370
develop such methodologies have suffered from
5. Gingras AC et al (2007) Analysis of protein
a number of technical limitations. These complexes using mass spectrometry. Nat Rev Mol
limitations have been addressed to a large extent Cell Biol 8(8):645–654
18 Protein-Protein Interaction Detection Via Mass Spectrometry-Based Proteomics 395

6. Chang IF (2006) Mass spectrometry-based proteomic 23. Andersen JS et al (2003) Proteomic characterization
analysis of the epitope-tag affinity purified protein of the human centrosome by protein correlation
complexes in eukaryotes. Proteomics 6 profiling. Nature 426(6966):570–574
(23):6158–6166 24. Kirkwood KJ et al (2013) Characterization of native
7. Aebersold R, Mann M (2003) Mass spectrometry- protein complexes and protein isoform variation using
based proteomics. Nature 422(6928):198–207 size-fractionation-based quantitative proteomics. Mol
8. Alvarado R et al (2010) A comparative study of in-gel Cell Proteomics 12(12):3851–3873
digestions using microwave and pressure-accelerated 25. Kristensen AR, Gsponer J, Foster LJ (2012) A high-
technologies. J Biomol Tech 21(3):148–155 throughput approach for measuring temporal changes
9. Rigaut G et al (1999) A generic protein purification in the interactome. Nat Methods 9(9):907–909
method for protein complex characterization and pro- 26. Havugimana PC et al (2012) A census of human
teome exploration. Nat Biotechnol 17(10):1030–1032 soluble protein complexes. Cell 150(5):1068–1081
10. Puig O et al (2001) The tandem affinity purification 27. Alberts B (1998) The cell as a collection of protein
(TAP) method: a general procedure of protein com- machines: preparing the next generation of molecular
plex purification. Methods 24(3):218–229 biologists. Cell 92(3):291–294
11. Rees JS et al (2011) In vivo analysis of proteomes and 28. Hartwell LH et al (1999) From molecular to modular
interactomes using Parallel Affinity Capture (iPAC) cell biology. Nature 402(6761 Suppl):C47–C52
coupled to mass spectrometry. Mol Cell Proteomics 29. Scott NE et al (2015) Development of a computational
10(6):M110.002386 framework for the analysis of protein correlation
12. Morin X et al (2001) A protein trap strategy to detect profiling and spatial proteomics experiments. J Prote-
GFP-tagged proteins expressed from their endoge- omics 118:112–129
nous loci in Drosophila. Proc Natl Acad Sci U S A 30. Clegg C, Hayes D (1974) Identification of
98(26):15050–15055 neighbouring proteins in the ribosomes of Escherichia
13. Couzens AL et al (2013) Protein interaction network coli. A topographical study with the cross-linking
of the mammalian Hippo pathway reveals reagent dimethyl suberimidate. Eur J Biochem 42
mechanisms of kinase-phosphatase interactions. Sci (1):21–28
Signal 6(302):rs15 31. Sutherland BW, Toews J, Kast J (2008) Utility of
14. Hubner NC et al (2010) Quantitative proteomics com- formaldehyde cross-linking and mass spectrometry
bined with BAC TransgeneOmics reveals in vivo pro- in the study of protein-protein interactions. J Mass
tein interactions. J Cell Biol 189(4):739–754 Spectrom 43(6):699–715
15. Domon B, Aebersold R (2006) Mass spectrometry and 32. Toth J, Biggin MD (2000) The specificity of protein-
protein analysis. Science 312(5771):212–217 DNA crosslinking by formaldehyde: in vitro and in
16. Hosp F et al (2015) A double-barrel liquid drosophila embryos. Nucleic Acids Res 28(2), e4
chromatography-tandem mass spectrometry 33. Bousquet-Dubouch MP et al (2009) Affinity purifica-
(LC-MS/MS) system to quantify 96 interactomes per tion strategy to capture human endogenous
day. Mol Cell Proteomics 14(7):2030–2041 proteasome complexes diversity and to identify
17. Zhu W, Smith JW, Huang CM (2010) Mass proteasome-interacting proteins. Mol Cell Proteomics
spectrometry-based label-free quantitative proteo- 8(5):1150–1164
mics. J Biomed Biotechnol 2010:840518 34. Vasilescu J, Guo X, Kast J (2004) Identification of
18. Turriziani B et al (2014) On-beads digestion in con- protein-protein interactions using in vivo cross-
junction with data-dependent mass spectrometry: a linking and mass spectrometry. Proteomics 4
shortcut to quantitative and dynamic interaction pro- (12):3845–3854
teomics. Biology (Basel) 3(2):320–332 35. Mohammed H et al (2013) Endogenous purification
19. Selbach M, Mann M (2006) Protein interaction reveals GREB1 as a key estrogen receptor regulatory
screening by quantitative immunoprecipitation com- factor. Cell Rep 3(2):342–349
bined with knockdown (QUICK). Nat Methods 3 36. Hwang YJ, Granelli J, Lyubovitsky J (2012) Effects of
(12):981–983 zero-length and non-zero-length cross-linking
20. Ong SE et al (2002) Stable isotope labeling by amino reagents on the optical spectral properties and
acids in cell culture, SILAC, as a simple and accurate structures of collagen hydrogels. ACS Appl Mater
approach to expression proteomics. Mol Cell Proteo- Interfaces 4(1):261–267
mics 1(5):376–386 37. Zybailov BL et al (2013) Large scale chemical cross-
21. Waldrip ZJ et al (2014) A CRISPR-based approach linking mass spectrometry perspectives. J Proteomics
for proteomic analysis of a single genomic locus. Bioinform 6(Suppl 2):001
Epigenetics 9(9):1207–1211 38. Gomes AF, Gozzo FC (2010) Chemical cross-linking
22. Larance M, Lamond AI (2015) Multidimensional pro- with a diazirine photoactivatable cross-linker
teomics for cell biology. Nat Rev Mol Cell Biol 16 investigated by MALDI- and ESI-MS/MS. J Mass
(5):269–280 Spectrom 45(8):892–899
396 B. Turriziani et al.

39. Suchanek M, Radzikowska A, Thiele C (2005) Photo- 41. Chen ZA et al (2010) Architecture of the RNA poly-
leucine and photo-methionine allow identification of merase II-TFIIF complex revealed by cross-linking
protein-protein interactions in living cells. Nat and mass spectrometry. EMBO J 29(4):717–726
Methods 2(4):261–267 42. Rappsilber J (2011) The beginning of a beautiful
40. Rappsilber J et al (2000) A generic strategy to analyze friendship: cross-linking/mass spectrometry and
the spatial organization of multi-protein complexes by modelling of proteins and multi-protein complexes. J
cross-linking and mass spectrometry. Anal Chem 72 Struct Biol 173(3):530–540
(2):267–275
Protein Structural Analysis via Mass
Spectrometry-Based Proteomics 19
Antonio Artigues, Owen W. Nadeau, Mary Ashley Rimmer,
Maria T. Villar, Xiuxia Du, Aron W. Fenton,
and Gerald M. Carlson

Abstract
Modern mass spectrometry (MS) technologies have provided a versatile
platform that can be combined with a large number of techniques to
analyze protein structure and dynamics. These techniques include the
three detailed in this chapter: (1) hydrogen/deuterium exchange (HDX),
(2) limited proteolysis, and (3) chemical crosslinking (CX). HDX relies on
the change in mass of a protein upon its dilution into deuterated buffer,
which results in varied deuterium content within its backbone amides.
Structural information on surface exposed, flexible or disordered linker
regions of proteins can be achieved through limited proteolysis, using a
variety of proteases and only small extents of digestion. CX refers to the
covalent coupling of distinct chemical species and has been used to
analyze the structure, function and interactions of proteins by identifying
crosslinking sites that are formed by small multi-functional reagents,
termed crosslinkers. Each of these MS applications is capable of revealing
structural information for proteins when used either with or without other
typical high resolution techniques, including NMR and X-ray
crystallography.

Keywords
Protein structural analysis • Hydrogen/Deuterium Exchange (HDX) •
Limited proteolysis • Chemical Crosslinking (CX)

A. Artigues (*) • O.W. Nadeau • M.A. Rimmer


M.T. Villar • A.W. Fenton • G.M. Carlson
Department of Biochemistry and Molecular Biology,
University of Kansas Medical Center, Kansas City, X. Du
KS, USA Department of Bioinformatics and Genomics, University
e-mail: aartigues@kumc.edu of North Carolina at Charlotte, Charlotte, NC, USA

# Springer International Publishing Switzerland 2016 397


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_19
398 A. Artigues et al.

19.1 Hydrogen/Deuterium Ulrik Linderstrom-Lang and Aase Hvidt,


Exchange scientists at the Carlsberg Laboratory in
Copenhagen. They discovered that both the
19.1.1 Introduction polar side chain hydrogens and the peptide
group hydrogens undergo continual exchange
Protein functions commonly rely on conforma- with the hydrogens from the solvent. Using den-
tional changes within the protein. In some cases sity gradient tubes, Lang and Hvidt developed a
these conformational changes include large novel method to measure this exchange of amide
sections or entire domains of the protein. In backbone hydrogens with a heavier isotope, deu-
other cases, protein conformational changes are terium [1, 2]. With this method, they were able to
restricted to small specific regions of the protein. show that the newly discovered α-helices and
Extensive conformational changes are associated β-sheets in native proteins do indeed have the
with protein folding immediately during or after proposed hydrogen-bonded backbone structures
their synthesis in vivo, when they fold to acquire [1, 3]. Despite having extremely limited resolu-
their native conformational structure. Knowl- tion and accuracy at this time, Lang and Hvidt
edge of the location of functionally relevant con- were able to derive equations and propose
formational changes within the protein and the mechanisms that are still being used today in
magnitude and rates of conformational intercon- hydrogen/deuterium exchange (HDX)
version among various protein conformations methodologies [1].
(i.e. dynamics) are of great importance to the During the following 40 years, many
understanding of protein function. developments were made using hydrogen
Direct or indirect evidence of protein confor- exchange in conjunction with different techniques,
mational changes have been deduced through the including NMR, tritium gel-filtration, and circular
use of several spectroscopic techniques, includ- dichroism. Some of these advances include
ing circular dichroism, electron paramagnetic showing that the chemical nature of adjacent side
resonance, intrinsic protein fluorescence, UV– chains has a major effect on the exchange rate [4],
vis and IR spectroscopy, and it is not uncommon measuring the rates of both acid- and base-
to use a combination of these techniques to catalyzed exchange [5], developing protein frag-
obtain a general description of the structure and mentation and HPLC separation methods [6], and
dynamics of the protein system under consider- site-resolved HDX [7]. Finally, in 1991 Katta and
ation. Measurements of protein dynamics tradi- Chait showed that HDX could be used with
tionally have been done by determining 15N electrospray ionization mass spectrometry,
NMR relaxation times and calculating S2, the removing many of the limitations associated with
average order parameter, a measure of the applications of HDX, including the size of the
motion of the N-H vector at peptide amide protein that can be studied [8]. With the use of
linkages. Higher order parameters indicate less MS to analyze the exchange, the use of HDX to
freedom of movement. Motions measured by study protein structure continues to advance, with
these techniques are on the pico- to nano-second the development of faster and more automated
time scale but may also indicate if slower software for both analyzing data and running
motions might be occurring. To fully understand samples [9], and cold boxes for HPLC to maintain
the dynamics of a particular protein, it is desir- low temperatures during injection to avoid back
able to span as wide a time range as possible. exchange [10]. As a result, the size and type of
Hydrogen exchange is a well-understood phe- proteins being studied with HDX, as well as the
nomenon, and in conjunction with mass spec- number of people employing this method, con-
trometry (MS) is a useful method for studying tinue to grow.
protein dynamics and structure. This exchange Recently, HDX in combination with MS has
was first discovered in the early 1950s by Kaj been used to characterize protein movements in
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 399

solution over a time range from milliseconds to in higher resolution information compared to
several hours. This technique has become many other techniques.
increasingly popular to characterize conforma- 3. The number of exchanging protons can be
tional changes and the dynamic transitions determined. Each proton that is exchanged
between the conformations in proteins. The pur- for a deuteron adds 1 atomic mass unit (amu)
pose of this chapter is to describe the procedures to the average molecular mass of a protein.
and methods for HDX MS to novice users. Thus, Thus, the increase in the mass determines the
we will first describe a basic methodology and a number of deuterium incorporated.
simple experimental set up. Then, we will dis- 4. By observing the isotopic pattern for a given
cuss alternative workflows, caveats and potential protein or peptide fragment (discussed in
problems, and complementary techniques. detail below), HDX MS can distinguish
HDX MS is a method in which deuterium between localized unfolding events (referred
atoms present in buffer replace hydrogen atoms to as EX2 kinetics and seen as a binomial
in the protein [11–16]. Of all the hydrogen atoms isotope pattern) and more global, or coopera-
present in a protein polypeptide, only hydrogen tive unfolding events (referred to as EX1
atoms in O-H, N-H and S-H groups can be kinetics and seen as a bimodal isotopic
replaced with deuterium atoms present in the pattern).
buffer. As a further limitation, only those present 5. There is no upper limit to the size of the
in the amide linkages can be measured by HDX macromolecule that can be analyzed by
MS (all other hydrogens exchange too rapidly HDX MS analysis. This is due to the fact
during sample handling to be detected by mass that for detailed analysis of deuterium content
spectrometry). The amino acid sequences of in specific regions (peptides) of the protein,
peptides (and thereby their locations in the pro- the protein is proteolyzed before mass
tein) and their mass (and thereby the identifica- analysis.
tion of which peptides undergo deuterium 6. Measurements are for proteins in solution
exchange) are detected by enzymatic (most with no dependency on crystal growth, as is
often pepsin) digestion of the protein into required for X-ray crystallography.
peptides and peptide mass evaluation by 7. As mentioned above, protein dynamics can be
MS. The total number of exchangeable protons probed on a much longer time scale than is
and the rate of exchange events are both depen- accessible with many other techniques
dent on the equilibrated protein conformational (e.g. NMR relaxation). HDX MS can probe
average and the rate of dynamic transitions dynamics ranging from milliseconds to sev-
between conformations. Therefore, HDX MS is eral hours, and perhaps longer. As a result,
a sensitive technique for evaluating both changes HDX MS can increase significantly the over-
in average conformation of the peptide backbone all description of dynamic motions within a
chain and changes in its dynamics. protein.
A number of attributes of HDX MS make it
ideal for evaluating macromolecular systems:

1. Mass spectrometry requires low


19.1.2 Theoretical Basis
concentrations of protein. This can remove
and Experimental Design
some of the ambiguity at higher protein
for HDX MS
concentrations (such as those required for
many NMR study).
The theory and methodology used to study pro-
2. Deuterium-labeling of a protein results in the
tein conformation and dynamics using HDX MS
introduction of multiple reporting groups (one
have been described in several reviews [12, 16–
reporter/protein residue) with minimum struc-
20]. In the absence of secondary structure
tural modification of the protein. This results
400 A. Artigues et al.

restraints, HDX for a specific polypeptide is HDX for rat liver mitochondrial aspartate amino-
dependent on the temperature and pH of the transferase, a 49,000 Da globular protein, in the
reaction. The most common experimental proce- absence of secondary structure restraints, calcu-
dure for HDX is continuous labeling. In this lated at 0  C and at both pH 7.5 and 2.3. At
method, the exchange is initiated by making a pH 7.5 HDX is very fast (t1/2 ¼ 0.014 min) and
large dilution of a concentrated stock of the pro- the exchange is completed almost instantly.
tein into deuterated buffer. The progress of the However, there is a minimum exchange rate at
exchange reaction is sampled at different times. pH 2.4. At this pH, minutes are required before
Under these conditions, the chemistry and the complete exchange occurs. This sensitivity of
thermodynamic parameters of HDX are well exchange rates to pH requires careful control of
established [21–23]. The rate of HDX at the pH during exchange. However, the same pH
protein amide linkages is acid or base catalyzed, sensitivity provides a tool to quench exchange
and can be expressed as follows: by quickly lowering temperature and pH, a step
necessary during mass analysis.
khdx ¼ kH ½H þ kOH ½OH  ð19:1Þ
In the absence of any structural constraints,
Thus, the rate of HDX for a specific polypep- the hydrogen atoms of solvent exposed amide
tide is dependent on the pH and temperature of linkages exchange at their free, unmodified
the reaction. This rate, as determined experimen- rates. However, if the amide hydrogen atoms
tally, has a minimum in the pH range 2.3–2.5. are involved in stable internal hydrogen-bonding,
Figure 19.1 illustrates the theoretical rates of or are not exposed to solvent, they will exchange
more slowly. In native proteins, the local
differences in these rates are due to the fact that
400 the structure of these molecules is not rigid, but
|

has a certain degree of mobility. This mobility


pH 2.4 has been called “breathing”, and can be
visualized as shown in Fig. 19.2. The kinetics
300 of HDX can be described according to the fol-
|

lowing kinetic equation:


kcl kop
D/molecule

khdx ¼ ð19:2Þ
200 kcl þ kop þ ke
|

where kcl, kop and ke are the constants of closing,


opening and chemical hydrogen/deuterium
100
exchange, respectively. For proteins in their
|

native state, a common assumption is that kcl


>> kop and ke >> kop.
pH 7.5 Depending on the relative values of the
kinetic constants, two extreme kinetic behaviors
|

can be found. When kcl << ke, the exchange rate


| | | |

0 2 3 5
is determined by the first order rate constant kop.
Time (min)
Thus, khdx is dependent exclusively on the con-
Fig. 19.1 Theoretical rates of hydrogen/deuterium formation of the protein. This first extreme
exchange of mitochondrial aspartate aminotransferase. behavior is defined as EX1. EX1 kinetics are
The theoretical rate of HDX at 0  C and pH 7.5 or 2.4
rarely observed. However, EX1 exchange can
was calculated for mitochondrial aspartate aminotransfer-
ase (MW 44,597 Da) according to a previously published be observed under experimental conditions that
algorithm [22] using HXPEP, written and kindly provided favor the unfolded state [24, 25] of proteins (high
by Zhongqi Zhang (Amgen, Thousand Oaks, CA) temperature or in the presence of chaotropic or
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 401

this case the rate of exchange is measured by


kcl khdx ¼ Kopke; and Kop ¼ kop/kcl. This second
ke extreme behavior is defined as EX2. The EX2
kop
mechanism is most commonly observed for
Closed Open – H Open - D proteins in the folded state. EX2 behavior is
khdx = (kcl*kop)/(kcl+kop+ke) characterized by a monotonic change of the isoto-
EX1, kcl<<ke EX2, kcl>>ke pic envelope with the progress of the exchange
khdx = kop khdx = Kop*ke , Kop=kop/kcl reaction (Fig. 19.2). The EX1 kinetic mechanism
reflects the activation energy for segmental open-
ing and the EX2 represents the sum of all energies
of opening and proton transfer. In EX2, the free
energy difference (ΔG0) of the opening event can
Reaction coordinate

be described according to the following equation:

ΔG0 ¼ RTlnK op ¼ RTlnðkhdx =ke Þ ð19:3Þ

where Kop is the equilibrium constant of the


opening/closing reaction (Fig. 19.2).
Based on these concepts, most HDX MS
experimental designs rely on two different stages
[14]: exchange and quenching. In the first stage,
1040 1042 1044 1040 1042 1044
m/z m/z reaction conditions (i.e., pH and temperature) are
designed to allow HDX while the protein
Fig. 19.2 Schematic representation of the mechanism of undergoes normal folding/function. In the second
hydrogen/deuterium exchange. Hydrogen atoms in the
peptide backbone (top panel) can exchange with hydro- stage, the HDX is quenched by rapidly decreas-
gen (blue) or deuterium (red) atoms in water in a process ing the temperature (to 0  C or below) and pH
dependent on accessibility and breathing (opening and (to pH 2.3–2.5). Deuterium content in the protein
closing) of the protein. In the EX1 regime (left panel), is then analyzed by mass spectrometry.
opening is faster than closing (kop>> kcl), and the rate of
exchange is determined by the rate constant of opening. In
the EX2 regime (right panel), closing is faster than open-
ing (kcl>> kop), and the reaction is dependent on the rate
of opening and chemical exchange. The isotopic patterns 19.1.3 Equipment
shows the theoretical exchange pattern of a triply charged
peptide with m/z ¼ 1040.08 under EX1 or EX2 exchange • Cooling HPLC interface. To reduce back
regimes
exchange during mass analysis of the intact
protein or its peptides, all experimental steps
unfolding reagents). On a mass spectrometer, after HDX are performed at low pH and tem-
EX1 is characterized by a binomial transition perature. The simplest instrumental set-up
from one mass (i.e., undeuterated) to the final consists of immersing the solvents, columns
(deuterated) species (Fig. 19.2). In other words, and all parts of an HPLC in an ice bath, or
two isotopic envelopes are detected, one for the enclosing the entire HPLC set-up in a
undeuterated peptide-ion and a second one for refrigerated chamber. For better control of
the fully deuterated ion. The relative intensity of temperature, we designed a Semi-Automatic
these two isotopic envelopes changes over time Interface for Deuterium Exchange (SAIDE,
as the exchange reaction proceeds. Fig. 19.3) [10]. This interface consists of a
In contrast to the conditions that define EX1, TVC –S2 box (Mecour) equipped with a
when kcl >> ke, the khdx is second order and 6-port valve (Cheminert, N60 SS) and a
depends exclusively on the factors determining 4-port valve (Cheminert, C2). The 6-port
the chemical hydrogen/deuterium exchange. In valve is equipped with a through-the-handle
402 A. Artigues et al.

A B
LTQ FT Reverse phase column

Waste Loop

SAIDE HPLC

MS

4 port valve 6 port valve

HPLC

Fig. 19.3 Mass spectrometer rigged for HDX MS. (A) The one reverse phase column, loop and fluid lines. The box is
cooling box (SAIDE) is located right before the ESI source used for temperature control during all stages of protein
of a high resolving mass spectrometer (LTQ FT) and after digestion, peptide desalting and chromatographic elution of
the HPLC pumps (HPLC). (B) Detail of the SAIDE box peptides
showing the internal components of the unit: two valves,

external loop injector and holds the sample A reverse phase C18 column (MicroTech Sci-
loop (10 μL). The sample loop acts as the entific, Zorbax C18 SB Wide Pore Guard Col-
reaction vessel during protease digestion. umn 2.5 cm  0.2 cm) is needed to resolve
The reversed phase column bridges the two peptic peptides and identify regions with deu-
valves, and the 4-port valve directs flow to terium incorporation.
either waste or to the mass spectrometer. • Mass spectrometer. The mass spectrometers
Other specialized equipment is available that useful for HDX MS characterization of mac-
performs automatic sample pick up, mixing, romolecular complexes are Tandem Mass
injection and data acquisition, although at a Spectrometers. That is, those that allow for
considerable expense [18]. at least two different stages of mass analysis:
• High performance liquid chromatograph one to scan for the peptide-ions (parent ions)
(HPLC). The system should be able to deliver present in the sample, and the second to scan
flows between 20 and 50 μL/min. We use a for the fragment ions produced after a specific
quaternary HPLC MS pump (ThermoFisher parent ion has undergone a stage of fragmen-
Scientific). tation (see Section II: Mass Spectrometry). A
• Chromatographic columns. A reverse phase high resolving power mass spectrometer,
C8 (MicroTech Scientific, Zorbax C8 SB such as an FT-ICR or Orbitrap, is
Wide Pore Guard Column 2.5 cm  0.2 cm) recommended. However, other mass
is needed to desalt the protein when measur- spectrometers with lower resolving power
ing global rate of the exchange in the intact have been used. Because of the high flows
protein. As an alternative, a reverse phase C4 used for peptide elution, the ESI tip must be
may be required to desalt highly hydrophobic chosen carefully. A 100 μm ID tip with an
proteins (MicroTech Scientific, Zorbax C4 SB opening of 30 μm has proved to be ideal for
Wide Pore Guard Column 2.5 cm  0.2 cm). our experimental set-up.
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 403

19.1.4 Materials used to study the reversible unfolding of a pro-


tein. The procedure outlined below describes the
• Protein or protein system of interest. steps involved in a continuous labeling,
• Pepsin. Make a pepsin (Worthington) stock on-exchange procedure. Other experimental
solution by diluting an appropriate weighed procedures are possible, however, and the partic-
amount of pepsin and dilute it in 200 mM ular design will depend on the question of
ammonium formate, pH 2.3 at a final protein interest.
concentration of 1.6 mg/ml. Pepsin concentra-
tion can be estimated from its absorbance at 19.1.5.1 Initiate Exchange Reaction
280 nm using a 1 % absorptivity coefficient (A) The exchange reaction is initiated by
of 1.4. making a 1:10 (or higher) dilution of a
• Protein stock buffer. A buffer appropriate for concentrated stock solution of the protein
your particular protein system. or protein system of interest into a buffer
• Deuteration buffers. Buffers adequate for made in D2O
your protein system made in D2O (99.9 % (B) At different time points, the exchange reac-
D2O, ACROS Organics). Note that a correc- tion is sampled by taking an aliquot. Two
tion factor must be introduced when measur- mass measurements can be made: the global
ing pH of deuterated buffers to account for the rate of exchange in the intact protein (see
differences in activity of protium vs. deute- Sect. 19.1.5.2. Global rate of exchange) or
rium: pH ¼ pD + 0.4. rate of exchange in pepsin generated
• Quench buffer. Quench buffer is 200 mM peptides (see Sect. 19.1.5.3 Location of deu-
NH4CH3COOH, pH 2.3, ice cold. Other terium exchange along the peptide
buffer composition can be used (ammonium backbone).
phosphate). Note in some cases it might be
required to supplement the quench buffer with
a low amount of a denaturing or chaotropic 19.1.5.2 Global Rate of Exchange
agent (i.e., 0.6 M guanidine hydrochloride) to To obtain a global rate of exchange, the change
achieve full unfolding of the protein and effi- in mass of the protein at different times following
cient pepsin digestion. the initiation of the exchange reaction is
• HPLC solvents. Two solvents are needed to measured. For the measurement of the mass of
create a gradient. Solvent A is 0.05 % the intact protein, mass analysis is performed by
trifluoroacetic acid in H2O (TFA, MS grade). direct injection of an aliquot of the labeling reac-
Solvent B is 0.05 % TFA acid in acetonitrile. tion mixture on a C4 or C8 nano-column. Fol-
lowing desalting at 0–15 % B at high flow, the
protein is eluted using a step gradient of acetoni-
trile (0–60 % B in B + A in 15 min) and
analyzed on-line by mass spectrometry.
19.1.5 Experimental Procedure
19.1.5.3 Location of Deuterium Exchange
Figure 19.4A outlines the procedures involved in
Along the Peptide Backbone
a continuous labeling experiment. Usually, a
To identify the residues involved in the hydro-
stock solution of the protonated protein is diluted
gen/deuterium exchange reaction, it is first
into a deuterated buffer and the direction of the
required to identify the peptides resulting from
exchange is H!D (on-exchange). Figure 19.4B
the proteolysis of the protein. This first stage is
outlines the reverse procedure (off-exchange),
performed under control conditions; that is, in
when a protein is first fully exchanged with
the absence of deuterium in the buffers but
deuterium, and the exchange reaction proceeds
under identical conditions to be used to measure
in the opposite direction. This method has been
the exchange. This results in a list of peptides of
404 A. Artigues et al.

A B
Protein stock solution Protein stock solution

Dilute in deuterated buffer


Dilute in deuterated buffer

HDX
Deuterium – labeled protein

Dilute in protiated buffer

HDX
Deuterium – labeled protein

Deuterium – labeled protein

Quench Quench
t1 t2 t3 … tn t1 t2 t3 … tn

proteolysis
proteolysis

peptic peptides peptic peptides

LC MS LC MS LC MS
LC MS

Global Deuterium Global Deuterium


content/rate level in content/rate level in
of exchange peptides of exchange peptides

Fig. 19.4 HDX MS general experimental procedure. buffer. At different time points the reaction is sampled by
The scheme shows the steps to perform continuous label- taking an aliquot and measuring the mass of the intact
ing HDX on-exchange (A) or off-exchange (B) experi- protein (global exchange) or of the proteolytic fragments
mental procedures. HDX is initiated by making a dilution (deuterium level in peptides) with the aid of a mass
of a concentrated stock of the protein into a deuterated spectrometer

interest. Then, the experiment is repeated under that the ratio protein:protease must be
the exchange conditions using deuterated optimized experimentally.
buffers. (c) Inject the reaction sample immediately
into the loop of the 6-port valve on the
Peptic mass maps SAIDE interface.
(a) The first step is to make a dilution (d) Allow pepsin digestion to proceed for
(1:10–1:20) of the protein stock in the 2–5 min (time of digestion must be
protonated buffer. This dilution is equiv- optimized for each protein).
alent to the dilution that will be made (e) Switch the 6-port valve, start HPLC gra-
later in deuterated buffer to initiate the dient. The resulting enzyme digest is
exchange reaction. desalted on a C18 nano-column at
(b) Peptic digestions of the protein are 75–100 μL/min for 2 min while the flow
performed by making a second 1:10 dilu- on the 4-port valve is diverted to waste.
tion of an aliquot of the protein in ice (f) Following desalting, switch the 4-port
cold 200 mM ammonium formate valve to direct the flow to the mass
(pH 2.3) containing pepsin at a final pro- spectrometer ESI source for peptide
tein:protease ratio of 1:1 (w:w). Note detection.
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 405

(g) Elute peptides using a 2–40 % gradient chromatographic system. The MS


of 0.05 % TFA acetonitrile in 0.05 % settings should be been optimized for
TFA in 15 min. The peptides are detection of peptides using high flow
detected on-line using a high resolving mobile phase. Data are acquired under
power mass spectrometer. Figure 19.5 automatic control to perform MS
shows a representative elution profile of followed by tandem mass scans of the
a peptide digest using our four to six most intense ions, using an
exclusion list of 2–4 min, depending on
the capabilities of your mass spectrome-
A ter and chromatographic system.
Relative intensity

Measurement of deuterium content in


peptides
(a) The exchange reaction is initiated as
indicated above, using deuterated buffer
instead of the protonated buffer.
5 15 25 (b) At different times during the exchange
Time (min) reaction, remove an aliquot and dilute it
B
in the quench buffer in the presence of
Relative intensity

pepsin, as before (see Sect. 19.1.5.3.A.


b–g).
(c) MS analysis is performed as above with
the exception that the mass spectrometer
is operated to perform mass analysis
600 1000 1400 1800
m/z only (no MS/MS).
C
Relative intensity

19.1.6 Data Analysis

856 858 860


19.1.6.1 Peptide Identification
m/z When working with pure proteins, as is the case
D y12+ in HDX MS, statistical tools for False Discovery
y132+-H2O
y7+
Relative intensity

Rate (FDR) and peptide/protein probabilities cal-


y132+-NH3
culation are, as a general rule, not useful. Instead,
y13+
+ peptide identification is based on parameters that
b3+ y2 b11 +
y4 + rely on the quality of the tandem mass spectra.
y 8+ b12+ b14+
When data are acquired on a high resolving
400 800 1200 1600
power mass spectrometer and Proteome Discov-
m/z erer is used to analyze them, peptide
identifications are made using an in-house pro-
Fig. 19.5 Representative chromatographic profile and
data analysis (A) Base line chromatographic profile of a
tein database. This database includes the protein
peptide digest of mitochondrial aspartate aminotransfer- of interest, pepsin and common contaminant pro-
ase. (B) Mass spectrum at 8.5 min of elution. (C) A tein sequences. The database is made assuming
magnification of the mass spectrum of panel B, showing that pepsin has no specificity, using a fragment
the doubly charged ion with m/z of 856.5 corresponding
to the peptic peptide AHNPTGTDPTEEEWK. (D) Tan-
ion mass tolerance of 20 ppm, and a parent ion
dem mass spectrum of the same peptide; to simplify the tolerance of 0.30 Da. Peptide identifications are
figure, only the most prominent b- and y- ions are indicated accepted if they can be established at Xcorr score
406 A. Artigues et al.

of at least 1.5, 2.0 or 2.5 for peptides with 1, 2 or B. Use of overlapping peptides – Because pepsin
3 charges, respectively, with a ΔCorrelation has low selectivity for cleavage site, pepsin
score larger than 0.08. Note, manual inspection digestion results in the production of multiple
and validation of some tandem mass spectra may overlapping peptides. Statistical and logical
be required. See Chap. 14 for more information analysis of the deuterium content of these
on tandem mass spectrometry peptide/protein overlapping peptides can provide higher spa-
sequencing and identification. tial resolution than that obtained at the pep-
tide level. Some of the programs mentioned
above will apply logical restrictions and will
19.1.6.2 Deuterium Content
provide a value for the amount of deuterium
The change in deuterium content is measured as
incorporated/retained in smaller units than
the change in mass of the deuterated and
obtained at the peptide level.
undeuterated averaged masses of the protein.
C. Additional considerations – When calculating
Many software packages can be used, and usu-
the total number of exchanged H/D, one must
ally the instrument manufacturer will provide a
keep several things in mind. (1) Any HDX at
program to obtain this measurement. Specialized
the N-terminal end of the peptide is lost dur-
software is recommended. HDExaminer (Sierra
ing proteolysis. (2) Previous studies have
Analytics) is a commercial software that
demonstrated that any HDX at the second
performs automatic isotopic envelope isolation,
amide linkage is also lost very quickly during
measurement of the average mass and deuterium
the chromatographic step [22, 32]. (3) Proline
content of the peptides, and can plot the results in
in peptic bonds does not have an exchange-
a variety of formats, including the comparison of
able proton at its amide linkage.
multiple states of a protein. There are, however,
several free tools for the same purpose:
HDXFinder [26], HD desktop [27] and its suc-
cessor HDX Workbench [9], HX Express [28],
Hexicon [29, 30] and MagTran [31], among 19.1.7 Alternative Workflows
others.
As mentioned above, the generic experimental
protocol outlined in Fig. 19.4 can be modified
19.1.6.3 Mathematical Analysis
to fit specific questions. In most cases these
A. Curve fitting – Eq. 19.2 describes the
require additional equipment. For example, man-
exchange reaction for a single amide linkage.
ual mixing, as indicated in the protocol outlined
In theory, one could expect one phase per
above, allows the measurement of deuterium
amide linkage. However, in practice, multiple
content after the first few seconds of exchange
protons in the peptide might exchange and
(10 s), but exchange reactions that occur below
individual rate constants of exchange cannot
that threshold cannot be measured. For rapid
be measured. In practice, the exchange reac-
mixing and quenching of the reaction in the
tion is fitted to an exponential rise
time range below seconds, a quench flow instru-
(on-exchange) or decay (off-exchange):
ment is required. In this situation, quench flow in
combination with HDX MS has been used to
Xn  
Dt ¼ i¼1
Ai 1  eki t ð19:4Þ access these very fast rates of exchange of
enzymes during catalysis [33]. In pulse labeling
Where Dt is the deuterium content at time t, experiments, an additional pump is used to
Ai and ki are the amplitude and the rate con- expose briefly the protein sample to a pulse of
stant for the ith phase. In practice, multiple deuterium and quench it quickly. This method
HDX reactions are grouped into fast, medium has been used to study intermediates of folding
and slow phases (n ¼ 3). pathways of proteins [34–36].
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 407

Most HDX MS studies make use of spectrometry following trypsin digestion. When
in-solution pepsin digestion. However, interpreting these data, it is important to keep in
immobilized pepsin columns have been shown mind that reactivity of individual amino acid
to improve digestion efficiency [37, 38]. In some residues is determined not only by their accessi-
cases there is too much back exchange, rendering bility to solvent but also by their individual reac-
the data unusable. Care must be taken on the tivity. The reactivity of amino acid side chains is
choice of support used to conjugate the as follows: Cys > Met > Trp > Tyr > Phe >
protein [39]. Cystine > His > Leu ~ Ile > Arg ~ Lys ~ Val
> Ser ~ Thr ~ Pro > Gln ~ Glu > Asp ~ Asn
> Ala > Gly [47]. For detailed discussion of
19.1.8 Complementary Methodology this methodology see the reviews by Chance
[50] and Konermann [51].
It has been observed that ESI of proteins in an
unfolded state will produce higher charged
envelopes than those produced by ESI of proteins
19.1.9 Problems and Caveats
in native conditions. This indicates that the pro-
tein ions in gas phase retain some of the structure
19.1.9.1 Back Exchange
that the protein had in solution, thus the charge
A primary concern in mass analysis is the loss of
distribution of the protein ions is an indication of
deuterium during sample handling for mass anal-
the global structure of the protein. This is thought
ysis. Reducing pH to quench exchange requires
to be a consequence of the higher exposure of
the addition of acid. This quenching results in a
potentially charged residues that are otherwise
reduced exchange rate, not a complete absence of
protected in the core of the protein in the native
exchange. Furthermore, reduced pH also exposes
state.
the now deuterated protein to additional protons.
To obtain higher spatial resolution it would be
Also, the deuterated protein is further exposed to
necessary to interpret the tandem mass spectrum.
protonated buffer during the HPLC stage of
However, due to the low energy of fragmentation
desalting/peptide separation. Therefore,
of CID, this fragmentation method results in
deuterons can be replaced with buffer protons
scrambling of deuterium among the resulting
during data acquisition steps in a process
fragment ions [40–46]. Thus, the CID mass spec-
known as back-exchange. In order to minimize
trum of these peptides cannot be used to deter-
loss of deuterium, mass measurement must be
mine the position of the deuterium in amide
taken quickly, usually within the first few
linkages. The development of ETD, a more ener-
minutes following quenching. Despite efforts to
getic method of fragmentation, results in the
work quickly, the back exchange of side chains is
efficient fragmentation of peptides with little or
too rapid to be assessed with normal mass spec-
no scrambling and interpretation of the tandem
trometry methodologies and is the reason that
mass spectra of these peptides results in amino
HDX MS is limited to detecting information
acid resolution.
about the peptide backbone.
In hydroxyl radical labeling [47], a protein
In most cases, two states of the protein are
solution is exposed briefly to oxidative
compared (control and experimental condition).
conditions. This results in oxidative
Thus, assuming that the experimental conditions
modifications of solvent exposed amino acid
are maintained constant for each state, the
side chains. This can be achieved by either chem-
differences in both total deuterium content
ical reaction using Fenton chemistry [48] or by
and/or rate constants in identical peptides are
UV cleavage of hydrogen peroxide in fast photo
used to describe different states of the protein.
oxidation of proteins (FPOP) [49]. The appear-
However, if a fully deuterated form of the protein
ance of covalently modified amino acid residues
is available, the following equation can be used
with oxygen can be identified by tandem mass
408 A. Artigues et al.

to correct for the loss of deuterium during the 19.2 Limited Proteolysis
analytical stages [13]:
19.2.1 Introduction
Dt ¼ ðmt  mH Þ=ðmD  mH Þ ð19:5Þ

where Dt is the content of deuterium at time t, mt The development of the concept of “limited pro-
is the average mass at time t, mH the mass of the teolysis” is widely attributed to work from the
undeuterated peptide and mD the mass of the Linderstrom-Lang laboratory in the 1940s
fully deuterated peptide. [3]. Among other studies, his laboratory
demonstrated that proteins could be “enzymati-
cally modified without serious degradation” by
19.1.9.2 Overlapping Peptides restricting proteolysis [52]. Subsequently, the
To reduce back exchange, peptides are eluted Neurath laboratory also made extensive use of
using sharp gradients. In most cases there are this technique to study the structure of proteins
only 30 min for data collection after quenching [53, 54]. Unlike the complete proteolysis that is
of the exchange reaction, which includes prote- normally used for mass spectrometry, limited
ase digestion, desalting and peptide separation. proteolysis refers to proteolysis that is halted by
Moreover, the use of an enzyme with low selec- some means, so that complete degradation of the
tivity results in the co-elution of multiple protein does not occur (see Sect. 19.2.3.2 for
peptides. The isotopic envelopes of these details on quenching proteolysis). Limited,
peptides are changing in shape and average controlled, in vitro proteolysis is a simple, but
mass as the exchange reaction proceeds. This powerful, tool to study the conformation of
often results in the overlapping of peptide isoto- proteins.
pic envelopes. Most software applications Proteases have a variety of specificities, i.e.,
resolve this problem by either extracting ion residues at which they preferentially cleave. This
chromatograms (HDFinder) or by curve fitting a specificity controls the sites of cleavage based on
theoretical envelope to the experimental data the primary structure of proteins not showing
(HDExaminer, HD Desktop). The use of high higher order structure. With the added dimension
resolving power mass spectrometer alleviates of folding, however, the normal specificities of
this problem. However, each peptide assignment proteases are no longer the only factor dictating
must be validated individually. cleavage location. Secondary structure will
obscure sites from proteases, regardless of expo-
19.1.9.3 Spatial Resolution sure, as will any additional structure that hides
The spatial resolution of HDX MS detected with regions within folds or causes stereochemical
simple mass measurements is at the peptide level. constraints [55]. Accessibility to the protease
Most HDX MS studies published to-date have active site by the protein target becomes more
been made using this mode of operation. As a restricted upon folding, thus the structure of the
result, such experimental designs provide medium substrate contributes to the selectivity of the
resolution, i.e. deuterium content is measured at protease.
the peptide level. To gain more spatial resolution As an experimental technique, limited prote-
using this experimental design, multiple olysis was initially used to cleave larger proteins
overlapping peptides are required and deuterium or complexes into separate domains to study
assignment content is provided by logical analysis them individually. It was first used to probe pro-
of these multiple overlapping peptides. However, tein structure by Neurath in 1980, when he
this is not always possible, since certain regions of observed that most globular proteins were rela-
the protein may not produce the necessary number tively resistant to proteolysis until they were
of overlapping peptides to obtain the degree of denatured [53]. He proposed that, as with other
resolution desired. enzymatic reactions, optimal proteolytic activity
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 409

occurred when there was complete complemen- the preferred method of analysis for limited
tarity between the substrate structure and the proteolysis. MS has allowed the applications
active site of the protease. The ability of the and capabilities of limited proteolysis to
protease to cleave the substrate also depends on greatly increase. With the use of MS, it is now
the location of a potential cleavage site within the possible to easily identify the exact sites where
structure, as only solvent exposed regions will be proteolysis occurs, providing a map of the
accessible in a tightly folded protein. Neurath regions cleaved by the brief proteolysis,
proposed a model in which functional domains allowing for detailed identification of the flexi-
of proteins are tightly packed, and therefore rela- ble and surface exposed regions. Unlike NMR
tively protease resistant, whereas linker regions or crystallography, MS requires only a minimal
or loops are surface exposed and more suscepti- amount of protein to obtain structural informa-
ble to proteolysis [53, 54]. Using crystal tion, and the ratio of protein to protease is key,
structures and limited proteolysis to confirm rather than the absolute amount of either. Lim-
correlations between flexibility and cleavage, ited proteolysis and MS can also be used on
this model became the basis of limited proteoly- proteins of any size, as there are no minimum
sis theory: that is, limited proteolysis occurs only or maximum protein size restrictions. It can be
at sites on a protein’s structure that are solvent used on single-domain proteins, multi-domain
accessible and flexible enough to conform and fit proteins, multi-subunit proteins, etc. Another
within the active site of the protease [56– advantage of limited proteolysis/MS is the
59]. Generally, this solvent accessibility and flex- ability of MS to analyze complex mixtures
ibility occurs at specific region(s) of a protein; so [62, 63].
that even when multiple proteases with different
specificities are used, the cleavage sites are clus-
tered together, although not necessarily with 19.2.2 Limited Proteolysis Applications
cleavage at the same residue [55].
Because protease specificity still plays a role Limited proteolysis can be used to study different
in determining cleavage sites, it is important to aspects of protein structure, several of which are
use proteases with broad specificities, along with described below. Because surface accessibility
multiple proteases with differing specificities. and flexibility are required for proteolysis to
This will ensure that the regions being targeted occur, the most obvious application of limited
reflect their exposure in the tertiary structure, proteolysis is the identification of exposed loops
rather than their primary structure. Therefore, it and disordered regions. By employing proteases
is also imperative to maintain the protein’s of different specificities and limiting proteolysis,
higher order structure. When planning and while maintaining the protein core, it is possible
executing an experiment, it is essential to keep to map exposed loops and identify regions of
in mind the basic premise of limited proteolysis: disorder. This can be used to complement NMR
brief proteolysis of surface exposed regions or crystallography data [64, 65], or even to
while maintaining the protein core. Because pro- replace these techniques if they cannot be used
teolysis of a protein can cause conformational on the protein of interest. Crystallography can be
changes, it should not be allowed to proceed for especially difficult for disordered or dynamic
too long, as regions that were not originally sur- regions, as it results in low resolution. Limited
face exposed may become so, causing results to proteolysis can be used to confirm the disorder
be skewed. If the protein core becomes and dynamic properties of these regions [66, 67].
compromised, information about the structure is Likewise, as first proposed by Neurath, multi-
no longer reliable. domain proteins often have flexible and disor-
Limited proteolysis was initially analyzed dered linker segments joining the domains, and
using SDS-PAGE and Edman degradation; how- these will be preferentially cleaved during partial
ever, with the development of MS to study proteolysis [57, 68]. Therefore, identification of
proteins in the late 1980s [60, 61], it became
410 A. Artigues et al.

domains and their exact boundaries is possible. ligand binding. In these cases, the limited prote-
This separation of domains was one of the first olysis of the protein in its basal state is compared
applications of limited proteolysis, as seen in to that of the altered protein. If there are confor-
several early papers [54, 69, 70]. More recently, mational changes occurring on the surface of the
this application has been used in conjunction protein, the resultant peptide maps can show
with MS for the specific identification of linker regions of differential proteolysis, indicating
regions. For example, applying these techniques they are more or less flexible or exposed.
to the E. coli transcriptional activator protein
NtrC, a protein with three separate domains,
Bantscheff et al. [57] developed a system com- 19.2.3 Methodology
bining limited proteolysis, MS, and SDS-PAGE
to identify two flexible linker regions. Limited 19.2.3.1 Optimization
proteolysis can also be used to cleave flexible The most basic rule to keep in mind when design-
linker regions to produce separate domains, ing and executing a limited proteolysis experi-
making feasible the study of single domains and ment is that the protein core must remain intact,
potential folding intermediates [71]. or it is no longer “limited” proteolysis, and infor-
Another application of limited proteolysis is mation about the protein structure may no longer
the study of complexes formed between proteins be valid. Because this is so essential, experiments
and their targets. This is possible because the must be performed under conditions that main-
interface regions of the complex will initially be tain the stability and structure of the protein
solvent accessible on the surface of the protein, being studied, regardless of the optimal
but become protected when the complex forms. conditions for the proteases being used.
Therefore, by first performing limited proteolysis Because it is important to ensure that the
on an isolated protein and then on the protein in higher order structure, and not the protease spec-
complex, it is possible to identify the interface ificity, dictates the sites of cleavage, it is advis-
regions, although regions affected by conforma- able to use multiple proteases with varying
tional changes prompted by the interaction may specificities and some with broad specificities.
also show changes in the level of protection. This means, however, that the individual
Different peptide maps for the two protein states, proteases will most likely not be cleaving under
free and in complex, will be observed by MS their optimal conditions (pH, temperature, etc.).
following the limited proteolysis. An example of Given that maintaining target protein stability is
this approach is the study of the calmodulin- the most important factor, one must first identify
melittin complex [72]. The authors performed lim- conditions that are optimal for protein stability.
ited proteolysis on free calmodulin, free melittin, This will include conditions such as buffer, pH,
and the calmodulin-melittin complex, observing temperature, and duration of proteolysis. Once
different peptide maps for the free proteins vs. these conditions are determined, the concentra-
the proteins in complex. From the regions that tion of proteases required for sufficient, yet lim-
changed, they designed a model showing the ited proteolysis, can be optimized. Because
sites of interactions between melittin and calmod- sub-optimal conditions will undoubtedly be
ulin. A similar application of limited proteolysis to used for some of the proteases, it will likely be
study protein complexes is to identify regions of necessary to use different ratios of protein to
protein-DNA, protein-RNA interactions, and anti- protease for each protease in order to ensure
body epitope identification [73–75]. similar levels of proteolysis with minimal cleav-
Regardless of the experimental design – age. Examples of this are shown in Table 19.1.
identifying domain linkers, mapping exposed Another important experimental variable to
loops, or interactions – another use of limited optimize is the quenching step, because different
proteolysis is comparing changes in those proteases may be typically inhibited differently.
regions upon protein activation, mutagenesis, or The ideal quenching step, however, is one that can
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 411

Table 19.1 Protease specificities and final concentrationsa


Protease Specificity Kinase: protease ratio
Thermolysin Hydrophobic 15:1
Chymotrypsin Aromatic 2000:1
Protease V8 (Glu C) Asp and Glu 150:1
Trypsin Arg and Lys 5000:1
Ficin Nonspecific 10,000:1
Arg C (Clostripain) Arg 10:1
Lys C Lys 50:1
Papain Nonspecific 10,000:1
Proteinase A Nonspecific 100:1
Subtilisin Nonspecific 200,000:1
Pepsin Aromatic, acidic, hydrophobic 10:1
a
Different proteases can be, and should be, used in limited proteolysis experiments. Listed above are examples of
proteases and the protein:protease ratios that were used in limited proteolysis experiments at pH 6.8 on the glycogeno-
lytic enzyme phosphorylase kinase [76]. While these ratios will likely differ for other proteins, these are reasonable
starting points for optimization. Other proteases that are commonly used include Proteinase K, elastase, and Asp-N

be used for all proteases in the study. If more than by pipetting an aliquot of the hydrolysis mixture
one quenching method is used, it should be shown directly onto the plate [66, 73]. The bottom line is
that neither the results nor the protein are affected. that whatever quenching condition one chooses
Finally, as discussed further in Sect. 19.2.3.2, to employ, it is imperative to experimentally test
quenching must be both rapid and complete. it to establish with certainty that quenching does,
in fact, occur.
19.2.3.2 Quenching of Proteolysis
For reproducibility and to avoid too much prote- 19.2.3.3 Mass Spectrometry
olysis, it is important to ensure hydrolysis is MALDI and ESI MS are both capable of
quenched effectively. In ideal quenching analyzing limited proteolysis data. MALDI-MS
conditions, proteases should be stopped instantly is tolerant of buffers and does not require
and completely. The requirement for instant desalting the samples, both desirable features.
protease deactivation excludes many protease ESI-MS does require desalting, but chro-
inhibitors that inhibit chemically, e.g., active- matographic separation of complex mixtures
site-directed affinity labels, as they may act rela- allows for sequencing of more peptides, particu-
tively slowly. Quenching by changing larly desirable in complex mixtures.
conditions, such as pH, can be useful; however,
if the quenching pH must be altered prior to 19.2.3.4 Peptide Identification
analysis, the possibility for renewed proteolysis Given that limited proteolysis is typically used
must be considered. Often quenching is achieved on purified, known proteins, the use of standard
by adding trifluoroacetic acid or acetonitrile, protocols, which employ probabilities and false
although protein precipitation may occur. discovery rates, is not essential. Peptide identifi-
Denaturants can also be used to quench; how- cation in limited proteolysis is similar to that
ever, some proteases still show residual activity used in HDX-MS (Sect. 19.1.6.1) and general
in the presence of denaturants. The denatured peptide identification is discussed in more detail
protein that is being studied will likely be an in Chap. 14. Typically, a region will be targeted,
even better substrate for proteolysis than its rather than a specific residue, and if different
native counterpart. When analysis is performed proteases with different specificities are used, it
by MALDI, proteolysis has sometimes been is likely there will be overlapping peptides cov-
quenched by addition of the matrix solution or ering the same region. This indicates consistency
412 A. Artigues et al.

of the data and the flexibility and exposure of that protein following quenching. This is not an
region. Proteolysis will likely result in important concern when product analysis is car-
sub-digestions, i.e., after a region has been ried out by MALDI, as all peptides should be
initially cleaved, the protease may continue to observed; however, the binding of proteolyzed
act on that peptide, resulting in multiple smaller peptides may be a concern with other analytical
peptides from the same region. These methods, as some peptides could be missing in
sub-digestions can be ignored in favor of the the final product analysis. The non-covalent
longer peptides that cover the same region. In binding of otherwise free peptides by a
fact, by considering sub-digestions cautiously, proteolyzed parent has been observed with the
one can avoid over-interpreting the putative protein phosphorylase kinase, a 1.3 MDa com-
importance of specific residues within the larger plex of multiple subunits. Following selective
region that encompasses them. chymotryptic hydrolysis of its largest subunit
(to the extent that no remaining trace of it was
19.2.3.5 General Protocol observed on SDS-PAGE), there were only small
(A) Proteolysis – Incubate protein with protease changes in the structure of the proteolyzed parent
at the optimized ratio determined previously as observed by electron microscopy [77], despite
(Sect. 19.2.3.1) under conditions (buffer, the fact that the degraded subunit accounts for
pH, and temperature) best suited for protein 43 % of that parent complex’s total mass. Con-
stability sequently, evaluating a variety of conditions for
(B) Quenching – After incubation for appropri- the quenching of proteolysis, or between proteol-
ate time(s), remove aliquot and quench reac- ysis and the removal of remaining parent protein
tion (Sect. 19.2.3.2) prior to analysis, could prove advantageous in
(C) MS – Prepare samples following protocol assuring maximum release of generated peptides.
established for the MS to be used. Be Note also that if the parent protein is precipitated
aware of maintaining quenched conditions, prior to analysis, peptides derived from it could
so as not to resume proteolysis. Keep all be trapped within the precipitant.
peptides for analysis. See Sect. 19.2.5 for A caveat that was discussed in Sect. 19.2.3.4 is
discussion on peptide release. the production of smaller peptides from the
sub-digestion of initially released larger parent
peptides, which may potentially produce peptides
too small to detect. If a proteolysis time-course is
19.2.4 Representative Results and Data run, these sub-digestion peptides are likely to be
Presentation observed later than their parent peptides. A time
course can also show the later secondary appear-
Organization and presentation of data are largely ance of less readily cleaved peptides from different
dependent on the main point of the experiment, regions of the protein. A caveat concerning inter-
the type of experiment performed, and the pro- pretation of the appearance of these unique sec-
tein(s) involved. Table 19.2 and Fig. 19.6 show ondary peptides is that, instead of representing
several possible ways to present results. regions less readily cleaved, they could also repre-
sent a new conformation of the protein induced by
an initial proteolysis event. A new proteolytically
19.2.5 Caveats Concerning Limited induced conformational change is especially prob-
Proteolysis lematic for proteins whose function is controlled
by so-called intrasteric regulation [78] (i.e., a
A possibility that is not often considered is region of primary structure in the protein is auto-
whether all peptides formed during limited pro- regulatory through its interaction with other
teolysis are actually released from the parent regions of the protein, generally the active site)
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 413

Table 19.2 Representative dataa


Trypsin Pepsin Arg C
WT Mutant WT Mutant WT Mutant
1–20 1–23
83–90 83–90 83–89 83–89 83–90 83–90
150–161 150–161 150–158 150–158 150–161 150–161
a
When comparing mutants, activated proteins, or complex formation, it is necessary that all conformers or states are
included in the table in a format that makes comparison easy. Because using multiple proteases is advisable, shown here
is a table in which the results from different proteases are compared side by side for the different conformers of protein.
The titles (wild type and mutant) can be exchanged for non-activated vs. activated protein forms, complexed protein vs.
free, etc

Fig. 19.6 Representative results. When mapping and exposure. Depending on the size of the protein, the
exposed loops and regions of disorder, it is helpful to line representing residues could be substituted for the
visually demonstrate the protein structure and sites of actual sequence. Alternatively, if the protein is too large,
cleavage. Demonstrated here is a way to conveniently the representative residue lines used in this figure may
show regions that are targeted by various proteases. This more clearly portray the results, and regions that are
figure also demonstrates clearly that different proteases cleaved can be magnified to highlight cleavage details
are targeting the same region, further implying flexibility

[79]. For many of these proteins, the auto- limited proteolysis experiments is the determina-
regulation can be overcome by limited proteolysis, tion of functional changes following proteolysis.
resulting in a new conformation with a different This concern also suggests that keeping the extent
activity. Thus, an important control to include in of proteolysis relatively limited is prudent.
414 A. Artigues et al.

19.2.6 Side Chain Modification modification, and deciding on which conditions


as a Complementary Technique to use is an empirical process. One wants to
to Partial Proteolysis obtain a reasonable amount of modification in a
reasonable amount of time; however, what
Historically the goal of this method has been to represents a reasonable amount of modification
identify relatively reactive nucleophilic amino acid is not always obvious. Certainly, enough product
side chains that are accessible to the electrophilic should be formed to be able to argue that it truly
reagent used to covalently modify them. Thus, the represents the conformation of a large population
residues modified are likely to be on the surface of of the native protein as opposed to the
the protein and could be within, or adjacent to, the conformations of minor components produced
exposed loops implicated by partial proteolysis. by denaturation, oxidation, post-translational
Identification of modified residues can, therefore, modification, or minor proteolysis during protein
corroborate results from partial proteolysis. Over purification. On the other hand, one doesn’t want
the years, more complex methods of side chain so much modification that the conformation of
modification having a considerably wider range the protein could be altered by the modifications
of amino acid targets, such as oxidation by or the conditions employed to modify
hydroxyl radicals [80–82], have been developed, it. Consequently control analyses should be car-
but the underlying idea of preferentially modifying ried out to characterize the properties of the
surface residues remains the same. An increase in protein following modification. Evaluating full
the variety of side chains that can be modified does, retentions of biological function and the higher
however, add greatly to the power of the technique, order structure of the protein after modification
making it complementary to HDX. Unlike HDX, are two necessary controls. Many studies do not
however, the covalent modifications are irrevers- address the extent of modification, nor its repro-
ible, potentially facilitating analysis. ducibility. The latter is necessary to assure that
The general method of side chain modifica- similar results are obtained with multiple inde-
tion could reasonably be called chemical or pro- pendent protein preparations. If one is comparing
tein “footprinting”. Historically the term two conformations, e.g. apo-protein vs. ligand-
footprinting has connoted protection of DNA bound, misleading information is less likely if
chains from cleavage by DNA-binding proteins. modifications of both are kept in the linear phase.
Similarly, the term “protein footprinting” has
been used to denote cleavage of a protein at
specific residues subsequent to its modification 19.3 Crosslinking
by a chemical reagent [83, 84]. The same term
has also been used, however, to describe the 19.3.1 Introduction
analyses through side chain modification of
nearly every characteristic of proteins (structure, Chemical crosslinking refers to the covalent cou-
dynamics, binding, etc.) with cleavage occurring pling of separate functional moieties. This tech-
after modification prior to MS analyses. Conse- nique has been used for over 50 years to analyze
quently, to avoid potential confusion in terminol- the structure, function and interactions of proteins
ogy, we call this approach side chain by identifying crosslinking sites formed by small
modification, rather than protein footprinting. multi-functional reagents, termed crosslinkers.
There are few variables in carrying out side The coupling of protein crosslinking with modern
chain modification experiments: choices of MS techniques (CXMS) has led to resurgence in
modifying reagent and of modifying conditions the field, with new instruments and crosslinking
(time, pH and concentration of modifier with technologies being developed to facilitate identifi-
respect to protein). The conditions used will cation of conjugates (crosslinked proteins and/or
affect the rate, and perhaps the extent, of peptides) from ever smaller amounts (nmole to
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 415

pmole) of sample. CXMS is a bottom-up determine the proximity of domains and amino
approach, in that the protein is first crosslinked acid side chains in protein monomers or
and then digested specifically with proteases to complexes, to identify potential intra or intermo-
generate peptides for detection by MS. A limiting lecular protein binding sites, and to provide
factor in the analysis of proteins by CXMS is the structural constraints for theoretical protein
extensive array of products (including many side models [89–91]. Many search algorithms and
products) that are possible from such digests. specialized reagents have been developed to
These product arrays are too complex to be enrich and enhance the detection of conjugates
annotated manually and require the use of search and more numerous side products from the
engines that have been developed specifically to digests of crosslinked proteins [90, 92–94],
identify cross-linked peptides. Our intent in this making this approach readily accessible to
chapter is to expose novice users to: (A) CXMS researcher with access to MS and proteomics
approaches that minimize the generation of side facilities.
products and maximize structurally useful
conjugates, (B) available conventional, mass and
affinity tag crosslinking reagents, and (C) search 19.3.1.2 Chemistry of Crosslinking
engine technologies for identifying conjugates.
Crosslinking Reagents
19.3.1.1 Advantages and Applications The range of structural information gained from
Crosslinking provides low to medium structural CXMS is inherently dependent upon the type of
information for proteins that are not amenable to cross-linking reagent (CXR) used. The largest and
high resolution techniques such as NMR and most commonly used classes of CXRs are bifunc-
X-ray crystallography. It is a versatile technique tional molecules containing two reactive groups
that, in its simplest form, has been used to deter- that are connected by an intervening spacer group.
mine nearest neighbors and the minimal subunit Bifunctional CXRs are further divided into two
stoichiometry in multi-oligomeric complexes subgroups, based on whether they contain identi-
[85]. And in its more complex from in combina- cal (homobifunctional) or different (heterobi-
tion with Western blotting, immuno- functional) reactive groups. Many different
precipitation, various protein labeling methods, reactive groups with varying chemistries have
top-down MS and CXMS approaches, it has been been incorporated into CXRs (Table 19.3). How-
used successfully to study protein-protein ever, there are five functional groups that are
interactions (PPI) in transient and stable identifi- commonly used, because they react with protein
cation of crosslinked amino acid side chains. CX side chains in aqueous solutions at near physio-
sites may be complexes [86], providing maxi- logical pH [85]. N-hydroxysuccinimide (NHS)
mum distance information for these targets in and imidoester groups react preferentially with
both in vitro and in vivo studies (reviewed in the N-termini of proteins, as well as the pyrrole
[87, 88]). Recent advances in the detection of and ε-amines of histidine and lysine, respectively.
peptides from complex mixtures by modern MS Sulfo-derivatives of the NHS group are also avail-
and supporting search engine technologies have able to increase the solubility of CXRs with large
provided a robust platform for the development hydrophobic spacer groups. Maleimide and alkyl
of CXMS and its primary use in the identification halide groups are targeted primarily by the free
of crosslinked peptides from digests of thiols of cysteine. As opposed to the functional
crosslinked proteins. CXMS provides a range of groups above, aryl azides are promiscuous, and
structural information, and the resolution of this upon exposure to UV, insert non-selectively as
information is dependent on how specifically a nitrenes at active hydrogen-carbon bonds or
crosslinking (CX) site can be localized. Identifi- undergo ring expansion to form dehydroazepines
cation of crosslinking sites which provids the [87], which react both with nucleophiles and
highest structural resolution requires the used to active hydrogen-containing species.
416 A. Artigues et al.

Table 19.3 Selected reactive groups of typical crosslinking reagents


Reactive group chemical structure Group name Amino acid preferentially targeted
N-Hydroxysuccinimide ester (NHS) Lysine

Maleimide Cysteine

Alkylhalide Cysteine
Imidoester Lysine

Phenylazide Non-specific

Carbodiimideb Aspartic and Glutamic acid

a
R denotes spacer and second reactive group, except for the carbodiimide
b
Zero-length crosslinking reagent that activates carboxyl groups for subsequent attack by proximal amines

Spacers or linkers are chemical moieties that comprising affinity tags such as biotin and Click
covalently couple the reactive functional groups chemistry labels are employed to enrich low
of a crosslinker. Besides determining the dis- abundant conjugates [97, 98], and even more
tance between the reactive groups, spacers also complex forms that contain both affinity and
influence the geometry of crosslinking and the mass tags have been synthesized to simulta-
solubility of the CXR. CXRs with long alkyl neously enhance enrichment and identification
spacers are generally hydrophobic and cover a of crosslinked peptides [99]. CXRs containing
broad range of crosslinking distances between functional spacers are often identified as
potential nucleophiles due to the flexibility of trifunctional or multifunctional reagents; how-
the linker. Spacers also contain functional groups ever, the term trifunctional also refers to CXRs
that allow for their cleavage by specific reagents, that contain three reactive groups that emanate
such as periodate or DTT, which cleave from a central spacer group or atom, each of
intervening glycol or disulfide groups, respec- which is capable of reacting with three distinct
tively. Crosslinkers that contain these groups sites on protein targets.
are members of a subclass of bifunctional Zero-length CXRs refer to molecules that
reagents, termed cleavable CXRs. In addition to directly couple amino acid side chains without
chemical cleavage sites, CXRs with spacers an intervening spacer. These reagents generally
containing MS-cleavable functional groups modify and activate functional groups on specific
have been developed to facilitate bond breaking side chains for subsequent attack by an adjacent
by collision-induced dissociation (CID) and/or protein nucleophile, such as the ε-amine of
electron transfer dissociation (ETD) in mass lysine. For example, N-substituted carbodiimides
spectrometers. Such reagents are used as reporter react with the carboxylates of Asp and Glu
groups to aid in the identification of crosslinked residues to form acylisourea intermediates that
peptides from complex mixtures [95, 96]. Spacers facilitate the formation of isoamide bonds with
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 417

proximal lysine residues (Table 19.3). Free thiols crosslinker. In addition to crosslinking between
may also target these reactive intermediates to the two proteins (intermolecular) and within each
form thioester linkages; however, these protein (intramolecular), numerous mono-
conjugates are relatively unstable by comparison modifications occur as well. Moreover,
with the corresponding amide linkage. For com- crosslinking is a continuous process, and if not
plete and thorough reviews of crosslinking carefully controlled, results in the formation of
reagents see the works of Wong and Hermanson multiple protein conjugates, progressing from
[87, 88]. crosslinked dimers to large polymeric arrays.
Subsequent digestion of the crosslinked proteins
Proteins as Reactants significantly increases the number of possible
Proteins as reactants add to the complexity of products, particularly if the CXR targets side
products generated in cross-linking reactions, chains that are also substrates of the protease
because they are polyvalent structures, containing used which results in incomplete digestion of
multiple reactive amino acid side chains with the crosslinked protein targets [100]. Estimates
varying reactivities that are dependent upon their suggest that the number of potential peptide
microenvironments in the protein complex. The products from such digests increases exponen-
microenvironment of an amino acid depends on tially with the size of the protein [101]
the dynamics of the region encompassing the necessitating the use of bioinformatics
location of that amino acid in the tertiary structure, approaches to annotate all possible products.
its solvent accessibility, and its interactions with
and chemical composition of its nearest Data Analysis
neighbors. On the basis of hydrophobicity, For two reasons, analysis of CXMS data is not
amino acids may be divided into two major clas- trivial and requires dedicated software tools. The
ses, nonpolar and polar. Polar residues can be first is that the number of candidates that must be
separated into those containing side chains with considered is enormous in comparison to regular
nonionizable (asparagine, glutamine, serine and proteome-wide peptide analyses. The second is
threonine) and ionizable (histidine, lysine, argi- that the abundance of crosslinked proteins is
nine, tyrosine, cysteine, aspartate and glutamate) much lower than that of non-modified proteins,
functional groups. With the occasional exception and the data analysis algorithm must be suffi-
of tryptophan, the latter group is primarily ciently sensitive to identify small signal peaks
targeted by CXRs. amongst dominating neighboring peaks.
A number of software tools have been devel-
Products of Crosslinking oped in the past decade for CXMS data analysis.
As previously mentioned, crosslinking and In the following sections, we will explain the
subsequent digestion of a protein and/or protein basic data analysis principles, look into the
complexes generate a vast array of products that computational algorithms behind these tools,
must be accounted for to detect crosslinked examine their pros and cons, and finally provide
peptides. Figure 19.7 shows examples of the our perspectives on future development of data
types of products that are typically observed analysis algorithms and software tools for CXMS
when two proteins are treated with a bifunctional analysis.

Fig. 19.7 Products of protein crosslinking


418 A. Artigues et al.

19.3.2 Methodology estimated theoretically using the Henderson-


Hasselback equation, which implies mathemati-
Crosslinking is a specialized form of general cally that for a nucleophile to exist equally in its
protein chemical modification, both of which conjugate base and acid forms, the pH value must
are empirical processes. It is simply impossible equal its pKa.
to predict under which conditions and with which
pH ¼ pKa þ logð½A =½HAÞ
CXR a given protein will undergo crosslinking.
Variables such as time, reaction component For one and two unit increases in pH, the per-
concentrations, pH and CXRs must be screened centage of the basic form increases correspond-
to maximize the specificity and selectivity of ingly from 50 to 95 and 99 %, respectively. Thus
crosslinking. Specificity refers to the preferential at alkaline pH values, the nucleophilicity for
stable modification of a protein side chain func- basic R-S and R-NH2 protein nucleophiles is
tional group by a specific class of CXR reactive greater than their corresponding acid forms
group. Selectivity on the other hand, denotes the (R-SH and R-NH3+) at low pH values.
potential for detecting observed protein The relative order of nucleophilicity for pro-
interactions by crosslinking. Both of these tein functional groups involved in crosslinking
parameters are inter-related and the extent to reactions is: R-S > R-NH2 > R-COO ffi R-
which one is controlled significantly affects the O. With the exception of zero-length
other. Ultimately, successful crosslinking of crosslinkers, most conventional, commercially
proteins to obtain maximum yields of a desired available CXRs are designed to react preferen-
conjugate depends on these two factors. tially with thiolate or amine-containing protein
Crosslinking is the first step in the CXMS path- nucleophiles. Examining the range of theoretical
way to identifying CX sites in any protein or pKa’s for cysteine (8.8–9.1) and lysine (9.3–9.5)
complex of interest. Optimization of subsequent [87], one might conclude that they are poor
proteolysis and detection steps is also critical and nucleophiles at neutral pH. However, in the
the corresponding protocols, instrumentations, microenvironments of proteins, these side chains
and software will be discussed in the following are often reactive and covalently modified by
sections. CXRs. Thus optimization of pH is critical in
controlling the outcome of crosslinking. For
19.3.2.1 pH example, crosslinking at high pH values might
Most CXRs contain electrophilic reactive groups seem prudent to increase the reactivity of amino
that are targeted by protein nucleophiles in acid side chains; however, it also significantly
reactions. These reactions generally involve diminishes the selectivity of a CXR for its
either displacement of a leaving group or direct intended target and may diminish the specificity
addition to a double bond with adjacent electron of crosslinking by increasing unwanted side
withdrawing groups on the CXR to form a cova- reactions and the formation of large conjugates,
lent bond between it and the attacking amino acid rendering the results uninterpretable. Moreover,
side-chain. In terms of Lewis acid–base theory, hydrolysis of many CXR reactive groups
the reactivity of an amino acid side chain is increases significantly and competes with
directly related to the nucleophilicity crosslinking at high pH values, generating exces-
(or electron-pair donating capability) of its side sive dead-end modifications. A general rule of
chain functional group, which in turn can be thumb is that pH and all other variables in the
expressed in terms of the ratio of its electron crosslinking reaction should be adjusted through
donor/base (A) and electron acceptor/acid screening to maximize the formation of detect-
(HA) forms in solution. This ratio can be able desirable low mass conjugates.
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 419

19.3.2.2 Uses of Conventional and Mass/ typically used in these reactions, with the protein
Affinity Tag CXRs first labeled with the thermo-reactive group and
In the following section, conventional CXRs are then purified in the dark, following which the
defined as those not containing mass tag/reporter modified complex is exposed to UV radiation to
and/or affinity tags. Because crosslinking is an activate and promote crosslinking by the
empirical process, a CXR is generally chosen for photoactive group.
a protein target from screens of reagents with Some specialized CXRs contain affinity or
multiple chemistries under multiple conditions. mass tags. Affinity tagged crosslinked proteins
That having been said, there are many commer- are enriched using affinity purification media.
cially available CXRs with properties that are Mass tagged crosslinked proteins, on the other
advantageous for specific types of analyses. hand, generate peptides with unique isotopic
Zero to short (2–3 Å) length CXRs are preferable signatures that aid in the detection and identifica-
for detecting protein interactions, in that their tion of crosslinks and dead-end side
product conjugates are more likely to represent modifications. For the most part these reagents
an actual interaction within or between protein use the same chemistries as conventional
targets, i.e. the specificity of crosslinking is crosslinkers, most of which incorporate NHS
maximized. For such analyses, conventional or groups to target lysine ε-amines. Many strategies
specialized mass/affinity tag-containing reagents have been introduced to follow sequentially
with large crosslinking spans (>2–3 Å) should be labeled precursor ions (ionized intact crosslinked
avoided. In low resolution crosslinking peptide) and their collision products with mass/
experiments in which the identification of one affinity tag combination CXRs created to reduce
or more binding partners is sufficient, longer the complexity of the product pool and to facili-
span CXRs with affinity or mass tags are more tate cleavage of both peptide arms of the
advantageous, simply because they generally crosslink. Several notable examples include
increase the likelihood of isolating and/or CLIP [98], which utilizes a bis-NHS CXR, with
detecting crosslinked products. Hydrophobic, a terminal Click alkyne tag for enrichment (using
water insoluble CXRs are typically used for biotin azide) and an NO2 reporter group that both
screening protein interactions that are stabilized enhances water solubility and acts as a neutral
by hydrophobic binding surfaces, whereas loss reporter during MS-induced fragmentation.
hydrophilic, water soluble, reagents are often Several groups have developed CXRs that frag-
employed for labeling charged, solvent accessi- ment during MS/MS to release small molecules
ble residues on the surfaces of proteins. that provide mass signatures for crosslinked
Homobifunctional CXRs are used primarily in peptides [95], termed protein interaction
one-step crosslinking reactions, in which all reporters [102]. Using a different ligation
components are present in the reaction. Heterobi- approach, Trnka and Burlingame synthesized a
functional reagents containing two chemically novel CXR, diformyl ethynlbenzene (DEB),
distinct functional groups are often exploited which forms Schiff bases with lysine ε-amines
for use in two-step crosslinking experiments. In that are subsequently reduced to secondary
such experiments, a protein target is first amines with cyanoborohydride [91]. The authors
modified under conditions that favor the reactiv- demonstrate that reduction to an amine, rather
ity of one functional group, followed by purifica- than an acetylation product formed by NHS
tion of the labeled complex to remove groups, provides two additional protonation
non-covalently bound reagent and to facilitate centers. Additionally, incorporation of the DEB
its exchange into conditions that favor reaction an intervening rigid ring spacer, decreases the
of the second CXR functional group. For exam- m/z ratio of the conjugate for more optimal frag-
ple, CXRs containing both photo- and thermo- mentation by high resolution ETD and electron
reactive functional groups (Table 19.1) are capture dissociation (ECD), providing more
complete fragmentation along the peptide
420 A. Artigues et al.

backbone. Moreover, the reagent contains a intensity ions typically associated with
clickable moiety for addition of affinity or mass crosslinked peptides during a given run. Orbitrap
tags for purification or generation of diagnostic MS instruments best fulfill these requirements
reporter ions during MS/MS fragmentation. [104]. In addition to the parameters listed
Digestion of DEB crosslinked proteins also above, Orbitraps come in different tandem MS
generates high charge state gas phase precursor configurations, with the most advance being
ions (4–6+), which allows for their exclusion capable of carrying out CID, ETD and higher-
from native and dead-end modified peptide ions energy C-trap dissociation (HCD) fragmentation
using charge dependent precursor selection of precursor ions. See Chap. 6 for more detailed
[91]. More recently Ihling et al. have developed descriptions of mass spectrometers.
a CXR with a spacer that contains an N-oxy-
tetramethylpiperidine linked to benzene 19.3.2.4 Data Analysis Using Search
(TEMPO), which contains a CID-labile NO-C Engines
bond [103]. This reagent facilitates free radical The goal of data analysis is to identify
initiated peptide sequencing (FRIPS), generating crosslinked peptides. Crosslinked peptides
open shell radicals that provide signatures for include inter-crosslinked peptides, intra-
determining the sequence and location of the crosslinked peptides, and dead-end crosslinked
CX site on crosslinked peptides by successive peptides. Identification of intra-crosslinked
MS2 and MS3 analyses. More solutions for peptides and dead-end crosslinked peptides may
reducing the complexity of crosslinking products be achieved by using software tools that were
are likely as the list of these reagents that exploit designed originally to identify regular
high resolution tandem MS continues to grow. (i.e. non-crosslinked) peptides from shotgun pro-
teomics experiments; however, their identifica-
19.3.2.3 Equipment tion is extremely difficult. This is because inter-
The initial stages in the analysis of proteins by crosslinked peptides include two peptides and the
crosslinking require very basic equipment, com- search algorithm must search each experimental
monly found in most biochemical laboratories. spectrum (i.e. query spectrum) against all of the
These include various SDS-PAGE apparati possible pairs of peptides. Figure 19.8 illustrates
(large and mini gel versions) to analyze protein a general data analysis procedure that comprises
crosslinking products, circulating water baths several steps that are explained in detail below.
and incubators to control for temperature and In the first step, sample proteins are digested
light boxes for viewing stained gels. In-gel pro- in silico to generate all of the possible peptides
teolysis techniques require a laminar flow hood, using a digestion rule, which uses the known
bench-top centrifuge and vacuum centrifuge. chemistry of the protease selected to determine
After optimizing the yield and proteolysis for a where cleavage should take place along the
crosslinked protein, MS technologies are amide backbone. For example, if trypsin is
employed to analyze the digest. There are many selected, then the algorithm generates all possi-
different configurations for mass spectrometers; ble peptides arising from cleavage C-terminal to
however, high resolution instruments with fast lysine and arginine, except when these residues
duty cycles almost always produce the best data are located N-terminal in the primary sequence to
for analysis by search engines, because consider- proline. Experimentally it is not uncommon for
able mass accuracy is required to sort out the trypsin to miss one or more of its cleavage sites
mass degeneracy resulting from the diversity of so peptides with 1–2 miscleavages are also con-
the peptide pool generated after the digestion of sidered. Peptides that are too short or extremely
crosslinked proteins [100]. High resolution hydrophilic are often lost in wash steps prior to
instruments also have faster acquisition time injection in the mass spectrometer and large
and shorter duty cycles (percentage of a time peptides with masses greater than 4000 Da are
window required to make a measurement) often not efficiently cleaved and transmitted.
which increase the potential for analyzing low Therefore algorithms must be flexible enough to
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 421

Fig. 19.8 Data analysis


workflow

accommodate these and other results that deviate characteristic of crosslinked peptides and thus
from ideal theoretical conditions. To accommo- allows them to be weighted to a greater extent
date these possibilities, search engines typically in subsequent scoring rounds. Programs that are
incorporate two user input parameters that may designed to carry out this procedure are generally
be adjusted to narrow the range of peptide chain capable of detecting more conjugates than those
length, depending upon the capabilities of the that simply analyze a given number of the most
instrument being used, and the number of intense peaks in the spectrum.
miscleavage sites (NMC). Usually, the terminal step in processing is to
In the second step, the peptides generated in score the spectral similarity between processed
the first step are combined in pairs, and their experimental spectra from step three and theoret-
masses are calculated and annotated in extremely ically generated spectra for all potential
large databases, based on sequence and potential candidates generated in step two. Existing
chemical modifications. Possible modifications programs calculate spectral similarity in different
are defined from rule sets that take into account ways, either by cross-correlation or simply
the possible chemistries and residues that are summarizing the number of matches detected
potentially targeted by any given CXR. between peaks from experimental MS/MS and
Depending on the flexibility of the search engine, theoretical fragmentation spectra. Candidates
a user may manually limit the number of poten- are first generated, and these consist of all of
tial crosslinked products from any given peptides the crosslinks with calculated masses that fall
during the run. More powerful programs may within a defined range bracketing the precursor
also include products that may result from mass measured in the experimental MS/MS spec-
crosslinking between three or more peptides trum. For each of the candidate crosslinks, a
using the parameters described above. theoretical fragmentation spectrum is generated.
Experimental spectra are pre-processed in the As opposed to the general processing of
third step to account for variations in noise in non-modified peptides, only b- and y-ions are
tandem MS signals and to normalize low and generally considered for crosslinks fragmented
high abundance peaks, both of which are gener- by CID and only c- and z-ions are considered
ally important in conjugate identification. for crosslinks fragmented by ETD. This is
Pre-processing of an experimental spectrum because each crosslink contains two or more
separates signal peaks from noise peaks, removes peptides and their theoretical fragmentation
the latter, and normalizes the resulting signal spectra become very complicated if other ion
peaks so that low and high intensity peaks are types, such as a-ions and those arising from loss
scaled differently. Normalization permits ampli- of H2O and NH3 are also considered. Existing
fication of low intensity peaks, which are often software tools are summarized in Table 19.4.
422 A. Artigues et al.

Table 19.4 Search engines that allow its native state or that support a known
Name Publication year Reference function and/or interaction with a specific part-
PeptideMap 1997 [105] ner. The concentration of protein should be suffi-
ASAP and MS2Asign 2000 [106] cient for visualization using general gel staining
GPMAW 2001 [107] procedures. Under such conditions, either the
X-Link 2002 [108] time or concentration can be varied for the
Popitam 2003 [109] CXR. In one-step crosslinking screens, the con-
MS2PRO 2003 [110] centration of CXR is generally varied in molar
Links and MS2Link 2004 [111] excess from 10 to 500 over the protein target,
CLPM 2005 [112]
initially for a fixed time of 15 min. Conversely,
XLINK 2006 [113]
greater than 500 M excesses of CXR are
VIRTUAL-MSLAB 2006 [114]
SearchXLinks 2006 [115, 116]
incubated with the protein for short time periods,
Pro-Crosslink 2006 [117] ranging between 1 and 10 min. Using small gel
X!Link 2007 [118, 119] formats with 15–20 wells, 3–4 reagents may be
X-Links 2007 [120] assessed per gel, and as many as 16 reagents may
CrossSearch 2008 [100] be tested in 1 day. If any conjugates are observed,
MS-3D 2008 [121] then reaction conditions may be varied to opti-
xComb 2010 [122] mize formation of the desired conjugate. During
xQuest 2010 [123, 124] the screening process, care must be taken to
Mass-Matrix 2010 [125] assure that accessory components (e.g., buffers,
CRUX 2010 [126] salts, stabilizing reagents) are compatible with
MS-Bridge 2010 [127]
the crosslinker being used. For example, amine
Xlink-Identifier 2011 [128]
containing buffers such as TRIS should be
CrossWork 2011 [129]
StavroX 2012 [130]
avoided when using NHS-substituted CXRs or
pLink 2012 [131] any other functional group that targets amines.
SQID-XLinK 2012 [132] To avoid large quantities of side-product forma-
Hekate 2013 [133] tion, excessive amounts of crosslinker should be
XLPM 2014 [134] avoided, and only the amount required to gener-
MXDB 2014 [135] ate sufficient amounts of the desired conjugate
AnchorMS 2014 [136] should be used. Additionally, extremely high pH
SIM-XL 2015 [137] values should be avoided, because most conven-
tional CXR reactive groups are susceptible to
19.3.3 General Protocols hydrolysis and are either rapidly deactivated or
preferentially mono-modify the protein target to
Because crosslinking is an entirely empirical form dead-end side products.
process, the following sections will focus primar- Screens using heterobifunctional CXRs to
ily on developing screens, rather than explicit form conjugates in two-step crosslinking
protocols, to determine the best conditions and protocols are more complicated than one-step
reagents for optimizing the yield and digestion of screens, because of the intermediate purification
a desired conjugate from protein targets. Because step required between successive modification
CXMS is a bottom-up process, we will assume in steps with each of the two functional groups of
these screens that protein reactants are purified to the CXR (see Sect. 19.3.2.2). A rapid assessment
near homogeneity. of conditions required for two-step crosslinking
can be achieved by using small one-mL spin
columns loaded with desalting gel media to par-
19.3.3.1 Crosslinking Screens tially purify the complex after the first modifica-
Ideally in any reagent screen, it is advisable to tion step and to exchange it into reaction media
first analyze the target protein under conditions that are compatible with the second modification
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 423

step. For example when screening conditions for protein to enhance maximal cleavage along the
optimizing crosslinking with a heterobi- backbone, a reducing agent (typically either DTT
functional CXR containing photo-reactive azido or 2-mercaptoethanol) to reduce cysteine
and NHS functional groups, time- and disulfides and an alkylating agent (iodoacetic
concentration-dependent modification of the pro- acid or iodoacetamide) to modify free thiols
tein target with the NHS group is carried out in generated by reduction. The latter two steps are
the first step, as described above under one-step carried out to prevent refolding. Proteins have
crosslinking. To avoid activating the azido unique properties and are targeted to different
group, reactions should be carried out using extents by specific proteases. Covalent
Eppendorf tubes that are not transparent to light attachments introduced by crosslinking usually
and the total volume for each condition should further complicate proteolysis by affecting the
not exceed 100 μl. Reactions are simply reproducibility and completeness of the digestion.
quenched by removing excess reagent with the With crosslinking, proteolysis is an empirical pro-
desalting spin column using a benchtop centri- cess and must be optimized by varying solution
fuge in dark. The desalting gel should be conditions and the general components discussed
equilibrated in a buffer solution that is compati- above [138]. Typically, the reaction steps are car-
ble with the second photolysis step. Multiple ried out in the following order: denaturation,
samples may be loaded onto crystallization reduction, alkylation and proteolysis. Historically,
trays that contain shallow wells, and exposed denaturants such as urea, guanidinium hydrochlo-
simultaneously to UV light using a simple ride and SDS were used and subsequently diluted
hand-held lamp that is placed over the tray for after reduction and alkylation steps to
2–3 min. The reactions are then quenched using concentrations tolerated by the protease; however,
SDS-buffer and loaded onto gels to analyze prod- they interfere and are poorly tolerated by MS. To
uct formation. Although the spin columns do not address this problem, more MS-friendly
remove all excess CXR and do not permit a denaturants such as Rapigest™ (Waters, Milford
complete exchange of conditions, they provide MA) [139], sodium deoxycholate (SDC) [139] or
an efficient method for narrowing the conditions sodium 3-[(2-methyl-2-undecyl-1,3-dioxolan-4-
required for optimal crosslinking of the target. yl)methoxyl]-1-propanesulfonate (ALS) [140]
have been developed. Alternatively, spin
19.3.3.2 Digestion of Conjugates for MS concentrators and various filters have been devel-
Analyses oped to facilitate exchange of secondary
In-gel or hetero-phase and in-solution digestions chemicals and denaturants without significant
are the two most common approaches for loss of the conjugate prior to proteolysis
hydrolyzing crosslinked proteins, and MS [139, 141]. Additional methods for improving
facilities generally provide basic protocols to fol- protein denaturation, including thermal (IR and
low for sample submission or provide services to microwave radiation), ultrasonic and solvent-
perform these procedures. However, the prepara- based techniques, are summarized in an excellent
tion of protein samples, and specifically review by Hustoft et al. [142]. After denaturation,
crosslinked proteins, for MS analyses is a critical engineered forms of trypsin are generally used to
and often overlooked component of CXMS. The carry out proteolysis, because they specifically
ultimate goal of this process is to maximize the cleave amide backbones after lysine and arginine,
coverage of the crosslinked protein, which function well in low concentrations of multiple
requires optimal cleavage and recovery of the denaturants, and are relatively resistant to autoly-
peptide components of the conjugate. Both in-gel sis. A recent report suggests that tandem applica-
and in-solution methods require similar tion of Lys-C (lysine–specific protease) and
components, which include a targeting protease trypsin promotes more efficient cleavage of pro-
or chemical to catalyze hydrolysis at specific sites tein substrates than trypsin alone [138]. Despite all
along the amide chain, a denaturant to unfold the the possible choices in such reactions, some of the
424 A. Artigues et al.

following parameters are good starting points for be accessed online at the UCSF mass spectrome-
in solution digestions. First, the conjugate may be try website.
reduced with DTT (10 M excess) and denatured
concomitantly in either 0.1 % ALS or SDC at
19.3.3.3 Data Input
elevated temperatures (~50–85  C) for 1 h. This
As discussed in Sect. 19.3.2.3, the use of search
is followed by alkylation with iodoacetic acid
engines generally requires little from the user.
(40 M excess) in the dark for 30 min at 30  C.
Most are designed with interfaces that allow the
After alkylation, DTT is added in excess of
user to upload the sequence(s) and reagents being
iodoacetic acid to prevent alkylation of trypsin.
tested. Additionally, some parameters such as the
The denatured protein may then be exchanged
number of allowed side products and crosslinks
into 25 mM ammonium bicarbonate using a
per conjugate may also be adjustable. Typically,
3000 MW cutoff spin concentrator (EMD
one should limit these parameters in the first
Millipore) and digested overnight at 30  C using
round of an analysis; first to minimize computing
a 25-fold excess (w/w) of sequencing grade tryp-
space and time, and second to avoid extensive
sin (Promega). Peptides may be recovered by
data output. Some programs ask the user to spec-
several rounds of centrifugation and washes with
ify the reactive groups of the CXR and the mass
10 % acetonitrile in 25 mM ammonium bicarbon-
of the intervening spacer (after modification), as
ate. Peptides are then concentrated to remove
well as the mass of dead-end products. Users
acetonitrile or lyophilized in a vacuum
with limited knowledge of cross-linking or
concentrator.
chemistry should avoid the latter programs.
In-gel digestion uses the resolving power of
SDS-PAGE to isolate the desired conjugate from
complex mixtures of crosslinking products, sig-
nificantly reducing the number of products to be 19.3.4 Caveats
analyzed. On the other hand, gels can hinder
peptide recovery, depending to a great extent on Perhaps the greatest mistake made by even expe-
the type of extraction procedure used. Several rienced users of the crosslinking technique is to
aspects of this technique are unique compared over interpret results. First, there is a tendency in
to in-solution methods, based on the polyacryl- the literature for users to define a detected CX
amide matrix, which limits diffusion of the site on a protein as a binding site, no matter the
reactants and protease necessary for generating span of the CRX. The specificity and, therefore,
peptides [143]. Thus, the ratios of protease to the probability that crosslinking represents an
substrate are generally much higher than those actual binding event is greatest when zero-length
typically used in solution. Additionally, the gel chemically coupled residues on opposing bind-
sections containing the conjugate must be treated ing partners are identified. CX sites that are
with solvents (typically 50 % acetonitrile in detected using CXR reagents with crosslinking
25 mM ammonium bicarbonate) to remove SDS spans greater than 2–3 Å should be discussed in
and other gel solution components that inhibit the terms of the proximity of the linked residues,
activity of the protease. Another important con- defined by the range of distances that the spacer
sideration is that there are many handling steps can occupy in solution [144]. Another common
that can potentially introduce contaminants, par- misconception is that the absence of crosslinking
ticularly keratins. Thus all reagents must be indicates absence of interaction [145]. In this
prepared carefully and any instruments used case there are many more reasons why
must be cleaned scrupulously before carrying crosslinking does not occur, based on incompati-
out the procedure. Gloves and sterile sleeve bility of the CXR with the chemistry, geometry
protectors should be worn at all times. Specific and/or solvent accessibility of the protein-protein
details for gel-phase proteolysis conditions are interaction surface(s). There are many examples
outlined in a published protocol [143] and can for which CX sites are purportedly identified by
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 425

simply matching the experimental mass of a pre- topological arrangement and interactions of its
cursor ion with the theoretical mass of a protein components are probed either by CID
crosslinked peptide. With large proteins it has after injection [148] or the introduction of
been demonstrated that a large number of dead- sub-stoichiometric amounts of small molecules
end and crosslinked peptides may account for a that destabilize the complex prior to injection
single precursor ion within the resolving limits [149]. Maps defining the interactions of integral
(~3 ppm) of a high resolution FT instrument subunits in the complex are constructed based on
[100]. Even when the masses of a precursor and the composition and number of subcomplexes
its corresponding fragments are well matched to detected [150]. Top-down MS also is capable of
those for a theoretically generated candidate, detecting differences in the stability of a complex
there is a reasonable potential for misidentifica- in different conformational states [151]. For
tion, based on the limited resolving capabilities example Lane et al. showed that the native,
(>200 ppm) of typical collision cells, non-activated form of the (αβγδ)4 phosphorylase
i.e. significant error in the identification fragment kinase complex (PhK) is more stable than its
ions generated by tandem MS. High resolution active phosphorylated form by demonstrating
measurements of fragment ions are now possible that the percentage of intact phosphorylated
in new Orbitrap MS instruments using HCD, complex decreased with respect to that of the
significantly increasing the potential for boosting native under identical conditions [151]. In that
confidence levels in matching assignments. study, phosphorylation of the complex also
Finally, corroborating evidence from alternative perturbed interactions of the subunits in the com-
methods is always desirable for any interaction plex, resulting in preferential interactions among
that is detected or suggested by crosslinking. the regulatory β subunits, also detected by
crosslinking in a previous study [152]. In a par-
allel study, these investigators combined CXMS,
19.3.5 Representative Results immuno EM, cryoEM, modeling and biochemi-
and Complementary Techniques cal data to determine the location of the regu-
latory β subunits in the PhK complex [89]. The
Because of its versatility, CXMS has been used topology and location of the subunits in the
in combination with many complementary connecting bridge region of the bilobal complex
techniques developed to detect protein-protein was determined using top-down MS, and CXMS
interactions. These studies often are focused on was used to constrain an atomic model of the β
determining the structure of proteins and their subunit (generated by I-TASSER [153]) to facil-
complexes and to theoretically model itate its docking in the bridges of the cryoEM 3D
non-homologous proteins. Several examples of structure. Aebersold and coworkers have also
these combined approaches will be discussed in used CXMS to provide distance constraints in
terms of how each complements the other. With combination with tandem affinity purification to
the development of MS instruments that are model a protein phosphatase 2A network of
capable of transmitting large macromolecular interactions [154]. CXMS has become the
complexes [146], top-down MS has become a method of choice for constraining theoretical
well-established method for analyzing the inter- models [155], and is widely used in integrative
action of proteins and/or subunits in large protein structural modeling (ISM), an approach in which
complexes that are not amenable to NMR and theoretical models of variable resolution are
X-ray crystallographic methods [147], providing scored, based on their agreement with constraints
a potent alternative and complementary approach provided by different forms of experimental data,
to crosslinking [94]. The basic approach relies on commonly referred to as input data [156]. ISM
the transmission of a partially hydrated protein approaches using CXMS have been used to
complex in near-native conditions, in which the model complex macromolecular assemblies,
426 A. Artigues et al.

including the yeast eIF3:eIF5 complex [90] and hydrogen exchange and mass spectrometry. J Mass
the photoreceptor phosphodiesterase hetero- Spectrom 32:135–146
15. Hamuro Y, Coales SJ, Southern MR, Nemeth-
oligomer [157]. Cawley JF, Stranz DD, Griffin PR (2003) Rapid
analysis of protein structure and dynamics by hydro-
gen/deuterium exchange mass spectrometry. J
Biomol Tech 14:171–182
References 16. Englander SW (2006) Hydrogen exchange and mass
spectrometry: a historical perspective. J Am Soc
1. Baldwin RL (2011) Early days of protein hydrogen Mass Spectrom 17:1481–1489
exchange: 1954–1972. Proteins 79:2021–2026 17. Busenlehner LS, Armstrong RN (2005) Insights into
2. Hvidt A, Linderstrom-Lang K (1954) Exchange of enzyme structure and dynamics elucidated by amide
hydrogen atoms in insulin with deuterium atoms in H/D exchange mass spectrometry. Arch Biochem
aqueous solutions. Biochim Biophys Acta Biophys 433:34–46
14:574–575 18. Chalmers MJ, Busby SA, Pascal BD, He Y,
3. Schellman JA, Schellman CG (1997) Kaj Ulrik Hendrickson CL, Marshall AG, Griffin PR (2006)
Linderstrom-Lang (1896–1959). Protein Sci Probing protein ligand interactions by automated
6:1092–1100 hydrogen/deuterium exchange mass spectrometry.
4. Sheinblatt M (1970) Determination of an acidity Anal Chem 78:1005–1014
scale for peptide hydrogens from nuclear magnetic 19. Chalmers MJ, Busby SA, Pascal BD, Southern MR,
resonance kinetic studies. J Am Chem Soc Griffin PR (2007) A two-stage differential hydrogen
92:2505–2509 deuterium exchange method for the rapid characteri-
5. Molday RS, Englander SW, Kallen RG (1972) Pri- zation of protein/ligand interactions. J Biomol Tech
mary structure effects on peptide group hydrogen 18:194–204
exchange. Biochemistry 11:150–158 20. Hoofnagle AN, Resing KA, Ahn NG (2004) Practi-
6. Rosa JJ, Richards FM (1979) An experimental pro- cal methods for deuterium exchange/mass spectrom-
cedure for increasing the structural resolution of etry. Methods Mol Biol 250:283–298
chemical hydrogen-exchange measurements on 21. Englander SW, Downer NW, Teitelbaum H (1972)
proteins: application to ribonuclease S peptide. J Hydrogen exchange. Annu Rev Biochem
Mol Biol 133:399–416 41:903–924
7. Wagner G, Wuthrich K (1982) Amide protein 22. Bai Y, Milne JS, Mayne L, Englander SW (1993)
exchange and surface conformation of the basic pan- Primary structure effects on peptide group hydrogen
creatic trypsin inhibitor in solution. Studies with exchange. Proteins 17:75–86
two-dimensional nuclear magnetic resonance. J 23. Weis DD, Wales TE, Engen JR, Hotchko M, Ten
Mol Biol 160:343–361 Eyck LF (2006) Identification and characterization
8. Katta V, Chait BT (1991) Conformational changes in of EX1 kinetics in H/D exchange mass spectrometry
proteins probed by hydrogen-exchange electrospray- by peak width analysis. J Am Soc Mass Spectrom
ionization mass spectrometry. Rapid Commun Mass 17:1498–1509
Spectrom 5:214–217 24. Ferraro DM, Lazo N, Robertson AD (2004) EX1
9. Pascal BD, Willis S, Lauer JL, Landgraf RR, West hydrogen exchange and protein folding. Biochemis-
GM, Marciano D, Novick S, Goswami D, Chalmers try 43:587–594
MJ, Griffin PR (2012) HDX workbench: software for 25. Krishna MM, Hoang L, Lin Y, Englander SW (2004)
the analysis of H/D exchange MS data. J Am Soc Hydrogen exchange methods to study protein fold-
Mass Spectrom 23:1512–1521 ing. Methods 34:51–64
10. Villar MT, Miller DE, Fenton AW, Artigues A 26. Miller DE, Prasannan CB, Villar MT, Fenton AW,
(2010) SAIDE: A Semi-Automated Interface for Artigues A (2012) HDXFinder: automated analysis
Hydrogen/Deuterium Exchange Mass Spectrometry. and data reporting of Deuterium/Hydrogen exchange
Proteomica 6:63–69 mass spectrometry. J Am Soc Mass Spectrom
11. Englander JJ, Rogero JR, Englander SW (1985) Pro- 23:425–429
tein hydrogen exchange studied by the fragment 27. Pascal BD, Chalmers MJ, Busby SA, Griffin PR
separation method. Anal Biochem 147:234–244 (2009) HD desktop: an integrated platform for the
12. Wales TE, Engen JR (2006) Hydrogen exchange analysis and visualization of H/D exchange data. J
mass spectrometry for the analysis of protein dynam- Am Soc Mass Spectrom 20:601–610
ics. Mass Spectrom Rev 25:158–170 28. Weis DD, Engen JR, Kass IJ (2006) Semi-automated
13. Zhang Z, Smith DL (1993) Determination of amide data processing of hydrogen exchange mass spectra
hydrogen exchange by mass spectrometry: a new using HX-express. J Am Soc Mass Spectrom
tool for protein structure elucidation. Protein Sci 17:1700–1703
2:522–531 29. Lou X, Kirchner M, Renard BY, Kothe U, Boppel S,
14. Smith DL, Deng Y, Zhang Z (1997) Probing the Graf C, Lee CT, Steen JA, Steen H, Mayer MP,
non-covalent structure of proteins by amide
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 427

Hamprecht FA (2010) Deuteration distribution esti- deuterium exchange ESI-MS studies. Anal Chem
mation with improved sequence coverage for 80:4078–4086
HX/MS experiments. Bioinformatics 26:1535–1541 43. Ferguson PL, Pan J, Wilson DJ, Dempsey B,
30. Lindner R, Lou X, Reinstein J, Shoeman RL, Lajoie G, Shilton B, Konermann L (2007) Hydro-
Hamprecht FA, Winkler A (2014) Hexicon 2: gen/deuterium scrambling during quadrupole time-
automated processing of hydrogen-deuterium of-flight MS/MS analysis of a zinc-binding protein
exchange mass spectrometry data with improved domain. Anal Chem 79:153–160
deuteration distribution estimation. J Am Soc Mass 44. Jorgensen TJ, Bache N, Roepstorff P, Gardsvoll H,
Spectrom 25:1018–1028 Ploug M (2005) Collisional activation by MALDI
31. Zhang Z, Marshall AG (1998) A universal algorithm tandem time-of-flight mass spectrometry induces
for fast and automated charge state deconvolution of intramolecular migration of amide hydrogens in
electrospray mass-to-charge ratio spectra. J Am Soc protonated peptides. Mol Cell Proteomics
Mass Spectrom 9:225–233 4:1910–1919
32. Connelly GP, Bai Y, Jeng MF, Englander SW (1993) 45. Jorgensen TJ, Gardsvoll H, Ploug M, Roepstorff P
Isotope effects in peptide group hydrogen exchange. (2005) Intramolecular migration of amide hydrogens
Proteins 17:87–92 in protonated peptides upon collisional activation. J
33. Liu YH, Konermann L (2006) Enzyme conforma- Am Chem Soc 127:2785–2793
tional dynamics during catalysis and in the ‘resting 46. Kim MY, Maier CS, Reed DJ, Deinzer ML (2001)
state’ monitored by hydrogen/deuterium exchange Site-specific amide hydrogen/deuterium exchange in
mass spectrometry. FEBS Lett 580:5137–5142 E. coli thioredoxins measured by electrospray ioni-
34. Hu W, Walters BT, Kan ZY, Mayne L, Rosen LE, zation mass spectrometry. J Am Chem Soc
Marqusee S, Englander SW (2013) Stepwise protein 123:9860–9866
folding at near amino acid resolution by hydrogen 47. Xu G, Takamoto K, Chance MR (2003) Radiolytic
exchange and mass spectrometry. Proc Natl Acad modification of basic amino acid residues in
Sci U S A 110:7684–7689 peptides: probes for examining protein-protein
35. Wintrode PL, Rojsajjakul T, Vadrevu R, Matthews interactions. Anal Chem 75:6995–7007
CR, Smith DL (2005) An obligatory intermediate 48. Sharp JS, Becker JM, Hettich RL (2003) Protein
controls the folding of the alpha-subunit of trypto- surface mapping by chemical oxidation: structural
phan synthase, a TIM barrel protein. J Mol Biol analysis by mass spectrometry. Anal Biochem
347:911–919 313:216–225
36. Yang H, Smith DL (1997) Kinetics of cytochrome c 49. Hambly DM, Gross ML (2005) Laser flash photoly-
folding examined by hydrogen exchange and mass sis of hydrogen peroxide to oxidize protein solvent-
spectrometry. Biochemistry 36:14992–14999 accessible residues on the microsecond timescale. J
37. Busby SA, Chalmers MJ, Griffin PR (2007) Improv- Am Soc Mass Spectrom 16:2057–2063
ing digestion efficiency under H/D exchange 50. Takamoto K, Chance MR (2006) Radiolytic protein
conditions with activated pepsinogen coupled footprinting with mass spectrometry to probe the
cloumns. Int J Mass Spectrom 259:130–139 structure of macromolecular complexes. Annu Rev
38. Ahn J, Jung MC, Wyndham K, Yu YQ, Engen JR Biophys Biomol Struct 35:251–276
(2012) Pepsin immobilized on high-strength hybrid 51. Konermann L, Stocks BB, Pan Y, Tong X (2010)
particles for continuous flow online digestion at Mass spectrometry combined with oxidative labeling
10,000 psi. Anal Chem 84:7256–7262 for exploring protein structure and folding. Mass
39. Wu Y, Kaveti S, Engen JR (2006) Extensive deute- Spectrom Rev 29:651–667
rium back-exchange in certain immobilized pepsin 52. Linderstrom-Land K, Ottesen M (1947) A new pro-
columns used for H/D exchange mass spectrometry. tein from ovalbumin. Nature 159:807
Anal Chem 78:1719–1723 53. Neurath H (1979) Limited proteolysis, protein fold-
40. Tsybin YO, Haselmann KF, Emmett MR, ing and physiological regulation. In: Jaenicke R
Hendrickson CL, Marshall AG (2006) Charge loca- (ed) Protein folding. Elsevier/North-Holland Bio-
tion directs electron capture dissociation of peptide medical Press, University of Regensburg,
dications. J Am Soc Mass Spectrom 17:1704–1711 Regensburg
41. Demmers JA, Rijkers DT, Haverkamp J, Killian JA, 54. Bloxham DP, Ericsson LH, Titani K, Walsh KA,
Heck AJ (2002) Factors affecting gas-phase deute- Neurath H (1980) Limited proteolysis of pig heart
rium scrambling in peptide ions and their citrate synthase by subtilisin, chymotrypsin, and
implications for protein structure determination. J trypsin. Biochemistry (Mosc) 19:3979–3985
Am Chem Soc 124:11191–11198 55. Fontana A, de Laureto PP, Spolaore B, Frare E
42. Ferguson PL, Konermann L (2008) Nonuniform iso- (2012) Identifying disordered regions in proteins by
tope patterns produced by collision-induced dissoci- limited proteolysis. Methods Mol Biol 896:297–318
ation of homogeneously labeled ubiquitin: 56. Fontana A, Fassina G, Vita C, Dalzoppo D,
implications for spatially resolved hydrogen/ Zamai M, Zambonin M (1986) Correlation between
428 A. Artigues et al.

sites of limited proteolysis and segmental mobility in 70. Potter RL, Taylor SS (1980) The structural domains
thermolysin. Biochemistry (Mosc) 25:1847–1851 of cAMP-dependent protein kinase
57. Bantscheff M, Weiss V, Glocker MO (1999) Identi- I. Characterization of two sites of proteolytic cleav-
fication of linker regions and domain borders of the age and homologies to cAMP-dependent protein
transcription activator protein NtrC from kinase II. J Biol Chem 255:9706–9712
Escherichia coli by limited proteolysis, in-gel diges- 71. Fontana A, de Laureto PP, Spolaore B, Frare E,
tion, and mass spectrometry. Biochemistry (Mosc) Picotti P, Zambonin M (2004) Probing protein struc-
38:11012–11020 ture by limited proteolysis. Acta Biochim Pol
58. Hubbard SJ (1998) The structural aspects of limited 51:299–321
proteolysis of native proteins. Biochim Biophys Acta 72. Scaloni A, Miraglia N, Orru S, Amodeo P, Motta A,
1382:191–206 Marino G, Pucci P (1998) Topology of the
59. Hubbard SJ, Eisenmenger F, Thornton JM (1994) calmodulin-melittin complex. J Mol Biol
Modeling studies of the change in conformation 277:945–958
required for cleavage of limited proteolytic sites. 73. Cohen SL, Ferre-D’Amare AR, Burley SK, Chait BT
Protein Sci 3:757–768 (1995) Probing the solution structure of the
60. Karas M, Hillenkamp F (1988) Laser desorption DNA-binding protein Max by a combination of pro-
ionization of proteins with molecular masses exceed- teolysis and mass spectrometry. Protein Sci
ing 10,000 daltons. Anal Chem 60:2299–2301 4:1088–1099
61. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse 74. Monti M, Pucci P (2006) Limited proteolysis mass
CM (1989) Electrospray ionization for mass spec- spectrometry of protein complexes. In: Mass spec-
trometry of large biomolecules. Science 246:64–71 trometry of protein interactions. Wiley, Hoboken, pp
62. Suh MJ, Pourshahian S, Limbach PA (2007) Devel- 63–82
oping limited proteolysis and mass spectrometry for 75. Suckau D, Kohl J, Karwath G, Schneider K,
the characterization of ribosome topography. J Am Casaretto M, Bitter-Suermann D, Przybylski M
Soc Mass Spectrom 18:1304–1317 (1990) Molecular epitope identification by limited
63. Feng Y, De Franceschi G, Kahraman A, Soste M, proteolysis of an immobilized antigen-antibody
Melnik A, Boersema PJ, de Laureto PP, Nikolaev Y, complex and mass spectrometric peptide mapping.
Oliveira AP, Picotti P (2014) Global analysis of Proc Natl Acad Sci U S A 87:9848–9852
protein structural changes in complex proteomes. 76. Trempe MR, Carlson GM (1987) Phosphorylase
Nat Biotechnol 32:1036–1044 kinase conformers. Detection by proteases. J Biol
64. Orru S, Dal Piaz F, Casbarra A, Biasiol G, De Chem 262:4333–4340
Francesco R, Steinkuhler C, Pucci P (1999) Confor- 77. Trempe MR, Carlson GM, Hainfeld JF, Furcinitti
mational changes in the NS3 protease from hepatitis PS, Wall JS (1986) Analyses of phosphorylase
C virus strain Bk monitored by limited proteolysis kinase by transmission and scanning transmission
and mass spectrometry. Protein Sci 8:1445–1454 electron microscopy. J Biol Chem 261:2882–2889
65. Zappacosta F, Pessi A, Bianchi E, Venturini S, 78. Kemp BE, Pearson RB (1991) Intrasteric regulation
Sollazzo M, Tramontano A, Marino G, Pucci P of protein kinases and phosphatases. Biochim
(1996) Probing the tertiary structure of proteins by Biophys Acta 1094:67–76
limited proteolysis and mass spectrometry: the case 79. Kobe B, Kemp BE (1999) Active site-directed pro-
of Minibody. Protein Sci 5:802–813 tein regulation. Nature 402:373–376
66. Bothner B, Dong XF, Bibbs L, Johnson JE, Siuzdak 80. Xu G, Chance MR (2005) Radiolytic modification
G (1998) Evidence of viral capsid dynamics using and reactivity of amino acid residues serving as
limited proteolysis and mass spectrometry. J Biol structural probes for protein footprinting. Anal
Chem 273:673–676 Chem 77:4549–4555
67. Fontana A, Zambonin M, Polverino de Laureto P, De 81. Kiselar JG, Chance MR (2010) Future directions of
Filippis V, Clementi A, Scaramella E (1997) Probing structural mass spectrometry using hydroxyl radical
the conformational state of apomyoglobin by limited footprinting. J Mass Spectrom 45:1373–1382
proteolysis. J Mol Biol 266:223–230 82. Zhang H, Gau BC, Jones LM, Vidavsky I, Gross ML
68. Villa JA, Cabezas M, de la Cruz F, Moncalian G (2011) Fast photochemical oxidation of proteins for
(2014) Use of limited proteolysis and mutagenesis to comparing structures of protein-ligand complexes:
identify folding domains and sequence motifs criti- the calmodulin-peptide model system. Anal Chem
cal for wax ester synthase/acyl coenzyme A: 83:311–318
diacylglycerol acyltransferase activity. Appl Envi- 83. Hanai R, Wang JC (1994) Protein footprinting by the
ron Microbiol 80:1132–1141 combined use of reversible and irreversible lysine
69. Graves DJ, Hayakawa T, Horvitz RA, Beckman E, modifications. Proc Natl Acad Sci U S A
Krebs EG (1973) Studies on the subunit structure of 91:11904–11908
trypsin-activated phosphorylase kinase. Biochemis- 84. Tu BP, Wang JC (1999) Protein footprinting at
try (Mosc) 12:580–585 cysteines: probing ATP-modulated contacts in
cysteine-substitution mutants of yeast DNA
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 429

topoisomerase II. Proc Natl Acad Sci U S A azides and terminal alkynes. Angew Chem Int Ed
96:4862–4867 Engl 41:2596–2599
85. Nadeau OW, Carlson GM (2005) Protein 98. Chowdhury SM, Du X, Tolic N, Wu S, Moore RJ,
interactions captured by chemical cross-linking. In: Mayer MU, Smith RD, Adkins JN (2009) Identifica-
Golemis E, Adams PD (eds) Protein-protein tion of cross-linked peptides after click-based
interactions : a molecular cloning manual, 2nd edn. enrichment using sequential collision-induced disso-
Cold Spring Harbor Laboratory Press, Cold Spring ciation and electron transfer dissociation tandem
Harbor, pp 105–127 mass spectrometry. Anal Chem 81:5524–5532
86. Nadeau OW (2006) Protein interaction analysis: 99. Vellucci D, Kao A, Kaake RM, Rychnovsky SD,
chemical cross-linking. In: Ganten D, Ruckpaul K Huang L (2010) Selective enrichment and identifica-
(eds) Encyclopedic reference of genomics and pro- tion of azide-tagged cross-linked peptides using
teomics in molecular medicine. Springer, Berlin chemical ligation and mass spectrometry. J Am Soc
87. Hermanson GT (2008) Bioconjugate techniques, 2nd Mass Spectrom 21:1432–1445
edn. Elsevier Academic Press, Amsterdam/Boston 100. Nadeau OW, Wyckoff GJ, Paschall JE, Artigues A,
88. Wong SS (1993) Chemistry of protein conjugation Sage J, Villar MT, Carlson GM (2008) CrossSearch,
and cross-linking. CRC Press, Boca Raton a user-friendly search engine for detecting chemi-
89. Nadeau OW, Lane LA, Xu D, Sage J, Priddy TS, cally cross-linked peptides in conjugated proteins.
Artigues A, Villar MT, Yang Q, Robinson CV, Mol Cell Proteomics 7:739–749
Zhang Y, Carlson GM (2012) Structure and location 101. Maiolica A, Cittaro D, Borsotti D, Sennels L,
of the regulatory beta subunits in the (alphabeta- Ciferri C, Tarricone C, Musacchio A, Rappsilber J
gammadelta)4 phosphorylase kinase complex. J (2007) Structural analysis of multiprotein complexes
Biol Chem 287:36651–36661 by cross-linking, mass spectrometry, and database
90. Politis A, Schmidt C, Tjioe E, Sandercock AM, searching. Mol Cell Proteomics 6:2200–2211
Lasker K, Gordiyenko Y, Russel D, Sali A, Robinson 102. Hoopmann MR, Weisbrod CR, Bruce JE (2010)
CV (2015) Topological models of heteromeric pro- Improved strategies for rapid identification of chem-
tein assemblies from mass spectrometry: application ically cross-linked peptides using protein interaction
to the yeast eIF3:eIF5 complex. Chem Biol reporter technology. J Proteome Res 9:6323–6333
22:117–128 103. Ihling C, Falvo F, Kratochvil I, Sinz A, Schafer M
91. Trnka MJ, Burlingame AL (2010) Topographic stud- (2015) Dissociation behavior of a bifunctional
ies of the GroEL-GroES chaperonin complex by tempoactive ester reagent for peptide structure anal-
chemical cross-linking using diformyl ysis by free radical initiated peptide sequencing
ethynylbenzene: the power of high resolution elec- (FRIPS) mass spectrometry. J Mass Spectrom
tron transfer dissociation for determination of both 50:396–406
peptide sequences and their attachment sites. Mol 104. Jedrychowski MP, Huttlin EL, Haas W, Sowa ME,
Cell Proteomics 9:2306–2317 Rad R, Gygi SP (2011) Evaluation of HCD- and
92. Paramelle D, Miralles G, Subra G, Martinez J (2013) CID-type fragmentation within their respective
Chemical cross-linkers for protein structure studies detection platforms for murine phosphoproteomics.
by mass spectrometry. Proteomics 13:438–456 Mol Cell Proteomics 10:M111.009910
93. Singh P, Panchaud A, Goodlett DR (2010) Chemical 105. Fenyo D (1997) A software tool for the analysis of
cross-linking and mass spectrometry as a mass spectrometric disulfide mapping experiments.
low-resolution protein structure determination tech- Comput Appl Biosci 13:617–618
nique. Anal Chem 82:2636–2642 106. Young MM, Tang N, Hempel JC, Oshiro CM, Taylor
94. Stengel F, Aebersold R, Robinson CV (2012) Join- EW, Kuntz ID, Gibson BW, Dollinger G (2000)
ing forces: integrating proteomics and cross-linking High throughput protein fold identification by using
with the mass spectrometry of intact complexes. Mol experimental constraints derived from intramolecu-
Cell Proteomics 11:R111.014027 lar cross-links and mass spectrometry. Proc Natl
95. Chowdhury SM, Munske GR, Tang X, Bruce JE Acad Sci U S A 97:5802–5806
(2006) Collisionally activated dissociation and elec- 107. Peri S, Steen H, Pandey A (2001) GPMAW–a soft-
tron capture dissociation of several mass ware tool for analyzing proteins and peptides. Trends
spectrometry-identifiable chemical cross-linkers. Biochem Sci 26:687–689
Anal Chem 78:8183–8193 108. Taverner T, Hall NE, O’Hair RA, Simpson RJ
96. Tang X, Munske GR, Siems WF, Bruce JE (2005) (2002) Characterization of an antagonist
Mass spectrometry identifiable cross-linking strat- interleukin-6 dimer by stable isotope labeling,
egy for studying protein-protein interactions. Anal cross-linking, and mass spectrometry. J Biol Chem
Chem 77:311–318 277:46487–46492
97. Rostovtsev VV, Green LG, Fokin VV, Sharpless KB 109. Hernandez P, Gras R, Frey J, Appel RD (2003)
(2002) A stepwise huisgen cycloaddition process: Popitam: towards new heuristic strategies to improve
copper(I)-catalyzed regioselective “ligation” of protein identification from tandem mass spectrome-
try data. Proteomics 3:870–878
430 A. Artigues et al.

110. Kruppa GH, Schoeniger J, Young MM (2003) A top approach to protein-protein interaction analysis. J
down approach to protein structural studies using Proteome Res 9:2508–2515
chemical cross-linking and Fourier transform mass 123. Leitner A, Walzthoeni T, Aebersold R (2014)
spectrometry. Rapid Commun Mass Spectrom Lysine-specific chemical cross-linking of protein
17:155–162 complexes and identification of cross-linking sites
111. Kellersberger KA, Yu E, Kruppa GH, Young MM, using LC-MS/MS and the xQuest/xProphet software
Fabris D (2004) Top-down characterization of pipeline. Nat Protoc 9:120–137
nucleic acids modified by structural probes using 124. Leitner A, Walzthoeni T, Kahraman A, Herzog F,
high-resolution tandem mass spectrometry and Rinner O, Beck M, Aebersold R (2010) Probing
automated data interpretation. Anal Chem native protein structures by chemical cross-linking,
76:2438–2445 mass spectrometry, and bioinformatics. Mol Cell
112. Tang Y, Chen Y, Lichti CF, Hall RA, Raney KD, Proteomics 9:1634–1649
Jennings SF (2005) CLPM: a cross-linked peptide 125. Xu H, Hsu PH, Zhang L, Tsai MD, Freitas MA
mapping algorithm for mass spectrometric analysis. (2010) Database search algorithm for identification
BMC Bioinf 6 Suppl 2:S9 of intact cross-links in proteins and peptides using
113. Seebacher J, Mallick P, Zhang N, Eddes JS, tandem mass spectrometry. J Proteome Res
Aebersold R, Gelb MH (2006) Protein cross-linking 9:3384–3393
analysis using mass spectrometry, isotope-coded 126. McIlwain S, Draghicescu P, Singh P, Goodlett DR,
cross-linkers, and integrated computational data Noble WS (2010) Detecting cross-linked peptides by
processing. J Proteome Res 5:2270–2282 searching against a database of cross-linked peptide
114. de Koning LJ, Kasper PT, Back JW, Nessen MA, pairs. J Proteome Res 9:2488–2495
Vanrobaeys F, Van Beeumen J, Gherardi E, de 127. Chu F, Baker PR, Burlingame AL, Chalkley RJ
Koster CG, de Jong L (2006) Computer-assisted (2010) Finding chimeras: a bioinformatics strategy
mass spectrometric analysis of naturally occurring for identification of cross-linked peptides. Mol Cell
and artificially introduced cross-links in proteins and Proteomics 9:25–31
protein complexes. FEBS J 273:281–291 128. Du X, Chowdhury SM, Manes NP, Wu S, Mayer
115. Schnaible V, Wefing S, Resemann A, Suckau D, MU, Adkins JN, Anderson GA, Smith RD (2011)
Bucker A, Wolf-Kummeth S, Hoffmann D (2002) Xlink-identifier: an automated data analysis platform
Screening for disulfide bonds in proteins by MALDI for confident identifications of chemically cross-
in-source decay and LIFT-TOF/TOF-MS. Anal linked peptides using tandem mass spectrometry. J
Chem 74:4980–4988 Proteome Res 10:923–931
116. Wefing S, Schnaible V, Hoffmann D (2006) 129. Rasmussen MI, Refsgaard JC, Peng L, Houen G,
SearchXLinks. A program for the identification of Hojrup P (2011) CrossWork: software-assisted iden-
disulfide bonds in proteins from mass spectra. Anal tification of cross-linked peptides. J Proteome
Chem 78:1235–1241 74:1871–1883
117. Gao Q, Xue S, Doneanu CE, Shaffer SA, Goodlett 130. Gotze M, Pettelkau J, Schaks S, Bosse K, Ihling CH,
DR, Nelson SD (2006) Pro-CrossLink. Software tool Krauth F, Fritzsche R, Kuhn U, Sinz A (2012)
for protein cross-linking and mass spectrometry. StavroX–a software for analyzing crosslinked
Anal Chem 78:2145–2149 products in protein interaction studies. J Am Soc
118. Lee YJ, Lackner LL, Nunnari JM, Phinney BS Mass Spectrom 23:76–87
(2007) Shotgun cross-linking analysis for studying 131. Yang B, Wu YJ, Zhu M, Fan SB, Lin J, Zhang K,
quaternary and tertiary protein structures. J Proteome Li S, Chi H, Li YX, Chen HF, Luo SK, Ding YH,
Res 6:3908–3917 Wang LH, Hao Z, Xiu LY, Chen S, Ye K, He SM,
119. Lee YJ (2009) Probability-based shotgun cross- Dong MQ (2012) Identification of cross-linked
linking sites analysis. J Am Soc Mass Spectrom peptides from complex samples. Nat Methods
20:1896–1899 9:904–906
120. Anderson GA, Tolic N, Tang X, Zheng C, Bruce JE 132. Li W, O’Neill HA, Wysocki VH (2012) SQID-
(2007) Informatics strategies for large-scale novel XLink: implementation of an intensity-incorporated
cross-linking analysis. J Proteome Res 6:3412–3421 algorithm for cross-linked peptide identification.
121. Yu ET, Hawkins A, Kuntz ID, Rahn LA, Rothfuss A, Bioinformatics 28:2548–2550
Sale K, Young MM, Yang CL, Pancerella CM, 133. Holding AN, Lamers MH, Stephens E, Skehel JM
Fabris D (2008) The collaboratory for MS3D: a (2013) Hekate: software suite for the mass spectro-
new cyberinfrastructure for the structural elucidation metric analysis and three-dimensional visualization
of biological macromolecules and their assemblies of cross-linked protein samples. J Proteome Res
using mass spectrometry-based approaches. J Prote- 12:5923–5933
ome Res 7:4848–4857 134. Jaiswal M, Crabtree N, Bauer MA, Hall R, Raney
122. Panchaud A, Singh P, Shaffer SA, Goodlett DR KD, Zybailov BL (2014) XLPM: efficient algorithm
(2010) xComb: a cross-linked peptide database for the analysis of protein-protein contacts using
19 Protein Structural Analysis via Mass Spectrometry-Based Proteomics 431

chemical cross-linking mass spectrometry. BMC molecular cloning manual. Cold Spring Harbor Lab-
Bioinf 15 Suppl 11:S16 oratory Press, New York, pp 75–91
135. Wang J, Anania VG, Knott J, Rush J, Lill JR, Bourne 146. Sobott F, Hernandez H, McCammon MG, Tito MA,
PE, Bandeira N (2014) Combinatorial approach for Robinson CV (2002) A tandem mass spectrometer
large-scale identification of linked peptides from for improved transmission and analysis of large mac-
tandem mass spectrometry spectra. Mol Cell Proteo- romolecular assemblies. Anal Chem 74:1402–1407
mics 13:1128–1136 147. Benesch JL, Robinson CV (2006) Mass spectrome-
136. Mayne SL, Patterton HG (2014) AnchorMS: a bioin- try of macromolecular assemblies: preservation and
formatics tool to derive structural information from dissociation. Curr Opin Struct Biol 16:245–251
the mass spectra of cross-linked protein complexes. 148. Benesch JL, Ruotolo BT, Simmons DA, Robinson
Bioinformatics 30:125–126 CV (2007) Protein complexes in the gas phase: tech-
137. Lima DB, de Lima TB, Balbuena TS, Neves- nology for structural genomics and proteomics.
Ferreira AG, Barbosa VC, Gozzo FC, Carvalho Chem Rev 107:3544–3567
PC (2015) SIM-XL: a powerful and user-friendly 149. Hernandez H, Robinson CV (2001) Dynamic protein
tool for peptide cross-linking analysis. J Proteomics complexes: insights from mass spectrometry. J Biol
129:51 Chem 276:46685–46688
138. Glatter T, Ludwig C, Ahrne E, Aebersold R, Heck 150. Hernandez H, Dziembowski A, Taverner T,
AJ, Schmidt A (2012) Large-scale quantitative Seraphin B, Robinson CV (2006) Subunit architec-
assessment of different in-solution protein digestion ture of multimeric complexes isolated directly from
protocols reveals superior cleavage efficiency of tan- cells. EMBO Rep 7:605–610
dem Lys-C/trypsin proteolysis over trypsin diges- 151. Lane LA, Nadeau OW, Carlson GM, Robinson CV
tion. J Proteome Res 11:5145–5156 (2012) Mass spectrometry reveals differences in sta-
139. Leon IR, Schwammle V, Jensen ON, Sprenger RR bility and subunit interactions between activated and
(2013) Quantitative assessment of in-solution diges- nonactivated conformers of the (alphabeta-
tion efficiency identifies optimal protocols for unbi- gammadelta)4 phosphorylase kinase complex. Mol
ased protein analysis. Mol Cell Proteomics Cell Proteomics 11:1768–1776
12:2992–3005 152. Fitzgerald TJ, Carlson GM (1984) Activated states
140. Nomura E, Katsuta K, Ueda T, Toriyama M, Mori T, of phosphorylase kinase as detected by the chemical
Inagaki N (2004) Acid-labile surfactant improves cross-linker 1,5-difluoro-2,4-dinitrobenzene. J Biol
in-sodium dodecyl sulfate polyacrylamide gel pro- Chem 259:3266–3274
tein digestion for matrix-assisted laser desorption/ 153. Zhang Y (2008) I-TASSER server for protein 3D
ionization mass spectrometric peptide mapping. J structure prediction. BMC Bioinf 9:40
Mass Spectrom 39:202–207 154. Herzog F, Kahraman A, Boehringer D, Mak R,
141. Wisniewski JR, Zougman A, Nagaraj N, Mann M Bracher A, Walzthoeni T, Leitner A, Beck M, Hartl
(2009) Universal sample preparation method for pro- FU, Ban N, Malmstrom L, Aebersold R (2012)
teome analysis. Nat Methods 6:359–362 Structural probing of a protein phosphatase 2A net-
142. Hustoft HK, Reubsaet L, Greibrokk T, Lundanes E, work by chemical cross-linking and mass spectrom-
Malerod H (2011) Critical assessment of etry. Science 337:1348–1352
accelerating trypsination methods. J Pharm Biomed 155. Rappsilber J (2011) The beginning of a beautiful
Anal 56:1069–1078 friendship: cross-linking/mass spectrometry and
143. Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann modelling of proteins and multi-protein complexes.
M (2006) In-gel digestion for mass spectrometric J Struct Biol 173:530–540
characterization of proteins and proteomes. Nat 156. Schneidman-Duhovny D, Pellarin R, Sali A (2014)
Protoc 1:2856–2860 Uncertainty in integrative structural modeling. Curr
144. Green NS, Reisler E, Houk KN (2001) Quantitative Opin Struct Biol 28:96–104
evaluation of the lengths of homobifunctional pro- 157. Zeng-Elmore X, Gao XZ, Pellarin R, Schneidman-
tein cross-linking reagents used as molecular rulers. Duhovny D, Zhang XJ, Kozacka KA, Tang Y,
Protein Sci 10:1293–1304 Sali A, Chalkley RJ, Cote RH, Chu F (2014) Molec-
145. Nadeau OW, Carlson GM (2002) Chemical cross- ular architecture of photoreceptor phosphodiesterase
linking in studying protein-protein interactions. In: elucidated by chemical cross-linking and integrative
Golemis E (ed) Protein-protein interactions : a modeling. J Mol Biol 426:3713–3728
Part V
Clinical Proteomics
Introduction to Clinical Proteomics
20
John E. Wiktorowicz and Allan R. Brasier

Abstract
Within the context of this section, biomarkers are defined as a panel of
proteins and peptides that are predictive of the risk for developing a
pathological condition. It is important to note here that the use of the
descriptor ‘panel’ is purposeful in that single “biomarkers” are rarely
sufficient to permit accurate prediction of a pathological condition.
More specifically, the primary application of a biomarker panel is that it
serves as a molecular indicator of the severity of a disease or its early
response to treatment. In this way, biomarkers enable the application of
precision medicine, an approach that tailors specific interventions to those
individuals that would most benefit. For a recent comprehensive review of
the proteomic-based biomarker development process with a focus on
bladder cancer, the reader is directed to Frantzi et al. [Clin Transl Med
3:7, 2014], or a special issue with multiple reviews [Stuhler and
Poschmann, Biochim Biophys Acta Proteins Proteomics
1844:859–1058, Elsevier, B V, 2014].

Keyword
Clinical proteomics

20.1 Overview the use of the descriptor ‘panel’ is purposeful in


that single “biomarkers” are rarely sufficient to
Within the context of this section, biomarkers are permit accurate prediction of a pathological con-
defined as a panel of proteins and peptides that dition. More specifically, the primary application
are predictive of the risk for developing a patho- of a biomarker panel is that it serves as a molec-
logical condition. It is important to note here that ular indicator of the severity of a disease or its
early response to treatment. In this way,
J.E. Wiktorowicz (*) • A.R. Brasier biomarkers enable the application of precision
The University of Texas Medical Branch, medicine, an approach that tailors specific
Galveston, TX, USA interventions to those individuals that would
e-mail: jowiktor@utmb.edu

# Springer International Publishing Switzerland 2016 435


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_20
436 J.E. Wiktorowicz and A.R. Brasier

most benefit. For a recent comprehensive review selected for optimal performance. For the
of the proteomic-based biomarker development purposes of this work, Discovery is a phase that
process with a focus on bladder cancer, the employs a broad survey of proteins using semi-
reader is directed to Frantzi et al. [1], or a special high throughput assays. Qualification is a phase
issue with multiple reviews [2]. involving independent measurement of differen-
Despite the great interest in biomarkers and tially expressed proteins, typically within the
their potential impact in clinical practice, there Discovery samples. Verification refers to confir-
have been surprisingly few biomarker panels that mation of the differentially expressed proteins
have been translated to clinical practice. A recent within an independent, second clinical cohort.
survey (2001–2014) of PubMed yielded As noted, there is an inverse relationship between
241 papers describing biomarker studies using a the number of candidates and the number of
proteomic approach. Unfortunately, as has been samples as the candidates move through the con-
noted extensively, very few have advanced to a firmation process. Survival of a candidate marker
Verification stage, much less Validation. The is dependent upon the quality of the quantitative
reasons for this barren landscape are manifold, analysis and statistical tools used to narrow the
and we will examine a few significant issues field in this analysis, the authors emphasized a
below. mass spectrometry approach for discovery
The currently accepted process [3] that leads through verification, as well as argued for
to biomarker approval for clinical use is analyses of proximal biofluids. Published in
summarized in the column labeled “Phase” in 2006, the conclusions drawn were optimistic in
Fig. 19.1. The biomarker development process that if the suggested approaches were utilized by
proceeds in distinct phases in which protein the proteomics biomarker community, greater
markers are initially identified, assayed, and numbers of biomarkers would survive the

Fig 20.1 Process flow for the development of novel for each phase. LC-MS/MS liquid chromatography tan-
protein biomarker candidates [3]. ‘Numbers of analytes’ dem mass spectrometry, SID stable isotope dilution,
refers to the number of proteins expected to be evaluated MRM multiple reaction monitoring (Reprinted by permis-
as candidate biomarkers in each phase of development. sion from Macmillan Publishers, Ltd: Rafai, N. et al.,
‘Numbers of samples’ refers to the sample requirements Nature Biotechnology, 24:971–983, 2006)
20 Introduction to Clinical Proteomics 437

development process. Many of these suggestions robust markers. These steps are then followed
were implemented in the proteomic biomarker by feature evaluation, model evaluation, and if
publications since then, unfortunately, up until necessary, model refinement prior to Verification
this year, the barren landscape of proteomically (Fig. 19.2).
derived validated biomarkers remains essentially
unchanged. Clearly, additional factors have con-
founded the attempts to bring candidate 20.2 Candidate Biomarker Selection
biomarkers to full biomarker status, and our pur-
pose here is to provide an overview of the The initial assembly of candidate biomarkers
challenges that we have encountered in our own will involve a combination of both prior knowl-
research. edge of pathophysiology with and quantitative
Our experience has led to some modification (or semi-quantitative) proteomics surveys of rel-
of the development workflow described in evant animal models or patient derived biofluids
Fig. 19.1. To summarize the salient points, suc- from well-designed clinical studies. At the out-
cess in biomarker identification critically relies set, it is critical to define the disease for which the
on disease definition, consideration of the goals biomarker is being developed. A “disease” must
of the biomarker panel, the strategy for selection be identifiable using objective criteria that are
of candidate biomarkers that will constitute the reproducible across multiple sites and are inde-
panel, statistical modeling, and alternative quan- pendent of observer bias. It is not uncommon in
titative proteomic tools to identify the most clinical practice for diseases to be diagnosed

Fig. 20.2 Modified


biomarker development
process. The general
terminology remains
unchanged, however,
greater attention is paid to
study design, where
investigated diseases
should be well defined with
clear diagnostic criteria,
and with second sample set
for a statistically powered
Verification phase. Further
modifications include the
use of a heuristics and
animal models to
supplement candidate
markers identified in the
Discovery phase. Finally,
the discovery marker set is
reduced with appropriate
statistical tools and used to
create models correlating
their pattern of abundance
with the relevant goals of
the study. This model is
evaluated and refined upon
confirmation after the
Qualification phase
438 J.E. Wiktorowicz and A.R. Brasier

using a combination of criteria. For example, the Candidate biomarkers are selected from mul-
diagnosis of dengue hemorrhagic fever is made tiple sources of information. An important source
on the basis of variable types of hemorrhage, of knowledge is prior pathophysiological studies,
plasma leak syndromes, and hemoconcentration when this information is available. Information
[4]. The diagnosis of severe asthma is made from relevant animal models, when these are
on the basis of a constellation of symptoms, available, can be valuable to select candidate
pharmacological responses, and symptomatic markers for Qualification. Animal models can
controls [5]. Rheumatoid arthritis is a syndrome be useful for several reasons:
of joint stiffness, cutaneous manifestations, and
variable amounts of joint destruction. The impor- – Inbred animal strains have reduced genetic
tant point here is that despite exhibiting similar variations that contribute to distinct protein
constellations of signs and symptoms that satisfy expression patterns
a clinical definition of a disease, the patients may – The timing and onset of disease can be more
exhibit distinct pathophysiologic processes that precisely controlled than possible in human
will contribute to variability in the selection of clinical studies
biomarker panels. For example, petechiae in den- – Proximal biofluid sampling is possible
gue hemorrhagic fever may be due to antibody-
induced thrombocytopenia, whereas the plasma When these sources of information are lim-
leak may be due to complement-mediated endo- ited, the final source of candidates comes from
thelial damage. Qualifying or verifying quantitative or semi-quantitative proteomics
biomarkers in populations with these distinct samples from observational human studies. This
pathophysiologies may result in markers that latter domain requires an objective definition of
may generalize only to a subpopulation. the disease and meticulous control of sample
collection protocols. Sample collection protocols
must be standardized and assiduously
20.3 Considerations in Biomarker implemented, and patient information (day
Use/Clinical Application symptoms appeared, etc.) must be noted accu-
rately. Many biomarker studies are on diseases
Another important consideration in biomarker that are relatively rare and therefore require
development is whether the biomarker panel is multi-site clinical design. In this setting, the pres-
actionable. In the case of dengue hemorrhagic ence of disease must be identifiable using objec-
fever (DHF), in endemic dengue regions, DHF tive criteria that are reproducible across multiple
is a relatively rare event, estimates of 5–10 % of sites and are independent of observer bias.
all patients with acute dengue infections will Finally, because the current publication envi-
manifest DHF. In this case, the identification of ronment requires confirmation of candidate
a biomarker panel is valuable for clinicians in markers through the Verification stage, it is criti-
resource poor areas to prioritize which patients cal to have identified a second, larger cohort of
should be closely monitored, and/or given intra- samples to be used for confirmation before
venous hydration. Conversely if the application initiating the biomarker Discovery stage.
of the biomarker will not impact case manage- As a side note, it is appropriate to point out
ment, there will be little acceptance or utilization that in the past, the published panels of Qualified
of the test. Having a clear understanding of the markers have been used to inform targeted
application of the biomarker and how its applica- proteomic studies and these have led to Verified
tion will contribute to more efficient clinical biomarker panels currently commercialized or
management is important in project selection. undergoing clinical trials [6, 7]. However, this
20 Introduction to Clinical Proteomics 439

source of potential targets is now in jeopardy, abundance should not be the only goal pursued.
due to new journal publication guidelines that Careful selection of analytical approaches must
require confirmation of candidate markers reflect the need to detect and quantify these post-
through the Verification phase before transcriptional modifications. These
manuscripts are even accepted for review. Con- considerations also drive separation, quantifica-
tinuation of this policy will impose severe tion, and identification approaches. Finally, a
consequences to the field due to two factors: comprehensive search of the literature can pro-
vide additional inputs into the list of candidates.
– Few academic researchers have the resources As a statement of general principles for dis-
to fund the large effort required for candidate covery, protein losses must be minimized so that
Verification (second, larger sample cohort) quantification can be accurate and precise. We
– The policy will hinder the acquisition of such have used all liquid fractionation to limit the
funding by preventing the necessary publica- possibility of irreversible surface binding in the
tion(s) of supportive preliminary studies presence of denaturant (e.g., urea). In the case of
biofluids, where high abundance proteins bind
A more considered approach to break the large numbers of peptides and proteins, urea
vicious circle would permit publication at the also serves to dissociate any protein interactions.
Qualification stage (same samples, alternative To track and permit normalization of protein
quantification/identification tools) provided that losses, internal standards must be added as early
confirmation is robust and statistically valid; oth- as possible in the sample extraction/preparation
erwise, this historically rich source of potential phase.
targets will disappear, greatly increasing the dif-
ficulty and cost of developing new effective bio-
marker panels. 20.5 Candidate Panel Selection/
Statistical Approaches
(Chap. 21)
20.4 Discovery (Chap. 20)
Typically upon quantification of protein/peptide
As a starting point, we define proteomics bio- signals, a “first level” of statistical analysis (e.g.,
marker discovery as the global proteomic analy- non-parametric t-test or ANOVA) will establish
sis of a sufficient number samples that can ensure a level of statistical significance to a narrowed
a power of at least 0.8 that will result in a panel of list of candidates, decreasing the demands placed
candidate biomarkers. This usually results in the upon Qualification. Statistical methods involve
estimate of 30+ samples each for case and not only difference testing, but need to inform
control. candidate biomarker selection by incorporating
The type and source of samples (biofluids, additional information, including group-wise
proximal fluids, tissues, cell, etc.) dictate the variance and identification of correlated markers.
range of analytical options that can be applied. An important source of candidate biomarkers
Since proteomics discovery encompasses a includes incorporation of heuristics to assemble
multi-step, multi-technique workflow, each sam- candidate markers for Qualification and Verifica-
ple type requires a customized strategy for tion. These and other factors will be discussed in
separations, quantification, and identification. detail in Chap. 20 – Discovery. Similarly, after
Since there are only 20,000+ genes and, by each phase of candidate biomarker development,
some estimates, more than 1,000,000 protein a combination of statistical approaches, includ-
isoforms, the vast majority of proteomic com- ing non-parametric hypothesis testing, feature
plexity is encompassed by post-transcriptional reduction, hierarchical and non-hierarchical clus-
mechanisms. Accordingly, comparisons of case ter analysis, and model building with receiver-
and controls to extract only differences in operator analysis is used to confirm selection of
440 J.E. Wiktorowicz and A.R. Brasier

candidate panels and appropriate predictive We will discuss the approach of Verification
models. not of individual markers, but of the marker
panel.

20.6 Qualification (Chap. 22)


20.8 UTMB Clinical Proteomics
Qualification is defined as the workflow for Center (CPC)
confirming (or rejecting) the accuracy of the sta-
tistically selected list of proteins and peptides The UTMB CPC was composed of
and/or the PTMs developed in the Discovery 11 investigators organized into seven technology
phase. By definition, the samples to be assayed teams funded through a 6 year contract mecha-
are the identical samples used in Discovery, but nism. The two major goals were to:
analyzed by an alternate, quantitative, higher-
throughput technique. While the expectation 1. develop, standardize, and apply a protein bio-
that the exact levels of abundance or PTM marker discovery pipeline that incorporates
changes will be confirmed is not expected, a quantitative pre-fractionation, 2-dimensional
statistically derived trend consistent with the gel electrophoresis (2DE) and tandem liquid
analytes’ behaviors in the Discovery phase is chromatography (LC)-mass spectrometry
required for confirmation. Finally to determine (MS), including MS-based identification
if biological mechanism(s) may be rationalized 2. Develop predictive models of infectious
pathway analysis may be applied to examine outcomes that will drive further studies in
networks and multivariate classification of Validation by collaborating investigators
patients and variably expressed proteins
identified in the Discovery phase. The scope of our work was to serve as a prote-
The process of feature selection, statistical omics resource for early stages of biomarker
modeling and Qualification may be an iterative development (Discovery through Verification)
process. It sometimes is the case that features for human cohort studies proposed by clinical
initially identified in Discovery do not exhibit investigators in the scientific community. During
significant differences between cases and the conduct of the CPC contract, five projects
controls, or that the models do not perform were approved:
well. In this case, the selection, statistical
modeling and Qualification process may be 1. To identify a predictive panel of severity of
repeated (indicated by yellow arrow in Fig. 19.2). dengue infection
2. To identify predictors associated with
Helicobacter pylori induced peptic ulcers
20.7 Verification (Chap. 22) 3. To identify predictors of chagasic
cardiomyopathy
Verification is likewise defined as a confirmatory 4. To identify diagnostics of invasive aspergillo-
analysis of the qualified surviving candidates, but sis in immunosuppressed patients
performed on an entirely different set of samples, 5. To identify predictors of acute rickettsial dis-
whose numbers satisfy statistical power ease in acute spotted fever cases
considerations for the analytical approach to be
taken. Obviously, a critical consideration is the Each project was unique in starting material
need for the samples to have been selected and proteomic discovery platform and the devel-
according to the identical clinical endpoints, opment process followed the path described
objectively derived. Any deviation from the orig- above. Our contract did not provide resources
inal selection criteria will lead to errors and for Validation. During the conduct of the pro-
non-confirmation of the qualified candidates. gram, the biomarker development program
20 Introduction to Clinical Proteomics 441

evolved to better address the challenges in bio- 2. Stuhler K, Poschmann G (2014) Biomarkers: a
marker candidate selection and model develop- proteomic challenge. Biochim Biophys Acta Proteins
Proteomics 1844:859–1058, Elsevier, B V
ment/refinement. 3. Rifai N, Gillette MA, Carr SA (2006) Protein bio-
The following chapters will describe our marker discovery and validation: the long and uncer-
refinement of the biomarker development strat- tain path to clinical utility. Nat Biotechnol 24:971–983
egy, and includes the separations, statistical, and 4. World Health, O (2009) Dengue: guidelines for diag-
nosis, treatment, prevention and control. World Health
mass spectrometric approaches we used to iden- Organization, Geneva
tify and confirm candidate biomarkers for the 5. ad-hoc writing committee of the Assembly on Allergy,
projects enumerated above. Our goals were to I. a. I (2000) Proceedings of the ATS workshop on
utilize a broad spectrum of proteomics tools to refractory asthma. Current understanding,
recommendations, and unanswered questions. Am J
generate predictive candidate markers in concert Respir Crit Care Med 162:2341–2351
with our NIAID Clinical Proteomics Center man- 6. Sun W, Hu G, Long G, Wang J, Liu D, Hu G (2014)
date to provide a panel of effective candidates Predictive value of a serum-based proteomic test in
that could be carried through to the Validation non-small-cell lung cancer patients treated with epider-
mal growth factor receptor tyrosine kinase inhibitors: a
phase by a subsequent funding mechanism. meta-analysis. Curr Med Res Opin 30:2033–2039
7. Li X-j, Hayward C, Fong P-Y, Dominguez M,
Hunsucker SW, Lee LW, McLean M, Law S,
References Butler H, Schirm M, Gingras O, Lamontagne J,
Allard R, Chelsky D, Price ND, Lam S, Massion PP,
Pass H, Rom WN, Vachani A, Fang KC, Hood L,
1. Frantzi M, Bhat A, Latosinska A (2014) Clinical Kearney P (2013) A blood-based proteomic classifier
proteomic biomarkers: relevant issues on study design for the molecular characterization of pulmonary
& technical considerations in biomarker development. nodules. Sci Transl Med 5:207ra142
Clin Transl Med 3:7
Discovery of Candidate Biomarkers
21
John E. Wiktorowicz and Kizhake V. Soman

Abstract
Properly performed, biomarker discovery can lead to effective candidates
that can ultimately serve as predictors of disease, medical condition,
define therapeutic parameters, and many other applications in medicine.
Preferably, biomarkers comprise a panel of indicators, e.g. proteins and/or
peptides that can be predictive or diagnostic of the medical condition of
interest. Emphasis here is placed on “panel,” as single candidates are
rarely sufficient to provide the necessary sensitivity and specificity. To
develop an effective panel that survives the development process
described in Chap. 19, proper experimental design and attention to impor-
tant statistical parameters are critical to ensure success. Errors in discov-
ery can lead to an inefficient use of expensive resources, as these may not
be uncovered until the latter stages in biomarker development. Hence,
accuracy, precision, and an estimate of the power of the proposed analyses
are critical in the discovery of the panel of candidate biomarkers by
proteomic methods, as is the selection of statistical approaches to refine
and appropriately reduce the dataset for subsequent confirmatory assays.

Keywords
Biomarker discovery • Plasma • Serum • Antibody depletion • Protein
pool

21.1 Introduction serve as predictors of disease, medical condition,


define therapeutic parameters, and many other
Properly performed, biomarker discovery can applications in medicine. Preferably, biomarkers
lead to effective candidates that can ultimately comprise a panel of indicators, e.g. proteins
and/or peptides that can be predictive or diagnos-
J.E. Wiktorowicz (*) • K.V. Soman tic of the medical condition of interest. Emphasis
The University of Texas Medical Branch, here is placed on “panel,” as single candidates
Galveston, TX, USA are rarely sufficient to provide the necessary
e-mail: jowiktor@utmb.edu

# Springer International Publishing Switzerland 2016 443


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_21
444 J.E. Wiktorowicz and K.V. Soman

sensitivity and specificity. To develop an effec- goals of the following studies were to “discover”
tive panel that survives the development process panels of proteins and peptides that could serve
described in Chap. 19, proper experimental as a potential predictors for the risk of develop-
design and attention to important statistical ing clinically severe sequelae of infectious dis-
parameters are critical to ensure success. Errors ease (Dengue Fever and Chagas
in discovery can lead to an inefficient use of cardiomyopathy), or that could serve as diagnos-
expensive resources, as these may not be uncov- tic tool for the infectious agent (invasive asper-
ered until the latter stages in biomarker develop- gillosis). All investigations proceeded through
ment. Hence, accuracy, precision, and an Discovery, Qualification, and Verification as
estimate of the power of the proposed analyses highlighted in Chap. 19. In this chapter, we will
are critical in the discovery of the panel of can- discuss Discovery in pursuit of candidates for
didate biomarkers by proteomic methods, as is these three diseases.
the selection of statistical approaches to refine
and appropriately reduce the dataset for
subsequent confirmatory assays. 21.2 Sample Source: Plasma
Simply put, the power of a study is an estimate and Serum
of the chance of detecting a real difference of a
given size [1]. In statistical terms and commonly In 2005, the Human Proteome Organization
used in proteomics, the power of a study is the (HUPO) published a compendium of studies
number of samples necessary to achieve >80 % resulting from the years-long Plasma Proteome
power that the null hypothesis is false and Initiative [2]. In it, the authors highlighted the
depends upon the desired level of significance, tactical successes with the following
and the sample assay variance (mean and stan- recommendations:
dard deviation). There are a number of software
and web-based resources that can be used to – Selection of plasma over serum
estimate the number of samples necessary to – EDTA over heparin
achieve a certain power, given the parameters – Minimum number of freeze-thaw cycles
enumerated above. One very important caveat – A number of other important procedural
is that the power analysis must be performed a recommendations for the use of biofluids in
priori, or before, the actual experiment is biomarker discovery
performed.
In summary, care should be taken to under- They also notably highlighted significant tac-
stand the collection nuances for the tissues or tical flaws, including the use of bottom-up, label-
biofluids to be used as the sample source, as free mass spectrometric approaches, among
well as selection of proper pre-separation others. The compendium was notable in its hon-
treatments to maximize recovery and minimize est appraisal and wide-ranging recommendations
artifacts. Along with these considerations, the for the improvement of biofluid-based candidate
proteomic strategy and quantification should biomarker discovery, and the potential biomarker
reflect carefully chosen methods tailored to the investigator would be well-served to examine
nuances and number of samples and goals of the this work carefully, despite its growing age.
study. Several important features are worth
In our discussion of these factors in this chap- highlighting as they formed our discovery strat-
ter, and consistent with the other chapters in this egy for our Center projects. As had been widely
Section, we will focus on biofluid samples, in noted, these include the enormous concentration
particular, plasma and serum, used to develop range of plasma proteins (10–12 orders of
candidate biomarkers for our NIAID-funded dynamic range), the under-sampling that ensues
Clinical Proteomics Center for Infectious Dis- upon depletion of the most highly abundant
ease and Biodefense (CPC). Moreover, the proteins that bind over 200 lower abundance
21 Discovery of Candidate Biomarkers 445

proteins, and the endogenous proteases in plasma Because most projects involve dozens of
that may degrade important proteins to peptides. biological samples, technical replicates at this
In addition, it is noteworthy that as much as 12 % stage are cost- and time-prohibitive, and not
of plasma peptides may constitute the natural recommended; hence a priori power analysis is
peptidome [3]. As a result the peptide critical to provide an estimate of the number of
“degradome” [4] consists of naturally occurring samples statistically necessary.
peptides, which may serve as legitimate candi-
date markers, and those generated through arti-
factual influences, which may not.
21.2.1 Fractionation
The last issue becomes increasingly relevant
in emphasizing the importance of standard
In consideration of the above and other
operating procedures in plasma and sera collec-
challenges, we have devised an efficient and
tion. Unless specified, most plasma collection
reproducible platform for fractionation and quan-
protocols do not include collection containers
tification of both proteins and peptides from
that have protease inhibitors or are specified for
biofluids that achieves differential analysis from
use for protein analysis (e.g., Becton-Dickinson
the same low-volume samples, called the
BD™ P100 Kit). Therefore, variations in the
Biofluids Analytical Platform (BAP) that we
collection and/or processing parameters may
used routinely in our NIAID CPC studies
result in differences in plasma protein stability,
(Fig. 21.1). The BAP utilizes a fluorescent inter-
leading to false signals in the differential
nal size standard that defines a protein/peptide
analyses. To date, there is no widely accepted
molecular size cutoff by which biofluid proteins
method for evaluating the quality of plasma for
are pooled separately from peptides after size-
proteomic purposes. Our attempts to use capil-
exclusion chromatography (SEC). Denatured
lary electrophoresis to gauge plasma quality by
samples (to disrupt molecular interactions) are
monitoring the four most abundant protein peaks
mixed with a fluorescence-labeled protein stan-
did not lead to obvious correlation with plasma
dard and fractionated by SEC. Protein and pep-
quality. Even the most highly abundant proteins
tide pools are created automatically according to
show considerable variation from individual to
the elution profile of the internal standard by a
individual, and so, at this point, the only way to
programmable UV monitor that controls a com-
assure quality is for meticulous adherence to the
patible fraction collector, thereby ensuring repro-
standard collection protocols, and limiting
ducible collection and allowing multiple
freeze-thaw cycles to at most two.
columns to be used simultaneously (Fig. 21.2).
Typically, unbiased proteomic investigations
Since SEC is a non-adsorptive fractionation
of biofluids consist of the comparison of biofluids
support, recoveries are routinely very high, and
from normal (control) versus affected
in our hands, reproducibly result in >95 %
individuals (case). Since there is no opportunity
recovery of input proteins and peptides. Thus
for multiplexing via discriminating reagents at
SEC permits fractionation before quantification
the pre-fractionation stage (presumably due to
with high recoveries, a necessary consideration
the expense of labeling the high abundance
in quantitative biomarker discovery. In addition,
proteins that constitute 95–99 % of the largely
the ability to fractionate by size allows the use of
uninformative total protein that will be removed
urea as a denaturant prior to sample injection to
in later steps), generating critically reproducible
ensure dissociation of peptides and proteins with
fractions for differential comparison from
the high abundance proteins. Other advantages of
sequential or parallel fractionation is a challenge.
SEC include the ability to exchange buffers and
As described in Chap. 19, replicate analyses
remove small molecules (including urea) and
with readily available samples should be
plasma electrolytes, and the dilution of proteins
performed to estimate variances for power anal-
as they pass through the column, minimizing the
ysis before precious samples are processed.
446 J.E. Wiktorowicz and K.V. Soman

Fig. 21.1 Biofluids analytical platform. Samples are defined by all fractions between the end of the thaumatin
denatured with urea and thaumatin (a plant protein) peak and the beginning of the free Alexa peak. The
labeled with Alexa-488, is added. The samples are sepa- protein pools are antibody depleted of the 14 most abun-
rately loaded onto a size-exclusion column through an dant proteins in human plasma and saturation-labeled
HPLC pump controlled by a computer. The effluent is with BODIPY-Fl. Proteins are separated by 2DGE,
monitored by UV spectrometer that controls a fraction analyzed, and identified by MS. Peptides are labeled via
collector. When the fluorescent dye is detected, the frac- trypsin-mediated oxygen exchange, pools from sample
tion collector is triggered to advance. Protein pools are 1 and 2 mixed, and separated and analyzed by on-line
defined by all fractions collected before the end of the RP-LC-MS/MS
fluorescent thaumatin peak, while peptide pools are

re-association of high and other abundant fractionating and separating intact proteins and
proteins. peptides, where their PTM status might be lost in
Since potential biomarkers may be proteins, a peptide-centric, bottom-up approach. As will
peptides, or both, comprehensive differential be seen below, we identified both PTM classes as
analysis must permit their recovery, ideally engendering candidate biomarkers in several of
from the same sample. A further potential benefit our CPC projects.
of analyzing both pools from the same sample is
that the concordant appearance of an observed
peptide and its parent protein may signal that the 21.2.2 Antibody Depletion
peptide is an artifact of proteolysis in the
biofluid, and therefore unlikely as an accurate After BAP fractionation, proteins are largely
candidate biomarker. disassociated and diluted, so depletion of the
Finally, with only 20,300 genes in the human most abundant of them is performed without
genome, the great complexity of proteins within fear of excessive losses. Optimization of the
any biological sample primarily reflects the depletion, however, is critical, and typically is
plethora of post-transcriptional/translational monitored by 1D SDS-PAGE of the untreated
modifications (PTM). There is little theoretical protein pool, the depleted pool, and the proteins
justification in the assumption that candidate recovered from the depletion columns. To estab-
biomarkers will not reflect one or more PTMs, lish enrichment of non-abundant proteins, equal
including naturally occurring proteolytic clip- amounts of proteins are loaded in each lane of the
ping or enzymatically catalyzed cross-linking. gel. If depletion was efficient, comparison with
This also justifies the top-down strategy of the undepleted lane should reveal that the high
21 Discovery of Candidate Biomarkers 447

Fig. 21.2 Biofluids analytical platform separations out- in the Figure as examples. Note, no plasma-specific pro-
put. In its current configuration, the BAP consists of four tein signals are detected, and therefore pool compositions
low-pressure columns (A–D) and generate four separate are strictly governed by the thaumatin retention time
protein and peptide pools, as described in the text and internal standard, providing maximum fractionation
Fig. 20.1 legend. When peaks are detected by the UV reproducibility from sample to sample.
monitor, an event marker (vertical line) is placed on the Note: UV tracings are purposely shifted to permit uncom-
chart. The event markers and the fractions used for the plicated viewing
protein and peptide pools for “Column D” are highlighted

abundance proteins should be diminished in other proteins and remaining uniquely appearing
intensity, while the faint or undetectable proteins peptides to generate a comprehensive set of can-
should be enriched in the depleted lane. The didate biomarkers for subsequent confirmation.
depleted proteins should appear in equal intensity
in the recovered gel lane.
21.3.1 Protein Pools

21.3 Analysis The protein pool, consisting of diluted proteins


eluting before the end of the internal standard
We segment our discussion into analyses specific peak (23 kDa  Alexa-thaumatin  ~17 kDa;
to the protein pool and those specific to the pep- Fig. 21.1), is permitted to partially renature dur-
tide pool. At the completion of the statistical ing its slow elution and subsequent storage at
reduction of the candidate protein and peptide 4  C overnight. After this period, the pool
biomarkers, we convolve the lists to determine undergoes antibody depletion (IgY) by two suc-
overlaps and uniqueness of these panels. Where cessive passages through a column of antibodies
overlaps exists, we select the protein partner as specific for the 14 most abundant plasma proteins
the candidate biomarker, to be added to the list of (Sigma-Aldrich).
448 J.E. Wiktorowicz and K.V. Soman

Because technical replicates are not exchange under conditions previously


performed, we select covalent saturation labeling established to ensure maximum incorporation of
for protein pools that are specific for cysteine two oxygen isotopes at each carboxy-terminus
residues [5, 6]. Since all proteomic analyses [9]. Under these circumstances, matched controls
alkylate cysteines before analysis, and >92 % and cases are 16O and 18O labeled, respectively.
of human proteins contain at least one cysteine The incorporation of the first oxygen during tryp-
residue [7], we simply alkylate with a fluorescent tic exposure is catalytic and performed at pH 8.0.
dye of high extinction coefficient at saturating The second exchange is slow and non-catalytic,
ratios of dye: protein thiol (>50-fold) [5, 8] as and is performed at pH 6.0 [10]. Maximum
determined by amino acid analysis. Saturation incorporation of 18O is dependent upon peptide
labeling with an uncharged (as opposed to neu- length and slow, so incubations are performed
tral) dye, e.g. Bodipy-Fl, ensures reproducible over 24 h (Fig. 21.3). Peptides thus labeled are
quantification with no change in electrophoretic mixed with their 16O labeled controls, separated
mobility. Separation followed by fluorescence by RP-HPLC, and quantified and identified by a
quantification is accomplished by 2D gel electro- tandem electrospray MS/MS.
phoresis (2DGE), and identification of differen-
tially abundant proteins by MS/MS.
21.4 NIAID CPC Project 1-Dengue
Fever
21.3.2 Peptide Pools
21.4.1 Introduction
The peptides in the peptide pools obtained from
the BAP to be compared are differentially Dengue Fever (DF) is a mosquito-borne disease
labeled with 16/18O by trypsin-mediated caused by a single-stranded RNA virus of

100%
160
1 x 180
80% 2 x 180
% Maximum Incorporation

60%

40%

20%

0%
0 2 4 6 8 16 18 20 22 24
Time (Hrs)

Fig. 21.3 Time course optimization of H218O peptide MS. Incorporation of the stable isotope was calculated
labeling. Seven peptides of varying length (945, 985, and the averages of the seven peptides for each isotope
1580, 1742, 1929, 2256, and 2759 Da) were exposed to substitution at each time point was normalized against the
trypsin-mediated isotope exchange for varying times as highest incorporation value. The maximum level of
indicated in the figure. At the appropriate time, solutions incorporation of the doubly labeled 18O peptide can be
were acidified with TFA and analyzed by RP-LC-MS/ seen after 22 h
21 Discovery of Candidate Biomarkers 449

4 serotypes. The mosquito thrives in tropical and presenting in the acute stage with normally
sub-tropical environs in which 1/3 to 1/2 of the resolving DF (n ¼ 59) compared to those who
world’s population lives. Initial infection confers later developed DHF (n ¼ 22), or DFC (n ¼ 29).
life-long immunity against subsequent identical
serotype infection, however, not against heterol-
ogous infection. Most infections are self- 21.4.3 Analysis
limiting, however, a small percentage of infected
individuals develop a life-threatening syndrome After 2DGE separations and analysis of the BAP
characterized by vascular leakage and hemor- protein and peptide pools, 1311 proteins were
rhage (Dengue Hemorrhagic Fever-DHF)— quantified, and 121 were judged significantly
defined by classical WHO criteria—or severe changed in DHF with respect to DF using non-
complicated Dengue disease (DFC), defined as parametric statistical analysis [16]. To reduce the
not satisfying WHO criteria for DHF, but who candidate panel further, statistical tools were
nevertheless exhibited hemorrhagic or thrombo- used (Chap. 19) and as a result, the panel was
cytopenia within 7 days of the onset of reduced to 15 proteins that accurately classified
symptoms. Early intervention of supportive ther- patients into DF, DHF, and DFC phenotypes.
apy for these individuals significantly enhances The significant proteins identified are listed in
survival. However, in challenging environments, Table 21.2 and the feature-reduced set of
the establishment of risk for developing DHF 15 proteins in Table 21.3. The significant
immediately upon clinical presentation would peptides are found in the analysis are listed in
be of critical importance for the health of the Table 21.4. Note that the Dengue NS1 protein
patient. Because of this, the pursuit of and Complement factors in Table 21.3 were not
biomarkers has resulted in literature suggestive obtained from Discovery, but resulted from
of several candidate biomarker leads [11–14], heuristics methods.
although none have been confirmed by quanti- The peptides from the BAP peptide pools
tative studies as required for proper candidate were quantified by 16/18O ratios, i.e. acute (18O-
biomarker Qualification and Verification (see labeled) and convalescent (16O-labeled) samples
Chap. 22). Our challenge was therefore to from the same patient were mixed before MS,
identify a panel of candidate biomarkers that and the log2 normalized ratios for each peptide
could accurately define the risk of developing detected from DF and DHF samples were com-
the survivable hemorrhagic form of the pared by t-test (Table 21.4). The three statisti-
infection and by extension the normal Qualifi- cally significant peptides from DF and DHF all
cation and Verification stages of biomarker showed increased abundance in DHF.
development. Of the proteins and peptides identified, several
are most useful as justification of our strategy to
pursue post-translational modifications with
21.4.2 Study analysis of intact proteins. One notable example
is the high molecular size albumin (~200+ kDa).
Our Center was presented with two DF This protein is not depleted by the depletion
proposals, both representing Latin American antibodies, and is diminished in the patients
cohorts. Only one (DF-Brazil), however, had suffering from the hemorrhagic sequel of DF
sufficient numbers of patients for Discovery as compared to normal plasma or patients who
well as a second cohort to take through qualifica- resolved their uncomplicated DF
tion [15, 16]. As described in Chap. 22, the goal [15, 16]. While the biochemistry is currently
of our studies was to develop candidate under investigation in our facility, the protein is
biomarkers that might define the risk of develop- likely covalently cross-linked by some cross-
ing DHF or DFC, so our discovery study design linking agent that appears depleted from the
focused on plasma collected from 110 patients viral infection. It is clear that a “bottom-up”
450 J.E. Wiktorowicz and K.V. Soman

Table 21.1 Summary of discovery results from the analysis of proteins and peptides in the NIAID-CPC projects
discussed in the chapter
Predictive features by
Project Total number analyzed Significant from discovery MARS or other methods
Protein spots Peptidesa Protein spots Unique Peptides Protein spots Peptides
Dengue fever, Brazil 1311 8873 121 639 6 3
Invasive Aspergillosis 556 4402 66 360 9 3
Chagasic cardiomyopathyb 635 ND 36 ND 7 ND
Probability  0.95; FDR < 0.01; Analysis by ProteoIQ™ (Premier Biosoft)
a
b
Discovery was performed on purified PBMCs

Table 22.2 Proteins identified from 2DGE gel spots in the dengue fever project
Spot Spot Spot MW UniProt Peptide Seq. Fold
# pIa (kDa)a Protein name accession countb coveragec change p (t-test)
Identified by MALDI TOF/TOF
73 7.20 >250 Complement C3 P01024 21 18.3 1.81 0.03335
80 6.18 >250 Serum albumin; Flags: P02768 14 29.2 1.33 0.06969
Precursor
83 7.30 200 Complement C3 P01024 26 22.7 1.72 0.01575
201 5.74 119 Alpha-2-macroglobulin P01023 21 20.6 1.43 0.01962
204 5.80 117 Alpha-2-macroglobulin P01023 22 21.2 1.46 0.01788
224 3.55 100 Keratin, type I cytoskeletal P13645 16 32.4 2.80 0.06689
10
306 7.30 80 Complement C3 P01024 31 27.2 2.18 0.04624
330 9.19 71 Complement C4-B P0C0L5 18 13.8 1.74 0.05309
335 9.09 70 Complement C4-B P0C0L5 20 15.9 1.79 0.05494
434 9.20 49 Ig gamma-1 chain C region P01857 6 23.6 2.41 0.08858
444 9.09 48 Ig gamma-1 chain C region P01857 6 23.6 2.71 0.06011
486 5.27 44 Antithrombin-III P01008 13 34.9 1.64 0.04141
719 9.92 31 Keratin, type I cytoskeletal P13645 16 33.9 1.38 0.01334
10
784 6.89 28 Complement C3 P01024 15 8.9 1.52 0.06435
1434 4.11 13 Keratin, type II P04264 16 30.4 2.03 0.10884
cytoskeletal 1
1483 4.13 12 Keratin, type I cytoskeletal P35527 19 49.4 1.91 0.01366
9
1516 7.96 11 Keratin, type I cytoskeletal P35527 20 50.08 1.43 0.05599
9
Identified by LC-MS
6 4.76 >250 Alpha-2-macroglobulin P01023 5 3.7 1.54 0.06468
81 6.68 >250 Complement C3 P01024 11 9.0 1.69 0.05986
85 3.62 191 Desmoplakin P15924 7 2.8 2.00 0.02492
90 3.49 184 Alpha-2-macroglobulin P01023 15 12.1 2.01 0.01044
94 6.40 184 Alpha-2-macroglobulin P01023 12 10.4 1.15 0.07054
108 6.55 181 Alpha-2-macroglobulin P01023 10 7.6 1.56 0.04054
127 4.86 169 Alpha-2-macroglobulin P01023 6 4.3 1.52 0.01788
221 6.53 102 Complement factor B P00751 3 2.2 1.43 0.03297
303 7.69 81 Isoform 2 of Complement P0C0L4 7 4.1 1.93 0.03246
C4-A
325 9.39 73 Isoform 2 of Complement P0C0L4 10 6.2 1.72 0.05882
C4-A
(continued)
21 Discovery of Candidate Biomarkers 451

Table 22.2 (continued)


Spot Spot Spot MW UniProt Peptide Seq. Fold
# pIa (kDa)a Protein name accession countb coveragec change p (t-test)
350 9.78 69 Isoform 2 of Complement P0C0L4 7 4.2 1.38 0.00536
C4-A
373 3.36 63 Alpha-1-antichymotrypsin P01011 2 5.7 2.04 0.00162
385 8.08 61 Complement C3 P01024 2 1.4 1.52 0.03671
394 3.36 60 Alpha-1-antichymotrypsin P01011 4 9.9 2.48 0.02543
411 3.40 56 Ig gamma-1 chain C region P01857 5 19.4 3.23 0.00698
421 3.47 52 Ig gamma-1 chain C region P01857 3 12.1 2.69 0.01273
441 4.73 48 Alpha-1-antichymotrypsin P01011 2 5.7 1.23 0.03502
450 8.66 48 Ig gamma-1 chain C region P01857 2 9.4 1.62 0.04675
451 8.91 48 Ig gamma-1 chain C region P01857 5 19.4 2.11 0.04242
457 5.73 47 Fibrinogen gamma chain P02679 3 6.5 1.16 0.03002
458 8.06 47 Ig gamma-1 chain C region P01857 3 13.0 1.60 0.05222
465 4.49 46 Leucine-rich alpha-2- P02750 2 5.5 1.43 0.02089
glycoprotein
493 5.07 44 Alpha-1-antitrypsin P01009 2 4.8 2.61 0.00261
506 9.47 43 Isoform 2 of Complement P0C0L4 2 1.1 1.48 0.02166
C4-A
539 6.51 39 Plasma serine protease P05154 4 10.6 1.75 0.00800
inhibitor
546 4.79 39 Desmoplakin P15924 24 10.9 1.42 0.02830
556 5.38 38 Complement C3 P01024 5 3.7 1.41 0.05469
563 5.07 38 Zinc-alpha-2-glycoprotein P25311 3 10.4 1.43 0.02424
564 5.24 38 Complement C3 P01024 3 2.8 1.41 0.01681
565 5.29 38 Haptoglobin P00738 2 6.4 1.54 0.02545
566 5.52 38 Haptoglobin P00738 1 3.0 1.74 0.02893
567 5.67 38 Complement C3 P01024 5 4.4 1.85 0.05336
584 5.68 36 Apolipoprotein E P02649 3 12.0 1.69 0.06733
604 7.78 36 Complement C3 P01024 11 9.3 1.62 0.02817
721 4.83 31 Isoform 2 of Clusterin P10909 55 2.0 1.60 0.01231
776 6.55 29 Complement C3 P01024 4 3.3 1.44 0.03974
891 3.45 23 60 kDa heat shock protein, P10809 2 9.1 2.08 0.07175
mitochondrial
964 5.77 20 Junction plakoglobin F5GWP8 6 19.6 1.65 0.00919
1031 9.82 19 Ig kappa chain C region P01834 2 35.8 1.98 0.00745
1138 6.38 18 Haptoglobin P00738 2 6.2 1.99 0.02162
1159 5.35 18 Haptoglobin P00738 2 6.2 1.93 0.05555
1232 8.03 17 60 kDa heat shock protein, P10809 1 5.2 1.41 0.02449
mitochondrial
1256 7.06 16 Desmoplakin P15924 11 4.6 1.57 0.01826
1318 7.54 15 Isoform 2 of Dermcidin P81605 3 21.5 2.42 0.00207
1416 9.06 13 Serum amyloid A-4 P35542 2 14.6 1.47 0.01291
protein
1459 9.32 12 Alpha-1-antitrypsin P01009 2 4.8 1.42 0.00300
1490 8.44 12 Isoform 2 of Dermcidin P81605 2 11.6 1.67 0.03968
a
The pI and MW are from 2DGE spots by gel calibration
b
“Peptide count” for MALDI and “Exclusive unique peptide count” for LC-MS identifications
c
Percent of the protein sequence covered by the mapped peptides
452 J.E. Wiktorowicz and K.V. Soman

Table 21.3 Predictive biomarkers for dengue fevera


Biological function Biomarker candidate Short name Swiss protein accession
Dengue Dengue NS1b NS1 Q67431
Complement Complement factor 4Ab CO4A P0C0L4
Complement factor Hb CFH P08603
Complement factor Db CFD P00746
Acute phase reactant A2-macroglobulin A2M P01023
Alpha 1 anti-trypsin A1AT P00760
Fibrinogen, alpha FIBA P02671
Fibrinogen, beta FIBB P002675
Ferritin, light chain FRIL P02792
Haptoglobin HPT P00738
Plasma protein Leucine-rich alpha2 glycoprotein AG2L P02750
High MW albumin HMWAIb P02761
Immunoglobulin Immunoglobulin J IGJ P01591
Immunoglobulin kappa, C region IGKC P01834
IgG-gamma-1, C region IGHG1 P01857
Cytoskeletal Keratin 1 KRT1 P04264
Tropomyosin 4 TMPH P67936
Low MW desmoplakin DESP P15924
Vimentin VIME P08670
a
Adapted from Table 3 (Ref. [16]), with permission
b
NS1 and complement factors were obtained by heuristics as outlined in the introduction

Table 21.4 Significant BAP peptides found in dengue fever discovery


Peptide Protein ID Gene name Ratioa p-valueb (t-test)
YWGVASFLQK Retinol-binding protein 4 RET-4 1.53 0.028
YAASSYLSLTPEQWK Ig Lambda-7 chain C-region LAC-7 1.59 0.031
DLATVYVDVLK Serum albumin ALBU 2.54 0.013
a
DHF/DF from acute vs convalescent peptide from the same patient
b
Comparison of log2 normalized ratios from DF and DHF

approach to a candidate biomarker discovery intermediate time points between the acute and
effort would likely have missed this molecule— convalescent times. In those studies, we observed
it would have appeared as simply albumin. considerable differences in the proteins identified
We determined from our analyses that several as statistically significant, as well as sample data
factors initially confounded the goals of our clustering by PCA, regardless of gender
study, and were likely to have potentially con- (Fig. 21.4d). However, as these times approached
founded the other studies mentioned in the previ- the onset of hemorrhage, we surmised that we
ous paragraph. We determined that gender plays were observing the development of the severe
a large role in differentiating DF and DHF symptoms of DHF and were less predictive than
phenotypes (Fig. 21.4). As can be seen from the diagnostic. Thus it could be argued that our anal-
principal component analysis (PCA), male and ysis was somewhat underpowered, due to these
female DF vs DHF did not obviously cluster into additional factors.
separate groups, while the separate genders These and other factors provide evidence for
clearly clustered between DF and DHF. the importance of strict adherence to collection
To further investigate our findings, we took standard operating procedures, sensitivity to
advantage of the samples that were collected at gender effects, collection times, and other
21 Discovery of Candidate Biomarkers 453

Fig. 21.4 Principal component analysis of DF and DHF, of intermediate time points for collection (days 3–6 after
male and female. (a). Combined male and female DF initial clinical presentation). Here, male (M) and female
(pink) and DHF (blue) analyses. (b). Male DF vs DHF. (F) samples are indistinguishable, but clear separation
Clear separation of the two sample cohorts can be between DF and DHF are. Key: Each dot represents the
observed. (c). Female DF vs. DHF. Separation of the behavior of each sample (gel data). DF gel data are pink,
two disease states are clearly observable. (d). Analysis DHF gel data are blue

potentially confounding variables, as empha- three fungal killers in the world. The most com-
sized in Chap. 22. mon victims are patients who are immunocom-
promised after organ transplants, or due to AIDS,
or neutropenic cancer. Infection most commonly
21.5 NIAID CPC Project 2-Infectious occurs from the inhalation of airborne fungal
Aspergillosis (IA) spores. After the initial pulmonary disease, IA
spreads via blood to other organs. Since blood
21.5.1 Introduction cultures are rarely positive for the fungus, IA is
difficult to diagnose and to control. The goals of
Aspergillus is one of the most common invasive this project were to:
fungal pathogens in hospitalized patients in the
United States. IA accounts for the most deaths 1. Identify a panel of candidate biomarkers from
due to fungal pathogens, and is among the top plasma for rapid and accurate diagnosis
454 J.E. Wiktorowicz and K.V. Soman

2. Confirm differential protein and peptide abun- 21.6 NIAID CPC Project 3-Chagasic
dance by a targeted, qualitative and quantita- Cardiomyopathy
tive approach
3. Verify these candidate biomarkers by testing 21.6.1 Introduction
their ability to discriminate between unin-
fected and Aspergillus-infected samples, and Chagas disease is a parasitic disease caused by
samples infected by non-Aspergillus molds Trypanosoma cruzi (T. Cruzi) infection, and is a
that produce similar clinical symptoms. serious health threat in Latin America.
According to a WHO report [17], there are 16–-
18 million people infected, and 25 % of the pop-
ulation of Latin America, i.e., ~120 million
21.5.2 Study people, are at risk of infection. Due to migration
and organ transplantation, it is estimated that
We received and analyzed plasma samples from about 300,000 infected patients live in the United
34 patients clinically diagnosed with IA (“case”), States. The disease exhibits acute and chronic
from 17 patients of the cohort prior to their clinical forms. The acute phase starting several
developing IA (“autocontrol”), and 34 subjects days post-infection is characterized by nonspe-
uninfected but matched to the infected by gender, cific symptoms, although skin reactions at the
disease, and immunosuppressed state (“matched site of infection (chagoma) may be suggestive.
control”). We proceeded with the sample Occasionally cardiac symptoms appear, but
processing and analysis protocols described in resolve normally within 6–8 weeks with the pro-
the above in the Dengue Fever. duction of anti T. cruzi antibodies [18]. After
From the analysis of the protein pools by 2D recovery from the acute phase, patients enter an
electrophoresis, we observed a total of asymptomatic, chronic phase. About one-third of
556 aligned spots that satisfied our criteria for the chronic patients progress to develop cardio-
quantitative analysis. An ANOVA comparison of myopathy in the form of an apical aneurysm as
the three sample groups—case, autocontrol, and long as 30 years later, which can result in heart
matched control—yielded 66 differential spots failure and death, or gastrointestinal
(p  0.05), which were identified by MALDI- abnormalities [18]. Because chronic infections
TOF/TOF MS. These proteins are listed in are difficult to treat, early detection and treatment
Table 21.5. A feature reduction and biomarker of those who are at high risk of developing
panel development approach similar to the one chagasic cardiomyopathy is critical. The goal of
used in the Dengue Fever project described this project was to identify a panel of candidate
above led to a predictive panel of six proteins biomarkers that was capable of defining those at
listed in Table 21.6 (for a generalized discussion risk of developing chagasic cardiomyopathy.
of these statistical tools, see Chap. 19).
The peptide pool was analyzed by the stable
isotope method (16/18O) as described in the “Pep- 21.6.2 Study
tide Pools” section above to compare case
vs. autocontrol. Only the 15 peptides detected We obtained peripheral blood mononuclear cells
in at least 50 % of the samples were included in (PBMCs) from four groups: healthy volunteers
the comparison (Table 21.7). Three of these (Group 1), Chagas seropositive but cardio-
peptides (highlighted in the table) were found to asymptomatic (Group 2), Chagas seropositive
be significantly different between case and cardio-symptomatic (Group 3), and
control. non-Chagas, cardio-symptomatic patients. Our
21 Discovery of Candidate Biomarkers 455

Table 21.5 Proteins identified from 2D gel analysis in the invasive Aspergillosis project
Gel MW, Swiss
spot pI kD prot Protein Abund. Abund p-value
No. (2D gel) (2D gel) Protein name accession score ratioa ratiob (ANOVA)c
183 7.66 51 Fibrinogen beta chain P02675 202 1.20 1.16 0.03015
187 5.99 49 Fibrinogen beta chain P02675 177 1.16 1.14 0.04969
282 7.51 36 Alpha-mannosidase 2 Q16706 30 2.56 1.41 0.00311
300 8.19 35 Ig kappa chain V-III region P01620 153 1.01 1.54 0.00034
SIE
303 7.15 35 Ig kappa chain V-III region P01620 81 1.12 1.82 0.00067
SIE
346 7.04 31 Ferritin light chain P02792 167 2.29 2.12 0.00049
586 8.12 17 Leucine-rich alpha-2- P02750 281 1.36 1.71 0.00178
glycoprotein
103 7.31 73 Complement factor B P00751 124 1.37 1.31 0.03761
356 7.38 31 Hemopexin HPX P02790 65 1.49 1.36 0.00574
412 3.76 25 Serum amyloid A-4 protein P35542 46 2.04 1.44 0.00169
508 5.37 19 Fibrinogen alpha chain P02671 405 1.08 1.25 0.03188
588 8.21 17 Leucine-rich alpha-2- P02750 302 1.28 1.43 0.00239
glycoprotein
747 3.96 12 Fibrinogen alpha chain P02671 180 1.10 1.37 0.00566
200 5.60 48 Complement C3 P01024 341 1.12 1.31 0.00530
245 7.25 42 Complement C4-A P0C0L4 40 1.33 1.35 0.01312
348 7.55 31 Histidine protein O95568 27 1.33 2.32 0.04394
methyltransferase 1 homolog
METTL18
359 5.22 30 Hemopexin HPX P02790 79 1.29 1.28 0.01005
360 6.35 30 Hemopexin HPX P02790 31 1.21 1.28 0.03737
364 9.26 29 Keratin, type II cytoskeletal P04264 68 1.71 1.12 0.04369
1 KRT1
408 6.68 26 Serum amyloid A-4 protein P35542 91 1.57 1.31 0.04750
SAA4
458 8.10 20 MEF2-activating motif and Q6ZN01 29 2.21 1.58 0.04684
SAP domain-containing
transcriptional regulator
MAMSTR
468 7.03 20 Apolipoprotein A-II APOA2 P02652 85 1.22 1.36 0.02118
494 5.00 19 Alpha-1-antichymotrypsin P01011 640 1.17 1.38 0.00531
SERPINA3
496 7.74 19 Alpha-1-antichymotrypsin P01011 633 1.17 1.33 0.02074
SERPINA3
502 5.51 19 Alpha-1-antichymotrypsin P01011 228 1.27 1.60 0.00440
SERPINA3
568 4.87 17 Alpha-1-antitrypsin P01009 43 1.28 1.33 0.00301
SERPINA1
580 6.33 17 Leucine-rich alpha-2- P02750 375 1.12 1.29 0.00869
glycoprotein LRG1
581 9.18 17 Leucine-rich alpha-2- P02750 399 1.17 1.37 0.00998
glycoprotein LRG1
650 3.85 16 Alpha-1-antitrypsin P01009 162 1.60 1.29 0.03102
SERPINA1
653 4.81 16 Alpha-1-antitrypsin P01009 272 1.37 1.37 0.04244
SERPINA1
695 5.05 14 Alpha-1-acid glycoprotein P02763 91 1.44 1.54 0.00113
1 ORM1
(continued)
456 J.E. Wiktorowicz and K.V. Soman

Table 21.5 (continued)


Gel MW, Swiss
spot pI kD prot Protein Abund. Abund p-value
No. (2D gel) (2D gel) Protein name accession score ratioa ratiob (ANOVA)c
696 7.82 14 Alpha-1-acid glycoprotein P02763 51 1.45 1.51 0.00384
1 ORM1
735 8.83 13 Apolipoprotein A-I APOA1 P02647 239 1.07 1.35 0.02470
737 7.59 13 Apolipoprotein A-I APOA1 P02647 97 1.28 1.44 0.00622
739 6.74 12 Apolipoprotein A-I APOA1 P02647 368 1.02 1.33 0.02157
764 4.81 12 Alpha-1-acid glycoprotein P02763 113 1.42 1.51 0.00112
1 ORM1
828 6.70 35 Serum albumin ALB P02768 97 1.63 1.26 0.02209
498 5.60 19 Alpha-1-antichymotrypsin P01011 587 1.25 1.35 0.03639
SERPINA3
499 5.75 19 Alpha-1-antichymotrypsin P01011 521 1.45 1.42 0.04196
SERPINA3
501 6.14 19 Alpha-1-antichymotrypsin P01011 522 1.16 1.43 0.00648
SERPINA3
503 5.25 19 Alpha-1-antichymotrypsin P01011 503 1.17 1.50 0.00426
SERPINA3
583 8.99 17 Leucine-rich alpha-2- P02750 359 1.11 1.35 0.00294
glycoprotein LRG1
585 9.09 17 Leucine-rich alpha-2- P02750 337 1.14 1.33 0.00450
glycoprotein LRG1
589 5.90 17 Leucine-rich alpha-2- P02750 377 1.20 1.31 0.00479
glycoprotein LRG1
639 3.52 16 Alpha-1-antitrypsin P01009 433 1.38 1.33 0.04594
SERPINA1
654 8.16 16 Alpha-1-antitrypsin P01009 368 1.29 1.48 0.00094
SERPINA1
691 4.57 14 Alpha-1-acid glycoprotein P02763 268 1.24 1.51 0.01454
1 ORM1
697 3.55 14 Alpha-1-acid glycoprotein P02763 227 1.24 1.48 0.00586
1 ORM1
766 3.99 12 Alpha-1-acid glycoprotein P02763 243 1.25 1.45 0.00207
1 ORM1
767 4.64 12 Alpha-1-acid glycoprotein P02763 228 1.29 1.39 0.02195
1 ORM1
49 9.47 109 Serum albumin ALB P02768 455 1.09 1.12 0.02791
115 8.14 73 Serum albumin ALB P02768 207 1.50 1.62 0.00015
116 8.20 73 Serum albumin ALB P02768 385 1.40 1.39 0.00398
117 6.22 73 Complement C4-A C4A P0C0L4 157 1.20 1.56 0.03590
748 6.22 12 Fibrinogen alpha chain FGA P02671 180 1.16 1.42 0.04536
176 5.52 51 Fibrinogen beta chain FGB P02675 303 1.34 1.29 0.00182
178 5.60 51 Fibrinogen beta chain FGB P02675 403 1.28 1.27 0.00770
399 6.25 26 Putative uncharacterized Q9HD87 28 1.88 1.76 0.00120
protein C6orf50 C6orf50
401 6.95 26 Transthyretin TTR P02766 40 1.66 1.48 0.00137
406 4.00 26 Transthyretin TTR P02766 231 1.68 1.50 0.00034
407 5.35 26 Transthyretin TTR P02766 239 1.86 1.62 0.00012
447 5.36 20 Apolipoprotein C-III APOC3 P02656 148 1.56 1.63 0.01367
560 7.65 17 Zinc-alpha-2-glycoprotein P25311 100 1.02 1.25 0.03762
AZGP1
(continued)
21 Discovery of Candidate Biomarkers 457

Table 21.5 (continued)


Gel MW, Swiss
spot pI kD prot Protein Abund. Abund p-value
No. (2D gel) (2D gel) Protein name accession score ratioa ratiob (ANOVA)c
850 5.56 12 Histidine protein O95568 30 1.48 1.39 0.02281
methyltransferase 1 homolog
METTL18
857 6.84 15 Annexin A10 ANXA10 Q9UJ72 29 2.17 2.21 0.00010
860 6.76 14 Transthyretin TTR P02766 42 2.60 1.97 0.00001
a
Case vs. Auto control (Same individual, sampled before and after infection)
b
Case vs. Matched control
c
Case vs. Auto control vs. Matched control

Table 21.6 Predictive proteins identified from 2DGE analysis in the Aspergillosis project
Spot No. Protein name UniProt accession
115 Serum albumin ALB P02768
200 Complement C3 P01024
494 Alpha-1-antichymotrypsin SERPINA3 P01011
654 Alpha-1-antitrypsin SERPINA1 P01009
359 Hemopexin HPX P02790
399 Putative uncharacterized protein C6orf50 Q9HD87

Table 21.7 Statistical analysis of discovery BAP peptides in the invasive Aspergillosis project
Peptide Mean Mean P Valid N Valid N
Case Control Case Control
VPQVSTPTLVEVSR 20.155 23.47225 0.000016 18 17
DALSSVQESQVAQQAR 24.8204 24.31583 0.632213 16 14
DALSSVQESQVAQQAR 24.6258 24.55044 0.924253 24 23
TTPPVLDSDGSFFLYSK 21.05433 21.31827 0.632961 19 18
GWVTDGFSSLK 23.90974 24.27239 0.651187 18 17
AVMDDFAAFVEK 21.9167 22.31393 0.589658 17 15
STAAM STYTGIFTDQVLSVLKG EE 20.32244 20.67465 0.638155 20 18
SPELQAEAK 21.61976 22.37701 0.21187 22 20
GPSVFPLAPSSK 21.49173 23.33829 0.001563 24 22
MGPTELLIEMEDWK 20.35843 19.69993 0.206339 18 16
MGPTELLIEMEDWK 20.70003 20.52631 0.618689 19 19
YAASSYLSLTPEQWK 25.73156 24.65533 0.083302 16 15
TEGDGVYTLNDK 21.69819 22.15719 0.504701 20 19
TEGDGVYTLNNEK 21.85964 20.63309 0.482638 21 19
SVLGQLGITK 20.69461 22.10105 0.000647 16 15
Peptides in bold underline are statistically significant (p  0.05)
458 J.E. Wiktorowicz and K.V. Soman

goal was to characterize the proteomic analysis. After gel alignment, spot filtering, and
differences between Groups 2 and 3 seeking can- editing, there were 635 spots for quantitative
didate biomarkers that would allow classification sample group comparisons. Based on t-test
of infected noncardio-symptomatic patients who p-values and fold-changes (RoR for SNOFlo),
are most likely to develop chagasic there were a total of 33 spots that were signifi-
cardiomyopathy. cantly different either in abundance, or
This project departs from the others in that nitrosylation, or both. These spots were picked
PBMCs were analyzed, rather than plasma. The and identified by MALDI TOF/TOF mass spec-
primary rationale is that blood has been shown to trometry. These 33 proteins are listed in
reflect the progress of infection [19], and that Table 21.8. The identified proteins are marked
T. cruzi infection induces activation of inflam- on the reference gel in Fig. 21.5.
matory cells (macrophages, neutrophils) that A predictive set of proteins was arrived at by
release cytotoxic reactive oxygen species (ROS) an approach involving classification modeling
and reactive nitrogen species (RNS) for the con- with MARS, Ensemble methods, Treenet,
trol of the parasite [20]. We, and our collaborator Generalized pathseeker (GPS) and Random
(Dr. N. Garg, UTMB) reasoned that an effective Forests (RF) which are described in detail in
approach would be to globally investigate the Chap. 19 in this volume. Using this approach,
oxidative status of PBMC proteins, namely, we reduced the protein set to the seven proteins
cysteinyl-S-nitrosylation (SNO), a widely listed in Table 21.9.
recognized prototype of redox signaling.
In this 2DGE study, we employed the SNO by
fluorescence approach (SNOFlo) to measure dif-
21.7 Conclusions
ferential SNO [8, 21, 22]. Protein differential
abundance was measured using the saturation
We have described the importance of analyzing
fluorescence approach as in the other two
both proteins and peptides from plasma, as well
projects described above [5]. Briefly, the
as the importance of recognizing that post-
SNOFlo analysis involves treating one half of
translational modifications can convert so called
each sample with ascorbate (Asc) to remove
“nuisance” proteins (e.g., high-abundance, high
existing SNO modifications, labeling both the
molecular weight proteins) into potential
treated (Asc+) and untreated (Asc-) aliquots
biomarkers by virtue of ionic and/or size isomer-
with the Bodipy-FL dye, and comparing spot
ization. This is not surprising, as many plasma
intensities of the Asc + and Asc- 2-D gels from
proteins represent leakage of cellular proteins
Group 2 samples with those of Group 3. The
into the plasma due to the molecular pathology
degree of differential Cys-S nitrosylation is
of the disease, and by their very nature, suggest
obtained by the calculation of a p-value and a
structural modifications that facilitate their leak-
“Ratio of Ratios” (or RoR; see [8]). Differential
age. We have also seen that modifications
abundance between groups 2 and 3 was calcu-
resulting in either increased or decreased molec-
lated as in the other two projects. In the asymp-
ular size, including notably those of high-
tomatic and symptomatic groups we had n ¼ 25
abundance proteins, by virtue of their unique
and n ¼ 28 samples, respectively, that we could
size qualified them as candidates. This character
take through the entire 2-DE analyses, leading to
would have been lost in a bottom-up strategy.
a total of 53 Asc + and 53 Asc- gel images for
21
Table 21.8 Significant proteins identified by 2DGE analysis in the chagasic cardiomyopathy project
Gel Spot MS ID
Spot Spot MW UniProt protein Abundance
No. pI (kD) Protein name accession score ratio p-Value SNOa p-Value
63 7.53 99 Vinculin GN ¼ VCL PE ¼ 1 SV ¼ 4 P18206 236 1.32 0.182504 1.32 0.033111
141 5.97 63 Serum albumin (Fragment) GN ¼ ALB PE ¼ 4 SV ¼ 1 H0YA55 167 1.25 0.204533 1.69 0.027258
165 4.19 55 Isoform 3 of Integrin alpha-IIb GN ¼ ITGA2B P08514-3 268 1.17 0.339162 1.06 0.040421
170 9.41 55 Isoform H7 of Myeloperoxidase GN ¼ MPO P05164-3 79 1.75 0.125809 1.71 0.034137
261 6.57 41 POTE ankyrin domain family member F OS ¼ Homo sapiens A5A3E0 121 1.39 0.011500 1.35 0.037733
GN ¼ POTEF PE ¼ 1 SV ¼ 2
267 6.41 41 Actin, cytoplasmic 2, N-terminally processed GN ¼ ACTG1 F5H0N0 106 1.30 0.034774 1.34 0.024770
PE ¼ 3 SV ¼ 1
Discovery of Candidate Biomarkers

273 8.84 41 Isoform 2 of Fibrinogen alpha chain OS ¼ Homo sapiens P02671-2 54 1.46 0.035493 1.04 0.044564
GN ¼ FGA
339 6.33 34 Tubulin beta chain OS ¼ Homo sapiens GN ¼ TUBB PE ¼ 3 Q5JP53 121 1.20 0.034097 1.19 0.012075
SV ¼ 1
355 5.18 32 Talin 1 GN ¼ TLN1 PE ¼ 2 SV ¼ 1 Q5TCU6 187 1.07 0.350146 1.26 0.037653
382 4.68 30 Actin, cytoplasmic 1, N-terminally processed GN ¼ ACTB B4E335 468 1.10 0.291819 1.17 0.024110
PE ¼ 2 SV ¼ 1
385 5.72 29 Vimentin OS ¼ Homo sapiens GN ¼ VIM PE ¼ 3 SV ¼ 1 F5H288 171 1.43 0.006230 1.41 0.006024
389 6.09 28 Annexin A3 OS ¼ Homo sapiens GN ¼ ANXA3 PE ¼ 1 SV ¼ 3 P12429 549 1.41 0.027378 1.34 0.021102
404 7.51 26 Actin, cytoplasmic 1, N-terminally processed GN ¼ ACTB B4DW52 133 1.13 0.116141 1.27 0.020184
PE ¼ 2 SV ¼ 1
411 7.98 26 Unconventional myosin-IXa GN ¼ MYO9A PE ¼ 4 SV ¼ 1 H3BMM1 46 1.02 0.516115 1.13 0.043125
425 8.63 25 Peptidyl-prolyl cis-trans isomerase A GN ¼ PPIA PE ¼ 1 SV ¼ 2 P62937 110 1.10 0.196105 1.14 0.045972
438 8.84 23 WD repeat-containing protein 49 GN ¼ WDR49 PE ¼ 4 SV ¼ 1 F8WBC8 46 1.37 0.413441 1.09 0.047502
502 7.12 20 Keratin, type II cytoskeletal 1 GN ¼ KRT1 PE ¼ 1 SV ¼ 6 P04264 153 1.09 0.619820 1.27 0.023729
506 8.92 20 Parathyroid hormone 2 receptor (Fragment) GN ¼ PTH2R PE ¼ 4 H7C0B0 42 1.33 0.051258 1.50 0.042012
SV ¼ 1
524 9.34 19 Keratin, type I cytoskeletal 10 OS ¼ Homo sapiens GN ¼ KRT10 P13645 286 1.55 0.046182 1.35 0.033418
PE ¼ 1 SV ¼ 6
535 7.96 19 Proteasome subunit beta type-2 GN ¼ PSMB2 PE ¼ 1 SV ¼ 1 P49721 82 1.08 0.532056 1.23 0.023404
563 5.99 18 Actin, cytoplasmic 2, N-terminally processed (Fragment) I3L1U9 137 1.83 0.030927 2.47 0.045743
OS ¼ Homo sapiens GN ¼ ACTG1 PE ¼ 4 SV ¼ 1
572 5.80 18 Ferritin light chain GN ¼ FTL PE ¼ 1 SV ¼ 2 P02792 74 1.23 0.067410 1.15 0.032468
(continued)
459
Table 21.8 (continued)
460

Gel Spot MS ID
Spot Spot MW UniProt protein Abundance
No. pI (kD) Protein name accession score ratio p-Value SNOa p-Value
592 4.38 17 Myosin regulatory light chain 12B OS ¼ Homo sapiens O14950 320 1.19 0.047581 1.07 0.040693
GN ¼ MYL12B PE ¼ 1 SV ¼ 2
605 5.64 16 ATP synthase subunit alpha OS ¼ Homo sapiens GN ¼ ATP5A1 A8K092 95 1.27 0.038451 1.17 0.058813
PE ¼ 2 SV ¼ 1
627 5.47 15 Annexin GN ¼ ANXA6 PE ¼ 3 SV ¼ 1 E5RIU8 103 1.44 0.150651 1.38 1.040693
640 5.12 15 Actin, cytoplasmic 1, N-terminally processed GN ¼ ACTB G5E9R0 193 1.06 0.915841 1.19 0.402444
PE ¼ 3 SV ¼ 1
644 7.71 15 Keratin, type I cytoskeletal 10 GN ¼ KRT10 PE ¼ 1 SV ¼ 6 P13645 47 1.00 0.214532 1.20 2.040693
650 7.26 15 Heterogeneous nuclear ribonucleoprotein A1 (Fragment) F8W646 102 1.47 0.016761 1.52 0.913503
GN ¼ HNRNPA1 PE ¼ 4 SV ¼ 1
735 4.32 10 SH3 domain-binding glutamic acid-rich-like protein Q9H299 349 1.29 0.029492 1.30 3.040693
3 GN ¼ SH3BGRL3 PE ¼ 1 SV ¼ 1
744 4.40 0 Ras-related protein Rap-1b GN ¼ RAP1B PE ¼ 2 SV ¼ 1 B4DQI8 58 1.23 0.180346 1.22 0.800487
758 5.44 38 Actin, cytoplasmic 1, N-terminally processed GN ¼ ACTB B4E335 624 1.01 0.956162 1.15 4.040693
PE ¼ 2 SV ¼ 1
816 7.47 78 Actin, cytoplasmic 1, N-terminally processed GN ¼ ACTB B4E335 454 1.50 0.158319 1.44 0.576176
PE ¼ 2 SV ¼ 1
878 6.90 10 Protein S100-A11 OS ¼ Homo sapiens GN ¼ S100A11 PE ¼ 1 P31949 246 1.60 0.037919 1.19 5.040693
SV ¼ 2
a
Ratio of ratios (Change in SNO normalized against change in abundance) [8]
J.E. Wiktorowicz and K.V. Soman
21 Discovery of Candidate Biomarkers 461

Fig. 21.5 The significant proteins found by 2DGE anal- protein spot numbers are marked in the figure, and the
ysis and identified by MALDI TOF/TOF in the project are corresponding protein names are in Table 20.8
shown on the reference gel used in the experiment. The

Table 21.9 Predictive proteins identified from 2DGE analysis in the chagasic cardiomyopathy project
Spot No. Protein name Swiss prot accession
141 Serum albumin (Fragment) H0YA55
273 Isoform 2 of Fibrinogen alpha chain P02671-2
339 Tubulin beta chain Q5JP53
385 Vimentin F5H288
389 Annexin A3 P12429
650 Heterogeneous nuclear ribonucleoprotein A1 (Fragment) F8W646
735 SH3 domain-binding glutamic acid-rich-like protein 3 Q9H299

carcinoma from cancer-free controls are unbiased by


References gender and age. Mol Cell Proteomics 5:1840–1852
5. Pretzer E, Wiktorowicz JE (2008) Saturation fluores-
1. Glanz SA (2005) Primer of biostatistics. McGraw- cence labeling of proteins for proteomic analyses.
Hill, New York Anal Biochem 374:250–262
2. (2005) Special issue: exploring the human plasma 6. Tyagarajan K, Pretzer EL, Wiktorowicz JE (2003)
proteome. The HUPO Plasma Proteome Project Thiol-reactive dyes for fluorescence labeling of
(HPPP). Proteomics 5:3223–3549 proteomic samples. Electrophoresis 24:2348–2358
3. Richter R, Schulz-Knappe P, Schrader M, Standker L, 7. Miseta A, Csutora P (2000) Relationship between the
Jurgens M, Tammen H, Forssmann W-G (1999) Com- occurrence of cysteine in proteins and the complexity
position of the peptide fraction in human blood of organisms. Mol Biol Evol 17:1232–1239
plasma: database of circulating human peptides. J 8. Wiktorowicz JE, Stafford S, Rea H, Urvil P,
Chromatogr B Biomed Sci Appl 726:25–35 Soman K, Kurosky A, Perez-Polo JR, Savidge TC
4. Villanueva J, Martorella AJ, Lawlor K, Philip J, (2011) Quantification of cysteinyl S-nitrosylation by
Fleisher M, Robbins RJ, Tempst P (2006) Serum fluorescence in unbiased proteomic studies. Biochem-
peptidome patterns that distinguish metastatic thyroid istry 50:5601–5614
462 J.E. Wiktorowicz and K.V. Soman

9. Yao X, Freas A, Ramirez J, Demirev PA, Fenselau C 16. Brasier AR, Zhao Y, Wiktorowicz JE, Spratt HM,
(2001) Proteolytic 18O labeling for comparative pro- Nascimento EJM, Cordeiro MT, Soman KV, Ju H,
teomics: model studies with two serotypes of adeno- Recinos A, Stafford S, Wu Z, Marques ETA,
virus. Anal Chem 73:2836–2842 Vasilakis N (2015) Molecular classification of
10. Miyagi M, Rao KCS (2007) Proteolytic 18O-labeling outcomes from dengue virus 3 infections. J Clin
strategies for quantitative proteomics. Mass Spectrom Virol 64:97–106
Rev 26:121–136 17. WHO (2002) Control of chagas’ disease. Second
11. Khedr A, Hegazy M, Kamal A, Shehata MA (2015) report of the WHO Expert Committee. WHO Tech
Profiling of esterified fatty acids as biomarkers in the Rep Ser 905:1–109, Geneva
blood of dengue fever patients using a microliter-scale 18. Duran-Rehbein GA, Vargas-Zambrano JC, Cuellar A,
extraction followed by gas chromatography and mass Puerta CJ, Gonzalez JM (2014) Mammalian cellular
spectrometry. J Sep Sci 38:316–324 culture models of Trypanosoma cruzi infection: a
12. Poole-Smith BK, Gilbert A, Gonzalez AL, Beltran M, review of the published literature. Parasite 21:38
Tomashek KM, Ward BJ, Hunsperger EA, Ndao M 19. Wen JJ, Dhiman M, Whorton EB, Garg NJ (2008)
(2014) Discovery and characterization of potential Tissue-specific oxidative imbalance and mitochon-
prognostic biomarkers for dengue hemorrhagic drial dysfunction during Trypanosoma cruzi infection
fever. Am J Trop Med Hyg 91:1218–1226 in mice. Microbes Infect 10:1201–1209
13. Thayan R, Huat TL, See LL, Tan CP, Khairullah NS, 20. Gupta S, Wen JJ, Garg NJ (2009) Oxidative stress in
Yusof R, Devi S (2009) The use of two-dimension chagas disease. Interdiscip Perspect Infect Dis 190354
electrophoresis to identify serum biomarkers from 21. Savidge TC, Urvil P, Oezguen N, Ali K,
patients with dengue haemorrhagic fever. Trans R Choudhury A, Acharya V, Pinchuk I, Torres AG,
Soc Trop Med Hyg 103:413–419 English RD, Wiktorowicz JE, Loeffelholz M,
14. Lee CY, Seet RC, Huang SH, Long LH, Halliwell B Kumar R, Shi L, Nie W, Braun W, Herman B,
(2009) Different patterns of oxidized lipid products in Hausladen A, Feng H, Stamler JS, Pothoulakis C
plasma and urine of dengue fever, stroke, and (2011) Host S-nitrosylation inhibits clostridial small
Parkinson’s disease patients: cautions in the use of molecule-activated glucosylating toxins. Nat Med
biomarkers of oxidative stress. Antioxid Redox Signal 17:1136–1141
11:407–420 22. Sheffield-Moore M, Wiktorowicz JE, Soman KV,
15. Brasier AR, Garcia J, Wiktorowicz JE, Spratt HM, Danesi CP, Kinsky MP, Dillon EL, Randolph KM,
Comach G, Ju H, Recinos A 3rd, Soman K, Forshey Casperson SL, Gore DC, Horstman AM, Lynch JP,
BM, Halsey ES, Blair PJ, Rocha C, Bazan I, Victor Doucet BM, Mettler JA, Ryder JW, Ploutz-Snyder
SS, Wu Z, Stafford S, Watts D, Morrison AC, Scott LL, Hsu JW, Jahoor F, Jennings K, White GR,
TW, Kochel TJ (2012) Discovery proteomics and McCammon SD, Durham WJ (2013) Sildenafil
nonparametric modeling pipeline in the development increases muscle protein synthesis and reduces mus-
of a candidate biomarker panel for dengue hemor- cle fatigue. Clin Transl Sci 6:463–468
rhagic fever. Clin Transl Sci 5:8–20
Statistical Approaches to Candidate
Biomarker Panel Selection 22
Heidi M. Spratt and Hyunsu Ju

Abstract
The statistical analysis of robust biomarker candidates is a complex
process, and is involved in several key steps in the overall biomarker
development pipeline (see Fig. 22.1, Chap. 19). Initially, data visualiza-
tion (Sect. 22.1, below) is important to determine outliers and to get a feel
for the nature of the data and whether there appear to be any differences
among the groups being examined. From there, the data must be
pre-processed (Sect. 22.2) so that outliers are handled, missing values
are dealt with, and normality is assessed. Once the processed data has been
cleaned and is ready for downstream analysis, hypothesis tests (Sect. 22.3)
are performed, and proteins that are differentially expressed are identified.
Since the number of differentially expressed proteins is usually larger than
warrants further investigation (50+ proteins versus just a handful that will
be considered for a biomarker panel), some sort of feature reduction
(Sect. 22.4) should be performed to narrow the list of candidate
biomarkers down to a more reasonable number. Once the list of proteins
has been reduced to those that are likely most useful for downstream
classification purposes, unsupervised or supervised learning is performed
(Sects. 22.5 and 22.6, respectively).

Keywords
Candidate biomarker selection • Data inspection • Data consistency •
Outlier detection • Data normalization • Data transformations • Data
clustering • Machine learning

The statistical analysis of robust biomarker


candidates is a complex process, and is involved
H.M. Spratt (*) • H. Ju, Ph.D in several key steps in the overall biomarker
The University of Texas Medical Branch, 301 University development pipeline (see Fig. 22.1, Chap. 19).
Blvd, Galveston, TX 77555-1148, USA Initially, data visualization (Sect. 22.1, below) is
e-mail: hespratt@utmb.edu; hsjuser@gmail.com

# Springer International Publishing Switzerland 2016 463


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_22
464 H.M. Spratt and H. Ju

Fig. 22.1 Histograms for


IP_10 Cytokine data.
Dengue Fever is on the top;
Dengue Hemorrhagic
Fever is on the bottom

important to determine outliers and to get a feel dedicated statisticians and bioinformaticians
for the nature of the data and whether there with in-depth knowledge of experimental design,
appear to be any differences among the groups insight about how experimental data was
being examined. From there, the data must be generated, as well as a grasp of the types of
pre-processed (Sect. 22.2) so that outliers are data structures that the proteomics experiment
handled, missing values are dealt with, and nor- generated. For these reasons, analysts should be
mality is assessed. Once the processed data has involved in the biomarker study design from the
been cleaned and is ready for downstream analy- very beginning. Doing so also allows them to
sis, hypothesis tests (Sect. 22.3) are performed, obtain a better understanding of the resultant
and proteins that are differentially expressed are data and any nuances associated with them. Fur-
identified. Since the number of differentially ther, they can also assist with experimental
expressed proteins is usually larger than warrants details to ensure that the proper analyses can be
further investigation (50+ proteins versus just a performed at the end of the experiment. Such an
handful that will be considered for a biomarker appreciation of the data obtained helps drive
panel), some sort of feature reduction (Sect. 22.4) strategies for handling outliers or missing data,
should be performed to narrow the list of candi- the pre-processing approaches frequently neces-
date biomarkers down to a more reasonable num- sary when working with omics data, and the
ber. Once the list of proteins has been reduced to appropriate selection of hypothesis tests for
those that are likely most useful for downstream analyzing the data.
classification purposes, unsupervised or The goal of learning methods is to classify the
supervised learning is performed (Sects. 22.5 samples into two or more groups based on a
and 22.6, respectively). subset of proteins that are most useful for
The statistical analysis of proteomics data to distinguishing between the groups. This subset
identify candidate biomarkers and ultimately, the of proteins is commonly referred to as candidate
development of predictive models is a complex, biomarkers for the classification. The result of
multi-step and iterative process. Candidate bio- supervised learning is a variable importance list
marker selection requires involvement by that ranks those proteins which are most likely to
22 Statistical Approaches to Candidate Biomarker Panel Selection 465

separate one group of interest from another. This 22.1 Data Inspection/Visualization
variable importance list is ordered by each
protein’s ability to discriminate one group from Proteomics data typically have a high degree of
another. In order for a classification task to gen- variability, due both to biological variability
eralize samples outside the initial discovery from one sample to another and technical
samples, some sort of resampling needs to be variability relating to the technology used, as
employed (Sect. 22.7). Resampling techniques well as to inherent differences between proteins
can be as simple as setting aside a separate sam- (e.g., isoforms and post-translational
ple set to validate the performance of classifica- modifications). In addition, proteomics
tion algorithm, or cross-validation techniques experiments are frequently performed on small
where some of the discovery data are left out of sample sizes (less than ten samples per group).
the training and are instead used for the testing The resultant data typically contains over 1000
the trained model. Additionally, methods exist variables, which results in a wide data set – one
for assessing the ability of a supervised learning that has small n (sample size) and large p (number
algorithm to correctly classify samples from each of variables).
of the groups of interest. Examining the predic- The first step in working with any data set
tion success or receiver operating characteristic should be data inspection and/or data visualiza-
(ROC) curves gives the user a feel for how well tion. This process involves checking the data for
the classification algorithm performs consistency of type, examining the dataset for
(Sect. 22.8). Ideally, the classification algorithm missing values or outliers, as well as graphically
should be able to predict class identity just as displaying the data to better understand the
well on the training dataset as the testing dataset nature and behavior of the various observations.
for a biomarker panel that can be used to distin-
guish one group from another.
The end result of the biomarker discovery
pipeline mentioned above is a list of candidate 22.1.1 Data Consistency
biomarkers that can be used to distinguish a
future sample as belonging to a particular Checking the data for consistency involves
group. However, the experimentation/data anal- examining the values present for each individual
ysis process does not end with the creation of a variable. If the data is supposed to be numeric,
predictive model. This is just the initial discovery one should check that all the values are actually
phase where a candidate biomarker panel has numbers, and that there are no textual strings
been identified, and subjected to qualification present. Bioplex cytokine assay data frequently
using independent quantitative proteomics are returned from the instrument with values
measurements. The next phase is the verification such as “OOR<”. It is up to the data analyst to
phase, wherein the same biomarker panel has to determine what this value represents (while it is
prove a successful predictor in an independent an actual value, but it is below the limit of detec-
dataset. This step is critical to the survival of a tion of the instrument), and what to replace this
biomarker panel for further study, by value with. We will discuss data replacement in
demonstrating the ability of said biomarker following sections. An example of this is
panel to generalize to additional samples. Once presented in Table 22.1. If the data is supposed
the biomarker panel has been verified in an inde- to be positive values only, do any of the columns
pendent dataset, further downstream steps can be have negative values? This can be easily checked
taken to Validate (Chap. 19: Introduction), pro- simply by calculating the minimum for all
duce, and market a diagnostic test based on the variables. Another way to check data consistency
discovered biomarker panel. is to make sure that the data is matched correctly
466 H.M. Spratt and H. Ju

Table 22.1 Sample of initial data file


Sample No. IP-10 MIP-1a TNF-a VEGF TRAIL
1 36800.84 718.23 28017.48 44634.68 21562.09
2 13247.18 2675.18 10569.1 5360.15
3 2682.51 OOR > 5006.67 2790.2 1359.8
4 10.28 5.4 18.75 9.04 1.9
5 3.34 1.57 5.33 3.37 2.39
6 0 *5.80 *7.11 167.62 OOR <

by subject, if matching is mentioned in the study 22.1.2 Missing Values/Outlier Detection


design. Matching is a statistical technique where
members of one group are “matched” to Dealing with missing values and outliers presents
members of another group with regards to possi- many challenges for the data analyst. Frequently,
bly confounding variables in order to minimize basic science experimentalists will replace a
the effect the confounder has on the treatment missing value with a value of 0. This value of
effect. If the study design suggests that 0 can have many different definitions. For
individuals will be matched for gender and age, instance, a value of 0 might indicate a plausible,
then the data analyst should verify that males are real value, but one that fell below the detection
matched with males, females are matched with limit of the instrument. Instead of placing a value
females, and individuals with a similar age are for that particular data point, a researcher might
matched to other individuals within that same opt to call it a 0. How the analyst handles a
age range. If any data consistency issues are 0 value depends on the true meaning of that
present, they should be corrected before any 0 value. In other situations, the researcher might
type of analysis is performed. Doing so often opt to replace a missing value with a 0 value.
involves communication with the PI as well as This is done because some software packages are
with the technical staff that generated the dataset. unable to handle missing values, and the
Table 22.1 demonstrates several examples of researcher thinks missing values make the
data inconsistencies. For instance, the last value dataset look ugly. Thus, it is important to deter-
in the IP-10 column is a 0. This was a value that mine if any such substitutions have been made
was initially missing, but the researcher changed within a given dataset. If multiple 0 values are
all missing values to 0 for that cytokine. MIP-1a observed, the PI or research technician should be
has two issues. The first is a value of “OOR >” consulted to determine the true meaning of these
(out of range positive) when a numeric value is 0 values.
expected. The second is a value of *5.80 when a Common ways to deal with missing values
strict numeric is expected. The analyst needs to include simply leaving those samples out of the
best determine how to handle such instances, data analysis, data imputation, or choosing anal-
often in consultation with the lab performing ysis methods that ignores missing values (such as
the experiment or the PI of the project. TNF-a those mentioned in Sects. 22.6.2, 22.6.3, and
has a true missing value as well as an “*” that 22.6.4). Several methods exist for data imputa-
needs to be dealt with. VEGF has a negative tion, which will be discussed in the following
value when only positive values are possible. section.
Thus, further investigation is needed as to why Another common issue in proteomics data
the negative value is present. Lastly, TRAIL has sets, as well as other omics data sets, is the
a value of “OOR <” (out of range negative) that presence of outliers. Outliers are individual data
needs to be properly handled. points, large or small, that lie further from the
22 Statistical Approaches to Candidate Biomarker Panel Selection 467

majority of the data than would ordinarily be the data?), as well as giving a feel for the shape/
expected and can have an exaggerated influence distribution of the data. One can visually deter-
on the fit of a given algorithm. An outlier may mine if the data is symmetric (possibly normally
indicate a bad data point: one that results from distributed) or skewed (not normally distributed).
improper coding, or possibly an experiment gone A skewed distribution is one that is not symmet-
awry. Outliers heavily influence descriptive sta- rical, but rather has a long tail in one direction.
tistics such as the mean, as well as impacting the Example histograms are presented in Fig. 22.1.
types of hypothesis testing that can validly/reli- These histograms represent Bioplex cytokine
ably be performed on the data. Thus, their detec- assay values for IP-10 for patients with Dengue
tion is an important step in the analysis pipeline. Fever (DF) and Dengue Hemorrhagic Fever
A simple method for detecting outliers is the (DHF), taken from our NIAID Clinical Proteo-
creation of a boxplot, discussed in the next sec- mics Center Dengue Fever (CPC) project (see
tion, which will graphically display the absence/ Chap. 20 for description). These histograms rep-
presence of any outliers. Specifically, a data resent data that is highly skewed, as the shape of
value is often said to be an outlier if it lies further the histogram is not symmetric, but rather shifted
away from the mean or median of the dataset towards the left. In addition, outliers are present
than  1.5*IQR (interquartile range). If the out- within the DHF subjects as there are three bins
lier is the result of improper data entry, its value that are separated from the rest of the data.
should be easily corrected. If the outlier is the Boxplots show the shape of the distribution,
result of an experiment gone awry, then the value the central value of a dataset, and the variability
should be removed from the dataset and treated within the dataset, by displaying the median, the
as a missing value. If, however, the cause of the interquartile range (IQR), as well as any potential
outlier cannot be attributed to either of those two outliers. As the name implies, the graphs have a
instances, the value must remain in the dataset box shape. The middle 50 % of the data is
and appropriate procedures should be utilized displayed within the central rectangle, the
downstream, i.e. ones that are robust to the pres- median value is frequently displayed as a line
ence of outliers. within the central rectangle, and whiskers are
displayed above and below the central rectangle,
representing the range up to some multiple of the
22.1.3 Graphical Methods IQR away from the median. The upper hinge
(edge) of the box indicates the 75th percentile
Many types of graphical methods exist to display of the data, and the lower hinge (edge) of the box
proteomics data. These include histograms, indicates the 25th percentile of the data. In addi-
boxplots, scatterplots, and quantile-quantile tion, individual outliers are displayed usually as
plots (also known as q-q plots), among others. stars or circles on the plot. If no outliers are
Plots can tell one about the presence of outliers present, the ends of the whiskers represent the
within the data, about possible relationships largest and smallest value within a dataset. An
amongst variables within the dataset, about the example of a boxplot is presented in Fig. 22.2.
validity of certain hypothesis test assumptions Like histograms, boxplots can be used to assess
(such as whether the data is normally the distribution of a given dataset. For data that is
distributed), or about possible differences symmetric (and thus possibly normally
between groups. distributed), the median line will lie roughly in
Histograms arrange the data points into bins the center of the rectangle. In addition, the whis-
of equal width, where the height of each bar ker above the rectangle will be roughly the same
represents the number or proportion of data length as the whisker below the rectangle. For
points that lie within each bin. Histograms are skewed data (and thus data that is probably not
useful for determining whether there are outliers normally distributed), the median line will lie
within the data (are there single bins which are much closer to the top or bottom of the rectangle
separated by many empty bins from the rest of than the middle, and the whiskers above and
468 H.M. Spratt and H. Ju

Fig. 22.2 Boxplots for


IP_10 Cytokine data.
Dengue Fever is on the left;
Dengue Hemorrhagic
Fever is on the right

below the rectangle will not be the same length. the dots on the scatterplot. Scatterplots can be
Boxplots can also be used to examine if there are used to determine if the functional relationship is
differences between two or more groups by a linear one, a quadratic one, a logarithmic one,
looking for overlapping rectangles when multi- or one of many different types of functions. This
ple boxplots are drawn on the same plot. is extremely useful if some sort of modeling is to
The example in Fig. 22.2 represents IP-10 be done later on.
Bioplex cytokine assay data for the Dengue Another mechanism for aiding with determin-
Fever project (the same as is shown for the histo- ing whether the dataset is normally distributed is
gram in Fig. 22.1). The value of IP-10 for the q-q plot. The q-q plot is a special scatterplot.
patients with Dengue Fever (DF) is shown in As the name implies, it is one that uses the
the left boxplot, and the value of IP-10 for quantiles of the data to create the plot. Along
patients with Dengue Hemorrhagic Fever the x axis are the quantiles of the experimental
(DHF) is shown in the right boxplot. Both of data. Along the y axis are the quantiles of a
these groups have outliers, which are represented specified distribution that has the same mean
by the circles and the stars (figure created using and standard deviation as the experimental data.
SPSS v20). In addition, both of the boxplots The specified distribution is a normal distribution
represent data that is skewed as the median line if the assumption of normality is being assessed.
is not in the middle of the rectangle. The whisker If all the data points lie on the line y ¼ x, then
above both boxes is also much longer than the the experimental data perfectly matches data that
whisker below each box. is normally distributed. If however, there is devi-
A scatterplot is a graphical representation of ation away from such a straight line, some
how two different variables relate to each other. amount of skewness is present.
The values for one variable are plotted along the Figure 22.3a, b represent q-q plots for the
x-axis, while the values for another variable are IP-10 Bioplex cytokine assay data discussed
plotted along the y-axis. Scatterplots are useful above. The shape of both of these plots is very
for detecting a correlation between two variables, non-linear, which indicates again that the data is
as well as if there appears to be some sort of highly skewed. Ideally, we would like all of the
functional relationship between the two points to lie on the straight diagonal line which
variables. If the two variables are correlated, would mean the experimental data exactly
there will be an obvious trend in the location of follows a normal distribution.
22 Statistical Approaches to Candidate Biomarker Panel Selection 469

Fig. 22.3 (a) Q-Q plot of IP-10 cytokine data for Dengue Hemorrhagic Fever, (b) Q-Q plot of IP-10 cytokine data for
Dengue Fever

missing values. Rubin [25] defines three types of


22.2 Pre-processing mechanisms that cause data to be missing: data
missing completely at random, data missing at
Data pre-processing methods refer to the addi- random, and data missing not at random. Data
tion, deletion, or transformation of the proteo- missing completely at random means the missing
mics data in some fashion before downstream values are truly just randomly missing (and thus
analysis is performed. Pre-processing of the ignorable). What this means is that there is no
data is a critical step to ensure that the results relationship between the value that is missing
obtained from statistical testing are both valid as and either the observed variables or the unob-
well as correct. How the data is pre-processed served parameters of interest. This is the easiest
can sometimes have a dramatic effect on the mechanism for a data value to be missing and
output of the model creation process. Some results in unbiased data analysis. Data missing at
procedures, such as classification and regression random (but not completely) happens when the
trees (Sect. 22.6.2) are fairly insensitive to data probability of a missing value depends on some
pre-processing, while other methods such as observed values but does not depend on any data
logistic regression are not [18]. Pre-processing that has not been observed (or the group assign-
includes dealing with outliers and missing values ment). Unfortunately, missing at random has a
through techniques such as imputation as well as somewhat confusing name as it does not mean
normalizing or transforming the dataset to meet missing completely at random which is what the
hypothesis testing and/or modeling assumptions. name implies. Data that is missing not at random
means the probability of a missing value depends
on the variable that is missing. This type of
22.2.1 Missing Values/Imputation “missingness” is often a result in survey analysis
where the respondent fails to answer a question
Missing values within a dataset present an impor- because of the nature of the question (i.e. income
tant, and often overlooked, challenge to down- level).
stream data analysis. The reasons for the missing Some analytic techniques (such as those that
data might bias the results, so the underlying deal with repeated measurements on the same
mechanism needs to be considered when deter- sample) require that there are no missing values.
mining the most appropriate method for handling Several methods exist that will allow the end user
470 H.M. Spratt and H. Ju

to overcome missing data. The first is to simply Imputation using the mean or median value
remove the sample which resulted in the incom- replaces any missing value for a given variable
plete data. While this is the simplest approach, it with either the average or the median value for
is often not preferred since the sample size that variable. The disadvantages to using this
(which is often small to begin with) will be method are that the overall variability for that
reduced. Other methods include some form of variable will be reduced and the correlation/
data imputation, where missing values are covariance estimates are also weakened. Simple
substituted with appropriate imputed values. regression involves replacing the missing data
Common imputation methods include single with that obtained from fitting a regression equa-
imputation methods such as data replacement tion to the remaining data. This method works
with a set value, data replacement with a mean well if there is more than one variable of interest.
or median, simple regression, as well as model- The advantage of this method is that it uses
based methods such as multiple imputation and information from all the data that is obtained.
maximum likelihood [19]. The disadvantages are that the overall measure
List-wise deletion can occur in one of two of variability within the dataset has been dimin-
ways: complete case elimination as mentioned ished as well as that the model fit and correlation
above or pairwise deletion. List-wise deletion estimates will likely be better than had a value
results in removing an entire observation from been initially obtained.
the dataset. The advantage of this method is that Model-based imputation methods include
it is simple, and one can compare the results of multiple imputation and maximum likelihood.
one variable to that of another as the dataset is the This method does not impute any data, but rather
same for all variables. The disadvantages include uses each cases available data to compute maxi-
a reduction in the power of the analysis since the mum likelihood estimates [9]. The maximum
size of the dataset has been reduced, and that the likelihood estimate of a parameter is the value
loss of generality that downstream analysis is not of the parameter that is most likely to have
based on all the information collected. Pairwise resulted in the observed data. The likelihood is
deletion involves just removing the data that is computed separately for those cases with com-
missing for a given variable, and leaving that plete data on some variables and those with com-
subject in the analysis of other variables where plete data on all variables. These two likelihoods
information is present. The advantages of are then maximized together to find the
pairwise deletion are that this method uses all estimates. Like multiple imputation, this method
the data that has been collected for the analysis gives unbiased parameter estimates and standard
of a given variable as well as keeps as many errors. One advantage is that it does not require
cases/samples as possible. The disadvantage is the careful selection of variables used to impute
that it becomes difficult to compare the results values that multiple imputation method requires.
from one variable to another as the same samples An additional imputation method is K-nearest
were not used to generate all results. neighbors (KNN) imputation [1]. With this tech-
Single imputation methods are fairly straight- nique, the k nearest neighbor algorithm is
forward. For cytokine data, replacement of char- utilized where proteins behave as the neighbors
acter data (such as OOR > or OOR<) is often and the distance between proteins is based on the
done with ten times the largest value observed in correlation between two proteins.
the dataset or one tenths the lowest value
observed within the dataset, respectively. This
is because the definition of OOR > is “Out-of- 22.2.2 Normalization
Range High” which means that a numeric value
should exist for that data point, but said value is The most common form of data pre-processing is
above the detection range of the instrument. The what is commonly known as normalizing
approach is similar for OOR < values. (or standardizing) the data. This process centers
22 Statistical Approaches to Candidate Biomarker Panel Selection 471

and rescales the data. To center the data for a transformation has been applied to every data
given variable, each value has the overall vari- point, the transformed data should be used for
able mean subtracted from the original value. To downstream analysis. It is important to check
scale the data variable, the centered data is then whether the transformation helped make the
divided by the standard deviation of each data data appear more normally distributed. Valid
variable. Centering results in the data variable methods for examining normality of the data
having a mean of zero, and scaling results in include checking boxplots as well as q-q plots,
the variable having a standard deviation of one. both of which were described in Sect. 22.1.3. If
Normalizing the data is commonly done to the data contain multiple groups (such as Dengue
improve the numeric stability of some classifica- Fever vs Dengue Hemorrhagic Fever, or Chagas
tion techniques. Support Vector Machine disease with cardiomyopathy versus Chagas dis-
modeling (Sect. 22.6.5) is one technique that ease without cardiomyopathy, Chap. 20), each
requires the data to be normalized before analy- group should be assessed for symmetry individu-
sis. Principal Component Analysis is another ally. The logarithmic transformation frequently
technique that benefits from centering and scal- works well for intensity data (such as that from
ing the data. The downside to centering and 2D gel or cytokine experiments) while the square
scaling data is that the data are no longer in root transformation works well for count data.
their original units, so it is sometimes challeng-
ing to interpret the normalized computer output.
However, simple arithmetic manipulations can
22.3 Hypothesis Testing
make the model fit into the original scale.
While exploratory techniques are an important
component to guide investigators to promising
22.2.3 Transformations hypotheses about mechanisms and structure, the
classical techniques for inference such as hypoth-
Data transformations are frequently used when esis testing and confidence interval construction
the data appear skewed. The most common forms provide a useful and generally accepted metric
of data transformations are the logarithmic trans- for validating or rejecting hypotheses of interest.
formation, the square root transformation, or the Built on the classical adversarial construction of
inverse transformation. Many statistical hypoth- proof against a null hypothesis of no discovery,
esis tests have an underlying assumption that the hypothesis testing provides researchers with a
data be normally distributed, the variances be way to summarize and quantify evidence that is
equal (homoscedasticity), or both. Biological generally invariant across most fields of science.
data, which are often skewed, frequently do not Statistical hypothesis tests falls into two
meet these assumptions. Data transformations categories: parametric and nonparametric. As
often make the assumptions more valid. To per- mentioned above in Sect. 22.2.3, parametric
form a data transformation, simply take the loga- tests require extra assumptions for their validity.
rithm (any base will do as long as the chosen base These assumptions are that the data come from a
is consistent from one data value to another) of simple random sample, are normally distributed,
every entry for a given variable. Likewise, the and also that the variances are homogeneous. If
square root of the data value or the inverse can be the normality or variance assumptions are
taken as well. The goal is to help transform the violated, parametric tests are not appropriate for
proteomic data into a dataset that is not skewed a dataset. Nonparametric techniques have no
or one that has similar variance between two or assumptions about the distribution of the data.
more groups of interest. Both of these concepts However, they do require randomness of the data
are important assumptions for parametric testing and independence of the samples.
mentioned in Sect. 22.3.1. Once the
472 H.M. Spratt and H. Ju

22.3.1 Parametric Tests that same transformation must be applied to all


other groups. The transformed data should then be
Parametric tests include one sample t-tests, checked for normality. If the transformed data
two-sample t-tests, paired t-tests, and multiple helps the samples look more normally distributed,
versions of analysis of variance (ANOVA). then the transformed data should be analyzed via
One-sample t-tests are the most simplistic form the student’s t-test (not the raw data). For the
of a hypothesis test. They are used when Dengue Fever example data, the data was
researchers want to determine whether a param- transformed using log base 2. This data was then
eter (such as the mean) of a variable matches that analyzed via the Welch’s correction for the
of one that was published. The null hypothesis is two-sample t-test. The results indicate that
that the mean of the obtained variable is equal to 107 protein spots are differentially abundant
the published or hypothesized value, and the between the DF & DHF samples at a significance
alternative hypothesis is that the mean of the level of 0.05. These 107 spots will be the input for
observed variable is not equal to the null value. the examples in Sects. 22.6.2, 22.6.3, 22.6.4,
This type of test is most often used when a 22.6.5, 22.6.6, and 22.6.7.
researcher is new to a specific technique or Paired t-tests are similar to two-sample t-tests;
instrument, and they want to check that they are however, instead of the two samples being inde-
performing the experiment properly or that they pendent, they are required to be dependent
have used the correct settings for an instrument. (matched). This means that they have either
As the name implies, the values of only one been matched to account for possible
condition are being measured. In other words, confounding factors such as on age, gender,
only a control sample is examined. race, etc. or that the sample is from the same
The two-sample t-test is an extension of the patient over time, such as a pre- and post- mea-
one sample t-test. Instead of just one group being surement after the administration of some drug.
compared to some known value, the objective of a Just as the previous t-tests had the assumption
two-sample t-test is to look for differences that the data be normally distributed, the paired
between two groups: typically a control sample t-test does as well. However, when checking for
and an infected sample. Here, the two samples normality with paired data, the difference
should be sampled independently from each between groups is assessed instead of the nor-
other. This means that the samples are not mality of each group separately. This is because
matched or related in some fashion. An additional the formula for the test statistic is based on the
assumption for the two-sample t-test to be valid is difference instead of each group individually.
that the variance of group 1 (i.e. controls) is simi- Thus, to determine if the data is normally
lar to the variance of group 2 (e.g. Infected). If this distributed for paired data, the pre measurement
assumption is violated, there is a version to control would be subtracted from the post measurement
for unequal variances, called Welch’s correction, and that value would be plotted on a q-q plot.
which can be used instead. For either form of the Analysis of Variance (ANOVA) techniques
t-test, each group needs to be checked for normal- are used when there are more than two groups
ity to meet that assumption. Thus, one should being compared or there are multiple factors
check to see if the controls samples are normally being investigated. There are many forms of
distributed and one should separately check to see ANOVA, including one-way ANOVA (three or
if the infected/treated samples are also normally more groups being compared at once), two-way
distributed. If either group violates the normality ANOVA (at least two factors with at least two
assumption, a parametric test may not be appro- levels each), and repeated measures or mixed-
priate. However, a transformation should typically model ANOVA (in which at least one factor has
be attempted before reverting to a nonparametric multiple measurements on the same individual
test. If the transformation is applied to one group, over time). The basic premise for ANOVA is that
22 Statistical Approaches to Candidate Biomarker Panel Selection 473

instead of mean values being considered, the nonparametric tests should only be used if the
amount of variability both within a group and data are highly skewed or the variances are not
also between groups is being compared. For homogeneous between groups. Whereas the test
one-way ANOVA, a single null hypothesis is statistic for a parametric test is based on the
examined: whether there is a difference among actual data value, a test statistic for a nonpara-
multiple different group means. For two-way metric test is based on the ranks of the data
ANOVA, multiple null hypotheses are exam- instead. As a result, if data meets the assumptions
ined: whether there is a difference due to the for using a parametric test, such a test should be
first factor; whether (typically) there is a differ- preferred over a nonparametric equivalent.
ence due to the second factor; and whether there
is an interaction between the first and second
factor. 22.3.3 Multiple Hypothesis Corrections
Unlike with t-tests where you immediately
can conclude if group 1 is significantly different When dealing with proteomics experiments, and
from group 2 based on the observed p-value, with other “omics” experiments as well, instead of
ANOVA all that is known based on the initial just testing one protein at a time, researchers
results of running the hypothesis test is that at typically examine many (often hundreds or
least one group differs from the others within a thousands) of hypotheses at a time. Doing so
given factor. Post-hoc tests, such as Tukey’s or increases the probability of false positives: that
Dunnett’s tests, can be used to determine exactly is, incorrectly rejecting a null hypothesis when
where the differences lie. Tukey’s post-hoc test no difference between groups exists. This is a
compares all levels of a factor to each other. serious problem in many basic science
Dunnett’s test, on the other hand, compares experiments, and needs to be dealt with accord-
each level to a control level only. ingly. The method for correcting the number of
For example, if you are comparing a control false positives, and bringing number of false
strain of a disease to an attenuated strain to a positives back to a more reasonable level, is
virulent strain, the initial results of a one-way known as multiple hypothesis corrections.
ANOVA will tell you that at least one of the There are two methods for controlling the false
strains is different from the others, but the exact positive rate when one is testing multiple
differences will not be able to be determined. hypotheses simultaneously. They are known as
Running Tukey’s test will compare control to the Family Wise Error Rate (FWER) and False
attenuated, control to virulent, and attenuated to Discovery Rate (FDR) corrections.
virulent to allow one to possibly conclude that The FWER is the probability of wrongly
control is different from virulent only. Dunnett’s rejecting any of the null hypotheses. The most
test, on the other hand, would only compare common FWER correction is the Bonferroni cor-
control to attenuated and also control to virulent, rection [21], although Tukey’s test corrects for
but will not compare attenuated to virulent FWER in the ANOVA setting. FWER
(which may not be a hypothesis of interest for corrections are considered to be conservative
some studies). methods for controlling for multiple hypothesis
tests, and frequently results in no proteins
remaining significantly differentially abundant
22.3.2 Nonparametric Tests in a proteomics experiment.
FDR, on the other hand, seeks to control the
Nonparametric tests include chi-square tests, the proportion of false positives among the complete
Mann–Whitney test, the Wilcoxon Signed Rank set of rejected null hypotheses (rather than the
test, and the Kruskal-Wallis test. All of these probability of any false positives). The most
tests do not require the data to have a specific common FDR method is the Benjamini-
shape. Because of the lack of assumptions, Hochberg method [2]. FDR procedures allow
474 H.M. Spratt and H. Ju

for more potential false positives than FWER In this case, a more restrictive p-value cut-off is
methods, but they have increased power when used for the downstream analysis.
compared to FWER methods. As a result, FDR
methods are less conservative than FWER
methods, and usually result in more proteins 22.4.2 Significance Analysis
being significantly differentially abundant of Microarrays (SAM)
between two groups.
Significance analysis of microarrays (SAM) is a
widely used permutation-based approach to
identify differentially expressed genes when
22.4 Feature Reduction
assessing statistical significance using false dis-
covery rate (FDR) adjustment in high dimen-
A major problem in mining large datasets is the
sional datasets [23]. SAM can be applied to
“curse of dimensionality”: that is, model efficacy
proteomics data since protein abundance
decreases as more variables are added. In many
microarrays are high-throughput technology
omics experiments, we not only want to learn
capable of generating large quantities of prote-
about which genes/proteins are different from
omics. SAM algorithm is a great tool comparing
one group to another, but we would like to
t statistic with multiple hypothesis testing
build a predictive model to determine possible
adjustments to determine which hypothesis to
biomarkers for things such as a disease progres-
reject to minimize the number of false positives
sion or diagnosis. However, as more and more
and negatives by permuting the columns of the
variables are added to the model, the computa-
protein abundance. Resampling method (per-
tional time increases and the information gained
mutation) can be used to estimate p values to
becomes minimal. Feature reduction aims to
avoid the joint distribution of the test statistics.
decrease the number of input variables to the
Two sample t-test procedures require
model; it moderates the effect of the curse of
parametric Gaussian assumptions. There are
dimensionality by removing irrelevant or redun-
attractive points to SAM using multiple testing
dant variables or noisy data. Feature reduction
procedures, that it does not rely on the
has the following positive effects: speeding up
parametric assumptions and it does not involve
processing time of the algorithm, enhancing the
any complex estimation procedures. SAM uses
quality of the data, increasing the predictive
the permutation methods (default 100 times) to
power of the algorithm, and making the results
estimate FDR and computes a modified
more understandable.
t-statistics which measures the strength of the
relationship between protein abundance and dis-
ease outcome. It also accounts for feature-
22.4.1 Hypothesis Testing Results specific fluctuations in signals and adjusts for
increasing variation in features with low signal-
One technique for reducing the dimension of the to-noise ratios. Data are presented as a scatter
variables to be included in predictive analysis is plot of expected (x-axis) vs observed (y-axis)
to eliminate those variables which show no relative differences between group, where sig-
variable-wise significant difference between nificant deviations that exceed a threshold
groups without adjustment for multiple testing. from expected relative differences are identified
This means some form of hypothesis test has and considered “significant”. The solid line
been run on the dataset, and the insignificant indicates the relative difference expression of
variables (p-value > 0.05) are removed. Fre- group is identical, but the dotted line drawn at
quently in omics data analysis, removing only threshold delta value. The delta was chosen by
those variables with a p-value > 0.05 still results minimal cross-validation errors. The high rank
in a large (greater than 100) variables of interest. features of SAM results are marked red color
22 Statistical Approaches to Candidate Biomarker Panel Selection 475

(induced protein) and green color (suppressed of 2 is frequently used in proteomics


protein). For our CPC aspergillosis study, the experiments when looking for differential
110 spots among total 655 spots in 2D-gel expression. Proteins with an absolute fold-
data are selected for differentiating case change greater than 2 are thought to be differen-
vs. control by 100 permutations and FDR as tially abundant between groups of interest. Thus,
5 % for delta ¼ 0.35. The Microsoft Excel only proteins that exhibit such characteristics are
add-in SAM package can be used with specific considered for downstream analysis. The fold-
option filtering. There are several options, for change cut-off is sometimes increased (to 2.5 or
example, multi-class, two-class paired, and 3) if the number of proteins that have such a fold-
two-class unpaired response types using the t- change is large.
test, Wilcoxon test, and analysis of variance
test. The limitation of SAM procedure is that
this approach is a univariate version approach
22.4.4 PCA
and not allowed to consider the correlated struc-
ture between features like a multivariate regres-
Principal component analysis (PCA) is useful for
sion modeling. An example of a SAM result for
the classification as well as compression of a
aspergillosis data is shown in Fig. 22.4.
dataset. The main goal of PCA is to decrease
the dimensionality of the dataset by finding a
new set of variables, called principal components
22.4.3 Fold-Change that represent the majority of the information
present within the original dataset. The informa-
Fold Change refers to the values for the control tion is related to the variation present within the
samples being divided by the obtained values for original dataset and is calculated by the covari-
the treated samples. If this results in a value less ance among the original variables. The number
than one, then the inverse value is taken and a of important principal components is typically
negative sign is added. Thus, the value for fold smaller than the initial number of variables in
changes range from -infinity to 1, and then also the dataset. This new variable space will reduce
from 1 to + infinity. A fold-change cut-off value the complexity and noise within the dataset and

Fig. 22.4 SAM result for Significant: 110


Aspergillosis dataset False Discovery Rate [%]:5.00 SAM Plotsheet
4

1
Observed Score

0
−3 −2 −1 0 1 2 3
−1

−2

−3

−4
Expected score
476 H.M. Spratt and H. Ju

reveal hidden characteristics within the data. The are grouped into clusters is determined by a mea-
principal components are uncorrelated (orthogo- sure of similarity between the objects. Various
nal) with each other and are also ordered by the measures of similarity exist, including Euclidean
total fraction of information about the original distance, Manhattan (city-block) distance, and
dataset they contain. The first principal compo- Pearson correlation. Euclidean distance is the
nent accounts for as much of the variability in the most commonly used measure of similarity for
original dataset as possible, and each subsequent proteomics experiments, but it is sensitive to
component accounts for as much of the outliers within the data. The Manhattan distance
remaining variability as possible. The process requires that the data be standardized before use.
for determining the principal components is one The Pearson correlation is a similarity measure
based on covariance eigenvalues and that is scale invariant, but it is not as intuitive to
eigenvectors. The results are presented in the use as the other measures of similarity.
form of scores (projections of the eigenvectors) Not only must one measure the similarity
and loadings (eigenvalues). (distance) between two data points, but one
must also determine how to measure the distance
between two clusters. This distance can be cal-
22.5 Unsupervised Learning culated in at least three ways as: (1) the minimum
distance between any two objects in the different
Machine learning falls into two categories of clusters; (2) the maximum distance between any
methods: those that are considered to be unsuper- two objects in the different clusters; or (3) the
vised, and those that are considered to be average distance between all objects in one clus-
supervised. The primary difference between the ter and all objects in the other cluster. In addition
two methods is what is assumed to be known at to measures of similarity and distance, one can
the start of the process. For unsupervised build the dendrogram either via top-down (divi-
learning, the “truth” is not assumed to be sive) methods or bottom-up (agglomerative)
known, nor is it used in the process. “Truth”, in methods. For divisive methods, the process is
our context, is knowing which group a sample reversed with each object first belonging to its
belongs to. In an experiment distinguishing own cluster. Figure 22.5 represents the results of
between Dengue Fever and Dengue Hemorrhagic hierarchical clustering on the Dengue Fever
Fever, the “truth” would be which patients have dataset. The input data is the log2 transformed
Dengue Fever, and which patients have Dengue 2D gel data using only the 107 spots that were
Hemorrhagic Fever. For supervised methods, the significantly different based on the t-test analy-
“truth” is required for each algorithm. Hierarchi- sis. As the reader can see, this dataset is challeng-
cal clustering, K-means clustering, and PCA are ing. Ideally, the DF subjects should cluster with
all examples of unsupervised learning methods. the DHF subjects. Unfortunately, there is some
amount of overlap between the diseases as the
clusters are not solely one disease or the other.
22.5.1 Hierarchical Clustering

Hierarchical clustering seeks to group available 22.5.2 K-means Clustering


data into clusters by the formation of a dendro-
gram. Hierarchical clustering is based on two key K-means clustering is similar to hierarchical
principles: (1) Members of each cluster are more clustering; however, instead of obtaining
closely related to other members of that cluster n clusters at the end, the data samples are
than they are to members of another cluster, and grouped into a pre-specified number, k < n,
(2) Elements in different clusters are further apart clusters. The goal of k-means clustering is to
from each other than they are from members of partition the data into k subsets which are signifi-
their own cluster. The process by which samples cantly different from each other. K-means
22 Statistical Approaches to Candidate Biomarker Panel Selection 477

22.5.3 PCA

As mentioned above in Sect. 21.4.4, PCA is used


to identify patterns in the data. PCA expresses
data in such a way that it highlights differences
and similarities between both groups and
samples within each group. In a data set with
many correlations, an ordination technique is
needed to look at overall structure of the avail-
able data. PCA is based on linear correlation
between the data values, and transforms the orig-
inal variables into new, uncorrelated variables.
Consider m observations (e.g., protein abundance
levels) on n variables (e.g., conditions/
individuals). This results in an m x n data matrix.
PCA reduces the dimensionality of the data
matrix by identifying r new variables, where
r < n. Each new variable, r, is a principal com-
ponent (PC). Each PC is a linear combination of
the original n variables.
To perform PCA, start with the m x n matrix of
protein abundance data: m rows correspond to
proteins (expression levels), n columns corre-
spond to conditions/individuals. Apply data
standardization, such as the logarithmic transfor-
mation or scaling and centering the data such that
the mean value is 0 and the standard deviation is
1. Calculate the covariance matrix of the dataset,
Fig. 22.5 Hierarchical clustering of Dengue Fever study. C. Find the eigenvectors and eigenvalues of the
Subjects labeled 1–30 are subjects with Dengue Fever; matrix C. Create n new variables, PCn, that are
Subjects labeled 31–52 are subjects with Dengue Hemor- linear functions of the original n observations:
rhagic Fever

PC1 ¼ a111 + a122 + . . . + a1nn


clustering is most useful when the user knows
PC2 ¼ a211 + a222 + . . . + a2nn
a-priori the number of clusters that the data
PCn ¼ an11 + an22 + . . . + annn
should belong to, i.e. if the data samples come
from control, attenuated, and virulent strains of a
The coefficients above (referred to as
disease, one would expect three clusters to be
“loadings”), ann, represent the linear correlation
created. Methods do exist, however, to aid the
between the original variables, xn, and the PCn.
user in estimating the appropriate number of
The coefficients are chosen to satisfy three
clusters. With both k-means clustering and hier-
requirements: (1) the variance of PCn is as large
archical clustering, the user has the ability to
as possible; (2) all values of PCn are uncorre-
examine in a graphical fashion how similar dif-
lated; and (3) sum across rows ¼ 1 (a111 +
ferent groups of data are, and whether there are
a122 + . . . + a11n ¼ 1). Thus, the end result
some proteins that will enable one to easily dis-
of PCA is that the data has been transformed so it
criminate one group (i.e., control) from another
is expressed in terms of the patterns between
group (i.e., infected).
samples and groups.
478 H.M. Spratt and H. Ju

  Xk
22.6 Supervised Learning/ p
ln ¼αþ βjXj þ e
Classification ð 1  pÞ j¼1

Machine learning is the study of how to build where p is the probability that event Y occurs
systems that learn from experience. It is a sub- P(Y ¼ 1), p/(1  p) is the odds ratio, and ln
field of artificial intelligence and utilizes theory [p/(1  p)] is the log odds ratio (the logit).
from cognitive science, information theory, and Thus, logistic regression is the method used
probability theory. Machine learning usually for a binary, rather than a continuous outcome.
involves a training set of data as well as a test The logistic regression model does not necessar-
set of data. These are both from the same dataset, ily require the assumptions of some other regres-
and the system is “trained” using the training sion models, such as the assumption that the
data, and then run on the test data to classify it, variables are normally distributed in linear dis-
and test the model? There are two types of criminant analysis. Maximum likelihood estima-
machine learning algorithms: supervised and tion is used to solve for the logistic regression
unsupervised learning. In unsupervised learning, equation estimates. Recent techniques such as
we simply have a set of data points. We do not penalized shrinkage and regularization estima-
know classes associated with these data points. In tion, and also lasso-type regularization logistic
supervised learning, we also know which classes regression models have been developed to
the training data belong to. Machine learning has improve prediction accuracy in classification.
recently been applied in the areas of medical One of the advantages of using logistic regres-
diagnosis, bioinformatics, stock market analysis, sion is that there is assumed to be a linear associ-
classifying DNA sequences, speech recognition, ation between the feature and response variables.
and object recognition. However, one has the ability to add logarithmic
transformations or squares of data to increase the
performance of the model. One of the key
disadvantages of logistic regression is that the
22.6.1 Logistic Regression method does not accommodate missing values.
Additionally, logistic regression is unable to deal
The main objective of logistic regression is to with variables that are highly correlated, except
model the relationship between a set of continu- when using the lasso or ridge penalties. Lastly,
ous, categorical, or dichotomous variables and a including variables that are not important
dichotomous outcome that is modeled via the features can hinder (decrease) the performance
logit function. Whereas typical linear regression of the model. For this reason, logistic regression
seeks to regress one variable onto another (typi- cannot be used as an additional feature selection
cally continuous data), logistic regression seeks technique. It can, however, be used in combina-
to model via a probability function of a binary tion with other feature selection techniques.
outcome. Logistic regression is the method used
when the outcome is a “yes/no” response versus a
continuous one. Traditionally, such problems 22.6.2 CART
were solved by ordinary least squares regression
or linear discriminant analysis. However, these Classification and regression trees (CART, [3])
approaches were found to be less than optimal are a nonparametric method for building decision
due to their strict assumptions (normality, trees to classify data. CART is highly useful for
linearity, constant error variance, and continuity our applications because it does not require ini-
for ordinary least squares regression and multi- tial variable selection. The three main
variate normality with equal variances and components of CART are creating a set of rules
covariances for discriminant analysis). A logistic for splitting each node in a tree, deciding when a
regression equation takes the form: tree is fully grown, and assigning a classification
22 Statistical Approaches to Candidate Biomarker Panel Selection 479

to each terminal node of the tree [22]. Decision are missing values. Additionally, variables used
trees, such as CART, have a human readable split within the CART framework are not required to
at each node which is a binary response of some meet any distributional assumptions (such as
feature in the data set. The basic algorithm for being normally distributed or having equal
building the decision tree seeks some feature of variances within groups). CART can also handle
the data which splits it (here into two groups) correlated data.
maximizing the difference between the classes CART also has several disadvantages. CART
contained in the parent node. CART is a recur- tends to overfit data, so one should plan to trim
sive algorithm which means that once it has (prune) the model so that it can be most useful.
decided on an appropriate split resulting in two Unfortunately, how much to prune the data/tree
child nodes, the child nodes then become the new is one of personal choice. Many software
parent nodes, and the process is carried on down implementations of CART have automatic prun-
the branches of the tree. CART can use cross ing as an option. The tree structures within
validation techniques to determine the accuracy CART may be unstable. This means that even
of the decision trees. small changes in the sample data can result in a
To build a decision tree, the following need to drastically different tree. Lastly, while the tree is
be determined: (1) which variable should be optimal at each individual split, it might not be
tested at a node, (2) when should a node be globally optimal.
declared a terminal node and further splitting CART was run on the Dengue Fever example
stop, and (3) if a terminal node contains objects data mentioned in prior sections of this chapter.
from different classes, how should the class of a Namely, the log2 transformed data from the
terminal node be determined? The process for 107 significant 2D gel spots was used as input
doing so is listed below. to the CART algorithm. Tenfold cross-validation
was selected since the sample size is fairly small
1. Start with splitting a variable at all of its split (less than 30 subjects within each class). Fig-
points. Sample splits into two binary nodes at ure 22.6a shows the representation of the Classi-
each split point. fication Tree that was produced that is best able
2. Select the best split in the variable in terms of to discriminate DF from DHF samples. Fig-
the reduction in impurity (heterogeneity). ure 22.6b shows the variable importance for the
3. Repeat steps 1 & 2 for all variables at the CART model. Table 22.2a shows the prediction
root node. success for the training data and Table 22.2b
4. Assign classes to the nodes according to a rule shows the prediction success for the testing
that minimizes misclassification costs. data. Figure 22.7 shows the ROC Curves for
5. Repeat steps 1–5 for each non-terminal node. both the training and testing datasets. The blue
6. Grow a very large tree Tmax until all terminal curve represents the training data, and the red
nodes are either small or pure or contain iden- curve represents the testing data. The AUC for
tical measurement vectors. the training data is 0.90 and the AUC for the
7. Prune and choose final tree using cross testing data is 0.47.
validation.

Some of the advantages of CART are that it 22.6.3 RF


can easily handle data sets which are complex in
structure, it is extremely robust and not very Random Forests (RF), developed by L. Breiman
effected by outliers, and it can use a combination [4], offers several unique and extremely useful
of both categorical and continuous data. Missing features which include built-in estimation of pre-
data values do not pose any obstacle to CART as diction accuracy, measures of feature impor-
it develops alternative split points for the data tance, and a measure of similarity between
that can be used to classify the data when there sample inputs. RF improves upon classical
480 H.M. Spratt and H. Ju

Fig. 22.6 (a) CART tree for DF vs DHF comparison, (b) Variable importance for the CART model

Table 22.2 (a) Prediction success for the training data, amount of pruning of the tree to reach optimal
(b) Prediction success for the testing prediction strength; RF, however, does not do
A Class Total Prediction any pruning, which reduces performance time.
DF (n ¼ 30) DHF (n ¼ 22) Second, RF uses only a small number of
DF 30 27 3 descriptors to test the splitting performance at
DHF 22 3 19 each node instead of doing an exhaustive search,
Total 52 Correct ¼ 90 % Correct ¼ 87 % as does CART.
B Class Total Prediction RF thus builds many trees and determines the
DF (n ¼ 30) DHF (n ¼ 22) most likely splits based upon a comparison
DF 30 15 15 within the ensemble of trees. The procedure
DHF 22 15 7
makes use of both a training dataset and a test
Total 52 Correct ¼ 50 % Correct ¼ 32 %
dataset. It proceeds as follows: First, a sample is
bootstrapped from the training dataset. Then, for
decision trees such as CART while still keeping each bootstrapped sample, a classification tree is
many of the appealing properties of tree grown. Here, RF modifies the CART algorithm
methods. Decision trees are known for their abil- by randomly selecting from a subset of the
ity to select the most informative descriptors descriptors, instead of choosing the best split
among many and to ignore the irrelevant ones. among all samples and variables. This means
By being an ensemble of trees, RF inherits this that at each node, a user defined number of
attractive property and exploits the statistical variables are examined to determine the best
power of ensembles. The RF algorithm is very split/variable amongst that list. This number is
efficient, especially when the number of typically small, on the order of five to ten
descriptors is very large. This efficiency over variables to choose from. At each node of the
traditional CART methods arises from two gen- tree, a separate list of variables is considered.
eral areas. The first is that CART requires some
22 Statistical Approaches to Candidate Biomarker Panel Selection 481

Fig. 22.7 ROC Curves for 1.0


both the training and
testing datasets. The blue
curve represents the 0.9
training data, and the red
curve represents the testing
data. The AUC for the
training data is 0.90 and the 0.8
AUC for the testing data is
0.47
0.7

0.6
True Pos. Rate

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False Pos. Rate

This procedure of creating trees is repeated until regression situations. RF is the most resistant to
a sufficiently large number of trees have been overfitting of the models discussed in this chap-
computed, usually 500 or more. ter. This means the algorithm typically
In practice, some form of cross-validation generalizes well for new data. RF is a quick
technique is used to test the prediction accuracy algorithm, which means it creates results rapidly
of any computational technique. RF performs a even with thousands of potential predictors. This
bootstrapping cross-validation procedure in par- is because RF does not use all variables at each
allel with the training step of its algorithm. This level of the tree building process. RF does not
method allows some of the data to be left out at require prior feature reduction, as it can perform
each step, and then used later to estimate the variable selection during the tree building pro-
accuracy of the classifier after each instance cess. RF also has the ability to handle missing
(i.e. tree) has been completed. values. There are only a couple disadvantages to
The advantages of RF are that high levels of RF. First, the algorithm can overfit some datasets
predictive accuracy are delivered automatically, that are extremely noisy. Additionally, the
and there are only a few control parameters to classifications created by RF can be difficult to
experiment with. Additionally, RF works equally interpret as the splits are not listed in the results
well for classification situations as well as file (i.e., the user does not know what value of a
482 H.M. Spratt and H. Ju

variable to classify as one group versus the other of individuals or processes over time. The basic
group). The results list only the important concept behind regression splines is to model
variables that can be used to distinguish one using potentially discrete linear or nonlinear
group from another. functions of given analytes over differing
Figures 22.8 and 22.9 and Table 22.3 depicts intervals. The resulting piecewise curve, referred
the results of running RF on the Dengue Fever to as a spline, is represented by basis functions
data using a default of 500 trees. Figure 22.8 within the model.
shows the resultant variable importance for the MARS builds models of the form
top twenty most important spots. Table 22.3
shows the prediction success for the models. X
k
f ðxÞ ¼ ci Bi ðxÞ:
Figures 22.8 and 22.9 shows the ROC curve for i¼1
the data. The AUC for the ROC is 0.77.
Each basis function Bi(x) takes one of the follow-
ing three forms: (1) a constant, there is just one
such term, the intercept; (2) a hinge function,
22.6.4 MARS which has the form max(0,x  const) or max(0,
const  x). MARS automatically selects
Multivariate Adaptive Regression Splines variables and values of those variables for knots
(MARS) is a robust nonparametric modeling of the hinge functions; or (3) a product of two or
approach for feature reduction and model build- more hinge functions. These basis functions can
ing [12]. MARS is a multivariate regression model interactions between two or more
method that can estimate complex nonlinear variables.
relationships using a sequence of spline functions This algorithm has the ability to search
of the predictor variables. Regression splines through a large number of candidate predictor
seek to find thresholds and breaks in variables to determine those most relevant to
relationships between variables and are very the classification model. The specific variables
well suited to identifying changes in the behavior to use and their exact parameters are identified by

Fig. 22.8 Random forests variable importance for the top 20 most important spots
22 Statistical Approaches to Candidate Biomarker Panel Selection 483

Fig. 22.9 ROC curve for 1.0


the data. The AUC for the
ROC is 0.77
0.9

0.8

0.7

0.6
True Pos. Rate

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False Pos. Rate

an intensive search procedure that is fast in com- Table 22.3 Prediction success for the models
parison to other methods. The optimal functional Class Total Prediction
form for the variables in the model is based on DF (n ¼ 11) DHF (n ¼ 41)
regression splines called basis functions. DF 30 10 20
MARS uses a two-stage process for DHF 22 1 21
constructing the optimal classification model. Total 52 Correct ¼ 33 % Correct ¼ 95 %
The first half of the process involves creating an
overly large model by adding basis functions that many forms as well as interactions, MARS is
represent either single variable transformations able to reliably track the very complex data
or multivariate interaction terms. The model structures that are often present in high-
becomes more flexible and complex as additional dimensional data. By doing so, MARS effec-
basis functions are added. The process is com- tively reveals important data patterns and
plete when a user-specified number of basis relationships that other models often struggle to
functions have been added. In the second stage, detect. Missing values are not a problem because
MARS deletes basis functions in order of least they are dealt with via nested variable
contribution to the model until the optimum one techniques. Cross-validation techniques are
is reached. By allowing for the model to take on used within MARS to avoid over-fitting the
484 H.M. Spratt and H. Ju

classification model, and randomly selected test allowed for 107 potential basis functions to be
data can also be used to avoid the issue as well. included. Table 22.4a shows the variable impor-
The end result is a classification model based on tance; Table 22.4b shows the prediction success
single variables and interaction terms which will rates for the training data; Figure X + 2: 6C
determine class identity. Thus, MARS excels at shows the prediction success rates for the testing
finding thresholds and breaks in relationships data; and Figure X + 2: 6D shows the ROC
between variables and as such is very well suited curves for the training and testing data. The
for identifying changes in the behavior of blue curve represents the training data and the
individuals or processes over time. red curve represents the testing data. The AUC
Some of the advantages of MARS are that it for the training data is 1.0 and the AUC for the
can model predictor variables of many forms, testing data is 0.63.
whether continuous or categorical, and that it
can tolerate large numbers of input predictor
variables. As a nonparametric approach, MARS 22.6.5 SVM
does not make any underlying assumptions about
the distribution of the predictor variables of inter- Support vector machines (SVMs) are based on
est. MARS is also a relatively fast algorithm, simple ideas that originated in the area of statis-
which means you can get results for large tical learning theory [16]. SVMs apply a trans-
datasets in under a minute. In addition, like formation to highly dimensional data to enable
CART and RF, MARS also has the ability to researchers to linearly separate the various
handle missing values within a dataset so that features and classes. As it turns out, this transfor-
imputation techniques are not necessary. mation avoids calculations in high dimension
MARS also has several disadvantages. The space. The popularity of SVMs owes much to
algorithm performs in such a fashion that the the simplicity of the transformation as well as
results are easily overfit to a specific dataset. their ability to handle complex classification and
While MARS allows interactions terms to appear regression problems. SVMs are trained with a
in the model, such interaction terms are learning algorithm from optimization theory
extremely difficult to interpret biologically. In and tested on the remainder of the available
addition, confidence intervals for predictive data that were not part of the training dataset
variables cannot be calculated directly. [6]. The main aim of support vector machines is
Table 22.4 and Figs. 22.10 and 22.11 show the to devise a computationally effective way of
results of running MARS on the log2 learning optimal separating parameters for two
transformed Dengue Fever dataset. The model classes of data.
was created using tenfold cross-validation and SVMs project the data into higher dimen-
sional space where different classes or categories
are linearly or orthogonally separable by locating
Table 22.4 (a) MARS prediction success rates for the
training data, (b) MARS prediction success rates for the
a hyperplane (basically, a line or surface that
testing data linearly separates data) within the space of data
A Class Total Prediction
points that can separate multiple classes of data.
DF (n ¼ 29) DHF (n ¼ 23) SVMs also maximize the width of a band
DF 30 29 1 separating the data from the hyperplane so that
DHF 22 0 22 the linear separation is optimal. SVMs use an
Total 52 Correct ¼ 97 % Correct ¼ 100 % implicit mapping of the input data, commonly
B Class Total Prediction referred to as Φ, into a highly dimensional fea-
DF (n ¼ 32) DHF (n ¼ 20) ture space defined by some kernel function. The
DF 30 20 10 learning then occurs in the feature space, and the
DHF 22 12 10 data points appear in dot products with other data
Total 52 Correct ¼ 67 % Correct ¼ 45 % points [20]. One particularly nice property of
22 Statistical Approaches to Candidate Biomarker Panel Selection 485

Fig. 22.10 MARS


variable importance

Fig. 22.11 ROC curves 1.0


for the training and testing
data. The blue curve
represents the training data 0.9
and the red curve
represents the testing data.
The AUC for the training
data is 1.0 and theAUC for 0.8
the testing data is 0.63

0.7

0.6
True Pos. Rate

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False Pos. Rate
486 H.M. Spratt and H. Ju

SVMs is that once a kernel function has been Then, after a large number of fitting attempts,
selected and validated, it is possible to work in each with difficult-to-classify observations
spaces of any dimension. Thus, it is easy to add given relatively more weight, overfitting can be
new data into the formulation since the complex- reduced if the fitted values from the different
ity of the problem will not be increased by fitting attempts are combined. Boosting is a
doing so. weak learning algorithm which combines the
The advantage of SVMs is that they are not outputs from many weak classifiers to produce
data-type dependent. This means that categorical a powerful classifier [15]. A stochastic gradient-
as well as quantitative data can be analyzed boosted model (TreeNet) is a generalized tree
together. SVMs are also not dimension depen- boosting that produces an accurate and effective
dent. They have the ability to map the data into off-the-shelf procedure for data mining [10]. The
higher dimensions in order to find a dimension algorithm generates thousands of small decision
where the data appear to separate into different trees built in a sequential error-correcting process
groups. Additionally, there are various kernel to converge to an actual model. At each iteration,
functions that can be used to map the input data a subsample of the training sample is drawn at
into the feature space, the most popular being the random without replacement from the full train-
radial basis function (or Gaussian) kernel. SVMs ing sample to improve robustness to outliers-
also can be used to classify more than just two contaminated data. The variance of the individ-
groups at a time. ual base learner is increased at each iteration, but
There are several disadvantages to using the correlations between the estimates are
SVMs. Most notably, SVMs are seen as “black decreased at different iterations, therefore the
box” algorithms and thus fewer researchers are variance of the combined model would be
willing to use them because they do not fully reduced. TreeNet performs consistently well in
understand the algorithm. Additionally, SVMs predictive accuracy across many different kinds
have an extensive memory requirement because of data while maintaining the ability to train the
of the quadratic programming necessary to com- model quickly comparing to one CART classi-
plete the transformation to higher order fier. The variable importance measure in percent-
dimensions. Thus, the SVM algorithm can be age scale provides how the variables contribute
very slow in the testing phase. The choice of to predictions on the classification. TreeNet
the kernel function is subjective, which means graph provides relative influence of each variable
that choosing one kernel function over another and root mean square (RMS) error, a measure of
will possibly result in a different classification. the differences between values predicted by a
Lastly, the data needs to be normalized (scaled model and the values actually observed, to assess
and centered) before using the algorithm. the power of the model. TreeNet model is also a
black box approach how classifiers are complex
and hard to interpret the results unlike general
22.6.6 TreeNet probabilistic framework to reach a particular
answer and the weak classifiers are too complex,
Sometime CART algorithm build non smooth which can lead to over-fitting. Treenet algorithm
step function classification boundary which requires no prior knowledge needed about weak
leads the variance of it is large and unstable learner, and is easy to run quickly.
results, so alternative ensemble classification
modeling is needed to improve accuracy by
increasing randomness through resampling 22.6.7 Generalized Path Seeker (GPS)
methods. If in the binary classification, a fitting Based on AIC and BIC
model misclassifies those observations, that
model can be applied again, but with extra The comparisons of penalized-regression
weight given to the observations misclassified. methods in binary response and logistic
22 Statistical Approaches to Candidate Biomarker Panel Selection 487

regression such as the ridge penalty (α∑βi2.), In AIC and BIC, the binomial log-likelihood may
lasso penalty (α∑|βi|), and elastic net (combined be viewed as a measure of the goodness-of-fit of
α∑|βi| + (1  α) ∑βi2) were conducted. The a model with the number of parameters function-
ridge regression can only shrink the coefficients, ing as a penalty for model complexity. The com-
but the lasso regression can do both shrink and plexity penalty α term is chosen by AIC or BIC
variable selection on the coefficients. The elastic criterion to evaluate the negative log-likelihood.
net regression can identify the group effect where The elastic net, α||β||1 + (1  α)||β||22, combines
strongly correlated features tend to be in the the L1 and L2 penalizing terms and possesses a
model together. The corresponding grouping effect, i.e., in a set of variables that
log-likelihood function of β (L) is given by have high pairwise correlations, the elastic net
X groups the correlated variables together. Lasso
L ¼ logLðβÞ ¼ Y T Xβ  logð1 þ expðxi βÞÞ: and elastic net are especially well suited to wide
data, meaning data with more predictors than
The coefficient vector β that minimizes the observations in linear regression model. The reg-
penalized log-likelihood is β ¼ argminβ2Rp ularization model outputs provide piece-wise lin-
 ∑(yi log pi + (1  yi) log(1  pi)) + Penalty ear regression path plots along with cross
(β), where pi ¼ P(y ¼ 1|x). To estimate the coef- validation to identify important predictors. This
ficient, we perform generalized path seeker procedure is applicable for variable selection for
(GPS), a high speed lasso-style regression from the parametric linear components. If the
Friedman [11] to regularize regression. GPS parametric assumptions are not satisfied, we
demonstrates the regularized regression based need nonparametric approach like MARS
on the generalized elastic net family of penalties. model beyond linearity of features related to
The efficient least angle regression (LARS) algo- disease outcomes.
rithm of Efron et al. [8] finds the entire regulari- Figures 22.12 and 22.13 and Tables 22.5a, b
zation paths in an iterative way with the depict the results of running GPS on the Dengue
computational effort. For a binary outcome vari- Fever data using tenfold cross-validation. Fig-
able and the logistic regression models, the lasso ure 22.12 shows the resultant variable impor-
estimator is estimated by penalizing the negative tance for the top twenty most important spots.
log-likelihood with the L1-norm through the Table 22.5a shows the prediction success for the
absolute constraint of regression coefficients training data; Table 22.5b shows the prediction
like α||β||1 ¼ α∑|βi|. success for the testing data. Figure 22.13 shows
The Akaike information criterion (AIC) is the ROC curve for the data. The blue curve
given by represents the training data; the red curve
AIC ¼  2 lnðLÞ þ 2ðp þ 1Þ; represents the testing data. The AUC for the
training ROC is 1.0; the AUC for the testing
where L is the binomial log-likelihood for the data is 0.92.
model, and p is number of covariates estimated in
the model.
The Bayesian information criterion (BIC) is 22.7 Resampling Techniques
given by
22.7.1 Training/Testing Sets
BIC ¼  2 lnðLÞ þ lnðnÞ  ðp þ 1Þ;

where n is the samples size, and p is defined as A key concept in machine learning is the creation
those variables in AIC. of a predictive model based on a training dataset,
Among the models having different number and then assessing the ability of the model to
of covariates, the one yielding the smallest AIC perform on an independent testing dataset (men-
and BIC values is selected as the optimal model. tioned in Sect. 22.6). Ideally, the training data
should be collected separately from the testing
488 H.M. Spratt and H. Ju

Fig. 22.12 GPS variable importance for the top 20 most important spots

data. This can mean that discovery samples are and samples with replacement repeatedly to pro-
used for the training data and validation samples duce an approximation to a statistic’s sampling
are used for the testing data. Another way to distribution. As a result, reliable confidence
create training and testing datasets is to set intervals and hypothesis tests are easily calcu-
aside some of the training data to be used instead lated, often with properties superior to standard
for the testing data. If the study contains more parametric techniques. In predictive modeling,
than 60 samples in a given group, this is the bootstrap resampling has been found to “smooth”
preferred method for machine learning out discontinuities in many fitting algorithms.
algorithms. How much of the training data to The resulting model is typically less variable
set aside for the testing data is up to the user. without a substantial increase in bias.
Frequently, 70–80 % of the dataset samples are
retained for the training of the predictive model,
with the additional 20–30 % being set aside to 22.7.3 CV/k-fold CV
test the model performance. For the majority of
the work performed by the Clinical Proteomics CV gives an accurate and robust indication of
Center, the analysis was performed by using how well an algorithm can make new predictions
cross-validation techniques (mentioned below [17]. CV is an important technique for avoiding
in G.3). testing hypotheses that may be inferred from the
data, but don’t actually exist. CV is appropriate
for each of the classification methods we will
22.7.2 Bootstrapping discuss. One well-accepted method for cross val-
idation is termed “k-fold” CV. Here the full
Bootstrap resampling [7] is a general method for dataset is divided into k subsets and the holdout
inference that has been applied to a variety of method, where a set amount of data is withheld
statistical problems too difficult to solve analyti- from the analysis, is repeated k times. Each time,
cally. The standard nonparametric bootstrap one of the k subsets is used as the test set and the
resampling treats the population data as a sample remaining subsets are used as the training sets.
22 Statistical Approaches to Candidate Biomarker Panel Selection 489

Fig. 22.13 ROC curve for 1.0


the data. The blue curve
represents the training data;
the red curve represents the
0.9
testing data. The AUC for
the training ROC is 1.0; the
AUC for the testing data is
0.92 0.8

0.7

0.6
True Pos. Rate

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False Pos. Rate

of training set vs. test set trials are used to calcu-


Table 22.5 (a) Prediction success for the training data,
late an average predictive error; so this method
(b) Prediction success for the testing data
provides an estimate of an algorithm’s predictive
A Class Total Prediction
power that is much less dependent upon the ini-
DF (n ¼ 31) DHF (n ¼ 21)
tial selection of members for the training set.
DF 30 30 0
DHF 22 1 21
Total 52 Correct ¼ 100 % Correct ¼ 95 %
B Class Total Prediction 22.8 Model Diagnostics/
DF (n ¼ 27) DHF (n ¼ 25) Performance/Quality
DF 30 24 6
Assessment
DHF 22 3 19
Total 52 Correct ¼ 80 % Correct ¼ 86 %
In practice, it is often customary for a supervised
classification to be conducted using several
The average error across all trials is then modeling approaches. The investigators then
computed to assess the predictive power of the examine the model performance using a variety
classification technique used. The advantage of of criteria, as well as look for convergence of
the k-fold CV method is that many combinations informative features. A widely accepted
490 H.M. Spratt and H. Ju

approach in model evaluation is to evaluate the computed with and without each variable in the
area under the receiver operating characteristic model.
(ROC) curve [14].

22.8.2 Deviance/Residual Plots


22.8.1 Receiver-Operating
Characteristic (ROC) Curves/ AUC Model checking is an important procedure to
check for assessing model adequacy in multiple
For a given technique, multiple models are fre- linear or logistic regression. Often the interest is
quently created. One method used to evaluate to assess the linear or non-linear association of
and compare the various models is by ROC binary responses on features. The logistic regres-
curves [24]. An ROC curve is a graphical plot sion model assumes that the logit of the outcome
of the sensitivity vs. 1-specificity of a binary is a linear combination of features. When model
classifier system as its discrimination threshold assumptions are not satisfied, we have problems,
is varied. This is an equivalent representation of a the confidence intervals of the coefficients are
plot of the fraction of true positives vs. the frac- wide and the statistical tests are incorrect and
tion of false positives. The assumption is that the inefficient. We examine whether our model has
samples on each side of the binary classifier are all of the relevant predictors and if the linear
from a separate population, and the ROC curve is association of them is appropriate.
a graphical presentation of the validity of this Next, we evaluate the partial residual plot as
assumption. The area under the ROC curve a diagnostic graphical tool for identifying the
(AUC) measurements indicate the ability of a nonlinear relationship between the logit of the
model to discriminate amongst the outcome disease outcome and features for additive
groups. Figures 22.7, 22.9, and 22.11 show the models. A partial residual plot (Fig. 22.14) is a
ROC curve for the Dengue Fever study compar- scatterplot of the partial correlation of each
ing DF vs. DHF. independent with the dependent outcome after
For the choice of regularization parameter, removing the linear effects of the other indepen-
information criterion such as cross validation, dent features in the model. The log-likelihoods
generalized cross validation (GCV), Akaike ratio test statistic is twice the difference in
information criterion (AIC) and Bayesian infor- log-likelihoods of linear and nonlinear of each
mation criterion (BIC) can be used. Generalized feature. For each feature, we also examine the
cross-validation can be viewed as an approxima- log-likelihood ratio-test p-values comparing the
tion to cross-validation, negative binomial log likelihood (i.e., deviance
h    i
i yi
" #2 di ¼ 2 yi log ^pyi þ ðni  yi Þlog nn^ ) bet-
1X
p
n
½y  f ðxi Þ i i

GCV ¼   ; ween the full model and the reduced model.


n i¼1 1  nk After performing log-likelihoods ratio test on
nonlinear models with smaller p-value less
where n is the number of observations, y is
than 0.05, it is preferable to use a
dependent variable x is the independent variable
non-parametric fit like MARS model. An exam-
(s), and k is the effective number of parameter or
ple of partial residual plot of lymphocytes clini-
degree of freedom in the model. The effective
cal data for Dengue data is shown in Fig. 22.14.
degrees of freedom is the means by which the
It shows the non-linearity of lymphocytes to
GCV error functions puts a penalty on adding
logit of the Dengue Hemorrhagic Fever.
variables to the model. The effective degrees of
In proteomic studies, some proteins could not
freedom is chosen by the modeler. The GCV can
be accurately measured, so they lead measure-
be used to rank the variables in importance. To
ment error problems. It is well known that ignor-
rank the variables in importance, the GCV is
ing measurement error in covariate leads to
22 Statistical Approaches to Candidate Biomarker Panel Selection 491

25

20

15
Partial residual

10

−5

−10
0 10 20 30 40 50 60 70 80 90
Lymphocytes

Fig. 22.14 Partial residual plot

biased estimate of the covariate effects. There are Bibliography


a number of measurement error models reported
in the literature [5, 13]. 1. Batista G, Monard M (2002) A study of K-nearest
Measurement error in the predictors, lack-of- neighbour as an imputation method. Hybrid Intelli-
fit error (under-fitting and over-fittings), and gent Systems, Santiago, Chile, pp 251–260
2. Benjamini Y, Hochberg Y (1995) Controlling the
error due to omitting relevant important false discovery rate: a practical and powerful
predictors can cause poor performance when approach to multiple testing. J R Stat Soc Ser B
building models, especially in terms of reproduc- 57:125–133
ibility of the training model into test data. Statis- 3. Breiman L, Friedman J, Olshen R, Stone C (1984)
Classification and regression trees. Wadsworth,
tical methods include the random effects in linear Belmont
mixed effect models could quantify between var- 4. Breiman L (2001) Random forests–random features.
iation, within variation and unwanted noise vari- University of California, Berkeley
ation. Therefore, the model performance 5. Carroll R, Ruppert A, Stefanski L, Crainiceanu C
(2006) Measurement error in nonlinear models: a
estimators should be evaluated from a test set. modern perspective, 2nd edn. CRC Press, London
We need to perform an examination process of 6. Cristianini N, Shawe-Taylor J (2000) An introduction
similarity between training and test set samples to support vector machines: and other kernel-based
for reproducibility of the model. We observed learning methods. Cambridge University Press,
Cambridge
that verification sample variations in aspergillo- 7. Efron B (1979) Bootstrap methods: another look at the
sis are much larger than in the qualification sam- jackknife. Ann Stat 7:1–26
ple ones. We know that the final optimal 8. Efron B, Hastie T, Johnstone I, Tibshirani R (2004)
classification model can be used to predict the Least angle regression. Ann Stat 32:407–499
9. Enders C (2001) A primer on maximum likelihood
probability of new data being in the disease algorithms available for use with missing data. Struct
group in the training samples. The final classifi- Equ Model Multidiscip J 8:128–141
cation model could be optimized in terms of 10. Friedman J (1999) Greedy function approximation: a
minimal noise in the predictors and response. gradient boosting machine. Department of Statistics,
Stanford University
492 H.M. Spratt and H. Ju

11. Friedman J (2012) Fast sparse regression and classifi- 19. Little R, Rubin D (2002) Statistical analysis with
cation. Int J Forecast 28:722–738 missing data, 2nd edn. Wiley & Sons, New York
12. Friedman J (1991) Multivariate adaptive regression 20. Scholkopf B, Smola A (2002) Learning with kernels:
splines. Ann Stat 19:1–41 support vector machines, regularization, optimization,
13. Fuller W (1987) Measurement error models. Wiley, and beyond. MIT Press, Cambridge, MA
New York 21. Shaffer J (1995) Multiple hypothesis testing. Annu
14. Hanley JA, McNeil BJ (1982) The meaning and use of Rev Psychol 46:561–584
the area under a receiver operating characteristic 22. Steinberg D, Colla P (1995) CART: tree-structured
(ROC) curve. Radiology 143:29–36 nonparametric data analysis. Salford Systems, San
15. Hastie T, Tibshirani R, Friedman J (2001) The Diego
elements of statistical learning; data mining, inference 23. Tusher V, Tibshirani R, Chu G (2001) Significance
and prediction. Springer, New York analysis of microarrays applied to the ionizing radia-
16. Karatzoglou A, Meyer D, Hornik K (2006) Support tion response. Proc Natl Acad Sci U S A
vector machines in R. J Stat Softw 15:1–28 98:5116–5121
17. Kohavi R (1995) A study of cross-validation and 24. Zweig MH, Campbell G (1993) Receiver-operating
bootstrap for accuracy estimation and model selec- characteristic (ROC) plots: a fundamental evaluation
tion. Fourteenth international joint conference on tool in clinical medicine. Clin Chem 39:561–577
artificial intelligence, Montreal, Canada, pp 25. Rubin D (1976) Inference and missing data.
1137–1143 Biometrika 63:581–592.
18. Kuhn M, Johnson K (2013) Applied predictive
modeling. Springer, New York
Qualification and Verification of
Protein Biomarker Candidates 23
Yingxin Zhao and Allan R. Brasier

Abstract
The importance of biomarkers has long been recognized by the public,
scientific community, and industry. Yet despite extensive efforts and
funding investments in biomarker discovery, only 109 protein biomarkers
in plasma or serum were approved by the US Food and Drug Administra-
tion throughout 2008 (Anderson NL. Clin Chem 56:177–185, 2010), and
even fewer protein biomarkers are currently used routinely in the clinic. In
recent years, the introduction of new protein biomarkers approved by the
US Food and Drug Administration has fallen to an average of 1.5 per year
(a median of only 1 per year) (Anderson NL. Clin Chem 56:177–185,
2010). The low efficiency of biomarker development is due to several
reasons, including the poor quality of clinical samples, the gap between
subjective clinical definition of a disease and objective protein
measurements, and high false discovery rate of differentially expressed
proteins identified in the initial discovery phase (Rifai N, Gillette MA,
Carr SA. Nat Biotechnol 24:971–983, 2006). It has become clear that the
vast majority of differentially expressed proteins identified in the discov-
ery phase will ultimately fail as useful clinical biomarkers, and only few
true positive candidates can move through the biomarker development
pipeline. Isolation of true biomarkers from the large pool of differentially
expressed proteins identified in the discovery phase becomes the greatest
challenge and the bottleneck in most biomarker pipelines. To succeed,
after the initial discovery study (see Chap. 20), the authenticity of bio-
marker candidates need to be tested in a pilot study with high throughput,
high accuracy and reasonable cost. This essential process is addressed by
qualification and verification phase of the biomarker development
pipeline.

Y. Zhao (*) • A.R. Brasier


The University of Texas Medical Branch, 301 University
Blvd, 77555 Galveston, TX, USA
e-mail: yizhao@utmb.edu; arbrasie@utmb.edu

# Springer International Publishing Switzerland 2016 493


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_23
494 Y. Zhao and A.R. Brasier

Keywords
Biomarker verification • ELISA-based verification • Selected Reaction
Monitoring (SRM) • Parallel reaction monitoring (PRM) • Accurate
inclusion mass screening (AIMS) • Data-independent MS/MS
acquisition (DIA-MS/MS) • Cohort selection

SISCAPA stable isotope-labeled standards


Abbreviations with capture by anti-peptide
antibodies
AIF All-ion fragmentation SOPs standard operating protocols
AIMS Accurate inclusion mass screening SRM selected reaction monitoring
AQUA absolute quantification peptides SWATH Sequential window acquisition of
standard all theoretical mass spectra.
CART classification and regression trees
CE capillary electrophoresis
CFD complement factor D
CV coefficient of variation
DDA- data-dependent MS/MS acquisition 23.1 Overview
MS/MS
DIA-MS/ Data-independent MS/MS The importance of biomarkers has long been
MS acquisition recognized by the public, scientific community,
ELISA Enzyme-linked immunosorbent and industry. Yet despite extensive efforts and
assay funding investments in biomarker discovery,
FDR false positive rate only 109 protein biomarkers in plasma or serum
FWHM full width at half maximum were approved by the US Food and Drug Admin-
HCD higher energy C-trap dissociation istration throughout 2008 [1], and even fewer
HPLC high performance liquid protein biomarkers are currently used routinely
chromatography in the clinic. In recent years, the introduction of
HR/AM high resolution and mass accuracy new protein biomarkers approved by the US
IPed immuno-precipitated Food and Drug Administration has fallen to an
LC liquid chromatography average of 1.5 per year (a median of only one per
LLOQ lower limit of quantification year) [1]. The low efficiency of biomarker devel-
LOD limit of detection opment is due to several reasons, including the
MARS multivariate adaptive regression poor quality of clinical samples, the gap between
splines subjective clinical definition of a disease and
MS mass spectrometry objective protein measurements, and high false
MS/MS tandem mass spectrometry discovery rate of differentially expressed
PAcIFIC Precursor acquisition independent proteins identified in the initial discovery phase
from ion count [2]. It has become clear that the vast majority of
PRM parallel reaction monitoring differentially expressed proteins identified in the
QCAT concatemer of standard peptides discovery phase will ultimately fail as useful
QQQ-MS triple quadrupole mass clinical biomarkers, and only few true positive
spectrometry candidates can move through the biomarker
SAM significance analysis of microarray development pipeline. Isolation of true
SEC size-exclusive chromatography biomarkers from the large pool of differentially
SID stable isotope dilution expressed proteins identified in the discovery
23 Qualification and Verification of Protein Biomarker Candidates 495

phase becomes the greatest challenge and the biomarkers enables the application of precision
bottleneck in most biomarker pipelines. To suc- medicine, an approach that tailors specific
ceed, after the initial discovery study (see interventions to those individuals that would
Chap. 20), the authenticity of biomarker most benefit. Described in Chap. 20, the “discov-
candidates need to be tested in a pilot study ery phase” entails the application of high
with high throughput, high accuracy and reason- throughput proteomics measurements to broadly
able cost. This essential process is addressed by sample proteins that distinguish between two
qualification and verification phase of the bio- disease states. The discovery phase typically is
marker development pipeline. applied to a small number of representative cases
The aims of the qualification and verification and controls in a cohort. The qualification phase
phase in biomarker development pipeline are: will measure the candidates in the samples used
in the discovery phase. The verification phase
1. to confirm the differential expression of involves measuring the candidates in an indepen-
candidates observed in the discovery phase dent, larger sample of similar cases and controls,
2. To verify the correlation of the biomarker frequently from multiple collaborating clinical
candidates to the disease over a relative large sites. In order for the verification phase to be
population of patients meaningful, a reproducible, observer-
3. To confirm the performance of the statistical independent criteria for case definition needs to
model combining the biomarker panel. be applied.
Moreover, significant attention to detail in
The qualification and verification phase, uniform sample acquisition and storage is para-
therefore, is a critical phase in the transition mount. There is increasing recognition that “cen-
from discovery to clinical applications. Three ter effects”, variations in sample acquisition,
major factors influence the feasibility of a bio- processing and storage may have profound
marker qualification and verification study: impact on the discovery, qualification, and veri-
fication phases of the biomarker development
1. The availability of biospecimens from a well- pipeline. To overcome this issue, multi-site clin-
curated cohort ical studies should develop and rigorously adhere
2. The availability of highly specific and quanti- to standard operating protocols (SOPs) for sam-
tative assays for the biomarker candidates of ple acquisition/archival at the onset of the study.
interest Although the techniques for quality assessment/
3. The expense for assay development and quality control of proteomics samples are cur-
applying the assays to measure a large number rently limited, sample quality should be moni-
of targeted analytes across many samples. tored where possible prior to the application of
qualification and verification assays.
The number of samples used in verification
study needs to provide sufficient power to assess
23.1.1 Biospecimens from Clinical the sensitivity and specificity of a candidate bio-
Cohort marker panel. The sample size for verification
stage depends on multiple factors including the
Success in the qualification and verification analytical variation of the assays, the biological
phase relies on a rigorous clinical study design variation between patients, the concentration of
and attention to detail in sample acquisition, biomarker candidates in clinical samples, and the
archival and tracking. Biomarker studies typi- effect size (the difference in the biomarker’s
cally seek to identify combinations of proteins abundance between cases and controls). The sta-
whose measurement will serve as a molecular tistical design for biospecimen size in verifica-
indicator of the severity of a disease or its early tion studies should take these factors into
response to treatment. This application of account [3].
496 Y. Zhao and A.R. Brasier

23.1.2 Requirements for Qualification 23.2 Platforms for Qualification/


and Verification Assays Verification, Advantage
and Disadvantage
The transition from discovery study to qualifica-
tion and verification usually requires the transi- 1. Enzyme-linked immunosorbent assay
tion from the unbiased, quantitative or semi- (ELISA)
quantitative approaches used in the discovery to
a targeted and much more precise, reproducible, ELISA has been extensively used in verifica-
quantitative approach. If such assays for bio- tion of biomarkers. It is extraordinary sensitivity
marker candidates are not readily available, (low pg/mL) [4, 5]. This technique has high
they need to be established de novo within a sample throughput, and is capable of analyzing
short lead-time. The analytical performance of hundreds of samples with good precision. For
the biomarker qualification and verification example, ELISA can reliably measure interleu-
assays including accuracy, precise, repeatability, kin (IL)-6 at concentrations as low as 0.15 pg/mL
reproducibility, sensitivity, specificity, and linear with coefficient of variation (CV) of 5 %
dynamic range should be validated to meet the [2]. However, only a small number of potential
predicted needs of the study. The assays need to biomarker candidates have immunoassay-grade
have high selectivity and sufficient sensitivity to antibody pairs available. Developing a new,
detect and quantify the analytes targeted in a clinical-grade ELISA assay is costly ($100,000–
highly complex matrix (such as human plasma). $4 million per biomarker candidate), time-
Because the goals of biomarker qualification and consuming (1–1.5 years), and associated with a
verification are to confirm and verify the relative high failure rate [6]. And it is even more difficult
changes observed in the discovery study and to to develop multiplex ELISA assays for a large
evaluate the model performance in their combi- number of protein targets because of the possible
nation, but not to measure the actual amount of cross-reactivity between antibodies [7, 8]. Taken
analytes in biological samples, the true accuracy together, ELISA technology is not well-suited
is usually not required in qualification/verifica- for quantifying a large number of protein
tion studies. However, the assays need to have candidates in the qualification and verification
high repeatability and reproducibility so that they study.
can be used to precisely and consistently measure
relative changes in a large numbers of targeted 2. Selected reaction monitoring (SRM)
analytes across many samples. Ideally, the assay
can be standardized across laboratories. A number of targeted mass spectrometry
Because all biomarker candidates identified approaches have emerged recently, such as accu-
from a discovery phase need to be tested in rate inclusion mass screening (AIMS), parallel
hundreds of samples over a short period of time reaction monitoring (PRM), SRM, and data-
and with reasonable cost, confirmatory independent acquisition (DIA-MS/MS) coupled
technologies should have a high throughput with targeted data extraction. These approaches
capability for analyzing hundreds of samples have tremendous promise for specific, reproduc-
with good precision and accuracy, be capable of ible, and quantitative measurements of changes
multiplexing to evaluate the significant number of proteins of interest in clinical research.
of biomarker candidates at a time, require mini- Among them, SRM is currently the most widely
mal sample consumption (because samples used approach for biomarker qualification and
amount may be limited), and have low verification.
assay cost. SRM-MS has emerged as a favorable alterna-
tive to immunoassays for qualification and
23 Qualification and Verification of Protein Biomarker Candidates 497

verification of candidate biomarkers. In a the internal standard. The sample is trypsin


SRM-MS assay, one or two signature proteotypic digested, and the resultant mix of unlabeled and
peptides are selected to stoichiometrically repre- labeled peptides are analyzed by SRM-MS.
sent the protein candidate of interest. The SRM Absolute quantification of target protein can be
analysis of these signature peptides are done by comparing the abundance of the known
performed on a triple quadrupole mass spectrom- internal standard peptide with its native peptide
eter (QQQ-MS). In SRM assays, the precursor when well-qualified isotope-labeled full length
ion of interest is preselected in the first mass filter protein standards are available.
(Q1), and stimulated to fragment by collision- The use of stable isotope-labeled peptides as
induced dissociation in second quadrupole (Q2). internal standards has significantly increased the
Several preselected fragments are analyzed by detection confidence and measurement precision
the third mass filter (Q3). The signals of the in SRM experiments. In SRM, only 3–5 fragment
fragment ions are then monitored over the chro- ions from the preselected precursor ions are typi-
matographic elution time. The SRM-MS offers cally monitored. When it is used for analyzing
several attractive features as a qualification/veri- the target analytes from a highly complex system
fication assay. First, because only preselected such as plasma, this assay may be prone to
precursor-product ion transitions are monitored matrix-related interference. Co-eluting matrix
in SRM mode, the noise level is significantly components can produce the same SRM
reduced and thereby SRM assays decrease the transitions as the analytes of interest, resulting
lower detection limit for peptides by up to in false-positive identification and inaccurate
100-fold in comparison to full scan MS/MS anal- quantification. Matrix components can also
ysis. Second, if the precursor-product ion transi- cause ion suppression by competing for available
tion of one proteotypic peptide is unique to the protons in the spray droplets. When matrix
protein of origin, it is not only distinguishable components co-elute with analytes of interest,
from other MS signals in one LC run, but it is a they will cause variation in ion current response
characteristic signature for the protein of interest. in different samples severely affecting the preci-
Therefore, the two filtering stages in SRM mode sion, accuracy, and sensitivity of quantification.
result in near-absolute structural specificity for The stable isotope-labeled peptides have identi-
the target protein, representing a significant cal structures as their endogenous peptide, and as
advantage over immunoassays. Third, because a result, co-elute in LC fractionation. When ion
no affinity reagent is typically needed, SRM suppression occurs, the suppression will affect
assays can be rapidly and cost-efficiently devel- both endogenous and stable isotope-labeled
oped in comparison to immunoassays. Finally, peptides at the same degree. Therefore, the ratio
SRM assays have multiplexing capability. of analyte to its internal standard will not be
Hundreds of precursor-product ion transitions affected by ion suppression. The LC retention
can be monitored in SRM mode over one LC time of stable isotope labeled peptides can also
run, allowing for the simultaneous quantification be used as the landmark to pinpoint the LC peak
of tens-to one hundred protein biomarker of endogenous peptide. Furthermore, stable
candidates in parallel. isotope-labeled peptides generate identical sets
SRM-MS in combination with stable-isotope of fragment ions as the endogenous counterparts.
dilution (SID-SRM-MS) is a target-driven The relative abundances of the fragment ions of
approach for direct quantification of target stable isotope labeled peptides can serve as ref-
proteins in a complex mixture [9]. In stable iso- erence to distinguish the true signal of targeted
tope dilution experiments, 13C-, or 15N- labeled native peptides from other co-eluting isobaric
absolute quantification peptide standards peptides. It will be important to demonstrate
(AQUA) [9], concatemer of standard peptides that the LC retention time and the relative
(QCAT) [10, 11], or isotope-labeled full-length abundances of the fragment ions of the native
target proteins [12, 13] are added to the sample as peptide are near identical with the stable isotope
498 Y. Zhao and A.R. Brasier

labeled internal standards. This usually requires times [3, 23]. Transitions extracted for an SRM
significant amount of time and effort to manually assay need to be confirmed by addressing the
inspect the SRM data to ensure the accuracy of likelihood that the chosen transitions and their
quantification [14–16]. Several bioinformatics intensity distributions are associated with target
tools mProphet [17] and AudIT [18] have been peptide. Several freely available software
developed to overcome these problems. products (for example TIQAM, ATAQS [24],
mProphet use criteria such as relative intensities and Skyline [25]) integrate many of the above
from reference spectra, correlation with the ref- mentioned tasks and automate assay develop-
erence spectra, retention time deviation, and ment for peptides (peptide and transition selec-
co-elution to generate a single score and compute tion), data evaluation, and analyzing SRM traces.
error rate of the measurement. AudIT identifies A publicly available SRM assay database,
contaminated transitions. It relies on reference SRMAtlas (www.srmatlas.org), features SRM
peptides and technical replicates. assays for about 99 % of human proteins. This
SID-SRM is well-suited for highly reproduc- database was generated from high-quality
ible quantification across many samples and, in measurements of natural and synthetic peptides
fact, also across different mass spectrometers and conducted on a QQQ mass spectrometer and is
laboratories. Recently, the Clinical Proteomic intended as a resource for SRM-based proteomic
Tumor Analysis Consortium led a landmark mul- workflows. Furthermore, to consider the detect-
tisite assessment study with a focus on the repro- ability of the SRM assays, PASSEL [26] was
ducibility of SID-SRM-MS assay between-run, created as a combined catalog of the best –avail-
between-laboratory, and between-mass spec- able transitions selected from PeptideAtlas shot-
trometer manufacturers [19]. In this study, the gun data and SRMAtlas, providing the validation
precision and reproducibility of SRM-based information of all assays in the context of a
measurements of proteins spiked in a background specific sample. Huttenhain et al. [27] developed
of human plasma were assessed over nine differ- SRM assays for 1172 cancer-associated proteins.
ent laboratories with mass spectrometers from Using these SRM assays in the clinical samples,
two manufacturers. The results are very 182 proteins were detected in depleted plasma
promising, with a 10–23 % inter-laboratory CV, and 408 proteins were detected in urine. These
a variance that includes variations in sample databases of SRM assays are, therefore, valuable
preparation and MS platforms. resources for designing and accelerating bio-
Compared to ELISA, SRM-MS assay can be marker qualification/verification studies.
developed with a short lead-time (1–3 months). Some advancement in instrument design has
A critical step in SRM-MS assay development is helped to improve the sensitivity and specificity
the selection of suitable transitions for a target of SRM assays. For example, in most of SRM
peptide [14]. The considerations are given to analysis, the first quadrupole (Q1) usually uses
fragment ions that provide the highest signal unit resolution (m/z window 0.7 full width at half
intensity and lowest level of interfering signals. maximum (FWHM)). This large m/z window
We previously reported a pathway for SRM allows other co-eluting sample constituents with
assay development and optimization, an similar m/z pass through Q1 and interfere with
approach that requires both empirical and bioin- detection of the desired target. The frequency of
formatics tools [14]. Several interfaces (for these interferences increases as the complexity of
example, MRMaid [20], MRMer [21], and the sample increases. Narrower mass windows
MaRiMba [22]) use fragment-spectra from shot- Q1 will increase selectivity for precursor ions
gun experiments to help in designing favorable with the cost of a steep decline in signal as
transitions for target peptides. For SRM assay these windows are narrowed to <0.5 FWHM.
design in analyses of complex samples it is also The Thermo Scientific TSQ Quantum line of
important to infer retention times. Software have triple quadrupole mass spectrometers offers a
been developed to realign and to predict elution new technique called highly selected reaction
23 Qualification and Verification of Protein Biomarker Candidates 499

monitoring (H-SRM). With the advancement of imprecision of 5 and 6.4 % [34]. The operation
the technology, the m/z window in Q1 can be of PRM is similar to a SRM. The precursor ions
narrowed to 0.1–0.2 FWHM to increase the spec- of the target peptides are isolated in the quadru-
ificity without sacrificing sensitivity. The practi- pole mass filter and transferred to higher energy
cal advantage H-SRM is that it dramatically C-trap dissociation (HCD) cells for fragmenta-
reduces isobaric chemical noise, thereby increas- tion. The fragment ions are measured by HR/AM
ing the signal-to-noise (S/N) [15, 16], which Orbitrap mass analyzer instead of a third quadru-
translates to improved lower limit of quantifica- pole used in SRM. The use of an Orbitrap mass
tion (LLOQs) and higher confidence in the quan- analyzer presents specific advantages. First,
tification results. Improvements in the design of instead of only 3–5 transitions are monitored by
nano-electrospray source and interface and the Q3 mass analyzer in SRM, PRM acquires a
applications of the ion-funnel technology to full MS/MS spectrum which contains all of
triple-quadrupole mass spectrometers have been potential fragment ions of one targeted peptide,
proven to increase the ionization efficiency and which can significantly improve the confidence
ion transmission, thus improving the LLOQ of of identification of the LC peaks of target
SRM-MS [28, 29]. Application of further stages peptides. Second, the Orbitrap provides addi-
of ion filtering in QQQ MS increases the sensi- tional data on assay selectivity. In the case of
tivity and specificity of SRM in MRM3. This complex samples, the interfering matrix ions
technique uses a hybrid quadrupole/linear ion co-isolated with the precursors of target peptides
trap instrument and monitors reconstructed ion can sometimes generate fragment ions which
chromatograms on secondary product ions have similar m/z values as those of the moni-
derived from a trapped primary product ion tored transitions. These two signals sometimes
[30, 31]. MRM3 can improve the limit of quanti- cannot be separated by a quadrupole mass ana-
fication by a factor of two to fourfold and enables lyzer with isolation width of 0.7–1.0 m/z and
protein biomarker quantification in the low may cause false positive identification and inac-
ng/mL range in non-depleted human serum with- curate quantification. The Orbitrap mass analyzer
out using immunoaffinity enrichment. The draw- can separate fragment ions with m/z difference
back of this method is that it requires much higher than 10 ppm; this mass accuracy and
longer acquisition times (350 ms) for each tran- resolution is much greater than that of the quad-
sition in comparison to regular SRM rupole. This feature enables PRM technology to
(6 ~ 20 ms), which reduces the number of data more effectively separate fragment ions of inter-
points that can be sampled over a given chro- est from interfering ions and improve the selec-
matographic peak and the number of peptides tivity of quantification. The enhanced selectivity
that can be monitored in one acquisition cycle. and specificity of the PRM method can result in
better sensitivity of quantification [32, 35]. Perfor-
3. Parallel reaction monitoring (PRM) mance comparison between PRM and SRM
shows that the linearity and dynamic range of
SRM is primarily performed on a triple quad- PRM can also rival the traditional SRM
rupole MS. With the newly introduced high res- approach. However, it is clear that SRM has
olution and mass accuracy (HR/AM) instruments superior quantitative precision [33]. The impre-
(e.g., Q Exactive quadrupole-Orbitrap or cision of PRM is largely because the PRM relies
quadrupole-TOF mass spectrometers), a new tar- on the Orbitrap mass analyzer, which is funda-
get proteomics approach referred as PRM has mentally less sensitive and has slower data acqui-
been developed [32, 33]. PRM has been used to sition rate than quadrupole mass analyzer.
measure amyloid-β, a biomarker for Alzheimer Quadrupole mass analyzers operate at a duty
disease, in cerebrospinal fluid. The assay shows cycle nearing 100 % and have the ability to sam-
the similar performance as SRM, with a recovery ple more points over a given chromatographic
of 100 % (15 %), intra-assay and inter-assay peak, thus provides a more accurate
500 Y. Zhao and A.R. Brasier

quantification of the LC peak and, in turn, greater cost-effective manner [39]. In a newly developed
precision and run-to-run repeatability. The targeted MS-based pipeline for biomarker verifi-
Orbitrap requires much longer scan time and cation, AIMS was implemented between discov-
40–120 ms Orbitrap injection time, which signif- ery and SRM-based verification study to confirm
icantly decrease the duty cycle of acquisition. the detectability of the candidates in plasma
This reduces the data points sampled over a [39]. Only the candidates detected in the plasma
given chromatographic peak resulting in lower by AIMS will be advanced to SRM-based assay
precision and repeatability of quantification. This development for more sophisticated quantitative
feature limits the number of possible peptides comparison of the levels of the candidates in
that can be monitored in one PRM acquisition cases vs controls. This strategy allows one to
cycle. To increase multiplex capability, PRM test a much larger number of candidates than
requires time-scheduled acquisition, which relies would have been possible over the traditional
on the availability of high-quality local spectral SID-SRM-MS based verification.
library with well-calibrated peptide chro-
matographic elution time. Unlike SRM, PRM 5. Data-independent MS/MS acquisition
does not require significant effort for assay (DIA-MS/MS)
development, but it requires the high-quality
local spectral library to confirm the identity of DIA-MS/MS is a new MS/MS acquisition
the analytes and assess measurement quality, technology [40, 41]. DIA-MS/MS carries the
especially when the stable isotope labeled acronyms Precursor Acquisition Independent
standards are not used. From Ion Count (PAcIFIC) [42], All-ion Frag-
mentation (AIF) [43], and Sequential window
4. Accurate inclusion mass screening (AIMS) acquisition of all theoretical mass spectra
(SWATH) [44, 45]. DIA-MS/MS is an approach
AIMS is another emerging targeted mass where tandem mass spectra are acquired at every
spectrometry-based proteomic technique m/z value without regard for whether a precursor
[36]. In AIMS acquisition, a list of pre-selected ion is observed or not. In DIA-MS/MS, the direct
precursor ions is used to generate an “inclusion relationship between fragments and precursor
list” for MS acquisition [37, 38]. Only precursors from which they originate is lost, and assigning
represented on the “inclusion list” will be fragments to precursors can depend on the
selected for fragmentation if they are detectable targeted data extraction and the availability of
in a survey scan. Compared to untargeted data- extensive spectral libraries such as PeptideAtlas
dependent LC-MS/MS acquisition (DDA-MS/ [46, 47]. DIA-MS/MS demonstrates better sensi-
MS) approach used in the discovery study, tivity, reproducibility and dynamic range than
AIMS significantly improves the level of repro- DDA-MS/MS, and allows consistent quantifica-
ducibility, sensitivity, and dynamic range by tion of proteins spanning a wide range of
restricting detection and fragmentation to only concentrations, e.g., 125–106 copies/cell [44], a
those peptides derived from proteins of interest. range well within the needs for quantifying host
It is at least fourfold more efficient at detecting cellular response profiles. Data-independent
peptides of interest than DDA-MS/MS [36]. The acquisition itself is not a targeted approach, but
analytical performance of AIMS is less satisfac- in combination with targeted data analysis, it can
tory than SRM in terms of accuracy, sensitivity, be used as an alternative approach of SRM assay
specificity, and dynamic range. However, in clinical research. In this approach, a quantita-
because AIMS has the ability for time-scheduled tive, digitalized proteomic recording (SWATH
monitoring over 1000 peptides in a single maps) will be generated for each clinical sample
LC-MS run, it can be used as a targeted approach as a personalized digital representation for each
for data-dependent triage and prioritization of patient [48]. The profile of proteins of interest
hundreds of candidate biomarker in a time- and can then subsequently be extracted in a targeted
23 Qualification and Verification of Protein Biomarker Candidates 501

fashion using assay information derived from developed for facilitating target data extraction
mass spectrometric reference maps. In a recent from SWATH maps data and quantification.
study of N-linked glycoproteins in human We therefore summarize the benefits and
plasma, N-linked glycoproteins in human plasma tradeoffs inherent to each platform for biomarker
were enriched with solid phase extraction, then verification with respect to the main factors
analyzed by both SWATH maps and SRM characterizing measurements: accuracy, sensitiv-
[45]. SWATH maps coupled with targeted data ity, specificity, reproducibility, precision,
extraction shows less sensitivity than SRM, but dynamic range, sample throughput, analyte
achieved a higher analyte throughput, compara- throughput, assay development easiness, and
ble dynamic range, reproducibility, and accuracy ease of data analysis (Fig. 23.1). Each method
if stable isotope labeled peptides of analytes were entails a compromise that maximizes the perfor-
used as internal standards. This finding indicates mance at some level, while reducing it at others.
that SWATH maps can be used as targeted, For example, PRM has higher specificity than
reproducible quantitative approach for biomarker SRM, but lower reproducibility and precision of
qualification/verification in less complicated quantification; SWATH can significantly
samples [45]. Furthermore, SWATH maps are improve analyte throughput but at the cost of
permanent digital maps and can be easily specificity and accuracy. Given SRM has the
re-examined for qualification/verification of best overall analytical performance, it is consid-
new sets of biomarker candidates without ered as the gold standard approach for biomarker
reanalyzing the sample physical samples qualification and verification.
[48]. Although SWATH maps require little Because the odds of discovery of a clinically
assay development, it can be useful only when a useful biomarker or biomarker panel are
high quality MS/MS spectra reference maps with extremely low, a large number of biomarker
well-calibrated elution times are available and candidates must be tested in a qualification
can be replicated on the instrument used for phase. Developing SID-SRM assays for every
SWATH MS analysis. SWATH generates highly candidate identified by discovery study will
complex and overlapping MS/MS spectra, and become very costly and time consuming. A
significant bioinformatic effort is required for small number of candidates must be selected
analyzing SWATH data. Some special bioinfor- from the many hundreds of available candidates.
matic tools, such as openSWATH [49] and Therefore, the qualification phase can be further
Spectronaut (www.biognosys.ch) have been divided into two steps: triage and quantification.

Fig. 23.1 Performance


profiles comparing
technical advantages and
disadvantages of target MS
platforms used in
biomarker verification
study
502 Y. Zhao and A.R. Brasier

In the triage step, the biomarker candidates are These studies demonstrate SRM assay can reli-
measured by targeted, but less costly assays ably quantify the classic plasma protein
[39]. Among the platforms available for bio- biomarkers with concentration higher than 1 μg/
marker qualification/verification, PRM, AIMS, mL directly in plasma. But this LLOQ of SRM
and SWATH have the capability to test and tri- assays is not sufficient for unambiguous detec-
age large number of candidates with lower tion and quantification of other types of protein
expense and less lead-time for assay develop- biomarkers with lower concentration, such as
ment. They can be easily developed if a local tissue leakage products, interleukins, and
high quality MS/MS spectra reference maps cytokines, directly from plasma (Fig. 23.2). The
with well-calibrated elution times are available lack of sensitivity by applying SRM assays
and can be replicated on the instrument used for directly to plasma is mainly caused by matrix-
analyzing the clinical samples. Only the related interference and ion suppression. Plasma
candidates that pass the triage step will be is an extremely complex mixture of proteins over
advanced to more expensive SID-SRM quantifi- a concentration range of 11 orders of magnitude
cation. This staged qualification/verification in the presence of other endogenous salt, lipid,
strategy will enable one to test as many and metabolites. These matrix components have
candidates as possible with reasonable cost and deleterious effect on the sensitivity of SRM
time to improve the chance of discovery of clini- assays. Competition for ionization between the
cally useful biomarker panels. analytes of interest and other endogenous (such
as salt, lipid, and metabolite) or exogenous (such
as polymers extracted from plastic tubes) species
23.3 Pre-fractionation causes the ion suppression effect. When these
and Enrichment Technologies interfering species elute at the same time as the
analyte of interest, the signals of analytes will be
Ideally, SRM assays can be applied to verify suppressed [52]. Some matrix components can
biomarker candidates directly from plasma or also produce the same product ions monitored
serum without upfront sample fractionation. It for the analytes of interest, giving rise to chemi-
is efficient, reproducible, high throughput, and cal and biological noise, which reduce the S/N
less prone to errors and analytical variations. In ratio necessary for detection and quantification.
recent studies, high and medium abundance To overcome these sensitivity barriers, a variety
human plasma proteins have been quantified by of sample preparation strategies have been devel-
using multiplexed SRM approach without further oped for target protein quantification aimed at
sample preparation. Kuzyk et al. reported the reducing sample complexity while maintaining
simultaneous quantification of 45 major plasma the requirements for high accuracy, reproducibil-
proteins with a CV below 20 % for 94 % of the ity, and throughput.
measured peptides [50]. Anderson et al. reported
that 47 major plasma proteins were quantified
with in-run CVs of 2–22 % [51]. The least abun- 23.3.1 Depletion of High-Abundance
dant protein quantified, L-selectin, had a Proteins
measured concentration of 0.67 μg/mL, a con-
centration 4–5 orders of magnitude lower than Depletion of the highest abundance plasma
the concentration of albumin in plasma. Addnota proteins using affinity columns is the simplest
et al. tested the LLOQ of SRM assays of target way to reduce the sample complexity. In a
proteins in human plasma [18]. Eight of ten study, Keshishian et al. reported that depletion
tested peptides had median LLOQ values of the 12 highest abundance plasma proteins
between 0.66 and 2.0 fmol/μL when peptides improved the SRM assay LLOQ to 25 ng/mL
were added into 1:60 diluted plasma (equivalent [2]. The combination of depletion with strong
to a range of 0.70–3.34 μg/ml protein in plasma). cation exchange chromatography (SCX) further
23 Qualification and Verification of Protein Biomarker Candidates 503

Fig. 23.2 Comparison of the LLOQ of different strategies for the quantification of protein biomarkers in plasma. A
schematic diagram of the source and target concentration ranges of candidate plasma biomarkers. At right is LLOQ of
current reported verification assay (Taken from Zhao, Current Proteomics, permission required)

improved the LLOQ of SRM assay to 1–10 ng/ the highly specific interaction between the
mL with CV below 15 % [53]. But this approach targeted proteins with affinity ligands, such as
is impractical for biomarker qualification/verifi- antibodies, aptamers, or lectins.
cation because extensive prefractionation of Pre-fractionation is especially useful for quanti-
samples into numbers of subfractions substan- fication of low-abundance proteins in plasma. In
tially reduces the throughput of the entire assay. our recent qualification and verification study of
dengue fever biomarker panel, we found that the
circulating level of one of the biomarker
candidates, Complement Factor D (CFD), was
23.3.2 Enrichment of Target Proteins or
below the LLOQ of the SID-SRM-MS assay
Peptides Using Affinity
and could not be detected in unfractionated
Chromatography
plasma. To address this issue, we developed an
assay in which the CFD was first immuno-
Specifically isolating the target proteins or
precipitated (IPed) by anti-CFD antibody from
peptides from human plasma with affinity purifi-
plasma followed by quantification with SID-
cation is the most efficient way to reduce the
SRM-MS [54]. The CFD protein in each sample
sample complexity. This approach is based on
504 Y. Zhao and A.R. Brasier

was IPed with biotin conjugated anti-CFD anti- biomarker candidates. Interlaboratory evaluation
body. The complex of CFD and its antibody was of SISCAPA indicated that limits of detection of
captured by streptavidin magnetic beads. Stable SISCAPA were at or below 1 ng/ml for the
isotope labeled CFD signature peptide was assayed proteins in 30 μl of plasma. Assay repro-
spiked into each sample, the proteins were ducibility was acceptable for verification studies,
trypsin-digested, and CFD abundance was with median intra- and inter-laboratory CVs
quantified with SID-SRM-MS. By using this above the limit of quantification of 11 % and
approach, we significantly improved the sensitiv- <14 %, respectively, for the entire immuno-
ity of the assay. MRM-MS assay process, including enzymatic
IP-SRM can be multiplexed using a mixture digestion of plasma [60]. SISCAPA has several
of magnetic beads containing different advantages over immunoaffinity capture of target
antibodies to increase the throughput of the proteins since; (1) it avoids potential interference
assay. Nicol et al. used this approach to quantify from endogenous antibodies in the sample as
multiple proteins from human sera simulta- they are digested to peptide by trypsin, and
neously [55]. The assays extend the LLOQ of (2) anti-peptide antibodies are easier to generate
SRM assay to low ng/ml range with good in comparison to anti-protein antibodies. The
accuracy. limitation of this type of enrichment strategy is
A newly emerging immuno-affinity-SRM the requirement for specific antibody to be
approach termed stable isotope-labeled standards generated for each tryptic peptide used for a
with capture by anti-peptide antibodies target protein. An alternative approach is the
(SISCAPA) was developed by Anderson et al. use of aptamers, oligonucleotide sequences with
[56], using immobilized anti-peptide antibodies molecular recognition properties selected from
to enrich the target peptides and the previously combinatorial oligonucleotide libraries
spiked synthetic stable isotope-labeled peptides. [61]. Aptamers bind protein ligands with high
Using this method, more than 1000-fold enrich- affinity and specificity [62]. They can be easily
ment for target peptides in a plasma digest can be generated because they are chemically
achieved. In several studies, individual synthesized, enabling standardization of assays
SISCAPA-SRM assays have been successfully across multiple lots, a feature not possible with
configured for quantifying biomarkers in the generation of polyclonal antibodies, for example.
ng/μL range in plasma with CV < 20 % [56–
58]. The protein concentration determined by
this method with results obtained using a com- 23.3.3 Sample Fractionations
mercial immunoassay yield a high correlation of for Protein Adduct or Fragments
the two technologies [57, 59], demonstrating that
the method can quantify low-abundance proteins Potential biomarkers may be proteins with post-
with high accuracy. SISCAPA-SRM-MS has translational modifications or peptide fragments
potential to multiplex the number of peptides derived from endogenous proteins. To unambig-
measured in one assay by using a mixture of uously quantify these candidates, they have to be
magnetic beads containing different antipeptide first separated from their canonical forms. In our
antibodies. Whiteaker et al. demonstrated that up recent biomarker discovery study of dengue
to nine peptides have been enriched simulta- fever, we identified a high molecular weight
neously with a LLOQ in the low ng/ml range (>250 kDa) form of albumin is associated with
(from 10 μl of plasma) and a median coefficient dengue fever virus infection [63]. The nature of
of variation of 12.6 % [58]. They also this protein is incompletely characterized, but is
demonstrated that the LLOQ can be extended to probably a covalently linked polymer [63]. To
low pg/ml range of protein concentration when verify the high molecular weight albumin iso-
larger volumes of plasma (1 ml) were used. This form, in our NIAID funded Clinical Proteomics
method holds great promise for verifying Center, we developed a capillary electrophoresis
23 Qualification and Verification of Protein Biomarker Candidates 505

(CE) based fractionation approach. For CE frac- utilizes quantitative information derived from
tionation, plasma samples were separated after any of the qualification/verification assays
spike-in with Beckman protein size standards. described above. Approaches for feature reduc-
The 250 kDa fraction was collected into a receiv- tion include pairwise statistical comparison, sig-
ing vial. The SDS in each collected CE fractions nificance analysis of microarray (SAM), a
was removed by using SDS sample cleaning kit technique that estimates false discovery rate
(Bio-Rad). The protein pellets were redissolved (FDR) in high dimensional datasets, regression
in 8 M urea. The proteins were digested with modeling, or machine learning techniques such
trypsin and quantified with SID-SRM-MS as classification and regression trees (CART),
assay. Similarly, for the peptide fragments multivariate adaptive regression splines
derived from endogenous proteins, size-based (MARS) or ensemble methods. The application
separation approaches such as size-exclusion of these approaches is described more fully in the
chromatography (SEC) can be used. For exam- Chap. 20.
ple, in our recent biomarker discovery study of
Aspergillosis (Discovery of Candidate
Biomarkers, Chap. 20), we identified 26 small 23.5 Consideration in Designing
molecular sized peptides in plasma. These Quantification/Verification
peptides are fragments of endogenous proteins Study
such as albumin, apolipoprotein A-I, haptoglo-
bin. To quantify these peptides, we first used 1. Selection of sample cohorts for verification
size-exclusion chromatography to separate the study
denatured plasma into protein and peptide pools
(MW <17 kDa). Then the concentration of these As described in the Introduction to
26 peptide fragments in the peptide pool was Proteomic-derived Biomarkers (Chap. 20),
quantified with SID-SRM-MS. the samples in the qualification phase are the
The qualification and verification strategies same samples used in the discovery phase. The
that were used for Dengue fever virus-3, infec- verification phase involves measuring the
tious Aspergillosis, and Chagasic Cardiomyopa- candidates independently in a larger number of
thy are summarized in Table 23.1, 23.2, 23.3, and samples collected from patients with similar
23.4. diagnosis and control patients from those that
were assayed in the discovery phase of the bio-
marker pipeline. In order for the qualification/
23.4 Feature Reduction/Candidate verification phase to be meaningful, a reproduc-
Selection ible, observer-independent criteria for case defi-
nition needs to be applied. Samples should
The qualification/verification phase seeks to represent meaningful sampling of the patient
reduce the number of candidate biomarkers to cohort. Specifically the biospecimens should be
those most informative for general application derived from components of the cohort that meet
in clinical setting. Another goal of qualification/ the same objective criteria for cases and controls
verification is to test the statistical model that as those used for the discovery analysis.
combines several of the informative features.
Feature reduction aims to decrease the number 2. Statistical design for verification study
of input variables to the model. Lower number of
input variable enhances the quality of the data, The statistical design for the verification
increases the predictive power of the biomarker phase should be developed based on
panel, and makes the results understandable and considerations of the effect size, outcomes (clas-
more robust for application to broader ses) in the experimental cohort, and experimental
populations. This is a statistical approach that goal –e.g. is the focus to test the performance of a
506 Y. Zhao and A.R. Brasier

Table 23.1 Qualification and verification strategies for candidate plasma proteins for Dengue fever virus-3
Qualification/
Verification strategy
Gene Accession Pre-
Biomarker candidates Name # fraction Quantification SRM signature peptides
Alpha-1-antitrypsin SERPINA1 P01009 – SID-SRM- SVLGQLGITK
MS
Leucine-rich alpha-2- LRG1 P02750 – SID-SRM- GQTLLAVAK
glycoprotein MS
Alpha-2-macroglobulin A2M P01023 – SID-SRM- QGIPFFGQVR
MS
Serum albumin ALB P02768 – SID-SRM- LVNEVTEFAK
MS
Apolipoprotein A-I APOA1 P02647 – SID-SRM- DYVSQFEGSALGK
MS
Apolipoprotein C-III APOC3 P02656 – SID-SRM- DALSSVQESQVAQQAR
MS
Complement factor D CFD P00746 – SID-SRM- VQVLLGAHSLSQPEPSK
MS
Complement factor H CFH P08603 – SID-SRM- SPDVINGSPISQK
MS
Complement C4-A C4A P0C0L4 – SID-SRM- VGDTLNLNLR
MS
Desmoplakin DSP P15924 – SID-SRM- TLELQGLINDLQR
MS
Fibrinogen alpha chain FGA P02671 – SID-SRM- GSESGIFTNTK
MS
Fibrinogen beta chain FGB P02675 – SID-SRM- SILENLR
MS
Ferritin light chain FTL P02792 – SID-SRM- LNQALLDLHALGSAR
MS
Hemopexin HPX P02790 – SID-SRM- NFPSPVDAAFR
MS
Haptoglobin HP P00738 – SID-SRM- VGYVSGWGR
MS
Ig gamma-1 chain C region IGHG1 P01857 – SID-SRM- GPSVFPLAPSSK
MS
Immunoglobulin J chain JCHAIN P01591 – SID-SRM- ENISDPTSPLR
MS
Ig kappa chain C region IGKC P01834 – SID-SRM- TVAAPSVFIFPPSDEQLK
MS
Keratin KRT1 P04264 – SID-SRM- SLDLDSIIAEVK
MS
Dengue-2 virus NS1 NS1 Q67431 – SID-SRM- SCTLPPLR
nonstructural protein MS
Tropomyosin alpha-4 chain TPM4 P67936 – SID-SRM- LVILEGELER
MS
Vimentin VIM P08670 – SID-SRM- VELQELNDR
MS
Complement Factor D CFD P00746 IP SID-SRM- VQVLLGAHSLSQPEPSK
MS
Low MW Desmoplakin DSP P15924 CE SID-SRM- TLELQGLINDLQR
MS
High MW albumin ALB P02761 CE SID-SRM- LVNEVTEFAK
MS
For each of the candidate plasma proteins, SID-SRM-MS assays were developed. Shown is the protein accession
number, common name, pre-fraction technology, and signature sequence
23 Qualification and Verification of Protein Biomarker Candidates 507

Table 23.2 Qualification and verification strategies for candidate plasma proteins for infectious Aspergillosis
Qualification/Verification
strategy
Gene Accession Pre-
Biomarker candidates Name # fraction Quantification SRM signature peptides
Alpha-1-acid glycoprotein 1 ORM1 P02763 – SID-SRM- YVGGQEHFAHLLILR
MS
Alpha-1-antitrypsin SERPINA1 P01009 – SID-SRM- SVLGQLGITK
MS
Alpha-1-antichymotrypsin SERPINA3 P01011 – SID-SRM- EIGELYLPK
MS
Serum albumin ALB P02768 – SID-SRM- LVNEVTEFAK
MS
Apolipoprotein A-I APOA1 P02647 – SID-SRM- DYVSQFEGSALGK
MS
Apolipoprotein C-III APOC3 P02656 – SID-SRM- DALSSVQESQVAQQAR
MS
Fibrinogen alpha chain FGA P02671 – SID-SRM- GLIDEVNQDFTNR
MS
Fibrinogen beta chain FGB P02675 – SID-SRM- SILENLR
MS
Leucine-rich alpha-2- LRG1 P02750 – SID-SRM- GQTLLAVAK
glycoprotein MS
For each of the candidate plasma proteins, SID-SRM-MS assays were developed. Shown is the protein accession
number, common name, pre-fraction technology, and signature sequence

biomarker to differentiate cases vs controls, or to clinically useful biomarker panel is extraordi-


evaluate the statistical model? The reader should narily low. To increase the chance of identifying
refer to Statistical Approaches (Chap. 22) for a successful biomarker panel, researchers usually
more details. assemble a candidate pool for the qualification
study from several sources including local
3. Selection of assays – Fit-for-purpose concept proteomic and transcriptional profiling
experiments, as well as data from the published
We propose to adopt staged, fit-for-purpose literature. The candidate pool can become very
strategy for design a biomarker qualification/ver- large and these candidates may not directly asso-
ification study [64, 65]. Depending on the num- ciate with the disease of interest. In the case of
ber of biomarker candidates, the concentration of that hundreds of candidates have to be tested in
biomarker candidates in clinical samples, the the qualification study, the study should start
feasibility of de novo assay development for the with a triage process to test these candidates
candidates, the analytical performance of the while containing cost. The goal of this triage
assays, and the cost of assay development and process is to reduce the initial list of candidates
application for measuring a large numbers of to a small subset that will be quantified with
targeted analytes across many samples, qualifica- SID-SRM in the quantification stage. The tech-
tion/verification study can consist of three steps: nology used in this step should have higher
triage, quantification, and verification (Fig. 23.3). capacity to triage large number of candidates
The triage and quantification are performed in with lower expense and shorter lead time for
the qualification phase with the same samples assay development. The assay should have
used in the discovery phase. One important les- enough specificity and precision to semi-
son learned from past 10-year’s biomarker dis- quantitatively measure the relative changes in
covery studies is that the odds of identifying a the level of large number of analytes across
508 Y. Zhao and A.R. Brasier

Table 23.3 Qualification and verification strategies for candidate plasma peptides for infectious Aspergillosis
Qualification/Verification
strategy
Accession Pre-
Biomarker candidates Gene Name # fraction Quantification SRM signature peptides
Serum albumin ALBU_671 P02768 BAP SID-SRM- AVMDDFAAFVEK
MS
Serum albumin ALBU_734 P02768 BAP SID-SRM- RHPDYSVVLLLR
MS
Serum albumin ALBU_756 P02768 BAP SID-SRM- VPQVSTPTLVEVSR
MS
Serum albumin ALBU_820 P02768 BAP SID-SRM- KVPQVSTPTLVEVSR
MS
Apolipoprotein A-I APOA1_615 P02647 BAP SID-SRM- QGLLPVLESFK
MS
Apolipoprotein A-I APOA1_618 P02647 BAP SID-SRM- DLATVYVDVLK
MS
Apolipoprotein A-II APOA2_486 P02652 BAP SID-SRM- SPELQAEAK
MS
Apolipoprotein A-II APOA2_578 P02652 BAP SID-SRM- SKEQLTPLIK
MS
Apolipoprotein A-II APOA2_600 P02652 BAP SID-SRM- VKSPELQAEAK
MS
Glutathione peroxidase 3 GPX3_657 P22352 BAP SID-SRM- FLVGPDGIPIMR
MS
Glutathione peroxidase 3 GPX3_665 P22352 BAP SID-SRM- FLVGPDGIPIM[Oxid]R
MS
Haptoglobin HPT_720 P00738 BAP SID-SRM- TEGDGVYTLNNEK
MS
Haptoglobin-related HPTR_448 P00739 BAP SID-SRM- NPANPVQR
protein MS
Haptoglobin-related HPTR_656 P00739 BAP SID-SRM- TEGDGVYTLNDK
protein MS
Ig kappa chain C region IGKC_973 P01834 BAP SID-SRM- TVAAPSVFIFPPSDEQLK
MS
Ig lambda-3 chain C LAC3_495 P0CG06 BAP SID-SRM- AGVETTTPSK
regions MS
Ig lambda-6 chain C region LAC6_872 P0CF74 BAP SID-SRM- YAASSYLSLTPEQWK
MS
Retinol binding protein 4 RBP4_599 Q5VY30 BAP SID-SRM- YWGVASFLQK
MS
For each of the candidate plasma proteins, SID-SRM-MS assays were developed. Shown is the protein accession
number, common name, pre-fraction technology, and signature sequence

large number of samples. The validation of the spiked into the samples in same amount. These
assays for triage will be minimal, including spec- standards can serve as benchmarks for normali-
ificity, precision and run-to-run variation. The zation of run-to-run reproducibility and
accuracy of quantification is not required. landmarks for calibration of LC retention time.
Although the use of stable-isotope labeled The targeted MS assays such as PRM, AIMS
standards for each analytes are not required for and SWATH with targeted data extraction are
triage process, a constant set of stable isotope well-suited for this purpose. They can monitor
labeled isotopic peptides corresponding to cer- the entire set of fragment ions for each analytes
tain housekeeping proteins is recommended to be with high resolution and high mass accuracy.
23 Qualification and Verification of Protein Biomarker Candidates 509

Table 24.4 Qualification and verification strategies for candidate protein markers for Chagasic Cardiomyopathy
Qualification/
Verification strategy
Accession Pre-
Biomarker candidates Gene Name # fraction Quantification SRM signature peptides
Serum albumin ALB P02768 – SID-SRM- LVNEVTEFAK
MS
Annexin A3 ANXA3 P12429 – SID-SRM- LTFDEYR
MS
Fibrinogen alpha chain FGA P02871 – SID-SRM- GLIDEVNQDFTNR
MS
Heterogeneous nuclear HNRNPA1 P09651 – SID-SRM- LFIGGLSFETTDESLR
ribonucleoprotein A1 MS
SH3 domain-binding glutamic SH3BGRL3 Q9H299 – SID-SRM- VYSTSVTGSR
acid-rich-like protein 3 MS
Tubulin-5 TUBB P07437 – SID-SRM- YLTVAAVFR
MS
Vimentin VIM P08670 – SID-SRM- VELQELNDR
MS
For each of the candidate proteins, SID-SRM-MS assays were developed. Shown is the protein accession number,
common name, pre-fraction technology, and signature sequence

Fig. 23.3 Multistage, targeted proteomic workflow for biomarker qualification and verification

With the absence of stable isotope labeled fragment ions and LC retention time. The identi-
peptides as internal standards for each target fication confidence is determined by the number
analyte, these approaches heavily rely on the of fragment ion observed and the correlation of
reference database of standard spectra of each the observed LC retention of the analyte to its
analyte to construct time-scheduled data predicted retention time. It should be noted that
acquisitions and confirm the identification of SRM without stable isotope labeled peptide of
the analytes. The acquired MS/MS spectra will each analyte is not a reliable tool for the triage
be compared with authentic standard spectra to process because SRM usually monitors only 3–5
examine the agreement of relative abundance of transitions with moderate mass accuracy and unit
510 Y. Zhao and A.R. Brasier

resolution. This technique cannot provide suffi- certain strategies for sample fractionation or
cient confidence in detecting candidate enrichment are usually required in order to quan-
biomarkers in the absence of stable isotope- tify the candidates with acceptable sensitivity
labeled peptide standard. If SRM is the only and specificity (Fig. 23.2). If antibodies are not
platform available for the study, low cost, readily available, IP-SRM and SISCAPA-SRM
unpurified stable isotope labeled peptides for are not recommended for less credentialed
each targeted analyte should be used to provide candidates because of tremendous effort required
the confidence needed for LC peak identification. for developing suitable antibodies.
Measurements in triage step are semi- The use of stable internal standards in SRM
quantitative, only allowing rough estimations of assays are required to provide the highest level of
relative abundance changes of targeted proteins. detection confidence and measurement precision.
The small set of candidates derived from triage Stable isotope labeled tryptic peptide standards
step require additional quantification with SID- are the most commonly used internal standards.
SRM-MS to confirm the observed changes. In They can provide sufficient precision and repro-
addition to prioritizing the candidates for more ducibility to confirm the differential expression
accurate quantification, triage step will also of candidates by the disease and eliminate the
determine which protein candidates can be false positive candidates identified in the discov-
quantified directly from clinical samples, and ery phase. But in this approach the accuracy of
which candidates need additional sample frac- quantification is only moderate because stable
tionation or enrichment to improve the limit of isotope-labeled peptide standards do not account
detection and quantification. for the differences in trypsin digestion efficiency.
In the quantification step, the list of candidates So assays using stable isotope-labeled peptide
for quantification can be first divided into several standards need to be validated to prove moderate
groups based on the concentration of biomarker precision, reproducibility, and specificity. The
candidates in clinical samples: extremely low outcome of the quantification process is the list
abundance proteins such as cytokines and of candidates with high correlation with disease
interleukins, medium-low proteins such as tissue of interest. These candidates will then advance to
leakage products, and classic plasma proteins. more rigorous verifications.
For cytokine and interleukin candidates, ELISA The goals of verification process are three-
is the first choice assay because well-validated fold; one is to confirm that the small subset of
ELISA assays are commercially available for candidates that survived the triage step truly
most. The analytical performances of ELISA reflects the disease presence, severity, or out-
are acceptable for the studies. The task to come, second is to establish the specificity and
develop SID-SRM assays for low-abundance sensitivity of the biomarker panel for its intended
proteins such as cytokines and interleukins is use; and third is to implement suitable sample
very challenging, requiring significant amount fraction/enrichment approach for the targets, if
time and effort to find suitable antibodies for applicable. It was found that trypsin digestion
the candidates. Even with antibody enrichment, and its requisite sample handling usually contrib-
the sensitivity of SRM will not be able to reach ute the most to assay variability. It has been
the required LLOQ of pg/ml in order to quantify shown that the use of stable isotope-labeled pro-
cytokines and interleukins. As a result, a much tein as an internal standard instead of stable iso-
larger biospecimen volume is required for their tope labeled peptides to account for losses in the
quantification by SRM. For tissue leakage digestion process nearly doubles assay accuracy
products and classic plasma proteins, SID-SRM [60]. Therefore, in verification phase to increase
is the primary choice for quantification. SRM can the accuracy of quantification, labeled, full-
be applied to verify classic plasma proteins length proteins, or winged-peptides with 2–6
directly from clinical samples without upfront amino acids of native flanking sequence at the
sample fractionation. For tissue-leakage proteins, N-, and C- termini of tryptic peptide analyte, or
23 Qualification and Verification of Protein Biomarker Candidates 511

concatemer of standard peptides should be added quantitative assays for quanitfying all protein in
at the start of trypsin digestion to serve as more biofluids. Recent advances in targeted MS-based
robust internal standards. The purity and quantity technologies such as AIMS, PRM, SWATH and
of internal standards must be established. For SID-SRM-MS show the potential to alleviate the
“winged” peptides, quantification is usually bottleneck in biomarker pipeline. Among them,
done by HPLC and amino acid analysis. If the SID-SRM-MS assays have been proven to be the
concentration of targeted proteins are below the most reliable approach for biomarker qualifica-
LLOQ of SID-SRM-MS assays and cannot be tion/verification. With the progress that has been
quantified directly from clinical samples, suit- made in recent years, it is becoming more of a
able strategies to enrich targeted proteins should realistic possibility that SID-SRM-MS approach
be established. IP-SRM or SISCAPA are the first can also be developed into a FDA-approvable
choice for this purpose because they are proven assay for clinical test. MS-based clinical assays
to be very efficient way to enrich the targeted can complement traditional immunoassays well
proteins with high precision and repeatability especially for protein biomarkers that high qual-
compared to other approaches. ity ELISA assays cannot detect, or in cases where
Similarly, the confidence in the accuracy of protein isoforms or posttranslational
the qualification/verification assay should modifications constitute the biomarker. In this
increases as the credential of the biomarker can- chapter, we proposed a fit-for-purpose, staged
didate increases. Although achieving total accu- biomarker qualification/verification workflow to
racy in mass spectrometry based protein verify the hundreds of candidates generated from
quantification is not possible, the assays used discovery phase with a cost-effective rapid man-
for high credential candidates should have high ner. This workflow starts with a data-dependent
specificity, reproducibility, precision (less than biomarker candidate triage step by using semi-
35 % CV), and sensitivity for target quantifica- quantitative AIMS, PRM, or SWATH
tion [65]. Analytical validation assays are approaches followed by SID-SRM-MS based
evaluated based on their assay precision, linear qualification and verification for candidates that
dynamic range, and sensitivity (LOD and survive the triage. The accuracy and precision of
LLOQ). If a prefractionation/enrichment step is qualification/verification assays for final
implemented prior to MS analysis, such steps candidates need to be confirmed at every step.
also need to be validated as part of the overall The rigor of biomarker assay validation should
assay validation for factors such as run-to-run increase as the credential of biomarker candidate
variation, recovery, and carryover. Ideally, the increases. This continuous and evolving fit-for-
assays for high credential candidates should be purpose strategy will conserve resources and
able to be standardized across laboratories and efforts in the qualification/verification stages of
translated into clinical assays. biomarker development and increase the chance
to identify a successful biomarker panel.

23.6 Summary
References
By far, the most challenging step in the bio-
marker development pipeline is isolating the 1. Anderson NL (2010) The clinical plasma proteome: a
true biomarkers from a large pool of differen- survey of clinical assays for proteins in plasma and
tially expressed proteins identified in discovery serum. Clin Chem 56:177–185
phase. The large size of the initial candidates 2. Rifai N, Gillette MA, Carr SA (2006) Protein bio-
marker discovery and validation: the long and uncer-
pool is due to several factors including high tain path to clinical utility. Nat Biotechnol
false positive discovery rate, the poor quality of 24:971–983
clinical samples, the high complexity of clinical 3. Spicer V, Grigoryan M, Gotfrid A, Standing KG,
samples, and the lack of highly specific and Krokhin OV (2010) Predicting retention time shifts
512 Y. Zhao and A.R. Brasier

associated with variation of the gradient slope in pep- activated NF-kappaB/RelA complexes using ssDNA
tide RP-HPLC. Anal Chem 82:9678–9685 aptamer affinity-stable isotope dilution-selected reac-
4. Roobol MJ, Carlsson SV (2013) Risk stratification in tion monitoring-mass spectrometry. Mol Cell Proteo-
prostate cancer screening. Nat Rev Urol 10:38–48 mics 10:M111 008771
5. Del Mastro L, Lambertini M, Bighin C, Levaggi A, 17. Reiter L, Rinner O, Picotti P, Huttenhain R, Beck M,
D’Alonzo A, Giraudi S, Pronzato P (2012) Brusniak MY, Hengartner MO, Aebersold R (2011)
Trastuzumab as first-line therapy in HER2-positive mProphet: automated data processing and statistical
metastatic breast cancer patients. Expert Rev Antican- validation for large-scale SRM experiments. Nat
cer Ther 12:1391–1405 Methods 8:430–435
6. Carr SA, Anderson L (2008) Protein quantification 18. Abbatiello SE, Mani DR, Keshishian H, Carr SA
through targeted mass spectrometry: the way out of (2010) Automated detection of inaccurate and impre-
biomarker purgatory? Clin Chem 54:1749–1752 cise transitions in peptide quantification by multiple
7. Hoofnagle AN, Wener MH (2009) The fundamental reaction monitoring mass spectrometry. Clin Chem
flaws of immunoassays and potential solutions using 56:291–305
tandem mass spectrometry. J Immunol Methods 19. Addona TA, Abbatiello SE, Schilling B, Skates SJ,
347:3–11 Mani DR, Bunk DM, Spiegelman CH, Zimmerman
8. Krastins B, Prakash A, Sarracino DA, Nedelkov D, LJ, Ham AJ, Keshishian H, Hall SC, Allen S,
Niederkofler EE, Kiernan UA, Nelson R, Vogelsang Blackman RK, Borchers CH, Buck C, Cardasis HL,
MS, Vadali G, Garces A, Sutton JN, Peterman S, Cusack MP, Dodder NG, Gibson BW, Held JM,
Byram G, Darbouret B, Perusse JR, Seidah NG, Hiltke T, Jackson A, Johansen EB, Kinsinger CR,
Coulombe B, Gobom J, Portelius E, Pannee J, Li J, Mesri M, Neubert TA, Niles RK, Pulsipher TC,
Blennow K, Kulasingam V, Couchman L, Moniz C, Ransohoff D, Rodriguez H, Rudnick PA, Smith D,
Lopez MF (2013) Rapid development of sensitive, Tabb DL, Tegeler TJ, Variyath AM, Vega-Montoto
high-throughput, quantitative and highly selective LJ, Wahlander A, Waldemarson S, Wang M,
mass spectrometric targeted immunoassays for clini- Whiteaker JR, Zhao L, Anderson NL, Fisher SJ,
cally important proteins in human plasma and serum. Liebler DC, Paulovich AG, Regnier FE, Tempst P,
Clin Biochem 46:399–410 Carr SA (2009) Multi-site assessment of the precision
9. Gerber SA, Rush J, Stemman O, Kirschner MW, Gygi and reproducibility of multiple reaction monitoring-
SP (2003) Absolute quantification of proteins and based measurements of proteins in plasma. Nat
phosphoproteins from cell lysates by tandem Biotechnol 27:633–641
MS. Proc Natl Acad Sci U S A 100:6940–6945 20. Mead JA, Bianco L, Ottone V, Barton C, Kay RG,
10. Beynon RJ, Doherty MK, Pratt JM, Gaskell SJ (2005) Lilley KS, Bond NJ, Bessant C (2009) MRMaid, the
Multiplexed absolute quantification in proteomics web-based tool for designing multiple reaction moni-
using artificial QCAT proteins of concatenated signa- toring (MRM) transitions. Mol Cell Proteomics
ture peptides. Nat Methods 2:587–589 8:696–705
11. Pratt JM, Simpson DM, Doherty MK, Rivers J, 21. Martin DB, Holzman T, May D, Peterson A,
Gaskell SJ, Beynon RJ (2006) Multiplexed absolute Eastham A, Eng J, McIntosh M (2008) MRMer, an
quantification for proteomics using concatenated sig- interactive open source and cross-platform system for
nature peptides encoded by QconCAT genes. Nat data extraction and visualization of multiple reaction
Protoc 1:1029–1043 monitoring experiments. Mol Cell Proteomics
12. Dupuis A, Hennekinne JA, Garin J, Brun V (2008) 7:2270–2278
Protein Standard Absolute Quantification (PSAQ) for 22. Sherwood CA, Eastham A, Lee LW, Peterson A, Eng
improved investigation of staphylococcal food poi- JK, Shteynberg D, Mendoza L, Deutsch EW, Risler J,
soning outbreaks. Proteomics 8:4633–4636 Tasman N, Aebersold R, Lam H, Martin DB (2009)
13. Brun V, Dupuis A, Adrait A, Marcellin M, Thomas D, MaRiMba: a software application for spectral library-
Court M, Vandenesch F, Garin J (2007) Isotope- based MRM transition list assembly. J Proteome Res
labeled protein standards: toward absolute quantita- 8:4396–4405
tive proteomics. Mol Cell Proteomics 6:2139–2149 23. Krokhin OV, Spicer V (2010) Predicting peptide
14. Zhao Y, Brasier AR (2013) Applications of selected retention times for proteomics. Curr Protoc Bioinfor-
reaction monitoring (SRM)-mass spectrometry matics. Wiley
(MS) for quantitative measurement of signaling 24. Brusniak MY, Kwok ST, Christiansen M,
pathways. Methods 61:313–322 Campbell D, Reiter L, Picotti P, Kusebauch U,
15. Zhao Y, Tian B, Edeh CB, Brasier AR (2013) Quanti- Ramos H, Deutsch EW, Chen J, Moritz RL,
fication of the dynamic profiles of the innate immune Aebersold R (2011) ATAQS: A computational soft-
response using multiplex selected reaction ware tool for high throughput transition optimization
monitoring-mass spectrometry. Mol Cell Proteomics and validation for selected reaction monitoring mass
12:1513–1529 spectrometry. BMC Bioinf 12:78
16. Zhao Y, Widen SG, Jamaluddin M, Tian B, Wood 25. MacLean B, Tomazela DM, Shulman N,
TG, Edeh CB, Brasier AR (2011) Quantification of Chambers M, Finney GL, Frewen B, Kern R, Tabb
23 Qualification and Verification of Protein Biomarker Candidates 513

DL, Liebler DC, MacCoss MJ (2010) Skyline: an targeted assay development for biomarker verifica-
open source document editor for creating and tion. Mol Cell Proteomics 7:1952–1962
analyzing targeted proteomics experiments. Bioinfor- 37. Schmidt A, Gehlenborg N, Bodenmiller B, Mueller
matics 26:966–968 LN, Campbell D, Mueller M, Aebersold R, Domon B
26. Farrah T, Deutsch EW, Kreisberg R, Sun Z, Campbell (2008) An integrated, directed mass spectrometric
DS, Mendoza L, Kusebauch U, Brusniak MY, approach for in-depth characterization of complex
Huttenhain R, Schiess R, Selevsek N, Aebersold R, peptide mixtures. Mol Cell Proteomics 7:2138–2150
Moritz RL (2012) PASSEL: the PeptideAtlas 38. Schmidt A, Claassen M, Aebersold R (2009) Directed
SRMexperiment library. Proteomics 12:1170–1175 mass spectrometry: towards hypothesis-driven prote-
27. Huttenhain R, Soste M, Selevsek N, Rost H, Sethi A, omics. Curr Opin Chem Biol 13:510–517
Carapito C, Farrah T, Deutsch EW, Kusebauch U, 39. Whiteaker JR, Lin C, Kennedy J, Hou L, Trute M,
Moritz RL, Nimeus-Malmstrom E, Rinner O, Sokal I, Yan P, Schoenherr RM, Zhao L, Voytovich
Aebersold R (2012) Reproducible quantification of UJ, Kelly-Spratt KS, Krasnoselsky A, Gafken PR,
cancer-associated proteins in body fluids using Hogan JM, Jones LA, Wang P, Amon L, Chodosh
targeted proteomics. Sci Transl Med 4:142ra194 LA, Nelson PS, McIntosh MW, Kemp CJ, Paulovich
28. Kelly RT, Page JS, Zhao R, Qian WJ, Mottaz HM, AG (2011) A targeted proteomics-based pipeline for
Tang K, Smith RD (2008) Capillary-based multi verification of biomarkers in plasma. Nat Biotechnol
nanoelectrospray emitters: improvements in ion trans- 29:625–634
mission efficiency and implementation with capillary 40. Geromanos SJ, Vissers JP, Silva JC, Dorschel CA, Li
reversed-phase LC-ESI-MS. Anal Chem 80:143–149 GZ, Gorenstein MV, Bateman RH, Langridge JI
29. Page JS, Tang K, Kelly RT, Smith RD (2008) (2009) The detection, correlation, and comparison of
Subambient pressure ionization with nanoelec- peptide precursor and product ions from data indepen-
trospray source and interface for improved sensitivity dent LC-MS with data dependant LC-MS/MS. Prote-
in mass spectrometry. Anal Chem 80:1800–1805 omics 9:1683–1695
30. Fortin T, Salvador A, Charrier JP, Lenz C, 41. Venable JD, Dong MQ, Wohlschlegel J, Dillin A,
Bettsworth F, Lacoux X, Choquet-Kastylevsky G, Yates JR (2004) Automated approach for quantitative
Lemoine J (2009) Multiple reaction monitoring analysis of complex peptide mixtures from tandem
cubed for protein quantification at the low nano- mass spectra. Nat Methods 1:39–45
gram/milliliter level in nondepleted human serum. 42. Panchaud A, Scherl A, Shaffer SA, von Haller PD,
Anal Chem 81:9343–9352 Kulasekara HD, Miller SI, Goodlett DR (2009) Pre-
31. Jeudy J, Salvador A, Simon R, Jaffuel A, Fonbonne C, cursor acquisition independent from ion count: how to
Leonard JF, Gautier JC, Pasquier O, Lemoine J (2014) dive deeper into the proteomics ocean. Anal Chem
Overcoming biofluid protein complexity during 81:6481–6488
targeted mass spectrometry detection and quantifica- 43. Geiger T, Cox J, Mann M (2010) Proteomics on an
tion of protein biomarkers by MRM cubed (MRM3). Orbitrap benchtop mass spectrometer using all-ion
Anal Bioanal Chem 406:1193–1200 fragmentation. Mol Cell Proteomics 9:2252–2261
32. Gallien S, Duriez E, Crone C, Kellmann M, 44. Gillet LC, Navarro P, Tate S, Rost H, Selevsek N,
Moehring T, Domon B (2012) Targeted proteomic Reiter L, Bonner R, Aebersold R (2012) Targeted data
quantification on quadrupole-Orbitrap mass spec- extraction of the MS/MS spectra generated by data-
trometer. Mol Cell Proteomics 11:1709–1723 independent acquisition: a new concept for consistent
33. Peterson AC, Russell JD, Bailey DJ, Westphall MS, and accurate proteome analysis. Mol Cell Proteomics
Coon JJ (2012) Parallel reaction monitoring for high 11:O111
resolution and high mass accuracy quantitative, 45. Liu Y, Huttenhain R, Surinova S, Gillet LC,
targeted proteomics. Mol Cell Proteomics Mouritsen J, Brunner R, Navarro P, Aebersold R
11:1475–1488 (2013) Quantitative measurements of N-linked
34. Leinenbach A, Pannee J, Dulffer T, Huber A, glycoproteins in human plasma by SWATH-MS. Pro-
Bittner T, Andreasson U, Gobom J, Zetterberg H, teomics 13:1247–1256
Kobold U, Portelius E, Blennow K, proteins, I. S. 46. Deutsch EW, Lam H, Aebersold R (2008)
D. W. G. o. C. (2014) Mass spectrometry-based can- PeptideAtlas: a resource for target selection for
didate reference measurement procedure for quantifi- emerging targeted proteomics workflows. EMBO
cation of amyloid-beta in cerebrospinal fluid. Clin Rep 9:429–434
Chem 60:987–994 47. Deutsch EW (2010) The PeptideAtlas project.
35. Gallien S, Bourmaud A, Kim SY, Domon B (2014) Methods Mol Biol 604:285–296
Technical considerations for large-scale parallel 48. Liu Y, Huttenhain R, Collins B, Aebersold R (2013)
reaction monitoring analysis. J Proteome 100: Mass spectrometric protein maps for biomarker dis-
147–159 covery and clinical research. Expert Rev Mol Diagn
36. Jaffe JD, Keshishian H, Chang B, Addona TA, 13:811–825
Gillette MA, Carr SA (2008) Accurate inclusion 49. Rost HL, Rosenberger G, Navarro P, Gillet L,
mass screening: a bridge from unbiased discovery to Miladinovic SM, Schubert OT, Wolski W, Collins
514 Y. Zhao and A.R. Brasier

BC, Malmstrom J, Malmstrom L, Aebersold R (2014) 59. Hoofnagle AN, Becker JO, Wener MH, Heinecke JW
OpenSWATH enables automated, targeted analysis of (2008) Quantification of thyroglobulin, a
data-independent acquisition MS data. Nat Biotechnol low-abundance serum protein, by immunoaffinity
32:219–223 peptide enrichment and tandem mass spectrometry.
50. Kuzyk MA, Smith D, Yang J, Cross TJ, Jackson AM, Clin Chem 54:1796–1804
Hardie DB, Anderson NL, Borchers CH (2009) Mul- 60. Kuhn E, Whiteaker JR, Mani DR, Jackson AM,
tiple reaction monitoring-based, multiplexed, absolute Zhao L, Pope ME, Smith D, Rivera KD, Anderson
quantification of 45 proteins in human plasma. Mol NL, Skates SJ, Pearson TW, Paulovich AG, Carr SA
Cell Proteomics 8:1860–1877 (2012) Interlaboratory evaluation of automated,
51. Anderson L, Hunter CL (2006) Quantitative mass multiplexed peptide immunoaffinity enrichment cou-
spectrometric multiple reaction monitoring assays pled to multiple reaction monitoring mass spectrome-
for major plasma proteins. Mol Cell Proteomics try for quantifying proteins in plasma. Mol Cell
5:573–588 Proteomics 11:M111 013854
52. Furey A, Moriarty M, Bane V, Kinsella B, Lehane M 61. Tuerk C, Gold L (1990) Systematic evolution of
(2013) Ion suppression; a critical review on causes, ligands by exponential enrichment: RNA ligands to
evaluation, prevention and applications. Talanta bacteriophage T4 DNA polymerase. Science
115:104–122 249:505–510
53. Keshishian H, Addona T, Burgess M, Mani DR, 62. Nery AA, Wrenger C, Ulrich H (2009) Recognition of
Shi X, Kuhn E, Sabatine MS, Gerszten RE, Carr SA biomarkers and cell-specific molecular signatures:
(2009) Quantification of cardiovascular biomarkers in aptamers as capture agents. J Sep Sci 32:1523–1530
patient plasma by targeted mass spectrometry and 63. Brasier AR, Garcia J, Wiktorowicz JE, Spratt HM,
stable isotope dilution. Mol Cell Proteomics Comach G, Ju H, Recinos A 3rd, Soman K, Forshey
8:2339–2349 BM, Halsey ES, Blair PJ, Rocha C, Bazan I, Victor
54. Brasier AR, Zhao Y, Wiktorowicz JE, Spratt HM, SS, Wu Z, Stafford S, Watts D, Morrison AC, Scott
Nascimento EJ, Cordeiro MT, Soman KV, Ju H, TW, Kochel TJ, the Venezuelan Dengue Fever Work-
Recinos A 3rd, Stafford S, Wu Z, Marques ET Jr, ing, G. (2012) Discovery proteomics and nonparamet-
Vasilakis N (2015) Molecular classification of ric modeling pipeline in the development of a
outcomes from dengue virus 3 infections. J Clin candidate biomarker panel for dengue hemorrhagic
Virol 64:97–106 fever. Clin Transl Sci 5:8–20
55. Nicol GR, Han M, Kim J, Birse CE, Brand E, 64. Lee JW, Devanarayan V, Barrett YC, Weiner R,
Nguyen A, Mesri M, FitzHugh W, Kaminker P, Allinson J, Fountain S, Keller S, Weinryb I,
Moore PA, Ruben SM, He T (2008) Use of an Green M, Duan L, Rogers JA, Millham R, O’Brien
immunoaffinity-mass spectrometry-based approach PJ, Sailstad J, Khan M, Ray C, Wagner JA (2006) Fit-
for the quantification of protein biomarkers from for-purpose method development and validation for
serum samples of lung cancer patients. Mol Cell Pro- successful biomarker measurement. Pharm Res
teomics 7:1974–1982 23:312–328
56. Anderson NL, Anderson NG, Haines LR, Hardie DB, 65. Carr SA, Abbatiello SE, Ackermann BL, Borchers C,
Olafson RW, Pearson TW (2004) Mass spectrometric Domon B, Deutsch EW, Grant RP, Hoofnagle AN,
quantification of peptides and proteins using Stable Huttenhain R, Koomen JM, Liebler DC, Liu T,
Isotope Standards and Capture by Anti-Peptide MacLean B, Mani DR, Mansfield E, Neubert H,
Antibodies (SISCAPA). J Proteome Res 3:235–244 Paulovich AG, Reiter L, Vitek O, Aebersold R,
57. Kuhn E, Addona T, Keshishian H, Burgess M, Mani Anderson L, Bethem R, Blonder J, Boja E,
DR, Lee RT, Sabatine MS, Gerszten RE, Carr SA Botelho J, Boyne M, Bradshaw RA, Burlingame AL,
(2009) Developing multiplexed assays for troponin I Chan D, Keshishian H, Kuhn E, Kinsinger C, Lee JS,
and interleukin-33 in plasma by peptide Lee SW, Moritz R, Oses-Prieto J, Rifai N, Ritchie J,
immunoaffinity enrichment and targeted mass spec- Rodriguez H, Srinivas PR, Townsend RR, Van Eyk J,
trometry. Clin Chem 55:1108–1117 Whiteley G, Wiita A, Weintraub S (2014) Targeted
58. Whiteaker JR, Zhao L, Zhang HY, Feng LC, Piening peptide measurements in biology and medicine: best
BD, Anderson L, Paulovich AG (2007) Antibody- practices for mass spectrometry-based assay develop-
based enrichment of peptides on magnetic beads for ment using a fit-for-purpose approach. Mol Cell Pro-
mass-spectrometry-based quantification of serum teomics 13:907–917
biomarkers. Anal Biochem 362:44–54
Protocol for Standardizing High-to-
Moderate Abundance Protein 24
Biomarker Assessments Through
an MRM-with-Standard-Peptides
Quantitative Approach

Andrew J. Percy, Juncong Yang, Andrew G. Chambers,


Yassene Mohammed, Tasso Miliotis,
and Christoph H. Borchers

Abstract
Quantitative mass spectrometry (MS)-based approaches are emerging as a
core technology for addressing health-related queries in systems biology
and in the biomedical and clinical fields. In several ‘omics disciplines
(proteomics included), an approach centered on selected or multiple
reaction monitoring (SRM or MRM)-MS with stable isotope-labeled
standards (SIS), at the protein or peptide level, has emerged as the most
precise technique for quantifying and screening putative analytes in
biological samples. To enable the widespread use of MRM-based protein
quantitation for disease biomarker assessment studies and its ultimate
acceptance for clinical analysis, the technique must be standardized to
facilitate precise and accurate protein quantitation. To that end, we have
developed a number of kits for assessing method/platform performance,
as well as for screening proposed candidate protein biomarkers in various
human biofluids. Collectively, these kits utilize a bottom-up LC-MS
methodology with SIS peptides as internal standards and quantify proteins
using regression analysis of standard curves. This chapter details the
methodology used to quantify 192 plasma proteins of high-to-moderate
abundance (covers a 6 order of magnitude range from 31 mg/mL for

A.J. Percy • J. Yang • A.G. Chambers


University of Victoria – Genome British Columbia T. Miliotis
Proteomics Centre, Vancouver Island Technology Park, AstraZeneca R&D, Innovative Medicines, S-431 83,
#3101 – 4464 Markham St., Victoria, BC V8Z 7X8, M€olndal, Sweden
Canada
C.H. Borchers (*)
Y. Mohammed University of Victoria – Genome British Columbia
University of Victoria – Genome British Columbia Proteomics Centre, Vancouver Island Technology Park,
Proteomics Centre, Vancouver Island Technology Park, #3101 – 4464 Markham St., Victoria, BC V8Z 7X8,
#3101 – 4464 Markham St., Victoria, BC V8Z 7X8, Canada
Canada
Department of Biochemistry and Microbiology,
Center for Proteomics and Metabolomics, Leiden University of Victoria, Petch Building Room 207, 3800
University Medical Center, 2333 ZA, Leiden, Finnerty Rd., Victoria, BC V8P 5C2, Canada
Netherlands e-mail: christoph@proteincentre.com

# Springer International Publishing Switzerland 2016 515


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5_24
516 A.J. Percy et al.

albumin to 18 ng/mL for peroxidredoxin-2), and a 21-protein subset


thereof. We also describe the application of this method to patient samples
for biomarker discovery and verification studies. Additionally, we intro-
duce our recently developed Qualis-SIS software, which is used to expe-
dite the analysis and assessment of protein quantitation data in control and
patient samples.

Keywords
Biomarker • Internal standards • MRM • Plasma • Proteomics •
Quantitation • Standardization

24.1 Introduction MRM-MS, specific precursor-product ion pairs


(referred to as transitions) are used for peptide
MS-based protein quantitation is increasingly detection. Generating peptide specific transitions
utilized to determine differences between requires a priori knowledge of the analyte and
samples from healthy and diseased patients for its dissociation upon collisional activation (also
biomarker (i.e., biological indicators of disease referred to as collision induced dissociation or
or disorder) and systems biology studies. CID). While the use of MRM is common and is
Although quantitation can be performed using a classically performed on a triple quadrupole
relative technique, such as iTRAQ (isobaric tags mass spectrometer, directed quantitation has
for relative and absolute quantitation [1]) or also recently been accomplished by parallel reac-
TMT (tandem mass tag [2]), techniques that pro- tion monitoring (PRM) on a hybrid quadrupole-
vide exact endogenous concentrations (often Orbitrap (i.e., Q Exactive) mass spectrometer [7–
reported in ng/mL units), as opposed to fold 9] and by MS/MSALL with SWATH acquisition
changes of abundance levels, are more informa- on a quadrupole time-of-flight (QTOF) [10] or a
tive and better suited for applications where the hybrid quadrupole linear ion trap (QTRAP) mass
analysis of pre-clinical and clinical samples is the spectrometer [11]. Mechanistically in PRM, for
ultimate goal. Such quantitative techniques are instance, all product ions that lie within a
commonly referred to as “absolute”, and require specified mass range and emanate from a specifi-
the use of isotopically labeled standards (typi- cally fragmented precursor are detected in the
cally expressed in bacterial media, in the case high resolution, high mass accuracy Orbitrap
of proteins [3], or chemically synthesized, for analyzer. An attractive feature of this technique,
peptides) and a targeted form of MS detection as well as MS/MSALL, is that it allows the post-
(usually MRM-MS with electrospray ionization, analysis mining of previously collected
ESI, for gas phase ionization of the chro- (or archived) MS/MS data, and therefore allows
matographic eluent) to be employed within a the selection of alternate quantitative transitions
bottom-up analytical workflow [4–6]. In this if interference with the target(s) is observed.
generalized approach, proteotypic peptides The most desirable sample sources for bio-
serve as molecular surrogates for the target marker research and clinical measurement are
proteins. The isotopically labeled standards are ideally non-invasive, such as urine or saliva.
typically labeled with 13C and/or 15N, as opposed Although blood plasma and serum are semi-
to 18O or 2H, and these labels are incorporated invasive, they are still commonly used for moni-
into amino acids within a protein or the toring and stratifying diseases. Plasma and serum
C-terminal residue of a tryptic peptide. Collec- are used because they are relatively inexpensive
tively, the standards are used for normalization to collect and analyze, and carry a wide dynamic
of the peptide signal and LC-MS conditions. In range of proteins (approximating or exceeding
24 Protocol for Standardizing High-to-Moderate Abundance Protein Biomarker. . . 517

10 orders of magnitude [12]) that are secreted, with alkaline and acidic RPLC (reversed-phase
released, or leaked from neighboring cells, liquid chromatography) [20, 21], less commonly
tissues, or organs into the systemic circulation. with strong cation-exchange and RPLC
The fluid therefore paints a physiological picture configurations [22]) for peptide fractionation.
of the health status of an individual, which is Additional techniques developed for deeper pro-
imperative for disease diagnosis and prognosis. tein quantification involves the upfront use of
It is important to note here that there is a distinc- immunodepletion for high abundant protein
tion between plasma and serum since the two are removal via antibody-based, affinity interactions
often incorrectly used interchangeably by the [22–26]. Depletion, however, is disfavored from
proteomics field. Plasma and serum are both a cost and throughput perspective, as well as for
derived from whole blood, with serum collected the potential of target protein loss through
from plasma after coagulation. It is through the non-specific or non-covalent interactions with
coagulation process that an assembly of the depletion cartridge or depleted proteins. An
mid-abundance proteins (e.g., fibrinogen, pro- added detraction of this technique is the potential
thrombin, thrombin, and a host of coagulation underestimation of protein concentration, as was
factors – notably II, V, and VIII) are at least demonstrated recently by Percy et al. in the side-
partially removed. Serum is, however, generally by-side comparison of a depletion-based and
disfavored by the Human Proteome Organization depletion-free, multiplexed quantitative
(HUPO [13, 14]) since coagulation can cause proteomic assay of cerebrospinal fluid
additional proteins to be unintentionally removed [27]. Nonetheless, despite the increasing empha-
through non-specific interactions and is also a sis on low-abundance proteins, antibody- and
highly variable process, with the results being fractionation-free quantitative proteomic
dependent upon the coagulation conditions and methods should also be developed for the screen-
the nature of the collection tube [15]. It is for ing of higher-abundance protein markers since
these reasons that our blood-based assay these are also informative and correlate with
developments and analyses are commonly multiple diseases such as cancer and cardiovas-
conducted with plasma, with the exception cular disease (CVD) [28, 29]. This is why we
being our dried fluid spot quantitative analyses have developed sets of highly-multiplexed
where the spots originate from whole blood [16]. (defined as enabling multi-analyte detection in a
As inferred above, plasma is an inherently single analytical run) MRM assays for the pre-
complex biofluid, carrying thousands of poten- cise quantitation of high-to-moderate abundance,
tially measurable proteins spanning the low candidate protein biomarkers in undepleted and
mg/mL (or millimolar; encompassing serum non-enriched human plasma [20, 30–32].
albumin and the immunoglobulins, among The protein biomarker pipeline is essentially
others) to low pg/mL (or attomolar; which comprised of four stages – discovery, verifica-
includes the interleukins and cytokines) concen- tion, pre-clinical validation, and clinical valida-
tration range. An active area of biomedical tion. Although quantitative MRM or PRM
research centers on developing sensitive methods methods can be used to assess marker utility at
to accurately and reproducibly quantify proteins all levels, their greatest value lies in the discov-
at the lower end of the concentration range since ery and verification phases. Once the lengthy list
these candidates are considered to have the of potential candidate markers has been screened
greatest diagnostic potential. Targeted quantita- and condensed according to statistical signifi-
tive methods for detection of proteins with cance, resources can then be invested in the
concentrations below the MRM detection limit development of antibodies, which is a costly
often use anti-protein [17] or anti-peptide and developmentally intensive process [33]. At
[12, 18, 19] antibodies for immunoaffinity the validation stages of biomarker assessment,
enrichment or alternatively the implementation shorter lists of verified candidates (typically
of multidimensional separations (increasingly <10) are interrogated against a larger number
518 A.J. Percy et al.

of samples (on order of 1000s at the validation description and implementation of our recently
stage vs. 10s–100s in the preceding stages [33]). developed Qualis-SIS software [43] for quantita-
While ELISAs (enzyme linked immunosorbent tive proteomic applications.
assays) are often considered to be the “gold-stan-
dard” for clinical applications [34], emerging
techniques, such as iMALDI (immuno matrix-
24.2 Targeted Quantitation
assisted laser desorption/ionization; where pep-
Method – Strategy,
tide detection of captured peptides occurs by
Description, and Rationale
MALDI-TOF-MS without prior chro-
matographic separation [35]) and SISCAPA (sta-
The principle checkpoints we use in developing
ble isotope standards and capture with anti-
sensitive and specific MRM-based quantitative
peptide antibodies via LC-MS [18, 36] or
proteomic assays, such as the BAK-192 and
MALDI-MS [37, 38] detection) could alterna-
BAK-21, involve protein/peptide target selec-
tively be employed.
tion, SIS peptide production, solution/sample
To expedite biomarker verification, the
preparation, interference screening, and protein
targeted quantitative methods must be
quantitation (see Fig. 24.1 for our generalized
standardized. This should facilitate improved
workflow). Additional important steps include
method reproducibility and transferability and
balancing the concentrations of the mixture of
lead to a more rapid and accurate evaluation of
SIS peptides to their corresponding natural
the candidate protein biomarkers in a given
(or NAT) peptide signals (balancing helps reduce
biological fluid [39, 40]. To this end, a variety of
analytical variation between analyses [44]), as
kits have been developed for the quantitative pro-
well as optimizing the MRM transitions
teomics community. Stemming from work done
(includes their collision energies) and LC gradi-
in our laboratory, QC kits are developed to evalu-
ent. This section expands upon that basic frame-
ate the performance of a LC-MS system and/or
work developed to quantify multiplexed panels
one type of sample preparation in a targeted quan-
of plasma proteins for assessment as potential
titative proteomic workflow [41, 42]. Recently,
biomarkers via a bottom-up LC/MRM approach
we have also developed several biomarker assess-
using SIS peptides. By outlining our strategy and
ment kits (BAKs) for screening various protein
rationale behind each development step, the user
panels against patient plasma samples for bio-
will obtain the necessary tools for extending the
marker discovery or verification studies. The
quantitative method to alternative panels and
methods collectively utilize an antibody-/fraction-
types of samples. Nonetheless, the applications
ation-free approach, a rigorously optimized and
that these BAKs are designed for is discussed in
evaluated bottom-up LC/MRM proteomic
the section that follows.
workflow, and our well characterized SIS
peptides. The targeted proteins are either putative
biomarkers for CVD and cancer or have unknown
disease associations. Each BAK contains a collec- 24.2.1 Protein and Peptide Selection
tion of key starting materials (i.e., reference
plasma, trypsin, and SIS peptide mixture), a The first step in our quantitative proteomic
detailed protocol, a LC-MS acquisition method, method development is generating a list of poten-
data analysis software, and a troubleshooting tial biomarkers in human plasma. These putative
guide. This chapter will detail the protocol and biomarkers are selected from prior discovery
provide the rationale behind the development and experiments or from literature reports, and typi-
application of two recent biomarker assessment cally exist in a wide range of concentrations.
kits – BAK-192 for discovery and a custom Tryptic peptides (ideally a minimum of 2) are
BAK-21 for verification – for MRM-based quan- then chosen to act as molecular surrogates for
titative proteomic studies. Also provided is a each biomarker. Selection is based on adherence
24 Protocol for Standardizing High-to-Moderate Abundance Protein Biomarker. . . 519

To reduce error and subjectivity, the rules


have recently been assembled into a software
tool we named PeptidePicker, which automates
candidate identification and ranks the selected
peptide(s) for a given protein within a specified
proteome (human or mouse) [47]. This program,
we note, is an advancement over the
PeptideSieve tool (developed by the Seattle
Proteome Centre), which predicts proteotypic
propensity based solely on the physicochemical
properties of the peptides expected to result from
a digest of a given protein [48]. Due to the
accuracy and enhanced speed of peptide selec-
tion in PeptidePicker (ca. 50 proteins per hour
compared to 8 per day in peptideSieve [47]), the
time devoted to bioinformatics is significantly
reduced, allowing more time to be spent on the
rest of assay development. Furthermore,
PeptidePicker reduces human error and provides
users with a standardized method for target pep-
tide selection of any panel of biomarkers.

Fig. 24.1 General workflow for MRM assay develop-


ment. Protein/peptide selection is a bioinformatics exer- 24.2.2 SIS Peptide Production
cise aided by previously collected data or curated
databases, as well as by software tools, such as
PeptidePicker. The internal standards employed are SIS Once the proteotypic peptides have been
peptides, which are synthesized, purified, and selected, their heavy isotope labeled analogues
characterized for more accurate protein quantitation. are synthesized, purified, and characterized.
MRM transition optimization and screening for chemical
interference in the sample matrix is performed empiri- These are essential steps for obtaining absolute
cally, while protein quantitation is performed on the and precise, but not necessarily accurate, endog-
interference-free peptides via standard curves enous protein concentrations. In our laboratory,
synthesis is performed in-house on an Overture
to a set of qualification criteria [45], with the peptide synthesizer (Protein Technologies)
most notable ones indicated below: using Fmoc chemistry. To enable chro-
matographic alignment of heavy isotope coded
• Peptides must be unique to the target bio- peptides with the regular NAT peptides (which
marker (human in this case; determined from greatly assists in the subsequent interference
a BLASTp search). testing step), [13C]/[15N] isotopes (Cambridge
• Peptides must have been previously observed Isotope Laboratories) are incorporated at the
in tandem MS proteomic studies (revealed in C-terminal residue of tryptic peptides, typically
the Global Proteome Machine and leading to +8 Da (from [13C6, 15N2]-lysine) or +
PeptideAtlas databases). 10 Da ([13C6, 15N4]-arginine) mass shifts. Puri-
• Peptides must not contain a missed tryptic fication is also performed in-house by RPLC,
cleavage site (Kiel rules obeyed [46]). with the fractions of interest confirmed by
• Peptides must be between 5 and 25 residues in MALDI-TOF-MS on an Ultraflex III TOF/TOF
length to ensure acceptable ionization and mass spectrometer (Bruker Daltonik). After
gas-phase fragmentation. lyophilization of the pooled target fractions,
520 A.J. Percy et al.

amino acid analysis (AAA) and capillary zone for the control and 6 μL of raw fluid per patient) is
electrophoresis (CZE) are then performed for denatured, reduced, alkylated, and quenched with
absolute concentration and purity determina- 1 % sodium deoxycholate (10 % initially), 5 mM
tion, respectively. Of relevance here, the aver- tris(2-carboxyethyl) phosphine (50 mM initially),
age purity of the 487 target peptides used in the 10 mM iodoacetamide (100 mM initially), and
discovery BAK-192 is 92 %. 10 mM dithiothreitol (100 mM initially), respec-
tively, all prepared in 25 mM ammonium bicar-
bonate. The protein denaturation and Cys-Cys
24.2.3 Sample Preparation and LC-MS reduction steps occur simultaneously for 30 min
Processing at 60  C, while Cys alkylation and iodoacetamide
quenching is performed subsequently for 30 min
It is our general practice to prepare small sample at 37  C. Thereafter, proteolysis is achieved by the
sets (i.e., <20) manually in polypropylene addition of 23.3 μL TPCK-treated trypsin
Maxymum recovery microtubes (Axygen), but (Worthington) (1.8 mg in 2 mL of 25 mM ammo-
automate the preparation of larger sets of nium bicarbonate; prepared immediately before
samples with a robot (Freedom EVO 150 plat- addition) at a 10:1 substrate:enzyme ratio. After
form; Tecan) in 96-well microtiter plates. A overnight incubation at 37  C, proteolysis is
generalized flow chart of our sample preparation arrested by the step-wise addition of a chilled
and processing process is illustrated in Fig. 24.2. SIS peptide mixture (concentration balanced;
It should be noted that our robot is configured to 50 μL at 250 to 0.5 fmol/μL for the control or
automate only the liquid handling steps, with 50 μL at 25 fmol/μL for the patient plasma) and a
centrifugation and incubation occurring chilled formic acid (FA) solution (277 μL at 1 %)
externally. to a digest aliquot (277 μL; pooled from 4 digests
Toward the preparation of plasma proteolytic in the control prep). The SIS mixes used in the
digests, a ten-fold diluted plasma sample (20 μL control will be used to prepare the calibration

Fig. 24.2 Overview of our sample preparation and chemical modification which can occur during proteoly-
processing workflow. The plasma proteins are unfolded sis. After the sample is concentrated by solid phase
and the disulfide bridges are cleaved and capped by a extraction, peptide mixture is separated by RPLC and
series of denaturation, reduction, alkylation, and detected by dynamic MRM on a QqQ mass spectrometer.
quenching steps prior to tryptic proteolysis. Labeled pep- Plasma protein quantitation is achieved through SPM or
tide standards are spiked post-digestion to prevent regression analysis of the standard control curve
24 Protocol for Standardizing High-to-Moderate Abundance Protein Biomarker. . . 521

curves. These mixtures each contain a fixed 33, 22.5; 38, 40.5; 39, 81; 42.9, 81; and 43, 1.5.
amount of endogenous peptide and an increasing Note that standard flow rates are used instead of
concentration of synthetic peptide (over a conventional nano-flow rates due to the superior
500-fold concentration range). The resulting dilu- analytical merits (in terms of reproducibility and
tion series prepared from each reference standard sensitivity) found for the standard flow system
is as follows: 250 fmol/μL stock (standard F), when 10 material is loaded onto a wider-bore
125 fmol/μL (standard E), 25 fmol/μL (standard column [49]. The mass spectrometer is operated
D), 12.5 fmol/μL (standard C), 2.5 fmol/μL (stan- in the dynamic MRM mode (i.e., scheduled
dard B), and 0.5 fmol/μL (standard A; all prepared retention times for enhanced analyte specificity
in 0.1 % FA). A merit of the deoxycholate surfac- and reduced duty cycle) with 1 min detection
tant is that is acid insoluble and therefore can be windows and cycle times approximating 850 ms
readily removed by simple centrifugation (10 min (see [32] and its supplemental tables for the gen-
at 12,000 rpm). This is in contrast to sodium eral and specific acquisition parameters).
dodecyl sulfate which damages the LC column
and causes signal suppression if not properly
removed. Following centrifugation, the peptide 24.2.4 Interference Reduction
supernatant is concentrated by solid phase extrac- and Screening
tion (SPE) using a polymeric RP sorbent (10 mg
Oasis HLB; Waters). The extraction steps are as Interference is commonly observed in the quan-
follows: titative analysis of human plasma. These
interferences exist despite the m/z and retention
1. wash with 1 mL methanol, time filtering in scheduled MRM acquisitions,
2. condition with 1 mL water, and is attributed largely to the inherent complex-
3. load with 556 μL of 0.1 % FA followed by ity of blood plasma, as well as to the low resolu-
444 μL of digest supernatant, tion QqQ mass spectrometer employed. Tryptic
4. wash with 1 mL water, and proteolysis further increases the complexity as it
5. elute with 300 μL of 50 % acetonitrile (ACN) converts thousands of plasma proteins into
in 0.1 % FA. millions of peptides. This increased complexity
increases the possibility of non-target ion trans-
The eluate is then lyophilized and rehydrated in mission in the quadrupole mass analyzers
100 μL of 0.1 % FA for LC-MRM/MS. (Q1 and Q3) which necessitates utilizing inter-
The LC-MS system we routinely use for the ference reduction and screening techniques in
BAKs consists of a 1290 Infinity system that is quantitative proteomic studies.
interfaced to a 6490 triple quadrupole (QqQ) Interferences can be reduced by minimizing
mass spectrometer (all from Agilent concurrent MRM transitions, so our method
Technologies) via a standard-flow, ESI source development first involves optimizing the LC
(operated in the positive ionization mode). The gradient, to produce an even distribution of
LC column is a Zorbax Eclipse Plus RP-UHPLC peptides across the chromatographic space. To
column (2.1  150 mm, 1.8 μm particles). The ensure the accuracy of quantitative results, the
separation occurs over a 43 min gradient control and sample are first screened for interfer-
(1.5–81 % mobile phase B; mobile phase ence. This is conducted empirically in our labo-
compositions: 0.1 % FA in water for A and ratory, as opposed to theoretically using a
0.1 % FA in ACN for B) at flow rates of program such as SRM Collider [50]. In the anal-
0.4 mL/min and a temperature of 50  C. A ysis of the control (also referred to as the refer-
4 min post-acquisition step using mobile phase ence) sample digest, interferences are
A is allotted for column equilibration. The spe- determined by monitoring the SIS and NAT
cific gradient we employ is as follows (time in responses (i.e., peak areas) under matrix-free
min, %B): 0, 1.5; 1.5, 6.3; 16, 13.5; 18, 13.77; and matrix-containing conditions (both at
522 A.J. Percy et al.

n ¼ 2). The variability in these calculated interference evaluation for BAK-192 discovery
response ratios indicates the presence or absence requires 3 LC-MS acquisition methods (2922
of interferences in the MRM ion channels. For a total transitions for the 487 peptides with 1461
given peptide to be interference-free, the average transitions targeted for both peptide forms). In
relative ratios between a SIS transition in buffer this case, multiple methods are required to
or plasma, and NAT transition in plasma, must reduce the duty cycle and obtain sufficient points
have CVs below 20 %. Further, the NAT and SIS across a chromatographic peak (defined as
signals must be the same in both peak shape and 10–15) for improved ion statistics.
retention time. Figure 24.3a shows a typical
example of an interference-free and
interference-containing peptide. In this example, 24.2.5 Plasma Protein Quantitation
the interference observed in the NAT transition
of YWGVASFLQK and the high variability of The MRM data is first examined with
two of its three average relative ratios precludes MassHunter Quantitative Analysis software
its use for protein quantitation. (Agilent; Skyline can alternatively be used), for
The aforementioned approach is suitable for verification, peak selection and integration.
the inspection of control samples, but an alterna- Thereafter, the processed data is inputted into
tive strategy must be adopted for interference our in-house developed software tool – Qualis-
screening in patient samples. Our recommended SIS – for analysis. This tool requires two input
strategy requires a minimum of two peptides to files for each of the reference and sample data
be targeted for a given protein in order to con- sets. These files carry peptide- and protein-
struct peptide correlation plots (ratios of quanti- related information, with SIS and NAT responses
fier NAT/SIS relative responses), as first required for the former (retention time, peak
introduced by Agger et al. [51]. The linearity of width, symmetry but other metrics can addition-
each plot is then examined for outliers; with ally be included) and protein molecular weights
those that deviate requiring further inspection of and SIS peptide concentrations required for the
their SIS and NAT peptide extracted ion latter. After defining a small number of criteria
chromatograms (XICs) to evaluate the level of (e.g., regression weighting, precision and accu-
interference. We recently demonstrated the racy requirements) for each concentration level
implementation of this strategy in the quantita- of the standard curve, the tool automatically
tive analysis of 40 CVD-linked proteins (inferred performs the following three functions:
from an average of three peptides per protein) (1) generates and extracts assay information
across a small CVD patient cohort (n ¼ 18; from standard control curves, (2) determines the
blood plasma supplied by Bioreclamation). As endogenous protein concentrations in the patient
illustrated in Fig. 24.3b, the peptides samples, and (3) assesses the quality of the quan-
SFNPNSPGK and IQNILTEEPK can effectively titative sample measurement with respect to the
serve as surrogates for serum paraoxonase/ assay’s linear dynamic range. The following
arylesterase 1 (P27169) in all of the measured information is provided by each control curve:
samples since they are interference-free, while endogenous protein concentration, dynamic
peptide VVLSQGSK cannot be used to quantify range, lower and upper limits of quantitation
sex hormone-binding globulin (P04278) in the (LLOQ and ULOQ), and regression equation
CVD patient sample marked with the arrow due (slope and y-intercept) with coefficient of deter-
to interference. The advantage of this approach is mination (R2). In the analysis of the samples,
that it requires the peptide responses of only the each measured concentration (derived from the
quantifier transitions, which enables BAK-192 to relative response measurements also referred to
be processed with a single acquisition method. as single point measurement –SPM- and linear
The use of multiple transitions (customarily with regression analysis) is plotted on each peptide’s
1 quantifier and 2 qualifiers) for enhanced standard curve. The quality assessment page
24 Protocol for Standardizing High-to-Moderate Abundance Protein Biomarker. . . 523

Fig. 24.3 Interference screening strategies for MRM interference-free peptide VGYVSGWGR (from hapto-
transitions monitored in control and patient plasma globin, P00738) and the interference-containing peptide
digests. (a) Representative XICs of 3 SIS and NAT YWGVASFLQK (from retinol-binding protein 4;
transitions measured in buffer and control plasma for the P02753). (b) Relative response (RR) correlation plots
524 A.J. Percy et al.

indicates whether or not the results should be from a strict set of qualification criteria, which
trusted through a color-coded matrix. In the our developed software – Qualis-SIS – accu-
matrix, green denotes an acceptable quantitative rately applies in an automated and rapid manner.
value (due to its presence within the assay’s The result of this analysis were a set of assays
range of linearity), yellow indicates that caution with average linear dynamic ranges of 102–103,
should be exercised, while red suggests that the protein LLOQs between 5 ng/mL and 260 ng/mL
value should be discarded (see Fig. 24.4 for an (based on quantifier peptides), and average R2
example of each classification type from the values of 0.980. The assay reproducibility is
CVD-directed quantitative study indicated high, with average relative responses of <6 %
above). The assessment is based on the relation- and average retention times of <0.1 % routinely
ship of the concentration to the linear dynamic obtained over replicate analyses [52]. These
range as well as its deviation from the LOQ and quantitative panels can now be applied in discov-
the user-defined confidence threshold. The com- ery- and verification-directed proteomic studies
prehensive and summarized results can then be to help bridge the gap between biomarker dis-
exported for subsequent reporting and statistical covery and validation.
treatment. In the classical sense, protein biomarker dis-
covery is accomplished through bottom-up
(or shotgun) LC-MS/MS using a multidimen-
sional protein identification technology
24.3 Method Implementation
(MudPIT) in conjunction with data dependent
and Practical Biomarker
acquisition (DDA). In DDA, a subset of peptide
Applications
precursor ions, detected in the survey scan, are
selected for CID based on abundance, yielding a
Through rigorous evaluation and refinement, a
collection of complete product ion spectra. Typi-
well characterized set of MRM assays has been
cal acquisition instruments for this include the
developed for quantifying a multiplexed panel of
quadrupole time-of-flight (QTOF) and hybrid ion
192 candidate disease markers in unfractionated
trap-Orbitrap mass spectrometers. While techno-
human plasma. The method centers on a bottom-
logical advancements have enabled broad classes
up UHPLC/MRM workflow and uses
of putative protein biomarkers to be identified
concentration-balanced SIS peptides as internal
through DDA, their detection sensitivity and
standards. The quantified proteins are of high-to-
sample-to-sample reproducibility is limited due
moderate abundance, with concentrations span-
to the intensity-driven, stochastic nature of the
ning 6 orders of magnitude, from 31 mg/mL (for
precursor ion selection process [53, 54]. To over-
serum albumin, P02768) to 18 ng/mL (for
come these inherent issues, data independent
peroxiredoxin-2, P32119) – see Fig. 24.5a for
acquisition (DIA) strategies, such as MS/MSALL
the quantitation range. These endogenous
[55], have been proposed. This is based on the
concentrations were derived from standard con-
acquisition of complete product ion spectra
trol curves (based on 144 proteins [52]) and/or
generated from the dissociation of all precursors
individual XIC measurements (based on an addi-
measured in given SWATH windows (typically
tional 48 proteins [20]) using peptides as
25 amu spanning from 400 to 1200 m/z) over the
surrogates (487 interference-free in total).
chromatographic run. While this may provide
Regarding the curves, these were constructed

ä
Fig. 24.3 (continued) and peptide XICs for the VVLSQGSK from sex hormone-binding globulin
interference-free peptides SFNPNSPGK and (P04278) in the CVD patient sample marked with the
IQNILTEEPK from serum paraoxonase/arylesterase arrow. These figures were reprinted from [41] and [32],
1 (P27169) and the interference-containing peptide respectively, with permission
24 Protocol for Standardizing High-to-Moderate Abundance Protein Biomarker. . . 525

Fig. 24.4 Examples of patient sample results from examples show cases where the quantitative results
the Qualis-SIS data analysis software tool. The are (a) acceptable (TAAQNLYEK from apolipoprotein
526 A.J. Percy et al.

enhanced reproducibility and throughput over a 7 and 16 ng/mL, respectively), 253 proteins
DDA-based method, a MRM-based methodol- (inferred from 625 peptides) were quantified
ogy, such as that described above, can instead across an 8 order-of-magnitude concentration
be employed at the discovery stage for improved range. This panel can also represent a potentially
sensitivity, throughput, and reproducibility. useful starting point for assessing potential bio-
The discovery BAK-192 platform allows the marker candidates at lower concentrations.
interrogation of 192 proteins using 487 peptides In a separate study focused on biomarker veri-
as molecular surrogates. In this targeted applica- fication, a 21-plex protein assay was selected
tion, the candidates will be assessed by quantita- by a group of investigators based on their
tively comparing the patient sample results with previous proteomic discovery results and our 1D
those from healthy controls. Ideally, a minimum LC-MRM/MS quantitative capabilities. The over-
of three process replicates (also referred to as all aim of their study was to determine whether
“analytical replicates” that encompass the entire these proteins play a role in the resolution and
preparatory workflow) should be obtained. But remission of type 2 diabetes after bariatric sur-
only replicates that are quantitatively reproduc- gery. Bariatric surgery is of considerable research
ible and interference-free should be used for interest as it has rapid and dramatic effects on
comparison. To be statistically significant, a glycemic control. Recent studies by Mingrone G
fold change ratio exceeding 1.5 and a p value et al. [57] and Schauer P et al. [58] found bariatric
<0.05 is desired [56]. While this biomarker surgery to be more effective than conventional
panel is rather small, it covers a broad concentra- medical therapy in controlling hyperglycemia in
tion range of proteins that can be consistently severely obese patients with type 2 diabetes, lead-
quantified without laborious pre-fractionation, ing to long-term benefits on macro and micro-
which can in itself introduce variability. For vascular disease [59]. Since some bariatric
more comprehensive biomarker discovery procedures, such as biliopancreatic diversion,
efforts, however, pre-fractionation is undoubt- improve glycemic control in people with diabetes,
edly required. Using a scaled-up sample prepara- understanding this additional effect could provide
tion method, we have recently developed a insight into the pathogenesis of type 2 diabetes
multidimensional LC-MRM workflow for and assist in the development of new drug
quantifying a broader and deeper (by a 2 order modalities. To address this unanswered question,
of magnitude concentration range) panel of puta- we are currently engaged in a project involving a
tive protein markers in human plasma [20]. In cohort of 20 morbidly obese, insulin-resistant
that method, the LCs are operated under alkaline patients whose plasma was collected over a
and acidic mobile phase conditions for altered 13-point time-course (from before surgery to
peptide selectivity, using an ACN gradient with 28 days post-surgery).
constant 10 mM ammonium hydroxide (pH 10) Sample preparation and analysis of the
in the former dimension and an ACN gradient BAK-21 is as described above. This requires
with constant 0.1 % FA (pH 3) in the latter. Both standard curves to be prepared for each of the
dimensions additionally utilize RP stationary 5 plates of 50 samples. Preliminary results for the
phases and standard-flow rates. Using SPM, and concentration distribution from this study are
recently standard curves for a smaller protein shown in Fig. 24.5b. To aid in standardization,
panel (e.g., the low abundance targets key starting materials (i.e., reference plasma,
osteopontin and matrix metalloproteinase 9 at trypsin, and SIS peptide mixture) and

ä
Fig. 24.4 (continued) C-II; P02655), (b) intermediate cofactor 2; P05546). The results were obtained from the
(IIPHHNYNAAINK from coagulation factor IX; same patient plasma sample used in the CVD study
P00740), or (c) unacceptable (TLEAQLTPR from heparin
24 Protocol for Standardizing High-to-Moderate Abundance Protein Biomarker. . . 527

Fig. 24.5 Quantitation results from the multiplexed BAK-192 discovery analysis, while the concentration
MRM analysis of control plasma. The range of protein distribution in (b) is from the BAK-21 verification
concentrations shown in (a) was determined from the analysis

acquisition/analysis methods have been assem- performance deficits in intra-/inter-lab studies in


bled. The final MRM acquisition method consists the past [41, 42], and should help again here to
of a maximum of two proteotypic peptides per validate the experimental workflow and analyti-
protein (39 total) and three transitions per pep- cal system.
tide, which will be used for interference screen-
ing and protein quantitation of the patient
samples, as outlined above. To ensure consistent
24.4 Summary
performance of the LC-MS platform, daily/
monthly QC kits will also be run before and
We have developed a set of highly specific and
after each plate. These kits require only simple
robust MRM-based assays for quantifying a large
rehydration of the lyophilized, SIS-spiked
panel of 192 high-to-moderate abundance candi-
plasma digest(s) prior to LC-MRM/MS analysis,
date protein markers in antibody- and
with evaluations achieved through value tracking
fractionation-free human plasma. The
and correlation to the reference values in the kits.
192 proteins (inferred from 487 peptides) are
These QC kits, we note, have already proven
designed to be implemented in targeted, bio-
useful in diagnosing instrument errors and
marker discovery-based studies, while a subset
528 A.J. Percy et al.

panel of 21 targets has been designed for bio- 4. Villanueva J, Carrascal M, Abian J (2014) Isotope
marker verification in a diabetes-centric study. dilution mass spectrometry for absolute quantification
in proteomics: concepts and strategies. J Proteomics
To help standardize the process, essential 96:184–99
materials required to complete the entire protocol 5. Gillette MA, Carr SA (2013) Quantitative analysis of
(from sample preparation and processing to peptides and proteins in biomedicine by targeted mass
quantitative analysis) have been assembled into spectrometry. Nat Methods 10:28–34
6. Picotti P, Aebersold R (2012) Selected reaction
kits, as described here for the BAK-192 and monitoring-based proteomics: workflows, potential,
BAK-21. Additionally, our recently developed pitfalls and future directions. Nat Methods 9:555–66
Qualis-SIS software offers an automated means 7. Gallien S, Duriez E, Crone C, Kellmann M,
of quantifying proteins in reference and patient Moehring T, Domon B (2012) Targeted proteomic
quantification on quadrupole-orbitrap mass spectrom-
samples through regression analysis of standard eter. Mol Cell Proteomics 11:1709–23
curves or through SPM. To aid in quality assess- 8. Gallien S, Bourmaud A, Kim SY, Domon B (2014)
ment, the results are illustrated in a color-coded Technical considerations for large-scale parallel reac-
matrix for rapid visualization and evaluation of tion monitoring analysis. J Proteomics 100:147–59
9. Peterson AC, Russell JD, Bailey DJ, Westphall MS,
the results. Continued developments are focused Coon JJ (2012) Parallel reaction monitoring for high
on extending these panels for more comprehen- resolution and high mass accuracy quantitative,
sive discovery and verification of putative, or targeted proteomics. Mol Cell Proteomics
unknown, protein biomarkers. Nonetheless, the 11:1475–88
10. Gillet LC, Navarro P, Tate S, R€ ost H, Selevsek N,
strategies, kits, and tools discussed here act as a Reiter L et al (2012) Targeted data extraction of the
useful starting point for biomarker evaluation of MS/MS spectra generated by data-independent acqui-
a panel of proteins of interest in patient samples. sition: a new concept for consistent and accurate pro-
teome analysis. Mol Cell Proteomics 11:O111.016717
11. Picotti P, Clément-Ziza M, Lam H, Campbell DS,
Acknowledgements We wish to thank Genome Canada Schmidt A, Deutsch EW et al (2013) A complete
for STIC (Science and Technology Innovation Centre) mass-spectrometric map of the yeast proteome
funding and support. Carol Parker (UVic-Genome BC applied to quantitative trait analysis. Nature
Proteomics Centre) is acknowledged for assisting in the 494:266–70
manuscript editing process. 12. Anderson NL, Anderson NG (2002) The human
plasma proteome: history, character, and diagnostic
Competing Interests prospects. Mol Cell Proteomics 1:845–67
CHB is the director of the Centre and the Chief Scientific 13. Omenn GS (2007) The HUPO Human Plasma Prote-
Officer of MRM Proteomics, which has commercialized ome Project. Proteomics Clin Appl 1:769–79
the performance kits (namely the PeptiQuant LC-MS 14. Omenn GS (2004) The Human Proteome Organiza-
Platform and PeptiQuant MRM/MS Workflow kits) and tion Plasma Proteome Project pilot phase: reference
the assessment kits (PeptiQuant Human Discovery Assay specimens, technology platform comparisons, and
kit, or BAK-192, and BAK-21) described here. standardized data submissions and analyses. Proteo-
mics 4:1235–40
15. Caisey JD, King DJ (1980) Clinical chemical values
for some common laboratory animals. Clin Chem
26:1877–9
References 16. Chambers AG, Percy AJ, Yang J, Camenzind AG,
Borchers CH (2013) Multiplexed quantitation of
1. Ross PL, Huang YN, Marchese JN, Williamson B, endogenous proteins in dried blood spots by multiple
Parker K, Hattan S et al (2004) Multiplexed protein reaction monitoring mass spectrometry. Mol Cell Pro-
quantitation in Saccharomyces cerevisiae using teomics 12:781–91
amine-reactive isobaric tagging reagents. Mol Cell 17. Berna M, Ott L, Engle S, Watson D, Solter P,
Proteomics 3:1154–69 Ackermann B (2008) Quantification of NTproBNP
2. Dayon L, Sanchez JC (2012) Relative protein quanti- in rat serum using immunoprecipitation and LC/MS/
fication by MS/MS using the tandem mass tag tech- MS: a biomarker of drug-induced cardiac hypertro-
nology. Methods Mol Biol 893:115–27 phy. Anal Chem 80:561–6
3. Picard G, Lebert D, Louwagie M, Adrait A, Huillet C, 18. Whiteaker JR, Zhao L, Lin C, Yan P, Wang P,
Vandenesch F et al (2012) PSAQ™ standards for Paulovich AG (2012) Sequential multiplexed analyte
accurate MS-based quantification of proteins: from quantification using peptide immunoaffinity
the concept to biomedical applications. J Mass
Spectrom 47:1353–63
24 Protocol for Standardizing High-to-Moderate Abundance Protein Biomarker. . . 529

enrichment coupled to mass spectrometry. Mol Cell 31. Percy AJ, Chambers AG, Yang J, Borchers CH (2013)
Proteomics 11:M111.015347. doi:10.1074/mcp.M111 Multiplexed MRM-based quantitation of candidate
19. Whiteaker JR, Zhao L, Frisch C, Ylera F, Harth S, cancer biomarker proteins in undepleted and
Knappik A et al (2014) High-affinity recombinant non-enriched human plasma. Proteomics 13:2202–15
antibody fragments (Fabs) can be applied in peptide 32. Percy AJ, Chambers AG, Yang J, Hardie D, Borchers
enrichment immuno-MRM assays. J Proteome Res CH (2014) Advances in multiplexed MRM-based pro-
13:2187–96 tein biomarker quantitation toward clinical utility.
20. Percy AJ, Simon R, Chambers AG, Borchers CH Biochim Biophys Acta 1844:917–26
(2014) Enhanced sensitivity and multiplexing with 33. Paulovich AG, Whiteaker JR, Hoofnagle AN, Wang P
2D LC/MRM-MS and labeled standards for deeper (2008) The interface between biomarker discovery
and more comprehensive protein quantitation. J Pro- and clinical validation: the tar pit of the protein bio-
teomics 106:113–24 marker pipeline. Proteomics Clin Appl 2:1386–402
21. Shi T, Fillmore TL, Sun X, Zhao R, Schepmoes AA, 34. Wilson R (2013) Sensitivity and specificity: twin
Hossain M et al (2012) Antibody-free, targeted mass- goals of proteomics assays. Can they be combined?
spectrometric approach for quantification of proteins Expert Rev Proteomics 10:135–49
at low picogram per milliliter levels in human plasma/ 35. Camenzind AG, van der Gugten JG, Popp R, Holmes
serum. Proc Natl Acad Sci U S A 109:15395–400 DT, Borchers CH (2013) Development and evaluation
22. Keshishian H, Addona T, Burgess M, Kuhn E, Carr of an immuno-MALDI (iMALDI) assay for angioten-
SA (2007) Quantitative, multiplexed assays for low sin I and the diagnosis of secondary hypertension.
abundance proteins in plasma by targeted mass spec- Clin Proteomics 10:20
trometry and stable isotope dilution. Mol Cell Proteo- 36. Anderson NL, Anderson NG, Haines LR, Hardie DB,
mics 6:2212–29 Olafson RW, Pearson TW (2004) Mass spectrometric
23. Huttenhain R, Soste M, Selevsek N, Rost H, Sethi A, quantitation of peptides and proteins using stable iso-
Carapito C et al (2012) Reproducible quantification of tope standards and capture by anti-peptide antibodies
cancer-associated proteins in body fluids using (SISCAPA). J Proteome Res 3:235–44
targeted proteomics. Sci Transl Med 4:142ra94 37. Anderson NL, Razavi M, Pearson TW, Kruppa G,
24. Liu T, Hossain M, Schepmoes AA, Fillmore TL, Paape R, Suckau D (2012) Precision of heavy-light
Sokoll LJ, Kronewitter SR et al (2012) Analysis of peptide ratios measured by MALDI-tof mass spec-
serum total and free PSA using immunoaffinity deple- trometry. J Proteome Res 11:1868–78
tion coupled to SRM: correlation with clinical immu- 38. Sparbier K, Wenzel T, Dihazi H, Blaschke S, Müller
noassay tests. J Proteomics 75:4747–57 GA, Deelder AM et al (2009) Immuno-MALDI-TOF
25. Rezeli M, Végvári A, Ottervald J, Olsson T, Laurell T, MS: new perspectives for clinical applications of mass
Marko-Varga G (2011) MRM assay for quantitation spectrometry. Proteomics 9:1442–50
of complement components in human blood plasma – 39. Kennedy JJ, Abbatiello SE, Kim K, Yan P, Whiteaker
a feasibility study on multiple sclerosis. J Proteomics JR, Lin C et al (2014) Demonstrating the feasibility of
75:211–20 large-scale development of standardized assays to
26. Keshishian H, Addona T, Burgess M, Mani DR, quantify human proteins. Nat Methods 11:149–55
Shi X, Kuhn E et al (2009) Quantification of cardio- 40. Addona TA, Abbatiello SE, Schilling B, Skates SJ,
vascular biomarkers in patient plasma by targeted Mani DR, Bunk DM et al (2009) Multi-site assess-
mass spectrometry and stable isotope dilution. Mol ment of the precision and reproducibility of multiple
Cell Proteomics 8:2339–49 reaction monitoring-based measurements of proteins
27. Percy AJ, Yang J, Chambers AG, Simon R, Hardie in plasma. Nat Biotechnol 27:633–41
DB, Borchers CH (2014) Multiplexed MRM with 41. Percy AJ, Chambers AG, Smith DS, Borchers CH
internal standards for cerebrospinal fluid candidate (2013) Standardized protocols for quality control of
protein biomarker quantitation. J Proteome Res MRM-based plasma proteomic workflow. J Proteome
13:3733–47 Res 12:222–33
28. Chambers AG, Percy AJ, Simon R, Borchers CH 42. Percy AJ, Chambers AG, Yang J, Jackson AM,
(2014) MRM for the verification of cancer biomarker Domanski D, Burkhart J et al (2013) Method and
proteins: recent applications to human plasma and platform standardization in MRM-based quantitative
serum. Expert Rev Proteomics 11:137–48 plasma proteomics. J Proteomics 95:66–76
29. Percy AJ, Byrns S, Chambers AG, Borchers CH 43. Mohammed Y, Percy AJ, Chambers AG, Borchers
(2013) Targeted quantitation of CVD-linked plasma CH (2015) Qualis-SIS: automated standard curve gen-
proteins for biomarker verification and validation. eration and quality assessment for multiplexed
Expert Rev Proteomics 10:567–78 targeted quantitative proteomic experiments with
30. Domanski D, Percy AJ, Yang J, Chambers AG, Hill labeled standards. J Proteome Res 14:1137–46
JS, Cohen Freue GV et al (2012) MRM-based 44. Kuzyk MA, Smith D, Yang J, Cross TJ, Jackson AM,
multiplexed quantitation of 67 putative cardiovascular Hardie DB et al (2009) Multiple reaction monitoring-
disease biomarkers in human plasma. Proteomics based, multiplexed, absolute quantitation of
12:1222–43
530 A.J. Percy et al.

45 proteins in human plasma. Mol Cell Proteomics MRM-based protein biomarker quantitation toward
8:1860–77 clinical utility. Biochim Biophys Acta 2014:917–26
45. Kuzyk MA, Parker CE, Domanski D, Borchers CH 53. Tabb DL, Vega-Montoto L, Rudnick PA, Variyath
(2013) Development of MRM-based assays for the AM, Ham A-JL, Bunk DM et al (2010) Repeatability
absolute quantitation of plasma proteins. Methods and reproducibility in proteomic identifications by
Mol Biol 1023:53–82 liquid chromatography-tandem mass spectrometry. J
46. Rodriguez J, Gupta N, Smith RD, Pevzner PA (2008) Proteome Res 9:761–76
Does trypsin cut before proline? J Proteome Res 54. Domon B, Aebersold R (2010) Options and
7:300–5 considerations when selecting a quantitative proteo-
47. Mohammed Y, Domanski D, Jackson AM, Smith DS, mics strategy. Nat Biotechnol 28:710–21
Deelder AM, Palmblad M et al (2014) PeptidePicker: 55. R€ost HL, Rosenberger G, Navarro P, Gillet L,
a scientific workflow with web interface for selecting Miladinović SM, Schubert OT et al (2014)
appropriate peptides for targeted proteomics OpenSWATH enables automated, targeted analysis
experiments. J Proteomics 106:151–61 of data-independent acquisition MS data. Nat
48. Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Biotechnol 32:219–23
Martin D et al (2007) Computational prediction of 56. Ni X, Li X, Guo Y, Zhou T, Guo X, Zhao C
proteotypic peptides for quantitative proteomics. Nat et al (2014) Quantitative proteomics analysis of
Biotechnol 12:125–31 altered protein expression in the placental villous tis-
49. Percy AJ, Chambers AG, Yang J, Domanski D, sue of early pregnancy loss using isobaric tandem
Borchers CH (2012) Comparison of standard- and mass tags. Biomed Res Int 2014:647143
nano-flow liquid chromatography platforms 57. Mingrone G, Iaconelli A, Leccesi L, Nanni G,
for MRM-based quantitation of putative plasma Pomp A, Castagneto M et al (2012) Bariatric surgery
biomarker proteins. Anal Bioanal Chem versus conventional medical therapy for type 2 diabe-
404:1089–101 tes. N Engl J Med 366:1577–85
50. R€ost H, Malmstr€om L, Aebersold R (2012) A compu- 58. Schauer PR, Kashyap SR, Wolski K, Brethauer SA,
tational tool to detect and avoid redundancy in Kirwan JP, Pothier CE et al (2012) Bariatric surgery
selected reaction monitoring. Mol Cell Proteomics versus intensive medical therapy in obese patients
11:540–9 with diabetes. N Engl J Med 366:1567–76
51. Agger SA, Marney LC, Hoofnagle AN (2010) Simul- 59. Sj€ostr€
om L, Peltonen M, Jacobson P, Ahlin S,
taneous quantification of apolipoprotein a-I and apoli- Andersson-Assarsson J, Anveden Å et al (2014) Asso-
poprotein B by liquid-chromatography-multiple- ciation of bariatric surgery with long-term remission
reaction-monitoring mass spectrometry. Clin Chem of type 2 diabetes and with microvascular and
56:1804–13 macrovascular complications. JAMA 311:2297–304
52. Percy AJ, Chambers AG, Yang J, Hardie DB,
Borchers CH (1844) Advances in multiplexed
Index

A normalization, 470
Absolute protein expression (APEX), 257, 259, 260, 266 processing pipeline, 204, 205, 207
Accurate inclusion mass screening (AIMS), 496, 500, transformations, 471
502, 505, 508, 511 Data-independent MS/MS acquisition (DIA-MS/MS),
Acetylome, 361 406, 500–502
Affinity chromatography, 14, 31–34, 46, 47, 103, 107, DAVID, 233, 252, 283, 287–293
108, 115, 116, 118, 134, 135, 140, 181, 351, 360, Digestion, 4, 5, 11, 12, 14, 42, 46–59, 72, 76, 95, 109, 113,
374, 376, 503–504 120, 130, 131, 135–137, 148, 153, 232, 256, 271,
Antibody depletion, 446–447 272, 298, 354, 360, 374, 386–388, 399, 402–404,
Anti-k(GG) Ab, 360 406–408, 412, 417, 420, 422–424, 504, 510, 511
ATAQS, 257, 259, 271, 272, 274–277, 498 comparison, 49, 51, 52

B E
BioGRID, 282, 301, 308, 323–329, 331–339 Electron transfer dissociation (ETD), 153, 177, 184–186,
Bioinformatics, 141, 173, 204, 205, 213, 234, 188, 189, 204, 246, 360, 362, 364–369, 371, 376,
249, 257–259, 263, 265–266, 272, 277, 407, 416, 419–421
281–339, 346, 351, 369–374, 391, 417, ELISA-based verification, 496
464, 478, 498, 501, 519
Biological fluids, 8–16, 34, 129, 359, 518 F
Biomarker discovery, 8–10, 14, 185, 438, 439, Fast photo oxidation of proteins (FPOP), 407
444, 445, 462, 465, 494, 504, 505, 507, Filter-assisted sample preparation (FASP), 46, 49, 51,
511, 518, 524, 526, 527 55, 59
Biomarker verification, 500, 501, 518, 526, 528 FT-ICR mass analyzer, 161–164

C G
Candidate biomarker selection, 437–439, 464 Gene ontology (GO), 212, 233, 252, 283, 285,
Cell culture, 3–9, 58, 110, 117, 181, 185, 256, 388 290, 308, 309, 312, 313, 315, 324, 326,
Central proteomics facilities pipeline (CPFP), 210, 211 328, 334, 337
Chemical crosslinking (CX), 282, 391, 393, 414, 415, Glycoproteomics, 51, 108, 139, 141
418, 420, 424 Green algae and plastids, 73
Clinical proteomics, 8, 435–441, 444, 467, 488, 504
Cohort selection, 495 H
Collision induced dissociation (CID), 153, 159, 164–167, HDX. See Hydrogen/deuterium exchange (HDX)
177, 184, 185, 187–189, 204, 244, 246, 259, 302, Higher-energy collisional dissociation (HCD), 153,
362–365, 367, 369, 371, 407, 416, 420, 421, 425, 177, 204, 246, 362, 364, 365, 367–369, 371,
516, 524 420, 425, 499
Column selection, 114 HILIC. See Hydrophilic interaction liquid
chromatography (HILIC)
D Human protein reference database (HPRD), 282, 308,
Data 319, 320, 323, 327–330
clustering, 452 Hydrazine-based purification, 359
consistency, 465–466 Hydrogen/deuterium exchange (HDX), 186–187,
inspection, 211, 465 398–411, 414

# Springer International Publishing Switzerland 2016 531


H. Mirzaei and M. Carrasco (eds.), Modern Proteomics – Sample Preparation, Analysis and Practical
Applications, Advances in Experimental Medicine and Biology 919, DOI 10.1007/978-3-319-41448-5
532 Index

Hydrophilic interaction liquid chromatography (HILIC), MINT, 282, 301, 308, 312–318, 324, 333
30–31, 41, 93, 95–99, 107, 110–120, 123, MOAC. See Metal oxide affinity chromatography
125–127, 131, 133, 134, 136, 138, 139, 141–143, (MOAC)
182, 355, 357, 358, 374 Mobile phase, 25–32, 34, 84–88, 90, 92–103, 106, 108,
Hydrophobic interaction chromatography (HIC), 27–29, 111, 115–117, 119, 120, 122–133, 136–139, 142,
34, 36, 39, 103, 110, 119, 132, 140 181, 182, 355, 405, 521, 526
MS3, 363, 364, 367, 369, 420
I Multi-dimensional chromatography for top-down
Immobilized metal affinity chromatography (IMAC), proteomics, 34–36
101, 114, 118, 127, 141, 142, 181, 352, 353, 357,
361, 374, 375 N
Immunoprecipitation, 44, 47, 180, 282, 312, 318, 352, Neutral loss scan, 166
353, 361, 385, 386, 388, 391–393, 415
In-gel digestion, 47, 49, 56, 135, 387, 424 O
Ingenuity pathway analysis (IPA), 210, 212, 283, OpenMS pipeline, 205, 211
301–307 Orbitrap mass analyzer, 163, 499
In-solution digestion, 49–51, 109, 423 Orthogonality, 109, 116, 128, 132, 354–357
IntAct, 282, 301, 308, 318–326, 332, 367 Outlier detection, 466–467
Interactome mapping, 282, 305
Ion exchange chromatography (IEX), 24–27, 30, 33, 34, P
36, 37, 89–91, 103, 106, 109, 112, 118, 119, 124, PA/IP2 pipeline, 212
125, 129, 130, 137, 138, 140, 142, 181–182, 354, PANTHER, 283–287
389, 391, 502 Parallel reaction monitoring (PRM), 44, 252, 348, 406,
Ion trap mass analyzer, 159 409–410, 499–502, 508, 511, 516, 517
IsobariQ and Iquant, 267–270 Peptide
chromatography, 107
K fractionation, 45, 93, 106, 109, 120, 354, 357,
KEGG. See Kyoto encyclopedia of genes and genomes 374, 517
(KEGG) identification, 48, 76, 108, 111, 112, 115, 120, 128,
KMBD MBT-based purification, 361 131, 149, 150, 153, 208, 210, 218, 237–241, 243,
Kyoto encyclopedia of genes and genomes (KEGG), 283, 244, 247, 264, 266, 355, 357, 364, 405–406,
290, 293–296, 299, 300, 302, 308, 309, 312, 374 411–412
PeptideAtlas, 213, 270–272, 274, 498, 500, 519
L Phosphoproteomics, 114, 115, 127, 141, 354
Lectin affinity, 47, 112, 118, 358, 359 Plant
Limited proteolysis, 408–413 cell lysis, 64–66, 73, 76
Lysine-acetylation, 31, 361 meristem and suspension culture cells, 71–72
Lysine-methylation, 247, 350, 351, 361 organs, 67–71
Lysis, 12, 45, 46, 50, 52, 53, 57, 58, 72, 74, 75, 130, protein extraction, 74
360, 389, 393 proteomics, 63, 68
secretome, 65
M Plasma, 5, 6, 9–11, 15, 35, 44, 45, 64, 65, 108, 113, 117,
Machine learning, 266, 476, 478, 487, 488, 505 118, 133, 134, 141, 143, 438, 444–449, 451, 452,
Mass analyzer, 157–168, 176–177, 189, 204, 269, 499, 454, 458, 496–510, 516–528
521 Post translational modification (PTM), 14, 49, 94, 102,
Mass spectrometer, 11, 14, 29, 31, 36, 41, 46, 52–55, 100, 153, 180, 184, 212, 228, 232, 233, 244–246,
119, 127, 130, 136, 147, 148, 150, 153, 157–168, 328–330, 346–349, 351, 357, 361, 362, 367–376,
173, 174, 177, 182, 189, 204, 206, 218, 224, 231, 440, 446
234, 256, 265, 271, 277, 361–365, 367, 369, 387, enrichment, 375
401, 402, 404, 405, 408, 416, 420, 497–499, 516, mass spectrometry, 346, 349
519–521, 524 Precursor ion scan, 165–166, 367
MAXQuant, 115, 205, 210–212, 228, 234, 244, 251, Product ion scan, 165, 166
258–260, 265–266, 268, 337 Progenesis QI, 261, 266
MaxQuant pipeline, 205, 211 Protein
Metal oxide affinity chromatography (MOAC), 115–117, chromatography, 104, 111
351–353 depletion or enrichment, 46–47
Methylome, 346 extraction, 7, 45–48, 51, 53–54, 64, 67–69, 71,
Microbial protein interaction database (MPIDB), 282, 73, 75, 110
332–333 fractionation, 23–41, 45, 50, 51, 88, 93, 132
Index 533

grouping, 212 Sequence database searching, 148, 151


identification, 15, 35, 36, 43, 44, 48, 50, 52, 53, 67, 71, Serial PTM enrichment, 375
104, 113, 120, 127, 133, 138, 139, 149, 163, 178, Serum, 6, 7, 9–11, 25, 37, 38, 46, 95, 108, 111, 112, 117,
206, 211, 212, 218, 230, 238, 240–242, 251–252, 120, 122, 130, 133, 134, 136, 138, 139, 141,
257, 264, 266, 269, 274, 524 444–445, 450–452, 455–457, 499, 502, 506–509,
inference, 148, 150, 151, 208, 210, 218, 219, 228–230, 516, 517, 522, 524
237–242, 268 Shotgun proteomics, 12, 75, 76, 113, 128, 139, 153,
pool, 446–448, 454 204, 205, 207, 210, 211, 217–224, 227–234,
quantification, 44, 209, 255–276, 390, 502, 511, 517 240, 256, 271, 420
quantification bioinformatics, 265 Size-exclusion chromatography (SEC), 28–32, 39, 93–95,
structural analysis, 397–425 101, 103, 108, 114–117, 124–126, 128–133, 135,
Protein–protein interaction prediction (PIPs), 282, 183, 186, 389–391, 445, 505
329–332 Skyline, 251, 252, 257, 259, 271–277, 498, 522
Proteomics data interpretation, 281–339 Sorcerer pipeline, 212
Proteomics data processing, 205 Stationary phase, 24, 26–28, 30, 31, 39, 84–107, 109, 111,
114, 116, 118–123, 127–129, 132, 136, 140–142,
Q 181, 182, 354, 390, 526
Q-TOF tandem mass analyzer, 166 STRING, 231, 251, 282, 306–309, 312–314, 324, 374
Quadrupole mass analyzer, 158, 499, 516, 521
Quantitative signal processing, 256, 257 T
Tandem mass analyzer, 164
R Targeted proteomics, 44, 252, 256, 257, 259, 269, 271,
Reducing sample complexity, 83–142, 357, 358, 502 272, 274, 438, 509
Reversed-phase chromatography (RPC), 27, 30, 33, 35, The Arabidopsis Information Resource (TAIR), 151,
36, 41, 84–89, 96, 97, 101–107, 129, 141, 142, 282, 333–335
204, 354, 355 Time-of-flight (TOF) mass analyzer, 158, 160
Titanium dioxide (TiO2), 14, 101, 115, 352, 353,
S 357, 374
Sample origin, 4, 5, 7 TOF/TOF tandem mass analyzer, 166
SAPRatio, 210, 258, 265, 266 Trans-proteomic pipeline (TPP), 148, 205, 210, 211,
Scaffold pipeline, 212, 244 239, 251, 252, 262, 266–267
Search engines, 55, 147–154, 180, 204, 210–212, Triple quadrupoles tandem mass analyzer, 164
217–234, 243–247, 249, 251, 252, 257, 260,
262–264, 266, 315, 372, 415, 420–422, 424 U
Selected reaction monitoring (SRM), 37, 44, 166, 213, Ubiquitinome and sumoylated proteome, 361
252, 256, 260, 269–274, 496–511, 521

Das könnte Ihnen auch gefallen