Beruflich Dokumente
Kultur Dokumente
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
AbstractMultidimensional scaling (MDS) is a popular di- have been developed for robust MDS [11], visualization of
mensionality reduction techniques that has been widely used for time-varying data [12], and cooperative localization of static
network visualization and cooperative localization. However, the [10], [13][15] and mobile networks [13]. Popular algorithms
traditional stress minimization formulation of MDS necessitates
the use of batch optimization algorithms that are not scalable for solving the stress minimization problem include scal-
to large-sized problems. This paper considers an alternative ing by majorizing a complicated function (SMACOF) [1],
stochastic stress minimization framework that is amenable to semidefinite programming [16], alternating directions method
incremental and distributed solutions. A novel linear-complexity of multipliers [14], [15], and distributed SMACOF [10].
stochastic optimization algorithm is proposed that is provably The attractiveness of the MDS framework has however
convergent and simple to implement. The applicability of the
proposed algorithm to localization and visualization tasks is started to diminish with the advent of the data deluge. Specifi-
also expounded. Extensive tests on synthetic and real datasets cally, when embedding N objects, the per-iteration complexity
demonstrate the efficacy of the proposed algorithm. and memory requirements of the aforementioned algorithms
Index TermsMultidimensional Scaling, Stochastic SMACOF, increase at least as O(N 2 ), making them impractical for
Visualization, Localization, Big data. large-scale problems. To this end, approximate versions of
SMACOF have been proposed for large-scale visualization
applications [17], [18]. Nevertheless, most approximate MDS
I. I NTRODUCTION algorithms are still too complex for large-scale data, and
Multidimensional scaling addresses the problem of embed- cannot be generalized to other applications such as cooperative
ding relational data onto a low-dimensional subspace. Origi- localization of large networks.
nally proposed in the context of psychometrics and marketing Visualization or localization of time-varying data is even
[1], MDS and its variants have since found applications in more challenging since the iterative majorization algorithm
social networks [2][6], genomics [7], computational chem- must converge at every time instant [10], [12], [13]. In mobile
istry [8], machine learning [9], and wireless networks [10]. sensor networks, carrying out a large number of iterations at
As an exploratory technique, MDS is often used as a first step each time instant incurs a tremendous communication over-
towards uncovering the structure inherent to high-dimensional head, and is generally impractical. For instance, the distributed
data. In the context of machine learning and data mining, the weighted MDS approach [10] still requires at least N oper-
pairwise dissimilarities are calculated using high- or infinite- ations per iteration per time instant, which is prohibitive for
dimensional nodal attributes, and MDS yields a distance- large networks. For large-scale applications, where localization
preserving, low-dimensional embedding. Of particular impor- or visualization is constrained by the per-iteration complexity
tance are the embeddings obtained in two or three dimensional and memory requirements, it is instead desirable to have an
euclidean spaces, that serve as perceptual maps for visualizing online algorithm. Towards this end, the goal is to obtain an
relationships between objects. In the context of social net- adaptive algorithm that processes dissimilarity measurements
works, such representations reveal interconnections between in a sequential or online manner. For instance, an adaptive
people and communities, and are often more insightful than algorithm can allow visualization of large networks by reading
simpler metrics such as centrality and density. Different from and processing the pairwise dissimilarities in small batches.
the classical MDS framework that utilizes principal compo- Similarly, the communication cost required for large-scale
nent analysis, modern MDS formulations are based on the network localization can be reduced by processing only a few
minimization of a non-convex stress function [1]. Since the range measurements at a time.
stress function is a weighted sum of squared fitting errors, it This paper considers the stress minimization problem in a
allows for the possibility of missing and noisy dissimilarities. stochastic setting, where the dissimilarity measurements and
Consequently variants of the stress minimization problem the weights are modeled as random time-varying quantities
with unknown distributions. The first contribution of this paper
Copyright (c) 2016 IEEE. Personal use of this material is permitted. is a novel stochastic SMACOF algorithm that processes the
However, permission to use this material for any other purposes must be dissimilarities in an online fashion, and is therefore appli-
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
The author Sandeep Kumar was supported by TCS Research Scholarship cable to both static and time-varying scenarios (Sec. III).
Program TCS/CS/2011191G. Ketan Rajawat and Sandeep Kumar are with the The proposed algorithm is not only scalable, but is also
Department of Electrical Engineering, Indian Institute of Technology Kanpur, amenable to a distributed and asynchronous implementation
Kanpur, UP 208016, India, email: {ketan, sandkr}@iitk.ac.in
This paper has supplementary downloadable material available at in ad hoc networks (Sec. IV). As our second contribution,
http://ieeexplore.ieee.org., provided by the author. The material in- it is shown that the trajectory of the stochastic SMACOF
cludes three videos and one pdf [readme.pdf, Video1-Multi-agent algorithm remains close to that of an averaged algorithm,
tracking.avi,Video2-MovieLens.mp4 and Video3-Newcomb.avi ]. Con-
tact [ketan@iitk.ac.in] for further questions about this work. which itself converges to a stationary point of the stochastic
stress minimization problem (Sec. III-B). The analysis borrows
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
tools from spectral graph theory, stochastic approximation, (0) , the SMACOF update at the k-th
Starting with an initial X
and convergence analysis of the SMACOF algorithm. Finally, iteration entails carrying out the following update:
as the third contribution, the performance of the proposed
algorithm is tested extensively on various synthetic and real- (k+1) = arg min tr(XT LX) 2tr(XT B(X
X (k) )X
(k) ) (6)
X
world data sets (Sec. V). The numerical tests confirm the
(k) )X
= L B(X (k) (7)
applicability of the stochastic SMACOF algorithm to a variety
of scenarios.
The notation used in this paper is as follows. Bold upper where (7) follows since B(X)X lies in the range space of L.
(lower) case letters denote matrices (vectors). The (m, n)-th Observe that since L is rank-deficient, the solution to (6) is
entry of a matrix A is denoted by [A]mn . IN is the N N not unique. However, when the weights {wmn } specify a fully
identity matrix, 0 denotes the all-zero matrix or vector, and 1 connected graph G := ({1, . . . , N }, E), both L and B(X) have
denotes the all-one matrix or vector, depending on the context. rank N 1, with the null space of L being 1. Therefore, any
solution to (6) is of the form L B(X (k) )X
(k) + 1c for c R.
For a vector x, kxk denotes its `2 norm. For a matrix A, kAk (0)
denotes its Frobenious norm, kAk2 denotes the `2 norm, tr(A) Further, if the initial X is chosen such that it is centered
denotes its trace, and det(A) denotes its determinant. at the origin, i.e., 1T X(0) = 0, the updates in (7) ensure that
1T X(k) = 0 for all k 1.
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
III. O NLINE E MBEDDING VIA S TOCHASTIC SMACOF special case of the proposed algorithm. On the other hand, the
A. Algorithm outline stochastic SMACOF can also be used to solve very large-scale
MDS problems, where the full set of measurements {mn }
Given {mn (t)} and {wmn (t)}, and starting with an arbi- cannot be processed simultaneously. Instead, it is possible to
0 , the updates for the proposed stochas-
trary origin-centered X apply (13) on a small subset of observations, corresponding to
tic SMACOF algorithm take the form, a subgraph Gt . A special case occurs when exactly one edge is
X t + L B (X
t+1 = (1 )X t )X
t t0 (10) chosen per time instant and per cluster, i.e., |Ctj | = 2, and the
t t
updates in (13) reduce to those in encountered in the stochastic
proximity embedding (SPE) algorithm [6],
( wmn (t)mn (t)
m 6= n ij (t)
where, [Bt (X)]mn = kx x k2 +x (11) xi (t + 1) = (1 )xi (t) + xi (t)
PNm n kxi (t) xj (t)k
k=1 [Bt (X)]mk m=n
ij (t)
with x being a small positive constant that ensures that the + 1 xj (t) (15)
kxi (t) xj (t)k
entries of Bt (X) stay bounded for all X. The update rule
can be viewed a stochastic version of the SMACOF algorithm and likewise for node j. The proposed stochastic SMACOF is
with the following modifications (a) at each time instant, only therefore a generalization of SPE, applied to components of
one iteration of SMACOF is executed using the modified arbitrary sizes. Since the updates in (13) for any two clusters
definition of B (X) in (11); (b) the estimated coordinates Ctj and Ctk do not depend on each other, the proposed algorithm
X t at time t are used for initialization at t + 1; and (c) can also be implemented in a distributed and asynchronous
the estimated coordinates X t+1 are constructed by taking a manner. Such an implementation is particularly suited to
convex combination of Xt and the SMACOF output. The last the range-based localization problems that arise in wireless
modification endows the algorithm with tracking capabilities networks.
since the parameter may be interpreted as the forgetting Finally, akin to the classical adaptive filtering algorithms
factor, and can be tuned in accordance with the rate of change such as LMS, the proposed algorithm can also be applied to
of {mn (t)} and {wmn (t)}. For example, the embedding at time-varying scenarios, i.e., when mn (t) is non-stationary.
time t + 1 can be forced to be close to those at time t by The applications of interest include localization of time-
setting 1. Finally, the proposed update rule subsumes varying networks, and visualization of time-varying data. In
the SMACOF algorithm for static scenarios, where we set both cases, the first term (I Jt )X (j) in the update (13)
t
mn (t) = mn and wmn (t) = wmn for all t, and = 1. serves as a momentum term. That is, a small encourages
The update rule in (10) is valid only if the graph Gt defined X t+1 to stay close to X t , resulting in a smooth trajectory of
by {wmn (t)} is connected for all t 1. In the case when
{Xt }. On the other hand, a large value of enables tracking
Gt has more than one connected component, the coordinates in highly time-varying scenarios, while making the updates
within each component must be updated separately. Let Ctj sensitive to noise [23, Ch-21] [24, Ch-9]. Further implemen-
be the set of nodes belonging to the j-th component and Ijt tation details pertaining to the localization and visualization
be the |Ctj | N selection matrix containing the rows of IN problems are discussed in Sec. IV. Before proceeding with
(j) T the asymptotic analysis, the following remark is due.
corresponding to the elements in Ctj . Defining Lt := Ijt Lt Ijt
(j) T Remark 1. Building further on the link with adaptive al-
and Bt (Xt ) := Ijt Bt (Xt )Ijt , the update rule for the nodes
j
in Ct is given by gorithms, may be interpreted as a forgetting factor that
downweights the past information. When is a constant
X (j) + (L(j) ) B (X
(j) = (I Jt )X (j) )X
(j) (12) that is strictly greater than zero, the algorithm forgets the
t+1 t t t t t
old data exponentially quickly, thus offering superior tracking
where Jt := I 11T /|Cj (t)|, is the |Cj (t)| |Cj (t)| centering capability. In contrast, it is possible to have a long-memory
matrix which ensures that the coordinate center of each version of the algorithm with a time-varying t 0. As
component does not change after the update, i.e., 1T X (j) =
t+1 t , such an algorithm would no longer track the changes
(j) . The general update rule
1T X t in mn (t), and can be applied to a static scenarios where
t+1 = (I L Lt )X
X t + L Bt (X
t )X
t (13) the algorithm can stop once the embeddings converge. While
t t
the bounds developed here apply only to the case of constant
subsumes the forms specified in (10) and (12), irrespective of > 0, diminishing step size is in fact utilized in Sec. V.
the number of connected components in Gt , since it holds that
B. Asymptotic Performance
j j
1 1/|Ct | m = n Ct
[Lt Lt ]mn = 1/|Ctj | m 6= n, m, n Ctj (14) In general, establishing convergence of stochastic algo-
rithms for non-convex problems is quite challenging [20].
0 otherwise.
Here, the asymptotic performance of the proposed algorithm
In contrast to the classical SMACOF algorithm, the pro- is established in two steps. First, it is shown that the trajectory
posed algorithm is flexible enough to be used in a number of of the stochastic SMACOF algorithm stays close to that of an
different scenarios. As already discussed, a specific choice of averaged algorithm, in an almost sure sense. This part involves
parameters allows us to interpret the SMACOF algorithm as a establishing a hovering theorem, and utilizes techniques from
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
stochastic approximation [24][26]. Next, it is shown that the the assumptions, the averaging analysis is presented in the
averaged algorithm converges to a stationary point of (8). subsequent subsection.
1) Assumptions: For the purposes of establishing conver- 2) Hovering Theorem: The proposed stochastic SMACOF
gence, a simplified setting is considered, wherein the graph algorithm will be related to an averaged algorithm with
Gt at each t consists of N/p 1 components of size p each. updates,
Let jm (t) := {j | m Ctj } be the index of the component
to which node m belongs at time t, and define t RN N t+1 = (1 )X
X t + Ba (X
t )X
t (16)
such that
where the time-invariant function Ba (X) := E[Lt Bt (X)] and
1/N jm (t) 6= jn (t)
= p(NN (p1)
1) . Assuming that both algorithms start from the
[t ]mn := 1/N + /p jm (t) = jn (t), m 6= n 0 = X 0 , the following proposition
same initialization, i.e., X
(1 ) 1/N + /p m = n.
states the main result of this section.
(A1) The random processes {wmn (t)}t0 and {mn (t)}t0 Proposition 1. Under (A1)-(A5), and for < 1, it holds for
are independent identically distributed (i.i.d.). the updates generated by (13) and (16), that
(A2) The random variables {mn (t)} have support (0, C ],
while the weights {wmn (t)} have support {0} [w , 1]. max
X
t Xt
c() (17)
(A3)
The online algorithm is initialized such that 1t1/
11 T
/N )X
0
Cx .
(I where the random variable c() 0 almost surely as 0
(A4) There exists t0 such that
Qfor any
(0, 1), there exists with probability 1.
t
% (0, 1) such that
s= +1 s
< %t for all
2 Intuitively, Proposition 1 states that the trajectory of the
t t0 . proposed stochastic algorithm in (13) stays close to that
(A5) For each t, the non-zero weights {wmn (t)}m,n are i.i.d. of the averaged algorithm in (16). Further, the stochastic
with w := E[wmn (t)]. oscillations of (13) are small if is also small. However,
The i.i.d. assumption in (A1) is standard in the analysis of choosing too small a value of , which is also the step-size
most stochastic approximation algorithms. For the applications in (16), will generally result in a slower convergence rate for
at hand, the support of mn (t) and wmn (t) is naturally finite. any such iterative algorithm. The parameter may therefore
It is required from (A2) that the non-zero weights be bounded be seen as controlling the trade-off between the convergence
away from zero. Such a condition is required to ensure the rate and asymptotic accuracy. Further characterization of this
numerical stability of the Laplacian system of equations that trade-off is pursued via numerical tests in Sec. V.
must be solved at every iteration [cf. (10), (13)]. Specifically, Alternatively, consider the case when T updates of (13) are
it is shown in Appendix that (A2) implies the following result performed with = 1/T . For this case, the bound in (17)
Lemma 1. Under (A2), it holds that
Lt
L := (N
becomes
2
1)2 /2w for all t 1. max
X
t Xt
c(1/T ) (18)
1tT
The proof of Lemma 1 is provided in Appendix A. The
initial configuration can always be normalized to satisfy the where c(1/T ) 0 almost surely as T . In other
bound in (A3). Assumption (A4) restricts the extent to which words, the stochastic oscillations can be made arbitrarily
the graphs Gt can stay disconnected over time. To obtain small if sufficient number of updates can be performed. It is
intuition on (A4), observe first the largest eigenvalue of t remarked that such results are commonplace in the stochastic
is 1 if Gt has a single connected component and one approximation literature [24][26].
otherwise. Consequently, if all {Gs }ts= +1 are connected, Next, an outline of the proof of Proposition 1 is presented,
(A4)
Q holds with
% = 1 . Conversely, it holds that while the details are deferred to Appendix B. The overall
t structure of the proof is similar to that in [24]. Significant
s= +1 s
= 1 if and only if (a) each {Gs }ts= +1 has
2 differences exist in the details however, since workarounds
more than one components, and (b) the components do not
change over time, i.e., jm (t) = jn (t) for all m, n, and are introduced in order to avoid making any assumptions
on the boundedness of X t . It is emphasized that such a
t. Intuitively, (A4) allows {Gs } to have multiple connected
components at each s 1, as long as the nodes belonging to modification is generally not possible in a vast majority of
these components keep changing over time. problems, and is not trivial. It is however possible here due to
Finally, (A5) is perhaps the most restrictive, and may not the special structure of the update (13) that depends only on
the differences between pairs of rows of X t ; see (49).
always be easy to satisfy. For instance, the weights are not
identically distributed in the context of dynamic network Proof of Proposition 1: The difference between the
localization (cf. Sec. IV-A), since non-zero weights are often iterates generated by (13) and (16) is given by
assigned to neighboring nodes only. Likewise, weights selected
t+1 := X t+1 = t L Lt X
t+1 X t X t
via Sammon mapping also result in non-identically distributed t
weights. The assumption however greatly simplifies the proof
+ Lt Bt (X t )X
t Ba (X t )X
t (19)
of convergence for the averaged algorithm. Having stated
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
Assuming that both the algorithms start from the same initial- 3) Convergence of the Averaged Algorithm: Having estab-
0 = X
ization, i.e., X 0 , it follows that lished that the trajectory of the stochastic algorithm hovers
t around that of the averaged algorithm, we complete the proof
by establishing that the averaged algorithm converges to a local
X
t+1 = X
L L X
=0 minimum of (1). The challenge here is that the updates in
t
X (16) do not resemble those in other classical algorithms such
+ )X
L B (X Ba (X
)X
as SMACOF or gradientq descent. For notational brevity, let
2
=0 mn := E[mn (t)]/ kxm xn k + x and J = I 11T /N ,
t
X and note the following result.
+ K1t + K2t + K3t
= (20)
=1 Lemma 3. Under (A1)-(A5), it holds that
where for all t 0, mn
P m 6= n
t [Ba (X)]mn = mk m = n.
X
)X
E[L B (X
)]X
N
K1t = L B (X (21a) k6=m
=0
The proof of Lemma 3 is provided in the Appendix C. For
t
X
the rest of the section, we will assume that x 1 and thus
K2t L L
= I X (21b)
negligible. Therefore from Lemma 3, we have that
=0
X
t
E[wmn (t)mn (t)2 ] + tr XT LX
X
(X) =
K3t = )X
Ba (X Ba (X
)X
. (21c) m<n
=1
2tr(XT B(X)X) (28)
The following intermediate lemma develops bounds on the
:= E[Lt ] = wpJ
where, L and
three terms in (21), and constitutes the key step in the proof.
Lemma 2. The following bounds hold for i = 1, 2 P
wp mn m 6= n
B(X) := (29)
t N mk m = n.
i
X k6=m
Kt
dit + Ci ti (22)
=1 The main result of this subsection is stated as the following
3
t
X proposition.
Kt
C3 k k (23)
Proposition 2. The mean-stress values t ) decrease mono-
(X
=1
tonically with t and converge to a stationary point of (8).
where the constants C1 , C2 , and C3 are independent of t, and
the constants d1t , t1 , d2t , and t2 are such that P Proof: Without loss of generality, let
2
m<n E[wmn (t)mn (t) ] = 1, and define
dit ti 2 (X) := 1
tr XT LX and (X) := 12 (1/
0 0 (24a) T
T
t t 1)tr X LX + tr X B(X)X , and observe that
for i = 1, 2, almost surely as t . (X) = 1 + 2 (X) 2(X). Similarly, define the mapping
(X) := (1 )X + wp B(X)X, so that the updates in
The proof of Lemma 2 is provided in Appendix B. The t+1 = (X t ).
(16) become X
norm of t+1 can therefore be bounded by applying triangle
Given any two embeddings X and Y, the following bounds
inequality on (19) as follows.
hold from the Cauchy-Schwarz inequality:
t
X
tr(XT LX)
tr((2X Y)T LY) (30)
kt+1 k (C3 + 1) k k + f () (25)
=1
tr(XT B(X)X)
tr(XT B(Y)Y) (31)
where we have used the fact that kJk2 = 1 and which allows us to conclude that
t 1 1
tr XT L(Y)
tr YT LY
f () := max (d1t + d2t ) + 2
X
C1 t1 + C2 t2 . (26) (X) (32)
2
0t1/
=1 1
(X) 1 + 2 (X) + tr YT LY
It is further shown in Appendix B that f () 0 almost surely
as 0. Proposition 1 then follows from the application of 2 T
tr X L(Y)
the discrete Bellman-Gronwall Lemma [24] on (25), which
yields = 1 + (1 ) 2 (Y) 2 ((Y)) + 2 (X (Y)) (33)
kt k f ()(1 + (C3 + 1))t = f ()et log(1+(C3 +1)) where equalities holds for X = Y. Denote the right-hand side
f ()e t(C3 +1)
f ()e C3 +1
:= c() (27) of (33) by Y (X), and observe that Y (X) Y ((Y))
for all X. This yields the main inequality that (Y) =
Y (Y) Y ((Y)) ((Y)). In other words, we have
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
that t)
(X t+1 ), so that the non-negative sequence
(X The asynchronous nature of the algorithm allows for delayed
t :=
(Xt ) is non-increasing and therefore convergent to updates at nodes, balanced battery usage within the network,
a limit, say . By squeeze theorem for limits [27], it also and communication errors. In general, it is also possible to
holds that X
t (Xt+1 ) , yielding the following limits apply multiple updates of the form in (13) per time instant,
without incurring any extra communication cost.
lim 2 (Xt ) = (1
)/ (34)
t Nodes may declare themselves as cluster heads using a
lim (Xt ) = (1
)(1 )/2 (35) random backoff-based contention mechanism such as CSMA,
t
and solicit neighbors by simply sending an RTS packet.
t X
lim 2 (X t+1 ) = 0 (36)
t An update at a cluster thus takes up at most two message
t }t0 are origin centered, the result in exchanges. More complicated protocols that ensure recovery
Since the matrices {X
from collisions, and robustness or errors can also be used [30].
(36) can equivalently be written as
X
t Xt+1
0 as
The online algorithm is flexible, and allows clusters of any
t . Denoting the limit point of X t by X , it can be shape or size, depending on the communication and computa-
seen that
(X ) = 0. tional resources available within the network. The non-zero
weights, corresponding to available distance measurements,
IV. I MPLEMENTATION A SPECTS can be chosen according to the estimated noise variance [10],
A. Multi-agent network localization [28], following Sammon mapping [1], [31], or simply as unity.
It is remarked that the node coordinates obtained from
Multidimensional scaling has been widely used for local-
(S1)-(S4) are relative and centered at the origin. In applica-
ization, where inter-node distances are often obtained from
tions where node coordinates are required with respect to a
time-of-arrival or received signal strength measurements [10],
set of GPS-enabled anchor nodes, appropriate rotation and
[14], [28], [29]. Wireless network localization is challenging
translation operations must be applied at each node. Since
because the pairwise distance measurements are noisy, time-
the anchor nodes are generally not power constrained, it is
varying due to mobility, fading, and synchronization errors,
possible for them to determine these transformations [10], and
and often partially missing, due to the limited range of the sen-
convey the result to all other nodes. As shown later in Sec.
sors. Further, the limited battery life and resource constraints
V, it is generally sufficient to calculate the transformations
at the nodes impose restrictions on the communication and
periodically every few time slots.
computational load that the network can tolerate [13], [14].
Finally, similar to the SMACOF algorithm, the stochastic
Towards addressing these limitations, the stochastic SMA-
SMACOF algorithm is sensitive to initialization. A random
COF algorithm for network localization works by judiciously
initialization may result in the algorithm getting trapped in
choosing {wmn (t)} to limit the communication and computa-
a poor local minimum. In practice, superior location es-
tional cost at each update. The idea is to partition the network
timation performance is obtained if the initialization is at
into several non-overlapping clusters (or components), chosen
least roughly correct. Simple low-complexity localization al-
randomly at each time t. The coordinates within a cluster
gorithms can be used for initialization. For instance, nodes can
are updated as in (13). Only neighboring nodes are included
roughly triangulate themselves using noisy distance estimates
within each cluster, thus eliminating the need for multihop
from the anchor nodes [32].
communication between far off nodes. Finally, since the up-
dates at different components are independent of each other,
the localization algorithm is run asynchronously as follows.
B. Large network visualization
S1. At a given time t, a node j randomly declares itself as a
cluster head, and solicits cluster members from among It is possible to visualize N objects in a 2 or 3 dimen-
its neighbors n Nj . Available neighbors respond with sional euclidean space by applying MDS to the pairwise
n (t), resulting in a star
their current location estimates x dissimilarities {mn }. The SMACOF algorithm is however ill-
shaped cluster Ctj . Once locked as cluster members, suited for large-scale visualization since it requires at least
these nodes respond only to the messages from node O(N 2 ) operations per iteration. Further, even processing the
j. full measurements {mn } simultaneously may not be feasible
S2. The cluster head performs distance measurements be- for datasets with more than a hundred thousand objects.
tween itself and all its neighbors and collects jn (t) for Visualization via stochastic embedding can be achieved by
all n Ctj \ {j}. partitioning the objects into several subsets of reasonable sizes,
S3. The cluster head performs the update in (13) with and performing the updates in (13). The following steps are
appropriately chosen weights {wjn (t)}, and broadcasts performed for each t 1.
the new location estimates to each node in Ctj \ {j} 1) Partition the N objects into random, mutually exclusive
S4. Nodes in Ctj \ {j}, upon receiving the new location subsets Ctj with p nodes per subset.
estimates (or upon timeout or error events), release their 2) For each subset, randomly choose a small fraction ft
locks and become available. of pairs and measure (calculate or fetch from memory)
As originally intended, the proposed algorithm can also distances mn for the chosen pairs. Let Ftj denote the
be applied to mobile networks. The algorithm is expected to set of chosen pairs for each cluster j and time t.
perform well as long as the node velocities are not too high. 3) Apply the update in (13) for each subset Ctj .
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
Compared to the localization algorithm, in this case all pair- First, assume that each Ljt is sparse, i.e., q p2 , so that
wise distances are available a priori and without noise, but can- f (N )/N = q/p p. In this case, since the per-iteration
not be read or processed simultaneously. The aforementioned complexity is given by O(f (N ) log(p)), the value of log(p)
steps result in making {wmn (t)} sparse and thus reducing should be as small as possible. It can be seen that the choice
the per-iteration complexity. Algorithm 1 summarizes the ! +1 !
implementation of stochastic SMACOF for large network f (N ) f (N )
pO qO (37)
visualization. N N
Algorithm 1 Stocahstic SMACOF for Large Network Visual- for some 1 results in the complexity
ization O(f (N ) log(f (N )/N )), while ensuring that Ljt is still sparse
1: Initialize X0 and set to some value in (0, 1) with q O(p1+1/ ). Note that it is not necessary for to be
2: for t = 1, 2, . . . do very large, as long as the sparse Laplacian solvers can still be
3: Partition the set N into C disjoint subsets {Ctj }C used. On the other hand, when Ljt is dense so that q O(p2 ),
j=1
4: for j = 1, . . . , C do the per-iteration complexity is given by O(N q) = O(f (N )p).
5: Measure or fetch from memory pairwise distances In this case, it holds that f (N )/N = q/p p, so that one
{mn (t)}, for a subset of object pairs (m, n) Ftj . must choose p O(f (N )/N ) and q O(f (N )2 /N 2 ).
6: Set weights wmn (t) = 1 for all (m, n) Fjt . Consequently, the optimal iteration complexity for this case
7: Perform the update in (13) for each subset Ctj . becomes O(f (N )2 /N ).
8: end for Table I shows a few example choices of f (N ) and the cor-
9: end for responding per-iteration complexity values. It can be observed
that when f (N ) is almost linear in N , so is the per-iteration
complexity, regardless of the sparsity of Ljt . On the other hand,
Again, as envisioned earlier, the algorithm is also applicable
using a sparse Ljt becomes important when f (N ) is large.
to visualization of dynamic networks. The idea here is to
create an animation consisting of embeddings that vary over Non-zero weights f (N ) sparse Ljt dense Ljt
time. By specifying a small enough value for in (13), it O(N 1+ ), 0 < 1 O(N 1+ log(N )) O(N 1+2 )
is possible to force the embeddings to change slowly over O(N log N ) O(N log N log log N ) O(N log2 N )
time, thus preserving the users mental map [12]. Unlike O(N 3/2 ) O(N 3/2 log N ) O(N 2 )
existing algorithms however, the proposed algorithm can allow TABLE I: Algorithm complexity for different choices of f (N )
visualization of very large datasets.
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
101
Further, as shown in the inset, the proposed algorithm hovers
above the averaged algorithm, with steady-state deviation
decreasing with .
It is remarked that the SGD algorithm, with updates spec-
0 ified in (9), tended to diverge in the presence of noisy
10
t)
(X distance measurements, different weight choices, and poor
t)
(X initializations. For instance, when using Sammon mapping,
(X)
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
t Xt k
neighbors do not form clusters. Similarly, to limit the com- 0.08
putational complexity at each node, cluster heads respond
kX
to at most 10 nearest neighbors. With these settings, the 0.06
computational and communication complexity incurred by the SMACOF
network at every time slot is approximately N/5. The com- 0.04
putational and communication complexity of the SMACOF
online
variants in [10], [14], [33] is also normalized appropriately. As
0.02 lower bound
a first order approximation, it is assumed that these algorithms
require O(N ) message exchanges per iteration. Equivalently,
if we allow N/5 message exchanges per iteration, and assume 0 100 200 300 400 500 600 700
that 10 iterations are required for convergence, SMACOF t
requires about 50 time slots for convergence. For obtaining Fig. 3: Estimation error for an example run of the online and
SMACOF algorithms.
the plots however, SMACOF is run till convergence, and the
number of iterations incurred was often more than 50. Both
algorithms start with an initial estimate of the node locations. 100
v = 0.002
Approximate node estimates can be quickly obtained using
simple techniques such as those in [32]. For the purpose v = 0.005
of simulations, the initial locations are chosen as x m (0) = v = 0.01
xm (0) + vm where vm N (0, N/100). Warm starts are v = 0.02
Localization error
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
10
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
11
suggests, a(G) captures the overall connectivity of the graph. exists a path P between any two nodes m and n, such that
On the other hand, if Gt has K 2 connected components
{Gtk }K
k=1 , the K smallest eigenvalues of Lt are zero, so the
X X 2
ym yn )2 = [
( yi yj ]2 (N 1) yi yj )
(
smallest non-zero eigenvalue is simply a(Gt ) = mink a(Gtk ). (i,j)P (i,j)P
Next, we establish a lower bound on the algebraic connectivity (40)
of the weighted graph Gt . X 2
Proof of Lemma 1: If Gt is connected, the second smallest (N 1) yi yj )
( (41)
(i,j)E
eigenvalue is given by
2 where, (40) holds since P may contain at most N 1 edges.
P
m<n wmn (ym yn )
a(Gt ) = N min P 2
. (39) Summing both sides over all edges in the graph, we have that
1T y=0,y6=0 m<n (ym yn )
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
12
Substituting (42) into (39) for y = y , we have that which implies that
ym yn )2
P
X
wmn (
[Bt (X)X]m,:
|wmn (t)mn (t)|. N C . (49)
a(G) = N m<n P (43)
ym yn )2
( n6=m
P m<n
2 (m,n)E wmn ( ym yn )2 2w The bound in (48b) therefore follows from the use of (45),
(44)
N 2 C
P
(N 1)2 (
(m,n)E my yn ) 2 (N 1)2
Lt Bt (X)X
Lt
kBt (X)Xk . (50)
2 L
which is the required bound. If Gt is not connected, it holds for
a component Gtk with p nodes that a(Gtk ) 2w /(p 1)2 which yields C4 = N 2 C /L . Likewise, the m-th row of
2w /(N 1)2 , so that we again have a(Gt ) = min a(Gtk ) X
Bt (X)X Bt (X) becomes
2w /(N 1)2 , which is the desired result.
X
Bt (X)X Bt (X)
m,:
A PPENDIX B X xm xn m x
x n
P ROOF OF L EMMA 2 = wmn (t)mn (t) . (51)
n6=m
dmn dmn
Before proceeding with the proof, we state some basic
xm x
Adding and subtracting the term ( n )/dmn to each term
results, and introduced necessary notation. In the subsequent
within the summation in (51), it can be seen that
analysis, we will repeatedly use the following inequalities [40]
xm xn xm xn
kABk kAk2 kBk kAk kBk (45)
dmn
dmn
where A and B matrices xm xm xn xn 1 1
q of compatible sizes. For nota- = xm x
+ ( n)
tional brevity, dmn :=
2
kxm xn k + x and dmn :=
dmn dmn dmn dmn
xm xm xn xn xm xn
dmn d2mn
2
q
k
xm x n k2 + x , and note that dmn , dmn . = + .
dmn dmn dmn dmn (dmn + dmn )
We begin by defining the total deviation functions corre- (52)
sponding to K1t and K2t as
Further, the term d2mn d2mn can be written compactly as
t
X
D1t (X) := L B (X)X E[L B (X)X] d2mn d2mn = x
Tm x Tn x xTm x
n xTm xm xTn xn + 2xTm xn
(46) m + x n 2
=1
= (xm xn + x n )T (xm x
m x n xn )
m + x (53)
X t
D2t (X) := L L E[L L ] X
(47) Consequently, it is possible to write (51) as,
=1
X
Bt (X)X Bt (X)
m,:
The following lemma lists several preliminary results re- X
quired in deriving the bounds in Lemma 2. = wmn (t)mn (t)Amn ((xm x m ) (xn x
n ))
n6=m
Lemma 4. There exists t0 < , such that for all t t0 , it
where the matrix Amn is given by
holds that
1 xm x
( n )(xm xn + x
m x n )T
X
Lt Bt (X)X Lt Bt (X)
Amn = I+ . (54)
C3
X X (48a) dmn dmn dmn (dmn + dmn )
Lt Bt (X)X
C4 (48b) Thus, the full difference becomes
X
Bt (X)X Bt (X) = At (X, X)vec
JXt
C5 (48c) XX (55)
1
Dt (X)
dt1 is given by
where the (m, n)-th p p block of At (X, X)
(48d)
(
X X
1
1
Dt (X) Dt (X) (48e)
t := A mn wmn (t)mn (t) m 6= n
2
Dt (X)
d2t
At (X, X) P (56)
(48f) n6=m Amn wmn (t)mn (t) m = n
2
X X
2
Dt (X) D2t (X)
t (48g) Next, repeated use of the triangle inequality yields
T 2
where J = I 11 /N , C3 and C4 are constants, while the kAmn k
random variables d1t , d2t , t1 , and t2 follow (24). Results in !
2 2k
xm x n k2 k
xm x n + xm xn k2
(48f) and (48g) also require X to be such that kJXk C5 . kIk +
d2mn d2mn (dmn + dmn )2
The proof organized into four steps, each considering one
or more inequalities. Here, it holds from the definition of dmn that
Proof of (48a) and (48b): Observe that the m-th row of k n k /dmn 1. Similarly, it holds that
xm x
Bt (X)X for each t 0 can be written as k
xm x
n + xm xn k 2
(57)
X wmn (t)mn (t) 2 2
[Bt (X)X]m,: = (xm xn ) kxm x
n k + kxm xn k + 2 k
xm x
n k kxm xn k
dmn d + d + 2dmn dmn = (dmn + dmn )2
2 2
(58)
n6=m mn mn
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
13
2
Therefore, the bound on kAmn k becomes The law of large numbers therefore implies that D1t (X)/t 0
1as t
almost surely
.1This also implies that there exists d1t
2(N + 1) 2 1
kAmn k (59) such that Dt (X) dt and dt /t 0 as t .
The Lipschitz continuity of D1t (X) can similarly be shown
Similarly, it holds for At (X, X) that
using (48a). Towards this end, observe that
2 t1
X
=
D1t (X) D1t (X) L B (X)X E[L B (X)X]
2 XX X
C2 2
At (X, X) kAmn k + kAmn k
=1
m n6=m n6=m
XX t1
2
X
3C2 X
L B (X) E[L B (X)
X]
kAmn k (60)
m n6=m =1
3 t1
N (N 1)(N + 1) 6N
3C2 < C2
X
X
L B (X)X B (X)
(61) =
x
=0
which in turn, yields the bound X
E[L B (X)X B (X) ]
(73)
2
2 6N 3 C .
At (X, X) The vectorized version of the first term can be written as
(62)
x
X
vec L B (X)X L B (X)
The Lipschitz continuity of Lt Bt (X)X thus follows as
= I L vec B (X)X B (X) X
(74)
X
X
Lt Bt (X)X Lt Bt (X)
Lt
Bt (X)X Bt (X)
2 = I L A (X, X)vec XX (75)
r
N C 6N
,
X X (63) Using a similar transformation on the second term of (73), the
L x vectorized version of the right-hand side can be written as
q
so that C3 = NCL
6N
x .
vec D1t (X) D1t (X)
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
14
In order to establish the Lipschitz continuity of D2t (X), Such a t0 () exists within [0, T /] for all T /t0 ().
= C0 (X X),
observe that D2t (X) D2t (X) where Therefore, given , if t t0 (), it holds that
t
t
X P [dt ] = 1 (93)
C0t := L L E[L L ] (81)
=0 for all /t0 ()Cd . On the other hand, if t > t0 (), (93)
Since each summand in (81) is zero mean and bounded, it holds for all T /t0 (/T ). Combining the two cases, it
holds from law of large numbers that C0t /t 0 almost holds that max0tT / dt 0, with probability one as
exists t2 such that 0.
2 as t
surely .
Consequently, there
2
X X
, and 2 /t 0 almost For the other two terms, observe similarly that given , there
Dt (X) D2t (X) t t
surely as t . exists T and C such that
Proof of Lemma 2: Bounds in (22) can be derived by
P [t /t C ] = 1 t, (94)
observing that for 1 t and = 1, 2, it holds that
and P [t /t ] = 1 t > T . (95)
) D (X
D (X 1 ) = K K +
1 1
Thus, given , if t T , it holds that
D (X ) D
1 1 (X 1 ). (82) " #
t r
2
X
1
Summing (82) over = 1, . . . , t, it follows that P = 1, , s.t, . (96)
=2
T C
t
X
t)
Dt (X 0)
D0 (X = Kt K0 + +1 )
D (X )
D (X Similarly, the result in (96) holds for t > T for all T
.
T/T 2
=1
t0 () and Cd such that Next, we show that the random variables mn and mk
are identically distributed for n 6= k 6= m. Without loss of
P [dt /t Cd ] =1 t, (91)
generality, let m = 1. Also, let Lnki denote the (N 2)
and P [dt /t ] =1 t > t0 (). (92) (N 2) submatrix of Lt + 11T /N after the removal of rows
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
15
(1, i) and columns (n, k). The Laplace expansion of 1n along pair of nodes (m, n) belongs to the same connected component
the k-th column yields is given by (p 1)/(N 1), yielding the required expression
X1 p 1 mn m 6= n
wki (t))(1)n+i+k |Lnk
h i
1n = ( i | E Lt Bt (X) = P
N mn p(N 1) mk m = n.
i6=1,n,k
k6=m
1 1 X
( wkn (t))(1)k |Lnk n |( + wki (t))(1)n |Lnk
k |
N N R EFERENCES
i6=k
X 1 [1] I. Borg and P. J. Groenen, Modern multidimensional scaling: Theory
= ( wki (t))(1)n+i+k |Lnk i | and applications. Springer Science & Business Media, 2005.
N [2] L. V. D. Maaten and G. Hinton, Visualizing data using t-SNE, Journal
i6=1
X of Machine Learning Research, vol. 9, no. Nov, pp. 25792605, 2008.
( wki (t) + 2wkn (t))(1)n |Lnk
k | (100) [3] A. Platzer, Visualization of SNPs with t-SNE, PloS One, vol. 8, no. 2,
i6=k,n 2013.
[4] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, Line: Large-
scale information network embedding, in Proc. of the Intl. Conf. on
Likewise, the expansion of 1k along the n-th column yields World Wide Web, 2015, pp. 106777.
X 1 [5] L. van der Maaten and G. Hinton, Visualizing non-metric similarities
1k = ( wni (t))(1)n+i+k |Lnk i |
in multiple maps, Machine Learning, vol. 87, no. 1, pp. 3355, 2012.
N [6] D. K. Agrafiotis, Stochastic proximity embedding, Journal of compu-
i6=1 tational chemistry, vol. 24, no. 10, pp. 121521, 2003.
X
[7] J. Tzeng, H. H. Lu, and W.-H. Li, Multidimensional scaling for large
( wki (t) + 2wkn (t))(1)k |Lnk
n | (101) genomic data sets, BMC bioinformatics, vol. 9, no. 1, p. 179, 2008.
i6=k,n [8] J. Y. Choi, S.-H. Bae, J. Qiu, B. Chen, and D. Wild, Browsing large-
scale cheminformatics data with dimension reduction, Concurrency and
It can be seen that the first terms in (100) and (101) are Computation: Practice and Experience, vol. 23, no. 17, pp. 231525,
identically distributed since wni (t) and wki (t) are identical 2011.
[9] M. Beatty and B. Manjunath, Dimensionality reduction using multi-
(cf. (A5)). Further, performing n k row exchanges on Lnk n , dimensional scaling for content-based retrieval, in Proc. of the ICIP,
it is possible to obtain L nk which only differs from Lnk in vol. 2, 1997, pp. 835838.
n k
the k-th row. Indeed, the elements of the k-th row of L nk are [10] J. A. Costa, N. Patwari, and A. O. Hero III, Distributed weighted-
n
multidimensional scaling for node localization in sensor networks, ACM
{(1/N wki (t))}i6=k,n , while the elements of the k-th row of Trans. on Sensor Networks, vol. 2, no. 1, pp. 3964, 2006.
Lnk
n are {(1/N wni (t))}i6=k,n . Since the determinant is linear [11] P. A. Forero and G. B. Giannakis, Sparsity-exploiting robust multidi-
in its rows, it follows that |Lnk nk n+k nk mensional scaling, IEEE Trans. on Signal Proc., vol. 60, no. 8, pp.
n | and |Ln | = (1) Lk
411834, 2012.
are identically distributed. In summary, we have that the [12] K. S. Xu, M. Kliger, and A. O. Hero III, A regularized graph
distributions of mn and mk are identical for all k 6= n 6= m. layout framework for dynamic network visualization, Data Mining and
Next, define identical random variables mn := Knowledge Discovery, vol. 27, no. 1, pp. 84116, 2013.
[13] S. Kumar, R. Kumar, and K. Rajawat, Cooperative localization of
wmn (t)(mmP mn ) for each n 6= m, so that mobile networks via velocity-assisted multidimensional scaling, IEEE
= NN1 n6=m mn . Since Gt is connected, it holds Trans. on Signal Process., vol. 64, no. 7, pp. 17441758, 2016.
that > 0. Therefore from symmetry, we have that [14] A. Simonetto and G. Leus, Distributed maximum likelihood sensor
network localization, IEEE Trans. on Signal Proc., vol. 62, no. 6, pp.
14241437, 2014.
mn N 1 mn 1 [15] S. Kumar, R. Jain, and K. Rajawat, Asynchronous optimization over
E[ ]= E[ P ]= (102)
N
n6=m mn N heterogeneous networks via consensus admm, IEEE Trans. on Signal
and Inf. Proc. over Networks, 2016 (to be published).
Further, using the fact that E[mn ] = E[mk ] for each k 6= n, [16] P. Biswas, T.-C. Lian, T.-C. Wang, and Y. Ye, Semidefinite pro-
gramming based algorithms for sensor network localization, ACM
it can be seen that Transactions on Sensor Networks, vol. 2, no. 2, pp. 188220, 2006.
[17] S.-H. Bae, J. Qiu, and G. Fox, Adaptive interpolation of multidimen-
h i 1 P mn m 6= n sional scaling, Procedia Computer Science, vol. 9, pp. 393 402, 2012.
E Lt Bt (X) = mk m = n.
[18] S. Ingram, T. Munzner, and M. Olano, Glimmer: Multilevel MDS on the
mn N GPU, IEEE Trans. on Visualization and Computer Graphics, vol. 15,
k6=m
no. 2, pp. 249261, 2009.
[19] L. Bottou, On-line Learning in Neural Networks, D. Saad, Ed. New
which is the required result. York, NY, USA: Cambridge University Press, 1998.
Finally, if Gt consists of multiple connected components, the [20] C. D. Sa, C. Re, and K. Olukotun, Global convergence of stochastic
quantity Lt Bt (X) is a permuted version of the block-diagonal gradient descent for some non-convex matrix problems, in Proc. of the
Intl. Conf. on Machine Learning, 2015, pp. 233241.
matrix with N/p block matrices of size p p each. Let j [21] J. Mairal, Stochastic majorization-minimization algorithms for large-
denote the determinant of j-th block, and the random variables scale optimization, in Advances in Neural Information Processing
Systems, 2013, pp. 22832291.
jmn be similarly defined block-wise. Proceeding along similar [22] O. Cappe and E. Moulines, On-line expectationmaximization algo-
lines, it can be seen that rithm for latent data models, Journal of the Royal Statistical Society:
Series B (Statistical Methodology), vol. 71, no. 3, pp. 593613, 2009.
jmn p1 j 1 [23] A. H. Sayed, Adaptive filters. John Wiley & Sons, 2011.
E[ j
]= E[ P mn j ] = . (103) [24] V. Solo and X. Kong, Adaptive signal processing algorithms: stability
p n6=m mn
p and performance. Prentice-Hall, Inc., 1994.
[25] H. Kushner and G. G. Yin, Stochastic approximation and recursive
Consequently, [Lt Bt (X)]mn is non-zero if and only if the algorithms and applications. Springer, 2003, vol. 35.
[26] V. S. Borkar et al., Stochastic approximation, Cambridge Books, 2008.
node pair (m, n) belong to the same component, and is zero [27] W. Rudin, Principles of mathematical analysis. McGraw-Hill New
otherwise. From (A5), we have that the probability that a given York, 1964, vol. 3.
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
16
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.