07850955

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
Transactions on Signal and Information Processing over Networks
Stochastic Multidimensional Scaling

Ketan Rajawat, Member, IEEE, and Sandeep Kumar, Student Member, IEEE
AbstractMultidimensional scaling (MDS) is a popular di- have been developed for robust MDS [11], visualization of
mensionality reduction techniques that has been widely used for time-varying data [12], and cooperative localization of static
network visualization and cooperative localization. However, the [10], [13][15] and mobile networks [13]. Popular algorithms
traditional stress minimization formulation of MDS necessitates
the use of batch optimization algorithms that are not scalable for solving the stress minimization problem include scal-
to large-sized problems. This paper considers an alternative ing by majorizing a complicated function (SMACOF) [1],
stochastic stress minimization framework that is amenable to semidefinite programming [16], alternating directions method
incremental and distributed solutions. A novel linear-complexity of multipliers [14], [15], and distributed SMACOF [10].
stochastic optimization algorithm is proposed that is provably The attractiveness of the MDS framework has however
convergent and simple to implement. The applicability of the
proposed algorithm to localization and visualization tasks is started to diminish with the advent of the data deluge. Specifi-
also expounded. Extensive tests on synthetic and real datasets cally, when embedding N objects, the per-iteration complexity
demonstrate the efficacy of the proposed algorithm. and memory requirements of the aforementioned algorithms
Index TermsMultidimensional Scaling, Stochastic SMACOF, increase at least as O(N 2 ), making them impractical for
Visualization, Localization, Big data. large-scale problems. To this end, approximate versions of
SMACOF have been proposed for large-scale visualization
applications [17], [18]. Nevertheless, most approximate MDS
I. I NTRODUCTION algorithms are still too complex for large-scale data, and
Multidimensional scaling addresses the problem of embed- cannot be generalized to other applications such as cooperative
ding relational data onto a low-dimensional subspace. Origi- localization of large networks.
nally proposed in the context of psychometrics and marketing Visualization or localization of time-varying data is even
[1], MDS and its variants have since found applications in more challenging since the iterative majorization algorithm
social networks [2][6], genomics [7], computational chem- must converge at every time instant [10], [12], [13]. In mobile
istry [8], machine learning [9], and wireless networks [10]. sensor networks, carrying out a large number of iterations at
As an exploratory technique, MDS is often used as a first step each time instant incurs a tremendous communication over-
towards uncovering the structure inherent to high-dimensional head, and is generally impractical. For instance, the distributed
data. In the context of machine learning and data mining, the weighted MDS approach [10] still requires at least N oper-
pairwise dissimilarities are calculated using high- or infinite- ations per iteration per time instant, which is prohibitive for
dimensional nodal attributes, and MDS yields a distance- large networks. For large-scale applications, where localization
preserving, low-dimensional embedding. Of particular impor- or visualization is constrained by the per-iteration complexity
tance are the embeddings obtained in two or three dimensional and memory requirements, it is instead desirable to have an
euclidean spaces, that serve as perceptual maps for visualizing online algorithm. Towards this end, the goal is to obtain an
relationships between objects. In the context of social net- adaptive algorithm that processes dissimilarity measurements
works, such representations reveal interconnections between in a sequential or online manner. For instance, an adaptive
people and communities, and are often more insightful than algorithm can allow visualization of large networks by reading
simpler metrics such as centrality and density. Different from and processing the pairwise dissimilarities in small batches.
the classical MDS framework that utilizes principal compo- Similarly, the communication cost required for large-scale
nent analysis, modern MDS formulations are based on the network localization can be reduced by processing only a few
minimization of a non-convex stress function [1]. Since the range measurements at a time.
stress function is a weighted sum of squared fitting errors, it This paper considers the stress minimization problem in a
allows for the possibility of missing and noisy dissimilarities. stochastic setting, where the dissimilarity measurements and
Consequently variants of the stress minimization problem the weights are modeled as random time-varying quantities
with unknown distributions. The first contribution of this paper
Copyright (c) 2016 IEEE. Personal use of this material is permitted. is a novel stochastic SMACOF algorithm that processes the
However, permission to use this material for any other purposes must be dissimilarities in an online fashion, and is therefore appli-
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
The author Sandeep Kumar was supported by TCS Research Scholarship cable to both static and time-varying scenarios (Sec. III).
Program TCS/CS/2011191G. Ketan Rajawat and Sandeep Kumar are with the The proposed algorithm is not only scalable, but is also
Department of Electrical Engineering, Indian Institute of Technology Kanpur, amenable to a distributed and asynchronous implementation
Kanpur, UP 208016, India, email: {ketan, sandkr}@iitk.ac.in
This paper has supplementary downloadable material available at in ad hoc networks (Sec. IV). As our second contribution,
http://ieeexplore.ieee.org., provided by the author. The material in- it is shown that the trajectory of the stochastic SMACOF
cludes three videos and one pdf [readme.pdf, Video1-Multi-agent algorithm remains close to that of an averaged algorithm,
tracking.avi,Video2-MovieLens.mp4 and Video3-Newcomb.avi ]. Con-
tact [ketan@iitk.ac.in] for further questions about this work. which itself converges to a stationary point of the stochastic
stress minimization problem (Sec. III-B). The analysis borrows
2373-776X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSIPN.2017.2668145, IEEE
tools from spectral graph theory, stochastic approximation, (0) , the SMACOF update at the k-th
Starting with an initial X
and convergence analysis of the SMACOF algorithm. Finally, iteration entails carrying out the following update:
as the third contribution, the performance of the proposed
algorithm is tested extensively on various synthetic and real- (k+1) = arg min tr(XT LX) 2tr(XT B(X
X (k) )X
(k) ) (6)
X
world data sets (Sec. V). The numerical tests confirm the
(k) )X
= L B(X (k) (7)
applicability of the stochastic SMACOF algorithm to a variety
of scenarios.
The notation used in this paper is as follows. Bold upper where (7) follows since B(X)X lies in the range space of L.
(lower) case letters denote matrices (vectors). The (m, n)-th Observe that since L is rank-deficient, the solution to (6) is
entry of a matrix A is denoted by [A]mn . IN is the N N not unique. However, when the weights {wmn } specify a fully
identity matrix, 0 denotes the all-zero matrix or vector, and 1 connected graph G := ({1, . . . , N }, E), both L and B(X) have
denotes the all-one matrix or vector, depending on the context. rank N 1, with the null space of L being 1. Therefore, any
solution to (6) is of the form L B(X (k) )X
(k) + 1c for c R.
For a vector x, kxk denotes its `2 norm. For a matrix A, kAk (0)
denotes its Frobenious norm, kAk2 denotes the `2 norm, tr(A) Further, if the initial X is chosen such that it is centered
denotes its trace, and det(A) denotes its determinant. at the origin, i.e., 1T X(0) = 0, the updates in (7) ensure that
1T X(k) = 0 for all k 1.
II. BACKGROUND AND P ROBLEM S TATEMENT

A. Classical MDS and SMACOF B. Stochastic MDS
The classical MDS framework seeks P -dimensional em- This paper considers the MDS problem in a stochastic
bedding vectors {xn }N n=1 , given the pairwise distances or setting, where the weights, and dissimilarities or distance mea-
dissimilarities {mn }(m,n)E , where E {(m, n) | 1 m < surements are random variables with unknown distributions.
n N }, between N different nodes or objects, denoted by Specifically, given {mn (t)} and {wmn (t)}, the stochastic
the set N := {1, . . . , N }. The embedding vectors, collected stress minimization problem is formulated as
into the rows of X RN P , are estimated by solving the
following non-convex optimization problem [1]
X
min
(X) := E[wmn (t)(mn (t) kxm xn k)2 ]. (8)
X
m<n
X 2
= arg min
X wmn (mn kxm xn k2 ) (1)
X
1m<nN In the absence of the distribution information, the expres-
where wmn is the weight associated with the measurement sion for (X) cannot be evaluated in closed-form, and the
mn , and is set to zero for all (m, n) / E. The non-zero SMACOF algorithm cannot be applied. Instead, (8) must be
weights can be chosen in a number of ways, depending on solved using a stochastic optimization algorithm. Of particular
the application, and are often simply set to one. The objective interest are the so-called online algorithms that can process the
function in (1) is referred to as the stress function, and is observations {mn (t)}, {wmn (t)} in an incremental manner.
henceforth denoted by (X). It can be seen that the optimum Within this context, efficient implementations of the stochastic
X obtained in (1) is not unique, and exhibits translational, (sub-)gradient descent (SGD) method have been used to solve
rotational, and reflectional ambiguity. very large-scale problems [19]. The SGD updates utilize the
The stress-minimization problem in (1) is non-convex, and subgradient of the instantaneous objective function, and for
can be solved up to a local optimum using the well known the present case, take the form:
SMACOF algorithm. Expanding the stress function, we obtain
t+1 = X
t + Bt (X
t )X
t Lt X
t

X X (9)
2 2
(X) = wmn mn + kxm xn k 2mn kxm xn k
m<n where (0, 1) is the learning rate or step size parameter.
(2) While the performance of the SGD has been well-studied
X
= 2
wmn mn + tr(XT LX) 2tr(XT B(X)X) (3) for convex problems, the same is not true for non-convex
m<n problems, such as the one in (8). Indeed, the standard SGD
algorithm does not necessarily converge for many non-convex
where,
( problems [20]. In the present case also, the SGD method
wmn m 6= n exhibits divergent behavior; see Sec. V. The general-purpose
[L]mn = Pm (4) stochastic majorization-minimization method [21] is also not
k=1 wmk m=n
w applicable in the present case since it requires a strongly
kxm xn k m 6= n, xm 6= xn
mn mn
convex surrogate function. On the other hand, problem-
[B(X)]mn = 0 m 6= n, xm = xn (5) specific stochastic algorithms have been developed and applied
Pm

k=1 [B(X)]mk m=n with great success. Examples include the online expectation-
maximization and the online matrix factorization approaches
The SMACOF algorithm works by iteratively majorizing the [19], [22]. Along similar lines, the next section details the
last term in (3) with a linear function and subsequently stochastic version of the SMACOF algorithm, and studies its
minimizing the majorized stress function with respect to X. asymptotic properties.
III. O NLINE E MBEDDING VIA S TOCHASTIC SMACOF special case of the proposed algorithm. On the other hand, the
A. Algorithm outline stochastic SMACOF can also be used to solve very large-scale
MDS problems, where the full set of measurements {mn }
Given {mn (t)} and {wmn (t)}, and starting with an arbi- cannot be processed simultaneously. Instead, it is possible to
0 , the updates for the proposed stochas-
trary origin-centered X apply (13) on a small subset of observations, corresponding to
tic SMACOF algorithm take the form, a subgraph Gt . A special case occurs when exactly one edge is
X t + L B (X
t+1 = (1 )X t )X
t t0 (10) chosen per time instant and per cluster, i.e., |Ctj | = 2, and the
t t
updates in (13) reduce to those in encountered in the stochastic
proximity embedding (SPE) algorithm [6],
( wmn (t)mn (t)
m 6= n ij (t)
where, [Bt (X)]mn = kx x k2 +x (11) xi (t + 1) = (1 )xi (t) + xi (t)
PNm n kxi (t) xj (t)k
k=1 [Bt (X)]mk m=n
ij (t)
with x being a small positive constant that ensures that the + 1 xj (t) (15)
kxi (t) xj (t)k
entries of Bt (X) stay bounded for all X. The update rule
can be viewed a stochastic version of the SMACOF algorithm and likewise for node j. The proposed stochastic SMACOF is
with the following modifications (a) at each time instant, only therefore a generalization of SPE, applied to components of
one iteration of SMACOF is executed using the modified arbitrary sizes. Since the updates in (13) for any two clusters
definition of B (X) in (11); (b) the estimated coordinates Ctj and Ctk do not depend on each other, the proposed algorithm
X t at time t are used for initialization at t + 1; and (c) can also be implemented in a distributed and asynchronous
the estimated coordinates X t+1 are constructed by taking a manner. Such an implementation is particularly suited to

convex combination of Xt and the SMACOF output. The last the range-based localization problems that arise in wireless
modification endows the algorithm with tracking capabilities networks.
since the parameter may be interpreted as the forgetting Finally, akin to the classical adaptive filtering algorithms
factor, and can be tuned in accordance with the rate of change such as LMS, the proposed algorithm can also be applied to
of {mn (t)} and {wmn (t)}. For example, the embedding at time-varying scenarios, i.e., when mn (t) is non-stationary.
time t + 1 can be forced to be close to those at time t by The applications of interest include localization of time-
setting 1. Finally, the proposed update rule subsumes varying networks, and visualization of time-varying data. In
the SMACOF algorithm for static scenarios, where we set both cases, the first term (I Jt )X (j) in the update (13)
t
mn (t) = mn and wmn (t) = wmn for all t, and = 1. serves as a momentum term. That is, a small encourages
The update rule in (10) is valid only if the graph Gt defined X t+1 to stay close to X t , resulting in a smooth trajectory of
by {wmn (t)} is connected for all t 1. In the case when
{Xt }. On the other hand, a large value of enables tracking
Gt has more than one connected component, the coordinates in highly time-varying scenarios, while making the updates
within each component must be updated separately. Let Ctj sensitive to noise [23, Ch-21] [24, Ch-9]. Further implemen-
be the set of nodes belonging to the j-th component and Ijt tation details pertaining to the localization and visualization
be the |Ctj | N selection matrix containing the rows of IN problems are discussed in Sec. IV. Before proceeding with
(j) T the asymptotic analysis, the following remark is due.
corresponding to the elements in Ctj . Defining Lt := Ijt Lt Ijt
(j) T Remark 1. Building further on the link with adaptive al-
and Bt (Xt ) := Ijt Bt (Xt )Ijt , the update rule for the nodes
j
in Ct is given by gorithms, may be interpreted as a forgetting factor that
downweights the past information. When is a constant
X (j) + (L(j) ) B (X
(j) = (I Jt )X (j) )X
(j) (12) that is strictly greater than zero, the algorithm forgets the
t+1 t t t t t
old data exponentially quickly, thus offering superior tracking
where Jt := I 11T /|Cj (t)|, is the |Cj (t)| |Cj (t)| centering capability. In contrast, it is possible to have a long-memory
matrix which ensures that the coordinate center of each version of the algorithm with a time-varying t 0. As
component does not change after the update, i.e., 1T X (j) =
t+1 t , such an algorithm would no longer track the changes
(j) . The general update rule
1T X t in mn (t), and can be applied to a static scenarios where
t+1 = (I L Lt )X
X t + L Bt (X
t )X
t (13) the algorithm can stop once the embeddings converge. While
t t
the bounds developed here apply only to the case of constant
subsumes the forms specified in (10) and (12), irrespective of > 0, diminishing step size is in fact utilized in Sec. V.
the number of connected components in Gt , since it holds that
B. Asymptotic Performance
j j
1 1/|Ct | m = n Ct

[Lt Lt ]mn = 1/|Ctj | m 6= n, m, n Ctj (14) In general, establishing convergence of stochastic algo-
rithms for non-convex problems is quite challenging [20].
0 otherwise.

Here, the asymptotic performance of the proposed algorithm
In contrast to the classical SMACOF algorithm, the pro- is established in two steps. First, it is shown that the trajectory
posed algorithm is flexible enough to be used in a number of of the stochastic SMACOF algorithm stays close to that of an
different scenarios. As already discussed, a specific choice of averaged algorithm, in an almost sure sense. This part involves
parameters allows us to interpret the SMACOF algorithm as a establishing a hovering theorem, and utilizes techniques from
stochastic approximation [24][26]. Next, it is shown that the the assumptions, the averaging analysis is presented in the
averaged algorithm converges to a stationary point of (8). subsequent subsection.
1) Assumptions: For the purposes of establishing conver- 2) Hovering Theorem: The proposed stochastic SMACOF
gence, a simplified setting is considered, wherein the graph algorithm will be related to an averaged algorithm with
Gt at each t consists of N/p 1 components of size p each. updates,
Let jm (t) := {j | m Ctj } be the index of the component
to which node m belongs at time t, and define t RN N t+1 = (1 )X
X t + Ba (X
t )X
t (16)
such that
where the time-invariant function Ba (X) := E[Lt Bt (X)] and
1/N jm (t) 6= jn (t)
= p(NN (p1)
1) . Assuming that both algorithms start from the

[t ]mn := 1/N + /p jm (t) = jn (t), m 6= n 0 = X 0 , the following proposition
same initialization, i.e., X
(1 ) 1/N + /p m = n.

states the main result of this section.
(A1) The random processes {wmn (t)}t0 and {mn (t)}t0 Proposition 1. Under (A1)-(A5), and for < 1, it holds for
are independent identically distributed (i.i.d.). the updates generated by (13) and (16), that
(A2) The random variables {mn (t)} have support (0, C ],
while the weights {wmn (t)} have support {0} [w , 1]. max X

t Xt c() (17)
(A3) The online algorithm is initialized such that 1t1/

11 T
/N )X
0 Cx .
(I where the random variable c() 0 almost surely as 0
(A4) There exists t0 such that Qfor any (0, 1), there exists with probability 1.
t
% (0, 1) such that s= +1 s < %t for all

2 Intuitively, Proposition 1 states that the trajectory of the
t t0 . proposed stochastic algorithm in (13) stays close to that
(A5) For each t, the non-zero weights {wmn (t)}m,n are i.i.d. of the averaged algorithm in (16). Further, the stochastic
with w := E[wmn (t)]. oscillations of (13) are small if is also small. However,
The i.i.d. assumption in (A1) is standard in the analysis of choosing too small a value of , which is also the step-size
most stochastic approximation algorithms. For the applications in (16), will generally result in a slower convergence rate for
at hand, the support of mn (t) and wmn (t) is naturally finite. any such iterative algorithm. The parameter may therefore
It is required from (A2) that the non-zero weights be bounded be seen as controlling the trade-off between the convergence
away from zero. Such a condition is required to ensure the rate and asymptotic accuracy. Further characterization of this
numerical stability of the Laplacian system of equations that trade-off is pursued via numerical tests in Sec. V.
must be solved at every iteration [cf. (10), (13)]. Specifically, Alternatively, consider the case when T updates of (13) are
it is shown in Appendix that (A2) implies the following result performed with = 1/T . For this case, the bound in (17)

Lemma 1. Under (A2), it holds that Lt L := (N
becomes
2
1)2 /2w for all t 1. max X

t Xt c(1/T ) (18)
1tT
The proof of Lemma 1 is provided in Appendix A. The
initial configuration can always be normalized to satisfy the where c(1/T ) 0 almost surely as T . In other
bound in (A3). Assumption (A4) restricts the extent to which words, the stochastic oscillations can be made arbitrarily
the graphs Gt can stay disconnected over time. To obtain small if sufficient number of updates can be performed. It is
intuition on (A4), observe first the largest eigenvalue of t remarked that such results are commonplace in the stochastic
is 1 if Gt has a single connected component and one approximation literature [24][26].
otherwise. Consequently, if all {Gs }ts= +1 are connected, Next, an outline of the proof of Proposition 1 is presented,
(A4)
Q holds with % = 1 . Conversely, it holds that while the details are deferred to Appendix B. The overall
t structure of the proof is similar to that in [24]. Significant
s= +1 s = 1 if and only if (a) each {Gs }ts= +1 has

2 differences exist in the details however, since workarounds
more than one components, and (b) the components do not
change over time, i.e., jm (t) = jn (t) for all m, n, and are introduced in order to avoid making any assumptions
on the boundedness of X t . It is emphasized that such a
t. Intuitively, (A4) allows {Gs } to have multiple connected
components at each s 1, as long as the nodes belonging to modification is generally not possible in a vast majority of
these components keep changing over time. problems, and is not trivial. It is however possible here due to
Finally, (A5) is perhaps the most restrictive, and may not the special structure of the update (13) that depends only on
the differences between pairs of rows of X t ; see (49).
always be easy to satisfy. For instance, the weights are not
identically distributed in the context of dynamic network Proof of Proposition 1: The difference between the
localization (cf. Sec. IV-A), since non-zero weights are often iterates generated by (13) and (16) is given by
assigned to neighboring nodes only. Likewise, weights selected
t+1 := X t+1 = t L Lt X
t+1 X t X t
via Sammon mapping also result in non-identically distributed t
weights. The assumption however greatly simplifies the proof

+ Lt Bt (X t )X
t Ba (X t )X
t (19)
of convergence for the averaged algorithm. Having stated
Assuming that both the algorithms start from the same initial- 3) Convergence of the Averaged Algorithm: Having estab-
0 = X
ization, i.e., X 0 , it follows that lished that the trajectory of the stochastic algorithm hovers
t around that of the averaged algorithm, we complete the proof
by establishing that the averaged algorithm converges to a local
X
t+1 = X
L L X
=0 minimum of (1). The challenge here is that the updates in
t
X (16) do not resemble those in other classical algorithms such
+ )X
L B (X Ba (X
)X
as SMACOF or gradientq descent. For notational brevity, let
2
=0 mn := E[mn (t)]/ kxm xn k + x and J = I 11T /N ,
t
X and note the following result.
+ K1t + K2t + K3t

= (20)
=1 Lemma 3. Under (A1)-(A5), it holds that

where for all t 0, mn
P m 6= n
t [Ba (X)]mn = mk m = n.

X
)X
E[L B (X
)]X

N
K1t = L B (X (21a) k6=m
=0
The proof of Lemma 3 is provided in the Appendix C. For
t
X
the rest of the section, we will assume that x 1 and thus
K2t L L

= I X (21b)
negligible. Therefore from Lemma 3, we have that
=0
X
t
E[wmn (t)mn (t)2 ] + tr XT LX

X
(X) =
K3t = )X
Ba (X Ba (X
)X
. (21c) m<n
=1
2tr(XT B(X)X) (28)
The following intermediate lemma develops bounds on the
:= E[Lt ] = wpJ
where, L and
three terms in (21), and constitutes the key step in the proof.

Lemma 2. The following bounds hold for i = 1, 2 P
wp mn m 6= n

B(X) := (29)
t N mk m = n.
i X k6=m
Kt dit + Ci ti (22)
=1 The main result of this subsection is stated as the following
3 t
X proposition.
Kt C3 k k (23)
Proposition 2. The mean-stress values t ) decrease mono-
(X
=1
tonically with t and converge to a stationary point of (8).
where the constants C1 , C2 , and C3 are independent of t, and
the constants d1t , t1 , d2t , and t2 are such that P Proof: Without loss of generality, let
2
m<n E[wmn (t)mn (t) ] = 1, and define
dit ti 2 (X) := 1
tr XT LX and (X) := 12 (1/
0 0 (24a) T
T

t t 1)tr X LX + tr X B(X)X , and observe that
for i = 1, 2, almost surely as t . (X) = 1 + 2 (X) 2(X). Similarly, define the mapping

(X) := (1 )X + wp B(X)X, so that the updates in
The proof of Lemma 2 is provided in Appendix B. The t+1 = (X t ).
(16) become X
norm of t+1 can therefore be bounded by applying triangle
Given any two embeddings X and Y, the following bounds
inequality on (19) as follows.
hold from the Cauchy-Schwarz inequality:
t
X
tr(XT LX)
tr((2X Y)T LY) (30)
kt+1 k (C3 + 1) k k + f () (25)
=1
tr(XT B(X)X)
tr(XT B(Y)Y) (31)
where we have used the fact that kJk2 = 1 and which allows us to conclude that
t 1 1

tr XT L(Y)
tr YT LY

f () := max (d1t + d2t ) + 2
X
C1 t1 + C2 t2 . (26) (X) (32)
2
0t1/
=1 1
(X) 1 + 2 (X) + tr YT LY

It is further shown in Appendix B that f () 0 almost surely
as 0. Proposition 1 then follows from the application of 2 T

tr X L(Y)
the discrete Bellman-Gronwall Lemma [24] on (25), which
yields = 1 + (1 ) 2 (Y) 2 ((Y)) + 2 (X (Y)) (33)
kt k f ()(1 + (C3 + 1))t = f ()et log(1+(C3 +1)) where equalities holds for X = Y. Denote the right-hand side
f ()e t(C3 +1)
f ()e C3 +1
:= c() (27) of (33) by Y (X), and observe that Y (X) Y ((Y))
for all X. This yields the main inequality that (Y) =
Y (Y) Y ((Y)) ((Y)). In other words, we have
that t)
(X t+1 ), so that the non-negative sequence
(X The asynchronous nature of the algorithm allows for delayed
t :=
(Xt ) is non-increasing and therefore convergent to updates at nodes, balanced battery usage within the network,
a limit, say . By squeeze theorem for limits [27], it also and communication errors. In general, it is also possible to
holds that X
t (Xt+1 ) , yielding the following limits apply multiple updates of the form in (13) per time instant,
without incurring any extra communication cost.
lim 2 (Xt ) = (1
)/ (34)
t Nodes may declare themselves as cluster heads using a
lim (Xt ) = (1
)(1 )/2 (35) random backoff-based contention mechanism such as CSMA,
t
and solicit neighbors by simply sending an RTS packet.
t X
lim 2 (X t+1 ) = 0 (36)
t An update at a cluster thus takes up at most two message
t }t0 are origin centered, the result in exchanges. More complicated protocols that ensure recovery
Since the matrices {X

from collisions, and robustness or errors can also be used [30].
(36) can equivalently be written as X
t Xt+1 0 as

The online algorithm is flexible, and allows clusters of any
t . Denoting the limit point of X t by X , it can be shape or size, depending on the communication and computa-
seen that
(X ) = 0. tional resources available within the network. The non-zero
weights, corresponding to available distance measurements,
IV. I MPLEMENTATION A SPECTS can be chosen according to the estimated noise variance [10],
A. Multi-agent network localization [28], following Sammon mapping [1], [31], or simply as unity.
It is remarked that the node coordinates obtained from
Multidimensional scaling has been widely used for local-
(S1)-(S4) are relative and centered at the origin. In applica-
ization, where inter-node distances are often obtained from
tions where node coordinates are required with respect to a
time-of-arrival or received signal strength measurements [10],
set of GPS-enabled anchor nodes, appropriate rotation and
[14], [28], [29]. Wireless network localization is challenging
translation operations must be applied at each node. Since
because the pairwise distance measurements are noisy, time-
the anchor nodes are generally not power constrained, it is
varying due to mobility, fading, and synchronization errors,
possible for them to determine these transformations [10], and
and often partially missing, due to the limited range of the sen-
convey the result to all other nodes. As shown later in Sec.
sors. Further, the limited battery life and resource constraints
V, it is generally sufficient to calculate the transformations
at the nodes impose restrictions on the communication and
periodically every few time slots.
computational load that the network can tolerate [13], [14].
Finally, similar to the SMACOF algorithm, the stochastic
Towards addressing these limitations, the stochastic SMA-
SMACOF algorithm is sensitive to initialization. A random
COF algorithm for network localization works by judiciously
initialization may result in the algorithm getting trapped in
choosing {wmn (t)} to limit the communication and computa-
a poor local minimum. In practice, superior location es-
tional cost at each update. The idea is to partition the network
timation performance is obtained if the initialization is at
into several non-overlapping clusters (or components), chosen
least roughly correct. Simple low-complexity localization al-
randomly at each time t. The coordinates within a cluster
gorithms can be used for initialization. For instance, nodes can
are updated as in (13). Only neighboring nodes are included
roughly triangulate themselves using noisy distance estimates
within each cluster, thus eliminating the need for multihop
from the anchor nodes [32].
communication between far off nodes. Finally, since the up-
dates at different components are independent of each other,
the localization algorithm is run asynchronously as follows.
B. Large network visualization
S1. At a given time t, a node j randomly declares itself as a
cluster head, and solicits cluster members from among It is possible to visualize N objects in a 2 or 3 dimen-
its neighbors n Nj . Available neighbors respond with sional euclidean space by applying MDS to the pairwise
n (t), resulting in a star
their current location estimates x dissimilarities {mn }. The SMACOF algorithm is however ill-
shaped cluster Ctj . Once locked as cluster members, suited for large-scale visualization since it requires at least
these nodes respond only to the messages from node O(N 2 ) operations per iteration. Further, even processing the
j. full measurements {mn } simultaneously may not be feasible
S2. The cluster head performs distance measurements be- for datasets with more than a hundred thousand objects.
tween itself and all its neighbors and collects jn (t) for Visualization via stochastic embedding can be achieved by
all n Ctj \ {j}. partitioning the objects into several subsets of reasonable sizes,
S3. The cluster head performs the update in (13) with and performing the updates in (13). The following steps are
appropriately chosen weights {wjn (t)}, and broadcasts performed for each t 1.
the new location estimates to each node in Ctj \ {j} 1) Partition the N objects into random, mutually exclusive
S4. Nodes in Ctj \ {j}, upon receiving the new location subsets Ctj with p nodes per subset.
estimates (or upon timeout or error events), release their 2) For each subset, randomly choose a small fraction ft
locks and become available. of pairs and measure (calculate or fetch from memory)
As originally intended, the proposed algorithm can also distances mn for the chosen pairs. Let Ftj denote the
be applied to mobile networks. The algorithm is expected to set of chosen pairs for each cluster j and time t.
perform well as long as the node velocities are not too high. 3) Apply the update in (13) for each subset Ctj .
Compared to the localization algorithm, in this case all pair- First, assume that each Ljt is sparse, i.e., q p2 , so that
wise distances are available a priori and without noise, but can- f (N )/N = q/p p. In this case, since the per-iteration
not be read or processed simultaneously. The aforementioned complexity is given by O(f (N ) log(p)), the value of log(p)
steps result in making {wmn (t)} sparse and thus reducing should be as small as possible. It can be seen that the choice
the per-iteration complexity. Algorithm 1 summarizes the ! +1 !
implementation of stochastic SMACOF for large network f (N ) f (N )
pO qO (37)
visualization. N N
Algorithm 1 Stocahstic SMACOF for Large Network Visual- for some 1 results in the complexity
ization O(f (N ) log(f (N )/N )), while ensuring that Ljt is still sparse
1: Initialize X0 and set to some value in (0, 1) with q O(p1+1/ ). Note that it is not necessary for to be
2: for t = 1, 2, . . . do very large, as long as the sparse Laplacian solvers can still be
3: Partition the set N into C disjoint subsets {Ctj }C used. On the other hand, when Ljt is dense so that q O(p2 ),
j=1
4: for j = 1, . . . , C do the per-iteration complexity is given by O(N q) = O(f (N )p).
5: Measure or fetch from memory pairwise distances In this case, it holds that f (N )/N = q/p p, so that one
{mn (t)}, for a subset of object pairs (m, n) Ftj . must choose p O(f (N )/N ) and q O(f (N )2 /N 2 ).
6: Set weights wmn (t) = 1 for all (m, n) Fjt . Consequently, the optimal iteration complexity for this case
7: Perform the update in (13) for each subset Ctj . becomes O(f (N )2 /N ).
8: end for Table I shows a few example choices of f (N ) and the cor-
9: end for responding per-iteration complexity values. It can be observed
that when f (N ) is almost linear in N , so is the per-iteration
complexity, regardless of the sparsity of Ljt . On the other hand,
Again, as envisioned earlier, the algorithm is also applicable
using a sparse Ljt becomes important when f (N ) is large.
to visualization of dynamic networks. The idea here is to
create an animation consisting of embeddings that vary over Non-zero weights f (N ) sparse Ljt dense Ljt
time. By specifying a small enough value for in (13), it O(N 1+ ), 0 < 1 O(N 1+ log(N )) O(N 1+2 )
is possible to force the embeddings to change slowly over O(N log N ) O(N log N log log N ) O(N log2 N )
time, thus preserving the users mental map [12]. Unlike O(N 3/2 ) O(N 3/2 log N ) O(N 2 )
existing algorithms however, the proposed algorithm can allow TABLE I: Algorithm complexity for different choices of f (N )
visualization of very large datasets.
C. Algorithm complexity V. S IMULATION RESULTS

Unlike the SMACOF algorithm, whose per-iteration com- This section provides simulation results evaluating the per-
plexity is O(N 2 ), the stochastic SMACOF algorithm pro- formance of the proposed algorithm. The general properties
cesses the data in small batches and can therefore be im- of the stochastic SMACOF algorithm are first characterized
plemented at near-linear complexity. This is because if Gt using numerical tests. Next, simulation results are provided for
consists of multiple components of size p each, the updates in the online localization algorithm, evaluating its performance
(13) decouple and can even be carried out in parallel. Further, in various mobile network scenarios. Finally, applicability
the weights for each cluster are chosen to be sparse, i.e., the to large-scale visualization is demonstrated by running the
(j)
p p matrix Lt has at most q p2 non-zero elements. The algorithm on two different datasets. Before proceeding, it
problem of solving a sparse Laplacian system of equations is remarked that the proposed stochastic SMACOF is better
has been well studied, and state-of-the-art solvers return a suited to applications where the size of the dataset is large,
solution in time O(q log p) for each component. Thus, using preferably N > 50. Indeed, if the problem at hand is small
N/p sparse matrices
{Ljt } results in an overall complexity of (say N < 20), conventional SMACOF would likely be
O Npq log p . As we will show next, the appropriate choice faster, since the proposed algorithm generally requires more
of the batch size p results in a near-linear complexity. The iterations to converge. The computational advantage arising
complexity results obtained in this section are summarized in from processing only a few distance measurements per time
Table I. instant becomes significant only when N is sufficiently large.
Note that a sublinear per-iteration complexity of
O(q log(p)) is also achievable by updating only one A. Algorithm Behavior
component per iteration. Such an implementation This section provides several numerical tests that allow
would however require proportionally large number of us to study various properties of the stochastic SMACOF
iterations. Alternatively, the per-iteration complexity of algorithm. Towards this end, consider a network with 100
the algorithm can be calibrated using the total number of nodes, distributed uniformly over a 10 10 planar area. The
dissimilarity measurements processed per-iteration, given by measured distances between nodes m and n are given by
f (N ) = q(N/p). To this end, we provide approximate rules mn (t) = kxm xn k + vmn (t), where vmn (t) N (0, 0.01).
for choosing p and q so as to minimize the per-iteration Negative distance measurements were discarded by setting the
complexity, given the total number of non-zero weights f (N ). corresponding wmn (t) = 0. The algorithm is run for different
101
Further, as shown in the inset, the proposed algorithm hovers
above the averaged algorithm, with steady-state deviation
decreasing with .
It is remarked that the SGD algorithm, with updates spec-
0 ified in (9), tended to diverge in the presence of noisy
10
t)
(X distance measurements, different weight choices, and poor
t)
(X initializations. For instance, when using Sammon mapping,
(X)
i.e., wmn = 1/mn , the noisy measurement model specified

earlier, and = 0.05, the SGD algorithm converged for only
10-1
19 out of 100 test runs. In contrast, no divergent behavior
= 0.3 = 0.05 was ever observed for the proposed algorithm even with
measurement noise vij N (0, 10).
= 0.1
2) Steady state performance: The algorithm is allowed to
-2
10 MDS run for 5000 time instants with different values of , and the
minimum, mean, and maximum steady-state stress values are
0 100 200 300 400 t 500 600 700 800 evaluated. We set Tss = [4801, . . . , 5000] and evaluate
Fig. 1: Performance of the stochastic SMACOF algorithm, the X (X t)
averaged algorithm, and the SMACOF algorithm t ) =
min = min (X t ).
max = max (X
tTss |Tss | tTss
tTss
Starting with the same initialization, the entire experiment is

max
repeated for 100 Monte Carlo iterations. Fig. 2 shows the
mean minimum, mean, and maximum steady state errors plotted
min against . As expected, the stress values converge to a small
non-zero value that decreases with .
10-1
B. Dynamic Network Localization

The localization performance of the proposed algorithm is

studied on a mobile network. Supplemental Video2 1 shows
an example run of the algorithm on a mobile network with
N = 8 and = 0.3. The performance of the algorithm is
further analyzed by carrying out simulations over networks
with different sizes and node velocities. For a mobile network
with N nodes, nodes are deployed randomly with an average
10-2 density of one node per unit area. Nodes
10-2 10-1 can measure distances
and communicate within a radius of N /2. For all values of
Fig. 2: Hovering property of the online algorithm N , five nodes are randomly chosen to be anchors. The node
velocities are initialized randomly and updated according to
the following model vmn (t + 1) = vmn (t) + 1 2 nv (t),
values of , with p = 25 and about 35% density of non-zeros1 . where vmn (0), nv (t) N (0, v2 I). The mobility parameter v
All non-zero weights are chosen to be unity. is directly proportional to the average speed of the nodes, and
1) Transient performance: Fig 1 shows the sequence of influences the tracking performance of the algorithms used.
normalized stress values obtained from an example run of The performance of the proposed algorithm is compared
the algorithm [cf. (13)]. For comparison, the stress values with the weighted MDS solution obtained by running the
obtained from running the averaged algorithm (cf. (16)) and SMACOF algorithm till convergence. The non-zero weights,
the SMACOF algorithm for weighted MDS (cf. (7)) are also corresponding to node pairs within the communication ra-
plotted. All algorithms are intialized with the same randomly dius of each other, are all set to one. Note however that a
chosen configuration. The MDS algorithm runs with all-one direct comparison between the SMACOF solution and the
weights, while the updates for the averaged algorithm are proposed algorithm is unfair, since SMACOF is too complex
obtained via empirical averaging. to be directly implemented in a mobile network. Even among
As expected, the convergence speed of the algorithm varies cooperative localization techniques that focus on efficient
monotonically with . Consistent with Proposition 1, the tra- implementation (see e.g. [10], [14], [28], [29]), localization
jectory of the proposed algorithm follows that of the averaged requires several iterations per time instant. In contrast, the
algorithm. As expected, the steady-state stress value achieved proposed algorithm is asynchronous, and incurs linear or
by the averaged algorithm is very close to that of SMACOF. sublinear complexity, but is inaccurate for the first few time
instants.
1 Non-zero locations are generated randomly, and the number of non-zeros
vary between different instantiations. 2 https://www.youtube.com/watch?v=-MQFR3yiv7U
In order to perform a fair comparison between algorithms, 0.14

the following modifications are adopted. First, a time-slotted
version of the stochastic SMACOF algorithm is considered. 0.12
Within each time slot, the network forms several clusters,
and performs steps (S1)-(S4). In order to reduce the overhead 0.1
associated with cluster formation, nodes with fewer than 5
t Xt k
neighbors do not form clusters. Similarly, to limit the com- 0.08
putational complexity at each node, cluster heads respond
kX
to at most 10 nearest neighbors. With these settings, the 0.06
computational and communication complexity incurred by the SMACOF
network at every time slot is approximately N/5. The com- 0.04
putational and communication complexity of the SMACOF
online
variants in [10], [14], [33] is also normalized appropriately. As
0.02 lower bound
a first order approximation, it is assumed that these algorithms
require O(N ) message exchanges per iteration. Equivalently,
if we allow N/5 message exchanges per iteration, and assume 0 100 200 300 400 500 600 700
that 10 iterations are required for convergence, SMACOF t
requires about 50 time slots for convergence. For obtaining Fig. 3: Estimation error for an example run of the online and
SMACOF algorithms.
the plots however, SMACOF is run till convergence, and the
number of iterations incurred was often more than 50. Both
algorithms start with an initial estimate of the node locations. 100
v = 0.002
Approximate node estimates can be quickly obtained using
simple techniques such as those in [32]. For the purpose v = 0.005
of simulations, the initial locations are chosen as x m (0) = v = 0.01
xm (0) + vm where vm N (0, N/100). Warm starts are v = 0.02
Localization error
utilized at subsequent time slots by initializing SMACOF with 10-1
the previously estimated node locations.

Fig. 3 shows an example run of the two algorithms with
v = 0.01, N = 50, and = 0.5. The best possible
estimation error obtained by solving the MDS problem is also 10-2
shown for comparison. Observe that the proposed algorithm
is inaccurate initially, and gradually approaches its steady
state value. Interestingly, the transient period required by the
Stochastic SMACOF
proposed algorithm is small, especially when compared to the
SMACOF
50 time slots required by the SMACOF implementation. 10-3
50 100 150 200 N 250 300 350 400
Next, the steady-state localization error of the two algo-
rithms is compared. Both algorithms are run for 700 iterations, Fig. 4: Localization error for the online and MDS algorithms for
and the maximum localization error incurred in the last 200 different network sizes and average node velocities.

iterations is evaluated as e` = maxtTss N1 X t Xt where

Tss = [501, . . . , 700]. The entire process is repeated for 100 four static anchors placed at the four corners of the 1 1
Monte-Carlo repetitions. For the proposed algorithm, the value region, that provide the necessary translation and rotation
of is tuned a priori to minimize the localization error. Fig. 4 information to all other nodes. For simplicity, only one 8-node
shows the steady-state localization error incurred by the online cluster is formed at each time instant by a randomly selected
and SMACOF algorithms, plotted for different values of N node. The actual and estimated node locations are shown as
and v . It is evident that the proposed algorithm performs circles and squares respectively, with markers drawn every
significantly better than the complexity-normalized SMACOF. 10 time instants. The nodes move in the direction indicated
In particular, while the performance of the two algorithms by decreasing marker sizes. As evident from the figure, the
deteriorates with increasing node mobility, the gap between trajectory of the estimated node locations converges to the
their performance also increases. This is because at higher actual trajectory within 30-40 time instants, and follows it
node speeds, the node locations change significantly within the thereafter.
50 time slots required by SMACOF to run. Observe that for a
given average node velocity, the performance of all algorithms
appears to improve with N . However, this is simply because C. Large-scale Visualization
the average node distances increase with N , thereby reducing This section demonstrates the use of the stochastic SMA-
the relative average node speeds. COF algorithm for large-scale visualization. Given the plethora
Fig. 5 shows an example run of the algorithm on a mobile of highly sophisticated visualization algorithms a full-fledged
network with N = 8 and = 0.3. The network has comparison is beyond the scope of the present work. Instead,
10
MovieLens database [37]. To this end, the time-stamp associ-

1 ated with each movie rating is utilized to generate a dynamic
network Gt that only contains the movies released and rated
till the week number t. The distance between two movies is
0.8 estimated from their cosine similarities. Supplemental Video 2
shows a visualization of the evolution of the movie-space over
the duration 1995-2015. Each movie is colored in accordance
0.6
with its popularity, and the newly released movies start at
the origin. From the video, it can be seen that the popular
y
movies move quickly (within few weeks) towards the edge of

0.4
the graph, while the less popular ones tend to remain near the
center. Further details regarding the video are included in the
0.2
Xt accompanying readme file. Also See the video at the link 3 .
t
X 3) Newcomb Fraternitys Dataset: The dynamic visualiza-
tion of the Newcomb Fraternity dataset [38] is considered.
0 Since the dataset consists of only 16 nodes, and yields only
0 0.2 0.4 0.6 0.8 1 14 snapshots overall, computational complexity is not an issue.
x Nevertheless, the dynamic visualization is obtained so that
Fig. 5: Example run of the dynamic network localization algorithm. it may be compared with the regularized MDS technique of
Marker size decreases with time to indicate the direction of motion. [12]. Supplemental Video4 3 shows the dynamic visualization
obtained from running the stochastic SMACOF algorithm for
50 iterations per time slot with = 0.2. The video is generated
we only present the visualizations obtained from running the following the procedure similar to that in [12]. The resulting
proposed algorithm for both static and dynamic datasets. The video is quite similar to the one obtained via the graph-
proposed algorithms are implemented in MATLAB and run on regularized framework of [12]. Intuitively, the momentum term
an Intel Core i7 CPU. This is in contrast to the state-of-the- in the updates in (13) plays the role of the regularization term
art visualization algorithms that require large compute clusters here, and keeps the embeddings from changing too quickly.
with hundreds of processors for similar-sized datasets [17].
1) PubChem Dataset: We consider a subset of 800,000
VI. C ONCLUSION
unique chemical compounds taken from the pubchem com-
pound database [34], [35]. The structural information about The multidimensional scaling (MDS) problem is considered
each compound is represented by its 166 bit MACCS fin- within a stochastic setting, and a novel stochastic scaling by
gerprint. Dissimilarities between two compounds is calculated majorizing a complicated function (SMACOF) is proposed.
using the Tanimoto score. Dissimilarities between two com- The proposed algorithm is highly scalable, and is applicable
pounds with binary fingerprints h and g is calculated using to visualization and localization problems of very large sizes.
the Tanimoto score [36, Ch-8], given by Asymptotic analysis of the stochastic SMACOF algorithm
P shows that it stays close to the trajectory of an averaged
hi gi algorithm, which itself converges to a stationary point of
= 1 Pi (38)
i hi gi the stochastic stress minimization problem. Implementation
where and denote the logical AND and OR operators details, as well as the computational complexity analysis of
respectively. It is remarked that for this case, it is no longer the proposed algorithms are also provided. The performance of
possible to load an N N matrix in the memory. Following the proposed algorithm is discussed for large-scale localization
the discussion in Sec. IV-B, we use p = 100 and q = 50, so as and visualization examples. The efficacy of the proposed
to obtain linear complexity per iteration. The simulation is run algorithm is demonstrated for localization of mobile networks,
for 5000 iterations, and the value of is reduced every 1000 and visualization of both, static and dynamic networks.
iterations from 0.2 to 0.001. Figure. 6 shows the visualiza-
tion obtained from the stochastic SMACOF algorithm. Each A PPENDIX A
dot represents a compound, and is colored according to its L OWER BOUND ON THE ALGEBRAIC CONNECTIVITY
molecular complexity, a measure available from the PubChem In order to obtain intuition on (A3), consider the undirected
dataset. Specifically, the blue dots represent simpler (lower graph Gt whose edges have weights {wmn (t)}, and recall
complexity) molecules, while green, yellow, and red colored that Lt is the graph Laplacian of Gt . The eigenvalues of Lt
dots represent progressively more complex molecules. It is constitute the spectrum of the graph Gt [39]. If Gt is connected,
observed that MDS yields two distinct clusters of compounds, the smallest eigenvalue of Lt is zero, while the second-smallest
while the lower complexity compounds are scattered towards eigenvalue a(Gt ) = 1/ Lt is always non-zero and is

the edges. The visualization obtained here is comparable to 2
referred to as the algebraic connectivity of Gt . As the name
those obtained in [8], [17].
2) MovieLens Dataset: The proposed algorithm is used to 3 https://www.youtube.com/watch?v=iJbY3HPHAUM
perform dynamic visualization of the 27,000 movies on the 4 https://www.youtube.com/watch?v=G9geUI3U7Tw&feature=youtu.be
11
Fig. 6: Visualization of PubChem Datasets.
suggests, a(G) captures the overall connectivity of the graph. exists a path P between any two nodes m and n, such that
On the other hand, if Gt has K 2 connected components
{Gtk }K
k=1 , the K smallest eigenvalues of Lt are zero, so the
X X 2
ym yn )2 = [
( yi yj ]2 (N 1) yi yj )
(
smallest non-zero eigenvalue is simply a(Gt ) = mink a(Gtk ). (i,j)P (i,j)P
Next, we establish a lower bound on the algebraic connectivity (40)
of the weighted graph Gt . X 2
Proof of Lemma 1: If Gt is connected, the second smallest (N 1) yi yj )
( (41)
(i,j)E
eigenvalue is given by
2 where, (40) holds since P may contain at most N 1 edges.
P
m<n wmn (ym yn )
a(Gt ) = N min P 2
. (39) Summing both sides over all edges in the graph, we have that
1T y=0,y6=0 m<n (ym yn )
Here, the minimum is attained by the corresponding eigenvec- X N (N 1)2 X 2

, that satisfies Lt y
tor y y. Recall that E := {(m, n) |
= a(Gt ) ym yn )2
( ym yn )
( (42)
m<n
2
wmn [w , 1]}, and observe that since Gt is connected, there (m,n)E
12
Substituting (42) into (39) for y = y , we have that which implies that

ym yn )2
P X
wmn ( [Bt (X)X]m,:

|wmn (t)mn (t)|. N C . (49)
a(G) = N m<n P (43)
ym yn )2
( n6=m
P m<n
2 (m,n)E wmn ( ym yn )2 2w The bound in (48b) therefore follows from the use of (45),
(44)
N 2 C
P
(N 1)2 (
(m,n)E my yn ) 2 (N 1)2

Lt Bt (X)X Lt kBt (X)Xk . (50)
2 L
which is the required bound. If Gt is not connected, it holds for
a component Gtk with p nodes that a(Gtk ) 2w /(p 1)2 which yields C4 = N 2 C /L . Likewise, the m-th row of
2w /(N 1)2 , so that we again have a(Gt ) = min a(Gtk ) X
Bt (X)X Bt (X) becomes
2w /(N 1)2 , which is the desired result.
X
Bt (X)X Bt (X)

m,:

A PPENDIX B X xm xn m x
x n
P ROOF OF L EMMA 2 = wmn (t)mn (t) . (51)
n6=m
dmn dmn
Before proceeding with the proof, we state some basic
xm x
Adding and subtracting the term ( n )/dmn to each term
results, and introduced necessary notation. In the subsequent
within the summation in (51), it can be seen that
analysis, we will repeatedly use the following inequalities [40]
xm xn xm xn
kABk kAk2 kBk kAk kBk (45)
dmn
dmn
where A and B matrices xm xm xn xn 1 1
q of compatible sizes. For nota- = xm x
+ ( n)
tional brevity, dmn :=
2
kxm xn k + x and dmn :=
dmn dmn dmn dmn
xm xm xn xn xm xn
dmn d2mn
2

q
k
xm x n k2 + x , and note that dmn , dmn . = + .
dmn dmn dmn dmn (dmn + dmn )
We begin by defining the total deviation functions corre- (52)
sponding to K1t and K2t as
Further, the term d2mn d2mn can be written compactly as
t
X
D1t (X) := L B (X)X E[L B (X)X] d2mn d2mn = x
Tm x Tn x xTm x
n xTm xm xTn xn + 2xTm xn

(46) m + x n 2
=1
= (xm xn + x n )T (xm x
m x n xn )
m + x (53)
X t
D2t (X) := L L E[L L ] X

(47) Consequently, it is possible to write (51) as,
=1
X

Bt (X)X Bt (X)

m,:
The following lemma lists several preliminary results re- X
quired in deriving the bounds in Lemma 2. = wmn (t)mn (t)Amn ((xm x m ) (xn x
n ))
n6=m
Lemma 4. There exists t0 < , such that for all t t0 , it
where the matrix Amn is given by
holds that

1 xm x
( n )(xm xn + x
m x n )T
X
Lt Bt (X)X Lt Bt (X) Amn = I+ . (54)

C3 X X (48a) dmn dmn dmn (dmn + dmn )

Lt Bt (X)X C4 (48b) Thus, the full difference becomes

X
Bt (X)X Bt (X) = At (X, X)vec

JXt C5 (48c) XX (55)
1
Dt (X) dt1 is given by
where the (m, n)-th p p block of At (X, X)
(48d)
(
X X
1 1

Dt (X) Dt (X) (48e)
t := A mn wmn (t)mn (t) m 6= n

2
Dt (X) d2t
At (X, X) P (56)
(48f) n6=m Amn wmn (t)mn (t) m = n
2 X X

2
Dt (X) D2t (X)

t (48g) Next, repeated use of the triangle inequality yields
T 2
where J = I 11 /N , C3 and C4 are constants, while the kAmn k
random variables d1t , d2t , t1 , and t2 follow (24). Results in !
2 2k
xm x n k2 k
xm x n + xm xn k2
(48f) and (48g) also require X to be such that kJXk C5 . kIk +
d2mn d2mn (dmn + dmn )2
The proof organized into four steps, each considering one
or more inequalities. Here, it holds from the definition of dmn that
Proof of (48a) and (48b): Observe that the m-th row of k n k /dmn 1. Similarly, it holds that
xm x
Bt (X)X for each t 0 can be written as k
xm x
n + xm xn k 2
(57)
X wmn (t)mn (t) 2 2
[Bt (X)X]m,: = (xm xn ) kxm x
n k + kxm xn k + 2 k
xm x
n k kxm xn k
dmn d + d + 2dmn dmn = (dmn + dmn )2
2 2
(58)
n6=m mn mn
13
2
Therefore, the bound on kAmn k becomes The law of large numbers therefore implies that D1t (X)/t 0
1as t
almost surely .1This also implies that there exists d1t
2(N + 1) 2 1
kAmn k (59) such that Dt (X) dt and dt /t 0 as t .

The Lipschitz continuity of D1t (X) can similarly be shown

Similarly, it holds for At (X, X) that
using (48a). Towards this end, observe that
2 t1
X
=
D1t (X) D1t (X) L B (X)X E[L B (X)X]
2 XX X

C2 2
At (X, X) kAmn k + kAmn k
=1
m n6=m n6=m
XX t1
2
X
3C2 X
L B (X) E[L B (X)
X]

kAmn k (60)
m n6=m =1
3 t1
N (N 1)(N + 1) 6N
3C2 < C2
X
X
L B (X)X B (X)

(61) =
x
=0
which in turn, yields the bound X
E[L B (X)X B (X) ]

(73)
2
2 6N 3 C .

At (X, X) The vectorized version of the first term can be written as
(62)
x
X
vec L B (X)X L B (X)

The Lipschitz continuity of Lt Bt (X)X thus follows as
= I L vec B (X)X B (X) X

(74)
X X
Lt Bt (X)X Lt Bt (X) Lt Bt (X)X Bt (X)

2 = I L A (X, X)vec XX (75)
r
N C 6N ,

X X (63) Using a similar transformation on the second term of (73), the
L x vectorized version of the right-hand side can be written as
q
so that C3 = NCL
6N
x .
vec D1t (X) D1t (X)

Proof of (48c): Observe that Lt J = Lt and JLt = Lt . t1

!
X

Right multiplying both sides of (13) by J, it follows that = C (X, X) E[C (X, X)] vec X X (76)
=0
t+1 = J(I L Lt )X
JX t + JL B (X t )X
t (64)
t t t
= I L A (X, X) is bounded as

where C (X, X)
= (JJ JL Lt J)X
t
t + L B (X
t )X
t t
t (65)

C (X, X) Lt A (X, X) C3 . It is therefore

= (J Lt Lt )JXt + Lt Bt (Xt )Xt (66) 2

possible to write
= (J
Lt Lt )(J Lt1 Lt1 )JXt1 + Lt Bt (Xt )Xt
t X X

1
Dt (X) D1t (X)

+ (J Lt Lt )Lt1 Bt1 (X
t1 )X
t1 (67) (77)
Continuing in a similar manner, taking norm on both sides of t1

X

(67), applying triangle inequality, and using (48b) yields where, t =
E[C (X, X)]
C (X, X) (78)

t =0
X
JXt+1 Q0t 2 JX 0 + (1 + kQt k2 )C4 (68)

Since the term within the norm is a bounded zero-mean
=1
random variable, it follows from law of large numbers that
where Qt := = (JLt Lt ). Next, from (A4), there exists
Qt
t1
some t0 < and % < 1 such that kQt k %t +1 for all 1X E[C (X, X)]
0
C (X, X) (79)
t + 1 t0 . Since kQt k 1 for all t + 1, bound in t =0
(68) becomes
with probability 1 as t . This also implies that t /t 0

%t almost surely as t .
JXt+1 Cx %t + C4 (1 + t0 + )

1% 1) Proof of (48f) and (48g): Observe that the zero mean
1 random variable D2t (X) can be written as
= Cx + C4 (1 + t0 + ) =: C5 (69)
1% t
X
D2t (X) = L L E[L L ] JX

for all t t0 . (80)
Proof of (48d) and (48e): Observe that each term of =0
D1t (X) in (46) is zero mean, and bounded as
so that it follows form (48c) that L L E[L L ] JX

(L B (X)X E[L B (X)X]

(70) 2C5 for all X such that kJXk C5 . Invoking the law of large

(L B (X)X + E[L B (X)X]

(71) numbers as before, D2t (X)/t 0 almost surely as t .
Consequently, there exists d2t such that D2t (X) d2t and
(L B (X)X + E[ L B (X)X ] 2C4 (72) d2t /t 0 almost surely as t .
14
In order to establish the Lipschitz continuity of D2t (X), Such a t0 () exists within [0, T /] for all T /t0 ().
= C0 (X X),
observe that D2t (X) D2t (X) where Therefore, given , if t t0 (), it holds that
t
t
X P [dt ] = 1 (93)
C0t := L L E[L L ] (81)
=0 for all /t0 ()Cd . On the other hand, if t > t0 (), (93)
Since each summand in (81) is zero mean and bounded, it holds for all T /t0 (/T ). Combining the two cases, it
holds from law of large numbers that C0t /t 0 almost holds that max0tT / dt 0, with probability one as
exists t2 such that 0.
2 as t
surely . Consequently, there
2 X X , and 2 /t 0 almost For the other two terms, observe similarly that given , there

Dt (X) D2t (X) t t
surely as t . exists T and C such that
Proof of Lemma 2: Bounds in (22) can be derived by
P [t /t C ] = 1 t, (94)
observing that for 1 t and = 1, 2, it holds that
and P [t /t ] = 1 t > T . (95)
) D (X
D (X 1 ) = K K +
1 1
Thus, given , if t T , it holds that
D (X ) D
1 1 (X 1 ). (82) " #
t r
2
X
1
Summing (82) over = 1, . . . , t, it follows that P = 1, , s.t, . (96)
=2
T C
t
X
t)
Dt (X 0)
D0 (X = Kt K0 + +1 )
D (X )
D (X Similarly, the result in (96) holds for t > T for all T
.
T/T 2
=1
0 ), a bound on K can be derived

Observing that K0 = D0 (X t
by using (48d) and (48e) as follows: A PPENDIX C
Xt P ROOF OF L EMMA 3
t )
kKt k Dt (X
)
D (X +1 ) D (X mn (t)
+ (83) For notational convenience, let mn (t) :=

=1 kxm xn k2 +x
t
X and recall that mn = E[mn (t)]. The proof is divided into

dt + X X (84) two parts. In the first part, we consider the case when Gt is

+1
=1 connected, so that p = N . In this case, the goal is to show
Xt that
= dt + )X
L B (X
L L X
(85)

(
mn

=1
h

i m 6= n
N E[Lt Bt (X)] = P mn m = n. (97)
t mn
n6=m
X
dt +

C4 + L L JX (86)
=1 Since the graph is connected, it holds that Lt = (Lt +
t
X 11T /N )1 11T /N . Let mn denote the (m, n)-th co-
dt + (C4 + C5 ) (87) factor of Lt + 11T /N and := det(Lt + 11T /N ), so
=1
that [Lt ]mn = mn / 1/N . PSince Lt has zero row and
M
so that C1 = C2 = (C 4 + C5 ) for = 1, 2. column sums, we also have that n=1 mn = . Therefore,
The bound on K3t follows form applying triangle inequal- expanding along the m-th row, the expression for becomes
ity on (21c), and using (48a) as follows: M
X 1 X
t1 = wmn (t)(mm mn ) + mn (98)
N n=1
X
)X
E[L B (X L B (X
)X
]
3
Kt
(88) n6=m
=1 N X
t1 = wmn (t)(mm mn ) (99)
X N 1
)X
E[ L B (X
L B (X
)X
(89) n6=m
]
=1 for each 1 m N . Straightforward manipulations allow us
t1 t1
X

X to conclude that
C3 X X = C k k (90)

3

=1 =1 mn (t)wmn (t)(mm mn )
m 6= n
h i 1 P
k6=m,n wnk (t)nk (t)(mn mk )

Finally, to show that ft () fT () 0 for the interval Lt Bt (X) =
mn
0 t T /, observe that for = 1, 2, it holds that dt
P

w mk (t)mk (t)(mm mk ) m = n.
T dt /t. From (24), it is known that given any , there exists k6=m
t0 () and Cd such that Next, we show that the random variables mn and mk
are identically distributed for n 6= k 6= m. Without loss of
P [dt /t Cd ] =1 t, (91)
generality, let m = 1. Also, let Lnki denote the (N 2)
and P [dt /t ] =1 t > t0 (). (92) (N 2) submatrix of Lt + 11T /N after the removal of rows
15
(1, i) and columns (n, k). The Laplace expansion of 1n along pair of nodes (m, n) belongs to the same connected component
the k-th column yields is given by (p 1)/(N 1), yielding the required expression

X1 p 1 mn m 6= n
wki (t))(1)n+i+k |Lnk
h i
1n = ( i | E Lt Bt (X) = P
N mn p(N 1) mk m = n.
i6=1,n,k
k6=m
1 1 X
( wkn (t))(1)k |Lnk n |( + wki (t))(1)n |Lnk
k |
N N R EFERENCES
i6=k
X 1 [1] I. Borg and P. J. Groenen, Modern multidimensional scaling: Theory
= ( wki (t))(1)n+i+k |Lnk i | and applications. Springer Science & Business Media, 2005.
N [2] L. V. D. Maaten and G. Hinton, Visualizing data using t-SNE, Journal
i6=1
X of Machine Learning Research, vol. 9, no. Nov, pp. 25792605, 2008.
( wki (t) + 2wkn (t))(1)n |Lnk
k | (100) [3] A. Platzer, Visualization of SNPs with t-SNE, PloS One, vol. 8, no. 2,
i6=k,n 2013.
[4] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, Line: Large-
scale information network embedding, in Proc. of the Intl. Conf. on
Likewise, the expansion of 1k along the n-th column yields World Wide Web, 2015, pp. 106777.
X 1 [5] L. van der Maaten and G. Hinton, Visualizing non-metric similarities
1k = ( wni (t))(1)n+i+k |Lnk i |
in multiple maps, Machine Learning, vol. 87, no. 1, pp. 3355, 2012.
N [6] D. K. Agrafiotis, Stochastic proximity embedding, Journal of compu-
i6=1 tational chemistry, vol. 24, no. 10, pp. 121521, 2003.
X
[7] J. Tzeng, H. H. Lu, and W.-H. Li, Multidimensional scaling for large
( wki (t) + 2wkn (t))(1)k |Lnk
n | (101) genomic data sets, BMC bioinformatics, vol. 9, no. 1, p. 179, 2008.
i6=k,n [8] J. Y. Choi, S.-H. Bae, J. Qiu, B. Chen, and D. Wild, Browsing large-
scale cheminformatics data with dimension reduction, Concurrency and
It can be seen that the first terms in (100) and (101) are Computation: Practice and Experience, vol. 23, no. 17, pp. 231525,
identically distributed since wni (t) and wki (t) are identical 2011.
[9] M. Beatty and B. Manjunath, Dimensionality reduction using multi-
(cf. (A5)). Further, performing n k row exchanges on Lnk n , dimensional scaling for content-based retrieval, in Proc. of the ICIP,
it is possible to obtain L nk which only differs from Lnk in vol. 2, 1997, pp. 835838.
n k
the k-th row. Indeed, the elements of the k-th row of L nk are [10] J. A. Costa, N. Patwari, and A. O. Hero III, Distributed weighted-
n
multidimensional scaling for node localization in sensor networks, ACM
{(1/N wki (t))}i6=k,n , while the elements of the k-th row of Trans. on Sensor Networks, vol. 2, no. 1, pp. 3964, 2006.
Lnk
n are {(1/N wni (t))}i6=k,n . Since the determinant is linear [11] P. A. Forero and G. B. Giannakis, Sparsity-exploiting robust multidi-
in its rows, it follows that |Lnk nk n+k nk mensional scaling, IEEE Trans. on Signal Proc., vol. 60, no. 8, pp.
n | and |Ln | = (1) Lk
411834, 2012.
are identically distributed. In summary, we have that the [12] K. S. Xu, M. Kliger, and A. O. Hero III, A regularized graph
distributions of mn and mk are identical for all k 6= n 6= m. layout framework for dynamic network visualization, Data Mining and
Next, define identical random variables mn := Knowledge Discovery, vol. 27, no. 1, pp. 84116, 2013.
[13] S. Kumar, R. Kumar, and K. Rajawat, Cooperative localization of
wmn (t)(mmP mn ) for each n 6= m, so that mobile networks via velocity-assisted multidimensional scaling, IEEE
= NN1 n6=m mn . Since Gt is connected, it holds Trans. on Signal Process., vol. 64, no. 7, pp. 17441758, 2016.
that > 0. Therefore from symmetry, we have that [14] A. Simonetto and G. Leus, Distributed maximum likelihood sensor
network localization, IEEE Trans. on Signal Proc., vol. 62, no. 6, pp.
14241437, 2014.
mn N 1 mn 1 [15] S. Kumar, R. Jain, and K. Rajawat, Asynchronous optimization over
E[ ]= E[ P ]= (102)
N
n6=m mn N heterogeneous networks via consensus admm, IEEE Trans. on Signal
and Inf. Proc. over Networks, 2016 (to be published).
Further, using the fact that E[mn ] = E[mk ] for each k 6= n, [16] P. Biswas, T.-C. Lian, T.-C. Wang, and Y. Ye, Semidefinite pro-
gramming based algorithms for sensor network localization, ACM
it can be seen that Transactions on Sensor Networks, vol. 2, no. 2, pp. 188220, 2006.
[17] S.-H. Bae, J. Qiu, and G. Fox, Adaptive interpolation of multidimen-
h i 1 P mn m 6= n sional scaling, Procedia Computer Science, vol. 9, pp. 393 402, 2012.
E Lt Bt (X) = mk m = n.

[18] S. Ingram, T. Munzner, and M. Olano, Glimmer: Multilevel MDS on the
mn N GPU, IEEE Trans. on Visualization and Computer Graphics, vol. 15,
k6=m
no. 2, pp. 249261, 2009.
[19] L. Bottou, On-line Learning in Neural Networks, D. Saad, Ed. New
which is the required result. York, NY, USA: Cambridge University Press, 1998.
Finally, if Gt consists of multiple connected components, the [20] C. D. Sa, C. Re, and K. Olukotun, Global convergence of stochastic
quantity Lt Bt (X) is a permuted version of the block-diagonal gradient descent for some non-convex matrix problems, in Proc. of the
Intl. Conf. on Machine Learning, 2015, pp. 233241.
matrix with N/p block matrices of size p p each. Let j [21] J. Mairal, Stochastic majorization-minimization algorithms for large-
denote the determinant of j-th block, and the random variables scale optimization, in Advances in Neural Information Processing
Systems, 2013, pp. 22832291.
jmn be similarly defined block-wise. Proceeding along similar [22] O. Cappe and E. Moulines, On-line expectationmaximization algo-
lines, it can be seen that rithm for latent data models, Journal of the Royal Statistical Society:
Series B (Statistical Methodology), vol. 71, no. 3, pp. 593613, 2009.
jmn p1 j 1 [23] A. H. Sayed, Adaptive filters. John Wiley & Sons, 2011.
E[ j
]= E[ P mn j ] = . (103) [24] V. Solo and X. Kong, Adaptive signal processing algorithms: stability
p n6=m mn
p and performance. Prentice-Hall, Inc., 1994.
[25] H. Kushner and G. G. Yin, Stochastic approximation and recursive
Consequently, [Lt Bt (X)]mn is non-zero if and only if the algorithms and applications. Springer, 2003, vol. 35.
[26] V. S. Borkar et al., Stochastic approximation, Cambridge Books, 2008.
node pair (m, n) belong to the same component, and is zero [27] W. Rudin, Principles of mathematical analysis. McGraw-Hill New
otherwise. From (A5), we have that the probability that a given York, 1964, vol. 3.
16
[28] N. Patwari, J. N. Ash, S. Kyperountas, A. O. Hero, R. L. Moses, and

N. S. Correal, Locating the nodes: cooperative localization in wireless
sensor networks, IEEE Signal Process. Mag., vol. 22, no. 4.
[29] H. Wymeersch, J. Lien, and M. Z. Win, Cooperative localization in
wireless networks, Proceedings of the IEEE, vol. 97, no. 2, pp. 427
450, 2009.
[30] I. Demirkol, C. Ersoy, F. Alagoz et al., Mac protocols for wireless
sensor networks: a survey, IEEE Commun. Mag., vol. 44, no. 4.
[31] W. S. Torgerson, Multidimensional scaling of similarity, Psychome-
trika, vol. 30, no. 4, pp. 379393, 1965.
[32] C. Savarese, J. M. Rabaey, and J. Beutel, Location in distributed ad-hoc
wireless sensor networks, in Proc. of the IEEE ICASSP, vol. 4, 2001,
pp. 20372040.
[33] L. Dong, Cooperative localization and tracking of mobile ad hoc
networks, IEEE Trans. on Signal Proc., vol. 60, no. 7.
[34] S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte,
L. Han, J. He, S. He, B. A. Shoemaker et al., Pubchem substance and
compound databases, Nucleic acids research, p. gkv951, 2015.
[35] E. E. Bolton, Y. Wang, P. A. Thiessen, and S. H. Bryant, Pubchem:
integrated platform of small molecules and biological activities, Annual
reports in computational chemistry, vol. 4, pp. 217241, 2008.
[36] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining.
Pearson Education India, 2006.
[37] F. M. Harper and J. A. Konstan, The movielens datasets: History and
context, ACM Trans. on Interactive Intelligent Systems (TiiS), vol. 5,
no. 4, p. 19, 2016.
[38] T. M. Newcomb, The acquaintance process. Holt, Rinehart & Winston,
1961.
[39] B. Mohar, Laplace eigenvalues of graphsa survey, Discrete mathemat-
ics, vol. 109, no. 1-3, pp. 171183, 1992.
[40] G. H. Golub and C. F. Van Loan, Matrix computations. JHU Press,
2012, vol. 3.
Ketan Rajawat (S06-M12) received his B.Tech

and M.Tech degrees in Electrical Engineering from
the Indian Institute of Technology (IIT) Kanpur, in
2007, and his Ph.D. degree in Electrical and Com-
puter Engineering from the University of Minnesota
in 2012. Currently, he is an assistant professor in the
Department of Electrical Engineering, IIT Kanpur.
His research interests lie in the areas of Signal Pro-
cessing and Communication networks. His current
research focuses on distributed optimization algo-
rithms, cross-layer network optimization, resource
allocation, and dynamic network monitoring.
Sandeep Kumar (S16) received his B.Tech

in Electronic and Telecommunication Engineering
from College of Engineering Roorkee, India in 2011;
and his M.Tech in Electrical Engineering from the
Indian Institute of Technology Kanpur, in 2013. He
is currently pursuing the Ph.D. degree at IIT Kanpur
in Electrical Engineering. He is a recipient of the
TCS Research Fellowship. His research area spans
parallel and distributed signal processing, optimiza-
tion and learning over networks, localization and
dynamic network monitoring.

07850955

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

07850955

Hochgeladen von

Copyright:

Verfügbare Formate

This article has been accepted for publication in a future issue of this journal, but has not been

Stochastic Multidimensional Scaling

II. BACKGROUND AND P ROBLEM S TATEMENT

C. Algorithm complexity V. S IMULATION RESULTS

i.e., wmn = 1/mn , the noisy measurement model specified

Starting with the same initialization, the entire experiment is

The localization performance of the proposed algorithm is

In order to perform a fair comparison between algorithms, 0.14

utilized at subsequent time slots by initializing SMACOF with 10-1

the previously estimated node locations.

MovieLens database [37]. To this end, the time-stamp associ-

movies move quickly (within few weeks) towards the edge of

perform dynamic visualization of the 27,000 movies on the 4 https://www.youtube.com/watch?v=G9geUI3U7Tw&feature=youtu.be

Fig. 6: Visualization of PubChem Datasets.

Here, the minimum is attained by the corresponding eigenvec- X N (N 1)2 X 2

Proof of (48c): Observe that Lt J = Lt and JLt = Lt . t1

Continuing in a similar manner, taking norm on both sides of t1

0 ), a bound on K can be derived

[28] N. Patwari, J. N. Ash, S. Kyperountas, A. O. Hero, R. L. Moses, and

Ketan Rajawat (S06-M12) received his B.Tech

Sandeep Kumar (S16) received his B.Tech

Das könnte Ihnen auch gefallen