Sie sind auf Seite 1von 11

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


CLRIC: Collecting Lane-Based Road

Information Via Crowdsourcing
Luliang Tang, Xue Yang, Zhen Dong, and Qingquan Li

AbstractLane-based road network information, such as the applied in the Open Street Map (OSM) for road-level map
number and locations of traffic lanes on a road, has played an construction, which uses the Global Positioning System (GPS)
important role in intelligent transportation systems. In this pa- for localization [6][9]. But low-end GPS data devices and
per, we propose a Collecting Lane-based Road Information via
Crowdsourcing (CLRIC) method, which can automatically extract urban canyons with tall buildings reduce the position accuracy
detailed lane structure of roads by using crowdsourcing data of GPS data to about 1015 m in urban areas. So it is a
collected by vehicles. First, CLRIC filters the high-precision GPS challenging to extract lane-based road information from low
data from the raw trajectories based on region growing clustering precision crowdsourcing GPS data.
with prior knowledge. Second, CLRIC mines the number and In this paper, we propose CLRIC: collecting lane-level road
locations of traffic lanes through optimized constrained Gaussian
mixture model. Experiments are conducted with taxi GPS tra- network information via crowdsourcing. CLRIC can automat-
jectories in Wuhan, China, and the results show that CLRIC is ically extract the detailed lane structure of roads using crowd-
quantified and displays detailed road networks with the number sourcing GPS data collected by vehicles. CLRIC is based on
and locations of traffic lanes comparing with the satellite image two key observations. The first observation is that high preci-
and human-interpreted situation. sion GPS trajectories with accuracies of about 3 m still exist in
Index TermsLane-based road information, crowdsourcing raw vehicle trajectories based on GPS error analysis [10]. Thus,
data, high-precision GPS data filtering, spatiotemporal GPS region growing clustering with prior knowledge (RGCPK) in
trajectories. CLRIC system is used to select high-precision GPS data from
low precision raw GPS data. The second observation is that
I. I NTRODUCTION vehicle trajectories contain abundant information regarding
road networks [11][13], traffic conditions [14], [15], points

A CCURATE lane-based road network data (the number

and locations of traffic lanes) is crucial for ensuring reli-
able and safe driving for next generation navigation, especially
of interest, and driving behaviors [16], [17], etc. The detailed
lane structure of roads can also be mined by using clustering
methods based on the assumption that GPS trajectories will
for intelligent transportation systems (ITS) such as advanced trend to cluster near the center of each lane with some spread.
driver assistance systems and autonomous driving. At the same Thus, CLRIC fits an optimized constrained Gaussian mixture
time, the number of lanes can also be important for infer- model to perpendicular cross sections of the trajectory vectors
ring the type of road and for estimating traffic flow capacity. across a road, and further determine the exact number and
At present, lane-based information such as the number, turn locations of traffic lanes. In summary, the contributions of this
rules and locations of traffic lanes, is usually acquired from paper are the following.
high-definition video/images, laser point clouds, or DGPS/INS
trajectories(Differential GPS/Inertial Navigation System) with 1) We presented a trajectory optimization method to select
accuracies of about 0.54 m [1][4]. Manual and semi-manual high-precision data from crowdsourcing GPS data. The
are time-consuming and labor-intensive [5]. position accuracy of selected data is about 3 m.
Crowdsourcing is a low-cost and efficient way to extract 2) We designed a mining lane information model based on
useful information from data acquired by crowd participants or optimized constrained Gaussian mixture method.
volunteers. The crowdsourcing method has been successfully The remainder of this article is organized as follows. In
Section II, related studies on trajectory optimization and traffic
lane information extraction from GPS trajectories are reviewed.
In Section III, CLRIC is fully described. In Section IV, a series
of experiments on Wuhan datasets are used to demonstrate the
Manuscript received June 7, 2015; revised November 1, 2015 and advantages and effectiveness of the CLRIC. In Section V, some
January 10, 2016; accepted January 19, 2016. This work was supported by
the National Science Foundation of China under Grants 41571430, 41271442, conclusions and directions for future research are given.
and 40801155. The Associate Editor for this paper was Z.-H. Mao.
L. Tang, X. Yang, and Z. Dong are with the State Key Laboratory of
Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan II. R ELATED W ORK
University, Wuhan 430079, China (e-mail:
Q. Li is with the Department of Shenzhen Key Laboratory of Spatial Smart GPS does not work perfectly in urban area. When using GPS
Sensing and Services, Shenzhen University, Shenzhen 518060, China. receivers in street canyons with tall buildings, the shadowing
Color versions of one or more of the figures in this paper are available online
at and multi-path effects results in low positional accuracy. Be-
Digital Object Identifier 10.1109/TITS.2016.2521482 sides, gathering GPS data via crowdsourcing way is relatively
1524-9050 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Fig. 1. CLRIC architecture.

optional compared with professional way, so the raw GPS data from registered users. Likewise, WikiMapia, Google Maps, and
mixes with many outliers. At present, there are several ways other map applications let users to update maps. The methods
to optimize raw trajectories such as filtering, map matching, proposed in [20][25] can generate and update road-level maps
and clustering algorithm. Filtering is suitable in those situations from crowdsourcing data, while detailed road network gener-
where the high-sampling rate trajectory data is particularly ation has gradually shifted down to lane-based road network
noisy, or when it is necessary to derive other quantities like information such as the number and locations of traffic lanes.
speed or direction from trajectory data [18]. Map matching Lane-based information extraction from vehicle trajectories
is another way for raw trajectory data optimization that each starts with differential GPS data and concludes with a refine-
trajectory is matched to road centerline it corresponds [19]. ment of an existing map, including finding lanes and lane transi-
In addition, some researchers proposed that using clustering tions through the intersections [26], [27]. This process involves
method to remove outliers. In reference [14], authors used smoothing and filtering the GPS data, matching it to an existing
Kernel density method to identify outliers and remove them. map, spline fitting for the road centerlines, clustering to find
The authors of [4] sort all the data points in ascending order ac- lanes, and refinement of the intersection geometry [28]. The
cording to their distances from the median and then choose 95% authors of [29] proposed to use vehicle trajectories collected by
of the sorted data points as the experimental data. However, all mobile phones equipped with GPS and MEMS (Micro-Electro-
these methods [4], [14], [18], [19] have their defects. Filtering Mechanical System) to generate lane-level road maps in open
is sensitive to the sampling rate of GPS data so its unfortunate area. The lane-level information was extracted by statistically
for GPS data with low-sampling rate. Map-matching is valid for analyzing the probability density distribution of trajectories
road-level information extraction like road network updating based on non-parametric Kernel Density Estimation. However,
and traffic flow detection and so on, but it is useless for lane- the methods discussed in [26][29] are based on the assumption
based road information extraction because each GPS point is that GPS trajectories from different lanes are separated well.
matched to road centerline. Besides, the existing clustering For low-precision crowdsourcing GPS data, this assumption is
methods [4], [14] are confined to parameter setting and cant seriously violated, and therefore we propose CLRIC to extract
remove outliers which are mixed in high-density points cluster. lane structure from a mass of low-precision crowdsourcing GPS
Extracting information form pre-processed GPS data is the data in urban area.
key issue in geographic area. In this study, we focus on the lane-
based information extraction from GPS data. There has also III. C OLLECTING L ANE -BASED ROAD
been work on completely automated methods aimed at inferring I NFORMATION VIA C ROWDSOURCING
road maps from crowdsourcing data. Those methods include
matching GPS traces to prototypical shapes [20], and using an The overview of the CLRIC system is shown in Fig. 1.
incremental method to process GPS traces that can be used to As seen in Fig. 1, CLRIC includes two steps:
generate road maps [21], [22], and applying clustering methods Step 1) select high-precision data from crowdsourcing data
or artificial algorithms to extract road network from GPS traces based on region growing clustering with prior knowl-
[23][25]. Besides, OpenStreetMap uses user-contributed GPS edge (RGCPK). The positional accuracy for selected
trajectories to create free digital maps that are open for editing data can approach 3 m.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Fig. 2. Trajectories and trajectory vectors. (a) and (b) show the trajectory and
trajectory vector respectively, where N indicates the north and is the angle
with north of vector v 1i .

Step 2) extract and optimize lane-based information from se-

Fig. 3. The similarity evaluation of trajectory vectors.
lected high-precision data such as the number and lo-
cations of traffic lanes using an optimized constrained 1) The Similarity Evaluation Model of Trajectory Vectors:
Gaussian mixture model and road construction rules. The difference of two trajectory vectors is reflected in the
vector direction and vertical distance between two start points
A. Crowdsourcing Data: GPS Trajectories of vectors. As shown in Fig. 3, v i (xi , yi ), (xi+1 , yi+1 ) and
In the real world, the trajectory of a moving object is v j (xj , yj ), (xj+1 , yj+1 ) are two different trajectory vectors.
continuous and usually called the original route. However, The angle to the north of v i and v j are 1 and 2 respectively.
it is hard to acquire continuous trajectories through existing The magnitude information of vectors is not discussed here
positioning techniques or store them in a database. Thus an because we only care about the location accuracy of trajectories
alternative feasible solution taken in GPS-enabled services is to rather than the sampling rate and speed of moving objects that
store only a set of sampled positions of a trajectory, as a tuple decide the value of magnitude of vectors.
T racei = p1 , p2 . . . pn , pi is a GPS sampled spatial-temporal Thus, the similarity measure in this paper is defined in the
point, which is a tuple xi , yi , ti , si (i = 1, 2, . . . n), as shown form of linear weighting:
in Fig. 2(a).
Here ti is the time stamp when data is collected, xi , yi  sim(vi ,vj ) = 1 edif fHd + 2 edif fij (1)
is the geographic location of the moving object, si contains where dif fHd and dif fij represent the vertical difference and
some extra features of a moving object such as the vehicle angular difference of v i and v j respectively, 1 and 2 are the
number, driving direction and speed. These kind of sampled weighting of the difference of vertical distance and angle, 1 +
trajectories are called raw GPS trajectories. For a trajectory 2 = 1. In general similarity of two vectors ranges from 0 to 1,
vector vi = pi , pi+1 , pi and pi+1 are regarded as the start with a value of 1 for two completely same vectors and a value
point and the end point respectively, and the direction from pi of 0 for two completely separated vectors. We define dif fHd
to pi+1 is regarded as the vector direction of vi (Fig. 2(b)). and dif fij as follows,
For lane-based information extraction, the information of
traces such as location, driving direction and collected time are max(Hdij , Hdji )
dif fHd = (2)
very important. However, due to the errors caused by data sam- Disconne
pling and encryption in GPS navigation services, many GPS dif fij = 1 cos() (3)
records are not precise and thus generally need to be optimized.
where Disconne is the constant and decided by the width of
lane, and used to constrain the similarity of vectors on the same
B. RGCPK: Region Growing Clustering With
lane as GPS traces trend to cluster near the center of each lane,
Prior Knowledge
Hdij is the vertical distance from the v i starting point to the v j
Most drivers keep driving along the centerline of lanes and starting point. The computation of Hdji is the same as Hdij ,
change lanes in a short time. The GPS trajectory reflects the is the angular difference of v i and v j (Fig. 3), and can be
tendency of vehicle. Thus, the high-precision data has two estimated as:
features: tracking points with high positional accuracy always
cluster together along the centerline of each lane; the angle of = |i j | (4)
two adjacent trajectory vectors of a trajectory will not change |xi (yj+1 yj ) + yi (xj xj+1 ) + (xj+1 yj xj yj+1 )|
Hdij = 
a lot unless they are at an intersection or changing lanes [30]. (yi+1 yi )2 + (xj xj+1 )2
Based on this observation, RGCPK clusters trajectories based (5)
on their similarity and then selects high-precision data from |xj (yi+1 yi ) + yj (xi xi+1 ) + (xi+1 yi xi yi+1 )|
clusters. The prior knowledge for RGCPK is extracted from Hdji = 
(yi+1 yi )2 + (xi xi+1 )2
the similarity between high precision DGPS trajectories and (6)
synchronized low precision GPS trajectories. The positioning
accuracy of a DGPS trajectory and its synchronized GPS The similarity calculation is used not only for prior knowl-
trajectory are about 0.5 m and 1015 m respectively, their edge extraction from DGPS and low precision GPS data but for
sampling rate is 1 s. To evaluate the similarity between DGPS crowdsourced data clustering and selection. Thus, we use the
trajectory vector and its synchronized GPS trajectory vector, a correlation between the vertical distance difference and angular
novel vector similarity evaluation model (VSEM) is presented. difference with measuring errors to estimate 1 and 2 .
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


tively. The measuring errors of STi can also be estimated as:

T i = (10)
N (STi )

where T i is the measuring error of T si , is the sum of
measuring error of each tracking point in STi , i = 1, 2, . . . , l.
The results of T si , T i and P eri are recorded as RSTi =
T si , T i , P eri  and regarded as the prior knowledge of
3) The Principle of RGCPK: The prior knowledge includes
T and P er derived from RSTj = T sj , T j , P erj , j =
1, 2, . . . , l. T is regarded as the threshold of clustering and
Fig. 4. Similarity of vector pairs. The similarities of some vector pairs are P er is used to select high-precision data from entire clusters.
shown in the yellow rectangle window. The key steps of the RGCPK are as follows:

Step 1: Initialize the cluster label of all trajectory vectors as

Given a set of GPS trajectories T = T race1 , T race2 , . . . ,
un-clustered and the current cluster label CCL = 0;
T races , its synchronized DGPS trajectories are denoted as
Step 2: Select a trajectory vector labeled un-clustered from
DT = Dt1 , Dt2 , . . . , Dts . The position accuracy of DT and
the trajectory randomly as seed trajectory vector v s and
T is 0.5 m and 1015 m respectively. So T is regarded as
label it as CL (v s ) = CCL , as shown in Fig. 5(a).
measurements and DT with high-accuracy actually refers to
Step 3: Search the adjacent trajectory vector of v s , denoted by
the truth value of T . Let T racei = p1 , p2 , . . . , pn , Dti =
v sn. v s and v sn are merged as one cluster and labeled
rp1 , rp2 , . . . , rpn , T racei T , Dti DT , i = 1, 2, . . . , s,
as CL (v sn ) = CCL if they satisfy: Sim(vs , v sn ) > T s,
their trajectory vectors are denoted as: T v i = v 1 , v 2 , . . . ,
where Sim(v s , v sn ) is the similarity between v s and
v n1 , Dv i = rv 1 , rv2 , . . . , rvn1 . The difference of ver-
v sn , and T is the similarity threshold.
tical distance and angle between T v i and Dv i are calcu-
Step 4: Take v sn as seed trajectory vector v s . Return to Step 3,
lated and denoted as: Di = d1 , d2 , . . . , dn1 , Ai = a1 , a2 ,
as shown in Fig. 5(b).
. . . , an1 , i = 1, 2, . . . , s. The measurement error of T can
Step 5: One cluster is acquired when the seed trajectory vec-
also be signed as: i = 1 , 2 , . . . , n , j = |pj rpj |, i =
tor v s cannot be merged with its adjacent trajectory
1, 2, . . . , s, j = 1, 2, . . . , n. Then 1 and 2 are estimated as:
vectors v sn . Then, let CCL = CCL + 1, and return to

s Step 2.
rD Step 6: Repeat Step 25 until all the trajectory vectors in the

1 = (7) trajectory are labeled, as shown in Fig. 5(c).
Step 7: Put the clusters as trajectory vectors and further merge
rD + rA
i=1 i=1 the clusters using these steps. The final result of clus-
2 = 1 1 (8) tering is shown in Fig. 5(d).
Step 8: After clustering, P er is regarded as the selectivity for
where rD and rA are the correlation of Di and i , Ai and i , data selection. The proportion of tracking points of
and can be estimated based on the covariance matrix. each cluster of the total is computed. All clusters are
2) The Prior Knowledge of RGCPK Extraction: The similar- sorted in descending order according to their propor-
ity SimT racei , DTi  of T racei and DTi can be computed tion and summed from the first value until the accu-
according to the similarity evaluation model, and the results re- mulated value is satisfied, P er. Then those aggregated
corded as T racei = (p1 , s1 ), (p2 , s2 ), . . . , (pn , sn ), where sj clusters are regarded as the high-precision data and
is the similarity value of v j and rv j , and sj is recorded as selected from all clusters, as shown in Fig. 5(e).
the similarity value of pj according to the tendency of moving
object, sn1 = sn , j = 1, 2, . . . , n, as shown in Fig. 4.
A high value of similarity indicates a higher positional ac- C. Lane-Based Road Information Collection
curacy of tracking point. Assuming that T s = T s1 , T s2 , . . . , 1) The Principle of Number and Locations of Traffic Lanes
T sl  is the threshold set of similarity and STi represents the
Detection: We fit a Constrained Gaussian Mixture Model
data set which is satisfied T si , STi T . Then the percentage (CGMM) to perpendicular cross sections of the traces across
of STi is calculated as: the road, based on the assumption that GPS trajectories will
N (STi ) tend to cluster near the center of each lane with some spread due
P eri = (9) to GPS noise and other vagaries. The CGMM can be defined as:
N (T )

1 (x j )2
where P eri represent the percentage of STi , N (STi ) and p(x) = j exp (11)
N (T ) are the number of tracking points of STi and T , respec- j=1 2 2 2 2
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Fig. 5. The basic principle of RGCPK.

where k is the number of Gaussian components, each com-

ponent corresponds to each lane, providing an automatic lane
count; w1 . . . wk are the weight of each component, correspond-
ing to the relative traffic volume in each lane. The weights have
to be positive and normalized, that is, wj > 0, j = 1, 2, . . . , k,
and w1 + w2 + . . . wk = 1. The parameters 1 . . . k are the
mean of the trajectories for each component and equal to the
centerline of each lane; is the standard variance of the trajec-
tories for each component and are set to same value because the
width of each lane of adjacency lane usually is the same. The
expectation-maximization (EM) algorithm is used to infer
(m) (m) (m)
the unknown parameter: j (j , j , (m) ) where m
Fig. 6. The raw tracking points, (a) shows the real image data of intersection
is the number of iterations. The concrete procedure on how to and (b) is the raw GPS records around it.
get parameter: j was obtained from [4].
The key task in the CGMM model is selecting the number
However, low-end GPS devices and urban canyons mean that
of components for a CGMM. A common practice is to estimate
GPS tracking points deviate from original positions. As shown
k {wi , i , i }k for a set of ks, i = 1, 2, . . . k, and then select
in Fig. 6(b), a lot of tracking points deviate from original
the k that minimizes the following function:
positions and their proportion of total is far more than 5%. So

n the regularizer RLS in [4] is not suitable for confirmation of the

Rsrm (p(xi |k )) = L (xi , p(xi |k ))+J(p(xi |k )) (12) number of lanes from GPS data at 1015 m accuracy.
n i=1
In this paper, we present a new regularizer J(p(xi |k )) for
confirmation of the number of lanes, as shown in equation (15).
k = min (Rsrm (p(xi |k ))) (13) 2
L (xi , p(xi |k )) = log (p(xi |k )) (14) J (p(xi |k )) = JTSW (p(xi |k )) = k (15)

where Rsrm (p(xi |k )) is the structural risk model (structural 

risk minimization, SRM), L(xi , p(p(xi |k ))) is the empirical
n +n ij (j 1)xi n1 j+1 j xi
k i=1 j=1 j=1 i=1
risk model used to evaluate the goodness of fit, J(p(xi |k )) is a = 2
regularization term that penalizes complex models, and > 0 
j+1 j 2 + n j+1 j
is the regularization parameter. Equation (12) shows a trade- j=1 j=1
off between model fitness and model complexity in order to (16)
achieve good generalization.
The authors in reference [4] proposed a new regularizer, RLS where Dw is the spread width of optimized trajectories, the
based on the relation between the number of lanes and the total value of (Dw/k) refers to the width of lane, k is the
spread of trajectories, their test results show the advantages change between two adjacent j s that equals the detected lane
when compared to other methods such as Akaike information width, j = 1, 2, . . . , k. Equation (15) indicates the consis-
criterion (RAIC ) and Bayesian information criterion (RBIC ). tency degree of (Dw/k) and k . Equation (16) shows the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Fig. 7. Lane centerline. (a) is the fitting result of CGMM and the location of each lane is shown in (b).

Fig. 8. The detection of the number of lanes.

Fig. 10. The construction rule of road. Part1 and Part2 are the most likely parts
of road to add lanes.

The details of the method are as follows:

Fig. 9. Width detection method.
calculation of . The reasoning process and explanation of Coordinate origin: Roi; i = 1, 2, . . . , t.
the parameters are as in [4]. According to (11), the parameter horizontal axis: the direction of the road centerline;
i reflects the location of each lane, as shown in Fig. 7. longitudinal axis: U yi = 0; Dyi = 0;
2) Identification and Optimization of the Number of Lanes: Sliding window: length = rh; width = w; Proportion = 0;
The accuracy of the number of lanes detection has a great /Assignment/
impact on the locations and turning rules for lane extraction. for each Sampling cross sections, do
Given a set of trajectories AT which start from Intersection1 repeat
and end at Intersection2 , the number of traffic lanes is detected Moving the sliding window along the positive
based on the CGMM. direction and negative direction of the longi-
As shown in Fig. 8, we choose a rectangular window to sam- tudinal axis and accumulating the Proportion
ple cross sections and the length and width of rectangle window (Proportion=current points number in sliding
are set as rh and rw. Then we fit a CGMM to the intersections window/all points in the current sampling cross
between the GPS trajectories and a sampling line perpendicular sections)
to the road centerline, and according to (12)(16) confirm the until Proportion == 100%
number of lanes and record them as N lanei , i = 1, 2, . . . , t. set Dwi = maximum |U yi | + |maximum|Dyi |;
In addition, the spread width of optimized trajectories Dw set Coordinate origin changed to Roi+1; U yi+1 =
in sampling cross sections are acquired by Width Detection Al- 0; Dyi+1 = 0;
gorithm. The direction of a road centerline and a sampling line end for
are set as the horizontal axis and longitudinal axis, respectively.
The intersection between sampling line and road centerline is In most cases, the value of N lanei between two intersections
set as the origin, as shown in Fig. 9. The width of sliding always remains the same except when adding lanes at an inter-
window w can be set at any value as long as in the scope of section, as shown in Fig. 10. But some incorrect classifications
desired precision and the length of sliding window is same as for number of lanes still exist due to the effects of uncertain
the length of rectangular window. traffic flows in each lane or inaccuracies in the GPS data.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Fig. 11. The collection of DGPS trajectories and synchronized GPS trajectories, (a) shows the driving region of shuttle vehicles, (b) is the magnification of (a) that
the black points and blue points represent GPS data and DGPS data respectively.

Fig. 12. Crowdsourcing data: collected by taxis, (a) indicates the road network for taxi driving and (b) shows the raw trajectories collected by taxis.

Thus, we present a method to optimize the results of the performance of region growing clustering with prior knowledge
number of lanes extraction, as follows. method. The test data set was applied to as the main data
First, comparing N lanei+1 with N lanei and N lanei+2 , source for lane-based information extraction. Two data sets are
N lanei+1 is replaced by N lanei when N lanei and N lanei+1 introduced as follows.
are different, and N lanei and N lanei+2 are the same. The training data set was collected by shuttle vehicles. Each
Secondly, clustering the results from Step 1 according to the shuttle vehicle was equipped with a GPS logger and Inertial
value of N lanei and their arrangement, for instance N lanee , Measurement Unit (IMU) that recorded two kinds of traces,
N lanee+1 , N lanee+2 , . . . , N lanee+c will be clustered when GPS traces based on the GPS single-point positioning technique
their value is same, e < t, e + c < t. Each cluster also corre- and synchronized DGPS traces based on differential global
sponds to a number of lanes. Assuming there are s clusters, and positioning technology. The positional accuracy of the GPS
recorded as Cj = N lj , ncj , where N lj is the number of lanes and DGPS data in urban area was about 1015 m and 0.5 m
of cluster Cj , ncj is the total number of N lanei that belong to respectively. The sampling rate for the training data set was 1 s.
Cj , j = 1, 2, . . . , s. The data collection period for the shuttle vehicles was seven
Finally, comparing Cj+1 with Cj , N lj+1 of Cj+1 is replaced days. We obtained about 40 thousand GPS and DGPS points,
by N lj of Cj when N lj+1 and N lj are different, and ncj+1 < shown in Fig. 11. The prior knowledge for high-precision data
cv, where cv is a constraint value that depends largely on the selection from the crowdsourcing data was extracted from part
road construction rules. of the training data set by analyzing the similarity of DGPS data
and its synchronized GPS data. The remaining training trajec-
tory data were used to evaluate the performance of RGCPK.
The test data set were collected by thousands of taxis based
CLRIC includes two steps high-precision data selection and on point position technique in Wuhan; the GPS devices were
lane-based information extraction. Thus, to evaluate the perfor- placed at the center of the taxi roofs. The sampling frequency
mance of CLRIC, we used two different types of data sets a of taxi traces ranged from 10 s to 20 s while the positioning ac-
training data set of ten shuttle vehicle traces and a test data set of curacy for them ranged from 10 m to 15 m in urban areas. Each
thousands of taxi GPS traces, and both data sets were collected taxi recorded traces for an average of 14 days. We collected
in urban area. The training data set was used to extract priori about 200 billion GPS points, shown in Fig. 12. According to
knowledge for high-precision data selection and verify the each tracking point location and heading direction, we got about
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.




equation (16) are recommended by [4]. The rh and rw of

rectangle window were set to 5 m and 50 m based on the road
Fig. 13. A part of training data set. Red marks and blue marks represent GPS
data and its high-precision synchronized DGPS data. construction standards; The width of one-way road is smaller
than 50 m in China and adding a lane on a road generally occurs
1000 trajectory segments based on clustering and partitioning within 50 m of an intersection.
method [31], and the counts of trajectories of each trajectory Given a set of optimized trajectories S T that start from one
segment range from 100 to 1000. Those trajectory segments intersection and end at another intersection, and 120 thousand
represent one-way road and start from one intersection and end tracking points, the number and locations of lanes were detected
at another intersection. as follows:
1) The Number of Traffic Lanes Detected Using Optimized
A. Region Growing Clustering With Prior Knowledge Constraint Gaussian Mixture Model (CGMM): As shown in
(RGCPK) for High-Precision Data Selection Fig. 15, we trained CGMM with k = 2, 3, 4, 5, on a sample
The RGCPK is the first step in the CLRIC system. We took s1 from S T , containing 4,000 data points. The densities of
a part of the trajectories from training data set as samples those CGMMs and their individual components, multiplied by
as shown in Fig. 13. The weights in the similarity evaluation the number of data points, are shown in (a), (b), (c) and (d),
model are estimated by (7) and (8) as shown in Table I. respectively. The number of lanes of s1 was identified using
The similarity degree of each tracking point is estimated equation (12)(16).
according to the similarity evaluation model of trajectories The number of lanes of s1 is 4 according to computational
vectors. We analyzed the measured value of error (T /m) and results. Then we fit CGMMs to each sample from S T with k,
the proportion of the data set (P eri ) under different similarity and calculated the number of lanes of all samples, k = 2, 3, 4,
thresholds (T s). The similarity threshold (T s), measurement 5, as shown in Fig. 16. The true value of the number of lanes
error (T /m), and percentage of the total data set it represents of S T was obtained by field observation. From Fig. 16, we see
(P er) are shown in Table II. that incorrect lane number identification exists in the results.
In Table II, T s, T and P er represent the similarity thresh- 2) Result Optimization: The process of result optimization
old, measurement error and selectivity, respectively but T was used to improve the accuracy of lane number identification.
indicates the expectation quality for data selection. As shown in Fig. 17, incorrect identifications are re-identified
We picked 0.71 as the clustering threshold when our expecta- based on the result optimization algorithm. A constraint value
tion quality was 3 m, and 55.4% is regarded as the selectivity for cv is set to 3 due to the road construction standard that stipulates
data selection from all clusters based on RGCPK. Fig. 14 shows that adding lanes happens within 50 m of an intersection and for
the results of RGCPK for the test data set and a part of training bus stops as the length of bus stop is more than 10 m.
data set. The red-solid points and black-empty circles represent As a complement to the lane number measures, Fig. 18
the high-precision data and outliers, respectively. presents a visualization of the location of each lane of S T where
red lines depict the centerline of each lane.
B. Detection of the Number and Locations of Traffic Lanes From Fig. 18, we see that the accuracy of the location of each
lane depends largely on the accuracy of extraction of lane num-
(0) (0)
The initial estimates j , j , (0) (j = 1 . . . k) are as- ber identification. The incorrect identifications of the number of
signed as: k = 2, 3, 4, 5 because there are four types of traffic lanes lead to wrong place of the location of each lane. Misiden-
lanes designated by the road construction standards in China; tifications of the number of lanes seen in Fig. 18 also indicate
(0) (0)
1 . . . k are the centerlines for each possible lane and that our method has difficulty dealing with data sets from com-
acquired by segmenting the road from road centerlines with plex intersections; that have different traffic flows between adja-
equal interval (set to 3.5 m); (0) is the half width of a lane and cent lanes caused by the traffic lights; or where there are driving
set as 1.75 m according to road construction standards in China. restrictions. In addition, the centerlines of each lane on a one-
Values of regularization parameter and other parameters in way road cannot be connected without gaps using our method.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Fig. 14. High-precision data selection; (a) indicates the result of test data set selection, and (b) shows the selecting results of a part of training data set. The
red-solid points and black-empty circles represent outliers and high-precision data.

Fig. 15. CGMM results. (a), (b), (c) and (d) indicate the overview of k = 2, k = 3, k = 4, k = 5, respectively.

Fig. 17. The optimized results of the number of lanes detection. The blue-
Fig. 16. The lane detection results. The black-solid points and red-empty empty circles and black-solid points represent the optimized results and truth
circles represent the true value and detection results for the number of lanes. value of the number of lanes.
The measurement errors of the selected data points were
C. Quantitative Evaluation
computed along with synchronized DGPS data. The results
1) The Performance of Region Growing Clustering With show that the position accuracy of selected data can achieve
Prior Knowledge Method (RGCPK): To evaluate the perfor- 3.02 1.2 m, where 3.02 m is the average value and 1.2 m is
mance of the proposed RGCPK, we implemented RGCPK on the standard deviation.
two data sets seen in Fig. 19. The first data set from the training For test data set, we could not estimate the position ac-
data set that was used to estimate the accuracy of selected data curacy of selected data because there was no high-precision
and the other is used to identify the performance of RGCPK for synchronized DGPS data. Thus, the performance of RGCPK
lane number identification. for crowdsourcing data was evaluated by comparing it with the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.



Fig. 18. The locations of each lane, where the yellow line depicts the centerline
of each lane and the white line shows the incorrect identification of lane

Fig. 20. Lane width comparison between detecting value and actual value.

mixture model, achieved the highest prediction precision when

compared to two other classification models. The comparison
experiment also shows that methods in [4] and [29] rely too
much on positional accuracy and sampling rate.
3) Evaluation for the Location of Lane Extraction: To eval-
uate the location of each lane extraction, we randomly select
about 200 trajectories segments from test data set as the sample
data, and compare the detecting lane width () of those
Fig. 19. The results of RGCPK using data from the training data set, where samples with the actual lane width. The actual lane width of
red-empty circles depict the selected data marks and blue points show the those samples is extracted by field measurement. Based on the
synchronized DGPS data points of those selected data points.
measurement results there are two basic types of lane width in
accuracy of the lane number identification in the test data set sample data: 3.75 m (type1) and 3.5 m (type2). From Fig. 20,
before and after optimization. Based on this, the accuracy of most detecting lane width () is higher than the actual value
lane number identification for selected data from crowdsourc- either for type1 or for type2. The standard deviation of for
ing data is 85.2%, but the accuracy of lane number identification type1 and type2 is 0.1904 and 0.2337 respectively. The average
of raw test data set only reach to 42.3%. value of difference between detecting results and true value
2) Quantitative Evaluation for Lane Number Identification: of lane width for type1 and type2 is 0.3564 m and 0.4585 m
The quantitative evaluation of our lane number identification respectively, which indicates the locations of lane extraction of
was done by comparing it to human-interpreted results. This type1 are slightly better than type2, but both of them are not
comparison shows that the proposed method CLRIC achieved completely close to the actual lane width.
good performance in extracting lane numbers at an overall These differences between the detecting value and the actual
accuracy of 85.2%; however, there was also a 14.8% chance value of different types of lane width are caused by the canyon
of incorrectly identifying the number of lanes. In urban area, it streets which lead to the positional accuracy of GPS data in dif-
is a challenging task to mine lane-based information from low- ferent areas existing difference. In urban area, the positional ac-
precision crowdsourcing data especially for roads with complex curacy of GPS data collected in the street with low-rise buildings
intersections, viaducts or tunnels. In summary, the reasons for is better than those collected in the streets with high-rise build-
the incorrect identification include: our method cannot distin- ings. In the future work, we will study deeply on this problem.
guish trajectories from roads on and below viaducts, since the
experimental data had no elevation information. The lane infor- V. C ONCLUSION
mation for overlapping roads in the study area was misclassified; In this paper, we propose CLRIC, an automated method to
the number of lanes can be missed because of GPS signal-loss extract lane-based information such as the number and loca-
in tunnels; our method has difficulty dealing with data sets from tions of traffic lanes on a road via crowdsourcing. CLRIC filters
complex intersections which mixed with viaducts. the high-precision GPS data from the raw trajectories using
In addition, to evaluate the performance of CLRIC, we region growing clustering with prior knowledge, and mines
presented a qualitative comparison of lane number identifi- the number and locations of traffic lanes through optimized
cation based on CLRIC, Constraint Gaussian Mixture Model constrained Gaussian mixture model, which is promised to
(CGMM) [4] and Kernel Density Estimation (KDE) [29]. The be a low-cost and real-time way to collect lane-based road
results of the number of lane identification comparisons us- information. However, the proposed method still has room for
ing test data set are shown in Table III. According to these improvement, and our future work will focus on the extraction
results, the CLRIC system employing region growing cluster- of lane information under complex road environments such as
ing with prior knowledge and optimized constraint Gaussian under tunnels and overpasses.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


R EFERENCES [26] S. Rogers, P. Langley and C. Wilson, Mining GPS data to augment road
models, in Proc. 5th ACM SIGKDD Int. Conf. Knowl. Discovery Data
[1] A. B. Hillel, R. Lerner, D. Levi, and G. Raz, Recent progress in road and Mining, New York, NY, USA, 1999, pp. 104113.
lane detection: A survey, Mach. Vis. Appl., vol. 25, no. 3, pp. 727745, [27] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl, Constrained
Apr. 2014. k-means clustering with background knowledge, in Proc. 18th ICML,
[2] M. Thuy and F. Len, Lane detection and tracking based on lidar data, San Francisco, CA, USA, 2011, pp. 577584.
Metrol. Meas. Syst., vol. 17, no. 3, pp. 311321, 2010. [28] S. Edelkamp and S. Schrdl, Route planning and map inference with
[3] B. Yang, Z. Dong, and W. Dai, Hierarchical extraction of urban objects
global positioning trajectories, Comput. Sci. Perspective, vol. 2598,
from mobile laser scanning data, ISPRS J. Phothogramm. Remote Sens., pp. 128151, 2003.
vol. 99, pp. 4557, Jan. 2015. [29] A. Uduwaragoda, A. S. Perera, and S. A. D. Dias, Generating lane level
[4] Y. Chen and J. Krumm, Probabilistic modeling of traffic lanes from GPS road data from vehicle trajectories using kernel density estimation, in
traces, in Proc. 18th SIGSPATIAL Int. Conf. Adv. Geographic Inf. Syst.,
Proc. IEEE 16th Int. Annu. ITSC, Oct. 69, 2013, pp. 384391.
2010, pp. 8188. [30] X. Liu et al., Road recognition using coarse-grained vehicular traces,
[5] A. G. O. Yeh et al., Hierarchical polygonization for generating and HP Lab., Palo Alto, CA, USA, 2012, pp. 110.
updating lane-based road network information for navigation from road
[31] J. G. Lee and J. Han, Trajectory clustering: A partition-and-group
markings, Int. J. Geographical Inf. Sci., vol. 29, no. 9, pp. 124, 2015. framework, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2007,
[6] B. Zhou et al., ALIMC: Activity landmark-based indoor mapping via pp. 593604.
crowdsourcing, IEEE Trans. Intell. Transp. Syst., vol. 16, no. 5, pp. 111,
Oct. 2015.
[7] N. D. Lane, S. B. Eisenman, M. Musolesi, E. Miluzzo, and A. T. Campbell,
Urban sensing systems: Opportunistic or participatory? in Proc. 9th
Workshop Mobile Comput. Syst. Appl., 2008, pp. 1116. Luliang Tang received the Ph.D. degree from
[8] M. Haklay and P. Weber, OpenStreetMap: User-generated street maps, Wuhan University, Wuhan, China, in 2007. He is
IEEE Pervasive Comput., vol. 7, no. 4, pp. 1218, Oct.Dec. 2008. currently a Professor with Wuhan University. His
[9] B. Hull et al., Cartel: A distributed mobile sensor computing system, research interests include spacetime GIS, GIS for
in Proc. 4th Int. Conf. Embedded Netw. Sens. Syst., 2006, pp. 125138. transportation, and change detection.
[10] H. W. Mckenzie, C. L. Jerde, D. R. Visscher, E. H. Merrill, and
M. A. Lewis, Inferring linear feature use in the presence of GPS mea-
surement error, Environ. Ecol. Stat., vol. 16, no. 4, pp. 531546, 2009.
[11] J. Wang et al., A novel approach for generating routable road maps from
vehicle GPS trajectories, Int. J. Geographical Inf. Sci., vol. 29, no. 1,
pp. 6991, Jan. 2014.
[12] L. Tang, F. Huang, X. Zhang, and H. Xu, Road network change detection
based on floating car data, J. Netw., vol. 7, no. 7, pp. 10631070, 2012.
[13] P. Yin et al., Mining GPS data for trajectory recommendation, in
Advances in Knowledge Discovery and Data Mining. New York, NY, Xue Yang received the M.Eng. degree from Wuhan
USA: Springer-Verlag, 2014, pp. 5061. University, Wuhan, China, in 2013. She is currently
[14] C. de Fabritiis, R. Ragona, and G. Valenti, Traffic estimation and predic- working toward the Ph.D. degree in the State Key
tion based on real time floating car data, in Proc. IEEE 11th Int. ITSC, Laboratory of Information Engineering in Surveying,
2008, pp. 197203. Mapping and Remote Sensing, Wuhan University.
[15] L. Tang, X. Chang, and Q. Li, Public travel route optimization based on Her research interests include intelligent transporta-
ant colony optimization algorithm and taxi GPS data, Chin. J. Highway tion system, spatiotemporal data analysis, and infor-
Transp., vol. 24, no. 2, pp. 8995, 2011. mation mining.
[16] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma, Mining interesting locations
and travel sequences from GPS trajectories, in Proc. Int. World Wide Web
Conf., 2009, pp. 791800.
[17] D. Sun et al., Urban travel behavior analyses and route prediction based
on floating car data, Transp. Lett. Int. J. Transp. Res., vol. 6, no. 3,
pp. 118125, Jul. 2014.
[18] W. C. Lee and J. Krumm, Trajectory preprocessing, in Computing Zhen Dong received the M.Eng. degree from Wuhan
With Spatial Trajectories. New York, NY, USA: Springer-Verlag, 2011, University, Wuhan, China, in 2013. He is currently
pp. 333. working toward the Ph.D. degree in the State Key
[19] S. Brakatsoulas, D. Pfoser, R. Salas, and C. Wenk, On map-matching ve- Laboratory of Information Engineering in Surveying,
hicle tracking data, in Proc. 31st Int. Conf. Very Large Data Bases, 2005, Mapping and Remote Sensing, Wuhan University,
pp. 853864. Wuhan University. His research interests include in-
[20] Y. Yanagisawa, J. Akahani, and T. Satoh, Shape-based similarity query telligent transportation system, computer vision, and
for trajectory of mobile objects, in Proc. 4th Int. Conf. Mobile Data LiDAR data processing.
Manage., Melbourne, Vic., Australia, Jan. 2124, 2003, pp. 6377.
[21] R. Bruntrup, S. Edelkamp, S. Jabbar, and B. Scholz, Incremental
map generation with GPS traces, in Proc. IEEE Intell. Transp. Syst.,
Sep. 1315, 2005, pp. 574579.
[22] J. Li, Q. Qin, C. Xie, and Y. Zhao, Integrated use of spatial and semantic
relationships for extracting road networks from floating car data, Int. J.
Appl. Earth Observ. Geoinf., vol. 19, no. 10, pp. 238247, 2012. Qingquan Li received the Ph.D. degree in geograph-
[23] A. Fathi and J. Krumm, Detecting road intersections from GPS traces, ic information system (GIS) and photogrammetry
in Geographic Information Science. Berlin, Germany: Springer-Verlag, from Wuhan Technical University of Surveying and
2010, pp. 5669. Mapping, Wuhan, China, in 1998. He is currently
a Professor with Shenzhen University, Guangdong,
[24] G. Agamennoni, J. Nieto, and E. M. Nebot, Robust inference of principal
road paths for intelligent transportation systems, IEEE Trans. Intell. China, and Wuhan University, Wuhan. His research
Transp. Syst., vol. 12, no. 1, pp. 298308, Mar. 2011. areas include dynamic data modeling in GIS, sur-
[25] L. Cao and J. Krumm, From GPS traces to a routable road map, in veying engineering, and intelligent transportation
Proc. 17th ACM SIGSPATIAL Int. Conf. Adv. Geographic Inf. Syst., 2009, system.
pp. 312.