Ranking Highlights in Personal Videos by Analyzing Edited Videos

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO.
11, NOVEMBER 2016 5145
Ranking Highlights in Personal Videos by

Analyzing Edited Videos
Min Sun, Ali Farhadi, Tseng-Hung Chen, and Steve Seitz, Fellow, IEEE
Abstract We present a fully automatic system for ranking been carefully and manually edited to feature the highlights
domain-specific highlights in unconstrained personal videos by and trim out the boring segments. As a result, most unedited
analyzing online edited videos. A novel latent linear ranking videos are only watched once and become the digital dark
model is proposed to handle noisy training data harvested
online. Specifically, given a targeted domain such as surfing, matters [4] on the web.
our system mines the YouTube database to find pairs of raw and One way to solve this problem is to develop computer
their corresponding edited videos. Leveraging the assumption algorithms to do the editing for us. I.e., wed provide raw video
that an edited video is more likely to contain highlights than footage, and out would pop a high quality edited video. Indeed,
the trimmed parts of the raw video, we obtain pair-wise ranking both Google and Facebook recently released products that
constraints to train our model. The learning task is challenging
due to the amount of noise and variation in the mined data. seek to achieve similar goals. Googles Auto-Awesome movie
Hence, a latent loss function is incorporated to mitigate the feature generates a video summary from all the footage of an
issues caused by the noise. We efficiently learn the latent model event (complete with filters and background music!). However,
on a large number of videos (about 870 min in total) using a Auto-Awesome works best when the event videos are short and
novel EM-like procedure. Our latent ranking model outperforms contain only highlights. It is not clear how it can handle raw
its classification counterpart and is fairly competitive compared
with a fully supervised ranking system that requires labels from personal videos typically a few minutes long. Facebooks new
Amazon Mechanical Turk. We further show that a state-of-the- Look Back feature provides similar functionality, but focused
art audio feature mel-frequency cepstral coefficients is inferior to on photos and your most popular posts instead of videos.
a state-of-the-art visual feature. By combining both audio-visual These applications motivate the importance of research in
features, we obtain the best performance in dog activity, surfing, automatic video editing.
skating, and viral video domains. Finally, we show that impressive
highlights can be detected without additional human supervision As a step towards this goal, we address the prob-
for seven domains (i.e., skating, surfing, skiing, gymnastics, lem of detecting domain-specific highlights in raw videos
parkour, dog activity, and viral video) in unconstrained personal (see Fig. 1(a)). Many prior researches [5][12] have
videos. explored the highlight selection problem in limited domains
Index Terms Video summarization, video highlight detection. (e.g., baseball game, cricket game, etc.). Recently, many
approaches [13][15] have been proposed to handle uncon-
strained personal videos. However, most methods require large
I. I NTRODUCTION
amounts of human-crafted training data. Since the definition
T HANKS to the wide usage of mobile devices, we increas-

ingly share a large amounts of videos, which are typically
captured in a casual and unplanned way. This trend is likely to
of a highlight is highly dependent on the domain of interest
(e.g., blowing out the candles on a birthday cake, a ski jump,
raising glasses in a toast), its not clear that these techniques
accelerate with more wearable devices like GoPro, HTC Re, are scalable to handle video contents from all domains.
etc. becoming available on the consumer market. On YouTube Instead, we ask a crucial question: can we learn to detect
alone, 400 hours of video are uploaded every minute. More highlights by analyzing how users (with domain knowledge)
impressively, one of a recent trendy video streaming service, edit videos? There is a wealth of edited video content on
Periscope, claims that 40 years of videos are watched per YouTube, along with the raw source material. This content
day [3]. However, among these incredible amount of videos, captures highlights spanning a vast range of different activities
most of them are not fun to watch; the best videos have usually and actions. Furthermore, we show that its possible to identify
the mapping of raw source material to edited highlights,
Manuscript received September 21, 2015; revised April 27, 2016 and
August 7, 2016; accepted August 8, 2016. Date of publication August 17, leading to a wealth of training data. In this work, we intro-
2016; date of current version September 13, 2016. This work was supported duce (1) a novel system to automatically harvest domain-
in part by MOST under Grant 103-2218-E-007-025, in part by Microsoft, in specific information from YouTube, and (2) a novel latent
part by Google, in part by Intel, in part by NSF under Grant IIS-1218683,
in part by ONR under Grant N00014-13-1-0720,and in part by ONR MURI ranking model which is trained with the harvested noisy data
under Grant N00014-10-1-0934. The associate editor coordinating the review (see Fig. 1(a)). Leveraging the assumption that edited video is
of this manuscript and approving it for publication was Prof. Janusz Konrad. more likely to contain highlights than the trimmed parts of the
M. Sun and T.-H. Chen are with National Tsing Hua University,
Hsinchu 300, Taiwan (e-mail: sunmin@ee.nthu.edu.tw). raw video, we formulate the highlight selection problem as a
A. Farhadi and S. Seitz are with the University of Washington, Seattle, pair-wise ranking problem between short video clips in the raw
WA USA. video. (see rank constraints in Fig. 1(b)). We introduce latent
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. variables into the ranking model to accommodate variation of
Digital Object Identifier 10.1109/TIP.2016.2601147 highlight selection across different users. For instance, user
1057-7149 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
5146 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
Fig. 1. System overview: Given surfing videos, we train a latent linear ranking SVM to predict the h-factors fully automatically (Panel (a)). Our system
automatically harvests training data online by mining raw and edited videos in YouTube related to surfing. The raw and edited pair of videos give us
pair-wise rank constraints as shown in panel (b). Note that the harvested data is noisy due to the variation of highlights selected by the users on YouTube
(Panel (c)).
A might select a very long duration clip as a highlight, A. Video Summarization

whereas user B prefers shorter clips (see Fig. 1(c)). We use With the unprecedented growth of online videos, many
a novel EM-like procedure to learn the latent ranking model. researchers recently have focused on summarizing casu-
Our approach has several advantages. First, our latent rank- ally captured personal videos. Gygli et al. [13] propose a
ing model consistently outperforms its classification counter- supervised training approach to detect interesting moment
part (see Fig. 4). Second, it can be efficiently trained on a large in a personal video. Potapov et al. [15] also propose a
number of videos by taking advantage of a newly developed supervised training approach for detecting highlight in each
solver [16] for linear ranking SVM. Third, our latent model specific-domain. Gygli et al. [14] further propose to optimize
nicely takes care of the noise in our automatically harvested multiple objectives for finding both interesting and repre-
training data in six out of seven domains (see Fig. 5). Finally, sentative moments. However, all methods require collecting
we demonstrate that results using automatically harvested videos with manually labeled interesting moments for training.
YouTube data are fairly competitive as compared to those In contrast, our method can be trained on a large amount
obtained from a fully supervised ranking approach trained of automatically harvested personal videos. Li et al. [23]
on specially constructed training sets commissioned on Ama- also propose to optimize multiple objectives, but focuses on
zon Mechanical Turk (see Fig. 6). Hence, we demonstrate transfer learning from text. There are also a few unsupervised
state-of-the-art weakly-supervised performance, while achiev- approaches. Zhao and Xing [24] propose a fully unsupervised
ing much greater scalability (avoiding the need to manually approach using sparse coding to summarize a video in
construct new annotated datasets for training). quasi-real-time. Yang et al. [25] propose a recurrent auto-
A preliminary version of the paper is described in [17]. encoder to unsupervisedly extract video highlights and achieve
In this letter, we update the performance by including impressive results. Finally, there are a few works with addi-
second-order information in the fisher vector representation. tional assumptions. Chu et al. [26] propose a co-summarization
As a result, the visual feature dimension increases to 85200. approach which leverages a large number of edited video in
Using this more powerful visual feature, our latent ranking the same domain as reference videos. Sun et al. [27] propose
model achieves a better accuracy than the one reported in [17]. to summarize unconstrained personal videos into a rank list of
In addition to the six domains reported in [17], a new image composites. However, the proposed method relies on a
viral video domain is added. The new viral video domain robust human detector to summarize only human activities.
contributes to about 38% of the total number of videos in Song et al. [28] propose to utilize freely available video
our dataset. We also conduct additional sensitivity analysis titles to guide the video summarization process. Similarly,
with respect to the hyper-parameter for classification and Liu et al. [29] propose to select video thumbnail according to
ranking SVM. Finally, we include additional audio cues in both visual and side information (e.g., video title). However,
our system. Our experiments show that visual feature alone these approaches require videos to be associated with high
is powerful and confirm that audio feature alone achieves quality video titles.
significantly worse results. By combining both audio-visual Many methods have been recently proposed for summa-
features, we obtain the best performance in dog activity, rizing a video with a known type of content. Given an
surfing, skating, and viral video domains. ego-centric video, we know hands, objects, and faces are
II. R ELATED W ORK important cues. References [30], [31] propose to summarize
Our work is highly related to video summarization. There a video according to discovered interesting objects and faces.
is a large literature on video summarization (see review [18]), Kopf et al. [32] focus on removing camera shakes in ego-
including techniques based on keyframes [19][22]. In the centric videos and propose to convert a long ego-centric video
following, we focus on subjects most relevant to our work. into a short and significantly stabilized time-lapse video called
SUN et al.: RANKING HIGHLIGHTS IN PERSONAL VIDEOS BY ANALYZING EDITED VIDEOS 5147
hyper-lapse. Arev et al. [33] propose a method to summa- D. Learning to Rank

rize multiple videos capturing the same event by analyzing Learning to rank is an important learning technique in recent
camera viewing directions. Xu et al. [34] recently propose years, because of its applications to search engines and online
to utilize gaze tracking information (such as fixation and advertisement. In computer vision, researchers have adopted
saccade) to summarize ego-centric videos. Similarly, given the ranking approach mainly for image retrieval task [38], [39].
a video uploaded to e-commerce websites for selling cars Recently, Parikh and Grauman have introduced the concept of
and trucks, Khosla et al. [35] propose to use web-image relative attributes [40] which opens up new opportunities to
priors (i.e., canonical viewpoints of cars and trucks online) to learn human-nameable visual attributes using a ranking func-
select frames to summarize the video. Our method is another tion. Wang et al. [41] propose a novel deep ranking model that
step in this direction. However, our proposed system is not employs deep learning techniques to learn similarity metric
restricted to handling only ego-centric videos, where cues directly from images. In this work, we propose one of the first
from hands and objects are easier to extract, nor does it rely ranking models for domain-specific video highlight detection.
on discovered canonical viewpoints which are shown to have Our domains include popular actions such as skating, surfing,
generalization issues in other domains such as cooking [35]. etc. Our proposed latent ranking method can also be consid-
Whereas, our method is feature independent and we directly ered as a Multiple Instance Learning (MIL) problem, where
harvest information about how users select domain-specific the automatically harvested edited video segment corresponds
highlights to create their own edited videos. to an unknown set of positive highlight moments. Similar
to [38], we found that formulating the domain-specific video
B. Sports Video Analysis highlight detection as a MIL ranking task outperforms a MIL
Highlight detection in broadcast sport videos has attracted classification task.
several researchers [5][12] due to the popularity of such
videos. Compared to other video types, such as personal III. V IDEO H IGHLIGHTS
videos, broadcast sports videos have well defined structure A video highlight is a moment (a very short video clip)
and rules. A long sports game often can be divided into parts of major or special interest in a video. More formally,
and only a few of these parts contain certain well defined a highlightness measure h (later referred to as h-factor) for
highlights. For example, common highlights are the score every moment in a video can be defined such that moments
event in soccer games, the hit event in baseball games, and the with high h are typically selected as highlights. Hence, the goal
bucket event in basketball games. Due to the well defined of detecting highlights is equivalent to learning a function
structure, specifically designed mid-level and high-level audio- f (x) to predict the h-factor (i.e., h = f (x)) given features
visual features, such as player trajectories, crowds, audience x extracted from a moment in the video (see Fig. 1(a)).
cheering, goal or score events, etc., are used in many methods. One straightforward way to learn f (x) is to treat it as a
One exception is [12] which uses easy-to-extract low-level supervised regression problem. However, labeling the h-factor
visual features. Most of the methods treat highlight detection consistently across many videos is hard for humans. On the
as a binary classification task, where each part of a training contrary, it is much easier for humans to rank pairs of moments
video is labeled as true highlight or not. We argue that for based on their highlightness. Hence, we propose to formulate
unconstrained personal videos, it is ambiguous for humans the video highlights problem as a pair-wise ranking problem
to label highlights as binary labels. This also imposes extra described next.
burden for annotators. Hence, it is important for a method
to scale-up by naturally harvesting crowd-sourced information A. Pair-Wise Ranking
online. To establish notation, we use i as a unique index for all
moments in a set of videos. Each moment is associated to a
C. Crowd-Sourced Highlight Discovery tuple (y, q, x), where y R is the relative h-factor, q Q
Recently, many researchers have demonstrated the ability to is the index of the video containing the moment, Q is a set
discover highlights from crowd-sourced data such as Twitter. of videos, and x is the features extracted from the moment.
Olsen et al. show that user interaction data from an inter- The set of ranking constraints corresponding to the q t h video
active TV application can be mined to detect events [36]. is defined as,
Hannon et al. use Twitter data in PASSEV [37], combining Pq {(i, j )|qi = q j = q, yi > y j }, (1)
summaries of tweet frequency and user-specified search for
terms in tweets to generate a highlight reel. The main differ- where qi = q j = q ensures that only moments from the q t h
ence between these approaches and others in computer vision video are compared to each other according to the relative
is that crowd-sourced data (users annotations but not the the h-factor y. Note that y only needs to be a relative value to
video data) are always required as an input. Hence, these express the order within each video. The full set of pair-wise
methods cannot work well on videos with no or very few ranking constraints in the dataset is defined as,
associated crowd-sourced data. On the contrary, our method P qQ Pq = {(i, j )|qi = q j , yi > y j }, (2)
harvests both crowd-sourced and video data from YouTube for
training our latent ranking model. After training, our model where qi = q j ensures that only moments from the same video
can be applied to rank highlights in any video. are enforced with the ranking constraints.
Our goal is to learn a function f (x) such that C. Handling Noisy Data
Not all the users online are experts. The start and end of
f (x i ) > f (x j ), (i, j ) P, (3)
a specific highlight can vary significantly depending on the
which means none of the pair-wise ranking constraints is users. For example, user A might select a loose highlight
violated. We adopt the L2 regularization and L2 loss linear with a few minutes included in its edited version. In contrast,
ranking SVM formulation to learn f (x; w) = w T x as follows, user B might select a tight highlight with a few seconds
(see Fig 1(c)). Especially for the loose highlight, constraints
1
min w T w + max(0, 1 w T (x i x j ))2 , (4) expressed in Eq. 5 might not always be true. In order to address
w 2
(i, j )P this problem, we propose a latent loss as follows,

where w is the linear model parameter and > 0 is a regular- L zq (q; w) = max(0, 1 w T (x zq x j ))2 , (8)
ization parameter. Ideally, the optimal model parameters can j Rq \E q
be learned since the optimization problem is convex. How- z q = arg max f (x i ; w), (9)
ever, the large number of pair-wise constraints (|P|) becomes iE q
the main difficulty in training the ranking SVM efficiently. where z q is the best highlighted moment in E q . Note that the
We solve the problem efficiently by taking advantage of a latent loss relaxes the constraints in Eq. 5 to
newly proposed fast truncated newton solver which employs
order-statistic trees to avoid evaluating all the pairs [16]. y z q > y j , j Rq \ E q , (10)
Note that deep learning based non-linear ranking methods where only the the best highlighted moment z q in E q must
(e.g., Zhao et al. [42]) also could be used to train on have higher relative h-factor than the rest of the moment
a large-scale dataset such as our videos dataset (about (Rq \ E q ) in the q th raw video.
870 minutes in total). 1) Latent Linear Ranking SVM: A latent linear ranking
SVM incorporating the latent loss is defined as,

B. Ranking From Edited Videos
1
Asking humans to order moments within each video is min w2 + L(q; w)+ L zq (q; w), (11)
w,Z 2
feasible but not scalable. Since the definition of a highlight is qQ T qQ L
highly dependent on the domain of interest, human annotators where Q T is a set of raw videos with tightly selected
without domain knowledge might find the task ambiguous highlights, Q L is a set of raw videos with loosely selected
and tedious. We argue that we can naturally harvest such highlights, {Q T , Q L } is a partition of Q (i.e., Q T Q L = ,
information from edited videos. Assuming that we have raw Q T Q L = Q), and Z = {z q }qQ L is a set of latent best
videos which are used to create the edited videos, and users highlighted moments in Q L .
make a (somewhat) informed decision to select the moments 2) EM-Like Approach: Given Q T and Q L , we solve the
in the raw video to be included in the edited video. We now model parameters w and the set of latent best highlighted
get the following ranking constraints, moments Z iteratively using a EM-like approach.
yi > y j , i E q , j R q \ E q , (5) 1) We set Q = Q T and obtain our initial w by solving
Eq. 6.
where q is the index of the raw video, E q is the set of moments 2) Given the initial w, we estimate our initial Z by solving
in the raw video which are included in the edited video, Eq. 9 (E-step).
and Rq is the set of moments in the raw video. Eq. 5 states 3) Then, we obtain w by solving Eq. 11 while fixing Z
that the relative h-factors of moments included in the edited (M-step).
video (E q ) is higher than the rest of the moments in the raw 4) Given w, we estimate Z by solving Eq. 9 (E-step).
video (Rq \ E q ) (see Fig. 1(b)). 5) Go to step 3 if the estimated latent variables Z have
Given many pairs of raw videos and their edited versions, changed; otherwise, stop the procedure.
the linear ranking SVM becomes In our experiments, the EM-like approach stops typically
1 within five iterations.
min w2 + L(q; w), (6)
w 2
qQ
D. Self-Paced Model Selection
L(q; w) = max(0, 1 w T (x i x j ))2 , (7)
One simple way to define {Q T , Q L } is to select videos
iE q j Rq \E q
with highlights less than a fixed length (see Sec. V-A4).
where Q is a set of raw videos, and L(q; w) is the loss However, such simply partition might not be the best partition.
of violating ranking constraints in the q th raw video. As a We propose a self-paced model selection procedure to evaluate
result, we can even extract ranking constraints for training data K partitions of Q guided by pair-wise accuracy of every video
where users edit their videos by simply trimming it. The main A = {aq }q as follows.
advantage of this approach is that a wealth of such data exists 1) We set Q T = Q and obtain w by solving Eq. 11.
in the digital world. We describe in Sec. IV how we can harvest 2) Given the model w, we evaluate the pair-wise accu-
such data online automatically. racy A on the training videos.
3) Given A, we order the videos from low to high pair-wise We can also define a latent linear classification SVM to
accuracy, and evenly split the videos into K mutually handle the noisy data by replacing L(q; w) with L C (q; w)
exclusive sets. and L zq (q; w) with L Czq (q; w) in Eq. 11, where L Czq (q; w) is
4) Starting from the set with the lowest accuracy, defined as,
we remove one set at a time from Q T to Q L .
L Czq (q; w) = max(0, 1 yzq (w T x zq ))2
5) For each partition of Q T and Q L , we solve Eq. 11 to
obtain a new model w and new pair-wise accuracy A + max(0, 1 y j (w T x j ))2 ,
on the training videos. j Rq \E q
6) Finally, we select the model with the highest mean pair- z q = arg max f (x i ; w). (16)
wise accuracy mean(A). iE q
The pair-wise accuracy aq for the q th video is defined as, The latent linear binary classification SVM model can also be
learned using our EM-like (Sec.III-C2) procedure. In Fig. 4,
(i, j )Pq 1(w x i > w x j )
T T
aq = , (12) we demonstrate that our proposed latent ranking model is
|Pq | consistently better than the latent classification model. This
where 1() is a indicator function which is one if a pair-wise proves that the additional constraints in the latent classification
constraint is satisfied in our learned model (w T x), Pq (defined model are indeed harmful.
in Eq. 1) is the set of pair-wise constraints for the q t h video,
and |Pq | is the number of pairs. The pair-wise accuracy is a IV. H ARVESTING YouTube V IDEOS
normalized value between 0 and 1. Hence, it is comparable Nowadays, many videos are shared online through websites
for different pairs of Q T and Q L . such as YouTube. We propose a novel approach to harvest a
large amount of raw (i.e.,unedited) videos and their edited
E. Comparison to Binary Highlight Classification highlights from YouTube. To find the corresponding raw
videos and edited highlights, we can use [43] to efficiently
Recall that previous highlight selection methods [5][12]
identify duplicated frames between a pair of videos. Once
formulate their problem as a binary highlight classification
a set of consecutively duplicated frames are matched for a
problem. Although we argue computing highlights is intrinsi-
pair of videos (q1 , q2 ), we define these matched frames as
cally a ranking problem, we show that our problem can also
the automatically harvested highlight E. Then, we identify
be considered as a binary classification problem by setting
the longer video among q1 and q2 as the raw video R. Note
{yi = +1; i E q , q Q} and that [43] is robust, but it can miss matches when the video
includes after effects. Hence, our dataset is built with high
{y j = 1; j Rq \ E q , q Q}, (13)
precision.
where y {1, +1} becomes a binary label. Next we discuss Theoretically, our proposed method is feasible even for a
the advantages and disadvantages of a binary classification large scale database such as YouTube. In particular, Google is
problem. already using a similar procedure to find videos with copyright
The binary classification problem is highly related to a violations. However, this is a daunting task for individual
pair-wise ranking problem which ignores the video index and researchers like us, since we need to retrieve a large number of
expresses the following constraints, videos to find enough matched pairs of edited and raw videos.
Therefore, we explore the following two approaches to retrieve
P {(i, j ); yi > y j }. (14) a smaller set of videos that is likely to contain raw video and
In this case, moments are also compared to each other across highlight pairs.
videos. However, training a classification model is much more YouTube video editor. YouTube has an online editor
efficient than training a ranking model. This is because the called YouTube video editor. If a video has been used
huge number (i.e., quadratic to the number of moments) of as a content to create other edited videos, the attribution
pair-wise violation losses (Eq. 7) are replaced by losses with of the video keeps track of which videos used this video
respect to a separation hyperplane as defined below, as content (Fig. 2-Left). This suggests that they are likely
to be raw and edited pairs of videos . Hence, we use the
L C (q; w) = max(0, 1 yi (w T x i )). (15) YouTube API to query videos generated by YouTube
i;qi =q video editor to retrieve a smaller set of candidate raw
The number of losses in Eq. 15 is linear to the number of and edited pairs of videos.
moments. The advantages of a classification model is (1) it YouTube Channels. There exists many YouTube channels
implicitly incorporates more pair-wise constraints in Eq. 14 curating highlight compilations. The compilation in these
than constraints in Eq. 2, and (2) it can be solved more effi- channels typical credits the original videos in the descrip-
ciently using many existing methods. Nevertheless, the newly tion of the video (see original links in Fig. 2-Right). The
added pair-wise constraints which compare moments from two original links typically point to other YouTube videos
different videos (i.e., {(i, j ); qi = q j , yi > y j }) could be corresponding to the raw video of the highlights.
harmful. For example, the highlight in video q1 might be less Given these candidate pairs of raw and highlight videos,
interesting than many non-highlighted moments in video q2 . we use [43] to efficiently identify the duplicated frames as
highlight annotation interface is shown in Fig. 3. Given the

knowledge of a domain (e.g., surfing), an user on Amazon
Mechanical Turk (referred to as turker) is asked to select the
beginning of the highlight and the duration of the highlight.
Since we are crowd-sourcing these annotations from a large
number of turkers, our annotations will inevitably be noisy.
Hence, we consider a video segment as ground truth positive
or negative highlights, if it is selected more than 2 times or
have not been selected, respectively. Please see Sec. V-A3 for
more details.
Fig. 2. The left panel shows the raw (top link) and edited highlight
(bottom link) pair that we retrieved using the property of YouTube video V. E XPERIMENTS
editor.The right panel shows curated highlight compilation (top) and a list of We conduct experiments on a newly created YouTube
original links (indicated by a red box) pointing to the raw videos.
highlight dataset harvested by our system automatically. In the
following sections, we first give details of our implementation
such as feature representation, parameter setting, etc. Then,
we describe details about the dataset and our highlight detec-
tion evaluation. Finally, we report quantitative and qualitative
results on our novel YouTube highlight dataset.
A. Implementation Details
We describe our representation and training parameters in
detail.
1) Moment Definition: We first define each moment as
a 100 frames clip evenly sampled across each raw video.
The start and end frames of each moment is then aligned to
the nearest shot boundary within 50 frames away. We use the
3rd party code1 to detect shot boundaries according to the
pixel-wise RGB difference between two consecutive frames.
2) Feature Representation: Given these moments,
Fig. 3. The screenshot of our web-based highlight annotation interface. Given we extract the state-of-the-art dense trajectory motion
the knowledge of a domain (e.g., surfing), an user on Amazon Mechanical feature [2] for action classification (other scene, objects,
Turk (AMT) is asked to select the beginning of the highlight and the duration
of the highlight. and audio features can also be included in our framework).
Then, the dimension of the dense trajectory features is reduced
the highlights. Inevitably, our current data is limited by the by half using PCA. The dense trajectory features within each
availability of videos generated by YouTube video editor moment are mapped to a learned Gaussian mixture codebook
and curated by YouTube channels including original link to generate a fisher vector including both first and second
information. For instance, we were not able to retrieve enough moment statistics with a fix dimension (85200). The Gaussian
videos for domains such as cooking, fixing bike, etc., since mixture model with 200 mixture components is learned using
for those common domains users have less incentive to curate the training raw videos of all domains.
these videos or edited the video collaboratively using YouTube 3) Highlight Moment Definition: Recall that the automati-
video editor. Nevertheless, our current system works well for cally harvested highlights are obtained at frame-level. Hence,
common action and animal related domains such as skating, we define a moment as a highlight moment (in E) if at
dog activity, parkour, viral video, etc. We describe in least 70% of its frames are covered by the frame-based
Sec. V-C about the selected domains and the data statistics. highlights. A moment is not a highlight (in R \ E) if at
most 30% of its frames are covered by the frame-based
highlights. All the remaining moments are not included in
A. Ground Truth Highlights
training. For the manually annotated highlights, we need to
For evaluating our ranking results, ideally we would like accumulate the annotation from five turkers to each moment.
to obtain a h-factor order of all moments in every video. We define a soft-highlightness score for each moment. For a
However, it is time consuming and ambiguous to ask gen- moment covered 70% by a manually annotated highlight,
eral people to order all the moments since often it is hard it will accumulate 0.7 soft-highlightness score. After accumu-
to compare the h-factors of two random moments. Hence, lating all the soft-highlightness scores, we define the moments
we found that it is more effective to ask multiple people with equal or more than 2 scores as ground truth positive
to select a single highlight (i.e., a video segment) within highlights and the moments have not been selected as negative
each video. We use Amazon Mechanical Turk to collect these ground truth for evaluation.
ground truth annotations. For each video, we collect a no more
than five seconds highlight from fix turkers. The web-based 1 http://johmathe.name/shotdetect.html
Fig. 4. Left-panel: performance comparison between our latent ranking model and its classification counterpart. Latent ranking model (blue-bar) is consistently
better than latent classification model (red-bar) with an average improvement of 28.6% in mAP. Right-panel: sensitivity analysis of the ranking and classification
models. We test = {0.001, 0.01, 0.1, 1, 10}.
Fig. 5. Performance comparison among non-latent ranking (red bar), latent ranking (blue bar), and self-paced latent ranking (green bar) models. Our latent
ranking models outperform non-latent models except for Parkour.
4) Tight Highlight Q T Definition: By default, we define a publicly downloadable. The viral domain is collected from
video highlight to be tight, if the training highlight consists curated Jukinvideo YouTube channel.2
of no more than 2 moments. All the rest of the videos are Our new dataset is very challenging since (1) it contains
considered to have loose highlight. Note that the self-paced a variety of videos captured by portable devices, (2) the
model selection procedure will modify Q T and Q L as descried start and end of a specific highlight can vary significantly
in Sec. III-D. depending on the behavior of the users, and (3) there are
5) Model Training: For all the experiments, several regu- irrelevant videos included in the dataset. Irrelevant videos
larization parameters = {0.001, 0.01, 0.1, 1, 10} have been typically include videos of interviews, slideshows of images,
considered. We report the average mean average precision and products commercials. For instance, this commercial3
across these five values for a robust and fair comparison. is in the skiing domain. We split our raw dataset into half
We use the liblinear package [44] to train our models in their training and the other half testing. Then, we manually label
primal form. For self-paced model selection, we set K = 5. the irrelevant videos in both training and testing so that the
fully supervised explicitly crowd-sourcing ranking baseline
(see Sec. V-D.4) is not trained with irrelevant videos. However,
B. Highlight Detection Evaluation our fully automatic system is trained agnostic to the label of
Within each video, the best method should first detect the irrelevant videos.
ground truth highlighted moments rather than other moments. For testing, none of the irrelevant videos are evalu-
We calculate the average precision of highlight detection for ated. Moreover, as mentioned in Sec. V-B, we evaluate
each testing video and report the mean average precision on videos where turkers reach consensus on highlighted
(mAP) summarizing the performance of all videos. Note that moments. The statistics of the training and the evalu-
unlike object detection which accumulates all the detections ated testing data for each domain is shown in Table. I.
from images to calculate the average precision, highlight For training data, we further show the number of relevant
detection treats each video separately since a highlighted videos and videos with tightly selected highlights (Q T ),
moment in one video is not necessary more interesting than a in each domain in Table. I. The dataset and codes are avail-
non-highlighted moment in another video. able at https://github.com/tsenghungchen/ranking-highlights-
in-personal-videos. Our dataset in t otal has 764 videos with
the total accumulated time of 1044 minutes (62681 seconds),
C. YouTube Highlight Dataset which is at the similar scale as the state-of-the-art large scale
By harvesting freely available data from YouTube as action recognition dataset [45] (1600 minutes).
described in Sec. IV, we have collected data for dog
activity, gymnastics, parkour, skiing, surfing, skat-
ing, and viral. The first six domains are selected because 2 https://www.youtube.com/user/JukinVideoDotCom
about 20% of the retrieved raw videos in these domains are 3 https://www.youtube.com/watch?v=4jdtNuVezzE
TABLE I
S TATISTICS OF O UR D ATA H ARVESTED F ROM YouTube
Fig. 6. Performance comparison between a fully supervised ranking model trained with turkers annotations (referred to as turk), our latent ranking trained
on relevant videos (referred to as latent-rank-sub), and our latent ranking trained with all harvested data (referred to as latent-rank). Our models are fairly
competitive compared to turk, and suprisingly outperforms turk significantly on parkour.
Fig. 7. Performance comparison between our model with audio feature only, visual feature only, audio-visaul-standardize (mfcc-dense trajectory) features,
and audio-visaul-non-standardize. We show that audio feature only performs much worse than visual feature. By combining both features, we achieve the
best average mAP, and best mAP in dog activity, surfing, skating, and viral domains. Finally, we show that standardization is important when combining
hetegoenous features.
D. Highlight Detection Results classification model (27.7%4 mAP) in all domains with an
average improvement of 28.6% in mAP.
Our fully automatic latent ranking method performs
As mentioned in the implementation details, we show
well on the novel YouTube highlight dataset with a
performance of both ranking and classification models with
mean average precision of 56.3% across seven domains.
different regularization parameter values in Fig. 4-Right.
We compare our system with other sophisticated methods
The performance for = {0.01, 0.1, 1, 10} are similar, which
below.
shows that our latent ranking method is not sensitive to
1) Latent Ranking v.s. Classification: In Fig. 4-Left, change of . In order to check statistical significance, we
we compare the mean average precision for every domain
between our latent ranking model (Sec. III-C) and a latent clas- 4 Note that the latent classification model using 26000 feature dimension
sification model (Sec. III-E), where both models are trained achieves 46% mAP in [17], which is still 18.3% less accurate than our latent
ranking model. We believe that this is due to the fact that more powerful
using noisy data harvested from YouTube. Our latent ranking features overfit a wrong classification model on the training set of a ranking
model (56.3% mAP) consistently outperforms the the latent problem and cause even worse generalization accuracy on the testing set.
Fig. 8. Examples of moments ranking from high (left) to low (right) according to our predicted h-factors for surfing, skating, parkour, gymnastics,
dog activity, skiing, and viral videos. We show one sampled frame from each moment to represent it.
compare the AP pairs of the latent classification v.s. the latent and the latent ranking are not distinguishable) is 2e6 . Hence,
ranking methods for each testing video. Given the AP pairs, the null hypothesis is strongly rejected.
we perform a sign test (i.e., only testing which AP is higher). 2) Latent v.s. Non-Latent Ranking: We mentioned in
The p-value of the null hypothesis (i.e., latent classification Sec. III-C that latent model can mitigate issues caused by
Fig. 9. Visualization of our predicted h-factors (red lines) and raw turkers annotations (blue bars), overlaid with manually sampled raw frames. For gymnastics,
dog activity, and skiing, the main characters in some frames are too small; hence, we show the manually cropped zoom-in version of the frames. Click to
watch videos on YouTube: surfing link, skating link, parkour link, skiing link, Gymnastics link, dog link, viral video link.
noisy labels in our harvested data. We show in Fig. 5 However, the improvement is less significant as compared to
that our latent ranking outperforms non-latent tanking in our preliminary results in [17]. In average across domains,
dog activity, gymnastics, skiing, surfing, skating, and viral. our latent ranking model is 2% better than the non-latent
Fig. 10. Failure examples. Visualization of our predicted h-factors (red lines) and raw turkers annotations (blue bars), overlaid with manually sampled raw
frames. These examples show that our predictions are not too far off from the turkers annotations.
ranking model in mean AP. We believe this is because our should perform much better. However, in parkour, our full
new feature has increased the performance such that our latent ranking model (latent-rank) is even superior to the fully
latent model has less room for improvement over a non-latent supervised ranking model trained with turkers annotations.
ranking model. In order to check statistical significance, This proves that our fully automatic system is very competitive
we compare the AP pairs of the non-latent v.s. the latent compared to the approach requiring annotations from a crowd-
ranking methods for each testing video.On surfing, skating, sourcing platform. Moreover, our latent-rank-sub" indeed
and viral domains, the p-value (probability) of the null hypoth- achieves better accuracy than our latent-rank which trained
esis (i.e., not distinguishable) is 0.064. Hence, the probability with both relevant and irrelevant videos. This proves that
that null hypothesis is true is very low. However, on the whole training with irrelevant videos in general will hurt the accuracy.
dataset, the p-value is 0.643. Note that this is possible even 5) Visual Features v.s. Audio Features: In addition to visual
when the mAP of laten ranking is significantly higher (e.g., feature including motion and appearance features, personal
in gymnastic and skiing), since the sign test do not consider videos also includes audio information which is not utilized in
the difference in AP, but mAP do. the previous experiments. Intuitively, audio feature should be
3) Latent v.s. Self-Paced Latent Ranking: In Fig. 5, we also a reliable cues, since highlights could come along with loud
show the performance of our self-paced model selection pro- audio from the scene and object being captured or from the
cedure. In average, our self-paced model selection procedure is cameramen themselves. However, we also notice some of our
slightly better the latent ranking method ( 1% improvement). videos contain no sound or background music. We conduct the
In particular, it outperforms both non-latent and latent tanking following experiments to test if audio feature is really reliable
methods in skiing, and skating. However, in the sign test, by itself. For audio feature, we extract state-of-the-art Mel-
the p-value is 0.853. Hence, statistically, our self-paced latent frequency cepstral coefficients (MFCCs) [1] for each moment.
ranking is on-par with the latent ranking. This corresponds to a 11x13 matrix containing 13 MFCCs
4) Explicitly Crowd-Sourcing v.s. Naturally Harvesting (first 12 DCT coeff. and the log energy) for each moment;
Data: In order to analyze the effect of our model trained using then, we convert it into a 143-dimension feature vector. When
noisy data harvested online, we also trained a fully supervised combing both visual and audio features, we standardize every
ranking model (Sec. III-A) with soft-highlight scores accumu- entry of the features; then, directly concatenate them. In Fig. 7,
lated from turkers. This is a time consuming and ambiguous we show audio feature alone performs worse than visual
task, since labeling highlights in raw personal videos is not feature. Given the AP pairs, we perform a sign test. Since
a well defined task for turkers unrelated to the events in the the p-value of the null hypothesis is 2e6 , the null hypothesis
videos. Moreover, we assume that the human annotators will is strongly rejected. Moreover, by combining both features,
discover the irrelevant videos. Hence, the irrelevant videos we achieve the best mAP in dog activity, surfing, skating,
in the training set are not used to train the fully supervised and viral domain. We believe audio feature in gymnastic and
ranking model. However, this means that the fully supervised skiing are less informative, since their videos are typically
ranking model is trained on about 69% of the full training data. captured a distance away from the event locations. We also
In order to have a fair comparison with the fully supervised show that standardization before direct concatenation is very
ranking model, we also train our latent ranking method on 69% critical. However, in the sign test, the p-value is 0.298. Hence,
of the full training data (referred to as latent-rank-sub). statistically, combining both features is on-par with visual
The comparison between the fully supervised ranking model feature.
trained with multiple turkers annotations and our proposed 6) Qualitative Results: We show a set of moments ranking
latent ranking models (latent-rank-sub and latent-rank) are from high to low according to our predicted h-factors
shown in Fig. 6. Intuitively, the fully supervised ranking model for different domains in Fig. 8. Note that our data is
challenging and realistic, since it includes videos captured by [10] M. H. Kolekar and S. Sengupta, Event-importance based customized
widely used cameras such as GoPro and cellphone cameras. and automatic cricket highlight generation, in Proc. ICME, 2006,
pp. 16171620.
Detailed visualization of our predicted h-factors and raw [11] A. Hanjalic, Adaptive extraction of highlights from a sport video
turkers annotations, overlaid with manually sampled raw based on excitement modeling, IEEE Trans. Multimedia, vol. 7, no. 6,
frames for each domain is shown in Fig. 9. A few failure pp. 11141122, Dec. 2005.
[12] H. Tang, V. Kwatra, M. Sargin, and U. Gargi, Detecting highlights in
examples are shown in Fig. 10. These examples show that our sports videos: Cricket as a test case, in Proc. ICME, 2011, pp. 16.
predictions are not too far off from the turkers annotations. [13] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, Creating
summaries from user videos, in Proc. ECCV, 2014, pp. 505520.
[14] M. Gygli, H. Grabner, and L. Van Gool, Video summarization by
VI. C ONCLUSION learning submodular mixtures of objectives, in Proc. CVPR, 2015,
As a step towards automatic video editing, we introduce a pp. 30903098.
[15] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, Category-specific
fully automatic system for ranking video highlights in uncon- video summarization, in Proc. ECCV, Jun. 2014, pp. 540555.
strained personal videos by analyzing online edited videos. [16] C.-P. Lee and C.-J. Lin, Large-scale linear rankSVM, Neural Comput.,
Our system and the proposed novel latent ranking model are vol. 26, no. 4, pp. 781817, 2013.
[17] M. Sun, A. Farhadi, and S. Seitz, Ranking domain-specific highlights
shown to be superior to its binary classification counterpart by analyzing edited videos, in Proc. ECCV, 2014, pp. 787802.
and a non-latent ranking model without handling noise in the [18] R. Borgo et al., A survey on video-based graphics and video visual-
harvested data. Our system is also fairly competitive compared ization, in Proc. EUROGRAPHICS, 2011, pp. 123.
[19] D. Liu, G. Hua, and T. Chen, A hierarchical visual model for video
to a fully supervised ranking system requiring annotations object summarization, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32,
from Amazon Mechanical Turk. We believe our system paves no. 12, pp. 21782190, Dec. 2010.
the way toward learning a large number of domain-specific [20] Y. Gong and X. Liu, Video summarization using singular value decom-
position, in Proc. CVPR, 2000, pp. 174180.
highlights since more and more users behavioral data can be [21] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhan, Video summarization and scene
harvested online. Moreover, we show that our results are not detection by graph modeling, IEEE Trans. Circuits Syst. Video Technol.,
sensitive to the values of the hyper-parameters for training the vol. 15, no. 2, pp. 296305, Feb. 2005.
[22] E. Mendi, H. B. Clemente, and C. Bayrak, Sports video summariza-
ranking SVM, and audio features alone are inferior than visual tion based on motion analysis, Comput. Elect. Eng., vol. 39, no. 3,
features. Finally, in dog, surfing, and skating domains, audio pp. 790796, 2013.
features are complementary to visual features. [23] L. Li, K. Zhou, G.-R. Xue, H. Zha, and Y. Yu, Video summarization
via transferrable structured learning, in Proc. World Wide Web, 2011,
In the future, we plan to focus on two main directions. pp. 287296.
First, we would like to use a bank of domain-specific highlight [24] B. Zhao and E. P. Xing, Quasi real-time summarization for consumer
rankers to describe every moment in a raw video. Given this videos, in Proc. CVPR, 2014, pp. 25132520.
[25] H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo, Unsupervised
high-level understanding of moments, we hope to reliably extraction of video highlights via robust recurrent auto-encoders, in
analyze how to fully automatically generate an edited video. Proc. ICCV, 2015, pp. 46334641.
Second, we would like to take advantage of deep learning [26] W.-S. Chu, Y. Song, and A. Jaimes, Video co-summarization: Video
summarization by visual co-occurrence, in Proc. CVPR, Jun. 2015,
techniques to learn a joint audio-visual feature representa- pp. 35843592.
tion, whereas combing two state-of-the-art hand crafe features [27] M. Sun, A. Farhadi, B. Taskar, and S. Seitz, Salient montages from
seems to be less effective in our experiment. We believe that unconstrained videos, in Proc. ECCV, 2014, pp. 472488.
[28] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, TVSum: Summarizing
this could fundamentally improve the accuracy of the video Web videos using titles, in Proc. CVPR, 2015, pp. 51795187.
summarization task. [29] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, Multi-task deep visual-
semantic embedding for video thumbnail selection, in Proc. CVPR,
R EFERENCES 2015, pp. 37073715.
[30] Z. Lu and K. Grauman, Story-driven summarization for egocentric
[1] S. B. Davis and P. Mermelstein, Comparison of parametric repre- video, in Proc. CVPR, 2013, pp. 27142721.
sentations for monosyllabic word recognition in continuously spoken [31] Y. J. Lee, J. Ghosh, and K. Grauman, Discovering important people
sentences, IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, and objects for egocentric video summarization, in Proc. CVPR, 2012,
pp. 357366, Aug. 1980. pp. 13461353.
[2] H. Wang and C. Schmid, Action recognition with improved trajecto- [32] J. Kopf, M. F. Cohen, and R. Szeliski, First-person hyper-lapse videos,
ries, in Proc. ICCV, Dec. 2013, pp. 35513558. [Online]. Available: in Proc. SIGGRAPH, 2014, Art. no. 78.
https://hal.inria.fr/hal-00873267 [33] I. Arev, H. S. Park, Y. Sheikh, J. Hodgins, and A. Shamir, Automatic
[3] TechCrunch. (2015). Periscope Has 10m Registered Users Watching editing of footage from multiple social cameras, in Proc. SIGGRAPH,
40 Years of Video Per Day. [Online]. Available: http://techcrunch. 2014, Art. no. 81.
com/2015/08/12/periscope-has-10m-registered-users-watching-40-years- [34] J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh, Gaze-
of-video-per-day/ enabled egocentric video summarization via constrained submodular
[4] The Online Economy. (2012). Internets Digital Dark Matter. [Online]. maximization, in Proc. CVPR, 2015, pp. 22352244.
Available: http://www.onlineeconomy.org/internets-digital-dark-matter/ [35] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan, Large-scale
[5] D. Yow, B. Yeo, M. Yeung, and B. Liu, Analysis and presenta- video summarization using Web-image priors, in Proc. CVPR, 2013,
tion of soccer highlights from digital video, in Proc. ACCV, 1995, pp. 26982705.
pp. 499503. [36] D. R. Olsen and B. Moon, Video summarization based on user
[6] Y. Rui, A. Gupta, and A. Acero, Automatically extracting highlights for interaction, in Proc. EuroITV, 2011, pp. 115122.
TV baseball programs, in Proc. ACM Multimedia, 2000, pp. 105115. [37] J. Hannon, K. McCarthy, J. Lynch, and B. Smyth, Personalized and
[7] S. Nepal, U. Srinivasan, and G. Reynolds, AAutomatic detection of automatic social summarization of events in video, in Proc. IUI, 2011,
Goal segments in basketball videos, in Proc. ACM Multimedia, 2001, pp. 335338.
pp. 261269. [38] Y. Hu, M. Li, and N. Yu, Multiple-instance ranking: Learning to rank
[8] J. Wang, C. Xu, E. Chng, and Q. Tian, Sports highlight detection from images for image retrieval, in Proc. CVPR, 2008, pp. 18.
keyword sequences using HMM, in Proc. ICME, 2004, pp. 599602. [39] B. Siddiquie, R. Feris, and L. Davis, Image ranking and retrieval based
[9] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. S. Huang, Highlights on multi-attribute queries, in Proc. CVPR, 2011, pp. 801808.
extraction from sports video based on an audio-visual marker detection [40] D. Parikh and K. Grauman, Relative attributes, in Proc. ICCV, 2011,
framework, in Proc. ICME, 2005, p. 4. pp. 503510.
[41] J. Wang et al., Learning fine-grained image similarity with deep Tseng-Hung Chen is currently pursuing the degree
ranking, in Proc. CVPR, 2014, pp. 13861393. with the Department of Electrical Engineering,
[42] F. Zhao, Y. Huang, L. Wang, and T. Tan, Deep semantic ranking Vision Science Laboratory, National Tsing Hua Uni-
based hashing for multi-label image retrieval, in Proc. CVPR, 2015, versity under the supervision of Prof. M. Sun.
pp. 15561564. His research interests are in the areas of computer
[43] C. E. Jacobs, A. Finkelstein, and D. H. Salesin, Fast multiresolution vision, natural language processing and deep learn-
image querying, in Proc. SIGGRAPH, 1995, pp. 277286. ing. His projects focus on natural language genera-
[44] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, tion and visual semantic understanding.
LIBLINEAR: A library for large linear classification, J. Mach. Learn.
Res., vol. 9, pp. 18711874, Jun. 2008.
[45] K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101
human actions classes from videos in the wild, in Proc. CRCV-TR,
2013.
Min Sun received the Ph.D. degree from the

University of Michigan, Ann Arbor. He is cur-
rently an Assistant Professor with the National
Tsing Hua University. He is also a Post-Doctoral
Researcher with the University of Washington.
His research interests include object recognition,
image/video understanding, human pose estimation, Steve Seitz received the Ph.D. degree in
machine learning, and deep learning. He received computer sciences from the University of
the best paper awards in 3DRR 2007, CVGIP 2015, Wisconsin in 1997. He is currently a Professor
and CVGIP 2016. with the Department of Computer Science and
Engineering, University of Washington. He also
directs an Imaging Group, Googles Seattle office.
Ali Farhadi received the Ph.D. degree from He spent one year visiting the Vision Technology
the CS Department, University of Illinois at Group with Microsoft Research and the subsequent
Urbana-Champaign under the supervision of two years as an Assistant Professor with the
Prof. D. Forsyth. He is currently an Assistant Robotics Institute, Carnegie Mellon University. He
Professor with the CSE Department, University of joined the University of Washington in 2000 as a
Washington. He spent a year as a Post-Doctoral Faculty Member. He received the David Marr Prize for the best paper at the
Fellow with the Robotics Institute, Carnegie Mellon International Conference of Computer Vision, and he has received an NSF
University, under the supervision of Prof. M. Hebert Career Award, and ONR Young Investigator Award, and an Alfred P. Sloan
and Prof. A. Efors. He received the Best Student Fellowship. He is interested in problems in computer vision and computer
Paper Award in CVPR 2011. graphics. His current research focuses on 3-D modeling and visualization
from large photo collections.

Ranking Highlights in Personal Videos by Analyzing Edited Videos

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ranking Highlights in Personal Videos by Analyzing Edited Videos

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO.

11, NOVEMBER 2016 5145

Ranking Highlights in Personal Videos by

T HANKS to the wide usage of mobile devices, we increas-

A might select a very long duration clip as a highlight, A. Video Summarization

hyper-lapse. Arev et al. [33] propose a method to summa- D. Learning to Rank

highlight annotation interface is shown in Fig. 3. Given the

Min Sun received the Ph.D. degree from the

Das könnte Ihnen auch gefallen