Video Copy Detection Based On Spatiotemporal Fusion Model: LL LL LL

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/12llpp51-59 Volume 17, Number 1, February 2012
Video Copy Detection Based on Spatiotemporal Fusion Model*

Jianmin Li**, Yingyu Liang, Bo Zhang
State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Abstract: Content-based video copy detection is an active research field due to the need for copyright protection and business intellectual property protection. This paper gives a probabilistic spatiotemporal fusion approach for video copy detection. This approach directly estimates the location of the copy segment with a probabilistic graphical model. The spatial and temporal consistency of the video copy is embedded in the local probability function. An effective local descriptor and a two-level descriptor pairing method are used to build a video copy detection system to evaluate the approach. Tests show that it outperforms the popular voting algorithm and the probabilistic fusion framework based on the Hidden Markov Model, improving F-score (F1) by 8%. Key words: video copy detection; probabilistic graphical model; spatiotemporal fusion model
Introduction
With the rapid progress in the Internet and multimedia, more and more videos are available on the Internet. However, it is reported that about 27% of the videos in a video search result obtained from YouTube, Google Video and Yahoo! Video are duplicated or near duplicated copies of a popular version[1]. A video copy is a segment of video derived from another video, usually by means of various transformations such as addition, deletion and modification (of aspect, color, contrast, encoding, ), and camcording[2]. Figure 1 shows some video copy examples[3]. Therefore, video copy detection has become an important technology in many applications, such as copyright protection, video databases, and video retrieval. For example, copy detection is an alternative way to augment traditional watermark technology for

Received: 2011-09-02; revised: 2011-09-30 **Supported by the National Key Basic Research and Development (863) Program of China (No. 2007CB311003) **To whom correspondence should be addressed. E-mail: lijianmin@tsinghua.edu.cn; Tel: 86-10-62795869
copyright protection. Illegal copies of a specific video can be found in some video sharing websites although the copies are not the same binary file as the original video due to some kind of transformation. Video sharing websites need only store one video as representative of all its copies to greatly reduce the storage. Finally, people will be more satisfied if the near duplicate videos in the search result returned from video search engines can be merged into one. Video copy detection technologies have received much attention in the field of video retrieval in recent years with many proposed approaches. Although there are some exceptions, such as video histograms[4] for the whole video, most video copy detection algorithms are frame based. The procedure can be divided into three phases, i.e., descriptor extraction, descriptor matching and sequence determination. First, global or local descriptors are extracted from the selected keyframes to describe the videos. Then descriptors similar to those in the query are retrieved from the video database. These two steps generally employ image processing and retrieval techniques. The Harris and Stephens detector[5], Scale-Invariant Features (SIFT)[6]
52
Tsinghua Science and Technology, February 2012, 17(1): 51-59
Picture in Picture
Blue
Insertion of pattern
Strong re-encoding
Noise
Contrast
Change in gamma
Mirroring
Ratio
Crop Fig. 1 Video copy examples
Shift
[3]
Text insertion
and Speeded Up Robust Features (SURF)[7] are widely used in the descriptor extraction step. In the matching step, most recent methods utilize index structures such as hash tables[8,9] and tree structures[10]. Some index structures are specially designed for transformed image retrieval, such as the distortion-based probabilistic similarity search[11] and Hamming embedding[12]. In many methods, each keyframe is treated individually for feature extraction and a similarity search although the video is a sequence of frames. Therefore, the utilization of temporal consistency, along with spatial consistency, relies on the determination step where partial results from previous procedures are fused to decide whether there is copy relationship between the query video and videos in the database. Usually, the last step is performed by voting on the referenced video[13-16], counting the number or score of descriptor pairs that approximately fit with the same parameters for the transformation model. The voting approaches are efficient and work well when the descriptor pairs are relatively precise. However, they do not take into account the temporal context of the pairs and do not
thoroughly utilize the temporal consistency. Therefore, highly scored false pairs cannot be eliminated by exploring the temporal context. In addition, the voting approaches are concerned with counting pairs consistent with the optimal parameters, which makes it difficult to locate a temporal position of the copy segment in the video because consistent and inconsistent pairs may be mixed together in the time sequence. Other approaches are based on the Hidden Markov Model. For example, Gengembre and Brrani[17] treat the descriptor pairs as observations and estimate the source video identifier and time offset hidden behind the video. This can deal with the temporal consistency and also makes the process more reliable since it reduces the impact of imprecision during the similarity searches. However, the approach is not robust to noise in the descriptor pairs since a very similar false positive descriptor may cause a sharp change in the estimated video identifier. As observed in tests, a copied segment in a query may be fragmented into several segments by highly scored noise, resulting in a relatively low recall rate. Another disadvantage is that the
Jianmin Li et al.Video Copy Detection Based on Spatiotemporal Fusion Model
53
parameters in the model are set manually, which may affect its performance and generalization ability on different databases. This paper presents a Probabilistic SpatioTemporal Fusion (PSTF) approach which directly estimates the most probable starting point and end point of the copy segment by simultaneously exploring the spatial and temporal consistency. The results are based on a probabilistic graphical model[18,19] which represents the relationship between the segments spatiotemporal location and the descriptor pairs obtained in previous steps, using the transformation parameters as the bridge. Both spatial and temporal consistency is embedded into the local probabilistic functions. The parameters of the local probabilistic functions in the model are set by the parameter learning approach in graphical model theory, which guarantees better performance and generalization ability than manual setting. The copy segment location is inferred efficiently with the variable elimination method.
Related Work
In the feature extraction step, descriptors for selected keyframes are usually explored to describe the video to reduce the time complexity. The various kinds of descriptors used in computer vision, image and video processing fall into two categories as global and local descriptors. Global descriptors, such as the ordinal measure[20], have poor performance with local transformations, such as shift, cropping and camcording[15] while local descriptors are robust to both global and local transformations. The Harris and Stephens detector[4] is one of the most widely used detectors and is associated with local descriptors, e.g. a 20-dimensional descriptor for a local patch[11,13,14]. Although the detection has low computational cost, this simple descriptor is neither scale-invariant nor discriminating, which hinders the performance. Another widely used descriptor is SIFT[6], which has shown the best object recognition performance[21]. With Hessian-based detector, SURF[7] is another scale-invariant and robust local descriptor with similar performance as SIFT and similar speed as the Harris descriptor, which is much faster than SIFT. Once extracted, the descriptors of the reference videos are organized into a database. To pair the query video descriptor with similar ones in the database,
those from the query video are searched among the database descriptors using an indexing search (e.g., Ref. [13]) to speed up the search procedure. The commonly used index structures include hash tables[8,9] and tree structures[10]. A probabilistic similarity search approach uses a partitioning space with Hilberts filling curves[13]. Approximate Nearest Neighbor (ANN)[10] recursively divides the data space into blocks according to the data points to be indexed and organizes them into a kdtree. Although the index takes up little space, the time for an approximation search grows logarithmically with the amount of data, which makes it suitable for larger similarity searches. After the pairing step, a set of descriptor pairs is obtained relating the query video and each referenced video, i.e., videos with descriptors returned by the similarity search. Each pair is associated with identifiers defining the videos to which the descriptors belong, with temporal positions for global descriptors and spatiotemporal positions for local descriptors[15]. The determination step then decides whether there is copy relationship between the two videos based on the pairing information. The last step is usually performed by a vote on the referenced video[13-16], counting the number or the score of the descriptor pairs that fit the transformation model described by the same parameters. Each pair corresponds to a constraint, generally a hyperplane, in the parameter space. The parameters which approximate the most constraints are estimated, and the number or the sum of the scores of the descriptor pairs consistent with the parameters is counted as the similarity score. For example, Spatial-Temporal-Scale Registration (STSR)[16] is a voting approach that considers both spatial and temporal constraints. The parameter space is divided into cubes. A vote is performed for a query and each referenced video by assigning the descriptor pairs into cubes. The cube with the maximum score is selected as the result, with the first and last frames containing descriptors in the cube considered to be the beginning and end points of the copied segment. A Probabilistic Fusion Framework (PFF) based on the Hidden Markov Model[17] estimates the most possible source video identifier and time offset by exploring information from the keyframe pairs. A continuous segment with the same source video identifier will be considered to be a copied segment. The approach
54
models the different parameters and inputs of the determination step and explores the temporal consistency. The maximum score path method[22] uses the returned keyframes as nodes and the similarity score as a node weight to construct a graph. Edges are added if the following temporal constraints are satisfied: the two nodes belong to the same referenced video and the time offset difference is below a certain threshold. If the score of the maximum score path exceeds a threshold, the path defines a copied segment. STSR and PFF are two typical algorithms for exploring video consistency. Many other fusion methods can be viewed as special cases or variants of these. For example, the maximum score path model is equivalent to a special Hidden Markov Model where the transition probability is 1 if temporal constraints are satisfied and 0 otherwise with the likelihood probability associated with the similarity score.
vf = (xf , yf , tf )T are the offset parameters. For all descriptor pairs between the two video segments which are the same under some transformation, these parameters should be constants, i.e., a static point in the parameter space PSpace. After the similarity search, the descriptor pairs between Q and R form a sequence PS = {PSi ,1 - i len(QKF)} , where PSi = {dpik ,1 - k - NDPi } are the pairs between the descriptors in QKFi and those in some keyframes in R. Each pair dpik in PSi has score dpik .score derived from the similarity of the descriptors in the pair. The determination step uses the pair sequence PS as input to solve the following maximum aposteriori hypothesis problem: (2) max p( j , u , v|PS)
1- j-len (V ) 1-u <v-len ( Q )
Problem Descriptions
In practice, the system will return the segments with the top k highest probabilities or the segments with probabilities above a predefined threshold.
The formal definition of video copy detection can be described as follows. Each video V is a frame sequence V = {Vi ,1-i- len(V )}. Given a video database R ={R j, 1 - j - len( R)} and a query video Q = {Qi ,1 - i len(Q)}, the task of video copy detection seeks to determine for each video R j = {Ri j ,1- i - len( R j )}, whether there exists 1 - m < n - len(Q) + 1 and 1 - x < y = {Q , m - i < n} is a copy len( R j ) + 1 satisfying Q i j j = Transform( R j). of R = {Ri , x - i < y} , i.e., Q For query video Q, the keyframes selected are noted j , a copied as QKF = {QKFi , 1 - i - len(QKF)} . So Q segment from the original segment of R j , is reprej sented as QKF = {QKFi ,u j - i < v j } or ( j , u , v) for simplicity. Each descriptor, d, extracted from the keyframes is associated with a spatiotemporal location, d.loc = ( x, y, t )T , where x, y are measured in pixels and t is measured as the keyframe number. The descriptor qd from query video Q and the similar one rd retrieved from a video R j in the database forms a descriptor pair dp = (qd, rd, score) . The transformation from rds spatiotemporal location rd.loc = ( xr , yr , tr )T to qds location qd.loc = ( xq , yq , tq )T is xq = xs xr + xf, yq = ys yr + yf, (1) tq = ts tr + tf
3 Probabilistic Spatiotemporal Fusion Approach

A video copy detection system was constructed to evaluate the PSTF approach. As illustrated in Fig. 2, the system framework consists of offline and online parts. The offline part deals with videos in the database while the online part deals with the query video. The offline part can be divided into descriptor extraction and index construction phases. The online part can be divided into descriptor extraction, descriptor pairing and determination steps. In the descriptor extraction step, the local descriptor SURF are extracted after keyframes are selected for every Sd frames for videos in the database and every Sq frames for the query video. In the descriptor pairing step, points in the query video are paired with similar ones in the database using a two-level pairing approach to reduce the indexed data and the time spent in the search. First, at most Nsf similar frames {Fsi } in the database are retrieved for each query keyframe Fq, based on the ANN index on the frame descriptors, which is a histogram of the local descriptor extracted on one frame. Then each SURF descriptor in the similar frames and its nearest neighbor found in the query frame make up a descriptor pair. In the determination step the probabilistic spatiotemporal fusion approach uses descriptor pairs
Note vs = (xs, ys, ts)T are the scale parameters and
55
Fig. 2 System framework
between the query and the reference video as input to locate the possible copied segment. The problem in Eq. (2) can be converted into the following problem: (3) max max p(u , v|PS j )
1- j-len(V ) 1-u <v-len ( Q )
where PS =
1- j-len(V )
PS j and PS j is the pair sej
into N p = N xs N ys N ts N xf N yf N tf cubes by dividing the range of the scale in the x, y, t directions into Nxs , Nys, Nts segments and the ranges of the offsets in the x, y, t directions into Nxf , Nyf , Ntf segments. is treated as a special cube. Then fi can be represented by the cube containing fi. For simplicity, instead of using PSi directly as the observation variables, the observation variables obvi are defined as transformation parameters derived from the descriptor pairs PSi. If PSi is empty, obvi = . Otherwise, obvi .vs and obvi .vf are obtained using the modified RANSAC (Random Sample Consensus) method taking the pair score dpik .score into consideration. The sum of the score of the corresponding pairs is defined as obvi .score .
3.2 Local probabilistic function
quence between the query Q and video R in the database. Therefore, the main probabilistic inference task is to compute the conditional probability p(u, v|PS j ) . A probabilistic graph is built to model the relationship between the copied segment [u, v) and the observed pair sequence PS j between the query video Q and a particular reference video R j . Hidden variables representing the transformation parameters are introduced as the bridge between [u, v) and PS j . The model is shown in Fig. 3 with the details described in following subsections. Throughout this section, N is used as the abbreviation for len(QKF).
3.1 Variable definition
In the model, [u, v) is the abbreviation for the copy segment {QKFi , u - i < v} . When u=v the segment is empty, which means that the query contains no copied segment from R j . A variable fi PSpace is defined to describe the transformation. When QKFi is not a copy of any frame in R j , fi is defined as . For the computations and storage, PSpace is divided
Fig. 3
Probabilistic spatiotemporal fusion model
The local probabilistic function is defined for each variable based on the variables definition. A uniform distribution on 1 - u < v - N + 1 is assumed for [u, v] since no prior knowledge about the location of the copied segment is assumed. 3.2.1 Local probabilistic function for the hidden variables For 1 - i - N , the local probabilistic function for fi is defined as: (), i < u or i . v; (4) p( f i |u, v, fi1 ) = 1 , i = u; Np ( fi 1 ), u < i < v where ( x) = 1 if fi = x and ( x) = 0 otherwise. The probabilistic function means: (a) For frame i that does not belong to a copied segment [u, v), the transform parameter should be . (b) For frame i that is the first frame of a copied segment, fi has a uniform distribution on PSpace so that the system can deal with a wide range of transformations. However, prior knowledge about the transformation can be used to adopt a different distribution to achieve better performance. (c) For the other frames in the copied segment, the transform parameter should be the same as that of the previous frame. In the case of f1, f0 is treated as a null variable, so the distribution of f1 can be defined in this uniform equation for concise statements.
56
3.2.2 Distribution of the observation variables Even if query frame i has similar frames in the reference video, the similar frames may not be retrieved correctly due to descriptor or index limitations, i.e., obvi = or obvi .vs fi .vs. Therefore, if fi , the distribution of the observation variables is defined as: 1 Pd , obji = ; p (obji | f i ) = 0, obji .vs fi .vs; P P f du + Pd (1 Pt ) , otherwise u d t Np s
method. Define pi = p ( f i |u , v, f i 1 ) p(obvi | fi );
i 1 (u , v, fi 1 ) = pi pi +1
fi fi +1
p
fN
p (u, v, f )
i i i fi
(9)
then
p(u , v,{obvi }1N ) =
f1
p(u, v) p
fN i =1
p (u , v) p1 p2
f1 f2
p
fN
(5) where Pd is the probability of detection indicating the probability that keyframes will be returned for a copied keyframe in the query after the search procedure, Pt is the probability that the returned keyframe is the corresponding one for the query keyframe, s indicates the cube containing (obvi fi), and fu =
1 (2)3 xs ys ts xf yf tf
p (u, v)0 (u , v, f 0 )
N ,1 , to finally obtain p (u , v,{obvi }1 ).
(10)
Then iteratively compute i 1 (u , v, f i1 ), i = N , N 1,
3.4
Model training
( u .xs)2 ( u .ys)2 ( u .ts)2 ( u .xf )2 ( u .yf )2 ( u .tf )2 2 2 2 2 2 2 2 xs 2 ys 2 ts 2 xf 2 yf 2 tf
(6) is the normal distribution with xs , ys , ts , xf , yf , and tf being the half lengths of the parameter cube in the corresponding directions. If f i = , i.e., query frame i does not have a similar frame in the reference video, the function is defined as: obvi = ; Pnd , (7) p (obvi | fi ) = (1 P ) / N , o bv nd p i where Pnd is the probability that the search process correctly returns no keyframe in R when searching similar frames for QKFi.
3.3 Probabilistic inference
The parameters in the graphical model, such as Pd, Pt, Pnd, can be obtained automatically by gathering statistical information on a training set to achieve better performance. NT query videos with similar transformations and lengths to the test query videos were generated as the training set. For query Qi , there are CQi copy keyframes and NCQi non-copy keyframes. After similarity search, similar keyframes are distributed among videos {Rij ,1 - j - NR i } from the database. For the CQi copy keyframes, {Rij } contributes RCQij similar keyframes, among which TRCQij ones are true corresponding keyframes. For the NCQi noncopy keyframes, Rij contributes RNCQij similar keyframes. Therefore,
p (u , v|PS j ) has to be calculated to infer the most probable copy segment. Note that:
p (u , v|PS ) = p (u , v|{obv } ) =
j N i 1
p (u , v,{obv } ) = p({obv } )
N i 1 N i 1
RCQ ; P = NR CQ TRCQ ; P= RCQ RNCQ P = 1 NR NCQ

NT ij d i =1 NT j =1 i i i =1 NT NR i j =1 ij i =1 t NT NR i j =1 ij i =1 NT NR i j =1 nd i =1 NT i =1 i i
NR i
ij
(11)
f1
fN
p(u , v) i =1 p ( fi |f i 1 , u , v) p (obvi | f i )
N
4
(8)
4.1
Tests and Results

Test setup
p ({obvi }1N )
The marginal probability p (u , v,{obvi }1N ) can be effectively computed with the variable elimination
The tests compare STSR[16], PFF[17], and this probabilistic spatiotemporal fusion method.
57
The system parameters were set in preliminary experiments in which the keyframe sample rate for database Sd, the keyframe sample rate for query Sq, the size of histogram D and the number of similar frames returned Nsf were set to reasonable ranges with those settings producing the best performance with respect to the returned similar frames selected. The offsets ranges in the x, y, t directions were set to [1000, 1000], [1000, 1000] and [10 000, 10 000]. The scales were set to xs [0.5, 1.5] , ys [0.5, 1.5] , and ts = Pd / Pq . Tests were also used to evaluate the influence of different values of Nxs, Nys , Nts , Nxf , Nyf and Ntf on the system performance. Those resulting in the best performance were then selected. Tests were performed on the video set MUSCLEVCD-2007[23] provided by CIVR07. It includes 100 hours of videos with 352 288 solutions and is designed for evaluating the performance of video copy detection systems. The query generation used 10 untransformed segments of 1 min length obtained from different videos in the database. 8 query videos were generated from each untransformed segment by applying different transformations mixing zoom in/out, shift, color adjustment, brightness adjustment and mean smoothing blur with the transform parameters listed in Table 1. In total, 80 query videos were generated. These videos formed the evaluation query set together with 15 query videos containing no copied segments. The model was trained with another 10 untransformed segments of 1 min length and 1 training query generated from each untransformed segment, i.e., NT=10. The tests all used the parameters learned from this training set. The performance is measured by the precision p which is the number of true matched keyframes divided by the total number of matched keyframes, recall r which is the number of true matched keyframes divided by the total number of keyframes in the copied segments and the F1 score which is the harmonic mean of the precision and the recall.
Table 1 Test transformation parameters Transform Parameters Zoom {0.7, 1.3} Shift {20, 20} Color adjustment* {20, 20} Brightness {1.0, 1.2} Smoothing times {0, 3} * The color adjustment denotes the increment in the R/G/B value of each pixel in the copied segment.
4.2
Comparison of the determination step
The first test focuses on evaluating the effectiveness of the PSTF approach. Since all steps in the system affect the result, the best performance comparison for the different approaches in the determination step is to compare results using the same input descriptor pair sequence. Therefore, a copy keyframe recall was manually designated with percentages of 0-100. Given the copy segment keyframe recall rc , similar keyframes to be returned are randomly selected with the probability of obtaining a true corresponding keyframe being rc . The effectivenesses of the approaches for various scenes were evaluated using different copy keyframe recalls. For each rc , a descriptor sequence was formed to evaluate the effectiveness of each approach. The results in Figs. 4-6 show that STSR achieves high recall but low precision while PFF achieves high precision but low recall. PSTF has similar recall with STSR and similar precision with PFF, resulting in great improvement in F1 compared to the other two approaches, especially when the copy keyframe recall is low. The significant improvement demonstrates that exploration of the spatial and temporal consistency in the PSTF approach effectively reduces the false alarms and missing detections.
Fig. 4
Precision of STSR, PFF and PSTF
Fig. 5
Recall of STSR, PFF and PSTF
58
that PSTF achieves both high recall and precision, resulting in a high F1 which is a tradeoff between precision and recall.
Conclusions
Fig. 6
F1 score of STSR, PFF and PSTF
4.3
System level effectiveness
In test 1, the descriptor pair sequences are randomly generated to simulate a designated copy keyframe recall. This provides insight into the effectiveness of PSTF for different copy keyframe recalls. However, the sequences generated may differ from the sequences obtained in a video copy detection system. For example, the similarity search returns highly similar descriptors while the simulation randomly selects descriptors, so the true copy descriptor in simulated sequences tends to have a lower similarity score. The second test evaluates systems with different determination approaches on the query set in the first test. The result presented in Table 2 shows that PSTF has an F1 8% higher than the other two methods. As with F1, the precision and recall have the same trends as in the first test. The results suggest the advantages of using PSTF in the system.
4.4 Discussion
A probabilistic spatiotemporal fusion approach was developed which embeds spatial and temporal consistency into a probabilistic graphical model with an appropriate probability function. The variable elimination algorithm efficiently estimates the location of the copied segment. An effective local descriptor and a twolevel descriptor pairing method are used to build a video copy detection system to evaluate the segment estimation approach. Tests show that the system is very effective, outperforming the typical voting approach and a probabilistic framework based on the Hidden Markov Model.
References
[1] Wu Xiao, Hauptmann A G, Ngo C-W. Practical elimination of near-duplicates from web video search. In: Proceedings of the 15th International Conference on Multimedia. Augsburg, Germany, 2007: 218-227. [2] Guidelines 2008. [3] Kraaij W. TRECVID-2009 content-based copy detection task overview. In: Proceedings of TRECVID Workshop, 2009. [4] Liu Lu, Lai Wei, Hua Xiansheng, et al. Video histogram: A novel video signature for efficient web video duplicate detection. In: Proceedings of the 13th International Multimedia Modeling Conference. Singapore, 2006: 94-103. [5] Harris C, Stevens M. A combined corner and edge detector. In: Proceedings of the Fourth Alvey Vision Conference. Manchester, UK, 1988: 147-151. [6] Lowe D G. Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision. Kerkyra, Greece, 1999: 1150-1157. [7] Bay H, Tuytelaars T, Gool L. Surf: Speeded up robust features. In: Proceedings of the 9th European Conference on Computer Vision. Graz, Austria, 2006: 404-417. [8] Datar M, Immorlica N, Indyk P, et al. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry. Brooklyn, NY, USA, 2004: 253-262. for TRECVID 2008 Evaluation. http://www-nlpir.nist.gov/projects/tv2008/tv2008.htm,
Both tests show that STSR achieves high recall but low precision while PFF achieves high precision but low recall. The STSR approach can use more strict constraints to determine the copied segment, such as smaller cubes for the voting. This eliminates more inconsistent pairs, which increases the precision but reduces the recall. With PFF, the transition possibility function can be adjusted to make it more robust to highly scored false keyframes, but identified copied segments may be extended at the end. The tests show
Table 2 STSR PFF PSTF System level effectiveness Recall 0.9132 0.7553 0.9052 F1 0.8653 0.8590 0.9442 Precision 0.8222 0.9956 0.9867
Jianmin Li et al.Video Copy Detection Based on Spatiotemporal Fusion Model [9] Coskun B, Sankur B, Menom N. Spatio-temporal transform-based video hashing. IEEE Transactions on Multimedia, 2006, 8(6): 1190-1208. [10] Arya S, Mount D M, Netanyahu N S, et al. An optimal algorithm for approximate nearest neighbor searching. Journal of the ACM, 1998, 45(6): 891-923. [11] Joly A, Frelicot C, Buisson O. Robust content-based video copy identification in a large reference database. In: Proceedings of the Second International Conference on Image and Video Retrieval. Urbana-Champaign, IL, USA, 2003: 511-516. [12] Jegou H, Douze M, Schmid C. Hamming embedding and weak geometric consistency for large scale image search. In: Proceedings of the 10th European Conference on Computer Vision. Marseille, France, 2008: 304-317. [13] Joly A, Buisson O, Frelicot C. Content-based copy detection using distortion-based probabilistic similarity search. IEEE Transactions on Multimedia, 2007, 9(2): 293-306. [14] Law-To J, Buisson O, Gouet-Brunet V, et al. Robust voting algorithm based on labels of behavior for video copy detection. In: Proceedings of the 14th annual ACM International Conference on Multimedia. Santa Barbara, CA, USA, 2006: 835-844. [15] Law-To J, Chen Li, Joly A, et al. Video copy detection: A comparative study. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval. Amsterdam, The Netherlands, 2007: 371-378.
59
[16] Chen Shi, Wang Tao, Wang Jinqiao, et al. A spatial-temporal-scale registration approach for video copy detection. In: Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing. Tainan, China, 2008: 407-415. [17] Gengembre N, Brrani S-A. A probabilistic framework for fusing frame-based searches within a video copy detection system. In: Proceedings of the 7th International Conference on Content-Based Image and Video Retrieval. Niagara Falls, Canada, 2008: 211-220. [18] Lauritzen S L. Graphical Models. USA: Oxford University Press, 1996. [19] Jordan M I. Graphical Models. Statistical Science, 2004, 19(1): 140-155. [20] Hua Xiansheng, Chen Xian, Zhang Hongjiang. Robust video signature based on ordinal measure. In: Proceedings of 2004 International Conference on Image Processing. Singapore, 2004: 685-688. [21] Mikolajczyk K, Schmid C. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(10): 1615-1630. [22] Zhang Yongdong, Gao Ke, Tang Sheng, et al. TRECVID 2008 Content-Based Copy Detection By MCG-ICT-CAS. In: Proceedings of TRECVID Workshop, 2008. [23] Law-To J, Joly A, Boujemaa N. Muscle-vcd-2007: A live benchmark for video copy detection. http://www-rocq.inria. fr/imedia/civr-bench/, 2007.

Video Copy Detection Based On Spatiotemporal Fusion Model: LL LL LL

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Video Copy Detection Based On Spatiotemporal Fusion Model: LL LL LL

Diunggah oleh

Hak Cipta:

Format Tersedia

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/12llpp51-59 Volume 17, Number 1, February 2012

Video Copy Detection Based on Spatiotemporal Fusion Model*

Tsinghua Science and Technology, February 2012, 17(1): 51-59

Crop Fig. 1 Video copy examples

Jianmin Li et al.Video Copy Detection Based on Spatiotemporal Fusion Model

Tsinghua Science and Technology, February 2012, 17(1): 51-59

3 Probabilistic Spatiotemporal Fusion Approach

Note vs = (xs, ys, ts)T are the scale parameters and

Jianmin Li et al.Video Copy Detection Based on Spatiotemporal Fusion Model

Fig. 2 System framework

PS j and PS j is the pair sej

Probabilistic spatiotemporal fusion model

Tsinghua Science and Technology, February 2012, 17(1): 51-59

method. Define pi = p ( f i |u , v, f i 1 ) p(obvi | fi );

Then iteratively compute i 1 (u , v, f i1 ), i = N , N 1,

( u .xs)2 ( u .ys)2 ( u .ts)2 ( u .xf )2 ( u .yf )2 ( u .tf )2 2 2 2 2 2 2 2 xs 2 ys 2 ts 2 xf 2 yf 2 tf

RCQ ; P = NR CQ TRCQ ; P= RCQ RNCQ P = 1 NR NCQ

Tests and Results

Jianmin Li et al.Video Copy Detection Based on Spatiotemporal Fusion Model

Comparison of the determination step

Precision of STSR, PFF and PSTF

Recall of STSR, PFF and PSTF

Tsinghua Science and Technology, February 2012, 17(1): 51-59

F1 score of STSR, PFF and PSTF

System level effectiveness

Anda mungkin juga menyukai