Article

Early versus late fusion in semantic video analysis

Authors:

Cees G. M. Snoek,

Marcel Worring, and

Arnold W. M. SmeuldersAuthors Info & Claims

MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

November 2005

Pages 399 - 402

https://doi.org/10.1145/1101149.1101236

Published: 06 November 2005 Publication History

Get Access

Abstract

Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, we consider two classes of fusion schemes, namely early fusion and late fusion. The former fuses modalities in feature space, the latter fuses modalities in semantic space. We show by experiment on 184 hours of broadcast video data and for 20 semantic concepts, that late fusion tends to give slightly better performance for most concepts. However, for those concepts where early fusion performs better the difference is more significant.

References

[1]

A. Amir et al. IBM research TRECVID-2003 video retrieval system. In Proc. TRECVID Workshop, Gaithersburg, USA, 2003.

Google Scholar

[2]

J. Gauvain, L. Lamel, and G. Adda. The LIMSI broadcast news transcription system. Speech Communication, 37(1--2):89--108, 2002.

Digital Library

Google Scholar

[3]

G. Iyengar, H. Nock, and C. Neti. Discriminative model fusion for semantic concept detection and annotation in video. In ACM Multimedia, pages 255--258, Berkeley, USA, 2003.

Digital Library

Google Scholar

[4]

NIST. TREC Video Retrieval Evaluation, 2004. http://www-nlpir.nist.gov/projects/trecvid/.

Google Scholar

[5]

J. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61--74. MIT Press, 2000.

Crossref

Google Scholar

[6]

C. Snoek et al. The MediaMill TRECVID 2004 semantic video search engine. In Proc. TRECVID Workshop, Gaithersburg, USA, 2004.

Google Scholar

[7]

S. Tsekeridou and I. Pitas. Content-based video parsing and indexing based on audio-visual interaction. IEEE Trans. CSVT, 11(4):522--535, 2001.

Digital Library

Google Scholar

[8]

V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NY, USA, 2th edition, 2000.

Digital Library

Google Scholar

[9]

T. Westerveld et al. A probabilistic multimedia retrieval model and its evaluation. EURASIP JASP, (2):186--197, 2003.

Digital Library

Google Scholar

[10]

Y. Wu, E. Chang, K.-C. Chang, and J. Smith. Optimal multimodal fusion for multimedia data analysis. In ACM Multimedia, New York, USA, 2004.

Digital Library

Google Scholar

Cited By

View all

Amsaprabhaa M(2024)Deep multimodal spatio-temporal Harris Hawk Optimized Pose Recognition framework for self-learning fitness exercisesJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23328646:4(9783-9805)Online publication date: 18-Apr-2024
https://doi.org/10.3233/JIFS-233286
Kuang JLi WLi FZhang JWu Z(2024)MIFI: MultI-Camera Feature Integration for Robust 3D Distracted Driver Activity RecognitionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.330431725:1(338-348)Online publication date: Jan-2024
https://doi.org/10.1109/TITS.2023.3304317
Ohtomo KHarakawa RIisaka MIwahashi M(2024)AM-Bi-LSTM: Adaptive Multi-Modal Bi-LSTM for Sequential RecommendationIEEE Access10.1109/ACCESS.2024.335554812(12720-12733)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3355548
Show More Cited By

Index Terms

Early versus late fusion in semantic video analysis
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

On Comparing Early and Late Fusion Methods
Advances in Computational Intelligence
Abstract
This paper presents a theoretical comparison of early and late fusion methods. An initial discussion on the conditions to apply early or late (soft or hard) fusion is introduced. The analysis show that, if large training sets are available, early ...
Read More
Two-layer similarity fusion model for cover song identification

Various musical descriptors have been developed for Cover Song Identification (CSI). However, different descriptors are based on various assumptions, designed for representing distinct characteristics of music, and often differ in scale and noise level. ...
Read More
Early and Late Fusion Methods for the Automatic Creation of Twitter Lists
ASONAM '12: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)

Twitter's list feature allows users to organize their followees into groups for easier information access and filtering. However, the percentage of users using lists is very small and most existing lists have only a few members. One reason for this may ...
Read More

Comments

Information & Contributors

Information

Published In

MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

November 2005

1110 pages

ISBN:1595930442

DOI:10.1145/1101149

General Chairs:
Hongjiang Zhang
Microsoft Research Asia, China
,
Tat-Seng Chua
National University of Singapore, Singapore
,
Program Chairs:
Ralf Steinmetz
Technische Universitat Darmstadt, Germany
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Lynn Wilcox
FXPAL

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

MM05

Sponsor:

MM05: 2005 13th Annual ACM International Conference on Multimedia

November 6 - 11, 2005

Hilton, Singapore

Acceptance Rates

MULTIMEDIA '05 Paper Acceptance Rate 49 of 312 submissions, 16%;

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

505
Total Citations
View Citations
3,149
Total Downloads

Downloads (Last 12 months)214
Downloads (Last 6 weeks)33

Other Metrics

View Author Metrics

Citations

Cited By

View all

Amsaprabhaa M(2024)Deep multimodal spatio-temporal Harris Hawk Optimized Pose Recognition framework for self-learning fitness exercisesJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23328646:4(9783-9805)Online publication date: 18-Apr-2024
https://doi.org/10.3233/JIFS-233286
Kuang JLi WLi FZhang JWu Z(2024)MIFI: MultI-Camera Feature Integration for Robust 3D Distracted Driver Activity RecognitionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.330431725:1(338-348)Online publication date: Jan-2024
https://doi.org/10.1109/TITS.2023.3304317
Ohtomo KHarakawa RIisaka MIwahashi M(2024)AM-Bi-LSTM: Adaptive Multi-Modal Bi-LSTM for Sequential RecommendationIEEE Access10.1109/ACCESS.2024.335554812(12720-12733)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3355548
Yi YZhou JWang HTang PWang M(2024)Emotion recognition in user‐generated videos with long‐range correlation‐aware networkIET Image Processing10.1049/ipr2.13174Online publication date: 10-Jul-2024
https://doi.org/10.1049/ipr2.13174
Sergidou EYpma RRohdin JWorring MGeradts ZBosma W(2024)Fusing linguistic and acoustic information for automated forensic speaker comparisonScience & Justice10.1016/j.scijus.2024.07.001Online publication date: Jul-2024
https://doi.org/10.1016/j.scijus.2024.07.001
Lv JSun YYe QFeng WLv J(2024)A multiscale neural architecture search framework for multimodal fusionInformation Sciences10.1016/j.ins.2024.121005(121005)Online publication date: Jun-2024
https://doi.org/10.1016/j.ins.2024.121005
Teshima TNiitsuma MNishimura H(2024)Determining the onset of driver’s preparatory action for take-over in automated driving using multimodal dataExpert Systems with Applications10.1016/j.eswa.2024.123153(123153)Online publication date: Jan-2024
https://doi.org/10.1016/j.eswa.2024.123153
Zheng YDeng HWu JXie SLi XChen YLi NXiao KPfeifer NMao X(2024)Deep multimodal fusion for 3D mineral prospectivity modeling: Integration of geological models and simulation data via canonical-correlated joint fusion networksComputers & Geosciences10.1016/j.cageo.2024.105618188(105618)Online publication date: Jun-2024
https://doi.org/10.1016/j.cageo.2024.105618
Vijayachandran VSuchithra R(2024)A Study of Multimodal Sentiment Analysis and Design of an ArchitectureData Management, Analytics and Innovation10.1007/978-981-97-3242-5_43(659-670)Online publication date: 23-Jul-2024
https://doi.org/10.1007/978-981-97-3242-5_43
Feng ZSivak JKrishnamurthy A(2024)Multimodal Fusion of Echocardiography and Electronic Health Records for the Detection of Cardiac AmyloidosisArtificial Intelligence in Medicine10.1007/978-3-031-66535-6_25(227-237)Online publication date: 25-Jul-2024
https://doi.org/10.1007/978-3-031-66535-6_25
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

On Comparing Early and Late Fusion Methods

Two-layer similarity fusion model for cover song identification

Early and Late Fusion Methods for the Automatic Creation of Twitter Lists