Multimodal Video Sentiment Analysis Using Audio and Text Data

Yanyan Wang *

School of Science, Zhongyuan University of Science, No.41 Zhongyuan Rd, Zhengzhou, China.

*Author to whom correspondence should be addressed.


Abstract

Nowadays, video sharing websites are becoming more and more popular, such as YouTube, Tiktok. A good way to analyze a video’s sentiment would greatly improve the user experience and would help with designing better ranking and recommendation systems [1,2]. In this project, we used both acoustic information of a video to predict its sentiment levels. For audio data, we leverage transfer learning technique and use a pre-trained VGGish model as a features extractor to analyze abstract audio embeddings [6]. We then used MOSI dataset [5] to further fine-tune the VGGish model and achieved a test accuracy of 90% for binary classification. For text data, we compared traditional bag-of-word model to LSTM model. We found that LSTM model with word2vec outperformed bag-of-word model and achieved a test accuracy of 84% for binary classification.

Keywords: Video sentiment analysis, multimodal data, transfer learning, abstract feature extraction, text mining


How to Cite

Wang, Yanyan. 2021. “Multimodal Video Sentiment Analysis Using Audio and Text Data”. Journal of Advances in Mathematics and Computer Science 36 (7):30-37. https://doi.org/10.9734/jamcs/2021/v36i730381.

Downloads

Download data is not yet available.