Multimodal Video Sentiment Analysis Using Audio and Text Data
Yanyan Wang *
School of Science, Zhongyuan University of Science, No.41 Zhongyuan Rd, Zhengzhou, China.
*Author to whom correspondence should be addressed.
Abstract
Nowadays, video sharing websites are becoming more and more popular, such as YouTube, Tiktok. A good way to analyze a video’s sentiment would greatly improve the user experience and would help with designing better ranking and recommendation systems [1,2]. In this project, we used both acoustic information of a video to predict its sentiment levels. For audio data, we leverage transfer learning technique and use a pre-trained VGGish model as a features extractor to analyze abstract audio embeddings [6]. We then used MOSI dataset [5] to further fine-tune the VGGish model and achieved a test accuracy of 90% for binary classification. For text data, we compared traditional bag-of-word model to LSTM model. We found that LSTM model with word2vec outperformed bag-of-word model and achieved a test accuracy of 84% for binary classification.
Keywords: Video sentiment analysis, multimodal data, transfer learning, abstract feature extraction, text mining