Chonbuk National University,
Keywords: music-video multimodal, Finetune network, emotion analysis, generalized mean
Summary:The multimodal neural network for music-video emotion analysis perform better using 2D convolution for music information and 3D convolution for video information retrieval. We use pre-trained network and fine tune some top-layer of the networks and gather the low-level and high level features. The features are then pass through the recurrent neural network to keep the time varying features of audio and video. The proposed network perform on test datasets and the results are visualized using various metrics.