Music Audio and Motion Capture (MAMC) Project:
Music performance AI with cross-modal and cross-cultural data

音樂聲音與3D動作追蹤(MAMC)研究計畫:
跨模組與跨文化的音樂演奏AI模型

左圖:音樂演奏家的3-D身體動作追蹤,右圖:演奏聲音分析
Music Audio and Motion Capture (MAMC) is a two-year project funded by Academia Sinica, aiming at applying big data and deep learning schemes 1) to construct a large-scale music performance dataset, and 2) to model the connection between the elementary and high-level musical elements in the cross-modal (music acoustics and musicians’ body movement) and cross-cultural (Western music and Asian music) music performance. MAMC is a sub-project under Automatic Music Concert Animation (AMCA). AMCA is a core project supported by Institute of Information Science, Academia Sinica, aiming at generating multi-media content using artificial intelligence (AI).

This project systematically collected and recorded a professional-level, large-scale MAMC dataset covering diverse types music performance (6420 recordings by 64 professional musicians, > 1,000,000 musical notes, > 230,000 seconds recorded). High-quality cross-modal data including 3D motion capturing of musicians’ 32 body joints, high-resolution audio recordings, and musicians’ note-by-note professional annotations regarding high-level aesthetic semantics are collected. Innovative technologies including 3D motion capture system, Music Information Retrieval (MIR) techniques, and deep learning models such as Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and Sequence generative adversarial networks (SeqGAN) are applied to analyzing musicians’ performed audio and body movement.

The large-scale, systematic, and cross-modal collection of high-quality music acoustic and motion capture data in this research project is never seen in previous research. And such a large-scale and long-term construction of dataset with professional annotations is the essential basis and pivotal key to make a breakthrough of current music information retrieval research – rise the analysis level from low-level elements to high-level semantics and aesthetics, and serves as the foundation to compare diverse performance versions in music big data. Collaborating with experienced researchers in Taiwan and the UK, experts in diverse disciplines including Musicology, Information Science, Animation Engineering and Biomechanics contribute their domain knowledge to analyze music performance. The deep learning model in this research is developed into an AI music performance analysis and assessment module, which can be applied to musician’s training.

音樂聲音與3D動作追蹤(Music Audio and Motion Capture, MAMC)是由中央研究院提供研究經費的兩年期研究計畫,旨在:1) 建置音樂演奏的大型研究資料庫,包括高準確度的音樂演奏身體動作追蹤資料(3-D motion capture)、高品質演奏錄音,以及專業音樂家對於演奏錄音的語意標註(semantic annotation); 2) 發展能夠連結局部(micro-level)聲音/動作訊號特徵及整體(macro-level)音樂語意(semantics)的跨模組(cross-modal)及跨文化(cross-cultural)模型。音樂聲音與動作追蹤計畫是隸屬於自動音樂動畫生成(Automatic Music Concert Animation, AMCA)計畫中的子計畫,AMCA是由中央研究院資訊科學研究所資助的核心計畫,旨在使用人工智慧生成多媒體內容。

本研究錄製了多達六千多個音樂演奏錄音及錄影,包括專業音樂家演奏西方古典音樂(鋼琴、小提琴、大提琴),及亞洲音樂(古琴、西塔琴)的演奏動作及錄音,音樂資訊檢索技術(Music Information Retrieval)被用於分析音訊檔案,而深度學習(deep learning)模型,包括遞歸神經網路(Recurrent Neural Network, RNN)、卷積神經網絡(Convolutional Neural Network, CNN)、生成式對抗網路(Generative Adversarial Networks, GAN)則用於建立演奏聲音-身體動作的跨模組模型。

本研究中大規模及系統化的跨模組音樂演奏資料收集和分析,為音樂演奏大數據中不同演奏版本間的相互比較提供了重要的基礎,本研究與跨國的研究團隊合作,包括來自英國愛丁堡大學音樂系、英國愛丁堡大學運動科學系、國立陽明交通大學物理治療系、台北藝術大學傳統音樂系,以及中央研究院資訊科學研究所的學者。

Guqin performance analysis using Music Information Retrieval (MIR) techniques

使用音訊處理技術分析古琴演奏技巧

左圖:古琴左手彈奏手法分類,右圖:演奏手法與五聲音階調式結構的對應關係

The research applies Music Information Retrieval and deep learning techniques to analyze Guqin music, and aims to explore the connection between the global tonal structure in the music piece and the implemented playing techniques in Guqin performance. Incorporating mode and playing technique analysis, this study demonstrates that the structural relationship between notes is crucial for detecting mode, and such information also provides extra guidance for the playing technique detection in local-level. In this study, a new dataset is compiled for Guqin performance analysis. The mode detection is achieved by pattern matching, and the predicted results are conjoined with audio features to be inputted into the Convolutional Neural Network for playing technique detection. It is manifest in the results that the global and local features are inter-connected in Guqin music. Our analysis identifies key components affecting the recognition of mode and playing technique, and challenging cases resulting from unique properties of Guqin audio signal are discussed for further research.

本研究使用音樂資訊檢索技術(Music Information Retrieval, MIR)及AI模型分析中國五聲音階的調式結構與古琴左手演奏手法的關聯性,古琴的歷史演奏錄音經過專家的收集及標註,本研究發展了模式比對(pattern match)的演算法,根據樂曲的錄音檔案進行自動的樂曲調式結構分析,而左手演奏手法則使用AI模型中的卷積神經網絡(Convolutional Neural Network)進行自動分類,分析結果顯示左手演奏手法的分佈情況符應了樂曲的整體調式結構,在研究中也討論了古琴與其它撥弦樂器不同的獨特聲音特質。

Expressive semantics in orchestral conducting movement

樂團指揮動作的音樂表現語彙

左圖:樂團指揮家的3-D身體動作追蹤,右圖:指揮表現語彙(semantics)辨認

In this research, we seek to explore an approach to search for effective descriptors in conducting movement to express musical features, and to extract complex expressive semantics from elementary conducting kinematic variations in a valid musical context. This study proposes a multi-task learning model to jointly identify dynamic, articulation, and phrasing cues from conducting kinematics. A professional conducting movement dataset is compiled using a high-resolution motion capture system. ReliefF algorithm is applied to select significant features from conducting movement, and recurrent neural network (RNN) is implemented to identify multiple movement cues. The experimental results disclose the key elements in conducting movement to communicate musical expressiveness, as well as to highlight the advantage of multi-task learning in the complete musical context over single-task learning. To the best of our knowledge, this is the first attempt to explore multiple semantic expressive cuing in conducting movement kinematics using recurrent neural network.

本研究探索樂團指揮家的動作語彙(movement semantics),3-D動作追蹤(motion capture)技術被應用於錄製樂團指揮家的身體動作,並分析指揮動作的運動力學(kinematic)特徵,特徵選擇(feature selection)的演算法辨認出動作中的主要特徵,此些特徵並經由AI模型中的遞歸神經網路(Recurrent Neural Network)自動分類,辨識出指揮動作中關於樂句、力度,以及演奏技巧(articulation) 的指示。