eXCube2: Explainable Brain-Inspired Spiking Neural Network Framework for Emotion Recognition from Audio-, Visual- and Multimodal Audio-Visual Data
This paper introduces a biomimetic framework and novel brain-inspired AI (BIAI) models based on spiking neural networks (SNNs) for emotion recognition from audio (speech), visual (face), and integrated multimodal audio-visual data. The developed framework, named eXCube2, uses a three-dimensional SNN that is spatially structured according to a human brain template. The BIAI models developed in eXCube2 are trainable on spatio- and spectro-temporal data using brain-inspired learning rules. Such models are explainable in terms of revealing patterns in data and are adaptable to new data. The eXCube2 models are implemented as software systems and tested on speech and video data of subjects expressing emotional states. The use of a brain template for the SNN structure enables brain-inspired tonotopic and stereo mapping of audio inputs, topographic mapping of visual data, and the combined use of both modalities. This novel approach not only brings AI-based emotion recognition closer to human perception, but also results in higher accuracy and better explainability than existing AI systems. This is demonstrated through experiments on benchmark datasets, achieving classification accuracy above 80% on single-modality data and 90% when multimodal audio-visual data are used and a “don’t know” output is introduced. The paper further discusses possible applications of the proposed eXCube2 framework to other audio, visual, and audio-visual data for solving challenging problems, such as recognizing emotional states of people from different origins; brain state diagnosis (e.g., Parkinson’s disease, Alzheimer’s disease, ADHD, dementia); measuring response to treatment over time; evaluating satisfaction responses from online clients; human–robot interaction; chatbots; and interactive computer games. The SNN-based implementation of BIAI also enables the use of neuromorphic chips and platforms, leading to reduced power consumption, smaller device size, higher performance accuracy, and improved adaptability and explainability.