ABSTRACT:
In order to improve the efficiency of multimodal fusion in human-robot interaction (HRI), an improved technique is proposed to synthesize visual and audio data. The robotic auditory system uses a microphone array to obtain auditory information and uses the MUSIC algorithm to determine the azimuth of the sound source, and uses end-to-end gating CNN recognizes speech results; For the visual system, a two-layer neural network system is used to detect and recognize dynamic gestures. An improved D-S evidence theory algorithm based on the rule intention voter is designed to fuse the output results of the two modules for determining intention of the current interactive object. Experimental results validate the efficiency and accuracy of multimodal fusion system.