Contact Member Areapopup icon

Problem Definition

Domotics and smart-home technologies will allow to connect more and more domestic devices (e.g. cameras, air-conditioning, alarm systems, appliance, hi-fi, TV, etc.) to a common network. The user will have access any time to information by using any means of communication (telephone, personal computer, TV). The use of voice interaction and the automatic monitoring of acoustic information in the given environment, represent important directions to take in order to enable the exploitation of the forthcoming interactive technological platforms. The DICIT project focuses on the problems of acoustic scene analysis and speech interaction in noisy and reverberant environment by means of microphone networks.

State of the Art

According to recent studies on human-machine interfaces, augmenting remote control devices with speech represents an important opportunity worth of investigation. During the last years, new devices have been introduced in the market, which integrate remote control and speech recognition to control a SetTopBox (STB) for a TV platform. However, the use of these devices is not easy as the set of admitted commands is quite restricted, the system is trained to work with a single specific user and the maximum user-device distance to obtain acceptable recognition performance is rather short. Voice interaction at a distance from microphones is a challenging task because of the disturbing effects of room acoustics (reflections, reverberation), background noise and overlapping of acoustic events. Moreover in the interaction with a TV setup the (possibly multichannel) sound produced by the TV itself needs to be compensated. An effective acoustic scene analysis able to classify and interpret a broad range of acoustic events would also significantly enhance the performance of currently available automatic surveillance systems based on audio/video sensors.

Research Objectives and Technologies

One of the most challenging objectives of DICIT is the development of a distant-talking interaction between a user (or multiple users) and a device. To this purpose a network of sensors composed by one or more microphone arrays will be used. The acquisition module will be integrated with several signal preprocessing steps: acoustic event detection, acoustic source localization, acoustic source separation, beamforming, multi-channel echo cancelling. The objective is to obtain an input signal of sufficient quality to grant satisfactory performance of automatic speech recognition allowing the use of a flexible language. This will need robust acoustic and language modeling, achieved by means of speaker identification and corresponding adaptation. Natural language understanding, mixed-initiative dialogue management and response generation will represent other fundamental components for the accomplishment of the foreseen smart interface. Another critical objective of DICIT is the realization of continuous acoustic monitoring by the system. This will require a smart acoustic scene analysis, implying detection, localization, classification and interpretation of all the possible acoustic events, even represented by unstationary disturbances.

Foreseen Prototypes

Three prototype systems are foreseen:

  • initial PC-based STB prototype, which comprises: TV tuner for program selection, display for TV viewing and visual feedback to user, microphone array and audio acquisition board for voice input, infrared interface for remote control input, speaker system for program audio and speech synthesis output, DVD player/recorder for handling video contents, network connection to support real access to TV program information. These devices are typically available as standard options or adapter cards from several commercial manufacturers.
  • final STB prototype, which incorporates an actual TV set-top box, preserving the same overall functionality of the initial STB prototype but implemented as a software/hardware package more suitable for commercial realization. It is expected that the list of software modules and internal interfaces will remain relatively unchanged when moving to a set-top platform.
  • acoustic-based surveillance prototype, based on the integration in the given architecture of the multi-microphone front-end and the acoustic scene analysis components. Two sets of acoustic events will be defined, one representing contexts normally observed in a living-room environment, and the other one corresponding to situations which may deserve an alarm triggering.

Market Potential

The technologies for acoustic scene analysis and distant speech interaction, developed within DICIT, must be considered as a potential framework for a much larger variety of applications. Voice interactive technologies can be applied in the field of in-car voice-enabled systems, which corresponds to a quite adverse context for ASR technology. For what concerns Interactive TV and related services , speech and dialogue capabilities are particularly requested in the high-end models of STB, which are growing in the world market. Finally, for the acoustic scene analysis technology a possible impact is envisaged in the sector of acoustic-based surveillance, where multi-microphone acquisition front-end can be joined with visual/motion detection sensors.