Contact Member Areapopup icon


DICIT acoustic WOZ data (Dec 2009)
FBK-irst has conducted Wizard of Oz experiments with the aim of collecting useful data for testing signal processing algorithms in the scenario foreseen by the DICIT project.

More info and download here!


first yearAchievements during the third year (Oct, 2008 - Sept, 2009)

  • The final STB prototype was realized including new multi-microphone front-end, ASR and NLU components more robust and effective than in the previous prototype. Improved screens and dialogue flow (both with speech and RC) were implemented based on experiences from the evaluation of the first prototype. Finally, new STB hardware and firmware were deployed, which enable fast channel switching, Common Interface for scrambled programs, faster communication with dialog manager, and new layouts for the OSD interface. A detailed description of the work can be found in the public deliverables D2.4 and D4.2.
  • A calibration procedure was defined which aims to ensure a more coherent performance evaluation across different installations, sites, and environmental acoustic conditions; very accurate standardized settings of all the hardware and software components were necessary to this purpose (see the public deliverable D3.2).
  • Methodologies, criteria, and tools for the final STB prototype evaluation were addressed and defined based on the experiences made in the first evaluation. An improved annotation schema has been developed that supports both performance evaluation for several modules of the prototype and usability evaluation of the dialogue. The technical framework to support this methodology has been extended.
  • An extensive usability evaluation of the final STB prototype with 172 subjects has been conducted. The prototype was tested in all three languages and at all partner sites (including IBM in US). The results show a clear improvement concerning speed and speech recognition accuracy which were two weaknesses of the first prototype. The subjective feedback by the participants was strongly positive and the objective metrics yielded satisfactory results. A detailed description of the final prototype evaluation can be found in the public deliverable D6.5.
  • A new speaker verification system that exploits the phoneme class segmentation provided by the speech recognition engine has been developed. The new system aims at improving the performance of the previous one especially in the case of short utterances. The system has been tested under the DICIT scenario examining in particular the effect of the residual echo (see D4.2).
  • A novel combination of Stereo Acoustic Echo Cancellation (2C-AEC) and Blind Source Separation (BSS) algorithms was investigated. A real-time system combining 2C-AEC with determined or underdetermined BSS has been set up. To compare the source localization capabilities of the DICIT prototype to the new system, a comparative analysis of the classical Global Coherence Field-based source localization and the BSS inherent source localization has been conducted (see D3.2).
  • Using and disseminating knowledge

    • DICIT was at two international events, i.e. ICT Lyon 2008 and IFA 2009. In both cases, a real-time prototype was presented during the whole event, under very challenging noisy conditions.
    • Invited talks describing the objectives and the main achievements were given at international conferences and workshops, as for instance in a plenary talk of EUSIPCO 2009, and at JEITA 2008.
    • The project was presented at invited seminars, lectures, and conferences (e.g. SpeechTek) worldwide.
    • New videos were produced which introduce the project objectives, the most recent achievements as well as the real use of the prototypes. They are available elsewhere in this web site.
    • The project was disseminated at the Researcher Nights held in Trento at FBK and in Erlangen at FAU.

first yearAchievements during the second year (Oct, 2007 - Sept, 2008)

  • A real-time multi-microphone front-end was integrated which reacts in real-time to distant-talking speech input, localizing the active speaker, beamforming towards the corresponding direction, performing acoustic echo cancellation to remove contributions from the loudspeakers, applying an accurate speech activity detection, and eventually providing a speech chunk as input to the recognizer.
  • A robust recognition engine was obtained, which is based on an acoustic modelling trained taking into account the specific conditions of the distant-talking interactive TV scenario, which includes reverberation, background noise, effects of beamforming and residual of echo cancelling. Language modelling was set up for the English Language, collecting >15000 written utterances from a panel of potential users; then the Italian and German versions of the Language Model were derived by translating and integrating the English one.
  • A multi-modal dialogue application for the targeted STB task was realized by using EBGuide Studio and EB Guide.
  • The first STB prototypewas realized which includes all the functionalities as planned in the project workplan. The prototype supports three languages, namely English, German, and Italian, and it was tested and made ready for the evaluation campaign conducted with real users in the premises of two industrial partners, i.e., Amuser and Elektrobit. To this purpose, specific evaluation methodologies, criteria and tools were addressed and defined: in particular, a technical framework to support evaluation activities was created.
  • Acoustic event detection components were developed for the surveillance prototype. The prototype was then integrated, which consists of two main components, a PC platform and an Alarm Panel. An evaluation of the prototype system was conducted for three different scenarios: intrusion, false alarm for inside sounds, and false alarm for outside sounds. The evaluation confirmed the effectiveness of the prototype’s design and implementation.
  • The distant-talking speaker identification task was explored and the related progress allowed developing a prototype embedding a speaker identification component in the given real-time multi-microphone front-end; the resulting prototype runs a simplified command-and-control recognition task for very noisy and reverberant contexts.
  • The DICIT consortium met AMIDA, LUNA and VISNET-II consortia with the objective of discussing on methodologies and tools for evaluation of some of the addressed technologies, as for instance multi-microphone space-time audio processing, spoken language understanding and spoken dialogue systems. Possible cooperation activities at international level are foreseen as a result of the given action for a join effort on the evaluation of multi-microphone processing techniques.
  • Using and disseminating knowledge

    • A portion of the WOZ acoustic data was further annotated and is now ready for public distribution to the research communities working in the scientific fields addressed by DICIT.
    • Invited talks describing the objectives and the main achievements were given at international conferences and workshops, as for instance at Acoustics 2008, CASTNESS 2008, etc.; moreover, the project was publicized at other international events and fairs, as IFA 2008, Langtech 2008.
    • In the two years, the project was presented at invited seminars (e.g., University of Southern California, Google, SpeechCycle, etc.) and lectures (e.g., several distinguished lectures given by Prof. Kellermann - FAU) worldwide. Papers were accepted for presentation at international conferences and workshops based on peer review.

first yearAchievements during the first year (Oct, 2006 - Sept, 2007)

  • A Market Study was conducted, based on a focus group methodology, to understand user needs and requirements in the interactive TV applicative context. The results of the study provided very useful information about the possible trend of this market, about the expectations of the users and, consequently, about the STB prototype design.
  • WOZ Experiments were conducted to better understand the way people would use the prospected system in order to design the first prototype user interface. Data were collected given two different methodologies and goals: Dialogue data aimed at characterizing user behaviour, vocabulary, language, etc.; Acoustic data were collected with the goal of obtaining useful experimental tasks, as for instance speaker localization and speaker ID, for the development of the multi-microphone processing components.
  • The Architecture for the first DICIT prototypes was successfully defined along with the tools and standards to be used, as described in the public Deliverable D2.1. The initial integration of a first simple speech-based prototype was accomplished using the CIMA framework to run an SCXML application based on the DICIT dialog specification.
  • As to the Multi-microphone Front-end, a stereo Acoustic Echo Cancellation module was adapted for the first prototype requirements with a performance increase compared to the given baseline; activities were conducted on the designs for beamforming and concepts for its combination with multichannel AEC. Progresses on speaker localization were achieved, in particular with the design of a new approach to handle potential multiple speaker conditions, and on a speech activity detection component for real-time robust speech recognition in the given scenario.
  • A Distant-talking ASR, baseline was developed as reference to evaluate the impact of front-end processing. To train acoustic models, contaminated databases were produced for the given environmental acoustics and languages. Acoustic Model adaptation was shown to improve the expected ASR accuracy for the first DICIT prototype. Additionally, the IBM Embedded ViaVoice ASR engine was successfully integrated into the CIMA package which will form the basis for the first prototype.
  • Several other key components were implemented for the Spoken Dialogue System, of the first DICIT prototype, including NLU engine, Dialog Manager, TTS engine, and GUI output. A strategy for the implementation of the multimodality within this prototype was developed. The multimodal dialog flow was designed including the prompts and the TV layouts, and the usage of the User Profile. The resulting components have been brought together under the CIMA framework which provides much of the infrastructure for the prototype.
  • A Restricted Focus Group (RFG), of industries was set up. The purpose of this action is to have an external advisory team that follows the project and provides information about how the project is technically sound and how it may be redirected to better focus on the most relevant issues. From this activity, a possible exploitation action for the most innovative and promising technologies might eventually be identified.
  • Using and disseminating knowledge

    • Different Dissemination Activities have been so far conducted. An official and publicly available project web site has been established (see At that site, most of the information about DICIT can be found and downloaded (the brochure, state-of-the-art documents, etc.). A public Deliverable (D7.1) is available for further information about the project as well as dissemination and use plan.
    • The DICIT consortium attended the “DG INFSO-E1 ALL FP6 PROJECTS” Workshop held on 4-6 December 2006, at European Commission, Luxembourg. The contribution to the event consisted in two oral presentations outlining some of the project objectives and of the related state-of-the-art.


August 2010

This video was realized by FBK, at the end of the third year, to summarize the main goals and the final achievements of the project.

September 2009

DICIT at IFA 2009 - The final DICIT prototype was presented at IFA 2009-TecWatch (Berlin, September 2009), an event that aimed to show the innovative technological potential of the world of digital media and modern home appliances. Note that also in this case the DICIT prototype provided satisfactory performance despite the very noisy operating conditions.

Download video clip (WMV 43.548 KB) | (m4v 42.592 KB)

December 2009

December 2008

This real-time command-and-control prototype was presented at ICT 2008. Besides speaker localization, stereo acoustic echo cancellation, beamforming, voice activity detector, and distant-talking speech recognition, it also includes quite accurate speaker identification capabilities.

November 2008

This video shows the prototype realized by the DICIT consortium during the first half of the project.

October 2008

Examples of signals in the DICIT front-end.

Click on the loudspeakers to listen to the different signals.

sound 1 Close-talk signal   ct
sound 2 Single far microphone  
sound 3 Signal after front-end processing  

DICIT front-end schema

April 2008

Prototype based on FBK-irst technologies (presented by Marco Matassoni and Maurizio Omologo, FBK-irst). The system includes the following components: mono acoustic echo cancellation, multi-speaker localization and tracking, delay-and-sum beamforming, speech activity detection, distant-talking speech recognition, distant-talking speaker identification, real-time control of a STB device. Both Italian and English languages are supported.

Download video clip (41.320 KB)

December 2006

In this video clip it is possible to see a hypothetical application of the DICIT system. There are four users watching a movie and they control the play using only their voice. Commands include volume setting and play controls. The actual execution of the commands is done by hand, it is not running a speech recognizer. The target of the project is to have a real system working in this way, with much more functionalities. There is a localization system (based on GCF) running in the left part of the video. It is able to track the position of the speakers using 2 reverse T-shaped arrays of 4 microphones each one. The arrays are located over the projection wall. Using the actual signals coming from the microfones would lead to poor performances of the localization, because there is the stereo movie noise coming from the loudspeakers. The signals of the arrays are so post-processed using a technique of Acoustic Echo Cancellation (developed by FAU). The sound from the movie is removed from the signals and the clean stream is used by the localization system.

Download video clip (18.157 KB)

May 2008

First release of the Set-Top-Box prototype that was presented at the DICIT event held at FBK-irst on May 9, 2008 (presented by Rajesh Bakchandran, IBM Research).

The prototype includes the following technologies: acoustic echo cancellation, multi-speaker localization and tracking, microphone array processing and beamforming, speech activity detection, distant-talking speech recognition, natural language understanding, spoken dialogue, access to EPG, control of the TV system.

Download video (57.324 KB)