3-D Sound Source Location and Auditory Scene Segmentation


Kevin D. Donohue ( donohue@engr.uky.edu )and

Harikrishnan Unnikrishnan ( harikrishnan@uky.edu )

Center for Visualization and Virtual Environments

University of Kentucky

July 25, 2008


Auditory scene segmentation is analogous to video scene or image segmentation where the goal is to identify and isolate groups of signal components (such as pixels or sound component samples) over space and time that belong to the same object or event.  For example, in video scenes the objects of interest may be faces, so group of pixels associated with a face must be detected in a frame (time instance) and then linked to detected face pixels over subsequent frames until the face object no longer exists in the field of view.  The groups of pixels associated with each detected face would then be extracted and analyzed to determine face properties based on the information required by a higher level application.  An example for auditory scene segmentation may involve voices as the objects of interest, so signals components associated with individual voices must be detected and tracked over time and space.  Since the auditory scene may include many voices and other noises, filtering must be performed to extract the voice of interest from all the other sounds in the signal.  This is typically done with spatially distributed microphones and electronic beamforming (spatial filtering).  In addition other filtering techniques to enhance the signal of interest based on expected signal characteristics (temporal filtering).  Once the audio object is extracted (sometimes referred to as an audio stream) further analysis can be performed to determine properties of the stream.


Scene segmentation is a low-level operation in perception and scene understanding [1]. The example presented here shows how audio scene segmentation can be performed using a set of distributed microphones, a steered-response power (SRP) algorithm [2], an adaptive threshold procedure [3], and a rule-based algorithm for combining detected sound sources over space and time into streams. For this experiment 2 people are talking while moving in a 3.6x3.6x2.25 meter space surrounded by 16 microphones distributed over the audio cage consisting of aluminum struts for mounting microphones. The speaker’s motion in the cage included sitting, standing and walking postures. The scene is also captured using a video camera so their motions can be compared to their detected positions in the 3-D audio detection space.


A 3-D SRP algorithm was applied over the volume with a 4cm step in all directions [2]. The results were thresholded to reject positions to result in a false-alarm rate close to 1 out of a 100,000 [3].  Since the SSL frames are computed every 20ms and with over 400,000 voxels per frame. There are many opportunities for false alarms.  After this thresholding an algorithm was applied to connect the same speaker over a sequence of frames using spatial cues to create an audio stream. Each audio stream was then rendered using a circular marker and given a unique color. The marker size was varied according to the strength of the detected source (the bigger the size, the stronger the likelihood of a source).


The SSL movie includes the rendered audio streams played along with the video of the actual recording the experiment.  And be played from:


(QuickTime format)

3D_Movie Part 1


3D_Movie Part 2


The movie shows that the sources could be tracked with little disturbance from false detections. An example of false detection is shown in the Fig.1.   The markers with very small sizes indicate point that barely crossed the threshold, most likely the result of side lobes off the main speaker with the larger red ball.  In this case they are false alarms.  The large size of the red marker indicates that it crossed the threshold by significant amount.  There are also occasional miss detections.  Figure 2 shows the case of a missed detection for second speaker.  His voice did not cause the SSL detection statistic to cross the threshold.


Fig 1. Example frame showing false detections.

Fig.2 Example frame showing miss detection.



The automatic detection and segmentation of audio streams presents significant challenges in an exciting research area drawing upon contribution from the areas of signal processing, computer engineering, and cognitive psychology.  Results of this research have applications for smart rooms (office and home automation, human computer interfaces), monitoring of those needing special care, surveillance, and systems for studying the complex behaviors of human, animals and machines.


The ultimate goal is to be able to engineer systems with powerful computing nodes, massive data access, and superhuman sensor abilities that mimic the ability of human cognition to perceive important activities in a space of interest and respond appropriately.




[1] Albert S. Bregman, Auditory Scene Analysis: The perceptual Organization of Sound. TheMIT Press, Cambridge, Massachusetts, 1994.


[2] K.D. Donohue, J. Hannemann, and H.G. Dietz, “Performance for Phase Transform for Detecting Sound Sources in Reverberant and Noisy Environments,”  Signal Processing, Vol. 87, no. 7, pp. 1677-1691, July 2007.


[3] K.D. Donohue, K.S. McReynolds, A. Ramamurthy, “Sound Source Detection Threshold Estimation using Negative Coherent Power,”  Proceeding of the IEEE, Southeastcon 2008, pp. 575-580, April 2008.