Scream Yell: analysis

Concept

The purpose behind this installation is to incite viewers to scream. Upon first initiating contact with the machine, it begins to tell you about how people are unwilling to assert themselves, and don't deserve their freedom. It then goads you into yelling. If the viewer finally starts yelling, it shows a video of the viewer when she or he was being more passive. The goal is to play with peoples' comfort zones of self assertion and influence. It provides a cognitive dissonance by at the same time chastizing the viewer for doing what they are told, and telling the viewer to yell by not doing what they are told (e.g., not yelling in public).

Technology

The code is written in MaxMSP, Jitter, and Javascript. It functions as a simple state machine, moving between different states based on input from the user. The inputs to the state machine are as follows:

  1. Speech starts
  2. Speech ends
  3. Yelling starts
  4. Yelling ends
  5. Video finishes
  6. Internal timer times out

Each different state of the state machine has a different response to those inputs. The states are as follows:

  1. Nothing (nothing is going on)
  2. Introduction (someone has begun to speak into the mic)
  3. Talkify (establish the paradigm of conversation by asking for response to simple questions)
  4. Rant (provide the core philosophy in an uncomfortable manner, to hopefully put the viewer slightly on edge)
  5. Incite (play a variety of videos attacking the viewer and encouraging them to yell)
  6. Yell (play a video of the viewer when she or he was first stepping up to the machine)

Speech/Non-Speech and Yell/Non-Yell Classification

I tried two approaches to classifying input as speech/non-speech, the second of which proved to be much more successful.

Contextual Analysis

Using the outputs from several analyzers for MaxMSP, I constructed a model based on what it seemed speech would likely exhibit. I bandpassed speech to the region between 400Hz and 2kHz, and applied a progressive filter by averaging the amplitude in each frequency bin from a fast fourier transform over time, and then filtering the average from the incoming signal, so that fast changing signals would get through but non-changing signals would not.

Next, I simply guaged based on the simple model that speech would likely have a particular centroid, be of a particular amplitude, and have a particular noisiness (kurtosis), and tried to classify accordingly. I found that this approach produced very poor results, that I was unable to distinguish between speech and yelling effectively, and unable to reliably distinguish speech from other room noise.

Support Vector Machines: partially-contextual analysis

My next effort was to put the data from these same analytical tools (pitch, amplitude, noisiness, centroid) on the filtered band-passed signal into a support vector machine (based on the command line program Tiny SVM). I built a MaxMSP patch to train and interface with that command line utility. Support Vector Machines operate by finding the optimal hyperplane that separates two data sets in multidimensional space. Once the hyperplane has been calculated using a training data set that is marked as being "true" or "false", the support vector machine can classify test data by simply seeing which side of the hyperplane the test data falls on.

I hoped that the support vector machine would be able to construct a better model than what I was able to come up with by finding characteristics of speech and non-speech that I did not in my model. This approach proved very effective at classifying speech from non-speech, but very poor at classifying speech from yelling.

Support Vector Machines: non-contextual analysis

Next, I tried forgoing all contextual information in the analysis, and simply putting the raw input signal into the support vector machine. To get the data set, I used the output of Miller Pucket's bonk~, which is simply the energy in 11 frequency bins of the input signal. I trained the support vector machine on data sets which were three concatenated frames of bonk's output. This I found to have the highest yet success rates at classifying speech from non-speech and yelling from speech.

One challenge associated with non-contextual analysis such as this is that I have no idea (other than the raw mathematical model) what features the support vector machine is using to classify the data. It is entirely possible that the classification is picking up features that are unlikely to be reproduced in particular installations (such as microphone peculiarities or room characteristics). It seems that with this approach it will be necessary to train the support vector machine on location for a successful classification scheme in an installation.

Thanks very much to Hany Farid and Dan Rockmore for their input, advice and encouragement in this project.