Building a Speech Emotion Recognition System Using Machine Learning

Supriya Nagpal
5 min readSep 22, 2024

--

In today’s world, human-computer interaction is becoming more sophisticated. One exciting area of research is Speech Emotion Recognition (SER), where machines learn to recognize human emotions from speech. This technology holds great potential for applications in customer service, mental health monitoring, and user experience design. In this post, we’ll explore how to build a Speech Emotion Recognition system step by step using machine learning.

What is Speech Emotion Recognition?

Speech Emotion Recognition is a process where a machine learns to classify emotions based on audio signals. Imagine a system that can listen to your voice and determine whether you’re happy, sad, angry, or surprised. By analyzing specific features of the speech signal, such as tone, pitch, and energy, machines can be trained to detect the emotional state of a speaker.

Project Overview

In this project, we build a system that processes raw speech, extracts relevant features, and trains machine learning models to classify the emotions present in the speech. The project pipeline can be broken down into the following steps:

  1. Data Collection
  2. Preprocessing
  3. Feature Extraction
  4. Model Training
  5. Model Evaluation

Let’s dive into each of these steps in detail.

Step 1: Data Collection

Every machine learning project starts with data. For this project, we need a dataset containing speech samples labeled with corresponding emotions.

There are several publicly available datasets designed for Speech Emotion Recognition, such as TESS (Toronto Emotional Speech Set). These datasets include speech samples where actors express various emotions like anger, happiness, sadness, and surprise. These labels are crucial because they allow our machine learning model to learn which features correspond to specific emotions.

For example:

  • A speaker may say the same sentence in different emotional tones.
  • Each recording is labeled with its respective emotion, which forms the foundation for training the model.

Once we have the dataset, the next step is to prepare the data for analysis.

Step 2: Preprocessing

Raw audio data comes in many different forms and often requires cleaning and standardization before feeding it into a machine learning model. The preprocessing phase ensures the data is consistent and ready for feature extraction. Some common preprocessing steps include:

  • Noise Reduction: Background noise can interfere with feature extraction, so we apply techniques like spectral gating or bandpass filters to reduce noise.
  • Resampling: Not all audio files have the same sampling rate. To standardize the data, we resample all audio to a consistent sampling rate, usually 16 kHz or 44.1 kHz, depending on the use case.
  • Trimming Silence: Often, speech recordings include unnecessary silence at the beginning or end. By trimming these silent portions, we can reduce file size and focus only on the important part — the speech itself.

By the end of this step, we have clean and standardized audio files ready for feature extraction.

Step 3: Feature Extraction

After preprocessing the audio data, we extract features from the speech signals. These features are crucial for the model to understand the underlying characteristics of the speech that correspond to different emotions. In Speech Emotion Recognition, we use several important features, including:

1. Mel Frequency Cepstral Coefficients (MFCCs)

MFCCs are one of the most commonly used features in speech recognition tasks. They capture the spectral properties of the audio and represent the power spectrum on a mel scale, which mimics how humans perceive sound.

  • Why MFCCs? The human ear does not perceive all frequencies equally. MFCCs account for this by compressing higher frequencies, making the data more relevant to human perception.

2. Formant Frequencies

Formants are resonance frequencies in the vocal tract that help define different speech sounds. Changes in formant frequencies can reflect changes in vocal tract shape due to different emotions. For example, anger might cause a speaker’s formants to shift due to changes in mouth or throat tension.

3. Pitch and Pitch Contour

Pitch refers to the fundamental frequency of a speech signal. Emotion often alters the pitch of speech — for example, anger may raise the pitch, while sadness may lower it. Pitch contour, which captures variations in pitch over time, is another useful feature.

4. Speech Rate

The rate at which a person speaks can be a good indicator of emotion. For instance, when excited or angry, people tend to speak faster, while sadness or boredom might slow down speech.

5. Energy

Energy measures the intensity or loudness of a speech signal. Emotions like anger or excitement often correspond with higher energy levels, whereas calmness or sadness may produce lower energy signals.

6. Zero Crossing Rate

This measures the rate at which the audio signal changes its sign (crosses the zero axis). It is useful for capturing the temporal dynamics of the speech signal and can provide additional insight into emotions.

Step 4: Model Training

Now that we’ve extracted the features from the speech signals, we can move on to the core of the project: model training. Here, we train machine learning models to classify the emotions based on the extracted features.

Machine Learning Models for SER

Several types of machine learning models can be used for Speech Emotion Recognition:

  • Support Vector Machines (SVM): SVM is a powerful algorithm for classification tasks and is often used in SER projects for its ability to handle high-dimensional data.
  • Convolutional Neural Networks (CNN): CNNs are typically used in image processing tasks, but they can also be adapted for audio signals. In SER, CNNs can learn patterns in the speech signal’s spectrograms, which represent the frequency spectrum of the signal over time.
  • Recurrent Neural Networks (RNN): Speech is inherently sequential, and RNNs, particularly Long Short-Term Memory (LSTM) networks, are designed to handle sequential data. LSTMs can capture dependencies in time-series data, making them a great choice for recognizing emotions based on how speech changes over time.

Training the Model

During training, the model learns to map the features extracted from the audio signals to their corresponding emotion labels. The dataset is split into training and validation sets to allow the model to learn and then validate its performance on unseen data.

Step 5: Model Evaluation

Once the model is trained, the next step is to evaluate its performance. In machine learning, performance is often measured using metrics like accuracy and F1-score.

Accuracy

Accuracy represents the percentage of correct predictions made by the model. However, accuracy alone is not always a good indicator, especially in imbalanced datasets where certain emotions may occur more frequently than others.

F1-Score

The F1-Score is a better measure of model performance when dealing with imbalanced data. It considers both precision (the number of true positive predictions compared to all positive predictions) and recall (the number of true positive predictions compared to all actual positive cases).

Evaluating the model helps us understand how well it is performing and where there’s room for improvement. If the model’s accuracy is low, we may revisit feature extraction or experiment with different models and configurations.

Conclusion

Speech Emotion Recognition is a fascinating intersection of human language and machine learning. By following the steps outlined in this post — from data collection and preprocessing to feature extraction and model training — we’ve built a system that can recognize human emotions based on speech.

While this project provides a solid foundation, there’s always room to expand. Future improvements could include more advanced model architectures, such as transformers, or the incorporation of additional data, like facial expressions or physiological signals, to improve emotion recognition accuracy.

Project Links

--

--

Supriya Nagpal
Supriya Nagpal

Written by Supriya Nagpal

“Data scientist with a love for mathematical puzzles and insights. Transforming data into stories.” 📊🔍✨ www.linkedin.com/in/supriyanagpal

No responses yet