Last updated: 28-05-2020

We all got exposed to different sounds every day. Like, the sound of car horns, sirens, and music, etc. How about teaching a computer to classify such sounds automatically into categories!

In this blog post, we will learn techniques to classify urban sounds into specific categories (such as car horns and sirens) with machine learning. Earlier blog posts covered classification problems where data can be easily expressed in vector form. For example, in the textual dataset, each word in the corpus becomes a feature, and tf-idf score becomes its value. Likewise, in anomaly detection dataset we saw two features “throughput” and “latency” that fed into a classifier to predict outliers. However, when it comes to sound, feature extraction is not quite straightforward. Today, we will first see what features can be extracted from sound dataset and how easy it is to extract such features in Python using an open source library called Librosa.

To get started with this tutorial, please make sure you have following tools installed:

  • Tensorflow 2.x
  • Librosa
  • Numpy
  • Matplotlib


We need a labelled dataset that we can be used to train a machine learning model. Fortunately, researchers open-sourced annotated dataset with urban sounds. It contains 8,732 labelled sound clips (4 seconds each) from ten classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music. The dataset by default is divided into 10-folds. To get the dataset please visit the following link and if you want to use this dataset in your research kindly don’t forget to acknowledge. In this dataset, the sound files are in .wav format but if you have files in another format such as .mp3, then it’s good to convert them into .wav format. It’s because .mp3 is lossy music compression technique, check this link for more information. The dataset is pre-split into 10-folds and to keep things managable, we will pre-process the sound files and save them as numpy array so that it will be easy to use afterwards.

Let's read some sound files and visualize to understand how different each sound clip is from the other. Matplotlib's specgram method performs all the required calculation and plotting of the spectrum. Likewise, Librosa provides handy methods for wave and log power spectrogram plotting. By looking at the plots shown in Figures 1 and 2, we can see apparent differences between sound clips of different classes.

### Load necessary libraries ###
import glob
import os
import librosa
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram
from sklearn.model_selection import KFold

import tensorflow as tf
from tensorflow import keras

%matplotlib inline'ggplot')

### Define helper functions ###
def load_sound_files(file_paths):
    raw_sounds = []
    for fp in file_paths:
        X,sr = librosa.load(fp)
    return raw_sounds

def plot_waves(sound_names,raw_sounds):
    i = 1
    fig = plt.figure(figsize=(25,60), dpi=900)
    for n,f in zip(sound_names,raw_sounds):
        i += 1
    plt.suptitle('Figure 1: Waveplot',x=0.5, y=0.915,fontsize=18)
def plot_specgram(sound_names,raw_sounds):
    i = 1
    fig = plt.figure(figsize=(25,60), dpi = 900)
    for n,f in zip(sound_names,raw_sounds):
        specgram(np.array(f), Fs=22050)
        i += 1
    plt.suptitle('Figure 2: Spectrogram',x=0.5, y=0.915,fontsize=18)

def plot_log_power_specgram(sound_names,raw_sounds):
    i = 1
    fig = plt.figure(figsize=(25,60), dpi = 900)
    for n,f in zip(sound_names,raw_sounds):
        D = librosa.logamplitude(np.abs(librosa.stft(f))**2, ref_power=np.max)
        librosa.display.specshow(D,x_axis='time' ,y_axis='log')
        i += 1
    plt.suptitle('Figure 3: Log power spectrogram',x=0.5, y=0.915,fontsize=18)
def extract_feature(file_name):
    X, sample_rate = librosa.load(file_name)
    stft = np.abs(librosa.stft(X))
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
    mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
    contrast = np.mean(librosa.feature.spectral_contrast(S=stft, 
    tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), 
    return mfccs,chroma,mel,contrast,tonnetz

def parse_audio_files(parent_dir,sub_dir,file_ext='*.wav'):
    features, labels = np.empty((0,193)), np.empty(0) # 193 => total features 
    for fn in glob.glob(os.path.join(parent_dir, sub_dir, file_ext)):
        mfccs, chroma, mel, contrast,tonnetz = extract_feature(fn)
        ext_features = np.hstack([mfccs,chroma,mel,contrast,tonnetz])
        features = np.vstack([features,ext_features])
        labels = np.append(labels, int(fn.split('/')[2].split('-')[1]))
    return np.array(features, dtype=np.float32), np.array(labels, dtype = np.int8)
### Plot few sound clips along with their spectrograms ###
sound_file_paths = ["57320-0-0-7.wav","24074-1-0-3.wav",
sound_names = ["air conditioner","car horn","children playing",
               "dog bark","drilling","engine idling",
               "gun shot","jackhammer","siren","street music"]

raw_sounds = load_sound_files(sound_file_paths)

Sound Wave Plot

Sound Spectrogram

Feature Extraction

To extract the useful features from the sound data, we will use Librosa library. It provides several methods to extract a variety of features from the sound clip. We are going to use below-mentioned methods to extract various features:

  • melspectrogram: Compute a mel-scaled power spectrogram
  • mfcc: Mel-frequency cepstral coefficients
  • chorma-stft: Compute a chromagram from a waveform or power spectrogram
  • spectral_contrast: Compute spectral contrast, using method defined in [1]
  • tonnetz: Computes the tonal centroid features (tonnetz), following the method of [2]

To make the process of feature extraction from sound clips easy, let us define helper functions. First parse_audio_files which takes parent directory name, subdirectory within the parent directory and file extension (default is .wav) as input. It then iterates over all the files within the subdirectory and invokes a second helper function extract_feature. It takes a file path as input, read the file withlibrosa.load method, extract and return features mentioned above. These methods are required to convert raw sound clips into informative features (along with a class label for each sound clip) that we can directly utilize to learn a model with the classification method of our choice. Note, the class label of each sound clip is in the file name. For example, if the file name is 108041-9-0-4.wav then the class label will be 9. Doing string split by &ndash, and taking the second item of the array will give us the class label. To summarize, we will iterate over each file within a fold to extract features and corresponding labels and save them as NumPy arrays.

# Pre-process and extract feature from the data
parent_dir = 'UrbanSounds8K/audio/'
save_dir = "UrbanSounds8K/processed/"
sub_dirs = np.array(['fold1','fold2','fold3','fold4',
for sub_dir in sub_dirs:
    features, labels = parse_audio_files(parent_dir,sub_dir)
    np.savez("{0}{1}".format(save_dir, sub_dir), features=features, 

Sound Classifier

Note: If you want to use another standard machine learning algorithm (such as from scikit-learn or any other library) for this purpose, feel free to use that. The goal of this tutorial is to provide an illustrative demonstration of a learning network with Tensorflow/Keras.

Now, as we have pre-processed our dataset, it’s time to implement a simple feedforward neural network in Keras (Tensorflow) to classify each sound clip into a different category.

The code provided below defines a standard feedforward neural network of three layers, each with 256, 128, and 64 units, respectively. Over the last version of this tutorial (from 2016), it can be seen that implementing and training neural networks has become much easier!

### Define feedforward network architecture ###
def get_network():
    input_shape = (193,)
    num_classes = 10
    model = keras.models.Sequential()
    model.add(keras.layers.Dense(256, activation="relu", input_shape=input_shape))
    model.add(keras.layers.Dense(128, activation="relu", input_shape=input_shape))
    model.add(keras.layers.Dense(64, activation="relu", input_shape=input_shape))
    model.add(keras.layers.Dense(num_classes, activation = "softmax"))

    return model

The last step is to train and evaluate the network with the extracted features using 10-folds cross-validation. This process is straightforward; we will read the data of all training folds, combine them and train the network, afterward evaluate its performance on the held-out fold. We will repeat this process ten times, covering all the folds and averaging the obtained accuracies to get the final performance estimate.

### Train and evaluate via 10-Folds cross-validation ###
accuracies = []
folds = np.array(['fold1','fold2','fold3','fold4',
load_dir = "UrbanSounds8K/processed/"
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(folds):
    x_train, y_train = [], []
    for ind in train_index:
        data = np.load("{0}/{1}.npz".format(load_dir,folds[ind]))
    x_train = np.concatenate(x_train, axis = 0)
    y_train = np.concatenate(y_train, axis = 0)
    data = np.load("{0}/{1}.npz".format(load_dir,folds[test_index][0]))
    x_test = data["features"]
    y_test = data["labels"]
    # Possibly do mean normalization here on x_train and 
    # x_test but using only x_train's mean and std.
    model = get_network(), y_train, epochs = 50, batch_size = 24, verbose = 0)
    l, a = model.evaluate(x_test, y_test, verbose = 0)
    print("Loss: {0} | Accuracy: {1}".format(l, a))
print("Average 10 Folds Accuracy: {0}".format(np.mean(accuracies)))

In this short tutorial, we saw how to extract features from a sound dataset and train a feedforward neural network model in Keras (Tensorflow) to categories sound clips. I would encourage you to check the documentation of Librosa and experiment with different neural network configurations i.e. changing number of neurons, number of hidden layers and introducing dropout etc. to improve the recognition.

The python notebook is available at the following link. If you have any question or feedback please comment below.