We all got exposed to different sounds every day. Like, the sound of car horns, siren and music etc. How about teaching computer to classify such sounds automatically into categories!

In this blog post, we will learn techniques to classify urban sounds into categories using machine learning. Earlier blog posts covered classification problems where data can be easily expressed in vector form. For example, in the textual dataset, each word in the corpus becomes feature and tf-idf score becomes its value. Likewise, in anomaly detection dataset we saw two features “throughput” and “latency” that fed into a classifier. But when it comes to sound, feature extraction is not quite straightforward. Today, we will first see what features can be extracted from sound data and how easy it is to extract such features in Python using open source library called Librosa.

To get started with this tutorial, please make sure you have following tools installed:

  • Tensorflow
  • Librosa
  • Numpy
  • Matplotlib

Dataset

We need a labelled dataset that we can feed into machine learning algorithm. Fortunately, some researchers published urban sound dataset. It contains 8,732 labelled sound clips (4 seconds each) from ten classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music. The dataset by default is divided into 10-folds. To get the dataset please visit the following link and if you want to use this dataset in your research kindly don’t forget to acknowledge. In this dataset, the sound files are in .wav format but if you have files in another format such as .mp3, then it’s good to convert them into .wav format. It’s because .mp3 is lossy music compression technique, check this link for more information. To keep things simple, we will use sound files from only first three folds, namely fold1, fold2 and fold3.

Let’s read some sound files and visualise to understand how different each sound clip is from other. Matplotlib’s specgram method performs all the required calculation and plotting of the spectrum. Likewise, Librosa provide handy method for wave and log power spectrogram plotting. By looking at the plots shown in Figure 1, 2 and 3, we can see apparent differences between sound clips of different classes.

import glob
import os
import librosa
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from matplotlib.pyplot import specgram
%matplotlib inline

def load_sound_files(file_paths):
    raw_sounds = []
    for fp in file_paths:
        X,sr = librosa.load(fp)
        raw_sounds.append(X)
    return raw_sounds

def plot_waves(sound_names,raw_sounds):
    i = 1
    fig = plt.figure(figsize=(25,60), dpi = 900)
    for n,f in zip(sound_names,raw_sounds):
        plt.subplot(10,1,i)
        librosa.display.waveplot(np.array(f),sr=22050)
        plt.title(n.title())
        i += 1
    plt.suptitle("Figure 1: Waveplot",x=0.5, y=0.915,fontsize=18)
    plt.show()
    
def plot_specgram(sound_names,raw_sounds):
    i = 1
    fig = plt.figure(figsize=(25,60), dpi = 900)
    for n,f in zip(sound_names,raw_sounds):
        plt.subplot(10,1,i)
        specgram(np.array(f), Fs=22050)
        plt.title(n.title())
        i += 1
    plt.suptitle("Figure 2: Spectrogram",x=0.5, y=0.915,fontsize=18)
    plt.show()

def plot_log_power_specgram(sound_names,raw_sounds):
    i = 1
    fig = plt.figure(figsize=(25,60), dpi = 900)
    for n,f in zip(sound_names,raw_sounds):
        plt.subplot(10,1,i)
        D = librosa.logamplitude(np.abs(librosa.stft(f))**2, ref_power=np.max)
        librosa.display.specshow(D,x_axis='time' ,y_axis='log')
        plt.title(n.title())
        i += 1
    plt.suptitle("Figure 3: Log power spectrogram",x=0.5, y=0.915,fontsize=18)
    plt.show()
sound_file_paths = ["57320-0-0-7.wav","24074-1-0-3.wav","15564-2-0-1.wav","31323-3-0-1.wav",
"46669-4-0-35.wav","89948-5-0-0.wav","40722-8-0-4.wav",
"103074-7-3-2.wav","106905-8-0-0.wav","108041-9-0-4.wav"]

sound_names = ["air conditioner","car horn","children playing",
"dog bark","drilling","engine idling", "gun shot",
"jackhammer","siren","street music"]

raw_sounds = load_sound_files(sound_file_paths)

plot_waves(sound_names,raw_sounds)
plot_specgram(sound_names,raw_sounds)
plot_log_power_specgram(sound_names,raw_sounds)

Sound Wave Plot

Sound Spectrogram

Sound Log Power Spectrogram

Feature Extraction

To extract the useful features from sound data, we will use Librosa library. It provides several methods to extract different features from the sound clips. We are going to use below mentioned methods to extract various features:

  • melspectrogram: Compute a Mel-scaled power spectrogram
  • mfcc: Mel-frequency cepstral coefficients
  • chorma-stft: Compute a chromagram from a waveform or power spectrogram
  • spectral_contrast: Compute spectral contrast, using method defined in [1]
  • tonnetz: Computes the tonal centroid features (tonnetz), following the method of [2]

To make the process of feature extraction from sound clips easy, two helper methods are defined. First parse_audio_files which takes parent directory name, subdirectories within parent directory and file extension (default is .wav) as input. It then iterates over all the files within subdirectories and call second helper function extract_feature. It takes file path as input, read the file by calling librosa.load method, extract and return features discussed above. These two methods are all that is required to convert raw sound clips into informative features (along with a class label for each sound clip) that we can directly feed into our classifier. Remember, the class label of each sound clip is in the file name. For example, if the file name is 108041-9-0-4.wav then the class label will be 9. Doing string split by – and taking the second item of the array will give us the class label.

def extract_feature(file_name):
    X, sample_rate = librosa.load(file_name)
    stft = np.abs(librosa.stft(X))
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
    mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
    contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
    tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X),
    sr=sample_rate).T,axis=0)
    return mfccs,chroma,mel,contrast,tonnetz

def parse_audio_files(parent_dir,sub_dirs,file_ext="*.wav"):
    features, labels = np.empty((0,193)), np.empty(0)
    for label, sub_dir in enumerate(sub_dirs):
        for fn in glob.glob(os.path.join(parent_dir, sub_dir, file_ext)):
            try:
              mfccs, chroma, mel, contrast,tonnetz = extract_feature(fn)
            except Exception as e:
              print "Error encountered while parsing file: ", fn
              continue
            ext_features = np.hstack([mfccs,chroma,mel,contrast,tonnetz])
            features = np.vstack([features,ext_features])
            labels = np.append(labels, fn.split('/')[2].split('-')[1])
    return np.array(features), np.array(labels, dtype = np.int)

def one_hot_encode(labels):
    n_labels = len(labels)
    n_unique_labels = len(np.unique(labels))
    one_hot_encode = np.zeros((n_labels,n_unique_labels))
    one_hot_encode[np.arange(n_labels), labels] = 1
    return one_hot_encode
parent_dir = 'Sound-Data'
tr_sub_dirs = ["fold1","fold2"]
ts_sub_dirs = ["fold3"]
tr_features, tr_labels = parse_audio_files(parent_dir,tr_sub_dirs)
ts_features, ts_labels = parse_audio_files(parent_dir,ts_sub_dirs)

tr_labels = one_hot_encode(tr_labels)
ts_labels = one_hot_encode(ts_labels)

Classification using Multilayer Neural Network

Note: If you want to use scikit-learn or any other library for training classifier, feel free to use that. The goal of this tutorial is to provide an implementation of the neural network in Tensorflow for classification tasks.

Now we have our training and testing set ready, let’s implement two layers neural network in Tensorflow to classify each sound clip into a different category.

The code provided below defines configuration parameters required by neural network model. Such as training epochs, a number of neurones in each hidden layer and learning rate.

training_epochs = 50
n_dim = tr_features.shape[1]
n_classes = 10
n_hidden_units_one = 280 
n_hidden_units_two = 300
sd = 1 / np.sqrt(n_dim)
learning_rate = 0.01

Now define placeholders for features and class labels, which tensor flow will fill with the data at runtime. Furthermore, define weights and biases for hidden and output layers of the network. For non-linearity, we use the sigmoid function in the first hidden layer and tanh in the second hidden layer. The output layer has softmax as non-linearity as we are dealing with multiclass classification problem.

X = tf.placeholder(tf.float32,[None,n_dim])
Y = tf.placeholder(tf.float32,[None,n_classes])

W_1 = tf.Variable(tf.random_normal([n_dim,n_hidden_units_one], mean = 0, stddev=sd))
b_1 = tf.Variable(tf.random_normal([n_hidden_units_one], mean = 0, stddev=sd))
h_1 = tf.nn.tanh(tf.matmul(X,W_1) + b_1)

W_2 = tf.Variable(tf.random_normal([n_hidden_units_one,n_hidden_units_two], 
mean = 0, stddev=sd))
b_2 = tf.Variable(tf.random_normal([n_hidden_units_two], mean = 0, stddev=sd))
h_2 = tf.nn.sigmoid(tf.matmul(h_1,W_2) + b_2)

W = tf.Variable(tf.random_normal([n_hidden_units_two,n_classes], mean = 0, stddev=sd))
b = tf.Variable(tf.random_normal([n_classes], mean = 0, stddev=sd))
y_ = tf.nn.softmax(tf.matmul(h_2,W) + b)

init = tf.initialize_all_variables()

The cross-entropy cost function will be minimised using gradient descent optimizer, the code provided below initialize cost function and optimizer. Also, define and initialize variables for accuracy calculation of the prediction by model.

cost_function = -tf.reduce_sum(Y * tf.log(y_))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)

correct_prediction = tf.equal(tf.argmax(y_,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

We have all the required pieces in place. Now let’s train neural network model, visualise whether cost is decreasing with each epoch and make prediction on the test set, using following code:

cost_history = np.empty(shape=[1],dtype=float)
y_true, y_pred = None, None
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(training_epochs):            
        _,cost = sess.run([optimizer,cost_function],feed_dict={X:tr_features,Y:tr_labels})
        cost_history = np.append(cost_history,cost)
    
    y_pred = sess.run(tf.argmax(y_,1),feed_dict={X: ts_features})
    y_true = sess.run(tf.argmax(ts_labels,1))
    print("Test accuracy: ",round(session.run(accuracy, 
    	feed_dict={X: ts_features,Y: ts_labels}),3))

fig = plt.figure(figsize=(10,8))
plt.plot(cost_history)
plt.axis([0,training_epochs,0,np.max(cost_history)])
plt.show()

p,r,f,s = precision_recall_fscore_support(y_true, y_pred, average="micro")
print "F-Score:", round(f,3)

NN Cost Per Epoch

In this tutorial, we saw how to extract features from a sound dataset and train a two layer neural network model in Tensorflow to categories sounds. I would encourage you to check the documentation of Librosa and experiment with different neural network configurations i.e. by changing number of neurons, number of hidden layers and introducing dropout etc.

The python notebook is available at the following link. If you have any question or feedback please comment below.