In this short blog post, I will cover some of the ideas for sensor fusion i.e. combining data from multiple sensors and different ways to represent (sensor) input data for deep neural networks.

In the last post, we saw how to apply convolutional neural network on accelerometer data for human activity recognition. The input data had three components (x, y, and z) from an accelerometer. The sliding window approach was applied to get segments of fixed size with class labels, that fed into a deep net for activity recognition. The depthwise convolution operation was applied to the input, which learns different weights for different input channels, in our case different accelerometer components. The learned features from convolution and pooling layers are then fed into a feed-forward neural network for classification or regression. Below, I wrote down some of the ideas for sensor fusion I learned from different sources.

1) An interesting way to represent input data for the convolutional neural network is to keep x, y and z components separate and apply separate convolution and/or pooling operation to learn different features independently [1]. At a later stage, the output of convolution or pooling layers can be flattened and combined. These new features then feed into densely connected layers or support vector machine to make predictions.

2) The features extracted using traditional methods from multiple sensors can be concatenated to form a large input vector and any algorithm can be used for learning e.g. d-dimensional logistic regression.

3) Another idea is to apply FFT or do spectrogram analysis on accelerometer components and feed new representation as input into a deep net. The spectrogram basically represents changes in energy content of a signal as a function of frequency and time. These representations of the raw signal can bring advantages, to learn interesting features by reducing the complexity of the task. To get more information, please consult [2].

4) In case of a dataset with multiple accelerometer sensors having same sampling rate. The 2D segments (like images) can be extracted, where either each row or column can represent x, y and z components from each of the sensor. I would highly recommend interested reader to check following papers [3], [4].

5) If the data is from multiple sensors having different sampling rates. The first thing to do is to always time align the dataset. Afterwards, a different convolutional neural network can be applied independently on each sensorâ€™s data to learn features. These learned features can then combined and feed into LSTM to learn interaction between different sensors. More information on this approach can be found in [5].

These techniques can be seen as an example of **early fusion** as each sensor's information is combined before doing inference. Last but not least, another strategy is to learn independent models from each sensor's data (an ensemble) and take average/majority vote for the prediction. This method can be stated as **late fusion** and it lowers the influence of insignificant signals.

I discussed some of the ideas and techniques I picked while reading papers, if you have more interesting thoughts, suggestion or feedback, please comment below.

*Last Edited: 9-11-2017*

### References:

- Cui, Zhicheng, Wenlin Chen, and Yixin Chen. "Multi-scale convolutional neural networks for time series classification." arXiv preprint arXiv:1603.06995 (2016).
- Alsheikh, Mohammad Abu, et al. "Deep activity recognition models with triaxial accelerometers." arXiv preprint arXiv:1511.04664 (2015).
- Hammerla, Nils Y., Shane Halloran, and Thomas Ploetz. "Deep, convolutional, and recurrent models for human activity recognition using wearables." arXiv preprint arXiv:1604.08880 (2016).
- Yang, Jianbo, et al. "Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition." IJCAI. 2015.
- Yao, Shuochao, et al. "DeepSense: A Unified Deep Learning Framework for Time-Series Mobile Sensing Data Processing." arXiv preprint arXiv:1611.01942 (2016).