The past months I grew some interest in Apache Spark, Machine Learning and Time Series, and I thought of playing around with it.
In this post I will explain how to predict user’s physical activity (like walking, jogging, sitting…) using Spark, the Spark-Cassandra connector and MLlib.
The entire code and data sets are available on my github account.
A few word about Apache Spark & Cassandra
Apache Spark started as a research project at the University of California, Berkeley in 2009 and it is an open source project written mostly in Scala. In a nutshell, Apache Spark is a fast and general engine for large-scale data processing.
Spark’s main property is in-memory processing, but you can also process data on disk and it can be fully integrated with Hadoop to process data from HDFS. Spark provides three main API, in Java, Scala and Python. In this post I chose the Java API.
Spark offers an abstraction called resilient distributed datasets (RDDs), which are immutable and lazy data collections partitioned across the nodes of a cluster.
MLlib is a standard component of Spark providing machine learning primitives on top of Spark which contains common algorithms (regression, classification, recommendation, optimization, clustering..), and also basic statistics and feature extraction functions.
An example: user’s physical activity recognition
The availability of acceleration sensors creates exciting new opportunities for data mining and predictive analytics applications. In this post, I will consider data from accelerometers to perform activity recognition.
I have used labeled accelerometer data from users thanks to a device in their pocket during different activities (walking, sitting, jogging, ascending stairs, descending stairs, and standing).
The accelerometer measures acceleration in all three spatial dimensions as following:
- Z-axis captures the forward movement of the leg
- Y-axis captures the upward and downward movement of the leg
- X-axis captures the horizontal movement of the leg
The plots below show characteristics for each activity. Because of the periodicity of such activities, a few seconds windows is sufficient to find specific characteristics for each activity. (I have used the Cityzen Data visualization tools from my company for the following diagrams).
We observe repeating waves and peaks for the following repetitive activities walking, jogging, ascending stairs and descending stairs. The activities Upstairs and Downstairs are very similar, and there is no periodic behavior for more static activities like standing or sitting, but different amplitudes.
Data into Cassandra
I have pushed my data into Cassandra using the cql shell.
Because I need to group my data by (user_id, activity) and then to sort them by timestamp, I decided to define the couple (user_id, activity) and timestamp, as a primary key.
Just below, an example of what my data looks like.
And now how to retrive the data from Cassandra with the Spark-Cassandra connector:
As you can imagine my data was not clean, and I needed to prepare them to extract my features from it. It is certainly the most time consuming part of the work, but also the more exciting for me.
My data is contained in a csv file, and the data was acquired on different sequential days . So I needed to define the different recording intervals for each user and each activity. Thanks to these intervals, I have extracted windows on which I have computed my features.
Here is a diagram to explain what I did and the code.
First retrieve the data for each (user, activity) and sorted by timestamp.
Then search for the jumps between the records in order to define my recording intervals and the number of windows per intervals.
Determine and compute features for the model
Each of these activities demonstrate characteristics that we will use to define the features of the model. For example, the plot for walking shows a series of high peaks for the y-axis spaced out approximately 0.5 seconds intervals, while it is rather a 0.25 seconds interval for jogging. We also notice that the range of the y-axis acceleration for jogging is greater than for walking, and so on. This analysis step is essential and takes time (again) to determine the best features to use for our model.
After several tests with different features combination, the ones that I have chosen are described below (it is basic statistics):
- Average acceleration (for each axis)
- Variance (for each axis)
- Average absolute difference (for each axis)
- Average resultant acceleration (1/n * sum [√(x² + y² + z²)])
- Average time between peaks (max) (Y-axis)
Features computation using Spark and MLlib
Now let’s compute the features to build the predictive model!
Average acceleration and variance
Average absolute difference
Average resultant acceleration
Average time between peaks
The model: Decision Trees
Just to recap, we want to determine the user’s activity from data where the possible activities are: walking, jogging, sitting, standing, downstairs and upstairs. So it is a classification problem.
You could also use others algorithms such as Random Forest or Multinomial Logistic Regression (from Spark 1.3) available in MLlib.
Remark: with the chosen features, prediction for “up” and “down” activities are pretty bad. One trick would be to define more relevant features to have a better prediction model.
Below is the code that shows how to load our dataset, split it into training and testing datasets.
Let’s use DecisionTree.trainClassifier to fit our model. After that the model is evaluated against the test dataset and an error is calculated to measure the algorithm accuracy.
|classes number||mean error (Random Forest)||mean error (Decision Tree)|
|4 (4902 samples)||1,3%||1,5%|
|6 (6100 samples)||13,4%||13,2%|
In this post we have first demonstrated how to use Apache Spark and Mllib to predict user’s physical activity.
The features extraction step is pretty long, because you need to test and experiment to find the best features as possible. We also have to prepare the data and the data processing is long too, but exciting.
If you find a better way/implementation to prepare the data or compute the features, do not hesitate to send a pull request or open an issue on github.