Data Breach Detection using Machine Learning Models

Data breaches are, by their very nature, hard to detect. More so when they happen over a period of time instead of a single point in time, which is very common given attackers best intentions to replicate a “normal” user behavior. In this context, even though a particular action taken by a user may not raise cause for concern, the collection of various seemingly harmless actions may indeed represent a real threat to the security of an Organization. 

Such particular characteristics of data breaches make it very difficult to find a bullet-proof method to be successful in detecting attacks. As attackers get more sophisticated, Organisations will necessarily require more intelligent security tools to protect their data. In this article we will go through different machine learning models that can be applied depending on the specificity of each circumstance. 

Machine learning models are well suited to detect anomalies in the access to personal or sensitive data. In that context, timestamps of all the events that constitute access to the data have an enormous importance in the efficiency (or lack of it) of the defense against such attacks, as they allow for the periodicity of the actions to be taken into account. For this reason, anomaly detection in time series data should provide the most effective methods to successfully recognize an attack. However, the hard part is to identify which type of outlier is the one that corresponds to the sort of behavior that constitutes a data breach.

From a classical point of view, an outlier is “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980). In a more modern context, other definitions of the event have been suggested in literature as well as characterizing different types of outliers as a way to better detect them.

On one hand, there are point outliers, which are pieces of data that showcase a rare behavior at a particular point in time when compared either to the other values in the time series (global outlier) or to its neighboring points (local outlier). On the other hand, there are subsequent outliers which constitute consecutive points in time that become uncommon when their joint behavior is analyzed, even though each observation by itself is not necessarily a point outlier.

With this in mind, it is important to understand that a data breach can happen both at a single time instance or over a period of time, so a method or a collection of methods that are able to detect both these types of outliers would be decisive to properly detect data breaches. 

Taking all these factors into account, there have been methods developed that fulfill these specificities and can be useful in the successful detection and prevention of attacks. Depending on whether the used data is univariate or multivariate, different models can be deployed.

Univariate time series consists of an ordered set of observations where each one is recorded at a particular point in time. Multivariate time series, on the other hand, consist of an ordered set of k-dimensional vectors, each one recorded at a particular point in time. Given that for the purpose of detecting data breaches, the events representing access to sensitive data should be described by a number of different variables, in this article we will consider methods that can be implemented in a multivariate time series. 

In the case of point outlier detection, the techniques described below have proven successful in approaching this issue:

  • Kieu et al. (2018) have suggested a two-module approach. First, the multivariate time series is enriched by adding statistical features to the raw data in order to further capture the extent of changes of the derived features across time. Then, the enriched data is fitted into a deep neural network based autoencoder in which the most representative features are established through dimensionality reduction. Datapoints that deviate from the established representative features can be seen as outliers.
  • Su et al. (2019) have suggested a more complex method which consists of a combination of Gated Recurrent Unit (GRU) to capture the temporal dependencies between multivariate observations and Variations Autoencoders (VAE) to map observations to random variables. This model has the advantage of providing interpretability of the results as the reconstruction probabilities given in the model output can be used to determine anomalies.

In the case of subsequent outlier detection, other methods have been developed in order to handle this issue:

  • Cheng et al. (2008, 2009) use the Radical Basis Function to compute the similarity between multivariate points of the series, which are then represented through a graph.
  • Jones et al. (2014) created a method that, first, selects a set of observations that are representative of each dimension, and then uses nonlinear functions to find pairs of related features. These pairs then make  the estimation of the values for each feature by taking into account the values from another feature. A new time series is then created from these estimations where sub-sequences from one variable are built through the learned functions and data from another variable.
  • Munir et al. (2019) developed a method that consists of two steps. First, they build a time series predictor based on Convolutional Neural Networks (CNN) that is used in order to predict new entire subsequences. Then, an anomaly detector based on the Euclidean distance between the actual and predicted values for each variable is used to detect outliers, once that distance surpasses a defined threshold.

In conclusion, this article goes through some anomaly detection methods for time series data that, due to their characteristics, should be helpful in the detection of data breaches. It is important to stress that the demonstration of success of these particular algorithms depends on a number of factors such as the characteristics of the dataset, the parameters chosen, the indicators used to evaluate the method and so on. Therefore, they may not be applicable to every case, being of utmost importance to adapt the chosen methods to the reality of each Organisation.

Share on facebook
Share on twitter
Share on linkedin
Share on reddit
Share on email

More from blockbird

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top