Data Analytics and Machine Learning Science

Thrust Area Leader:
Gregory Ditzler, The University of Arizona

Current approaches in cybersecurity are primarily signature-based that operate by looking for a signature to detect a malicious attack; however, these techniques fail to identify malicious activities that do not exactly match the signature. Recent, machine learning and data science have shown to be incredibly successful in many application driven fields such as cybersecurity. In fact, some classification tasks, such as image classification, machine learning has been shown to outperform humans at the same task. Therefore, in this proposal, we seek to integrate machine learning into each of the cybersecurity and forensic tasks to improve upon the state-of-the-art. Furthermore, many machine learning models are static in the sense that once they are trained, they are not updated again. Unfortunately, many real-world classification tasks are not static and naturally evolve or change over time. It is against this background that we focus on machine learning techniques that model an adversary in a cyber environment for machine learning algorithms. Therefore, we will research change detection algorithms and how knowing information about a machine learning model and the adversary can be exploited to make the algorithm more robust in a cyber environment. Our machine learning and data science expertise will be applied to cybersecurity and we will focus on the entire data science pipeline (i.e., data collection, preprocessing, learning and classification) to improve upon the state-of-the-art in cybersecurity.

Research Approach:
To achieve the aforementioned goals, we propose to: (1) research how to learn in a dynamic and changing environment, such that learning can be done with high accuracy as well as detect change in a complex environment, (2) leverage adversarial information (i.e., data and information about the attack model) to improve upon the accuracy of the system, (3) develop a data science pipeline for modeling the adversary from the data collection to classification, (4) identify and develop metrics of success for learning in such environments, and (5) benchmark the proposed approaches against state-of-the-art algorithms in cybersecurity.

The following are descriptions of some of the research projects in this thrust area:


Machine Learning Techniques to Adjust to Continuous Changes in Data Streams

The goal of this project is to develop an Adaptive Big Data Analytics Environment (ABDAE) that can adjust computations to respond promptly to rapid changes in data and cyber physical environments. Addressing this problem space will require using adaptive Big streaming data analytics that can: (a) model cyber physical infrastructures that encompass realistically complex critical infrastructure operations, (b) ingest the massive data sets needed to capture large-scale dynamic systems complexity, and (c) process and update the analytics results in a timely manner in order to test contrasting mechanistic models and drive the next set of analyses. The outcomes of this project will be: (1) a proof-of-concept algorithm that implements active change detection and passive learning models for non-stationary environments; (2) a framework for hybrid/passive learning methods for learning in high-volume data streams; (3) provide a recommendation system for cybersecurity applications that face non-stationary streaming environments; and (4) provide a benchmark using synthetic and real-world data sets along with statistics that measure the overall efficacy of the proposed approaches.

Our first focus will be on developing and identifying algorithms for change detection in large volume data streams. We have identified algorithms for change detection that use the probability of error from a classifier to detect a change in the data stream. We have identified change detection algorithms that use the probability of error to detect a change. Furthermore, many of the approaches described in this section have the ability to not only detect the change, but also provide a warning of a potential change in the data stream.

The change detection algorithms can be implemented as a part of monitoring the classification error of a model. We now describe the change detection algorithms that use classification error as a mechanism to aid in the detection of change in a data stream. We should note that we are primarily interested in increases in the error rate since we expect the error to drop or converge as new data are presented over time. An increase in the error is an indication that some property of the data stream has changed, and the learner should be reset. We have also identified the computational intelligence CUSUM (CI-CUSUM) change detection test that was developed by Alippi and Roveri (2008). CI-CUSUM addresses the change detection aspect leaving the design of just-in-time adaptive classification systems to a companion paper. Two completely automatic tests for detecting nonstationarity phenomena are suggested, which neither require a priori information nor assumptions about the process generating the data. In particular, an effective computational intelligence-inspired test is provided to deal with multidimensional situations, a scenario where traditional change detection methods are generally not applicable or scarcely effective.

Supervised classifiers will be identified and learned based on cybersecurity data sources that are domain specific (e.g., features collected from specific types text that are often labeled). The idea of specific data sources will have certain features sets that will allow us to choose the classifier that is the best fit for the types of data. For example, the Hoeffding tree is an online algorithm for learning a decision tree classifier. These algorithms have the desirable quality that they can easily handle many different data types that are not as trivial to work compare to other classifiers (e.g., neural networks). If data are sampled from real-valued quantities and the decision rule is simple then an online linear classifier could be learned by using stochastic gradient descent to minimize a convex loss function (i.e., hinge-loss would produce a support vector machine with a linear kernel). To address the possibility of learning in a nonstationary environment, we propose to use ensemble classifiers that have been a popular solution for learning in nonstationary environment using a passive strategy that continuously update the parameters of the model whenever labeled data are available. Our framework trains multiple classifiers on each of the data sources when labeled data are available (see Figure 3 that show models are added to the ensemble as new data become available and adaptive weights are assigned to the different models to attempt to maximize the accuracy of the final model). Ensembles have been shown to provide desirable theoretical and empirical traits, which is why we propose to learn a classifier on each data source (Freund and Shapire, 1997; Breiman, 1996; Breiman, 2001), and aggregate their decision. We will investigate coupling the change detection algorithms with the supervised classifier to enable them to rapidly learn in a changing cybersecurity environment. Furthermore, we will also investigate the feasibility of using deep neural networks in these dynamically changing environments. Co-PI Ditzler’s group has recently been working in this area and plan to contribute to the research in neural networks as well as training students in applied machine learning in cybersecurity.


Approaches to Counter Adversarial Manipulation of Machine Learning Algorithms

Given the significance of the problem and the urgent need for an adversarial big data analytic capabilities for Intrusion Prevention/Detection Systems (IPS/IDS) in a cyber environment, we have identified a novel approach to implement these capabilities. The proposed multilayer framework that is disposable, diverse, and autonomous. Our framework can broadly be applied to problems that face large volumes of data and uses machine learning to leverage adversarial data as well to improve:
Our goal for is geared to address the following machine learning limitations:
– The increasing prevalence of classification problems with massive volumes of streaming data plagues the state-of-the-art data mining algorithms. Application areas include – but are not limited to – climate, remote sensing, fraud detection, web usage tracking, IPS/IDS and malware detection for cyber-security data. The use for many “commercial-off-the-shelf” classification is not useful if they cannot cope with learning in some of the harshest environments.
– Recent research has shown that many classifiers are susceptible to attacks even if the direct form of the model is unknown (e.g., logistic regression or neural network).

The security and privacy of machine learning algorithms have been exposed in many classes’ models (e.g., logistic regression, neural networks, etc.). In fact, adversaries in machine learning has been a topic of recent interest and concern due to the mathematical flaws in the algorithms. Furthermore, domains of cybersecurity pose an even larger threat than many applications of machine learning because of how an adversary can influence the training data or even the testing data. Application areas related to this proposal include – but are not limited to – remote sensing, fraud detection, web usage tracking, IPS/IDS and malware detection for cybersecurity data. We will model the adversary in cybersecurity from preprocessing the data to learning a model for classification, to making predictions on unseen data (i.e., a typical data science pipeline for analyzing data). It should be noted that adversaries’ impact on the preprocessing of data is often overlooked (Liu and Ditzler, 2019; Elderman et al, 2017). Therefore, we have identified a data science pipeline that leverages information-theoretic feature selection (FS), which is the process of identifying features in data that are informative, to understand how the adversary can negatively impact the FS. One robust and relevant features have been selected, we will then focus on learning a classifier by leveraging adversarial data from cybersecurity. Co-PI Ditzler’s team was the first to show how to insert data samples into a training dataset that poisons an information-theoretic feature selection algorithm. Note that while cybersecurity is one area that can benefit from adversarial data, it is often not used in benchmarks. Therefore, our efforts will focus on mathematical models of an adversary and a data science pipeline for processing data for cybersecurity.


Cybersecurity Detection, Protection and Forensic Analysis

Thrust Area Leader:
Salim Hariri, University of Arizona

Current cyberattack analysis and detection tools are mainly static and manually intensive. At the same time, the complexity of cyber systems, their dynamic behavior, and the availability of many heterogeneous devices that are static and mobile make these tools incapable to accurately characterize current states, detect malicious attacks and stop them or their fast propagation and/or minimize their impacts.
Research is needed to better understand human cognitive processes in relation to how alerts are processed and how best to present alerts to analyst so that appropriate action can be taken at the time of a cyber attack. In fact, what is needed is a paradigm shift in the way we model and characterize attacks, how we identify them and develop prompt responses to stop them or their propagations and minimize their impacts on mission critical operations. We need urgently to develop a cybersecurity science field that provides the mathematical foundations to build the next generation of autonomic monitoring tools that continuously monitors the cyber system resources and services 24 by 7, analyze the current state and predict the next operational state, perform anomaly behavior analysis to detect attacks and proactively either recommend actions to stop attacks and minimize their impacts and/or respond automatically depending on the severity of these attacks and their potential impacts on the overall system operations.

Research Approach:
The main research activities will include (1) the development of a theoretical framework to perform anomaly behavior analysis of cyber
systems, protocols and applications; (2) the development of cyber-social data structures and metrics that can be used to integrate social
and cyber activities to improve the accuracy and the time to detect insider attacks; (3) the development of bioinspired self-protection
system; and (4) development of a methodology to perform continuous forensic monitoring, analysis and protection.


Theoretical Framework for Anomaly Behavior Analysis of Cyber Systems and Protocols

​Our anomaly behavior analysis methodology as shown in Figure 2 is defined over a universe space U, which is a finite set of events. Since we are modeling the overall cyber system behavior we can consider the event set U as all possible transitions in the system. U is partitioned into two subsets N and A, which respectively denote the Normal and Abnormal events, such that N∪A=U and N∩A=∅. To model the U space we need a representation map to represent the events in the event set U for further analysis.

Thus the representation map R is responsible to map the events in U to their representation patterns in U^R as U□(□(⇒┴R ) U^R ). Likewise, the N^R and A^R respectively represent the normal events set N, and abnormal events set A, such that N□(⇒┴R ) N^R, A□(⇒┴R ) A^R and N^R∪A^R=U^R. A detector is defined as a system D=(f,M) with two components f and M, where f is the anomaly characterization function defined as f:U^R→[0,1] and M is the memory of the system that keeps the extracted normal patterns from N^R as a normal behavior model. With an output between 0 and 1, function f specifies the degree of abnormality for a sample event sequence s∈U^R through comparing it with the stored normal model M. The greater the value of f, the more abnormality degree for the sample s. The detector D is a binary classifier, which classifies a sample s∈U^R as normal or abnormal by comparing it with normal model M. We can consider D as:

D(s)={■(abnormal & if f(s,M)>τ_i@ & @normal &otherwise)┤
where τ_i is the ith element of an n dimensional threshold vector. Detection occurs when the detector classifies a sample as abnormal. The detection errors are defined over a test set U_t^R which is a subset of U^R, U_t^R⊆U^R. Two types of errors are considered for a detector: false positive and false negative. The false positive happens when a sample from normal set N^R is detected as an abnormal event, which is defined as ε^+={s∈N^R│D(s)=abnormal}; the false negative occurs when the detector classifies an abnormal sample s∈A^R as a normal event (undetected anomalies), that is ε^-={s∈A^R│D(s)=normal}.

We have applied this methodology to detect attacks against several protocols including WiFi, DNS, BACnet, Modbus, and Bluetooth (Alipour, 2015; Satam, 2015; Pan, 2014; Mallouhi, 2011; Satam, 2018). In what follow, we briefly describe how to apply the approach to model the behavior of the WiFi protocol as shown in Figure 3. In our protocol behavior analysis approach, we consider the frequency of a sequence of protocol transitions over a period of time as a measure of whether or not the protocol is behaving normally. During the training phase, state transitions are represented as a multiset of n-gram patterns (N_T^R), and their statistical properties are captured in the corresponding normal behavior model (M). During the testing phase, the frequency of any N consecutive transitions of the protocol in each session S_l is computed during frequency of the normal transitions that are stored in the normal behavior model M. The difference between these two values specifies the anomaly degree for each sub-session S_(l,∆T).

Figure 4 depicts the a-score (anomaly score) distribution of both attack and normal traffic in the same graph. By comparing the two distributions, we observe that the normal and abnormal traffic can be easily differentiated with a good margin as is specified by blue dashed area. It means the a score threshold τ can be set to some value between 6 and 15. A detailed discussion of our approach and these evaluation results are presented in (Alipour, 2013).

Close Menu