Applications of data analytics and machine learning tools to the enhanced design of modern communication networks and security applications

Ignacio Martín Martínez

Ayuda

Applications of data analytics and machine learning tools to the enhanced design of modern communication networks and security applications

Autores: Ignacio Martín Martínez
Directores de la Tesis: José Alberto Hernández Gutiérrez (dir. tes.)
Lectura: En la Universidad Carlos III de Madrid ( España ) en 2019
Idioma: español
Tribunal Calificador de la Tesis: Andres Marin Lopez (presid.), Ignacio de Miguel Jiménez (secret.), Marco Ruffini (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Telemática por la Universidad Carlos III de Madrid
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Inteligencia artificial
    - Informática
Enlaces
- Tesis en acceso abierto en: e-Archivo
Resumen
- Lately, Artificial Intelligence (AI) and Machine Learning (ML) have become game-changing technologies due to their ability to generalize from data and infer algorithmic behaviors that consider larger casuistic that humans are able to. In essence, these technologies pursue the installation of human-like intelligence to computer tasks so they can overtake different functions. Thanks to that, computer capabilities can be enhanced much faster than before, as programming does no longer require explicitly coding all possible outcomes, but instead feeding algorithms with sufficient data to produce meaningful outcomes.
  
  In spite of this advantages, the implantation and development of AI in many fields is still too early stage, where many companies are either unaware of their potential or uncapable of extracting value from their data for different reasons. Additionally, AI technologies usually have large computing power needs and are based on a very strong mathematical and computational background.
  
  Thus, ML problems require expert practitioners that are capable to understand and learn from the field of application to be successful. However, the effort to acquire and maintain an updated team of engineers capable of both is not within reach of any company and many times the successful application of the technologies is constrained by this limitation. Therefore, the aim of this thesis is to facilitate the application of these technologies in a broad range of knowledge areas by means of systematizing processes and methodologies.
  
  To this end, we contribute in this thesis to advance in the state of the art within an specific field that can be transposed to others: The Internet Infrastructure. Specifically, contributions focus on two specific areas, namely cybersecurity and optical WDM networks. Both fields are relevant for the advancement of the Internet infrastructure and have suffered from complex and time-consuming process that could be improved with the help of Machine Learning. Furthermore, both fields pose interesting challenges regarding automation, as most time-consuming tasks are usually performed by humans that tend to get overwhelmed due to continuously growing requirements in the management and planning of tasks.
  
  At the beginning of the PhD period, we surveyed the literature for ML applied in cybersecurity and networking and found that both fields were starting to leverage AI tools in their process. The first problem observed is that both fields come from a data-less paradigm, while collecting log or functioning data is not always expected, hence complicating the acquisition of labeled datasets. Furthermore, many of the problems previously addressed are very straightforward ML classification/regression problems, which has always left more complex formulation as open challenges. Over all, the state of the art does show that ML applied to cybersecurity and optical WDM networks is interesting from both academic and industrial viewpoints and has large potential for improvement.
  
  The solutions proposed focus on different problems observed from the study of the literature and attempt to leverage different AI techniques, from classical ML, feature selection or itemset mining algorithms to latent variable modeling, advanced locality similarity hashing methods or even deep learning. In our experiments, different tools and formulations are proposed according to the most relevant problems observed in their respective fields.
  
  On the security side, we focus on the Android ecosystem which is one of the largest and most popular ones containing tons of applications, operative system versions and even application Markets. With a penetration of more than 80% of mobile devices, Android has attracted both developers of both legitimate applications and malware, which has become a problem. At the current pace, malware generation rates are much higher than any security team or expert can handle, making many malware samples go undetected and crippling many anti-malware solutions, such as Antivirus, which depend on human-generated malware signatures and footprints.
  
  In this light, we propose a new approach for malware detection and application quality assessment that relies in application meta-information, that is, the data describing the application details such as description, category, permissions... that are not contained in application code. Meta-data is leveraged as input feature for two important tasks in the Android security landscape: ML-based plain malware detection on runtime and scalable repackaging detection through semantic clustering of meta-data textual elements.
  
  The first application consists on the usage of meta-data as as input features of a Machine Learning classification system that is trained using a labeled collection of Android Applications coming from Google Play to determine whether they are malware or not. Concisely, logistic regression, SVMs and random forest algorithms are evaluated over differently configured malware-goodware datasets to assess overall abilities of meta-information on malware detection.
  
  In the experiments, distinct meta-data features are evaluated following feature selection methods, such as Step-AIC, Pearson’s Chi Squared test or the Gini Index. Application permissions as declared in Google Play meta-data fields are considered for analysis and reduced using feature hashing techniques. At the end, results indicate a certainly high ability of meta-data for identifying malware patterns, reaching F-score values of nearly 0.9, with clear relevance on reputation-based scores for application developers and signature issuers.
  
  The final model comprises a subset of 15 features, including developer and certificate issuer reputations, upload and update market or application size, number of files or images among others. Moreover, the proposed solutions are fast thanks to the ability of ML algorithms to perform very fast permissions and robust, with the support of experiments that demonstrate how many of the selected features are useful and redundant in case other features are missing or altered .
  
  Arising from the observations from this initial machine learning analysis, Antivirus (AV) engines coming from multi-scanner tools, those systems that return the scanning results of several AV solutions rather than a single one, are inspected using data analytics and AI technologies. This study is motivated from their proven lack of consensus at the detection and categorization levels observed in the literature and verified in our review of the detection of malware using ML with meta-data. The main aim for this analysis is twofold: advancing on the understanding of AV detection patterns and policies and the improvement of multi-engine detection by proposing different aggregation and cleaning tools.
  
  Initially, AV engine detections are inspected, showing that most engines disagree when detecting malware to the extent of not completely agreeing in the detection of a single application. Moreover, different detection patterns are observed, namely leader, follower and eccentric engines, and validated by means of Principal Component Analysis, rule mining and pairwise detection correlations. Such experiments show how eccentric engines show mainly zero correlation with all the rest, whereas rule mining demonstrates existing links between the detections of follower and leader engines.
  
  Besides, principal component analysis verifies that, contrary than expected, AV engines are not always redundant and must be all used for completeness: for the studied dataset PCA is only able to reduce the number of engines to 48, which indicates much variability in something that should have a clear pattern. At the end, an estimation of the risk of malware per application based on Structural Equation models is proposed. Such model produces a probabilistic estimation on whether an input sample is malware or not according to the engines triggering a malware detection in their analysis.
  
  On the family side, we propose a lightweight categorization scheme that achieves comparable scores to other alternatives in the literature: SignatureMiner. SignatureMiner follows a semi-supervised approach that cleans AV detection signatures and applies min-hashing to them to present the user with a list of similar tokens resembling actual malware families. Supported by this system, any user can develop a set of labeling rules that compress different tokens under the umbrella of a clearly distinguished malware family. Thanks to its user supervision, SignatureMiner achieves similar performance to its competitors whilst having less requisites, like large quantities of data available.
  
  Using such system, we normalize and categorize AV signatures into 41 distinct families and three broader categories, namely adware, harmful and unknown. Such categories are assigned to each of the families depending on the nature of the threats involving each family: either annoying but harmless (up to some extent) applications crowded with advertisements in many forms (adbar, screens, clickbait…) or serious malware applications targeting much more specific aims and much more disturbing to users, including premium service subscriptions, information leakage or privacy-threatening attacks among others.
  
  Actually, we observe that the unknown category is assigned by those AV engines not reporting concise family information, even though any malware sample should belong to any of the two aforementioned categories. As a result, an ML classifier to assign and specific category to unknown malware based on the specific engines detecting samples in each case is proposed. The system is trained and validated using those samples where an adware or harmful family is reported and later used to unveil nearly 51% applications assigned the unknown labels are indeed harmful.
  
  Logistic regression algorithm is chosen to provide probabilistic outcomes along with the model weights, that can be interpreted as a measure of what category each AV is more focused on, either harmful (positive value) or adware (negative value). Finally, a random forest classifier is used to improve performance, obtaining an accuracy over 0.9 after hyper-parameter tuning.
  
  After the complete malware analysis, including AV engine exploration and improvements, we returned to meta-data to explore it as a repackaging detection indicator. Repackaging is a very efficient way of creating Android malware that consists on downloading an application from any market, introduce custom malicious code and upload the application to the market again so unprevented users can confuse it with the original application and run the malicious code on their devices. In spite of the facilities this approach provides to malware developers, it has a downside: the cloned (or repackaged) applications must maintain their market meta-data almost exact to maximize its effect, since it relies on deceiving users to believe they are installing the legitimate application and thus, any major change in meta-data might be easily detected. From a detection perspective, tracking applications with very similar meta-data could yield potential cloners and victims together.
  
  To demonstrate this approach, we download a large collection of nearly 1.3 million Android applications’ meta-data from Google Play and apply min-hashing, an item-based similarity clustering algorithm, to generate groups of potentially cloned applications called app-sets. App-sets are conceived as application groups where clones and victims should concur provided they have very similar meta-data fields.
  
  This way, using title edit distance and description cosine similarity, we create a unified scoring metric, called Application Similarity Index (ASI) that modulates any application description similarity by the number of changes on application title must overcome to become the other. Such value enables app-set ranking in such a way app-sets can be better sorted for optimal utilization of the analysis resources (typically human analysts) according to their repackaging likelihood. It is worth noting that this approach is a ranking approach and, therefore, solely provides a ranking of app-sets that must be inspected and validating, leaving to the analyst the task of determining which application is the victim and which one the repackager.
  
  This approach is called CloneSpot and has been able to identify more than 420K applications inside more than 50K app-sets out of the 1.3M collection. For validation, we have been tracking the removal rates of these applications and observed than one year after the initial collection nearly half the suspicious applications have been removed, affecting to a considerable amount of app-sets.
  
  On the optical WDM network side, we contribute to the introduction of Machine Learning by proposing an integral pipeline framework that improves the development of ML-powered network protocols as enhanced heuristics that emulate optimal solutions in many areas. Such framework is based on data generation, modeling and validation and network implementation. In this thesis, we focus on the first two steps by developing proof of concept solutions for both.
  
  Dataset generation and data labeling is addressed with Netgen, a versatile network data generator based on Net2Plan. Netgen works as a wrapper system of Net2plan, using the tool as core and enabling the parallel execution of many instances to speed up the data generation process. To facilitate even further the process of data generation, Netgen may be feed with existing traffic matrix data or required to produce variations from a “canonical” traffic matrix adding statistical noise. Netgen handles the data structure as well, returning networking datasets in an ML-friendly CSV format.
  
  Finally, for the modeling and validation step in the framework, this thesis chooses the classical problem Routing and Wavelength Assignment (RWA) in optical WDM networks to be approximated through ML. RWA maybe solved either through Integer Linear Programming (ILP) formulations, that is an NP-hard problem but provides optimal solutions or heuristic algorithms that produce suboptimal solutions much faster. Using ML we produce an alternative formulation on the problem that attempts to create a heuristic ML algorithm that uses ILP-solved data to learn how to assign Routing and Wavelength Classes (RWCs) to incoming traffic matrices that yield a feasible network solution.
  
  Thanks to the predictive abilities of ML, the proposed heuristic is faster than common ILP and most heuristic approaches and obtains much better solutions than heuristic approaches thanks to the learning process over optimally labeled data examples. In other words, the proposed algorithm learns to emulate ILP solutions much faster than the ILP. To prevent returning unfeasible cases, the system considers the 10 topmost RWCs suggested by ML and introduces flexibility in the ILP-based solutions.
  
  We use logistic regression and deep learning algorithms and produce results that achieve correct RWA assignment with a very high probability (above 70% in any case) and very fast for medium sized networks such as the well-known abilene topology. In this light, we demonstrate that ML can be used to develop advanced networking protocols that emulate optimal solutions and can take into account dynamic parameters, such as network load or traffic at each time.
  
  In sum, this thesis builds different AI-based components that serve their fields in some specific target by enhancing required functionalities and capabilities and through systematic approaches and methodologies that can be replicated in other components and fields. As a result, all the pieces of the thesis contribute to the design and development of the concept of AI as a Service (AIaaS), that proposes a paradigm for the integration of AI technologies over specific knowledge areas with limited expertise in both AI and the specific area.
  
  Future work involves the utilization of more data with more complex models that can provide larger richness to the existing collections. Indeed, the existence of new data points and fields would enable further exploration over the potential of different components. In addition, the ideas and methods of this thesis could be transposed to other knowledge areas where AI/ML has not been introduced yet. Moreover, all the components have been proposed as building blocks of the aforementioned AIaaS concept, which design and development would be the most direct and straightforward continuation of the work from this thesis.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: