Efficient network traffic classifier: composition approach

Hossein Doroud

Ayuda

Efficient network traffic classifier: composition approach

Autores: Hossein Doroud
Directores de la Tesis: Andres Marin Lopez (dir. tes.)
Lectura: En la Universidad Carlos III de Madrid ( España ) en 2019
Idioma: español
Tribunal Calificador de la Tesis: Carlos García Rubio (presid.), Francisco Javier Simó Reigadas (secret.), Robert Bifulco (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Telemática por la Universidad Carlos III de Madrid
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Informática
Enlaces
- Tesis en acceso abierto en: e-Archivo
Resumen
- Internet Service Providers (ISP)s are eagerly looking for obtaining metadata from the traffic that they carry. The obtained metadata is a valuable asset for ISPs to enhance their functionality and reduce their operational cost.
  
  Classifying a network traffic based on the application (app) that generates the traffic is vital for today’s ISPs and network providers. They use Network Traffic Classification (NTC) to do the classification and improve many aspects of their network. NTC enables an Internet Service Provider to detect the emerging application traffic requirements and develop its infrastructure accordingly. Besides, ISPs can do traffic engineering and improve the Quality- of-Service. Therefore, they can offer new services to their customers with a better network service pricing. NTC can be used in the core of an automated Intrusion Detection System (IDS), in order to detect patterns of denial of service attacks or the users’ suspicious activities and worms. In addition, NTC can assist the automated IDS system to re-allocate the network resources dynamically based on the customers’ priorities. NTC is a fundamental component of ISP-based Lawful Interception solutions, since the governments enforce ISPs to provide Lawful Interception of IP data traffic. Therefore, the ISP can report network activities of a specific user in a specific time upon the government’s request.
  
  Many research works have been done to participate in NTC evolution, in the past two decades. The majority of the works have been focused on an approach corresponding to one of the three main aspects of a flow.
  
  Port-based is the first approach since early days of Internet. It classifies flows based on their destination’s port number. Although it is a fast and a low memory/process consumption approach, it suffers from high false positive ratio due to many apps which do not have a registered port number like Team Fortress 2 or use ’well known’ port numbers such as using ICMP proxies to bypass firewalls or even registration at hotels’ hotspots. This limits port-based classifiers accuracy from 30% up to 70%. Nevertheless, there are still some apps such as Email that have to use registered port number to communicate with different servers. Therefore, the port number is still valuable to classify specific apps.
  
  Payload-based classifier or Deep Packet Inspection (DPI) is another approach which implies string matching on packet payload to find apps’ signature. It is well-known as the most accurate approach in the field and many publications ground truth is based on that. Despite its high accuracy, it has three main drawbacks: I) It is expensive in terms of memory and process consumption because it needs to store apps’ signature in a data structure and also it requires checking each byte of packet payload exhaustively to find signature. Therefore, it is not practical to be used online for a high-speed link. Many research works developed DPI implementation on hardware like NetFPGA to meet the high-speed link requirement for NTC. II) Looking inside the user data rises privacy concerns. III) It is incapable against encrypted flows and applications with no existing signature data such as evolving, polymorphic, and zero-day malware.
  
  Although both port-based and DPI approaches have their limitations, they are still widely used in practice as adopted by commercial products (e.g., Cisco NBAR, Ipoque PACE) mainly to implement application-level firewalls. Specifically, many commercial solutions apply DPI to encrypted HTTPS traffic by adopting a man-in-the-middle technique based on trusted SSL certificates (e.g., SonicWall DPI-SSL, A10 Thunder SSLi). Within this approach, a DPI classifier is placed in the middle of communication of an end user and its corresponded server. Both end points establish an SSL tunnel to the DPI classifier and the classifier forward the traffic to either side. Therefore, the DPI classifier can decrypt the traffic and do its classification while the communication is protected by the encryption.
  
  In addition, BlindBox is proposed to use DPI directly on the encrypted traffic. BlindBox encrypts DPI rules with the end points public key and considers the encrypted rules to detect different apps.
  
  To overcome the aforementioned issues, Machine Learning (ML) has been utilized to recogniz apps based on the flow statistical fingerprints. This approach is the dominated trend in the field of NTC. ML classifiers automatically build a model to map flow- level statistical features to apps during a training phase and then use the model to classify new flows. As the resulting model is much smaller than the signatures used by DPI approaches and, its computational complexity is way lower than regular expression matching, it is more efficient in terms of memory and processing usage. In addition, since it does not require access to packets payload, it does not rise any privacy concern and remains practical against encrypted flows. However, real life tests show that it can recognize a few traffic categories and it is not accurate for mission critical applications.
  
  In this thesis, I propose a practical approach for improving the efficiency of traditional traffic classification techniques by chaining fast classification stages (port-based and machine-learning- based), combined to lower their false-positive rate, and a more precise – but time- and resource-demanding – stage based on DPI. Experimental results demonstrate that Chain obtains results in line with DPI approaches in term of Precision, Recall, Accuracy and Area Under the Curve (AUC), while it is 45% faster when compared to nDPIng, a well- known DPI implementation.
  
  I have implemented Chain in Traffic Identification Engine platform and have evaluated its performance with dataset which is published by CBA research group at Technical University of Catalunya. Although the considered dataset is generated and captured specifically for NTC research works, it contains traffic generated in 2014 and is outdated. Therefore, I have developed a platform named GTEngin to collect a new ground truth driven from mobile apps. GTEngin is a scalable platform and can control multiple android devices simultaneously. Besides, GTEngin runs different apps on different devices in a given time. Therefore, the captured traffic is a mixture of flows from different devices and apps which reflects the real-life scenario. I build a ground truth consisting of 5,000 network flows generated by 15 popular android apps such as Facebook and Instagram.
  
  Following, I have reevaluated the performance of my proposal with the new ground truth generated by the most popular android application. The experiment shows that Chain is more flexible than a state-of-the-art implementation of Deep Packet Inspection classifier to adopt to the new network traffic due to its modularity design. Besides, the performance of traditional classifiers reduced by considering the mobile apps traffic as the ground truth. Outdated signature database of nDPIng and dynamic port number assignment by the mobile apps can explain the performance reduction of classifiers develop based on DPI and port number approaches respectively. However, the performance reduction of ML classifiers needs to be investigated by studying the mobile traffic and mobile internet ecosystem.
  
  Mobile applications typically connect to two types of on-line services: first-party services controlled by application developers, and third-party services integrated in mobile apps for advertising and tracking purposes, or to embed other services like online payment and weather reports. These third-party services may also rely on Cloud Service Providers (CSP)s for outsourcing their cloud infrastructure. It has been shown that third-party traffic dominates all app traffic in the mobile ecosystem, demonstrating the importance of studying this type of mobile traffic in addition to first-party traffic.
  
  I leveraged accurate traffic fingerprints from thousands of mobile apps that was collected through crowdsourcing with Lumen. Lumen Privacy Monitor is a mobile application, available for free on Google Play, that aims to promote mobile transparency and user awareness. Users use Lumen to identify data leaks and the presence of online tracking services on their apps. Lumen runs locally on the device and intercepts all network traffic without requiring root permissions by using the Android VPN permission. The results show a significant reliance of apps on mobile CSPs with the major CSPs being used by 85% of the apps. This is motivated by the interest of mobile app developers to integrate new services to their app or improve their user experience by getting closer to them.
  
  The tight relationship between mobile apps and CSPs as well as the fact that a limited number of CSPs are dominated the market, impose a new challenge in to NTC. A single CSP can support many apps from different categories and provide them similar resources. For e.g., several apps with different functionalities can use the same ad network to improve their revenue. In this case, the apps generate flows with the same network footprint to connect to the ad network. Such a communication generates a traffic from various apps with the same statistical pattern despite their app’s functionality. This mixture of the network traffic increases the ambiguity to do NTC using ML approach. Therefore, CSP related traffic can be considered as noise for the ML approach.
  
  I rely on information of Pointer (PTR) record of destination IPs to detect CSP related traffic and then filter them out from the ground truth collected with GTEngin. The ground truth contains 742 unique IP addresses which I could not harvest PTR record for 25% of them. I rely on the Autonomous System (AS) information for those IP addresses without PTR record. A traffic flow is a CSP related traffic if its destination IP address is hosted by an AS which all of the AS’ IP blocks belong to a given CSP. During filtering process, 69% of the flows are detected as CSP related traffic. 30% of the detection are based on the AS information. Filtering the CSPs related traffic reduces the ground truth size to 1,353 network flows. Besides, the majority of filtered flows are the result of apps communication with the famous CSP providers such as Google and Facebook.
  
  I use the ground truth without the CSP related traffics to evaluate the performance of Chain and C4.5 which is a ML algorithm. C4.5 is a supervise decision tree classifier and many research works report its superiority in compare with the conventional ML algorithm in the field of NTC The experimental results clearly state that filtering CSP’s related traffic leads to precision enhancement in most of the classes. However, the classification performance of classifying instances from Instagram does not follow the trend. A learning model from imbalanced training dataset is always biased toward the classes with the majority instances. Filtering CSP related traffic significantly reduces the Instagram instances from 236 to 30. Although C4.5 is more robust than other conventional ML algorithms in dealing with imbalanced training dataset, it affects the classifier performance in extreme cases like Instagram.
  
  Following, the performance of Chain gets also increased due to its modularity and sequential design. In fact, the performance improvement of ML module of Chain, increases the Chain overall performance consequently. However, Chain does not inherit the same precision improvement from ML module and a fraction of the improvement is dumped by the next module which is based on DPI approach.
  
  Chain classifier is a new approach for improving the efficiency of DPI traffic classification techniques. Since DPI is widely used in application firewalls, Chain can be used as a solution to operate on-line classification quicker and with less computational resources.
  
  DPI suffers from low flexibility to adopt with a new ecosystem. Indeed, extracting signatures from a vast number of new apps which merge to internet ecosystem daily is an intensive human work. Furthermore, updating a DPI database with the new signatures contributes additional overhead to adopt a DPI classifier to a new ecosystem. On the other hand, Chain can be adopted fast and with the same level of complexity as a ML classifier. In this case, Chain performs close to the ML classifier which is much higher than a DPI classifier with outdated signature database. However, it is required to update Chain’s DPI module to achieve its maximum performance. Therefore, Chain outperforms a DPI classifier when the mixture of traffic is in transition phase of emerging a new concept (e.g., Emerging mobile traffic to the internet ecosystem).
  
  The multi-stage architecture is specifically suited for the recently standardized approach for network services defined in Service Function Chaining Architecture. Both stages of Chain can be implemented as Virtual Network Functions (VNFs) and can perform the role of Classifiers for the network policies depending on the app that generated the traffic. Their implementations as VNFs allows to separately instantiate them, maximally benefiting from the scalability and elasticity of Network Function Virtualization to dynamically adjust resources and traffic forwarding depending on the traffic mix that traverses the Chain stages. Service Chaining also provides a standard mechanism for the encapsulation of flow metadata that affects traffic forwarding and service chaining: The Network Service Header (NSH). In the case of Chain, the first stage encapsulates in the NSH either the classification response of the first stage (and exits Chain services), or the verdict of ML and the subsequent classification service (next stage of Chain).
  
  Chain allows for a relatively more privacy-friendly approach compared with full-DPI, by limiting the extraction of features from the payload of the packets only when classification has not been possible with other privacy-preserving means.
  
  This thesis represents a stream of research works which can be continued and extended in many directions. Chain is designed to perform online and implement on a platform which also supports online operation. However, Chain online classification performance has not been investigated yet and it is an open field of research for future.
  
  Furthermore, implementing Chain in a hardware like NetFPGA can be considered as a next step in direction of investigating its online performance characteristic. The multi-stage architecture of Chain is also specifically suited for implementation in the form of VNF chaining, where stages of different memory/computational costs can be independently scaled and deployed according to dynamic network traffic mix and rate.
  
  Detection of any illegitimate access to a network resource is in the scope of Intrusion Detection (ID). Many IDs use DPI to detect malware and malicious activity looking for their signature in the flowing traffic. Although it is beyond the scope of this thesis to study the performance of Chain in different network applications, it is interesting to investigate the performance gain which can be achieved by substituting conventional DPI classifiers with Chain in IDs.
  
  I have proposed an algorithm to detect CSP communication based on PTR and ASN. However, my solution does not have the ability of distinguishing between the various services which a CSP can provide. The CSPs services can be categorized in two main categories: I) Standard services such as advertisement and CDN, II) General services like cloud computing which can be modified to support the main functionalities of an app. Although the generated traffic from the second communication is valuable for NTC, the first category traffic reduces the contrast between traffic generated by different app and redue the precision when using ML-based classifiers. Therefore, it is essential to distinguish between traffic generated from these two CSP related traffic categories to filter the noise and preserve the valuable data. Identifying CSPs’ services is an interesting open area of research which can be considered as a future work.
  
  There are vast number of apps which published world-wide and GTEngin provides a great opportunity to build a ground truth from their traffics. App’s developers are interested to integrate a new functionality to their app daily to satisfy their customers. Although GTEngin labels the capture flows based on their process name, NTC needs a ground truth with the higher resolution to develop a classifier with the ability of detecting the various functionalities of a given app. To this end, GTEngin should be updated.
  
  Besides, Internet of Thing (IoT) is supposed to bring sensors, computation and communication in all aspects of the human life. Therefore, supporting IoT devices communication is one of the constrain of 5G designs. It is expected that traffic which is generated by IoT devices will shape the future network traffic. Consequently, it is essential to update GTEngin to support IoT devices to build a ground truth required by any NTC study targeting IoT traffic.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: