Understanding road scenes using deep neural networks

Hamed H. Aghdam

Ayuda

Understanding road scenes using deep neural networks

Autores: Hamed H. Aghdam
Directores de la Tesis: Domènec Puig Valls (dir. tes.)
Lectura: En la Universitat Rovira i Virgili ( España ) en 2018
Idioma: español
Tribunal Calificador de la Tesis: Petia Radeva Ivanova (presid.), Antonio Moreno Ribas (secret.), Rodrigo Moreno Serrano (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Informática y Matemáticas de la Seguridad por la Universidad Rovira i Virgili
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Inteligencia artificial
Enlaces
- Tesis en acceso abierto en: hdl.handle.net TDX
Resumen
- According to the National Safety Council, medically consulted motor-vehicle injuries for the first six month of 2015 were estimated to be about 2,254,000. Also, the World Health Organization reported that there have been about 1,250,000 fatalities in 2015 due to car accidents. These studies show that human error accounts solely for 57\% of all accidents and it is a contributing factor in over 90% of accidents.
  
  Human error could be due to distractions, drowsiness, lack of experience, high response time, unstable mental state or other similar issues. Notwithstanding, an autonomous car does not get distracted or sleepy. Its response time is fixed in different conditions and it does not suffer from unstable mental state or lack of experience. Consequently, replacing autonomous cars with human drivers can reduce number of accidents in which human is a contributing factor.
  
  An autonomous car must perform different tasks and it must be at least as intelligent as a human driver. Among different categories of tasks, analyzing road scene using visual information is crucial for performing other tasks successfully. This includes segmenting a scene into meaningful regions in order to identify drivable areas, obstacles and human as well as identifying dynamic objects from static objects for 3D reconstruction. In addition, some objects such as traffic signs have to be detected and classified in order to make it possible for an autonomous car to conform to road rules.
  
  Detecting other objects such as pedestrians might not be crucial since these objects are already segmented using the segmentation module and an autonomous car might not need to detect, identify or count number of pedestrians. Similarly, it might not need to classify brand of a car or count number of cars. It may only use information of segmented regions for purposes of 3D reconstruction or motion planning. However, detecting and classifying traffic signs is crucial for driving a care safely.
  
  First step in solving the above computer vision problems is to find an appropriate representation for image patches. Despite great efforts in computer vision community, hand-engineered features were not able to properly model large classes of natural objects. Advent of convolutional neural networks, large datasets and parallel computing hardware changed the course of computer vision. Instead of designing feature vectors by hand, convolutional neural networks learn a composite feature transformation function that makes classes of objects linearly separable in their feature space.
  
  Recently, convolutional neural networks have surpassed human in different tasks such as classification of natural objects and classification of traffic signs. After their great success, convolutional neural networks have become the first choice for learning features from image data. One of the fields that has been greatly influenced by convolutional neural networks is automotive industry. Tasks such as pedestrian detection, car detection, traffic sign recognition, traffic light recognition and road scene understanding is rarely done using hand-crafted features anymore.
  
  Nonetheless, convolutional neural networks are highly computational models. On the other hand, one essential requirement in autonomous cars is that models for doing the above tasks must be computationally efficient so that more tasks can be executed on a single device. Consequently, it is crucial to find minimal network architectures that are able to perform the above tasks accurately.
  
  Convolutional Neural Networks are originally proposed for learning representation and a linear classifier on image patches and it was not trivial to use them for detection and segmentation problems. In this thesis, we study three different problems in autonomous cars which includes detection, classification and segmentation and propose accurate and computationally efficient networks for each of them. In addition, we adequately study sensitivity of neural networks to adversarial samples and propose a few methods for increasing stability of networks against noise.
  
  To be more specific, we first propose a method based on visual attributes and Bayesian networks for classification of traffic signs. Then, we propose two networks for classification of traffic signs and a method for creating ensembles by formulating the ensemble creation as an optimal subset selection problem. Next, we empirically study sensitivity of neural networks against small perturbations (ie. low magnitude additive noise) called adversarial samples and propose two methods for alleviating this problem. Moreover, we propose a new objective function to generate adversarial samples close to the decision boundary.
  
  Specifically, two state-of-the-art ConvNets for classification of traffic signs create ensembles of 25 and Ł20 networks in which these ensembles collectively need 3,208,042,500 and 1,445,265,400 multiplications to compute the classification score, respectively. The huge number of arithmetic operations are the result of redundancy in the ensembles, high number of parameters and choice of activation functions. Using our formulation of ensemble creation, we create an ensemble consisting of 5 ConvNets that needs $382,699,560$ multiplications to compute the classification score. Compared to the two other ensembles, our first ensemble reduces number of the multiplications 88% and 73%, respectively.
  
  Our experiments on the GTSRB dataset show that despite a huge reduction in the number of arithmetic operations, our network improves the classification accuracy 0.1% compared to \citet{Ciresan2012a} and its accuracy is only 0.04% less than \citet{Jin2014}. In addition, our second network is able to classify 99.55% of the test images, correctly. In addition, an ensemble of 3 of the second network is able to correctly classify 99.70% of the samples.
  
  Input of the classification module is patches of traffic signs. These patches are generated by a traffic sign detector which locates traffic signs in a large image. For this purpose, we first propose a lightweight and accurate network for detecting traffic signs and show how to implement the scanning window technique withing our ConvNet by means of dilated convolutions. Then, we propose a new approach for detecting traffic signs by formulating the detection problem as a segmentation problem.
  
  Specifically, a lightweight ConvNet is designed for detecting traffic sings on high-resolution images. This network is trained in two steps. In the first step, training is done using the negative samples that are randomly selected from the training images. Then, it is applied on the training images in order to mine hard-negative samples. These samples are used in the next step to train the network using more appropriate data. Using this technique, the average precision of the detection ConvNet increases to 99.89% on the GTSDB dataset. We further analyze time-to-completion of the network in different settings and showed how to use statistical information from the dataset to speed up the forward pass of the network. Using this information, 37.72 high-resolution images can be processed per second with high accuracy to locate traffic signs.
  
  The last step in our thesis is to segment a scene into semantically meaningful regions. We throughly study different networks and approaches that have been proposed for segmenting images and propose a new network which is faster and more accurate that state-of-the-art networks on some metrics but it requires less memory compared to these networks. Our network consists of series of fire-modules in the encoder and decoder parts. It also replicates elements in order to upsample features maps instead of using max-unpooling operation or interpolation techniques. Our experiments on the CityScape dataset show that our network generates more accurate results compared to other networks that subsample the input image by factor 2 before feeding to the network. Specifically, mean instance-level class IoU and mean instance-level category IoU of our network are equal to 47.8 and 75.5 which is higher than more computational and memory intensive networks.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: