Convolutional neural networks for joint object detection and pose estimation in traffic scenes

Carlos Guindel Gómez

Ayuda

Convolutional neural networks for joint object detection and pose estimation in traffic scenes

Autores: Carlos Guindel Gómez
Directores de la Tesis: José María Armingol Moreno (dir. tes.), David Martín Gómez (codir. tes.)
Lectura: En la Universidad Carlos III de Madrid ( España ) en 2019
Idioma: español
Tribunal Calificador de la Tesis: Felipe Jiménez Alonso (presid.), Basam Musleh Lancis (secret.), Eduardo José Molinos Vicente (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de Madrid
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Inteligencia artificial
- Ciencias tecnológicas
  - Tecnología de los sistemas de transporte
Enlaces
- Tesis en acceso abierto en: e-Archivo
Resumen
- Advanced driver assistance systems (ADAS) have been present in vehicles for many years, showing the convenience of introducing technology into road transport modes. The next step on the horizon is the autonomous vehicle, which involves replacing the driver with automated systems capable of performing all the tasks associated with driving. This technological advance is expected to lead to a significant improvement in transport safety by eliminating the primary cause of traffic accidents: human error. In addition to this, autonomous vehicles have the potential to alleviate other problems associated with transport, including pollution and traffic jams, by allowing more efficient journeys and promoting the use of new transport paradigms, such as car sharing.
  
  One of the main obstacles preventing the widespread adoption of this technology today is the enormous difficulty involved in the perception of the vehicle's environment. The safe navigation of a vehicle requires, among many other things, the development of a model of the traffic environment that contains all the relevant elements present in it, including other road users. These users are, from the perspective of autonomous driving, potential obstacles that can compromise safety during operation and must be conveniently identified as far in advance as possible. While this is a relatively simple task for a human being, automatic systems still have serious problems to correctly interpret the information that reaches them through the sensors on board the vehicle.
  
  Conveniently, significant progress has been made in artificial intelligence and, particularly, computer vision over the last decade due to the emergence of a series of techniques under the name of deep learning (DL). These methods extend the classic idea of artificial neural network by incorporating a large number of layers that increase the representation capacity of the models. DL algorithms are nowadays the method of choice for the resolution of a wide variety of problems as they achieve unprecedented accuracies and, at the same time, reduce the need to manually create specific solutions unsuitable for other domains. The advantage of DL methods comes from its ability to automatically learn the most useful features for each application through a backpropagation training process that uses annotated samples to minimize a predefined cost function.
  
  This thesis aims to take advantage of these methods to improve the understanding of the scene by the perception system of an automated vehicle. More concretely, the study focuses on the reliable identification of dynamic objects in the surroundings of the vehicle, using vision sensors and laser (lidar) scanners. Although there are numerous works with similar approaches in the scientific literature, there are indeed not many proposals that take into account the unique challenges of on-board systems, such as the need to base their design on computational efficiency. Besides, the particular difficulties posed by traffic environments, such as poor structure or frequent occlusions, require specific solutions extending the capabilities of the existing multi-purpose algorithms.
  
  In order to achieve this goal, there are a series of preliminary considerations to be taken into account. Among them, issues related to on-board sensors stand out since they are, ultimately, the only source that provides information to the perception system. Three modalities are considered in this work: monocular cameras, stereo cameras, and lidar scanners. Monocular cameras provide color images conveying appearance information. In addition to this, stereo vision systems, inspired by the binocular vision or stereopsis mechanism, makes use of epipolar geometry to reconstruct the geometry of the environment from a pair of images captured by two cameras displaced from each other. To that end, images must be processed using a stereo matching procedure, which provides a dense map assigning a depth value to each point of the image. Finally, lidar devices emit infrared light beams that allow for a precise and robust measurement of distances to the different elements of the scene. The most modern ones consist of several beams of light that rotate around a common axis, providing a relatively dense point cloud in a 360º field of view around the vehicle.
  
  Among other considerations, this thesis analyzes the criteria to be considered for the selection of the sensor setup of the vehicle, presents alternatives for the representation of data from the studied sensor modalities, and addresses a particularly critical issue in multi-modal sensor setups: sensor calibration. In particular, a novel method for automatic calibration of extrinsic parameters between a stereo vision system and a multilayer lidar sensor is presented, as a prerequisite for their joint use. The approach makes use of a fiducial marker, from which reference points are extracted in both data representations. Then, a registration step is used to compute the 3D transform between the two sensors. The method has advantages over existing methods, both in terms of ease of use and accuracy of results. Stereo and lidar data are modalities used by some of the remaining algorithms proposed in this thesis, so this calibration approach is a contribution of great interest for their implementation.
  
  From this point on, the algorithms proposed for the characterization of objects in the scene are built upon a modern DL method for the detection of objects in RGB images: Faster R-CNN. As usual in object detection, this framework relies on deep convolutional neural networks (CNNs), a structure particularly suitable for image processing. The method consists of two well-differentiated stages: one aimed at generating regions of interest through a region proposal network (RPN), and the other intended for classification and refinement of proposals, which provides the final detection result. The selection of this framework was made due to its compelling features, suitable for on-board perception: it is highly efficient, reduces to a minimum the number of parameters to be adjusted manually, and performs better than single-stage alternatives in the detection of distant objects.
  
  This work investigates in detail the fitness of the method for object detection in traffic environments using the KITTI vision benchmark suite as a testbed. The KITTI object detection benchmark contains data from a stereo camera and a 64-layer lidar device recorded in real traffic environments in the mid-size city of Karlsruhe, in Germany. Data from both modalities are profusely annotated, which makes this dataset one of the best choices to train and test automotive perception systems. Besides, the diversity of scenarios included in the dataset guarantees their representativeness and poses a strong challenge to the algorithms tested on it. Because of using a publicly available dataset, the contributions introduced in this thesis can be fairly compared with other methods in the literature.
  
  As a result of the analysis performed on these data, some measures are proposed to optimize the performance of Faster R-CNN under these circumstances, e.g., the modification of the set of anchors used by the region proposal network. The mean average precision (mAP) of this baseline approach reaches 67.4% on the moderately difficult samples from the selected KITTI validation set.
  
  Additionally, an alternative method is proposed to further improve the accuracy of object detection by including data from a stereo vision system. This approach achieves non-negligible increases in the detection performance (around one point mAP) by including the disparity map computed from the stereo pair as a fourth channel. Therefore, the detection CNN can make use of spatial reasoning in a straightforward manner, thus improving the segmentation of objects and, therefore, the region proposal stage.
  
  Despite the satisfactory performance of the Faster R-CNN detector, it only provides information about the position of the objects in the image and assigns them an estimate of their category (e.g., car, pedestrian, or cyclist). However, the decision modules of an autonomous vehicle require a complete characterization of the potential obstacles in the surroundings. In particular, knowledge about the orientation of objects on the road plane, i.e., their heading angle, is of vital importance to enable accurate forecasting of their future trajectory. In this work, a novel approach to embed orientation estimation within the Faster R-CNN detection framework is presented. The inference of the orientation value is made using only one frame and is therefore based on appearance features, instead of relying on motion features that would require a robust tracking of the objects. For this reason, the estimated variable is indeed the observation angle or viewpoint from which objects are observed, which decouples the orientation estimation problem from the relative position between camera and object. This viewpoint angle is, ultimately, the only parameter relevant to describe the orientation of objects if a reasonable assumption is made that road users move on the road plane.
  
  The approach proposed in this thesis naturally extends the Faster R-CNN framework with an additional branch for viewpoint estimation. This branch makes use of the same features as its detection siblings, which means that both estimates can be made simultaneously, without the new functionality significantly affecting the computational cost of the inference process. The idea behind this approach is that detection and orientation estimation are linked problems that can be performed simultaneously using the same set of shared features. During training, the new task is straightforwardly incorporated into the multi-task loss function used in the Faster R-CNN paradigm, so it does not have a relevant adverse effect on its original detection function.
  
  The proposal adopts a discrete approach for orientation estimation so that the circle is discretized in bins representing a range of angles, and each detected object is assigned one of these bins. A different probability distribution over the possible bins is estimated for each category, thus making the classification class-aware. Although the discrete philosophy limits the precision with which the orientation of objects can be estimated, it also has certain advantages in terms of integrating the task into the CNN and has proved sufficient to carry out meaningful modeling of the environment. The validity of the method is assessed on the KITTI dataset, where it achieves similar performance to other comparable methods (60.8% mean average orientation similarity, mAOS, on the moderately difficult samples from the KITTI testing set) using only a small fraction of their run time.
  
  The experimental work then extends to the analysis of a variety of factors that affect the performance of the method. These factors include the selection of the CNN that acts as a feature extractor, the size of the input image, or the tuning of different hyperparameters, among others. The study presents a whole range of configuration options that alter the tradeoff between precision and computation time, offering different alternatives that may be adopted depending on the particular requirements of the application.
  
  On the other hand, the proposed inference method also serves as a testbed for a study that aims to quantify the influence of training data on algorithms of this type. To this end, training samples from the Cityscapes dataset, initially aimed at the semantic segmentation of the scene, are incorporated into the experimental framework. The results show that, with a small increase in the size of the training set, significant improvements (up to 7.7 points mAP) can be achieved due to the diversity introduced by the new samples. It is thus demonstrated the vital importance of training data in the performance of DL algorithms, in general, and on-board perception algorithms, in particular.
  
  Furthermore, this thesis outlines two complementary lines of research aimed at improving the proposed joint detection and orientation estimation method to achieve performance values on a par with the most recent works in the field. The first one proposes a hybrid between classification and regression that allows improving the resolution in orientation estimation. The proposal consists of incorporating an additional regression branch into the Faster R-CNN framework to provide a fine adjustment of the value of the viewpoint estimated by the classification part. On the other hand, the second line of work makes use of a modern implementation of Faster R-CNN that includes numerous improvements in the method, such as the use of new CNN models or a feature pyramid network (FPN) for feature extraction, specially designed to improve the operation of the method at very different scales. The results of both proofs of concept are highly promising (the best model reaches 67.4% mAOS on the moderately difficult samples from the KITTI validation set), showing the potential of the method beyond the specific implementation analyzed in this thesis.
  
  Once the category and orientation of the objects in the camera's field of view have been identified, it quickly becomes evident that decision-making modules also require an accurate estimate of their location with respect to the automated vehicle. There are works in the literature aimed to obtain the pose of the objects based only on monocular images; however, at this point, the accuracy required by automotive applications make it advisable to include information from sensors capable of capturing the geometry of the environment, thus allowing spatial reasoning. This is the case of stereo vision devices and lidar scanners, which are the sensor modalities employed in this work for this purpose.
  
  Two alternatives to obtain the location of objects in the surroundings of the vehicle are proposed. The first one is based on stereo information and aims to complement the previously introduced detection and orientation estimation method. The approach assigns a depth estimate to each detection by making use of the depth values provided by the stereo matching algorithm in the region of the object. By including the standard dimensions for each category and the viewpoint angle estimated by the detection framework, every object can be located in a top view of the environment of the vehicle as a rotated 2D box. Experimental validation of the method is carried out using different alternatives for stereo matching and feature extraction. Results prove the robustness of the proposed method, as the median Euclidean error in the localization is only 0.5 m. The performance of the method is better in the short-to-medium range (0-20 m from the vehicle) because of the limitations inherent to stereo vision.
  
  When all the modules proposed in this thesis are used together, an object-based model of the vehicle environment can be built. The complete pipeline has been deployed on an in-house research platform (an instrumented vehicle), providing satisfactory results in real traffic environments.
  
  As a complement to all previous developments, the framework developed for joint detection and orientation estimation is applied to a completely different domain: the data provided by the lidar sensor, which is conveniently represented in bird's eye view (BEV). The objective is, in this case, the 3D detection of the surrounding objects, which implies the simultaneous detection, classification, and localization of the different agents present in the range of vision of the lidar. This is possible because, in this case, objects are represented in the same coordinates used for localization. Although the general scheme of the detection method coincides with that used in color images, several modifications are proposed to adapt it to the new nature of the data. The target output is now a set of axis-aligned boxes enclosing the objects of interest, which are, besides, provided with an estimate of their yaw angle. If a standard width value is assumed for each type of obstacle, it is possible to obtain a 3D cuboid that represents the position, orientation, and size of each agent, in an inference process that takes place almost end-to-end on a single CNN framework. The method presents relevant improvements compared to other similar works, such as the possibility of simultaneously detecting cars, pedestrians and cyclists, or its ability to work only with information from a lidar sensor, without requiring additional devices to carry out the detection in a 360º range around the vehicle. The results are, on the other hand, comparable to other more sophisticated alternatives (30.4% mAP BEV on the moderately difficult samples from the KITTI testing set).
  
  All the contributions introduced in this thesis and included in this summary are aimed, in short, to improve the understanding of the road environment by the perception systems of an automated vehicle, with the ultimate goal of allowing its autonomous navigation through real traffic scenarios. The vast experimentation carried out on real data confirms the validity and soundness of the proposed methods, although, naturally, their application to industrial-grade autonomous driving systems still presents some challenges that fall outside the scope of this thesis. One of them is the necessary integration of the proposed algorithms with other essential modules in the perception stack of an autonomous vehicle, such as detectors of infrastructure elements (e.g., road lanes or traffic lights). An annex is included presenting some examples of these types of applications.
  
  Apart from these considerations, the work undertaken in this thesis contributes to the resolution of a pressing problem in the field of on-board perception systems, namely the characterization of road users with whom some interaction may occur, and open some avenues of future research for the improvement and development of applications of this type.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: