Video alignment consists of integrating multiple video sequences recorded independently into a single video sequence. This means to register both in time (synchronize frames) and space (image registration) so that the two videos sequences can be fused or compared pixel-wise. In spite of being relatively unknown, many applications today may benefit from the availability of robust and efficient video alignment methods. For instance, video surveillance requires to integrate video sequences that are recorded of the same scene at different times in order to detect changes. The problem of aligning videos has been addressed before, but in the relatively simple cases of fixed or rigidly attached cameras and simultaneous acquisition. In addition, most works rely on restrictive assumptions which reduce its difficulty such as linear time correspondence or the knowledge of the complete trajectories of corresponding scene points on the images; to some extent, these assumptions limit the practical applicability of the solutions developed until now. In this thesis, we focus on the challenging problem of aligning sequences recorded at different times from independent moving cameras following similar but not coincident trajectories. More precisely, this thesis covers four studies that advance the state-of-the-art in video alignment. First, we focus on analyzing and developing a probabilistic framework for video alignment, that is, a principled way to integrate multiple observations and prior information. In this way, two different approaches are presented to exploit the combination of several purely visual features (image-intensities, visual words and dense motion field descriptor), and global positioning system (GPS) information. Second, we focus on reformulating the problem into a single alignment framework since previous works on video alignment adopt a divide and conquer strategy, i.e., first solve the synchronization, and then register corresponding frames. This also generalizes the 'classic' case of fixed geometric transform and linear time mapping. Third, we focus on exploiting directly the time domain of the video sequences in order to avoid exhaustive cross-rame search. This provides relevant information used for learning the temporal mapping between pairs of video sequences. Finally, we focus on adapting these methods to the online setting for road detection and vehicle geolocation. The qualitative and quantitative results presented in this thesis on a variety of real-world pairs of video sequences show that the proposed method is: robust to varying imaging conditions, different image content (e.g., incoming and outgoing vehicles), variations on camera velocity, and different scenarios (indoor and outdoor) going beyond the state-of-the-art. Since it is difficult to completely present in a written document the results for the investigated datasets, they can be viewed at http://www.cvc.uab.es/~fdiego/Thesis/.