Within the scope of this research, we investigate the problem of cross-view geo-localization, which attempts to establish the location of a street view image by comparing it to a collection of geo-tagged aerial pictures. Google is responsible for bringing this matter to our attention, and it will serve as the focus of this study. Cross-view matching is a particularly challenging operation to complete because of the large differences in look and geometry that exist between different views. We propose a novel evolving geo-localization Transformer (EgoTR) that models global dependencies by utilizing the qualities of self-attention in the Transformer. This allows us to better understand the relationships between different parts of the world. Using this strategy brings about a large reduction in the number of visual ambiguities that are present in the process of cross-view geo-localization. The currently available methods rely heavily on CNN. In addition, we make use of the positional encoding that Transformer provides in order to support the EgoTR in comprehending geometric arrangements between ground and aerial pictures and to correlate these configurations. EgoTR learns positional embeddings in a flexible manner via the training goal, in contrast to state-of-the-art approaches, which make stringent assumptions about the user's prior knowledge of geometry.
This work is licensed under a Creative Commons Attribution 4.0 International License.
You may also start an advanced similarity search for this article.