#ArtificialReality

R0:a8fc240769ba4448b373719f7fbe640d-Do Vision Transformers See Like Convolutional Neural Networks?

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers.

Realidad Virtual (RV) & Realidad Aumentada (RA) & Realidad Mixta (RM)

La realidad en estos tiempos comienza a ser casi como un espejismo de nosotros mismos. Se está convirtiendo en un anhelo de mejoras artificiales entorno a nuestra sociedad y a los seres humanos. Ya no nos sorprende nada o casi nada. Y estamos acostumbrados a ver cómo se crean otras realidades en paralelo que cada vez cuesta más distinguirlas por nuestros sentidos físicos. Incluso llega a amenzar nuestra pacífica supervivencia con ficciones de realidades antirreglas como con las deepfake o las fakenews.