Building machines that see and recognize from visual inputs is not easy. In the last decade, there have been tremendous efforts toward this goal and new ways have emerged that process image inputs, either in the form of single inputs or as streams, known as videos. These advances have allowed us to build systems which can recognize objects in 2D and 3D, identify their class and their pose or track them in time. While not perfect, these models work robustly enough and are now part of our everyday life, including in our homes via smart home devices and in our daily activities via our smart phones. In this talk, I will give an overview of state-of-the-art visual recognition models I've worked on, covering a range of visual tasks, such as object detection, human-object interaction, human pose tracking and 3D object understanding. Moreover, I will discuss the computational needs of such state-of-the-art AI models which are satisfied by developing novel and efficient tools, which in turn have ignited a small revolution within AI.