One of the toughest problems in the domain of biologically inspired models of vision is explaining how one can properly categorize objects while recognizing novel instances of objects within a category. For example, there are an infinite number of things that you could categorize as a rectangle, yet you have no problems deciding if something is rectangular. The problem becomes fiendish when considering more complex objects such as faces.
It is exceedingly unlikely that visual information is stored in memory for all angles of all objects encountered, as this would require virtually incalculable storage space. Therefore, object recognition is probably not based on rote comparison against stored images. An interesting and demonstrable model described over a series of papers from researchers at MIT proposes that top-down components of object recognition involves a set of 2D prototype objects from which a set of object transforms are learned. These learned transforms can be applied to novel concrete instances from the same object class (Poggio, Vetter 1997).
The best way to illustrate this is by example. I implemented the Poggio and Vetter (1997) model in Matlab and tested it on a set of cuboid prototypes. One can think of these as representing the cube like objects of prior experience. For this implementation, the 3D cuboid vertices were randomly generated. The 3D representations were tilted by a few degrees to avoid an accidental perspective and then projected to 2D space using a perspective projection. The resulting 2D cuboids are rendered in figure 1.
Of course, one rarely observes an object from just one angle, so we’ll add another angle. In reality, there would be many, but not all, possible transformation of the object class. For simplicity, we’ll consider one rotation about the x-axis by 20 degrees. The resulting object class is rendered in figure 2.
Upon this rotation, the model learns how to transform the set of 2D vertices defining the non-rotated object into the 2D vertices defining the rotated objects. In this implementation, the transform is learned by a single layer neural network. However, an analytic solution also exists if the vertices form a linearly independent set; for details see Poggio & Vetter (1997). At this point, we have a single function that will transform any 2D cuboid into the same, or at least very close (if class is not linearly independent), cuboid rotated by 20 degrees.
This can be demonstrated by generating a new random cuboid, one that is not in the prototype class, and using the learned transformation function to rotate the novel cuboid. Figure 3 depicts the random cuboid in red rendered in the same location as the rotated version in blue. This is intended to assist in recognizing a successful transformation.
It is key to recognize that a traditional rotation in 3D space is not being performed on the novel cuboid. The novel object is projected into 2D space and then rotated via the learned 2D->2D transform.
An interesting prediction implied by this model is that when an individual mentally rotates an object, their performance should degrade relative to the number of degrees they are trying to rotate due to repeated applications of learned transforms. For instance, if we wanted to rotate the novel cuboid 40 degrees then we could simply apply the learned transform twice and this should take twice the amount of time. It also follows that the performance should degrade linearly, as the transform is a linear function. In fact, this is exactly what was reported in the classic mental rotation study by Shepard and Metzler (1971). They found a linear increase in reaction time when deciding whether or not two objects observed from different angles were the same object as the degrees of rotation between the two objects increased.
It would be interesting to see if performance holds for rotations of different increments. Shepard and Metzler tested only one increment; all rotations were in increments of 20 degrees. Differences in performance would be expected if individuals stored more than one rotation transform for an object class. For instance, if asked to rotate something by 40 degrees one may apply two iterations of 20 or several iterations of a more granular transform. However, one may be able to jump strait to a 45 or 90 degree rotation, especially for cubes.
Furthermore, experiments investigating the effects of different object classes would provide additional insight. Although Shepard and Metzler argued their objects were novel, they were all composed of cubes. It’s possible that the visual system may be able to apply the cube transforms to the cubes composing the object individually. What would happen if pyramids were used as the atomic component as opposed to cubes? The Poggio and Vetter model examined here couples transforms to an object class. Therefore, differences should be expected when manipulating object class. More specifically, one may not expect performance increases due to learning effects to carry over to a new class.
Poggio T., Vetter T. (1997). Linear Object Classes and Image Synthesis From a Single Example Image. IEEE Transactions on Pattern Matching and Machine Intelligence. 19.7, 733-741.
Shepard R., Metzler J. (1971). Mental Rotation of Three-Dimensional Objects. Science. 171.3972, 701-703.