Epistemic guidance of visual attention for robotic agents in dynamic visual scenes
Humans and many animals can selectively sample important parts of their visual surroundings to carry out their daily activities like foraging or finding prey or mates. Selective attention allows them to efficiently use the limited resources of the brain by deploying sensory apparatus to collect data believed to be pertinent to the organism's current task in hand. Robots or other computational agents operating in dynamic environments are similarly exposed to a wide variety of stimuli, which they must process with limited sensory and computational resources. Developing computational models of visual attention has long been of interest as such models enable artificial systems to select necessary information from complex and cluttered visual environments, hence reducing the data-processing burden. Biologically inspired computational saliency models have previously been used in selectively sampling a visual scene, but these have limited capacity to deal with dynamic environments and have no capacity to reason about uncertainty when planning their visual scene sampling strategy. These models typically select contrast in colour, shape or orientation as salient and sample locations of a visual scene in descending order of salience. After each observation, the area around the sampled location is blocked using inhibition of return mechanism to keep it from being re-visited. This thesis generalises the traditional model of saliency by using an adaptive Kalman filter estimator to model an agent's understanding of the world and uses a utility function based approach to describe what the agent cares about in the visual scene. This allows the agents to adopt a richer set of perceptual strategies than is possible with the classical winner-take-all mechanism of the traditional saliency model. In contrast with the traditional approach, inhibition of return is achieved without implementing an extra mechanism on top of the underlying structure. This thesis demonstrates the use of five utility functions that are used to encapsulate the perceptual state that is valued by the agent. Each utility function thereby produces a distinct perceptual behaviour that is matched to particular scenarios. The resulting visual attention distribution of the five proposed utility functions is demonstrated on five real-life videos. In most of the experiments, pixel intensity has been used as the source of the saliency map. As the proposed approach is independent of the saliency map used, it can be used with other existing more complex saliency map building models. Moreover, the underlying structure of the model is sufficiently general and flexible, hence it can be used as the base of a new range of more sophisticated gaze control systems.