Top PDF Modeling and Predicting Object Attention in Natural Scenes

Modeling and Predicting Object Attention in Natural Scenes

Modeling and Predicting Object Attention in Natural Scenes

The suggestion of the present data that saliency drives attention indirectly through predicting interesting objects reconciles earlier findings: saliency map features do not need to drive attention [9, 22, 129] despite saliency’s undisputed correlation with fix- ations during free viewing [99]. However, we do not argue that saliency maps fully answer how interesting objects are selected, or that saliency map features causally drive object recognition. Further research by targeted manipulations of object prop- erties is needed to analyze which stimulus features drive attention, and how they relate to features that make an object interesting, characteristic, or diagnostic for a scene and to different types of recall (tokens, types, scene gist, object positions, etc.). However, our data suggest that the allocation of attention is preceded by some pre-attentive scene understanding. This is in line with the data [13, 14], showing that the even the earliest guidance of attention and fixation depends on whether an object is semantically plausible in a scene. The minimum requirement for such a decision is a coarse pre-attentive recognition of the scene context, or gist, and some form of pre-attentive figure-ground segmentation. Taken with our present data, this strongly suggests that attention cannot be understood as a mere preprocessing step for recognition, but both need to be handled in a common framework.
Show more

148 Read more

Text Detection and Translation from Natural Scenes

Text Detection and Translation from Natural Scenes

Signs can vary in font, size, orientation, and position of sign texts, be blurred from motion, and occluded by other objects. Originating in a 3-D space, text on signs in scene images can be distorted by slant, tilt, and shape of objects on which they are found (Ohya 1994). Furthermore, unconstrained background clutters can look like signs in their appearance, causing false detection. Moreover, languages impose another level of variation in text. For example, Chinese characters are composed of many segments and the layout of Chinese characters in signs, which is based on pictographic characters, differs from the layout used in European languages. Handling Chinese characters requires a more elaborate method of modeling. Figure 1 shows an example of a Chinese sign with both vertical and horizontal layouts as well as distortion because of warping.
Show more

25 Read more

Novel Colors Correction Approaches for Natural Scenes and Skin Detection Techniques

Novel Colors Correction Approaches for Natural Scenes and Skin Detection Techniques

The sensitivities of the camera greatly affect the skin-color distribution for the same person under the same illumination. Several computer vision algorithms have been developed for skin color detection. A skin detector typically transforms a given pixel into an appropriate color space and then uses a skin classifier to label the pixel whether it is a skin or a non- skin pixel. A skin classifier defines a decision boundary of the skin color class in the color space based on certain rules. Skin detection technique based on color modeling has gained high popularity because of its fast processing and its independency on geometric variations of the face pattern. Also, many experts claim that human skin has a specific color, and can be easily recognized. So using skin color modeling approaches for skin detection is a good trend proposed by skin color properties and common sense.
Show more

10 Read more

Interactions of visual attention and object recognition : computational modeling, algorithms, and psychophysics

Interactions of visual attention and object recognition : computational modeling, algorithms, and psychophysics

The number of fixations used for recognition and learning depends on the resolution of the images, and on the amount of visual information. In low-resolution images with few objects, three fixations may be sufficient to cover the relevant parts of the image. In high-resolution images with a large amount of information, up to 30 fixations are required to sequentially attend to most or all object regions. Humans and monkeys, too, need more fixations to analyze scenes with richer information content (Sheinberg and Logothetis 2001; Einh¨ auser et al. 2006). The number of fixations required for a set of images is determined by monitoring after how many fixations the serial scanning of the saliency map starts to cycle for a few typical examples from the set. Cycling usually occurs when the salient regions have covered approximately 40–50 % of the image area. We use the same number of fixations for all images in an image set to ensure consistency throughout the respective experiment. It is common in object recognition to use interest operators (Harris and Stephens 1988) or salient feature detectors (Kadir and Brady 2001) to select features for learning an object model. This is different, however, from selecting an image region and limiting the learning and recognition of objects to this region.
Show more

147 Read more

The search template for object detection in naturalistic scenes

The search template for object detection in naturalistic scenes

descriptive  adjective  (e.g.,  “red  purse”),  are  both  less  effective  than  an  exact  image   but  more  effective  than  a  single  word  cue  (e.g.,  “purse”;  Castelhano  &  Heaven,  2010;   Malcolm  &  Henderson  2009;  2010;  Schmidt  &  Zelinsky,  2009;  Vickery  et  al.,  2005).     Search  templates  approximate,  rather  than  perfectly  represent,  the  features   of  an  object,  which  allows  for  a  certain  amount  of  flexibility  between  the  cue  and  the   target  image  (Bravo  &  Farid,  2009;  Vickery  et  al.,  2005).  Templates  activated  by  a   specific  image  cue  can  still  guide  attention  to  targets  that  do  not  exactly  match  the   cue  (Bravo  &  Farid,  2009;  Li  &  DiCarlo,  2008;  Vuilleumier,  Henson,  Driver,  &  Dolan,   2002),  perhaps  because  we  expect  certain  changes  in  the  appearance  of  an  object  (in   lighting,  distance,  occlusion,  etc.)  in  the  dynamic  natural  world.  Evidence  for  this   comes  from  a  study  by  Bravo  and  Farid  (2009),  which  showed  that  a  specific  image   could  efficiently  cue  targets  that  varied  in  size  or  orientation.  Additional  evidence   comes  from  a  study  by  Ghose  and  Liu  (2013);  in  one  experiment,  subjects  were   instructed  to  sort  images  as  “new”  or  “old”,  in  which  repetitions  of  a  specific  image   were  considered  “old”  and  different  viewpoints  of  the  same  image  were  considered   “new”.  The  authors  found  that  subjects  responded  “old”  more  often  than  “new”  to   different  viewpoints  of  the  specific  image.  This  suggests  that  subjects  involuntarily   activated  a  view-­‐invariant  template  for  the  specific  targets.    
Show more

164 Read more

Acoustic scanning of natural scenes by echolocation in the big brown bat,
Eptesicus fuscus

Acoustic scanning of natural scenes by echolocation in the big brown bat, Eptesicus fuscus

In visual animals, eye movements during scanning of a scene consist of a series of fixations interspersed by saccades to move the eye from one point to the next. Birds move the whole head to fixate the eyes (Eckmeier et al., 2008). In primate vision, behavioral (Rizzolatti et al., 1987) and neurophysiological studies (Kustov and Robinson, 1996) suggest a tight coupling between eye movement and attention systems. In the echolocating bat, we suggest that beam axis direction may correspond to its attention to different objects in the scene. Any listening animal will turn its head toward a sound of interest, exploiting the maximum directional acuity of hearing along the midline, which, for example, has been demonstrated elegantly in a thorough series of experiments with barn owls (Konishi, 1993). Echolocating bats can take advantage of the added directionality of the outgoing sound to reduce echo intensity from an off-axis object by controlling the aim of the sonar beam.
Show more

10 Read more

Attentional Mechanisms in Natural Scenes

Attentional Mechanisms in Natural Scenes

In daily life, attention is often directed to high-level object attributes, such as when we look out for cars before crossing a road. Previous work using MEG decoding investigated the influence of such category-based attention on the time course of object category representations (Kaiser et al., 2016). Attended object categories were more strongly represented than unattended categories from 180ms after scene onset. In Chapter 4, we used a similar approach to determine when, relative to this category-level modulation, attention is spatially focused on the target. Results showed that the location of both target and distracter objects could be accurately decoded shortly after scene onset (50 ms). However, the emergence of spatial attentional selection – reflected in better decoding of target location than distracter location – emerged only later in time (240 ms). Target presence itself (irrespective of location and category) could be decoded from 180 ms after stimulus onset. Combined with the earlier work, these results indicate that naturalistic category search operates through an initial spatially-global modulation of category processing that then guides attention to the location of the target. This “feature- to location- based selection” (Hopf et al., 2004), also referred to as “global to local” process (Campana et al., 2016), has been proposed in classical theories of attentional selection, among which Guided Search (Wolfe et al., 1989; Wolfe, 1994) and Reverse Hierarchy Theory (RHT; Hochstein & Ahissar, 2002; Ahissar et al., 2009), and demonstrated for simple stimuli in artificial displays (Treisman and Sato, 1990; Cave, 1999; Hopf et al., 2004; Eimer, 2014; Campana et al., 2016).
Show more

104 Read more

Predicting Japanese Word Order in Double Object Constructions

Predicting Japanese Word Order in Double Object Constructions

Because Japanese exhibits a flexible word order, potential factors that predict word orders of a given construction in Japanese have been recently delved into, particularly in the field of compu- tational linguistics (Yamashita and Kondo, 2011; Orita, 2017). One of the major findings relevant to the current study is ‘long-before-short’, whereby a long noun phrase (NP) tends to be scrambled ahead of a short NP (Yamashita and Chang, 2001). This paper sheds light on those factors in dou- ble object constructions (DOC), where either (1) an indirect object (IOBJ) or (2) a direct object (DOBJ) can precede the other object:
Show more

5 Read more

Concurrency Issues in Object Oriented Modeling

Concurrency Issues in Object Oriented Modeling

advantage of decreasing costs of hardware, but also for more stringent reliability (and other) constraints, the constant need to tackle more complex problems, and especially problems whose solution is expressed naturally in concurrent terms (thus avoiding overspecification), the discipline of building concurrent programs is now spreading to areas and (uninitiated) people not expected a few years ago. The quest for new design techniques, new paradigms, etc., has increased correspondingly, but we are still far from a panacea. object model Though the object- oriented model has roots in simulation and artificial intelligence, they remained relatively unknown until the eighties, when it sprang to notoriety with the Smalltalk phenomenon. The uniformity of the approach (\everything is an object or a message between objects") captured the interest of a community burdened by the complexity of the problems it was tackling and the tools it was using. To name just two of the expectations raised, it was anticipated that programming would move closer to design, and that reusability of software would be practical. While initial results have not been miraculous, there are signs that indicate that this is a (maybe `the') right approach to building software.
Show more

6 Read more

Anomaly detection through spatio temporal context modeling in crowded scenes

Anomaly detection through spatio temporal context modeling in crowded scenes

Considering the fact that mutual interference of several human body parts potentially happen in the same block, we propose an atomic motion pattern representation using the Gaussian Mixt[r]

6 Read more

Validating the Effectiveness of Object-Oriented Metrics for Predicting Maintainability

Validating the Effectiveness of Object-Oriented Metrics for Predicting Maintainability

In this paper, an attempt has been made to use subset of class level object-oriented metrics in order to predicting software maintainability. Neuro-GA approach was used to design a model by employing 10-fold and 5-fold cross- validation technique for QUES and UIMS software. Four analysis approaches of di ff erent metrics sets were considered for estimating the software maintainability for QUES and UIMS software. These techniques have the ability to predict the output based on historical data. The software metrics are taken as input data to train the network and estimate the maintainability of the software product. From this analysis it can be concluded that Neuro-GA approach obtained promising results when compared with the work done by Van Koten et al. and Zhou et al. Also the results reported that the identified of subset metrics demonstrated an improved maintainability prediction with higher accuracy.
Show more

9 Read more

Natural Language Descriptions of Human Activities Scenes: Corpus Generation and Analysis

Natural Language Descriptions of Human Activities Scenes: Corpus Generation and Analysis

In comparison some categories, such as ‘Sit- Down’, ‘SitUp’ and ‘StandUp’, had the substan- tially lower F1 scores than the rest. There were two potential reasons why the annotators did not pay sufficient attention to these actions. Firstly, these actions were performed very quickly in the context of some videos. For example, when a person sat down or stood up during an eating scene, the annotators would have focused on eat- ing (rather than sitting down or standing up) in their description. Secondly, these actions were often overlapped with another action by different humans in the video, which the annotators might have found more important for description. Over- all outcome of the classification experiment indi- cates that the corpus is a reliable tool for assessing natural language description of video streams. 6 Findings from the Corpus Analysis The corpus is important for the following reasons: (1) limiting this study to a clearly defined and manageable domain; (2) identifying the most im- portant HLFs that should be extracted by image processing techniques in order to describe seman-
Show more

9 Read more

Detecting Text in Natural Scenes with Connected Component Clustering and Nontext Filtering

Detecting Text in Natural Scenes with Connected Component Clustering and Nontext Filtering

Recognizing text in natural scene images is becoming a popular research area due to the wide spread availability of image capturing devices like digital cameras, mobile phones in low-cost. Various scene text detection and recognition have received much attention for the last decades. Among them, text detection and recognition in camera based images have been considered as very important problems in computer-vision community [1], [2].In this paper, we presents an innovative scene text detection algorithm with the help of two machine
Show more

5 Read more

Modeling Localness for Self Attention Networks

Modeling Localness for Self Attention Networks

examined whether it is necessary to apply local- ness modeling to all the layers. Finally, given that T RANSFORMER consists of encoder and decoder side self-attention as well as encoder-decoder at- tention networks, we checked which types of at- tention networks benefit most from the localness modeling. To eliminate the influence of control variables, we conducted the first two ablation stud- ies on encoder-side self-attention networks only. Window Prediction Strategies As shown in Table 1, all the proposed window prediction strategies consistently improve the model perfor- mance over the baseline, validating the impor- tance of localness modeling in self-attention net- works. Among them, layer-specific and query- specific window outperform 5 their fixed counter- part, showing the benefit that flexible mechanism is able to capture varying local context accord- ing to layer and query information. Moreover, the flexible strategy does not reply on the hand- crafted parameters (e.g. the pre-defined window size), which makes model robustly applicable to other language pairs and NLP tasks. Considering the training speed, we use the query-specific pre- diction mechanism as the default setting in subse- quent experiments.
Show more

10 Read more

Novel Colors Correction Approaches for Natural Scenes and Skin Detection Techniques

Novel Colors Correction Approaches for Natural Scenes and Skin Detection Techniques

In this backdrop of the commercialization of the internet, this study attempts to help users quantify such non-transparent behavior, ultimately minimizing computer [r]

6 Read more

Modeling Human Reading with Neural Attention

Modeling Human Reading with Neural Attention

Attention-based neural architectures either em- ploy soft attention or hard attention. Soft attention distributes real-valued attention values over the in- put, making end-to-end training with gradient de- scent possible. Hard attention mechanisms make discrete choices about which parts of the input to focus on, and can be trained with reinforcement learning (Mnih et al., 2014). In NLP, soft atten- tion can mitigate the difficulty of compressing long sequences into fixed-dimensional vectors, with ap- plications in machine translation (Bahdanau et al., 2015) and question answering (Hermann et al., 2015). In computer vision, both types of attention can be used for selecting regions in an image (Ba et al., 2015; Xu et al., 2015).
Show more

11 Read more

Visual attention and object categorization: from psychophysics to computational models

Visual attention and object categorization: from psychophysics to computational models

In a biological system, any high-level representation must be built from lower-level representations, and in vision this means that all representations must ultimately trace back to the retinal input. Many categorization models presuppose that the high-level (external) features used by the experimenter to define the objects are the same as those used internally by the observer when making a categorization decision. For example, many categorization studies have used a set of circles with bisecting lines, defined by two features: the diameter of the circle, and the angle of the bisecting line (see Figure 2.1). This approach has certainly been fruitful, and MDS studies (Chapter 4) have demonstrated strong similarities between the ex- ternal and internal feature representations. Nevertheless, apparent irregularities in the categorization process that might be inexplicable in terms of high-level rep- resentations, could appear entirely natural in the light of biological early vision. At the least, features such as angle of the bisecting line are not likely to be repre- sented explicitly by neurons involved in visual perception; rather, a population of neurons might form a distributed representation, in which each neuron responds preferentially to a single range of orientations. Whether such differences have an effect on the output of categorization models is an empirical question. We have tested a set of hybrid models, in which we adapted a hierarchical model of early vision model (“HMAX”) based on Riesenhuber and Poggio [1999]. HMAX op- erates directly in image space, in contrast to the categorization models described above, which operate in feature space. Our approach was to extract a new feature space representation from the output of HMAX, which could then be used as an
Show more

138 Read more

Event Prediction and Object Motion Estimation in the Development of Visual Attention

Event Prediction and Object Motion Estimation in the Development of Visual Attention

Of particular interest is how infants behave when the target disappears, for example behind an oc- cluder. Infants that are 7-9 weeks old continue to look at the edge of the occluder where the object disap- pears for 1 second before finding it it again (Rosander & von Hofsten, 2004). Infants that are 12 weeks old move their eyes as soon as the target becomes visible again. This delay also decreases with each trial which indicates that the infant starts to anticipate where the objects will reappear. Some of these effects have been seen in younger infant as well but they have not been reliable. It is possible that the younger in- fant would have performed better if the object was made invisible instead of occluded since the occluder distracts attention from the target (Jonsson & von Hofsten 2003).
Show more

6 Read more

Effective Attention Modeling for Neural Relation Extraction

Effective Attention Modeling for Neural Relation Extraction

(3) Entity Attention (EA) (Shen and Huang, 2016): This is the combination of a CNN model and an attention model. Words are represented using word embeddings and two positional em- beddings. A CNN with max-pooling is used to extract global features. Attention is applied with respect to the two entities separately. The vec- tor representation of every word is concatenated with the word embedding of the last token of the entity. This concatenated representation is passed to a feed-forward layer with tanh activation and then another feed-forward layer to get a scalar at- tention score for every word. The original word representations are averaged based on the attention scores to get the attentive feature vectors. A CNN- extracted feature vector and two attentive feature vectors with respect to the two entities are con- catenated and passed to a feed-forward layer with softmax to determine the relation.
Show more

10 Read more

A Decomposable Attention Model for Natural Language Inference

A Decomposable Attention Model for Natural Language Inference

Our method is motivated by the central role played by alignment in machine translation (Koehn, 2009) and previous approaches to sentence similarity modeling (Haghighi et al., 2005; Das and Smith, 2009; Chang et al., 2010; Fader et al., 2013), natural language inference (Marsi and Krahmer, 2005; MacCartney et al., 2006; Hickl and Bensley, 2007; MacCartney et al., 2008), and semantic parsing (Andreas et al., 2013). The neural counterpart to alignment, atten- tion (Bahdanau et al., 2015), which is a key part of our approach, was originally proposed and has been predominantly used in conjunction with LSTMs (Rockt¨aschel et al., 2016; Wang and Jiang, 2016) and to a lesser extent with CNNs (Yin et al., 2016). In contrast, our use of attention is purely based on word embeddings and our method essentially consists of feed-forward networks that operate largely indepen- dently of word order.
Show more

7 Read more

Show all 10000 documents...