Object localization and segmentation has been a popular research problem in the computer vision community. Various approaches have leveraged the audio modality to better perform this task with the central idea of associating visual motion and audio. Fisher et al.  proposed to use joint statistical modeling to perform this task using mutual information. Izadinia et al.  consider the problem of moving- sounding object segmentation, using CCA to correlate audio and visual features. The video features consisting of mean velocity and acceleration computed over spatio-temporal segments are correlated with audio. The magnitude of the learned video projection vector indicates the strength of association between corresponding video segments and the audio. Several other works have followed the same line of reasoning while using different video features to represent motion [86, 143]. Effec- tiveness of CCA can be illustrated with a simple example of a video with a person dribbling a basketball  (see Fig. 2). Simplifying Izadinia et al. ’s  visual feature extraction methodology, we compute the optical flow and use mean velocity calculated over 40 × 40 blocks as the visual representation and mel-spectra as the audio representation. The heat map in Fig. 2 shows correlation between each image block and audio. Areas with high correlation correspond to regions with motion. If we instead use a soft co-factorization model , it is indeed possible to track the image blocks correlated with the audio in each frame.
An increasing number of researchers work in computational auditory sceneanalysis (CASA). However, a set of tasks, each with a well-defined evaluation framework and com- monly used datasets do not yet exist. Thus, it is difficult for results and algorithms to be compared fairly, which hin- ders research on the field. In this paper we will introduce a newly-launched public evaluation challenge dealing with two closely related tasks of the field: acoustic scene classifica- tion and eventdetection. We give an overview of the tasks involved; describe the processes of creating the dataset; and define the evaluation metrics. Finally, illustrations on results for both tasks using baseline methods applied on this dataset are presented, accompanied by open-source code.
We have to highlight that the ROC curves in Figure 4.16a) are below the y = x straight line. This characteristic is because true positives are counted only in those cases when a method successfully detects the ROI depicting the abnormal event within a frame. If the system fails to identify this ROI and detects other regions, we count the detection as a false negative. Consequently, this criterion allows us to determine if a method is capable of detecting precisely the area of the scene where unusual events happen. Alternatively, we can label the whole frame as abnormal whenever any region is abnormal, i.e., by following the same frame-level criterion. This manner evidently increases AUC values but prevents measuring if the method is capable of detecting the exact regions that generate the anomaly. Figure 4.16c) shows ROC curves using such a frame-level criterion. We can see that the ROC curves now approach the y = x straight line, as expected. The ADCSF also attains the best performance based on this frame-level criterion.
The core idea of this class of algorithms is that a shape is well represented by its high- curvature points . Then, a contour can be described by using such points as the vertices of a piecewise linear interpolation. Several heuristics have been designed to this purpose. Teh and Chin  determine the curvature at each point based on a support region, and detect the dominant points through a non-maxima suppression process. Other approaches rely on the detection of salient points. Held et al.  first apply a coarse-to- fine smoothing to identify dominant points, and then define a hierarchical approximation based on perceptual significance. Zhu and Chirlian  determine the importance of each point by transforming the curve into polar coordinates and then calculating the relevant derivatives. In this class of algorithms it is also possible to identify methods that search for the most significant points using relaxation labeling . The paper focuses on the contour extraction of shapes, but the extension of the work to trajectory analysis is straightforward. In this approach, the left and right slopes and the curvature are evaluated and associated with an attribute list to each point of the input curve. This information determines the initial probability of the current point to be a side (a linear piece in the case of a trajectory), or an angle (a point with strong curvature). The relaxation process iteratively updates the probabilities until convergence. The obtained angle points can therefore be used as a meaningful representation of the whole trajectory.
features of auditory inputs) than integration, which is schema-based (auditory experience and knowledge) [1,32,35,36]. Both processes make a hearing a single speech stream in a crowded room possible. The MMN is an index of using this information (the segregated input and neural representation of the relevant context) as the basis for detecting what has changed in the environment. Auditory attention modifying the initial organization of the sound input affects event formation and how the information is represented and stored in memory [9,37,38]. Kujala et al. using MMN investigated the segregation of speech sounds in children with dyslexia. They showed the differences between brain procedures of discrimination and recog- nition of sound changes between dyslexic and normal children. In addition, their results indicated that there were deficits in processing and attention procedures, and speech perception difficulties related to speech sound segregation in dyslexic children . Lepisto et al. studied segregation and integration of auditory streams indexed by MMN in children with Asperger’s syndrome. These children have some problems in detection of speech stimuli and attention to these sounds in crowded environments. In this study, there were MMN differences (amplitude decrease and no response) between these patients and normal children. The results of this study indicated difficulties in concurrent segre- gation and integration of auditory streams followed by speech perception difficulties in noisy environments . It could be said that Fig. 4. Two deviant stimulus paradigms used to show one or two MMNs dependent to context of
fundamental frequency, energy band ratio, and silence ratio are extracted. A rule- based heuristic procedure incorporating these audio features is performed, aiming at classifying the shots into one of the following classes: silence, speech, music, and environmental sounds. An event is confirmed as a dialogue, if at least 40% of its shots contain speech. The facial analysis includes the detection of frontal faces. A simple face tracking system is employed, that retains only the faces appearing in several consecutive frames. A 2-speaker dialogue is considered as not having more than one face in most of its component shots. Hence, when more faces are detected it is relabeled as multiple-speaker dialogue. The system was evaluated with encourag- ing results in three movies, containing 80 events in total. When audio and facial cues were integrated, the false alarms were eliminated, yielding a precision rate of 100%, and a recall rate higher than 83% in all movies. However, the amount of heuristic rules and employed thresholds requires a large validation set in addition to the test set in order to experimentally verify the rules and the corresponding thresholds asso- ciated to the rules.
Automatic recognition of human activities and behaviors is still a chal- lenging problem due to many reasons, including limited accuracy of the data acquired by sensing devices, high variability of human behaviors, gap between visual appearance and scene semantics. Symbolic approaches can significantly simplify the analysis, turning raw data into chains of mean- ingful patterns. This allows getting rid of most of the clutter produced by low-level processing operations, embedding significant contextual in- formation into the data, as well as using simple syntactic approaches to perform the matching between incoming sequences and models. In this context we propose a symbolic approach to learn and detect complex activ- ities through sequences of atomic actions. Compared to previous methods based on Context Free Grammars (CFGs), we introduce several important novelties, such as the capability to learn actions based on both positive and negative samples, the possibility of efficiently re-training the system in the presence of misclassified or unrecognized events, the use of a parsing procedure that allows correctly detecting the activities also when they are concatenated and/or nested one with each other. Experimental validation on three datasets with different characteristics demonstrates the robustness of the approach in classifying complex human behaviors.
Although there is a vast amount of literature and works on facial expression recognition, few papers have focused exclusively on smile detection. Caifeng Shan  proposed a smile detection system in which pixel intensity differences are used as features. AdaBoost was adopted to choose and combine weak classifiers based on intensity differences to form a strong classifier in this system. This approach attains 85% accuracy by examining 20 pairs of pixels and 88% accuracy with 100 pairs of pixels. The two major approaches on face image analysis are local feature-based and image- vector-based. Y. Shinohara et.al  proposes a hybrid of these two approaches. This method used higher order Local Auto- Correlation (HLAC) features and Fisher weight maps. Experimental results summarized that the recognition rate of this method using the Fisher weight map (FWM) and HLAC features was 97.9%, while Fisher faces method was 93.8% and HLAC without a weight map was 72.9%. The hybrid method outperformed the Fisher faces method and the HLAC- features-based method. This method achieved high accuracy, but only limited data were used in this study. A. Ito et.al  describes a method to detect smiles and laughter sounds from the video of natural dialogue.In this paper, a six-dimension feature vector is extracted to describe the lip and cheeks, which incorporates a perceptron classifier for smile detection. Testing done on three video sequences, this method achieves the accuracy of 60%–85%. U. Kowalik et.al  explains BROAFERENCE, a test bud for studying future oriented multimedia services and applications in distributed environments. In feature extraction, a 16-dimension feature vector derived by tracking eight mouth points is used with a neural network classifier to detect smiles. However, performance evaluation was absent. O. Deniz et.al  described both face detector and smile detector for PUIs. A smile detector based on the Viola–Jones cascade classifier is described. Training done on 5812 images (i.e., 2436 positive and 3376 negative images), the detector achieves the accuracy of 96.1% on 4928 testing images. Even then, the used face images are mainly frontal, with limited imaging conditions. Whitehill et al.  presented a comprehensive study on practical smile detection. They collected the GENKI database comprising of 63,000 real-life face images from the Web. They investigated different parameters, including size and type of data sets, image registration, facial representation, and machine learning algorithms. Their study suggested that high detection accuracy is achievable in real-life situations, with
Abstract—Emerging applications of wireless sensor networks (WSNs) require real-time eventdetection to be provided by the network. In a typical event monitoring WSN, multiple reports are generated by several nodes when a physical event occurs, and are then forwarded through multi-hop communication to a sink that detects the event. To improve the eventdetection reliability, usually timely delivery of a certain number of packets is required. Traditional timing analysis of WSNs are, however, either focused on individual packets or traffic flows from individual nodes. In this paper, a spatio-temporal fluid model is developed to capture the delay characteristics of eventdetection in large-scale WSNs. More specifically, the distribution of delay in eventdetection from multiple reports is modeled. Accordingly, metrics such as mean delay and soft delay bounds are analyzed for different network parameters. Motivated by the fact that queue build up in WSNs with low-rate traffic is negligible, a lower-complexity model is also developed. Testbed experiments and simulations are used to validate the accuracy of both approaches. The resulting framework can be utilized to analyze the effects of network and protocol parameters on eventdetection delay to realize real-time operation in WSNs. To the best of our knowledge, this is the first approach that provides a transient analysis of eventdetection delay when multiple reports via multi-hop communication are needed.
The literature on carried object detection is relatively young, but rapidly growing. Earliest works are based on silhouette and motion [3, 8]. More recently, these features have been complemented by taking into account protrusions and the way carried objects modify human appearance . A limitation of the above approaches is that they rely heavily on fitting people to silhouettes and on the availability of visible protrusions, respectively. Other approaches reformulate the carried object detection problem as one of identifying pedestrians with human walking patterns, by searching for shape outliers [10, 13] or by analysing subjects’ gait , but without localising the object itself. Very few works have tried to improve tracking by including events. Wang et al.  join pedestrian tracking and eventdetection into a single optimisation problem, however their events describe human motion with respect to the viewpoint (left, right, away or towards the camera) which are extrinsic rather than intrinsic events, i.e. events are defined with respect to the viewpoint rather than the actors. In  event recognition and tracking are not optimised jointly; tracking requires event knowledge, and only running the optimisation for each event class allows to compare the results of event choice. Carried object tracking is performed using spatial consistency between person and object in , but the system only considers the carry event and is thus able to track an object only when it is being carried.
Our second experiment was the evaluation of both, the audio module alone and the whole multi- modal technique. Table III presents the results of the audio, video and multimodal approaches for the movies: A Beautiful Mind, Back to the Future, Gone in 60 Seconds, Ice Age and Pirates of the Caribbean. The analysis of the results shows that, when compared with the audio technique, the visual technique presented better performance in precision and recall. Therefore, it is possible to notice that images can capture an important part of the video semantics. On the other hand, the audio technique may present a similar performance to the visual one using a considerably lower computational cost. In addition, the audio technique was able to identify scene cuts not detected by the visual technique, especially when consecutive shots were visually similar but the subject between them changed. This reinforces our point about multimodality. This fact explains the better recall results obtained by the multimodal technique, which merges the true positives found by the two techniques. However, the multimodal technique had a reduction in precision compared to the pure visual and aural techniques, since the false positives from the two techniques are incorporated into the final result.
However, we are yet to predict both of these tasks simul- taneously with a single model which is the main contribution of this paper. When recognising environmental sounds, humans use prior knowledge of likely events in the scene and their prior experiences of the scenes to classify environments, as demon- strated by one such listening test in . This, plus more ev- idence that context information such as a scene descriptor in- creases SED accuracy by machines , motivates this work to build a single recognition model by learning both scene and event data concurrently. To the best of our knowledge this is the first attempt to create a system for joint ASC and SED. With re- spect to prior work in sound sceneanalysis, this is a novel proof- of-concept which has the potential to optimise future ASC and SED systems whereby robust ASC and SED inputs are coupled before training a single model to predict both scenes and events jointly. In the SED literature, detection and recognition refer to the same problem. This is because SED systems are not eval- uated just on spotting events (with onsets and offsets), but on both spotting the events and assigning a label to them. For clar- ity, in the rest of this paper, we use the term classification for matching class labels, detection for spotting events and recog- nition as both classification and detection.
e 1 , e 2 , e 3 , e 4 may be, respectively, “player1 possesses the ball", “player1 kicks the ball",
“the ball approaches player 2", “player2 gets in possession of the ball".
4. T YPES OF E VENTS
One of the most interesting things about soccer analysis is the ability to recognize events, such as a kick, goal, pass, offside, cards, ball possession, etc. from a common video. Most of the videos previously used in the event recognition use multiple fixed cameras to observe the position of all the players and the ball on the soccer field . The use of such cameras improves the overall accuracy of the system for object tracking but they are computationally expensive. The fragment of video we have used can be easily accessible from the internet.
6 Lecture Notes in Computer Science
3 Results and Discussion
The accuracies of our liveness detection approach were assessed on LivDet 2011 dataset . This dataset is one of the most used in the literature and, thus, allows for a comparison with a large number of methods. LivDet 2011 consists of four datasets of images acquired from different devices (Biometrika, Digital Persona, Italdata and Sagem). For each device, 2000 live images of different subjects and 2000 fake images obtained with different materials (such as gelatine, latex, PlayDoh, silicone and wood glue), were collected. Images were divided into a training and a test set, each containing an equal number of live and fake images, with fake images equally distributed among different materials. LivDet 2011 datasets were acquired using a consensual method , where the subject actively cooperated to create a mold of his/her finger, thus obtaining surrogates of better quality, i.e. more difficult to detect, than those created from latent fingerprints. Experiments were organized as follows. First, we optimized the parameters of each method, i.e. the parameter C of the linear SVM and the task weights for MTJSRC, with a 5-fold cross validation procedure on the training set. Then, we computed the classification capabilities of both individual and grouped at- tributes. A preliminary result, not detailed for the sake of brevity, is that the individual attributes perform consistently worse than their combination with other attributes, demonstrating the strength of multiviewapproaches. As for grouped attributes, we tested different combinations of the candidate attributes described in Section 2.3 and, for each candidate attribute, we tested different parameter settings. We found that the best results were obtained for WLD us- ing three different image scales (referred in the results as W3), for BSIF using different windows size, 5x5 (B5), 15x15 (B15) and 17x17 (B17), while for LPQ we obtained similar results computing phase information with either Short Term Fourier Transform (LS) or Gaussian derivative quadrature filter pairs (LG).
The job of identifying sound events includes identifying and classifying sounds in audio forecasting and offset sounds for different cases of sound events and offering a textual descriptor for each [13-14]. Common classification method used on SED are as the conventional speech recognition . Along the pipeline of a typical SED it consist of basically feature extraction followed by the classification . Classification method that is popular recently consist of Convolution Neural Network (CNN) . In surveillance SED architectures consist of additional background subtraction,object tracking & situational analysis processes in the pipeline . Latest research of an improved pipeline suggest a verification step to reduce false positives after the SED process . Figure 1 shows the extended SED pipeline.
When performing scene clustering, we randomly select one frame from each sliding window to generate the SC Graph. The clustering result of each frame is used to label the corresponding sliding window. The graph-based deep Gaussian mixture model is built by three graph convolutional layers and two fully-connected layers. In particular, the graph convolutional network runs with GC(512, 128, ReLU)-Drop(0.5)-GC(128, 32, ReLU)-Drop(0.5)-GC(32, 4, none). The architecture of the fully-connected network is FC(4, 32, ReLU)-Drop(0.5)-FC(32, 10, softmax). Layer(a, b, f ) means a graph convolutional (GC) layer or fully-connected (FC) layer, where the size of the layer-specific trainable weight matrix is a × b and the activation function is f . Drop(p) refers to a dropout layer with the parameter p. During clustering, the number of clustering centers M denotes the preset number of scene types, and we set M to 10 to cover common scenes (e.g., campuses, highways, subways, etc). The training batch of scene clustering is set to 1024, and we use the RMSprop optimizer with a 0.0001 learning rate to train the clustering model.
In this work we present a new dataset of lit- erary events—events that are depicted as tak- ing place within the imagined space of a novel. While previous work has focused on event de- tection in the domain of contemporary news, literature poses a number of complications for existing systems, including complex nar- ration, the depiction of a broad array of men- tal states, and a strong emphasis on figurative language. We outline the annotation decisions of this new dataset and compare several mod- els for predicting events; the best performing model, a bidirectional LSTM with BERT to- ken representations, achieves an F1 score of 73.9. We then apply this model to a cor- pus of novels split across two dimensions— prestige and popularity—and demonstrate that there are statistically significant differences in the distribution of events for prestige.
Multi-Modal Cues: As with face-to-face, MultiView can support multiple types of cues concurrently. During cali- bration, a person setting up the system was able to look at someone at the remote site and point in a direction to say tacitly, “Hey you, go that way.” He was able to use two non- verbal deictic cues – gaze to identify the person and hand gestures to identify the direction he wanted them to move – at the same time. No verbal communication is required. Life-Size Images: Reeves and Nass have shown that the size of a display can affect the levels of cognitive arousal and we wished to preserve this effect . Many common systems use typical computer monitors to display the video stream and, oftentimes, the image itself is only a fraction of the screen. GAZE-2  uses the entirety of the monitor’s real estate, but the actual images of people are quite small. The rest of the monitor space is required for recreating a sense of spatial relations among the participants. Hydra  uses small LCD panels as a display. In MultiView, the entirety of the display is used.
information fusion is based on homography mapping of the foreground information from multiple cameras and for multiple parallel planes. Unlike the most recent algorithms which transmit and project foreground bitmaps, it approximates each foreground silhouette with a polygon and projects the polygon vertices only. In addition, an alternative approach to estimating the homographies for multiple parallel planes is presented. It is based on the observed pedestrians and does not resort to vanishing point estimation. The ability of this algorithm to remove cast shadows in moving object detection is also investigated. The results on open video datasets are demonstrated.
As cybercrime is increasingly becoming a problem for the global community (McAfee, 2014). Cyber-security is becoming the focus of many governments and organisations. Existing cyber-security strategies need to be both accurate and efficient at processing ‘real-time’ network traffic. This paper explored the need for an IDS detection approach that is accurate at classifying traffic and efficient at processing network traffic to keep up with predicted future network traffic flows. Research has been conducted into improving individual aspects of intrusion detectionapproaches such as accuracy, however there has been little to no research conducted on developing an intrusion detection approach that is both accurate and efficient. There is a need for an IDS detection approach that can classify network traffic accurately as well as process the network traffic efficiently.