From Photo Streams to Evolving Situations
FFrom Photo Streams to Evolving Situations
Mengfan Tang a, ∗ , Feiping Nie b , Siripen Pongpaichet c , Ramesh Jain a a Department of Computer Science, University of California, Irvine, USA b School of Computer Science and Center for OPTical IMagery Analysis and Learning(OPTIMAL), Northwestern Polytechnical University, China c Faculty of Information and Communication Technology, Mahidol University, Thailand
Abstract
Photos are becoming spontaneous, objective, and universal sources of informa-tion. This paper develops evolving situation recognition using photo streamscoming from disparate sources combined with the advances of deep learning.Using visual concepts in photos together with space and time information, weformulate the situation detection into a semi-supervised learning framework andpropose new graph-based models to solve the problem. To extend the methodfor unknown situations, we introduce a soft label method which enables thetraditional semi-supervised learning framework to accurately predict predefinedlabels as well as effectively form new clusters. To overcome the noisy data whichdegrades graph quality, leading to poor recognition results, we take advantageof two kinds of noise-robust norms which can eliminate the adverse effects ofoutliers in visual concepts and improve the accuracy of situation recognition.Finally, we demonstrate the idea and the effectiveness of the proposed modelon Yahoo Flickr Creative Commons 100 Million.
Keywords: evolving situations, semi-supervised learning, new label discovery, (cid:96) -norm, capped norm, outlier elimination ∗ Corresponding author
Email addresses: [email protected] (Mengfan Tang), [email protected] (FeipingNie), [email protected] (Siripen Pongpaichet), [email protected] (Ramesh Jain)
Preprint submitted to JVCI September 4, 2018 a r X i v : . [ c s . MM ] F e b . Introduction In today’s data-rich environment, big data is readily available from manyopen sources, proprietary sources, IoTs, and databases. The act of leveraginga human mobility enabled sensing of the environment is referred to as “Partic-ipatory Sensing. Increasingly participatory citizen sensing and crowdsourcingare playing more important roles in understanding current trends and evolvingsituations. Among diverse participatory sensing data, photos are one of thebiggest data sources. Every minute, millions of photo are uploaded by peopleto social media. Photos have been used for emerging applications such as eventrecognition [1], trend analysis [2], and cultural dynamics [3]. Photos provideinformation without a language.Deep learning frameworks have been successfully used in video and imageanalysis. Visual concepts can be accurately recognized. This technology ofconcept detection is commercialized by a company called “Clarifai” to provideservice in solving real-world problems for businesses and developers for photounderstanding. Accurate concept detector frameworks provide informative andobjective multimedia micro-reports. It is a good time to explore use of geo-tagged photos for situation recognition. Different from event detection whichis based on data at one particular space and time, situation is defined as theperception of elements in the environment within a volume of time and space,the comprehension of their meaning, and the projection of their status in thenear future [4]. We use photos as the information source for radical improve-ment in the quality of participatory sensing data. Photo data has associatedgeo-spatial and other useful information. “Situation” is characterized by itsspace-time-theme nature. With information of space and time, the objectiveand factual nature of photo make it the best resources of situation recognition[5]. For example, when a situation occurs, people take photos related to thesituation, which enables detection of the situation occurrence promptly, simplyby observing the photo. Social situations and trends are usually detected bythe concurrences of visual concepts. For example, “Olympics Games” is always2ssociated with concepts of people, sport, bar, and stadium, etc, evolving in acertain pattern. The event model of “Olympics Games” can be defined as a bagof these visual concepts. Detection of situations is then transformed to a photoclustering problem, in which each cluster represents a situation [5].Photo clustering assigns photos to groups which share the same semanticconcepts of the contents. Many traditional methods, such as K-means, supportvector machine, and spectral clustering, can be used to infer photo’s label usinglabeled photo and unlabeled photo. In particular, by using labeled and unla-beled data together, semi-supervised learning can assign labels to the remainingphotos by assuming that neighboring data points are likely to have the samelabel. However, traditional semi-supervised learning frameworks require at leastone data sample for each label and lack abilities to discover new situations.
Figure 1: Semi-supervised learning Framework for Evolving Situation Recognition under theCondition of Unknown-Labeled Data
Graph-based semi-supervised learning methods have been the state-of-the-art in photo annotation and photo understanding [6]. [7] proposed HarmonicEnergy Minimization and use Gaussian fields to propagate label information tounlabeled data. As the proposed method is one kind of random walks, the out-put can be interpreted as the probabilities of classifying the data points to givenlabeled clusters. The classification heavily depends on the labeled data whichmakes it sensitive to the noise in labeled data. Some of the existing graph-based3odels use a quadratic form of graph embedding. However, a major drawbackis the sensitivity of results to outliers. A robust graph-based learning methodcan overcome this drawback through the user of noise-robust norms. Amongthese methods, models using (cid:96) -norm have demonstrated effective performance[8]. Unfortunately, computational expense increases when subset selection prob-lems. To reduce the computational burden, the (cid:96) -norm has been replaced bysome relaxations, such as (cid:96) -norm.In this paper, we combine the advantages of noise-resistant norms and thesoft label methods and propose new graph-based models. This model learns anefficient graph embedding by utilizing (cid:96) -norm and capped (cid:96) p -norm to removedata outliers. The soft label method empowers semi-supervised learning frame-work to accurately predict labels as well as discover new clusters. Furthermore,efficient iterative algorithms are adopted to solve the proposed optimizationproblem. The proposed framework provides a powerful tool to investigate sit-uation recognition under the condition of noisy and unknown-labeled data. Toverify the effectiveness and efficiency of the proposed model, we apply it toYahoo Flickr Creative Commons data.
2. The Proposed Framework
Suppose we have millions of photos uploaded to the Web every day. Canwe use the photos to observe real-world situations? “Situation” is defined onspace, time and concept information. For example, when a public situation oc-curs, people take photos related to this situation. People are acting as sensorsenabling detection of situation occurrence promptly. We develop a frameworkthat treats photos as micro-reports to detect evolving situations. The frame-work makes use of visual concepts, space, and time information and includesthree components: data collection, clustering, and filtering by space and time.An overview of the proposed framework is shown in Fig. 1. The frameworkfirst collects historical photos and their labeled situations. For each photo, deep4earning concept detector is applied to get visual concepts. We do not aim atdeveloping new concept detection methods or try to improve the existing con-cept detector. We use these concepts as features to detect situations. At thedata collection stage, photos are associated with visual concepts and situationlabels. It is commonly known that the number of labeled data is far less thanunlabeled data on the web. Semi-supervised learning methods can be naturallyadopted to target the problem: recognizing new photo labels. However, tra-ditional semi-supervised learning methods can not be directly applied becausethey require that photos must belong to at least one predefined cluster. Inreal world applications, unknown or new situations may exist in unseen pho-tos. One the other hand, noise may also exist in the new data. Thus, in thestage of clustering , we need to handle both the new situation problem and thenoise problem. We introduce a soft label method which can effectively formnew clusters as well as accurately predict known labels. We incorporate noiserobust norms to eliminate the adverse effects of outliers in visual concepts andthus improve the accuracy of situation recognition. At the end of the secondstage, photos are labeled by situations. At the last stage, we use space and timeinformation to further understand when and where these situations happen.For example, given photos of the situation “Olympic Games”, time informationcan be used to understand if it is a “Winter Olympic Games” or a “SummerOlympic Games”. Another example from space perspective is “holi”. “Holi”is an original Indian festival. Gradually it has been celebrated in the UnitedStates and Europe. The proposed framework can show how the “holi” trendevolving at different locations.
Given n photos { x , · · · , x m , x m +1 , · · · , x n } and the labeled situation set U = { , · · · , u } , { x , · · · , x m } are labeled with known situations y ∈ U andthe remaining photos { x m +1 , · · · , x n } are not labeled. The goal is to use visualconcepts of the photos as features to predict the labels of unlabeled photosand discovery a new situation if there is in the photos. Predicting the labels5f unlabeled photos using both labeled and unlabeled data is a semi-supervisedlearning problem. To enable the traditional semi-supervised learning frameworkto new situation discovery, we introduce one more label variable denoted as u +1,where ˆ U = { , · · · , u + 1 } . This simple setting will solve the new label discoveryproblem which traditional semi-supervised learning framework cannot solve.Consider a connected graph of photos, G = ( V, E ). Nodes in V correspondto the n data points. Nodes in M = (1 , , · · · , m ) are situation-labeled photos.Nodes in U = ( m + 1 , · · · , m + u ) are unknown-situation label photos. The edge E in the graph is described in the similarity matrix W ∈ R n × n , where w ij isthe similarity measurement of a pair of vertices, x i and x j . For the similaritymatrix w ij , one of the examples using Gaussian function is, w ij = exp( − (cid:80) pz =1 ( x iz − x jz ) σ z ) j ∈ N i or i ∈ N j , otherwisewhere x i = ( x i , · · · , x iz ), x has p dimensions of features. N i is a set of indicesof x i ’s neighbors, σ z ( z = 1 , · · · , p ) are parameters associated with features.Given a similarity matrix W , The key idea of the graph based method is thatthe nodes connected by a large weight in W in the graph have similar values. Inother words, observations y , · · · , y n change smoothly on the graph. Based onthe smoothness assumption, the task is to assign labels to nodes U . [9] proposeda general graph-based semi-supervised learning method for new class discovery.min F (cid:88) i,j ˆ w ij (cid:107) f i − f j (cid:107) + n (cid:88) i =1 u i ˆ d i (cid:107) f i − y i (cid:107) , (1)where the normalized weights ˆ w ij can be computed by ˆ w ij = w ij / (cid:112) d i d j , d i = (cid:80) j w ij . By optimizing Eq. (1), we can get the soft label matrix F ∈ R n × ( c +1) .Then the label of x i can be calculated as , y i = arg max j ≤ c +1 F ij . This model has two terms. The first term plays a role as a regularization, whichcontrols the smoothness of the predicted labels on the graph. The second term6etermines the degree of label matching between the predicted labels and initiallabels. Two parameters u and d are used to balance the trade-off between thesetwo terms. Traditional methods such as (1) are built on (cid:96) -norm of graph embedding.The quadratic form makes these methods sensitive to noise or outliers. To over-come the noisy data which degrades graph quality, leading to poor recognitionresults, we use noise robust norms which can eliminate the adverse effects ofoutliers in visual concepts and improve the accuracy of situation recognition.In particular, we propose one (cid:96) -norm method and one capped norm method.Capped norm based loss function has been used for various purposes. For exam-ple, capped norm has been used for unsupervised photo clustering [10]. Here,we use (cid:96) p -norm as a robust and stable loss function to resist outliers. We usethis property in projecting data into a manifold where the similarity of datapoints is adjusted without input from outliers. If the distance of a label vectoris large, the corresponding similarity value is not updated.min F (cid:88) i,j ˆ w ij (cid:107) f i − f j (cid:107) + n (cid:88) i =1 u i ˆ d i (cid:107) f i − y i (cid:107) . (2)Capped norm based loss function of a c -dimensional vector u ∈ R × c ismin( (cid:107) u (cid:107) p , θ ), where θ is a parameter. The value of this loss function is (cid:107) u (cid:107) p , if (cid:107) u (cid:107) p is smaller than θ , and is θ , otherwise. This loss function is more robust tooutliers than (cid:96) -norm because it has threshold θ for outliers.The proposed Capped (cid:96) p -Norm Method is expressed as,min F (cid:88) i,j ˆ w ij min( (cid:107) f i − f j (cid:107) p , θ ) + n (cid:88) i =1 u i ˆ d i (cid:107) f i − y i (cid:107) , (3)where w ij is the similarity measurement in graph between x i and x j , 0 < p ≤ θ is a parameter.The above formulation benefits from the input control of input data. Theclustering results are dependent on the quality of input data graph. Most of7he time, they are sensitive to the particular graph construction methods. Weovercome this problem from two perspectives which are graph initialization,and graph similarity adaptation. We will introduce a method to give a “good”initialization. For the graph similarity adaptation, if the distance of label vectoris large, then the corresponding similarity value is not updated because of thecapped norm. By optimizing Eq. (2) and (3), we can get the soft label matrix F ∈ R n × ( c +1) . Similar to Eq.(1), the label of x i can be calculated as , y i = arg max j ≤ c +1 F ij . In this subsection, we introduce the optimization algorithm to solve problem(2) and (3). To optimize the objective function easily, we rewrite 2 into a matrixform, min F (cid:88) i,j ˆ w ij (cid:107) f i − f j (cid:107) + T r ( F − Y ) T U ˆ D ( F − Y ) , where ˆ D is a diagonal matrix with diagonal entries as ˆ D ii = (cid:80) j ˆ W ij , ∀ i . Takingderivative of 2.4 with respect to F , and setting the derivative to zero, we have,¯ LF + U ˆ D ( F − Y ) = 0 , where ¯ L is the Laplacian matrix of ¯ W , and the ij - th element of ¯ W is defined by¯ W ij = W ij (cid:107) f i − f j (cid:107) Thus, F = ( ¯ L + U ˆ D ) − U ˆ DY It’s noted that ¯ L is dependent on F , we propose an iterative algorithm toobtain the solution.In the objective function, similarity matrix W ∈ R n × n is required, and thestructure of the graph plays important roles in the performance of the graph-based clustering. We use a method [11] to generate initial graph for the proposed8odel. This method has only one integer parameter: the number of neighbors,which is easier to tune.min w Ti =1 ,w i ≥ ,w ii =0 n (cid:88) j =1 (cid:107) x i − x j (cid:107) w ij + λ n (cid:88) j =1 w ij . where w i is the i -th row of similarity matrix W . Denote e ij = (cid:107) x i − x j (cid:107) , anduse Lagrangian method, the optimal similarities can be obtained,ˆ w ij = e i,m +1 − e ij ke i,k +1 − (cid:80) km =1 e ik j ≤ k j > k (4)where k is the number of neighbors. Because of the simplicity of computationof ˆ w ij , compared to Gaussian functions, it is fitting into a large-scale graphconstruction. Algorithm 1: Semi-supervised (cid:96) -norm Method Initialize W ij ;Construct graph for w ij ∈ W using Equation (4). repeat Calculate ¯ L t = ¯ D t − ¯ W t ,where ¯( W t ) ij = W ij (cid:107) ( f t ) i − ( f t ) j (cid:107) Calculate F t +1 = ( ¯ L t + U ˆ D ) − U ˆ DY t = t+1; until converge ;Assign labels to x i , y i = arg max j ≤ c +1 F ij To optimize the objective function in the Problem (3) easily, we rewrite itinto a matrix form, (cid:88) i,j ˆ w ij min( (cid:107) f i − f j (cid:107) p , θ ) + + T r ( F − Y ) T U ˆ D ( F − Y ) , Because the proposed model is a weighted sum of a concave function, it can besolved in a general re-weighted optimization framework [12].9he general re-weighted optimization problem is,min x ∈C f ( x ) + (cid:88) i h i ( g i ( x )) , (5)where h i ( x ) is an arbitrary concave function with the domain of g i ( x ). Thisgeneral problem can be solved by Algorithm 1. Algorithm 2: Solving the Problem (5)
Initialize X ∈ C ; repeat Calculate D i = h (cid:48) i ( g i ( x ))Solve the following problem min x ∈C f ( x ) + (cid:80) i T r ( D Ti g i ( x )) until converge ;For problem (3), denote h ( x ) = min( x p , θ ) and x = (cid:107) f i − f j (cid:107) . Because h ( x ) is a concave function with respect to x , the supergradient of h ( x ) can beobtained. h (cid:48) ( x ) = p x p − x p ≤ θ h ( x ) is concave. Thefirst part of the proposed model can be written in the following form,min F T F = I (cid:88) i,j w ij h (cid:48) ( x ) x = min F T F = I (cid:88) i,j w ij s ij (cid:107) f i − f j (cid:107) = min F T F = I (cid:88) i,j ˜ s ij (cid:107) f i − f j (cid:107) = min F T F = I T r ( F T L ˜ s F ) , where ˜ s ij = w ij s ij .Fixing ˜ s ij , the objective function becomes,min F T r ( F T L ˜ s F ) + T r ( F − Y ) T U ˆ D ( F − Y ) , (6)10aking derivative of 6 with respect to F , and setting the derivative to zero,we have, L ˜ s F + U ˆ D ( F − Y ) = 0 . Thus, F = ( L ˜ s + U ˆ D ) − U ˆ DY We propose an iterative algorithm to obtain the solution. The algorithm isguaranteed to converge because it is an application of a general optimizationframework.We apply an iterative algorithm to solve the optimization problem (4). Oneof the most important aspects of iterative algorithm is convergence. The ob-jective function belongs to the general optimization problem. Naturally, it istheoretically proven to converge [12].According to Algorithm 2, we have the following Algorithm 3 to solve theProblem (3).
Algorithm 3: Semi-supervised (cid:96) p -norm Method Initialize s ij and set s ij = 1;Construct graph for w ij ∈ W using Equation (4). repeat Calculate ˜ w by ˜ w = w ij s ij Update F by F = ( L ˜ s + U ˆ D ) − U ˆ DY ;Calculate s ij = p (cid:107) f i − f j (cid:107) p − (cid:107) f i − f j (cid:107) p ≤ θ until converge ;Assign labels to x i , y i = arg max j ≤ c +1 F ij
3. Experiments
Yahoo Flickr Creative Commons 100M (YFCC100M) dataset [13] is used todemonstrate our idea. As we described in previous sections, deep-learning ap-11 able 1: Accuracy(in percentage) for label-known data and label-unknown data: GSS isSemi-supervised Learning Method for Unknown Labels Model (Eq. (1)), SSL is (cid:96) -norm Semi-supervised Learning Method (Eq. (2)), SSC is (cid:96) p -norm Semi-supervised Learning Method (Eq.(3)) Method Label-known Data Label-unknown Data
GSS 92.80 91.40SSL 93.64 97.40SSC 96.06 95.10proaches from Clarifai is used to detect 1,570 concepts in all the sampled photos.We do not aim at developing new concept detection methods or try to improvethe existing concept detector. We use these concepts as features to detect sit-uations. The top-500 concepts are used as features because of a long-taileddistribution of concepts. A list of top-30 concepts is shown as ‘people’, ‘nature’,‘indoor’, ‘sport’, ‘landscape’, ‘plant’, ‘architecture’, ‘music’, ‘performer’, ‘tree’,‘demonstrator’, ‘vehicle’, ‘building’, ‘concert’, ‘cherry blossom’, ‘water’, ‘mu-sician’, ‘blossom’, ‘crowd’, ‘sakura’, ‘outfit’, ‘ensemble’, ‘text’, ‘stage’, ‘slope’,‘riverbank’, ‘animal’, ‘lake’, ‘road’, and ‘sign’.We perform two experiments to verify the idea of situation recognition andthe proposed models. The first one is to prove that the proposed noise-robustnorms are effective in reducing the adverse function of outliers. The secondexperiment is to apply the verified models to YFCC100M data and show theevolving situation recognition.
We use six situations to demonstrate the idea and the proposed framework.The six situations include “hanami”, “Olympic Games”,“protest”, “flood”, “bluesfest”, and “holi”. We use the users’ tags as situation labels. We randomlyselect 500 photos for each situation before the year of 2010 as the labeleddata, then we randomly select 3000 photos after the year of 2010 as testing12 igure 2: The Evolving Situation of “Hanami”Figure 3: The Situation of “Holi” in Space and Time data. Among the unlabeled data, 500 photos are not in the categories of pre-defined situations as noise. We use 3000 labeled photos and 3000 unlabeledphotos. We present experimental study on GSS [9], Semi-supervised (cid:96) -normmethod(SSL), and Semi-supervised (cid:96) p -norm method(SSC) for situation detec-tion. There are parameters u i in the proposed methods. For fair comparison,we set the compared methods and the proposed methods with the same pa-rameter setting. We use one u for the labeled data and another u for unla-beled data to control the smoothness term and the fitting term for the model. u = (1 , , , , , , , , , , θ = (0 . , . , ,
10) and p = (0 . , . , , . , . , (cid:96) -13orm based GSS model. For the known labeled data, compared with GSS,SSC improves the clustering accuracy by 4 . .
80% to 93 . .
5% and 4 . In order to detect the evolving situations, we further analyze space and timecontext of the labeled photo outputs from our clustering method and apply theproposed framework to the data after the year 2011 to detect situations. Fig.2 shows that we can successfully detect the situation of “hanami” in Japan.Tokyo, Osaka, and Hiroshima are the three most popular places for “hanami”.It happened in late March and early April every year. Similarly, we demonstrateit on detection of “holi”. “Holi” is an Indian festival. People from the UnitedStates and Europe have started to celebrate this festival recently. Fig. 3 showsthat “holi” is celebrated not only in India but is widely celebrated in the UnitedStates and Europe also in March every year.
4. Conclusion
Situation recognition is important in many real world applications rangingfrom disaster response to politics and social happenings. With the explosivegrowth of user-contributed photos and videos, there is a great potential op-portunity to detect evolving situations for responding to them. Combininggeo-tagged photo streams and concept detection, it is a right time to exploresituation recognition. We investigate the use of visual concepts for situationrecognition and develop new methodologies to tackle some technical challenges14ying in noisy data and existing approaches. Situation recognition will be en-riched by developing methods to combine visual concepts with physical sensorydata. Real-time situation recognition and predictive analysis are importantcomponents of many new challenges faciltated by big data.
ReferencesReferences [1] Y. Wang, M. S. Kankanhalli, Tweeting cameras for event detection, in:WWW, 2015.[2] X. Jin, A. Gallagher, L. Cao, J. Luo, J. Han, The wisdom of social mul-timedia: using flickr for prediction and forecast, in: ACM Multimedia,2010.[3] N. Hochman, L. Manovich, Zooming into an instagram city: Reading thelocal through social media, First Monday 18 (7).URL http://firstmonday.org/ojs/index.php/fm/article/view/4711http://firstmonday.org/ojs/index.php/fm/article/view/4711