Tsung-Yu Lin
University of Massachusetts Amherst
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tsung-Yu Lin.
international conference on computer vision | 2015
Tsung-Yu Lin; Aruni RoyChowdhury; Subhransu Maji
We propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor. This architecture can model local pairwise feature interactions in a translationally invariant manner which is particularly useful for fine-grained categorization. It also generalizes various orderless texture descriptors such as the Fisher vector, VLAD and O2P. We present experiments with bilinear models where the feature extractors are based on convolutional neural networks. The bilinear form simplifies gradient computation and allows end-to-end training of both networks using image labels only. Using networks initialized from the ImageNet dataset followed by domain specific fine-tuning we obtain 84.1% accuracy of the CUB-200-2011 dataset requiring only category labels at training time. We present experiments and visualizations that analyze the effects of fine-tuning and the choice two networks on the speed and accuracy of the models. Results show that the architecture compares favorably to the existing state of the art on a number of fine-grained datasets while being substantially simpler and easier to train. Moreover, our most accurate model is fairly efficient running at 8 frames/sec on a NVIDIA Tesla K40 GPU. The source code for the complete system will be made available at http://vis-www.cs.umass.edu/bcnn.
computer vision and pattern recognition | 2016
Tsung-Yu Lin; Subhransu Maji
A number of recent approaches have used deep convolutional neural networks (CNNs) to build texture representations. Nevertheless, it is still unclear how these models represent texture and invariances to categorical variations. This work conducts a systematic evaluation of recent CNN-based texture descriptors for recognition and attempts to understand the nature of invariances captured by these representations. First we show that the recently proposed bilinear CNN model [25] is an excellent generalpurpose texture descriptor and compares favorably to other CNN-based descriptors on various texture and scene recognition benchmarks. The model is translationally invariant and obtains better accuracy on the ImageNet dataset without requiring spatial jittering of data compared to corresponding models trained with spatial jittering. Based on recent work [13, 28] we propose a technique to visualize pre-images, providing a means for understanding categorical properties that are captured by these representations. Finally, we show preliminary results on how a unified parametric model of texture analysis and synthesis can be used for attribute-based image manipulation, e.g. to make an image more swirly, honeycombed, or knitted. The source code and additional visualizations are available at http://vis-www.cs.umass.edu/texture.
workshop on applications of computer vision | 2016
Aruni Roy Chowdhury; Tsung-Yu Lin; Subhransu Maji; Erik G. Learned-Miller
The recent explosive growth in convolutional neural network (CNN) research has produced a variety of new architectures for deep learning. One intriguing new architecture is the bilinear CNN (B-CNN), which has shown dramatic performance gains on certain fine-grained recognition problems [15]. We apply this new CNN to the challenging new face recognition benchmark, the IARPA Janus Benchmark A (IJB-A) [12]. It features faces from a large number of identities in challenging real-world conditions. Because the face images were not identified automatically using a computerized face detection system, it does not have the bias inherent in such a database. We demonstrate the performance of the B-CNN model beginning from an AlexNet-style network pre-trained on ImageNet. We then show results for fine-tuning using a moderate-sized and public external database, FaceScrub [17]. We also present results with additional fine-tuning on the limited training data provided by the protocol. In each case, the fine-tuned bilinear model shows substantial improvements over the standard CNN. Finally, we demonstrate how a standard CNN pre-trained on a large face database, the recently released VGG-Face model [20], can be converted into a B-CNN without any additional feature training. This B-CNN improves upon the CNN performance on the IJB-A benchmark, achieving 89.5% rank-1 recall.
conference on multimedia modeling | 2011
Tung-Ying Lee; Tsung-Yu Lin; Szu-Hao Huang; Shang-Hong Lai; Shang-Chih Hung
In a network of cameras, people localization is an important issue. Traditional methods utilize camera calibration and combine results of background subtraction in different views to locate people in the three dimensional space. Previous methods usually solve the localization problem iteratively based on background subtraction results, and high-level image information is neglected. In order to fully exploit the image information, we suggest incorporating human detection into multi-camera video surveillance. We develop a novel method combining human detection and background subtraction for multi-camera human localization by using convex optimization. This convex optimization problem is independent of the image size. In fact, the problem size only depends on the number of interested locations in ground plane. Experimental results show this combination performs better than background subtraction-based methods and demonstrate the advantage of combining these two types of complementary information.
international conference on image processing | 2014
Tsung-Yu Lin; Tyng-Luh Liu
Recent advances in tackling large-scale computer vision problems have supported the use of an extremely high-dimensional descriptor to encode the image data. Under such a setting, we focus on how to efficiently carry out similarity search via employing binary codes. Observe that most of the popular high-dimensional descriptors induce feature vectors that have an implicit 2-D structure. We exploit this property to reduce the computation cost and high complexity. Specifically, our method generalizes the Iterative Quantization (ITQ) framework to handle extremely high-dimensional data in two steps. First, we restrict the dimensionality-reduction projection to a block-diagonal form and decide it by independently solving several moderate-size PCA sub-problems. Second, we replace the full rotation in ITQ with a bilinear rotation to improve the efficiency both in training and testing. Our experimental results on a large-scale dataset and comparisons with a state-of-the-art technique are promising.
british machine vision conference | 2017
Tsung-Yu Lin; Subhransu Maji
Archive | 2015
Aruni RoyChowdhury; Tsung-Yu Lin; Subhransu Maji; Erik G. Learned-Miller
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2018
Tsung-Yu Lin; Aruni RoyChowdhury; Subhransu Maji
arXiv: Computer Vision and Pattern Recognition | 2015
Tsung-Yu Lin; Aruni RoyChowdhury; Subhransu Maji
arXiv: Computer Vision and Pattern Recognition | 2018
Tsung-Yu Lin; Subhransu Maji; Piotr Koniusz