Xiaoyin Che
Hasso Plattner Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiaoyin Che.
acm multimedia | 2013
Xiaoyin Che; Haojin Yang; Christoph Meinel
In this paper we propose a solution which segments lecture video by analyzing its supplementary synchronized slides. The slides content derives automatically from OCR (Optical Character Recognition) process with an approximate accuracy of 90%. Then we partition the slides into different subtopics by examining their logical relevance. Since the slides are synchronized with the video stream, the subtopics of the slides indicate exactly the segments of the video. Our evaluation reveals that the average length of segments for each lecture is ranged from 5 to 15 minutes, and 45% segments achieved from test datasets are logically reasonable.
International Journal of Information and Education Technology | 2016
Xiaoyin Che; Sheng Luo; Cheng Wang; Christoph Meinel
“Internetworking with TCP/IP” is a massive open online course (MOOC) provided by Germany-based MOOC platform “openHPI”, which has been offered in German, English and – recently – Chinese respectively, with similar content. In this paper, the authors, who worked jointly as a teacher (or as teaching assistants) in this course, want to share their ideas derived from daily teaching experiences, analysis of the statistics, comparison between the performance in different language offers and the feedback from user questionnaires. Additionally, our motivation, attempt and suggestion at MOOC localization will also be discussed.
conference of the international speech communication association | 2016
Xiaoyin Che; Sheng Luo; Haojin Yang; Christoph Meinel
In this paper we propose a solution that detects sentence boundary from speech transcript. First we train a pure lexical model with deep neural network, which takes word vectors as the only input feature. Then a simple acoustic model is also prepared. Because the models work independently, they can be trained with different data. In next step, the posterior probabilities of both lexical and acoustic models will be involved in a heuristic 2-stage joint decision scheme to classify the sentence boundary positions. This approach ensures that the models can be updated or switched freely in actual use. Evaluation on TED Talks shows that the proposed lexical model can achieve good results: 75.5% accuracy on error-involved ASR transcripts and 82.4% on error-free manual references. The joint decision scheme can further improve the accuracy by 3∼10% when acoustic data is available.
international conference on web-based learning | 2015
Xiaoyin Che; Haojin Yang; Christoph Meinel
In this paper, we propose an automated adaptive solution to generate logical, accurate and detailed tree-structure outline for video-based online lectures, by extracting the attached slides and reconstructing their content. The proposed solution begins with slide-transition detection and optical character recognition, and then proceeds by a static method of analyzing the layout of single slide and the logical relations within the slides series. Some features about the under-processing slides series, such as a fixed title position, will be figured out and applied in the adaptive rounds to improve the outline quality. The result of our experiments shows that the general accuracy of the final lecture outline reaches 85 %, which is about 13 % higher than the static method.
international conference on artificial neural networks | 2016
Sheng Luo; Haojin Yang; Cheng Wang; Xiaoyin Che; Christoph Meinel
With significant increasing of surveillance cameras, the amount of surveillance videos is growing rapidly. Thereby how to automatically and efficiently recognize semantic actions and events in surveillance videos becomes an important problem to be addressed. In this paper, we investigate the state-of-the-art Deep Learning (DL) approaches for human action recognition, and propose an improved two-stream ConvNets architecture for this task. In particular, we propose to use Motion History Image (MHI) as motion expression for training the temporal ConvNet, which achieved impressive results in both accuracy and recognition speed. In our experiment, we conducted an in-depth study to investigate important network options and compared to the latest deep network for action recognition. The detailed evaluation results show the superior ability of our proposed approach, which achieves state-of-the-art in surveillance video context.
international conference on multimedia retrieval | 2015
Haojin Yang; Cheng Wang; Xiaoyin Che; Sheng Luo; Christoph Meinel
In this paper we showcase a system for real-time text detection and recognition. We apply deep features created by Convolutional Neural Networks (CNNs) for both text detection and word recognition task. For text detection we follow the common localization-verification scheme which already shown its excellent ability in numerous previous work. In text localization stage, textual regions are roughly detected by using a MSERs (Maximally Stable Extremal Regions) detector with high recall rate. False alarms are then eliminated by using a CNNs classifier, and remaining text regions are further grouped into words. In the word recognition stage, we developed an skeleton-based text binarization method for segmenting text from its background. A CNNs based recognizer is then applied for recognizing character. The initial experiments show the powerful ability of deep features for text classification comparing with commonly used visual features. Our current implementation demonstrates real-time performance for recognizing scene text by using a standard PC with webcam.
conference on multimedia modeling | 2015
Cheng Wang; Haojin Yang; Xiaoyin Che; Christoph Meinel
In this paper, we propose a concept-based multimodal learning model (CMLM) for generating document topic through modeling textual and visual data. Our model considers cross-modal concept similarity and unlabeled image concept, it is capable of processing document which has modality missing. The model can extract semantic concepts from unlabeled image and combine with text modality to generate document topics. Our comparison experiments on news document topic generation shows, in multimodal scenario, CMLM can generate more representative topics than latent dirichet allocation (LDA) based topic for representing given document.
IEEE Transactions on Learning Technologies | 2018
Xiaoyin Che; Haojin Yang; Christoph Meinel
Textbook highlighting is widely considered to be beneficial for students. In this paper, we propose a comprehensive solution to highlight the online lecture videos in both sentence- and segment-level, just as is done with paper books. The solution is based on automatic analysis of multimedia lecture materials, such as speeches, transcripts, and slides, in order to facilitate the online learners in this era of e-learning - especially with MOOCs. Sentence-level lecture highlighting basically uses acoustic features from the audio and the output is implemented in subtitle files of corresponding MOOC videos. In comparison with ground truth created by experts, the precision is over 60 percent, which is better than baseline works and also welcomed by user feedbacks. On the other hand, segment-level lecture highlighting works with statistical analysis, mainly by exploring the speech transcripts, the lecture slides and their connections. With the ground truth created by massive users, an evaluation process shows that general accuracy can reach 70 percent, which is fairly promising. Finally, we also attempt to find potential correlation between these two types of lecture highlights.
international conference on advanced learning technologies | 2017
Xiaoyin Che; Sheng Luo; Haojin Yang; Christoph Meinel
In this paper we propose an integrated framework of automatic bilingual subtitle generation for lecture videos, especially for MOOCs. The framework consists of Automatic Speech Recognition (ASR), Sentence Boundary Detection (SBD), and Machine Translation (MT). Then we quantitatively evaluate the auto-generated subtitles, the manually produced subtitles from scratch, and the auto-generated subtitles with manual modification in term of accuracy and time expenditure, in both original and target languages. The result shows that the auto-generated subtitles in the original language (English) are fairly accurate already. By using them as the draft, human subtitle producers can save 54% of the working time and simultaneously reduce the error rate by 54.3%, which is a significant improvement. However, the effectiveness of machine translated subtitles (English to Chinese) is limited. In the end, if the proposed framework is applied, the total working time in preparing bilingual subtitles can be shortened by approximately 1/3, with no decline in quality.
international conference on neural information processing | 2016
Sheng Luo; Haojin Yang; Cheng Wang; Xiaoyin Che; Christoph Meinel
The explosive growth of surveillance cameras and its 7 * 24 recording period brings massive surveillance videos data. Therefore how to efficiently retrieve the rare but important event information inside the videos is eager to be solved. Recently deep convolutinal networks shows its outstanding performance in event recognition on general videos. Hence we study the characteristic of surveillance video context and propose a very competitive ConvNets approach for real-time event recognition on surveillance videos. Our approach adopts two-steam ConvNets to respectively recognition spatial and temporal information of one action. In particular, we propose to use fast feature cascades and motion history image as the template of spatial and temporal stream. We conducted our experiments on UCF-ARG and UT-interaction dataset. The experimental results show that our approach acquires superior recognition accuracy and runs in real-time.