Younes Samih | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Younes Samih is active.

Explore More

Publication

Featured researches published by Younes Samih.

workshop on computational approaches to code switching | 2016

Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

Younes Samih; Suraj Maharjan; Mohammed Attia; Laura Kallmeyer; Thamar Solorio

This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UHG system introduces a novel unified neural network architecture for language identification in code-switched tweets for both SpanishEnglish and MSA-Egyptian dialect. The system makes use of word and character level representations to identify code-switching. For the MSA-Egyptian dialect the system does not rely on any kind of language-specific knowledge or linguistic resources such as, Part Of Speech (POS) taggers, morphological analyzers, gazetteers or word lists to obtain state-ofthe-art performance.

Proceedings of the Third Arabic Natural Language Processing Workshop | 2017

A Neural Architecture for Dialectal Arabic Segmentation

Younes Samih; Mohammed Attia; Mohamed Eldesouki; Ahmed Abdelali; Hamdy Mubarak; Laura Kallmeyer; Kareem Darwish

The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that heavily depend on additional resources.

Natural Language Engineering | 2016

Arabic spelling error detection and correction

Mohammed Attia; Pavel Pecina; Younes Samih; Khaled Shaalan; Josef van Genabith

A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.

conference on computational natural language learning | 2017

Learning from Relatives: Unified Dialectal Arabic Segmentation

Younes Samih; Mohamed Eldesouki; Mohammed Attia; Kareem Darwish; Ahmed Abdelali; Hamdy Mubarak; Laura Kallmeyer

Arabic dialects do not just share a common koine, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling.

workshop on computational approaches to code switching | 2016

SAWT: Sequence Annotation Web Tool.

Younes Samih; Wolfgang Maier; Laura Kallmeyer

We present SAWT, a web-based tool for the annotation of token sequences with an arbitrary set of labels. The key property of the tool is simplicity and ease of use for both annotators and administrators. SAWT runs in any modern browser, including browsers on mobile devices, and only has minimal server-side requirements.

language resources and evaluation | 2012