Is this you? Create Your Porfile

Thamar Solorio

University of Alabama at Birmingham

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thamar Solorio is active.

Explore More

Publication

Featured researches published by Thamar Solorio.

Proceedings of the 3rd ACM workshop on Artificial intelligence and security | 2010

Lexical feature based phishing URL detection using online learning

Aaron Blum; Brad Wardman; Thamar Solorio; Gary Warner

Phishing is a form of cybercrime where spammed emails and fraudulent websites entice victims to provide sensitive information to the phishers. The acquired sensitive information is subsequently used to steal identities or gain access to money. This paper explores the possibility of utilizing confidence weighted classification combined with content based phishing URL detection to produce a dynamic and extensible system for detection of present and emerging types of phishing domains. Our system is capable of detecting emerging threats as they appear and subsequently can provide increased protection against zero hour threats unlike traditional blacklisting techniques which function reactively.

empirical methods in natural language processing | 2008

Learning to Predict Code-Switching Points

Thamar Solorio; Yang Liu

Predicting possible code-switching points can help develop more accurate methods for automatically processing mixed-language text, such as multilingual language models for speech recognition systems and syntactic analyzers. We present in this paper exploratory results on learning to predict potential code-switching points in Spanish-English. We trained different learning algorithms using a transcription of code-switched discourse. To evaluate the performance of the classifiers, we used two different criteria: 1) measuring precision, recall, and F-measure of the predictions against the reference in the transcription, and 2) rating the naturalness of artificially generated code-switched sentences. Average scores for the code-switched sentences generated by our machine learning approach were close to the scores of those generated by humans.

empirical methods in natural language processing | 2008

Part-of-Speech Tagging for English-Spanish Code-Switched Text

Thamar Solorio; Yang Liu

Code-switching is an interesting linguistic phenomenon commonly observed in highly bilingual communities. It consists of mixing languages in the same conversational event. This paper presents results on Part-of-Speech tagging Spanish-English code-switched discourse. We explore different approaches to exploit existing resources for both languages that range from simple heuristics, to language identification, to machine learning. The best results are achieved by training a machine learning algorithm with features that combine the output of an English and a Spanish Part-of-Speech tagger.

international conference on computational linguistics | 2004

A language independent method for question classification

Thamar Solorio; Manuel Pérez-Coutiño; Manuel Montes-y-Gémez; Luis Villaseñor-Pineda; Aurelio López-López

Previous works on question classification are based on complex natural language processing techniques: named entity extractors, parsers, chunkers, etc. While these approaches have proven to be effective they have the disadvantage of being targeted to a particular language. We present here a simple approach that exploits lexical features and the Internet to train a classifier, namely a Support Vector Machine. The main feature of this method is that it can be applied to different languages without requiring major modifications. Experimental results of this method on English, Italian and Spanish show that this approach can be a practical tool for question answering systems, reaching a classification accuracy as high as 88.92%.

north american chapter of the association for computational linguistics | 2015

Not All Character N -grams Are Created Equal: A Study in Authorship Attribution

Upendra Sapkota; Steven Bethard; Manuel Montes; Thamar Solorio

Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of charactern-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that characterngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.

2010 eCrime Researchers Summit | 2010

Authorship attribution of web forum posts

Sangita R. Pillay; Thamar Solorio

Extracting useful information from user generated text on the web is an important ongoing research in natural language processing, machine learning, and data mining. Online tools like emails, news groups, blogs, and web forums provide an effective communication platform for millions of users around the globe and also provide an added advantage of anonymity. Millions of people post information on different web forums daily. The possibility of exchanging sensitive information between anonymous users on these web forums cannot be ruled out. This document proposes a two stage approach for combining unsupervised and supervised learning approaches for performing authorship attribution on web forum posts. During the first stage, the approach focuses on using clustering techniques to make an effort to group the data sets into stylistically similar clusters. The second stage involves using the resulting clusters from stage one as features to train different machine learning classifiers. This two stage approach is an effort towards reducing the complexity of the classification task and boosting the prediction accuracy.

international conference on computational linguistics | 2006

An unsupervised language independent method of name discrimination using second order co-occurrence features

Ted Pedersen; Anagha Kulkarni; Roxana Angheluta; Zornitsa Kozareva; Thamar Solorio

Previous work by Pedersen, Purandare and Kulkarni (2005) has resulted in an unsupervised method of name discrimination that represents the context in which an ambiguous name occurs using second order co–occurrence features. These contexts are then clustered in order to identify which are associated with different underlying named entities. It also extracts descriptive and discriminating bigrams from each of the discovered clusters in order to serve as identifying labels. These methods have been shown to perform well with English text, although we believe them to be language independent since they rely on lexical features and use no syntactic features or external knowledge sources. In this paper we apply this methodology in exactly the same way to Bulgarian, English, Romanian, and Spanish corpora. We find that it attains discrimination accuracy that is consistently well above that of a majority classifier, thus providing support for the hypothesis that the method is language independent.

workshop on computational approaches to code switching | 2014

Overview for the First Shared Task on Language Identification in Code-Switched Data

Thamar Solorio; Elizabeth Blair; Suraj Maharjan; Steven Bethard; Mona T. Diab; Mahmoud Ghoneim; Abdelati Hawwari; Fahad AlGhamdi; Julia Hirschberg; Alison Chang; Pascale Fung

We present an overview of the first shared task on language identification on codeswitched data. The shared task included code-switched data from four language pairs: Modern Standard ArabicDialectal Arabic (MSA-DA), MandarinEnglish (MAN-EN), Nepali-English (NEPEN), and Spanish-English (SPA-EN). A total of seven teams participated in the task and submitted 42 system runs. The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs. In contrast, the language pairs with the higest F-measure where SPA-EN and NEP-EN. The task made evident that language identification in code-switched data is still far from solved and warrants further research.

north american chapter of the association for computational linguistics | 2009

A Corpus-Based Approach for the Prediction of Language Impairment in Monolingual English and Spanish-English Bilingual Children

Keyur Gabani; Melissa Sherman; Thamar Solorio; Yang Liu; Lisa M. Bedore; Elizabeth D. Peña

In this paper we explore a learning-based approach to the problem of predicting language impairment in children. We analyzed spontaneous narratives of children and extracted features measuring different aspects of language including morphology, speech fluency, language productivity and vocabulary. Then, we evaluated a learning-based approach and compared its predictive accuracy against a method based on language models. Empirical results on monolingual English-speaking children and bilingual Spanish-English speaking children show the learning-based approach is a promising direction for automatic language assessment.

atlantic web intelligence conference | 2004

Toward a Document Model for Question Answering Systems

Manuel Pérez-Coutiño; Thamar Solorio; Manuel Montes-y-Gómez; Aurelio López-López; Luis Villaseñor-Pineda

The problem of acquiring valuable information from the large amounts available today in electronic media requires automated mechanisms more natural and efficient than those already existing. The trend in the evolution of information retrieval systems goes toward systems capable of answering specific questions formulated by the user in her/his language. The expected answers from such systems are short and accurate sentences, instead of large document lists. On the other hand, the state of the art of these systems is focused -mainly- in the resolution of factual questions, whose answers are named entities (dates, quantities, proper nouns, etc). This paper proposes a model to represent source documents that are then used by question answering systems. The model is based on a representation of a document as a set of named entities (NEs) and their local lexical context. These NEs are extracted and classified automatically by an off-line process. The entities are then taken as instance concepts in an upper ontology and stored as a set of DAML+OIL resources which could be used later by question answering engines. The paper presents a case of study with a news collection in Spanish and some preliminary results.

Explore More