Christan Grant | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christan Grant is active.

Explore More

Publication

Featured researches published by Christan Grant.

Proceedings of the Second Workshop on Data Analytics in the Cloud | 2013

GPText: Greenplum parallel statistical text analysis framework

Kun Li; Christan Grant; Daisy Zhe Wang; Sunny Khatri; George Chitouras

Many companies keep large amounts of text data inside of relational databases. Several challenges exist in using state-of-the-art systems to perform analysis on such datasets. First, expensive big data transfer cost must be paid up front to move data between databases and analytics systems. Second, many popular text analytics packages do not scale up to production sized datasets. In this paper, we introduce GPText, Greenplum parallel statistical text analysis framework that addresses the above problems by supporting statistical inference and learning algorithms natively in a massively parallel processing database system. GPText seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database analytics which can be installed on Post-greSQL and Greenplum. In addition, GPText also developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the performance and scalability of the parallel CRF implementation. Finally, we describe an eDiscovery application built on the GPText framework.

conference on information and knowledge management | 2012

MADden: query-driven statistical text analytics

Christan Grant; Joir-dan Gumbs; Kun Li; Daisy Zhe Wang; George Chitouras

In many domains, structured data and unstructured text are both important natural resources to fuel data analysis. Statistical text analysis needs to be performed over text data to extract structured information for further query processing. Typically, developers will need to connect multiple tools to build off-line batch processes to perform text analytic tasks. MADden is an integrated system developed for relational database systems such as PostgreSQL and Greenplum for real-time ad hoc query processing over structured and unstructured data. MADden implements four important text analytic functions that we have contributed to the MADlib open source library for textual analytics. In this demonstration, we will show the capability of the MADden text analytic library using computational journalism as the driving application. We show real-time declarative query processing over multiple data sources with both structured and text information.

information integration and web-based applications & services | 2010

Morpheus : a deep web question answering system

Christan Grant; Clint P. George; Joir-dan Gumbs; Joseph N. Wilson; Peter J. Dobbins

When users search the deep web, the essence of their search is often found in a previously answered query. The Morpheus question answering system reuses prior searches to answer similar user queries. Queries are represented in a semistructured format that contains query terms and referenced classes within a specific ontology. Morpheus answers questions by using methods from prior successful searches. The system ranks stored methods based on a similarity quasimetric defined on assigned classes of queries. Similarity depends on the class heterarchy in an ontology and its associated text corpora. Morpheus revisits the prior search pathways of the stored searches to construct possible answers. Realm-based ontologies are created using Wikipedia pages, associated categories, and the synset heterarchy of WordNet. This paper describes the entire process with emphasis on the matching of user queries to stored answering methods.

very large data bases | 2017

In-database batch and query-time inference over probabilistic graphical models using UDA---GIST

Kun Li; Xiaofeng Zhou; Daisy Zhe Wang; Christan Grant; Alin Dobra; Christopher Dudley

To meet customers’ pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in parallel. However, UDAs alone are not sufficient to implement statistical algorithms where most of the work is performed by iterative transitions over a large state that cannot be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to set up the large state in the first place and demands post-processing after the statistical inference. This paper presents general iterative state transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA–GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile batch applications: cross-document coreference, image denoising and one query-time inference application: marginal inference queries over probabilistic knowledge graphs. The 3 applications use probabilistic graphical models, which encode complex relationships of different variables and are powerful for a wide range of problems. We show that the in-database framework allows us to tackle a 27 times larger problem than a scalable distributed solution for the first application and achieves 43 times speedup over the state-of-the-art for the second application. For the third application, we implement query-time inference using the UDA–GIST framework and apply over a probabilistic knowledge graph, achieving 10 times speedup over sequential inference. To the best of our knowledge, this is the first in-database query-time inference engine over large probabilistic knowledge base. We show that the UDA–GIST framework for data- and graph-parallel computations can support both batch and query-time inference efficiently in databases.

acm sigspatial workshop recommendations for location based services and social networks | 2018

Multi-stage Collaborative filtering for Tweet Geolocation

Keerti Banweer; Austin Graham; Joe Ripberger; Nina Cesare; Elaine O. Nsoesie; Christan Grant

Data from social media platforms such as Twitter can be used to analyze severe weather reports and foodborne illness outbreaks. Government officials use online reports for early estimation of the impact of catastrophes and to aid resource distribution. For online reports to be useful they must be geotagged, but location is often not available. Less then one percent of users share their location information and/or acquisition of significant sample of geolocation messages is prohibitively expensive. In this paper, we propose a multi-stage iterative model based on the popular matrix factorization technique. This algorithm uses the partial information and exploits the relationship of messages, location, and keywords to recommend locations for non-geotagged messages. We present this model for geotagging messages using recommender systems and discussion the potential applications and next steps in this work.

Journal of Data and Information Quality | 2017

A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction

S. Goldberg; Daisy Zhe Wang; Christan Grant

The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text annotations in a database. The current state-of-the-art structured prediction methods, however, are likely to contain errors and it is important to be able to manage the overall uncertainty of the database. On the other hand, the advent of crowdsourcing has enabled humans to aid machine algorithms at scale. In this article, we introduce pi-CASTLE, a system that optimizes and integrates human and machine computing as applied to a complex structured prediction problem involving Conditional Random Fields (CRFs). We propose strategies grounded in information theory to select a token subset, formulate questions for the crowd to label, and integrate these labelings back into the database using a method of constrained inference. On both a text segmentation task over academic citations and a named entity recognition task over tweets we show an order of magnitude improvement in accuracy gain over baseline methods.

information reuse and integration | 2016

Query-Driven Sampling for Collective Entity Resolution

Christan Grant; Daisy Zhe Wang; Michael L. Wick

Entity Resolution is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets, this is an extremely expensive process. One key observation is that such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators - selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis-Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions.

Journal of Data and Information Quality | 2015

A Challenge for Long-Term Knowledge Base Maintenance

Christan Grant; Daisy Zhe Wang

Knowledge bases (KBs) are repositories of interconnected facts with an inference engine. Companies are increasingly populating KBs with facts from disparate sources to create a central repository of information to provide users with a richer and more integrated user experience [Herman and Delurey 2013]. Additionally, inference over the constructed KB can produce new facts not specifically mentioned in the KB. Google is now employing KBs to surface additional information for user search [Dong et al. 2014a]. Manually constructed KBs, such as YAGO [Hoffart et al. 2013] and DBpedia [Auer et al. 2007], are increasingly being used as the gold standard and ground truth of newer KBs [Dong et al. 2014b]. However, the growing number of KBs inside an organization require a sufficiently high level of quality and must be meticulously maintained. Both YAGO and DBPedia were constructed based on data from Wikipedia. Within Wikipedia, the medium lag between the occurrence of a notable event and the addition of the event was measured at 356 days [Frank et al. 2012]. This fact spurred many efforts to discover methods to automatically build, extend, and clean KBs [Frank et al. 2012; Ellis et al. 2012; Ji et al. 2014; Surdeanu and Ji 2014]. In these contests, teams build systems to explore the creation of Web-scale KBs; however, by and large, these contests stop short of designing systems for deployment in a production system. We believe that there are two main questions that are wholly understudied across research communities: in KBs, over time, (1) what stale information needs to be cleaned? and (2) when should this information be updated?

text retrieval conference | 2011