Mayank Kejriwal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mayank Kejriwal is active.

Explore More

Publication

Featured researches published by Mayank Kejriwal.

international conference on data mining | 2013

An Unsupervised Algorithm for Learning Blocking Schemes

Mayank Kejriwal; Daniel P. Miranker

A pair wise comparison of data objects is a requisite step in many data mining applications, but has quadratic complexity. In applications such as record linkage, blocking methods may be applied to reduce the cost. That is, the data is first partitioned into a set of blocks, and pair wise comparisons computed for pairs within each block. To date, blocking methods have required the blocking scheme be given, or the provision of training data enabling supervised learning algorithms to determine a blocking scheme. In either case, a domain expert is required. This paper develops an unsupervised method for learning a blocking scheme for tabular data sets. The method is divided into two phases. First, a weakly labeled training set is generated automatically in time linear in the number of records of the entire dataset. The second phase casts blocking key discovery as a Fisher feature selection problem. The approach is compared to a state-of-the-art supervised blocking key discovery algorithm on three real-world databases and achieves favorable results.

european semantic web conference | 2015

Semi-supervised Instance Matching Using Boosted Classifiers

Mayank Kejriwal; Daniel P. Miranker

Instance matching concerns identifying pairs of instances that refer to the same underlying entity. Current state-of-the-art instance matchers use machine learning methods. Supervised learning systems achieve good performance by training on significant amounts of manually labeled samples. To alleviate the labeling effort, this paper presents a minimally supervised instance matching approach that is able to deliver competitive performance using only 2i¾?% training data and little parameter tuning. As a first step, the classifier is trained in an ensemble setting using boosting. Iterative semi-supervised learning is used to improve the performance of the boosted classifier even further, by re-training it on the most confident samples labeled in the current iteration. Empirical evaluations on a suite of six publicly available benchmarks show that the proposed system outcompetes optimization-based minimally supervised approaches in 1---7 iterations. The systems average F-Measure is shown to be within 2.5i¾?% of that of recent supervised systems that require more training samples for effective performance.

international conference on big data | 2015

A pipeline for extracting and deduplicating domain-specific knowledge bases

Mayank Kejriwal; Qiaoling Liu; Ferosh Jacob; Faizan Javed

Building a knowledge base (KB) describing domain-specific entities is an important problem in industry, examples including KBs built over companies (e.g. Dun & Bradstreet), skills (LinkedIn, CareerBuilder) and people (inome). The task involves several engineering challenges, including devising effective procedures for data extraction, aggregation and deduplication. Data extraction involves processing multiple information sources in order to extract domain-specific data instances. The extracted instances must be aggregated and deduplicated; that is, instances referring to the same underlying entity must be identified and merged. This paper describes a pipeline developed at CareerBuilder LLC for building a KB describing employers, by first extracting entities from both global, publicly available data sources (Wikipedia and Freebase) and a proprietary source (Infogroup), and then deduplicating the instances to yield an employer-specific KB. We conduct a range of pilot experiments over three independently labeled datasets sampled from the extracted KB, and comment on some lessons learned.

Journal of Web Semantics | 2015

An unsupervised instance matcher for schema-free RDF data

Mayank Kejriwal; Daniel P. Miranker

This article presents an unsupervised system that performs instance matching between entities in schema-free Resource Description Framework (RDF) files. Rather than relying on domain expertise or manually labeled samples, the system automatically generates its own heuristic training set. The training sets are first used by the system to align the properties in the input graphs. The property alignment and training sets are used together to simultaneously learn two functions, one for the blocking step of instance matching and the other for the classification step. Finally, the learned functions are used to perform instance matching. The full system is implemented as a sequence of components that can be iteratively executed to boost performance. Evaluations on a suite of ten test cases show individual components to be competitive with state-of-the-art baselines. The system as a whole is shown to compete effectively with adaptive supervised approaches.

international semantic web conference | 2014

Populating Entity Name Systems for Big Data Integration

Mayank Kejriwal

An Entity Name System (ENS) is a thesaurus for entities. An ENS is a fundamental component of data integration systems, serving instance matching needs across multiple data sources. Populating an ENS in support of co-referencing Linked Open Data (LOD) is a Big Data problem. Viable solutions to the long-standing Entity Resolution (ER) problem are required, meeting specific requirements of heterogeneity, scalability and automation. In this thesis, we propose to develop and implement algorithms for an ER system that address the three key criteria. Preliminary results demonstrate potential system feasibility.

IEEE Transactions on Big Data | 2017

Knowledge Graphs for Social Good: An Entity-centric Search Engine for the Human Trafficking Domain

Mayank Kejriwal; Pedro A. Szekely

Web advertising related to Human Trafficking (HT) activity has been on the rise in recent years. Answering entity-centric questions over crawled HT Web corpora to assist investigators in the real world is an important social problem, involving many technical challenges. This paper describes a recent entity-centric knowledge graph effort that resulted in a semantic search engine to assist analysts and investigative experts in the HT domain. The overall approach takes as input a large corpus of advertisements crawled from the Web, structures it into an indexed knowledge graph, and enables investigators to satisfy their information needs by posing investigative search queries to a special-purpose semantic execution engine. We evaluated the search engine on real-world data collected from over 90,000 webpages, a significant fraction of which correlates with HT activity. Performance on four relevant categories of questions on a mean average precision metric were found to be promising, outperforming a learning-to-rank approach on three of the four categories. The prototype uses open-source components and scales to terabyte-scale corpora. Principles of the prototype have also been independently replicated, with similarly successful results.

european semantic web conference | 2015