IEEE Transactions on Knowledge and Data Engineering | 2019
Unsupervised Statistical Text Simplification
Abstract
Most recent approaches for Text simplification (TS) draw on insights from machine translation to learn simplified text from monolingual corpora of complex and simplified sentence pairs, whose effectiveness strongly rely on the availability of large amounts of parallel sentences. However, there has been a serious problem haunting TS for decades, that is, the availability of corpora of parallel sentences for TS are scarce or not fit for the learning task. In this paper, we will focus on one especially useful and challenging problem of unsupervised TS without using a single parallel sentence. We present the first unsupervised text simplification system based on phrase-based machine translation system, which leverage a careful initialization of phrase tables and language modelings. Specifically, we utilize the ‘ordinary’ English Wikipedia as a massive knowledge base. We obtain word frequency and word embeddings from Wikipedia for populating phrase tables, and gather simplified sentences and complex sentences from Wikipedia using Flesch reading-ease score for training simplified and complex language modelings. On the widely used WikiLarge and WikiSmall benchmarks, our approach respectively obtains 39.08 and 25.12 SARI points, even outperforming supervised baselines.