Archive | 2019

Syntax is clearer on the other side - Using parallel corpus to extract monolingual data

 

Abstract


This paper describes the elaboration of a training corpus containing Hungarian sentences that are labelled according to a syntactic criterion, namely the syntactic role of a very common multifunctional word volt ’was/had’. The labels are assigned by a rule-based algorithm that specifies the function of the target word based on the English pairs of the sentences extracted from a parallel corpus. The reasoning of this idea is that the required syntactic information is easier to retrieve in English than in Hungarian. The accuracy achieved by the algorithm was fair but still needs improvement in order to use the output as reliable training data. The obtained training corpus was tested with FastText’s text classifier, the results of which showed that the targeted disambiguation problem is resolvable using neural network based text classification.

Volume None
Pages None
DOI 10.18653/v1/W19-7813
Language English
Journal None

Full Text