Proceedings of the 21st ACM Symposium on Document Engineering | 2021

A novel approach on the joint de-identification of textual and relational data with a modified mondrian algorithm

Abstract

Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joint, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.

Volume None

Proceedings of the 21st ACM Symposium on Document Engineering | 2021

A novel approach on the joint de-identification of textual and relational data with a modified mondrian algorithm

Abstract

Volume None

Pages None

DOI 10.1145/3469096.3469871

Language English

Journal Proceedings of the 21st ACM Symposium on Document Engineering

Full Text