Achieving High Data Utility K-Anonymization Using Similarity-Based Clustering Model

Mohammad Rasool SARRAFI AGHDAM; Noboru SONEHARA

doi:10.1587/transinf.2015INP0019

Achieving High Data Utility K-Anonymization Using Similarity-Based Clustering Model

Mohammad Rasool SARRAFI AGHDAM, Noboru SONEHARA

Full Text Views

0

Share
Cite this

Summary :

In data sharing privacy has become one of the main concerns particularly when sharing datasets involving individuals contain private sensitive information. A model that is widely used to protect the privacy of individuals in publishing micro-data is k-anonymity. It reduces the linking confidence between private sensitive information and specific individual by generalizing the identifier attributes of each individual into at least k-1 others in dataset. K-anonymity can also be defined as clustering with constrain of minimum k tuples in each group. However, the accuracy of the data in k-anonymous dataset decreases due to huge information loss through generalization and suppression. Also most of the current approaches are designed for numerical continuous attributes and for categorical attributes they do not perform efficiently and depend on attributes hierarchical taxonomies, which often do not exist. In this paper we propose a new model for k-anonymization, which is called Similarity-Based Clustering (SBC). It is based on clustering and it measures similarity and calculates distances between tuples containing numerical and categorical attributes without hierarchical taxonomies. Based on this model a bottom up greedy algorithm is proposed. Our extensive study on two real datasets shows that the proposed algorithm in comparison with existing well-known algorithms offers much higher data utility and reduces the information loss significantly. Data utility is maintained above 80% in a wide range of k values.

Publication: IEICE TRANSACTIONS on Information Vol.E99-D No.8 pp.2069-2078

Publication Date: 2016/08/01

Publicized: 2016/05/31

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2015INP0019

Type of Manuscript: Special Section PAPER (Special Section on Security, Privacy and Anonymity of Internet of Things)

Category

Authors

Mohammad Rasool SARRAFI AGHDAM
School of Multidisciplinary, Informatics Department
Noboru SONEHARA
School of Multidisciplinary, Informatics Department,National Institute of Informatics (NII)

Keyword

anonymization, privacy preserving data mining, K-anonymity, algorithm

Cite this

Copy

Mohammad Rasool SARRAFI AGHDAM, Noboru SONEHARA, "Achieving High Data Utility K-Anonymization Using Similarity-Based Clustering Model" in IEICE TRANSACTIONS on Information, vol. E99-D, no. 8, pp. 2069-2078, August 2016, doi: 10.1587/transinf.2015INP0019.
Abstract: In data sharing privacy has become one of the main concerns particularly when sharing datasets involving individuals contain private sensitive information. A model that is widely used to protect the privacy of individuals in publishing micro-data is k-anonymity. It reduces the linking confidence between private sensitive information and specific individual by generalizing the identifier attributes of each individual into at least k-1 others in dataset. K-anonymity can also be defined as clustering with constrain of minimum k tuples in each group. However, the accuracy of the data in k-anonymous dataset decreases due to huge information loss through generalization and suppression. Also most of the current approaches are designed for numerical continuous attributes and for categorical attributes they do not perform efficiently and depend on attributes hierarchical taxonomies, which often do not exist. In this paper we propose a new model for k-anonymization, which is called Similarity-Based Clustering (SBC). It is based on clustering and it measures similarity and calculates distances between tuples containing numerical and categorical attributes without hierarchical taxonomies. Based on this model a bottom up greedy algorithm is proposed. Our extensive study on two real datasets shows that the proposed algorithm in comparison with existing well-known algorithms offers much higher data utility and reduces the information loss significantly. Data utility is maintained above 80% in a wide range of k values.
URL: https://globals.ieice.org/en_transactions/information/10.1587/transinf.2015INP0019/_p

Copy

@ARTICLE{e99-d_8_2069,
author={Mohammad Rasool SARRAFI AGHDAM, Noboru SONEHARA, },
journal={IEICE TRANSACTIONS on Information},
title={Achieving High Data Utility K-Anonymization Using Similarity-Based Clustering Model},
year={2016},
volume={E99-D},
number={8},
pages={2069-2078},
abstract={In data sharing privacy has become one of the main concerns particularly when sharing datasets involving individuals contain private sensitive information. A model that is widely used to protect the privacy of individuals in publishing micro-data is k-anonymity. It reduces the linking confidence between private sensitive information and specific individual by generalizing the identifier attributes of each individual into at least k-1 others in dataset. K-anonymity can also be defined as clustering with constrain of minimum k tuples in each group. However, the accuracy of the data in k-anonymous dataset decreases due to huge information loss through generalization and suppression. Also most of the current approaches are designed for numerical continuous attributes and for categorical attributes they do not perform efficiently and depend on attributes hierarchical taxonomies, which often do not exist. In this paper we propose a new model for k-anonymization, which is called Similarity-Based Clustering (SBC). It is based on clustering and it measures similarity and calculates distances between tuples containing numerical and categorical attributes without hierarchical taxonomies. Based on this model a bottom up greedy algorithm is proposed. Our extensive study on two real datasets shows that the proposed algorithm in comparison with existing well-known algorithms offers much higher data utility and reduces the information loss significantly. Data utility is maintained above 80% in a wide range of k values.},
keywords={},
doi={10.1587/transinf.2015INP0019},
ISSN={1745-1361},
month={August},}

Copy

TY - JOUR
TI - Achieving High Data Utility K-Anonymization Using Similarity-Based Clustering Model
T2 - IEICE TRANSACTIONS on Information
SP - 2069
EP - 2078
AU - Mohammad Rasool SARRAFI AGHDAM
AU - Noboru SONEHARA
PY - 2016
DO - 10.1587/transinf.2015INP0019
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 8
JA - IEICE TRANSACTIONS on Information
Y1 - August 2016
AB - In data sharing privacy has become one of the main concerns particularly when sharing datasets involving individuals contain private sensitive information. A model that is widely used to protect the privacy of individuals in publishing micro-data is k-anonymity. It reduces the linking confidence between private sensitive information and specific individual by generalizing the identifier attributes of each individual into at least k-1 others in dataset. K-anonymity can also be defined as clustering with constrain of minimum k tuples in each group. However, the accuracy of the data in k-anonymous dataset decreases due to huge information loss through generalization and suppression. Also most of the current approaches are designed for numerical continuous attributes and for categorical attributes they do not perform efficiently and depend on attributes hierarchical taxonomies, which often do not exist. In this paper we propose a new model for k-anonymization, which is called Similarity-Based Clustering (SBC). It is based on clustering and it measures similarity and calculates distances between tuples containing numerical and categorical attributes without hierarchical taxonomies. Based on this model a bottom up greedy algorithm is proposed. Our extensive study on two real datasets shows that the proposed algorithm in comparison with existing well-known algorithms offers much higher data utility and reduces the information loss significantly. Data utility is maintained above 80% in a wide range of k values.
ER -