Deokyeon Jang , Minsoo Kim and Yon Dohn DohnPrivacy-Constrained Relational Data Perturbation: An Empirical EvaluationAbstract: The release of relational data containing personal sensitive information poses a significant risk of privacy breaches. To preserve privacy while publishing such data, it is important to implement techniques that ensure protection of sensitive information. One popular technique used for this purpose is data perturbation, which is popularly used for privacy-preserving data release due to its simplicity and efficiency. However, the data perturbation has some limitations that prevent its practical application. As such, it is necessary to propose alternative solutions to overcome these limitations. In this study, we propose a novel approach to preserve privacy in the release of relational data containing personal sensitive information. This approach addresses an intuitive, syntactic privacy criterion for data perturbation and two perturbation methods for relational data release. Through experiments with synthetic and real data, we evaluate the performance of our methods. Keywords: Anonymization , Data Perturbation , Data Privacy , Personal Data Protection 1. IntroductionData privacy over statistical databases is important, because databases contain personal sensitive information that should not be revealed when analyzing and publishing data [1-5]. There have been various data privacy studies: syntactic privacy methods [6-8] (such as k-anonymity, l-diversity, t-closeness, etc) and semantic privacy methods [9-11] (such as [TeX:] $$\varepsilon$$-differential privacy, [TeX:] $$(\varepsilon, \delta)$$-differential privacy, μ-Gaussian differential privacy, etc). The syntactic privacy research focuses on preventing privacy breaches directly from data, whereas the semantic privacy does from programs (or algorithms) processing the data. That is, the former is used for publishing (or releasing) data in an anonymized manner, and the latter is used for analyzing (or deeplearning) data in a privacy-preserving manner (not releasing data). Although there are some studies [12- 15] generating and publishing synthetic data using the notion of differential privacy, they assume the usage of the released data (e.g., clustering, regression, and so on) is known a priori. In the paper, we focus on the privacy-preserving relational data publication (a.k.a. de-identification) that modifies the source relation such that no individual record can be identified from the released relation. Although the semantic privacy (mostly the differential privacy) has become the major trend among academic researchers in recent years, the syntactic privacy issue (i.e., data release after anonymization/ de-identification) has still been recognized its importance in industrial fields. Especially, in Korea, the data exchange and combination after pseudonymization/anonymization has been proliferated in recent years and to top it off is encouraged by the government [16,17]. For privacy-preserving data publication, two kinds of approaches have been explored in the past: (1) generalization-based approach [6-8] and (2) perturbation-based approach [16,18-20]. The first is to modify data records such that multiple data records have the same values for the QI (quasi-identifier) attributes [6]. For example, the k-anonymity model [6,7] requires every data record has at least "k-1" other data records with the same contents. Fig. 1 shows an example. Suppose there are five data records with the same height, weight and age values in the published relation. The individuals for these five records are not distinguishable (Fig. 1(b)), that is, de-identified. On the other hand, the second is to modify the data such that the published values are not the same to the original ones (Fig. 1(c)). Usually, for the perturbation (i.e., changing the data values), randomly generated noises are used. In spite of very different policies they adopt, both approaches achieve the same goal—anonymization. Comparing the two approaches, it is possible to enforce the degree of anonymity in a quantitative manner in the generalization-based approach. The larger k (and also l) values denote the stronger anonymity [6,7]. However, there are situations such as when the data exhibits high-dimensionality, where achieving k-anonymity via generalization may not be feasible [21]. In particular, the dataset consists of numerous attributes that can be quasi-identifiers pose challenges in attaining de-identification without significant amount of data loss. In such cases, the perturbation-based approach may be preferred as it can provide privacy protection while accommodating the limitations posed by the data's high dimensionality. On the other hand, unlike the generalization-based privacy models such as k-anonymity [6] and l-diversity [7], the perturbation-based approach has a problem of lacking in the way of quantitatively controlling the leakage of privacy. In practical applications, the absence of a quantitative privacy measure is a significant obstacle to adoption; therefore, it is important to define privacy quantification [22]. In this respect, this paper proposes a privacy-constrained data perturbation method, namely ρ-RDP; especially we are focusing on relational data perturbation via additive noises. Fig. 2 shows the overview of our proposed method. Given a relation, we perturb the data via adding random noises. Here, we measure the amount of possible privacy leakage from the perturbed data and guarantee it satisfies the specified criteria, the ρ-safeness measure, that will be explained in Section 2. Then the output data can be safely utilized by the public. The rest of the paper is organized as follows. Firstly, in Section 2, we define a novel syntactic privacy measure ρ for data perturbation. As far as we know, there has been no research defining a syntactic privacy metric for data perturbation. And we propose two data perturbation methods satisfying the given privacy constraint in Section 3. Lastly in Section 4, we evaluate the proposed methods with various experiments in comparison with the generalization-based anonymization method. 2. Problem DefinitionIn this section, we formally define the problem of RDP that we tackle in the paper. We assume a relation is given where all attributes are in continuous and non-negative numeric domains in range [0, 1]. In practice, any numeric values can be used without loss of generality with appropriate scaling and transforming. Categorical attributes can also be used if relevant scoring functions and distance measures are specified. We assume the relation has no identifying attributes. Definition 1 (RDP): Let R be a relation consisting of N records, where each record [TeX:] $$r_i$$ is defined as [TeX:] $$r_i=\left[r_{i 1}, r_{i 2}, \ldots, r_{i j}, \ldots, r_{i M}\right]$$, where i=1 to N, j=1 to M and [TeX:] $$0 \leq r_{i j} \leq 1$$. The perturbation of [TeX:] $$r_i^{\prime}$$ into means the addition of a random noise in range [-1,1] to each attribute’s value like: [TeX:] $$r_{i j}^{\prime}=r_{i j}+X_{i j}$$. Then, the perturbed relation, denoted as [TeX:] $$R^{\prime}$$, is the set of N records [TeX:] $$r_i^{\prime}=\left[{r_{i 1}^{\prime}}, r_{i 2}^{\prime}, \ldots, r_{i j}^{\prime}, \ldots,{r_{i M}^{\prime}}\right]$$, where [TeX:] $$X_{ij}$$ follows a random distribution. RDP just randomizes attribute values of original data records into those of result relation in a one-to- one way. Due to the randomness, the privacy leakage is restricted in probabilistic ways [1,2, 4,23]. However, the deficiency of privacy quantification [24] makes it challenging for practitioners in the field to readily adopt. In this regard, we propose an intuitive, syntactic privacy criterion through which the amount of privacy leakage from [TeX:] $$r_i^{\prime}$$ is measured solely based on its original record [TeX:] $$r_i$$. Definition 2 (ρ-safeness): A perturbed record [TeX:] $$r_i^{\prime}$$ is called "ρ-safe" compared with its original record [TeX:] $$r_i$$ if the dissimilarity between [TeX:] $$r_i \text{ and } r_i^{\prime}$$ is greater than ρ [TeX:] $$(0 \leq \rho \leq 1)$$, where [TeX:] $${dissimilarity}\left(r_i, r_i^{\prime}\right)=\sum_{j=1}^M\left|r_{i j}-r_{i j}^{\prime}\right| / M$$. The dissimilarity measure represents the ratio of distance between two records [TeX:] $$r_i \text{ and } r_i^{\prime}$$ to the size of multidimensional space. We use the [TeX:] $$\text { Manhattan }\left(L_1\right)$$ distance for the dissimilarity between two records assuming the set of M attribute values as a point in an M dimensional, normalized space. The privacy safeness between [TeX:] $$r(0.3) \text { and } r^{\prime}(0.7)$$ is 0.4. The record [TeX:] $$r^{\prime}(0.1,0.4)$$ is 0.45 privacy-safe compared with its original record [TeX:] $$r(0.7, 0.1)$$ (Interestingly, the distance concept has been used as the data utility measure, not privacy one, in the literature [6, 7,11,15]. This is because they control the risk of re-identification by the size (i.e., "k" or "l") of so-called equivalence class). Fig. 3 shows some examples of ρ-safe perturbation results, where ρ values are 0.01, 0.1 and 0.5. Definition 3 (ρ-RDP): Given a relation R with N records and M attributes, the problem of privacy-constrained RDP with privacy safeness ρ is to generate a relation [TeX:] $$R^{\prime}$$ from R such that [TeX:] $$R^{\prime}$$ contains N data records [TeX:] $$r_i^{\prime}$$ and M attributes’ values [TeX:] $$r_{i j}^{\prime}$$ for each record; [TeX:] $$r_{i j}^{\prime}=r_{i j}+X,$$ where X is a noise in range [-1,1] and [TeX:] $${dissimilarity}\left(r_i, r_i^{\prime}\right)\gt \rho$$. Additionally, the noise X is generated by the function f, which produces M noises for each record based on specific distribution. ρ-RDP randomizes the attribute values of original data records into those of result relation in a one-to- one way such that all records are "ρ-safe." Unlike the basic RDP, the amount of privacy protection (i.e., privacy riskiness) is quantitatively guaranteed via the ρ-safeness measure. 3. Privacy-Constrained Data Perturbation MethodsAlgorithm 1 shows the algorithm of the ρ-RDP framework satisfying our proposed privacy constraint ρ-safeness. The algorithm adds noises to each (Line 5) and checks the privacy constraint (Lines 7–8; if necessary, the domain integrity check and relevant conversion can be added here according to the target applications). If the perturbed value does not satisfy the privacy constraint, we retry the perturbation. For the noise generation, we use the value distortion methods in [24] and consider the following two approaches: 1. Uniform noise: [TeX:] $$|X| \sim U(0,1)$$ 2. Laplace noise: [TeX:] $$X \sim \operatorname{Lap}(\rho, 0.1)$$ The former is a very straightforward approach for generating additive noises. A value in range [0,1] is randomly chosen, and added/subtracted to the original attribute value [TeX:] $$r_{ij}$$. . However, although we need noises for providing ρ-safeness, too much noise is not desirable, since they degrade the data utility of the released relation. Under the restriction of satisfying the required ρ-safeness (i.e., [TeX:] $$\sum_{j=1}^M\left|r_{i j}-r_{i j}^{\prime}\right|\gt \rho \cdot M$$), we should minimize the error (i.e., difference between the original and perturbed values). In order to provide adequate amount of noise, we devise the second method: generating noises from the Laplace distribution of mean ρ (we use the scale parameter "0.1" for simplicity.) and added or subtracted to the original attribute value with a half chance. This is based on the observation that the [TeX:] $$\text { Chebyshev }\left(L_{\infty}\right)$$ distance is equal to [TeX:] $$\operatorname{Max}\left|r_{i j}-r_{i j}^{\prime}\right|$$. That is, in order to guarantee the required privacy constraint with less errors, we try to add random noises whose absolute values are close to ρ as much as possible and distribute them evenly over M attributes. 4. Performance EvaluationIn this section, we carry out experiments for performance evaluation and describe the results. We implemented our method using Python (version 3.9.16) and executed the experiments in a computer with 96 GB memory and Intel Core i9-10900 (2.80 GHz) processor (Our implementation is available from the Github repository [25]). For comparison with the conventional method, we used ARX [26-28], which is recognized as a de facto standard, open-source software for anonymizing personal data. Our ρ-RDP framework per se releases ρ-safe data records, irrespective of noise types. In addition, it incurs reasonable processing overhead of [TeX:] $$O(N \cdot M)$$ complexity. Therefore, in this section, we focus on evaluating the data utility of released data via our methods and the conventional data perturbation method with random noise - the basic RDP in Definition 1 (Note that the data records perturbed via the conventional method cannot guarantee our proposed privacy measure "ρ-safeness"). We also have included k-anonymity for comparison with our proposed methods to evaluate data utility between the generalization-based approach and the perturbation-based approach Firstly, we compare statistics of original and perturbed relations using synthetic data with various settings (N = 1,000 and 10,000; M = 10 and 20; ρ = 0.01 and 0.1). The synthetic data utilized in our first and second evaluations is generated based on a uniform distribution U(0,1). Table 1 shows the results of statistics obtained from the conventional method, k-anonymity with k=3 and k=5 and our proposed method. The results are averaged over 20 trials, where the measures we use are mean absolute error (MAE), variance of absolute errors (VAE), and Jensen-Shannon distance (JSD) between the original and de-identified data distributions [TeX:] $$J S D\left(r_{i *}, r_{i *}^{\prime}\right)$$. JSD provides a means of quantifying the similarity between two probability distributions. When computing JSD, we bound the perturbed values into range [0,1]. Within Table 1, the minimum values are highlighted in bold, meaning that the method associated with these bolded values exhibits the closest similarity to the original data in terms of statistics. Table 1.
Since shuffling the data records after the perturbation ensures the plausible deniability to individuals [24], we need not care coincidental matches, that is [TeX:] $$r_i^{\prime}=r_j(i \neq j).$$ So, the privacy risk measures used in [14,18], such as hitting rate or distance to closest record (DCR), are not considered here. Secondly, we compare the performance of linear queries. The linear query involves calculating a linear combination of the counts within a data vector, encompassing various aggregation tasks such as counting and summing. Since linear query has the capability to represent a wide range of common aggregation queries, it is one of the most popularly used query forms in data analyses [5]. We use four queries [TeX:] $$L_1-L_4$$ as follows, and Fig. 4 depicts the query result error compared with original data.
[TeX:] $$\begin{aligned} & \left(L_1\right) r_{i, 1}+r_{i, 2}+r_{i, 3} \ldots+r_{i, M-1}+r_{i, M} \\ & \left(L_2\right) r_{i, 1}-r_{i, 2}+r_{i, 3} \ldots+r_{i, M-1}-r_{i, M} \\ & \left(L_3\right) r_{i, 1}+2 \cdot r_{i, 2}+3 \cdot r_{i, 3} \ldots+(M-1) \cdot r_{i, M-1}+M \cdot r_{i, M} \\ & \left(L_4\right) r_{i, 1}-2 \cdot r_{i, 2}+3 \cdot r_{i, 3} \ldots+(M-1) \cdot r_{i, M-1}-M \cdot r_{i, M} \end{aligned}$$ In Table 1 and Fig. 4, the conventional perturbation methods with random noise do not provide accurate data compared with our methods, since they do not add noise adequately for the given privacy requirement. In the case of k-anonymity, it has shown a comparable level of accuracy to our proposed methods. With regards to our methods, the Laplace noise method outperforms the uniform noise method. This is because the amount of noise via the former is adequate to satisfy the given privacy requirement, not too much. Also, the distribution of source data is preserved well especially by the Laplace method. Obviously, perturbation with small ρ provides more accuracy. Lastly, we evaluate the data utility w.r.t. aggregation query processing using a real dataset (UCI Adult data [29]) in Table 2. The Adult dataset consists of information for 32,561 individuals. Each record in the dataset contains a total of 15 attributes: six numeric attributes and nine categorical attributes. In this experiment, only numeric attributes are perturbed where the domain is set as [min-value, max-value] for each attribute. And with respect to the case of k-anonymity, certain numeric attributes, including age, education-num, capital-gain, capital-loss and hours-per-week, are determined as quasi-identifiers and the data was anonymized using the data anonymization tool, ARX [26]. The results of the SQL queries are displayed in Table 2. The bolded values indicate the highest similarity with the original data. We use the following SQL queries [TeX:] $$A_1-A_6$$. ([TeX:] $$A_1$$) select avg(age) from adult where race = 'White' and sex = 'Female'; ([TeX:] $$A_2$$) select count (*) from adult where marital-status = ‘Divorced’ and workclass = 'Private' and age > 40; ([TeX:] $$A_3$$) select avg(hours-per-week) from adult where education = 'Doctorate' and occupation = 'Prof-specialty'; ([TeX:] $$A_4$$) select count (*) from adult where education-num < 10 and age > 30 and relationship = 'Unmarried'; ([TeX:] $$A_5$$) select avg(capital-loss) from adult where education = 'Bachelors' and age < 40; ([TeX:] $$A_6$$) select count (*) from adult where education-num > 9 and hours-per-week < 45 and salary = '>50K'. Table 2.
In case of the third experiment, the performance results of our and conventional methods are not clearly distinguishable, since the queries return aggregation results and all methods are based on probability distributions with mean ‘0’; however, our methods (especially with the Laplace noise) outperform the conventional one for the same reason as in the first and second experiments. When applying k-anonymity, it is observed that certain attributes, experience a substantial loss of data, causing some significantly different values from original query result compared to our methods. For example, in query [TeX:] $$A_5$$, values in the attribute capital-loss undergoes significant suppression, leading to notable differences between the query result from the anonymized data and the original data. 5. ConclusionIn the paper we proposed a syntactic privacy criterion "ρ-safeness" for data perturbation, through which relational data can be perturbed in a quantitatively controlled, privacy-preserving way. To the best of our knowledge, there were no perturbation-based data anonymization methods supporting the quantitative control of privacy. And we devised two perturbation methods that guarantee the given privacy criterion ρ. Through experiments with synthetic and real data, we empirically evaluate the performance of pro¬posed methods in comparison with the conventional generalization-based method. Based on the results, we found that the proposed method effectively anonymizes the relational data in an efficient manner. BiographyDeokyeon Janghttps://orcid.org/0009-0006-8912-0421He received the B.S. degree from the Department of Computer Science and Engineering, Korea University, Seoul, South Korea, in 2023, where he is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering. His research interests include data privacy and machine learning for databases. BiographyMinsoo Kimhttps://orcid.org/0000-0003-3450-9721He received the B.S. degree in computer and information science from Korea University, Sejong, South Korea, in 2015, and the Ph.D. degree with the Department of Computer Science and Engineering from Korea University, Seoul, South Korea, in 2023. His research interests include array database and distributed/parallel processing of large-scale data. BiographyYon Dohn Chunghttps://orcid.org/0000-0003-2070-5123He received the B.S. degree in computer science from Korea University, Seoul, South Korea, in 1994, and the M.S. and Ph.D. degrees in computer science from KAIST, Daejeon, South Korea, in 1996 and 2000, respectively. He was an Assistant Professor with the Department of Computer Engineering, Dongguk University, Seoul, from 2003 to 2006. He joined as a faculty member of the Department of Computer Science and Engineering, Korea University, in 2006, where he is currently a professor. His research interests include database systems, spatial databases, and data privacy. References
|