Privacypreserving distributed clustering
 Zekeriya Erkin^{1}Email author,
 Thijs Veugen^{1, 2},
 Tomas Toft^{3} and
 Reginald L Lagendijk^{1}
DOI: 10.1186/1687417X20134
© Erkin et al.; licensee Springer. 2013
Received: 23 July 2013
Accepted: 31 October 2013
Published: 9 November 2013
Abstract
Clustering is a very important tool in data mining and is widely used in online services for medical, financial and social environments. The main goal in clustering is to create sets of similar objects in a data set. The data set to be used for clustering can be owned by a single entity, or in some cases, information from different databases is pooled to enrich the data so that the merged database can improve the clustering effort. However, in either case, the content of the database may be privacy sensitive and/or commercially valuable such that the owners may not want to share their data with any other entity, including the service provider. Such privacy concerns lead to trust issues between entities, which clearly damages the functioning of the service and even blocks cooperation between entities with similar data sets. To enable joint efforts with private data, we propose a protocol for distributed clustering that limits information leakage to the untrusted service provider that performs the clustering. To achieve this goal, we rely on cryptographic techniques, in particular homomorphic encryption, and further improve the state of the art of processing encrypted data in terms of efficiency by taking the distributed structure of the system into account and improving the efficiency in terms of computation and communication by data packing. While our construction can be easily adjusted to a centralized or a distributed computing model, we rely on a set of particular users that help the service provider with computations. Experimental results clearly indicate that the work we present is an efficient way of deploying a privacypreserving clustering algorithm in a distributed manner.
1 Introduction
As a powerful tool in data mining, clustering is widely used in several domains, including finance, medicine and social networks, to group similar objects based on a similarity metric. In many cases, the entity that performs the clustering operation has access to the whole database, while in some other cases, databases from different resources are merged to improve the performance of the clustering algorithms. A number of examples can be given as follows:

Social networks. Users are clustered by the service provider based on their profile data. The clustering result can be used for creating selfhelp groups or generating recommendations. Obviously, in many cases, users would not like to share their profile data with anyone else but with the people that are in the same group.

Banking. Several banks might want to merge their customer databases for credit card fraud detection or to classify their users based on past transactions to identify profitable customers.

Medical domain. Different holders of medical databases might be willing to pool their data for medical research, either for scientific, economic or marketing reasons [1]. Another case can be the Centre for Disease Control that would like to identify trends based on data from different insurance companies [2].
However, regardless of the application setting with one or more data resources, in many cases, data are privacy sensitive or commercially valuable: the data owners might not want to reveal their sensitive data to the service provider, for instance in social networks, as the data can be processed for other purposes, transferred to other third parties without user consent or stolen by outsiders. In the case of multiple data resources from different entities as in banking, the data owners might not want to take risks in sharing their customer data with other competitors. Clearly, such privacyrelated concerns might result in several drawbacks: people not joining social networks or database owners preferring to process data on their own.
In this paper, we focus on a setting with a central entity that provides services based on clustering of multiple users, each one having a private preference vector. Our goal is to prevent the service provider from learning the privacysensitive data of the users, without substantially degrading the performance of the clustering algorithm. Thus, we focus on the following:

Privacy. To protect the privacy of users, we encrypt the preference vectors and provide only these encrypted vectors to the service provider, who does not have the decryption key. However, it is still possible for the service provider to cluster people using our cryptographic protocol. Throughout the protocol, the preferences, intermediate cluster assignments and the final results of the clustering algorithm are all encrypted and thus unknown to the service provider or any other person in the network. This approach, which has proved itself useful in the field of privacyenhanced technologies [3], guarantees privacy protection to the users of the social network without disrupting the service.

Performance. While processing encrypted data as explained above provides privacy protection, it also comes with a price: expensive operations on the encrypted data, in terms of computational and communication costs. To improve the efficiency, we approach this challenge in two directions: (1) customtailored cryptographic protocols that use data packing and (2) a setting in which the service provider creates user sets and assigns additional responsibilities to one of the users in each set to be able to use less expensive cryptographic subprotocols for the computations, avoiding expensive computations such as the ones in [4]. Moreover, having such a construction, centralized or distributed clustering scenarios can be realized, as discussed further below.
The service provider is defined as the entity that wants to cluster users based on their private preference vectors. Each user also participates in the clustering computations, and a set of users, named helper users, are chosen randomly to perform additional tasks. As the number of user sets increases, it becomes easier to parallelize operations and thus achieve better performance. However, this setting with one set of users and a single helper user can also be considered to realize clustering algorithms for the scenarios with multiple entities, each having a private database: users belong to different entities, and the helper user becomes a privacy service provider [4]. Thus, our construction can easily be reshaped according to the application.
In this paper, we choose the Kmeans clustering algorithm for finding the group of similar people based on their similarities. We choose the Kmeans algorithm since it is known to be a very efficient data mining tool that is widely used in practice as its implementation is simple and the algorithm converges quickly [5]. Our goal is to provide an efficient, privacypreserving version of the clustering algorithm. Even though the idea of processing encrypted data for clustering has been addressed before in the literature, its realization in an efficient way has been a challenge. To improve the state of the art, we contribute in the following aspects:

We propose a flexible setting, which can be interpreted as a centralized or a distributed environment with several servers. This enables a wide variety of business models.

We build our system based on the semihonest security model, in which we assume that involved parties are following the protocol steps. For the application settings, where the central entity is expected to go beyond the bounds of the protocol, our protocol can be tweaked to work in the malicious model with a cost of increased computation and communication [6]. However, we provide an alternative that is in between: we distribute trust among a number of helper users instead of relying on a single party. Especially in a setting with distributed databases, this substantially limits the power of a malicious central party.

We exploit the construction with helper users to avoid more expensive cryptographic protocols such as secure comparisons [4], achieving significant performance gain compared to related work in the field.

We employ customtailored cryptographic protocols with data packing [4, 7, 8] to reduce the communication and computation costs of using homomorphic encryption.
We emphasize that our proposal is an improvement of the ideas from [9] and [10] in terms of efficiency and requires reasonable security and business assumptions. Our main contribution is to show that realizing privacypreserving Kmeans clustering with existing tools is feasible to deploy. To prove our claim of achieving high efficiency, we also give the test results of our proposal on a synthetic data set of 100,000 users.
The rest of the paper is organized as follows: Section 2 gives an overview on the state of the art. Section 3 introduces Kmeans clustering algorithm and homomorphic encryption briefly and presents our security assumptions and the notation used throughout the paper. Section 4 describes the privacypreserving version of the Kmeans clustering algorithm in detail. Section 5 discusses the security aspects of our proposal, and Section 6 presents the complexity analysis and the numerical test results. Section 7 gives a discussion on the practicality of deploying our protocol in real life. Finally, Section 8 concludes this paper.
2 Related work
The idea of privacypreserving data mining was introduced by Agarwal and Srikant [11] and Lindell and Pinkas [1]. In their work, the aim is to extract information from users’ private data without having to reveal individual data items. Since then, a number of privacy protection mechanisms for finding similar items or people have been proposed in [2, 12–16], which address the widely applied Kmeans clustering algorithm. The proposed methods apply either cryptographic tools [2, 12–14] or randomization techniques from signal processing [15, 16] to protect the private data, which are either horizontally or vertically partitioned.
In general, the cryptographic proposals are based on secure multiparty computation techniques [6], which make any twoparty privacypreserving data mining problem solvable, for instance by using Yao’s secure circuit evaluation method [17]. Even though Yao’s method can be used to implement any function in a privacypreserving manner, heavy computation or communication costs in such circuits make the solutions feasible only for small circuit sizes. However, algorithms like clustering require large circuit sizes for realization. In [2, 12, 14], the authors attempt to solve the clustering problem in a twoparty setting which is suitable for deploying techniques based on secret sharing. Apart from the difference in settings, [2] suffers from a problem during the centroid update procedure where an integer division is misinterpreted as multiplication by the inverse, which is not correct, as explained with an example in [12]. On the other hand, [13] has a multiuser setting but requires three noncolluding entities for the clustering algorithm, and the authors overcome the problem of updating centroids by allowing users to perform the division algorithm locally. In order to do that, the users learn the intermediate cluster assignments, meaning more information leakage.
As a different approach from using secure multiparty computation techniques, Oliviera and Zaiane [15, 16] suggested using techniques from signal processing based on randomization and geometric transformation of data to hide private data of individuals. In these works, the privacy of the users is achieved by perturbing their data in a predefined way. Then, the data is made publicly available for processing. This approach is fast since the operations can be handled by each user simultaneously. However, data perturbation leads to unavoidable data leakage [18, 19].
In [9], Erkin et al. proposed a method based on encryption and secure multiparty computation techniques for clustering users in a centralized system. In that work, Erkin et al. kept the preference vector of each user in the system hidden from all other users and the service provider and reveal the centroid locations to the service provider for achieving better performance in terms of runtime and bandwidth. The proposed method requires the participation of all users, and the average communication and computation cost is high due to homomorphic encryption. In [10], Beye et al. proposed an improved version of Kmeans clustering by proposing a threeparty setting. In that work, users’ private data are stored by one party and the decryption key by the other. A third party helps with the computations. Due to this threeparty setting, Beye et al. proposed a highly efficient algorithm based on garbled circuits [20] that does not require oblivious transfer protocols [6]. While the overall system is highly efficient, the authors rely on trusting three separate parties that may not collude.
3 Preliminaries
In this section, we briefly introduce the Kmeans clustering algorithm, present our security assumptions, describe homomorphic encryption and introduce the notation used throughout the paper.
3.1 Kmeans clustering
Data clustering is a common technique for statistical data analysis where data is partitioned into smaller subgroups with their members sharing a common property [5]. As a widely used technique, Kmeans clusters data into K groups using an iterative algorithm. Particularly, each user i is represented as a point in an Rdimensional space, denoted with P_{ i } = (p_{(i,1)},…,p_{(i,R)}), and assigned to the closest cluster among K clusters, {C_{1},…,C_{ K }}. The algorithm starts with choosing the constant value K, which is the number of clusters in the data set. Each cluster k is represented by its centre (also named centroid), C_{ k } = (c_{(k,1)},…,c_{(k,R)}), which is initially a random point. In every iteration, the distances D_{(i,k)} between the i th user P_{ i } and cluster centre C_{ k } for k ∈ {1,…,K} are calculated and each user is assigned to the closest cluster. Once every user is associated with a cluster, centroid locations are recalculated by taking the arithmetic mean of the users’ locations within that cluster. For the next iteration, the distances are recalculated and users are assigned to the closest cluster. This procedure, given in Algorithm 1, is repeated until either a certain number of iterations is reached or centroid locations converge.
Algorithm 1 The K means clustering algorithm.
3.2 Security assumptions
We consider the semihonest security model, which assumes that the involved parties are honest and follow the defined protocol steps but are also curious to obtain more information. Therefore, the parties can store previously exchanged messages to deduce more information than they are entitled to. This model does not consider any malicious activity by the parties such as manipulating the original data.
We assume that the service provider creates groups of people randomly to help in computations, in which a number of people takes more responsibility in computations. Our assumption is that these randomly chosen users do not collude with the service provider in revealing other users’ personal data. The risk of information leakage by such parties is reduced, as explained later, by randomly choosing such helper users in each iteration of the algorithm. Note that as the number of helper users increases, the security of the system improves since the trust is divided among multiple entities, rather than one single entity.
Note that even though we assume that the service provider acts according to the protocol description, it is possible that the service provider can create dummy users and assign them as helper users. There are two approaches to cope with this problem. Firstly, each user participating in the computations can be asked to use certifications to prove their identity. Secondly, the helper users can be chosen truly random by deploying another subprotocol. For this, the ideas from [21] can be used. Furthermore, in the case of malicious acts requiring input verification by the users, the techniques known from cryptography like commitment schemes and zeroknowledge proofs can be deployed at the cost of increased complexity.
Finally, we also assume that all underlying communication channels are secure: both integrity and authentication of all messages are obtained via standard means, e.g. IPSec or SSL/TLS [22].
3.3 Homomorphic encryption
where n is a product of two large prime numbers, g generates a subgroup of order n and r is a random number in ${\mathbb{Z}}_{n}^{\ast}$. Note that the message space is ${\mathbb{Z}}_{n}$ and the cipher text space is ${\mathbb{Z}}_{{n}^{2}}^{\ast}$. For decryption and further details, we refer readers to [23].
In addition to the homomorphic property, Paillier is semantically secure. Informally, this means that one cannot distinguish between encryptions of known messages and random messages. This is achieved by having multiple possible cipher texts for each plain text and choosing randomly between these. This property is required as the messages to be encrypted in this paper are from a very small range compared to the message space of the cryptosystem. Throughout this paper, we denote a Paillier encryption of a message m by [ [m] ]_{ pk } for the sake of simplicity.
3.4 Notation
List of symbols
Symbol  Description 

M  Number of groups created by the service provider 
N _{ u }  Total number of users in the social network 
K  Number of clusters 
P _{ i }  Preference set of user i 
${\stackrel{~}{P}}_{{\Sigma}_{(m,r)}}$  Packed sum of preferences of users in G_{ m } 
p _{(i,r)}  r th coordinate of P_{ i } 
c _{(k,r)}  r th coordinate of cluster k 
${\stackrel{~}{P}}_{i}$  Packed preference of user i 
${\stackrel{~}{C}}_{r}$  Packed centroid for dimension r 
γ _{(i,k)}  Binary value for the k th cluster: 1 for the closest cluster,0 otherwise 
${\stackrel{~}{\Gamma}}_{{\Sigma}_{m}}$  Packed total number of users in each cluster for G_{ m } 
w  Bit length of p_{(i,r)} and c_{(j,r)} 
α,β,ϕ,ψ,ρ  Random values 
[ [·] ]_{ H }  Encryption of a plain text using the public key of H_{ m } 
${\stackrel{~}{S}}_{(i,r)}$  Packed partial input of user i for dimension r 
N _{ g }  Number of users in each group 
R  Dimension of preferences 
G _{ m }  Group m 
${\stackrel{~}{P}}_{i}$  Packed preferences of user i 
${\stackrel{~}{P}}_{{\Sigma}_{r}}$  Packed sum of preferences of all users for dimension r 
C _{ k }  Cluster centroid k 
D _{(i,k)}  Euclidean distance between P_{ i } and C_{ k } 
${\stackrel{~}{D}}_{(i,k)}$  Packed Euclidean distance between P_{ i } and C_{ j } 
Δ  Compartment size of ${\stackrel{~}{\Gamma}}_{i}$ in bits 
${\stackrel{~}{\Gamma}}_{i}$  Packed vector of γ’s for user i 
${\stackrel{~}{\Gamma}}_{\Sigma}$  Packed total number of users in each cluster 
σ  Statistical security parameter 
H _{ m }  Helper user of group m 
n  Paillier modulus 
ℓ  Compartment size of $\stackrel{~}{C}$ in bits 
4 Privacypreserving user clustering
 1.
The service provider has a business interest in providing services to users in a social network.
 2.
A user participates in a social network and would like to find similar other users based on his/her preferences.
 3.
A helper is a user who helps the service provider with the computations.
 1.
The service provider creates M groups of users and, for each group, G _{ m } for m ∈ {1,…,M}, selects a random user H _{ m } for each iteration as illustrated in Figure 1. We assume that there are N _{ g } users in each group and the total number of users is N _{ u } = N _{ g } · M.
 2.
The service provider informs every user in G _{ m } about the public key to be used for encryption in that iteration, which is the public key of H _{ m }.
 3.
The service provider sends the encrypted cluster centroids to the users. Each user computes K encrypted Euclidean distances, one for every cluster centroid, and sends them to the service provider.
 4.
The service provider interacts with H _{ m } to obtain an encrypted vector for each user, whose elements indicate the closest cluster to that user. Then, the service provider sends this vector to the user.
 5.
Each user computes his/her partial input for updating the centroid locations using the encrypted vector and sends it to the service provider.
 6.
The service provider aggregates the partial inputs from all users in G _{ m } and interacts with H _{ m } to obtain the clustering result of G _{ m } in plain text.
 7.
Finally, the service provider combines the clustering results from all groups and obtains the new centroids for that iteration.
Steps 1 to 7 are repeated either for a certain number of iterations or up to a point where the cluster centroids do not change significantly. After the final iteration, the service provider runs a protocol with H_{ m } to send the index of the closest cluster to each user. Hereafter, we describe the above procedure in detail for a single group.
4.1 Steps 1 and 2: grouping and key distribution
The first step of the Kmeans clustering algorithm is for the service provider to choose K points in an Rdimensional space as the initial cluster centroids. Next, the service provider creates M groups consisting of N_{ g } users each and picks a random user from every group who will help the service provider with the computations for the current iteration. Note that in order for the helper users to be clustered, the service provider treats them as an ordinary user in a different group. Later, the service provider disseminates the public key of the helper user to the other users.
4.2 Step 3: computing the encrypted distances
The above approach to compute the encrypted distances in [9] uses the homomorphic property of the cryptosystem without any optimization and therefore introduces a considerable computational overhead for user i. More precisely, that computation requires K(R + 1) encryption by the service provider and one encryption, KR exponentiation and K(R + 1) multiplications by each user over mod n^{2}. Repeating these expensive operations for N_{ u } users, the clustering algorithm becomes considerably expensive and thus impractical in real life.
The bit length of each ${\stackrel{~}{C}}_{r}$ is now K × ℓ bits. We set the size of each compartment to ℓ = 2w + ⌈logR⌉ bits, with w being the bit length of c_{(k,r)}, to accommodate the distances computed in the consequent steps, which are the sum of R positive numbers of size 2w bits.
User i then sends ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{D}}_{i}^{2}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$ to the service provider.
Remark 1. Each squared distance ${D}_{(i,k)}^{2}$ consists of at most ℓbits. We assume that K · ℓ ≪ n, where n is the message space of the Paillier encryption scheme, meaning all of the K distances can be packed in one encryption. Note that p_{(i,r)}  c_{(k,r)} ≤ max(p_{(i,r)},c_{(k,r)}), and thus, ${D}_{(i,k)}^{2}\le R\xb7{2}^{2w}$.
Notice that due to the way we compute distances using data packing, there is a gain by a factor of K in the number of operations on the encrypted data compared to [9]. The computation of the packed encrypted distances only requires R + 1 encryption by the service provider and one encryption, R exponentiation and R + 1 multiplications by each user.
4.3 Step 4: finding the closest cluster
After having obtained ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{D}}_{i}^{2}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$, the service provider interacts with H_{ m } to find out the minimum distance, hence the closest cluster. To achieve this, the service provider sends the packed distances to H_{ m }, who has the decryption key. H_{ m } decrypts the cipher text and obtains the packed distances in clear. Note that H_{ m } does not know the identity of the owner of the computed distances. After decryption, H_{ m } unpacks the distances and creates a vector (γ_{(i,1)},γ_{(i,2)},…,γ_{(i,K)}), where γ_{(i,k)} is 1 if and only if ${D}_{(i,k)}^{2}$ is the minimum distance (so user i is in cluster number k), and 0 otherwise. Before sending these binary values to the service provider, H_{ m } encrypts them using his/her public key.
where Δ = w + ⌈logN_{ u }⌉, N_{ u } being the number of total users in the system. This gives one packed ${\stackrel{~}{\Gamma}}_{i}$ with a compartment size of w + ⌈logN_{ u }⌉ bits. The service provider, then, sends ${\stackrel{~}{\Gamma}}_{i}$ to the users.
Remark 2. Notice that in the above procedure, H_{ m } will learn how many users in his/her group belong to each cluster. To hide this information from H_{ m }, the service provider uses a different permutation, π_{ i }, independently chosen for each user to shuffle the order of clusters during the creation of the ${\stackrel{~}{C}}_{r}$ values. The order is corrected when the service provider applies the inverse permutation, ${\pi}_{i}^{1}$, on the received [ [γ_{(i,k)}] ]’s. As this permutation is necessary and can only be done by the service provider, H_{ m } cannot apply data packing himself, which would simplify the computations otherwise.
Remark 3. We assumed in Equation 8 that packing K γ_{(i,k)}’s, each within a compartment size of w + ⌈logN_{ u }⌉ bits, is possible. This is a valid assumption in practical cases since the Paillier modulus, even for a weak security, is 1,024 bits. Given that K = 10 and w = 3, N_{ u } can be as large as 2^{99}.
4.4 Step 5: computing partial inputs
for r ∈ {1,…,R}. The result of this operation is R encryptions, each of which contains K packed values. Each compartment of the encryptions contains the multiplication of γ_{(i,k)} and p_{(i,r)} for k ∈ {1,…,K}. It is clear that K1 compartments consist of zeros and only one compartment that has the index of the closest cluster is exactly p_{(i,r)}. User i, finally, sends $\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{S}}_{(i,r)}\right]\phantom{\rule{0.3em}{0ex}}\right]$ for r ∈ {1,…,R} to the service provider.
for r ∈ {1,…,R}. This results in R encryption, one for each dimension, and each of which has K packed sums of preferences of users in G_{ m }.
4.5 Step 6: aggregating partial inputs
for r ∈ {1,…,R}. Then, the service provider sends ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{\Gamma}}_{{\Sigma}_{m}}+{\alpha}_{m}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$ and ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{P}}_{{\Sigma}_{(m,r)}}+{\beta}_{(m,r)}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$ to H_{ m }. After decrypting them, H_{ m } could send ${\stackrel{~}{\Gamma}}_{{\Sigma}_{m}}+{\alpha}_{m}$ and ${\stackrel{~}{P}}_{{\Sigma}_{(m,r)}}+{\beta}_{(m,r)}$ to the service provider, who could remove the masking by subtracting the random values, but this would reveal sensitive information to the service provider about the distribution of users in each group. To avoid this information leakage, H_{ m } also applies masking by adding random values, ϕ_{ m } and ψ_{(m,r)}, which are computed as described below, and sends the resulting masked values to the service provider.
The random values ϕ_{ m } and ψ_{(m,r)} of size K · Δ + σ bits are generated by a single helper user prior to the start of the iteration such that $\sum _{m=1}^{M}{\varphi}_{m}=0$ and $\sum _{m=1}^{M}{\psi}_{(m,r)}=0$ for r ∈ {1,…,R}. Each of these random values are then encrypted with the public key of the corresponding helper user and sent to the service provider, who passes them to the corresponding H_{ m }. Finally, each helper user sends ${\stackrel{~}{\Gamma}}_{{\Sigma}_{m}}+{\alpha}_{m}+{\varphi}_{m}$ and ${\stackrel{~}{P}}_{{\Sigma}_{(m,r)}}+{\beta}_{(m,r)}+{\psi}_{(m,r)}$ to the service provider.
4.6 Step 7: obtaining the new cluster centroids
Recall that $\sum _{m=1}^{M}{\varphi}_{m}=0$ and $\sum _{m=1}^{M}{\psi}_{(m,r)}=0$. Since the service provider knows $\sum _{m=1}^{M}{\alpha}_{m}$ and $\sum _{m=1}^{M}{\beta}_{(m,r)}$, he subtracts them from the total and obtains ${\stackrel{~}{\Gamma}}_{\Sigma}=\sum _{m=1}^{M}{\stackrel{~}{\Gamma}}_{{\Sigma}_{m}}$ and ${\stackrel{~}{P}}_{{\Sigma}_{r}}=\sum _{m=1}^{M}{\stackrel{~}{P}}_{{\Sigma}_{(m,r)}}$ for r ∈ {1,…,R}. Notice that ${\stackrel{~}{P}}_{{\Sigma}_{r}}=\sum _{i\in {C}_{1}}{p}_{(i,r)}\left\right\dots \left\right\sum _{i\in {C}_{K}}{p}_{(i,r)}$ and ${\stackrel{~}{\Gamma}}_{\Sigma}=\sum _{i\in {C}_{1}}{\gamma}_{(i,r)}\left\right\dots \left\right\sum _{i\in {C}_{K}}{\gamma}_{(i,r)}$.
4.7 Termination control and obtaining the cluster index
The service provider checks whether the predetermined termination condition is reached at the end of each iteration. Since centroid locations and the number of iterations are known to the service provider in plain text, termination control is considered to be costless. Once the termination condition is reached, i.e. when a certain number of iterations is reached or when centroids do not move significantly, the cluster index of the user, which is the nonzero element in the encrypted vector ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{\Gamma}}_{i}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$, should be delivered to the user in plain text. For this purpose, after the last iteration, the service provider sends ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{\Gamma}}_{i}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$ to user i, who masks it with a random number ρ_{ i } of size log(K R) + σ bits to get ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{\Gamma}}_{i}+{\rho}_{i}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$, and sends it back to the service provider. The service provider sends the cipher text to H_{ m } to be decrypted. After receiving the plain text, the service provider sends ${\stackrel{~}{\Gamma}}_{i}+{\rho}_{i}$ to user i, who can easily obtain the cluster index by subtracting ρ_{ i } from the decrypted value and checking the nonzero value in the compartments. Notice that all messages pass through the service provider instead of sending them directly to the helper user. This is unavoidable in our design as the users do not know the identity of the helper user. As an alternative approach, the encrypted random value ρ_{ i } can also be sent directly to the service provider at the start of the protocol. Having this random number, the service provider can mask ${\stackrel{~}{\Gamma}}_{i}$ and send it to the helper user to be decrypted. By this way, transmission of ${\stackrel{~}{\Gamma}}_{i}$ to user i can be avoided.
5 Security discussion
In this section, we present arguments to show that our protocol is secure under the semihonest security model. Recall that this model expects involved parties, namely the service provider, the helper users and all other users, to be honest in following the protocol steps. These parties are also assumed to be curious so they can keep previous messages to deduce more information than they are entitled to. This model does not consider corrupted parties. In this paper, we consider only one flavour of security threat: information leakage. In the following, we present an informal discussion on this issue.
Before discussing what information each party can derive from the received messages, we need to point out what information is allowed to be accessed. Remember that the number of clusters is public information. While the users to be clustered receive only the index of the cluster they are assigned to, the service provider does not obtain any information on the preferences of the users nor the final clustering results. However, a helper user obtains the distances between a user and the cluster centres. Even though the distances are permuted and the identity of the user is unknown to the helper user, the helper user can still acquire information about the distances between clusters and the number of assignments for each cluster. Note that enabling helper users to access permuted distances seems to be a reasonable compromise to achieve better performance in computation. Although not formally proved, we believe that the privacy risks created here will be rather harmless to the users, particularly when the number of clusters grows. Considering that in each iteration a different user will be assigned as a helper user, the amount of information on the distances between centres diminishes.
Recall that every user, including the helper users, only interacts with the service provider using a secured channel. The public key of the helper user is also delivered to the users in that group by the service provider. On the basis of this information, we analyze what information can be inferred from exchanged messages.
5.1 Service provider
The service provider receives from the helper users encrypted packed distances of the users ${\left[\phantom{\rule{0.3em}{0ex}}\left[{\stackrel{~}{D}}_{(i,k)}^{2}\right]\phantom{\rule{0.3em}{0ex}}\right]}_{H}$ and the encrypted binary values [ [γ_{(i,k)}] ]_{ H } that show the closest cluster for each user. As the Paillier cryptosystem is semantically secure, meaning that it is infeasible for a computationally bounded adversary to derive significant information about the plain text when its cipher text and the public key used are known, it is not feasible for the service provider from obtaining meaningful information from the cipher texts without the decryption key. However, the service provider also receives the sum of preferences and the number of users in each cluster within the group from the helper users in plain text. To prevent the service provider from accessing this information, the helper users mask their messages using random numbers, ϕ_{ m } and ψ_{(m,r)}, in such a way that when these masked values are all added up, random values cancel each other out and the service provider gets the final result of the clustering for that iteration. The values received are statistically indistinguishable from random values with the same sum, but completely independent of the group sums. Therefore, the service provider does not have access to any information that might harm the users.
Notice that the way helper users perturb data prevents the service provider to obtain meaning information about each user group. This data perturbation technique will serve its purpose as long as the random numbers are generated accordingly, and helper users do not cooperate with the service provider. Within the semihonest model, we assume that random number generation is performed properly. Selecting a number of helper users randomly for each iteration also reduces the risk of possible cooperation.
5.2 Helper user
In each group, all computations on the encrypted data are performed by using the public key of the helper user. As pointed out before, the helper user receives and sends data from and to the service provider only. To prevent the helper user from knowing the number of users in each cluster, the service provider applies a different permutation for each user during packing of the centroids. As a result of different permutations, the helper cannot observe the actual cluster with the minimum distance. Note that while the helper user learns the distances between users and K centroids, it is not possible to know the distance between a specific user and a certain cluster since both is kept hidden from the helper user. Furthermore, since in each iteration helper user changes, deducing meaningful information from computed distances becomes infeasible.
Hiding the sum of preferences in each group by applying permutation, on the other hand, is not possible. To hide this information, the service provider masks the values by adding random values, which guarantees that the helper user cannot infer meaningful information.
5.3 Users
Users receive encrypted messages from the service provider and without the decryption key, they cannot access the content. As a result, users cannot obtain information on the intermediate values of the clustering algorithm.
Although out of the scope of our semihonest model, users are able to manipulate the clustering output by providing fake input data. As long as the size of the input is correct, such an attack would also not be prevented. However, since a user only obtains the index of his cluster, such an attack is not likely to lead to information leakage.
6 Performance
In this section, we present the complexity analysis of the privacypreserving Kmeans clustering algorithm and experimental results on its performance.
6.1 Complexity analysis
The privacypreserving version of the clustering algorithm presented in this paper has a number of disadvantages compared to the version in plain text. User preferences, which are usually small nonnegative integers, grow to large numbers, e.g. 2,048 bits, after encryption. On top of that, addition and multiplication on the plain text become multiplication and exponentiation over mod n^{2}, which are computationally timeconsuming. Moreover, transmission of the data from users to the service provider and vice versa requires more bandwidth than the plain version. Finally, realization of the algorithm involves interactive steps, requiring data exchange between the users and the service provider, which do not exist in the clustering algorithm with plain text data.
Communication and computational complexity for SP, H _{ m } and user i for one iteration of the algorithm
Our proposal  Algorithm in[9]  

SP  H _{ m }  User  SP  User  
Encryption  $\mathcal{O}\left({N}_{u}R\right)$  $\mathcal{O}\left({N}_{g}K\right)$  $\mathcal{O}\left(1\right)$  $\mathcal{O}\left({N}_{u}K\right(R+\ell \left)\right)$  $\mathcal{O}\left(K\right(R+\ell \left)\right)$ 
Decryption    $\mathcal{O}\left({N}_{g}R\right)$    $\mathcal{O}\left({N}_{u}K\right(R+\ell \left)\right)$   
Multiplication  $\mathcal{O}\left({N}_{u}R\right)$    $\mathcal{O}\left(R\right)$  $\mathcal{O}\left({N}_{u}\mathit{\text{KR}}\right)$  $\mathcal{O}\left(K\right(R+{\ell}^{2}\left)\right)$ 
Exponentiation  $\mathcal{O}\left({N}_{u}K\right)$    $\mathcal{O}\left(R\right)$    $\mathcal{O}\left(K\right(R+{\ell}^{2}\left)\right)$ 
Communication  $\mathcal{O}\left({N}_{u}\right(R+K\left)\right)$  $\mathcal{O}\left({N}_{g}\right(R+K\left)\right)$  $\mathcal{O}\left(R\right)$  $\mathcal{O}\left({N}_{u}K\right(R+\ell \left)\right)$  $\mathcal{O}\left(K\right(R+\ell \left)\right)$ 
As seen in Table 2, the privacypreserving Kmeans clustering algorithm has linear complexity in the number of users similar to the original version on plain text. However, the cost of working in the encrypted domain has been significantly reduced compared to [9], which has a comparable complexity to [10]. The computational and communication gain come from the effective use of data packing, eliminating the need for an expensive secure comparison protocol in [9] and involving helper users in the computations.
The complexity analysis also shows that our proposal has lower complexity compared to the previous works in [2],[12] and [14]. The communication complexity in [2] is $\mathcal{O}\left({N}_{u}\mathit{\text{nKR}}\right)$ bits, and the computational complexity for the twoparty setting is $\mathcal{O}\left({N}_{u}\mathit{\text{KR}}\right)$ encryptions and multiplications for one party and $\mathcal{O}\left({N}_{u}\mathit{\text{KR}}\right)$ exponentiations and multiplications for the other. [12] claims to have the same level of communication complexity with [2] but does not provide the computational complexity. [14], on the other hand, has a communication complexity of $\mathcal{O}\left({K}^{3}\mathit{\text{nR}}\right)$. The computational complexity is $\mathcal{O}\left({K}^{3}R\right)$ encryption and $\mathcal{O}\left({N}_{u}{K}^{3}R\right)$ multiplications for one party and $\mathcal{O}\left({K}^{3}R\right)$ exponentiation, $\mathcal{O}\left({K}^{2}\right)$ encryption and $\mathcal{O}\left({N}_{u}{K}^{3}R\right)$ multiplications for the other party.
6.2 Performance analysis
Parameters
Symbol  Value 

N _{ u }  100,000 
N _{ g }  1,562 
M  64 
R  12 
K  10 
n  1,024 bits 
ℓ  10 
w  3 bits 
σ  40 bits 
As for the bandwidth, we only consider the transmitted encrypted messages. For the parameters given in Table 3, the service provider sends and receives 1.215 GB of data, while the amount of data transmitted for a helper user is 9.4 kB. An ordinary user sends and receives only 6.8 kB of data. For the same set of parameters, the work in [9] requires the service provider and each user to transmit 7.8 GB and 82 kB of data, respectively. The significant difference in the amount of transmitted data is a result of data packing, as shown in the complexity analysis.
7 Discussion
Our proposal in this paper outperforms the most related protocol given in [9], which is also based on cryptographic tools within the semihonest security model. Note that the privacypreserving protocol in [10] that hides the cluster centroids from the service provider has a complexity comparable to [9]. Even though the numerical results on a data set of 100,000 show that our protocol is promising to be deployed in real life, we believe the performance of our proposal in a real implementation can be improved further for the following reasons. Firstly, an appropriate number of helper users can be determined by assessing the number of users in the system and the users’ resources in terms of bandwidth and computation. This leads to a number of groups, in which the helper user can process encrypted data without disrupting the user’s other activities. Secondly, after choosing the optimum number of helper users based on the aforementioned criteria, the overall performance of the privacypreserving clustering algorithm will be determined by the performance of the service provider. Note that all operations by the service provider can completely be performed in parallel. Since a multiple server model, or a cloud, is widely used in business, the overall runtime of the privacypreserving clustering algorithm is expected to be within reasonable boundaries in real life.
With respect to bandwidth usage, our protocol employs data packing to the fullest extent. Note that using the Paillier cryptosystem, we face data expansion by a factor of 64, assuming that 32bit numbers become 2,048bit cipher text after encryption. We reduce this expansion considerably by deploying data packing.
A major aspect to be considered in deploying the privacypreserving Kmeans clustering algorithm in real life is the security assumptions. Our model assumes the honest participation of all parties. While the semihonest security model can be considered too simplistic, it is still good enough for realworld applications where the service provider and the users have incentives to act according to the protocol as seen in the sugar beet auction system [24]. Note that a protocol with ‘proper measures against malicious parties’ will be much more expensive computationally and hence impractical for largescale deployment. A protocol with less strict but still realistic security guarantees is therefore preferred. To that end, we distribute the trust of the system between multiple parties, preventing a single malicious party to learn sensitive data. In a distributed model consisting of independent database owners, security risks are smaller because a collusion between the service provider and one of the helper users will be less likely. A second aspect to consider is the active participation of all users in the system. It is our conclusion that without introducing (semi) trusted third parties, users’ data cannot be processed without their participation. Fortunately, due to our construction, only the helper user needs to be online during the clustering procedure. Once the encrypted data are sent to the service provider, users can go offline for the rest of the computation. If the same helper users are to be used, other users can stay offline not only during that iteration but during the whole clustering; however, this would lead to minor changes in the protocol such as the encrypted distances should be computed by the service provider. Note that using the same helper users will lead to a similar setting to [10] with dedicated key holders. However, it is our motivation to distribute trust among multiple random helper users in each iteration for privacy protection, which requires helper users to be online during each iteration.
8 Conclusion
In this paper, we present an efficient, privacypreserving Kmeans clustering algorithm in a social network setting. We present a mechanism where the private data of the users, sensitive intermediate values and the final clustering assignments are protected by means of encryption. The service provider, who does not have the decryption key, can still perform clustering without being able to access the content of private data. While the approach of processing encrypted data presents a concrete privacy protection for the users, it also introduces performance drawbacks compared to the version with plain text due to data expansion after encryption and expensive operations on the encrypted data. Previous work has shown different approaches to reduce the complexity of privacypreserving Kmeans clustering such as using semitrusted third parties. In this work, we build a mechanism on the common serverclient model and reduce the costs by employing data packing. By this way, we reduce the number of encryption by a factor of K, thus introducing a considerable gain in terms of communication and computation. We also avoid interactive protocols such as secure comparison by exploiting the distributive setting. We also distribute trust among multiple random users for each iteration of the protocol, which introduces a computational gain proportional to the number of such users. The resulting cryptographic protocol is significantly more efficient compared to previous work in the semihonest security model. We also analyze the effects of different choices of parameters on the performance of the cryptographic protocol. Experimental results support our claim on the feasibility of privacypreserving Kmeans clustering such that it takes 26 min to cluster 100,000 users. This result, which can be improved further on a real system, encourages the deployment of privacypreserving Kmeans clustering algorithms based on homomorphic encryption.
Declarations
Acknowledgements
This work is supported by the Dutch COMMIT programme.
Authors’ Affiliations
References
 Lindell Y, Pinkas B: Privacy preserving data mining. In CRYPTO ’00: Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology. Springer, London; 2000:3654.Google Scholar
 Jagannathan G, Wright R N: Privacy preserving distributed Kmeans clustering over arbitrarily partitioned data. In KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, New York; 2005:593599.View ArticleGoogle Scholar
 Lagendijk R, Erkin Z, Barni M: Encrypted signal processing for privacy protection: encrypted signal processing for privacy protection. IEEE Signal Process. Mag 2013, 30(1):82105.View ArticleGoogle Scholar
 Erkin Z, Veugen T, Toft T, Lagendijk RL: Generating private recommendations efficiently using homomorphic encryption and data packing. IEEE Trans. Inf. Forensics Secur 2012, 7(3):10531066.View ArticleGoogle Scholar
 Fukunaga K: Introduction to Statistical Pattern Recognition. Academic, San Diego; 1990.MATHGoogle Scholar
 Goldreich O: Foundations of Cryptography II. Cambridge University Press, Cambridge; 2004.MATHView ArticleGoogle Scholar
 TroncosoPastoriza JR, Katzenbeisser S, Celik MU, Lemma AN: A secure multidimensional point inclusion protocol. In ACM Workshop on Multimedia and Security. ACM, Dallas; 2007:109120.Google Scholar
 Bianchi T, Piva A, Barni M: Composite signal representation for fast and storageefficient processing of encrypted signals. IEEE Trans. Signal Process 2009, 5(1):180187.Google Scholar
 Erkin Z, Veugen T, Toft T, Lagendijk R: Privacypreserving user clustering in a social network. In 1st IEEE Workshop on Information Forensics and Security (WIFS09). IEEE, London; 2009:96100.Google Scholar
 Beye M, Erkin Z, Lagendijk R: Efficient privacy preserving kmeans clustering in a threeparty setting. IEEE Workshop on Information Forensics and Security (WIFS ’11) (Foz do Iguaçu, 29 Nov–2 Dec 2011)Google Scholar
 Agrawal R, Srikant R: Privacypreserving data mining. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Volume 29(2). ACM, New York; 2000:439450.View ArticleGoogle Scholar
 Bunn P, Ostrovsky R: Secure twoparty kmeans clustering. In Proceedings of the 14th ACM Conference on Computer and Communications Security. ACM, New York; 2007:486497.View ArticleGoogle Scholar
 Clifton C, Kantarcioglu M, Vaidya J: Tools for privacy preserving distributed data mining. ACM SIGKDD Explorations 2003, 4(2):2834.View ArticleGoogle Scholar
 Jagannathan G, Pillaipakkamnatt K, Wright R: A new privacypreserving distributed kclustering algorithm. Proceedings of the Sixth SIAM International Conference on Data Mining (Bethesda, 20–22 Apr 2006)Google Scholar
 Oliveira S, Zaiane O: Privacy preserving clustering by data transformation. Proceedings of the 18th Brazilian Symposium on Databases (Manaus, 6–10 October 2003, pp. 304–318)Google Scholar
 Oliveira S, Zaiane O: Achieving privacy preservation when sharing data for clustering. In Secure Data Management, ed. by W Jonker, M Petković. Proceedings of the VLDB 2004 Workshop, SDM 2004, Toronto, Canada, August 30, 2004. Lecture Notes in Computer Science, vol. 3178. Springer, Berlin; 2004:6782.Google Scholar
 Yao ACC: How to generate and exchange secrets (extended abstract). In Proceedings of the 27th Annual IEEE Symposium on Foundations of Computer Science. IEEE, Toronto; 1986:162167.Google Scholar
 Huang Z, Du W, Chen B: Deriving private information from randomized data. In SIGMOD ’05: Proceedings of the 2005, ACM SIGMOD International Conference on Management of Data. ACM, New York; 2005:3748.View ArticleGoogle Scholar
 Kargupta H, Datta S, Wang Q, Sivakumar K: On the privacy preserving properties of random data perturbation techniques. In Proceedings of the ICDM 2003. IEEE, Melbourne; 2003:99106.Google Scholar
 Kolesnikov V, Sadeghi AR, Schneider T: Improved garbled circuit building blocks and applications to auctions and computing minima. In Cryptology and Network Security, ed. by JA Garay, A Miyaji, A Otsuka. Proceedings of the 8th International Conference, CANS 2009, Kanazawa, Japan, December 12–14, 2009. Lecture Notes in Computer Science, vol. 5888. Springer, Berlin; 2009:120.Google Scholar
 Kononchuk D, Erkin Z, van der Lubbe JCA, Lagendijk RL: Privacypreserving user data oriented services for groups with dynamic participation. In Computer Security – ESORICS 2013, ed. by J Crampton, S Jajodia, K Mayes. Proceedings of the 18th European Symposium on Research in Computer Security, Egham, UK, September 9–13, 2013. Lecture Notes in Computer Science, vol. 8134. Springer, Berlin; 2013:418442.Google Scholar
 Doraswamy N, Harkins D: IPSec: The New Security Standard for the Internet, Intranets, and Virtual Private Networks. PrenticeHall, Upper Saddle River; 1999.Google Scholar
 Paillier P: Publickey cryptosystems based on composite degree residuosity classes. In Advances in Cryptology — EUROCRYPT ’99, ed. by J Stern. Proceedings of the International Conference on the Theory and Application of Cryptographic Techniques Prague, Czech Republic, May 2–6, 1999. Lecture Notes in Computer Science, vol. 1592. Springer, Berlin; 1999:223238.Google Scholar
 Bogetoft P, Christensen DL, Damgård I, Geisler M, Jakobsen TP, Krøigaard M, Nielsen JD, Nielsen JB, Nielsen K, Pagter J, Schwartzbach MI, Toft T: Multiparty computation goes live. IACR Cryptology ePrint Arch 2008, 2008: 68.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.