To meet the above stated demands of long-term integrity and confidentiality protection, we have derived a protection scheme, which is described in this chapter.
Full-retrieval data
Unprocessed raw reads, e.g., stored in compressed FASTQ format, and resulting alignments, e.g., stored in CRAM format, are usually only accessed as a whole and a long-term protection scheme for that use case was proposed in [13]. The scheme presented here in Section 4.3 enhances the integrity protection scheme of [13], so that a large number of small data items can be protected together efficiently.
Random access data
As opposed to whole-data integrity proofs, our scheme provides random access integrity proofs of genomic variation data on the finest level possible—per position in the reference genome.
We view genomic variation data like VCF/BCF files as a table G, where for each genome position i, G[ i] denotes the corresponding variant data entry in G. If there is no mutation at position i, we set G[ i] to 0. Note that we do not need to actually store those 0s as the absence of a variation implicitly represents a 0. However, the scheme also needs to create commitments for the absence of variants so that absence can also be proven. Since a human genome has about 3·109 positions, this is the size of table G and the number of commitments that have to be created, independent of the underlying data format.
For genome data G, generated and signed by a sequencing laboratory, the scheme generates an integrity proof P. The validity period of such a proof is limited in time because the cryptographic primitives used for its generation have a limited validity period. Therefore, the proof is updated regularly. Furthermore, we describe how a partial integrity proof for a subset G′⊂G can be extracted from P, and how such a partial integrity proof is verified. Our scheme thus delivers random access to G′⊂G with integrity proofs while keeping the remaining data G∖G′ private. We also present a security analysis of the proposed scheme. The scheme uses components of the schemes Lincos [13] and Mops [12]. More information on the used cryptographic primitives (i.e., timestamps, commitments, hashes, and signatures) can be found in the respective publications.
Scheme description
Our scheme for long-term integrity protection of genomic data provides the algorithms Protect, Update, PartialProof, and Verify. Algorithm Protect generates the initial integrity proof when genomic data is stored. Algorithm Update updates the integrity proof if a used cryptographic primitive (e.g., the hash function) is threatened to become insecure. Algorithm PartialProof generates a partial integrity proof for verification of a subset of the genomic data. Algorithm Verify allows a verifier to verify the integrity of a given genomic dataset using a given partial integrity proof.
Initial protection
The initial integrity proof P for sequenced genome data G is generated by the sequencing laboratory using algorithm Protect (Algorithm 1). The algorithm obtains as input genome data G, an information-theoretic hiding commitment algorithm Com [28], a hash algorithm Hash, a signing algorithm Sign, and a time-stamping algorithm TS. The algorithm first uses algorithm Com to generate commitments and decommitments to all entries in G. The commitments can be used as placeholders for the data items, which itself do not leak information, and the decommitments can be used to prove the connection between the commitment and the corresponding data item. Then, it uses the hash algorithm Hash to compute a Merkle hash tree (MHT) [14] for the generated commitment values. The root node of the generated tree is then signed using algorithm Sign and timestamped using the trusted timestamp authority TS [29]. Output of the initial protection algorithm is an integrity proof P which contains the commitments, the decommitments, the MHT, the signature, and the timestamp.
In our algorithm listings we denote by MHT:(Hash,L)→T an algorithm that on input a hash algorithm Hash and a set of leaf nodes L, outputs a MHT T. Furthermore, we denote the root of a MHT T by T.r.
Protection update
Timestamps, hash values, and commitments have a limited validity periods, which in turn limits the validity period of the corresponding integrity proof. The overall validity of an integrity proof is therefore prolonged regularly by the genome database by running Algorithm 2. The input parameter op∈{upCHT,upHT,upT} determines which primitives are updated; op=upCHT updates commitments, hashes, and timestamps; op=upHT updates only hashes and timestamps; and op=upT updates only timestamps. For op=upCHT, first new information theoretically hiding commitments are generated. Then, a new MHT T is generated and finally the root of T is timestamped. Output of the update algorithm is an updated integrity proof P′.
In the algorithm listings, we denote by AuthPath(T,i)→A an algorithm that on input MHT T and leaf index i, outputs the authentication path A from leaf node i to root node T.r.
Generate partial integrity proof
A data owner may want to create a partial integrity proof P′ for a subset G′⊂G such that P′ does not reveal any information about G∖G′. This can be done using Algorithm 3. The algorithm extracts from P all information relevant for proving the integrity of G′ and outputs them in form of a partial integrity proof P′. In particular, the partial integrity proof contains the commitments corresponding to the positions contained in G′, the corresponding hash tree authentication paths, as well as the corresponding timestamps and the corresponding signature.
Verification
A verifier receives partial genome data G′ and a corresponding partial integrity proof P′. Additionally, it uses a trusted verification algorithm Ver and reads the current time tn+1. It then uses Algorithm 4 to verify the integrity of G′.
The trusted verification algorithm Ver is used for verifying the validity of timestamps, hashes, commitments, and signatures. It can be realized by leveraging trusted public key certificates that include verification parameters and validity periods. It must provide the following functionality. If VerTS(m,ts;t)=1, then ts is a valid timestamp for m at time t, meaning that the cryptographic algorithms used for generating the timestamp are considered secure at time t. The time that the timestamp ts refers to is denoted by ts.t. Hence, VerTS(m,ts;t)=1 means that it is safe to believe at time t that data m existed at time ts.t. Similarly, VerMHT(m,a,r;t)=1 means that at time t, a is a valid authentication path for m through a hash tree with root r. VerCom(m,c,d;t)=1 means that at time t, d is a valid decommitment from commitment c to message m. VerSign(m,σ;t)=1 means that at time t, σ is a valid signature for message m. We refer to Section 5.2 for more details on how the validity periods of the cryptographic primitives are derived.
We use the following shorthand notations tNxTs(i), tNxHa(i), tNxCo(i) to denote update times with respect to a given partial integrity proof \(P^{\prime }= \left [\sigma,P^{\prime }_{1},\ldots,P^{\prime }_{n}\right ]\). By tNxTs(i) we denote the time of the next timestamp update after Pi, i.e., tNxTs(i)= min{tsj.t:j>i}. Likewise, by tNxHa(i) we denote the time of the next hash tree update after Pi, and by tNxCo(i) we denote the time of the next commitment update after Pi.
The verification function Verify of the genome data protection scheme works as follows. It checks whether the integrity proof has been constructed correctly, and whether the cryptographic primitives have been updated before becoming invalid. We refer the reader to the next section (Section ??) for more details on the security of this scheme.
Security analysis
We now analyze the security of the proposed scheme and argue that it fulfills the requirements described in Section 3.3.
Confidentiality
We observe that a partial integrity proof P′ for genome data G′⊂G does not reveal any information about the remaining data G∖G′ by the following argument. Let \(P^{\prime } = (\sigma,P^{\prime }_{1},\ldots,P^{\prime }_{n})\) be a partial integrity proof for G′, where \(P^{\prime }_{i} = (\textsf {op}_{i},C^{\prime }_{i},D^{\prime }_{i},A^{\prime }_{i},T_{i}.r,\textsf {ts}_{i})\). We observe that for every i∈{1,…,n}, opi, \(C^{\prime }_{i}\), and \(D^{\prime }_{i}\) are independent of G∖G′ because of the information-theoretic hiding property of the commitments. Furthermore, \(A^{\prime }_{i}\) contains authentication paths that only depend on information theoretically hiding commitments and thus does not reveal any information as long as the decommitment values are not revealed. Hence, also the tree root Ti.r, the timestamp tsi, and the signature σ are independent of G∖G′.
Integrity
Next, we show that it is infeasible for an adversary, who cannot break any of the used cryptographic primitives within their validity period, to present a valid partial integrity proof P′ for partial genome data G′ if G′ has not been originally signed by the laboratory.
For our security analysis, we consider an adversary that can potentially become computationally more powerful over time and use methods developed in [30–32] for arguing about the knowledge of an adversary at an earlier point in time. For this, we require that the timestamp, commitment, and hash algorithms chosen by the user are extractable. Thereby, we are able to show that if an adversary presents a valid integrity proof, then the signed data together with the signature must have been known at a point when the corresponding signature scheme was considered valid. If the signature is valid for the data, then it follows that the data is authentic.
Here, we use the following notation to express the knowledge of the adversary. For any data m and time t, we write \(m \in \mathcal {K}[t]\) to denote that the adversary knows m at time t. We remark that for any t<t′, \(m \in \mathcal {K}[t]\) implies \(m \in \mathcal {K}[t^{\prime }]\).
Extractable timestamping [30, 32] guarantees that if at some time t, a timestamp ts and message m are known and ts is considered valid for m at time t, then m must have been known at time ts.t, or in the notation introduced above:
$$ (m,\textsf{ts}) \in \mathcal{K}[\!t] \land \textsf{Ver}_{\textsf{TS}}(m,\textsf{ts};t) \implies m \in \mathcal{K}[\textsf{ts}.t] \text{.} $$
(1)
Moreover, extractable commitments [31] guarantee that if a commitment value is known at time t, and a message m and a valid decommitment value are known at a later time t′>t, then the message m was already known at commitment time t, i.e.:
$$\begin{array}{*{20}l} &c \in \mathcal{K}[\!t] \land (m,d) \in \mathcal{K}[\!t^{\prime}] \land \textsf{Ver}_{\textsf{Com}}(m,c,d;t^{\prime})\\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\implies m \in \mathcal{K}[\!t] \text{.} \end{array} $$
(2)
Extractable hash trees [32] provide similar guarantees, i.e., for any hash tree root value r, time t, message m, hash tree authentication path a, and times t,t′:
$$\begin{array}{*{20}l} &r \in \mathcal{K}[\!t] \land (m,a) \in \mathcal{K}[\!t^{\prime}] \land \textsf{Ver}_{\textsf{MHT}}(m,a,r;t^{\prime})\\ &\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\implies m \in \mathcal{K}[t] \text{.} \end{array} $$
(3)
Furthermore, we know that if a signature σ and a message m are known at some time t, and σ is considered valid for m at time t, then by the existential unforgeability of the signatures it follows that m is authentically signed [30, 33]:
$$ (m,\sigma) \in \mathcal{K}[t] \land \textsf{Ver}_{\textsf{Sign}}(m,\sigma;t) \implies m\ \text{is authentic} \text{.} $$
(4)
Finally, it is known that signing the root of a Merkle tree preserves the integrity of the leafs. Furthermore, if the leafs are commitments, the authenticity of the committed messages is preserved. That is, for any hash tree root value r, signature σ, commitment c, hash tree authentication path a, message m, decommitment d, and times t,t′,t′′:
$$\begin{array}{*{20}l} &{}(r,\sigma) \in \mathcal{K}[t] \land \textsf{Ver}_{\textsf{Sign}}(r,\sigma;t) \land\\ &{}(c,a) \in \mathcal{K}[t^{\prime}] \land \textsf{Ver}_{\textsf{MHT}}\left(c,a,r;t^{\prime}\right) \land\\ &{}(m,d) \in \mathcal{K}[t^{\prime\prime}] \land \textsf{Ver}_{\textsf{Com}}\left(m,c,d;t^{\prime\prime}\right) \\ &\qquad\qquad\qquad\qquad\qquad\qquad\quad\implies m\ \text{is authentic} \text{.} \end{array} $$
(5)
We now show that it is infeasible to produce a valid integrity proof for genome data that is not authentically signed. Assume an adversary outputs (G′,P′) at some point in time tn+1 and let Ver be a verification function trusted by the verifier. We show that if P′ is a valid partial integrity proof for data G′ (i.e., Verify(Ver,G′,P′)=1), then the signature σ for G′ is not a forgery.
Let \(P^{\prime } = (\sigma,P^{\prime }_{1},\ldots,P^{\prime }_{n})\), where \(P^{\prime }_{i} = (\textsf {op}_{i},C^{\prime }_{i},D^{\prime }_{i},\allowbreak {}A^{\prime }_{i},\allowbreak {}T_{i}.r,\allowbreak {}\textsf {ts}_{i})\). Define \(P^{\prime \prime }_{i} = (\sigma,P^{\prime }_{1},\ldots,P^{\prime }_{i})\) and ti=tsi.t. In the following, we show recursively for i∈[n,…,1], that given Verify(Ver,G′,P′)=1, statement \(\textsf {St}(i) = \langle (G^{\prime },P^{\prime \prime }_{i}) \in \mathcal {K}[t_{i+1}] \rangle \) holds.
We observe that St(n) is trivially true because the adversary presents valid (G′,P′) at tn+1 by assumption. Next, we show that assuming St(i) holds, then also St(i−1) holds. Given St(i), we observe that by VerTS([σ,(T1.r,…,Ti.r),(ts1,…,tsi−1)],tsi;tNxTs(i))=1 and (1), we have \([\sigma, (T_{1}.r, \ldots, T_{i}.r),\allowbreak (\textsf {ts}_{1},\ldots,\textsf {ts}_{i-1})] \allowbreak \in \mathcal {K}[t_{i}]\). Furthermore, by
$$\textsf{Ver}_{\textsf{MHT}}(\textsf{CA}^{\prime}(i,j), A^{\prime}_{i}[\!j], T_{i}.r; t_{\textsf{NxHa}}(i)) = 1$$
and (3), we have \(\textsf {CA}^{\prime }(i,j) \in \mathcal {K}[\!t_{i}]\) for every j∈G′. Finally, by
$$\begin{array}{*{20}l} &\textsf{Ver}_{\textsf{Com}}([G^{\prime}[j], D^{\prime}_{1}[\!j], \ldots, D^{\prime}_{i-1}[\!j]], C^{\prime}_{i}[\!j], D^{\prime}_{i}[\!j]; \\ &\qquad\qquad\qquad\qquad\qquad\qquad t_{\textsf{NxCo}}(i)) = 1 \end{array} $$
and (2) we have \([G^{\prime }[j], D^{\prime }_{1}[j], \ldots, D^{\prime }_{i-1}[j]] \in \mathcal {K}[t_{i}]\) for every j∈G′. Combined, we obtain \((G^{\prime },P^{\prime \prime }_{i-1}) \in \mathcal {K}[t_{i}]\), which means that St(i−1) holds.
We observe that St(1), VerTS([σ,T1.r],ts1;tNxTs(1))=1, and (1) implies that \([\sigma, T_{1}.r] \in \mathcal {K}[t_{1}]\). Furthermore, by VerSign(T1.r,σ;t1)=1 and (4), we obtain that σ is genuine for T1.r. Finally, we observe that for every i∈G, \(\textsf {Ver}_{\textsf {MHT}}(C^{\prime }_{1}[i], A^{\prime }_{1}[i], T_{1}.r; t_{\textsf {NxHa}}(1)) = 1\), \(\textsf {Ver}_{\textsf {Com}}(G[i], C^{\prime }_{1}[i], D^{\prime }_{1}[i]; t_{\textsf {NxCo}}(1)) = 1\), and we obtain by (5) that σ is a genuine signature for G′.