# Markov Modelling of Fingerprinting Systems for Collision Analysis

- Neil J. Hurley
^{1}Email author, - Félix Balado
^{1}and - Guénolé C. M. Silvestre
^{1}

**2008**:195238

**DOI: **10.1155/2008/195238

© Neil J. Hurley et al. 2008

**Received: **8 May 2007

**Accepted: **3 December 2007

**Published: **12 December 2007

## Abstract

Multimedia fingerprinting, also known as robust or perceptual hashing, aims at representing multimedia signals through compact and perceptually significant descriptors (hash values). In this paper, we examine the probability of collision of a certain general class of robust hashing systems that, in its binary alphabet version, encompasses a number of existing robust audio hashing algorithms. Our analysis relies on modelling the fingerprint (hash) symbols by means of Markov chains, which is generally realistic due to the hash synchronization properties usually required in multimedia identification. We provide theoretical expressions of performance, and show that the use of -ary alphabets is advantageous with respect to binary alphabets. We show how these general expressions explain the performance of Philips fingerprinting, whose probability of collision had only been previously estimated through heuristics.

## 1. Introduction

Multimedia fingerprinting, also known as robust or perceptual hashing, aims at representing multimedia signals through compact and perceptually significant descriptors (hash values). Such descriptors are obtained through a hashing function that maps signals surjectively onto a sufficiently lower-dimensional space. This function is akin to a cryptographic hashing function in the sense that, in order to perform nearly unique identification from the hash values, perceptually different signals—according to some relevant distance—must lead with high probability to clearly different descriptors. Equivalently, the probability of collision ( ) between the descriptors corresponding to perceptually different signals must be kept low. Differently than in cryptographic hashing, signals that are perceptually close must lead to similar robust hashes. Despite this difference with respect to cryptographic hashing, the probability of collision remains the parameter that determines the "resolution" of a method for identification purposes.

A large number of robust hashing algorithms have been proposed recently. This flurry of activity calls for a more systematic examination of robust hashing strategies and their performance properties. In this paper, we take a step in that direction by examining the probability of collision of a certain general class of robust hashing systems, rather than analyzing a particular method. In its binary alphabet version, the class considered broadly encompasses several existing algorithms, in particular, a number of robust audio hashing algorithms [1–4]. We will show that the
-ary alphabet version of the class provides an advantage over the binary version for fixed storage size. In order to keep our exposition simple, other issues such as robustness to distortions or to desynchronization are not considered in this analysis. The study of the tradeoffs brought about by the simultaneous consideration of these issues is left as further work. We must also note that we will be dealing with *unintentional* collisions due to the inherent properties of the signals to be hashed. A related problem not tackled in this paper is the analysis of *intentional* forgeries of signals—perhaps under distortion constraints—in order to maximize the probability of collision.

*feature extraction block*, a function, , is applied to extract a set of feature vectors, which we assume to be real-valued with dimension . The feature extraction function is

*hashing block*, in which the continuous feature vector values are mapped to a finite alphabet of hash symbols, that is, quantized. In many methods, this hashing block is implemented through the application of a scalar hashing function to each scalar feature vector value, which we denote as

where is the alphabet of hash symbols whose size is given by .

In any hashing system, a distance measure must be established in order to determine the closeness between hash values. The commonly used distance for comparing sequences formed by discrete-alphabet symbols is the Hamming distance. This distance is defined as the number of times that symbols with the same index differ in the two sequences. Therefore, when comparing any two -ary symbols their Hamming distance can only take the values or .

*Markov chain*. In particular, we assume that

for all . Furthermore, we assume that the process is stationary, that is, with statistics independent of . We will also focus without loss of generality on one particular element of the feature vector. Hence, we will write the relevant random variables of the feature vector as and to represent the distributions of the feature value at and , respectively, for any , dropping the implicit index .

Finally note that, although methods which deal with real-valued fingerprints could be deemed in principle to belong to this class (using very large values of ), they rely on the use of mean square error distances instead of the Hamming distance. Thus, their study is not covered by the class of methods studied here.

Notation

Lowercase boldface letters such as represent column vectors, while matrices are represented by upper case Roman letters such as . is a matrix with the elements of in the diagonal and zero elsewhere. The symbols and denote the identity and the all-zero matrices, respectively, whereas denotes an all-ones vector, all of suitable size depending on the context. denotes the trace of . The operator stacks sequentially the columns of an matrix into an column vector. The symbol denotes the Kronecker (or direct) product of two matrices, and denotes their Hadamard (component-wise) product. Finally, denotes the Kronecker delta function.

## 2. Probability of Collision

To fix a point of operation, we consider hash sequences of symbols (assumed integer) which have fixed bit size (storage size). We investigate the probability of collision between two such independent sequences of symbols generated from the Markov chain with transition matrix , whose elements are defined in (4). Note that is a column-stochastic matrix, so that .

with the Hamming distance between the elements of the two sequences. If the random variables were independent, we could apply the central limit theorem (CLT) to for large , in order to compute the probability (6). Although there are short-term dependencies created by the Markov chain, these vanish in the long term. Then we may invoke a broader version of the CLT for locally correlated signals [6]. In summary, the result in [6] states that, provided the second and third moments of are bounded, then tends to the normal distribution. Finally, notice that is discrete, and then applying the CLT entails approximating a distribution with support in the positive integers using a distribution with support in the whole real line.

with . We tackle the computation of the statistics required for this approximation in Section 3, and particular cases in Section 5.

We investigate this direct approach in Section 4. Finally, in Section 6 we propose a Chernoff bound to , which is useful when the CLT assumption is not accurate or when the exact computation presents computational difficulties.

## 3. Mean and Variance of Hamming Distance

In this section, we derive the mean and variance of the Hamming distance using the Markov chain of symbol transitions , defined by (4). To proceed, we assume that represents an irreducible, aperiodic Markov chain.

If is symmetric, then the symbols are equally likely in equilibrium and .

Using the probabilities (12) and (13), we can derive the mean and variance of the Hamming distance between two independent hash sequences of
symbols, *assuming that the process starts in the equilibrium distribution* (11). This is tantamount to assuming
, in which case
and
, that is, we can drop the index
and write
. When the initial symbol is chosen with uniform probability from
this condition holds if the transition matrix is symmetric. Even if all values for the initial symbol are not equiprobable in reality, the assumption is not too demanding whenever convergence to equilibrium is fast. We investigate a more general case for binary hashes in Section 5.

## 4. The Stochastic Process of Elemental Distances

In this section, we will investigate the stochastic process of elemental distances, that is, the process that generates the sequence . Through an analysis of this process, we arrive at a full expression for the probability of collision, which is exact in the case of binary hashing sequences with symmetric transition matrices. This is possible because, as we will show, the elemental distance process is itself a Markov chain when and the transition matrix is symmetric. Even for the case , we note that the elemental distance process is well approximated by a Markov chain, and then the expression obtained for the probability of collision can be interpreted as a good approximation to the true collision probability.

Note that (24) is a weighted sum of the off-diagonal elements of with weights depending on and summing to one. The remaining two components of are given by and .

It follows that, whenever the diagonal elements of
are all equal *and* the off-diagonals are all equal, the dependence of
on
factors from (23) and (24), and
is independent of the time-step
. In this case, the process of elemental distances is itself a stationary Markov chain. Let us assume that
has the structure
with
and
. In this case, as
, we can see that
with
and
. As we have discussed above, this is the structure that allows to cancel the dependence on
in (23) and (24). For
, observe that symmetry implies that
is always of the form above, and then the conditions are always fullfilled in that case.

On the other hand, even when the elemental distances do not follow a Markov chain, since
, the equilibrium probability, the elemental distance process is well approximated by the Markov chain with transition matrix
obtained by replacing
in (23) and (24) with
, such that
. From now on, we will refer loosely to the *elemental distance Markov chain*, meaning, when appropriate, the Markov chain derived from this approximation.

### 4.1. Probability of Collision

Expression (28) gives the exact probability of collision when the sequence of elemental distances is a Markov chain. In other cases, it will lead to an approximation. Consequently, the analysis is exact for and symmetric, in which case ( ) can be determined easily from .

## 5. Binary Hashes with Symmetric Transition Matrix

While (31) holds under the assumption that the distribution of is the equilibrium distribution, it is also possible to derive the exact mean and variance of from an arbitrary initial distribution. This case is interesting, since, although the symbol sequences are assumed to be generated from independent sources, at the application level, the first bit of the hash sequence corresponding to the input signal is sometimes aligned with that of the hash sequences in the database. We can handle this scenario by assuming that the distance between the initial pair of bits is zero.

### 5.1. Exact Mean and Variance

Noting that as , this expression coincides with (31) as when .

## 6. Chernoff Bounding

For large and small probabilities the CLT can exhibit large deviations from the true probabilities. This is due to the fact that the CLT gives an approximation based only on the two first moments of the real distribution. Also, the exact computation (28) can run into numerical difficulties due to the combinatorials involved. Then, it is interesting to see what can be obtained by means of Chernoff bounding on (6). Apart from the interest of a strict upper bound, this strategy also provides the error exponent followed by the integral of the tail of the distribution of .

It is not possible to optimize this expression analytically in closed-form. Nonetheless, numerical optimization can be easily undertaken, as (41) is just a weighted sum of powers of .

## 7. Empirical Results

Matlab source code and data assoicated with the empirical results given below can be downloaded from http://www.ihl.ucd.ie.

### 7.1. Synthetic Markov Chains

The CLT approximation has good agreement in the binary case for , but is significantly less accurate for 4-ary hashes. This is due to the fact that in the second case, the pdf of is significantly skewed as zero distances are more likely to happen. Due to this, the CLT approximation understimates the tail of the true distribution. The Chernoff bound, also shown in Figure 1, follows the same shape as the exact distribution and is tighter for high values of than the CLT approximation.

### 7.2. The Philips Method

We show in this subsection how the Markov modelling that we have described is applicable to the hashing method proposed by Haitsma et al. [1], commonly known as the Philips method. Moreover we show how previous work on modelling this particular method allows to obtain analytically the parameters of the Markov chain.

In previous work [8], we developed a model that allows the analysis of the performance of the Philips method under additive noise and desynchronisation. Using this model, the transition matrix of the Markov chain associated to the bitstream of the Philips hash can be determined analytically as follows. In [8] we analysed the bit error that results from desynchronization, the lack of alignment between the original framing used in the acquisition stage and the framing that takes place in the identification stage.

where is the correlation coefficient corresponding to that band and that level of desynchronization. This model was shown therein to give very good agreement with empirical results, even with real audio (and hence nonstationary) input signals.

In the results presented in Figure 2, and hence the correction factor for this value of is . In summary, our analysis is able to tackle dependencies without resorting to any heuristics.

#### 7.2.1. Real Audio Signals

Although our model assumes stationarity, which is clearly not the case for real audio signals, good agreement is found between the model predictions and empirical data. The greatest discrepancy appears in the AC/DC audio and may be due to greater dynamics in this song. To improve the results, we could apply the approach used in [8], where real audio signals are approximated by stationary stretches and apply our model separately to each stretch. While this approach can provide the probability of collision within each stationary stretch, combining these into an overall probability of collision could prove problematic.

## 8. Conclusion

We have examined the probability of collision of a certain general class of robust hashing systems that can be described by means of Markov chains. We have given theoretical expressions for the performance of general chains of -ary hashes, by deriving the mean and variance of the distance between independent hashes and applying a CLT approximation for the probability distribution. We have been able to derive an expression for the distribution, which is exact for binary symmetric hashes and gives a very good approximation otherwise. We have confirmed the accuracy of the Gaussian distribution on binary hashes once the hash sequence is sufficiently large. Moreover, we derived the binary transition matrix for the Philips method and showed that the Markov chain model has very good agreement with empirical results for this method. While we have shown that for , -ary chains have an advantage over binary chains from the point of view of collision, higher order alphabets will inevitably lead to a degradation of performance under additive noise and desynchronisation error. The performance tradeoffs that result will be examined in future work.

## Appendices

### A. Variance of an -Ary Hash Sequence

Using (17) and (A.12) in (15) we finally obtain (21).

### B. Variance of Binary Symmetric Hash Sequence

Finally, inserting (36) and (B.5) into (15), we arrive at (39).

## Authors’ Affiliations

## References

- Haitsma J, Kalker T, Oostveen J:
**Robust audio hashing for content identification.***Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI '01), September 2001, Brescia, Italy*117-125.Google Scholar - Mihçak MK, Venkatesan R:
**A perceptual audio hashing algorithm: a tool for robust audio identification and information hiding.**In*Proceedings of the 4th International Workshop on Information Hiding (IHW '01), April 2001, Pittsburgh, Pa, USA, Lecture Notes In Computer Science*.*Volume 2137*. Springer; 51-65.Google Scholar - Baluja S, Covell M:
**Content fingerprinting using wavelets.***Proceedings of the 3rd European Conference on Visual Media Production (CVMP '06), November 2006, London, UK*209-212.Google Scholar - Kim S, Yoo CD:
**Boosted binary audio fingerprint based on spectral subband moments.***Proceedings of the 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA***1:**241-244.Google Scholar - Haitsma J, Kalker T:
**A highly robust audio fingerprinting system.***Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR '02), October 2002, Paris, France*107-115.Google Scholar - Blum M:
**On the central limit theorem for correlated random variables.***Proceedings of the IEEE*1964,**52**(3):308-309.View ArticleGoogle Scholar - Magnus JR, Neudecker H:
*Matrix Differential Calculus with Applications in Statistics and Econometrics*. 2nd edition. John Wiley & Sons, New York, NY, USA; 1999.MATHGoogle Scholar - Balado F, Hurley NJ, McCarthy EP, Silvestre GCM:
**Performance analysis of robust audio hashing.***IEEE Transactions on Information Forensics and Security*2007,**2**(2):254-266.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.