Combining PRNU and noiseprint for robust and efficient device source identification

PRNU-based image processing is a key asset in digital multimedia forensics. It allows for reliable device identification and effective detection and localization of image forgeries, in very general conditions. However, performance impairs significantly in challenging conditions involving low quality and quantity of data. These include working on compressed and cropped images, or estimating the camera PRNU pattern based on only a few images. To boost the performance of PRNU-based analyses in such conditions we propose to leverage the image noiseprint, a recently proposed camera-model fingerprint that has proved effective for several forensic tasks. Numerical experiments on datasets widely used for source identification prove that the proposed method ensures a significant performance improvement in a wide range of challenging situations.


Introduction
Image attribution is a fundamental task in the context of digital multimedia forensics and has gained a great deal of attention in recent years. Being able to identify the specific device that acquired a given image represents a powerful tool in the hands of investigators fighting such hideous crimes as terrorism and pedopornography.
Most successful techniques for device identification rely on the photo response non-uniformity (PRNU) [1,2], a sort of fingerprint left by the camera in each acquired photo. The PRNU pattern originates from subtle random imperfections of the camera sensor which affect in a deterministic Correspondence to: verdoliv@unina.it University Federico II of Naples Via Claudio, 21, 80125 Naples (Italy) Tel.: +39 081 76-83929 way all acquired images. Each device has its specific PRNU pattern, which can be accurately estimated by means of sophisticated processing steps, provided a large number of images acquired by the device itself is available. Given the camera PRNU pattern, image attribution is relatively easy and very reliable, in ideal conditions. However, the performance degrades rapidly when the operating conditions are less favourable, since it is ultimately related to the quantity of available data and their quality [3]. In particular, performance may be severely impaired when i) tests take place on small image crops, ii) a limited number of images are available for estimating the device PRNU. These challenging cases are of great interest for real-world applications, as clarified in the following.
Modern cameras generate very large images, with millions of pixels, which provide abundant information for reliable device identification. However, often large-scale investigations are necessary. The analyst may be required to process a large number of images, for example all images downloaded from a social account, and look for their provenance in a huge database of available cameras (PRNU patterns). This calls for an inordinate processing time, unless suitable methods are used to reduce computation, typically involving some forms of data summarization [4,5,6]. Working on small crops, rather than on the whole image, is a simple and effective way to achieve such a goal. In addition, the PRNU pattern can be also used to perform image forgery localization [2,7]. In this case, a PRNU-based sliding-window analysis is necessary, on patches of relatively small size.
Turning to the second problem, it may easily happen, in criminal investigations, that only a few images acquired by the camera of interest are available to estimate its PRNU pattern. Moreover, the number and type of source cameras may be themselves unknown, and must be estimated in a blind fashion based on a relatively small set of unlabeled images [8,9,10,11]. In such cases, one must cope with lowquality estimation rather than abandoning the effort.
In this work, we tackle these problems and propose a novel source identification strategy which improves the performance of PRNU-based methods when only a few, or even just one image is available for estimation, and when only small images may be processed. To this end, we rely on a recent approach for camera model identification [12] and use it to improve the PRNU-based source device identification performance. Camera model identification has received great attention in recent years, with a steady improvement of performance, thanks to the availability of huge datasets on which it is possible to train learning-based detectors, and the introduction of convolutional neural networks (CNN). The supervised setting guarantees very good performance [13] [14], especially if deep networks are used. However, such solutions are highly vulnerable to attacks [15] [16]. To gain higher robustness, unsupervised or semi-supervised methods may be used. For example, in [17] features are extracted through a CNN, while classification relies on machine learning methods. Interestingly, only the classification step needs to be re-trained when testing on camera models that are not present in the training set. Likewise, in [18] it has been shown that proper fine-tuning strategies can be applied to camera model identification, a task that shares many features with other forensic tasks. Of course, this makes the problem easier to face, given that in a realistic scenario it is not possible to include in the training phase all the possible camera models. A further step in this direction can be found in [12], where the use of a new fingerprint has been proposed, called noiseprint, related to camera model artifacts and extracted by means of a CNN trained in siamese modality. Noiseprints can be used in PRNU-like scenarios but require much less data to reach a satisfactory performance [19].
The main idea of this paper is to make use of noiseprints to support PRNU-based device identification. In fact, although noiseprints allow only for model identification, they are much stronger than PRNU patterns, and more resilient to challenging conditions involving restrictions on the number and size of images.
In the rest of the paper, after providing background material on PRNU and noiseprint in Section II, and describing the proposed method in Section III, we carry out a thorough performance analysis, in Section IV, on datasets widely used in the literature and in several conditions of interest, proving the potential of the proposed approach. Finally, Section V concludes the paper.

Background
Device-level source identification relies on specific marks that each individual device leaves on the acquired images. Barring trivial cases, like the presence of multiple defective pixels in the sensor, most identification methods resort to the photo response non-uniformity noise (PRNU) pattern. In fact, due to unavoidable inaccuracies in sensor manufacturing, sensor cells are not perfectly uniform and generate pixels with slightly different luminance in the presence of the same light intensity. Accordingly, a simplified multiplicative model for an image I generated by a given camera is where I 0 is the true image, K is the PRNU pattern, Θ accounts for all other sources of noise, and all operations are pixel-wise. The PRNU is unique for each device, stable in time, and present in all images acquired by the device itself. Therefore it can be regarded as a legitimate device fingerprint, and used to perform a large number of forensic tasks [2,20,7]. The PRNU of camera C i can be estimated from a suitable number of images, say, I i,1 , . . . , I i,N , acquired by the camera itself. First, noise residuals are computed by means of some suitable denoising algorithms f (·) The denoiser removes to a large extent the original image, I 0 i,n , regarded as a disturbance here, in order to emphasize the multiplicative pattern [21]. Then the PRNU, K i , can be estimated by plain average or according to a maximum-likelihood rule, in order to reduce the residual noise. Moreover, to remove unwanted periodicities, related to JPEG compression or model-based artifacts, the estimated fingerprint is further filtered by subtracting the averages of each row and column and by using a Wiener filter in the DFT domain [1,22]. Eventually, the estimate converges to the true PRNU as N → ∞.
Assuming to know the true reference PRNU, we can check if image I m was acquired by camera C i based on the normalized cross-correlation (NCC) between W m and K i NCC(W m , with ·, · and · indicating inner product and Euclidean norm, respectively. The computed NCC will be a random variable, due to all kinds of disturbances, with a positive mean if the image was taken by camera C i , and zero mean otherwise. In the following, to allow easier interpretation, we present results in terms of a PRNU-based pseudo-distance, defined as the complement to 1 of NCC Methods based on PRNU have shown excellent performance for source identification [2] in ideal conditions. However, the NCC variance increases both with decreasing image size and when the number of images used to estimate the real PRNU decreases, in which cases, decision may become unreliable. Both conditions may easily occur in real-world practice, as recalled in the Introduction. Therefore, further evidence, coming from camera-model features, is highly welcome to improve identification performance.
Indeed, it is well known that, besides device-specific traces, each image bears also model-specific traces, related to the processes carried inside the camera, such as demosaicking or JPEG compression. Such traces are treated as noise in the PRNU estimation process, and mostly removed, but they can provide a precious help to tell apart cameras of different models.
Recently, a camera-model fingerprint has been proposed [12], called noiseprint, extracted by means of a suitably trained convolutional neural network (CNN). A noiseprint is an image-size pattern, like the PRNU, in which the high-level scene content is removed and model-related artifacts are emphasized. This is obtained by training the CNN in a Siamese configuration, as illustrated in Fig.1. Two identical versions of the same net are fed with pairs of patches extracted from unrelated images. Such pairs have positive label when they come from the same model and have the same spatial position, and negative label otherwise. During training, thanks to a suitable loss function, the network learns to emphasize the similarities among positive pairs, that is, the camera model artifacts. In [12] noiseprints have been shown to enable the accomplishment of numerous forensic tasks, especially image forgery localization.
Similarly to the reference PRNU pattern, the reference noiseprint of a model is obtained by averaging a large number of noiseprints extracted by images acquired by the same model (not necessarily the same device) [19], in formulas where φ (·) is the function implemented by the CNN, R i is the estimated reference pattern of the i-th model and, I i,n is the n-th image taken by the i-th model. To verify whether a given image I m was acquired by camera model M i , we extract its noiseprint, φ (I m ), and compute the mean square error (MSE) with respect to the model reference pattern R i . For homogeneity with the previous case, this is also called NP-based distance Again D NP (i, m) is expected to be small if the m-th image was indeed acquired by the i-th model and large otherwise, and its variance depends, again, on the size and number of images used to estimate the reference pattern. While the image noiseprint does not allow to single out the camera that acquired a given image, it can be used to discard or play down a large number of candidates, that is, all cameras whose model does not fit the test noiseprint. Moreover, the model-related traces found in noiseprints are much stronger than the PRNU traces found in image residuals, as clearly shown in the example of Fig.2, hence noiseprintbased decisions keep being reliable even over small regions and when a limited number of images is available for estimation [19].

Proposed method
In this work, we want to improve the performance of the device source identification in the two critical scenarios: i) only a few images are available to estimate the reference PRNU pattern; and ii) only a small crop of the image is used for testing. The few-image scenario accounts for cases where the physical device is not available, so that the only available images are recovered from a hard disk or maybe from a social account of a suspect. Instead, the small-region scenario is motivated by the need to reduce memory and time resources for very large-scale analyses. In addition, a good performance on small regions allows one to use this approach for image forgery detection and localization, especially the critical case of small-size forgeries. In Fig.3, we show the histograms of the PRNU-based (left) and noiseprint-based (right) distances computed on the widespread VISION dataset [23] in various situations (different rows). Each subplot shows three different histograms: 1. same-device (green): the distance is evaluated between an image acquired from camera C of model M and the reference pattern of the same camera; 2. different-device same-model (blue): the distance is evaluated between an image acquired from camera C of model M and the reference pattern of a different camera C of the same model M; 3. different-device different-model (red): the distance is evaluated between an image acquired from camera C of model M and the reference pattern of a different camera C of a different model M .
On the first row, we consider a nearly ideal situation, where tests are performed on 1024×1024 crops of the image and there is plenty of images (we limit them to N=100) to estimate the reference patterns. The subplot on the left shows that the PRNU-based distance separates very well same-device from different-device samples. On the contrary, the two different-device distributions (same-model and different-model) overlap largely, since the PRNU does bear model-specific information. Then, when the number of images used to estimate the reference pattern decreases (second row, N=1) or the analysis crop shrinks significantly (third row, d=64) same-device and different-device distributions are not so well separated anymore, and PRNU-based decisions become unreliable. Eventually, in the extreme case of N=1 and d=64 (fourth row), all distributions collapse. The right side of the figure shows histograms of the noiseprint-based distance. In the ideal case (top row) we now observe a very good separation between same-model and different-model histograms, while the two same-model distributions overlap, as noiseprints do not carry device-related information. Unlike with PRNU, however, when the analysis conditions deviate from ideal (following rows), the same-model and different-model distributions keep being reasonably well-separated, allowing for a reliable model discrimination even in the worst case (fourth row). This suggests the opportunity to use the noiseprint-based distance to support decision insofar different models are involved, In summary, our proposal is to use noiseprint-based model-related information to support PRNU-based device identification. Assuming to know the camera model of the image to analyze, the search for the source device can be restricted only to devices of the same model, thereby reducing the risk of wrong identification, especially in the most critical cases. However, in real-world scenarios, camera models may not be known in advance, calling for a preliminary model identification phase, which is itself prone to errors. Therefore, with the hierarchical procedure outlined before, there is the non-negligible risk of excluding right away the correct device. For this reason, we prefer to exploit the two pieces of information jointly, rather than hierarchically, by suitably combining the two distances, as shown pictorially in Fig.4. Note that, in this worst-case scenario, where no model-related information is known a priori, the noiseprint reference pattern is estimated from the very same images used to estimate the PRNU pattern. Hence, the resulting performance represents a lower bound for more favourable conditions.
We consider the following binary hypothesis test: H 0 : m-th image not acquired by i-th device.
and propose three different strategies to combine the two distances.
SVM: as first strategy, we adopt a linear support-vector machine (SVM) classifier. Relying on a large dataset of examples, the SVM finds the hyperplane that best separates samples of the two hypotheses, maximizing the distance from the hyperplane to the nearest points of each class. Then, we use the oriented distance from the hyperplane as a score to detect the correct device that acquired the image. If a limited number of candidate devices is given, the maximum-score device is chosen. Otherwise, the score is compared with a threshold to make a binary decision, and the threshold itself may be varied to obtain a performance curve. These criteria apply to the following cases as well.
Likelihood-Ratio Test: to formulate a likelihood-ratio test (LRT), we model D PRNU and D NP as random variables with jointly Gaussian distribution in both hypotheses The parameters µ 0 , Σ 0 , µ 1 , Σ 1 are estimated, for each situation of interest, on a separate training-set. In particular, we use two methods to estimate these parameters, the classical ML approach, and a robust approach based on the minimum covariance determinant MCD proposed in [24]. With this Gaussian model, and the estimated parameters, we can compute the log-likelihood ratio and use it as decision statistic.
Fisher's Linear Discriminant: this approach looks for the direction along which the distributions of two classes are better separated, measuring the separation as the ratio of the inter-class to intra-class variances. The weight vector of the optimal direction, w opt , is given by where, again, means and co-variance matrices in the two hypotheses, µ 0 , Σ 0 , µ 1 , Σ 1 , are estimated on a training-set. In this case, the score is given by the projection of the vector formed by the two distances along the optimal direction

Experiments
In this section, we assess the performance of the proposed device identification method in all its variants, and in various realistic scenarios of interests. We consider two typical scenarios (Fig.5): 1. closed-set. The set of candidate sources is predefined and one is required to associate the test image with one of the devices in the set. Therefore, the task can be regarded as a multi-class classification problem. Accordingly, we evaluate the classification performance in terms of accuracy, that is, probability of correct decision. 2. open-set or verification. The set of candidate sources is not given a priori. In this case, one can only decide on whether the test image was acquired by a certain device Same or Different Fig. 5 On the left the closed-set scenario where the test image is assigned to one of the device in a set of known cameras. In the verification or open-set scenario, on the right, the task is to decide if the test image was acquired by a certain device or not.
or not. Therefore, the problem is now binary classification, and we evaluate the classification performance, as customary, in terms of probability of correct detection P D and probability of false alarm P FA , summarized by a receiver operating curve (ROC), computed for varying decision threshold, and eventually by the area under such a curve (AUC).

Datasets
Three different datasets have been employed for our experiments, two for training and one for testing. In fact, in order to avoid any bias, we trained the CNN that extracts the noiseprints and the source identification classifiers on disjoint sets of data. Concerning noiseprints, the network is trained on a large dataset of images publicly available on dpreviewer.com. Our collection comprises 625 camera models, 600 for training the network and 25 for validation, with a number of images per model ranging from 8 to 173. The parameters of the source identification classifiers, instead, have been estimated on the VISION dataset [23], often used for camera model identification, which comprises images acquired from 29 different models, most of them with only one device. Finally, tests have been conducted on the Dresden dataset [25], proposed originally for camera source identification, comprising images from 25 camera models, 18 of which featuring two or more different devices. We selected these latter models with all their devices, for a grand total of 66. Details are reported in Tab.1.

Results
To assess the source identification performance of the conventional PRNU-only and the proposed PRNU+noiseprint methods we consider several challenging cases obtained by varying the size of the image crop used for testing and the number of images used for estimating the reference patterns. In addition, we consider also the case of JPEG images compressed at two quality factors, aimed at simulating a scenario in which images are downloaded from social network accounts, where compression and resizing are routinely performed. Note that the PRNU and noiseprint reference patterns are both estimated from the very same images, since no prior information is assumed to be available on the camera models.
To gain some insight into the role of the two distances, Fig.6 shows the scatter plots of D PRNU and D PN for images of the VISION dataset in the same cases of Fig.3. Here, however, we only show same-device (green) and different-device (red) points, irrespective of models. In the most favourable case of d=1024 and N=100 (top-left), the two clusters can be separated very effectively by a vertical line, that is, based only on the PRNU distance. Instead, in the worst case of d=64 and N=1 (bottom-right), the PRNU-based distance is basically useless, while the noiseprint-based distance provides by itself a pretty good separation, with residual errors mostly due to same-model different-device points. In intermediate cases, both pieces of information help separating effectively the two clusters. Note that a weak correlation between the two distances exist for same-device points, likely due to the imperfect rejection of unwanted contributions. Of course, a different classifier must be designed for each situation of interest, based on the corresponding training set.
We now analyze the performance on the test dataset, Dresden, never used in training, beginning with the closedset scenario, Tab.2. On the rows, we consider all combinations of d=64, 256, 1024, and N=1, 10, 100. When N=1 or 10, results are averaged over 10 repetitions with different reference patterns. The leftmost columns show results for the two baselines, based only on the PRNU-distance (that is, the conventional method of [2]) and only on the noiseprintdistance. Again, the performance of the conventional me-thod is very good in the ideal case, but worsens significantly in more challenging situations, and is only slightly better than random choice when N=1 and d=64. On the contrary, noiseprint-only identification is never very good, since there are always from 2 to 5 indistinguishable devices for each model, but remains remarkably stable in all conditions, suggesting that models keep being identified accurately, with only minor impairments even in the most challenging cases. In the following column we show a third reference, called "ideal", which represents an upper bound for the performance of the proposed approach. Here, we assume perfect model identification, and rely on the PRNU-based distance only for the final choice in the restricted set of same-model cameras. In the most favourable case (d=1024, N=100) the PRNU-only performance was already quite good, and only a marginal improvement is observed. Likewise, in the worst case (d=64, N=1), PRNU is unreliable, and the fusion improves only marginally upon the noiseprint-only performance. However, in intermediate cases, large improvements are observed, with the accuracy growing from 0.649 to 0.793 (case d=1024, N=1) or from 0.342 to 0.610 (d=64, N=100). This performance gain fully justifies our interest for this approach, we only need the proposed methods to work relatively close to this upper bound.
In the following five columns we report results for the various versions of the proposed method, based on support vector machine classifier (SVM), likelihood ratio test with ML estimation (LRT) and robust estimation (r-LRT) of parameters, and Fisher's linear discriminant in the same two versions (FLD and r-FLD). Results are fully satisfactory. Taking for example the FLD column, the gap with respect to the ideal reference remains quite small in all cases of interest and, consequently, a large improvement is observed with respect to the conventional method. This confirms that the noiseprint-based model classification remains largely successful in most conditions, and helps improving the overall performance. As for the various versions of the proposed approach, there seems to be no consistent winner, with FLD providing slightly better results, on the average, and SVM showing an isolated bad point (d=1024, N=1) maybe due to some overfitting. In particular, the nonlinear decision boundary of the LRT criterion does not seem to ensure improvements over the linear boundaries of SVM and FLD, confirming that the two classes are linearly well separated in the feature space. The versions based on robust estimation of parameters performs on par or slightly worse than the ML counterparts.
Tab.3, structurally identical to Tab.2, provides results for the open-set scenario in terms of area under the ROC curve (AUC). All considerations made for the closed-set scenario keep holding here. Of course, numbers are much larger than before, because we are considering binary decisions, where an AUC of 0.5 is equivalent to coin tossing and good re-sults correspond to AUC's close to 1. This is actually the case for the proposed method. Considering again the FLD column, the AUC is never less than 0.935 and not far from that of the ideal reference, while it is often much larger than the AUC of the conventional method. Actually, it is worth emphasizing that also the noiseprint-only method has quite good performance indicators. In hindsight, this is not too surprising. This scenario, in fact, fits the case of large-scale analysis, where a very large number of candidate sources must be considered. Perfect model identification allows one to reject right away most of the candidates (small P FA , large AUC), which allows one to focus on a limited set of candidates, to be analyzed with greater care and resources. Like before, differences among the various versions of the proposal are negligible.
The next four tables refer to the case of images compressed using JPEG, with QF=90 (Tab.4 and Tab.5) and with QF=80 (Tab. 6 and Tab.7) always for both the closed-set and open-set scenarios. First of all, with reference to the closedset scenario, let us analyze the performance of the conventional method as the image quality impairs. Only in the ideal case the accuracy is fully satisfactory, while it decreases dramatically in all other conditions, for example, from 0.649 (uncompressed) to 0.364 (QF=80), for (d=1024, N=1). In fact, the JPEG compression filters out as noise most of the small traces on which source identification methods rely. This is also true for the noiseprint traces. However, in the same case as before, with the robust-FLD version, the proposed method keeps granting an accuracy of 0.540, with a more limited loss from the 0.752 accuracy of uncompressed images, and a large gain with respect to the conventional method. The same behavior is observed, with random fluctuations, in all other cases, and also in the open-set scenario, so we refrain from a tedious detailed analysis. However, it is worth pointing out that, in the presence of compression, the versions based on robust estimation (r-LRT and r-FLD) provide a consistent, and often significant, improvement over those relying on ML estimation.

Conclusions
In this paper, we proposed to use noiseprint, a camera-model image fingerprint, to support PRNU-based forensic analyses. Numerical experiments prove that the proposed approach ensures a significant performance improvement in several challenging situations easily encountered in real-world applications. This is only a first step in this direction, and there is certainly much room for further improvements. In future work we want to extend the proposed approach to improve PRNUbased image forgery detection and localization, and also to perform accurate blind image clustering, an important problem in multimedia forensics.  Table 7 AUC on the Dresden dataset in the open-set scenario with compression (QF=80). Fusion preserves a reasonable performance even with scarce data, with robust FLD almost always the best method.