Boosting CNN-based primary quantization matrix estimation of double JPEG images via a classification-like architecture

The problem of estimation of the primary quantization matrix in double JPEG images is of relevant importance in several applications, and in particular, for splicing localization. In addition to traditional statistical-based approaches, recently, deep learning has been exploited to design a well performing estimator, by training a Convolutional Neural Network (CNN) model to solve the estimation as a standard regression problem. In this paper, we propose the use of a simil-classification CNN architecture to solve the estimation, by exploiting the integer nature of the quantization coefficients, and using a proper loss function for training, that takes into account both the accuracy and the Mean Square Error of the estimation. The capability of the method to work under general operative conditions, regarding the alignment of the second compression grid with the one of first compression and the combinations of the JPEG qualities of former and second compression, is very relevant in practical applications, where these information are unknown a priori. Results confirm the effectiveness of the proposed technique, compared to the state-of-the art methods based on statistical analysis and deep learning regression.


Introduction
Detection of double JPEG (DJPEG) compression is one of the most widely studied problems in image forensics, see for instance [1][2][3]. The interest of researcher in this topic is motivated by the fact that double compression can reveal important information about the past history of an image. Important information can be obtained by estimating the quality of the first JPEG compression, and, moreover, by estimating the primary quantization matrix used for the first JPEG compression. Given an image with several copy-pasted regions, it is possible to identify the different origin of the tampering by recognizing that they have been compressed first with different JPEG qualities, and more in general that they are characterized by different primary quantization matrices of the compression (while there is not a standard definition of the JPEG quality, the concept of quantization matrix is a standard one [4]).
Several methods have been proposed in the literature for the estimation of the primary quantization matrix. Many of them exploit statistical modeling of DCT coefficients [5][6][7][8]. A common feature of all these approaches is that they work under particular operative conditions and settings about the relationship of the JPEG qualities of former and second compression, and the alignment of the 8 × 8 grid of the first compression with the second one. For instance, the method in [7] works only when the two compressions are aligned and the second quantization step is lower than the first one, that is when the quality of the second JPEG is higher than the quality of the first one (hence, QF 1 < QF 2 for the standard quantization matrix case). Similarly, the method in [6] is designed for the aligned JPEG case and can not estimate the first quantization step when this is a divisor of the second one. The algorithm proposed in [5] can work both in the aligned and nonaligned cases, however the performance drops when QF 1 > QF 2 . Eventually, the method in [8] works in the non-aligned case only. Another drawback of such model-based techniques and approaches that rely on hand-crafted features is that their performance tend to arXiv:2012.00468v1 [cs.CV] 1 Dec 2020 decrease significantly when they are applied to small patches, that prevents the application of these methods for the local estimation of the quantization matrix of first compression (useful for tampering localization applications).
A modern method for primary quantization matrix estimation based on Convolutional Neural Networks (CNN) has been recently proposed in [9]. Such method can work under very general operative conditions and on small (64× 64) patches. This approach has been shown to outperform previous approaches, both in terms of accuracy and mean square error (MSE) of the estimation. In particular, in [9], the CNN is trained to minimize the squared difference between the predicted values of the quantization coefficients and the true values, hence the MSE of the estimation is minimized. Some works in the deep learning literature, however, shows that CNNs are better to solve classification than regression problems. CNNs can in fact achieve remarkably accurate results when trained to predict categorical variables, drawn from discrete probability distributions of data [10,11]. Whenever possible, switching to a classification problem or consider hybrid methods that combine classification with regression has been shown to yield better results [12]. In [12], for instance, soft values are estimated by using a quantized regression architecture that first obtains a quantized estimate (using the softmax followed by the cross entropy loss), and then refines it through regression of the residual.
Given the above, in this paper, we focus on improving the performance of CNN-based estimation of the primary quantization matrix by turning the regression into a classification-like problem, with the design of a suitable CNN architecture. Our approach starts from the observation that the quantization coefficients can only take integer values. Therefore, we design a structure such that the estimation of a vector of integer values, namely all the coefficients of the quantization matrix, can be performed in a classification-like fashion. For the implementation of the network (internal layers), we consider the same CNN architecture already considered in [9], yielding good results, namely DenseNet [13]. Similarly to [9], the CNN-based estimator is designed to work under very general operative conditions, i.e. when the second compression grid is either aligned or not with the one of first compression, and for every combinations of qualities of former and second compression. The capability of the method to work under both aligned and non-aligned DJPEG, and for all possible combinations of JPEG qualities, is very relevant in practical applications, where those information are not known a priori, thus making the adoption of dedicated method very impractical. Like the method in [9], another remarkable strength of the proposed estimator is that it works on small patches, that opens the way to the application of the method for tampering localization.
The rest of this paper is organized as follows: Section 2 recaps the main concepts of double compression and introduces the notation. The proposed method is described in Section 3. Then, Section 4 details the experimental methodology and the results are reported and discussed in Section 5. We conclude the paper with some final remarks in Section 6.

Basic concepts and notation
We denote by Q the quantization matrix, that is, the 8 × 8 matrix with the quantization steps of the DCT coefficients considered for the compression. A double compression occurs when an image compressed with a given Q 1 is decompressed (decompression involves dequantization and inverse DCT), and compressed again with a second quantization matrix Q 2 . The elements of Q 1 can be conveniently arranged in a vector of dimensionality 64, zig-zag ordered [4]. We denote by q 1 such 64-dim vector built from Q 1 . As commonly done in the literature [5][6][7]9], we focus on the first elements of q 1 and restrict the estimation to those coefficients. We denote with (q 1 ) Nc = [q 1,1 , q 1,2 , ..., q 1,Nc ] the vector of the first N c coefficients of q 1 . The coefficients at the medium-high DCT frequencies are in fact more difficult to estimate accurately, due to the stronger quantization usually applied to them; however, since these coefficients are not very discriminative (as they tend to be similar for most quantization matrices), their estimation is less important.
When a JPEG image is compressed a second time, the second compression grid can be either aligned or non-aligned to the first compression grid. The case of a non-aligned DJPEG corresponds to the most frequent scenario in practice. A grid misalignment occurs locally when image splicing is performed, that is, when a region of a single JPEG image is copy-pasted into another image, since in this case the alignment between the compression grids is rarely preserved. On a global level, we have a non-aligned DJPEG when the image is cropped in between the former and second compression stage, or some processing is applied causing a de-synchronization.
The quality of the JPEG compression is often summarized by many compression softwares by means of the JPEG Quality Factor (QF ), whose values range from 0 to 100 (QF values lower than 50 however are seldom used in practice nowadays since they corresponds to extremely low qualities). A QF value specifies a quantization matrix Q (standard quantization matrix). For convenience, in the rest of the paper, we refer to the JPEG Quality Factor (QF ). Note that, in principle, the proposed estimator can be applied to estimate any quantization matrix of former compression, be it standard and non-standard. In the rest of the paper, we denote with QF 2 the second compression QF and with QF 1 the former.

Proposed CNN classification-like estimator
The proposed method starts from the observation that the q 1,i 's values we want to estimate are discrete values. In [9], where a regression problem is addressed to estimate (q 1 ) Nc , the values obtained at the output of the CNN are finally quantized to get the estimated vector (q 1 ) Nc (specifically, rounding is performed on each element of the output vector independently, yieldingq 1,i , i = 1, ..., N c ). However, for the estimation of discrete quantities (categorical distributions), it is preferable to resort to softmax followed by the cross entropy loss, which is good for backpropagation (see [12]). Therefore, we propose to switch the regression to a classification-like problem. To do so, we consider a custom output layers structure with a basic loss function, and also with a refined loss function, described in the following.

General structure
The architecture that we considered for the internal layers of the CNN, and in particular, the feature extraction part, is a dense structure, namely the DenseNet backbone architecture [13], and is described in Section 4.1. In the following, we describe the specific structure of the output layer.
In the proposed structure, each to-be-estimated coefficient q 1,i (i = 1, 2, .., N c ) of discrete value is encoded as a one-hot vector. The dimensionality of the encoded vector is determined by all the possible values that q 1,i may take. Assuming that the quality of the image can not be too low, in fact, for every i, the estimated coefficientq 1,i may take a limited number of values, that is, 1 ≤q 1,i ≤ q M 1,i . For simplicity, we set q M 1,i equal to the corresponding value of the i-th coefficient when QF 1 = 50 (minimum quality of the JPEG considered). [1] To get the desired output, we set the logit level output to a size [q M 1,1 + q M 1,2 · · · + q M 1,Nc ]; then, the softmax is applied block-based on each N c block, where each block has q M 1,i inputs, i = 1, 2, .., N c . [1] This corresponds to assume that the former JPEG quality is always higher than or equal to QF1 = 50, which is often the case in practice, thus not representing a big restriction (as a consequence, Q1 matrices corresponding to lower qualities are not correctly estimated).
For training the network we consider the following basic custom loss: where y i,j denotes the ground-truth label corresponding to q 1,i . According to Eq. (1), a cross-entropy loss is first computed on each block separately, then the loss is defined as the sum of all the cross-entropy loss terms. Figure 1 illustrates the scheme of the CNN considered for the estimation with specific focus on the output layer. In the figure, y denotes the ground-truth vector of the N c one-hot encoded vectors, having the dimensionality [q M 1,1 + q M 1,2 + · · · + q M 1,Nc ], and f (x) the output soft vector of the CNN, having the same dimensionality of y. Formally, y = y 1 ⊕ y 2 ⊕ · · · ⊕ y Nc , where ⊕ denote the horizontal concatenation, that is, and, similarly, j=1 . As we said, q M 1,i , for every i, is determined considering the value assumed by the i-th coefficient in the quantization matrix corresponding to the lowest QF 1 considered (that we set to 50 in our experiments). Then, the final estimated vector (q 1 ) Nc is given bŷ

Refined loss function
For a given image x, and final predicted vector (q 1 ) Nc , the accuracy of the estimation is averaged over all the N c coefficients, that is, Accuracy( The new classification-like structure trained with the loss function L attempts to maximize the accuracy of the estimation, without caring about the MSE of the estimation. From Eq.(1), it is easy to see that solutions yielding large MSE values are not penalized compared to those yielding lower values, for the same soft values associated to the '1' positions in vector y (L(x) takes the same value). Said differently, an incorrect decision on the value of a q 1,i for some i, that results in a different wrong one-hot encoded vector, may lead to a same value of the loss function in Eq.(1), regardless of the estimated value, or better yet, regardless of the difference between the true and estimated value, i.e., |q 1,i −q 1,i |.
Since both the accuracy and the MSE of the estimation are important in practice, we would like to get high accuracy for the estimation, without paying (much) in terms of MSE. In order to solve this issue, we investigated two possible solutions, that corresponds to two possible refinements of the loss function. The first solution was to use a "smooth" categorical crossentropy loss that keeps all the advantages of the standard cross-entropy loss, but at the same time assigns different weights to the errors depending on the position of the "1" inside each one-hot encoded vector, that is, depending on |q 1,i −q 1,i | for each i. The second solution, that gave us to get better results, was to perform jointly classification and regression by considering a combined loss (as done by some approaches in the literature of deep learning and standard machine learning [14][15][16]). Given the simil-classification architecture considered, defining a suitable loss function that takes into account the distance between the estimate and the true value, and then penalizes large values of such distance, is not obvious. To do so, we define a vector d y that reports in each position the distance from the '1' in the corresponding one-hot encoded vector y i (see Figure 2). Formally, let where q 1 is the true vector of coefficients.  Figure 2 Vector dy of the relative distances obtained from y.
Then, we define the combined loss function as follows: where c is a constant, 0 < c < 1, determining the trade off between the two terms. We observe that, for each i, the contribution to the second term is large when the arg max j f ij (x), that iŝ q 1,i , is far from q 1j , small otherwise (the second term is 0 when arg max j f ij (x) = q 1i for every i, that is, in the case of ideal estimation). Then, the refined loss indirectly takes into account the MSE of the estimation via the second term. Moreover, the second term is continuously differentiable and then is good for backpropagation.
Some preliminary experiments we carried out confirmed that, as expected, adopting the loss function

Experimental methodology
In this section, we describe in detail the backbone architecture of the network, the procedure of dataset construction and the training setting considered in our experiments.
The proposed solution is compared with the stateof-the-art approach in [9] based on deep learning and, for the aligned case, also with those in [5,7] based on statistical analysis. While in fact the method in [9] always outperforms all the other previous methods for the non-aligned scenario (e.g. [5,8]), for the aligned case, there are some cases where the accuracies achieved by the methods in [5,7], tailored for the aligned scenario, are superior to those of [9].

Backbone architecture
For the design of the internal layers of the CNN, we considered the DenseNet architecture [13], which was also considered in [9]. Such backbone architecture has been recently adopted for several image forensic tasks, see for instance [17][18][19], yielding improved performance compared to those achieved with traditional CNN architecture (e.g. residual-based networks).
The main feature of the dense structure is that it connects each layer to every other layer in a dense block in a feed-forward fashion. in this way, the features extracted by the various layers are used by the subsequent layers throughout the same dense block (hierarchical structure). The dense connectivity has been shown in [13] to mitigate the gradient vanishing problem. The number of links in the network increases compared to traditional CNN architectures, passing from l to l(l − 1)/2 for each dense block, where l is the number of layers in the block. However, as an advantage, the number of (to-be-trained) parameters is significantly reduced. Following the original dense structure (see [13]), we considered a network depth of 40, with 3 dense blocks and growth rate k = 12. Each dense layer consists of 12 convolutional layers and a transition layer, where 2 × 2 average pooling is performed to decrease the input size. All the convolutions have kernel size 3 × 3 × 12. The default dropout of 0.2 is considered. An initial convolution with 24 (2k) filters of size 3 × 3 is performed before the first dense block. For more details on the dense structure we refer to [13]. After the last dense blocks, global average pooling is performed and the feature vector is fed to the fully connected layer.
The number of output nodes of the fully-connected layer is set to (q M 1,1 ·q M 1,2 ·· · ··q M 1,Nc ). A softmax is applied to each block of q M 1,i nodes independently, for a total of N c softmaxes, as illustrated in Figure 1.

Datasets
As in [9], a model for Q 1 estimation is trained for a fixed value of QF 2 . This does not represent a limitation in practice since the information on the second compression is always available. The knowledge of the final quantization matrix, in fact, can be recovered from the JPEG file, and is necessary to decompress the image, getting the image in the pixel domain. Moreover, when the image is re-saved in an uncompressed format, the quantization matrix of last compression can be accurately estimated [20]. As a drawback, a model has to be trained for every matrix of second quantization, which may be time-consuming (the same happens with the method in [9]). However, since training has to be performed only once, this does not represent a big issue. Moreover, our experiments show that a network trained for a given QF 2 generalizes pretty well to a different QF 2 's or, more in general, to a different Q 2 matrix, when the difference is not too much (a ± 2 mismatch in the QF resulting in a very small decrease of performance), hence a limited number of models can be trained. The training and testing datasets are built as described in the following.
We considered the RAISE dataset [21] with 8156 native (tiff) images, that is split into a training and a test set. Specifically, 7000 images were considered for training, while the remaining 1156 were reserved for testing. The images were then compressed first with several QF 1 's and then with the prescribed QF 2 , thus obtaining several double compressed versions (both QF 1 values larger and smaller than QF 2 were considered). JPEG compression was performed with OpenCV. To simulate the misalignment, we applied a random grid shift (r, c) with 0 ≤ r, c ≤ 7 between the two compressions, with r, c randomly selected in the [0 : 7] range. Therefore, the JPEG is non-aligned with probability 63/64, while the aligned scenario (which corresponds to the case r = c = 0) occurs with probability 1/64.
To build the dataset of patches used for training, we proceeded as follows: for every QF 1 , we cropped the DJPEG images in the training set into patches of size 64 × 64 × 3; then, from each image we took 100 patches in random positions; we stopped collecting patches when a total number of 10 5 patches was reached (coming then from 1000 images) for each given QF 1 . [2] For our experiments we set QF 2 = 90 and 80. Specifically, for QF 2 = 90, we built the training dataset D 90 by considering QF 1 ∈ [60 : 98], for a total of 3.9 × 10 6 patches. For QF 2 = 80, we built set D 80 by considering QF 1 ∈ [55 : 98], then for a total of 4.4 × 10 6 patches.
The test patch set was obtained in the same way. In this case, for every QF 1 , all the 1156 images in the test set were considered, each one contributing to 100 random patches (for a total of 115600 patches). The Dresden dataset [22] was also considered to test the performance under dataset mismatch, consisting of 1491 raw images (hence, for each QF 1 , the performance were tested on a total of 149100 patches).
For the implementation of the proposed method and the custom loss, we used TensorFlow version 2.2. Model training and testing were carried out in Python via TensorFlow, using Keras API.
We ran our experiments using a 2x Nvidia GeForce RTX 2080 Ti 11 GB GDDR6 GPU. For the optimization, the Adam solver was used with learning rate 10 −5 . The batch size for training and testing was set to 32 images. We got our models using the L r loss by training the network for 100 epochs. After this number of epochs we verified that the loss decreases very slightly (less than 0.01% at every iterations) and the accuracy of the estimation cannot be improved further by letting the training go on (only incurring the risk of overfitting). The weight in the combined loss L r in Eq. (5) was set to c = 0.8.
The code is publicly available and can be found at the github link https://tinyurl.com/yxhl32w5.

Comparison with existing methods
The average performance results achieved by the CNN model for the new architecture, trained with the L r loss on D 90 , are reported in Table 5.1, where they are compared to those achieved by the CNN model in [9], trained for the same value of QF 2 = 90. The performance results are averaged under the same setting considered for the training regarding the alignment of the DJPEG, i.e. the test patches are DJPEG, aligned with probability 1/64, non-aligned with 63/64. The performance results of the new model are superior to those [2] For every QF1, a random shuffle was applied to the 7000 DJPEG images compressed with (QF1, QF2), so the subset of images considered for every QF1 was never the same.  Average accuracy CNN in [9] Proposed CNN Method in [5] Method in [ Table 1 Average performance of the proposed CNN estimator and [9] for QF 2 = 90. The DJPEG is non-aligned with probability 63/64, aligned with probability 1/64 (same setting considered for the training of the models).
achieved by the method in [9] both in terms of accuracy and of MSE.
The performance results achieved by the methods in the aligned DJPEG scenario are reported in Table 5.1. The performance results in aligned scenario are slightly inferior to those reported in the  Average accuracy CNN in [9] Proposed CNN Method in [5] Method in [7]  Average accuracy CNN in [9] Proposed CNN Method in [5] Method in [7] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Coefficient (zig-zag order) 10 0 10 1 Average MSE CNN in [9] Proposed CNN Method in [5] Method in [7] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Coefficient (zig-zag order) 10 1 Average MSE CNN in [9] Proposed CNN Method in [5] Method in [7] Figure 5 Average accuracy when QF 1 < QF 2 (top left) and when QF 1 ≥ QF 2 (top right), average MSE when QF 1 < QF 2 (bottom left) and when QF 1 ≥ QF 2 (bottom right) of the estimation for each of the 15 DCT coefficients, for QF 2 = 90, in aligned DJPEG scenario.  Table 2 Average performance of the proposed CNN estimator in the aligned case for QF 2 = 90, and comparison with the state-of-the-art. method can also outperform [7], which is designed for the aligned case, both in terms of MSE and accuracy.
The estimation accuracy and MSE for each DCT coefficient are reported for the non-aligned and aligned case in Figure 3 and 4, where the results are averaged on all the QF 1 s. Comparison with the methods [5,7] is also reported for the aligned case. It can be observed that the proposed method always performs better than existing methods, especially for the aligned case where it greatly outperforms the methods in [5,7] for all the 15 DCT coefficients. Figure 5 shows the averaged results when QF 1 < QF 2 and QF 1 ≥ QF 2 respectively, in the case QF 2 =90 for the aligned scenario. It can be noticed that when QF 1 ≥ QF 2 the proposed method clearly outperforms the methods [5,7], both in terms of accuracy and MSE. When QF 1 < QF 2 , the proposed method still outperforms the state-of-the-art methods.
The average performance obtained for the case QF 2 = 80 (model trained on D 80 ) are reported in Table 5.2 for the non-aligned scenario and Table 5.2 for the aligned scenario. A performance loss is experienced by all the methods. This was expected since with a smaller QF 2 , the second quantization tends to erase more the traces of the first compression, thus making the estimation harder. Nevertheless, the proposed method has an advantage over the state-of-theart. We have noticed that the CNN model in [9] does better than our method for the aligned case in terms of MSE. Anyhow, we are better in terms of accuracy and in the non-aligned case which is the main focus of our method. Figure 6 and 7 report the results on 15 DCT coefficients in the case QF 2 = 80, averaged on all the QF 1 s, for the non-aligned and aligned case respectively. We see that the gain of the proposed method is confirmed.

Generalization capability
The generalization capability of the model are tested by considering several sources of mismatch, i.e., the second compression quality QF 2 , and the image database.
The results in presence of QF 2 mismatch are reported in Figure 8, where the model trained on D 90 is tested on images compressed with QF 2 = 92, for the general setting considered for training regarding the alignement (the DJPEG is aligned with probability 1/64). We see that, the drop of performance is limited, proving a certain generalization capability, and is similar for the two methods. The performance of the proposed CNN remains superior to [9]: the total Av-gAcc and AvgMSE are respectively 0.591 and 0.859 for our method, and 0.500 and 0.944 for the method in [9].
To assess the impact that dataset mismatch has on the performance of our CNN-based estimator, we also evaluate the performance of the estimator on DJPEG images coming from the Dresden dataset. Figure 9 reports the results of our tests. The total AvgAcc and AvgMSE are respectively 0.644 and 0.538 for our method, and 0.523 and 0.694 for the method in [9]. Coefficient (zig-zag order)  Average accuracy CNN in [9] Proposed CNN Method in [5] Method in [ Coefficient (zig-zag order) 10 1 Average MSE CNN in [9] Proposed CNN Method in [5] Method in [7]

Application to tampering localization
Given that the CNN estimator works on small image patches, quite straightforwardly, the method can be applied on sliding windows on a JPEG image to get a map with the estimated primary quantization coefficients (q 1 ) Nc for each 64 × 64 block, that can be useful to localize possible tampering regions in a DJPEG image. Notably, by looking at those maps the tampering can be exposed in the general scenario where both the background and the foreground (tampered areas) are DJPEG, that is, both the background and the copy-pasted region were originally JPEG (compressed with a different qualities) and undergo a second JPEG compression after forging. This corresponds to a very common scenario in practice. In this scenario, methods that try to detect and localize tampering by looking at the presence or absence of typical double compression artifacts do not work, see for instance [3,[23][24][25], just to mention a few of them. These methods in fact implicitly assume that the background is single compressed, while the foreground is double compressed, or viceversa.
In order to get a localization map, we first divide the input image x of size V × L × 3 into overlapping blocks of size 64 × 64 with stride s = 1; then, each block is fed to the CNN that returns a vector with the first N c estimated quantization coefficients. Let QM (i, j, :) = f ([x ij ]) be the network output when the input is the (i, j)-th image block of size 64 × 64 × 3; then, QM (i, j, :) = (q 1 ([x ij ])) Nc . In this way, for each k = 1, ..., N c we obtain a map QM (:, :, k) with the estimated values of the k-th coefficient for each block. Figure 10 shows two examples. For both tampered images, we have two distinct tampered areas, where the copy-pasted regions have different first JPEG qualities, that is, QF 1,1 = 95 and QF 1,2 = 85 for the first example and QF 1,1 = 65 and QF 1,2 = 95 for the second one. The first JPEG quality for the background of the two examples is 75. The last quality factor for both examples is QF 2 = 90. All the JPEG grids are not aligned. For sake of visualization a color map is reported. The color map shows that the two tampering regions have a different q 1,k from the background; interestingly, the color map also reveals that the q 1,k value is also different between them, hence, that they correspond to two distinct tampering (the copy-pasted regions come from different donor images, having different JPEG compression qualities).
In general, if the qualities of the former JPEG are close, it is harder to visualize and expose the tampering by simple inspection of the N c maps of the q 1,k coefficients of each block, that is QM (:, :, k), all the more that some of the coefficients might have the same value. In this cases, we could resort to clustering to get a tampering localization map from the vectors of the estimated q 1,k values for each position. This interesting analysis is left as a future work.

Conclusions
In this paper, we proposed a method for primary quantization matrix estimation via CNNs, that resorts to a classification-like architecture to perform the estimation of the quantization coefficients. Thanks to the adoption of a simil-classification structure, the new CNN estimator achieves improved performance with respect to the CNN regression-based method in [9], both in terms of accuracy and MSE.
Notably, the proposed method is a general one, which can work under a wide variety of operative conditions. i.e. when the second compression grid is either aligned or not with the one of first compression, and for every combinations of qualities of former and second compression. Regarding the JPEG alignment, the method is designed to work in particular for the case of nonaligned double JPEG compression (the aligned case is assumed to occur with probability 1/64). A method capable to deal with primary quantization estimation in the non-aligned scenario, in fact, is very relevant when the proposed estimator is used for image tampering localization (when a region of a JPEG image is copy-pasted into another JPEG image, in fact, very likely, the alignment between the compression grids is not preserved and the final JPEG is non aligned with the grid of the spliced area). Despite its generality, the proposed method also outperforms the existing -dedicated -state-of-the-art solution for the aligned scenario in most of the cases. The method provide very good performance also in the challenging case of QF 1 > QF 2 , where state-of-the-art methods based on statistical analysis often fail. More importantly, the estimator works on small image patches, that opens the way to the application of the method for tampering localization (see Section 5.3).
As future research, the application of the method for image tampering localization in DJPEG images is then an interesting direction, possibly with the identification of different tampering sources. In addition, the robustness of the estimator in presence of adversarial attacks could also be investigated. From a more general perspective, we believe that a similar architecture to the one proposed in this paper could also be exploited and applied to address other estimation problems which are relevant in image forensics.