Potential advantages and limitations of using information fusion in media forensics—a discussion on the example of detecting face morphing attacks

Information fusion, i.e., the combination of expert systems, has a huge potential to improve the accuracy of pattern recognition systems. During the last decades, various application fields started to use different fusion concepts extensively. The forensic sciences are still hesitant if it comes to blindly applying information fusion. Here, a potentially negative impact on the classification accuracy, if wrongly used or parameterized, as well as the increased complexity (and the inherently higher costs for plausibility validation) of fusion is in conflict with the fundamental requirements for forensics. The goals of this paper are to explain the reasons for this reluctance to accept such a potentially very beneficial technique and to illustrate the practical issues arising when applying fusion. For those practical discussions the exemplary application scenario of morphing attack detection (MAD) is selected with the goal to facilitate the understanding between the media forensics community and forensic practitioners. As general contributions, it is illustrated why the naive assumption that fusion would make the detection more reliable can fail in practice, i.e., why fusion behaves in a field application sometimes differently than in the lab. As a result, the constraints and limitations of the application of fusion are discussed and its impact to (media) forensics is reflected upon. As technical contributions, the current state of the art of MAD is expanded by: The introduction of the likelihood-based fusion and an fusion ensemble composition experiment to extend the set of methods (majority voting, sum-rule, and Dempster-Shafer Theory of evidence) used previously The direct comparison of the two evaluation scenarios “MAD in document issuing” and “MAD in identity verification” using a realistic and some less restrictive evaluation setups A thorough analysis and discussion of the detection performance issues and the reasons why fusion in a majority of the test cases discussed here leads to worse classification accuracy than the best individual classifier The introduction of the likelihood-based fusion and an fusion ensemble composition experiment to extend the set of methods (majority voting, sum-rule, and Dempster-Shafer Theory of evidence) used previously The direct comparison of the two evaluation scenarios “MAD in document issuing” and “MAD in identity verification” using a realistic and some less restrictive evaluation setups A thorough analysis and discussion of the detection performance issues and the reasons why fusion in a majority of the test cases discussed here leads to worse classification accuracy than the best individual classifier


Introduction
Information fusion has a long research history and its core concept, the combination of outputs of different expert systems, has been rigorously studied and applied for at least two decades in various application domains. The concept of fusion has been studied under many different terminologies, e.g., classifier ensembles [1], combining pattern classifiers [2], or cooperative agents [3]. As a result of the growing popularity of machine learning at that point of time and practical problems arising from ever increasing feature space complexities, in 2002 [4] stated that "instead of looking for the best set of features and the best classifier, now we look for the best set of classifiers and then the best combination method." This statement was rephrased by [5] into "the role of information fusion […] is to determine the best set of experts in a given problem domain and devise an appropriate function that can optimally combine the decisions rendered by the individual experts [...]." In [2], the following three different types of reasons why a classifier ensemble might be better than a single classifier are identified: Statistical (instead of picking a potentially inadequate single classifier, it would be a safer option to use a set of unrelated ones and consider all their outputs), computational (some training algorithms use hill-climbing or random methods, which might lead to different local optima when initialized differently) and representational (it is possible that the classifier space considered for a problem does not contain an optimal classifier). Whatever the exact reason for choosing a fusion approach instead of a single classifier, [2] explicitly warns that "an improvement on the single best classifier or on the group's average performance, for the general case, is not guaranteed. What is exposed here are only 'clever heuristics' [...]". In summary, by combining classifiers (or other expert systems), the applicants hope for a more accurate decision at the expense of increased complexity.
The huge potential for accuracy improvement gained by applying fusion has been well illustrated in many fields of applied pattern recognition. A good example is the field of biometric user authentication where, e.g., [5] shows various benefits that this field can draw from fusion at different steps of the pattern recognition pipeline. When it comes to blindly applying information fusion, among the disciplines that are currently still hesitant are the forensic sciences. Here, the potentially negative impact to classification accuracy as well as the increased complexity (and the inherently higher cost for plausibility validation) of fusion are in conflict with fundamental requirements for (media) forensics (as is discussed in more detail in section 2.1). The goals of this paper are to explain the reasons for this reluctance to accept a potentially very beneficial technique such as information fusion and to illustrate the practical problems of applying fusion. To this end, an exemplary application scenario from media forensics called face morphing attack detection (MAD) is selected. This scenario is currently a hot research topic due to the fact that this kind of attack imposes a recent and currently unsolved threat to face image based authentication scenarios such as border crossing using travel documents (i.e., passports), see section 2.3.
By facilitating the understanding of the reluctance to blindly use fusion in (media) forensics as well as the potential pitfalls of practically applied fusion techniques, it is the hope to facilitate acceptance both in the media forensics community as well as the community of forensic practitioners. To achieve this, the paper provides the following contributions: a) As general contributions, it is illustrated why (even with a set of classifiers relevant to a specific problem) the naive assumption that fusion would make the detection more reliable can fail in practice, i.e., why fusion behaves in a field application sometimes differently than in the lab and often delivers lower detection performances than single detectors. As a result, the constraints and limitations of the application of fusion are discussed and its impact to (media) forensics is reflected upon. The two main aspects addressed in this discussion are the generalization power of classification models and the relationship between training and test data sets. In the evaluations, it is shown that both aspects, despite being similar in nature, have to be considered separately for applied information fusion. b) As technical contributions for face morphing attack detection (MAD), the current state of the art is expanded by: Introduction of likelihood ratio (LR) based fusion for face morphing attack detection (MAD) to extend the set of methods (majority voting, sum-rule, and Dempster-Shafer Theory (DST) of evidence [6]) used in [7]. Direct comparison of the two evaluation scenarios: "MAD in document issuing" vs. "MAD in identity verification." Analysis and discussion of detection performance issues found with the fusion based detectors (note: questions of feature or classifier selection are out of scope for this paper), the results show that: Fusion can fail even when a set of accurate individual classifiers is available. The results presented for fusion detectors are in the vast majority of the cases worse than the results of the best individual classifier used.
Trained thresholding and weighting strategies as well as sophisticated (context adapted) fusion methods (especially DST and LR based) can under specific circumstances perform significantly worse than unweighted, simplistic fusion approaches like the sum-rule or majority voting. Different fusion ensemble composition strategies (i.e., using all available detectors vs. selecting a subset of those) have an influence on the decision error rates. For the two evaluation scenarios "MAD in document issuing" (SC1) vs. "MAD in identity verification" (SC2) different detection and fusion trends are observed, resulting from differences in the inherent characteristics of the application scenario (esp. the amount and type of data available for investigations).
The rest of the paper is structured as follows: section 2 performs a discussion of related work on requirements for media forensic methods, the current state of the art in face morphing attacks detection (MAD) and information fusion approaches in MAD. In section 3, the investigation concept from [7] is summarized and extended into the concept for fusion-based face morphing attack detection used in this paper. Section 4 defines the evaluation setup (incl. the two application scenarios "MAD in document issuing" vs. "MAD in identity verification"). Section 5 presents the evaluation results and their discussion, while in section 6 the conclusions are drawn from the presented results.

Related work
Technical capabilities (such as accuracy) are by far not the most significant characteristics of forensic methods. In general, those are usually rated by practitioners in criminal investigations by their maturity, i.e., by their scientific admissibility. Section 2.1 discusses some issues of scientific admissibility in European contexts (where, due to the very nature of the EU and its member states, it is currently much less well regulated as for example in the USA) to establish an understanding on the requirements and limitations for forensic methods originating from this field. Section 2.2 briefly summarizes the media forensics application domain selected for this paper, the face morphing attack detection (MAD). More detailed overviews over the research activities in this field, which is very active since 2014, can be found in the two survey papers [8,9].
Several studies have demonstrated that both manually and automatically generated high-quality morphs cannot be recognized as such neither by algorithms nor by human examiners [10][11][12][13], and even low-quality morphs pose a threat to the identity verification process if it is completely automated. This explains the urgent need for automated face morphing detectors. At the time of writing this paper, none of the existing research initiatives working on this specific image manipulation detection problem has been able to present detectors that achieve sufficient detection accuracy on a wide range of morphed images (see the ongoing NIST FRVT MORPH challenge [14]). As a logical consequence fusion approaches are used to combine the existing detectors and thereby improve the overall performance. The state of the art approaches in information fusion for MAD are briefly discussed in section 2.3.

Requirements for media forensic methods in terms of scientific admissibility
When working in media forensics, the question of determining the maturity of methods arises. In lab tests analyzing data for which ground truth information exists, an answer to that question is easy. In that case, the degree of agreement between ground truth label and detector response can simply be used to express the accuracy of the method.
In field applications of forensics, there usually exists no ground truth information for an object under investigation. In these cases, other means of establishing the maturity or suitability of a forensic method have to be used. In forensics, the whole field of work looking into this aspect is termed "scientific admissibility." It is a very complex topic on which Champod and Vuille state in [15]: "The scientific admissibility of evidence, while subject to fairly precise rules in United States law, [...], is seldom addressed in European legal writings, [...]. The question of scientific reliability is seen as intrinsically linked with the assessment of the actual evidence, that is with the determination of its probative value […]." Researchers in the fields of computer science and applied pattern recognition have to rely on the verdict of legal experts defining the hurdles media forensics approaches have to take before achieving the ultimate goal of court admissibility. Looking at [15], it can be stated that there is no EU wide regulation on scientific admissibility questions but that there are common principles that would have to be considered. In that in-depth analysis of the current legal situation in [15] a non-exhaustive list of such principles is presented, containing in its core the following aspects: Methods should be peer reviewed and accepted within the corresponding scientific community. Error rates associated with a method should be precisely known, Existence of standards for the application and maintenance of methods.
This list is very similar to the state-of-the-art criteria used by judges in the USA to address the questions of court admissibility for forensic (and other) methods, i.e., the so called Daubert and FRE702 criteria [15]. While pointing out the benefits of such selection principles, Champod and Vuille also provide some form of criticism into their application: for peer reviewed methods they point out that "this criterion does not indicate whether a technique accepted in scientific literature has been used properly in a given case" and regarding the issue of ascertaining the error rates of a test, they claim that those "can prove misleading if not all its complexities are understood" [15].
In the context of work presented in this paper, those statements imply two important things: First, that a very careful investigation of the precise constrains for the application of a method such as information fusion is required for any specific forensic application case. Second, that the associated complexities in practical application (such as the attempt to improve MAD detection used for illustration purposed within this paper) are clearly and openly discussed.

Face morphing attacks and their detection
Face images in documents are an established and well accepted means of identity verification. Current electronic machine readable travel documents (eMRTD) are equipped with digital portraits to automate the identity verification process. The automation saves manpower and enhances security due to switching from subjective (officers) to objective (automated face recognition systems) matching of faces. The benefit of automation is especially relevant in high-throughput applications like an airport border control. However, the automation entails the risk of face morphing attacks [16].
In publications such as [12,16], it has been shown that the blending of face images (here called face morphing) of two or more persons can lead to a face image resembling the faces of all persons involved. Using such an image as a reference in a document is referred to as face morphing attack because it enables illicit document sharing among several users. Such morphing attacks have been shown to be effective in an automated border control (ABC) scenario giving a wanted criminal a chance to cross a border with a chosen (i.e., wrong) identity [10,17,18]. Document issuing procedures are different depending on the country and its national regulations. In many countries, the biometric face image can be (and often is) submitted as a hard copy. Here, the attack aims at fooling an officer at the document issuing office by submitting a morphed face image. As long as persons are allowed to submit images to the document issuing office during the document generation, face morphing attacks will remain a severe threat to photo-ID-based verification. Indeed, if an officer accepts a morphed face image, the issued document would pass all integrity checks, and if an automated face recognition (AFR) system matches a live face with a morphed document image, access will be granted to an impostor.
The risk of the morphing attack can be reduced by supporting both officers and AFR systems with a dedicated morph detector. The only way to completely remove the threat of such attacks would be to take the picture directly in the controlled environment of the issuing office and by ensuring that there is no malwareenabled morphing attack embedded into the digital part of the document issuing pipeline, too. The question whether to take the picture directly in place is a political issue, which has in the past lead to many controversial discussions (e.g., in France and Germany) between governmental regulation and the photo industry. But even if this problem would be solved for one country, there would still be the issues of legacy passports (which might still be valid for up to 10 years) as well as foreign documents. Figure 1 depicts the document life-cycle of a document with a face morphing attack present. While publications such as [19] also discuss the role of forensics (and anti-forensics) in the quality assessment (QA) of the attacker during the morph generation process, in the scope of this paper, only the image forensic analysis of the images submitted into the document creation and the corresponding analysis in every document usage (e.g., in an ABC gate) are relevant. These two investigation points are representing the evaluation scenarios "MAD in document issuing" (SC1) and "MAD in identity verification" (SC2) considered in this paper. They are discussed in detail in section 4.
The face morphing attack detection (MAD) approaches are typically categorized into two groups regarding whether a trustworthy reference face image is presented or not. The first group is often referred to as single-image or no-reference MAD approaches. The second group is referred to as two-image differential or reference-based MAD approaches. Despite the fact that the reference-based MAD has more potential for robust operation, the non-reference MAD approaches are better represented in the literature.
Within the group of reference-based MAD approaches, as ponted out in [21] there are two subcategories: Reconstruction-based and reference-based MAD. The most prominent examples from the first subcategory try to reconstruct a likely original face (from the assumedly morphed face image provided) by making use of a trustworthy reference face image taken life from the person in front of a camera. This process is often referred to as de-morphing. The detection is done in this case by comparing the reconstructed image and the reference one. The de-morphing is done either by inversion of the common morphing procedure [22] or by applying neural networks such as an autoencoder [23] or generative adversarial networks (GAN) [24]. Alternative approaches to implement reference-based MAD could also be relying on reference feature vectors instead of complete face images.
The approaches from the second subcategory extract features from both presented images (probe document image and trustworthy reference image) and either compare them to each other [13] or combine them for the further classification [25], or even train an additional classifier based on difference vectors [26]. The common problem of all single-image MAD approaches based on "hand-made" or "hand-crafted" features is that they do not detect morphing but rather traces of image manipulations. Since, there is a set of legitimate image manipulations such as in-plane rotation, cropping, scaling, and even some kinds of filtering the morphing characteristics can be easily simulated to prevent detection. The more sophisticated single-image MAD (like [27]) approaches make use of deep convolutional neural networks (DCNN) which are learned to automatically extract features characterizing morphing artifacts based on a large set of samples. If a training set is large and diverse enough covering all frequently used image manipulations, there is a chance that the network will learn not the characteristics of a special dataset, but actual characteristics of morphing. Training of different DCNN architectures for morphing detection was conducted in [17,26,28] applying transfer learning with pre-trained networks as well as learning from scratch. In [29], a feature-level fusion of two DCNNs (AlexNet and VGG19) trained by means of transfer learning is shown to outperform BSIF features.
The majority of the aforementioned detectors are learned with morphed face images created by the standard morphing approach which roughly includes three steps: alignment of faces, warping of face components given by polygons (usually triangles), and blending of color values [12,17,30]. However, the recent trend is the application of GAN to create realistic face images [31,32]. The performance of MAD approaches to detect standard morphs and morphs produced by GAN are compared in [33,34]. Several MAD approaches are compared within the framework of the ongoing NIST FRVT MORPH challenge [14].

Information fusion approaches in face morphing attack detection
Decision-making systems can be fused at four different levels [2]: data level, feature level, classifier level, and combination (or decision) level. The earlier the fusion is applied, the higher are implementation costs (esp. the computation power required), but also the higher accuracy is expected.
A huge number of different fusion approaches exist, ranging from simplistic methods, like the sum-rule (also known as average rule, meaning the linear combination of matching scores with equal weights) or majority voting to complex schemes like Dempster-Shafer Theory (DST) of evidence [35]. Since DST has a theoretical foundation for handling contradicting and missing decisions of expert systems, it has been successfully applied in a wide range of applications [36]. There, exist  [19], combined morph generated based on [12] original face images taken from the ECVP face dataset [20] Kraetzer et al. EURASIP Journal on Information Security (2021) 2021:9 Page 5 of 25 different ways on how to exactly implement fusion based on DST. For details of our own realization, we refer to section 4.3 accordingly. For the question which fusion method should be chosen, there exists, to the best of the authors' knowledge, no universally agreed upon theory to answer this question. Some experts put a strong focus on one specific method, e.g., Kittler et al. in [37], where the authors claimed that the sum-rule is not only simple, intuitive, remarkably robust, but also outperforms in their experiments all other aggregation operators tested. Other experts, like Ho [4] and Kuncheva [38], explicitly refrain to give any generalized recommendation. Acknowledging the fact that, even when a critical mass of single classification models has been accumulated in a field of application, there are still open questions regarding their combination and the interpretation of the combination output.
If, within media forensics, the field of image manipulation detection is considered (which also contains MAD as a research question) the same wide range of methods are used in research papers, ranging from the simple to complex. A good example in this domain would be the work of Fontani et al. in [39,40]. In those papers, the authors apply with DST a very sophisticated approach to image manipulation detection task and additionally use its benefits to counter anti-forensics.
A face morphing attack detector is in its nature a binary pattern classifier. The methods for combining such pattern classifiers have been thoroughly studied for a long time, e.g., in [38]. The paper [7] summarizes the state of the art in information fusion for MAD and extends it by introducing DST to this field. The test results presented do show that the error rates with the DSTbased fusion are significantly lower compared to those of individual detectors as well as some simplistic fusion approaches applied previously (majority voting and average rule). Here, the work from [7] is used as basis for this paper, taking its fusion framework and extending it even further by including likelihood-based fusion. The reason to do so is the prominent role that the forensic sciences currently attribute to the usage of likelihood ratios in expert testimony, see, e.g., [41] for the example of footwear marks (and underlying forensic analyses, see, e.g., [42]).
While many scientific publications address applying fusion under lab conditions, only very few publications address the question of generalization as well as the applicability for forensic procedures within the context of criminal investigations. In [43], classical probabilities are replaced by Shafer belief functions and an analogy of the Bayes' rule is introduced that is capable to overcome the traditional inability to distinguish between lack of belief and disbelief. Besides mathematical modeling, the consequences of applying the fusion theory for legal practice are discussed. They conclude that there is still a lot of room for explaining the advantages and limitations of using information fusion to forensic researchers as well as the actual practitioners in criminal investigations. Here, the discussion of the advantages and disadvantages of information fusion is continued and its limitations, if applied in real-life conditions, are empirically demonstrated.

The concept of fusion-based face morphing attack detection
In theory, a necessary and sufficient condition for a combination or fusion of classifiers to be more accurate than any of its members is that the individual classifiers are accurate and diverse. An accurate classifier has a classification performance better than random guessing and two diverse classifiers make errors on different data points [44]. In practice, experimental evidence has been provided that, for the case of classifiers with a low level of dependence, a consensual decision is likely to be more accurate than any of individual decisions [45]. It has been also shown that lowering correlation among classifiers increases the accuracy of combination [46].
Application of fusion to MAD approaches and especially of the Dempster Shafer Theory (DST) is initially discussed in [7]. In the experiments performed there, the fusion always outperforms individual classifiers in terms of lower error rates. The evaluation concept from this paper is considered here as a reference. It is expanded and it is demonstrated that under certain conditions the superiority of fusion is not always the case. In particular, it is illustrated why the assumption that fusion would make the detection more reliable can nevertheless fail in practice. This enables a discussion on the constraints and limitations of the application of fusion and reflects upon the impact of generalization power of single classifiers as well as fusion methods and the relationship between training and test data sets. Figure 2 roughly depicts the initial evaluation concept.
The concept consists of five major components: 1. The set D of individual morphing attack detectors. Each individual morphing detector is considered as a black box (i.e., they are used as pre-trained methods implying that we have no influence on the training of the classification model). An input for an individual detector is a face image and an output is a score between 0 and 1. High scores indicate morphs and low scores genuine samples. 2. The set of approaches for establishing weights for individual decisions in the fused one. In the case of DST, the mass (belief) functions are required. The process of deriving such parameters is referred to as training in Fig. 2.
3. The set of fusion approaches F. A fusion approach gets a list of individual decisions and the "importance" of each decision and returns the consensual decision. 4. The evaluation data, which includes training data for establishing fusion parameters (e.g., weights or mass functions) and test data for estimation of error rates. The training and test datasets are created by splitting the AMSL Face Morph Image Data Set (made available via: https://omen.cs.uni-magdeburg. de/disclaimer/index.php). This dataset was initially created to simulate a border control scenario and includes cropped and JPEG-compressed face images which do not exceed 15 kByte and, therefore, fit onto a chip of an eMRTD. In the evaluation, this application scenario is referred to as "MAD in identity verification" (SC2). For creating morphed face images, the combined morphing approach from [30] is applied.

Comparison of individual detectors and fusion
approaches. As a performance metric, we have chosen the error rates of classification approaches.
Here, this concept and its components are re-used and extended by the following: (1) providing a better separation between the training and test datasets by using completely different data sources, (2) adding a fusion approach based on forensic likelihood ratios, (3) adding two types of morphed face images: complete and splicing morphs [12], and (4) adding the application scenario "MAD in document issuing" (SC1).
For scientific rigor, it has been ensured in communication with the authors of the MAD approaches that the datasets used for training of the individual detectors do not overlap with the datasets used for training and testing of the fusion approaches. Figure 3 depicts the evaluation concept for this paper. The components from [7] and the modifications and extensions summarized in section 3 are apparent in the comparison to Fig. 2.

Evaluation setup
The representation of the evaluation scenario is done by either using images in their native format and resolution (for application scenario "MAD in document issuing" SC1) or in the format specified for ICAO compliant eMRTD (for application scenario "MAD in identity verification" SC2). The evaluation scenarios are discussed in more detail in section 4.1. In section 4.2, the used single classifiers for MAD are discussed, while section 4.3 summarizes the fusion methods evaluated (including the strategies for determination of decision thresholds and score normalization). Section 4.4 introduces the performance metrics and 4.5 the databases that are used to create the evaluation data sets.

Detailed specification of two evaluation scenarios
So far, the evaluation of morphing attack detection (MAD) mechanisms has not been focused on the application scenario. The MAD approaches were rather classified in two groups regarding whether a trustworthy reference face image is presented or not (reference-based vs. single-image/no-reference approaches; see section 2.2). Here, two application scenarios "MAD in document issuing" (SC1) and "MAD in identity verification" (SC2), representing the two forensic checks required in the document life-cycle of a face image based identity document (see Fig. 1), are considered. Table 1 compares both application scenarios. The most intuitive mapping would be to link singleimage MAD approaches to SC1 and reference-based MAD approaches to SC2. In fact, both application scenarios can be tuned in the way that the reference image is presented. For SC2, taking a "live" face image is an inherent part of the procedure. Note that this image could be used solely for face recognition and ignored by the MAD module. For the document issuing in SC1, a webcam could be installed next to the officer at the issuing authority, providing a possibility for capturing "live" face images of an applicant.
No-reference MAD approaches are limited to the search for content-independent statistical anomalies or content-dependent visual artifacts caused by the morphing process. Such methods often apply techniques developed within the context of digital image forensics (see section 2.2). Reference-based MAD algorithms try to reconstruct the morphing process aiming at predicting the face of an "accomplice" and comparing this face to the trustworthy "live" image. Hence, the presence of a reference face image rather gives additional options for the choice of detection mechanisms, but does not determine the application scenario.
In contrast, the face image format in SC2 is very closely defined by national and international regulations, especially by the International Civil Aviation Organization (ICAO) standardization of eMRTD. As a result, the limitations to the digital image that should be stored in an eMRTD are caused by antiquated physical storage limitations. For instance, the current generation of German (and other countries) passports limits the free space for a digital face to 15 kB. During the application for a new document, an applicant submits a printed face photograph of the size of 35 × 45 mm. These images are scanned with the resolution of 300 dpi and undergo lossy compression before they are stored in the passport. The submission of printed face images is in fact the main vulnerability spot making the face morphing attack easy to execute. The reason is that the printing process destroys almost all traces of image manipulation so that human examiners are highly prone to  errors when categorizing such images [12]. The straightforward way to reduce the danger of the morphing attack is a prescription to submit high-resolution digital face photographs of a decent quality. Having done this, the image resolution would not be an issue any more for at least a document issuing scenario. As described in section 2.2, taking the picture directly in the controlled environment of the issuing office would limit the threat by morphing attacks. This is not only a political issue but would also require the elimination of further attack vectors. The file format used in this paper to implement SC2 is a face image compliant with ICAO specifications for eMRTD: 531 × 413 pixels (inter-eye distance of at least 120 pixels), in JPEG2000 format, compressed to fit the 15 kB size constraint. The file format to implement SC1 is not that narrowly defined; here, the original file format of the reference databases (see section 4.5) is used.

Morph attack detection approaches
In this paper, five morph attack detection (MAD) approaches are examined. The first one (D keypoints ) is based on localization and counting of keypoints [19]. The keypoint-based morphing detector indirectly quantifies the blending effect as an indispensable part of the morphing process. Blending leads to a reduction of face details and therefore to a reduction of "significant corners" and edge pixels. The detector counts the relative number of keypoints in the face region detected by different approaches as well as the relative number of edge pixels. For classification within D keypoints , a linear support vector machine (SVM) was trained based on 24-dimensional feature vectors with a dataset of 2000 genuine and 2000 morphed high-resolution passport images. These morphs were created using the approaches from [12,30].
The other four MAD approaches are based on Deep Convolutional Neural Networks (DCNN). Two of them designated as D ArXivNaive and D ArXivMC are described in [26]. The other two designated as D BIOSIGNaive and D BIOSIGMC are described in [17]. All four of these detectors are based on the VGG19 network. Transfer learning is applied to build a binary classifier from the classification model originally trained for the ILSVRC challenge. The training dataset is comprised of approximately 2000 genuine images and the same number of morphs. Genuine images were collected from several public face databases and scraped from the internet. The major difference between classifiers is in the approach for generation of morphed face images for training. While the D ArXivNaive is an older detector trained with lower quality morphs and D ArXivMC is the same detector with an updated data augmentation strategy in the training, the D BIOSIGNaive and D BIOSIGMC detectors applied for the creation of the training data sophisticated morphing with artificially added high-frequencies to compensate the blurring effect of the blending operation. The differences between the Naive training and the MC (multiclass/complex morphs) versions lie in the composition of the training data: For Naive 50% genuine images and 50% complete morphs are used. For MC 50% genuine images and a mix of complete and partial morphs are used, with the aim of forcing the network to take all available information for its decision-making into account (i.e., prevent it from focus on selected face regions like the eyes to detect morphing attacks). The details on the training concept for Naive and MC versions of the detectors used here can be found in [17].

Fusion approaches
Here, each MAD approach operates as a "black box" returning a matching score for an input sample. As a consequence of the evaluation concept, fusion on signal level is out of scope for this paper and fusion on feature level (see section 2.3) is not feasible. Hence, the detection accuracy gain from one fusion approach at the decision level (majority voting) and three fusion approaches at the matching score level (weighted linear combination, Dempster-Shafer Theory (DST) of evidence, and forensic likelihood ratios (LR)) is explored. Below, the fusion operators F are described in detail:

Majority voting (F M )
The naive consensus pattern of simple majority [38] is used for opinion combination. If the number of votes for every alternative is equal, the majority rule returns "no decision."

Weighted linear combination (F WLC )
The sum-rule (or weighted linear combination) extends the average rule by assigning different weights to the output of the individual classifiers to be combined. For the case of the same weights, the fusion strategy is often referred to as average rule. Here, two different strategies are used: average rule as well as weighted linear combination with pre-determined weights (see section 5.1 for details on these two strategies).

Fusion based on Depster-Shafer Theory (F DST )
The Depster-Shafer Theory (DST) is based on two concepts: belief functions representing degrees of belief for one question from subjective probabilities for a related question and Dempster's rule for combining such degrees of belief when they are based on independent items of evidence.
In our case, the frame of discernment is defined as Θ = {mor, gen}, with m(mor)/m(gen) representing the basic beliefs that the face is morphed/genuine respectively, and m(Θ) is a mass of uncertainty. A degree of belief (mass) is assigned to each subset. As proposed in [7], we construct mass functions as cumulative distribution functions of matching scores obtained from an experiment. Let p mor (s) and p gen (s) be the approximations of probability density functions of scores for verification attempts with morphed and genuine images respectively. For a detector outcome s* ranging from 0 to 1, we define the mass m(mor) as an area under p mor (s) between 0 and s* and m(gen) as an area under p gen (s) between s* and 1, and the mass of uncertainty as a complement to the sum of both masses: Note that we interpret the detector outcome s* (also called matching score) as a decision confidence with 1 for 100% confidence that the image is morphed and 0 for 100% confidence that the image is genuine.
Technically, the three masses are calculated for each morphing detector based on the matching scores of training samples and stored as a parameter of our fusion engine. At the time of decision-making, for each outcome s i * of the i th detector, we obtain the values m i (mor), m i (gen), and m i (Θ) as the nearest points on the corresponding discrete mass curves.
Dempster's rule of combination for two beliefs from independent sources is given by: where m(A) represents the combined mass on A (a given member of the power set), m 1 and m 2 represent the masses of first and second items of evidence respectively, and K represents the normalization constant. The second term in K describes the conflict between two items of evidence. If it is equal to 1 then K is equal to 0 implying that these two items contradict each other and cannot be combined by applying Dempster's rule. The efficient application of the Dempster's rule for computation of combined belief can be found in [6]:

Fusion using likelihood ratios (F LR )
Likelihood ratios (LR) are used in forensics in order to express uncertainty [47]. The basic concept relies on the quotient of the probabilities of the correctness of two hypotheses with respect to an observation within binary decisions which are common in forensics. Semantically, the LR describe how much more probable one of the hypotheses is in comparison to a complementary one when specific observations can be made.
Within the scope of a forensic comparison of face images, LR are discussed, e.g., in [42] and is already used in some countries in the forensic practice as well, as shown, e.g., in [41] for a case involving footwear marks in the UK. Sometimes the observed LR are mapped to particular levels regarding the confidence in the hypothesis in order to make the result more accessible to forensic laymen as the requirements for particular LR differ between forensic domains, see, e.g., [48]. Generally, a likelihood ratio close to 1 indicates a weak decision as the probabilities for the two hypotheses are almost identical.
With the availability of multiple detection algorithms, a fusion using LR is also possible as suggested, e.g., in [49] for multiple biometric matchers. For each detection algorithm, a quality value needs to be determined as a weight in the fusion algorithm.
In our experiments, the LR for a single detector D providing confidence levels c in a two-class problem is determined by the quotient of the detectors confidence for a sample s toward a genuine sample-c D (gen)-divided by the confidence toward a morphed sample-c D (mor): Note that the inverse of the LRs is used in the experiments performed here, in order to achieve a defined value of zero for a confident decision. Usually the tested hypothesis-in this case whether an image is a morphwould be used as the numerator. As a result, the F LR shows the same behavior. In addition to that, it is possible to normalize F LR using the number of detectors (in this paper 5). Otherwise, this number would have to be taken into account during the interpretation of fusion operator.
The LR-based fusion score F LR of a sample image in question for the k = 5 detectors D = {D keypoints , D ArXivNaive , D arXivMC , D BIOSIGNaive , D BIOSIGMC } is determined as the quotient of weighted sum of LRs toward a genuine sample (LR g ) divided by the LRs toward a morph (LR m ) with LR g ðs; DÞ ¼ 1 LR m ðs;DÞ ¼ c D ðmorÞ c D ðgenÞ : The factor w i /w j represents here the weighting factor for the LR fusion as described in section 5.1. A quotient F LR (s) closer to zero indicates a larger confidence of the decision toward a morph.

Normalization
In order to perform a reasonable fusion, the matching scores of the individual classifiers should be brought into the same range. The detectors D ArXivNaive , D arXivMC , D BIOSIGNaive , and D BIOSIGMC return negative values for genuine faces and positive values for the morphed faces. The default decision threshold is 0. In contrast, the detector D keypoints returns values between 0 and 1. Lower values are for genuine faces and higher values for morphed faces. The default decision threshold is 0.5. Within the training phase performed in this paper using the DEFACTO dataset (see section 4.5), we perform min-max normalization of the matching scores and adapt the default decision thresholds. As a result, the normalized matching scores of all detectors range then from 0 to 1 and the new default decision threshold can be found in Table 3 (column τ fixed ). For each classifier, the MIN and MAX values of matching scores are stored to perform the min-max score normalization at the evaluation phase. The aforementioned decision thresholds are also stored as parameters of the fusion and are used in the evaluations in SC1 and SC2.

Performance metrics
Morphing detection is a standard two class problem with two possible outcomes: "passport image is morphed" or "passport image is not morphed" and two types of errors: morphed image is recognized as nonmorphed and vice versa. Driven by the idea that the morphing attack can be seen as a special case of the presentation attack, the detection performance metrics from the presentation attack detection testing standard [50] are adopted. Attack Presentation Classification Error Rate (APCER) describes the proportion of morphed face images incorrectly classified as genuine (bona fide) and Bona Fide Classification Error Rate (BPCER) describes the proportion of genuine (bona fide) face images incorrectly classified as morphed. MAD approaches are typically designed to report two values: a binary decision on whether the image is morphed or not and a confidence score for this decision from the interval [0; 1]. Higher values indicate higher confidence that the image is morphed. In fact, the binary decision is derived from the confidence score by comparing it to an algorithm-dependent predefined decision threshold. Hence, APCER and BPCER are the reciprocal functions of decision threshold. Formally, the BPCER is computed as the proportion of bona fide images over the threshold and the APCER as the proportion of morphed images below the threshold. At the stage of development, when an algorithm can be evaluated with different decision thresholds, the more informative way to compare algorithms is drawing the detection error trade-off (DET) curves (respectively the area under curve (AUC)) on the same plot. Traditionally, BPCER is seen as a convenience measure while APCER as a security measure. The DET curve represents BPCER as a function of APCE R. Here, also the half total error rate (HTER) is used as an average of BPCER and APCER with the fixed decision threshold to compare performances in an easier way.

Evaluation datasets
There are four databases used in the experiments in this paper: The DEFACTO database [51] containing morphs and genuine face images is used for the training of the fusion methods (see Fig. 2). This database is chosen as a neutral dataset for training because it ensured by the authors that it was not used in the creation (i.e., training) of any of the five used "black box" individual detectors and its used morphing method being unknown. By this choice, a realistic evaluation setup can be ensured, with training data (DEFACTO material) having an unknown similarity to test data (for SC1 and SC2; see Fig. 1), reflecting the constraints that will be encountered in field application. The following datasets (and subsets) are used: The DEFACTO dataset contains 200 genuine face images and 39980 morphs. Since using the whole dataset would represent an extremely strong bias toward morphs, only a subset of 2309 randomly selected morphed images is used. Three other databases are used to simulate the evaluations conducted within the comparison between single classifiers and fusion methods performances: For two of them (the ECVP (aka Utrecht) [20] and London Set [52] databases) morphed images are generated using the approaches from [12,30]. The subsets of morphed images are denoted as complete, splicing, and combined according to the generation method used. Additionally, as a source for further genuine face images, mugshots from the Alabama News Network [53] are taken.
Using the original sized images (and morphs based on those), the experiments simulate the passport issuing scenario (SC1). In order to simulate the verification scenario (SC2), the images are down-scaled (to 413 × 531 pixels) and compressed using the JPEG2000 format in a way that the image size does not exceed 15 kilobyte (kB) as described in section 4.1. Figure 3 shows the exact evaluation concept and Table 2 summarizes the information about the image (sub-)sets used in our experiments.

Evaluation results and discussion
This chapter contains a large number of results from different empirical evaluations as well as their interpretation. It is structured as follows: Section 5.1 summarizes the DEFACTO experiments, which serve as a baseline as well as an estimator for fusion weights (or mass functions). Section 5.2 evaluates the individual detectors and fusion methods (using the full ensemble of detectors) for the two simulated application scenarios SC1 and SC2. Section 5.3 discusses the impact of the performed fusion to the field of MAD. Section 5.4 determines the impact of using smaller ensembles (i.e., subsets of the available detectors) for fusion. Section 5.5 determines the impact of less restrictive assumptions in the evaluation setup composition on the error rates achieved in fusion. Section 5.6 provides a final summary and generalization on the obtained results.

DEFACTO training and baseline experiments
The experiments with the DEFACTO dataset have two objectives: 1. Fair comparison of the MAD approaches to each other regarding their error rates with a disjunctive dataset. In fact, face images in the DEFACTO dataset do not overlap with those used for the training of MAD approaches. Moreover, the morphing procedure with the DEFACTO significantly differs from those with the individual MAD approaches.
2. Training of the fusion parameters including fusion weights and decision thresholds of the individual MAD approaches as well as mass curves for the DST-based fusion. An importance (or in other words a credibility) of one or another detector in the fusion is given by the fusion weight. Here, we consider two thresholding strategies "fixed" and "adaptive" to define at the same time the decision thresholds and weights (the latter only for F WLC and F LR ): For the "fixed" strategy, we rely on the default decision thresholds suggested by the developers of the MAD approaches and assign equal weights for fusion approaches that accept weights. This trivial strategy (which considers all available detectors as being equally important) is typically the only choice if no additional evaluation of classifiers can be performed, or if there is a suspicion that the evaluation dataset does not fit to the in-field data.
For the "adaptive" strategy, we set a new decision threshold at the point at which the EER of a MAD approach is reached. Additionally, we calculate the fusion weights for F WLC and F LR based on the EER values.
To be more precise, the inverse of the EER values are used as weights of the individual MAD approaches in the fusion. Since the possible EER values for a binary classifier range from 0 (for a perfect classifier) to 0.5 (for a random guess) and the weight should spread over the interval [0, 1], an EER value is multiplied by 2, see Equation (11).
with i representing one of the five MAD approaches. Figure 4 shows the DET curves of the five addressed MAD approaches on the original-sized DEFACTO images. Crossings with the dashed black line represent the EER of the detectors. Regarding the EER, three detectors D ArXivNaive , D BIOSIGMC , and D BIOSIGNaive demonstrate comparable performances, with D BIOSIGMC achieving the best performance by a small fraction. The D ArXivMC demonstrates slightly worse performance and the D keypoints is by far the worst detector. Table 3 demonstrates the EER values of the individual MAD approaches, the decision thresholds τ at which the EER are reached, and the weights assigned to the approaches for fusion for both strategies "fixed" and "adaptive." If the fusion is done at the decision level, the decision thresholds are used to derive decisions from matching scores.
The mass functions for the DST fusion are demonstrated in Fig. 5. The mass curves for the "genuine" and "morphed" matching scores reproduce the classic error curves so that the crossing point indicates the EER.
What can be observed from the results in Table 3 is that D BIOSIGMC outperforms the other four detectors by presenting the smallest EER (resp. the highest AUC). As a result, it is assigned the highest weight for the fusion operations. The results for D keypoints confirm what was already indicated in Fig. 4: Despite its good performance on other image sets, this detector is here performing significantly worse than the other four. As a result, it gets with 0.42 the lowest weight assigned for the fusion.
If the EER locations (the projection of the EER onto the x-axis) and the uncertainty curves shown in Fig. 5 are analyzed, it can be seen that four of the five curves (resp. EER locations) are shifted from the center to the left (indicating a bias toward morphed images) and only D keypoints is shifted to the right with a strong bias toward genuine images. The amount of the shift correlates with the ranking of the detectors: D BIOSIGMC shows the smallest shift (a nearly centered uncertainty curve with a very small skew) while the other four show an increase in the shift (and skew) with their higher EER.

Experiments with individual detectors and fusion methods
The sections 5.2.1 and 5.2.2 summarize the results on the performance of the individual detectors and fusion methods evaluated with the two simulated application  Figure 6 shows the DET curves for the tests on complete, splicing, and combined morphs in SC1. The individual classifier performance is displayed by solid lines (with the same color coding as in Fig. 4), and the performance of the fusion methods is given as dashed lines (where a continuous space of operation points is possible) or symbols (in case only one operation point, either the "fixed" setting or the "adaptive," is possible). For all three morphing types, the individual classifier D arXivNaive achieves the best performance for SC1, followed by the weighted linear combination (F WLC ). The three single classifiers D BIOSIGNaive , D BIOSIGMC , and D keypoints show the lowest performance. F M with "fixed" and "adaptive" thresholding strategy achieve the lowest performance of the fusion methods. The more sophisticated fusion operators (F DST and F LR ) perform better than F M , in some cases F DST even outperforms F WLC , but both show a significant bias toward morphed images. Especially for F DST , this is apparent with an APCER close to 0 at a BPCER of roughly 0.2. Figure 7 shows the DET curves for the tests on complete, splicing, and combined morphs in SC2. The same color coding and symbols are used as in Figs. 4 and 6.

Scenario SC2 ("MAD in identity verification")
The general performances of the individual and fusion based detectors in SC2 are very similar to the SC1 results shown in Fig. 6. A slight decrease in the detection performances can be observed for all tested methods. This decrease can be attributed to the fact that the 15 kB image format that is used in SC2 leaves generally less room for media forensic investigations on image manipulation. What is remarkable in the results is that the results of the more sophisticated fusion operators (F DST and F LR ), while also showing some performance decrease, loose some of their bias toward morphed images. Especially for the splicing morphs, it can be observed in Fig. 7 that F DST shows an APCER larger than 0, even slightly outperforming at the corresponding APCER values all other detectors.

Discussion of the impact of fusion to face morphing attack detection
Tables 4, 5, and 6 summarize the results. Table 4 demonstrates a baseline using only the individual classifiers,  showing that D ArXivNaive performs best in testing in both application scenarios SC1 and SC2 on all three morph types. Tables 5 and 6 present the single classifier and fusion results in the "fixed" (Table 5) and "adaptive" (Table 6) thresholding strategies. The difference lies in the basic assumption for the similarity of training data (here DEFACTO) and the material encountered in field application (here, the mix of ECVP, London, and Alabama material, either in original (for SC1) or the 15 kB version (SC2)). While the "adaptive" setting is the setting encountered in most lab experiments, the "fixed" one (which assumes a much lower similarity between training and test data) is a more realistic assumption, leading to more trustworthy error estimates in this media forensic analysis.
When focussing on the single classifier results obtained for both thresholding strategies ("fixed" decision threshold and fusion weights vs. "adaptive" decision threshold and fusion weights), it can be seen that D BIO-SIGMC , which performed best on the DEFACTO dataset (see Fig. 4 in section 5.1) demonstrates in the evaluations significantly worse performance in both application scenarios SC1 and SC2. In Fig. 4, in two of the six tests (the two evaluations run on splicing morphs), it actually shows the lowest performance (i.e., highest HTER). When looking at Tables 5 and 6, these results are confirmed. For both thresholding strategies and all three different morphing types, D BIOSIGMC achieves the second lowest detection performances, followed only by D keypoints . The best performance for a single classifier is in all cases achieved by D arXivNaive with the "fixed" decision threshold.
When comparing the single classifier and fusion results in Tables 5 and 6, the general picture established in section 5.2 is confirmed: In nearly all cases for SC1 as well as SC2, the fusion approaches fail to outperform the best individual detector. Neither for selected morphing approaches nor for one of the two thresholding strategies, the fusion generally outperforms the best single classifier, even though in one case for SC2 and splicing morphs it is close (best single is D arXivNaive with "fixed" at an HTER of 8.5% and the best fusion is F LR with "adaptive" and an HTER of 8.92%). Most interestingly, the DST-based fusion, which is the most sophisticated fusion strategy and which is highly regarded in many other application fields, leads here in all cases to low performances. For the thresholding strategies, it can be summarized that for the four classifiers D BIOSIGNaive , D BIOSIGMC , D ar-XivNaive , and D arXivMC , there is a tendency that the best results are obtained with the "fixed" decision threshold while for D keypoints in the majority of the cases better results are obtained with the adaptive decision threshold. For the fusion, no clear tendency which thresholding strategy leads to better results can be observed.
When considering the differences in the detection performance for the three tested morph types (combined, complete, and splicing), it can be summarized that all detection approached discussed here yield very similar detection performances (both in SC1 as well as SC2).

Variation of the fusion ensemble
During the review phase for this journal paper, the reviewers raised the question why it is assumed that a fusion using all five single classifiers is the optimal choice at hand. Alternative fusion ensembles using three or four classifiers might be capable to outperform the whole set of five, especially when removing the weakest candidate (D keypoints ). To address this issue, Table 7 compares the results of three different sets of fusion ensembles for the "fixed" decision thresholds. The results shown are for the complete set of 5 detectors as baseline, the best performing ensemble of 4 (here D BIOSIGNaive , D BIOSIGMC , D arXivNaive , and D arXivMC ; the evaluations performed in this case were a complete leave one out sequence but only the most relevant result is presented here) and the ensemble of three with the most disparate characteristics (D arXivNaive , D BIOSIGMC , D keypoints ; i.e., selection by limiting redundancy). The results show an apparent decrease of the HTER for SC1 and SC2 if switching from an ensemble of 5 (denoted as "5 det" in Table 7) to an ensemble of (the most suitable) 4 detectors (denoted as "4 det" in Table 7). When compared to the single detector performance reported in Table 5 above, it can be seen that the best ensemble of 4 also seems to outperform the individual detectors. Some of the figures presented have to be considered very carefully since they are hiding a problem in the scheme: This is absolutely no problem for cases where the individual weighting makes deadlocks neigh to impossible (e.g., in case of the F WLC ) but is especially relevant for the majority vote where significant numbers of "undecided" events occurred (e.g., cases where 2 detectors predicted one class and the other 2 the other) that are not reported in the table. These "undecided" events amount over the various tested ensembles to up to 10% of all majority vote cases. In case of the chosen ensemble of 3 detectors (denoted as "3 det" in Table 7) all HTER values increased significantly, showing that this ensemble (which more strongly relies on the opinion of the rather weak D keypoints ) is outperformed by the bigger ensembles.
Similar to Table 7, Table 8 performs the same ensemble tests for the "adaptive" thresholding strategy. Here, the results also show better results for the best ensemble of 4 detectors when compared to the complete ensemble of 5. In contrast to the "fixed" thresholding strategy discussed above, the performance increase obtained by leaving D keypoints out seems smaller but also the number of "undecided" events is way smaller (less than 3%) so that here the gain has to be considered higher. This performance gain is also evident in the comparison to the single detector results discussed in Table 6.
Like in the case of the "fixed" thresholding strategy, the tested cases of 3 detector ensembles showed significantly worse results, increasing the HTER to 18% or even higher.
Summarizing the results on these detector ensemble selection experiments, it has to be said that the best performing set of 4 detectors outperformed for both thresholding strategies ("fixed" and "adaptive") and SC1 as well as SC2 the complete ensemble of 5. For fusion methods that are prone to deadlock or "undecided" situations (esp. the majority vote), the even number of detectors in this cased caused a small issue, generating in the worst case up to 10% deadlock results that would have to be handled in application. All results for the chosen ensemble of the 3 most dissimilar detectors proved near fatal for the system performance since the HTER was significantly increased in all these cases.

Discussion on alternative evaluation setups
Another issue, raised during the review phase for this journal, is the choice of a realistic but rather challenging experimental scenario where the dataset used for training is disjoint from the ones used for testing. The question was how an overlap between training and testing set (i.e., more favorable conditions for the individual detectors) would influence the outcome of the experiments. To address this question, two different sets of less realistic experimental setups are discussed below: first, a tenfold stratified cross-validation with disjoint sets of genuine samples and morphs, and second an even less realistic (i.e., more lab-condition) test with a static percentage split on one a set containing genuine and morphs that are derived directly from these genuine images.
For the first of these alternative setups, additional tests are performed here to show how a deviation from rigorous evaluation routines reflects in the error rates obtained. Table 9 summarizes the results for the "fixed" as well as the "adaptive" thresholding strategy. If comparing the results in Table 9 to the results in Tables 4 and 5, then the single detector performances in the "fixed" thresholding remain nearly unchanged while the HTER values in case of the fusions decrease (e.g., from 11.85% to 2.6% in case of F LR in SC1 for combined morphs of from 13.70% to 5.9% in case of F LR in SC2 for combined morphs). For the "adaptive" thresholding, the single detector HTER values reported significantly improve (e.g., from 9.62 to 2.2% for D ArXivNaive in SC1 for combined morphs). In some cases, they are getting really close to the EER values for the corresponding experiment, which represents the best value that could be achieved in this test. The fusion results for this thresholding strategy see an even more significant drop in the HTER values presented (e.g., 13.41% to 2.8% for F M in SC1 for combined morphs). For the second, an even less realistic (i.e., more labcondition) test no additional test has to be performed here. Instead results from an earlier publication on fusion in face morph attack detection are re-used here. As authors of [7], we used a static percentage split (50%: 50%) on one a set containing genuine (originating from exactly one public database) and morphs that are derived directly from these genuine images to perform initial tests with DST in this field. The results presented were astonishing HTER values of less that 1%. While the results did indicate the potential benefit of using fusion in MAD, the observed lack of realism in the setup made us question the actual extend of the performance increase we could realistically hope for. This realization motivated the research work on the empirical limitations of using information fusion and the constraints for its application that lead to this journal paper.
Summarizing the results obtained on alternative (i.e., less realistic) evaluation setups, it has to be said that the error rates obtained achieved when drawing training and test data from the same parent population are obviously lower than in a setup with disjoint populations used. In the experiments discussed above, the fusion approaches benefit more from the unrealistic lab-condition like evaluation setups than the single detectors and the "adaptive" thresholding strategy benefits more than the "fixed" one.

Summary on the fusion experiments results
There are three main reasons why fusion fails to outperform the best individual classifier in the results discussed in section 5.3: 1. Lack of diversity of the individual detectors. The detectors D ArXivNaive , D arXivMC , D BIOSIGMC , and D BIOSIGNaive are developed by the same research group and rely on training of DCNN with similar data sets but strong variances in data augmentation. Hence, it is very likely that these detectors make in field application mistakes on the same samples. Only the D keypoints detector relies on entirely different morphing detection clues and is developed by another research group using a different data set for training. In theory, an assumed clustering of four apparently very similar detectors might prove a strong prejudice in fusion that should be avoided at any cost. In practice, our experiment on different ensembles of classifiers showed a better performance if only those four detectors are used instead of all five.
2. Lack of performance in individual detectors. It can be seen from the evaluation with the DEFACTO dataset, that D keypoints lacks generalization power. The default decision threshold of 0.5 is far away from the sub-optimal (i.e., containing an offset due to training data vs. test data mismatch) threshold of 0.87252 obtained from its evaluation. Even higher are the sub-optimal decision thresholds with the mixed test data set (London, ECVP , and Alabama images   3. Lack of similarity between the training and test data. Different proprietary data sets are used for training individual classifiers, which is a very common case, but the datasets for adjusting fusion parameters (evaluation data set) and for actual testing are also very different from each other and the training data set. One can say that it makes absolutely no sense to use different data sources for adjusting fusion parameters and for testing, but this is the real-life situation. In practice, it is very difficult to precisely foresee and provide significant infield data at the stage of system development or parameter adjustment. Moreover, there is no guarantee that the in-field data that will be obtained in the future is even similar to the presented training data.
The case study performed in this paper clearly demonstrates that if the training, evaluation, and test datasets lack similarity, the adaptation of the classifier parameters such as a decision threshold may lead to performance degradation. This can be well explained on the example of the classifier D ArXivNaive which in the tests performed shows the best generalization power. The classifier is well trained with the default decision threshold of 0.59072. An attempt to adapt the decision threshold based on the DEFACTO data set actually fails with shifting it to 0.39958, resulting in an EER of 10%. As a consequence, the APCER and BPCER values are imbalanced in the test leading to the HTER values of approximately 9.5% in SC1 and 10.5% in SC2 (see Table 6). However, if there is no adaptation of the decision threshold, the suboptimal (i.e., offset) thresholds of 0.594687, 0.600357, and 0.566983 are close to the default one and the APCE R and BPCER values are well balanced in SC1 leading to HTER values of 1.91%, 1.81%, and 2.77% for combined, complete, and splicing morphs respectively (see Table  5). In contrary, the sub-optimal thresholds in the SC2 would be 0.499938, 0.507648, and 0.492199 for combined, complete, and splicing morphs respectively which are far away from the default value of 0.59072. Hence, in the test within SC2 the APCER and BPCER values are imbalanced leading to the HTER values of 7.14%, 5.77%, and 8.50% for combined, complete, and splicing morphs respectively. The same situation can be observed with the detectors D arXivMC , D BIOSIGMC , and D BIOSIGNaive . Considering the results of different fusion strategies, it can be said that in almost all cases, the APCER and BPCER values are imbalanced in the case when training, evaluation, and test datasets lack similarity. This results in the conclusion that pre-determining the proper decision thresholds (as well as the fusion weights) in real-life conditions (where the training, evaluation, and in-field data might be dramatically different) is hardly possible.
When considering alternative (less strict) evaluation setups, where training and test data show and artificial similarity due to the fact that they have been drawn from the same parent distribution, we see in section 5.5 significantly lower HTER values not only for fusion results but in some cases also for the individual detectors.
The results presented more clear indicators that the similarity between the training and test data is the dominating factor for the error rates achieved. If this similarity is an artificial one (e.g., in an unrealistic setup where training, parameterization, and test data are drawn from the same parent population) instead of a natural one (i.e., the fusion as well as the individual detectors are suitably well trained) the low error rates obtained are meaningless.
The practical consequence of these three issues is that one of the individual detectors (obviously accurate but far from perfect in its performance) in all evaluations outperforms four different fusion approaches, ranging from simplistic to very sophisticated, in different parameterizations in the tests performed in 5.3 but becomes marginalized by fusion approaches as soon as either the ensemble of detectors used in the fusion is optimized (as done by removing one disturbing detector in section 5.4) or the similarity between training and test data is increased (as in section 5.5).

Conclusions
The results presented in the empirical evaluations in this paper demonstrate that fusion can fail even with a set of relevant individual classifiers. This can be seen in both application scenarios ("MAD in document issuing" and "MAD in identity verification") evaluated in this paper. Here, the three reasons for this phenomenon discussed above are (a) low diversity of the detectors, (b) lack of performance in individual detectors, and (c) lack of similarity between the training and test data.
Summarizing the lessons learned from the approach of using fusion for MAD detection as done in this paper and drawing some generalization toward other media forensics classification or decision problems, the following has to be said: The requirements for (media) forensic methods in terms of scientific admissibility (or Daubert compliance) are obviously important! Methods should indeed be published upon and peer reviewed, their error rates should be precisely known and standards for the application of methods should be known. But the threat that Champod and Vuille identify as a problem of ascertaining the error rates of a test "can prove misleading if not all its complexities are understood" [15] plays a very significant role as demonstrated in the evaluations performed here.
Besides the requirements for individual expert systems to be used in forensic investigations (including its accurateness), if it comes to information fusion, additional constraints have to be observed. These are, at least: The diversity of the detectors, which has to be ascertained either by knowledge about the precise means of decision generation and the diversity of those means or empirically. An independent and thorough benchmarking of detectors to establish also an idea on the generalization power of performance claims made by their creators. Considerations on the similarity/correlation between training data available (during training of the individual classifiers and the training of the fusion methods) and the data to be expected in field application are very important. If very precise assumptions are possible on the application data, weighting might be applicable in fusion. Else-wise, only unweighted fusion strategies like majority voting or the sum-rule should be employed, if any fusion is used in those cases at all.
The diversity issue becomes very problematic if features (as the means to represent a decision problem in a feature space) are not hand crafted by experts but learned, e.g., by DCNN. In this paper, the diversity problem of the detectors used here as "black boxes" has been established in direct contact with the developers of those methods, which is hardly an option in most field applications.
Also, the recent trend to generate synthetic data sets for the training of pattern recognition methods (either traditional or neural network based) introduces another degree of freedom into the characteristics of datasets. In publications such as [54], this approach is used to avoid tedious data collection tasks while creating sufficiently sized data sets for modern day data-greedy classifiers. The problem here is the influence of the synthesis process on its output (i.e., the synthesis-specific artifacts) that will become part of the model trained by each classifier. It is related to the questions of source characteristics imposing themselves into trained models but carries a different degree of relevance for forensic application scenarios. detectors. The authors wish to thank Tom Neubert for providing the keypoints detector used. The experiments in sections 5.4 and 5.5 were inspired by the reviewers in the first round of reviews for this journal paper. We wish to express our gratitude to those experts unknown to us because we feel that their recommendations significantly helped to improve this paper.
Authors' contributions CK works on the media forensic perspectives and fusion theory parts as well as the interpretation of results. AM work on the biometric perspective, the dataset creation, the conduction of the experiments (classifier selection and fusion operator implementation), and the interpretation of the experimental results. JD initial structuring of the work and definition of focus and scope of the work (incl. suggesting the two application scenarios as well as the usage of DST and LR based fusion). MH theoretical and practical work on likelihood based fusion and the interpretation of its results. The authors read and approved the final manuscript.

Funding
The work in this paper has been funded in part by the German Federal Ministry of Education and Science (BMBF) through the research program under the contract no. FKZ: 16KIS0509K (research project ANANAS). The work in this paper has been funded in part by the Deutsche Forschungsgemeinschaft (DFG) under contract no.: 421860277 (research project GENSYNTH). Open Access funding enabled and organized by Projekt DEAL.

Availability of data and materials
The empirical work in this paper is based on the following publicly available datasets: -The AMSL Face Morph Image Data Set (made available via: https://omen.cs. uni-magdeburg.de/disclaimer/index.php; last accessed Sept. 10, 2020) -The Utrecht/ECVP as part of Psychological Image Collection at Stirling (PICS), (available at: http://pics.stir.ac.uk/2D_face_sets.htm; last accessed Sept. 10, 2020) -The London DB has been made available by L. DeBruine and B. Jones as: Face Research Lab London Set: https://figshare.com/articles/dataset/Face_ Research_Lab_London_Set/5047666 (last accessed Sept. 10th, 2020) -The DEFACTO dataset (including the face morphing subset used in this paper) introduced in [51] is available at: https://defactodataset.github.io/ (last accessed Sept. 10, 2020) -The used Alabama database is the collection of mugshots of the Alabama News Network (available at: https://www.alabamanews.net/mugshots/; last accessed Sept. 10,2020) Declarations