In this section, robust video steganography against video transcoding is proposed for hidden communication based on social media. The PCA-based strategy is first presented to select robust embedding regions. Then, side information compression is conducted to reduce the transmission bandwidth cost. Besides, video preprocessing is used to improve the applicability of our method over various social media channels. Channel coding is introduced to enhance the robustness against video transcoding. Finally, dual-channel joint embedding and extraction processes based on the Y and U components are described.
The overall steganographic framework is shown in Fig. 3. First, decode the cover video to generate the YUV components sequence. Then, embed secret messages into the Y component and side information into the U component. At last, encode modulated YUV components sequence to generate the stego video. It is worth mentioning that using BCH codes to encode secret messages is optional. Error correction bits can be introduced to reduce the bit error rate but decrease the available message capacity. In the experiments, BCH codes are used to analyze their error correction capacity.
PCA selection of embedding regions
Video transcoding introduces different levels of noise to different regions in a video. Different noise levels affect different frames, and different areas are affected by noise differently, even within the same frame. Thus, the most direct way of improving robustness is to select those robust regions less affected by video transcoding. In [22], two-thirds of frames are selected as robust frames to carry secret messages in a video. There are still some regions sensitive to video transcoding even within a selected robust frame. Thus, the BER of their method is still high. It is necessary to score pixel blocks in a given video to select robust blocks against video transcoding.
Principal component analysis is the process of computing the principal components. Sometimes, only the first few principal components are used, and the rest principal components are ignored. The proportion of the first principal component reflects the distribution characteristics of data. In video transcoding, the number of bits to code the prediction residuals of macroblocks changes. The low-frequency DCT coefficients are retained, but most high-frequency DCT coefficients become 0. Video transcoding makes a great impact on the macroblocks, the prediction residuals of which have a large proportion of high-frequency DCT coefficients. The blocks with a large proportion of low-frequency DCT coefficients should be the preferred option for message embedding. Moreover, the larger the proportion of low-frequency DCT coefficients is, the larger the proportion of the first principal component is. Thus, the proportion of the first principal component is a reasonable and straightforward guideline to score pixel blocks.
In this paper, the proportion of the first principal component in the DWT domain is calculated as the assessment criteria to determine robust embedding regions. Video frames are divided into non-overlapping n×n blocks. PCA is performed on each block in a frame, and the proportion of the first principal component is calculated. A threshold T is elaborately set for the selection of robust blocks. The blocks whose proportion of the first principal component is greater than T are selected as robust blocks and marked as “0.” The remaining blocks are labeled as “1.”
For a given block X, DWT is performed to obtain the LL sub-band. Next, PCA is conducted by:
$$ \begin{aligned} \mathbf{C} &= \frac{2}{n} \mathbf{X}_{\text{LL}} \mathbf{X}^{\mathrm{T}}_{\text{LL}} = \frac{2}{n} \left(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\mathrm{T}}\right) \left(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\mathrm{T}}\right)^{\mathrm{T}}\\ &= \frac{2}{n} \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\mathrm{T}} \mathbf{V} \boldsymbol{\Sigma}^{\mathrm{T}} \mathbf{U}^{\mathrm{T}} = \frac{2}{n} \mathbf{U}\mathbf{\Sigma}^{2} \mathbf{U}^{\mathrm{T}} \end{aligned} $$
(7)
where XLL is the LL sub-band of X, n is the size of X, and C is the covariance matrix of XLL. SVD is performed as XLL=UΣVT where UUT=I, VVT=I. The proportion p1 of the first principal component is calculated by:
$$ p_{1} = \frac{\sigma_{1}^{2}}{\sum_{i=1}^{r} \sigma_{i}^{2}} $$
(8)
where the diagonal matrix Σ=diag(σ1,σ2,...,σr,...,0) is the singular values matrix, and r represents the number of nonzero singular values.
A simple test is realized to verify the effectiveness of PCA. A randomly selected video with 180 frames is transcoded by a transcoder with the constant rate factor (CRF) value of 26. The average mean square error (MSE) of regions within each frame is shown in Fig. 4. The average MSE of PCA-selected regions is 56% of all regions and 11% of non-PCA-selected regions. The experimental result has illustrated that PCA is effective in selecting the regions less affected by video transcoding.
Side information compression
The side information, formed by all labels of pixel blocks in a video, should be reliably transmitted on lossy channels to sync the embedding and extraction regions. However, it is too long to be reliably transmitted without any compression. Besides, it would make no sense to construct hidden communication if the side information for region synchronization was longer than the communication message.
In order to reduce the transmission bandwidth cost, side information compression is designed by imitating JPEG compression. Then, channel coding is used to achieve reliable transmission over lossy channels. In the JPEG compression, quantization, run-length encoding (RLE), and Huffman coding are joined to compress the image data effectively. By imitating JPEG compression, side information compression is conducted as follows:
-
(1)
The side information is converted into one-dimensional (1D) data.
-
(2)
The data sequence is quantified based on an elaborately set quantization step size. Besides, RLE is performed to store the data sequence as a single data value and count.
-
(3)
Huffman coding is used to encode the sequence of data count.
An example of side information compression is illustrated in Fig. 5. The final compressed side information is spliced by the first element of the data value sequence and encoded data count sequence.
Data transformation
Since digital video is 3D data, the side information generated by PCA selection is also 3D data. Thus, data transformation aims at converting the side information to 1D data for subsequent processing. Firstly, the side information is scanned from left to right and top to bottom, forming a 2D matrix. Each row of the 2D matrix represents the side information of the corresponding frame. Then, the 2D matrix is scanned by column and converted to a 1D sequence.
Quantization
Quantization is a lossy process. First, divide the 1D sequence SI into several segments. If the flag “1” exists in a given segment, change all flags in the segment to “1.” Assume that the length of the segment is Δ2, the quantization process is expressed by:
$$ f_{i}=\left\{\begin{array}{l} \text{"1"} \text{,}\ \exists f_{j} = \text{"1"} \text{,} \lceil \frac{i}{\Delta_{2}} \rceil-1 > \frac{j}{\Delta_{2}} \leq \lceil \frac{i}{\Delta_{2}} \rceil \\ \text{"0"} \text {, others } \end{array}\right. $$
(9)
where fi,fj∈SI, i and j represent the position of fi and fj, respectively in SI. After quantization, the original data sequence cannot be recovered without any auxiliary information. Thus, the quantified side information is used for determining robust regions.
Run-length encoding
RLE is a form of lossless data compression in which runs of data are stored as a single data value and count rather than as the original run. Since there are only “0” and “1” in SI, it is efficient to use RLE to compress the side information. As the output of RLE, the sequence of data value alternates between the element “0” and “1,” and all values are integer multiples of Δ2 in the sequence of data count.
Huffman coding
A Huffman code is an optimal prefix code. By constructing the Huffman coding table, the sequence of data count generated by RLE can be further compressed.
Side information compression is conducted through operations such as data transformation, quantization, RLE, and Huffman coding. Figure 6 shows the compression rate of side information in ten randomly selected videos. The average compression rate is 0.041, which demonstrates the effectiveness of side information compression.
A decoder is built at the receiver of hidden communication by conducting the inverse process of side information compression. First, decode the data count sequence using the preset Huffman coding table. Then, utilize the first element of compressed side information to generate the data value sequence, the length of which is equal to the data count sequence. Besides, the data value sequence alternates between the element “0” and “1.” Third, take the input data value sequence and count sequence to perform RLE, and the quantified side information is obtained. At last, robust regions are determined based on quantified side information for subsequent message extraction.
Video preprocessing
In the research on robust steganography, the preprocessing strategy is widely used. Zhao et al. [4] proposed transport channel matching (TCM) to enhance the algorithm’s robustness on lossy channels. Their experimental result showed that the image differences caused by JPEG compression would gradually decrease when an image was compressed multiple times with the same quality factor.
Inspired by their work, we construct a lossy channel with fixed transcoding parameters and perform multiple recompressions to simulate the TCM operation. As could be observed from Fig. 7, with the number of times of video compression increasing, the MSE gradually decreases, and structural similarity (SSIM) gradually increases. The value of MSE or SSIM is calculated based on videos before and after each compression. The experimental result indicates that the video compressed on a specific lossy channel has better robustness against video transcoding than the raw input. In practice, different lossy channels have different video transcoding mechanisms which are invisible for users. Video preprocessing is induced to improve the applicability of our proposed method over various channels. First, the original video is uploaded to a given lossy channel as the first input. Then, its transcoded version is obtained from this channel as the following input. Repeat the above operation several times and download the final transcoded video as the cover to embed secret messages. The robustness and applicability of our method can get improved in this way.
However, multiple recompressions on a specific social media channel are unsafe and damage the visual quality. As shown in Fig. 8, the peak signal to noise ratio (PSNR) and SSIM gradually decrease. The visual quality is greatly degraded because of multiple recompressions. Thus, we use a local transcoder to make video preprocessing and set the number of times of video compression as 3, considering the visual quality, robustness, and security.
Joint embedding of Y and U channels
In video encoding, the Y component and U component are encoded separately. Thus, the message embedding in the Y component does not affect the message embedding in the U component. The Y component and U component can be joined to embed secret messages.
In Section 3.1, a PCA-based selecting strategy is proposed to enhance the robustness against video transcoding. The regions with a high proportion of the first principal component are selected to embed secret messages. The side information is generated to label PCA-selected regions. Even though side information compression is elaborately designed, a reliable channel is necessary to transmit the compressed side information. Since the side information is long and error bits easily happen over lossy channels, building a separated channel to transmit the compressed side information is unworthy and unsafe.
In our proposed scheme, combining the Y component with the U component solves the synchronization of embedding and extraction regions. Two steganographic channels based on the Y and U components are built to transmit both side information and secret messages in a single video. At the sender, the embedding regions selected by PCA determine the content of the side information. At the receiver, the side information determines the extraction regions where secret messages can be correctly extracted.
Joint embedding of the Y and U components is shown in Fig. 9. The specific embedding process is described as follows.
-
(1)
Perform video preprocessing based on multiple recompressions. The original video C0 is compressed to adapt the target channel. Use the compressed video as the cover C1 to reduce the impact of video transcoding in the target channel.
-
(2)
Decode C1 to get the Y component and U component. The Y component is divided into non-overlapping n×n blocks. Divide the U component into n/2×n/2 blocks.
-
(3)
Extract the elements based on DWT and SVD to construct the embedding domain EDY and EDU.
-
(4)
Make principal component analysis and calculate the proportion of the first principal component of each block in the Y component. Then, generate the side information SI according to the predefined threshold T.
-
(5)
Make side information compression to get compressed side information SI′.
-
(6)
According to SI′, select robust elements in EDY to embed the scrambled secret message based on QIM.
-
(7)
Calculate the proportion of the first principal component of each frame in the U component. Select the first k frames with larger proportions as robust frames. Generate the protocol to label the k frames. The length of the generated protocol is equal to the number of video frames.
-
(8)
The protocol and compressed side information are spliced together. Then, encode spliced information based on BCH codes to eliminate the error bits in the lossy transmission.
-
(9)
Embed the encoded information into U components of selected k frames based on QIM.
-
(10)
Reconstruct modulated Y and U components utilizing inverse SVD and DWT. By setting the CRF value as 0, encode modulated Y and U components combining with the V component to generate the stego video S.
The message extraction is an inverse process of message embedding. The specific extraction process is explained as follows.
-
(1)
Obtain the transcoded video S′ from the target channel. Decode S′ to get the Y and U components.
-
(2)
The Y component is divided into non-overlapping n×n blocks. Divide the U component into non-overlapping n/2×n/2 blocks.
-
(3)
Extract the elements based on DWT and SVD to construct the embedding domain EDY and EDU.
-
(4)
Extract the encoded information with fixed length in EDU. Decode it based on BCH codes to generate the protocol. Then, determine selected k frames.
-
(5)
Extract the encoded side information from the k frames in EDU. Decode it to generate the compressed side information SI′.
-
(6)
According to SI′, determine the extraction regions and select robust elements in EDY.
-
(7)
Extract the scrambled secret message from selected elements based on QIM. Anti-scramble it to get the secret message.