James D. Gordy A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
Supervisor, Dr. L. T. Bruton Department of Electrical and Computer Engineering Dr. A. Eberlein Department of Electrical and Computer Engineering Dr. H. Leung Department of Electrical and Computer Engineering Dr. M. Collins Department of Geomatics Engineering
Date ii
Abstract
Digital watermarking is the process of embedding sideband data within the samples of a digital audio, image, or video signal. The watermark must be imperceptible to the intended audience of the host signal, and must withstand distortion from common signal processing operations. In this thesis, implementations and improvements are described of digital audio, image, and video watermarking algorithms. In addition, a novel performance evaluation framework is introduced, and is used to compare the algorithms using bit rate, perceptual quality, computational complexity, and robustness to signal processing. Watermarks embedded in a transform domain representation of the host signal perform better under signal processing operations than time or spatial domain approaches. In addition, incorporation of perceptual models of human hearing and vision improves the imperceptibility of watermark data and its resilience to signal processing operations. However, the cost of transform domain and perceptual analysis is an increase in computational complexity.
iii
Acknowledgements
First of all, I would like to express my sincere thanks to Dr. Bruton for his supervision of my research, and for his advice, encouragement, and support that have kept me focused on my work. I can honestly say that his enthusiasm and interest have made my time at the University of Calgary a more enjoyable and rewarding experience. I would also like to gratefully acknowledge the generous nancial support of the Natural Sciences and Engineering Research Council (NSERC), the Department of Electrical and Computer Engineering, and Dr. Bruton. My research and this thesis would not have been possible without their assistance. Finally, I wish to thank Norm Bartley for helping to keep the lab running smoothly, his encouraging conversations, and for his helpful suggestions. My fellow students in the department, particularly Chad Dreveny, Remi Gurski, and Mark Chakravorti, receive many thanks for their friendship, helpful suggestions, and lively lunchtime discussions.
iv
Contents
Abstract Acknowledgements Dedication Contents List of Tables List of Figures List of Symbols Chapter 1 Introduction
1.1 Digital Media and Copyright Protection . . . . 1.2 Requirements Analysis . . . . . . . . . . . . . . 1.2.1 Imperceptibility . . . . . . . . . . . . . . 1.2.2 Robustness to Signal Processing . . . . . 1.2.3 Private vs. Public Watermarks . . . . . 1.3 Watermark Embedding and Extraction Systems 1.3.1 Perceptual Analysis . . . . . . . . . . . . 1.3.2 Key Generation . . . . . . . . . . . . . . 1.3.3 Encoding and Decoding . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4 Watermark Insertion and Extraction . . . . . . . 1.3.4.1 The Discrete Fourier Transform (DFT) . 1.3.4.2 The Discrete Cosine Transform (DCT) . 1.3.4.3 The Discrete Wavelet Transform (DWT) 1.4 A Framework for Performance Evaluation . . . . . . . . 1.4.1 Bit Rate . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Perceptual Quality . . . . . . . . . . . . . . . . . 1.4.3 Computational Complexity . . . . . . . . . . . . . 1.4.4 Robustness to Signal Processing . . . . . . . . . . 1.5 Scope and Outline of Thesis . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
9 11 12 12 13 16 16 17 18 18
2.1 The Human Audio System (HAS) . . . . . . . . . 2.1.1 Frequency Sensitivity . . . . . . . . . . . . 2.1.2 Frequency Masking . . . . . . . . . . . . . 2.1.3 Other Psychoacoustic Concepts . . . . . . 2.1.4 The MPEG Layer I Psychoacoustic Model 2.2 The Human Visual System (HVS) . . . . . . . . . 2.2.1 Frequency Sensitivity . . . . . . . . . . . . 2.2.2 Frequency Masking . . . . . . . . . . . . . 2.2.3 Spatial and Luminance Masking . . . . . . 2.2.4 Colour Sensitivity . . . . . . . . . . . . . . 2.2.5 Temporal Masking . . . . . . . . . . . . . 2.2.6 Human Vision Models . . . . . . . . . . . 2.2.6.1 Spatial Domain Models . . . . . 2.2.6.2 Frequency Domain Models . . . . 2.3 Summary . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
20
21 21 23 24 27 31 35 37 39 39 41 41 42 45 51 53
52
3.2 Echo Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Encoder Structure . . . . . . . . . . . . . . . . . . . 3.2.2 Decoder Structure . . . . . . . . . . . . . . . . . . . 3.2.3 Implementation and Proposed Improvements . . . . . 3.2.3.1 Selection of and no . . . . . . . . . . . . . 3.2.3.2 Discussion . . . . . . . . . . . . . . . . . . . 3.3 Phase Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Encoder Structure . . . . . . . . . . . . . . . . . . . 3.3.2 Decoder Structure . . . . . . . . . . . . . . . . . . . 3.3.3 Implementation Details . . . . . . . . . . . . . . . . . 3.4 Spread Spectrum Coding . . . . . . . . . . . . . . . . . . . . 3.4.1 Encoder Structures . . . . . . . . . . . . . . . . . . . 3.4.1.1 Direct Sequence Spread Spectrum . . . . . . 3.4.1.2 Frequency Hopped Spread Spectrum . . . . 3.4.2 Decoder Structures . . . . . . . . . . . . . . . . . . . 3.4.3 Probability of Bit Error . . . . . . . . . . . . . . . . 3.4.4 Implementation and Proposed Improvements . . . . . 3.4.4.1 Selection of . . . . . . . . . . . . . . . . . 3.4.4.2 Pre ltering to Improve Decoding Reliability 3.4.4.3 Discussion . . . . . . . . . . . . . . . . . . . 3.5 Frequency Masking . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Encoder Structure . . . . . . . . . . . . . . . . . . . 3.5.2 Decoder Structure . . . . . . . . . . . . . . . . . . . 3.5.3 Probability of Bit Error . . . . . . . . . . . . . . . . 3.5.4 Implementation and Proposed Improvements . . . . . 3.5.4.1 Construction of Filter Coe cients . . . . . . 3.5.4.2 Selection of . . . . . . . . . . . . . . . . . 3.5.4.3 Pre ltering to Improve Decoding Reliability 3.5.4.4 Discussion . . . . . . . . . . . . . . . . . . . viii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54 56 58 59 59 61 63 64 65 65 67 70 70 70 73 74 75 75 77 80 82 83 84 85 86 86 87 87 88
3.6 Performance Evaluation . . . . . . . . . . . . . . . 3.6.1 E ect of Block Size . . . . . . . . . . . . . . 3.6.2 Perceptual Quality . . . . . . . . . . . . . . 3.6.3 Computational Complexity . . . . . . . . . . 3.6.4 Robustness to Signal Processing . . . . . . . 3.6.4.1 Linear and Nonlinear Filtering . . 3.6.4.2 Additive and Coloured Noise . . . 3.6.4.3 Linear and Nonlinear Quantization 3.6.4.4 Lossy Compression . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 88 . 89 . 91 . 92 . 93 . 94 . 97 . 98 . 101 . 103
4.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Spread Spectrum Techniques . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Encoder Structures . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.1 Direct Sequence Spread Spectrum . . . . . . . . . . . 4.2.1.2 Frequency Hopped Spread Spectrum . . . . . . . . . 4.2.2 Decoder Structures . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Probability of Bit Error . . . . . . . . . . . . . . . . . . . . . 4.2.4 Implementation and Proposed Improvements . . . . . . . . . . 4.2.4.1 Selection of and S . . . . . . . . . . . . . . . . . . 4.2.4.2 Spatial Domain Masking Analysis: DSSS-SM . . . . 4.2.4.3 Frequency Domain Masking Analysis: FHSS-FMW and FHSS-FMT . . . . . . . . . . . . . . . . . . . . 4.2.4.4 Pre ltering to Improve Decoding Reliability . . . . . 4.2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Multiresolution Embedding . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Encoder Structure . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Decoder Structure . . . . . . . . . . . . . . . . . . . . . . . . ix
104
105 106 107 108 108 109 110 111 111 111 112 114 117 118 120 122
4.3.3 Discussion . . . . . . . . . . . . . . . . 4.4 Performance Evaluation . . . . . . . . . . . . 4.4.1 E ect of Block Size . . . . . . . . . . . 4.4.2 Perceptual Quality . . . . . . . . . . . 4.4.3 Computational Complexity . . . . . . . 4.4.4 Robustness to Signal Processing . . . . 4.4.4.1 Mean and Lowpass Filtering . 4.4.4.2 Highpass Filtering . . . . . . 4.4.4.3 High-emphasis Filtering . . . 4.4.4.4 Wiener Filtering . . . . . . . 4.4.4.5 Median Filtering . . . . . . . 4.4.4.6 Additive and Coloured Noise 4.4.4.7 Quantization . . . . . . . . . 4.4.4.8 Histogram Equalization . . . 4.4.4.9 Lossy Compression . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
124 125 127 128 132 133 133 135 136 136 137 139 142 144 145 147
5.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Frame-By-Frame Watermarking . . . . . . . . . . . . . . . . . . . . . 5.2.1 Direct Sequence Spread Spectrum (DSSS) . . . . . . . . . . . 5.2.2 Spatial Masking Analysis: DSSS-SM . . . . . . . . . . . . . . 5.2.3 Frequency Hopped Spread Spectrum (FHSS) . . . . . . . . . . 5.2.4 Frequency Domain Masking Analysis: FHSS-FMW and FHSSFMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Multiresolution Embedding . . . . . . . . . . . . . . . . . . . 5.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Temporal Multiresolution Watermarking . . . . . . . . . . . . . . . . 5.3.1 Encoder and Decoder Structures . . . . . . . . . . . . . . . . . x
148
150 151 152 153 153 153 154 154 154 155
5.3.2 Selection of Wavelet Basis Functions 5.3.3 Selection of Quantization Levels . . . 5.3.4 Discussion . . . . . . . . . . . . . . . 5.4 Performance Evaluation . . . . . . . . . . . 5.4.1 E ect of Block Size . . . . . . . . . . 5.4.2 Perceptual Quality . . . . . . . . . . 5.4.3 Computational Complexity . . . . . . 5.4.4 Robustness to Signal Processing . . . 5.4.4.1 Frame Averaging . . . . . . 5.4.4.2 Frame Reordering . . . . . 5.4.4.3 Frame Downsampling . . . 5.4.4.4 Lossy Compression . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
156 157 158 158 159 159 162 163 164 164 168 168 173
Chapter 6 Conclusions
6.1 Summary of Results . . . . . . . . . . . . . . . . . . . 6.1.1 Tradeo s in Watermarking Systems . . . . . . . 6.2 Opportunities for Further Research . . . . . . . . . . . 6.2.1 Further Investigations and Improvements . . . . 6.2.2 Watermark Invertibility . . . . . . . . . . . . . 6.2.3 Applications of Digital Watermarking . . . . . . 6.2.4 Information Theory and Digital Watermarking . 6.2.5 Current Standardization E orts . . . . . . . . .
175
xi
List of Tables
2.1 Minimum quantization matrix QMIN (k k ) constructed by measuring sensitivity to 2D-DCT basis functions. . . . . . . . . . . . . . . . . .
1 2
3.1 SNR of watermarked audio signals versus original host signals (in decibels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Audio watermarking algorithm CPU timings (in seconds). . . . . . . 4.1 Wavelet quantization levels for a 512 512 image at the standard viewing distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 PSNR of watermarked images versus original images (in decibels). . . 4.3 Image watermarking algorithm timings (in seconds). . . . . . . . . . . 4.4 Bit error rate due to histogram equalization (in percent). . . . . . . .
5.1 PSNR of watermarked video signals versus original sequences (in decibels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2 Video watermarking algorithm CPU timings (in seconds). . . . . . . . 162
xii
List of Figures
1.1 Block diagram of a typical watermark embedding system. Dashed lines indicate optional blocks. . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Block diagram of a typical watermark extraction system. Dashed lines indicate optional blocks. . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Example of a subband lter bank and the lowpass and highpass decomposition lters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Subset of the 32 overlapping lters modelling the bandpass channels within the Human Audio System. . . . . . . . . . . . . . . . . . . . . 2.2 Plot of TA(f ), the absolute detection threshold of the Human Audio System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Logarithmic mapping from frequencies to the Bark scale. . . . . . . . 2.4 Raised detection threshold for a 15 dB masking signal at 5 kHz. . . . 2.5 Power spectrum and corresponding absolute and raised detection threshold functions, TA(f ) and TM (f ), for a sample audio sequence. . . . . 2.6 Passband lter responses of the two-dimensional Cortex lters used to represent the set of visual channels. . . . . . . . . . . . . . . . . . . . 2.7 Observed frequencies are dependent upon the image width and the viewing distance, standardized to six times the image width. . . . . . 2.8 Plot of C (f ), the visual contrast detection threshold function. . . . . 2.9 Weighting function used to determine the raised contrast detection threshold in the presence of a masking signal 23]. . . . . . . . . . . . xiii 7 7 14 22 23 25 26 32 33 34 36 38
2.10 Raised detection thresholds of zero-mean additive white noise in the presence of (a) luminance masking and (b) spatial masking. . . . . . 2.11 The optical point spread function. . . . . . . . . . . . . . . . . . . . . 2.12 Example of perceptual analysis using Girod's model of the Human Visual System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 E ect of a strong 2D-DCT coe cient on adjacent coe cients within the minimum quantization matrix. . . . . . . . . . . . . . . . . . . . 3.1 Magnitude and phase responses of an echo lter with an echo amplitude of = 0:1 delayed by no = 5 samples. . . . . . . . . . . . . . . . . . . 3.2 Structure of the echo coding algorithm's encoder. . . . . . . . . . . . 3.3 Transition bands employed to minimize phase di erence between blocks containing di erent bits. . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Bit error rate of echo coding algorithm of varying for di erent echo delays (N ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Example of applying an echo lter kernel to an audio signal, and detection of the echo lter delay using the cepstrum. . . . . . . . . . . . 3.6 Structure of the phase coding algorithm's encoder. . . . . . . . . . . . 3.7 Magnitude spectrum of a PN sequence, jP (ej! )j . . . . . . . . . . . . 3.8 Block diagram of the DSSS encoder. . . . . . . . . . . . . . . . . . . 3.9 Block diagram of the FHSS encoder. . . . . . . . . . . . . . . . . . . 3.10 Error rate as a function of SNR for the spread spectrum algorithms. . 3.11 Highpass lter used to pre lter host signals watermarked with the DSSS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Block diagram of the spread spectrum decoder with pre ltering prior to decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 Comparison of DSSS decoding using highpass pre ltering and AR modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.14 Comparison of FHSS decoding using highpass pre ltering and AR modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
40 43 46 50 55 56 57 60 62 66 69 71 72 76 79 80 81 81
3.15 Block diagram of the frequency masking encoder. . . . . . . . . . . . 3.16 Bit error rate as a function of block size for audio watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Bit error rate after ltering for audio watermarking algorithms. . . . 3.18 Bit error rate in the presence of additive and coloured noise for audio watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . 3.19 Linear and nonlinear quantization functions for K = 5 bits per sample. 3.20 Bit error rate after quantization using linear and two nonlinear bit allocation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.21 Bit error rate due to lossy compression as a function of bit rate. . . .
84 90 96 97 99 100 102
4.1 Example of a 512 512 image divided into 16 16 blocks in the spatial domain. Each block will be used to embed one bit of data. . . . . . . 106 4.2 Two-dimensional highpass lter used to pre lter host images watermarked with the DSSS and FHSS algorithms. . . . . . . . . . . . . . 116 4.3 Decomposition lters used to compute the 2D-DWT. . . . . . . . . . 120 4.4 N N composite images made from the multiresolution decomposition subimages and quantization levels. . . . . . . . . . . . . . . . . . . . 122 4.5 Example of a four-level wavelet decomposition of a 512 512 pixel version of LENNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.6 Sample images used in the performance evaluation of image watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.7 Bit error rate versus block size for the six watermarking algorithms compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.8 LENNA image watermarked with the DSSS and DSSS-SM algorithms. 129 4.9 LENNA image watermarked using the FHSS, FHSS-FMW, and FHSSFMT algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.10 LENNA image watermarked using the multiresolution algorithm. . . . 131 4.11 Bit error rate from mean ltering for image watermarking algorithms. 134 4.12 Bit error rate from lowpass ltering for image watermarking algorithms.135 xv
4.13 Bit error rate from highpass ltering for image watermarking algorithms.136 4.14 Bit error rate from high-emphasis ltering for image watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.15 Bit error rate from wiener ltering for image watermarking algorithms. 138 4.16 Bit error rate from median ltering for image watermarking algorithms. 138 4.17 Bit error rate due to additive white Gaussian noise. . . . . . . . . . . 140 4.18 Bit error rate due to coloured white Gaussian noise. . . . . . . . . . . 141 4.19 Bit error rate due to linear quanitization. . . . . . . . . . . . . . . . . 143 4.20 Bit error rate due to JPEG compression, as a function of compression quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.1 Example of an image sequence divided into blocks in the spatial domain, as well as blocks temporally. Each three-dimensional block will be used to embed one bit of data. . . . . . . . . . . . . . . . . . . . . 150 5.2 Example of computing the temporal DWT on a video signal four frames in length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.3 Sample sequences used in the performance evaluation of video watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.4 Bit error rate versus block size for video watermarking algorithms. . . 161 5.5 Bit error rate versus frame averaging for video watermarking algorithms.165 5.6 Bit error rate versus frame reordering for video watermarking algorithms.167 5.7 Bit error rate versus frame downsampling for video watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.8 PSNR versus compression ratio for sample video signals. . . . . . . . 171 5.9 Bit error rate due to MPEG compression as a function of bit rate. . . 172
xvi
List of Symbols
2D-DCT 2D-DFT 2D-DWT a(n) a(n n ) A(z) A(z z ) c(k) c(n n ) cINH (n n ) C C (f ) C (f fm) C (k k ) CD D (n n ) DCT DFT DSSS DSSS-SM
1 2 1 2 1 2 1 2 1 2 1 2
Two-dimensional Discrete Cosine Transform Two-dimensional Discrete Fourier Transform Two-dimensional Discrete Wavelet Transform One-dimensional autoregressive (AR) model coe cients Two-dimensional autoregressive (AR) model coe cients One-dimensional autoregressive (AR) model Two-dimensional autoregressive (AR) model DCT weighting function Saturated ganglion image function of the HVS Gain-controlled retinal image function Correlator output value Contrast detection threshold function of the HVS Raised contrast detection threshold function 2D-DCT function Compact Disc Localized distortion image Discrete Cosine Transform Discrete Fourier Transform Direct Sequence Spread Spectrum DSSS with spatial masking analysis
xvii
DWT DVD E ] f fo fs f (k k ) fo(k k ) FHSS FHSS-FMW FHSS-FMT FIR h(n) hHP (n) hINH (n n ) hLOCAL(n n ) hLP (n) hPSF (n n ) H (ej! ) H (z) HAS HDTV HVS JPEG k(f=fm) km kSAT
1 2 1 2 1 2 1 2 1 2
Discrete Wavelet Transform Digital Versatile Disc Statistical expectation operator Frequency Normalized frequency Sampling frequency 2D-DCT coe cient frequency Normalized 2D-DCT frequency Frequency Hopped Spread Spectrum FHSS with Watson's frequency domain masking analysis FHSS with Tew k's frequency domain masking analysis Finite Impulse Response Impulse response of a linear lter Highpass lter impulse response Optical inhibition function of the HVS Localized distortion spread function Lowpass lter impulse response Optical point spread function of the HVS Complex frequency response of h(n) Z-Transform of h(n) Human Audio System High De nition Television Human Visual System Joint Photographic Experts Group Contrast detection threshold weighting function DFT index of masking signal Saturation level of the HVS
xviii
l (n n ) lRETINA(n n ) M(k) MPEG MSE p(n) p(n n ) p(n n n ) PB P (k) PSNR QL(k k ) QMIN (k k ) Q(k k ) Q(x) R(d km) sMONITOR S SNR TA(f ) TG(f ) TM (f ) TR (k km) x(n) x(n n ) x(n n n ) x(n)
1 2 1 2 1 1 2 2 3 1 2 1 2 1 2 1 1 2 2 3
Monitor luminance image function Retinal image function Magnitude frequency response Moving Picture Experts Group Mean Squared Error One-dimensional pseudorandom sequence Two-dimensional pseudorandom sequence Three-dimensional pseudorandom sequence Probability of bit error, or bit error rate Power spectrum Peak Signal to Noise Ratio Luminance masking minimum quantization matrix Minimum 2D-DCT quantization matrix Raised 2D-DCT quantization matrix Complimentary error function Raised detection threshold function Minimum monitor luminance level Subset of DCT or 2D-DCT coe cients Signal to Noise Ratio Absolute detection threshold function of the HAS Global masking threshold function Frequency masking threshold function of the HAS Raised detection threshold function Digital audio signal of length N samples Digital image of size N N pixels Digital video signal of size N N N pixels Watermarked audio signal
1 2 1 2 3
xix
x(n n ) x(n n n ) x(n) ^ x(n) ~ x(n n ) ~ x(n n n ) ~ xCOMP (n n ) X (k) X (k k ) X (k k n ) X (n n k ) X (k) X (k k ) X (k k n ) X (n n k ) v(n) v(n n ) vx(n) vx(n n ) w(m) w(m m ) w(k k ) w(m m m ) w(m) ~ w(m m ) ~ w(m m m ) ~ w (n n )
1 1 2 2 3 1 2 1 2 3 1 2 1 1 2 2 3 1 2 3 1 1 2 2 3 1 2 3 1 2 1 2 1 2 1 2 1 2 3 1 1 2 2 3 1 1 2
Watermarked image Watermarked video signal Real-valued cepstrum of x(n) Approximated or corrupted audio signal Approximated or corrupted image Approximated or corrupted video signal Composite image from 2D-DWT decomposition Transform domain audio signal Transform domain digital image Frame-by-frame transform domain video signal Temporal multiresolution video signal Watermarked transform domain audio signal Watermarked transform domain image Watermarked frame-by-frame transform video signal Watermarked temporal multiresolution video signal Additive white Gaussian noise (AWGN) signal Two-dimensional (AWGN) signal Prediction error lter noise function Two-dimensional prediction error lter noise function One-dimensional watermark signal of size M bits Two-dimensional watermark signal of size M M bits 2D-DCT visual model weighting function Three-dimensional watermark signal of size M M M bits Extracted one-dimensional watermark Extracted two-dimensional watermark Extracted three-dimensional watermark Linearized visual model parameter
1 2 1 2 3
xx
(k k )
1 2
Linearized visual model parameter Linearized visual model parameter Time varying watermark magnitude function (audio) Space varying watermark magnitude function (images) Time / space varying watermark magnitude function (video) Impulse function Two-dimensional impulse function Image distortion function Phase frequency response di erence Orientation weighting factor Phase frequency response Correlation coe cient for two signals Variance of AWGN process v(n) 2D-DCT coe cient angle
xxi
Chapter 1 Introduction
1.1 Digital Media and Copyright Protection
A great deal of information is now being created, stored, and distributed in digital form. Newspapers, and magazines, for example, have gone online to provide real-time coverage of stories with high-quality audio, still images, and even video sequences. The growth in use of public networks such as the Internet has further fueled the online presence of publishers by providing a quick and inexpensive way to distribute their work. The explosive growth of digital media is not limited to news organizations, however. Commercial music may be purchased and downloaded o of the Internet, stock photography vendors digitize and sell photographs in electronic form, and Digital Versatile Disc (DVD) systems provide movies with clear images and CD-quality sound. Unfortunately, media stored in digital form are vulnerable in a number of ways. First of all, digital media may be simply copied and redistributed, either legally or illegally, at low cost and with no loss of information. In addition, today's fast computers allow digital media to be easily manipulated, so it is possible to incorporate portions of a digital signal into one's own work without regard for copyright restrictions placed upon the work. Encryption is an obvious way to make the distribution of digital media more secure, but often there is no way to protect information once it has been
2 decrypted into its original form. The ability for pirates to easily copy works is one of the last hurdles that keeps publishers from completely adopting online distribution systems. Legislation has been enacted recently in an e ort to stop digital piracy. In the United States, for example, the Digital Millennium Copyright Act (DMCA) was passed in late 1998. The bill speci es and clari es copyright rules for downloading and viewing copyrighted material from public networks such as the Internet 1]. These rules govern the concept of \fair use" | copying of material for personal or academic means | and limits the distribution of copyrighted digital media. The bill also criminalizes the use of technologies for removing copyright notices or defeating copyprotection devices. The Canadian government is also reviewing and amending its copyright laws accordingly. However, given the ease with which digital media can be copied and manipulated, it is necessary to have technologies for tightly coupling copyright information with digital signals. Digital watermarking is seen as a partial solution to the problem of securing copyright ownership. Essentially, watermarking is de ned as the process of embedding sideband data directly into the samples of a digital audio, image, or video signal. Sideband data is typically \extra" information that must be transmitted along with a digital signal, such as block headers or time synchronization markers. It is important to realize that a watermark is not transmitted in addition to a digital signal, but rather as an integral part of the signal samples. The value of watermarking comes from the fact that regular sideband data may be lost or modi ed when the digital signal is converted between formats, but the samples of the digital signal are (typically) unchanged. To clarify this concept further, it is useful to consider an analogy between digital watermarks and paper watermarks. Watermarks have traditionally been used as a form of authentication for legal documents and paper currency. A watermark is embedded within the bres of paper when it is rst constructed, and it is essentially invisible unless held up to a light or viewed at a particular angle. More importantly,
3 a watermark is very di cult to remove without destroying the paper itself, and it is not transferred if the paper is photocopied. The goals of digital watermarking are similar, and it will be shown in the next section that digital watermarks require similar properties. Before the concept of watermarking can be explored further, three important de nitions must rst be established. A host signal is a raw digital audio, image, or video signal that will be used to contain a watermark. A watermark itself is loosely de ned as a set of data, usually in binary form, that will be stored or transmitted through a host signal. The watermark may be as small as a single bit, or as large as the number of samples in the host signal itself. It may be a copyright notice, a secret message, or any other information. Watermarking is the process of embedding the watermark within the host signal. Finally, a key may be necessary to embed a watermark into a host signal, and it may be needed to extract the watermark data afterwards.
1.2.1 Imperceptibility
Most importantly, the watermark signal should be imperceptible to the end user who is listening to or viewing the host signal. This means that the perceived \quality" of the
4 host signal should not be distorted by the presence of the watermark. Ideally, a typical user should not be able to di erentiate between watermarked and unwatermarked signals. In 2], the importance of incorporating perceptual modeling techniques into watermarking systems is further discussed. There are two reasons why it is important to ensure that the watermark signal is imperceptible. First of all, the presence or absence of a watermark should not detract from the primary purpose of the host signal, that of conveying high-quality audio or visual information. In addition, perceptible distortion may indicate the presence of a watermark, and perhaps its precise location within a host signal. This knowledge may be used by a malicious party to distort, replace, or remove the watermark data.
5 regardless of intention, will not severely disrupt the quality or value of the watermarked signal. One of the bene ts of digital media is that they can be represented at an arbitrarily high level of resolution. For example, consider a piece of commercial music sampled at 44.1 kHz and stored at 16 bits per sample. Downsampling and quantizing the signal to a rate of 8 kHz and 8 bits per sample will greatly reduce the value of the music because its quality will be poor.
Perceptual Analysis
Host Signal
Watermark Insertion
Watermarked Signal
Key Generation
Watermark Encoding
Watermark
Figure 1.1: Block diagram of a typical watermark embedding system. Dashed lines indicate optional blocks.
Perceptual Analysis
Watermarked Signal
Watermark Extraction
Watermark Decoding
Watermark
Original Signal
Key Generation
Figure 1.2: Block diagram of a typical watermark extraction system. Dashed lines indicate optional blocks.
8 Perceptual analysis, also called perceptual modeling, is based on psychoacoustic and psychovisual models of the human audio system and human visual system, respectively. The response of the human ear to a piece of music, for example, varies with time and with the frequency-domain characteristics of the music. Perceptual models predict which portions of the host signal are not perceivable to the audience and may be manipulated without a loss of perceptible quality. Perceptual models were originally developed as an addition to lossy audio, image, and video compression systems. By identifying portions of the host signal that are imperceptible (or redundant), the system may remove them to increase the coding rate. In watermarking systems, imperceptible portions of the host signal are employed as \channels" in which watermark data may be placed. It will be shown in later chapters that perceptual analysis is often computationally expensive, and so it may not be feasible to perform this step in time-critical applications or where computing power is limited. In these cases it may be possible to determine a maximum and uniform level of distortion that may be applied to a host signal.
This is simply the process of converting watermark data, such as copyright information (text) or some other data, to and from a form that can more easily be used with a watermarking algorithm (usually binary). At this stage, a key may also be used to encode or decode the watermark to and from a more secure form. Many cryptographic algorithms and standards exist, such as the commonly used Data Encryption Standard (DES) and the Rivest-Shamir-Adelman (RSA) techniques. Although encryption will not be considered in this thesis, a comprehensive source of cryptography concepts and algorithms can be found in 5]. Before the watermark is inserted into the host signal, this stage also allows the opportunity to employ error correcting codes (channel coding) to guard against possible errors at the receiver due to signal processing. The application of channel coding to watermarking techniques will not be examined in this thesis, but the implications of such an improvement will be discussed in Chapter 6. A good source of channel coding techniques for error detection and correction can be found in 6].
10 the host signal, then let (n) represent the allowable strength of the watermark as a function of time, spatial position, or transform domain coe cient. It will be shown in Chapter 2 that in many cases (n) represents a per-sample maximum distortion of x(n), indicating that it is a strictly positive value. In an additive approach, one of the most commonly used, the watermarked signal, x(n), is obtained by simply adding the weighted watermark to samples of the host signal using the following formula:
(1.1)
where the bipolar nature of w(n) means the host signal is increased or decreased, depending on the watermark bit. In the context of a private watermarking system, the watermark data may be extracted by subtracting the original host signal from the watermarked version. Without access to the original signal, as in a public watermarking system, the presence of the original signal within the watermarked version must be removed or minimized. In a multiplicative embedding approach, the watermark is multiplied by both the host signal and the strength function:
(1.2)
Extracting the watermark data is more di cult, requiring division of the watermarked signal with the original signal (if it is available at the decoder). This approach is not often used in public watermarking systems. Finally, a common non-linear approach to embedding involves quantization of the host signal sample values, and then perturbing the quantized values by a fraction of the quantization level:
x(n) = x(n) + 1 w(n) (n) (1.3) (n) 4 where (n) represents an allowable quantization level, rather than a maximum watermark strength, and ] represents the rounding operator. Since w(n) is bipolar,
(" # )
11 the sample value (after quantization) is increased or decreased according to the watermark bit. Extracting the watermark data is a simple matter of quantizing the watermarked signal by the same levels in (n), and then determining the sign of the result. There are three commonly used signal transforms for watermarking in the transform domain. In the sections that follow, they are brie y reviewed.
nk N nk N
x(n) =
N ;1 X k=0
=0
(1.4) (1.5)
X (k)ej
The DFT can be extended to two (and higher) dimensions. The two dimensional DFT (2D-DFT) of an N N image is given by:
1 2
X (k k ) =
1 2
N1 ;1 N2 ;1 X X
x(n n ) = N 1N
1 2 1
n1 =0 n2 =0 N1 ;1 N2 ;1 X X k1 =0 k2 =0
x(n n )e;j
1 2
n1 k1 + n2 k2 N1 N2 n1 k1 + n2 k2 N1 N2
(1.6) (1.7)
X (k k )ej
1 2
In this thesis, the operations of computing the forward and inverse DFT, regardless of dimension, will be denoted DFT and IDFT , respectively.
12
The DCT, introduced in 8], produces real-valued coe cients that do not su er from the symmetry constraints of the DFT. It is commonly used for audio, image, and video compression as the heart of transform-based coders. The DCT may be computed from the DFT, so fast algorithms such as the FFT may be used to compute it. Other e cient implementations of the DCT exist 9]. The forward and inverse Discrete Cosine Transform of a signal N samples in length may be written by the following equations: " # N; X (2n + 1)k X (k) = c(k) x(n) cos (1.8) 2N n " # N; X (2n + 1)k x(n) = c(k)X (k) cos (1.9) 2N k where 0 k N ; 1, and
1 =0 1 =0
k=0 (1.10) 1 k N ;1 N The equations above are commonly referred to as the DCT pair. Like the DFT, the DCT can be extended to two dimensions: c(k) =
> : q
8 q > 1 <
2
X (k k ) =
1 2
c(k )c(k )
1 2
N1 ;1 N2 ;1 X X n1 =0 k2 =0
+ + c(k )c(k )X (k k ) cos (2n 2N1)k cos (2n 2N1)k (1.12) k1 k2 where c(k) is the same as in Equation 1.10. In this thesis, the operations of computing the forward and inverse DCT, regardless of dimension, will be denoted DCT and IDCT , respectively.
" # " #
1 2 1 2 1 1 2 2 =0 =0 1 2
x(n n ) =
1 2
#
1
#
2
(1.11)
N1 ;1 N2 ;1 X X
13 along with downsampling of the ltered signals 10]. If certain conditions are met in the design of the lters, then perfect reconstruction of the original signal can be obtained by using a reconstruction scheme of upsampling and ltering to remove spectral aliasing e ects. Figure 1.3-(a) shows an example of a two-band ltering system employing a lowpass and highpass lter bank. The frequency responses of the two lters are shown in Figure 1.3-(b). In this study, the Discrete Wavelet Transform (DWT) is used to implement subband decomposition and reconstruction lter banks 11]. The DWT can be extended to work in two and higher dimensions by using separable lters working on separate dimensions. An excellent description of a two-dimensional wavelet ltering scheme can be found in 12]. In this thesis, the operations of computing the forward and inverse DWT, regardless of dimension, will be denoted DWT and IDWT , respectively.
14
h(n) x(n)
h(n) x(n)
Processing
g(n)
g(n)
10
20
30
40
50
60
70
80
90
0.1
0.2
0.3
(b) Half-band decomposition lters Figure 1.3: Example of a subband lter bank and the lowpass and highpass decomposition lters.
0.7
0.8
0.9
15 and let w(n) 2 f;1 +1g represent the extracted watermark. The bit error rate, ~ expressed as a percentage, is given by:
PB = M
n=0
(1.13)
In some applications, the embedded watermark may be used as a signature representing the author or copyright owner. In this case, it is useful to measure how well the extracted watermark correlates with the signature. A threshold value may then be set to decide whether the extracted watermark is acceptable or not. This correlation coe cient is given by 13]: PM ; n w ~ q (w w) = PM ; n w(q)PM (n) ~ (1.14) ; ~ n w (n) n w (n) where 0 1. = 1 indicates perfect correlation, while an extremely low value reveals that the watermarks are dissimilar. It is very important to note that bit error rate is used as a measure of how well single bits may be extracted from the host signal, and that it is decoupled from the reliability of the watermark itself. For example, assume a system in which one of 1024 possible watermarks will be embedded to represent one of 1024 possible copyright owners. If a binary encoding is used, then a minimum of log 1024 = 10 bits must be embedded. If the bit error rate of the system is ten percent (or one bit in ten), then the watermark has a reliability of only 50 percent, since a single bit error will cause an incorrect copyright owner to be identi ed. However, if a longer watermark is used to represent the 1024 possible copyright owners, then a bit error rate of ten percent may be acceptable. The evaluation framework proposed here is based upon four performance metrics: bit rate, perceptual quality, computational complexity, and robustness to signal processing. Each of these metrics are described in the following sections. Recently other researchers have proposed a benchmark for comparing the performance of watermarking algorithms 14]. However, their approach di ers from this one in several ways. First of all, their benchmark is limited to the study of image watermarking
1 =0 2 1 =0 1 =0 2 2
16 algorithms and only those that require access to the original image at the decoder (private watermarks). In addition, their benchmark lacks two important aspects: comparison by bit rate and computational complexity.
17 a di erent algorithm. The images are viewed under standard conditions, displayed a xed distance from the viewer, and under a known level of ambient light. The subject is asked to select the image that has a better \quality". This process is performed for a large number of subjects and host images, and for every possible combination of algorithms, resulting in a ranking of the algorithms. Obviously a formal perceptual quality study requires a signi cant amount of time and resources. In this study, a simpler measure of quality will be used: the signal-to-noise ratio (SNR). This is simply the power of the watermarked signal over the distortion power introduced by the watermarking algorithm. Although not as robust as a more formal study, SNR will be used because it is simple to implement and provides a rough measure of quality. If x(n) represents an audio signal of length N samples, and x(n) is the watermarked version, then the SNR is given by 16]:
SNR = 10 log
"
10 P
N ;1 x2 (n) n=0
(1.15)
For images and video sequences, peak signal-to-noise ratio (PSNR) will be used 13]:
PSNR = 10 log
10
(1.16)
where it is assumed that x(n n ) and x(n n ) are normalized to the interval 0 1]. PSNR is commonly used as a performance metric for digital image and video compression algorithms. Perceptual quality is dependent upon the intended application of a watermarking system. In some situations, a detectable amount of distortion may be acceptable if it ensures a higher bit rate or more reliable encoding. In other cases, it may be required that watermark data be completely imperceptible to a user.
1 2
18 required to implement each algorithm. In a classical sense, complexity often refers to \big-O" analysis, in which the complexity of an algorithm is roughly determined asymptotically as a function of the size of the input 17]. An algorithm with O(N ) complexity, for example, requires on the order of N processing steps for an input of size N . In time-critical applications, or where computing power is limited, selection of a watermarking algorithm requires more quantitative information. In this investigation, actual time in CPU cycles will be used as a measure of complexity.
2 2
19 and masking. The study of these concepts has led to the development of mathematical models of human perception, and the models may be used to determine maximum allowable levels of distortion resulting from embedded watermark data. Some of these models will be introduced, along with a description of their implementation and necessary modi cations. In Chapters 3 - 5 the main results of this investigation will be presented, and the structure of these three chapters will be similar. First of all, a selection of watermarking algorithms for digital audio, image, and video signals will be described, along with details of their implementation considerations. Where possible, improvements to the algorithms will be proposed, and they all will be evaluated with respect to the evaluation framework proposed earlier in this chapter. Finally, in Chapter 6, the primary results of this investigation will be reviewed, and possible applications of watermarking technology will be examined. In closing, recommendations will be made for future research e orts in this area.
20
21 Section 2.2, along with the myriad of masking properties that have been studied by psychologists. Three psychovisual models are presented, one in the spatial domain and two that operate in the frequency domain. Modi cations to these models will also be proposed that allow them to be more easily incorporated into watermarking systems.
22
0.9
0.8
0.7
MAGNITUDE RESPONSE
0.6
0.5
0.4
0.3
0.2
0.1
0.5
2.5 x 10
4
Figure 2.1: Subset of the 32 overlapping lters modelling the bandpass channels within the Human Audio System.
23
70
60
50
40
30
20
10
10
8 10 12 FREQUENCY (KHZ)
14
16
18
20
Figure 2.2: Plot of TA(f ), the absolute detection threshold of the Human Audio System. frequencies between 500 Hz and 8 kHz, called the mid-band frequencies.
24 sounds consisting of similar frequencies are detected and sent to the brain by the same mechanism. A channel responding to a masking signal may be less sensitive to weaker signals of similar frequency. If a critical frequency band does not contain any single strong tones, then the combination of frequency components within the critical band may collectively serve to mask the presence of any individual tone within the band. This concept is referred to as noise-masks-tone, as the set of frequency components within the band are considered \noisy". The raised detection threshold at frequency f due to a masking signal at fm is a function of the distance between the two frequencies and the power of the masking signal. In psychoacoustic studies, frequencies and distances between frequencies are sometimes given in Barks. The Bark scale is a logarithmic mapping from frequencies in Hz, as shown in Figure 2.3. Figure 2.4 shows an example of the raised detection threshold for a masking signal with a power of 15 dB at a frequency of 5 kHz. The plot, compared to Figure 2.2, illustrates how much the absolute detection threshold is raised for all other frequencies due to the presence of the masking signal. It is clear that the e ect is maximized for frequencies adjacent to the masking frequency. Another frequency-domain characteristic of the HAS is the concept of noisemasks-noise. The presence of a noise-like audio signal, with a relatively at spectrum and no prominent frequency components, tends to mask additional noise applied to the signal. For example, consider a piece of music containing background crowd noise. Low levels of additive noise will not be perceivable to a listener.
25
25
20
BARK SCALE
15
10
8 10 12 FREQUENCY (KHZ)
14
16
18
20
26
70
60
50
40
30
20
10
10
8 10 12 FREQUENCY (KHZ)
14
16
18
20
27 \echo") in an audio signal, provided that the echo delay is not too long. In Section 3.3, another algorithm will be introduced that relies upon the inability of the HAS to distinguish the phase di erence between two distinct tones (but of the same frequency) that are slightly out of phase by a constant factor. Discussion of these other aspects of the HAS is reserved for those sections.
P (k) = jX (k)j
(2.1)
28 where X (k), for 0 k N ; 1, represents the Discrete Fourier Transform (DFT) of the current block. Since xm (n) is real-valued, P (k) is symmetric around a frequency of one-half the sampling rate. In the following steps, P (k) is considered for only half of the frequency components, or 0 k N ; 1. The frequency of a component of P (k) is given by f = fsN k (2.2) where fs is the sampling frequency.
2
2. Divide the power spectrum into 32 equal-width critical frequency bands to approximate the set of bandpass channels of the human ear across the range of audible frequencies. For N = 512 samples, for example, each critical band consists of = 8 coe cients from the power spectrum.
512 2 32
3. Identify tonal and non-tonal components in the power spectrum. A tonal component is de ned as a local maximum of P (k):
A non-tonal component is de ned as the sum of the power spectrum coe cients within a critical frequency band. If k and k are boundary indexes of a critical band, then the power of a non-tonal component Pm is given by:
Pm(k) =
2 X
The frequency of a non-tonal component within its critical band, km , is determined by an average of the band's frequencies, weighted by the power spectrum coe cient associated with each frequency: Pk 2 k k P (k ) km = Pkk1 (2.7) 2 k k1 P (k)
= =
k=k1
P (k)
(2.6)
29 4. Remove tonal and non-tonal components that are below the absolute detection threshold TA(f ), for it is assumed that they will not be audible to the listener. Also remove tonal components that are less than one-half of a critical band width from a neighbouring tonal component, since the response of the HAS to one such component will mask the other. 5. For each remaining tonal and non-tonal frequency component at index km, compute the raised detection threshold as a function for all other audible frequencies. Let d represent the distance between a frequency of interest and the tonal component frequency, k ; km , measured on the Bark scale. The raised detection threshold at index k due to the presence of a tonal or non-tonal component at km is given by 21]:
(2.8)
where R(d km) is a piecewise-continuous function of the tonal component power and the distance of frequency k from the masking frequency km 21]:
R(d km) =
8 > > > > > > > > < > > > > > > > > :
(2.9)
The result is a set of raised detection threshold levels, TR (k km), one for each tonal and non-tonal component, indicating how each individual tonal and nontonal component raises the detection threshold level for all other frequencies. 6. Compute a global masking threshold level as the sum of the raised detection thresholds for all of the tonal and non-tonal components. This global function provides the raised masking threshold for a single frequency resulting from the contribution of all tonal and non-tonal components.
TG ( k ) =
km
TR (k km)
(2.10)
30 Convert the global function into a function of frequency, TG (f ), using Equation 2.2. 7. Finally, compute the frequency masking threshold function as the maximum of the absolute detection threshold (from Section 2.1.1) and the global masking threshold: TM (f ) = max fTG (f ) TA(f )g (2.11) This nal step is performed because a raised detection threshold will still be inaudible if it lies below the absolute detection threshold. The original MPEG psychoacoustic model speci es a xed block size of N = 512 samples, and values for the absolute detection threshold function, TA(f ), are provided for = 256 frequencies between 0 f 22:050 kHz, based on a sampling rate of 44:1 kHz. The formula for the raised detection threshold function of Equation 2.9 is also tuned to a block size of 512 samples. In addition, tonal and non-tonal components are removed in Step 4 above if they are less than four coe cients apart, and non-tonal components are constructed from the average power within eight power spectra coe cients, because each critical band consists of eight coe cients. However, to incorporate this perceptual model into watermarking algorithms, for convenience it will be necessary to operate on variable block sizes. It will be shown in Chapter 3 that the psychoacoustic model may be tightly coupled to watermarking algorithms using variable block sizes. Where a larger number of samples are needed, missing values of the absolute detection threshold function, TA(f ), will be obtained using bilinear interpolation. Interpolation will not be required for TR (k km), the raised detection threshold function, because it is a function of the di erence in frequencies. Where the block size will be smaller than 512 samples, required values for the absolute detection threshold will be obtained using a nearest-neighbour approach. Since the power of TA (f ) and TR (k km) will be a ected by a change in block size, they will also be scaled. For TA(f ) provided in dB and a block size of N samples, the
512 2
modi ed absolute detection threshold function is given by: N 0 TA(f ) = TA(f ) + 20 log 512 (2.12) A similar modi cation will be made to the raised detection threshold function. Because the MPEG model speci es a division of 32 critical frequency bands to approximate the lter bank structure of the HAS, the block size will be limited to N 64 samples. Also, the model was designed to approximate the short-term response of the HAS by analyzing audio samples of less than 30 ms in duration. By increasing the block size too far, it is likely that the model will no longer be accurate. Figure 2.5 shows an example of the power spectrum of an audio signal sampled at 44.1 kHz and 512 samples in length. The plot illustrates the power spectrum of the signal, along with the absolute detection threshold function TA(f ) and the masking threshold function TM (f ) computed using the procedure described above.
31
32
80
60
40
20
20
8 10 12 FREQUENCY (KHZ)
14
16
18
20
Figure 2.5: Power spectrum and corresponding absolute and raised detection threshold functions, TA(f ) and TM (f ), for a sample audio sequence.
33
90 degrees 60 degrees
30 degrees
0 degrees
-30 degrees
cycles / pixel
1/8
1/4
1/2
Figure 2.6: Passband lter responses of the two-dimensional Cortex lters used to represent the set of visual channels. in cycles per degree of vision, which di ers from the spatial frequency of the image based on its sampling rate. Cycles per degree represents the observed frequency of a stimlulus incident upon the retina, and the quantity depends upon the width of the image and the distance from the image to the viewer. In the description of psychovisual properties that follow, the standard viewing distance is assumed to be six times the width of the image, as illustrated in Figure 2.7. For an image of size N N pixels, and assuming a viewing distance of six times the image width, a normalized spatial frequency of f cycles per pixel may be converted to f cycles per degree using the following transformation: N f (2.13) f= 2 arctan For example, a 512 512 image may possess spatial frequencies ranging from 0 f 0:5 cycles per pixel. If the image is displayed at the standard distance, the observed frequencies range from 0 f 27 cycles per degree. For a 256 256 image displayed at the corresponding standard distance, the observed frequencies are limited to a maximum of f = 13 cycles per degree. For the remainder of this discussion of
0 1 6 0 0
34
WIDT
HEIGHT
N PIXELS
N PIX
ELS
DIS
TA
NC
E=
6x
WI
DT
Figure 2.7: Observed frequencies are dependent upon the image width and the viewing distance, standardized to six times the image width.
35 visual models, spatial frequencies will often be speci ed in cycles per degree to explain physical properties independent of image properties.
L C = 2 LMAX ; LMIN L +
MAX MIN
(2.14)
The result is a frequency-dependent contrast detection threshold function, C (f ). Figure 2.8 shows a plot of contrast sensitivity as a function of spatial frequency 23]. Frequency components of an image with a contrast below the detection threshold are not visible. From the plot, it is clear that the HVS is most sensitive to spatial frequencies around 3 cycles per degree. The method described above determines the sensitivity of the HVS to sinusoidal stimuli of a single spatial frequency. Sinusoids form the basis functions used in Fourier analysis of signals, and so C (f ) represents the sensitivity of the HVS to Fourier basis functions. However, it is possible to determine sensitivity to other basis functions as well. In Section 2.2.6.2, quantization matrices will be described that are based upon measured sensitivity to 2D-DCT basis functions. In Section 4.3, the sensitivity of the HVS to 2D-DWT basis functions will be described.
36
10
10
10
10
10
10
10
10
37
Like the HAS, studies have determined that masking e ects occur in human vision as well. The presence in an image of a strong frequency component (a masking signal) will mask the presence of other components of similar spatial frequency (masked signals). In particular, the contrast detection threshold function, C (f ), of a masked signal is raised by the presence of a masking signal 23]. This e ect is maximized for signals of the same frequency and orientation. The raised detection threshold at any spatial frequency f due to the presence of a masking signal at frequency fm is given by: C (f fm ) = C (f ) max f1 k(f=fm) Cm] g (2.15) where C (f ) denotes the original detection threshold at f , Cm is the contrast of the masking component, and is a tunable parameter usually set to 0:649 23]. k(f=fm) is a weighting function, illustrated in Figure 2.9. It is clear from the plot that the masking e ect is highest for spatial frequencies close to the masking frequency, and decreases as the frequencies di er. In addition, the masking e ect increases with Cm, the contrast of the masking signal. The e ect of frequency masking is also dependent upon orientation. A masking signal will raise the detection threshold for a weaker signal of similar frequency and orientation, but the e ect will become less pronounced as the angle between the two signals increases. The masking signal will have little or no e ect on the contrast detection threshold for a signal that is oriented 90 degrees away. This e ect can be modeled as a Gaussian weighting factor as a function of the di erence in orientation from the masking signal 24]: ( ) = exp ;0:5 j j
4 2 !2 3 5
(2.16)
where is the angle in degrees between the masking signal and the signal of interest, and = 15 . ( ) is applied to C (f fm ), the raised contrast detection function, to compensate for the orientation between signals.
38
10
10
k(f / fm)
10
10
10
10
10 f / fm
10
Figure 2.9: Weighting function used to determine the raised contrast detection threshold in the presence of a masking signal 23].
39
Spatial masking e ects can be modeled in the spatial domain, as opposed to frequency sensitivity and masking, which are primarily frequency domain e ects. In addition, frequency sensitivity and masking e ects are global in nature, but the HVS processes e ects locally as well. There are two main characteristics, luminance masking and spatial masking 25]. Luminance masking describes the fact that the ability to detect noise or distortion varies with the mean luminance of the image region. When superimposed onto an image of uniform intensity, the visibility of zero-mean white noise varies with the intensity of the image. Figure 2.10(a) shows a plot of the detection threshold, measured in noise variance, as a function of intensity for an 8-bit image. Spatial masking occurs in an image around sharp changes in intensity (or edges). On either side of the edge, the detection threshold of additive noise or distortion is raised. Figure 2.10(b) shows a plot of the detection threshold, measured in noise variance, of additive noise on either side of an edge. In this gure, the raised threshold is a function of the observed distance from the edge in degrees of vision. Negative degrees in the plot correspond to a dark region, while a uniform whiter region lies within positive degrees.
40
10
10
10
50
200
250
(a)
10
10
10
10 0.5
0.4
0.3
0.3
0.4
0.5
(b)
Figure 2.10: Raised detection thresholds of zero-mean additive white noise in the presence of (a) luminance masking and (b) spatial masking.
41
Masking e ects in the HVS are not limited to strictly spatial or frequency stimuli. The time-varying nature of video signals evokes two di erent responses: the icker frequency and temporal masking 15]. The icker frequency results from a lowpass temporal lter in the HVS, limiting the temporal response to roughly 24 - 30 Hz (frames per second). Temporal frequency components occurring at a faster rate generally are not perceivable. However, the icker frequency also depends greatly upon spatial frequencies within the video signal. Flicker frequency is not yet a useful property for watermarking, since it is not likely that a digital video sequence will be sampled at a higher rate, and the complex relationship between spatial frequencies and temporal frequencies has not been widely studied. Temporal masking occurs when the local mean intensity of a video sequence changes abruptly, such as during a scene change, a rapidly moving object, or a bright ash. During a rapid intensity change, the detection threshold of additive noise or distortion is elevated for a period of between 50 - 100 ms before and after the change 25]. Unfortunately, temporal masking is not yet useful for watermarking. Scene changes are relatively sparse compared to the overall length of sequences, and current models of the HVS do not accurately model the response of human vision to moving objects.
relationships between two masking characteristics. HVS models may be roughly divided into two approaches, operating chie y in the spatial domain or in the frequency / transform domain. In the following sections, implementations of some of these models will be described.
42
(2.17)
where LMONITOR = 0:35 10; candelas per square meter, sMONITOR = 15, and = 2:2. The screen luminance is then converted into an image incident upon the eye's retina by convolving the luminance function with the optical point spread function (PSF) of the eye:
(2.18)
43
0.8
0.6
0.4
0.2
0 5 5 0 0 5 5
SAMPLE (ARCMIN)
SAMPLE (ARCMIN)
Figure 2.11: The optical point spread function. The optical point spread function hPSF (n n ) has a circularly symmetric Gaussian impulse response with a half bandwidth of 1/60th of a degree of vision (1 arcmin) given by 2 ! 3 jnj 5 hPSF (n) = exp 4;0:5 (2.19)
1 2 2
where n = n + n and is a distance in pixels corresponding to a half-bandwidth at 1 arcmin of vision, and obtained using the conversion of Equation 2.13. A plot of the PSF is shown in Figure 2.11. Girod's model also approximates the adaptive gain control present within the retina cells. The gain control allows the eye to have a large dynamic range, and is represented by the formula:
2 1 2 2
cINH (n n ) = l
1 2
(2.20)
44 where hINH (n n ) is a Gaussian impulse response from Equation 2.19 with a half bandwidth equal to = 8 arcmin of vision, and LAD = 7 candelas per square meter. Finally, the visual model predicts the saturation characteristics of the ganglion cells. This portion of the model is nonlinear, and the saturation of the ganglion cells results in luminance and spatial masking e ects:
1 2
c( n n ) =
1 2
where kSAT = 8. The result of this processing is an approximation of the stimulus transmitted to the brain by the ganglion cells. The monitor luminance and saturation characteristic resulting in c(n n ) are nonlinear operations, and Girod introduced a linearization of his model about an operating point speci ed by the undistorted input image x(n n ). Linearization allows the model to be used for determining whether distortion in the image, denoted x(n n ), will be visible to a human observer. Determining whether this distortion is perceivable is achieved by using a localized detection threshold. This is done by convolving the square of the distortion image with a Gaussian impulse response hLOCAL(n n ) of half bandwidth equal to = 13 arcmin of vision. The result is a localized distortion image given by:
1 2 1 2 1 2 1 2
D(n n ) = c (n n ) hLOCAL(n n )
1 2 2 1 2 1 2 1 2
(2.22)
If D(n n ) exceeds a pre-speci ed threshold somewhere within the image, then the distortion is deemed to be perceivable by the viewer. Obviously x(n n ) may be used as a weighting function, (n n ) from Section 1.3.1, to control the strength of an additive watermark signal. Computing the weighting function using Girod's model directly is di cult, because the model only indicates whether a particular distortion function would be detectable by the viewer. Tew k et al proposed a method of incorporating Girod's model into perceptual image compression and watermarking algorithms 27, 28]. In their approach, simpli cations
1 2 1 2
45 are made so that a reverse procedure may be followed: given an input image x(n n ), determine the maximum distortion function x(n n ) that remains imperceptible to the viewer. The maximum distortion function may then be used to weight a watermark signal. The linearization of Girod's model about the operating point x(n n ) may be expressed by the following equations. First of all, error in the monitor luminance due to the distortion function is given by:
1 2 1 2 1 2
lMONITOR = w (n n )
1 1 2
x(n n )
1 2
(2.23)
and the error in the signal transmitted to the brain is given by:
c(n n ) = w (n n )
1 2 2 1 2
lRETINA (n n ) ; w (n n )
1 2 3 1 2 1 2
cINH (n n )] (2.24)
1 2
The weighting functions used in the above equations result from the linearization of the model about an operating point x(n n ). In the expressions below, a indicates parameters computed from the non-linear model with x(n n ) as the input: ( w (n n ) = dlMONITORnn) n ) (2.25) ds(n 1 (2.26) w (n n ) = c (n n ) + k SAT maxf0 lRETINA(n n ) ; cINH ]g INH w (n n ) = cINH (n n ) (2.27)
1 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 3 1 2 1 2
Figure 2.12(a) shows an example of a 512 512 image x(n n ) with intensity values between 0 - 255. Figure 2.12(b) shows the scaled result of the corresponding masking analysis using Girod's model. The masking threshold values range from 2 T (n n ) 12. From the illustration, it is clear that the detection thresholds are highest around edges (such as on the shoulder in the image) and in regions of uniform intensity.
1 2 1 2
46
Figure 2.12: Example of perceptual analysis using Girod's model of the Human Visual System. create image-adaptive 8 8 quantization matrices for use in the JPEG compression algorithm. The matrix provided in the baseline JPEG standard is constant, and does not take advantage of masking concepts that may vary between images 29]. The models described in this section employ the 2D-DCT coe cients of each block, denoted C (k k ), to vary the quantization matrix for each block. The models may be used for watermarking an image on a block-by-block basis using the quantization approach of Equation 1.3. All of the frequency-domain models begin with an 8 8 image- and blockindependent basic quantization matrix. Since each DCT coe cient represents a frequency component, the basic matrix was constructed by measuring the sensitivity to each 2D-DCT basis function 30, 31]. The result is a minimum set of quantization levels QMIN (k k ) that allow for perceptually transparent modi cation of each DCT coe cient, as shown in Table 2.1. Watson built upon the basic quantization matrix by incorporating support for luminance masking e ects and a simple adjustment for frequency masking e ects 32]. Support for luminance masking was added by using a simple correction factor based
1 2 1 2
47 0 1 2 3 4 5 6 7 0 14 10 11 14 19 25 34 45 1 10 11 11 12 15 20 26 33 2 11 11 15 18 21 25 31 38 3 14 12 18 24 28 33 39 47 4 19 15 21 28 36 43 51 59 5 25 20 25 33 43 54 64 74
1
6 34 26 31 39 51 64 77 91
2
7 45 33 38 47 59 74 91 108
Table 2.1: Minimum quantization matrix QMIN (k k ) constructed by measuring sensitivity to 2D-DCT basis functions. upon the mean intensity of each block:
QL(k k ) = QMIN (k k ) C (0 0) (2.28) C (0 0) where C (0 0) is the DC coe cient of the image block representing the mean intensity, C (0 0) is the DC coe cient corresponding to an average monitor intensity of 128 (for a monitor displaying 8-bit pixels), and = 0:649. Rudimentary frequency masking was also incorporated into Watson's model by using a simple equation representing the frequency masking e ect of a single coe cient. For a strong DCT coe cient, the contrast detection threshold is raised not only for adjacent frequency components, but also for the masking coe cient itself. In Watson's model, the raised contrast detection threshold was considered only for the masking coe cient:
" #
1 2 1 2
(2.29)
where 0 w(k k ) 1 is an exponent controlling the e ect of the DCT coe cient magnitude C (k k ) on the raised detection threshold. In Watson's model, w(0 0) = 0 and is equal to 0:7 for all other coe cients. To incorporate Watson's quantization matrix into a watermarking algorithm, it will be necessary to analyze blocks larger or smaller than the 8 8 blocks used in his model. To achieve this, Watson's model will require two modi cations. First of all,
48 the minimum quantization matrix, QMIN (k k ), will be constructed using bilinear interpolation of the original 8 8 matrix provided in Table 2.1 for larger blocks. This is a reasonable modi cation, because the distribution of spatial frequencies represented by DCT coe cients will be the same in blocks of di erent size (there will simply be a more or less ner resolution of them). In addition, it will be necessary to ensure that C (0 0) in Equation 2.28 re ects the average monitor intensity of an M M block. Tew k et al proposed an improved model based on a more complicated frequency masking analysis than the single-frequency model used in Watson's model 33]. However, the authors provided few details about their approach, so what follows is a description of the frequency masking analysis implemented for this study. Assume that an N N image has been divided into a set of M M blocks, and that for each block the 2D-DCT has been computed as C (k k ).
1 2 1 2
1. Begin with the minimum quantization matrix QMIN (k k ) and apply the luminance masking modi cation from Watson's model to produce QL(k k ) (Equation 2.28).
1 2 1 2
2. For each 2D-DCT coe cient, determine the normalized spatial frequency of the coe cient in cycles per pixel: 1 f (k k ) = M k + k
q
0 1 2 2 1 1 2 2 2
(2.30)
for 0 k k M ; 1. Then convert the spatial frequency into cycles per degree using the transformation of Equation 2.13:
f (k k ) =
1 2
N 2 arctan
1 6
fo(k k )
1 2
(2.31)
!
2 1
(2.32)
3. For every spatial frequency in f (l l ), compute the e ect of its corresponding 2D-DCT coe cient C (l l ) on every other frequency in f . This is performed
1 2
49 by adapting the raised contrast detection threshold function of Equation 2.15 to raise quantization levels. The e ect of the 2D-DCT coe cient C (l l ) on the raised quantization level Q(k k ) is given by:
1 2 1 2
Q(k k l l ) = k Q(k k ) max 1 k ff((k l )) jC (l l )jw Q(l l ) ;w (2.33) l where w = 0:7, and k is the frequency masking weighting function of Figure 2.9 normalized to 1 at a ratio of f (k k )=f (l l ) = 1. The equation above employs the raised quantization coe cient from Watson's model, and weights it for adjacent frequencies. The masking e ect is maximized when the two frequencies are the same, and is low when the two frequencies are dissimilar. The result is a set of M M functions representing the raised quantization levels arising from every DCT coe cient in C (k k ).
1 2 1 2
4. Apply the weighting factor of Equation 2.16 to correct for the di erence in angular orientation between (l l ) and (k k ).
1 2 1 2
5. Finally, for every frequency in f (k k ) compute the raised quantization level as the sum of the frequency masking e ects from all other DCT coe cients. This is performed by using a summation rule of the form 21]:
1 2
Q(k k ) =
1 2
M ;1 M ;1 X X l1 =0 l2 =0
31
Q(k k l l )
1 2 1 2
(2.34)
where = 2. An example of the raised quantization matrix resulting from a single 2D-DCT coe cient is shown in Figure 2.13. The plot shows how much the minimum quantization matrix QMIN (k k ) for a block size of M = 64 will be raised as a result of a strong 2D-DCT coe cient at k = 29 and k = 11. It is clear from the plot that the masking e ect is pronounced for 2D-DCT components with frequencies and orientations close to that of the masking signal.
1 2 1 2
50
20 60 18
50
16
14 40
12
k2
10
30
20
10 15
5
10
10
5
10 20
15
20
10
5
30 k1
40
50
60
Figure 2.13: E ect of a strong 2D-DCT coe cient on adjacent coe cients within the minimum quantization matrix.
2.3 Summary
51
The goal of this chapter was to introduce the Human Audio System (HAS) and Human Visual System (HVS), along with a description of models from the literature that were implemented for this study. Masking was introduced as a key concept throughout this chapter. Essentially it may be described as the presence of a strong signal \masking" the ability of humans to detect other signals with similar characteristics. The HAS is subject to masking e ects due to frequency sensitivity and frequency masking. An implementation of the MPEG Layer I psychoacoustic model was described to take advantage of these e ects. The HVS is slightly more complex, o ering masking e ects due to spatial frequency sensitivity, frequency masking, luminance masking, and spatial masking. Spatial frequencies di er from audio frequencies in that an observed spatial frequency varies with the distance of the viewer from the image. Two additional e ects were introduced, colour sensitivity and temporal masking, but will not be considered in this investigation. Three visual models were described and implemented, one in the spatial domain and two that analyze an image in the 2D-DCT domain.
52
53 also selected to represent a range of computational complexities and implementation structures. Since the focus of this thesis is on public watermarking algorithms, the techniques chosen do not require access to the original signal in order to extract the watermark data. In some cases, however, having such access may improve the decoding process. The chapter is organized as follows. Sections 3.2 - 3.5 provide a description of the audio watermarking algorithms, including the theory behind them, encoder and decoder structures, and implementation details. This is followed in Section 3.6 by a performance evaluation of the algorithms with respect to perceptual quality, bit rate, computational complexity, and robustness to signal processing operations. Finally, a review of the chapter's ndings are provided in Section 3.7.
3.1 Conventions
In order to provide a common basis to describe and compare the algorithms in this chapter, the following conventions are used. It is assumed that x(n) represents a discrete-time host audio signal of length N samples. This signal is divided into B = bN=M c blocks of M samples each. The signal is divided into blocks because it can be assumed that most audio signals exhibit local stationarity within blocks of less than 30 ms in length. In this case, second-order stationarity allows for analysis of the audio signal's mean and variance, which is useful for some of the algorithms. x(n) represents the watermarked audio signal, while xm (n) and xm(n) indicate the mth block in the original and watermarked signals, respectively, for 0 m B ; 1. As mentioned in Section 1.4.1, dividing the host signal into blocks is a simple method for allowing a variable number of bits to be embedded. Therefore, it is assumed that one watermark bit is embedded in each block, and this sequence of B bits is denoted by w(m) 2 f;1 +1g, for 0 m B ; 1. A bit extracted from the watermarked signal is denoted w(m). ~
54
The rst algorithm studied, called the echo coding algorithm, embeds data into a host signal by adding a small amount of resonance, or echo. Bender et al, who introduced the algorithm, contend that natural signals such as recorded speech and music already contain resonance introduced during the recording process, such as the echoes present within a studio or concert hall 36]. They claim that the human ear is accustomed to hearing this slight resonance in commercial music, so adding more will not signi cantly impair the quality of the sound. This arti cial form of resonance can be modeled mathematically as a linear system consisting of an impulse followed by a weighted delayed impulse:
h(n) = (n) +
(n ; no)
(3.1)
where is the magnitude of the echo, and no is the echo delay. The impulse response of the system, h(n), is referred to as an echo lter. It will be shown that should be kept small compared to the magnitude of the host audio signal. Bender et al report that an echo delay of less than 1=1000th of a second is not perceivable by humans. It is important to analyze the distortion introduced into a signal by the addition of echo. In the frequency domain, the echo lter's magnitude and phase responses are functions of both the echo delay and the amplitude of the delayed signal: 1 + 2 cos(!no) + " # j! ) = arctan ; sin(!no) 6 H (e 1 + cos(!no)
2
H (ej! ) =
(3.2) (3.3)
Figure 3.1 shows a plot of the magnitude and phase frequency responses for = 0:1 and a delay of no = 5 samples. It can be seen that the echo introduces a signi cant distortion into the host signal's magnitude response, along with a nonlinear phase response. Such a large gain distortion would probably be unacceptable in many applications. The magnitude of these distortions vary with frequency and are directly proportional to . Therefore, should be kept small to minimize the distortion.
55
1.2
MAGNITUDE RESPONSE
1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 NORMALIZED FREQUENCY 0.7 0.8 0.9 1
0.2
PHASE RESPONSE
0.1
0.1
0.2
0.1
0.2
0.3
0.7
0.8
0.9
Figure 3.1: Magnitude and phase responses of an echo lter with an echo amplitude of = 0:1 delayed by no = 5 samples.
56
x-1 (n) h-1 (n) x(n) x+1(n) h+1(n) 2:1 MUX _ x(n)
w(m)
Figure 3.2: Structure of the echo coding algorithm's encoder. An echo lter may be used for watermarking by varying the lter delay according to the bit, w(n), to be embedded within the host signal. This approach will be described further in the following section.
xm (n ; xm (n ;
;1 )
+1
(3.4) (3.5)
):
Finally, a mixer stage takes the two echoed source signals and selects one of them as the current output block, acting as a 2:1 multiplexer, according to the bit to be embedded within the block:
xm (n) =
(3.6)
57
b(m) -1 +1 -1 -1 +1 +1 -1
Figure 3.3: Transition bands employed to minimize phase di erence between blocks containing di erent bits. At the boundary between two blocks containing di erent bits, there will be a significant change of phase in the watermarked signal. This phase change may become audible to the listener, so to minimize the distortion there are two modi cations that can be made to the algorithm. First of all, the echo delays ; and should be set close together, ideally only one or two samples apart. This will minimize the di erence in magnitude and phase responses between the two echo lter kernels. In addition, a transition period can be incorporated into the encoder's mixer stage, in order to \ramp down" the current block's signal and \ramp up" the next block's signal, over the course of a xed number of samples around the boundary between the two blocks. This modi cation to the structure introduces two additional mixer signals, s; (n) and s (n), illustrated in Figure 3.3, which can be used to rewrite Equation 3.6 in the form
1 +1 1 +1
(3.7)
58
At the decoder, the embedded data bit may be retrieved by determining the length of echo delay introduced into the host signal within the current block. This may be accomplished by analyzing the cepstrum of the watermarked signal 7]. The realvalued cepstrum of a signal x(n) is formed from the inverse Discrete Fourier Transform of the natural logarithm of the magnitude frequency response of x(n):
(3.8)
Let y(n) denote the convolution of a signal x(n) with an echo lter kernel h(n) as de ned in Equation 3.1. In 7], it is shown that the cepstrum of y(n) may be written as ^ y(n) = x(n) + h(n) ^ ^ (3.9) ^ where x(n) and h(n) represent the cepstra of x(n) and h(n), respectively. A more ^ ^ precise mathematical expression can also be derived for h(n) by noting that for j j < 1, log(1 + ) may be written as a power series expansion of the form 1 k k X log(1 + ) = (;1)k : k
+1 =1
(3.10)
^ Using the z-transform representation of the echo lter from Equation 3.1, h(n) may ^ be obtained from the inverse z-transform of H (z) 7]: ^ ^ h(n) = Z ; fH (z)g = Z ; flog(1 + z;no )g 1 (;1)k k = (n ; kno ) (3.11) k k In other words, the cepstrum of the watermarked block consists of the cepstrum of xm (n) plus an in nite series of decaying impulses at integer multiples of either ; or , depending on the embedded bit:
1 1
+1
=1
+1
(3.12)
+1
59 ^ (n) can be obtained by comIf the original signal is available at the decoder, then h puting the cepstrum of xm (n) and simply subtracting it from the cepstrum of the watermarked block. Within the framework of a public watermarking system, since the cepstrum of xm (n) is present and will interfere with the signal at the decoder, it ^ is necessary to enhance h(n) by ensuring that is large enough to make the largest ^ ^ impulse of h(n) detectable at the receiver. xm (n) will possess a peak at the echo delay. ^ If a peak occurs at xm ( ; ), then a ;1 bit was encoded in the block. Otherwise, a ^ peak at xm( ) indicates that a +1 bit was embedded.
1 +1
60
40
30
20
10
0.1
0.2
0.3
0.4
0.5 ALPHA
0.6
0.7
0.8
0.9
Figure 3.4: Bit error rate of echo coding algorithm of varying delays (N ).
61 in order to minimize distortion in the host signal's magnitude and phase. However, in the previous section it was noted that if the original signal is not available at the decoder, then should be large in order to detect the echo delay through the interference caused by the presence of x(n) in the cepstrum of the watermarked block. ^ These con icting constraints introduce a tradeo into the echo coding algorithm | to increase the reliability of the encoding, the quality of the host signal must be compromised to a certain degree. To illustrate this tradeo , consider the plots in Figure 3.5 which show an audio signal of M = 2048 samples, to which an echo lter kernel has been applied with = 0:1 and no = 10 samples. The gure shows y(n), the cepstrum of the output ^ ^ signal, where the impulses of h(n) are visible at delays of no samples. Included in the ^ plots are x(n) and h(n), the constituent components of y(n). It can be seen that the ^ ^ cepstrum of the original signal interferes with the detection of the echo lter impulses.
3.2.3.2 Discussion
The echo coding algorithm possesses certain features that make it attractive as a watermarking technology, and most notable is its simplicity. The encoder is a simple linear system, which makes it easy to implement in hardware or incorporate into an existing audio recording system. In addition, no private key sequence is required to embed or extract data from the host signal. The algorithm is also relatively resistant to synchronization errors, so misalignment by a few samples, up to the limits of the transition band, will not signi cantly a ect computation of the cepstrum at the decoder. Unfortunately, certain features of the echo coding technique may limit its practical use. First of all, computation of the cepstrum and autocorrelation at the receiver may be too expensive if the audio signal is divided into large blocks and if the FFT cannot be employed. Since no private key sequence is required to \unlock" the watermark data, it is easy to detect and remove the watermark by applying an inverse
62
0.15
1.8
0.1
1.6
1.4
0.05
Cepstrum of y(n)
1.2
x(n)
0.8
0.05
0.6
0.4
0.1
0.2
200
400
600
800
a. x(n)
2 1.8 1.6 1.4
1400
1600
1800
2000
10
15
20
25 Sample (n)
30
35
40
45
50
Cepstrum of h(n)
0 5 10 15 20 25 Sample (n) 30 35 40 45 50
Cepstrum of x(n)
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
10
15
20
c. x(n) ^
^ d. h(n)
25 Sample (n)
30
35
40
45
50
Figure 3.5: Example of applying an echo lter kernel to an audio signal, and detection of the echo lter delay using the cepstrum.
63
where no corresponds to the desired lter delay to remove. Finally, the algorithm relies upon a relatively weak property of the human audio system to embed data, and the quality of the output signal su ers as a result of having to increase the distortion to improve the reliability of the encoding.
H ; (z) = 1 + 1z;no
1
(3.13)
(3.14)
The human auditory system is not able to discern the phase di erence between the two signals (unless, of course, o is such that the two signals are completely out of phase). However, if the phase di erence is allowed to vary with time, according to
(3.15)
then the listener will be able to detect this. Note that although the time-varying phase di erence may be detectable, the constant phase di erence between the sinusoids, o, will not be discernable.
64 Bender et al, authors of the echo coding algorithm reviewed earlier, proposed another watermarking scheme based on the HAS sensitivity to phase described above 36]. In their approach, they divide the host audio sequence into a set of equallength segments and compute the DFT for each segment, equivalent to computing the STFT. However, as will be seen shortly, the algorithm introduces a constant phase change in the segments of the host signal while maintaining the time-varying phase di erence of the original signal.
(3.16)
To embed a bit of information into the current block, the phase of the rst subblock, (k), is replaced with a unique phase signature corresponding to the desired data bit: 8 > < (k) if w(m) = ;1 (k ) = > ; (3.17) : (k) if w(m) = +1 The phase of each subsequent subblock is replaced with this phase signature plus the sum of phase di erences up to the given subblock. In this manner, the relative phase of each subblock is preserved, and the long-term phase of the block itself is maintained: i X (3.18) i (k) = (k) + j (k)
0 1 0 +1 0
j =1
65 As a nal step, the watermarked block xm(n) is obtained by computing the inverse DFT of each subblock using the original magnitude response Mi(k) and modi ed phase response i(k). A block diagram of the encoder, illustrating the STFT, phase di erence calculation, and phase signal reconstruction is shown in Figure 3.6.
w(m) = ~
;1 if (
+1 if (
;1 ) >
1
( ; )< (
+1 +1
) )
(3.19)
Note that access to the original signal is not required to extract the embedded bits, and that even having x(n) available at the decoder will not be useful. The decoder also requires synchronization of the watermarked signal with the block and subblock boundaries.
66
x m (n)
Subblock 0
Subblock 1
Subblock 2
-/+
IDFT
IDFT
IDFT
Subblock 0
Subblock 1
Subblock 2
_ x m (n)
The subblock size was set to M=16, where M is the size of each block in samples. The primary advantage of the phase coding algorithm is that, like the echo coding algorithm described earlier, the encoder and decoder structures are conceptually simple. In particular, it should be noted that the DFT and IDFT computations on each of the subblocks may be performed in parallel. In addition, use of the correlation coe cient as a means of deciding between bits at the decoder means that one would expect the algorithm to perform well in the presence of additive noise. Unfortunately, it will be shown in Section 3.6 that the watermarked signals generated using this algorithm are of relatively poor quality, even with the single phase coe cient modi cation introduced above. Also, by using only the rst subblock to decode the bits, synchronization of the watermarked signal is necessary to correctly extract them. In addition, removing or corrupting the rst subblock e ectively destroys the watermark data. Delaying the subblock by no samples introduces a linear phase change into the signal.
67
68 maximal-length linear feedback shift registers (LFSR) are often used because they are simple to design, analyze, and implement in hardware 5]. PN sequences have several properties that make them attractive as spreading signals 16]: PN sequences typically have the same statistical properties as white noise, such as a wide and relatively at power spectrum. A data signal spread with a PN sequence will occupy a correspondingly wide bandwidth. PN sequences are periodic and deterministic in nature, meaning they can be predicted. This is important at the receiver for synchronization, because the spreading signal must be matched with the spread data signal so that the information can be decoded properly. Spread spectrum techniques can be incorporated into a digital watermarking algorithm by recalling that in an additive model of watermarking, the watermark data signal is \corrupted" by a noisy channel consisting of the host audio signal:
xm(n) = xm (n) +
w(m)
(3.21)
where w(n) denotes the weighted data signal that is to be transmitted. Spreading the data signal introduces two important bene ts that are attractive from both a communications theory and digital watermarking viewpoint: The spread signal is more resilent to jamming noise. The power spectrum of the host audio signal is usually not at, but possesses a lowpass or bandpass characteristic. By increasing the bandwidth of the data signal to occupy frequencies separate from those of the host signal, the encoding can be made more reliable. Since PN sequences appear as random signals and spread the bandwidth of the data signal, it becomes more di cult for an unauthorized party to detect and remove the watermark from the host signal.
69
60
50
MAGNITUDE SPECTRUM
40
30
20
10
0.1
0.2
0.3
0.7
0.8
0.9
Figure 3.7: Magnitude spectrum of a PN sequence, jP (ej! )j Two commonly used spread spectrum techniques are direct sequence spread spectrum (DSSS) and frequency hopped spread spectrum (FHSS). In the former, a watermark signal is modulated directly by a PN sequence:
(3.22)
where w(n) denotes the watermark signal and p(n) represents the PN sequence. If p(n) occupies a wide spectral bandwidth (like white noise), then the bandwidth of y(n) will be expanded due to the convolution of the two signals in the frequency domain: Y (ej! ) = W (ej! ) P (ej! ): (3.23) Figure 3.7 illustrates the magnitude spectrum of a zero-mean PN sequence 512 samples in length. In FHSS approach, the PN sequence is used to randomly select from a set of prede ned frequencies, and these frequencies are used to control the carrier frequency of the data signal. The carrier frequency \hops" across the spectrum at set intervals of time, and in a pattern determined by the PN sequence.
70
In the discussion that follows, it is assumed that a bipolar PN sequence of the form p(n) 2 f;1 +1g is available for use at the encoder and decoder, and that the sequence has zero mean and a relatively at power spectrum.
xm (n) = xm (n) +
w(m) p(n)
(3.24)
In the equation above, represents a constant weighting factor that can be used to control the level of distortion added to the host signal. Since w(m) is constant within the block, the spectrum of the added noise assumes the shape of the spectrum of p(n): w(m) p(n) , W (ej! ) P (ej! )] (3.25) , w(m) (ej! ) P (ej! )] , w(m) P (ej! ) where denotes the convolution operator. A block diagram of the DSSS encoder is shown in Figure 3.8.
71
x(n) _ x(n)
w(m)
p(n)
Figure 3.8: Block diagram of the DSSS encoder. cosine transform (DCT) is used to transform the original signal block, xm (n), into the frequency domain in accordance with:
(3.26)
The result is a set of M frequency-domain coe cients, where M is the length of the block in samples. Then, a subset of S M coe cients are selected to contain watermark data:
S = fsi 2 Zj0 si M ; 1 0 i S ; 1g
(3.27)
The coe cients are modi ed by using a PN sequence, p(k) 2 f;1 +1g, of length S samples, modulating the bit to be embedded within the block with this short PN sequence, and then adding this noise-like sequence to the selected coe cients:
w(m) p(k) if k 2 S (3.28) 0 otherwise where, as with the DSSS algorithm, is a parameter used to control the noise power. The nal step is to construct the watermarked signal by using the inverse DCT to convert the modi ed frequency domain signal into the watermarked block: X m (k) = Xm(k) +
> :
8 > <
(3.29)
The subset of S modi ed coe cients may be xed for the entire audio sequence, or it may vary with each block. It is important to note the di erence between this
72
X(k) x(n) DCT 2:1 MUX w(m) _ X(k) IDCT _ x(n)
p(k)
Figure 3.9: Block diagram of the FHSS encoder. approach and an implementation of FHSS in a digital communications system. Rather than modulate the frequency of a single carrier wave, in this approach a set of S carrier waves are used. The techniques are similar, in that the frequencies of the modi ed coe cients may vary with each block, so that they appear to \hop" across the spectrum. A PN sequence is still required to spread the bit within each block, and in this case a second PN sequence may be used to determine the frequencies to modify for each block. In their image watermarking scheme, Cox et al proposed sorting the DCT coe cients by magnitude, and then selecting those with largest magnitude for modi cation. An illustration of the FHSS encoder structure is shown in Figure 3.9. There are two key di erences between the DSSS and FHSS algorithms. First of all, since addition of the noise-like signal is performed in the frequency domain for FHSS, the noise power is spread throughout the watermarked block in the time domain. Since only a subset of frequencies are used, the noise power at each modi ed coe cient, localized in frequency, may be increased without causing a corresponding increase in the time-domain noise variance. In other words, may be increased in the frequency domain without creating a corresponding increase of noise in the time domain.
73
A similar decoding process is used for both the DSSS and FHSS algorithms. Consider rst the case of the DSSS decoder. The embedded bit is extracted by computing the correlation of the watermarked signal block with a synchronized version of the PN sequence:
C =
= =
xm (n) p(n) +
(3.30)
Given that p(n) is a noise-like signal with zero mean, as is the case with most PN sequences, then the correlation of the original signal with p(n) in the equation above may be assumed to be very low:
M ;1 X n=0
(3.31)
M ;1 X n=0
w(m) p (n) =
2
(3.32)
Since w(m) is bipolar, the extracted bit may be obtained from the sign of the correlation computed above:
M w(m)]
(3.33)
For the case of FHSS encoding, the watermarked signal xm (n) is rst transformed into the frequency domain using the DCT, and the correlation of the marked coe cients with the PN sequence is computed in the same manner as in Equation 3.30 above. Note that in addition to the synchronized PN sequence, the frequencies of the marked coe cients must also be available at the decoder. If the original audio block is available, such as within a private watermarking framework, then it may be subtracted from the watermarked block prior to correlation
74 in order to improve the reliability of detection at the decoder. If the original signal is not available, then one may have to introduce additional processing at the decoder in case that the assumption of Equation 3.31 does not hold.
(3.34)
This is a reasonable assumption, for it was shown in Chapter 1 that additive noise is a commonly used attack on watermarked signals. Also assume for now that the original signal block, xm (n), is available at the decoder and can be subtracted from the watermarked signal. The resulting signal presented to the correlation receiver has the form: xm (n) = w(m) p(n) + v(n): ~ (3.35) Applying the correlation formula of Equation 3.30, the extracted bit with a noise term may be written as
C =
= =
xm (n) p(n) ~ w(m) p(n) + v(n)] p(n) w(m) p (n) + v(n) p(n)]
2
(3.36)
Since v(n) has zero mean and is uncorrelated with p(n), for M 1 the second term in the correlation summation will be approximately zero. Therefore, for large block sizes, it is predicted that the spread spectrum algorithms possess a strong resilience to additive noise distortions.
75 However, when the block size is small, then the correlation of p(n) and v(n) will not be zero, and it is important to be able to quantify the corresponding bit error for varying noise power. This was obtained experimentally by using Equation 3.36 for varying block sizes and noise ratios, and the results are shown in Figure 3.10. This gure illustrates the probability of bit error, PB , as a function of signal-to-noise ratio (SNR) for various block sizes. The probability of bit error may be approximated mathematically by the expression 16]:
PB = Q
(3.37)
where M denotes the block size in samples (for DSSS), or the number of modi ed frequency-domain coe cients in the FHSS algorithm, and is the power of the watermark signal from Equation 3.24. Q(x), the complimentary error function, is de ned as: ! 1 Z 1 exp ;u du: Q(x) = p (3.38) 2 2 x From the PB equation above, it is clear that either increasing the block size or increasing the watermark power has a signi cant e ect on the reliability of the encoding. For the case where xm (n), the original host signal, is not available at the receiver, then the presence of the host signal will cause decoding problems. Solutions to this problem are proposed in the next section.
2
76
10
10
10
10
10
15
15
10
0 SNR (DECIBELS)
10
15
Figure 3.10: Error rate as a function of SNR for the spread spectrum algorithms.
77 It is important to recognize that with the FHSS approach, may be increased because by modifying only a subset of frequency components, the noise power is spread throughout the time domain signal. In other words, a higher level of noise at a subset of frequencies corresponds to a lower power of noise within each sample in the time domain. In this investigation, the number of coe cients modi ed in the frequency domain was set to S = M=32, or roughly three percent of the coe cients.
(3.40)
where vx(n) is a random signal of variance x corresponding to the prediction error. When applied to the watermarked signal block, xm (n), the output of the
78 (3.41)
xm (n) a(n) = xm (n) + w(m) p(n)] a(n) ~ = w(m) p(n) a(n)] + vx(n)
In other words, convolution of the AR model coe cients with the watermarked signal results in two signals: p(n) a(n) weighted by the watermark bit, and the random prediction error vx(n). However, recall that p(n) is a noise-like signal with zero mean and a variance p = 1. The output of the correlator will be
2
c(m) =
=
M ;1 X n=0 M ;1 X n=0
x(n) p(n) ~
(3.42)
Neglecting the contribution of the correlation of the random prediction error and the PN sequence, the correlator output has the following form:
c(m) =
w(m)
M ;1 X n=0
(3.43)
However, it was shown in 7] that when a random signal is convolved with a linear lter, the cross-correlation of the lter output with the random input is a function of only the variance of the input and the lter coe cients. This property can be used to simplify the expected value of the correlator output:
E c(m)] =
=
w (m ) w (m )
M ;1 X n=0 p
2
a(n)
(3.44)
noting the change in summation to re ect the fact that the lter length will be shorter than the length of the watermarked block, and E ] denotes the statistical expectation operator. As long as the sum of the AR model coe cients is greater than zero: K; X a(n) 0 (3.45)
1
then the sign of the correlator output may be used to extract the embedded bit.
n=0
79
0
10
20
MAGNITUDE RESPONSE
30
40
50
60
70
80
90
0.1
0.2
0.3
0.7
0.8
0.9
Figure 3.11: Highpass lter used to pre lter host signals watermarked with the DSSS algorithm. In this investigation, a symmetric highpass Finite Impulse Response (FIR) lter of length K = 11 samples, constructed using a Hamming window, was used as a pre lter for decoding the DSSS algorithm. An FIR lter was used because it provides a linear phase response, is independent of the host signal (may be constructed ahead of time), and because the watermark signal is spread throughout the spectrum of the host signal. The frequency response of the lter is shown in Figure 3.11. The AR modeling technique was applied at the decoder of the FHSS algorithm by constructing a 11th-order whitening lter for each watermarked block. The lter order was chosen as a tradeo between accuracy of the AR model and the computational cost of computing the model coe cients. A time-averaged estimate of the block's autocorrelation function r(n) was computed, and the AR coe cients were constructed using r(n) and the Levinson-Durbin recursion 42]. A block diagram of
80
AR Modelling
_ x m(n)
Whitening Filter
p(n)
M-1 n=0
sgn
w(m)
Figure 3.12: Block diagram of the spread spectrum decoder with pre ltering prior to decoding. the improved decoder is shown in Figure 3.12. The AR modeling technique was chosen over the highpass pre lter because the latter would not improve decoding for the FHSS algorithm, since not every frequency component is modi ed. If the distribution of the watermarked subset of frequencies is random, then on average half of the modi ed coe cients would lie in the lower frequency band. To illustrate how much these two modi cations improve the decoding reliability of the DSSS and FHSS algorithms, refer to Figure 3.13 and Figure 3.14. These are experimental plots of the bit error rate (BER), as a function of , for the two spread spectrum algorithms under three conditions: no pre ltering, highpass pre ltering, and AR modeling. In these tests, a monophone audio signal from the performance evaluation of Section 3.6 was watermarked using a block size of 2048 samples.
3.4.4.3 Discussion
The spread spectrum watermarking techniques described in this section possess several advantages over the echo coding and phase coding algorithms studied previously. First of all, the encoder and decoder architectures are quite simple to implement in hardware, and they are based on well-understood aspects of digital communications theory. Equation 3.37 reveals that the reliability of the watermark encoding can be
81
50 NO PREFILTER HIGHPASS AR MODELING
45
40
35
BER (PERCENT)
30
25
20
15
10
0.1
0.2
0.8
0.9
Figure 3.13: Comparison of DSSS decoding using highpass pre ltering and AR modeling.
50 NO PREFILTER HIGHPASS AR MODELING
45
40
35
BER (PERCENT)
30
25
20
15
10
0.1
0.2
0.8
0.9
Figure 3.14: Comparison of FHSS decoding using highpass pre ltering and AR modeling.
82 improved by either increasing the noise power , or by increasing the length of the block size M . In addition, redundant copies of the watermark may be embedded by using a di erent PN sequence to spread each copy, because statistically the noise-like PN sequences are uncorrelated. Finally, it is important to note that the watermark signal is additive, so the only distortion introduced into the host signal is low-level noise. Although the autocorrelation property of PN sequences allows for easier synchronization of the host signal at the decoder, a malevolent party may use the same process to determine the bit sequence in order to disrupt or remove the watermark. For example, if an LFSR generator of length n bits is used as the PN sequence, then the Berlekamp-Massey algorithm can be used to reconstruct the LFSR generator structure given only 2n bits of the sequence 43]. In quiet periods of the host audio signal, for example, the only signal present may be the watermark signal, exposing the sequence to such an analysis and attack. This would suggest using a more robust and unpredictable pseudorandom number generator, such as the Blum-Blum-Shub algorithm 44]. However, doing so would require a more complex synchronization system at the decoder. Finally, it should be noted that neither of these two spread spectrum techniques take advantage of the complex masking properties of the human audio system described in Chapter 2.
operations. Recall that in Chapter 2 the psychoacoustic properties of the Human Audio System (HAS) were introduced. In particular, the chapter explored the absolute detection threshold function TA (f ), based on empirical studies examining the minimum power required for a single tone to be perceptible to a human listener. This function is independent of the audio signal. In addition, the frequency masking concepts of tone-masks-noise, noise-masks-tone, and noise-masks-noise were described. The MPEG Layer I psychoacoustic model was introduced as a standardized procedure for determining the masking threshold function, TM (f ), that is dependent on the local frequency-domain properties of the audio signal. Tew k et al have proposed an audio watermarking algorithm that takes advantage of this psychoacoustic model 21]. An implementation of their approach, described in the following sections with modi cations and improvements, uses the masking threshold function to control the magnitude and spectral properties of an additive noise-like watermark signal. Doing so ensures that the distortion is inaudible to the listener.
83
TM (f )
K ;1 X n=0
t(n)e;j
fn :
1 =0
(3.46)
Also, normalize t(n) so that it provides unity gain, or PK ; t(n) = 1. n 3. Use the coe cients as a noise-shaping lter by convolving it with the block's bit, w(m), spread by a bipolar PN sequence p(n) 2 f;1 +1g. Note that this
84
() _ x m(n)
x m(n)
Masking Analysis
Filter Approximation
t(n)
p(n)
w(m)
Figure 3.15: Block diagram of the frequency masking encoder. is similar to the direct sequence spread spectrum (DSSS) encoding process described in Section 3.4, but the additive noise-like signal no longer has a at power spectrum. Convolving the spread data signal with the lter model has the e ect of shaping the spectrum of the noise-like signal to approximate the frequency masking threshold of the host signal. This spread data signal is then added to the host signal:
xm (n) = xm (n) +
(3.47)
where t(n) denotes the lter coe cients computed above. Again, is a constant weighting factor that determines the maximum power of the noise signal. A block diagram of the frequency masking algorithm's encoder is shown in Figure 3.15.
Equation 3.30:
85
w(m) = sign ~
"
M ;1 X n=0
xm (n) p(n)
(3.48)
Note that the same conditions and assumptions of the DSSS decoder apply here. In particular, the PN sequence is assumed to be available at the receiver and synchronized with the watermarked signal. If no pre ltering is applied prior to computing the correlation, it is also assumed that the approximation of Equation 3.31 holds here as well: M; X xm (n) p(n) 0 (3.49)
1
If the original block, xm (n), is available at the decoder, then it may be subtracted from the watermarked block prior to decoding.
n=0
Applying the correlation formula of Equation 3.30, the extracted bit with a noise term may be written as
C=
M ;1 X n=0
(3.51)
However, if it is assumed that p(n) has the same properties as white noise, such as a at power spectrum, zero mean, and variance p , then the rst term in the summation can be simpli ed. In 7], it is shown that when white noise is passed through a linear
86 lter, the cross-correlation of the output and input signals is a function of only the noise variance and the lter coe cients. Let p0 (n) denote the convolution of p(n) and t(n) in the correlation equation above. The cross-correlation function of the PN sequence and the ltered PN sequence is given by
t(m)
(3.52)
where E ] represents the statistical expectation operator. Substituting this into the receiver correlator equation yields a simpli ed expression for the expected value of the correlator output:
C=
M ;1 X n=0
w(m)
(3.53)
Since p(n) 2 f;1 +1g, its variance will be 1. Recall that t(n) is constructed to have unity gain, so the sum of the coe cients will be one. t(n) is also limited to K samples, and assumed to be zero elsewhere. Therefore, the simpli ed correlator equation may be written as M; X C = K w(m) + v(n) p(n) (3.54)
1
Like the spread spectrum case, for a block size of M 1 the second term will cancel out. When the block size is not large, then it is important to be able to quantify PB , the probability of bit error. This was accomplished experimentally by determining PB for various signal to noise ratios. The expression may be approximated by p ! K P Q (3.55)
B
n=0
where Q(x) denotes the complementary error function of Equation 3.38. Therefore, the bit error rate is a function of the noise power and the lter length, not the length of the block.
87 between the frequency masking threshold function and the frequency response of the K th-order lter 13]. This method was chosen because it provides a close match of the lter coe cients to the masking threshold function. K was set to 10 as a tradeo between lter accuracy and the cost of computing the coe cients.
3.5.4.2 Selection of
By experiment, it was found that may be set to 25 percent of the dynamic range of the host audio signal. This is a signi cant improvement over the standard DSSS algorithm described previously. Note that this does not mean that the noise-like watermark signal has the same magnitude at each time interval. Since the PN sequence's spectrum is shaped to match the frequency masking function, represents the maximum level of distortion in the time domain.
3.5.4.4 Discussion
88
Since the frequency masking algorithm is based on the direct sequence spread spectrum algorithm, it possesses most of the advantages and disadvantages described in Section 3.4.4.3. However, note that since the spectral characteristics of TM (f ) are used to shape the noise-like watermark signal, the power can be maximized while ensuring that the watermark is imperceptible to the listener. As a result, the quality of the watermarked audio signal is better for larger values of than the algorithms described in earlier sections. Unfortunately, the encoder is computationally expensive because for each signal block, the frequency masking threshold function must be computed and an approximation lter constructed prior to encoding. In addition, the correlation decoder is susceptible to the same synchronization problems as the spread spectrum algorithms.
89 earlier. In general, each of the audio signals was watermarked using the ve algorithms 100 times, and the results averaged for each algorithm. In all cases, a di erent and random watermark signal was generated for each run. This was done in order to remove any dependency of extraction on the watermark data itself. The audio signals were chosen to represent ve di erent classes of commercial music | blues, classical, country, folk, and pop / rock | so that the signals would have a variety of spectral properties. Classical music, for example, is composed of primarily single-tone signals localized in time, such as notes played on a piano. Contrast this with blues, which typically contains music from low-frequency instruments such as the cello.
90
45
40
35
BER (PERCENT)
30
25
20
15
10
15
Figure 3.16: Bit error rate as a function of block size for audio watermarking algorithms.
91 sponding to a bit rate of approximately 21 bits per second for a sample rate of 44.1 kHz. In the experiments that follow, a block size of 2048 samples was used. However, it is important to note that depending on the desired reliability of the watermark, a longer block size may be necessary to produce a lower error rate. In practice, a sensible approach to selecting a watermarking algorithm would be to rst determine the desired bit error rate for the application, say one percent, and then select from the algorithms that can meet the requirement. Another important tradeo can be seen from this experiment. If the watermarked signal is truncated, or if entire blocks of the signal are simply removed, then the bits embedded in the a ected blocks will be lost. This is particularly important for small blocks, for it is possible to remove small sets of samples at random without a ecting the quality of the watermarked signal. In order to guard against such processing, larger blocks should be used, but this limits the bit rate of the system.
92 Audio Echo Phase DSSS FHSS Frequency Signal Coding Coding Masking BLUES1 21.45 26.23 54.38 49.43 20.14 BLUES2 23.86 27.65 54.19 49.25 24.31 COUNTRY1 16.63 21.67 54.48 49.53 19.54 COUNTRY2 21.34 25.05 54.22 49.27 17.94 CLASSICAL1 21.54 23.53 54.05 49.10 26.43 CLASSICAL2 23.82 29.02 54.07 49.12 28.38 FOLK1 13.52 17.89 54.59 49.64 17.65 FOLK2 14.21 18.22 54.57 49.62 16.94 POP1 14.75 19.08 54.96 50.01 18.53 POP2 14.51 19.09 54.47 49.52 17.84 Average 18.53 22.78 54.40 49.45 20.77 Table 3.1: SNR of watermarked audio signals versus original host signals (in decibels). function of the host signal. Therefore, it is slightly misleading to only consider the SNR of this approach.
93 Audio Echo Phase Signal Coding Coding BLUES1 9.40 10.41 BLUES2 10.26 9.91 COUNTRY1 9.94 11.25 COUNTRY2 10.42 12.24 CLASSICAL1 9.19 10.20 CLASSICAL2 10.53 9.90 FOLK1 10.22 10.50 FOLK2 9.08 11.41 POP1 10.83 9.90 POP2 9.94 9.55 Average 9.98 10.53 DSSS FHSS Frequency Masking 4.47 53.60 395.89 4.60 54.69 363.26 4.71 54.82 404.29 3.26 54.71 422.47 4.16 55.29 376.16 4.18 54.67 407.16 3.24 55.19 415.08 3.44 52.78 358.15 4.81 53.98 361.18 3.90 53.84 401.42 4.08 54.35 390.51
Table 3.2: Audio watermarking algorithm CPU timings (in seconds). be transformed into the frequency domain at both the encoder and decoder. However, more computational resources are required to compute the autoregressive (AR) model of the watermarked signal at the receiver in order to whiten the signal prior to decoding. Unless the DCT computation and AR modeling can be implemented more e ciently, the FHSS algorithm may be limited in application. The frequency masking algorithm is clearly the more expensive, requiring an average of 391 seconds to encode each ten-second sample. This is because the masking threshold function, TM (f ), must be computed for each block. In addition, a noiseshaping lter must be constructed with a frequency response that approximates the threshold function. Note that the timings vary more for each run of this algorithm than the others, because each host signal has di erent spectral properties. Due to its complexity, it may not be possible to implement the frequency masking algorithm to run in real time applications.
94 or lossy compression, should not completely destroy a watermark embedded within the signal. Measuring how well each watermark can survive distortions provides another tool for choosing between algorithms for a particular application, particularly if known in advance the distortions to which the host signal may be subjected. The signal processing operations were selected because they do not severely distort the subjective quality of the audio signal, and because they can be used to represent or simulate \real world" distortions.
(3.56)
95 hLP (n) and hHP (n) represent halfband lowpass and highpass lters of length 11 samples, with A varying from 0 A 1. High-emphasis ltering was included in this investigation because it simulates the function of a graphic equalizer in stereo systems.
Wiener ltering. For each watermarked signal block, xm (n), a K th-order forward prediction lter was constructed and applied to the signal:
xm (n) = ~
k=1
xm(n ; k)h(k)
(3.57)
where the coe cients of h(n) were chosen to minimize the mean squared error (MSE) between xm (n) and xm (n), for 1 K 15. The output of the prediction ~ lter is an approximation of the watermarked signal, plus a random prediction error signal. Wiener ltering was used to simulate linear predictive coding (LPC), a common low bit rate audio compression technique.
Median ltering. A median lter was used, designed to replace each sample of the audio signal with the median of its K previous samples, for 1 K 15. Median ltering is a non-linear process often used for reducing high-frequency noise in a signal.
The results of this experiment are shown in Figure 3.17. Generally, the phase coding algorithm was the most resilient to almost all of the ltering operations applied, with the exception of highpass and highemphasis ltering. This is surprising, given that the algorithm is relatively simple to implement and possesses a low computational complexity. This is also disappointing, because the quality of the watermarked signals is much lower than the other algorithms studied. The DSSS, FHSS, and frequency masking algorithms provided the best performance under highpass and high-emphasis ltering. This is not surprising, given the fact that the algorithms distribute their signal energy across the entire spectrum, and that the audio signals often possess low-frequency components.
96
45
50
40
35 40
BER (PERCENT)
30
25
BER (PERCENT)
0 5 FILTER SIZE (SAMPLES) 10 15
30
20
20 15
10 10 5
10
15
BER (PERCENT)
30
BER (PERCENT)
0 5 FILTER SIZE (SAMPLES) 10 15
30
25
25
20
20
15
15
10
10
0.1
0.2
0.3
0.4
0.5 A
0.6
0.7
0.8
0.9
BER (PERCENT)
BER (PERCENT)
0 5 10 PREDICTION FILTER SIZE (SAMPLES) 15
30
30
25
20
20 15
10 10 5
10
15
Figure 3.17: Bit error rate after ltering for audio watermarking algorithms.
97
50 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 45 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 45 40
40
35
35 30
BER (PERCENT)
30
BER (PERCENT)
0 5 10 15 SNR (DECIBELS) 20 25 30 35
25
25
20
20
15 15 10
10
0 5
0 5
10
15 SNR (DECIBELS)
20
25
30
35
Figure 3.18: Bit error rate in the presence of additive and coloured noise for audio watermarking algorithms.
(3.58)
and for coloured noise, the signal was corrupted with noise of the same power, but multiplied by a normalized version of the watermarked signal. Since x(n) was already normalized to lie within the interval 0 x(n) 1, the corrupted signal may be written as x(n) = x(n) + x(n) v(n): ~ (3.59) For each algorithm, the bit error rate was computed as a function of SNR in decibels. A block size of M = 2048 samples was used, and the results of this experiment are shown in Figure 3.18.
98 In the presence of additive and coloured noise, it is clear that the spread spectrum algorithms | DSSS, FHSS, and frequency masking | perform quite well. It was shown in Section 3.4.3, however, that the presence of noise does not have a signi cant impact on spread spectrum techniques for larger block sizes or larger . Of the three, frequency masking algorithm performs well due to the larger watermark power. The echo coding algorithm performs poorly in additive and coloured noise environments, because the presence of noise will a ect the cepstrum used to extract the watermark bits. Decoding the bits involves evaluating the cepstrum at the two echo lter delays, so additive noise will increase the chance that bits are incorrectly decoded. Phase coding performs very well under noise conditions, and there is a good reason for this. In the study of communications systems, it has been shown that systems employing angle modulation (either frequency or phase) are more resilient to severe noise than amplitude modulation schemes (such as the spread spectrum approaches) 16].
99
0.9
0.8
0.7
0.6
Q(x)
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5 x
0.6
0.7
0.8
0.9
0.6
Q(x)
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5 x
0.6
0.7
0.8
0.9
0.6
Q(x)
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
(c) Nonlinear outlier quantization function Figure 3.19: Linear and nonlinear quantization functions for K = 5 bits per sample.
0.5 x
0.6
0.7
0.8
0.9
100
45
40
35
BER (PERCENT)
30
25
20
15
10
10
15
BER (PERCENT)
30
25
20
15
10
10
15
BER (PERCENT)
30
25
20
15
10
10
15
(c) Nonlinear outlier quantization Figure 3.20: Bit error rate after quantization using linear and two nonlinear bit allocation functions.
101 From the plots, it can be seen that the DSSS, FHSS, and frequency masking algorithms provide the most resilience to linear and nonlinear quantization, even at low bits per sample. In the previous experiment it was shown that the spread spectrum algorithms perform well in noisy environments. Quantization of a signal introduces random noise with a variance that varies with the quantization step size, so the results of this section should correspond.
102
45
40
35
BER (PERCENT)
30
25
20
15
10
50
100
200
250
300
Figure 3.21: Bit error rate due to lossy compression as a function of bit rate.
103 quality MPEG compression, are signi cantly better than echo and phase coding. These are the bit rates most commonly used for distributing music over the Internet.
3.7 Summary
The primary goals of this chapter were to review a selection of audio watermarking algorithms from the literature, and to evaluate them using the framework proposed in Chapter 1. Five algorithms were chosen from the literature to represent several unique approaches to embedding data within audio signals: echo coding, phase coding, direct sequence and frequency hopped spread spectrum, and frequency masking. Since the focus of this thesis is on public watermarks, the original signal could not be used at the decoder to assist in extracting data, presenting an interesting problem for some algorithms. In addition to a description of each algorithm, suggestions were provided on how they could be implemented and improved. Key among these improvements was the incorporation of a whitening lter at the receiver of the DSSS and FHSS algorithms, based on an autoregressive model of the host signal, in an e ort to minimize the presence of the host signal. Another goal of this chapter was to evaluate the algorithms using the performance analysis framework introduced in Chapter 1. It was found that the echo coding and phase coding algorithms provided the poorest quality output, while signals watermarked with the spread spectrum and frequency masking algorithms had a higher quality. It was shown that there is a tradeo between algorithm robustness and bit rate. With respect to signal processing, it was found that the echo coding and phase coding algorithms provided the best resilience to linear and nonlinear ltering operations, with the exception of highpass and high-emphasis ltering. The three spread spectrum algorithms | DSSS, FHSS, and frequency masking | proved considerably robust to additive noise, quantization distortion, and lossy compression. The three techniques were notably less resilient to coloured noise and lowpass ltering operations, with the exception of the frequency masking algorithm.
104
105 Six image watermarking algorithms will be reviewed in this chapter, and improvements to their encoder and decoder structures will be proposed. Another goal of this chapter is to apply the performance analysis framework proposed in Chapter 1 as a means of comparing the algorithms. The algorithms evaluated in this chapter were selected to represent the three di erent approaches to embedding data: spatial domain, frequency domain, and spatial / frequency (muiltiresolution). They were also chosen to represent a range of computational complexities and implementation structures. Since the focus of this thesis is on public watermarking algorithms, the techniques chosen do not require access to the original image in order to extract the watermark data. In some cases, however, having such access may improve the decoding process. The chapter is organized as follows. Sections 4.2 - 4.3 provide a description of the image watermarking algorithms, including the theory behind them, encoder and decoder structures, and implementation details. This is followed in Section 4.4 by a performance evaluation of the algorithms with respect to perceptual quality, bit rate, computational complexity, and robustness to signal processing operations. Finally, a review of the chapter's ndings are provided in Section 4.5.
4.1 Conventions
Similar conventions to those used in the study of audio watermarking algorithms will be used in this chapter. First of all, it is assumed that x(n n ) represents a digital host image of size N N pixels. This signal is divided into a set of M M blocks of size M M pixels, where M = bN =M c and M = bN =M c, as shown in Figure 4.1. Like audio signals, the image is divided into blocks because although images are typically nonstationary as a whole, they exhibit local stationarity within smaller regions. In this case, second-order stationarity allows for analysis of the image's local mean and variance, which is useful for some of the algorithms. x(n n ) represents the watermarked image, while xm1 m2 (n n ) and xm1 m2 (n n )
1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 2
106
N x N image M x M block
Figure 4.1: Example of a 512 512 image divided into 16 16 blocks in the spatial domain. Each block will be used to embed one bit of data. indicate the < m m > block in the original and watermarked images, respectively, for 0 m M ; 1 and 0 m M ; 1. Finally, it is assumed that one bit is embedded in each block, and this sequence of M M bits is denoted by w(m m ) 2 f;1 +1g, for 0 m M ; 1 and 0 m M ; 1. As mentioned in the previous chapter, dividing the image up into variable sized blocks conveniently allows for a variable number of bits to be embedded within the image. A bit extracted from the watermarked image is denoted by w(m m ). ~
1 2 1 1 2 2 1 2 1 2 1 1 2 2 1 2
107 from the initial discussion that, for the basic spread spectrum algorithms, the noise power had to be maintained at a very low level in order to keep the distortion inaudible to the listener, as the Human Audio System is sensitive to low levels of noise at mid-band frequencies. As described in Chapter 2, the Gaussian optical point spread function of the Human Visual System has a lowpass frequency response. Therefore, it is predicted that the human eye will be more tolerant to high frequency noise. Many image watermarking algorithms implicitly take advantage of the lowpass frequency response of the Human Visual System. Two groups | Hartung and Girod, and Cox et al | introduced image watermarking algorithms that are based upon this premise 40, 41]. The approach of Hartung and Girod operates in the spatial domain, manipulating host image pixels in the same manner as the Direct Sequence Spread Spectrum (DSSS) algorithm introduced in Chapter 3. The algorithm of Cox et al embeds watermark data into the two-dimensional DCT of the host image. However, in this investigation the two algorithms have been modi ed so that an arbitrary amount of watermark data may be added, to be described in the following sections. In this study, PN sequences are again used as spreading signals for the same arguments they were employed in the previous chapter for audio watermarking: they possess the same statistical properties as white noise, they are deterministic, and they occupy frequencies in excess of the host image's spectrum. The two spread spectrum techniques introduced in the previous chapter, (DSSS) and frequency hopped spread spectrum (FHSS), are extended in this discussion to the two-dimensional case. The extension of these algorithms is relatively straightforward, and it will be seen that similar assumptions and improvements apply to the 2D case.
108
xm1 m2 (n n ) = xm1 m2 (n n ) +
1 2 1 2
w(m m ) p(n n )
1 2 1 2
(4.1)
In the equation above, represents a constant weighting factor that can be used to control the level of noise added to the host signal. In Section 4.2.4, suitable values for will be established. Since w(m m ) is constant within the block, the spectrum of the added noise assumes the shape of the spectrum of p(n n ):
1 2 1 2
where
w(m m ) p(n n ) ,
(4.2)
(4.3)
The result is a set of M M frequency domain coe cients, where M is the size of the block in pixels. Then, a subset of S M M coe cients are selected to contain watermark data:
S = fsi j 2 Zj0 si j M ; 1 0 i j S ; 1g
(4.4)
109 The coe cients are modi ed by using a PN sequence, p(k k ) 2 f;1 +1g, of size S samples, modulating the bit to be embedded within the block with this short PN sequence, and then adding this noise-like sequence to the selected coe cients:
1 2
w(m m ) p(k k ) < k k >2 S (4.5) 0 otherwise where, as with the DSSS algorithm, is a parameter used to control the noise power. The nal step is to construct the watermarked image by using the inverse 2D-DCT to convert the modi ed frequency domain signal into the watermarked block: X m1 m2 (k k ) = Xm1 m2 (k k )+
1 2 1 2 1 2 1 2 1 2
xm1 m2 (n n ) = IDCT X m1 m2 (k k )]
1 2 1 2
(4.6)
As before, the subset of S modi ed coe cients may be xed for the entire image, or it may vary with each block. Methods of selecting the coe cients will be discussed in Section 4.2.4.
C =
= =
1
M ;1 M ;1 X X n1 =0 n2 =0 M ;1 M ;1 X X n1 =0 n2 =0 M ;1 M ;1 X X n1 =0 n2 =0
2
xm1 m2 (n n ) p(n n )
1 2 1 2
xm1 m2 (n n ) +
1 2 1 2 1
xm1 m2 (n n )p(n n ) +
w(m m ) p (n n )]
1 2 2 1 2
(4.7)
Given that p(n n ) is a noise-like signal with zero mean, then the correlation of the original signal with the PN sequence in the equation above may be assumed to be low: M; M; X X (4.8) xm1 m2 (n n )p(n n ) 0
1 1
n1 =0 n2 =0
110 resulting in a weighted bipolar watermark bit at the correlator output. The extracted bit may be obtained from the sign of this output:
M w (m m )
2 1 2
(4.9)
A similar procedure is used to decode watermark bits in the FHSS algorithm, but the correlation is performed only on the 2D-DCT coe cients that were modi ed during the embedding process. Note that the set of S modi ed coe cients must be available at the receiver.
xm1 m2 (n n ) = ~
1 2
(4.10)
Applying the correlation formula of Equation 4.7, the extracted bit with a noise term may be written as:
C=
1 2
M ;1 M ; 1 X X n1 =0 n2 =0
(4.11)
Since v(n n ) has zero mean and is uncorrelated with p(n n ), for large block sizes it is predicted that the algorithms possess a strong resilience to additive noise distortions. When the block size is not large, then the probability of bit error may be approximated by the expression:
PB = Q
M
v
(4.12)
111
Z 1 exp ;u du Q(x) = p1 (4.13) 2 2 x It is clear from the PB equation above, that either increasing the block size or increasing the watermark power has a signi cant e ect on the reliability of the encoding. !
112 spectrum algorithm may be improved by incorporating Girod's spatial domain model of the Human Visual System into the encoder. This improvement was originally proposed by Tew k et al in their image watermarking system 28]. Their modi cations to Girod's visual model were described in Section 2.2.6.1. Prior to encoding, the masking values for the image as a whole are computed using Girod's visual model. This has the e ect of producing a tolerable error image (n n ) representing the maximum watermark distortion on a pixel-by-pixel basis. The analysis is performed for the entire image, because in some cases spatial masking e ects will occur at the boundary between blocks. The allowable error is maximized within regions of uniform intensity and where spatial masking e ects occur. Then, for each block to be watermarked, select the minimum value of (n n ) within the block as the xed watermark power:
1 2 1 2
0 = minf
m1 m2 (n1
n )g
2
(4.14)
where 0 is the new constant watermark power used throughout the block. The minimum value is used to ensure that the watermark distortion is still imperceptible. By using the localized value of 0, the watermark can take advantage of masking characteristics of the image that are localized in nature. Note, however, that the bene t of this modi cation is maximized for smaller blocks, and decreases as the block size is increased.
113 For each block to be watermarked, the 2D-DCT algorithm is rst used to transform the block into the frequency domain, denoted Xm1 m2 (k k ), as described earlier for the FHSS watermarking algorithm. In addition, a two-dimensional bipolar PN sequence p(k k ) is constructed for each block, and this sequence is combined with w(m m ), the bit to be embedded. After this, the following steps are taken:
1 2 1 2 1 2
1. Compute Q(k k ), the raised frequency detection threshold levels, using either the Tew k or Watson analysis algorithm.
1 2
2. Quantize the 2D-DCT coe cients using the masking threshold levels, and then modify each quantized coe cient by plus or minus a quarter of the quantization level, according to the PN sequence:
X m1 m2 (k k ) = Xm1 m2 (k k ) + 1 w(m m ) p(k k ) Q(k k ) (4.15) Q(k k ) 4 where ] denotes the rounding operator.
("
1 2
At the decoder, the embedded bit is extracted by rst computing the frequency masking threshold levels, TM (k k ), either from the original host image or an approximation based on the watermarked image. The bit is extracted from the block by determining the PN sequence bits from a quantized version of the watermarked 2D-DCT block:
1 2
(4.16) w(m m ) p(k k ) X m1 m2 (k k ) ; X m1 (m2 (k )k ) Q(k k ) Qk k and then computing the correlation of the extracted bits with the original PN sequence. Note that the watermark is not added to the 2D-DCT coe cients, as was the case in the original FHSS algorithm. If the masking coe cients computed at the decoder are a close approximation to those used at the encoder, then the bit error will be zero in a distortionless environment (i.e., no additive noise or other corruption). Tew k et al only explored the use of their masking analysis technique in
" #
1 2 1 2 1 2 1 2 1 2 1 2
114 the proposed watermarking algorithm. In this investigation, both the Watson model and Tew k's model were implemented for use in the modi ed FHSS algorithm, and they are denoted FHSS-FMW and FHSS-FMT, respectively. Watson's model does not incorporate the frequency masking analysis of Tew k's approach, but it will be shown in Section 4.4.3 that the former is less computationally expensive to implement.
2. Attempt to minimize the presence of the original host image by employing a whitening lter constructed from a two-dimensional autoregressive (AR) model of the image block. A K K model is given by the expression:
A(z z )
1 2
K ;1 PK ;1 a(n 1 n1 =0 n2 =0
n )z;n1 z;n2
2
(4.17)
where a(0 0) = 1. The AR model coe cients for a particular image block may be obtained from an estimate of the two-dimensional autocorrelation function of the block, and then using either a 2D form of the normal equation, or a 2D
115 form of the Levinson-Durbin recursion 46]. The AR model may be used as a whitening lter by convolving the watermarked image block with the coe cients computed for that block. The result is a two-dimensional random process corresponding to the prediction error:
(4.18)
Assuming that the power of the noise-like watermark signal, controlled by , is much less than that of the host image, then the AR coe cients computed for the watermarked block, x(n n ), will be close to those computed for the original image block. Therefore, the convolution of the watermarked block with the AR coe cients will result in x(n n ), the image presented to the correlator: ~
1 2 1 2
(4.19)
It was shown in the previous chapter that the convolution of a random process with a linear lter will result in a cross-correlation between the lter output and the random input that depends only upon the variance of the random signal and the lter coe cients 7]. Therefore, from a derivation similar to that used in Section 3.4.4.2, the expected value of the correlator output for the block will be a weighted version of the watermark bit embedded within the block
E C] =
w(m m )
1 2
K ;1 K ;1 X X n1 =0 n2 =0
a (n n )
1 2
(4.20)
where K is the size of the set of AR model coe cients, and the extracted bit, w(m m ), may be taken as the sign of the correlator output. ~
1 2
Figure 4.2 shows a plot of the magnitude response of the two-dimensional nite impulse response (FIR) lter considered in this investigation. The lter is of size 11 11 coe cients, and constructed using the McClellan frequency transformation method 47]. While it is computationally expensive to construct such a lter, it only has to be performed at design-time and incorporated into the decoder.
116
1.5
MAGNITUDE
0.5
Figure 4.2: Two-dimensional highpass lter used to pre lter host images watermarked with the DSSS and FHSS algorithms.
117 In Section 3.4.4.2 it was shown experimentally that a highpass pre lter greatly improves the decoding reliability for the audio DSSS algorithm, while the AR modeling technique improves the performance of the FHSS decoder. In a similar manner, it was found that the highpass pre ltering modi cation provided the best decoding performance for the DSSS and DSSS-SM algorithms. It was also discovered that a 3 3 whitening lter provided the best decoding performance for the FHSS algorithm. This size of AR model is in agreement with studies of image compression using twodimensional linear predictive coding (2D-LPC), where it was found that a square of coe cients larger than three or four samples per side provided little coding gain 46]. In the performance analysis section, this modi cation to the algorithm decoders will be used. It is important to remember that for the FHSS algorithms with frequency domain masking analysis (FHSS-FMT and FHSS-FMW), no such pre ltering will be used at the decoder because the algorithms use quantization rather an additive watermark.
4.2.4.5 Discussion
There are signi cant di erences between the spread spectrum algorithms implemented for this study and the original versions from the literature. Most notable is the division of the host image into a set of blocks, which allows for a variable number of bits to be embedded within a host image. In the original FHSS algorithm proposed by Cox et al, for example, the authors compute the 2D-DCT of the entire image. They also recommend constructing a watermark using samples drawn from a Gaussian process, and they require access to the original image in order to extract the watermark. The result is a signature embedded into the host image, rather than an arbitrary set of watermark data achieved from using a block-by-block approach. In addition, a novel approach used in this study is the incorporation of Watson's frequency domain masking analysis in the quantized FHSS-FMT algorithm proposed by Tew k et al. It will be shown in Section 4.4 that Watson's model is less computationally expensive than Tew k's model, but o ers a similar level of performance when incorporated into
the FHSS-FMW algorithm. The basic DSSS and FHSS algorithms are straightforward to implement, and are computationally e cient. If the block size is a power of two, then the 2D-FFT or other fast algorithms can be used to compute the 2D-DCT of each image block. Also, the watermarking of blocks can be performed in parallel. It is predicted that incorporating spatial masking analysis into the DSSS algorithm will improve the imperceptibility of the distortion and maximize in regions of the image that possess sharp edges or uniform intensity. The more complex frequency domain masking analysis techniques added to the FHSS algorithm should also improve algorithm performance by their use of luminance masking, frequency sensitivity, and frequency masking characteristics of the host image. However, the spread spectrum algorithms are subject to the same PN sequence synchronization problems as described in Section 3.4.4.3. This makes them quite susceptible to cropping and geometric transformations. Also, it will be shown in the performance evaluation of Section 4.4 that incorporating masking analysis increases the computational requirements of the algorithms.
118
119 Orientation LL LH HL HH 1 29.7551 53.1615 53.1615 155.0356 Level 2 3 20.3071 17.9397 29.2656 21.8632 29.2656 21.8632 64.7006 38.4151 4 19.7667 21.0311 21.0311 30.3196
Table 4.1: Wavelet quantization levels for a 512 512 image at the standard viewing distance. on a four-level wavelet decomposition of the image using the 9-7 biorthogonal lters originally proposed for image compression 12]. In their approach, the authors determined the minimum amount of noise in wavelet coe cients, at each level of resolution and orientation, that would be detectable to a viewer seated at the standard viewing distance from the image. They did this by using a psychovisual study and a large number of test subjects. Random noise was injected into wavelet coe cients at a single resolution and orientation, and the resulting image was presented to the viewer after computing an inverse wavelet transformation on the noisy coe cients. The noise was increased until it became detectable in the resulting image. The result was a set of quantization levels, one for each resolution and orientation in the four-level decomposition. Quantization levels from Watson et al depend on the spatial resolution of the image, which in turn depends upon the image size and distance from the viewer. For a 512 512 image located at the standard viewing distance of six times the image width, the quantization levels associated with four levels of decomposition are shown in Table 4.1. The four orientations LL, LH, HL, and HH correspond to the four subimages at each level of the multiresolution decomposition. They represent lowpass, horizontal, vertical, and diagonal components, respectively. Podilchuk and Zeng have attempted to incorporate these wavelet quantization levels into an image watermarking scheme 49]. However, their approach is de cient in several ways. First of all, their algorithm embeds a watermark signature into the
120
1 0.6 0.8 0.4
G(N)
0.2 0.4 0.6 3 2 1 0 SAMPLE 1 2 3 4 0.8 3
H(N)
0.2
0.2 4
0 SAMPLE
Figure 4.3: Decomposition lters used to compute the 2D-DWT. host image, not an arbitrary set of data bits. In addition, access to the original image is required for extracting the signature at the decoder (a private watermark system). In the following sections, a modi ed version of their multiresolution watermarking scheme is proposed that allows embedding of data, and does not require access to the original image.
1. Compute the four-level two-dimensional DWT of the host image x(n n ). The result of each successive level of decomposition is a set of four N N \subimM M ages", where M denotes the level. These four downsampled subimages represent a lowpass representation of the image and three detail images corresponding to horizontal (LH), vertical (HL), and diagonal (HH) components. The lowpass
1 2 2 2
121 image is ltered and downsampled again to produce the next level of subimages. 2. From the set of ten subimages created by the DWT, construct an N N composite image xcomp(n n ) using the subimages, as shown in Figure 4.4-(a) and Figure 4.5. The composite image has the same dimensions as the original image, but is comprised of subimages.
1 2
3. Construct an N N quantization matrix Q(n n ), based on the allowable quantization level for each subimage's resolution and orientation from Table 4.1. For example, the region of Q(n n ) corresponding to the lowpass (LL) subimage at the fourth level of decomposition would be N N samples in size, and assigned a quantization level of 19.7667. Figure 4.4-(b) illustrates the structure of the N N quantization matrix.
1 2 1 2 16 16
4. Divide the composite image into a set of blocks of size M M pixels, and for each block construct a PN sequence p(n n n ) 2 f;1 +1g with which to spread each bit within the block.
1 2 3
5. Embed the spread bits into the blocks of the composite image by quantizing the composite image coe cients using the quantization matrix, and then modify each quantized coe cient by plus or minus a quarter of the quantization level, according to the PN sequence:
(4.21)
6. Finally, compute the inverse DWT of the modi ed subimages to produce the resulting watermarked image:
(4.22)
Note that the blocks containing bits can overlap with the multiresolution subimages, so that some of the data may be embedded within di erent spatial frequency bands
122
M=4 LL M=4 HL HH LH
Q(4, HL)
M=3 LH
M=2
M=1
Q(4, LL) Q(4, LH) N Q(3, LH) Q(2, LH) Q(3, HL) Q(3, HH) Q(1, LH)
LH
M=3
HL
HH LH
Q(4, HH)
M=2
HL
HH
Q(2, HL) Q(2, HH)
M=1
HL
HH
Q(1, HL)
Q(1, HH)
(a) Composite image from submimages (b) Composite quantization matrix. Figure 4.4: N N composite images made from the multiresolution decomposition subimages and quantization levels. and di erent orientations. By dividing the composite image into M M blocks, it can be used to hold the same amount of data as the other watermarking algorithms. Also note the similarity of this approach to the modi ed FHSS algorithm incorporating frequency domain masking analysis. In both algorithms, the frequency or spatial / frequency domain coe cients are quantized with respect to the maximum quantization level, resulting in perfect reconstruction of the data bits in the absence of distortion. An example of the wavelet decomposition performed on the LENNA image is shown in Figure 4.5.
123
m=4 m=3
m=2
m=1
m=2
m=3 m=4
m=1
Figure 4.5: Example of a four-level wavelet decomposition of a 512 512 pixel version of LENNA.
124 1. Compute the four-level two-dimensional DWT of the watermarked image, and again construct an N N composite image from the subimages:
(4.23)
2. Divide the composite image into a set of M M blocks, and determine the spread data sequence for each block from a quantized version of the watermarked block:
(4.24)
3. Compute the correlation of the extracted spread sequence with the original PN sequence in order to extract the embedded data bit:
C =
N1 ;1 N2 ;1 X X n1 =0 n2 =0
2
w(m m ) p (n n )
1 2 2 1 2 1 2
= M w(m m ) 4. Finally, obtain the embedded bit from the sign of the correlation:
(4.25)
w(m m ) = sign C ] ~
1 2
(4.26)
4.3.3 Discussion
The multiresolution embedding algorithm possesses a number of advantages over the spread spectrum techniques discussed earlier. Like the FHSS algorithm, the multiresolution approach spreads watermark data throughout the spatial domain of the host image. However, the division of blocks in the 2D-DWT domain results in data being embedded within di erent frequency bands and orientations. The quantization levels are xed for the given standard viewing distance and wavelet basis functions, so it is not necessary to compute them again at the receiver, like the FHSS-FMW and FHSS-FMT algorithms. However, it should be noted that the quantization levels represent the sensitivity of the HVS to wavelet basis functions at various resolutions and orientations
125 in the 2D-DWT domain. This sensitivity is image independent, so other aspects of perceptual masking, such as luminance masking and frequency masking, are not considered. In addition, the quantization levels are only valid for the 9-7 basis functions studied by Watson et al, and it is not clear how the levels could be adjusted for use with other basis functions.
126
(a) BARBARA
(b) BOAT
(c) FROG
(d) GOLDHILL
(e) LENNA
(f) MANDRILL
(g) MONARCH
(h) MOUNTAIN
(i) PEPPERS
(j) ZELDA Figure 4.6: Sample images used in the performance evaluation of image watermarking algorithms.
127
50 DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION
45
40
35
BER (PERCENT)
30
25
20
15
10
Figure 4.7: Bit error rate versus block size for the six watermarking algorithms compared.
128 FMW, and multiresolution algorithms had an error rate of zero under all block sizes. It was explained earlier that in the absence of distortion, these algorithms are expected to produce no bit error. All of the algorithms had an error rate of less than ve percent at M M 16 16 samples. In the experiments that follow, a block size of 16 square samples was used. As with audio watermarking, a similar problem exists with respect to blocks and watermarked images. If the image is cropped, then any bits embedded within a ected blocks would be lost. In order to prevent this, it is obvious that larger blocks should be used, but this limits the bit rate of the watermarking system. Therefore, there is a tradeo between block size and bit rate. As an example, Figures 4.8- 4.10 show the original 512 512 LENNA image along with versions watermarked using the six algorithms described in this chapter. It is clear from the images that the distortion introduced by the algorithms does not degrade the perceptual quality of the host image, when seen from a standard viewing distance of six times the image width.
129
(b) DSSS
(c) DSSS-SM
Figure 4.8: LENNA image watermarked with the DSSS and DSSS-SM algorithms.
130
(b) FHSS
(c) FHSS-FMW
(d) FHSS-FMT
Figure 4.9: LENNA image watermarked using the FHSS, FHSS-FMW, and FHSSFMT algorithms.
131
Image BARBARA BOAT FROG GOLDHILL LENNA MANDRILL MONARCH MOUNTAIN PEPPERS ZELDA Average
DSSS DSSS (SM) 38.59 32.43 38.59 29.96 38.62 39.73 38.59 32.09 38.59 32.18 38.59 30.12 38.60 37.09 38.72 37.20 38.59 31.49 38.59 32.80 38.61 33.51
FHSS 37.85 37.93 37.16 38.00 37.54 38.45 35.95 36.05 37.17 38.01 37.41
FHSS FHSS Multi(FMW) (FMT) Resolution 28.44 25.69 31.49 27.74 26.03 32.01 32.77 26.02 31.81 29.00 25.93 31.86 29.48 26.48 32.26 26.33 24.53 30.57 27.90 23.94 29.63 31.16 25.53 31.40 28.62 26.31 32.15 30.26 26.68 32.45 29.17 25.71 31.56
Table 4.2: PSNR of watermarked images versus original images (in decibels).
132 Image BARBARA BOAT FROG GOLDHILL LENNA MANDRILL MONARCH MOUNTAIN PEPPERS ZELDA Average DSSS DSSS (SM) 4.36 7.81 4.31 7.69 4.32 7.66 4.31 7.63 4.31 7.62 4.28 7.61 4.47 7.85 4.52 7.99 4.64 8.20 4.62 8.18 4.41 7.82 FHSS 13.75 13.69 13.66 13.64 13.67 13.64 14.09 14.31 14.80 14.64 13.99 FHSS FHSS Multi(FMW) (FMT) Resolution 30.99 69.15 9.79 30.97 71.09 9.77 30.85 74.52 9.75 30.85 72.09 9.71 30.83 69.17 9.71 30.84 71.50 9.71 31.85 69.78 10.02 32.31 68.26 10.14 33.30 68.86 10.45 32.97 68.09 10.35 31.58 70.45 9.93
Table 4.3: Image watermarking algorithm timings (in seconds). the power of the noise-like watermark signal added to the host image. In contrast, the PSR of the FHSS-FMW and FHSS-FMT algorithms may not be predicted as easily. Quantization of the 2D-DCT coe cients is performed by these two techniques, which introduces random noise in the form of quantization errors. If the quantization matrices were uniform, then the noise variance could be predicted by the quantization levels. However, the quantization levels vary with frequency and with the 2D-DCT of the host image block, so the PSNR tends to vary between images. The PSNR of the multiresolution algorithm is more constant for all of the sample images because the quantization level is uniform for each level and orientation of the decomposition.
133 ing analysis (DSSS-SM), along with the multiresolution algorithm, are slightly more expensive because they have to either lter the host image or compute a forward and inverse transform. The FHSS algorithm is more expensive, requiring an average of 14 seconds to encode and decode each host image. This is not surprising considering that each block must be transformed into the frequency domain using the 2D-DCT, and more resources are required to compute the AR model of the watermarked image at the decoder. Not surprisingly, the FHSS algorithms with frequency domain masking, FHSSFMW and FHSS-FMT, require the most time to encode and decode the host images. In addition to computation of the frequency threshold levels of each DCT block, the algorithms must perform the same steps at the decoder in order to approximate the masking levels of the original image. This is required because the original image is typically not available at the receiver. Of the two, the FHSS-FMT algorithm requires over twice the amount of time to run, since it involves computing the complex frequency masking characteristics of each image block.
134
60
50
40
BER (PERCENT)
30
20
10
Figure 4.11: Bit error rate from mean ltering for image watermarking algorithms. pixels: 1 x(n n ) = K ~
1 2
K ;1 K ;1 X X
2
where x denotes the averaged pixel. Mean ltering, essentially a lowpass ltering ~ operation, has the e ect of removing high-frequency noise from a signal. For lowpass ltering, a lowpass symmetric half-band lter of size K K samples was constructed using a frequency sampling technique, for 1 K 15. Lowpass ltering is used prior to downsampling an image, or to remove high-frequency noise. As the lter order increases, more high-frequency components of the watermarked image are attenuated by the mean and lowpass ltering operations. The results of this experiment are shown in Figures 4.11 - 4.12. The DSSS and multiresolution algorithms perform poorly under these operations. DSSS employs a highpass pre lter at the decoder under the assumption that the host image has a lowpass magnitude response, but mean and lowpass ltering remove these high-frequency components. In Figure 4.5, it is clear that roughly 3/4 of the DWT coe cients lie within the high-frequency subimages at the rst level of decomposition. Therefore, roughly 3/4 of the watermark data will be corrupted by lowpass ltering operations. In contrast, the block-based FHSS algorithms perform better because they operate in the frequency domain, and spread watermark data
i=0 j =0
x(n ; i n ; j )
1 2
(4.27)
135
45 DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION 40
35
30
BER (PERCENT)
25
20
15
10
10
15
Figure 4.12: Bit error rate from lowpass ltering for image watermarking algorithms. from each bit throughout the spectrum. Lowpass ltering operations will a ect the high frequency frequency components, but there will still be a correlation between lowpass components and the PN sequence used to spread the data.
136
14 DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION 12
10
BER (PERCENT)
10
15
Figure 4.13: Bit error rate from highpass ltering for image watermarking algorithms.
(4.28)
where hLP (n n ) and hHP (n n ) represent lters of 11 11 samples, with A varying from 0 A 1. The high-emphasis ltering operation, also known as unsharp masking, was included in this study because it is commonly used to remove high-frequency noise while retaining sharp edges and features in an image. It was expected that as A approaches 0 and 1, the performance of the watermarking algorithms approximate those of the lowpass and highpass ltering experiments, respectively. The results, plotted in Figure 4.14, support this.
xm1 m2 (n n ) = ~
1 2
XX
xm1 m2 (n ; i n ; j )a(i j )
1 2
(4.29)
137
45 DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION 40
35
30
BER (PERCENT)
25
20
15
10
0.1
0.2
0.3
0.4
0.5 A
0.6
0.7
0.8
0.9
Figure 4.14: Bit error rate from high-emphasis ltering for image watermarking algorithms. where i j 2 ROS is a K K block of previous pixels. The coe cients of a(i j ) were chosen to minimize the mean squared error (MSE) between xm1 m2 (n n ) and ~ xm1 m2 (n n ), for square blocks of 1 K 15. Wiener ltering was used to simulate the e ects of two-dimensional linear predictive image coding 46]. As the prediction lter order increases, the model more closely matches the watermarked image block. The output of the prediction lter is an approximation to the image block, plus random noise corresponding to the prediction error. The variance of this noise decreases with an increase in the lter order, and depends upon the host image being modeled. Figure 4.15 shows the result of this experiment.
1 2 1 2
138
45
40
35
BER (PERCENT)
30
25
20
15
10
15
Figure 4.15: Bit error rate from wiener ltering for image watermarking algorithms.
60
50
40
BER (PERCENT)
30
20
10
Figure 4.16: Bit error rate from median ltering for image watermarking algorithms.
139 su ers when more pixels are altered. Transform domain algorithms { FHSS-FMW, FHSS-FMT, and multiresolution | work better because coe cients are a ected less by modi cations to individual pixels.
(4.30)
and for coloured noise, the image was distorted with noise of the same power, but multiplied by a normalized version of the watermarked signal. Since the pixels of the host images lay in the interval 0 x(n n ) 255, the corrupted image may be written as (4.31) x(n n ) = x(n n ) + x(n n ) v(n n ) ~ 255
1 2 1 2 1 2 1 2 1 2
For each algorithm, the bit error rate was computed as a function of peak signal-tonoise ratio (PSNR) in decibels. A block size of 16 16 pixels was used, the results of this experiment are shown in Figure 4.17 and Figure 4.18. From the plots, it is clear that the FHSS, FHSS-FMW, FHSS-FMT, and multiresolution algorithms provided the best resilience to additive and coloured noise. Of the three, the more complicated FHSS-FMT technique was the best overall, particularly at extremely low PSNR. The performance of the DSSS, DSSS-SM, and multiresolution algorithms was comparable for additive noise, but the multireolution algorithm provides the least resilience to coloured noise. Error rates for coloured noise case were lower than for additive noise because the noise power was determined before multiplication with the normalized image. This tends to skew the noise ratio, but this is not a serious problem because the interest lies in the performance of the algorithms compared with each other.
140
45
40
35
BER (PERCENT)
30
25
20
15
10
0 5
10
15 20 PSNR (DECIBELS)
25
30
35
Figure 4.17: Bit error rate due to additive white Gaussian noise.
141
45
40
35
BER (PERCENT)
30
25
20
15
10
0 5
10
15 20 PSNR (DECIBELS)
25
30
35
Figure 4.18: Bit error rate due to coloured white Gaussian noise.
142 Additive noise has a at power spectrum, and so the distortion has the same level for all transform-domain coe cients. The transform domain algorithms | FHSS-FMW, FHSS-FMT, and multiresolution | perform better than the spatialdomain approaches because the quantization levels are not the same for each coefcient. The multiresolution algorithm quantization levels vary with orientation and DWT level, and the FHSS-FMW and FHSS-FMT levels vary with each 2D-DCT coe cient. The same level of additive noise will a ect a coe cient with a smaller quantization level more than one with a larger level.
4.4.4.7 Quantization
The purpose of this experiment was to investigate the performance of each algorithm to distortion due to linear quantization of the watermarked images. The images, originally represented as greyscale images with 8 bits per pixel (bpp), were linearly requantized to K bpp, for 1 K 7. As before, for each algorithm the encoder was run on each image, followed by quantization, and then decoding to extract the bits. A block size of 16 16 pixels was used, the results of this experiment are shown in Figure 4.19. From the plot, it can be seen that the FHSS, FHSS-FMW, and FHSS-FMT algorithms provide the most resilience to pixel quantization, particularly at extremely coarse quantization levels. However, the performance of the DSSS and DSSS-SM is comparable down to three bpp. The process of quantization introduces a noise-like error, referred to as quantization noise, into the watermarked image. The mean and variance of this distortion depends on the quantization step size and on whether rounding or truncation quantization is applied. With rounding quantization, used in this investigation, the noise has zero mean and a variance q equal to 7]:
2
q q = 12
2 2
(4.32)
where q is the quantization step size. If the intensity levels of the watermarked image are uniformly distributed, and that the quantization noise is uncorrelated with the
143
35
30
BER (PERCENT)
25
20
15
10
144 Image BARBARA BOAT FROG GOLDHILL LENNA MANDRILL MONARCH MOUNTAIN PEPPERS ZELDA Average DSSS DSSS (SM) 2.66 2.93 0.55 0.66 0.08 0.04 0.51 0.31 0.27 0.08 5.16 9.02 7.66 10.94 0.27 0.27 0.31 0.08 0.04 0.04 1.75 2.44 FHSS 0.47 0.04 0.43 0.23 0.04 0.51 4.65 1.80 0.00 0.00 0.82 FHSS FHSS Multi(FMW) (FMT) Resolution 0.00 0.00 1.07 0.00 0.00 0.17 0.00 0.00 1.63 0.00 0.00 1.79 0.00 0.00 0.35 0.04 0.00 2.69 0.35 0.00 1.27 0.00 0.00 1.46 0.00 0.00 0.82 0.00 0.00 0.67 0.04 0.00 1.19
Table 4.4: Bit error rate due to histogram equalization (in percent). watermarked image, then it may be possible to predict the results of quantization using the additive noise data from Section 4.4.4.6. For a representation of 3 bits per pixel or 2 = 8 intensity levels, the quantization step size would be 256=8 = 32 intensity levels, corresponding to quantization noise with variance v 85. The PSNR from this noise is approximately 29 decibels. However, the bit error rate due to additive noise at this level, shown in Figure 4.17, do not correspond to those from quantization to 3 bpp. This discrepancy is because the host images do not have a uniform distribution of intensity levels.
3 2
145 However, the DSSS and DSSS-SM algorithms do not have a large bit error rate, at roughly two percent. In this process, the dynamic range of image pixel values is increased so that the histogram distribution occupies all possible values (0 - 255 for an 8-bit image, for example), hence making the probability distribution roughly uniform. However, the spatial- and transform-domain relationships between adjacent pixels and frequency coe cients is preserved during the process, so it is less likely that the watermark data will be disrupted.
146
40
35
30
BER (PERCENT)
25
20
15
10
10
20
30
70
80
90
Figure 4.20: Bit error rate due to JPEG compression, as a function of compression quality.
147 of watermark data from coe cient quantization is limited to individual frequencies. A watermark bit is spread throughout the set of transform-domain coe cients, so it is less likely that quantization of a coe cient will a ect more than a portion of the watermark bit.
4.5 Summary
The goal of this chapter was to review six image watermarking algorithms from the literature: direct sequence spread spectrum (DSSS) and DSSS with spatial masking analysis (DSSS-SM), frequency hopped spread spectrum (FHSS) and FHSS with two di erent frequency domain masking analysis improvements (FHSS-FMW and FHSSFMT), and multiresolution embedding. The FHSS-FMW algorithm is a result of replacing the frequency masking analysis of Tew k et al with a simpler frequency domain masking analysis process introduced by Watson. The multiresolution embedding algorithm described is an adapted version from the literature, modi ed so that an arbitrary amount of watermark data may be embedded within an image. Another goal of this chapter was to evaluate the algorithms using the performance evaluation framework introduced in Chapter 1. From these results it is clear that the multiresolution embedding algorithm, o ering average computational complexity and the best perceptual quality, performed poorly under simple signal processing operations. Better performance was observed from the DSSS and DSSSSM algorithms, but it should be noted that the added complexity of spatial masking analysis does not signi cantly improve the performance of the spread spectrum technique. Overall, the best performance was seen with the FHSS algorithm and its frequency masking variants, FHSS-FMW and FHSS-FMT. In all cases, the FHSSFMT approach o ered the most resilience to signal processing operations.
148
149 before extracting the watermark. Some of these approaches work by slightly adjusting the variable length codes of DCT block coe cients in \I" frames 54]. Another approach works by modifying the block motion vectors used to construct the B and P frames 45]. Uncompressed-domain algorithms work by embedding and extracting watermark data before and after any compression algorithms are applied, respectively. These approaches are more interesting because it is interesting to consider how well a watermark would survive compression. In this chapter the focus will be on uncompressed-domain algorithms. In the previous two chapters, a selection of digital audio and image watermarking algorithms were implemented and compared. The evaluation is extended in this chapter to the study of seven digital video watermarking algorithms. Another goal of this chapter is to apply the performance analysis framework proposed in Chapter 1 as a means of comparing the algorithms. The algorithms evaluated in this chapter were selected to represent the three di erent approaches to embedding data: spatial domain, frequency domain, and spatial / frequency (multiresolution). They were also chosen to represent a range of computational complexities and implementation structures. Since the focus of this thesis is on public watermarking algorithms, the techniques chosen do not require access to the original signal in order to extract the watermark data. The chapter is organized as follows. Sections 5.2 - 5.3 provide a description of the video watermarking algorithms, including the theory behind them, encoder and decoder structures, and implementation details. This is followed in Section 5.4 by a performance evaluation of the algorithms with respect to bit rate, perceptual quality, computational complexity, and robustness to signal processing operations. Finally, a review of the chapter's ndings are provided in Section 5.5.
150
M x M x M BLOCK
N2 PIXELS
N1
PIX
EL
N3 FR
AME
Figure 5.1: Example of an image sequence divided into blocks in the spatial domain, as well as blocks temporally. Each three-dimensional block will be used to embed one bit of data.
5.1 Conventions
Similar conventions to those used in the previous two chapters will be used in this investigation of video algorithms. First of all, it is assumed that x(n n n ) represents a digital video signal of size N N pixels spatially, and N frames temporally. Each frame of this signal is divided into a set of M M blocks of size M M pixels, where M = bN =M c and M = bN =M c, as shown in Figure 4.1. As explained in Chapters 3 and 4, a division of the host signal into blocks is a convenient way of embedding a variable amount of watermark bits. The sequence is further divided into a set of M blocks of M frames temporally, where M = bN =M c. x(n n n ) represents the watermarked video signal, while xm1 m2 m3 (n n n ) and xm1 m2 m3 (n n n ) indicate the < m m m > block in the original and watermarked signals, respectively, for 0 m M ; 1, 0 m M ; 1, and 0 m M ; 1. Finally, it is assumed that one bit is embedded in each block, and this sequence of M M M bits is
1 2 3 1 2 3 1 2 1 1 2 2 3 3 3 1 2 3 1 2 3 1 2 3 1 2 3 1 1 2 2 3 3 1 2 3
151 denoted by w(m m m ) 2 f;1 +1g, for 0 m M ; 1, 0 m M ; 1, and 0 m M ; 1. A bit extracted from the watermarked signal is denoted by w(m m m ). ~
1 2 3 1 1 2 2 3 3 1 2 3
152 be adapted for embedding data into digital video signals: direct sequence spread spectrum (DSSS) and DSSS with spatial domain masking analysis (DSSS-SM), frequency hopped spread spectrum (FHSS) and FHSS with frequency domain masking analysis (FHSS-FMW and FHSS-FMT), and multiresolution embedding. In the following sections, extensions of the six algorithms will be described. It is important to note that no other researchers have analyzed the use of image watermarking algorithms in this manner. All of the techniques from the previous chapter employ a spreading signal, a two-dimensional pseudonoise (PN) sequence, to distribute the energy of the watermark data throughout the spectrum of the host image. For blocks of digital video, a three-dimensional PN sequence will be constructed of the form p(n n n ) 2 f;1 +1g, with the same dimensions as the block to be watermarked. This PN sequence will be used to spread the watermark data throughout the spectrum of the video block. Likewise, the embedded bit will be extracted using the correlation of the watermarked block, after possible pre ltering, with the original PN sequence. Recall that the properties and advantages of PN sequences were discussed in Section 3.4, and the value of using a correlation receiver was explained in Section 3.4.2.
1 2 3
153
The spatial domain masking analysis is performed on each entire frame prior to encoding, and then divided into spatial blocks to correspond with the size of the host video block. For each block, the minimum masking level is selected as the single weighting value for the block, and is used to weight the spread watermark data added to the image block. The two-dimensional highpass pre lter is employed at the decoder, also on a frame-by-frame basis.
154
The forward and inverse 2D-DWT is computed on each entire frame prior to encoding, and then the composite image and quantization matrix constructed for each frame. The set of composite images are divided into 3D blocks using M M spatial blocks and M frames. The quantization levels are used to quantize the wavelet coe cients of the composite image, and perturb them with the spread watermark data, as described in Section 4.3.
5.2.6 Discussion
The frame-by-frame video watermarking algorithms possess a number of the same advantages and disadvantages of their constituent image watermarking techniques. First of all, they are simple to implement, as the only real di erence is in the embedding of spread watermark data into blocks of video rather than within spatial blocks on individual frames. Individual frames may be processed in parallel for computing the 2D-DCT and 2D-DWT transformations, and for computing spatial or frequency domain masking analysis. However, each of the frame-by-frame video watermarking algorithms are subject to synchronization problems at the receiver. As described in Section 3.4.4.3, a watermarked block and the PN sequence used to spread the watermark data must be perfectly registered in order for the correlator to work properly.
155 a static background. Subsequent frames correspond to components that change with increasing temporal frequency. The value of embedding watermark data into the temporal wavelet domain should be obvious: after computing the inverse transform of the wavelet frames, the watermark will exist throughout the video signal, and at various temporal scales. Watermark data embedded into the DC frame, for example, would exist within every frame of the video signal. As a result, it is likely that the embedded watermark will be more resilient to compression and other signal processing operations. In the following sections, an implementation of the temporal multiresolution watermarking scheme of Tew k et al will be described in more detail.
1. Compute a multilevel discrete wavelet transform (DWT) of the entire video signal along the it's temporal axis, to a depth of blog N c levels. The result is a set of N wavelet frames:
2 3 3
X (n n k ) = DWT x(n n n )]
1 2 3 1 2 3 3 3
(5.1)
where 0 k N ; 1 indexes the temporal wavelet frames. The wavelet basis functions used to compute the forward and inverse DWT will be discussed in Section 5.3.2. 2. Divide the set of wavelet frames into a set of M M M video blocks, denoted Xm1 m2 m3 (n n n ), as described in Section 5.1. For each video block, construct a 3D pseudonoise sequence with which to spread the watermark bit, w(m m m ), to be embedded in the block.
1 2 3 1 2 3
156 3. Quantize the coe cients within each block using a constant quantization level , and then perturb them by a quarter of the quantization level to embed the spread watermark data:
X m1 m2 m3 (n n k ) = Xm1 m2 m3 (n n k ) + 1 w(m m m ) p(n n n ) 4 Appropriate values for will be discussed in Section 5.3.3.
("
1 2 3
(5.2)
x(n n n ) = IDWT X (n n k )]
1 2 3 1 2 3
(5.3)
The above procedure assumes that the original and watermarked video signals are real-valued, but in practice they will require a discrete representation (such as 8 bits per pixel, for example). Computation of the temporal DWT and applying the watermark data will be done using real-valued signals, but the watermarked signal will be rounded so that it ts within the same format as the host signal. Figure 5.2 shows the result of computing a single-level temporal DWT on a video signal of four frames. The outputs are two lowpass frames and two highpass frames, representing static and dynamic temporal components in the video signal.
157
Time
Temporal DWT
Lowpass Frames
Highpass Frames
Figure 5.2: Example of computing the temporal DWT on a video signal four frames in length.
158 2 percent of the image pixel range, or 2 - 5 intensity levels for an 8-bit image. For this study, a constant quantization level of = 8 was used for every wavelet coe cient frame, corresponding to a maximum watermark strength of =2 = 4 levels within each coe cient frame.
5.3.4 Discussion
As mentioned earlier, the temporal multiresolution watermarking algorithm possesses a number of attractive features. First of all, watermark data is spread throughout the video signal, and at various levels of temporal support. As a result, it is predicted that the algorithm is more robust to signal processing operations than the simpler frame-by-frame techniques introduced in Section 5.2. Limiting the DWT computation to the temporal axis means that the operation may be performed in parallel for each pixel. However, having to compute a multilevel DWT on the temporal axis for every pixel is still computationally expensive, particularly for long video signals. Tew k et al suggest dividing the video signal into scenes, both to limit the number of frames required for each DWT computation, and to produce a DC wavelet frame containing static components from a single scene. Finally, it is not likely that this algorithm can be used to embed watermark data in real time, given the computational cost and the need to wait for a scene change before computing the DWT.
159 64 frames long, and represented by 8 bits per pixel. The watermarking algorithms were implemented in MATLAB under Linux on an Intel Pentium PC running at 166 MHz. Each algorithm was implemented using the parameters speci ed with the details presented earlier. In general, each of the video signals was watermarked using the algorithms 100 times, and the results averaged for each algorithm. In all cases, a di erent and random watermark signal was generated for each run. This was done in order to remove any dependency of extraction on the watermark data itself.
160
(a) AMERICA
(b) FOOTBALL
(c) SALESMAN
(d) TENNIS
(e) TREVOR
(f) WESTERN
Figure 5.3: Sample sequences used in the performance evaluation of video watermarking algorithms.
161
20
BER (PERCENT)
15
10
1.5
2.5
5.5
Figure 5.4: Bit error rate versus block size for video watermarking algorithms.
DSSS DSSS (SM) 38.12 37.41 37.98 40.27 38.03 37.86 38.02 38.11 37.99 39.07 38.07 40.14 38.04 38.81
FHSS FHSS Multi- Temporal (FMW) (FMT) Resolution MR 32.74 27.69 39.60 30.81 29.34 26.86 42.69 29.34 31.92 28.25 40.82 31.59 30.83 27.59 38.71 28.33 31.54 25.44 45.67 32.43 33.35 26.57 42.29 34.25 31.62 27.07 41.63 31.13
Table 5.1: PSNR of watermarked video signals versus original sequences (in decibels).
162 Video Signal AMERICA FOOTBALL SALESMAN TENNIS TREVOR WESTERN Average DSSS DSSS (SM) 70.37 125.83 71.71 126.74 69.97 124.43 72.74 125.98 70.42 126.37 70.67 123.53 70.91 125.48 FHSS 220.00 221.86 220.54 223.02 220.79 222.13 221.39 FHSS (FMW) 505.40 505.59 506.72 504.93 505.90 506.08 505.77 FHSS Multi- Temporal (FMT) Resolution MR 1128.16 154.81 1748.47 1126.29 153.80 1741.36 1127.44 156.99 1734.92 1127.54 155.48 1740.87 1126.27 156.33 1735.48 1126.51 156.23 1734.39 1127.37 155.61 1739.25
Table 5.2: Video watermarking algorithm CPU timings (in seconds). to that found for static images in Section 4.4.2. The temporal multiresolution technique produced an average PSNR of 31 dB, which lies within the range of distortions introduced by the frame-by-frame algorithms.
163 video signals. For the frame-by-frame techniques, the cost to watermark each frame depends upon the spatial dimensions N N , so processing ve frames takes approximately ve times longer than a single frame. The temporal multiresolution algorithm uses a multilevel DWT of depth blog N c, which would suggest logarithmically increasing complexity for larger N . However, for each level of decomposition the length of the video signal is halved by downsampling. Therefore, the total operations per pixel is bounded by b X 3c N 2N (5.4) 2i < 2N
1 2 2 3 3 log 3
i=0
164 performance of digital video watermarking algorithms under these distortions will be examined further in the following sections.
i=0
(5.5)
where x denotes the approximated frame. This is essentially mean ltering, a form of ~ lowpass ltering considered in the study of audio and image watermarking algorithms. In this case, however, it is performed along the temporal axis. For small K (less than 3), frame averaging does not have a signi cant impact on the quality of the test video signals, but the lowpass ltering e ect increases with K . In this experiment, K was varied from 1 K 16 frames, and the results for each algorithm are shown in Figure 5.5. From the plot, it is clear that the temporal multiresolution algorithm provided better resilience to the averaging lter than the frame-by-frame algorithms. One likely reason for this is that most of the test video signals did not contain a great deal of high frequency temporal components, with the exception being the FOOTBALL sequence. An averaging lter would remove such components. Of the frame-by-frame approaches, the frequency hopped spread spectrum algorithms with frequency domain masking analysis, FHSS-FMW and FHSS-FMT, performed better than the others. This result is not unexpected given the results of the image watermarking evaluation presented in the previous chapter.
165
50
40
BER (PERCENT)
30
20
10
12
14
16
Figure 5.5: Bit error rate versus frame averaging for video watermarking algorithms.
166 that spread spectrum algorithms require synchronization of the watermarked signal with the PN sequence used to spread watermark data. Obviously, frame reordering will a ect the performance of these algorithms since a portion of the 3D blocks will no longer be synchronized if a frame is replaced with another. In this experiment, similarity between two frames x(n n i) and x(n n j ) was measured using the normalized mean squared error (MSE) between them:
1 2 1 2
MSE = N 1N
1
N1 ;1 N2 ;1 X X
2
n1 =0 n2 ;1
x(n n i) ; x(n n j )]
1 2 1 2
(5.6)
where 0 MSE 1 and it is assumed that the two frames are normalized to the interval 0 1]. Each pair of frames in the sample video signals were examined using the similarity measure, and if the MSE was less than T , a variable threshold value, then the pair were placed into a pool of possible frame reordering candidates. After the set of candidate frame pairs was constructed, a random number of the frame pairs were selected for interchanging. This process was repeated for a range of threshold values, and a plot of the bit error rate is shown in Figure 5.6. From the plot, it is clear that the performance of each algorithm degrades quickly with a decrease in the threshold value, because more frame pairs enter into the pool of candidates for interchanging. However, the temporal multiresolution technique performs slightly better than the frame-by-frame algorithms at low threshold values. This likely results from the fact that reordering takes place in the time / spatial domain, while watermarking embedding and extraction is performed in the temporal DWT domain. In contrast, the performances of the frame-by-frame algorithms are roughly similar to each other. This is because frame reordering disrupts the syncronization of the watermarked signal with the PN sequence in a similar manner for each algoritm. It appears that the FHSS-FMW and FHSS-FMT techniques o er slightly more resilience to reordering, but distortion introduced by the algorithms into the host signal tends to increase the MSE between frame pairs. An increase in MSE tends to decrease the number of frames reordered for a given threshold value.
167
50
40
BER (PERCENT)
30
20
10
0.05
0.1 THRESHOLD
0.15
0.2
0.25
Figure 5.6: Bit error rate versus frame reordering for video watermarking algorithms.
168
For digital video signals with high frame rates (over 15 frames per second), and smoothly varying components in the scene, it is possible to downsample the video signal along the temporal axis by a factor of K . If the video is lowpass ltered prior to downsampling to remove aliasing e ects, then the missing frames may be reconstructed using an interpolation lter 7]. This would obviously be useful as a compression scheme, and for low downsampling factors (2 - 3) the distortion of the video would not be too signi cant. In this experiment, frames from the entire watermarked video signal were downsampled by a factor of K , for 2 K 16, and then reconstructed from previous and next frames using simple bilinear interpolation of pixels. For example, for a factor of K = 2, the ith frame was reconstructed according to 1 x(n n i) = 2 x(n n i ; 1) + x(n n i + 1)] ~ (5.7) where x denotes the reconstructed frame. A plot of the bit error rate as a function ~ of downsampling factor is shown in Figure 5.7. From the plot, it is clear that the performance of each algorithm degrades quickly with an increase in the downsampling factor, particularly for the frame-by-frame algorithms. This is because more frames are removed and the spread watermark data in reconstructed frames does not correlate well with the PN sequence used to spread the data. However, the temporal multiresolution technique performs slightly better than the frame-by-frame algorithms as the downsampling factor increases.
1 2 1 2 1 2
169
BER (PERCENT)
0.5
3.5
Figure 5.7: Bit error rate versus frame downsampling for video watermarking algorithms.
170 Moving Picture Experts Group (MPEG) standard is widely used in many popular systems such as Digital Versatile Discs (DVD's), and has been adopted for use in the high de nition television (HDTV) standard. Therefore, this is an important experiment. A tutorial on the MPEG standards may be found in 56]. In this investigation, the grayscale 256 256 sample video sequences were sampled at 30 frames per second, corresponding to a bit rate of approximately 15:7 million bits per second (Mbps). The MPEG codec was applied to the watermarked video signals for varying compression rates, measured in bits per pixel (bpp). The MPEG coder and decoder used were the \Berkeley MPEG-1 Video Encoder" and the \Berkeley MPEG Player", respectively 57, 58]. A standard group of pictures (GOP) of 15 frames was used, with a repeating pattern of \I", \B", and \P" frames of the form \IBBPBBP ". MPEG encoding was performed for varying bit rates, up to 6 bits per pixel, by modifying the scale factors applied to the block DCT quantization matrices. The compressed video was then decoded back into its raw digital format for extracting the watermark data. The peak signal to noise ratio (PSNR), a common measure of digital image and video quality, is obviously dependent upon the level of compression applied to the signal. Figure 5.8 shows a plot of the PSNR for each of the sample video signals as a function of the compression ratio. Although the e ects of DCT coe cient quantization occur at every compression level, previous researchers have found that artifacts produced by the MPEG standard only begin to become visible at levels below 2 bpp 56]. This is a result of the high level of temporal redundancy, and spatial redundancy to a lesser extent, present within video signals. It is important to note that the PSNR above 2 bpp is above 30 dB for each of the sample signals, which corresponds with the perceptual quality results presented in Section 5.4.2. Figure 5.9 shows a plot of the bit error rate of the extracted watermarks as a function of the compression rate in bits per pixel. From these results, it is clear that the temporal multiresolution algorithm again outperformed the frame-by-frame algorithms, particularly at rates of less than 1 bpp. The main reason for this result
171
50
45
40
PSNR (DECIBELS)
35
30
20
15
Figure 5.8: PSNR versus compression ratio for sample video signals.
172
50
40
BER (PERCENT)
30
20
10
Figure 5.9: Bit error rate due to MPEG compression as a function of bit rate.
173 is that the MPEG compression algorithm works on a frame-by-frame basis to remove temporal and spatial redundancies, while the temporal multiresolution algorithm embeds watermark data throughout the temporal axis of the video signal. Therefore it is likely that more watermark bits may be correctly extracted. Of the frame-by-frame algorithms, the FHSS-FMW and FHSS-FMT algorithms performed slightly better than the others. This is not too surprising given the results of the lossy image compression comparison presented in Section 4.4.4.9, where it was found that the FHSS algorithms performed well under JPEG compression.
5.5 Summary
Compared to image watermarking algorithms, few techniques exist in the literature for explicitly embedding watermark data into digital video signals. However, video is nothing more than a sequence of still images, so intuitively image watermarking approaches may be easily extended into the temporal dimension. In this chapter, the six watermarking algorithms of the previous chapter were adapted for use in \frame-byframe" watermarking of digital video: DSSS, DSSS-SM, FHSS, FHSS-FMW, FHSSFMT, and the spatial multiresolution technique. Also, an implementation was described of a novel temporal multiresolution watermarking system speci cally designed for video signals. Another aspect of this chapter was to evaluate the algorithms using the performance evaluation framework introduced in Chapter 1. It was noted that high-quality video signals possess two unique properties: a high bit rate resulting from a relatively high temporal sampling rate, and a high level of temporal redundancy. The distortion of video from watermarking using the frame-by-frame algorithms, measured in PSNR, correponds to the results of Section 4.4.2, while temporal multiresolution introduced no more distortion than the former. With respect to computational complexity, it was found that the temporal multiresolution algorithm proved twice as expensive as
174 the most complex frame-by-frame algorithm, but the per-frame cost of the former does not increase with the length of the video signal. Resilience to signal processing was only measured using operations unique to video: frame averaging, reordering, downsampling, and lossy (MPEG) compression. For each of these experiments, the temporal multiresolution algorithm proved far more resilient to signal processing than the frame-by-frame techniques.
175
Chapter 6 Conclusions
The previous three chapters have provided a review of many techniques for embedding data within digital audio, image, and video signals, and details on implementation strategies were provided. In addition, a performance evaluation was conducted for each class of algorithms in order to compare them with respect to a common set of criteria. In this nal chapter, the primary results of this investigation are summarized, and several key areas are listed as possible avenues of further research in this eld.
176 reveal this result, because the only way to prove it is to perform a formal perceptual quality study as described in Section 1.4.2. Results from the performance evaluations help to support these conclusions. In particular, algorithms incorporating perceptual modeling { the frequency masking audio watermarking algorithm, and spread spectrum techniques with masking analysis (DSSS-SM, FHSS-FMW, and FHSS-FMT) { had a signal to noise ratio below the algorithms that did not use perceptual modeling. This indicates that more distortion from watermarking is introduced by these techniques. In addition, in many cases these algorithms performed better under common signal processing operations as shown in Section 3.6.4 and Section 4.4.4, most notably lowpass and wiener ltering, additive noise, and lossy compression. It was also discovered that many watermarking algorithms from the literature rely upon spread spectrum techniques from digital communications theory in order to securely encode and decode watermark data. This is because spread spectrum systems possess several unique properties: 1. Watermark data, when \spread" using a pseudorandom (PN) sequence, is distributed throughout the spectrum of the host image, including portions not already occupied by image components. 2. As shown by Section 3.4.2, the spread spectrum correlation receiver is highly robust to additive noise distortion, and it was also shown that the reliability increases with the block size and the magnitude of the spread watermark data. 3. Extra security of the watermark data is obtained from using a PN sequence, because the correlation of two di erent PN sequences is very low. However, perfect synchronization of the watermarked host signal and the PN sequence is required to correctly extract watermark data at the receiver. From the performance evaluation conducted throughout this thesis, it is clear that a combination of spread spectrum techniques and transform domain embedding produces a more robust watermark. In particular, the frequency hopped spread spectrum
177 (FHSS) algorithm and its variants (FHSS-FMW and FHSS-FMT) proved particularly resilient to processing. The main reason why this is so is that embedding watermarks in the transform domain tends to distribute their energy throughout the temporal or spatial area of the host signal. Recall that the focus of this thesis is on public watermarking algorithms, where the original signal is not available at the receiver to assist in extracting watermark data from the host signal. In many cases, the presence of the host signal may interfere with extraction of the watermark. This is very true of additive watermarks, such as the direct sequence and frequency hopped spread spectrum (DSSS and FHSS) techniques. In contrast, a quantization approach provides no decoding error, but only if the watermarked signal is distributed in a distortion-free environment. In this thesis, improvements were introduced for the DSSS and FHSS approaches to audio, image, and video watermarking for reducing the presence of the host signal. Employing a highpass pre lter, or a \whitening" lter constructed from an autoregressive (AR) model of the host signal, both work well to reduce interference from the host signal.
178 3. Higher perceptual quality, resulting from lower distortion from embedded watermark data, usually corresponds to less robustness against signal processing. For example, consider results from the audio watermarking chapter. The phase coding algorithm for audio signals produces poor quality signals (measured in SNR), but provides high resilience to processing. In contrast, the DSSS approach has a better quality (due to low ), but is less robust to processing operations.
179 works \best" for watermarking di erent signals. It is clear that frequency domain approaches spread watermark data throughout the time or spatial domain of the host signal, which is valuable for providing resilience to signal processing operations. Transform- and wavelet-based compression algorithms achieve coding gains by seeking a more compact representation of a signal's time or spatial domain sample values. However, it is not clear that there is any bene t to using di erent transform kernels or basis functions for watermarking. The performance evaluation only considered robustness to individual signal processing operations. In practice, it is likely that a number of operations would be performed on a given signal, and it would be useful to know which watermarking algorithm proves more resilient to combinations of operations. For example, an image from a digital camera may undergo lowpass ltering to remove noise, followed by histogram equalization to widen the dynamic range of the image, and completed by lossy (JPEG) compression before being posted onto an Internet web site. If the set of possible signal processing operations are known in advance of watermarking, then it may be possible to construct modi cations to each watermarking algorithm so that the embedded data will survive the distortions. In addition, robustness to more sophisticated signal processing operations should be considered. For example, the algorithms incorporating perceptual analysis | frequency masking, DSSS-SM, FHSS-FMW and FHSS-FMT | produce localized increases in watermark strength due to localized masking e ects within the host signal. However, is is possible for an attacker (as de ned in Section 1.2.2) to use these models to localize an attack on a watermark as well, possibly with little or no perceivable loss of signal quality. Quite often, high quality digital signals are stored or transmitted in analog form, because not all consumers have access to the Internet or other sources of digital media. For example, digital video may be converted to a standard television format (NTSC) and then broadcast or recorded on an analog tape. Similarly, digital images are printed in magazines and newspapers, and audio is still recorded and sold on cheap
180 cassettes. Future investigation of the robustness of watermarking algorithms should take the digital-to-analog (D/A) and analog-to-digital (A/D) conversion process into account, for it is likely that some techniques would not survive the process well.
181 In addition to copyright protection frameworks, other interesting applications of watermarking are beginning to emerge in the literature. For example, two novel applications were recently described: audio-in-video and video-in-video 61]. As mentioned in Section 5.4.4, the high bit rate of raw digital video allows for a large amount of watermark data to be embedded within the signal. For audio-in-video, the authors embed four speech signals within a 360 240 pixel video signal at 30 frames per second. The speech is sampled at 8 kHz and represented with 8 bits per sample, and compressed to 2400 bits per second using a CELP speech compression algorithm 62]. The value of this approach is that the embedded speech signals could represent additional audio tracks, perhaps in di erent languages. Since the speech is embedded within the video signal itself, the bit rate of the signal does not need to be increased to accomodate the extra speech. In a similar manner, the authors embed a small video signal, compressed using the MPEG algorithm, within the host video signal as a form of video-in-video.
182 vidual pixels and the spatial (or interpixel) redundancy between pixels 46]. Addition of a noise-like watermark signal has the e ect of increasing the \randomness" of the image, which would reduce both the coding and interpixel redundancies. Watermarks may a ect the compression rate achieved by lossy techniques as well. For example, a watermarked block of 8 8 pixels used in the JPEG image compression algorithm may contain frequency components in the 2D-DCT domain that do not exist in the original block of pixels. The components may not be removed by coe cient quantization, leading to a larger compressed image size. Similar e ects occur in the MPEG video compression scheme, so it is useful to consider ways of embedding watermark data that minimize the e ects on compression.
Bibliography
1] Pamela Samuelson. Good News and Bad News on the Intellectual Property Front. Communications of the ACM, 42(3):19{24, March 1999. 2] J. S. Lauritzen, Adar Pelah, and David Tolhurst. Perceptual Rules for Watermarking Images: A Psychophysical Study of the Visual Basis for Digital Pattern Encryption. In Proceedings of SPIE Human Vision and Electronic Imaging IV, volume 3644, pages 392{402, 1999. 3] Ingemar Cox and Jean-Paul Linnartz. Some General Methods for Tampering With Watermarks. IEEE Journal on Selected Areas in Communications, 16(4):587{593, May 1998. 4] Scott Craver, Nasir Memon, Boon-Lock Yeo, and Minerva Yeung. Resolving Rightful Ownerships with Invisible Watermarking Techniques: Limitations, Attacks, and Implications. IEEE Journal on Selected Areas in Communications, 16(4):573{586, May 1998. 5] Bruce Schneier. Applied Cryptography. John Wiley & Sons, New York, 2nd edition, 1995. 6] Stephen Wicker. Error Control Systems for Digital Communication and Storage. Prentice Hall, Englewood Cli s, NJ, 1995. 7] Alan Oppenheim and Ronald Schafer. Discrete-Time Signal Processing. Prentice Hall, Englewood Cli s, NJ, 1989. 183
184 8] Nasir Ahmed, T. Raj Natarajan, and K. R. Rao. Discrete Cosine Transform. IEEE Transactions on Computers, C-23(1):90{93, January 1974. 9] Ephraim Feig and Shmuel Winograd. Fast Algorithms for the Discrete Cosine Transform. IEEE Transactions on Signal Processing, 40(9):2174{2193, September 1992. 10] Martin Vetterli. Multidimensional Subband Coding: Some Theory and Algorithms. Signal Processing, 6(2):97{112, April 1984. 11] Olivier Rioul and Martin Vetterli. Wavelets and Signal Processing. IEEE Signal Processing Magazine, 8(4):14{38, October 1991. 12] Marc Antonini, Michel Barlaud, Pierre Mathieu, and Ingrid Daubechies. Image Coding Using Wavelet Transform. IEEE Transactions on Image Processing, 1(2):205{220, April 1992. 13] Rafael Gonzalez and Richard Woods. Digital Image Processing. Addison Wesley, Reading, MA, 1992. 14] Martin Kutter and Fabien Petitcolas. Fair Benchmark for Image Watermarking Systems. In Proceedings of SPIE Security and Watermarking of Multimedia Contents, volume 3657, pages 226{239, 1999. 15] Arun Netravali and Barry Haskell. Digital Pictures: Representation, Compression, and Standards, chapter Visual Psychophysics. Plenum Press, New York, 2nd edition, 1995. 16] Bernard Sklar. Digital Communications: Fundamentals and Applications. Prentice Hall, Englewood Cli s, NJ, 2nd edition, 1988. 17] Niklaus Wirth. Algorithms and Data Structures. Prentice Hall, Englewood Cli s, NJ, 1986.
185 18] Nikil Jayant, James Johnston, and Robert Safranek. Signal Compression Based on Models of Human Perception. Proceedings of the IEEE, 81(10):1385{1422, October 1993. 19] Peter Noll. MPEG Digital Audio Coding. IEEE Signal Processing Magazine, 14(5):59{81, September 1997. 20] Davis Pan. Tutorial on MPEG/Audio Compression. IEEE Multimedia Magazine, 2(2):60{74, Summer 1995. 21] Mitchell Swanson, Bin Zhu, Ahmed Tew k, and Laurence Boney. Robust Audio Watermarking Using Perceptual Masking. Signal Processing, 66(3):337{355, May 1998. 22] Charles Stromeyer III and Bela Julesz. Spatial-Frequency Masking in Vision: Critical Bands and Spread of Masking. Journal of the Optical Society of America, 62(10):1221{1232, October 1972. 23] Gordon Legge and John Foley. Contrast Masking in Human Vision. Journal of the Optical Society of America, 70(12):1458{1471, December 1980. 24] J. F. Delaigle, C. De Vleeschouwer, and B. Macq. Watermarking Algorithm Based on a Human Visual Model. Signal Processing, 66(3):319{335, May 1998. 25] Bernd Girod. The Information Theoretical Signi cance of Spatial and Temporal Masking in Video Signals. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display, volume 1077, pages 178{187, 1989. 26] Martin Kutter, Frederic Jordan, and Frank Bossen. Digital Watermarking of Color Images Using Amplitude Modulation. Journal of Electronic Imaging, 7(2):326{332, April 1998. 27] Bin Zhu and Ahmed Tew k. Low Bit Rate Near-Transparent Image Coding. In Proceedings of SPIE Wavelet Applications II, volume 2491, pages 173{184, 1995.
186 28] Mitchell Swanson, Bin Zhu, and Ahmed Tew k. Robust Data Hiding for Images. In IEEE Digital Signal Processing Workshop, pages 37{40, 1996. 29] Gregory Wallace. The JPEG Still Picture Compression Standard. Communications of the ACM, 34(4):30{44, April 1991. 30] Albert Ahumada and Heidi Peterson. Luminance-Model-Based DCT Quantization for Color Image Compression. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display III, volume 1666, pages 365{374, 1992. 31] Heidi Peterson, Albert Ahumada, and Andrew Watson. An Improved Detection Model for DCT Coe cient Quantization. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display IV, volume 1913, pages 191{201, 1993. 32] Andrew Watson. DCT Quantization Matrices Visually Optimized for Individual Images. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display IV, volume 1913, pages 202{216, 1993. 33] Mitchell Swanson, Mei Kobayashi, and Ahmed Tew k. Multimedia DataEmbedding and Watermarking Technologies. Proceedings of the IEEE, 86(6):1064{1087, June 1998. 34] Francois Gauthier. Fundamentals of Digital Radio Broadcasting (DRB) in Canada. Technical report, Communications Research Centre, October 1996. 35] SDMI Portable Device Speci cation - Part 1. http://www.sdmi.org, July 1999. 36] Walter Bender, Daniel Gruhl, Norishige Morimoto, and Anthony Lu. Techniques for Data Hiding. IBM Systems Journal, 35(3/4):313{335, 1996. 37] Alan Oppenheim and Jae Lim. The Importance of Phase in Signals. Proceedings of the IEEE, 69(5):529{541, May 1981. 38] Correspondence with Walter Bender, September 1999.
187 39] Raymond Pickholtz, Donald Schilling, and Laurence Milstein. Theory of SpreadSpectrum Communications - A Tutorial. IEEE Transactions on Communications, COM-30(5):855{884, May 1982. 40] Frank Hartung and Bernd Girod. Digital Watermarking of Uncompressed and Compressed Video. Signal Processing, 66(3):283{301, May 1998. 41] Ingemar Cox, Joe Kilian, Thomas Leighton, and Talal Shamoon. Secure Spread Spectrum Watermarking for Multimedia. IEEE Transactions on Image Processing, 6(12):1673{1687, December 1997. 42] Simon Haykin. Adaptive Filter Theory. Prentice Hall, Englewood Cli s, NJ, 3rd edition, 1995. 43] Massey and Berlekamp. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory, IT-15(1):122{127, January 1969. 44] Lenore Blum, Manuel Blum, and Michael Shub. A Simple Unpredictable PseudoRandom Number Generator. SIAM Journal of Computing, 15(2):364{383, May 1986. 45] Frank Hartung and Martin Kutter. Multimedia Watermarking Techniques. Proceedings of the IEEE, 87(7):1079{1107, July 1999. 46] Petros Maragos, Ronald Schafer, and Russel Mersereau. Two-Dimensional Linear Prediction and Its Application to Adaptive Predictive Coding of Images. IEEE Transactions on Acoustics, Speech, and Signal Procesing, ASSP-32(6):1213{1229, December 1984. 47] Dan Dudgeon and Russel Mersereau. Multidimensional Digital Signal Processing. Prentice Hall, Englewood Cli s, NJ, 1984. 48] Andrew Watson, Gloria Yang, Joshua Solomon, and John Villasenor. Visual Thresholds for Wavelet Quantization Error. In Proceedings of SPIE Human Vision and Electronic Imaging, volume 2657, pages 382{392, 1996.
188 49] Christine Podilchuk and Wenjun Zeng. Image-Adaptive Watermarking Using Visual Models. IEEE Journal on Selected Areas in Communications, 16(4):525{ 539, May 1998. 50] Bhavesh Bhatt and David Birks. Digital Television: Making it Work. IEEE Spectrum Magazine, 34(10):19{28, October 1997. 51] Advanced Television Systems Committee. ATSC Digital Television Standard. http://www.atsc.org, September 1995. 52] Zdzislaw Papir and Andrew Simmonds. Competing for Throughput in the Local Loop. IEEE Communications Magazine, 37(5):61{66, May 1999. 53] Sara Robinson. Copyright Lawsuits Test Limits of New Digital Media. The New York Times, January 24, 2000. 54] Frank Hartung and Bernd Girod. Fast Public-Key Watermarking of Compressed Video. In IEEE International Conference on Image Processing, volume 1, pages 528{531, 1997. 55] Mitchell Swanson, Bin Zhu, and Ahmed Tew k. Multiresolution Scene-Based Video Watermarking Using Perceptual Models. IEEE Journal on Selected Areas in Communications, 16(4):540{550, May 1998. 56] Thomas Sikora. MPEG Digital Video-Coding Standards. IEEE Signal Processing Magazine, 14(5):82{100, September 1997. 57] Plateau Research Group. Berkeley MPEG-1 Video Encoder Users Guide. http://bmrc.berkeley.edu/research/mpeg. 58] Plateau Research Group. http://bmrc.berkeley.edu/research/mpeg. Berkeley MPEG Player.
59] Michael Eckert and Andrew Bradley. Perceptual Quality Metrics Applied to Still Image Compression. Signal Processing, 70(3):177{200, Nov 1998.
189 60] Keith Hill. The Role of Identi ers in Managing and Protecting Intellectual Property in the Digital Age. Proceedings of the IEEE, 87(7):1228{1238, July 1999. 61] Mitchell Swanson, Bin Zhu, and Ahmed Tew k. Data Hiding for Video-in-Video. In IEEE International Conference on Image Processing, pages 676{679, 1997. 62] Allan Gersho. Advances in Speech and Audio Compression. Proceedings of the IEEE, 82(6):900{918, June 1994. 63] Tao Bo and Michael Orchard. Coding and Modulation in Watermarking and Data Hiding. In Proceedings of SPIE Security and Watermarking of Multimedia Contents, volume 3657, pages 503{510, 1999. 64] Alan Bell. The Dynamic Digital Disk. IEEE Spectrum Magazine, 36(10):28{35, October 1999.