PEWM

THE UNIVERSITY OF CALGARY
Performance Evaluation of Digital Watermarking Algorithms

by
James D. Gordy A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
CALGARY, ALBERTA April, 2000

c James D. Gordy 2000
THE UNIVERSITY OF CALGARY FACULTY OF GRADUATE STUDIES

The undersigned certify that they have read, and recommend to the Faculty of Graduate Studies for acceptance, a thesis entitled \Performance Evaluation of Digital Watermarking Algorithms" submitted by James D. Gordy in partial ful lment of the requirements for the degree of Master of Science.
Supervisor, Dr. L. T. Bruton Department of Electrical and Computer Engineering Dr. A. Eberlein Department of Electrical and Computer Engineering Dr. H. Leung Department of Electrical and Computer Engineering Dr. M. Collins Department of Geomatics Engineering
Date ii
Abstract
Digital watermarking is the process of embedding sideband data within the samples of a digital audio, image, or video signal. The watermark must be imperceptible to the intended audience of the host signal, and must withstand distortion from common signal processing operations. In this thesis, implementations and improvements are described of digital audio, image, and video watermarking algorithms. In addition, a novel performance evaluation framework is introduced, and is used to compare the algorithms using bit rate, perceptual quality, computational complexity, and robustness to signal processing. Watermarks embedded in a transform domain representation of the host signal perform better under signal processing operations than time or spatial domain approaches. In addition, incorporation of perceptual models of human hearing and vision improves the imperceptibility of watermark data and its resilience to signal processing operations. However, the cost of transform domain and perceptual analysis is an increase in computational complexity.
iii
Acknowledgements
First of all, I would like to express my sincere thanks to Dr. Bruton for his supervision of my research, and for his advice, encouragement, and support that have kept me focused on my work. I can honestly say that his enthusiasm and interest have made my time at the University of Calgary a more enjoyable and rewarding experience. I would also like to gratefully acknowledge the generous nancial support of the Natural Sciences and Engineering Research Council (NSERC), the Department of Electrical and Computer Engineering, and Dr. Bruton. My research and this thesis would not have been possible without their assistance. Finally, I wish to thank Norm Bartley for helping to keep the lab running smoothly, his encouraging conversations, and for his helpful suggestions. My fellow students in the department, particularly Chad Dreveny, Remi Gurski, and Mark Chakravorti, receive many thanks for their friendship, helpful suggestions, and lively lunchtime discussions.
iv
To all the girls I've loved before.
Contents
Abstract Acknowledgements Dedication Contents List of Tables List of Figures List of Symbols Chapter 1 Introduction
1.1 Digital Media and Copyright Protection . . . . 1.2 Requirements Analysis . . . . . . . . . . . . . . 1.2.1 Imperceptibility . . . . . . . . . . . . . . 1.2.2 Robustness to Signal Processing . . . . . 1.2.3 Private vs. Public Watermarks . . . . . 1.3 Watermark Embedding and Extraction Systems 1.3.1 Perceptual Analysis . . . . . . . . . . . . 1.3.2 Key Generation . . . . . . . . . . . . . . 1.3.3 Encoding and Decoding . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii iv v vi xii xiii xvii 1

1 3 3 4 5 6 6 8 9
1.3.4 Watermark Insertion and Extraction . . . . . . . 1.3.4.1 The Discrete Fourier Transform (DFT) . 1.3.4.2 The Discrete Cosine Transform (DCT) . 1.3.4.3 The Discrete Wavelet Transform (DWT) 1.4 A Framework for Performance Evaluation . . . . . . . . 1.4.1 Bit Rate . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Perceptual Quality . . . . . . . . . . . . . . . . . 1.4.3 Computational Complexity . . . . . . . . . . . . . 1.4.4 Robustness to Signal Processing . . . . . . . . . . 1.5 Scope and Outline of Thesis . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
9 11 12 12 13 16 16 17 18 18
Chapter 2 Perceptual Modeling Techniques
2.1 The Human Audio System (HAS) . . . . . . . . . 2.1.1 Frequency Sensitivity . . . . . . . . . . . . 2.1.2 Frequency Masking . . . . . . . . . . . . . 2.1.3 Other Psychoacoustic Concepts . . . . . . 2.1.4 The MPEG Layer I Psychoacoustic Model 2.2 The Human Visual System (HVS) . . . . . . . . . 2.2.1 Frequency Sensitivity . . . . . . . . . . . . 2.2.2 Frequency Masking . . . . . . . . . . . . . 2.2.3 Spatial and Luminance Masking . . . . . . 2.2.4 Colour Sensitivity . . . . . . . . . . . . . . 2.2.5 Temporal Masking . . . . . . . . . . . . . 2.2.6 Human Vision Models . . . . . . . . . . . 2.2.6.1 Spatial Domain Models . . . . . 2.2.6.2 Frequency Domain Models . . . . 2.3 Summary . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
20
21 21 23 24 27 31 35 37 39 39 41 41 42 45 51 53
Chapter 3 Audio Watermarking
3.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
52
3.2 Echo Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Encoder Structure . . . . . . . . . . . . . . . . . . . 3.2.2 Decoder Structure . . . . . . . . . . . . . . . . . . . 3.2.3 Implementation and Proposed Improvements . . . . . 3.2.3.1 Selection of and no . . . . . . . . . . . . . 3.2.3.2 Discussion . . . . . . . . . . . . . . . . . . . 3.3 Phase Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Encoder Structure . . . . . . . . . . . . . . . . . . . 3.3.2 Decoder Structure . . . . . . . . . . . . . . . . . . . 3.3.3 Implementation Details . . . . . . . . . . . . . . . . . 3.4 Spread Spectrum Coding . . . . . . . . . . . . . . . . . . . . 3.4.1 Encoder Structures . . . . . . . . . . . . . . . . . . . 3.4.1.1 Direct Sequence Spread Spectrum . . . . . . 3.4.1.2 Frequency Hopped Spread Spectrum . . . . 3.4.2 Decoder Structures . . . . . . . . . . . . . . . . . . . 3.4.3 Probability of Bit Error . . . . . . . . . . . . . . . . 3.4.4 Implementation and Proposed Improvements . . . . . 3.4.4.1 Selection of . . . . . . . . . . . . . . . . . 3.4.4.2 Pre ltering to Improve Decoding Reliability 3.4.4.3 Discussion . . . . . . . . . . . . . . . . . . . 3.5 Frequency Masking . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Encoder Structure . . . . . . . . . . . . . . . . . . . 3.5.2 Decoder Structure . . . . . . . . . . . . . . . . . . . 3.5.3 Probability of Bit Error . . . . . . . . . . . . . . . . 3.5.4 Implementation and Proposed Improvements . . . . . 3.5.4.1 Construction of Filter Coe cients . . . . . . 3.5.4.2 Selection of . . . . . . . . . . . . . . . . . 3.5.4.3 Pre ltering to Improve Decoding Reliability 3.5.4.4 Discussion . . . . . . . . . . . . . . . . . . . viii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54 56 58 59 59 61 63 64 65 65 67 70 70 70 73 74 75 75 77 80 82 83 84 85 86 86 87 87 88
3.6 Performance Evaluation . . . . . . . . . . . . . . . 3.6.1 E ect of Block Size . . . . . . . . . . . . . . 3.6.2 Perceptual Quality . . . . . . . . . . . . . . 3.6.3 Computational Complexity . . . . . . . . . . 3.6.4 Robustness to Signal Processing . . . . . . . 3.6.4.1 Linear and Nonlinear Filtering . . 3.6.4.2 Additive and Coloured Noise . . . 3.6.4.3 Linear and Nonlinear Quantization 3.6.4.4 Lossy Compression . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 88 . 89 . 91 . 92 . 93 . 94 . 97 . 98 . 101 . 103
Chapter 4 Image Watermarking
4.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Spread Spectrum Techniques . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Encoder Structures . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.1 Direct Sequence Spread Spectrum . . . . . . . . . . . 4.2.1.2 Frequency Hopped Spread Spectrum . . . . . . . . . 4.2.2 Decoder Structures . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Probability of Bit Error . . . . . . . . . . . . . . . . . . . . . 4.2.4 Implementation and Proposed Improvements . . . . . . . . . . 4.2.4.1 Selection of and S . . . . . . . . . . . . . . . . . . 4.2.4.2 Spatial Domain Masking Analysis: DSSS-SM . . . . 4.2.4.3 Frequency Domain Masking Analysis: FHSS-FMW and FHSS-FMT . . . . . . . . . . . . . . . . . . . . 4.2.4.4 Pre ltering to Improve Decoding Reliability . . . . . 4.2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Multiresolution Embedding . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Encoder Structure . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Decoder Structure . . . . . . . . . . . . . . . . . . . . . . . . ix
104
105 106 107 108 108 109 110 111 111 111 112 114 117 118 120 122
4.3.3 Discussion . . . . . . . . . . . . . . . . 4.4 Performance Evaluation . . . . . . . . . . . . 4.4.1 E ect of Block Size . . . . . . . . . . . 4.4.2 Perceptual Quality . . . . . . . . . . . 4.4.3 Computational Complexity . . . . . . . 4.4.4 Robustness to Signal Processing . . . . 4.4.4.1 Mean and Lowpass Filtering . 4.4.4.2 Highpass Filtering . . . . . . 4.4.4.3 High-emphasis Filtering . . . 4.4.4.4 Wiener Filtering . . . . . . . 4.4.4.5 Median Filtering . . . . . . . 4.4.4.6 Additive and Coloured Noise 4.4.4.7 Quantization . . . . . . . . . 4.4.4.8 Histogram Equalization . . . 4.4.4.9 Lossy Compression . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
124 125 127 128 132 133 133 135 136 136 137 139 142 144 145 147
Chapter 5 Video Watermarking
5.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Frame-By-Frame Watermarking . . . . . . . . . . . . . . . . . . . . . 5.2.1 Direct Sequence Spread Spectrum (DSSS) . . . . . . . . . . . 5.2.2 Spatial Masking Analysis: DSSS-SM . . . . . . . . . . . . . . 5.2.3 Frequency Hopped Spread Spectrum (FHSS) . . . . . . . . . . 5.2.4 Frequency Domain Masking Analysis: FHSS-FMW and FHSSFMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Multiresolution Embedding . . . . . . . . . . . . . . . . . . . 5.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Temporal Multiresolution Watermarking . . . . . . . . . . . . . . . . 5.3.1 Encoder and Decoder Structures . . . . . . . . . . . . . . . . . x
148
150 151 152 153 153 153 154 154 154 155
5.3.2 Selection of Wavelet Basis Functions 5.3.3 Selection of Quantization Levels . . . 5.3.4 Discussion . . . . . . . . . . . . . . . 5.4 Performance Evaluation . . . . . . . . . . . 5.4.1 E ect of Block Size . . . . . . . . . . 5.4.2 Perceptual Quality . . . . . . . . . . 5.4.3 Computational Complexity . . . . . . 5.4.4 Robustness to Signal Processing . . . 5.4.4.1 Frame Averaging . . . . . . 5.4.4.2 Frame Reordering . . . . . 5.4.4.3 Frame Downsampling . . . 5.4.4.4 Lossy Compression . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
156 157 158 158 159 159 162 163 164 164 168 168 173
Chapter 6 Conclusions
6.1 Summary of Results . . . . . . . . . . . . . . . . . . . 6.1.1 Tradeo s in Watermarking Systems . . . . . . . 6.2 Opportunities for Further Research . . . . . . . . . . . 6.2.1 Further Investigations and Improvements . . . . 6.2.2 Watermark Invertibility . . . . . . . . . . . . . 6.2.3 Applications of Digital Watermarking . . . . . . 6.2.4 Information Theory and Digital Watermarking . 6.2.5 Current Standardization E orts . . . . . . . . .
175
175 177 178 178 180 180 181 182
xi
List of Tables
2.1 Minimum quantization matrix QMIN (k k ) constructed by measuring sensitivity to 2D-DCT basis functions. . . . . . . . . . . . . . . . . .
1 2
47 92 93 119 131 132 144
3.1 SNR of watermarked audio signals versus original host signals (in decibels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Audio watermarking algorithm CPU timings (in seconds). . . . . . . 4.1 Wavelet quantization levels for a 512 512 image at the standard viewing distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 PSNR of watermarked images versus original images (in decibels). . . 4.3 Image watermarking algorithm timings (in seconds). . . . . . . . . . . 4.4 Bit error rate due to histogram equalization (in percent). . . . . . . .
5.1 PSNR of watermarked video signals versus original sequences (in decibels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2 Video watermarking algorithm CPU timings (in seconds). . . . . . . . 162
xii
List of Figures
1.1 Block diagram of a typical watermark embedding system. Dashed lines indicate optional blocks. . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Block diagram of a typical watermark extraction system. Dashed lines indicate optional blocks. . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Example of a subband lter bank and the lowpass and highpass decomposition lters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Subset of the 32 overlapping lters modelling the bandpass channels within the Human Audio System. . . . . . . . . . . . . . . . . . . . . 2.2 Plot of TA(f ), the absolute detection threshold of the Human Audio System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Logarithmic mapping from frequencies to the Bark scale. . . . . . . . 2.4 Raised detection threshold for a 15 dB masking signal at 5 kHz. . . . 2.5 Power spectrum and corresponding absolute and raised detection threshold functions, TA(f ) and TM (f ), for a sample audio sequence. . . . . 2.6 Passband lter responses of the two-dimensional Cortex lters used to represent the set of visual channels. . . . . . . . . . . . . . . . . . . . 2.7 Observed frequencies are dependent upon the image width and the viewing distance, standardized to six times the image width. . . . . . 2.8 Plot of C (f ), the visual contrast detection threshold function. . . . . 2.9 Weighting function used to determine the raised contrast detection threshold in the presence of a masking signal 23]. . . . . . . . . . . . xiii 7 7 14 22 23 25 26 32 33 34 36 38
2.10 Raised detection thresholds of zero-mean additive white noise in the presence of (a) luminance masking and (b) spatial masking. . . . . . 2.11 The optical point spread function. . . . . . . . . . . . . . . . . . . . . 2.12 Example of perceptual analysis using Girod's model of the Human Visual System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 E ect of a strong 2D-DCT coe cient on adjacent coe cients within the minimum quantization matrix. . . . . . . . . . . . . . . . . . . . 3.1 Magnitude and phase responses of an echo lter with an echo amplitude of = 0:1 delayed by no = 5 samples. . . . . . . . . . . . . . . . . . . 3.2 Structure of the echo coding algorithm's encoder. . . . . . . . . . . . 3.3 Transition bands employed to minimize phase di erence between blocks containing di erent bits. . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Bit error rate of echo coding algorithm of varying for di erent echo delays (N ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Example of applying an echo lter kernel to an audio signal, and detection of the echo lter delay using the cepstrum. . . . . . . . . . . . 3.6 Structure of the phase coding algorithm's encoder. . . . . . . . . . . . 3.7 Magnitude spectrum of a PN sequence, jP (ej! )j . . . . . . . . . . . . 3.8 Block diagram of the DSSS encoder. . . . . . . . . . . . . . . . . . . 3.9 Block diagram of the FHSS encoder. . . . . . . . . . . . . . . . . . . 3.10 Error rate as a function of SNR for the spread spectrum algorithms. . 3.11 Highpass lter used to pre lter host signals watermarked with the DSSS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Block diagram of the spread spectrum decoder with pre ltering prior to decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 Comparison of DSSS decoding using highpass pre ltering and AR modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.14 Comparison of FHSS decoding using highpass pre ltering and AR modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
40 43 46 50 55 56 57 60 62 66 69 71 72 76 79 80 81 81
3.15 Block diagram of the frequency masking encoder. . . . . . . . . . . . 3.16 Bit error rate as a function of block size for audio watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Bit error rate after ltering for audio watermarking algorithms. . . . 3.18 Bit error rate in the presence of additive and coloured noise for audio watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . 3.19 Linear and nonlinear quantization functions for K = 5 bits per sample. 3.20 Bit error rate after quantization using linear and two nonlinear bit allocation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.21 Bit error rate due to lossy compression as a function of bit rate. . . .
84 90 96 97 99 100 102
4.1 Example of a 512 512 image divided into 16 16 blocks in the spatial domain. Each block will be used to embed one bit of data. . . . . . . 106 4.2 Two-dimensional highpass lter used to pre lter host images watermarked with the DSSS and FHSS algorithms. . . . . . . . . . . . . . 116 4.3 Decomposition lters used to compute the 2D-DWT. . . . . . . . . . 120 4.4 N N composite images made from the multiresolution decomposition subimages and quantization levels. . . . . . . . . . . . . . . . . . . . 122 4.5 Example of a four-level wavelet decomposition of a 512 512 pixel version of LENNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.6 Sample images used in the performance evaluation of image watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.7 Bit error rate versus block size for the six watermarking algorithms compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.8 LENNA image watermarked with the DSSS and DSSS-SM algorithms. 129 4.9 LENNA image watermarked using the FHSS, FHSS-FMW, and FHSSFMT algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.10 LENNA image watermarked using the multiresolution algorithm. . . . 131 4.11 Bit error rate from mean ltering for image watermarking algorithms. 134 4.12 Bit error rate from lowpass ltering for image watermarking algorithms.135 xv
4.13 Bit error rate from highpass ltering for image watermarking algorithms.136 4.14 Bit error rate from high-emphasis ltering for image watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.15 Bit error rate from wiener ltering for image watermarking algorithms. 138 4.16 Bit error rate from median ltering for image watermarking algorithms. 138 4.17 Bit error rate due to additive white Gaussian noise. . . . . . . . . . . 140 4.18 Bit error rate due to coloured white Gaussian noise. . . . . . . . . . . 141 4.19 Bit error rate due to linear quanitization. . . . . . . . . . . . . . . . . 143 4.20 Bit error rate due to JPEG compression, as a function of compression quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.1 Example of an image sequence divided into blocks in the spatial domain, as well as blocks temporally. Each three-dimensional block will be used to embed one bit of data. . . . . . . . . . . . . . . . . . . . . 150 5.2 Example of computing the temporal DWT on a video signal four frames in length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.3 Sample sequences used in the performance evaluation of video watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.4 Bit error rate versus block size for video watermarking algorithms. . . 161 5.5 Bit error rate versus frame averaging for video watermarking algorithms.165 5.6 Bit error rate versus frame reordering for video watermarking algorithms.167 5.7 Bit error rate versus frame downsampling for video watermarking algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.8 PSNR versus compression ratio for sample video signals. . . . . . . . 171 5.9 Bit error rate due to MPEG compression as a function of bit rate. . . 172
xvi
List of Symbols
2D-DCT 2D-DFT 2D-DWT a(n) a(n n ) A(z) A(z z ) c(k) c(n n ) cINH (n n ) C C (f ) C (f fm) C (k k ) CD D (n n ) DCT DFT DSSS DSSS-SM
1 2 1 2 1 2 1 2 1 2 1 2
Two-dimensional Discrete Cosine Transform Two-dimensional Discrete Fourier Transform Two-dimensional Discrete Wavelet Transform One-dimensional autoregressive (AR) model coe cients Two-dimensional autoregressive (AR) model coe cients One-dimensional autoregressive (AR) model Two-dimensional autoregressive (AR) model DCT weighting function Saturated ganglion image function of the HVS Gain-controlled retinal image function Correlator output value Contrast detection threshold function of the HVS Raised contrast detection threshold function 2D-DCT function Compact Disc Localized distortion image Discrete Cosine Transform Discrete Fourier Transform Direct Sequence Spread Spectrum DSSS with spatial masking analysis
xvii
DWT DVD E ] f fo fs f (k k ) fo(k k ) FHSS FHSS-FMW FHSS-FMT FIR h(n) hHP (n) hINH (n n ) hLOCAL(n n ) hLP (n) hPSF (n n ) H (ej! ) H (z) HAS HDTV HVS JPEG k(f=fm) km kSAT
1 2 1 2 1 2 1 2 1 2
Discrete Wavelet Transform Digital Versatile Disc Statistical expectation operator Frequency Normalized frequency Sampling frequency 2D-DCT coe cient frequency Normalized 2D-DCT frequency Frequency Hopped Spread Spectrum FHSS with Watson's frequency domain masking analysis FHSS with Tew k's frequency domain masking analysis Finite Impulse Response Impulse response of a linear lter Highpass lter impulse response Optical inhibition function of the HVS Localized distortion spread function Lowpass lter impulse response Optical point spread function of the HVS Complex frequency response of h(n) Z-Transform of h(n) Human Audio System High De nition Television Human Visual System Joint Photographic Experts Group Contrast detection threshold weighting function DFT index of masking signal Saturation level of the HVS
xviii
l (n n ) lRETINA(n n ) M(k) MPEG MSE p(n) p(n n ) p(n n n ) PB P (k) PSNR QL(k k ) QMIN (k k ) Q(k k ) Q(x) R(d km) sMONITOR S SNR TA(f ) TG(f ) TM (f ) TR (k km) x(n) x(n n ) x(n n n ) x(n)
1 2 1 2 1 1 2 2 3 1 2 1 2 1 2 1 1 2 2 3
Monitor luminance image function Retinal image function Magnitude frequency response Moving Picture Experts Group Mean Squared Error One-dimensional pseudorandom sequence Two-dimensional pseudorandom sequence Three-dimensional pseudorandom sequence Probability of bit error, or bit error rate Power spectrum Peak Signal to Noise Ratio Luminance masking minimum quantization matrix Minimum 2D-DCT quantization matrix Raised 2D-DCT quantization matrix Complimentary error function Raised detection threshold function Minimum monitor luminance level Subset of DCT or 2D-DCT coe cients Signal to Noise Ratio Absolute detection threshold function of the HAS Global masking threshold function Frequency masking threshold function of the HAS Raised detection threshold function Digital audio signal of length N samples Digital image of size N N pixels Digital video signal of size N N N pixels Watermarked audio signal
1 2 1 2 3
xix
x(n n ) x(n n n ) x(n) ^ x(n) ~ x(n n ) ~ x(n n n ) ~ xCOMP (n n ) X (k) X (k k ) X (k k n ) X (n n k ) X (k) X (k k ) X (k k n ) X (n n k ) v(n) v(n n ) vx(n) vx(n n ) w(m) w(m m ) w(k k ) w(m m m ) w(m) ~ w(m m ) ~ w(m m m ) ~ w (n n )
1 1 2 2 3 1 2 1 2 3 1 2 1 1 2 2 3 1 2 3 1 1 2 2 3 1 2 3 1 2 1 2 1 2 1 2 1 2 3 1 1 2 2 3 1 1 2
Watermarked image Watermarked video signal Real-valued cepstrum of x(n) Approximated or corrupted audio signal Approximated or corrupted image Approximated or corrupted video signal Composite image from 2D-DWT decomposition Transform domain audio signal Transform domain digital image Frame-by-frame transform domain video signal Temporal multiresolution video signal Watermarked transform domain audio signal Watermarked transform domain image Watermarked frame-by-frame transform video signal Watermarked temporal multiresolution video signal Additive white Gaussian noise (AWGN) signal Two-dimensional (AWGN) signal Prediction error lter noise function Two-dimensional prediction error lter noise function One-dimensional watermark signal of size M bits Two-dimensional watermark signal of size M M bits 2D-DCT visual model weighting function Three-dimensional watermark signal of size M M M bits Extracted one-dimensional watermark Extracted two-dimensional watermark Extracted three-dimensional watermark Linearized visual model parameter
1 2 1 2 3
xx
w (n n ) w (n n ) (n) (n n ) (n n n ) (n) (n n ) x(n n ) () () (k) (x y )

2 3 1 1 2 2 1 1 2 2 3 1 2 1 2
(k k )
1 2
Linearized visual model parameter Linearized visual model parameter Time varying watermark magnitude function (audio) Space varying watermark magnitude function (images) Time / space varying watermark magnitude function (video) Impulse function Two-dimensional impulse function Image distortion function Phase frequency response di erence Orientation weighting factor Phase frequency response Correlation coe cient for two signals Variance of AWGN process v(n) 2D-DCT coe cient angle
xxi
Chapter 1 Introduction
1.1 Digital Media and Copyright Protection
A great deal of information is now being created, stored, and distributed in digital form. Newspapers, and magazines, for example, have gone online to provide real-time coverage of stories with high-quality audio, still images, and even video sequences. The growth in use of public networks such as the Internet has further fueled the online presence of publishers by providing a quick and inexpensive way to distribute their work. The explosive growth of digital media is not limited to news organizations, however. Commercial music may be purchased and downloaded o of the Internet, stock photography vendors digitize and sell photographs in electronic form, and Digital Versatile Disc (DVD) systems provide movies with clear images and CD-quality sound. Unfortunately, media stored in digital form are vulnerable in a number of ways. First of all, digital media may be simply copied and redistributed, either legally or illegally, at low cost and with no loss of information. In addition, today's fast computers allow digital media to be easily manipulated, so it is possible to incorporate portions of a digital signal into one's own work without regard for copyright restrictions placed upon the work. Encryption is an obvious way to make the distribution of digital media more secure, but often there is no way to protect information once it has been
2 decrypted into its original form. The ability for pirates to easily copy works is one of the last hurdles that keeps publishers from completely adopting online distribution systems. Legislation has been enacted recently in an e ort to stop digital piracy. In the United States, for example, the Digital Millennium Copyright Act (DMCA) was passed in late 1998. The bill speci es and clari es copyright rules for downloading and viewing copyrighted material from public networks such as the Internet 1]. These rules govern the concept of \fair use" | copying of material for personal or academic means | and limits the distribution of copyrighted digital media. The bill also criminalizes the use of technologies for removing copyright notices or defeating copyprotection devices. The Canadian government is also reviewing and amending its copyright laws accordingly. However, given the ease with which digital media can be copied and manipulated, it is necessary to have technologies for tightly coupling copyright information with digital signals. Digital watermarking is seen as a partial solution to the problem of securing copyright ownership. Essentially, watermarking is de ned as the process of embedding sideband data directly into the samples of a digital audio, image, or video signal. Sideband data is typically \extra" information that must be transmitted along with a digital signal, such as block headers or time synchronization markers. It is important to realize that a watermark is not transmitted in addition to a digital signal, but rather as an integral part of the signal samples. The value of watermarking comes from the fact that regular sideband data may be lost or modi ed when the digital signal is converted between formats, but the samples of the digital signal are (typically) unchanged. To clarify this concept further, it is useful to consider an analogy between digital watermarks and paper watermarks. Watermarks have traditionally been used as a form of authentication for legal documents and paper currency. A watermark is embedded within the bres of paper when it is rst constructed, and it is essentially invisible unless held up to a light or viewed at a particular angle. More importantly,
3 a watermark is very di cult to remove without destroying the paper itself, and it is not transferred if the paper is photocopied. The goals of digital watermarking are similar, and it will be shown in the next section that digital watermarks require similar properties. Before the concept of watermarking can be explored further, three important de nitions must rst be established. A host signal is a raw digital audio, image, or video signal that will be used to contain a watermark. A watermark itself is loosely de ned as a set of data, usually in binary form, that will be stored or transmitted through a host signal. The watermark may be as small as a single bit, or as large as the number of samples in the host signal itself. It may be a copyright notice, a secret message, or any other information. Watermarking is the process of embedding the watermark within the host signal. Finally, a key may be necessary to embed a watermark into a host signal, and it may be needed to extract the watermark data afterwards.
1.2 Requirements Analysis

It is important to de ne the requirements of a watermarking system, because they can be used to de ne criteria for comparing systems and selecting one for use in a particular application. There are two key aspects of a watermarking system: imperceptibility of the watermark within the host signal, and security of the watermark against common operations or attacks. Security may be further broken down into two features: robustness to signal processing and whether the watermark is private or public in nature. The exact requirements, of course, will vary between applications. The three requirements are examined further in the following sections.
1.2.1 Imperceptibility
Most importantly, the watermark signal should be imperceptible to the end user who is listening to or viewing the host signal. This means that the perceived \quality" of the
4 host signal should not be distorted by the presence of the watermark. Ideally, a typical user should not be able to di erentiate between watermarked and unwatermarked signals. In 2], the importance of incorporating perceptual modeling techniques into watermarking systems is further discussed. There are two reasons why it is important to ensure that the watermark signal is imperceptible. First of all, the presence or absence of a watermark should not detract from the primary purpose of the host signal, that of conveying high-quality audio or visual information. In addition, perceptible distortion may indicate the presence of a watermark, and perhaps its precise location within a host signal. This knowledge may be used by a malicious party to distort, replace, or remove the watermark data.
1.2.2 Robustness to Signal Processing

Another important requirement is that watermark signals must be reasonably resilient to common signal processing operations. Once a host signal is encoded with watermark data, distortions may be applied to the signal before, during, and after distribution across the Internet. These distortions may be designed to improve the quality of the host signal or compress it before transmission, and they may or may not signi cantly disrupt the host signal. Examples include noise reduction (lowpass or mean ltering), feature enhancement (histogram equalization), and lossy compression. Throughout this thesis, the term attacker is used to describe an individual who attempts to corrupt or remove a digital watermark from a host signal. An attacker intentionally applys signal processing operations as a means of destroying the watermark signal without severely disrupting the quality of the host signal. Examples of this include requantization of signal values, and introducing correlated or uncorrelated noise into the watermarked signal. In 3], Cox and Linnartz discuss these and other common signal processing operations that could be applied to watermarked signals. In the context of this study, it is reasonable to assume that such distortions,
5 regardless of intention, will not severely disrupt the quality or value of the watermarked signal. One of the bene ts of digital media is that they can be represented at an arbitrarily high level of resolution. For example, consider a piece of commercial music sampled at 44.1 kHz and stored at 16 bits per sample. Downsampling and quantizing the signal to a rate of 8 kHz and 8 bits per sample will greatly reduce the value of the music because its quality will be poor.
1.2.3 Private vs. Public Watermarks

Craver et al recently introduced the distinction between two types of watermarking systems, private and public 4]. A private watermarking system requires that the original signal be present at the decoder in order to extract watermark information. In contrast, a public system does not require access to the original signal in order to decode the watermark. The terms private and public also indicate, to a degree, the intended audience of watermark data transmitted through a host signal. A public system, for example, would typically be used to send watermarks to the end-users of a host signal, whereas a private watermarking system would intuitively be more secure. Private watermarking may not be practical for applications where a large number or volume of host signals are generated, or for applications where watermark data is intended for a large number of end-users to decode. Examples of such applications are high de nition digital television (HDTV) or broadcast digital radio. In these cases, it is not feasible to transmit both the original and watermarked versions of the host signal. However, private watermarking may be suitable for other applications, such as online stock photography shops, where a private library of digital media is maintained by the business, and watermarked versions sold to consumers. In this investigation, the emphasis will be on public watermarking algorithms, since it is likely that they will be of more use in practical applications.
1.3 Watermark Embedding and Extraction Systems

In this section, a generic watermark embedding and extraction system is presented, in an attempt to capture the various systems and con gurations that have been presented in the literature. It should be noted that watermarking is essentially a communications system, and so it is di cult to represent every possible aspect or con guration. Figure 1.1 shows a block diagram of a typical embedding system. In its simplest form, such a system has two inputs, the host signal and the watermark data, and a single output, the watermarked version of the signal. As shown in the diagram, this process may be represented by two blocks: an encoder and an embedding function. In the former, watermark data is converted into a form suitable for applying to the signal, and in the latter, the encoded watermark is actually applied to the host signal. Likewise, Figure 1.2 shows a block diagram of a typical extraction system. It is clear from the illustration that the extraction procedure is almost an inverse of the embedding process. Depending on the intended application, two additional steps may be performed during the embedding and extraction process: perceptual analysis and key generation. Note that the original signal is an optional input to the extraction system. The presence or absence of this signal indicates the di erence between a private or public watermarking system, as discussed in Section 1.2.3. In the following sections, the blocks of the embedding and extraction systems are described in more detail.
1.3.1 Perceptual Analysis

The goal of the perceptual analysis stage is to analyze the host signal to determine how much distortion from the watermark may be introduced before it becomes perceivable by the intended audience of the host signal. This information can be used to vary the strength of the watermark temporally and / or spatially.
Perceptual Analysis
Host Signal
Watermark Insertion
Watermarked Signal
Key Generation
Watermark Encoding
Watermark
Figure 1.1: Block diagram of a typical watermark embedding system. Dashed lines indicate optional blocks.
Perceptual Analysis
Watermarked Signal
Watermark Extraction
Watermark Decoding
Watermark
Original Signal
Key Generation
Figure 1.2: Block diagram of a typical watermark extraction system. Dashed lines indicate optional blocks.
8 Perceptual analysis, also called perceptual modeling, is based on psychoacoustic and psychovisual models of the human audio system and human visual system, respectively. The response of the human ear to a piece of music, for example, varies with time and with the frequency-domain characteristics of the music. Perceptual models predict which portions of the host signal are not perceivable to the audience and may be manipulated without a loss of perceptible quality. Perceptual models were originally developed as an addition to lossy audio, image, and video compression systems. By identifying portions of the host signal that are imperceptible (or redundant), the system may remove them to increase the coding rate. In watermarking systems, imperceptible portions of the host signal are employed as \channels" in which watermark data may be placed. It will be shown in later chapters that perceptual analysis is often computationally expensive, and so it may not be feasible to perform this step in time-critical applications or where computing power is limited. In these cases it may be possible to determine a maximum and uniform level of distortion that may be applied to a host signal.
1.3.2 Key Generation

As mentioned earlier, using a key to encode and decode a watermark is optional, but it may increase the security of a watermarking system. Recall in Section 1.2.3 that public watermarks, while not as secure as private systems, are more feasible for broadcasting digital audio or video signals. By using a private key, incorporated with limited access into a receiver or decoding system, a public watermark can be made more secure. In addition to generating keys, it may be useful having a more formal infrastructure for generating, storing, and accessing watermark keys. In practice, key management for watermarking serves a purpose similar to keys used for cryptography 5]. However, the subject of key management is beyond the scope of this thesis.
1.3.3 Encoding and Decoding
This is simply the process of converting watermark data, such as copyright information (text) or some other data, to and from a form that can more easily be used with a watermarking algorithm (usually binary). At this stage, a key may also be used to encode or decode the watermark to and from a more secure form. Many cryptographic algorithms and standards exist, such as the commonly used Data Encryption Standard (DES) and the Rivest-Shamir-Adelman (RSA) techniques. Although encryption will not be considered in this thesis, a comprehensive source of cryptography concepts and algorithms can be found in 5]. Before the watermark is inserted into the host signal, this stage also allows the opportunity to employ error correcting codes (channel coding) to guard against possible errors at the receiver due to signal processing. The application of channel coding to watermarking techniques will not be examined in this thesis, but the implications of such an improvement will be discussed in Chapter 6. A good source of channel coding techniques for error detection and correction can be found in 6].
1.3.4 Watermark Insertion and Extraction

Watermark data, after being transformed into a binary form and possibly encoded, may be inserted into the host signal in either the time or spatial domain, or in a transform domain. In the former, samples of the host signal are manipulated directly. In the transform domain, the host signal is rst converted into a transform representation by using a discrete transform kernel. Watermark data is then applied to the transform domain coe cients, and the watermarked signal is obtained from an inverse transform of the modi ed coe cients. The watermark data may be applied to the host signal in several ways, but the most common are additive, multiplicative, and non-linear. Let x(n) represent a host signal (regardless of domain), and w(n) represent binary watermark data of the form w(n) 2 f;1 +1g to be embedded. If a perceptual analysis is performed on
10 the host signal, then let (n) represent the allowable strength of the watermark as a function of time, spatial position, or transform domain coe cient. It will be shown in Chapter 2 that in many cases (n) represents a per-sample maximum distortion of x(n), indicating that it is a strictly positive value. In an additive approach, one of the most commonly used, the watermarked signal, x(n), is obtained by simply adding the weighted watermark to samples of the host signal using the following formula:
x(n) = x(n) + (n) w(n)
(1.1)
where the bipolar nature of w(n) means the host signal is increased or decreased, depending on the watermark bit. In the context of a private watermarking system, the watermark data may be extracted by subtracting the original host signal from the watermarked version. Without access to the original signal, as in a public watermarking system, the presence of the original signal within the watermarked version must be removed or minimized. In a multiplicative embedding approach, the watermark is multiplied by both the host signal and the strength function:
x(n) = x(n) 1 + (n) w(n)]
(1.2)
Extracting the watermark data is more di cult, requiring division of the watermarked signal with the original signal (if it is available at the decoder). This approach is not often used in public watermarking systems. Finally, a common non-linear approach to embedding involves quantization of the host signal sample values, and then perturbing the quantized values by a fraction of the quantization level:
x(n) = x(n) + 1 w(n) (n) (1.3) (n) 4 where (n) represents an allowable quantization level, rather than a maximum watermark strength, and ] represents the rounding operator. Since w(n) is bipolar,
(" # )
11 the sample value (after quantization) is increased or decreased according to the watermark bit. Extracting the watermark data is a simple matter of quantizing the watermarked signal by the same levels in (n), and then determining the sign of the result. There are three commonly used signal transforms for watermarking in the transform domain. In the sections that follow, they are brie y reviewed.
1.3.4.1 The Discrete Fourier Transform (DFT)

Complete details on the DFT and its properties can be found in 7], but some key points are related here. The DFT generally produces complex-valued frequency domain coe cients. For a real-valued input x(n), its DFT possesses certain symmetry constraints that must be maintained in order to obtain a real-valued inverse transform. Fast algorithms exist for computing the DFT, such as the Fast Fourier Transform (FFT). The forward and inverse Discrete Fourier Transform of a signal N samples in length may be written by the following transform pair:
X 1 N ; x(n)e;j X (k) = N n
1 2
nk N nk N
x(n) =
N ;1 X k=0
=0
(1.4) (1.5)
X (k)ej
The DFT can be extended to two (and higher) dimensions. The two dimensional DFT (2D-DFT) of an N N image is given by:
1 2
X (k k ) =
1 2
N1 ;1 N2 ;1 X X
x(n n ) = N 1N
1 2 1
n1 =0 n2 =0 N1 ;1 N2 ;1 X X k1 =0 k2 =0
x(n n )e;j
1 2
n1 k1 + n2 k2 N1 N2 n1 k1 + n2 k2 N1 N2
(1.6) (1.7)
X (k k )ej
1 2
In this thesis, the operations of computing the forward and inverse DFT, regardless of dimension, will be denoted DFT and IDFT , respectively.
1.3.4.2 The Discrete Cosine Transform (DCT)
12
The DCT, introduced in 8], produces real-valued coe cients that do not su er from the symmetry constraints of the DFT. It is commonly used for audio, image, and video compression as the heart of transform-based coders. The DCT may be computed from the DFT, so fast algorithms such as the FFT may be used to compute it. Other e cient implementations of the DCT exist 9]. The forward and inverse Discrete Cosine Transform of a signal N samples in length may be written by the following equations: " # N; X (2n + 1)k X (k) = c(k) x(n) cos (1.8) 2N n " # N; X (2n + 1)k x(n) = c(k)X (k) cos (1.9) 2N k where 0 k N ; 1, and
1 =0 1 =0
k=0 (1.10) 1 k N ;1 N The equations above are commonly referred to as the DCT pair. Like the DFT, the DCT can be extended to two dimensions: c(k) =
> : q
8 q > 1 <
2
X (k k ) =
1 2
c(k )c(k )
1 2
N1 ;1 N2 ;1 X X n1 =0 k2 =0
+ + c(k )c(k )X (k k ) cos (2n 2N1)k cos (2n 2N1)k (1.12) k1 k2 where c(k) is the same as in Equation 1.10. In this thesis, the operations of computing the forward and inverse DCT, regardless of dimension, will be denoted DCT and IDCT , respectively.
" # " #
1 2 1 2 1 1 2 2 =0 =0 1 2
x(n n ) =
1 2
+ x(n n ) cos (2n 2N1)k

"
1 2 1 1
#
1
+ cos (2n 2N1)k

"
2 2
#
2
(1.11)
N1 ;1 N2 ;1 X X
1.3.4.3 The Discrete Wavelet Transform (DWT)

A subband ltering system allows a signal to be separated into di erent frequency bands by employing a combination of lowpass, bandpass, and / or highpass lters,
13 along with downsampling of the ltered signals 10]. If certain conditions are met in the design of the lters, then perfect reconstruction of the original signal can be obtained by using a reconstruction scheme of upsampling and ltering to remove spectral aliasing e ects. Figure 1.3-(a) shows an example of a two-band ltering system employing a lowpass and highpass lter bank. The frequency responses of the two lters are shown in Figure 1.3-(b). In this study, the Discrete Wavelet Transform (DWT) is used to implement subband decomposition and reconstruction lter banks 11]. The DWT can be extended to work in two and higher dimensions by using separable lters working on separate dimensions. An excellent description of a two-dimensional wavelet ltering scheme can be found in 12]. In this thesis, the operations of computing the forward and inverse DWT, regardless of dimension, will be denoted DWT and IDWT , respectively.
1.4 A Framework for Performance Evaluation

From the previous section, it is clear that there are number of approaches to embedding data within a host signal. Indeed, the watermarking algorithms evaluated in this thesis were chosen to represent a variety of approaches to embedding data. Since the algorithms di er greatly, it is vital to have a common set of criteria for comparing the algorithms. In this section, a novel framework for evaluating the performance of watermarking algorithms is proposed. Some components of the framework are based upon the probability of bit error, and how it varies in response to di erent evaluation criteria. Bit error refers to the di erence between the data embedded within a host signal, and the data extracted at the receiver. Depending upon the watermarking application, there are a number of ways bit error can be determined and used. Bit error can be estimated directly from the watermark bits. Earlier it was assumed that w(n) 2 f;1 +1g represents an embedded watermark M bits in length,
14
h(n) x(n)
h(n) x(n)
Processing
g(n)
g(n)
(a) Filter bank block diagram

Lowpass and Highpass Decomposition Filters 10 Lowpass Highpass 0
10
20
Magnitude Response (dB)
30
40
50
60
70
80
90
0.1
0.2
0.3
(b) Half-band decomposition lters Figure 1.3: Example of a subband lter bank and the lowpass and highpass decomposition lters.
0.4 0.5 0.6 Normalized Frequency
0.7
0.8
0.9
15 and let w(n) 2 f;1 +1g represent the extracted watermark. The bit error rate, ~ expressed as a percentage, is given by:
PB = M
X < 100 M ;1 >
n=0
1 w(n) = w(n) ~ > : 0 w(n) 6= w(n) ~
(1.13)
In some applications, the embedded watermark may be used as a signature representing the author or copyright owner. In this case, it is useful to measure how well the extracted watermark correlates with the signature. A threshold value may then be set to decide whether the extracted watermark is acceptable or not. This correlation coe cient is given by 13]: PM ; n w ~ q (w w) = PM ; n w(q)PM (n) ~ (1.14) ; ~ n w (n) n w (n) where 0 1. = 1 indicates perfect correlation, while an extremely low value reveals that the watermarks are dissimilar. It is very important to note that bit error rate is used as a measure of how well single bits may be extracted from the host signal, and that it is decoupled from the reliability of the watermark itself. For example, assume a system in which one of 1024 possible watermarks will be embedded to represent one of 1024 possible copyright owners. If a binary encoding is used, then a minimum of log 1024 = 10 bits must be embedded. If the bit error rate of the system is ten percent (or one bit in ten), then the watermark has a reliability of only 50 percent, since a single bit error will cause an incorrect copyright owner to be identi ed. However, if a longer watermark is used to represent the 1024 possible copyright owners, then a bit error rate of ten percent may be acceptable. The evaluation framework proposed here is based upon four performance metrics: bit rate, perceptual quality, computational complexity, and robustness to signal processing. Each of these metrics are described in the following sections. Recently other researchers have proposed a benchmark for comparing the performance of watermarking algorithms 14]. However, their approach di ers from this one in several ways. First of all, their benchmark is limited to the study of image watermarking
1 =0 2 1 =0 1 =0 2 2
16 algorithms and only those that require access to the original image at the decoder (private watermarks). In addition, their benchmark lacks two important aspects: comparison by bit rate and computational complexity.
1.4.1 Bit Rate

In the context of this thesis, bit rate refers to the number of bits, M , of watermark data that may be reliably embedded within a host signal per unit of time or space. Audio and video signals contain a temporal dimension, and so bit rates for these watermarking algorithms are given in bits per second. In contrast, still images are limited to spatial dimensions, so bit rates for image watermarks are given in bits per pixel. In this study, a number of modi cations to algorithms from the literature will be proposed that allow for a variable number of bits to be inserted. Foremost among these is the division of the host signal into variable sized blocks. This is an important modi cation, because it makes the algorithms more exible. Bit error rate will be measured as a function of bit rate, but it will be shown that the bit rate is limited by certain tradeo s. As the desired bit rate increases, so does the bit error rate. Knowing the capacity limitation of watermarking algorithms is useful design criteria.
1.4.2 Perceptual Quality

Because watermarks introduce distortion into host signals, another way of evaluating algorithms is to compare the magnitude of this distortion and how perceivable it is to the intended audience of the host signal. Perceptual quality is a measure of imperceptibility, obtained by determining both the amount of distortion introduced into a host signal by a watermarking algorithm, and how detectable the distortion is. A proper method of determining quality is to conduct a formal study using a large number of subjects and host signals 15]. As a simple example, consider image watermarking. Two images are presented to a viewer, each watermarked with
17 a di erent algorithm. The images are viewed under standard conditions, displayed a xed distance from the viewer, and under a known level of ambient light. The subject is asked to select the image that has a better \quality". This process is performed for a large number of subjects and host images, and for every possible combination of algorithms, resulting in a ranking of the algorithms. Obviously a formal perceptual quality study requires a signi cant amount of time and resources. In this study, a simpler measure of quality will be used: the signal-to-noise ratio (SNR). This is simply the power of the watermarked signal over the distortion power introduced by the watermarking algorithm. Although not as robust as a more formal study, SNR will be used because it is simple to implement and provides a rough measure of quality. If x(n) represents an audio signal of length N samples, and x(n) is the watermarked version, then the SNR is given by 16]:
SNR = 10 log
"
10 P
N ;1 x(n) ; x(n)]2 n=0

#
2
N ;1 x2 (n) n=0
(1.15)
For images and video sequences, peak signal-to-noise ratio (PSNR) will be used 13]:
PSNR = 10 log
10
255 PN ; PN ; 1 2 n1 n2 fx(n n ) ; x(n n )g

2 1 =0 1 =0 1 2 1 2 1 2
(1.16)
where it is assumed that x(n n ) and x(n n ) are normalized to the interval 0 1]. PSNR is commonly used as a performance metric for digital image and video compression algorithms. Perceptual quality is dependent upon the intended application of a watermarking system. In some situations, a detectable amount of distortion may be acceptable if it ensures a higher bit rate or more reliable encoding. In other cases, it may be required that watermark data be completely imperceptible to a user.
1 2
1.4.3 Computational Complexity

Another important tool for evaluating algorithms is measurement of the amount of time required to embed a watermark into a host signal, and then extract the information afterwards. This is important as a rough measure of the computational resources
18 required to implement each algorithm. In a classical sense, complexity often refers to \big-O" analysis, in which the complexity of an algorithm is roughly determined asymptotically as a function of the size of the input 17]. An algorithm with O(N ) complexity, for example, requires on the order of N processing steps for an input of size N . In time-critical applications, or where computing power is limited, selection of a watermarking algorithm requires more quantitative information. In this investigation, actual time in CPU cycles will be used as a measure of complexity.
2 2

Recalling the requirements analysis of Section 1.2, it was stated that a \good" watermarking system should be reasonably resistant to signal processing operations. In this study, common signal processing operations will be selected that represent typical \real-world" operations such as noise removal, lossy compression, and additive noise. In later chapters, these operations will be applied to watermarked signals, and then the resulting bit error rate at the decoder will be used as a performance metric.
1.5 Scope and Outline of Thesis

The major objective of this thesis is to present a performance evaluation of watermarking algorithms for embedding data within digital audio, image, and video signals. This will be accomplished by using the framework of performance metrics proposed in Section 1.4. The focus of the study is on public watermarks, which by de nition do not require access to the original signal in order to decode the embedded data. Where necessary, modi cations will be presented for algorithms that allow them to be compared on this common basis. In addition, modi cations to the algorithms will be proposed that improve their performance. Earlier it was stated that perceptual analysis is an important stage in watermark embedding. In Chapter 2, the human audio and visual systems are introduced, along with psychoacoustic and psychovisual concepts such as frequency sensitivity
19 and masking. The study of these concepts has led to the development of mathematical models of human perception, and the models may be used to determine maximum allowable levels of distortion resulting from embedded watermark data. Some of these models will be introduced, along with a description of their implementation and necessary modi cations. In Chapters 3 - 5 the main results of this investigation will be presented, and the structure of these three chapters will be similar. First of all, a selection of watermarking algorithms for digital audio, image, and video signals will be described, along with details of their implementation considerations. Where possible, improvements to the algorithms will be proposed, and they all will be evaluated with respect to the evaluation framework proposed earlier in this chapter. Finally, in Chapter 6, the primary results of this investigation will be reviewed, and possible applications of watermarking technology will be examined. In closing, recommendations will be made for future research e orts in this area.
20
Chapter 2 Perceptual Modeling Techniques

In the previous chapter it was shown that perceptual analysis is an important tool for watermark embedding, as it allows a system to determine a time- and / or spacevarying watermark \strength" that maximizes the watermark magnitude while ensuring that the distortion is imperceptible to an end-user. The goal of this chapter is to introduce the human audio and visual systems, and to describe some limitations and aspects of human perception that may be exploited for watermarking. In particular, the concept of masking will be introduced, where the presence of a strong signal tends to \mask" weaker signals with similar characteristics. A selection of mathematical models of human perception from the literature will be examined, along with a description of how they were implemented and adapted for use in this study. Throughout this chapter, it is important to remember that mathematical models of human perception are still quite primitive. The models that do exist are based upon psychological studies designed to reveal the response of the human auditory and visual systems to various stimuli. The models are not able to predict the correct response of humans all of the time, but may be considered an heuristic. The chapter is organized as follows. In Section 2.1 the Human Audio System is introduced, and the psychoacoustic properties of frequency sensitivity and frequency masking are introduced. In addition, an implementation of the MPEG Layer I psychoacoustic model is described. The Human Visual System is then introduced in
21 Section 2.2, along with the myriad of masking properties that have been studied by psychologists. Three psychovisual models are presented, one in the spatial domain and two that operate in the frequency domain. Modi cations to these models will also be proposed that allow them to be more easily incorporated into watermarking systems.
2.1 The Human Audio System (HAS)

The purpose of the human audio system is to convert sound pressure waves into stimuli that are sent to the brain for processing. The exact response varies between individuals, but in general the HAS can detect very low frequencies and up to approximately 20 kHz. The response varies with time, the intensity of the sound, and the frequency-domain characteristics of the sound. Physically, the HAS may be approximated by a series of 32 overlapping lter banks with bandwidths that increase with frequency, as shown in Figure 2.1 18]. The lter bank represents \channels", also referred to as critical frequencies, through which sounds of similar frequency are processed. The bandpass nature of these channels limits the frequency resolution of the HAS, and also gives rise to two key concepts: frequency sensitivity and frequency masking. These concepts will be further described in the following sections.
2.1.1 Frequency Sensitivity

Frequency sensitivity is used to describe the sensitivity of the HAS to a single sinusoidal tone of variable power. This sensitivity is obtained from psychoacoustic studies that present test subjects with a tone of xed frequency, whose power is increased until the tone becomes audible to the listeners. The result, obtained for the range of audible frequencies, is the absolute detection threshold function, TA(f ). Frequency components of an audio signal below the threshold level are not audible. Figure 2.2 shows a plot of the absolute detection threshold, in decibels, as a function of frequency. It is clear from this diagram that the HAS is most sensitive to
22
0.9
0.8
0.7
MAGNITUDE RESPONSE
0.6
0.5
0.4
0.3
0.2
0.1
0.5
1 1.5 FREQUENCY (HZ)
2.5 x 10
4
Figure 2.1: Subset of the 32 overlapping lters modelling the bandpass channels within the Human Audio System.
23
70
60
SOUND PRESSURE LEVEL (DECIBELS)
50
40
30
20
10
10
8 10 12 FREQUENCY (KHZ)
14
16
18
20
Figure 2.2: Plot of TA(f ), the absolute detection threshold of the Human Audio System. frequencies between 500 Hz and 8 kHz, called the mid-band frequencies.
2.1.2 Frequency Masking

Frequency masking is related to the frequency detection threshold described above, and it results from the lter bank structure of the HAS. There are three properties that result from the frequency masking aspects of the HAS: tone-masks-tone, noisemasks-tone, and noise-masks-noise. First of all, the presence of a strong tone | a single frequency component of relatively high power | tends to \mask" weaker adjacent tones (tone-masks-tone). This strong tone is often referred to as a masking signal or masking frequency. In particular, the absolute detection threshold function, TA(f ), is raised signi cantly for frequencies around the strong tone, and falls exponentially as the distance from the masking frequency increases. This occurs because
24 sounds consisting of similar frequencies are detected and sent to the brain by the same mechanism. A channel responding to a masking signal may be less sensitive to weaker signals of similar frequency. If a critical frequency band does not contain any single strong tones, then the combination of frequency components within the critical band may collectively serve to mask the presence of any individual tone within the band. This concept is referred to as noise-masks-tone, as the set of frequency components within the band are considered \noisy". The raised detection threshold at frequency f due to a masking signal at fm is a function of the distance between the two frequencies and the power of the masking signal. In psychoacoustic studies, frequencies and distances between frequencies are sometimes given in Barks. The Bark scale is a logarithmic mapping from frequencies in Hz, as shown in Figure 2.3. Figure 2.4 shows an example of the raised detection threshold for a masking signal with a power of 15 dB at a frequency of 5 kHz. The plot, compared to Figure 2.2, illustrates how much the absolute detection threshold is raised for all other frequencies due to the presence of the masking signal. It is clear that the e ect is maximized for frequencies adjacent to the masking frequency. Another frequency-domain characteristic of the HAS is the concept of noisemasks-noise. The presence of a noise-like audio signal, with a relatively at spectrum and no prominent frequency components, tends to mask additional noise applied to the signal. For example, consider a piece of music containing background crowd noise. Low levels of additive noise will not be perceivable to a listener.
2.1.3 Other Psychoacoustic Concepts

In addition to the formal (and well-studied) concepts of frequency sensitivity and frequency masking, there are several other aspects of hearing that have been exploited for watermarking. In Section 3.2, an algorithm will be introduced that takes advantage of the assumption that the HAS will tolerate a certain amount of resonance (or
25
25
20
BARK SCALE
15
10
14
16
18
20
Figure 2.3: Logarithmic mapping from frequencies to the Bark scale.
26
70
60
50
40
30
20
10
10
14
16
18
20
Figure 2.4: Raised detection threshold for a 15 dB masking signal at 5 kHz.
27 \echo") in an audio signal, provided that the echo delay is not too long. In Section 3.3, another algorithm will be introduced that relies upon the inability of the HAS to distinguish the phase di erence between two distinct tones (but of the same frequency) that are slightly out of phase by a constant factor. Discussion of these other aspects of the HAS is reserved for those sections.
2.1.4 The MPEG Layer I Psychoacoustic Model

As part of their initiative to develop a video compression system, the Moving Picture Experts Group (MPEG) has developed a standard for compressing wideband audio signals 19, 20]. The system relies upon a sophisticated psychoacoustic model that permits a high level of compression (on the order of 10-to-1) without a loss of perceptual quality. The MPEG audio compression scheme involves three formats, or layers, of increasing complexity and increasing compression rates. At the heart of all three layers is a psychoacoustic model that determines the frequency masking characteristics of a portion of the original audio signal. The model is applied to blocks of the audio signal, because most signals exhibit statistical stationarity within blocks of up to 30 ms in length. The result is a frequency masking threshold function, TM (f ), indicating the allowable level of distortion that may be introduced at each frequency, if any, before the distortion becomes audible to the listener. What follows is an overview of the steps followed to compute the frequency masking function in the implementation used for this study 21]. Let x(n) represent an audio signal sampled at a rate fs = 44:1 kHz. 1. Divide the audio signal into blocks of N samples. Let block m be denoted xm (n), for 0 n N ; 1. For a sampling rate of fs = 44:1 kHz and N = 512, for example, this corresponds to blocks of about 12 ms in length. For each block, compute the power spectrum function P (k) in accordance with:
P (k) = jX (k)j
(2.1)
28 where X (k), for 0 k N ; 1, represents the Discrete Fourier Transform (DFT) of the current block. Since xm (n) is real-valued, P (k) is symmetric around a frequency of one-half the sampling rate. In the following steps, P (k) is considered for only half of the frequency components, or 0 k N ; 1. The frequency of a component of P (k) is given by f = fsN k (2.2) where fs is the sampling frequency.
2
2. Divide the power spectrum into 32 equal-width critical frequency bands to approximate the set of bandpass channels of the human ear across the range of audible frequencies. For N = 512 samples, for example, each critical band consists of = 8 coe cients from the power spectrum.
512 2 32
3. Identify tonal and non-tonal components in the power spectrum. A tonal component is de ned as a local maximum of P (k):
P (k ; 1) < P (k) > P (k + 1)

such that adjacent coe cients di er in power by at least 7 dB:
(2.3) (2.4) (2.5)
P (k) ; P (k ; 1) 7dB P (k) ; P (k + 1) 7dB

1 2
A non-tonal component is de ned as the sum of the power spectrum coe cients within a critical frequency band. If k and k are boundary indexes of a critical band, then the power of a non-tonal component Pm is given by:
Pm(k) =
2 X
The frequency of a non-tonal component within its critical band, km , is determined by an average of the band's frequencies, weighted by the power spectrum coe cient associated with each frequency: Pk 2 k k P (k ) km = Pkk1 (2.7) 2 k k1 P (k)
= =
k=k1
P (k)
(2.6)
29 4. Remove tonal and non-tonal components that are below the absolute detection threshold TA(f ), for it is assumed that they will not be audible to the listener. Also remove tonal components that are less than one-half of a critical band width from a neighbouring tonal component, since the response of the HAS to one such component will mask the other. 5. For each remaining tonal and non-tonal frequency component at index km, compute the raised detection threshold as a function for all other audible frequencies. Let d represent the distance between a frequency of interest and the tonal component frequency, k ; km , measured on the Bark scale. The raised detection threshold at index k due to the presence of a tonal or non-tonal component at km is given by 21]:
TR (k km) = P (km) ; 6:025 ; 0:275 km + R(d km)
(2.8)
where R(d km) is a piecewise-continuous function of the tonal component power and the distance of frequency k from the masking frequency km 21]:
R(d km) =
8 > > > > > > > > < > > > > > > > > :
17 (d + 1) ; 0:4 P (km) + 6] 0:4 P (km) + 6] d ;17 d ;(d ; 1) 17 ; 0:15 P (km)] ; 17
d < ;1 ;1 d < 0 0 d<1 1<d
(2.9)
The result is a set of raised detection threshold levels, TR (k km), one for each tonal and non-tonal component, indicating how each individual tonal and nontonal component raises the detection threshold level for all other frequencies. 6. Compute a global masking threshold level as the sum of the raised detection thresholds for all of the tonal and non-tonal components. This global function provides the raised masking threshold for a single frequency resulting from the contribution of all tonal and non-tonal components.
TG ( k ) =
km
TR (k km)
(2.10)
30 Convert the global function into a function of frequency, TG (f ), using Equation 2.2. 7. Finally, compute the frequency masking threshold function as the maximum of the absolute detection threshold (from Section 2.1.1) and the global masking threshold: TM (f ) = max fTG (f ) TA(f )g (2.11) This nal step is performed because a raised detection threshold will still be inaudible if it lies below the absolute detection threshold. The original MPEG psychoacoustic model speci es a xed block size of N = 512 samples, and values for the absolute detection threshold function, TA(f ), are provided for = 256 frequencies between 0 f 22:050 kHz, based on a sampling rate of 44:1 kHz. The formula for the raised detection threshold function of Equation 2.9 is also tuned to a block size of 512 samples. In addition, tonal and non-tonal components are removed in Step 4 above if they are less than four coe cients apart, and non-tonal components are constructed from the average power within eight power spectra coe cients, because each critical band consists of eight coe cients. However, to incorporate this perceptual model into watermarking algorithms, for convenience it will be necessary to operate on variable block sizes. It will be shown in Chapter 3 that the psychoacoustic model may be tightly coupled to watermarking algorithms using variable block sizes. Where a larger number of samples are needed, missing values of the absolute detection threshold function, TA(f ), will be obtained using bilinear interpolation. Interpolation will not be required for TR (k km), the raised detection threshold function, because it is a function of the di erence in frequencies. Where the block size will be smaller than 512 samples, required values for the absolute detection threshold will be obtained using a nearest-neighbour approach. Since the power of TA (f ) and TR (k km) will be a ected by a change in block size, they will also be scaled. For TA(f ) provided in dB and a block size of N samples, the
512 2
modi ed absolute detection threshold function is given by: N 0 TA(f ) = TA(f ) + 20 log 512 (2.12) A similar modi cation will be made to the raised detection threshold function. Because the MPEG model speci es a division of 32 critical frequency bands to approximate the lter bank structure of the HAS, the block size will be limited to N 64 samples. Also, the model was designed to approximate the short-term response of the HAS by analyzing audio samples of less than 30 ms in duration. By increasing the block size too far, it is likely that the model will no longer be accurate. Figure 2.5 shows an example of the power spectrum of an audio signal sampled at 44.1 kHz and 512 samples in length. The plot illustrates the power spectrum of the signal, along with the absolute detection threshold function TA(f ) and the masking threshold function TM (f ) computed using the procedure described above.
31
2.2 The Human Visual System (HVS)

The purpose of the HVS is to convert light rays incident upon the retina into a series of impulses sent from the ganglion cells to the human brain for processing. The structure and response of the HVS causes it to possess a number of the same concepts of the Human Audio System described in the previous section. In particular, the HVS can be modeled as a set of overlapping two-dimensional bandpass lters with center frequencies varying in spatial frequency and orientation, as shown in Figure 2.6 22]. From the illustration, it is clear that the lter bandwidths are narrow for low spatial frequencies, and widen with an increase in spatial frequency. These lters, also referred to as the cortex lters, represent channels through which visual stimuli of similar frequency and orientation are transmitted to the brain. The bandpass nature of these channels limits the frequency resolution of the HVS, and also gives rise to several important concepts described in the following sections. One important di erence between the HVS and HAS is in the de nition of frequency. In psychovisual studies, the spatial frequency of an image is often given
32
100 P(k) TM(f) T (f)

A
80
60
40
20
20
14
16
18
20
Figure 2.5: Power spectrum and corresponding absolute and raised detection threshold functions, TA(f ) and TM (f ), for a sample audio sequence.
33
90 degrees 60 degrees
30 degrees
0 degrees
-30 degrees
cycles / pixel
1/8
1/4
1/2
Figure 2.6: Passband lter responses of the two-dimensional Cortex lters used to represent the set of visual channels. in cycles per degree of vision, which di ers from the spatial frequency of the image based on its sampling rate. Cycles per degree represents the observed frequency of a stimlulus incident upon the retina, and the quantity depends upon the width of the image and the distance from the image to the viewer. In the description of psychovisual properties that follow, the standard viewing distance is assumed to be six times the width of the image, as illustrated in Figure 2.7. For an image of size N N pixels, and assuming a viewing distance of six times the image width, a normalized spatial frequency of f cycles per pixel may be converted to f cycles per degree using the following transformation: N f (2.13) f= 2 arctan For example, a 512 512 image may possess spatial frequencies ranging from 0 f 0:5 cycles per pixel. If the image is displayed at the standard distance, the observed frequencies range from 0 f 27 cycles per degree. For a 256 256 image displayed at the corresponding standard distance, the observed frequencies are limited to a maximum of f = 13 cycles per degree. For the remainder of this discussion of
0 1 6 0 0
34
WIDT
HEIGHT
N PIXELS
N PIX
ELS
DIS
TA
NC
E=
6x
WI
DT
Figure 2.7: Observed frequencies are dependent upon the image width and the viewing distance, standardized to six times the image width.
35 visual models, spatial frequencies will often be speci ed in cycles per degree to explain physical properties independent of image properties.
2.2.1 Frequency Sensitivity

Frequency sensitivity, also known as contrast sensitivity, refers to the response of the HVS to a stimulus of a single spatial frequency 15]. The HVS is capable of detecting frequencies between 0 - 50 cycles per degree. This frequency response is obtained from psychovisual studies in a similar manner to that used for determining frequency sensitivity for the HAS. A sinusoidal image of a single spatial frequency is displayed to a test subject, and the intensity of the stimulus, measured as contrast, is increased until the sinusoid becomes visible to the viewer. Contrast is given by the ratio of peak to trough luminance of the sinusoidal stimulus:
L C = 2 LMAX ; LMIN L +
MAX MIN
(2.14)
The result is a frequency-dependent contrast detection threshold function, C (f ). Figure 2.8 shows a plot of contrast sensitivity as a function of spatial frequency 23]. Frequency components of an image with a contrast below the detection threshold are not visible. From the plot, it is clear that the HVS is most sensitive to spatial frequencies around 3 cycles per degree. The method described above determines the sensitivity of the HVS to sinusoidal stimuli of a single spatial frequency. Sinusoids form the basis functions used in Fourier analysis of signals, and so C (f ) represents the sensitivity of the HVS to Fourier basis functions. However, it is possible to determine sensitivity to other basis functions as well. In Section 2.2.6.2, quantization matrices will be described that are based upon measured sensitivity to 2D-DCT basis functions. In Section 4.3, the sensitivity of the HVS to 2D-DWT basis functions will be described.
36
10
10
CONTRAST DETECTION THRESHOLD
10
10
10
10
10
10 10 FREQUENCY (CYCLES PER DEGREE)
10
Figure 2.8: Plot of C (f ), the visual contrast detection threshold function.
2.2.2 Frequency Masking
37
Like the HAS, studies have determined that masking e ects occur in human vision as well. The presence in an image of a strong frequency component (a masking signal) will mask the presence of other components of similar spatial frequency (masked signals). In particular, the contrast detection threshold function, C (f ), of a masked signal is raised by the presence of a masking signal 23]. This e ect is maximized for signals of the same frequency and orientation. The raised detection threshold at any spatial frequency f due to the presence of a masking signal at frequency fm is given by: C (f fm ) = C (f ) max f1 k(f=fm) Cm] g (2.15) where C (f ) denotes the original detection threshold at f , Cm is the contrast of the masking component, and is a tunable parameter usually set to 0:649 23]. k(f=fm) is a weighting function, illustrated in Figure 2.9. It is clear from the plot that the masking e ect is highest for spatial frequencies close to the masking frequency, and decreases as the frequencies di er. In addition, the masking e ect increases with Cm, the contrast of the masking signal. The e ect of frequency masking is also dependent upon orientation. A masking signal will raise the detection threshold for a weaker signal of similar frequency and orientation, but the e ect will become less pronounced as the angle between the two signals increases. The masking signal will have little or no e ect on the contrast detection threshold for a signal that is oriented 90 degrees away. This e ect can be modeled as a Gaussian weighting factor as a function of the di erence in orientation from the masking signal 24]: ( ) = exp ;0:5 j j
4 2 !2 3 5
(2.16)
where is the angle in degrees between the masking signal and the signal of interest, and = 15 . ( ) is applied to C (f fm ), the raised contrast detection function, to compensate for the orientation between signals.
38
10
10
k(f / fm)
10
10
10
10
10 f / fm
10
Figure 2.9: Weighting function used to determine the raised contrast detection threshold in the presence of a masking signal 23].
2.2.3 Spatial and Luminance Masking
39
Spatial masking e ects can be modeled in the spatial domain, as opposed to frequency sensitivity and masking, which are primarily frequency domain e ects. In addition, frequency sensitivity and masking e ects are global in nature, but the HVS processes e ects locally as well. There are two main characteristics, luminance masking and spatial masking 25]. Luminance masking describes the fact that the ability to detect noise or distortion varies with the mean luminance of the image region. When superimposed onto an image of uniform intensity, the visibility of zero-mean white noise varies with the intensity of the image. Figure 2.10(a) shows a plot of the detection threshold, measured in noise variance, as a function of intensity for an 8-bit image. Spatial masking occurs in an image around sharp changes in intensity (or edges). On either side of the edge, the detection threshold of additive noise or distortion is raised. Figure 2.10(b) shows a plot of the detection threshold, measured in noise variance, of additive noise on either side of an edge. In this gure, the raised threshold is a function of the observed distance from the edge in degrees of vision. Negative degrees in the plot correspond to a dark region, while a uniform whiter region lies within positive degrees.
2.2.4 Colour Sensitivity

The retina contains rod cells, which are sensitive to intensity, and cone cells which are sensitive to red, green, and blue wavelengths of light 13]. The masking concepts of the HVS examined so far are related to the intensity response of rod cells. The cone cells allow humans to perceive colour, but this sensitivity is not uniform among the three types of cones. In particular, the response of blue cones is much less than that of the red or green cones. A watermarking algorithm exists that employs the colour sensitivity of the blue channel to embed data within colour images 26]. However, it will not be used as part of the performance evaluation since other algorithms under consideration are for grayscale images only.
40
10
THRESHOLD OF DETECTABLE NOISE VARIANCE
10
10
50
100 150 BACKGROUND INTENSITY (8 BPP)
200
250
(a)
10
THRESHOLD OF DETECTABLE NOISE VARIANCE
10
10
10 0.5
0.4
0.3
0.2 0.1 0 0.1 0.2 DISTANCE FROM EDGE (DEGREES)
0.3
0.4
0.5
(b)
Figure 2.10: Raised detection thresholds of zero-mean additive white noise in the presence of (a) luminance masking and (b) spatial masking.
2.2.5 Temporal Masking
41
Masking e ects in the HVS are not limited to strictly spatial or frequency stimuli. The time-varying nature of video signals evokes two di erent responses: the icker frequency and temporal masking 15]. The icker frequency results from a lowpass temporal lter in the HVS, limiting the temporal response to roughly 24 - 30 Hz (frames per second). Temporal frequency components occurring at a faster rate generally are not perceivable. However, the icker frequency also depends greatly upon spatial frequencies within the video signal. Flicker frequency is not yet a useful property for watermarking, since it is not likely that a digital video sequence will be sampled at a higher rate, and the complex relationship between spatial frequencies and temporal frequencies has not been widely studied. Temporal masking occurs when the local mean intensity of a video sequence changes abruptly, such as during a scene change, a rapidly moving object, or a bright ash. During a rapid intensity change, the detection threshold of additive noise or distortion is elevated for a period of between 50 - 100 ms before and after the change 25]. Unfortunately, temporal masking is not yet useful for watermarking. Scene changes are relatively sparse compared to the overall length of sequences, and current models of the HVS do not accurately model the response of human vision to moving objects.
2.2.6 Human Vision Models

Models for human vision were originally developed as an addition to lossy image and video compression algorithms, in an e ort to reduce bit rates while ensuring that the resulting signal is perceptually lossless. Many models have been introduced, but most of them take advantage of only one or two of the frequency and spatial masking concepts described in the previous sections 24]. This is because most psychovisual experiments have been designed to study the e ect of one masking concept, such as frequency sensitivity, and from such experiments it is often di cult to construct
relationships between two masking characteristics. HVS models may be roughly divided into two approaches, operating chie y in the spatial domain or in the frequency / transform domain. In the following sections, implementations of some of these models will be described.
42
2.2.6.1 Spatial Domain Models

Spatial domain models of the HVS analyze the masking characteristics of a host image by processing it in the spatial domain. The model implemented in this study comes from a model of the HVS proposed by Girod 25]. His model predicts luminance, spatial, and temporal masking e ects for an image or video sequence. Girod's model is nonlinear, and may be roughly divided into three sets of operations: formation of an image, displayed on a monitor, into an image upon the human retina, modeling of the adaptive gain control mechanism of the human eye, and modeling of the saturation characteristics of the human ganglion cells. It was mentioned in Section 2.2.5 that the e ects of temporal masking are not particularly useful for watermarking applications, so in the description of Girod's model all references to temporal responses will be disregarded. Input to the model is assumed to be an N N image x(n n ) represented by intensity levels from 0 - 255. It is also assumed in this implementation that the image is displayed at the standard viewing distance of six times the image width. The pixel intensity levels are rst converted into a screen luminance function emitted by the computer monitor:
1 2
l(n n ) = LMONITOR x(n n ) + sMONITOR]

1 2 1 2 3
(2.17)
where LMONITOR = 0:35 10; candelas per square meter, sMONITOR = 15, and = 2:2. The screen luminance is then converted into an image incident upon the eye's retina by convolving the luminance function with the optical point spread function (PSF) of the eye:
lRETINA(n n ) = l(n n ) hPSF (n n )

1 2 1 2 1 2
(2.18)
43
0.8
0.6
0.4
0.2
0 5 5 0 0 5 5
SAMPLE (ARCMIN)
SAMPLE (ARCMIN)
Figure 2.11: The optical point spread function. The optical point spread function hPSF (n n ) has a circularly symmetric Gaussian impulse response with a half bandwidth of 1/60th of a degree of vision (1 arcmin) given by 2 ! 3 jnj 5 hPSF (n) = exp 4;0:5 (2.19)
1 2 2
where n = n + n and is a distance in pixels corresponding to a half-bandwidth at 1 arcmin of vision, and obtained using the conversion of Equation 2.13. A plot of the PSF is shown in Figure 2.11. Girod's model also approximates the adaptive gain control present within the retina cells. The gain control allows the eye to have a large dynamic range, and is represented by the formula:
2 1 2 2
cINH (n n ) = l
1 2
lRETINA(n n ) RETINA (n n ) hINH (n n ) + LAD

1 2 1 2 1 2
(2.20)
44 where hINH (n n ) is a Gaussian impulse response from Equation 2.19 with a half bandwidth equal to = 8 arcmin of vision, and LAD = 7 candelas per square meter. Finally, the visual model predicts the saturation characteristics of the ganglion cells. This portion of the model is nonlinear, and the saturation of the ganglion cells results in luminance and spatial masking e ects:
1 2
c( n n ) =
1 2
8 > < > :
where kSAT = 8. The result of this processing is an approximation of the stimulus transmitted to the brain by the ganglion cells. The monitor luminance and saturation characteristic resulting in c(n n ) are nonlinear operations, and Girod introduced a linearization of his model about an operating point speci ed by the undistorted input image x(n n ). Linearization allows the model to be used for determining whether distortion in the image, denoted x(n n ), will be visible to a human observer. Determining whether this distortion is perceivable is achieved by using a localized detection threshold. This is done by convolving the square of the distortion image with a Gaussian impulse response hLOCAL(n n ) of half bandwidth equal to = 13 arcmin of vision. The result is a localized distortion image given by:
1 2 1 2 1 2 1 2
cINH (n n ) cINH (n n ) 1 1 + kSAT logfkSAT cINH (n n ) ; 1] + 1g cINH (n n ) > 1 (2.21)

1 2 1 2 1 1 2 1 2
D(n n ) = c (n n ) hLOCAL(n n )
1 2 2 1 2 1 2 1 2
(2.22)
If D(n n ) exceeds a pre-speci ed threshold somewhere within the image, then the distortion is deemed to be perceivable by the viewer. Obviously x(n n ) may be used as a weighting function, (n n ) from Section 1.3.1, to control the strength of an additive watermark signal. Computing the weighting function using Girod's model directly is di cult, because the model only indicates whether a particular distortion function would be detectable by the viewer. Tew k et al proposed a method of incorporating Girod's model into perceptual image compression and watermarking algorithms 27, 28]. In their approach, simpli cations
1 2 1 2
45 are made so that a reverse procedure may be followed: given an input image x(n n ), determine the maximum distortion function x(n n ) that remains imperceptible to the viewer. The maximum distortion function may then be used to weight a watermark signal. The linearization of Girod's model about the operating point x(n n ) may be expressed by the following equations. First of all, error in the monitor luminance due to the distortion function is given by:
1 2 1 2 1 2
lMONITOR = w (n n )
1 1 2
x(n n )
1 2
(2.23)
and the error in the signal transmitted to the brain is given by:
c(n n ) = w (n n )
1 2 2 1 2
lRETINA (n n ) ; w (n n )
1 2 3 1 2 1 2
cINH (n n )] (2.24)
1 2
The weighting functions used in the above equations result from the linearization of the model about an operating point x(n n ). In the expressions below, a indicates parameters computed from the non-linear model with x(n n ) as the input: ( w (n n ) = dlMONITORnn) n ) (2.25) ds(n 1 (2.26) w (n n ) = c (n n ) + k SAT maxf0 lRETINA(n n ) ; cINH ]g INH w (n n ) = cINH (n n ) (2.27)
1 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 3 1 2 1 2
Figure 2.12(a) shows an example of a 512 512 image x(n n ) with intensity values between 0 - 255. Figure 2.12(b) shows the scaled result of the corresponding masking analysis using Girod's model. The masking threshold values range from 2 T (n n ) 12. From the illustration, it is clear that the detection thresholds are highest around edges (such as on the shoulder in the image) and in regions of uniform intensity.
1 2 1 2
2.2.6.2 Frequency Domain Models

Frequency domain models typically operate on blocks of the image that have been converted into the frequency domain using the two dimensional Discrete Cosine Transform described in Section 1.3.4.2. They were originally designed as an attempt to
46
(a) Original image
(b) Normalized masking threshold T (n n )

1 2
Figure 2.12: Example of perceptual analysis using Girod's model of the Human Visual System. create image-adaptive 8 8 quantization matrices for use in the JPEG compression algorithm. The matrix provided in the baseline JPEG standard is constant, and does not take advantage of masking concepts that may vary between images 29]. The models described in this section employ the 2D-DCT coe cients of each block, denoted C (k k ), to vary the quantization matrix for each block. The models may be used for watermarking an image on a block-by-block basis using the quantization approach of Equation 1.3. All of the frequency-domain models begin with an 8 8 image- and blockindependent basic quantization matrix. Since each DCT coe cient represents a frequency component, the basic matrix was constructed by measuring the sensitivity to each 2D-DCT basis function 30, 31]. The result is a minimum set of quantization levels QMIN (k k ) that allow for perceptually transparent modi cation of each DCT coe cient, as shown in Table 2.1. Watson built upon the basic quantization matrix by incorporating support for luminance masking e ects and a simple adjustment for frequency masking e ects 32]. Support for luminance masking was added by using a simple correction factor based
1 2 1 2
47 0 1 2 3 4 5 6 7 0 14 10 11 14 19 25 34 45 1 10 11 11 12 15 20 26 33 2 11 11 15 18 21 25 31 38 3 14 12 18 24 28 33 39 47 4 19 15 21 28 36 43 51 59 5 25 20 25 33 43 54 64 74
1
6 34 26 31 39 51 64 77 91
2
7 45 33 38 47 59 74 91 108
Table 2.1: Minimum quantization matrix QMIN (k k ) constructed by measuring sensitivity to 2D-DCT basis functions. upon the mean intensity of each block:
QL(k k ) = QMIN (k k ) C (0 0) (2.28) C (0 0) where C (0 0) is the DC coe cient of the image block representing the mean intensity, C (0 0) is the DC coe cient corresponding to an average monitor intensity of 128 (for a monitor displaying 8-bit pixels), and = 0:649. Rudimentary frequency masking was also incorporated into Watson's model by using a simple equation representing the frequency masking e ect of a single coe cient. For a strong DCT coe cient, the contrast detection threshold is raised not only for adjacent frequency components, but also for the masking coe cient itself. In Watson's model, the raised contrast detection threshold was considered only for the masking coe cient:
" #
1 2 1 2
Q(k k ) = max QL(k k ) jC (k k )jw k1 k2 QL (k k ) ;w k1 k2

1 2 1 2 1 2 ( ) 1 2 1 ( 1 2 1 2
(2.29)
where 0 w(k k ) 1 is an exponent controlling the e ect of the DCT coe cient magnitude C (k k ) on the raised detection threshold. In Watson's model, w(0 0) = 0 and is equal to 0:7 for all other coe cients. To incorporate Watson's quantization matrix into a watermarking algorithm, it will be necessary to analyze blocks larger or smaller than the 8 8 blocks used in his model. To achieve this, Watson's model will require two modi cations. First of all,
48 the minimum quantization matrix, QMIN (k k ), will be constructed using bilinear interpolation of the original 8 8 matrix provided in Table 2.1 for larger blocks. This is a reasonable modi cation, because the distribution of spatial frequencies represented by DCT coe cients will be the same in blocks of di erent size (there will simply be a more or less ner resolution of them). In addition, it will be necessary to ensure that C (0 0) in Equation 2.28 re ects the average monitor intensity of an M M block. Tew k et al proposed an improved model based on a more complicated frequency masking analysis than the single-frequency model used in Watson's model 33]. However, the authors provided few details about their approach, so what follows is a description of the frequency masking analysis implemented for this study. Assume that an N N image has been divided into a set of M M blocks, and that for each block the 2D-DCT has been computed as C (k k ).
1 2 1 2
1. Begin with the minimum quantization matrix QMIN (k k ) and apply the luminance masking modi cation from Watson's model to produce QL(k k ) (Equation 2.28).
1 2 1 2
2. For each 2D-DCT coe cient, determine the normalized spatial frequency of the coe cient in cycles per pixel: 1 f (k k ) = M k + k
q
0 1 2 2 1 1 2 2 2
(2.30)
for 0 k k M ; 1. Then convert the spatial frequency into cycles per degree using the transformation of Equation 2.13:
f (k k ) =
1 2
N 2 arctan
1 6
fo(k k )
1 2
(2.31)
Also determine the orientation of the coe cient in degrees: (k k ) = arctan k k

1 2 1 2
!
2 1
(2.32)
3. For every spatial frequency in f (l l ), compute the e ect of its corresponding 2D-DCT coe cient C (l l ) on every other frequency in f . This is performed
1 2
49 by adapting the raised contrast detection threshold function of Equation 2.15 to raise quantization levels. The e ect of the 2D-DCT coe cient C (l l ) on the raised quantization level Q(k k ) is given by:
1 2 1 2
Q(k k l l ) = k Q(k k ) max 1 k ff((k l )) jC (l l )jw Q(l l ) ;w (2.33) l where w = 0:7, and k is the frequency masking weighting function of Figure 2.9 normalized to 1 at a ratio of f (k k )=f (l l ) = 1. The equation above employs the raised quantization coe cient from Watson's model, and weights it for adjacent frequencies. The masking e ect is maximized when the two frequencies are the same, and is low when the two frequencies are dissimilar. The result is a set of M M functions representing the raised quantization levels arising from every DCT coe cient in C (k k ).
1 2 1 2
4. Apply the weighting factor of Equation 2.16 to correct for the di erence in angular orientation between (l l ) and (k k ).
1 2 1 2
5. Finally, for every frequency in f (k k ) compute the raised quantization level as the sum of the frequency masking e ects from all other DCT coe cients. This is performed by using a summation rule of the form 21]:
1 2
Q(k k ) =
1 2
M ;1 M ;1 X X l1 =0 l2 =0
31
Q(k k l l )
1 2 1 2
(2.34)
where = 2. An example of the raised quantization matrix resulting from a single 2D-DCT coe cient is shown in Figure 2.13. The plot shows how much the minimum quantization matrix QMIN (k k ) for a block size of M = 64 will be raised as a result of a strong 2D-DCT coe cient at k = 29 and k = 11. It is clear from the plot that the masking e ect is pronounced for 2D-DCT components with frequencies and orientations close to that of the masking signal.
1 2 1 2
50
20 60 18
50
16
14 40
12
k2
10
30
20
10 15
5
10
10
5
10 20
15
20
10
5
30 k1
40
50
60
Figure 2.13: E ect of a strong 2D-DCT coe cient on adjacent coe cients within the minimum quantization matrix.
2.3 Summary
51
The goal of this chapter was to introduce the Human Audio System (HAS) and Human Visual System (HVS), along with a description of models from the literature that were implemented for this study. Masking was introduced as a key concept throughout this chapter. Essentially it may be described as the presence of a strong signal \masking" the ability of humans to detect other signals with similar characteristics. The HAS is subject to masking e ects due to frequency sensitivity and frequency masking. An implementation of the MPEG Layer I psychoacoustic model was described to take advantage of these e ects. The HVS is slightly more complex, o ering masking e ects due to spatial frequency sensitivity, frequency masking, luminance masking, and spatial masking. Spatial frequencies di er from audio frequencies in that an observed spatial frequency varies with the distance of the viewer from the image. Two additional e ects were introduced, colour sensitivity and temporal masking, but will not be considered in this investigation. Three visual models were described and implemented, one in the spatial domain and two that analyze an image in the 2D-DCT domain.
52
Chapter 3 Audio Watermarking

High-quality commercial music is increasingly being created, stored, and distributed in digital form. For example, many Internet web sites allow users to purchase and download music, both quickly and inexpensively. Such music is usually compressed using the MPEG Layer III or Windows Media formats which allow for high levels of compression (on the order of 10:1) without any loss in perceptual quality 19]. In addition, digital radio broadcasts have begun in many metropolitan areas, airing freely available CD-quality music 34]. As discussed in Chapter 1, however, the widespread availability and use of these technologies has led to rampant piracy. Several organizations are currently considering ways of incorporating copyright protection mechanisms, including digital watermarks, into their compression standards and playback devices 35]. The purpose of this chapter is to review ve audio watermarking algorithms that have been proposed in the literature, and to propose improvements to their encoder and decoder structures. Another goal of this chapter is to apply the performance analysis framework proposed in Chapter 1 as a means of comparing the algorithms. The results of this comparison should be useful when deciding on a watermarking algorithm for a particular application. The algorithms evaluated in this chapter were chosen to represent two di erent approaches to embedding data: time domain and frequency domain. They were
53 also selected to represent a range of computational complexities and implementation structures. Since the focus of this thesis is on public watermarking algorithms, the techniques chosen do not require access to the original signal in order to extract the watermark data. In some cases, however, having such access may improve the decoding process. The chapter is organized as follows. Sections 3.2 - 3.5 provide a description of the audio watermarking algorithms, including the theory behind them, encoder and decoder structures, and implementation details. This is followed in Section 3.6 by a performance evaluation of the algorithms with respect to perceptual quality, bit rate, computational complexity, and robustness to signal processing operations. Finally, a review of the chapter's ndings are provided in Section 3.7.
3.1 Conventions
In order to provide a common basis to describe and compare the algorithms in this chapter, the following conventions are used. It is assumed that x(n) represents a discrete-time host audio signal of length N samples. This signal is divided into B = bN=M c blocks of M samples each. The signal is divided into blocks because it can be assumed that most audio signals exhibit local stationarity within blocks of less than 30 ms in length. In this case, second-order stationarity allows for analysis of the audio signal's mean and variance, which is useful for some of the algorithms. x(n) represents the watermarked audio signal, while xm (n) and xm(n) indicate the mth block in the original and watermarked signals, respectively, for 0 m B ; 1. As mentioned in Section 1.4.1, dividing the host signal into blocks is a simple method for allowing a variable number of bits to be embedded. Therefore, it is assumed that one watermark bit is embedded in each block, and this sequence of B bits is denoted by w(m) 2 f;1 +1g, for 0 m B ; 1. A bit extracted from the watermarked signal is denoted w(m). ~
3.2 Echo Coding
54
The rst algorithm studied, called the echo coding algorithm, embeds data into a host signal by adding a small amount of resonance, or echo. Bender et al, who introduced the algorithm, contend that natural signals such as recorded speech and music already contain resonance introduced during the recording process, such as the echoes present within a studio or concert hall 36]. They claim that the human ear is accustomed to hearing this slight resonance in commercial music, so adding more will not signi cantly impair the quality of the sound. This arti cial form of resonance can be modeled mathematically as a linear system consisting of an impulse followed by a weighted delayed impulse:
h(n) = (n) +
(n ; no)
(3.1)
where is the magnitude of the echo, and no is the echo delay. The impulse response of the system, h(n), is referred to as an echo lter. It will be shown that should be kept small compared to the magnitude of the host audio signal. Bender et al report that an echo delay of less than 1=1000th of a second is not perceivable by humans. It is important to analyze the distortion introduced into a signal by the addition of echo. In the frequency domain, the echo lter's magnitude and phase responses are functions of both the echo delay and the amplitude of the delayed signal: 1 + 2 cos(!no) + " # j! ) = arctan ; sin(!no) 6 H (e 1 + cos(!no)
2
H (ej! ) =
(3.2) (3.3)
Figure 3.1 shows a plot of the magnitude and phase frequency responses for = 0:1 and a delay of no = 5 samples. It can be seen that the echo introduces a signi cant distortion into the host signal's magnitude response, along with a nonlinear phase response. Such a large gain distortion would probably be unacceptable in many applications. The magnitude of these distortions vary with frequency and are directly proportional to . Therefore, should be kept small to minimize the distortion.
55
1.2
MAGNITUDE RESPONSE
1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 NORMALIZED FREQUENCY 0.7 0.8 0.9 1
0.2
PHASE RESPONSE
0.1
0.1
0.2
0.1
0.2
0.3
0.4 0.5 0.6 NORMALIZED FREQUENCY
0.7
0.8
0.9
Figure 3.1: Magnitude and phase responses of an echo lter with an echo amplitude of = 0:1 delayed by no = 5 samples.
56
x-1 (n) h-1 (n) x(n) x+1(n) h+1(n) 2:1 MUX _ x(n)
w(m)
Figure 3.2: Structure of the echo coding algorithm's encoder. An echo lter may be used for watermarking by varying the lter delay according to the bit, w(n), to be embedded within the host signal. This approach will be described further in the following section.
3.2.1 Encoder Structure

The structure of the algorithm's encoder is shown in Figure 3.2. In the rst stage, x(n) is passed through two equal-length echo lters, h; (n) and h (n), each composed of a unit impulse followed by a second impulse of smaller magnitude, as discussed in the previous section. The echoes are delayed by ; and samples, respectively. Mathematically,
1 +1 1 +1
x; (n) = xm (n) + x (n) = xm (n) +

1 +1
xm (n ; xm (n ;
;1 )
+1
(3.4) (3.5)
):
Finally, a mixer stage takes the two echoed source signals and selects one of them as the current output block, acting as a 2:1 multiplexer, according to the bit to be embedded within the block:
xm (n) =
8 > < > :
x; (n) if w(m) = ;1 x (n) if w(m) = +1

1 +1
(3.6)
57
b(m) -1 +1 -1 -1 +1 +1 -1
1 s-1 (n) Transition band 0 1 s+1 (n) 0
Figure 3.3: Transition bands employed to minimize phase di erence between blocks containing di erent bits. At the boundary between two blocks containing di erent bits, there will be a significant change of phase in the watermarked signal. This phase change may become audible to the listener, so to minimize the distortion there are two modi cations that can be made to the algorithm. First of all, the echo delays ; and should be set close together, ideally only one or two samples apart. This will minimize the di erence in magnitude and phase responses between the two echo lter kernels. In addition, a transition period can be incorporated into the encoder's mixer stage, in order to \ramp down" the current block's signal and \ramp up" the next block's signal, over the course of a xed number of samples around the boundary between the two blocks. This modi cation to the structure introduces two additional mixer signals, s; (n) and s (n), illustrated in Figure 3.3, which can be used to rewrite Equation 3.6 in the form
1 +1 1 +1
xm (n) = x; (n) s; (n) + x (n) s (n):

1 1 +1 +1
(3.7)
Note that at any instant of time, s; (n) + s (n) = 1.

1 +1
3.2.2 Decoder Structure
58
At the decoder, the embedded data bit may be retrieved by determining the length of echo delay introduced into the host signal within the current block. This may be accomplished by analyzing the cepstrum of the watermarked signal 7]. The realvalued cepstrum of a signal x(n) is formed from the inverse Discrete Fourier Transform of the natural logarithm of the magnitude frequency response of x(n):
x(n) = IDFT flog jDFT fx(n)gj]g ^
(3.8)
Let y(n) denote the convolution of a signal x(n) with an echo lter kernel h(n) as de ned in Equation 3.1. In 7], it is shown that the cepstrum of y(n) may be written as ^ y(n) = x(n) + h(n) ^ ^ (3.9) ^ where x(n) and h(n) represent the cepstra of x(n) and h(n), respectively. A more ^ ^ precise mathematical expression can also be derived for h(n) by noting that for j j < 1, log(1 + ) may be written as a power series expansion of the form 1 k k X log(1 + ) = (;1)k : k
+1 =1
(3.10)
^ Using the z-transform representation of the echo lter from Equation 3.1, h(n) may ^ be obtained from the inverse z-transform of H (z) 7]: ^ ^ h(n) = Z ; fH (z)g = Z ; flog(1 + z;no )g 1 (;1)k k = (n ; kno ) (3.11) k k In other words, the cepstrum of the watermarked block consists of the cepstrum of xm (n) plus an in nite series of decaying impulses at integer multiples of either ; or , depending on the embedded bit:
1 1
+1
=1
+1
^ h (n) if w(m) = ;1 ^ xm(n) = xm (n) + ; ^ ^ h (n) if w(m) = +1

1
8 > < > :
(3.12)
+1
59 ^ (n) can be obtained by comIf the original signal is available at the decoder, then h puting the cepstrum of xm (n) and simply subtracting it from the cepstrum of the watermarked block. Within the framework of a public watermarking system, since the cepstrum of xm (n) is present and will interfere with the signal at the decoder, it ^ is necessary to enhance h(n) by ensuring that is large enough to make the largest ^ ^ impulse of h(n) detectable at the receiver. xm (n) will possess a peak at the echo delay. ^ If a peak occurs at xm ( ; ), then a ;1 bit was encoded in the block. Otherwise, a ^ peak at xm( ) indicates that a +1 bit was embedded.
1 +1
3.2.3 Implementation and Proposed Improvements

3.2.3.1 Selection of and no
From the discussion in Section 3.2.2, it should be clear that the echo delays must be long enough that the resonance is detectable at the decoder, but not so long that the echo becomes noticeable to the listener. Bender et al report that for most audio signals, an echo delay of less than 1=1000th of a second is imperceptible to humans. The values of and no used in this investigation were determined experimentally. The collection of ten host signals used later in this chapter, in the performance evaluation of Section 3.6, were also employed as host signals for testing the echo coding algorithm with varying parameters. Each of the host signals was watermarked 100 times with random watermark bits, using a block size of M = 2048 samples, and the results averaged. This process was repeated for three echo delay lengths of N = 0:01, 0:001, and 0:0001 seconds. Figure 3.4 shows the results of this experiment, and it is clear that the performance of the echo coding algorithm increases with both the magnitude of the echo and the echo delay. These results correspond to the discussion in the previous section. In this investigation, = 0:1, while ; and were set to approximately N = 0:001 of a second. This corresponds to sample delays of ; = bfs=1000c and = bfs=1000c + 1, where fs denotes the sample rate of the audio signal in Hz. In Section 3.2.1 it was shown that , the amplitude of the echo, should be small
1 +1 1 +1
60
60 N = 0.01 N = 0.001 N = 0.0001 50
BIT ERROR RATE (PERCENT)
40
30
20
10
0.1
0.2
0.3
0.4
0.5 ALPHA
0.6
0.7
0.8
0.9
Figure 3.4: Bit error rate of echo coding algorithm of varying delays (N ).
for di erent echo
61 in order to minimize distortion in the host signal's magnitude and phase. However, in the previous section it was noted that if the original signal is not available at the decoder, then should be large in order to detect the echo delay through the interference caused by the presence of x(n) in the cepstrum of the watermarked block. ^ These con icting constraints introduce a tradeo into the echo coding algorithm | to increase the reliability of the encoding, the quality of the host signal must be compromised to a certain degree. To illustrate this tradeo , consider the plots in Figure 3.5 which show an audio signal of M = 2048 samples, to which an echo lter kernel has been applied with = 0:1 and no = 10 samples. The gure shows y(n), the cepstrum of the output ^ ^ signal, where the impulses of h(n) are visible at delays of no samples. Included in the ^ plots are x(n) and h(n), the constituent components of y(n). It can be seen that the ^ ^ cepstrum of the original signal interferes with the detection of the echo lter impulses.
3.2.3.2 Discussion
The echo coding algorithm possesses certain features that make it attractive as a watermarking technology, and most notable is its simplicity. The encoder is a simple linear system, which makes it easy to implement in hardware or incorporate into an existing audio recording system. In addition, no private key sequence is required to embed or extract data from the host signal. The algorithm is also relatively resistant to synchronization errors, so misalignment by a few samples, up to the limits of the transition band, will not signi cantly a ect computation of the cepstrum at the decoder. Unfortunately, certain features of the echo coding technique may limit its practical use. First of all, computation of the cepstrum and autocorrelation at the receiver may be too expensive if the audio signal is divided into large blocks and if the FFT cannot be employed. Since no private key sequence is required to \unlock" the watermark data, it is easy to detect and remove the watermark by applying an inverse
62
0.15
1.8
0.1
1.6
1.4
0.05
Cepstrum of y(n)
1.2
x(n)
0.8
0.05
0.6
0.4
0.1
0.2
200
400
600
800
a. x(n)
2 1.8 1.6 1.4
1000 1200 Sample (n)
1400
1600
1800
2000
10
15
20
^ b. y(n) = x(n) + h(n) ^ ^

2 1.8 1.6 1.4
25 Sample (n)
30
35
40
45
50
Cepstrum of h(n)
0 5 10 15 20 25 Sample (n) 30 35 40 45 50
Cepstrum of x(n)
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
10
15
20
c. x(n) ^
^ d. h(n)
25 Sample (n)
30
35
40
45
50
Figure 3.5: Example of applying an echo lter kernel to an audio signal, and detection of the echo lter delay using the cepstrum.
echo lter of the form
63
where no corresponds to the desired lter delay to remove. Finally, the algorithm relies upon a relatively weak property of the human audio system to embed data, and the quality of the output signal su ers as a result of having to increase the distortion to improve the reliability of the encoding.
H ; (z) = 1 + 1z;no
1
(3.13)
3.3 Phase Coding

The phase coding algorithm embeds data into an audio signal by taking advantage of the human auditory system (HAS) response to phase information. In an important work on the topic, Oppenheim and Lim discussed the importance of phase information in audio and image signals 37]. They note that for most audio signals, particularly speech, the short-term phase of the signal, arising from the Short-Time Fourier Transform (STFT) over a small time interval, is less important than the long-term phase over the course of the entire signal. The authors suggest that long-term phase information, containing sharp edges (\features"), contributes more to the intelligibility of an audio signal than the magnitude response. Another way of approaching the HAS sensitivity to phase is to consider the superposition of two sinusoidal signals that are out of phase by a constant factor:
x(t) = cos(2 ft) + cos(2 ft + o)
(3.14)
The human auditory system is not able to discern the phase di erence between the two signals (unless, of course, o is such that the two signals are completely out of phase). However, if the phase di erence is allowed to vary with time, according to
x(t) = cos(2 ft) + cos 2 ft + o + (t)]
(3.15)
then the listener will be able to detect this. Note that although the time-varying phase di erence may be detectable, the constant phase di erence between the sinusoids, o, will not be discernable.
64 Bender et al, authors of the echo coding algorithm reviewed earlier, proposed another watermarking scheme based on the HAS sensitivity to phase described above 36]. In their approach, they divide the host audio sequence into a set of equallength segments and compute the DFT for each segment, equivalent to computing the STFT. However, as will be seen shortly, the algorithm introduces a constant phase change in the segments of the host signal while maintaining the time-varying phase di erence of the original signal.

As described in 36], the rst step of the phase coding algorithm is to compute the STFT of the current block in the host signal, xm (n). This is performed by dividing xm (n) into a set of L equal-length subblocks, and for each of these subblocks the DFT is used to obtain the magnitude spectrum Mi (k) and phase spectrum i(k), for 0 i L ; 1. The next step is to determine the phase di erence between subblocks on a frequency-by-frequency basis:
i (k) = i (k) ; i;1 (k):
(3.16)
To embed a bit of information into the current block, the phase of the rst subblock, (k), is replaced with a unique phase signature corresponding to the desired data bit: 8 > < (k) if w(m) = ;1 (k ) = > ; (3.17) : (k) if w(m) = +1 The phase of each subsequent subblock is replaced with this phase signature plus the sum of phase di erences up to the given subblock. In this manner, the relative phase of each subblock is preserved, and the long-term phase of the block itself is maintained: i X (3.18) i (k) = (k) + j (k)
0 1 0 +1 0
j =1
65 As a nal step, the watermarked block xm(n) is obtained by computing the inverse DFT of each subblock using the original magnitude response Mi(k) and modi ed phase response i(k). A block diagram of the encoder, illustrating the STFT, phase di erence calculation, and phase signal reconstruction is shown in Figure 3.6.

When compared with the algorithm's encoder, the decoder structure is much simpler to implement. The embedded bit is extracted by computing the DFT of the rst subblock, extracting its phase response, (k), and comparing this response to ; (k) and (k) using the statistical correlation coe cient introduced in Chapter 1. The larger value of corresponds to the embedded bit:
0 1 +1
w(m) = ~
8 > < > :
;1 if (
+1 if (
;1 ) >
1
( ; )< (
+1 +1
) )
(3.19)
Note that access to the original signal is not required to extract the embedded bits, and that even having x(n) available at the decoder will not be useful. The decoder also requires synchronization of the watermarked signal with the block and subblock boundaries.
3.3.3 Implementation Details

In this investigation, it was found that replacing the entire rst subblock, (k), with a phase signature introduced too much distortion into the host signal 38]. Recall that one of the goals of watermarking is to minimize the distortion introduced into the host signal. In this implementation, only one phase component was altered in (k), corresponding to a mid-band frequency of approximately 10 kHz (the exact component varies with the sample rate). The phase component was set to =4 in accordance with 8 > < ; =4 if w(m) = ;1 (k) = > (3.20) : + =4 if w(m) = +1
0 0 0
66
x m (n)
Subblock 0
Subblock 1
Subblock 2
DFT M 0 (k) 0 (k)
DFT M 1 (k) 1 (k)
DFT M 2 (k) 2 (k)
-/+ -1 +1 w(m) 2:1 MUX
-/+
IDFT
IDFT
IDFT
Subblock 0
Subblock 1
Subblock 2
_ x m (n)
Figure 3.6: Structure of the phase coding algorithm's encoder.
The subblock size was set to M=16, where M is the size of each block in samples. The primary advantage of the phase coding algorithm is that, like the echo coding algorithm described earlier, the encoder and decoder structures are conceptually simple. In particular, it should be noted that the DFT and IDFT computations on each of the subblocks may be performed in parallel. In addition, use of the correlation coe cient as a means of deciding between bits at the decoder means that one would expect the algorithm to perform well in the presence of additive noise. Unfortunately, it will be shown in Section 3.6 that the watermarked signals generated using this algorithm are of relatively poor quality, even with the single phase coe cient modi cation introduced above. Also, by using only the rst subblock to decode the bits, synchronization of the watermarked signal is necessary to correctly extract them. In addition, removing or corrupting the rst subblock e ectively destroys the watermark data. Delaying the subblock by no samples introduces a linear phase change into the signal.
67
3.4 Spread Spectrum Coding

Spread spectrum techniques were invented in the 1950's as a means of improving the security and reliability of digital communications systems, and today they are regularly employed in wireless systems. Spread spectrum systems all share the following key characteristics 39]. A narrowband data signal, such as a frequency shift keying (FSK) signal for example, is converted into a spread signal by modulating it with a wideband spreading signal that is independent of the data signal. This process causes the spread signal to occupy a spectral bandwidth far in excess of the bandwidth of the original data signal. The data signal at the decoder is recovered by correlating the spread signal with a synchronized copy of the spreading signal, also known as despreading. Pseudorandom sequences, also referred to as pseudonoise or PN sequences, are commonly used as spreading signals in spread spectrum systems. In particular,
68 maximal-length linear feedback shift registers (LFSR) are often used because they are simple to design, analyze, and implement in hardware 5]. PN sequences have several properties that make them attractive as spreading signals 16]: PN sequences typically have the same statistical properties as white noise, such as a wide and relatively at power spectrum. A data signal spread with a PN sequence will occupy a correspondingly wide bandwidth. PN sequences are periodic and deterministic in nature, meaning they can be predicted. This is important at the receiver for synchronization, because the spreading signal must be matched with the spread data signal so that the information can be decoded properly. Spread spectrum techniques can be incorporated into a digital watermarking algorithm by recalling that in an additive model of watermarking, the watermark data signal is \corrupted" by a noisy channel consisting of the host audio signal:
xm(n) = xm (n) +
w(m)
(3.21)
where w(n) denotes the weighted data signal that is to be transmitted. Spreading the data signal introduces two important bene ts that are attractive from both a communications theory and digital watermarking viewpoint: The spread signal is more resilent to jamming noise. The power spectrum of the host audio signal is usually not at, but possesses a lowpass or bandpass characteristic. By increasing the bandwidth of the data signal to occupy frequencies separate from those of the host signal, the encoding can be made more reliable. Since PN sequences appear as random signals and spread the bandwidth of the data signal, it becomes more di cult for an unauthorized party to detect and remove the watermark from the host signal.
69
60
50
MAGNITUDE SPECTRUM
40
30
20
10
0.1
0.2
0.3
0.7
0.8
0.9
Figure 3.7: Magnitude spectrum of a PN sequence, jP (ej! )j Two commonly used spread spectrum techniques are direct sequence spread spectrum (DSSS) and frequency hopped spread spectrum (FHSS). In the former, a watermark signal is modulated directly by a PN sequence:
y(n) = w(n) p(n)
(3.22)
where w(n) denotes the watermark signal and p(n) represents the PN sequence. If p(n) occupies a wide spectral bandwidth (like white noise), then the bandwidth of y(n) will be expanded due to the convolution of the two signals in the frequency domain: Y (ej! ) = W (ej! ) P (ej! ): (3.23) Figure 3.7 illustrates the magnitude spectrum of a zero-mean PN sequence 512 samples in length. In FHSS approach, the PN sequence is used to randomly select from a set of prede ned frequencies, and these frequencies are used to control the carrier frequency of the data signal. The carrier frequency \hops" across the spectrum at set intervals of time, and in a pattern determined by the PN sequence.
3.4.1 Encoder Structures
70
In the discussion that follows, it is assumed that a bipolar PN sequence of the form p(n) 2 f;1 +1g is available for use at the encoder and decoder, and that the sequence has zero mean and a relatively at power spectrum.
3.4.1.1 Direct Sequence Spread Spectrum

The algorithm described here is a variant of the spread spectrum image watermarking method introduced by Hartung and Girod 40]. In this approach, no carrier wave is used to hold the data bit, as would normally be the case for a bandpass digital communications system. As before, the audio signal is divided into blocks as described in Section 3.1. The data bit for the current block, w(m), is spread simply by modulating it with the PN sequence. The resulting noise-like signal is then added to the original block to construct the watermarked signal:
xm (n) = xm (n) +
w(m) p(n)
(3.24)
In the equation above, represents a constant weighting factor that can be used to control the level of distortion added to the host signal. Since w(m) is constant within the block, the spectrum of the added noise assumes the shape of the spectrum of p(n): w(m) p(n) , W (ej! ) P (ej! )] (3.25) , w(m) (ej! ) P (ej! )] , w(m) P (ej! ) where denotes the convolution operator. A block diagram of the DSSS encoder is shown in Figure 3.8.
3.4.1.2 Frequency Hopped Spread Spectrum

The FHSS algorithm described here is a variant of the method introduced by Cox et al in their study of image watermarking algorithms 41]. In this approach, the discrete
71
x(n) _ x(n)
w(m)
p(n)
Figure 3.8: Block diagram of the DSSS encoder. cosine transform (DCT) is used to transform the original signal block, xm (n), into the frequency domain in accordance with:
Xm (k) = DCT xm (n)]:
(3.26)
The result is a set of M frequency-domain coe cients, where M is the length of the block in samples. Then, a subset of S M coe cients are selected to contain watermark data:
S = fsi 2 Zj0 si M ; 1 0 i S ; 1g
(3.27)
The coe cients are modi ed by using a PN sequence, p(k) 2 f;1 +1g, of length S samples, modulating the bit to be embedded within the block with this short PN sequence, and then adding this noise-like sequence to the selected coe cients:
w(m) p(k) if k 2 S (3.28) 0 otherwise where, as with the DSSS algorithm, is a parameter used to control the noise power. The nal step is to construct the watermarked signal by using the inverse DCT to convert the modi ed frequency domain signal into the watermarked block: X m (k) = Xm(k) +
> :
8 > <
xm (n) = IDCT X m (k)]
(3.29)
The subset of S modi ed coe cients may be xed for the entire audio sequence, or it may vary with each block. It is important to note the di erence between this
72
X(k) x(n) DCT 2:1 MUX w(m) _ X(k) IDCT _ x(n)
p(k)
Figure 3.9: Block diagram of the FHSS encoder. approach and an implementation of FHSS in a digital communications system. Rather than modulate the frequency of a single carrier wave, in this approach a set of S carrier waves are used. The techniques are similar, in that the frequencies of the modi ed coe cients may vary with each block, so that they appear to \hop" across the spectrum. A PN sequence is still required to spread the bit within each block, and in this case a second PN sequence may be used to determine the frequencies to modify for each block. In their image watermarking scheme, Cox et al proposed sorting the DCT coe cients by magnitude, and then selecting those with largest magnitude for modi cation. An illustration of the FHSS encoder structure is shown in Figure 3.9. There are two key di erences between the DSSS and FHSS algorithms. First of all, since addition of the noise-like signal is performed in the frequency domain for FHSS, the noise power is spread throughout the watermarked block in the time domain. Since only a subset of frequencies are used, the noise power at each modi ed coe cient, localized in frequency, may be increased without causing a corresponding increase in the time-domain noise variance. In other words, may be increased in the frequency domain without creating a corresponding increase of noise in the time domain.
3.4.2 Decoder Structures
73
A similar decoding process is used for both the DSSS and FHSS algorithms. Consider rst the case of the DSSS decoder. The embedded bit is extracted by computing the correlation of the watermarked signal block with a synchronized version of the PN sequence:
C =
= =
M ;1 X n=0 M ;1 X n=0 M ;1 X n=0
xm(n) p(n) xm (n) + w(m) p(n)] p(n) w(m) p (n)]

2
xm (n) p(n) +
(3.30)
Given that p(n) is a noise-like signal with zero mean, as is the case with most PN sequences, then the correlation of the original signal with p(n) in the equation above may be assumed to be very low:
M ;1 X n=0
xm (n) p(n) 0 M w(m):
(3.31)
implying that Equation 3.30 may be written as
M ;1 X n=0
w(m) p (n) =
2
(3.32)
Since w(m) is bipolar, the extracted bit may be obtained from the sign of the correlation computed above:
w(m) = sign C ] = sign ~
M w(m)]
(3.33)
For the case of FHSS encoding, the watermarked signal xm (n) is rst transformed into the frequency domain using the DCT, and the correlation of the marked coe cients with the PN sequence is computed in the same manner as in Equation 3.30 above. Note that in addition to the synchronized PN sequence, the frequencies of the marked coe cients must also be available at the decoder. If the original audio block is available, such as within a private watermarking framework, then it may be subtracted from the watermarked block prior to correlation
74 in order to improve the reliability of detection at the decoder. If the original signal is not available, then one may have to introduce additional processing at the decoder in case that the assumption of Equation 3.31 does not hold.
3.4.3 Probability of Bit Error

Spread spectrum techniques are well established in the digital communications literature, and it is possible to derive a theoretical bound for the performance of the algorithms in the presence of additive noise 16]. Assume that the watermarked block is distorted by additive white Gaussian noise (AWGN) of zero mean and variance v :
2
xm (n) = xm(n) + v(n) ~
(3.34)
This is a reasonable assumption, for it was shown in Chapter 1 that additive noise is a commonly used attack on watermarked signals. Also assume for now that the original signal block, xm (n), is available at the decoder and can be subtracted from the watermarked signal. The resulting signal presented to the correlation receiver has the form: xm (n) = w(m) p(n) + v(n): ~ (3.35) Applying the correlation formula of Equation 3.30, the extracted bit with a noise term may be written as
C =
= =
M ;1 X n=0 M ;1 X n=0 M ;1 X n=0
xm (n) p(n) ~ w(m) p(n) + v(n)] p(n) w(m) p (n) + v(n) p(n)]
2
(3.36)
Since v(n) has zero mean and is uncorrelated with p(n), for M 1 the second term in the correlation summation will be approximately zero. Therefore, for large block sizes, it is predicted that the spread spectrum algorithms possess a strong resilience to additive noise distortions.
75 However, when the block size is small, then the correlation of p(n) and v(n) will not be zero, and it is important to be able to quantify the corresponding bit error for varying noise power. This was obtained experimentally by using Equation 3.36 for varying block sizes and noise ratios, and the results are shown in Figure 3.10. This gure illustrates the probability of bit error, PB , as a function of signal-to-noise ratio (SNR) for various block sizes. The probability of bit error may be approximated mathematically by the expression 16]:
PB = Q
(3.37)
where M denotes the block size in samples (for DSSS), or the number of modi ed frequency-domain coe cients in the FHSS algorithm, and is the power of the watermark signal from Equation 3.24. Q(x), the complimentary error function, is de ned as: ! 1 Z 1 exp ;u du: Q(x) = p (3.38) 2 2 x From the PB equation above, it is clear that either increasing the block size or increasing the watermark power has a signi cant e ect on the reliability of the encoding. For the case where xm (n), the original host signal, is not available at the receiver, then the presence of the host signal will cause decoding problems. Solutions to this problem are proposed in the next section.
2

3.4.4.1 Selection of
In Chapter 2 it was shown that the human auditory system (HAS) has a wide dynamic range that is sensitive to low-level noise at mid-band frequencies (500 Hz - 8 kHz). Therefore, in order to keep the watermark signal below the threshold of hearing, it was found that must be set to an extremely low value, typically one percent of the dynamic range of the audio signal. Unfortunately, from Equation 3.37 it is obvious that this will reduce the reliability of the encoding.
76
10
M=1 M=2 M=3 M=4 M=5
PROBABILITY OF BIT ERROR
10
10
10
10
15
15
10
0 SNR (DECIBELS)
10
15
Figure 3.10: Error rate as a function of SNR for the spread spectrum algorithms.
77 It is important to recognize that with the FHSS approach, may be increased because by modifying only a subset of frequency components, the noise power is spread throughout the time domain signal. In other words, a higher level of noise at a subset of frequencies corresponds to a lower power of noise within each sample in the time domain. In this investigation, the number of coe cients modi ed in the frequency domain was set to S = M=32, or roughly three percent of the coe cients.
3.4.4.2 Pre ltering to Improve Decoding Reliability

As mentioned earlier, within a public watermarking framework it is assumed in Equation 3.31 that the correlation of the host signal and the PN sequence will be approximately zero. This is often not a valid assumption, and there are two improvements that can be incorporated to minimize the presence of the host signal prior to decoding: 1. Hartung and Girod suggest the use of a highpass pre lter to remove as much of the host signal as possible, based on the assumption that xm (n) can be modeled as a lowpass signal 40]. This is a simple approach, because a xed lter may be constructed ahead of time and incorporated into the receiver. 2. Employ a whitening lter constructed from a K th-order autoregressive (AR) model of the watermarked signal block 42]. This is a novel approach that has not yet been considered in the watermarking literature. Using a z-transform representation: A(z) = PK ; 1 ;i (3.39) i a(i)z where a(0) = 1. Assuming that the noise-like watermark signal has a much smaller power than the host signal, then the AR model may be used to whiten xm (n) by convolving the two signals:
=0 1
xm(n) a(n) = vx(n)

2
(3.40)
where vx(n) is a random signal of variance x corresponding to the prediction error. When applied to the watermarked signal block, xm (n), the output of the
whitening lter presents the following signal to the correlator:
78 (3.41)
xm (n) a(n) = xm (n) + w(m) p(n)] a(n) ~ = w(m) p(n) a(n)] + vx(n)
In other words, convolution of the AR model coe cients with the watermarked signal results in two signals: p(n) a(n) weighted by the watermark bit, and the random prediction error vx(n). However, recall that p(n) is a noise-like signal with zero mean and a variance p = 1. The output of the correlator will be
2
c(m) =
=
M ;1 X n=0 M ;1 X n=0
x(n) p(n) ~
f w(m) p(n) a(n)] + vx(n)g p(n)
(3.42)
Neglecting the contribution of the correlation of the random prediction error and the PN sequence, the correlator output has the following form:
c(m) =
w(m)
M ;1 X n=0
p(n) p(n) a(n)]
(3.43)
However, it was shown in 7] that when a random signal is convolved with a linear lter, the cross-correlation of the lter output with the random input is a function of only the variance of the input and the lter coe cients. This property can be used to simplify the expected value of the correlator output:
E c(m)] =
=
w (m ) w (m )
M ;1 X n=0 p
2
E fp(n) p(n) a(n)]g

K ;1 X n=0
a(n)
(3.44)
noting the change in summation to re ect the fact that the lter length will be shorter than the length of the watermarked block, and E ] denotes the statistical expectation operator. As long as the sum of the AR model coe cients is greater than zero: K; X a(n) 0 (3.45)
1
then the sign of the correlator output may be used to extract the embedded bit.
n=0
79
0
10
20
MAGNITUDE RESPONSE
30
40
50
60
70
80
90
0.1
0.2
0.3
0.7
0.8
0.9
Figure 3.11: Highpass lter used to pre lter host signals watermarked with the DSSS algorithm. In this investigation, a symmetric highpass Finite Impulse Response (FIR) lter of length K = 11 samples, constructed using a Hamming window, was used as a pre lter for decoding the DSSS algorithm. An FIR lter was used because it provides a linear phase response, is independent of the host signal (may be constructed ahead of time), and because the watermark signal is spread throughout the spectrum of the host signal. The frequency response of the lter is shown in Figure 3.11. The AR modeling technique was applied at the decoder of the FHSS algorithm by constructing a 11th-order whitening lter for each watermarked block. The lter order was chosen as a tradeo between accuracy of the AR model and the computational cost of computing the model coe cients. A time-averaged estimate of the block's autocorrelation function r(n) was computed, and the AR coe cients were constructed using r(n) and the Levinson-Durbin recursion 42]. A block diagram of
80
AR Modelling
_ x m(n)
Whitening Filter
p(n)
M-1 n=0
sgn
w(m)
Figure 3.12: Block diagram of the spread spectrum decoder with pre ltering prior to decoding. the improved decoder is shown in Figure 3.12. The AR modeling technique was chosen over the highpass pre lter because the latter would not improve decoding for the FHSS algorithm, since not every frequency component is modi ed. If the distribution of the watermarked subset of frequencies is random, then on average half of the modi ed coe cients would lie in the lower frequency band. To illustrate how much these two modi cations improve the decoding reliability of the DSSS and FHSS algorithms, refer to Figure 3.13 and Figure 3.14. These are experimental plots of the bit error rate (BER), as a function of , for the two spread spectrum algorithms under three conditions: no pre ltering, highpass pre ltering, and AR modeling. In these tests, a monophone audio signal from the performance evaluation of Section 3.6 was watermarked using a block size of 2048 samples.
3.4.4.3 Discussion
The spread spectrum watermarking techniques described in this section possess several advantages over the echo coding and phase coding algorithms studied previously. First of all, the encoder and decoder architectures are quite simple to implement in hardware, and they are based on well-understood aspects of digital communications theory. Equation 3.37 reveals that the reliability of the watermark encoding can be
81
50 NO PREFILTER HIGHPASS AR MODELING
45
40
35
BER (PERCENT)
30
25
20
15
10
0.1
0.2
0.3 0.4 0.5 0.6 0.7 ALPHA (PERCENT OF HOST MAGNITUDE)
0.8
0.9
Figure 3.13: Comparison of DSSS decoding using highpass pre ltering and AR modeling.
50 NO PREFILTER HIGHPASS AR MODELING
45
40
35
BER (PERCENT)
30
25
20
15
10
0.1
0.2
0.3 0.4 0.5 0.6 0.7 ALPHA (PERCENT OF HOST MAGNITUDE)
0.8
0.9
Figure 3.14: Comparison of FHSS decoding using highpass pre ltering and AR modeling.
82 improved by either increasing the noise power , or by increasing the length of the block size M . In addition, redundant copies of the watermark may be embedded by using a di erent PN sequence to spread each copy, because statistically the noise-like PN sequences are uncorrelated. Finally, it is important to note that the watermark signal is additive, so the only distortion introduced into the host signal is low-level noise. Although the autocorrelation property of PN sequences allows for easier synchronization of the host signal at the decoder, a malevolent party may use the same process to determine the bit sequence in order to disrupt or remove the watermark. For example, if an LFSR generator of length n bits is used as the PN sequence, then the Berlekamp-Massey algorithm can be used to reconstruct the LFSR generator structure given only 2n bits of the sequence 43]. In quiet periods of the host audio signal, for example, the only signal present may be the watermark signal, exposing the sequence to such an analysis and attack. This would suggest using a more robust and unpredictable pseudorandom number generator, such as the Blum-Blum-Shub algorithm 44]. However, doing so would require a more complex synchronization system at the decoder. Finally, it should be noted that neither of these two spread spectrum techniques take advantage of the complex masking properties of the human audio system described in Chapter 2.
3.5 Frequency Masking

In the previous section, two spread spectrum watermarking algorithms were introduced, and it was shown that they o er improved transmission reliability, and less distortion of the host audio signal, than the echo coding and phase coding algorithms. However, it was also noted that the power of the spread data signal had to be kept at an extremely low level to ensure that the additive noise would be imperceptible to the listener. By reducing the magnitude of the distortion, it is more likely that the watermark signal could be damaged or destroyed by simple signal processing
operations. Recall that in Chapter 2 the psychoacoustic properties of the Human Audio System (HAS) were introduced. In particular, the chapter explored the absolute detection threshold function TA (f ), based on empirical studies examining the minimum power required for a single tone to be perceptible to a human listener. This function is independent of the audio signal. In addition, the frequency masking concepts of tone-masks-noise, noise-masks-tone, and noise-masks-noise were described. The MPEG Layer I psychoacoustic model was introduced as a standardized procedure for determining the masking threshold function, TM (f ), that is dependent on the local frequency-domain properties of the audio signal. Tew k et al have proposed an audio watermarking algorithm that takes advantage of this psychoacoustic model 21]. An implementation of their approach, described in the following sections with modi cations and improvements, uses the masking threshold function to control the magnitude and spectral properties of an additive noise-like watermark signal. Doing so ensures that the distortion is inaudible to the listener.
83

The encoder algorithm is described by the following procedure: 1. Given xm (n), the current block of the original audio signal, use the MPEG Layer I psychoacoustic modeling procedure described in Chapter 2 to compute TM (f ), the frequency masking threshold for the block. 2. Construct a K th-order nite impulse response (FIR) lter to approximate TM (f ):
TM (f )
K ;1 X n=0
t(n)e;j
fn :
1 =0
(3.46)
Also, normalize t(n) so that it provides unity gain, or PK ; t(n) = 1. n 3. Use the coe cients as a noise-shaping lter by convolving it with the block's bit, w(m), spread by a bipolar PN sequence p(n) 2 f;1 +1g. Note that this
84
() _ x m(n)
x m(n)
Masking Analysis
Filter Approximation
t(n)
p(n)
Noise Shaping Filter
w(m)
Figure 3.15: Block diagram of the frequency masking encoder. is similar to the direct sequence spread spectrum (DSSS) encoding process described in Section 3.4, but the additive noise-like signal no longer has a at power spectrum. Convolving the spread data signal with the lter model has the e ect of shaping the spectrum of the noise-like signal to approximate the frequency masking threshold of the host signal. This spread data signal is then added to the host signal:
xm (n) = xm (n) +
w(m) p(n) t(n)]
(3.47)
where t(n) denotes the lter coe cients computed above. Again, is a constant weighting factor that determines the maximum power of the noise signal. A block diagram of the frequency masking algorithm's encoder is shown in Figure 3.15.

Since the data signal is spread by using a PN sequence as in the DSSS algorithm, decoding the signal is accomplished using a correlator at the receiver of the form in
Equation 3.30:
85
w(m) = sign ~
"
M ;1 X n=0
xm (n) p(n)
(3.48)
Note that the same conditions and assumptions of the DSSS decoder apply here. In particular, the PN sequence is assumed to be available at the receiver and synchronized with the watermarked signal. If no pre ltering is applied prior to computing the correlation, it is also assumed that the approximation of Equation 3.31 holds here as well: M; X xm (n) p(n) 0 (3.49)
1
If the original block, xm (n), is available at the decoder, then it may be subtracted from the watermarked block prior to decoding.
n=0

Like the spread spectrum algorithms described in the previous section, it is possible to derive a general theoretical bound for the performance of the frequency masking algorithm in the presence of AWGN. Note that the spread data signal does not possess a xed power, but rather convolution of the model lter with p(n) causes the magnitude of the data signal to change with each sample. As before, assume that the watermarked block is corrupted with additive noise of zero mean and variance v , and that the original block is available and subtracted from the signal prior to decoding: xm (n) = w(m) p(n) t(n)] + v(n) ~ (3.50)
2
Applying the correlation formula of Equation 3.30, the extracted bit with a noise term may be written as
C=
M ;1 X n=0
f w(m) p(n) t(n)] + v(n)g p(n)

2
(3.51)
However, if it is assumed that p(n) has the same properties as white noise, such as a at power spectrum, zero mean, and variance p , then the rst term in the summation can be simpli ed. In 7], it is shown that when white noise is passed through a linear
86 lter, the cross-correlation of the output and input signals is a function of only the noise variance and the lter coe cients. Let p0 (n) denote the convolution of p(n) and t(n) in the correlation equation above. The cross-correlation function of the PN sequence and the ltered PN sequence is given by
c(m) = E p(n ; m) p0(n)] =
t(m)
(3.52)
where E ] represents the statistical expectation operator. Substituting this into the receiver correlator equation yields a simpli ed expression for the expected value of the correlator output:
C=
M ;1 X n=0
w(m)
t(n) + v(n) p(n)]
(3.53)
Since p(n) 2 f;1 +1g, its variance will be 1. Recall that t(n) is constructed to have unity gain, so the sum of the coe cients will be one. t(n) is also limited to K samples, and assumed to be zero elsewhere. Therefore, the simpli ed correlator equation may be written as M; X C = K w(m) + v(n) p(n) (3.54)
1
Like the spread spectrum case, for a block size of M 1 the second term will cancel out. When the block size is not large, then it is important to be able to quantify PB , the probability of bit error. This was accomplished experimentally by determining PB for various signal to noise ratios. The expression may be approximated by p ! K P Q (3.55)
B
n=0
where Q(x) denotes the complementary error function of Equation 3.38. Therefore, the bit error rate is a function of the noise power and the lter length, not the length of the block.

3.5.4.1 Construction of Filter Coe cients
In this investigation, the lter coe cients used to approximate TM (f ) were constructed using an iterative least-squares technique designed to minimize the error
87 between the frequency masking threshold function and the frequency response of the K th-order lter 13]. This method was chosen because it provides a close match of the lter coe cients to the masking threshold function. K was set to 10 as a tradeo between lter accuracy and the cost of computing the coe cients.
3.5.4.2 Selection of
By experiment, it was found that may be set to 25 percent of the dynamic range of the host audio signal. This is a signi cant improvement over the standard DSSS algorithm described previously. Note that this does not mean that the noise-like watermark signal has the same magnitude at each time interval. Since the PN sequence's spectrum is shaped to match the frequency masking function, represents the maximum level of distortion in the time domain.

When access to the original block is not possible, the assumption of Equation 3.31, that of low correlation between the host audio signal and the PN sequence, may not be valid. In this case, the presence of xm (n) at the correlation receiver may interfere with the decoding process, so it is necessary to remove as much of the host signal as possible. The two improvements proposed for the DSSS algorithm, lowpass ltering or autoregressive modeling, may be applied to the frequency masking algorithm as well. In this investigation, the AR modeling technique of Section 3.4.4.2 was applied at the decoder of the frequency masking algorithm, and a 10th-order whitening lter was constructed for each watermarked block. As before, an estimate of the block's autocorrelation function r(n) was obtained, and the AR coe cients were constructed using the Levinson-Durbin recursion 42]. This improved decoder has the same structure as for the spread spectrum algorithm shown in Figure 3.12.
3.5.4.4 Discussion
88
Since the frequency masking algorithm is based on the direct sequence spread spectrum algorithm, it possesses most of the advantages and disadvantages described in Section 3.4.4.3. However, note that since the spectral characteristics of TM (f ) are used to shape the noise-like watermark signal, the power can be maximized while ensuring that the watermark is imperceptible to the listener. As a result, the quality of the watermarked audio signal is better for larger values of than the algorithms described in earlier sections. Unfortunately, the encoder is computationally expensive because for each signal block, the frequency masking threshold function must be computed and an approximation lter constructed prior to encoding. In addition, the correlation decoder is susceptible to the same synchronization problems as the spread spectrum algorithms.
3.6 Performance Evaluation

The purpose of this section is to compare the ve audio watermarking algorithms reviewed in the previous sections. Recall in Chapter 1 that a framework was introduced for evaluating watermarking algorithms. In particular, it was proposed that bit rate, perceptual quality, computational complexity, and robustness to signal processing were key aspects by which watermarking algorithms could be evaluated. In this section, this framework was used to evaluate the watermarking algorithms. In this investigation, a set of ten monophone audio signals were selected for watermarking. Each signal was ten seconds in length and of CD quality, sampled at a rate of 44:1 kHz and linearly quantized to 16 bits per sample. This corresponds to a signal length of N = 441000 samples. Each signal was normalized to lie within the interval 0 x(n) 1. The watermarking algorithms were implemented in MATLAB under Linux on an Intel Pentium PC running at 166 MHz. Each algorithm was implemented using the parameters speci ed with the implementation details presented
89 earlier. In general, each of the audio signals was watermarked using the ve algorithms 100 times, and the results averaged for each algorithm. In all cases, a di erent and random watermark signal was generated for each run. This was done in order to remove any dependency of extraction on the watermark data itself. The audio signals were chosen to represent ve di erent classes of commercial music | blues, classical, country, folk, and pop / rock | so that the signals would have a variety of spectral properties. Classical music, for example, is composed of primarily single-tone signals localized in time, such as notes played on a piano. Contrast this with blues, which typically contains music from low-frequency instruments such as the cello.
3.6.1 E ect of Block Size

This experiment was designed to determine the bit error rate of each watermarking algorithm as a function of M , the block size in samples. It is assumed that the decoder does not have access to the original audio signal, so the algorithms must rely upon the improvements introduced. For each algorithm, the encoder and decoder were run on each audio signal, and the bit error rate (BER) determined from the extracted bits. This process was performed for block sizes between 2 and 32768 samples, and the results are shown in Figure 3.16. It was predicted in Section 3.4.3 that for the spread spectrum algorithms, either an increase in block size or an increase in watermark signal power would decrease the bit error rate. From the plot, it is clear that the reliability of the DSSS, FHSS, and frequency masking algorithms decreases with a decreasing block size (higher bit rate). Frequency masking performs better because the magnitude of is higher than that for the others. The reliability of the phase coding algorithm does not depend on the block size, and so its error rate is zero. These results illustrate a common tradeo of most digital watermarking algorithms: as the bit rate increases, the reliability of the encoding decreases. All of the algorithms have an error rate of less than ve percent at M = 2048 samples, corre-
90
50 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING
45
40
35
BER (PERCENT)
30
25
20
15
10
5 10 BLOCK SIZE (LOG2(SAMPLES))
15
Figure 3.16: Bit error rate as a function of block size for audio watermarking algorithms.
91 sponding to a bit rate of approximately 21 bits per second for a sample rate of 44.1 kHz. In the experiments that follow, a block size of 2048 samples was used. However, it is important to note that depending on the desired reliability of the watermark, a longer block size may be necessary to produce a lower error rate. In practice, a sensible approach to selecting a watermarking algorithm would be to rst determine the desired bit error rate for the application, say one percent, and then select from the algorithms that can meet the requirement. Another important tradeo can be seen from this experiment. If the watermarked signal is truncated, or if entire blocks of the signal are simply removed, then the bits embedded in the a ected blocks will be lost. This is particularly important for small blocks, for it is possible to remove small sets of samples at random without a ecting the quality of the watermarked signal. In order to guard against such processing, larger blocks should be used, but this limits the bit rate of the system.

This experiment was designed to determine the amount of distortion each watermarking algorithm introduces into the host signal by using the signal-to-noise ratio (SNR) of the watermarked signal versus the original host signal. Each of the ten audio signals was watermarked with the ve algorithms, and the SNR computed. A block size of M = 2048 samples was used, and the results of this experiment are shown in Table 3.1. From the table, it is clear that based on the signal to noise ratio, the echo coding, phase coding, and frequency masking algorithms introduce the most distortion into the host signal. It was predicted in earlier sections that the echo coding and phase coding algorithms severely distort the host signal, and these results support that idea. The spread spectrum algorithms have a higher SNR because the power of the watermark had to be maintained at a low level. Recall that the frequency masking algorithm tends to \hide" watermark distortion by shaping the watermark data to match the frequency masking threshold
92 Audio Echo Phase DSSS FHSS Frequency Signal Coding Coding Masking BLUES1 21.45 26.23 54.38 49.43 20.14 BLUES2 23.86 27.65 54.19 49.25 24.31 COUNTRY1 16.63 21.67 54.48 49.53 19.54 COUNTRY2 21.34 25.05 54.22 49.27 17.94 CLASSICAL1 21.54 23.53 54.05 49.10 26.43 CLASSICAL2 23.82 29.02 54.07 49.12 28.38 FOLK1 13.52 17.89 54.59 49.64 17.65 FOLK2 14.21 18.22 54.57 49.62 16.94 POP1 14.75 19.08 54.96 50.01 18.53 POP2 14.51 19.09 54.47 49.52 17.84 Average 18.53 22.78 54.40 49.45 20.77 Table 3.1: SNR of watermarked audio signals versus original host signals (in decibels). function of the host signal. Therefore, it is slightly misleading to only consider the SNR of this approach.

This experiment was designed to determine the computational cost of each watermarking algorithm. This is important to know, for it may not be possible to implement more expensive algorithms in particular applications. For each algorithm, timings were extracted for the encoder and decoder run on each audio signal. A block size of M = 2048 samples was used, and the results are shown in Table 3.2. From the table, it is immediately obvious that the DSSS algorithm is the most e cient, requiring an average of 4 seconds to encode and decode each ten-second audio sample. The echo coding and phase coding algorithms also compare favourably, each requiring approximately ten seconds to embed and decode data into the host signals. This is not surprising given the simple structures of the three algorithms, and it is clear from this data that they can be easily implemented in real time. The FHSS algorithm is considerably more expensive, requiring an average of 54 seconds to encode and decode each host signal. First of all, the host signal must
93 Audio Echo Phase Signal Coding Coding BLUES1 9.40 10.41 BLUES2 10.26 9.91 COUNTRY1 9.94 11.25 COUNTRY2 10.42 12.24 CLASSICAL1 9.19 10.20 CLASSICAL2 10.53 9.90 FOLK1 10.22 10.50 FOLK2 9.08 11.41 POP1 10.83 9.90 POP2 9.94 9.55 Average 9.98 10.53 DSSS FHSS Frequency Masking 4.47 53.60 395.89 4.60 54.69 363.26 4.71 54.82 404.29 3.26 54.71 422.47 4.16 55.29 376.16 4.18 54.67 407.16 3.24 55.19 415.08 3.44 52.78 358.15 4.81 53.98 361.18 3.90 53.84 401.42 4.08 54.35 390.51
Table 3.2: Audio watermarking algorithm CPU timings (in seconds). be transformed into the frequency domain at both the encoder and decoder. However, more computational resources are required to compute the autoregressive (AR) model of the watermarked signal at the receiver in order to whiten the signal prior to decoding. Unless the DCT computation and AR modeling can be implemented more e ciently, the FHSS algorithm may be limited in application. The frequency masking algorithm is clearly the more expensive, requiring an average of 391 seconds to encode each ten-second sample. This is because the masking threshold function, TM (f ), must be computed for each block. In addition, a noiseshaping lter must be constructed with a frequency response that approximates the threshold function. Note that the timings vary more for each run of this algorithm than the others, because each host signal has di erent spectral properties. Due to its complexity, it may not be possible to implement the frequency masking algorithm to run in real time applications.

As described in Chapter 1, robustness to signal processing is a desirable feature of any watermarking system. Common operations, such as noise reduction, linear ltering,
94 or lossy compression, should not completely destroy a watermark embedded within the signal. Measuring how well each watermark can survive distortions provides another tool for choosing between algorithms for a particular application, particularly if known in advance the distortions to which the host signal may be subjected. The signal processing operations were selected because they do not severely distort the subjective quality of the audio signal, and because they can be used to represent or simulate \real world" distortions.
3.6.4.1 Linear and Nonlinear Filtering

The purpose of this experiment was to determine the bit error rate of each algorithm when the watermarked signal was subjected to ltering operations. Common signal processing operations, such as equalization and noise suppression, are constructed as linear lters. As before, for each algorithm, the encoder was run on each audio signal, followed by a ltering operation, and decoding to extract the bits. A block size of M = 2048 samples was used, and the following ltering operations were considered:
Mean ltering. An averaging lter of length K samples was applied to the watermarked signals, for 1 K 15. Mean ltering, essentially a lowpass ltering operation, has the e ect of removing high-frequency noise from a signal. Lowpass ltering. A lowpass symmetric halfband lter of length K samples was constructed using the Hamming window method, and applied to the watermarked signals, for 1 K 15. Highpass ltering. A highpass symmetric halfband lter of length K samples was constructed using the Hamming window method, and applied to the watermarked signals, for 1 K 15. High-emphasis ltering. A high-emphasis lter, 11 samples in length, was constructed using a weighted superposition of halfband lowpass and highpass lters:
h(n) = (1 ; A) hLP (n) + A hHP (n)
(3.56)
95 hLP (n) and hHP (n) represent halfband lowpass and highpass lters of length 11 samples, with A varying from 0 A 1. High-emphasis ltering was included in this investigation because it simulates the function of a graphic equalizer in stereo systems.
Wiener ltering. For each watermarked signal block, xm (n), a K th-order forward prediction lter was constructed and applied to the signal:
xm (n) = ~
k=1
xm(n ; k)h(k)
(3.57)
where the coe cients of h(n) were chosen to minimize the mean squared error (MSE) between xm (n) and xm (n), for 1 K 15. The output of the prediction ~ lter is an approximation of the watermarked signal, plus a random prediction error signal. Wiener ltering was used to simulate linear predictive coding (LPC), a common low bit rate audio compression technique.
Median ltering. A median lter was used, designed to replace each sample of the audio signal with the median of its K previous samples, for 1 K 15. Median ltering is a non-linear process often used for reducing high-frequency noise in a signal.
The results of this experiment are shown in Figure 3.17. Generally, the phase coding algorithm was the most resilient to almost all of the ltering operations applied, with the exception of highpass and highemphasis ltering. This is surprising, given that the algorithm is relatively simple to implement and possesses a low computational complexity. This is also disappointing, because the quality of the watermarked signals is much lower than the other algorithms studied. The DSSS, FHSS, and frequency masking algorithms provided the best performance under highpass and high-emphasis ltering. This is not surprising, given the fact that the algorithms distribute their signal energy across the entire spectrum, and that the audio signals often possess low-frequency components.
96
45
50
40
35 40
BER (PERCENT)
30
25
BER (PERCENT)
0 5 FILTER SIZE (SAMPLES) 10 15
30
20
20 15
10 10 5
5 FILTER SIZE (SAMPLES)
10
15
(a) Mean ltering

50 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 50 45 45 40 40 35 35
(b) Lowpass ltering

ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING
BER (PERCENT)
30
BER (PERCENT)
0 5 FILTER SIZE (SAMPLES) 10 15
30
25
25
20
20
15
15
10
10
0.1
0.2
0.3
0.4
(c) Highpass ltering

60 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 50 45 50 40 35 40
(d) High-emphasis ltering

ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING
0.5 A
0.6
0.7
0.8
0.9
BER (PERCENT)
BER (PERCENT)
0 5 10 PREDICTION FILTER SIZE (SAMPLES) 15
30
30
25
20
20 15
10 10 5
10
15
(e) Wiener ltering
(f) Median ltering
Figure 3.17: Bit error rate after ltering for audio watermarking algorithms.
97
50 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 45 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 45 40
40
35
35 30
BER (PERCENT)
30
BER (PERCENT)
0 5 10 15 SNR (DECIBELS) 20 25 30 35
25
25
20
20
15 15 10
10
0 5
0 5
10
(a) Additive noise
(b) Coloured noise
15 SNR (DECIBELS)
20
25
30
35
Figure 3.18: Bit error rate in the presence of additive and coloured noise for audio watermarking algorithms.
3.6.4.2 Additive and Coloured Noise

The purpose of this experiment was to determine the performance of each algorithm in the presence of additive and coloured noise. As described in Chapter 1, additive white Gaussian noise is a commonly used attack on watermarked signals in an e ort to hamper decoding at the receiver. For additive noise, the watermarked signal was corrupted with an AWGN process of zero mean and variance v in accordance with
2
x(n) = x(n) + v(n) ~
(3.58)
and for coloured noise, the signal was corrupted with noise of the same power, but multiplied by a normalized version of the watermarked signal. Since x(n) was already normalized to lie within the interval 0 x(n) 1, the corrupted signal may be written as x(n) = x(n) + x(n) v(n): ~ (3.59) For each algorithm, the bit error rate was computed as a function of SNR in decibels. A block size of M = 2048 samples was used, and the results of this experiment are shown in Figure 3.18.
98 In the presence of additive and coloured noise, it is clear that the spread spectrum algorithms | DSSS, FHSS, and frequency masking | perform quite well. It was shown in Section 3.4.3, however, that the presence of noise does not have a signi cant impact on spread spectrum techniques for larger block sizes or larger . Of the three, frequency masking algorithm performs well due to the larger watermark power. The echo coding algorithm performs poorly in additive and coloured noise environments, because the presence of noise will a ect the cepstrum used to extract the watermark bits. Decoding the bits involves evaluating the cepstrum at the two echo lter delays, so additive noise will increase the chance that bits are incorrectly decoded. Phase coding performs very well under noise conditions, and there is a good reason for this. In the study of communications systems, it has been shown that systems employing angle modulation (either frequency or phase) are more resilient to severe noise than amplitude modulation schemes (such as the spread spectrum approaches) 16].
3.6.4.3 Linear and Nonlinear Quantization

The purpose of this experiment was to investigate the robustness of each algorithm to distortion due to linear and nonlinear quantization. The watermarked signals, originally encoded at 16 bits per sample, were requantized to K bits per sample, for 1 K 15. For linear quantization, a function was used of the form in Figure 3.19(a). For the nonlinear case, two di erent quantization functions were used as shown in Figures 3.19(c)-(d). The rst function allocates more quantization levels to mid-band samples, while the second function allocates more to the outlier samples. If 0 x 2m ; 1 represents the 2m levels possible in an m-bit encoding, then Q(x) represents the 2m quantization levels constrained to the interval 0 Q(x) 1. As before, for each algorithm the encoder was run on each audio signal, followed by quantization, and then decoding to extract the bits. A block size of M = 2048 samples was used, and the results of this experiment are shown in Figure 3.20.
99
0.9
0.8
0.7
0.6
Q(x)
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
(a) Linear quantization function

1 0.9 0.8 0.7
0.5 x
0.6
0.7
0.8
0.9
0.6
Q(x)
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
(b) Nonlinear midband quantization function

1 0.9 0.8 0.7
0.5 x
0.6
0.7
0.8
0.9
0.6
Q(x)
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
(c) Nonlinear outlier quantization function Figure 3.19: Linear and nonlinear quantization functions for K = 5 bits per sample.
0.5 x
0.6
0.7
0.8
0.9
100
45
40
35
BER (PERCENT)
30
25
20
15
10
5 BITS PER SAMPLE
10
15
(a) Linear quantization

50 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 45 40 35
BER (PERCENT)
30
25
20
15
10
5 BITS PER SAMPLE
10
15
(b) Nonlinear midband quantization

50 ECHO CODING PHASE CODING DSSS FHSS FREQUENCY MASKING 45 40 35
BER (PERCENT)
30
25
20
15
10
5 BITS PER SAMPLE
10
15
(c) Nonlinear outlier quantization Figure 3.20: Bit error rate after quantization using linear and two nonlinear bit allocation functions.
101 From the plots, it can be seen that the DSSS, FHSS, and frequency masking algorithms provide the most resilience to linear and nonlinear quantization, even at low bits per sample. In the previous experiment it was shown that the spread spectrum algorithms perform well in noisy environments. Quantization of a signal introduces random noise with a variance that varies with the quantization step size, so the results of this section should correspond.
3.6.4.4 Lossy Compression

The goal of this experiment was to determine the robustness of each algorithm to lossy compression using the MPEG Layer III standard. This compression technique incorporates the frequency masking analysis of the MPEG Layer I standard described in Chapter 2, and builds upon it by adding analysis of the temporal masking properties of the host signal to achieve a high compression ratio while minimizing the perceptual loss of signal quality 19]. This is an important experiment, because the MPEG Layer III standard is commonly used for encoding music for distribution | legally and otherwise | across the Internet. As before, for each algorithm the encoder was run on each audio signal, followed by MPEG Layer III compression and decompression, and then decoding to extract the bits. Various bit rates between 8 and 256 kbps were used, along with a block size of M = 2048 samples. This process was performed 100 times for averaged results, and the results are shown in Figure 3.21. From the plot, it is clear that the frequency masking algorithm provides lower error rates than the other algorithms, followed by the spread spectrum approaches. This is not surprising, since the MPEG Layer III compression technique also computes the frequency masking threshold function, and allocates more bits to components at the same frequencies as the watermark signal. The spread spectrum and frequency masking algorithms o er greater robustness than the echo coding and phase coding algorithms at all bit rates. It is important to note that the error at bit rates of between 96 and 160 kbps, corresponding to high
102
45
40
35
BER (PERCENT)
30
25
20
15
10
50
100
150 BIT RATE (KBPS)
200
250
300
Figure 3.21: Bit error rate due to lossy compression as a function of bit rate.
103 quality MPEG compression, are signi cantly better than echo and phase coding. These are the bit rates most commonly used for distributing music over the Internet.
3.7 Summary
The primary goals of this chapter were to review a selection of audio watermarking algorithms from the literature, and to evaluate them using the framework proposed in Chapter 1. Five algorithms were chosen from the literature to represent several unique approaches to embedding data within audio signals: echo coding, phase coding, direct sequence and frequency hopped spread spectrum, and frequency masking. Since the focus of this thesis is on public watermarks, the original signal could not be used at the decoder to assist in extracting data, presenting an interesting problem for some algorithms. In addition to a description of each algorithm, suggestions were provided on how they could be implemented and improved. Key among these improvements was the incorporation of a whitening lter at the receiver of the DSSS and FHSS algorithms, based on an autoregressive model of the host signal, in an e ort to minimize the presence of the host signal. Another goal of this chapter was to evaluate the algorithms using the performance analysis framework introduced in Chapter 1. It was found that the echo coding and phase coding algorithms provided the poorest quality output, while signals watermarked with the spread spectrum and frequency masking algorithms had a higher quality. It was shown that there is a tradeo between algorithm robustness and bit rate. With respect to signal processing, it was found that the echo coding and phase coding algorithms provided the best resilience to linear and nonlinear ltering operations, with the exception of highpass and high-emphasis ltering. The three spread spectrum algorithms | DSSS, FHSS, and frequency masking | proved considerably robust to additive noise, quantization distortion, and lossy compression. The three techniques were notably less resilient to coloured noise and lowpass ltering operations, with the exception of the frequency masking algorithm.
104
Chapter 4 Image Watermarking

In the previous chapter, a selection of digital audio watermarking algorithms were implemented and evaluated. However, a great deal of visual content is distributed across public networks such as the Internet, so it is also important to study watermarking algorithms for digital images. High quality digital images di er from wideband audio signals in several ways. First of all, grayscale images are typically represented with a coarser resolution, generally 8 bits per pixel (bpp) as opposed to 16 bits per sample for digital audio. In addition, the number of samples in an image is usually much smaller than a piece of recorded music. A 512 512 image, for example, contains the equivalent of only three seconds of music sampled at 44:1 kHz. Many digital image watermarking algorithms have been introduced in the literature 45]. A variety of spatial domain and frequency domain approaches have been proposed, and many of these have been tested individually for robustness to signal processing operations, usually additive noise and JPEG compression. Many of the proposed algorithms are private watermarking systems, requiring access to the original (unwatermarked) image in order to extract watermark information. As described in Chapter 1, private watermarking schemes are more limited in application than public systems. No comparative analysis of image watermarking algorithms exists in the literature, and in general few signal processing operations are considered in addition to noise and lossy compression.
105 Six image watermarking algorithms will be reviewed in this chapter, and improvements to their encoder and decoder structures will be proposed. Another goal of this chapter is to apply the performance analysis framework proposed in Chapter 1 as a means of comparing the algorithms. The algorithms evaluated in this chapter were selected to represent the three di erent approaches to embedding data: spatial domain, frequency domain, and spatial / frequency (muiltiresolution). They were also chosen to represent a range of computational complexities and implementation structures. Since the focus of this thesis is on public watermarking algorithms, the techniques chosen do not require access to the original image in order to extract the watermark data. In some cases, however, having such access may improve the decoding process. The chapter is organized as follows. Sections 4.2 - 4.3 provide a description of the image watermarking algorithms, including the theory behind them, encoder and decoder structures, and implementation details. This is followed in Section 4.4 by a performance evaluation of the algorithms with respect to perceptual quality, bit rate, computational complexity, and robustness to signal processing operations. Finally, a review of the chapter's ndings are provided in Section 4.5.
4.1 Conventions
Similar conventions to those used in the study of audio watermarking algorithms will be used in this chapter. First of all, it is assumed that x(n n ) represents a digital host image of size N N pixels. This signal is divided into a set of M M blocks of size M M pixels, where M = bN =M c and M = bN =M c, as shown in Figure 4.1. Like audio signals, the image is divided into blocks because although images are typically nonstationary as a whole, they exhibit local stationarity within smaller regions. In this case, second-order stationarity allows for analysis of the image's local mean and variance, which is useful for some of the algorithms. x(n n ) represents the watermarked image, while xm1 m2 (n n ) and xm1 m2 (n n )
1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 2
106
N x N image M x M block
Figure 4.1: Example of a 512 512 image divided into 16 16 blocks in the spatial domain. Each block will be used to embed one bit of data. indicate the < m m > block in the original and watermarked images, respectively, for 0 m M ; 1 and 0 m M ; 1. Finally, it is assumed that one bit is embedded in each block, and this sequence of M M bits is denoted by w(m m ) 2 f;1 +1g, for 0 m M ; 1 and 0 m M ; 1. As mentioned in the previous chapter, dividing the image up into variable sized blocks conveniently allows for a variable number of bits to be embedded within the image. A bit extracted from the watermarked image is denoted by w(m m ). ~
1 2 1 1 2 2 1 2 1 2 1 1 2 2 1 2
4.2 Spread Spectrum Techniques

Recall from the previous chapter that spread spectrum techniques from digital communications may be adapted for use in watermarking systems as a means of spreading watermark information throughout the spectrum of a host signal. For digital images, it will be shown in this section that the two spread spectrum watermarking techniques introduced in the previous chapter may be extended into two dimensions. Also recall
107 from the initial discussion that, for the basic spread spectrum algorithms, the noise power had to be maintained at a very low level in order to keep the distortion inaudible to the listener, as the Human Audio System is sensitive to low levels of noise at mid-band frequencies. As described in Chapter 2, the Gaussian optical point spread function of the Human Visual System has a lowpass frequency response. Therefore, it is predicted that the human eye will be more tolerant to high frequency noise. Many image watermarking algorithms implicitly take advantage of the lowpass frequency response of the Human Visual System. Two groups | Hartung and Girod, and Cox et al | introduced image watermarking algorithms that are based upon this premise 40, 41]. The approach of Hartung and Girod operates in the spatial domain, manipulating host image pixels in the same manner as the Direct Sequence Spread Spectrum (DSSS) algorithm introduced in Chapter 3. The algorithm of Cox et al embeds watermark data into the two-dimensional DCT of the host image. However, in this investigation the two algorithms have been modi ed so that an arbitrary amount of watermark data may be added, to be described in the following sections. In this study, PN sequences are again used as spreading signals for the same arguments they were employed in the previous chapter for audio watermarking: they possess the same statistical properties as white noise, they are deterministic, and they occupy frequencies in excess of the host image's spectrum. The two spread spectrum techniques introduced in the previous chapter, (DSSS) and frequency hopped spread spectrum (FHSS), are extended in this discussion to the two-dimensional case. The extension of these algorithms is relatively straightforward, and it will be seen that similar assumptions and improvements apply to the 2D case.
4.2.1 Encoder Structures

In the discussion that follows, it is assumed that a bipolar two-dimensional PN sequence of the form p(n n ) 2 f;1 +1g is available for use at the encoder and decoder, and that the sequence has zero mean, a relatively at power spectrum, and variance of p = 1. Also recall from Section 4.1 that bipolar watermark data is of the
1 2 2
form w(m m ) 2 f;1 +1g.

1 2
108
4.2.1.1 Direct Sequence Spread Spectrum

The DSSS algorithm described here was originally introduced by Hartung and Girod 40]. The data bit for the current block, w(m m ), is spread by modulating it with p(n n ), the two-dimensional PN sequence, resulting in a noise-like signal that is added to the original block to construct the watermarked signal:
1 2 1 2
xm1 m2 (n n ) = xm1 m2 (n n ) +
1 2 1 2
w(m m ) p(n n )
1 2 1 2
(4.1)
In the equation above, represents a constant weighting factor that can be used to control the level of noise added to the host signal. In Section 4.2.4, suitable values for will be established. Since w(m m ) is constant within the block, the spectrum of the added noise assumes the shape of the spectrum of p(n n ):
1 2 1 2
where
W (! ! ) P (! ! )] , w(m m ) (! ! ) P (! ! )] , w (m m ) P (! ! ) denotes the two-dimensional convolution operator.

1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 1 1 2
w(m m ) p(n n ) ,
(4.2)
4.2.1.2 Frequency Hopped Spread Spectrum

The FHSS algorithm described here was introduced by Cox et al 41]. In this approach, the two-dimensional discrete cosine transform (2D-DCT), introduced in Chapter 1, is used to transform the original image block, xm1 m2 (n n ), into the frequency domain in accordance with:
1 2
Xm1 m2 (k k ) = DCT xm1 m2 (n n )]

1 2 1 2
(4.3)
The result is a set of M M frequency domain coe cients, where M is the size of the block in pixels. Then, a subset of S M M coe cients are selected to contain watermark data:
S = fsi j 2 Zj0 si j M ; 1 0 i j S ; 1g
(4.4)
109 The coe cients are modi ed by using a PN sequence, p(k k ) 2 f;1 +1g, of size S samples, modulating the bit to be embedded within the block with this short PN sequence, and then adding this noise-like sequence to the selected coe cients:
1 2
w(m m ) p(k k ) < k k >2 S (4.5) 0 otherwise where, as with the DSSS algorithm, is a parameter used to control the noise power. The nal step is to construct the watermarked image by using the inverse 2D-DCT to convert the modi ed frequency domain signal into the watermarked block: X m1 m2 (k k ) = Xm1 m2 (k k )+
1 2 1 2 1 2 1 2 1 2
8 > < > :
xm1 m2 (n n ) = IDCT X m1 m2 (k k )]
1 2 1 2
(4.6)
As before, the subset of S modi ed coe cients may be xed for the entire image, or it may vary with each block. Methods of selecting the coe cients will be discussed in Section 4.2.4.
4.2.2 Decoder Structures

A similar process is used for decoding both the DSSS and FHSS algorithms. In the DSSS approach, the embedded bit is extracted by computing the correlation of the watermarked block with a synchronized version of the PN sequence used at the encoder:
C =
= =
1
M ;1 M ;1 X X n1 =0 n2 =0 M ;1 M ;1 X X n1 =0 n2 =0 M ;1 M ;1 X X n1 =0 n2 =0
2
xm1 m2 (n n ) p(n n )
1 2 1 2
xm1 m2 (n n ) +
1 2 1 2 1
w(m m ) p(n n )] p(n n )

1 2 1 2 1 2 2
xm1 m2 (n n )p(n n ) +
w(m m ) p (n n )]
1 2 2 1 2
(4.7)
Given that p(n n ) is a noise-like signal with zero mean, then the correlation of the original signal with the PN sequence in the equation above may be assumed to be low: M; M; X X (4.8) xm1 m2 (n n )p(n n ) 0
1 1
n1 =0 n2 =0
110 resulting in a weighted bipolar watermark bit at the correlator output. The extracted bit may be obtained from the sign of this output:
M w (m m )
2 1 2
(4.9)
A similar procedure is used to decode watermark bits in the FHSS algorithm, but the correlation is performed only on the 2D-DCT coe cients that were modi ed during the embedding process. Note that the set of S modi ed coe cients must be available at the receiver.

Like the spread spectrum algorithms for audio watermarking, a theoretical bound on the performance of the algorithm in the presence of additive white Gaussian noise may be derived. Assume that the watermarked image is corrupted by AWGN of zero mean and variance v , and that the original image is available at the decoder and subtracted from the watermarked image. The resulting image block presented to the correlation receiver has the form:
2
xm1 m2 (n n ) = ~
1 2
w(m m ) p(n n ) + v(n n )

1 2 1 2 1 2
(4.10)
Applying the correlation formula of Equation 4.7, the extracted bit with a noise term may be written as:
C=
1 2
M ;1 M ; 1 X X n1 =0 n2 =0
w(m m ) p (n n ) + v(n n ) p(n n )]

1 2 2 1 2 1 2 1 2 1 2
(4.11)
Since v(n n ) has zero mean and is uncorrelated with p(n n ), for large block sizes it is predicted that the algorithms possess a strong resilience to additive noise distortions. When the block size is not large, then the probability of bit error may be approximated by the expression:
PB = Q
M
v
(4.12)
where, as before, Q(x) represents the complimentary error function 16]:

2
111
Z 1 exp ;u du Q(x) = p1 (4.13) 2 2 x It is clear from the PB equation above, that either increasing the block size or increasing the watermark power has a signi cant e ect on the reliability of the encoding. !

4.2.4.1 Selection of and S
It was found experimentally that the power of the watermark signal, controlled by , had to be maintained at low levels in order to keep the distortion at an imperceptible level. This is because the additive watermark distortion is uniformly applied across the image, and so it is directly related to the peak signal-to-noise ratio (PSNR) of the image. In particular, for the DSSS algorithm was limited to 3 intensity levels (for an 8-bit image), and limited to 10 for the FHSS algorithm (based on a block size of 16 16 pixels). In the original FHSS algorithm proposed by Cox et al, the subset of 2D-DCT coe cients used for watermarking, S , was constructed by sorting the coe cients by magnitude, and then selecting the S largest of them minus the DC coe cient. This was done in order to localize the watermark energy around the most \signi cant" portions of the host image. In this investigation, S was constructed by selecting coe cients pseudorandomly from the entire range of the 2D-DCT (except the DC coe cient).
4.2.4.2 Spatial Domain Masking Analysis: DSSS-SM

Note that the values of in the previous section represent the maximum amount of global distortion allowed to keep the noise imperceptible. In Chapter 2, a number of visual models were implemented and introduced as a means of determining whether a given level of distortion would be perceptible to a viewer. The direct sequence spread
112 spectrum algorithm may be improved by incorporating Girod's spatial domain model of the Human Visual System into the encoder. This improvement was originally proposed by Tew k et al in their image watermarking system 28]. Their modi cations to Girod's visual model were described in Section 2.2.6.1. Prior to encoding, the masking values for the image as a whole are computed using Girod's visual model. This has the e ect of producing a tolerable error image (n n ) representing the maximum watermark distortion on a pixel-by-pixel basis. The analysis is performed for the entire image, because in some cases spatial masking e ects will occur at the boundary between blocks. The allowable error is maximized within regions of uniform intensity and where spatial masking e ects occur. Then, for each block to be watermarked, select the minimum value of (n n ) within the block as the xed watermark power:
1 2 1 2
0 = minf
m1 m2 (n1
n )g
2
(4.14)
where 0 is the new constant watermark power used throughout the block. The minimum value is used to ensure that the watermark distortion is still imperceptible. By using the localized value of 0, the watermark can take advantage of masking characteristics of the image that are localized in nature. Note, however, that the bene t of this modi cation is maximized for smaller blocks, and decreases as the block size is increased.
4.2.4.3 Frequency Domain Masking Analysis: FHSS-FMW and FHSSFMT

In Chapter 2, two frequency domain analysis techniques were introduced, one by Watson and another by Tew k et al 28, 32]. Although originally proposed as 8 8 2D-DCT coe cient quantization matrices for the JPEG image compression algorithm, Tew k et al proposed incorporating their quantization matrix into a watermarking algorithm described below. Modi cations were introduced in Section 2.2.6.2 that allow the analysis techniques to be used on variable sized blocks.
113 For each block to be watermarked, the 2D-DCT algorithm is rst used to transform the block into the frequency domain, denoted Xm1 m2 (k k ), as described earlier for the FHSS watermarking algorithm. In addition, a two-dimensional bipolar PN sequence p(k k ) is constructed for each block, and this sequence is combined with w(m m ), the bit to be embedded. After this, the following steps are taken:
1 2 1 2 1 2
1. Compute Q(k k ), the raised frequency detection threshold levels, using either the Tew k or Watson analysis algorithm.
1 2
2. Quantize the 2D-DCT coe cients using the masking threshold levels, and then modify each quantized coe cient by plus or minus a quarter of the quantization level, according to the PN sequence:
X m1 m2 (k k ) = Xm1 m2 (k k ) + 1 w(m m ) p(k k ) Q(k k ) (4.15) Q(k k ) 4 where ] denotes the rounding operator.
("
1 2
At the decoder, the embedded bit is extracted by rst computing the frequency masking threshold levels, TM (k k ), either from the original host image or an approximation based on the watermarked image. The bit is extracted from the block by determining the PN sequence bits from a quantized version of the watermarked 2D-DCT block:
1 2
(4.16) w(m m ) p(k k ) X m1 m2 (k k ) ; X m1 (m2 (k )k ) Q(k k ) Qk k and then computing the correlation of the extracted bits with the original PN sequence. Note that the watermark is not added to the 2D-DCT coe cients, as was the case in the original FHSS algorithm. If the masking coe cients computed at the decoder are a close approximation to those used at the encoder, then the bit error will be zero in a distortionless environment (i.e., no additive noise or other corruption). Tew k et al only explored the use of their masking analysis technique in
" #
1 2 1 2 1 2 1 2 1 2 1 2
114 the proposed watermarking algorithm. In this investigation, both the Watson model and Tew k's model were implemented for use in the modi ed FHSS algorithm, and they are denoted FHSS-FMW and FHSS-FMT, respectively. Watson's model does not incorporate the frequency masking analysis of Tew k's approach, but it will be shown in Section 4.4.3 that the former is less computationally expensive to implement.

When the watermark decoder does not have access to the original image, such as in a public watermarking system, it is assumed in Equation 4.8 that the correlation of the host signal and the PN sequence will be very low. For a large block size, this assumption may be valid, but not for smaller blocks. Like the improvements proposed for audio watermarking, there are two techniques that can be incorporated to minimize the presence of the host image prior to decoding. The rst improvement was suggested by Hartung and Girod 40], but the second improvement has not yet been applied to digital watermarks. 1. Assume that the host image block, xm1 m2 (n n ), has a lowpass magnitude spectrum characteristic, and employ a highpass pre lter to remove as much of the signal as possible prior to decoding. Obviously, this approach can be expected to work well on smooth image blocks that do not contain sharp features and high-frequency components, and it is relatively simple to implement.
1 2
2. Attempt to minimize the presence of the original host image by employing a whitening lter constructed from a two-dimensional autoregressive (AR) model of the image block. A K K model is given by the expression:
A(z z )
1 2
K ;1 PK ;1 a(n 1 n1 =0 n2 =0
n )z;n1 z;n2
2
(4.17)
where a(0 0) = 1. The AR model coe cients for a particular image block may be obtained from an estimate of the two-dimensional autocorrelation function of the block, and then using either a 2D form of the normal equation, or a 2D
115 form of the Levinson-Durbin recursion 46]. The AR model may be used as a whitening lter by convolving the watermarked image block with the coe cients computed for that block. The result is a two-dimensional random process corresponding to the prediction error:
x(n n ) a(n n ) = vx(n n )

1 2 1 2 1 2
(4.18)
Assuming that the power of the noise-like watermark signal, controlled by , is much less than that of the host image, then the AR coe cients computed for the watermarked block, x(n n ), will be close to those computed for the original image block. Therefore, the convolution of the watermarked block with the AR coe cients will result in x(n n ), the image presented to the correlator: ~
1 2 1 2
x(n n ) = x(n n ) a(n n ) ~ = vx(n n ) + w(m m ) p(n n ) a(n n )]

1 2 1 2 1 2 1 2 1 2 1 2 1 2
(4.19)
It was shown in the previous chapter that the convolution of a random process with a linear lter will result in a cross-correlation between the lter output and the random input that depends only upon the variance of the random signal and the lter coe cients 7]. Therefore, from a derivation similar to that used in Section 3.4.4.2, the expected value of the correlator output for the block will be a weighted version of the watermark bit embedded within the block
E C] =
w(m m )
1 2
K ;1 K ;1 X X n1 =0 n2 =0
a (n n )
1 2
(4.20)
where K is the size of the set of AR model coe cients, and the extracted bit, w(m m ), may be taken as the sign of the correlator output. ~
1 2
Figure 4.2 shows a plot of the magnitude response of the two-dimensional nite impulse response (FIR) lter considered in this investigation. The lter is of size 11 11 coe cients, and constructed using the McClellan frequency transformation method 47]. While it is computationally expensive to construct such a lter, it only has to be performed at design-time and incorporated into the decoder.
116
1.5
MAGNITUDE
0.5
0 1 0.5 0 0 0.5 NORMALIZED FREQUENCY 1 1 0.5 NORMALIZED FREQUENCY 0.5 1
Figure 4.2: Two-dimensional highpass lter used to pre lter host images watermarked with the DSSS and FHSS algorithms.
117 In Section 3.4.4.2 it was shown experimentally that a highpass pre lter greatly improves the decoding reliability for the audio DSSS algorithm, while the AR modeling technique improves the performance of the FHSS decoder. In a similar manner, it was found that the highpass pre ltering modi cation provided the best decoding performance for the DSSS and DSSS-SM algorithms. It was also discovered that a 3 3 whitening lter provided the best decoding performance for the FHSS algorithm. This size of AR model is in agreement with studies of image compression using twodimensional linear predictive coding (2D-LPC), where it was found that a square of coe cients larger than three or four samples per side provided little coding gain 46]. In the performance analysis section, this modi cation to the algorithm decoders will be used. It is important to remember that for the FHSS algorithms with frequency domain masking analysis (FHSS-FMT and FHSS-FMW), no such pre ltering will be used at the decoder because the algorithms use quantization rather an additive watermark.
4.2.4.5 Discussion
There are signi cant di erences between the spread spectrum algorithms implemented for this study and the original versions from the literature. Most notable is the division of the host image into a set of blocks, which allows for a variable number of bits to be embedded within a host image. In the original FHSS algorithm proposed by Cox et al, for example, the authors compute the 2D-DCT of the entire image. They also recommend constructing a watermark using samples drawn from a Gaussian process, and they require access to the original image in order to extract the watermark. The result is a signature embedded into the host image, rather than an arbitrary set of watermark data achieved from using a block-by-block approach. In addition, a novel approach used in this study is the incorporation of Watson's frequency domain masking analysis in the quantized FHSS-FMT algorithm proposed by Tew k et al. It will be shown in Section 4.4 that Watson's model is less computationally expensive than Tew k's model, but o ers a similar level of performance when incorporated into
the FHSS-FMW algorithm. The basic DSSS and FHSS algorithms are straightforward to implement, and are computationally e cient. If the block size is a power of two, then the 2D-FFT or other fast algorithms can be used to compute the 2D-DCT of each image block. Also, the watermarking of blocks can be performed in parallel. It is predicted that incorporating spatial masking analysis into the DSSS algorithm will improve the imperceptibility of the distortion and maximize in regions of the image that possess sharp edges or uniform intensity. The more complex frequency domain masking analysis techniques added to the FHSS algorithm should also improve algorithm performance by their use of luminance masking, frequency sensitivity, and frequency masking characteristics of the host image. However, the spread spectrum algorithms are subject to the same PN sequence synchronization problems as described in Section 3.4.4.3. This makes them quite susceptible to cropping and geometric transformations. Also, it will be shown in the performance evaluation of Section 4.4 that incorporating masking analysis increases the computational requirements of the algorithms.
118
4.3 Multiresolution Embedding

In Section 2.2.1, it was discussed that the Human Visual System possesses frequency sensitivity characteristics. In particular, the contrast detection threshold function of Figure 2.8 illustrates how the sensitivity of human vision varies with the spatial frequency of stimuli. The FHSS-FMW and FHSS-FMT algorithms presented earlier in this chapter use the frequency sensitivity models of Watson and Tew k et al constructed in the 2D-DCT domain on blocks of the host image. As mentioned in Section 4.2.4.5, however, a watermarked image is more vulnerable to cropping and geometric distortion as the block size decreases. Increasing the block size, however, limits the amount of watermark data embedded within the image. In 48], Watson et al derived quantization matrices for image compression based
119 Orientation LL LH HL HH 1 29.7551 53.1615 53.1615 155.0356 Level 2 3 20.3071 17.9397 29.2656 21.8632 29.2656 21.8632 64.7006 38.4151 4 19.7667 21.0311 21.0311 30.3196
Table 4.1: Wavelet quantization levels for a 512 512 image at the standard viewing distance. on a four-level wavelet decomposition of the image using the 9-7 biorthogonal lters originally proposed for image compression 12]. In their approach, the authors determined the minimum amount of noise in wavelet coe cients, at each level of resolution and orientation, that would be detectable to a viewer seated at the standard viewing distance from the image. They did this by using a psychovisual study and a large number of test subjects. Random noise was injected into wavelet coe cients at a single resolution and orientation, and the resulting image was presented to the viewer after computing an inverse wavelet transformation on the noisy coe cients. The noise was increased until it became detectable in the resulting image. The result was a set of quantization levels, one for each resolution and orientation in the four-level decomposition. Quantization levels from Watson et al depend on the spatial resolution of the image, which in turn depends upon the image size and distance from the viewer. For a 512 512 image located at the standard viewing distance of six times the image width, the quantization levels associated with four levels of decomposition are shown in Table 4.1. The four orientations LL, LH, HL, and HH correspond to the four subimages at each level of the multiresolution decomposition. They represent lowpass, horizontal, vertical, and diagonal components, respectively. Podilchuk and Zeng have attempted to incorporate these wavelet quantization levels into an image watermarking scheme 49]. However, their approach is de cient in several ways. First of all, their algorithm embeds a watermark signature into the
120
1 0.6 0.8 0.4
0.2 0.6 0 0.4
G(N)
0.2 0.4 0.6 3 2 1 0 SAMPLE 1 2 3 4 0.8 3
H(N)
0.2
0.2 4
(a) Lowpass lter
(b) Highpass lter
0 SAMPLE
Figure 4.3: Decomposition lters used to compute the 2D-DWT. host image, not an arbitrary set of data bits. In addition, access to the original image is required for extracting the signature at the decoder (a private watermark system). In the following sections, a modi ed version of their multiresolution watermarking scheme is proposed that allows embedding of data, and does not require access to the original image.

As before, the bit to be embedded in each block is spread by a PN sequence p(n n ) 2 f;1 +1g before being used to modify the image. The basis functions used to compute the discrete wavelet transform (DWT) are the 9-7 biorthogonal functions described in 12]. The decomposition lowpass and highpass lters resulting from the choice of basis functions are shown in Figure 4.3. To watermark the image, the following procedure is followed:
1 2
1. Compute the four-level two-dimensional DWT of the host image x(n n ). The result of each successive level of decomposition is a set of four N N \subimM M ages", where M denotes the level. These four downsampled subimages represent a lowpass representation of the image and three detail images corresponding to horizontal (LH), vertical (HL), and diagonal (HH) components. The lowpass
1 2 2 2
121 image is ltered and downsampled again to produce the next level of subimages. 2. From the set of ten subimages created by the DWT, construct an N N composite image xcomp(n n ) using the subimages, as shown in Figure 4.4-(a) and Figure 4.5. The composite image has the same dimensions as the original image, but is comprised of subimages.
1 2
3. Construct an N N quantization matrix Q(n n ), based on the allowable quantization level for each subimage's resolution and orientation from Table 4.1. For example, the region of Q(n n ) corresponding to the lowpass (LL) subimage at the fourth level of decomposition would be N N samples in size, and assigned a quantization level of 19.7667. Figure 4.4-(b) illustrates the structure of the N N quantization matrix.
1 2 1 2 16 16
4. Divide the composite image into a set of blocks of size M M pixels, and for each block construct a PN sequence p(n n n ) 2 f;1 +1g with which to spread each bit within the block.
1 2 3
5. Embed the spread bits into the blocks of the composite image by quantizing the composite image coe cients using the quantization matrix, and then modify each quantized coe cient by plus or minus a quarter of the quantization level, according to the PN sequence:
xcomp(n n ) = xcomp(n n ) + 1 w(m m ) p(n n ) Q(n n ) Q(n n ) 4

("
1 2
(4.21)
6. Finally, compute the inverse DWT of the modi ed subimages to produce the resulting watermarked image:
x(n n ) = IDWT xcomp(n n )]

1 2 1 2
(4.22)
Note that the blocks containing bits can overlap with the multiresolution subimages, so that some of the data may be embedded within di erent spatial frequency bands
122
M=4 LL M=4 HL HH LH
Q(4, HL)
M=3 LH
M=2
M=1
Q(4, LL) Q(4, LH) N Q(3, LH) Q(2, LH) Q(3, HL) Q(3, HH) Q(1, LH)
LH
M=3
HL
HH LH
Q(4, HH)
M=2
HL
HH
Q(2, HL) Q(2, HH)
M=1
HL
HH
Q(1, HL)
Q(1, HH)
(a) Composite image from submimages (b) Composite quantization matrix. Figure 4.4: N N composite images made from the multiresolution decomposition subimages and quantization levels. and di erent orientations. By dividing the composite image into M M blocks, it can be used to hold the same amount of data as the other watermarking algorithms. Also note the similarity of this approach to the modi ed FHSS algorithm incorporating frequency domain masking analysis. In both algorithms, the frequency or spatial / frequency domain coe cients are quantized with respect to the maximum quantization level, resulting in perfect reconstruction of the data bits in the absence of distortion. An example of the wavelet decomposition performed on the LENNA image is shown in Figure 4.5.

The decoder structure is similar to that of the FHSS-FMW and FHSS-FMT algorithms, in that the spatial / frequency coe cients of the watermarked image are quantized again, and used to determine the spread data signal. The procedure is as follows:
123
m=4 m=3
m=2
m=1
m=2
m=3 m=4
m=1
Figure 4.5: Example of a four-level wavelet decomposition of a 512 512 pixel version of LENNA.
124 1. Compute the four-level two-dimensional DWT of the watermarked image, and again construct an N N composite image from the subimages:
xcomp(n n ) = DWT x(n n )]

1 2 1 2
(4.23)
2. Divide the composite image into a set of M M blocks, and determine the spread data sequence for each block from a quantized version of the watermarked block:
w(m m ) p(n n ) xcomp(n n ) ; xcomp(n n n ) Q(n n ) Q(n )

" #
1 2 1 2 1 2 1 2 1 2 1 2
(4.24)
3. Compute the correlation of the extracted spread sequence with the original PN sequence in order to extract the embedded data bit:
C =
N1 ;1 N2 ;1 X X n1 =0 n2 =0
2
w(m m ) p (n n )
1 2 2 1 2 1 2
= M w(m m ) 4. Finally, obtain the embedded bit from the sign of the correlation:
(4.25)
w(m m ) = sign C ] ~
1 2
(4.26)
4.3.3 Discussion
The multiresolution embedding algorithm possesses a number of advantages over the spread spectrum techniques discussed earlier. Like the FHSS algorithm, the multiresolution approach spreads watermark data throughout the spatial domain of the host image. However, the division of blocks in the 2D-DWT domain results in data being embedded within di erent frequency bands and orientations. The quantization levels are xed for the given standard viewing distance and wavelet basis functions, so it is not necessary to compute them again at the receiver, like the FHSS-FMW and FHSS-FMT algorithms. However, it should be noted that the quantization levels represent the sensitivity of the HVS to wavelet basis functions at various resolutions and orientations
125 in the 2D-DWT domain. This sensitivity is image independent, so other aspects of perceptual masking, such as luminance masking and frequency masking, are not considered. In addition, the quantization levels are only valid for the 9-7 basis functions studied by Watson et al, and it is not clear how the levels could be adjusted for use with other basis functions.

The purpose of this section is to compare the six image watermarking algorithms reviewed in the previous sections. Recall in Chapter 1 that a framework was introduced for evaluating watermarking algorithms. In particular, it was proposed that bit rate, perceptual quality, computational complexity, and robustness to signal processing were key aspects by which watermarking algorithms could be evaluated. In this section, this framework will be used to evaluate the watermarking algorithms. In this investigation, a set of ten grayscale digital images were selected for watermarking, as shown in Figure 4.6. The images were chosen because they exhibit a variety of spatial and spectral properties. BARBARA, for example, contains strong angular frequency components, while LENNA possesses a mixture of low and high frequency components. Each image is 512 512 pixels in size, and quantized to 8 bits per pixel (bpp), or 256 intensity levels. The watermarking algorithms were implemented in MATLAB under Linux on an Intel Pentium PC running at 166 MHz. Each algorithm was implemented using the parameters speci ed with the implementation details presented earlier. In general, each of the images was watermarked using the six algorithms 100 times, and the results averaged for each algorithm. In all cases, a di erent and random watermark signal was generated for each run. This was done in order to remove any dependency of extraction on the watermark data itself.
126
(a) BARBARA
(b) BOAT
(c) FROG
(d) GOLDHILL
(e) LENNA
(f) MANDRILL
(g) MONARCH
(h) MOUNTAIN
(i) PEPPERS
(j) ZELDA Figure 4.6: Sample images used in the performance evaluation of image watermarking algorithms.
127
50 DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION
45
40
35
BER (PERCENT)
30
25
20
15
10
4 5 6 BLOCK SIZE (LOG2(SAMPLES))
Figure 4.7: Bit error rate versus block size for the six watermarking algorithms compared.

This experiment was designed to determine the bit error rate of each watermarking algorithm as a function of M , the square block size in pixels. It is assumed that the decoders do not have access to the original image, so the algorithms must rely upon the improvements introduced. For each algorithm, the encoder and decoder were run on each image, and the bit error rate (BER) determined from the extracted bits. This process was performed for block sizes between 2 and 512 samples (the size of the images). The results of this experiment are shown in Figure 4.7. Like the investigation of audio watermarking algorithms in the previous chapter, it is clear from the plot that the error rate declines as the block size increases. This reinforces the tradeo discussed in the previous chapter, that of the reliability of the encoding increasing along with the block size. Note that FHSS-FMT, FHSS-
128 FMW, and multiresolution algorithms had an error rate of zero under all block sizes. It was explained earlier that in the absence of distortion, these algorithms are expected to produce no bit error. All of the algorithms had an error rate of less than ve percent at M M 16 16 samples. In the experiments that follow, a block size of 16 square samples was used. As with audio watermarking, a similar problem exists with respect to blocks and watermarked images. If the image is cropped, then any bits embedded within a ected blocks would be lost. In order to prevent this, it is obvious that larger blocks should be used, but this limits the bit rate of the watermarking system. Therefore, there is a tradeo between block size and bit rate. As an example, Figures 4.8- 4.10 show the original 512 512 LENNA image along with versions watermarked using the six algorithms described in this chapter. It is clear from the images that the distortion introduced by the algorithms does not degrade the perceptual quality of the host image, when seen from a standard viewing distance of six times the image width.

This experiment was designed to determine the amount of distortion each watermarking algorithm introduced into the host image by using the peak signal-to-noise ratio (PSNR) of the watermarked image versus the original image. Each of the ten images was watermarked with the six algorithms, and the PSNR computed. A block size of M M = 16 16 pixels was used, and the results of this experiment are shown in Table 4.2. From the table, it is clear that techniques incorporating perceptual masking | DSSS-SM, FHSS-FMW, FHSS-FMT, and multiresolution | produce a PSNR lower than algorithms that do not. The reason for this is that perceptual analysis allows for the watermark strength to be increased in regions of the image that possess spatial and / or frequency domain masking properties. The PSNR of the DSSS and FHSS algorithms is directly dependent upon the value of , since the parameter controls
129
(a) Original image
(b) DSSS
(c) DSSS-SM
Figure 4.8: LENNA image watermarked with the DSSS and DSSS-SM algorithms.
130
(a) Original image
(b) FHSS
(c) FHSS-FMW
(d) FHSS-FMT
Figure 4.9: LENNA image watermarked using the FHSS, FHSS-FMW, and FHSSFMT algorithms.
131
(a) Original image
(b) Multiresolution embedding
Figure 4.10: LENNA image watermarked using the multiresolution algorithm.
Image BARBARA BOAT FROG GOLDHILL LENNA MANDRILL MONARCH MOUNTAIN PEPPERS ZELDA Average
DSSS DSSS (SM) 38.59 32.43 38.59 29.96 38.62 39.73 38.59 32.09 38.59 32.18 38.59 30.12 38.60 37.09 38.72 37.20 38.59 31.49 38.59 32.80 38.61 33.51
FHSS 37.85 37.93 37.16 38.00 37.54 38.45 35.95 36.05 37.17 38.01 37.41
FHSS FHSS Multi(FMW) (FMT) Resolution 28.44 25.69 31.49 27.74 26.03 32.01 32.77 26.02 31.81 29.00 25.93 31.86 29.48 26.48 32.26 26.33 24.53 30.57 27.90 23.94 29.63 31.16 25.53 31.40 28.62 26.31 32.15 30.26 26.68 32.45 29.17 25.71 31.56
Table 4.2: PSNR of watermarked images versus original images (in decibels).
132 Image BARBARA BOAT FROG GOLDHILL LENNA MANDRILL MONARCH MOUNTAIN PEPPERS ZELDA Average DSSS DSSS (SM) 4.36 7.81 4.31 7.69 4.32 7.66 4.31 7.63 4.31 7.62 4.28 7.61 4.47 7.85 4.52 7.99 4.64 8.20 4.62 8.18 4.41 7.82 FHSS 13.75 13.69 13.66 13.64 13.67 13.64 14.09 14.31 14.80 14.64 13.99 FHSS FHSS Multi(FMW) (FMT) Resolution 30.99 69.15 9.79 30.97 71.09 9.77 30.85 74.52 9.75 30.85 72.09 9.71 30.83 69.17 9.71 30.84 71.50 9.71 31.85 69.78 10.02 32.31 68.26 10.14 33.30 68.86 10.45 32.97 68.09 10.35 31.58 70.45 9.93
Table 4.3: Image watermarking algorithm timings (in seconds). the power of the noise-like watermark signal added to the host image. In contrast, the PSR of the FHSS-FMW and FHSS-FMT algorithms may not be predicted as easily. Quantization of the 2D-DCT coe cients is performed by these two techniques, which introduces random noise in the form of quantization errors. If the quantization matrices were uniform, then the noise variance could be predicted by the quantization levels. However, the quantization levels vary with frequency and with the 2D-DCT of the host image block, so the PSNR tends to vary between images. The PSNR of the multiresolution algorithm is more constant for all of the sample images because the quantization level is uniform for each level and orientation of the decomposition.

This experiment was designed to determine the computational cost of each watermarking algorithm. For each algorithm, CPU timings were extracted for the encoder and decoder run on each audio signal. A block size of M M = 16 16 pixels was used, and the results of this test are shown in Table 4.3. From the table, it is immediately obvious that the DSSS is the most e cient, requiring an average of just over 4 seconds to encode and decode each 512 512 image. The DSSS with spatial mask-
133 ing analysis (DSSS-SM), along with the multiresolution algorithm, are slightly more expensive because they have to either lter the host image or compute a forward and inverse transform. The FHSS algorithm is more expensive, requiring an average of 14 seconds to encode and decode each host image. This is not surprising considering that each block must be transformed into the frequency domain using the 2D-DCT, and more resources are required to compute the AR model of the watermarked image at the decoder. Not surprisingly, the FHSS algorithms with frequency domain masking, FHSSFMW and FHSS-FMT, require the most time to encode and decode the host images. In addition to computation of the frequency threshold levels of each DCT block, the algorithms must perform the same steps at the decoder in order to approximate the masking levels of the original image. This is required because the original image is typically not available at the receiver. Of the two, the FHSS-FMT algorithm requires over twice the amount of time to run, since it involves computing the complex frequency masking characteristics of each image block.

As described in Chapter 1, robustness to signal processing is a desirable feature of any watermarking system. Common operations, such as noise reduction, image enhancement, or lossy compression, should not completely destroy a watermark embedded within an image. Measuring how well each algorithm can survive distortion provides another tool for selecting between algorithms for a particular application. The processing operations for this evaluation were chosen because they do not severely distort the subjective quality of the watermarked images, and because they represent or simulate \real world" types of operations.
4.4.4.1 Mean and Lowpass Filtering

For mean ltering, a lter of size K K samples was applied to the watermarked images, for 1 K 15, to replace each pixel with the average of a block of previous
134
60
50
40
BER (PERCENT)
30
20
10
DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION 0 5 FILTER SIZE (SAMPLES) 10 15
Figure 4.11: Bit error rate from mean ltering for image watermarking algorithms. pixels: 1 x(n n ) = K ~
1 2
K ;1 K ;1 X X
2
where x denotes the averaged pixel. Mean ltering, essentially a lowpass ltering ~ operation, has the e ect of removing high-frequency noise from a signal. For lowpass ltering, a lowpass symmetric half-band lter of size K K samples was constructed using a frequency sampling technique, for 1 K 15. Lowpass ltering is used prior to downsampling an image, or to remove high-frequency noise. As the lter order increases, more high-frequency components of the watermarked image are attenuated by the mean and lowpass ltering operations. The results of this experiment are shown in Figures 4.11 - 4.12. The DSSS and multiresolution algorithms perform poorly under these operations. DSSS employs a highpass pre lter at the decoder under the assumption that the host image has a lowpass magnitude response, but mean and lowpass ltering remove these high-frequency components. In Figure 4.5, it is clear that roughly 3/4 of the DWT coe cients lie within the high-frequency subimages at the rst level of decomposition. Therefore, roughly 3/4 of the watermark data will be corrupted by lowpass ltering operations. In contrast, the block-based FHSS algorithms perform better because they operate in the frequency domain, and spread watermark data
i=0 j =0
x(n ; i n ; j )
1 2
(4.27)
135
45 DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION 40
35
30
BER (PERCENT)
25
20
15
10
10
15
Figure 4.12: Bit error rate from lowpass ltering for image watermarking algorithms. from each bit throughout the spectrum. Lowpass ltering operations will a ect the high frequency frequency components, but there will still be a correlation between lowpass components and the PN sequence used to spread the data.
4.4.4.2 Highpass Filtering

A highpass symmetric half-band lter of size K K samples was constructed using a frequency sampling technique, for 1 K 15. Highpass ltering is commonly used to emphasize edges in an image, and as the lter order increases, low-frequency components of the watermarked image are attenuated more by highpass ltering. The results of this experiment are shown in Figure 4.13. From the plot, it is clear that most algorithms perform well under this operation for two reasons. First of all, each of the techniques use a spread spectrum approach to distribute watermark data throughout the spectrum of the host image. Also, the sample images possess more low frequency components than high, and removing those components will enhance the remaining watermark data. The exception to this result is the multiresolution algorithm, which showed a higher bit error rate than the others. From Figure 4.5, roughly 1/4 of the DWT coe cients will be attenuated by highpass ltering, and roughly the same percentage of watermark data will be corrupted.
136
10
BER (PERCENT)
10
15
Figure 4.13: Bit error rate from highpass ltering for image watermarking algorithms.
4.4.4.3 High-emphasis Filtering

A high-emphasis lter, 11 11 samples in size, was constructed using a weighted superposition of halfband lowpass and highpass lters:
h(n n ) = (1 ; A) hLP (n n ) + A hHP (n n )

1 2 1 2 1 2 1 2 1 2
(4.28)
where hLP (n n ) and hHP (n n ) represent lters of 11 11 samples, with A varying from 0 A 1. The high-emphasis ltering operation, also known as unsharp masking, was included in this study because it is commonly used to remove high-frequency noise while retaining sharp edges and features in an image. It was expected that as A approaches 0 and 1, the performance of the watermarking algorithms approximate those of the lowpass and highpass ltering experiments, respectively. The results, plotted in Figure 4.14, support this.
4.4.4.4 Wiener Filtering

For each watermarked image block, xm1 m2 (n n ), a K K autoregressive model was computed and used as a forward prediction lter:
1 2
xm1 m2 (n n ) = ~
1 2
XX
xm1 m2 (n ; i n ; j )a(i j )
1 2
(4.29)
137
35
30
BER (PERCENT)
25
20
15
10
0.1
0.2
0.3
0.4
0.5 A
0.6
0.7
0.8
0.9
Figure 4.14: Bit error rate from high-emphasis ltering for image watermarking algorithms. where i j 2 ROS is a K K block of previous pixels. The coe cients of a(i j ) were chosen to minimize the mean squared error (MSE) between xm1 m2 (n n ) and ~ xm1 m2 (n n ), for square blocks of 1 K 15. Wiener ltering was used to simulate the e ects of two-dimensional linear predictive image coding 46]. As the prediction lter order increases, the model more closely matches the watermarked image block. The output of the prediction lter is an approximation to the image block, plus random noise corresponding to the prediction error. The variance of this noise decreases with an increase in the lter order, and depends upon the host image being modeled. Figure 4.15 shows the result of this experiment.
1 2 1 2
4.4.4.5 Median Filtering

A median lter was used, designed to replace each sample of the watermarked image with the median value from the set of K K neighbouring pixels. Median ltering is a non-linear process often used to reduce high-frequency noise in an image. The results are shown in Figure 4.16, and it is clear from the plot that as the lter order increases, more pixels are replaced and the performance of every algorithm quickly degrades. This is because the correlation of the watermarked image with the PN sequence
138
45
40
35
BER (PERCENT)
30
25
20
15
10
5 10 PREDICTION FILTER SIZE (SAMPLES)
15
Figure 4.15: Bit error rate from wiener ltering for image watermarking algorithms.
60
50
40
BER (PERCENT)
30
20
10
DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION 0 5 FILTER SIZE (SAMPLES) 10 15
Figure 4.16: Bit error rate from median ltering for image watermarking algorithms.
139 su ers when more pixels are altered. Transform domain algorithms { FHSS-FMW, FHSS-FMT, and multiresolution | work better because coe cients are a ected less by modi cations to individual pixels.
4.4.4.6 Additive and Coloured Noise

The purpose of this experiment was to determine the performance of each algorithm in the presence of additive and coloured noise. As mentioned in the previous chapter, a common attack on watermarks is to introduce noise into the image in an e ort to hamper decoding at the receiver. For additive noise, the watermarked image was corrupted with an AWGN process of zero mean and variance v in accordance with
2
x(n n ) = x(n n ) + v(n n ) ~

1 2 1 2 1 2
(4.30)
and for coloured noise, the image was distorted with noise of the same power, but multiplied by a normalized version of the watermarked signal. Since the pixels of the host images lay in the interval 0 x(n n ) 255, the corrupted image may be written as (4.31) x(n n ) = x(n n ) + x(n n ) v(n n ) ~ 255
1 2 1 2 1 2 1 2 1 2
For each algorithm, the bit error rate was computed as a function of peak signal-tonoise ratio (PSNR) in decibels. A block size of 16 16 pixels was used, the results of this experiment are shown in Figure 4.17 and Figure 4.18. From the plots, it is clear that the FHSS, FHSS-FMW, FHSS-FMT, and multiresolution algorithms provided the best resilience to additive and coloured noise. Of the three, the more complicated FHSS-FMT technique was the best overall, particularly at extremely low PSNR. The performance of the DSSS, DSSS-SM, and multiresolution algorithms was comparable for additive noise, but the multireolution algorithm provides the least resilience to coloured noise. Error rates for coloured noise case were lower than for additive noise because the noise power was determined before multiplication with the normalized image. This tends to skew the noise ratio, but this is not a serious problem because the interest lies in the performance of the algorithms compared with each other.
140
45
40
35
BER (PERCENT)
30
25
20
15
10
0 5
10
15 20 PSNR (DECIBELS)
25
30
35
Figure 4.17: Bit error rate due to additive white Gaussian noise.
141
45
40
35
BER (PERCENT)
30
25
20
15
10
0 5
10
15 20 PSNR (DECIBELS)
25
30
35
Figure 4.18: Bit error rate due to coloured white Gaussian noise.
142 Additive noise has a at power spectrum, and so the distortion has the same level for all transform-domain coe cients. The transform domain algorithms | FHSS-FMW, FHSS-FMT, and multiresolution | perform better than the spatialdomain approaches because the quantization levels are not the same for each coefcient. The multiresolution algorithm quantization levels vary with orientation and DWT level, and the FHSS-FMW and FHSS-FMT levels vary with each 2D-DCT coe cient. The same level of additive noise will a ect a coe cient with a smaller quantization level more than one with a larger level.
4.4.4.7 Quantization
The purpose of this experiment was to investigate the performance of each algorithm to distortion due to linear quantization of the watermarked images. The images, originally represented as greyscale images with 8 bits per pixel (bpp), were linearly requantized to K bpp, for 1 K 7. As before, for each algorithm the encoder was run on each image, followed by quantization, and then decoding to extract the bits. A block size of 16 16 pixels was used, the results of this experiment are shown in Figure 4.19. From the plot, it can be seen that the FHSS, FHSS-FMW, and FHSS-FMT algorithms provide the most resilience to pixel quantization, particularly at extremely coarse quantization levels. However, the performance of the DSSS and DSSS-SM is comparable down to three bpp. The process of quantization introduces a noise-like error, referred to as quantization noise, into the watermarked image. The mean and variance of this distortion depends on the quantization step size and on whether rounding or truncation quantization is applied. With rounding quantization, used in this investigation, the noise has zero mean and a variance q equal to 7]:
2
q q = 12
2 2
(4.32)
where q is the quantization step size. If the intensity levels of the watermarked image are uniformly distributed, and that the quantization noise is uncorrelated with the
143
35
30
BER (PERCENT)
25
20
15
10
4 5 BITS PER PIXEL
Figure 4.19: Bit error rate due to linear quanitization.
144 Image BARBARA BOAT FROG GOLDHILL LENNA MANDRILL MONARCH MOUNTAIN PEPPERS ZELDA Average DSSS DSSS (SM) 2.66 2.93 0.55 0.66 0.08 0.04 0.51 0.31 0.27 0.08 5.16 9.02 7.66 10.94 0.27 0.27 0.31 0.08 0.04 0.04 1.75 2.44 FHSS 0.47 0.04 0.43 0.23 0.04 0.51 4.65 1.80 0.00 0.00 0.82 FHSS FHSS Multi(FMW) (FMT) Resolution 0.00 0.00 1.07 0.00 0.00 0.17 0.00 0.00 1.63 0.00 0.00 1.79 0.00 0.00 0.35 0.04 0.00 2.69 0.35 0.00 1.27 0.00 0.00 1.46 0.00 0.00 0.82 0.00 0.00 0.67 0.04 0.00 1.19
Table 4.4: Bit error rate due to histogram equalization (in percent). watermarked image, then it may be possible to predict the results of quantization using the additive noise data from Section 4.4.4.6. For a representation of 3 bits per pixel or 2 = 8 intensity levels, the quantization step size would be 256=8 = 32 intensity levels, corresponding to quantization noise with variance v 85. The PSNR from this noise is approximately 29 decibels. However, the bit error rate due to additive noise at this level, shown in Figure 4.17, do not correspond to those from quantization to 3 bpp. This discrepancy is because the host images do not have a uniform distribution of intensity levels.
3 2
4.4.4.8 Histogram Equalization

The goal of this experiment was to determine the performance of each algorithm due to histogram equalization, a common image enhancement technique 13]. For each algorithm the encoder was run on each image, followed by histogram equalization, and then decoding to extract the bits. A block size of 16 16 pixels was used. The results of this investigation are shown in Table 4.4. From the table, it is clear that the FHSS, FHSS-FMW, and FHSS-FMT algorithms are more robust to histogram equalization than the other techniques studied.
145 However, the DSSS and DSSS-SM algorithms do not have a large bit error rate, at roughly two percent. In this process, the dynamic range of image pixel values is increased so that the histogram distribution occupies all possible values (0 - 255 for an 8-bit image, for example), hence making the probability distribution roughly uniform. However, the spatial- and transform-domain relationships between adjacent pixels and frequency coe cients is preserved during the process, so it is less likely that the watermark data will be disrupted.

The aim of this experiment was to determine the robustness of each algorithm to lossy compression using the Joint Photographic Experts Group (JPEG) standard. As described in 29], JPEG is a transform-based codec utilizing the 2D-DCT on 8 8 blocks of the image. 2D-DCT coe cients are quantized using a constant quantization matrix, and recoded using a combination of Hu man coding and run-length encoding. This is an important experiment, because the JPEG algorithm is widely used to compress images for distribution across the Internet, as well as in consumer devices such as digital cameras. As before, for each algorithm the encoder was run on each host image, followed by JPEG compression and decompression, and then decoding to extract the bits. Various JPEG quality percentages, from 10 to 90 percent were used. A plot of the bit error rates as a function of quality percentage is shown in Figure 4.20. The JPEG compression algorithm introduces error into watermarked images from quantization of 2D-DCT coe cients. The 8 8 quantization matrix is not uniform, but varies with frequency and the JPEG quality factor. As the quality decreases, the quantization matrix becomes coarser. From the plot, it is clear that the spatial-domain algorithms, DSSS and DSSS-SM, perform poorly. This is a result of quantization being performed in the 2D-DCT domain. Distortion of even a single 2D-DCT coe cient a ects all of the spatial-domain pixels in the image block. In contrast, all of the transform-domain algorithms perform better because distortion
146
40
35
30
BER (PERCENT)
25
20
15
10
10
20
30
40 50 60 JPEG QUALITY (PERCENT)
70
80
90
Figure 4.20: Bit error rate due to JPEG compression, as a function of compression quality.
147 of watermark data from coe cient quantization is limited to individual frequencies. A watermark bit is spread throughout the set of transform-domain coe cients, so it is less likely that quantization of a coe cient will a ect more than a portion of the watermark bit.
4.5 Summary
The goal of this chapter was to review six image watermarking algorithms from the literature: direct sequence spread spectrum (DSSS) and DSSS with spatial masking analysis (DSSS-SM), frequency hopped spread spectrum (FHSS) and FHSS with two di erent frequency domain masking analysis improvements (FHSS-FMW and FHSSFMT), and multiresolution embedding. The FHSS-FMW algorithm is a result of replacing the frequency masking analysis of Tew k et al with a simpler frequency domain masking analysis process introduced by Watson. The multiresolution embedding algorithm described is an adapted version from the literature, modi ed so that an arbitrary amount of watermark data may be embedded within an image. Another goal of this chapter was to evaluate the algorithms using the performance evaluation framework introduced in Chapter 1. From these results it is clear that the multiresolution embedding algorithm, o ering average computational complexity and the best perceptual quality, performed poorly under simple signal processing operations. Better performance was observed from the DSSS and DSSSSM algorithms, but it should be noted that the added complexity of spatial masking analysis does not signi cantly improve the performance of the spread spectrum technique. Overall, the best performance was seen with the FHSS algorithm and its frequency masking variants, FHSS-FMW and FHSS-FMT. In all cases, the FHSSFMT approach o ered the most resilience to signal processing operations.
148
Chapter 5 Video Watermarking

Increasingly, video signals are being captured, edited, and distributed in digital form. For example, movies are now readily available for purchase or rental in Digital Versatile Disc (DVD) format, o ering crisp images and CD-quality sound. In the United States, high de nition television (HDTV) broadcasts have begun in many metropolitan areas, with all markets scheduled to be serviced by 2003 50, 51]. As the bandwidth of network and cable channels into the home increases, streaming video from the Internet and other sources are also becoming available 52]. With all these sources of digital video, it is also apparent that piracy and other copyright violations are becoming rampant. For example, the encryption algorithm used to secure DVD content was broken recently by European researchers 53]. Clearly there is a need to provide additional means of identifying and protecting the rights of content creators. Compared to image watermarking, relatively few algorithms have been proposed speci cally for embedding data within digital video signals. Many papers describing image watermarking algorithms claim that the techniques may be easily extended to video, but few details are provided. Of the algorithms speci c to video, there are two approaches: compressed-domain and uncompressed-domain. Since video signals are often distributed in compressed form, such as MPEG, compressed-domain algorithms work by embedding and extracting watermark data to and from the video signal in its compressed form. This eliminates the need to decompress the signal
149 before extracting the watermark. Some of these approaches work by slightly adjusting the variable length codes of DCT block coe cients in \I" frames 54]. Another approach works by modifying the block motion vectors used to construct the B and P frames 45]. Uncompressed-domain algorithms work by embedding and extracting watermark data before and after any compression algorithms are applied, respectively. These approaches are more interesting because it is interesting to consider how well a watermark would survive compression. In this chapter the focus will be on uncompressed-domain algorithms. In the previous two chapters, a selection of digital audio and image watermarking algorithms were implemented and compared. The evaluation is extended in this chapter to the study of seven digital video watermarking algorithms. Another goal of this chapter is to apply the performance analysis framework proposed in Chapter 1 as a means of comparing the algorithms. The algorithms evaluated in this chapter were selected to represent the three di erent approaches to embedding data: spatial domain, frequency domain, and spatial / frequency (multiresolution). They were also chosen to represent a range of computational complexities and implementation structures. Since the focus of this thesis is on public watermarking algorithms, the techniques chosen do not require access to the original signal in order to extract the watermark data. The chapter is organized as follows. Sections 5.2 - 5.3 provide a description of the video watermarking algorithms, including the theory behind them, encoder and decoder structures, and implementation details. This is followed in Section 5.4 by a performance evaluation of the algorithms with respect to bit rate, perceptual quality, computational complexity, and robustness to signal processing operations. Finally, a review of the chapter's ndings are provided in Section 5.5.
150
M x M x M BLOCK
N2 PIXELS
N1
PIX
EL
N3 FR
AME
Figure 5.1: Example of an image sequence divided into blocks in the spatial domain, as well as blocks temporally. Each three-dimensional block will be used to embed one bit of data.
5.1 Conventions
Similar conventions to those used in the previous two chapters will be used in this investigation of video algorithms. First of all, it is assumed that x(n n n ) represents a digital video signal of size N N pixels spatially, and N frames temporally. Each frame of this signal is divided into a set of M M blocks of size M M pixels, where M = bN =M c and M = bN =M c, as shown in Figure 4.1. As explained in Chapters 3 and 4, a division of the host signal into blocks is a convenient way of embedding a variable amount of watermark bits. The sequence is further divided into a set of M blocks of M frames temporally, where M = bN =M c. x(n n n ) represents the watermarked video signal, while xm1 m2 m3 (n n n ) and xm1 m2 m3 (n n n ) indicate the < m m m > block in the original and watermarked signals, respectively, for 0 m M ; 1, 0 m M ; 1, and 0 m M ; 1. Finally, it is assumed that one bit is embedded in each block, and this sequence of M M M bits is
1 2 3 1 2 3 1 2 1 1 2 2 3 3 3 1 2 3 1 2 3 1 2 3 1 2 3 1 1 2 2 3 3 1 2 3
151 denoted by w(m m m ) 2 f;1 +1g, for 0 m M ; 1, 0 m M ; 1, and 0 m M ; 1. A bit extracted from the watermarked signal is denoted by w(m m m ). ~
1 2 3 1 1 2 2 3 3 1 2 3
5.2 Frame-By-Frame Watermarking

The extension of image watermarking algorithms to embed data within digital video may seem straightforward, since raw video is nothing more than a sequence of images. However, few authors of image watermarking techniques have investigated e ective ways of adapting their techniques. Video signals possess several unique properties, which will be discussed further in Section 5.4. One key feature is the high level of redundancy between frames. Sequential frames are highly similar, so they are vulnerable to processing and malicious attacks, such as lossy compression and frame averaging, which do not severely disrupt the video quality. Therefore, it would not make sense to blindly embed data strictly within individual frames. That is why the video signal is divided into blocks both spatially and temporally. The image watermarking algorithms may be used to embed a single watermark bit within the same spatial block, but repeatedly over several temporal frames. As a result, the watermarked video may be more robust against compression and attacks. As described in Section 2.2.5, temporal masking e ects of the Human Visual System are not terribly useful for watermarking, since digital video is usually not sampled at rates higher than the icker frequency, and temporal masking is limited to scene changes which are relatively sporadic. However, the visual masking e ects introduced in Chapter 2 still occur within individual frames: frequency sensitivity and masking, luminance and spatial masking. So in addition to embedding watermark data within spatial blocks spread across multiple frames, perceptual analysis may also be performed on spatial blocks to weight the watermark signal according to masking characteristics of the video signal. The six digital image watermarking algorithms implemented in Chapter 4 will
152 be adapted for embedding data into digital video signals: direct sequence spread spectrum (DSSS) and DSSS with spatial domain masking analysis (DSSS-SM), frequency hopped spread spectrum (FHSS) and FHSS with frequency domain masking analysis (FHSS-FMW and FHSS-FMT), and multiresolution embedding. In the following sections, extensions of the six algorithms will be described. It is important to note that no other researchers have analyzed the use of image watermarking algorithms in this manner. All of the techniques from the previous chapter employ a spreading signal, a two-dimensional pseudonoise (PN) sequence, to distribute the energy of the watermark data throughout the spectrum of the host image. For blocks of digital video, a three-dimensional PN sequence will be constructed of the form p(n n n ) 2 f;1 +1g, with the same dimensions as the block to be watermarked. This PN sequence will be used to spread the watermark data throughout the spectrum of the video block. Likewise, the embedded bit will be extracted using the correlation of the watermarked block, after possible pre ltering, with the original PN sequence. Recall that the properties and advantages of PN sequences were discussed in Section 3.4, and the value of using a correlation receiver was explained in Section 3.4.2.
1 2 3
5.2.1 Direct Sequence Spread Spectrum (DSSS)

The spread watermark data within the block is weighted with same used for image watermarking, 1 - 2 percent of the dynamic range of pixel values. For an 8-bit grayscale video signal, this corresponds to a maximum distortion of 2 - 5 intensity levels. A two-dimensional highpass lter was proposed in the previous chapter for pre ltering the watermarked image prior to decoding. The lter is used again, but only on a frame-by-frame basis.
5.2.2 Spatial Masking Analysis: DSSS-SM
153
The spatial domain masking analysis is performed on each entire frame prior to encoding, and then divided into spatial blocks to correspond with the size of the host video block. For each block, the minimum masking level is selected as the single weighting value for the block, and is used to weight the spread watermark data added to the image block. The two-dimensional highpass pre lter is employed at the decoder, also on a frame-by-frame basis.
5.2.3 Frequency Hopped Spread Spectrum (FHSS)

The forward and inverse 2D-DCT is computed on spatial portions of the block to be watermarked and decoded. To embed data within images, the watermark weight was set to 1 - 2 percent of the DC coe cient magnitude. In this version, the minimum DC coe cient of the set of M spatial blocks is selected, and the watermark weight set to 1 - 2 percent of its magnitude. Autoregressive modeling was proposed in the previous chapter for whitening the watermarked image prior to decoding. The whitening lter is used again, but only on a frame-by-frame basis.
5.2.4 Frequency Domain Masking Analysis: FHSS-FMW and FHSS-FMT

The forward and inverse 2D-DCT is again computed on spatial portions of the video block to be watermarked. In addition, the frequency domain masking analysis techniques of Watson and Tew k et al are performed on each frame within the block, resulting in a 3D set of quantization levels for the block. These levels are used to quantize the set of 2D-DCT coe cients, and then perturb them with the spread watermark data, as described in Section 4.2.4.3.
5.2.5 Multiresolution Embedding
154
The forward and inverse 2D-DWT is computed on each entire frame prior to encoding, and then the composite image and quantization matrix constructed for each frame. The set of composite images are divided into 3D blocks using M M spatial blocks and M frames. The quantization levels are used to quantize the wavelet coe cients of the composite image, and perturb them with the spread watermark data, as described in Section 4.3.
5.2.6 Discussion
The frame-by-frame video watermarking algorithms possess a number of the same advantages and disadvantages of their constituent image watermarking techniques. First of all, they are simple to implement, as the only real di erence is in the embedding of spread watermark data into blocks of video rather than within spatial blocks on individual frames. Individual frames may be processed in parallel for computing the 2D-DCT and 2D-DWT transformations, and for computing spatial or frequency domain masking analysis. However, each of the frame-by-frame video watermarking algorithms are subject to synchronization problems at the receiver. As described in Section 3.4.4.3, a watermarked block and the PN sequence used to spread the watermark data must be perfectly registered in order for the correlator to work properly.
5.3 Temporal Multiresolution Watermarking

A novel approach to video watermarking was introduced by Tew k et al that employs a multilevel discrete wavelet transform (DWT) decomposition on the video signal, but only along the temporal axis 55]. The result, computed for a video signal of length N frames, is a set of N \images" of wavelet coe cients representing the video signal at varying temporal scales. The DC frame, at the deepest level of the multilevel DWT, corresponds to components that do not change throughout the frames, such as
3 3
155 a static background. Subsequent frames correspond to components that change with increasing temporal frequency. The value of embedding watermark data into the temporal wavelet domain should be obvious: after computing the inverse transform of the wavelet frames, the watermark will exist throughout the video signal, and at various temporal scales. Watermark data embedded into the DC frame, for example, would exist within every frame of the video signal. As a result, it is likely that the embedded watermark will be more resilient to compression and other signal processing operations. In the following sections, an implementation of the temporal multiresolution watermarking scheme of Tew k et al will be described in more detail.
5.3.1 Encoder and Decoder Structures

It is again assumed that the watermark bit to be embedded within a video block will be spread by using a 3D pseudonoise (PN) sequence of the form p(n n n ) 2 f;1 +1g. The quantization method of embedding watermark data, from Equation 1.3, will be used in this implementation because it provides no loss of data in a distortion-free environment.
1 2 3
1. Compute a multilevel discrete wavelet transform (DWT) of the entire video signal along the it's temporal axis, to a depth of blog N c levels. The result is a set of N wavelet frames:
2 3 3
X (n n k ) = DWT x(n n n )]
1 2 3 1 2 3 3 3
(5.1)
where 0 k N ; 1 indexes the temporal wavelet frames. The wavelet basis functions used to compute the forward and inverse DWT will be discussed in Section 5.3.2. 2. Divide the set of wavelet frames into a set of M M M video blocks, denoted Xm1 m2 m3 (n n n ), as described in Section 5.1. For each video block, construct a 3D pseudonoise sequence with which to spread the watermark bit, w(m m m ), to be embedded in the block.
1 2 3 1 2 3
156 3. Quantize the coe cients within each block using a constant quantization level , and then perturb them by a quarter of the quantization level to embed the spread watermark data:
X m1 m2 m3 (n n k ) = Xm1 m2 m3 (n n k ) + 1 w(m m m ) p(n n n ) 4 Appropriate values for will be discussed in Section 5.3.3.
("
1 2 3
(5.2)
4. Compute the inverse DWT on the watermarked wavelet frames
x(n n n ) = IDWT X (n n k )]
1 2 3 1 2 3
(5.3)
The above procedure assumes that the original and watermarked video signals are real-valued, but in practice they will require a discrete representation (such as 8 bits per pixel, for example). Computation of the temporal DWT and applying the watermark data will be done using real-valued signals, but the watermarked signal will be rounded so that it ts within the same format as the host signal. Figure 5.2 shows the result of computing a single-level temporal DWT on a video signal of four frames. The outputs are two lowpass frames and two highpass frames, representing static and dynamic temporal components in the video signal.
5.3.2 Selection of Wavelet Basis Functions

In their paper, Tew k et al do not specify the wavelet basis functions used to construct the lowpass and highpass decomposition / reconstruction lters for computing the temporal DWT. However, certain conditions are obviously required, such as using a wavelet lter bank with perfect reconstruction properties after the reconstruction lters. The 9-7 biorthogonal wavelets from 12], used in Section 4.3 for image watermarking, were also used to compute the temporal DWT in this investigation. There are many other basis functions that can be used to construct similar lter banks, but this area was not explored 11]. As mentioned in Section 6.2, this is a possible area of further research.
157
Time
Temporal DWT
Lowpass Frames
Highpass Frames
Figure 5.2: Example of computing the temporal DWT on a video signal four frames in length.
5.3.3 Selection of Quantization Levels

In their paper, Tew k et al propose using perceptual analysis on the wavelet coe cient frames to determine the watermark strength. In particular, they used their frequency domain masking analysis procedure, introduced in Section 2.2.6.2, on each coe cient frame to determine maximum quantization levels to use for embedding watermark data. That approach was not used in this implementation for two reasons. First of all, the DC wavelet coe cient frame has values that lie within the same range as the original image frames (because it represents static image components), but the remaining coe cient frames vary greatly in their range. It is not clear that a frequency domain masking analysis performed on those frames would yield an imperceptual quantization matrix. Another reason is that the algorithm becomes expensive having to compute both the temporal DWT as well as the 2D-DCT for each frame. In Section 4.2.4, it was revealed that a watermark magnitude of equal to 1 -
158 2 percent of the image pixel range, or 2 - 5 intensity levels for an 8-bit image. For this study, a constant quantization level of = 8 was used for every wavelet coe cient frame, corresponding to a maximum watermark strength of =2 = 4 levels within each coe cient frame.
5.3.4 Discussion
As mentioned earlier, the temporal multiresolution watermarking algorithm possesses a number of attractive features. First of all, watermark data is spread throughout the video signal, and at various levels of temporal support. As a result, it is predicted that the algorithm is more robust to signal processing operations than the simpler frame-by-frame techniques introduced in Section 5.2. Limiting the DWT computation to the temporal axis means that the operation may be performed in parallel for each pixel. However, having to compute a multilevel DWT on the temporal axis for every pixel is still computationally expensive, particularly for long video signals. Tew k et al suggest dividing the video signal into scenes, both to limit the number of frames required for each DWT computation, and to produce a DC wavelet frame containing static components from a single scene. Finally, it is not likely that this algorithm can be used to embed watermark data in real time, given the computational cost and the need to wait for a scene change before computing the DWT.

The purpose of this section is to compare the seven video watermarking algorithms described in the previous sections. The performance evaluation framework, presented in Chapter 1 and employed in the previous chapters on audio and image watermarking, is used again to compare the algorithms on the basis of bit rate, perceptual quality, computational complexity, and robustness to signal processing. In this investigation, a set of six grayscale digital video signals were selected for watermarking, as shown in Figure 5.3. Each sequence is 256 256 pixels in size, sampled at 30 frames per second,
159 64 frames long, and represented by 8 bits per pixel. The watermarking algorithms were implemented in MATLAB under Linux on an Intel Pentium PC running at 166 MHz. Each algorithm was implemented using the parameters speci ed with the details presented earlier. In general, each of the video signals was watermarked using the algorithms 100 times, and the results averaged for each algorithm. In all cases, a di erent and random watermark signal was generated for each run. This was done in order to remove any dependency of extraction on the watermark data itself.

This experiment was designed to determine the bit error rate of each video watermarking algorithm as a function of M , the cubic block size in pixels. For each algorithm, the encoder and decoder were run on each image, and the bit error rate (BER) determined from the extracted bits. This process was performed for block sizes between 2 and 64 samples (the length of the video signals). The results of this experiment are shown in Figure 5.4. Note that this experiment and the results are very similar to those from image watermarking, except that the video blocks contain more pixels than the blocks in images. Many of the algorithms have an error rate of zero for all block sizes, given that they use the quantization embedding procedure. All of the techniques have an error rate of less than one percent with a block size of M = 16 pixels cubed, so that size will be used for the remaining sections of the evaluation.

This experiment was designed to determine the amount of distortion each watermarking algorithm introduced into the host video signal by using the peak signal-to-noise ratio (PSNR) of the watermarked signal versus the original. Each of the six sequences was watermarked with the seven algorithms, and the PSNR computed. A block size of M M M = 16 16 16 pixels was used, with the results shown in Table 5.1. Not surprisingly, the frame-by-frame algorithms introduce a level of distortion similar
160
(a) AMERICA
(b) FOOTBALL
(c) SALESMAN
(d) TENNIS
(e) TREVOR
(f) WESTERN
Figure 5.3: Sample sequences used in the performance evaluation of video watermarking algorithms.
161
25 DSSS DSSSSM FHSS FHSSFMW FHSSFMT MULTIRESOLUTION TEMPORAL
20
BER (PERCENT)
15
10
1.5
2.5
3 3.5 4 4.5 BLOCK SIZE (LOG2(SAMPLES))
5.5
Figure 5.4: Bit error rate versus block size for video watermarking algorithms.
Video Signal AMERICA FOOTBALL SALESMAN TENNIS TREVOR WESTERN Average
DSSS DSSS (SM) 38.12 37.41 37.98 40.27 38.03 37.86 38.02 38.11 37.99 39.07 38.07 40.14 38.04 38.81
FHSS 32.85 31.36 37.47 32.48 33.84 30.37 35.17
FHSS FHSS Multi- Temporal (FMW) (FMT) Resolution MR 32.74 27.69 39.60 30.81 29.34 26.86 42.69 29.34 31.92 28.25 40.82 31.59 30.83 27.59 38.71 28.33 31.54 25.44 45.67 32.43 33.35 26.57 42.29 34.25 31.62 27.07 41.63 31.13
Table 5.1: PSNR of watermarked video signals versus original sequences (in decibels).
162 Video Signal AMERICA FOOTBALL SALESMAN TENNIS TREVOR WESTERN Average DSSS DSSS (SM) 70.37 125.83 71.71 126.74 69.97 124.43 72.74 125.98 70.42 126.37 70.67 123.53 70.91 125.48 FHSS 220.00 221.86 220.54 223.02 220.79 222.13 221.39 FHSS (FMW) 505.40 505.59 506.72 504.93 505.90 506.08 505.77 FHSS Multi- Temporal (FMT) Resolution MR 1128.16 154.81 1748.47 1126.29 153.80 1741.36 1127.44 156.99 1734.92 1127.54 155.48 1740.87 1126.27 156.33 1735.48 1126.51 156.23 1734.39 1127.37 155.61 1739.25
Table 5.2: Video watermarking algorithm CPU timings (in seconds). to that found for static images in Section 4.4.2. The temporal multiresolution technique produced an average PSNR of 31 dB, which lies within the range of distortions introduced by the frame-by-frame algorithms.

This experiment was designed to determine the computational cost of each watermarking algorithm in relation to the others. For each technique, CPU timings were extracted for the encoder and decoder run on each video signal. The results of this test are shown in Table 5.2. From the table, it is immediately obvious that most of the frame-by-frame algorithms are considerably less expensive than the temporal multiresolution algorithm. This likely results from the need to compute the multilevel temporal DWT along every pixel in the video signal. The exception is FHSS-FMT, which is not surprising given the cost of computing the frequency domain masking analysis found in the previous chapter. In addition, the CPU timing results for the frame-by-frame techniques are predictable from the results of Section 4.4.3, since the image watermarking algorithms were essentially being performed repeatedly on a sequence of smaller images. It is interesting to note that the per-frame computational complexities of the frame-by-frame and the temporal multiresolution algorithm do not increase for longer
which corresponds to a bounded per-frame computational cost.
163 video signals. For the frame-by-frame techniques, the cost to watermark each frame depends upon the spatial dimensions N N , so processing ve frames takes approximately ve times longer than a single frame. The temporal multiresolution algorithm uses a multilevel DWT of depth blog N c, which would suggest logarithmically increasing complexity for larger N . However, for each level of decomposition the length of the video signal is halved by downsampling. Therefore, the total operations per pixel is bounded by b X 3c N 2N (5.4) 2i < 2N
1 2 2 3 3 log 3
i=0

It was mentioned earlier in this chapter that video signals possess several unique properties. High quality digital video is usually sampled at rates of at least 15 frames per second (fps) and higher 56]. This obviously leads to a very large number of pixels to be processed every second. A grayscale video signal in Common Intermediate Format (CIF), for example, has 360 288 pixels and sampled at 30 fps, corresponding to a massive 3 million pixels per second. Intuitively, this high pixel volume should greatly increase the number of watermark bits that may be embedded within digital video. However, video signals possess a high level of temporal redundancy between frames, in that di erences between adjacent frames are usually minor and spatially limited. For example, the SALESMAN and TREVOR videos shown in Figure 5.3 have static backgrounds that change little throughout the sequence, and di erences between frames are localized around the central portions. In addition to additive noise, histogram equalization, and other image processing operations explored in the previous chapter, temporal redundancy leads to several unique signal processing operations that can be applied to digital video without a signi cant loss of signal quality: frame averaging, frame reordering, frame downsampling, and lossy compression. The
164 performance of digital video watermarking algorithms under these distortions will be examined further in the following sections.
5.4.4.1 Frame Averaging

In this operation, pixels of each watermarked frame are replaced with the average of those from K previous frames: 1 K; x(n n n ) = K x(n n n ; i) ~
1 X 1 2 3
i=0
(5.5)
where x denotes the approximated frame. This is essentially mean ltering, a form of ~ lowpass ltering considered in the study of audio and image watermarking algorithms. In this case, however, it is performed along the temporal axis. For small K (less than 3), frame averaging does not have a signi cant impact on the quality of the test video signals, but the lowpass ltering e ect increases with K . In this experiment, K was varied from 1 K 16 frames, and the results for each algorithm are shown in Figure 5.5. From the plot, it is clear that the temporal multiresolution algorithm provided better resilience to the averaging lter than the frame-by-frame algorithms. One likely reason for this is that most of the test video signals did not contain a great deal of high frequency temporal components, with the exception being the FOOTBALL sequence. An averaging lter would remove such components. Of the frame-by-frame approaches, the frequency hopped spread spectrum algorithms with frequency domain masking analysis, FHSS-FMW and FHSS-FMT, performed better than the others. This result is not unexpected given the results of the image watermarking evaluation presented in the previous chapter.
5.4.4.2 Frame Reordering

Since there is often a high level of temporal redundancy between frames, it is likely that two highly \similar" frames may be safely interchanged without a noticable di erence to the viewer. This is particularly true for video sampled at higher frame rates, and for sequences with slowly moving components. It was noted in Section 3.4
165
50
40
BER (PERCENT)
30
20
10
6 8 10 FILTER SIZE (FRAMES)
12
14
16
Figure 5.5: Bit error rate versus frame averaging for video watermarking algorithms.
166 that spread spectrum algorithms require synchronization of the watermarked signal with the PN sequence used to spread watermark data. Obviously, frame reordering will a ect the performance of these algorithms since a portion of the 3D blocks will no longer be synchronized if a frame is replaced with another. In this experiment, similarity between two frames x(n n i) and x(n n j ) was measured using the normalized mean squared error (MSE) between them:
1 2 1 2
MSE = N 1N
1
N1 ;1 N2 ;1 X X
2
n1 =0 n2 ;1
x(n n i) ; x(n n j )]
1 2 1 2
(5.6)
where 0 MSE 1 and it is assumed that the two frames are normalized to the interval 0 1]. Each pair of frames in the sample video signals were examined using the similarity measure, and if the MSE was less than T , a variable threshold value, then the pair were placed into a pool of possible frame reordering candidates. After the set of candidate frame pairs was constructed, a random number of the frame pairs were selected for interchanging. This process was repeated for a range of threshold values, and a plot of the bit error rate is shown in Figure 5.6. From the plot, it is clear that the performance of each algorithm degrades quickly with a decrease in the threshold value, because more frame pairs enter into the pool of candidates for interchanging. However, the temporal multiresolution technique performs slightly better than the frame-by-frame algorithms at low threshold values. This likely results from the fact that reordering takes place in the time / spatial domain, while watermarking embedding and extraction is performed in the temporal DWT domain. In contrast, the performances of the frame-by-frame algorithms are roughly similar to each other. This is because frame reordering disrupts the syncronization of the watermarked signal with the PN sequence in a similar manner for each algoritm. It appears that the FHSS-FMW and FHSS-FMT techniques o er slightly more resilience to reordering, but distortion introduced by the algorithms into the host signal tends to increase the MSE between frame pairs. An increase in MSE tends to decrease the number of frames reordered for a given threshold value.
167
50
40
BER (PERCENT)
30
20
10
0.05
0.1 THRESHOLD
0.15
0.2
0.25
Figure 5.6: Bit error rate versus frame reordering for video watermarking algorithms.
5.4.4.3 Frame Downsampling
168
For digital video signals with high frame rates (over 15 frames per second), and smoothly varying components in the scene, it is possible to downsample the video signal along the temporal axis by a factor of K . If the video is lowpass ltered prior to downsampling to remove aliasing e ects, then the missing frames may be reconstructed using an interpolation lter 7]. This would obviously be useful as a compression scheme, and for low downsampling factors (2 - 3) the distortion of the video would not be too signi cant. In this experiment, frames from the entire watermarked video signal were downsampled by a factor of K , for 2 K 16, and then reconstructed from previous and next frames using simple bilinear interpolation of pixels. For example, for a factor of K = 2, the ith frame was reconstructed according to 1 x(n n i) = 2 x(n n i ; 1) + x(n n i + 1)] ~ (5.7) where x denotes the reconstructed frame. A plot of the bit error rate as a function ~ of downsampling factor is shown in Figure 5.7. From the plot, it is clear that the performance of each algorithm degrades quickly with an increase in the downsampling factor, particularly for the frame-by-frame algorithms. This is because more frames are removed and the spread watermark data in reconstructed frames does not correlate well with the PN sequence used to spread the data. However, the temporal multiresolution technique performs slightly better than the frame-by-frame algorithms as the downsampling factor increases.
1 2 1 2 1 2

The high data rates required to represent high quality digital video signals means that compression algorithms are invaluable for e ciently storing and transmitting digital video. Compression takes advantage of both spatial redundancy and high temporal redundancy, which are removed during the compression process to reduce the data rate. Many video compression algorithms are available, but the current
169
BER (PERCENT)
0.5
1.5 2 2.5 DOWNSAMPLING FACTOR (LOG2)
3.5
Figure 5.7: Bit error rate versus frame downsampling for video watermarking algorithms.
170 Moving Picture Experts Group (MPEG) standard is widely used in many popular systems such as Digital Versatile Discs (DVD's), and has been adopted for use in the high de nition television (HDTV) standard. Therefore, this is an important experiment. A tutorial on the MPEG standards may be found in 56]. In this investigation, the grayscale 256 256 sample video sequences were sampled at 30 frames per second, corresponding to a bit rate of approximately 15:7 million bits per second (Mbps). The MPEG codec was applied to the watermarked video signals for varying compression rates, measured in bits per pixel (bpp). The MPEG coder and decoder used were the \Berkeley MPEG-1 Video Encoder" and the \Berkeley MPEG Player", respectively 57, 58]. A standard group of pictures (GOP) of 15 frames was used, with a repeating pattern of \I", \B", and \P" frames of the form \IBBPBBP ". MPEG encoding was performed for varying bit rates, up to 6 bits per pixel, by modifying the scale factors applied to the block DCT quantization matrices. The compressed video was then decoded back into its raw digital format for extracting the watermark data. The peak signal to noise ratio (PSNR), a common measure of digital image and video quality, is obviously dependent upon the level of compression applied to the signal. Figure 5.8 shows a plot of the PSNR for each of the sample video signals as a function of the compression ratio. Although the e ects of DCT coe cient quantization occur at every compression level, previous researchers have found that artifacts produced by the MPEG standard only begin to become visible at levels below 2 bpp 56]. This is a result of the high level of temporal redundancy, and spatial redundancy to a lesser extent, present within video signals. It is important to note that the PSNR above 2 bpp is above 30 dB for each of the sample signals, which corresponds with the perceptual quality results presented in Section 5.4.2. Figure 5.9 shows a plot of the bit error rate of the extracted watermarks as a function of the compression rate in bits per pixel. From these results, it is clear that the temporal multiresolution algorithm again outperformed the frame-by-frame algorithms, particularly at rates of less than 1 bpp. The main reason for this result
171
50
45
40
PSNR (DECIBELS)
35
30
25 AMERICA FOOTBALL SALESMAN TENNIS TREVOR WESTERN 0 1 2 3 BITS PER PIXEL 4 5 6
20
15
Figure 5.8: PSNR versus compression ratio for sample video signals.
172
50
40
BER (PERCENT)
30
20
10
3 BITS PER PIXEL
Figure 5.9: Bit error rate due to MPEG compression as a function of bit rate.
173 is that the MPEG compression algorithm works on a frame-by-frame basis to remove temporal and spatial redundancies, while the temporal multiresolution algorithm embeds watermark data throughout the temporal axis of the video signal. Therefore it is likely that more watermark bits may be correctly extracted. Of the frame-by-frame algorithms, the FHSS-FMW and FHSS-FMT algorithms performed slightly better than the others. This is not too surprising given the results of the lossy image compression comparison presented in Section 4.4.4.9, where it was found that the FHSS algorithms performed well under JPEG compression.
5.5 Summary
Compared to image watermarking algorithms, few techniques exist in the literature for explicitly embedding watermark data into digital video signals. However, video is nothing more than a sequence of still images, so intuitively image watermarking approaches may be easily extended into the temporal dimension. In this chapter, the six watermarking algorithms of the previous chapter were adapted for use in \frame-byframe" watermarking of digital video: DSSS, DSSS-SM, FHSS, FHSS-FMW, FHSSFMT, and the spatial multiresolution technique. Also, an implementation was described of a novel temporal multiresolution watermarking system speci cally designed for video signals. Another aspect of this chapter was to evaluate the algorithms using the performance evaluation framework introduced in Chapter 1. It was noted that high-quality video signals possess two unique properties: a high bit rate resulting from a relatively high temporal sampling rate, and a high level of temporal redundancy. The distortion of video from watermarking using the frame-by-frame algorithms, measured in PSNR, correponds to the results of Section 4.4.2, while temporal multiresolution introduced no more distortion than the former. With respect to computational complexity, it was found that the temporal multiresolution algorithm proved twice as expensive as
174 the most complex frame-by-frame algorithm, but the per-frame cost of the former does not increase with the length of the video signal. Resilience to signal processing was only measured using operations unique to video: frame averaging, reordering, downsampling, and lossy (MPEG) compression. For each of these experiments, the temporal multiresolution algorithm proved far more resilient to signal processing than the frame-by-frame techniques.
175
Chapter 6 Conclusions
The previous three chapters have provided a review of many techniques for embedding data within digital audio, image, and video signals, and details on implementation strategies were provided. In addition, a performance evaluation was conducted for each class of algorithms in order to compare them with respect to a common set of criteria. In this nal chapter, the primary results of this investigation are summarized, and several key areas are listed as possible avenues of further research in this eld.
6.1 Summary of Results

First of all, it was found that watermarking algorithms incorporating perceptual analysis possess two signi cant advantages over other those without: 1. Perceptual analysis can, in theory, maximize watermark strength by taking advantage of masking properties present within the host signal. It was shown for many algorithms that by increasing the watermark strength, embedded data becomes be more resilient to distortion from signal processing operations. 2. Perceptual masking may ensure that the watermark is imperceptible to the end user, which is a key requirement of most watermarking systems. Using signal to noise ratios (SNR and PSNR) as a perceptibility measure does not directly
176 reveal this result, because the only way to prove it is to perform a formal perceptual quality study as described in Section 1.4.2. Results from the performance evaluations help to support these conclusions. In particular, algorithms incorporating perceptual modeling { the frequency masking audio watermarking algorithm, and spread spectrum techniques with masking analysis (DSSS-SM, FHSS-FMW, and FHSS-FMT) { had a signal to noise ratio below the algorithms that did not use perceptual modeling. This indicates that more distortion from watermarking is introduced by these techniques. In addition, in many cases these algorithms performed better under common signal processing operations as shown in Section 3.6.4 and Section 4.4.4, most notably lowpass and wiener ltering, additive noise, and lossy compression. It was also discovered that many watermarking algorithms from the literature rely upon spread spectrum techniques from digital communications theory in order to securely encode and decode watermark data. This is because spread spectrum systems possess several unique properties: 1. Watermark data, when \spread" using a pseudorandom (PN) sequence, is distributed throughout the spectrum of the host image, including portions not already occupied by image components. 2. As shown by Section 3.4.2, the spread spectrum correlation receiver is highly robust to additive noise distortion, and it was also shown that the reliability increases with the block size and the magnitude of the spread watermark data. 3. Extra security of the watermark data is obtained from using a PN sequence, because the correlation of two di erent PN sequences is very low. However, perfect synchronization of the watermarked host signal and the PN sequence is required to correctly extract watermark data at the receiver. From the performance evaluation conducted throughout this thesis, it is clear that a combination of spread spectrum techniques and transform domain embedding produces a more robust watermark. In particular, the frequency hopped spread spectrum
177 (FHSS) algorithm and its variants (FHSS-FMW and FHSS-FMT) proved particularly resilient to processing. The main reason why this is so is that embedding watermarks in the transform domain tends to distribute their energy throughout the temporal or spatial area of the host signal. Recall that the focus of this thesis is on public watermarking algorithms, where the original signal is not available at the receiver to assist in extracting watermark data from the host signal. In many cases, the presence of the host signal may interfere with extraction of the watermark. This is very true of additive watermarks, such as the direct sequence and frequency hopped spread spectrum (DSSS and FHSS) techniques. In contrast, a quantization approach provides no decoding error, but only if the watermarked signal is distributed in a distortion-free environment. In this thesis, improvements were introduced for the DSSS and FHSS approaches to audio, image, and video watermarking for reducing the presence of the host signal. Employing a highpass pre lter, or a \whitening" lter constructed from an autoregressive (AR) model of the host signal, both work well to reduce interference from the host signal.
6.1.1 Tradeo s in Watermarking Systems

Throughout this thesis, it was noted that tradeo s exist in the design and implementation of digital watermarking systems. The four desirable properties of watermarking systems { high bit rate, high perceptual quality, low computational complexity, and resilience to signal processing { are di cult to achieve all at once. In particular, this investigation revealed the following tradeo s: 1. Higher bit rates, resulting from the use of smaller block sizes, typically means less robustness against signal processing. This was observed for additive watermarking algorithms, such as the standard DSSS and FHSS techniques. 2. Higher computational complexity indirectly corresponds to more robustness against signal processing, because frequency domain algorithms and perceptual analysis o er more resilience at a cost of higher complexity.
178 3. Higher perceptual quality, resulting from lower distortion from embedded watermark data, usually corresponds to less robustness against signal processing. For example, consider results from the audio watermarking chapter. The phase coding algorithm for audio signals produces poor quality signals (measured in SNR), but provides high resilience to processing. In contrast, the DSSS approach has a better quality (due to low ), but is less robust to processing operations.
6.2 Opportunities for Further Research

6.2.1 Further Investigations and Improvements
One important improvement to this study would be conducting formal perceptual quality experiments. As described in Section 1.4.2, a formal study would employ a large number of subjects, listening to or viewing a combination of unmodi ed and watermarked signals and ranking them according to their perceived \quality", for each of the watermarking algorithms considered. The results would allow a more suitable watermark strength to be determined for the basic DSSS and FHSS algorithms, and it would also allow for a more fair comparison of the perceptual quality of watermarking algorithms. Recall that signal to noise ratio (SNR or PSNR) was used as a general measure of signal quality. Alternatively, it may be useful to investigate the use of audio, image, and video quality models from the literature 59, 18]. Quality models traditionally have been used for judging the transparency of lossy compression algorithms, and few authors have considered their use in evaluating digital watermarking systems. It is likely that further improvements are possible for each of the watermarking algorithms implemented and studied in this thesis. In particular, it was noted that the multiresolution decomposition for image and temporal video watermarking used the same basis functions to construct decomposition and reconstruction lters. It may be useful to study the use of di erent wavelet coe cients to determine which
179 works \best" for watermarking di erent signals. It is clear that frequency domain approaches spread watermark data throughout the time or spatial domain of the host signal, which is valuable for providing resilience to signal processing operations. Transform- and wavelet-based compression algorithms achieve coding gains by seeking a more compact representation of a signal's time or spatial domain sample values. However, it is not clear that there is any bene t to using di erent transform kernels or basis functions for watermarking. The performance evaluation only considered robustness to individual signal processing operations. In practice, it is likely that a number of operations would be performed on a given signal, and it would be useful to know which watermarking algorithm proves more resilient to combinations of operations. For example, an image from a digital camera may undergo lowpass ltering to remove noise, followed by histogram equalization to widen the dynamic range of the image, and completed by lossy (JPEG) compression before being posted onto an Internet web site. If the set of possible signal processing operations are known in advance of watermarking, then it may be possible to construct modi cations to each watermarking algorithm so that the embedded data will survive the distortions. In addition, robustness to more sophisticated signal processing operations should be considered. For example, the algorithms incorporating perceptual analysis | frequency masking, DSSS-SM, FHSS-FMW and FHSS-FMT | produce localized increases in watermark strength due to localized masking e ects within the host signal. However, is is possible for an attacker (as de ned in Section 1.2.2) to use these models to localize an attack on a watermark as well, possibly with little or no perceivable loss of signal quality. Quite often, high quality digital signals are stored or transmitted in analog form, because not all consumers have access to the Internet or other sources of digital media. For example, digital video may be converted to a standard television format (NTSC) and then broadcast or recorded on an analog tape. Similarly, digital images are printed in magazines and newspapers, and audio is still recorded and sold on cheap
180 cassettes. Future investigation of the robustness of watermarking algorithms should take the digital-to-analog (D/A) and analog-to-digital (A/D) conversion process into account, for it is likely that some techniques would not survive the process well.
6.2.2 Watermark Invertibility

Craver et al recently introduced the concept of invertibility as a set of attacks upon embedded watermark data 4]. The authors contend that many algorithms used to embed watermark data may be easily inverted, leaving watermarked signals vulnerable to two possible attacks. In the rst approach, the watermark data and location are identi ed within the host signal, and then corrupted or replaced with other watermark data. A second attack works by embedding a second watermark within the host signal to coexist with the existing watermark. This process, referred to as overwatermarking, produces two valid sets of data within the host signal, which may cause problems for identifying copyright ownership of the material. Attacks exist for both public and private watermarking technique, and the exact procedure di ers for each algorithm. Further studies may consider the invertilibilty of watermarking algorithms as a design criteria.
6.2.3 Applications of Digital Watermarking

The focus of this thesis is on the means, rather than the ends, of digital watermarking. In Chapter 1, general insertion and extraction systems were introduced, but only brief mention was made of what constitutes a watermark: a set of arbitrary data stored or transmitted through a host signal. This limited de nition re ects the narrow focus of the literature in this new research area. However, there is a serious need to develop intellectual property management systems to protect authors of digital media from illegal copying and distribution of their work 60]. If such systems will be based on digital watermarking technologies, then watermark data must be designed more carefully to support them.
181 In addition to copyright protection frameworks, other interesting applications of watermarking are beginning to emerge in the literature. For example, two novel applications were recently described: audio-in-video and video-in-video 61]. As mentioned in Section 5.4.4, the high bit rate of raw digital video allows for a large amount of watermark data to be embedded within the signal. For audio-in-video, the authors embed four speech signals within a 360 240 pixel video signal at 30 frames per second. The speech is sampled at 8 kHz and represented with 8 bits per sample, and compressed to 2400 bits per second using a CELP speech compression algorithm 62]. The value of this approach is that the embedded speech signals could represent additional audio tracks, perhaps in di erent languages. Since the speech is embedded within the video signal itself, the bit rate of the signal does not need to be increased to accomodate the extra speech. In a similar manner, the authors embed a small video signal, compressed using the MPEG algorithm, within the host video signal as a form of video-in-video.
6.2.4 Information Theory and Digital Watermarking

It was rst noted in Section 1.3 that digital watermarking is essentially a digital communications system: watermark data is encoded, embedded within a host signal for storage or transmission, and then extracted and decoded at a receiver. Information theory is often employed in the design and analysis of communications systems to set limits on the capacity of channels, to predict the reliability of communications systems, and to develop ways of coding data so that transmission errors may be detected and corrected. However, only recently has work been done to incorporate these tools, such as channel coding, into watermarking systems 63]. It is likely that such techniques could improve the robustness of watermarks to signal processing operations, and is worth studying further. Another largely ignored aspect is the e ect of embedded watermark data on compression rates, particularly if lossless compression is required. Lossless image compression schemes, for example, attempt to remove the coding redundancy of indi-
182 vidual pixels and the spatial (or interpixel) redundancy between pixels 46]. Addition of a noise-like watermark signal has the e ect of increasing the \randomness" of the image, which would reduce both the coding and interpixel redundancies. Watermarks may a ect the compression rate achieved by lossy techniques as well. For example, a watermarked block of 8 8 pixels used in the JPEG image compression algorithm may contain frequency components in the 2D-DCT domain that do not exist in the original block of pixels. The components may not be removed by coe cient quantization, leading to a larger compressed image size. Similar e ects occur in the MPEG video compression scheme, so it is useful to consider ways of embedding watermark data that minimize the e ects on compression.
6.2.5 Current Standardization E orts

Although digital watermarking is a relatively new research area, repeatedly in this thesis it was noted that there is a pressing need for systems that can identify and protect authors of digital media from piracy. Standards are currently being developed to address these needs. The Secure Digital Music Initiative (SDMI), an organization of music recording industry and technology companies, is working on watermarking standards for devices used to download and play digital music 35]. It was mentioned in Chapter 5 that Digital Versatile Disc (DVD) players search for watermarks within video signals and limit access or copying of the stored media accordingly 64]. And nally, the new MPEG-4 video compression standard incorporates an intellectual property management system, and it allows for watermarking of scenes and individual objects in video signals 56].
Bibliography
1] Pamela Samuelson. Good News and Bad News on the Intellectual Property Front. Communications of the ACM, 42(3):19{24, March 1999. 2] J. S. Lauritzen, Adar Pelah, and David Tolhurst. Perceptual Rules for Watermarking Images: A Psychophysical Study of the Visual Basis for Digital Pattern Encryption. In Proceedings of SPIE Human Vision and Electronic Imaging IV, volume 3644, pages 392{402, 1999. 3] Ingemar Cox and Jean-Paul Linnartz. Some General Methods for Tampering With Watermarks. IEEE Journal on Selected Areas in Communications, 16(4):587{593, May 1998. 4] Scott Craver, Nasir Memon, Boon-Lock Yeo, and Minerva Yeung. Resolving Rightful Ownerships with Invisible Watermarking Techniques: Limitations, Attacks, and Implications. IEEE Journal on Selected Areas in Communications, 16(4):573{586, May 1998. 5] Bruce Schneier. Applied Cryptography. John Wiley & Sons, New York, 2nd edition, 1995. 6] Stephen Wicker. Error Control Systems for Digital Communication and Storage. Prentice Hall, Englewood Cli s, NJ, 1995. 7] Alan Oppenheim and Ronald Schafer. Discrete-Time Signal Processing. Prentice Hall, Englewood Cli s, NJ, 1989. 183
184 8] Nasir Ahmed, T. Raj Natarajan, and K. R. Rao. Discrete Cosine Transform. IEEE Transactions on Computers, C-23(1):90{93, January 1974. 9] Ephraim Feig and Shmuel Winograd. Fast Algorithms for the Discrete Cosine Transform. IEEE Transactions on Signal Processing, 40(9):2174{2193, September 1992. 10] Martin Vetterli. Multidimensional Subband Coding: Some Theory and Algorithms. Signal Processing, 6(2):97{112, April 1984. 11] Olivier Rioul and Martin Vetterli. Wavelets and Signal Processing. IEEE Signal Processing Magazine, 8(4):14{38, October 1991. 12] Marc Antonini, Michel Barlaud, Pierre Mathieu, and Ingrid Daubechies. Image Coding Using Wavelet Transform. IEEE Transactions on Image Processing, 1(2):205{220, April 1992. 13] Rafael Gonzalez and Richard Woods. Digital Image Processing. Addison Wesley, Reading, MA, 1992. 14] Martin Kutter and Fabien Petitcolas. Fair Benchmark for Image Watermarking Systems. In Proceedings of SPIE Security and Watermarking of Multimedia Contents, volume 3657, pages 226{239, 1999. 15] Arun Netravali and Barry Haskell. Digital Pictures: Representation, Compression, and Standards, chapter Visual Psychophysics. Plenum Press, New York, 2nd edition, 1995. 16] Bernard Sklar. Digital Communications: Fundamentals and Applications. Prentice Hall, Englewood Cli s, NJ, 2nd edition, 1988. 17] Niklaus Wirth. Algorithms and Data Structures. Prentice Hall, Englewood Cli s, NJ, 1986.
185 18] Nikil Jayant, James Johnston, and Robert Safranek. Signal Compression Based on Models of Human Perception. Proceedings of the IEEE, 81(10):1385{1422, October 1993. 19] Peter Noll. MPEG Digital Audio Coding. IEEE Signal Processing Magazine, 14(5):59{81, September 1997. 20] Davis Pan. Tutorial on MPEG/Audio Compression. IEEE Multimedia Magazine, 2(2):60{74, Summer 1995. 21] Mitchell Swanson, Bin Zhu, Ahmed Tew k, and Laurence Boney. Robust Audio Watermarking Using Perceptual Masking. Signal Processing, 66(3):337{355, May 1998. 22] Charles Stromeyer III and Bela Julesz. Spatial-Frequency Masking in Vision: Critical Bands and Spread of Masking. Journal of the Optical Society of America, 62(10):1221{1232, October 1972. 23] Gordon Legge and John Foley. Contrast Masking in Human Vision. Journal of the Optical Society of America, 70(12):1458{1471, December 1980. 24] J. F. Delaigle, C. De Vleeschouwer, and B. Macq. Watermarking Algorithm Based on a Human Visual Model. Signal Processing, 66(3):319{335, May 1998. 25] Bernd Girod. The Information Theoretical Signi cance of Spatial and Temporal Masking in Video Signals. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display, volume 1077, pages 178{187, 1989. 26] Martin Kutter, Frederic Jordan, and Frank Bossen. Digital Watermarking of Color Images Using Amplitude Modulation. Journal of Electronic Imaging, 7(2):326{332, April 1998. 27] Bin Zhu and Ahmed Tew k. Low Bit Rate Near-Transparent Image Coding. In Proceedings of SPIE Wavelet Applications II, volume 2491, pages 173{184, 1995.
186 28] Mitchell Swanson, Bin Zhu, and Ahmed Tew k. Robust Data Hiding for Images. In IEEE Digital Signal Processing Workshop, pages 37{40, 1996. 29] Gregory Wallace. The JPEG Still Picture Compression Standard. Communications of the ACM, 34(4):30{44, April 1991. 30] Albert Ahumada and Heidi Peterson. Luminance-Model-Based DCT Quantization for Color Image Compression. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display III, volume 1666, pages 365{374, 1992. 31] Heidi Peterson, Albert Ahumada, and Andrew Watson. An Improved Detection Model for DCT Coe cient Quantization. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display IV, volume 1913, pages 191{201, 1993. 32] Andrew Watson. DCT Quantization Matrices Visually Optimized for Individual Images. In Proceedings of SPIE Human Vision, Visual Processing, and Digital Display IV, volume 1913, pages 202{216, 1993. 33] Mitchell Swanson, Mei Kobayashi, and Ahmed Tew k. Multimedia DataEmbedding and Watermarking Technologies. Proceedings of the IEEE, 86(6):1064{1087, June 1998. 34] Francois Gauthier. Fundamentals of Digital Radio Broadcasting (DRB) in Canada. Technical report, Communications Research Centre, October 1996. 35] SDMI Portable Device Speci cation - Part 1. http://www.sdmi.org, July 1999. 36] Walter Bender, Daniel Gruhl, Norishige Morimoto, and Anthony Lu. Techniques for Data Hiding. IBM Systems Journal, 35(3/4):313{335, 1996. 37] Alan Oppenheim and Jae Lim. The Importance of Phase in Signals. Proceedings of the IEEE, 69(5):529{541, May 1981. 38] Correspondence with Walter Bender, September 1999.
187 39] Raymond Pickholtz, Donald Schilling, and Laurence Milstein. Theory of SpreadSpectrum Communications - A Tutorial. IEEE Transactions on Communications, COM-30(5):855{884, May 1982. 40] Frank Hartung and Bernd Girod. Digital Watermarking of Uncompressed and Compressed Video. Signal Processing, 66(3):283{301, May 1998. 41] Ingemar Cox, Joe Kilian, Thomas Leighton, and Talal Shamoon. Secure Spread Spectrum Watermarking for Multimedia. IEEE Transactions on Image Processing, 6(12):1673{1687, December 1997. 42] Simon Haykin. Adaptive Filter Theory. Prentice Hall, Englewood Cli s, NJ, 3rd edition, 1995. 43] Massey and Berlekamp. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory, IT-15(1):122{127, January 1969. 44] Lenore Blum, Manuel Blum, and Michael Shub. A Simple Unpredictable PseudoRandom Number Generator. SIAM Journal of Computing, 15(2):364{383, May 1986. 45] Frank Hartung and Martin Kutter. Multimedia Watermarking Techniques. Proceedings of the IEEE, 87(7):1079{1107, July 1999. 46] Petros Maragos, Ronald Schafer, and Russel Mersereau. Two-Dimensional Linear Prediction and Its Application to Adaptive Predictive Coding of Images. IEEE Transactions on Acoustics, Speech, and Signal Procesing, ASSP-32(6):1213{1229, December 1984. 47] Dan Dudgeon and Russel Mersereau. Multidimensional Digital Signal Processing. Prentice Hall, Englewood Cli s, NJ, 1984. 48] Andrew Watson, Gloria Yang, Joshua Solomon, and John Villasenor. Visual Thresholds for Wavelet Quantization Error. In Proceedings of SPIE Human Vision and Electronic Imaging, volume 2657, pages 382{392, 1996.
188 49] Christine Podilchuk and Wenjun Zeng. Image-Adaptive Watermarking Using Visual Models. IEEE Journal on Selected Areas in Communications, 16(4):525{ 539, May 1998. 50] Bhavesh Bhatt and David Birks. Digital Television: Making it Work. IEEE Spectrum Magazine, 34(10):19{28, October 1997. 51] Advanced Television Systems Committee. ATSC Digital Television Standard. http://www.atsc.org, September 1995. 52] Zdzislaw Papir and Andrew Simmonds. Competing for Throughput in the Local Loop. IEEE Communications Magazine, 37(5):61{66, May 1999. 53] Sara Robinson. Copyright Lawsuits Test Limits of New Digital Media. The New York Times, January 24, 2000. 54] Frank Hartung and Bernd Girod. Fast Public-Key Watermarking of Compressed Video. In IEEE International Conference on Image Processing, volume 1, pages 528{531, 1997. 55] Mitchell Swanson, Bin Zhu, and Ahmed Tew k. Multiresolution Scene-Based Video Watermarking Using Perceptual Models. IEEE Journal on Selected Areas in Communications, 16(4):540{550, May 1998. 56] Thomas Sikora. MPEG Digital Video-Coding Standards. IEEE Signal Processing Magazine, 14(5):82{100, September 1997. 57] Plateau Research Group. Berkeley MPEG-1 Video Encoder Users Guide. http://bmrc.berkeley.edu/research/mpeg. 58] Plateau Research Group. http://bmrc.berkeley.edu/research/mpeg. Berkeley MPEG Player.
59] Michael Eckert and Andrew Bradley. Perceptual Quality Metrics Applied to Still Image Compression. Signal Processing, 70(3):177{200, Nov 1998.
189 60] Keith Hill. The Role of Identi ers in Managing and Protecting Intellectual Property in the Digital Age. Proceedings of the IEEE, 87(7):1228{1238, July 1999. 61] Mitchell Swanson, Bin Zhu, and Ahmed Tew k. Data Hiding for Video-in-Video. In IEEE International Conference on Image Processing, pages 676{679, 1997. 62] Allan Gersho. Advances in Speech and Audio Compression. Proceedings of the IEEE, 82(6):900{918, June 1994. 63] Tao Bo and Michael Orchard. Coding and Modulation in Watermarking and Data Hiding. In Proceedings of SPIE Security and Watermarking of Multimedia Contents, volume 3657, pages 503{510, 1999. 64] Alan Bell. The Dynamic Digital Disk. IEEE Spectrum Magazine, 36(10):28{35, October 1999.

PEWM

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

PEWM

Diunggah oleh

Hak Cipta:

Format Tersedia

THE UNIVERSITY OF CALGARY

Performance Evaluation of Digital Watermarking Algorithms

CALGARY, ALBERTA April, 2000

THE UNIVERSITY OF CALGARY FACULTY OF GRADUATE STUDIES

To all the girls I've loved before.

iii iv v vi xii xiii xvii 1

Chapter 2 Perceptual Modeling Techniques

Chapter 3 Audio Watermarking

3.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Chapter 4 Image Watermarking

Chapter 5 Video Watermarking

175 177 178 178 180 180 181 182

47 92 93 119 131 132 144

w (n n ) w (n n ) (n) (n n ) (n n n ) (n) (n n ) x(n n ) () () (k) (x y )

1.2 Requirements Analysis

1.2.2 Robustness to Signal Processing

1.2.3 Private vs. Public Watermarks

1.3 Watermark Embedding and Extraction Systems

1.3.1 Perceptual Analysis

1.3.2 Key Generation

1.3.3 Encoding and Decoding

1.3.4 Watermark Insertion and Extraction

x(n) = x(n) + (n) w(n)

x(n) = x(n) 1 + (n) w(n)]

1.3.4.1 The Discrete Fourier Transform (DFT)

1.3.4.2 The Discrete Cosine Transform (DCT)

+ x(n n ) cos (2n 2N1)k

+ cos (2n 2N1)k

1.3.4.3 The Discrete Wavelet Transform (DWT)

1.4 A Framework for Performance Evaluation

(a) Filter bank block diagram

Magnitude Response (dB)

0.4 0.5 0.6 Normalized Frequency

X < 100 M ;1 >

1 w(n) = w(n) ~ > : 0 w(n) 6= w(n) ~

1.4.1 Bit Rate

1.4.2 Perceptual Quality

N ;1 x(n) ; x(n)]2 n=0

255 PN ; PN ; 1 2 n1 n2 fx(n n ) ; x(n n )g

1.4.3 Computational Complexity

1.4.4 Robustness to Signal Processing

1.5 Scope and Outline of Thesis

Chapter 2 Perceptual Modeling Techniques

2.1 The Human Audio System (HAS)

2.1.1 Frequency Sensitivity

1 1.5 FREQUENCY (HZ)

SOUND PRESSURE LEVEL (DECIBELS)

2.1.2 Frequency Masking

2.1.3 Other Psychoacoustic Concepts

Figure 2.3: Logarithmic mapping from frequencies to the Bark scale.

SOUND PRESSURE LEVEL (DECIBELS)

Figure 2.4: Raised detection threshold for a 15 dB masking signal at 5 kHz.

2.1.4 The MPEG Layer I Psychoacoustic Model

P (k ; 1) < P (k) > P (k + 1)

(2.3) (2.4) (2.5)

P (k) ; P (k ; 1) 7dB P (k) ; P (k + 1) 7dB

TR (k km) = P (km) ; 6:025 ; 0:275 km + R(d km)

17 (d + 1) ; 0:4 P (km) + 6] 0:4 P (km) + 6] d ;17 d ;(d ; 1) 17 ; 0:15 P (km)] ; 17

d < ;1 ;1 d < 0 0 d<1 1<d

2.2 The Human Visual System (HVS)

100 P(k) TM(f) T (f)

SOUND PRESSURE LEVEL (DECIBELS)

2.2.1 Frequency Sensitivity