Anda di halaman 1dari 100

Helwan University

Faculty of Engineering

Department of Electronics,
Communications, and Computers

MULTIMEDIA MIDDLEWARE

by

Nora Abdel gaffar Naguib El-morsy

Bsc. In Telecommunication Engineering, 2005

nora.naguib@yahoo.com

A thesis submitted in partial fulfillment of the requirements for the degree of

Masters of Science in Telecommunications Engineering

Supervised by:
Prof. Mohamed I. El Adawy
Faculty of Engineering, Helwan University

Dr. Hesham A. Keshk


Faculty of Engineering, Helwan University

Dr. Ahmed E. Hussein


Faculty of Engineering, Helwan University

2010
ii | P a g e
P a g e | iii

ACKNOWLEDGEMENT

It is a pleasure to thank those who made this thesis possible. I would like to
express my gratitude to Prof. Mohamed I. El-Adawy for his constant support
and most valuable advice. I would like to thank the rest of the supervisory
committee for all their help and Dr. Ahmed E. Hussien for the suggestion of
reference titles.

I would also like to thank my family for the support they provided me
through my entire life and in particular, I really cannot express my full
gratitude to my brother Yasser Naguib who patiently proofread this entire
thesis. Special thanks go to my brother Wael Naguib without whose
motivation and encouragement I would not have considered a post graduate
degree. Above all, to my mother who stood beside me all the time.

Lastly, I offer my regards to all of those who supported me in any respect


during the completion of the project.

I dedicate this thesis to My Mother


iv | P a g e
Page |v

PUBLICATIONS

 Nora A. Naguib, Ahmed E. Hussein, Hesham A. Keshk, and


Mohamed I. El-Adawy “Contrast Error Distribution Measurement
for Full Reference Image Quality Assessment”, The 18th
International Conference on Computer Theory and Applications
2008, Alexandria, Egypt.

 Nora A. Naguib, Ahmed E. Hussein, Hesham A. Keshk, and


Mohamed I. El-Adawy “Using PFA in Feature Analysis and Selection
for H.264 Adaptation”, World Academy of Science, Engineering and
Technology, VOLUME 54, JUNE 2009, Paris, France, ISSN: 2070-
3724
vi | P a g e
P a g e | vii

ABSTRACT

In today's world, users have heterogeneous devices connected to a mesh of networks each
with different capabilities and restrictions. Multimedia content providers need innovative
approaches to keep not only one version of each video but having the capability to offer
different bitstreams for a variety of client capabilities as well. The previously used design of
"one size fits all" systems can not apply in diverse environments presented today. A single
bitstream with static parameter cannot satisfy the diversity presented on the client side. This is
why the researchers in Universal Multimedia Access (UMA) are working on the development
of new techniques for coding multimedia objects with maximum compression efficiency
along with flexibility in the parameters of the provided video when dealing with client devices.

The transcoding of multimedia objects requires the presence of intermediate systems that are
capable of altering the bitstream on demand. Those systems should have the capability of
manipulating different format of bitstreams. A large number of adaptation techniques exists
in today’s literature, each specialized in altering the video bitstream with respect to only one
dimension namely temporal (frame-rate), spatial (resolution), Signal to Noise Ratio (SNR), or
format conversion. In real world, adaptation of video sequences should take the form of
multi-dimensional adaptation allowing the system to do a combination of reduction processes
on different parameters of video sequence while providing the best possible quality.

In this thesis, we have focused on the transcoder policy module. While most of the previous
studies in multimedia transcoding focused on the transcoding techniques, the lack of control
algorithm rendered those techniques useless. The study was directed toward the creation of
an offline data analysis model for transcoders policy module.

The results and analysis provided in this thesis help toward the creation of policy module that
control the transcoder operation for universal multimedia access.

KEYWORDS: Multimedia Transcoding, Objective Quality Assessment, Universal


Multimedia Access.
viii | P a g e

TABLE OF CONTENTS

Introduction 1
1-1 Motivation 1
1-2 Problem Statement 3
1-3 Objectives and contributions 4
1-4 Thesis Outline 5
Multimedia Communications Basics 7
2-1 ITU-T MediaCom2004 project 7
2-2 MPEG-7 and MPEG-21 8
2-3 Coding Standards 9
2-4 Transcoding Vs Scalable Coding 10
2-5 Quality Assessment 12
Related Work 15
3-1 Quality Assessment 15
3-1-1 Background 15
3-1-2 Simple Quality Metrics 16
3-1-3 Objective Quality Metrics 17
3-1-3-1 Using DCT, DWT, and DFT 18
3-1-3-2 Perceptual Distortion Metric (PDM) 19
3-1-3-3 Structural Similarity 20
3-1-3-4 Visual Information Fidelity and Natural Scene Statistics 22
3-2 Subjective Experiments 23
3-2-1 Double Stimulus Impairment Scale (DSIS) 24
3-2-2 Double Stimulus Continuous Quality Scale (DSCQS) 24
3-2-3 Single Stimulus Continuous Quality Scale (SSCQS) 24
3-3 VQEG 25
3-4 Benchmark 26
3-4-1 Error Domains 26
3-4-2 Subjective Experiment 26
3-4-3 Realignment Process 27
3-4-4 Datasets 27
3-5 H.264 Review 28
3-6 Multimedia Transcoding 33
3-6-1 Transcoding Techniques 34
3-6-2 Control Schemes 36
Quality Assessment 43
4-1 Introduction 43
4-2 Proposed Metric 44
4-3 Metric Evaluation Process 46
4-3-1 Subjective data rescaling 47
4-3-2 Nonlinear Regression 47
P a g e | ix

4-3-3 Prediction Accuracy 48


4-3-4 Prediction Monotonicity 48
4-3-5 Prediction Consistency 49
4-4 Results 49
4-4-1 Overall Performance 50
4-4-2 Cross-Distortion Performance 50
4-4-3 Logistic Regression Performance 53
4-4-4 Complexity Performance 52
Data Analysis 63
5-1 Introduction 63
5-2 Offline Data Analysis Model 64
5-3 H.264 Setup 65
5-4 Test Sequences 66
5-5 Features 66
5-5-1 Feature Definitions 68
5-5-1-1 Source Domain Features 68
5-5-1-2 Resources Required 68
5-5-1-3 Coded Domain features 69
5-5-2 Analysis and selection 69
5-6 Results 70
5-7 Transcoder Configuration 73
5-8 Transcoder Setup 74
5-9 Clustering 76
Conclusion and Future Work 79
6-1 Conclusion 79
6-2 Future Work 82

Bibliography 83
x|Page

LIST OF FIGURES

FIGURE ‎1-1 MULTIMEDIA MIDDLEWARE .........................................................................................3


FIGURE ‎2-1 MULTIMEDIA COMMUNICATIONS STUDY AREAS (2001 ITU-T) ..........................................8
FIGURE ‎2-2 GENERAL ARCHITECTURE OF CODING ALGORITHMS ..........................................................9
FIGURE ‎2-3 SCALABLE BITSTREAMS.............................................................................................. 10
FIGURE ‎3-1 BLOCK DIAGRAM OF THE PERCEPTUAL DISTORTION METRIC (PDM) ................................. 19
FIGURE ‎3-2 BLOCK DIAGRAM OF THE STRUCTURAL SIMILARITY ......................................................... 20
FIGURE ‎3-3 BLOCK DIAGRAM OF THE MULTI-SCALE STRUCTURAL SIMILARITY L: LOW PASS FILTERING; 2↓:
DOWN SAMPLING BY 2 .................................................................................................... 22
FIGURE ‎3-4 CONCEPTUAL DIAGRAM OF THE VIF ............................................................................ 22
FIGURE ‎3-5 SUBJECTIVE EXPERIMENTS: VIEWING MODES (ON THE LEFT) SCORE SCALE (ON THE RIGHT).
(A) DOUBLE STIMULUS IMPAIRMENT SCALE (DSIS) (B) DOUBLE STIMULUS CONTINUOUS QUALITY
SCALE (DSCQS) (C) SINGLE STIMULUS CONTINUOUS QUALITY SCALE (SSCQS) ....................... 25
FIGURE ‎3-6 (A) VIDEO CODING LAYER (VLC) AND NETWORK ABSTRACTION LAYER (NAL) ARRANGEMENT.
(B) NAL UNIT ................................................................................................................ 29
FIGURE ‎3-7 BLOCK DIAGRAM OF H.264 ENCODER ......................................................................... 30
FIGURE ‎3-8 BLOCK DIAGRAM OF THE H.264 DECODER ................................................................... 30
FIGURE ‎3-9 H.264 PROFILES ...................................................................................................... 33
FIGURE ‎3-10 HOMOGENEOUS TRANSCODING................................................................................ 35
FIGURE ‎3-11 TRANSCODER IMPLEMENTATION............................................................................... 36
FIGURE ‎3-12 UTILITY MODEL ..................................................................................................... 36
FIGURE ‎3-13 INFO-PYRAMID BASED CONTROL SCHEME ................................................................... 37
FIGURE ‎3-14 THREE DIMENSIONAL VIEW ...................................................................................... 38
FIGURE ‎3-15 SYSTEM OVERVIEW ................................................................................................. 39
FIGURE ‎3-16 ADAPTATION, RESOURCE, UTILITY SPACES .................................................................. 41
FIGURE ‎4-1 BLOCK DIAGRAM OF THE CONTRAST ERROR DISTRIBUTION (CED) ................................... 46
FIGURE ‎4-2 SCATTER PLOT OF VQRS AGAINST DMOS VALUES (BLUE), AND NONLINEAR LOGISTIC FITTING
CURVE (BLACK). THIS WAS CALCULATED FOR 6 VQM: PSNR, SSIM, VIF, PD-VIF, CED,
LOG(CED) RESPECTIVELY ................................................................................................. 54
FIGURE ‎4-3 SCATTER PLOT OF PREDICTED DMOS (VQRS AFTER LOGISTIC REGRESSION) AGAINST DMOS
VALUES. THIS WAS CALCULATED FOR 6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED)
RESPECTIVELY ................................................................................................................. 57
FIGURE ‎4-4 CALIBRATION CURVES FOR EACH ERROR DOMAIN: JPEG2K (GREEN), JPEG (RED), WHITE
NOISE (BLUE), GAUSSIAN BLUE (MAGENTA), FAST FADING (CYAN) AND ALL ERROR DOMAINS
(BLACK). THIS WAS CALCULATED FOR 6 VQM: PSNR, SSIM, VIF, PD-VIF, CED, LOG(CED) .... 60
FIGURE ‎5-1 BLOCK DIAGRAM OF MULTIMEDIA MIDDLEWARE .......................................................... 65
FIGURE ‎5-2 TEST SEQUENCES DESCRIPTION .................................................................................. 67
FIGURE ‎5-3 STANDARD TRANSCODER CONFIGURATION................................................................... 73
P a g e | xi

FIGURE ‎5-4 ADOPTED TRANSCODER CONFIGURATION ..................................................................... 74


FIGURE ‎5-5 NORMALIZED BITRATE AGAINST DIFFERENT TRANSCODING PARAMETERS FOR ALL THE TEST
SEQUENCES .................................................................................................................... 76
FIGURE ‎5-6 DENDROGRAM OF THE GENERATED CLUSTERS ............................................................... 77
FIGURE ‎5-7 NORMALIZED BITRATE AFTER ADDING THE NO TRANSCODING VALUES ............................... 77
xii | P a g e

LIST OF TABLES

TABLE 1 COMPARISON BETWEEN THE PSNR, SSIM, CED, PD-VIF, LOG(CED), LOG(VIF) WITH RESPECT
TO CC: PEARSON CORRELATION COEFFICIENT, SROCC: SPEARMAN RANK CORRELATION
COEFFICIENT, RMSE: ROOT MEAN SQUARE ERROR ............................................................. 51
TABLE 2 PEARSON CORRELATION COEFFICIENT OF THE SSIM, CED, PD-VIF, LOG(CED), LOG(VIF).
CALCULATED FOR THE DISTORTION DOMAINS JPEG2000, JPEG, WHITE NOISE, GAUSSIAN BLUR,
AND FAST FADING ........................................................................................................... 51
TABLE 3 SPEARMAN RANK CORRELATION COEFFICIENT OF THE SSIM, CED, PD-VIF, LOG(CED),
LOG(VIF). CALCULATED FOR THE DISTORTION DOMAINS JPEG2000, JPEG, WHITE NOISE,
GAUSSIAN BLUR, AND FAST FADING................................................................................... 51
TABLE 4 ROOT MEAS SQUARE ERROR OF THE SSIM, CED, PD-VIF, LOG(CED), LOG(VIF). CALCULATED
FOR THE DISTORTION DOMAINS JPEG2000, JPEG, WHITE NOISE, GAUSSIAN BLUR, AND FAST
FADING. ........................................................................................................................ 52
TABLE 5 EVALUATION OF THE QUALITY METRICS ............................................................................ 52
TABLE 6 SOURCE DOMAIN FEATURES ........................................................................................... 71
TABLE 7 RESOURCE FEATURES..................................................................................................... 72
TABLE 8 CODED DOMAIN FEATURES ............................................................................................ 72
TABLE 9 FINAL TRAIL ................................................................................................................. 72
P a g e | xiii

ACRONYM

ARU Adaptation / Resource / Utility


CED Contrast Error Distribution
CPDT Cascaded Pixel Domain Transcoder
DCT Discrete Cosine Transform
DFT Discrete Fourier Transform
DMOS Differential Mean Opinion Score
DSCQS Double Stimulus Continuous Quality Scale
DSIS Double Stimulus Impairment Scale
DWT Discrete Wavelet Transform
FIR Finite Impulse Response
FR QA Full Reference Quality Assessment
HVS Human Visual System
ISO/IEC International Organization for Standardization
/ International Electro-technical Commission
IT Information Technology
ITU-R International Telecommunication Union –
Radio Communication
ITU-T International Telecommunication Union –
Telecommunications
MM FSA MultiMedia Framework Study Areas
MPEG Motion Pictures Experts Group
MSE Mean Square Error
NAL Network Abstraction Layer
xiv | P a g e

NR QA No Reference Quality Assessment


NSS Natural Scene Statistics
PCA Principle Component Analysis
PDM Perceptual Distortion Metric
PFA Principle Feature Analysis
PSNR Peak Signal to Noise Ratio
QoE Quality of Experience
RR QA Reduce Reference Quality Assessment
SDOs Standards Development Organizations
SG Study Group
SNR Signal to Noise Ratio
SSCQS Single Stimulus Continuous Quality Scale
SSIM Structural Similarity
UMA Universal Multimedia Access
VCL Video Coding Layer
VIF Visual Information Fidelity
VQEG Video Quality Experts Group
VQM Video Quality Metric
VQR Video Quality Rating
Page |1

Chapter 1

Introduction

1.

1-1 Motivation

Multimedia plays an important role in our life. We now have terms that were
introduced to industry, culture and leisure that solely depend on the
evolvement of the Multimedia Communications field. Working with another
team member overseas through your laptop was never possible if it were not
for the video conferencing capabilities. The term webinar was not used until
few years ago when it was found that a web based seminar would be more
effective in reaching all its target audience with no regard to distances apart.

Multimedia objects can be described as the highest demanding object


transferred between networks, where the Quality of Experience (QoE) [1] is
the most important thing. The slightest delay or error would heavily affect the
quality and render the multimedia object useless. This however doesn’t
change the fact that multimedia is the most popular type of data on the
internet.
2|Page

The growth of users with access to the internet along with the tremendous
increase in their network capabilities and mobility, made way to the increase
in amount of data accessed and uploaded through the internet. This data as a
whole contains at least 70% of it as multimedia objects. Those users spend
more than 20% of their time away from their primary workplace.

For a relatively long time now, we are used to having two types of networks
available to us. Telecommunications and IT (Information Technology)
networks. Though we have interconnections between them, we haven’t yet
reached the combination of the two. To achieve this merge the ITU-T
(International Telecommunications Union - Telecommunication) is working
on the standardization of what is called Next Generation Networks.

The work of study group 16 is focused in providing guidelines for “Network


of the Networks” that unifies the view points of end users, standard
committee, telecommunication and IT providers. This will allow the
convergence of all services under the umbrella of one network, and the
cooperation of content providers and network service providers to serve end
users better.

This advancement in the telecommunications networks and device


interoperability led to increasing the importance of multimedia objects.
Multimedia communication is expected to dominate the field of
communications in the following 10 years. This fact makes it crucial for us to
tackle the problem of exchanging multimedia objects seamlessly in these
changing environments. The research presented in this thesis is a trail to
examine some of the open issues in the field of multimedia communications.
Page |3

1-2 Problem Statement

Multimedia middleware are intermediate systems between the client and the
content server that provides a number of complementary services. The
generalized block diagram of multimedia middleware is illustrated in Figure
‎1-1. Those servers are used to transcode multimedia objects before delivery

to client devices. This transcoding will help in situations where, we do not


want to exhaust network resources or device processing power when users
are just reviewing multimedia objects to select one, or when the client device
does not have a high screen resolution.

Figure ‎1-1 Multimedia Middleware

Transcoding can be done with respect to numerous domains, none of which


will result in the same combination of resources. The transcoding middleware
should be able to evaluate the client request, analyze the content of the
multimedia object requested, choose a transcoding scheme, then Transcode
4|Page

and deliver it to the user. This middleware server will need to fit within the
existing system and be transparent to both content server and client.

A multimedia middleware should possess the following qualities in order to


be transparent to the client side:

 When adding a new multimedia object to the content server, the time
required for the transcoding server to analyze the content of the
video should be minimized.

 Time from the reception of client requests till delivery of the content
back to the user should be minimized.

 Transcoding server should not require the presence of any pixel


domain information in any of its processes.

 The server should have the means to assess the quality of the
generated version of the multimedia object and choose between
different transcoding schemes.

The above qualities provide a roadmap for the implementation of transcoding


servers. However for those servers to function properly, a set of offline data
analysis studies for multimedia objects should be done. In the available
literature, a number of studies worked on this point but none have reached
the optimal criteria satisfying the above stated qualities. Our work in
multimedia middleware is focused toward the implementation of the
transcoder policy module. We have divided the analysis into two points. A
quality assessment model has been developed for the use in offline data
analysis along with an overall feature analysis for the selection of transcoding
schemes.

1-3 Objectives and contributions

The middleware server request cycle consists of the following:


Page |5

 Data Analysis of the pre-encoded video stream.

 Policy Module: Choosing a transcoding scheme that best fits the


client requirement and have the best quality of all possible solutions.

 Transcoding the video stream.

The objective of this research is to examine the first two stages. This work
will help toward the practical implementation of the middleware server
control module. The contribution of this research was concentrated in the
following:

 Discovering the features that best serve in clustering the multimedia


objects and provide means of predicting the way those objects would
react to different transcoding schemes.

 Developing a new quality assessment metric for the evaluation and


the choice of the best available transcoding scheme.

1-4 Thesis Outline

This thesis is organized as follows: chapter 2 introduces some of the


multimedia communications concepts used in the discussion presented in this
thesis, chapter 3 provides a review of the related literature, chapter 4
introduces the proposed objective quality assessment model along with the
evaluation of its performance, chapter 5 presents the offline data analysis and
the features analysis for the implementation of the transcoder policy module,
and chapter 6 presents the conclusion and future work.
6|Page
Page |7

Chapter 2

Multimedia Communications
Basics

2.

2-1 ITU-T MediaCom2004 project

The advances in the multimedia communications depend not only on fields


that study multimedia objects but on the development of underlying
networks and services that will allow the integration of complex multimedia
objects in resource limited network, taking into consideration the quality
received by end users.

ITU-T SG16, the lead Study Group for Multimedia, is working on project -
MEDIACOM 2004 (Multimedia Communication 2004) [2]. The objective of
the Mediacom 2004 Project is to establish a framework for Multimedia
standardization for use both inside and external to the ITU. This framework
will support the harmonized and coordinated development of global
multimedia communication standards across all ITU-T and ITU-R Study
Groups, and in close cooperation with other regional and international
standards development organizations (SDOs).
8|Page

Figure ‎2-1 presents the Multimedia framework study areas (MM FSA) as
defined by the Mediacom project.

Figure ‎2-1 Multimedia Communications Study areas (2001 ITU-T)

2-2 MPEG-7 and MPEG-21

Another important segment of research is the semantic annotation of


multimedia content. This annotation provides a bigger picture view on the
overall information that resides in a webpage. As a result, the content of this
webpage can be classified based on its importance and then delivered.

MPEG-7 and MPEG-21 are two standards developed by the Moving


Pictures Experts Group MPEG in 2003. Those standards are not intended
for the coding of Multimedia objects as the proceeding standards. However,
they aim in the integration with the other coding algorithms to allow the
Page |9

transmission of user preference and context information back and forth


between clients and content servers.

2-3 Coding Standards

Multimedia objects are known to contain a large amount of correlated data.


Coding algorithms are designed to decouple these associations in both the
temporal and spatial dimension and by that achieve a high compression rate
without losing valuable information. Figure ‎2-2 illustrates the main
components of coding algorithms.

Figure ‎2-2 General Architecture of Coding Algorithms

MPEG-4 and H.264 are the newest standards for multimedia coding
developed by the MPEG. They both rely on the same coding principles but
with significantly different visions. MPEG-4 is mainly concerned with
flexibility where H.264 features efficient compression and reliability.

As stated above, the difference between the two standards does not reside in
the theory of the compression module itself, but in how the input is treated.
In MPEG-4, the input of the compression module is a series of multimedia
10 | P a g e

objects that are contained in video frames. H.264 uses frame based
compression.

2-4 Transcoding Vs Scalable Coding

Scalable Video Encoding is the coding of video streams to contain a number


of substreams that can be decoded separately. The bitstream structure is
shown in Figure ‎2-3. First, a base substream containing the most basic
information that allows client devices to render the video with the lowest
obtainable quality is considered. This is usually the case in mobile devices
where the client is connected on a low bandwidth network. That base
substream is followed by a series of enhancement layers that can be
downloaded on-demand; this is usually the case when the client can afford
more resources to increase the quality of received video.

Figure ‎2-3 Scalable Bitstreams

On the other hand, transcoding can be achieved by the presence of


intermediate systems (Multimedia Middleware) between server and client. On
these subsystems the video is re-encoded upon receiving client requests.
Those requests will contain the characteristics of the client device along with
P a g e | 11

network resources available. In this thesis the words transcoding and


adaptation will be used interchangeably.

The most basic form of a transcoder is a back to back encoder-decoder


configuration. However, this configuration requires a heavy processing power
on the intermediate system. Another form is based on partially decoding the
stream and manipulating the data in its pre-coded form without referring to
the pixel domain data. Those transcoding systems exploit dependencies
between coded domain and pixel domain information along with full
understanding of coding scheme itself.

Scalable coding and transcoding are the two coexisting lines of UMA
researches where each has its advantages and limitations. Scalable coding has
the advantage of processing videos in advance therefore it does not require
any intermediate system. However, it means that the video bitstream
resource/quality degradation can be done only on predefined steps and
therefore it does not comply with the exact client requirements.

In other words, scalable coding provides error margin between the provided
bitstream and the requested resource/quality. Meanwhile transcoding tailors
video bitstreams to the exact device/network requirements provided by the
client requests.

Two other limitations involved on practical implementation of the scalable


coding are as follows:
12 | P a g e

 The decoders’ compliance to the scalable coding format. Non


compliant decoders will only decode the base layer of the bitstream
yielding a low quality video on clients which can support higher
quality.

 Enormous number of video bitstreams is available on today’s


networks that adapt single layer bitstreams; In order to accommodate
the scalable coding techniques transcoding is required for all present
videos

2-5 Quality Assessment

Quality assessment is an important step in transcoding / adaptation process.


In proxy /middleware, the choice of the transcoding dimension and the exact
parameter is dependent on the quality produced. Although meeting client
requests and resources is the steering wheels of the transcoding middleware,
the QoE on client side is what this whole system is about.

During assessment of reduced bit stream, we should bear in mind that quality
measurement of multimedia objects is not defined as fidelity of the new
bitstream to original. Quality when it comes to multimedia objects is defined
as the perceived quality which means that some errors are more important
than others. The perceived quality is related to the limitation within the
Human visual system (HVS) where some errors are neutral while others are
severely perceived by it.

Peak Signal to Noise Ratio (PSNR) is considered to be the most recognized


quality metric. This metric calculates error power within the image.
Consequently, it overlooks the significance of the affected data within the
image, along with modification in HVS response due to this variation in data.
P a g e | 13

The degree by which the alteration of video bitstream has affected the
perceived quality can be calculated by either subjective experiments or
objective quality metrics. Subjective experiments refer to viewing videos by
human observers where each observer rates the video quality and then a
mean opinion record is calculated for this video. Objective quality metrics
measures degradation of visual perceptual quality defining criterion for
describing perceptual error.
14 | P a g e
P a g e | 15

Chapter 3

Related Work

3.

3-1 Quality Assessment

3-1-1 Background
The complexity and nonlinearity of HVS are characteristic features that have
been used to trick the audiences’ eyes for ages. Throughout the study of
HVS, a number of facts have been discovered and made it possible for the
generation of multimedia objects as we know it today. For example, using
frame rate more than 50 Hz to deceive the human eye into seeing a moving
video, along with using lossy coding algorithms that exploit dependencies in
spatio-temporal information to remove extra data from the video stream are
both relying on the nonlinearity of HVS.

However when we consider the measurement of the video perceived quality,


this complexity poses a problem. The quality of any type of data can be found
by simply computing what is called error power. On the other hand, when
viewing the multimedia objects, HVS never calculates fidelity of the images
16 | P a g e

on the screen with respect to the original video stream. This clarifies why any
fidelity measure as the SNR would fail in describing the opinion of the
observer.

Although, HVS is a complex organism, it is limited when it comes to error


perception. These limitations are the reason why an error with less power
might contribute in a much severe way to the degradation of image quality.

Up until now, subjective experiments have been used for the assessment of
multimedia quality. However those experiments are impractical, expensive
and time consuming. Hence, they cannot be used in estimating the quality of
multimedia objects during its reproduction. Researchers in the field of
multimedia quality assessment are working on the development of objective
metrics that can predict the observer’s opinion about the quality of
multimedia objects.

3-1-2 Simple Quality Metrics


Simple error power models are considered to be the most recognized quality
metric. This metric calculates the error power within the image.
Consequently, it overlooks the significance of the affected data within the
image, along with the modification in the HVS response due to this variation
in data.

To calculate the PSNR between the original and distorted images, we start by
calculating the MSE (Mean Square Error) of pixels’ grayscale values.

1 2
𝑀𝑆𝐸 = 𝑓 𝑥 𝑦 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑥, 𝑦, 𝑓 − 𝐷𝑖𝑠𝑡𝑜𝑟𝑡𝑒𝑑 𝑥, 𝑦, 𝑓 [3]
𝐹𝑋𝑌
P a g e | 17

Where: images have a width of X pixels and height of Y pixels and the video
sequence contains F frames.

𝐼2
𝑃𝑆𝑁𝑅 = 10𝑙𝑜𝑔10 [3]
𝑀𝑆𝐸

Where: I is the maximum value that a pixel can take.

From the above we can see that the MSE defines the difference between the
two signals and the PSNR defines the fidelity of the distorted image to the
original. In [4] the authors illustrate why error power cannot be used as a
metric for the perceptual quality. They considered the following cases:

 Different types of visual error with equal power introduced to the


same image.

 Identical error introduced to different images.

In these two cases, although the error has identical power values, the two
images may enclose different perceptual quality. In other words, the type of
error should be studied with respect to its effect on HVS and the image in
hand.

3-1-3 Objective Quality Metrics


The above argument about error power based metrics led the researchers to
explore, and formulate a definition for the perceived quality. Some of the
metrics were designed to be generic and utilized the basic understanding of
the limitations in the HVS. The metric itself was designed to mimic the
processing done in the human eye and brain. Other metrics were more
18 | P a g e

specific and relied on the prior information about the distortion process that
multimedia object went through. (For example: Coding algorithms introduce
blocking Artifacts)

Three types of references can be used for quality assessment: Full Reference
(FR), Reduced Reference (RR), and No Reference (NR). In FR QA (Full
Reference Quality Assessment) the original image is compared to the
reproduced image, while in RR QA only some features of the original image
are used in the comparison. NR QA is the techniques that rely on the natural
image features to decide about the quality of the image without referring to
any outside information. Obviously, the FR and RR are not very suitable for
the transmission quality problem, due to the need for the original image or
some of its features at the receiver. However, FR and RR are very useful in
cases of developing coding and transcoding techniques. These metrics are
used in order to judge the quality of the image where the original is already
available.

In the following sections, we are going to present a number of FR QA


metrics that have been developed by researchers in the quality assessment
field, along with the underlying definition of the perceptual quality.

3-1-3-1 U S I N G DCT, DWT, AN D DFT


Authors in [5] examined the effect of decoupling inter-pixel dependencies by
using transforms like Discrete Cosine Transform (DCT), Discrete Wavelet
Transform (DWT) or Discrete Fourier Transform (DFT). Their study shows
that by transforming images to the frequency domain and then doing a simple
pixel difference, the resulting performance would surpass complex quality
measures.
P a g e | 19

3-1-3-2 P E R C E P T U A L D I S T O R T I O N M E TR I C (PDM)

Figure ‎3-1 Block diagram of the Perceptual Distortion Metric (PDM)

In [3], a generic model of HVS is used as an objective quality assessment


metric. The block diagram of the metric is illustrated in Figure ‎3-1. The color
space conversion block relies on the fact that HVS treats colors as a nonlinear
color differences rather than RGB. i.e., White-Black, Red-Green, and Blue-
Yellow. The perceptual decomposition is a set of spatio-temporal filters that
would mimic the nonlinearity of the neuron responses in HVS to different
Spatio-temporal patterns. The HVS sensitivity decreases with high spatial
frequency, the contrast gain control module is used to compensate for this
feature.
20 | P a g e

3-1-3-3 S T R U C T U R A L S I M I L AR I T Y

Figure ‎3-2 Block diagram of the Structural Similarity

The argument used in this metric is based on the idea where the human eye is
tuned to detect structural error. By definition there are three types of error
that can be introduced to multimedia objects i.e., variation of average local
luminance or contrast and structural error. The first two don’t contribute in
the degradation of the perceived quality. Thus by removing those two error
types, we can calculate the structural error that would result in defining the
amount of degradation in the image quality. The block diagram of the
Structural Similarity (SSIM) is shown in Figure ‎3-2.

The definition of those three types of error is as follows:

Luminance error:

2𝜇𝑥 𝜇𝑦
𝑙 𝑥, 𝑦 =
𝜇𝑥2 + 𝜇𝑦2
P a g e | 21

Contrast error:

2𝜎𝑥 𝜎𝑦
𝑐 𝑥, 𝑦 =
𝜎𝑥2 + 𝜎𝑦2

Structure error:

𝜎𝑥𝑦
𝑠 𝑥, 𝑦 =
𝜎𝑥2+ 𝜎𝑦2

Where:

 µx: Mean of image X

 µy: Mean of image Y

 σx: Variance of image X

 σy: Variance of image Y

 σxy: Covariance between image X and Y

From the above, Authors in [4], [6], and [7] presented the structural error as
the cosine of the angle between the original (x-µ) and distorted image (y-µ).
This logic assumes that after the removal of the luminance and contrast
errors, the resulting error would be illustrated as a circle where all error have
the same error power but different angle defining its effect on the perceived
quality.
22 | P a g e

Figure ‎3-3 Block Diagram of the Multi-scale Structural Similarity L: Low pass
filtering; 2↓: Down sampling by 2

In [8], an improvement of the system showed that running the metric on


downscaled images and combining the results would be more effective in
catching all the structural error in the image, and compensate for the different
watching distances. A diagram of the Multi-scale SSIM is in Figure ‎3-3.

3-1-3-4 V I S U A L I N F O R M A T I ON F I D E L I TY AN D
N A T U R A L S C E N E S T AT I S T I C S

Figure ‎3-4 Conceptual diagram of the VIF

Although at the beginning of this discussion we argued that fidelity measures


do not correlate well to the perceived quality. This concept is illustrated in
Figure ‎3-4. The authors in [9-10] presented their concept of a fidelity measure
that uses the natural scene statistics to calculate the amount of information
P a g e | 23

conveyed correctly between the original and distorted image through to the
observer.

The Natural Scene Statistics (NSS) rely on the fact that natural scenes occupy
tiny subspace out of all possible permutations for pixel values, and by that, it
is easy to describe natural undistorted images with a number of statistical
features. Visual Information Fidelity (VIF) defines the perceived quality as the
difference in mutual information between the input and output of HVS for
no-distortion and distortion channels.

3-2 Subjective Experiments

Subjective experiments [11] are required for the evaluation of Video Quality
Metrics (VQMs). In these experiments, human subjects are requested to
review, evaluate and assess the quality of images available in the database. The
subjects are normally screened for visual acuity and color blindness, to make
sure those quality score describe the accurate perceived quality for each
image. Moreover, viewing session should last for less than 30 minutes to
reduce the effect of fatigue on the observers.

The output of these experiments is the Differential Mean Opinion Score


(DMOS) of each image in the database. Those DMOS values serve as
benchmark for perceived quality, and are to be compared with the output
values of objective model when they are evaluated. Generally the evaluation
significance is affected by the size, and the different error types used in the
database.
24 | P a g e

There are a number of internationally accepted test methods to perform


subjective experiment. They are illustrated in Figure ‎3-5 and following is a
description of their scheme:

3-2-1 Double Stimulus Impairment Scale (DSIS)


Human subjects review reference / test image sets, then rate the images in a
discrete scale ranging from: Imperceptible, perceptible, slightly annoying,
annoying, and very annoying.

3-2-2 Double Stimulus Continuous Quality Scale (DSCQS)


In this test method, subjects are blind as to which image is the reference.
Each reference/ test set is viewed twice. The rating of the images is scored on
two scales continuous and discrete.

3-2-3 Single Stimulus Continuous Quality Scale (SSCQS)


This method differs from DSCQS in the number of viewing times for the
reference/ test sets. Therefore, it is used for longer sequences (several
minutes) whereas the DSCQS is only suitable for sequences of about 20-30
seconds. Furthermore, the SSCQS resemble the real viewing conditions more
than DSCQS.
P a g e | 25

Figure ‎3-5 Subjective Experiments: Viewing Modes (On the Left) Score
Scale (On the Right). (A) Double Stimulus Impairment Scale (DSIS) (B)
Double Stimulus Continuous Quality Scale (DSCQS) (C) Single Stimulus
Continuous Quality Scale (SSCQS)

3-3 VQEG

The Video Quality Experts Group (VQEG) was formed on 1997. Its main
objective was to validate and standardize objective quality assessment models.
Moreover, that group works toward standardization of performance metrics
for validating the objective models. So far, the VQEG have completed two
sets of tests.

 Phase I (1998): The subjective experiment used DSCQS. Nine


objective quality assessment models were evaluated. This test showed
that 8 out of 9 models gave results that are indistinguishable from
PSNR.
26 | P a g e

 Phase II (2001-2003) [12]: the phase focused on digitally encoded


television quality video. The database used for this test featured
abroad coverage of content: spatial details, motion complexity, and
color.

3-4 Benchmark

The evaluation cycle for the metric proposed in this thesis did not include the
execution of subjective experiments. However, we have used the Texas
University, “LIVE Image Quality Assessment Database Release 2” [13]. The
database contains 982 images out of which 203 reference images and 779
distorted images.

3-4-1 Error Domains


This database was originally created from 29 source images. The distorted
images are divided into 5 different categories.

 JPEG2000: bit rate ranging from 0.028 bits per pixel to 3.15bpp

 JPEG: bit rate ranging from 0.15 to 3.34 bpp

 White noise: Gaussian noise with standard deviation sigma=0.12 to 2

 Gaussian blur: images were filtered with circular symmetric 2-D


Gaussian kernel of standard deviation sigma=0.42 to 15

 Fast Fading: Bit errors in JPEG2000 bitstream when transmitted over


a simulated fast-fading Rayleigh channel. The overall SNR is ranging
from 15.5 to 26.1dB

3-4-2 Subjective Experiment


The subjective experiment performed on this database used a Continuous
linear scale that was divided into five equal regions marked with adjectives
``Bad", ``Poor", ``Fair", ``Good" and ``Excellent". Each image was rated by
P a g e | 27

at least 20-29 human observers. Single stimulus method was used. The
database was rated in 7 separate viewing sessions.

The fact that images were reviewed in more than one session led to a
mismatch scale in the scores given to those images. Therefore, an extra round
of review was performed using double stimulus methodology and a randomly
selected 50 images.

3-4-3 Realignment Process


The raw scores for each subject were converted to difference scores (between
the test and the reference) and then Z-scores and then scaled and shifted to
the full range (1 to 100). Finally, a Difference Mean Opinion Score (DMOS)
value for each distorted image was computed.

For a single image if the single score is considered an outlier, that is outside a
certain interval from the standard deviation of the mean score for the image.
This point is removed from the DMOS calculation for that image.

A subject is rejected if the number of outliers is more than a specific accepted


rate. Hence all the ratings done by that subject are excluded from the final
dataset.

3-4-4 Datasets
The database of images is accompanied with a number of datasets that define
the benchmark values of the perceived quality for each image of the 982
images available in the database.
28 | P a g e

 dmos.mat: contains two arrays of length 982 each: DMOS and orgs.

o orgs(i)==0 for distorted image, and orgs(i)==1 for reference


images.

o DMOS(1:227): JP2K, DMOS(228:460):JPEG,


DMOS(461:634): White Noise, DMOS(634:808): Gaussian
Blur, DMOS (809:982): Fast Fading]

o The values of DMOS when corresponding (orgs==1) are


zero (they are reference images)

 refnames_all.mat: contains a cell array refnames_all.

o refnames_all{i} is the name of the reference image for image


i whose DMOS value is given by DMOS(i).

o If orgs(i)==0, then this is a valid DMOS entry. Else if


orgs(i)==1 then image i denotes a copy of the reference
image.

 DMOS_realigned.mat: DMOS values after realignment.

3-5 H.264 Review

Throughout this study, the H.264 standard was used as the main compression
technique for encoding and transcoding all test sequences. In this section we
are going to review this standard and demonstrate its new features.

H.264 is the newest in its series known as International standard 14496-10 or


MPEG-4 part 10 Advanced Video Coding of ISO/IEC. The standard was
finalized in March 2003 and approved by the ITU-T in May 2003 [14-16]

The encoder decoder configuration is separated into two separate stages


Video Coding Layer (VCL) and the Network Abstraction Layer (NAL).
Figure ‎3-6 shows the arrangement of both layers.
P a g e | 29

Figure ‎3-6 (A) Video Coding Layer (VLC) and Network Abstraction Layer
(NAL) arrangement. (B) NAL unit

VCL is responsible for efficient coding of video frames and delivering coded
information to be formatted by NAL. The main aim of NAL is to arrange all
of the coded information in a way that would be comprehended by the
receiver. All the information are sent in what is known as NAL units, these
units act as packets that can be handled separately by the transport layer for
transmission or storage in file. Each NAL unit consists of a NAL header that
specifies sequencing of the information within the unit, and the payload data.

H.264 coding standard falls into the category of Block-based motion


compensated video compression. Figure ‎3-7 and Figure ‎3-8 show the detailed
block diagram of encoder and decoder.
30 | P a g e

Figure ‎3-7 Block diagram of H.264 Encoder

Figure ‎3-8 Block diagram of the H.264 Decoder

The term slice refers to a set of macroblocks in raster order that are to be
coded in the same type, i.e. I, P, B, SI, SP. Macroblocks are defined as the
P a g e | 31

area of 16 X 16 pixels. It is the main building block on which the processing


occurs.

The slice type is defined by type of coding done on macroblocks contained in


the slice. Different slice types are illustrated in the following:

 I (Intra) slice: Macroblocks are coded through prediction from


macroblocks in the same frame.

 P (Predicted) slice: Macroblocks are coded in reference to previously


coded frames.

 B (Bi-directional predicted) slice: Macroblocks uses both previous and


next frames.

 SI and SP (Switching) slice: used to switch between different sub


streams.

The processing in the macroblock layer is divided in two categories intra and
inter coding. In intra coding a macroblock is predicted using only spatial
information i.e., macroblocks from the same frame. However in inter frames,
the prediction rely on temporal dependencies. This is done by copying an area
from previously coded frames and assigning it to the currently encoded
macroblock. The encoder then sends motion vectors, reference frames and
error signal between the predicted and the current macroblock. Moreover,
Motion vector are not sent to the receiver. Only a displacement Motion
vector is sent to adjust values predicted by the receiver. This is dependent on
the fact that motion prediction in both encoder and decoder are identical.
Therefore, Motion vectors are predicted from the surrounding macroblocks
and then a compensation MV is sent to the receiver to correct the value.
32 | P a g e

The motion prediction in H.264 supports half and quarter pixel values. The
intensity values for fractional pixels are determined by means of interpolation.

 Luma half pixel: 6-tap FIR filter.

 Luma quarter pixel: Averaging of half and integer pixels.

 Chroma: All fractional pixels are computed through averaging.

In the following a list of the differences between H.264 and earlier standards:

 H.264 includes a Deblocking filter.

 H.264 allows for multiple reference frames.

 H.264 introduces the spatial prediction in intra frames.

 H.264 uses 4X4 integer transform instead of the former DCT 8X8
transform.

The standard defines a set of profiles in which H.264 can operate: baseline,
main, and extended. Each profile defines accepted syntax and tools to be
used. The profiles are in Figure ‎3-9. In this study we have used the Baseline
profile.
P a g e | 33

Figure ‎3-9 H.264 profiles

H.264 is the most efficient coding algorithm with respect to bit rate
reduction, yet the most complex among its peers. In [17] authors performed a
number of tests to analyze the complexity-distortion relationship within
H.264. They found that P frames are more efficient with respect to distortion
and complexity but requires more bitrate than sequences containing B frames.
The authors in [18] show that processing time of H.264 is dominated by
deblocking filter (49.01%) and fractional pixel interpolation (19.98%).

3-6 Multimedia Transcoding

The research in multimedia transcoding is categorized into the following:

 Transcoding techniques: design of transcoding techniques that would


adapt the video stream to fit fewer resources.

 Transcoder analysis: analysis of the resource utilization in transcoders


and its optimization schemes.
34 | P a g e

 Control Schemes: to control the selection process of transcoding


techniques along with amount of transcoding done by each of them.

Although, there is now a large number of studies on transcoding techniques


design, the lack of policy module that supports transcoder implementation
left those designs useless. In the following section we are going to review the
first category to familiarize the reader with the baseline knowledge about
transcoding. The rest of the section will provide a review of control scheme.

3-6-1 Transcoding Techniques


Transcoding has different types based on kind of change induced in the
bitstream [19]:

 Homogeneous: is the modification in one or more resources that is


required by bitstream. The different types of resources are
demonstrated in Figure ‎3-10.

 Heterogeneous: is the change of the bitstream syntax from one


standard coding scheme to another.

 Error Resilience: is the injection of some bits to increase the


bitstream’s robustness to error.
P a g e | 35

Figure ‎3-10 Homogeneous transcoding

Another categorization for transcoding techniques would be from the


implementation point of view. The simplest implementation is back to back
encoder- decoder configuration, also known as cascaded pixel domain
transcoder (CPDT). CPDT is the simplest yet most time consuming
transcoder implementation. In Figure ‎3-11, we demonstrate that as we go
deeper in the structure of the bitstream we will get higher quality transcoding
at the cost of transcoder complexity.
36 | P a g e

Figure ‎3-11 Transcoder Implementation

3-6-2 Control Schemes


In [20], authors proposed a utility model based on the maximization of utility
under a certain amount of resources. The system didn’t support dynamic
transcoding nor was transcoding done online. Three profiles where defined
for each multimedia object, namely: gold, silver, bronze.

Figure ‎3-12 Utility Model

Another argument was set by authors in [21] that is, Offline transcoded
objects can be arranged in what is called Info-pyramid. The info-pyramid is
by definition a progressive data representation scheme. Objects stored in the
info-pyramid have different resolutions and abstraction levels:
P a g e | 37

 Fidelity: is spatial and temporal resolution using lossy compression


technique.

 Modality: can be the selection of either: key frame images, audio


track, and closed captions.

When customization and selection module receives client request, it assigns


the object that best fits the request and sends it back to user. The architecture
of the system is illustrated in Figure ‎3-13.

Figure ‎3-13 Info-pyramid based control scheme

On the other hand, authors in [22] proposed a model with three dimensions:

 Device Modality: Display, audio, memory, CPU, and color


38 | P a g e

 Network conditions: Bandwidth, Latency, and BER

 User preferences

The illustration of the dimensions and overall system architecture is illustrated


in Figure ‎3-14 and Figure ‎3-15 respectively.

For each dimension a number of classes where defined and offline


transcoding for multimedia objects was done. Storage and mapping of these
different bitstreams is done by using MPEG-7 standard. When user’s request
is received the system chooses from a matrix of classes the most appreciate
one and sends it to user.

Figure ‎3-14 Three dimensional view


P a g e | 39

Figure ‎3-15 System overview

Another type of control schemes was proposed in[23]. The system operates
in real-time and uses a single dimensional transcoding to fit videos to
available bit rate. A buffer based control scheme was used. The system
utilizes the relation between delay, occupancy of buffer and bitrate. Two type
of transcoding was used re-quantization and frame dropping. The estimated
amount of bits required to encode a frame is predicted by using information
gathered from previously encoded frames.

Control scheme can be simplified to fit as a solution to a specific application.


In [24], authors proposed a control scheme for map viewing application. The
scheme is user centric, where information about the type of usage is
important in defining amount of details to be sent to the user. For example a
hiker would require more fine details than a car driver.
40 | P a g e

Transcoding is done offline and system request the resources of highest utility
selection based on user’s preference and if it fails, a negotiation cycle is
started till enough resource reduction is done.

Another application based control scheme was developed in [25], for


computer graphics systems. The authors define a graphical based
representation embedded in a multidimensional utility space. Each object in
the application is represented in an independent utility subspace. In this
subspace, nodes are used to define relations between different resolutions.

The arguments set by authors in [26-27] are summarized in the following:

 There is not enough work in the available literature on transcoding


strategies.

 Transcoding strategies select out of all possible adaptation processes


the one that would satisfy resource and provide the best utility.

 The transcoder complexity should be minimized, as decision should


be done in real-time to facilitate the user’s request.

 The reduction in resources by a specific adaptation process is not the


same across different types of content.

The authors developed the arrangement of ARU spaces which stands for the
Adaptation, Resource, and Utility accordingly. Imagine that we have an
adaptation space where each point is mapped on to resource and utility
spaces. In the resource space, we can determine how much complexity
reduction or bitrate reduction this point (adaptation process) could cause. In
utility space, the system can compare two adaptation processes with respect
to quality of multimedia object. A conceptual illustration of the system is
illustrated in Figure ‎3-16.
P a g e | 41

Figure ‎3-16 Adaptation, Resource, Utility spaces

The curves for these three spaces cannot be developed from a single video
sequence. Each video sequence can react differently to adaptation processes.
Authors developed a system for the generation of utility functions by
extracting a set of features from video sequences. Those features are then
used to cluster the sequences into a number of predefined clusters that are
supposed to behave in the same way with respect to different adaptation
process. Those clusters are defined through the analysis of a set of test
sequences.
42 | P a g e
P a g e | 43

Chapter 4

Quality Assessment

4.

4-1 Introduction

Our work in the objective quality assessment was mainly driven by the need
for an objective model to be used in the policy module of the transcoding
engine. This FR QA model should possess the following in order to replace
the need for subjective experiments:

 High correlation to subjective experiments output.

 Consistent with respect to its reaction to different type of visual error


and image content.

 Inexpensive with respect to time consumption.

These features are crucial for the metric to be practically used in place of
human observers. Research in quality assessment has revealed different
perspectives for looking at perceptual error. Although, these definitions of
the perceptual error made use of high level features of images, none of them
44 | P a g e

have reached the optimal criteria for providing the metric features described
above.

In [28], the authors studied 10 state-of-art FR QA metrics. This extensive


evaluation shows that most of these metrics produce results worse than or
indistinguishable from PSNR. Although these metrics are based on high level
visual features, they didn’t correlate well with the subjective data.

In this chapter, we are going to present our work on the formulation of an


objective metric that would comply with the above criteria, along with the
logic behind its design.

4-2 Proposed Metric

Studies examining how HVS treats the received visual information found that
HVS doesn’t treat images as luminance values but as contrast differences.
Moreover, this contrast based response varies with the viewing distance. This
led to the use of contrast sensitivity function after the decomposition of the
image into spatial and temporal bands in HVS based metrics.

The idea of the metric presented here uses this fact. If the change in contrast
values was distributed well on the entire image, HVS will not capture this type
of error, since the relations between the contrast values are maintained, and
vice versa, contrast change due to a distortion having a large standard
deviation would modify the contrast relations in images.

The proposed algorithm for calculating the Contrast Error Distribution


(CED) metric is as follows [29]:
P a g e | 45

 Calculate local contrast for the original and distorted images using
only the luminance component (the contrast of the image is simply its
standard deviation). We have used only the luminance values, as it is
known about HVS to have achromatic acuity higher than the
chromatic one.

 Calculate the contrast error:

𝑑 = 𝐶𝑜𝑛𝑡𝑟𝑎𝑠𝑡 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐼𝑚𝑎𝑔𝑒 − 𝐶𝑜𝑛𝑡𝑟𝑎𝑠𝑡 𝐷𝑖𝑠𝑡𝑜𝑟𝑡𝑒𝑑 𝐼𝑚𝑎𝑔𝑒

 Calculate the standard deviation of error d. STD(d)

𝑆𝑇𝐷(𝑑) = STD 𝐶𝑜𝑛𝑡𝑟𝑎𝑠𝑡 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐼𝑚𝑎𝑔𝑒 − 𝐶𝑜𝑛𝑡𝑟𝑎𝑠𝑡 𝐷𝑖𝑠𝑡𝑜𝑟𝑡𝑒𝑑 𝐼𝑚𝑎𝑔𝑒

After the first evaluation cycle, we found the above criterion holds quite well
for JPEG, JPEG2000, white noise, and fast fading distortion domains.
However, for Gaussian blur domain, the metric didn’t correlate with the
Subjective experiment outputs.

These results were very reasonable, because the contrast of the error
introduced by Gaussian blur tend to be of a weak standard deviation. This
type of error would yet modify the local contrast information in the image.
Referencing the analysis done by the authors in [30], they found that image
contents would play an important role on the effectiveness of error on the
perceived quality. Therefore, we have modified the metric by referencing its
output values to the standard deviation of the reference image contrast. The
block diagram for the metric is in Figure ‎4-1.

 Calculate the standard deviation of the original image’s contrast.


STD(contrast of Original Image)
46 | P a g e

 Calculate the metric.

𝑆𝑇𝐷 𝐶𝑜𝑛𝑡𝑟𝑎𝑠𝑡 𝑜𝑓 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐼𝑚𝑎𝑔𝑒


𝐶𝐸𝐷 =
𝑆𝑇𝐷 𝑑

Figure ‎4-1 Block Diagram of the Contrast Error Distribution (CED)

4-3 Metric Evaluation Process

The metric evaluation process is not just a simple measurement of how much
resemblance is there between DMOS values and Video Quality Ratings
(VQRs). A number of metrics are to be applied on the VQRs to confirm if
this metric gives good results regardless of error type, image content, or even
the amount of quality degradation.

In short, all of the above would comply with a single definition:


Generalizability. VQEG defines it as: “the ability of a model to perform
reliably over a very broad set of video content.” This is obviously a critical
selection factor given the very wide variety of content found in real
applications. There is no specific metric that is specific to generalizability so
this objective testing procedure requires the selection of as broad a set of
representative test sequences as is possible.[12]
P a g e | 47

As stated above, to achieve this generalizability, we have to perform VQM


tests over a wide range of images and use performance test that would
describe every aspect of generalizability. For this reason, the VQEG
standardized evaluation domains for VQM, as follows:

 Prediction Accuracy: the ability to predict the subjective quality


ratings with low error.

 Prediction Monotonicity: the degree to which the model’s predictions


agree with the relative magnitudes of subjective quality ratings.

 Prediction Consistency: the degree to which the model maintains


prediction accuracy over the range of video test sequences, i.e., its
response is robust to a variety of video impairments.

4-3-1 Subjective data rescaling


DMOS values after realignment might take invalid values, for example
negative values. Therefore a linear scaling is required to level values between
0 and 1. Zero being the worst perceived quality.

The scaling function is as follows:

Raw Difference Score − Minimum Value


𝑆𝑐𝑎𝑙𝑒𝑑 𝑅𝑎𝑡𝑖𝑛𝑔 =
Maximum Value − Minimum Value

4-3-2 Nonlinear Regression


The relation between DMOS and VQRs is not linear. Therefore, the
application of performance metrics on VQM output will lead to inaccurate
results. This nonlinearity is due to the fact that subjective test results tend to
48 | P a g e

be compressed at the extreme of the test range. Consequently, a nonlinear


regression process is required to compensate for this.

We have used a 5 parameter logistic regression function as follows:

𝑏1
𝐷𝑀𝑂𝑆𝑝 𝑉𝑄𝑅 = [28]
1 + 𝑒 −𝑏2 𝑉𝑄𝑅−𝑏3

The nonlinear regression converts VQRs into DMOSp(predicted) that can be


then compared to DMOS(subjective).

4-3-3 Prediction Accuracy


Pearson linear correlation Coefficients:

2
𝜎𝑥𝑦
𝑟2 =
𝜎𝑥2 𝜎𝑦2

Where σxy, σx, and σy are defined as follows:

2
𝜎𝑥𝑦 = 𝑥𝑖 − 𝜇𝑥 𝑦𝑖 − 𝜇𝑦

𝜎𝑥2 = 𝑥𝑖 − 𝜇𝑥 2

2
𝜎𝑦2 = 𝑦𝑖 − 𝜇𝑦

4-3-4 Prediction Monotonicity


Spearman rank order correlation coefficient is a measure of monotonic
association that is used when distribution of data make Pearson correlation
coefficient undesirable or misleading.
P a g e | 49

𝑥𝑦
𝑟=
𝑥2 𝑦2
4-3-5 Prediction Consistency
Outlier Ratio:

𝑁𝑜
𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑅𝑎𝑡𝑖𝑜 =
𝑁

Where:

 No is the number of outlier points

 N is the total number of data points

 A point is considered as outlier where Qerror[i] for 1≤i≤N and


Qerror[i]=DMOS[i]–DMOSp[i], if the following condition is satisfied:
𝑄𝑒𝑟𝑟𝑜𝑟 > 2𝐷𝑀𝑂𝑆_𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑_𝐸𝑟𝑟𝑜𝑟 𝑖 .

Root mean square error (RMSE): is also considered as a metric for


consistency.

4-4 Results

In the evaluation cycle we have chosen 6 FR-QA metric to be compared.

 Peak Signal to Noise Ratio (PSNR)

 Structural Similarity (SSIM) [31]

 Visual Information Fidelity (Log(VIF)) [32]

 Pixel-Domain Visual Information Fidelity (VIF-PD): a less complex


implementation of the VIF [33]
50 | P a g e

 Contrast Error Distribution (CED) [Proposed]

 Contrast Error Distribution (Log(CED)) [Proposed]

4-4-1 Overall Performance


The overall performance was measured by computing the Pearson
Correlation Coefficient, spearman rank, and the root mean square error of
the 6 quality assessment metrics mentioned above. The results are shown in
Table 1. The values in Table 1 demonstrate that CED gives results similar to
a more sophisticated metrics as the VIF.

4-4-2 Cross-Distortion Performance


The results shown in Table 2 through Table 4 are the detailed values of the
above performance metrics for each distortion domains. The tables show that
CED’s output is identical across all distortion domains whereas the other
metrics perform worse in the Fast Fading domain.
P a g e | 51

Table 1 Comparison between the PSNR, SSIM, CED, PD-VIF, Log(CED),


Log(VIF) with respect to CC: Pearson Correlation Coefficient, SROCC:
Spearman Rank Correlation Coefficient, RMSE: Root Mean Square Error
PSNR SSIM CED PD-VIF Log(CED) Log(VIF)
(Proposed) (Proposed)
CC 0.8700 0.8959 0.9369 0.9326 0.9525 0.9544
SROCC 0.8755 0.9075 0.9550 0.9471 0.9550 0.9637
RMSE 13.4713 12.1396 9.9549 9.8798 8.3168 8.1708

Table 2 Pearson Correlation Coefficient of the SSIM, CED, PD-VIF,


Log(CED), Log(VIF). Calculated for the distortion domains JPEG2000,
JPEG, White Noise, Gaussian Blur, and Fast Fading
JP2K JPEG WN GBlur FF
SSIM 0.9311 0.9436 0.9693 0.8622 0.9271
CED (Proposed) 0.9561 0.9688 0.9325 0.9368 0.9466
PD-VIF 0.9702 0.9749 0.9717 0.9538 0.8698
Log(CED) (Proposed) 0.9598 0.9738 0.9716 0.9696 0.9635
Log(VIF) 0.9744 0.9688 0.9804 0.9707 0.9490

Table 3 Spearman Rank Correlation Coefficient of the SSIM, CED, PD-VIF,


Log(CED), Log(VIF). Calculated for the distortion domains JPEG2000,
JPEG, White Noise, Gaussian Blur, and Fast Fading
JP2k JPEG WN GBlur FF
SSIM 0.9331 0.9389 0.9684 0.8827 0.9380
CED (Proposed) 0.9545 0.9712 0.9719 0.9699 0.9658
PD-VIF 0.9717 0.9840 0.9872 0.9695 0.8675
Log(CED) (Proposed) 0.9545 0.9712 0.9719 0.9699 0.9658
Log(VIF) 0.9698 0.9600 0.9856 0.9734 0.9658
52 | P a g e

Table 4 Root Meas Square Error of the SSIM, CED, PD-VIF, Log(CED),
Log(VIF). Calculated for the distortion domains JPEG2000, JPEG, White
Noise, Gaussian Blur, and Fast Fading.
JP2k JPEG WN GBlur FF
SSIM 9.2222 10.5526 6.8789 9.3565 10.6995
CED (Proposed) 7.6804 8.2344 10.4274 6.8455 9.6306
PD-VIF 6.1433 7.1296 6.6276 5.5593 14.0610
Log(CED) (Proposed) 7.0897 7.2565 6.6182 4.5263 7.6321
Log(VIF) 5.6908 7.8561 5.5314 4.4474 9.0253

4-4-3 Complexity Performance


VQEG has not yet standardized a complexity measure for the VQM.
However the complexity of the 3 metrics was evaluated using a Pentium M,
1.86 GHz Laptop; using the consumed time in calculating the quality metric
for all the jpeg 2000 distorted images (227 images). The complexity measure
is shown in Table 5.

From the results, it can be seen that CED provides a good tradeoff between
performance and complexity. Where it operates in 1.5 seconds per image
where metrics with comparable results operate in 12 seconds per image.

Table 5 Evaluation of the Quality Metrics


MSSIM CED PD-VIF VIF
(Proposed)
224.11sec/ 310.91sec/ 498.26sec/ 2768.4sec/
227images 227images 227images 227images
0.99sec/ average 1.37sec/ average 2.2sec/ average 12.2sec/ average
image image image image
P a g e | 53

4-4-4 Logistic Regression Performance


Figure ‎4-2 shows the scatter plot of output of VQMs against DMOS values,
along with the logistic regression fit of the data. The plot for CED shows
that VQR points are distributed evenly across the perceived quality range.

Figure ‎4-3 shows the scatter plot of DMOS against the predicted DMOS
values. This scatter plot shows outlier points. For the metric to perform
better, the scatter points should be distributed near the diagonal of the graph.
Moreover, the points should be distributed evenly across the range of the
perceived quality.

It can be seen from Figure ‎4-3 that metrics have two empty spots one near
the origin and the other at the far side of the graph as highlighted in red. The
empty spot near the origin means that zero point is translated to a different
value in the predicted DMOS. The graph for the CED shows that all the
empty spots have been decreased significantly and therefore the response for
the CED is improved for error figures located in those areas of the graph.

Figure ‎4-4 shows the calibration curves of the 5-distortion domains from the
database used in the experiment. The evaluation of VQM performance
stability across different types of distortion mandate that the calibration
curves should be indistinguishable. In the figure, we can see that calibration
curves are not overlying, however they are adjacent to each other. The points
of intersection highlight the amount of error where the metric would react to
different types of error indifferently. Otherwise, the metric would be
more/less sensitive to certain types of error.
54 | P a g e

Figure ‎4-2 Scatter plot of VQRs against DMOS values (Blue), and Nonlinear
Logistic fitting curve (Black). This was calculated for 6 VQM: PSNR, SSIM,
VIF, PD-VIF, CED, Log(CED) respectively

Figure ‎4-2 Cont.


P a g e | 55

Figure ‎4-2 Cont.

Figure ‎4-2 Cont.


56 | P a g e

Figure ‎4-2 Cont.

Figure ‎4-2 Cont.


P a g e | 57

RMSE=13.4713

Figure ‎4-3 Scatter plot of predicted DMOS (VQRs after logistic regression)
against DMOS values. This was calculated for 6 VQM: PSNR, SSIM, VIF,
PD-VIF, CED, Log(CED) respectively

RMSE=12.1396

Figure ‎4-3 Cont.


58 | P a g e

RMSE=9.8798

Figure ‎4-3 Cont.

RMSE=8.1708

Figure ‎4-3 Cont.


P a g e | 59

RMSE=9.9549

Figure ‎4-3 Cont.

RMSE=8.3168

Figure ‎4-3 Cont.


60 | P a g e

Figure ‎4-4 Calibration Curves for each error domain: JPEG2k (Green),
JPEG (Red), White Noise (Blue), Gaussian Blue (Magenta), Fast Fading
(Cyan) and all error domains (Black). This was calculated for 6 VQM: PSNR,
SSIM, VIF, PD-VIF, CED, Log(CED)

Figure ‎4-4 Cont.


P a g e | 61

Figure ‎4-4 Cont.

Figure ‎4-4 Cont.


62 | P a g e

Figure ‎4-4 Cont.

Figure ‎4-4 Cont.


P a g e | 63

Chapter 5

Data Analysis

5.

5-1 Introduction

Nowadays, a large number of video transcoding schemes exist. These


schemes change a pre-encoded video bitstream into another that exhibits less
bit rate or complexity and therefore quality.

Currently, the main problem in video adaptation is management of process


itself. More specifically, the problem lies in how to determine the following:

 The transcoding scheme to be used.

 The amount of transcoding.

The problem relies on the fact that not all video sequences react in the same
way to transcoding processes. A certain amount of transcoding can result into
a different amount of resource reduction in different video sequences. This is
due to varied complexity of video content.
64 | P a g e

5-2 Offline Data Analysis Model

The authors in [34] put together a systematic procedure for designing video
adaptation technologies, they are as follows:

1. Identify the adequate entities for adaptation, e.g. frame, shot,


sequence of shot, etc.
2. Identify the feasible adaptation operators e.g., de-quantization, frame
dropping, coefficient drooping, etc.
3. Develop models for measuring and estimating resource and utility
values associated with video entities undergoing identified operators.
4. Given user preferences and constraints on resource or utility, develop
strategies to find the optimal adaptation operator(s) satisfying the
constraints.
In Figure ‎5-1, a conceptual diagram of the 3 stages process of the transcoder:
offline data analysis, policy module, and transcoding engine. The work was
mainly focused on offline data analysis module. The policy module decides
which transcoding algorithm to be used and how much transcoding is
needed. This is done by extracting some features from pre-encoded videos
and mapping it to a certain class. Each of the classes defined in the policy
module contains information about the resource–transcoding relations.
Those classes are created in the offline data analysis stage.

The main aim of offline data analysis stage is to define the main classes of
multimedia objects. Each class has its own Resource–transcoding–quality
graph which contributes in the policy module decision.
P a g e | 65

Figure ‎5-1 Block diagram of Multimedia Middleware

The presented study relies mainly on the idea of finding key features that
would characterize the differences between video sequences. Those video
sequences usually reach the transcoding server in a pre-encoded form.
Transcoding servers should distinguish the class of the sequence through only
the information present in the coded domain.

5-3 H.264 Setup

The C++ implementation of H.264 video coding algorithm in [35] Version


JM 13.0 was used. The baseline profile was chosen as the main profile for
encoding the Test sequence.

This profile contains the following features:


66 | P a g e

 I slices: Intra-coding, only spatial prediction is allowed.

 P slices: Inter-coding, forward temporal prediction.

 CAVLC: Context Adaptive variable length codes

Configuration parameters for the coding algorithm:

 Baseline Profile

 QP=28

 To be coded in IPPP

5-4 Test Sequences

The test video sequences used in this study are presented in [36]. Those video
sequences are single shot video segments. Therefore, video sequence is
encoded with the first frame as I-frame and the rest of the frames as P-
frames. A description of complexity for each video sequence is described in
Figure ‎5-2 Test Sequences Description

5-5 Features

By classifying videos based on their content, the video bitstreams can be


grouped based on their behavior within the transcoding engine. This
classification depends mainly on features extracted from video sequences. A
number of studies in transcoding control schemes have adopted the idea of
classifying the video content based on their complexity. However, features
used were the main point of argument in this concept. In this chapter, the
proposed feature analysis is presented. This analysis was done on most of
features used in the available literature [37-40]. The study conducted in this
P a g e | 67

thesis concluded that many of these features convey the same information
and some of which can be omitted from the proposed model.

Figure ‎5-2 Test Sequences Description


68 | P a g e

5-5-1 Feature Definitions


All of feature definitions described in this section are calculated on per frame
basis, In order to calculate a single value for each sequence, the average was
computed. Only for the Source Domain Features, average values were
compared against the first frame (I-frame) value.

5-5-1-1 S O U R C E D O M A I N F E A T U R E S
 Variance: Average variance of the luminance pixels

 Pelact: Standard deviation of the luminance pixels

 Pelspread: Standard deviation of Pelact

 Edgeact: Magnitude of pixel gradient

 Edgespread: standard deviation of EdgeAct

5-5-1-2 R E S O U R CE S R E Q U I R E D
 bitcount: Bitcount for coding for macroblock accumulated on the
whole frame.

 bitcount Y: Bitcount used for coding only the Y component of the


frame

 ME time: Time consumed in motion estimation

 SNR Y: Signal to Noise Ratio calculated on Y frame

 SNR U: Signal to Noise Ratio calculated on U frame

 SNR V: Signal to Noise Ratio calculated on V frame

 Time: Time consumed in coding


P a g e | 69

5-5-1-3 C O D E D D O M A I N FEATURES

 MV magn: Motion verctors magnitude (Calculated for only non static


Macroblocks)

 MV magn var: Motion vectores variance (Calculated for only non


static Macroblocks)

 sub MV: Percentage of MVs that require subpixel interpolation


(either half pixel or quarter pixel)

 non zero MV: Percentage of non static Macroblocks

 ave energy I: Average Energy of AC coefficients in Iframes

 ave energy P: Average Energy of AC coefficients in Pframes

 MV accel: Motion vectors acceleration

 MV dir: Motion vector change of direction

5-5-2 Analysis and selection


Using Principal component analysis (PCA) [41-42] would only help in
changing the axis on which the features are projected to the axes with the
highest covariance between features. Therefore PCA is not suitable as the
main purpose is to omit some features and to inspect if source video features
are important for differentiating between the video sequences or not.

Principal Feature analysis in [43] provides a way to do this. By classifying the


features in the high variance axes and finding the most dominant feature
groups therefore only one feature from each dominant group can be chosen.
First, this algorithm was used on each of the three feature domains separately
70 | P a g e

first. Then, a final trail was performed on selected features from both source
and coded domains. This trail is used to examine the possibility of complete
removal of pixel domain features. This will help toward the extraction of all
the required features for transcoding through only pre-encoded video
information without the need for transmission of any additional information
from the content server [44].

The PFA algorithm is as follows:

 Calculate PCA for the covariance matrix of feature set

 Choose desired retained variability after reduction: sum of the first q


eignvalues divided by sum of all eignvalues, retained variability
describes the amount of information loss in the process.

 Cluster row vectors of the Eigenvector matrix into p clusters where p


is greater than or equal to q. The clustering algorithm is k-mean
algorithm

5-6 Results

In this section, the results of each trail of the algorithm are presented as
follows:

 Running the algorithm on source domain features

 Running the algorithm on Resources features

 Running the algorithm on coded domain features

 Running the algorithm on selected features from both the coded


domain and the source domain

In Table I, the trail of the source domain features. The results show that
averaging per frame values or selecting I frame values are statistically
P a g e | 71

indistinguishable. The three source features to be selected are Ave variance,


Pelspread, and Edgeact. Retained variability is equal to 99.3974 %.

In Table II, the trail of resources is presented. this analysis demonstrates that
ME time can be used instead of encoding time without any loss of
information and that SNR can be calculated on any of the frame components
YUV without any difference. Retained variability of this trail was 99.77155 %.

In Table III, the trail of the coded domain features. The four selected features
are MV magn, sub MV, Ave energy I, and Ave energy P.

Final trail is where both source and coded domain features are compared.
This trail results are illustrated in Table IV. Retained variability for this trail is
99.9966 %

Table 6 Source Domain Features

Cluster Index Feature Distance from center


2 Ave Variance (I-frame) 0.063633
2 Ave Variance (Averaged) 0.063633
3 Pelact (I-frame) 0.0015095
3 Pelact (Averaged) 0.0019788
3 Pelspread (I-frame) 0.00086648
3 Pelspread (Averaged) 0.0013133
1 Edgeact (I-frame) 0.0045721
1 Edgeact (Averaged) 0.0045721
3 Edgespread (I-frame) 0.012588
3 Edgespread (Averaged) 0.0014647
72 | P a g e

Table 7 Resource Features

Cluster Index Feature Distance from center


3 Bitcount 0.18841
3 Bitcount Y 1.1781
2 ME Time 0.0012094
1 SNR V 0.02584
1 SNR U 0.025837
1 SNR Y 0.02599
2 Time 0.0012094

Table 8 Coded Domain Features

Cluster Index Feature Distance from center


1 MV magn 0
1 MV magn var 0
2 Sub MV 0
2 Non zero MV 0
3 Ave energy I 0
4 Ave energy P 0
1 MV accel 0
1 MV dir 0

Table 9 Final Trail

Cluster Index Feature Distance from center


2 MV magn 0.0016
2 Sub MV 0.0018
3 Ave energy I 0
1 Ave energy P 0
2 Ave variance 0.0481
2 PelSpread 0
2 Edgeact 0.0096
P a g e | 73

5-7 Transcoder Configuration

Figure ‎5-3 presents architecture of transcoding system, where videos are pre-
encoded with best quality supported, then passed through a transcoder that
only decodes NAL units into a set of VCL information. Transcoder changes
some of this information in coded domain and then re-encodes them into
NAL units. This modified bitstream is then sent to decoder at the client side
to retrieve the pixel domain video sequence.

Figure ‎5-3 Standard Transcoder Configuration

The implementation used for the transcoder is presented in Figure ‎5-4. This
configuration was adopted to simplify the implementation of the transcoder.
74 | P a g e

This relies on the fact that NAL encoder and decoder blocks are identical
therefore can be omitted.

Figure ‎5-4 Adopted transcoder configuration

5-8 Transcoder Setup

The implementation of transcoder is based on the coefficient dropping


transcoding scheme. This has been applied to all test sequences and the same
features were extracted, details are as follow:

Transcoding parameters and amount of reduction:

 Drop one coefficient (6.25% reduction)

 Drop 3 coefficients (18.75% reduction)

 Drop 5 coefficients (31.25% reduction)


P a g e | 75

 Drop 7 coefficients (43.75% reduction)

In this experiment we have used the features elected by the feature analysis as
discussed in the previous section, those features are as follows:

 Bitcount

 ME time

 SNR Y

 Sub MV

 Ave Energy I

 Ave Energy P

 MV Magn

Figure ‎5-5 shows bitrate relations between different bitstreams and


transcoding parameters. The bit rate values are normalized using (zscore)
function in matlab. This function is defined as:

𝑍 = 𝑧𝑠𝑐𝑜𝑟𝑒 𝐷

𝑉 − 𝑚𝑒𝑎𝑛 𝑉
𝑍=
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑉

Where: V is a column vector of D.


76 | P a g e

Figure ‎5-5 Normalized Bitrate against different transcoding parameters for


all the test sequences

5-9 Clustering

The cluster analysis functions in Matlab were used to cluster the video
sequences. The Dendrogram of the cluster is presented in Figure ‎5-6. Figure
‎5-7 provides a graph of normalized bitrate values after adding non transcoded

values referred to in the graph as 0% transcoding.

In Figure ‎5-7, the inversion point marked in blue shows the difference in the
response of each video sequence to the transcoding technique. From the
graph, it can be seen that the video sequences are grouped into two different
clusters as marked in red. The first cluster contains video sequences that
experience a decrease in the bitrate with any percentage of coefficient
reduction. The other group is video sequences that would be subjected to an
increase in the bitrate with the 6.25% reduction of DCT coefficients.
P a g e | 77

Figure ‎5-6 Dendrogram of the generated clusters

Figure ‎5-7 Normalized Bitrate after adding the no transcoding values


78 | P a g e

Cluster analysis done in this study was able to predict reaction of test videos
to transcoding process. The dendrogram shows the presence of two clusters
in test sequences, one where video’s bitrate without transcoding is higher
than transcoded bitrate, and the second, video’s bitrate without transcoding is
less than some of the transcoded bitrate. Those two clusters are marked in
bitrate graph in Figure ‎5-7.
P a g e | 79

Chapter 6

Conclusion and Future Work

6.

6-1 Conclusion

The research in multimedia transcoding became an essential part of the field


of multimedia communications. This is due to the fact that all users are
turning on to multimedia as a key source of information. In reality, most of
those users are using devices or networks that can't yet handle the large
amount of resources required in the transmission of multimedia objects.

Multimedia Middleware servers perform the required transcoding to allow the


video sequence to be transferred over these networks or devices seamlessly
without any intervention from the user's side. This system requires a
thorough understanding of the video characteristics, device capabilities and
network resources. The overall objective of this structure is to provide users
with exactly the right amount of information excluding the possibility of
requiring more resources than needed.
80 | P a g e

A large number of transcoding techniques have been developed in the


available literature. Those techniques can alter the video sequence through
the modification of one or more of its parameters. This leads to a variety of
potential transcoded objects that can be transferred to the user. Currently, the
management scheme for providing video sequences that fits best the
requirements of the client devices and networks continues to be a challenge.

The management system of multimedia content adaptation should have the


capability of providing an efficient use of resources on client side while
keeping the response time to client requests minimal. The concept adopted in
this thesis for the implementation of the transcoding system relies mainly on
the study of the video content while providing different transcoding plans for
different content types.

The transcoding cycle will start with an offline analysis stage that would
cluster the multimedia objects based on their characteristic into categories.
This analysis predicts the behavior of multimedia objects with respect to the
transcoding techniques. Next the choice of the best transcoding plan is
determined. This would require the presence of a quality assessment metric to
evaluate the result and grantee the transmission of the best available option of
the resources on hand.

In our study we have explored those two points. The work done in this thesis
will help toward the implementation of the transcoding server and more
specifically the policy module in that transcoding server.

First, we have considered the examination of the quality assessment methods


in order to define a valid approach to compute the amount of degradation in
the object quality. We have defined Contrast Error Distribution (CED)
P a g e | 81

metric which provides a good tradeoff between performance and complexity.


This feature makes it suitable for usage in transcoders where real-time
response is valued greatly.

The results showed that CED is consistent with respect to different error
domains and visual content. This characteristic will allow it to be used in the
loopback analysis cycle where both time and generalizability matters most.

The proposed metric defines the perceived quality using a simple


mathematical model that is deducted from common knowledge about the
HVS. All previously available studies of the FR QA models, showed that for
a metric to be of good performance it has to be based on complex analysis of
the image. However using the CED overcomes this weak point showed high
performance as that of the complex metric and at the same time very low
computational time.

Secondly, we ran an analytical study of the type of features to be included in


the offline analysis of videos. This study led to a set of features that can be
used in classifying and predicting the behavior of the video with respect to
the change in transcoding parameters.

The analysis showed that pixel domain features can be omitted. This is an
important fact as all the videos in the content servers will be in a pre-encoded
form and therefore the pixel domain features will not be available for user in
the transcoding server. As a result, the offline analysis will not require any
external information other than the pre-encoded video sequence.
82 | P a g e

In our study we have ran some preliminary experiment that proved that using
the features selected. A clustering system would be able to predict the
behavior of a set of video sequences.

6-2 Future Work

The contributions discussed so far have examined the implementation of the


offline data analysis and the quality assessment metric. We have examined
those two segments of the transcoding server separately. Consequently, the
next step would be to integrate both of the proposed structures into the
implementation of a transcoding server to validate the whole theory.
Moreover, we need to expand the analysis done in this thesis to include the
following:

 Expand the evaluation process of the CED to include a database that


contains compound error components instead of single error
component.

 Change the CED to use 16X16 windows instead of 8X8 and apply it
on DCT coefficients instead of luminance values.

 Build a transcoding server that would use multiple transcoding


techniques and validate the ability of the clustering algorithm to
detect the most significant clusters.
P a g e | 83

BIBLIOGRAPHY

[1] Maarten Wijnants, Patrick Monsieurs, Peter Quax, Wim Lamotte,


"Exploiting Proxy-Based Transcoding to Increase the User Quality
of Experience in Networked Applications," in First International
Workshop on Advanced Architectures and Algorithms for Internet Delivery and
Applications, Florida, 2005, pp. 73-80.
[2] (March 2002) MEDIACOM 2004 - A Framework for Multimedia
Standardization Project Description - Version 3.0. [Online].
http://www.itu.int/itudoc/itu-t/com16/mediacom/projdesc-fr.html
[3] Stefan Winkler, Digital Video Quality: Vision Models and Metrics.: Wiley,
2005.
[4] Alan C. Bovik, "Structural Approaches to image quality assessment," in
Handbook of Image and Video Processing (Communications, Networking and
Multimedia).: Academic Press , 2000.
[5] Eugene Girshtel, Vitaliy Slobodyan, Jonathan S. Weissman, Ahmet M.
Eskicioglu, "Comparison of three full-reference color image quality
measures," Proceedings of SPIE, the International Society for Optical
Engineering, 2006.
[6] Z. Wang, L. Lu, and A. C. Bovik, "Video quality assessment based on
structural distortion measurement," Signal Processing: Image
Communication, special issue on Objective video quality metric, vol. 19, no. 2,
2004.
[7] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality
assessment: from error visibility to structural similarity," IEEE
Transactions on Image Processing, vol. 13, no. 4, 2004.
[8] Z. Wang, E. P. Simoncelli and A. C. Bovik, "Multi-scale Structural
Similarity for Image Quality Assessment," in Proc. 37th IEEE
Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA,
2003.
[9] H. R. Sheikh and A. C. Bovik, "Image information and visual quality," in
Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP '04), Montreal, Canada, 2004.
[10] H. R. Sheikh, A. C. Bovik, and G. de Vecian, "An Information Fidelity
84 | P a g e

Criterion for Image Quality Assessment Using Natural Scene


Statistics," IEEE Transactions on Image Processing, vol. 14, no. 12, 2005.
[11] Wu H.R., Digital Video Image Quality and Perceptual Coding. 978-
1420027822: CRC Press Inc, 2005.
[12] Philip Corriveau, Arthur Webster. (2003) VQEG Final Report of FR-TV
Phase II. [Online]. www.vqeg.org
[13] Hamid Rahim Sheikh, Zhou Wang, Lawrence Cormack, Alan C Bovik.
LIVE Image Quality Assessment Database Release 2. [Online].
http://live.ece.utexas.edu/research/quality
[14] Ostermann, J. Bormans, J. List, P. Marpe, D. Narroschke, M. Pereira, F.
Stockhammer, T. Wedi, T., "Video coding with H.264/AVC: tools,
performance, and complexity," IEEE Circuits and Systems Magazine,
vol. 4, no. 1, 2004.
[15] Martin Fielder. (Simenar Paper) Implementation of a basic H.264/AVC
Decoder. [Online]. http://rtg.informatik.tu-chemnitz.de/docs/da-sa-
txt/sa-mfie.pdf
[16] Iain Richardson, H.264 and MPEG-4 Video Compression: Video Coding for
Next Generation Multimedia.: Wiley, Aug. 2003.
[17] Alan Ray, Hayder Radha, "Complexity-Distortion Analysis of
H.264/JVT Decoders on Mobile Devices," in IEEE Int. Conf. on
Image Processing, Suntec City, Singapore, 2004.
[18] M. Alvarez, E. Salami, A. Ramirez, M. Valero, "A performance
characterization of high definition digital video decoding using
H.264/AVC," in Proceedings of the IEEE International In Workload
Characterization Symposium, 2005.
[19] Jijun Zhang, Andrew Perkis , Nicolas Georganas, "H.264/AVC and
Transcoding for Multimedia Adaptation," in Proceedings of the 6th
COST 276 workshop, Thessaloniki, Greece, 2004.
[20] Lei Chen, Shahadat Khan, Kin F. Li, Eric G. Manning, "Building an
adaptive multimedia system using the utility model," in Proceedings of
the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th
International Parallel Processing Symposium and 10th Symposium on Parallel
and Distributed Processing, 1999.
[21] R. Smith, J.R. Chung-Sheng Li Mohan, "Adapting multimedia Internet
content for universal access," IEEE Transactions on Multimedia, vol. 1,
no. 1, 1999.
[22] Sumi Helal, Latha Sampath, Kevin Birkett, Joachim Hammer, "Adaptive
delivery of video data over wireless and mobile environments,"
Wireless Communications and Mobile Computing, vol. 3, no. 1, 2002.
P a g e | 85

[23] Zhijun Lei, Nicolas D. Georganas, "Rate adaptation transcoding for


precoded video streams," in Proceedings of the tenth ACM international
conference on Multimedia, Juan-les-Pins, France, 2002.
[24] Dan Chalmers, Morris Sloman, Naranker Dulay, "Map adaptation for
users of mobile systems," in Proceedings of the 10th international conference
on World Wide Web, Hong Kong, 2001.
[25] David Gotz, Ketan Mayer-Patel, "A general framework for
multidimensional adaptation," in Proceedings of the 12th annual ACM
international conference on Multimedia, New York, NY, USA, 2004.
[26] Jae-Gon Kim, Yong Wang, Shih-Fu Chang, Hyung-Myung Kim, "An
Optimal Framework of Video Adaptation and Its Application to
Rate Adaptation Transcoding," ETRI Journal, 2005.
[27] Yong Wang, Shih-Fu Chang, Alexander C. Loui, "Content-Based
Prediction of Optimal Video Adaptation Operations Using
Subjective Quality Evaluation," Columbia University, ADVENT
Technical Report 202-2004-2, 2004.
[28] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, "A Statistical Evaluation of
Recent Full Reference Image Quality Assessment Algorithms,"
IEEE Transactions on Image Processing, vol. 15, no. 11, 2006.
[29] Nora A Naguib, Ahmed E Hussein, Hesham A Keshk, Mohamed I El-
Adawy, "Contrast Error Distribution Measurement for Full
Reference Image Quality Assessment," in The 18th International
Conference on Computer Theory and Applications, Alexandria, Egypt., 2008.
[30] Damon M. Chandler, Kenny H. Lim, Sheila S. Hemami, "Effects of
spatial correlations and global precedence on the visual fidelity of
distorted images," SPIE, Human Vision and Electronic Imaging XI, vol.
6057, 2006.
[31] SSIM Index version 1.0, Matlab Code. [Online].
http://www.cns.nyu.edu/»zwang/files/research/ssim/ssimindex.m
[32] VIF, Matlab Code. [Online].
http://live.ece.utexas.edu/research/quality/vifvecrelease.zip
[33] Pixel Domain VIF, Matlab Code. [Online].
http://live.ece.utexas.edu/research/quality/vifp release.zip
[34] S-F Chang and A. Vetro, "Video Adaptation: Concepts, Technologies
and Open Issues," Proceedings of the IEEE, vol. 93, no. 1, 2005.
86 | P a g e

[35] H.264 C++ implementation: Version: JM13. [Online].


http://iphome.hhi.de/suehring/tml/
[36] Test Sequences. [Online]. www.cipr.rpi.edu
[37] Yong Wang, Jae-Gon Kim, Shih-Fu Chang, Hyung-Myung Kim, "Utility-
Based Video Adaptation for Universal Multimedia Access (UMA)
and Content-Based Utility Function Prediction for Real-Time Video
Transcoding," IEEE Transactions on Multimedia, vol. 9, no. 2, 2007.
[38] Dimitrios Miras, "On Quality Aware Adaptation of Internet Video,"
University of London, PhD Dissertation 2004.
[39] Arriba, Catalina Crespi de, "Subjective Video Quality Evaluation and
Estimation for H.264 Codec and QVGA Resolution Sequences,"
PhD dissertation 2007.
[40] M. van Der Schaar, Y. Andreopoulos, "Rate-distortion-complexity
modeling for network and receiver aware adaptation," IEEE
Transactions on Multimedia, vol. 7, no. 3, 2005.
[41] Jonathon Shlens. A tutorial on Principal Components Analysis. [Online].
http://www.cs.cmu.edu/~elaw/papers/pca.pdf
[42] Lindsay I Smith. A tutorial on Principal Components Analysis. [Online].
http://www.cs.otago.ac.nz/cosc453/student_tutorials/
[43] Yijuan Lu, Ira Cohen, Xiang Sean Zhou, Qi Tian, "Feature selection
using principal feature analysis," in Proceedings of the 15th international
conference on Multimedia, Augsburg, Germany, 2007.
[44] Nora A Naguib, Ahmed E Hussein, Hesham A Keshk, Mohamed I El-
Adawy, "Using PFA in Feature Analysis and Selection for H.264
Adaptation," vol. 54, Paris, France JUNE 2009.

Anda mungkin juga menyukai