Faculty of Engineering
Department of Electronics,
Communications, and Computers
MULTIMEDIA MIDDLEWARE
by
nora.naguib@yahoo.com
Supervised by:
Prof. Mohamed I. El Adawy
Faculty of Engineering, Helwan University
2010
ii | P a g e
P a g e | iii
ACKNOWLEDGEMENT
It is a pleasure to thank those who made this thesis possible. I would like to
express my gratitude to Prof. Mohamed I. El-Adawy for his constant support
and most valuable advice. I would like to thank the rest of the supervisory
committee for all their help and Dr. Ahmed E. Hussien for the suggestion of
reference titles.
I would also like to thank my family for the support they provided me
through my entire life and in particular, I really cannot express my full
gratitude to my brother Yasser Naguib who patiently proofread this entire
thesis. Special thanks go to my brother Wael Naguib without whose
motivation and encouragement I would not have considered a post graduate
degree. Above all, to my mother who stood beside me all the time.
PUBLICATIONS
ABSTRACT
In today's world, users have heterogeneous devices connected to a mesh of networks each
with different capabilities and restrictions. Multimedia content providers need innovative
approaches to keep not only one version of each video but having the capability to offer
different bitstreams for a variety of client capabilities as well. The previously used design of
"one size fits all" systems can not apply in diverse environments presented today. A single
bitstream with static parameter cannot satisfy the diversity presented on the client side. This is
why the researchers in Universal Multimedia Access (UMA) are working on the development
of new techniques for coding multimedia objects with maximum compression efficiency
along with flexibility in the parameters of the provided video when dealing with client devices.
The transcoding of multimedia objects requires the presence of intermediate systems that are
capable of altering the bitstream on demand. Those systems should have the capability of
manipulating different format of bitstreams. A large number of adaptation techniques exists
in today’s literature, each specialized in altering the video bitstream with respect to only one
dimension namely temporal (frame-rate), spatial (resolution), Signal to Noise Ratio (SNR), or
format conversion. In real world, adaptation of video sequences should take the form of
multi-dimensional adaptation allowing the system to do a combination of reduction processes
on different parameters of video sequence while providing the best possible quality.
In this thesis, we have focused on the transcoder policy module. While most of the previous
studies in multimedia transcoding focused on the transcoding techniques, the lack of control
algorithm rendered those techniques useless. The study was directed toward the creation of
an offline data analysis model for transcoders policy module.
The results and analysis provided in this thesis help toward the creation of policy module that
control the transcoder operation for universal multimedia access.
TABLE OF CONTENTS
Introduction 1
1-1 Motivation 1
1-2 Problem Statement 3
1-3 Objectives and contributions 4
1-4 Thesis Outline 5
Multimedia Communications Basics 7
2-1 ITU-T MediaCom2004 project 7
2-2 MPEG-7 and MPEG-21 8
2-3 Coding Standards 9
2-4 Transcoding Vs Scalable Coding 10
2-5 Quality Assessment 12
Related Work 15
3-1 Quality Assessment 15
3-1-1 Background 15
3-1-2 Simple Quality Metrics 16
3-1-3 Objective Quality Metrics 17
3-1-3-1 Using DCT, DWT, and DFT 18
3-1-3-2 Perceptual Distortion Metric (PDM) 19
3-1-3-3 Structural Similarity 20
3-1-3-4 Visual Information Fidelity and Natural Scene Statistics 22
3-2 Subjective Experiments 23
3-2-1 Double Stimulus Impairment Scale (DSIS) 24
3-2-2 Double Stimulus Continuous Quality Scale (DSCQS) 24
3-2-3 Single Stimulus Continuous Quality Scale (SSCQS) 24
3-3 VQEG 25
3-4 Benchmark 26
3-4-1 Error Domains 26
3-4-2 Subjective Experiment 26
3-4-3 Realignment Process 27
3-4-4 Datasets 27
3-5 H.264 Review 28
3-6 Multimedia Transcoding 33
3-6-1 Transcoding Techniques 34
3-6-2 Control Schemes 36
Quality Assessment 43
4-1 Introduction 43
4-2 Proposed Metric 44
4-3 Metric Evaluation Process 46
4-3-1 Subjective data rescaling 47
4-3-2 Nonlinear Regression 47
P a g e | ix
Bibliography 83
x|Page
LIST OF FIGURES
LIST OF TABLES
TABLE 1 COMPARISON BETWEEN THE PSNR, SSIM, CED, PD-VIF, LOG(CED), LOG(VIF) WITH RESPECT
TO CC: PEARSON CORRELATION COEFFICIENT, SROCC: SPEARMAN RANK CORRELATION
COEFFICIENT, RMSE: ROOT MEAN SQUARE ERROR ............................................................. 51
TABLE 2 PEARSON CORRELATION COEFFICIENT OF THE SSIM, CED, PD-VIF, LOG(CED), LOG(VIF).
CALCULATED FOR THE DISTORTION DOMAINS JPEG2000, JPEG, WHITE NOISE, GAUSSIAN BLUR,
AND FAST FADING ........................................................................................................... 51
TABLE 3 SPEARMAN RANK CORRELATION COEFFICIENT OF THE SSIM, CED, PD-VIF, LOG(CED),
LOG(VIF). CALCULATED FOR THE DISTORTION DOMAINS JPEG2000, JPEG, WHITE NOISE,
GAUSSIAN BLUR, AND FAST FADING................................................................................... 51
TABLE 4 ROOT MEAS SQUARE ERROR OF THE SSIM, CED, PD-VIF, LOG(CED), LOG(VIF). CALCULATED
FOR THE DISTORTION DOMAINS JPEG2000, JPEG, WHITE NOISE, GAUSSIAN BLUR, AND FAST
FADING. ........................................................................................................................ 52
TABLE 5 EVALUATION OF THE QUALITY METRICS ............................................................................ 52
TABLE 6 SOURCE DOMAIN FEATURES ........................................................................................... 71
TABLE 7 RESOURCE FEATURES..................................................................................................... 72
TABLE 8 CODED DOMAIN FEATURES ............................................................................................ 72
TABLE 9 FINAL TRAIL ................................................................................................................. 72
P a g e | xiii
ACRONYM
Chapter 1
Introduction
1.
1-1 Motivation
Multimedia plays an important role in our life. We now have terms that were
introduced to industry, culture and leisure that solely depend on the
evolvement of the Multimedia Communications field. Working with another
team member overseas through your laptop was never possible if it were not
for the video conferencing capabilities. The term webinar was not used until
few years ago when it was found that a web based seminar would be more
effective in reaching all its target audience with no regard to distances apart.
The growth of users with access to the internet along with the tremendous
increase in their network capabilities and mobility, made way to the increase
in amount of data accessed and uploaded through the internet. This data as a
whole contains at least 70% of it as multimedia objects. Those users spend
more than 20% of their time away from their primary workplace.
For a relatively long time now, we are used to having two types of networks
available to us. Telecommunications and IT (Information Technology)
networks. Though we have interconnections between them, we haven’t yet
reached the combination of the two. To achieve this merge the ITU-T
(International Telecommunications Union - Telecommunication) is working
on the standardization of what is called Next Generation Networks.
Multimedia middleware are intermediate systems between the client and the
content server that provides a number of complementary services. The
generalized block diagram of multimedia middleware is illustrated in Figure
1-1. Those servers are used to transcode multimedia objects before delivery
and deliver it to the user. This middleware server will need to fit within the
existing system and be transparent to both content server and client.
When adding a new multimedia object to the content server, the time
required for the transcoding server to analyze the content of the
video should be minimized.
Time from the reception of client requests till delivery of the content
back to the user should be minimized.
The server should have the means to assess the quality of the
generated version of the multimedia object and choose between
different transcoding schemes.
The objective of this research is to examine the first two stages. This work
will help toward the practical implementation of the middleware server
control module. The contribution of this research was concentrated in the
following:
Chapter 2
Multimedia Communications
Basics
2.
ITU-T SG16, the lead Study Group for Multimedia, is working on project -
MEDIACOM 2004 (Multimedia Communication 2004) [2]. The objective of
the Mediacom 2004 Project is to establish a framework for Multimedia
standardization for use both inside and external to the ITU. This framework
will support the harmonized and coordinated development of global
multimedia communication standards across all ITU-T and ITU-R Study
Groups, and in close cooperation with other regional and international
standards development organizations (SDOs).
8|Page
Figure 2-1 presents the Multimedia framework study areas (MM FSA) as
defined by the Mediacom project.
MPEG-4 and H.264 are the newest standards for multimedia coding
developed by the MPEG. They both rely on the same coding principles but
with significantly different visions. MPEG-4 is mainly concerned with
flexibility where H.264 features efficient compression and reliability.
As stated above, the difference between the two standards does not reside in
the theory of the compression module itself, but in how the input is treated.
In MPEG-4, the input of the compression module is a series of multimedia
10 | P a g e
objects that are contained in video frames. H.264 uses frame based
compression.
Scalable coding and transcoding are the two coexisting lines of UMA
researches where each has its advantages and limitations. Scalable coding has
the advantage of processing videos in advance therefore it does not require
any intermediate system. However, it means that the video bitstream
resource/quality degradation can be done only on predefined steps and
therefore it does not comply with the exact client requirements.
In other words, scalable coding provides error margin between the provided
bitstream and the requested resource/quality. Meanwhile transcoding tailors
video bitstreams to the exact device/network requirements provided by the
client requests.
During assessment of reduced bit stream, we should bear in mind that quality
measurement of multimedia objects is not defined as fidelity of the new
bitstream to original. Quality when it comes to multimedia objects is defined
as the perceived quality which means that some errors are more important
than others. The perceived quality is related to the limitation within the
Human visual system (HVS) where some errors are neutral while others are
severely perceived by it.
The degree by which the alteration of video bitstream has affected the
perceived quality can be calculated by either subjective experiments or
objective quality metrics. Subjective experiments refer to viewing videos by
human observers where each observer rates the video quality and then a
mean opinion record is calculated for this video. Objective quality metrics
measures degradation of visual perceptual quality defining criterion for
describing perceptual error.
14 | P a g e
P a g e | 15
Chapter 3
Related Work
3.
3-1-1 Background
The complexity and nonlinearity of HVS are characteristic features that have
been used to trick the audiences’ eyes for ages. Throughout the study of
HVS, a number of facts have been discovered and made it possible for the
generation of multimedia objects as we know it today. For example, using
frame rate more than 50 Hz to deceive the human eye into seeing a moving
video, along with using lossy coding algorithms that exploit dependencies in
spatio-temporal information to remove extra data from the video stream are
both relying on the nonlinearity of HVS.
on the screen with respect to the original video stream. This clarifies why any
fidelity measure as the SNR would fail in describing the opinion of the
observer.
Up until now, subjective experiments have been used for the assessment of
multimedia quality. However those experiments are impractical, expensive
and time consuming. Hence, they cannot be used in estimating the quality of
multimedia objects during its reproduction. Researchers in the field of
multimedia quality assessment are working on the development of objective
metrics that can predict the observer’s opinion about the quality of
multimedia objects.
To calculate the PSNR between the original and distorted images, we start by
calculating the MSE (Mean Square Error) of pixels’ grayscale values.
1 2
𝑀𝑆𝐸 = 𝑓 𝑥 𝑦 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑥, 𝑦, 𝑓 − 𝐷𝑖𝑠𝑡𝑜𝑟𝑡𝑒𝑑 𝑥, 𝑦, 𝑓 [3]
𝐹𝑋𝑌
P a g e | 17
Where: images have a width of X pixels and height of Y pixels and the video
sequence contains F frames.
𝐼2
𝑃𝑆𝑁𝑅 = 10𝑙𝑜𝑔10 [3]
𝑀𝑆𝐸
From the above we can see that the MSE defines the difference between the
two signals and the PSNR defines the fidelity of the distorted image to the
original. In [4] the authors illustrate why error power cannot be used as a
metric for the perceptual quality. They considered the following cases:
In these two cases, although the error has identical power values, the two
images may enclose different perceptual quality. In other words, the type of
error should be studied with respect to its effect on HVS and the image in
hand.
specific and relied on the prior information about the distortion process that
multimedia object went through. (For example: Coding algorithms introduce
blocking Artifacts)
Three types of references can be used for quality assessment: Full Reference
(FR), Reduced Reference (RR), and No Reference (NR). In FR QA (Full
Reference Quality Assessment) the original image is compared to the
reproduced image, while in RR QA only some features of the original image
are used in the comparison. NR QA is the techniques that rely on the natural
image features to decide about the quality of the image without referring to
any outside information. Obviously, the FR and RR are not very suitable for
the transmission quality problem, due to the need for the original image or
some of its features at the receiver. However, FR and RR are very useful in
cases of developing coding and transcoding techniques. These metrics are
used in order to judge the quality of the image where the original is already
available.
3-1-3-2 P E R C E P T U A L D I S T O R T I O N M E TR I C (PDM)
3-1-3-3 S T R U C T U R A L S I M I L AR I T Y
The argument used in this metric is based on the idea where the human eye is
tuned to detect structural error. By definition there are three types of error
that can be introduced to multimedia objects i.e., variation of average local
luminance or contrast and structural error. The first two don’t contribute in
the degradation of the perceived quality. Thus by removing those two error
types, we can calculate the structural error that would result in defining the
amount of degradation in the image quality. The block diagram of the
Structural Similarity (SSIM) is shown in Figure 3-2.
Luminance error:
2𝜇𝑥 𝜇𝑦
𝑙 𝑥, 𝑦 =
𝜇𝑥2 + 𝜇𝑦2
P a g e | 21
Contrast error:
2𝜎𝑥 𝜎𝑦
𝑐 𝑥, 𝑦 =
𝜎𝑥2 + 𝜎𝑦2
Structure error:
𝜎𝑥𝑦
𝑠 𝑥, 𝑦 =
𝜎𝑥2+ 𝜎𝑦2
Where:
From the above, Authors in [4], [6], and [7] presented the structural error as
the cosine of the angle between the original (x-µ) and distorted image (y-µ).
This logic assumes that after the removal of the luminance and contrast
errors, the resulting error would be illustrated as a circle where all error have
the same error power but different angle defining its effect on the perceived
quality.
22 | P a g e
Figure 3-3 Block Diagram of the Multi-scale Structural Similarity L: Low pass
filtering; 2↓: Down sampling by 2
3-1-3-4 V I S U A L I N F O R M A T I ON F I D E L I TY AN D
N A T U R A L S C E N E S T AT I S T I C S
conveyed correctly between the original and distorted image through to the
observer.
The Natural Scene Statistics (NSS) rely on the fact that natural scenes occupy
tiny subspace out of all possible permutations for pixel values, and by that, it
is easy to describe natural undistorted images with a number of statistical
features. Visual Information Fidelity (VIF) defines the perceived quality as the
difference in mutual information between the input and output of HVS for
no-distortion and distortion channels.
Subjective experiments [11] are required for the evaluation of Video Quality
Metrics (VQMs). In these experiments, human subjects are requested to
review, evaluate and assess the quality of images available in the database. The
subjects are normally screened for visual acuity and color blindness, to make
sure those quality score describe the accurate perceived quality for each
image. Moreover, viewing session should last for less than 30 minutes to
reduce the effect of fatigue on the observers.
Figure 3-5 Subjective Experiments: Viewing Modes (On the Left) Score
Scale (On the Right). (A) Double Stimulus Impairment Scale (DSIS) (B)
Double Stimulus Continuous Quality Scale (DSCQS) (C) Single Stimulus
Continuous Quality Scale (SSCQS)
3-3 VQEG
The Video Quality Experts Group (VQEG) was formed on 1997. Its main
objective was to validate and standardize objective quality assessment models.
Moreover, that group works toward standardization of performance metrics
for validating the objective models. So far, the VQEG have completed two
sets of tests.
3-4 Benchmark
The evaluation cycle for the metric proposed in this thesis did not include the
execution of subjective experiments. However, we have used the Texas
University, “LIVE Image Quality Assessment Database Release 2” [13]. The
database contains 982 images out of which 203 reference images and 779
distorted images.
JPEG2000: bit rate ranging from 0.028 bits per pixel to 3.15bpp
at least 20-29 human observers. Single stimulus method was used. The
database was rated in 7 separate viewing sessions.
The fact that images were reviewed in more than one session led to a
mismatch scale in the scores given to those images. Therefore, an extra round
of review was performed using double stimulus methodology and a randomly
selected 50 images.
For a single image if the single score is considered an outlier, that is outside a
certain interval from the standard deviation of the mean score for the image.
This point is removed from the DMOS calculation for that image.
3-4-4 Datasets
The database of images is accompanied with a number of datasets that define
the benchmark values of the perceived quality for each image of the 982
images available in the database.
28 | P a g e
dmos.mat: contains two arrays of length 982 each: DMOS and orgs.
Throughout this study, the H.264 standard was used as the main compression
technique for encoding and transcoding all test sequences. In this section we
are going to review this standard and demonstrate its new features.
Figure 3-6 (A) Video Coding Layer (VLC) and Network Abstraction Layer
(NAL) arrangement. (B) NAL unit
VCL is responsible for efficient coding of video frames and delivering coded
information to be formatted by NAL. The main aim of NAL is to arrange all
of the coded information in a way that would be comprehended by the
receiver. All the information are sent in what is known as NAL units, these
units act as packets that can be handled separately by the transport layer for
transmission or storage in file. Each NAL unit consists of a NAL header that
specifies sequencing of the information within the unit, and the payload data.
The term slice refers to a set of macroblocks in raster order that are to be
coded in the same type, i.e. I, P, B, SI, SP. Macroblocks are defined as the
P a g e | 31
The processing in the macroblock layer is divided in two categories intra and
inter coding. In intra coding a macroblock is predicted using only spatial
information i.e., macroblocks from the same frame. However in inter frames,
the prediction rely on temporal dependencies. This is done by copying an area
from previously coded frames and assigning it to the currently encoded
macroblock. The encoder then sends motion vectors, reference frames and
error signal between the predicted and the current macroblock. Moreover,
Motion vector are not sent to the receiver. Only a displacement Motion
vector is sent to adjust values predicted by the receiver. This is dependent on
the fact that motion prediction in both encoder and decoder are identical.
Therefore, Motion vectors are predicted from the surrounding macroblocks
and then a compensation MV is sent to the receiver to correct the value.
32 | P a g e
The motion prediction in H.264 supports half and quarter pixel values. The
intensity values for fractional pixels are determined by means of interpolation.
In the following a list of the differences between H.264 and earlier standards:
H.264 uses 4X4 integer transform instead of the former DCT 8X8
transform.
The standard defines a set of profiles in which H.264 can operate: baseline,
main, and extended. Each profile defines accepted syntax and tools to be
used. The profiles are in Figure 3-9. In this study we have used the Baseline
profile.
P a g e | 33
H.264 is the most efficient coding algorithm with respect to bit rate
reduction, yet the most complex among its peers. In [17] authors performed a
number of tests to analyze the complexity-distortion relationship within
H.264. They found that P frames are more efficient with respect to distortion
and complexity but requires more bitrate than sequences containing B frames.
The authors in [18] show that processing time of H.264 is dominated by
deblocking filter (49.01%) and fractional pixel interpolation (19.98%).
Another argument was set by authors in [21] that is, Offline transcoded
objects can be arranged in what is called Info-pyramid. The info-pyramid is
by definition a progressive data representation scheme. Objects stored in the
info-pyramid have different resolutions and abstraction levels:
P a g e | 37
On the other hand, authors in [22] proposed a model with three dimensions:
User preferences
Another type of control schemes was proposed in[23]. The system operates
in real-time and uses a single dimensional transcoding to fit videos to
available bit rate. A buffer based control scheme was used. The system
utilizes the relation between delay, occupancy of buffer and bitrate. Two type
of transcoding was used re-quantization and frame dropping. The estimated
amount of bits required to encode a frame is predicted by using information
gathered from previously encoded frames.
Transcoding is done offline and system request the resources of highest utility
selection based on user’s preference and if it fails, a negotiation cycle is
started till enough resource reduction is done.
The authors developed the arrangement of ARU spaces which stands for the
Adaptation, Resource, and Utility accordingly. Imagine that we have an
adaptation space where each point is mapped on to resource and utility
spaces. In the resource space, we can determine how much complexity
reduction or bitrate reduction this point (adaptation process) could cause. In
utility space, the system can compare two adaptation processes with respect
to quality of multimedia object. A conceptual illustration of the system is
illustrated in Figure 3-16.
P a g e | 41
The curves for these three spaces cannot be developed from a single video
sequence. Each video sequence can react differently to adaptation processes.
Authors developed a system for the generation of utility functions by
extracting a set of features from video sequences. Those features are then
used to cluster the sequences into a number of predefined clusters that are
supposed to behave in the same way with respect to different adaptation
process. Those clusters are defined through the analysis of a set of test
sequences.
42 | P a g e
P a g e | 43
Chapter 4
Quality Assessment
4.
4-1 Introduction
Our work in the objective quality assessment was mainly driven by the need
for an objective model to be used in the policy module of the transcoding
engine. This FR QA model should possess the following in order to replace
the need for subjective experiments:
These features are crucial for the metric to be practically used in place of
human observers. Research in quality assessment has revealed different
perspectives for looking at perceptual error. Although, these definitions of
the perceptual error made use of high level features of images, none of them
44 | P a g e
have reached the optimal criteria for providing the metric features described
above.
Studies examining how HVS treats the received visual information found that
HVS doesn’t treat images as luminance values but as contrast differences.
Moreover, this contrast based response varies with the viewing distance. This
led to the use of contrast sensitivity function after the decomposition of the
image into spatial and temporal bands in HVS based metrics.
The idea of the metric presented here uses this fact. If the change in contrast
values was distributed well on the entire image, HVS will not capture this type
of error, since the relations between the contrast values are maintained, and
vice versa, contrast change due to a distortion having a large standard
deviation would modify the contrast relations in images.
Calculate local contrast for the original and distorted images using
only the luminance component (the contrast of the image is simply its
standard deviation). We have used only the luminance values, as it is
known about HVS to have achromatic acuity higher than the
chromatic one.
After the first evaluation cycle, we found the above criterion holds quite well
for JPEG, JPEG2000, white noise, and fast fading distortion domains.
However, for Gaussian blur domain, the metric didn’t correlate with the
Subjective experiment outputs.
These results were very reasonable, because the contrast of the error
introduced by Gaussian blur tend to be of a weak standard deviation. This
type of error would yet modify the local contrast information in the image.
Referencing the analysis done by the authors in [30], they found that image
contents would play an important role on the effectiveness of error on the
perceived quality. Therefore, we have modified the metric by referencing its
output values to the standard deviation of the reference image contrast. The
block diagram for the metric is in Figure 4-1.
The metric evaluation process is not just a simple measurement of how much
resemblance is there between DMOS values and Video Quality Ratings
(VQRs). A number of metrics are to be applied on the VQRs to confirm if
this metric gives good results regardless of error type, image content, or even
the amount of quality degradation.
𝑏1
𝐷𝑀𝑂𝑆𝑝 𝑉𝑄𝑅 = [28]
1 + 𝑒 −𝑏2 𝑉𝑄𝑅−𝑏3
2
𝜎𝑥𝑦
𝑟2 =
𝜎𝑥2 𝜎𝑦2
2
𝜎𝑥𝑦 = 𝑥𝑖 − 𝜇𝑥 𝑦𝑖 − 𝜇𝑦
𝜎𝑥2 = 𝑥𝑖 − 𝜇𝑥 2
2
𝜎𝑦2 = 𝑦𝑖 − 𝜇𝑦
𝑥𝑦
𝑟=
𝑥2 𝑦2
4-3-5 Prediction Consistency
Outlier Ratio:
𝑁𝑜
𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑅𝑎𝑡𝑖𝑜 =
𝑁
Where:
4-4 Results
Table 4 Root Meas Square Error of the SSIM, CED, PD-VIF, Log(CED),
Log(VIF). Calculated for the distortion domains JPEG2000, JPEG, White
Noise, Gaussian Blur, and Fast Fading.
JP2k JPEG WN GBlur FF
SSIM 9.2222 10.5526 6.8789 9.3565 10.6995
CED (Proposed) 7.6804 8.2344 10.4274 6.8455 9.6306
PD-VIF 6.1433 7.1296 6.6276 5.5593 14.0610
Log(CED) (Proposed) 7.0897 7.2565 6.6182 4.5263 7.6321
Log(VIF) 5.6908 7.8561 5.5314 4.4474 9.0253
From the results, it can be seen that CED provides a good tradeoff between
performance and complexity. Where it operates in 1.5 seconds per image
where metrics with comparable results operate in 12 seconds per image.
Figure 4-3 shows the scatter plot of DMOS against the predicted DMOS
values. This scatter plot shows outlier points. For the metric to perform
better, the scatter points should be distributed near the diagonal of the graph.
Moreover, the points should be distributed evenly across the range of the
perceived quality.
It can be seen from Figure 4-3 that metrics have two empty spots one near
the origin and the other at the far side of the graph as highlighted in red. The
empty spot near the origin means that zero point is translated to a different
value in the predicted DMOS. The graph for the CED shows that all the
empty spots have been decreased significantly and therefore the response for
the CED is improved for error figures located in those areas of the graph.
Figure 4-4 shows the calibration curves of the 5-distortion domains from the
database used in the experiment. The evaluation of VQM performance
stability across different types of distortion mandate that the calibration
curves should be indistinguishable. In the figure, we can see that calibration
curves are not overlying, however they are adjacent to each other. The points
of intersection highlight the amount of error where the metric would react to
different types of error indifferently. Otherwise, the metric would be
more/less sensitive to certain types of error.
54 | P a g e
Figure 4-2 Scatter plot of VQRs against DMOS values (Blue), and Nonlinear
Logistic fitting curve (Black). This was calculated for 6 VQM: PSNR, SSIM,
VIF, PD-VIF, CED, Log(CED) respectively
RMSE=13.4713
Figure 4-3 Scatter plot of predicted DMOS (VQRs after logistic regression)
against DMOS values. This was calculated for 6 VQM: PSNR, SSIM, VIF,
PD-VIF, CED, Log(CED) respectively
RMSE=12.1396
RMSE=9.8798
RMSE=8.1708
RMSE=9.9549
RMSE=8.3168
Figure 4-4 Calibration Curves for each error domain: JPEG2k (Green),
JPEG (Red), White Noise (Blue), Gaussian Blue (Magenta), Fast Fading
(Cyan) and all error domains (Black). This was calculated for 6 VQM: PSNR,
SSIM, VIF, PD-VIF, CED, Log(CED)
Chapter 5
Data Analysis
5.
5-1 Introduction
The problem relies on the fact that not all video sequences react in the same
way to transcoding processes. A certain amount of transcoding can result into
a different amount of resource reduction in different video sequences. This is
due to varied complexity of video content.
64 | P a g e
The authors in [34] put together a systematic procedure for designing video
adaptation technologies, they are as follows:
The main aim of offline data analysis stage is to define the main classes of
multimedia objects. Each class has its own Resource–transcoding–quality
graph which contributes in the policy module decision.
P a g e | 65
The presented study relies mainly on the idea of finding key features that
would characterize the differences between video sequences. Those video
sequences usually reach the transcoding server in a pre-encoded form.
Transcoding servers should distinguish the class of the sequence through only
the information present in the coded domain.
Baseline Profile
QP=28
To be coded in IPPP
The test video sequences used in this study are presented in [36]. Those video
sequences are single shot video segments. Therefore, video sequence is
encoded with the first frame as I-frame and the rest of the frames as P-
frames. A description of complexity for each video sequence is described in
Figure 5-2 Test Sequences Description
5-5 Features
thesis concluded that many of these features convey the same information
and some of which can be omitted from the proposed model.
5-5-1-1 S O U R C E D O M A I N F E A T U R E S
Variance: Average variance of the luminance pixels
5-5-1-2 R E S O U R CE S R E Q U I R E D
bitcount: Bitcount for coding for macroblock accumulated on the
whole frame.
5-5-1-3 C O D E D D O M A I N FEATURES
first. Then, a final trail was performed on selected features from both source
and coded domains. This trail is used to examine the possibility of complete
removal of pixel domain features. This will help toward the extraction of all
the required features for transcoding through only pre-encoded video
information without the need for transmission of any additional information
from the content server [44].
5-6 Results
In this section, the results of each trail of the algorithm are presented as
follows:
In Table I, the trail of the source domain features. The results show that
averaging per frame values or selecting I frame values are statistically
P a g e | 71
In Table II, the trail of resources is presented. this analysis demonstrates that
ME time can be used instead of encoding time without any loss of
information and that SNR can be calculated on any of the frame components
YUV without any difference. Retained variability of this trail was 99.77155 %.
In Table III, the trail of the coded domain features. The four selected features
are MV magn, sub MV, Ave energy I, and Ave energy P.
Final trail is where both source and coded domain features are compared.
This trail results are illustrated in Table IV. Retained variability for this trail is
99.9966 %
Figure 5-3 presents architecture of transcoding system, where videos are pre-
encoded with best quality supported, then passed through a transcoder that
only decodes NAL units into a set of VCL information. Transcoder changes
some of this information in coded domain and then re-encodes them into
NAL units. This modified bitstream is then sent to decoder at the client side
to retrieve the pixel domain video sequence.
The implementation used for the transcoder is presented in Figure 5-4. This
configuration was adopted to simplify the implementation of the transcoder.
74 | P a g e
This relies on the fact that NAL encoder and decoder blocks are identical
therefore can be omitted.
In this experiment we have used the features elected by the feature analysis as
discussed in the previous section, those features are as follows:
Bitcount
ME time
SNR Y
Sub MV
Ave Energy I
Ave Energy P
MV Magn
𝑍 = 𝑧𝑠𝑐𝑜𝑟𝑒 𝐷
𝑉 − 𝑚𝑒𝑎𝑛 𝑉
𝑍=
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑉
5-9 Clustering
The cluster analysis functions in Matlab were used to cluster the video
sequences. The Dendrogram of the cluster is presented in Figure 5-6. Figure
5-7 provides a graph of normalized bitrate values after adding non transcoded
In Figure 5-7, the inversion point marked in blue shows the difference in the
response of each video sequence to the transcoding technique. From the
graph, it can be seen that the video sequences are grouped into two different
clusters as marked in red. The first cluster contains video sequences that
experience a decrease in the bitrate with any percentage of coefficient
reduction. The other group is video sequences that would be subjected to an
increase in the bitrate with the 6.25% reduction of DCT coefficients.
P a g e | 77
Cluster analysis done in this study was able to predict reaction of test videos
to transcoding process. The dendrogram shows the presence of two clusters
in test sequences, one where video’s bitrate without transcoding is higher
than transcoded bitrate, and the second, video’s bitrate without transcoding is
less than some of the transcoded bitrate. Those two clusters are marked in
bitrate graph in Figure 5-7.
P a g e | 79
Chapter 6
6.
6-1 Conclusion
The transcoding cycle will start with an offline analysis stage that would
cluster the multimedia objects based on their characteristic into categories.
This analysis predicts the behavior of multimedia objects with respect to the
transcoding techniques. Next the choice of the best transcoding plan is
determined. This would require the presence of a quality assessment metric to
evaluate the result and grantee the transmission of the best available option of
the resources on hand.
In our study we have explored those two points. The work done in this thesis
will help toward the implementation of the transcoding server and more
specifically the policy module in that transcoding server.
The results showed that CED is consistent with respect to different error
domains and visual content. This characteristic will allow it to be used in the
loopback analysis cycle where both time and generalizability matters most.
The analysis showed that pixel domain features can be omitted. This is an
important fact as all the videos in the content servers will be in a pre-encoded
form and therefore the pixel domain features will not be available for user in
the transcoding server. As a result, the offline analysis will not require any
external information other than the pre-encoded video sequence.
82 | P a g e
In our study we have ran some preliminary experiment that proved that using
the features selected. A clustering system would be able to predict the
behavior of a set of video sequences.
Change the CED to use 16X16 windows instead of 8X8 and apply it
on DCT coefficients instead of luminance values.
BIBLIOGRAPHY