Anda di halaman 1dari 171

ALPHA COLLEGE OF ENGINEERING

DEPARTMENT
OF
INFORMATION TECHNOLOGY

MINIMUM LEARNING MATERIAL


FOR THE THIRD YEAR B.Tech(IT) DEGREE COURSE
(R-2013)

SEMESTER-VII

TABLE OF CONTENTS
LIST OF SUBJECTS
CURRICULAM
IT6701

Information Management

CS6701

Cryptography and Network Security

IT6702

Data Ware Housing and Data Mining

CS6703

Grid and Cloud Computing

IT6004

Software Testing

IT6711

Data Mining Laboratory

IT6712

Security Laboratory

IT6713

Grid and Cloud Computing Laboratory

PAGE
NO
iii

ANNA UNIVERSITY CHENNAI


AFFILIATED INSTITUTIONS
2013 REGULATION B.TECH. INFORMATION TECHNOLOGY
VII SEMESTERS CURRICULUM AND SYLLABI

Code No.

Course Title

THEORY
IT6701

Information Management

CS6701

Cryptography and Network Security

IT6702

Data Ware Housing and Data Mining

CS6703

Grid and Cloud Computing

IT6004

Software Testing

PRACTICAL
IT6711

Data Mining Laboratory

IT6712

Security Laboratory

15

21

IT6713

Grid and Cloud Computing Laboratory

TOTAL

CS6701

CRYPTOGRAPHY AND NETWORK SECURITY

LTPC
3003
UNIT I
INTRODUCTION & NUMBER THEORY
10
Services, Mechanisms and attacks-the OSI security architecture-Network security modelClassical Encryption techniques (Symmetric cipher model, substitution techniques,
transposition techniques, steganography).FINITE FIELDS AND NUMBER THEORY:
Groups, Rings, Fields-Modular arithmetic- Euclids algorithm-Finite fields- Polynomial
Arithmetic Prime numbers-Fermats and Eulers theorem- Testing for primality -The
Chinese remainder theorem- Discrete logarithms.
UNIT II
BLOCK CIPHERS & PUBLIC KEY CRYPTOGRAPHY
10
Data Encryption Standard-Block cipher principles-block cipher modes of operationAdvanced Encryption Standard (AES)-Triple DES-Blowfish-RC5 algorithm. Public key
cryptography: Principles of public key cryptosystems-The RSA algorithm-Key
management Diffie Hellman Key exchange-Elliptic curve arithmetic-Elliptic curve
cryptography.
UNIT III HASH FUNCTIONS AND DIGITAL SIGNATURES
8
Authentication requirement Authentication function MAC Hash function Security
of hash function and MAC MD5 SHA HMAC CMAC Digital signature and
authentication protocols DSS EI Gamal Schnorr.
UNIT IV SECURITY PRACTICE & SYSTEM SECURITY
8
Authentication applications Kerberos X.509 Authentication services Internet
Firewalls for Trusted System: Roles of Firewalls Firewall related terminology- Types of
Firewalls Firewall designs SET for E-Commerce Transactions. Intruder Intrusion
detection system Virus and related threats Countermeasures Firewalls design
principles Trusted systems Practical implementation ofcryptography and security.
UNIT V
E-MAIL, IP & WEB SECURITY
9
E-mail Security: Security Services for E-mail-attacks possible through E-mail
establishing keys privacy-authentication of the source-Message Integrity-Nonrepudiation-Pretty Good Privacy-S/MIME. IPSecurity: Overview of IPSec IP and IPv6Authentication Header-Encapsulation Security Payload (ESP)-Internet Key Exchange
(Phases of IKE, ISAKMP/IKE Encoding). Web Security: SSL/TLS Basic Protocolcomputing the keys- client authentication-PKI as deployed by SSLAttacks fixed in
v3- Exportability-Encoding-Secure Electronic Transaction (SET).
TOTAL: 45
PERIODS
TEXT BOOKS:
1. William Stallings, Cryptography and Network Security, 6th Edition, Pearson
Education,
March 2013.
(UNIT
I,II,III,IV).
2. Charlie Kaufman, Radia Perlman and Mike Speciner, Network Security, Prentice
Hall of India, 2002. (UNIT V).

REFERENCES:
1. Behrouz A. Ferouzan, Cryptography & Network Security, Tata Mc Graw Hill, 2007.
2. Man Young Rhee, Internet Security: Cryptographic Principles, Algorithms and
Protocols,
Wiley Publications,
2003.
3. Charles Pfleeger, Security in Computing, 4th Edition, Prentice Hall of India, 2006.
4. Ulysess Black, Internet Security Protocols, Pearson Education Asia, 2000.
5. Charlie Kaufman and Radia Perlman, Mike Speciner, Network Security, Second
Edition,
Private Communication
in
Public
World,
PHI
2002.
6. Bruce Schneier and Neils Ferguson, Practical Cryptography, First Edition, Wiley
Dreamtech India
Pvt
Ltd,
2003.
7. Douglas R Simson Cryptography Theory and practice, First Edition, CRC Press,
1995.
8. http://nptel.ac.in/.

ALPHA COLLEGE OF ENGINEERING


Thirumazhisai, Chennai 600124
LESSON PLAN

Faculty Name

: Prasath.R

Subject Name
Year

:Cryptography &
Network Security
:IV

Degree & Branch

:B.Tech/IT

Designation

:AP

Code

:CS6701

Semester

:07

AIM:
To understand OSI security architecture and classical encryption techniques Acquire fundamental
knowledge on the concepts of finite fields and number theory, understand various block cipher and stream
cipher models, Describe the principles of public key cryptosystems, hash functions and digital signature.
Sl. No.

No. of Periods
Required

Topics

Text /Ref.
Book

UNIT I INTRODUCTION & NUMBER THEORY


1

Services, Mechanisms and attacks

T1

OSI security architecture-Network security


model

T1

Classical Encryption techniques

T1

Groups, Rings, Fields

T1

Modular arithmetic

TI

Euclids algorithm

T1

Finite fields

T1

Polynomial Arithmetic

T1

Prime numbers

T1

10

Fermats and Eulers theorem

T1

11

The Chinese remainder theorem, Discrete


2
alogarithms
UNIT II BLOCK CIPHERS & PUBLIC KEY CRYPTOGRAPHY

T1

12

Data Encryption Standard

T1

13

Block cipher principles

T1

14

block cipher modes of operation

T1

15

AES, Triple DES,RC5

T1

16

public key cryptosystems

T1

17

RSA algorithm-Key management

T1

18

Diffie Hellman Key exchange

T1

19

Elliptic curve cryptography

T1

Text /Ref.
No. of Periods
Book
Required
UNIT III HASH FUNCTIONS AND DIGITAL SIGNATURES

Sl. No.

Topics

20

Authentication requirement, function

T1

21
22

2
2

T1
T1

23

MAC Hash function


Security of hash function and MAC MD5
SHA
HMAC

T1

24

CMAC

25

Digital signature and authentication protocol

T1
T1

26

DSS

T1

27

EI Gamal Schnorr

T1

UNIT IV SECURITY PRACTICE & SYSTEM SECURITY


28

Authentication applications

T1

29

Kerberos X.509
Roles of Firewalls, Terms, Types of Firewalls
Firewall
SET,IDS,designs
Virus and related threats
Countermeasures, Firewalls design principles
Trusted systems

T1

2
2
1
1

T1
T1
T1
T1

34

Practical implementation of cryptography and


1
security
UNIT V E-MAIL, IP & WEB SECURITY

T1

35

E-mail Security

T2

36

Message
Integrity-Non-repudiation-Pretty
Good Privacy-S/MIME

T2

37

Cache basics

T2

38

IPSecurity:

T2

39

IP and IPv6-Authentication Header

T2

40

ESP,IKE

T2

41

Web Security

T2

42

SSL/TLS Basic Protocol-

T2

43

PKI as deployed by SSLAttacks


Exportability-Encoding-Secure
Transaction (SET).

T2

T2

30
31
32
33

44

Electronic

UNIT I
PART A (TWO MARKS)
1. Specify the four categories of security threats.
Interruption Interception Modification Fabrication
8

2. Explain active and passive attack with example. (June 15)


Passive attack: Monitoring the message during transmission. Eg: Interception
Active attack: It involves the modification of data stream or creation of false data
stream. E.g.: Fabrication, Modification, and Interruption
3. Define integrity and non-repudiation.
Integrity: Service that ensures that only authorized person able to modify the message.
Non repudiation: This service helps to prove that the person who denies the transaction
is true or false.
4. Differentiate symmetric and asymmetric encryption?
Symmetric encryption

Asymmetric encryption

It is a form of cryptosystem in
It is a form of cryptosystem in which
which encryption and decryption
encryption and decryption performed
Performed
using
two
keys.
using the same key. Eg: DES, AES
Eg:RSA,ECC
5. Define cryptanalysis?
It is a process of attempting to discover the key or plaintext or both.
6. Compare stream cipher with block cipher with example. (May 15)
Stream cipher
Block cipher
Processes the input stream continuously Processes the input one block of
and producing one element at a time. elements at a time producing an output
Example: caeser cipher
block for each input block. Example:
DES

7. Define security mechanism.


It is process that is designed to detect prevent, recover from a security attack. Example:
Encryption algorithm, Digital signature, Authentication protocols.
8. Differentiate unconditionally secured and computationally secured.
An Encryption algorithm is unconditionally secured means; the condition is if the cipher
text generated by the encryption scheme doesnt contain enough information to determine
corresponding plaintext. Encryption is computationally secured means,
The cost of breaking the cipher exceeds the value of enough information.
Time required to break the cipher exceed the useful lifetime of information.
9

9. Define steganography.
Hiding the message into some cover media. It conceals the existence of a message.
10. Why network need security?
When systems are connected through the network, active attacks and passive attacks are
possible during transmission time from sender to receiver and vice versa. So network
needs security.
11. Define Encryption.
The process of converting from plaintext to cipher text is known as encryption.
12. Specify the components of encryption algorithm.
(a) Plaintext (b) Encryption algorithm (c) secret key (d) cipher text (e) Decryption
algorithm.
13. Define confidentiality and authentication
Confidentiality: It means how to maintain the secrecy of message. It ensures that the
information in a computer system and transmitted information are accessible only for
reading by authorized person.
Authentication: It helps to prove that the source entity only has involved the transaction.
14. Define cryptography.
It is a science of writing Secret code using mathematical techniques. The many schemes
used for enciphering constitute the area of study known as cryptography.
15. Compare Substitution and Transposition techniques. (Dec 14)

16.
Define

SUBSTITUTION

TRANSPOSITION

A substitution techniques is one in


which the letters of plaintext are
replaced by other letter or by number
or symbols.

It means, different kind of mapping


is achieved by performing some sort
of permutation on the plaintext
letters.

*Eg: Caesar cipher.

*Eg: DES, AES

Diffusion & confusion. (May 15)


Diffusion: It means each plaintext digits affect the value of many cipher text digits which
is equivalent to each cipher text digit is affected by many plaintext digits. It can be
achieved by performing permutation on the data. It is the relationship between the
plaintext and cipher text. Confusion: It can be achieved by substitution algorithm. It is
the relationship between cipher text and key.
10

17. Define Multiple Encryptions.


`
It is a technique in which the encryption is used multiple times. Eg: Double DES, Triple
DES
18. Specify the design criteria of block cipher.
Number of rounds, Design of the function F, Key scheduling
19. Define Reversible mapping. (Nov 13)
Each plain text is maps with the unique cipher text. This transformation is called
reversible mapping.
20. Specify the basic task for defining a security service.
A service that enhances the security of the data processing systems and the information
transfer of an organization. The services are intended to counter security attack, and they
make use of one or more security mechanism to provide the service.
PART B (16 mark)
1. Explain OSI architecture (May 11)
OSI Security Architecture
ITU-T X.800 Security Architecture for OSI defines a systematic way of defining and
providing security requirements.
Security Services
X.800 defines it as: a service provided by a protocol layer of communicating open
systems, which ensures adequate security of the systems or of data transfers
RFC 2828 defines it as: a processing or communication service provided by a system to
give a specific kind of protection to system resources
X.800 defines it in 5 major categories
Authentication - assurance that the communicating entity is the one claimed Access
Control - prevention of the unauthorized use of a resource
Data Confidentiality protection of data from unauthorized disclosure

11

Data Integrity - assurance that data received is as sent by an authorized entity NonRepudiation - protection against denial by one of the parties in a communication
Security Mechanisms
specific security mechanisms:
encipherment, digital signatures, access controls, data integrity, authentication
exchange, traffic padding, routing control, notarization
pervasive security mechanisms:
trusted functionality, security labels, event detection, security audit trails, security
recovery
Classification of Security Attacks as
passive attacks - eavesdropping on, or monitoring of, transmissions to:
obtain message contents, or monitor traffic flows
active attacks modification of data stream to: masquerade of one entity as some other
replay previous messages modify messages in transit denial of service
Security Attacks is classified as
Passive attack

Reading contents of messages

Also called eavesdropping

Difficult to detect passive attacks

Defence: to prevent their success

Active attacks
Modification or creation of messages (by attackers)
Four categories: modification of messages, replay, masquerade, denial of service.
Easy to detect but difficult to prevent.
Defense: detect attacks and recover from damages

2. Give a model for network security. (8 marks)

12

3. Explain classical Encryption techniques in detail. (May 12)

have two basic components of classical ciphers: substitution and transposition


in substitution ciphers letters are replaced by other letters

in transposition ciphers the letters are arranged in a different order

these ciphers may be:

monoalphabetic - only one substitution/ transposition is used, or

polyalphabetic - where several substitutions/ transpositions are used

several such ciphers may be concatentated together to form a product cipher

Caesar Cipher - a monoalphabetic cipher

replace each letter of message by a letter a fixed distance away eg use the 3rd
letter on

reputedly used by Julius Caesar

eg.
13

L FDPH L VDZ L FRQTXHUHG


I CAME I SAW I CONQUERED

ie mapping is
ABCDEFGHIJKLMNOPQRSTUVWXYZ
DEFGHIJKLMNOPQRSTUVWXYZABC
Mixed Alphabets

Most generally we could use an arbitrary mixed (jumbled) alphabet

each plaintext letter is given a different random ciphertext letter, hence key is 26
letters long
Plain: ABCDEFGHIJKLMNOPQRSTUVWXYZ
Cipher: DKVQFIBJWPESCXHTMYAUOLRGZN
Plaintext: IFWEWISHTOREPLACELETTERS
Cipher text: WIRFRWAJUHYFTSDVFSFUUFYA

Cryptanalysis

use frequency counts to guess letter by letter


also have frequencies for digraphs & trigraphs

General Monoalphabetic

special form of mixed alphabet

use key as follows:


o

write key (with repeated letters deleted)

then write all remaining letters in columns underneath

then read off by columns to get ciphertext equivalents

14

Polyalphabetic Substitution

in general use more than one substitution alphabet

makes cryptanalysis harder since have more alphabets to guess

and because flattens frequency distribution

(since same plaintext letter gets replaced by several ciphertext letter, depending on
which alphabet is used)

Vigenre Cipher

basically multiple Caesar ciphers

key is multiple letters long K = k_(1) k_(2) ... k_(d)

ith letter specifies ith alphabet to use

use each alphabet in turn, repeating from start after d letters in message

4. Explain Euclids algorithm and Fermats Little Theorem. (May 12 & May 15)
The Euclidean Algorithm is a technique for quickly finding the GCD of two
integers.
The Algorithm
The Euclidean Algorithm for finding GCD(A,B) is as follows:

If A = 0 then GCD(A,B)=B, since the GCD(0,B)=B, and we can stop.

If B = 0 then GCD(A,B)=A, since the GCD(A,0)=A, and we can stop.

Write A in quotient remainder form (A = BQ + R)

Find GCD(B,R) using the Euclidean Algorithm since GCD(A,B) = GCD(B,R)


Fermat's little Theorem
Fermat's little theorem states that if p is a prime number, then for any integer a, the
number a p a is an integer multiple of p. In the notation of modular arithmetic, this is
expressed as
For example, if a = 2 and p = 7, 27 = 128, and 128 2 = 7 18 is an integer multiple of
7.

15

If a is not divisible by p, Fermat's little theorem is equivalent to the statement that a p 1


1 is an integer multiple of p, or in symbols
[1][2]

For example, if a = 2 and p = 7 then 26 = 64 and 64 1 = 63 is thus a multiple of 7.


Fermat's little theorem is the basis for the Fermat primality test and is one of the
fundamental results of elementary number theory. The theorem is named after Pierre de
Fermat, who stated it in 1640. It is called the "little theorem" to distinguish it
from Fermat's last theorem.
5. Explain different types of attack in detail.
Classes of attack might include passive monitoring of communications, active
network attacks, close-in attacks, exploitation by insiders, and attacks through the service
provider. Information systems and networks offer attractive targets and should be
resistant to attack from the full range of threat agents, from hackers to nation-states. A
system must be able to limit damage and recover rapidly when attacks occur.
There are five types of attack:
PASSIVE ATTACK
A passive attack monitors unencrypted traffic and looks for clear-text passwords
and sensitive information that can be used in other types of attacks. Passive
attacks include traffic analysis, monitoring of unprotected communications, decrypting
weakly encrypted traffic, and capturing authentication information such as passwords.
Passive interception of network operations enables adversaries to see upcoming actions.
Passive attacks result in the disclosure of information or data files to an attacker without
the consent or knowledge of the user.
ACTIVE ATTACK
In an active attack, the attacker tries to bypass or break into secured systems.
This can be done through stealth, viruses, worms, or Trojan horses. Active attacks include
attempts to circumvent or break protection features, to introduce malicious code, and to
steal or modify information. These attacks are mounted against a network backbone,
exploit information in transit, electronically penetrate an enclave, or attack an authorized

16

remote user during an attempt to connect to an enclave. Active attacks result in the
disclosure or dissemination of data files, DoS, or modification of data.
DISTRIBUTED ATTACK
A distributed attack requires that the adversary introduce code, such as a Trojan horse
or back-door program, to a trusted component or software that will later be distributed
to many other companies and users Distribution attacks focus on the malicious
modification of hardware or software at the factory or during distribution. These attacks
introduce malicious code such as a back door to a product to gain unauthorized access to
information or to a system function at a later date.
INSIDER ATTACK
An insider attack involves someone from the inside, such as a disgruntled
employee, attacking the network Insider attacks can be malicious or no malicious.
Malicious insiders intentionally eavesdrop, steal, or damage information; use information
in a fraudulent manner; or deny access to other authorized users. No malicious attacks
typically result from carelessness, lack of knowledge, or intentional circumvention of
security for such reasons as performing a task
CLOSE-IN ATTACK
A close-in attack involves someone attempting to get physically close to network
components, data, and systems in order to learn more about a network Close-in attacks
consist of regular individuals attaining close physical proximity to networks, systems, or
facilities for the purpose of modifying, gathering, or denying access to information. Close
physical proximity is achieved through surreptitious entry into the network, open access,
or both.
PHISHING ATTACK
In phishing attack the hacker creates a fake web site that looks exactly like a
popular site such as the SBI bank or paypal. The phishing part of the attack is that the
hacker then sends an e-mail message trying to trick the user into clicking a link that leads
to the fake site. When the user attempts to log on with their account information, the
hacker records the username and password and then tries that information on the real site.
17

HIJACK ATTACK
Hijack attack In a hijack attack, a hacker takes over a session between you and
another individual and disconnects the other individual from the communication. You still
believe that you are talking to the original party and may send private information to the
hacker by accident.
SPOOF ATTACK
Spoof attack In a spoof attack, the hacker modifies the source address of the
packets he or she is sending so that they appear to be coming from someone else. This
may be an attempt to bypass your firewall rules.
BUFFER OVERFLOW
Buffer overflow A buffer overflow attack is when the attacker sends more data to
an application than is expected. A buffer overflow attack usually results in the attacker
gaining administrative access to the system in a ommand prompt or shell.
EXPLOIT ATTACK
Exploit attack In this type of attack, the attacker knows of a security problem
within an operating system or a piece of software and leverages that knowledge by
exploiting the vulnerability.
PASSWORD ATTACK
Password attack An attacker tries to crack the passwords stored in a network account
database or a password-protected file. There are three major types of password attacks: a
dictionary attack, a brute-force attack, and a hybrid attack. A dictionary attack uses a
word list file, which is a list of potential passwords. A brute-force attack is when the
attacker tries every possible combination of characters.
6. Explain Cipher Feedback and Output Feedback.
Cipher Feedback (CFB)
Message is

treated as a stream of bits o added to the output of the block cipher


result is feedback for next stage (hence name)
standard allows any number of bit (1,8 or 64 or whatever) to be feedback o denoted
CFB-1, CFB-8, CFB-64 etc
18

is

most efficient to use all 64 bits (CFB-64 Ci = Pi XOR DESK1(Ci-1) C-1 = IV


uses: stream data encryption, authentication
Advantages and Limitations of CFB

appropriate when data arrives in bits/bytes o most common stream mode

limitation is need to stall while do block encryption after every n-bits o note that
the block cipher is used in encryption mode at both ends
errors propagate for several blocks after the error

Output Feedback (OFB)

message is treated as a stream of bits


output of cipher is added to message
output is then feedback (hence name)
feedback is independent of message
can be computed in advance

Ci = Pi XOR Oi Oi = DESK1(Oi-1) O-1 = IV


o uses: stream encryption over noisy channels

Advantages and Limitations of OFB


Used when error feedback a problem or where need to encryptions before message is
available
superficially similar to CFB
but feedback is from the output of cipher and is independent of message o a variation of
a Vernam cipher
hence must never reuse the same sequence (key+IV)
sender and receiver must remain in sync, and some recovery method is needed to ensure
this occur
originally specified with m-bit feedback in the standards
Subsequent research has shown that only OFB-64 should ever be used.
7. Explain Chinese Reminder Theorem.
Chinese Remainder Theorem

Used to speed up modulo computations o working modulo a product of numbers


19

eg. mod M = m1m2..mk


Chinese Remainder theorem lets us work in each module mi separately
since computational cost is proportional to size, this is faster than working in the
full modulus M
can implement CRT in several ways
to compute (A mod M) can firstly compute all (a i mod mi) separately and then
combine results to get answer using:
mod M
=

*(

mod

) for 1

Primitive Roots

from Eulers theorem have a(n)mod n=1 o consider ammod n=1, GCD(a,n)=1
must exist for m= (n) but may be smaller o once powers reach m, cycle will
repeat
if smallest is m= (n) then a is called a primitive root
if p is prime, then successive powers of a "generate" the group mod p o these are
useful but relatively hard to find

Discrete Logarithms or Indices


the inverse problem to exponentiation is to find the discrete logarithm of a number
modulo p
that is to find x where ax = b mod p

written as x=loga b mod p or x=inda,p(b)

if a is a primitive root then always exists, otherwise may not


x = log3 4 mod 13 (x st 3x = 4 mod 13) has no answer
x = log2 3 mod 13 = 4 by trying successive powers
Whilst exponentiation is relatively easy, finding discrete logarithms is generally
a hard problem

8. Explain confidentiality of Symmetric Encryption.


Confidentiality Using Symmetric Encryption
Traditionally symmetric encryption is used to provide message confidentiality consider
typical scenario

workstations on LANs access other workstations & servers on LAN


LANs interconnected using switches/routers
20

with external lines or radio/satellite links consider attacks and placement in this
scenario
snooping from another workstation
use dial-in to LAN or server to snoop
use external router link to enter & snoop monitor and/or modify traffic one
external links have two major placement alternatives
Link encryption
Encryption occurs independently on every link
implies must decrypt traffic between links
requires many devices, but paired keys
end-to-end encryption
Encryption occurs between original source and final destination
need devices at each end with shared keys
Traffic Confidentiality
when using end-to-end encryption must leave headers in clear
so network can correctly route information
hence although contents protected, traffic pattern flows are not
ideally want both at once end-to-end protects data contents over entire path and
provides authentication
link protects traffic flows from monitoring
Placement of Encryption
Can place encryption function at various layers in OSI Reference Model o link
encryption occurs at layers 1 or 2
end-to-end can occur at layers 3, 4, 6, 7
as move higher less information is encrypted but it is more secure though more
complex with more entities and keys
Traffic Analysis
Is monitoring of communications flows between parties o useful both in military
& commercial spheres
can also be used to create a covert channel o link encryption obscures header
details
but overall traffic volumes in networks and at end-points is still visible o traffic
padding can further obscure flows

UNIT II
21

PART A (TWO MARKS)


1. Differentiate public key and conventional encryption? (Dec 11)
Conventional Encryption

Public key Encryption

One algorithm is used for encryption


The same algorithm with the same key is and decryption with a pair of keys,
used for encryption and decryption.
one for encryption and another for
decryption.
The sender and receiver must share the the sender and receiver must each
algorithm and the key
have one of the matched pair of keys.
The key must be secret

2.

It must be impossible
impractial message if
information is available

One of two keys must be kept Secret .


or atleast It must be impossible or to at least
no other impractical to decipher a message if
no other information is available.

Knowledge of the algorithm plus one


Knowledge of the algorithm plus
of key plus samples of ciphertext
samples of cipher text must insufficient
must be insufficient to determine the
to determine the key
other key
What are the principle elements of a public key cryptosystem?
The principle elements of a cryptosystem are:

plain text
Encryption algoritm
Public and private key
Cipher text
Decryption algorithm IT

3. What are roles of public and private key?


The two keys used for public-key encryption are referred to as the public key and the
private key. Invariably, the private key is kept secret and the public key is known
publicly. Usually the public key is used for encryption purpose and the private key is used
in the decryption side.
4. Specify the applications of the public key cryptosystem?

22

The applications of the public-key cryptosystem can classified as follows


Encryption/Decryption: The sender encrypts a message with the recipients public key.
Digital signature: The sender signs a message with its private key. Signing is achieved
by a cryptographic algorithm applied to a message or to a small block of data that is a
function of the message.
Key Exchange: Two sides cooperate to exchange a session key. Several different
approaches are possible, involving the private key(s) of one or both parties.
5. What requirements must a public key cryptosystem to fulfill to a secured algorithm?
The requirements of public-key cryptosystem are as follows:
It is computationally easy for a party B to generate a pair(Public key KUb,Private
key KRb)
It is computationally easy for a sender A, knowing the public key and the message
to be encrypted , M, to generate the corresponding cipher text: C=EKUb(M) .
It is computationally easy for the receiver B to decrypt the resulting ciphertext
using
the
private
key
to
recover
the
original
message
:
M=DKRb(C)=DKRb[EKUb(M)]
It is computationally infeasible for an opponent, knowing the public key,KUb,to
determine the private key,KRb.
It is computationally infeasible for an opponent, knowing the public key,KUb,
and a cipher text, C, to recover the original message.
The encryption and decryption functions can be applied in either order:
M=EKUb[DKRb(M)]=DKUb [EKRb(M)]
6. What is a one way function? (Dec 12)
One way function is one that map the domain into a range such that every function value
has a unique inverse with a condition that the calculation of the function is easy whereas
the calculations of the inverse is infeasible.
7. What is a trapdoor one way function? (Dec 12)
It is function which is easy to calculate in one direction and infeasible to calculate in
other direction in the other direction unless certain additional information is known. With
the additional information the inverse can be calculated in polynomial time. It can be
summarized as: A trapdoor one way function is a family of invertible functions fk, such
that Y= fk( X) easy, if k and X are known X=fk -1 (Y) easy, if k and y are known X= fk
-1 (Y) infeasible, if Y is known but k is not known

23

8. List four general characteristics of schema for the distribution of the public key?(May
11)
The four general characteristics for the distribution of the public key are

Public announcement
Publicly available directory
Public-key authority
Public-key certificate

9. What are essential ingredients of the public key directory?


The essential ingredients of the public key are as follows:
The authority maintains a directory with a {name, public key} entry for each
participant.
Each participant registers a public key with the directory authority. Registration
would have to be in person or by some form of secure authenticated
communication.
A participant may replace the existing key with a new one at a time ,either
because of the desire to replace a public key that has already been used for a large
amount of data, or because the corresponding private key has been comprised in
some way.
Periodically, the authority publishes the entire directory or updates to the
directory. For example, a hard-copy version much like a telephone book could be
published, or updates could be listed in a widely circulated newspaper.
Participants could also access the directory electronically. For this purpose,
secure, authenticated communication from the authority to the participant is
mandatory.
10. What are the design parameters of Feistel cipher network?
*Block size *Key size *Number of Rounds *Sub key generation algorithm *Round
function *Fast software Encryption/Decryption *Ease of analysis
11. Define Product cipher.
It means two or more basic cipher are combined and it produce the resultant cipher is
called the product cipher.
12. Explain Avalanche effect.
A desirable property of any encryption algorithm is that a small change in either the
plaintext or the key produce a significant change in the cipher text. In particular, a change
in one bit of the plaintext or one bit of the key should produce a change in many bits of

24

the cipher text. If the change is small, this might provider a way to reduce the size of the
plaintext or key space to be searched
.
13. Give the five modes of operation of Block cipher. (Dec 14)
Electronic Codebook(ECB)
Cipher Block Chaining(CBC)
Cipher Feedback(CFB)
Output Feedback(OFB)
Counter(CTR)
14. State advantages of counter mode.
*Hardware Efficiency * Software Efficiency *Preprocessing * Random Access *
Provable Security * Simplicity.
15. Find gcd (1970, 1066) using Euclids algorithm? (Dec 13)
gcd (1970,1066) = gcd(1066,1970 mod 1066)
= gcd(1066,904)
= 2 15.
16. What is the primitive root of a number? (Dec 11)
We can define a primitive root of a number p as one whose powers generate all
the I ntegers from 1 to p-1. That is p, if a is a primitive root of the prime number p then
the numbers.
17. Determine the gcd (24140, 16762) using Euclids algorithm.
Soln: We know, gcd(a, b) = gcd(b, a mod b)
gcd(24140,16762) =gcd(16762,7378)
gcd(7378,2006) =gcd(2006,1360)
gcd(1360,646)= gcd(646,68)
gcd(68,34) = 34
gcd(24140,16762) = 34.
18. Perform encryption and decryption using RSA Alg. for the following.
P=7; q=11; e=17; M=8.
Soln: n = pq
25

n = 7*11=77
(n)=(p-1) (q-1)
=6*10 = 60
e =17 d =27
C = Me mod n
C = 817 mod 77
= 57
M = Cd mod n
= 5727 mod 77
=8
19. What is elliptic curve cryptography?
Elliptic curve cryptography (ECC) is an approach to public-key cryptography based on
the algebraic structure of elliptic curves over finite fields. ECC requires smaller keys
compared to non-ECC cryptography (based on plain Galois fields) to provide equivalent
security. Elliptic curves are applicable for encryption, digital signatures, pseudo-random
generators and other tasks.
20. What is Blowfish?
Blowfish is a symmetric-key block cipher, designed in 1993 by Bruce Schneierand
included in a large number of cipher suites and encryption products. Schneier designed
Blowfish as a general-purpose algorithm, intended as an alternative to the aging DES and
free of the problems and constraints associated with other algorithms.
PART B (16 marks)
1. Explain Diffie Hellman key Exchange in detail with an example (May 11 & Dec
12)
DiffieHellman key exchange (DH) [nb 1] is a specific method of securely
exchanging cryptographic keys over a public channel and was one of the first public-key
protocols as originally conceptualized by Ralph Merkle and named after Whitfield
Diffie and Martin Hellman.[1][2]DH is one of the earliest practical examples of public key
exchange implemented within the field of cryptography. Traditionally, secure encrypted
communication between two parties required that they first exchange keys by some
26

secure physical channel, such as paper key lists transported by a trusted courier. The
DiffieHellman key exchange method allows two parties that have no prior knowledge of
each other to jointly establish a shared secret key over an insecure channel. This key can
then be used to encrypt subsequent communications using a symmetric key cipher.
DiffieHellman is used to secure a variety of Internet services. However, research
published in October 2015 suggests that the parameters in use for many DH Internet
applications at that time are not strong enough to prevent compromise by very wellfunded attackers, such as the security services of large governments.
The scheme was first published by Whitfield Diffie and Martin Hellman in 1976. By
1975, James H. Ellis, Clifford Cocks and Malcolm J. Williamson within GCHQ, the
British signals intelligence agency, had previously shown how public-key cryptography
could be achieved; however, their work was kept secret until 1997.
Although DiffieHellman key agreement itself is a non-authenticated key-agreement
protocol, it provides the basis for a variety of authenticated protocols, and is used to
provide forward secrecy in Transport Layer Security's ephemeral modes (referred to as
EDH or DHE depending on the cipher suite).
The method was followed shortly afterwards by RSA, an implementation of public-key
cryptography using asymmetric algorithms.
General overview

Illustration of the DiffieHellman Key Exchange

27

DiffieHellman Key Exchange establishes a shared secret between two parties that can be
used for secret communication for exchanging data over a public network. The following
conceptual diagram illustrates the general idea of the key exchange by using colors
instead of very large numbers.
The process begins by having the two parties, Alice and Bob, agree on an arbitrary
starting color that does not need to be kept secret (but should be different every time); in
this example the color is yellow. Each of them selects a secret colorred and aqua
respectivelythat they keep to themselves. The crucial part of the process is that Alice and
Bob now mix their secret color together with their mutually shared color, resulting in
orange and blue mixtures respectively, then publicly exchange the two mixed colors.
Finally, each of the two mix together the color they received from the partner with their
own private color. The result is a final color mixture (brown) that is identical to the
partner's color mixture.
If another party (usually named Eve in cryptology publications, Eve being a third-party
who is considered to be an eavesdropper) had been listening in on the exchange, it would
be computationally difficult for that person to determine the common secret color; in fact,
when using large numbers rather than colors, this action is impossible for
modern supercomputers to do in a reasonable amount of time.

2. Explain DES in detail. (May 15)


Data Encryption Standard (DES)
Most widely used block cipher in world
adopted in 1977 by NBS (now NIST)
as FIPS PUB 46encrypts 64-bit data using 56-bit key
has widespread use
has been considerable controversy over its security
DES History
IBM developed Lucifer cipher
by team led by Feistel
used 64-bit data blocks with 128-bit key
then redeveloped as a commercial cipher with input from NSA and others
in 1973 NBS issued request for proposals for a national cipher standard
IBM submitted their revised Lucifer which was eventually accepted as theDES
DES Design Controversy
Although DES standard is public was considerable controversy over design in choice of
56-bit key (vs Lucifer 128-bit) and because design criteria were classified subsequent
events and public analysis show in fact design was appropriate
DES has become widely used, especially in financial application
28

Initial Permutation IP

first step of the data computation

IP reorders the input data bits.

even bits to LH half, odd bits to RH half.

quite regular in structure (easy in h/w)


DES Round Structure

uses two 32-bit L & R halves

expands R to 48-bits using perm E

adds to sub key

passes through 8 S-boxes to get 32-bit result

finally permutes this using 32-bit perm P


Substitution Boxes S

have eight S-boxes which map 6 to 4 bits

each S-box is actually 4 little 4 bit boxes

outer bits 1 & 6 (row bits) select one rows

inner bits 2-5 (col bits) are substituted

result is 8 lots of 4 bits, or 32 bits

row selection depends on both data & key

feature known as autoclaving (auto keying)

example: S(18 09 12 3d 11 17 38 39) = 5fd25e03


DES Key Schedule

forms sub keys used in each round consists of:

Initial permutation of the key (PC1) which selects 56-bits in two28-bit halves

16 stages consisting of: selecting 24-bits from each half permuting them by PC2
for use in function f, rotating each half separately either 1 or 2 places depending on the
key rotation schedule K
DES Decryption

decrypt must unwind steps of data computation

with Feistel design, do encryption steps again\

sing subkeys in reverse order (SK16 SK1)

note that IP undoes final FP step of encryption

1st round with SK16 undoes 16th encrypt round

16th round with SK1 undoes 1st encrypt round

then final FP undoes initial encryption IP

thus recovering original data value


Avalanche Effect

key desirable property of encryption alg

where a change of one input or key bit results in changing approx halfoutput bits

making attempts to home

in by guessing keys impossible

DES exhibits strong avalanche


29

3. Briefly explain block cipher design principles and modes of operation. (Dec13)
Block Cipher Design Principles and Modes of Operation

Basic principles
still like Feistel in 1970s
number of rounds
more is better, exhaustive search best attack
function f:
provides confusion, is nonlinear, avalanche
key schedule
complex subkey creation, key avalanche

Modes of Operation

block ciphers encrypt fixed size blocks


eg. DES encrypts 64-bit blocks, with 56-bit key

Need way to use in practice, given usually have arbitrary amount of information
to encrypt

four were defined for DES in ANSI standard ANSI X3.106-1983 Modesof Use

subsequently now have 5 for DES and AES have block and stream modes.
(i)Electronic Codebook Block (ECB)
message is broken into independent blocks which are encrypted
each block is a value which is substituted, like a codebook, hence name
each block is encoded independently of the other blocksC
uses: secure transmission of single values
Advantages and Limitations of ECB
repetitions in message may show in cipher text
if aligned with message block particularly with data such graphics
or with messages that change very little, which become a code-book analysis
problem
weakness due to encrypted message blocks being independent
main use is sending a few blocks of data
(ii)Cipher Block Chaining (CBC)
message is broken into blocks
but these are linked together in the encryption operation
each previous cipher blocks is chained with current plaintext block
Advantages and Limitations of CBC
each cipher text block depends on all message blocks
thus a change in the message affects all cipher text blocks after the change as well as
the original block
need Initial Value (IV) known to sender & receiver
30

However if IV is sent in the clear, an attacker can change bits of the first block, and
change IV to compensate hence either IV must be a fixed value (as in EFTPOS) or it
must besent encrypted in ECB mode before rest of message
at end of message, handle possible last short block
by padding either with known non-data value (eg nulls)
or pad last block with count of pad sizeeg. [ b1 b2 b3 0 0 0 0 5] <- 3 data bytes, then
5 bytes pad+count
(iii)Cipher Feedback (CFB)
message is treated as a stream of bits
added to the output of the block cipher
result is feedback for next stage (hence name)
standard allows any number of bit (1,8 or 64 or whatever) to be feed back
denoted CFB-1, CFB-8, CFB-64 etc
is most efficient to use all 64 bits (CFB-64)
uses: stream data encryption, authentication
Advantages and Limitations of CFB
appropriate when data arrives in bits/bytes
most common stream mode
limitation is need to stall while do block encryption after every n-bits
note that the block cipher is used in encryption mode at both ends
errors propagate for several blocks after the error
(iv)Output FeedBack (OFB)
message is treated as a stream of bits
output of cipher is added to message
output is then feed back (hence name)
feedback is independent of message
can be computed in advance C
uses: stream encryption over noisy channels
(v)Counter (CTR)
a new mode, though proposed early on
similar to OFB but encrypts counter value rather than any feedback value
must have a different key & counter value for every plaintext block (never reused)
C uses: high-speed network encryptions
4. Explain RSA algorithm in detail with an example (May 11, May 12 & Dec 14)
RSA is one of the first practical public-key cryptosystems and is widely used for secure
data transmission. In such a cryptosystem, the encryption key is public and differs from
the decryption key which is kept secret. In RSA, this asymmetry is based on the practical
difficulty of factoring the product of two large prime numbers, the factoring problem.
RSA is made of the initial letters of the surnames of Ron Rivest, Adi Shamir,
31

and Leonard Adleman, who first publicly described the algorithm in 1977. Clifford
Cocks, an English mathematician working for the UK intelligence agency GCHQ, had
developed an equivalent system in 1973, but it was notdeclassified until 1997.
A user of RSA creates and then publishes a public key based on two large prime numbers,
along with an auxiliary value. The prime numbers must be kept secret. Anyone can use
the public key to encrypt a message, but with currently published methods, if the public
key is large enough, only someone with knowledge of the prime numbers can feasibly
decode the message.Breaking RSAencryption is known as the RSA problem; whether it is
as hard as the factoring problem remains an open question.
RSA is a relatively slow algorithm, and because of this it is less commonly used to
directly encrypt user data. More often, RSA passes encrypted shared keys for symmetric
key cryptography which in turn can perform bulk encryption-decryption operations at
much higher speed.
The RSA algorithm involves four steps: key generation, key distribution, encryption and
decryption.
RSA involves a public key and a private key. The public key can be known by everyone
and is used for encrypting messages. The intention is that messages encrypted with the
public key can only be decrypted in a reasonable amount of time using the private key.
The basic principle behind RSA is the observation that it is practical to find three very
large positive integers e,d and n such that with modular exponentiation for all m:
and that even knowing e and n or even m it can be extremely difficult to find d.
Additionally, for some operations it is convenient that the order of the two
exponentiations can be changed and that this relation also implies:
1. Key distribution
To enable Bob to send his encrypted messages, Alice transmits her public key (n, e) to
Bob via a reliable, but not necessarily secret route. The private key is never distributed.
2. Encryption
Suppose that Bob would like to send message M to Alice. He first turns M into an
integer m, such that 0 m < n and gcd(m, n) = 1 by using an agreed-upon reversible
protocol known as a padding scheme. He then computes the cipher text c, using Alice's
public key e, corresponding to. This can be done efficiently, even for 500-bit numbers,
using modular exponentiation. Bob then transmits c to Alice.
3. Decryption

32

Alice can recover m from c by using her private key exponent d by computing. Given m,
she can recover the original message M by reversing the padding scheme.
4. Key generation
The keys for the RSA algorithm are generated the following way:
Choose two distinct prime numbers p and q.
For security purposes, the integers p and q should be chosen at random, and
should be similar in magnitude but 'differ in length by a few digits' [2] to make
factoring harder. Prime integers can be efficiently found using a primality
test.
Compute n = pq.
n is used as the modulus for both the public and private keys. Its length,
usually expressed in bits, is the key length.
Compute (n) = (p)(q) = (p 1)(q 1) = n (p + q 1), where is Euler's
totient function. This value is kept private.Choose an integer e such that 1 < e <
(n) and gcd(e, (n)) = 1; i.e., e and (n) are coprime.
1. Determine d as d e1 (mod (n)); i.e., d is
multiplicative inverse of e (modulo (n))

the modular

This is more clearly stated as: solve for d given de 1 (mod (n))

e having a short bit-length and small Hamming weight results in more


efficient encryption most commonly 216 + 1 = 65,537. However, much
smaller values of e (such as 3) have been shown to be less secure in some
settings.[13]

e is released as the public key exponent.

d is kept as the private key exponent.


The public key consists of the modulus n and the public (or
encryption) exponent e. The private key consists of the
modulus n and the private (or decryption) exponent d, which must be
kept secret. p, q, and (n) must also be kept secret because they can
be used to calculate d.

An alternative, used by PKCS#1, is to choose d matching de 1


(mod ) with = lcm(p 1, q 1), where lcm is the least
33

common multiple. Using instead of (n) allows more choices


for d. can also be defined using the Carmichael function, (n).
Since any common factors of (p 1) and (q 1) are present in the
factorisation of pq 1,[14] it is recommended that (p 1) and (q 1)
have only very small common factors, if any besides the necessary 2.
5. Briefly explain the idea behind Elliptic Curve Cryptosystem. (16)
Elliptic curve cryptography (ECC) is an approach to public-key cryptography based on
the algebraic structure of elliptic curves over finite fields. ECC requires smaller keys
compared to non-ECC cryptography (based on plain Galois fields) to provide equivalent
security.
Elliptic curves are applicable for encryption, digital signatures, pseudo-random
generators and
other
tasks.
They
are
also
used
in
several integer
factorization algorithms that have applications in cryptography, such as Lenstra elliptic
curve factorization.
Public-key cryptography is based on the intractability of certain mathematical problems.
Early public-key systems are secure assuming that it is difficult to factor a large integer
composed of two or more large prime factors. For elliptic-curve-based protocols, it is
assumed that finding the discrete logarithm of a random elliptic curve element with
respect to a publicly known base point is infeasible: this is the "elliptic curve discrete
logarithm problem" or ECDLP. The security of ECC depends on the ability to compute
a point multiplication and the inability to compute the multiplicand given the original and
product points. The size of the elliptic curve determines the difficulty of the problem.
The primary benefit promised by ECC is a smaller key size, reducing storage and
transmission requirements, i.e. that an elliptic curve group could provide the same level
of security afforded by an RSA-based system with a large modulus and correspondingly
larger key: for example, a 256-bit ECC public key should provide comparable security to
a 3072-bit RSA public key.
The U.S. National Institute of Standards and Technology (NIST) has endorsed ECC in
its Suite B set of recommended algorithms, specifically Elliptic Curve DiffieHellman
(ECDH) for key exchange and Elliptic Curve Digital Signature Algorithm (ECDSA) for
digital signature. The U.S. National Security Agency (NSA) allows their use for
protecting information classified up to top secret with 384-bit keys.[1] However in August
2015, the NSA announced it plans to replace Suite B with a new cipher suite due to
concerns about quantum computing attacks on ECC.[2]
34

While the RSA patent expired in 2000, there may be patents in force covering certain
aspects of ECC technology, though some (including RSA Laboratories[3] and Daniel J.
Bernstein[4]) argue that the Federal elliptic curve digital signature standard (ECDSA;
NIST FIPS 186-3) and certain practical ECC-based key exchange schemes (including
ECDH) can be implemented without infringing them.
6. Explain Key management in detail. (16 mark)
Key Management
Distribution of Public Keys

Public-Key Distribution of Secret keys


Distribution of Public Keys
can be considered as using one of:
Public announcement
Publicly available directory
Public-key authority
Public-key certificates
Public Announcement
Users distribute public keys to recipients or broadcast to community at large
eg. append PGP keys to email messages or post to news groups or email list
major weakness is forgery
anyone can create a key claiming to be someone else and broadcast it
until forgery is discovered can masquerade as claimed user
Publicly Available Directory

Can obtain greater security by registering keys with a public directory


directory must be trusted with properties:
contains {name,public-key} entries
participants register securely with directory
participants can replace key at any time
directory is periodically published
directory can be accessed electronically
still vulnerable to tampering or forgery
Public-Key Authority
Improve security by tightening control over distribution of keys from directory
has properties of directory
and requires users to know public key for the directory
then users interact with directory to obtain any desired public key securely
35

7. Explain Elliptic Curve Cryptography. (Dec12)


Elliptic Curve Cryptography
Majority of public-key crypto (RSA, D-H) use either integer or polynomial arithmetic
with very large numbers/polynomials
imposes a significant load in storing and processing keys and messages an alternative is
to use elliptic curves
offers same security with smaller bit sizes
Real Elliptic Curves
An elliptic curve is defined by an equation in two variables x & y, with coefficients
consider a cubic elliptic curve of form o y2 = x3 + ax + b where x,y,a,b are all real
numbers
also define zero point O
have addition operation for elliptic curve
geometrically sum of Q+R is reflection of intersection R
Finite Elliptic Curves
o Elliptic curve cryptography uses curves whose variables & coefficients are finite
have two families commonly used:
o prime curves Ep(a,b) defined over Zp
o use integers modulo a prime
o best in software
o binary curves E2m(a,b) defined over GF(2n)
o use polynomials with binary coefficients
o best in hardware
Elliptic Curve Cryptography
o ECC addition is analog of modulo multiply
o ECC repeated addition is analog of modulo exponentiation o need hard problem equiv
to discrete log
o Q=kP, where Q,P belong to a prime curve o is easy to compute Q given k,P
o but hard to find k given Q,P
o known as the elliptic curve logarithm problem o Certicom example: E23(9,17)
ECC Diffie-Hellman
36

o can do key exchange analogous to D-H o users select a suitable curve Ep(a,b)
o select base point G=(x1,y1) with large order n s.t. nG=O o A & B select private keys
nA<n, nB<n
o compute public keys: PA=nAG, PB=nBG o compute shared key: K=nAPB, K=nBPA
o same since K=nAnBG
ECC Encryption/Decryption
o several alternatives, will consider simplest
must first encode any message M as a point on the elliptic curve Pm

select suitable curve & point G as in D-H o each user chooses private key nA<n
o and computes public key PA=nAG

o to encrypt Pm : Cm={kG, Pm+k Pb}, k random o decrypt Cm compute:


o Pm+kPbnB(kG) = Pm+k(nBG)nB(kG) = Pm
ECC Security
o relies on elliptic curve logarithm problem
o fastest method is Pollard rho method
o compared to factoring, can use much smaller key sizes than with RSA etc
o for equivalent key lengths computations are roughly equivalent
o hence for similar security ECC offers significant computational advantages.
o does require real-time access to directory when keys are needed
Public-Key Certificates

Certificates allow key exchange without real-time access to public-key authority

a certificate binds identity to public key

usually with other info such as period of validity, rights of use etc with all contents
signed by a trusted Public-Key or Certificate Authority (CA)
can be verified by anyone who knows the public-key authorities public-key

Public-Key Distribution of Secret Keys


use previous methods to obtain public-ke

can use for secrecy or authentication

but public-key algorithms are slow


37

so usually want to use private-key encryption to protect message contents

hence need a session key

have several alternatives for negotiating a suitable session


Simple Secret Key Distribution
Proposed by Merkle in 1979

A generates a new temporary public key pair

A sends B the public key and their identity

B generates a session key K sends it to A encrypted using the supplied public key

A decrypts the session key and both use

problem is that an opponent can intercept and impersonate both halves of protocol

8. Explain Advanced Encryption Standard


Advanced Encryption Standard (AES) Evaluation Criteria
AES Requirements
o private key symmetric block cipher o 128-bit data, 128/192/256-bit keys o stronger &
faster than Triple-DES
o active life of 20-30 years (+ archival use) o provide full specification & design details
o both C & Java implementations
o NIST have released all submissions & unclassified analyses
AES Evaluation Criteria
initial criteria:

security effort to practically cryptanalyse

cost computational

algorithm & implementation characteristics o final criteria

general security

software & hardware implementation ease

implementation attacks

flexibility (in en/decrypt, keying, other factors)


AES Cipher - Rijendael

Designed by Rijmen-Daemen in Belgium has 128/192/256 bit keys, 128 bit data an
iterative rather than feistel cipher
treats data in 4 groups of 4 bytes

38

operates an entire block in every round designed to be:

resistant against known attacks

o speed and code compactness on many CPUs


o design simplicity
Processes data as 4 groups of 4 bytes (state) has 9/11/13 rounds in which state
undergoes:
o byte substitution (1 S-box used on every byte)
o shift rows (permute bytes between groups/columns)
o mix columns (subs using matrix multipy of groups)
o add round key (XOR state with key material)
initial XOR key material & incomplete last round all operations can be combined into
XOR and table lookups - hence very fast & efficient.
9. Briefly explain triple DES.
Triple DES
Clear a replacement for DES was needed theoretical attacks that can break it
demonstrated exhaustive key search attacks AES is a new cipher alternative prior to this
alternative was to use multiple encryption with DES implementations
Why Triple-DES?
Why not Double-DES?
NOT same as some other single-DES use, but have o meet-in-the-middle attack

works whenever use a cipher twice

since X = EK1[P] = DK2[C]

attack by encrypting P with all keys and store

then decrypt C with keys and match X value

can show takes O(256) steps


Triple-DES with Two-Keys
hence must use 3 encryptions

would seem to need 3 distinct keys o but can use 2 keys with E-D-E sequence

C = EK1[DK2[EK1[P]]]

nb encrypt & decrypt equivalent in security

if K1=K2 then can work with single DES o standardized in ANSI X9.17 &
ISO8732
no current known practical attacks

39

Triple-DES with Three-Keys


Although are no practical attacks on two-key Triple-DES have some indications

can use Triple-DES with Three-Keys to avoid even these C = EK3[DK2[EK1[P]]]

has been adopted by some Internet applications, eg PGP, S/MIME

10. Explain Blow Fish Algorithm.


Blowfish Encryption Algorithm:
Blowfish was designed in 1993 by Bruce Scheier as a fast, alternative to existing
encryption algorithms such AES, DES and 3 DES etc.
Blowfish is a symmetric block encryption algorithm designed in consideration with,

Fast: It encrypts data on large 32-bit microprocessors at a rate of 26 clock cycles


per byte.

Compact: It can run in less than 5K of memory.

Simple: It uses addition, XOR, lookup table with 32-bit operands.

Secure: The key length is variable, it can be in the range of 32~448 bits: default
128 bits key length.

It is suitable for applications where the key does not change often, like
communication link or an automatic file encrypted.
Unpatented and royalty-free.

40

The Feistel structure of Blowfish

Description of Algorithm:
Blowfish symmetric block cipher algorithm encrypts block data of 64-bits at a time. It
will follows the feistel network and this algorithm is divided into two parts.
1. Key-expansion
2. Data Encryption
Key-expansion:
It will convert a key of at most 448 bits into several sub key arrays totaling 4168
bytes. Blowfish uses large number of sub keys.
These keys are generating earlier to any data encryption or decryption.
The p-array consists of 18, 32-bit subkeys:
P1, P2 P18
Four 32-bit S-Boxes consist of 256 entries each:
S1, 0, S1, 1 S1, 255
S2, 0, S2, 1 S2, 255
41

S3, 0, S3, 1 S3, 255


S4, 0, S4, 1 ...S4, 255
Generating the Subkeys:
The sub keys are calculated using the Blowfish algorithm:
1.

2.

3.
4.
5.
6.
7.

Initialize first the P-array and then the four S-boxes, in order, with a fixed string.
This string consists of the hexadecimal digits of pi (less the initial 3): P1 = 0x243f6a88,
P2 = 0x85a308d3, P3 = 0x13198a2e, P4 = 0x03707344, etc.
XOR P1 with the first 32 bits of the key, XOR P2 with the second 32-bits of the key,
and so on for all bits of the key (possibly up to P14). Repeatedly cycle through the key
bits until the entire P-array has been XORed with key bits. (For every short key, there is
at least one equivalent longer key; for example, if A is a 64-bit key, then AA, AAA, etc.,
are equivalent keys.)
Encrypt the all-zero string with the Blowfish algorithm, using the sub keys
described in steps (1) and (2).
Replace P1 and P2 with the output of step (3).
Encrypt the output of step (3) using the Blowfish algorithm with the modified sub
keys.
Replace P3 and P4 with the output of step (5).
Continue the process, replacing all entries of the P array, and then all four S-boxes
in order, with the output of the continuously changing Blowfish algorithm.
In total, 521 iterations are required to generate all required sub keys. Applications can
store the sub keys rather than execute this derivation process multiple times.
UNIT III
PART A (TWO MARKS)
1. What is message authentication? (Dec 14)
It is a procedure that verifies whether the received message comes from assigned source
has not been altered. It uses message authentication codes, hash algorithms to
authenticate the message.
2. Define the classes of message authentication function.
Message encryption: The entire cipher text would be used for authentication.
42

Message Authentication Code: It is a function of message and secret key produce a


fixed length value.
Hash function: Some function that map a message of any length to fixed length which
serves as authentication.
3. What are the requirements for message authentication? (May 12)
The requirements for message authentication are
i.
ii.

iii.

iv.
v.
vi.

vii.
viii.

Disclosure: Release of message contents to any person or process not


processing the appropriate cryptographic key
Traffic Analysis: Discovery of the pattern of traffic between parties. In a
connection oriented application, the frequency and duration of
connections could be determined. In either a connection oriented or
connectionless environment, the number and length of messages between
parties could be determined.
Masquerade: Insertion of messages into the network from a fraudulent
source. This includes the creation of messages by an opponent that are
purported to come from an authorized entity. Also included are fraudulent
acknowledgements of message receipt or no receipt by someone other
than the message recipient.
Content modification: Changes to the contents of a message, including
insertion, deletion, transposition, and modification.
Sequence modification: Any modification to a sequence of messages
between parties, including insertion, deletion, and modification.
Timing modification: Delay or replay of messages. In a connection
oriented application, an entire session or sequence of messages could be a
replay of some previous valid session, or individual messages in the
sequence could be delayed or replayed. In connectionless application, an
individual message could be delayed or replayed.
Source repudiation: Denial of transmission of message by source.
Destination repudiation: Denial of receipt of message by destination.

4. What do you mean by hash function?


A cryptographic hash function is a hash function which takes an input (or 'message') and
returns a fixed-size alphanumeric string, which is called the hash value (sometimes called
a message digest, a digital fingerprint, a digest or a checksum).Functions with these
properties are used as hash functions for a variety of purposes, not only in cryptography.
Practical
applications
include message
integrity checks, digital
signatures, authentication, and various information security applications.
5. Differentiate MAC and Hash function? (May 13)
43

MAC: In Message Authentication Code, the secret key shared by sender and receiver.
The MAC is appended to the message at the source at a time which the message is
assumed or known to be correct.
Hash Function: The hash value is appended to the message at the source at time when
the message is assumed or known to be correct. The hash function itself not considered to
be secret.
6. Any three hash algorithm.
MD5 (Message Digest version 5) algorithm.
SHA_1 (Secure Hash Algorithm).
RIPEMD_160 algorithm.
7. What are the requirements of the hash function?
H can be applied to a block of data of any size.
H produces a fixed length output.
H(x) is relatively easy to compute for any given x, making both hardware and
software implementations practical.
8. What do you mean by MAC?
MAC is Message Authentication Code. It is a function of message and secret key which
produce a fixed length value called as MAC. MAC = Ck(M) Where M = variable length
message K = secret key shared by sender and receiver. CK(M) = fixed length
authenticator.
9. Differentiate internal and external error control.
Internal error control: In internal error control, an error detecting code also known as
frame check sequence or checksum.
External error control: In external error control, error detecting codes are appended
after encryption.
10. What is meant by meet in the middle attack?
This is the cryptanalytic attack that attempts to find the value in each of the range and
domain of the composition of two functions such that the forward mapping of one
through the first function is the same as the inverse image of the other through the second
function-quite literally meeting in the middle of the composed function.
44

11. What is the role of compression function in hash function?


The hash algorithm involves repeated use of a compression function f, that takes two
inputs and produce a n-bit output. At the start of hashing the chaining variable has an
initial value that is specified as part of the algorithm. The final value of the chaining
variable is the hash value usually b>n; hence the term compression.
12. What is the difference between weak and strong collision resistance?
Weak collision resistance

Strong resistance collision

For any given block x, it is It is computationally infeasible to


computationally infeasible to fine yx find any pair (x,y) such that
wit H(y)=H(x).
H(x)=H(y).
It is proportional to 2n

It is proportional to 2 n/2

13. Compare MD5, SHA1 and RIPEMD-160 algorithm. (Dec 13)


Algorithms

MD5

SHA-1

RIPEMD160

Digest length

128 bits

128 bits

160 bits

Basic
unit
processing

of 512 bits

512 bits

512 bits

No of steps

64(4 rounds of 80(4 rounds of 160(5 pairs rounds


16)
20)
of 16)

Maximum
message size

Infinity

Primitive
function

logical 4

264 -1 bits

2 64 -1 bits

Additive constants
used

64

Endianess

Little endian

Big endian

Little endian

14. Distinguish between direct and arbitrated digital signature?


Direct digital signature

Arbitrated Digital Signature

1.The direct digital signature involves The arbiter plays a sensitive and crucial
45

only the communicating parties.

role in this digital signature.

2.This may be formed by encrypting the Every signed message from a sender x
entire message with the senders private to a receiver y goes first to an arbiter A,
key.
who subjects the message and its
signature to a number of tests to check
its origin and content.

15. What are the properties a digital signature should have?


It must verify the author and the data and time of signature.
It must authenticate the contents at the time of signature.
It must be verifiable by third parties to resolve disputes.
16. What requirements should a digital signature scheme should satisfy?
The signature must be bit pattern that depends on the message being signed.
The signature must use some information unique to the sender, to prevent both
forgery and denial. It must be relatively easy to produce the digital signature.
It must be relatively easy to recognize and verify the digital signature.
It must be computationally infeasible to forge a digital signature, either by
constructing a new message for an existing digital signature or by constructing a
fraudulent digital signature for a given message.
It must be practical to retain a copy of the digital signature in storage.
17. Define CMAC
In cryptography, CMAC (Cipher-based Message Authentication Code)[1] is
a block cipher-based message authentication code algorithm. It may be used to
provide assurance of the authenticity and, hence, the integrity of binary data.
This mode of operation fixes security deficiencies of CBC-MAC (CBC-MAC is
secure only for fixed-length messages).
18. Define HMAC
Hash-based Message Authentication Code (HMAC) is a message authentication
code that uses a cryptographic key in conjunction with a hash function. HMAC
provides the server and the client each with a public and private key. The public
key is known, but the private key is known only to that specific server and that
specific client. The client creates a unique HMAC, or hash, per request to the
server by combing the request data and hashing that data, along with a private key
and sending it as part of a request. The server receives the request and regenerates
46

its own unique HMAC. The server compares the two HMACs, and, if they're
equal, the client is trusted and the request is executed. This process is often called
a secret handshake.
19. What is digital signature? (May 15)
A digital signature is a mathematical technique used to validate the authenticity
and integrity of a message, software or digital document.(Digital signatures can
provide the added assurances of evidence to origin, identity and status of an
electronic document, transaction or message, as well as acknowledging informed
consent by the signer.
20. Give Elgamal Digital Signature Scheme. (May 13)
The ElGamal signature scheme is a digital signature scheme which is based on
the difficulty of computing discrete logarithms. It was described by Taher
ElGamal in 1984. The ElGamal signature scheme allows a third-party to confirm
the authenticity of a message sent over an insecure channel.
PART-B
1. Explain the classification of authentication function in detail. (May 11)

message authentication is concerned with:


o protecting the integrity of a message
o

validating identity of originator

non-repudiation of origin (dispute resolution)

electronic equivalent of a signature on a message

an authenticator, signature, or message authentication code (MAC) is sent


along with the message

the MAC is generated via some algorithm which depends on both the message
and some (public or private) key known only to the sender and receiver

the message may be of any length

the MAC may be of any length, but more often is some fixed size, requiring the
use of some hash function to condense the message to the required size if this is
not achieved by the authentication scheme

need to consider replay problems with message and MAC


o

require a message sequence number, timestamp or negotiated random


values
47

Fig: Authentication using Private-key Ciphers


if a message is being encrypted using a session key known only to the sender and
receiver, then the message may also be authenticated
o since only sender or receiver could have created it
o

any interference will corrupt the message (provided it includes sufficient


redundancy to detect change)

but this does not provide non-repudiation since it is impossible to prove


who created the message

message authentication may also be done using the standard modes of use of a
block cipher
o

sometimes do not want to send encrypted messages

can use either CBC or CFB modes and send final block, since this will
depend on all previous bits of the message

no hash function is required, since this method accepts arbitrary length


input and produces a fixed output

usually use a fixed known IV

this is the approached used in Australian EFT standards AS8205

major disadvantage is small size of resulting MAC since 64-bits is


probably too small

Hashing Functions

hashing functions are used to condense an arbitrary length message to a fixed size,
usually for subsequent signature by a digital signature algorithm

good cryptographic hash function h should have the following properties:


o

h should destroy all homomorphic structures in the underlying public key


cryptosystem (be unable to compute hash value of 2 messages combined
given their individual hash values)

h should be computed on the entire message

h should be a one-way function so that messages are not disclosed by their


signatures
48

it should be computationally infeasible given a message and its hash value


to compute another message with the same hash value

should resist birthday attacks (finding any 2 messages with the same
hash value, perhaps by iterating through minor permutations of 2
messages )

it is usually assumed that the hash function is public and not keyed

traditional CRCs do not satisfy the above requirements

length should be large enough to resist birthday attacks (64-bits is now regarded
as too small, 128-512 proposed)

Snefru

a one-way hash function designed by Ralph Merkle

creates 128 or 256 bit long hash values (let m be length)

uses an algorithm H which hashes 512-bits to m-bits, taking the first m output bits
of H as the hash value

H is based on a reversible block cipher E operating on 512-bit blocks

H is the last m-bits of the output of E XOR'd with the first m-bits of the
input of E

E is composed of several passes, each pass has 64 rounds of an S-box


lookup and XOR

E can use 2 to 8 passes

overview of algorithm
o

break message into 512-m bit chunks

each chunk has the previous hash value appended (assuming an IV of 0)

H is computed on this value, giving a new hash value

after the last block (0 padded to size as needed) the hash value is appended
to a message length value and H computed on this, the resulting value
being the MAC

Snefru has been broken by a birthday attack by Biham and Shamir for 128-bit
hashes, and possibly for 256-bit when 2 to 4 passes are used in E

Merkle recommends 8 passes, but this is slow

49

2. Describe MD5 algorithm in detail. Compare its performance with SHA-1. (Dec 13
& May 12)
MD2, MD4 and MD5

family of one-way hash functions by Ronald Rivest


MD2 is the oldest, produces a 128-bit hash value, and is regarded as slower and
less secure than MD4 and MD5

MD4 produces a 128-bit hash of the message, using bit operations on 32-bit
operands for fast implementation

R L Rivest, "The MD4 Message Digest Algorithm",


MD4 overview
o pad message so its length is 448 mod 512
o

append a 64-bit message length value to message

initialise the 4-word (128-bit) buffer (A,B,C,D)

process the message in 16-word (512-bit) chunks, using 3 rounds of 16 bit


operations each on the chunk & buffer

output hash value is the final buffer value

some progress at cryptanalyzing MD4 has been made, with a small number of
collisions having been found

MD5 was designed as a strengthened version, using four rounds, a little more
complex than in MD4 .

a little progress at cryptanalyzing MD5 has been made with a small number of
collisions having been found

both MD4 and MD5 are still in use and considered secure in most practical
applications

both are specified as Internet standards (MD4 in RFC1320, MD5 in RFC1321)


SHA (Secure Hash Algorithm)

SHA was designed by NIST & NSA and is the US federal standard for use with
the DSA signature scheme (nb the algorithm is SHA, the standard is SHS)
it produces 160-bit hash values

SHA overview

pad message so its length is a multiple of 512 bits

initialise the 5-word (160-bit) buffer (A,B,C,D,E) to


50

(67452301,efcdab89,98badcfe,10325476,c3d2e1f0)

process the message in 16-word (512-bit) chunks, using 4 rounds of 20


bit operations each on the chunk & buffer

output hash value is the final buffer value

SHA is a close relative of MD5, sharing much common design, but each having
differences

SHA has very recently been subject to modification following NIST identification
of some concerns, the exact nature of which is not public

current version is regarded as secure

3. Describe RIPEMD-160 algorithm in detail. (Dec 13)


(i)RIPEMD-160 was developed in Europe as part of

RIPE project in 96 by researchers involved in attacks on MD4/5


initial proposal strengthen following analysis to
become RIPEMD-160 somewhat similar to MD5/SHA
uses 2 parallel lines of 5 rounds of 16 steps
creates a 160-bit hash value
slower, but probably more secure, than SHA
overviewpad message so its length is 448 mod 512

(ii)append a 64-bit length value to message


(iii)initialise 5-word (160-bit) buffer (A,B,C,D,E) to
(67452301,efcdab89,98badcfe,10325476,c3d2e1f0)
(iv)process message in 16-word (512-bit) chunks: use 10 rounds of 16 bit operations on
message block& buffer in 2 parallel lines of 5 add output to input to form new buffer
value
(v) Output hash value is the final buffer value
4. Write and explain the Digital Signature (16)
Digital Signatures
have looked at message authentication
o But does not address issues of lack of trust digital signatures provide the
ability to:
o verify author, date & time of signature
o authenticate message contents
51

o be verified by third parties to resolve disputes hence include authentication


function with additional capabilities
Digital Signature Properties
o must depend on the message signed
o must use information unique to sender
to prevent both forgery and denial
o must be relatively easy to produce
o must be relatively easy to recognize & verify
o be computationally infeasible to forge
with new message for existing digital signature
with fraudulent digital signature for given message
be practical save digital signature in storage
Direct Digital Signatures
o involve only sender & receiver o assumed receiver has senders public-key
o digital signature made by sender signing entire message or hash with private-key
o can encrypt using receivers public-key o important that sign first then encrypt
message & signature
o security depends on senders private-key Arbitrated Digital Signatures
o involves use of arbiter A
validates any signed message
then dated and sent to recipient
o requires suitable level of trust in arbiter can be implemented with either private or
public-key algorithms
o arbiter may or may not see message
Arbitrated Digital Signatures
o involves use of arbiter A
o validates any signed message
o then dated and sent to recipient
o requires suitable level of trust in arbiter
o can be implemented with either private or public-key algorithms
o arbiter may or may not see message

5. Explain Needham-Schroeder Protocol


o Original third-party key distribution protocol
o for session between A B mediated by KDC
o protocol overview is:
1. AKDC: IDA || IDB || N1
2. KDCA: EKa[Ks || IDB || N1 || EKb[Ks||IDA] ]
AB: EKb[Ks||IDA]
3. BA: EKs[N2]
52

4. AB: EKs[f(N2)]
o
o
o
o

used to securely distribute a new session key for communications between A & B
but is vulnerable to a replay attack if an old session key has been compromised
then message 3 can be resent convincing B that is communicating with A
modifications to address this require:
timestamps (Denning 81)
using an extra nonce (Neuman 93)
6. Explain authentication protocols in detail.

Authentication protocols overview


Authentication is a fundamental aspect of system security. It confirms the identity of
any user trying to log on to a domain or access network resources. Windows
Server 2003 family authentication enables single sign-on to all network resources. With
single sign-on, a user can log on to the domain once, using a single password or smart
card, and authenticate to any computer in the domain.
Authentication types
When attempting to authenticate a user, several industry-standard types of
authentication may be used, depending on a variety of factors. The following table lists
the types of authentication that the Windows Server 2003 family supports.
Authentication Protocols
Kerberos V5
authentication
SSL/TLS authentication
NTLM authentication
Digest authentication
Passport authentication

Description
A protocol that is used with either a password or a smart
card for interactive logon. It is also the default method of
network authentication for services.
A protocol that is used when a user attempts to access a
secure Web server.
A protocol that is used when either the client or server
uses a previous version of Windows.
Digest authentication transmits credentials across the
network as an MD5 hash or message digest.
Passport authentication is a user-authentication service
which offers single sign-in service.

7. Explain authentication requirements.


Authentication Requirements
Disclosure release of message contents to any person or process not possessing the
appropriate cryptographic key.
Traffic analysis discovery of the pattern of traffic between parties.
53

In a connection oriented application, the frequency and duration of connections could be


determined.

In either a connection oriented or connectionless environment, the


number and length of messages between parties could be determined.

o Masquerade insertion of messages into the network from fraudulent


acknowledgements of message receipt or nonreceipt by someone other than the message
recipient.
o Content modification changes to the contents of a message, including insertion,
deletion, transposition, and modification.
o Sequence modification any modification to a sequence of messages between parties,
including insertion, deletion, and reordering.
o Timing modification delay or replay of messages.
In a connection oriented application, an entire session or sequence of messages could be
replay of some previous valid session, or individual messages in the sequence could be
delayed or replayed.
In a connectionless application, an individual message (e.g., datagram) could be delayed
or replayed.
Source repudiation denial of transmission of message by source.

Destination repudiation denial of receipt of message by destination.


Authentication Function
Any message authentication or digital signature mechanism can be viewed as having
fundamentally two levels.
At the lower level, there must be some sort of function that produces an authenticator: a
value to be used to authenticate a message.
This low-level function is then used as primitive in a higher-level authentication
protocol that enables a receiver to verify the authenticity of a message.
The types of function that may be used to produce an authenticator are grouped into
three classes.

54

Message Encryption the cipher text of the entire message serves as its authenticator.
Message Authentication Code (MAC) a public function of the message and a secret
key that produces a fixed length value that serves as the authenticator.
Hash Function a public function that maps a message of any length into a fixedlength hash value, which serves as the authenticator.
8. Explain HMAC
Specified as Internet standard RFC2104

uses hash function on the message: HMACK = Hash[(K+ XOR opad) ||

Hash[(K+ XOR ipad)||M)]] where K+ is the key padded out to sizeand opad, ipad
are specified padding constants
overhead is just 3 more hash calculations than the message needs alone any of
MD5, SHA-1, RIPEMD-160 can be used

HMAC Security
know that the security of HMAC relates to that of the underlying hash algorithm
attacking HMAC requires either:
brute force attack on key used
birthday attack (but since keyed would need to observe a very large number of
messages)
choose hash function used based on speed verses security constraints
UNIT- IV
PART-A (2 MARKS)
1.

Define Kerberos. (Dec 11)


Kerberos is an authentication service developed as part of project Athena at MIT. The
problem that Kerberos address is, assume an open distributed environment in which users
at work stations wish to access services on servers distributed throughout the network.

2.

What is Kerberos? What are the uses? (May 12)


Kerberos is an authentication service developed as a part of project Athena at
MIT.Kerberos provides a centralized authentication server whose functions is to
authenticate servers.

3.

What are the requirements that were defined by Kerberos?


Secure
55

Reliable
Transparent
Scalable
4.

In the content of Kerberos, what is realm?


A full service Kerberos environment consisting of a Kerberos server, a no. of clients,
no.of application server requires the following: The Kerberos server must have user ID
and hashed password of all participating users in its database. The Kerberos server must
share a secret key with each server. Such an environment is referred to as Realm.

5. What is the purpose of X.509 standard? (Dec 14)


X.509 defines framework for authentication services by the X.500 directory to its
users.X.509 defines authentication protocols based on public key certificates.
6. What are the services provided by PGP services?
Digital signature
Message encryption
Compression
E-mail compatibility,Segmentation
7. Why E-mail compatibility function in PGP needed?
Electronic mail systems only permit the use of blocks consisting of ASCII text. To
accommodate this restriction PGP provides the service converting the row 8-bit binary
stream to a stream of printable ASCII characters. The scheme used for this purpose is
Radix-64 conversion.
8. Name any cryptographic keys used in PGP?
One-time session conventional keys.
Public keys.
Private keys.
Pass phrase based conventional keys.
9. Define key Identifier?
PGP assigns a key ID to each public key that is very high probability unique with a user
ID. It is also required for the PGP digital signature. The key ID associated with each
public key consists of its least significant 64bits.
56

10. List the limitations of SMTP/RFC 822?


1. SMTP cannot transmit executable files or binary objects.
2. It cannot transmit text data containing national language characters.
3. SMTP servers may reject mail message over certain size.\
4. SMTP gateways cause problems while transmitting ASCII and EBCDIC.
5. SMTP gateways to X.400 E-mail network cannot handle non textual data included in
X.400 messages.
11 Define S/MIME? (Dec 14)
Secure/Multipurpose Internet Mail Extension(S/MIME) is a security enhancement to the
MIME Internet E-mail format standard, based on technology from RSA Data Security.
12. What is a firewall?
A firewall is a network security system designed to prevent unauthorized access to or
from a private network. Firewalls can be implemented in both hardware and software, or
a combination of both.
A firewall is a single device used to enforce security policies within a network or between
networks by controlling traffic flows.
The Firewall Services Module (FWSM) is a very capable device that can be used to
enforce those security policies. The FWSM was developed as a module or blade that
resides in either a Catalyst 6500 series chassis or a 7600 series router chassis.
13. What are the types of firewalls?

Packet Filtering Firewalls


Reverse Proxy Firewalls
Host Based firewalls
Personal firewalls
Distributed Firewalls
Circuit level firewall
Application proxy firewall

14. What are limitations of firewalls?

cannot protect from attacks bypassing it eg sneaker net, utility modems, trusted
organizations, trusted services (eg SSL/SSH)
cannot protect against internal threats eg disgruntled or colluding employees
57

cannot protect against access via WLAN if improperly secured against external
use
Cannot protect against malware imported via laptop, PDA, storage infected outside
15. What is an intruder?
An Intruder is a person who attempts to gain unauthorized access to a system, to damage
that system, or to disturb data on that system. In summary, this person attempts to
violate Security by interfering with system Availability, data Integrity or data
Confidentiality.
16. What is IDS?
An intrusion detection system (IDS) is a device or software application that monitors
network or system activities for malicious activities or policy violations and produces
electronic reports to a management station.
17. What are the types of IDS?
Network Based IDS
Host Based IDS
Intrusion detection and prevention systems (IDPS)
18. Define virus
A computer virus is a malware that, when executed, replicates by reproducing itself or
infecting other programs by modifying them.[1] Infecting computer programs can include
as well, data files, or the boot sector of the hard drive. When this replication succeeds,
the affected areas are then said to be "infected".
19. Differentiate virus, worm and Trojan horse
VIRUS

WORM

TROJAN Horse

A computer
virus is
a malware that,
when
executed, replicates by
reproducing
itself
or
infecting
other programs by
modifying them

It
uses
a computer
network to
spread
iself.Unlike
a computer
virus, it does not need to
attach itself to an existing
program. Worms almost
always cause at least some
harm to the network, even
if
only
by
consuming bandwidth

The Trojan Horse, at


first glance will
appear
to
be
useful software but
will actually do
damage
once
installed or run on
your computer.

58

20. Define worms.


A computer worm is a standalone malware computer program that replicates itself in
order to spread to other computers. Often, it uses a computer network to spread itself,
relying on security failures on the target computer to access it. Unlike a computer virus, it
does not need to attach itself to an existing program. Worms almost always cause at least
some harm to the network, even if only by consuming bandwidth, whereas viruses almost
always corrupt or modify files on a targeted computer.
PART B (16 marks)
1. Explain Kerberos in detail.
Kerberos
trusted key server system from MIT
provides centralised private-key third-party authentication in a distributed network
o allows users access to services distributed through network
o without needing to trust all workstations
o rather all trust a central authentication server two versions in use: 4 & 5
Kerberos Requirements
first published report identified its requirements as:
o security
o reliability
o transparency
o scalability
implemented using an authentication protocol based on Needham-Schroeder
Kerberos 4 Overview
o a basic third-party authentication scheme
o have an Authentication Server (AS)
users initially negotiate with AS to identify self
AS provides a non-corruptible authentication credential (ticket granting ticket TGT)
o have a Ticket Granting server (TGS)
o users subsequently request access to other services from TGS on basis of users TGT
Kerberos Realms
o a Kerberos environment consists of:
a Kerberos server
a number of clients, all registered with server
application

servers, sharing keys with server


o this is termed a realm
typically

a single administrative domain


o if have multiple realms, their Kerberos servers must share keys and trust
Kerberos Version 5
o developed in mid 1990s
59

o provides improvements over v4


addresses

environmental shortcomings
o encryption alg, network protocol, byte order, ticket lifetime, authentication
forwarding, interrealm auth
and

technical deficiencies
o double encryption, non-std mode of use, session keys, password attacks
o specified as Internet standard RFC 1510

2. Explain Intrusion Detection Systems in detail. (May 15)


Intruders
Significant issue for networked systems is hostile or unwanted access
either via network or local can identify classes of intruders:
masquerader misfeasor clandestine user varying levels of competence
clearly a growing publicized problem from Wily Hacker in 1986/87
to clearly escalating CERT stats may seem benign, but still cost
resources may use compromised system to launch other attacks
Intrusion Techniques
aim to increase privileges on system basic attack methodology target
acquisition and information gathering initial access privilege escalation
covering tracks key goal often is to acquire passwords so then exercise access
rights of owner
Password Guessing
one of the most common attacks attacker knows a login (from email/web page
etc) then attempts to guess password for it try default passwords shipped with
systems try all short passwords then try by searching dictionaries of common
words intelligent searches try passwords associated with the user (variations
on names, birthday, phone, common words/interests) before exhaustively
searching all possible passwords check by login attempt or against stolen
password file success depends on password chosen by user surveys show
many users choose poorly
Password Capture
another attack involves password capture watching over shoulder as password
is entered using a trojan horse program to collect monitoring an insecure
network login (eg. telnet, FTP, web, email)extracting recorded info after
successful login (web history/cache, last number dialed etc) using valid
login/password can impersonate user users need to be educated to use suitable
precautions/countermeasures
Intrusion Detection
inevitably will have security failure need also to detect intrusions so can block if
detected quickly act as deterrent collect info to improve security assume intruder
will behave differently to a legitimate user but will have imperfect distinction
between
Approaches to Intrusion Detection
statistical anomaly detection threshold profile based rule-based detection anomaly
penetration identification
Audit Records
fundamental tool for intrusion detection native audit records part of all common
multi-user O/S already present for use may not have info wanted in desired form
60

detection-specific audit records created specifically to collect wanted info at cost of


additional overhead on system
Statistical Anomaly Detection
threshold detection count occurrences of specific event over time if exceed reasonable
value assume intrusion alone is a crude & ineffective detector
profile based characterize past behavior of users detect significant deviations from
this profile usually multi-parameter
Audit Record Analysis
foundation of statistical approaches analyze records to get metrics over time
counter, gauge, interval timer, resource use use various tests on these to determine
if current behavior is acceptable mean & standard deviation, multivariate, markov
process, time series, operational key advantage is no prior knowledge used.
3. Explain Rule-Based Intrusion Detection
observe events on system & apply rules to decide if activity is suspicious or not
rule-based anomaly detection analyze historical audit records to identify
usage patterns & auto-generate rules for them then observe current behavior
& match against rules to see if conforms like statistical anomaly detection does
not require prior knowledge of security flaws rule-based penetration
identification uses expert systems technology with rules identifying known
penetration, weakness patterns, or suspicious behavior rules usually
machine & O/S specific rules are generated by experts who interview &
codify knowledge of security admins quality depends on how well this is
done compare audit records or states against rules
Base-Rate Fallacy
practically an intrusion detection system needs to detect a substantial
percentage of intrusions with few false alarms if too few intrusions
detected -> false security if too many false alarms -> ignore / waste
time this is very hard to do existing systems seem not to have a good
record
Distributed Intrusion Detection
traditional focus is on single systems but typically have networked systems more effective
defense has these working together to detect intrusions issues dealing with varying
audit record formats integrity & confidentiality of networked data centralized or
decentralized architecture
Distributed Intrusion Detection Architecture

61

Distributed Intrusion Detection Agent Implementation

Honeypots
o decoy systems to lure attackers

away from accessing critical systems

to collect information of their activities

to encourage attacker to stay on system so administrator can


respond
o are filled with fabricated information
o instrumented to collect detailed information on attackers activities o may be single
or multiple networked systems

62

4. Explain design principles of firewall in detail.


Introduction
o seen evolution of information systems
o now everyone want to be on the Interne and to interconnect networks
o has persistent security concerns
o cant easily secure every system in org o need "harm
minimisation"
o a Firewall usually part of this
What is a Firewall?
o a choke point of control and monitoring o interconnects
networks with differing trust
o imposes restrictions on network services

only authorized traffic is allowed o auditing


and controlling access

can implement alarms for abnormal behavior o is itself


immune to penetration
o provides perimeter defence
Firewall Limitations
o cannot protect from attacks bypassing it

eg sneaker net, utility modems, trusted organisations, trusted services (eg


SSL/SSH)
o cannot protect against internal threats

eg disgruntled employee
o cannot protect against transfer of all virus infected programs or files

because of huge range of O/S & file types


Firewalls Packet Filters

63

o simplest of components
o foundation of any firewall system
o examine each IP packet (no context) and permit or deny according to rules o hence restrict
access to services (ports)
o

possible default policies

that not expressly permitted is prohibited


that not expressly prohibited is permitted

Attacks on Packet Filters


o IP address spoofing

add filters on router to block

source routing attacks

fake source address to be trusted


attacker sets a route other than default
block source routed packets

tiny fragment attacks

split header info over several tiny packets


either discard or reassemble before check

Firewalls Stateful Packet Filters


o examine each IP packet in context

keeps tracks of client-server sessions

checks each packet validly belongs to one o better able to detect bogus packets out of context

5. Explain Roles of Firewalls.

64

A firewall is a term used for a ``barrier'' between a network of machines and users that operate
under a common security policy and generally trust each other, and the outside world. In recent
years, firewalls have become enormously popular on the Internet. In large part, this is due to the
fact that most existing operating systems have essentially no security, and were designed under
the assumption that machines and users would trust each other.
There are two basic reasons for using a firewall at present: to save money in concentrating your
security on a small number of components, and to simplify the architecture of a system by
restricting access only to machines that trust each other. Firewalls are often regarded as some as
an irritation because they are often regarded as an impediment to accessing resources. This is
not a fundamental flaw of firewalls, but rather is the result of failing to keep up with demands
to improve the firewall.
There is a fairly large group of determined and capable individuals around the world who take
pleasure in breaking into systems. Other than the sense of insecurity that it has instilled in
society, the amount of actual damage that has been caused is relatively slight. It highlights the
fact that essentially any system can be compromised if an adversary is determined enough. It is
a tried and true method to improve security within DOD projects to have a ``black hat''
organization that attempts to break into systems rather than have them found by your real
adversaries. By bringing the vulnerabilities of systems to the forefront, the Internet hackers
have essentially provided this service, and an impetus to improve existing systems. It is
probably a stretch to say that we should thank them, but I believe that it is better to raise these
issues early rather than later when our society will be almost 100% dependent on information
systems.
6. Explain types of firewalls.
Types of Firewalls The firewalls can be broadly categorized into the following three types:
Packet Filters
Application-level Gateways
Circuit-level Gateways
Packet Filters: Packet filtering router applies a set of rules to each incoming IP packet and
then forwards or discards it. Packet filter is typically set up as a list of rules based on matches
of fields in the IP or TCP header. An example table of telnet filters rules .The packet filter
operates with positive filter rules. It is necessary to specify what should be permitted, and
everything that is explicitly not permitted is automatically forbidden .A table of packet filter
rules for telnet application.
Application-level Gateway: Application level gateway, also called a Proxy Server acts as a
relay of application level traffic. Users contact gateways using an application and the request is
successful after authentication. The application gateway is service specific such as FTP,
TELNET, SMTP or HTTP.
Circuit Level Gateway: Circuit-level gateway can be a standalone or a specialized system. It
does not allow end-to-end TCP connection; the gateway sets up two TCP connections. Once the
65

TCP connections are established, the gateway relays TCP segments from one connection to the
other without examining the contents. The security function determines which connections will
be allowed and which are to be disallowed.
7. Explain types of secure system.
Types of Secure Computing Systems
Dedicated (Single-Level) Systems
o handles subjects and objects with same classification
o relies on other security procedures (eg physical)
System-High
o only provides need-to-know protection between users
o entire system operates at highest classification level
o all users must be cleared for that level of information
Compartmented
o varaition of System-High which can process two or more types of compartmented
information
o not all users are cleared for all compartments, but all must be cleared to the highest level of
information processed
Multi-Level Systems
o is validated for handling subjects and objects with different rights and levels of security
simultaneously
o major features of such systems include:
user identification and authentication
resource access control and object labeling
audit trails of all security relevant events
external validation of the systems security
8. Explain active firewall elements.
The structure of an active firewall element, which is integrated in the communication interface
between the insecure public network and the private network To provide necessary security
services, following components are required:
Integration Module: It integrates the active firewall element into the communication system
with the help of device drivers. In case of packet filters, the integration is above the Network
Access Layer, where as it are above the Transport layer ports in case of Application Gateway.
Analysis Module: Based on the capabilities of the firewall, the communication data is analyses
in the Analysis Module. The results of the analysis are passed on to the Decision Module.
Decision Module: The Decision Module evaluates and compares the results of the analysis
with the security policy definitions stored in the Rule set and the communication data is
allowed or prevented based the outcome of the comparison.
66

Processing module for Security related Events: Based on rule set, configuration settings and
the message received from the decision module, it writes on the logbook and generates alarm
message to the Security Management System.
Authentication Module: This module is responsible for the identification and authentication of
the instances that are communicated through the firewall system.
Rule set: It contains all the information necessary to make a decision for or against the
transmission of communication data through the Firewall and it also defines the security related
events to be logged.
Logbook: All security-related events that occur during operation are recorded in the logbook
based on the existing rule set. Security Management System: It provides an interface where the
administrator enters and maintains the rule set. It also analyses the data entered in the logbook.
UNIT V
PART A (Two marks)
1. Define Public-Key Infrastructure.
Public-key infrastructure (PKI) as the set of hardware, software, people, policies, and procedures
needed to create, manage, store, distribute, and revoke digital certificates based on asymmetric
cryptography.
2. Define PGP. (Dec 14)
Pretty Good Privacy is an open-source freely available software package for e-mail security. It
provides authentication through the use of digital signature; confidentiality through the use of
symmetric block encryption; compression using the ZIP algorithm; e-mail compatibility using
the radix-64 encoding scheme; and segmentation and reassembly to accommodate long e-mails.
3. Define S/MIME (May 15)
Secure/Multipurpose Internet Mail Extension is an Internet standard approach to e-mail security
that incorporates the same functionality as PGP.
4. Write short notes on IP Security.
IPsec provides the capability to secure communications across a LAN, across private and public
WANs, and across the Internet.
5. Write short notes on Web Security
Secure socket layer (SSL) provides security services between TCP and applications that use TCP.
The Internet standard version is called transport layer service (TLS).
6. Write short notes on Secure Electronic Transaction.
67

Secure Electronic Transaction (SET) is an open encryption and security specification designed to
protect credit card transactions on the Internet.
7. What are the features of SET?
Confidentiality of information
Integrity of data
Cardholder account authentication
Merchant authentication
8. Write short notes on Transport Layer Security (TLS)? (Dec 11)
Transport Layer Security is defined as a Proposed Internet Standard in RFC 2246. RFC 2246 is
very similar to SSLv3. The TLS Record Format is the same as that of the SSL Record Format,
and the fields in the header have the same meanings. The one difference is in version number.
9. What are the function areas of IP security?
Authentication
Confidentiality
Key management.
10. Differentiate Transport and Tunnel mode in IPsec?
Transport mode
1. Provide the protection for upper layer
protocol between two hosts.

Tunnel Mode
1. Provide the protection for entire IP Packet.

2. ESP in this mode encrypts and


2. ESP in this mode encrypt authenticate the
optionally authenticates IP Payload but not entire IP packet.
IP Header.
3. AH in this mode authenticate the IP
3. AH in this mode authenticate the entire IP
Payload and selected portion of IP
Packet plus selected portion of outer IP Header.
Header.
11. What is dual signature? What it is purpose?
The purpose of the dual signature is to link two messages that intended for two different
recipients. To avoid misplacement of orders.
12. What do you mean by Reply Attack?
A replay attack is one in which an attacker obtains a copy of an authenticated packet and
later transmits it to the intended destination.
Each time a packet is send the sequence number is incremented in the counter by the
sender.
13. Name any cryptographic keys used in PGP?
One-time session conventional keys.
Public keys.
Private keys.
Pass phrase based conventional keys.
68

14. Define Certification authority.


The issuer of certificates and certificate revocation lists (CRLs). It may also support a variety of
administrative functions, although these are often delegated to one or more Registration
Authorities.
15. List the Applications of IPsec.
Secure branch office connectivity over the Internet
Secure remote access over the Internet
Establishing extranet and intranet connectivity with partners
Enhancing electronic commerce security
16. What do you mean by Security Association?
An association is a one-way relationship between a sender and receiver that affords security
services to the traffic carried on. A key concept that appears in both the authentication and
confidentiality mechanism for IP is the security association (SA).
17. Specify the parameters that identify the Security Association?
A security Association is uniquely identified by 3 parameters:
Security Parameter Index (SPI).
IP Destination Address.
Security Protocol Identifier
18. What are the headers fields define in MIME?
MIME version.
Content type.
Content transfer encoding.
Content id.
Content description.
PART B (16marks)
1. Explain in detail about Public-Key Infrastructure
RFC 2822 (Internet Security Glossary) defines public-key infrastructure (PKI) as the set of
hardware, software, people, policies, and procedures needed to create, manage, store, distribute,
and revoke digital certificates based on asymmetric cryptography. The principal objective for
developing a PKI is to enable secure, convenient, and efficient acquisition of public keys. The
Internet Engineering Task Force (IETF) Public Key Infrastructure X.509 (PKIX) working group
has been the driving force behind setting up a formal (and generic) model based on X.509 that is
suitable for deploying a certificate-based architecture on the Internet. This section describes the
PKIX model.

Figure shows the interrelationship among the key elements of the PKIX model.
69

These elements are


End entity: A generic term used to denote end users, devices (e.g., servers,
routers), or any other entity that can be identified in the subject field of a public
key certificate. End entities typically consume and/or support PKI-related
services.
Certification authority (CA): The issuer of certificates and (usually) certificate
revocation lists (CRLs). It may also support a variety of administrative functions,
although these are often delegated to one or more Registration Authorities.
Registration authority (RA): An optional component that can assume a number
of administrative functions from the CA. The RA is often associated with the End
Entity registration process, but can assist in a number of other areas as well.
CRL issuer: An optional component that a CA can delegate to publish CRLs.
Repository: A generic term used to denote any method for storing certificates and
CRLs so that they can be retrieved by End Entities.

PKIX Management Functions


PKIX identifies a number of management functions that potentially need to be
supported by management protocols. These are indicated in Figure and include
the following:
Registration: This is the process whereby a user first makes itself known to a CA
(directly, or through an RA), prior to that CA issuing a certificate or certificates
for that user. Registration begins the process of enrolling in a PKI. Registration
usually involves some offline or online procedure for mutual authentication.
Typically, the end entity is issued one or more shared secret keys used for
subsequent authentication.
Initialization: Before a client system can operate securely, it is necessary to
install key materials that have the appropriate relationship with keys stored
elsewhere in the infrastructure. For example, the client needs to be securely
70

initialized with the public key and other assured information of the trusted CA(s),
to be used in validating certificate paths.
Certification: This is the process in which a CA issues a certificate for a user's
public key, and returns that certificate to the user's client system and/or posts that
certificate in a repository.
Key pair recovery: Key pairs can be used to support digital signature creation
and verification, encryption and decryption, or both. When a key pair is used for
encryption/decryption, it is important to provide a mechanism to recover the
necessary decryption keys when normal access to the keying material is no longer
possible, otherwise it will not be possible to recover the encrypted data. Loss of
access to the decryption key can result from forgotten passwords/PINs, corrupted
disk drives, damage to hardware tokens, and so on. Key pair recovery allows end
entities to restore their encryption/decryption key pair from an authorized key
backup facility (typically, the CA that issued the End Entity's certificate).
Key pair update: All key pairs need to be updated regularly (i.e., replaced with a
new key pair) and new certificates issued. Update is required when the certificate
lifetime expires and as a result of certificate revocation.
Revocation request: An authorized person advises a CA of an abnormal situation
requiring certificate revocation. Reasons for revocation include private key
compromise, change in affiliation, and name change.
Cross certification: Two CAs exchange information used in establishing a crosscertificate. A cross-certificate is a certificate issued by one CA to another CA that
contains a CA signature key used for issuing certificates.
PKIX Management Protocols
The PKIX working group has defines two alternative management protocols between
PKIX entities that support the management functions listed in the preceding subsection.
RFC 2510 defines the certificate management protocols (CMP). Within CMP, each of the
management functions is explicitly identified by specific protocol exchanges. CMP is
designed to be a flexible protocol able to accommodate a variety of technical,
operational, and business models.
2. Write briefly about the e-mail security-PGP (Pretty Good Privacy). (May 15)
PGP is an open-source freely available software package for e-mail security. It provides
authentication through the use of digital signature; confidentiality through the use of symmetric
block encryption; compression using the ZIP algorithm; e-mail compatibility using the radix-64
encoding scheme; and segmentation and reassembly to accommodate long e-mails.
There are five important services in PGP
Authentication (Sign/Verify)
Confidentiality (Encryption/Decryption)
Compression
71

Sender:
1.
2.
3.
4.
Receiver:
1.
2.
3.

Email compatibility
Segmentation and Reassembly
The last three are transparent to the user PGP: Authentication steps
Creates a message
Hashes it to 160-bits using SHA1
Encrypts the hash code using her private key, forming a signature
Attaches the signature to message
Decrypts attached signature using senders public key and recovers hash code
Re-computes hash code using message and compares with the received hash code
If they match, accepts the message

PGP: Confidentiality
Sender:
1. Generates message and a random number (session key) only for this message
2. Encrypts message with the session key using AES, 3DES, IDEA or CAST-128
3. Encrypts session key itself with recipients public key using RSA
4. Attaches it to message
Receiver:
1. Recovers session key by decrypting using his private key
2. Decrypts message using the session key.

Combining authentication and confidentiality in PGP

72

Authentication and confidentiality can be combined


o A message can be both signed and encrypted
That is called authenticated confidentiality
Encryption/Decryption process is nested within the process shown for authentication
alone

PGP Compression

Compression is done after signing the hash


o Saves having to compress document every time you wish to verify its signature
It is also done before encryption
o To speed up the process (less data to encrypt)
o Also improves security
o Compressed messages are more difficult to cryptanalyze as they have less
redundancy

PGP Email compatibility:

PGP is designed to be compatible with all email systems


Makes no assumptions regarding ability to handle attachments etc.
o Handles both the simplest system and the most complex system
o Output of encryption and compression functions is divided into 6-bit blocks

PGP Segmentation/Reassembly:

Email protocols have a maximum allowed size for messages


o Like 100 KB
PGP divides messages that are too large into smaller ones
o Divide and conquer
Reassembly at the receiving end is required before verifying signature or decryption PGP
Key Identifiers:
Consider this:

A user may have many public/private key pairs at his disposal


He wishes to encrypt or sign a message using one of his keys
73

PGP Key Rings

PGP uses key rings to identify the key pairs that a user owns or trusts
Private-key ring contains public/private key pairs of keys he owns
Public-key ring contains public keys of others he trusts

PGP Public key management

Key rings are different from certificate chains used in X.509


o There user only trusts CAs
o And people signed by the CAs
o Here he can trust anyone and can add others signed by people he trusted
Thus, users do not rely on external CAs
o A user is her own CA

3. Explain in detail the architecture in IP security. (Dec 13)


IPsec is an Internet standard for network layer security
Components:
An authentication protocol (Authentication Header AH)
A combined encryption and authentication protocol (Encapsulated Security Payload
ESP)
Key management protocols (the default is ISAKMP/Oakley)
Important RFCs
RFC 2401: an overview of the IPsec security architecture
RFC 2402: specification of AH
RFC 2406: specification of ESP
RFC 2408: specification of ISAKMP
RFC 2412: specification of Oakley
IPsec is mandatory for IPv6 and optional for IPv4
The benefits of IPSec include:
IPSec can be transparent to end users.
There is no need to train users on security mechanisms
IPSec can provide security for individual
When used in Firewalls provides better security mechanisms.
Applications:
IPSec provides the capability to secure communications across a LAN, across private and
public WANs, and across the Internet. Examples of its use include:
Secure branch office connectivity over the Internet
Secure remote access over the Internet
Security associations (SA):
an SA is a one-way relationship between a sender and a receiver system
74

an SA is used either for AH or for ESP but never for both


an SA is uniquely identified by three parameters
SA parameters:
Sequence number counter
o Counts the packets sent using this SA
Sequence counter overflow flag
indicates whether overflow of the sequence number counter should prevent further
transmission using this SA anti-replay window
used to determine whether an inbound AH or ESP packet is a replay
AH / ESP information
o Algorithm, key, and related parameters
Lifetime
o a time interval or byte count after which this SA must be terminated
protocol mode
o tunnel or transport mode
SA selectors:
Security Policy Database (SPD)
each entry defines a subset of IP traffic and points to the SAs to be applied to that traffic
subset of IP traffic is defined in terms of selectors
o destination IP address (single, enumerated list, range, or mask)
o source IP address (single, enumerated list, range, or mask)
o transport layer protocol (single, enumerated list, or range)
o destination port (single, enumerated list, range, or wildcard)
Modes of operation:
Transport mode
provides protection primarily for upper layer protocols
protection is applied to the payload of the IP packet
ESP in transport mode encrypts and optionally authenticates the IP payload but not the IP
header
AH in transport mode authenticates the IP payload and selected fields of the IP header
Usually end user
Tunnel mode
provides protection to the entire IP packet
the entire IP packet is considered as payload and encapsulated in another IP packet (with
potentially different source and destination addresses)
ESP in tunnel mode encrypts and optionally authenticates the entire inner IP packet

75

AH in transport mode authenticates the entire inner IP packet and selected fields of the
outer IP header
usually used between security gateways (routers, firewalls)

4. Explain in detail about secure electronic transaction.

Is designed to protect credit card and transaction on the internet.


Features of SET Confidentiality of information
Integrity of data
Card holder
Account authentication.
Merchant authentication

SET participants

Cardholder: is an authorized holder of a payment card (e.g., MasterCard, Visa) that has
been issued by an issuer through internet.
Merchant: is a person or organization that has goods or services to sell to the cardholder.
Issuer: is a financial institution, such as a bank, that provides the cardholder with the
payment card.
Acquirer: is a financial institution that establishes an account with a merchant and
processes payment card authorizations and payments.
Payment gateway: is a function operated by the acquirer or a designated third party that
processes merchant payment messages.
Certification authority (CA): is an entity that is trusted to issue X.509v3 public-key
certificates for cardholders, merchants, and payment gateways.

76

Sequence of events that are required for a transaction


1. The customer opens an account.
2. The customer receives a certificate.
3. Merchants have their own certificates.
4. The customer places an order.
5. The merchant is verified.
6. The order and payment are sent.
7. The merchant requests payment authorization.
8. The merchant confirms the order.
9. The merchant provides the goods or service.
10. The merchant requests payment
SET Transaction Types

Cardholder registration
Merchant registration
Purchase request
Payment authorization
Payment capture
Certificate inquiry and status
Purchase inquiry
Authorization reversal
Capture reversal
Credit
77

Credit reversal
Payment gateway certificate request
Batch administration
Error message

Purchase Request

Message from customer to merchant containing OI(Order Information) for merchant and
PI(payment Information) for bank.
Consists of 4 messages
o Initiate Request
o Initiate Response
o Purchase Request
o Purchase Response

5. Explain in detail about Secure Socket Layer and Transport Layer Security.
SSL Architecture
SSL is designed to make use of TCP to provide a reliable end-to-end secure
service.

Fig: SSL Protocol Stack


Two important SSL concepts
1. Connection: A connection is a transport that provides suitable type of service.
Connection Parameters:
Server and client random
Server write MAC secret
Client write MAC secret
Server write key
Client write key
Initialization vectors
78

Sequence numbers
2. Session: An SSL session is an association between a client and a server.
Session identifier
Peer certificate
Compression method
Cipher spec
Master secret.
Is resemble.
SSL Record Protocol

i)Confidentiality
ii)Message Integrity

Fig: SSL Record Protocol Operation


Steps involved

Fragmentation: Message is fragmented into blocks of 214 bytes.


Compression: Message length is reduced.
Add MAC: Add message authentication code over compressed data.
Encrypt: Compressed message plus MAC are encrypted using symmetric encryption.
Prepend a header: is the final step, appending a header.

SSL Record Format:

79

Content types
Change Cipher Specification Protocol
o This protocol consists of a single message which consists of a single byte with the
value 1.
o This is used to cause the pending state to be copied into the current state
Alert protocol
o The Alert Protocol is used to convey SSL-related alerts to the peer entity.
o Alerts that are fatal
Handshake protocol
o Is used for server and client to authenticate each other and protect data sent in SSL
record.
o Type (1 byte): Indicates one of 10 messages. lists the defined message types.(table)
o Length (3 bytes): The length of the message in bytes.
o Content (0 bytes): The parameters associated with this message. (table)
Application Data protocol
o Contains Opaque content
TLS(Transport Layer Security)
o TLS is defined as proposed internet standard in RFC 2246 and record format is
similar to SSL record format with different in version no.
6. Write brief notes on malicious software.
Viruses and Other Malicious Content
computer viruses have got a lot of publicity
one of a family of malicious softwareeffects usually obvious
have figured in news reports, fiction
movies (often exaggerated) getting more attention than deserve
are a concern though.
80

Trapdoors
secret entry point into a program
allows those who know access bypassing
usual security procedures have been commonly used by developers
a threat when left in production programs
allowing exploited by attackers very hard to block in O/S

requires good s/w development


Logic Bomb
one of oldest types of malicious software
o code embedded in legitimate program
o activated when specified conditions met
o eg presence/absence of some file
o particular date/time particular user when triggered typically damage system
o modify/delete files/disks
Trojan Horse
program with hidden side-effects
which is usually superficially attractive
eg game, s/w upgrade etc when run performs some additional tasks
allows attacker to indirectly gain access they do not have directly often used to
propagate a virus/worm or
install a backdoor
Zombie
program which secretly takes over another networked computer then uses it to indirectly
launch attacks often used to launch distributed denial of service (DDoS) attacks
Viruses
a piece of self-replicating code attached to some other code cf biological virus both
propagates itself & carries a payload
7. Briefly explain worms.
Worms replicating but not infecting program

typically spreads over a network

Morris Internet Worm in 1988

led to creation of CERTs

using users distributed privileges or by exploiting

system vulnerabilities

widely used by hackers to create zombie PC's,

subsequently used for further attacks, esp DoS major issue is lack of security of
permanently

connected systems,
Worm Operation
worm phases like those of viruses:
Dormant
81

propagation
search for other systems to infect
establish connection to target remote system
replicate self onto remote system triggering
Execution
Morris Worm
Best known classic worm
released by Robert Morris in 1988
targeted Unix systems using several propagation techniques
simple password cracking of local pw file
exploit bug in finger daemon exploit debug trapdoor in sendmail daemon
if any attack succeeds then replicated self
Recent Worm Attacks
new spate of attacks from mid-2001
Code Red
Exploited bug in MS IIS to penetrate
spread probes random IPs for systems running IIS
had trigger time for denial-of-service attack 2 nd wave infected 360000 servers in 14
hours
Code Red 2
had backdoor installed to allow remote control
Nimda used multiple infection mechanisms email, shares, web client, IIS, Code
Red 2 backdoor
8. Explain countermeasures of viruses .
Virus Countermeasures
Viral attacks exploit lack of integrity
control on systems to defend need to add such controls
typically by one or more of:
prevention - block virus infection mechanism
detection - of viruses in infected system reaction - restoring system to clean state
Anti-Virus Software
First-generation
scanner uses virus signature to identify virus or change in length of programs
Second-generation
uses heuristic rules to spot viral infection or uses program checksums to spot changes
Third-generation
memory-resident programs identify virus by actions
Fourth-generation packages with a variety of antivirus techniques eg scanning &
activity traps, access-controls
Advanced Anti-Virus Techniques
generic decryption
use CPU simulator to check program signature
82

& behavior before actually running it


Digital immune system (IBM)
general purpose emulation & virus detection any virus entering org is captured, analyzed,
detection/shielding created for it, removed
Behavior-Blocking Software
Integrated with host O/
monitors program behavior in real-timeg file access, disk format, executable mods,
system settings changes, network access for possibly malicious actions
if detected can block, terminate, or seek ok has advantage over scanners
but malicious code runs before detection
9. Explain types of viruses.
Types of Viruses can classify on basis of how they attack
parasitic virus
memory-resident virus
boot sector virus
stealth
polymorphic virus
macro virus
Macro Virus
macro code attached to some data file
interpreted by program using file
eg Word/Excel macros esp. using auto command & command macros code is now
platform independent
is a major source of new viral infections
blurs distinction between data and program
files making task of detection much harder classic trade-off: "ease of use" vs
"security
Email Virus
spread using email with attachment
containing a macro virus cf Melissa
triggered when user opens attachment
or worse even when mail viewed by using
scripting features in mail agent usually targeted at Microsoft Outlook mail
agent & Word/Excel documents
10. Explain IP security in detail.
IP Security have considered some application specific security mechanisms eg. S/MIME, PGP,
Kerberos, SSL/HTTPS however there are security concerns that
cut across protocol layers would like security implemented by the
network for all applications
IPSec general IP Security mechanisms
provides
83

Authentication
confidentiality
key management
applicable to use over LANs, across public & private WANs, & for the Internet
Benefits of IPSec
in a firewall/router provides strong
Security to all traffic crossing the perimeter is resistant to bypass
is below transport layer, hence transparent
to applications can be transparent to end users
can provide security for individual users if desired
IP Security Architecture
specification is quite complex
defined in numerous RFCs
incl. RFC 2401/2402/2406/2408 many others, grouped by category
mandatory in IPv6, optional in IPv4
IPSec Services
Access control
Connectionless integrity
Data origin authentication
Rejection of replayed packets
a form of partial sequence integrity Confidentiality (encryption)
Limited traffic flow confidentiality

84

85

86

87

IT6702

DATA WAREHOUSING AND DATA MINING

LTPC 3003

OBJECTIVES: The student should be made to:


Be familiar with the concepts of data warehouse and data mining,
Be acquainted with the tools and techniques used for Knowledge Discovery in Databases.
UNIT I

DATA WAREHOUSING

Data warehousing Components Building a Data warehouse - Mapping the Data Warehouse to
a Multiprocessor Architecture DBMS Schemas for Decision Support Data Extraction,
Cleanup, and Transformation Tools Metadata.
UNIT II

BUSINESS ANALYSIS

Reporting and Query tools and Applications Tool Categories The Need for Applications
Cognos Impromptu Online Analytical Processing (OLAP) Need Multidimensional Data
Model OLAP Guidelines Multidimensional versus Multi relational OLAP Categories of
Tools OLAP Tools and the Internet.
UNIT III

DATA MINING

Introduction Data Types of Data Data Mining Functionalities Interestingness of Patterns


Classification of Data Mining Systems Data Mining Task Primitives Integration of a Data
Mining System with a Data Warehouse Issues Data Preprocessing.
88

UNIT IV ASSOCIATION RULE MINING AND CLASSIFICATION

Mining Frequent Patterns, Associations and Correlations Mining Methods Mining various
Kinds of Association Rules Correlation Analysis Constraint Based Association Mining
Classification and Prediction - Basic Concepts - Decision Tree Induction - Bayesian
Classification Rule Based Classification Classification by Back propagation Support Vector
Machines Associative Classification Lazy Learners Other Classification Methods
Prediction.
UNIT V

CLUSTERING AND TRENDS IN DATA MINING

Cluster Analysis - Types of Data Categorization of Major Clustering Methods K-means


Partitioning Methods Hierarchical Methods - Density-Based Methods Grid Based Methods
Model-Based Clustering Methods Clustering High Dimensional Data - Constraint Based
Cluster Analysis Outlier Analysis Data Mining Applications.

TOTAL: 45
PERIODS OUTCOMES:
After completing this course, the student will be able to:
Apply data mining techniques and methods to large data sets.
Use data mining tools.
Compare and contrast the various classifiers.
TEXT BOOKS:
1. Alex Berson and Stephen J.Smith, Data Warehousing, Data Mining and OLAP, Tata
McGraw Hill Edition, Thirteenth Reprint 2008.
2. Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques, Third Edition,
Elsevier, 2012.
REFERENCES:
1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining, Person
Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Aja, Insight into Data Mining Theory and Practice,
Eastern Economy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, Introduction to Data Mining with Case Studies, Eastern Economy Edition,
Prentice Hall of India, 2006.
4. Daniel T.Larose, Data Mining Methods and Models, Wiley-Interscience, 2006.

89

Unit-I
Part-A
1. Define data warehouse. [Dec 2013][May 2012]
Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing
involves data cleaning, data integration, and data consolidations.
2. List the Components of Data ware House.
i.
ii.
iii.
iv.
v.
vi.

Data storage Components


Source Data Components
Data Staging Components
Information delivery components
Meta-Data Components
Manage and Control Components

3. How is a data warehouse different from a database? How are they similar?[May 2012]
Data warehouse
Online Analytical Processing
Data analysis and decision making

Database
Online Transaction Processing
Day-to-day operations purchasing, Payroll etc
90

Up-to-date driven

Query driven

4. Define Metadata
Metadata is simply defined as data about data. The data that are used to represent other data is
known as metadata
5. Define Data partitioning?
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It optimizes the
hardware performance and simplifies the management of data warehouse by partitioning each
fact table into multiple separate partitions.
6. What are the five types of access tools
i.
ii.
iii.
iv.
v.

Data query and reporting tools


Application development tools
Executive information system tools
Online analytical processing tools
Data mining tools

7. What are the implementation considerations in the design of data warehouse?


i.
Access tools
ii.
Data Extraction, clean up, Transformation, and migration
iii.
Data placement strategies
iv. Meta data
v. User sophistication levels: Casual users, Power users, Experts
8. What are the two types of meta data
i.
Technical Metadata
ii.
Business Metadata
9.

Define-Technical Meta data


This contains information about data warehouse data for use by warehouse
designers and administrators when carrying out warehouse development and
management tasks

10. Define-Business Meta data


Contains information that gives users an easy-to-understand perspective of the
information stored in the data warehouse .Business metadata documents information
about.
11. How is a data warehouse different from a database?
Data warehouse is a repository of multiple heterogeneous data sources, organized
under a unified schema at a single site in order to facilitate management decision-making.
Database consists of a collection of interrelated data.
91

12. State Star Schema


Star Schema comprises one or more fact tables referencing one or more
dimension tables. It is a very simple style of schema and handles simple queries
effectively.
13. What is the need of cleaning?
The following are the reasoning for the cleaning
- Incomplete
- Noisy
- Inconsistent
14. What are the two ways of transforming data?
- Multistage Data Transformation
- Pipelined Data Transformation
15. What is virtual warehouse?
It is a set of views over operational database for efficient query processing. Only
some of the possible summary views may be materialized. It is easy to build but requires
excess capability on operational database server.
16. What is ETL Process?
A data warehouse obtains data from a number of operational database systems,
which can be based on RDBMS or ERP Packages.
17. Mention 3 parallel architecture styles.
- Shared Memory Architecture
- Shared-Disk Architecture
- Shared-Nothing Architecture.
18. Define Speed up.
The time taken to execute a transaction should be reduced by the same factors.
19. Define scale-up.
The capacities of the CPU and hard disk are increased in correspondence to the amount
of data.
20. Differentiate data warehouse and data mart.
Data Warehouse

Data Mart

collects information about subjects that span the focuses on selected subjects
entire organization
scope is enterprise-wide

scope is department-wide

Example:
fact constellation schema

Example:
star or snowflake schema
92

Part-B
1. Describe the data warehouse Architecture? [Nov-Dec-2014]
A data warehouses adopts three-tier architecture. Following are the three tiers of the data
warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database server. It is the
relational database system. We use the back end tools and utilities to feed data into the bottom
tier.
Middle Tier - In the middle tier the OLAP Server that can be implemented by Relational OLAP
(ROLAP), this is an extended relational database management system. The ROLAP maps the
operations on multidimensional data to standard relational operations or by Multidimensional
OLAP (MOLAP) model, this directly implements the multidimensional data and operations.
Top-Tier - This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.

2. Discuss the components of Data warehouse?

93

Operational Source System


It is the traditional OLTP systems which stores transaction data of the organizations business. Its
generally used one record at any time not necessarily stores history of the organizations
informations. Operational source systems generally not used for reporting like data warehouse.
Data Staging Area
Data staging area is the storage area as well as set of ETL process that extract data from
source system. It is everything between source systems and Data warehouse. Data staging are
never be used for reporting purpose. Data is extracted from source system and stored, cleansed,
transformed in staging area to load into data warehouse. Staging are not necessarily the DBMS.
It could be flat files also. Staging area can be structured like normalized source systems. It totally
depends on choice and need of development process.
Data Presentation Area
Data presentation area is generally called as data warehouse. Its the place where cleaned,
transformed data is stored in a dimensionally structured warehouse and made available for
analysis purpose
Data Access Tools
Data is available in presentation area it is accessed using data access tools like Business
Objects.
3. Write the benefits of data warehousing
Data warehouses are designed to perform well with aggregate queries running on large amounts
of data.
The structure of data warehouses is easier for end users to navigate, understand andquery
against unlike the relational databases primarily designed to handle lots of transactions.
Data warehouses enable queries that cut across different segments of a
company'soperation. E.g. production data could be compared against inventory data even
if they were originally stored in different databases with different structures.
Queries that would be complex in very normalized databases could be easier to build and
maintain in data warehouses, decreasing the workload on transaction systems.
Data warehousing is an efficient way to manage and report on data that is from a
varietyof sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information fromlots
of users. Data warehousing provides the capability to analyze large amounts of historical
data fornuggets of wisdom that can provide an organization with competitive advantage
4. Brief the way to build a Data warehouse.
There are two factors that drive you to build and use data warehouse.
They are:
94

Business factors: Business users want to make decision quickly and correctly
using all available data.
Technological factors:
To address the incompatibility of operational data stores.
IT infrastructure is changing rapidly. Its capacity is increasing and cost is
decreasing sothat building a data warehouse is easy.
There are several things to be considered while building a successful data warehouse
They are:i) Top - Down Approach (Suggested by Bill Inmon)
In the top down approach suggested by Bill Inmon, we build a centralized repository to
house corporate wide business data. This repository is called Enterprise Data Warehouse
(EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data. The data in the EDW is stored at the most detail level. The reason to build the EDW
on the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.
ii) Bottom Up ApproachThe bottom up approach suggested by Ralph Kimball is an incremental approach
to build a data warehouse. Here we build the data marts separately at different points
of time as and when the specific subject area requirements are clear. The data marts
are integrated or combined together to form a data warehouse. Separate data marts are
combined through the use of conformed dimensions and conformed facts. A
conformed dimension and a conformed fact is one that can be shared across data
marts.
5. Mention the factors to be used to build successful data warehouse.
Data extraction, clean up, transformation and migration
Data warehouse several have the following selection criteria that affect the ability to
transform, consolidate, integrate and repair the data should be considered:
Timeliness of data delivery to the warehouse.
i. The tool must have the ability to identify the particular data and that can be read
byconversion tool.
ii. The tool must support flat files, indexed files since corporate data is still in this
type.
iii. The tool must have the capability to merge data from multiple data stores.
iv. The tool should have specification interface to indicate the data to be extracted.
v.The tool should have the ability to read data from data dictionary.
vi. The code generated by the tool should be completely maintainable.
vii. The tool should permit the user to extract the required data.
viii. The tool must have the facility to perform data type and character set translation.

95

ix. The tool must have the capability to create summarization, aggregation and
derivation of records.
x. The data warehouse database system must be able to perform loading data directly
fromthese tools
6. Explain the concept of mapping the data warehouse architecture to Multiprocessor
architecture. [Nov-Dec-2014].
The functions of data warehouse are based on the relational data base technology. The
relational data base technology is implemented in parallel manner. There are two
advantages of having parallel relational data base technology for data warehouse:
Linear Speed up: refers the ability to increase the number of processor to reduce
response time.
Linear Scale up: refers the ability to provide same performance on the same requests as
the database size increases.
Types of parallelism
i) Inter query Parallelism:
In which different server threads or processes handle multiple requests at
the same time.
ii)Intra query Parallelism:
This form of parallelism decomposes the serial SQL query into the lower
level operations such as scan, join, sort etc. Then these lower level operations are
executed concurrently in parallel.
Intra query parallelism can be done in either of two ways:
Horizontal parallelism:
The data base is partitioned across multiple disks and parallel processing occurs
within a specific task that is performed concurrently on different processors against
different set of data.
Vertical parallelism:
This occurs among different tasks. All query components such as scan, join, sort
etc are executed in parallel in a pipelined fashion. In other words, an output from one task
becomes an input into another task.
7. Brief the types of meta data [Nov-Dec 2013,2014]
Meta data It is data about data. It is used for maintaining, managing and using the data
warehouse. It is classified into two:
a. Technical Meta data:
i.
It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks. It includes, Info
about data stores

96

ii.
iii.
iv.
v.

vi.

Transformation descriptions. That is mapping methods from operational db to


warehouse database.
Warehouse Object and data structure definitions for target data.
The rules used to perform clean up, and data enhancement.
Data mapping operations.
Access authorization, backup history, archive history, info delivery history, data
acquisition history, data access etc.

b. Business Meta data:


i.
It contains info that gives info stored in data warehouse to users. It includes,
Subject areas, and info object type including queries, reports, images, video,
audio, clips etc.
ii.
Internet home pages.
iii.
Info related to info delivery system.
iv. Data warehouse operational info such as ownerships, audit trails etc.,
v. Meta data helps the users to understand content and find the data. Meta data are
stored
vi.
in a separate data stores which is known as informational directory or Meta data
vii.
repository which helps to integrate, maintain and view the contents of the data
warehouse.
8. List the points for data extraction and transformation tools. [April/May-2011]
i.
Data extraction, clean up, transformation and migration: Timeliness of data
delivery to the warehouse.
ii.
The tool must have the ability to identify the particular data and that can be read
by
iii.
Conversion tool The tool must support flat files, indexed files since corporate data
is still in this type.
iv. The tool must have the capability to merge data from multiple data stores
v. The tool should have specification interface to indicate the data to be extracted
vi.
The tool should have the ability to read data from data dictionary.
vii.
The code generated by the tool should be completely maintainable.
viii.
The data warehouse database system must be able to perform loading data
directly from these tools
9. Give the steps involved to design a data warehouse.

The following nine-step method is followed in the design of a data warehouse:


1. Choosing the subject matter.
2. Deciding what a fact table represents.
3. Identifying and conforming the dimensions.
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the database
8. The need to track slowly changing dimensions.
97

9. Deciding the query priorities and query models


10. Discuss the types of Schema in data warehouse. [apr-May-2015]
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational model,
while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we
will discuss the schemas used in a data warehouse.
Star Schema
i.
Each dimension in a star schema is represented with only one-dimension table.
ii.
This dimension table contains the set of attributes.
iii.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

iv.
v.

There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.

Snowflake Schema
Some dimension tables in the Snowflake schema are normalized. The normalization splits up the
data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables,
namely item and supplier table.

98

i.
ii.

Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.

Fact Constellation Schema


A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.

i.
ii.
iii.
iv.

The sales fact table is same as that in the star schema.


The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item,
and location dimension tables are shared between the sales and shipping fact table
Unit-II
Part-A
99

1.

What is the use of reporting tools? [May 2013]


Reporting tools can be divided into production reporting tools and desktop report writers.
Production reporting tools will let companies generate regular operational reports or support high
volume batch jobs, such as calculating and printing paychecks. Desktop tools designed for end
users.
2. What is the use of OLAP tools?
Online Analytical Processing (OLAP) tools provide an intuitive way to view corporate
data. These tools aggregate data along common business subjects or dimensions and then let
users navigate through the hierarchies and dimensions with the click of a mouse button. Users
can drill down, across, or up levels in each dimension or pivot and swap out dimensions to
change their view of the data.
3. What are the special reporting options in Impromptu tool?

Picklists and prompts

Custom templates

Exception reporting

Interactive reporting

Frames
4 . List the categories of OLAP tools. [May 2011][Dec 2013]

MOLAP (Multidimensional OLAP)

ROLAP (Relational OLAP).

Hybrid OLAP (HOLAP)

Web OLAP
5. What are the OLAP guidelines? [Dec 2013]
Multidimensional conceptual view.
Transparency
Accessibility
Consistent reporting performance
Client/server architecture
6. Define Data cube.[June 2013]
A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally
used to explain the time sequence of an image's data. It is a data abstraction to evaluate
aggregated data from a variety of viewpoints.
7. Name some OLAP tools. [Dec 2013]
ArborsEssbase, Oracle Express, Planning Sciences Gentia, Kenan Technologies
Acumate ES.
8. What is the need of tools for applications?
100

Easy-to-use
Point-and-click tools accept SQL or generate SQL statements to query relational
data stored in the warehouse.
Tools can format the retrieved data into easy-to-read reports
9. What are production reporting tools? Give examples[June 2013]
Third Generation Language COBOL, Fourth generation language Information Builder,
Focus, Client server tools MITIs SQR.
10. What is multidimensional database?[Dec 2011]
A multidimensional database (MDB) is a type of database that is optimized for data
warehouse and online analytical processing (OLAP) applications. Multidimensional
databases are frequently created using input from existing relational databases
11. Define OLTP systems.
The major task of online operational database system is to perform online transaction and
query processing. These systems are called On Line Transaction Processing (OLTP) systems.
They cover most of the day-to-day operations of an organization such as purchasing, inventory,
manufacturing and banking.
12. List the commercial tools used in data warehouse development?

Infomartica.

Cognos

Business objects

Data Storage

RapidMiner

Weka

R Language
13. State the components of Data Integrator.

Graphical Designer

Data Integration server

Meta Data Repository

Administrator
14. Define ETL.
ETL is short for extract, transform, and load, three database functions that are
combined into one tool to pull data out of one database and place it into another database.
15. Define data transformation. [May 2011]
Data transformation from one format to another on the basis of possible differences
between the source and the target platforms.
Ex: calculating age from the date of birth, replacing a possible numeric gender code with a more
meaningful male and female.
16. Write the categories of query Tools.
101

There are five categories of decision support tools


i. Reporting
ii. Managed query
iii. Executive information system
iv. OLAP
v. Data Mining
17. What is impromptu?
Impromptu is an interactive database reporting tool. It allows Power Users to query data
without programming knowledge. It is only capable of reading the data.
18. List the features of Impromptu.
i.
Interactive reporting capability
ii.
Enterprise-wide scalability
iii.
Superior user interface
iv. Fastest time to result
v. Lowest cost of ownership
19. Mention the elements of business intelligence.
Three main components of BIData warehouse
OLAP
Data Mining
20. What is weka tool?
Weka is used in data mining, which is performed using many machine learning
algorithms.
Used in two ways.
- Calling them for java code.
- Applying them directly on a dataset.
Part-B
1. Explain the OLAP Operations.
Roll up (drill up)
Perform aggregation on a data cube by
Climbing Climbing up a concept concept hierarchy hierarchy for a dimension dimension
Dimension reduction
Drill down (roll down)
Drilldown is the reverse of rollup
Navigates from less detailed data to more detailed data by
Stepping down a concept hierarchy for a dimension.
Introducing additional dimensions
102

Slice and dice


- The slice operation performs a selection on one dimension of the given cube, resulting in a sub
cube
- The dice operation defines a subcube by performing a selection on two or more dimensions
Pivot (rotate)
Visualization operation that rotate the data axes in view in order to provide an alternative
presentation of the data
Drill across
An additional drilling operation
Executes Executes queries queries involving involving (i e. ., across) across) more than one fact
table
Drill through
An additional drilling operation
Uses relational relational SQL facilities facilities to drill through through the
bottom level of a data cube down to its backend relational tables
2. Explain Information and cognos.
Both are most commonly used data warehousing tools. Informatica provides the business
intelligence solutions from the server to the front-end. whereas the cognos is a rich set of tools
used to develop data mart and data warehouse.
Informatica tools.
Repository server administration console.
It is used to connect/disconnect to the repository server.
Repository Manager.
It is used to create or manage the repository. By this easy to handle task like creation of
repository and organization of folders and configuring permissions and privileges for the user
and group.
Designer
It is used to create mappings that contain transformation instruction for the informatica server.
Workflow manager
Used to create and run workflows and tasks.
Workflow monitor
It is used to monitor the scheduled and running workflows for each informatica server.
Cognos tools
Cognos Decision Stream
It is used to perform ETL & create meta data.
Cognos Impromptu
It is used to generate business intelligence reports.
Cognos scenario
103

It is used in data mining applications to find the hidden trends and pattern in data.
Cognos query
It facilitates data navigator, speed up the process of adhoc queries.
Cognos powerplay
It is used for carrying out the multidimensional online analysis of data.
3. Multidimensional versus Multi-relational
OLAP-OLAP Tools
The key role in multidimensional modelling is played by the concept of hierarchies while
implementing the OLAP.
Types of OLAP servers
ROLAP (Relational OLAP)
There are mainly three types of OLAP servers. ROLAP,MOLAP AND HOLAP. Let us Discuss
these types of OLAP Server
ROLAP (Relational OLAP)
The preferred technology when the database size is large,i.e greater than 100 GB. Here ,the data
will not be in summarized from. Its response time is poor, minutes to hours, depending on the
query type shows the architecture of ROLAP. As the name implies ,ROLAP systems are based
on relational data model . There are ROLAP clients and a database server which is based on
RDBMS. The OLAP server sends the request to the database server. The multidimensional cubes
are generated dynamically as per the requirement sent by the user. It supports mapping between
relational model and business dimensions.
MLOAP (Mutlitdimensional OLAP)
The system consists of OLAP Client that provides the front-end GUI for giving queries and
obtaining the reports. OLAP server is known as multidimensional DBMS. This is a proprietary
DBMS which stores the multidimensional data in multidimensional cubes and contains the data
in summarized form, based on the type of reports required.
A machine that carries out the data staging, which converts the data from RDBMS format to
MDBMS and sends the Multidimensional cube data to OLAP server.
HOLAP (Hybrid OLAP)
Hybrid OLAP is an amalgamation of ROLAP and MOLAP. It tries to accommodate the
advantages of both models. ROLAP has good database structure and simple queries can be
handled efficiently. On the other hand, MOLAP can handle complex aggregate queries faster.
However, MOLAP is computationally costlier. So, we can have a midway, In HOLAP, relational
database structure is preserved to handle simple and user-required queries. Instead of computing
all the cubes and storing them in MOLAP server, HOLAP server stores only some important and
partially computed cubes or aggregates so that when we require higher scalability and faster
computation, the required aggregates can be computed efficiently. Thus, HOLAP possesses
advantages of both ROLAP and HOLAP.
4. Write the difference between OLTP vs. OLAP (Nov/Dec-2014)

104

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can
assume that OLTP systems provide source data to data warehouses.

OLTP (On-line Transaction Processing) It is characterized by a large number of short on-line


transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on
very fast query processing, maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second. In OLTP database there is detailed
and current data, and schema used to store transactional databases is the entity model.

Particulars

OLTP System
Online Transaction Processing
(Operational System)

OLAP System
Online Analytical Processing
(Data Warehouse)

Operational data; OLTPs are the original


source of the data.
Purpose of
To control and run fundamental business
data
tasks
Reveals a snapshot of ongoing business
What the data
processes
Inserts and
Short and fast inserts and updates
Updates
initiated by end users
Relatively standardized and simple
Queries
queries Returning relatively few records

Consolidation data; OLAP data comes


from the various OLTP Databases
To help with planning, problem solving,
and decision support
Multi-dimensional views of various
kinds of business activities
Periodic long-running batch jobs refresh
the data
Often complex queries involving
aggregations
Depends on the amount of data
involved; batch data are refreshes and
Processing
Typically very fast
complex queries may take many hours;
Speed
query speed can be improved by creating
indexes
Larger due to the existence of
Space
Can be relatively small if historical data
aggregation structures and history data;
Requirements is archived
requires more indexes than OLTP
Typically de-normalized with fewer
Database
Highly normalized with many tables
tables; use of star and/or snowflake
Design
schemas
Source of data

105

Backup and
Recovery

operational data is critical to run the


business, data loss is likely to entail
significant monetary loss and legal
liability

Instead of regular backups, some


environments may consider simply
reloading the OLTP data as a recovery
method

OLAP (On-line Analytical Processing)


It is characterized by relatively low volume of transactions. Queries are often very
complex and involve aggregations. For OLAP systems a response time is an effectiveness
measure. OLAP applications are widely used by Data Mining techniques.
The above table summarizes the major differences between OLTP and OLAP system
design.
5. Discuss the concepts of OLAP Tools. (May/June-2014)
OLAP Tools
OLAP tools are designed to manipulate and control multi-dimensional databases and help the
sophisticated user to analyze the data using clear multidimensional complex views. Their typical
applications include product performance and profitability, effectiveness of a sales program or a
marketing campaign, sales forecasting, and capacity planning. ROLAP OLAP Tools and the
internet:

ROLAP

The mainly comprehensive premises in computing have been the internet and data warehousing
thus the integration of these two giant technologies is a necessity. The advantages of using the
Web for access are inevitable. The advantages are:
1. The internet provides connectivity between countries acting as a free resource.
2. The web eases administrative tasks of managing scattered locations.
3. The Web allows users to store and manage data and applications on servers that can be
managed, maintained and updated centrally.
106

These reasons indicate the importance of the Web in data storage and manipulation.
6. List the guidelines for OLAP

Codds Rules for OLAP Systems


In 1993, E.F. Codd formulated twelve rules as the basis for selecting OLAP tools.
i.
ii.
iii.
iv.
v.
vi.
vii.
viii.
ix.
x.
xi.
xii.
xiii.
xiv.
xv.
xvi.
xvii.

Multi-dimensional conceptual view


Transparency
Accessibility
Consistent reporting performance
Client-server architecture
Generic dimensionality
Dynamic sparse matrix handling
Multi-user support
Unrestricted cross-dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels
There are proposals to re-defined or extended the rules. For example to also include
Comprehensive database management tools
Ability to drill down to detail (source record) level
Incremental database refresh
SQL interface to the existing enterprise environment

7. Discuss the OLAP Tools and the Internet. [May-June-2014]


The mainly comprehensive premises in computing have been the internet and data
warehousing thus the integration of these two giant technologies is a necessity. The advantages
of using the Web for access are inevitable
These advantages are:
i.
The internet provides connectivity between countries acting as a free resource.
ii.
The web eases administrative tasks of managing scattered locations.
iii.
The Web allows users to store and manage data and applications on servers that can be
managed, maintained and updated centrally.
iv. These reasons indicate the importance of the Web in data storage and manipulation.
The web enabled data access has many significant features, such as:
i.
The first
ii.
The second
iii.
The emerging third
iv. HTML publishing
v. Helper applications
vi.
Plug-ins
vii.
Server-centric components
viii.
Java and active-x applications

107

The primary key in the decision making process is the amount of data collected and how well
this data is interpreted. Nowadays, Managers arent satisfied by getting direct answers to their
direct questions, Instead due to the market growth and increase of clients their questions became
more complicated.
8. Discuss the concept of Multidimensional data Model [Apr-May-2015]
The multidimensional data model is an integral part of On-Line Analytical Processing, or
OLAP. Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries
during interactive sessions, not in batch jobs that run overnight. And because OLAP is also
analytic, the queries are complex. The multidimensional data model is designed to solve complex
queries in real time.
Multidimensional data model is to view it as a cube. The cable at the left contains detailed
sales data by product, market and time. The cube on the right associates sales number (unit sold)
with dimensions-product type, market and time with the unit variables organized as cell in an
array.
This cube can be expended to include another array-price-which can be associates with all or
only some dimensions. As number of dimensions increases number of cubes cell increase
exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,
quarters, months, weak and day. GEOGRAPHY may contain country, state, city etc.
9. List the Operations in Multidimensional Data Model.

Aggregation (roll-up)
dimension reduction: e.g., total sales by city
summarization over aggregate hierarchy:
e.g., total sales by city and year -> total sales by region and by year
Selection (slice) defines a subcube
e.g., sales where city = Palo Alto and date = 1/15/96
Navigation to detailed data (drill-down)
e.g., (sales - expense) by city, top 3% of cities by average income
Visualization Operations (e.g., Pivot or dice).

10. Explain the Reporting Query tools and Applications


The data warehouse is accessed using an end-user query and reporting tool from Business
Objects.
Business Objects provides several tools to securely access the data warehouse or personal data
files with a point-and-click interface including the following:
BusinessObjects (Reporter and Explorer) a Microsoft Windows based query and reporting tool.
InfoView - a web based tool that allows reports to be refreshed on demand
108

InfoBurst - a web based server tool that allows reports to be refreshed, scheduled and distributed.
It can be used to distribute reports and data to users or servers in various formats (e.g. Text,
Excel, PDF, HTML, etc.). For more information, see the documentation below:
o InfoBurst Usage Notes (PDF)
o InfoBurst User Guide (PDF).
Data Warehouse List Upload - a web based tool, that allows lists of data to be uploaded into the
data warehouse for use as input to queries.
o Data Warehouse List Upload Instructions (PDF) WSU has negotiated a contract with Business
Objects for purchasing these tools at a discount.
Selecting your Query Tool:
a. The query tools discussed in the next several slides represent the most commonly used query
tools at Penn State.
b. A Data Warehouse user is free to select any query tool, and is not limited to the ones
mentioned.
c. What is a Query Tool?
d. A query tool is a software package that allows you to connect to the data warehouse from your
PC and develop queries and reports
Unit-III
Part-A
1.

What is Data mining?


Data mining refers to extracting or mining knowledge from large amount of data. It is
considered as a synonym for another popularly used term Knowledge Discovery in
Databases or KDD.

2.

Give the classification of data mining tasks.[June 2014]


Descriptive Characterizes the general property of the data in the database. Predictive
perform inference on the current data in order to make predictions

3.

Define pattern evaluation.


Pattern evaluation is used to identify the truly interesting patterns representing knowledge
based on some interestingness measures. A pattern is interesting if it validates a hypothesis
that the user sought to confirm.

4.
What are the types of data?[Nov 2014]
i) Qualitative data
ii) Quantitative data
5.
List the data attributes types.
i) Nominal
ii) Ordinal
iii) Interval
109

iv) Ratio.
6.

State the need for data pre-processing.[Dec 2013]


Real world data are generally
i) Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data.
ii) Noisy: containing errors or outliers.
iii) Inconsistent: containing discrepancies in codes or names. So to remove all these data preprocessing is needed
7.

What is meta learning.[Dec 2014]


Meta learning is a subfield of Machine learning where automatic learning algorithms are
applied on meta-data about machine learning experiments.

8.

Differentiate between data characterization and discrimination.[Dec 2013]


Characterization: provides a concise and succinct summarization of the given collection of
data.
Discrimination or Comparison: provides descriptions comparing two or more collections of
data.

9.

State the need of data cleaning. [Dec 2011] [May 2013].


Data cleaning removes noise and correct inconsistencies in the data. It cleans the data by
filling in missing values smoothing noisy data, identifying or removing outliers and
resolving inconsistencies.

10.

Define data reduction.


It is used to obtain a reduced representation of the data set that is much smaller in volume
yet closely maintains the integrity of the original data. I-e mining on the reduced set should
be more efficient yet produce the same analytical results.

11.
Define data integration.
Data integration combines data from multiple sources into a coherent data store. These sources
may include multiple databases, data cubes or flat files.
12.
What are the data preprocessing techniques?
Data preprocessing techniques are
i) Data cleaning-removes noise and correct inconsistencies in the data.
ii) Data integration-merges data from multiple sources into a coherent data store such as data
warehouse or a data cube.
iii) Data transformations-such as normalization improve the accuracy and efficiency of mining
algorithms involving distance measurements.
iv) Data reduction-reduces the data size by aggregating, eliminating redundant features, or
clustering.
13.

List the task of data mining primitive task.


110

i)
ii)
iii)
iv)
v)

Task relevant data


Knowledge to be mined
Background knowledge
Interestingness measure
Presentation & visualization of discovered patterns

14.
What kind of data can be mined?
Kinds of data are Database data, data warehouses, transactional data and other kinds of data like
time related data, data streams, spatial data, engineering design data, multimedia data and
web data.
15.
Give the various data smoothing techniques.
i) Binning
ii) clustering
iii) regression
16. List the attributes of data.
- Nominal
- Ordinal
- Interval
- Ratio.
17. Give an example for Nominal type attribute.
Nominal divided into 2 parts.
Simple (eg: Professor, AP, Lecturer)
Binary(eg: values between 0 and 1).
18. Give an example for Numeric type attribute.
Internal ( Eg: Temp in degree/Celsius)
Ratio ( Eg: Years of experience , Age).
19. State No-coupling Architecture.
The data mining system which does not work with any aspect of the database or data
warehouse system.
20. State Loose coupling architecture.
The data mining system which collaborates with the database or data warehouse.
Part-B
1. Explain the architecture of a typical data mining system.
The architecture of a typical data mining system may have the following major components

111

i.
ii.
iii.

iv.
v.

vi.

Database, data warehouse, or other information repository. This is one or a set of


databases, data warehouses, spread sheets, or other kinds of information repositories.
Data cleaning and data integration techniques may be performed on the data.
Database or data warehouse server. The database or data warehouse server is responsible
for fetching the relevant data, based on the user's data mining request.
Knowledge base. This is the domain knowledge that is used to guide the search, or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern's
interestingness based on its unexpectedness, may also be included.
Data mining engine. This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association analysis,
classification, and evolution and deviation analysis.
Pattern evaluation module. This component typically employs interestingness measures
and interacts with the data mining modules so as to focus the search towards interesting
patterns. It may access interestingness thresholds stored in the knowledge base.
Alternatively, the pattern evaluation module may be integrated with the mining module,
depending on the implementation of the data mining method used.
Graphical user interface. This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory data
mining based on the intermediate data mining results

2. Explain Knowledge Discovery in Databases or KDD


Knowledge discovery as a process is depicted and consists of an iterative sequence of the
following steps:
i. Data cleaning (to remove noise or irrelevant data).
ii. Data integration (where multiple data sources may be combined).
iii. Data selection (where data relevant to the analysis task are retrieved from the
database) Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)

112

iv. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns).
v. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures;)
vi. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user).

3. Explain the various methods of data cleaning in detail [may-june 2014].


Data cleaningData cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
1. Missing values
i. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification or description). This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.
i.Fill in the missing value manually: In general, this approach is time-consuming and may
not be feasible given a large data set with many missing values.
ii.Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like Unknown". If missing values are replaced by, say,
Unknown", then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common - that of Unknown". Hence, although
this method is simple, it is not recommended.
iii.Use the attribute mean to fill in the missing value: For example, suppose that the average
income of All Electronics customers is $28,000. Use this value to replace the missing
value for income.
iv.Use the attribute mean for all samples belonging to the same class as the given tuple: For
example, if classifying customers according to credit risk, replace the missing value with
the average income value for customers in the same credit risk category as that of the
given tuple.
v.Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction.
113

2. Noisy data Noise


It is a random error or variance in a measured variable
Binning methods: Binning methods smooth a sorted data value by consulting the
neighborhood", or values around it. The sorted values are distributed into a number of
'buckets', or bins. Because binning methods consult the neighborhood of values, they
perform local smoothing.
2. Clustering: Outliers may be detected by clustering, where similar values are organized
into groups or clusters. Intuitively, values which fall outside of the set of clusters may
be considered outliers
4. How data mining systems are classified? Discuss each classification with an example.
(Or) Give the classification of data mining system. Describe the issues related to data
mining. [May 2011] [Dec 2011] [Dec 2013] [May 2012][Dec 2014]
i)
Classification according to the kinds of databases mined
A data mining system can be classied according to the kinds of databases mined.
Database systems themselves can be classified according to different criteria (such as data
models, or the types of data or applications involved), each of which may require its own
data mining technique. Data mining systems can therefore be classified accordingly. For
instance, if classifying according to data models, we may have a relational, transactional,
object-oriented, object-relational, or data warehouse mining system. If classifying
according to the special types of data handled, we may have a spatial, time-series, text, or
multimedia data mining system, or a World-Wide Web mining system. Other system
types include heterogeneous data mining systems, and legacy data mining systems.
ii) Classification according to the kinds of knowledge mined.
Data mining systems can be categorized according to the kinds of knowledge they mine,
i.e., based on data mining functionalities, such as characterization, discrimination,
association, classification, clustering, trend and evolution analysis, deviation analysis,
similarity analysis, etc. A comprehensive data mining system usually provides multiple
and/or integrated data mining functionalities. Moreover, data mining systems can also be
distinguished based on the granularity or levels of abstraction of the knowledge mined,
including generalized knowledge (at a high level of abstraction), primitive-level
knowledge(at a raw data level), or knowledge at multiple levels (considering several
levels of abstraction). An advanced data mining system should facilitate the discovery of
knowledge at multiple levels of abstraction.
iii). Classification according to the kinds of techniques utilized.
Data mining systems can also be categorized according to the underlying data mining
techniques employed. These techniques can be described according to the degree of user
interaction involved (e.g., autonomous systems, interactive exploratory systems, querydriven systems), or the methods of data analysis employed (e.g., database-oriented or data
114

warehouse-oriented techniques, machine learning, statistics, visualization, pattern


recognition, neural networks, and so on). A sophisticated data mining system will often
adopt multiple data mining techniques or work out an effective, integrated technique
which combines the merits of a few individual approaches.
5. Write the short notes on Interestingness Patterns.[Nov-Dec-2014]
A data mining system has the potential to generate thousands or even millions of
patterns, or rules.
This raises some serious questions for data mining: A pattern is interesting if
i. it is easily understood by humans.
ii. valid on new or test data with some degree of certainty.
iii. potentially useful, and
iv. novel.
A pattern is also interesting if it validates a hypothesis that the user sought to con_rm.
An
interesting pattern represents knowledge. Several objective measures of pattern interestingness
exist. These are based on the structure of discovered patterns and the statistics underlying them.
An objective measure for association rules of the form XU Y is rule support, representing the
percentage of data samples that the given rule satisfies. Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected association.
It is defined as the conditional probability that a pattern Y is true given that X is true.
More formally, support and confidence aredefined as support(X ) Y) = Prob{XUY}g
confidence (X ) Y) = Prob{Y |X}g.
6. What is the need for pre-processing the data? (Nov/Dec 2007)
Incomplete, noisy, and inconsistent data are commonplace properties of large real world
databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of
interest may not always be available, such as customer information for sales transaction data.
Other data may not be included simply because it was not considered important at the time of
entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment
malfunctions. Data that were inconsistent with other recorded data may have been deleted.
Furthermore, the recording of the history or modifications to the data may have been overlooked.
Missing data, particularly for tuples with missing values for some attributes, may need to be
inferred.
7. Explain the evolution of database technology.
Evolution of Database Technology Data mining primitives. A data mining query is defined in
terms of the following primitives
1. Task-relevant data: This is the database portion to be investigated. For example, suppose
that you are a manager of All Electronics in charge of sales in the United States and Canada.
In particular, you would like to study the buying trends of customers in Canada rather than
mining on the entire database. These are referred to as relevant attributes.
2. The kinds of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association, classification, clustering, or
115

evolution analysis. For instance, if studying the buying habits of customers in Canada, you
may choose to mine associations between customer profiles and the items that these
customers like to buy
3. Background knowledge: Users can specify background knowledge, or knowledge about
the domain to be mined. This knowledge is useful for guiding the knowledge discovery
process, and for evaluating the patterns found. There are several kinds of background
knowledge.
4. Interestingness measures: These functions are used to separate uninteresting patterns from
knowledge. They may be used to guide the mining process, or after discovery, to evaluate the
discovered patterns. Different kinds of knowledge may have different interestingness
measures.
5. Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for knowledge
presentation, such as rules, tables, charts, graphs, decision trees, and cubes.
8. Brief the major issues in data mining. [Apr-May-2015]
1. Mining methodology and user-interaction issues. These recent the kinds of knowledge mined,
the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad-hoc
mining, and knowledge visualization.
2. Mining different kinds of knowledge in databases. Since different users can be interested in
different kinds of knowledge, data mining should cover a wide spectrum of data analysis and
knowledge discovery tasks, including data characterization, discrimination, association,
classification, clustering, trend and deviation analysis, and similarity analysis. These tasks
may use the same database in different ways and require the development of numerous data
mining techniques.
3. Interactive mining of knowledge at multiple levels of abstraction. Since it is difficult to know
exactly what can be discovered within a database, the data mining process should be
interactive. For databases containing a huge amount of data, appropriate sampling technique
can first be applied to facilitate interactive data exploration. Interactive mining allows users to
focus the search for patterns, providing and refining data mining requests based on returned
results. Specifically, knowledge should be mined by drilling-down, rolling-up, and pivoting
through the data space and knowledge space interactively, similar to what OLAP can do on
data cubes. In this way, the user can interact with the data mining system to view data and
discovered patterns at multiple granularities and from different angles.
9. Explain the performance issues in data mining.[Apr-may-2011,2015]
Performance issues
These include efficiency, scalability, and parallelization of data mining algorithms.
Efficiency and scalability of data mining algorithms. To effectively extract information
from a huge amount of data in databases, data mining algorithms must be efficient and
scalable. That is, the running time of a data mining algorithm must be predictable and
acceptable in large databases. Algorithms with exponential or even medium-order
116

polynomial complexity will not be of practical use. From a database perspective on


knowledge discovery, efficiency and scalability are key issues in the implementation of
data mining systems. Many of the issues discussed above under mining methodology and
user-interaction must also consider efficiency and scalability.
Parallel, distributed, and incremental updating algorithms.
The huge size of many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the development of
parallel and distributed data mining algorithms. Such algorithms divide the data into
partitions, which are processed in parallel. The results from the partitions are then
merged. Moreover, the high cost of some data mining processes promotes the need for
incremental data mining algorithms which incorporate database updates without having
to mine the entire data again the from scratch"
10. Short notes on data transformation.
Data transformation. In data transformation, the data are transformed or consolidated into
forms appropriate for mining. Data transformation can involve the following:
1. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0. There are three main methods for data
normalization : min-max normalization, zscore normalization, and normalization by
decimal scaling. (i).Min-max normalization performs a linear transformation on the
original data. Suppose that minA and maxA are the minimum and maximum values of an
attribute A. Min-max normalization maps a value v of A to v0 in the range [new minA;
new maxA] by computing

(ii).z-score normalization (or zero-mean normalization), the values for an attribute A are
normalized based on the mean and standard deviation of A. A value v of A is normalized
to v0 by computing where mean A and stand dev A are the mean and standard deviation,
respectively, of attribute A. This method of normalization is useful when the actual
minimum and maximum of attribute A are unknown, or when there are outliers which
dominate the min-max normalization.

(iii). Normalization by decimal scaling normalizes by moving the decimal point of values
of attribute A. The number of decimal points moved depends on the maximum absolute
value of A. A value v of A is normalized to v0by computing where j is the smallest
integer such that
Unit-IV
Part-A
1. Define Association Rule Mining.
Association rule mining searches for interesting relationships among items in a given data
set.
117

2. When we can say the association rules are interesting?


Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. Users or domain experts can set such thresholds.
3. How are association rules mined from large databases?
I step: Find all frequent item sets:
II step: Generate strong association rules from frequent item sets
4. Give few techniques to improve the efficiency of Apriori algorithm.
Hash based technique
Transaction Reduction
Portioning
Sampling
Dynamic item counting
5. Define constraint-Based Association Mining.
Mining is performed under the guidance of various kinds of constraints provided by the user.
Knowledge type constraints
Data constraints
Dimension/level constraints
Interestingness constraints
Rule constraints.
6. Define Itemset & Frequent Itemset.
An Itemset is an ordered collection or a set of certain items that can be defined in one
group.
Frequent Itemset- it contain contains all the subsets as frequent itemsets.
7. What is Decision tree?
A decision tree is a flow chart like tree structures, where each internal node denotes a test
on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes
or class distributions. The top most in a tree is the root node.
8. Describe Tree pruning methods.
When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outlier. Tree pruning methods address this problem of over fitting the data.
9. Define Pre Pruning
A tree is pruned by halting its construction early. Upon halting, the node becomes a leaf. The leaf
may hold the most frequent class among the subset samples.
10. Define Post Pruning.

118

Post pruning removes branches from a Fully grown tree. A tree node is pruned by removing its
branches. Eg: Cost Complexity Algorithm.
11. What is the concept of prediction?
Prediction can be viewed as the construction and use of a model to assess the class of an
unlabeled sample or to assess the value or value ranges of an attribute that a given sample is
likely to have.
12. What is the purpose of Apriori Algorithm?
Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. The name of the algorithm is based on the fact that the algorithm uses prior
knowledge of frequent item set properties.
13. State Support Vector Machine.
It refers to an algorithm in which the training data in transformed into a higher dimension by
using a non linear mapping.
14. What are the things suffering the performance of Apriori candidate generation technique.
Need to generate a huge number of candidate sets
Need to repeatedly scan the scan the database and check a large set of candidates by
pattern matching
15.What is CHIAD?
CHIAD= Chi-Square Automatic interaction Detection. It refers to an approach that uses
classification to handle nominal attributes, where for each input attribute ai, there is a pair of
values vi, which are the least different from the target attribute.
16. How are association rules mined from large databases?
I step: Find all frequent item sets:
II step: Generate strong association rules from frequent item sets
17. Define anti-monotone property.
If a set cannot pass a test, all of its supersets will fail the same test as well.
18. What are the things suffering the performance of Apriori candidate generation technique.
Need to generate a huge number of candidate sets
Need to repeatedly scan the scan the database and check a large set of candidates by
pattern matching
19. What are multidimensional association rules?
Association rules that involve two or more dimensions or predicates
Inter dimension association rule: Multidimensional association rule with no repeated
predicate or dimension
Hybrid-dimension association rule: Multidimensional association rule with multiple
occurrences of some predicates or dimensions.

119

20. Define the concept of classification.


Two step process
A model is built describing a predefined set of data classes or concepts.
The model is constructed by analyzing database tuples described by attributes.
Part-B
1. Compare Classification and Prediction:
Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes or to predict future data trends.
Classification predicts categorical (discrete, unordered) labels, prediction models continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures of potential
customers on computer equipment given their income and occupation.
Predictor is constructed that predicts a continuous-valued function, or ordered value, as opposed
to a categorical label. Regression analysis is a statistical methodology that is most often used for
numeric prediction. Many classification and prediction methods have been proposed by
researchers in machine learning, pattern recognition, and statistics. Most algorithms are memory
resident, typically assuming a small data size. Recent data mining research has built on such
work, developing scalable classification and prediction techniques capable of handling large
disk-resident data.
2. What is Market basket analysis? [May-June-2013]
Market Basket analysis A market basket is a collection of items purchased by a customer in a
single transaction, which is a well-defined business activity. For example, a customer's visits to a
grocery store or an online purchase from a virtual store on the Web are typical customer
transactions. Retailers accumulate huge collections of transactions by recording business
activities over time. One common analysis run against a transactions database is to find sets of
items, or itemsets that appear together in many transactions. A business can use knowledge of
these patterns to improve the Placement of these items in the store or the layout of mail- order
catalog page and Web pages. An item set containing i items is called an i-itemset. The percentage
of transactions that contain an item set is called the itemset's support. For an itemset to be
interesting, its support must be higher than a user-specified minimum. Such itemsets are said to
be frequent.

120

Rule support and confidence are two measures of rule interestingness. They respectively reflect
the usefulness and certainty of discovered rules. A support of 2% for association Rule means that
2% of all the transactions under analysis show that computer and financial management software
are purchased together. A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software. Typically, association rules are considered interesting if they
satisfy both a minimum support threshold and a minimum confidence threshold.
3. Discuss in brief about the Decision tree Induction.
Classification by Decision Tree Induction Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution.
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root.
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers.
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree.
Algorithm
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information
gain)
Conditions for stopping partitioning
121

All samples for a given node belong to the same class.


There are no remaining attributes for further partitioning.
Majority voting is employed for classifying the leaf.
There are no samples left
4.

Explain Bayesian Classification and the rule based classification. [Nov-Dec-2014].


Bayesian Classification: Bayesian classifiers are statistical classifiers. They can predictclass
membership probabilities, such as the probability that a given tuple belongs to a particular
class. Bayesian classification is based on Bayes theorem.
Bayes Theorem: Let X be a data tuple. In Bayesian terms, X is considered evidence.and
it is described by measurements made on a set of n attributes.
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine.
P(H|X), the probability that the hypothesis H holds given the evidence or observed data
tuple X. P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes theorem is useful in that it providesa way of calculating the posterior probability, P(H|
X), from P(H), P(X|H), and P(X).

5.

Write Short Notes on SVM( Support Vector Machines).[April-2011].


A new classification method for both linear and nonlinear data It uses a nonlinear
mapping to transform the original training data into a higher dimension With the new
dimension, it searches for the linear optimal separating hyperplane (i.e., decision
boundary) With an appropriate nonlinear mapping to a sufficiently high dimension, data
from two classes can always be separated by a hyperplane SVM finds this hyperplane
using support vectors (essential training tuples) and margins (defined by the support
vectors) Features: training can be slow but accuracy is high owing to their ability to
model complex nonlinear decision boundaries (margin maximization) Used both for
classification and prediction Applications: handwritten digit recognition, object
recognition, speaker identification, benchmarking time-series prediction tests.

122

SVMLinearly Separable A separating hyperplane can be written as


W X + b = 0 where W={w1, w2, , wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as w0 + w1 x1 + w2 x2 = 0.
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 1 for yi = 1.
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin)
are support vectors
This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective Lagrangian multipliers
Quadratic Programming (QP)
Function and linear constraints
6. Short notes on Association Mining.
Association Mining
Association rule mining:
Finding frequent patterns, associations, correlations, or causal structures among sets of items
or objects in transaction databases, relational databases, and other information repositories.
Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis,
clustering, classification, etc.
Examples. Rule form: Body Head [support, confidence].
Association Rule: Basic Concepts
(1) database of transactions,
(2) each transaction is a list of items (purchased by a customer in a visit)
123

Find: all rules that correlate the presence of one set of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories also get automotive services
done
Applications.
Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) *
(What other products should the store stocks up?)
Home Electronics
Attached mailing in direct marketing
Detecting ping-ponging of patients, faulty collisions
7. Short notes on Mining Frequent Patterns. [Nov-Dec 2014]
Mining Frequent Patterns The method that mines the complete set of frequent itemsets with
candidate generation.
Apriori property
All nonempty subsets of a frequent item set most also be frequent.
An item set I does not satisfy the minimum support threshold, min-sup, then I is not
frequent, i.e., support(I) < min-sup
If an item A is added to the item set I then the resulting item set (I U A) can not occur more
frequently than I.
Monotonic functions are functions that move in only one direction.
This property is called anti-monotonic.
If a set cannot pass a test, all its supersets will fail the same test as well.
This property is monotonic in failing the test.

8. Write the Apriori Algorithm. [Nov-Dec-2013]


Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

124

9. Mention the categories of Constraints


Categories of Constraints.
1. Anti-monotone and Monotone Constraints
constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the
superpatterns of S can satisfy Ca
A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of
S also satisfies it
2. Succinct Constraint p(I) for some selection predicate
A subset of item Is is a succinct set, if it can be expressed as is a selection operatorp,
where I, s.t.
SP can be2I is a succinct power set, if there is a fixed number of succinct set I1, , Ik
SP expressed in terms of the strict power sets of I1, , Ik using union and minus
A constraint Cs is succinct provided SATCs(I) is a succinct power set
125

3. Convertible Constraint
Suppose all items in patterns are listed in a total order R
A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint
implies that each suffix of S w.r.t. R also satisfies C
A constraint C is convertible monotone iff a pattern S satisfying the constraint implies
that each pattern of which S is a suffix w.r.t. R also satisfies C
10. Illustrate the Constraints in Data Mining.
Rule Constraints in Association Mining
Two kind of rule constraints:
Rule form constraints: meta-rule guided mining.
P(x, y) ^ Q(x, w) takes(x, database systems).
Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD98).
sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD99):
1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above.
2-var: A constraint confining both sides (L and R). sum(LHS) < min(RHS) ^ max(RHS) < 5*
sum(LHS)
Constrain-Based Association Query
Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
where C is a set of constraints on S1, S2 including frequency constraint
A classification of (single-variable) constraints:
e.g. Class constraint: S Item A
Domain constraint:
S { , , , , , } e.g. S.Price < 100
S.Type. e.g. snacks or is S, v
{, , , , } V , S , or SV S.Type e.g. {snacks, sodas }
Unit-V
Part-A
1. Define Clustering
Clustering is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
2. List the types of Data in clustering.
- Data Matrix
- Dissimilarity Matrix.
3. What is CRM?
126

CRM-Customer Relationship Management. It refers to the system of building and nurturing a


relationship with customers.
4. Define: Hierarchical Clustering
It refers to the clustering technique built by partitioning the cluster.
5. What do u mean by partitioning method?
In partitioning method a partitioning algorithm arranges all the objects into various partitions,
where the total number of partitions is less than the total number of objects. Here each partition
represents a cluster. The two types of partitioning method are k-means and k-medoids.
6. Differentiate Agglomerative and Divisive Hierarchical Clustering?
Agglomerative Hierarchical clustering method works on the bottom-up approach. In
Agglomerative hierarchical method, each object creates its own clusters. The single Clusters are
merged to make larger clusters and the process of merging continues until all the singular
clusters are merged into one big cluster that consists of all the objects.
Divisive Hierarchical clustering method works on the top-down approach. In this method all the
objects are arranged within a big singular cluster and the large cluster is continuously divided
into smaller clusters until each cluster has a single object.
7. Define Density based method?
Density based method deals with arbitrary shaped clusters. In density-based method, clusters are
formed on the basis of the region where the density of the objects is high.
8. What do you mean by Grid Based Method?
In this method objects are represented by the multi resolution grid data structure. All the objects
are quantized into a finite number of cells and the collection of cells build the grid structure of
objects. The clustering operations are performed on that grid structure. This method is widely
used because its processing time is very fast and that is independent of number of objects.
9. What is a STING?
Statistical Information Grid is called as STING; it is a grid based multi resolution clustering
method. In STING method, all the objects are contained into rectangular cells, these cells are
kept into various
hierarchical structure.
10. What are the factors involved while choosing data mining system?
i.
Data types
ii.
System issues
iii.
Data sources
iv. Data Mining functions and methodologies
v. Coupling data mining with database and/or data warehouse systems
vi.
Scalability
vii.
Visualization tools
viii.
Data mining query language and graphical user interface.
127

11. What is Model based method?


For optimizing a fit between a given data set and a mathematical model based methods are used.
This method uses an assumption that the data are distributed by probability distributions. There
are two basic approaches in this method that are
i.
Statistical Approach
ii.
Neural Network Approach.

12. What is Time Series Analysis?


A time series is a set of attribute values over a period of time. Time Series Analysis may be
viewed as finding patterns in the data and predicting future values.
13. What is Nominal Variables?
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
blue, green Method
i.
1: Simple matching Method
ii.
2: use a large number of binary variables
14. Compare Data Matrix and Dissimilarity Matrix.
Data Matrix
Dissimilarity Matrix.
Known as two-mode matrix
known as one-mode matrix
Rows & column represent different objects
Rows & column represent same objects
15. What is PAM?
Partitioning around Medoids- it starts from the initial set of mediods and replaces with nonmediods iteratively. As a result the total distance of the final cluster is improved.
16. What are outliers?
The set of objects are considerably dissimilar from the remainder of the data Example:
Sports: Michael Jordon, Wayne Gretzky
17. What are the Algorithms for mining distance-based outliers?
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm
18. List the applications of data mining.
Biomedical and DNA data analysis
Financial data analysis
Retail industry
Telecommunication industry
19. What are the requirements of cluster analysis?
128

The basic requirements of cluster analysis are


Dealing with different types of attributes.
Dealing with noisy data.
Constraints on clustering.
Dealing with arbitrary shapes.
20. What is CURE?

Clustering Using Representatives is called as CURE. The clustering algorithms generally


work on spherical and similar size clusters. CURE overcomes the problem of spherical and
similar size cluster and is more robust with respect to outliers

Part-B
1. Explain outlier analysis with an example [nov-dec-2013].
Outlier Analysis What Is Outlier Discovery?
What are outliers?
The set of objects are considerably dissimilar from the remainder of the data Example:
Sports: Michael Jordon, Wayne Gretzky.
Problem
Find top n outlier points
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis Outlier Discovery: Statistical Approaches

Assume a model underlying distribution that generates data set (e.g. normal distribution).
Use discordance tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be know
129

Outlier Discovery: Distance-Based Approach


Introduced to counter the main limitations imposed by statistical methods
We need multi-dimensional analysis without knowing data distribution.
Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a
fraction p of the objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm
Outlier Discovery: Deviation-Based Approach
Identifies outliers by examining the main characteristics of objects in a group
Objects that deviate from this description are considered outliers
sequential exception technique
simulates the way in which humans can distinguish unusual objects from among a series of
supposedly like objects
OLAP data cube technique
uses data cubes to identify regions of anomalies in large multidimensional data
2. How agglomerative hierarchical clustering works? [Apr-May-2015].
AGGLOMERATIVE HIERARCHICAL CLUSTERING Algorithms of hierarchical cluster
analysis are divided into the two categories divisible algorithms and agglomerative
algorithms.
A divisible algorithm starts from the entire set of samples X and divides it into a partition of
subsets, then divides each subset into smaller sets, and so on. Thus, a divisible algorithm
generates a sequence of partitions that is ordered from a coarser one to a finer one.
An agglomerative algorithm first regards each object as an initial cluster. The clusters are
merged into a coarser partition and the merging process proceeds until the trivial partition is
obtained: all objects are in one large cluster. This process of clustering is a bottom-up process,
where partitions from a finer one to a coarser one.
Most agglomerative hierarchical clustering algorithms are variants of the single-link or
complete-link algorithms.
In the single-link method, the distance between two clusters is the minimum of the distances
between all pairs of samples drawn from the two clusters (one element from the first cluster,
the other from the second).
In the complete-link algorithm, the distance between two clusters is the maximum of all
distances between all pairs drawn from the two clusters.
The basic steps of the agglomerative clustering algorithm are the same. These steps are

130

1. Place each sample in its own cluster. Construct the list of inter-cluster distances for all
distinct unordered pairs of samples, and sort this list in ascending order.
2. Step through the sorted list of distances, forming for each distinct threshold value dk a
graph of the samples where pairs samples closer than dk are connected into a new cluster by a
graph edge. If all the samples are members of a connected graph, stop. Otherwise, repeat this
step.
3. The output of the algorithm is a nested hierarchy of graphs, which can be cut at the desired
dissimilarity level forming a partition (clusters) identified by simple connected components in
the corresponding sub graph.

3. Brief the concept of Density-Based Clustering Methods


Clustering based on density (local cluster criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD96)
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98)
- OPTICS: Ankerst, et al (SIGMOD99).
Density-Based Clustering: Background
Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum
number of points in an Eps-neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-reachable from a point q wrt.
Eps, MinPts if 1) p belongs to NEps(q) 2) core point condition: |NEps (q)| >= MinPts
Density-Based Clustering: Background (II)
Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if
there is a chain of points p1, , pn, p1 = q, pn = p such that pi+1 is directly densityreachable from pi
Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if
there is a point o such that both, p and q are density-reachable from o wrt. Eps and
MinPts.

4. Explain the Applications of data mining with an example


Data mining is a young discipline with wide and diverse applications There is still a
nontrivial gap between general principles of data mining and domain-specific, effective
data mining tools for particular applications.
Some application domains.
Biomedical and DNA data analysis.
Financial data analysis.
131

Retail industry.
Telecommunication industry
Biomedical Data Mining and DNA Analysis
DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C),
guanine (G), and thymine (T).
Gene: a sequence of hundreds of individual nucleotides arranged in a particular order.
Humans have around 100,000 genes.
Tremendous number of ways that the nucleotides can be ordered and sequenced to form
distinct genes.
Semantic integration of heterogeneous, distributed genome databases
Current: highly distributed, uncontrolled generation and use of a wide variety of DNA
data Data cleaning and data integration methods developed in data mining will help
Data Mining for Financial Data Analysis
Financial data collected in banks and financial institutions are often relatively complete,
reliable, and of high quality.
Design and construction of data warehouses for multidimensional data analysis and data
mining
View the debt and revenue changes by month, by region, by sector, and by other
factors.
Access statistical information such as max, min, total, average, trend, etc.
Loan payment prediction/consumer credit policy analysis
Feature selection and attribute relevance ranking
Loan payment performance.
Consumer credit rating
Data Mining for Retail Industry
Retail industry: huge amounts of data on sales, customer shopping history, etc.
Applications of retail data mining
Identify customer buying behaviours
Discover customer shopping patterns and trends
Improve the quality of customer service
Achieve better customer retention and satisfaction
Enhance goods consumption ratios
Design more effective goods transportation and distribution policies
Data Mining for Telecommunication Industry
A rapidly expanding and highly competitive industry and a great demand for data
mining.
Understand the business involved
Identify telecommunication patterns
Catch fraudulent activities
Make better use of resources
Improve the quality of service

132

5. Explain the types of partitioning algorithm in detail [May-June 2014]


Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters.
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen67): Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw87): Each cluster is
represented by one of the objects in the cluster
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps:
Partition objects into k nonempty subsets.
Compute seed points as the centroids of the clusters of the current partition. The centroid is the
center (mean point) of the cluster.
Assign each object to the cluster with the nearest seed point.
Go back to Step 2, stop when no more new assignment.

Comments on the K-Means Method


Strength
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally,
k, t << n. Often terminates at a local optimum. The global optimum may be found using
techniques such as: deterministic annealing and genetic algorithms.
Weakness
Applicable only when mean is defined, then what about categorical data?
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data and outliers.
Not suitable to discover clusters with non-convex shapes Variations of the K-Means Method
A few variants of the k-means which differ in
133

Selection of the initial k means


Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method

6. Explain the rock algorithm and chameleon.


Rock Algorithm and CHAMELEON.
ROCK: Robust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE99).
Use links to measure similarity/proximity
Not distance based
Computational complexity:
Basic ideas:
Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
Rock: Algorithm
Links: The number of common neighbours for the two points.
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3} 3 {1,2,4}
Algorithm
Draw random sample
Cluster with links
Label data in disk
CHAMELEON
CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and
V. Kumar99
Measures the similarity based on a dynamic model.
Two clusters are merged only if the interconnectivity and closeness (proximity) between two
clusters are high relative to the internal interconnectivity of the clusters and closeness of items
within the clusters
A two phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small
sub-clusters

134

2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by


repeatedly combining these sub-clusters

7. Explain the Grid Based Model. [May-June-2014]


Grid-Based Methods Using multi-resolution grid data structure several interesting methods
STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB98) A multi-resolution
clustering approach using wavelet method
CLIQUE: Agrawal, et al. (SIGMOD98)
STING: A Statistical Information Grid Approach
Wang, Yang and Muntz (VLDB97)
The spatial area is divided into rectangular cells
There are several levels of cells corresponding to different levels of resolution STING: A
Statistical Information Grid Approach (2)
Each cell at a high level is partitioned into a number of smaller cells in the next lower level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from parameters of lower level cell
count, mean, s, min, max
type of distributionnormal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layertypically with a small number of cells
For each cell in the current level compute the confidence interval
STING: A Statistical Information Grid Approach (3)
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower level
Repeat this process until the bottom layer is reached
Advantages:
Query-independent, easy to parallelize, incremental update
O(K), where K is the number of grid cells at the lowest level
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

8. How to apply wavelet transform to find clusters?


Summaries the data by imposing a multidimensional grid structure onto data space
These multidimensional spatial data objects are represented in a n-dimensional feature
space
Apply wavelet transform on feature space to find the dense regions in the feature space
Apply wavelet transform multiple times which result in clusters at different scales from
fine to coarse.
135

Why is wavelet transformation useful for clustering?


Unsupervised clustering
It uses hat-shape filters to emphasize region where points cluster, but simultaneously to suppress
weaker information in their boundary
Effective removal of outliers
Multi-resolution
Cost efficiency
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data

9. Discuss the Clustering in Quest (CLIQUE).


CLIQUE (Clustering In QUEst)
Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD98).
Automatically identifying subspaces of a high dimensional data space that allow better
clustering than original space
CLIQUE can be considered as both density
-based and grid-based
It partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular
units
A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
Major Steps in CLIQUE
Partition the data space and find the number of points that lie inside each cell of the
partition.
Identify the subspaces that contain clusters using the Apriori principle
Identify clusters:
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of connected dense units for
each cluster
Determination of minimal cover for each cluster

10. Discuss the Model Based Clustering methods.[Apr/May-2015]


Model-Based Clustering Methods
136

Attempt to optimize the fit between the data and some mathematical model
Statistical and AI approach
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of unlabeled objects.
Finds characteristic description for each concept (class)
COBWEB (Fisher87)
A popular a simple method of incremental conceptual learning
Creates a hierarchical clustering in the form of a classification tree
Each node refers to a concept and contains a probabilistic description of that concept.
Neural network approaches
Represent each cluster as an exemplar, acting as a prototype of the cluster
New objects are distributed to the cluster whose exemplar is the most similar according
to some distance measure.
Competitive Learning
Involves a hierarchical architecture of several units (neurons)
Neurons compete in a winner-takes-all fashion for the object currently being
presented.

137

Anna University sample Question Paper


PARTA (10 x2= 20 marks)
1. What is the need for back end process in data warehouse design?
2. What are the advantages dimensional modelings?
3. List out the two different types of reporting tools.
4. Define OLAP5. What is Legacy database? i
6. What is descriptive and predictive data mining?
7. How is prediction differing from classification?
8. How do you choose best split while constructing a decision tree?
9. What is a STING?
10. Define Wave Cluster?

11.

PARTB-(5x 16=80marks)
(a) (i), Explain the mapping of data warehouse to multiprocessor architecture. (10)
(ii) Discuss about data warehouse meta data. (6)
Or
b) With a neat diagram describe the various stages of building a data warehouse. (16)

12. (a) (i)) Explain the data model which is suitable for data warehouse with an example.' (8) (ii)Write the
difference between multi-dimensional OLAP and multirelational OLAP
Or
b) Explain the.different types of OLAP tools.
13.what is the use of data mining task? what are the basic types of data mining tasks? Explainwith'examples. (16)
Or
(b) Explain various methods of data cleaning in detail.
14.Write and explain the algorithm for minirrg without candidate generation frequent itemsets (8)
(ii) A database has nine transactions let min_sup = 30%
TID
List of items_IDs
1
a,b,e
2
b,d
3
b,c
4
a,b,d
5
a,c
6
b,c
7
a,c
8
a,b,c,e
9
a,b,c
Find all frequent itemsets using the above algorithm
Or
(b) With an example explain various attribution selection measures in classification.
15. (a) (r) Explain the different types of data used in cluster analysis. (10)
(ii) Discuss the use of outlier analysis. (6)
Or
b) (i) Write the difference between CLARA and CLARANS.
138

(ii) Explain how data mining is used for retail industry.

Part-A

139

140

References
1. Alex Berson and Stephen J.Smith, Data Warehousing, Data Mining and OLAP, Tata
McGraw Hill Edition, Thirteenth Reprint 2008.
2. Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques, Third
Edition, Elsevier, 2012.
3. http://www.tutorialspoint.com.
4. https://anuradhasrinivas.files.wordpress.com

141

IT6004

SOFTWARE TESTING

L T P C3 0 0 3

UNIT I
INTRODUCTION
9
Testing as an Engineering Activity Testing as a Process Testing axioms Basic definitions
Software Testing Principles The Testers Role in a Software Development Organization
Origins of Defects Cost of defects Defect Classes The Defect Repository and Test Design
Defect Examples Developer/Tester Support of Developing a Defect Repository Defect
Prevention strategies.
UNIT II
TEST CASE DESIGN
9
Test case Design Strategies Using Black Box Approach to Test Case Design Random Testing
Requirements based testing Boundary Value Analysis Equivalence Class Partitioning State
based testing Cause-effect graphing Compatibility testing user documentation testing
domain testing Using White Box Approach to Test design Test Adequacy Criteria static
testing vs. structural testing code functional testing Coverage and Control Flow Graphs
Covering Code Logic Paths code complexity testing Evaluating Test Adequacy Criteria.
UNIT III
LEVELS OF TESTING
9
The need for Levers of Testing Unit Test Unit Test Planning Designing the Unit Tests The
Test Harness Running the Unit tests and Recording results Integration tests Designing
Integration Tests Integration Test Planning Scenario testing Defect bash elimination System
Testing Acceptance testing Performance testing Regression Testing Internationalization
testing Adhoc testing Alpha, Beta Tests Testing OO systems Usability and Accessibility
testing Configuration testing Compatibility testing Testing the documentation Website
testing.
UNIT IV
TEST MANAGEMENT
9
People and organizational issues in testing Organization structures for testing teams testing
services Test Planning Test Plan Components Test Plan Attachments Locating Test Items
test management test process Reporting Test Results The role of three groups in Test
Planning and Policy Development Introducing the test specialist Skills needed by a test
specialist Building a Testing Group.
UNIT V
TEST AUTOMATION
9
Software test automation skill needed for automation scope of automation design and
architecture for automation requirements for a test tool challenges in automation Test
metrics and measurements project, progress and productivity metrics.
TOTAL: 45 PERIODS
TEXT BOOKS: 1. Srinivasan Desikan and Gopalaswamy Ramesh, Software Testing
Principles and Practices, Pearson Education, 2006. 2. Ron Patton, Software Testing, Second
Edition, Sams Publishing, Pearson Education, 2007.
REFERENCES:
1. Ilene Burnstein, Practical Software Testing, Springer International Edition, 2003.
2. Edward Kit, Software Testing in the Real World Improving the Process, Pearson
Education, 1995.
3. Boris Beizer, Software Testing Techniques 2 nd Edition, Van Nostrand Reinhold, New
York, 1990.
4. Aditya P. Mathur, Foundations of Software Testing _ Fundamental Algorithms and
Techniques, Dorling Kindersley (India) Pvt. Ltd., Pearson Education, 2008

142

ALPHA COLLEGE OF
ENGINEERING
Thirumazhisai, Chennai 600124
LESSON PLAN

Faculty Name

Prema.S

Designation

Assistant professor

Subject Name

Software Testing

Code

IT6001

Year

IV

Semester

07

Degree & Branch

B.Tech/IT

AIM:
To understand the concepts of software testing technologies and to Expose the criteria for test cases, Learn
the design of test cases , Be familiar with test management and test automation techniques , to Be exposed to test
metrics and measurements.
TEXT BOOKS:
1. Srinivasan Desikan and Gopalaswamy Ramesh, Software Testing Principles and Practices, Pearson
Education, 2006.
2. Ron Patton, Software Testing, Second Edition, Sams Publishing, Pearson Education, 2007.
REFERENCES:
1. Ilene Burnstein, Practical Software Testing, Springer International Edition, 2003.
2. Edward Kit, Software Testing in the Real World Improving the Process, Pearson Education, 1995.
3. Boris Beizer, Software Testing Techniques 2 nd Edition, Van Nostrand Reinhold, New York, 1990.
4. Aditya P. Mathur, Foundations of Software Testing _ Fundamental Algorithms and Techniques, Dorling
Kindersley (India) Pvt. Ltd., Pearson Education, 2008

Sl.
No.
1
2
3
4
5

No. of Periods
Required

Topics

UNIT I INTRODUCTION
Introduction to software testing
Testing as an Engineering Activity
Testing as a Process Testing axioms
Basic definitions
The Testers Role in a Software Development
Organization
143

1
1
1
1
2

Text /Ref.
Book
T1
T1
T1
T1
TI

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

Origins of Defects Cost of defects Defect


Developer/Tester
Support
of Developing
Classes The Defect
Repository
and Testa
Test
Defect Repository Defect Prevention
UNIT II TEST CASE DESIGN
Introduction to test case design
Test case Design Strategies
Using Black Box Approach to Test Case
Boundary
Value Analysis
Equivalence Class
Design Random
TestingRequirements
State
based
testing

Cause-effect
graphing
Partitioning
Compatibility testing user documentation
Domain
testing testing
Using White Box Approach to Test design
Code
functionalCriteria
testing static
Coverage
andvs.
Test Adequacy
testing
Covering
CodeGraphs
Logic Paths
Control Flow
Code complexity testing Evaluating Test
Test
Adequacy Criteria.
UNIT III LEVELS OF TESTING
Introduction to levels of testing
The need for Levers of Testing Unit Test
The
thethe
Unit
tests
and
UnitTest
TestHarness
Planning Running
Designing
Unit
Tests
Integration
tests

Designing
Integration
Tests
Recording results
Scenario
testing
Defect bash elimination
Integration
Test Planning
Acceptance
testing Performance testing
System Testing
Regression Testing Internationalization
Alpha,
Tests
Testing OO systems
testing Beta
Adhoc
testing
Usability and Accessibility testing
Compatibility
Configuration testing Testing the
Test
documentation Website testing.

2
2
1

T1
T1

1
1
1
1
1
1
1
1
1
1
1
1

T1
T1
T1
T1
T1
T1
T1
T1
T1
T1
T1

T2

1
1
1
2
1
2
1
1
1
1

T2
T2
T2
T2
T2
T2
T2
T2
T2

32
33

UNIT IV TEST MANAGEMENT


Introduction to test management
1
People and organizational issues in testing
1

T2
T2

34
35
36

Organization structures for testing teams


Testing services Test Planning
Test Plan Components

1
1
1

T2
T2
T2

37
38
39
40
41

Test Plan Attachments Locating Test Items


test management test process
Reporting Test Results The role of three
Introducing
thePlanning
test specialist
groups in Test
and Policy
Skills needed by a test specialist Building a
Testing Group.

1
1
1
1
1

T2
T2
T2
T2
T2

144

42

Test

T2

43
44
45
46
47
48
49
50
51

UNIT V TEST AUTOMATION


Introduction to test automation
Software test automation
skill needed for automation
Scope of automation
Design and architecture for automation
Requirements for a test tool
Challenges in automation
Test metrics and measurements
Project, progress and productivity metrics.

1
1
1
1
1
1
1
1
1

52

Test

T2
T2
T2
T2
T2
T2
T2
T2
T2
T2

UNIT: 1 (INTRODUCTION)
1) Define Software Engineering.
Software Engineering is a discipline that produces error free software with in a
time and budget.
2) Define software Testing.
Testing can be described as a process used for revealing defects in software, and
for establishing that the software has attained a specified degree of quality with respect to
selected attributes.
3) List the elements of the engineering disciplines.
Basic principles
Processes
Standards
Measurements
Tools
Methods
Best practices
Code of ethics
Body of knowledge
4) Differentiate between verification and validation?(Nov/Dec 2009/2012)[May/June2006]

Verification

Validation

1. Verification is the process of


evaluating software system or component
to determine whether the products of a
given development phase satisfy the
conditions imposed at the start of that
phase.
2. Verification is usually associated with
activities such as inspections and reviews
145
of the s/w deliverables.

1.Verification is the process of evaluating


software system or component during
or at the end of the , the development
phase satisfy the conditions imposed at the
start of that phase.
2. Verification is usually associated with
Traditional execution _based testing, i.e.,
Exercising the code with testcases.

5) Define the term Testing.


Testing is generally described as a group of procedures carried out to evaluate
some aspect of a piece of software.
Testing can be described as a process used for revealing defects in software, and
for establishing that the software has attained a specified degree of quality with respect to
selected attributes.
6) Define process in the context of software quality. ( U.Q Nov/Dec 2009)
Process, in the software engineering domain, is a set of methods, practices,
Standards, documents, activities, polices, and procedures that software engineers use to
develop and maintain a software system and its associated artifacts, such as project and
test plans, design documents, code, and manuals.
7) Define the term Debugging or fault localization.
Debugging or fault localization is the process of
Locating the fault or defect
Repairing the code, and
Retesting the code.
8) List the levels of TMM.
The testing maturity model or TMM contains five levels. They are
Level1: Initial
Level2: Phase definition
Level3: Integration
Level4: Management and Measurement
Level5: Optimization /Defect prevention and Quality Control
9) List the members of the critical groups in a testing process (U.Q Nov/Dec 2008)
Manager
Developer/Tester
User/Client
10) Define Error.
An error is mistake or misconception or misunderstanding on the part of a software
developer.
11) Define Faults (Defects).
A fault is introduced into the software as the result of an error. It is an anomaly in
the software that may cause nit to behave incorrectly, and not according to its
specification.
12) Define failures.
A failure is the inability of a software or component to perform its required
within specified performance requirements.
13) Distinguish between fault and failure. (U.Q May/June 2009)
Fault

Failure
146

functions

1. A fault is introduced into the

2. A failure is the inability of a

software as the result of an error. It

software or component to perform

is an anomaly in the software that

its required

may cause nit to behave incorrectly,

specified performance

and not according to its

requirements.

functions within

specification.
14) Define Test Cases. [Nov/Dec-2009]
A test case in a practical sense is attest related item which contains the following
information.
A set of test inputs. These are data items received from an external
source by the code under test. The external source can be hardware,
software, or human.
Execution conditions. These are conditions required for running the test,
for example, a certain state of a database, or a configuration of a
hardware device.
Expected outputs. These are the specified results to be produced by the
code under test.
15)Write short notes on Test, Test Set, and Test Suite.
A Test is a group of related test cases, or a group of related test cases and test
procedure.
A group of related test is sometimes referred to as a test set.
A group of related tests that are associated with a database, and are usually run together,
is sometimes referred to as a Test Suite.
16) Define Test Oracle.
Test Oracle is a document, or a piece of software that allows tester to determine whether a
test has been passed or failed.
17) Define Test Bed.
A test bed is an environment that contains all the hardware and software needed to test a
software component or a software system.
18) Define Software Quality.
Quality relates to the degree to which a system, system component, or process meets
specified requirements.
Quality relates to the degree to which a system, system component, or process meets
Customer or user needs, or expectations.
19) List the Quality Attributes.
147

Correctness
Reliability
Usability
Integrity
Portability
Maintainability
Interoperability

20) Define SQA group.


The software quality assurance (SQA) group is a team of people with the necessary
training and skills to ensure that all necessary actions are taken during the development
process so that the resulting software confirms to established technical requirements.
21) Explain the work of SQA group.
Testers to develop quality related policies and quality assurance plans for each project.
The group is also involved in measurement collection and analysis, record keeping, and
Reporting. The SQA team members participate in reviews and audits, record and track Problems,
and verify that corrections have been made.
22) Define reviews.
A review is a group meeting whose purpose is to evaluate a software artifact or a set of
Software artifacts. Review and audit is usually conducted by a SQA group.
23) List the sources of Defects or Origins of defects. Or list the classification of defect (U.Q
May/June 2009)

Education
Communication
Oversight
Transcription
Process

24) Programmer A and Programmer B are working on a group of interfacing modules.


Programmer A tends to be a poor communicator and does not get along well with
Programmer B. Due to this situation, what types of defects are likely to surface in these
interfacing modules?
Communication defects.
25)Differentiate between testing and debugging. (U.Q Nov/Dec 2008)
Testing
Debugging
1. Testing as a dual purpose
1. Debugging or fault localization is the
148

process
Reveal defects
And to evaluate
quality attributes

PART B

process of
Locating the fault or defect
Repairing the code, and
Retesting the code.

UNIT 1 (INTRODUCTION)

1. The Role of process in Software quality.[Nov/Dec-2009]


Process
- Definition
Figure - Components of an engineered process.
Capability Maturity Model.
Testing Maturity model ( TMM )
2. Testing as a Process.
Validation
Verification
Figure : Example Processses embedded in the software development process.
Testing
Debuging
3. Overview of the Testing Maturity Model(TMM) & the test related activities that
should be done for V-model architeture.
Test releted issues
Benefits of test process improvement
Introduction to TMM
TMM levels
Figure : The internal Structure of TMM maturity level.
Figure : Five level Structure of TMM.
Figure : The Extended / Modified V model
4. Software Testing Principles.[Nov/Dec-2009/2012]
Define - Principle.
Principle 1 : Revealing defects and evaluating quality.
Principle 2 : Effectiveness of testing effort.
Principle 3 : Test results should be inspected.
Principle 4 : Test case must contain the expected output.
Principle 5 : Test case should be developed for both valid and invalid input conditions.
Principle 6 : Defects ratio.
Principle 7 : Testing should be carried out by a group.
Principle 8 : Tests must be repeatable and reusable.
Principle 9 : Tessting should be planned.
Principle 10: Testing activities should be integrated into software lifecycle.
Principle 11: Testing is a creative and challenging task.
5. Origins of defects.[Nov/Dec-2009/2012]
Sources of defects
Figure : Origins of defects.
Fault model.
149

6. Defect Classes ,Defect Repository,and Test Design. [Nov/Dec-2012]


Figure:Defect classes and a defect repository.
Requirements and specification defects.
Functional Description defects.
Feature defects.
Feature interaction defects.
Interface describtion defects.
Design defects.
Algorithmic and processing defects.
Control ,logic, and sequence defects.
Data defects.
Module interface description defects.
Functional description defects.
External interface description defects.
Coding defects.
Algorithmic and processing defects.
Control ,logic, and sequence defects.
Typographical defects.
Initialization defects.
Dataflow defects.
Data defects.
Module interface defects.
Code document defects.
UNIT: 2 (TEST CASE DESIGN)
1. Define Smart Tester.
Software must be tested before it is delivered to users. It is responsibility of the testers to
Design tests that (i) reveal defects
(ii) can be used to evaluate software performance, usability and reliability.
To achieve these goals, tester must select a finite no. of test cases (i/p, o/p, & conditions).
2. Compare black box and white box testing [Nov/Dec-2012]
Black box testing
White box Testing
Black box testing , the tester is no
The White box approach focuses on the
Knowledge of its inner structure(i.e. how
inner structure of the software to be tested.
it woks)The tester only has knowledge of
what it does(Focus only input & output)
Black box approach is usually applied
White box approach is usually applied
large size piece of software.
small size piece of software.
Black box testing sometimes called
White box sometimes called clear or glass
functional or specification testing.
box testing.
3. Draw the testers view of black box and white box testing.
150

Test Strategy

Testers View
Inputs

Black box

(No Knowledge about inner structure, Focus only


input and output)
Outputs

White box

(focuses on the inner structure of the software)

4. Write short notes on Random testing and Equivalence class portioning.


Each software module or system has an input domain from which test input data is
selected. If a tester randomly selects inputs from the domain, this is called random testing.
In equivalence class partitioning the input and output is divided in to equal classes or
partitions.
5. List the Knowledge Sources & Methods of black box and white box testing.

Test Strategy

Black box

White box

Knowledge Sources

Methods

1. Requirements
document
2. Specifications
3. Domain Knowledge
4. Defect analysis data
1. High level design
2. Detailed design
3. Control flow graphs
4. Cyclomatic complexity

1. Equivalence class partitioning (ECP)


2. Boundary value analysis (BVA)
3. State Transition testing.(STT)
4. Cause and Effect Graphing.
5. Error guessing
1. Statement testing
2. Branch testing
3. Path testing
4. Data flow testing
5. Mutation testing
6. Loop testing

6. Define State.
A state is an internal configuration of a system or component. It is defined in terms of the
values assumed at a particular time for the variables that characterize the system or component.
7. Define Finite-State machine.
A finite-state machine is an abstract machine that can be represented by a state graph
having a finite number of states and a finite number of transitions between states.
8. Define Error Guessing.
151

The tester/developer is sometimes able to make an educated guess as to which type of


defects may be present and design test cases to reveal them. Error Guessing is an ad-hoc
approach to test design in most cases.
9. Define COTS Components.
The reusable component may come from a code reuse library within their org or, as is
most likely, from an outside vendor who specializes in the development of specific types of
software components. Components produced by vendor org are known as commercial off-the
shelf, or COTS, components.
10. Define usage profiles and Certification.
Usage profiles are characterizations of the population of intended uses of the software in
its intended environment. Certification refers to third party assurance that a product, process, or
service meets a specific set of requirements.
11. Write the application scope of adequacy criteria?

Helping testers to select properties of a program to focus on during test.


Helping testers to select a test data set for a program based on the selected properties.
Supporting testers with the development of quantitative objectives for testing
Indicating to testers whether or not testing can be stopped for that program.

12. What are the factors affecting less than 100% degree of coverage?

The nature of the unit


Some statements/branches may not be reachable.
The unit may be simple, and not mission, or safety, critical, and so complete
coverage is thought to be unnecessary.
The lack of resources
The time set aside for testing is not adequate to achieve complete coverage for all
of the units.
There is a lack of tools to support complete coverage
Other project related issues such as timing, scheduling. And marketing constraints.

13. What are the basic primes for all structured program.

Sequential ( e.g., Assignment statements)


Condition (e.g., if/then/else statements)
Iteration (e.g., while, for loops)
The graphical representation of these three primes are given
Sequence
Condition

Iteration
False

152

True

False

True

14. Define path.


A path is a sequence of control flow nodes usually beginning from the entry node of a
graph through to the exit node.
15. Write the formula for cyclomatic complexity?
The complexity value is usually calculated from control flow graph(G) by the formula.
V(G) = E-N+2
Where The value E is the number of edges in the control flow graph
The value N is the number of nodes.
16. List the various iterations of Loop testing.
Zero iteration of the loop
One iteration of the loop
Two iterations of the loop
K iterations of the loop where k<n
n-1 iterations of the loop
n+1 iterations of the loop
17. Define test set.
A test set T is said to be mutation adequate for program p provided that for every in
equivalent mutant pi of p there is an element t in T such that pi[t] is not equal to p[t].
18. What are the errors uncovered by black box testing?
Incorrect or missing functions
Interface errors
Errors in data structures
Performance errors
Initialization or termination error
19.Define Path.
A path is a sequence of control flow nodes usually beginning from the entry node of a
graph through to the exit node.
153

20. List the two basic Testing strategies.


Black box testing
White box testing.
PART-B
UNIT 2 (TEST CASE DESIGN)
1. Smart Tester.
Reveal defects
Evaluate quality
Finite no .of testcase
2. Test case design strategies.[Nov/Dec-2009]
Positive consequences
Two strategies
Whitebox (clear or glass box)
Black box(Functional or specification)
Figure: The two basic testing strategies.
3. Types of black box testing
Random testing
Randomly select the input.
Three conditions.
Equivalence class partitioning
Adv of Equivalence class partitioning
List of conditions.
Figure: A specification of a square root function
Example of equivalance class reporting table
Boundary value analysis
List the conditions
Figure: Boundaries of on Equivalence partition
Example of Boundary value analysis.
4. Other Black box test design Approaches
Cause and Effect graphing
Steps of testcases with a Cause and Effect graph
Figure: Samples of Cause and Effect graph notations.
The input conditions
The output conditions
The rules or relationships.
Figure : Cause and Effect graph for the character search example.
Table: Decision table for character search example.
State Transition testing
State
Finite state machine
Figure: Simple state transition graph
Table: A state table for the machine
Error Guessing
Past experience
5. Black Box Testing and COTS (Commercial Off-the-shelf) components.
Usage Profiles
154

COTS
Certification
6. Types of white box testing[Nov/Dec-2009]
Coverage and control flow graph
Three basic primes
Sequential
Condition
Iteration
Coverage code logic
Figure: Code sample with branch and loop.
Figure: A control flow graph representation for the code.
Table: A test case for the code ,that satisfies the decision coverage criterion.
Table: Test cases for simple decision coverage
Table: Test cases for condition coverage
Table: Test cases for decision condition coverage.
Path Testing
Path
cyoclomatic complexity formula.
7. Additional white box test design approaches. .[Nov/Dec-2012]
Dataflow and white box test design
Variable.
Figure: sample code with data flow information
Loop Testing
Mutation Testing
The component programmer hypothesis
The copying effect
8. Evaluating Test adequacy Criteria
Axioms Set of assumptions
Applicability Property
Non exhaustive applicability property
Monotonicity Property
Inadequate Empty set
Antientensionality Property
General multiple change Property
Anti decomposition Property
Renaming Property
Complexity Property
Statement Coverage Property
UNIT: 3 (LEVELS OF TESTING )
1. List the levels of Testing or Phases of testing.
Unit Test
Integration Test
System Test
Acceptance Test
155

2. Define Unit Test and characterized the unit test.


At a unit test a single component is tested. A unit is the smallest possible testable
software component.
It can be characterized in several ways
A unit in a typical procedure oriented software systems.
It performs a single cohensive function.
It can be compiled separately.
It contains code that can fit on a single page or a screen.
3. List the phases of unit test planning.
Unit test planning having set of development phases.
Phase1: Describe unit test approach and risks.
Phase 2: Identify unit features to be tested.
Phase 3: Add levels of detail to the plan.
4. List the work of test planner.
Identifies test risks.
Describes techniques to be used for designing the test cases for the units.
Describe techniques to be used for data validation and recording of test results.
Describe the requirement for test harness and other software that interfaces with
the unit to be tested, for ex, any special objects needed for testing object oriented.
5. Define integration Test.
At the integration level several components are tested as a group and the tester
investigates component interactions.
6. Define System test.
When integration test are completed a software system has been assembled and its
major subsystems have been tested. At this point the developers /testers begin to
test it as a whole. System test planning should begin at the requirements phase.
7. Define Alpha and Beta Test.[Nov/Dec-2009]
Alpha test developers to use the software and note the problems.
Beta test who use it under real world conditions and report the defect to the
Developing organization.
8. What are the approaches are used to develop the software?
There are two major approaches to software development
Bottom-Up
Top_Down
These approaches are supported by two major types of programming languages.
They are
procedure_oriented
Object_oriented
9. List the issues of class testing.
Issue1: Adequately Testing classes
Issue2: Observation of object states and state changes.
Issue3: The retesting of classes-I
Issue4: The retesting of classes-II
156

10. Define test Harness. [Nov/Dec-2012]


The auxiliary code developed into support testing of units and components is
called a test harness. The harness consists of drivers that call the target code and
stubs that represent modules it calls.
11. Define Test incident report.
The tester must determine from the test whether the unit has passed or failed the
test. If the test is failed, the nature of the problem should be recorded in what is
sometimes called a test incident report.
12. Define Summary report.
The causes of the failure should be recorded in the test summary report, which is
the summary of testing activities for all the units covered by the unit test plan.
13. Goals of Integration test.
To detects defects that occur on the interface of the units.
To assemble the individual units into working subsystems and finally a completed
system that ready for system test.
14. What are the Integration strategies?
Top_ Down: In this strategy integration of the module begins with testing the
upper level modules.
Bottom_ Up: In this strategy integration of the module begins with testing the
lowest level modules.
15. What is Cluster?
A cluster consists of classes that are related and they may work together to
support a required functionality for the complete system.
16. List the different types of system testing.
Functional testing
Performance testing
Stress testing
Configuration testing
Security testing
Recovery testing
The other types of system Testing are,
Reliability & Usability testing.
17. Define load generator and Load.
An important tool for implementing system tests is a load generator. A load
generator is essential for testing quality requirements such as performance and
stress
A load is a series of inputs that simulates a group of transactions. A transaction is
a unit of work seen from the system users view. A transaction consist of a set of operation that
may be perform by a person , s/w system or device that is outside the
system.
18. Define functional Testing.
Functional tests at the system level are used ensure that the behavior of the system
adheres to the requirement specifications.
19. What are the two major requirements in the Performance testing.
Functional Requirement: User describe what functions the software should
157

perform. We test for compliance of the requirement at the system level with the
functional based system test.
Quality Requirement: They are nonfunctional in nature but describe quality
levels expected for the software.
20. Define stress Testing.
When a system is tested with a load that causes it to allocate its resources in
maximum amounts .It is important because it can reveal defects in real-time and
other types of systems.
21. Define Breaking the System.
The goal of stress test is to try to break the system; Find the circumstances under
which it will crash. This is sometimes called breaking the system.
22. What are the steps for top down integration?
Main control module is used as a test driver and stubs are substituted for all
components directly subordinate to the main module.
Depending on integration approach (Depth or breadth first) subordinate stubs are
replaced one at a time with actual components.
Tests are conducted as each component is integrated.
The completion of each set of tests another stub is replaced with real component
Regression testing may be conducted to ensure that new errors have not been
introduced.
23. What is meant by regression testing?
Regression testing is used to check for defects propagated to other modules by
changes made to existing program. Thus, regression testing is used to reduce the side
effects of the changes.
PART-B
UNIT- 3 (LEVELS OF TESTING)
1. Need for levels testing.
Unit Test
Integration Test
System Test
Acceptance Test
Fig: Levels of Testing
Alpha And Beta Test
2. Levels of testing and software development paradigm
Fig: Levels of testing
Two Approaches
Bottom_Up
Top_Down
Two types of Language
Procedure Oriented
Object Oriented
3. Unit Test
Functions, procedures, classes and methods as units
Fig: Some components suitable for unit test
Unit Test: Need for preparation
158

Planning
Both black box and White box
Reviewer
Several Tasks
4. Unit Test Planning
Phase I: Describe unit test approach and Risks
Phase II: Identify unit features to be tested
Phase III: Add levels of detail to the planning
5. The class as testable unit
Issue1: Adequately Testing classes
Issue2: Observation of object states and state changes.
Issue3: The retesting of classes-I
Issue4: The retesting of classes-II
Fig: Sample stack class with multiple methods
Fig: Sample Shape class
6. Test harness
The auxiliary code developed into support testing of units and components is
called a test harness. The harness consists of drivers that call the target code
and stubs that represent modules it calls.
Fig: The test Harness
Diver
Stub
7. Integration Test [Nov/Dec-2009]
Goals
Integration strategies for procedures and functions
Top down
Bottom up
Fig: Simple Structure chart for integration test example
Integration strategies for classes
Fig: An generic class cluster
8. System test: Different Types
Functional testing
Performance testing
Stress testing
Configuration testing
Security testing
Recovery testing
The other types of system Testing are,
Reliability & Usability testing.
Fig: Types of System Tests
Fig: Example of special resources needed for a performance test
UNIT 4(TEST MANAGEMENT)
1) Write the different types of goals?

159

i.

Business goal: To increase market share 10% in the next 2 years in the area of

ii.
iii.
iv.

financial software
Technical Goal: To reduce defects by 2% per year over the next 3 years.
Business/technical Goal: To reduce hotline calls by 5% over the next 2 years
Political Goal: To increase the number of women and minorities in high
management positions by 15% in the next 3 years.

2) Define Goal and Policy


A goal can be described as (i) a statement of intent or (ii) a statement of a
accomplishment that an individual or an org wants to achieve.
A Policy can be defined as a high-level statement of principle or course of action
that is used to govern a set of activities in an org.
3) Define Milestones.
Milestones are tangible events that are expected to occur at a certain time in the
Projects lifetime. Managers use them to determine project status.
4) List the Test plan components.[Nov/Dec-2009]
Test plan identifier
Introduction
Items to be tested
Features to be tested
Approach
Pass/fail criteria
Suspension and resumption criteria
Test deliverables
Testing Tasks
Test environment
Responsibilities
Staffing and training needs
Scheduling
Risks and contingencies
Testing costs
Approvals.
5) Draw a hierarchy of test plans.
Software quality assurance (V&V) plan

Master test plan

Review plan:
Inspections and
walkthroughs
160

Unit test
plan

Integration
test plan

System
test plan

Acceptanc
e test plan

6) Define a Work Breakdown Structure.(WBS)


A Work Breakdown Structure (WBS) is a hierarchical or treelike representation
of all the tasks that are required to complete a project.
7) Write the approaches to test cost Estimation?
The COCOMO model and heuristics
Use of test cost drivers
Test tasks
Tester/developer ratios
Expert judgment
8) Write short notes on Cost driver.
A Cost driver can be described as a process or product factor that has an impact
on overall project costs. Cost drivers for project the include
Product attributes such as the required level of reliability
Hardware attributes such as memory constraints.
Personnel attributes such as experience level.
Project attributes such as tools and methods.
9) Write the WBS elements for testing.
1. Project startup
2. Management coordination
3. Tool selection
4. Test planning
5. Test design
6. Test development
7. Test execution
8. Test measurement, and monitoring
9. Test analysis and reporting
10. Test process improvement
10) What is the function of Test Item Transmittal Report or Locating Test Items
Suppose a tester is ready to run tests on the data described in the test plan. We
needs to be able to locate the item and have knowledge of its current status. This
is the function of the Test Item Transmittal Report. Each Test Item Transmittal
Report has a unique identifier.
11) What is the information present in the Test Item Transmittal Report or Locating
Test Items
1) Version/revision number of the item
2) Location of the item
161

3) Person responsible for the item (the developer)


4) References tyo item documentation and test plan it is related to.
5) Status of the item
6) Approvals space for signatures of staff who approve the transmittal.
12) Define Test incident Report
The tester should record in attest incident report (sometimes called a problem
report) any event that occurs during the execyution of the tests that is unexpected ,
unexplainable, and that requires a follow- up investigation.
13) Define Test Log.
The Test log should be prepared by the person executing the tests. It is a diary of
the events that take place during the test. It supports the concept of a test as a
repeatable experiment.
14) What are the Three critical groups in testing planning and test plan policy ?
Managers:
Task forces, policies, standards, planning Resource allocation,

support for education and training, Interact with users/Clients


Developers/Testers
Apply Black box and White box methods, test at all levels, Assst

with test planning, Participate in task forces.


Users/Clients
Specify requirement clearly, Support with operational profile,

Participate in acceptance test planning


15) Define Procedure.
A procedure in general is a sequence of steps required to carry out a specific task.
16) What are the skills needed by a test specialist?
Personal and managerial Skills
Organizational, and planning skills, work with others, resolve
conflicts, mentor and train others, written /oral communication skills,

think creatively.
Technical Skills
General software engineering principles and practices, understanding
of testing principles and practices, ability to plan, design, and execute

test cases, knowledge of networks, database, and operating System.


17) Write the test term hierarchy?
Test Manager
Test leader
Test Engineer
Junior Test Engineer
162

18).Define Plan.

A plan is a document that provides a framework or approach for


achieving a set of goals.

1.

2.

3.

4.

5.

PART-B
UNIT 4 (TEST MANAGEMENT)
Testing and Debugging goals and Policy
Debugging goal
Debugging policy
Testing Policy: Organization X
Debugging policy: Organization X
Test planning
Planning
Milestone
Overall test objectives
What to test (Scope of the tests)
Who will test?
How to test?
When to test?
When to stop Testing?
Test Plan Components.[Nov/Dec-2012]
Test plan identifier
Introduction
Items to be tested
Features to be tested
Approach
Pass/fail criteria
Suspension and resumption criteria
Test deliverables
Testing tasks
Test environment
Responsibilities
Staffing and training needs
Scheduling
Risks and contingencies
Testing costs
Approvals
Test Plan Attachments
Test design specifications
Test case specifications
Test procedure specifications
Reporting Test Results
Test log
Test log identifier
163

Description
Activity and event entities
Test incident report
Test incident report identifier
Summary
Impact
Test summary report
6. The role of the 3 critical groups [Nov/Dec-2009]
1. Managers
Task forces, policies, standards
Planning
Resource allocation
Support for education and training
Interact with users
2. Developers/ testers
Apply black and white box methods
Assist with test planning
Test at all levels
Train and mentor
Participate in task forces
Interact with users
3. Users/clients
Specify requirements clearly
Participate in usability test
UNIT: 5 (TEST AUTOMATION)
1. Define Project monitoring or tracking.[Nov/Dec-2012]
Project monitoring refers to the activities and tasks managers engage into periodically
check the status of each project .Reports are prepared that compare the actual work done to
the work that was planned.
2. Define Project Controlling. [Nov/Dec-2012]
It consists of developing and applying a set of corrective actions to get a project on
track when monitoring shows a deviation from what was planned .
3. Define Milestone.
MileStones are tangible events that are expected to occur at a certain time in the
projects life time.Mnagers use them to determine project status.
4. Define SCM (Software Configuration management).[Nov/Dec-2012]

164

Software Configuration Management is a set of activities carried out for identifying,


organizing and controlling changes throughout the lifecycle of computer software.
5. Define Base line.
Base lines are formally reviewed and agreed upon versions of software artifacts,
from which all changes are measured. They serve as the basis for futher development and
can be changed only through formal change procedures.
6. Differentiate version control and change control.
Version Control combines procedures and tools to manage different versions of
configuration objects that are created during software process.
Change control is a set of procedures to evaluate the need of change and apply the
changes requested by the user in a controlled manner.
7. What is Testing?
Testing is generally described as a group of procedures carried out to evaluate
some aspect of a piece of software.It used for revealing defect in software and to evaluate
degree of quality.
8. Define Review.
Review is a group meeting whose purpose is to evaluate a software artifact or a
set of software artifacts.
9.

What are the goals of Reviewers?


Identify problem components or components in the software artifact that need
improvement.
Identify components of the software artifact that do not need improvement.
Identify specific errors or defects in the software artifact.
Ensure that the artifact confirms to organizational standards.

10. What are the benefits of a Review program?


Higher quality software
Increased productivity
Increased awareness of quality issues
Reduced maintenance costs
Higher customer satisfaction
11. What are the Various types of Reviews?[Nov/Dec-2009]
Inspections
WalkThroughs
12. What is Inspections?

165

It is a type of review that is formal in nature and requires prereview preparation on


the part of the review team.the Inspection leader prepares is the checklist of items that serves as
the agenda for the review.
13. What is WalkThroughs?
It is a type of technical review where the producer of the reviewed material serves as
the review leader and actually guides the progression of the review .It have traditionally
been applied to design and code.
14. List out the members present in the Review Team.
SQA(Software Quality Assurance) staff
Testers
Developers
Users /Clients.
Specialists.
15. List the components of review plans.
Review Goals
Items being reviewed
Preconditions for the review.
Rolls,Team size,participants.
Training requirements.
Review steps.
Time requirements
PART-B
UNIT 5: TEST AUTOMATION
1. Measurements and milestones for monitoring and controlling
Measurements for monitoring testing status
Coverage measures
Test case development
Test execution
Test harness development
Measurements to monitor tester productivity
Measurements for monitoring testing costs
Measurements for monitoring errors, faults, and failures
Monitoring test effectiveness
Criteria for test completion
All the planned tests that were developed have been executed and passed
All specified coverage goals have been met
The detection of a specific number of defects has been accomplished
The rates of defect detection for a certain time period have fallen below a
specified level
Fault seeding ratios are favorable
2. Software configuration management[Nov/Dec-2009]
Identification of the configuration items
166

3.

4.

Change control
Configuration status reporting
Configuration audits
Types of reviews
Inspections as a type of technical review
Inspection process
Initiation
Preparation
Inspection meeting
Reporting results
Rework and follow up
Walkthroughs as a type of technical review
Components of review plans
Review goals
Preconditions and items to be reviewed
Roles, participants, team size, and time requirements
Review procedures
Review training
Review checklists
Requirements reviews
Design reviews
Code reviews
Test plan reviews

167

B.E./B.Tech., DEGREE EXAMINATIONS


INFORMATION TECHNOLOGY
SEVENTH SEMISTER(Nov/Dec-2012)
IT2032 - SOFTWARE TESTING
(REGULATION 2008)
Time:3 hours

Maximum:100marks
Answer all the questions
Part A-(10*2=20marks)

1.What is test case? what are the information it contains?


2.Differentiate verification and validation.
3.Write the two basic testing strategies used to design test cases.
4.State the need for code functional testing in test case design.
5.Write the workable definition for a software unit and characterize it.
6.Define test harness.
7.What is the purpose of Test Transmitted report and the test log?
8.Write the various approaches to test cost estimation.
9.Differentiate between project monitoring and project controlling.
10.What is software configuration management?
Part B - (5*16=80 marks)
11.a.i.Explain the various software testing principles.
ii.Define correctness, reliability, integrity, interoperability.
Or
b.i.Define defect and write the various origins of defects.
ii.Explain the concepts of defects with the coin problem.
12.a.Explain the concepts of equivalence class partitioning and boundary value analysis.
Or
b.Explain the various additional white box test design approaches.
13.a.i.Describe the activities or tasks and responsibilities for developer or tester in support of
multilevel testing?
ii.List the tasks that must be performed by the developer or tester during the preparation for unit
testing.
Or
b.i.Write the importance of security testing and what are the consequences of security breaches,
also write the various areas which has to be focused on during security testing.
ii.State the need for integration testing in procedural code.
168

14.a.Explain the components of test plan.


Or
b.i.write any four IEEE recommended test related documents in detail.
ii.Write the various technical skills needed by a test specialist.
15.a.Explain the five stop-test criteria that are based on quantitative approach.
Or
b.Narrate about the metrics or parameters to be considered for evaluating the software quality.
B. E./B. Tech. DEGREE EXAMINATION,
Eighth Semester
Information Technology
SOFTWARE TESTING - Nov / Dec 2009.
(Regulation 2008)
Time: Three hours

Maximum: 100 marks


Part A
1.Give the role of process in software quality.
2.List the people who are associated with testing.
3.Identify the test case design strategies.
4.What is control graph?
5.What is the difference between alpha and beta testing?
6.What are the various skills needed by a test specialist.
7.List out the test plan components.
8.State the need of test plan components.
9.Name any tow test metrics.
10.What is the purpose of critical design review?
Part B
11.a.i. Explain the steps in developing defect repository.
ii.Discusst eh defect classification in detail.
Or
b.i.Discuss the orging of defects and explain the defect repository.
ii.Explain the principles of softwre testing and describe the role of tester in software
development organization.
12.a.i.Explain briefly about path and cyclomatic complexity.
ii.Explain the test factors that must be followed to design a customized test strategy.
Or
b.Describe the role of Oaths in white box testing and explain any tow white box design
approaches.
13.a.Explain the integrationan and its design and ploanning.
Or
b.What is regression testing? Give its types. When is it necessry to perform regression testing
and how is it done?
14.a.i.State and discuss the various stages that a test plan will consist of. ii.Explain the role of
testing.
169

Or
b.i.Explain the challenges and issues faced in testing services organization. Also write how we
can eliminate challenges.
ii.How can we build a test group?
15.a.i.Test metrics. ii.Testing tools.
Or
b.i.Write about software configuration management.
ii.Write the cmoponents of review plans.
MODEL QUESTION PAPER
B.E./B.Tech., DEGREE EXAMINATIONS
INFORMATION TECHNOLOGY
SEVENTH SEMISTER
IT2032 - SOFTWARE TESTING
(REGULATION 2008)
Time:3 hours

Maximum:100marks

Answer all the questions


Part A-(10*2=20marks)
1. List the four components the software development process is comprised of.
2. Distinguish between fault and failure.
3. what is black box approach.
4. What is a control flow graph.
5. What is unit testing.
6. Why is it so important to design a test harness for reusability.
7. List any two importance to testing plan.
8. Write down the skills needed by a technical level tester.
9. What should be included in milestone report for testing.
10. Define test measurement process.
Part B - (5*16=80 marks)
11. (a)(i)Why is it important to meticulously inspect test result? Give Example?
(ii)Discuss the drawbacks incase if you fail to inspect.
(or)
11. (b)(i)Why is it necessary to develop test cases for both valid and invalid input condition.
(ii)How important is document for product? How will you test requirement and design
document?
12. (a)Develop black-box test cases using equivalence class partitioning nd boundary value
analysis to test a module for ATM system.
(or)
12. (b)Imagine yourself as a developer of flight control system. Describe any three test adequacy
criteria you would consider applying to develop test cases for flight control system.
13. (a)List and explain types of system test.
170

(or)
13. (b)Develop a use case to describe a user purchase of a laptop with credit card from online
vendor using web based software. With use case, design a set of tests you would use during
system test (general).
14. (a)Why is testing plan important for developing a repeatable and managed testing process.
Give example.
(or)
14. (b)What role do user/client play in the development of test plan for a project? Should they be
present at any of the test plan reviews. Justify your answer.
15. (a)If you are developing a patient record system for health care centre, why of the stop test
will be most appropriate for this system.
(or)
15. (b)What is the role of the tester in supporting, monitoring and controlling of testing?
REFERENCES:
WEBSITE:
1. www.tutorialspoint.com/software_testing/software_testing_quick_guide.htm
2. https://www.vidyarthiplus.com/vp/thread-9727.html
3. pass-in-annauniversityexams.blogspot.com/.../anna-university-IT2032SoftwareTesting
4. www.rejinpaul.com
5. www.vidyarthiplus.in/2013/05/it2032-software-testing-important.html
BOOKS:
1. Ilene Burnstein, Practical Software Testing, Springer International Edition, 2003.
2. Edward Kit, Software Testing in the Real World Improving the Process, Pearson
Education, 1995.
3. Boris Beizer, Software Testing Techniques 2 nd Edition, Van Nostrand Reinhold, New
York, 1990.
4. Aditya P. Mathur, Foundations of Software Testing _ Fundamental Algorithms and
Techniques, Dorling Kindersley (India) Pvt. Ltd., Pearson Education, 2008
5.Elfriede Dustin, Effective Software Testing, First Edition, Pearson Education, 2003.
6.Renu Rajani, Pradeep Oak, Software Testing Effective Methods, Tools and Techniques,
Tata McGraw Hill, 2004.

171

Anda mungkin juga menyukai