Slides 1 Introduction To File Structures

Introduction to File Structures
Objectives At the end of the lesson, the student should be able to:
Explain the concepts behind physical file organization Differentiate logical and physical files Determine the advantages of using primary over secondary storage devices Identify the characteristics, capacities and access costs of different storage devices Explain the concepts behind file organization and file access Identify the different indexing mechanisms Use file compression techniques to improve space utilization and file access performance
CS 165 Database Systems
Lesson 1: Introduction to File Structures
Issues in Physical File Organization

Slow disk access Volative primary storage Virtually infinite secondary storage Cheaper secondary storage Dynamic nature of data Need for fast searching and retrieval
Ideally, we would like to get information with one access to the disk. If it is not possible, even with just as few accesses as possible. We also want to have file structures with information groups, or what we call records, to get a bulk of information in just one access to the disk. Solutions Made

Sequential access Indexed access Binary trees Tree-structured file organization: B-tree, B*-tree and B+-tree Direct access design in the form of hashing
Logical vs. Physical Files physical file a file in a disk, which means it exists physically logical file a file as viewed by the user Example: assign(input_file,input.txt); input.txt physical
input_file logical, Secondary Storage Devices
hold files that are not currently being used used for mid-term to long-term storage Types of Storage Devices
Direct Access Storage Devices (DASDs) also known as random access devices magnetic disks such as hard disks, floppy disks and optical disks Serial Devices use media such as magnetic tapes that permit only serial access.
Disks Use magnetism to store the data on a magnetic surface Advantages: high storage capacity, reliable, gives direct access to data A drive spins the disk very quickly underneath a read/write head Organization of Disks All magnetic disks are similarly formatted, or divided into areas, called tracks, sectors and cylinders.

Platters Tracks Sectors Disk pack Cylinder
Seeking Moving the r/w head to find the right location on the disk Usually the slowest part of reading information
Disk Capacity
Amount of data on a track depends upon how densely bits can be stored Low density High density To compute track, cylinder and drive capacity: CT = Track Capacity = #sectors/track x #bytes/sector CC = Cylinder Capacity = #tracks/cylinder x CT CD = Drive Capacity = #cylinders/drive x CC Example: What is the capacity of the drive with the following characteristics? #bytes/sector #sectors/track #tracks/cylinder #cylinders = 512 = 40 = 11 = 1331
Cost of a Disk Access (Latency)

Factors affecting disk access time: seek time [ts], rotational delay [tr], block transfer time [tbt]
Step 1. seek move the head to proper track
Measured as: (in ms) seek time
2. rotate rotational delay rotate disk under the head to the correct sector 3. settle head lowers to disk; wait for vibrations from moving to stop (actually touches only on floppies) 4. data transfer copy data to main memory settling time
data transfer time
Seek Time
Time it takes to move the arm to the correct cylinder Depends on the physical characteristics of the disk drive as well as on the #cylinders. Assumption: all cylinders are equally likely to be sources or destinations.
minimum seek time time it takes to move the access arm from a track to its adjacent track. maximum seek time time it takes to move the access arm from the outermost to the innermost track. average seek time the average of the minimum and maximum seek times. Rotational Delay
Time it takes for the head to reach the right place on the track. Assumption: Correct block can be recognized by some flag or block identifier at the beginning.
rotational time (tr ) time needed for a disk pack to complete one whole revolution.
Average latency time: [tr/2]= time it takes the disk drive to make revolution. tr/2 = (1/2) (60 x 1000) / rpm where rpm is revolutions per minute or if tr is given: tr/2 = tr / 2
Hard disks usually rotates at about 7,200 rpm ~ 1 rev. per 8.33 ms so the average rotational delay is 4.167 msec. For floppy disk: 360 rpm, tr/2 = 83.3 ms.
Block Transfer Time
Time it takes for the head to pass over a block The bytes of data can be transferred from disk to RAM and vice versa. Depends on rotational speed, recording density (#bytes/track), and #bytes in a block.
8 Lesson 1: Introduction to File Structures
Let tbt = block transfer time rbt = block transfer rate B = #bytes transferred tbt = B / rbt Example: Assume a block size of 1,000 bytes and that the blocks are stored randomly. The disk drive has the following characteristics: ts/2 = 30 ms tr = 16.67 ms tr/2 = 8.3 ms rbt = 806 KB/sec = 825,344 bytes/sec
What is the average access time per block? Average access time per block = seek + rotational latency + transfer = ts/2 + tr/2 + tbt =30 ms/block + 8.3 ms/block + 1,000bytes/block 825,344bytes/sec = (30 + 8.3 + 1.21) ms /block = 39.51 ms/block
Magnetic Tape
Sequential access Logical position of a byte within a file corresponds directly to its physical location relative to the start of the file. Good medium for archival storage, for transportation of data Tape drives come in many shapes, sizes and speeds. Performance of each drive can be measured in terms of the following quantities: Tape (or recording) density number of characters or bytes of data that can be stored per inch (bpi) Tape Speed commonly 30 to 200 inches per second (ips)
Disk vs. Tape DISK Random Access Use to store files in shorter terms Generally serves many processes Expensive Used as main secondary storage
CS 165 Database Systems 10
TAPE Sequential Access Long-term storage of files Dedicated to one process Less expensive Considered to be a tertiary storage
RAID: Improving File Access Performance by Parallel Processing
Redundant Array of Inexpensive Disks A set of physical disk drives that appear to the database users and programs as if they form one large logical storage unit. Parallel database processing speeds up writes and reads. This can be accomplished through using a RAID. RAID does not change the logical or physical structure of application programs or database queries.
11
RAID Levels
12
13
Fundamental File Structure Concepts Data is stored in a physical storage for later retrieval. In order to organize the data for easy retrieval, physical file organization and access mechanism have to be determined. Field and Record Organization Refers to the physical structure of records to store Fixed or variable in length A Stream File PROGRAM: writstrm get output file name and open it with the logical name OUTPUT get LAST name as input while (LAST name has a length > 0) get FIRST name, ADDRESS, CITY, STATE, and ZIP as input write LAST to the file OUTPUT write FIRST to the file OUTPUT write ADDRESS to the file OUTPUT Writes a stream of bytes write CITY to the file OUTPUT containing no added information: write STATE to the file OUTPUT hard to get the data back since write ZIP to the file OUTPUT there is no integrity of the get LAST name as input fundamental organizational units endwhile of the input data. close OUTPUT end PROGRAM
CS 165 Database Systems 14 Lesson 1: Introduction to File Structures
Field Structures Field A conceptual tool Smallest logically meaningful unit of information in a file A subdivision of a record containing a single attribute of the entity the record describes Methods of Field Organization 1. Fixed-length fields
Disadvantage: wasted space, insufficient space
2.Length indicator at the beginning of each field 3.Delimiters to separate fields 4.keyword = value expression to identify fields

used in combination with delimiters advantage: self-descriptive disadvantage: wasted space
15
Record Structures Records Also a conceptual tool Set of fields that belong together when the file is viewed in terms of a higher level of organization Methods of Record Organization 1. Fixed-length records
Does not imply that the sizes or number of fields in the record must be fixed
2. Fixed number of fields 3. Length indicator at the start of each record 4. Use of index to keep track of addresses
Index in another file or at the file's header.
5. Delimiter at the end of each record

File Organization The following are the main types of file organization: Sequential Relative Sequential File Organization Oldest type of file organization since during the 1950s and 1960s, the foundation of many information systems was sequential processing Sequential Files Files that are read from beginning to end Physical Characteristics

Physical order of the records is the same as the logical order Logical representation: (record 1) (record 2) (record 3) (record n)
Two Types of Sequential Files

Unordered pile files Sorted sequential files

Relative File Organization
There exists a predictable relationship between the key used to identify a record and that records absolute address on an external file Allows access to a record directly given only the key, regardless of the position of the record in the file Characterized as providing random access because the logical organization of the file need not correspond to its physical organization Simplest relative file organization: key value corresponds directly to the physical location of the record in the file Useful for dense keys, i.e., values of consecutive keys differ only by one. If the key collection is not dense, it might result to wasted space. This can be solved by mapping the large range of non-dense key values into smaller range of record positions in the file. The key is no longer the address of the record in the file. Hashing or indexing may be used.
18
File Access Refers to the manner data is retrieved File Access Methods Sequential Access O(n) expensive if done directly to the disk Relative Access
Addresses of records can be obtained directly from a key Uses an indexing technique that allows a user to locate a record in a file with as few accesses as possible, ideally with just one access The indexing scheme could be one of the following: Direct addressing Binary search tree B-trees Multiple-key indexing Hashing
If access is sequential, file organization used will not have significant effect on the access cost. However, relative access entails fixing the size of records or using an index.
Indexing Mechanisms Index Consists of keys and reference fields Works by indirection Used to provide multiple access path to a file Gives keyed access to variable-length record files Key Identifies a record based on the records contents not on the sequence of the records Primary keys Types of Indexes Keys that uniquely identify a single record Primary keys should be unchanging

Secondary keys Used to overcome the shortcomings of the primary key Used to access records according to data content
Primary Index on Unordered Files Primary Index on Ordered Files Clustering Index on Ordered Files Secondary Index
20
Primary Index on Unordered Files

The index is a simple arrays of structures that contain the keys and reference fields Allows binary search The index provides the order to unordered physical records Physical files are entry-sequenced
21
Primary Index on Ordered Files

Physical records are ordered based on the primary key The index is ordered but only one index record for each block Reduces the index requirement, enabling binary search over the values without having to read the entire file to perform binary search
22
Clustering Index on Ordered Files
23
Secondary Index

Used to facilitate faster access to commonly queried non-primary fields Typically point to the primary index Advantage: Record deletion and update cause less work Disadvantage: Less efficient
Retrieval Using Combination of Secondary Keys

use boolean AND operation, specifying the intersection of two subsets of the data file
Hashing
Used to obtain addresses from keys. A hash function is used to map a range of key values into a smaller range of relative addresses. Hashing is like indexing in that it involves associating a key with a relative record address. Unlike indexing, with hashing, there is no obvious connection between the key and the address generated since the function "randomly selects" a relative address for a specific key value, without regard to the physical sequence of the records in the file. Thus it is also referred to as randomizing scheme. Problem in hashing: Presence of collision. Collision happens when two or more input keys generate the same address when the same hash function is used.
Solution: Collision resolution technique or use of perfect hash functions.
25
Common Hashing Techniques Hash functions which are able to generate random addresses (not necessarily unique) Prime Number Division Method Digit Extraction Folding Mid-Square Radix Conversion Advantages Easier to implement Disadvantages Records may collide Records may be unevenly distributed. In the worst case, all records hash into a single address
Perfect Hashing Techniques Hash functions which are able to generate a perfectly uniform distribution of addresses

Quotient Reduction Remainder Reduction Associated Value Reciprocal Hashing
Provide 1-1 mapping of keys into addresses (no collisions)
Requires knowledge of the set of key values in order to generate a perfectly uniform distribution of keys Quite complicated to use Has a lot of pre-requisites It is hard to find a function that produces no collision
26
Other Indexing Mechanisms
Binary Search Trees: BST, Height-Balanced (AVL, Red-Black) B-Trees
Indexes That Are Too Large to Hold in Memory Index too large for memory must be maintained on secondary storage a lot of seeking is required. Alternatives: Use hashing for direct access or tree-structured index for both keyed access and sequential access
27
Organizing Files for Performance

Used to improve space utilization and file access times Some methods: data compression and reclaiming unused space
Data Compression The process of making files smaller by encoding the information to take up less space Methods Using a Different Notation Suppressing Repeated Sequences Assigning Variable-Length Codes Redundancy Reduction compression by reducing redundancy in data representation Using a Different Notation Compact Notation a compression technique which we decrease the number of bits in data representation.
Fixed-length fields of a record are good candidates for compression o Choosing minimal length that is still conservative enough o Using alternative values: for example, course code instead of course name
Suppressing Repeated Sequences Run-Length Encoding (RLE)
A compression technique in which runs of repeated codes are replaced by a count of the number of repetitions of the code, followed by the code that is repeated Images with repeated pixels are good candidates for RLE The Algorithm: Read through the pixels in the image, copying the values to the file sequence, except where the same pixel value occurs more that one successively. Substitute with the following 3 bytes the pixel values which occurred in succession: Run-length code indicator; Pixel value that is repeated; and The number of times it is repeated.
For example, we want to compress the following, with 0xFF not included in the image: 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24 The compressed version is 22 23 FF 24 07 25 FF 26 06 25 24
RLE does not guarantee any particular amount of space savings

Assigning Variable-Length Codes The most frequently used piece of data is assigned to have the shortest code Example 1: Morse Code E I S H T M O
However, Morse Code still need some delimiter to recognize characters
30
Example 2: Huffman Code Suppose we have an alphabet consisting of only seven characters: Letter Probability Code
a 0.4 1
b 0.1 010
c 0.1 011
d e f g 0.1 0.1 0.1 0.1 0000 0001 0010 0011
Each letter here occurs with the probability indicated Letter a has the greatest probability to occur more frequently, so it is assigned the one bit code So the string abacde is represented as 1010101100000001 Seven letters can be stored using three bits only, but in this example, as much as four bits are used to ensure that the distinct codes can be stored together without delimiters and still could be recognized
31

Slides 1 Introduction To File Structures

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Slides 1 Introduction To File Structures

Diunggah oleh

Hak Cipta:

Format Tersedia

Introduction to File Structures

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Issues in Physical File Organization

CS 165 Database Systems

Lesson 1: Introduction to File Structures

input_file logical, Secondary Storage Devices

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Platters Tracks Sectors Disk pack Cylinder

CS 165 Database Systems

Lesson 1: Introduction to File Structures

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Cost of a Disk Access (Latency)

Step 1. seek move the head to proper track

Measured as: (in ms) seek time

data transfer time

CS 165 Database Systems

Lesson 1: Introduction to File Structures

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Block Transfer Time

CS 165 Database Systems

CS 165 Database Systems

Lesson 1: Introduction to File Structures

RAID: Improving File Access Performance by Parallel Processing

CS 165 Database Systems

Lesson 1: Introduction to File Structures

CS 165 Database Systems

Lesson 1: Introduction to File Structures

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Disadvantage: wasted space, insufficient space

used in combination with delimiters advantage: self-descriptive disadvantage: wasted space

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Index in another file or at the file's header.

5. Delimiter at the end of each record

Two Types of Sequential Files

Unordered pile files Sorted sequential files

CS 165 Database Systems

Relative File Organization

CS 165 Database Systems

Lesson 1: Introduction to File Structures

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Primary Index on Unordered Files

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Primary Index on Ordered Files

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Clustering Index on Ordered Files

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Retrieval Using Combination of Secondary Keys

Solution: Collision resolution technique or use of perfect hash functions.

CS 165 Database Systems

Lesson 1: Introduction to File Structures

Quotient Reduction Remainder Reduction Associated Value Reciprocal Hashing

Provide 1-1 mapping of keys into addresses (no collisions)