File Structures UNIT 1 Notes

FILE STRUCTURESUNIT -1 INTRODUCTION TO FILE STRUCTURES Lecture notes 1.1 The heart of file structure design  DISKs o have enormous storage capacity o are non volatile o costs less than memory o but are very slow when compared to memory  ANALOGY o RAM access time  120ns o DISK access time  30ms o If finding something in the book in hand takes 20sec, and the same info if not found in the book should be searched in the library, keeping the same ratio of memory access and disk access, it would take 5million sec or almost 58days.  A disk’s relatively slow access time and the enormous, nonvolatile capacity is the driving force behind FILE STRUCTURE design!!  FS should give access to all the capacity without making the application spend a lot of time waiting for the disk.  FS is a combination of representation for data in files and of operations for accessing the data. o It allows applications to read, write and modify data o Also finding the data o Or reading the data in a particular order  Efficiency of FS design for a particular application is decided on, o Details of the representation of the data o Implementation of the operations  A large variety in the types of data and in the needs of application makes FS design important.  What is best for one situation may be terrible for other. 1.2 A Short History of FS Design  The general goals of FS design o One access to the disk to get the desired information. o Structures that take us to the information with as few accesses as possible. Two or three trips to disks... o Group information so that we can get everything we need in one trip to the disk; name, address, phone number and account balance... all at once...  B-tree o After ten years of design work came up B-tree o AVL tree grows top-down.  Extendible dynamic hashing retrieves the information with one or.  Then came in the disks drives o Indexes were added to files o List of keys and pointers were present in a smaller file (easily searchable) o Easy to directly access the file even if it was a very huge file. (size non-changing files)  Early days. two disk accesses no matter how big the file become. o But trees can grow unevenly as records are added or deleted.  B-tree and B+ tree became the basis for many commercial file systems  They provided access times that grow in proportion to log k N where. All these are easy to achieve if the files do not change. o Resulting in long searches requiring many disk accesses to find a record. where as B-tree grows bottom-up o Provides excellent access performance o Sequential access was not efficient in B-tree +  B trees o Solved the problem of sequential access in B-tree o Added a linked list at the bottom level of the B-tree. o N is the number of entries in the file o k is the number of entries indexed in a single block of the B-tree structure. at most. o The problem was dozens of access were required to find a record in even moderate sized files o A method was required to keep a tree balanced when each node of the tree was not a single record. o As the indexes grew they too became difficult to manage. as in a binary tree. you can find one file entry among millions of others with only three or four trips to the disk. .  Practically. When information is added or deleted it is much difficult.  B-tree guarantees that performance stays about the same even if you add or delete entries.  Initially the storage device was tape.  Hashing is a good way to get what we want with a single request. grow or shrink.  Early 1960s o Idea of applying tree structures emerged.  In 1963 o AVL tree was developed which was a self adjusting binary tree structure for data in memory. o Access was sequential o Accessing cost was directly proportional to the size of the file. hashed indexes were used to provide fast access to files. o AVL tree structure was implemented for files by some researchers. but a file block containing dozens or hundreds of records. 3 A Conceptual Toolkit  Development of file structures over the last three decades o Sequential o Tree structures o Direct access  Design tools keep emerging o Decrease the number of disk access by collecting data into  Buffers  Blocks  Buckets o Manage the growth of these collections by splitting them  Requires us to find ways to increase address or index space.  A number is returned in response to link the physical file with the logical file. o Methods of framing and addressing a design problem 1.  A single program is limited to use only 20 files. o The application relies on the operating system to take care of the telephone switching system. o This line is called the logical file  To open a file for use o The operating system should receive instructions to link the logical file with the physical file on the disk or a device.4 Fundamental File Operations: Physical Files and Logical Files  A physical file o Refers to particular collection of bytes stored on a disk or tape. which is used to refer to the file inside the program – that is logical name. o It physically exists on the disk. o Bytes going out of the program might end up in a file or appear on the terminal screen.  These are called the conceptual tools. o A disk drive might contain hundreds and thousands of these physical files. o When I receive a call I get an intercom message such as “you have a call on line three.” . o Bytes coming into the program can come down the line from a physical file or the keyboard or some other i/p device. o The program knows to which line it is talking to get the bytes in and send the bytes out.  A logical file o To the program  A file is somewhat like a telephone line connected to the telephone network  The program can send or receive bytes through this phone line  Where do these bytes go? Or where do they come from? It does not know!  Knows nothing about the other end of it.1.  ANALOGY o My office phone is connected to six telephone lines. \n" << "Enter end-of-file to end input.h> must be included. o When a file is opened. “you have a call from 814-789-1903” I need to have the call identified logically. o To perform file processing in C++. // prototype in stdlib.> includes <ifstream> and <ofstream>  Creating a sequential file // Create a sequential file #include <iostream.h } cout << "Enter the account.dat”. not physically. o outClientFile. // ofstream destructor closes file }  Question. o What does the above program do?  How to open a file in C++ ? o Ofstream outClientFile(“clients. char name[ 30 ]. ios::out ).o o The receptionist does not say.4 Opening Files  C++ Files and Streams o C++ views each files as a sequence of bytes.\n? ".h> #include <fstream.h> int main() { // ofstream constructor opens file ofstream outClientFile( "clients.open(“clients. ios:out) OR o Ofstream outClientFile. the header files <iostream. o Each file ends with an end-of-file marker. } return 0. o <fstream.(append) write all output to the end of file .h> and <fstream.dat”.dat". 1. ios:out)  File Open Modes o ios:: app . cout << "? ". name. float balance. an object is created and a stream is associated with the object. int account. and balance. while ( cin >> account >> name >> balance ) { outClientFile << account << ' ' << name << ' ' << balance << '\n'. if ( !outClientFile ) { // overloaded ! operator cerr << "File could not be opened" << endl.h> #include <stdlib. exit( 1 ). data can be written anywhere in the file ios:: binary . the open operation fails ios:noreplace . clientData client.accountNumber.balance.  The second argument of write is an integer of type size_t specifying the number of bytes to written.if the file exists.if the file does NOT exists.(input) open a file for input ios::out .h> #include "clntdata. When the stream is associated with a file.o o o o o o o ios:: ate .accountNumber > 0 && client.(output) open afile for output ios: trunc -(truncate) discard the files’ contents if it exists ios:nocreate .close(). firstname.5 Closing Files  The file is closed implicitly when a destructor for the corresponding object is called  Or by using member function close: o outClientFile.  Writing data randomly to a random file #include <iostream. hence we used the reinterpret_cast <const char *> to convert the address of the blankClient to a const char *.lastName >> client. the open operation fails 1. 0 to end input)\n? ". } cout << "Enter account number " << "(1 to 100.6 Reading and Writing Files  <ostream> memebr function write o The <ostream> member function write outputs a fixed number of bytes beginning at a specific location in memory to the specific stream.seekp( ( client.read/write data in binary format ios:: in .dat".h> #include <stdlib.  The write function expects a first argument of type const char *. while ( client. if ( !outCredit ) { cerr << "File could not be opened. ios::ate ). the data is written beginning at the location in the file specified by the “put” file pointer. exit( 1 ).firstName >> client.h> #include <fstream.1 ) * ." << endl. 1. Thus the sizeof( clientData ).accountNumber <= 100 ) { cout << "Enter lastname. cin >> client.h" int main() { ofstream outCredit( "credit. outCredit.accountNumber . balance\n? ". cin >> client. write( reinterpret_cast<const char *>( &client )." << endl. client ).read (reinterpret_cast<char *>(&client).balance << '\n'. } cout << setiosflags( ios::left ) << setw( 10 ) << "Account" << setw( 16 ) << "Last Name" << setw( 11 ) << "First Name" << resetiosflags( ios::left ) << setw( 10 ) << "Balance" << endl.h" void outputLine( ostream&.read( reinterpret_cast<char *>( &client ).h> #include <iomanip. inCredit. if ( !inCredit ) { cerr << "File could not be opened.accountNumber << setw( 16 ) << c. ios::in ). int main() { ifstream inCredit( "credit.eof() ) { if ( client.lastName << setw( 11 ) << c.dat". sizeof( clientData ) ). sizeof(clientData)). exit( 1 ).sizeof( clientData ) ).accountNumber. sizeof( clientData ) ). sizeof( clientData ) ).h> #include <fstream. clientData client. while ( inCredit && !inCredit.h> #include <stdlib. inCredit. cout << "Enter account number\n? ".firstName << setw( 10 ) << setprecision( 2 ) << resetiosflags( ios::left ) << setiosflags( ios::fixed | ios::showpoint ) << c. } void outputLine( ostream &output. .read( reinterpret_cast<char *>( &client ). } return 0.accountNumber != 0 ) outputLine( cout.h> #include "clntdata. }  Reading data from a random file #include <iostream. const clientData & ). const clientData &c ) { output << setiosflags( ios::left ) << setw( 10 ) << c. } return 0. }  The <istream> function read  inCredit. cin >> client. outCredit. exit( 1 ).7 Seeking  Reading and printing a sequential file // Reading and printing a sequential file #include <iostream. ios::in ). .repositions the file get pointer to the n-th byte of the file o inClientFile.dat". The <istream> function inputs a specified (by sizeof(clientData)) number of bytes from the current position of the specified stream into an object. int main() { // ifstream constructor opens the file ifstream inClientFile( "clients.seekg(0) . double ). } }  File position pointer o <istream> and <ostream> classes provide member functions for repositioning the file pointer (the byte number of the next byte in the file to be read or to be written. respectively.seekg(n.  Member functions tellg() and tellp().h> #include <stdlib. o Member functions tellg and tellp are provided to return the current locations of the get and put pointers. ios:cur) .) o These member functions are:  seekg (seek get) for istream class  seekp (seek put) for ostream class  Examples of moving a file pointer o inClientFile. o To move the pointer relative to the current location use ios:cur  o inClientFile. ios:end) -repositions the file get pointer to the m-th byte from the end of file o nClientFile.moves the file get pointer n bytes forward.h> void outputLine( int.tellg(). Updating a sequential file o Data that is formatted and written to a sequential file cannot be modified easily without the risk of destroying other data in the file. if ( !inClientFile ) { cerr << "File could not be opened\n". 1. ios:end) .seekg(0.seekg(m.h> #include <iomanip. o long location = inClientFile.repositions the file get pointer to the beginning of the file o inClientFile.repositions the file get pointer to the end of the file o The same operations can be performed with <ostream> function member seekp.h> #include <fstream. ios:beg) . const char *.seekg(n. o If we want to modify a record of data. 1. o Individual records of a random access file can be accessed directly (and quickly) without searching many other records.8 Special Characters in Files  All computer systems have reserved a number of characters for specific system functions.  Examples: o Control-Z indicates often end-of-file in MS-DOS programs o Control-D indicates often end-of-file in Unix programs o CR (Carriage return) and LF (Line Feed) characters together indicate end-of-line 1.9 Directory Structures  Files are stored in directories. Thus directories are collections of files  Most modern systems maintain a tree directory structure.11 Secondary Storage Management  Secondary storage devices: o have much longer access time than main memory o have access times that vary from one access to another (some accesses are relatively fast and other accesses are slower on the same device) o have a lot of more storage than main memory o have storage that is non-volatile .10 Physical Devices and Logical Files  I/O Redirection o I/O redirection allows for changing the source of input to come from a file instead of a keyboard:  program < file /* program reads input form a file instead of keyboard o I/O redirection allows for directing the output to go a file instead of the screen  program > file /* program writes to a file instead of the screen  1.12 Pipes o An output of one program can be used as an input to another program be using pipes: o Example:  program1 | program2 1. o These applications include banking systems.  Problems with sequential files o Sequential files are inappropriate for so-called “instant access” applications in which a particular record of information must be located immediately.)  Random access files o Instant access is possible with random access files. 1. point-of-sale systems. airline reservation systems. the new data may be longer than the old one and it could overwrite parts of the record following it. (or any data-base system. g. o Disk storage units are:  tracks and cylinders  sectors Disk Storage Capacity o The amount of data that can be held on a disk depends on how densely bits can be stored on the disk surface o The capacity of the disk is a function of:  the number of cylinders  the number of tracks per cylinder  the capacity of a track o Track capacity =number of sectors per track bytes per sector o Cylinder capacity=number of tracks per cylinder track capacity o Drive capacity = number of cylinders cylinder capacity How is the Data Read from or Written to a Disk? o The operating system sends control signals to the disk via a disk driver to read or to write data from a given sector of a given cylinder. 5200 rpm (rotations per minute) o Average rotational delay: e. 22 msec o Spindle speed: e.g. 1 msec o Average seek time : e.g.g.g.g. o The disk is rotating to position the needed sector under the read/write head (rotational delay) o The read/write head is moving to the needed cylinder (seek time). 512 o Sectors per track: e. 12 msec (milliseconds) o Maximum seek time: e.g. Specification of Disk Drives o Capacity: e. 2 GB o Minimum (track to track) seek time: e. 16 o Cylinders: 4092 Organizing Data by Sectors o Consecutive physical sectors sometimes are not consecutive logically o this is called sector interleaving . 63 o Tracks per cylinder: e.g. Disks o      Types of commonly used disks  hard disks  floppy disks  Iomega ZIP disks  Jaz disks The Organization of Disks o Data is stored on the surface of one or more platters.g.2796 bytes/msec=2730K/sec o Bytes per sector: e. 6 msec o Mximum transfer rate: e.g. Thus 511 bytes are wasted. o Some operating systems view each file as a series of clusters. o Extents  Extents of a file are those parts of the file which are stored in contiguous clusters.  Non-data Overhead o Non-data overhead includes at the beginning of each sector:  sector address  track address  sector usability  The Cost of Disk Access o Seek time  the time required to move the r/w head to the correct cylinder o Rotational delay  the time required to rotate the disk so that the correct sector is positioned under the r/w head o Transfer time  the time required to transfer the data: .  Blocks o Some disk allow for storing data in user defined blocks instead of sectors. o When the data on a disk is organized in blocks.o In the early 1990s. we have to allocate to it one whole sector. controller speeds improved so that disks can now offer noninterleaving (also known as 1:1 interleaving)  Clusters o A cluster is a fixed number of consecutive (logical) disk sectors. o If a sector size is 512 bytes than even if we need to store only one byte.  Fragmentation o Fragmentation is the wasted disk space due to the fact that the smallest organizational unit of a disk is one sector. o Blocks can be either variable or fixed length.  It is very beneficial to store the whole file in one extent (seek time is minimized). o Block organization can be more efficient than sector organization but it is much more complex. this usually means that the amount of data transferred in a single I/O operation can vary. o Clusters are designed to improve performance since all sectors in one cluster can be accessed without an additional seek. tapes provide only sequential access.000 o data transfer rates range from 0.  Current tape drives: o big variety o prices range from $150 to $150.5 MB/sec to 10 MB/sec o capacities range from 200 MB to 50 GB  Tape Organization o Tapes store data sequentially. 1.12 Magnetic Tapes  Tapes provide sequential data access.number of bytes transferred Transfer time = rotation time number of bytes on a track  Disks as Bottlenecks o Disk speeds lag far behind  CPU  main memory  local network o Computer programs spend most of time awaiting data from the disk  Improving Disk Performance o Disk striping  splitting the parts of a single file on several drives o RAID  Redundant Array of Inexpensive Disks o RAM disk o Disk caching o Buffering 1. o The logical position of a byte within a file corresponds directly to its physical position relative to the start of the file o Data is stored usually on 9 parallel tracks with 8 data bits and 1 parity bit (this is called a frame) o Frames are grouped into blocks separated by interblock gaps.13 Disks versus Tape  Disks provide both random access and sequential access.  Tapes are mostly used as backup devices. o Tapes are read one block at a time. . tapes) o slowest. Buffering of disk data in main memory reduces seek time. RAM) o fastest.5 sec to 1 sec!) o slow transfer rate (1x = 150KB/sec)  approx. Read Only Memory CD-ROM data can be recorded only once but it can be read multiple times. consequently other devices must be available (such as a hard disk) to allow for interactive use of programs on CD-ROMs. less expensive  Offline (removable disks. thus disk are commonly used for sequential file processing too. most expensive  Secondary (hard disk) o slower. o CLV (Constant Linear Velocity) format (spiral format)  CD-ROM’s Strengths o high storage capacity (600 MB) o low cost o durability o read-only access (great advantage from the design point of view for  CD-ROMs Weaknesses o long seek time (0.14 CD-ROM      CD-ROM=Compact Disk. 5 times faster than a floppy disks  orders of magnitude slower than a good hard disk o asymmetric writing and reading  once data has been recorded no write access is allowed. CD-ROM technology has been started in the early 1970s as videodisk technology First modern CD-ROMs were built in 1985. o Data is recorded by copying a master disk. large and largest. smallest. CD-ROM Technology o CD-ROMS are encoded using laser technology.16 A Journey of a Byte .17 p. 1. larger. 84 of your text) 1.15 Computer Storage Hierarchy  Primary (registers.  Tapes are still the most common long term archival storage. 1. least expensive (see fig 3. o The master is formed by using the data to be encoded to turn a powerful laser on and off very quickly. g. o scatter/gather I/O ( reading or writing of disk blocks and separating headers and data into different buffers) . Word) Operating System (e.g.17 Buffer Management  File managers buffer data in main memory. Windows NT) File manager (part of operating system) I/O processor Disk controller Disk 1.  This is called buffering  Buffering strategies used: o multiple buffering (CPU uses one buffer and I/O processor uses another) o move mode and locate mode (instead of using application and system buffers .only one buffer for both purposes is used).      Application program (e.

Comments

Description