| Q. What is VSAM? Never heard of it…! |
| VSAM stands for Virtual Storage Access Method. In Windows and many O/S user data is stored in files. In a text file, on Windows or Linux, the data consists of lines/records one after the other. Such files are called Sequential Files(Sequential DATASETs). VSAM is a new, improved way of storing Data. VSAM overcomes some of the limitations of conventional file systems like ISAM, and is hence a major boon. |
| Q. Hold on for a sec... What does a Sequential File look like? |
|
In the early days of computers, all the data was stored in Sequential Files(Physical Sequential PS DATASETs). Data was stored in the form of records, one after the other. Suppose, we wanted to store the information about all the employees in our organization. Below, you’d find a find picture of a how a Sequential File/PS Dataset looks like :
As you can see, each record represents the data of a single, individual employee. This way, there would be thousands of records that make EMPDAT Sequential File. |
| Q. So, its pretty cool the way a Sequential File stores data. But how to get it back? How to search for a particular employee? |
|
Well, that’s the tough part. Coz sequential datasets work more or less similar to a Cassette Tape. Yup.. an audio cassette tape. The songs recorded on the cassette tape are analogous to records in a Sequential File/PS DATASET. If you want to play a particular song, you have to start from the beginning of the tape, travel through the entire tape, till you reach the desired song. You can’t directly jump to a song and play it. You have to read through the tape, and forward scan through it, till you reach the desired place. On the same lines, when you want to search for a particular record say Employee no. 502, you have to travel through the entire the list of records, one by one, till you reach the desired record. The longer is the Sequential File, the longer is would take to access the record. You just don’t know, where the record lies hidden in such huge list/Dataset. So, searching or getting data records, i.e. retrieval of data in a sequential file takes a very long time. Let me give you another analogy : you can compare a sequence of records stored in a Sequential File/PS Dataset to a stack/collection books arranged in a cupboard in a Library. Suppose, you wanted to insert/push in a new book. Since, there’s no space, you need to shift the remaining books in order to create vacant space for the new Book. You can apply this analogy to records in a sequential file as well. Insertion of records in Sequential File/PS Datasets is difficult. |
| Q. I get it.. as far as Searching and Insertion goes – Sequential Files are not very efficient. What’s VSAM got to offer? |
|
You can use a more structured and organised way of storing this data called VSAM. Though the abbreviation is a little geeky – VSAM, solves both our problems with Sequential files : - Searching data stored in VSAM Files takes less time. - Inserting data in VSAM Files is easier, with use of IMBED Option. Apart from this, there are many other advantages that VSAM has to offer : - Free space within a dataset is reclaimed automatically - VSAM Datasets are device and O/S independent, this means if you stored data in VSAM on MVS O/S on Mainframes, you can port the dataset, and read it from Windows O/S on an Intel Machine. |
| Q. What are the types of VSAM Datasets? |
|
VSAM Datasets are primarily of 3 types - 1) Entry Sequenced Dataset(ESDS) 2) Key Sequenced Dataset(KSDS) 3) Relative Record Dataset(RRDS) VSAM Datasets are called Clusters. So, we typically refer to them as KSDS Cluster, ESDS Cluster and RRDS Cluster. |
| Q. What’s an Entry Sequenced Dataset(ESDS)? Can you explain in brief? |
|
An Entry Sequenced Dataset is similar to a Sequential File/PS Dataset. Records are stored one after other in successively. As you enter more records, the records are simply appended to the end, and the ESDS Dataset grows in size. All the records are placed at the end of the Dataset. I shall talk about ESDS further at length later in the article. Note : When you code a program in COBOL, you indicate an ESDS dataset, by specifying : ORGANIZATION IS SEQUENTIAL. |
| Q. What’s a Key Sequenced Dataset(KSDS)? Can you explain in brief? |
|
- Concept of Key : In a KSDS, every record is identified by a unique identification key. Every single, individual employee will have a distinct and unique key value. This key could be his Employee Identification No, since it is unique for each employee. No two employees can have the same key value. - How data is loaded : When you first create a KSDS cluster, it is initially empty. You must fill data into the KSDS Dataset. Thus, you need to populate(Load) the KSDS dataset with real data. When data is being loaded into KSDS Dataset, the data must be supplied in increasing(ascending) order of the key. KSDS stores all the data records in increasing(ascending) order of the key. Let me cite a simple, yet practical example. Suppose you have an empty VSAM KSDS Cluster. You’ve written a Batch JOB/JCL to read input records of employees and add them to the KSDS Dataset. You first add Employee E1 with key 01 to the KSDS File. It works. Next, you add Employee E2 with key 03 to the KSDS File. It works. Next, you try to add Employee E3 with key 02 to the KSDS File. The Insert is insuccessful. - KSDS Structure : A KSDS Cluster contains two parts : 1) Data Component – That stores the file records(actual data) 2) Index Component – That keeps track of the location of the record in the data component. Given below is a rough sketch which will give you a helicopter view of what a KSDS Cluster looks like. Of course, the details are explained further ahead. - For Dummies - Concept of Memory Address : As shown in the figure above, in essence, KSDS has 2 parts – INDEX Component and DATA Component. The Data Component contains the Data records. Every record is stored in 1 Mainframe Storage/Memory Location. Every memory location houses 1 record. Just like, the houses on a street in which people live, in Mainframe memory, in each house/cell/storage location lives 1 record. Houses on a street have a residential address by which they can be easily reached. If you know the house address, you can access the house. The same way, our houses/storage locations in the memory have unique addresses, by which they can be accessed. If you knew the location/address of a house you can easily access it(in much less time). - For Dummies – Comparing a Book’s INDEX with a KSDS INDEX ; How search performance improves with the help of INDEX Component : Now, imagine, if you didn’t have an index in a book, and you wanted to find a keyword. You would have to read and scan the entire length of the book, page by page, till you could find the word you’ve wanted. The INDEX simplifies this activity. The index of a book is quite similar to the INDEX Component, that we have here, as you’ll find out. Basically, a book index has two columns, one the keyword, and other the page no./location in the text where you’d find this keyword. Every page has a page-number. On the same lines, every record in Storage has an address. So, the INDEX Component of KSDS has an entry for every keyword(key-field). For example, employees 1,2,3 each have an entry in the index. And besides this, the index also stores the address of where this employee record can be found in Data Component Storage. Moreover, the index is sorted according to the key-field(like a book index is alphabetically sorted according to the keyword). So, how does it work? Let’s say you wanted to find the name of Employee No. 4. Simple, you look up the the row of Employee 4 in the index. This is easy, because, the index is already sorted on the Key field => Emp No. Now, you find the address of the Storage Location(House) in the Data Component, where Employee 4 lives. This is location no. 600. Since you know the address, you can now directly jump to address 600, and find that the name of the Employee is Ratul. This is far quicker than you thought. The GIST of this concept is, Index Component stores the key-field, and pointers(address in memory) to the corresponding record. This way, Searching is faster and easier. The process of building an INDEX on a key-field for Data Records is called Indexing(or simply building an INDEX). Let me caution you, that the diagram above is a very crude view of the INDEX Component. The INDEX Component of KSDS looks like a family tree, with father and sons and their progeny. In Computer Science, we call such a tree, a B+ Tree. If you are curious to know, what’s a B+ Tree, and how the INDEX Component really looks, read on. If you feel, you’ve absorbed a lot, you can call it a day! NOTE : THE ABOVE EXAMPLE MEANT TO EXPLAIN THE WORKING OF DATA COMPONENT AND INDEXING, IS SIMPLE, AND MEANT FOR A LEARNING AT A PRELIMINARY LEVEL. THEY DO NOT COINCIDE WITH THE EXACT STRUCTURE AND WORKING OF KSDS. |
| Q. What’s a Relative Record Dataset(RRDS)? |
|
A Relative Record Dataset is organised as a collection/list of records. Each record has Relative Record Number(RRN). The RRN indicates the relative position of the record in the file. Think of people standing in a long queue to get tickets. RRDS resembles this ; a person who is 5 places away from the first person in queue(he’s 6th in line), his RRN will be 5. Note that, in relative Record Datasets, there is NO KEY-FIELD. Instead, it is upto the user to draw a relationship between any UNIQUE VALUE and RRN. Once again, RRDS differs from KSDS in that, it has no KEY(Key-field). RECORDS are accessed using Relative Record Numbers. If you know the RRN, you can directly access the Record(RANDOM Access). If you do not know the RRN, you will have to scan the dataset fully from the beginning, till you reach the desired record. This is called SEQUENTIAL ACCESS. |
| Q. How VSAM manages the records in a KSDS DATASET/Cluster? |
|
VSAM stores logical records of a file in fixed length blocks called Control Intervals(CI). The Control Interval(CI) is the VSAM Counterpart for BLOCK in a Sequential File/QSAM Files. The Control Interval(CI) is the basic unit of data transfer between Storage Device and machine. Hence, one should remember, that the basic unit of I/O Transfer in VSAM is Control Interval(CI). Inside a Control Interval(CI), records are stored in increasing or ascending order of the key. A VSAM Cluster could have thousands of Control Intervals. It is important to ensure that, a record is written into the right Control Interval, depending on its key value. However note that, in KSDS VSAM, records can be of any size or LENGTH. KSDS does not distinguish in particular between FIXED Length records and VARIABLE Length records. Instead, a collection of one or more records are stored in FIXED SIZE CONTROL INTERVAL. The default size of Control Interval(CI) is 4k= 4096 bytes. However, the maximum size of a CI is 32k, whereas minimum is 512 bytes. When a new KSDS Cluster is created, Control Intervals are created and records are written into the Control Intervals. LET ME FIRST START WITH VERY SIMPLE AND IDEAL MODEL of Control Interval. Control Interval (Very idealistic – Simplified) Assume, that we have all records of SAME fixed size = 1024 bytes. And each CI = 4096 bytes. Then, in this example, No. of records in each CI = 4096/1024 = 4 records/CI
Once again, it is not necessary that all the records are of the same fixed size. Records could be of any length. Control Interval often contains some empty/free space(Close to real model) : Consider the same example as above - CI Size = 4096 bytes, all records are of length = 1,270 bytes. In this case, no. of records that can be written to a Control Interval CI = 3 records. Still some empty space is left over. 1 CI will contain 3 records. 3 x 1,270 = 3,810 Therefore, 4,096 – 3,810 = 286 bytes FREESPACE. A Control Interval(CI) may contain FREESPACE. What happens, when you try to write the 4th record? What happens when CI does contain enough space? Let us study this process of filling up of the Control Interval(CI) once again in slow motion. The 4th record is 1,270 bytes in length. Control Interval does not have sufficient space. A new Control Interval(CI) is created, and roughly one half of the records are transferred from the old Control Interval(CI) to the new Control Interval(CI). This is called Control Interval Split. Control Interval Split(Thumbrule) : When there is not enough space left in a CI for the given record, a new CI is created – half of the records are transferred from the Old CI to the new CI, and the record in question is placed in the original CI. Control Interval also contains Control Information(Real Model) : Apart from logical data records and FREESPACE, the Control Interval CI also stores some Control Information. Control Information contains 2 pieces of information - Control Interval Descriptor Field(CIDF)4 Bytes – One CIDF per CI(Stores information about the amount and location of free space in a CI) - Record Descriptor Field(RDF) 3 bytes – Used to describe the length of the records. For all same fixed length records, there are 2 RDFs => 1 for storing the length, 1 for storing how many number of records with the same length.
Spanned Records : What happens if a record has length greater than the CI Size. For example, in the above scenario, if CI Size = 4096 and Record length = 5,120 bytes. Then, the record grows and extends into another CI. Such a record which occupies many Control Intervals is called a Spanned Record, since its spans multiple Control Intervals. Carrying forward the above example, the spanned record will completely fill up the first CI of size 4,096. The second CI will be partially filled. Spanned record will occupy 5,120 – 4,096 = 1,024 bytes of the second CI. You might be tempted to think, that since the second CI has a lot of free/vacant space left, 4,096 – 1,024 = 3,072 bytes, it can accomodate other records. However, this is not so. As a thumbrule, a spanned record begins at the CI Boundary, spans 2 or more records. The empty or vacant space in the last record can only be used to extend the original spanned record, it cannot be filled with new records. A new record can be filled up only in a new CI. Control Area(CA) : Control Intervals(CI) are further grouped into a larger entity called Control Areas(CA). When MVS O/S allocates space to a VSAM Cluster, it does so in terms of Control Areas. So, Control Area forms the indivisible atomic unit of storage space allocation in a VSAM File. It is atomic, which means when MVS allocates storage space to VSAM, it will allocate space as 1 Control Area, 2 Control Areas, 5 Control Areas,... and so on. You can’t allocate 1.5 Control Areas of space to a VSAM File. Control Area(CA) Splits : Let us draw parallel on the lines of Control Interval. Let’s say you have a record, which does not fit into a Control Interval. So. you would try and make a new Control Interval(CI). A CI Split is likely to occur. But, to add to it, there’s not enough space in the Control Area, to create a new Control Interval. Now what happens? A new Control Area will be created and, half of the control intervals are transferred from the original control Area to the new Control Area. This process is called a Control Area Split. Thus, when there’s not enough room inside a Control Area, for a CI, a new Control Area is created and half of the CI’s are put into it. |
| Q. How does a KSDS Data Cluster look like? Can you show me a pic? |
|
Sure, this is how a KSDS Cluster looks like – collection of control areas, each control area containing all same sized fixed length blocks called Control Intervals, which contain the Logical Records. |
| Q. What does the INDEX Component of a KSDS Cluster look like? |
|
The INDEX component of KSDS has 2 parts – Index Set and Sequence Set. Sequence Set is the lowest level of the INDEX Component. It contains the 1 KEY Value for each Control Area(CA), and physical location(address) of that CI on the disk. The INDEX Set of KSDS has one or more levels. Highest level contains only 1 record. And the records of the lowest level INDEX Set contains pointer to the Sequence Set. Records at the top level point to the records at the next level, which in turn point to the records at the next level and so on... Structure : For every Control Area in the Data Component, there is one record in the Sequence Set. Inside each Control Area, corresponding to the Control Interval, there is one index entry in the Sequence Set record. The INDEX Entry => Highest Key Value in corresponding CI. Consider the following Structure :
For example, In Control Area-1 : For Control Interval-1, Highest Key Value = 187 For Control Interval-2, Highest Key Value = 247 For Control Interval-3, Highest Key Value = 369 These will be the entries in the Sequence Set record corresponding to Control Area 1. The entries in the Index Set Record contain the highest key value of the Sequence Set records. How the record is located? Suppose we wish to access record 501. 1) Lookup, the entry in the Index Set Record, then find the appropriate sequence set record(Control Area). 2) Lookup the entry in the Sequence Set record to find the corresponding Control Interval. It works like Binary SEARCH. |
It is important to discuss the implications of the size of Control Interval on Access Time. More the no. of CIs, more I/O transfer operations are required.