This post will be updated.

Introduction

This post briefly describes the topic of operating systems and the implementation of UNIX-like kernel. It assumes that the reader is familiar with basic computer concepts such as:

  • translation and interpretation, high level, assembly and machine languages
  • hardware organization: processor, addressed memory, input-output devices
  • programming and algorithmic concepts
  • some higher level abstractions such as file

As well, it does not cover the history of operating systems (batch processing systems, time sharing systems, distributed and networking systems), different types of kernel organizations (monolithic kernels, hybrid kernels, microkernels and exokernels), different purposes of operating systems (generic, real-time etc.) or the picture seeing by the user (only briefly mentioning it).
It assumes generic-purpose operating system written in a high level language and having a classical monolithic kernel (also, I assume x86-based PC).
In particular, I am going to describe the concept of operating systems (which is mostly independent from what has been stated above) including a generic UNIX-like operating systems, and the implementation of such UNIX-like OS.

Other topics (including networking, topics mentioned above, x86 architecture and some specific
cases) can be covered in prospective posts. :)

Operating system concept

Basically, a computer without operating system is a computer which runs only one particular program having a full control over all of its hardware resources.
In such a system:

  • only one program is available (from the definition)
  • this program can do anything it wants, including monopolizing the hardware and not allowing any other programs to run (computer has to be relaunched manually)
  • it is very hard to write programs, since every task has to be expressed in too low-level terms (there are no files, only sectors on disk, so each program has to be divided between at least two levels of abstraction)
  • the functionality of programs will be repeated
    Of course, the program can yield the hardware to another possible program (taking some responsibilities of non-existent operating system) but still the last two problems are preserved even with it, and the next program can still monopolize the processor.
    In general, such a system can be excused in a case of a very specific digital computers like microcalculators or digital watches.

If we speak about a generic modern system, it must satisfy the following requirements:

  • multiple programs of different types should be running simultaneously - each program must be unable to hurt anything beyond its own world - programs should be able to interact with each other - each program should be provided some set of underlying abstractions in order for its code to concentrate only on higher-level tasks - duplicating of programs code should be avoided also on a higher-level routine level (such code should be placed to libraries which should be made shared between different programs)

The software system which implements all these requirements is operating system, which can be viewed as a layer of abstraction between bare hardware and application programs.
Kernel is basically a part of operating system executing in privileged mode, and in our case kernel is the same as operating system (since we consider monolithic systems). There is also another notion of “operating system”, which refers to a platform as a whole (including some applications),
but it is mostly user-oriented and I do not use it here.

Operating system structure

Such operating system consists of:

  • process manager (process is basically an instance of a running program with its own address space, different processes can be launched from the same executable file), allowing multiple
    processes to run simultaneously (which can be a reality in the case of multiprocessors, or just switching between different processes with a high frequency) without each of them knowing this, and providing some mechanism of interaction between them.
    Thus, process manager provides an abstraction of CPU.

  • memory manager, providing each program with its own virtual address space with protection of some addresses (the first makes it impossible to hurt other programs, and the second makes it impossible to hurt the operating system itself, which is normally executed as a part of current process). Also, memory manager should make some physical addresses shared between different processes (although they can access them via different virtual addresses), in order to avoid duplicating of code. Another very important part is making it possible for programs to operate with a higher amount of memory than is really available on the machine.
    Thus, memory manager provides an abstraction of memory.

  • file system, which provides a high-level way of inputting and outputting (and also sharing) data without a necessity to operate with disks or other specific hardware devices. In such a way,
    processes normally does not know that such devices even exist! (except some specific cases, where they still ask the operating system to provide an access for them).
    Thus, file system provides an abstraction of I/O.

  • device drivers and hardware control layer (I will also refer to it as to low-level layer),
    which does not provide functionality to applications directly, but instead it takes responsibilities over operating of the actual hardware, making it possible for other parts of the operating system to provide their abstractions.
    Thus, low-level layer provides a foundation for abstractions.

  • finally, system call layer is on the other side of the operating system, and it is basically an interface from the first three parts to the user-level applications. Still, it should be covered separately for two reasons: to emphasize its role as a gate between operating systems internals and user applications and because some its parts are actually independent from any of the operating system parts described before.
    Thus, system call layer provides an interface to the user programs.

UNIX-like operating systems

UNIX-like operating system normally satisfies the design described above (and it basically arrived in UNIX), although with some important additions:

  • everything is a file concept. According to it, all essences in the system accessible via user code should be accessible as files. In other words, OS not only provides I/O abstraction allowing user programs to operate with files instead of hard disk, but it should treat devices themselves (and any other essences like directories or pipes) like files accessible via file system with the same set of system calls (although this idea has been violated historically, since separate interface for sockets arrived).
  • processes are schedulled based on priority queues, with round-robin schedulling inside each of them (the classical algorithm in UNIXes).
  • modern UNIX systems also support kernel-visible threads inside each processes (which are
    reminiscent to them but share the same virtual address space), although classical UNIXes did not support them.
  • relatively small part of a system is written in assembly – only that in low-level layer and partly in system calls handling (in the gates of it). The rest should be written in a high level language in order to ensure portability (typically it is C, designed for the purpose of writing UNIX).
  • virtual memory is normally paged-based, without segmentation.
  • programs interact on the user level with pipes (they can be replaced with temporary files, and pipes themselves were once internally implemented in such a way, but now they are based on in-memory shared buffer).
  • files are accessible as a stream of bytes – moreover, the OS does not provide any other treatment of them for user programs (unlike VMS, for instance).
  • directories are also files (see “everything is a file”). Typically they include 2 main kinds of information: inode numbers of the files “included” in the directory, and the respective file
    names (depending on the structure additional information may be provided, for instance in order to ensure variable-length names).
  • main file information is stored in a structure called “inode”, which number uniquely identifies the file in the file system.
  • file system should support hard links (namely, different directory entries for the same file-inode, and it is not the same as symbolic link or shortcut in Windows, which is a separate file of a special type storing one of the full paths of the file it refers to).
  • two modes of processors are used – kernel mode and user mode (although some architectures like x86 support 4).
  • multiuser environment should be supported on the level of file system, with the division of users in groups independent from each other.
    Each file inode stores its UID, GID and permissions in a form of 9 logical bits rwxrwxrwx
    (physically each symbol can correspond to more than one bit).
    UID means ID of the file owner, GID means ID of the file group (and file owner does not
    necessarily belong to this group).
    r means read access (and seeing the content in the case of directory).
    w means write access (and removing/adding/renaming files in the case of directory).
    x means execute access (and using as intermediate node in filesystem paths in the case
    of directory).
    The first 3 bits refer to the owner of the file, the second 3 bits refer to any other user in
    the file’s group and the third 3 bits refer to all other users.
  • file system consists of logical units called blocks, each of them including integer (and normally the power of two) amount of disk sectors.
  • UNIX systems support in-memory buffer cache for a file system,
    for 2 main reasons: ensuring that only one process currently has an access to the
    given block of a file system, and increasing of efficiency, since if a block is in memory, there is no need to read it again from the disk.
  • UNIX systems support signals which allow the kernel to provide information to processes.
  • the only API for user programs provided by the kernel is the set of system calls (derives from the monolithic kernel description above).
  • UNIX is a generic purpose system with generic purpose algorithms (although there have been some real-time implementations probably).
  • UNIX has a lot of small tools working together…
  • …and a specific hacking culture around them and UNIX itself (although the last two points are mostly user-oriented)

This list is approximate, as well as the definition of UNIX-like OS itself.
Let us now look at the implementation of a UNIX-like kernel, using my own simple OS ygg as an
example.

UNIX kernel implementation

This is an approximate diagram of the kernel we want to build (although I would place memory management separately from the process subsystem according to what has been stated above):

screenshotGeneric UNIX kernel

Another important note is that this system is x86-specific, and a lot of code which could have been written in C was written in x86 assembly.
In this version of a post we will skip a complete discussion of x86 architecture except some
small descriptions, since that topic is too complicated and requires to be covered
in a separate post.

Kernel sources

Full sources of ygg are available here.
Here is their structure (excluding user-level part from repository and additional files) to
which I will refer:

ygg 
 |
 ├── bch (buffer cache)
 |   |
 |   ├── bch.h 
 |   └── bch.c
 |
 ├── boot (booting process: boot.S -> map.S -> main.c -> init.S)
 |   |
 |   ├── boot.S (bootloader)
 |   ├── init.S (first pseudo-process)
 |   ├── main.c (initialize all kernel subsystems, prepare the first pseudo-process,
 |   |		launch scheduller, which will launch it, the process then calls 
 |   |		exec() syscall, replacing itself with a real /bin/init)
 |   └── map.S (kernel entry, initial setting of virtual memory for kernel to operate)
 |
 ├── comm (common routines, just internal kernel library)
 |   |
 |   ├── comm.h (contains common operations and type definitions)
 |   ├── reg.S (contains procedures operating with x86 system registers to be accessible 
 |   |	       in C code)
 |   ├── reg.h 
 |   ├── str.S (string operations similar to those in string.h in user-level C library 
 |   ├── str.h 
 |   ├── tm.c (time calculations)
 |   ├── tm.h 
 |   └── utsnm.h (system information)
 |
 ├── dev (device drivers)
 |   |
 |   ├── crt.h 
 |   ├── crt0.S (VGA 80x25 cursor operations)
 |   ├── crt1.c (VGA 80x25 symbol output)
 |   ├── ide.h 
 |   ├── ide0.S (IDE disk init, read and write)
 |   ├── ide1.c (IDE disk interrupts, request queue)
 |   ├── kbd.S (IBM PC keyboard)
 |   ├── kbd.h 
 |   ├── pic.S (interrupt controller)
 |   ├── pic.h 
 |   ├── rtc.h 
 |   ├── rtc0.S (BIOS realtime clock)
 |   ├── rtc1.c 
 |   ├── ser.S (serial port -- not implemented yet)
 |   ├── ser.h 
 |   ├── tmr.S (100Hz timer for process schedulling)
 |   └── tmr.h 
 |
 ├── fs (file system)
 |   |
 |   ├── blk.c (block allocator, the lowest level)
 |   ├── blk.h 
 |   ├── dev.h (device files)
 |   ├── dir.c (directories)
 |   ├── dir.h 
 |   ├── fcntl.h (file control, both for kernel and user space)
 |   ├── fl.c (file table)
 |   ├── fl.h 
 |   ├── ind.c (inodes)
 |   ├── ind.h 
 |   ├── pth.c (pathnames parsing)
 |   ├── pth.h 
 |   ├── sblk.c (superblock)
 |   ├── sblk.h (file stats, both for kernel and user space)
 |   ├── stat.h 
 |   ├── sysfs.c (system calls)
 |   └── sysfs.h 
 |
 ├── int (x86 interrupt handling)
 |   |
 |   ├── crit.S (critical sections for interrupt stack, not to lose 
 |   |		 interrupt requests while another interrupt is being proceeded)
 |   ├── intentr.S (entry point)
 |   ├── inthand.c (interrupt handler, invokes appropriate driver)
 |   └── intx86.h (interrupt frame and codes)
 |
 ├── io (higher level input-output)
 |   |
 |   ├── dbg.c (debugging messages inside the kernel)
 |   ├── dbg.h 
 |   ├── kpr.c (kernel printf, panic etc.)
 |   ├── kpr.h 
 |   ├── tty.c (terminal subsystem, some code is from xv6)
 |   └── tty.h 
 |
 ├── mm (memory management)
 |   |
 |   ├── alloc.c (kernel internal dynamic memory allocator)
 |   ├── alloc.h 
 |   ├── elf.h (ELF file standard)
 |   ├── mm.S (generic memory operations like in string.h of C standard library)
 |   ├── mm.h 
 |   ├── mmaddr.h (memory layout, macros can be changed to configure kernel)
 |   ├── page.h (x86 page geometry and page table entries)
 |   ├── pmm.c (physical memory allocator)
 |   ├── pmm.h 
 |   ├── seg.h (x86 segment descriptors)
 |   ├── sysmm.c (system calls)
 |   ├── sysmm.h 
 |   ├── vmm.c (virtual memory allocator)
 |   └── vmm.h 
 |
 ├── pp (pipes)
 |   |
 |   ├── pp.c 
 |   └── pp.h 
 |
 ├── prc (process management)
 |   |
 |   ├── mkprc.c (process creation and removing, first process preparing)
 |   ├── prc.h (process structure)
 |   ├── schd.c (scheduller)
 |   ├── sig.c (signals)
 |   ├── sig.h 
 |   ├── slp.c (sleep and wakeup)
 |   ├── swprc.S (register-level process switch)
 |   ├── sysprc.c (system calls)
 |   └── sysprc.h 
 |
 ├── syscl (system calls common code)
 |   |
 |   ├── callentr.S (entry point)
 |   ├── sysarg.S (fetch and invoke syscalls, for different argument count)
 |   ├── syscl.c (system call tabls)
 |   └── syscl.h 
 |
 ├── kern.ld (kernel linking script)
 └── mkfs.c (program to make file system image for ygg)

Bootloader

Bootloader is not a part of operating system itself, it is a separate low-level program which is loaded from the medium by computer BIOS, sometimes performs some additional activities
(like transition to the protected CPU mode and setting up the initial stack) and loads a
kernel into memory.

ygg’s bootloader performs transition to protected mode using initial GDT (memory management data
structure in x86 systems), and loads kernel into memory.
Bootloader assumes that filesystem superblock is stored right in the next block after itself (in the block 1 on the disk, in other words), and it assumes that without additional offsets superblock stores information about kernel address (kernel is stored as a raw code separately from the filesystem, in order for bootloader to avoid additional complexity of parsing ELF binaries) and all information required to calculate kernel size.
Bootloader also must be compiled with some headers from kernel source tree included, since otherwise it would not know the size of FS block.

Bootloader also includes some kind of mini-driver for IDE LBA28 hard disks, which code is invoked to read superblock and kernel. This driver uses simple waiting loops and no interrupts, unlike the larger kernel disk driver.

screenshotUNIX filesystem layout

Initial setup

Kernel entry point is its initial mapper, which prepares x86 virtual memory system for kernel to use. After that it invokes main() which initializes all kernel subsystems – a part of it is process subsystem, which prepares first pseudoprocess to run by artifically creating its return frame, placing instruction pointer to its code, embedded into kernel, and marking it as READY.
After this main() finishes initialization and invokes scheduller, which finds the only process in process table and “returns” it to execution (actually firstly launching its code).
The code of process contains only invoking exec() system call with a real first process – /bin/init – pathname as an argument.
In such a way the first process is launched. /bin/init then opens first user level file descriptors (which would be accessible for its children processes) and invokes shell, which is the main user interface to any UNIX system.
main() never returns, except while error has occured. Current processes code executes before either:

  • its time quantum (set as 0.01 s) finishes – in such case because of timer interrupt the kernel code of scheduller becomes active, which chooses the next process to run
  • other interrupt is invoked – in such case after interrupt handling and unless the process goes to sleep, the control flow returns to the current process code (if it goes to sleep, scheduller is invoked)
  • it goes to sleep (because of an initiating of I/O operation, for instance), then scheduller is invoked
  • it invokes system call – in such case situation is conceptually the same as with interrupts,
    except the case with exit() system call which causes the process to terminate.

If there is no processes in process table, scheduller simply stays in waiting loop, sometimes with switching to timer interrupt handling code.

Memory management

x86 memory management relies on a few structures:

  • segment registers (CS, SS, DS, ES, FS, GS)
  • system registers (CR0, CR3, GDTR, IDTR and LDTR)
  • GDT (global descriptor table), IDT (interrupt descriptor table), multiple LDTs (local descriptor tables)
  • page directory and tables
    It is created to make possible segmented-paged address space firstly introduced in MULTICS operating system.
    Segments are the logical constructs which store the content of a particular type with a particular protection (executable code, data etc.), and they can be of a different size up to 4Gb each and
    can overlap. Pages are fixed-size fragments of memory which can intersect with segments (a segment can start in the middle of a page) and which are created to ensure easy virtual address translations and memory allocation (and potentially swapping). Typically x86 pages are 4K each (although some CPUs allow large pages of 4Mb size, and PAE adds more choices, but we will not cover those improvements here and ygg does not use any of them).

Now let us see how the actual addressing behaves in x86.
Let us assume that we have a 32-bit address inside a program, that in general
is generated according to the following rule: , where offset is a 32-bit constant, base and index are registers, and scale is one of 1/2/4/8.
This address is 32-bit part of a logical address, which also includes 16-bit segment register value.
Segment register stores 3 configuration bits (1 of them representing GDT/LDT, see below, and another 2 representing 1 of 4 CPU modes), and 13 bits storing selector.
Selector is an index into either GDT or LDT (depending on the respective bit in register), but I will cover only GDT for simplicity.
GDT is an in-memory (although it is frequently cached inside the processor) table storing 64-byte segment descriptors, which format is shown below.

screenshotx86 segment descriptor

Each non-segment descriptor represents one particular segment, described above, and stores its base linear address (see below) and its size limit. GDT’s linear address itself is stored in 48-bit
register GDTR, as well as its size.
After we have a pair of 16 bit segment register and 32 bit software-generated address, which together form 48-bit logical address, we obtain linear address in a following way:

  • GDTR and selector together point to the appropriate segment descriptor
  • only if privilege level stated in segment register is (0 is the highest) than the one in segment descriptor the further address creation is allowed, otherwise CPU raises an error (which is handled by the respective kernel part in a similar way as interrupts are handled)
  • besides, only if software-generated 32-bit address (which serves as an offset inside a segment)
    does not move beyond the stated limit, the prospective address generation is allowed, otherwise CPU fault occurs
  • segment descriptor base is added to 32-bit address and linear address is formed (maximum 4Gb can be presented with 20-bit limit, because additional flag in segment descriptor switches on treating this limit as 4K page count)

In such a way a linear 32-bit address is created. Now if the paging is disabled (bit 31 in register CR0 is 0), then this address is a ready physical address which is placed on memory bus by the CPU.
Otherwise the CR3 register is used, which stores a physical address of a page directory.
Page directory entries point to page tables, and page table entries point to actual physical 4K pages. Page directory and page table are 4K each and contain 1024 4-byte entries, each entry consisting of 20-bit base address (unlike segments, all pages are aligned in 4K borders, with the lower 12 bits of address being 0x000) and 12 bits of flags (ygg uses only 3 of them – presence bit, read-only/write access and user/kernel page).

x86 uses double-level structure of paging in order to avoid too large page tables (for the same granularity of 4K physical pages to ensure 4Gb address space with one page table there would have to be 4-byte page table entries, and the page table would be 4M in size,
while in this case for the process using only one page table we have to have one page directory and one page table, or two 4K tables, or 8K in total, which is 512 times less).

screenshotx86 page directory/table entry

So, 32-bit linear address is divided into 3 parts – 10-bit directory part, showing the index in the page directory pointed by CR3 register (which logically replace higher 10 bits in address),
10-bit table part, showing the index in the page table pointed by the page directory entry (which
logically replace middle 10 bits in address), and 12-bit offset, showing the required byte
in 4K page pointed by the page table entry.

The resulting 32-bit address is a physical address which is placed on the memory bus by the CPU.
The whole memory management diagram starting from the logical address (without its 32-bit part forming by the program itself) is shown below:

screenshotx86 memory management

As we see, x86 always uses segmentation, but paging is optional.
However, we want the reverse! Namely, UNIX does not use segmentation – and it is harder to handle it, as well as write compilers which will generate the appropriate code. Besides, we definitely have to use paging (allocating memory with them is very easy), and the combined structure is just too complex.
So, all UNIX implementations for x86 known to me (and ygg is not an exception) follow flat memory model, where segments are pretended to be absent – this is impossible from x86 point of view, but their bases in GDT are just made 0, and the limits are set to 4Gb. We still have to have different segments for code and data (this is another x86 requirement, and, for instance,
CS register must point only to the code segment descriptor), but they just completely overlap with
each other. Adding offset to 0 will give us the same offset, so the only translation which remains from the point of view of kernel (except some initialization and low-level code) is only paging,
which it enables.
In such a way, the structure logical-optional linear-physical is transformed to the
virtual-physical where virtual is the same as optional linear and the 32-bit part in logical address.
Of course, the CPU does not know about this software point of view, still performing the complete translation (but GDT is not changed and thus is always cached inside the CPU) – it would have been better if Intel engineers added a bit to explicitly switch off the segmentation.

After having understood this let us return to ygg memory management.

Memory layout of a process in ygg is as follows:

  • 4Gb total available (both for current user-level code and kernel)
  • page size is 4K (large pages are not supported yet).
  • user code goes from the address 0, then read-only data, writeable data and BSS section (which is not stored in executable files but is allocated when the process is loaded into memory).
    Afterwards goes dynamic memory (heap) allocated with a request to the kernel.
  • kernel goes from the address KBASE (typically it is configured to be 3Gb) and up to the highest address 0xffffffff, the layout is reminiscent to the one of user process.
  • user and kernel stacks grow downward from the addresses KBASE and 4Gb - 1 respectively.
  • no shared memory and swapping are supported yet.
  • kernel pages are protected from user access.

screenshotProcess memory layout in Linux (in ygg user code starts from 0)

No x86 segmentation is used intentionally – x86 processor still requires segmentation, but we simulate its absence by making all segments starting from address 0 and finishing on the address 4Gb-1.

ygg uses four memory managers, some of them depend on the others:

  • physical memory allocator, based on a double-linked list of free memory, where each node stores
    the respective free block first physical address, its size in 4K x86 pages and two pointers to next and previous elements in list:
    screenshotEach object is memory block
    Address is not a C pointer, since it cannot point to physical unused memory, it is just a number,
    which is considered by operating system as a physical address, which can be afterwards written in a page table as expected by x86 architecture with a corresponding virtual page address mapping (and the process may then operate with those virtual addresses to be mapped, seeing them as new available memory in its virtual address space). This is the lowest level allocator not invoking any other one.

  • virtual memory allocator, which does not have any special data structure and just operates with x86 page tables as expected by processor.
    screenshotTwo level translation in x86 – page directory
    and table

    It invokes physical allocator for performing a mapping in x86 table.

  • kernel dymanic memory allocator, which also invokes virtual allocator and maintains double-linked list of free memory. It is not implemented yet.

  • “allocator” in brk() and sbrk() system calls. It simply increases the amount of virtual memory available in user space by performing additional mapping (and thus invoking virtual allocator).
    It is a foundation for C Standard library functions such as malloc().

Process management

ygg process can be in 5 states (6th state is basically an absence of process in a slot of process table):

  • new – the process is just created but it cannot be launched for running now, because its data structures are not filled yet (in such a way mkprc() function says that this slot is already in use, but scheduller cannot launch half-ready process)
  • running – the process is currently executing by CPU (either in user or kernel mode)
  • ready-to-run – the process is not running and it can be launched for running now
  • sleeping – the process is not running and it cannot be launched for running now (the most common reason is its waiting for some I/O transaction to finish)
  • zombie – the process is finished, but the slot in the process table remains in order for its parent to obtain its return code and release all its remaining resources
    A finite automata representing the behaviour of a process in the system is shown below:
    screenshot5 states of a process

ygg does not support swapping to/from disk and does not distinguish between running in user or kernel mode in the process table.

ygg process management algorithm is now very simple, implementing round-robin schedulling.
However, I am going to implement something reminiscent to 4.3BSD scheme, where there are a few priority queues, and only the processes from the highest priority non-empty queue are executed in a round-robin manner, but after using the time allocated per queue each process moves to the next lower one, and with some period (which is larger than any time quantum and is multiple of it) all processes are moved to the highest priority queue.
It is an updated version of an approach where there are no time per queue variables,
and instead each process having finished its time quantum moves to the next lower priority queue,
and a process which released the CPU before finishing its quantum (in particular,
when initiating I/O transaction) remains on the same queue.
The idea is the same – allow those processors which rarely use the processor to obtain it in order to avoid starvation and at the same time providing enough time for high priority processes because of executing the processes from the highest priority non-empty queue – but in the updated approach the hack is impossible, where process knowing about quantum size and able to calculate execution
time uses almost all quantum and then initiates I/O, remaining in the same queue and thus monopolizing CPU.
Also, the algorithm can be updated with giving different quantum sizes for different queues (the smaller ones for the highest priorities, in particular) – quantum sizes for lower priority queues
are also important because there can be a situation when there are no processes in the higher priority queues (or they all are sleeping).

Besides, ygg supports signal handling, but no message handling, which is one of the ways of interprocess interaction.

screenshotSchedulling based on priority queues

Sleep and wakeup on the given event are also implemented (the event is represented as the address of the object related to it, and when something which can be considered as a release of this object happens, all processes awaiting for it are awaken, and each of them can then, depending on the situation, check if it was indeed the required case, returning to sleep otherwise).

Buffer cache

Buffer cache is a structure which contains buffers for file system block.
Each request for file system block is firstly performed looking to buffer cache, and in a lot of cases the content of that block is already in some buffer.
In such a way buffer cache serves two main goals:

  • it ensures a consistency of file system (namely, no two processes can operate with the same file system block, since the respective block is locked while operation because of buffer cache organization).
  • it increases system’s efficiency, since no redundant disk transactions are performed.

Buffer cache is organized as a hash table with a free list of buffers crossing its elements.
Hash table is not necessary – it is just a way to reduce the average search time,
if the hashing rule produces definite hash queue index from the given FS block number.

When the request for the given FS block is performed, 4 scenarios are possible:

  • the buffer for the block is in buffer cache and it is being used by another process. In such a case the current process goes to sleep until this buffer is released.
  • the buffer for the block is in buffer cache and it is free. In such a case the current process can take and lock it.
  • the buffer for the block is not in buffer cache and there are no free buffers in the list. In such a case the current process goes to sleep until the first buffer appears in the free list.
  • the buffer for the block is not in buffer cache and there is a free buffer in the list. In such
    a case the current process takes and locks the first free buffer, removes it from the free list,
    places it to the appropriate hash queue and reads the appropriate FS block into it.

Buffer cache functions invoke IDE driver interface to perform actual read/write operations.

screenshotBuffer cache

File system

ygg file system is organized in layers. Let us cover them in the bottom-up approach.

Block layer is basically responsible for finding and allocating FS blocks, which are tracked within block bitmap located after super block. Its code invokes buffer cache functions.

Inode layer is basically the on-disk file implementation layer.
Inode (can be viewed as “index node”) is a core data structure representing file attributes and its content. Inodes are stored in the fixed-size array located after the bitmap.
Inode normally includes the following:

  • byte size of file (also can include a redundant block count, but ygg does not have it)
  • type of file – regular, directory, character or block device, symbolic link, named pipe, or free inode mark (also can be a socket, but ygg does not support them)
  • hard link count (see below)
  • user ID - group ID - 9 bit of permissions (also can include SUID, SGID and a sticky bit, but ygg does not have them)
  • 3 timestamps – file last read, file last write, inode last write
  • (additionally in ygg) timestamp of file creation
  • (additionally in ygg) sort bit for directories (see below)
  • disk addresses (typically 13, 10 direct, one indirect, one double indirect and one triple indirect)

However, if the type of a file is character or block device, inode includes its major (representing type of device) and minor (representing a particular device of the given type) numbers instead of some of the fields above (block addresses in ygg), and if the type of a file is symbolic link, inode includes its content in itself instead of block addresses.
Inode size in ygg is 128 bytes.

Pay attention that inode does not store the file name. In UNIX file names do not uniquely represent the respective files. Instead, file name is simply some kind of link inside a directory which points to a particular inode. Multiple file names can point to the same inode, and only inode is what actually represents the file. Each file name, however, correspond to a particular hard link,
which count is stored in the inode. After each file removal from the directory (see below) this count is decreased, and only when it reaches zero (namely, there are no pathes to this inode anymore) the file is removed by marking all its data blocks free in the FS bitmap and marking its inode free as well (actual file content is not removed – it is still on the hard disk, in the data blocks without links, so theoretically it can be restored using appropriate tools or manual fixing the FS image).

Disk addresses store block numbers for blocks actually keeping the file content and – in the case of indirect blocks – keeping more disk addresses of a respective level of indirection.
In such a way address information in inode table is kept fixed-size (and thus it is possible to find the inode for the given file in a constant time) and at the same time it does not have to be large in order to allow large files support (some files just do not use indirect addresses).

There is also in-core inode table kept in memory which elements have additional fields and which operates with the rules similar to those of buffer cache.

screenshotUNIX on-disk inode (a bit different from ygg)

The next layer is the directory layer.
In UNIX a directory is simply a file storing a list of other files (“included” in it).
It contains file names (which are parts of pathnames) and their respective inodes.
Depending on a particular filesystem, additional information can be added (for instance, to support variable-length names).
ygg uses classical approach with 2 byte-length inode number (which limits the amount of
files in a file system to 65536) and 14 byte-length name (which limits the file name to 14 symbols), with zero byte in inode number representing an empty entry (and also making the first inode number in file system as 1, normally corresponding to /) and first zero byte in the name field representing the end of the name.
ygg has its own feature – sort bit in inodes corresponding to directories – and the respective system call drsrt() which is explicitly invoked by user programs in order to sort directory content using quick sort algorithm (unlike ls program, which just presents directory content in a sorted way, sorting it in user space) in order to increase search time.
If this bit is set, ygg will perform binary search instead of linear one while parsing pathnames (see below).

screenshotDirectories in UNIX

The next layer is the pathname layer.
Basically, each system call which receives a pathname in a symbolic form (for instance,
“/etc/passwd”), should transform it into in-core inode to work with (almost all inode operations are performed with in-core inodes serving as a buffer cache for on-disk inodes).
For this namei() function checks whether pathname starts from / or no – if no, search must be performed from the inode corresponding to the current process directory (pointer to it is stored in the current process table slot). Otherwise the path is absolute, and the kernel firstly checks the pathname cache organized as a hash table where inode number may already be presented (not just because hash table allows constant time search in the best case, but, more importantly, to avoid on-disk search when it can be performed in memory), and if no it performs search starting from inode 1 (inode for /).
The kernel does not search in cache in the case of relative path, because the cache is independent from the current process (this can be updated providing hash tables for each process or making hash rule depending on process ID).
In both cases of searching (either from / or the current directory, which may also be / anyway),
the kernel simply checks each next token from the pathname, searches for it in the content of the current directory (corresponding to the current inode) – using either linear or binary search –
and after finding it, takes its inode number, switching current directory to the new inode.
This continues until either file is found (and the pointer to its in-core inode is returned) or the file is not found.
Each directory also contains entries “.” and “..” (for this directory and parent directory), and “..” for / directory also points to /.

The next layer is open file layer. Here there is a global open file table, each entry of which stores pointer to the respective entry of in-core inode table, offset in a file (where the next read or write operation will be performed) and the amount of references to this entry (which represents the amount of per-process pointer to it, see below; the same amount of references from the global file table is stored in each in-core inode table entry).
Offset cannot be stored in in-core inode table because each inode (each file) can be opened by different processes working with it in different positions (offsets).
File table itself cannot be just scattered in a form of per-process open file tables because some processes must share file position (for instance, each child process shares file position from its parent).

Finally, the highest layer is a file descriptor layer.
Here each process has a table of open file descriptors, which are basically pointers to the global file table described above.
Index in this per-process file table is exactly what is returned to the user by open() system call,
and what is passed to other file system system calls.
This index is an abstraction with which the user operates.
So called stdin, stdout and stderr (standard input, output and error descriptors) in UNIX user-level programs are basically such indexes, by convention equal to 0, 1 and 2 respectively (but this is only user-level convention not know to the kernel itself, and any other numbers can be used as well).

screenshotRelations between per-process file descriptor tables, global file table and in-core inode table in UNIX

Device drivers

Device drivers is the least generic part of the kernel, each of them having a lot of specific coding related not to UNIX, but to the device itself, so I will cover them very briefly.

Theoretically there are three main ways of implementing device drivers:

  • awaiting loop – this approach is a fully-software one by its nature. CPU simply executes a loop in which it checks device controller readiness to receive or send data. In such a way a significant amount of CPU time is spent just on this checking (in a particular case – all its time).
  • interrupts – this is the most common approach, based on device sending interrupt signals to CPU (and because of that this approach should be supported by hardware), CPU switching from the current code execution to interrupt handling and restoring the situation afterwards. CPU spends time for context switches, but on average it does not waste it as much as in the first approach.
  • DMA-controller – this approach is also hardware-based. It introduces an external device –
    direct memory access controller – which performs the same awaiting loop, but doing this instead of the CPU. CPU is distracted only two times – when interrupts come in the beginning and the end of the whole transaction. DMA controller requires its own driver to operate.
  • simple one-time reading or writing of device registers or mapped memory (see below), without waiting for any data to arrive or device to become ready.

ygg uses interrupt handling in all cases, except VGA, RTC and PIC (VGA and PIC drivers follow the last approach, and RTC driver follows the first one).

Independently of these approaches input/output transactions themselves can be performed either via ports (which requires separate instruction in CPU instruction set like IN and OUT) or via
memory address space (in such an approach device control registers are mapped into some part
of physical memory, and some part of memory should not be used by other code or data, but no special instructions are required).

ygg uses port approach for all devices, except VGA monitor, which is combined (control is performed via ports, while actual 80x25 monitor symbol output is performed via a specific physical memory region).

ygg supports drivers for the following devices:

  • programmable interrupt controller – this is a main device making multiple device interrupt handling possible and flexible. Its output pin is connected to the input CPU interrupt pin,
    and it has 8 inputs. However, since the amount of devices on motherboard is larger,
    IBM PC normally has one of these input pins connected to the output pin of another PIC, forming 2-level cascade structure supporting 15 input devices.
  • IDE HDD – ygg supports IDE disks with LBA28 addressing only, although I hope they can be emulated by SATA or USB devices as well.
  • IBM PC keyboard – ygg supports IBM PC keyboard translating its codemap to a set of ASCII symbols which are afterwards sent to monitor output.
  • text mode VGA 80x25 – this is a typical output mode in which most PCs operate after their launching before switching to graphical pixel-based mode. Old UNIX terminals initially had a similar output mode.
  • system timer – this is a device which generates digital signals with a given frequency, which in ygg case is set to 100MHz. After each such signal the scheduller is invoked in order to consider the next process to run, ensuring multitasking in such a way.
  • BIOS real-time clock – this circuit stores updating date and time information which does not vanish between computer operating states because it is powered by a separate battery. ygg reads this information only once after each launch in order to receive date and time, converting it into UNIX time – amount of seconds after midnight 01.01.1970, which is then updated after each
    100 x86 timer ticks (see above) and written inside file inodes.
    This time notion is independent of a platform and common to all UNIX systems, and x86 timer ticks are one of the ways of implementing it on the PC platform.

I will also add serial port driver, and perhaps something else.

System call handling

Finally, let us cover system call handling.

Basically, here we have four main code areas:

  • user program which invokes system call
  • user-level library written in assembly languages which performs system call (it cannot be performed simply by calling them as a function, but only as a special processor instruction, because simple procedure call cannot change the processor mode and switch to kernel) – this library is rewritten for each architecture on which the system should be running and it contains functions with the names similar to those of syscalls (so when user program invokes syscall name() it actually invokes name() from this library, not from the kernel).
    This function invokes actual syscall by placing its number and arguments into registers (stack approach is also possible) and invoking that special instruction (software interrupt in most UNIX implementations and long call x86 instruction – lcall – in ygg)
  • kernel part where the syscall number is checked and the appropriate syscall function is invoked from the respective kernel part (file system, memory manager or process manager)
  • syscall implementation itself inside the kernel

The process is shown on the diagram below.

screenshotRead system call handling in UNIX

Screenshots

Of course, this is only approximate description which I will later evolve, but it provides a brief overview of generic monolithic and UNIX-like operating system concept and UNIX implementation.

Here you can see some screenshots of ygg launched in QEMU virtual machine:

screenshotygg just launched
screenshotls output for /, you may see permissions,
hard link counts etc.

screenshotAn example of some UNIX programs operating

And now some questions for you (based on ls output shown above):

  • why are there 11 hard links for /, and 2 hard links for each directory on the next level except the last two?
  • why inode number for /var is 13 and not 10?
  • how many application programs does ygg have (assume they all and only they are stored in /bin,
    byte size of each file is shown in the 7th column, and directory layout has been described above)?
  • (repeat) why inode number for / is 1 and not 0?

Materials

Here are the books and other materials I have been studying in order to have obtained the current level of understanding of operating systems in general and UNIX kernel deeper topics in particular:

P.S. By the way, ygg is named after Yggdrasil, the world tree in the Norse mythology.