Emerald OS
Emerald
This is an operating system that I'm building for fun and learning in Rust.
Please check it out, and if you have any questions, feel free to ask, open an issue, fix a bug, etc...
Running
If you don't want to build the project, you can download the latest artifacts from:
You get ISO
file containing the kernel and compressed filesystem
directory containing the userspace programs.
The current command is what we use normally to run the OS, but can be run by any VM with some setup.
qemu-system-x86_64 -cdrom <kernel.iso> -serial mon:stdio -m 512 -boot d -drive format=raw,file=fat:rw:<filesystem>
where <kernel.iso>
is the path to the ISO file, and <filesystem>
is the path to the filesystem directory decompressed.
Some extra info:
-serial mon:stdio
is used to redirect the serial output to the terminal.-m 512
is the amount of memory to allocate for the VM,512MB
.-boot d
is to boot from the CD-ROM we just loaded.-drive format=raw,file=fat:rw:<filesystem>
is to pass the filesystem directory to the kernel as a disk.Here we use a feature of QEMU,
virtual fat
, where it will treat the directory as a FAT filesystem, and being passed to the kernel as a disk.
Building
Building
The whole building and packaging is done by xtask
The ISO file can be used to run on other VMs/hardware(not tested)
For building the ISO image, you can use make
but you need to have other dependencies installed to build and run the ISO:
xorriso mtools grub-pc-bin qemu-system-x86
Build kernel iso:
cargo xtask build-iso
Building userspace programs
This builds userspace programs into filesystem
directory (used by qemu):
The userspace programs are built using a custom rust
toolchain (See more info here)
Anyway, there are 2 options to build our userspace programs and in general any other program.
Using the prebuilt toolchain
We distribute a prebuilt toolchain in:
- toolchain.zip Where you can install with
bash tools/install_toolchain_and_link.sh <path_to_toolchain.zip>
This will install the toolchain into extern/toolchain
and link it to rustup
as emerald
.
Then, xtask
will use the installed toolchain to build userspace programs, if its not installed
it will give an error.
cargo xtask userspace build
Building the toolchain
We don't build the toolchain automatically, i.e. if you don't have the toolchain you can build the toolchain yourself from source if you don't want to installed prebuilt.
cargo xtask toolchain
If you want to build and install from source into extern/toolchain
directory
You can then use
rustup toolchain link ...
to link to this folder
cargo xtask toolchain --install
Building and running
To build and run kernel and userspace programs:
cargo xtask run
You need to have qemu-system-x86_64
installed.
Debugging
You can use gdb
or lldb
to debug this.
But I have included vscode configs to enable easily debugging with CodeLLDB
extension.
And to boot QEMU in debug mode you can use (it will wait for debugger on port :1234
)
cargo xtask run --gdb
Kernel
Link to the Kernel rust documentation
Here we explain the kernel and its components in details.
From Boot, to drivers, filesystem, processes, memory management, and more...
Boot
What happens during boot?
This OS doesn't include a bootloader, the kernel is being loaded by default using grub
using multiboot2
.
The kernel
crate will be compiled as an ELF
file implementing the multiboot2
specification which can
be loaded by any bootloader that supports it.
For example, in grub we use the multiboot2
command to load the kernel.
menuentry "Kernel" {
insmod all_video
multiboot2 /boot/kernel # the kernel ELF file
boot
}
The kernel ELF
file is in elf64-x86-64
format, but we start in x86
protected mode (32-bit) and then switch to long mode
(64-bit) using assembly code in boot.S
.
Initial boot, protected mode.
We start in assembly, since we are using rust, and we target x64
for compilation, we have to add 32-bit
code manually using assembly.
The boot.S
file is included in main.rs
using global_asm!
macro, and we use .code32
directive to generate
32-bit
code.
The initial boot here in the assembly performs the following:
- Setup initial page tables (required for
64-bit
mode). - Setup basic GDT (Global Descriptor Table) (required for
64-bit
mode). - Switch to
64-bit
long mode. - Jump to
kernel_main
entry point inmain.rs
.
The kernel ELF file is loaded at 0x100000
physically, and 0xFFFFFFFF80100000
virtually.
The virtual address 0xFFFFFFFF80000000
will map to the physical address 0x0
initially.
Note: you may notice in the code that we use
- 0xFFFFFFFF80000000
a lot, such aslgdt [gdtr64 - 0xFFFFFFFF80000000]
. And this is because we don't have virtual memory yet, so we are operating with physical memory currently, and the linker will use0xFFFFFFFF80000000
as the base address. But since we know it maps to0x0
physically, we can subtract to convert to physical addresses.
Multiboot2 header
We specify some information to the bootloader that we want in the multiboot2
header:
- Address: Here we specify load address information, which include:
start
andend
of the kernel image.bss_end
which is the end of the.bss
section.
- Entry point: The entry point of the kernel, which is
entry
function inboot.S
,
i.e. where execution will start. This is executed in32-bit
mode. - Module alignment: Modules will be page aligned. (maybe not needed, as I thought it affected the kernel alignment, but looks like its for modules).
After that, execution will start at entry
, where we check that the value in EAX
is equal to the special multiboot2
magic value just to make sure we are correct.
Switching to long mode
Then, we check that long mode is supported in the machine by making sure that PAE
feature is supported in CPUID:0000_0001
and LM
feature is supported in CPUID:8000_0001
.
If its not supported, we display an error and infinite loop.
# check for PEA
mov eax, 0x00000001
cpuid
test edx, CPUID_FEAT_EDX_PAE # (1 << 6)
jz not_64bit
# check for long mode
mov eax, 0x80000001
cpuid
test edx, CPUID_FEAT_EX_EDX_LM # (1 << 29)
jz not_64bit
If all is good, we setup some basic page tables as follows
Initial page tables
We map the first 128MB
of physical memory into two ranges.
[0x0000000000000000..0x0000000007FFFFFF]
, 1:1 mapping which give us easy access and required when we switch to64-bit
mode.[0xFFFFFFFF80000000..0xFFFFFFFF87FFFFFF]
, This is where the rust kernel is loaded in the ELF file, and all references addresses in this range.
The structure of the page table is as follows:
- [0x0000000000000000..0x0000000007FFFFFF] - 1:1 mapping with the physical pages
- This will be:
- PML4[0]
- PDPT[0]
- PDT[0..63] # 2MB each
- [0xFFFFFFFF80000000..0xFFFFFFFF87FFFFFF] - the kernel ELF file virtual address space
- This will be:
- PML4[511]
- PDPT[510]
- PDT[0..63] # 2MB each (shared with the above)
The location of where we store the initial page tables is at boot_page_tables
which is a region of memory
in the .boot_page_tables
section in the .data
section that fits the size of 4
page tables,
each 4KB
in size.
The usage of boot_page_tables
is as follows:
PML4 (boot_page_tables[0])
PML4[0] -> PDPT-A (boot_page_tables[1])
PML4[511] -> PDPT-B (boot_page_tables[2])
PDPT-A[0] -> PDT (boot_page_tables[3])
PDPT-B[510] -> PDT (boot_page_tables[3]) // same PDT as above
And then PDT[0..63]
is shared between the two and maps the first 128MB
of physical memory.
Then, CR3
is set to the address of PML4
and CR4
is set to enable PAE
.
GDT (Global Descriptor Table)
We setup a very basic GDT that contain kernel code
and data
segments (even though, data probably is not needed).
.align 16
gdt64:
.quad 0x0000000000000000 # null descriptor
# Code segment (0x8)
.long 0x00000000 # Limit & Base (low, bits 0-15)
.byte 0 # Base (mid, bits 16-23)
.byte GDT_CODE | GDT_NOT_SYS | GDT_PRESENT | GDT_WRITE # Access
.byte GDT_LONG_MODE # Flags & Limit (high, bits 16-19)
.byte 0x00 # Base (high, bits 24-31)
# Data segment (0x10)
.long 0x00000000 # Limit & Base (low, bits 0-15)
.byte 0 # Base (mid, bits 16-23)
.byte GDT_NOT_SYS | GDT_PRESENT | GDT_WRITE # Access
.byte 0x00 # Flags & Limit (high, bits 16-19)
.byte 0x00 # Base (high, bits 24-31)
I prefer it this way, but other just use a quad
static value. This won't change at all so either is fine.
Then, we load the GDT
using lgdt
instruction.
lgdt [gdtr64 - 0xFFFFFFFF80000000]
Switch to long mode
We need to do the following:
- Setup page tables.
- Enable
PAE
inCR4
. - Enable
PG
(Paging) andPE
(Protection Enable) inCR0
. - Enable
long mode
inEFER
MSR.mov ecx, EFER_REG # (0xC0000080) rdmsr or eax, EFER_LME # (1 << 8) wrmsr
- Setup
GDT
. - Jump to
kernel_main
inmain.rs
.
But before we go to kernel_main
directly, we jump to a location in boot.S
(kernel_main_low
).
Where we setup the data segment, the stack, also forward the multiboot2 information to kernel_main
.
The stack used is 512
pages, i.e. 2MB
in size, and is located in the .stack
section
which is inside the .data
section.
The stack is a bit large, but we don't require that much stack most of the time, its needed like this due to our recursive AML (ACPI) parser, which we should improve.
We also setup a guard page for the stack that is unmapped later using the more advanced virtual memory setup.
Then, we jump to kernel_main
in main.rs
, and its all rust from here.
CPU state in rust
When we jump into rust, here is the state of the CPU (do note, that the kernel may change the state later on):
- 64-bit mode
- Basic
GDT
setup with only 2 segments (kernel code and data) (see above GDT) - Empty
IDT
setup (i.e. exceptions will trigger triple faults) - interrupts are disabled
cr3
is set to the.boot_page_tables
(which is a temporary page tables)cr0 = CR0_PG | CR0_PE | CR0_MP
cr4 = CR4_PAE | CR4_OSFXSR | CR4_OSXMMEXCPT
EFER = EFER_LME
| (EFER_LMA would be set by the CPU, indicating that long mode is active)- The stack is setup at the end of the
.stack
section - The multiboot info is passed in
rdi
(which is the same asebx
since we haven't touched it) rax
is set to thekernel_main
and then jumped to- the rest of the registers are arbitrary, so make sure
kernel_main
only takes one argument (rdi
).
Command Line
The kernel receives a command line from the bootloader, where we can control some aspects of the kernel boot behavior and enabled features.
Command Line Format
The format is of
property1=value1 property2=value2,property3=value3, proerty4=value4
between properties, comma ,
can be used and/or space
.
Each property is key-value pair, a property can duplicated, in that case, the last value will be used.
Properties
This is implemented in
cmdline
implemented as#![allow(unused)] fn main() { cmdline_struct! { pub struct Cmd<'a> { #[default = true] pub uart: bool, #[default = 115200] pub uart_baud: u32, #[default = LogLevel::Info] pub max_log_level: LogLevel, #[default = "/kernel.log"] pub log_file: &'a str, #[default = true] pub allow_hpet: bool, } } }
Here is the supported properties:
Property | Type | Description | Default |
---|---|---|---|
uart | bool | Enable UART/serial interface | true |
uart_baud | u32 | UART baud rate | 115200 |
max_log_level | LogLevel (trace/debug/info/warn/error ) | Maximum log level | LogLevel::Info |
log_file | &str | Log file path | "/kernel.log" |
allow_hpet | bool | Allow HPET (if present), otherwise always use PIT | true |
log_aml | LogAml (off/normal/structured ) | Log the AML content as ASL code on boot from ACPI tables | LogAml::Off |
If we write these in a command line, it will look like:
uart=true uart_baud=115200 max_log_level=info log_file=/kernel.log allow_hpet=true
Memory
Explaining the memory management in the kernel.
We have several tools to manage memory in the kernel, each with its own purpose and goal.
First, let's look at the memory layout of the kernel, to know where everything is located.
Then, we can look at the physical allocator which provide us with the raw memory from the hardware, which we can use later with the virtual mapper to map it into virtual memory in order to access it.
The heap in the kernel is implemented with heap allocator, and the PageAllocator
will allocate pages using the virtual mapper.
Lastly we have the virtual space which is a very useful tool to get a virtual address for components that are not
part of the kernel directly but we know where they are located in physical space. This includes, Memory mapped IO
, ACPI
structures, Multiboot
structures, etc.
Memory pointer types
Another thing you might notice in the code is that we use u64
sometimes and usize
other times. Here is what they mean:
u64
: This is a 64-bit unsigned integer, and it will be64
no matter the platform. It is used to represent physical addresses. Because inx86
withCR0.PG
bit enabled, the CPU can map40-bit
physical addresses to32-bit
virtual addresses. And thus it can be more than32-bit
.usize
: This is a pointer-sized unsigned integer, and it will be32
or64
depending on the platform. It is used to represent virtual addresses, and it is the same size as a pointer on the platform.
Something tangent. For filesystem operations we only use u64
,
as hardware drives can have easily more than 4GB
of space. without needing for the CPU to be 64-bit
.
Memory layout
The memory layout constants and code is defined in
memory_layout
This is the structure of the memory for any process running in the system.
Layout
0000_0000_0000_0000 .. FFFF_FF7F_FFFF_FFFF - User (15.99~ EB)
FFFF_FF80_0000_0000 .. FFFF_FFFF_7FFF_FFFF - Process specific kernel (510 GB)
FFFF_FFFF_8000_0000 .. FFFF_FFFF_FFFF_FFFF - Kernel (2 GB)
Kernel layout
FFFF_FFFF_8000_0000..FFFF_FFFF_8010_0000 nothing
FFFF_FFFF_8010_0000..FFFF_FFFF_8XXX_XXXX kernel elf (text, rodata, data, bss)
FFFF_FFFF_8XXX_XXXX..FFFF_FFFF_8800_0000 (kernel end) physical allocator low (until 128MB mark pre-mapped in `boot`)
FFFF_FFFF_8800_0000..FFFF_FFFF_8900_0000 kernel heap (16MB)
FFFF_FFFF_8900_0000..FFFF_FFFF_890E_7000 interrupt stacks
FFFF_FFFF_8900_0000..FFFF_FFFF_8900_1000 interrupt stack 0 guard page (4KB) *not mapped by purpose*
FFFF_FFFF_8900_1000..FFFF_FFFF_8902_1000 interrupt stack 0 (32 * 4KB = 128KB)
... *repeat for 6 more stacks*
FFFF_FFFF_890E_7000..FFFF_FFFF_FFFF_F000 Kernel extra (virtual space, free virtual space to use)
The kernel is loaded by the bootloader at physical address 0x10_0000
, and then it will
perform virtual mapping for physical 0x0
into 0xFFFF_FFFF_8000_0000
for 128MB
i.e. until the end of the initial physical page allocator
. See more details in the boot chapter.
Look at virtual space for more info on it.
Virtual space
Virtual space is a I'm using (not sure what other OSes call), that solves the issue of "I have a physical address of an object, but I don't have virtual space to map it to".
This is useful for reading structures that are in specific location in physical memory, such as ACPI
tables, PCI
configuration space, memory mapped IO
, etc.
Its very simple, it will take memory from the kernel extra
space, and map it to the physical address.
Process specific kernel layout
FFFF_FF80_0000_0000..FFFF_FF80_0000_1000 process kernel stack guard page (4KB) *not mapped by purpose*
FFFF_FF80_0000_1000..FFFF_FF80_0004_1000 process kernel stack (64 * 4KB = 256KB)
This is a space specific to each process, but reside in kernel space.
The idea is to have structures that are specific to processes here, that others won't have access and thus reduce the need to setup a lock around them.
We use it currently for kernel stack
, which is where the kernel will store the stack when an interrupt happens while we are in user space.
It solves the issue where having a single kernel stack for all processes can't work, as if two processes gets interrupted while the first one is still in the kernel, the second one will overwrite the first one's stack.
As you might have expect, the previous paragraph was a source of a crazy bug that introduced this
kernel stack
. Fixed in 0dc04f8
User layout
Not much to talk about here, as this will depend on the process itself and where to load the ELF file, currently we load at the address specified in the ELF file. This is of course not safe, as we don't do extra checks for that value.
But anyway, the other parts of the userspace are as follows:
XXXX_XXXX_XXXX_XXXX .. YYYY_YYYY_YYYY_YYYY - ELF file
YYYY_YYYY_YYYY_YYYY .. ZZZZ_ZZZZ_ZZZZ_ZZZZ - Heap. From the end of the ELF and grows up
ZZZZ_ZZZZ_ZZZZ_ZZZZ .. FFFF_FF7F_FFFF_D000 - Stack. From the top and grows down
FFFF_FF7F_FFFF_D000 .. FFFF_FF7F_FFFF_E000 - Stack guard page. *not mapped, just for reference*
FFFF_FF7F_FFFF_E000 .. FFFF_FF7F_FFFF_F000 - Process Metadata structure
A lot of symbols XD. But in general, the stack is at the top of the user space, and the elf file is at the bottom, and the heap is in the middle starts after the elf file.
Physical Allocator
This is implemented in
physical_page_allocator
This provides physical memory allocation and deallocation.
Currently it is very basic, and can allocate in 4KB pages. And only allocates 1 page at a time. This is due to our design.
Current design
The design now is built using a linked list, each page point to the next free page. The reason we can't get more pages is that when we create the list, we create it by freeing all the pages, and free means that we add it to the list.
The issue is that we are adding them one by one, start to finish, and when we allocate, we take the last one from the list, so its disconnected from the rest of the pages, it doesn't know if the next page is immediately after it in memory or not.
This could be solved by having a more complex allocator, like what we have in the heap allocator, but I want to use another design that is fast, since that one is slow.
Another issue is that we only have 128MB
of memory to allocate from, and we can't allocate more than that.
This is not a design issue, but the physical page allocator
initially relies on the memory we have during boot
where we map the first 128MB
of memory directly into the kernel space, see boot and memory layout for more details.
Design issues to fix
- Can only allocate 1 page at a time
- Only has
128MB
of memory to allocate from
Ideas
Linux uses a tree like structure, where each node in the tree is a page, and the children are the sub-pages, this allows easy allocation of powers of 4KB
pages, and also allows easy coalescing of pages.
I'm thinking of doing something like that.
For the 128MB
limit, I'm thinking of having another allocator, that won't have direct access to the memory, i.e. is not mapped, and we will store metadata about those pages in the heap, which we already have access to. i.e. it will be a "late" physical page allocator.
Virtual Mapper
This is implemented in
virtual_memory_mapper
This is where we map physical memory to virtual memory, and where virtual memory pages are managed after [boot].
The main features is to map and unmap physical memory to virtual memory, and to allocate and free virtual memory pages.
The map
function takes 1 argument VirtualMemoryMapEntry
which is a struct contains information about the mapping:
#![allow(unused)] fn main() { pub struct VirtualMemoryMapEntry { pub virtual_address: usize, pub physical_address: Option<u64>, pub size: usize, pub flags: u64, } }
virtual_address
is the virtual address to map to, and must be known of coursephysical_address
is the physical address to map to, if itsNone
, then we will allocate new memory from the [physical allocator]size
is the size of the mapping, must be4K
alignedflags
is the flags of the mapping, such asWritable
,UserAccessible
. For now, these are just constants mapping directly tox86
page table flags.
The unmap
function takes the same struct, but physical_address
must be None
, and virtual_address
must be known, it also takes
is_allocated
which is a boolean to indicate if the memory was allocated by the virtual memory mapper, if it was, it will be freed.
This API can be improved, but currently we don't keep track of the mappings, so we rely on the caller to do that for us :D.
Beside that, we got other functionalities used by processes. Like:
switch_to_this
: to switch to theself
VirtualMemoryMapper
, which each process has its ownVirtualMemoryMapper
.get_current_vm
: to get the currentVirtualMemoryMapper
of the current process.clone_current_vm_as_user
: Clones the kernel mappings of the current vm, and mark it asuser
vm, so it doesn't allowkernel
mappings anymore.
Virtual space
This is implemented in
virtual_space
Virtual space is a I'm using (not sure what other OSes call), that solves the issue of "I have a physical address of an object, but I don't have virtual space to map it to".
This is useful for reading structures that are in specific location in physical memory, such as ACPI
tables, PCI
configuration space, memory mapped IO
, etc.
Its very simple, it will take memory from the kernel extra
space, and map it to the physical address.
It can be used by VirtualSpace
, which is similar to Box
, i.e. its a wrapper for a pointer, and it will automatically unmap the memory when it goes out of scope.
#![allow(unused)] fn main() { let mut vs = unsafe { VirtualSpace::<u32>::new(0x1000).unwrap() }; *vs = 0x1234; assert_eq!(*vs, 0x1234); }
Drivers
This will cover all drivers that interact with hardware.
We may call it drivers
or devices
. For now at least, we treat them as the same thing.
Some of these devices and virtual devices may be available as devices
in the filesystem.
IDE
This is implemented in
ide_device
IDE devices are the type of PCI device we support now. The PCI device type is MassStorageController(0x01, ..)
.
We support both ATA
and ATAPI
devices. ATA
devices are hard drives and ATAPI
devices are CD/DVD drives.
Its basic support without DMA or async
.
We perform read with read_sync
Keyboard
This is implemented in
keyboard
The keyboard driver is simple, and uses the legacy PS/2 interface at 0x60
and 0x64
ports,
its implemented alongside the mouse driver in the same file.
The driver provide events broadcasts to all listeners using blinkcast
. These listeners
are mostly processes reading from the /devices/keyboard
file (see keyboard reader).
#![allow(unused)] fn main() { pub struct Key { pub pressed: bool, // the state of the modifiers at the time of the fetch pub modifiers: u8, pub key_type: KeyType, } }
Where KeyType
is an enum containing all keys from a US mapping.
The keyboard user can then use this as the origin, and map it to any other key depending on the layout they want.
Currently, we use the US
layout to get the character of a key using the function Key::virtual_key
(used in the kernel and userspace).
The modifiers
field is a bitflags from modifier
, so use these constants to check if a specific modifier is on.
There are 2 types of modifiers:
- Held modifiers:
SHIFT
,CTRL
,ALT
- Toggled modifiers:
CAPSLOCK
,NUMLOCK
,SCROLLLOCK
Keyboard reader
The keyboard driver provide a way to get a blinkcast
reader using get_keyboard_reader
,
where the user can read keyboard events without blocking anytime they want.
The console and userspace processes use this reader to read keyboard events.
For userspace processes, they can read the keyboard events through the virtual device at /devices/keyboard
.
A process can open a file descriptor to this device and read from it to get keyboard events.
The file descriptor will hold a blinkcast
reader to the keyboard driver, then each process can read events without blocking.
The user can open the file and read the content, but since we are performing some encoding, its better to use the library emerald_runtime
which provide easy way to read the events.
Example:
#![allow(unused)] fn main() { use emerald_runtime::keyboard::Keyboard; let mut keyboard = Keyboard::new(); if let Some(key) = keyboard.get_key_event() { println!("Key: {:?}", key); } // or for key in keyboard.iter_keys() { println!("Key: {:?}", key); } }
Mouse
This is implemented in
mouse
The mouse driver is simple, and uses the legacy PS/2 interface at 0x60
and 0x64
ports,
its implemented alongside the keyboard driver in the same file.
The driver provide events broadcasts to all listeners using blinkcast
. These listeners
are mostly processes reading from the /devices/mouse
file (see mouse reader).
#![allow(unused)] fn main() { pub enum ScrollType { None = 0, VerticalUp = 1, VerticalDown = 2, HorizontalRight = 3, HorizontalNegative = 4, } pub struct MouseEvent { pub x: i16, pub y: i16, pub scroll_type: ScrollType, pub buttons: u8, } }
The buttons
field is a bitflags from buttons
, so use these constants to check a button is pressed.
Note, that this is the state of the mouse, so you must keep the old state to know if a button was pressed or released.
The buttons are:
LEFT
:0b0000_0001
RIGHT
:0b0000_0010
MIDDLE
:0b0000_0100
FORTH
:0b0000_1000
FIFTH
:0b0001_0000
Mouse reader
The keyboard driver provide a way to get a blinkcast
reader using get_mouse_reader
,
where the user can read mouse events without blocking anytime they want.
Userspace processes can read the mouse events through the virtual device at /devices/mouse
.
A process can open a file descriptor to this device and read from it to get mouse events.
The file descriptor will hold a blinkcast
reader to the mouse driver, then each process can read events without blocking.
The user can open the file and read the content, but since we are performing some encoding, its better to use the library emerald_runtime
which provide easy way to read the events.
Example:
#![allow(unused)] fn main() { use emerald_runtime::mouse::Mouse; let mut mouse = Mouse::new(); if let Some(event) = mouse.get_event() { println!("Event: {:?}", event); } // or for event in mouse.iter_events() { println!("Event: {:?}", event); } }
UART
This is implemented in
uart
A very basic UART driver, connects to 0x3F8
port (COM1). And can be read and written to.
It is used by the console to print and read characters.
Virtual devices
These are "devices" that we interact with from other parts of the kernel, but may not be actually available in hardware. They provide a sort of abstraction layer above other drivers/devices and maybe other components.
One of the examples, is the console, which is a virtual device that we use to print and read characters from the screen, it will use the [keyboard] and [uart] drivers to do so.
Some of these devices may be available as devices
in the filesystem.
Console
This is implemented in
console
The console is a virtual device that we use to print and read characters from the screen, it will use the keyboard and uart drivers to do so.
This is called by print!
and println!
macros.
We have 2 consoles, for now, I don't like the design now, and would like to change it in the future.
EarlyConsole
This is a console object that is statically initialized, can only write, and doesn't have access to the keyboard.
LateConsole
This is the main console that is initialized later, and can read and write and has access to keyboard.
The main purpose of this is to add this to the /devices
directory, and act as a kernel device, so we can use it from the userspace.
The design can be improved, the issue is that LateConsole
is inside an Arc<Mutex<>>
(so it can be used as a device), EarlyConsole
is static
,
there is several differences, so there is a lot of code duplication, and I would like to improve it somehow.
Pipe
This is implemented in
pipe
A pipe is a virtual device that allows two processes to communicate with each other. It is a unidirectional communication channel. It is used to pass data from one process to another. t is similar to a file, but it is not stored on the disk. It is stored in the memory. It is a first-in-first-out (FIFO) data structure.
It acts as a special file, i.e. the process just write to it as a normal file.
It is created with create_pipe_pair
, which will return 2 File
objects,
one for reading and one for writing. The kernel then assign those to the process and such.
Internally, the Pipe
is a dyn Device
, so its stored in the INode
as a device. See filesystem for more details on INode
.
Power
This is implemented in
power
This is a very basic virtual device accessible from /devices/power
, and it is used
to issue a power related event, for now shutdown
or reboot
.
Basic usage will be echo "shutdown" > /devices/power
or echo "reboot" > /devices/power
.
This will make the kernel shutdown or reboot the system.
Filesystem
This is implemented in
filesystem
In this kernel, the filesystem is implemented by several layers, and components.
When you try to open a file path, the filesystem
will look for the best entity that contains this path.
And this is achieved by the mapping system.
Check FAT for more information about the FAT filesystem specifically.
Then, when you open a file, you can specify several flags (implemented in OpenOptions
):
read
- Open the file for reading.write
- Open the file for writing.create
- Create the file if it doesn't exist.create_new
- Fail if the file exists.truncate
- Truncate the file if it exists.append
(implicitwrite
) - Append to the file if it exists.
With these, you can create new files and choose which mode to open them with. Of course the filesystem may refuse to create the file if the
operation is not supported, such as with /devices
directory mappings.
Mapping
This is implemented in
fs::mapping
The mapping, is a structure we use to map a path prefix to a Filesystem
.
For example, currently we have the following mappings:
'/' -> FAT (filesystem backed by disk)
'/devices' -> Devices (a virtual filesystem)
When you open a path, it will find the best mapping, i.e. the longest prefix that matches the path.
Then will use the resulting Filesystem
manager to open the file.
For example, if you open the path /devices/console
, it will use the Devices
filesystem manager to open the file /console
.
Internally, this mapping is stored in a recursive tree structure of MappingNode
,
each will contain:
- The
Filesystem
object. - Weak ref to parent (to not get into trouble when dropping)
- childern BTreeMap (child component name ->
MappingNode
)
So, it will be something like this:
- / (root) = {
fs: object
parent: None
children: {
"devices": {
fs: object
parent: /
children: {}
}
}
}
This is used so that we can treverse between two mappings easily.
i.e., if we are at the beginning of the mapping, and encountered ..
path, we can go back to the parent mapping.
Also, when going forward in mapping, we can check if the component is a child mapping of this mapping and switch to it easily.
With this treversal, we can build canonical path for a node.
Filesystem trait
The "manager" here is an implementor of the Filesystem
trait, which is a simple interface that controls all
the filesystem operations.
Operations supported are:
open_root
- Open the root directory, this is the entry point when treversing the filesystem.read_dir
- Read the directory entries from aDirectoryNode
.treverse_dir
- Look through the dir, and returnNode
that matches the entry name or error if not found.create_node
- Create a new file or directory inside aDirectoryNode
.read_file
- Read the file contents from aFileNode
.write_file
- Write the file contents to aFileNode
.flush_file
- Force the driver to flush the content to physical media (i.e. clear cache if any).close_file
- Send a message that we are closing the file, if you notice, we don't haveopen_file
, but instead, the user can treverse the filesystem withopen_root
andread_dir
until the file node is found, then it can be used directly. This function is used to alert the filesystem to clean up any resources that it might have allocated for this file.set_file_size
- Set the file size to a custom value, this is similar totruncate
in Unix systems,write_file
, will increase the file size if needed.unmount
- Unmount the filesystem, this is called when the filesystem is no longer needed, and it should clean up all resources. You might say we don't useDrop
, but there are several reasons I went with this.- We can't add
Drop
as a trait dependancy toFilesystem
, so I wanted something related to the trait itself. - Some filesystems might not be able to be dropped like the
DEVICES
global filesystem, but we still want to tell it to clean itself or the parts that can be cleaned before shutdown/reboot.
- We can't add
Node
See Node
The Filesystem
will give us an Node
when we open a directory or a file, and this Node
can be either FileNode
or DirectoryNode
I'm calling
Node
even though FAT doesn't have this concept, but I'm using it to represent the file information.
Partition tables
Currently we only support the MBR partition table, and we can only read the first partition, we don't check the partition type, and just forward it to the FAT filesystem.
Devices
See Devices
This is a basic dictionary that maps a device name, to a Arc<dyn Device>
. Then, when its opened, the device clone is
returned in a special FileNode
, so we can act upon it as a file.
FAT (File Allocation Table) filesystem
This is implemented in
fat
The FAT filesystem is a simple filesystem that is widely used in many devices, such as USB drives, SD cards, and floppy disks.
In this kernel, we have support for:
- FAT12
- FAT16
- FAT32
- Long file names (LFN).
- reading and writing files, changing file size, and creating files and directories.
Network
This is implemented in
net
The network and its components.
Here, we have the structs and resources that can be used to communicate over the network:
EthernetSocket
- A socket that can send and receive packets over the network.NetworkPacket
- Stack based packet structure that holds severalNetworkHeader
s along with payload. The socket can read/write to this struct efficiently, and due to its implementation, it can be used multiple times to send and receive packets.
Processor
This is implemented in
cpu
.
Here we talk about processor related structures and functions. Including:
- Interrupts and exceptions
- Global Descriptor Table (GDT)
- Advanced Programmable Interrupt Controller (APIC) and IO APIC.
Let's first talk about other processor stuff that are not the above.
Saved CPU state
Each CPU (currently only 1) has a structure that contain the state related to the CPU.
Where it contains among others:
- the
id
andapic_id
of the cpu, for identification. - the
n_cli
andold_interrupt_enable
which is used by the implementation of locks. - the
context
which is a process context, used when switching between processes, alsoprocess_id
for current process, and other scheduling related fields.
CPU initialization
Currently, we don't perform any additional initialization after boot, and its causing some issues. As UEFI results in a different CPU state than BIOS, and we need to handle that, there is an issue for that #34.
Interrupts and exceptions
This is implemented in
interrupts
.
This is where interrupts are managed. We have 2 types of interrupts.
- Exceptions: These are errors that occur in the CPU, like division by zero, invalid opcode, etc, and these are defined in specific places in the IDT.
- Interrupts: These are events that are triggered by devices, like the keyboard, timer, etc,
these do not have specific placement in the IDT, and are instead handled mostly in the
user_interrupts
range of the IDT, which is from32
to255
.
The last 16
entries of the user interrupts
are used specially by the kernel as such:
SPECIAL_SCHEDULER_INTERRUPT=0xFF
: Used by scheduler to switch between contexts.SPECIAL_SYSCALL_INTERRUPT=0xFE
: Used by the syscall instruction to switch to kernel mode, this is defined in kernel user link.
Generally we use functions like allocate_user_interrupt
, which will allocate
an interrupt entry, and puts our function there (see Interrupts Handlers later for types of interrupts), then it will give us where it was allocated, so we can use it later, for example with APIC.
The InterruptDescriptorTableEntry
gives us some extra functionalities that
are not provided by default. Such as:
set_stack_index
which sets the stack index to use when handling the interrupt.set_privilege_level
which sets the privilege level of the interrupt.set_disable_interrupts
which sets if the interrupts flag should disabled when handling this interrupt.override_code_segment
which sets the code segment to use when handling the interrupt.
Interrupts handlers
There are 2 types of interrupts handlers based on what arguments they take:
- The minimal, which takes
InterruptStackFrame64
, i.e. the data provided by the CPU automatically.#![allow(unused)] fn main() { pub struct InterruptStackFrame64 { pub rip: u64, pub cs: u8, pub rflags: u64, pub rsp: u64, pub ss: u8, } }
- The full, which takes all registers (except
fxsave
), this is used when we expect to switch between processes or need those extra registers.#![allow(unused)] fn main() { pub struct InterruptAllSavedState { pub rest: RestSavedRegisters, // contain all the rest of the registers pub number: u64, pub error: u64, pub frame: InterruptStackFrame64, } }
Generally, APIC is the only user of allocate_user_interrupt
, as it
uses it to allocate interrupts for hardware devices. Look at the APIC for more details on what interrupts
we have now.
Global Descriptor Table (GDT) and others
The Global Descriptor Table (GDT) is a data structure used by the x86 architecture to define the characteristics of the various memory and privileges of the segments used by the CPU.
The Interrupt Descriptor Table (IDT) is a data structure used by the x86 architecture to define the characteristics of the various interrupts and exceptions.
Interrupt Descriptor Table (IDT)
I'll start with this just to get it out of the way.
The setup for IDT is very simple, we just have a static memory of "default" empty handlers,
and we use the lidt
instruction to load the IDT.
Later, when we add an interrupt, we just modify the IDT entry with the new handler, and it will be used from now on.
For more information about specific usage of interrupts, see Interrupts and exceptions and APIC.
Global Descriptor Table (GDT)
This kernel is x86_64
, and segments are not used as much as they were in the past, so we have a very basic implementation of the GDT.
Currently, we have 4 segments excluding the NULL
segment:
KERNEL_CODE
: This is the code segment for the kernel.- flags:
flags::PRESENT | flags::CODE | flags::USER | flags::dpl(KERNEL_RING)
- flags:
USER_CODE
: This is the code segment for the userspace.- flags:
flags::PRESENT | flags::CODE | flags::USER | flags::dpl(USER_RING)
- flags:
KERNEL_DATA
: This is the data segment for the kernel.- flags:
flags::PRESENT | flags::USER | flags::WRITE | flags::dpl(KERNEL_RING)
- flags:
USER_DATA
: This is the data segment for the userspace.- flags:
flags::PRESENT | flags::USER | flags::WRITE | flags::dpl(USER_RING)
- flags:
The code segments will have the LONG
flag set. Technically, we don't also need the KERNEL_DATA
segment, but It's included to be
more consistent.
I won't go into details of the flags, you can check the documentation of GDT or the CPU manual.
Also an interesting node,
flags::WRITE
seem to be required, at least withqemu
it would crash when switching to data segment where its not available, even though, the AMD64 manual says that the CPU ignores those bits in 64-bit mode.
From above:
KERNEL_RING
is0
USER_RING
is3
As part of the GDT
setup, we also setup the TSS
(Task State Segment), which is used by the CPU to switch between tasks generally.
But since we don't use hardware tasks, we at least need to set it up to configure interrupts stacks.
Task State Segment (TSS)
The TSS is a structure that is used by the CPU to switch between tasks, and it also contains the IST
(Interrupt Stack Table) which is used to provide a separate stack for interrupts, also provide the stack for when to go from user to kernel modes.
For us, we setup 7
stacks, usable by any interrupt. Look at memory layout for where those are located.
The interrupts manager can then choose the
stack to use for each interrupt with set_stack_index
(see Interrupts and exceptions).
A value of None
means to use the default stack.
The default stack will be the current stack if the privilege level is the same as the current privilege level, otherwise it will change to the stack specified in the TSS based on the target privilege level.
Currently, we only have 1 stack for KERNEL_RING
, which is at Process kernel stack
in the memory layout.
I.e. this is a stack specific to each process, as this will only be used when transitioning from user to kernel mode, and inside user mode, we will always be inside a process.
Advanced Programmable Interrupt Controller (APIC) and IO APIC
This is implemented in
apic
.
The APIC is a part of the CPU, and it is used to manage interrupts and exceptions. It is used to manage the interrupts and exceptions that are triggered by hardware devices, and it is also used to manage the interrupts and exceptions that are triggered by the CPU itself.
Initialization
The initialization of the APIC
is done using ACPI tables,
which contains data about the ACPI, such as:
- Number of CPUs.
- The APIC ID of each CPU.
- The address of the
IO APIC
.
The APIC
is a memory-mapped address provided by the CPU, we can fetch it by reading the MSR
register 0x1B
(which is the APIC_BASE
register).
Then we can map the APIC
and IOAPIC
(also memory-mapped) addresses to the virtual address space using virtual space.
Interrupts
The APIC
can be used to allocate and assign interrupts to hardware devices. We can use the functions:
assign_io_irq
to assign an interrupt to a hardware device.assign_io_irq_custom
which is similar to the above but provide extra changes to the interrupt, such as:with_interrupt_polarity_low
: Set the interrupt polarity to low/high (boolean).with_trigger_mode_level
: Set the trigger mode to level/edge (boolean).with_mask
: Set the interrupt to be masked or not (boolean).
It will set up the interrupts with the correct IO APIC
based on the argument irq_num
provided.
Currently, we have these interrupts configured:
1
: Used by the keyboard driver.14 & 15
: Used by the IDE driver.- HPET timer, with interrupt number specified dynamically based on its configuration,
but looks to be
2
on the VM.
We also have interrupts from the APIC
itself, such as:
- Timer interrupt: this is used by the scheduler to switch between processes.
- Error interrupt: This is mapped, but haven't seen it triggered yet.
- Spurious interrupt: This is mapped, but haven't seen it triggered yet.
Advanced Configuration and Power Interface (ACPI)
This is implemented in
acpi
.
ACPI is a standard for operating systems to discover and configure computer hardware components, to perform power management, and to configure the system's power management features. It is a successor to Advanced Power Management (APM), and it is a part of the UEFI specification.
We have some support to some of the components of the ACPI standard.
ACPI comes in tables, and these tables include information about the system's hardware.
ACPI Tables Support
We have parsing support for these tables now, having support does mean we use all the features inside them. It just means that we can read the table, and that's it, some of them are used, but not all.
- RSDP: This is either provided by
multiboot2
during boot, or we search for it in memory, and this just points to theRSDT
orXSDT
. - RSDT: This is the Root System Description Table, and it contains pointers to other tables.
- MADT/APIC: This is the Multiple APIC Description Table, and it contains information about the APIC,see APIC for more details.
- FACP: This is the Fixed ACPI Description Table, and it contains information about the power management features of the system, also contain some info about the RTC.
- HPET: This is the High Precision Event Timer, and it contains information about the HPET, see HPET for more details.
- DSDT (not used*): This is the Differentiated System Description Table, and it contains AML code, which is used to configure the system.
- SSDT (not used*): This is the Secondary System Description Table, and it also contains AML code, which is used to configure the system.
- BGRT (not used): This is the Boot Graphics Resource Table, and it contains information about the boot logo, and it is used by the UEFI firmware.
- WAET (not used): This is the Windows ACPI Emulated Devices Table, and it contains information about the emulated devices, and it is used by the UEFI firmware.
- SRAT (not used): This is the System Resource Affinity Table, and it contains information about the system's memory and processors locality.
*: We don't use
DSDT
andSSDT
data for now, but we do parse it asAML
code, see AML for more details.
We use virtual space to map APIC
tables, and then we copy
them to the heap, this will make it easier to use, and we can reclaim ACPI
memory later.
ACPI Control
During boot, we take control of ACPI registers, and also register an interrupt for ACPI events. (Implemented in acpi::setup_enable_acpi).
So, currently, we can get ACPI events, such as PowerButton pressed
, and so on, but we need to implement
shutdown behavior to correctly react to it and not just print it in the logs.
ACPI Machine Language (AML)
This is implemented in
aml
.
AML
is a language used to describe the hardware of the system, and it is used by the ACPI to configure the system, check spec here (this is what's used to implement this parser).
This is an AML
parser that is able to parse the code inside the DSDT
and SSDT
tables.
We have 2 forms of the parsed AML
.
- Normal: This is what we get from the table, we just parse it directly
- Structured: After getting the
Normal
version, we do some processing to order them and group them byScope
so that we can easily search for a given label.
Any of these can be printed by the cmdline log_aml
being normal
or structured
(See Cmdline)
Why this is needed?
There are some details of the hardware that is hidden in AML
, such as:
- Finding out the sleep commands for the system, i.e. which registers to write to, and what values to write.
- Finding out interrupts configuration, for devices that share the same interrupt line.
Generally, most OSes will use acpica
which is a tool that can parse and execute AML
code.
Some annoying parts
Just to share here, since this is a documentation and all.
Generally, parsing AML
is not that hard, it's a binary format, we can read the opcode
and know what term type we need.
The issue is one thing, method calls
.
Method calls are encoded as such, NameString
followed by N
arguments as TermArg
(which is a term type).
The issue is:
- The expression itself doesn't provide the number of arguments.
- The method call can happen before the method is defined (method definition provide number of arguments).
- The method can be external (I think).
So, we have one choice (acpica
also does it) and that is to "guess" the number of arguments.
And this is very error-prone, as a Term
can be also considered an expression, like NameString
can be considered as
a variable if it is not a method call. very messy :'(
The current implementation I have I think is quite good, and I got to fix all the bugs I found, but of course there could be more.
This seems kinda "complaining" XD, but I just wanted to share the experience.
Processes
This is implemented in
process
This module, implements the process management of the kernel, including creating, destroying, and scheduling processes. As well as managing the process's memory, and the process's resources.
Currently, when we say process
, we also mean thread
, as we don't have multi-threading support yet.
Process Data
The process structure Process
contain all the information relating to the process, some of the important ones:
id
: The process id.parent_id
: The id of the parent process.vm
: The process's virtual memory, an instant ofVirtualMemoryMapper
, see virtual mapper.context
: A saved state of the CPU before the process is being scheduled. i.e. while theprocess
is running, this is considered invalid and doesn't represent the current state of the process.open_filesystem_nodes
: A map of open file nodes, see filesystem a node can be a file or a directory, the mapping here is fromusize
, we use map instead of a list since we can remove a file from the middle of the list, and we don't want to have to shift all the elements after it.argv
: A string list of the arguments passed to the process.stack_ptr_end
: The end of the stack, the stack grows down, so this is the highest address of the stack, and where the stack starts when the process is created.stack_size
: The current size of the stack, currently, its constant, until we get growing stack support.heap_start
: The start address of the heap, this will be padded by around1MB
from the end of theELF
file loaded into memory.heap_size
: The current size of the heap. The user process can request more heap space with theinc_dec_heap
system call.heap_max
: The maximum possible size of the heap, this is not changed, currently set to1GB
.priority
: The priority of the process, this is used by the scheduler. seePriorityLevel
.)exit_code
: The exit code of the process, if the process is exited, this will be set to the exit code.children_exits
: A list of the children processes that have exited, with their exit code (see #process-exit later for more information).
Process Creation
Process creation (structure creation) is as follows:
- Load the
ELF
file, this doesn't load the whole thing, just the header to make sure its valid. - Maps the stack region.
- Loads argv into the stack (check argv structure for more information).
- Load
ELF
regions into memory. - Load the
Process Metadata
structure (check Process Metadata for more information). - Add process-specific kernel memory regions, like the kernel stack (this must be done after loading the ELF, and last modification to the VM manually, because we can't switch to this VM after this point unless its by the scheduler, see the comments
process/mod.rs::allocate_process
for more details) - Add data about the heap, with size
0
and max size1GB
, i.e. no memory allocated yet. - Default
context
is created, everything is0
, except for:rip
: The entry point of theELF
file.rsp
: The end of the stack.rflags
: enable the interrupts.cs
: The code segment for theUSER_RING
(see GDT).ss/ds
: The data segment for theUSER_RING
(see GDT).rdi
: The value ofargc
, since we useSYSV
calling convention from the userspace.rsi
: The address of theargv
array, since we useSYSV
calling convention from the userspace.
Argv Structure
The argv
structure in the stack is as follows:
+-------------------+ <- stack_ptr_end
| arg0 |
+-------------------+
| arg1 |
+-------------------+
| .... |
+-------------------+
| null |
+-------------------+
| argv | <- pointer to `arg0`
+-------------------+
| argc | <- number of arguments
+-------------------+
| .... | <- this is where `rsp` will be set to, when the process is created
+-------------------+
Process Metadata Structure
Structure definition at
ProcessMetadata
.
This structure is placed in a static location in userspace memory (see user memory layout), and is used to store information about the process, such as:
- process id.
- image base address.
- image size.
- program header offset (even though it can probably be obtained by reading the image header).
eh_frame
address and size, this is used to implement unwinding.text
address and size, this is useful for debugging and getting backtrace from usermode.
Process Exit
When the syscall exit
is called, the process moved to exited
list, and the exit code is set.
The Exited
process will be removed from the scheduler
's list, at the next schedule
call, see scheduler for more information.
When the process exits, it does the following as well:
- It will notify all processes that are in the state
WaitingForPid
with the process's id, it will give it theexit_code
, and continue those processes. - It will add itself to the parent's
children_exits
list, with theexit_code
, so that parents can know when their children have exited without blocking onWaitingForPid
(i.e. they can callwaitpid
without blocking, only because they are parents).
Scheduler
This is implemented in
scheduler
The scheduler
is responsible for scheduling the processes, and managing the CPU time between them.
Scheduling Algorithm
We are using priority-queue based approach for scheduling processes.
The queue order is determined by a value priority_counter
, that starts at u64::MAX
, its decremented by
a value generated from the process's priority level, higher priority level will decrease the value less, and thus staying
on top for more times.
Each time we schedule a process we perform the following:
- Check all waiting processes, and wake them if its time, currently, we have
ProcessState::WaitingForTime
andProcessState::WaitingForPid
states that support waiting. - After waking them (moving them to
scheduled
list), pick the top scheduled process, and run it, moving it to therunning_and_waiting
list. - If we have
exited
processes, handle notifying waiters and parents and remove the process. Its important we remove the process here, since we can't do it while the process is running (still handling theexit
syscall) since we still hold the virtual memory, deleting the process will free it up and cause a page fault.
Running the process is simple:
- copy the
context
of theprocess
to the savedcontext
of theCPU
, see processor saved state, which will be used by the scheduler interrupt to jump to it. - Set the
pid
of theprocess
to theprocess_id
of theCPU
. - Mark the
process
asProcessState::Running
, and move it to therunning_and_waiting
list as mentioned.
Yielding
When a process
is running, it can yield to the scheduler through 2 ways now:
- Timer: The
APIC
timer, see APIC, will interrupt the CPU every once in a while, and this gives us preemptive multitasking. - System Call: When a
syscall
is executed, after thesyscall
, we performyield
as well, see syscalls.
When yielding, we perform the following:
- Save the
all_state
of the CPU to thecontext
of theprocess
, and thisall_state
comes from the interrupt, i.e. we can only yield when an interrupt occurs from that process, since we have to save the exactcpu
before the interrupt. - reschedule the
process
, by putting it in thescheduled
list and fixing up thepriority_counter
to be similar to the top process. This is important, as if a process was sleeping for some time, we don't want it to hog the execution when it wakes up because at that point, itspriority_counter
will be much higher than any other process.
Sleeping
When a process
is running, it can sleep, and this is done through the syscall
sleep
, see syscalls.
When sleeping, we perform the following:
- Mark the
process
asProcessState::WaitingForTime(deadline)
, wheredeadline
is the expected time to finish the sleep from thecurrent
time. See Clocks. - the process would already be in
running_and_waiting
list, so no movement is done here.
And then, in the scheduler, we handle sleeping processes (see scheduling algorithm).
Scheduler Interrupt
This is interrupt 0xFF
, See interrupts for more information.
This interrupt is used to easily change the execution context of the current cpu
.
When the interrupt is triggered, we do the following:
- The
cpu
must contain acontext
, which we will move to. - We must be in the kernel of course, this is a private interrupt for the kernel, and the scheduler alone.
- Switch the
all_state
coming from the interrupt (which will be a point in theschedule
function in the kernel, where we called this interrupt), and thecontext
from thecpu
, which will be the state of theprocess
currently.
And with the last step, we achieve the context switch between two execution states, the kernel
and the process
.
Why?
I found this solution that worked quite well for switch between contexts, is it the best? I don't know, but it works quite well now and is very stable.
Syscalls
This is implemented in
syscalls
The numbers of the syscalls and some metadata like SyscallResult
and SyscallError
are defined in kernel user link crate.
Syscalls specs is as follows:
- Syscalls can have up to
7
arguments, and are passed inRCX
,RDX
,RSI
,RDI
,R8
,R9
,R10
in order. - The syscall number is passed in
RAX
. - The syscall return value is passed in
RAX
, the syscall may write to pointers passed in the other registers, but the result code will still be inRAX
. - The returned
RAX
value is an encoded value ofSyscallResult
. So, its never intended to be read asu64
directly. - The arguments are not modified by the syscall, but the syscall may read from them, or write to memory pointed by them.
- All pointers passed to the syscall are have to be valid, and point to user space memory only, the kernel will check that the memory is mapped and write to it, but it doesn't guarantee the validity of the memory if it was modified by the kernel (i.e. if the memory was pointed to random part in the heap it could corrupt the heap for example).
- The syscall may block execution depend on the syscall itself, like
wait_pid
or aread
to a blocking file with no data.
Syscalls list
This shows the list of syscalls and their arguments and return values. Some notes on this:
a: *const/*mut T, b: usize
will be used in the kernel as a slice, for examplewrite(file_index: usize, buf: &[u8])
. But I didn't want to remove an argument and confuse the reader, so I split them into two as being sent from userspace.- types like
&Path
,BlockingMode
,SpawnFileMapping
, etc... are passed asu64
as is everything, but the kernel checks validity and cast/parse these pointers/values to the correct types. - All the return types are
SyscallResult
and will report any error during execution, for simplicity, I didn't just repeat that in the table. - When the return type is
()
, it means the kernel will returnSyscallResult::Ok(0)
, the userspace will check that its0
.
Name | Arguments | Return value | Description |
---|---|---|---|
open | path: &Path, access_mode: u64, mode: u64 | file_index: usize | Opens a file |
write | file_index: usize, buf: *const u8, size: usize | bytes_written: usize | Writes to a file |
read | file_index: usize, buf: *mut u8, size: usize | bytes_read: usize | Reads from a file |
close | file_index: usize | () | Closes a file |
blocking_mode | file_index: usize, blocking_mode: BlockingMode | () | Sets the blocking mode of a file. This is DEPRECATED, and should be replaced with set_file_meta with FileMeta::BlockingMode |
exit | exit_code: i32 | ! | Exits the current process |
spawn | path: &Path, argv: *const *const u8, file_mappings: *const SpawnFileMapping, file_mappings_size: usize | pid: u64 | Spawns a new process |
inc_heap | increment: i64 | old_heap_end: usize | Increase/decrease the heap of the current process (similar sbrk ) |
create_pipe | read_fd: *mut usize, write_fd: *mut usize | () | Creates a pipe |
wait_pid | pid: u64, block: bool | exit_code: i32 | Waits for a process to exit |
stat | path: &Path, stat: *mut FileStat | () | Gets the file stat of a file |
open_dir | path: &Path | dir_index: usize | Opens a directory |
read_dir | dir_index: usize, buf: *mut DirEntry, len: usize | entries_read: usize | Reads from a directory |
get_cwd | buf: *mut u8, len: usize | needed_bytes: usize | Gets the current working directory, returns BufferTooSmall if the buffer is too small |
chdir | path: &Path | () | Changes the current working directory |
set_file_meta | file_index: usize, meta_id: u64, meta_data: u64 | () | Sets the file meta |
get_file_meta | file_index: usize, meta_id: u64, meta_data: *mut u64 | () | Gets the file meta |
sleep | seconds: u64, nanos: u64 | () | Sleeps for a duration |
get_time | clock_type: ClockType, time: *mut ClockTime | () | Gets the time based on the clock_type , see Clocks |
graphics | command: GraphicsCommand, extra: *mut () | () | Graphics operations, see Graphics:VGA |
seek | file_index: usize, whence: SeekWhence, offset: i64 | new_offset: u64 | Seeks a file |
priority | pid: u64, priority: Option<PriorityLevel> | PriorityLevel | Sets and gets the priority of a process |
Executables
This is implemented in
executable
Currently, we only have support for ELF and will probably stay like this for a while, I don't plan to support other formats soon.
ELF
The ELF file is the executable format used by most of the Unix-like systems, and it is the format we will be using for our executables.
This is implemented in
elf
We load elf on process creation, see process creation for more information.
For now, we support very basic loading, no dynamic linking, shared libraries, or relocation. Just loading segments.
Clocks
This is implemented in
clocks
.
Here we implement the clock functionality of the system.
This includes functionalities used in implementing:
- system time.
- timers.
- sleep (see scheduler).
We have several clock sources, and these are devices that provide us with a "time", this "time" is not bound to a specific start, but it just guarantees that:
- The time is always increasing.
- TODO: We still haven't handled wrap around.
- Querying the time at 2 points will give you the time difference based on real time speed.
- This depends on the
granularity
of the source, for example, if we implement it forRTC
, then thegranularity
is 1 second, i.e. querying it within 1 second will mostly give you the same time.
- This depends on the
And thus with these, we can keep track of the time.
We have 2 times in the system:
- boot time (uptime).
- This is the time since the system booted.
- This is calculated by periodically updating the time based on the best
clock source
we have.
- real time (unix time).
- On boot, we get the current time from the RTC.
- When we need this value, we calculate it based on the
boot time
and thestart
time we got at boot.
These times can be fetched with the get_time
syscall.
RTC
This is implemented in
rtc
.
The RTC (Real Time Clock) is a hardware clock that is used to provide the current time and date for the system.
The RTC
is used as the base time to determine the unix
time of the system, and provide this time to the user space.
Beside that, the kernel generally doesn't care about unix
, and everything is based on "system boot" time.
The RTC
can technically be used as a clock source, such as HPET and TSC, but its accuracy is very low, so for now its not used as such.
High Precision Event Timer (HPET)
This is implemented in
hpet
.
See Intel's High Precision Event Timer (HPET) Specification for more details.
The HPET is a hardware timer that is used to provide a high-precision time base for the system.
The other clocks such as TSC is calibrated using the HPET
, then TSC is used to provide the time for the system as it is faster than the HPET
.
Currently, we only use 1 timer for the clock, and we don't use the interrupts. But we could use it in the future to provide a more accurate time based events.
If HPET
is not available or the user has allow_hpet=false
in the cmdline (see [Command Line]), HPET
will be disabled, and we are going to use PIT.
Programmable Interval Timer (PIT)
This is implemented in
pit
.
See OsDev docs OsDev: Programmable Interval Timer (PIT) for more details.
The PIT is a hardware timer chip that is commonly used in x86-based systems to generate periodic interrupts and measure time intervals.
Its kinda a lesser version of HPET, but it has wider support. We are using it as an alternative to HPET
if its not present or the user doesn't want to use it by using the allow_hpet=false
cmdline (see Command Line)
We are using PIT
as a timer to keep track of system ticks and also to calibrate TSC when HPET isn't used,
so we are not using the interrupts generated by PIT for anything else.
Time Stamp Counter (TSC)
This is implemented in
tsc
.
The TSC is a 64-bit register present in all x86 processors since the Pentium. It counts the number of cycles since the processor was reset. It is used to provide a high-precision time base for the system.
But it doesn't count human time, so we have to calibrate it using a more accurate timely based clock.
Like the HPET or the RTC, RTC
is very slow (1 second interval), so we use the HPET or PIT depending
on which is usable for calibration.
We also may need to recalibrate the TSC
once a while, as it may drift.
Graphics
Currently, we only have a VGA driver implemented, in VGA.
It only has CPU based rendering, and only copies the images to vga framebuffer provided to us by the hardware.
There is no hardware acceleration yet.
VGA
This is implemented in
vga
.
This is a basic graphics driver implementation.
We take the framebuffer info from multiboot2
structure coming from boot,
and use it to write to that memory, which is then displayed on the screen.
This framebuffer is controlled by the kernel, the kernel can use it internally to display stuff, for example the console uses it to display characters on the screen.
But then, the userspace processes can take ownership of the framebuffer, what this means is that the kernel will stop rendering to it, but still owns the memory region. At this stage the kernel will just stay there waiting for rendering commands coming from the owner process.
The rendering commands here are just Blit
, which is an operation that copies a region from one framebuffer
(user allocated) to another (the vga framebuffer, that the kernel owns).
Which means that all the rendering is done by the userspace processes, and the kernel just copies the images to the screen.
Userspace processes can be more efficient by telling the kernel which regions of the framebuffer have changed and only sending those regions to the kernel, so the kernel can copy only the changed regions to the screen.
These operations are accessible by the graphics
syscall
Graphics Command
There are 4 commands supported:
TakeOwnership
: This is used to take ownership of the graphics device.ReleaseOwnership
: This is used to release ownership of the graphics device, and is executed automatically when the process exits if it was not released manually.GetFrameBufferInfo(&mut info_out)
: This is used to get information about the framebuffer, see FrameBufferInfo.Blit(&BlitCommand)
: This is used to blit a region from userspace memory into the graphics framebuffer, it can control (See BlitCommand for more info):src_framebuffer
: memory reference to the source framebuffer (only read by the kernel)src_framebuffer_info
: The framebuffer info of the source framebuffer, i.e. its shape in the memory, so that we can copy correctly from it.src_x
,src_y
: The top-left corner of the source region (user memory)dest_x
,dest_y
: The top-left corner of the destination region (kernel)width
,height
: The width and height of the region to copy, applies to both
Testing
We have testing framework in the kernel that tries to replicate the testing framework in rust-std
.
We can't just use #[test]
, so instead we created our own macro for tests testing::test!
.
Example:
#![allow(unused)] fn main() { testing::test! { fn test_free_realloc() { let page = unsafe { alloc() }; let addr = page as usize; unsafe { free(page) }; let page2 = unsafe { alloc() }; assert_eq!(page as usize, addr); unsafe { free(page2) }; } #[should_panic] fn test_unaligned_free() { let page = unsafe { alloc() }; let addr_inside_page = unsafe { page.add(1) }; unsafe { free(addr_inside_page) }; } } }
We have added
macro_rules_attribute
temporarily, so instead of usingtesting::test! {}
we can use this:#![allow(unused)] fn main() { #[macro_rules_attribute::apply(testing::test)] fn test() { // test code } }
The plan is to replace it with custom
proc_macro
crate later.
When you create a new feature be sure to add a test for it as much as possible.
Logging
We are using tracing
to implement logging in our kernel.
So, you can use trace!
, debug!
, info!
, warn!
, and error!
macros to log messages.
Currently, we don't support spans yet.
As of now, the level of tracing is hardcoded to ignore trace!
and debug!
messages, but we want to add kernel command-line arguments to change the level of tracing.
The logs are also saved into a file that is replaced every time the kernel is booted. The file is /kernel.log
.
Power
This is implemented in
power
Check Virtual Devices/Power as well for more information.
Shutdown/Restart Mechanism
The process to perform shutdown/restart on any system is quite simple in principle. The system needs to:
- Stop all processes.
- Unmount all filesystems and flush data to disk.
- Uninitialize devices and connected peripherals.
- Issue the reboot/shutdown command to the hardware components that can actually do the electrical side, those would be the firmware through ACPI or the PS2 interface.
Start
The shutdown
/reboot
process is initiated by the user or the system itself. And it can be done
by calling power::start_power_sequence
.
Stopping Processes
When the "power sequence" is started, the Scheduler
is informed
to not schedule anymore and stop all processes as they arrive for rescheduling.
Then the scheduler wait for all processes to exit,
and then returns to kernel_main
, which will then call
power::finish_power_sequence
which continues the shutdown process below.
Filesystem Unmounting
When all processes are stopped and cleaned up, we know that no
File
is being used except the log_file
(see Logging).
So, we flush the kernel log, close the file and at this point no more logging is stored in disk, but there are still some logging messages shown on the screen.
Then, we unmount all filesystems.
Device Uninitialization
This is not implemented yet, TODO.
Shutdown
For shutdown we use ACPI
to issue a shutdown command to the hardware.
During system startup, we read the AML
data and extract all the values of \_Sx
available. These values
are used to transition the system to several types of sleep states.
For now, we are only using S5
which is the shutdown state.
Reboot
For reboot, we use the PS2 interface to issue a reset command.
We just execute the below command:
outb(0x64, 0xFE);
ACPI
does support system reset by using reset_register
and reset_value
which can be provided.
But for my qemu environment, it is not provided, so for now the PS2 method is what works in most
cases I guess?
We can implement the ACPI reset later and use it when the hardware supports it.
Userspace
This section will cover the userspace part of the operating system. It will cover the programs, libraries and other userspace related topics.
Building custom userspace programs
Since we are using Rust STD
, you can build (hopefully)
a lot of rust projects that don't have any dependencies on libc
, linux
or windows
specific libraries.
Getting the toolchain
First, you need to get the toolchain, and you can do that by either:
Using the prebuilt toolchain
We distribute a prebuilt toolchain in:
- toolchain.zip Where you can install with
sh tools/install_toolchain_and_link.sh <path_to_toolchain.zip>
This will install the toolchain into extern/toolchain
and link it to rustup
as emerald
.
Then, when using our cargo xtask
to build our programs.
cargo xtask userspace [build/check/fmt/clippy]
Building the toolchain
You can build our toolchain with the command
cargo xtask toolchain --install
Which will install a toolchain in ./extern/toolchain
, which you can then link to your rustup
toolchain with rustup toolchain link emerald ./extern/toolchain
.
Building the userspace programs
Then, you can build your project with the toolchain
cargo +emerald build --target x86_64-unknown-emerald
Please open an issue if you have any problems building your project, or if you want to add a new feature to the toolchain.
Programs
These can be found in userspace
Here are the list of programs that are found in the userspace of the operating system by default.
init
The first program that is run when the operating system boots, for now the operating system requires this program and expects it to be found at /init
.
Currently, init
performs the following:
- Sets the
stdin
as blocking (will see why in a bit). - Creates a new
/shell
process, usingstdin: Piped
and passstdout
andstderr
normally inherited. - Stays in the following loop:
- Check if
/shell
has exited (not blocking). - Reads from
stdin
and buffers it until a newline is found, then it sends it to the pipe of/shell
'sstdin
, effectively, giving us behavior similar to a normal terminal in linux.
- Check if
- If the process exits, it will spawn a new
/shell
process and goes back to the loop.
This is a temporary behavior (maybe?), but we still need to improve file operations as init
is looping a lot.
shell
This is a basic shell, that can change directories, and execute programs.
It also support output redirect (no piping between processes yet), so you can do something like:
ls > file.txt
or even append to a file:
ls >> file.txt
List of commands/programs
Name | Description |
---|---|
cd (internal) | Change directory |
pwd (internal) | Print working directory |
exit (internal) | Exit the shell, which will just cause another to come back up |
touch (internal) | Create a file, if not present |
ls | List directory contents |
tree | List directory contents recursively |
echo | Write arguments to the standard output |
cat | Print 1 file on the standard output (no concat yet XD) |
xxd | Hexdump utility |
keyboard | Keyboard test program |
mouse | Mouse test program |
graphics
Here we have simple graphics programs that will take control of the graphics controller from the kernel and thus will look like exiting from shell, upon exiting the program, the shell will come back up.
List of commands/Programs
Name | Description |
---|---|
graphics | Simple graphics demo program, it will display a red ball and it will bounce around the screen |
video | Video player, it will take a video in image zip format, that is a zip file with jpg images inside it, check tools/video_to_zip.sh for how to convert normal videos to this format. You can specify the fps upon creation (default is 30 ), and specify it as will when running the program. |
Libraries
In this chapter we will talk about the libraries that we will use in our userspace programs.
Libraries are code shared between multiple programs, and can be "all" programs, such as the case for the Rust Standard Library.
Rust Standard Library
Currently, everything is built by Rust
, we don't have libc
yet. And instead we have a custom rust target
that we use to build our userspace programs.
This is implemented in my fork of rust-lang/rust
.
We have the target x86_64-unknown-emerald
. The changes needed are in rust/library/std/src/sys/emerald
and rust/library/std/src/os/emerald
.
These changes are provided by another crate emerald_std, where we have all the basic implementation of userspace functionalities including:
allocator
: See Heap Allocator for more details.files
andio
: For file operations and input/output.clock
: For clock and time operations.graphics
: For graphics operations.process
: For process management.
This crate, is very basic and only performs syscalls
basically, nothing much else, then in rust
we perform the
rest of the bindings.
Using rust
like this gives us a lot of benefits, since we just need to implement basic unsafe
functionalities,
and it will handle synchronization, memory management, etc., of course we need to make sure our implementation
is correct.
Extra
Here are extra components that are shared between the kernel and userspace.
You might find references to these components in both the kernel and userspace chapters.
Heap allocator
In the kernel, and also in userspace we need a heap.
Heap simply is a memory region that we can allocate and deallocate memory from when we need dynamically.
Anyway, we need heap implementation for the kernel and userspace.
This is achieved in increasing_heap_allocator
. This is a crate implementing heap allocator based on "increasing" memory block.
What does that mean?
It means that this crate require a Page
provider, i.e. an entity that can provide pages of memory that reside after each other,
and the rest of handling heap memory is done by the crate.
This example will make it clear, this is how its implemented in userspace
:
use increasing_heap_allocator::{PageAllocatorProvider, HeapAllocator}; const PAGE_4K: usize = 4096; struct PageAllocator { heap_start: usize, mapped_pages: usize, } impl PageAllocator { fn new() -> Self { Self { heap_start: unsafe { syscall::inc_dec_heap(0).unwrap() as usize }, mapped_pages: 0, } } } impl PageAllocatorProvider<PAGE_4K> for PageAllocator { fn allocate_pages(&mut self, pages: usize) -> Option<*mut u8> { assert!(pages > 0); // calculate the last heap base, using the `heap_start` and `mapped_pages` // the next heap base will be `last_heap_base + (pages * PAGE_4K)` let last_heap_base = self.heap_start + self.mapped_pages * PAGE_4K; let new_addr = unsafe { syscall::inc_dec_heap((pages * PAGE_4K) as isize) }; let Ok(new_addr) = new_addr else { return None; }; // `inc_dec_heap` will return the top of the new block, so it must match where we think the heap ends assert!(new_addr as usize == last_heap_base); // update the mapped pages self.mapped_pages += pages; Some(new_addr) } } // just a simple usage fn main() { let allocator = HeapAllocator::new(PageAllocator::new()); let layout = !; // example of layout // allocate and deallocate let ptr = allocator.alloc(layout); allocator.dealloc(ptr, layout); }
The only thing users of this crate need to implement is the PageAllocatorProvider
trait, which is a simple trait that requires
a method to allocate pages.
syscall::inc_dec_heap
is very similar to sbrk
in Unix, it will increase or decrease the heap size by the given size.
An argument of 0
will return the current heap address without changing it, which we use in the beginning to get the initial heap address.
Allocator implementation
The internal implementation of the allocator is as follows:
-
The
HeapAllocator
keep of a free linked-list, these are free heap blocks. We exploit that the blocks are free and store some metadata in the block itself.#![allow(unused)] fn main() { struct HeapFreeBlock { prev: *mut HeapFreeBlock, next: *mut HeapFreeBlock, // including this header size: usize, } }
We have pointers to the previous and next free block, and the size of the block. As you might have guessed,
HeapAllocator
is a!Sync
type, because we use raw pointers. -
Whenever we need to allocate a block, we check the free list, pick the smallest block that fits the requested size, and split it if necessary. If we can't find a block, we allocate new pages from
PageAllocatorProvider
and add the new block to the free list. -
Whenever we add a free block to the list, i.e. by
dealloc
or when getting new pages, we will try to merge with free blocks around it to avoid fragmentation. The implementation of this is atfree_block
And explained later atdealloc
algorithm. -
The allocated block will contain metadata residing before it in memory, this is used to know the size of the block and any padding it has, which the
dealloc
function can use later to correctly deallocate it without leaving any memory behind. The structure is:#![allow(unused)] fn main() { struct AllocatedHeapBlockInfo { magic: u32, size: usize, pre_padding: usize, } }
The
magic
is a constant value0xF0B0CAFE
that we use to check if some bug happened, it could probably be removed when we have a stable implementation. But it has helped me a lot in debugging.size
is the size of the block, andpre_padding
is the padding before the block, which is used to align the block to the correct alignment, seealloc
algorithm for more details.
alloc
algorithm
First, we need to know how much memory we need for this allocation, we get the argument Layout
which contain size
, but we need to
make sure of:
- Including the
AllocatedHeapBlockInfo
metadata. - Making sure the block is aligned to the correct alignment,
Layout::align
.
First, we get a rough estimation of the size by doing:
#![allow(unused)] fn main() { let block_info_layout = Layout::new::<AllocatedHeapBlockInfo>(); // whole_layout here is the layout of the the info header + requested block // `whole_block_offset` is the offset of the block after the info header let (whole_layout, block_offset_from_header) = block_info_layout .extend(layout.align_to(block_info_layout.align()).unwrap()) .unwrap(); let allocation_size = whole_layout.pad_to_align().size(); }
This will give us the raw size of the block along with any padding we may need to align it.
The alignment of this whole_layout
is equal to the alignment of the largest of the two.
Example:
#![allow(unused)] fn main() { use std::alloc::Layout; #[repr(align(32))] struct A(i32); #[repr(align(128))] struct C(i32); let layout_a = Layout::new::<A>(); let layout_c = Layout::new::<C>(); let (whole_layout, _) = layout_a.extend(layout_c).unwrap(); // whole_layout.align() == 128 let (whole_layout, _) = layout_c.extend(layout_a).unwrap(); // whole_layout.align() == 128 }
Which means we can totally just use allocation_size
, and be done with it.
But the issue is that we don't know the address of the free_block
we will get, i.e. in order
for this to work, we must make sure the address of the block we will return be aligned by whole_layout.align()
.
And the whole alignment of the returned free_block caused a bug before, which was fixed in 471eff0.
We know that free_block
will always be aligned to AllocatedHeapBlockInfo
, since all allocations are aligned to it.
But if the block
we require, needs higher alignment we will probably need to increase the size of allocation_size
to account
for the extra padding.
The algorithm is as follows:
-
Get a free block from the free list that fits the
allocation_size
. -
Get the
allocated_block_offset
which is an offset from thefree_block
pointer to the start of the block.- The new pointer from this
offset
is aligned to theblock
, that the user will get. - This is computed as such:
#![allow(unused)] fn main() { // `layout` is the input `block` layout // `align_up` is a function that aligns the `base` to the `align` let possible_next_offset = align_up(base, layout.align()) - base; let allocated_block_offset = if possible_next_offset < mem::size_of::<AllocatedHeapBlockInfo>() { // if we can't fit the info header, we need to add to the offset possible_next_offset + mem::size_of::<AllocatedHeapBlockInfo>().max(layout.align()) } else { possible_next_offset }; }
- The new pointer from this
-
Then, the
ptr
result will befree_block + allocated_block_offset
. And the metadata will be stored inptr - mem::size_of::<AllocatedHeapBlockInfo>()
. i.e. directly before theptr
.It may seem that this is
unsafe
sinceptr
may not be aligned toAllocatedHeapBlockInfo
, and thus we shouldn't justptr - mem::size_of::<AllocatedHeapBlockInfo>()
. But iflayout
has lower alignment thanAllocatedHeapBlockInfo
, the previousif
statement will betrue
and we will getpossible_next_offset + mem::size_of::<AllocatedHeapBlockInfo>().max(layout.align())
.git logMaybe this is not the best way, or at least it should be documented better.
-
We need to check that
allocated_block_offset
doesn't exceedblock_offset_from_header
, if it does, thenallocation_size
is not enough anymore. As it relies on the previousblock_offset_from_header
, and we are usingallocated_block_offset
, which is higher.To fix this, we just increase
allocation_size
. Currently we don't check thatfree_block
is enough for the newallocation_size
, but we should. (Check TODO) -
As a last step, we need to shrink/split the
free_block
. The idea is to see if the free block is larger than theallocation_size
, then we split it into two blocks, one for the allocation and the other for the remaining free block. The remaining free block will be kept in the free list.But here we must be careful, we need to make sure that the remaining free block is large enough to hold the
HeapFreeBlock
metadata, i.e.mem::size_of::<HeapFreeBlock>()
. Otherwise, when we update the metadata of the new split, we will overwrite the next block metadata.This was a bug before, and was fixed in cf53cf9.
The split is done as such:
#![allow(unused)] fn main() { let required_safe_size = allocation_size + mem::size_of::<HeapFreeBlock>(); // do we have empty space left? if free_block_size > required_safe_size { // split let new_free_block = free_block + allocation_size; // linked-list logic.... } else { // no space left, just remove the block from the free list // we need to note the size we took allocation_size = free_block_size; // linked-list logic.... } }
The
add_free_block
will add the new free block to the free list, and try to merge it with the free blocks around it. -
The
pre_padding
is theallocated_block_offset
value, thesize
is theallocation_size
at the end after all the considerations.
dealloc
algorithm
This is a lot simpler than alloc
, as the data is already provided by AllocatedHeapBlockInfo
metadata.
The algorithm is as follows:
- We get the pointer to the
AllocatedHeapBlockInfo
metadata, and we make sure themagic
is correct. - We get the
size
andpre_padding
from the metadata, the freeing block address will beptr - pre_padding
, and the size will besize
. - We call
free_block
here, which is the same function used when getting new pages.
free_block
algorithm
This is a function that will take a pointer to a block and its size, and add it to the free list.
In the beginning we look at all the free blocks (maybe slow?) and get the following information:
- previous block if present (this is the block that ends at
ptr - 1
) - next block if present (this is the block that starts at
ptr + size
) - "closest previous block", this is an optimization, where we find the closest block that ends before
ptr
, the new block (this) will be put after this in the linked list if its not merged with anything before it. As we need to keep the list sorted.And as you might expect this was discovered after a bug in da23ee9 :'(.
The merge logic is very simple as follows:
- If we have previous AND next block, we merge with both. The
next
block will be removed from the list as well. - If we have previous block, we merge with it, very easy
prev.size += size
. - If we have next block, we merge with it, the
next
block will be removed and this will replace it in the list. - If we have none, we just add the block to the list, after the closest previous block.
- If we don't have a closest previous block, then we are the first block in the list, and we set it as the
head
of the list.
- If we don't have a closest previous block, then we are the first block in the list, and we set it as the
TODO
-
Add special implementation for
realloc
, which would be smarter without the need toalloc
anddealloc
. -
I think there is a better way to implement
alloc
, currently, we waste a lot of space. Imagine thislayout
isalign=512
andAllocatedHeapBlockInfo
isalign=8,size=32
. Thewhole_layout
will bealign=512
and the allocated size will be1024
even though we clearly don't need that. Also we can improve how we handle unalignedfree_block
. -
Handle cases where we can't increase
allocation_size
as thefree_block
isn't enough
Kernel user link
This is a crate that contains common definitions for the kernel and the user space. And it is implemented in emerald_kernel_user_link
crate.
It contains common definitions such as:
syscall
numbers.SyscallResult
, andSyscallError
types.- some
process
related arguments:SpawnFileMapping
: For mapping files from current process to new process.
file
related structures and arguments:DirEntry
: For directory entries.FileStat
: For file stats, such as size, type, etc.BlockingMode
: For blocking modes of the operations on files.FileType
: For file types.FileMeta
: For assigning and getting metadata of files.
clock
related structures and arguments:ClockType
: For specifying the type of the clock to get the time from.RealTime
: For getting the real time, which is based on theunix time
.SystemTime
: For getting the time since the system booted.
ClockTime
: Structure holding the time,seconds
andnanos
.
STDIN
,STDOUT
,STDERR
file descriptors numbers.