In the previous post (Kernel booting process. Part 5.
) - Kernel decompression we stopped at the jump on the decompressed kernel:
jmp *%rax
and now we are in the kernel. There are many things to do before the kernel will start first init
process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is in the arch/x86/kernel/head_64.S. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the start_kernel
function from the init/main.c will be called.
So let's start.
Okay, we got address of the kernel from the decompress_kernel
function into rax
register and just jumped there. Decompressed kernel code starts in the arch/x86/kernel/head_64.S:
__HEAD
.code64
.globl startup_64
startup_64:
...
...
...
We can see definition of the startup_64
routine and it defined in the __HEAD
section, which is just:
#define __HEAD .section ".head.text","ax"
We can see definition of this section in the arch/x86/kernel/vmlinux.lds.S linker script:
.text : AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;
...
...
...
} :text = 0x9090
We can understand default virtual and physical addresses from the linker script. Note that address of the _text
is location counter which is defined as:
. = __START_KERNEL;
for x86_64
. We can find definition of the __START_KERNEL
macro in the arch/x86/include/asm/page_types.h:
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
#define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)
Here we can see that __START_KERNEL
is the sum of the __START_KERNEL_map
(which is 0xffffffff80000000
, see post about paging) and __PHYSICAL_START
. Where __PHYSICAL_START
is aligned value of the CONFIG_PHYSICAL_START
. So if you will not use kASLR and will not change CONFIG_PHYSICAL_START
in the configuration addresses will be following:
- Physical address -
0x1000000
; - Virtual address -
0xffffffff81000000
.
Now we know default physical and virtual addresses of the startup_64
routine, but to know actual addresses we must to calculate it with the following code:
leaq _text(%rip), %rbp
subq $_text - __START_KERNEL_map, %rbp
Here we just put the rip-relative
address to the rbp
register and then subtract $_text - __START_KERNEL_map
from it. We know that compiled address of the _text
is 0xffffffff81000000
and __START_KERNEL_map
contains 0xffffffff81000000
, so rbp
will contain physical address of the text
- 0x1000000
after this calculation. We need to calculate it because kernel can't be run on the default address, but now we know the actual physical address.
In the next step we checks that this address is aligned with:
movq %rbp, %rax
andl $~PMD_PAGE_MASK, %eax
testl %eax, %eax
jnz bad_address
Here we just put address to the %rax
and test first bit. PMD_PAGE_MASK
indicates the mask for Page middle directory
(read paging about it) and defined as:
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_SHIFT 21
As we can easily calculate, PMD_PAGE_SIZE
is 2 megabytes. Here we use standard formula for checking alignment and if text
address is not aligned for 2 megabytes, we jump to bad_address
label.
After this we check address that it is not too large:
leaq _text(%rip), %rax
shrq $MAX_PHYSMEM_BITS, %rax
jnz bad_address
Address most not be greater than 46-bits:
#define MAX_PHYSMEM_BITS 46
Okay, we did some early checks and now we can move on.
The first step before we started to setup identity paging, need to correct following addresses:
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
Here we need to correct early_level4_pgt
and other addresses of the page table directories, because as I wrote above, kernel can't be run at the default 0x1000000
address. rbp
register contains actual address so we add to the early_level4_pgt
, level3_kernel_pgt
and level2_fixmap_pgt
. Let's try to understand what these labels means. First of all let's look on their definition:
NEXT_PAGE(early_level4_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
NEXT_PAGE(level2_kernel_pgt)
PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
KERNEL_IMAGE_SIZE/PMD_SIZE)
NEXT_PAGE(level2_fixmap_pgt)
.fill 506,8,0
.quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
.fill 5,8,0
NEXT_PAGE(level1_fixmap_pgt)
.fill 512,8,0
Looks hard, but it is not true.
First of all let's look on the early_level4_pgt
. It starts with the (4096 - 8) bytes of zeros, it means that we don't use first 511 early_level4_pgt
entries. And after this we can see level3_kernel_pgt
entry. Note that we subtract __START_KERNEL_map + _PAGE_TABLE
from it. As we know __START_KERNEL_map
is a base virtual address of the kernel text, so if we subtract __START_KERNEL_map
, we will get physical address of the level3_kernel_pgt
. Now let's look on _PAGE_TABLE
, it is just page entry access rights:
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
more about it, you can read in the paging post.
level3_kernel_pgt
- stores entries which map kernel space. At the start of it's definition, we can see that it filled with zeros L3_START_KERNEL
times. Here L3_START_KERNEL
is the index in the page upper directory which contains __START_KERNEL_map
address and it equals 510
. After it we can see definition of two level3_kernel_pgt
entries: level2_kernel_pgt
and level2_fixmap_pgt
. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
_PAGE_DIRTY)
access rights. The second - level2_fixmap_pgt
is a virtual addresses which can refer to any physical addresses even under kernel space.
The next level2_kernel_pgt
calls PDMS
macro which creates 512 megabytes from the __START_KERNEL_map
for kernel text (after these 512 megabytes will be modules memory space).
Now we know Let's back to our code which is in the beginning of the section. Remember that rbp
contains actual physical address of the _text
section. We just add this address to the base address of the page tables, that they'll have correct addresses:
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
At the first line we add rbp
to the early_level4_pgt
, at the second line we add rbp
to the level2_kernel_pgt
, at the third line we add rbp
to the level2_fixmap_pgt
and add rbp
to the level1_fixmap_pgt
.
After all of this we will have:
early_level4_pgt[511] -> level3_kernel_pgt[0]
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
level2_kernel_pgt[0] -> 512 MB kernel mapping
level2_fixmap_pgt[506] -> level1_fixmap_pgt
As we corrected base addresses of the page tables, we can start to build it.
Now we can see set up the identity mapping early page tables. Identity Mapped Paging is a virtual addresses which are mapped to physical addresses that have the same value, 1 : 1
. Let's look on it in details. First of all we get the rip-relative
address of the _text
and _early_level4_pgt
and put they into rdi
and rbx
registers:
leaq _text(%rip), %rdi
leaq early_level4_pgt(%rip), %rbx
After this we store physical address of the _text
in the rax
and get the index of the page global directory entry which stores _text
address, by shifting _text
address on the PGDIR_SHIFT
:
movq %rdi, %rax
shrq $PGDIR_SHIFT, %rax
leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx
movq %rdx, 0(%rbx,%rax,8)
movq %rdx, 8(%rbx,%rax,8)
where PGDIR_SHIFT
is 39
. PGDIR_SHFT
indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories:
#define PGDIR_SHIFT 39
#define PUD_SHIFT 30
#define PMD_SHIFT 21
After this we put the address of the first level3_kernel_pgt
to the rdx
with the _KERNPG_TABLE
access rights (see above) and fill the early_level4_pgt
with the 2 level3_kernel_pgt
entries.
After this we add 4096
(size of the early_level4_pgt
) to the rdx
(it now contains the address of the first entry of the level3_kernel_pgt
) and put rdi
(it now contains physical address of the _text
) to the rax
. And after this we write addresses of the two page upper directory entries to the level3_kernel_pgt
:
addq $4096, %rdx
movq %rdi, %rax
shrq $PUD_SHIFT, %rax
andl $(PTRS_PER_PUD-1), %eax
movq %rdx, 4096(%rbx,%rax,8)
incl %eax
andl $(PTRS_PER_PUD-1), %eax
movq %rdx, 4096(%rbx,%rax,8)
In the next step we write addresses of the page middle directory entries to the level2_kernel_pgt
and the last step is correcting of the kernel text+data virtual addresses:
leaq level2_kernel_pgt(%rip), %rdi
leaq 4096(%rdi), %r8
1: testq $1, 0(%rdi)
jz 2f
addq %rbp, 0(%rdi)
2: addq $8, %rdi
cmp %r8, %rdi
jne 1b
Here we put the address of the level2_kernel_pgt
to the rdi
and address of the page table entry to the r8
register. Next we check the present bit in the level2_kernel_pgt
and if it is zero we're moving to the next page by adding 8 bytes to rdi
which contaitns address of the level2_kernel_pgt
. After this we compare it with r8
(contains address of the page table entry) and go back to label 1
or move forward.
In the next step we correct phys_base
physical address with rbp
(contains physical address of the _text
), put physical address of the early_level4_pgt
and jump to label 1
:
addq %rbp, phys_base(%rip)
movq $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f
where phys_base
mathes the first entry of the level2_kernel_pgt
which is 512 MB kernel mapping.
After that we jumped to the label 1
we enable PAE
, PGE
(Paging Global Extension) and put the physical address of the phys_base
(see above) to the rax
register and fill cr3
register with it:
1:
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
movq %rcx, %cr4
addq phys_base(%rip), %rax
movq %rax, %cr3
In the next step we check that CPU support NX bit with:
movl $0x80000001, %eax
cpuid
movl %edx,%edi
We put 0x80000001
value to the eax
and execute cpuid
instruction for getting extended processor info and feature bits. The result will be in the edx
register which we put to the edi
.
Now we put 0xc0000080
or MSR_EFER
to the ecx
and call rdmsr
instruction for the reading model specific register.
movl $MSR_EFER, %ecx
rdmsr
The result will be in the edx:eax
. General view of the EFER
is following:
63 32
--------------------------------------------------------------------------------
| |
| Reserved MBZ |
| |
--------------------------------------------------------------------------------
31 16 15 14 13 12 11 10 9 8 7 1 0
--------------------------------------------------------------------------------
| | T | | | | | | | | | |
| Reserved MBZ | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
| | E | | | | | | | | | |
--------------------------------------------------------------------------------
We will not see all fields in details here, but we will learn about this and other MSRs
in the special part about. As we read EFER
to the edx:eax
, we checks _EFER_SCE
or zero bit which is System Call Extensions
with btsl
instruction and set it to one. By the setting SCE
bit we enable SYSCALL
and SYSRET
instructions. In the next step we check 20th bit in the edi
, remember that this register stores result of the cpuid
(see above). If 20
bit is set (NX
bit) we just write EFER_SCE
to the model specific register.
btsl $_EFER_SCE, %eax
btl $20,%edi
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
1: wrmsr
If NX
bit is supported we enable _EFER_NX
and write it too, with the wrmsr
instruction.
In the next step we need to update Global Descriptor table with lgdt
instruction:
lgdt early_gdt_descr(%rip)
where Global Descriptor table defined as:
early_gdt_descr:
.word GDT_ENTRIES*8-1
early_gdt_descr_base:
.quad INIT_PER_CPU_VAR(gdt_page)
We need to reload Global Descriptor Table because now kernel works in the userspace addresses, but soon kernel will work in it's own space. Now let's look on early_gdt_descr
definition. Global Descriptor Table contains 32 entries:
#define GDT_ENTRIES 32
for kernel code, data, thread local storage segments and etc... it's simple. Now let's look on the early_gdt_descr_base
. First of gdt_page
defined as:
struct gdt_page {
struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));
in the arch/x86/include/asm/desc.h. It contains one field gdt
which is array of the desc_struct
structures which defined as:
struct desc_struct {
union {
struct {
unsigned int a;
unsigned int b;
};
struct {
u16 limit0;
u16 base0;
unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
};
};
} __attribute__((packed));
and presents familiar to us GDT descriptor. Also we can note that gdt_page
structure aligned to PAGE_SIZE
which is 4096 bytes. It means that gdt
will occupy one page. Now let's try to understand what is it INIT_PER_CPU_VAR
. INIT_PER_CPU_VAR
is a macro which defined in the arch/x86/include/asm/percpu.h and just concats init_per_cpu__
with the given parameter:
#define INIT_PER_CPU_VAR(var) init_per_cpu__##var
After this we have init_per_cpu__gdt_page
. We can see in the linker script:
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(gdt_page);
As we got init_per_cpu__gdt_page
in INIT_PER_CPU_VAR
and INIT_PER_CPU
macro from linker script will be expanded we will get offset from the __per_cpu_load
. After this calculations, we will have correct base address of the new GDT.
Generally per-CPU variables is a 2.6 kernel feature. You can understand what is it from it's name. When we create per-CPU
variable, each CPU will have will have it's own copy of this variable. Here we creating gdt_page
per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with it's own copy of variable and etc... So every core on multiprocessor will have it's own GDT
table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about per-CPU
variables in the Theory/per-cpu post.
As we loaded new Global Descriptor Table, we reload segments as we did it every time:
xorl %eax,%eax
movl %eax,%ds
movl %eax,%ss
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
After all of these steps we set up gs
register that it post to the irqstack
(we will see information about it in the next parts):
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
where MSR_GS_BASE
is:
#define MSR_GS_BASE 0xc0000101
We need to put MSR_GS_BASE
to the ecx
register and load data from the eax
and edx
(which are point to the initial_gs
) with wrmsr
instruction. We don't use cs
, fs
, ds
and ss
segment registers for addressation in the 64-bit mode, but fs
and gs
registers can be used. fs
and gs
have a hidden part (as we saw it in the real mode for cs
) and this part contains descriptor which mapped to Model specific registers. So we can see above 0xc0000101
is a gs.base
MSR address.
In the next step we put the address of the real mode bootparam structure to the rdi
(remember rsi
holds pointer to this structure from the start) and jump to the C code with:
movq initial_code(%rip),%rax
pushq $0
pushq $__KERNEL_CS
pushq %rax
lretq
Here we put the address of the initial_code
to the rax
and push fake address, __KERNEL_CS
and the address of the initial_code
to the stack. After this we can see lretq
instruction which means that after it return address will be extracted from stack (now there is address of the initial_code
) and jump there. initial_code
defined in the same source code file and looks:
__REFDATA
.balign 8
GLOBAL(initial_code)
.quad x86_64_start_kernel
...
...
...
As we can see initial_code
contains address of the x86_64_start_kernel
, which defined in the arch/x86/kerne/head64.c and looks like this:
asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
...
...
...
}
It has one argument is a real_mode_data
(remember that we passed address of the real mode data to the rdi
register previously).
This is first C code in the kernel!
We need to see last preparations before we can see "kernel entry point" - start_kernel function from the init/main.c.
First of all we can see some checks in the x86_64_start_kernel
function:
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - __STAT_KERNEL_map
, that kernel text with modules is not less than image of the kernel and etc... BUILD_BUG_ON
is a macro which looks as:
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
Let's try to understand this trick works. Let's take for example first condition: MODULES_VADDR < __START_KERNEL_map
. !!conditions
is the same that condition != 0
. So it means if MODULES_VADDR < __START_KERNEL_map
is true, we will get 1
in the !!(condition)
or zero if not. After 2*!!(condition)
we will get or 2
or 0
. In the end of calculations we can get two different behaviors:
- We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because
MODULES_VADDR
can't be less than__START_KERNEL_map
will be in our case); - No compilation errors.
That's all. So interesting C trick for getting compile error which depends on some constants.
In the next step we can see call of the cr4_init_shadow
function which stores shadow copy of the cr4
per cpu. Context switches can change bits in the cr4
so we need to store cr4
for each CPU. And after this we can see call of the reset_early_page_tables
function where we resets all page global directory entries and write new pointer to the PGT in cr3
:
for (i = 0; i < PTRS_PER_PGD-1; i++)
early_level4_pgt[i].pgd = 0;
next_early_pgt = 0;
write_cr3(__pa_nodebug(early_level4_pgt));
soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (PTRS_PER_PGD
is 512
) in the loop and make it zero. After this we set next_early_pgt
to zero (we will see details about it in the next post) and write physical address of the early_level4_pgt
to the cr3
. __pa_nodebug
is a macro which will be expanded to:
((unsigned long)(x) - __START_KERNEL_map + phys_base)
After this we clear _bss
from the __bss_stop
to __bss_start
and the next step will be setup of the early IDT
handlers, but it's big theme so we will see it in the next part.
This is the end of the first part about linux kernel initialization.
If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.
In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and many many more.
Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-internals.