It is the fourth part of the Kernel booting process
and we will see first steps in the protected mode, like checking that cpu supports the long mode and SSE, paging and initialization of the page tables and transition to the long mode in in the end of this part.
NOTE: will be much assembly code in this part, so if you have poor knowledge, read a book about it
In the previous part we stopped at the jump to the 32-bit entry point in the arch/x86/boot/pmjump.S:
jmpl *%eax
Remind that eax
register contains the address of the 32-bit entry point. We can read about this point from the linux kernel x86 boot protocol:
When using bzImage, the protected-mode kernel was relocated to 0x100000
And now we can make sure that it is true. Let's look on registers value in 32-bit entry point:
eax 0x100000 1048576
ecx 0x0 0
edx 0x0 0
ebx 0x0 0
esp 0x1ff5c 0x1ff5c
ebp 0x0 0x0
esi 0x14470 83056
edi 0x0 0
eip 0x100000 0x100000
eflags 0x46 [ PF ZF ]
cs 0x10 16
ss 0x18 24
ds 0x18 24
es 0x18 24
fs 0x18 24
gs 0x18 24
We can see here that cs
register contains - 0x10
(as you can remember from the previous part, it is the second index in the Global Descriptor Table), eip
register is 0x100000
and base address of the all segments include code segment is zero. So we can get physical address, it will be 0:0x100000
or just 0x100000
, as in boot protocol. Now let's start with 32-bit entry point.
We can find definition of the 32-bit entry point in the arch/x86/boot/compressed/head_64.S:
__HEAD
.code32
ENTRY(startup_32)
....
....
....
ENDPROC(startup_32)
First of all why compressed
directory? Actually bzimage
is a gzipped vmlinux + header + kernel setup code
. We saw the kernel setup code in the all of previous parts. So, the main goal of the head_64.S
is to prepare for entering long mode, enter into it and decompress the kernel. We will see all of these steps besides kernel decompression in this part.
Also you can note that there are two files in the arch/x86/boot/compressed
directory:
- head_32.S
- head_64.S
We will see only head_64.S
because we are learning linux kernel for x86_64
. head_32.S
even not compiled in our case. Let's look on the arch/x86/boot/compressed/Makefile, we can see there following target:
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
$(obj)/string.o $(obj)/cmdline.o \
$(obj)/piggy.o $(obj)/cpuflags.o
Note on $(obj)/head_$(BITS).o
. It means that compilation of the head_{32,64}.o depends on value of the $(BITS)
. We can find it in the other Makefile - arch/x86/kernel/Makefile:
ifeq ($(CONFIG_X86_32),y)
BITS := 32
...
...
else
...
...
BITS := 64
endif
Now we know where to start, so let's do it.
As i wrote above, we start in the arch/x86/boot/compressed/head_64.S. First of all we can see before startup_32
definition:
__HEAD
.code32
ENTRY(startup_32)
__HEAD
defined in the include/linux/init.h and looks as:
#define __HEAD .section ".head.text","ax"
We can find this section in the arch/x86/boot/compressed/vmlinux.lds.S linker script:
SECTIONS
{
. = 0;
.head.text : {
_head = . ;
HEAD_TEXT
_ehead = . ;
}
Note on . = 0;
. .
is a special variable of linker - location counter. Assigning a value to it, is an offset relative to the offset of the segment. As we assign zero to it, we can read from comments:
Be careful parts of head_64.S assume startup_32 is at address 0.
Ok, now we know where we are, and now the best time to look inside the startup_32
function.
In the start of the startup_32
we can see the cld
instruction which clears DF
flag. After this, string operations like stosb
and other will increment the index registers esi
or edi
.
The Next we can see the check of KEEP_SEGMENTS
flag from loadflags
. If you remember we already saw loadflags
in the arch/x86/boot/head.S
(there we checked flag CAN_USE_HEAP
). Now we need to check KEEP_SEGMENTS
flag. We can find description of this flag in the linux boot protocol:
Bit 6 (write): KEEP_SEGMENTS
Protocol: 2.07+
- If 0, reload the segment registers in the 32bit entry point.
- If 1, do not reload the segment registers in the 32bit entry point.
Assume that %cs %ds %ss %es are all set to flat segments with
a base of 0 (or the equivalent for their environment).
and if KEEP_SEGMENTS
is not set, we need to set ds
, ss
and es
registers to flat segment with base 0. That we do:
testb $(1 << 6), BP_loadflags(%esi)
jnz 1f
cli
movl $(__BOOT_DS), %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %ss
remember that __BOOT_DS
is 0x18
(index of data segment in the Global Descriptor Table). If KEEP_SEGMENTS
is not set, we jump to the label 1f
or update segment registers with __BOOT_DS
if this flag is set.
If you read previous the part, you can remember that we already updated segment registers in the arch/x86/boot/pmjump.S, so why we need to set up it again? Actually linux kernel has also 32-bit boot protocol, so startup_32
can be first function which will be executed right after a bootloader transfers control to the kernel.
As we checked KEEP_SEGMENTS
flag and put the correct value to the segment registers, next step is calculate difference between where we loaded and compiled to run (remember that setup.ld.S
contains . = 0
at the start of the section):
leal (BP_scratch+4)(%esi), %esp
call 1f
1: popl %ebp
subl $1b, %ebp
Here esi
register contains address of the boot_params structure. boot_params
contains special field scratch
with offset 0x1e4
. We are getting address of the scratch
field + 4 bytes and put it to the esp
register (we will use it as stack for these calculations). After this we can see call instruction and 1f
label as operand of it. What does it mean call
? It means that it pushes ebp
value in the stack, next esp
value, next function arguments and return address in the end. After this we pop return address from the stack into ebp
register (ebp
will contain return address) and subtract address of the previous label 1
.
After this we have address where we loaded in the ebp
- 0x100000
.
Now we can setup the stack and verify CPU that it has support of the long mode and SSE.
The next we can see assembly code which setups new stack for kernel decompression:
movl $boot_stack_end, %eax
addl %ebp, %eax
movl %eax, %esp
boots_stack_end
is in the .bss
section, we can see definition of it in the end of head_64.S
:
.bss
.balign 4
boot_heap:
.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
First of all we put address of the boot_stack_end
into eax
register and add to it value of the ebp
(remember that ebp
now contains address where we loaded - 0x100000
). In the end we just put eax
value into esp
and that's all, we have correct stack pointer.
The next step is CPU verification. Need to check that CPU has support of long mode
and SSE
:
call verify_cpu
testl %eax, %eax
jnz no_longmode
It just calls verify_cpu
function from the arch/x86/kernel/verify_cpu.S which contains a couple of calls of the cpuid
instruction. cpuid
is instruction which is used for getting information about processor. In our case it checks long mode and SSE support and returns 0
on success or 1
on fail in the eax
register.
If eax
is not zero, we jump to the no_longmode
label which just stops the CPU with hlt
instruction while any hardware interrupt will not happen.
no_longmode:
1:
hlt
jmp 1b
We set stack, cheked CPU and now can move on the next step.
The next step is calculating relocation address for decompression if need. We can see following assembly code:
#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
movl BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax, %ebx
cmpl $LOAD_PHYSICAL_ADDR, %ebx
jge 1f
#endif
movl $LOAD_PHYSICAL_ADDR, %ebx
1:
addl $z_extract_offset, %ebx
First of all note on CONFIG_RELOCATABLE
macro. This configuration option defined in the arch/x86/Kconfig and as we can read from it's description:
This builds a kernel image that retains relocation information
so it can be loaded someplace besides the default 1MB.
Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
it has been loaded at and the compile time physical address
(CONFIG_PHYSICAL_START) is used as the minimum location.
In short words, this code calculates address where to move kernel for decompression put it to ebx
register if the kernel is relocatable or bzimage will decompress itself above LOAD_PHYSICAL_ADDR
.
Let's look on the code. If we have CONFIG_RELOCATABLE=n
in our kernel configuration file, it just puts LOAD_PHYSICAL_ADDR
to the ebx
register and adds z_extract_offset
to ebx
. As ebx
is zero for now, it will contain z_extract_offset
. Now let's try to understand these two values.
LOAD_PHYSICAL_ADDR
is the macro which defined in the arch/x86/include/asm/boot.h and it looks like this:
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
+ (CONFIG_PHYSICAL_ALIGN - 1)) \
& ~(CONFIG_PHYSICAL_ALIGN - 1))
Here we calculates aligned address where kernel is loaded (0x100000
or 1 megabyte in our case). PHYSICAL_ALIGN
is an alignment value to which kernel should be aligned, it ranges from 0x200000
to 0x1000000
for x86_64. With the default values we will get 2 megabytes in the LOAD_PHYSICAL_ADDR
:
>>> 0x100000 + (0x200000 - 1) & ~(0x200000 - 1)
2097152
After that we got alignment unit, we adds z_extract_offset
(which is 0xe5c000
in my case) to the 2 megabytes. In the end we will get 17154048 byte offset. You can find z_extract_offset
in the arch/x86/boot/compressed/piggy.S
. This file generated in compile time by mkpiggy program.
Now let's try to understand the code if CONFIG_RELOCATABLE
is y
.
First of all we put ebp
value to the ebx
(remember that ebp
contains address where we loaded) and kernel_alignment
field from kernel setup header to the eax
register. kernel_alignment
is a physical address of alignment required for the kernel. Next we do the same as in the previous case (when kernel is not relocatable), but we just use value of the kernel_alignment
field as align unit and ebx
(address where we loaded) as base address instead of CONFIG_PHYSICAL_ALIGN
and LOAD_PHYSICAL_ADDR
.
After that we calculated address, we compare it with LOAD_PHYSICAL_ADDR
and add z_extract_offset
to it again or put LOAD_PHYSICAL_ADDR
in the ebx
if calculated address is less than we need.
After all of this calculation we will have ebp
which contains address where we loaded and ebx
with address where to move kernel for decompression.
Now we need to do the last preparations before we can see transition to the 64-bit mode. At first we need to update Global Descriptor Table for this:
leal gdt(%ebp), %eax
movl %eax, gdt+2(%ebp)
lgdt gdt(%ebp)
Here we put the address from ebp
with gdt
offset to eax
register, next we put this address into ebp
with offset gdt+2
and load Global Descriptor Table with the lgdt
instruction.
Let's look on Global Descriptor Table definition:
.data
gdt:
.word gdt_end - gdt
.long gdt
.word 0
.quad 0x0000000000000000 /* NULL descriptor */
.quad 0x00af9a000000ffff /* __KERNEL_CS */
.quad 0x00cf92000000ffff /* __KERNEL_DS */
.quad 0x0080890000000000 /* TS descriptor */
.quad 0x0000000000000000 /* TS continued */
It defined in the same file in the .data
section. It contains 5 descriptors: null descriptor, for kernel code segment, kernel data segment and two task descriptors. We already loaded GDT in the previous part, we're doing almost the same here, but descriptors with CS.L = 1
and CS.D = 0
for execution in the 64 bit mode.
After we have loaded Global Descriptor Table, we must enable PAE mode with putting value of cr4
register into eax
, setting 5 bit in it and load it again in the cr4
:
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
movl %eax, %cr4
Now we finished almost with all preparations before we can move into 64-bit mode. The last step is to build page tables, but before some information about long mode.
Long mode is the native mode for x86_64 processors. First of all let's look on some difference between x86_64
and x86
.
It provides some features as:
- New 8 general purpose registers from
r8
tor15
+ all general purpose registers are 64-bit now - 64-bit instruction pointer -
RIP
- New operating mode - Long mode
- 64-Bit Addresses and Operands
- RIP Relative Addressing (we will see example if it in the next parts)
Long mode is an extension of legacy protected mode. It consists from two sub-modes:
- 64-bit mode
- compatibility mode
To switch into 64-bit mode we need to do following things:
- enable PAE (we already did it, see above)
- build page tables and load the address of top level page table into
cr3
register - enable
EFER.LME
- enable paging
We already enabled PAE
with setting the PAE bit in the cr4
register. Now let's look on paging.
Before we can move in the 64-bit mode, we need to build page tables, so, let's look on building of early 4G boot page tables.
NOTE: I will not describe theory of virtual memory here, if you need to know more about it, see links in the end
Linux kernel uses 4-level paging, and generally we build 6 page tables:
- One PML4 table
- One PDP table
- Four Page Directory tables
Let's look on the implementation of it. First of all we clear buffer for the page tables in the memory. Every table is 4096 bytes, so we need 24 kilobytes buffer:
leal pgtable(%ebx), %edi
xorl %eax, %eax
movl $((4096*6)/4), %ecx
rep stosl
We put address which stored in ebx
(remember that ebx
contains the address where to relocate kernel for decompression) with pgtable
offset to the edi
register. pgtable
defined in the end of head_64.S
and looks:
.section ".pgtable","a",@nobits
.balign 4096
pgtable:
.fill 6*4096, 1, 0
It is in the .pgtable
section and it size is 24 kilobytes. After we put address to the edi
, we zero out eax
register and writes zeros to the buffer with rep stosl
instruction.
Now we can build top level page table - PML4
with:
leal pgtable + 0(%ebx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)
Here we get address which stored in the ebx
with pgtable
offset and put it to the edi
. Next we put this address with offset 0x1007
to the eax
register. 0x1007
is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - PRESENT+RW+USER
) and puts eax
to the edi
. After this manipulations edi
will contain the address of the first Page Directory Pointer Entry with flags - PRESENT+RW+USER
.
In the next step we build 4 Page Directory entry in the Page Directory Pointer table, where first entry will be with 0x7
flags and other with 0x8
:
leal pgtable + 0x1000(%ebx), %edi
leal 0x1007(%edi), %eax
movl $4, %ecx
1: movl %eax, 0x00(%edi)
addl $0x00001000, %eax
addl $8, %edi
decl %ecx
jnz 1b
We put base address of the page directory pointer table to the edi
and address of the first page directory pointer entry to the eax
. Put 4
to the ecx
register, it will be counter in the following loop and write the address of the first page directory pointer table entry to the edi
register.
After this edi
will contain address of the first page directory pointer entry with flags 0x7
. Next we just calculates address of following page directory pointer entries with flags 0x8
and writes their addresses to the edi
.
The next step is building of 2048
page table entries by 2 megabytes:
leal pgtable + 0x2000(%ebx), %edi
movl $0x00000183, %eax
movl $2048, %ecx
1: movl %eax, 0(%edi)
addl $0x00200000, %eax
addl $8, %edi
decl %ecx
jnz 1b
Here we do almost the same that in the previous example, just first entry will be with flags - $0x00000183
- PRESENT + WRITE + MBZ
and all another with 0x8
. In the end we will have 2048 pages by 2 megabytes.
Our early page table structure are done, it maps 4 gigabytes of memory and now we can put address of the high-level page table - PML4
to the cr3
control register:
leal pgtable(%ebx), %eax
movl %eax, %cr3
That's all now we can see transition to the long mode.
First of all we need to set EFER.LME
flag in the MSR to 0xC0000080
:
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
wrmsr
Here we put MSR_EFER
flag (which defined in the arch/x86/include/uapi/asm/msr-index.h) to the ecx
register and call rdmsr
instruction which reads MSR register. After rdmsr
executed, we will have result data in the edx:eax
which depends on ecx
value. We check EFER_LME
bit with btsl
instruction and write data from eax
to the MSR
register with wrmsr
instruction.
In next step we push address of the kernel segment code to the stack (we defined it in the GDT) and put address of the startup_64
routine to the eax
.
pushl $__KERNEL_CS
leal startup_64(%ebp), %eax
After this we push this address to the stack and enable paging with setting PG
and PE
bits in the cr0
register:
movl $(X86_CR0_PG | X86_CR0_PE), %eax
movl %eax, %cr0
and call:
lret
Remember that we pushed address of the startup_64
function to the stack in the previous step, and after lret
instruction, CPU extracts address of it and jumps there.
After all of these steps we're finally in the 64-bit mode:
.code64
.org 0x200
ENTRY(startup_64)
....
....
....
That's all!
This is the end of the fourth part linux kernel booting process. If you have questions or suggestions, ping me in twitter 0xAX, drop me email or just create an issue.
In the next part we will see kernel decompression and many more.
Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-internals.