Skip to content

Latest commit

 

History

History
522 lines (381 loc) · 24.1 KB

linux-initialization-1.md

File metadata and controls

522 lines (381 loc) · 24.1 KB

Kernel initialization. Part 1.

First steps in the kernel code

In the previous post (Kernel booting process. Part 5.) - Kernel decompression we stopped at the jump on the decompressed kernel:

jmp	*%rax

and now we are in the kernel. There are many things to do before the kernel will start first init process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is in the arch/x86/kernel/head_64.S. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the start_kernel function from the init/main.c will be called.

So let's start.

First steps in the kernel

Okay, we got address of the kernel from the decompress_kernel function into rax register and just jumped there. Decompressed kernel code starts in the arch/x86/kernel/head_64.S:

	__HEAD
	.code64
	.globl startup_64
startup_64:
	...
	...
	...

We can see definition of the startup_64 routine and it defined in the __HEAD section, which is just:

#define __HEAD		.section	".head.text","ax"

We can see definition of this section in the arch/x86/kernel/vmlinux.lds.S linker script:

.text : AT(ADDR(.text) - LOAD_OFFSET) {
	_text = .;
	...
	...
	...
} :text = 0x9090

We can understand default virtual and physical addresses from the linker script. Note that address of the _text is location counter which is defined as:

. = __START_KERNEL;

for x86_64. We can find definition of the __START_KERNEL macro in the arch/x86/include/asm/page_types.h:

#define __START_KERNEL	(__START_KERNEL_map + __PHYSICAL_START)

#define __PHYSICAL_START  ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)

Here we can see that __START_KERNEL is the sum of the __START_KERNEL_map (which is 0xffffffff80000000, see post about paging) and __PHYSICAL_START. Where __PHYSICAL_START is aligned value of the CONFIG_PHYSICAL_START. So if you will not use kASLR and will not change CONFIG_PHYSICAL_START in the configuration addresses will be following:

  • Physical address - 0x1000000;
  • Virtual address - 0xffffffff81000000.

Now we know default physical and virtual addresses of the startup_64 routine, but to know actual addresses we must to calculate it with the following code:

	leaq	_text(%rip), %rbp
	subq	$_text - __START_KERNEL_map, %rbp

Here we just put the rip-relative address to the rbp register and then subtract $_text - __START_KERNEL_map from it. We know that compiled address of the _text is 0xffffffff81000000 and __START_KERNEL_map contains 0xffffffff81000000, so rbp will contain physical address of the text - 0x1000000 after this calculation. We need to calculate it because kernel can't be run on the default address, but now we know the actual physical address.

In the next step we checks that this address is aligned with:

	movq	%rbp, %rax
	andl	$~PMD_PAGE_MASK, %eax
	testl	%eax, %eax
	jnz	bad_address

Here we just put address to the %rax and test first bit. PMD_PAGE_MASK indicates the mask for Page middle directory (read paging about it) and defined as:

#define PMD_PAGE_MASK           (~(PMD_PAGE_SIZE-1))

#define PMD_PAGE_SIZE           (_AC(1, UL) << PMD_SHIFT)
#define PMD_SHIFT       21

As we can easily calculate, PMD_PAGE_SIZE is 2 megabytes. Here we use standard formula for checking alignment and if text address is not aligned for 2 megabytes, we jump to bad_address label.

After this we check address that it is not too large:

	leaq	_text(%rip), %rax
	shrq	$MAX_PHYSMEM_BITS, %rax
	jnz	bad_address

Address most not be greater than 46-bits:

#define MAX_PHYSMEM_BITS       46

Okay, we did some early checks and now we can move on.

Fix base addresses of page tables

The first step before we started to setup identity paging, need to correct following addresses:

	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)

Here we need to correct early_level4_pgt and other addresses of the page table directories, because as I wrote above, kernel can't be run at the default 0x1000000 address. rbp register contains actual address so we add to the early_level4_pgt, level3_kernel_pgt and level2_fixmap_pgt. Let's try to understand what these labels means. First of all let's look on their definition:

NEXT_PAGE(early_level4_pgt)
	.fill	511,8,0
	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(level3_kernel_pgt)
	.fill	L3_START_KERNEL,8,0
	.quad	level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
	.quad	level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(level2_kernel_pgt)
	PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
		KERNEL_IMAGE_SIZE/PMD_SIZE)

NEXT_PAGE(level2_fixmap_pgt)
	.fill	506,8,0
	.quad	level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
	.fill	5,8,0

NEXT_PAGE(level1_fixmap_pgt)
	.fill	512,8,0

Looks hard, but it is not true.

First of all let's look on the early_level4_pgt. It starts with the (4096 - 8) bytes of zeros, it means that we don't use first 511 early_level4_pgt entries. And after this we can see level3_kernel_pgt entry. Note that we subtract __START_KERNEL_map + _PAGE_TABLE from it. As we know __START_KERNEL_map is a base virtual address of the kernel text, so if we subtract __START_KERNEL_map, we will get physical address of the level3_kernel_pgt. Now let's look on _PAGE_TABLE, it is just page entry access rights:

#define _PAGE_TABLE     (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
                         _PAGE_ACCESSED | _PAGE_DIRTY)

more about it, you can read in the paging post.

level3_kernel_pgt - stores entries which map kernel space. At the start of it's definition, we can see that it filled with zeros L3_START_KERNEL times. Here L3_START_KERNEL is the index in the page upper directory which contains __START_KERNEL_map address and it equals 510. After it we can see definition of two level3_kernel_pgt entries: level2_kernel_pgt and level2_fixmap_pgt. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:

#define _KERNPG_TABLE   (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
                         _PAGE_DIRTY)

access rights. The second - level2_fixmap_pgt is a virtual addresses which can refer to any physical addresses even under kernel space.

The next level2_kernel_pgt calls PDMS macro which creates 512 megabytes from the __START_KERNEL_map for kernel text (after these 512 megabytes will be modules memory space).

Now we know Let's back to our code which is in the beginning of the section. Remember that rbp contains actual physical address of the _text section. We just add this address to the base address of the page tables, that they'll have correct addresses:

	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)

At the first line we add rbp to the early_level4_pgt, at the second line we add rbp to the level2_kernel_pgt, at the third line we add rbp to the level2_fixmap_pgt and add rbp to the level1_fixmap_pgt.

After all of this we will have:

early_level4_pgt[511] -> level3_kernel_pgt[0]
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
level2_kernel_pgt[0]   -> 512 MB kernel mapping
level2_fixmap_pgt[506] -> level1_fixmap_pgt 

As we corrected base addresses of the page tables, we can start to build it.

Identity mapping setup

Now we can see set up the identity mapping early page tables. Identity Mapped Paging is a virtual addresses which are mapped to physical addresses that have the same value, 1 : 1. Let's look on it in details. First of all we get the rip-relative address of the _text and _early_level4_pgt and put they into rdi and rbx registers:

	leaq	_text(%rip), %rdi
	leaq	early_level4_pgt(%rip), %rbx

After this we store physical address of the _text in the rax and get the index of the page global directory entry which stores _text address, by shifting _text address on the PGDIR_SHIFT:

	movq	%rdi, %rax
	shrq	$PGDIR_SHIFT, %rax

	leaq	(4096 + _KERNPG_TABLE)(%rbx), %rdx
	movq	%rdx, 0(%rbx,%rax,8)
	movq	%rdx, 8(%rbx,%rax,8)

where PGDIR_SHIFT is 39. PGDIR_SHFT indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories:

#define PGDIR_SHIFT     39
#define PUD_SHIFT       30
#define PMD_SHIFT       21

After this we put the address of the first level3_kernel_pgt to the rdx with the _KERNPG_TABLE access rights (see above) and fill the early_level4_pgt with the 2 level3_kernel_pgt entries.

After this we add 4096 (size of the early_level4_pgt) to the rdx (it now contains the address of the first entry of the level3_kernel_pgt) and put rdi (it now contains physical address of the _text) to the rax. And after this we write addresses of the two page upper directory entries to the level3_kernel_pgt:

	addq	$4096, %rdx
	movq	%rdi, %rax
	shrq	$PUD_SHIFT, %rax
	andl	$(PTRS_PER_PUD-1), %eax
	movq	%rdx, 4096(%rbx,%rax,8)
	incl	%eax
	andl	$(PTRS_PER_PUD-1), %eax
	movq	%rdx, 4096(%rbx,%rax,8)

In the next step we write addresses of the page middle directory entries to the level2_kernel_pgt and the last step is correcting of the kernel text+data virtual addresses:

	leaq	level2_kernel_pgt(%rip), %rdi
	leaq	4096(%rdi), %r8
1:	testq	$1, 0(%rdi)
	jz	2f
	addq	%rbp, 0(%rdi)
2:	addq	$8, %rdi
	cmp	%r8, %rdi
	jne	1b

Here we put the address of the level2_kernel_pgt to the rdi and address of the page table entry to the r8 register. Next we check the present bit in the level2_kernel_pgt and if it is zero we're moving to the next page by adding 8 bytes to rdi which contaitns address of the level2_kernel_pgt. After this we compare it with r8 (contains address of the page table entry) and go back to label 1 or move forward.

In the next step we correct phys_base physical address with rbp (contains physical address of the _text), put physical address of the early_level4_pgt and jump to label 1:

	addq	%rbp, phys_base(%rip)
	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
	jmp 1f

where phys_base mathes the first entry of the level2_kernel_pgt which is 512 MB kernel mapping.

Last preparations

After that we jumped to the label 1 we enable PAE, PGE (Paging Global Extension) and put the physical address of the phys_base (see above) to the rax register and fill cr3 register with it:

1:
	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
	movq	%rcx, %cr4

	addq	phys_base(%rip), %rax
	movq	%rax, %cr3

In the next step we check that CPU support NX bit with:

	movl	$0x80000001, %eax
	cpuid
	movl	%edx,%edi

We put 0x80000001 value to the eax and execute cpuid instruction for getting extended processor info and feature bits. The result will be in the edx register which we put to the edi.

Now we put 0xc0000080 or MSR_EFER to the ecx and call rdmsr instruction for the reading model specific register.

	movl	$MSR_EFER, %ecx
	rdmsr

The result will be in the edx:eax. General view of the EFER is following:

63                                                                              32
 --------------------------------------------------------------------------------
|                                                                               |
|                                Reserved MBZ                                   |
|                                                                               |
 --------------------------------------------------------------------------------
31                            16  15      14      13   12  11   10  9  8 7  1   0
 --------------------------------------------------------------------------------
|                              | T |       |       |    |   |   |   |   |   |   |
| Reserved MBZ                 | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
|                              | E |       |       |    |   |   |   |   |   |   |
 --------------------------------------------------------------------------------

We will not see all fields in details here, but we will learn about this and other MSRs in the special part about. As we read EFER to the edx:eax, we checks _EFER_SCE or zero bit which is System Call Extensions with btsl instruction and set it to one. By the setting SCE bit we enable SYSCALL and SYSRET instructions. In the next step we check 20th bit in the edi, remember that this register stores result of the cpuid (see above). If 20 bit is set (NX bit) we just write EFER_SCE to the model specific register.

	btsl	$_EFER_SCE, %eax
	btl	    $20,%edi
	jnc     1f
	btsl	$_EFER_NX, %eax
	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
1:	wrmsr

If NX bit is supported we enable _EFER_NX and write it too, with the wrmsr instruction.

In the next step we need to update Global Descriptor table with lgdt instruction:

lgdt	early_gdt_descr(%rip)

where Global Descriptor table defined as:

early_gdt_descr:
	.word	GDT_ENTRIES*8-1
early_gdt_descr_base:
	.quad	INIT_PER_CPU_VAR(gdt_page)

We need to reload Global Descriptor Table because now kernel works in the userspace addresses, but soon kernel will work in it's own space. Now let's look on early_gdt_descr definition. Global Descriptor Table contains 32 entries:

#define GDT_ENTRIES 32

for kernel code, data, thread local storage segments and etc... it's simple. Now let's look on the early_gdt_descr_base. First of gdt_page defined as:

struct gdt_page {
	struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));

in the arch/x86/include/asm/desc.h. It contains one field gdt which is array of the desc_struct structures which defined as:

struct desc_struct {
         union {
                 struct {
                         unsigned int a;
                         unsigned int b;
                 };
                 struct {
                         u16 limit0;
                         u16 base0;
                         unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
                         unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
                 };
         };
 } __attribute__((packed));

and presents familiar to us GDT descriptor. Also we can note that gdt_page structure aligned to PAGE_SIZE which is 4096 bytes. It means that gdt will occupy one page. Now let's try to understand what is it INIT_PER_CPU_VAR. INIT_PER_CPU_VAR is a macro which defined in the arch/x86/include/asm/percpu.h and just concats init_per_cpu__ with the given parameter:

#define INIT_PER_CPU_VAR(var) init_per_cpu__##var

After this we have init_per_cpu__gdt_page. We can see in the linker script:

#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(gdt_page);

As we got init_per_cpu__gdt_page in INIT_PER_CPU_VAR and INIT_PER_CPU macro from linker script will be expanded we will get offset from the __per_cpu_load. After this calculations, we will have correct base address of the new GDT.

Generally per-CPU variables is a 2.6 kernel feature. You can understand what is it from it's name. When we create per-CPU variable, each CPU will have will have it's own copy of this variable. Here we creating gdt_page per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with it's own copy of variable and etc... So every core on multiprocessor will have it's own GDT table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about per-CPU variables in the Theory/per-cpu post.

As we loaded new Global Descriptor Table, we reload segments as we did it every time:

	xorl %eax,%eax
	movl %eax,%ds
	movl %eax,%ss
	movl %eax,%es
	movl %eax,%fs
	movl %eax,%gs

After all of these steps we set up gs register that it post to the irqstack (we will see information about it in the next parts):

	movl	$MSR_GS_BASE,%ecx
	movl	initial_gs(%rip),%eax
	movl	initial_gs+4(%rip),%edx
	wrmsr

where MSR_GS_BASE is:

#define MSR_GS_BASE             0xc0000101

We need to put MSR_GS_BASE to the ecx register and load data from the eax and edx (which are point to the initial_gs) with wrmsr instruction. We don't use cs, fs, ds and ss segment registers for addressation in the 64-bit mode, but fs and gs registers can be used. fs and gs have a hidden part (as we saw it in the real mode for cs) and this part contains descriptor which mapped to Model specific registers. So we can see above 0xc0000101 is a gs.base MSR address.

In the next step we put the address of the real mode bootparam structure to the rdi (remember rsi holds pointer to this structure from the start) and jump to the C code with:

	movq	initial_code(%rip),%rax
	pushq	$0
	pushq	$__KERNEL_CS
	pushq	%rax
	lretq

Here we put the address of the initial_code to the rax and push fake address, __KERNEL_CS and the address of the initial_code to the stack. After this we can see lretq instruction which means that after it return address will be extracted from stack (now there is address of the initial_code) and jump there. initial_code defined in the same source code file and looks:

	__REFDATA
	.balign	8
	GLOBAL(initial_code)
	.quad	x86_64_start_kernel
	...
	...
	...

As we can see initial_code contains address of the x86_64_start_kernel, which defined in the arch/x86/kerne/head64.c and looks like this:

asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
	...
	...
	...
}

It has one argument is a real_mode_data (remember that we passed address of the real mode data to the rdi register previously).

This is first C code in the kernel!

Next to start_kernel

We need to see last preparations before we can see "kernel entry point" - start_kernel function from the init/main.c.

First of all we can see some checks in the x86_64_start_kernel function:

BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - __STAT_KERNEL_map, that kernel text with modules is not less than image of the kernel and etc... BUILD_BUG_ON is a macro which looks as:

#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))

Let's try to understand this trick works. Let's take for example first condition: MODULES_VADDR < __START_KERNEL_map. !!conditions is the same that condition != 0. So it means if MODULES_VADDR < __START_KERNEL_map is true, we will get 1 in the !!(condition) or zero if not. After 2*!!(condition) we will get or 2 or 0. In the end of calculations we can get two different behaviors:

  • We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because MODULES_VADDR can't be less than __START_KERNEL_map will be in our case);
  • No compilation errors.

That's all. So interesting C trick for getting compile error which depends on some constants.

In the next step we can see call of the cr4_init_shadow function which stores shadow copy of the cr4 per cpu. Context switches can change bits in the cr4 so we need to store cr4 for each CPU. And after this we can see call of the reset_early_page_tables function where we resets all page global directory entries and write new pointer to the PGT in cr3:

for (i = 0; i < PTRS_PER_PGD-1; i++)
	early_level4_pgt[i].pgd = 0;

next_early_pgt = 0;

write_cr3(__pa_nodebug(early_level4_pgt));

soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (PTRS_PER_PGD is 512) in the loop and make it zero. After this we set next_early_pgt to zero (we will see details about it in the next post) and write physical address of the early_level4_pgt to the cr3. __pa_nodebug is a macro which will be expanded to:

((unsigned long)(x) - __START_KERNEL_map + phys_base)

After this we clear _bss from the __bss_stop to __bss_start and the next step will be setup of the early IDT handlers, but it's big theme so we will see it in the next part.

Conclusion

This is the end of the first part about linux kernel initialization.

If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.

In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and many many more.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-internals.

Links