The Boot-time Allocator

The bootmem allocator is used during the boot process to allocate memory before the kernel MM subsystem is usable. It is quite simple, though not as simple as its predecessor. The bootmem allocator is used, for example, to allocate the mem_map - the array of page structs used by the VM subsystem to keep track of the disposition of physical pages.

The bootmem allocator lives only until the kernel has set up the data structures necessary to support the zone allocator.

The 2.2 Method

The symbol _end represents the end of the loaded kernel data - that is, the next usable byte after the kernel code loaded by the bootloader. (_end, along with a number of other important addresses, is defined in the linker script arch/i386/vmlinux.lds.) This address is in the kernel's virtual memory space at address PAGE_OFFSET+physical_kernel_end. Pages < _end are, naturally, reserved for the kernel's use, and are never used by the VM subsystem.

In the 2.2 kernel, the kernel would reserve memory at boot time by simply incrementing _end as required, in PAGE_SIZE chunks. This was somewhat inefficient, as it wasn't often the case that the size of the data in question was a multiple of PAGE_SIZE; thus many sub-page chunks were unrecoverably lost during the boot process.

The 2.4 Method

2.4 uses a dramatically refined version of the same basic idea. The bootmem allocator in the 2.4 kernel is capable both of performing sub-page-size allocations efficiently, and (when appropriate) of reclaiming pages for the zone allocator after boot. For example, the bootmem allocator's data itself is not used after the zone allocator is initialized, so those pages can be released and given to the zone allocator as free pages.

Bootmem Allocator Data

Memory is organized into "nodes", each node being a (more or less) contiguous chunk of physical RAM; on normal, non-NUMA machines, there is only one node. Each node is represented by a pg_data_t structure, and of course on non-NUMA machines, there is only one of these, contig_page_data. Each pg_data_t has a member bdata of type bootmem_data_t that contains the bootmem allocator's data.

The bootmem_data_t struct contains a pointer (node_bootmem_map) to a bitmap representing all the pages in the node, one bit per page. If a page's boot-map bit is 1, the page is reserved and will not be touched by the VM system during normal system operations; otherwise the page will be given to the zone allocator as a free page when early boot is complete.

bootmem_data_t->node_boot_start is the physical address of the start address of the node. bootmem_data_t->node_low_pfn is the maximum low page frame number on the node -- the bootmem allocator never allocates high memory. The last_pos member is the offset of the last-allocated page, and last_offset is the offset of the next free address within the page.

Overview of Bootmem Allocator Operation

When the system BIOS memory map is interrogated by the kernel in setup_arch(), all nodes are given to the bootmem allocator as "reserved" memory; subsequently, those pages which correspond to real, usable RAM are bootmem-freed and are available to kernel subsystems that need to do boot-time allocations before the zone allocator is enabled. Those allocations are done either by returning an address within a bootmem-reserved page that is not fully utilized, or by reserving new pages as necessary in the bootmem bitmaps. Bootmem-allocated memory can also be freed as long as it is done before the zone allocator is enabled. Once the zone allocator is functional, all non-reserved bootmem pages are given to it as free memory, and the bootmem allocator is henceforth unusable. The bootmem code is in the __init link area, so it is itself released to the zone allocator when system boot is complete.

Note that there may be "holes" in a node's address space; for example, there is frequently a 384K hole between (physical address) 640K and 1024K on PCs. In such cases, setup_arch() simply doesn't free the bootmem pages associated with the holes. This ensures that the nonexistent pages remain reserved, both at boot time and during normal system operation.

Bootmem Allocator Code

We first meet the bootmem allocator (on Intel platforms) in arch/i386/kernel/setup.c in the setup_arch() function. Here, the size of the low and high memory areas is computed, and init_bootmem() is called in order to initialize the bootmem data. On non-NUMA machines, init_bootmem() just calls init_bootmem_core() passing &contig_page_data as the pg_data_t.

init_bootmem_core()

The first argument to init_bootmem_core(pg_data_t *pgdat,u nsigned long mapstart, unsigned long start, unsigned long end) is the pg_data_t pointer whose contents are to be initialized. The second argument is the address of the page struct within the system memory map corresponding to the first page of the node. The last to arguments are the node's start and end addresses.

free_bootmem_core()

free_bootmem_core(bootmem_data_t *bdata, unsigned long addr, unsigned long size) is called by setup_arch() in order to register free RAM areas with the bootmem allocator, and by other kernel entities to freee bootmem that they no longer need. While allocations can be done in power-of-two-byte-sized chunks, frees can only be done with page granularity -- any page that is even partially used by permanent kernel data is considered reserved.

The arguments to free_bootmem_core() are a bootmem_data_t describing the node, and the address and size of the block to free. The function converts the start address to a page frame index (rounding up), converts the end address to a page frame index (rounding down0, and zeros the corresponding bits in the bootmem bitmap for the node.

__alloc_bootmem_core()

__alloc_bootmem_core(bootmem_data_t *bdata, unsigned long size, unsigned long align, unsigned long goal) is used to allocate boot memory on a particular node. The more-generic interface __alloc_bootmem() simply tries to allocate from each of the extant nodes until it succeeds; we'll ignore multiple-node systems for the moment.

The arguments are the bootmem_data_t struct describing the node, the size of the requested block, the byte alignment requirement (which must be a power of 2), and a "goal" address. The allocator will return an address > the goal if it can [why?].

free_all_bootmem_core()

free_all_bootmem_core(pg_data_t* pgdat) is called when the bootmem allocator is being torn down after early boot. It releases all the non-reserved bootmem pages to the zone allocator.

Linux MM Outline

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits