Working Notes on the Linux VM Subsystem

My biggest problem when approaching the MM system was understanding how the kernel manages physical memory. Once that's understood, everything else is basically details of policy. Those details get very complicated, though.

Kernel Mapping Whys and Wherefores

Apparently kernel virtual memory must map all of physical memory. Why? Must be so that kernel code can access any page. OK, so why is PAGE_OFFSET so large? Why not just make it 4K? Hypothesis: all processes are going to share (at least in kernel mode) the mapping starting at PAGE_OFFSET. That means that if user processes are going to have address space starting at 0, PAGE_OFFSET has to be big enough to allow some room for user-mode addresses. There's no reason user processes couldn't have a totally separate address mapping from the kernel, and both could start at low addresses; but that would make it a pain for the kernel to access user-space memory, and would mean that entry into the kernel would require icky page-table manipulation. I think the scheme used makes things very easy: user segments map up to PAGE_OFFSET, and kernel segments map PAGE_OFFSET and onward. The kernel segments always refer to the same memory (kernel mem), shared between all processes via shared page tables; while the user segments refer to user mem for each process by using non-shared page tables (well, some shared and some private pages, perhaps). The kernel segments are privileged and can only be accessed in kernel mode. This hypothesis predicts that the >=PAGE_OFFSET part of swapper_pg_dir should be copied for each new process, so that transitions to kernel mode can continue to use the user-space page dir. [Hypothesis confirmed: get_pdg_slow(), used indirectly in new_page_tables(), copies the kernel part of the pgd for a new task.]

PAGE_OFFSET

PAGE_OFFSET is a macro typically defined as 0xC0000000, or 3GB. This is the offset at which the kernel "sees" physical memory. That is, from the kernel's point of view, PAGE_OFFSET maps to physical address 0, and succeeding physical pages are mapped sequentially from PAGE_OFFSET up.

x86 Page Tables

Pagetables are accessed by the x86 hardware in physical memory (naturally). The kernel uses the __va(phys_addr) and __pa(virt_addr) macros to convert from physical to virtual and vice versa; these merely add or subtract the PAGE_OFFSET. So to map a virtual address V to a particular physical address P, the kernel finds the proper page-directory entry containing V, allocates a page-table page (if necessary), puts the physical address of the pagetable page in the pgdir (if necessary), converts the pgt address to virtual using __va(), finds the right page-table entry for the page V to be mapped, and writes P to that entry, along with some accounting info in the lower 12 bits. (Only the top 20 bits of a pagetable entry are significant in address resolution, since a page is 4K on Intel.)

For a description of the page struct fields, the comments in include/linux/mm.h are a good starting point.

When Is A Page Free?

A physical page is free if all of the following are true:

Free physical pages have exactly one virtual mapping: in the kernel page tables, at PAGE_OFFSET+physical_page_address.

Page Flags

Does each process get its own page directory?

Kernel 2.2: When a process is first created by fork(), it shares the memory mapping of its parent, with writeable pages marked "copy-on-write". When either process writes to a copy-on-write page, that process gets its own copy of the page. Thus, many processes can share the same page tables.

Kernel 2.4: Page tables are never shared (except kernel ones, I hope! - Yep, get_pgd_slow() still copies the kernel mapping). fork() calls dup_mmap() for the new process, which copies all the page tables. They're WP'd in both procs for copy-on-write.

When a process does an exec, it gets its own MM context into which the new executable is mapped.

Relocating EIP during boot

But there are some jump instructions referring to those labels prior to paging-enable. How do those jumps work? Since jumps less than 256 bytes are coded as relative jumps in x86 machine code, and all jumps in head.S prior to paging-enable are short jumps, the code in head.S never refers directly to an absolute address until after paging is enabled! (It appears that everything prior to the call to the code in head.S takes place in real mode and is part of boot-time magic; I'm not concerned here with anything that takes place prior to entry into head.S.)

kmap() page tables

It appears that (a) pagetable_init() maps all the physical RAM up to max_low_pfn, and (b) the fixmap_init() function for fixed mappings and kmap() pagetables co-opts a portion of the kernel address space.

kmap() etc

The top 128MB of kernel VIRTUAL space is reserved, and will not be included in the low zone. It's reserved for vmalloc() and, presumably, fixmaps - among which (though not really fixed) we find the kmap range. kmap() lets us map small areas in and out of the very top of kernel space as necessary, but it looks like those mappings must never persist for long - in fact, they are usually used and released entirely between calls to schedule().

kmap() is only relevant for KERNEL pages. User space pagetables can freely map highmem pages for as long as they like.

OK, *anyone* who does alloc_pages() with appropriate GFP_MASK might get a high page in return. So why doesn't everyone who does alloc_pages() also kmap() the page? Could we be counting on page->virtual==0 for free highmem pages? Eek, it seems so! But that's not necessarily going to be true: you could kmap() a high page, give it a vaddr, kunmap() it (which doesn't alter the vaddr), free it, and then __get_free_pages() could return that page and return the no-longer-mapped vaddr. This seems bad. We seem to be just trusting people to never call __get_free_pages() with a GFP_HIGHMEM, since __get_free_pages() doesn't check.

OK, you'd have to be pretty blind to accidentally call __gfp() or alloc_pages() with __GFP_HIGHMEM or __GFP_HIGHUSER, and places that use it seem to always kmap()/kunmap() the returned page. So now I think this is all clear to me. The only thing I'm a bit hazy on is how we prevent vmalloc() from overlapping the fixmap addresses, but that's for tomorrow.

Kernel allocators

Zone Allocator

In kernel version 2.4, some major changes to __get_free_pages() et al have been made. It seems the old buddy allocator had problems with memory fragmentation, and had no simple way to distinguish between various kinds of memory (DMA, cacheable, slow). For this reason, the notion of "zones" is introduced in 2.4. The zone allocator carves up the physical address space into a number of zones, and allocates certain types of memory objects preferentially from appropriate zones. Thus, user memory (that is, memory to be mapped into a process address space) is allocated preferentially from the "cacheable" zone, and only from the DMA or slow zones if no cached pages are available; requests for DMA are filled exclusively from the "DMA" zone; and certain other requests (eg pagetables) are filled preferentially from the "slow" zone.

The zone allocator still uses the buddy system internally. The free area lists and buddy bitmaps are maintained on a per-zone basis rather than globally.

In 2.2 and earlier kernels, the physical allocator did have a means of distinguishing DMA memory from other memory, but it was convoluted and ugly, and involved maintaining separate global freelists for DMA and normal RAM. The zone allocator is a major refinement.

Page lists

Hypothesis: "active" == mapped by some process. "inactive_clean" == not mapped, not dirty (different from page on disk). "inactive_dirty" == not mapped, but changed from disk copy - must be written out before being reused.

Q: Why have an "inactive_clean" list? How are "inactive clean" pages different from "free" pages? If those pages are treated differently from free pages, won't that contribute to memory fragmentation within zones?

Possible answer: inactive_clean pages contain data that has recently been used somewhere, and we may need it again. That's what LRU is all about, after all: reusing that page that has spent the longest time unreferenced.

Kernel threads

A kernel thread is a process with no VM of its own. It executes using the kernel's memory context. The kernel_thread() function stars a kernel thread. It's weird - appears to be invoking a system call with undefined register contents. But I can't really read the GCC/asm syntax very well...

Dirty pages

A page is said to be "dirty" if it has been altered from its on-disk representation. All pages (? [later: no, anonymous pages may not have a swap page allocated before the first time try_to_swap_out looks at them]) have an on-disk representation, either in a swap file or in an mmap()ed file. A page might be an anonymous page to which swap space has not been allocated, I think, in which case it must be considered dirty even though it doesn't actually have an on-disk representation. In order to re-use the page frame we will have to allocate swap space and write the logical page out.

VM General Shape Hypothesis

This looks basically LRU-ish. Instead of a plain LRU list, the unreferenced ("older") end of the list is split off and maintained separately, and furthermore it's split again into dirty and clean lists (presumably so page_launder() doesn't have to wade through clean pages looking for ones to write to disk?) (Yes, that, and also so we can keep a supply of more-or-less untainted inactive-clean pages with which to supply new page requests.)

There are tests about page_count when deactivating pages. It has to be <=1 unless there are buffers associated with the page, in which case <=2. I need to understand exactly how page reference counting works. So. Where is page_count incremented?

OK, pages on any of the lists are checked for being referenced at various points, so it must be that they can be mapped. [No, that's just paranoia. They really, truly won't be mapped.]

reclaim_pages() succeeds if the chosen page is either (a) in the swap cache, or (b) has a page->mapping (eg it is part of a file mapping). Both (a) and (b) can't be true, if the code is any guide, and if neither are true it seems that we have a bug.

OK, it seems that active pages are certainly mapped, and inactive ones are not. Well, maybe inactive_clean ones are?

try_to_swap_out() attempts to get rid of a PTE, and then do the Right Thing with the underlying page. If the page is clean, it gets freed by page_cache_release()! Yay! I knew it had to happen somewhere...

The comments in try_to_swap_out() are old and crusty. It really returns 1 to mean "stop trying to page out the current process", and 0 to mean "keep trying."

You only have to be "not recently used" to be deactivated and moved to the inactive_dirty list (though there are some not-very-strict checks about reference count there). However, to be "inactive_clean" you basically have to be 100% freeable. page_launder() has strict criteria for allowing a page to be inactive_clean - it tries to write the buffer out, and ensures that the page has only one user. (This seems bad though, since that user might eg fork() and have the page be mapped again... have to look that code over more.)

The "cache" counts as a page user. I assume the page cache is just the collection of lists (active, inactive_dirty, inactive_clean). But where is this reference acquired? Not in "add_page_*".

"A day in the life of a user page" would be a nice section.

What's up with __alloc_pages()?

It looks to me like the zone allocator is going out of its way to not act in a "zoney" fashion. I thought the point was to ensure that eg there's DMA around when we need it, not have user pages spread all willy-nilly around physical memory. But it seems that __alloc_pages() is more interested in balancing allocations from the different zones than in ensuring that less-preferred zones are only chosen as a last resort. It seems to me that the "Right Thing" would be to try like hell to fulfill an allocation from the most preferred zone, and only if that fails to try the other zones on the list.

Questions

Page Cache Notes

Aargh!

It does not seem possible to understand the VM system in a modular way. The interfaces between, say, the zone allocator and the swap policy are many and varied, and you can't just look at one part and say, "OK, I understand that, now let's look at the layer above, or below." What a nightmare.

What up with PG_referenced? It's set only in getblk(), tested only in the vmscan code. So it propagates page usage information from the buffer cache into the page cache, I guess.

Linux MM Outline

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits