Virtual Memory Management Policy

We've looked at process virtual memory mapping, the page cache, the swap mechanism... but how do these components work together to give processes a fair chance to utilize system memory?

The basic principle of the Linux VM system is page aging. We've seen that refill_inactive_scan() is invoked periodically to try to deactivate pages, and that it ages pages down as it does so, deactivating them when their age reaches 0. We've also seen that swap_out() will age referenced page frames up while scanning process memory maps. This is the fundamental mechanism for VM resource balancing in Linux: pages are aged down at a more-or-less steady rate, and deactivated when they become sufficiently old; but processes can keep pages "young" by referencing them frequently.

Once a page has been deactivated, is goes into the pool of "old pages" - the inactive lists. Some attempt is made to keep inactive pages in a semblance of LRU order, but the kernel doesn't waste too much time on that - when a page needs to be reclaimed from the cache, we use the first clean, inactive page that comes to hand.

Note that there is a fundamental difference between the aging-up and aging-down processes: refill_inactive_scan() scans the page frames on the active_list to age pages down, while swap_out() scans process PTEs to age pages up. Since a single page frame might be referenced my multiple PTEs (this will certainly be the case, for example, with pages mapping frequently-used libraries such as Glibc), a page could be aged up multiple times for each time it is aged down. So is that a good thing or a bad thing?

Well, let's take a look at the actual page aging numbers. When a page is aged up by swap_out(), its age is incremented by PAGE_ADVANCE == 3. When a page is aged down, its age is halved. That would seem to mean that for a page's age not to go to 0 quite quickly, it will need to be referenced approximately twice as often as refill_inactive_scan() examines it. Assuming that refill_inactive_scan() and swap_out() examing their respective memory spaces at approximately the same rate - that is, refill_inactive_scan() examines all active pages in about the same time it takes swap_out() to examing all process PTEs - then it would appear that a page needs to be referenced at least twice (by one or more processes) between encounters with refill_inactive_scan().

However, when page->age is small, halving the age is effectively a linear operation: 2/2 == 1, 1/2 == 0. And since PAGE_ADVANCE is 3, it now seems that one reference between encounters will suffice to keep a page active - given the assumption above regarding scanning rates. The more processes are mapping a page, the less frequently any one of them needs to reference it in order to keep it active.

On the face of it, this seems reasonable (although one might wonder why these particular values were chosen). Under non-memory-stressed conditions, kswapd() wakes once per second, and we would expect all the pages that any process uses on a regular basis to remain active, and the remainder to be deactivated and gradually trickled back into the free pool by page_launder() and kreclaimd().

When memory stress occurs, processes attempting to allocate memory wake kswapd(), so everything - aging, swapping out, laudering, reclamation - simply happens faster. We will scan the active list more-or-less proportionally to the number of processes that are having a hard time in alloc_pages(). We would then expect pages to age down more quickly and be evicted from process mappings more quickly, as well, during low-memory conditions.

The real story is quite a bit more complicated than the simple analysis above would suggest. There are subtleties in the VM code that can probably only be teased out by careful benchmarking and statistics-gathering. For example:

I'm currently running a kernel with a slight change: rather than age pages up, swap_out() merely sets the PG_referenced bit in the page frame when it finds a page has been referenced through a PTE. refill_inactive_scan() then ages pages up if they've been referenced (have the PG_referenced bit set) and down if not; and the up and down aging are both done linearly. This seems to work fine, though I haven't yet written the code necessary to discover the exact effects it has on the VM system.

Swapping

Linux MM Outline

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits