Swapping and the Page Cache

Pretty much every page of user process RAM is kept in either the page cache or the swap cache (the swap cache is just the part of the page cache associated with swap devices). Most user pages are added to the cache when they are initially mapped into process VM, and remain there until they are reclaimed for use by either another process or by the kernel itself. The purpose of the cache is simply to keep as much useful data in memory as possible, so that page faults may be serviced quickly. Pages in different parts of their life cycle must be managed in different ways by the VM system, and likewise pages that are mapped into process VM in different ways.

The cache is a layer between the kernel memory management code and the disk I/O code. When the kernel swaps pages out of a task, they do not get written immediately to disk, but rather are added to the cache. The kernel then writes the cache pages out to disk as necessary in order to create free memory.

The kernel maintains a number of page lists which collectively comprise the page cache. The active_list, the inactive_dirty_list, and the inactive_clean_list are used to maintain a more-or-less least-recently-used sorting of user pages (the page replacement policy actually implemented is something like "not recently used", since Linux is fairly unconcerned about keeping a strict LRU ordering of pages). Furthermore, each page of an executable image or mmap()ed file is associated with a per-inode cache, allowing the disk file to be used as backing storage for the page. Finally, anonymous pages (those without a disk file to serve as backing storage - pages of malloc()'d memory, for example) are assigned an entry in the system swapfile, and those pages are maintained in the swap cache.

Note that anonymous pages don't get added to the swap cache - and don't have swap space reserved - until the first time they are evicted from a process's memory map, whereas pages mapped from files begin life in the page cache. Thus, the character of the swap cache is different than that of the page cache, and it makes sense to make the distinction. However, the cache code is mostly generic, and I won't be too concerned about the differences between mapped pages and swap pages here.

The general characteristics of pages on the LRU page lists are as follows:

During page-fault handling, the kernel looks for the faulting page in the page cache. If it's found, it can be moved to the active_list (if it's not already there) and used immediately to service the fault.

Life Cycle of a User Page

I'll present the common case of a page (call it P) that's part of an mmap()ed data file. (Executable text pages have a similar life cycle, except they never get dirtied and thus are never written out to disk.)

  1. The page is read from disk into memory and added to the page cache. This can happen in a number of different ways:

  2. P is written by the process, and thus dirtied. At this point P is still on the active_list.

  3. P is not used for a while. Periodic invocations of the kernel swap daemon kswapd() will gradually reduce the page->age count. kswapd() wakes up more frequently as memory pressure increases. P's age will gradually decay to 0 if it is not referenced, due to periodic invocations of refill_inactive() by kswapd.

  4. If memory is tight, swap_out() will eventually be called by kswapd() to try to evict pages from Process A's virtual address space. Since page P hasn't been referenced and has age 0, the PTE will be dropped, and the only remaining reference to P is the one resulting from its presence in the page cache (assuming, of course, that no other process has mapped the file in the meantime). swap_out() does not actually swap the page out; rather, it simply removes the process's reference to the page, and depends upon the page cache and swap machinery to ensure the page gets written to disk if necessary. (If a PTE has been referenced when swap_out() examines it, the mapped page is aged up - made younger - rather than being unmapped.)

  5. Time passes... a little or a lot, depending on memory demand.

  6. refill_inactive_scan() comes along, trying to find pages that can be moved to the inactive_dirty list. Since P is not mapped by any process and has age 0, it is moved from the active_list to the inactive_dirty list.

  7. Process A attempts to access P, but it's not present in the process VM since the PTE has been cleared by swap_out(). The fault handler calls __find_page_nolock() to try to locate P in the page cache, and lo and behold, it's there, so the PTE can be immediately restored, and P is moved to the active_list, where it remains as long as it is actively used by the process.

  8. More time passes... swap_out() clears Process A's PTE for page P, refill_inactive_scan() deactivates P, moving it to the inactive_dirty list.

  9. More time passes... memory gets low.

  10. page_launder() is invoked to clean some dirty pages. It finds P on the inactive_dirty_list, notices that it's actually dirty, and attempts to write it out to the disk. When the page has been written, it can then be moved to the inactive_clean_list. The following sequence of events occurs when page_launder() actually decides to write out a page:

  11. page_launder() runs again, finds the page unused and clean, and moves it to the inactive_clean_list of the page's zone.

  12. An attempt is made by someone to allocate a single free page from P's zone. Since the request is for a single page, it can be satisfied by reclaiming an inactive_clean page; P is chosen for reclamation. reclaim_page() removes P from the page cache (thereby ensuring that no other process will be able to gain a reference to it during page fault handling), and it is given to the caller as a free page.

    Or:

    kreclaimd comes along trying to create free memory. It reclaims P and then frees it.

Note that this is only one possible sequence of events: a page can live in the page cache for a long time, aging, being deactivated, being recovered by processes during page fault handling and thereby reactivated, aging, being deactivated, being laundered, being recovered and reactivated...

Pages can be recovered from the inactive_clean and active lists as well as from the inactive_dirty list. Read-only pages, naturally, are never dirty, so page_launder() can move them from the inactive_dirty_list to the inactive_clean_list "for free," so to speak.

Pages on the inactive_clean list are periodically examined by the kreclaimd kernel thread and freed. The purpose of this is to try to produce larger contiguous free memory blocks, which are needed in some situations.

Finally, note that P is in essence a logical page, though of course it is instantiated by some particular physical page.

Usage Notes

When a page is in the page cache, the cache holds a reference to the page. That is, any code that adds a page to the page cache must increment page->count, and any code that removes a page from the page cache must decrement page->count. Failure to honor these rules will cause Bad Thingstm to happen, since the page reclamation code expects cached pages to have a reference count of exactly 1 (or 2 if the page is also in the buffer cache) in order to be reclaimed.

The existing public interface to the page cache (add_to_page_cache() etc) already handles page reference counting properly. You shouldn't attempt to add pages to the cache in any other way.

Page Cache Data

The basic data structures involved in the page cache are

Page Cache Code

add_to_page_cache()

void add_to_page_cache( page struct * page, struct address_space * mapping, unsigned long offset ) is the public interface by which pages are added to the page cache. It adds the given page to the inode and hash queues for the specified address_space, and to the LRU lists.

The actual cache-add operation is performed by the helper function __add_to_page_cache(), in filemap.c on line 500

add_to_page_cache_unique()

This is the same as add_to_page_cache(), except that it first checks to be sure the page being added isn't already in the cache.

__find_page_nolock()

page struct * __find_page_nolock(struct address_space *mapping, unsigned long offset, page struct *page) is the basic cache-search function. It looks at the pages in the hash queue given by the page argument, examining only those that are associated with the given mapping (address_space), and returns the one with the matching offset, if such exists. The code is "interesting", but pretty straightforward.

As an example of how __find_page_nolock() is used, consider the kernel's activity when handling a page fault on an executable text page:

page_cache_read()

int page_cache_read(struct file * file, unsigned long offset) checks to see if a page corresponding to the given file and offset is already in the page cache. If it's not, it reads in the page and adds it to the cache.

Well, actually the other way around:

wait_on_page()

void wait_on_page(page struct *page) simply checks to see if the given page is locked, and if it is, waits for it to be unlocked. The actual waiting is done in ___wait_on_page() in filemap.c:

__block_write_full_page()

When we need to write a page from the page cache into backing storage, in page_launder() for example, we invoke the writepage member of the page mapping's address_space_operations structure. In most cases this will ultimately invoke __block_write_full_page(struct inode *inode, page struct *page, get_block_t *get_block), defined in buffer.c on line 1492 __b_w_f_p() writes out the given page via the buffer cache. Essentially, all we do here is create buffers for the page if it hasn't been done yet, map them to disk blocks, and submit them to the disk I/O subsystem to be physically written out asynchronously. We set the I/O completion handler to end_buffer_io_async(), which is a special handler that knows how to deal with page-cache read and write completions.

kswapd()

kswapd() is the kernel swap thread. It simply loops, deactivating and laundering pages as necessary. If memory is tight, it is more aggressive, but it will always trickle a few inactive pages per minute back into the free pool, if there are any inactive pages to be trickled.

When there's nothing to do, kswapd() sleeps. It can be woken explicitly by tasks that need memory using wakeup_kswapd(); a task that wakes kswapd() can either let it carry on asynchronously, or wait for it to free some pages and go to sleep again.

try_to_free_pages() does almost the same thing that kswapd() does, but it does it synchronously without invoking kswapd. Some tasks might deadlock with kswapd() if, for example, they are holding kernel locks that kswapd() needs in order to operate; such processes call try_to_free_pages() instead of wakeup_kswapd(). [An example would be good here...]

Most of kswapd's work is done by do_try_to_free_pages().

do_try_to_free_pages()

int do_try_to_free_pages(unsigned int gfp_mask, int user) takes a number of actions in order to try to create free (or freeable) pages.

refill_inactive()

int refill_inactive(unsigned int gfp_mask, int user) is invoked to try to deactivate active pages. It tries to create enough freeable pages to erase any existing free memory shortage.

refill_inactive_scan()

refill_inactive_scan() looks at each page on the active list, aging the pages up if they've been referenced since last time and down if they haven't (this is counterintuitive - older pages have lower age values).

If a page's age is 0, and the only remaining references to the page are those of the buffer cache and/or page cache, then the page is deactivated (moved to the inactive_dirty_list).

There is some misleading commentary in this function. It doesn't actually use age_page_down() to age down the page and deactivate it; rather, it ages the page down, and then additionally checks the reference count to be sure it is really unreferenced by process PTEs before actually deactivating the page.

swap_out

int swap_out(unsigned int priority, int gfp_mask) scans process page tables and attempts to unmap pages from process VM. It computes the per-task swap_cnt value - basically a "kick me" rating, higher values make a process more likely to be victimized - and uses it to decide which process to try to "swap out" first. Larger processes which have not been swapped out in a while have the best chance of being chosen.

The actual unmapping is done by swap_out_mm(), which tries to unmap pages from a process's memory map; swap_out_vma(), which tries to unmap pages from a particular VM area within a process by walking the part of the page directory corresponding to the VM area looking for page middle directory entries to unmap; swap_out_pgd() which walks page middle directory entries looking for page tables to unmap; swap_out_pmd() which walks page tables looking for page table entries to unmap; and ultimately, try_to_swap_out(), where all the interesting stuff happens.

try_to_swap_out()

int try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, int gfp_mask) attempts to unmap the page indicated by the given pte_t from the given vm_area_struct. It returns 0 if the higher-level swap-out code should continue to scan the current process, 1 if the current scan should be aborted and another swap victim chosen. It's a bit complicated:

Simple, eh?

page_launder()

int page_launder(int gfp_mask, int sync) is responsible for moving inactive_dirty pages to the inactive_clean list, writing them out to backing disk if necessary. It's pretty straightforward. It makes at most two loops over the inactive_dirty_list. On the first loop it simply moves as many clean pages as it can to the inactive_clean_list. On the second loop it starts asynchronous writes for any dirty pages it finds, after locking the pages; but it leaves those pages on the inactive_dirty list. Eventually the writes will complete and the pages will be unlocked, at which point future invocations of page_launder() can move them to the inactive_clean_list.

page_launder() may also do synchronous writes if it's invoked by a user process that's trying to free up enough memory to complete an allocation. In this case we start the async writes as above (vmscan.c line 561), but then wait for them to complete, by calling try_to_free_buffers() with wait > 1 (vmscan.c, line 595 - line 603). Starting the page write causes the page to be added to the buffer cache; try_to_free_buffers() attempts to remove the page from the buffer cache by writing all its buffers to disk, if necessary.

reclaim_page()

page struct * reclaim_page(zone_t * zone) attempts to reclaim an inactive_clean page from the given zone. It examines each page in the zone's inactive_clean_list, moving referenced pages to the active list, and dirty pages to the inactive_dirty list. (Both of those tests are basically paranoid and should never happen, since code that remaps an inactive page in the cache also moves it to the active list.) When it finds a genuine clean page, which should happen immediately, it removes the page from the page cache and returns it as a free page.

Note the code from line 431 to line 439 every inactive_clean page is either in the swap cache (if it was an anonymous page) or else it's in the page cache associated with an mmapped file (page->mapping && !PageSwapCache(page)). Exactly one of those conditions will be true.

Note also line 455 this is just another paranoia check. Pages on the inactive_clean list are no longer in the buffer cache (since they've been written out and their buffers freed in page_launder), so every reclaimed page will have a reference count of exactly 1 (the page cache's reference), in the absence of broken kernel modules or drivers.

deactivate_page()

deactivate_page() is called within try_to_swap_out() to try to deactivate a page. It uses deactivate_page_nolock() for the interesting bits. It checks that the page is unused by anyone except the caller (which presumably is about to release a reference) and the page and buffer caches, then makes the page old by setting its age to 0 and moves it to the inactive_dirty_list, but only if it was on the active list. We never want to deactivate a page that's not in the page cache, since we have no idea what else such pages may be being used for, and the page cache makes some pretty strong assumptions about what it can do with the pages under its control.

Kernel VM Allocation

Linux MM Outline

Process VM

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits