Swapping and the Page Cache

Pretty much every page of user process RAM is kept in either the page cache or the swap cache (the swap cache is just the part of the page cache associated with swap devices). Most user pages are added to the cache when they are initially mapped into process VM, and remain there until they are reclaimed for use by either another process or by the kernel itself. The purpose of the cache is simply to keep as much useful data in memory as possible, so that page faults may be serviced quickly. Pages in different parts of their life cycle must be managed in different ways by the VM system, and likewise pages that are mapped into process VM in different ways.

The cache is a layer between the kernel memory management code and the disk I/O code. When the kernel swaps pages out of a task, they do not get written immediately to disk, but rather are added to the cache. The kernel then writes the cache pages out to disk as necessary in order to create free memory.

The kernel maintains a number of page lists which collectively comprise the page cache. The active_list, the inactive_dirty_list, and the inactive_clean_list are used to maintain a more-or-less least-recently-used sorting of user pages (the page replacement policy actually implemented is something like "not recently used", since Linux is fairly unconcerned about keeping a strict LRU ordering of pages). Furthermore, each page of an executable image or mmap()ed file is associated with a per-inode cache, allowing the disk file to be used as backing storage for the page. Finally, anonymous pages (those without a disk file to serve as backing storage - pages of malloc()'d memory, for example) are assigned an entry in the system swapfile, and those pages are maintained in the swap cache.

Note that anonymous pages don't get added to the swap cache - and don't have swap space reserved - until the first time they are evicted from a process's memory map, whereas pages mapped from files begin life in the page cache. Thus, the character of the swap cache is different than that of the page cache, and it makes sense to make the distinction. However, the cache code is mostly generic, and I won't be too concerned about the differences between mapped pages and swap pages here.

The general characteristics of pages on the LRU page lists are as follows:

active_list: pages on the active list have page->age > 0, may be clean or dirty, and may be (but are not necessarily) mapped by process page-table entries.
inactive_dirty_list: pages on this list have page->age == 0, may be clean or dirty, and are not mapped by any process PTE.
inactive_clean_list: each zone has its own inactive_clean_list, which contains clean pages with age == 0, not mapped by any process PTE.

During page-fault handling, the kernel looks for the faulting page in the page cache. If it's found, it can be moved to the active_list (if it's not already there) and used immediately to service the fault.

Life Cycle of a User Page

I'll present the common case of a page (call it P) that's part of an mmap()ed data file. (Executable text pages have a similar life cycle, except they never get dirtied and thus are never written out to disk.)

The page is read from disk into memory and added to the page cache. This can happen in a number of different ways:
- Process A attempts to access page P; it is then read in by the page-fault handler for the process's VM area corresponding to the mapped file and added to the page cache, and to the process page tables. The page starts its life on the inode cache for the file's inode, and on the active_list of the LRU, where it remains while it is actively being used.
  Or:
- Page P is read in during a swap readahead operation, and added to the page cache. In this case, the reason the page is read in is simply that it is part of the cluster of blocks on disk that happens to be easy to read; we don't necessarily know the page will be needed, but it's cheap to read a bunch of pages that are sequential on the disk - and the cost of throwing away those pages if it turns out we don't need them is trivial, since they can be immediately reclaimed if they're never referenced. Such pages start life in the swap cache, and on the active_list. (Actually this will never be the case for an mmapped page - such pages are never written to swap.)
  Or...
- Page P is read in during a mmap cluster readahead operation, in which case a sequence of adjacent pages following the faulting page in an mmapped file is read. Such pages start their life in the page cache associated with the mmapped file, and on the active list.
P is written by the process, and thus dirtied. At this point P is still on the active_list.
P is not used for a while. Periodic invocations of the kernel swap daemon kswapd() will gradually reduce the page->age count. kswapd() wakes up more frequently as memory pressure increases. P's age will gradually decay to 0 if it is not referenced, due to periodic invocations of refill_inactive() by kswapd.
If memory is tight, swap_out() will eventually be called by kswapd() to try to evict pages from Process A's virtual address space. Since page P hasn't been referenced and has age 0, the PTE will be dropped, and the only remaining reference to P is the one resulting from its presence in the page cache (assuming, of course, that no other process has mapped the file in the meantime). swap_out() does not actually swap the page out; rather, it simply removes the process's reference to the page, and depends upon the page cache and swap machinery to ensure the page gets written to disk if necessary. (If a PTE has been referenced when swap_out() examines it, the mapped page is aged up - made younger - rather than being unmapped.)
Time passes... a little or a lot, depending on memory demand.
refill_inactive_scan() comes along, trying to find pages that can be moved to the inactive_dirty list. Since P is not mapped by any process and has age 0, it is moved from the active_list to the inactive_dirty list.
Process A attempts to access P, but it's not present in the process VM since the PTE has been cleared by swap_out(). The fault handler calls __find_page_nolock() to try to locate P in the page cache, and lo and behold, it's there, so the PTE can be immediately restored, and P is moved to the active_list, where it remains as long as it is actively used by the process.
More time passes... swap_out() clears Process A's PTE for page P, refill_inactive_scan() deactivates P, moving it to the inactive_dirty list.
More time passes... memory gets low.
page_launder() is invoked to clean some dirty pages. It finds P on the inactive_dirty_list, notices that it's actually dirty, and attempts to write it out to the disk. When the page has been written, it can then be moved to the inactive_clean_list. The following sequence of events occurs when page_launder() actually decides to write out a page:
- Lock the page.
- We determine the page needs to be written, so we call the writepage method of the page's mapping. That call invokes some filesystem-specific code to perform an asynchronous write to disk with the page locked. At that point, page_launder() is finished with the page: it remains on the inactive_dirty_list, and will be unlocked once the async write completes.
- Next time page_launder() is called it will find the page clean and move it to the inactive_clean_list, assuming no process has found it in the pagecache and started using it in the meantime.
page_launder() runs again, finds the page unused and clean, and moves it to the inactive_clean_list of the page's zone.
An attempt is made by someone to allocate a single free page from P's zone. Since the request is for a single page, it can be satisfied by reclaiming an inactive_clean page; P is chosen for reclamation. reclaim_page() removes P from the page cache (thereby ensuring that no other process will be able to gain a reference to it during page fault handling), and it is given to the caller as a free page.
Or:
kreclaimd comes along trying to create free memory. It reclaims P and then frees it.

Note that this is only one possible sequence of events: a page can live in the page cache for a long time, aging, being deactivated, being recovered by processes during page fault handling and thereby reactivated, aging, being deactivated, being laundered, being recovered and reactivated...

Pages can be recovered from the inactive_clean and active lists as well as from the inactive_dirty list. Read-only pages, naturally, are never dirty, so page_launder() can move them from the inactive_dirty_list to the inactive_clean_list "for free," so to speak.

Pages on the inactive_clean list are periodically examined by the kreclaimd kernel thread and freed. The purpose of this is to try to produce larger contiguous free memory blocks, which are needed in some situations.

Finally, note that P is in essence a logical page, though of course it is instantiated by some particular physical page.

Usage Notes

When a page is in the page cache, the cache holds a reference to the page. That is, any code that adds a page to the page cache must increment page->count, and any code that removes a page from the page cache must decrement page->count. Failure to honor these rules will cause Bad Things^tm to happen, since the page reclamation code expects cached pages to have a reference count of exactly 1 (or 2 if the page is also in the buffer cache) in order to be reclaimed.

The existing public interface to the page cache (add_to_page_cache() etc) already handles page reference counting properly. You shouldn't attempt to add pages to the cache in any other way.

Page Cache Data

The basic data structures involved in the page cache are

The LRU list of page structs:
- The active_list, containing active pages (mapped by process page tables).
- The inactive_dirty_list, containing pages that are not mapped by any process page tables, but which may be dirty.
- The per-zone inactive_clean_list, containing freeable pages that are clean and not mapped by any process page tables. There is one inactive_clean_list per zone, so that allocations from each zone can be fulfilled by reclaiming appropriate inactive_clean pages.
The struct address_space, which represents the virtual memory image of a file. Each mmap()ed file, including executables and files which are explicitly mmap()ed by a program, is represented by an address_space struct which holds on to the inode for the file, a list of all the VM mappings of the file, and a collection of pointers to functions needed to handle various VM tasks on the file, such as reading in and writing out pages. No matter how many processes have mapped a particular file, there is only one address_space struct for the file. Swap files are also represented by address_space structs, which allows the page cache code to be generic for anonymous, swap-backed pages and mmapped file pages.
The ubiquitous page struct - these are the items that make up the LRU lists and inode caches.
The page hash queues - lists of pages with the same hash values. The hash queues are used when looking for mapped pages in the page cache based on the file and the offset of the page within the file.

Page Cache Code

add_to_page_cache()

void add_to_page_cache( page struct * page, struct address_space * mapping, unsigned long offset ) is the public interface by which pages are added to the page cache. It adds the given page to the inode and hash queues for the specified address_space, and to the LRU lists.

The actual cache-add operation is performed by the helper function __add_to_page_cache(), in filemap.c on line 500

Line 506 if the page is locked, it's a bug. (The page may, for example, be in the process of being written out to disk.)
Lines 509 and 510: clear the referenced, error, dirty, uptodate, and arch (architecture-specific?) bits, and set the locked bit in the page frame.
Line 511 take a reference to the page. This is the cache's reference, and is critical for the proper operation of other page-cache code (notably reclaim_page()).
The remainder of the code in the function adds the page to the address_space mapping, to the page hash queue, and to the LRU list.

add_to_page_cache_unique()

This is the same as add_to_page_cache(), except that it first checks to be sure the page being added isn't already in the cache.

__find_page_nolock()

page struct * __find_page_nolock(struct address_space *mapping, unsigned long offset, page struct *page) is the basic cache-search function. It looks at the pages in the hash queue given by the page argument, examining only those that are associated with the given mapping (address_space), and returns the one with the matching offset, if such exists. The code is "interesting", but pretty straightforward.

As an example of how __find_page_nolock() is used, consider the kernel's activity when handling a page fault on an executable text page:

The fault handler uses the fault address to find the vm_area_struct corresponding to the fault address. It calls the *nopage member of the vm_area_struct's vm_ops, which in the case of a text area will be filemap_nopage().
filemap_nopage() tracks down the address_space via the vm_area_struct->file member. It computes the page's offset into the file using the vm_area_struct's base address and the fault address. It computes the hash value of the faulting page, and calls __find_get_page() to search the page cache, giving it the address_space mapping, offset, and hash queue.
__find_get_page() calls __find_page_nolock() to search the hash queue for the page at the given offset in the proper mapping. If found, it takes an additional reference to the page on behalf of the faulting process, and returns the page. If the page isn't found, filemap_nopage() reads the page into the page cache and starts over from the beginning.

page_cache_read()

int page_cache_read(struct file * file, unsigned long offset) checks to see if a page corresponding to the given file and offset is already in the page cache. If it's not, it reads in the page and adds it to the cache.

Well, actually the other way around:

Line 557 is the page in the cache? If so, do nothing.
Line 562 allocate a page (page_cache_alloc() == get_free_page()).
Line 566 add the page to the page cache, associating it with the specified file and offset.
Line 567 do whatever is necessary to asychronously read the page. Note that adding the page to the page cache has locked it; anyone who wants to use the page must wait for it to be unlocked using wait_on_page(). The page will be unlocked when its read completes. [How does this happen?]

wait_on_page()

void wait_on_page(page struct *page) simply checks to see if the given page is locked, and if it is, waits for it to be unlocked. The actual waiting is done in ___wait_on_page() in filemap.c:

Line 609 declare a waitqueue entry and add it to the page's wait queue. We're declaring the waitqueue entry on the stack; that's OK, since this instance of ___wait_on_page() won't return until the page is unlocked.
The loop on line 612 is executed while the page is locked.
- On line 613 we try to force any I/O in progress on the page to complete.
- We then set the task state to uninterruptible sleep, check to see if the page has been unlocked in the meantime, and if not, run the disk I/O task queue and schedule another process. [sync_page() seems to normally do run_task_queue(&tq_disk), via block_sync_page(). So we're forcing I/O twice?]
- We'll wake up and continue from schedule() as soon as the page gets unlocked and the wait queue is awoken.

__block_write_full_page()

When we need to write a page from the page cache into backing storage, in page_launder() for example, we invoke the writepage member of the page mapping's address_space_operations structure. In most cases this will ultimately invoke __block_write_full_page(struct inode *inode, page struct *page, get_block_t *get_block), defined in buffer.c on line 1492 __b_w_f_p() writes out the given page via the buffer cache. Essentially, all we do here is create buffers for the page if it hasn't been done yet, map them to disk blocks, and submit them to the disk I/O subsystem to be physically written out asynchronously. We set the I/O completion handler to end_buffer_io_async(), which is a special handler that knows how to deal with page-cache read and write completions.

kswapd()

kswapd() is the kernel swap thread. It simply loops, deactivating and laundering pages as necessary. If memory is tight, it is more aggressive, but it will always trickle a few inactive pages per minute back into the free pool, if there are any inactive pages to be trickled.

When there's nothing to do, kswapd() sleeps. It can be woken explicitly by tasks that need memory using wakeup_kswapd(); a task that wakes kswapd() can either let it carry on asynchronously, or wait for it to free some pages and go to sleep again.

try_to_free_pages() does almost the same thing that kswapd() does, but it does it synchronously without invoking kswapd. Some tasks might deadlock with kswapd() if, for example, they are holding kernel locks that kswapd() needs in order to operate; such processes call try_to_free_pages() instead of wakeup_kswapd(). [An example would be good here...]

Most of kswapd's work is done by do_try_to_free_pages().

do_try_to_free_pages()

int do_try_to_free_pages(unsigned int gfp_mask, int user) takes a number of actions in order to try to create free (or freeable) pages.

Line 921 move clean inactive_dirty pages to the inactive_clean list, and if necessary, launder some inactive_dirty pages by starting writes to the disk.
Line 931 attempt to move some active pages to the inactive_dirty list.
Line 935 free unused slab memory.

refill_inactive()

int refill_inactive(unsigned int gfp_mask, int user) is invoked to try to deactivate active pages. It tries to create enough freeable pages to erase any existing free memory shortage.

The loop beginning on line 849 loops over priority until it reaches 0. Each time through the loop we decrement the priority; lower priority indicates higher memory pressure, and will cause us to take more aggressive action to try to free pages.
On line 857 we attempt to refill the inactive_dirty list by deactivating as many unused active pages as possible, using refill_inactive_scan(). If this works well, it's all we need, and the loop will exit.
On line 868 we shrink the inode cache and the dentry cache, if possible.
Then, on line 874 we attempt to forcibly remove process page mappings. The swap_out() function is seriously mis-named: it doesn't actually do any swapping out. Rather, it tries to evict pages from process PTE mappings. If it can eliminate all process mappings of the page, the page becomes eligible for deactivation and laundering.
If we've eliminated the memory shortage, exit the loop. If we haven't been able to free any pages, do the loop again, more aggressively. Note that the loop is certain to exit eventually by either eliminating the memory shortage, or failing to make progress even at priority 0. (Unless you're one of those lucky souls with an infinite number of freeable pages on your system.)

refill_inactive_scan()

refill_inactive_scan() looks at each page on the active list, aging the pages up if they've been referenced since last time and down if they haven't (this is counterintuitive - older pages have lower age values).

If a page's age is 0, and the only remaining references to the page are those of the buffer cache and/or page cache, then the page is deactivated (moved to the inactive_dirty_list).

There is some misleading commentary in this function. It doesn't actually use age_page_down() to age down the page and deactivate it; rather, it ages the page down, and then additionally checks the reference count to be sure it is really unreferenced by process PTEs before actually deactivating the page.

swap_out

int swap_out(unsigned int priority, int gfp_mask) scans process page tables and attempts to unmap pages from process VM. It computes the per-task swap_cnt value - basically a "kick me" rating, higher values make a process more likely to be victimized - and uses it to decide which process to try to "swap out" first. Larger processes which have not been swapped out in a while have the best chance of being chosen.

The actual unmapping is done by swap_out_mm(), which tries to unmap pages from a process's memory map; swap_out_vma(), which tries to unmap pages from a particular VM area within a process by walking the part of the page directory corresponding to the VM area looking for page middle directory entries to unmap; swap_out_pgd() which walks page middle directory entries looking for page tables to unmap; swap_out_pmd() which walks page tables looking for page table entries to unmap; and ultimately, try_to_swap_out(), where all the interesting stuff happens.

try_to_swap_out()

int try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, int gfp_mask) attempts to unmap the page indicated by the given pte_t from the given vm_area_struct. It returns 0 if the higher-level swap-out code should continue to scan the current process, 1 if the current scan should be aborted and another swap victim chosen. It's a bit complicated:

Line 46 if the given page isn't present, return 0 to continue the scan.
Do some paranoia checks to be sure the page is in a sane state.
Line 52 if the current process's memory map has been sufficiently victimized by the swap code, return 1 to force a new process to be chosen. Otherwise, reduce the process's "good victim" rating.
Line 57 check if the page is active.
Line 59 age the page up (make it younger - someone didn't have enough caffeine when they wrote the page-aging code :-) if it's been referenced recently.
Line 65 if the page is not active, age the page down, but do not move it to the inactive_dirty_list because (by virtue of us looking at the page in try_to_swap_out()) it's still mapped and thus not freeable. That's the meaning of the mysterious comment on line 64
Line 72 if the page is young, don't unmap it. Note that under high memory load, both swap_out() and refill_inactive_scan() will be called more frequently (often by processes attempting to allocate free pages), thus pages will age more quickly, and we have a reasonable chance to find young pages in swap_out().
Line 75 lock the page frame so nothing unexpected happens to it while we're molesting it. If we can't lock it, continue to scan the VMA.
Line 83 clear the PTE, flush the TLB (so the CPU doesn't continue to believe the page is mapped).
Line 94 if the page is already in the swap cache (that is, already in the page cache and associated with swap space rather than with a regular file mapping), we just replace the pte with the swapfile entry that will allow us to swap the page in again when required. We also mark the page dirty, if necessary, to force it to be written out next time page_launder() looks at it.
Line 102 - drop_pte: we jump to this label from a number of places in order to finish the unmapping operation. Unlock the page, attempt to deactivate it (this will only succeed if no other process is using it), and drop the process's reference. Return 0 to continue the scan.
Line 124 We get here if the page wasn't already in the swap cache (it's either an anonymous page that's never been swapped out before, or else it's an mmapped page with backing storage on a regular file and a valid page->mapping). If the page is clean we just goto drop_pte, since the page can be recovered by paging it in from its backing store, if any. Note that if it's a clean anonymous page, and it's not in the swap cache, then no process has ever written data to the page - so it can be trivially replaced, if necessary, by simply allocating a free page.
Line 133 the page is definitely dirty. If the page has a mapping, we're cool, because page_launder() will write it out if necessary, so we can just goto drop_pte and be done.
Line 144 The page had no mapping, so we need to allocate a swap page for it and add it to the swap cache. (The swap cache is just that part of the page cache that's associated with the swapper_space mapping rather than with a regular file mapping.) If the attempt to allocate a swap page fails, we must restore the PTE by jumping to out_unlock_restore.
Line 149 swap space allocated successfully. We add the page to the page cache, set it dirty so page_launder() knows to write it out, and goto set_swap_pte to replace the PTE with the swap entry, deactivate the page, and continue the scan.

Simple, eh?

page_launder()

int page_launder(int gfp_mask, int sync) is responsible for moving inactive_dirty pages to the inactive_clean list, writing them out to backing disk if necessary. It's pretty straightforward. It makes at most two loops over the inactive_dirty_list. On the first loop it simply moves as many clean pages as it can to the inactive_clean_list. On the second loop it starts asynchronous writes for any dirty pages it finds, after locking the pages; but it leaves those pages on the inactive_dirty list. Eventually the writes will complete and the pages will be unlocked, at which point future invocations of page_launder() can move them to the inactive_clean_list.

page_launder() may also do synchronous writes if it's invoked by a user process that's trying to free up enough memory to complete an allocation. In this case we start the async writes as above (vmscan.c line 561), but then wait for them to complete, by calling try_to_free_buffers() with wait > 1 (vmscan.c, line 595 - line 603). Starting the page write causes the page to be added to the buffer cache; try_to_free_buffers() attempts to remove the page from the buffer cache by writing all its buffers to disk, if necessary.

reclaim_page()

page struct * reclaim_page(zone_t * zone) attempts to reclaim an inactive_clean page from the given zone. It examines each page in the zone's inactive_clean_list, moving referenced pages to the active list, and dirty pages to the inactive_dirty list. (Both of those tests are basically paranoid and should never happen, since code that remaps an inactive page in the cache also moves it to the active list.) When it finds a genuine clean page, which should happen immediately, it removes the page from the page cache and returns it as a free page.

Note the code from line 431 to line 439 every inactive_clean page is either in the swap cache (if it was an anonymous page) or else it's in the page cache associated with an mmapped file (page->mapping && !PageSwapCache(page)). Exactly one of those conditions will be true.

Note also line 455 this is just another paranoia check. Pages on the inactive_clean list are no longer in the buffer cache (since they've been written out and their buffers freed in page_launder), so every reclaimed page will have a reference count of exactly 1 (the page cache's reference), in the absence of broken kernel modules or drivers.

deactivate_page()

deactivate_page() is called within try_to_swap_out() to try to deactivate a page. It uses deactivate_page_nolock() for the interesting bits. It checks that the page is unused by anyone except the caller (which presumably is about to release a reference) and the page and buffer caches, then makes the page old by setting its age to 0 and moves it to the inactive_dirty_list, but only if it was on the active list. We never want to deactivate a page that's not in the page cache, since we have no idea what else such pages may be being used for, and the page cache makes some pretty strong assumptions about what it can do with the pages under its control.

Kernel VM Allocation
Linux MM Outline
Process VM

Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits