123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615 |
- This document describes the Linux memory management "Unevictable LRU"
- infrastructure and the use of this infrastructure to manage several types
- of "unevictable" pages. The document attempts to provide the overall
- rationale behind this mechanism and the rationale for some of the design
- decisions that drove the implementation. The latter design rationale is
- discussed in the context of an implementation description. Admittedly, one
- can obtain the implementation details--the "what does it do?"--by reading the
- code. One hopes that the descriptions below add value by provide the answer
- to "why does it do that?".
- Unevictable LRU Infrastructure:
- The Unevictable LRU adds an additional LRU list to track unevictable pages
- and to hide these pages from vmscan. This mechanism is based on a patch by
- Larry Woodman of Red Hat to address several scalability problems with page
- reclaim in Linux. The problems have been observed at customer sites on large
- memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB
- of main memory will have over 32 million 4k pages in a single zone. When a
- large fraction of these pages are not evictable for any reason [see below],
- vmscan will spend a lot of time scanning the LRU lists looking for the small
- fraction of pages that are evictable. This can result in a situation where
- all cpus are spending 100% of their time in vmscan for hours or days on end,
- with the system completely unresponsive.
- The Unevictable LRU infrastructure addresses the following classes of
- unevictable pages:
- + page owned by ramfs
- + page mapped into SHM_LOCKed shared memory regions
- + page mapped into VM_LOCKED [mlock()ed] vmas
- The infrastructure might be able to handle other conditions that make pages
- unevictable, either by definition or by circumstance, in the future.
- The Unevictable LRU List
- The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
- called the "unevictable" list and an associated page flag, PG_unevictable, to
- indicate that the page is being managed on the unevictable list. The
- PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active
- flag in that it indicates on which LRU list a page resides when PG_lru is set.
- The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU
- Kconfig option.
- The Unevictable LRU infrastructure maintains unevictable pages on an additional
- LRU list for a few reasons:
- 1) We get to "treat unevictable pages just like we treat other pages in the
- system, which means we get to use the same code to manipulate them, the
- same code to isolate them (for migrate, etc.), the same code to keep track
- of the statistics, etc..." [Rik van Riel]
- 2) We want to be able to migrate unevictable pages between nodes--for memory
- defragmentation, workload management and memory hotplug. The linux kernel
- can only migrate pages that it can successfully isolate from the lru lists.
- If we were to maintain pages elsewise than on an lru-like list, where they
- can be found by isolate_lru_page(), we would prevent their migration, unless
- we reworked migration code to find the unevictable pages.
- The unevictable LRU list does not differentiate between file backed and swap
- backed [anon] pages. This differentiation is only important while the pages
- are, in fact, evictable.
- The unevictable LRU list benefits from the "arrayification" of the per-zone
- LRU lists and statistics originally proposed and posted by Christoph Lameter.
- The unevictable list does not use the lru pagevec mechanism. Rather,
- unevictable pages are placed directly on the page's zone's unevictable
- list under the zone lru_lock. The reason for this is to prevent stranding
- of pages on the unevictable list when one task has the page isolated from the
- lru and other tasks are changing the "evictability" state of the page.
- Unevictable LRU and Memory Controller Interaction
- The memory controller data structure automatically gets a per zone unevictable
- lru list as a result of the "arrayification" of the per-zone LRU lists. The
- memory controller tracks the movement of pages to and from the unevictable list.
- When a memory control group comes under memory pressure, the controller will
- not attempt to reclaim pages on the unevictable list. This has a couple of
- effects. Because the pages are "hidden" from reclaim on the unevictable list,
- the reclaim process can be more efficient, dealing only with pages that have
- a chance of being reclaimed. On the other hand, if too many of the pages
- charged to the control group are unevictable, the evictable portion of the
- working set of the tasks in the control group may not fit into the available
- memory. This can cause the control group to thrash or to oom-kill tasks.
- Unevictable LRU: Detecting Unevictable Pages
- The function page_evictable(page, vma) in vmscan.c determines whether a
- page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions,
- page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's
- address space using a wrapper function. Wrapper functions are used to set,
- clear and test the flag to reduce the requirement for #ifdef's throughout the
- source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created.
- This flag remains for the life of the inode.
- For shared memory regions, AS_UNEVICTABLE is set when an application
- successfully SHM_LOCKs the region and is removed when the region is
- SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page
- tables for the region as does, for example, mlock(). So, we make no special
- effort to push any pages in the SHM_LOCKed region to the unevictable list.
- Vmscan will do this when/if it encounters the pages during reclaim. On
- SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the
- unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed
- region is destroyed, the pages are also "rescued" from the unevictable list in
- the process of freeing them.
- page_evictable() detects mlock()ed pages by testing an additional page flag,
- PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a
- non-NULL vma is supplied, page_evictable() will check whether the vma is
- VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
- update the appropriate statistics if the vma is VM_LOCKED. This method allows
- efficient "culling" of pages in the fault path that are being faulted in to
- VM_LOCKED vmas.
- Unevictable Pages and Vmscan [shrink_*_list()]
- If unevictable pages are culled in the fault path, or moved to the unevictable
- list at mlock() or mmap() time, vmscan will never encounter the pages until
- they have become evictable again, for example, via munlock() and have been
- "rescued" from the unevictable list. However, there may be situations where we
- decide, for the sake of expediency, to leave a unevictable page on one of the
- regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for
- such pages in all of the shrink_{active|inactive|page}_list() functions and
- will "cull" such pages that it encounters--that is, it diverts those pages to
- the unevictable list for the zone being scanned.
- There may be situations where a page is mapped into a VM_LOCKED vma, but the
- page is not marked as PageMlocked. Such pages will make it all the way to
- shrink_page_list() where they will be detected when vmscan walks the reverse
- map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list()
- will cull the page at that point.
- Note that for anonymous pages, shrink_page_list() attempts to add the page to
- the swap cache before it tries to unmap the page. To avoid this unnecessary
- consumption of swap space, shrink_page_list() calls try_to_munlock() to check
- whether any VM_LOCKED vmas map the page without attempting to unmap the page.
- If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page
- without consuming swap space. try_to_munlock() will be described below.
- To "cull" an unevictable page, vmscan simply puts the page back on the lru
- list using putback_lru_page()--the inverse operation to isolate_lru_page()--
- after dropping the page lock. Because the condition which makes the page
- unevictable may change once the page is unlocked, putback_lru_page() will
- recheck the unevictable state of a page that it places on the unevictable lru
- list. If the page has become unevictable, putback_lru_page() removes it from
- the list and retries, including the page_unevictable() test. Because such a
- race is a rare event and movement of pages onto the unevictable list should be
- rare, these extra evictabilty checks should not occur in the majority of calls
- to putback_lru_page().
- Mlocked Page: Prior Work
- The "Unevictable Mlocked Pages" infrastructure is based on work originally
- posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
- Nick posted his patch as an alternative to a patch posted by Christoph
- Lameter to achieve the same objective--hiding mlocked pages from vmscan.
- In Nick's patch, he used one of the struct page lru list link fields as a count
- of VM_LOCKED vmas that map the page. This use of the link field for a count
- prevented the management of the pages on an LRU list. Thus, mlocked pages were
- not migratable as isolate_lru_page() could not find them and the lru list link
- field was not available to the migration subsystem. Nick resolved this by
- putting mlocked pages back on the lru list before attempting to isolate them,
- thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated
- with the Unevictable LRU work, the count was replaced by walking the reverse
- map to determine whether any VM_LOCKED vmas mapped the page. More on this
- below.
- Mlocked Pages: Basic Management
- Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of
- unevictable pages. When such a page has been "noticed" by the memory
- management subsystem, the page is marked with the PG_mlocked [PageMlocked()]
- flag. A PageMlocked() page will be placed on the unevictable LRU list when
- it is added to the LRU. Pages can be "noticed" by memory management in
- several places:
- 1) in the mlock()/mlockall() system call handlers.
- 2) in the mmap() system call handler when mmap()ing a region with the
- MAP_LOCKED flag, or mmap()ing a region in a task that has called
- mlockall() with the MCL_FUTURE flag. Both of these conditions result
- in the VM_LOCKED flag being set for the vma.
- 3) in the fault path, if mlocked pages are "culled" in the fault path,
- and when a VM_LOCKED stack segment is expanded.
- 4) as mentioned above, in vmscan:shrink_page_list() with attempting to
- reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock().
- Mlocked pages become unlocked and rescued from the unevictable list when:
- 1) mapped in a range unlocked via the munlock()/munlockall() system calls.
- 2) munmapped() out of the last VM_LOCKED vma that maps the page, including
- unmapping at task exit.
- 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file.
- 4) before a page is COWed in a VM_LOCKED vma.
- Mlocked Pages: mlock()/mlockall() System Call Handling
- Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup()
- for each vma in the range specified by the call. In the case of mlockall(),
- this is the entire active address space of the task. Note that mlock_fixup()
- is used for both mlock()ing and munlock()ing a range of memory. A call to
- mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED
- is treated as a no-op--mlock_fixup() simply returns.
- If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas"
- below, mlock_fixup() will attempt to merge the vma with its neighbors or split
- off a subset of the vma if the range does not cover the entire vma. Once the
- vma has been merged or split or neither, mlock_fixup() will call
- __mlock_vma_pages_range() to fault in the pages via get_user_pages() and
- to mark the pages as mlocked via mlock_vma_page().
- Note that the vma being mlocked might be mapped with PROT_NONE. In this case,
- get_user_pages() will be unable to fault in the pages. That's OK. If pages
- do end up getting faulted into this VM_LOCKED vma, we'll handle them in the
- fault path or in vmscan.
- Also note that a page returned by get_user_pages() could be truncated or
- migrated out from under us, while we're trying to mlock it. To detect
- this, __mlock_vma_pages_range() tests the page_mapping after acquiring
- the page lock. If the page is still associated with its mapping, we'll
- go ahead and call mlock_vma_page(). If the mapping is gone, we just
- unlock the page and move on. Worse case, this results in page mapped
- in a VM_LOCKED vma remaining on a normal LRU list without being
- PageMlocked(). Again, vmscan will detect and cull such pages.
- mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will
- TestSetPageMlocked() for each page returned by get_user_pages(). We use
- TestSetPageMlocked() because the page might already be mlocked by another
- task/vma and we don't want to do extra work. We especially do not want to
- count an mlocked page more than once in the statistics. If the page was
- already mlocked, mlock_vma_page() is done.
- If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
- page from the LRU, as it is likely on the appropriate active or inactive list
- at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will
- putback the page--putback_lru_page()--which will notice that the page is now
- mlocked and divert the page to the zone's unevictable LRU list. If
- mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
- it later if/when it attempts to reclaim the page.
- Mlocked Pages: Filtering Special Vmas
- mlock_fixup() filters several classes of "special" vmas:
- 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind
- these mappings are inherently pinned, so we don't need to mark them as
- mlocked. In any case, most of the pages have no struct page in which to
- so mark the page. Because of this, get_user_pages() will fail for these
- vmas, so there is no sense in attempting to visit them.
- 2) vmas mapping hugetlbfs page are already effectively pinned into memory.
- We don't need nor want to mlock() these pages. However, to preserve the
- prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup()
- will call make_pages_present() in the hugetlbfs vma range to allocate the
- huge pages and populate the ptes.
- 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of
- kernel pages, such as the vdso page, relay channel pages, etc. These pages
- are inherently unevictable and are not managed on the LRU lists.
- mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls
- make_pages_present() to populate the ptes.
- Note that for all of these special vmas, mlock_fixup() does not set the
- VM_LOCKED flag. Therefore, we won't have to deal with them later during
- munlock() or munmap()--for example, at task exit. Neither does mlock_fixup()
- account these vmas against the task's "locked_vm".
- Mlocked Pages: Downgrading the Mmap Semaphore.
- mlock_fixup() must be called with the mmap semaphore held for write, because
- it may have to merge or split vmas. However, mlocking a large region of
- memory can take a long time--especially if vmscan must reclaim pages to
- satisfy the regions requirements. Faulting in a large region with the mmap
- semaphore held for write can hold off other faults on the address space, in
- the case of a multi-threaded task. It can also hold off scans of the task's
- address space via /proc. While testing under heavy load, it was observed that
- the ps(1) command could be held off for many minutes while a large segment was
- mlock()ed down.
- To address this issue, and to make the system more responsive during mlock()ing
- of large segments, mlock_fixup() downgrades the mmap semaphore to read mode
- during the call to __mlock_vma_pages_range(). This works fine. However, the
- callers of mlock_fixup() expect the semaphore to be returned in write mode.
- So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not
- support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore
- and reacquire it in write mode. In a multi-threaded task, it is possible for
- the task memory map to change while the semaphore is dropped. Therefore,
- mlock_fixup() looks up the vma at the range start address after reacquiring
- the semaphore in write mode and verifies that it still covers the original
- range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of
- mlock_fixup() have been changed to deal with this new error condition.
- Note: when munlocking a region, all of the pages should already be resident--
- unless we have racing threads mlocking() and munlocking() regions. So,
- unlocking should not have to wait for page allocations nor faults of any kind.
- Therefore mlock_fixup() does not downgrade the semaphore for munlock().
- Mlocked Pages: munlock()/munlockall() System Call Handling
- The munlock() and munlockall() system calls are handled by the same functions--
- do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock
- vs lock operation indicated by an argument. So, these system calls are also
- handled by mlock_fixup(). Again, if called for an already munlock()ed vma,
- mlock_fixup() simply returns. Because of the vma filtering discussed above,
- VM_LOCKED will not be set in any "special" vmas. So, these vmas will be
- ignored for munlock.
- If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off
- the specified range. The range is then munlocked via the function
- __mlock_vma_pages_range()--the same function used to mlock a vma range--
- passing a flag to indicate that munlock() is being performed.
- Because the vma access protections could have been changed to PROT_NONE after
- faulting in and mlocking some pages, get_user_pages() was unreliable for visiting
- these pages for munlocking. Because we don't want to leave pages mlocked(),
- get_user_pages() was enhanced to accept a flag to ignore the permissions when
- fetching the pages--all of which should be resident as a result of previous
- mlock()ing.
- For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling
- munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
- flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page()
- use the Test*PageMlocked() function to handle the case where the page might
- have already been unlocked by another task. If the page was mlocked,
- munlock_vma_page() updates that zone statistics for the number of mlocked
- pages. Note, however, that at this point we haven't checked whether the page
- is mapped by other VM_LOCKED vmas.
- We can't call try_to_munlock(), the function that walks the reverse map to check
- for other VM_LOCKED vmas, without first isolating the page from the LRU.
- try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
- not be on an lru list. [More on these below.] However, the call to
- isolate_lru_page() could fail, in which case we couldn't try_to_munlock().
- So, we go ahead and clear PG_mlocked up front, as this might be the only chance
- we have. If we can successfully isolate the page, we go ahead and
- try_to_munlock(), which will restore the PG_mlocked flag and update the zone
- page statistics if it finds another vma holding the page mlocked. If we fail
- to isolate the page, we'll have left a potentially mlocked page on the LRU.
- This is fine, because we'll catch it later when/if vmscan tries to reclaim the
- page. This should be relatively rare.
- Mlocked Pages: Migrating Them...
- A page that is being migrated has been isolated from the lru lists and is
- held locked across unmapping of the page, updating the page's mapping
- [address_space] entry and copying the contents and state, until the
- page table entry has been replaced with an entry that refers to the new
- page. Linux supports migration of mlocked pages and other unevictable
- pages. This involves simply moving the PageMlocked and PageUnevictable states
- from the old page to the new page.
- Note that page migration can race with mlocking or munlocking of the same
- page. This has been discussed from the mlock/munlock perspective in the
- respective sections above. Both processes [migration, m[un]locking], hold
- the page locked. This provides the first level of synchronization. Page
- migration zeros out the page_mapping of the old page before unlocking it,
- so m[un]lock can skip these pages by testing the page mapping under page
- lock.
- When completing page migration, we place the new and old pages back onto the
- lru after dropping the page lock. The "unneeded" page--old page on success,
- new page on failure--will be freed when the reference count held by the
- migration process is released. To ensure that we don't strand pages on the
- unevictable list because of a race between munlock and migration, page
- migration uses the putback_lru_page() function to add migrated pages back to
- the lru.
- Mlocked Pages: mmap(MAP_LOCKED) System Call Handling
- In addition the the mlock()/mlockall() system calls, an application can request
- that a region of memory be mlocked using the MAP_LOCKED flag with the mmap()
- call. Furthermore, any mmap() call or brk() call that expands the heap by a
- task that has previously called mlockall() with the MCL_FUTURE flag will result
- in the newly mapped memory being mlocked. Before the unevictable/mlock changes,
- the kernel simply called make_pages_present() to allocate pages and populate
- the page table.
- To mlock a range of memory under the unevictable/mlock infrastructure, the
- mmap() handler and task address space expansion functions call
- mlock_vma_pages_range() specifying the vma and the address range to mlock.
- mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in
- "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will
- have already been set by the caller, in filtered vmas. Thus these vma's need
- not be visited for munlock when the region is unmapped.
- For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
- fault/allocate the pages and mlock them. Again, like mlock_fixup(),
- mlock_vma_pages_range() downgrades the mmap semaphore to read mode before
- attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore
- back to write mode before returning.
- The callers of mlock_vma_pages_range() will have already added the memory
- range to be mlocked to the task's "locked_vm". To account for filtered vmas,
- mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the
- callers then subtract a non-negative return value from the task's locked_vm.
- A negative return value represent an error--for example, from get_user_pages()
- attempting to fault in a vma with PROT_NONE access. In this case, we leave
- the memory range accounted as locked_vm, as the protections could be changed
- later and pages allocated into that region.
- Mlocked Pages: munmap()/exit()/exec() System Call Handling
- When unmapping an mlocked region of memory, whether by an explicit call to
- munmap() or via an internal unmap from exit() or exec() processing, we must
- munlock the pages if we're removing the last VM_LOCKED vma that maps the pages.
- Before the unevictable/mlock changes, mlocking did not mark the pages in any way,
- so unmapping them required no processing.
- To munlock a range of memory under the unevictable/mlock infrastructure, the
- munmap() hander and task address space tear down function call
- munlock_vma_pages_all(). The name reflects the observation that one always
- specifies the entire vma range when munlock()ing during unmap of a region.
- Because of the vma filtering when mlocking() regions, only "normal" vmas that
- actually contain mlocked pages will be passed to munlock_vma_pages_all().
- munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup()
- for the munlock case, calls __munlock_vma_pages_range() to walk the page table
- for the vma's memory range and munlock_vma_page() each resident page mapped by
- the vma. This effectively munlocks the page, only if this is the last
- VM_LOCKED vma that maps the page.
- Mlocked Page: try_to_unmap()
- [Note: the code changes represented by this section are really quite small
- compared to the text to describe what happening and why, and to discuss the
- implications.]
- Pages can, of course, be mapped into multiple vmas. Some of these vmas may
- have VM_LOCKED flag set. It is possible for a page mapped into one or more
- VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one
- of the active or inactive LRU lists. This could happen if, for example, a
- task in the process of munlock()ing the page could not isolate the page from
- the LRU. As a result, vmscan/shrink_page_list() might encounter such a page
- as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To
- handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED
- vmas while it is walking a page's reverse map.
- try_to_unmap() is always called, by either vmscan for reclaim or for page
- migration, with the argument page locked and isolated from the LRU. BUG_ON()
- assertions enforce this requirement. Separate functions handle anonymous and
- mapped file pages, as these types of pages have different reverse map
- mechanisms.
- try_to_unmap_anon()
- To unmap anonymous pages, each vma in the list anchored in the anon_vma must be
- visited--at least until a VM_LOCKED vma is encountered. If the page is being
- unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked
- pages are migratable. However, for reclaim, if the page is mapped into a
- VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap
- semphore of the mm_struct to which the vma belongs in read mode. If this is
- successful, try_to_unmap() will mlock the page via mlock_vma_page()--we
- wouldn't have gotten to try_to_unmap() if the page were already mlocked--and
- will return SWAP_MLOCK, indicating that the page is unevictable. If the
- mmap semaphore cannot be acquired, we are not sure whether the page is really
- unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN.
- try_to_unmap_file() -- linear mappings
- Unmapping of a mapped file page works the same, except that the scan visits
- all vmas that maps the page's index/page offset in the page's mapping's
- reverse map priority search tree. It must also visit each vma in the page's
- mapping's non-linear list, if the list is non-empty. As for anonymous pages,
- on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will
- attempt to acquire the associated mm_struct's mmap semaphore to mlock the page,
- returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not.
- try_to_unmap_file() -- non-linear mappings
- If a page's mapping contains a non-empty non-linear mapping vma list, then
- try_to_un{map|lock}() must also visit each vma in that list to determine
- whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit
- all vmas in the non-linear list to ensure that the pages is not/should not be
- mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate.
- However, there is no easy way to determine whether the page is actually mapped
- in a given vma--either for unmapping or testing whether the VM_LOCKED vma
- actually pins the page.
- So, try_to_unmap_file() handles non-linear mappings by scanning a certain
- number of pages--a "cluster"--in each non-linear vma associated with the page's
- mapping, for each file mapped page that vmscan tries to unmap. If this happens
- to unmap the page we're trying to unmap, try_to_unmap() will notice this on
- return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it
- will return SWAP_AGAIN, causing vmscan to recirculate this page. We take
- advantage of the cluster scan in try_to_unmap_cluster() as follows:
- For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap
- semaphore of the associated mm_struct for read without blocking. If this
- attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will
- retain the mmap semaphore for the scan; otherwise it drops it here. Then,
- for each page in the cluster, if we're holding the mmap semaphore for a locked
- vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This
- call is a no-op if the page is already locked, but will mlock any pages in
- the non-linear mapping that happen to be unlocked. If one of the pages so
- mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will
- return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan
- to cull the page, rather than recirculating it on the inactive list. Again,
- if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns
- SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but
- couldn't be mlocked.
- Mlocked pages: try_to_munlock() Reverse Map Scan
- TODO/FIXME: a better name might be page_mlocked()--analogous to the
- page_referenced() reverse map walker--especially if we continue to call this
- from shrink_page_list(). See related TODO/FIXME below.
- When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System
- Call Handling" above--tries to munlock a page, or when shrink_page_list()
- encounters an anonymous page that is not yet in the swap cache, they need to
- determine whether or not the page is mapped by any VM_LOCKED vma, without
- actually attempting to unmap all ptes from the page. For this purpose, the
- unevictable/mlock infrastructure introduced a variant of try_to_unmap() called
- try_to_munlock().
- try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
- mapped file pages with an additional argument specifing unlock versus unmap
- processing. Again, these functions walk the respective reverse maps looking
- for VM_LOCKED vmas. When such a vma is found for anonymous pages and file
- pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
- attempt to acquire the associated mmap semphore, mlock the page via
- mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
- pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs
- shrink_page_list() that the anonymous page should be culled rather than added
- to the swap cache in preparation for a try_to_unmap() that will almost
- certainly fail.
- If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap
- semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list()
- to recycle the page on the inactive list and hope that it has better luck
- with the page next time.
- For file pages mapped into non-linear vmas, the try_to_munlock() logic works
- slightly differently. On encountering a VM_LOCKED non-linear vma that might
- map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking
- the page. munlock_vma_page() will just leave the page unlocked and let
- vmscan deal with it--the usual fallback position.
- Note that try_to_munlock()'s reverse map walk must visit every vma in a pages'
- reverse map to determine that a page is NOT mapped into any VM_LOCKED vma.
- However, the scan can terminate when it encounters a VM_LOCKED vma and can
- successfully acquire the vma's mmap semphore for read and mlock the page.
- Although try_to_munlock() can be called many [very many!] times when
- munlock()ing a large region or tearing down a large address space that has been
- mlocked via mlockall(), overall this is a fairly rare event. In addition,
- although shrink_page_list() calls try_to_munlock() for every anonymous page that
- it handles that is not yet in the swap cache, on average anonymous pages will
- have very short reverse map lists.
- Mlocked Page: Page Reclaim in shrink_*_list()
- shrink_active_list() culls any obviously unevictable pages--i.e.,
- !page_evictable(page, NULL)--diverting these to the unevictable lru
- list. However, shrink_active_list() only sees unevictable pages that
- made it onto the active/inactive lru lists. Note that these pages do not
- have PageUnevictable set--otherwise, they would be on the unevictable list and
- shrink_active_list would never see them.
- Some examples of these unevictable pages on the LRU lists are:
- 1) ramfs pages that have been placed on the lru lists when first allocated.
- 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to
- allocate or fault in the pages in the shared memory region. This happens
- when an application accesses the page the first time after SHM_LOCKing
- the segment.
- 3) Mlocked pages that could not be isolated from the lru and moved to the
- unevictable list in mlock_vma_page().
- 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't
- acquire the vma's mmap semaphore to test the flags and set PageMlocked.
- munlock_vma_page() was forced to let the page back on to the normal
- LRU list for vmscan to handle.
- shrink_inactive_list() also culls any unevictable pages that it finds
- on the inactive lists, again diverting them to the appropriate zone's unevictable
- lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became
- SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or
- pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from
- the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice
- the latter, but will pass on to shrink_page_list().
- shrink_page_list() again culls obviously unevictable pages that it could
- encounter for similar reason to shrink_inactive_list(). As already discussed,
- shrink_page_list() proactively looks for anonymous pages that should have
- PG_mlocked set but don't--these would not be detected by page_evictable()--to
- avoid adding them to the swap cache unnecessarily. File pages mapped into
- VM_LOCKED vmas but without PG_mlocked set will make it all the way to
- try_to_unmap(). shrink_page_list() will divert them to the unevictable list when
- try_to_unmap() returns SWAP_MLOCK, as discussed above.
- TODO/FIXME: If we can enhance the swap cache to reliably remove entries
- with page_count(page) > 2, as long as all ptes are mapped to the page and
- not the swap entry, we can probably remove the call to try_to_munlock() in
- shrink_page_list() and just remove the page from the swap cache when
- try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page()
- doesn't seem to allow that.
|