linux-kernel - [RFC] Using "page credits" as a solution for common thrashing scenarios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <b64f365b0911171252q42a0c102jddb2cd29af64a7a7@mail.gmail.com>
Date:	Tue, 17 Nov 2009 22:52:51 +0200
From:	Eyal Lotem <eyal.lotem@...il.com>
To:	LKML <linux-kernel@...r.kernel.org>
Subject: [RFC] Using "page credits" as a solution for common thrashing 
	scenarios

I apologize for not sending a patch, but I am not yet skilled enough with
Linux kernel hacking to stop with the cheap talk and show you the code.

I have encountered numerous times the following situation:

  * Small processes A, B, C happily functioning on the system (e.g: bash,
    xterm, etc).

  * Large misbehaving process D (e.g firefox) is launched, and due to
    a bug or other issue, begins to allocate and write to an extremely
    large memory working set.

  * The entire operating system and all processes become highly
    unresponsive. Sometimes to the point where I cannot even kill the
    problematic process.  SysRq keys still work, but those are usually too
    coarse.  Sometimes the OOM will start firing at random.

My analysis of this:

  * I believe that process D in this scenario is basically causing the
    kernel MM to evict all of the pages of the well-behaving A, B, C
    processes.

  * I think it is wrong for the kernel to evict the 15 pages of the bash,
    xterm, X server's working set, as an example, in order for a
    misbehaving process to have 1000015 instead of 1000000 pages in its
    working set. EVEN if that misbehaving process is accessing its working
    set far more aggressively.

Suggested solution:

 If my analysis isn't off, then I believe the following solution might
 mitigate the problem elegantly.

 1. Maintain a per-process Most-Recently-Used (MRU) page listing. This
    listing is allowed to be approximate (e.g: by use of the page table's
    Accessed and Dirty bits).

 2. Assign a number of "page credits" to each process (sort of a memory
    "niceness" level) which the kernel automatically disperses on the MRU
    pages of that process.  When the MRU set changes, the "page credits"
    are *moved* from LRU entries to new MRU entries.

 3. Each physical page accumulates all of the "page credits" of the various
    processes that use it in their MRU.  This allows shared pages to
    accumulate more credit when they are useful for more processes.

 4. Page eviction should still be global (not per-process) but should evict
    pages whose accumulated "page credits" count is lowest.

 5. per-process "page credit" levels can be maintained by user-space (using
    setrlimit or such). It probably makes sense for fork() to split the
    "page credit" levels and not duplicate them (and spawn new page credits
    out of nowhere).  While having a good way to specify per-process "page
    credits" could improve things, I don't believe it is critical to do
    this accurately in order for this to effectively prevent thrashing.

Rationale: The usefulness of a page also stems from how big of a percentage
  it is of it's user process working set.  If a single page is the entire
  working set of a process, evicting that single page is most costly than
  even evicting 100 pages of another process if that other process's
  working set is extremely large.

Highly simplified example of suggested solution (numbers are made up):

NOTE: The example does not include shared library memory use and various
other details (I do believe this solution handles those elegantly), to
avoid clutter.

 1. 3 bash processes are assigned 1 million credits each, and each of them
    use a shared 10 pages (e.g mmap'd code) and another 40 unique pages in
    their MRU working set.

    The kernel automatically assigns 20,000 credits (1M / 50) to each of
    the pages in each bash process.  The shared 10 physical pages will
    accumulate 60,000 credits each.  Each of the 120 (40 * 3) unique
    physical pages accumulates 20,000 credits.

 2. xterm, X processes are assigned 1 million credits each, and each of
    them use 200 unique pages.

    The kernel automatically assigns 5,000 credits to each of the unique
    pages, so the physical pages accumulate 5,000 credits each.

 3. a firefox process is too assigned 1 million credits. It aggressively
    allocates and writes to as many pages as the kernel allows.  Instead of
    starting to evict the above processes (bash's, xterm's, X's) which
    hinder responsiveness to the user, the kernel will find the physical
    pages with the least amount of "credits" to evict.

 4. Assuming firefox has already allocated 1 million pages before eviction
    is required, the eviction decision is faced with the following data:

    * 10 physical pages shared by the bash processes, each have 60,000
      credits.

    * 120 physical pages (40 x 3 unique pages of the bash processes), each
      with 20,000 credits.

    * 400 physical pages (of xterm and X) with 5,000 credits each.

    * 1 million physical pages (of firefox) with 1 credit each.

 5. The kernel has a "no-brainer" choice here. Instead of effectively
    pausing all of the behaving processes in order to allocate a few more
    pages for firefox, it can and should make firefox pay for its
    wastefulness.  It will evict firefox's old pages to make room for
    firefox's new pages.

    In effect, firefox's DOS attack here (on the physical page resource)
    will attack only firefox itself, and not the rest of the system.
    Firefox's DOS attack on disk I/O and other resources is still underway
    (Perhaps requiring a different solution).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/