[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20091028094343.d59fec94.kamezawa.hiroyu@jp.fujitsu.com>
Date: Wed, 28 Oct 2009 09:43:43 +0900
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To: Hugh Dickins <hugh.dickins@...cali.co.uk>
Cc: vedran.furac@...il.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, kosaki.motohiro@...fujitsu.com,
minchan.kim@...il.com, akpm@...ux-foundation.org,
rientjes@...gle.com, aarcange@...hat.com
Subject: Re: Memory overcommit
On Tue, 27 Oct 2009 20:44:16 +0000 (GMT)
Hugh Dickins <hugh.dickins@...cali.co.uk> wrote:
> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Sigh, gnome-session has twice value of mmap(1G).
> > Of course, gnome-session only uses 6M bytes of anon.
> > I wonder this is because gnome-session has many children..but need to
> > dig more. Does anyone has idea ?
>
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
> I built up a patchset of fixes, but once I came to split them up for
> submission, not one of them seemed entirely satisfactory; and Andrea's
> fix to the KSM/mlock deadlock forced me to abandon even the first of
> the patches (we've since then fixed the way munlocking behaves, so
> in theory could revisit that; but Andrea disliked what I was trying
> to do there in KSM for other reasons, so I've not touched it since).
> I had to get on with KSM, so I set it all aside: none of the issues
> was a recent regression.
>
> I did briefly wonder about the reliance on total_vm which you're now
> looking into, but didn't touch that at all. Let me describe those
> issues which I did try but fail to fix - I've no more time to deal
> with them now than then, but ought at least to mention them to you.
>
Okay, thank you for detailed information.
> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
Hmm.
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
ok, then, easy handling can't be a help.
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
I can't agree that part of heuristics, either.
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
ok.
> 3. badness() has a comment above it which says:
> * 5) we try to kill the process the user expects us to kill, this
> * algorithm has been meticulously tuned to meet the principle
> * of least surprise ... (be careful when you change it)
> But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
> refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
> adds plenty of surprise there, by trying to factor children into the
> calculation. Intended to deal with forkbombs, but any reasonable
> process whose purpose is to fork children (e.g. gnome-session)
> becomes very vulnerable. And whereas badness() itself goes on to
> refine the total_vm points by various adjustments peculiar to the
> process in question, those refinements have been ignored when
> adding the child's total_vm/2. (Andrea does remark that he'd
> rather have rewritten badness() from scratch.)
>
> I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
> part of the calculation up to select_bad_process(), making a
> solo_badness() function which makes all those adjustments to
> total_vm, then badness() itself a simple function adding half
> the children's solo_badness()es to the process' own solo_badness().
> But probably lots more needs doing - Andrea's rewrite?
>
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
>
> I hope these notes help someone towards a better solution
> (and be prepared to discover more on the way). I agree with
> Vedran that the present behaviour is pretty unimpressive, and
> I'm puzzled as to how people can have been tinkering with
> oom_kill.c down the years without seeing any of this.
>
Sorry, I usually don't use X on servers and almost all recent my OOM test
was done under memcg ;(
Thank you for your investigation. Maybe I'll need several steps.
Thanks,
-Kame
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists