linux-kernel - Re: [PATCH] aio: Add memcg accounting of user used data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171206090508.GF16386@dhcp22.suse.cz>
Date:   Wed, 6 Dec 2017 10:05:08 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     Kirill Tkhai <ktkhai@...tuozzo.com>
Cc:     axboe@...nel.dk, oleg@...hat.com, bcrl@...ck.org, tj@...nel.org,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-aio@...ck.org, jmoyer@...hat.com
Subject: Re: [PATCH] aio: Add memcg accounting of user used data

On Tue 05-12-17 19:02:00, Kirill Tkhai wrote:
> On 05.12.2017 18:43, Michal Hocko wrote:
> > On Tue 05-12-17 18:34:59, Kirill Tkhai wrote:
> >> On 05.12.2017 18:15, Michal Hocko wrote:
> >>> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> >>>> Currently, number of available aio requests may be
> >>>> limited only globally. There are two sysctl variables
> >>>> aio_max_nr and aio_nr, which implement the limitation
> >>>> and request accounting. They help to avoid
> >>>> the situation, when all the memory is eaten in-flight
> >>>> requests, which are written by slow block device,
> >>>> and which can't be reclaimed by shrinker.
> >>>>
> >>>> This meets the problem in case of many containers
> >>>> are used on the hardware node. Since aio_max_nr is
> >>>> a global limit, any container may occupy the whole
> >>>> available aio requests, and to deprive others the
> >>>> possibility to use aio at all. The situation may
> >>>> happen because of evil intentions of the container's
> >>>> user or because of the program error, when the user
> >>>> makes this occasionally
> >>>>
> >>>> The patch allows to fix the problem. It adds memcg
> >>>> accounting of user used aio data (the biggest is
> >>>> the bunch of aio_kiocb; ring buffer is the second
> >>>> biggest), so a user of a certain memcg won't be able
> >>>> to allocate more aio requests memory, then the cgroup
> >>>> allows, and he will bumped into the limit.
> >>>
> >>> So what happens when we hit the hard limit and oom kill somebody?
> >>> Are those charged objects somehow bound to a process context?
> >>
> >> There is exit_aio() called from __mmput(), which waits till
> >> the charged objects complete and decrement reference counter.
> > 
> > OK, so it is bound to _a_ process context. The oom killer will not know
> > about which process has consumed those objects but the effect will be at
> > least reduced to a memcg.
> > 
> >> If there was a problem with oom in memcg, there would be
> >> the same problem on global oom, as it can be seen there is
> >> no __GFP_NOFAIL flags anywhere in aio code.
> >>
> >> But it seems everything is safe.
> > 
> > Could you share your testing scenario and the way how the system behaved
> > during a heavy aio?
> > 
> > I am not saying the patch is wrong, I am just trying to undestand all
> > the consequences.
> 
> My test is simple program, which creates aio context and then starts
> infinity io_submit() cycle. I've tested the cases, when certain stages
> fail: io_setup() meets oom, io_submit() meets oom, io_getevents() meets
> oom. This was simply tested by inserting sleep() before the stage, and
> moving the task to appropriate cgroup with low memory limit. The most
> cases, I get bash killed (I moved it to cgroup too). Also, I've executed
> the test in parallel.
> 
> If you want I can send you the source code, but I don't think it will be
> easy to use it if you are not the author.

Well, not really, I was merely interest about the testing scenario
mainly to see how the system behaved because memcg hitting the hard
limit will OOM kill something only if the failing charge is from the
page fault path. All kernel allocations therefore return with ENOMEM.
The fact we are not considering per task charged kernel memory and
therefore a small task constantly allocating kernel memory can put the
whole cgroup down. As I've said this is something that _should_ be OK
because the bad behavior is isolated within the cgroup.

If that is something that is expected behavior for your usecase then OK.
-- 
Michal Hocko
SUSE Labs