linux-kernel - Re: [PATCH 1/2 v2] mm: Allow small allocations to fail

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5509410F.2080000@suse.cz>
Date:	Wed, 18 Mar 2015 10:10:39 +0100
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Michal Hocko <mhocko@...e.cz>, Johannes Weiner <hannes@...xchg.org>
CC:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
	akpm@...ux-foundation.org, david@...morbit.com, mgorman@...e.de,
	riel@...hat.com, fengguang.wu@...el.com, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2 v2] mm: Allow small allocations to fail

On 03/17/2015 08:41 PM, Michal Hocko wrote:
> On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
>> On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
>>> On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
>>>> On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
>>>>> On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
>>>>>> A sysctl certainly doesn't sound appropriate to me because this is not
>>>>>> a tunable that we expect people to set according to their usecase.  We
>>>>>> expect our model to work for *everybody*.  A boot flag would be
>>>>>> marginally better but it still reeks too much of tunable.
>>>>>
>>>>> I am OK with a boot option as well if the sysctl is considered
>>>>> inappropriate. It is less flexible though. Consider a regression testing
>>>>> where the same load is run 2 times once with failing allocations and
>>>>> once without it. Why should we force the tester to do a reboot cycle?
>>>>
>>>> Because we can get rid of the Kconfig more easily once we transitioned.
>>>
>>> How? We might be forced to keep the original behavior _for ever_. I do
>>> not see any difference between runtime, boottime or compiletime option.
>>> Except for the flexibility which is different for each one of course. We
>>> can argue about which one is the most appropriate of course but I feel
>>> strongly we cannot go and change the semantic right away.
>>
>> Sure, why not add another slab allocator while you're at it.  How many
>> times do we have to repeat the same mistakes?  If the old model sucks,
>> then it needs to be fixed or replaced.  Don't just offer another one
>> that sucks in different ways and ask the user to pick their poison,
>> with a promise that we might improve the newer model until it's
>> suitable to ditch the old one.
>>
>> This is nothing more than us failing and giving up trying to actually
>> solve our problems.
>
> I probably fail to communicate the primary intention here. The point
> of the knob is _not_ to move the responsibility to userspace. Although
> I would agree that the knob as proposed might look like that and that is
> my fault.
>
> The primary motivation is to actually help _solving_ our long standing
> problem. Default non-failing allocations policy is simply wrong and we
> should move away from it. We have a way to _explicitly_ request such a
> behavior. Are we in agreement on this part?
>
> The problem, as I see it, is that such a change cannot be pushed to
> Linus tree without extensive testing because there are thousands of code
> paths which never got exercised. We have basically two options here.
> Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> and have developers to do their testing. This will surely help to
> catch a lot of fallouts and fix them right away. But we will miss those
> who are using Linus based trees and would be willing to help to test
> in their loads which we never dreamed of.
> The other option would be pushing an experimental code to the Linus
> tree (and distribution kernels) and allow people to turn it on to help
> testing.
>
> I am not ignoring the rest of the email, I just want to make sure we are
> on the same page before we go into a potentially lengthy discussion just
> to find out we are talking past each other.
>
> [...]

After reading this discussion, my impression is: as I understand your 
motivation, the knob is supposed to expose code that has broken handling 
of small allocation failures, because the handling was never exercised 
(and thus found to be broken) before. The steps you are proposing are to 
allow this to be tested by those who understand that it might break 
their machines, until those broken allocation sites are either fixed or 
converted to __GFP_NOFAIL. We want the change of implicit nofail 
behavior to happen, as then we limit the potential deadlocks to 
explicitly annotated allocation sites, which simplifies efforts to 
prevent the deadlocks (e.g. with reserves).

AFAIU, Johannes is worried that the knob adds some possibility that 
allocations will fail prematurely, even though further trying would 
allow it to succeed and would not introduce a deadlock. The probability 
of this is hard to predict even inside MM, yet we assume that userspace 
will set the value. This might discourage some of the volunteers that 
would be willing to test the new behavior, since they could get extra 
spurious failures. He would like to see this to be as reliable as 
possible, failing allocation only when it's absolutely certain that 
nothing else can be done, and not depend on a magically set value from 
userspace. He also believes that we can still improve on the what "can 
be done" part.

I'll add that I think if we do improve the reclaim etc, and make 
allocations failures rarer, then the whole testing effort will have much 
lower chance of finding the places where allocation failures are not 
handled properly. Also Michal says that catching those depend on running 
all "their loads which we never dreamed of". In that case, if our goal 
is to fix all broken allocation sites with some quantifiable 
probability, I'm afraid we might be really better off with some form of 
fault injection, which will trigger the failures with the probability we 
set, and not depend on corner case low memory conditions manifesting
just at the time the workload is at one of the broken allocation sites.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/