[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJgzZopVb5CC-T6GTYTAGKwY+vQ9Muvi3g6jKNmOC1+4BTMNvQ@mail.gmail.com>
Date: Wed, 15 Nov 2017 14:10:17 -0800
From: enh <enh@...gle.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Rob Landley <rob@...dley.net>, Kees Cook <keescook@...omium.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
toybox@...ts.landley.net
Subject: Re: Regression: commit da029c11e6b1 broke toybox xargs.
On Sun, Nov 5, 2017 at 12:46 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> On Sat, Nov 4, 2017 at 5:39 PM, Rob Landley <rob@...dley.net> wrote:
>> On 11/03/2017 08:07 PM, Linus Torvalds wrote:
>>>>
>>>> But it boils down to "got the limit wrong, the exec failed after the
>>>> fork(), dynamic recovery from which is awkward so I'm trying to figure
>>>> out the right limit".
>>
>> Sounds later like dynamic recovery is what you recommend. (Awkward
>> doesn't mean I can't do it.)
>
> So my preferred solution would certainly be for xargs to realize that
> _SC_ARG_MAX is at most an "informed guess" rather than some absolute
> value.
>
> Or, alternatively, simply use a _SC_ARG_MAX that is small enough that
> we can say "yes, that's what we've always supported". That, in
> practice, is the 128kB value.
>
> It's probably also big enough that nobody cares any more. Sure, you
> *could* feed bigger arguments, and we do end up basically supporting
> it because people who don't write proper scripts will simply do
>
> grep some-value $(find . -name '*.c')
>
> and that works pretty well if your code-base isn't humongous.
>
> It still works for the kernel, for example, but there's no question
> that it's still not something we *guarantee* works.
>
> It doesn't work if you don't limit it to *.c files:
>
> [torvalds@i7 linux]$ grep some-value $(find .)
> -bash: /usr/bin/grep: Argument list too long
>
> so when the kernel expanded the argument list from the traditional
> 128kB, it was for _convenience_, it was not meant for people who do
> proper scripting and use xargs.
>
> See the difference?
>
>>> What _is_ the stack limit when using toybox? Is it just entirely unlimited?
>>
>> Answer to second question on ubuntu 14.04:
>>
>> landley@...ftwood:~/linux/linux/fs$ ulimit -s 999999999
>> landley@...ftwood:~/linux/linux/fs$ ulimit -s
>> 999999999
>
> Oh, I can do that on Fedora too.
>
> But the thing is, nobody actually seems to do that.
>
> And our regression rule has never been "behavior doesn't change".
> That would mean that we could never make any changes at all.
>
> For example, we do things like add new error handling etc all the
> time, which we then sometimes even add tests for in our kselftest
> directory.
>
> So clearly behavior changes all the time and we don't consider that a
> regression per se.
>
> The rule for a regression for the kernel is that some real user
> workflow breaks. Not some test. Not a "look, I used to be able to do
> X, now I can't".
>
> So that's why this whole "is this a test or a real workload" question
> is important to me, and what odd RLIMIT_STACK people actually use.
Android's default RLIMIT_STACK is 8192.
the specific bug report was from an interactive shell user. i'm not
aware of any part of the platform itself that relies on this, nor any
third-party app.
> I'm not interested in "can you reproduce it", because that's simply
> not the issue for us from a regression standpoint.
>
> That said, I tried this under Fedora:
>
> ulimit -s unlimited
> find / -print0 2> /dev/null | xargs -0 /usr/bin/echo | wc
>
> and it reported 991 lines. That would match just using 128kB as the
> breaking point for xargs.
>
> So apparently at least Fedora doesn't seem to use that "stack limit /
> 4" thing, but the traditional 128kB value.
>
> I don't know if that is because of 'xargs', or because of library
> differences - I didn't check.
>
> And I have a hard time judging whether the regression report is a
> "look, this behavior changed", or a "we actually have a real
> regression visible to actual users".
>
> See above why that matters to me.
>
>> It sounds like "probe and recover" is my best option.
>
> I actually suspect "just use 128kB" is the actual best option in practice.
for libc's sysconf(_SC_ARG_MAX) too? i'm fine changing bionic back to
reporting 128KiB if there's an lkml "Linus says" mail that i can link
to in the comment. it certainly seems like an overly-conservative
choice is better than the current situation...
> Sure, "probe and recover" is the best option in theory, but it's a lot
> more complex, and there doesn't actually seem to be a lot of upside.
>
> So to take my silly example of piping "find /" to xargs: I can't
> imagine that anybody sane should care really whether it results in
> 900+ execve's (for that 128kB limit) or just 20 (for some "optimal"
> 6MB limit).
>
> And there is actually a somewhat real fear with the
> "probe-and-recover" model: the already mentioned nasty case "sometimes
> we don't return E2BIG, we might just succeed the execve, but then kill
> the process very early due to some _other_ limit being triggered".
>
> That nasty case really shouldn't happen, but it was the case we
> considered (and rejected) for suid binaries.
>
> So that's the real reason the kernel behavior changed: we had a
> security issue with the "we allow basically unlimited stack growth"
> where a huge argv/envp could be used to grow the stack into some other
> segment.
>
> Realistically, it was only a security issue with suid binaries, but as
> mentioned in this thread, the problem is that we do the argument
> copying even before we've decided whether the execve is going to be a
> suid execve.
>
> So then - exactly so that we do *not* have that nasty case of "execve
> succeeds, but we kill the process immediately if it turns into a
> potential security issue", we do that "limit the stack growth for
> everybody".
>
> Linus
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Powered by blists - more mailing lists