linux-kernel - Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <AM6PR03MB5170B4426D360326DA86FE60E4DC0@AM6PR03MB5170.eurprd03.prod.outlook.com>
Date:   Sun, 12 Apr 2020 08:01:10 +0200
From:   Bernd Edlinger <bernd.edlinger@...mail.de>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Oleg Nesterov <oleg@...hat.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Waiman Long <longman@...hat.com>,
        Ingo Molnar <mingo@...nel.org>, Will Deacon <will@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Alexey Gladkov <gladkov.alexey@...il.com>
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Linus,

On 4/11/20 11:57 PM, Linus Torvalds wrote:
> On Sat, Apr 11, 2020 at 2:21 PM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
>>
>> You're confused.
>>
>> There *was* that callback already. It happened when the thread exited and became a zombie. No ordering has changed.
> 
> Also, stop calling this a "deadlock" like you do, when you're talking
> about the kernel behavior.
> 

Agreed, we must first define the terms.  Let's call this a pseudo-deadlock.

I always talk about the tracer, so it is a deadlock in the tracer yes.
And it is also a deadlock in the process trying to execve the process
that execve's is doing something silly.  That is not shutting down
the threads that are maybe doing dangerous things, while one of them
does execve.  In my programs I do never do that because it is always
a race condition and dangerous not shut down threads before exiting
or execve.  But the tracer wants to debug this silly program, just to
understand why it is occasionally malfunctioning.  Then a deadlock in
both processes happens at the same time, repeatably.  Now a user called
Bernd is scratching his head why that happens.....

> That only - again - shows that you're confused.
> 

Sorry, I was writing another more important mail :), and interrupted
that quickly, to help you with one aspect of the problem that may
have been overlooked, so it is quite possible that I missed something.

So the intention was just to make sure you make your decisions based
on the some facts that I realized because I have debugged the tracer,
and found out that the kernel is expecting the tracer to have knowledge
of future events, and that is impossible for the tracer, so the tracer
must fail when he predicts the future wrong.  Again that is just my
impression I had when debugging the tracer.

The trace looks like this
do {
   wait()
   if(needed) ptrace_access();
   if(needed) ptreace_otherstuff();
} while(1);

when the thread death event is in the tracer's input queue,
the tracer does not know of it unless it looks,
but from time to time the tracer wants to do other things
like attach a process that may be doing random things.
But if I think about it, the parent changes in the moment.
The original parent did not listen and make the zombie
thread go away, but now the trace is the parent, but
it is unable to call wait since it is just doing
ptrace_access.  It is possible that I have missed something,
then it woud be good that others with more experience help
with me putting the puzzle together.

> The kernel at no point was deadlocked.
> 
> The kernel had two events - the exit of one thread, and the execve()
> of another - that it wanted to return to the tracer.
> 
> The tracer isn't listening to the first one, and the kernel is waiting
> for the tracer to do so. The tracer wants to get the second one, but
> it's not happening until the tracer has dealt with the first one.
> 
> That's a deadlock FOR THE TRACER.
> 

I just say, changing the order of the events is not what the tracer
wants.  The tracer wants to get out of the ptrace_access and do
listen to the events, in the correct order.  If the events happen
in a different order than before the tracer may be confused.
That is the point I am trying to make.  And yes I may be wrong.
So please correct me, if that is the case.

> But it's not a deadlock for the kernel. See the difference?
> 
> It literally is absolutely not at all different from a user space
> application that deadlocks on a full pipe buffer, because it's not
> emptying it - and it's not emptying it because it's trying to just
> write more data to it.
> 
> See? It's a user space bug. It really is that simple.
> 
> Here, I'll give you this program just so that you can run it and see
> for yourself how it hangs:
> 
>     #include <unistd.h>
> 
>     char buffer[1024*1024];
> 
>     int main(void)
>     {
>         int fd[2];
>         pipe(fd);
>         write(fd[1], buffer, sizeof(buffer));
>         read(fd[0], buffer, sizeof(buffer));
>     }
> 
> and it's *exactly* the same thing. The kernel buffer for the pipe
> fills up, so the write() ends up hanging. There's a read() there to
> empty the buffer, but that program will never get to it, because the
> write hangs.
> 

Yes, and no.  To solve this problem we have non-blocking sockets
and pipes.  When I write a server I never use any blocking APIs
because I know a write can block at any time, when the server and
the client use a blocking strategy.  So also when I write a client
it is always using non-blocking sockets.  But not all clients are
written this way, therefore the server must be forgiving.

> Is the above a kernel deadlock? No.
> 
> And anybody who calls it a "deadlock" when talking to kernel people is
> confused and should be ignored.
> 

It is not a good idea to ignore people who are barely aware of the right
terms.  It is better to try to understand what they really mean and
translate what they said to what they meant, then ask if you understood
correctly what was meant.  At least that is how I would like the discussion
like that to be.  It is a suggestion and not meant as an offense.

> So stop calling the ptrace() issue a deadlock when talking to kernel
> people - it just means annoys and confuses them. I literally thought
> something completely different was going on initially because you and
> Eric called it a deadlock: I thought that one thread was tracing
> another, and that the SIGKILL didn't resolve it.
> 

Once again, I am new to this lkml list, and it is not meant as an offense.

Yes SIGKILL does resolve this.
Either the tracer, or one of the tracess, but which one?

But SIGKILL is impolite, I prefer SIGTERM. And that does not resolve
the pseudo-deadlock.

> But no. It's not a kernel deadlock at all, it's something much
> "simpler". It is the same exact thing as the stupid buggy example
> program above, except instead of using "write()" and waiting for
> something that will never happen because the write needs another thing
> to happen first, the buggy ptrace test program is using 'ptrace()' to
> wait for something that will never happen.
> 
> Now, admittedly there is _one_ big difference: ptrace() is a garbage
> interface. Nobody disputes that.
> 
> ptrace() is absolutely horrible for so many reasons. One reason it's
> horrible is that it doesn't have a nonblocking mode with poll(), like
> the write() above would have, and that people who use read and write
> can do. You can fix the stupid deadlock with the above read/write loop
> by switching over to nonblocking IO and a poll loop - doing the same
> with ptrace is more difficult, because ptrace() just isn't that kind
> of interface.
> 

What my other patch (you know the one you hated ;-) ) does, is making
ptrace non-blocking when it must, by returning -EAGAIN.
But keep the order of events as they are.
I think your solution will change the order of events in order
to make the kernel more simple.  Right?

> And with read/write, you could also (and this is even more common)
> just used multiple processes (or threads), so that one process does
> reading, and another does writing. Again, ptrace() is a shitty
> interface and doesn't really allow for that very well, although maybe
> it could be made to work that way too.
> 
> But even with ptrace, you could have made the tracing program set a
> signal handler for SIGCHLD *before* it started doing forks and stuff,
> and then exit of the first thread should have caused a signal, and you
> could have reaped it there, and it should work with ptrace.
> 
> But that garbage ptrace() test program didn't do even that. So it ends
> up hanging, waiting for an event that never happens, because it didn't
> do that other thing it needed to do - *EXACTLY* like the "write()"
> ends up hanging, waiting for that read() that will never happen,
> because the garbage test-program did things wrong.
> 

The test case does that on purpose to demonstrate something that
may also happen in strace, but in a very unlikely case, but the
likelihood is not zero and when it happens users are surprised.

> So as long as you guys keep talking about "deadlocks", the _only_
> thing you are showing is that you don't understand the problem.
> 
> It's not a deadlock - it's a buggy test.
> 

The test case makes only sense together with the last part of
the patch.  If you prefer another solution then you must change
the test case as well.  That is allowed.
Changing strace is not allowed.  And breaking something in the
strace-5.5 test suite is also not allowed.

> Now, what I've suggested are a couple of ways to make ptrace()
> friendlier to use, and thus allow that stupid test to work the way you
> seem to want it to work.

No do not make my test case happy.  That is not what it was written
for.

> 
> Basically, execve() doesn't have a lot of reasons to really wait for
> the threads it waits for, and the only real thing it needs to do is to
> make sure they are killed. But maybe it can just ignore them once they
> are dead.
> 
> Or alternatively, it can just say "if you didn't reap the zombie
> threads, I'll reap them for you".
> 

Isn't that is an API change?

I think, there is no way how we can avoid an API change, but we have the
choice which API change we want.

> Or, as mentioned, we can do nothing, and just let buggy user space
> tracing programs hang. It's not the _kernels_ problem if you write
> buggy code in user space, this is not anything we've ever helped with
> before, so it's not like it's a regression.
> 

Also that is a possible allowed solution.
Just leave the test case as is, or change to KFAIL so known FAIL.

The remaining problem is certainly not important enough to make you
unhappy.


Bernd.

> (Although it can be a regression for your buggy program if you used to
> - for example - do threading entirely using some user-space library,
> and didn't have threads that the kernel knew about, and so when your
> program did 'execve()' it just got rid of those threads
> automatically).
> 
>                  Linus
>