netdev - Re: 9p/trans

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200806123849.GA2640@nautica>
Date:   Thu, 6 Aug 2020 14:38:49 +0200
From:   Dominique Martinet <asmadeus@...ewreck.org>
To:     Alexey Kardashevskiy <aik@...abs.ru>
Cc:     v9fs-developer@...ts.sourceforge.net, Greg Kurz <groug@...d.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: 9p/trans_fd lockup

Alexey Kardashevskiy wrote on Thu, Aug 06, 2020:
> I am seeing another bug in 9p under syzkaller, the reprocase is:
> 
> r0 = open$dir(&(0x7f0000000040)='./file0\x00', 0x88142, 0x182)
> 
> r1 = openat$null(0xffffffffffffff9c, &(0x7f0000000640)='/dev/null\x00',
> 0x0, 0x0)
> mount$9p_fd(0x0, &(0x7f0000000000)='./file0\x00',
> &(0x7f00000000c0)='9p\x00', 0x0, &(0x7f0000000100)={'trans=fd,',
> {'rfdno', 0x3d, r1}, 0x2$, {'wfdno', 0x3d, r0}})
> 
> 
> 
> The default behaviour of syzkaller is to call syscalls concurrently (I
> think), at least it forks by default and executes the same sequence in
> both threads.
> 
> In this example both threads makes it to:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/9p/client.c?h=v5.8#n757
> 
> and sit there with the only difference which is thread#1 goes via
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/9p/client.c?h=v5.8#n767
> 
> I am pretty sure things should not have gone that far but I cannot
> clearly see what needs fixing. Ideas? Thanks,

Unkillable threads there happen with the current p9_client_rpc when
there is no real server (or server bug etc); the code is really stupid.

Basically what happens is that when you send a first signal (^C or
whatever), the function catches the signal, sends a flush, and
indefinitely waits for the flush to come back.
If you send another signal there no more flush comes but it goes back to
waiting -- it's using wait_event_killable but it's a lie it's not really
killable it just loops on that wait until the flush finally comes, which
will never come in your case.
(the rpc that came by the way is probably version or whatever is first
done on mount)

Dmitry reported that to me ages ago and I have a fix which is just to
stop waiting for the flush -- just make it asynchronous, send and
forget. That removes the whole signal handling logic and it won't hang
there anymore.

I sent the patches to the list last year, but didn't get much feedback
and didn't have time to run all the tests I wanted to run on it.

I have some free time at the end of the month so I was planning to
finish it for 5.10 (e.g. won't send it for 5.9 but once 5.9 initial
merge window passed leave it in -next for a couple of months and push it
for 5.10), so your timing is pretty good :)
An extra pair of eyes would be more than appreciated.

You can find the original mails there:
https://lore.kernel.org/lkml/1544532108-21689-3-git-send-email-asmadeus@codewreck.org/

They're also in my 9p-test branch on git://github.com/martinetd/linux

Cheers & thanks for the attention,
-- 
Dominique