[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZE2QhyNzgMo8KFVS@mit.edu>
Date: Sat, 29 Apr 2023 17:47:51 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: syzbot <syzbot+ecab51a4a5b9f26eeaa1@...kaller.appspotmail.com>
Cc: adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org
Subject: Re: [syzbot] WARNING in ext4_dirty_folio
#syz set subsystems: mm
On Wed, Jun 08, 2022 at 04:36:20AM -0700, syzbot wrote:
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit: cf67838c4422 selftests net: fix bpf build error
> git tree: net
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=123c2173f00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=fc5a30a131480a80
> dashboard link: https://syzkaller.appspot.com/bug?extid=ecab51a4a5b9f26eeaa1
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1342d5abf00000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11ecafebf00000
The root cause of this failure is a fundamental bug / design flaw in
get_user_pages and related functions, which file system developers
have been complaining about for literally **years**. See the recent
discussion at [1] and going back earlier to 2018[2][3] and 2019[4].
[1] https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com/
[2] https://lwn.net/Articles/753027/
[3] https://lwn.net/Articles/774411/
[4] https://lwn.net/Articles/784574/
I'm going to reassign this to the mm subsystem, since there's not much
we can do on the file system end. The WARNING is considered a good
thing because users can see silent data corruption/loss if they use
process_vm_writev() or RDMA to write to memory backed by a file. And
while most users at large hyperscale scientific compute farms probably
won't be paying attention to the system logs, at least we've done
something to warn them.
Fortunately data corruption is rare (but when it happens it could
really screw with your results!), but if they are doing some large
scale simulation to evaluate the safety of nuclear weapons (for
example), it would be nice if they got at least some hint.
There is a potential solution discussed at [1], but there is push back
since it could break users by disallowing the thing that might cause
data corruption. Why breaking user applications is bad, turning a
possible silent data corruption to a very visible, hard failure is
arguably a good thing....
- Ted
Powered by blists - more mailing lists