linux-ext4 - Re: INFO: rcu detected stall in ext4_write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACT4Y+aHgz9cPa7OnVsNeHim72i6zVdjnbvVb0Z1oN2B8QLZqg@mail.gmail.com>
Date:   Fri, 5 Jul 2019 15:18:06 +0200
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     "Theodore Ts'o" <tytso@....edu>,
        syzbot <syzbot+4bfbbf28a2e50ab07368@...kaller.appspotmail.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        David Miller <davem@...emloft.net>, eladr@...lanox.com,
        Ido Schimmel <idosch@...lanox.com>,
        Jiri Pirko <jiri@...lanox.com>,
        John Stultz <john.stultz@...aro.org>,
        linux-ext4@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        netdev <netdev@...r.kernel.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        Thomas Gleixner <tglx@...utronix.de>
Cc:     syzkaller <syzkaller@...glegroups.com>
Subject: Re: INFO: rcu detected stall in ext4_write_checks

On Wed, Jun 26, 2019 at 8:43 PM Theodore Ts'o <tytso@....edu> wrote:
>
> On Wed, Jun 26, 2019 at 10:27:08AM -0700, syzbot wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:    abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1435aaf6a00000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b
> > dashboard link: https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11234c41a00000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15d7f026a00000
> >
> > The bug was bisected to:
> >
> > commit 0c81ea5db25986fb2a704105db454a790c59709c
> > Author: Elad Raz <eladr@...lanox.com>
> > Date:   Fri Oct 28 19:35:58 2016 +0000
> >
> >     mlxsw: core: Add port type (Eth/IB) set API
>
> Um, so this doesn't pass the laugh test.
>
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10393a89a00000
>
> It looks like the automated bisection machinery got confused by two
> failures getting triggered by the same repro; the symptoms changed
> over time.  Initially, the failure was:
>
> crashed: INFO: rcu detected stall in {sys_sendfile64,ext4_file_write_iter}
>
> Later, the failure changed to something completely different, and much
> earlier (before the test was even started):
>
> run #5: basic kernel testing failed: failed to copy test binary to VM: failed to run ["scp" "-P" "22" "-F" "/dev/null" "-o" "UserKnownHostsFile=/dev/null" "-o" "BatchMode=yes" "-o" "IdentitiesOnly=yes" "-o" "StrictHostKeyChecking=no" "-o" "ConnectTimeout=10" "-i" "/syzkaller/jobs/linux/workdir/image/key" "/tmp/syz-executor216456474" "root@...128.15.205:./syz-executor216456474"]: exit status 1
> Connection timed out during banner exchange
> lost connection
>
> Looks like an opportunity to improve the bisection engine?

Hi Ted,

Yes, these infrastructure errors plague bisections episodically.
That's https://github.com/google/syzkaller/issues/1250

It did not confuse bisection explicitly as it understands that these
are infrastructure failures rather then a kernel crash, e.g. here you
may that it correctly identified that this run was OK and started
bisection in v4.10 v4.9 range besides 2 scp failures:

testing release v4.9
testing commit 69973b830859bc6529a7a0468ba0d80ee5117826 with gcc (GCC) 5.5.0
run #0: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ...]: exit status 1
Connection timed out during banner exchange
run #1: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ....]: exit status 1
Connection timed out during banner exchange
run #2: OK
run #3: OK
run #4: OK
run #5: OK
run #6: OK
run #7: OK
run #8: OK
run #9: OK
# git bisect start v4.10 v4.9

Though, of course, it may confuse bisection indirectly by reducing
number of tests per commit.

So far I wasn't able to gather any significant info about these
failures. We gather console logs, but on these runs they are empty.
It's easy to blame everything onto GCE but I don't have any bit of
information that would point either way. These failures just appear
randomly in production and usually in batches...