linux-ext4 - Re: How does newbie find bugs in ext4?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yx9fUHiiZaKXeLUw@mit.edu>
Date:   Mon, 12 Sep 2022 12:33:20 -0400
From:   "Theodore Ts'o" <tytso@....edu>
To:     JunChao Sun <sunjunchao2870@...il.com>
Cc:     linux-ext4@...r.kernel.org
Subject: Re: How does newbie find bugs in ext4?

Hi,

So first of all, I would recommend that you learn how to use
kvm-xfstests.  The reason for this is that kvm-xfstests is very useful
for testing any changes that you make.  The same test appliance can be
used for testing file systems for Android and using Google Compute
Engine VM's (which is one of the best ways to use it).  Please take a
look at these references:

      https://thunk.org/gce-xfstests
      https://github.com/tytso/xfstests-bld/blob/master/Documentation/what-is-xfstests.md
      https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
      https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md

In addition to using this as a way of a quick "playground" where you
can test patches, this can also be a good way to (for example) test
syzbot reports.

Another thing which you could potentially do is to manual backporting
of ext4 patches which didn't automatically get applied because the
patch required some adjustments (or required backporting some
additional commits, etc.) to fix a particular problem.  So for
example, you could try running xfstests using the latest 5.10.y or
5.15.y stable kernels, since as we fix bugs, we often add tests to
check for regressions.  For example, if you look at the header of the
test ext4/058, you'll find:

# Set 256 blocks in a block group, then inject I/O pressure,
# it will trigger off kernel BUG in ext4_mb_mark_diskspace_used
#
# Regression test for commit
# a08f789d2ab5 ext4: fix bug_on ext4_mb_use_inode_pa

So if you find out that a particular test fails on an LTS kernel
(e.g., 5.15.y or 5.10.y), but it passes on upstream, it could be that
a missing commit needs to be backported.  We don't currently have
anyone doing this on a regular basis for the LTS kernels (I maybe will
do this once every few months, when I have time), so this could be a
good way for you to contribute and also learn more about ext4 as you
go.

Finally, I'll note that although I do run xfstests regularly, and will
reject patches that cause regressions, but there are still some tests
that fail.  For example, here is my latest test report:

TESTRUNID: ltm-20220912073217
KERNEL:    kernel 6.0.0-rc4-xfstests #760 SMP PREEMPT_DYNAMIC Mon Sep 12 07:23:13 EDT 2022 x86_64
CMDLINE:   full --kernel gs://gce-xfstests/kernel.deb
CPUS:      4
MEM:       7680

ext4/4k: 515 tests, 27 skipped, 4093 seconds
ext4/1k: 511 tests, 2 failures, 40 skipped, 5095 seconds
  Flaky: generic/475: 40% (2/5)   generic/476: 40% (2/5)
ext4/ext3: 507 tests, 115 skipped, 3514 seconds
ext4/encrypt: 493 tests, 3 failures, 129 skipped, 2583 seconds
  Failures: generic/681 generic/682 generic/691
ext4/nojournal: 510 tests, 4 failures, 94 skipped, 3610 seconds
  Failures: ext4/301 ext4/304 generic/455
  Flaky: generic/077: 40% (2/5)
ext4/ext3conv: 512 tests, 27 skipped, 3650 seconds
ext4/adv: 512 tests, 3 failures, 34 skipped, 3860 seconds
  Failures: generic/475 generic/477
  Flaky: generic/455: 80% (4/5)
ext4/dioread_nolock: 513 tests, 27 skipped, 4235 seconds
ext4/data_journal: 511 tests, 2 failures, 87 skipped, 3647 seconds
  Failures: generic/231 generic/455
ext4/bigalloc: 489 tests, 2 failures, 34 skipped, 3904 seconds
  Failures: generic/455 shared/298
ext4/bigalloc_1k: 488 tests, 2 failures, 51 skipped, 3826 seconds
  Failures: generic/455 shared/298
ext4/dax: 502 tests, 127 skipped, 2520 seconds
Totals: 6135 tests, 792 skipped, 80 failures, 0 errors, 44288s

(This was done by using gce-xfstests, which is a cloud VM variant of
kvm-xfstests.  The equivalant would take roughly 12 to 24 hours using
kvm-xfstests, whichj gets run on multiple VM times, so the wall clock
time needed is perhaps two to two and a half hours.)

In general, I try very hard to make sure that ext4/4k (ext4 with the
default 4k block size) to be free of failures hen running the xfstests
"auto" group.  However, you'll see that there are other configs where
there are failures, some of which have been around for a while.
However, the challenge is that these are bugs that often, more senior
ext4 developers have tried looking at for, say, an hour or two, and
then said, "I have higher priority fires to fight".  But these might
not be the best tests failures to ask a ext4 newbie to debug.  That
being said, if you don't mind a bit (or a lot) of frustration, it
could be that you might be able root cause soe of these failed tests.

(But starting with testing the LTS kernels might be a better place to
start.)

Cheers,

					- Ted