lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 15 Sep 2020 18:45:41 -0400
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Jens Axboe <axboe@...nel.dk>, linux-ext4@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        linux-block@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling
 into blk_mq_get_driver_tag

On Tue, Sep 15, 2020 at 03:33:03PM +0800, Ming Lei wrote:
> Hi Theodore,
> 
> On Tue, Sep 15, 2020 at 12:45:19AM -0400, Theodore Y. Ts'o wrote:
> > On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote:
> > > Worse, right now, -rc1 and -rc2 is causing random crashes in my
> > > gce-xfstests framework.  Sometimes it happens before we've run even a
> > > single xfstests; sometimes it happens after we have successfully
> > > completed all of the tests, and we're doing a shutdown of the VM under
> > > test.  Other times it happens in the middle of a test run.  Given that
> > > I'm seeing this at -rc1, which is before my late ext4 pull request to
> > > Linus, it's probably not an ext4 related bug.  But it also means that
> > > I'm partially blind in terms of my kernel testing at the moment.  So I
> > > can't even tell Linus that I've run lots of tests and I'm 100%
> > > confident your one-line change is 100% safe.
> > 
> > I was finally able to bisect it down to the commit:
> > 
> > 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
> 
> 37f4a24c2469 has been reverted in:
> 
> 	4e2f62e566b5 Revert "blk-mq: put driver tag when this request is completed"
> 
> And later the patch is committed as the following after being fixed:
> 
> 	568f27006577 blk-mq: centralise related handling into blk_mq_get_driver_tag
> 
> So can you reproduce the issue by running kernel of commit 568f27006577?

Yes.  And things work fine if I try 4e2f62e566b5.

> If yes, can the issue be fixed by reverting 568f27006577?

The problem is it's a bit tricky to revert 568f27006577, since there
is a merge conflict in blk_kick_flush().  I attempted to do the bisect
manually here, but it's clearly not right since the kernel is not
booting after the revert:

https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870

branch:

https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577

Can you send me a patch which correctly reverts 568f27006577?  I can
try it against -rc1 .. -rc4, whichever is most convenient.

> Can you share the exact mount command line for setup the environment?
> and the exact xfstest item?

It's a variety of mount command lines, since I'm using gce-xfstests[1][2]
using a variety of file system scenarios --- but the basic one, which
is ext4 using the default 4k block size is failing (they all are failing).

[1] https://thunk.org/gce-xfstests
[2] https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

It's also not one consistent xfstests which is failing, but it does
tend to be tests which are loading up the storage stack with a lot of
small random read/writes, especially involving metadata blocks/writes.
(For example, tests which run fsstress.)

Since this reliably triggers for me, and other people running
kvm-xfstests or are running xfstests on their own test environments
aren't seeing it, I'm assuming it must be some kind of interesting
interaction between virtio-scsi, perhaps with how Google Persistent
Disk is behaving (maybe timing related?  who knows?).  Darrick Wong
did say he saw something like it once using Oracle's Cloud
infrastructure, but as far as I know it hasn't reproduced since.  On
Google Compute Engine VM's, it reproduces *extremely* reliably.

I expect that if you were to set up gce-xfstests, get a free GCE
account with the initial $300 free credits, you could run
"gce-xfstests -c ext4/4k -g auto" and it would reproduce within an
hour or so.  (So under a dollar's worth of VM credits, so long as you
notice that it's hung and shut down the VM after gathering debugging
data.)

The instructions are at [2], and the image xfstests-202008311554 in
the xfstests-cloud project is a public copy of the VM test appliance I
was using.

% gcloud compute images describe --project xfstests-cloud xfstests-202008311554
archiveSizeBytes: '1720022528'
creationTimestamp: '2020-09-15T15:09:30.544-07:00'
description: Linux Kernel File System Test Appliance
diskSizeGb: '10'
family: xfstests
guestOsFeatures:
- type: VIRTIO_SCSI_MULTIQUEUE
- type: UEFI_COMPATIBLE
id: '1558420969906537845'
kind: compute#image
labelFingerprint: V-2Qgcxt2uw=
labels:
  blktests: g8a75bed
  e2fsprogs: v1_45_6
  fio: fio-3_22
  fsverity: v1_2
  ima-evm-utils: v1_3_1
  nvme-cli: v1_12
  quota: g13bb8c2
  util-linux: v2_36
  xfsprogs: v5_8_0-rc1
  xfstests: linux-v3_8-2838-geb439bf2
  xfstests-bld: gb5085ab
licenseCodes:
- '5543610867827062957'
licenses:
- https://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-10-buster
name: xfstests-202008311554
selfLink: https://www.googleapis.com/compute/v1/projects/xfstests-cloud/global/images/xfstests-202008311554
sourceDisk: https://www.googleapis.com/compute/v1/projects/xfstests-cloud/zones/us-east1-d/disks/temp-xfstests-202008311554
sourceDiskId: '5824762850044577124'
sourceType: RAW
status: READY
storageLocations:
- us

Cheers,

					- Ted

P.S.  As you can see, I can also easily run blktests using this VM
Test Appliance.  If you think it would be helpful for me to try
running one or more of the blktests test groups, I can do that.

I've been focusing on using xfstests mainly because this is my primary
way of running regression tests, and I'm blocked on ext4 testing until
I can figure out this particular problem.  Which is why a proper
revert of 568f27006577 would be extremely helpful, if only so I can
apply that on top of the ext4 git tree temporarily so I can run my
file system regression tests.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ