[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b1605cceba6e7de753bce8a78e1ce5c2d545546.camel@linux.ibm.com>
Date: Mon, 30 Jan 2023 08:17:49 -0500
From: James Bottomley <jejb@...ux.ibm.com>
To: Yu Kuai <yukuai1@...weicloud.com>,
Zhong Jinghua <zhongjinghua@...wei.com>,
gregkh@...uxfoundation.org, martin.petersen@...cle.com,
hare@...e.de, bvanassche@....org, emilne@...hat.com
Cc: linux-kernel@...r.kernel.org, linux-scsi@...r.kernel.org,
yi.zhang@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH-next v2 2/2] scsi: fix iscsi rescan fails to create
block device
On Mon, 2023-01-30 at 11:46 +0800, Yu Kuai wrote:
> Hi,
>
> 在 2023/01/30 11:29, James Bottomley 写道:
> > On Mon, 2023-01-30 at 11:07 +0800, Yu Kuai wrote:
> > > Hi,
> > >
> > > 在 2023/01/30 1:30, James Bottomley 写道:
> > > > On Sat, 2023-01-28 at 17:41 +0800, Zhong Jinghua wrote:
> > > > > This error will cause a warning:
> > > > > kobject_add_internal failed for block (error: -2 parent:
> > > > > 1:0:0:1). In the lower version (such as 5.10), there is no
> > > > > corresponding error handling, continuing to go down will
> > > > > trigger a kernel panic, so cc stable.
> > > >
> > > > Is this is important point and what you're saying is that this
> > > > only panics on kernels before 5.10 or so because after that
> > > > it's correctly failed by block device error handling so there's
> > > > nothing to fix in later kernels?
> > > >
> > > > In that case, isn't the correct fix to look at backporting the
> > > > block device error handling:
> > >
> > > This is the last commit that support error handling, and there
> > > are many relied patches, and there are lots of refactor in block
> > > layer. It's not a good idea to backport error handling to lower
> > > version.
> > > Althrough error handling can prevent kernel crash in this case, I
> > > still think it make sense to make sure kobject is deleted in
> > > order, parent should not be deleted before child.
> >
> > Well, look, you've created a very artificial situation where a
> > create closely followed by a delete of the underlying sdev races
> > with the create of the block gendisk devices of sd that bind
> > asynchronously to the created sdev. The asynchronous nature of the
> > bind gives the elongated race window so the only real fix is some
> > sort of check that the sdev is still viable by the time the bind
> > occurs ... probably in sd_probe(), say a scsi_device_get of sdp at
> > the top which would ensure viability of the sdev for the entire
> > bind or fail the probe if the sdev can't be got.
>
> Sorry, I don't follow here. 😟
In the current kernel the race is mitigated because add_device fails
due to the parent being torn down. That parent is the sdev->gendev so
it seems we can detect this in the probe by looking at the sdev->gendev
state, which scsi_device_get() will do.
> I agree this is a very artificial situation, however I can't tell our
> tester not to test this way...
>
> The problem is that kobject session is deleted and then sd_probe()
> tries to create a new kobject under hostx/sessionx/x:x:x:x/. I don't
> see how scsi_device_get() can prevent that, it only get a kobject
> reference and can prevent kobject to be released, however,
> kobject_del() can still be done.
So your contention is there's no way that we could make scsi_device_get
see the kernfs deactivation? I would have thought checking sdev-
>sdev_gendev.kobj.sd.active would give that ... although the check
would have to be via an API since KN_DEACTIVATED_BIAS is internal.
James
> In this patch, we make sure remove session and sd_probe() won't
> concurrent, remove session will wait for all child kobject to be
> deleted, what do you think?
>
> Thanks,
> Kuai
>
Powered by blists - more mailing lists