lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240213155954.GA1210633@bhelgaas>
Date: Tue, 13 Feb 2024 09:59:54 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Leon Romanovsky <leonro@...dia.com>
Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>,
	Jim Harris <jim.harris@...sung.com>,
	Bjorn Helgaas <bhelgaas@...gle.com>, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org, Jason Gunthorpe <jgg@...dia.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	Pierre Crégut <pierre.cregut@...nge.com>
Subject: Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs
 sriov_numvfs reads vs writes"

On Tue, Feb 13, 2024 at 09:34:50AM +0200, Leon Romanovsky wrote:
> On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote:
> > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote:
> > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote:
> > > > On 2/9/24 3:52 PM, Jim Harris wrote:
> > > > > If an SR-IOV enabled device is held by vfio, and the device
> > > > > is removed, vfio will hold device lock and notify userspace
> > > > > of the removal. If userspace reads the sriov_numvfs sysfs
> > > > > entry, that thread will be blocked since sriov_numvfs_show()
> > > > > also tries to acquire the device lock. If that same thread
> > > > > is responsible for releasing the device to vfio, it results
> > > > > in a deadlock.
> > > > >
> > > > > The proper way to detect a change to the num_VFs value is to
> > > > > listen for a sysfs event, not to add a device_lock() on the
> > > > > attribute _show() in the kernel.
> > 
> > The lock was not about detecting a change; Pierre did this:
> > 
> >   ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \
> >     cat ${path}/device/sriov_numvfs; \
> > 
> > which I assume works by listening for sysfs events.  
> 
> It is not, "ip monitor ..." listens to netlink events emitted by
> netdev core and not sysfs events. Sysfs events are not involved in
> this case.

Thanks for correcting my hasty assumption!

> > The problem was that after the event occurred, the sriov_numvfs
> > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991).
> 
> Yes, and it is outcome of such cross-subsytem involvement, which
> is racy by definition. Someone can come with even simpler example of why
> locking sysfs read and write is not a good idea. 
> 
> For example, let's consider the following scenario with two CPUs and
> locks on sysfs read and write:
> 
>  CPU1					CPU2
>  echo 1 > ${path}/device/sriov_numvfs
> 		 context_switch ->
>  					cat ${path}/device/sriov_numvfs
> 					lock
> 					return 0
> 					unlock
> 		context_switch <-
>  lock
>  set 1
>  unlock
> 
>  CPU1					CPU2
>  echo 1 > ${path}/device/sriov_numvfs
>  lock
>  set 1
>  unlock
> 		 context_switch ->
>  					cat ${path}/device/sriov_numvfs
> 					lock
> 					return 1
> 					unlock
> 
> So same scenario will return different values if user doesn't protect
> such case with external to the kernel lock.
> 
> But if we return back to Pierre report and if you want to provide
> completely bullet proof solution to solve cross-subsystem interaction,
> you will need to prohibit device probe till sriov_numvfs update is completed.
> However, it is overkill for something that is not a real issue.

Pierre wanted to detect the configuration change and learn the new
num_vfs, which seems like a reasonable thing to do.  Is there a way to
do both via netlink or some other mechanism?

> > So I would drop this sentence because I don't think it accurately
> > reflects the reason for 35ff867b7657.
> > 
> > > > Since you are reverting a commit that synchronizes SysFS read
> > > > /write, please add some comments about why it is not an
> > > > issue anymore.
> > > 
> > > It was never an issue, the idea that sysfs read and write should be
> > > serialized by kernel is not correct by definition. 
> > 
> > I think it *was* an issue.  The behavior Pierre observed at was
> > clearly wrong, 
> 
> I disagree with this sentence. 
> 
> > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs
> > sriov_numvfs reads vs writes") to resolve it.
> > 
> > We should try to avoid reintroducing the problem, so I think we should
> > probably squash these two patches and describe it as a deadlock fix
> > instead of dismissing 35ff867b7657 as being based on false premises.
> > 
> > It would be awesome if you had time to verify that these patches also
> > resolve the problem you saw, Pierre.
> 
> They won't resolve his problem, because he is not listening to sysfs
> events, but rely on something from netdev side.

I guess that means that if we apply this revert, the problem Pierre
reported will return.  Obviously the deadlock is more important than
the inconsistency Pierre observed, but from the user's point of view
this will look like a regression.

Maybe listening to netlink and then looking at sysfs isn't the
"correct" way to do this, but I don't want to just casually break
existing user code.  If we do contemplate doing the revert, at the
very least we should include specific details about what the user code
*should* do instead, at the level of the actual commands to use
instead of "ip monitor dev; cat ${path}/device/sriov_numvfs".

Bjorn

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ