[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210520122626.GW1002214@nvidia.com>
Date: Thu, 20 May 2021 09:26:26 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Halil Pasic <pasic@...ux.ibm.com>
Cc: Tony Krowiak <akrowiak@...ux.ibm.com>, linux-s390@...r.kernel.org,
linux-kernel@...r.kernel.org, borntraeger@...ibm.com,
cohuck@...hat.com, pasic@...ux.vnet.ibm.com, jjherne@...ux.ibm.com,
alex.williamson@...hat.com, kwankhede@...dia.com
Subject: Re: [PATCH v3 2/2] s390/vfio-ap: control access to PQAP(AQIC)
interception handler
On Thu, May 20, 2021 at 10:48:57AM +0200, Halil Pasic wrote:
> On Wed, 19 May 2021 21:08:15 -0400
> Tony Krowiak <akrowiak@...ux.ibm.com> wrote:
>
> > >
> > > This is nonesense too:
> > >
> > > if (vcpu->kvm->arch.crypto.pqap_hook) {
> > > if (!try_module_get(vcpu->kvm->arch.crypto.pqap_hook->owner))
> > > return -EOPNOTSUPP;
> > > ret = vcpu->kvm->arch.crypto.pqap_hook->hook(vcpu);
> > >
> > > It should have a lock around it of some kind, not a
> > > try_module_get. module_get is not la lock.
> >
> > As I said earlier, I don't know why the author did this.
>
> Please have a look at these links from the archive to get some
> perspective:
> https://lkml.org/lkml/2020/12/4/994
> https://lkml.org/lkml/2020/12/3/987
> https://www.lkml.org/lkml/2019/3/1/260
>
> We can ask the original author, but I don't think we have to. BTW the
> patch that introduced it has your r-b.
>
> > My best guess
> > is that he wanted to ensure that the module was still loaded; otherwise,
> > the data structures contained therein - for example, the pqap_hook
> > and matrix_mdev that contains it - would be gonzo.
>
> More precisely prevent the module from unloading while we execute code
> from it. As I've pointed out in a previous email the module may be gone
> by the time we call try_module_get().
No, this is a common misconception.
The module_get prevents the module from even being attempted to be
unloaded. Code should acquire this if it has done something that would
cause a module remove function hang indefinitely, such as a design
that waits for a user FD to close.
This provides a good user experience but should generally not be
required for correctness.
All code passing function pointers across subsystems should always
fully fence those function pointers during removal. This means it
interacts with some kind of locking that guarentees nothing is
currently calling, or ever will call again, those function pointers.
This is not just to protect the function pointer code itself, but the
lock should also protect the data access that function pointer almost
always invokes. This is the bug here, ap is accessing the matrix_dev
data from a function pointer without any locking or serialization
against kfree(matrix_dev). Fencing to guarentee the hook isn't and
won't run also serves as a strong enough serialization to allow the
kfree().
The basic logic is that a module removal cannot complete until all
its function pointers have been removed from everywhere and all the
locking that protect those removals are satisified.
Jason
Powered by blists - more mailing lists