[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZWpaTD9dVge+suyv@Asurada-Nvidia>
Date: Fri, 1 Dec 2023 14:12:28 -0800
From: Nicolin Chen <nicolinc@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>
CC: "Tian, Kevin" <kevin.tian@...el.com>,
"Liu, Yi L" <yi.l.liu@...el.com>,
"joro@...tes.org" <joro@...tes.org>,
"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
"robin.murphy@....com" <robin.murphy@....com>,
"baolu.lu@...ux.intel.com" <baolu.lu@...ux.intel.com>,
"cohuck@...hat.com" <cohuck@...hat.com>,
"eric.auger@...hat.com" <eric.auger@...hat.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"mjrosato@...ux.ibm.com" <mjrosato@...ux.ibm.com>,
"chao.p.peng@...ux.intel.com" <chao.p.peng@...ux.intel.com>,
"yi.y.sun@...ux.intel.com" <yi.y.sun@...ux.intel.com>,
"peterx@...hat.com" <peterx@...hat.com>,
"jasowang@...hat.com" <jasowang@...hat.com>,
"shameerali.kolothum.thodi@...wei.com"
<shameerali.kolothum.thodi@...wei.com>,
"lulu@...hat.com" <lulu@...hat.com>,
"suravee.suthikulpanit@....com" <suravee.suthikulpanit@....com>,
"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
"Duan, Zhenzhong" <zhenzhong.duan@...el.com>,
"joao.m.martins@...cle.com" <joao.m.martins@...cle.com>,
"Zeng, Xin" <xin.zeng@...el.com>,
"Zhao, Yan Y" <yan.y.zhao@...el.com>
Subject: Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
On Fri, Dec 01, 2023 at 04:43:40PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2023 at 11:58:07AM -0800, Nicolin Chen wrote:
> > > It seems there is not a simple way to realize this error back to
> > > userspace since we can't block the global command queue and we proceed
> > > to complete commands that the real HW would not have completed.
> > >
> > > To actually emulate this the gerror handler would have to capture all
> > > the necessary registers, return them back to the thread doing
> > > invalidate_user and all of that would have to return back to userspace
> > > to go into the virtual version of all the same registers.
> > >
> > > Yes, it can be synchronous it seems, but we don't have any
> > > infrastructure in the driver to do this.
> > >
> > > Given this is pretty niche maybe we just don't support error
> > > forwarding and simply ensure it could be added to the uapi later?
> >
> > If arm_smmu_cmdq_issue_cmdlist in arm_smmu_cache_invalidate_user
> > fails with ETIMEOUT, we polls the CONS register to get the error
> > code. This can cover CERROR_ABT and CERROR_ATC_INV.
>
> Why is timeout linked to these two? Or rather, it doesn't have to be
> linked like that. Any gerror is effectively synchronous because it
> halts the queue and allows SW time to inspect which command failed and
> record the gerror flags. So each and every command can get an error
> indication.
>
> Restarting the queue is done by putting sync in there to effectively
> nop the failed command and we hope for the best and let it rip.
I see that SMMU driver only restarts the queue when dealing with
CERROR_ILL. So only CERROR_ABT or CERROR_ATC_INV would result in
-ETIMEOUT.
> > As you remarked that we can't block the global CMDQ, so we have
> > to let a real CERROR_ILL go. Yet, we can make sure commands to
> > be fully sanitized before being issued, as we should immediately
> > reject faulty commands anyway, for errors such as unsupported op
> > codes, unzero-ed reserved fields, and unlinked vSIDs. This can
> > at least largely reduce the probability of a real CERROR_ILL.
>
> I'm more a little more concerend with ATC_INV as a malfunctioning
> device can trigger this..
How about making sure that the invalidate handler always issues
one CMD_ATC_INV at a time, so each arm_smmu_cmdq_issue_cmdlist()
call has a chance to timeout? Then, we can simply know which one
in the user array fails.
> > So, combining these two, we can still have a basic synchronous
> > way by returning an errno to the invalidate ioctl? I see Kevin
> > replied something similar too.
>
> It isn't enough information, you don't know which gerror bits to set
> and you don't know what cons index to stick to indicate the error
> triggering command with just a simple errno.
>
> It does need to return a bunch of data to get it all right.
The array structure returns req_num to indicate the index. This
works, even if the command consumption stops in the middle:
* @req_num: Input the number of cache invalidation requests in the array.
* Output the number of requests successfully handled by kernel.
So we only need an error code of CERROR_ABT/ILL/ATC_INV.
Or am I missing some point here?
> Though again, there is no driver infrastructure to do all this and it
> doesn't seem that important so maybe we can just ensure there is a
> future extension possiblity and have userspace understand an errno
> means to generate CERROR_ILL on the last command in the batch?
Yea, it certainly can be a plan.
Thanks
Nicolin
Powered by blists - more mailing lists