[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b46210ce00b46ce42b8487e5670cc56b4458031f.camel@linux.ibm.com>
Date: Wed, 23 Aug 2023 16:21:14 +0200
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Robin Murphy <robin.murphy@....com>, Joerg Roedel <joro@...tes.org>,
Matthew Rosato <mjrosato@...ux.ibm.com>, Will Deacon <will@...nel.org>,
Wenjia Zhang <wenjia@...ux.ibm.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: Gerd Bayer <gbayer@...ux.ibm.com>, Julian Ruess <julianr@...ux.ibm.com>,
Pierre Morel <pmorel@...ux.ibm.com>,
Alexandra Winter
<wintera@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik
<gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Christian
Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle
<svens@...ux.ibm.com>,
Suravee Suthikulpanit
<suravee.suthikulpanit@....com>,
Hector Martin <marcan@...can.st>, Sven
Peter <sven@...npeter.dev>,
Alyssa Rosenzweig <alyssa@...enzweig.io>,
David
Woodhouse <dwmw2@...radead.org>,
Lu Baolu <baolu.lu@...ux.intel.com>, Andy
Gross <agross@...nel.org>,
Bjorn Andersson <andersson@...nel.org>,
Konrad
Dybcio <konrad.dybcio@...aro.org>,
Yong Wu <yong.wu@...iatek.com>,
Matthias
Brugger <matthias.bgg@...il.com>,
AngeloGioacchino Del Regno
<angelogioacchino.delregno@...labora.com>,
Gerald Schaefer
<gerald.schaefer@...ux.ibm.com>,
Orson Zhai <orsonzhai@...il.com>,
Baolin
Wang <baolin.wang@...ux.alibaba.com>,
Chunyan Zhang <zhang.lyra@...il.com>, Chen-Yu Tsai <wens@...e.org>,
Jernej Skrabec <jernej.skrabec@...il.com>,
Samuel Holland <samuel@...lland.org>,
Thierry Reding
<thierry.reding@...il.com>,
Krishna Reddy <vdumpa@...dia.com>,
Jonathan
Hunter <jonathanh@...dia.com>,
Jonathan Corbet <corbet@....net>, linux-s390@...r.kernel.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
iommu@...ts.linux.dev, asahi@...ts.linux.dev,
linux-arm-kernel@...ts.infradead.org, linux-arm-msm@...r.kernel.org,
linux-mediatek@...ts.infradead.org, linux-sunxi@...ts.linux.dev,
linux-tegra@...r.kernel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCH v11 5/6] iommu/dma: Allow a single FQ in addition to
per-CPU FQs
On Fri, 2023-08-18 at 19:16 +0100, Robin Murphy wrote:
> On 2023-07-17 12:00, Niklas Schnelle wrote:
> > In some virtualized environments, including s390 paged memory guests,
> > IOTLB flushes are used to update IOMMU shadow tables. Due to this, they
> > are much more expensive than in typical bare metal environments or
> > non-paged s390 guests. In addition they may parallelize poorly in
> > virtualized environments. This changes the trade off for flushing IOVAs
> > such that minimizing the number of IOTLB flushes trumps any benefit of
> > cheaper queuing operations or increased paralellism.
> >
> > In this scenario per-CPU flush queues pose several problems. Firstly
> > per-CPU memory is often quite limited prohibiting larger queues.
> > Secondly collecting IOVAs per-CPU but flushing via a global timeout
> > reduces the number of IOVAs flushed for each timeout especially on s390
> > where PCI interrupts may not be bound to a specific CPU.
> >
> > Let's introduce a single flush queue mode that reuses the same queue
> > logic but only allocates a single global queue. This mode is selected by
> > dma-iommu if a newly introduced .shadow_on_flush flag is set in struct
> > dev_iommu. As a first user the s390 IOMMU driver sets this flag during
> > probe_device. With the unchanged small FQ size and timeouts this setting
> > is worse than per-CPU queues but a follow up patch will make the FQ size
> > and timeout variable. Together this allows the common IOVA flushing code
> > to more closely resemble the global flush behavior used on s390's
> > previous internal DMA API implementation.
> >
> > Link: https://lore.kernel.org/linux-iommu/3e402947-61f9-b7e8-1414-fde006257b6f@arm.com/
> > Reviewed-by: Matthew Rosato <mjrosato@...ux.ibm.com> #s390
> > Signed-off-by: Niklas Schnelle <schnelle@...ux.ibm.com>
> > ---
> > drivers/iommu/dma-iommu.c | 178 +++++++++++++++++++++++++++++++++------------
> > drivers/iommu/iommu.c | 20 +----
>
> The hunks in iommu.c appear to be an inadvertent reversion of patch #1?
>
> A few bonus nits below which you can take or leave, but otherwise, with
> the rebase-mishap fixed:
>
> Acked-by: Robin Murphy <robin.murphy@....com>
>
> > drivers/iommu/s390-iommu.c | 3 +
> > include/linux/iommu.h | 2 +
> > 4 files changed, 142 insertions(+), 61 deletions(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index e57724163835..4ada8b9749c9 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -43,14 +43,23 @@ enum iommu_dma_cookie_type {
> > IOMMU_DMA_MSI_COOKIE,
> > };
> >
> > +struct dma_iommu_options {
> > +#define IOMMU_DMA_OPTS_PER_CPU_QUEUE 0L
>
> Nit: if the intent is to add more flags then that will no longer make
> sense, and if not then we may as well just have a bool ;)
>
> > +#define IOMMU_DMA_OPTS_SINGLE_QUEUE BIT(0)
My thinking was that the above two options are mutually exclusive with
per-CPU encoded as BIT(0) unset and single queue as set. Then other
options could still use the other bits. It's true though that the below
use of IOMMU_DMA_OPTS_PER_CPU_QUEUE is a nop so maybe just drop that?
Or we could use an enum even if I don't forsee more than these 2 queue
types.
> > + u64 flags;
> > +};
> > +
> >
---8<---
> >
> > +static void fq_ring_free(struct iommu_dma_cookie *cookie, struct iova_fq *fq)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&fq->lock, flags);
> > + fq_ring_free_locked(cookie, fq);
> > + spin_unlock_irqrestore(&fq->lock, flags);
> > +}
> > +
> > static void fq_flush_iotlb(struct iommu_dma_cookie *cookie)
> > {
> > atomic64_inc(&cookie->fq_flush_start_cnt);
> > @@ -152,23 +172,29 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie)
> > atomic64_inc(&cookie->fq_flush_finish_cnt);
> > }
> >
> > +static void fq_flush_percpu(struct iommu_dma_cookie *cookie)
> > +{
> > + int cpu;
> > +
> > + for_each_possible_cpu(cpu) {
> > + struct iova_fq *fq;
> > +
> > + fq = per_cpu_ptr(cookie->percpu_fq, cpu);
> > + fq_ring_free(cookie, fq);
> > + }
> > +}
> > +
> > static void fq_flush_timeout(struct timer_list *t)
> > {
> > struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> > - int cpu;
> >
> > atomic_set(&cookie->fq_timer_on, 0);
> > fq_flush_iotlb(cookie);
> >
> > - for_each_possible_cpu(cpu) {
> > - unsigned long flags;
> > - struct iova_fq *fq;
> > -
> > - fq = per_cpu_ptr(cookie->fq, cpu);
> > - spin_lock_irqsave(&fq->lock, flags);
> > - fq_ring_free(cookie, fq);
> > - spin_unlock_irqrestore(&fq->lock, flags);
> > - }
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + fq_ring_free(cookie, cookie->single_fq);
> > + else
> > + fq_flush_percpu(cookie);
>
> Nit: honestly I'd just inline that as:
>
> else for_each_possible_cpu(cpu)
> fq_ring_free(cookie, per_cpu_ptr(cookie->percpu_fq, cpu));
>
> (possibly with extra braces if you don't feel brave enough to join the
> elite "else for" club...)
I might but it looks like checkpatch.pl isn't a fan:
...
ERROR: trailing statements should be on next line
#119: FILE: drivers/iommu/dma-iommu.c:185:
+ else for_each_possible_cpu(cpu)
So braces it is.
>
> > }
> >
> > static void queue_iova(struct iommu_dma_cookie *cookie,
> > @@ -188,7 +214,11 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
> > */
> > smp_mb();
> >
> > - fq = raw_cpu_ptr(cookie->fq);
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + fq = cookie->single_fq;
> > + else
> > + fq = raw_cpu_ptr(cookie->percpu_fq);
> > +
> > spin_lock_irqsave(&fq->lock, flags);
> >
> > /*
> > @@ -196,11 +226,11 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
> > * flushed out on another CPU. This makes the fq_full() check below less
> > * likely to be true.
> > */
> > - fq_ring_free(cookie, fq);
> > + fq_ring_free_locked(cookie, fq);
> >
> > if (fq_full(fq)) {
> > fq_flush_iotlb(cookie);
> > - fq_ring_free(cookie, fq);
> > + fq_ring_free_locked(cookie, fq);
> > }
> >
> > idx = fq_ring_add(fq);
> > @@ -219,31 +249,90 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
> > jiffies + msecs_to_jiffies(IOVA_FQ_TIMEOUT));
> > }
> >
> > -static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie)
> > +static void iommu_dma_free_fq_single(struct iova_fq *fq)
> > +{
> > + int idx;
> > +
> > + if (!fq)
> > + return;
>
> Nit: That should never be true if cookie->fq_domain was set
True and the _percpu variant doesn't check it either so will drop.
>
> > + fq_ring_for_each(idx, fq)
> > + put_pages_list(&fq->entries[idx].freelist);
> > + vfree(fq);
> > +}
> > +
> > +static void iommu_dma_free_fq_percpu(struct iova_fq __percpu *percpu_fq)
> > {
> > int cpu, idx;
> >
> > - if (!cookie->fq)
> > - return;
> > -
> > - del_timer_sync(&cookie->fq_timer);
> > /* The IOVAs will be torn down separately, so just free our queued pages */
> > for_each_possible_cpu(cpu) {
> > - struct iova_fq *fq = per_cpu_ptr(cookie->fq, cpu);
> > + struct iova_fq *fq = per_cpu_ptr(percpu_fq, cpu);
> >
> > fq_ring_for_each(idx, fq)
> > put_pages_list(&fq->entries[idx].freelist);
> > }
> >
> > - free_percpu(cookie->fq);
> > + free_percpu(percpu_fq);
> > +}
> > +
> > +static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie)
> > +{
> > + if (!cookie->fq_domain)
> > + return;
> > +
> > + del_timer_sync(&cookie->fq_timer);
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + iommu_dma_free_fq_single(cookie->single_fq);
> > + else
> > + iommu_dma_free_fq_percpu(cookie->percpu_fq);
> > +}
> > +
> > +static void iommu_dma_init_one_fq(struct iova_fq *fq)
> > +{
> > + int i;
> > +
> > + fq->head = 0;
> > + fq->tail = 0;
> > +
> > + spin_lock_init(&fq->lock);
> > +
> > + for (i = 0; i < IOVA_FQ_SIZE; i++)
> > + INIT_LIST_HEAD(&fq->entries[i].freelist);
> > +}
> > +
> > +static int iommu_dma_init_fq_single(struct iommu_dma_cookie *cookie)
> > +{
> > + struct iova_fq *queue;
> > +
> > + queue = vzalloc(sizeof(*queue));
>
> Nit: vmalloc() - no need to zero the whole thing when the percpu path
> doesn't.
Done
>
> > + if (!queue)
> > + return -ENOMEM;
> > + iommu_dma_init_one_fq(queue);
> > + cookie->single_fq = queue;
> > +
> > + return 0;
> > +}
> > +
> > +static int iommu_dma_init_fq_percpu(struct iommu_dma_cookie *cookie)
> > +{
> > + struct iova_fq __percpu *queue;
> > + int cpu;
> > +
> > + queue = alloc_percpu(struct iova_fq);
> > + if (!queue)
> > + return -ENOMEM;
> > +
> > + for_each_possible_cpu(cpu)
> > + iommu_dma_init_one_fq(per_cpu_ptr(queue, cpu));
> > + cookie->percpu_fq = queue;
> > + return 0;
> > }
> >
> > /* sysfs updates are serialised by the mutex of the group owning @domain */
> > int iommu_dma_init_fq(struct iommu_domain *domain)
> > {
> > struct iommu_dma_cookie *cookie = domain->iova_cookie;
> > - struct iova_fq __percpu *queue;
> > - int i, cpu;
> > + int rc;
> >
> > if (cookie->fq_domain)
> > return 0;
> > @@ -251,26 +340,16 @@ int iommu_dma_init_fq(struct iommu_domain *domain)
> > atomic64_set(&cookie->fq_flush_start_cnt, 0);
> > atomic64_set(&cookie->fq_flush_finish_cnt, 0);
> >
> > - queue = alloc_percpu(struct iova_fq);
> > - if (!queue) {
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + rc = iommu_dma_init_fq_single(cookie);
> > + else
> > + rc = iommu_dma_init_fq_percpu(cookie);
> > +
> > + if (rc) {
> > pr_warn("iova flush queue initialization failed\n");
> > return -ENOMEM;
> > }
> >
> > - for_each_possible_cpu(cpu) {
> > - struct iova_fq *fq = per_cpu_ptr(queue, cpu);
> > -
> > - fq->head = 0;
> > - fq->tail = 0;
> > -
> > - spin_lock_init(&fq->lock);
> > -
> > - for (i = 0; i < IOVA_FQ_SIZE; i++)
> > - INIT_LIST_HEAD(&fq->entries[i].freelist);
> > - }
> > -
> > - cookie->fq = queue;
> > -
> > timer_setup(&cookie->fq_timer, fq_flush_timeout, 0);
> > atomic_set(&cookie->fq_timer_on, 0);
> > /*
> > @@ -297,6 +376,7 @@ static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)
> > if (cookie) {
> > INIT_LIST_HEAD(&cookie->msi_page_list);
> > cookie->type = type;
> > + cookie->options.flags = IOMMU_DMA_OPTS_PER_CPU_QUEUE;
>
> You see, this confused me into thinking it does something and I had to
> go back and check ;)
I have no real preference we can drop it, make it an enum, a bool or
keep it. I felt like this explicit assignment documented per-CPU as the
default so see my variant below but yes it does nothing since the
cookie was already kzalloced.
>
> > }
> > return cookie;
> > }
> > @@ -614,10 +694,18 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
> > if (ret)
> > goto done_unlock;
> >
> > - /* If the FQ fails we can simply fall back to strict mode */
> > - if (domain->type == IOMMU_DOMAIN_DMA_FQ &&
> > - (!device_iommu_capable(dev, IOMMU_CAP_DEFERRED_FLUSH) || iommu_dma_init_fq(domain)))
> > - domain->type = IOMMU_DOMAIN_DMA;
> > + if (domain->type == IOMMU_DOMAIN_DMA_FQ) {
> > + /* Expensive shadowing IOTLB flushes require some tuning */
> > + if (dev->iommu->shadow_on_flush)
> > + cookie->options.flags |= IOMMU_DMA_OPTS_SINGLE_QUEUE;
>
> It would probably be better to set this regardless of the domain type,
> in case the FQ is only brought up later via sysfs.
>
> > +
> > + /* If the FQ fails we can simply fall back to strict mode */
> > + if (!device_iommu_capable(dev, IOMMU_CAP_DEFERRED_FLUSH) ||
> > + iommu_dma_init_fq(domain)) {
> > + domain->type = IOMMU_DOMAIN_DMA;
> > + cookie->options.flags &= ~IOMMU_DMA_OPTS_SINGLE_QUEUE;
>
> ...and either way there should be no need to clear it again - if it was
> true once it will always be true.
>
> Cheers,
> Robin.
>
I'm now experimenting with an iommu_dma_init_options() helper that gets
called from the main path of iommu_dma_init() here is how it looks like
at the end:
static void iommu_dma_init_options(struct iommu_dma_options *options, struct device *dev)
{
/* Expensive shadowing IOTLB flushes do better with a single large queue */
if (dev->iommu->shadow_on_flush) {
options->flags = IOMMU_DMA_OPTS_SINGLE_QUEUE;
options->fq_timeout = IOVA_SINGLE_FQ_TIMEOUT;
options->fq_size = IOVA_SINGLE_FQ_SIZE;
} else {
options->flags = IOMMU_DMA_OPTS_PER_CPU_QUEUE;
options->fq_size = IOVA_DEFAULT_FQ_SIZE;
options->fq_timeout = IOVA_DEFAULT_FQ_TIMEOUT;
}
}
Also I noticed that my options struct used the prefix "dma_iommu_"
while everything else is "iommu_dma_" so that got fixed too.
> > + }
> > + }
> >
> > ret = iova_reserve_iommu_regions(dev, domain);
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index fd9f79731d6a..caaf563d38ae 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -2413,17 +2413,8 @@ int iommu_map(struct iommu_domain *domain, unsigned long iova,
> > return -EINVAL;
> >
> > ret = __iommu_map(domain, iova, paddr, size, prot, gfp);
> > - if (ret == 0 && ops->iotlb_sync_map) {
> > - ret = ops->iotlb_sync_map(domain, iova, size);
> > - if (ret)
> > - goto out_err;
> > - }
> > -
> > - return ret;
> > -
> > -out_err:
> > - /* undo mappings already done */
> > - iommu_unmap(domain, iova, size);
> > + if (ret == 0 && ops->iotlb_sync_map)
> > + ops->iotlb_sync_map(domain, iova, size);
> >
> > return ret;
> > }
> > @@ -2564,11 +2555,8 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
> > sg = sg_next(sg);
> > }
> >
> > - if (ops->iotlb_sync_map) {
> > - ret = ops->iotlb_sync_map(domain, iova, mapped);
> > - if (ret)
> > - goto out_err;
> > - }
> > + if (ops->iotlb_sync_map)
> > + ops->iotlb_sync_map(domain, iova, mapped);
> > return mapped;
Good find that is indeed a rebase mishap ;-(
> >
> > out_err:
> > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> > index 020cc538e4c4..63c56a440c9d 100644
> > --- a/drivers/iommu/s390-iommu.c
> > +++ b/drivers/iommu/s390-iommu.c
> > @@ -468,6 +468,9 @@ static struct iommu_device *s390_iommu_probe_device(struct device *dev)
> > if (zdev->end_dma > ZPCI_TABLE_SIZE_RT - 1)
> > zdev->end_dma = ZPCI_TABLE_SIZE_RT - 1;
> >
> > + if (zdev->tlb_refresh)
> > + dev->iommu->shadow_on_flush = 1;
> > +
> > return &zdev->iommu_dev;
> > }
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 182cc4c71e62..c3687e066ed7 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -409,6 +409,7 @@ struct iommu_fault_param {
> > * @priv: IOMMU Driver private data
> > * @max_pasids: number of PASIDs this device can consume
> > * @attach_deferred: the dma domain attachment is deferred
> > + * @shadow_on_flush: IOTLB flushes are used to sync shadow tables
> > *
> > * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
> > * struct iommu_group *iommu_group;
> > @@ -422,6 +423,7 @@ struct dev_iommu {
> > void *priv;
> > u32 max_pasids;
> > u32 attach_deferred:1;
> > + u32 shadow_on_flush:1;
> > };
> >
> > int iommu_device_register(struct iommu_device *iommu,
> >
Powered by blists - more mailing lists