[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.1902152242120.1683@nanos.tec.linutronix.de>
Date: Sat, 16 Feb 2019 00:00:29 +0100 (CET)
From: Thomas Gleixner <tglx@...utronix.de>
To: Ming Lei <ming.lei@...hat.com>
cc: LKML <linux-kernel@...r.kernel.org>,
Christoph Hellwig <hch@....de>,
Bjorn Helgaas <helgaas@...nel.org>,
Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
Sagi Grimberg <sagi@...mberg.me>,
linux-nvme@...ts.infradead.org, linux-pci@...r.kernel.org,
Keith Busch <keith.busch@...el.com>,
Marc Zyngier <marc.zyngier@....com>,
Sumit Saxena <sumit.saxena@...adcom.com>,
Kashyap Desai <kashyap.desai@...adcom.com>,
Shivasharan Srikanteshwara
<shivasharan.srikanteshwara@...adcom.com>
Subject: Re: [patch V5 4/8] nvme-pci: Simplify interrupt allocation
On Fri, 15 Feb 2019, Thomas Gleixner wrote:
> On Fri, 15 Feb 2019, Ming Lei wrote:
> > > + * If only one interrupt is available, combine write and read
> > > + * queues. If 'write_queues' is set, ensure it leaves room for at
> > > + * least one read queue.
> > > + */
> > > + if (nrirqs == 1)
> > > + nr_read_queues = 0;
> > > + else if (write_queues >= nrirqs)
> > > + nr_read_queues = nrirqs - 1;
> > > + else
> > > + nr_read_queues = nrirqs - write_queues;
> > > +
> > > + dev->io_queues[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
> > > + affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
> > > + dev->io_queues[HCTX_TYPE_READ] = nr_read_queues;
> > > + affd->set_size[HCTX_TYPE_READ] = nr_read_queues;
> > > + affd->nr_sets = nr_read_queues ? 2 : 1;
> > > }
> >
> > .calc_sets is called only if more than .pre_vectors is available,
> > then dev->io_queues[HCTX_TYPE_DEFAULT] may not be set in case of
> > (nvecs == affd->pre_vectors + affd->post_vectors).
>
> Hmm, good catch. The delta patch below should fix that, but I have to go
> through all the possible cases in pci_alloc_irq_vectors_affinity() once
> more with brain awake.
The existing logic in the driver is somewhat strange. If there is only a
single interrupt available, i.e. no separation of admin and I/O queue, then
dev->io_queues[HCTX_TYPE_DEFAULT] is still set to 1.
Now with the callback scheme (independent of my changes) that breaks in
various ways:
1) irq_create_affinity_masks() bails out early:
if (nvecs == affd->pre_vectors + affd->post_vectors)
return NULL;
So the callback won't be invoked and the last content of
dev->io_queues is preserved. By chance this might end up with
dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
dev->io_queues[HCTX_TYPE_READ] = 0;
but that does not happen by design.
2) pci_alloc_irq_vectors_affinity() has the following flow:
__pci_enable_msix_range()
for (...) {
__pci_enable_msix()
}
If that fails because MSIX is not supported, then the affinity
spreading code has not been called yet and neither the callback.
Now it goes on and tries MSI, which is the same as the above.
When that fails because MSI is not supported, same situation as above
and the code tries to allocate the legacy interrupt.
Now with the initial initialization that case is covered, but not the
following error case:
Assume MSIX is enabled, but __pci_enable_msix() fails with -ENOMEM
after having called irq_create_affinity_masks() and subsequently the
callback with maxvecs.
Then pci_alloc_irq_vectors_affinity() will try MSI and fail _BEFORE_
the callback is invoked.
The next step is to install the leagcy interrupt, but that does not
invoke the affinity code and neither the callback.
At the end dev->io_queues[*] are still initialized with the failed
attempt of enabling MSIX based on maxvecs.
Result: inconsistent state ...
I have an idea how to fix that, but that's again a task for brain being
awake. Will take care of that tomorrow morning.
Thanks,
tglx
Powered by blists - more mailing lists