[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZzVxIfb5KpL97P4Q@LeoBras>
Date: Thu, 14 Nov 2024 00:40:17 -0300
From: Leonardo Bras <leobras@...hat.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Leonardo Bras <leobras@...hat.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Jiri Slaby <jirislaby@...nel.org>,
Tony Lindgren <tony@...mide.com>,
Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
John Ogness <john.ogness@...utronix.de>,
Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
Uwe Kleine-König <u.kleine-koenig@...gutronix.de>,
Florian Fainelli <florian.fainelli@...adcom.com>,
Shanker Donthineni <sdonthineni@...dia.com>,
linux-kernel@...r.kernel.org,
linux-serial@...r.kernel.org
Subject: Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY
On Fri, Feb 23, 2024 at 01:37:39AM -0300, Leonardo Bras wrote:
> On Wed, Feb 21, 2024 at 04:41:20PM +0100, Thomas Gleixner wrote:
> > On Wed, Feb 21 2024 at 02:39, Leonardo Bras wrote:
> > > On Mon, Feb 19, 2024 at 12:03:07PM +0100, Thomas Gleixner wrote:
> > >> >> Is scenarios where there is no need to keep track of IRQ handled, convert
> > >> >> it back to IRQ_HANDLED.
> > >> >
> > >> > That's not really workable as you'd have to update tons of drivers just
> > >> > to deal with that corner case. That's error prone and just extra
> > >> > complexity all over the place.
> > >
> > > I agree, that's a downside of this implementation.
> >
> > A serious one which is not really workable. See below.
> >
> > > I agree the above may be able to solve the issue, but it would make 2 extra
> > > atomic ops necessary in the thread handling the IRQ, as well as one extra
> > > atomic operation in note_interrupt(), which could increase latency on this
> > > IRQ deferring the handler to a thread.
> > >
> > > I mean, yes, the cpu running note_interrupt() would probably already have
> > > exclusiveness for this cacheline, but it further increases cacheline
> > > bouncing and also adds the mem barriers that incur on atomic operations,
> > > even if we use an extra bit from threads_handled instead of allocate a new
> > > field for threads_running.
> >
> > I think that's a strawman. Atomic operations can of course be more
> > expensive than non-atomic ones, but they only start to make a difference
> > when the cache line is contended. That's not the case here for the
> > normal operations.
> >
> > Interrupts and their threads are strictly targeted to a single CPU and
> > the cache line is already hot and had to be made exclusive because of
> > other write operations to it.
> >
> > There is usually no concurrency at all, except for administrative
> > operations like enable/disable or affinity changes. Those administrative
> > operations are not high frequency and the resulting cache line bouncing
> > is unavoidable even without that change. But does it matter in the
> > larger picture? I don't think so.
>
> That's a fair point, but there are some use cases that use CPU Isolation on
> top of PREEMPT_RT in order to reduce interference on a CPU running an RT
> workload.
>
> For those cases, IIRC the handler will run on a different (housekeeping)
> CPU when those IRQs originate on an Isolated CPU, meaning the above
> described cacheline bouncing is expected.
>
>
> >
> > > On top of that, let's think on a scenario where the threaded handler will
> > > solve a lot of requests, but not necessarily spend a lot of time doing so.
> > > This allows the thread to run for little time while solving a lot of
> > > requests.
> > >
> > > In this scenario, note_interrupt() could return without incrementing
> > > irqs_unhandled for those IRQ that happen while the brief thread is running,
> > > but every other IRQ would cause note_interrupt() to increase
> > > irqs_unhandled, which would cause the bug to still reproduce.
> >
> > In theory yes. Does it happen in practice?
> >
> > But that exposes a flaw in the actual detection code. The code is
> > unconditionally accumulating if there is an unhandled interrupt within
> > 100ms after the last unhandled one. IOW, if there is a periodic
> > unhandled one every 50ms, the interrupt will be shut down after 100000 *
> > 50ms = 5000s ~= 83.3m ~= 1.4h. And it neither cares about the number of
> > actually handled interrupts.
> >
> > The spurious detector is really about runaway interrupts which hog a CPU
> > completely, but the above is not what we want to protect against.
>
> Now it makes a lot more sense to me.
> Thanks!
Hi Thomas,
I would like to go back to this discussion :)
>From what I could understand, and read back the thread:
- The spurious detector is used to avoid cpu hog when a lots of IRQs are
hitting a cpu, but few ( < 100 / 100k) are being handled. It works by
disabling that interruption.
- The bug I am dealing with (on serial8250), happens to fit exactly at
above case: lots of requests, but few are handled.
The reason: threaded handler, many requests, and they are dealt with in
batch: multiple requests are handled at once, but a single IRQ_HANDLED
returned.
- My proposed solution: Find a way of accounting the requests handled.
- Implementation: add an option for drivers voluntarily report how
many requests they handled. Current drivers need no change.
- Limitation: If this issue is found on another driver, we need to
implement accounting there as well. This may only happen on drivers
which handle over 1k requests at once.
What was left for me TODO:
Think on a generic solution for this issue, to avoid dealing with that
in a per-driver basis.
That's what I was able to think about:
- Only the driver code knows how many requests it handled, so without
touching them we can't know how many requests were properly handled.
- I could try thinking a different solution, which involves changing only
the spurious detector.
- For that I would need to find a particular characteristic we would want
to avoid spurious detection against, and make sure it won't miss an
actual case we want to be protected about.
Generic solutions(?) proposed:
- Zero irqs_unhandled if threaded & handles a single request in 100k
- Problem: A regular issue with the interruption would not be detected
in the driver.
- Skip detection if threaded & the handling thread is running
- Problem 1: the thread may run shortly and batch handle a lot of stuff,
not being detected by the spurious detector.
- Problem 2: the thread may get stuck, not handle the IRQs and also not
being detected by the spurious handler. (IIUC)
In the end, I could not find a proper way of telling apart
a - "this is a real spurious IRQ behavior, which needs to be disabled", and
b - "this is just a handler that batch-handles it's requests",
without touching the drivers' code.
Do you have any suggestion on how to do that?
Thanks!
Leo
Powered by blists - more mailing lists