lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xkp3zaclvqx6pv4qh4wv7jttqby4lp2atsrareb63ofi7sy2qp@hfugy3pyac4o>
Date:   Tue, 30 May 2023 10:56:56 -0700
From:   Jerry Snitselaar <jsnitsel@...hat.com>
To:     Péter Ujfalusi <peter.ujfalusi@...ux.intel.com>
Cc:     Lino Sanfilippo <l.sanfilippo@...bus.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Lino Sanfilippo <LinoSanfilippo@....de>, peterhuewe@....de,
        jgg@...pe.ca, hdegoede@...hat.com, oe-lkp@...ts.linux.dev,
        lkp@...el.com, peterz@...radead.org, linux@...ewoehner.de,
        linux-integrity@...r.kernel.org, linux-kernel@...r.kernel.org,
        lukas@...ner.de, p.rosenberger@...bus.com
Subject: Re: [PATCH 1/2] tpm, tpm_tis: Handle interrupt storm

On Mon, May 29, 2023 at 09:46:08AM +0300, Péter Ujfalusi wrote:
> Hi Lino,
> 
> On 23/05/2023 23:46, Lino Sanfilippo wrote:
> >> On the other hand any new functionality is objectively a maintanance
> >> burden of some measure (applies to any functionality). So how do we know
> >> that taking this change is less of a maintenance burden than just add
> >> new table entries, as they come up?
> >>
> > 
> > Initially this set was created as a response to this 0-day bug report which you asked me
> > to have a look at:
> > 
> > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@kernel.org/
> > 
> > My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not
> > (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus
> > needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once
> > 6.4 is released. 
> 
> I'm surprised that there is a need for a storm detection in the first
> place... Do we have something else on the same IRQ line on the affected
> devices which might have a bug or no driver at all?
> It is hard to believe that a TPM (Trusted Platform Module) is integrated
> so poorly ;)
> 
> But put that aside: I think the storm detection is good given that there
> is no other way to know which machine have sloppy TPM integration.
> There are machines where this happens, so it is a know integration
> issue, right?
> 
> My only 'nitpick' is with the printk level to be used.
> The ERR level is not correct as we know the issue and we handle it, so
> all is under control.
> If we want to add these machines to the quirk list then WARN is a good
> level to gain attention but I'm not sure if a user will know how to get
> the machine in the quirk (where to file a bug).
> If we only want the quirk to be used for machines like UPX-i11 which
> simply just have broken (likely floating) IRQ line then the WARN is too
> high level, INFO or even DBG would be appropriate as you are not going
> to update the quirk, it is just handled under the hood (which is a great
> thing, but on the other hand you will have the storm never the less and
> that is not a nice thing).
> 
> It is a matter on how this is going to be handled in a long term. Add
> quirk for all the known machines with either stormy or plain broken IRQ
> line or handle the stormy ones and quirk the broken ones only.
> 
> >>> Detect an interrupt storm by counting the number of unhandled interrupts
> >>> within a 10 ms time interval. In case that more than 1000 were unhandled
> >>> deactivate interrupts, deregister the handler and fall back to polling.
> >>
> >> I know it can be sometimes hard to evaluate but can you try to explain
> >> how you came up to the 10 ms sampling period and 1000 interrupt
> >> threshold? I just don't like abritrary numbers.
> > 
> > At least the 100 ms is not plucked out of thin air but its the same time period
> > that the generic code in note_interrupt() uses - I assume for a good reason.
> > Not only this number but the whole irq storm detection logic is taken from 
> > there: 
> > 
> >>
> >>> This equals the implementation that handles interrupt storms in
> >>> note_interrupt() by means of timestamps and counters in struct irq_desc.
> >> The number of 1000 unhandled interrupts is still far below the 99900
> used in
> > note_interrupt() but IMHO enough to indicate that there is something seriously
> > wrong with interrupt processing and it is probably saver to fall back to polling.
> 
> Except that if the line got the spurious designation in core, the
> interrupt line will be disabled while the TPM driver will think that it
> is still using IRQ mode and will not switch to polling.
> 
> A storm of 1000 is better than a storm of 99900 for sure but quirking
> these would be the desired final solution. imho

If that is the case, then output could probably be sent to the console
detailing the dmi info needed to update the table.

Regards,
Jerry

> 
> There are many buts around this ;)
> 
> -- 
> Péter

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ