lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 12 Mar 2019 16:56:15 -0400
From:   Mimi Zohar <zohar@...ux.ibm.com>
To:     Calvin Owens <calvinowens@...com>
Cc:     Peter Huewe <peterhuewe@....de>,
        Jarkko Sakkinen <jarkko.sakkinen@...ux.intel.com>,
        Jason Gunthorpe <jgg@...pe.ca>, Arnd Bergmann <arnd@...db.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "linux-integrity@...r.kernel.org" <linux-integrity@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH] tpm: Make timeout logic simpler and more robust

On Tue, 2019-03-12 at 20:08 +0000, Calvin Owens wrote:
> On Tuesday 03/12 at 13:04 -0400, Mimi Zohar wrote:
> > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> > > We're having lots of problems with TPM commands timing out, and we're
> > > seeing these problems across lots of different hardware (both v1/v2).
> > > 
> > > I instrumented the driver to collect latency data, but I wasn't able to
> > > find any specific timeout to fix: it seems like many of them are too
> > > aggressive. So I tried replacing all the timeout logic with a single
> > > universal long timeout, and found that makes our TPMs 100% reliable.
> > > 
> > > Given that this timeout logic is very complex, problematic, and appears
> > > to serve no real purpose, I propose simply deleting all of it.
> > 
> > Normally before sending such a massive change like this, included in
> > the bug report or patch description, there would be some indication as
> > to which kernel introduced a regression.  Has this always been a
> > problem? Is this something new? How new?
> 
> Honestly we've always had problems with flakiness from these devices,
> but it seems to have regressed sometime between 4.11 and 4.16.

Well, that's a start.  Around 4.10 is when we started noticing TPM
performance issues due to the change in the kernel timer scheduling.
 This resulted in commit a233a0289cf9 ("tpm: msleep() delays - replace
with usleep_range() in i2c nuvoton driver"), which was upstreamed in
4.12.

At the other end, James was referring to commit "424eaf910c32 tpm:
reduce polling time to usecs for even finer granularity", which was
introduced in 4.18.

> 
> I wish a had a better answer for you: we need on the order of a hundred
> machines to see the difference, and setting up these 100+ machine tests
> is unfortunately involved enough that e.g. bisecting it just isn't
> feasible :/

> What I can say for sure is that this patch makes everything much better
> for us. If there's anything in particular you'd like me to test, I have
> an army of machines I'm happy to put to use, let me know :)

I would assume not all of your machines are the same nor have the same
TPM.  Could you verify that this problem is across the board, not
limited to a particular TPM.

BTW, are you seeing this problem with both TPM 1.2 or 2.0?

thanks!

Mimi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ