[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1552424175.24794.105.camel@linux.ibm.com>
Date: Tue, 12 Mar 2019 16:56:15 -0400
From: Mimi Zohar <zohar@...ux.ibm.com>
To: Calvin Owens <calvinowens@...com>
Cc: Peter Huewe <peterhuewe@....de>,
Jarkko Sakkinen <jarkko.sakkinen@...ux.intel.com>,
Jason Gunthorpe <jgg@...pe.ca>, Arnd Bergmann <arnd@...db.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"linux-integrity@...r.kernel.org" <linux-integrity@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH] tpm: Make timeout logic simpler and more robust
On Tue, 2019-03-12 at 20:08 +0000, Calvin Owens wrote:
> On Tuesday 03/12 at 13:04 -0400, Mimi Zohar wrote:
> > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> > > We're having lots of problems with TPM commands timing out, and we're
> > > seeing these problems across lots of different hardware (both v1/v2).
> > >
> > > I instrumented the driver to collect latency data, but I wasn't able to
> > > find any specific timeout to fix: it seems like many of them are too
> > > aggressive. So I tried replacing all the timeout logic with a single
> > > universal long timeout, and found that makes our TPMs 100% reliable.
> > >
> > > Given that this timeout logic is very complex, problematic, and appears
> > > to serve no real purpose, I propose simply deleting all of it.
> >
> > Normally before sending such a massive change like this, included in
> > the bug report or patch description, there would be some indication as
> > to which kernel introduced a regression. Has this always been a
> > problem? Is this something new? How new?
>
> Honestly we've always had problems with flakiness from these devices,
> but it seems to have regressed sometime between 4.11 and 4.16.
Well, that's a start. Around 4.10 is when we started noticing TPM
performance issues due to the change in the kernel timer scheduling.
This resulted in commit a233a0289cf9 ("tpm: msleep() delays - replace
with usleep_range() in i2c nuvoton driver"), which was upstreamed in
4.12.
At the other end, James was referring to commit "424eaf910c32 tpm:
reduce polling time to usecs for even finer granularity", which was
introduced in 4.18.
>
> I wish a had a better answer for you: we need on the order of a hundred
> machines to see the difference, and setting up these 100+ machine tests
> is unfortunately involved enough that e.g. bisecting it just isn't
> feasible :/
> What I can say for sure is that this patch makes everything much better
> for us. If there's anything in particular you'd like me to test, I have
> an army of machines I'm happy to put to use, let me know :)
I would assume not all of your machines are the same nor have the same
TPM. Could you verify that this problem is across the board, not
limited to a particular TPM.
BTW, are you seeing this problem with both TPM 1.2 or 2.0?
thanks!
Mimi
Powered by blists - more mailing lists