[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zv1810ZfEBEhybmg@earth.li>
Date: Wed, 2 Oct 2024 18:03:19 +0100
From: Jonathan McDowell <noodles@...th.li>
To: linux-integrity@...r.kernel.org, Jarkko Sakkinen <jarkko@...nel.org>,
	James Bottomley <James.Bottomley@...senpartnership.com>
Cc: Peter Huewe <peterhuewe@....de>, Jason Gunthorpe <jgg@...pe.ca>,
	linux-kernel@...r.kernel.org
Subject: Problems with TPM timeouts
We have been seeing a large number of TPM transmit problems across our
fleet, with frequent
tpm tpm0: tpm_try_transmit: send(): error -62
errors being logged. I don't have an on-demand reproducer, which makes
diagnosis difficult. In almost all cases it's a transient issue, and a
subsequent attempt to execute a command succeeds, but especially when
the kernel resource broker is involved that can still cause problems, as
the kernel is not doing retries here.  Uptime does not seem to be a
factor.
This is not yet using the new HMAC session bits; kernels affected range
from at least 6.9 back to 5.12. Historically we've not paid attention to
TPMs long after initial boot, these days we're now looking at them
throughout the uptime of the machine so perhaps discovering something
that's been latent for a while.
I have a few things to try, which I'll describe below, but running
through them will take several months due to the difficulties in trying
to track the issue down over a production fleet. I'm posting here in
case anyone has any insight or ideas I might have missed.
First, I've seen James' post extending the TPM timeouts back in 2018
(https://lore.kernel.org/linux-integrity/1531329074.3260.9.camel@HansenPartnership.com/),
which doesn't seem to have been picked up. Was an alternative resolution
found, or are you still using this, James?
That was for a Nuvoton device; ours our Infineon devices. The behaviour
is not firmware specific; we see the problem with the latest 7.85
firmware as well as the older 7.62.
Things we are going to try:
 * Direct usage of /dev/tpm0 rather than /dev/tpmrm0. This is not a long
   term solution as we want multiple processes to be able to access the
   TPM, but is easier to deploy. The expectation is this will lower the
   number of issues due to fewer TPM commands being executed, but that
   this is not the root cause.
 * Retrying command submission on status timeout. We've had details of
   an errata where the status register can become stuck, with the work
   around being command resubmission. I've got a patch for this ready to
   test - I'll follow up to this mail with it, but need to actually roll
   it out and test before I'll submit it for inclusion.
 * Instrumenting other timeout points to see if we're hitting a
   different timeout.
J.
-- 
101 things you can't have too much of : 8 - Hard drive space.
Powered by blists - more mailing lists
 
