lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zv1810ZfEBEhybmg@earth.li>
Date: Wed, 2 Oct 2024 18:03:19 +0100
From: Jonathan McDowell <noodles@...th.li>
To: linux-integrity@...r.kernel.org, Jarkko Sakkinen <jarkko@...nel.org>,
	James Bottomley <James.Bottomley@...senpartnership.com>
Cc: Peter Huewe <peterhuewe@....de>, Jason Gunthorpe <jgg@...pe.ca>,
	linux-kernel@...r.kernel.org
Subject: Problems with TPM timeouts

We have been seeing a large number of TPM transmit problems across our
fleet, with frequent

tpm tpm0: tpm_try_transmit: send(): error -62

errors being logged. I don't have an on-demand reproducer, which makes
diagnosis difficult. In almost all cases it's a transient issue, and a
subsequent attempt to execute a command succeeds, but especially when
the kernel resource broker is involved that can still cause problems, as
the kernel is not doing retries here.  Uptime does not seem to be a
factor.

This is not yet using the new HMAC session bits; kernels affected range
from at least 6.9 back to 5.12. Historically we've not paid attention to
TPMs long after initial boot, these days we're now looking at them
throughout the uptime of the machine so perhaps discovering something
that's been latent for a while.

I have a few things to try, which I'll describe below, but running
through them will take several months due to the difficulties in trying
to track the issue down over a production fleet. I'm posting here in
case anyone has any insight or ideas I might have missed.

First, I've seen James' post extending the TPM timeouts back in 2018
(https://lore.kernel.org/linux-integrity/1531329074.3260.9.camel@HansenPartnership.com/),
which doesn't seem to have been picked up. Was an alternative resolution
found, or are you still using this, James?

That was for a Nuvoton device; ours our Infineon devices. The behaviour
is not firmware specific; we see the problem with the latest 7.85
firmware as well as the older 7.62.

Things we are going to try:

 * Direct usage of /dev/tpm0 rather than /dev/tpmrm0. This is not a long
   term solution as we want multiple processes to be able to access the
   TPM, but is easier to deploy. The expectation is this will lower the
   number of issues due to fewer TPM commands being executed, but that
   this is not the root cause.

 * Retrying command submission on status timeout. We've had details of
   an errata where the status register can become stuck, with the work
   around being command resubmission. I've got a patch for this ready to
   test - I'll follow up to this mail with it, but need to actually roll
   it out and test before I'll submit it for inclusion.

 * Instrumenting other timeout points to see if we're hitting a
   different timeout.

J.

-- 
101 things you can't have too much of : 8 - Hard drive space.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ