lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 27 May 2009 17:47:44 -0700
From:	"Yu, Fenghua" <fenghua.yu@...el.com>
To:	'Andrew Morton' <akpm@...ux-foundation.org>
CC:	"'dwmw2@...radead.org'" <dwmw2@...radead.org>,
	"'mingo@...e.hu'" <mingo@...e.hu>,
	"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
	"'iommu@...ts.linux-foundation.org'" 
	<iommu@...ts.linux-foundation.org>,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>
Subject: RE: [PATCH] Time out for possible dead loops during queued
 invalidation wait

>> >> Which error code is better? Is EAGAIN ok?
>> >
>> >That depends on driver details - probably EIO would be suitable, dunno.
>> >
>> >But all the callers of qi_submit_sync() seem to just drop the error
>> >code on the floor:
>> >
>> >	/* should never fail */
>> >	qi_submit_sync(&desc, iommu);
>> >
>> >and may well cause a kernel crash as a result.
>>
>> Should the code go to kernel panic after timeout in qi_submit_sync()
>loops? When timeout (10 seconds) in the loops, something in hardware could
>be wrong.
>>
>
>The most important thing to do when an error is detected is to protect
>the user's data, perhaps by reliably halting the kernel.
>
>The second most important thing is to report what happened, so people
>can fix things.
>
>The third most important thing is to attempt to recover from the error
>so that the kernel continues to function.  This third requirement often
>makes the second one more successful: a still-running kernel can report
>things which a crashed kernel cannot.
>
>In this particualr case I'd suggest that the driver be converted to
>correctly recognise errors, clean things up and propagate the errors
>back in an orderly fashion, as usual.
>
>
>otoh...  How likely is it that this timeout actually occurs?  If it's
>only conceivable that this can happen when the hardware is busted then
>I'd suggest that we not patch the kernel at all - the kernel really has
>little chance of surviving broken silicon so why bother adding a little
>bit of code here to handle a tiny subset of it?
>
>If the chip is indeed busted then afacit the kernel will get stuck in
>an infinite loop.  That's OK, we can still diagnose those with NMI
>watchdogs, sysrq-p handlers, etc.

This timeout can occur either when the hardware is busted or when iommu kernel software or BIOS has a bug. The timeout detection with a suitable error code can help debug the hardware or software issues.

In some cases (e.g. interrupt migration in interrupt remapping), the timeout is not a blocking issue; it just impacts performance. If there is not the timeout detection, the cpu ends up in an infinite loop although system should be able to continue to run. But the timeout detection can dump warning info for user to debug and the system can continue to run.

Seems I will also need to change the upper level callers to correctly handle the error code (EIO) generated by qi_submit_sync().

Thanks.

-Fenghua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ