lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090527165117.8b768d00.akpm@linux-foundation.org>
Date:	Wed, 27 May 2009 16:51:17 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	"Yu, Fenghua" <fenghua.yu@...el.com>
Cc:	dwmw2@...radead.org, mingo@...e.hu, linux-kernel@...r.kernel.org,
	iommu@...ts.linux-foundation.org
Subject: Re: [PATCH] Time out for possible dead loops during queued
 invalidation wait

On Wed, 27 May 2009 16:25:52 -0700
"Yu, Fenghua" <fenghua.yu@...el.com> wrote:

> >> Which error code is better? Is EAGAIN ok?
> >
> >That depends on driver details - probably EIO would be suitable, dunno.
> >
> >But all the callers of qi_submit_sync() seem to just drop the error
> >code on the floor:
> >
> >	/* should never fail */
> >	qi_submit_sync(&desc, iommu);
> >
> >and may well cause a kernel crash as a result.
> 
> Should the code go to kernel panic after timeout in qi_submit_sync() loops? When timeout (10 seconds) in the loops, something in hardware could be wrong.
> 

The most important thing to do when an error is detected is to protect
the user's data, perhaps by reliably halting the kernel.

The second most important thing is to report what happened, so people
can fix things.

The third most important thing is to attempt to recover from the error
so that the kernel continues to function.  This third requirement often
makes the second one more successful: a still-running kernel can report
things which a crashed kernel cannot.

In this particualr case I'd suggest that the driver be converted to
correctly recognise errors, clean things up and propagate the errors
back in an orderly fashion, as usual.


otoh...  How likely is it that this timeout actually occurs?  If it's
only conceivable that this can happen when the hardware is busted then
I'd suggest that we not patch the kernel at all - the kernel really has
little chance of surviving broken silicon so why bother adding a little
bit of code here to handle a tiny subset of it?

If the chip is indeed busted then afacit the kernel will get stuck in
an infinite loop.  That's OK, we can still diagnose those with NMI
watchdogs, sysrq-p handlers, etc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ