linux-kernel - Re: PROBLEM: Possible race between xen, md, dm and/or xfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Thu, 14 Jun 2012 10:57:08 +1000
From:	Jason Stubbs <jasonbstubbs@...il.com>
To:	Matt Wilson <msw@...zon.com>
CC:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"xen-devel@...ts.xen.org" <xen-devel@...ts.xen.org>
Subject: Re: PROBLEM: Possible race between xen, md, dm and/or xfs

On 2012-6-14 10:18 , Matt Wilson wrote:
> On Tue, Jun 12, 2012 at 05:11:37AM -0700, Jason Stubbs wrote:
>> I'll be sure to give Amazon ample opportunity to diagnose things from
>> there side should the issue occur again and hopefully there won't be
>> any more people reporting extraneous issues.
> 
> If you're able to reproduce this hang, I'm sure that we can get to the
> root of the problem quite quickly. Short of that, if you can provide a
> running instance that is exhibiting the problem we can do some
> live-system debugging. It is much more difficult to determine root
> cause and verify fixes without reproduction instructions.

We've got about 50 instances using the same disk layout, but have only
been running these new instances for a couple of months. We've been
using EC2 and EBS for three years now though, which is why I thought
it was likely something to do with the disk layout of the new instances.
Thinking that, my first concern was to get the instance working again
to keep the service running smoothly.

Come to think of it though, I think I might have had this issue once
before with EBS. Still, that makes two occurrences in somewhere around
70 years combined uptime, so it was either a one off or a very rare
corner case.

Either way, I think all that can be done is to wait for it to happen
again, at which time I'll take it out of production, leave it running
and set up a new instance for production instead.

> Given the kernel version you reported in your traces, I can at least
> rule out one known bug that caused blkfront to wait forever for an IO
> to complete: 
>   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=dffe2e1
> 
> The kernel version you're using using includes the follow-on change to
> use fasteoi:
>   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=3588fe2

Yep, this is exactly the sort of corner case I though it might be.
I've confirmed that this change against the sources for the kernel
I'm using, though.

> I'm sorry that I can't be more of more immediate help. If you
> encounter the problem again, please contact developer support.

No problem. We have a support contract and I did go there first, but
the response was basically that nothing can be done without the instance
running. I supplied the traces, but it wasn't clear whether they'd
actually been investigated or not, hence I chose to report here. 
In hindsight, I realize I should have kept the instance running, but
I don't tend to think so clearly when it's the middle of the night. ;)

As for not being able to solve the problem, I don't mind at all. I just
wanted to make sure that an adequate attempt had been made to solve the
problem. We "architect for failure" as much as possible, so the problem
in itself is not such a big deal. Thanks for looking into it!

--
Regards,
Jason Stubbs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/