linux-kernel - Re: Oops after 30 days of uptime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20060910082649.GA20814@1wt.eu>
Date:	Sun, 10 Sep 2006 10:26:49 +0200
From:	Willy Tarreau <w@....eu>
To:	Ondrej Zary <linux@...nbow-software.org>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	kaber@...sh.net
Subject: Re: Oops after 30 days of uptime

Hi Ondrej,

OK, I've analysed your oops with your kernel. My conclusions are that you
have a hardware problem (most probably the CPU), because you've hit an
impossible case :

ip_nat_cheat_check() pushed the size of the data (8) on the stack, followed
by the pointer to the data, then called csum_partial() :

c01e657f:       6a 08                   push   $0x8
c01e6581:       52                      push   %edx
c01e6582:       e8 a5 85 00 00          call   c01eeb2c <csum_partial>

In csum_partial(), ECX is filled with the size (8) and ESI with the data
pointer (0xc0227ce8) :

c01eeb32:       8b 4c 24 10             mov    0x10(%esp),%ecx
c01eeb36:       8b 74 24 0c             mov    0xc(%esp),%esi

Then, the size is divided by 32 to count how many 32 bytes blocks can be read
at a time. If the size is lower than 32, the code branches to a special
location which reads 1 word at a time :

c01eeb78:       89 ca                   mov    %ecx,%edx
c01eeb7a:       c1 e9 05                shr    $0x5,%ecx
c01eeb7d:       74 32                   je     c01eebb1 <csum_partial+0x85>

Your oops comes from a few instructions below. The branch has not been taken
while it should have because (8 >> 5) == 0. You can also see from EDX in the
oops that it really was 0x8 when copied from ECX. The rest is pretty obvious.
The data are read 32 bytes at a time after ESI, and ECX is decreased by 1
every 32 bytes. When ESI+0x18 reaches an unmapped area (0xc2000000), you get
the oops, and ECX = 0xfff113e8 as in your oops.

Given that the failing instruction is the most common conditionnal jump, it
is very fortunate that your system can work 30 days before crashing. I think
that your CPU might be running too hot and might get wrong results during
branch prediction. It's also possible that you have a poor power supply.
However, I'm pretty sure that this is not a RAM problem.

Best regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/