linux-kernel - kernel oops: assertion failure at journal:576 (ext3 issue?)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20061117152722.GP28000@renesys.com>
Date:	Fri, 17 Nov 2006 10:27:22 -0500
From:	John Rouillard <rouilj@...esys.com>
To:	linux-kernel@...r.kernel.org
Subject: kernel oops: assertion failure at journal:576 (ext3 issue?)

Hello all:

We have a few (3) systems that are crashing with:

  Assertion failure in journal_next_log_block() at fs/jbd/journal.c:576:
  "journal->j_free > 1" 

  Kernel BUG at journal:576
  invalid operand: 0000 [1] SMP
  CPU 1
  Modules linked in: 
  md5 ipv6 parport_pc lp parport w83627hf eeprom adm1026 hwmon_vid hwmon
  i2c_sensor i2c_isa i2c_amd756 i2c_amd8111 i2c_dev i2c_core nfs lockd
  nfs_acl sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter
  ip_tables button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot
  dm_zero dm_mirror ext3 jbd dm_mod 3w_9xxx sata_mv libata sd_mod
  scsi_mod
  Pid: 1603, comm: kjournald Not tainted 2.6.9-42.0.3.ELsmp
  RIP: 0010:[<ffffffffa006c18a>]
  <ffffffffa006c18a>{:jbd:journal_next_log_block+76}
  RSP: 0018:0000010476327b88 EFLAGS: 00010212
  RAX: 0000000000000060 RBX: 0000010283163e00 RCX: ffffffff803e1fe8
  RDX: ffffffff803e1fe8 RSI: 0000000000000246 RDI: ffffffff803e1fe0
  RBP: 0000000000000040 R08: ffffffff803e1fe8 R09: 0000010283163e00
  R10: 0000000100000000 R11: ffffffff8011e884 R12: 0000010283163e24
  R13: 0000010476327be0 R14: 0000010283163e00 R15: 000000000000002e
  FS: 0000002a95560b00(0000) GS:ffffffff804e5200(0000)
  knlGS:00000000f7ff36c0
  CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000002a9556c000 CR3: 0000000037e42000 CR4: 00000000000006e0
  Process kjournald (pid: 1603, threadinfo 0000010476326000, task
  0000010478d777f0)
  Stack: 0000010453f4afa8 0000010310072240 0000000000000040
  0000010147528be0
  000001044240a880 ffffffffa0067dfe 00000e7c00000000
  00000101c33f2184
  0000000000000000 0000010310b12f50
  Call Trace:<ffffffffa0067dfe>{:jbd:journal_commit_transaction+1834}
  <ffffffff80135756>{autoremove_wake_function+0}
  <ffffffff80135756>{autoremove_wake_function+0}
  <ffffffffa006a914>{:jbd:kjournald+250}
  <ffffffff80135756>{autoremove_wake_function+0}
  <ffffffff80135756>{autoremove_wake_function+0}
  <ffffffffa006a814>{:jbd:commit_timeout+0}
  <ffffffff80110f47>{child_rip+8}
  <ffffffffa006a81a>{:jbd:kjournald+0}
  <ffffffff80110f3f>{child_rip+0}

  Code: 0f 0b bd e2 06 a0 ff ff ff ff 40 02 48 8b ab 18 01 00 00 48
  RIP <ffffffffa006c18a>{:jbd:journal_next_log_block+76} RSP
  <0000010476327b88>
  <0>Kernel panic - not syncing: Oops

(Note I editied together some lines in the "Modules linked in"
section. The rest is cut from the serial console (size 80x24) on the
system.)

We are running centos 4.4 kernel. Uname -a shows:

  Linux cook05 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006
  x86_64 x86_64 x86_64 GNU/Linux 

The disk subsystem for this crash are 4 sata disks on a 3ware 9550
(see the attached dmesg output for more info) with a mix of western
digital and seagate drives. It has also crashed with sysrq enabled and
(not surprisingly) the system is totally dead. We have to power cycle
it to reboot it.

Other systems experiencing the same crash have:

   * non-smp version of the same kernel with the software md raid
     drivers
   * same kernel running a megaraid raid card

The same crash has also been seen with an earlier kernel version
2.6.9-42.ELsmp.

It seems to crash when we expect the system to have high IO, but we
don't have any hard evidence of throughput/transactions to disk to
support that.

We can try setting up a remote kernel dump if that would be
useful/would work.

We get a crash every couple of days on average (sometimes two crashes
with 30 min-2 hours between them) so we can try applying patches/new
kernels if needed and see how the system does.

I have attached selected lines from dmesg to give some additional info
about the hardware and config of the system. I tried to attach
/proc/kallsyms from the system as requested by the mailing list FAQ
at: http://www.tux.org/lkml/#s4-3. However it has been two days since
I originally sent that email and I haven't see it arrive in the
archives, so that info is available on request.  The dmesg info is
from a post crash boot that should be identical to the pre-crash boot.

If you require more/different information just let me know and I will
try to obtain it.

Thank you for your help.

--
				-- rouilj

John Rouillard
System Administrator
Renesys Corporation
603-643-9300 x 111

View attachment "cook05.dmesg_selected.txt" of type "text/plain" (6190 bytes)