linux-kernel - HD5500 with weak signal locks up (and stays locked up), but does not return error to applications

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <480FED34.1090003@gmail.com>
Date:	Wed, 23 Apr 2008 21:15:16 -0500
From:	Roger Heflin <rogerheflin@...il.com>
To:	LKML <linux-kernel@...r.kernel.org>
Subject: HD5500 with weak signal locks up (and stays locked up), but does
 not return error to applications

Ok,

Note that the mailing list listed in the linux-kernel maintainers file is a 
subscriber only mailing list and outright rejects all posts to the list by 
non-subscribers.   Though that it is listed that way in the maintainers file

Ever so often my HD5500 stops working in mythtv (empty/no file).   Once it 
starts happening getatsc also seems to also fail, stracing getatsc and/or mythtv 
has getatsc hanging on the read (blocking read) and has mythbackend getting 
EAGAIN on the read (nonblocking read).

I did find this in messages on 2.6.23, and this does appear to happen around the
time of it starting to fail (it also happens on 2.6.24.4):

kernel: cx88[0]: mpeg risc op code error
kernel: cx88[0]: mpeg - dma channel status dump
kernel: cx88[0]:   cmds: initial risc: 0x37bcf000
kernel: cx88[0]:   cmds: cdt base    : 0x00180800
kernel: cx88[0]:   cmds: cdt size    : 0x0000000a
kernel: cx88[0]:   cmds: iq base     : 0x001807c0
kernel: cx88[0]:   cmds: iq size     : 0x00000010
kernel: cx88[0]:   cmds: risc pc     : 0x37bcf048
kernel: cx88[0]:   cmds: iq wr ptr   : 0x000001f2
kernel: cx88[0]:   cmds: iq rd ptr   : 0x000001f6
kernel: cx88[0]:   cmds: cdt current : 0x00000818
kernel: cx88[0]:   cmds: pci target  : 0x350115e0
kernel: cx88[0]:   cmds: line / byte : 0x01650000
kernel: cx88[0]:   risc0: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   risc1: 0x350115e0 [ arg #1 ]
kernel: cx88[0]:   risc2: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   risc3: 0x350118d0 [ arg #1 ]
kernel: cx88[0]:   iq 0: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   iq 1: 0x1aa78490 [ arg #1 ]
kernel: cx88[0]:   iq 2: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   iq 3: 0x350112f0 [ arg #1 ]
kernel: cx88[0]:   iq 4: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   iq 5: 0x350115e0 [ arg #1 ]
kernel: cx88[0]:   iq 6: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   iq 7: 0x350118d0 [ arg #1 ]
kernel: cx88[0]:   iq 8: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   iq 9: 0x35011bc0 [ arg #1 ]
kernel: cx88[0]:   iq a: 0x18000150 [ write sol count=336 ]
kernel: cx88[0]:   iq b: 0x35011eb0 [ arg #1 ]
kernel: cx88[0]:   iq c: 0x140001a0 [ write eol count=416 ]
kernel: cx88[0]:   iq d: 0x1aa78000 [ arg #1 ]
kernel: cx88[0]:   iq e: 0x1c0002f0 [ write sol eol count=752 ]
kernel: cx88[0]:   iq f: 0x1aa781a0 [ arg #1 ]
kernel: cx88[0]: fifo: 0x00186400 -> 0x187400
kernel: cx88[0]: ctrl: 0x001807c0 -> 0x180820
kernel: cx88[0]:   ptr1_reg: 0x00186790
kernel: cx88[0]:   ptr2_reg: 0x00180818
kernel: cx88[0]:   cnt1_reg: 0x00000014
kernel: cx88[0]:   cnt2_reg: 0x00000000

Once it starts happening it requires a module unload/reload or a reboot get 
things working again.

 From viewing the recording happening at the time of the error, I believe this 
is a lockup caused be a less than perfect signal, and that given enough events 
of less than a perfect signal it eventually causes something to stop working and 
lockup.

Is there any more graceful recovery possible than just not working?

Or is does something fail down at a lower level than is reporting the above error?

At a minimum it would probably be good to return errors to the applications 
accessing the devices when this sort of thing happens, right now the 
applications don't notice the failure at all (except for not getting any 
data-which could just be a weak signal), but once this fault happens it happens 
on every channel-even channels that don't ever have signal issues, and ioctls 
and opens still appear to succeed even though the underlying modules are messed 
up and are never going to return any data until something is done.

                         Roger

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/