lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <54B199B6.3030103@apollo.lv>
Date:	Sat, 10 Jan 2015 23:29:26 +0200
From:	Raimonds Cicans <ray@...llo.lv>
To:	linux-kernel@...r.kernel.org
Subject: [REGRESSION] media: cx23885 broken by commit
 453afdd9ce33293f640e84dc17e5f366701516e8  (was: Help needed: complex case
 bisection (TBS6981))

TL;DR:
media: cx23885 broken by commit 453afdd9ce33293f640e84dc17e5f366701516e8
"[media] cx23885: convert to vb2"
Broken mean: until this commit driver was rock solid, after I started to 
receive
IOMMU related warnings and sometimes card stopped working


Full report:

On 09.01.2015 10:34, Raimonds Cicans wrote:
> History of problem:
> 1) I own computer based on AMD Athlon(tm) II X2 240e Processor on Asus 
> M5A97 LE R2.0 motherboard
> 2) I own TBS6981 card (Dual DVB-S/S2 PCIe receiver, in kernel driver)
> 3) I used kernel 3.13.something
> 4) everything was fine
> 5) time to time I tried to upgrade to newer kernels
>     but I got AMD IOMMU driver regression
>     (AMD-Vi: Completion-Wait loop timed out)
> 6) I tried to disable IOMMU, but this lead to problems with NIC and 
> USB controller
> 7) I was forced to upgrade to newer kernel (I needed all new fixes for 
> BTRFS file system)
> 8) I bought TBS6285 (Quad DVB-T/T2 PCIe receiver)
> 9) I upgraded to kernel 3.17.7
>     AMD IOMMU driver regression disappeared
>     but appeared two IOMMU related problems with TBS6981:
>
> WARNING: CPU: 0 PID: 13204 at drivers/iommu/amd_iommu.c:2625 
> dma_ops_domain_unmap.part.9+0x4d/0x56()
>
> and
>
> AMD-Vi: Event logged [IO_PAGE_FAULT device=08:00.0 domain=0x001c 
> address=0x0000000001355000 flags=0x0000]
>
> As I understand first message mean "we tried to unmap same dma region 
> twice"
> and second mean "we tried to dma to/from region that do not exist"
>
> IMHO this mean that cause for those messages can be single commit
>
I was naive... In reality it is not so simply.

> Hypotheses:
> 1) Bug(s) in motherboard's hardware/BIOS (why it worked before?)
Disproved: card work rock solid with plain 3.13.10 kernel

> 2) TBS6981 conflicts with TBS6285
Disproved: card have same problems on new drivers,
                   without TBS6285 in computer

> 3) Bug(s) in TBS6981 driver
Very likely: problems started from commit 
453afdd9ce33293f640e84dc17e5f366701516e8
                  which changed cx23885 driver

> 4) Bug(s) in media subsystem (video buffer dma part)
Unlikely: only if cx23885 driver uses this part in specific way

> 5) Bug(s) in AMD IOMMU driver
Very unlikely: this would cause problems with other systems and drivers

> Bisection plan:
> 0. take out from computer all unnecessary hardware (including TBS6285)
> 1. install latest known good kernel (3.13.something)
>     I will use this kernel because I want to rule out AMD IOMMU 
> regression
> 2. cold reboot and test everything is working
Every thing was working... almost: card lost one receiver
I take card from second PCIe slot and put in third:
first receiver appeared but card lost second receiver.
I put card in fourth slot and both receivers reappeared.
(I need to investigate this deeper)

> 3. take linux-media tree and compile drivers from HEAD
It took several hours to make linux-media build system to work

> 4. cold reboot and test
>     if everything is working, then
>                     a) problem is fixed in HEAD
>                         or
>                     b) compatibility problem with TBS6285
>                         or
>                     c) problem is related to AMD IOMMU driver 
> regression in
>                         newer kernels
>                     to distinguish between this cases I should build
>                     newest kernel with HEAD media drivers
>                     if everything is working then case a) or b)
>                             and I must put TBS6285 back and test again
>                     else this is case c) and I should bisect 
> linux-kernel tree for this problem
>                             (git bisect start; git bisect bad v3.14; 
> git bisect good v3.13)
>
>                     end of testing
>
>     if TBS6981 driver misbehaves, then I should git bisect linux-media 
> tree
Driver misbehaved

> 5. bisect linux-media tree
>     (git bisect start -- drivers/media; git bisect bad v3.17; git 
> bisect good v3.13)
>
>     if I find single commit that is cause for both messages then stop
>
>     if at some commit only one message appear, then I should write down
>     good/bad region and continue with first message and then do
>     new bisection for other message but on reduced region
Soon appeared new bug:
kernel BUG at mm/slub.c:1394!

Main problem was that this bug appeared much faster than first two bugs
and often caused computer lock up.
I decided to treat this error same way as previous two:  git bisect bad

When I tested affecting commit I received  two errors: second & third



What now?

I see two ways:

1) investigate problem deeper and solve bug in current driver
     it is perfect solution, but IMHO it is far beyond of my abilities.

2) Just revert affecting commit(s)
     BTW can somebody show how to create new branch and revert
     some commit with all depending commits



Raimonds Cicans
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ