linux-kernel - RE: Linux 3.8-rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <A3397C8B8B789E45844E7EC5DEAD89D02DF0A37D@sausexdag04.amd.com>
Date:	Tue, 29 Jan 2013 20:13:08 +0000
From:	"Deucher, Alexander" <Alexander.Deucher@....com>
To:	Shuah Khan <shuahkhan@...il.com>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: RE: Linux 3.8-rc4

> -----Original Message-----
> From: Shuah Khan [mailto:shuahkhan@...il.com]
> Sent: Tuesday, January 29, 2013 2:11 PM
> To: Deucher, Alexander
> Cc: Linus Torvalds; Linux Kernel Mailing List
> Subject: Re: Linux 3.8-rc4
> 
> On Tue, Jan 29, 2013 at 6:05 AM, Deucher, Alexander
> <Alexander.Deucher@....com> wrote:
> >> -----Original Message-----
> >> I was out sick for a few days and finally picked this bisect backup
> >> again. I started at 3.7 tag instead of 3.8-rc1 that I did in the past
> >> and also did bisect at drivers/gpu/drm/radeon instead. Here are the
> >> results:
> >>
> >> 6253e4c75d96006c06b9ac8f417eba873de2497b is the first bad commit
> >> commit 6253e4c75d96006c06b9ac8f417eba873de2497b
> >> Author: Alex Deucher <alexander.deucher@....com>
> >> Date:   Wed Dec 12 14:30:32 2012 -0500
> >>
> >>     drm/radeon: improve mc_stop/mc_resume on r5xx-r7xx
> >>
> >>     Along the same lines of what was done for evergreen+
> >>     in the last kernel.
> >>
> >>     Signed-off-by: Alex Deucher <alexander.deucher@....com>
> >>
> >> git bisect log attached.
> >>
> >
> > Try the attached patch.  I think it should fix the issue.  I just applied a similar
> patch for newer asics.
> >
> > Alex
> >
> 
> I reverted 6253e4c75d96006c06b9ac8f417eba873de2497b and DMAR faults
> went away. Undid the revert and applied your new patch. DMAR faults
> are back again.
> 
> 
> [   25.158653] [drm] PCIE GART of 512M enabled (table at
> 0x0000000000040000).
> [   25.158715] radeon 0000:01:00.0: WB enabled
> [   25.158719] radeon 0000:01:00.0: fence driver on ring 0 use gpu
> addr 0x0000000008000c00 and cpu addr 0xffff88002f143c00
> [   25.158721] radeon 0000:01:00.0: fence driver on ring 3 use gpu
> addr 0x0000000008000c0c and cpu addr 0xffff88002f143c0c
> 
> A few observations and questions about r600_startup() code sequence:
> 
> I notice DMAR faults right after
> 
> [drm] Loading RV620 Microcode message which is from
> r600_init_microcode(). This routine does a series of
> request_firmware() calls. btw. don't see release_firmware() calls in
> regular code path, only from error legs in r600_init_microcode().
> 
> However, this routine doesn't do any loading yet. When this routine
> returns, I am assuming request_firmware() step isn't complete yet
> based on my reading request_firmware() interface. At this point
> r600_startup() keeps chugging along, and does r600_mc_program() which
> in turn calls rv515_mc_stop() which was changed with the
> 6253e4c75d96006c06b9ac8f417eba873de2497b commit.
> 
> I am thinking the changes somehow eliminated a wait or delay that used
> be there for request_firmware() step to complete (?)
> 
> I can see from dmesg that the faults occur right after:
> 
> r600_init_microcode(rdev);
> 
> and stop before r600_pcie_gart_enable()

r600_init_microcode() doesn't actually touch the hardware it just calls request_firmware() to fetch the microcode images from disk.  The microcode doesn't get loaded onto the hardware until r600_cp_load_microcode() much later in the function.  I don't think the microcode has anything to do with this.

rv515_mc_stop() stops GPU memory clients (e.g., the displays) and blacks out the GPU memory controller so that we can change the location of VRAM within the GPU's address space.  If one of the display controllers memory request stop requests takes too long to go through for some reason, it's possible that the display hardware may attempt to read from a GPU memory location no-longer backed by vram (since we changed the location of vram in r600_mc_program()) momentarily until the stop request goes through.  Does the attached updated version of the patch help?  Alternatively, you can try adding delays to the end of rv515_mc_stop() and see if that helps.

Alex
 

Download attachment "0001-drm-radeon-fix-MC-blackout-on-r5xx-r7xx-v2.patch" of type "application/octet-stream" (3347 bytes)