[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8cdf0dd0-2a2f-bae9-71ea-89a88fdb14a5@redhat.com>
Date: Mon, 2 Nov 2020 10:48:54 +0100
From: Hans de Goede <hdegoede@...hat.com>
To: rwright@....com, jani.nikula@...ux.intel.com,
joonas.lahtinen@...ux.intel.com, rodrigo.vivi@...el.com,
airlied@...ux.ie, daniel@...ll.ch, sumit.semwal@...aro.org,
christian.koenig@....com, wambui.karugax@...il.com,
chris@...is-wilson.co.uk, matthew.auld@...el.com,
akeem.g.abodunrin@...el.com, prathap.kumar.valsan@...el.com,
mika.kuoppala@...ux.intel.com
Cc: intel-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, linux-media@...r.kernel.org
Subject: Re: [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
Hi,
On 11/1/20 6:41 PM, rwright@....com wrote:
> From: Randy Wright <rwright@....com>
>
> For several months, I've been experiencing GPU hangs when starting
> Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
> kernel. I reported this recently in
> https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
> attached the requested evidence including the state collected from
> /sys/class/drm/card0/error and debug output from dmesg.
>
> I ran a bisect to find the problem, which indicates this is the
> troublesome commit:
>
> [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts
>
> The nature of that commit suggested to me that reducing the
> batch size used in the context clear operation might help this
> relatively low-powered system to avoid the hang.... and it did!
> I simply forced this system to take the smaller batch length that is
> already used for non-Haswell systems.
>
> The first two versions of this patch were posted as RFC
> patches to the Intel-gfx list, implementing the same
> algorithmic change in function batch_get_defaults,
> but without employing a properly constructed quirk.
>
> I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
> The quirk is presently set only for the aforementioned HP Pavilion Mini
> 300-020. The patch now touches three files to define the quirk, set it,
> and then check for it in function batch_get_defaults.
Note I'm not really an i915 dev.
With that said I do wonder if we should not use the
reduced batch size in a lot more cases, the machine in question uses a
3558U CPU if the iGPU of that CPU has this issue, then I would expect
pretty much all Haswell U models (at a minimum) to have this issue.
So solving this with a quirk for just the HP Pavilion Mini 300-020
seems wrong to me. I think we need a more generic way of enabling
the reduced batch size. I even wonder if we should not simply use
it everywhere. Since you do have a proper Haswell CPU, I guess
it being an U model makes the hang easier to trigger, but I suspect
the higher TPD ones may also still be susceptible ...
Regards,
Hans
Powered by blists - more mailing lists