linux-kernel - Re: [PATCH v11 2/7] drm/i915/skl: Add support for the SAGV, fix underrun hangs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160812120410.GY4329@intel.com>
Date:	Fri, 12 Aug 2016 15:04:10 +0300
From:	Ville Syrjälä <ville.syrjala@...ux.intel.com>
To:	Lyude <cpaul@...hat.com>
Cc:	intel-gfx@...ts.freedesktop.org,
	Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
	Matt Roper <matthew.d.roper@...el.com>,
	Daniel Vetter <daniel.vetter@...ll.ch>, stable@...r.kernel.org,
	Daniel Vetter <daniel.vetter@...el.com>,
	Jani Nikula <jani.nikula@...ux.intel.com>,
	David Airlie <airlied@...ux.ie>,
	dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v11 2/7] drm/i915/skl: Add support for the SAGV, fix
 underrun hangs

On Thu, Aug 11, 2016 at 03:54:31PM -0400, Lyude wrote:
> Since the watermark calculations for Skylake are still broken, we're apt
> to hitting underruns very easily under multi-monitor configurations.
> While it would be lovely if this was fixed, it's not. Another problem
> that's been coming from this however, is the mysterious issue of
> underruns causing full system hangs. An easy way to reproduce this with
> a skylake system:
> 
> - Get a laptop with a skylake GPU, and hook up two external monitors to
>   it
> - Move the cursor from the built-in LCD to one of the external displays
>   as quickly as you can
> - You'll get a few pipe underruns, and eventually the entire system will
>   just freeze.
> 
> After doing a lot of investigation and reading through the bspec, I
> found the existence of the SAGV, which is responsible for adjusting the
> system agent voltage and clock frequencies depending on how much power
> we need. According to the bspec:
> 
> "The display engine access to system memory is blocked during the
>  adjustment time. SAGV defaults to enabled. Software must use the
>  GT-driver pcode mailbox to disable SAGV when the display engine is not
>  able to tolerate the blocking time."
> 
> The rest of the bspec goes on to explain that software can simply leave
> the SAGV enabled, and disable it when we use interlaced pipes/have more
> then one pipe active.
> 
> Sure enough, with this patchset the system hangs resulting from pipe
> underruns on Skylake have completely vanished on my T460s. Additionally,
> the bspec mentions turning off the SAGV	with more then one pipe enabled
> as a workaround for display underruns. While this patch doesn't entirely
> fix that, it looks like it does improve the situation a little bit so
> it's likely this is going to be required to make watermarks on Skylake
> fully functional.
> 
> Changes since v10:
>  - Apparently sandybridge_pcode_read actually writes values and reads
>    them back, despite it's misleading function name. This means we've
>    been doing this mostly wrong and have been writing garbage to the
>    SAGV control. Because of this, we no longer attempt to read the SAGV
>    status during initialization (since there are no helpers for this).
>  - mlankhorst noticed that this patch was breaking on some very early
>    pre-release Skylake machines, which apparently don't allow you to
>    disable the SAGV. To prevent machines from failing tests due to SAGV
>    errors, if the first time we try to control the SAGV results in the
>    mailbox indicating an invalid command, we just disable future attempts
>    to control the SAGV state by setting dev_priv->skl_sagv_status to
>    I915_SKL_SAGV_NOT_CONTROLLED and make a note of it in dmesg.
>  - Move mutex_unlock() a little higher in skl_enable_sagv(). This
>    doesn't actually fix anything, but lets us release the lock a little
>    sooner since we're finished with it.
> Changes since v9:
>  - Only enable/disable sagv on Skylake
> Changes since v8:
>  - Add intel_state->modeset guard to the conditional for
>    skl_enable_sagv()
> Changes since v7:
>  - Remove GEN9_SAGV_LOW_FREQ, replace with GEN9_SAGV_IS_ENABLED (that's
>    all we use it for anyway)
>  - Use GEN9_SAGV_IS_ENABLED instead of 0x1 for clarification
>  - Fix a styling error that snuck past me
> Changes since v6:
>  - Protect skl_enable_sagv() with intel_state->modeset conditional in
>    intel_atomic_commit_tail()
> Changes since v5:
>  - Don't use is_power_of_2. Makes things confusing
>  - Don't use the old state to figure out whether or not to
>    enable/disable the sagv, use the new one
>  - Split the loop in skl_disable_sagv into it's own function
>  - Move skl_sagv_enable/disable() calls into intel_atomic_commit_tail()
> Changes since v4:
>  - Use is_power_of_2 against active_crtcs to check whether we have > 1
>    pipe enabled
>  - Fix skl_sagv_get_hw_state(): (temp & 0x1) indicates disabled, 0x0
>    enabled
>  - Call skl_sagv_enable/disable() from pre/post-plane updates
> Changes since v3:
>  - Use time_before() to compare timeout to jiffies
> Changes since v2:
>  - Really apply minor style nitpicks to patch this time
> Changes since v1:
>  - Added comments about this probably being one of the requirements to
>    fixing Skylake's watermark issues
>  - Minor style nitpicks from Matt Roper
>  - Disable these functions on Broxton, since it doesn't have an SAGV
> 
> Signed-off-by: Lyude <cpaul@...hat.com>
> Cc: Matt Roper <matthew.d.roper@...el.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>
> Cc: Daniel Vetter <daniel.vetter@...ll.ch>
> Cc: Ville Syrjälä <ville.syrjala@...ux.intel.com>
> Cc: stable@...r.kernel.org
> ---
>  drivers/gpu/drm/i915/i915_drv.h      |  7 +++
>  drivers/gpu/drm/i915/i915_reg.h      |  4 ++
>  drivers/gpu/drm/i915/intel_display.c | 12 +++++
>  drivers/gpu/drm/i915/intel_drv.h     |  2 +
>  drivers/gpu/drm/i915/intel_pm.c      | 89 ++++++++++++++++++++++++++++++++++++
>  5 files changed, 114 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 7971c76..d74d166 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1945,6 +1945,13 @@ struct drm_i915_private {
>  	struct i915_suspend_saved_registers regfile;
>  	struct vlv_s0ix_state vlv_s0ix_state;
>  
> +	enum {
> +		I915_SKL_SAGV_UNKNOWN = 0,
> +		I915_SKL_SAGV_DISABLED,
> +		I915_SKL_SAGV_ENABLED,
> +		I915_SKL_SAGV_NOT_CONTROLLED
> +	} skl_sagv_status;
> +
>  	struct {
>  		/*
>  		 * Raw watermark latency values:
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 73b3d4d..4980cfe 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -7144,6 +7144,10 @@ enum {
>  #define   HSW_PCODE_DE_WRITE_FREQ_REQ		0x17
>  #define   DISPLAY_IPS_CONTROL			0x19
>  #define	  HSW_PCODE_DYNAMIC_DUTY_CYCLE_CONTROL	0x1A
> +#define   GEN9_PCODE_SAGV_CONTROL		0x21
> +#define     GEN9_SAGV_DISABLE			0x0
> +#define     GEN9_SAGV_IS_DISABLED		0x1
> +#define     GEN9_SAGV_DYNAMIC_FREQ              0x3

Hmm. The definition of these bits is definitely peculiar. Unfortunately
the spec doesn't seem to explain them. First I though bit 0 might be
more of an ack bit, but then why would we set it for the "enable" request?

I'd maybe do s/DYNAMIC_FREQ/ENABLE/ to make it more clear what is the
counterpart to DISABLE.

>  #define GEN6_PCODE_DATA				_MMIO(0x138128)
>  #define   GEN6_PCODE_FREQ_IA_RATIO_SHIFT	8
>  #define   GEN6_PCODE_FREQ_RING_RATIO_SHIFT	16
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index c5c0c35..35bdd67 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -14142,6 +14142,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		     intel_state->cdclk_pll_vco != dev_priv->cdclk_pll.vco))
>  			dev_priv->display.modeset_commit_cdclk(state);
>  
> +		/*
> +		 * SKL workaround: bspec recommends we disable the SAGV when we
> +		 * have more then one pipe enabled
> +		 */

It doesn't actually say that AFAICS. What it says is the *if* you
guarantee that in single pipe mode you never run afoul of the SAGV
block time (30us on SKL), then you can simply disable SAGV depending
on the number of active pipes. I guess that can sort of imply that you
shouldn't enable it with multiple pipes, but it's a very vague way of
saying that for sure.

Then it goes on to say something that if you can't enable watermark
levels with latency>=SAGV block time, then you shouldn't enable SAGV
either. I would interpret that so that you have to have to have
enabled at least the first watermark level with higher latency than
the SAGV block time, since that means the plane can tolerate also
the SAGV block time.

Eg. on one SKL I have the following WM latencies:
WM0 latency 2 (2.0 usec)
WM1 latency 19 (19.0 usec)
WM2 latency 28 (28.0 usec)
WM3 latency 32 (32.0 usec)
WM4 latency 63 (63.0 usec)
WM5 latency 77 (77.0 usec)
WM6 latency 83 (83.0 usec)
WM7 latency 99 (99.0 usec)

So if we can't enable WM3 for all of the active planes, then we should
not enable SAGV. But if all planes can enable WM3 or higher, we can
also enable SAGV.

Anyways, doing it all properly seems like a bit more work, so we can
definitely leave it for a future improvement.

> +		if (IS_SKYLAKE(dev_priv) &&
> +		    hweight32(intel_state->active_crtcs) > 1)
> +			skl_disable_sagv(dev_priv);
> +
>  		intel_modeset_verify_disabled(dev);
>  	}
>  
> @@ -14215,6 +14223,10 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		intel_modeset_verify_crtc(crtc, old_crtc_state, crtc->state);
>  	}
>  
> +	if (IS_SKYLAKE(dev_priv) && intel_state->modeset &&
> +	    hweight32(intel_state->active_crtcs) <= 1)
> +		skl_enable_sagv(dev_priv);
> +
>  	drm_atomic_helper_commit_hw_done(state);
>  
>  	if (intel_state->modeset)
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 9539f0e..76e78b8 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -1721,6 +1721,8 @@ void ilk_wm_get_hw_state(struct drm_device *dev);
>  void skl_wm_get_hw_state(struct drm_device *dev);
>  void skl_ddb_get_hw_state(struct drm_i915_private *dev_priv,
>  			  struct skl_ddb_allocation *ddb /* out */);
> +int skl_enable_sagv(struct drm_i915_private *dev_priv);
> +int skl_disable_sagv(struct drm_i915_private *dev_priv);
>  uint32_t ilk_pipe_pixel_rate(const struct intel_crtc_state *pipe_config);
>  bool ilk_disable_lp_wm(struct drm_device *dev);
>  int sanitize_rc6_option(struct drm_i915_private *dev_priv, int enable_rc6);
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index 8752730..0a202264 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -2883,6 +2883,95 @@ skl_wm_plane_id(const struct intel_plane *plane)
>  	}
>  }
>  
> +/*
> + * SAGV dynamically adjusts the system agent voltage and clock frequencies
> + * depending on power and performance requirements. The display engine access
> + * to system memory is blocked during the adjustment time. Having this enabled
> + * in multi-pipe configurations can cause issues (such as underruns causing
> + * full system hangs), and the bspec also suggests that software disable it
> + * when more then one pipe is enabled.
> + */
> +int
> +skl_enable_sagv(struct drm_i915_private *dev_priv)
> +{
> +	int ret;
> +
> +	if (dev_priv->skl_sagv_status &&

Was that supposed to check for NOT_CONTROLLED?

> +	    dev_priv->skl_sagv_status != I915_SKL_SAGV_DISABLED)
> +		return 0;
> +
> +	mutex_lock(&dev_priv->rps.hw_lock);
> +	DRM_DEBUG_KMS("Enabling the SAGV\n");
> +

printk can be outside the mutex.

> +	ret = sandybridge_pcode_write(dev_priv, GEN9_PCODE_SAGV_CONTROL,
> +				      GEN9_SAGV_DYNAMIC_FREQ);
> +	mutex_unlock(&dev_priv->rps.hw_lock);
> +
> +	if (!ret) {
> +		dev_priv->skl_sagv_status = I915_SKL_SAGV_ENABLED;
> +	} else if (ret == -EINVAL) {
> +		DRM_DEBUG_DRIVER("No SAGV found on system, ignoring\n");
> +		dev_priv->skl_sagv_status = I915_SKL_SAGV_NOT_CONTROLLED;
> +		ret = 0;
> +	} else {
> +		DRM_ERROR("Failed to enable the SAGV\n");
> +	}
> +
> +	/* We don't need to wait for SAGV when enabling */
> +	return ret;
> +}
> +
> +static int
> +skl_do_sagv_disable(struct drm_i915_private *dev_priv)
> +{
> +	int ret;
> +	uint32_t temp = GEN9_SAGV_DISABLE;
> +
> +	ret = sandybridge_pcode_read(dev_priv, GEN9_PCODE_SAGV_CONTROL,
> +				     &temp);
> +	if (ret) {
> +		/*
> +		 * Some very early Skylake systems don't actually let you
> +		 * control the SAGV, which is normal.
> +		 */
> +		if (ret != -EINVAL)
> +			DRM_ERROR("Failed to disable the SAGV\n");

Why are we printing errors both here and in the caller?

> +
> +		return ret;
> +	}
> +
> +	return temp & GEN9_SAGV_IS_DISABLED;
> +}
> +
> +int
> +skl_disable_sagv(struct drm_i915_private *dev_priv)
> +{
> +	int ret, result;
> +
> +	if (dev_priv->skl_sagv_status &&

same question about NOT_CONTROLLED

> +	    dev_priv->skl_sagv_status != I915_SKL_SAGV_ENABLED)
> +		return 0;
> +
> +	mutex_lock(&dev_priv->rps.hw_lock);
> +	DRM_DEBUG_KMS("Disabling the SAGV\n");

this printk can move up as well

> +
> +	/* bspec says to keep retrying for at least 1 ms */
> +	ret = wait_for(result = skl_do_sagv_disable(dev_priv), 1);
> +	mutex_unlock(&dev_priv->rps.hw_lock);
> +
> +	if (!ret) {
> +		dev_priv->skl_sagv_status = I915_SKL_SAGV_DISABLED;
> +	} else if (ret == -ETIMEDOUT) {
> +		DRM_ERROR("Request to disable SAGV timed out\n");
> +	} else if (result == -EINVAL) {
> +		DRM_DEBUG_DRIVER("No SAGV found on system, ignoring\n");
> +		dev_priv->skl_sagv_status = I915_SKL_SAGV_NOT_CONTROLLED;
> +		ret = 0;
> +	}

The ret/result thing here looks messy. Maybe deal with 'ret' first,
and the with 'result':

if (ret) {
	DRM_ERROR...
	return ret;
}

switch (result) {
case -EINVAL:
	...
case 0:
case 1:
	...
default:
	...
}

or whatever are all the different cases you want to distinguish. Or did
I miss some subtle logic here?

> +
> +	return ret;
> +}
> +
>  static void
>  skl_ddb_get_pipe_allocation_limits(struct drm_device *dev,
>  				   const struct intel_crtc_state *cstate,
> -- 
> 2.7.4

-- 
Ville Syrjälä
Intel OTC