[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <063f4e49-94c2-4e2e-9ddf-70f0afc3ab05@intel.com>
Date: Wed, 17 Jan 2024 11:40:13 +0800
From: Xiaochen Shen <xiaochen.shen@...el.com>
To: Tony Luck <tony.luck@...el.com>, Fenghua Yu <fenghua.yu@...el.com>,
Reinette Chatre <reinette.chatre@...el.com>,
Peter Newman <peternewman@...gle.com>, x86@...nel.org
Cc: James Morse <james.morse@....com>, Jamie Iles <quic_jiles@...cinc.com>,
Babu Moger <babu.moger@....com>, linux-kernel@...r.kernel.org,
patches@...ts.linux.dev, xiaochen.shen@...el.com
Subject: Re: [PATCH] x86/resctrl: Implement new MBA_mbps throttling heuristic
Tested-by: Xiaochen Shen <xiaochen.shen@...el.com>
On 1/10/2024 6:00, Tony Luck wrote:
> The MBA_mbps feedback loop increases throttling when a group is using
> more bandwidth than the target set by the user in the schemata file, and
> decreases throttling when below target.
>
> To avoid possibly stepping throttling up and down on every poll a
> flag "delta_comp" is set whenever throttling is changed to indicate
> that the actual change in bandwidth should be recorded on the next
> poll in "delta_bw". Throttling is only reduced if the current bandwidth
> plus delta_bw is below the user target.
>
> This algorithm works well if the workload has steady bandwidth needs.
> But it can go badly wrong if the workload moves to a different phase
> just as the throttling level changed. E.g. if the workload becomes
> essentially idle right as throttling level is increased, the value
> calculated for delta_bw will be more or less the old bandwidth level.
> If the workload then resumes, Linux may never reduce throttling because
> current bandwidth plu delta_bw is above the target set by the user.
>
> Implement a simpler heuristic by assuming that in the worst case the
> currently measured bandwidth is being controlled by the current level of
> throttling. Compute how much it may increase if throttling is relaxed to
> the next higher level. If that is still below the user target, then it
> is ok to reduce the amount of throttling.
>
> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
> Reported-by: Xiaochen Shen <xiaochen.shen@...el.com>
> Signed-off-by: Tony Luck <tony.luck@...el.com>
> ---
>
> This patch was previously posted in:
>
> https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t
>
> as part of a proposal to allow the mba_MBps mount option to base its
> feedback loop input on total bandwidth instead of local bandwidth.
> Sending it now as a standalone patch because Xiaochen reported that
> real systems have experienced problems when delta_bw is incorrectly
> calculated.
>
> arch/x86/kernel/cpu/resctrl/internal.h | 4 ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++--------------------
> 2 files changed, 10 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..71bbd2245cc7 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -296,14 +296,10 @@ struct rftype {
> * struct mbm_state - status for each MBM counter in each domain
> * @prev_bw_bytes: Previous bytes value read for bandwidth calculation
> * @prev_bw: The most recent bandwidth in MBps
> - * @delta_bw: Difference between the current and previous bandwidth
> - * @delta_comp: Indicates whether to compute the delta_bw
> */
> struct mbm_state {
> u64 prev_bw_bytes;
> u32 prev_bw;
> - u32 delta_bw;
> - bool delta_comp;
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..1961823b555b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>
> cur_bw = bytes / SZ_1M;
>
> - if (m->delta_comp)
> - m->delta_bw = abs(cur_bw - m->prev_bw);
> - m->delta_comp = false;
> m->prev_bw = cur_bw;
> }
>
> @@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> {
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> - u32 cur_bw, delta_bw, user_bw;
> + u32 cur_bw, user_bw;
> struct rdt_resource *r_mba;
> struct rdt_domain *dom_mba;
> struct list_head *head;
> @@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>
> cur_bw = pmbm_data->prev_bw;
> user_bw = dom_mba->mbps_val[closid];
> - delta_bw = pmbm_data->delta_bw;
>
> /* MBA resource doesn't support CDP */
> cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
> @@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> cur_bw += cmbm_data->prev_bw;
> - delta_bw += cmbm_data->delta_bw;
> }
>
> /*
> * Scale up/down the bandwidth linearly for the ctrl group. The
> * bandwidth step is the bandwidth granularity specified by the
> * hardware.
> - *
> - * The delta_bw is used when increasing the bandwidth so that we
> - * dont alternately increase and decrease the control values
> - * continuously.
> - *
> - * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
> - * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
> - * switching between 90 and 110 continuously if we only check
> - * cur_bw < user_bw.
> + * Always increase throttling if current bandwidth is above the
> + * target set by user.
> + * But avoid thrashing up and down on every poll by checking
> + * whether a decrease in throttling is likely to push the group
> + * back over target. E.g. if currently throttling to 30% of bandwidth
> + * on a system with 10% granularity steps, check whether moving to
> + * 40% would go past the limit by multiplying current bandwidth by
> + * "(30 + 10) / 30".
> */
> if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
> new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
> } else if (cur_msr_val < MAX_MBA_BW &&
> - (user_bw > (cur_bw + delta_bw))) {
> + (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
> new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
> } else {
> return;
> }
>
> resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
> -
> - /*
> - * Delta values are updated dynamically package wise for each
> - * rdtgrp every time the throttle MSR changes value.
> - *
> - * This is because (1)the increase in bandwidth is not perfectly
> - * linear and only "approximately" linear even when the hardware
> - * says it is linear.(2)Also since MBA is a core specific
> - * mechanism, the delta values vary based on number of cores used
> - * by the rdtgrp.
> - */
> - pmbm_data->delta_comp = true;
> - list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> - cmbm_data->delta_comp = true;
> - }
> }
>
> static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
>
> base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
Best regards,
Xiaochen
Powered by blists - more mailing lists