lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALPaoCg76nUsJ7eYcU61gied8WBuAAmqy0Pqpsq5=Z-S52Qg6w@mail.gmail.com>
Date:   Fri, 12 May 2023 15:25:37 +0200
From:   Peter Newman <peternewman@...gle.com>
To:     Reinette Chatre <reinette.chatre@...el.com>
Cc:     Fenghua Yu <fenghua.yu@...el.com>, Babu Moger <babu.moger@....com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>,
        Stephane Eranian <eranian@...gle.com>,
        James Morse <james.morse@....com>,
        linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to
 collect CPUs' MBM events

Hi Reinette,

On Thu, May 11, 2023 at 11:37 PM Reinette Chatre
<reinette.chatre@...el.com> wrote:
> On 4/21/2023 7:17 AM, Peter Newman wrote:
> > Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM
> > event counts into its current software RMID. The delta for each CPU is
> > determined by tracking the previous event counts in per-CPU data.  The
> > software byte counts reside in the arch-independent mbm_state
> > structures.
>
> Could you elaborate why the arch-independent mbm_state was chosen?

It largely had to do with how many soft RMIDs to implement. For our
own needs, we were mainly concerned with getting back to the number of
monitoring groups the hardware claimed to support, so there wasn't
much internal motivation to support an unbounded number of soft RMIDs.

However, breaking this artificial connection between supported HW and
SW RMIDs to support arbitrarily-many monitoring groups could make the
implementation conceptually cleaner. If you agree,  I would be happy
to give it a try in the next series.


> > +     /* cache occupancy events are disabled in this mode */
> > +     WARN_ON(!is_mbm_event(evtid));
>
> If this is hit it would trigger a lot, perhaps WARN_ON_ONCE?

Ok

>
> > +
> > +     if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID) {
> > +             counter = &state->local;
> > +     } else {
> > +             WARN_ON(evtid != QOS_L3_MBM_TOTAL_EVENT_ID);
> > +             counter = &state->total;
> > +     }
> > +
> > +     /*
> > +      * Propagate the value read from the hw_rmid assigned to the current CPU
> > +      * into the "soft" rmid associated with the current task or CPU.
> > +      */
> > +     m = get_mbm_state(d, soft_rmid, evtid);
> > +     if (!m)
> > +             return;
> > +
> > +     if (resctrl_arch_rmid_read(r, d, hw_rmid, evtid, &val))
> > +             return;
> > +
>
> This all seems unsafe to run without protection. The code relies on
> the rdt_domain but a CPU hotplug event could result in the domain
> disappearing underneath this code. The accesses to the data structures
> also appear unsafe to me. Note that resctrl_arch_rmid_read() updates
> the architectural MBM state and this same state can be updated concurrently
> in other code paths without appropriate locking.

The domain is supposed to always be the current one, but I see that
even a get_domain_from_cpu(smp_processor_id(), ...) call needs to walk
a resource's domain list to find a matching entry, which could be
concurrently modified when other domains are added/removed.

Similarly, when soft RMIDs are enabled, it should not be possible to
call resctrl_arch_rmid_read() outside of on the current CPU's HW RMID.

I'll need to confirm whether it's safe to access the current CPU's
rdt_domain in an atomic context. If it isn't, I assume I would have to
arrange all of the state used during flush to be per-CPU.

I expect the constraints on what data can be safely accessed where is
going to constrain how the state is ultimately arranged, so I will
need to settle this before I can come back to the other questions
about mbm_state.

>
> > +     /* Count bandwidth after the first successful counter read. */
> > +     if (counter->initialized) {
> > +             /* Assume that mbm_update() will prevent double-overflows. */
> > +             if (val != counter->prev_bytes)
> > +                     atomic64_add(val - counter->prev_bytes,
> > +                                  &m->soft_rmid_bytes);
> > +     } else {
> > +             counter->initialized = true;
> > +     }
> > +
> > +     counter->prev_bytes = val;
>
> I notice a lot of similarities between the above and the software controller,
> see mbm_bw_count().

Thanks for pointing this out, I'll take a look.

>
> > +}
> > +
> > +/*
> > + * Called from context switch code __resctrl_sched_in() when the current soft
> > + * RMID is changing or before reporting event counts to user space.
> > + */
> > +void resctrl_mbm_flush_cpu(void)
> > +{
> > +     struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> > +     int cpu = smp_processor_id();
> > +     struct rdt_domain *d;
> > +
> > +     d = get_domain_from_cpu(cpu, r);
> > +     if (!d)
> > +             return;
> > +
> > +     if (is_mbm_local_enabled())
> > +             __mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d);
> > +     if (is_mbm_total_enabled())
> > +             __mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d);
> > +}
>
> This (potentially) adds two MSR writes and two MSR reads to what could possibly
> be quite slow MSRs if it was not designed to be used in context switch. Do you
> perhaps have data on how long these MSR reads/writes take on these systems to get
> an idea about the impact on context switch? I think this data should feature
> prominently in the changelog.

I can probably use ftrace to determine the cost of an __rmid_read()
call on a few implementations.

To understand the overall impact to context switch, I can put together
a scenario where I can control whether the context switches being
measured result in change of soft RMID to prevent the data from being
diluted by non-flushing switches.


Thank you for looking over these changes!

-Peter

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ