[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190828153033.GA15512@pauld.bos.csb>
Date: Wed, 28 Aug 2019 11:30:34 -0400
From: Phil Auld <pauld@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Matthew Garrett <mjg59@...f.ucam.org>,
Vineeth Remanan Pillai <vpillai@...italocean.com>,
Nishanth Aravamudan <naravamudan@...italocean.com>,
Julien Desfossez <jdesfossez@...italocean.com>,
Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
tglx@...utronix.de, pjt@...gle.com, torvalds@...ux-foundation.org,
linux-kernel@...r.kernel.org, subhra.mazumdar@...cle.com,
fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
Aaron Lu <aaron.lwe@...il.com>,
Aubrey Li <aubrey.intel@...il.com>,
Valentin Schneider <valentin.schneider@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3
On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> > Apple have provided a sysctl that allows applications to indicate that
> > specific threads should make use of core isolation while allowing
> > the rest of the system to make use of SMT, and browsers (Safari, Firefox
> > and Chrome, at least) are now making use of this. Trying to do something
> > similar using cgroups seems a bit awkward. Would something like this be
> > reasonable?
>
> Sure; like I wrote earlier; I only did the cgroup thing because I was
> lazy and it was the easiest interface to hack on in a hurry.
>
> The rest of the ABI nonsense can 'trivially' be done later; if when we
> decide to actually do this.
I think something that allows the tag to be set may be needed. One of
the use cases for this is virtualization stacks, where you really want
to be able to keep the higher CPU count and to set up the isolation
from management processes on the host.
The current cgroup interface doesn't work for that because it doesn't
apply the tag to children. We've been unable to fully test it in a virt
setup because our VMs are made of a child cgroup per vcpu.
>
> And given MDS, I'm still not entirely convinced it all makes sense. If
> it were just L1TF, then yes, but now...
I was thinking MDS is really the reason for this. L1TF has mitigations but
the only current mitigation for MDS for smt is ... nosmt.
The current core scheduler implementation, I believe, still has (theoretical?)
holes involving interrupts, once/if those are closed it may be even less
attractive.
>
> > Having spoken to the Chrome team, I believe that the
> > semantics we want are:
> >
> > 1) A thread to be able to indicate that it should not run on the same
> > core as anything not in posession of the same cookie
> > 2) Descendents of that thread to (by default) have the same cookie
> > 3) No other thread be able to obtain the same cookie
> > 4) Threads not be able to rejoin the global group (ie, threads can
> > segregate themselves from their parent and peers, but can never rejoin
> > that group once segregated)
> >
> > but don't know if that's what everyone else would want.
> >
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 094bb03b9cc2..5d411246d4d5 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -229,4 +229,5 @@ struct prctl_mm_map {
> > # define PR_PAC_APDBKEY (1UL << 3)
> > # define PR_PAC_APGAKEY (1UL << 4)
> >
> > +#define PR_CORE_ISOLATE 55
> > #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 12df0e5434b8..a054cfcca511 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > return -EINVAL;
> > error = PAC_RESET_KEYS(me, arg2);
> > break;
> > + case PR_CORE_ISOLATE:
> > +#ifdef CONFIG_SCHED_CORE
> > + current->core_cookie = (unsigned long)current;
>
> This needs to then also force a reschedule of current. And there's the
> little issue of what happens if 'current' dies while its children live
> on, and current gets re-used for a new process and does this again.
sched_core_get() too?
Cheers,
Phil
>
> > +#else
> > + result = -EINVAL;
> > +#endif
> > + break;
> > default:
> > error = -EINVAL;
> > break;
> >
> >
> > --
> > Matthew Garrett | mjg59@...f.ucam.org
--
Powered by blists - more mailing lists