[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <VE1PR04MB66701E2D8CE661F280D6A6C48B350@VE1PR04MB6670.eurprd04.prod.outlook.com>
Date: Fri, 3 May 2019 06:34:29 +0000
From: Vakul Garg <vakul.garg@....com>
To: Steffen Klassert <steffen.klassert@...unet.com>
CC: Florian Westphal <fw@...len.de>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: RE: [RFC HACK] xfrm: make state refcounting percpu
> -----Original Message-----
> From: Steffen Klassert <steffen.klassert@...unet.com>
> Sent: Friday, May 3, 2019 11:52 AM
> To: Vakul Garg <vakul.garg@....com>
> Cc: Florian Westphal <fw@...len.de>; netdev@...r.kernel.org
> Subject: Re: [RFC HACK] xfrm: make state refcounting percpu
>
> On Fri, May 03, 2019 at 06:13:22AM +0000, Vakul Garg wrote:
> >
> >
> > > -----Original Message-----
> > > From: Steffen Klassert <steffen.klassert@...unet.com>
> > > Sent: Friday, May 3, 2019 11:38 AM
> > > To: Florian Westphal <fw@...len.de>
> > > Cc: Vakul Garg <vakul.garg@....com>; netdev@...r.kernel.org
> > > Subject: Re: [RFC HACK] xfrm: make state refcounting percpu
> > >
> > > On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
> > > > I'm not sure this is a good idea to begin with, refcount is right
> > > > next to state spinlock which is taken for both tx and rx ops, plus
> > > > this complicates debugging quite a bit.
> > >
> > >
> > > Hm, what would be the usecase where this could help?
> > >
> > > The only thing that comes to my mind is a TX state with wide
> > > selectors. In that case you might see traffic for this state on a
> > > lot of cpus. But in that case we have a lot of other problems too,
> > > state lock, replay window etc. It might make more sense to install a
> > > full state per cpu as this would solve all the other problems too (I've
> talked about that idea at the IPsec workshop).
> > >
> > > In fact RFC 7296 allows to insert multiple SAs with the same traffic
> > > selector, so it is possible to install one state per cpu. We did a
> > > PoC for this at the IETF meeting the week after the IPsec workshop.
> > >
> >
> > On 16-core arm64 processor, I am getting very high cpu usage (~ 40 %) in
> refcount atomics.
> > E.g. in function dst_release() itself, I get 19% cpu usage in refcount api.
> > Will the PoC help here?
>
> If your usecase is that what I described above, then yes.
>
> I guess the high cpu usage comes from cachline bounces because one SA is
> used from many cpus simultaneously.
> Is that the case?
I don't find kernel code to be taking care of reservation granule size alignment (or cacheline size)
for refcount vars. So it is possible that wasteful reservation loss is happening in atomics.
>
> Also, is this a new problem or was it always like that?
It is always like this. On 4-core, 8-core platforms as well, these atomics consume significant cpu
(8 core cpu usage is more than 4 core).
On 16-core system, we are seeing no throughput scalability beyond 8 cores.
Powered by blists - more mailing lists