[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1004292055471.2951@localhost.localdomain>
Date: Thu, 29 Apr 2010 21:19:36 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: Stephen Hemminger <shemminger@...tta.com>
cc: Eric Dumazet <eric.dumazet@...il.com>,
Andi Kleen <ak@...goyle.fritz.box>, netdev@...r.kernel.org,
Andi Kleen <andi@...stfloor.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: OFT - reserving CPU's for networking
On Thu, 29 Apr 2010, Stephen Hemminger wrote:
> > Le jeudi 29 avril 2010 à 19:42 +0200, Andi Kleen a écrit :
> > > > Andi, what do you think of this one ?
> > > > Dont we have a function to send an IPI to an individual cpu instead ?
> > >
> > > That's what this function already does. You only set a single CPU
> > > in the target mask, right?
> > >
> > > IPIs are unfortunately always a bit slow. Nehalem-EX systems have X2APIC
> > > which is a bit faster for this, but that's not available in the lower
> > > end Nehalems. But even then it's not exactly fast.
> > >
> > > I don't think the IPI primitive can be optimized much. It's not a cheap
> > > operation.
> > >
> > > If it's a problem do it less often and batch IPIs.
> > >
> > > It's essentially the same problem as interrupt mitigation or NAPI
> > > are solving for NICs. I guess just need a suitable mitigation mechanism.
> > >
> > > Of course that would move more work to the sending CPU again, but
> > > perhaps there's no alternative. I guess you could make it cheaper it by
> > > minimizing access to packet data.
> > >
> > > -Andi
> >
> > Well, IPI are already batched, and rate is auto adaptative.
> >
> > After various changes, it seems things are going better, maybe there is
> > something related to cache line trashing.
> >
> > I 'solved' it by using idle=poll, but you might take a look at
> > clockevents_notify (acpi_idle_enter_bm) abuse of a shared and higly
> > contended spinlock...
Say thanks to Intel/AMD for providing us timers which stop in lower
c-states.
Not much we can do about the broadcast lock when several cores are
going idle and we need to setup a global timer to work around the
lapic timer stops in C2/C3 issue.
Simply the C-state timer broadcasting does not scale. And it was never
meant to scale. It's a workaround for laptops to have functional NOHZ.
There are several ways to work around that on larger machines:
- Restrict c-states
- Disable NOHZ and highres timers
- idle=poll is definitely the worst of all possible solutions
> I keep getting asked about taking some core's away from clock and scheduler
> to be reserved just for network processing. Seeing this kind of stuff
> makes me wonder if maybe that isn't a half bad idea.
This comes up every few month and we pointed out several times what
needs to be done to make this work w/o these weird hacks which put a
core offline and then start some magic undebugable binary blob on it.
We have not seen anyone working on this, but the "set cores aside and
let them do X" idea seems to stick in peoples heads.
Seriously, that's not a solution. It's going to be some hacked up
nightmare which is completely unmaintainable.
Aside of that I seriously doubt that you can do networking w/o time
and timers.
Thanks,
tglx
Powered by blists - more mailing lists