linux-kernel - Re: Why the number of /proc/interrupts doesn't change when nic is under heavy workload?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEc1PS1gpA6p0Nb11bCkmi6q6e-yxN=63R1duPvimOFnjC9kLQ@mail.gmail.com>
Date:	Sun, 15 Jan 2012 17:45:29 -0500
From:	Yuehai Xu <yuehaixu@...il.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
	yhxu@...ne.edu
Subject: Re: Why the number of /proc/interrupts doesn't change when nic is
 under heavy workload?

On Sun, Jan 15, 2012 at 5:27 PM, Yuehai Xu <yuehaixu@...il.com> wrote:
> Thanks for replying! Please see below:
>
> On Sun, Jan 15, 2012 at 5:09 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> Le dimanche 15 janvier 2012 à 15:53 -0500, Yuehai Xu a écrit :
>>> Hi All,
>>>
>>> My nic of server is Intel Corporation 80003ES2LAN Gigabit Ethernet
>>> Controller, the driver is e1000e, and my Linux version is 3.1.4. I
>>> have a Memcached server running on this 8 core box, the weird thing is
>>> that when my server is under heavy workload, the number of
>>> /proc/interrupts doesn't change at all. Below are some details:
>>> =======
>>> cat /proc/interrupts | grep eth0
>>> 68:     330887     330861     331432     330544     330346     330227
>>>    330830     330575   PCI-MSI-edge      eth0
>>> =======
>>> cat /proc/irq/68/smp_affinity
>>> ff
>>>
>>> I know when network is under heavy load, NAPI will disable nic
>>> interrupt and poll ring buffer in nic. My question is, when is nic
>>> interrupt enabled again? It seems that it will never be enabled if the
>>> heavy workload doesn't stop, simply because the number showed by
>>> /proc/interrupts doesn't change at all. In my case, one of core is
>>> saturated by ksoftirqd, because lots of softirqs are pending to that
>>> core. I just want to distribute these softirqs to other cores. Even
>>> RPS is enabled, that core is still occupied by ksoftirq, nearly 100%.
>>>
>>> I dive into the codes and find these statements:
>>> __napi_schedule ==>
>>>    local_irq_save(flags);
>>>    ____napi_schedule(&__get_cpu_var(softnet_data), n);
>>>    local_irq_restore(flags);
>>>
>>> here "local_irq_save" actually invokes "cli" which disable interrupt
>>> for the local core, is this the one that used in NAPI to disable nic
>>> interrupt? Personally I don't think it is because it just disables
>>> local cpu.
>>>
>>> I also find "enable_irq/disable_irq/e1000_irq_enable/e1000_irq_disable"
>>> under drivers/net/e1000e, are these used in NAPI to disable nic
>>> interrupt, but I fail to get any clue that they are used in the code
>>> path of NAPI?
>>
>> This is done in the device driver itself, not in generic NAPI code.
>>
>> When NAPI poll() get less packets than the budget, it re-enables chip
>> interrupts.
>>
>>
>
> So you mean that if NAPI poll() get more or equal packets than budget,
> it will not enable chip interrupts, right? In this case, one core
> still suffers from heavy workloads. Can you please briefly show me
> where is this control statement in kernel source code? I have looked
> for it several days but without luck.
>

I go through the codes again, since NAPI poll() actually invokes
e1000_clean() (my driver is e1000e), and this routine shows:
....
adapter->clean_rx(adapter, &work_done, budget);
....
/* If budget not fully consumed, exit the polling mode */
if(work_done < budget) {
     ....
     e1000_irq_enable(adapter)
     .....
}

I think above should be what you have said. Correct me if I am wrong,
"work_done" the number of packets that NAPI polls, while budget is a
parameter that administrator can set. So, if work_done always larger
or equal than budget, chip interrupt will never be re-enabled. This
sounds make sense. However, this also means there is a certain core
needs to handle all softirqs, simply because my smp_affinity of irq
doesn't work here. Even RPS can alleviate some softirqs to other
cores, it doesn't solve the problem 100%.

>
>>>
>>> My current situation is that, almost 60% of time of other 7 cores are
>>> idle, while only one core which is occupied by ksoftirq is 100% busy.
>>>
>>
>> You could post some info, like "cat /proc/net/softnet_stat"
>>
>> If you use RPS on a very high workload, on a mono queue NIC, best is to
>> stick for example cpu0 for the packet dispatching, and other cpus for
>> IP/UDP handling.
>>
>> echo 01 >/proc/irq/68/smp_affinity
>> echo fe >/sys/class/net/eth0/queues/rx-0/rps_cpus
>>
>> Please keep in mind that if your memcache uses a single UDP socket, you
>> probably hit a lot of contention on the socket spinlock and various
>> counters. So maybe it would be better to _reduce_ number of cpus
>> handling network load to reduce false sharing.
>
> My memcached uses 8 different UDP sockets(8 different UDP ports), so
> there should be no lock contention for a single UDP rx-queue.
>
>>
>> echo 0e >/sys/class/net/eth0/queues/rx-0/rps_cpus
>>
>> Really, if you have a single UDP queue, best would be to not use RPS and
>> only have :
>>
>> echo 01 >/proc/irq/68/smp_affinity
>>
>> Then you could post the result of "perf top -C 0" so that we can spot
>> obvious problems on the hot path for this particular cpu.
>>
>>
>>
>
> Thanks!
> Yuehai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/