netdev - Re: Multicast packet loss

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 17 Mar 2009 12:08:48 +0100
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	David Miller <davem@...emloft.net>, kchang@...enacr.com,
	netdev@...r.kernel.org, cl@...ux-foundation.org, bmb@...enacr.com
Subject: Re: Multicast packet loss

Peter Zijlstra a écrit :
> On Mon, 2009-03-16 at 23:22 +0100, Eric Dumazet wrote:
> 
>> Here is the last incantation of the patch, that of course should be
>> split in two parts and better Changelog for further discussion on lkml.
> 
> I read the entire thread up to now, and I still don't really understand
> the Changelog, sorry :(

Sure, I should have taken more time, will repost this in a couple of hours,
with nice CHangelogs and split patches.

> 
>> [PATCH] softirq: Introduce mechanism to defer wakeups
>>
>> Some network workloads need to call scheduler too many times. For example,
>> each received multicast frame can wakeup many threads. ksoftirqd is then
>> not able to drain NIC RX queues in time and we get frame losses and high
>> latencies.
>>
>> This patch adds an infrastructure to delay work done in
>> sock_def_readable() at end of do_softirq(). This needs to
>> make available current->softirq_context even if !CONFIG_TRACE_IRQFLAGS
> 
> How does that solve the wakeup issue?

Apparently, on SMP machines this actually helps a lot, in case of multicast
trafic handled by many subscribers. skb_cloning involves atomic ops on
route cache entries, and if we wakeup threads as we currently do, they
start to consume skb while the feeder is still doing skb clones for
other sockets. Many cache line ping pongs are slowing down the softirq.

I will post the test program to reproduce the problem.

> 
>> Signed-off-by: Eric Dumazet <dada1@...mosbay.com>
>> ---
> 
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -158,6 +158,42 @@ void local_bh_enable_ip(unsigned long ip)
>>  }
>>  EXPORT_SYMBOL(local_bh_enable_ip);
>>  
>> +
>> +#define SOFTIRQ_DELAY_END (struct softirq_delay *)1L
>> +static DEFINE_PER_CPU(struct softirq_delay *, softirq_delay_head) = {
>> +	SOFTIRQ_DELAY_END
>> +};
> 
> Why the magic termination value? Can't we NULL terminate the list

Yes we can, you are right.

> 
>> +
>> +/*
>> + * Caller must disable preemption, and take care of appropriate
>> + * locking and refcounting
>> + */
> 
> Shouldn't we call it __softirq_delay_queue() if the caller needs to
> disabled preemption?

I was wondering if some BUG_ON() can be added to crash if preemption is enabled
at this point. Could not find an existing check,
doing again the 'if (running_from_softirq())'" test might be overkill,
should I document caller should do :

skeleton :

    lock_my_data(data); /* barrier here */
    sdel = &data->sdel;
    if (running_from_softirq()) {
	if (softirq_delay_queue(sdel)) {
		hold a refcount on data;
	} else {
		/* already queued, nothing to do */
	}
    } else {
	/* cannot queue the work , must do it right now */
	do_work(data);
    }
    release_my_data(data);
}

> 
> Futhermore, don't we always require the caller to take care of lifetime
> issues when we queue something?

You mean comment is too verbose... or 

> 
>> +int softirq_delay_queue(struct softirq_delay *sdel)
>> +{
>> +	if (!sdel->next) {
>> +		sdel->next = __get_cpu_var(softirq_delay_head);
>> +		__get_cpu_var(softirq_delay_head) = sdel;
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Because locking is provided by subsystem, please note
>> + * that sdel->func(sdel) is responsible for setting sdel->next to NULL
>> + */
>> +static void softirq_delay_exec(void)
>> +{
>> +	struct softirq_delay *sdel;
>> +
>> +	while ((sdel = __get_cpu_var(softirq_delay_head)) != SOFTIRQ_DELAY_END) {
>> +		__get_cpu_var(softirq_delay_head) = sdel->next;
>> +		sdel->func(sdel);	/*	sdel->next = NULL;*/
>> +		}
>> +}
> 
> Why can't we write:
> 
>   struct softirq_delay *sdel, *next;
> 
>   sdel = __get_cpu_var(softirq_delay_head);
>   __get_cpu_var(softirq_delay_head) = NULL;
> 
>   while (sdel) {
>     next = sdel->next;
>     sdel->func(sdel);
>     sdel = next;
>   }
> 
> Why does it matter what happens to sdel->next? we've done the callback.
> 
> Aah, the crux is in the re-use policy.. that most certainly does deserve
> a comment.

Hum, so my comment was not verbose enough :)

> 
> How about we make sdel->next point to itself in the init case?
> 
> Then we can write:
> 
>   while (sdel) {
>     next = sdel->next;
>     sdel->next = sdel;
>     sdel->func(sdel);
>     sdel = next;
>   }
> 
> and have the enqueue bit look like:
> 
> int __softirq_delay_queue(struct softirq_delay *sdel)
> {
>   struct softirq_delay **head;
> 
>   if (sdel->next != sdel)
>     return 0;

Yes we could do that

> 
>   head = &__get_cpu_var(softirq_delay_head);
>   sdel->next = *head;
>   *head = sdel;
>   return 1;
> }
>      
>> @@ -1691,6 +1694,43 @@ static void sock_def_readable(struct sock *sk, int len)
>>  	read_unlock(&sk->sk_callback_lock);
>>  }
>>  
>> +/*
>> + * helper function called by softirq_delay_exec(),
>> + * if inet_def_readable() queued us.
>> + */
>> +static void sock_readable_defer(struct softirq_delay *sdel)
>> +{
>> +	struct sock *sk = container_of(sdel, struct sock, sk_delay);
>> +
>> +	sdel->next = NULL;
>> +	/*
>> +	 * At this point, we dont own a lock on socket, only a reference.
>> +	 * We must commit above write, or another cpu could miss a wakeup
>> +	 */
>> +	smp_wmb();
> 
> Where's the matching barrier?

Check softirq_delay_exec(void) comment, where I stated synchronization had
to be done by the subsystem.

In this socket case, caller of softirq_delay_exec() has a lock on socket.

Problem is I dont want to get this lock again in sock_readable_defer() callback

if sdel->next is not committed, another cpu could call _softirq_delay_queue() and
find sdel->next being not null (or != sdel with your suggestion). Then next->func()
wont be called as it should (or called litle bit too soon)

So matching barrier is on "lock_my_data(data)" in previous skeleton ?

> 
>> +	sock_def_readable(sk, 0);
>> +	sock_put(sk);
>> +}
>> +
>> +/*
>> + * Custom version of sock_def_readable()
>> + * We want to defer scheduler processing at the end of do_softirq()
>> + * Called with socket locked.
>> + */
>> +void inet_def_readable(struct sock *sk, int len)
>> +{
>> +	if (running_from_softirq()) {
>> +		if (softirq_delay_queue(&sk->sk_delay))
>> +			/*
>> +			 * If we queued this socket, take a reference on it
>> +			 * Caller owns socket lock, so write to sk_delay.next
>> +			 * will be committed before unlock.
>> +			 */
>> +			sock_hold(sk);
>> +	} else
>> +		sock_def_readable(sk, len);
>> +}
> 
> OK, so the idea is to handle a bunch of packets and instead of waking N
> threads for each packet, only wake them once at the end of the batch?
> 
> Sounds like a sensible idea.. 

Idea is to batch wakeups() yes, and if we receive several packets for
the same socket(s), we reduce number of wakeups to one. In the multicast stress
situation of Athena CR, it really helps, no packets dropped instead of
30%

Thanks Peter

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html