netdev - Re: [PATCH net-next] tcp: Add tracepoint for rxtstamp coalescing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <46eed1f7-e3bc-4d30-a5b6-edf049160fd8@linux.alibaba.com>
Date: Tue, 18 Jun 2024 20:11:38 +0800
From: Philo Lu <lulie@...ux.alibaba.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 Mike Maloney <maloney@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
 netdev@...r.kernel.org, rostedt@...dmis.org, mhiramat@...nel.org,
 mathieu.desnoyers@...icios.com, davem@...emloft.net, dsahern@...nel.org,
 kuba@...nel.org, xuanzhuo@...ux.alibaba.com, dust.li@...ux.alibaba.com,
 Soheil Hassas Yeganeh <soheil@...gle.com>
Subject: Re: [PATCH net-next] tcp: Add tracepoint for rxtstamp coalescing



On 2024/6/14 20:02, Willem de Bruijn wrote:
>>>> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
>>>>> During tcp coalescence, rx timestamps of the former skb ("to" in
>>>>> tcp_try_coalesce), will be lost. This may lead to inaccurate
>>>>> timestamping results if skbs come out of order.
>>>>>
>>>>> Here is an example.
>>>>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
>>>>> are processed by tcp in the following order:
>>>>> A -(1us)-> C -(1ms)-> B
>>>>
>>>> IMHO the above order makes the changelog confusing
>>>>
>>>>> If C is coalesced to B, the final rx timestamps of the message will be
>>>>> those of C. That is, the timestamps show that we received the message
>>>>> when C came (including hardware and software). However, we actually
>>>>> received it 1ms later (when B came).
>>>>>
>>>>> With the added tracepoint, we can recognize such cases and report them
>>>>> if we want.
>>>>
>>>> We really need very good reasons to add new tracepoints to TCP. I'm
>>>> unsure if the above example match such requirement. The reported
>>>> timestamp actually matches the first byte in the aggregate segment,
>>>> inferring anything more is IMHO stretching too far the API semantic.
>>>>
>>>
>>> Note the current behavior was a conscious choice, see
>>> commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
>>> to TCP recvmsg")
>>> for the rationale.
>>>
>>
>> IIUC, the behavior of returning the timestamp of the skb with highest
>> sequence number works well without disorder. But once disorder occurs,
>> tcp coalescence can cause this issue.
>>
>>> Perhaps another application would need to add a new timestamp to report
>>> both the oldest and newest timestamps.
>>
>> I prefer this way, we do need both oldest and newest timestamps of a
>> message to find if any packet is unexpected delayed after sending.
>> But given there can be both hardware and software timestamps, we may
>> need more fields in sk_buff to carry these new timestamps.
> 
> Unfortunately returning multiple timestamps in tcp_recv_timestamp
> requires a new extended struct scm_timestamping, and likely an extra
> field to store both after coalescing.
> 
> FWIW, I maintain a patch that also changes semantics, by returning not
> the timestamp associated with the last byte in the message (which is
> the current defined behavior), but the first byte that makes the
> socket readable. Usually just the first byte, unless SO_RCVLOWAT is
> set.
> 
> It is definitely easier to define a flag like SOF_TIMESTAMPING_POLLIN
> that changes behavior of the one timestamp returned, than to return
> two timestamps.

I believe this is a step forward because now we can choose to get the 
oldest or newest timestamps. However, even with this option, it seems 
still unclear whether and how long an skb gets stuck in the out-of-order 
queue.

>>>
>>> Or add a socket flag to prevent coalescing for applications needing
>>> precise timestamps.
>>>
>>> Willem might know better about this.
>>>
>>> I agree the tracepoint seems not needed. What about solving the issue instead ?
>> Thanks.
> 
> A tracepoint is also not needed as a bpftrace program with kfunc on
> tcp_try_coalesce should be able to access this information already
> without kernel modifications. Or if it has to be at this line, a
> program with kprobe at offset, but that requires manual register
> reading.

Though using bpf fentry/kprobe could be feasible, I wonder if there are 
better solutions with lower cost.
(And bpf fentry/kprobe may fail because tcp_try_coalesce is static)

I mean, usually we use timestamping to reason where a jitter occurs, so 
it is expected to keep working in background (with the target 
application). In this case, kprobe without offset introduces much 
overhead in such a hot function, while kprobe with offset is of low 
compatibility for production.

The tracepoint can solve the problem, which doesn't bother the receiver 
with TIMESTAMPING disabled, and is stable enough. And I'm looking 
forward to any suggestions.

Thanks.
-- 
Philo