lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 9 Oct 2023 09:37:20 -0700
From: Stanislav Fomichev <sdf@...gle.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: bpf@...r.kernel.org, ast@...nel.org, daniel@...earbox.net, 
	andrii@...nel.org, martin.lau@...ux.dev, song@...nel.org, yhs@...com, 
	john.fastabend@...il.com, kpsingh@...nel.org, haoluo@...gle.com, 
	jolsa@...nel.org, kuba@...nel.org, toke@...nel.org, willemb@...gle.com, 
	dsahern@...nel.org, magnus.karlsson@...el.com, bjorn@...nel.org, 
	maciej.fijalkowski@...el.com, yoong.siang.song@...el.com, 
	netdev@...r.kernel.org, xdp-hints@...-project.net
Subject: Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata

On 10/09, Jesper Dangaard Brouer wrote:
> 
> 
> On 03/10/2023 22.05, Stanislav Fomichev wrote:
> > When we get a packet on port 9091, we swap src/dst and send it out.
> > At this point we also request the timestamp and checksum offloads.
> > 
> > Checksum offload is verified by looking at the tcpdump on the other side.
> > The tool prints pseudo-header csum and the final one it expects.
> > The final checksum actually matches the incoming packets checksum
> > because we only flip the src/dst and don't change the payload.
> > 
> > Some other related changes:
> > - switched to zerocopy mode by default; new flag can be used to force
> >    old behavior
> > - request fixed tx_metadata_len headroom
> > - some other small fixes (umem size, fill idx+i, etc)
> > 
> > mvbz3:~# ./xdp_hw_metadata eth3
> > ...
> > 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> > rx_hash: 0x2E1B50B9 with RSS type:0x2A
> > rx_timestamp:  1691436369532047139 (sec:1691436369.5320)
> > XDP RX-time:   1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
> 
> I guess system time isn't configured to be in sync with NIC HW time,
> as delta is a negative offset.
> 
> > AF_XDP time:   1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
> The AF_XDP time is also software system time and compared to XDP
> RX-time, it shows a delta of 122 usec.  This number indicate to me that
> the CPU is likely configured with power saving sleep states.

Yes, I don't do any synchronization and don't disable the sleep states.

> > 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
> > 0x1062cb8: complete tx idx=0 addr=10
> > 0x1062cb8: tx_timestamp:  1691436369598419505 (sec:1691436369.5984)
> 
> Could we add something that we can relate tx_timestamp to?
> 
> Like we do with the other delta calculations, as it helps us to
> understand/validate if the number we get back is sane. Like negative
> offset aboves tells us that system time sync isn't configured, and that
> system have configures C-states.
> 
> I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they are
> the same clock domain.  It will tell us the total time spend from HW RX
> to HW TX, counting all the time used by software "ping-pong".
> 
>  1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
> 
> When implementing this, it could be (1) practical to store the
> "rx_timestamp" in the metadata area of the completion packet, or (2)
> should we have a mechanism for external storage that can be keyed on the
> umem "addr"?

Sounds good. I can probably just store last rx_timestamp somewhere in the
global var and do a delta on tx? Since the test is single threaded
and sequential, not sure we need the mechanism to pass the tstamp around.
LMK if you disagree and I'm missing something.

> > 0x1062cb8: complete rx idx=128 addr=80100
> > 
> > mvbz4:~# nc  -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
> > 
> > mvbz4:~# tcpdump -vvx -i eth3 udp
> > 	tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
> >          0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
> >          0x0010:  1270 fdff fe48 1087 fe80 0000 0000 0000
> >          0x0020:  1270 fdff fe48 1077 d9ff 2383 000b 3b8e
> >          0x0030:  7864 70
> > 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
> >          0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
> >          0x0010:  1270 fdff fe48 1077 fe80 0000 0000 0000
> >          0x0020:  1270 fdff fe48 1087 2383 d9ff 000b de7e
> >          0x0030:  7864 70
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@...gle.com>
> > ---
> >   tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
> >   1 file changed, 192 insertions(+), 10 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > index 613321eb84c1..ab83d0ba6763 100644
> > --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > @@ -10,7 +10,9 @@
> >    *   - rx_hash
> >    *
> >    * TX:
> > - * - TBD
> > + * - UDP 9091 packets trigger TX reply
> > + * - TX HW timestamp is requested and reported back upon completion
> > + * - TX checksum is requested
> >    */
> >   #include <test_progs.h>
> > @@ -24,14 +26,17 @@
> [...]
> > @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
> [...]
> > @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
> [...]
> > @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
> >   	printf("skb hwtstamp is not found!\n");
> >   }
> > +static bool complete_tx(struct xsk *xsk)
> > +{
> > +	struct xsk_tx_metadata *meta;
> > +	__u64 addr;
> > +	void *data;
> > +	__u32 idx;
> > +
> > +	if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
> > +		return false;
> > +
> > +	addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> > +	data = xsk_umem__get_data(xsk->umem_area, addr);
> > +	meta = data - sizeof(struct xsk_tx_metadata);
> > +
> > +	printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> > +	printf("%p: tx_timestamp:  %llu (sec:%0.4f)\n", xsk,
> > +	       meta->completion.tx_timestamp,
> > +	       (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> > +	xsk_ring_cons__release(&xsk->comp, 1);
> > +
> > +	return true;
> > +}
> > +
> > +#define swap(a, b, len) do { \
> > +	for (int i = 0; i < len; i++) { \
> > +		__u8 tmp = ((__u8 *)a)[i]; \
> > +		((__u8 *)a)[i] = ((__u8 *)b)[i]; \
> > +		((__u8 *)b)[i] = tmp; \
> > +	} \
> > +} while (0)
> > +
> > +static void ping_pong(struct xsk *xsk, void *rx_packet)
> > +{
> > +	struct xsk_tx_metadata *meta;
> > +	struct ipv6hdr *ip6h = NULL;
> > +	struct iphdr *iph = NULL;
> > +	struct xdp_desc *tx_desc;
> > +	struct udphdr *udph;
> > +	struct ethhdr *eth;
> > +	__sum16 want_csum;
> > +	void *data;
> > +	__u32 idx;
> > +	int ret;
> > +	int len;
> > +
> > +	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> > +	if (ret != 1) {
> > +		printf("%p: failed to reserve tx slot\n", xsk);
> > +		return;
> > +	}
> > +
> > +	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> > +	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
> > +	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> > +
> > +	meta = data - sizeof(struct xsk_tx_metadata);
> > +	memset(meta, 0, sizeof(*meta));
> > +	meta->flags = XDP_TX_METADATA_TIMESTAMP;
> > +
> > +	eth = rx_packet;
> > +
> > +	if (eth->h_proto == htons(ETH_P_IP)) {
> > +		iph = (void *)(eth + 1);
> > +		udph = (void *)(iph + 1);
> > +	} else if (eth->h_proto == htons(ETH_P_IPV6)) {
> > +		ip6h = (void *)(eth + 1);
> > +		udph = (void *)(ip6h + 1);
> > +	} else {
> > +		printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
> > +		xsk_ring_prod__cancel(&xsk->tx, 1);
> > +		return;
> > +	}
> > +
> > +	len = ETH_HLEN;
> > +	if (ip6h)
> > +		len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
> > +	if (iph)
> > +		len += ntohs(iph->tot_len);
> > +
> > +	swap(eth->h_dest, eth->h_source, ETH_ALEN);
> > +	if (iph)
> > +		swap(&iph->saddr, &iph->daddr, 4);
> > +	else
> > +		swap(&ip6h->saddr, &ip6h->daddr, 16);
> > +	swap(&udph->source, &udph->dest, 2);
> > +
> > +	want_csum = udph->check;
> > +	if (ip6h)
> > +		udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
> > +					       ntohs(udph->len), IPPROTO_UDP, 0);
> > +	else
> > +		udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
> > +						 ntohs(udph->len), IPPROTO_UDP, 0);
> > +
> > +	meta->flags |= XDP_TX_METADATA_CHECKSUM;
> > +	if (iph)
> > +		meta->csum_start = sizeof(*eth) + sizeof(*iph);
> > +	else
> > +		meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
> > +	meta->csum_offset = offsetof(struct udphdr, check);
> > +
> > +	printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
> > +	       xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
> > +
> > +	memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
> > +	tx_desc->options |= XDP_TX_METADATA;
> > +	tx_desc->len = len;
> > +
> > +	xsk_ring_prod__submit(&xsk->tx, 1);
> > +}
> > +
> >   static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
> >   {
> >   	const struct xdp_desc *rx_desc;
> > @@ -250,6 +378,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> >   	while (true) {
> >   		errno = 0;
> > +
> > +		for (i = 0; i < rxq; i++) {
> > +			ret = kick_rx(&rx_xsk[i]);
> > +			if (ret)
> > +				printf("kick_rx ret=%d\n", ret);
> > +		}
> > +
> >   		ret = poll(fds, rxq + 1, 1000);
> >   		printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
> >   		       ret, errno, bpf_obj->bss->pkts_skip,
> > @@ -280,6 +415,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> >   			       xsk, idx, rx_desc->addr, addr, comp_addr);
> >   			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
> >   					    clock_id);
> > +
> > +			if (!skip_tx) {
> > +				/* mirror the packet back */
> > +				ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
> > +
> > +				ret = kick_tx(xsk);
> > +				if (ret)
> > +					printf("kick_tx ret=%d\n", ret);
> > +
> > +				for (int j = 0; j < 500; j++) {
> > +					if (complete_tx(xsk))
> > +						break;
> > +					usleep(10*1000);
> 
> I don't fully follow why we need this usleep here.

To avoid the busypoll here (since we don't care too much about perf in
the test). But I agree, should be ok to drop, will do.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ