linux-kernel - Re: [PATCH net-next v2] vsock/test: Add test for null ptr deref when transport changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bq6hxrolno2vmtqwcvb5bljfpb7mvwb3kohrvaed6auz5vxrfv@ijmd2f3grobn>
Date: Fri, 28 Mar 2025 14:48:08 +0100
From: Stefano Garzarella <sgarzare@...hat.com>
To: Luigi Leonardi <leonardi@...hat.com>, Michal Luczaj <mhal@...x.co>
Cc: virtualization@...ts.linux.dev, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Hyunwoo Kim <v4bel@...ori.io>
Subject: Re: [PATCH net-next v2] vsock/test: Add test for null ptr deref when
 transport changes

On Wed, Mar 26, 2025 at 05:21:03PM +0100, Stefano Garzarella wrote:
>On Wed, Mar 26, 2025 at 04:14:20PM +0100, Luigi Leonardi wrote:
>>Hi Michal,
>>
>>On Wed, Mar 19, 2025 at 01:27:35AM +0100, Michal Luczaj wrote:
>>>On 3/14/25 10:27, Luigi Leonardi wrote:
>>>>Add a new test to ensure that when the transport changes a null pointer
>>>>dereference does not occur[1].
>>>>
>>>>Note that this test does not fail, but it may hang on the client side if
>>>>it triggers a kernel oops.
>>>>
>>>>This works by creating a socket, trying to connect to a server, and then
>>>>executing a second connect operation on the same socket but to a
>>>>different CID (0). This triggers a transport change. If the connect
>>>>operation is interrupted by a signal, this could cause a null-ptr-deref.
>>>
>>>Just to be clear: that's the splat, right?
>>>
>>>Oops: general protection fault, probably for non-canonical address 0xdffffc000000000c: 0000 [#1] PREEMPT SMP KASAN NOPTI
>>>KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
>>>CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted
>>>Workqueue: vsock-loopback vsock_loopback_work
>>>RIP: 0010:vsock_stream_has_data+0x44/0x70
>>>Call Trace:
>>>virtio_transport_do_close+0x68/0x1a0
>>>virtio_transport_recv_pkt+0x1045/0x2ae4
>>>vsock_loopback_work+0x27d/0x3f0
>>>process_one_work+0x846/0x1420
>>>worker_thread+0x5b3/0xf80
>>>kthread+0x35a/0x700
>>>ret_from_fork+0x2d/0x70
>>>ret_from_fork_asm+0x1a/0x30
>>>
>>
>>Yep! I'll add it to the commit message in v3.
>>>>...
>>>>+static void test_stream_transport_change_client(const struct test_opts *opts)
>>>>+{
>>>>+	__sighandler_t old_handler;
>>>>+	pid_t pid = getpid();
>>>>+	pthread_t thread_id;
>>>>+	time_t tout;
>>>>+
>>>>+	old_handler = signal(SIGUSR1, test_transport_change_signal_handler);
>>>>+	if (old_handler == SIG_ERR) {
>>>>+		perror("signal");
>>>>+		exit(EXIT_FAILURE);
>>>>+	}
>>>>+
>>>>+	if (pthread_create(&thread_id, NULL, test_stream_transport_change_thread, &pid)) {
>>>>+		perror("pthread_create");
>>>
>>>Does pthread_create() set errno on failure?
>>It does not, very good catch!
>>>
>>>>+		exit(EXIT_FAILURE);
>>>>+	}
>>>>+
>>>>+	tout = current_nsec() + TIMEOUT * NSEC_PER_SEC;
>>>
>>>Isn't 10 seconds a bit excessive? I see the oops pretty much immediately.
>>Yeah it's probably excessive. I used because it's the default 
>>timeout value.
>>>
>>>>+	do {
>>>>+		struct sockaddr_vm sa = {
>>>>+			.svm_family = AF_VSOCK,
>>>>+			.svm_cid = opts->peer_cid,
>>>>+			.svm_port = opts->peer_port,
>>>>+		};
>>>>+		int s;
>>>>+
>>>>+		s = socket(AF_VSOCK, SOCK_STREAM, 0);
>>>>+		if (s < 0) {
>>>>+			perror("socket");
>>>>+			exit(EXIT_FAILURE);
>>>>+		}
>>>>+
>>>>+		connect(s, (struct sockaddr *)&sa, sizeof(sa));
>>>>+
>>>>+		/* Set CID to 0 cause a transport change. */
>>>>+		sa.svm_cid = 0;
>>>>+		connect(s, (struct sockaddr *)&sa, sizeof(sa));
>>>>+
>>>>+		close(s);
>>>>+	} while (current_nsec() < tout);
>>>>+
>>>>+	if (pthread_cancel(thread_id)) {
>>>>+		perror("pthread_cancel");
>>>
>>>And errno here.
>>>
>>>>+		exit(EXIT_FAILURE);
>>>>+	}
>>>>+
>>>>+	/* Wait for the thread to terminate */
>>>>+	if (pthread_join(thread_id, NULL)) {
>>>>+		perror("pthread_join");
>>>
>>>And here.
>>>Aaand I've realized I've made exactly the same mistake elsewhere :)
>>>
>>>>...
>>>>+static void test_stream_transport_change_server(const struct test_opts *opts)
>>>>+{
>>>>+	time_t tout = current_nsec() + TIMEOUT * NSEC_PER_SEC;
>>>>+
>>>>+	do {
>>>>+		int s = vsock_stream_listen(VMADDR_CID_ANY, opts->peer_port);
>>>>+
>>>>+		close(s);
>>>>+	} while (current_nsec() < tout);
>>>>+}
>>>
>>>I'm not certain you need to re-create the listener or measure the time
>>>here. What about something like
>>>
>>>	int s = vsock_stream_listen(VMADDR_CID_ANY, opts->peer_port);
>>>	control_expectln("DONE");
>>>	close(s);
>>>
>>Just tried and it triggers the oops :)
>
>If this works (as I also initially thought), we should check the 
>result of the first connect() in the client code. It can succeed or 
>fail with -EINTR, in other cases we should report an error because it 
>is not expected.
>
>And we should check also the second connect(), it should always fail, 
>right?
>
>For this I think you need another sync point to be sure the server is 
>listening before try to connect the first time:
>
>client:
>    // pthread_create, etc.
>
>    control_expectln("LISTENING");
>
>    do {
>        ...
>    } while();
>
>    control_writeln("DONE");
>
>server:
>    int s = vsock_stream_listen(VMADDR_CID_ANY, opts->peer_port);
>    control_writeln("LISTENING");

We found that this needed to be extended by adding an accept() loop to 
avoid filling up the backlog of the listening socket.
But by doing accept() and close() back to back, we found a problem in 
AF_VSOCK, where connect() in some cases would get stuck until the 
timeout (default: 2 seconds) returning -ETIMEDOUT.

Fix is coming.

Thanks,
Stefano

>    control_expectln("DONE");
>    close(s);
>
>Thanks,
>Stefano