lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <833d4550f9f6436ac8d61919e05272ba@anonaddy.me>
Date: Sat, 07 Feb 2026 18:56:49 +0000
From: agpn1b92@...naddy.me
To: sgarzare@...hat.com
Cc: virtualization@...ts.linux.dev, netdev@...r.kernel.org
Subject: Re: [BUG] vsock: poll() not waking on data arrival, causing
 multi-second SSH delays

Hi Stefano and all,

Thank you Stefano for your response and skepticism about whether this was
a kernel issue - you were absolutely right to question it!

After extensive debugging with strace on both guest and host, I've
determined this was NOT a kernel bug at all, but rather an OpenSSH issue
specific to vsock connections.

Root Cause:
-----------
The 10-20 second delay was caused by OpenSSH's sshd attempting DNS lookups
on the literal string "UNKNOWN" (the placeholder hostname used for vsock
connections where no IP address exists). This triggered two 5-second DNS
timeouts during login recording and audit subsystem operations, totaling
~10 seconds of delay.

The strace showed:
   17:11:14.465 sendmmsg(13, DNS query for "UNKNOWN")
   17:11:14.465 poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout) 
<5.005s>
   17:11:19.472 sendmmsg(13, DNS query for "UNKNOWN") [RETRY]
   17:11:19.472 poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout) 
<5.005s>

Why I Initially Thought It Was a Kernel Issue:
----------------------------------------------
- bpftrace showed ppoll() timeouts while data appeared to be queued
- The pattern looked like a classic lost wakeup race condition

However, the vsock kernel modules were working perfectly. The delay
happened in userspace during sshd's session setup, specifically when
mm_record_login() tried to resolve the peer hostname for logging.

The Fix:
--------
OpenSSH 10.1 and 10.2 include fixes to prevent passing "UNKNOWN" to
subsystems that would attempt DNS resolution:

- 10.1: Skip audit logging for UNKNOWN hostnames
- 10.2: Don't set PAM_RHOST when remote host is "UNKNOWN"

References:
- https://github.com/openssh/openssh-portable/pull/388
- 
https://gitlab.archlinux.org/archlinux/packaging/packages/openssh/-/issues/16
- https://www.openssh.org/releasenotes.html

Workaround for older OpenSSH versions:
Add to /etc/hosts: 127.0.0.1 UNKNOWN

Apologies for the noise on netdev - the vsock kernel implementation is
working correctly. The misleading symptoms (PTY-specific, ppoll timeouts,
state between connections) made it appear kernel-related when it was
actually sshd's login recording code hitting DNS timeouts.

Thanks again for your help and for maintaining the vsock subsystem!

Best regards,
[Your name - don't forget to update it this time or you'll look even 
more stupid]



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ