linux-kernel - Re: Getting rid of inside

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <s5hwpog377v.wl-tiwai@suse.de>
Date:	Sat, 02 Apr 2016 07:33:40 +0200
From:	Takashi Iwai <tiwai@...e.de>
To:	"Luis R. Rodriguez" <mcgrof@...nel.org>
Cc:	Stefano Stabellini <stefano.stabellini@...citrix.com>,
	Andrew Cooper <andrew.cooper3@...rix.com>,
	Stephen Hemminger <stephen@...workplumber.org>,
	Jim Gettys <jg@...edesktop.org>,
	Dario Faggioli <dario.faggioli@...rix.com>,
	George.Dunlap@...citrix.com, Ingo Molnar <mingo@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Steven Rostedt <rostedt@...dmis.org>,
	Mel Gorman <mgorman@...e.de>,
	Andy Lutomirski <luto@...capital.net>,
	Konstantin Ozerkov <kozerkov@...allels.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	ALSA development <alsa-devel@...a-project.org>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Borislav Petkov <bp@...en8.de>,
	xen-devel <xen-devel@...ts.xenproject.org>,
	Joerg Roedel <joro@...tes.org>
Subject: Re: Getting rid of inside_vm in intel8x0

On Sat, 02 Apr 2016 00:28:31 +0200,
Luis R. Rodriguez wrote:
> 
> On Fri, Apr 01, 2016 at 07:34:10AM +0200, Takashi Iwai wrote:
> > On Fri, 01 Apr 2016 00:26:18 +0200,
> > Luis R. Rodriguez wrote:
> > > 
> > > On Wed, Mar 30, 2016 at 08:07:04AM +0200, Takashi Iwai wrote:
> > > > On Tue, 29 Mar 2016 23:37:32 +0200,
> > > > Andy Lutomirski wrote:
> > > > > 
> > > > > Would it be possible to revert:
> > > > > 
> > > > > commit 228cf79376f13b98f2e1ac10586311312757675c
> > > > > Author: Konstantin Ozerkov <kozerkov@...allels.com>
> > > > > Date:   Wed Oct 26 19:11:01 2011 +0400
> > > > > 
> > > > >     ALSA: intel8x0: Improve performance in virtual environment
> > > > > 
> > > > > Presumably one or more of the following is true:
> > > > > 
> > > > > a) The inside_vm == true case is just an optimization and should apply
> > > > > unconditionally.
> > > > > 
> > > > > b) The inside_vm == true case is incorrect and should be fixed or disabled.
> > > > > 
> > > > > c) The inside_vm == true case is a special case that makes sense then
> > > > > IO is very very slow but doesn't make sense when IO is fast.  If so,
> > > > > why not literally measure the time that the IO takes and switch over
> > > > > to the "inside VM" path when IO is slow?
> > > 
> > > BTW can we simulate this on bare metal by throttling an IO bus, or
> > > perhaps mucking with the scheduler ?
> > > 
> > > I ask as I wonder if similar type of optimization may be useful
> > > to first simulate with other types of buses for other IO devices
> > > we might use in virtualization environments. If so, I'd be curious
> > > to know if similar type of optimizations might be possible for
> > > other sounds cards, or other IO devices.
> > 
> > There aren't so many sound devices requiring such a workaround.
> 
> Why not, what makes this special?

The hardware buggyness.

> > > > More important condition is rather that the register updates of CIV
> > > > and PICB are atomic.
> > > 
> > > To help with this can you perhaps elaborate a bit more on what the code
> > > does? As I read it snd_intel8x0_pcm_pointer() gets a pointer to some
> > > sort of audio frame we are in and uses two values to see if we are
> > > going to be evaluating the right frame, we use an optimization of
> > > some sort to skip one check for virtual environments. We seem to need
> > > this given that on a virtual environment it is assumed that the sound
> > > card is emulated, and as such an IO read there is rather expensive.
> > > 
> > > Can you confirm and/or elaborate a bit more what this does ?
> > > 
> > > To try to help understand what is going on can you describe what CIV
> > > and PICB are exactly ?
> > 
> > CIV and PICB registers are a pair and we calculate the linear position
> > in a ring buffer from both two.
> 
> What type of ring buffer is this ?

A normal PCM ring buffer via PCI DMA transfer.

> > However, they are divorced sometimes
> > under stress, and the position calculated from such values may go
> > backward wrongly.  For avoiding it, there is the second read of the
> > PICB register and compare with the previous value, and loop until it
> > matches.  This check is skipped on VM.
> 
> I see. Is this a software emulation *bug*, or an IO issue due to
> virtualization? I tried to read what the pointer() (that's what its called)
> callback does but since there is no documentation for any of the callbacks I
> have no clue what so ever.
> 
> If the former, could a we somehow detect an emulated device other than through
> this type of check ? Or could we *add* a capability of some sort to detect it
> on the driver ? This would not address the removal, but it could mean finding a
> way to address emulation issues.
> 
> If its an IO issue -- exactly what is the causing the delays in IO ?

Luis, there is no problem about emulation itself.  It's rather an
optimization to lighten the host side load, as I/O access on a VM is
much heavier.

> > > > This is satisfied mostly only on VM, and can't
> > > > be measured easily unlike the IO read speed.
> > > 
> > > Interesting, note the original patch claimed it was for KVM and
> > > Parallels hypervisor only, but since the code uses:
> > > 
> > > +#if defined(__i386__) || defined(__x86_64__)
> > > +               inside_vm = inside_vm || boot_cpu_has(X86_FEATURE_HYPERVISOR);
> > > +#endif
> > > 
> > > This makes it apply also to Xen as well, this makes this hack more
> > > broad, but does is it only applicable when an emulated device is
> > > used ? What about if a hypervisor is used and PCI passthrough is
> > > used ?
> > 
> > A good question.  Xen was added there at the time from positive
> > results by quick tests, but it might show an issue if it's running on
> > a very old chip with PCI passthrough.  But I'm not sure whether PCI
> > passthrough would work on such old chipsets at all.
> 
> If it did have an issue then that would have to be special cased, that
> is the module parameter would not need to be enabled for such type of
> systems, and heuristics would be needed. As you note, fortunately this
> may not be common though...

Actually this *is* module parametered.  If set to a boolean value, it
can be applied / skipped forcibly.  So, if there has been a problem on
Xen, this should have been reported.  That's why I wrote it's no
common case.  This comes from the real experience.

> but if this type of work around may be
> taken as a precedent to enable other types of hacks in other drivers
> I'm very fearful of more hacks later needing these considerations as
> well.
> 
> > > > > There are a pile of nonsensical "are we in a VM" checks of various
> > > > > sorts scattered throughout the kernel, they're all a mess to maintain
> > > > > (there are lots of kinds of VMs in the world, and Linux may not even
> > > > > know it's a guest), and, in most cases, it appears that the correct
> > > > > solution is to delete the checks.  I just removed a nasty one in the
> > > > > x86_32 entry asm, and this one is written in C so it should be a piece
> > > > > of cake :)
> > > > 
> > > > This cake looks sweet, but a worm is hidden behind the cream.
> > > > The loop in the code itself is already a kludge for the buggy hardware
> > > > where the inconsistent read happens not so often (only at the boundary
> > > > and in a racy way).  It would be nice if we can have a more reliably
> > > > way to know the hardware buggyness, but it's difficult,
> > > > unsurprisingly.
> > > 
> > > The concern here is setting precedents for VM cases sprinkled in the kernel.
> > > The assumption here is such special cases are really paper'ing over another
> > > type of issue, so its best to ultimately try to root cause the issue in
> > > a more generalized fashion.
> >  
> > Well, it's rather bare metal that shows the buggy behavior, thus we
> > need to paper over it.   In that sense, it's other way round; we don't
> > tune for VM.  The VM check we're discussing is rather for skipping the
> > strange workaround.
> 
> What is it exactly about a VM that enables this work around to be skipped?
> I don't quite get it yet.

VM -- at least the full one with the sound hardware emulation --
doesn't have the hardware bug.  So, the check isn't needed.

> > You may ask whether we can reduce the whole workaround instead.  It's
> > practically impossible.  We don't know which models doing so and which
> > not.  And, the hardware in question are (literally) thousands of
> > variants of damn old PC mobos.  Any fundamental change needs to be
> > verified on all these machines...
> 
> What if we can come up with algorithm on the ring buffer that would
> satisfy both cases without special casing it ? Is removing this VM
> check impossible really?

Yes, it's impossible practically, see my comment above.
Whatever you change, you need to verify it on real machines.  And it's
very difficult to achieve.


thanks,

Takashi