linux-kernel - RE: [PATCH 2/2] Revert "ACPI / button: Change default behavior to lid_init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1AE640813FDE7649BE1B193DEA596E886CEA4C0A@SHSMSX101.ccr.corp.intel.com>
Date:   Wed, 17 May 2017 07:32:02 +0000
From:   "Zheng, Lv" <lv.zheng@...el.com>
To:     Benjamin Tissoires <benjamin.tissoires@...hat.com>
CC:     "Rafael J. Wysocki" <rafael@...nel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Jiri Eischmann <jeischma@...hat.com>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 2/2] Revert "ACPI / button: Change default behavior to
 lid_init_state=open"

Hi, Benjamin

> > What's that?
> > I mean, the bad faith?
> I already explained 4 times why we need to revert these two patches and
> why we need to keep 'method'. And you keep answering with long emails
> that you would rather not. I call it bad faith, sorry.

The 4 times explanations didn't answer my questions.
But that's OK, let's clarify it again.

> > > This is a REGRESSION. It used to work on thousands of devices, it
> > > doesn't anymore. So any regression has to be chased down and no good
> > > reason can justify such a regression.
> > I triggered many such kind of layered regressions and did fix them 1 by 1 in different places.
> > However, this might be different.
> No. It is a regression. It used to work for thousands of devices befor
> v4.11, and now it's broken for those devices. It's a regression.
> Some new devices are broken with "method", it's a bug, and we can't fix
> them by regressing on all the others.
...
> I call this "fixing by users", and this is wrong. It used to work for
> years for almost everybody, you can not ask users to fix this one by
> one.

What about regressions triggered by this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23de5d9ef2a4
Before that (year 2007), "ignore" is the default mode.

Other than this, I just had concerns related to fixing things back and forth, but you didn't reply properly.
Again, that's OK, let's just clarify it.

> Yes, it's called a quirk. And the good practice is to register those
> quirks and make them available to everybody. Being in hwdb in user space
> or in acpi/button in kernel space doesn't matter, we need them.

I have no objections but concerns related to the combination of "default mode" and "quirk responsibles".
From my point of view, these are my conclusions:
1. If you want to use libinput to generate quirks, you should use "ignore" rather than "method" mode as default mode;
2. If you want to use button driver to generate quirks, we need "close" mode;
3. If GDM can change or users are ok use command lines, we can remain to use "open" as the default behavior.
(I'll send technical details in private about these conclusions)
But you seem to always:
1. Say no to "ignore" which makes 1 impossible;
2. Say no to "close" which makes 2 impossible;
3. Say no to "open" which makes 3 impossible.

> > We haven't asked user space to change.
> > We are just discussing the correctness of some user space behaviors.
> 
> They *are* correct.
> They are following the exported ACPI documentation

I doubt. In ACPI world, Windows is the only standard.

> and the input node documentation.
> Quoting the input doc:
> file Documentation/input/event-codes.rst:
> EV_SW
> -----
> 
> EV_SW events describe stateful binary switches. For example, the SW_LID code is
> used to denote when a laptop lid is closed.
> 
> Upon binding to a device or resuming from suspend, a driver must report
> the current switch state. This ensures that the device, kernel, and userspace
> state is in sync.
> 
> Upon resume, if the switch state is the same as before suspend, then the input
> subsystem will filter out the duplicate switch state reports. The driver does
> not need to keep the state of the switch at any time.
> ----

That's really a convenient feature for driver.
If I'm the driver writers, I would be very appreciated for being able to use such features.
So you see I don't have objections to having this feature.

I just have concerns related to:
1. Is it required to have a timeout in systemd, forcing platform to suspend again, just due to event delays?
2. Is it required to use SW_LID to determine whether an internal display should be lit on?
I don't see any conflicts between the ABI of EV_SW and the 2 questions.

> So no, you can't have 'ignore' or 'open' to be the default, because user
> space expects the switch to reflect the current state of the hardware.

Then what's the benefit of having 'method' to be the default,
Given it is still not able to reliably deliver the current state of hardware?
Actually both 'ignore/open/method' modes are trying to be compliant to EV_SW.
Among them, "ignore" did the best IMO.
And cases broken in "ignore" mode but not broken in "method" mode are all issues:
 - Platform doesn't send notification after boot/resume.
   IMO, we should also collect them and indicate them to desktop managers.

So in the end, we just have differences related to picking which default mode.

> > > You can not also change the semantic of an input switch. An input
> > > switch, as per the input subsystem is supposed to forward an actual
> > > state of the underlying hardware. Any fake information is bad and has to
> > > be avoided.
> > Since fake events are harmful, why do we fake an event after boot/resume?
> > button.lid_init_state=method seems can fake such an event.
> We don't fake an event, we are syncing the input switch state with the
> hardware.
> Faking an event is when you send "switch is open" while you know there
> is a possibility it's actually closed.

No we are faking events in this mode.
_LID can return cached state, and:
1. it might be "close" while LID is "open" after suspend.
2. it might be "open" while LID is "close" after suspend.
In this case, it explains the side effect of having type 1 fake event:
https://bugzilla.kernel.org/show_bug.cgi?id=106151
If we don't evaluate _LID and send such a "close", the platform can correctly send "open" just with a delay.

> > > I already gave you 2 solutions to fix the 7 machines you see that are
> > > problematic, and you just seem to ignore them:
> > > - revert to the v4.10 behavior and let libinput fix that for you
> >
> > I already chose this.
> > But I just raised a concern that button.lid_init_state=method could bring troubles to libinput
> quirks.
> 
> No. We can do whatever we want in libinput. And we can fix hardware when
> it appears. We don't need to have the correct solution for everybody,
> ever. libinput is a library and it can be updated, and we can ask users
> to upgrade. We can even break the API by bumping the soname. We have
> much more liberty that the kernel has, so the libinput implementation
> shouldn't be a concern.

It seems you don't know all the details I was looking at.
It's about a timing problem, with button.lid_init_state=method and libinput quirk, we actually have 2 quirks.
And libinput write can appear before/after the faked init notification triggered by "method" mode.
Causing some cases not workaroundable.
See below for details.

> > No, I currently cannot persuade myself to revert to "method" mode.
> > But that doesn't mean I don't want to or I want to.
> > You just didn't prove that by answering my questions in PATCH 1/2 and in this email.
> > Especially:
> > 1. After reverting to "method" mode, can libinput work all cases around?
> If it doesn't, we'll fix it. So yes, it will eventually.

That's great.

> >    Are there any timing issues that can prevent libinput from working some sucks around.
> >    Considering this case, it's likely such problems can happen.
> >    https://bugzilla.kernel.org/show_bug.cgi?id=106151
> 
> logind has a delay between the time it starts and the time it starts
> polling the lid switch state. The default value is a little bit too
> short on Fedora considering the faulty devices are running small CPUs.
> But this can be changed as wish by the user and by systemd. They told us
> they took one arbitrary value by default, and we can change it.
> 
> We can ask systemd/logind to change part of the behavior for new devices
> when there is a bug. But we can't ask them to change the whole design
> because there is a regression in the kernel.

That sounds great!
If systemd/logind can be changed, can we ask systemd/logind to change it to be longer enough.
For example, allowing users to configure this to "INFINITE"?
That solves many other problems.

NOTE, EV_SW is a good feature to reduce duplicated events,
but that doesn't mean users of EV_SW need to be so strict to the timings of SW_LID, right?

If it can be configured to "INFINITE", with "ignore" mode,
systemd/logind should always be able to see an "open" event before seeing a new BIOS triggered "close" events.

So why do you refuse the possibilities of using "open/ignore" as default modes?

> > 2. Who should be responsible for fixing this issue after reverting to "method" mode:
> >    https://bugzilla.kernel.org/show_bug.cgi?id=106151
> 
> libinput will change the value to open given the heuristics we have.

Yes, I see.
But I also can see one broken case cannot be worked around by libinput if you insist to use "method".

> >    What's the root cause of this issue?
> 
> Poorly written EC?
> It doesn't matter. We know the machine is buggy, we'll just need a
> workaround.

It seems you know better than the SAMSUNG official supporters?
The 10-20 seconds lag can be seen even on Windows:
http://www.samsung.com/uk/support/model/NP-N210-JP02UK

Since you seem to be able to see something I'm not aware of.
Let me ask, which do you mean by using word "poorly", EC driver or EC firmware?
If it's EC driver, can you fix it?

> > 3. How can we generate the quirk for the following possible breakage:
> >    No lid notification and _LID return cached open after boot/resume
> 
> "method" will fetch and report the cached _LID value as open, so there
> is no breakage. Unless the lid is actually closed and the EC is plain
> wrong, but that is something we can also explain to users or fix by an
> other mean. But nothing generic will work. For instance, on the Surface
> 3, the LID notification for open isn't forwarded, so I wrote a specific
> driver to get the correct behavior based on a careful analysis of the
> DSDT.

First, I couldn't see anything related to EC here but you kept on talking EC.
Do you have personal feelings against EC?

In fact, this is just a theoretical issue showing that something cannot be worked around by libinput when "method" is the default mode.
I don't even know any platform acting in this way.
But it is likely there are such platforms as the default "method" mode may cause them invisible to us.

The broken cases related to the timing are:
1. If libinput writes "close" after button driver sends initial notification using _LID return value:
   For a platform that has no lid notification and _LID return cached "open" after boot/resume.
   If such a system connects to an external monitor, and users configure to use external display, then
   "open" will arrive user space programs first, thus internal monitor will be lit on and external one will be lit off.
   Which is not what users want.
2. If libinput writes "close" before button driver sends initial notification using _LID return value:
   For a platform that has no lid notification and _LID return cached "close" after boot/resume.
   If such a system connects to an external monitor, and users configure to use internal display, then
   "close" will arrive userspace programs first, thus internal monitor will be lit off and external one will be lit on.
   Which is not what users want.
   There are even more cases broken if libinput writes "close" before button driver sends initial notification using _LID return value.
So either 1 or 2 will be broken.
What I supposed was 1 would be broken thus I listed such theoretical platform.

Cheers,
Lv