HP laptop keyboard won’t type on Linux

Here’s another story from my “WTF, computer?!” files (and also my “oh I’m dumb” files).

As I regularly do, I recently updated my Fedora machines. This includes the crappy HP 2000-2b30DX Notebook PC that I bought as a refurb in 2013. After dnf finished, I rebooted the laptop and put it away. Then while I was at a conference last week, my wife sent me a text telling me that she couldn’t type on it.

When I got home I took a look. Sure enough, they keyboard didn’t key. But it was weirder than that. I could type in the decryption password for the hard drive at the beginning of the boot process. And when I attached a wireless keyboard, I could type. Knowing the hardware worked, I dropped to runlevel 3. The built-in keyboard worked then.

I tried applying the latest updates, but that didn’t help. Some internet searching lead me to Freedesktop.org bug 103561. Running dnf downgrade libinput and rebooting gave me a working keyboard again. The bug is closed as NOTABUG, since the maintainers say it’s an issue in the kernel, which is fixed in the 4.13 kernel release. So I checked to see if Fedora 27, which was released last week, includes the 4.13 kernel. It does, and so does Fedora 26.

That’s when I realized I still had the kernel package excluded from dnf updates on that machine because of a previous issue where a kernel update caused the boot process to hang while/after loading the initrd. I removed the exclusion, updated the kernel, and re-updated libinput. After a reboot, the keyboard still worked. But if you’re using a kernel version from 4.9 to 4.12, libinput 1.9, and an HP device, your keyboard may not work. Update to kernel 4.13 or downgrade libinput (or replace your hardware. I would not recommend the HP 2000 Notebook. It is not good.)

Disappearing WiFi with rt2800pci

I recently did a routine package update on my Fedora 24 laptop. I’ve had the laptop for three years and have been running various Fedorae the whole time, so I didn’t think much of it. So it came as some surprise to me when after rebooting I could no longer connect to my WiFi network. In fact, there was no indication that any wireless networks were even available.

Since the update included a new kernel, I thought that might be the issue. Rebooting into the old kernel seemed to fix it (more on that later!), so I filed a bug, excluded kernel packages from future updates, and moved on.

But a few days later, I rebooted and my WiFi was gone again. The kernel hadn’t updated, so what could it be? I spent a lot of time flailing around until I found a “solution”. A four-year-old forum post said don’t reboot. Booting from off or suspending and resuming the laptop will cause the wireless to work again.

And it turns out, that “fixed” it for me. A few other posts seemed to suggest power management issues in the rt2800pci driver. I guess that’s what’s going on here, though I can’t figure out why I’m suddenly seeing it after so long. Seems like a weird failure mode for failing hardware.

Here’s what dmesg and the systemd journal reported:

Aug 01 14:54:24 localhost.localdomain kernel: ieee80211 phy0: rt2800_wait_wpdma_ready: Error - WPDMA TX/RX busy [0x00000068]
Aug 01 14:54:24 localhost.localdomain kernel: ieee80211 phy0: rt2800pci_set_device_state: Error - Device failed to enter state 4 (-5)

Hopefully, this post saves someone else a little bit of time in trying to figure out what’s going on.

The tricky problem dilemma

A good sysadmin believes in treating the cause, not the symptom. Unfortunately, pragmatism sometimes gets in the way of that. A recent example: we just rolled out a kernel update to a few of our compute clusters. About 3% of the machines ended up in a troubled state. By troubled, I mean that the permissions on a few directories (/bin, /lib, /dev, /etc, /proc, and /sys) were set to 700, making the machine effectively unusable. For the most part, we didn’t notice this on the affected machines until after they did their post-upgrade reboot, but fortunately we were able to catch a few that hadn’t yet rebooted.

What we found was that / had a sysroot directory and an init file. These are created by the mkinitrd script, which is called by the new-kernel-pkg script, which is in turn called in the postinstall script of the kernel RPM. The relevant part of the mkinitrd script seems to be

TMPDIR=""
    for t in /tmp /var/tmp /root ${PWD}; do
        if [ ! -d $t ]; then continue; fi
        if ! access -w $t ; then continue; fi

        fs=$(df -T $t 2>/dev/null | awk '{line=$1;} END {printf $2;}')
        if [ "$fs" != "tmpfs" ]; then
            TMPDIR=$t
            break
        fi
    done

which creates a working directory in /tmp under normal conditions. However, there seemed to be something that caused / to be used instead of /tmp. Later in the script, several directories are created in $TMPDIR, which correspond to the wrongly-permissioned directories. There’s not a clear indication of why this happens, but if we clean up and reinstall the updated kernel package it doesn’t necessarily repeat itself. After some soul-searching, we decided that it was more important to return the nodes to service than to try to track down an easily-correctable-but-difficult-to-solve problem. We’ll see if it happens again with the next kernel upgrade.