Nvidia-smi fail to communicate with driver

Nvidia-smi fail to communicate with driver

After we might periodically update our linux kernel, sometimes we might confront with "Nvidia-smi failure": even if your driver is working correctly, error message will pop up, when running $nvidia-smi, like following:

dluser1@cnpvgl903653:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Here is how to resolve with such problem:

step 1: check your latest driver version

$ sudo apt-get install dkms
$ cd /usr/src
$ ls
source code on disk

step 2: rebuild nvidia driver under current kernel source.

sudo dkms install -m nvidia -v 440-440.64.00

Once you upgrade kernel to 6.2.0-26-generic for nvidia driver 520.61.05, you may met with the following issue

'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.2.0-26-generic modules....(bad exit status: 2)
ERROR (dkms apport): binary package for nvidia: 520.61.05 not found
Error! Bad return status for module build on kernel: 6.2.0-26-generic (x86_64)
Consult /var/lib/dkms/nvidia/520.61.05/build/make.log for more information.

The root cause can be traced in /var/lib/dkms/nvidia/520.61.05/build/make.log, seeing

/var/lib/dkms/nvidia/520.61.05/build/nvidia-drm/nvidia-drm-drv.c:245:21: error: 'struct drm_mode_config' has no member named 'fb_base'
/var/lib/dkms/nvidia/520.61.05/build/nvidia-drm/nvidia-drm-connector.c:101:18: error: 'struct drm_connector' has no member named 'override_edid'

How to solve: upgrade to nvidia new driver 535, refer to https://forums.linuxmint.com/viewtopic.php?t=399323

sudo apt-get update
sudo apt-get upgrade
sudo apt install nvidia-driver-535 nvidia-dkms-535

You may meet with the issue "epoch: 0%| Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory", so you need to check if libcuda.so is in /usr/local/cuda/lib64/stubs, then add this path to

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64/stubs

you can sudo vim /etc/bash.bashrc then add /usr/local/cuda/lib64/stubs explicitly.

Additional tip: how to effectively clean up with your legacy kernel.

The compete guide is in https://www.cyberciti.biz/faq/ubuntu-18-04-remove-all-unused-old-kernels/. Here just summarize for ubuntu 16.04

step 1: determine your current kernel version

$ uname -r | awk -F '-virtual' '{ print $1}'

step 2: remove your legacy kernel

$ dpkg --list | egrep -i 'linux-image|linux-headers' | awk '/ii/{ print $2}' | egrep -v "$i" # list all legacy versions
$ sudo apt-get --purge remove $(dpkg --list | egrep -i 'linux-image|linux-headers' | awk '/ii/{ print $2}' | egrep -v "$i") # clean up your legacy kernels

Another issue: "Failed to initialize NVML: Driver/library version mismatch"

step 1: determine mismatched version

dmesg | grep NVRM

error as follow

[   39.818114] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.57.02  Tue Jul 13 16:14:05 UTC 2021
[ 9235.513579] NVRM: API mismatch: the client has the version 470.63.01, but
               NVRM: this kernel module has the version 470.57.02.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.
[ 9550.325002] NVRM: API mismatch: the client has the version 470.63.01, but
               NVRM: this kernel module has the version 470.57.02.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.

Fix steps:

1, add ubuntu driver source, refer to https://askubuntu.com/questions/173721/how-do-i-update-my-nvidia-modules-after-updating-my-kernel

sudo add-apt-repository ppa:xorg-edgers/ppa

2, download driver from https://www.nvidia.com/Download/driverResults.aspx/179599/en-us and install

How to downgrade ubuntu 22.04

To keep our kernel stable and consistent with nvidia-smi 520.6.105, to avoid upgrade to 5.19.0-051900 is a must-have. Please check https://askubuntu.com/questions/1404722/downgrade-kernel-for-ubuntu-22-04-lts if you upgrade to 6.0.26 by mistake. In this latest ubuntu version, you have to make nvidia-cuda to 12.2 ready for the system, which is a huge effort for the environment preparation.

Additional check-up on ubuntu 22.04

Check detailed failured in /var/log/nvidia-installer.log and /var/log/cuda-installer.log. And in /var/log/nvidia-installer.log

ERROR: The Nouveau kernel driver is currently in use by your system.  This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory.  Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written.  For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification of the initial ramdisk.  Please reboot your system and attempt NVIDIA driver installation again.  Note if you later wish to re-enable Nouveau, you will need to delete these files: /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Here is how to fix the issue, refer to https://askubuntu.com/questions/841876/how-to-disable-nouveau-kernel-driver and https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau

sudo nano /etc/modprobe.d/blacklist-nouveau.conf

Add

blacklist nouveau
options nouveau modeset=0

Regenerate the kernel initramfs and reboot

sudo update-initramfs -u && sudo reboot

More detailed to check https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ixzz4rQODN0jy

Install https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/11.8/cudnn-local-repo-ubuntu2204-8.8.1.3_1.0-1_amd64.deb/

then update

sudo apt-get install libcudnn8 && sudo apt-get install libcudnn8-dev

Perhaps you may meet with the following issue caused by in-use nvidia device:

exx@nlp-in-477-l:/usr/local$ cat /var/log/nvidia-installer.log 
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Dec 14 11:26:41 2023
installer version: 520.61.05

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --ui=none
    --no-questions
    --accept-license
    --disable-nouveau
    --no-cc-version-check
    --install-libglvnd

Using built-in stream user interface
-> Detected 48 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Refer to https://unix.stackexchange.com/questions/440840/how-to-unload-kernel-module-nvidia-drm

1) Download the latest CUDA Toolkit

2) Switch to tty3 by pressing Ctl+Alt+F3

3) Unload nvidia-drm before proceeding.

3a) Isolate multi-user.target

sudo systemctl isolate multi-user.target

3b) Note that nvidia-drm is currently in use.

lsmod | grep nvidia.drm

3c) Unload nvidia-drm

sudo modprobe -r nvidia-drm

4d) Note that nvidia-drm is not in use anymore.

lsmod | grep nvidia.drm

5) Go to your download folder and run the cuda installation.

sudo sh cuda_10.1.168_418.67_linux.run

6) Answer any prompts during installation.

7) When installation has finished, confirm that the CUDA Version has been updated.

nvidia-smi

8) Start the GUI again.

sudo systemctl start graphical.target