Nvidia-smi fail to communicate with driver

After we might periodically update our linux kernel, sometimes we might confront with "Nvidia-smi failure": even if your driver is working correctly, error message will pop up, when running $nvidia-smi, like following:
dluser1@cnpvgl903653:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Here is how to resolve with such problem:
step 1: check your latest driver version
$ sudo apt-get install dkms
$ cd /usr/src
$ ls

step 2: rebuild nvidia driver under current kernel source.
sudo dkms install -m nvidia -v 440-440.64.00
Once you upgrade kernel to 6.2.0-26-generic for nvidia driver 520.61.05, you may met with the following issue
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.2.0-26-generic modules....(bad exit status: 2)
ERROR (dkms apport): binary package for nvidia: 520.61.05 not found
Error! Bad return status for module build on kernel: 6.2.0-26-generic (x86_64)
Consult /var/lib/dkms/nvidia/520.61.05/build/make.log for more information.
The root cause can be traced in /var/lib/dkms/nvidia/520.61.05/build/make.log, seeing
/var/lib/dkms/nvidia/520.61.05/build/nvidia-drm/nvidia-drm-drv.c:245:21: error: 'struct drm_mode_config' has no member named 'fb_base'
/var/lib/dkms/nvidia/520.61.05/build/nvidia-drm/nvidia-drm-connector.c:101:18: error: 'struct drm_connector' has no member named 'override_edid'
How to solve: upgrade to nvidia new driver 535, refer to https://forums.linuxmint.com/viewtopic.php?t=399323
sudo apt-get update
sudo apt-get upgrade
sudo apt install nvidia-driver-535 nvidia-dkms-535
You may meet with the issue "epoch: 0%| Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory", so you need to check if libcuda.so is in /usr/local/cuda/lib64/stubs, then add this path to
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64/stubs
you can sudo vim /etc/bash.bashrc then add /usr/local/cuda/lib64/stubs explicitly.
Additional tip: how to effectively clean up with your legacy kernel.
The compete guide is in https://www.cyberciti.biz/faq/ubuntu-18-04-remove-all-unused-old-kernels/. Here just summarize for ubuntu 16.04
step 1: determine your current kernel version
$ uname -r | awk -F '-virtual' '{ print $1}'
step 2: remove your legacy kernel
$ dpkg --list | egrep -i 'linux-image|linux-headers' | awk '/ii/{ print $2}' | egrep -v "$i" # list all legacy versions
$ sudo apt-get --purge remove $(dpkg --list | egrep -i 'linux-image|linux-headers' | awk '/ii/{ print $2}' | egrep -v "$i") # clean up your legacy kernels
Another issue: "Failed to initialize NVML: Driver/library version mismatch"
step 1: determine mismatched version
dmesg | grep NVRM
error as follow
[ 39.818114] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021
[ 9235.513579] NVRM: API mismatch: the client has the version 470.63.01, but
NVRM: this kernel module has the version 470.57.02. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[ 9550.325002] NVRM: API mismatch: the client has the version 470.63.01, but
NVRM: this kernel module has the version 470.57.02. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
Fix steps:
1, add ubuntu driver source, refer to https://askubuntu.com/questions/173721/how-do-i-update-my-nvidia-modules-after-updating-my-kernel
sudo add-apt-repository ppa:xorg-edgers/ppa
2, download driver from https://www.nvidia.com/Download/driverResults.aspx/179599/en-us and install
How to downgrade ubuntu 22.04
To keep our kernel stable and consistent with nvidia-smi 520.6.105, to avoid upgrade to 5.19.0-051900 is a must-have. Please check https://askubuntu.com/questions/1404722/downgrade-kernel-for-ubuntu-22-04-lts if you upgrade to 6.0.26 by mistake. In this latest ubuntu version, you have to make nvidia-cuda to 12.2 ready for the system, which is a huge effort for the environment preparation.
Additional check-up on ubuntu 22.04
Check detailed failured in /var/log/nvidia-installer.log and /var/log/cuda-installer.log. And in /var/log/nvidia-installer.log
ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory. Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written. For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification of the initial ramdisk. Please reboot your system and attempt NVIDIA driver installation again. Note if you later wish to re-enable Nouveau, you will need to delete these files: /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Here is how to fix the issue, refer to https://askubuntu.com/questions/841876/how-to-disable-nouveau-kernel-driver and https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau
sudo nano /etc/modprobe.d/blacklist-nouveau.conf
Add
blacklist nouveau
options nouveau modeset=0
Regenerate the kernel initramfs and reboot
sudo update-initramfs -u && sudo reboot
More detailed to check https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ixzz4rQODN0jy
then update
sudo apt-get install libcudnn8 && sudo apt-get install libcudnn8-dev
Perhaps you may meet with the following issue caused by in-use nvidia device:
exx@nlp-in-477-l:/usr/local$ cat /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Dec 14 11:26:41 2023
installer version: 520.61.05
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
--ui=none
--no-questions
--accept-license
--disable-nouveau
--no-cc-version-check
--install-libglvnd
Using built-in stream user interface
-> Detected 48 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Refer to https://unix.stackexchange.com/questions/440840/how-to-unload-kernel-module-nvidia-drm
1) Download the latest CUDA Toolkit
2) Switch to tty3 by pressing Ctl+Alt+F3
3) Unload nvidia-drm before proceeding.
3a) Isolate multi-user.target
sudo systemctl isolate multi-user.target
3b) Note that nvidia-drm is currently in use.
lsmod | grep nvidia.drm
3c) Unload nvidia-drm
sudo modprobe -r nvidia-drm
4d) Note that nvidia-drm is not in use anymore.
lsmod | grep nvidia.drm
5) Go to your download folder and run the cuda installation.
sudo sh cuda_10.1.168_418.67_linux.run
6) Answer any prompts during installation.
7) When installation has finished, confirm that the CUDA Version has been updated.
nvidia-smi
8) Start the GUI again.
sudo systemctl start graphical.target