I have a Raspberry Pi 4 with an SSD running Raspberry OS Lite. I run Docker containers for Home Assistant, Pi-Hole, and NetDaemon. I've been experiencing random hangs where the system becomes unresponsive.
Here's a detailed breakdown of what I've tried and observed:
Symptoms Observed:
The first sign of an issue is usually the internet going offline indicating Pi-Hole is unresponsive.
My Home Assistant app shows "disconnected," and automations stop working, indicating that the container has stopped running.
I cannot log into Pi-Hole.
When trying to connect via SSH, I receive a "connection closed/refused" error (I can't remember the exact error message).
It is quicker and easier to just pull the power and restart the system.
Initial Suspicion:
I initially suspected a brute force attack on SSH might be causing the issue.
Action Taken: Installed fail2ban to mitigate potential SSH attacks.
Outcome: Seemed to improve initially, but the issue resurfaced after a few weeks.
OS Upgrade:
Action Taken: Upgraded the OS to Bullseye.
Outcome: Continued to experience the same problem, and additionally encountered issues switching from a 32-bit to a 64-bit kernel.
Recent Activity:
I went on holiday and within three weeks, the system hung six times, requiring manual restarts.
Action Taken: Made a backup of my config and data, then reflashed the SSD with a new image of Bookworm 64-bit Lite.
Outcome: Reinstalled everything on Sunday, but the system hung again on Wednesday morning between 4am and 5am for no apparent reason.
Request for Assistance
I am looking for guidance on how to effectively troubleshoot this issue. From what I understand, Linux systems are usually very stable, and some people run them for years without needing to reboot. Here are some specific questions and areas where I need help:
Hardware Issues:
Could this be a hardware problem with my Raspberry Pi 4 or SSD?
Log Files:
Are there specific logs I should check to identify what is causing these hangs?
Telemetry and Monitoring:
Are there tools or methods to log telemetry data such as CPU temperatures, memory usage, etc., to help narrow down the problem?
Also posted here https://www.reddit.com/r/raspberry_pi/comments/1de9q05/rpi_4_keeps_hanging_docker_services_ssh_fail/