showing system's health on /dev/tty1

showing system's health on /dev/tty1 - Printable Version
+- nixers (https://nixers.net)
+-- Forum: Operating Systems & Administration (https://nixers.net/Forum-Operating-Systems-Administration)
+--- Forum: GNU/Linux (https://nixers.net/Forum-GNU-Linux)
+--- Thread: showing system's health on /dev/tty1 (/Thread-showing-system-s-health-on-dev-tty1)

showing system's health on /dev/tty1 - freem - 17-12-2020

Hello.

I am growing tired of /dev/tty1 showing a boot sequence which is... useless.

It is useless because, basically, it spits a lot of lines, that no-one will ever be able to read (yes, even with the --noclear option of getty).
Also, in practice, my system starts services in parallel (with runit) and most of the logs are redirected to svlogd, which means several things:

1) gettys are initialized far _before_ the system is completely up and running (especially network interfaces)
2) you don't know if something is wrong except if if's so wrong that you can't login or other similar big issues

So, I had the idea of using /dev/tty1 to *not* be used as a login manager (you know, the usual line in ~/.profile: `test $(tty) = /dev/tty1 && exec xinit`) but for showing a quick health summary.
What I would need is a tool which can show:

* (only "bad"?) lines in dmesg
* daemons' status and their age (so that one can know if they are trying to boot every few seconds)
* system's uptime
* system's NICs and IPs
* system's hostname

Constraints:

* run the checks either in real time or with reasonably slow delay
* minimal dependencies, especially in terms of daemons already being up and running (obviously: the goal is to be able to quickly see what is wrong when something is)

Ideally, it would work with any POSIX system and only be "active" when the TTY it runs on is actually displayed (I know it's doable) but those are just ideal stuff, not mandatoty (I could hack my way in there if necessary).

Anyone knows such tool (if I write such a tool myself, it will likely be a statically linkable C++ application using linux's framebuffer's API, which means that would not run on non-linux kernels)?

RE: showing system's health on /dev/tty1 - movq - 19-12-2020

(17-12-2020, 08:45 AM)freem Wrote: Anyone knows such tool

Not really, but it’s a nice idea. Output on tty1 has become super useless over the years (starting with KMS, which nukes half the output midway through). Showing something meaningful instead would be lovely.

It’s really strange, now that I think about it. Why are we flying blind? Why isn’t what you proposed already the standard? Servers have proper monitoring (hopefully), but on the desktop, nope, let’s just hope for the best. Weird.

Actually, this might even be interesting for our servers. Sometimes I do reboot them and watch their VNC/serial console, I see all that useless output scrolling by … but no idea if everything’s fine. I have to check something like Icinga for that.

Hmm … Might be a project for the upcoming vacation? :) (If I do it, I’d probably go for something systemd-based, though, since that’s what we run.)

(17-12-2020, 08:45 AM)freem Wrote: Ideally, it would work with any POSIX system

This might be a dumb question, but how do you check if a particular daemon is running in a POSIX-ly portable way?

RE: showing system's health on /dev/tty1 - freem - 19-12-2020

(19-12-2020, 05:45 PM)vain Wrote: This might be a dumb question, but how do you check if a particular daemon is running in a POSIX-ly portable way?

Well... that's the problem actually.
I already started working on it, at least, started thinking about how the monitoring should be done, without adding dependencies to daemons, thus, by just observing the system's state.
For now, my answer is basically: scan the logs, and hope they don't lie. For that to work, daemons must have a single "current" file, which, in worst cases, will advertise when the daemon is actually started (aka: after all pre-checks were done) and when it dies.
Based on that, you usually need a delay information to say the daemon is *really* doing it's job (might be few seconds, might be minutes...) because many won't say "hey, I've opened my socket and am listening on it!".

How to do that, considering file rotations, is annoying. The how to know a file is writen is rather easy: using `poll` should do the job (yet to verify), but what when there's a rotation? POSIX do not expose API to watch a directory, thus I'll need use linux-specific inotify. Ideally, I would have an #ifdef to use whatever BSD mechanism there might be...

RE: showing system's health on /dev/tty1 - movq - 23-12-2020

I once saw someone give a talk on how hard “tail -f” is and how it is/was broken on several OSes. Not sure if doing that is a lot of fun. (I can’t find that talk anymore, instead YouTube is full of “tutorials” on how to use tail …)

Maybe this should be up to the user. If the source code was organized in a way that makes it easy for the user to provide her own “is_daemon_running(char *name)” function, then maybe everybody™ would be happy? :)

RE: showing system's health on /dev/tty1 - Wildefyr - 26-12-2020

Quote:It is useless because, basically, it spits a lot of lines, that no-one will ever be able to read (yes, even with the --noclear option of getty).

You know it is possible to modify things like loglevel, rd.systemd.show_status, rd.udev.log_priority etc on your kernel boot to show less verbose messages? I agree the defaults could be better in showing the user /what/ is wrong rather than /everything/

RE: showing system's health on /dev/tty1 - freem - 20-02-2021

(26-12-2020, 12:19 PM)Wildefyr Wrote: You know it is possible to modify things like loglevel, rd.systemd.show_status, rd.udev.log_priority etc on your kernel boot to show less verbose messages?

I know it's possible to tweak the output. Will it actually make sense, still? I don't think so.

Quote:I agree the defaults could be better in showing the user /what/ is wrong rather than /everything/

Good point. The code I have for now does not shows /what is wrong/, but only daemons which are not up, and removes the /what is wrong/ info in the process. If running it on main console, it might even prevent recovery by annoying user, I need to fix that somehow (still considered alpha, works in an X11-terminal-emulator, but not tested in-situ, even in a VM...)
But showing /everything/ at least shows boot's progress, which is something my code does, too, since one can consider system to be up when some daemons are up.

I was thinking few days ago that, well, everything in boot sequence is not a daemon, and my tool itself is a daemon, so would suffer from not booting (but I wrote the code to avoid failure possibilities the best I could, even avoiding dynamic allocations from tools I use, hopefully).
I'm also hiding /why/ a daemon can't go up, so I probably need to improve on those 2 points, somehow.
I'm reluctant to have something interactive, because that would introduce potential for failures (and I try to write a rock-solid tool which does not weaken the system, it's very important to me).

I work by considering that a daemon can be in 3 states: "down", "waking", and "up".
"down" services are triggered by a line pattern (configurable).
"waking" services are triggered by another line pattern (configurable).
"up" services are triggered by time (configurable), if service was previously "waking".

Do you think it would make sense to only show logs of daemons which are "waking"? Or those which are "down"?