Archive for May, 2008|Monthly archive page

Daddy, the network is down…

The gateway of my home network, is not one of those “broadband router”. Instead, it’s an old Pentium 200Hz machine in my basement, running, of cause, QNX. Why am I doing this? I think it’s one of those “because I can” thing. Since I compiled my own TCPIP stack, I can really know every detail of the packets in and out of my gateway. Another reason is, of cause, I like to “live on the edge”.

Yes, it’s really “bleeding edge”, though a lot of benifit and fun of running the HEAD branch stack, one of the disadvantage is, while in it’s early stage, the stack “crashes”. The good thing is I have the core dump I could look at, but the bad thing is, that’s also when my kids started shutting at me.

Those of you who had been managed a home network, would really understand how stressful this is. :) Fortunatly, soon my kids find out the “engineering way” to fix the problem. They went down to the basement, press the little reset button on the old Pentinum, give it a couple of minutes, and wola, everything comes back.

This works well for a while, but one day while I was at home along, the stack on gateway gone again. I have to get out of my comfort couch, went down to the basement and reset it myself. I said to myself, “why can’t I just write a program to resetart the network if it’s crashed”, after all, QNX is all about Micro Kernel and Modular System, isn’t it?

That’s where my “sockmon” program cames from. Once started, it will keeps on monitoring if TCPIP stack is still running, if it disappered, “sockmon” will try to execute a shell script you gave it on command line, to re-start the network. If the restart somehow failed after some try, then it will just reboot the system.

You may wonder “how do you know if TCPIP stack is there or not”? Well, QNX resource manager have builtin notification to all connected clients if their server disappeared. So all you need is to establish a connection to the tcpip stack (by call socket()), and setup to waitfor the notification events.

I have include the source here, the “notification” thing above is true for ALL resource manager (unless the manager is written in such way that turned off this feature), so you can easilly extended my program to any resource manager. Just give it a config file to read about which resource manager (what namespace you care) to watch out, and what to do (which script to execute) if the manager went away.

I will leave this for reader exercise, but if you did that, you would realiz you just got yourself a simple, basic, HA program.


#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <process.h>
#include <signal.h>
#include <string.h>
#include <syslog.h>
#include <sys/procmgr.h>
#include <sys/sysmgr.h>
#include <sys/neutrino.h>
#include <sys/socket.h>

int main(int argc, char **argv)
        char *script;
        int sd, chid, fcount;
        struct _pulse pulse;

        if (argc < 2) {
                fprintf(stderr, “sockmon <re-start script>\n”);
                return -1;

        script = argv[1];
        if (access(script, X_OK) != 0) {
                fprintf(stderr, “access(‘%s’): %s\n”, script, strerror(errno));
                return -1;

        /* creat a channel for accept COIDDEATH pulse */
        if ((chid = ChannelCreate(_NTO_CHF_COID_DISCONNECT)) == -1) {
                return -1;

        /* don’t care about the child */
        signal(SIGCHLD, SIG_IGN);

#ifdef NDEBUG
        if (procmgr_daemon(0, PROCMGR_DAEMON_NOCLOSE) == -1) {
                return -1;

        openlog(“sockmon”, LOG_PID, LOG_DAEMON);

        for (;;)
                fcount = 0;

                /* connect to tcpip to monitoring, give it 30 seconds, if still can’t
                 * connect, reboot the system
                while ((sd = socket(AF_INET, SOCK_DGRAM, 0)) == -1) {
                        if (++fcount >= 6) {
                                syslog(LOG_ERR, “Can’t connect to socket after 3 minutes, reboot…”
                                spawnl(P_NOWAIT, “/bin/slay”, “slay”, “-f”, “syslogd”, NULL);
                                return 0;
                        syslog(LOG_INFO, “Connect to Socket failed: %m”);

                syslog(LOG_INFO, “Connected to Network, start monitoring…”);
                if (MsgReceivePulse(chid, &pulse, sizeof(pulse), NULL) == -1) {
                        syslog(LOG_ERR, “MsgReceivePulse(): %m”);
                        return -1;

                if (pulse.code != _PULSE_CODE_COIDDEATH) {
                        syslog(LOG_ERR, “MsgReceivePulse(): %m”);
                        return -1;

                if (pulse.value.sival_int != sd) {
                        syslog(LOG_ERR, “COIDDEATH pulse for %d\n”, pulse.value);

                syslog(LOG_INFO, “Network gone, restarting…”);
                spawnl(P_WAIT, “/bin/ksh”, “/bin/ksh”, script, NULL);

        return 0;