Daddy, the network is down…

The gateway of my home network, is not one of those “broadband router”. Instead, it’s an old Pentium 200Hz machine in my basement, running, of cause, QNX. Why am I doing this? I think it’s one of those “because I can” thing. Since I compiled my own TCPIP stack, I can really know every detail of the packets in and out of my gateway. Another reason is, of cause, I like to “live on the edge”.

Yes, it’s really “bleeding edge”, though a lot of benifit and fun of running the HEAD branch stack, one of the disadvantage is, while in it’s early stage, the stack “crashes”. The good thing is I have the core dump I could look at, but the bad thing is, that’s also when my kids started shutting at me.

Those of you who had been managed a home network, would really understand how stressful this is. :) Fortunatly, soon my kids find out the “engineering way” to fix the problem. They went down to the basement, press the little reset button on the old Pentinum, give it a couple of minutes, and wola, everything comes back.

This works well for a while, but one day while I was at home along, the stack on gateway gone again. I have to get out of my comfort couch, went down to the basement and reset it myself. I said to myself, “why can’t I just write a program to resetart the network if it’s crashed”, after all, QNX is all about Micro Kernel and Modular System, isn’t it?

That’s where my “sockmon” program cames from. Once started, it will keeps on monitoring if TCPIP stack is still running, if it disappered, “sockmon” will try to execute a shell script you gave it on command line, to re-start the network. If the restart somehow failed after some try, then it will just reboot the system.

You may wonder “how do you know if TCPIP stack is there or not”? Well, QNX resource manager have builtin notification to all connected clients if their server disappeared. So all you need is to establish a connection to the tcpip stack (by call socket()), and setup to waitfor the notification events.

I have include the source here, the “notification” thing above is true for ALL resource manager (unless the manager is written in such way that turned off this feature), so you can easilly extended my program to any resource manager. Just give it a config file to read about which resource manager (what namespace you care) to watch out, and what to do (which script to execute) if the manager went away.

I will leave this for reader exercise, but if you did that, you would realiz you just got yourself a simple, basic, HA program.

 -xtang

#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <process.h>
#include <signal.h>
#include <string.h>
#include <syslog.h>
#include <sys/procmgr.h>
#include <sys/sysmgr.h>
#include <sys/neutrino.h>
#include <sys/socket.h>

int main(int argc, char **argv)
{
        char *script;
        int sd, chid, fcount;
        struct _pulse pulse;

        if (argc < 2) {
                fprintf(stderr, “sockmon <re-start script>\n”);
                return -1;
        }

        script = argv[1];
        if (access(script, X_OK) != 0) {
                fprintf(stderr, “access(‘%s’): %s\n”, script, strerror(errno));
                return -1;
        }

        /* creat a channel for accept COIDDEATH pulse */
        if ((chid = ChannelCreate(_NTO_CHF_COID_DISCONNECT)) == -1) {
                perror(“ChannelCreate”);
                return -1;
        }

        /* don’t care about the child */
        signal(SIGCHLD, SIG_IGN);

#ifdef NDEBUG
        if (procmgr_daemon(0, PROCMGR_DAEMON_NOCLOSE) == -1) {
                perror(“procmgr_daemon”);
                return -1;
        }
#endif

        openlog(“sockmon”, LOG_PID, LOG_DAEMON);
        setlogmask(LOG_UPTO(LOG_INFO));

        for (;;)
        {
                fcount = 0;

                /* connect to tcpip to monitoring, give it 30 seconds, if still can’t
                 * connect, reboot the system
                 */
                while ((sd = socket(AF_INET, SOCK_DGRAM, 0)) == -1) {
                        if (++fcount >= 6) {
                                syslog(LOG_ERR, “Can’t connect to socket after 3 minutes, reboot…”
);
                                spawnl(P_NOWAIT, “/bin/slay”, “slay”, “-f”, “syslogd”, NULL);
                                sleep(1);
                                sysmgr_reboot();
                                return 0;
                        }
                        sleep(5);
                        syslog(LOG_INFO, “Connect to Socket failed: %m”);
                }

                syslog(LOG_INFO, “Connected to Network, start monitoring…”);
                if (MsgReceivePulse(chid, &pulse, sizeof(pulse), NULL) == -1) {
                        syslog(LOG_ERR, “MsgReceivePulse(): %m”);
                        return -1;
                }

                if (pulse.code != _PULSE_CODE_COIDDEATH) {
                        syslog(LOG_ERR, “MsgReceivePulse(): %m”);
                        return -1;
                }

                if (pulse.value.sival_int != sd) {
                        syslog(LOG_ERR, “COIDDEATH pulse for %d\n”, pulse.value);
                        continue;
                }

                syslog(LOG_INFO, “Network gone, restarting…”);
                spawnl(P_WAIT, “/bin/ksh”, “/bin/ksh”, script, NULL);
        }

        return 0;
}

 

Advertisements

12 comments so far

  1. hp on

    Hi Colin,
    why not using HAM, SLM or of course srv-starter ;-))
    /hp

  2. colinburgess on

    Hmmm, Xiaodan wrote this, HP :-)

    But HAM could be used, yes. As for those others? Never heard of them! ;-)

  3. hp on

    oh man … the one who could read, has definitely advantage.
    thought it was you, Xiaodan is new to this blog?
    should have wondered when him/her (sorry for this) was writing about the kids.
    /hp

  4. name on

    Good day!,

  5. name on

    Of *course* not cause.

  6. Mamur on

    Спасибо. Прочитал с интересом. Блог в избранное занес=)

  7. stroitelstvo on

    RSS u vas po4emu-to pokazyvaet tolko po pervye 300 postov – tak i doljno byt?

  8. car floor jacks on

    It’s the first time I commented here and I must say you share us genuine, and quality information for bloggers! Great job.
    p.s. You have an awesome template for your blog. Where have you got it from?

  9. TurdeMonacos on

    Часто поиск не дает результата. А жаль…

  10. IvanPetrovichBur on

    Зашел случайно на сайт про Советский Союз. Аж слеза накатилась. Ностальгия, что тут скажешь.

  11. PortoPaderto on

    Потрясающе! Вот не ожидал…)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: