Archive for October, 2007|Monthly archive page
One of the hardest things about working through problems with some of our customers is the fact that since I work here at the QNX mothership office I have full and complete visibility into all source from our products … and they do not. This of course means that I can almost always come out looking like a hero since for any hard question a customer has about how something works (or appear not to as the case may be) I can generally root through the source repository, figure out what is going on and explain it back to them.
It is nice to be a hero … but sometimes things just get lost in translation.
Since it used to be that I couldn’t just share the code with the customer’s engineers, we spend a lot of time going back and forth discussing topics in the “abstract” until I can paint enough of a mental picture for them about a particular implementation of technology. By the time that they understand what to do about the problem they are facing, I’ve pretty much given them a tour through our code. Kind of a waste of time considering that most of the software developers I end up working with can read through a blob of C, C++ or Java code just as easily as a paperback novel.
This is why I think that Foundry27 is an awesome idea. We are working to get to the point where for nearly all of the Neutrino technology that has been developed, I’ll be able to talk customers through a problem and bring them right into the guts of the code during the explanation. It is going to take some time to get all of that source out, but we have already pushed out the C library source code and the kernel/process manager code and that represents quite a bit of important glue bits that help you understand how to get the most out of your applications.
Even better than the fact that the source is being published, is the fact that the different development teams within QNX are starting to take on much more active public roles themselves and helping others to interpret and navigate through the source code. If you missed it, you should really check out the webinar that Colin and Sebastien presented as a bit of a guided tour through some of the source code included in the QNX Operating System project. Also we’ve got most of the kernel developers (and a large number of other QNX developers) online now and monitoring the forums so there is lots of good information flowing around.
Next up on the source release roadmap is the networking stack and driver source and I’m also going to see if I can get a set of missing OS utilities out the door as well before Christmas!
Dan posted recently on his blog about implementing an unblock handler in one of our media servers. Unblock handling is tricky and it can take a bit of fiddling to get an implementation that makes the appropriate trade off between unblock response time and the complexity of implementation. In this case a significant customer needed the work done in short order so it was done.
The funny thing is that a few days later the customer was then starting to get all sort of spurious errors in their system. Normally this isn’t a challenge to track down, but in this case the system was tightly coupled and composed of hundreds of threads with tens of processes and close to a dozen active interrupts. There was enough other system activity going on that a small failure in one path took a long time to ripple through to a point where the error was noticeable. The gut feeling was that some part of the client software was being unblocked as a result of the recent server unblock changes and the client wasn’t ready to deal with the consequences.
… but we needed to prove it …
The good news is that the system was running the instrumented kernel and we could take some traces and take a look at the entire system, so traces we took and then we loaded them into the System Profiler.
The question was, how can we quickly check and see if the unblock behaviour is at fault? Well the first thing to do is to see if there is any unblock activity in the log. This is easily accomplished with using the Trace Search capability, Search > Search… (or CTRL+H for the keyboard addicts). Click Add to create a new search condition. The condition we want to look for is the Communication class events with the code of Unblock Pulse. This will pull up all of the unblock pulses generated in the trace. In this case we didn’t have very many to consider:
We could have gone through and taken a look at each instance, but in this case we knew that we had just put the unblock handling code into the mme server. The unblock pulse event contains the name of the target process, so we could take a look specifically to see if we had targeted our server. We can do this by going back into the Trace Search and either editing the existing condition or creating a new condition. We will still specify the Communication class and the Unblock Pulse event, but this time we will also add in a specific search for instances where the processdata is equal to the mme server.
Now when we run the search BINGO!we get just one single hit. From this point it is easy now to look at the Timeline editor pane and track the behaviour of the client process and thread that the mme unblocks and what happens after the unblock (bad things it turns out).
Also, with the system trace we can take a look at the specific cause of the signal. It turns out in this case that there was a timer firing a signal that got dropped on the thread which caused it to unblock. This in turn exposed a general flaw in the customer’s design since they were running a multi-threaded application where any thread could have potentially been blasted by this signal instead of the process having a dedicated signal handling thread.
Hopefully this quick example gives you some more ideas how you can effectively navigate through the instrumentation trace files, Trace Search is a powerful tool.