Finding an unblock victim
Dan posted recently on his blog about implementing an unblock handler in one of our media servers. Unblock handling is tricky and it can take a bit of fiddling to get an implementation that makes the appropriate trade off between unblock response time and the complexity of implementation. In this case a significant customer needed the work done in short order so it was done.
The funny thing is that a few days later the customer was then starting to get all sort of spurious errors in their system. Normally this isn’t a challenge to track down, but in this case the system was tightly coupled and composed of hundreds of threads with tens of processes and close to a dozen active interrupts. There was enough other system activity going on that a small failure in one path took a long time to ripple through to a point where the error was noticeable. The gut feeling was that some part of the client software was being unblocked as a result of the recent server unblock changes and the client wasn’t ready to deal with the consequences.
… but we needed to prove it …
The good news is that the system was running the instrumented kernel and we could take some traces and take a look at the entire system, so traces we took and then we loaded them into the System Profiler.
The question was, how can we quickly check and see if the unblock behaviour is at fault? Well the first thing to do is to see if there is any unblock activity in the log. This is easily accomplished with using the Trace Search capability, Search > Search… (or CTRL+H for the keyboard addicts). Click Add to create a new search condition. The condition we want to look for is the Communication class events with the code of Unblock Pulse. This will pull up all of the unblock pulses generated in the trace. In this case we didn’t have very many to consider:
We could have gone through and taken a look at each instance, but in this case we knew that we had just put the unblock handling code into the mme server. The unblock pulse event contains the name of the target process, so we could take a look specifically to see if we had targeted our server. We can do this by going back into the Trace Search and either editing the existing condition or creating a new condition. We will still specify the Communication class and the Unblock Pulse event, but this time we will also add in a specific search for instances where the processdata is equal to the mme server.
Now when we run the search BINGO!we get just one single hit. From this point it is easy now to look at the Timeline editor pane and track the behaviour of the client process and thread that the mme unblocks and what happens after the unblock (bad things it turns out).
Also, with the system trace we can take a look at the specific cause of the signal. It turns out in this case that there was a timer firing a signal that got dropped on the thread which caused it to unblock. This in turn exposed a general flaw in the customer’s design since they were running a multi-threaded application where any thread could have potentially been blasted by this signal instead of the process having a dedicated signal handling thread.
Hopefully this quick example gives you some more ideas how you can effectively navigate through the instrumentation trace files, Trace Search is a powerful tool.