Archive for the ‘Tracing’ Category
At long last, the dtrace project at Foundry27 is public. It’s really still in the formative stages, but if you want to download the source and hack around a bit, or even just laugh at the dirty hacks, feel free!
I’m slowing working on a more thorough port, but the prototype is a nice forum to play around in.
As some of you may know, Thomas and I have been working on a port of dtrace, mostly in our spare time. I was intrigued and enthused to see that it had also been ported to FreeBSD and Mac OS X Leopard.
I won’t repeat the details of it here, but the gist of it is that Apple allows for certain processes to be hidden from any debug/tracing systems. While you may or may not agree on whether that is Truly Evil Behaviour or not, it doesn’t really matter – what it means for sure is that the Leopard dtrace port is severely broken.
When a system wide tracer cannot trace the whole system, it pretty much ceases to be of any use!
You might as well pack up, and go home, folks!
The really sad thing is the way they broke it! They basically disable any probes from running at the same time as a protected process, the prime example being iTunes.
This means, for example, that anything based on the timer tick is useless, since you lose the regular heartbeat, and that any kernel activity that may have been deferred (say disk io) is lost, unnoticed by your dtrace probes!
It really means that you cannot trust any of the data that dtrace is supplying, as it may be at best incomplete, at worst incorrect.
I shake my head, I really do.
I recently received an interesting kernel trace from a customer. It showed several threads going from STATE_RUNNING to STATE_MUTEX without an intervening SyncMutexLock kernel call. He wondered if perhaps the trace was corrupted.
After looking at the trace it seemed to be sane, but the behaviour shown was somewhat puzzling. However it is explainable, and it has to do with our implementation of mutexes.
Dan posted recently on his blog about implementing an unblock handler in one of our media servers. Unblock handling is tricky and it can take a bit of fiddling to get an implementation that makes the appropriate trade off between unblock response time and the complexity of implementation. In this case a significant customer needed the work done in short order so it was done.
The funny thing is that a few days later the customer was then starting to get all sort of spurious errors in their system. Normally this isn’t a challenge to track down, but in this case the system was tightly coupled and composed of hundreds of threads with tens of processes and close to a dozen active interrupts. There was enough other system activity going on that a small failure in one path took a long time to ripple through to a point where the error was noticeable. The gut feeling was that some part of the client software was being unblocked as a result of the recent server unblock changes and the client wasn’t ready to deal with the consequences.
… but we needed to prove it …
The good news is that the system was running the instrumented kernel and we could take some traces and take a look at the entire system, so traces we took and then we loaded them into the System Profiler.
The question was, how can we quickly check and see if the unblock behaviour is at fault? Well the first thing to do is to see if there is any unblock activity in the log. This is easily accomplished with using the Trace Search capability, Search > Search… (or CTRL+H for the keyboard addicts). Click Add to create a new search condition. The condition we want to look for is the Communication class events with the code of Unblock Pulse. This will pull up all of the unblock pulses generated in the trace. In this case we didn’t have very many to consider:
We could have gone through and taken a look at each instance, but in this case we knew that we had just put the unblock handling code into the mme server. The unblock pulse event contains the name of the target process, so we could take a look specifically to see if we had targeted our server. We can do this by going back into the Trace Search and either editing the existing condition or creating a new condition. We will still specify the Communication class and the Unblock Pulse event, but this time we will also add in a specific search for instances where the processdata is equal to the mme server.
Now when we run the search BINGO!we get just one single hit. From this point it is easy now to look at the Timeline editor pane and track the behaviour of the client process and thread that the mme unblocks and what happens after the unblock (bad things it turns out).
Also, with the system trace we can take a look at the specific cause of the signal. It turns out in this case that there was a timer firing a signal that got dropped on the thread which caused it to unblock. This in turn exposed a general flaw in the customer’s design since they were running a multi-threaded application where any thread could have potentially been blasted by this signal instead of the process having a dedicated signal handling thread.
Hopefully this quick example gives you some more ideas how you can effectively navigate through the instrumentation trace files, Trace Search is a powerful tool.