Archive for February, 2009|Monthly archive page
… at least that is true in the embedded space where your CPU footprint is just as important as your memory footprint and your storage (“on-disk”) footprint.
As a follow up to the Crank WebKit performance “shark week” we decided that since runtime performance was starting to look pretty good we should turn our sights to the memory footprint that the WebKit port was taking on our systems.
Due to the fact that this was a brand new browser engine (WebKit), with a minimal custom front end (Crank Launcher), ported to a brand new graphics framework from QNX (AG/GF), there was really no “apples to apples” comparable to use (though Crank does have many internal versions of Firefox). This isn’t as bad as it seems … in the world of software, attempting to “match” what previous software generations did rarely helps to advance the state of the technology. Our goal was to take a fixed block of time and make the biggest impact that we could.
We took base measurements off many different sites, but here I’ll use the plain old www.google.com as the reference. When we loaded that site up, after all of the shared objects were loaded and we finished rendering the page, we found that we would often hit close to 30M of memory being used! That was a pretty outrageous number.
There are lots of different tools for memory analysis, most of which don’t scale appropriately or provide accurate enough results when used on an application as disjoint as WebKit: It uses both C and C++ object smart pointers, it mixes custom allocators and the system allocator (malloc) and the usage and behaviour of memory allocation is completely different for each of the shared libraries used (ie libc, libxml, freetype, icu, libjpeg, …).
Since we were coming off the week of CPU performance analysis we had lots of different trace files hanging around so we decided to take a crank at them with the System Profiler. In addition to the new Path Manager event in 6.4.0, there are two system events for tracking mmap events … one for named events and one for un-named events.
We set up some traces, using the same technique as with the Path Manager events, and leveraged that information to give us some insight to what major allocations (those that routed to mmap()) were occuring and correspondingly to see what shared libraries, files and other objects were being mmap’ed by WebKit.
This tuning once again pointed us to the font handing and some less than optimal code in the way that WebKit uses the FreeType and FontConfig API’s. After last week’s CPU tuning experience with the font configuration, I’m rapidly coming to the conclusion that fonts and internationalization is nothing but a work generator!
Just for confirmation, we dropped the entire application into the debugger, put a breakpoint on the mmap() function and stepped through the major WebKit operations all while double verifying using pidin mem to see the memory effect on the system.
To make a short story long, we concluded that there is no reason to map in Asian fonts, especially at 6M a pop, if you have no characters on screen that require such functionality! Not a revelation that most people would be knocked over by, but the effect on this software was. With a little bit of tuning around the fonts and a few other memory areas, our memory footprint rocketted down to a very respectable (and stable) 4M when we target www.google.com.
… and of course to speak to the title, when you drop from 32M to 4M (the less) we pick up a few extra CPU cycles (the more) in the general operation. We’ve got a few more tricks up our sleeve but we’re getting very close to our GF release point!
File system access is pretty fast … if you are running on a desktop system with lots of caching resources and no real requirement to synch/flush content to the backing media until it you have idle cycles. You can do a lot of churning and never even have it poke to the top of any desktop performance metering tools.
If you are running an embedded system though, filesystem access is your nemesis. Not only does the media that you are interfacing with have generally slower access times, but the environment much more demanding. You get the double whammy of fewer CPU cycles to spare and little or no additional cache memory to compensate for slow device access. All that, and you’ve got some pesky power monitor that keeps trying to get the system to idle so it can sleep and save a few milliwatts of power.
The customers that have approached Crank software about WebKit integration all have embedded systems that they are running, so as part of my WebKit optimization week I started out with the easy targets: Excessive filesystem access.
- Open a trace log file and open the search dialog; Search > Search or just Ctrl+H in the editor.
- Select the Trace Search tab icon and select Add to create a new condition
- Create the condition using the System class and the Path Manager code
Now if you use this condition you will see all of the path based file operations. You can use the data fields in the event such as process or pid to filter the events down to what you are interested in. Running this query, focusing in on the the WebKit based application as it launched against a test website yeilded:
Unbelievable! Considering that this was a trace that lasted only 30 seconds, 2100file accesses seems to be a little bit unreasonable and either a bug or an area ripe for optimization. The pathname provided us with lots of insight into what was happening (lots and lots of font access). Our current port uses FontConfig to manage the font mappings and FreeType to perform the rendering.
Our first change was to create a more ’embedded friendly’ font configuration file. By default the font configuration is scanned every few seconds to support dynamic font addition and removal. Usefull for desktops, but not needed by most embedded systems. Doing that, we picked up a few seconds of improvement, but were still churning.
Time to dive into the code and correlate it with our trace results:
The traces showed repetitive file access for the same font, so we added a simple filesystem name cache and used FreeType’s internal cache to avoid this churn. This dropped the file accesses by about an eighth and picked up another few seconds of improvement. Not bad for a bit of effort, but not the big gain we were looking for.
The traces still showed that we were hitting the font configuration directories several times over, definitely not the intention of the source as far as we could tell. After a day of code inspection, the culprit turned out to be an innocent looking routine that was responsible for cleaning up temporary font resources … unfortunately the cleanup also destroyed a static font configuration, causing it to be re-created each time a new font request was made! Fixing that bug and re-running our test load:
Now that’s looking better! While the number of file accesses is still high (318) it isn’t completely ridiculous (2100) and even better, most of those initial accesses are the shared library loading and don’t occur during the steady state operations of the browser. The even better news is the time savings that came with this reduction … it dropped a full ten (10) seconds off the general load time! We’ve totally moving from Super Sloppy to Super Shiny! (thanks Mario and Paul!).
I spent most of the past week doing some performance tuning on the port of WebKit to QNX Neutrino that Crank Software is doing. There are lots of different tools I could use but nothing beats the System Profiler when you want to get a quick overview of what is going on with your application and its effect on the rest of the system.
Since we’re building WebKit ourselves, I was able to add in a number of user trace events that had specific meaning for the WebKit performance metrics we were looking at. With a few events, it made repetitive measurements of things like application load time, page load time and network latency a snap to calculate.
Prior to the 6.4.0 release, if you wanted to add in your own custom trace events you had to use the TraceEvent() API. This API does way more than just insert trace events, it is the swiss army knife of calls to configure and control the entire kernel instrumentation system. At the end of the day, I was always having to go back to the documentation just to double check the arguments required to push out an event with a string in it:
TraceEvent(_NTO_TRACE_INSERTUSRSTREVENT, <id>, <string>)
Of course if you wanted anything more than a simple string, you had to fiddle with sprintf’s, allocate buffers etc. A number of times I simply would insert five or six trace events in a row rather than doing all that extra work … which of course meant havin log files cluttered with extra events I eventually needed to filter out to see what I wanted.
With 6.4.0 a whole slew of trace_* functions were added into the <sys/trace.h> header. Now instead of having to remember the specific define for the TraceEvent() call, I can just do:
trace_logf(<id>, <printf style format string>);
That is way easier and far more convenient … makes tracing almost as much fun as using my ShamWow!
Unfortunately the functions aren’t documented yet, but they were modelled using the same style as the slog_* functions, and you have to be carefull about their use within interrupt handlers but things like the trace_here() function can offer interesting insight using the other optimization tools.
… and in case you were wondering about the WebKit optimization results, they are looking good. We’ve eliminated most of the superfluous work that hit embedded systems (especially non-x86 targets) harder than desktops and hope to be making a very usable GF based version available soon!