Combining System Profiler search results

In the last few posts [1][2][3] I’ve been talking about how the developers at Crank have been finding and addressing some of the performance issues with WebKit.   

One of the ways in which we highlighted this performance issue was to take advantage of the System Profiler’s Trace Search and the new Path Manager trace events that are available in the 6.4.0 Neutrino kernel.   This search gives us a quick view into what files are being opened up

fileopensearch 

These search results give us very usefull information, but they are kind of isolated.  One of the neat things that the Trace Search dialog allows you to do is to combine search conditions together using the && and || operators to AND and OR conditions which can give your search results a bit more of an impact.  

searchandor 

For example, if we wanted to get a rough estimate of the kind of effect that each of these file opens was having on our system, then we might do a compound search joining both the FileOpen information with the number of times that we perform a ConnectAttach to actually establish a connection to a server:

searchconnect1

Then enter a compound condition:

searchcombined

Now when we do our search, we get a much more interesting set of results that gives us an indication of just how many different servers might be involved when we are doing all of this additional file access:

searchresults2

I’ve cut out some of the columns here to try and fit the content in, but we can now see the FileOpen results with the ConnectAttach() calls inter-mingled.  The handy thing with this information is that you can now quickly establish a few cost metrics based on the observed results.  This can be handy when you are trying to determine what areas may be worth persuing for further optimization gains.

Happy Optimizing.

Thomas

Less WebKit is More WebKit

… at least that is true in the embedded space where your CPU footprint is just as important as your memory footprint and your storage (”on-disk”) footprint.

As a follow up to the Crank WebKit performance “shark week” we decided that since runtime performance was starting to look pretty good we should turn our sights to the memory footprint that the WebKit port was taking on our systems.  

Due to the fact that this was a brand new browser engine (WebKit), with a minimal custom front end (Crank Launcher), ported to a brand new graphics framework from QNX (AG/GF), there was really no “apples to apples” comparable to use (though Crank does have many internal versions of Firefox).  This isn’t as bad as it seems … in the world of software, attempting to “match” what previous software generations did rarely helps to advance the state of the technology.  Our goal was to take a fixed block of time and make the biggest impact that we could.

We took base measurements off many different sites, but here I’ll use the plain old www.google.com as the reference.  When we loaded that site up, after all of the shared objects were loaded and we finished rendering the page, we found that we would often hit close to 30M of memory being used!  That was a pretty outrageous number.

There are lots of different tools for memory analysis, most of which don’t scale appropriately or provide accurate enough results when used on an application as disjoint as WebKit:  It uses both C and C++ object smart pointers, it mixes custom allocators and the system allocator (malloc) and the usage and behaviour of memory allocation is completely different for each of the shared libraries used (ie libc, libxml, freetype, icu, libjpeg, …).

Since we were coming off the week of CPU performance analysis we had lots of different trace files hanging around so we decided to take a crank at them with the System Profiler.   In addition to the new Path Manager event in 6.4.0, there are two system events for tracking mmap events … one for named events and one for un-named events. 

We set up some traces, using the same technique as with the Path Manager events, and leveraged that information to give us some insight to what major allocations (those that routed to mmap()) were occuring and correspondingly to see what shared libraries, files and other objects were being mmap’ed by WebKit. 

This tuning once again pointed us to the font handing  and some less than optimal code in the way that WebKit uses the FreeType and FontConfig API’s.  After last week’s CPU tuning experience with the font configuration, I’m rapidly coming to the conclusion that fonts and internationalization is nothing but a work generator!

Just for confirmation, we dropped the entire application into the debugger, put a breakpoint on the mmap() function and stepped through the major WebKit operations all while double verifying using pidin mem  to see the memory effect on the system.

To make a short story long, we concluded that there is no reason to map in Asian fonts, especially at 6M a pop, if you have no characters on screen that require such functionality!  Not a revelation that most people would be knocked over by, but the effect on this software was.  With a little bit of tuning around the fonts and a few other memory areas, our memory footprint rocketted down to a very respectable (and stable) 4M when we target www.google.com

… and of course to speak to the title, when you drop from 32M to 4M (the less) we pick up a few extra CPU cycles (the more) in the general operation.  We’ve got a few more tricks up our sleeve but we’re getting very close to our GF release point!

Thomas

Find font, open font, close font … Go WebKit! Go!

File system access is pretty fast … if you are running on a desktop system with lots of caching resources and no real requirement to synch/flush content to the backing media until it you have idle cycles.  You can do a lot of churning and never even have it poke to the top of any desktop performance metering tools.

If you are running an embedded system though, filesystem access is your nemesis.  Not only does the media that you are interfacing with have generally slower access times, but the environment much more demanding.  You get the double whammy of fewer CPU cycles to spare and little or no additional cache memory to compensate for slow device access.  All that, and you’ve got some pesky power monitor that  keeps trying to get the system to idle so it can sleep and save a few milliwatts of power.

The customers that have approached Crank software about WebKit integration all have embedded systems that they are running, so as part of my WebKit optimization week I started out with the easy targets: Excessive filesystem access. 

Using the System Profiler (a handy hammer for all my problems) and QNX 6.4.0 it is dead easy to highlight those expensive pathname based filesystem operations  (open,opendir,stat etc):

  1. Open a trace log file and open the search dialog; Search > Search or just  Ctrl+H in the editor. 
  2. Select the Trace Search tab icon and select Add to create a new condition
  3. Create the condition using the System class and the Path Manager code

conditiondef1

Now if you use this condition you will see all of the path based file operations.  You can use the data fields in the event such as process or pid to filter the events down to what you are interested in.  Running this query, focusing in on the the WebKit based application as it launched against a test website yeilded:

 

fileopensearch

Unbelievable!  Considering that this was a trace that lasted only 30 seconds, 2100file accesses seems to be a little bit unreasonable and either a bug or an area ripe for optimization.  The pathname provided us with lots of insight into what was happening (lots and lots of font access).  Our current port uses FontConfig to manage the font mappings and FreeType to perform the rendering. 

Our first change was to create a more ‘embedded friendly’ font configuration file.  By default the font configuration is scanned every few seconds to support dynamic font addition and removal.  Usefull for desktops, but not needed by most embedded systems.  Doing that, we picked up a few seconds of improvement, but were still churning. 

Time to dive into the code and correlate it with our trace results:

The traces showed repetitive file access for the same font, so we added a simple filesystem name cache and used FreeType’s internal cache to avoid this churn.  This dropped the file accesses by about an eighth and picked up another few seconds of improvement.  Not bad for a bit of effort, but not the big gain we were looking for.

The traces still showed that we were hitting the font configuration directories several times over, definitely not the intention of the source as far as we could tell.  After a day of code inspection, the culprit turned out to be an innocent looking routine that was responsible for cleaning up temporary font resources … unfortunately the cleanup also destroyed a static font configuration, causing it to be re-created each time a new font request was made!  Fixing that bug and re-running our test load:

fileopensearchnew1

Now that’s looking better!  While the number of file accesses is still high (318) it isn’t completely ridiculous (2100) and even better, most of those initial accesses are the shared library loading and don’t occur during the steady state operations of the browser.  The even better news is the time savings that came with this reduction … it dropped a full ten (10) seconds off the general load time!  We’ve totally moving from Super Sloppy to Super Shiny!  (thanks Mario and Paul!).

If you want to get more details on the WebKit port, we’ve just opened up the Crank Software blog where we’re planning on posting updates directly there as we make progress and make it available.

Thomas

Tracing WebKit made easy

I spent most of the past week doing some performance tuning on the port of WebKit to QNX Neutrino that Crank Software is doing.  There are lots of different tools I could use but nothing beats the System Profiler when you want to get a quick overview of what is going on with your application and its effect on the rest of the system.

Since we’re building WebKit ourselves, I was able to add in a number of user trace events that had specific meaning for the WebKit performance metrics we were looking at.  With a few events, it made repetitive measurements of things like application load time, page load time and network latency a snap to calculate.

Prior to the 6.4.0 release, if you wanted to add in your own custom trace events you had to use the TraceEvent() API.  This API does way more than just insert trace events, it is the swiss army knife of calls to configure and control the entire kernel instrumentation system.  At the end of the day, I was always having to go back to the documentation just to double check the arguments required to push out an event with a string in it:

TraceEvent(_NTO_TRACE_INSERTUSRSTREVENT, <id>, <string>)

Of course if you wanted anything more than a simple string, you had to fiddle with sprintf’s, allocate buffers etc.  A number of times I simply would insert five or six trace events in a row rather than doing all that extra work … which of course meant havin log files cluttered with extra events I eventually needed to filter out to see what I wanted.

With 6.4.0 a whole slew of trace_* functions were added into the <sys/trace.h> header.  Now instead of having to remember the specific define for the TraceEvent() call, I can just do:

 trace_logf(<id>, <printf style format string>);

That is way easier and far more convenient … makes tracing almost as much fun as using my ShamWow

Unfortunately the functions aren’t documented yet, but they were modelled using the same style as the slog_* functions, and you have to be carefull about their use within interrupt handlers but things like the trace_here() function can offer interesting insight using the other optimization tools.

… and in case you were wondering about the WebKit optimization results, they are looking good.  We’ve eliminated most of the superfluous work that hit embedded systems (especially non-x86 targets) harder than desktops and hope to be making a very usable GF based version available soon!

Happy Tracing!

Thomas

I’ve been workin’ on the WebKit all the live long day …

WebKit for Neutrino 6.4.0 that is

Lots has gone on since my last post on SRR.  After finishing up my parental leave last year  I decided that after working at QNX for nearly 10 years it was time for a change and took on a role at Crank Software.

We’re doing a lot of interesting things at Crank, mostly to do with graphics and improving the way people integrate graphical content into their embedded products. 

Crank is doing a lot of work with customers who base their user interface designs on what Apple’s iPhone and iTouchcan do, but want to do it on systems with less powerful CPU’s, smaller memory capacity and less capable graphics engines.

Oh yeah … and for those devices that are network connected they also want to run WebKit, the engine under Apple’s Safari web browser

As part of addressing these needs (better, faster, smaller) Crank has ported WebKit to QNX Neutrino, and since web browsers and graphical applications go hand in hand these days, we plan to provide assistance and support on this technology. 

If you are keen, you can try out an advance pre-release version of the WebKit engine by downloading it from  http://www.cranksoftware.com/pre-release/.

This is a snapshot of our development from a month ago after we got the initial port running and passing the basic browser tests, lots of improvements have been made since then and we’ll update it when we hit a good stability point.

Our initial ports run on both Neutrino 6.3.2 and 6.4.0.  We currently have Photon microGUI versions and expect to have an AGTDK based version out in the coming weeks.

The development on WebKit is very active, with lots of work going on in all areas from user interface to scalability to scripting performance.  Getting the initial port up and running wasn’t a trivial amount of work, and we hope to work with the WebKit developers to back port our changes. 

In the mean time, if you run Neutrino and wanted to be like all the other cool kids out there running WebKit go check out the pre-release and let us know how it works for you! 

Thomas

Next Page »