So it is time for a shameless plug … but one with a distinct QNX angle =;-)
After many months of hard work the developers at Crank Software (which includes myself) are ready to release Storyboard Suite 1.0, a multi-operating system, multi-platform designer driven graphical application builder.
At this point you may be either bored or intrigued … in either case you should go and start downloading your free version of Storyboard Suite now! If you are curious and don’t want to read … go watch a Storyboard video.
Still reading? Great … now let’s talk about why this is great for QNX developers. Up to this point if you wanted to build single mode graphical applications on top of Advanced Graphics TDK (aka GF aka Core Graphics) you had a few different options:
- Use Flash. Flash is an awesome platform/tool and you can build incredibly creative and eye popping applications using all of the power and features that Adobe has provided, with a ready pool of skilled Flash developers at the ready … but sometimes Flash is just too big a solution when all you want are a couple of screens and some data displays.
- Use the rich C API’s and a (ported) widget library. This isn’t impossible, but it also means that your software developers are the ones doing the user interface development (and porting) and re-interpreting the graphical design that some artist has created for the product. This is time consuming and in my experience the software developer rarely gets the vision the graphical designer had in mind when he build the application screens.
Crank Storyboard Suite is composed of two pieces that give you an awesome third alternative:
Storyboard Designer is an Eclipse based graphical application development environment. It is based on a application hierarchy composed of screens, layers and controls and is driven by a simple, but extensible, event model. Visual content comes directly from designers as either direct image imports including a sweet PSD file importer that can transform Photoshop layer elements directly into an application’s screen, layers and controls.
The tool exports a data bundle which is used as input for the Storyboard Embedded Engine, the runtime component that lives on your embedded system. The engine is easily configurable and plugin based so that it is extensible to any new technologies or trends that appear in the market.
… did I mention that there are videos that you can watch?
By now your download should be finished, so go and give it a whirl. A few simple examples are included along with a simulator that allows you to run applications you build on your native system if you don’t have an embedded target handy, and if you’ve got questions … well we’ve got forums to ask them in.
With 1.0 now out the door 1.1 (or 1.0.1?) is coming along quick with a focus on user driven feedback and enhancements! Things like providing additional WebKit integration for various customers and making simple graphical transforms even easier .. oh and I thought I saw an OpenGLES renderer running in the office too!
Note: I didn’t include Photon in the above list, since really if you are using Photon you’ve got a ready solution for building applications, widgets and an application builder at the ready. It isn’t likely that you graphics designers will find it an environment they want to work in, but they could learn it if they wanted to.
One of the ways in which we highlighted this performance issue was to take advantage of the System Profiler’s Trace Search and the new Path Manager trace events that are available in the 6.4.0 Neutrino kernel. This search gives us a quick view into what files are being opened up
These search results give us very usefull information, but they are kind of isolated. One of the neat things that the Trace Search dialog allows you to do is to combine search conditions together using the && and || operators to AND and OR conditions which can give your search results a bit more of an impact.
For example, if we wanted to get a rough estimate of the kind of effect that each of these file opens was having on our system, then we might do a compound search joining both the FileOpen information with the number of times that we perform a ConnectAttach to actually establish a connection to a server:
Then enter a compound condition:
Now when we do our search, we get a much more interesting set of results that gives us an indication of just how many different servers might be involved when we are doing all of this additional file access:
I’ve cut out some of the columns here to try and fit the content in, but we can now see the FileOpen results with the ConnectAttach() calls inter-mingled. The handy thing with this information is that you can now quickly establish a few cost metrics based on the observed results. This can be handy when you are trying to determine what areas may be worth persuing for further optimization gains.
… at least that is true in the embedded space where your CPU footprint is just as important as your memory footprint and your storage (“on-disk”) footprint.
As a follow up to the Crank WebKit performance “shark week” we decided that since runtime performance was starting to look pretty good we should turn our sights to the memory footprint that the WebKit port was taking on our systems.
Due to the fact that this was a brand new browser engine (WebKit), with a minimal custom front end (Crank Launcher), ported to a brand new graphics framework from QNX (AG/GF), there was really no “apples to apples” comparable to use (though Crank does have many internal versions of Firefox). This isn’t as bad as it seems … in the world of software, attempting to “match” what previous software generations did rarely helps to advance the state of the technology. Our goal was to take a fixed block of time and make the biggest impact that we could.
We took base measurements off many different sites, but here I’ll use the plain old www.google.com as the reference. When we loaded that site up, after all of the shared objects were loaded and we finished rendering the page, we found that we would often hit close to 30M of memory being used! That was a pretty outrageous number.
There are lots of different tools for memory analysis, most of which don’t scale appropriately or provide accurate enough results when used on an application as disjoint as WebKit: It uses both C and C++ object smart pointers, it mixes custom allocators and the system allocator (malloc) and the usage and behaviour of memory allocation is completely different for each of the shared libraries used (ie libc, libxml, freetype, icu, libjpeg, …).
Since we were coming off the week of CPU performance analysis we had lots of different trace files hanging around so we decided to take a crank at them with the System Profiler. In addition to the new Path Manager event in 6.4.0, there are two system events for tracking mmap events … one for named events and one for un-named events.
We set up some traces, using the same technique as with the Path Manager events, and leveraged that information to give us some insight to what major allocations (those that routed to mmap()) were occuring and correspondingly to see what shared libraries, files and other objects were being mmap’ed by WebKit.
This tuning once again pointed us to the font handing and some less than optimal code in the way that WebKit uses the FreeType and FontConfig API’s. After last week’s CPU tuning experience with the font configuration, I’m rapidly coming to the conclusion that fonts and internationalization is nothing but a work generator!
Just for confirmation, we dropped the entire application into the debugger, put a breakpoint on the mmap() function and stepped through the major WebKit operations all while double verifying using pidin mem to see the memory effect on the system.
To make a short story long, we concluded that there is no reason to map in Asian fonts, especially at 6M a pop, if you have no characters on screen that require such functionality! Not a revelation that most people would be knocked over by, but the effect on this software was. With a little bit of tuning around the fonts and a few other memory areas, our memory footprint rocketted down to a very respectable (and stable) 4M when we target www.google.com.
… and of course to speak to the title, when you drop from 32M to 4M (the less) we pick up a few extra CPU cycles (the more) in the general operation. We’ve got a few more tricks up our sleeve but we’re getting very close to our GF release point!
File system access is pretty fast … if you are running on a desktop system with lots of caching resources and no real requirement to synch/flush content to the backing media until it you have idle cycles. You can do a lot of churning and never even have it poke to the top of any desktop performance metering tools.
If you are running an embedded system though, filesystem access is your nemesis. Not only does the media that you are interfacing with have generally slower access times, but the environment much more demanding. You get the double whammy of fewer CPU cycles to spare and little or no additional cache memory to compensate for slow device access. All that, and you’ve got some pesky power monitor that keeps trying to get the system to idle so it can sleep and save a few milliwatts of power.
The customers that have approached Crank software about WebKit integration all have embedded systems that they are running, so as part of my WebKit optimization week I started out with the easy targets: Excessive filesystem access.
- Open a trace log file and open the search dialog; Search > Search or just Ctrl+H in the editor.
- Select the Trace Search tab icon and select Add to create a new condition
- Create the condition using the System class and the Path Manager code
Now if you use this condition you will see all of the path based file operations. You can use the data fields in the event such as process or pid to filter the events down to what you are interested in. Running this query, focusing in on the the WebKit based application as it launched against a test website yeilded:
Unbelievable! Considering that this was a trace that lasted only 30 seconds, 2100file accesses seems to be a little bit unreasonable and either a bug or an area ripe for optimization. The pathname provided us with lots of insight into what was happening (lots and lots of font access). Our current port uses FontConfig to manage the font mappings and FreeType to perform the rendering.
Our first change was to create a more ‘embedded friendly’ font configuration file. By default the font configuration is scanned every few seconds to support dynamic font addition and removal. Usefull for desktops, but not needed by most embedded systems. Doing that, we picked up a few seconds of improvement, but were still churning.
Time to dive into the code and correlate it with our trace results:
The traces showed repetitive file access for the same font, so we added a simple filesystem name cache and used FreeType’s internal cache to avoid this churn. This dropped the file accesses by about an eighth and picked up another few seconds of improvement. Not bad for a bit of effort, but not the big gain we were looking for.
The traces still showed that we were hitting the font configuration directories several times over, definitely not the intention of the source as far as we could tell. After a day of code inspection, the culprit turned out to be an innocent looking routine that was responsible for cleaning up temporary font resources … unfortunately the cleanup also destroyed a static font configuration, causing it to be re-created each time a new font request was made! Fixing that bug and re-running our test load:
Now that’s looking better! While the number of file accesses is still high (318) it isn’t completely ridiculous (2100) and even better, most of those initial accesses are the shared library loading and don’t occur during the steady state operations of the browser. The even better news is the time savings that came with this reduction … it dropped a full ten (10) seconds off the general load time! We’ve totally moving from Super Sloppy to Super Shiny! (thanks Mario and Paul!).
I spent most of the past week doing some performance tuning on the port of WebKit to QNX Neutrino that Crank Software is doing. There are lots of different tools I could use but nothing beats the System Profiler when you want to get a quick overview of what is going on with your application and its effect on the rest of the system.
Since we’re building WebKit ourselves, I was able to add in a number of user trace events that had specific meaning for the WebKit performance metrics we were looking at. With a few events, it made repetitive measurements of things like application load time, page load time and network latency a snap to calculate.
Prior to the 6.4.0 release, if you wanted to add in your own custom trace events you had to use the TraceEvent() API. This API does way more than just insert trace events, it is the swiss army knife of calls to configure and control the entire kernel instrumentation system. At the end of the day, I was always having to go back to the documentation just to double check the arguments required to push out an event with a string in it:
TraceEvent(_NTO_TRACE_INSERTUSRSTREVENT, <id>, <string>)
Of course if you wanted anything more than a simple string, you had to fiddle with sprintf’s, allocate buffers etc. A number of times I simply would insert five or six trace events in a row rather than doing all that extra work … which of course meant havin log files cluttered with extra events I eventually needed to filter out to see what I wanted.
With 6.4.0 a whole slew of trace_* functions were added into the <sys/trace.h> header. Now instead of having to remember the specific define for the TraceEvent() call, I can just do:
trace_logf(<id>, <printf style format string>);
That is way easier and far more convenient … makes tracing almost as much fun as using my ShamWow!
Unfortunately the functions aren’t documented yet, but they were modelled using the same style as the slog_* functions, and you have to be carefull about their use within interrupt handlers but things like the trace_here() function can offer interesting insight using the other optimization tools.
… and in case you were wondering about the WebKit optimization results, they are looking good. We’ve eliminated most of the superfluous work that hit embedded systems (especially non-x86 targets) harder than desktops and hope to be making a very usable GF based version available soon!