Archive for April, 2008|Monthly archive page
At long last, the dtrace project at Foundry27 is public. It’s really still in the formative stages, but if you want to download the source and hack around a bit, or even just laugh at the dirty hacks, feel free!
I’m slowing working on a more thorough port, but the prototype is a nice forum to play around in.
If you’ve ever taken a kernel trace of an application starting up on a kernel that is 6.3.2 or later, you might have noticed a thread state in your application called STATE_WAITPAGE.
To understand what is going on here, we have to first look at mmap() and how it allocates and initializes memory.
When you allocate memory with mmap(), it first allocates a virtual address range, and then (unless you’ve passed MAP_LAZY), it will allocate the physical memory needed to back that object.
What it doesn’t do, though is setup all the page table mappings for the mapping (although in actual fact it will setup a certain amount, depending on heuristics based on the type of mapping).
In actual fact it will wait until the program first accesses a page before setting up a page table entry for it.
This is a change from pre-6.3.2 kernels, where all mappings were setup immediately. You can re-enable the old behaviour by using the procnto option -mL in your buildfile.
The benefits of this scheme, known as demand paging, are a speedup when the application doesn’t actually access all of the pages in a mapping. This is quite useful when mapping executable objects, since often there are quite a few pages in the text section which may never be referenced.
In some patterns of usage, though, this can induce a performance penalty, especially when there are many threads running at the same priority.
This brings us back to STATE_WAITPAGE. When a process is executing and it access a page that has not been mapped, or tries to write to a page that is marked read-only, it will induce a page fault. This is caught by the kernel.
The fast path in the kernel will peek at the processes pagetables, and if found it will bang it into the TLB (this is done in hardware on some architectures, such as the x86). If it can’t find it then it needs to defer the work of handling the fault to the process manager, since it may need to do some complex work such as communicating with device drivers, and also the also the necessary structures may not be accessible/consistent.
This, then, is where the thread gets blocked in STATE_WAITPAGE, and a pulse is sent to procnto, to wake up a thread to handle the request.
The procnto worker does the dirty work of initializing the memory (it may need to read a page from a disk driver for example) and setting up the pagetables.
The first time a page is read it is setup with a read-only mapping, even if the mapping is writeable, unless the underlying page has already been marked as modified. For MAP_LAZY mappings, this is the point at which physical memory is actually allocated for the page. If the allocation fails, then the application will have a SIGBUS signal delivered to it (this was the same in pre 6.3.2 kernels).
This delayed initialisation supports two handy schemes.
This first is Copy On Write semantics (COW). This is where we don’t bother making a private copy of a page that was originally shared with another mapping until the page is modified.
The second is support for writeable mappings to files. Prior to 6.3.2 you could only map a file readonly, or with the extra flag, MAP_NOSYNCFILE. This was because there was no tracking of modified pages.
When you call msync() on a shared mapping to a file, all the modified pages are written to the backing store. Now the pages can be marked read-only again, and the modified indicator can be turned off.
Now all this is great, but what about that performance penalty? Well all that context switch can make the page fault processing take quite a while if there are lots of threads in STATE_READY at the same priority. The procnto thread will be placed at the back of the queue, wait for the others to run, the it will run, potentially talk to a device driver, and then make the original thread ready again, which will be placed at the end of the queue.
Another thing to think of in some circumstances you want to take the hit of talking to the backing store all at once for determinism reasons.
A way to control this behaviour is via the mlock() and mlockall() functions. These tell procnto that you want the some (or all) pages made memory resident right now.
This means that the mapping will have been fully read in from disk by the time the mmap() call returns to your program.
You still get page faults on the first write to a page, though. In that case we setup a read-only mapping (or if the underlying pages have already been modified, a read-write mapping).
If you TRULY don’t want page faults for mappings, then there are two options open to you. Since device drivers may have their code run in an interrupt handler, and they may also be running code with interrupt disabled, they can’t take page faults at all. So when you request IO privity with the ThreadCtl() call, you get marked as being super locked. This means that all mappings are setup and ready to go.
Of course, only processes running with superuser privilege can request this special status. To have all processes be marked as super locked, you can pass the -mL option to procnto. To have all processes simply be locked (the equivalent of calling mlockall(MCL_FUTURE|MCL_CURRENT), you can pass the -ml option.
Well, this post has gone all over the map <groan>, so I’m going to sign off.
Well sorry for not having written for so long. We’ve all been very busy working on the 6.4 kernel.
BTW – if you’ve been wondering about the being CoreOS source repository out of date, the sync process between the internal and external repositories broke down. :-(
The good news though is that it’s fixed now, so you can get all the goodness straight from the source. :-)