Tag Archives: perf

A performance mystery (part two)

In part one, we discovered that our multicast receipt thread was being stalled by page faults.

In part two, we’ll dig down into the causes of those page faults, and with some help from our friends at Informatica, get to the bottom of things.

Systems Cavalry Arrives

Let’s focus on just the first stack trace to begin with.

We can have a bit of a guess at what’s going on just by looking at the symbol names. Let’s flip the stack so we can go from out to in:

  1. Java_com_latencybusters_lbm_LBMContext_lbmContextProcessEvents [/opt/lbm/lib/liblbmj.so.5.0.0]

This is the function that the LBM thread repeatedly calls. This is presumably a java wrapper for the native library’s equivalent.

  1. lbm_context_process_events [/opt/lbm/lib/liblbm.so.7.0.1]

The native version of process events

  1. lbm_timer_expire [/opt/lbm/lib/liblbm.so.7.0.1]

Somewhere in LBM, a timer has expired

  1. lbm_rate_ctlr_handle_timer [/opt/lbm/lib/liblbm.so.7.0.1]

Somewhere in LBM‘s rate controller, we’re handling timer expiry?

  1. _dl_runtime_resolve [/lib64/ld-2.12.so]

Which means we need resolve a symbol?

  1. _dl_fixup [/lib64/ld-2.12.so]

Which needs fixing up?

This already feels odd. Why are we resolving a symbol more than fifteen minutes after we started our application? Is there code in LBM that is only triggered for the first time at this point in the performance run?

What we want to know is what happens at

 0x7efeba91b7a3 : lbm_rate_ctlr_handle_timer+0x9e3/0xbe0 [/opt/lbm/lib/liblbm.so.7.0.1]

Let’s see if we can work that out.

objdump

The stack trace tells us where to look for a symbol, so we invoke

# objdump -D /opt/lbm/lib/liblbm.so.7.0.1 > ~/lbm-symbols.out
# less ~/lbm-symbols.out
# /lbm_rate_ctlr_handle_timer
000000000015ddc0 <lbm_rate_ctlr_handle_timer>:
  15ddc0:       41 57                   push   %r15
  15ddc2:       41 56                   push   %r14

The stack tells us we are +0x9e3 from the start of that function, which should be at 15ddc0 + 9e3 = 15e7a3

  15e799:       e8 62 af f4 ff          callq  a9700 <pthread_mutex_unlock@plt>
  15e79e:       e8 0d 26 f4 ff          callq  a0db0 <sched_yield@plt>
  15e7a3:       48 8b 7c 24 18          mov    0x18(%rsp),%rdi
  15e7a8:       e8 13 a2 f4 ff          callq  a89c0 <pthread_mutex_lock@plt>
  15e7ad:       e9 ab fc ff ff          jmpq   15e45d <lbm_rate_ctlr_handle_timer+0x69d>

We guess that the mov instruction can’t really be requiring symbol resolution. It seems more likely that the instruction immediately before, to callq a0db0 <sched_yield@plt>, is responsible.

We theorise that LBM is calling sched_yield, and for some reason, it doesn’t already know where sched_yield is. ld is enlisted to find it for us, and that takes a while; perhaps longer than usual because of the memory pressure the system is under.

We’ll take a brief detour to explain how our application is supposed to work out where functions like sched_yield are, then expand our theory based on that.

ELF

No, not Emerson, Lake and Farmer. And yes, before you ask, yes, DWARF and ELF are from the same people.

ELF is a file format used by binaries (and binary libraries).

# file /opt/lbm/lib/liblbm.so.7.0.1
/opt/lbm/lib/liblbm.so.7.0.1: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped

The @plt we see in the output of objdump expands to procedure linkage table. This is a specific section of the ELF format which lists the symbols that were not available at link time; i.e when this library was being built at Informatica. This is sensible – we definitely want to have LBM call the currently running OS’s implementation of sched_yield rather than having the build machine’s version statically linked into it.

What happens when we call sched_yield for the first time in liblbm.so.7.0.1, though? Well, it depends who our program interpreter is. We don’t expect liblbm.so.7.0.1 to have an ELF interpreter entry, as it’s a library rather than a binary; our java binary should, though.

# readelf -l /opt/zvm/bin/java

Elf file type is EXEC (Executable file)
Entry point 0x400540
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

So, when we load the native LBM library during JNI initialisation, the LBM library .sos will be interpreted by /lib64/ld-linux-x86-64.so.2.

By default, this interpreter lazily initialises references in the plt. Just as malloced memory might only be assigned a physical memory page when we attempt to write to it, the runtime will only resolve (and then cache) a symbol location in the plt at first invocation time.

With this knowledge, we are down to three possibilities.

  1. This is the very first time LBM has invoked sched_yield (possible, but unlikely)
  2. Our dynamic linker is misbehaving (astronomically unlikely)
  3. Our understanding of how this is supposed to work is incorrect (likely)

There’s only one thing for it: it’s time to write a tiny C program.

#cat call_sched_yield.c
#include <sched.h>

int main(int argc, char* argv[]) {
  sched_yield();
  sched_yield();

  return 0;
}

We compile this program:

gcc call_sched_yield.c

Examine its ELF interpreter:

# readelf -l a.out

Elf file type is EXEC (Executable file)
Entry point 0x4003e0
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

and then run it with some special ld environment variables to see if our understanding is correct.

# LD_DEBUG=all ./a.out > ld_debug_one.out 2>&1
# LD_BIND_NOW=y LD_DEBUG=all ./a.out > ld_debug_two.out 2>&1

LD_DEBUG is an instruction to ld-linux to print debugging information about the work the linker is doing. LD_BIND_NOW is an instruction to ld-linux to override the lazy initialisation behaviour and load all symbols in the plt at start time.

# tail -n 15 debug_one.out 
     74235: 
     74235: initialize program: ./a.out
     74235: 
     74235: 
     74235: transferring control: ./a.out
     74235: 
     74235: symbol=sched_yield;  lookup in file=./a.out [0]
     74235: symbol=sched_yield;  lookup in file=/lib64/libc.so.6 [0]
     74235: binding file ./a.out [0] to /lib64/libc.so.6 [0]: normal symbol `sched_yield&#39; [GLIBC_2.2.5]
     74235: 
     74235: calling fini: ./a.out [0]
     74235: 
     74235: 
     74235: calling fini: /lib64/libc.so.6 [0]
     74235: 
     
# tail -n 11 debug_ld_two.out
     76976: initialize program: ./a.out
     76976: 
     76976: 
     76976: transferring control: ./a.out
     76976: 
     76976: 
     76976: calling fini: ./a.out [0]
     76976: 
     76976: 
     76976: calling fini: /lib64/libc.so.6 [0]
     76976: 

‘transferring control’ is the point at which our program’s main is entered. We can see that without LD_BIND_NOW, the first call to sched_yield triggers a lookup. That lookup succeeds and then nothing else happens until the program terminates.

With LD_BIND_NOW=y, however, we see no ld behaviour at all during main.

Our small test program has the same interpreter as java, and behaves exactly as we expect. We should start to consider case 1 (LBM calling sched_yield for the first time) as a more serious contender.

How can we test this? Well, let’s set LD_BIND_NOW=y for the market data service and see what happens.

Result!

In this chart, each block represents a run through the performance environment. The calming blue blocks are NAK free, the yellow blocks saw LBM recovery, and the red blocks experienced a NAK. Guess where we set LD_BIND_NOW=y. 1

Five days of NAKs 

Conclusions

  1. Record all the things. The number of tables in our perftest schema has grown to seventeen. Only three of those contain the latency distributions we originally set out to measure; much of this investigation was made easier by the data in the others.

  2. Abstractions are great. No way do I want to have to think about this stuff all the time.

  3. We will sometimes, unfortunately, have to look behind the curtain and figure out what is going on. Linux has phenomenal tooling to help do this, even in cases where you don’t have the source code (although I think we’d have really struggled if the LBM libs were stripped).

  4. systemtap really shines in problems of this ilk, where you need to capture a lot of detail about a single event.

Epilogue

We still have a few things to explain. It’s still weird that this is the first time LBM calls sched_yield. What’s triggering that behaviour?

We got an excellent response from Informatica explaining how this might happen:

 
Looking through the traces and the UM source code, the
sched_yield() in the lbm_rate_ctlr_handle_timer() function will only
occur under very rare circumstances.

The context has to hit the LBTRM or LBTRU rate limit.  On wakeup, the
lbm_rate_ctlr_handle_timer() function attempts to "wakeup" all of the
blocked transports.  The wakeup can fail due to an internal lock being
held.  In that particular case, the lbm_rate_ctlr_handle_timer() will
call sched_yield().

One other thing that the market data service does is publish orderbook snapshots to a market data viewer application. It does this once per second. It turns out that sending all the state of all the orderbooks in our performance environment can occasionally trigger LBM‘s rate limiting.

Blocked Send Count
N.B That key records how many times a send call was blocked in a reporting interval – so we block one or two writes most seconds.

What we didn’t know was that hitting a rate limit on a publisher could potentially cause jitter on the receive path. Informatica have plans to amend this in future versions, and we can work around it by bringing rate limiting inside our own code.

This also explains some other anomalies we discovered when we analysed when NAKs happened:

  1. We only ever had one market data service problem per run. This matches exactly with the lazy initialisation of the plt.
  2. The growth in the depth of the OrderBookEvents socket buffer always began within 20ms of a second boundary. This matches with the trigger for republishing orderbook snapshots in the market data service (the start of a second), and the rate limiting interval we give LBM (20ms).

Unanswered questions

We didn’t look into the other stack traces captured by stap. We can reasonably assume they had the same cause, as they disappear with the adoption of LD_BIND_NOW.

We didn’t look too deeply at why symbol resolution is causing a pagefault that proves so difficult to service. Other processes are surely calling sched_yield frequently, so how does looking it up take so long? We might get to this in a future post.

And finally…

We should point out that there was a great deal more trial and error involved in this investigation than this post really communicates.

Much of the monitoring information we used exists only because we were stuck, had a theory about what the problem could be, then added monitoring to confirm or disprove that theory. It took us a lot of those cycles (and a fair few annoyances like perf‘s timestamp behaviour) over a number of months before we had enough information to dive in with system tap.


  1. This picture was taken a little after we originally turned LD_BIND_NOW on. The first runs that were free of market data service NAKs showed us that another important application in the environment been suffering packet drops since its host was last rebooted. This was adding more red to the NAK display on the dashboard, but we’d missed the issue because we had grown so used to the red. We fixed the packet loss, disabled LD_BIND_NOW to show how bad things were, then enabled it again to generate the unbroken sequence of calm blue blocks afterward.

A performance mystery (part one)

We recently fixed a long standing performance issue at LMAX.

The path we followed to fixing it was sufficiently windy to merit a couple of posts.

In this first post we’ll define our issue and then attempt to figure out its cause.

Problem Identified

At LMAX we have an application named the market data service. It’s multicast receipt thread is not keeping up – it occasionally stalls for around 150ms. This causes data loss; the service has to request retransmission (it NAKs) from upstream. This causes a degradation in system latency, which we want to avoid.

Two quick definitions:

LBM: the technology we use to talk multicast (this is an Informatica product). Under the hood, it uses epoll_wait.

OrderBookEvents: the name of the high volume multicast topic we’re not keeping up with.

Ruling out the obvious suspects

Here are a few things we ruled out early on:

  1. A burst of traffic

We took packet captures and arrival rate was constant. We took captures on other subscribers to the same data and the arrival rate was constant there, too.

  1. Preemption of multicast receipt thread

We’ve talked about reducing this sort of jitter here and here.

The code and configuration we use to do multicast receipt is shared across several applications. We’d expect to see this issue everywhere if we’d managed to screw this up, and we don’t. 1

  1. LMAX code being slow to respond to data passed up from LBM

The LBM thread calls us back with data from that topic. Helpfully it also records a statistic that captures the maximum callback time for a given interval. It is not big enough to explain the issue.

Worried that the sampling rate of that metric might be hiding an issue, we added our own monitoring to time each individual callback. This also found no evidence of a problem large enough to be responsible.

  1. Getting blocked in epoll_wait for longer than the interval passed to it.

We configured our LBM thread to ‘spin’. We pass it a parameter to instruct it such that its calls to epoll_wait specify a timeout of 0. This tells epoll_wait to return immediately if no events are present. The problem remained.

Let’s zoom out a little.

What’s different about the Market Data Service?

Or, better: why isn’t this happening to any of the other OrderBookEvents subscribers? We have several running in this environment and they behave flawlessly.

Well, it does a lot more disk IO. The other subscribers tend to be gateways. This is LMAX speak for "translates internal LMAX multicast traffic into different format client sockets". One example might be a FiX market data application.

The market data service is special; it journals the OrderBookEvents topic to disk, creating a system of record. It maintains another disk based format to support market surveillance. If that weren’t enough, it also manages the recording of various csv files that end up being part of our market data API. We didn’t do a tremendous job of keeping this application simple!

We might think that the root cause is slow disk access, but this cannot be the case. The writes to disk occur either on or downstream of our business logic thread, not the LBM thread. The metrics we capture on the buffer between LBM and those threads show us that we’re easily keeping up.

We’ve ruled out preemption via isolation. We’ve ruled out a bursty network. We’ve ruled out the effect of slow disk writes.

What other shared system resources are there? Are they able to make the market data service behave so much worse than our other applications?

The Linux Memory Subsystem

The programming interface for much of the market data service might well be files, but at runtime, Linux may cache many of those files in memory. A brief google for "Linux Page Cache" will find a variety of introductions to this topic; we won’t go into it here.

A picture might speed us up a little bit here. Here’s a chart of how much system memory is free across a performance run, in kilobytes.

Eating Memory Under Pressure

We can see that around fifteen minutes into the run, we’ve eaten up all but 2Gb of the available RAM. The shape of the graph changes into a sawtooth. We’re still eating memory, but the OS periodically frees some by evicting the least recently used files from memory.

If this is our issue, we’d guess that all of our NAKs should be occurring after the system enters that sawtooth area. Indeed, a brief look at the timings in the performance test results database shows this to be the case.

How could this be causing problems on the LBM thread, though? It shouldn’t be touching the file system, so whether a file is cached or not should not impact it. Perhaps it attempts to allocate memory at a really inopportune moment and the page cache has to evict some files back to the disk in order to supply it.

In the parlance of the memory subsystem this is a major page fault. Let’s see if we can capture the page faults occurring on the LBM thread and correlate them with an OrderBookEvents NAK.

Capturing page faults

We want to be able to see every page fault on the LBM thread encounters that takes more than x milliseconds. There’s a few ways to do this; let’s pick one.

  1. perf

This is, apparently, a one liner.

perf record -e major-faults -ag

This records all major faults system wide – we could add more flags to make it process specific.

A couple of things are a bit harder to do at this point:

  • how do we work out how long it takes to service a fault?
  • how do we correlate the fault timing with our NAKs?

The second is the real kicker. perf, by default, uses the system uptime as the ‘time’ of an event it records. perf record‘s man page claims that -T also samples epoch timestamps, and that perf report -D can view them. As far as I could work out, the timestamps are buried somewhere in the binary part shown to you in the report.

The pain of having to parse the perf format and rework the timestamps was enough to put me off.

  1. ftrace

While this has sensible timestamps; how do we measure the duration of a page fault rather than just measuring when they occur?

  1. systemtap

This is what I ended up using.

Not least because googling "systemtap trace page faults" takes you straight to a working example.

Systemtap

Let’s not go into any detail about how systemtap works. Suffice to say it allows you to execute a section of code at particular points in existing binaries, even those in kernel space.

Here’s the script we found earlier:

#! /usr/bin/env stap

global fault_entry_time, fault_address, fault_access
global time_offset

probe begin { time_offset = gettimeofday_us() }

probe vm.pagefault {
  t = gettimeofday_us()
  fault_entry_time[tid()] = t
  fault_address[tid()] = address
  fault_access[tid()] = write_access ? "w" : "r"
}

probe vm.pagefault.return {
  t=gettimeofday_us()
  if (!(tid() in fault_entry_time)) next
  e = t - fault_entry_time[tid()]
  if (vm_fault_contains(fault_type,VM_FAULT_MINOR)) {
    ftype="minor"
  } else if (vm_fault_contains(fault_type,VM_FAULT_MAJOR)) {
    ftype="major"
  } else {
    next #only want to deal with minor and major page faults
  }

  printf("%d:%d:%id:%s:%s:%d\n",
    t - time_offset, tid(), fault_address[tid()], fault_access[tid()], ftype, e)
  #free up memory
  delete fault_entry_time[tid()]
  delete fault_address[tid()]
  delete fault_access[tid()]
}

This script adds two code sections (‘probes’ in systemtap lingo) at the beginning and end of something called vm.pagefault. The interpretation of the probe code is left as an exercise to the reader.

Let’s try to run that script; there’s a small chance it might immediately tell us the answer.

# stap pfaults.stp 
WARNING: cannot find module kernel debuginfo: No DWARF information found [man warning::debuginfo]
semantic error: while resolving probe point: identifier 'kernel' at /usr/share/systemtap/tapset/linux/memory.stp:68:8
        source:                      kernel.function("__handle_mm_fault@mm/memory.c").call
                                     ^

semantic error: no match (similar functions: __handle_mm_fault, handle_mm_fault, handle_pte_fault, drm_vm_fault, shm_fault)
semantic error: while resolving probe point: identifier 'kernel' at :67:22
        source: probe vm.pagefault = kernel.function("handle_mm_fault@mm/memory.c").call !,
                                     ^

semantic error: no match (similar functions: handle_mm_fault, __handle_mm_fault, handle_pte_fault, drm_vm_fault, shm_fault)
semantic error: while resolving probe point: identifier 'vm' at pfaults.stp:8:7
        source: probe vm.pagefault {
                      ^

semantic error: no match
Pass 2: analysis failed.  [man error::pass2]
Number of similar error messages suppressed: 3.
Number of similar warning messages suppressed: 3.
Rerun with -v to see them.

If only! This is disheartening, but we’re getting quite used to things that don’t work as we expect by this stage, so let’s look at the error output properly for clues.

Systemtap is telling us two things, both of which are bad news:

WARNING: cannot find module kernel debuginfo: No DWARF information found [man warning::debuginfo]

"I can’t find debug information for the kernel you’re running :-("

semantic error: while resolving probe point: identifier 'kernel' at /usr/share/systemtap/tapset/linux/memory.stp:68:8
        source:                      kernel.function("__handle_mm_fault@mm/memory.c").call

"I can’t find the place you asked me to probe"

…and one thing which is very, very helpful indeed.

semantic error: no match (similar functions: __handle_mm_fault, handle_mm_fault, handle_pte_fault, drm_vm_fault, shm_fault)

"Did you mean one of these?"

It looks like systemtap is trying to tell us that it doesn’t know what __handle_mm_fault@mm/memory.c is, but it has heard of __handle_mm_fault. Perhaps vm.pagefault might be some sort of alias.

# man stap
STAP(1)

NAME
       stap - systemtap script translator/driver

SYNOPSIS
       stap [ OPTIONS ] FILENAME [ ARGUMENTS ]
       stap [ OPTIONS ] - [ ARGUMENTS ]
       stap [ OPTIONS ] -e SCRIPT [ ARGUMENTS ]
       stap [ OPTIONS ] -l PROBE [ ARGUMENTS ]
       stap [ OPTIONS ] -L PROBE [ ARGUMENTS ]
       stap [ OPTIONS ] --dump-probe-types
       stap [ OPTIONS ] --dump-probe-aliases
       stap [ OPTIONS ] --dump-functions

--dump-probe-aliases, you say?

# stap --dump-probe-aliases | grep pagefault
vm.pagefault = kernel.function("handle_mm_fault@mm/memory.c").call!, kernel.function("__handle_mm_fault@mm/memory.c").call
vm.pagefault.return = kernel.function("handle_mm_fault@mm/memory.c").return!, kernel.function("__handle_mm_fault@mm/memory.c").return

Perhaps we can avoid the alias altogether and just bind probes to kernel.function("handle_mm_fault") and kernel.function("__handle_mm_fault")?

We make the following change:

-probe vm.pagefault {
+probe kernel.function("handle_mm_fault").call!, kernel.function("__handle_mm_fault").call {

-probe vm.pagefault.return {
+probe kernel.function("handle_mm_fault").return!, kernel.function("__handle_mm_fault").return {

…and now we get the following output:

[root@ldperf-pf2-mmmd01 perf]# stap pfaults-dealiased.stp 
WARNING: never-assigned local variable 'address' (similar: fault_address, t, fault_access, write_access, time_offset): identifier 'address' at pfaults-dealiased.stp:11:26
 source:   fault_address[tid()] = address
                                  ^
WARNING: never-assigned local variable 'write_access' (similar: fault_access, address, t, fault_address, time_offset): identifier 'write_access' at :12:25
 source:   fault_access[tid()] = write_access ? "w" : "r"
                                 ^
WARNING: never-assigned local variable 'fault_type' (similar: ftype, e, fault_access, t, fault_address): identifier 'fault_type' at :19:25
 source:   if (vm_fault_contains(fault_type,VM_FAULT_MINOR)) {

I confess, I am still confused as to what this means. I also don’t really know what those variables are meant to address in that script; if I look at the kernel source that defines __handle_mm_fault for our kernel version, there are no parameters named write_access or fault_type, nor are any variables declared with those names in the method body. Perhaps that example script works in an older kernel.

I assumed this meant that I would probably be getting untrustworthy output. With the addition of the first error we got (WARNING: cannot find module kernel debuginfo: No DWARF information found [man warning::debuginfo]), I went back to google with some slightly different keywords, and eventually discovered DWARFless probing.

I rewrote the script using that approach, and then realised that I really didn’t care about the address or the write_access. All I needed was the type of fault, and that was just as easily accessible in the return value of __handle_mm_fault.

I ended up with a script that looks like this

# Trace page faults for a process (and any of its threads?)
#
# Usage: stap process_page_faults.stp -x <pid>;

global fault_entry_time

probe begin {
  printf("systemtap has started\n")
}

probe kernel.function("handle_mm_fault").call!, kernel.function("__handle_mm_fault").call {
  pid = pid()
  if (pid == target())
  {
    id = tid()
    t = gettimeofday_us()
    fault_entry_time[id] = t
  }
}

probe kernel.function("handle_mm_fault").return!, kernel.function("__handle_mm_fault").return {
  pid = pid()
  if (pid == target()) {
    id = tid()
    t = gettimeofday_us()
    if (!(id in fault_entry_time)) next
    start_time = fault_entry_time[id]
    e = t - start_time
    if (e > 1000)
    {
      printf("%d:%d:%d:%d:%d:%d\n",
             start_time, t, id, pid, e, returnval())
    }
    #free up memory
    delete fault_entry_time[id]
  }
}

I also wrote a few wrapper scripts to detect the presence of the market data service and appropriately start and stop the systemtap script at the right time.

We start to get the following output:

server not there; waiting a bit
2016-10-14 12:32:52.331562 : 200 OK
mds lbm thread not there yet, waiting a bit
2016-10-14 12:33:07.343710 : 200 OK
2016-10-14 12:33:07.344100 : found lbm tid: 133881
2016-10-14 12:33:07.429192 : found mds pid: 133210
2016-10-14 12:33:07.429236 : ['python', 'kill_stap_on_no_mds.py']
2016-10-14 12:33:07.452395 : ['stap', '-x', '133210', 'process_page_faults.stp']

1475682078014325:1475682078038965:133881:133210:24640:1028 
1475682078039052:1475682078079234:133881:133210:40182:1028
1475682078079293:1475682078119595:133881:133210:40302:1028
1475682078119649:1475682078135669:133881:133210:16020:1028
1475682078135731:1475682078157507:133881:133210:21776:1028
1475682078157571:1475682078163527:133881:133210:5956:1028

That’s 147 milliseconds of wall clock time spent attempting to service a major page fault on the LBM thread. (The fifth column is the time spent in microseconds).

If we take that first microsecond timestamp and poke it into epochconverter.com, we get

1475682078014325
Assuming that this timestamp is in microseconds (1/1,000,000 second):
GMT: Wed, 05 Oct 2016 15:41:18 GMT

Splitting that timestamp into seconds (1475682078) and microseconds (014325), we see that we’re only 14 milliseconds into 15:41:18.

Let’s take a look at the OrderBookEvents UDP socket depth at that time:

Buffer full correlates
This is excellent news. Now, can we work out what is triggering the fault?

Well, outside of application start/stop time, these are the only faults of significant duration the LBM thread sees. This means we can add further debugging information without worrying too much about data volume, so let’s print a userspace stack trace for each recorded fault using print_ubacktrace.

We go through several cycles fixing warnings of the form

WARNING: missing unwind data for module, rerun with 'stap -d some_module.so'

until we end up with a slightly longer command line

['stap', '-x', '133210', '-d', '/lib64/ld-2.12.so', '-d', '/opt/lbm/lib/liblbm.so.7.0.1', '-d', '/opt/lbm/lib/liblbmj.so.5.0.0', 'process_page_faults.stp']

and stack traces that look a bit like this (only the first is included for space sanity purposes):

 0x30e0e0dfe0 : _dl_fixup+0x2e/0x1c0 [/lib64/ld-2.12.so]
 0x30e0e148f5 : _dl_runtime_resolve+0x35/0x70 [/lib64/ld-2.12.so]
 0x7efeba91b7a3 : lbm_rate_ctlr_handle_timer+0x9e3/0xbe0 [/opt/lbm/lib/liblbm.so.7.0.1]
 0x7efeba879c82 : lbm_timer_expire+0x372/0x490 [/opt/lbm/lib/liblbm.so.7.0.1]
 0x7efeba895eb5 : lbm_context_process_events+0x2d5/0x7f0 [/opt/lbm/lib/liblbm.so.7.0.1]
 0x7efebae54ca1 : Java_com_latencybusters_lbm_LBMContext_lbmContextProcessEvents+0x41/0x80 [/opt/lbm/lib/liblbmj.so.5.0.0]
 0x31d292ad

At this point, I am solidly out of my depth – the symbols in ld-2.12.so are opaque to me. I don’t have the source for LBM to work out what’s going on in those frames, either. I enlist help from our systems team, and send an email to Informatica with some questions. We’ll pick up from there in the next post.


  1. We’ve gone as far as to write monitoring that detects when more than a single thread runs on an isolated core. This has caught several bugs where we’ve forgotten to correctly pin/affinitise a new thread in several applications, but it didn’t found anything wrong here.