Estimated reading time: 15 minutes

Who hurt you?

That’s the recurring question every confirmed bachelor, Lothario, and playboy expects to hear around this Valentine’s Day Season. That time of year when couples exchange gifts of love, share romantic dinners, or whisk away on a mini-trip. Meanwhile, free-wheeling casual daters skillfully avoid the festivities with their multiple paramours with well-rehearsed scripts criticizing the holiday’s emptiness and commercial nature. Once that Celebration of Exclusivity passes, they come back out of hiding, ready to mingle! But why is that? Could it just be that some of us enjoy the excitement of sparking a romance with new and interesting people? Or might it stem from a fear of commitment due to some past heartbreak? “Who hurt you?” assumes the latter.

You and I have formed a pretty solid bond over these past couple years, so I’ll come clean with you: I fear commitment. . . to the 1st CPU core. And not just core 0, either, which we all know usually catches the brunt of interrupts and typically is singled-out as a house-keeping core. I’m talkin’ about the 1st core on *any* CPU socket. And you’re right – it’s not due to some free-wheeling excitement gained from the willy-nilly scheduling of threads across cores. It’s because the 1st CPU core once hurt me deeply, and I haven’t looked at it the same way since. Here’s my story.

The 1st CPU Core and How We Met

My meet-cute with the 1st CPU core began around 2014. RHEL 7 had just been released with very few in the HFT space considering a migration from RHEL 6.8 and its 2.6.32 kernel until they could suss out all latency gotchas in RHEL 7’s 3.10.0 kernel. It was under this backdrop that I worked with a startup Options Trading Group of a Chicago-based HFT firm.

I deployed all the optimal configurations you’d expect for a low latency shop, many of which I’ve alluded to in past articles. Paramount among these settings was isolating all cores (using isolcpus) on the second CPU socket (where all the odd-numbered cores reside), which was home to the PCIe lanes attached to the high-performance NIC. Free from the vagaries of the Linux kernel scheduler, these cores served as low-jitter locations on which to pin latency-critical application threads. One of the more important threads we pinned to core 1, the 1st core on that 2nd socket. And for a few weeks, it was a fairy-tale affair. High IPC, low jitter, low latency – the pairing seemed destined! But nothing lasts forever, does it?

Cracks Begin To Show

“Mark, we’re suddenly experiencing sporadic latency spikes of over 1ms on core 1!”

Weeks had gone by without a hitch on the Options Trading Group’s system. So how could this be? Now, for those of you who have never serviced customers in IT, I’ll let you in on a little secret – they *never* tell you the whole story. It’s always “No, we didn’t change anything – we were just moseying along and then BOOM!” But hey, it is what it is. So, I fired up the FTrace facility and began tracking scheduler events on that core:

echo sched_switch > /sys/kernel/debug/tracing/current_tracer
echo latency-format > /sys/kernel/debug/tracing/trace_options

Here’s some of the tracing output I found after enabling those FTrace options:

TradeApp-31594   1d.s.. 70251098us!:  31594:115:R   + [001]     6:120:R ksoftirqd
TradeApp-31594   1d.... 70252191us!:  31594:115:R ==> [001]     6:120:R ksoftirqd
ksoftirqd-6      1d.... 70252599us!:      6:120:S ==> [001] 31594:115:R TradeApp

<<< above represents ~400 microsecond scheduling latency >>>

TradeApp-31594   1d.s.. 85246192us!:  31594:115:R   + [001]     6:120:R ksoftirqd
TradeApp-31594   1d.... 85246652us!:  31594:115:R ==> [001]     6:120:R ksoftirqd
ksoftirqd-6      1d.... 85262686us!:      6:120:R ==> [001] 31594:115:R TradeApp

<<< above represents a whopping 16 millisecond scheduling latency >>>

Ksoftirqd was taking as much as 16 *milliseconds* of cpu time from the trading application thread! This kernel thread jumps into action in process-context when the interrupt-context softirq is overwhelmed with work and, thus, needs to offload it to a schedule-able entity. But what type of work overloaded the softirq so much that it had to enlist the help of ksoftirqd?

For this, I enabled the function_graph tracer in FTrace to track the flow of kernel functions called during ksoftirqd‘s runtime. What did I find? A bunch of block request queue completion work (i.e., the cleaning of block layer structures created during the process of handling I/O) from the megasas disk I/O controller.

But why? No one pinned the megasas driver IRQ to core 1. So why was all its associated background processing running there?

The 1st CPU Core and its Wandering Eye

Every now and then, you might miss a configuration step or two, even while using configuration management tools like Puppet or Ansible. So I verified my IRQ-assignment assumptions by checking /proc/interrupts, which confirmed my understanding. However, I *did* find a torrent of “Function Call Interrupts” (or CAL in /proc/interrupts parlance) hitting our beloved core 1:

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:         47          0          0          0          0          0          0          0   IO-APIC-edge      timer
  1:         25          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  8:          1          0          0          0          0          0          0          0   IO-APIC-edge      rtc0
  9:        528          0          0        287          0          0          0          0   IO-APIC-fasteoi   acpi
 12:        207          0          0          0          0          0          0          0   IO-APIC-edge      i8042
 16:        493          0        187          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, nouveau
 17:      29465          0          0          0          0          0          0          0   IO-APIC-fasteoi   ath9k, snd_hda_intel
 19:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   xhci_hcd:usb3
 23:       9352          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb2
 45:          0          0          0          0          0          0          0          0   PCI-MSI-edge      eth0
 46:     192342          0          0          0          0          0          0          0   PCI-MSI-edge      ahci
 47:         14          0          0          0          0          0          0          0   PCI-MSI-edge      mei
 48:        294          0          0          0          0          0          0          0   PCI-MSI-edge      snd_hda_intel
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:     552219    1010298    2272333    3179890    1445484    1226202    1800191    1894825   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0   Performance monitoring interrupts
IWI:          0          0          0          0          0          0          0          0   IRQ work interrupts
RTR:          7          0          0          0          0          0          0          0   APIC ICR read retries
RES:      18708       9550        771        528        129        170        151        139   Rescheduling interrupts
CAL:        711    9153934       1312       1261       1446       1411       1433       1432   Function call interrupts
TLB:       4493       6108       4981       5014       1788       2327       1967        914   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts

That’s 3 orders of magnitude more CALs on core 1 than any other core! Further debugging and source code reading (explained in the next section) revealed the culprit as the /sys/block/sda/queue/rq_affinity setting. From the kernel docs:

“If this option is ‘1’, the block layer will migrate request completions to the

cpu “group” that originally submitted the request. For some workloads this

provides a significant reduction in CPU cycles due to caching effects.”

And of course, it was at this point that I noticed another core on that same CPU socket performing a heavy stream of writes to the disk attached to the megasas driver. “Oh yeah, Mark, we pinned a new thread to core 3 on that socket to perform our data writes. Did we forget to mention that?” *sigh*

Just to make sure you all are following me, here’s a recap: Core 1 and I agreed that it would dedicate itself to our trading workload, but then weeks later some other I/O workload came struttin’ by and stole its attention away from me. Really? Just like that, huh? What ever happened to trust? To fidelity? To ’til decommission do us part’?!

This I/O-heavy workload on core 3 adversely impacted latency on core 1 due to rq_affinity interrupting it via CALs to perform block request completion cleanup (as we’ll learn in the next section). But why was it always core 1? Why not share it with other cores within that same LLC-sharing “group”? To answer that, we gotta read the source.

How rq_affinity *Used* to Work

The workhorse for rq_affinity is the __blk_complete_request() function shown below from a 2.6.32 kernel release:

void __blk_complete_request(struct request *req)
{
	struct request_queue *q = req->q;
	unsigned long flags;
	int ccpu, cpu, group_cpu;

	BUG_ON(!q->softirq_done_fn);

	local_irq_save(flags);
	cpu = smp_processor_id();
	group_cpu = blk_cpu_to_group(cpu);

	/*
	 * Select completion CPU
	 */
	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
		ccpu = req->cpu;
	else
		ccpu = cpu;

	if (ccpu == cpu || ccpu == group_cpu) {
		struct list_head *list;
do_local:
		list = &__get_cpu_var(blk_cpu_done);
		list_add_tail(&req->csd.list, list);

		/*
		 * if the list only contains our just added request,
		 * signal a raise of the softirq. If there are already
		 * entries there, someone already raised the irq but it
		 * hasn't run yet.
		 */
		if (list->next == &req->csd.list)
			raise_softirq_irqoff(BLOCK_SOFTIRQ);
	} else if (raise_blk_irq(ccpu, req))
		goto do_local;

	local_irq_restore(flags);

Here, it finds the core that is currently servicing the IRQ (cpu variable), compares it to the core identified in the block I/O request struct as the originator of the I/O (req->cpu variable), and checks if rq_affinity is set (i.e., QUEUE_FLAG_SAME_COMP flag). If set, and both cores are not one and the same OR do not belong to the same LLC-sharing group, it will interrupt the originating core to process the block request completion using the raise_blk_irq() function:

/*
 * Setup and invoke a run of 'trigger_softirq' on the given cpu.
 */
static int raise_blk_irq(int cpu, struct request *rq)
{
	if (cpu_online(cpu)) {
		struct call_single_data *data = &rq->csd;

		data->func = trigger_softirq;
		data->info = rq;
		data->flags = 0;

		__smp_call_function_single(cpu, data, 0);
		return 0;
	}

	return 1;
}

Aha, so now we know the source of all those CALs we saw in /proc/interrupts – the __smp_call_function_single() function!

Ok, so this all seems reasonable now, right? We had cores 1 and 3 in the same “group”, with rq_affinity enabled, and the megasas IRQ pinned to a core which was NOT part of that group. Case closed, right? Wrong, because that still doesn’t explain why core 1 is *always* the victim among all the cores on that second CPU socket. Let’s take a look at the blk_cpu_to_group() function referenced in the code snippet above for more insight:

static inline int blk_cpu_to_group(int cpu)
{
#ifdef CONFIG_SCHED_MC
	const struct cpumask *mask = cpu_coregroup_mask(cpu);
	return cpumask_first(mask);
#elif defined(CONFIG_SCHED_SMT)
	return cpumask_first(topology_thread_cpumask(cpu));
#else
	return cpu;
#endif
}

Uh, what’s with that “_first” suffix to the cpumask_first() function? Makes me nervous. Hmm, maybe I’m being paranoid. Let’s check out its definition:

/**
 * cpumask_first - get the first cpu in a cpumask
 * @srcp: the cpumask pointer
 *
 * Returns >= nr_cpu_ids if no cpus set.
 */
static inline unsigned int cpumask_first(const struct cpumask *srcp)
{
	return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits);
}

NOOOO!!! It *always* picks the first core in the group, thus causing core 1’s milliseconds-long hiccups! You broke my heart, core 1. You broke my heart. I’ll never trust the 1st core on any CPU socket with a low-latency assignment ever again, especially not when there’s a cpumask_first() function defined in the kernel just waiting for someone to use in the future!

The 1st CPU Core Attempts To Rekindle Our Flame

Nowadays, rq_affinity offers an extra option which makes the core that is generating all the I/O clean up after itself:

For storage configurations that need to maximize distribution of completion

processing setting this option to ‘2’ forces the completion to run on the

requesting cpu (bypassing the “group” aggregation logic).

It also removes any consideration about the “first core in the group” as depicted in this snippet from the same __blk_complete_request() function from a more recent kernel release:

/*
	 * Select completion CPU
	 */
	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && ccpu != -1) {
		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
			shared = cpus_share_cache(cpu, ccpu);
	} else
		ccpu = cpu;

	/*
	 * If current CPU and requested CPU share a cache, run the softirq on
	 * the current CPU. One might concern this is just like
	 * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
	 * running in interrupt handler, and currently I/O controller doesn't
	 * support multiple interrupts, so current CPU is unique actually. This
	 * avoids IPI sending from current CPU to the first CPU of a group.
	 */
	if (ccpu == cpu || shared) {
		struct list_head *list;

But how can I be sure? When you go through a betrayal of this magnitude, you’re not eager to jump right back into that relationship. Let’s set up a test for this new setting on Rocky 8.6 (kernel release 4.18.0-372.9.1) using the fio microbenchmark and the osnoise latency tracker.

Testing the Updated rq_affinity = 1

For this test, we have isolated the cores of the 2nd CPU socket (all odd-numbered cores). The target disk IRQ is assigned to core 7, our fio microbenchmark is pinned to core 3, and rq_affinity is set to ‘1’ (the default).

In a separate window, I’ve set up the osnoise tracer to track any disturbance lasting for longer than a millisecond against any core on the 2nd CPU socket *except* core 3, where our fio workload will run:

[root@eltoro ~]# cd /sys/kernel/tracing/
[root@eltoro tracing]# echo osnoise > current_tracer
[root@eltoro tracing]# echo osnoise > set_event
[root@eltoro tracing]# echo "1,5,7,9,11,13,15" > osnoise/cpus
[root@eltoro tracing]# echo 1000 > osnoise/stop_tracing_us

In a separate window, I fire up fio to perform random writes for 300 seconds against a 5GB file using one thread pinned to core 3:

[root@eltoro ~]# fio -name=blogpost -numjobs=1 -cpus_allowed=3 -filename=/corefiles/mytest.file -direct=0 -rw=randwrite -bs=4k -size=5G -iodepth=64 -ioengine=libaio -time_based=1 -runtime=300

Once I hit ‘ENTER’ on the fio command, I start the trace in the other window like so:

[root@eltoro tracing]# echo 1 > tracing_on
[root@eltoro tracing]# cat trace_pipe

After a few seconds, the trace automatically stops and displays this output as its final few lines:

       . . .
       . . .
       . . .
       osnoise/7-13353 [007] d.h. 37384.833988: irq_noise: ahci[0000:87:00.0]:47 start 37384.833985696 duration 2390 ns
       osnoise/7-13353 [007] ..s. 37384.840123: softirq_noise:    BLOCK:4 start 37384.833988294 duration 6134479 ns
       osnoise/7-13353 [007] .... 37384.840124: sample_threshold: start 37384.833985490 duration 6138539 ns interference 2
[root@eltoro tracing]#

I keep reenabling the trace during the fio microbenchmark but the trace keeps stopping at lines showing disturbances only for core 7 from softirqs executing block-related code. This is great because it means that it’s not bugging core 1 like in older kernel releases! We get further proof from the preponderance of reported “BLOCK” softirqs, 256,724 in all (0 on the other cores), going to core 7 in /proc/softirqs:

[root@eltoro ~]# cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
          HI:          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
       TIMER:    1311064       4732       4732       4733       4732       4731       4731       4732       4731       4731       4731       4731       4731       4731       4731       4731
      NET_TX:      14369          1          2          1          2          0          1          0          1          0          0          0          2          0          1          0
      NET_RX:     547910        525     151351       2080          3          0          3          0          3          2          4          1          4          1          3          0
       BLOCK:        106          0          0          0          0          0          0     256724          0          0          0          0          0          0          0          0
    IRQ_POLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
     TASKLET:      33514          0          0          0          0          0          0          0         19          0         64          0          0          0          0          0
       SCHED:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
     HRTIMER:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
         RCU:     999445       4754       4733       5599       4734       4765       4732       5147       4732       4823       4732       4767       4732       4816       4732       4751

Now let’s see what happens after a clean reboot with rq_affinity set to ‘2’.

Testing the Updated rq_affinity = 2

Everything is configured the exact same way as in the previous section except rq_affinity is now set to ‘2’. I fire up fio, then I commence tracing and begin reading from the pipe:

[root@eltoro tracing]# echo 1 > tracing_on
[root@eltoro tracing]# cat trace_pipe
       osnoise/5-2065  [005] d.h.   475.145177: irq_noise: local_timer:236 start 475.145176545 duration 584 ns
       osnoise/5-2065  [005] ..s.   475.145181: softirq_noise:    TIMER:1 start 475.145177244 duration 3489 ns
       osnoise/5-2065  [005] d.H.   475.145183: irq_noise: irq_work:246 start 475.145182922 duration 425 ns
       osnoise/5-2065  [005] ....   475.145184: sample_threshold: start 475.145176397 duration 8003 ns interference 3
       osnoise/7-2066  [007] d.h.   475.156856: irq_noise: ahci[0000:87:00.0]:73 start 475.156846144 duration 9333 ns
       osnoise/7-2066  [007] d.h.   475.156858: irq_noise: irq_work:246 start 475.156856920 duration 1280 ns
       osnoise/7-2066  [007] ....   475.156859: sample_threshold: start 475.156845776 duration 12710 ns interference 2
       osnoise/7-2066  [007] d.h.   475.160222: irq_noise: ahci[0000:87:00.0]:73 start 475.160219278 duration 2770 ns
       osnoise/7-2066  [007] d.h.   475.160225: irq_noise: irq_work:246 start 475.160223535 duration 1506 ns
       osnoise/7-2066  [007] ....   475.160225: sample_threshold: start 475.160218911 duration 6392 ns interference 2
      osnoise/15-2070  [015] dNh.   475.184177: irq_noise: reschedule:253 start 475.184176989 duration 170 ns
      osnoise/15-2070  [015] dNh.   475.184180: irq_noise: irq_work:246 start 475.184178651 duration 1389 ns
      . . .
      . . .
      . . .

It just goes on and on until I explicitly disable tracing. No softirq hiccups to be had on cores 1, 5, 7, 9, 11, 13, or 15. So, then, where are they going? They shifted to the core responsible for all the disk I/O (fio), core 3 (114,729 in total), according to /proc/softirqs:

[root@eltoro ~]# cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
          HI:          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
       TIMER:     110220         88         88         89         88         88         88         89         88         88         87         87         87         87         87         87
      NET_TX:        658          1          2          1          0          0          1          0          1          0          2          0          2          0          1          0
      NET_RX:      47970         86       2752         88          2          0          1          0          2          2          3          1          2          1          1          0
       BLOCK:       6387          0          0     114729          0          0          0       4348          0          0          0          0          0          0          0          0
    IRQ_POLL:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
     TASKLET:       3140          0          0          0          0          0          0          0         19          0         64          0          0          0          0          0
       SCHED:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
     HRTIMER:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0
         RCU:      51021         98         90        707         89        118         89        214         89        132         88        118         88        119         88        119

It looks like the updated rq_affinity code works as advertised. So. . . does this mean the 1st CPU core can be trusted again?

What We Had Was Beautiful But. . .

You guys wanted to know why I can’t commit to the 1st CPU core. You wanted my answer to, “Who hurt you?” Well, now you know the history. True, the 1st CPU core has come a long way toward rebuilding the foundation of trust we once had. But for as long as the kernel defines a cpumask_first() function, I will always worry that some other Linux feature might employ it to force unrelated, latency-spiking workloads upon that core. For this reason, my fear of commitment to the 1st CPU core remains. And I will continue admonishing my clients to avoid the 1st CPU core for any latency-sensitive application threads.

Will I continue maintaining my wildly varied “confirmed bachelor” lifestyle, too? Well, that’s a topic for another day on another blog. 😉 Happy Valentine’s Day!

Read More