<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <generator uri="http://jekyllrb.com" version="3.10.0">Jekyll</generator>
  
  
  <link href="https://dasl.cc/feed.xml" rel="self" type="application/atom+xml" />
  <link href="https://dasl.cc/" rel="alternate" type="text/html" />
  <updated>2026-04-01T21:57:57+00:00</updated>
  <id>https://dasl.cc/</id>

  
    <title type="html">dasl.cc</title>
  

  
    <subtitle>Write an awesome description for your new site here. You can edit this line in _config.yml. It will appear in your document head meta (for Google search results) and in your feed.xml site description.</subtitle>
  

  

  
  
    <entry>
      
      <title type="html">Debugging Our New Linux Kernel</title>
      
      
      <link href="https://dasl.cc/2025/01/01/debugging-our-new-linux-kernel/" rel="alternate" type="text/html" title="Debugging Our New Linux Kernel" />
      
      <published>2025-01-01T00:00:00+00:00</published>
      <updated>2025-01-01T00:00:00+00:00</updated>
      <id>https://dasl.cc/2025/01/01/debugging-our-new-linux-kernel</id>
      <content type="html" xml:base="https://dasl.cc/2025/01/01/debugging-our-new-linux-kernel/">&lt;p&gt;Read on to learn how we used network packet captures and BPF to debug web server performance, ultimately uncovering a Linux kernel performance issue. This investigation was a collaboration between myself and my colleagues.&lt;/p&gt;

&lt;p&gt;See the &lt;a href=&quot;https://news.ycombinator.com/item?id=43046174&quot;&gt;discussion of this post&lt;/a&gt; on Hacker News.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contents&lt;/strong&gt;&lt;/p&gt;
&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#background&quot; id=&quot;markdown-toc-background&quot;&gt;Background&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#what-are-listen-overflows&quot; id=&quot;markdown-toc-what-are-listen-overflows&quot;&gt;What are listen overflows?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#whats-causing-listen-overflows&quot; id=&quot;markdown-toc-whats-causing-listen-overflows&quot;&gt;What’s causing listen overflows?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#why-were-web-requests-served-slowly-in-the-first-few-minutes-after-new-hosts-were-pooled&quot; id=&quot;markdown-toc-why-were-web-requests-served-slowly-in-the-first-few-minutes-after-new-hosts-were-pooled&quot;&gt;Why were web requests served slowly in the first few minutes after new hosts were pooled?&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#is-it-the-network-no&quot; id=&quot;markdown-toc-is-it-the-network-no&quot;&gt;Is it the network? (no)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#is-it-due-to-elevated-system-cpu-yes&quot; id=&quot;markdown-toc-is-it-due-to-elevated-system-cpu-yes&quot;&gt;Is it due to elevated system CPU? (yes)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#why-did-we-initially-suspect-the-network&quot; id=&quot;markdown-toc-why-did-we-initially-suspect-the-network&quot;&gt;Why did we initially suspect the network?&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#whats-causing-elevated-system-cpu&quot; id=&quot;markdown-toc-whats-causing-elevated-system-cpu&quot;&gt;What’s causing elevated system CPU?&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#is-it-writeback-no&quot; id=&quot;markdown-toc-is-it-writeback-no&quot;&gt;Is it writeback? (no)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#is-it-inode-cgroup-switching-yes&quot; id=&quot;markdown-toc-is-it-inode-cgroup-switching-yes&quot;&gt;Is it inode cgroup switching? (yes)&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-fix&quot; id=&quot;markdown-toc-the-fix&quot;&gt;The fix&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#followup-questions&quot; id=&quot;markdown-toc-followup-questions&quot;&gt;Followup questions&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#do-we-have-a-minimal-reproduction-script&quot; id=&quot;markdown-toc-do-we-have-a-minimal-reproduction-script&quot;&gt;Do we have a minimal reproduction script?&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#why-wasnt-centos-affected&quot; id=&quot;markdown-toc-why-wasnt-centos-affected&quot;&gt;Why wasn’t CentOS affected?&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#why-couldnt-we-reproduce-when-running-rsync-manually&quot; id=&quot;markdown-toc-why-couldnt-we-reproduce-when-running-rsync-manually&quot;&gt;Why couldn’t we reproduce when running rsync manually?&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#is-this-a-hypervisor-or-a-kernel-performance-issue&quot; id=&quot;markdown-toc-is-this-a-hypervisor-or-a-kernel-performance-issue&quot;&gt;Is this a hypervisor or a kernel performance issue?&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;background&quot;&gt;Background&lt;/h1&gt;

&lt;p&gt;We’ve been upgrading the operating system from CentOS to Ubuntu on hosts across our fleet. Our CentOS hosts run an outdated Linux kernel version (3.10), whereas our Ubuntu hosts run a more modern kernel version (6.8). In August 2024, we began rolling out the Ubuntu upgrade across our Apache web servers. When we migrated larger portions of our fleet to Ubuntu, we began seeing elevated listen overflow errors. This elevated error rate prompted us to roll back the Ubuntu upgrade while we debugged:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i1.png&quot; alt=&quot;graph of web server listen overflows&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;what-are-listen-overflows&quot;&gt;What are listen overflows?&lt;/h1&gt;

&lt;p&gt;Apache listens on a socket for incoming web requests. When incoming requests arrive more quickly than Apache can serve them, the queue of requests waiting to be served grows longer. This queue is capped to a configurable size. When the queue overflows its maximum size, we have a &lt;em&gt;listen overflow&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Listen overflows are a symptom of either of two things: the rate of incoming web requests is too high, and / or Apache is serving web requests too slowly.&lt;/p&gt;

&lt;p&gt;Each listen overflow that occurs means we failed to serve a web request. This can result in user-facing errors. Furthermore, if the listen overflows are a symptom of web requests being served slowly, it means users may be experiencing slow page loads.&lt;/p&gt;

&lt;h1 id=&quot;whats-causing-listen-overflows&quot;&gt;What’s causing listen overflows?&lt;/h1&gt;

&lt;p&gt;Listen overflows occurred a few minutes after a newly autoscaled web server was pooled. They did not tend to recur subsequently. Furthermore, web requests had elevated latency during this same time period:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i2.png&quot; alt=&quot;web server listen overflow and web request latency graphs&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We believed listen overflows were occurring because web requests served in the first few minutes after the host was pooled were being served unusually slowly.&lt;/p&gt;

&lt;h1 id=&quot;why-were-web-requests-served-slowly-in-the-first-few-minutes-after-new-hosts-were-pooled&quot;&gt;Why were web requests served slowly in the first few minutes after new hosts were pooled?&lt;/h1&gt;

&lt;h2 id=&quot;is-it-the-network-no&quot;&gt;Is it the network? (no)&lt;/h2&gt;

&lt;p&gt;Log lines, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strace&lt;/code&gt; timing information, and &lt;a href=&quot;https://github.com/adsr/phpspy&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;phpspy&lt;/code&gt;&lt;/a&gt; flame graphs showed that network operations were executing particularly slowly. But investigating further, we found evidence that the network was performing normally. Instead, PHP had seemingly stalled for over a second. The below log line indicates that a memcached SET command was slow, but network packet captures on both the client and the server that we analyzed in Wireshark indicate that the SET command experienced normal network latency. The client waited over 1 second before sending the subsequent GET command, as if our PHP script stalled after the packet was received but before we recorded the elapsed time.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;[Tue Sep 24 21:20:54 2024] [info] Memcached operation exceeded 20ms :operation=&quot;set&quot; &lt;b&gt;latency=&quot;1354.68292236&quot;&lt;/b&gt; key=&quot;warmup_key_7746_2&quot;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i3.png&quot; alt=&quot;wireshark packet capture&quot; /&gt;
&lt;i&gt;&lt;b&gt;Left:&lt;/b&gt; client side (web server) packet capture. &lt;b&gt;Right:&lt;/b&gt; server side (memcached) packet capture&lt;/i&gt;&lt;br /&gt;
&lt;i&gt;See a &lt;a href=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i3.png&quot;&gt;bigger version of this image&lt;/a&gt;&lt;/i&gt;&lt;/p&gt;

&lt;h2 id=&quot;is-it-due-to-elevated-system-cpu-yes&quot;&gt;Is it due to elevated system CPU? (yes)&lt;/h2&gt;

&lt;p&gt;Adding on to the evidence pointing away from network problems, we saw a large spike in system CPU usage about four minutes after newly autoscaled hosts were booted. If we waited until after this spike in system CPU to pool the hosts, we experienced no listen overflows. This spike in system CPU only occurred on Ubuntu hosts – CentOS did not have this problem:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i4.png&quot; alt=&quot;system CPU graph&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This made us realize that the problems we were seeing were unrelated to pooling the hosts – the spike in system CPU occurred regardless of whether the hosts were pooled. During the spike in system CPU, we saw logs in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dmesg&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;workqueue: inode_switch_wbs_work_fn hogged CPU for &amp;gt;10000us 4 times, consider switching to WQ_UNBOUND&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;And &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;htop&lt;/code&gt; showed kernel workers were using lots of CPU in a function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inode_switch_wbs&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i5.png&quot; alt=&quot;htop screenshot&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We made a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;perf&lt;/code&gt; recording demonstrating that when system CPU spikes, the kernel is busy inside &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inode_switch_wbs_work_fn&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/perf.png&quot; alt=&quot;perf recording flamegraph&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;why-did-we-initially-suspect-the-network&quot;&gt;Why did we initially suspect the network?&lt;/h2&gt;

&lt;p&gt;Our logs and profiling tools showed that network operations were executing particularly slowly. However, when CPU is busy, network operations may appear to slow down disproportionately. Network calls require CPU &lt;a href=&quot;https://en.wikipedia.org/wiki/Context_switch&quot;&gt;context switches&lt;/a&gt; (I think?). After a process blocked on network receives a response, the process may spend a long time waiting in the CPU &lt;a href=&quot;https://en.wikipedia.org/wiki/Run_queue&quot;&gt;run queue&lt;/a&gt; before it gets scheduled again when the CPU is busy. Network operations that appear slow at the user space level may be a symptom of CPU busyness.&lt;/p&gt;

&lt;p&gt;Although there can be cases where this is not true, it has been my experience that when the network is the cause of slowness, CPU usage on client hosts is often lower than normal. When the client is blocked waiting for the network, it is often more idle. In retrospect, perhaps the fact that elevated CPU was one of the symptoms we were seeing should have pointed us away from network issues.&lt;/p&gt;

&lt;h1 id=&quot;whats-causing-elevated-system-cpu&quot;&gt;What’s causing elevated system CPU?&lt;/h1&gt;

&lt;h2 id=&quot;is-it-writeback-no&quot;&gt;Is it writeback? (no)&lt;/h2&gt;

&lt;p&gt;We believed something in the kernel function &lt;a href=&quot;https://github.com/torvalds/linux/blob/906bd684e4b1e517dd424a354744c5b0aebef8af/fs/fs-writeback.c#L490&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inode_switch_wbs_work_fn&lt;/code&gt;&lt;/a&gt; was causing elevated system CPU. This function is in a file &lt;a href=&quot;https://github.com/torvalds/linux/blob/master/fs/fs-writeback.c&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fs-writeback.c&lt;/code&gt;&lt;/a&gt;, which contains functions related to the &lt;a href=&quot;https://github.com/firmianay/Life-long-Learner/blob/master/linux-kernel-development/chapter-16.md&quot;&gt;writeback functionality of the Linux page cache&lt;/a&gt;. We knew that system CPU was elevated about four minutes after new hosts were booted. One of the first things a new host does is download the latest code – this is part of our host bootstrapping process and involves writing thousands of files to disk. We wondered if the process of flushing the dirty pages to disk was causing the elevated system CPU. While we did not see elevated disk write metrics during the system CPU spike, we decided to test this theory. We added a &lt;a href=&quot;https://www.man7.org/linux/man-pages/man1/sync.1.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sync&lt;/code&gt;&lt;/a&gt; call after the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsync&lt;/code&gt; command that downloads the code on new hosts. In theory, that should synchronously write the dirty pages in the page cache to disk. Perhaps by controlling when the page cache was flushed to disk, we could control when the spike in system CPU occurred and ensure that it occurred before we pooled the host. This attempt, however, was unsuccessful. We saw no spike in system CPU when calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sync&lt;/code&gt;. Furthermore, we still saw the spike in system CPU a minute or two later.&lt;/p&gt;

&lt;p&gt;We were back to the drawing board. As we mentioned above, one of the first things a new host does is download the latest code. This process is called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt;, and it runs as a &lt;a href=&quot;https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html&quot;&gt;systemd oneshot&lt;/a&gt; service. We found something perplexing: if we prevented the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; service from running, we never saw the spike in system CPU. And if we subsequently ran the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; service manually, we would see the spike in system CPU a couple minutes later. This implies that something in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; was the cause of the issues. However, if we ran the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsync&lt;/code&gt; command in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; manually in our interactive shell, we saw no spikes in system CPU. This apparent contradiction led us on a wild goose chase of trying to determine if some systemd service that was dependent on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; was the cause of the elevated system CPU, or if there was some subtle difference in the way we were running the commands in our shell versus how systemd was running the commands.&lt;/p&gt;

&lt;h2 id=&quot;is-it-inode-cgroup-switching-yes&quot;&gt;Is it inode cgroup switching? (yes)&lt;/h2&gt;

&lt;p&gt;Each of the thousands of files that is written by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; service has a corresponding &lt;a href=&quot;https://www.kernel.org/doc/html/latest/filesystems/ext4/inodes.html&quot;&gt;index node&lt;/a&gt;, also known as an inode, that the kernel uses to store file metadata. &lt;a href=&quot;https://www.kernel.org/doc/html/v5.0/admin-guide/cgroup-v2.html#control-group-v2&quot;&gt;Control groups&lt;/a&gt;, also known as cgroups, are a feature of Linux which allows for setting per process limits on system resources. For example, cgroups allow us to limit a given process from consuming too much memory, disk I/O, network bandwidth, etc. Every process belongs to a cgroup. Cgroups form a tree-like hierarchical structure. Processes in a given cgroup are given limits both by the cgroup to which they belong and that cgroup’s parents.&lt;/p&gt;

&lt;p&gt;In the context of cgroups, page cache writeback is tracked at the inode level. A given inode is assigned to whichever cgroup contains the process that is responsible for the majority of writes to the inode’s file. If a new process starts writing a lot to a file, the file’s inode may switch to the new process’s cgroup. Likewise, if a process managed by systemd is terminated, systemd will &lt;a href=&quot;https://systemd.io/CGROUP_DELEGATION/#controller-support&quot;&gt;remove the process’s cgroup&lt;/a&gt;, at which point any inodes assigned to the process’s cgroup will be moved to the parent cgroup.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i6.png&quot; alt=&quot;cgroup diagram&quot; /&gt;
&lt;em&gt;Initially, inodes are assigned to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt;’s cgroup&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i7.png&quot; alt=&quot;cgroup diagram&quot; /&gt;
&lt;em&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; finishes. Systemd removes its cgroup, and inodes are moved to the parent cgroup&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/torvalds/linux/blob/906bd684e4b1e517dd424a354744c5b0aebef8af/fs/fs-writeback.c#L490&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inode_switch_wbs_work_fn&lt;/code&gt;&lt;/a&gt; that we believed was causing the elevated system CPU is responsible for moving an inode from one cgroup to another in the context of writeback. We got more insight by running this &lt;a href=&quot;https://github.com/bpftrace/bpftrace&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bpftrace&lt;/code&gt;&lt;/a&gt; command on a newly booted host:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;% sudo bpftrace -e &apos;
tracepoint:writeback:inode_switch_wbs {
  printf(
    &quot;[%s] inode is switching! inode: %d old cgroup: %d new cgroup: %d\n&quot;,
    strftime(&quot;%H:%M:%S&quot;, nsecs),
    args-&amp;gt;ino,
    args-&amp;gt;old_cgroup_ino,
    args-&amp;gt;new_cgroup_ino
  );
}&apos;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;During the spike in system CPU, we saw thousands of these lines printed out by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bpftrace&lt;/code&gt; command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[20:49:30] inode is switching! inode: 3730800 old cgroup: 22438 new cgroup: 88
[20:49:30] inode is switching! inode: 3730799 old cgroup: 22438 new cgroup: 88
...

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each line corresponds to a file written by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; that was switching from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt;’s dying cgroup to the parent cgroup. The old cgroup identifier (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;22438)&lt;/code&gt; corresponds to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt;’s cgroup. The new cgroup identifier (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;88&lt;/code&gt;) corresponds to the parent cgroup.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bpftrace&lt;/code&gt; command prints out data from a &lt;a href=&quot;https://github.com/torvalds/linux/blob/906bd684e4b1e517dd424a354744c5b0aebef8af/fs/fs-writeback.c#L415&quot;&gt;kernel tracepoint&lt;/a&gt; in the Linux kernel’s &lt;a href=&quot;https://github.com/torvalds/linux/blob/906bd684e4b1e517dd424a354744c5b0aebef8af/include/trace/events/writeback.h#L216-L243&quot;&gt;writeback code&lt;/a&gt;. The fields available to print in this tracepoint can be viewed via:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;% sudo cat /sys/kernel/debug/tracing/events/writeback/inode_switch_wbs/format
name: inode_switch_wbs
ID: 886
format:
    field:unsigned short common_type;   offset:0;   size:2; signed:0;
    field:unsigned char common_flags;   offset:2;   size:1; signed:0;
    field:unsigned char common_preempt_count;   offset:3;   size:1; signed:0;
    field:int common_pid;   offset:4;   size:4; signed:1;

    field:char name[32];    offset:8;   size:32;    signed:0;
    field:ino_t ino;    offset:40;  size:8; signed:0;
    field:ino_t old_cgroup_ino; offset:48;  size:8; signed:0;
    field:ino_t new_cgroup_ino; offset:56;  size:8; signed:0;

print fmt: &quot;bdi %s: ino=%lu old_cgroup_ino=%lu new_cgroup_ino=%lu&quot;, REC-&amp;gt;name, (unsigned long)REC-&amp;gt;ino, (unsigned long)REC-&amp;gt;old_cgroup_ino, (unsigned long)REC-&amp;gt;new_cgroup_ino
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We found that when we added a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sleep 3600&lt;/code&gt; to the end of the script that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; executes, we could delay the spike in system CPU by one hour. Because systemd only removes a service’s cgroup when its process exits, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sleep&lt;/code&gt; delayed when inodes switched from one cgroup to another.&lt;/p&gt;

&lt;h1 id=&quot;the-fix&quot;&gt;The fix&lt;/h1&gt;

&lt;p&gt;We found a systemd directive that allows us to turn off certain cgroup accounting features: &lt;a href=&quot;https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html#Control%20Group%20Management&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DisableControllers&lt;/code&gt;&lt;/a&gt;. If either the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;io&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory&lt;/code&gt; controllers are disabled, the kernel &lt;a href=&quot;https://github.com/torvalds/linux/blob/c291c9cfd76a8fb92ef3d66567e507009236ce90/include/linux/backing-dev.h#L172&quot;&gt;will not&lt;/a&gt; perform cgroup writeback or any of the related accounting and cgroup switching. We found that by creating a systemd &lt;a href=&quot;https://gist.github.com/dasl-/87b849625846aed17f1e4841b04ecc84#file-dasl-slice-L5&quot;&gt;slice with those controllers disabled&lt;/a&gt; and configuring &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt;’s unit file to &lt;a href=&quot;https://gist.github.com/dasl-/06ced03d4b905fd79d8d58283ecaf67d#file-setupwebroot-service-L7&quot;&gt;use that slice&lt;/a&gt;, we could no longer reproduce the elevated system CPU. We had solved our performance issue.&lt;/p&gt;

&lt;p&gt;No more system CPU spike and no more listen overflows:
&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i8.png&quot; alt=&quot;system CPU graph&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/2025-01-01-debugging-our-new-linux-kernel/i9.png&quot; alt=&quot;listen overflows graph&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;followup-questions&quot;&gt;Followup questions&lt;/h1&gt;

&lt;h2 id=&quot;do-we-have-a-minimal-reproduction-script&quot;&gt;Do we have a minimal reproduction script?&lt;/h2&gt;

&lt;p&gt;We came up with a minimal reproduction of the issue:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$ sudo mkdir -p /var/random-files &amp;amp;&amp;amp; sudo systemd-run --property=Type=oneshot bash -c &apos;dd if=/dev/urandom bs=1024 count=400000 | split -a 16 -b 1k - /var/random-files/file.&apos;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This command creates 400,000 files, each consisting of 1,024 random bytes. The files have names like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/var/random-files/file.aaaaaaaaaaaaaaaa&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/var/random-files/file.aaaaaaaaaaaaaaab&lt;/code&gt;. This command is run as a systemd oneshot service. Within anywhere from 30 seconds to 3 minutes after this command finishes, we see a spike in system CPU. Viewing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;htop&lt;/code&gt; will confirm this (press &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shift + k&lt;/code&gt; to show kernel threads in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;htop&lt;/code&gt;) – we see kernel workers using lots of CPU in the function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inode_switch_wbs&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;why-wasnt-centos-affected&quot;&gt;Why wasn’t CentOS affected?&lt;/h2&gt;

&lt;p&gt;The initial release of &lt;a href=&quot;https://man7.org/linux/man-pages/man7/cgroups.7.html&quot;&gt;cgroups&lt;/a&gt;, known as cgroups v1, was in kernel version 2.6.24. Cgroups v1 has since been replaced by a new implementation: cgroups v2. Cgroups v2 was officially released in kernel version 4.5. Our old CentOS operating system used kernel version 3.10. We believe this inode switching CPU issue is related to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;io&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory&lt;/code&gt; controllers introduced in cgroups v2. Thus CentOS, which uses cgroups v1, is not vulnerable to this issue.&lt;/p&gt;

&lt;h2 id=&quot;why-couldnt-we-reproduce-when-running-rsync-manually&quot;&gt;Why couldn’t we reproduce when running rsync manually?&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;#is-it-writeback-no&quot;&gt;Recall&lt;/a&gt; that when we ran the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsync&lt;/code&gt; command from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setupwebroot&lt;/code&gt; manually in our interactive shell, we saw no spike in system CPU. It turns out that each interactive ssh session you have open creates its own cgroup. Below is the output of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd-cgls&lt;/code&gt; on a web server on which I have two interactive ssh sessions open. One session is running a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sleep 100&lt;/code&gt; command, and the other session is running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd-cgls&lt;/code&gt;. The two cgroups are called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session-17450.scope&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session-17455.scope&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;
% sudo systemd-cgls
Control group /:
-.slice
├─user.slice
│ └─user-10101.slice
│   ├─user@10101.service …
│   │ └─init.scope
│   │   ├─1710746 /lib/systemd/systemd --user
│   │   └─1710793 (sd-pam)
│   ├─&lt;b&gt;session-17450.scope&lt;/b&gt;
│   │ ├─1708943 sshd: dleibovic [priv]
│   │ ├─1711073 sshd: dleibovic@pts/0
│   │ ├─1711171 -zsh
│   │ └─1716022 sleep 100
│   └─&lt;b&gt;session-17455.scope&lt;/b&gt;
│     ├─1780667 sshd: dleibovic [priv]
│     ├─1781414 sshd: dleibovic@pts/1
│     ├─1781577 -zsh
│     └─1791367 systemd-cgls
...
&lt;/pre&gt;

&lt;p&gt;These &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session-*.scope&lt;/code&gt; cgroups stick around until you terminate your ssh session. After terminating your ssh session, systemd removes the corresponding cgroup. With this insight, we tested terminating the interactive ssh session after manually running the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsync&lt;/code&gt; commands. Sure enough, about 2 minutes after we terminated the ssh session, we saw the big spikes in system CPU caused by inode cgroup switching.&lt;/p&gt;

&lt;h2 id=&quot;is-this-a-hypervisor-or-a-kernel-performance-issue&quot;&gt;Is this a hypervisor or a kernel performance issue?&lt;/h2&gt;

&lt;p&gt;We suspected that this performance issue was caused by either the hypervisor or the kernel. We shared our findings with Canonical, the company behind Ubuntu. Canonical confirmed that it is a kernel issue that was likely introduced by a Linux kernel commit from 2021. More details are available in the &lt;a href=&quot;https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.5/+bug/2038492&quot;&gt;public bug report&lt;/a&gt;, in which I have commented. We are hopeful that Canonical will engage with the Linux kernel developers and eventually fix this performance issue.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>David Leibovic</name>
          
          
        </author>
      

      

      

      
        <summary type="html">Read on to learn how we used network packet captures and BPF to debug web server performance, ultimately uncovering a Linux kernel performance issue. This investigation was a collaboration between myself and my colleagues.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Setting custom timestamps for prometheus metrics</title>
      
      
      <link href="https://dasl.cc/2024/07/07/setting-custom-timestamps-for-prometheus-metrics/" rel="alternate" type="text/html" title="Setting custom timestamps for prometheus metrics" />
      
      <published>2024-07-07T00:00:00+00:00</published>
      <updated>2024-07-07T00:00:00+00:00</updated>
      <id>https://dasl.cc/2024/07/07/setting-custom-timestamps-for-prometheus-metrics</id>
      <content type="html" xml:base="https://dasl.cc/2024/07/07/setting-custom-timestamps-for-prometheus-metrics/">&lt;h1 id=&quot;tldr&quot;&gt;TLDR&lt;/h1&gt;

&lt;p&gt;To associate a custom timestamp with a Prometheus metric:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Write a &lt;a href=&quot;https://prometheus.github.io/client_python/collector/custom/&quot;&gt;custom collector&lt;/a&gt; - you can’t use the built-in &lt;a href=&quot;https://prometheus.github.io/client_python/instrumenting/gauge/&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Gauge&lt;/code&gt;&lt;/a&gt; class in the Python client.&lt;/li&gt;
  &lt;li&gt;Write a &lt;a href=&quot;https://prometheus.io/docs/instrumenting/writing_exporters/&quot;&gt;custom exporter&lt;/a&gt;. This is a web server that exposes the timestamped metrics in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*.prom&lt;/code&gt; file such that Prometheus can scrape them. The built-in &lt;a href=&quot;https://github.com/prometheus/node_exporter?tab=readme-ov-file#textfile-collector&quot;&gt;textfile collector&lt;/a&gt; does not support timestamped metrics.&lt;/li&gt;
  &lt;li&gt;Update your Prometheus config scrape targets with the address of the new exporter.&lt;/li&gt;
  &lt;li&gt;Update your Prometheus config to set &lt;a href=&quot;https://prometheus.io/docs/prometheus/2.53/configuration/configuration/#tsdb&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;out_of_order_time_window&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Contents&lt;/strong&gt;&lt;/p&gt;
&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#tldr&quot; id=&quot;markdown-toc-tldr&quot;&gt;TLDR&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#background&quot; id=&quot;markdown-toc-background&quot;&gt;Background&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#custom-collectors&quot; id=&quot;markdown-toc-custom-collectors&quot;&gt;Custom collectors&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#custom-exporters&quot; id=&quot;markdown-toc-custom-exporters&quot;&gt;Custom exporters&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#updating-prometheus-config-scrape-targets&quot; id=&quot;markdown-toc-updating-prometheus-config-scrape-targets&quot;&gt;Updating Prometheus config scrape targets&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#updating-prometheus-config-out_of_order_time_window&quot; id=&quot;markdown-toc-updating-prometheus-config-out_of_order_time_window&quot;&gt;Updating Prometheus config: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;out_of_order_time_window&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#conclusion&quot; id=&quot;markdown-toc-conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;background&quot;&gt;Background&lt;/h1&gt;

&lt;p&gt;By default, the timestamp associated with a Prometheus metric is the timestamp at which the metric was scraped by Prometheus. But sometimes, one wants to associate a custom timestamp with a data point. Prometheus is very opinionated and does not make this easy, but it is possible.&lt;/p&gt;

&lt;p&gt;My use case for setting custom timestamps was monitoring local air quality. The New York City government has &lt;a href=&quot;https://a816-dohbesp.nyc.gov/IndicatorPublic/data-features/realtime-air-quality/&quot;&gt;an API&lt;/a&gt; where the city’s air quality is reported. While the API is “realtime”, in practice the data is 2 - 4 hours delayed. The data returned from the API is formatted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;timestamp&amp;gt;,&amp;lt;air_quality&amp;gt;&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;2024-07-07T10:00:00,5.59
2024-07-07T11:00:00,6.35
2024-07-07T12:00:00,6.75
2024-07-07T13:00:00,6.59
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I want Prometheus to associate the timestamp of the air quality reading with the data point rather than the timestamp at which the metric was scraped by Prometheus, which may be 2 - 4 hours later.&lt;/p&gt;

&lt;h1 id=&quot;custom-collectors&quot;&gt;Custom collectors&lt;/h1&gt;

&lt;p&gt;I’ll be working with Prometheus’s &lt;a href=&quot;https://github.com/prometheus/client_python&quot;&gt;Python client&lt;/a&gt; in these examples. I was using a &lt;a href=&quot;https://prometheus.github.io/client_python/instrumenting/gauge/&quot;&gt;gauge&lt;/a&gt; to record the air quality data. Unfortunately, the gauge’s &lt;a href=&quot;https://github.com/prometheus/client_python/blob/09a5ae30602a7a81f6174dae4ba08b93ee7feed2/prometheus_client/metrics.py#L432&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set&lt;/code&gt;&lt;/a&gt; method exposes no parameter to use a custom timestamp. I discovered on &lt;a href=&quot;https://github.com/prometheus/client_python/issues/588#issuecomment-1054724554&quot;&gt;github&lt;/a&gt; that we can accomplish our goal with a &lt;a href=&quot;https://prometheus.github.io/client_python/collector/custom/&quot;&gt;custom collector&lt;/a&gt;. The &lt;a href=&quot;https://github.com/prometheus/client_python/blob/09a5ae30602a7a81f6174dae4ba08b93ee7feed2/prometheus_client/metrics_core.py#L172&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;add_metric&lt;/code&gt;&lt;/a&gt; method exposes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timestamp&lt;/code&gt; as an optional parameter. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timestamp&lt;/code&gt; should be a &lt;a href=&quot;https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information&quot;&gt;unix epoch timestamp in milliseconds&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;prometheus_client&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;prometheus_client.core&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GaugeMetricFamily&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CustomTimestampedGaugeCollector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prometheus_client&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;registry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timestamp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;gauge_with_custom_timestamp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GaugeMetricFamily&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&apos;my_metric_name&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;my_metric_description.&apos;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;gauge_with_custom_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_metric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gauge_with_custom_timestamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For reference, here’s &lt;a href=&quot;https://github.com/dasl-/pitools/blob/e144fc6e9e92bc908eb30556e3575c4f520f0fa5/sensors/measure_city_data&quot;&gt;my air quality monitoring code&lt;/a&gt; before I started associating timestamps with the data. I was &lt;a href=&quot;https://github.com/dasl-/pitools/blob/e144fc6e9e92bc908eb30556e3575c4f520f0fa5/sensors/measure_city_data#L20&quot;&gt;using&lt;/a&gt; the gauge’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set&lt;/code&gt; method and unable to pass a custom timestamp. This code would write a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.prom&lt;/code&gt; file that looked like:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# HELP city_pm25 Concentration of PM 2.5 in local city. Units: μg/m^3.
# TYPE city_pm25 gauge
city_pm25 6.51
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And here’s &lt;a href=&quot;https://github.com/dasl-/pitools/blob/873ff1eac2dd2bbf3e7521c42a3bf2460cb0b6ad/sensors/measure_city_data&quot;&gt;my code&lt;/a&gt; after I started associating timestamps with the data via a &lt;a href=&quot;https://github.com/dasl-/pitools/blob/873ff1eac2dd2bbf3e7521c42a3bf2460cb0b6ad/sensors/measure_city_data#L66&quot;&gt;custom collector&lt;/a&gt;. This code writes a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.prom&lt;/code&gt; file that looks like:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# HELP city_pm25 Concentration of PM 2.5 in local city. Units: μg/m^3.
# TYPE city_pm25 gauge
city_pm25 6.51 1720375200000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;custom-exporters&quot;&gt;Custom exporters&lt;/h1&gt;

&lt;p&gt;Problem solved, right? Wrong. I had been using the node_exporter’s &lt;a href=&quot;https://github.com/prometheus/node_exporter?tab=readme-ov-file#textfile-collector&quot;&gt;textfile collector&lt;/a&gt; to export my data. But the textfile collector does not support custom timestamps:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Note: Timestamps are not supported.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You’ll see something like this in logs if you try to use the textfile collector with a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*.prom&lt;/code&gt; file that has timestamps:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;NODE_EXPORTER[669]: ts=2024-07-06T11:16:03.618Z caller=textfile.go:227 level=error collector=textfile msg=&quot;failed to collect textfile data&quot; file=city_data.prom err=&quot;textfile \&quot;/tmp/city_data.prom\&quot; contains unsupported client-side timestamps, skipping entire file&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s the &lt;a href=&quot;https://github.com/prometheus/node_exporter/blob/4cc1c177d05e80176f26fe1ca2a1f193c03c67a0/collector/textfile.go#L299-L301&quot;&gt;node_exporter code&lt;/a&gt; that explicitly disallows timestamps.&lt;/p&gt;

&lt;p&gt;I discovered on the &lt;a href=&quot;https://groups.google.com/g/prometheus-users/c/Zg1EJltYwp0/m/XrZFtClRBQAJ&quot;&gt;Prometheus mailing list&lt;/a&gt; that a &lt;a href=&quot;https://prometheus.io/docs/instrumenting/writing_exporters/&quot;&gt;custom exporter&lt;/a&gt; can solve the problem. It’s easy enough to write an exporter - the basic idea is that we need to expose the timestamped metrics in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*.prom&lt;/code&gt; file via a web server such that Prometheus can scrape them. I wrote a custom exporter in Python, using Python’s built in &lt;a href=&quot;https://docs.python.org/3/library/http.server.html#http.server.ThreadingHTTPServer&quot;&gt;HTTP server&lt;/a&gt;. Here’s &lt;a href=&quot;https://github.com/dasl-/pitools/blob/873ff1eac2dd2bbf3e7521c42a3bf2460cb0b6ad/observability/timestamped_exporter&quot;&gt;the code&lt;/a&gt; - it exposes the metrics at &lt;a href=&quot;https://github.com/dasl-/pitools/blob/873ff1eac2dd2bbf3e7521c42a3bf2460cb0b6ad/observability/timestamped_exporter#L79&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http://&amp;lt;hostname&amp;gt;:9101/metrics&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The bulk of the work happens in the &lt;a href=&quot;https://github.com/dasl-/pitools/blob/873ff1eac2dd2bbf3e7521c42a3bf2460cb0b6ad/observability/timestamped_exporter#L20C9-L20C15&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;do_GET&lt;/code&gt;&lt;/a&gt; method. Here’s a simplified version of the code:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;do_GET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;&apos;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;prom_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/prom_dir&apos;&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# read all *.prom files in the directory and concatenate their contents
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;listdir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prom_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endswith&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;.prom&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prom_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;r&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# serve the data that was read.
&lt;/span&gt;    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;send_response&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;send_header&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Content-Type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;text/plain; charset=utf-8&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;end_headers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;utf-8&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;updating-prometheus-config-scrape-targets&quot;&gt;Updating Prometheus config scrape targets&lt;/h1&gt;

&lt;p&gt;After creating the custom exporter and exposing the timestamped metrics on port 9101, we need to update the scrape targets in Prometheus config with the address of the new exporter. Here’s what &lt;a href=&quot;https://gist.github.com/dasl-/c208118d01d4d6b5e826e5fdabaf583d#file-prometheus-yaml-L18&quot;&gt;my config&lt;/a&gt; looked like.&lt;/p&gt;

&lt;h1 id=&quot;updating-prometheus-config-out_of_order_time_window&quot;&gt;Updating Prometheus config: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;out_of_order_time_window&lt;/code&gt;&lt;/h1&gt;

&lt;p&gt;Problem solved, right? Wrong again. Checking your Prometheus logs, you may see errors like this:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;PROMETHEUS[1733468]: ts=2024-07-06T13:12:40.583Z caller=scrape.go:1729 level=warn component=&quot;scrape manager&quot; scrape_pool=node target=http://study:9101/metrics msg=&quot;Error on ingesting samples that are too old or are too far into the future&quot; num_dropped=1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I believe if the timestamped metrics you are attempting to ingest have timestamps that are older than approximately 1 hour, you may encounter this error. Prometheus has an experimental feature that solves this problem: &lt;a href=&quot;https://prometheus.io/docs/prometheus/2.53/configuration/configuration/#tsdb&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;out_of_order_time_window&lt;/code&gt;&lt;/a&gt;. See also this &lt;a href=&quot;https://promlabs.com/blog/2022/10/05/whats-new-in-prometheus-2-39/#experimental-out-of-order-ingestion&quot;&gt;blog post&lt;/a&gt; announcing the feature. Since the air quality data I need to ingest is at most 2 - 4 hours delayed, it would be sufficient to use a value of anything greater than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4h&lt;/code&gt;. I decided to use a value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1d&lt;/code&gt; just to be conservative. Here’s what &lt;a href=&quot;https://gist.github.com/dasl-/eb28a9f692fdc76dc2c8033b40557616#file-prometheus-yaml-L44&quot;&gt;my config&lt;/a&gt; looked like.&lt;/p&gt;

&lt;p&gt;I can now see the NYC air quality data in grafana:
&lt;img src=&quot;/assets/posts/2024-07-07-prometheus-timestamps/grafana.png&quot; alt=&quot;grafana panel showing air quality data&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;This was a lot harder than I anticipated. We can’t use the out-of-the-box features to accomplish our goal - we instead had to write a lot of custom code. Prometheus is quite opinionated, and the maintainers seem to think that there are few valid use cases for setting custom timestamps on metrics.&lt;/p&gt;</content>

      
      
      
      
      

      
        <author>
            <name>David Leibovic</name>
          
          
        </author>
      

      

      

      
        <summary type="html">TLDR</summary>
      

      
      
    </entry>
  
  
</feed>
