Tag Archives: Linux

Scaling fork(2) in the real world

Some details on a typical problem with fork(2) on Linux/Unix that I worked on a while back: Fork is The Way.

Most solutions to this are minimal, with just enough hackery to meet the narrow needs of the particular piece of software facing the issue. I tried to make the solution generic and widely applicable.

Boot Environments on Linux

I have been busy working on a bunch of exciting tech last few years with very little time for anything else apart from family. The typical startup grind. However, better late than never, I found a bit of opportunity to write up on something I have been hacking on (along with couple of others) and have been put out in open-source.

We have seen the concept of Boot Environments initially with ZFS on Solaris. Because the filesystem supports snapshotting and cloning, it is possible to create another virtual copy of the entire root filesystem as a clone of the currently booted root. It is then possible to mount that clone, upgrade packages inside that prefix and finally reboot into the upgraded clone after adding a grub entry for it. There are several advantages of this approach.

  • The operation of cloning the filesystem is instantaneous due to the Copy-On-Write design. There is no need to perform a costly copy of the entire root filesystem image.
  • During the upgrade the box is up and running and in production use because the currently booted environment remains untouched. The only downtime is when we reboot.
  • Since the currently booted environment is untouched the upgrade is very safe. Any failures or breakage during upgrade would mean destroying the clone and starting again from the beginning, in the worst case.
  • If we discover issues after booting into the updated clone, we can easily reboot back to the previous working Boot Environment.
  • Because of the COW design the upgraded clone only occupies space for the packages that were actually upgraded. Unchanged package content is shared with the parent dataset of the clone. This un-duplication is space efficient.
  • It is possible to have multiple boot environments on the box without having to worry about partitioning.
  • Because of the pooled storage design all datasets or clones share the same pooled storage space avoiding hard partitioning and fixed space allocation headaches.
  • Multiple boot environments are *extremely* useful in a testing and development setup where multiple software releases can reside on the box. To select a certain release for testing or experimenting, just boot into the appropriate BE (Boot Environment). Messing around is also easy. Create another BE from the current one. Boot into it and mess to your heart’s content. The original BE remains safe as long as you do not screw the disk or the bootloader of course!

I can go on some more but lets draw a line here. Now, we wanted to get the same stuff on Linux. It is possible given we have the nice beast called Btrfs. Btrfs, till sometime back has been controversial and criticised quite a bit. However, I noticed that it has been maturing of late. Number of serious issues have gone to negligible. Many fixes and improvements have come. All of the rants I found via Google were at least couple of years back and mostly were older than that. This gave us the confidence to start testing it and eventually use it in the products my employer sells. We did have to go through a learning curve getting to grips with the nitty gritties and idiosyncracies of Btrfs. It created a bit of initial teething troubles and deployment issues but it was manageable.

I looked around to see if the BE capability already existed but found none. I came across things like apt-snapshot or Snapper which are similar but not quite the same. Our hard requirement was that upgrade must not touch the running environment. So, in the end, we came up with our own scripts to implement the BE feature.

Since our Linux environment is based off Ubuntu the root subvolume is ‘@’. We then create a writable snapshot of ‘@’ as our initial root subvolume and that becomes our first BE. Subsequent upgrades creates writable snapshots of the current booted subvolume. In addition, the ‘@’ subvolume is always mounted under /.rootbe in our environment and all the BE subvolumes and mounted under it including the currently booted one which is also mounted at ‘/’ obviously.

Btrfs has the concept of a default subvolume, however we do not change that. Rather, we just use the ‘rootflags=subvol=…’ parameter. This allows us to have the primary grub menu in a single place and access always via /.rootbe/boot/grub/grub.cfg.

The entire show is managed via two shell scripts. One to do the BE management (create, delete, list etc.) called ‘beadm’ and one to upgrade the current environment by creating a new BE called ‘pn-apt-get’. Both of them are available at this url: https://github.com/PluribusNetworks/pluribus_linux_userland/tree/master/components/bootenv-tools

The ‘pn-apt-get’ script creates a new BE and runs ‘apt-get dist-upgrade’ inside a chroot. It assumes that the ‘sources.list’ has been setup properly.

In addition to all this I wanted to optimize the space used by having dpkg Not replace files in a package being upgraded if the new file is identical to the one already installed. I needed to add a small Dpkg patch to achieve this: https://github.com/PluribusNetworks/pluribus_linux_userland/blob/master/components/dpkg-ubuntu/debian/patches/skip_unchanged.patch

All this allows us to do safe, fast, in-production upgrades and only incur downtime during the reboot.  One quirk I had to deal with was BE space usage reporting. I had to turn on the ‘quota’ feature of Btrfs to get accurate space accounting even though I do not use quotas in practice. This also meant that I hit a couple of obscure quota bugs (especially after subvolume delete) in the 4.4 Ubuntu Xenial kernel release that we have been using (logistics issues). To work around I found it sufficient to do a periodic “quota rescan” every few hours.

The Funny KVM benchmarks

RedHat Summit 2013 concluded recently and while browsing some of the presentation PDFs I came across something funny. In general the content is good and there is a bunch of interesting stuff available. However this particular PDF ruffled me up: http://rhsummit.files.wordpress.com/2013/06/sarathy_t_1040_kvm_hypervisor_roadmap_and_overview.pdf

This presentation talks about KVM technology in general with a bunch of marketing content thrown in which is all fine. However fast forward to slide 12 and something looks odd. The slide seems to scream KVM’s outstanding performance on SPECvirt_sc2010 as compared to ESXi5/4. Great isn’t it ? The “Eureka” feeling lasts till you look at the bottom of the graphs. Every comparison is done on dissimilar hardware! Suddenly Archimedes comes crashing to the floor.

Take for example the 2-socket 16-core benchmarks. The HP DL385 G7 box is a Generation 7 AMD bulldozer piece while DL380p Gen8 is a Generation 8 Sandy Bridge piece. RedHat is putting ESXi5 on an older generation hardware and KVM on the latest, greatest. If we consider the highest bin processors then the DL385 will get AMD Opteron 6220, 3.0 GHz processors with 16MB cache while DL380p will get Xeon E5-2690, 2.9 GHz processors with 20MB cache. Even if the Opteron’s clock is marginally higher a Bulldozer is simply no match for a big juicy Sandy Bridge beast. Second the Bulldozers get HT links with 6.4 GT/s throughput while the Xeons get QPI with 8 GT/s throughput. The Gen7 box gets PCIe Gen 2.0 while Gen 8 boxes get PCIe Gen 3.0. Similarly the story goes on and on. So we have a no-contest here. The Gen8 box wins hands down even if one puts fewer VMs on the Gen7 box.

Let’s look at the 4-socket 40 cores comparo. First the two boxes are from two different vendors. Second they are comparing ESXi4.1 with latest KVM. Whatever happened to ESXi5 here ? Does it not support that hardware ? At least the processors on the two boxes IBM x3850 x5 and DL580 G7 are comparable 10-core Xeon E7-4870 ones (considering the highest bin 10-core processors). However older ESX version skews the game.

Similarity the processors on the other comparisons are similar but the ESX version is older one that everyone is migrating off. If I am going to do a comparison, I will install latest ESX on a hardware, measure, reinstall latest KVM on the same hardware and measure not play games.

RedHat is nonchalantly tying one hand behind ESX’s back. Helpfully for the marketing fuzz types we have this fine print at the bottom: “Comparison based on best performing Red Hat and VMware solutions by cpu core count published at http://www.spec.org”. That is we are going by earlier measurements that our competitors published, so everyone chant after us: KVM is faster than ESX, KVM is faster than ESX, KVM is faster than ESX … ah well, let me grab that can of Diet Coke sitting nearby (or should it be salt rather?).

Disclaimer

I am NOT a Linux or KVM hater. On the other hand I use Linux Mint day in and day out and work with open-source in general. However above all I am a technologist and I like to take things as they really are, free of all the fuzz. Fuzz dilutes the values that various technologies bring to the table.

R.I.P. Atul Chitnis – End of a Chapter

Very saddened today morning upon hearing the news of Atul Chitnis passing away. He was battling cancer for a while and he finally lost it. I am sure he will be peaceful in the Happy Computing Community.

I have known him from his early PCQuest days and my awareness of Linux was primarily due to his PCQlinux distribution initiative. However he will be remembered the most for the FOSS.IN conference. Without him FOSS.IN has lost a father figure. I have been visiting the conference from the time it was originally called Linux Bangalore and his influence over the flocks gathering there was unmistakable. He did have his quirks and share of disagreements with others in the Indian FOSS community but his far-reaching contributions in the Indian FOSS scene overshadow everything else.

http://www.firstpost.com/tech/indian-tech-world-mourns-death-of-open-source-guru-atul-chitnis-837705.html

http://en.wikipedia.org/wiki/Atul_Chitnis

 

GCC Link time Optimizations need some Salt

I wanted to make use of the much talked about feature in Gcc 4.5+ called Link Time Optimization in Pcompress. Link time optimizations promise to bring in advanced optimizations across compilation contexts. For example if you declare a function as “inline” in a source file and call it from other places in the same source file then all function calls will be replaced with the function body itself. However if you call the same function from another source file which is compiled separately then inlining will not happen across the files. Earlier Gcc support for this was called whole-program optimization but it was cumbersome and in some cases impossible to use. LTO enables a repeat optimization and re-compilation pass during linking of the final executable providing the ability to do such cross-file optimizations elegantly.

I am using Fedora 16 which has got Gcc 4.6.3 with support for LTO so it was a simple matter of tweaking the makefile. However, after adding the LTO flag something felt not right. The program felt a tad slower. When doing performance tweaks “feel” is the last thing you want to depend upon, so I added simple timing measurements to various modules within Pcompress. They measure the starting and ending monotonic wall clock times in milliseconds for processing a chunk of data and compute the throughput in terms of MBs per Second. What I saw after that was quite strange.

LTO appeared to actually reduce throughput performance by as much as 60%! My laptop had the following specs: Core i5 430M 2.27GHz, 8GB RAM. I was using Gcc 4.6.3 native build on Fedora. Suspecting something with the distro or the specific Gcc build I looked at another laptop I had. That was an AMD Piledriver 1.9 GHz with 4GB RAM running Linux Mint 13. It also had Gcc 4.6.3 but of course differen Ubuntu derived build. This also produced the same results. A 50% performance drop with LTO enabled. The outputs are below.

Output of “Normal” build without LTO

time ./pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar
Scaling to 2 threads
Original size: 86528000, blknum: 12726
Number of maxlen blocks: 116
Total Hashtable bucket collisions: 4477
Merge count: 12320
Deduped size: 85195878, blknum: 562, delta_calls: 0, delta_fails: 0
Dedupe speed 107.758 MB/s
LZP: Insize: 85193594, Outsize: 35087799
LZP: Processed at 178.023 MB/s
DELTA2: srclen: 35087799, dstlen: 35086691
DELTA2: header overhead: 192
DELTA2: Processed at 247.827 MB/s
Chunk compression speed 1.513 MB/s

real    0m24.058s
user    0m40.485s
sys     0m0.448s

Output of build with LTO enabled

time ./pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar
Scaling to 2 threads
Original size: 86528000, blknum: 12726
Number of maxlen blocks: 116
Total Hashtable bucket collisions: 4477
Merge count: 12320
Deduped size: 85195878, blknum: 562, delta_calls: 0, delta_fails: 0
Dedupe speed 32.146 MB/s
LZP: Insize: 85193594, Outsize: 35087799
LZP: Processed at 67.585 MB/s
DELTA2: srclen: 35087799, dstlen: 35086691
DELTA2: header overhead: 192
DELTA2: Processed at 84.044 MB/s
Chunk compression speed 0.828 MB/s

real    0m45.138s
user    1m17.835s
sys     0m0.612s

The pcompress invocations above uses LZMA compression in extreme mode (-l14) and also enables Deduplication, LZP and Delta2 encoding. The tarfile “w020n40.tar” is a Global Topographic Data (Digital Elevation Model) from USGS. I used that dataset since it contains a lot of embedded tables of repeating numeric data that Delta Encoding and LZP (as you can see above in this case) can detect and collapse.

The performance drop with LTO is drastic to say the least. To rule out any mistake in my throughput computation I also included the “time” command and the performance difference is clear in the time output as well. Similar results were visible on the AMD laptop running Linux Mint 13. In addition to LTO I was passing the following flags to the compiler:

-m64 -msse3 -c  -O3  -ftree-vectorize -floop-interchange -floop-block

I then experimented with omitting those flags and reducing optimization level to -O2 but to no avail. I even tried “-fno-inline-functions” thinking excessive inlining might be causing cache overflows of loops. But that produced the same results. LTO kept on churning out lower performance numbers regardless of what I did. Subsequently I tried with Gcc 4.7.2. I built it from upstream sources and repeated my experiments with exactly the same results as before! Something was broken.

So my next step was to ask for help from the gem of an utility on Linux called simply as Perf. This can do a bunch of profiling on the system and apps and also provide CPU performance counter metrics. This tool is similar to some OpenSolaris tools like cpustat, intrstat etc. Perf has got a whole bunch of features but I looked at a couple of capabilities. I used “perf record” to collect profiling data from a pcompress run and used “perf stat -d” to dump detailed performance counter statistics. The outputs of perf stat were interesting but provided little clue as to the root cause:

Output of “perf stat -d” for normal non-LTO build

Performance counter stats for './pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar':

   46724.903534 task-clock                #    1.682 CPUs utilized
         21,763 context-switches          #    0.466 K/sec
          1,800 CPU-migrations            #    0.039 K/sec
         49,634 page-faults               #    0.001 M/sec
 97,919,627,864 cycles                    #    2.096 GHz                     [40.01%]
 60,976,690,063 stalled-cycles-frontend   #   62.27% frontend cycles idle    [39.99%]
 42,558,445,705 stalled-cycles-backend    #   43.46% backend  cycles idle    [39.97%]
 83,132,473,549 instructions              #    0.85  insns per cycle
                                          #    0.73  stalled cycles per insn [49.93%]
 10,757,966,536 branches                  #  230.241 M/sec                   [49.95%]
    674,594,850 branch-misses             #    6.27% of all branches         [50.01%]
 20,128,203,574 L1-dcache-loads           #  430.781 M/sec                   [50.05%]
    819,620,127 L1-dcache-load-misses     #    4.07% of all L1-dcache hits   [50.02%]
    470,081,090 LLC-loads                 #   10.061 M/sec                   [40.06%]
    291,066,964 LLC-load-misses           #   61.92% of all LL-cache hits    [39.99%]

    27.781254809 seconds time elapsed

Output of “perf stat -d” for LTO enabled build

Performance counter stats for './pcompress -D -c lzmaMt -l14 -L -P -s110m w020n40.tar':

   80593.375461 task-clock                #    1.759 CPUs utilized
         26,507 context-switches          #    0.329 K/sec
            889 CPU-migrations            #    0.011 K/sec
          2,696 page-faults               #    0.033 K/sec
169,713,512,965 cycles                    #    2.106 GHz                     [40.02%]
116,359,748,762 stalled-cycles-frontend   #   68.56% frontend cycles idle    [40.02%]
 44,479,101,832 stalled-cycles-backend    #   26.21% backend  cycles idle    [40.04%]
147,677,082,105 instructions              #    0.87  insns per cycle
                                          #    0.79  stalled cycles per insn [50.04%]
 12,176,677,580 branches                  #  151.088 M/sec                   [50.05%]
    710,765,008 branch-misses             #    5.84% of all branches         [50.01%]
 73,456,165,490 L1-dcache-loads           #  911.442 M/sec                   [49.98%]
    576,238,614 L1-dcache-load-misses     #    0.78% of all L1-dcache hits   [50.00%]
    361,655,574 LLC-loads                 #    4.487 M/sec                   [39.94%]
    236,828,595 LLC-load-misses           #   65.48% of all LL-cache hits    [39.97%]

    45.811620815 seconds time elapsed

I have highlighted some of the differences of interest. As you can see LLC or Last Level Cache (L3 Cache) misses increased by 5% with LTO while L1 data cache hits actually increased. The frontend (the instruction fetch and decode primarily) had more stalled cycles with LTO. While a stall can be due to many reasons coupled with a higher LLC miss ratio it appears that more time was spent waiting to fetch data from memory.

Side Note: You may be wondering that the LLC cache miss ratio is very high even in the normal non-LTO case. Does that point to severe inefficiencies within Pcompress ? Not really since it is LZMA which is the culprit here. LZMA’s memory access pattern is cache-unfriendly. A cost one has to bear to get the almost unbeatable compression ratio. For comparison, I have included the perf stat output for LZ4 compression in the Appendix below.  See that.

Next item on the agenda was to capture profile data via “perf record” and view it via “perf report”. Perf report typically lists functions or modules within an executable with the percentage of time they hogged during the execution. The list is sorted in descending order. I am reproducing the top 6 items from each of the profile data below.

Perf report snippet for normal non-LTO case

Samples: 172K of event 'cycles', Event count (approx.): 88229375574                                                                                 
 50.31%  pcompress  pcompress              [.] GetMatchesSpec1
 32.22%  pcompress  pcompress              [.] LzmaEnc_CodeOneBlock
  4.02%  pcompress  pcompress              [.] MixMatches3
  2.61%  pcompress  pcompress              [.] GetHeads4b
  1.63%  pcompress  pcompress              [.] dedupe_compress
  1.45%  pcompress  pcompress              [.] MatchFinderMt_GetMatches

Perf report snippet for LTO build

Samples: 326K of event 'cycles', Event count (approx.): 169052159163                                                                                
 39.39%  pcompress  pcompress              [.] GetMatchesSpec1
 29.70%  pcompress  pcompress              [.] GetOptimum.4390
  4.91%  pcompress  pcompress              [.] LitEnc_GetPriceMatched.4145
  3.22%  pcompress  pcompress              [.] MixMatches3
  2.87%  pcompress  pcompress              [.] dedupe_compress
  2.72%  pcompress  pcompress              [.] GetHeads4b.5262

We can immediately notice a few differences. The LTO enabled build lists a few new symbols that are not present in the non-LTO case. These are actually functions within the LZMA implementation. The numeric suffix to the function names is an LTO artifact. Why are these functions not showing up in the non-LTO build? Lets check the different binaries using objdump.

Objdump output for normal non-LTO binary

objdump -t pcompress | grep GetOptimum 
[pcompress]$

Objdump output for LTO enabled binary

objdump -t pcompress | grep GetOptimum 
00000000004284c4 g     F .text	00000000000004d5        .hidden GetOptimumFast.4368
000000000042a351 g     F .text	0000000000002412        .hidden GetOptimum.4390

Objdump output for non-LTO DEBUG binary (No Optimizations)

objdump -t pcompress | grep GetOptimum 
000000000040fd05 l     F .text	0000000000002570        GetOptimum
0000000000412275 l     F .text	0000000000000515        GetOptimumFast

Something jumps out as obvious now. The GetOptimum function is not visible in the optimized non-LTO build but is visible in the DEBUG build. This is a static function in LzmaEnc.c and is large but still gets inlined in the normal optimized build. LTO however is actually preventing some useful inlining from happening. I inspected the disassembly using “objdump -d” to verify that. In the LzmaEnc.c source file one of the call sequence is thus:

LzmaEnc_CodeOneMemBlock -> LzmaEnc_CodeOneBlock -> GetOptimum

In the normal non-LTO optimized build LzmaEnc_CodeOneMemBlock() is a single large function with the other functions above inlined and further optimized. In the LTO build the call sequence is thus:

LzmaEnc_CodeOneMemBlock ->LzmaEnc_CodeOneBlock.4411 ->
LzmaEnc_CodeOneBlock.part.8.4405 -> GetOptimum.4390

No wonder this causes a major performance drop since inlining in turn enables a series of other optimizations like Value Range Propagation, Code Motion etc which LTO is preventing by preventing inlining. So the lesson of the day is to use LTO with a spoonful of salt. Do not just use it and be happy. Actually benchmark and verify whether you are benefitting from it or not. We should see some improvement in this space with Gcc 4.8 (see below).

Appendix

Output of perf stat -d for LZ4 compression method

Performance counter stats for './pcompress -D -c lz4 -l14 -L -P -s110m w020n40.tar':

    3943.008621 task-clock                #    0.996 CPUs utilized
            440 context-switches          #    0.112 K/sec
              9 CPU-migrations            #    0.002 K/sec
          2,179 page-faults               #    0.553 K/sec
  8,275,383,339 cycles                    #    2.099 GHz                     [39.97%]
  3,510,592,770 stalled-cycles-frontend   #   42.42% frontend cycles idle    [39.91%]
  2,314,266,052 stalled-cycles-backend    #   27.97% backend  cycles idle    [39.93%]
 10,907,435,475 instructions              #    1.32  insns per cycle
                                          #    0.32  stalled cycles per insn [49.95%]
  1,754,196,387 branches                  #  444.888 M/sec                   [49.93%]
     60,384,786 branch-misses             #    3.44% of all branches         [50.05%]
  2,588,170,552 L1-dcache-loads           #  656.395 M/sec                   [50.10%]
    196,213,428 L1-dcache-load-misses     #    7.58% of all L1-dcache hits   [50.17%]
     57,790,609 LLC-loads                 #   14.656 M/sec                   [40.14%]
      2,870,453 LLC-load-misses           #    4.97% of all LL-cache hits    [40.11%]

      3.957535121 seconds time elapsed

The interesting thing to note here is the LLC cache miss ratio. Due to it’s near linear access pattern LZ4 is making good use of the cache. This is one of the reasons why it is so fast. Of course LZ4 cannot even think of matching LZMA in terms of compression ratio but then, that is not it’s intent either. In addition it also shows that LZP and my Dedupe and Delta2 Encoding implementations are cache-efficient as well.