Many a times we need some software that will enable us to load a system and check its stability under bad conditions. This can be a burn-in test or it can be generation of load to cause borderline faulty hardware to start acting up. This allows one to isolate system crashes to either hardware or Software issues for example. In my experience that latter has been a common scenario where system crash logs appear to look like hardware issues but diagnostics tools including the vendor provided ones come out clean and then the system crashes again soon after being put back into production use. Eventually some back and forth investigation and trial-and-error things are done to replace potentially faulty components till things are stable again or the box itself is replaced.
One of the choices here is to run something that loads the system hard causing hidden faults to surface faster than otherwise. When it comes to stress testing tools there are a whole bunch of choices but most of them focus on one piece at a time. Most commonly it starts with the CPU, then RAM and Disk I/O of course. However I am yet to come across something that comprehensively loads the entire system. By entire system I mean CPU, RAM, Disk and Network together. In addition by loading CPU, I mean loading virtually every component inside the CPU: FPU, SSE, AVX, Fetch, Decode and so on. Just running a single computation like for example Prime95 may heat up the CPU and/or RAM modules but it exercises only a few components within. The key here is to stress test everything in parallel.
Eventually all this should result in the system’s ambient temperature to be raised by a few degrees even when located inside a chilled datacenter and even when the server’s fans are spinning at a higher RPM. Once we have stressed the box we can then look at diagnostic logs like the IML (HP Integrated Management Log) and run diagnostic tools that will hopefully have a better chance of picking up something odd.
I have worked on something like this at work where we have successfully used it on several occasions for troubleshooting faults, evaluating new server models and when commissioning new datacenter field layouts. I have now started an open-source project on the same lines but being more comprehensive: https://github.com/moinakg/systemroller
At them moment this is a work in progress and one will only find a few items in that github repo mostly dealing with creating a mini Fedora live image which is a core part of the system. The objectives for this system are listed below.
- Parallel stress testing of CPU, RAM, Disk, and Network together or a chosen subset on Linux. Of course the core test framework should lend itself to be ported to other platforms like BSD or Illumos.
- Attempt to load virtually every sub-component.
- Non-destructive disk tests.
- Network interface Card testing that will not flood the network with packets or frames.
- Post-test verification and diagnostics scan.
- Self-contained live-bootable environment to allow scheduling tests via PXE boot for example.
- Ability to pass parameters via PXE/DHCP options.
- Live environment should allow restricted root access that primarily does not provide the filesystem utilities like mount but allows reading from the block device. In addition the restricted shell should provide only a small subset of Linux utilities to prevent backdoors. This will allow systems engineers to to diagnostics etc while providing no ability to access production data on the disk filesystems.
- A http based graphical console to remotely access the live environment and look at logs, run tests, do diagnostics etc.
- The live bootable image should be as small as feasible and should be able to load itself entirely in RAM and boot and run off a ramdisk.
The Github project repo currently provides a Fedora kickstart file that goes into a great effort to minimize the live bootable ISO image (139MB approx including EFI boot capability). The live environment boots and auto-logins into a restricted root environment. One will require Fedora 18 and the Fedora livecd-creator to build it (see the README).