Install trouble

Postby **moredruid** » Tue Aug 17, 2010 1:58 pm

This is the tale of a not quite smooth Amahi Express CD install... I'll probably use a lot of words, scroll to the tldr section (or search for it to go to it directly) if you just want to read the gist of it

After my server hardware arrived (HP ML110 G6; quad Xeon 2.6GHz, 4GB RAM) I naturally wanted to migrate my old Amahi install to this new server. But since I'd run it headless anyway I thought I'd use the Amahi ExpressCD to install it.

Well, after about a week of installing, reinstalling, patching, searching, troubleshooting, I figured it out. I'll just tell you the steps I have taken.
I created a backup of my old server (only the data; the modifications I made to the system weren't that impressive and could be set up again in about 15 minutes or less).
I then put the old disks (2x750GB & 2x1TB) in my new rig and went on to install Amahi with the ExpressCD. Of course I chose the expert install since I wanted to setup RAID & LVM myself. While setting the installation up I noticed that the partitions weren't created as I expected: after I setup RAID (2x750GB in RAID 1 on /dev/md0 and 2x1TB in RAID 1 on /dev/md1, PVs on the md devices and LVM on top) the installer hung on formatting my large data LV. I had accepted all the defaults except for the LV/partioning. So I broke the install off (it didn't continue anyway, and the hardware log was spewing lots of ATA resets and errors you wouldn't believe).

Started again and chose just 1 750GB disk with LVM, figuring I'd set up RAID on the 2x1TB later and use LVM mirroring with the other 750GB disk. After around 1.9% of the 1TB RAID format the process hung. Bleh. The creation of the /dev/md0 had gone flawlessly (even though it didn't mirror properly, which I found out later). Again there were copious amounts of ATA resets and errors in dmesg. The LVM mirroring with the 750GB disks was also failing after around 2.4%. More bleh. I ran a couple of smartctl tests and some of the disks were not finishing their SMART tests anymore; even though they were perfectly fine before (I ran smartctl before shutting the old server down and moving the disks to the new box). Even more bleh. So wait. This machine is RHEL 5 certified. I have some spare RHEL 5.4 disks lying around from my RedHat Clustering course, let's try those. Install works fine. read/writes work fine. But lots of SMART errors, yuck, 2 of my disks are thrashed

Even morer bleh.

I also tested with a default Ubuntu server CD, and I was very _very_ __VERY__ impressed. Boot with DNS, Samba, SSH, DHCP services installed & configured took around 8 sec, I started counting from the time the BIOS screen went away. Very lean install, very low memory footprint (around 150M when just booted). But that aside - though I'm very much tempted to set up everything myself since I don't use a lot of the features that Amahi gives me anymore.

Took out the disks, put them in my old box and ran the installer. Same issue, lots of resets and other errors. Well, obviously that machine was working fine before, so I've ruled out flakey new hardware. Maybe it was this newfangled greyhole thingy that was borking the format? chkconfig'ed it out, stopped the service, tested again: no dice.

Took 2 other disks (the 250GB disk that came with the new rig and an old 160GB disk which I had laying around (which has SMART errors! but works anyhow) and tried to set stuff up again. Only this time I meticulously thought about what was different in the old (F10) install from the new one, as well as the diffences in the RHEL and Ubuntu Server installations. Only the greyhole service. Hmpf, not likely to be a candidate for thrashing your disks this thoroughly. Especially not when turned off as soon as I could. What then? Well, my old data LV was formatted using ext3 instead of ext4 and both RHEL and Ubuntu use ext3, but that shouldn't be a problem now should it?

Ha, let's test it. I kept the install as small as I could (10G RAID 1 for /, the rest for a big data LV but didn't assign a mount point yet) and went on testing: created an ext4 filesystem on the big LV... process hung! Reboot, remove LV, recreate LV, create ext3 filesystem: process completed. Heh. Ran numerous read/write tests to the ext3 LV without issue. OK. Take the 750GB disk and the 1TB disk that didn't have SMART errors and put them into the new rig, and of course take the others out. Reinstall, created a RAID 1 mirror of 750GB with those 2 HDDs, installed, tested lots of read/writes without issue. The RAID sync also completed OK. Great, now let's set the thing back up for production (remember: this system is being used a lot for my GFs business (which in vacation time has not that much going on, but still) and the server has been down for about a week now! That's absolutely unacceptable. I took 2 days off in between going out with the GF to do some sightseeing in Germany so I can't really count those but still

So I reinstalled yet again (and I have another reinstall waiting for me when my other disks come back from RMA since I want to use RAID 1) and the box is up and running now for about 3 days. No hardware errors or weird stuff anymore. I did do some minor tweaking (I disabled a lot of services I don't use) but nothing fancy.

This leads me to a point I've been saying a long long time: use a stable distro for production work, not the latest & (not so) greatest cutting edge distro. This tour de force took me - a seasoned linux administrator - around 3.5 working days to get figured out and in the process thrashed 2 disks that may or may not fall under warranty replacement. If not I'm looking at another 170 euro expense for new disks. The reason it also took a bit longer is that I was also experimenting with the onboard RAID controller that's in the HP rig and getting support for that into the kernel. In the end I gave up on that and just use it as a plain SATA controller. I couldn't really get the driver loaded into the Fedora kernel (error when using the dd file), and trying a rpm --force of the HP supplied rpm didn't work on Fedora either. It did work on the RHEL install, but it stopped working after updating RHEL to 5.5 since that kernel isn't supported by the HP package (sigh).

All in all I do understand that this is 1 person's experience, but what an experience it has been! And it's not that I don't understand enough from Amahi or Linux in general: I work in a team of 5 that manages around 500 servers currently, most of them loaded with heavy applications like Oracle, SAP, JBoss, Tomcat stuff, which may or may not be clustered (Oracle RAC or just plain Linux Clustering) etc in datacenters with technology that's not readily available to the public unless you have around $40K lying around for just a server, and I'm not counting the $1.6M DMX fibre channel storage and the related network equipment that's also connected... One of our largest databases is around 150TB - that's not a typo BTW

- just to give you a picture of what I'm used to wrt troubleshooting: exotic hardware and abhorrent driver support sometimes and some quite challenging software solution setups

tldr:
use ext3 for your filesystems, even though Fedora 12 defaults to ext4. It borked 2 of my HDDs which are hopefully replaced within warranty (if the Mfg doesn't think it's user error). This might be related to my SATA controller, but I'm not able to confirm this as of yet. I have an open ticket with HP but I think they'll reject it on the grounds that I don't use RHEL.

Cheers, Moredruid

Install trouble

Install trouble

Who is online