Help: Castrophic failure of management and database server: DB server unrecoverable

September 01, 2013 14:42

Here is the summary of the issue:
This started with two problems. First there were iscsi connection issues (I know thatâs vague, but it seemed that although it said the connection was âokâ, things didnât work right)
Second, we then lost the master server (xenserver-manager2). The system became unresponsive, so it was rebooted. On restart, it threw a lot of âread-onlyâ errors until it finally stopped responding at all. It failed to find the boot partition on a second reboot.

For manager2, I reinstalled xenserver and then left it alone.

Xenserver manager1 was in its own bad state. It had thought the VMs were running even though it had been rebooted. A second reboot of that one got it back to a normal state with the VMs being offline. I had to force xenserver to say that the VMs that had been running on manager2 were now off. I then removed the âoldâ manager2 from the pool.

This left us with iscsi connection problems. Xenmanage2 and 3 SRs would not repair. The Xenmanage4 SR seems to have come back healthy.
Using the iscsiadm tool, I can successfully connect to the manage2 and manage3 SRs.
I told xenserver to âforgetâ manage2 in the hopes I could just re-attach it. As it turns out, you can only add a new one by formatting it in xencenter. To reattach, you must do it from the command prompt. (I was trying to use sr-introduce)

(The following is the short version of the steps that actually worked/gave me forward progress)
Since it was forgotten, of course that doesnât work. The drive didnât show up in the pvlist (or vgs) .. Basically it wasnât listed as any of the known physical/logical volumes.
I managed to recreate it with the same uuid I found by searching through /var/log/SMlog so, it was then listed.
Next I encountered an error that the scsi id was wrong, so following http://support.citrix.com/article/CTX118641 I was able to reintroduce the correct scsi-id to it.
Finally, Iâm still stuck the next step. It is missing the Volume Group information. I found a backup of the old one in /etc/lvm/backup/VG_XenStorage-55a045f3-e6fe-71c8-6042-08fa76255b05 (where that string matches the uuid I found in SMlog, and it has the two images weâre looking for on that drive.)
According to http://support.citrix.com/article/CTX116017 we should be able to use that to restore the device. I got as far as step 4, but the command returns with âPlease specify a single volume group to restore.â

I have purposely not forgotten manage3, but it cannot reconnect either. I can see it from âXe sr-listâ, but itâs host is listed as â<not in database>â. It is not however listed under âxe pbd-listâ
My assumption is that it is having similar issues to manage2

I copied all the logs to /opt/saved_logs just in case the old ones are needed and start to get deleted.

So, I spent two days this week working with XenServer Support to get access to my storage and my management and database virtual machines restored. We got the access to the storage restored to both management servers, but all attempts to recover the database VM were unsuccessful. Once we were able to reconnect to the storage we tried starting the database VM and it failed to start. We determined the problem was the VM header information was corrupted. XenServer support attempted a block-block copy of the VM with no success - same errors and problems. This problem started with the following:

So, does anyone know what are our options here? The cloud pool was unaffected, so all VMs that were in the cloud are still there. Is there any way to re-install the management servers and database and bring the VMs in the cloud pool back under management?

We really would appreciate any help we can get. Thanks.

1 comment

Please sign in to leave a comment.

Previous 1 comment

Date Votes

Marc Jensen

March 23, 2016 07:21

No way to recover unless there is either a backup of the database (obviously), a snapshot of the database VM (obviously), or a database dump from a previous time period.