Bare Metal Recovery of Windows 2000 Servers Using Bacula and BartPE

October 13, 2005
© Brian McDonald, 2005
brianmcd at columbus dot rr dot com

Introduction

Bacula had been working as our backup/restore solution for just over a year and a half. Aside from some juggling of backup pools to better accommodate growth, we'd had to do very little to keep it running. Restoration of our FreeBSD servers from bare metal never really worried me. The FreeBSD installation media itself has powerful recovery features and based on reading, it seemed that recovering a UNIX system would be fairly straightforward (I've done it with just dumps dozens of times).

I was far less certain about restoring our Windows 2000 systems from bare metal. The file set that can be used on a Windows 2000 system intentionally bypasses a number of system files which cannot be backed up due to the Windows kernel file locking mechanisms which prevent a simple copy. Volume Shadow Copy (finally) corrected this problem, but that service was only available on Windows 2003 and above, and it's not a good idea to make that upgrade just for that feature.

So, we embarked on a project to determine if we could recover a Windows 2000 system from bare metal using only Bacula backups. There were many false starts. Several pitfalls. We did some things to these systems that would make a Microsoft Professional Support Services engineer shake with fear, anger, and revulsion. In the end though, we can recover our Windows systems from catastrophic failure fairly quickly.

This guide is divided into three sections which I'll call Bliss, Panic, and Relief.

Phase I - Bliss

All of the servers are up. All of the customers pay on time. All the data is intact, and available. Now is the time to plan a proper disaster recovery situation, before the excrement hits the air circulation device. There are two things to do here - set up your servers and build your recovery disk.

Server Setup:

  1. Record the drive the drive partition information. During recovery you may have to rebuild the partition table, and when restoring data, it's important that it all fit. The size does not have to be exact, but it has to be close. It's also very important to have the same letters assigned to partitions during restore, lest something "interesting" happen. Pay attention to where your CD-ROM drives are assigned. If you're thinking far ahead, re-assigning them all to O: leaves plenty of room for fixed disk partitions. Ours all end up D:, E: or even F:, which is a pain in several other areas.
  2. If you're using IIS, then please, back up your IIS metabase. We found that many, many times the IIS metabase did not restore properly, and it is not included in a system state backup. There are tools in the IIS ResKit to do this from script or you can do it from the IIS manager snap-in. Either automate it if you're into that, or do it manually every time you alter any IIS parameters. The IIS management snap-in puts the resulting file in %SYSTEMDRIVE%\system32\inetsrv\MetaBack. We've found that it's fairly rare to go into IIS aside from adding new customers, but your mileage may vary.
  3. If you're using certificates for anything, then back up your certificate keyring using the Certificates MMC console. These are also not stored in the system state backup, at least not obviously enough that I was confident about being able to get them back. The Certificates MMC snap in allows you to create an encrypted keyring file of all of your certificates which you can import into any Windows 2000/2003 system (provided you know the key). I drop a backup out whenever we import a new cert for a customer.
  4. Back up the system state using ntbackup, as described in the Bacula documentation. Don't skip this. You can't get the system back without it. System state backup files are usually 200-500MB, depending on what is on the server. This includes any Active Directory data on a domain controller, but really, don't rely on this. Have two domain controllers at all times. Restoring a big Active Directory from anything is a pain you don't want, whereas just bringing up a new domain controller and rejoining is fairly simple by comparison, especially if that's all the server does. The System state backup contains copies of the registry, key system files, and the entire contents of the dllcache directory, which is where Windows File Protection stores backups of dlls. This is the bulk of this file. Technically you only have to do this backup as often as your system changes significantly. We do it about once a week, because I don't like storing 3.0GB of superfluous data per server on my tapes.
  5. Back up your system using Bacula daily. We use the fileset mentioned at the end, which skips a lot of the files nobody cares about, and a lot of files that can't be backed up thanks to the file locking mechanisms in Windows. We use a GFS backup scheme with tape cartridge rotations on the 1st Sunday of each month, with an optional rotation on the 3rd Sunday if things are filling up too fast for some reason. The drive has an 8-tape DDS4 autochanger, giving us about ~240GB of space (In practice, we get about 1.5x compression).

By convention, we drop/move all of the files resulting from Steps 2-4 into C:\SystemState. It's a very obvious and standard looking place to find things.

Recovery Disk

Build a BartPE recovery disk from the instructions located at http://www.nu2.nu/pebuilder/. The guy is a genius - use that to your advantage. I tweaked my recovery disk somewhat to allow me to choose what network and bacula server were being used (I use the disk both at home and at the office) but you don't have to - the bacula plug-in works fine once you tweak the configuration files.

Phase II - Panic

One (or more!) of the servers are down. Customers are threatening to never pay you again. Data integrity is now a question mark, hovering over your head. Now is the time to not panic (despite your intense desire to do so). First - analyze the situation careful and determine if you really, really need to do a bare metal restore. You can usually tell if the system is a smoking pile of ruin (I had an NCR Windows NT server catch on fire once in my presence. It was - exciting.) Some Windows errors cause the OS to get into such a damaged state that even if you recover it, you're not going to be sure it's okay. This is important, because after this proceedure, you can't really guarantee it'll be 100% right anyway. If you want that - use UNIX, my friend. I restored a partition on one of my UNIX boxes while I was logged in and using KDE - both KDE and bacula-fd were resident on that partition, and continued to run even after rm had it's way with the file system (it was late, my finger slipped, honest.) If you show me a Windows install that can survive the trashing of half of the file system, I'll eat a car. But enough about me.

Recovering Your Server

  1. Replace the hardware with as close to the same system as you can get. Windows is a lot more tolerant for changing hardware than you might think, but it's not invincible. If you replace the system with something that has an brand spanking new controller, etc, you're going to run into problems because, while you may get BartPE to boot, the system it restores may not have the mass storage drivers to read the disk you just lovingly recreated. We pair servers in production/development roles so that if a production server goes south, and we have to forcibly restore it, we can blast a development box and get the production system up and running again quickly, rebuilding the development server when Panic is complete.
  2. Rebuild your drives. This includes rebuilding any drive arrays and repartitioning the disks according to the instructions you wrote out in Step 1 of the Blissful days. The diskpart utility from Windows works just fine for this, including building fault tolerant volumes. I'd deep link to the documentation for this utility on Microsoft's website but it won't be there in a day or two anyway. Google for "diskpart reference".
  3. Restore all data from bacula. Run the bacula-fd from the BartPE menu, and connect to it from your director. Restore to that client, specifying nothing as the restore path - it'll get all the C:/ and D:/ mess right. This takes a while, and if you didn't add !restored to the Messages resource on the client, be prepared to get a really, really big email.
  4. This step I'm not entirely sure is required. It feels evil, though, so I'll leave it in here. If I had another 10 hours to kill on this project, I might re-build my test environment and go to town, but I don't. Anyway. You have to run ntbackup from the restored file system. It won't want to start because the Removable Media service is not working, but you can cajole it by ignoring the warning. Once in ntbackup, select your system state backup in the SystemState directory, and restore it. This is important - specify an alternate path for the recovery, because you really don't want to let it restore to your BartPE CD, which it will certainly give a good try, despite the read-onliness of most CDs. Once it's restored, you'll notice that it didn't do what you wanted - it restored everything to C:\boot_files. This is a consequence of how the system state backup is on the tape. At this point, copy everything out of C:\boot_files into the root drive, effectively laying C:\boot_files\WINNT into C:\WINNT, etc. Replace everything as needed. You'll end up with your restored system merged with the last system state back up (more or less).
  5. The system state restore above won't actually correctly restore your registry, unfortunately. You don't need all of it (yet), you just need a registry, so copy all of the files from c:\winnt\repair to c:\winnt\system32\config. Now, these represent an older registry - possibly one from the early days of the system, which means the Administrator account password may revert to an older version. A Windows guru can probably say exactly when these files are created, but the important thing is that you need them to reboot the system.
  6. Reboot the system. No process involving Windows is complete without a rebooting step.
  7. Log back into the system as a local Administrator. Remember the caveat that the password may have changed. It's likely a lot of system services will not run or be entirely missing at this point. Don't panic.
  8. Restore that system state backup again, only this time, to the original location.
  9. Reboot when it tells you too. No process involving Windows is complete without a second rebooting step either.
  10. If you have IIS and Certificate store backups, restore them now. At this point the box should be substantially like it was after the last system state backup, with updated content files from the last bacula restore.

Phase III - Relief

The system is back up. Customers stopped harassing your help desk. All of the data appears to be there. Now would probably be a good time to get some sleep or at least a meal. You're not done yet.

After Action Tasks

  1. Verify that all of the data and configurations are back to normal. If you have a regular test harness script you run against your server, run it. If you have a monitoring system in place like Naigos or OpenNMS, verify that it thinks everything has returned to normal. Look in the event log for strange errors which occurred AFTER the last system state restore. Ask the help desk if customers are still reporting problems.
  2. Make sure whatever spares you used during the recovery are replaced ASAP, to minimize the amount of time you're under risk of another catastrophic failure without proper backup.

At this point, you're probably out of the woods.

Additional Recommendations

We treat the systems and data pretty independently. System software is on C:, data is on D:, for instance. By doing this, we insure that bacula is always getting our customers data all the time (off of the D: drive) and it's getting enough of C: to let the server come back up and do what it was doing before without a lot of manual reconfiguration. Something else we can do is restore a dead boxes SystemState directory and D: to another server and "merge" them together. This can be done very efficiently, since we don't have to wait for a large amount of useless system software to come off the tape, and we didn't have to laboriously go through file system selection to get there.

Windows 2003 with IIS6 has better backup/recovery tools than IIS5, however, there are a lot of caveats to upgrading which may make this impossible or impractical (search for CDONTS Windows 2003 some time..). Windows 2003 also has Shadow Volume Copy and ASR, making this entire document fairly moot.

We keep copies of our SQL database backups on two local servers. If one dies, we have the option to restore those databases to another DB server or recover the dead DB server from bare metal. Before bacula, we were using Amanda and really didn't have this all down pat, and we ended up with 10 hours of downtime for our trouble after a RAID tossed two disks out at once, killing a 8 disk array. By storing our SQL backups on multiple servers instead of just the target and tape, we can can start to recover from this situation within minutes and be done in a fraction of the time.

VMWare (or qemu, or bochs) makes testing this sort of thing very fast, because you can quickly iterate through various BartPE configuratons. You can also test the backup restore process itself much faster than blasting a real box. Do the test on a real box, though, both to verify it works with your hardware and frankly to get practice doing it. People who remain calm under pressure work better, get more raises, and are more attractive to their chosen dating pool. You learn how to remain calm by knowing what you're doing, and by having done it before.

Appendix A - FileSet Configuration

Here is our Windows File Set Configuration. Some of the items in here are specific to our setup, but you should be able to get the gist. We use "WIN*" to catch both WINNT and WINDOWS, since we use this set on both Win2K and Win2K3 boxes.

FileSet {
Name = the-set
IgnoreFileSetChanges = yes
  Include {
    Options {
	wilddir = "C:/Documents and Settings/*/Application Data/*/Profiles/*/*/Cache"
	wilddir = "C:/Documents and Settings/*/Desktop"
	wilddir = "C:/Documents and Settings/*/Local Settings/History"
	wilddir = "C:/Documents and Settings/*/Local Settings/Temporary Internet Files"
	wilddir = "C:/Documents and Settings/*/Local Settings/Temp"
	wilddir = "C:/WIN*/$Nt*Uninstall*"
	wilddir = "C:/WIN*/CSC"
	wilddir = "C:/WIN*/Internet Logs"
	wilddir = "C:/WIN*/Microsoft.NET/Framework/v1*/Temporary ASP.NET Files"
	wilddir = "C:/WIN*/msdownld.tmp"
	wilddir = "C:/WIN*/system32/LogFiles"
	wilddir = "C:/WIN*/system32/MsDtc/Trace"
	wilddir = "C:/WIN*/system32/Perflib*"
	wilddir = "C:/WIN*/system32/config"
	wilddir = "C:/WIN*/system32/wbem/Repository/FS"
	wilddir = "C:/WIN*/SYSVOL/domain/DO_NOT_REMOVE_NtFrs_PreInstall_Directory"
	wilddir = "C:/WIN*/SYSVOL/sysvol/*/DO_NOT_REMOVE_NtFrs_PreInstall_Directory"
	wilddir = "C:/WIN*/Temp"
	wilddir = "[A-Z]:/RECYCLER"
	wilddir = "[A-Z]:/System Volume Information"
	wilddir = "[A-Z]:/Temp"
	wilddir = "[A-Z]:/Tmp"
	wilddir = "[A-Z]:/WUTemp"
	wildfile = "C:/Documents and Settings/*/Application Data/Microsoft/CLR Security Config/v1*/security.config.cch*"
	wildfile = "C:/Documents and Settings/*/ASPNET/Application Data/Microsoft/CLR Security Config/v1*/security.config.cch*"
	wildfile = "C:/Documents and Settings/*/Local Settings/Application Data/Microsoft/Windows/USRCLASS.*"
	wildfile = "C:/Documents and Settings/*/NTUSER.*"
	wildfile = "C:/Documents and Settings/*/Cookies/*"
	wildfile = "C:/WIN*/Debug/PASSWD.LOG"
	wildfile = "C:/WIN*/Debug/NtFrs*.log"
	wildfile = "C:/WIN*/Microsoft.NET/Framework/v1*/CONFIG/enterprisesec.config.cch*"
	wildfile = "C:/WIN*/Microsoft.NET/Framework/v1*/CONFIG/security.config.cch*"
	wildfile = "C:/WIN*/NETLOGON.CHG"
	wildfile = "C:/WIN*/NTDS/edb.log"
	wildfile = "C:/WIN*/NTDS/ntds.dit"
	wildfile = "C:/WIN*/NTDS/temp.edb"
	wildfile = "C:/WIN*/ntfrs/jet/log/edb.log"
	wildfile = "C:/WIN*/ntfrs/jet/ntfrs.jdb"
	wildfile = "C:/WIN*/ntfrs/jet/temp/tmp.edb"
	wildfile = "C:/WIN*/Registration/*.crmlog"
	wildfile = "C:/WIN*/SchedLgU.Txt"
	wildfile = "C:/WIN*/security/logs/scepol.log"
	wildfile = "C:/WIN*/security/edb.log"
	wildfile = "C:/WIN*/security/edbtmp.log"
	wildfile = "C:/WIN*/security/log.edb"
	wildfile = "C:/WIN*/system32/DTCLog/MSDTC.LOG"
	wildfile = "C:/WIN*/system32/ias/*.ldb"
	wildfile = "C:/WIN*/system32/ias/*.mdb"
	wildfile = "C:/WIN*/system32/MsDtc/MSDTC.LOG"
	wildfile = "C:/WIN*/system32/inetsrv/urlscan/urlscan.*.log"
	wildfile = "C:/WIN*/system32/wbem/Repository/CIM.REP"
	wildfile = "C:/WIN*/system32/windows media/server/NamespaceDelta.xml"
	wildfile = "C:/WIN*/Tasks/SchedLgU.Txt"
	wildfile = "[A-Z]:/pagefile.sys"
	wildfile = "[A-Z]:/Program Files/APC/PowerChute Business Edition/agent/data.dat"
	wildfile = "[A-Z]:/Program Files/APC/PowerChute Business Edition/agent/DataLog"
	wildfile = "[A-Z]:/Program Files/APC/PowerChute Business Edition/server/data.dat"
	wildfile = "[A-Z]:/Program Files/APC/PowerChute Business Edition/server/debug.txt"
	Exclude = yes
    }
    Options { signature=MD5; }
    File = C:/
    File = D:/
  }
}