SQL Server 2005 Fail-over Cluster Upgrade Post Mortem
I just finished up a SQL Server 2000 to SQL Server 2005 upgrade on a large fail-over production cluster. I ran into a couple of issues that extended the upgrade outside the maintenance window and caused me some serious headaches. The information below may save you some time and sanity before upgrading your SQL Server Fail-Over cluster from 2000 to 2005.
Issue one: The new SQL Server 2005 Password Policy
I'm all for security features - password policies rank up there as an important feature for any product that requires authentication. SQL Server 2005 combined with Windows Server 2003 brought SQL Server into this century as far as SQL Account are concerned (though I prefer using Windows Logins with SSPI when possible but most third party vendors whose software runs on SQL Server don't seem to get this).
One of the worst things about using SQL Accounts in SQL Server 2000 is that SQL account passwords are case insensitive if you're using the default collation (which 99% of you are). That's right, there's nothing like reducing the security of a password by removing an additional 26 combinations from the keyspace.
So back to SQL Server 2005's password policy, which I feel I need to say again, is a welcome addition as it not only resolves the case insensitivity issue with the default collation in SQL 2000, but also integrates Windows 2003 password complexity requirements into SQL Accounts as well (including strong password enforcement by extending passfilt.dll).
My main issue with the new Password Policy is not so much with how it's implemented in SQL Server 2005, but how it applies to the SQL Server 2005 upgrade process. The SQL Server 2005 Setup fails to check the service identity's password strength against the Windows 2003 policy before performing the upgrade thus potentially causing a major issue near the end of the installation.
The million dollar question: What happens if the windows service account that you're using to run SQL Server 2000 doesn't meet the complexity requirements in the NEW and IMPROVED Password Policy in SQL Server 2005 when you do the upgrade?
Answer: The Setup on a high-availability cluster will reach about 85% on one cluster node and 100% on the other cluster node and THEN it checks the service account password to see if it meets complexity requirements. This wouldn't necessarily be a problem if they gave you the opportunity to change it here, but they don't - the only option appears to be to roll back the entire SQL Server 2005 installation back to SQL 2000...OH...MY...GOD.
I'm too stubborn to just let it rollback - I'm stunned, staring at a modal window that offers two options: "Retry" and "Cancel" - canceling seems like a terrible idea - especially since one of the nodes 100% completed. I check the status and the setup program - it is in the process of migrating the msdb database with SQL Scripts. I also reviewed the Setup Logs files to see where it was at. All signs seem to point to SQL Server 2005 up and running and executing migration scripts - basically SQL code on the newly upgraded engine. I checked the Services snap-in and the install folders and it looks like everything is in place and running. I even connected to it with SQL Server Management Studio and looked at the version. My Analysis? The SQL Service instance was UP AND RUNNING SQL SERVER 2005 within the cluster! The Setup is basically done! The System tables are the only things that still need to be migrated.
And why does the Setup program want to rollback the entire installation? What is the critical failure that requires such a heavy hand?
My service account password doesn't meet complexity requirements.
That's right - I don't have two upper, two lower, two numbers, and two symbols (or whatever our enterprise requirements are) for the SQL service account password so I just better uninstall the whole darn thing and start over. Genius.
Some poor soul ran into this exact issue here - AND HAD TO ROLLBACK!
"We, unfortunately, had to fall back to SQL 2000 (painful) and then re-do the upgrade."
I REFUSED to hit cancel. I figured the Retry button must be there for a reason. So after pushing it a few times in a hopeless attempt for it to just give in and move on, I did some digging with Google.
Turns out there's a real simple fix: Just add the trace flag -T4606 to the service command arguments on SQL Server 2000 before upgrading and it will disable the password policy check on upgrade and viola, upgrade works just fine OR change your service account password to meet the complexity requirements and then upgrade. WOW, wish I would have known that!
I realize I might be able to fix this by simply changing the SQL Server 2005 startup parameters, restarting the SQL Service, and then hitting "Retry" on the SQL Server 2005 Setup program and if all went as planned, the setup would move on.
Turns out, this solution WORKS just fine and I would have been able to add the trace flag, cycle the service, click "Retry" and finished the upgrade...except for another small issue...
Issue Two - SQL Server 2005 Configuration Manager Unexpected Behavior
In order to change the startup parameters to SQL Server 2005 to add the additional Trace flag of -T4606, you have do something fairly simple.
The Microsoft Knowledge Base Article describes it like this:
1. Start the SQL Server service by using trace flag 4606.
a. Open SQL Server Configuration Manager.
b. Click SQL Server 2005 Services, and then double-click SQL Server (InstanceName).
c. In the SQL Server (InstanceName) Properties dialog box, click the Advanced tab.
d. On the Advanced tab, add the following text at the end of the existing string in the Startup Parameters box: -T4606
e. Click OK.
Now I've changed startup parameters on probably hundreds of programs and I expect this sort of thing to work a certain way...and the instructions above fail to mention the MOST CRITICAL part.
You need a semicolon between the previous parameter and the -T4505 parameter. A semicolon? Why would I need that? What happens if you don't put a semicolon between the parameters? Something unexpected if you're not familiar with SQL Server 2005 Configuration Manager.
So I change the startup parameter from this:
...-lc:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\mastlog.ldf
...-lc:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\mastlog.ldf -T4606
That looks fine, doesn't it....I thought so too. But when I took down and started back up the SQL 2005 Service via the Cluster manager, the error when the database service tries to start in ERRORLOG is:
Can't open file "c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\mastlog.ldf -T4606"
That's right, it appended -T4606 to the master LDF file name. Talk about unexpected.
I remove the -T4606 the same way I added it - using the SQL Server Configuration Manager then recycle the SQL service. But I'm getting the same error. I look in the Configuration Manager again and the -T4606 parameter is back. I remove it again, recycle - but -T4606 shows back up as a part of the filename!
This is maddening. So I go to the registry and find the startup parameters, completely bypassing SQL Server 2005 Configuration Manager. The settings are here:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer\Parameters
SQLArg0 = -dc:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\master.mdf"
SQLArg1 = -ec:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\ERRORLOG"
SQLArg2 = -lc:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\mastlog.ldf -T4606"
So I modify SQLArg2 to remove the tailing trace flag and then add:
SQLArg3 = "-T4606"
Then I cycle the service - guess what (and if you know clustering services well, you know exactly what's happening) - the SQLArg2 is back with the -T4606 appended to the LDF filename and the SQLArg3 is GONE.
A little background on Windows Clustering Services
Windows Clustering Services is in charge of SQL Server and its' startup parameters (which live in the registry location listed above), you never start the SQL Service directly on a cluster. The Windows Clustering Service is in charge because in a situation where there's a failure on one node, the other node needs to know when to take over - that's Windows Clustering Service job. It notices the failure and the nodes work out which one should bring up the SQL Server service.
For this to function properly, the SQL Server registry configuration must match between the two servers. Windows Clustering Services stores the 'master' settings for these services in the CPT hive file in the cluster quorum - this is called a checkpoint. When a cluster resource comes online, Windows Clustering Services copies out these particular sections of the registry to each servers so they are consistent. If a change is made while the resource is online, these changes get check-pointed back into the CPT hive. If a change happens while the cluster resource is offline, then the changes DO NOT get check-pointed into the CPT hive in the cluster quorum.
So here's what I did wrong:
My 'bad' change (master ldb file with -T4606 appended to it) was check-pointed to the cluster quorum because I did it when the cluster resource, that is SQL Server, was on line. Since the resource couldn't come back online due to the bad change, my attempt to fix the setting was never check-pointed back to the quorum.
Then when I tried to bring the resource back online, the service parameters in the registry were overwritten with the last 'good' (yea right) checkpoint in the quorum which contained the bad setting.
What took me a while to figure out is that the best way to deal with this issue is to delete the Cluster checkpoint for the registry key that handles the Startup Parameters for SQL Server 2005.
1. Disable the cluster checkpoint for the specific registry key:
c:\> cluster res "SQL Server" /removecheck: "Software\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLSERVER"
2. Correct the trace flag mistake in the SQL 2005 Configuration Manager OR just edit the registry directly which is the approach I took.
3. Enable the cluster checkpoint for the specific registry key:
c:\> cluster res "SQL Server" /addcheck: "Software\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLSERVER"
Once the trace flag -T4606 was added in, I brought up the Cluster node, hit "Retry" in the SQL Server 2005 setup and then received another error about losing the connection to the SQL Server. This is expected since I cycled the service. Then I click "Retry" again and the setup process completes successfully.
What a pain!
Next steps : remove the -T4606 flag from the startup parameters and increase the complexity of my service account password.
Microsoft KB Articles
Error message in the SQL Server ERRORLOG file after you upgrade SQL Server 2000 to SQL Server 2005 on a server that is running Windows Server 2003: "Unable to update password policy"
The SQL Server service cannot start when you change a startup parameter for a clustered instance of SQL Server 2000 or of SQL Server 2005 to a value that is not valid
Unable to correct invalid SQL Server Network Configuration on clustered SQL Server causes clustered SQL Server fail to start “permanently”
(Discusses Clustering Services Checkpoint Behavior)