A tale of SQL Server installation

I had to reinstall SQL Server 2000 on a box a few weeks ago. The mission sounded simple when I first started but quickly turned into an eventful drama. Below is a recount of the events I had experienced.

One afternoon, a user reported that he could no longer run reports against a database. He also claimed to have successfully extracted data earlier in the day. The symptom didn’t sound too complicated.

After having difficulty connecting to the SQL Server from SSMS, I went onto the box and discovered that the SQL Server service was stopped.

That’s strange but still no big deal. I clicked on the green start button in the service manager, once…twice…three times and the server was still not started. I sensed something was not right.

A trip to the event viewer revealed the problem:

17052: SQL Server evaluation period has expired.

Huh? All along I didn’t realise we were running an evaluation version. Now what? We had proper licences and a reinstallation would be the logical step but due to the time constraint, I didn’t have time to mark around, it was getting late and the user needed his data urgently.

One of my team mate suggested a dirty quick trick at this point: move the system clock back one day and pretend we were still within the evaluation period. You know what it actually worked. I thought I was smart but this guy was a real “problem solver”.

The system clock trick was nice but it was just a sticky tape solution, we really needed to install a proper version.

Came next day, I naturally selected the enterprise edition (since it was to replace the evaluation enterprise edition) and happily clicked on setup.exe.

Microsoft SQL server 2000 enterprise edition server component is not supported on this operating system, Only client component will be available for installation.

Oops, I later found out that neither enterprise nor standard edition was supported on a Windows XP box, so I was left with no choice but to settle on a developer edition. Fast forward and I clicked on the setup.exe again but this time I was presented with a different message:

A previous program installation created pending file operations on the installation machine. You must restart the computer before running setup.

Fine, so I restarted the box…three times and the message was still lingering around. Between the restarts, I had uninstalled SQL Server 2000 (and inadvertently lost all the DTS packages stored in msdb, luckily I kept a pretty recent backup of msdb, phew!)

In the end, I had to delete a registry entry following instructions from the KB article: SQL Server 2000 installation fails with “…previous program installation…” error message

Thirty minutes later, SQL Server 2000 was successfully installed. OK, time to restore the msdb database to recover all the DTS packages and jobs, etc. It didn’t take me long to locate the msdb backup and type in the command to restore it…

Msg 3168, Level 16, State 1, Line 1
The backup of the system database on device C:\Backup\msdb.bak cannot be restored because it was created by a different version of the server (134219767) than this server (134217922).

Now what?! After staring at the error message for a while, I remembered that the previous SQL Server was on SP4. It got to be it.

Click…click…click, SP4 was installed without any hiccups. Attempted the msdb restore again and thank god it worked this time. I had finally restored the server to the state it was two day ago. What a “fun” experience!

How a wrong join made the query MUCH slower

I was monitoring a bunch of batch jobs one day and noticed one job was taking much longer time than usual. Normally the job completes around 3 minutes but this time it had been running for 3 hours and still going.

After running a series of troubleshooting commands like “sp_who2 active” and “dbcc inputbuffer()”, I relied on the query below to identify the runaway query.

select 
    s.session_id  as 'session_id'
    , s.status                  as 'session_status'
    , r.status                  as 'request_status'
    , s.last_request_start_time as 'last_request_start_time'
    , db_name(r.database_id)    as 'database_name'
    , t.text                    as 'command'
    , r.wait_type               as 'wait_type'
    , r.blocking_session_id     as 'blocking_session_id'
    , s.login_name              as 'login_name'
    , s.host_name               as 'host_name'
    , s.program_name            as 'program_name'
    , p.query_plan              as 'query_plan'
from sys.dm_exec_sessions s
left join sys.dm_exec_requests r     on s.session_id = r.session_id
outer apply sys.dm_exec_sql_text(r.sql_handle) t
outer apply sys.dm_exec_query_plan(r.plan_handle) p
where s.is_user_process = 1
and s.status = 'running'
and s.session_id != @@spid

Both blocking_session_id and wait_type were null, the only useful information that could be extracted from the query was the execution plan:

query_plan_nested_loop

Before I go on and explain the execution plan, I will provide some background information. The table was joined against itself and it had 3.8 million rows. Worst of all, the table didn’t have any indexes on it at all, not even a clustered index. Blessed with such insight, it was not difficult to see that the execution plan was sub-optimal. The infamous table scans were due to the lack of indexes, unforgivable nonetheless they were not the main offender. The nested loops were really the culprit killing the query. How so? For every row out of 3.8 million, it was scanning through 3.8 million rows, it could be illustrated by the formula below:

3,800,000 x 3,800,000 = 14,440,000,000,000

Adding an index to the table (and fixing statistics at the same time) was the most sensible approach for improving the performance on this query. However, the aim of this post is to demonstrate the effect of a different join operator on the query.

I was able to locate a historical execution plan which was generated by the query optimizer. The hash join in the execution plan made it obvious why the query never had any problems in the past.

query_plan_hash_join

This was the same time as saying:

3,800,000 + 3,800,000 = 7,600,000

What an astronomical difference! Once the root cause was identified, it was quite easy to work out what to do about it. A solution could be:

  • Updating the statistics on the table;
  • Implementing a join hint;
  • Or applying the most effective performance tuning technique: adding an index.

An index was added in the end and the job was completed in no time.

If you would like to get a deeper understanding of the join operators or query execution in general, I highly recommend Craig Freeman’s blog. Below is what he wrote about hash join:

When it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

Dropdown box in DTS Transform Data Task is too narrow

It is a bit unfortunate that some of us today still need to work with DTS packages. When confronted with problems from using the antique toolset, the first thing that one does naturally was to curse the longevity of old technologies.

I have encountered one such problem today working with a simple DTS package. No matter how much I wished for SSIS for my next Christmas present, it was not December. At the end of the day, the problem still needed to be addressed in the DTS package.

The package had a transform data task which loaded a file into a database. I was happily clicking around until I ended up at the destination tab and stopped by the dreadful dropdown box. It was not wide enough to show the full table names, all I could see was the first few characters of the table names. How would I know which table I have selected?

TransformDataTask_SecondaryMonitor

I was pretty sure there must be a way to get around it, after all DTS packages have been around for yonks, someone must have seen this behaviour and reported it. After searching on Google for 2 minutes, I happened to stumble across this little gem. I couldn’t believe my luck, the solution was dead simple but who would’ve guessed. All I had to do was to move the DTS designer window to the primary monitor!

TransformDataTask_PrimaryMonitor

SSIS package executed without errors but wrote 0 rows

I have encountered and resolved an SSIS issue today where the package was executed successfully without errors, but close inspection of the data flow pipeline showed no row count. I think it is a good idea to write a post on this topic so other people can benefit from my observations.

I am going to reproduce the issue with a quick demo.

1. Create a new SSIS package with a data flow task. Use OLE DB as the source and Flat File as the destination.

image

2. Configure the source using a SQL command.

image

3. Before we go any further, let’s test the SQL command by

a) Clicking on “Parse Query”

image

b) Clicking on “Preview…”

image

c) Looking at “Columns”

image

All seems to be okay at this stage right? Wait and see.Smile

4. Configure the destination and since this is not an important aspect of the demo, I will skip this step. I am using a Flat File destination but I think it is okay to use any other destination types.

5. Execute the package and you should see something like this:

image

image

Notice there is no row count and the output showed the pipeline wrote 0 rows. Crying face

6. I have found two ways of working around the issue:

a) Remove the USE statement at the top of the SQL command.

image

b) Insert a SET NOCOUNT ON statement at the start of the SQL command.

image

7. Now execute the package again and you should see it working properly. Open-mouthed smile

image

image

Database ‘xxx’ cannot be opened because it is offline

Got an error the other day which seemed pretty straight forward at first.

Msg 942, Level 14, State 4, Procedure test, Line 4
Database 'ghost' cannot be opened because it is offline.

I will provide some context to the cause of the error message before I go on with the story. A database was being migrated from one server to the other. As part of the migration, a DBA has implemented a strategy using a bridging database so that the database would appear available on both servers.

image

It was all nice and slick during the transition phase of the migration project. Then on the day of the cut-over, the DBA migrated the database following these steps:

1. Rename the database on Server B. For example, sp_renamedb ‘active’, ‘ghost’

2. Migrate the database from Server A to Server B.

3. Setting the renamed database ‘ghost’ to offline so no one can access it anymore.

In hindsight, the ‘ghost’ database should’ve been removed instead of the rename and offline approach.

After the database was migrated, users started to get error messages like the one shown at the start of the blog post. My initial assumption was that somewhere someone was still referencing the bridging database.

The next natural thing to do was to search in all the code and see if anyone anywhere was still referencing the database ‘ghost’. You probably already guessed, the search revealed nothing at all. Since I wasn’t the person working on the migration project, I wasn’t sure if I understood everything the DBA had done. Whilst discussing the error with the DBA and getting him to confirm that all the migration steps had been done, another senior DBA listening on the side interrupted with a comment that was like a lightning bolt from the clear sky “It would be the execution plans.

Because the offline ‘ghost’ database was not removed from the server, the database id was still cached in the execution plans.

image

The solution was then easy once the root cause was determined. The execution plan just needed to be refreshed by detaching the database.

It took forever to delete one record

Have you ever tried to delete a record from a table but it seemed to have taken forever? Furthermore, you checked everything and just couldn’t find any performance problems with the server.

Well, check again. Check the foreign key references on the primary key of the table. Now you see it? Yes, SQL Server was checking the foreign key constraints on every table big or small that references your primary key table. It would only delete the record(s) after it made sure that your statement would not introduce data integrity issues.

I am now going to use the AdventureWorks.HumanResources.Employee table to demonstrate how you can work around the delete performance issue.

Check table

Assuming you already have the table name in a query window, now highlight the table and press Alt+F1. It is the same as running the “sp_help” system proc on the table.

If you got the following error instead of proper results, make sure you surround the schema and table names using quotes or square brackets.

Msg 102, Level 15, State 1, Line 1
Incorrect syntax near '.'.

Like this:

exec sp_help 'HumanResources.Employee'
exec sp_help "HumanResources.Employee"

exec sp_help [HumanResources.Employee]

You will get several result sets and pay attention to the one titled “Table is referenced by foreign key”.

image

Disable foreign key constraints

Copy the list of constraints to a new query window.

image

Replace the beginning of the lines with “alter table “.

image

Replace semi colons with ” nocheck constraint “.

image

Your script should look like the one below

image

By executing the script, you are effectively disabling the foreign key constraints on these table.

Delete the record(s)

Now you should be able to delete the desired records with a much faster response time(assuming you’ve got appropriate indexes and everything).

delete AdventureWorks.Sales.SalesPerson
where EmployeeID =  280

Re-Enable the foreign key constraints

We can re-enable the constraints by using the same script we’ve prepared earlier for disabling the foreign key constraints. We just need to replace “nocheck” with “with check check”.

image

In case if you are wondering why CHECK CHECK, here is the official explanation from Books Online:

The WITH CHECK CHECK CONSTRAINT re-enables the constraint, and also validates the existing data against the re-enabled constraint.

So it should be read like WITH CHECK (pause) CHECK CONSTRAINT.

Finally

I hope you find the above tip helpful. However, a word of warning though, this tactical operation is better reserved for the non production environments.

There is no SUBSTRING function in my code!

Today I have encountered something really annoying, it took me at least an hour to work out what went wrong.

Let’s say I had the following query in a stored procedure:

select left(name, charindex(' ', name) - 1) as 'first_name'
from (select 'louie bao' as name) a

which returned:

image

Now, let’s pretend that I had decided to change the name delimiter from a space to a semicolon and I had forgotten to update the CHARINDEX function:

select left(name, charindex(' ', name) - 1) as 'first_name'
from (select 'louie;bao' as name) a

which returned an error:

Msg 536, Level 16, State 5, Line 1
Invalid length parameter passed to the SUBSTRING function.

The error message looked pretty straight forward. Look again! That’s right, there was no SUBSTRING function in my code!

When I saw that error message, I searched through my stored procedure for the word “substring” over and over again but to no avail. Just when I was about to give up, I noticed the line number. Why don’t I go to that line and see what was actually causing the error? How did I do that? Easy, I will demonstrate with an example:

use tempdb
go
create proc no_substring
as
 
print 'Blah'
 
select left(name, charindex(' ', name) - 1) as 'first_name'
from (select 'louie;bao' as name) a
 
print 'Blah'

return 0
go

Now run the stored proc:

exec no_substring

and you would expect to get:

Blah
Msg 536, Level 16, State 5, Procedure no_substring, Line 6
Invalid length parameter passed to the SUBSTRING function.

Blah

Note that the code starting at Line 6 generated the error. Let’s go there and find the code:

exec sp_helptext 'no_substring'

image

Now back to the problem at hand, I managed to reduce the code to the query below and was still able to produce the error message:

select left('', (select -1))

and got:

Msg 536, Level 16, State 5, Line 1
Invalid length parameter passed to the SUBSTRING function.

I need to mention that I was running in SQL Server 2005. Just before writing this post, I tried the same code in SQL Server 2008 and got the following error message instead:

Msg 537, Level 16, State 2, Line 1
Invalid length parameter passed to the LEFT or SUBSTRING function.

A new error message was created in SQL Server 2008 which is much more helpful at identifying the cause of the error.

Restore 500GB SQL2000 database from tape using NetBackup 6.5.3.1

The other day I was restoring a 500GB SQL Server 2000 database from tape using NetBackup 6.5.3.1 and it didn’t work well. Well, it didn’t work at all. No matter how many times I tried, I kept getting this error:

Status Code 41: network connection timed out

This was not the first database I’d tried and I was able to restore other databases successfully, so I knew there was nothing wrong with the network. One thing I could think of was the relatively large database size.

At that moment, I was 50% sure that the error was caused by a timeout setting somewhere in NetBackup. Since I had limited experiences working with NetBackup, I turned to Google and found the following article:

DOCUMENTATION: Restores of large Microsoft SQL server databases using the NetBackup for Microsoft SQL Server database extension fail before jobs start reading data from tape.

With Microsoft SQL server, the restore process starts by having SQL allocate all of the data files that will be used for the SQL Database. SQL then writes zeroes to all of these files. If this is a very large SQL database, this process can take a significant amount of time. Only after Microsoft SQL finishes writing zeros will it start requesting data from the NetBackup agent. Generally there is no appearance of activity until the Microsoft SQL server is ready to start requesting data to recover the SQL database.

If the NetBackup “Client Read Timeout” for that client is not large enough, the restore processes will have already timed out before the SQL Server requests the first byte of data.

It all made perfect sense to me and luckily I had read about instant file initialization before so I knew it was exactly it. I’d managed to connect all the dots and came to the understanding that:

if file initialization time > client read timeout
    status code 41
else
    ok

Since I had no control over the database size and had to restore to a SQL Server 2000 box, I definitely needed to modify “Client Read Timeout”, but to what?

A quick search on the net returned overwhelming amount of blog posts including Symantec documentation advising people to increase that number.

That just didn’t seem right to me. My experience tells me when you code something like a timeout, you always leave behind a way to disable it and the convention is: zero = disabled.

In the end, I set the timeout to zero and guess what? It worked. I was happily restoring the 500GB database and looking forward to restoring the 2TB data warehouse next.

–oo00oo–

For more info on instant file initialization, please refer to the following blog posts:
Instant Initialization – What, Why and How?
Misconceptions around instant file initialization
Search Engine Q&A #24: Why can’t the transaction log use instant initialization?

Rebuild Master failed with error -1:

The other day I was trying to rebuild the master database for a SQL Server 2000 box. Everything was going well until the “Configuring Server” dialog popped up.

D0B9A22DC37EAE4F_164_0

The progress bar went across about 4 times and then an error message popped up.

D0B9A22DC37EAE4F_164_1

That’s weird, I have rebuilt the master database a few times before and I have never encountered errors like this one. I looked for the error details and started with the cnfgsvr.out log file located in C:\Program Files\Microsoft SQL Server\MSSQL\Install

D0B9A22DC37EAE4F_164_2

By looking at the log file, I could confirm the error, but it didn’t help to explain as why the connections had failed. So I ran eventvwr. One of the system log error mentioned something about MSSQLSERVER service failed to start.

D0B9A22DC37EAE4F_164_3

At that point, I was still not sure what’d happened, so I then ran services.msc and tried to start the MSSQLSERVER service manually.

D0B9A22DC37EAE4F_164_4

The error message matched what I had seen in the event viewer. Path? What path? Suddenly as if stricken by lightening, I noticed “Path to executable:” had the short file names for the path. The question became obvious, what was MICROS~3?

D0B9A22DC37EAE4F_164_5

A quick look in the explorer revealed 5 long name folders starting with the word “Microsoft”.

D0B9A22DC37EAE4F_164_6

MICROS~3 happened to be “Microsoft Office” instead of “Microsoft SQL Server”. Now the cause of the issue had been identified, I needed a way to work around it. For what I was doing at the time, I could not just reinstall SQL Server, therefore I decided to move the “80” and “mssql” folders into “Microsoft Office”.

D0B9A22DC37EAE4F_164_7

What I had effectively done was making C:\PROGRA~1\MICROS~3\MSSQL\binn\sqlservr.exe a valid path.

Then I reran rebuildm.exe and as expected, everything worked.

D0B9A22DC37EAE4F_164_8

–oo00oo–

For more info on short file names, please refer to the following Microsoft Knowledge Base articles:
How Windows Generates 8.3 File Names from Long File Names
8.3 Short File Names not Retained after Tape Restore
How to Disable Automatic Short File Name Generation