How a wrong join made the query MUCH slower

I was monitoring a bunch of batch jobs one day and noticed one job was taking much longer time than usual. Normally the job completes around 3 minutes but this time it had been running for 3 hours and still going.

After running a series of troubleshooting commands like “sp_who2 active” and “dbcc inputbuffer()”, I relied on the query below to identify the runaway query.

select 
    s.session_id  as 'session_id'
    , s.status                  as 'session_status'
    , r.status                  as 'request_status'
    , s.last_request_start_time as 'last_request_start_time'
    , db_name(r.database_id)    as 'database_name'
    , t.text                    as 'command'
    , r.wait_type               as 'wait_type'
    , r.blocking_session_id     as 'blocking_session_id'
    , s.login_name              as 'login_name'
    , s.host_name               as 'host_name'
    , s.program_name            as 'program_name'
    , p.query_plan              as 'query_plan'
from sys.dm_exec_sessions s
left join sys.dm_exec_requests r     on s.session_id = r.session_id
outer apply sys.dm_exec_sql_text(r.sql_handle) t
outer apply sys.dm_exec_query_plan(r.plan_handle) p
where s.is_user_process = 1
and s.status = 'running'
and s.session_id != @@spid

Both blocking_session_id and wait_type were null, the only useful information that could be extracted from the query was the execution plan:

query_plan_nested_loop

Before I go on and explain the execution plan, I will provide some background information. The table was joined against itself and it had 3.8 million rows. Worst of all, the table didn’t have any indexes on it at all, not even a clustered index. Blessed with such insight, it was not difficult to see that the execution plan was sub-optimal. The infamous table scans were due to the lack of indexes, unforgivable nonetheless they were not the main offender. The nested loops were really the culprit killing the query. How so? For every row out of 3.8 million, it was scanning through 3.8 million rows, it could be illustrated by the formula below:

3,800,000 x 3,800,000 = 14,440,000,000,000

Adding an index to the table (and fixing statistics at the same time) was the most sensible approach for improving the performance on this query. However, the aim of this post is to demonstrate the effect of a different join operator on the query.

I was able to locate a historical execution plan which was generated by the query optimizer. The hash join in the execution plan made it obvious why the query never had any problems in the past.

query_plan_hash_join

This was the same time as saying:

3,800,000 + 3,800,000 = 7,600,000

What an astronomical difference! Once the root cause was identified, it was quite easy to work out what to do about it. A solution could be:

  • Updating the statistics on the table;
  • Implementing a join hint;
  • Or applying the most effective performance tuning technique: adding an index.

An index was added in the end and the job was completed in no time.

If you would like to get a deeper understanding of the join operators or query execution in general, I highly recommend Craig Freeman’s blog. Below is what he wrote about hash join:

When it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.