In a recent talk with the Sacramento SQL Server User Group (@SACSQLUG), a question was asked I wanted to address, and the answer seemed more like a short blog than a tweet.

I was presenting on how to use the TOP clause to break down large operations into short, fast, bite-sized operations. The mechanics are things I learned from writing processes that do garbage collection, backfill new columns, and anonymizing PII data on existing tables. I’ve just posted the slides and example scripts here if you are interested.

Are they the same?

The question was whether the SET ROWCOUNT command would work just the same, and the answer is sometimes yes but largely no.

SET ROWCOUNT will limit the number of rows returned by the following statements, and this functions like a connection level setting. It will remain in effect until you change it or clear it by setting the rowcount to 0 (unlimited).

I would be concerned about setting the option and not clearing it, causing it to affect additional statements.

So let’s adapt some of my examples and see how it behaves.

The simple case with SET ROWCOUNT

Here’s a script to return 100 rows from a table in WideWorldImporters.

USE WideWorldImporters
GO
SELECT TOP 100 
	sol.OrderLineID, 
	sol.UnitPrice 
FROM Sales.OrderLines sol 
WHERE sol.OrderLineID < 1000 
	AND sol.UnitPrice > 50;
GO

SET ROWCOUNT 100;

SELECT 
	sol.OrderLineID, 
	sol.UnitPrice 
FROM Sales.OrderLines sol 
WHERE sol.OrderLineID < 1000 
	AND sol.UnitPrice > 50;
GO

Fairly simple, and the plans are much the same.

100 rows, as promised

The only difference is that the TOP operator is missing. But the query is behaving much like it is present. I surmise that the SELECT operator itself is limiting the number of rows by not requesting additional data after the first 100.

Quick aside on Row Mode

SQL Server queries typically operate in row mode. Batch mode is used in some cases with columnstore indexes and can be used on rowstore indexes in SQL Server 2019 and above. But I want to focus on how a query processes in row mode (with thanks to Grant Fritchey for his description of this).

The activity is driven by the root node, in this latest plan the Select operator. It asks the operator under it (the Top operator) to go get a row. The Select operator doesn’t really care how that operator does what it does, it just wants a row. And for the Top operator to do it’s job, it asks the operator below it for the same.

Until we get down to the last Nested Loops operator. It asks the Index Scan operator for a row, and that operator can finally get us some data. It returns a row and, being a join operator, Nested Loops asks the Key Lookup for data related to its row. It joins the results and pass the joined row up to the operator above it.

When data flows up to an operator, it will typically reply, “Now give me another row.”

The Top operator changes this be simply not asking for more rows once it has received the number of rows it needs. So all the activity for the underlying operators stops because the Top hasn’t told it to continue.

In the simple example of my two queries, we can see the Clustered Index Seek only returned 100 rows, then the query ended. It seems with SET ROWCOUNT in effect, the Select operator itself stopped the operation by not requesting more data.

But then it falls apart

Here’s a stored procedure I wrote to delete old records from the Sales.Orders table and all related tables.

USE WideWorldImporters
GO
CREATE OR ALTER PROCEDURE Sales.Orders_GarbageCollection(
@BatchSize INT = 100,
@Duration INT = 30
)AS
SET NOCOUNT ON;

	DECLARE 
		@NestedTransaction BIT = 0,
		@EndTime DATETIME;

	CREATE TABLE #OrdersGC(
		OrderID INT,
		INDEX IX_OrdersGC(OrderID)
	);

	CREATE TABLE #InvoicesGC(
		InvoiceID INT,
		INDEX IX_InvoicesGC(InvoiceID)
	);

	SET @EndTime = DATEADD(SECOND, @Duration, GETUTCDATE());

	WHILE (GETUTCDATE() < @EndTime)
	BEGIN
		TRUNCATE TABLE #OrdersGC;
		TRUNCATE TABLE #InvoicesGC;

		IF @@NESTLEVEL > 0
		BEGIN
			SET @NestedTransaction = 1;
		END
		ELSE
		BEGIN
			BEGIN TRANSACTION;
		END;

		INSERT INTO #OrdersGC
		SELECT TOP (@BatchSize)
			so.OrderID
		FROM Sales.Orders so
		WHERE
			so.OrderDate < 
				DATEADD(month, -50, GETUTCDATE());

		DELETE sol
		FROM #OrdersGC gc
		JOIN Sales.OrderLines sol
			ON sol.OrderID = gc.OrderID;
	
		INSERT INTO #InvoicesGC
		SELECT --TOP (@BatchSize)
			inv.InvoiceID
		FROM #OrdersGC gc
		JOIN Sales.Invoices inv
			ON inv.OrderID = gc.OrderID;


			DELETE ctr
			FROM #InvoicesGC inv
			JOIN Sales.CustomerTransactions ctr
				ON ctr.InvoiceID = inv.InvoiceID;

			DELETE sti
			FROM #InvoicesGC inv
			JOIN Warehouse.StockItemTransactions sti
				ON sti.InvoiceID = inv.InvoiceID;

			DELETE invl
			FROM #InvoicesGC inv
			JOIN sales.InvoiceLines invl
				ON invl.InvoiceID = inv.InvoiceID;

		DELETE inv
		FROM #OrdersGC gc
		JOIN Sales.Invoices inv
			ON inv.OrderID = gc.OrderID;

		DELETE so
		FROM #OrdersGC gc
		JOIN Sales.Orders so
			ON so.OrderID = gc.OrderID;
		
		IF @NestedTransaction = 0
		BEGIN
			COMMIT TRANSACTION;
		END;

	END;

GO

BEGIN TRANSACTION

EXEC Sales.Orders_GarbageCollection
	@BatchSize = 100,
	@Duration = 5;

ROLLBACK TRANSACTION
GO


--CREATE INDEX IX_Sales_Orders_OrderDate ON Sales.Orders (OrderDate);

This proc identifies 100 rows in Sales.Orders that we want to delete, and puts those OrderIDs in a temp table. We then delete records from related tables based on those OrderIDs, and we similarly delete records related to rows we want to delete from Sales.Invoices. And this procedure was written to loop and keep deleting records for a specified number of seconds.

One key thing to remember is that we only use a TOP clause when we populate the first temp table. We want to delete those 100 rows and everything related to them, not those 100 rows and only 100 related rows from the related tables. If we only delete some of the related rows, our process will fail on a foreign key violation when we try to delete some rows from the Sales.Orders table.

If you were to change this to use a SET ROWCOUNT, you would need to reset that to 0 after the statement inserting into the temp table. If not, the next statement looks like this:

With SET ROWCOUNT 100

Note the Top operator in this DELETE statement. All is flowing as expected, but then the Top operator shuts this down prematurely. Top only passes on 100 rows, when then get deleted. So, later when we try to DELETE from the Sales.Orders table, we fail because we haven’t deleted all the related rows.

And you could just SET ROWCOUNT 0 once you don’t need it, but I’d be concerned I’d forget to do so and cause issues.

Also, it’s deprecated

If you take a look at the remarks section of the documentation for this statement, there’s something interesting.

Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax. For more information, see TOP (Transact-SQL).

Microsoft

So they aren’t using the word deprecated here, but it amounts to that. This will stop affecting DELETE, INSERT, and UPDATE statements in future, so you wouldn’t want to write a process using SET ROWCOUNT only to have it fail at some undefined point in the future.

It doesn’t state whether this will also affect MERGE statements, but I’d rather not have to use different syntax on some statements anyway, so I’ll just stick with my TOP clauses.

Happy Holidays

Should have another post or two coming in the next few weeks thanks to some extra time off and not traveling much for the holidays. Hope yours are great.

If you have any topics related to performance in SQL Server you would like to hear more about, please feel free to make a suggestion. You can follow me on twitter (@sqljared) and contact me if you have questions.

Also, subscribe to get emails when I post again.

Slow Garbage Collection

I encountered a curious issue recently, and immediately knew I needed to blog about it. Having already blogged about implicit conversions and how the TOP operator interacts with blocking operators, I found a problem that looked like the combination of the two.

I reviewed a garbage collection process that’s been in place for some time. The procedure populates a temp table with the key values for the table that is central to the GC. We use the temp table to delete from the related tables, then delete from the primary table. However, the query populating our temp table was taking far too long, 84 seconds when I tested it. We’re scanning and returning 1.4 million rows from the first table, doing a key lookup on all of them. We scan another table to look up the retention period for the related account, as this database has information from multiple accounts, and return 8.4 million rows there. We join two massive record sets, then have a filter operator that gets us down to a more reasonable size.

In general, when running a complex query you want your most effective filter, that results in the fewest rows, to happen first. We want to get the result set small early, because every operation after becomes less expensive. Here’s part of our anonymized query:

 WHERE
   Object5.Column10 = ?
   AND Object5.Column6 < Variable4 - COALESCE(Object12.Column1,Variable1,?)

So, Object5 is our main table. Column10 is a status column, and we compare that to a scalar value to make sure this record is ready for GC. That’s fine. The second is checking if the record is old enough to GC, and this is where things get interesting. The code here is odd; we’re subtracting a COALESCE of three values from a DATETIME variable, relying on behavior that subtracting an INT from a DATETIME subtracts that number of days from the DATETIME. There’s an implicit conversion, but it is on the variable side of the inequality. And the plan above appears to be our result.

So, Bad Date Math Code?

Seems like a good root cause, and I don’t like this lazy coding pattern, but let’s test it out with tables we can all look at. Would this be any better if we had written this to use the DATEADD function to do the math properly?


USE AdventureWorks2014
GO
--CREATE INDEX IX_SalesOrderHeader_ModifiedDate ON Sales.SalesOrderHeader (ModifiedDate);

SELECT TOP 100
	soh.ModifiedDate,
	soh.SalesOrderID
FROM Sales.SalesOrderHeader soh
WHERE
	soh.ModifiedDate < DATEADD(day, -1, GETUTCDATE());
	
SELECT TOP 100
	soh.ModifiedDate,
	soh.SalesOrderID
FROM Sales.SalesOrderHeader soh
WHERE
	soh.ModifiedDate < GETUTCDATE() - 1;

So, if we had an index on the OrderDate column, would it perform differently?

Apparently not. Same plan, same cost. But when you think about it, this tests the date math by subtracting an integer from the DATETIME provided by GETUTCDATE(). The original was subtracting a COALESCE of three values. One of those values was a float. Could the COALESCE or the resulting data type have made this more complicated?

Testing COALESCE


USE AdventureWorks2014
GO
DECLARE
	@AccoutSetting FLOAT = 8.0,
	@Default INT = 365;
	
SELECT TOP 100
	soh.ModifiedDate,
	soh.SalesOrderID
FROM Sales.SalesOrderHeader soh WITH(INDEX(IX_SalesOrderHeader_ModifiedDate))
WHERE
	soh.ModifiedDate < DATEADD(DAY, -COALESCE(@AccoutSetting, @Default), GETUTCDATE());

Running this, again we see a nice seek that reads and returns 100 rows. So the different data type and the COALESCE makes no difference.

Looking at the original query again, the first value isn’t a variable, it’s a column from a different table. We can’t filter by this column until we’ve read the other table, which affects our join order. But we have no criteria to seek the second table with.

Joined Filtering

One more test. Let’s see what the behavior looks like if we join to the Customers table to look for the RetentionPeriod. First, I’ll create and populate some data in that column:


USE AdventureWorks2014
GO
IF NOT EXISTS(
	SELECT 1
	FROM INFORMATION_SCHEMA.COLUMNS isc
	WHERE
		isc.TABLE_NAME = 'Customer'
		AND isc.TABLE_SCHEMA = 'Sales'
		AND isc.COLUMN_NAME = 'RetentionPeriod'
)
BEGIN
	ALTER TABLE Sales.Customer
		ADD RetentionPeriod INT NULL;
END;
GO
UPDATE TOP (100) sc 
	SET sc.RetentionPeriod = (sc.CustomerID%4+1)*90
FROM Sales.Customer sc;
GO
CREATE INDEX IX_Customer_RetentionPeriod ON Sales.Customer (RetentionPeriod);
GO

I only populated a few of the records to better match the production issue; only some accounts have the value I’m looking for, hence the COALESCE.


USE AdventureWorks2014
GO
DECLARE
	@Default INT = 365;

SELECT TOP 100
	soh.ModifiedDate,
	soh.SalesOrderID,
	sc.RetentionPeriod
FROM Sales.SalesOrderHeader soh
LEFT JOIN Sales.Customer sc
	ON sc.CustomerID = soh.CustomerID
	AND sc.RetentionPeriod IS NOT NULL
WHERE
	soh.ModifiedDate < DATEADD(DAY, -COALESCE(sc.RetentionPeriod, @Default), GETUTCDATE());

Now we’re trying to filter records in SalesOrderHeader, based on the Customer’s RetentionPeriod. How does this perform?

 

Well, the row counts aren’t terrible, but we are scanning both tables. The optimizer opted to start with Customer table, which is much smaller. We’re not filtering on the date until the filter operator, after the merge join.

I’d be worried that with a larger batch size or tables that don’t line up for a merge join, we’d just end up doing a hash match. That would force us to scan the first table, and without any filter criteria that would be a lot of reads.

Solution

The solution I applied to my production query was to create a temp table that had the account ID and the related retention period. I joined this to my primary table, and the query that was taking 84 seconds was replaced with one that took around 20 milliseconds.

 


USE AdventureWorks2014
GO
IF OBJECT_ID('tempdb..#AccountDates') IS NULL
BEGIN
	CREATE TABLE #AccountDates(
		CustomerID INT,
		GCDate DATETIME
	);
END
GO
DECLARE
	@Default INT = 365;

INSERT INTO #AccountDates
SELECT 
	sc.CustomerID,
	DATEADD(DAY, -COALESCE(sc.RetentionPeriod, @Default), GETUTCDATE()) AS GCDate
FROM Sales.Customer sc
WHERE
	sc.RetentionPeriod IS NOT NULL;

SELECT 
	soh.SalesOrderID
FROM #AccountDates ad
JOIN Sales.SalesOrderHeader soh
	ON soh.CustomerID = ad.CustomerID
	AND soh.ModifiedDate < ad.GCDate;

TRUNCATE TABLE #AccountDates;
GO

Here’s how I’d apply that thought to our example. I’ve created a temp table with CustomerID and its associated RetentionDate, so we could use that to search Sales.SalesOrderHeader.

You may have noticed my filter on the Sales.Customer table. In the live issue, the temp table had dozens of rows for dozens of accounts, not the tens of thousands I’d get from using all rows Sales.Customer in my example. So, I filtered it to get a similar size in my temp table.

Infinitely better. Literally in this case, since the ElapsedTime and ActualElapsedms indicators in the plan XML are all 0 milliseconds. I’m declaring victory.

If you liked this post, please follow me on twitter or contact me if you have questions.

How Relevant is the TOP operator?

I’ve explained what a blocking operator is and provided a few examples, but maybe this doesn’t seem important. It’s affecting the TOP operator, sure, but don’t people just use this to look at the TOP 1000 rows of their tables in SSMS?

The TOP operator is useful for many operations, especially in a large environment. Large operation can timeout or fail for a variety of reasons, consuming resources without providing the results you need. A small, batch-sized operation is more likely to succeed and tends to perform more consistently. Many maintenance operations make sense to run with a TOP operator, so we should make sure those operations aren’t stymied by blocking operators. Some examples:

  • Garbage collection on a table with many millions rows. You want this to perform quickly, but you really can’t afford for this to time out (whatever that timeout period may be). We can limit how long this runs by GC’ing a small batch at a time, but this can be hampered badly by an extra sort or a hash join.
  • Archiving data applies for the same reasoning as GC. If you archive data to another table\database\server, you’ll want to keep your operation small enough to manage.
  • Backfilling a new column. If the existing table is large, you can’t just UPDATE the whole table; you’ll lock the table and block all your users. A batched UPDATE in a loop or in a scheduled process can resolve this without causing an outage.
  • GDPR is here, and CCPA is coming. You may need to anonymize data across many related tables. We need this to perform well to cleanse our existing data, and we’ll continue running this process going forward.
  • Queries producing potentially large results to your application may need to be batched as well. If this times out, you’re still wasting a lot of time and reads, even if no data is changed.

Out the Window

I examined one process recently that was similar to this query, causing a GC operation to time out.

SELECT TOP 100
	inv.InvoiceID,
	ili.InvoiceLineID,
	ROW_NUMBER() OVER(Partition By inv.CustomerID ORDER BY inv.InvoiceDate) AS SortID
FROM Sales.Invoices inv
JOIN Sales.InvoiceLines ili
	ON ili.InvoiceID = inv.InvoiceID
WHERE
	inv.InvoiceDate < DATEADD(day, -60, GETUTCDATE()) 
	AND ili.LastEditedWhen < DATEADD(day, -60, GETUTCDATE()) 
ORDER BY SortID 

It was a SELECT statement, and it inserted the important fields into a temp table, and ran DELETEs against multiple tables based on the contents of the temp table. But it was the initial SELECT that had a poor plan and caused the timeout.

I quickly saw a SORT in the execution plan and wondered why. The actual query didn’t have an ORDER BY clause. But it did have a ROW_NUMBER OVER in the select list; took me a minute to notice that.

But was the sorting necessary? “We need to delete really old records in this table, but it’s vitally important that we delete them in the order they were originally inserted!”

It seemed a poor reason to sort a table with tens of millions of rows. Coupled with the very small batch size, we were doing an extraordinary amount of work to get rid of a few rows. So what if we commented the ROW_NUMBER and the ORDER BY out?

Even though the new plan has a scan operator, we only read 110 rows from it because we are using the TOP operator properly. Note the row counts from the first plan again. We have 479 thousand rows going through multiple operators in the first, but only 100 per operation in the second.

Avoid one Blocking Operator, find another

Here’s an example from recent work. I was looking at a query in a GC that was populating a temp table to use for later DELETEs. I was anticipating the optimizer to try a hash match join, so I used an INNER LOOP JOIN hint to avoid that blocking operator. The results were quite unpleasant, as you can see in this anonymized plan.

So, I avoided the hash match join, but the SQL Server optimizer didn’t see it the way I did. The first table is a temp table, Object6, which we are joining to a normal table, Object11. But this plan includes a table spool that not only forces us to read all 587741 rows in the seek against our table, it seems to create a cross join in memory between the results of the clustered index scan on the temp table and the clustered index seek against the base table (538 * 587741 = 316204658).

That hint obviously wouldn’t work. I reversed the order of the tables, then removed the hint altogether, giving the following:

 

2000 rows returned across the board. There is a SORT operation, but it’s after the TOP so there’s no harm.. Our results are being sorted before inserting them into a second temp table. And a much more performant query, taking ~8 ms instead of 75000 ms.

Sort out blocking operators

Hopefully this has been informative. I honestly wasn’t aware of blocking operators until a month or two ago. That’s the frustrating and interesting thing of SQL Server sometimes; you can work on this as long as I have and, even putting new development aside, there’s always more to learn.

Hope this helps you optimize some of your own operations.