Thursday, June 27, 2013

Disk Space Forecasting Using Linear Regression and SQL CLR


Many times I have needed to forecast a future value based on past history. Most often this has to do with growth in one form or another. In my first experiment in this area it was to proactively evaluate daily imported record counts to identify obvious failures in the import process. Too many records could indicate a double import and too few could indicate a corrupted import file. My most recent and most common use for this has been trying to keep ahead of the rapidly growing databases by making sure that disk space is being allocated before it is being needed.

If you maintain a history of past values at a meaningful increment, you can use that data with Linear Regression to forecast the likely future value assuming that no new variables are introduced. Common variables include, changes in usage, more indexing, modification of archiving processes.... Us DBA's are good but we're not psychic and we can not predict everything that could happen, but what we can do is give a pretty accurate estimate based on the data available.

 I have been working on this problem for some time and have finally moved into SQL-CLR to make another improvement to my process. Lets start with some background data on the method I chose to solve this problem. Lets start with some basic parts to the formula.

Metric M.
Forecast Key X.
Count C.
Slope: B= (C * SXY - (SX)(SY)) / (C * SX2 - (SX)2) 
Y-Intercept: A= (SY - B(SX)) / C 
Seasonality: Metric/A+BX.
Forecast: A+BX. 

Essentially we need to calculate Slope(B) and Y-Intercept(A) from the past Data to then calculate the Forecast Value.

I originally had done this entirely in T-SQL and will probably post some examples of it later but in this article, I want to focus on a few new SQL-CLR Functions I created based on David Poole's article A Genuine Use for a SQL CLR Aggregate which do this much faster and more robustly. The three new functions I created were Slope, RSquared, and Intercept.

Now that I have these functions as Aggregates, I can calculate each new record to be forecast. First I start by evaluating the Historical Data with something Like this:

SELECT [DatabaseName]
,dbaadmin.dbo.slope([Period],[DataSize]) B_DataSize
,dbaadmin.dbo.Intercept([Period],[DataSize]) A_DataSize
,dbaadmin.dbo.slope([Period],[IndexSize]) B_IndexSize
,dbaadmin.dbo.Intercept([Period],[IndexSize]) A_IndexSize FROM [DBSizeHistory]

GROUP BY [DatabaseName]

From this I got the main ingredients needed to calculate the future entries and this was just fine when I was validating daily record imports because I was only looking one day into the future but I have found that when looking a year out, looking at the last year to plot the next year created strange artifacts as s;\pikes or dips that were based on events like some of those "Variables" I mentioned earlier like archiving a bunch of data, importing new data, or dropping a table. Most of those things shouldn't be part of my calculations. For these reasons I modified my process a bit to smooth out the bumps and use a sliding window to calculate the slope and intercept. First I used a specialized moving average to smooth out the data.


;WITH Window
AS
(
SELECT DISTINCT
Period
,([Period]*100)/(SELECT MAX(Period) FROM #CleanHistory)/10 [Pos]
,(((SELECT MAX(Period)+1 FROM #CleanHistory)- Period)*100)

/(SELECT MAX(Period) FROM #CleanHistory)/10 [pre]
FROM #CleanHistory
)
select R.*
,(SELECT avg(isnull(DataSize,0)) FROM #CleanHistory Where ServerName = R.ServerName 

AND DatabaseName = R.DatabaseName and Period 
Between (R.Period - [Pre]) AND (R.Period + [Pos])  
) as Smoothed_MetricA
,(SELECT avg(isnull(IndexSize,0)) FROM #CleanHistory Where ServerName = R.ServerName 
AND DatabaseName = R.DatabaseName and Period 
Between (R.Period - [Pre]) AND (R.Period + [Pos])  
) as Smoothed_MetricB
INTO #Results  
from #CleanHistory R
JOIN Window W
ON R.Period = W.Period




With this chunk of code I dynamically adjust the number of records before and after that I am averaging. I want the deviation from the actual value to be much less at the beginning and end of the line. The effect I am going for is like if the line of data was a string and I was simple pulling it a little tighter but leaving the end points where they were. I am not sure how this affects the accuracy, but I was pleased with the cosmetics in a chart.

Then my next step was to use a sliding window to re-calculate the slope and intercept for each new record I created.

SELECT [ServerName]
,[DatabaseName]
,dbaadmin.dbo.slope([Period],Smoothed_MetricA) B_MetricA
,dbaadmin.dbo.slope([Period],Smoothed_MetricB) B_MetricB
,dbaadmin.dbo.Intercept([Period],Smoothed_MetricA) A_MetricA
,dbaadmin.dbo.Intercept([Period],Smoothed_MetricB) A_MetricB
FROM #Results
WHERE [Period] IN (SELECT DISTINCT TOP (50) [Period] FROM #Results ORDER BY [Period] DESC)
GROUP BY [ServerName]
,[DatabaseName]


This Snippet uses a lookup in the where clause to make sure I am only evaluating the 50 latest periods in the table. In this case the last 50 Days. But, as each new record is added to the results it is then used in the next days calculation and the inevitable output is a much softer cure and eventually a smooth line in the general direction of growth rather than a hard, jagged or spiky line that often generated hard to answer questions about why we needed 500GB of space added to a drive for 5 days in the future before the line dropped back below our current limit.

Now this brings us to the main part of the process the generation of the forecasted values.

;WITH B
AS
(
SELECT [ServerName]
,[DatabaseName]
,dbaadmin.dbo.slope([Period],Smoothed_MetricA) B_MetricA
,dbaadmin.dbo.slope([Period],Smoothed_MetricB) B_MetricB
,dbaadmin.dbo.Intercept([Period],Smoothed_MetricA) A_MetricA
,dbaadmin.dbo.Intercept([Period],Smoothed_MetricB) A_MetricB
FROM #Results
WHERE [Period] IN (SELECT DISTINCT TOP (50) [Period] FROM #Results ORDER BY [Period] DESC)
GROUP BY [ServerName]
,[DatabaseName]
)
INSERT INTO #Results ([ServerName],[DatabaseName],[Period],[DataSize],[IndexSize],[Smoothed_MetricA],[Smoothed_MetricB])
SELECT A.[ServerName]
,A.[DatabaseName]
,A.[Period] + 1
,B.A_MetricA + (B.B_MetricA * (A.[Period] + 1))
,B.A_MetricB + (B.B_MetricB * (A.[Period] + 1))
,B.A_MetricA + (B.B_MetricA * (A.[Period] + 1))
,B.A_MetricB + (B.B_MetricB * (A.[Period] + 1))
FROM #Results A
INNER JOIN B
ON A.[ServerName] = B.[ServerName]
AND A.[DatabaseName] = B.[DatabaseName]
WHERE A.[Period] = (SELECT MAX([Period]) FROM #Results)



This is executed one time for each new period (Day) forecasted in the table. In this example I am putting the same value into the value and smoothed-value fields. and as you can see here the formula is as simple as..
    (A+BX)  or....... Intercept + (Slope * New_Period) ................Ta Da.......

There are many things that I have learned in taking this information from an idea to reality such as the previously mentioned steps to smooth the data and only use the last 50 days in the regression. One of the other key things I have done in order to improve the quality of the output is to ensure that the the data being evaluated is clean and regular. For example, when I started this process I was working of a per-existing table that another dba had created and populated. his population process was only weekly but could be run more often in some cases like after a reboot or when a health check report was kicked off. this made the data uneven in the gaps between records and sometimes several records for the same day. This caused the forecasting curve to be uneven and odd. For this reason I decided to populate a holding table that had exactly 1 record per day and filled in the empty spots with the last known values.

To start this process off, I chose to create a couple of functions that should be in every dba's toolbox. dbaudf_TimeTable and dbaudf_TimeDimmension. These first function just generated a table output of datetime values starting from one input parameter going to another and incrementing by two other parameters. If you can't figure out how to use it, you probably shouldn't. The second function uses the first but returns all of the values broken out into all of the parts used for a datetime dimension. I have found it extremely useful for data warehouses, cubes, and linear regression and it will get even more use once I have more real historical data and start to incorporate seasonality into my calculations.

IF OBJECT_ID('dbaudf_TimeTable') IS NOT NULL
DROP FUNCTION [dbo].[dbaudf_TimeTable]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO


CREATE FUNCTION [dbo].[dbaudf_TimeTable]  
    ( 
    @StartDateTime DateTime
    ,@EndDateTime DateTime
    ,@Interval sysname
    ,@IntervalCount Int 
    ) 
RETURNS @TimeTable TABLE(DateTimeValue DATETIME NOT NULL)
AS 
BEGIN 
    DECLARE @DateTime DateTime
    SELECT @DateTime = @StartDateTime
    WHILE @DateTime  <= @EndDateTime 
    BEGIN 
        INSERT INTO @TimeTable (DateTimeValue)
        SELECT @DateTime
        
        SET @DateTime = CASE @Interval
WHEN 'year' THEN Dateadd(year ,@IntervalCount, @DateTime)
WHEN 'quarter' THEN Dateadd(quarter ,@IntervalCount, @DateTime)
WHEN 'month' THEN Dateadd(month ,@IntervalCount, @DateTime)
WHEN 'dayofyear' THEN Dateadd(dayofyear ,@IntervalCount, @DateTime)
WHEN 'day' THEN Dateadd(day ,@IntervalCount, @DateTime)
WHEN 'week' THEN Dateadd(week ,@IntervalCount, @DateTime)
WHEN 'weekday' THEN Dateadd(weekday ,@IntervalCount, @DateTime)
WHEN 'hour' THEN Dateadd(hour ,@IntervalCount, @DateTime)
WHEN 'minute' THEN Dateadd(minute ,@IntervalCount, @DateTime)
WHEN 'second' THEN Dateadd(second ,@IntervalCount, @DateTime)
WHEN 'millisecond' THEN Dateadd(millisecond ,@IntervalCount, @DateTime)
WHEN 'microsecond' THEN Dateadd(microsecond ,@IntervalCount, @DateTime)
WHEN 'nanosecond' THEN Dateadd(nanosecond ,@IntervalCount, @DateTime)
ELSE @DateTime + @IntervalCount
END
    END 
    RETURN  
END 
GO


IF OBJECT_ID('dbaudf_TimeDimension') IS NOT NULL
DROP FUNCTION [dbo].[dbaudf_TimeDimension]
GO

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE FUNCTION [dbo].[dbaudf_TimeDimension]
    ( 
    @StartDateTime DateTime
    ,@EndDateTime DateTime
    ,@Interval sysname
    ,@IntervalCount Int 
    ) 
RETURNS TABLE
AS RETURN
(
SELECT CONVERT(bigint,REPLACE(REPLACE(REPLACE(REPLACE(CONVERT(VarChar(50),[DateTimeValue],121),' ',''),':',''),'-',''),'.','')) AS TimeKey
,[DateTimeValue]
,DATEPART(year,[DateTimeValue]) AS DatePart_year
,DATEPART(quarter,[DateTimeValue]) AS DatePart_quarter
,DATEPART(month,[DateTimeValue]) AS DatePart_month
,DATEPART(dayofyear,[DateTimeValue]) AS DatePart_dayofyear
,DATEPART(day,[DateTimeValue]) AS DatePart_day
,DATEPART(week,[DateTimeValue]) AS DatePart_week
,DATEPART(weekday,[DateTimeValue]) AS DatePart_weekday
,DATEPART(hour,[DateTimeValue]) AS DatePart_hour
,DATEPART(minute,[DateTimeValue]) AS DatePart_minute
,DATEPART(second,[DateTimeValue]) AS DatePart_second
,DATEPART(millisecond,[DateTimeValue]) AS DatePart_millisecond
,DATEPART(microsecond,[DateTimeValue]) AS DatePart_microsecond
,DATEPART(nanosecond,[DateTimeValue]) AS DatePart_nanosecond
,DATEPART(ISO_WEEK,[DateTimeValue]) AS DatePart_ISO_WEEK
,DATENAME(year,[DateTimeValue]) AS DateName_year
,DATENAME(quarter,[DateTimeValue]) AS DateName_quarter
,DATENAME(month,[DateTimeValue]) AS DateName_month
,DATENAME(dayofyear,[DateTimeValue]) AS DateName_dayofyear
,DATENAME(day,[DateTimeValue]) AS DateName_day
,DATENAME(week,[DateTimeValue]) AS DateName_week
,DATENAME(weekday,[DateTimeValue]) AS DateName_weekday
,DATENAME(hour,[DateTimeValue]) AS DateName_hour
,DATENAME(minute,[DateTimeValue]) AS DateName_minute
,DATENAME(second,[DateTimeValue]) AS DateName_second
,DATENAME(millisecond,[DateTimeValue]) AS DateName_millisecond
,DATENAME(microsecond,[DateTimeValue]) AS DateName_microsecond
,DATENAME(nanosecond,[DateTimeValue]) AS DateName_nanosecond
FROM dbaadmin.dbo.dbaudf_TimeTable(@StartDateTime,@EndDateTime,@Interval,@IntervalCount)
)
GO



With those two functions ready to go I was then able to use them in a CTE which built my #CleanHistory table to work from.

;WITH DateDimension
AS
(
SELECT ROW_NUMBER()OVER(ORDER BY [TimeKey]) [Period]
,T2.*
FROM (
SELECT MinDate = CAST(CONVERT(VarChar(12),MIN([rundate]),101)AS DateTime)
,MaxDate = CAST(CONVERT(VarChar(12),MAX([rundate]),101)AS DateTime)
FROM [dbaperf].[dbo].[db_stats_log] [db_stats_log] WITH(NOLOCK)
) T1
CROSS APPLY [dbaadmin].[dbo].[dbaudf_TimeDimension](T1.MinDate,T1.MaxDate,'Day',1) T2
)
,SlidingWindow
AS
(
SELECT DISTINCT
Period
,([Period]*100)/(SELECT MAX(Period) FROM DateDimension)/10 [Pos]
,(((SELECT MAX(Period)+1 FROM DateDimension)- Period)*100)

/(SELECT MAX(Period) FROM DateDimension)/10 [pre]
FROM DateDimension
)
,RawData
AS
(
SELECT [ServerName]
,[DatabaseName]
,CAST(CONVERT(VarChar(12),[rundate],101)AS DateTime) [rundate]
,SUM(CAST([data_space_used_KB] AS FLOAT) / 1024) DataSize
,SUM(CAST([index_size_used_KB] AS FLOAT) / 1024) IndexSize
FROM [dbaperf].[dbo].[db_stats_log] [db_stats_log] WITH(NOLOCK)
GROUP BY [ServerName]
,[DatabaseName]
,[rundate]
)
,ResultKey
AS
(
SELECT DISTINCT
[RawData].[ServerName]
,[RawData].[DatabaseName]
,[DateDimension].[Period]
,[DateDimension].[DateTimeValue]
FROM [DateDimension]
CROSS JOIN [RawData]
)
,History
AS
(
SELECT [ResultKey].[ServerName]
,[ResultKey].[DatabaseName]
,[ResultKey].[Period]
,[RawData].[DataSize]
,[RawData].[IndexSize]
FROM [ResultKey]
LEFT JOIN [RawData]
ON [ResultKey].[DateTimeValue] = [RawData].[rundate]
AND [ResultKey].[ServerName] = [RawData].[ServerName]
AND [ResultKey].[DatabaseName] = [RawData].[DatabaseName]
)
SELECT [ServerName]
,[DatabaseName]
,[Period]
,CASE
WHEN [DataSize] IS NULL
THEN (SELECT TOP 1 [DataSize] FROM History WHERE [DataSize] IS NOT NULL 

AND [ServerName] = H.[ServerName] 
AND [DatabaseName] = H.[DatabaseName] 
AND [Period] < H.[Period] ORDER BY [Period] DESC)
ELSE [DataSize] END [DataSize]

,CASE
WHEN [IndexSize] IS NULL
THEN (SELECT TOP 1 [IndexSize] FROM History WHERE [IndexSize] IS NOT NULL 

AND [ServerName] = H.[ServerName] 
AND [DatabaseName] = H.[DatabaseName] 
AND [Period] < H.[Period] ORDER BY [Period] DESC)
ELSE [IndexSize] END [IndexSize]
INTO #CleanHistory
FROM History H




And now we have gone full circle to feed the query at the top of this article.



C# Code for CLR Aggregate Functions.

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;


[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
        Format.Native,                      //use clr serialization to serialize the intermediate result
        IsInvariantToNulls = false,         //optimizer property
        IsInvariantToDuplicates = false,    //optimizer property
        IsInvariantToOrder = true           //optimizer property
        )]
public struct Slope
{
    private long count;
    private SqlDouble SumY;
    private SqlDouble SumXY;
    private SqlDouble SumX2;
    private SqlDouble SumY2;
    private SqlDouble SumX;

    public void Init()
    {
        count = 0;
        SumX = SumY = SumXY = SumX2 = SumY2 = 0;
    }

    public void Accumulate(SqlDouble x, SqlDouble y)
    {
        if (!x.IsNull && !y.IsNull)
        {
            count++;
            SumX += (SqlDouble)x;
            SumY += (SqlDouble)y;
            SumXY += (SqlDouble)x * (SqlDouble)y;
            SumX2 += (SqlDouble)x * (SqlDouble)x;
            SumY2 += (SqlDouble)y * (SqlDouble)y;
        }
    }

    public void Merge(Slope other)
    {
        count += other.count;
        SumY += other.SumY;
        SumXY += other.SumXY;
        SumX2 += other.SumX2;
        SumY2 += other.SumY2;
        SumX += other.SumX;
    }

    public SqlDouble Terminate()
    {
        if (count > 1)
        {
            SqlDouble value = (count * SumXY - (SumX * SumY)) / ((count * SumX2) - (SumX * SumX));
            return value;
        }
        else
        {
            return 0;
        }
    }

}


[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
        Format.Native,                      //use clr serialization to serialize the intermediate result
        IsInvariantToNulls = false,         //optimizer property
        IsInvariantToDuplicates = false,    //optimizer property
        IsInvariantToOrder = true           //optimizer property
        )]
public struct RSquared
{
    private long count;
    private SqlDouble SumY;
    private SqlDouble SumXY;
    private SqlDouble SumX2;
    private SqlDouble SumY2;
    private SqlDouble SumX;

    public void Init()
    {
        count = 0;
        SumX = SumY = SumXY = SumX2 = SumY2 = 0;
    }

    public void Accumulate(SqlDouble x, SqlDouble y)
    {
        if (!x.IsNull && !y.IsNull)
        {
            count++;
            SumX += (SqlDouble)x;
            SumY += (SqlDouble)y;
            SumXY += (SqlDouble)x * (SqlDouble)y;
            SumX2 += (SqlDouble)x * (SqlDouble)x;
            SumY2 += (SqlDouble)y * (SqlDouble)y;
        }
    }

    public void Merge(RSquared other)
    {
        count += other.count;
        SumY += other.SumY;
        SumXY += other.SumXY;
        SumX2 += other.SumX2;
        SumY2 += other.SumY2;
        SumX += other.SumX;
    }

    public SqlDouble Terminate()
    {
        if (count > 1)
        {
            if (Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY))) != 0)
            {
                SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));
                return value * value;
            }
            else
            {
                return 0;
            }
        }
        else
        {
            return 0;
        }
    }

}


[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
        Format.Native,                      //use clr serialization to serialize the intermediate result
        IsInvariantToNulls = false,         //optimizer property
        IsInvariantToDuplicates = false,    //optimizer property
        IsInvariantToOrder = true           //optimizer property
        )]
public struct Intercept
{
    private long count;
    private SqlDouble SumY;
    private SqlDouble SumXY;
    private SqlDouble SumX2;
    private SqlDouble SumY2;
    private SqlDouble SumX;

    public void Init()
    {
        count = 0;
        SumX = SumY = SumXY = SumX2 = SumY2 = 0;
    }

    public void Accumulate(SqlDouble x, SqlDouble y)
    {
        if (!x.IsNull && !y.IsNull)
        {
            count++;
            SumX += (SqlDouble)x;
            SumY += (SqlDouble)y;
            SumXY += (SqlDouble)x * (SqlDouble)y;
            SumX2 += (SqlDouble)x * (SqlDouble)x;
            SumY2 += (SqlDouble)y * (SqlDouble)y;
        }
    }

    public void Merge(Intercept other)
    {
        count += other.count;
        SumY += other.SumY;
        SumXY += other.SumXY;
        SumX2 += other.SumX2;
        SumY2 += other.SumY2;
        SumX += other.SumX;
    }

    public SqlDouble Terminate()
    {

        if (count > 1)
        {
            SqlDouble slopeValue = (count * SumXY - (SumX * SumY)) / ((count * SumX2) - (SumX * SumX));
            SqlDouble interceptValue = (SumY - (slopeValue * SumX)) / count;

            return interceptValue;
        }
        else
        {
            return 0;
        }
    }

}