Database Buddha: Disk Space Forecasting Using Linear Regression and SQL CLR

Many times I have needed to forecast a future value based on past history. Most often this has to do with growth in one form or another. In my first experiment in this area it was to proactively evaluate daily imported record counts to identify obvious failures in the import process. Too many records could indicate a double import and too few could indicate a corrupted import file. My most recent and most common use for this has been trying to keep ahead of the rapidly growing databases by making sure that disk space is being allocated before it is being needed.

If you maintain a history of past values at a meaningful increment, you can use that data with Linear Regression to forecast the likely future value assuming that no new variables are introduced. Common variables include, changes in usage, more indexing, modification of archiving processes.... Us DBA's are good but we're not psychic and we can not predict everything that could happen, but what we can do is give a pretty accurate estimate based on the data available.

I have been working on this problem for some time and have finally moved into SQL-CLR to make another improvement to my process. Lets start with some background data on the method I chose to solve this problem. Lets start with some basic parts to the formula.

Metric M.
Forecast Key X.
Count C.
Slope: B= (C * SXY - (SX)(SY)) / (C * SX2 - (SX)2)
Y-Intercept: A= (SY - B(SX)) / C
Seasonality: Metric/A+BX.
Forecast: A+BX.

Essentially we need to calculate Slope(B) and Y-Intercept(A) from the past Data to then calculate the Forecast Value.

I originally had done this entirely in T-SQL and will probably post some examples of it later but in this article, I want to focus on a few new SQL-CLR Functions I created based on David Poole's article A Genuine Use for a SQL CLR Aggregate which do this much faster and more robustly. The three new functions I created were Slope, RSquared, and Intercept.

Now that I have these functions as Aggregates, I can calculate each new record to be forecast. First I start by evaluating the Historical Data with something Like this:

 SELECT  [DatabaseName]

,dbaadmin.dbo.slope([Period],[DataSize]) B_DataSize

 
  ,dbaadmin.dbo.Intercept([Period],[DataSize]) A_DataSize

,dbaadmin.dbo.slope([Period],[IndexSize]) B_IndexSize

 
  ,dbaadmin.dbo.Intercept([Period],[IndexSize]) A_IndexSize FROM  [DBSizeHistory]

GROUP BY [DatabaseName]

From this I got the main ingredients needed to calculate the future entries and this was just fine when I was validating daily record imports because I was only looking one day into the future but I have found that when looking a year out, looking at the last year to plot the next year created strange artifacts as s;\pikes or dips that were based on events like some of those "Variables" I mentioned earlier like archiving a bunch of data, importing new data, or dropping a table. Most of those things shouldn't be part of my calculations. For these reasons I modified my process a bit to smooth out the bumps and use a sliding window to calculate the slope and intercept. First I used a specialized moving average to smooth out the data.


  ;WITH  Window

    AS

    (

    SELECT  DISTINCT

      Period

      ,([Period]*100)/(SELECT MAX(Period) FROM #CleanHistory)/10 [Pos]

      ,(((SELECT MAX(Period)+1 FROM #CleanHistory)- Period)*100)

/(SELECT MAX(Period) FROM #CleanHistory)/10 [pre]


    FROM  #CleanHistory

    )

  select  R.*

    ,(SELECT avg(isnull(DataSize,0)) FROM #CleanHistory Where ServerName = R.ServerName

AND DatabaseName = R.DatabaseName and Period
Between (R.Period - [Pre]) AND (R.Period + [Pos])
) as Smoothed_MetricA


    ,(SELECT avg(isnull(IndexSize,0)) FROM #CleanHistory Where ServerName = R.ServerName

AND DatabaseName = R.DatabaseName and Period
Between (R.Period - [Pre]) AND (R.Period + [Pos])
) as Smoothed_MetricB


  INTO  #Results      

  from  #CleanHistory R

  JOIN  Window W

   ON R.Period = W.Period

With this chunk of code I dynamically adjust the number of records before and after that I am averaging. I want the deviation from the actual value to be much less at the beginning and end of the line. The effect I am going for is like if the line of data was a string and I was simple pulling it a little tighter but leaving the end points where they were. I am not sure how this affects the accuracy, but I was pleased with the cosmetics in a chart.

Then my next step was to use a sliding window to re-calculate the slope and intercept for each new record I created.

  SELECT  [ServerName]
    ,[DatabaseName]
    ,dbaadmin.dbo.slope([Period],Smoothed_MetricA)  B_MetricA
    ,dbaadmin.dbo.slope([Period],Smoothed_MetricB)  B_MetricB
    ,dbaadmin.dbo.Intercept([Period],Smoothed_MetricA) A_MetricA
    ,dbaadmin.dbo.Intercept([Period],Smoothed_MetricB) A_MetricB
  FROM  #Results
  WHERE  [Period] IN (SELECT DISTINCT TOP (50) [Period] FROM #Results ORDER BY [Period] DESC)
  GROUP BY [ServerName]
    ,[DatabaseName]

This Snippet uses a lookup in the where clause to make sure I am only evaluating the 50 latest periods in the table. In this case the last 50 Days. But, as each new record is added to the results it is then used in the next days calculation and the inevitable output is a much softer cure and eventually a smooth line in the general direction of growth rather than a hard, jagged or spiky line that often generated hard to answer questions about why we needed 500GB of space added to a drive for 5 days in the future before the line dropped back below our current limit.

Now this brings us to the main part of the process the generation of the forecasted values.

;WITH  B
  AS
  (
  SELECT  [ServerName]
    ,[DatabaseName]
    ,dbaadmin.dbo.slope([Period],Smoothed_MetricA)  B_MetricA
    ,dbaadmin.dbo.slope([Period],Smoothed_MetricB)  B_MetricB
    ,dbaadmin.dbo.Intercept([Period],Smoothed_MetricA) A_MetricA
    ,dbaadmin.dbo.Intercept([Period],Smoothed_MetricB) A_MetricB
  FROM  #Results
  WHERE  [Period] IN (SELECT DISTINCT TOP (50) [Period] FROM #Results ORDER BY [Period] DESC)
  GROUP BY [ServerName]
    ,[DatabaseName]
  )
INSERT INTO #Results ([ServerName],[DatabaseName],[Period],[DataSize],[IndexSize],[Smoothed_MetricA],[Smoothed_MetricB])
SELECT  A.[ServerName] 
  ,A.[DatabaseName]
  ,A.[Period] + 1
  ,B.A_MetricA + (B.B_MetricA * (A.[Period] + 1)) 
  ,B.A_MetricB + (B.B_MetricB * (A.[Period] + 1)) 
  ,B.A_MetricA + (B.B_MetricA * (A.[Period] + 1))
  ,B.A_MetricB + (B.B_MetricB * (A.[Period] + 1))
FROM  #Results A
INNER JOIN B
 ON A.[ServerName]  = B.[ServerName]
 AND A.[DatabaseName] = B.[DatabaseName]
WHERE  A.[Period] = (SELECT MAX([Period]) FROM #Results)

This is executed one time for each new period (Day) forecasted in the table. In this example I am putting the same value into the value and smoothed-value fields. and as you can see here the formula is as simple as..
(A+BX) or....... Intercept + (Slope * New_Period) ................Ta Da.......

There are many things that I have learned in taking this information from an idea to reality such as the previously mentioned steps to smooth the data and only use the last 50 days in the regression. One of the other key things I have done in order to improve the quality of the output is to ensure that the the data being evaluated is clean and regular. For example, when I started this process I was working of a per-existing table that another dba had created and populated. his population process was only weekly but could be run more often in some cases like after a reboot or when a health check report was kicked off. this made the data uneven in the gaps between records and sometimes several records for the same day. This caused the forecasting curve to be uneven and odd. For this reason I decided to populate a holding table that had exactly 1 record per day and filled in the empty spots with the last known values.

To start this process off, I chose to create a couple of functions that should be in every dba's toolbox. dbaudf_TimeTable and dbaudf_TimeDimmension. These first function just generated a table output of datetime values starting from one input parameter going to another and incrementing by two other parameters. If you can't figure out how to use it, you probably shouldn't. The second function uses the first but returns all of the values broken out into all of the parts used for a datetime dimension. I have found it extremely useful for data warehouses, cubes, and linear regression and it will get even more use once I have more real historical data and start to incorporate seasonality into my calculations.


IF OBJECT_ID('dbaudf_TimeTable') IS NOT NULL

DROP FUNCTION [dbo].[dbaudf_TimeTable]

GO



SET ANSI_NULLS ON

GO



SET QUOTED_IDENTIFIER ON

GO





CREATE FUNCTION [dbo].[dbaudf_TimeTable]  

    ( 

    @StartDateTime DateTime

    ,@EndDateTime DateTime

    ,@Interval  sysname

    ,@IntervalCount Int 

    ) 

RETURNS @TimeTable TABLE(DateTimeValue DATETIME NOT NULL)

AS 

BEGIN 

    DECLARE @DateTime DateTime

    SELECT @DateTime = @StartDateTime

    WHILE @DateTime  <= @EndDateTime 

    BEGIN 

        INSERT INTO @TimeTable (DateTimeValue)

        SELECT  @DateTime

        

        SET @DateTime = CASE @Interval

    WHEN 'year'  THEN Dateadd(year  ,@IntervalCount, @DateTime)

    WHEN 'quarter'  THEN Dateadd(quarter  ,@IntervalCount, @DateTime)

    WHEN 'month'  THEN Dateadd(month  ,@IntervalCount, @DateTime)

    WHEN 'dayofyear' THEN Dateadd(dayofyear  ,@IntervalCount, @DateTime)

    WHEN 'day'  THEN Dateadd(day  ,@IntervalCount, @DateTime)

    WHEN 'week'  THEN Dateadd(week  ,@IntervalCount, @DateTime)

    WHEN 'weekday'  THEN Dateadd(weekday  ,@IntervalCount, @DateTime)

    WHEN 'hour'  THEN Dateadd(hour  ,@IntervalCount, @DateTime)

    WHEN 'minute'  THEN Dateadd(minute  ,@IntervalCount, @DateTime)

    WHEN 'second'  THEN Dateadd(second  ,@IntervalCount, @DateTime)

    WHEN 'millisecond' THEN Dateadd(millisecond ,@IntervalCount, @DateTime)

    WHEN 'microsecond' THEN Dateadd(microsecond ,@IntervalCount, @DateTime)

    WHEN 'nanosecond' THEN Dateadd(nanosecond  ,@IntervalCount, @DateTime)

    ELSE @DateTime + @IntervalCount

    END

    END 

    RETURN  

END 

GO





IF OBJECT_ID('dbaudf_TimeDimension') IS NOT NULL

DROP FUNCTION [dbo].[dbaudf_TimeDimension]

GO



SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO



CREATE FUNCTION [dbo].[dbaudf_TimeDimension]

    ( 

    @StartDateTime DateTime

    ,@EndDateTime DateTime

    ,@Interval  sysname

    ,@IntervalCount Int 

    ) 

RETURNS TABLE

AS RETURN

(

 SELECT  CONVERT(bigint,REPLACE(REPLACE(REPLACE(REPLACE(CONVERT(VarChar(50),[DateTimeValue],121),' ',''),':',''),'-',''),'.','')) AS TimeKey

   ,[DateTimeValue]

   ,DATEPART(year,[DateTimeValue])    AS DatePart_year   

   ,DATEPART(quarter,[DateTimeValue])   AS DatePart_quarter  

   ,DATEPART(month,[DateTimeValue])   AS DatePart_month   

   ,DATEPART(dayofyear,[DateTimeValue])   AS DatePart_dayofyear  

   ,DATEPART(day,[DateTimeValue])    AS DatePart_day   

   ,DATEPART(week,[DateTimeValue])    AS DatePart_week   

   ,DATEPART(weekday,[DateTimeValue])   AS DatePart_weekday  

   ,DATEPART(hour,[DateTimeValue])    AS DatePart_hour   

   ,DATEPART(minute,[DateTimeValue])   AS DatePart_minute   

   ,DATEPART(second,[DateTimeValue])   AS DatePart_second   

   ,DATEPART(millisecond,[DateTimeValue])   AS DatePart_millisecond 

   ,DATEPART(microsecond,[DateTimeValue])   AS DatePart_microsecond 

   ,DATEPART(nanosecond,[DateTimeValue])   AS DatePart_nanosecond  

   ,DATEPART(ISO_WEEK,[DateTimeValue])   AS DatePart_ISO_WEEK  

   ,DATENAME(year,[DateTimeValue])    AS DateName_year   

   ,DATENAME(quarter,[DateTimeValue])   AS DateName_quarter  

   ,DATENAME(month,[DateTimeValue])   AS DateName_month   

   ,DATENAME(dayofyear,[DateTimeValue])   AS DateName_dayofyear  

   ,DATENAME(day,[DateTimeValue])    AS DateName_day   

   ,DATENAME(week,[DateTimeValue])    AS DateName_week   

   ,DATENAME(weekday,[DateTimeValue])   AS DateName_weekday  

   ,DATENAME(hour,[DateTimeValue])    AS DateName_hour   

   ,DATENAME(minute,[DateTimeValue])   AS DateName_minute   

   ,DATENAME(second,[DateTimeValue])   AS DateName_second   

   ,DATENAME(millisecond,[DateTimeValue])   AS DateName_millisecond 

   ,DATENAME(microsecond,[DateTimeValue])   AS DateName_microsecond 

   ,DATENAME(nanosecond,[DateTimeValue])   AS DateName_nanosecond  

 FROM  dbaadmin.dbo.dbaudf_TimeTable(@StartDateTime,@EndDateTime,@Interval,@IntervalCount)

)
GO

With those two functions ready to go I was then able to use them in a CTE which built my #CleanHistory table to work from.


;WITH  DateDimension

   AS

   (

   SELECT ROW_NUMBER()OVER(ORDER BY [TimeKey]) [Period]

    ,T2.*

   FROM (

    SELECT MinDate  = CAST(CONVERT(VarChar(12),MIN([rundate]),101)AS DateTime)

     ,MaxDate = CAST(CONVERT(VarChar(12),MAX([rundate]),101)AS DateTime)

    FROM [dbaperf].[dbo].[db_stats_log] [db_stats_log] WITH(NOLOCK)

    ) T1

   CROSS APPLY [dbaadmin].[dbo].[dbaudf_TimeDimension](T1.MinDate,T1.MaxDate,'Day',1) T2

   )

   ,SlidingWindow

   AS

   (

   SELECT DISTINCT

    Period

    ,([Period]*100)/(SELECT MAX(Period) FROM DateDimension)/10 [Pos]

    ,(((SELECT MAX(Period)+1 FROM DateDimension)- Period)*100)

/(SELECT MAX(Period) FROM DateDimension)/10 [pre]


   FROM DateDimension

   )

   ,RawData

   AS

   (

   SELECT  [ServerName]

     ,[DatabaseName]

     ,CAST(CONVERT(VarChar(12),[rundate],101)AS DateTime) [rundate]

     ,SUM(CAST([data_space_used_KB] AS FLOAT) / 1024) DataSize

     ,SUM(CAST([index_size_used_KB] AS FLOAT) / 1024) IndexSize

   FROM  [dbaperf].[dbo].[db_stats_log] [db_stats_log] WITH(NOLOCK)

   GROUP BY [ServerName]

     ,[DatabaseName]

     ,[rundate]

   )

   ,ResultKey

   AS

   (

   SELECT  DISTINCT

     [RawData].[ServerName]

     ,[RawData].[DatabaseName]

     ,[DateDimension].[Period]

     ,[DateDimension].[DateTimeValue]

   FROM  [DateDimension]

   CROSS JOIN [RawData]

   )

   ,History

   AS

   (

   SELECT  [ResultKey].[ServerName]

     ,[ResultKey].[DatabaseName]

     ,[ResultKey].[Period]

     ,[RawData].[DataSize]

     ,[RawData].[IndexSize]

   FROM  [ResultKey]

   LEFT JOIN [RawData]

    ON [ResultKey].[DateTimeValue] = [RawData].[rundate]

    AND [ResultKey].[ServerName] = [RawData].[ServerName]

    AND [ResultKey].[DatabaseName] = [RawData].[DatabaseName]

   )

SELECT  [ServerName]

  ,[DatabaseName]

  ,[Period]

  ,CASE

   WHEN [DataSize] IS NULL

   THEN (SELECT TOP 1 [DataSize] FROM History WHERE [DataSize] IS NOT NULL

AND [ServerName] = H.[ServerName]
AND [DatabaseName] = H.[DatabaseName]
AND [Period] < H.[Period] ORDER BY [Period] DESC)


   ELSE [DataSize] END [DataSize]



  ,CASE

   WHEN [IndexSize] IS NULL

   THEN (SELECT TOP 1 [IndexSize] FROM History WHERE [IndexSize] IS NOT NULL

AND [ServerName] = H.[ServerName]
AND [DatabaseName] = H.[DatabaseName]
AND [Period] < H.[Period] ORDER BY [Period] DESC)


   ELSE [IndexSize] END [IndexSize]

INTO  #CleanHistory

FROM  History H

And now we have gone full circle to feed the query at the top of this article.

C# Code for CLR Aggregate Functions.


using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;





[Serializable]

[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(

        Format.Native,                      //use clr serialization to serialize the intermediate result

        IsInvariantToNulls = false,         //optimizer property

        IsInvariantToDuplicates = false,    //optimizer property

        IsInvariantToOrder = true           //optimizer property

        )]

public struct Slope

{

    private long count;

    private SqlDouble SumY;

    private SqlDouble SumXY;

    private SqlDouble SumX2;

    private SqlDouble SumY2;

    private SqlDouble SumX;



    public void Init()

    {

        count = 0;

        SumX = SumY = SumXY = SumX2 = SumY2 = 0;

    }



    public void Accumulate(SqlDouble x, SqlDouble y)

    {

        if (!x.IsNull && !y.IsNull)

        {

            count++;

            SumX += (SqlDouble)x;

            SumY += (SqlDouble)y;

            SumXY += (SqlDouble)x * (SqlDouble)y;

            SumX2 += (SqlDouble)x * (SqlDouble)x;

            SumY2 += (SqlDouble)y * (SqlDouble)y;

        }

    }



    public void Merge(Slope other)

    {

        count += other.count;

        SumY += other.SumY;

        SumXY += other.SumXY;

        SumX2 += other.SumX2;

        SumY2 += other.SumY2;

        SumX += other.SumX;

    }



    public SqlDouble Terminate()

    {

        if (count > 1)

        {

            SqlDouble value = (count * SumXY - (SumX * SumY)) / ((count * SumX2) - (SumX * SumX));

            return value;

        }

        else

        {

            return 0;

        }

    }



}





[Serializable]

[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(

        Format.Native,                      //use clr serialization to serialize the intermediate result

        IsInvariantToNulls = false,         //optimizer property

        IsInvariantToDuplicates = false,    //optimizer property

        IsInvariantToOrder = true           //optimizer property

        )]

public struct RSquared

{

    private long count;

    private SqlDouble SumY;

    private SqlDouble SumXY;

    private SqlDouble SumX2;

    private SqlDouble SumY2;

    private SqlDouble SumX;



    public void Init()

    {

        count = 0;

        SumX = SumY = SumXY = SumX2 = SumY2 = 0;

    }



    public void Accumulate(SqlDouble x, SqlDouble y)

    {

        if (!x.IsNull && !y.IsNull)

        {

            count++;

            SumX += (SqlDouble)x;

            SumY += (SqlDouble)y;

            SumXY += (SqlDouble)x * (SqlDouble)y;

            SumX2 += (SqlDouble)x * (SqlDouble)x;

            SumY2 += (SqlDouble)y * (SqlDouble)y;

        }

    }



    public void Merge(RSquared other)

    {

        count += other.count;

        SumY += other.SumY;

        SumXY += other.SumXY;

        SumX2 += other.SumX2;

        SumY2 += other.SumY2;

        SumX += other.SumX;

    }



    public SqlDouble Terminate()

    {

        if (count > 1)

        {

            if (Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY))) != 0)

            {

                SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));

                return value * value;

            }

            else

            {

                return 0;

            }

        }

        else

        {

            return 0;

        }

    }



}





[Serializable]

[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(

        Format.Native,                      //use clr serialization to serialize the intermediate result

        IsInvariantToNulls = false,         //optimizer property

        IsInvariantToDuplicates = false,    //optimizer property

        IsInvariantToOrder = true           //optimizer property

        )]

public struct Intercept

{

    private long count;

    private SqlDouble SumY;

    private SqlDouble SumXY;

    private SqlDouble SumX2;

    private SqlDouble SumY2;

    private SqlDouble SumX;



    public void Init()

    {

        count = 0;

        SumX = SumY = SumXY = SumX2 = SumY2 = 0;

    }



    public void Accumulate(SqlDouble x, SqlDouble y)

    {

        if (!x.IsNull && !y.IsNull)

        {

            count++;

            SumX += (SqlDouble)x;

            SumY += (SqlDouble)y;

            SumXY += (SqlDouble)x * (SqlDouble)y;

            SumX2 += (SqlDouble)x * (SqlDouble)x;

            SumY2 += (SqlDouble)y * (SqlDouble)y;

        }

    }



    public void Merge(Intercept other)

    {

        count += other.count;

        SumY += other.SumY;

        SumXY += other.SumXY;

        SumX2 += other.SumX2;

        SumY2 += other.SumY2;

        SumX += other.SumX;

    }



    public SqlDouble Terminate()

    {



        if (count > 1)

        {

            SqlDouble slopeValue = (count * SumXY - (SumX * SumY)) / ((count * SumX2) - (SumX * SumX));

            SqlDouble interceptValue = (SumY - (slopeValue * SumX)) / count;



            return interceptValue;

        }

        else

        {

            return 0;

        }

    }



}

Database Buddha

Thursday, June 27, 2013

Disk Space Forecasting Using Linear Regression and SQL CLR

1 comment:

Followers