Programación desordenada: noviembre 2020

A propósito del post que ha escrito Jose A Bautista, Series temporales en SQL, este post es complementario (o casi idéntico y entonces se podría considerar un plagio) para terminar de entender cómo calcular las series temporales, o expresado en un lenguaje más llano, crear intervalos.

El problema al que nos enfrentamos (y resolvió, quiero insistir, Jose A Bautista de forma efectiva) era el siguiente:

Saber que productos han tenido stock pero no ventas, en un periodo determinado.

El problema del problema (un metaproblema por así decirlo) no es averiguar lo anterior, para eso no es necesario crear series temporales, sino trabajar con un gran volumen de datos y que la consulta siga ejecutándose en un tiempo razonable, por eso crear intervalos ofrece una solución porque, presumiblemente (y depende mucho de la dispersión de los datos), reducirá el conjunto de filas sobre el que se ejecuta la consulta (asumiendo también que crear esta tabla intermedia tiene un coste, tanto en tiempo como en almacenamiento).

El set de datos inicial es el siguiente:

CREATE TABLE [dbo].[Sales](
	[ProductId] [int] NOT NULL,
	[PointOfSaleId] [int] NOT NULL,
	[Date] [date] NOT NULL,
	[Quantity] [int] NOT NULL
) ON [PRIMARY]
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-01' AS Date), 1)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-02' AS Date), 2)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-03' AS Date), 3)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-10' AS Date), 10)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-11' AS Date), 11)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-20' AS Date), 20)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 1, CAST(N'2020-01-01' AS Date), 1)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 1, CAST(N'2020-01-02' AS Date), 2)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 1, CAST(N'2020-01-05' AS Date), 5)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 2, CAST(N'2020-01-15' AS Date), 15)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 2, CAST(N'2020-01-16' AS Date), 16)
GO
INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 2, CAST(N'2020-01-25' AS Date), 25)

Finalmente tenemos los siguientes datos:

ProductId	PointOfSaleId	Date	Quantity
1	1	01-01-2020	1
1	1	01-02-2020	2
1	1	01-03-2020	3
1	1	01-10-2020	10
1	1	01-11-2020	11
1	1	01-20-2020	20
2	1	01-01-2020	1
2	1	01-02-2020	2
2	1	01-05-2020	5
2	2	01-15-2020	15
2	2	01-16-2020	16
2	2	01-25-2020	25

El siguiente paso es entender la función ROW_NUMBER

-- Sin PARTITION, simplemente tenemos un contador
SELECT ROW_NUMBER() OVER (ORDER BY ProductId) AS rn, * FROM Sales

rn	ProductId	PointOfSaleId	Date	Quantity
1	1	1	01-01-2020	1
2	1	1	01-02-2020	2
3	1	1	01-03-2020	3
4	1	1	01-10-2020	10
5	1	1	01-11-2020	11
6	1	1	01-20-2020	20
7	2	1	01-01-2020	1
8	2	1	01-02-2020	2
9	2	1	01-05-2020	5
10	2	2	01-15-2020	15
11	2	2	01-16-2020	16
12	2	2	01-25-2020	25

-- PARTITION BY cambia, así que se reinicia el contador
    SELECT ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY ProductId) AS rn, * FROM Sales

rn	ProductId	PointOfSaleId	Date	Quantity
1	1	1	01-01-2020	1
2	1	1	01-02-2020	2
3	1	1	01-03-2020	3
4	1	1	01-10-2020	10
5	1	1	01-11-2020	11
6	1	1	01-20-2020	20
1	2	1	01-01-2020	1
2	2	1	01-02-2020	2
3	2	1	01-05-2020	5
1	2	2	01-15-2020	15
2	2	2	01-16-2020	16
3	2	2	01-25-2020	25

Ahora que ya sabemos como funciona ROW_NUMBER, vamos a centranos en lo que queremos conseguir, el resultado final, los intervalos:

ProductId	PointOfSaleId	StartDate	EndDate	Quantity
1	1	01-01-2020	01-03-2020	6
1	1	01-10-2020	01-11-2020	21
1	1	01-20-2020	01-20-2020	20
2	1	01-01-2020	01-02-2020	3
2	1	01-05-2020	01-05-2020	5
2	2	01-15-2020	01-16-2020	31
2	2	01-25-2020	01-25-2020	25

Para llegar del conjunto inicial de datos al resultado final, la SQL necesaria es:

WITH sales_cte AS
    (SELECT ProductId,
            PointOfSaleId, 
            Date,
            SUM(Quantity) AS Quantity
     FROM Sales
     GROUP BY ProductId,
              PointOfSaleId,
              Date),
       groups_cte AS
    (SELECT ROW_NUMBER() OVER (ORDER BY date) AS row_number,
      DATEADD(DAY, -ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY Date), Date) AS [group],
          ProductId,
          PointOfSaleId , 
          Date,
          Quantity
     FROM sales_cte)   
  SELECT ProductId,
         PointOfSaleId,
         MIN(Date) AS StartDate,
         MAX(Date) AS EndDate,
         SUM(Quantity) AS Quantity
  FROM groups_cte
  GROUP BY ProductId,
           PointOfSaleId,
           [group]

La clave de la SQL está en -ROW_NUMBER(), vamos a ver el resultado intermedio y entender así que los días consecutivos están creando un grupo que podremos usar posteriormente:

SELECT 
	ROW_NUMBER() OVER (ORDER BY date) AS row_number,
	ProductId,
    PointOfSaleId,
	Date,
	Quantity,
	-ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY Date) AS negative_row_number,
    DATEADD(DAY, -ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY Date), Date) AS [group]
INTO #groups
FROM (SELECT ProductId,
          PointOfSaleId, 
		  Date,
		  SUM(Quantity) AS Quantity
   FROM Sales   
   GROUP BY ProductId,
            PointOfSaleId,
            Sales.Date) as T;
SELECT * FROM #groups;

row_number	ProductId	PointOfSaleId	Date	Quantity	negative_row_number	group
1	1	1	01-01-2020	1	-1	12-31-2019
3	1	1	01-02-2020	2	-2	12-31-2019
5	1	1	01-03-2020	3	-3	12-31-2019
7	1	1	01-10-2020	10	-4	01-06-2020
8	1	1	01-11-2020	11	-5	01-06-2020
11	1	1	01-20-2020	20	-6	01-14-2020
2	2	1	01-01-2020	1	-1	12-31-2019
4	2	1	01-02-2020	2	-2	12-31-2019
6	2	1	01-05-2020	5	-3	01-02-2020
9	2	2	01-15-2020	15	-1	01-14-2020
10	2	2	01-16-2020	16	-2	01-14-2020
12	2	2	01-25-2020	25	-3	01-22-2020

Después de esto ya podemos agrupar por el grupo creado (la columna [group]) y usar funciones de agregado para sacar el resto de los datos:

SELECT ProductId,
       PointOfSaleId,
	   [group],
       MIN(Date) AS StartDate,
       MAX(Date) AS EndDate,
	   SUM(Quantity) AS Quantity
FROM #groups
GROUP BY ProductId,
         PointOfSaleId,
         [group]

ProductId	PointOfSaleId	group	StartDate	EndDate	Quantity
1	1	12-31-2019	01-01-2020	01-03-2020	6
1	1	01-06-2020	01-10-2020	01-11-2020	21
1	1	01-14-2020	01-20-2020	01-20-2020	20
2	1	12-31-2019	01-01-2020	01-02-2020	3
2	1	01-02-2020	01-05-2020	01-05-2020	5
2	2	01-14-2020	01-15-2020	01-16-2020	31
2	2	01-22-2020	01-25-2020	01-25-2020	25

Por acabar con algún dato del mundo real, para el cliente en el que estamos haciendo algunas pruebas la reducción de filas ha sido la siguiente:

	Original rows	Intervals rows
Stock	560.089.291	12.236.634
Sales	22.981.825	16.794.972

Un saludo!

Programación desordenada

Páginas

domingo, 22 de noviembre de 2020

Series temporales en SQL

Etiquetas

Archivo del blog