A propósito del post que ha escrito Jose A Bautista, Series temporales en SQL, este post es complementario (o casi idéntico y entonces se podría considerar un plagio) para terminar de entender cómo calcular las series temporales, o expresado en un lenguaje más llano, crear intervalos.
El problema al que nos enfrentamos (y resolvió, quiero insistir, Jose A Bautista de forma efectiva) era el siguiente:
Saber que productos han tenido stock pero no ventas, en un periodo determinado.
El problema del problema (un metaproblema por así decirlo) no es averiguar lo anterior, para eso no es necesario crear series temporales, sino trabajar con un gran volumen de datos y que la consulta siga ejecutándose en un tiempo razonable, por eso crear intervalos ofrece una solución porque, presumiblemente (y depende mucho de la dispersión de los datos), reducirá el conjunto de filas sobre el que se ejecuta la consulta (asumiendo también que crear esta tabla intermedia tiene un coste, tanto en tiempo como en almacenamiento).
El set de datos inicial es el siguiente:
CREATE TABLE [dbo].[Sales]( [ProductId] [int] NOT NULL, [PointOfSaleId] [int] NOT NULL, [Date] [date] NOT NULL, [Quantity] [int] NOT NULL ) ON [PRIMARY] GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-01' AS Date), 1) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-02' AS Date), 2) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-03' AS Date), 3) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-10' AS Date), 10) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-11' AS Date), 11) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (1, 1, CAST(N'2020-01-20' AS Date), 20) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 1, CAST(N'2020-01-01' AS Date), 1) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 1, CAST(N'2020-01-02' AS Date), 2) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 1, CAST(N'2020-01-05' AS Date), 5) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 2, CAST(N'2020-01-15' AS Date), 15) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 2, CAST(N'2020-01-16' AS Date), 16) GO INSERT [dbo].[Sales] ([ProductId], [PointOfSaleId], [Date], [Quantity]) VALUES (2, 2, CAST(N'2020-01-25' AS Date), 25)
Finalmente tenemos los siguientes datos:
ProductId | PointOfSaleId | Date | Quantity |
1 | 1 | 01-01-2020 | 1 |
1 | 1 | 01-02-2020 | 2 |
1 | 1 | 01-03-2020 | 3 |
1 | 1 | 01-10-2020 | 10 |
1 | 1 | 01-11-2020 | 11 |
1 | 1 | 01-20-2020 | 20 |
2 | 1 | 01-01-2020 | 1 |
2 | 1 | 01-02-2020 | 2 |
2 | 1 | 01-05-2020 | 5 |
2 | 2 | 01-15-2020 | 15 |
2 | 2 | 01-16-2020 | 16 |
2 | 2 | 01-25-2020 | 25 |
El siguiente paso es entender la función ROW_NUMBER
-- Sin PARTITION, simplemente tenemos un contador SELECT ROW_NUMBER() OVER (ORDER BY ProductId) AS rn, * FROM Sales
rn | ProductId | PointOfSaleId | Date | Quantity |
1 | 1 | 1 | 01-01-2020 | 1 |
2 | 1 | 1 | 01-02-2020 | 2 |
3 | 1 | 1 | 01-03-2020 | 3 |
4 | 1 | 1 | 01-10-2020 | 10 |
5 | 1 | 1 | 01-11-2020 | 11 |
6 | 1 | 1 | 01-20-2020 | 20 |
7 | 2 | 1 | 01-01-2020 | 1 |
8 | 2 | 1 | 01-02-2020 | 2 |
9 | 2 | 1 | 01-05-2020 | 5 |
10 | 2 | 2 | 01-15-2020 | 15 |
11 | 2 | 2 | 01-16-2020 | 16 |
12 | 2 | 2 | 01-25-2020 | 25 |
-- PARTITION BY cambia, así que se reinicia el contador SELECT ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY ProductId) AS rn, * FROM Sales
rn | ProductId | PointOfSaleId | Date | Quantity |
1 | 1 | 1 | 01-01-2020 | 1 |
2 | 1 | 1 | 01-02-2020 | 2 |
3 | 1 | 1 | 01-03-2020 | 3 |
4 | 1 | 1 | 01-10-2020 | 10 |
5 | 1 | 1 | 01-11-2020 | 11 |
6 | 1 | 1 | 01-20-2020 | 20 |
1 | 2 | 1 | 01-01-2020 | 1 |
2 | 2 | 1 | 01-02-2020 | 2 |
3 | 2 | 1 | 01-05-2020 | 5 |
1 | 2 | 2 | 01-15-2020 | 15 |
2 | 2 | 2 | 01-16-2020 | 16 |
3 | 2 | 2 | 01-25-2020 | 25 |
Ahora que ya sabemos como funciona ROW_NUMBER, vamos a centranos en lo que queremos conseguir, el resultado final, los intervalos:
ProductId | PointOfSaleId | StartDate | EndDate | Quantity |
1 | 1 | 01-01-2020 | 01-03-2020 | 6 |
1 | 1 | 01-10-2020 | 01-11-2020 | 21 |
1 | 1 | 01-20-2020 | 01-20-2020 | 20 |
2 | 1 | 01-01-2020 | 01-02-2020 | 3 |
2 | 1 | 01-05-2020 | 01-05-2020 | 5 |
2 | 2 | 01-15-2020 | 01-16-2020 | 31 |
2 | 2 | 01-25-2020 | 01-25-2020 | 25 |
Para llegar del conjunto inicial de datos al resultado final, la SQL necesaria es:
WITH sales_cte AS (SELECT ProductId, PointOfSaleId, Date, SUM(Quantity) AS Quantity FROM Sales GROUP BY ProductId, PointOfSaleId, Date), groups_cte AS (SELECT ROW_NUMBER() OVER (ORDER BY date) AS row_number, DATEADD(DAY, -ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY Date), Date) AS [group], ProductId, PointOfSaleId , Date, Quantity FROM sales_cte) SELECT ProductId, PointOfSaleId, MIN(Date) AS StartDate, MAX(Date) AS EndDate, SUM(Quantity) AS Quantity FROM groups_cte GROUP BY ProductId, PointOfSaleId, [group]
La clave de la SQL está en -ROW_NUMBER(), vamos a ver el resultado intermedio y entender así que los días consecutivos están creando un grupo que podremos usar posteriormente:
SELECT ROW_NUMBER() OVER (ORDER BY date) AS row_number, ProductId, PointOfSaleId, Date, Quantity, -ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY Date) AS negative_row_number, DATEADD(DAY, -ROW_NUMBER() OVER (PARTITION BY ProductId, PointOfSaleId ORDER BY Date), Date) AS [group] INTO #groups FROM (SELECT ProductId, PointOfSaleId, Date, SUM(Quantity) AS Quantity FROM Sales GROUP BY ProductId, PointOfSaleId, Sales.Date) as T; SELECT * FROM #groups;
row_number | ProductId | PointOfSaleId | Date | Quantity | negative_row_number | group |
1 | 1 | 1 | 01-01-2020 | 1 | -1 | 12-31-2019 |
3 | 1 | 1 | 01-02-2020 | 2 | -2 | 12-31-2019 |
5 | 1 | 1 | 01-03-2020 | 3 | -3 | 12-31-2019 |
7 | 1 | 1 | 01-10-2020 | 10 | -4 | 01-06-2020 |
8 | 1 | 1 | 01-11-2020 | 11 | -5 | 01-06-2020 |
11 | 1 | 1 | 01-20-2020 | 20 | -6 | 01-14-2020 |
2 | 2 | 1 | 01-01-2020 | 1 | -1 | 12-31-2019 |
4 | 2 | 1 | 01-02-2020 | 2 | -2 | 12-31-2019 |
6 | 2 | 1 | 01-05-2020 | 5 | -3 | 01-02-2020 |
9 | 2 | 2 | 01-15-2020 | 15 | -1 | 01-14-2020 |
10 | 2 | 2 | 01-16-2020 | 16 | -2 | 01-14-2020 |
12 | 2 | 2 | 01-25-2020 | 25 | -3 | 01-22-2020 |
Después de esto ya podemos agrupar por el grupo creado (la columna [group]) y usar funciones de agregado para sacar el resto de los datos:
SELECT ProductId, PointOfSaleId, [group], MIN(Date) AS StartDate, MAX(Date) AS EndDate, SUM(Quantity) AS Quantity FROM #groups GROUP BY ProductId, PointOfSaleId, [group]
ProductId | PointOfSaleId | group | StartDate | EndDate | Quantity |
1 | 1 | 12-31-2019 | 01-01-2020 | 01-03-2020 | 6 |
1 | 1 | 01-06-2020 | 01-10-2020 | 01-11-2020 | 21 |
1 | 1 | 01-14-2020 | 01-20-2020 | 01-20-2020 | 20 |
2 | 1 | 12-31-2019 | 01-01-2020 | 01-02-2020 | 3 |
2 | 1 | 01-02-2020 | 01-05-2020 | 01-05-2020 | 5 |
2 | 2 | 01-14-2020 | 01-15-2020 | 01-16-2020 | 31 |
2 | 2 | 01-22-2020 | 01-25-2020 | 01-25-2020 | 25 |
Por acabar con algún dato del mundo real, para el cliente en el que estamos haciendo algunas pruebas la reducción de filas ha sido la siguiente:
Original rows | Intervals rows | |
Stock | 560.089.291 | 12.236.634 |
Sales | 22.981.825 | 16.794.972 |
Un saludo!