Imagine one has two SQL tables containing daily time-series data.
Both of the tables have primary key (and since it’s SQL Server the key is also clustered). The key is composite and includes field BUSINESS_DATE of type Date.
Both table are partitioned monthly.
Table_Short contains only 2 years worth of data.
Table_Long contains 50 years worth of data.
The end-user can only query maximum last 1 year worth of data at a time. Therefore, the front-end (UI) will always make a SELECT statement like
SELECT <some columns>
FROM <Table>
WHERE BUSINESS_DATE >= t1
AND BUSINESS_DATE <= t2
for some dates t1, t2.
Given that the user only needs to look into the last year and the fact that the tables are partitioned and ordered by business_date, is it true, that the query latency should not depend on the table size, thus query against Table_Long will take the same time as against Table_Long?
My understanding was that SQL Server stores tables as Binary Tree, where search is at worst O(log n).
However, given that we only need the last year in either 2 years or 50 years, should the time-complexity become O(1)?
>Solution :
Query latencies may still depend on table size even though SQL Server uses a B-tree index to efficiently search and retrieve data. In certain cases, latency can depend on table size even if the data is partitioned.
In theory, while time complexity of the search operation may still be O(log n). When it comes to practice, SELECT statement shouldn’t depend on the overall size of the tables, but more on the number of partitions that need to be scanned to get data within a specific date range.(the partitioning scheme allows for efficient pruning of irrelevant data)
- In your case, we are querying for the last year of data, and both tables are partitioned by month. SQL Server query optimizer should be able to use the partitioning scheme to prune the partitions that do not contain the relevant data, resulting in a smaller search space and faster query execution.
Ps. SQL Server uses a clustered index physically ordering the data in a table, which further improves query performance.
- in your case, the primary key in both tables is clustered and includes
BUSINESS_DATEfield (data within each partition will be ordered byBUSINESS_DATE, making it easier and faster to locate data within each partition, resulting in query execution times that are not dependent on the overall size of the tables)