How does data_engine maintenance and retention work?

Document ID : KB000033704
Last Modified Date : 14/02/2018
Show Technical Document Details
Question:
How does data_engine maintenance and retention work?

"Delete historic data older than" should be set to a higher number than raw data, and the way it works is that after the number of days set for Raw Data, it rolls it into historic. In other words, if you have Raw Data set to 30 days, and Historic Data set to 60 days, then from day 1-30 you will have Raw Data and on the 31st day, the oldest raw data gets put into day one of historic data.

Technical Details

Data maintenance is a two-step process:

1. For raw data, look at the retention time, and for data that is older than the retention days, run a summarization query that stores the hourly summaries into the corresponding hourly (HN) table. Something similar is done to store daily (DN) summaries from the HN table. The process runs on the schedule defined by the customer. This is done in the same way, regardless of whether the table is partitioned or not. The retention days tell you how old, from midnight this morning, the rows in the RN or HN must be in order to be considered for summarization. The summarization query is very expensive, but there is no other way to do it. If the process is running smoothly, the query will only summarize one day of data for each QoS. If the process has stalled or failed for any reason, the query becomes significantly more expensive as more days need to be summarized. Database transactions aggravate this process, but we are stuck with how the RDBMS works. Partitioning can have a positive effect on this summarization query, since summarization and insertion occur against different partitions and there is no contention.

2. Again look at the retention time, and determine what data needs to be deleted, which was summarized above. DN data is obviously not summarized and it is just aged out as per the number of days since midnight this AM. Here is where the process is different for partitioned and unpartitioned tables:

? For partitioned tables, we simply drop partitions for old data on each table. Each partition holds data for one day. Dropping a partition is a very fast and cheap operation. It is almost instantaneous. This is why we overwhelming recommend partitioning databases. For 8.4, we are changing the default installation to partition the database if it is supported.

? For unpartitioned tables, this is very expensive. The pruning process is driven by a temporal delete statement: delete from <table> where sampletime < <retention_time>. The delete operation is one of the most expensive queries that you can run on a DB. I this case, it has to scan the index for the range, re-arrange data, indexes, storage, ? It is more expensive than the summarization queries in step 1.

Additional Information

Large Environments

***In large installations, this may not finish in a 24 hour period. This results in data maintenance not finishing, which causes the summarization in step 1 to get progressively more expensive. You quickly get into a death cycle, which can only be resolved by partitioning the DB and possibly manually deleting old data from the RN tables. The DN and HN tables are generally pretty small, so they are not an issue.

If on top of that, if you have inadequate resources, poorly configured DBs, shared storage with other applications, ? pretty quickly, things get out of hand and degrade badly in large environments. Partitioning a large, e.g., > 1 TB, and already struggling DB is like doubling the resource requirements until partitioning is completed.

You need to have enough space available on the primary filegroup to enable partitioning.
or you need to wait for CA UIM to support partitioning when files are not on the primary filegroup (which is at least several releases away as of the date of this Article)

?