When your Big Data changes, you can refresh the extract by selecting the data source from the data menu in Business Intelligence. By default, the extracts are fully refreshed. That means, all the rows are replaced with the new data. This full refresh which we spoke about in part-1 of this series is an expensive refresh as the extract can be big and time consuming.
You can now load only new or changed data to enable very large datasets in the Power BI Premium service.
There is an alternative in the BI tool.
It’s called the incremental data refresh.
It’s called the incremental data refreshHere your data source is updated with the daily sales transaction instead of refreshing the entire data. Therefore, you can add rows of the new transactions that occur daily and conduct a full refresh maybe once a week. The advantages are that you have a quicker refresh, they are more reliable as you don’t need to maintain long-running connections to source systems and the resource consumption is reduced since there is less data to refresh.
How to use incremental refresh?
Incremental refresh polices are defined in Power BI Desktop and they can be applied once they are published to the Power BI service.
- You can begin by enabling incremental refresh in the preview features.
- Next, you need to filter the large datasets in the BI Desktop.
- The huge datasets with multiple rows may not fit into the Power BI Desktop because the tool is restricted by the resources available on the user’s desktop PC. Such datasets are therefore filtered upon import to fit into the Business Intelligence Desktop. This happens regardless of whether you are using incremental refresh or not.
- To leverage incremental refresh in the Power BI service, filtering is required using Power Query date/time parameters with the reserved, case-sensitive names ‘RangeStart’ and ‘RangeEnd’.
- Once published, the parameter values are overridden automatically by the BI service with no need to set them in the dataset settings in Business Intelligence.
- It is crucial here that the filter is pushed to the source system when queries are submitted for refresh operations. This means the data source should support ‘Query Folding’. Since there are various levels of ‘Query-Folding’ support for each data source, it is suggested that you verify if the filter logic is included in the source queries. If this does not happen, each query will pull all the Big Data from the source, and that defeats the purpose of an incremental refresh.
- The filter partitions the data into ranges in the Power BI service and is not designed for updating the filtered date column. This will interprete the update as an insertion and a deletion (not an update). If the deletion occurs in the historical range and not the incremental range, it won’t register.
- In the Power Query Editor, you have to select ‘Manage Parameters’ to define the parameters with default values.
- With the parameters defined, you need to apply the filter by selecting the ‘Custom Filter’ menu option for a column.
- You must ensure that the rows are filtered where the column value is after or equal to ‘RangeStart’ and before ‘RangeEnd’.
- Although the data type of the parameters must be date/time oriented, they can be converted to match the requirements of the data source. In the example below, Power Query function converts a date/time value to resemble an integer surrogate key of the form yyyymmdd, which is common for data warehouses. The function can be called the filter step.
- Next, you must select ‘Close and Apply’ from the ‘Power Query Editor’ and you should have a subset of the dataset in the Power BI Desktop.
Define the refresh policy in the BI tool
- Incremental refresh is available on the context menu for tables, except for live connection models.
- The incremental refresh dialog is displayed, and you can use the toggle to enable the dialog.
- If the Power Query expression for the table doesn’t refer to the parameters with reserved names, the toggle is disabled.The header text explains the following:
- Incremental refresh is supported only for workspaces on premium capacity. Refresh policies are defined in Power BI Desktop and they are applied by refresh operations in the BI service.
- If you download the PBIX file containing an incremental-refresh policy from the Power BI service, it will not open in the BI Desktop and you will be unable to download everything. While this may just be supported in the future, you need to note that these datasets can grow so big, that it becomes impractical to download and open them on a desktop PC.
Refresh ranges in the Business Intelligence service
The first refresh in the Power BI service may take a little longer since it must import the 5-year data, but subsequent refreshes may be quicker.
- Add a new day of data.
- Refresh 10 days up to the present date.
- Remove calendar years that are older than 5 years prior to the present date. For example, if the current date is January 1st, 2019, the year 2013 is removed.
The below mentioned example defines a refresh policy to store 5 years of data in total, and incrementally refresh 10 days of data. If the dataset is refreshed on day to day basis, the following steps will be carried out for each refresh operation.
Definition of these ranges might be all what you need, and you can go to the publishing step below. The additional dropdowns are for advanced features.
Incremental refresh of 10 days is more efficient than a full refresh of 5 years in the Power BI service. However, this is achieved better if you select the ‘Detect Data Changes’ checkbox, and then a date/time column to identify and refresh only the days where the data has changed. This assuming such a column exists in the source system, which is typically for auditing purposes. But be mindful that this should not be the same column used to partition the data with the RangeStart/RangeEnd parameters. The maximum value of this column is evaluated for each of the periods in the incremental range and if it hasn’t changed much since the previous refresh, you don’t need to refresh the period. In the example below, you can see how this can further reduce the days incrementally refreshed from 10 to 2.
The current design requires that the column to detect Big Data changes is retained and cached into memory. You may want to consider one of the following techniques to reduce cardinality and memory consumption.
- Retain only the maximum value of this column at the time of refresh, by using a ‘Power Query function’.
- Reduce the precision to a level that is acceptable in accordance with your refresh-frequency requirements.
- Allow the definition of custom queries for data-change detection later. This could be used to avoid the persisting of the column value altogether.
- For refreshing Big Data from a financial system where data for the previous month is approved on the 12th calendar day of the monthly, you could set the incremental range to 1 month and schedule the refresh to run on the 12th day of the next month. For e.g. February data would be refreshed on March 12th.
- You must refresh operations in the BI service run under UTC time to determine the effective date and affect complete periods. You can plan to add the ability to override the effective date for a refresh operation.
- Since incremental refresh is a premium feature, the publish dialog only allows selection of a workspace on premium capacity.
- Now, you can refresh the model. The first refresh may take longer to import the historical data, but subsequent refreshes will be faster because of an incremental refresh.
- Queries can also be restricted due to the default timeout for the data source, but most sources allow overriding timeouts in the M expression.
This is all what there is to know on Power BI data refresh and scheduling. But we would love to hear your thoughts and experiences too while refreshing your Big Data while using the Business Intelligence tool. Please feel free to write to us at email@example.com.