Given two lists of values that I want to concatenate into a single list of distinct values, I noticed some interesting behavior: It didn’t matter whether I used DISTINCT in Load 1, Load 2, or both to make the final list of values distinct.
//Load 1 Base: LOAD [DISTINCT] * INLINE [ FV 1 1 2 2 3 3 ];
//Load 2 CONCATENATE (Base) LOAD [DISTINCT] * INLINE [ FV 3 3 4 4 ];
Here are the results of all combinations of DISTINCT usage:
|Scenario||Load 1||Load 2||Resulting Rows|
|1||No distinct||No distinct||10|
What number of rows would you have expected to result from the second and third scenarios? I would have guessed incorrectly.
Looking in particular at the second scenario, it retroactively made Load 1 distinct, even though there was no DISTINCT for that load. The number of rows in the table went down between Loads 1 and 2, and the log file bears that out. (When concatenate loading, the “lines fetched” is the total rows in our table at that point in the script execution.)
Load 1 (Non-distinct) -- 1 fields found: FV, 6 lines fetched Load 2 (Non-distinct) -- 1 fields found: FV, 10 lines fetched
Load 1 (Non-distinct) -- 1 fields found: FV, 6 lines fetched Load 2 (Distinct) -- 1 fields found: FV, 4 lines fetched
So what it must be doing in these cases is reloading the entire resident table when concatenating more rows. If that is true, then what would the scripting and reload performance implications be? For one thing, when creating concatenated fact tables, I would load the source tables in ascending order of row count to minimize the reload time. But, sticking to the original scenario…
Expanding my sources to two 10 million row QVDs with randomly generated integer values, then repeating scenarios 2, 3, and 4, here are my results:
//Load 1 Base: LOAD [DISTINCT] FV FROM Base1.qvd (qvd); //Load 2 CONCATENATE (Base) LOAD [DISTINCT] FV FROM Base2.qvd (qvd);
|Scenario||Load 1||Load 2||Reload Time (seconds)|
While scenario 4 is what I would naturally be inclined to write, it turned out to be a few seconds faster to allow the first load to be non-distinct, then wait and apply the DISTINCT only once, after all of the data had been combined. I repeated this test several times with the same result, and I would wager that the improvement would be even more pronounced with more rows or columns or more than two sources.