Concatenate loading rows: behavior and performance




Given two lists of values that I want to concatenate into a single list of distinct values, I noticed some interesting behavior: It didn't matter whether I used DISTINCT in Load 1, Load 2, or both to make the final list of values distinct.

//Load 1
Base:
LOAD [DISTINCT] *
INLINE [
    FV
    1
    1
    2
    2
    3
    3
];
//Load 2
CONCATENATE (Base)
LOAD [DISTINCT] *
INLINE [
    FV
    3
    3
    4
    4
];

Here are the results of all combinations of DISTINCT usage:

Scenario Load 1 Load 2 Resulting Rows
1 No distinct No distinct 10
2 No distinct Distinct 4
3 Distinct No distinct 4
4 Distinct Distinct 4

What number of rows would you have expected to result from the second and third scenarios? I would have guessed incorrectly.

Looking in particular at the second scenario, it retroactively made Load 1 distinct, even though there was no DISTINCT for that load. The number of rows in the table went down between Loads 1 and 2, and the log file bears that out. (When concatenate loading, the "lines fetched" is the total rows in our table at that point in the script execution.)

Scenario 1:

Load 1 (Non-distinct) -- 1 fields found: FV, 6 lines fetched
Load 2 (Non-distinct) -- 1 fields found: FV, 10 lines fetched

Scenario 2:

Load 1 (Non-distinct) -- 1 fields found: FV, 6 lines fetched
Load 2 (Distinct) -- 1 fields found: FV, 4 lines fetched

So what it must be doing in these cases is reloading the entire resident table when concatenating more rows. If that is true, then what would the scripting and reload performance implications be? For one thing, when creating concatenated fact tables, I would load the source tables in ascending order of row count to minimize the reload time. But, sticking to the original scenario...

Expanding my sources to two 10 million row QVDs with randomly generated integer values, then repeating scenarios 2, 3, and 4, here are my results:

//Load 1
Base:
LOAD [DISTINCT] FV
FROM Base1.qvd (qvd);

//Load 2
CONCATENATE (Base)
LOAD [DISTINCT] FV
FROM Base2.qvd (qvd);

Scenario Load 1 Load 2 Reload Time (seconds)
1 N/A N/A N/A
2 No distinct Distinct 26
3 Distinct No distinct 30
4 Distinct Distinct 30

While scenario 4 is what I would naturally be inclined to write, it turned out to be a few seconds faster to allow the first load to be non-distinct, then wait and apply the DISTINCT only once, after all of the data had been combined. I repeated this test several times with the same result, and I would wager that the improvement would be even more pronounced with more rows or columns or more than two sources.

TAGS: Performance, QlikView, Data Model, Load Script

Related News

IS YOUR COMPANY UPGRADING TO QLIKVIEW 12?

Axis Group was approached by a client asking for help...

Read More

Part 1 of this series describes what Qlik Sense does better than QlikView, from the perspective...

Read More