Speed up Power BI merges with record lookups

14 November 2024

You’re in Power BI. You’re in the Power Query editor getting data. Imagine you’ve got a CSV like this:

id	name	salary
EMP-1	Johnny Bravo	10000
EMP-2	Donald Duck	20000
EMP-3	Finn the Human	15000
EMP-4	Jake the Dog	5000

And another CSV like this:

id	country
EMP-1	New Zealand
EMP-2	China
EMP-3	United Kingdom
EMP-4	Australia

And you want to combine the two to get:

id	name	salary	country
EMP-1	Johnny Bravo	10000	New Zealand
EMP-2	Donald Duck	20000	China
EMP-3	Finn the Human	15000	United Kingdom
EMP-4	Jake the Dog	5000	Australia

The first thing I’d usually do is jump to is using the “Combine” function and merge the two tables together on the “id” column. After this, I could expand the resulting merged table to get the country value in my employees table.

This is how this might look:

Getting the employees table:

    /* table name: employees */
    let
        Source = Csv.Document(File.Contents("C:\path\to\employees.csv"),[Delimiter=",", Columns=3, QuoteStyle=QuoteStyle.None]),
        #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
        #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{ {"id", type text}, {"name", type text}, {"salary", Int64.Type}})
    in
        #"Changed Type"

Getting the countries table:

    /* table name: employeecountries */
    let
        Source = Csv.Document(File.Contents("C:\path\to\employeecountries.csv"),[Delimiter=",", Columns=2, QuoteStyle=QuoteStyle.None]),
        #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
        #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{ {"id", type text}, {"country", type text}})
    in
        #"Changed Type"

And I’m going to make a third table which combines these with a merge:

    /* table name: merge */
    let
        Source = #"employees",
        #"Merged Queries" = Table.NestedJoin(Source, {"id"}, employeecountries, {"id"}, "employeecountries", JoinKind.LeftOuter),
        #"Expanded employeecountries" = Table.ExpandTableColumn(#"Merged Queries", "employeecountries", {"country"}, {"country"})
    
        
    in
        #"Expanded employeecountries"

This works great for some small CSV files on your local machine, but if you’ve got tables with hundreds of thousands of rows, and your sources are from across the internet, you’re going to hit some performance problems.

A faster way #

First we want to convert the countries into a record. As far as I can tell there’s no way to do this in the GUI, so you’re going to have to jump into the “advanced query” section to smash in this code:

    /* table name: employeecountriesrecord */
    let
        Source = #"employeecountries",
        #"Renamed Columns" = Table.RenameColumns(Source,{ {"id", "Name"}, {"country", "Value"}}),
        ToRecord = Record.FromTable(#"Renamed Columns" )
    in
       ToRecord

After you’ve added this in, it’s important to right-click this query and switch off “enable load” otherwise Power BI will automatically try to turn this query back into a table.

Now to grab the data. You can use the Record.FieldOrDefault function to do a lookup on the related table. This will default to null if it can’t find the key in the lookup table. If you want it to fail when there is no matching key, you can use ‘Record.Field’ instead.

    /* table name: lookup */
    let
        Source = #"employees",
        #"Added Custom" = Table.AddColumn(Source, "country", each Record.FieldOrDefault(employeecountriesrecord,[id]))
    in
        #"Added Custom"

The bigger your data is, the bigger the difference this will make, as a merge will be processed in O^n2 time (it will exponentially increase with the number of rows), whereas a lookup will not increase with the size of the data.

For demonstration, you can use git bash to automatically build larger tables with these scripts (okay these were written by chat GPT, but I used my extensive bash knowledge to audit them 👀).

    #!/bin/bash
    
    # Creates employees table
    
    # Set the output file and the number of rows
    OUTPUT_FILE="employees2.csv"
    NUM_ROWS=100000  # Adjust this for the number of rows you want
    
    # Define some sample names
    NAMES=("Johnny Bravo" "Donald Duck" "Finn the Human" "Jake the Dog" "Mickey Mouse" "SpongeBob SquarePants" "Rick Sanchez" "Morty Smith" "Homer Simpson" "Peter Griffin")
    
    # Create the CSV header
    echo "id,name,salary" > "$OUTPUT_FILE"
    
    # Generate random data for each row
    for i in $(seq 1 $NUM_ROWS); do
        # Unique employee ID
        ID="EMP-$i"
        # Random name from the list
        NAME="${NAMES[$RANDOM % ${#NAMES[@]}]}"
        # Random salary between 5000 and 30000
        SALARY=$(( (RANDOM % 25000) + 5000 ))
    
        # Write row to file
        echo "$ID,$NAME,$SALARY" >> "$OUTPUT_FILE"
    done
    
    echo "Generated $NUM_ROWS rows in $OUTPUT_FILE"

And this one:

    #!/bin/bash
    
    #Creates countries lookup table
    
    # Set the output file and the number of rows
    OUTPUT_FILE="employeecountries2.csv"
    NUM_ROWS=100000  # Adjust this for the number of rows you want
    
    # Define a list of sample countries
    COUNTRIES=("New Zealand" "China" "United Kingdom" "Australia" "United States" "Canada" "Germany" "France" "Japan" "India")
    
    # Create the CSV header
    echo "id,country" > "$OUTPUT_FILE"
    
    # Generate random data for each row
    for i in $(seq 1 $NUM_ROWS); do
        # Unique employee ID
        ID="EMP-$i"
        # Random country from the list
        COUNTRY="${COUNTRIES[$RANDOM % ${#COUNTRIES[@]}]}"
    
        # Write row to file
        echo "$ID,$COUNTRY" >> "$OUTPUT_FILE"
    done
    
    echo "Generated $NUM_ROWS rows in $OUTPUT_FILE"

Now we can add some more queries to check how long these are taking using Chris Webb’s check for how long our queries take:

    let
    StartTime = DateTime.LocalNow(),
    Source = if StartTime<>null
    then
     #"merge"
    else
    null,
    //insert all other steps here
    NumberOfRows = Number.ToText(Table.RowCount(Source)),
    EndTime = DateTime.LocalNow(),
    Output = "Query returned " & NumberOfRows & " rows and took " & Duration.ToText(EndTime - StartTime),
    end = Table.FromList({Output})
    in
    end

and:

    let
    StartTime = DateTime.LocalNow(),
    Source = if StartTime<>null
    then
     #"lookup"
    else
    null,
    //insert all other steps here
    NumberOfRows = Number.ToText(Table.RowCount(Source)),
    EndTime = DateTime.LocalNow(),
    Output = "Query returned " & NumberOfRows & " rows and took " & Duration.ToText(EndTime - StartTime),
    end = Table.FromList({Output})
    in
    end

At the end of this, my lookup takes 40% of the time of the merge with 10,000 row files. This gap will only increase as the number of rows increase and as network latency slows your query down. Keep this method in your toolbox, because it’s an important feature to speed up queries.

Is that really a merge? #

Now if you’ve been scrutinising this, you’ll notice that this isn’t a real merge! If you have a one-to-many or a many-to-many relationship between your two tables, the code above won’t help you.

What I mean by this, is that if you had employees who worked in more than one country and you had 2 rows in the country table for a single employee, and wanted to end up with 2 rows after your merge in your output. Using record lookups won’t do this for you and you should go back to using the merge queries function.

The thing is, most of the time I find myself reaching for merge function, I don’t want to end up with more rows. In fact, I usually want to actively avoid this as it might cause bugs with the resulting duplicates. Using a record lookup will actively enforce that you end up with the same number of rows you started with, with no duplicate IDs.

If you have more power bi performance tips, or you want to complain about this post, hit me up on Blue Sky.

Back to blog