Speed up Power BI merges with record lookups
14 November 2024You’re in Power BI. You’re in the Power Query editor getting data. Imagine you’ve got a CSV like this:
id | name | salary |
---|---|---|
EMP-1 | Johnny Bravo | 10000 |
EMP-2 | Donald Duck | 20000 |
EMP-3 | Finn the Human | 15000 |
EMP-4 | Jake the Dog | 5000 |
And another CSV like this:
id | country |
---|---|
EMP-1 | New Zealand |
EMP-2 | China |
EMP-3 | United Kingdom |
EMP-4 | Australia |
And you want to combine the two to get:
id | name | salary | country |
---|---|---|---|
EMP-1 | Johnny Bravo | 10000 | New Zealand |
EMP-2 | Donald Duck | 20000 | China |
EMP-3 | Finn the Human | 15000 | United Kingdom |
EMP-4 | Jake the Dog | 5000 | Australia |
The first thing I’d usually do is jump to is using the “Combine” function and merge the two tables together on the “id” column. After this, I could expand the resulting merged table to get the country value in my employees table.
This is how this might look:
Getting the employees table:
/* table name: employees */
let
Source = Csv.Document(File.Contents("C:\path\to\employees.csv"),[Delimiter=",", Columns=3, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{ {"id", type text}, {"name", type text}, {"salary", Int64.Type}})
in
#"Changed Type"
Getting the countries table:
/* table name: employeecountries */
let
Source = Csv.Document(File.Contents("C:\path\to\employeecountries.csv"),[Delimiter=",", Columns=2, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{ {"id", type text}, {"country", type text}})
in
#"Changed Type"
And I’m going to make a third table which combines these with a merge:
/* table name: merge */
let
Source = #"employees",
#"Merged Queries" = Table.NestedJoin(Source, {"id"}, employeecountries, {"id"}, "employeecountries", JoinKind.LeftOuter),
#"Expanded employeecountries" = Table.ExpandTableColumn(#"Merged Queries", "employeecountries", {"country"}, {"country"})
in
#"Expanded employeecountries"
This works great for some small CSV files on your local machine, but if you’ve got tables with hundreds of thousands of rows, and your sources are from across the internet, you’re going to hit some performance problems.
A faster way #
First we want to convert the countries into a record. As far as I can tell there’s no way to do this in the GUI, so you’re going to have to jump into the “advanced query” section to smash in this code:
/* table name: employeecountriesrecord */
let
Source = #"employeecountries",
#"Renamed Columns" = Table.RenameColumns(Source,{ {"id", "Name"}, {"country", "Value"}}),
ToRecord = Record.FromTable(#"Renamed Columns" )
in
ToRecord
After you’ve added this in, it’s important to right-click this query and switch off “enable load” otherwise Power BI will automatically try to turn this query back into a table.
Now to grab the data. You can use the Record.FieldOrDefault
function to do a lookup on the related table. This will default to null if it can’t find the key in the lookup table. If you want it to fail when there is no matching key, you can use ‘Record.Field’ instead.
/* table name: lookup */
let
Source = #"employees",
#"Added Custom" = Table.AddColumn(Source, "country", each Record.FieldOrDefault(employeecountriesrecord,[id]))
in
#"Added Custom"
The bigger your data is, the bigger the difference this will make, as a merge will be processed in O^n2 time (it will exponentially increase with the number of rows), whereas a lookup will not increase with the size of the data.
For demonstration, you can use git bash to automatically build larger tables with these scripts (okay these were written by chat GPT, but I used my extensive bash knowledge to audit them 👀).
#!/bin/bash
# Creates employees table
# Set the output file and the number of rows
OUTPUT_FILE="employees2.csv"
NUM_ROWS=100000 # Adjust this for the number of rows you want
# Define some sample names
NAMES=("Johnny Bravo" "Donald Duck" "Finn the Human" "Jake the Dog" "Mickey Mouse" "SpongeBob SquarePants" "Rick Sanchez" "Morty Smith" "Homer Simpson" "Peter Griffin")
# Create the CSV header
echo "id,name,salary" > "$OUTPUT_FILE"
# Generate random data for each row
for i in $(seq 1 $NUM_ROWS); do
# Unique employee ID
ID="EMP-$i"
# Random name from the list
NAME="${NAMES[$RANDOM % ${#NAMES[@]}]}"
# Random salary between 5000 and 30000
SALARY=$(( (RANDOM % 25000) + 5000 ))
# Write row to file
echo "$ID,$NAME,$SALARY" >> "$OUTPUT_FILE"
done
echo "Generated $NUM_ROWS rows in $OUTPUT_FILE"
And this one:
#!/bin/bash
#Creates countries lookup table
# Set the output file and the number of rows
OUTPUT_FILE="employeecountries2.csv"
NUM_ROWS=100000 # Adjust this for the number of rows you want
# Define a list of sample countries
COUNTRIES=("New Zealand" "China" "United Kingdom" "Australia" "United States" "Canada" "Germany" "France" "Japan" "India")
# Create the CSV header
echo "id,country" > "$OUTPUT_FILE"
# Generate random data for each row
for i in $(seq 1 $NUM_ROWS); do
# Unique employee ID
ID="EMP-$i"
# Random country from the list
COUNTRY="${COUNTRIES[$RANDOM % ${#COUNTRIES[@]}]}"
# Write row to file
echo "$ID,$COUNTRY" >> "$OUTPUT_FILE"
done
echo "Generated $NUM_ROWS rows in $OUTPUT_FILE"
Now we can add some more queries to check how long these are taking using Chris Webb’s check for how long our queries take:
let
StartTime = DateTime.LocalNow(),
Source = if StartTime<>null
then
#"merge"
else
null,
//insert all other steps here
NumberOfRows = Number.ToText(Table.RowCount(Source)),
EndTime = DateTime.LocalNow(),
Output = "Query returned " & NumberOfRows & " rows and took " & Duration.ToText(EndTime - StartTime),
end = Table.FromList({Output})
in
end
and:
let
StartTime = DateTime.LocalNow(),
Source = if StartTime<>null
then
#"lookup"
else
null,
//insert all other steps here
NumberOfRows = Number.ToText(Table.RowCount(Source)),
EndTime = DateTime.LocalNow(),
Output = "Query returned " & NumberOfRows & " rows and took " & Duration.ToText(EndTime - StartTime),
end = Table.FromList({Output})
in
end
At the end of this, my lookup takes 40% of the time of the merge with 10,000 row files. This gap will only increase as the number of rows increase and as network latency slows your query down. Keep this method in your toolbox, because it’s an important feature to speed up queries.
Is that really a merge? #
Now if you’ve been scrutinising this, you’ll notice that this isn’t a real merge! If you have a one-to-many or a many-to-many relationship between your two tables, the code above won’t help you.
What I mean by this, is that if you had employees who worked in more than one country and you had 2 rows in the country table for a single employee, and wanted to end up with 2 rows after your merge in your output. Using record lookups won’t do this for you and you should go back to using the merge queries function.
The thing is, most of the time I find myself reaching for merge function, I don’t want to end up with more rows. In fact, I usually want to actively avoid this as it might cause bugs with the resulting duplicates. Using a record lookup will actively enforce that you end up with the same number of rows you started with, with no duplicate IDs.
If you have more power bi performance tips, or you want to complain about this post, hit me up on Blue Sky.
Back to blog