Today I had a problem that is a non-starter when dealing with relational databases: I needed to do an inner join on two sets of data.
Instead what I had was two Excel spreadsheets, and it ended up taking much more time than I would have liked.
I wanted to "join" the spreadsheets based on a common email column. I thought this would be easy: there must be a built-in Excel wizard to do just that, right? Seemingly not, and the solution I ended up finding was to treat Excel like some Frankenstein SQL database.
At first I thought the "Consolidate" feature would accomplish what I needed. I wasn't managing to make it do what I wanted after looking at a few "How to use Excel consolidate" guides though. The guides I was finding were very "point-here, click-that, there you go—magic", without explaining what pointing here and clicking that actually did. I gave up on it.
Then I tried searching for "join tables excel" and got a lot of relevant add-ons that did exactly that, but they all cost money.
Finally I stumbled across a Microsoft article explaining how to do exactly what I wanted with Microsoft Query.
The only problem(s) with the article are:
- It is aimed at mail merges in Word.
- I wasn't able to add more than one Excel file as a data source, as they are able to do in the article.
- It's a ten year-old article. Also it seems the Microsoft Query interface hasn't been updated since Windows 3.1.
It took me some back-and-forth between the article, Excel and Microsoft Query to get the results that I needed so I'm documenting the steps for posterity.
Get the data in the same place
All the data has to be in the same spreadsheet; I didn't find a way to add more than one spreadsheet to Microsoft Query.
In my case, I was trying to end up with the rows from Spreadsheet 1 whose email was the same as in Spreadsheet 2. I just copied the email column from Spreadsheet 2 into Spreadsheet 1.
Define names to use as tables
Microsoft Query uses names defined in Excel to create its "tables". I needed two of these "tables": one for Spreadsheet 1's information (id and email) and one for Spreadsheet 2's information (email, which I'd pasted into Spreadsheet 1).
To create the first table I selected the two columns I wanted and then clicked "Define Name" in the "Formulas" menu, "Named cells" panel. I called it "emailsall".
The second table was created the same way, except selecting the "emails" column that I had pasted in from Spreadsheet 2, and naming it "emailsshort".
Query the "tables"
To get the results of the faux join I wanted I created a new Excel spreadsheet and then clicked on the "Data" menu, then on "From Other Sources" inside the "Get External Data" panel, and finally "From Microsoft Query".
At the "Choose Data Source" screen I selected "Excel Files" and opened Spreadsheet 1. In "Choose Columns", I added the two "tables" that had just been created. When clicking "Next" the Wizard complained about not being able to join the tables itself and they had to be manually joined. I clicked "OK" and in the Microsoft Query window clicked the little "SQL" button.
In the SQL window I edited the query to do an inner join and got the results I needed. The query went something like this:
SELECT emailall.id, emailall.email
FROM emailall emailall
INNER JOIN emailshort on emailall.email = emailshort.email
Finally go to the "File" menu -> "Return Data to Microsoft Excel"