How many different ways can you write an address? Turns out… a LOT. Typically, when I am cleaning address data (yes, that’s something I do quite regularly, what with being a spreadsheet whisperer and all), I am preparing to either (1) upload a single file into an online database or (2) create a mail merge of letters and address labels. Rarely do I need to compare the list of addresses to ANOTHER list of addresses, but that’s exactly the conundrum that came my way from a social justice org that was doing some targeted organizing based on two lists of addresses, and needed my help to compare them. This one was juicy and I even learned a new formula!
There were a couple of tricky issues that we needed to tackle in the source files.
Clean up and standardize addresses
One source file had the full Street Address in one cell (ie 123 Main Street) and the other had the parts of the Street Address dispersed in separate cells (ie 123 | Main | Street). Our first idea used the “delimiter” function to splice the “123 Main Street” into separate cells. We were hoping that we would have one column of Street/St/St./Str etc so that we could easily capture all of the variations and “normalize” them, but any addresses that had additional modifiers (like “West Main Street” or “Suite 1200”) used wayyy more columns. There was no common column of all of … what I am calling address designations (street/road/avenue).
So instead of splicing the Street Address into separate columns, we decided to go the other way and “glue” the separate cells together into one! This used the =Concatenate() function. While we successfully ended up with a Street Address in one cell, we introduced a new problem. Since some addresses had more words (like “West” or “Suite”), we had to concatenate many cells together. The addresses that did *mot* have those extra parts wound up with “trailing spaces” at the end. In other words, since there was no value in the cells that we smooshed together, Excel added an invisible spacebar. I could have added logic to skip the splicing rule if there was no value in the cell, but instead, I used another trick. One of my favorite formulas in my toolbox is the TRIM() function, which casually snips invisible spaces at the end of a cell. Voila!
Additionally, one source file used all caps (which is pretty common in data pulls) but it felt like the spreadsheet was yelling at my eyeballs! I used the =PROPER() function to easily change “123 MAIN STREET” to “123 Main Street”.
Now that we got all of the parts of the address (building #, street name, street designation) into the same cell, properly capitalized, and no extra spaces, we can move on to the next phase… taking a guillotine and cutting off the street designation. Since it is TOO HARD to change all of the variations/abbreviations of street/avenue/etc, we decided to standardize addresses by limiting to just “123 Main.” This will allow us to compare the most important part of the addresses to each other. Since we were working within only once city’s worth of data, we decided that we had limited risk of having “123 Main St” and “123 Main Avenue” in the same dataset. YMMV 🙂
In order to accomplish this goal, I turned to google and found a very robust conversation on Reddit about how to do this. Ultimately, I used a combo of formula that included REPT(), SUBSTITUTE(), TRIM(), and RIGHT(). This is the winning formula:
=TRIM(RIGHT(SUBSTITUTE(A2, " ", REPT(" ", 100)), 100))
What does it do? In words format, it takes “123 Main Street” and turns it into “Street”; or with “123 Main St”, this formula will return “St”. It finds the last “word” in the cell (aka anything after the last space bar) and gives it back to us in its own cell.
In logic format,
- First, it replaces every “space” with 100 spaces using the REPT() function and the Substitute() function (woah) – this seems bananas until…
- Then it finds the last 100 characters of the cell value (starting from the right, aka the last letter) (which in this case is ” [space x ~100] street”
- And then it removes the extra space bars (whoooo hoooo handy dandy TRIM() is making another appearance!)
- And then it is left with just the remaining letters
- (If you replace the “Right” with “Left” you would get the FIRST word of the cell)
Now that we knew the last “word” of the Street Address, we could subtract it from the rest of the Street Address with a formula like this:
Let's assume: Cell A2 = full Street Address "123 Main Street" Cell B2 = result of prior formula, "Street" Cell C2 = Cell where new formula will go =LEFT(A2, (Len(A2)-Len(B2))))
This formula figures out how many letters “street” or “st” are, and removes that many letters from the original Street Address. Therefore, “123 Main Street” becomes “123 Main” and “123 Main St” becomes “123 Main”. Ding, ding, ding!
Now we have to wash, rinse, and repeat. These data transformations need to occur in BOTH address lists, so that we get the equivalent of “123 Main” for every address. We went back and did manual cleanup for the ones that had “Suite xxxx,” there were relatively few of them, but this formula logic does not work for that! A more advanced solution would need to deal with that!
Another enhancement – we could do a similar formula to find cardinal directions and standardize them if necessary (ie West/W, South/S, Northwest/NW… if you are Washington DC and you have NW, that is!)
Now that we have two, clean, consistent columns of addresses, we can use use a (relatively) simple VLOOKUP to compare them! I go into much more detail about how to compare data this way in my post on combining email lists. But really, it’s the same type of thinking.
Final round of cleaning – look for anything that seems nonstandard about the addresses (like additional abbreviations (ie JFK vs. John F. Kennedy [Boulevard]). One way that I did an extra check was to add a fake address to both lists to make sure that my compare formula was accurate.
VLOOKUP() is an extremely effective formula, not only for finding data in a list (it’s most common mainstream use), but also for comparing two lists (most common use for me!).
This was SUCH a fun project, which took something that seemed impossibly tedious and turned it into something that was still tedious, but a lot more efficient. And much, much better than the “highlighter method.” It will still take human eyeballs and street smarts 😉 to interpret and refine the data comparison. However, with the computer doing SOME of the work for us, which only took about 30 mins to set up, my friend can spend a LOT more time *organizing* the living, breathing humans at these addresses and a lot LESS time organizing the addresses themselves.
And heck, if they ever need to send a letter to all of these buildings, the spreadsheet is in great shape for a mail merge. We just gotta add the street designations back in there and let ‘er rip 🙂