Most data analysts will admit that up to 80% of their time is spend doing “data prep”, sorting out the data that will be analysed and presented to the end user. So anything that we analysts can do to reduce that time means more actual analysis, right?
For me, part of this data prep involved joining lots of reports together, adding variables manually that I would later use in Tableau. It takes ages and is likely to incur human error into your data. Tableau’s ability to union your data together is a massive time saver so I’m quickly going to show you how its done.
What is a data union?
A data union is sort of the opposite to a join or merge, which combines data that link together on common fields. E.g. one set of data about peoples’ shoe sizes can link to a set of data about peoples’ heights, so long as there is a common identifier in each data set like name. In a union, you are adding one set of data to the bottom of another set, so long as they share common fields (preferably all) E.g. one set of data about peoples’ shoe sizes can be unioned to a second set of data about different peoples’ shoe sizes to build a bigger list.
So how is it done? In this example I have a folder with monthly sales data and monthly returns data about the products that I sell.
I want to make a Tableau dashboard showing the monthly sales of my products. Luckily the data in each of my “sales” sheets has the same fields in it, but unluckily for me one of those fields is not the date.
When I open Tableau and select Excel from the list of data sources, I just pick any of the sales workbooks, as this is the type I want to analyse.
I then click the dropdown arrow and select Convert to Union…
The following menu appears.
This allows you to either specifically tick which workbooks you want to union, which is good for a one-time bit of data analysis but if you are likely to be refreshing this data, say, monthly, you’ll want to set it up so that it automatically adds in new data.
So I’ve selected Wildcard (automatic). You can specify certain Excel Sheet Names to include or exclude (you can just type in part-strings as well), but as all my data has “Sheet1” I leave this blank. On Workbooks however, I only want those with sales data, and I want to exclude the returns data. So I type “sales*” into the matching pattern. This will only include workbooks in the union that have “sales” in the title.
There are two optional extras here, whether you search through subfolders or through the parent folder. This is useful in my example because in addition to my 2018 folder I also have a 2017 folder, so by ticking these two boxes, I grab all the data in the 2017 folder as well.
So once I click OK I get two new fields in my data; sheet and path.
As you can see my data includes 2018 and 2017 files. The Path field is going to be very useful as I can use it to generate a date field. The Sheet field is not that useful in this example as they are all the same.
Now in my data I create a date field using my Path field.
First I use the MID function to extract the month. If you haven’t used it before, MID is a really useful function, and takes a string from within another string. The first number is the starting position, and the second number is the length of the string you want to take. E.g. MID([Path],6,5)=”sales”.
I use the LEFT function to take the year from the Path name, which is the first four characters thanks to my folder structure. I then put a “/” in the middle to make it look like a regular date field.
If I wanted to I could add in the day as well, but as they’re all the first of the month I shan’t bother here.
So there you have it, I can now use this dataset as if it was one Excel workbook with a date field in it. If I save it as a Tableau Workbook it will refresh each time I open it, so any new data will automatically load. As I said, too useful not to share so if you haven’t used it before, have a go yourself.
As a side note I’ve found this really useful to use with Microsoft Flow. If there are daily/weekly/monthly Excel reports you get by email from the same person/email address you can set up a Flow to move the attachment into a folder on your computer/OneDrive etc. which can then feed the workbook. A simple easy way to speed up some of that 80% data prep time.