This blog post is brought to you by Diego Oppenheimer a Program Manager on the Excel team.
I am very happy to be writing this blog post today. Not just because I will be showing you another way Excel can make your data analysis easier but also because I will be introducing the new Data Model and Relationships features that will hopefully change the way you use Excel for data analysis forever.
For those of you who are not familiar with the power and usefulness of Pivot Tables you might want to check out this article (Overview of PivotTable and PivotChart reports) or this training (PivotTable I: Get started with PivotTable reports) . Some of these articles are a bit old but the principles and functionality still apply .
Ok, here we go…
Finding a home
Around this time last year my wife and I were considering purchasing a house in the Seattle area, even if it meant dealing with some of the worst traffic in the US. So like any self-respecting Excel nerd I started a spreadsheet with a table of data that fit our parameters. This data was easy to find on the many real-estate sites out there like Zillow.com or Redfin.com. One thing I noticed though was that none of the these sites by themselves had all the relevant data I wanted to make an informed decision and this is where the Data Model came into play by allowing me to combine data from multiple sources and perform a richer analysis.
You can see what I started with below or just download the workbook for yourself.
My first look at the Data Model
If you open the file above you will see I have a table with a lot of data. The first thing I am going to do is create a PivotTable so that I can sift through it easily. Under the INSERT tab, hit PivotTable and the
following dialog should pop-up:
I have highlighted a new option in the create PivotTable dialog which is to “Add this data to the Data Model”. So what is this Data Model I speak of?
“A Data Model is a new approach for integrating data from multiple tables, effectively building a relational data source inside the Excel workbook. Within Excel, Data Models are used transparently, providing data used in PivotTables, PivotCharts, and Power View reports“. Read more here…
In other words, the new Data Model allows for building a “model” where data from a lot of different sources can be combined by creating “relationships” between the data sources. For those of you with some database knowledge this is similar to creating joins between tables, except all the tables live in Excel.
I chose to add this data to my Data Model because I am going to be combining it with data I will get from other sources to make my analysis more complete.
The first thing I want to do is look at the number of houses I have selected by zip code. To do this I drag the ZIP field to ROWS and the LISTING ID field to VALUES. By default, this will give me a SUM of LISTING ID’s, but we want a COUNT. To do this, right click on the header that says SUM of LISTING ID -> Value Field Settings… and change to COUNT. As you can see, I only have to sift through 165 house listings now (L).
I have decided to add a couple more fields to my PivotTable to help with my analysis. I added LIST PRICE, DAYS ON MARKET and SQFT and changed the Value Field Settings to AVERAGE.
At this time my PivotTable looked something like this:
So now I have a layout that shows me the number of houses that met my criteria per Zip code and some extra data like Average Price, Square footage and Days on Market (my realtor says this comes in real handy when negotiating).
What do I know about the Zip codes?
One thing I noticed about all of the real-estate listing sites is that they give you a ton of detail about the listing but don’t really tell you much about the neighborhood. I want to better understand the demographics of the Zip codes I have selected.
Good thing there is a marketplace for just this type of inquiry (http://datamarket.azure.com). This place is great, and you can read all about it here. Make sure you have a Microsoft/Live-ID or sign up for one free.
Type “Demographics” in the search box and find a data set called “2010 Key US Demographics by ZIP Code, Place and County (Trial)”. If I click on that link to the data set, the first option is to “Explore Data Set” and that’s exactly what I want to do.
Next step is to narrow down the data set to only what is relevant, so I make GeographyType: Zip code and StateAbbreviation “WA” and hit “RUN QUERY”. Great, I think I have now what I need.
(1) The URL for the current expressed query
(2) Your private account key
Now that I have all of this figured out, I can easily add what I found to the Data Model. In Excel, go to the DATA tab and select “From Other Sources”, “From Windows Azure Marketplace”.
Fill out the information with what you have saved from the website:
Hit “Finish” and then select “Only Create Connection”:
Note: Some of you might be wondering why I chose “Only Create Connection”. I chose to do so, so that the data was never brought onto the Excel sheet, but directly into the Data Model. I only plan on using it in combination with my original table so there was no need to bring it in as a table onto my sheet or create a new PivotTable or PivotChart. If you’re wondering what that PowerView Report does check out this post by Sean Boon.
Combining my Data
You can see a tab called “Active” (which is selected) and a tab called “All”. The field list is the best way to explore our newly created Data Model, and the “All” tab lets us do just that. Any connection or table that was added to the Data Model will show up in this tab. Let’s take a look.
We can see that our original Table (Table 1) and the Table we brought in from the Data Market are both there. Great!
Now for some magic. I want to know a bit more about each of my Zip Codes, so what I am going to do is add those fields from the demog1 table to my PivotTable. The first one I’ll add is MedianAge2010 (we want to live around people our age so we have a better chance of meeting neighbors we have something in common with).
There are two things to notice here. First, the median age is the same for every zip code, which gives me a hint that something is wrong:
And second, a little message popped up in the field list.
Excel is telling me that a relationship between the two tables might be needed, so I go ahead and create it.
A couple of things to notice when working with relationships:
· Both columns chosen need to contain the same type of data (Zip Code in this case).
· They only work when one of the table’s columns contains unique values.
· The Related Column (Primary) should always be the one containing the unique (no duplicates) values.
· In a PivotTable, to be able to put something from the Related Table on Values you will have to have a field from the Related Table on Rows or Columns.
To satisfy this last condition, I remove ZIP and add GeographyId to Rows. Now my PivotTable looks something like this:
The reason for all those ugly blank spaces is because I have no houses that match my criteria from Table1 for every single Zip Code from the Marketplace data. I can easily filter these out by clicking the downward arrow next to Row Labels, Value Filters, Does Not Equal
Great, now I only have the Zip Codes I care about. I went ahead and added some more fields, conditional formatting and some sorting to figure out what the best zip code for us is and ended up with this:
Seems like 98103 is a really good candidate, falling in the middle in price and Average SQFT but towards the bottom in both Unemployment and % Vacant units. I can go ahead and add a COUNT of ListingIDs and see that I have 8 houses to go check out in this neighborhood that meet my criteria.
More Details on the Data Model
In this last example, I really am only scratching the surface of what the Data Model can do and I plan on showing much more in future blog posts. Here are some quick facts though:
- Any given workbook will only have one Data Model.
- Any table in Excel can be added to the Data Model.
- Almost all Data Sources can be added to the Data Model (SQL, Odata, Atom feeds, Excel tables and more).
- Tables in the Data Model have no limit in terms of rows.
- Relationships can be defined across multiple tables.
I hope this is enough information to make you dangerous with the new Data Model capabilities. Feel free to leave your questions and comments below or message me on Twitter @doppenhe.