PASS BAC PREVIEW SERIES: SQL Professionals and the World of Self-service BI and Big Data

As the PASS Business Analytics Conference approaches, we are excited to host this new blog series featuring content from some of the amazing speakers that you will experience at PASS BAC, April 10-12 in Chicago. This new conference is the place where Business Analytics professionals will gather to connect, share and learn in world class sessions covering a wide range of important topics. You can find out more about the sessions, the speakers, or register here.

*     *     *


 A member of Microsoft SQLCAT, Cindy Gross (@SQLCindy) works with Microsoft Business Intelligence, SQL Server, Hadoop, and Hive to make data more accessible to decision makers and generate more business value. She has extensive data platform experience, with roles that span engineering and direct customer engagement, and was honored to earn the title of SQL Server 2008 Microsoft Certified Master. As an active member of the technical community, she regularly speaks at local and national events and contributes to whitepapers, blogs, technical articles, and books.

Cindy will be co-presenting at PASS BAC with Eduardo Gamez of Intel in a 60 minute Breakout Session “How Intel Integrates Self-Service BI with IT for Better Business Results.”

The IT buzzword of the year seems to be “big data.” If you listen to the hype, big data will solve all the world’s problems and do so at a bargain price. While that’s obviously an exaggeration, the hype does point to an underlying change in the way we approach our data and how we make that data available within a business. Self-service BI allows business experts to use familiar tools such as Excel to quickly create and share data models and reports that are highly impactful to their portion of the business. They may base their data models and reports on data sources that weren’t available in the recent past, providing new insight not previously available. External data, archival data, previously unknown and unexplored data, big data, and large volumes of data are cheaper to store and shared more easily than ever before. This changes how IT interacts with business users.

For anyone looking for the latest and greatest in BI, or someone trying to cut costs at every opportunity, the concepts of big data sound like a siren song: Scale out, distributed storage and processing, inexpensive hardware, elasticity, schema on read instead of write, flexibility, speed of end-to-end answers, data too big to store affordably in existing systems. As a result, some people are trying to replace their relational and multi-dimensional systems such as SQL Server and Analysis Services with big data technologies such as Hadoop as if they are completely interchangeable. This presents a problem, as they fill different needs, have different core strengths, and often work together in a complementary fashion rather than opposing each other.

For example, the SQL Server PDW V2 appliance has a feature, codenamed Polybase, which allows a single TSQL query to combine SQL Server PDW data with Hadoop data in a single result set. A frequent use case in the big data world, used by the likes of Yahoo! and Klout, is to do the initial data processing in a batch-based, scaled-out Hadoop system, then pull the aggregated data into Analysis Services where individual queries are fast and the business can easily and quickly get the answers needed for their standard questions. Self-service BI tools such as PowerPivot (highly compressed in-memory data models in Excel 2013 or in Excel + SharePoint) and Power View (easy, powerful visualizations within Excel reports) can see Hadoop data, including Microsoft’s distribution of Hadoop called HDInsight, as just another ODBC data source if you use the Hadoop ecosystem component Hive to impose a “rows and columns” structure on the data. Once a PowerPivot data model is deemed to be “important enough” to be shared or if it needs more memory than is available on a client machine, it can be easily imported into a server-side Analysis Services tabular model which can also be a data source for Power View reports, Excel spreadsheets, and other BI tools. The bottom line here: Your users may be accessing and incorporating big data into their self-service BI reports without evening knowing it and you, the IT professional, need to know if/when/how to adjust your tools, skill sets, and processes to handle these new realities.

Sometimes Hadoop and other big data technologies are not the only – or the best – way to provide that processing power or data sources for all those self-service reports. If the data is already well-known, well-curated, and loaded into an existing system that answers the important questions for the business units involved, then copying or moving that data to a new Hadoop-based system may not be the best or most economical answer. SQL Server systems can handle many terabytes of data and the new Parallel Data Warehouse (PDW) V2 appliances will have several petabytes of storage per appliance. But, if you have new/expanded data sources (such as keeping more columns or rows than the existing relational system is designed for), data that isn’t well defined or has multiple formats or interpretations, or you aren’t sure yet what questions you will ask, then one of the big data technologies such as Hadoop with the Hive component may be a great fit for exploring the data. With Hadoop you can easily and quickly change the structure you impose on the data and do a lot of batch and ad hoc processing across a scaled out system of relatively inexpensive nodes. Each individual program or query is exploring an entire data set (for SQL Server folks this is like a table scan), so individual queries, especially if they go against subsets of the data, are likely to be slower than well-known, well-indexed queries on a relational system such as SQL Server. But the end-to-end time from when you decide you need to explore some particular combination of often previously unknown and/or uncurated data sets to when you are able to provide some sort of programmatic answer to a set of often previously undefined questions is likely faster in Hadoop. This fits well with the idea of self-service BI. Once you figure out what data is most valuable and the sorts of questions that provide the most value, you may be able to pull a subset of the data (fewer columns and/or fewer rows) into SQL Server and/or Analysis Services, add indexes or aggregations, and run those individual queries faster. As a SQL Server or BI implementer it is very important to understand enough about Hadoop and the other big data technologies to discern which projects need to include SQL Server, Analysis Services or PowerPivot, StreamInsight, PDW, or Hadoop, and at what stage. The more you know about both current systems and new ones, the more successful you, your team, and your company will be at implementing impactful projects at a reasonable cost.

The upcoming PASS Business Analytics Conference is a great opportunity to dip your toes or really dive into analytics, self-service BI, and big data. Whether you are new to these areas or looking to gain more depth you will find sessions to help meet your goals, starting with a keynote by Dr. Steven Levitt, author of Freakonomics. This is a great opportunity to network with other professionals working with Microsoft BI and Business Analytics including data scientists, architects, analysts, implementers, and business decision makers. I look forward to talking to you at the conference and learning more about your BI/BA needs and solutions!