This post is brought to you by Runar Olsen, an Architect at Microsoft Search Center of Excellence, and by Andrew Moll and Greg Guillen, Associate Consultants at Microsoft Big Data Center of Excellence.
Your e-commerce site can increase revenue opportunities, and understanding what user behavior led to a purchase can help retain repeat customers and gain new ones. In this post, we explore how you can connect SharePoint with Hadoop and map-reduce analysis in the background to create a web browsing experience that is personalized to the end user’s intent and past behavior, thus increasing the success of a sale on your e-commerce site.
We’ll use a fictitious e-commerce site, Contoso Electronics, to show how we integrated it with HDInsight Service to provide a Big Data use case for helping end users find results easier on the search result page. We’ll also look at how we used HDInsight Service to add functionality to the electronics search result page by doing clickstream analysis on user events.
The Contoso Electronics site is running on SharePoint 2013, using the new cross-site publishing model. The product catalog data is maintained in Dynamics AX and pushed to SharePoint 2013 where it’s rendered in a web site format. (The data can also be rendered on tablets and phones.
The new functionality offers a quicker way for people to find what they were looking for. For instance, if a user is searching for the keyword tablet, the search results will display products that other users ended up buying after searching for that term. By analyzing the purchase history and search query patterns of previous users, we are able to provide this information to the new user who just entered the site and searched for a tablet. On the web site it would look like this:
There are a number of streams of user event data we can use. For this analysis, we need the search query log and purchase logs for all users. We can also look within the web server logs, and find query term referrals from web search engines. For example, a user might search for tablet on Bing.com. In the search results, they see a Contoso Electronics product and browse directly to the product to buy it without using the search box on the Contoso Electronics site. Often, e-commerce sites already have web analytics tools in place to collect this data, and the logs can be retrieved from there. For our scenario, we auto-generated several gigabytes of user search and purchase logs for our fictitious site.
Since we have large amounts of unstructured events to parse in log files, we choose to use HDInsight Service to analyze the data. Below is the architecture we settled on for our pilot. The Contoso Electronics web site is running in Azure IaaS, so everything is already in the cloud.
Because the web logs are stored in Azure Storage Vault (ASV), we never have to bring any of the data on-premises-we can keep everything in the cloud. When data is stored in ASV, it is easily accessible by our HDInsight Service via Sqoop, which can take multiple data files from ASV and move them into the Hadoop Distributed File System (HDFS) where mappers/reducers can manipulate it.
Another option is to create a Hive External Table. When data is stored in ASV, it is easily accessible by our HDInsight Service via external Hive tables. External Hive tables allow a metadata wrapper to be layered on top of your unstructured data sitting in ASV, ultimately providing a structured table format that can be exported to a relational database for further business analytics. This provides a cost-effective IaaS method for customers to utilize their HDInsight Service Cluster for analytical purposes, leaving the cheap storage aspect of your data to Azure Storage Vault while still relying on HDInsight to perform the data analytics.
Hadoop Streaming takes 4 inputs: mapper, reducer, input files, and output location. The beauty of this is that your mapper and reducer do not need to be written in Java. Hadoop mappers are natively written in Java, but we wrote ours in C#, and then used Hadoop Streaming to make the necessary adjustments. Not every problem can be solved with MapReduce (M/R) but even fewer can be solved with only one M/R job. This holds true in our example as we have multiple MapReduce jobs. In our example, there are two M/R jobs that are “chained” together, which means that the output of one job is used as the input for the second job.
Our first mapper takes the web logs and pulls out <User ID> and <Event> (purchased, viewed, or queried)
From here we have a reducer that organizes the user’s session and outputs the sorted data to a file.
The second mapper picks up the data from the first mapper output. From here the mapper adds a value of 1 or 10, depending on if the product was only viewed (1) or bought (10) times.
The reducer then takes this data, sorts it, and adds up the values. Therefore a higher number = more products purchased. This data is then exported as a text file to complete the analysis.
The Hadoop portal shows details about the Map Reduce Job that is running on our cluster.
After Hadoop has completed its analysis, we have a big text file in this format:
<product id>, [<search query:count>, <search query:count>, …]
Now you can create a Hive table on top of the text files, giving the data a table format structure which can be exported into any relational data store using SQL Server Integrated Services.
Our source data in the Product Catalog in SharEPoint is of this format:
<product id>, <category>, <price>, <brand>, <weight>, …, …
We are now enriching the content of a given product, by adding relevant search queries that lead to the purchase of that given item.
<product id>, <category>, <price>, <brand>, <weight>, …, <purchasedqueryfor>
There is no overwriting of the existing data, so it can always be updated, independent of our Hadoop data. When the original data changes (for instance, a price update), it will not change the column where we put our output from the analysis conducted above. We used the new RESTful APIs in SharePoint2013 to update the metadata column in the product catalog. (Product Catalog is a SharePoint list.)
On the front page in SharePoint we added a new Content By Search web part, which adds a custom search query:
The result is a set of products that users ended up buying after searching for the given input query term. The web part that was added to the search page is configuration only, and there is no need for any custom code to achieve this.
What we’ve seen in this blog post is a concrete example of how to integrate HDInsight Service with SharePoint 2013 to provide functionality on the search page on the Constoso.com website. By increasing search relevance for customers we are able to provide users with the actual and similar products they are looking for, thus increasing the opportunity for a company to boost revenue through products purchased. In addition, a company can also understand what user views led to a purchase, which can assist with planning market product positioning for future customers.