Case Study: Predicting Customer’s Purchase Behaviour for Online Contact Lens Retailer

Prologue

I met my current boss in a serendipitous manner. I was visiting my Indonesian friend in his dorm. At that time, I was eating my kebab and contemplating on the more fundamental quarter life crisis question that was lingering on my mind (swiping on cat Instagram post). He was then chatting about his recent LinkedIn communication with one of the company’s founders. I jokingly said to him, “Dude, you should tell him about me as well”, which he later did apparently. Next, I exchanged a few emails with my (current) boss, and to my surprise, he was very nice, he offered us an opportunity to undertake a capstone project at his company — working on a live client project (let’s call it “XYZ” online retailer ) in building predictive models for customer churn.

A few days later, with the excitement of having a project under my belt, I met with my boss at their (former) office a couple blocks away from Southbank London. Long story short, I ended up conducted my dissertation there advancing the same client’s data for my research topic, and ultimately jumped on board as an intern where I spent three months until the end of January this year in our London office.

Introduction

Purchase recommendation based on implicit feedback

Going back to my dissertation work, in the XYZ’s online setting the implicit feedback comes in the form of ‘buys’, in contrast to explicit feedback rating system where customers give a certain score to an item, ranging from one to five. So in the latter case, higher rating normally implies a stronger indication that the customer fancies the item, whereas in our case, we can’t be so sure whether the customer failed to purchase an item due to lack of knowledge about an item or is simply less interested.

Therefore, in general implicit feedback is primarily characterised by having a highly imbalanced class in the data as well as a very sparse dataset. This happens due to the fact that customers merely observe a small fraction of the entire product collections. Subsequently, in order to turn the problem in a binary classification model, we need a way to introduce negatives samples. This requires a special framework which has to be inexpensive and flexible.

Collaborative filtering

Traditionally, most online retailers have relied on collaborative filtering (CF) algorithm to provide recommendations. The underlying principle behind CF is that if several users share a common interest towards buying an item, it may indicate their preferences overlap strongly (user-based). The other way around, if several items are bought by the same group of customers, they probably have something in common (item-based). Pure CF quantifies this relationship between items and customers using a mathematical concept of distance.

In our case, the similarity was determined by purchase history. The approach sounds very logical if a customer’s preferences are very similar to other customers, goods that the first customer buys might interest to the second customer.

However, pure CF is content-agnostic and this raises an important issue with the heterogeneity content that an item has. As an instance, customers may buy contact lens with specific top brand or manufacturer. So if there is no purchase history to analyse, unpopular or new items would be not recommended and the recommendation could end up having a very narrow focus on certain items.

Content-based filtering

There are various types of contact lenses, namely: disposable, reusable, astigmatic, multifocal that may refer to unique visual requirements. All of this information associated with a contact lens are reflected with the lens’s replacement and wearing schedules, each offering their own benefits and perks.

In addition to that, different lifestyle, occupational, and leisure needs also could be used to fuel recommendation engine. Someone with the needs of wearing contact lens six days in a week is different to someone who wears it occasionally during a holiday season.

The content-based approach has intuitive appeal: it gets its name from the principle of matching the items to customers based on as much as of the existing knowledge that is available. Furthermore, with content-based filtering, we can turn the problem as a supervised learning case and apply a machine learning classifier.

However, one downside of using content-based model is that eliciting user preference is onerous. The are only a few aspects of an item is available and there are many more factors that might influence customer preferences. For instance, one customer may use a reusable lens in her weekdays and then change to a disposable lens in her weekend. This raises a recommendation problem that could also reflect a very specific individual’s needs.

The hybrid approach

We tried to tackle such limitations from both content-based filtering and collaborative filtering by training a binary regression model to predict the likelihood of an item being purchased of which its features were obtained from collaborative filtering models. We think that by straddling both techniques for we could get the best of them.

The underlying idea of this approach is that a collaborative filtering models work by viewing both purchased items and customers as vectors and define the similarity between them as the angle. If the angle is small, the purchased item are probably similar. Once we got the similarity model, we can create features from prediction score of an item will be purchased by a user by using the idea of weighted sum. Moreover, we can add more features from user-past purchased history such as total amount of transaction in the last 90 days, the total number of items bought in the last 90 days.

To visualise this similarity amongst items, we can project the score obtained from cosine similarity computation into a matrix. As we can see here, eye-drops are quite close to dailies and monthlies which do make perfect sense. Someone who buys daily or monthly contact lens, they probably also buy eye drops.

Similarity-based item features

Cosine similarity formulation can be used to find items that look similar. In this regards, we have six variables that could probably explain the item’s similarity; namely: brand, manufacturer, lens type, sub-category and combination using two or three variables (lens type and brand). Using the predicted score for a given item, we can find other items in the test set whose predicted score are close to it in terms of cosine distance. As a result, we can have n most similar items to a given item — be aware that the first item is always the item itself.

For sure, these combinations are from perfect. But considering that these characteristics could explain the relationship between items and customers preferences, the results are sufficiently good. Another approach I can think of s is perhaps using Latent Dirichlet Allocation, a graphical model that could learn the mixtures of ‘information’ in each item.

Model Building

Data Splitting Strategy

The primary objective of our models is to predict which products are most likely to be bought by recurring customers. To reflect that, our model should be able to reflect future events when new products are launched in the subsequent year.

Therefore, our approach is to build models on the 2014 dataset and tune the hyperparameters of the model using a holdout set from January 2015 to April 2015. After getting the best configuration set, we build the model using the entire set (both 2014 and the holdout set). Finally, we evaluate the model using the remaining dataset in 2015.

Negative sampling scheme

Finding representative negative examples (items that are not bought by a customer) is quite hard since almost missing data and negative data are mixed together. One extreme approach is to label all missing data as negative examples intuitively work but there is always possibility that the missing data is an actually positive target, not to mention the application of this approach is expensive.

What we do instead is to use random sampling algorithm to create new training data from the original training data. We take all positive samples to the new training set, and then sample negative examples from missing data based on a set of items sold in 2014 as well as new collections in 2015. Slightly different to the uniform random sampling where all the missing data share the probability of being sampled as negative samples, we put more weights based item’s category that a customer has bought in the past. This guarantees that random sampling is taken into account the category items that a customer bought. For instance: if she bought dailies and eye solutions, the majority of negative samples would be drawn from these two categories.

Training with past purchase data

Now comes the coolest part, what do these parameters of the learning algorithm try to learn? We want to optimise the area under the curve by means we want the learning classifier has the ability to assign patterns to the positive class if its resulting probability is above the selected threshold and to the negative class if below.

We look for models that have a high score in AUROC and Sensitivity. The sensitivity can be intuitively perceived as the precision rate for positive events. It corresponds to the proportion of individual prediction outcomes that are correctly predicted given to all positive samples. Putting it simply, we want a set of features that we have created could explain the relationship between an item and a customer — whether an item has a high likelihood to be bought or not.

We were experimenting with four machine learning algorithms at which the Stochastic Gradient Boosting algorithms performs best — reported at 72% and 89% in the Sensitivity and AUC score, respectively. The resulting models also outperformed the baseline model: popular items by 3.5% margin. To get the best performance of the model, we used the grid search to tune six parameters of the SGD models including learning rate, max depth of three, the number of rounds, and etc.

Conclusion

In this article, I have given an overview of my dissertation work. I have explained my approach of infusing item-based collaborative filtering techniques as features vector in the content-based classification model. Using this approach, we could learn the distinctive characteristics of each customer preferences on a per-item basis.

There are several potential ideas which are worth investigating. Apart from utilising the LDA for mining textual information available in the website as I mentioned before, one improvement could be made from exploring more ways to identify other ways of drawing negative samples to training and test set. Interested readers could explore the work of Ron Pang et all 2008.

Stream Intelligence is a great place to work. The culture fosters a great team collaboration, and the coolest thing is that they are very supportive of providing full support to its employees to attend a conference and presenting their works. Last month, I was presenting my dissertation paper at The 2017 ISI Regional Statistics Conference in Bali. It was generally a great experience to meet leaders in their field — Agus Sudjianto, Corporate Risk Management Modeling. I had also chance to meet with my bachelor thesis advisor, Dr Dumaria Tampubolon who introduced me to Prof Ken Sen Tang. He is a leading researcher in actuarial sciences at Waterloo University whose its papers I used as references, so it’s great moment to see someone whose work I do really admire.

By the way, Stream Intelligence is currently seeking data scientist to join the team in the Jakarta office. This year’s cohort would the very first batch in which they will receive the new graduate program curriculum that we have been developing in the past six months. In addition, the new joiners would also receive regular one-to-one coaching as well as communication skills and consumer psychology workshop. If you are interested in learning more about recommendation engine, churn models, demand forecasting or anything related to analytics please don’t hesitate to contact me or drop your CV through career email available on our website. Also, we normally open for internship opportunities all year around both in London and Jakarta office, so if you are interested, please do get in touch with us through email.

Hermawan Budyanto is the Senior Data Scientist at Stream Intelligence. He received his Master’s degree in Business Analytics with Specialisation in Computer Science from University College London. Connect with him through LinkedIn.