Exploratory data analysis and the application of unsupervised learning techniques (K-means and HDBSCAN)
Hi there! It’s been a while since my last post, and I’m excited to share an analysis I’ve been working on as part of my new journey in pursuing a Master’s degree in Computer Science at the same university where I completed my undergraduate studies (UFSCar).
The first course I took in this program was focused on Unsupervised and Semi-supervised Machine Learning. It’s been a fascinating experience so far, and I’ve learned some interesting techniques that I’m eager to explore in this post. Specifically, I’ll be using K-means and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) in order to create clusters and identify patterns in products from Adidas and Nike.
For this analysis, I used a dataset that contains detailed information on sales and other relevant aspects of Adidas and Nike products. The dataset consists of 3,268 products from both brands, with 12 attributes related to these products. You can access this dataset for free on Kaggle via this link. Here it is the overview of it:
Name | brands |
Number of rows | 3268 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Product.Name | 0 | 1 | 4 | 73 | 0 | 1531 | 0 |
Product.ID | 0 | 1 | 6 | 10 | 0 | 3179 | 0 |
Brand | 0 | 1 | 4 | 24 | 0 | 5 | 0 |
Description | 0 | 1 | 0 | 687 | 3 | 1763 | 0 |
Last.Visited | 0 | 1 | 19 | 19 | 0 | 318 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Listing.Price | 0 | 1 | 6868.02 | 4724.66 | 0 | 4299.0 | 5999.0 | 8999.0 | 29999 | ▇▆▂▁▁ |
Sale.Price | 0 | 1 | 6134.27 | 4293.25 | 449 | 2999.0 | 4799.0 | 7995.0 | 36500 | ▇▂▁▁▁ |
Discount | 0 | 1 | 26.88 | 22.63 | 0 | 0.0 | 40.0 | 50.0 | 60 | ▇▁▁▅▆ |
Rating | 0 | 1 | 3.24 | 1.43 | 0 | 2.6 | 3.5 | 4.4 | 5 | ▃▁▅▆▇ |
Reviews | 0 | 1 | 40.55 | 31.54 | 0 | 10.0 | 37.0 | 68.0 | 223 | ▇▆▁▁▁ |
I started by creating visualizations that focus on the Sale Price and Listing Price distributions. By overlaying these histograms, we aim to identify patterns in pricing strategies, such as differences in pricing ranges, which can provide valuable insights into market behavior.
Before diving deeper into this, I’d like to review the other variables. This might help on identifying any inconsistencies or elements that may not be useful for the analysis.
With the insight brought by the visualization above this product called Adidas Adidas Originals was reassigned for the correct category called Adidas Originals. This way we are going to remain with only 4 categories instead of 5 given the inconsistency found in the Brand assignment.
After visualizing the information contained in the Figure 3, a couple of interesting questions come to mind:
- How many products have a Listing Price higher than the Sale Price?
- Why is there a difference between Listing Price and Sale Price? Is it always due to discounts?
Well no, among the 876 products with an available Listing Price and no explicit discounts, 217 have a Sale Price lower than the Listing Price. This suggests that the difference isn’t always due to planned discounts, leading to a few possible explanations:
Which brand segment is most affected by these unrecorded changes?
As seen above, all products that do not show a discount in the data but had changes in the final sale price are Nike products. In fact, every Nike product with a listing price underwent adjustments. This suggests either that these products do not have the correct listing price, or the discount is not being recorded for these products (the latter is more likely, as no Nike product in the dataset has a documented discount).
- Are there any products with a Sale Price higher than the Listing Price?
This is unexpected, right? The Listing Price should be equal to or higher than the Sale Price, not the other way around. I’ll take a closer look at these Nike products.
An intriguing issue in this table is that these products with higher sale price have a listing price of 0, which suggests a potential error in the data generation process, as it’s unlikely for a product to be listed without a price. To correct this, I will assign the sale price value to these cases, adjusting based on the discount column (if available):
As I mentioned before, none of the Nike products have a discount available in the dataset. But what about the Adidas products?
The flow chart from the Figure 5 shows that CORE / NEO segment had the highest percentage of products with discounts over 30%. Ok, so let’s explore ratings description and reviews before building our product clusters.
The difference between the brands’ ratings is notable; however, it is important to point out that they have different numbers of products, with Nike having the fewest representatives in the dataset, which may explain the high standard deviation observed. Additionally, the dataset includes product descriptions, and given that it contains only shoes, it would be interesting to explore the common words used in these descriptions to gain insights into product characteristics or marketing trends.
Naturally, the first question that arises from this plot is whether there’s any relationship between the words used and the product ratings. To explore this further, I conducted an analysis and visualized the findings in the form of this eye-catching word cloud:
The word cloud and box plot together highlight the contrasting dynamics between Nike and Adidas in terms of customer engagement and review distribution. In the word cloud, Nike-related terms like “modernised” and “air” are prominent, suggesting that these products dominate discussions and receive significant attention. This is mirrored in the box plot, where Nike shows an irregular review distribution, with numerous outliers indicating that while some Nike products receive a large volume of reviews, others are reviewed much less frequently. In contrast, Adidas displays more balanced word frequencies in the word cloud and a more consistent review distribution across its product lines, as seen in the box plot. This suggests that Adidas products tend to attract steadier, more uniform customer engagement, without the extremes observed in Nike’s reviews.
Before moving into modeling, it’s important to note that our data does not exhibit spherical dispersion. As a result, algorithms that form globular clusters, such as K-means, may not be ideal. Nonetheless, I will still test K-means to compare the clusters they form.
The figure above shows the distribution of two key variables: Listing Price and Sale Price. These variables are essential for analyzing pricing strategies and product performance, and they play a crucial role in forming clusters to understand business dynamics in retail and e-commerce.
First, the data was scaled to minimize the impact of varying units across different variables. The following variables were initially tested:
As a reminder, variables where the Listing Price was 0 were adjusted to match the Sale Price, accounting for any available Discount.
The elbow method suggests that two clusters are optimal for this dataset. The K-means algorithm was applied to the scaled data, resulting in two distinct clusters. The plot shows the distribution of data points based on Sale Price and Listing Price, with each point colored according to its assigned cluster. While the two clusters exhibit a roughly globular dispersion, the data appears to have more complexity and does not fully adhere to this structure, indicating that K-means may not be the best algorithm to capture the underlying characteristics of the products.
The HDBSCAN algorithm identified 10 distinct clusters and 972 points considered noise. Noise points are those that do not fit well into any of the formed clusters. The sizes of these clusters vary significantly, which is typical for density-based algorithms like HDBSCAN.
For instance, Cluster 4 is the largest, containing 930 objects, indicating a region of high density. In contrast, Cluster 5 has only 12 objects, suggesting a region of much lower density. Interestingly, Cluster 0 represents noise, with 972 points, indicating a significant portion of the data does not clearly belong to any cluster. The remaining clusters capture variations in sale and listing prices, reviews, and ratings.
The maximum sale prices across clusters range from $3,197 in smaller clusters up to $36,500 in the noise points. Cluster 4 stands out with its large number of reviews and higher average rating of 3.9, while Cluster 5 has no reviews or ratings, indicating limited customer interaction. Clusters like Cluster 1 and Cluster 2 are smaller in size, with moderate sale prices and no significant discounts.
Table summarizing the clusters formed by HDBSCAN:
In conclusion, HDBSCAN effectively identifies regions of varying density in the dataset, capturing both high and low-density clusters. This clustering offers valuable insights into different pricing strategies and product performance, where larger clusters indicate products with higher customer engagement and more stable pricing, while smaller clusters may indicate niche products or items with less market activity. The noise points, accounting for a large proportion, might represent outliers or products that don’t fit well into the overall market patterns.
In a real-world scenario, I would delve deeper into the modeling process. It would be particularly interesting to model the brands separately to enable direct comparisons, but I’ll save that exploration for another time, as this post is already quite extensive, but that was it!
Note: As you may have noticed, exploratory data analysis has been a topic I’ve really enjoyed lately, and I plan to share more posts on this subject in the future.