EDA + Clustering Data

Exploratory data analysis and the application of unsupervised learning techniques (K-means and HDBSCAN)

Gabriel de Freitas Pereira true
2024-10-23

Introduction

 

Hi there! It’s been a while since my last post, and I’m excited to share an analysis I’ve been working on as part of my new journey in pursuing a Master’s degree in Computer Science at the same university where I completed my undergraduate studies (UFSCar).

The first course I took in this program was focused on Unsupervised and Semi-supervised Machine Learning. It’s been a fascinating experience so far, and I’ve learned some interesting techniques that I’m eager to explore in this post. Specifically, I’ll be using K-means and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) in order to create clusters and identify patterns in products from Adidas and Nike.

 

 

Dataset

 

For this analysis, I used a dataset that contains detailed information on sales and other relevant aspects of Adidas and Nike products. The dataset consists of 3,268 products from both brands, with 12 attributes related to these products. You can access this dataset for free on Kaggle via this link. Here it is the overview of it:

Table 1: Data summary
Name brands
Number of rows 3268
Number of columns 10
_______________________
Column type frequency:
character 5
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Product.Name 0 1 4 73 0 1531 0
Product.ID 0 1 6 10 0 3179 0
Brand 0 1 4 24 0 5 0
Description 0 1 0 687 3 1763 0
Last.Visited 0 1 19 19 0 318 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Listing.Price 0 1 6868.02 4724.66 0 4299.0 5999.0 8999.0 29999 ▇▆▂▁▁
Sale.Price 0 1 6134.27 4293.25 449 2999.0 4799.0 7995.0 36500 ▇▂▁▁▁
Discount 0 1 26.88 22.63 0 0.0 40.0 50.0 60 ▇▁▁▅▆
Rating 0 1 3.24 1.43 0 2.6 3.5 4.4 5 ▃▁▅▆▇
Reviews 0 1 40.55 31.54 0 10.0 37.0 68.0 223 ▇▆▁▁▁

 

 

Exploratory Data Analysis

 

I started by creating visualizations that focus on the Sale Price and Listing Price distributions. By overlaying these histograms, we aim to identify patterns in pricing strategies, such as differences in pricing ranges, which can provide valuable insights into market behavior.

Comparative Analysis of Sale Price vs Listing Price for Adidas and Nike

Figure 1: Comparative Analysis of Sale Price vs Listing Price for Adidas and Nike

Before diving deeper into this, I’d like to review the other variables. This might help on identifying any inconsistencies or elements that may not be useful for the analysis.

Frequency of Brand Categories

Figure 2: Frequency of Brand Categories

With the insight brought by the visualization above this product called Adidas Adidas Originals was reassigned for the correct category called Adidas Originals. This way we are going to remain with only 4 categories instead of 5 given the inconsistency found in the Brand assignment.

Comparative Analysis of Sale Price vs Listing Price by Brand

Figure 3: Comparative Analysis of Sale Price vs Listing Price by Brand

After visualizing the information contained in the Figure 3, a couple of interesting questions come to mind:

- How many products have a Listing Price higher than the Sale Price?
Comparison of Listing Price vs Sale Price by Brand

Figure 4: Comparison of Listing Price vs Sale Price by Brand

- Why is there a difference between Listing Price and Sale Price? Is it always due to discounts?

Well no, among the 876 products with an available Listing Price and no explicit discounts, 217 have a Sale Price lower than the Listing Price. This suggests that the difference isn’t always due to planned discounts, leading to a few possible explanations:

Which brand segment is most affected by these unrecorded changes?

reactable( 
  brands %>%
  group_by( Brand ) %>%
  filter( Listing.Price > Sale.Price, Discount == 0 ) %>%
  count( )
)

As seen above, all products that do not show a discount in the data but had changes in the final sale price are Nike products. In fact, every Nike product with a listing price underwent adjustments. This suggests either that these products do not have the correct listing price, or the discount is not being recorded for these products (the latter is more likely, as no Nike product in the dataset has a documented discount).

- Are there any products with a Sale Price higher than the Listing Price?

This is unexpected, right? The Listing Price should be equal to or higher than the Sale Price, not the other way around. I’ll take a closer look at these Nike products.

reactable( 
  brands %>% 
  select( Product.Name, Listing.Price, Sale.Price, Discount ) %>% 
  filter( Sale.Price > Listing.Price ), defaultPageSize = 5
)

An intriguing issue in this table is that these products with higher sale price have a listing price of 0, which suggests a potential error in the data generation process, as it’s unlikely for a product to be listed without a price. To correct this, I will assign the sale price value to these cases, adjusting based on the discount column (if available):

brands_input <- 
  brands %>%
  mutate( Listing.Price = 
            case_when( 
              Listing.Price == 0 & Discount == 0 ~ Sale.Price,
              Listing.Price == 0 & Discount > 0 ~ Sale.Price * Discount / 100,
              TRUE ~ Listing.Price  
          )
  )

As I mentioned before, none of the Nike products have a discount available in the dataset. But what about the Adidas products?

Distribution of Discounts Across Adidas Segments

Figure 5: Distribution of Discounts Across Adidas Segments

The flow chart from the Figure 5 shows that CORE / NEO segment had the highest percentage of products with discounts over 30%. Ok, so let’s explore ratings description and reviews before building our product clusters.

Distribution of Ratings Across Segments

Figure 6: Distribution of Ratings Across Segments

The difference between the brands’ ratings is notable; however, it is important to point out that they have different numbers of products, with Nike having the fewest representatives in the dataset, which may explain the high standard deviation observed. Additionally, the dataset includes product descriptions, and given that it contains only shoes, it would be interesting to explore the common words used in these descriptions to gain insights into product characteristics or marketing trends.

Frequency of Words by Brand

Figure 7: Frequency of Words by Brand

Naturally, the first question that arises from this plot is whether there’s any relationship between the words used and the product ratings. To explore this further, I conducted an analysis and visualized the findings in the form of this eye-catching word cloud:

Frequency of Words by Brand

Figure 8: Frequency of Words by Brand

Distribution of Reviews by Brand

Figure 9: Distribution of Reviews by Brand

The word cloud and box plot together highlight the contrasting dynamics between Nike and Adidas in terms of customer engagement and review distribution. In the word cloud, Nike-related terms like “modernised” and “air” are prominent, suggesting that these products dominate discussions and receive significant attention. This is mirrored in the box plot, where Nike shows an irregular review distribution, with numerous outliers indicating that while some Nike products receive a large volume of reviews, others are reviewed much less frequently. In contrast, Adidas displays more balanced word frequencies in the word cloud and a more consistent review distribution across its product lines, as seen in the box plot. This suggests that Adidas products tend to attract steadier, more uniform customer engagement, without the extremes observed in Nike’s reviews.

Before moving into modeling, it’s important to note that our data does not exhibit spherical dispersion. As a result, algorithms that form globular clusters, such as K-means, may not be ideal. Nonetheless, I will still test K-means to compare the clusters they form.

Distribution of Data Based on Sale and Listing Prices

Figure 10: Distribution of Data Based on Sale and Listing Prices

The figure above shows the distribution of two key variables: Listing Price and Sale Price. These variables are essential for analyzing pricing strategies and product performance, and they play a crucial role in forming clusters to understand business dynamics in retail and e-commerce.

 

 

Building Clusters with K-means

 

First, the data was scaled to minimize the impact of varying units across different variables. The following variables were initially tested:

As a reminder, variables where the Listing Price was 0 were adjusted to match the Sale Price, accounting for any available Discount.

Elbow Method for Determining the Optimal Number of Clusters on K-means algorithm

Figure 11: Elbow Method for Determining the Optimal Number of Clusters on K-means algorithm

K-means Clusters

Figure 12: K-means Clusters

The elbow method suggests that two clusters are optimal for this dataset. The K-means algorithm was applied to the scaled data, resulting in two distinct clusters. The plot shows the distribution of data points based on Sale Price and Listing Price, with each point colored according to its assigned cluster. While the two clusters exhibit a roughly globular dispersion, the data appears to have more complexity and does not fully adhere to this structure, indicating that K-means may not be the best algorithm to capture the underlying characteristics of the products.

 

 

Building Clusters with HDBSCAN

 

The HDBSCAN algorithm identified 10 distinct clusters and 972 points considered noise. Noise points are those that do not fit well into any of the formed clusters. The sizes of these clusters vary significantly, which is typical for density-based algorithms like HDBSCAN.

HDBSCAN Cluster Plot Showing Consistent Branches

Figure 13: HDBSCAN Cluster Plot Showing Consistent Branches

For instance, Cluster 4 is the largest, containing 930 objects, indicating a region of high density. In contrast, Cluster 5 has only 12 objects, suggesting a region of much lower density. Interestingly, Cluster 0 represents noise, with 972 points, indicating a significant portion of the data does not clearly belong to any cluster. The remaining clusters capture variations in sale and listing prices, reviews, and ratings.

HDBSCAN Clusters

Figure 14: HDBSCAN Clusters

The maximum sale prices across clusters range from $3,197 in smaller clusters up to $36,500 in the noise points. Cluster 4 stands out with its large number of reviews and higher average rating of 3.9, while Cluster 5 has no reviews or ratings, indicating limited customer interaction. Clusters like Cluster 1 and Cluster 2 are smaller in size, with moderate sale prices and no significant discounts.

Table summarizing the clusters formed by HDBSCAN:

 

 

Conclusion

 

In conclusion, HDBSCAN effectively identifies regions of varying density in the dataset, capturing both high and low-density clusters. This clustering offers valuable insights into different pricing strategies and product performance, where larger clusters indicate products with higher customer engagement and more stable pricing, while smaller clusters may indicate niche products or items with less market activity. The noise points, accounting for a large proportion, might represent outliers or products that don’t fit well into the overall market patterns.

In a real-world scenario, I would delve deeper into the modeling process. It would be particularly interesting to model the brands separately to enable direct comparisons, but I’ll save that exploration for another time, as this post is already quite extensive, but that was it!

Note: As you may have noticed, exploratory data analysis has been a topic I’ve really enjoyed lately, and I plan to share more posts on this subject in the future.