Introduction

Hi there! It’s been a while since my last post, and I’m excited to share an analysis I’ve been working on as part of my new journey in pursuing a Master’s degree in Computer Science at the same university where I completed my undergraduate studies (UFSCar).

The first course I took in this program was focused on Unsupervised and Semi-supervised Machine Learning. It’s been a fascinating experience so far, and I’ve learned some interesting techniques that I’m eager to explore in this post. Specifically, I’ll be using K-means and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) in order to create clusters and identify patterns in products from Adidas and Nike.

Dataset

For this analysis, I used a dataset that contains detailed information on sales and other relevant aspects of Adidas and Nike products. The dataset consists of 3,268 products from both brands, with 12 attributes related to these products. You can access this dataset for free on Kaggle via this link. Here it is the overview of it:

Table 1: Data summary
Name	brands
Number of rows	3268
Number of columns	10
_______________________
Column type frequency:
character	5
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Product.Name	1	4	73	0	1531
Product.ID	1	6	10	0	3179
Brand	1	4	24	0	5
Description	1	0	687	3	1763
Last.Visited	1	19	19	0	318

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Listing.Price	1	6868.02	4724.66	0	4299.0	5999.0	8999.0	29999	▇▆▂▁▁
Sale.Price	1	6134.27	4293.25	449	2999.0	4799.0	7995.0	36500	▇▂▁▁▁
Discount	1	26.88	22.63	0	0.0	40.0	50.0	60	▇▁▁▅▆
Rating	1	3.24	1.43	0	2.6	3.5	4.4	5	▃▁▅▆▇
Reviews	1	40.55	31.54	0	10.0	37.0	68.0	223	▇▆▁▁▁

Exploratory Data Analysis

I started by creating visualizations that focus on the Sale Price and Listing Price distributions. By overlaying these histograms, we aim to identify patterns in pricing strategies, such as differences in pricing ranges, which can provide valuable insights into market behavior.

Figure 1: Comparative Analysis of Sale Price vs Listing Price for Adidas and Nike

Before diving deeper into this, I’d like to review the other variables. This might help on identifying any inconsistencies or elements that may not be useful for the analysis.

Figure 2: Frequency of Brand Categories

With the insight brought by the visualization above this product called Adidas Adidas Originals was reassigned for the correct category called Adidas Originals. This way we are going to remain with only 4 categories instead of 5 given the inconsistency found in the Brand assignment.

Figure 3: Comparative Analysis of Sale Price vs Listing Price by Brand

After visualizing the information contained in the Figure 3, a couple of interesting questions come to mind:

- How many products have a Listing Price higher than the Sale Price?

Figure 4: Comparison of Listing Price vs Sale Price by Brand

- Why is there a difference between Listing Price and Sale Price? Is it always due to discounts?

Well no, among the 876 products with an available Listing Price and no explicit discounts, 217 have a Sale Price lower than the Listing Price. This suggests that the difference isn’t always due to planned discounts, leading to a few possible explanations:

Unrecorded Discounts: Discounts may have been applied but not properly documented in the data.
Dynamic Pricing: Prices may vary due to demand, stock, or customer behavior, without being categorized as discounts.
System or Labeling Errors: Price discrepancies could result from labeling or system errors.
Competitive Pricing: Companies may lower prices to stay competitive, even without formal discounts.
Promotions: Other forms of price reductions might not be labeled as discounts but still lower the sale price.
Market Value Changes: Prices might be reduced over time to reflect changes in demand or product value.

Which brand segment is most affected by these unrecorded changes?

reactable( 
  brands %>%
  group_by( Brand ) %>%
  filter( Listing.Price > Sale.Price, Discount == 0 ) %>%
  count( )
)

As seen above, all products that do not show a discount in the data but had changes in the final sale price are Nike products. In fact, every Nike product with a listing price underwent adjustments. This suggests either that these products do not have the correct listing price, or the discount is not being recorded for these products (the latter is more likely, as no Nike product in the dataset has a documented discount).

- Are there any products with a Sale Price higher than the Listing Price?

This is unexpected, right? The Listing Price should be equal to or higher than the Sale Price, not the other way around. I’ll take a closer look at these Nike products.

reactable( 
  brands %>% 
  select( Product.Name, Listing.Price, Sale.Price, Discount ) %>% 
  filter( Sale.Price > Listing.Price ), defaultPageSize = 5
)

An intriguing issue in this table is that these products with higher sale price have a listing price of 0, which suggests a potential error in the data generation process, as it’s unlikely for a product to be listed without a price. To correct this, I will assign the sale price value to these cases, adjusting based on the discount column (if available):

brands_input <- 
  brands %>%
  mutate( Listing.Price = 
            case_when( 
              Listing.Price == 0 & Discount == 0 ~ Sale.Price,
              Listing.Price == 0 & Discount > 0 ~ Sale.Price * Discount / 100,
              TRUE ~ Listing.Price  
          )
  )

As I mentioned before, none of the Nike products have a discount available in the dataset. But what about the Adidas products?

Figure 5: Distribution of Discounts Across Adidas Segments

The flow chart from the Figure 5 shows that CORE / NEO segment had the highest percentage of products with discounts over 30%. Ok, so let’s explore ratings description and reviews before building our product clusters.

Figure 6: Distribution of Ratings Across Segments

The difference between the brands’ ratings is notable; however, it is important to point out that they have different numbers of products, with Nike having the fewest representatives in the dataset, which may explain the high standard deviation observed. Additionally, the dataset includes product descriptions, and given that it contains only shoes, it would be interesting to explore the common words used in these descriptions to gain insights into product characteristics or marketing trends.

Figure 7: Frequency of Words by Brand

Naturally, the first question that arises from this plot is whether there’s any relationship between the words used and the product ratings. To explore this further, I conducted an analysis and visualized the findings in the form of this eye-catching word cloud:

Figure 8: Frequency of Words by Brand

Figure 9: Distribution of Reviews by Brand

The word cloud and box plot together highlight the contrasting dynamics between Nike and Adidas in terms of customer engagement and review distribution. In the word cloud, Nike-related terms like “modernised” and “air” are prominent, suggesting that these products dominate discussions and receive significant attention. This is mirrored in the box plot, where Nike shows an irregular review distribution, with numerous outliers indicating that while some Nike products receive a large volume of reviews, others are reviewed much less frequently. In contrast, Adidas displays more balanced word frequencies in the word cloud and a more consistent review distribution across its product lines, as seen in the box plot. This suggests that Adidas products tend to attract steadier, more uniform customer engagement, without the extremes observed in Nike’s reviews.

Before moving into modeling, it’s important to note that our data does not exhibit spherical dispersion. As a result, algorithms that form globular clusters, such as K-means, may not be ideal. Nonetheless, I will still test K-means to compare the clusters they form.

Figure 10: Distribution of Data Based on Sale and Listing Prices

The figure above shows the distribution of two key variables: Listing Price and Sale Price. These variables are essential for analyzing pricing strategies and product performance, and they play a crucial role in forming clusters to understand business dynamics in retail and e-commerce.

Building Clusters with K-means

First, the data was scaled to minimize the impact of varying units across different variables. The following variables were initially tested:

Sale Price
Listing Price
Discount
Ratings
Number of Reviews

As a reminder, variables where the Listing Price was 0 were adjusted to match the Sale Price, accounting for any available Discount.

Figure 11: Elbow Method for Determining the Optimal Number of Clusters on K-means algorithm

Figure 12: K-means Clusters

The elbow method suggests that two clusters are optimal for this dataset. The K-means algorithm was applied to the scaled data, resulting in two distinct clusters. The plot shows the distribution of data points based on Sale Price and Listing Price, with each point colored according to its assigned cluster. While the two clusters exhibit a roughly globular dispersion, the data appears to have more complexity and does not fully adhere to this structure, indicating that K-means may not be the best algorithm to capture the underlying characteristics of the products.

Building Clusters with HDBSCAN

The HDBSCAN algorithm identified 10 distinct clusters and 972 points considered noise. Noise points are those that do not fit well into any of the formed clusters. The sizes of these clusters vary significantly, which is typical for density-based algorithms like HDBSCAN.

Figure 13: HDBSCAN Cluster Plot Showing Consistent Branches

For instance, Cluster 4 is the largest, containing 930 objects, indicating a region of high density. In contrast, Cluster 5 has only 12 objects, suggesting a region of much lower density. Interestingly, Cluster 0 represents noise, with 972 points, indicating a significant portion of the data does not clearly belong to any cluster. The remaining clusters capture variations in sale and listing prices, reviews, and ratings.

Figure 14: HDBSCAN Clusters

The maximum sale prices across clusters range from $3,197 in smaller clusters up to $36,500 in the noise points. Cluster 4 stands out with its large number of reviews and higher average rating of 3.9, while Cluster 5 has no reviews or ratings, indicating limited customer interaction. Clusters like Cluster 1 and Cluster 2 are smaller in size, with moderate sale prices and no significant discounts.

Table summarizing the clusters formed by HDBSCAN:

Conclusion

In conclusion, HDBSCAN effectively identifies regions of varying density in the dataset, capturing both high and low-density clusters. This clustering offers valuable insights into different pricing strategies and product performance, where larger clusters indicate products with higher customer engagement and more stable pricing, while smaller clusters may indicate niche products or items with less market activity. The noise points, accounting for a large proportion, might represent outliers or products that don’t fit well into the overall market patterns.

In a real-world scenario, I would delve deeper into the modeling process. It would be particularly interesting to model the brands separately to enable direct comparisons, but I’ll save that exploration for another time, as this post is already quite extensive, but that was it!

Note: As you may have noticed, exploratory data analysis has been a topic I’ve really enjoyed lately, and I plan to share more posts on this subject in the future.