Applying K-Means Clustering in retail industry - Fall 2025

Author

Charan Gullipalli, Lavanya Guntupalli, Arun Pandian (Advisor: Dr. Cohen)

Published

October 20, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Important

Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.

Carefully read this template since it has instructions and tips to writing!

Nice report!

Introduction

1. Introduction

In today’s data-driven corporate environment, understanding customer behavior and effectively managing inventories are key to organizational success. With the increased availability of transactional, demographic, and behavioral data, data mining and clustering techniques—particularly K-Means clustering—have become critical decision-making tools in retail, logistics, and marketing. Recent studies emphasize that algorithm selection and data preprocessing directly affect the quality of insights derived from clustering (Prasetiya and Prayoga (2025)).

K-Means clustering is a machine learning technique that divides data into distinct clusters based on similarities, allowing businesses to identify patterns that would not be visible using traditional analysis. Research has demonstrated that it can greatly enhance consumer segmentation, inventory forecasting, and marketing optimization across various industries. For example, a weighted K-Means model produced more meaningful and stable customer clusters compared to the standard model (Omol et al. (2024)). Similarly, comparative work on the same dataset found that while more complex models can sometimes yield higher precision, K-Means remains valuable for its interpretability and speed in operational settings (Prasetiya and Prayoga (2025)).

In emerging markets, K-Means has also proven beneficial for small and resource-limited businesses. Clustering transactional and demographic data enables grocery stores to identify clear patterns of consumer purchasing behavior (Sitorus et al. (2023)). Likewise, studies show that clustering supports targeted marketing, loyalty programs, and more efficient stock management (Bala (2012)). Collectively, these works underscore that even small or localized retailers can leverage K-Means-based segmentation to align supply, demand, and promotional strategy.

Beyond retail, advances in clustering algorithms—including weighted variants (Omol et al. (2024)) as well as Global and MiniBatch K-Means—have improved computational efficiency and accuracy, reducing issues like random initialization and scalability on large datasets. These enhancements make K-Means increasingly suitable for real-time analytics and modern business intelligence frameworks. In summary, the literature shows that K-Means clustering is more than a statistical technique; it is a strategic enabler for data-informed decision-making. By uncovering relevant behavioral patterns in large datasets, firms can deepen customer understanding, optimize resource allocation, and forecast demand more accurately. Integrating K-Means into corporate analytics frameworks is thus a crucial step toward operational excellence and maintaining competitiveness in the digital economy.

2. Literature Review

Research shows that K-Means clustering is widely used in retail to improve customer segmentation and inventory forecasting. According to Prasetiya and Prayoga (2025), studies between 2019–2024 identified that combining K-Means with association rule learning (like FP-Growth) helps retailers better understand customer behavior and create focused marketing campaigns. The review also highlights that K-Means performs reliably on large datasets and can be applied to classify products by sales speed — such as fast-moving and slow-moving goods.

Omol et al. (2024) applied K-Means clustering to grocery stores in Kenya using transactional and demographic data. Their study showed that age and income strongly influence shopping behavior. Retailers could use this information to manage inventory and target marketing efforts more effectively.

Bala (2012) introduced a forecasting model combining clustering and demand prediction. This approach improved inventory accuracy, reduced excess stock, and minimized shortages — showing that clustering can make supply chains leaner and more efficient. Similarly, Sitorus et al. (2023) used K-Means to group products by demand at a retail store and identified three categories: fast, moderate, and slow sellers. This helped optimize stock levels and reduce costs.

Together, these studies demonstrate that K-Means clustering helps retailers improve both marketing precision and inventory control by identifying customer patterns and linking them to product demand. For example, clustering results can be used to align promotional strategies with inventory decisions, creating an integrated, data-driven retail management system (Prasetiya and Prayoga (2025); Omol et al. (2024); Bala (2012); Sitorus et al. (2023)).

Methods

Descriptive analytics and data visualization methods in R Studio were used to investigate the dataset. Renaming columns, dealing with missing values, and making sure numerical and categorical variables were consistent were all part of data cleansing.

To find patterns, seasonal trends, and correlations between variables like sales, suppliers, and item kinds, exploratory data analysis (EDA) was used. Sales distributions, monthly trends, and category comparisons were represented through the use of ggplot2 visualization techniques.

Since understanding sales behavior rather than making predictions was the primary objective, neither predictive nor machine learning models were used. Because they make it possible to clearly identify important trends, supplier performance, and sales anomalies within the dataset, the selected approaches are appropriate.

Analysis and Results

Data Exploration and Visualization

R Studio was used to clean and visualize the data. Monthly trends, supplier performance, and product categories were the main subjects of the analysis.

The holidays are when sales are at their highest, according to the visuals, and a select group of suppliers generate the greatest income. The best-selling products were wine and spirits, and warehouse sales were typically higher than retail sales.

These trends demonstrate the dataset’s substantial provider dominance and seasonal demand.

Data Sources and Collection:

The Department of Liquor Control’s Montgomery County, Maryland Open Data Portal provided the data. Monthly retail, transfer, and warehouse sales records by supplier and item type are included; these records are updated on a regular basis and made publicly accessible under a Public Domain License.

Initial Findings and Insights:

Holiday seasons are when sales are at their highest.

Overall sales are dominated by a small number of vendors.

Wine and spirits are the most popular product categories.

Sales in warehouses are higher than those in stores.

Unexpected Patterns or Anomalies:

There are also duplicate item codes and missing supplier names.

certain months with sales figures that are abnormally high or low.

Retail transfer increases are probably related to restocking occasions.

Code

# loading packages 
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)


# Load libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
library(scales)

# Read Excel file (update the file path if needed)

data <- read_excel("Retail_Sales.xlsx")

# View basic structure
glimpse(data)

Rows: 307,645
Columns: 9
$ YEAR               <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
$ MONTH              <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, …
$ SUPPLIER           <chr> "A I G WINE & SPIRITS", "A I G WINE & SPIRITS", "AI…
$ `ITEM CODE`        <dbl> 119229, 120200, 306528, 97039, 69953, 69954, 69490,…
$ `ITEM DESCRIPTION` <chr> "TROCADERO SPARK(BRUT) - 750ML", "DOM DES FONTANELL…
$ `ITEM TYPE`        <chr> "WINE", "WINE", "WINE", "BEER", "BEER", "BEER", "KE…
$ `RETAIL SALES`     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 3.25, 0.0…
$ `RETAIL TRANSFERS` <dbl> 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ `WAREHOUSE SALES`  <dbl> 4, 3, 1, 3, 12, 12, 3, 45, 2, 0, 20, 6, 11, 10, 13,…

Code

summary(data)

      YEAR          MONTH          SUPPLIER           ITEM CODE      
 Min.   :2017   Min.   : 1.000   Length:307645      Min.   :      2  
 1st Qu.:2017   1st Qu.: 3.000   Class :character   1st Qu.:  48104  
 Median :2019   Median : 7.000   Mode  :character   Median :  84023  
 Mean   :2018   Mean   : 6.424                      Mean   : 164114  
 3rd Qu.:2019   3rd Qu.: 9.000                      3rd Qu.: 318087  
 Max.   :2020   Max.   :12.000                      Max.   :3480003  
                                                    NA's   :54       
 ITEM DESCRIPTION    ITEM TYPE          RETAIL SALES      RETAIL TRANSFERS  
 Length:307645      Length:307645      Min.   :  -6.490   Min.   : -38.490  
 Class :character   Class :character   1st Qu.:   0.000   1st Qu.:   0.000  
 Mode  :character   Mode  :character   Median :   0.320   Median :   0.000  
                                       Mean   :   7.024   Mean   :   6.936  
                                       3rd Qu.:   3.268   3rd Qu.:   3.000  
                                       Max.   :2739.000   Max.   :1990.830  
                                       NA's   :3                            
 WAREHOUSE SALES  
 Min.   :-7800.0  
 1st Qu.:    0.0  
 Median :    1.0  
 Mean   :   25.3  
 3rd Qu.:    5.0  
 Max.   :18317.0

Code

# Remove duplicate rows
data <- distinct(data)

# Handle missing values (replace or remove)
data <- data %>%
  mutate(
    SUPPLIER = ifelse(is.na(SUPPLIER), "Unknown Supplier", SUPPLIER),
    `ITEM TYPE` = ifelse(is.na(`ITEM TYPE`), "Unknown Type", `ITEM TYPE`),
    `RETAIL SALES` = ifelse(is.na(`RETAIL SALES`), 0, `RETAIL SALES`),
    `RETAIL TRANSFERS` = ifelse(is.na(`RETAIL TRANSFERS`), 0, `RETAIL TRANSFERS`),
    `WAREHOUSE SALES` = ifelse(is.na(`WAREHOUSE SALES`), 0, `WAREHOUSE SALES`)
  )

# Create a Total Sales column
data <- data %>%
  mutate(TOTAL_SALES = `RETAIL SALES` + `RETAIL TRANSFERS` + `WAREHOUSE SALES`)

# Combine YEAR and MONTH into a proper Date column
data <- data %>%
  mutate(Date = make_date(YEAR, MONTH, 1))

Code

# Total Sales Trend Over Time

ggplot(data, aes(x = Date, y = TOTAL_SALES)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(title = "Total Sales Trend Over Time", x = "Date", y = "Total Sales") +
  scale_y_continuous(labels = comma) +
  theme_minimal()

Code

# Top 10 Suppliers by Total Sales

top_suppliers <- data %>%
  group_by(SUPPLIER) %>%
  summarise(Total_Sales = sum(TOTAL_SALES, na.rm = TRUE)) %>%
  arrange(desc(Total_Sales)) %>%
  head(10)

ggplot(top_suppliers, aes(x = reorder(SUPPLIER, Total_Sales), y = Total_Sales, fill = SUPPLIER)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Suppliers by Total Sales", x = "Supplier", y = "Total Sales") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(legend.position = "none")

Code

# Sales by Item Type

ggplot(data, aes(x = `ITEM TYPE`, y = TOTAL_SALES, fill = `ITEM TYPE`)) +
  geom_bar(stat = "summary", fun = "sum") +
  labs(title = "Total Sales by Item Type", x = "Item Type", y = "Total Sales") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(legend.position = "none")

Code

# Comparison: Retail vs Warehouse Sales

sales_compare <- data %>%
  summarise(
    Retail_Sales = sum(`RETAIL SALES`, na.rm = TRUE),
    Warehouse_Sales = sum(`WAREHOUSE SALES`, na.rm = TRUE)
  ) %>%
  tidyr::pivot_longer(cols = everything(), names_to = "Category", values_to = "Sales")

ggplot(sales_compare, aes(x = Category, y = Sales, fill = Category)) +
  geom_bar(stat = "identity") +
  labs(title = "Retail vs Warehouse Sales", x = "", y = "Sales") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(legend.position = "none")

Code

# Monthly Sales Heatmap
library(viridis)

monthly_sales <- data %>%
  group_by(YEAR, MONTH) %>%
  summarise(Total_Sales = sum(TOTAL_SALES, na.rm = TRUE))

ggplot(monthly_sales, aes(x = factor(MONTH), y = factor(YEAR), fill = Total_Sales)) +
  geom_tile() +
  scale_fill_viridis(option = "magma", direction = -1, labels = comma) +
  labs(title = "Monthly Sales Heatmap", x = "Month", y = "Year", fill = "Total Sales") +
  theme_minimal()

Modeling and Results

Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.

Conclusion

Summarize your key findings.
Discuss the implications of your results.

References

Bala, Pradip Kumar. 2012. “Improving Inventory Performance with Clustering-Based Demand Forecasts.” Decision Tree-Based Demand Forecasts for Improving Inventory Performance. https://www.researchgate.net/publication/261023207_Decision_tree_based_demand_forecasts_for_improving_inventory_performance.

Omol, Edwin, Dorcas Onyangor, Lucy Mburu, and Paul Abuonji. 2024. “Application of k-Means Clustering for Customer Segmentation in Grocery Stores in Kenya.” International Journal of Science, Technology and Management (IJSTM). https://ijstm.inarah.co.id/index.php/ijstm/article/view/1024.

Prasetiya, Yanda Rizky, and Firlan Prayoga. 2025. “Systematic Literature Review: Customer Segmentation in Retail.” Journal of Technovasia. https://journal-iam.com/index.php/technovasia/article/view/11.

Sitorus, Zulham, Irwan Syahputra, Chairul Indra Angkat, and Dewi Sartika. 2023. “Implementation of k-Means Clustering for Inventory Projection.” International Journal of All Research in Science, Technology and Computers (IJARSCT). https://www.ijarsct.co.in/Paper19084.pdf.