Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.
Carefully read this template since it has instructions and tips to writing!
Nice report!
Introduction
1. Introduction
In today’s data-driven corporate environment, understanding customer behavior and effectively managing inventories are key to organizational success. With the increased availability of transactional, demographic, and behavioral data, data mining and clustering techniques—particularly K-Means clustering—have become critical decision-making tools in retail, logistics, and marketing. Recent studies emphasize that algorithm selection and data preprocessing directly affect the quality of insights derived from clustering (Prasetiya and Prayoga (2025)).
K-Means clustering is a machine learning technique that divides data into distinct clusters based on similarities, allowing businesses to identify patterns that would not be visible using traditional analysis. Research has demonstrated that it can greatly enhance consumer segmentation, inventory forecasting, and marketing optimization across various industries. For example, a weighted K-Means model produced more meaningful and stable customer clusters compared to the standard model (Omol et al. (2024)). Similarly, comparative work on the same dataset found that while more complex models can sometimes yield higher precision, K-Means remains valuable for its interpretability and speed in operational settings (Prasetiya and Prayoga (2025)).
In emerging markets, K-Means has also proven beneficial for small and resource-limited businesses. Clustering transactional and demographic data enables grocery stores to identify clear patterns of consumer purchasing behavior (Sitorus et al. (2023)). Likewise, studies show that clustering supports targeted marketing, loyalty programs, and more efficient stock management (Bala (2012)). Collectively, these works underscore that even small or localized retailers can leverage K-Means-based segmentation to align supply, demand, and promotional strategy.
Beyond retail, advances in clustering algorithms—including weighted variants (Omol et al. (2024)) as well as Global and MiniBatch K-Means—have improved computational efficiency and accuracy, reducing issues like random initialization and scalability on large datasets. These enhancements make K-Means increasingly suitable for real-time analytics and modern business intelligence frameworks. In summary, the literature shows that K-Means clustering is more than a statistical technique; it is a strategic enabler for data-informed decision-making. By uncovering relevant behavioral patterns in large datasets, firms can deepen customer understanding, optimize resource allocation, and forecast demand more accurately. Integrating K-Means into corporate analytics frameworks is thus a crucial step toward operational excellence and maintaining competitiveness in the digital economy.
2. Literature Review
Research shows that K-Means clustering is widely used in retail to improve customer segmentation and inventory forecasting. According to Prasetiya and Prayoga (2025), studies between 2019–2024 identified that combining K-Means with association rule learning (like FP-Growth) helps retailers better understand customer behavior and create focused marketing campaigns. The review also highlights that K-Means performs reliably on large datasets and can be applied to classify products by sales speed — such as fast-moving and slow-moving goods.
Omol et al. (2024) applied K-Means clustering to grocery stores in Kenya using transactional and demographic data. Their study showed that age and income strongly influence shopping behavior. Retailers could use this information to manage inventory and target marketing efforts more effectively.
Bala (2012) introduced a forecasting model combining clustering and demand prediction. This approach improved inventory accuracy, reduced excess stock, and minimized shortages — showing that clustering can make supply chains leaner and more efficient. Similarly, Sitorus et al. (2023) used K-Means to group products by demand at a retail store and identified three categories: fast, moderate, and slow sellers. This helped optimize stock levels and reduce costs.
Together, these studies demonstrate that K-Means clustering helps retailers improve both marketing precision and inventory control by identifying customer patterns and linking them to product demand. For example, clustering results can be used to align promotional strategies with inventory decisions, creating an integrated, data-driven retail management system (Prasetiya and Prayoga (2025); Omol et al. (2024); Bala (2012); Sitorus et al. (2023)).
Methods
Descriptive analytics and data visualization methods in R Studio were used to investigate the dataset. Renaming columns, dealing with missing values, and making sure numerical and categorical variables were consistent were all part of data cleansing.
To find patterns, seasonal trends, and correlations between variables like sales, suppliers, and item kinds, exploratory data analysis (EDA) was used. Sales distributions, monthly trends, and category comparisons were represented through the use of ggplot2 visualization techniques.
Since understanding sales behavior rather than making predictions was the primary objective, neither predictive nor machine learning models were used. Because they make it possible to clearly identify important trends, supplier performance, and sales anomalies within the dataset, the selected approaches are appropriate.
Analysis and Results
Data Exploration and Visualization
R Studio was used to clean and visualize the data. Monthly trends, supplier performance, and product categories were the main subjects of the analysis.
The holidays are when sales are at their highest, according to the visuals, and a select group of suppliers generate the greatest income. The best-selling products were wine and spirits, and warehouse sales were typically higher than retail sales.
These trends demonstrate the dataset’s substantial provider dominance and seasonal demand.
Data Sources and Collection:
The Department of Liquor Control’s Montgomery County, Maryland Open Data Portal provided the data. Monthly retail, transfer, and warehouse sales records by supplier and item type are included; these records are updated on a regular basis and made publicly accessible under a Public Domain License.
Initial Findings and Insights:
Holiday seasons are when sales are at their highest.
Overall sales are dominated by a small number of vendors.
Wine and spirits are the most popular product categories.
Sales in warehouses are higher than those in stores.
Unexpected Patterns or Anomalies:
There are also duplicate item codes and missing supplier names.
certain months with sales figures that are abnormally high or low.
Retail transfer increases are probably related to restocking occasions.
Code
# loading packages library(tidyverse)library(knitr)library(ggthemes)library(ggrepel)library(dslabs)# Load librarieslibrary(readxl)library(dplyr)library(ggplot2)library(lubridate)library(scales)# Read Excel file (update the file path if needed)data <-read_excel("Retail_Sales.xlsx")# View basic structureglimpse(data)
YEAR MONTH SUPPLIER ITEM CODE
Min. :2017 Min. : 1.000 Length:307645 Min. : 2
1st Qu.:2017 1st Qu.: 3.000 Class :character 1st Qu.: 48104
Median :2019 Median : 7.000 Mode :character Median : 84023
Mean :2018 Mean : 6.424 Mean : 164114
3rd Qu.:2019 3rd Qu.: 9.000 3rd Qu.: 318087
Max. :2020 Max. :12.000 Max. :3480003
NA's :54
ITEM DESCRIPTION ITEM TYPE RETAIL SALES RETAIL TRANSFERS
Length:307645 Length:307645 Min. : -6.490 Min. : -38.490
Class :character Class :character 1st Qu.: 0.000 1st Qu.: 0.000
Mode :character Mode :character Median : 0.320 Median : 0.000
Mean : 7.024 Mean : 6.936
3rd Qu.: 3.268 3rd Qu.: 3.000
Max. :2739.000 Max. :1990.830
NA's :3
WAREHOUSE SALES
Min. :-7800.0
1st Qu.: 0.0
Median : 1.0
Mean : 25.3
3rd Qu.: 5.0
Max. :18317.0
Code
# Remove duplicate rowsdata <-distinct(data)# Handle missing values (replace or remove)data <- data %>%mutate(SUPPLIER =ifelse(is.na(SUPPLIER), "Unknown Supplier", SUPPLIER),`ITEM TYPE`=ifelse(is.na(`ITEM TYPE`), "Unknown Type", `ITEM TYPE`),`RETAIL SALES`=ifelse(is.na(`RETAIL SALES`), 0, `RETAIL SALES`),`RETAIL TRANSFERS`=ifelse(is.na(`RETAIL TRANSFERS`), 0, `RETAIL TRANSFERS`),`WAREHOUSE SALES`=ifelse(is.na(`WAREHOUSE SALES`), 0, `WAREHOUSE SALES`) )# Create a Total Sales columndata <- data %>%mutate(TOTAL_SALES =`RETAIL SALES`+`RETAIL TRANSFERS`+`WAREHOUSE SALES`)# Combine YEAR and MONTH into a proper Date columndata <- data %>%mutate(Date =make_date(YEAR, MONTH, 1))
Code
# Total Sales Trend Over Timeggplot(data, aes(x = Date, y = TOTAL_SALES)) +geom_line(color ="steelblue", linewidth =1) +labs(title ="Total Sales Trend Over Time", x ="Date", y ="Total Sales") +scale_y_continuous(labels = comma) +theme_minimal()
Code
# Top 10 Suppliers by Total Salestop_suppliers <- data %>%group_by(SUPPLIER) %>%summarise(Total_Sales =sum(TOTAL_SALES, na.rm =TRUE)) %>%arrange(desc(Total_Sales)) %>%head(10)ggplot(top_suppliers, aes(x =reorder(SUPPLIER, Total_Sales), y = Total_Sales, fill = SUPPLIER)) +geom_bar(stat ="identity") +coord_flip() +labs(title ="Top 10 Suppliers by Total Sales", x ="Supplier", y ="Total Sales") +scale_y_continuous(labels = comma) +theme_minimal() +theme(legend.position ="none")
Code
# Sales by Item Typeggplot(data, aes(x =`ITEM TYPE`, y = TOTAL_SALES, fill =`ITEM TYPE`)) +geom_bar(stat ="summary", fun ="sum") +labs(title ="Total Sales by Item Type", x ="Item Type", y ="Total Sales") +scale_y_continuous(labels = comma) +theme_minimal() +theme(legend.position ="none")
Code
# Comparison: Retail vs Warehouse Salessales_compare <- data %>%summarise(Retail_Sales =sum(`RETAIL SALES`, na.rm =TRUE),Warehouse_Sales =sum(`WAREHOUSE SALES`, na.rm =TRUE) ) %>% tidyr::pivot_longer(cols =everything(), names_to ="Category", values_to ="Sales")ggplot(sales_compare, aes(x = Category, y = Sales, fill = Category)) +geom_bar(stat ="identity") +labs(title ="Retail vs Warehouse Sales", x ="", y ="Sales") +scale_y_continuous(labels = comma) +theme_minimal() +theme(legend.position ="none")
Code
# Monthly Sales Heatmaplibrary(viridis)monthly_sales <- data %>%group_by(YEAR, MONTH) %>%summarise(Total_Sales =sum(TOTAL_SALES, na.rm =TRUE))ggplot(monthly_sales, aes(x =factor(MONTH), y =factor(YEAR), fill = Total_Sales)) +geom_tile() +scale_fill_viridis(option ="magma", direction =-1, labels = comma) +labs(title ="Monthly Sales Heatmap", x ="Month", y ="Year", fill ="Total Sales") +theme_minimal()
Modeling and Results
Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Omol, Edwin, Dorcas Onyangor, Lucy Mburu, and Paul Abuonji. 2024. “Application of k-Means Clustering for Customer Segmentation in Grocery Stores in Kenya.”International Journal of Science, Technology and Management (IJSTM). https://ijstm.inarah.co.id/index.php/ijstm/article/view/1024.
Sitorus, Zulham, Irwan Syahputra, Chairul Indra Angkat, and Dewi Sartika. 2023. “Implementation of k-Means Clustering for Inventory Projection.”International Journal of All Research in Science, Technology and Computers (IJARSCT). https://www.ijarsct.co.in/Paper19084.pdf.