Logo Image
Mike Benas

Turtle Games Ltd

Analyzing real-world data to address a problem faced by Turtle Games. Carried out descriptive, diagnostic, prescriptive, and predictive analysis in Python, while also carrying out an analysis in R. Data cleaning and wrangling in order to produce informative visualizations employing both programming languages.

Project Image

Key Skills & Resources Used

Python
R
K-Means
Clustering
Forecasting
Regression Models
NLP
Sentiment Analysis
Tokenisation
Storytelling

Project Overview

In the given case study, the individual was provided with data from a manufacturer and retailer known as Turtle Games. Turtle Games operates globally and offers a range of products, including books, toys, board games, and video games. The individual was tasked with analyzing the company's sales data and customer reviews to gain insights into overall sales performance and identify areas for improvement.
To conduct the data analysis and present the findings, the individual utilized R/RStudio and Python as the primary tools.
The analysis also encompassed the examination of the following key areas:
• How customers earn and accumulate loyalty points.
• How different customer segments within the customer base can be targeted for specific market strategies.
• How social data, such as customer reviews, can be utilized to shape marketing campaigns.
• The impact of each product on sales performance.
• The reliability of the data, considering factors like normal distribution, skewness, and kurtosis.
• Identifying any potential relationships between sales in North America, Europe, and the global market.

Fig1
Figure 1:OLS Regression results and Scatter plot to show Spending Score vs Loyalty points with Regression line.

Fig2
Figure 2: OLS Regression results and Scatter plot to show Renumeration vs Loyalty points with Regression line.

Fig3
Figure 3: OLS Regression results and Scatter plot to Age vs Loyalty points with Regression line.

Based on the findings presented in Figures 1-3, it is evident that the r-squared values for all three models are considerably low. Specifically, the r-squared values are 0.45, 0.38, and 0.002 for spending score vs. loyalty points, renumeration vs. loyalty points, and age vs. loyalty points, respectively. A higher r-squared value, closer to 1, indicates a better fit of the regression model and explains a larger proportion of the total variation in the data.
The scatter plots reveal a slight positive relationship or linearity between spending score vs. loyalty points and renumeration vs. loyalty points. The regression lines in the plots help visualize this connection. However, no discernible relationship exists between age and loyalty points, suggesting that further exploration in that direction is unnecessary.
Considering the involvement of multiple variables, it might be beneficial to consider a multiple linear regression model instead.
Expanding upon the analysis conducted with the linear regression models, let's focus on the utility of renumeration and spending scores in identifying distinct groups within the customer base for targeted market segmentation. To accomplish this, we can employ k-means clustering.
To determine the optimal number of clusters for our dataset, we can utilize the Elbow or Silhouette methods. These methods indicate that testing out 4-6 clusters would be appropriate.

Fig4
Figure 4: The Elbow method to determine optimal number of clusters.

Fig5
Figure 5: The Silhouette method to determine the optimal number of clusters.

Based on the analysis and evaluation of the methods and models mentioned above, it can be concluded that utilizing 5 clusters yields the most distinct separation and grouping. It is important to note that this selection is subject to some degree of subjectivity, as unsupervised models like k-means clustering do not provide predetermined labels. Without expertise in the specific field, determining the ideal labels for the clusters becomes challenging and uncertain. This is recognized as one of the limitations of unsupervised models, where the interpretation of results can be subjective, especially when clear and definitive labels are not available.

Fig6
Figure 6: k-means model using 5 clusters.

Upon observing the k-means model depicted in Figure 6, it becomes evident that the clusters/groups exhibit well-defined boundaries, with minimal overlap among the data points from different clusters. The characteristics of each cluster are as follows:
- Cluster 1 (red): Represents a higher income group that spends more.
- Cluster 2 (green): Corresponds to an average income group with average spending habits.
- Cluster 3 (blue): Represents a higher income group that spends less.
- Cluster 4 (black): Represents a lower income group that spends more.
- Cluster 5 (orange): Corresponds to a lower income group that spends less.
To gain deeper insights into customer reviews, Natural Language Processing (NLP) techniques can be employed. Conducting sentiment analysis allows us to identify the 15 most commonly used words, as well as the top 20 positive and negative reviews received. By importing and cleaning the data, we were able to tokenize the words and generate a word cloud, aiding in better understanding the sentiment expressed in the reviews.

Fig7
Figure 7: Word cloud of the online customer reviews.

Upon analyzing the word cloud, it becomes evident that the prominently featured words are predominantly positive in nature. To validate this observation, we proceeded to plot the top 15 most frequently used words (Figure 8). Additionally, a histogram was created to illustrate the sentiment score polarity (Figure 9). These visualizations serve as further confirmation that the majority of the sentiment expressed in the reviews is positive.

Fig8
Figure 8: Count plot for the top 15 words used in customer product reviews.

Fig9
Figure 9: Histogram of sentiment score polarity of reviews.

The aforementioned visualizations indeed validate the existence of a positive sentiment towards the products, and the polarity extends beyond the neutral and positive ends of the sentiment scale.
Furthermore, we successfully presented the top 20 positive and negative reviews. However, it is important to acknowledge one of the limitations of utilizing Natural Language Processing (NLP), which lies in the potential for incorrect interpretations of words or reviews. Figure 10 illustrates this issue when examining the list of the top 20 negative reviews. It becomes apparent that some of the reviews are actually positive in nature, but the NLP model has erroneously classified them as negative. This highlights the importance of cautious interpretation and manual review when working with NLP results to account for potential misclassifications.

Fig10
Figure 10: List of top 20 negative sentiment reviews with positive reviews mixed.

When considering the sales data, it is challenging to precisely determine the specific impact of a product on overall sales based solely on the available data. This limitation arises from the absence of sales trend data over time, as our dataset only includes information for the year each product was released.
Nevertheless, by grouping the data according to the "Product" category, we can identify the highest and lowest selling products within the two regions. This analysis enables the marketing team to focus their efforts on promoting the right products, thereby enhancing sales. While we may not have a comprehensive understanding of the overall impact of each product, this information empowers the marketing team to strategically tailor their marketing campaigns to drive improved sales performance.

Fig11
Figure 11: Bar chart to show the highest selling products in £millions for North America

Fig12
Figure 12: Bar chart to show the highest selling products in £millions for Europe

Fig13
Figure 13: Boxplot to show the number of games sold (£millions) for each region.

We can see from Figure 13 that North America has a higher median number of sales compared to Europe and all regions have many outliers. The Histogram helped to visualise this also.

Fig14
Figure 14: Histogram to show the distribution for Global sales.

Figure 14 shows that the data is skewed to the right and so the few larger values raise the mean higher than the median. This is not a normal distribution. We can confirm this by running the Shapiro-Wilk test; skewness and the kurtosis functions.

Fig15
Figure 15: Shapiro Wilk test results for all regions

Figure 15 shows that the p-value is less than 0.05 for all three regions, meaning that the null hypothesis should be rejected and therefore the sales data is not normally distributed.

Fig16
Figure 16: Skewness and Kurtosis scores for all regions

Figure 16 shows us that the skewness values are positive indicating that the tail is on the right side of distribution and the Kurtosis is leptokurtic (heavy-tailed), meaning that data produces more outliers than a normal distribution.
If data is not normally distributed it could impact any machine learning models that we use for predictive analytics. We would need to remove any outliers, but this could reduce the data sample and so it would be best to get a larger data set so that we can get a more distribution.
To confirm if there are any relationships between North American, European, and Global sales, we are able to create a linear regression model

Fig17
Figure 17: Using cor() function to create correlations between the variable, North American, European, and global sales

The cor() values show us that there is a positive relationship between all 3 regions but the strongest correlation (91.6%) is between North American Sales and Global Sales. By plotting the relationships, we are able to visualise this relationship (Figure 18).

Fig18
Figure 18: Scatter plot for the correlation between North American Sales and Global sales (£millions)

As there are multiple variables, a multiple linear regression model is more appropriate to assess the effect of the variables on the dependent variable.

Fig19
Figure 19: Using cor() function to show the correlations between the North American, European, and global sales in a multiple linear regression model.

Fig20
Figure 20: Summary of the multiple linear regression model.

The summary statistics for the multiple linear regression show us that we can be very confident with our model; the p-values are less than 0.05, and the adjusted R-squared value is very high (96.6%). This tells us that EU_Sales and NA_Sales are highly significant variables in relationship to Global sales. We can use this model to test its ability to predict the global sales and compare with what was actually observed.

Fig21
Figure 21: Using our multiple linear regression to compare predicted global sales values to what was actually observed.

From our analysis, we have been able to identify the key groups of customers shopping at Turtles Games, based on renumeration and spending scores and the best groups to target to boost sales (Groups 1 and 0 from Figure 6).
Our sentiment analysis demonstrated that customers had predominantly positive sentiments towards the company and the sentiment polarity distribution was more neutral to positive. We were able to identify the highest and lowest selling products for all regions and so Turtles Games will be better informed when it comes to marketing so that they promote the most popular products.
In terms of relationships relating to sales data, our analysis showed that there was a positive relationship with North American, Europeans sales and Global Sales; although, North American Sales had the strongest relationship with Global sales and so this would be a key market to focus on also. We were able to predict the Global sales with very strong accuracy using a multiple linear regression model. Questions were raised regarding the distribution of the data and so to further our analysis, we would suggest getting data from a bigger sample that is more representative of the normal distribution or data for a longer time frame.