SQL:Tokopedia Simple Sentiment Analysis

personalProject | Aug. 26, 2024, 10:18 p.m.

Introduction


As part of my journey to becoming a proficient data analyst, I recently undertook a personal project that involved analyzing product reviews from Tokopedia, one of Indonesia's largest e-commerce platforms. This project not only allowed me to apply and showcase my SQL skills but also provided valuable insights into customer sentiment based on product ratings. In this article, I’ll go through the steps of the project, breaking down the process.

---

1. Understanding the Data Structure


Before diving into any analysis, it’s crucial to understand the structure of the data. The dataset I used for this project was sourced from on Kaggle titled "tokopedia product reviews" and contained various columns, including:

- no: Number.
- text: Text of the review.
- rating: User rating (1-5).
- category: Product category.
- product_name: Name of the product.
- product_id: Unique identifier for the product.
- sold: Number of products sold.
- shop_id: ID of the shop selling the product.
- product_url: URL of the product.

Understanding the data layout is essential, as it forms the foundation for the entire analysis.

2. Data Cleaning and Normalization


Before importing the data into PostgreSQL for analysis, I needed to clean and normalize it. The `sold` column, which records the number of products sold, had some inconsistencies. For instance, some entries were blank, some were labeled as 'broken,' and others used 'rb' to denote thousands.

To handle these inconsistencies, I used Excel to perform some initial data cleaning. I applied the `COUNTIF()` function by `product_id` to handle missing and 'broken' value and used the `SUBSTITUTE()` function to convert 'rb' values into actual numbers by multiplying them by 1,000.

3. Classifying Sentiment Based on Ratings


The heart of this project was sentiment analysis, where I classified customer sentiment based on their ratings. Given that the ratings were on a scale of 1 to 5, I categorized them as follows:

- Positif: Ratings of 4 and 5.
- Netral: Rating of 3.
- Negatif: Ratings of 1 and 2.

To implement this classification in SQL, I used the following query:

ALTER TABLE reviews
ADD COLUMN sentiment text;
UPDATE reviews
SET sentiment = CASE
    WHEN rating >= 4 THEN 'positif'
    WHEN rating = 3 THEN 'netral'
    WHEN rating <= 2 THEN 'negatif'
END;


This step was crucial as it allowed me to segment the data into meaningful categories that could be analyzed further.

4. Analyzing Sentiment by Product Category


To analyze sentiment distribution across product categories, I used the following query:


SELECT
category,
sentiment,
COUNT(*) AS sentiment_count
FROM
reviews
GROUP BY
category, sentiment
ORDER BY
category, sentiment_count DESC;


This analysis helps in understanding which categories are receiving more positive or negative feedback.


5. Sentiment Analysis by Store (shop_id)


To identify stores with the most positive or negative reviews, I executed:

SELECT
shop_id,
sentiment,
COUNT(*) AS sentiment_count
FROM
reviews
GROUP BY
shop_id, sentiment
ORDER BY
shop_id, sentiment_count DESC;

This query provides insights into store performance based on customer sentiment.


6. Products with the Most Negative Reviews


To pinpoint products with the highest number of negative reviews, I used:


SELECT
product_name,
COUNT(*) AS negative_reviews
FROM
reviews
WHERE
sentiment = 'negative'
GROUP BY
product_id, product_name
ORDER BY
negative_reviews DESC
LIMIT 10;


This identifies products that might require attention due to frequent negative feedback.


7. Product Popularity by Rating Volume


Finally, to assess product popularity based on rating volume, I ran:


SELECT
product_name,
SUM(sold) AS sum_sold,
COUNT(rating) AS num_rating
FROM
reviews
GROUP BY
product_name
ORDER BY
num_rating DESC;


This analysis reveals which products are the most reviewed and popular.


---


Conclusion

This project was a rewarding experience, allowing me to leverage SQL for practical, real-world data analysis. It demonstrated my ability to clean and normalize data, classify sentiment, and derive meaningful insights that could influence business strategy. I’m excited to apply these skills in a professional setting and continue growing as a data analyst. This is a breakdown article from my personal project as part of my portfolio, you can review it on my github.