Used Car Prices Analysis with Python

personalProject | Feb. 8, 2023, 2:39 p.m.

This is a case study from "Data Analisys with Python" course by IBM cognitiveclass.ai. The dataset used in this course is an open dataset, by Jeffrey C. Schlimmer containing 205 number of instances and consist of 26 attributes (detail).

Download this notebook here

In [ ]:

import pandas as pd
import numpy as np

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

df = pd.read_csv(url, header = None)

headers = ["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engine-location",
           "wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke",
           "compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]

df.columns = headers
# df = pd.read_csv(url, names = headers)

df.head(20)

Out[ ]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
5	2	?	audi	gas	std	two	sedan	fwd	front	99.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250
6	1	158	audi	gas	std	four	sedan	fwd	front	105.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	17710
7	1	?	audi	gas	std	four	wagon	fwd	front	105.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	18920
8	1	158	audi	gas	turbo	four	sedan	fwd	front	105.8	...	131	mpfi	3.13	3.40	8.3	140	5500	17	20	23875
9	0	?	audi	gas	turbo	two	hatchback	4wd	front	99.5	...	131	mpfi	3.13	3.40	7.0	160	5500	16	22	?
10	2	192	bmw	gas	std	two	sedan	rwd	front	101.2	...	108	mpfi	3.50	2.80	8.8	101	5800	23	29	16430
11	0	192	bmw	gas	std	four	sedan	rwd	front	101.2	...	108	mpfi	3.50	2.80	8.8	101	5800	23	29	16925
12	0	188	bmw	gas	std	two	sedan	rwd	front	101.2	...	164	mpfi	3.31	3.19	9.0	121	4250	21	28	20970
13	0	188	bmw	gas	std	four	sedan	rwd	front	101.2	...	164	mpfi	3.31	3.19	9.0	121	4250	21	28	21105
14	1	?	bmw	gas	std	four	sedan	rwd	front	103.5	...	164	mpfi	3.31	3.19	9.0	121	4250	20	25	24565
15	0	?	bmw	gas	std	four	sedan	rwd	front	103.5	...	209	mpfi	3.62	3.39	8.0	182	5400	16	22	30760
16	0	?	bmw	gas	std	two	sedan	rwd	front	103.5	...	209	mpfi	3.62	3.39	8.0	182	5400	16	22	41315
17	0	?	bmw	gas	std	four	sedan	rwd	front	110.0	...	209	mpfi	3.62	3.39	8.0	182	5400	15	20	36880
18	2	121	chevrolet	gas	std	two	hatchback	fwd	front	88.4	...	61	2bbl	2.91	3.03	9.5	48	5100	47	53	5151
19	1	98	chevrolet	gas	std	two	hatchback	fwd	front	94.5	...	90	2bbl	3.03	3.11	9.6	70	5400	38	43	6295

20 rows × 26 columns

Sekilas Insight¶

In [ ]:

df.columns

Out[ ]:

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

In [ ]:

df.dtypes

Out[ ]:

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

tipe data bore seharusnya bukan berupa object karena bore berisi diameter of each cylinder in engine sesuai dengan keterangan dataset bahwa data continuous from 2.54 to 3.94., selain bore ada normalized-losses , stroke, horsepower, peak-rpm dan price

In [ ]:

df.describe()

Out[ ]:

	symboling	wheel-base	length	width	height	curb-weight	engine-size	compression-ratio	city-mpg	highway-mpg
count	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000
mean	0.834146	98.756585	174.049268	65.907805	53.724878	2555.565854	126.907317	10.142537	25.219512	30.751220
std	1.245307	6.021776	12.337289	2.145204	2.443522	520.680204	41.642693	3.972040	6.542142	6.886443
min	-2.000000	86.600000	141.100000	60.300000	47.800000	1488.000000	61.000000	7.000000	13.000000	16.000000
25%	0.000000	94.500000	166.300000	64.100000	52.000000	2145.000000	97.000000	8.600000	19.000000	25.000000
50%	1.000000	97.000000	173.200000	65.500000	54.100000	2414.000000	120.000000	9.000000	24.000000	30.000000
75%	2.000000	102.400000	183.100000	66.900000	55.500000	2935.000000	141.000000	9.400000	30.000000	34.000000
max	3.000000	120.900000	208.100000	72.300000	59.800000	4066.000000	326.000000	23.000000	49.000000	54.000000

In [ ]:

df.describe(include="all")

Out[ ]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
count	205.000000	205	205	205	205	205	205	205	205	205.000000	...	205.000000	205	205	205	205.000000	205	205	205.000000	205.000000	205
unique	NaN	52	22	2	2	3	5	3	2	NaN	...	NaN	8	39	37	NaN	60	24	NaN	NaN	187
top	NaN	?	toyota	gas	std	four	sedan	fwd	front	NaN	...	NaN	mpfi	3.62	3.40	NaN	68	5500	NaN	NaN	?
freq	NaN	41	32	185	168	114	96	120	202	NaN	...	NaN	94	23	20	NaN	19	37	NaN	NaN	4
mean	0.834146	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	98.756585	...	126.907317	NaN	NaN	NaN	10.142537	NaN	NaN	25.219512	30.751220	NaN
std	1.245307	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.021776	...	41.642693	NaN	NaN	NaN	3.972040	NaN	NaN	6.542142	6.886443	NaN
min	-2.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	86.600000	...	61.000000	NaN	NaN	NaN	7.000000	NaN	NaN	13.000000	16.000000	NaN
25%	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	94.500000	...	97.000000	NaN	NaN	NaN	8.600000	NaN	NaN	19.000000	25.000000	NaN
50%	1.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	97.000000	...	120.000000	NaN	NaN	NaN	9.000000	NaN	NaN	24.000000	30.000000	NaN
75%	2.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	102.400000	...	141.000000	NaN	NaN	NaN	9.400000	NaN	NaN	30.000000	34.000000	NaN
max	3.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	120.900000	...	326.000000	NaN	NaN	NaN	23.000000	NaN	NaN	49.000000	54.000000	NaN

11 rows × 26 columns

describe digunakan untuk melihat perhitungan umum dasar dataset yang mempunyai tipe data yang bisa dihitung(include all menambah seluruh tipe data:notice unique, freq, top) atau istilahnya statistical summary

In [ ]:

df[["symboling", "normalized-losses"]].describe()

Out[ ]:

	symboling
count	205.000000
mean	0.834146
std	1.245307
min	-2.000000
25%	0.000000
50%	1.000000
75%	2.000000
max	3.000000

In [ ]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               205 non-null    object 
 19  stroke             205 non-null    object 
 20  compression-ratio  205 non-null    float64
 21  horsepower         205 non-null    object 
 22  peak-rpm           205 non-null    object 
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              205 non-null    object 
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

Preprocessing¶

Missing Values¶

In [ ]:

# replace data yang bernilai "?"(sesuai keterangan dari iformasi dataset) menjadi NaN untuk memudahkan identifikasi
df.replace("?", np.NaN, inplace=True)
df.head(5)

Out[ ]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

5 rows × 26 columns

In [ ]:

# cek missing data. bisa menggunakan isnull() atau notnull()
missing_data = df.isnull()

# missing_data.columns.values
# menghitung missing values per kolom dengan for
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print("")  
# missing_data.head()

symboling
False    205
Name: symboling, dtype: int64

normalized-losses
False    164
True      41
Name: normalized-losses, dtype: int64

make
False    205
Name: make, dtype: int64

fuel-type
False    205
Name: fuel-type, dtype: int64

aspiration
False    205
Name: aspiration, dtype: int64

num-of-doors
False    203
True       2
Name: num-of-doors, dtype: int64

body-style
False    205
Name: body-style, dtype: int64

drive-wheels
False    205
Name: drive-wheels, dtype: int64

engine-location
False    205
Name: engine-location, dtype: int64

wheel-base
False    205
Name: wheel-base, dtype: int64

length
False    205
Name: length, dtype: int64

width
False    205
Name: width, dtype: int64

height
False    205
Name: height, dtype: int64

curb-weight
False    205
Name: curb-weight, dtype: int64

engine-type
False    205
Name: engine-type, dtype: int64

num-of-cylinders
False    205
Name: num-of-cylinders, dtype: int64

engine-size
False    205
Name: engine-size, dtype: int64

fuel-system
False    205
Name: fuel-system, dtype: int64

bore
False    201
True       4
Name: bore, dtype: int64

stroke
False    201
True       4
Name: stroke, dtype: int64

compression-ratio
False    205
Name: compression-ratio, dtype: int64

horsepower
False    203
True       2
Name: horsepower, dtype: int64

peak-rpm
False    203
True       2
Name: peak-rpm, dtype: int64

city-mpg
False    205
Name: city-mpg, dtype: int64

highway-mpg
False    205
Name: highway-mpg, dtype: int64

price
False    201
True       4
Name: price, dtype: int64

Deal with missing data

How to deal with missing data?

Drop data
a. Drop the whole row(price, karena variabel dependant)
b. Drop the whole column
Replace data
a. Replace it by mean(norm-loss, stroke, bore, horsepower, dan peak-rpm)
b. Replace it by frequency(num-of-doors)
c. Replace it based on other functions

Berdasarkan mean

In [ ]:

mean_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
#print("mean norm loss : ", mean_norm_loss)
df["normalized-losses"].replace(np.nan, mean_norm_loss, inplace=True)

In [ ]:

mean_stroke = df["stroke"].astype("float").mean(axis=0)
#print("mean stroke : ", mean_stroke)
df["stroke"].replace(np.nan, mean_stroke, inplace=True)

In [ ]:

mean_bore = df["bore"].astype("float").mean(axis=0)
# print("mean bore: ", mean_bore)
df["bore"].replace(np.nan, mean_bore, inplace=True)

In [ ]:

mean_horsepower = df["horsepower"].astype("float").mean(axis=0)
df["horsepower"].replace(np.nan, mean_horsepower, inplace=True)

In [ ]:

mean_peakrpm = df["peak-rpm"].astype("float").mean(axis=0)
df["peak-rpm"].replace(np.nan, mean_peakrpm, inplace=True)

In [ ]:

## lebih clean tapi NGGA JALAN LOL
# def change_mean(col):
#     mean_col = col.astype("float").mean(axis=0)
#     col.replace(np.nan, mean_col, inplace=True)
#     return col

# # df[["normalized-losses","stroke","bore","peak-rpm", "horsepower"]].apply(lambda col: col.replace(np.nan, col.astype("float").mean(), inplace=True))
# df[["normalized-losses","stroke","bore","peak-rpm", "horsepower"]] = df[["normalized-losses","stroke","bore","peak-rpm", "horsepower"]].apply(lambda col: col.fillna(col.mean()))

# df.info()

Berdasarkan frekuensi

In [ ]:

## melihat jumlah value dari num-of-doors
# df["num-of-doors"].value_counts()

## melihat value yang paling banyak muncul
df['num-of-doors'].value_counts().idxmax()

Out[ ]:

'four'

In [ ]:

df["num-of-doors"].replace(np.nan, 'four', inplace=True)

Drop(remove) baris(row)

In [ ]:

## remove baris jika value price kosong
df.dropna(subset=["price"], axis=0, inplace=True)

In [ ]:

## reset index setelah menghapus sebagian baris
df.reset_index(drop=True, inplace=True)
df.head()

Out[ ]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	122.0	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	122.0	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	122.0	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

5 rows × 26 columns

Data formatting¶

In [ ]:

df[["bore", "stroke", "price", "peak-rpm"]] = df[["bore", "stroke", "price", "peak-rpm"]].astype("float")
df[["normalized-losses", "horsepower"]] = df[["normalized-losses", "horsepower"]].astype("int")
# df[["price"]] = df[["price"]].astype("float")
# df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

df.dtypes

Out[ ]:

symboling              int64
normalized-losses      int64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower             int64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

In [ ]:

# mengubah mile/galon ke L/100km
df["city-mpg"] = 235/df["city-mpg"]
df.rename(columns={"city-mpg": "city-L/100km"}, inplace=True)
df["city-L/100km"].head()

Out[ ]:

0    11.190476
1    11.190476
2    12.368421
3     9.791667
4    13.055556
Name: city-L/100km, dtype: float64

In [ ]:

df["highway-mpg"] = 235/df["highway-mpg"]
df.rename(columns={"highway-mpg": "highway-L/100km"}, inplace=True)
df["highway-L/100km"].head()

Out[ ]:

0     8.703704
1     8.703704
2     9.038462
3     7.833333
4    10.681818
Name: highway-L/100km, dtype: float64

Normalisasi¶

In [ ]:

## menormalisasi width dan length menjadi range 0 - 1(menggunakan simple feature Xold/Xmax)
df["width"] = df["width"]/df["width"].max()
df["height"] = df["height"]/df["height"].max()
df["length"] = df["length"]/df["length"].max()

df[["width", "height", "length"]].head()

Out[ ]:

	width	height	length
0	0.890278	0.816054	0.811148
1	0.890278	0.816054	0.811148
2	0.909722	0.876254	0.822681
3	0.919444	0.908027	0.848630
4	0.922222	0.908027	0.848630

Data Binning¶

In [ ]:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
plt.hist(df["horsepower"])

# set x/y labels and plot title
plt.xlabel("horsepower")
plt.ylabel("count")
plt.title("horsepower bins")

Out[ ]:

Text(0.5, 1.0, 'horsepower bins')

In [ ]:

## membuat edge bin 4 value dengan min awal dan max akhir untuk digunakan dalam 3 bin([48-119], [119-190], [190-262])
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
bins

Out[ ]:

array([ 48.        , 119.33333333, 190.66666667, 262.        ])

In [ ]:

group_names = ['Low', 'Medium', 'High']

## 
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df["horsepower-binned"].value_counts()

Out[ ]:

Low       153
Medium     43
High        5
Name: horsepower-binned, dtype: int64

In [ ]:

pyplot.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title
plt.xlabel("horsepower")
plt.ylabel("count")
plt.title("horsepower bins")

Out[ ]:

Text(0.5, 1.0, 'horsepower bins')

In [ ]:

# menggunakan historgram "horsepower" dengan bins = 3
plt.pyplot.hist(df["horsepower"], bins = 3)

# set x/y labels and plot title
plt.xlabel("horsepower")
plt.ylabel("count")
plt.title("horsepower bins")

Out[ ]:

Text(0.5, 1.0, 'horsepower bins')

Indicator Variable¶

In [ ]:

## dummy variabel untuk mengkategorikan tipe bahan bakar
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()

Out[ ]:

	diesel	gas
0	0	1
1	0	1
2	0	1
3	0	1
4	0	1

In [ ]:

dummy_variable_1.rename(columns={'gas':'bensin', 'diesel':'solar'}, inplace=True)
dummy_variable_1.head()

Out[ ]:

	solar	bensin
0	0	1
1	0	1
2	0	1
3	0	1
4	0	1

In [ ]:

# merge data frame "df" dg "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop "fuel-type"
df.drop("fuel-type", axis = 1, inplace=True)
df.head()

Out[ ]:

	symboling	normalized-losses	make	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	...	stroke	compression-ratio	horsepower	peak-rpm	city-L/100km	highway-L/100km	price	horsepower-binned	bensin
0	3	122	alfa-romero	std	two	convertible	rwd	front	88.6	0.811148	...	2.68	9.0	111	5000.0	11.190476	8.703704	13495.0	Low	1
1	3	122	alfa-romero	std	two	convertible	rwd	front	88.6	0.811148	...	2.68	9.0	111	5000.0	11.190476	8.703704	16500.0	Low	1
2	1	122	alfa-romero	std	two	hatchback	rwd	front	94.5	0.822681	...	3.47	9.0	154	5000.0	12.368421	9.038462	16500.0	Medium	1
3	2	164	audi	std	four	sedan	fwd	front	99.8	0.848630	...	3.40	10.0	102	5500.0	9.791667	7.833333	13950.0	Low	1
4	2	164	audi	std	four	sedan	4wd	front	99.4	0.848630	...	3.40	8.0	115	5500.0	13.055556	10.681818	17450.0	Low	1

5 rows × 28 columns

In [ ]:

dummy_aspirasi = pd.get_dummies(df["aspiration"])
dummy_aspirasi.head()

Out[ ]:

	std	turbo
0	1	0
1	1	0
2	1	0
3	1	0
4	1	0

In [ ]:

df = pd.concat([df, dummy_aspirasi], axis=1)
df.drop("aspiration", axis=1, inplace=True)

df.head()

Out[ ]:

	symboling	normalized-losses	make	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	...	horsepower	peak-rpm	city-L/100km	highway-L/100km	price	horsepower-binned	bensin	std
0	3	122	alfa-romero	two	convertible	rwd	front	88.6	0.811148	0.890278	...	111	5000.0	11.190476	8.703704	13495.0	Low	1	1
1	3	122	alfa-romero	two	convertible	rwd	front	88.6	0.811148	0.890278	...	111	5000.0	11.190476	8.703704	16500.0	Low	1	1
2	1	122	alfa-romero	two	hatchback	rwd	front	94.5	0.822681	0.909722	...	154	5000.0	12.368421	9.038462	16500.0	Medium	1	1
3	2	164	audi	four	sedan	fwd	front	99.8	0.848630	0.919444	...	102	5500.0	9.791667	7.833333	13950.0	Low	1	1
4	2	164	audi	four	sedan	4wd	front	99.4	0.848630	0.922222	...	115	5500.0	13.055556	10.681818	17450.0	Low	1	1

5 rows × 29 columns

Exploratory Data Analysis¶

In [ ]:

## melihat korelasi
df[['bore','stroke','compression-ratio','horsepower']].corr()

Out[ ]:

	bore	stroke	compression-ratio	horsepower
bore	1.000000	-0.055390	0.001263	0.566903
stroke	-0.055390	1.000000	0.187871	0.098128
compression-ratio	0.001263	0.187871	1.000000	-0.214489
horsepower	0.566903	0.098128	-0.214489	1.000000

Continues Numerical Variable¶

strong linear correlation¶

Melihat atribut mana yang memiliki korelasi yang kuat terhadap price

In [ ]:

import seaborn as sns

# Semakin besar engine semakin mahal
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

Out[ ]:

(0.0, 53228.55080277679)

In [ ]:

## test nilai korelasi
df[["engine-size", "price"]].corr()

Out[ ]:

	engine-size	price
engine-size	1.000000	0.872335
price	0.872335	1.000000

In [ ]:

## korelasi seberapa boros di jalan lepas terhadap harga
## semakin boros semakin murah
sns.regplot(x="highway-L/100km", y="price", data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d464ddcd0>

In [ ]:

df[["highway-L/100km", "price"]].corr()

Out[ ]:

	highway-L/100km	price
highway-L/100km	1.000000	0.801118
price	0.801118	1.000000

weak linear correlation¶

In [ ]:

sns.regplot(x="peak-rpm", y="price", data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d447ac820>

In [ ]:

df[['peak-rpm','price']].corr()

Out[ ]:

	peak-rpm	price
peak-rpm	1.000000	-0.101616
price	-0.101616	1.000000

In [ ]:

df[["stroke","price"]].corr()

Out[ ]:

	stroke	price
stroke	1.000000	0.082269
price	0.082269	1.000000

In [ ]:

sns.regplot(x="stroke", y="price", data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d44764be0>

categorical variable¶

In [ ]:

sns.boxplot(x="body-style", y="price", data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d4456c100>

body-style terlalu overlapping antara satu dengan yang lain

In [ ]:

sns.boxplot(x="engine-location", y="price", data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d4449da90>

engine-location terlihat sangat kontras antar 2 variable sehingga potensial digunakan sebagai salah satu predictor

In [ ]:

sns.boxplot(x="drive-wheels", y="price", data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d4448c3d0>

drive-wheels terlihat cukup kontras antar satu sama lain sehingga potensial digunakan sebagai salah satu predictor

Descriptive Statistical Analysis¶

In [ ]:

df.describe()

Out[ ]:

	symboling	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-L/100km	highway-L/100km	price	solar	bensin	std	turbo
count	201.000000	201.00000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000	201.000000
mean	0.840796	122.00000	98.797015	0.837102	0.915126	0.899108	2555.666667	126.875622	3.330692	3.256874	10.164279	103.402985	5117.665368	9.944145	8.044957	13207.129353	0.099502	0.900498	0.820896	0.179104
std	1.254802	31.99625	6.066366	0.059213	0.029187	0.040933	517.296727	41.546834	0.268072	0.316048	4.004965	37.365650	478.113805	2.534599	1.840739	7947.066342	0.300083	0.300083	0.384397	0.384397
min	-2.000000	65.00000	86.600000	0.678039	0.837500	0.799331	1488.000000	61.000000	2.540000	2.070000	7.000000	48.000000	4150.000000	4.795918	4.351852	5118.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	101.00000	94.500000	0.801538	0.890278	0.869565	2169.000000	98.000000	3.150000	3.110000	8.600000	70.000000	4800.000000	7.833333	6.911765	7775.000000	0.000000	1.000000	1.000000	0.000000
50%	1.000000	122.00000	97.000000	0.832292	0.909722	0.904682	2414.000000	120.000000	3.310000	3.290000	9.000000	95.000000	5125.369458	9.791667	7.833333	10295.000000	0.000000	1.000000	1.000000	0.000000
75%	2.000000	137.00000	102.400000	0.881788	0.925000	0.928094	2926.000000	141.000000	3.580000	3.410000	9.400000	116.000000	5500.000000	12.368421	9.400000	16500.000000	0.000000	1.000000	1.000000	0.000000
max	3.000000	256.00000	120.900000	1.000000	1.000000	1.000000	4066.000000	326.000000	3.940000	4.170000	23.000000	262.000000	6600.000000	18.076923	14.687500	45400.000000	1.000000	1.000000	1.000000	1.000000

In [ ]:

df.describe(include=object)
# df["num-of-doors"].value_counts()

Out[ ]:

	make	num-of-doors	body-style	drive-wheels	engine-location	engine-type	num-of-cylinders	fuel-system
count	201	201	201	201	201	201	201	201
unique	22	2	5	3	2	6	7	8
top	toyota	four	sedan	fwd	front	ohc	four	mpfi
freq	32	115	94	118	198	145	157	92

In [ ]:

df['drive-wheels'].value_counts().to_frame()

Out[ ]:

	drive-wheels
fwd	118
rwd	75
4wd	8

In [ ]:

# melihat value engine location
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)

Out[ ]:

	value_counts
engine-location
front	198
rear	3

engine-location memiliki data yang sangat timpang sehingga bukan merupakan predictor yang baik

Grouping¶

In [ ]:

df['drive-wheels'].unique()

Out[ ]:

array(['rwd', 'fwd', '4wd'], dtype=object)

Membentuk grup untuk di analisa

In [ ]:

df_gptest = df[['drive-wheels','body-style','price']]
df_group_wheel = df_gptest.groupby(['drive-wheels', 'body-style'],as_index=False).mean()
df_group_wheel

Out[ ]:

	drive-wheels	body-style	price
0	4wd	hatchback	7603.000000
1	4wd	sedan	12647.333333
2	4wd	wagon	9095.750000
3	fwd	convertible	11595.000000
4	fwd	hardtop	8249.000000
5	fwd	hatchback	8396.387755
6	fwd	sedan	9811.800000
7	fwd	wagon	9997.333333
8	rwd	convertible	23949.600000
9	rwd	hardtop	24202.714286
10	rwd	hatchback	14337.777778
11	rwd	sedan	21711.833333
12	rwd	wagon	16994.222222

Ubah menjadi pivot agar lebih mudah dianalisa

In [ ]:

pivot_group = df_group_wheel.pivot(index='drive-wheels',columns='body-style')
pivot_group

Out[ ]:

	price
body-style	convertible	hardtop	hatchback	sedan	wagon
drive-wheels
4wd	NaN	NaN	7603.000000	12647.333333	9095.750000
fwd	11595.0	8249.000000	8396.387755	9811.800000	9997.333333
rwd	23949.6	24202.714286	14337.777778	21711.833333	16994.222222

In [ ]:

## isi NaN dengan 0
pivot_group = pivot_group.fillna(0)
pivot_group

Out[ ]:

	price
body-style	convertible	hardtop	hatchback	sedan	wagon
drive-wheels
4wd	0.0	0.000000	7603.000000	12647.333333	9095.750000
fwd	11595.0	8249.000000	8396.387755	9811.800000	9997.333333
rwd	23949.6	24202.714286	14337.777778	21711.833333	16994.222222

In [ ]:

df_body = df[["body-style", "price"]]
df_body = df_body.groupby("body-style", as_index=False).mean()
df_body

Out[ ]:

	body-style	price
0	convertible	21890.500000
1	hardtop	22208.500000
2	hatchback	9957.441176
3	sedan	14459.755319
4	wagon	12371.960000

Visualisasikan dengan heatmap

In [ ]:

plt.pcolor(pivot_group, cmap='RdBu')
plt.colorbar()
plt.show()

Membuat heatmap lebih mudah dibaca¶

In [ ]:

fig, ax = plt.subplots()
im = ax.pcolor(pivot_group, cmap='RdBu')

#label names
row_labels = pivot_group.columns.levels[1]
col_labels = pivot_group.index

#move ticks and labels to the center
ax.set_xticks(np.arange(pivot_group.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(pivot_group.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

Menghitung korelasi dengan Pearson Corelation lebih jauh¶

dengan memperhitungkan p-value :

p-value is < 0.001: we say there is strong evidence that the correlation is significant
the p-value is < 0.05: there is moderate evidence that the correlation is significant
the p-value is < 0.1: there is weak evidence that the correlation is significant.
the p-value is > 0.1: there is no evidence that the correlation is significant.

wheel-base : price¶

In [ ]:

from scipy import stats

pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

The Pearson Correlation Coefficient is 0.584641822265508  with a P-value of P = 8.076488270733218e-20

koefisien wheel-base terhadap price tidak begitu tinggi di angka 0,58 sedangkan p-value sangat kecil di 0,0001

horsepower : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.8096068016571054  with a P-value of P =  6.273536270650504e-48

koefisien horsepower terhadap price cukup tinggi di angka 0,8 dan p-value sangat kecil 0,001

length : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.6906283804483639  with a P-value of P =  8.016477466159328e-30

koefisien length terhadap price termasuk moderate dan p-value sangat kecil

width : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )

The Pearson Correlation Coefficient is 0.7512653440522673  with a P-value of P = 9.200335510481646e-38

koefisien width terhadap price termasuk cukup tinggi dan p-value sangat kecil

curb-weight : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.8344145257702843  with a P-value of P =  2.189577238894065e-53

koefisien curb-weight terhadap price termasuk tinggi dan p-value sangat kecil

engine-size : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

The Pearson Correlation Coefficient is 0.8723351674455185  with a P-value of P = 9.265491622198389e-64

koefisien engine-size terhadap price termasuk tinggi dan p-value sangat kecil

bore : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =  ", p_value )

The Pearson Correlation Coefficient is 0.5431553832626602  with a P-value of P =   8.049189483935489e-17

koefisien bore terhadap price termasuk moderate dan p-value sangat kecil

city-L/100km : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['city-L/100km'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.7898975136626943  with a P-value of P =  3.903106400939802e-44

koefisien city-L/100km terhadap price termasuk tinggi dan p-value sangat kecil(korelasi koefisien seharusnya negative karena semakin sedikit bahan bakar semakin baik)

highway-L/100km : price¶

In [ ]:

pearson_coef, p_value = stats.pearsonr(df['highway-L/100km'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.8011176263981974  with a P-value of P =  3.046784581041456e-46

koefisien highway-L/100km terhadap price termasuk tinggi dan p-value sangat kecil(korelasi koefisien seharusnya negative karena semakin sedikit bahan bakar semakin baik)

ANOVA (Analysist of Variance)¶

drive-wheels¶

In [ ]:

df_group_wheel2 = df_gptest[["drive-wheels", "price"]].groupby(["drive-wheels"])
df_group_wheel2.head(2)

Out[ ]:

	drive-wheels	price
0	rwd	13495.0
1	rwd	16500.0
3	fwd	13950.0
4	4wd	17450.0
5	fwd	15250.0
136	4wd	7603.0

In [ ]:

df_group_wheel2.get_group('rwd')['price']

Out[ ]:

0      13495.0
1      16500.0
2      16500.0
9      16430.0
10     16925.0
        ...   
196    16845.0
197    19045.0
198    21485.0
199    22470.0
200    22625.0
Name: price, Length: 75, dtype: float64

In [ ]:

# ANOVA
f_val, p_val = stats.f_oneway(df_group_wheel2.get_group('fwd')['price'], df_group_wheel2.get_group('rwd')['price'], df_group_wheel2.get_group('4wd')['price'])  

print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23

nilai f-test tergolong tinggi dan p-value yang sangat kecil

fwd & rwd¶

In [ ]:

f_val, p_val = stats.f_oneway(df_group_wheel2.get_group('fwd')['price'], df_group_wheel2.get_group('rwd')['price'])  

print( "ANOVA results: F=", f_val, ", P =", p_val )

ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23

nilai f-test tergolong tinggi dan p-value yang sangat kecil

4wd & rwd¶

In [ ]:

f_val, p_val = stats.f_oneway(df_group_wheel2.get_group('4wd')['price'], df_group_wheel2.get_group('rwd')['price'])  

print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333

nilai f-test tergolong cukup tinggi dan p-value yang kecil

4wd & fwd¶

In [ ]:

f_val, p_val = stats.f_oneway(df_group_wheel2.get_group('4wd')['price'], df_group_wheel2.get_group('fwd')['price'])  

print("ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666

nilai f-test tergolong rendah dan p-value yang cukup tinggi. hal ini menunjukan ekmungkinan kesamaan antara dua kategori tersebut

Kesimpulan¶

terdapat beberapa variabel yang dapat dijadikan acuan dalam menentukan harga mobil bekas diantaranya :

Continuous numerical :

Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore

Categorical:

Drive-wheels

Central Java Dengue Cases in Relation to Population and Rainfall - Aug. 5, 2024

Cyclistic Data Analysist Case Study - Jan. 6, 2023

Analyzing Former vs Current BoomID Carry Players: A Data-Driven Approach - March 11, 2023

Used Car Prices Analysis with Python

Sekilas Insight¶

Preprocessing¶

Missing Values¶

Deal with missing data

Data formatting¶

Normalisasi¶

Data Binning¶

Indicator Variable¶

Exploratory Data Analysis¶

Continues Numerical Variable¶

strong linear correlation¶

weak linear correlation¶

categorical variable¶

Descriptive Statistical Analysis¶

Grouping¶

Membuat heatmap lebih mudah dibaca¶

Menghitung korelasi dengan Pearson Corelation lebih jauh¶

wheel-base : price¶

horsepower : price¶

length : price¶

width : price¶

curb-weight : price¶

engine-size : price¶

bore : price¶

city-L/100km : price¶

highway-L/100km : price¶

ANOVA (Analysist of Variance)¶

drive-wheels¶

fwd & rwd¶

4wd & rwd¶

4wd & fwd¶

Kesimpulan¶

Related Articles