Diamond Price Prediction Using Machine Learning

A data science project analyzing 53,940 diamonds to build and evaluate regression models predicting price based on physical and quality attributes.

GitHub Repository

Course: Data Science Project, UT Austin
Stack: Python, Scikit-learn, XGBoost, Pandas, NumPy, Matplotlib, Seaborn

Overview

This project examined how characteristics such as carat, cut, clarity, color, and dimensional volume influence price. Multiple machine learning models were trained and compared to understand driver importance and optimize predictive performance.

Can we accurately predict the price of a diamond using its physical and quality attributes?

Modeling Results

Model	RMSE	R²
Linear Regression	1204.86	0.91
Decision Tree (Pruned)	1333.14	0.89
Random Forest	535.14	0.98
XGBoost	542.84	0.88

Random Forest produced the strongest performance with an R² of 0.98 and average prediction error (RMSE) of approximately $535.

Feature Insights

• Carat and the engineered volume feature were the most influential price predictors
• Clarity and color provided additional predictive power
• Cut, depth, and table contributed minimally relative to size and quality features

Visualizations

Feature importance ranking

Random Forest actual vs predicted

Key Takeaways

• Engineered features such as volume significantly improve model performance
• Linear models struggle with nonlinear relationships common in pricing data
• Tree-based ensembles capture complex interactions and offer superior accuracy