Back to Projects

Diamond Price Prediction Using Machine Learning

A data science project analyzing 53,940 diamonds to build and evaluate regression models predicting price based on physical and quality attributes.

GitHub Repository
Course
Data Science Project, UT Austin
Stack
Python, Scikit-learn, XGBoost, Pandas, NumPy, Matplotlib, Seaborn

Overview

This project examined how characteristics such as carat, cut, clarity, color, and dimensional volume influence price. Multiple machine learning models were trained and compared to understand driver importance and optimize predictive performance.

Can we accurately predict the price of a diamond using its physical and quality attributes?

Modeling Results

Model RMSE
Linear Regression 1204.86 0.91
Decision Tree (Pruned) 1333.14 0.89
Random Forest 535.14 0.98
XGBoost 542.84 0.88

Random Forest produced the strongest performance with an R² of 0.98 and average prediction error (RMSE) of approximately $535.

Feature Insights

  • Carat and the engineered volume feature were the most influential price predictors
  • Clarity and color provided additional predictive power
  • Cut, depth, and table contributed minimally relative to size and quality features

Visualizations

Feature Importance

Feature importance ranking

Random Forest Performance

Random Forest actual vs predicted

Key Takeaways

  • Engineered features such as volume significantly improve model performance
  • Linear models struggle with nonlinear relationships common in pricing data
  • Tree-based ensembles capture complex interactions and offer superior accuracy