Data Science
Summer Internship at Prodigal
Scoring Model, Imbalanced classes, XGBoost
Debt Collections, Recovery, Credit Scoring
I was responsible to create custom scoring models to predict which account holders are most likely to pay in a specific payment window. Implemented end-to-end models to identify the accounts most likely to pay wihtin a 7-day and 21-day payment window. Sourced data from MySQL and APIs. Performed extensive data cleaning and engineering. Built models in Python (XGBoost) and performed feature engineering and tuning.
Link to Blog (Coming Soon)Elo Merchant Category Recommendation
BUDT 758T Data Mining and Predictive Analysis
Team Members: Akshat Vaidya, Deep Talati, Harsh Patel, Lei Xia and Xingyi Wang
The aim of this project is to predict customer loyalty score for each unique card id. The focus was not only on predicting loyalty score correctly but also using as few variables as possible. The aim was also to make the model more interpretable and explainable.
Used Boosting and Random Forests from the LightGBM library to implement the model. The model was entirely implemented in Python and used Matplotlib for visualization. Used Bayesian Optimizer for finding the optimum values of the hyper-parameters.
For making the model more interpretable, SHAP values were used which are both consistent and accurate. This was done since by using SHAP values, features could be universally compared across models. Used the SHAP library in Python for computing these values. Also, plotted graphs of important individual features to understand how their value affects the model predictions.
Analysis of the NCI HINTS (Cycle 5) Dataset
Data Analysis, Data Visualization, Python, R, Tableau
As part of the UMD Data Challenge, took on the task to analyze the National Cancer Institute HINTS dataset. The aim was to analyze how people trust different sources for cancer information and break them down by region and ethnicity.
Also, focused on what problems people face when searching for cancer information and their use of social networks for sharing of cancer related information.
It was my first time working with a huge amount of survey data (read categorical features!). A wonderful learning opportunity!
Stock Price Prediction using Prophet
Time-series analysis, Python, Prophet
Used Prophet to predict the stock price of Apple stock based on the closing price. Visualized graphs using Plotly in Python. This was the first time I did time-series analysis using Prophet and the project turned out to be a great learning curve.
Link to Kaggle notebookIncome Prediction based on Demographic variables
Classification, Python
A binary classification problem which involved predicting whether a person would earn $50,000 or more as annual income or not. Used Random Forests for the classification task. Tuned the model for optimum accuracy and improved performance.
Link to Kaggle notebookMileage (MPG) Prediction
Prediction, Python
The goal here was to predict the MPG of the car given information such as the model, make, horsepower etc. I used a regression model to predict the mileage and also explained the potential problems with linear regression such as heteroscedasticity and normality of residuals.
Link to Kaggle notebook