Data Science


Summer Internship at Prodigal

Scoring Model, Imbalanced classes, XGBoost

Debt Collections, Recovery, Credit Scoring

I was responsible to create custom scoring models to predict which account holders are most likely to pay in a specific payment window. Implemented end-to-end models to identify the accounts most likely to pay wihtin a 7-day and 21-day payment window. Sourced data from MySQL and APIs. Performed extensive data cleaning and engineering. Built models in Python (XGBoost) and performed feature engineering and tuning.

Link to Blog (Coming Soon)
Elo Merchant Category Recommendation

BUDT 758T Data Mining and Predictive Analysis

Team Members: Akshat Vaidya, Deep Talati, Harsh Patel, Lei Xia and Xingyi Wang

The aim of this project is to predict customer loyalty score for each unique card id. The focus was not only on predicting loyalty score correctly but also using as few variables as possible. The aim was also to make the model more interpretable and explainable.
Used Boosting and Random Forests from the LightGBM library to implement the model. The model was entirely implemented in Python and used Matplotlib for visualization. Used Bayesian Optimizer for finding the optimum values of the hyper-parameters.
For making the model more interpretable, SHAP values were used which are both consistent and accurate. This was done since by using SHAP values, features could be universally compared across models. Used the SHAP library in Python for computing these values. Also, plotted graphs of important individual features to understand how their value affects the model predictions.

Link to Github Code File
Analysis of the NCI HINTS (Cycle 5) Dataset

Data Analysis, Data Visualization, Python, R, Tableau

As part of the UMD Data Challenge, took on the task to analyze the National Cancer Institute HINTS dataset. The aim was to analyze how people trust different sources for cancer information and break them down by region and ethnicity.
Also, focused on what problems people face when searching for cancer information and their use of social networks for sharing of cancer related information. It was my first time working with a huge amount of survey data (read categorical features!). A wonderful learning opportunity!

Link to Presentation
Stock Price Prediction using Prophet

Time-series analysis, Python, Prophet

Used Prophet to predict the stock price of Apple stock based on the closing price. Visualized graphs using Plotly in Python. This was the first time I did time-series analysis using Prophet and the project turned out to be a great learning curve.

Link to Kaggle notebook
Income Prediction based on Demographic variables

Classification, Python

A binary classification problem which involved predicting whether a person would earn $50,000 or more as annual income or not. Used Random Forests for the classification task. Tuned the model for optimum accuracy and improved performance.

Link to Kaggle notebook
Mileage (MPG) Prediction

Prediction, Python

The goal here was to predict the MPG of the car given information such as the model, make, horsepower etc. I used a regression model to predict the mileage and also explained the potential problems with linear regression such as heteroscedasticity and normality of residuals.

Link to Kaggle notebook