PROJECTS

Personal Projects

------------------------💡😀 Update on hold due to busy work. Will resume when time allows 😀💡------------------------

Multilingual Customer Service Chatbot

with Memory, LLM (Llama 3) and RAG (LangChain) on GPU

Chatbot Demo Link

I am currently working on hosting the chatbot application on either AWS EC2 or HuggingFace Space.

In the meantime, please open this Google Colab Notebook to interact with the Chatbot. You simply need to hit Runtime and hit Run all to have the application generated.

Please note that it will take about 3 minutes to generate the chatbot app in Google Colab. Please be patient while I work on hosting the app permanently.

Project motivation

Many companies use AI chatbots for their customer support services. This project is my attempt to build a LLM chatbot which can use RAG to retrieve company-specifc information to help answer questions from customers who speak different languages.

Chatbot Technical Details:

See Github link readme

Past Work Projects

Below listed some of the projects I led/worked on at work

Pyspark refactorization and Parallelization for A/B Test Pipeline

Project motivation

Every year, our company runs thousands of A/B test to demonstrate our product and marketing campaign impact to many top-tier clients (eg. Kraft, P&G, Lego etc). These tests are crucial in revenue generation and contract negotiation. Since our company had expanded rapidly and had gone from a start-up to a public company, the demand for the volume of A/B tests had been increasing drastically as well. Therefore, C-suite leaders had requested testing capacity expansion and the possibility of providing real-time analysis to stay competitive in the industry.

However, to fulfill the C-leaders' request, there were many bottlenecks/challenges to overcome. Our A/B Test Analysis Pipeline was slow and not scalable because of many historical tech-debt accumulated over the years. Specifically, the slowness and non-scalable issues were caused by:

The heavy use of for-loop in the orchestration scripts.
Many code conversion from spark dataframe to pandas
Many serial task operations on Airflow
Random cluster choice without memory and CPU optimization

Key achievements

Run time reduction: From 300+ hours to approximately 3 hours for 1000 campaigns
Volume boost: From capping the analyses at 50 campaigns every night to running unlimited number of tests
Redesigned entire pipeline with 30 scripts utilizing parallel tasks and sensors on Airflow
Increased scalability by refactoring python for-loop to pyspark and UDF
Worked with Analytic engineers to optimize cluster choices

Time Series Modeling when A/B testing holdout is not possible

In this project, we encountered a situation where a holdout is not possible for an A/B test due to client restrictions.

In order to measure the effect of our products, our Data Science team built an ensemble of Time Series models to mimic the effect of the holdout (baseline). For details, please check out this medium article.

Medium Article Link

Github Repo Link

Spatial Analysis: Denver Residential Ethnic Cluster

Clustering is a common technique in Data Science. But how is clustering done spatially? For example, how do you test and verify the idea statistically that certain groups of your customers which share specific characteristics are located in specific locations?

In this project, spatial clustering analysis techniques were applied to analyze if different ethnic groups are clustered in Denver residentially. If so, what are those locations' characteristics?

This was a project submitted to the D2P (Data to Policy) competition in Denver in 2022. The goal was to utilize Data analysis to help our city government to make Denver a better place.

Burned Area Mapping in Alaska

Fires impact our ecosystems. In order to study the impact of past fires and forecast future patterns, a reliable and consistent record of fire is essential. However, very few agencies consistently track fire occurrence over space and time. The incompleteness of fire data makes assessing trends and impacts of fires challenging.

To address this issue, USGS (United States Geological Survey) developed a machine learning algorithm to map burned areas for the Conterminous US using Landsat satellite images for date ranges from 1984 to present.

Prior to expanding its fire mapping coverage from Conterminous US to the next biggest territory, Alaska, USGS sought to modify current algorithm for better prediction performance.

In this project, 5 aspects of the existing USGS algorithm: data-split, model evaluation metric, classifier, feature selection, hyperparameter tuning were examined and modified.

Github Repo Link

Key achievements

Model Prediction Accuracy: an improvement of approximately 5.5% in Average-Precision score was achieved. Omission Error and Commission Error were reduced by 9.74% and 2.28% respectively.
Model Training Efficiency: The process of model training, tuning and evaluation was shortened from a total of 3 Days 18 hours (90 hours) to 19 hours utilizing fewer computation and parallelization resources

Customer Segmentation using Principal Component Analysis (PCA) and K-mean Clustering

When it comes to alternative, renewable energy, what are the views and attitudes of the customers?

Are they interested in trying them? What will drive or nudge them to switch to using renewable energy? What kinds of price points or services are they willing to pay for?

In this project, we conducted a large-scale survey for our client. Then we used PCA and clustering to help them identify key factors that distinguished and segmented their customers.

Interative Dashboard Rebuild

Still using tiles to display multiple graphs and stakeholders are confused by the number of small tiles shown on dashboard? Make it attractive by utilizing Plotly dropdown menu buttons and interative features! This can greatly improve your performance metric visualization and presentation!

Github Project Demo

Coming Soon

Google Sites

Report abuse