------------------------💡😀  Update on hold due to busy work. Will resume when time allows 😀💡------------------------
I am currently working on hosting the chatbot application on either AWS EC2 or HuggingFace Space.Â
In the meantime, please open this Google Colab Notebook to interact with the Chatbot. You simply need to hit Runtime and hit Run all to have the application generated.
Please note that it will take about 3 minutes to generate the chatbot app in Google Colab. Please be patient while I work on hosting the app permanently.
Many companies use AI chatbots for their customer support services. This project is my attempt to build a LLM chatbot which can use RAG to retrieve company-specifc information to help answer questions from customers who speak different languages.
Below listed some of the projects I led/worked on at work
Every year, our company runs thousands of A/B test to demonstrate our product and marketing campaign impact to many top-tier clients (eg. Kraft, P&G, Lego etc). These tests are crucial in revenue generation and contract negotiation. Since our company had expanded rapidly and had gone from a start-up to a public company, the demand for the volume of A/B tests had been increasing drastically as well. Therefore, C-suite leaders had requested testing capacity expansion and the possibility of providing real-time analysis to stay competitive in the industry.Â
However, to fulfill the C-leaders' request, there were many bottlenecks/challenges to overcome. Our A/B Test Analysis Pipeline was slow and not scalable because of many historical tech-debt accumulated over the years. Specifically, the slowness and non-scalable issues were caused by:
The heavy use of for-loop in the orchestration scripts.Â
Many code conversion from spark dataframe to pandas
Many serial task operations on Airflow
Random cluster choice without memory and CPU optimizationÂ
Run time reduction: From 300+ hours to approximately 3 hours for 1000 campaigns
Volume boost: From capping the analyses at 50 campaigns every night to running unlimited number of testsÂ
Redesigned entire pipeline with 30 scripts utilizing parallel tasks and sensors on AirflowÂ
Increased scalability by refactoring python for-loop to pyspark and UDF
Worked with Analytic engineers to optimize cluster choices
In this project, we encountered a situation where a holdout is not possible for an A/B test due to client restrictions.Â
In order to measure the effect of our products, our Data Science team built an ensemble of Time Series models to mimic the effect of the holdout (baseline). For details, please check out this medium article.
          Medium Article Link
Clustering is a common technique in Data Science. But how is clustering done spatially? For example, how do you test and verify the idea statistically that certain groups of your customers which share specific characteristics are located in specific locations?Â
In this project, spatial clustering analysis techniques were applied to analyze if different ethnic groups are clustered in Denver residentially. If so, what are those locations' characteristics?
This was a project submitted to the D2P (Data to Policy) competition in Denver in 2022. The goal was to utilize Data analysis to help our city government to make Denver a better place.
Fires impact our ecosystems. In order to study the impact of past fires and forecast future patterns, a reliable and consistent record of fire is essential. However, very few agencies consistently track fire occurrence over space and time. The incompleteness of fire data makes assessing trends and impacts of fires challenging.Â
To address this issue, USGS (United States Geological Survey) developed a machine learning algorithm to map burned areas for the Conterminous US using Landsat satellite images for date ranges from 1984 to present.Â
Prior to expanding its fire mapping coverage from Conterminous US to the next biggest territory, Alaska, USGS sought to modify current algorithm for better prediction performance.Â
In this project, 5 aspects of the existing USGS algorithm: data-split, model evaluation metric, classifier, feature selection, hyperparameter tuning were examined and modified.Â
Model Prediction Accuracy: an improvement of approximately 5.5% in Average-Precision score was achieved. Omission Error and Commission Error were reduced by 9.74% and 2.28% respectively.
Model Training Efficiency: The process of model training, tuning and evaluation was shortened from a total of 3 Days 18 hours (90 hours) to 19 hours utilizing fewer computation and parallelization resources
When it comes to alternative, renewable energy, what are the views and attitudes of the customers?
Are they interested in trying them? What will drive or nudge them to switch to using renewable energy? What kinds of price points or services are they willing to pay for?
In this project, we conducted a large-scale survey for our client. Then we used PCA and clustering to help them identify key factors that distinguished and segmented their customers.
Still using tiles to display multiple graphs and stakeholders are confused by the number of small tiles shown on dashboard? Make it attractive by utilizing Plotly dropdown menu buttons and interative features! This can greatly improve your performance metric visualization and presentation!
Github Project Demo
Coming Soon