Business Data Analytics, Recommendation and Rating Prediction
Problem Statement
- Many businesses rely on customer’s online reviews, tips and ratings. Explicit feedback or tips are especially important in the small-medium business or e-commerce industry where all customer engagements are impacted by these ratings. Furthermore, all business-related information are online, such as location, operation time, contact information. Yelp’s DataSet has includes all above business information based on the geographical areas.
- For Small-Medium business owners, if they want to invest a new business in a city, they are eager to know the more useful information to help their decisions, such as where is the best location? what is the key features to beat the competitors? what’s current ratings and reviews for exist owners,etc. A unified business data analytics platform will solve this problem. This platform will provide the detailed business insights and recommendations based on different industry and provide the rating prediction.
Project Overview
This project’s target is to build Unified Business Data Analytics platform(Web/Mobile),the major functionalities are:
- Integrated Data Analytics Platform to provide Data Pipeline, Data Dashboard, Business Insight, Geographical Analysis, Rating Prediction and Business Recommendation
- Build the deep learning sentiment analysis model to predict rating(improved accuracy to 80%+)
- Hybrid Recommendation Engine (Content-based Filtering/Knowledge-based Filtering)
- Deployment on GCP or AWS or Mobile Platform(Herokuapp only support 5000 Lines for free version)
- Current Work:
- Data ETL: Jupyter Notebook
- Rating Prediction and Item-based Collaborative filtering Recommendation algorithm Jupyter Notebook
- Flask Web Server and API functionalities app.py
- Data Analytics Web application Javascript/D3
Data Sets
- The Yelp dataset includes 1,223,094 tips by 1,637,138 user.There are over 1.2 million business attributes like hours, parking, availability, and ambience. It is aggregated check-ins over time for each of the 192,609 businesses. We will use the newly updated dataset from Yelp Dataset Website. You can download it from this Link
Project Architecture and Functionalities
- This project is a full-stack data analytics application. The whole process incudes:
- Get the raw data (From the Yelp.com)
- Data Preprocessing, Extract-Transform-Load (JSON to CSV, Database: PostgreSQL 10)
- Data Visualization and EDA - Discover and visualize the data to gain insights ( Matplotlib, Seaborn, JavaScript, D3, plot.ly and leaflet mapping)
- Feature Engineering - Numeric Features,Categorical Features,Time Series Features,Text Features and Handling the missing data
- Select the machine learning models,train and fine-tune the models (Logistic Regression,XGBoost,Light-GBM and Ensemble Models)
- Select the recommendation algorithm(Item-based Collaborative filter)
- Deploy the system and provide the APIs capabilities (Python Flask Web Server)
- Project Functionalities
- Dashboard:
- Provide Yelp GTA business overview dashborad,which includes the total business number,rating number and reviewed number and also adds key filters to provide the detailed information. Also the application provides the full data tables to display the whole business information(Totla 33,412 businesses included)
- Business Search:
- Using Yelp Fusion API, the application can query the business in any category and from any location. The detailed result will be geomapping into the map with detailed contact information(phone,address,rating and etc..)
- Categories Chart:
- Categories in Yelp dataset is very complicated. There are always a long text to describe the business’ category. The reason behind this is that categories are added by business owners. So we created a categorial algorithm to simplify the category description and easy to query.
- Recommendation Chart:
- Show whole bussiness full list and recommend the business based on user’s input.The results are geo-mapping into the map with detailed contact information
- Rating Prediction:
- The application uses four machine learning models to predict the rating. The logistic Regression,XGBoost,light-GBM and ensemble models are trained in this application.if we set the threshold is 70%, the best model accuracy is around 68%.Also the feature importances are provided.
- Rating Maps:
- The application will show the all business rating vs review counts in heatmap, the maps will be in different layers based on the rating and review counts
- Dashboard:
Workflow Engine and API format
Workflow
- Raw Data Transform: JSON to CSV
- Data Storage : PostgreSQL
- Workflow Engine (WFE): Flask Web Server/SQLAchemy/Python
- Front END: Web Application/GUI, HTML/CSS, JavaScript,D3,Leaflet.js,Plot.ly
- Back End: Feature Engineering,Machine Learning Models, Item-based Collaborative filter algorithm for Recommendation
- Production Deployment on Heroku.com or GCP
API format
- Flask API JSON Data Route:
- @app.route(“/yelp_metadata”)
- @app.route(“/yelp_metadata/
") - @app.route(“/city/
") - @app.route(“/stars/
") - @app.route(“/yelp_metadata/pages/
") - @app.route(“/city/
/ ") - @app.route(“/stars/
/ ") - @app.route(“/apiquery/
/ ") - @app.route(“/category_feature/
") - @app.route(“/category_feature_count/
") - @app.route(“/category_feature/keyword/
") - @app.route(“/category_feature/keyword/
/ ") - @app.route(“/recsys/
/ ") - @app.route(“/yelp_rec_metadata/pages/
") - @app.route(“/yelp_rec_metadata”)
Data Extract Transform and Load
- Raw data set is in JSON format, so first we need convert JSON into CSV, then we use python to do the data preprocessing and load into PostgreSQL.
Data Dashboard
Feature Engineering
- Categories in Yelp dataset is very complicated. There are always a long text to describe the business’ category. The reason behind this is that categories are added by business owners. So we created a categorial algorithm to simplify the category description and easy to query.
Recommendation Algorithm
- Item-based Collaborative filter algorithm is used as the business recommendation engine
Rating Prediction Model
-The application uses four machine learning models to predict the rating. The logistic Regression,XGBoost,light-GBM and ensemble models are trained in this application.if we set the threshold is 70%, the best model accuracy is around 68%.Also the feature importances are provided.
-
Rating Maps Analytics
-
Models and Model Performances
-
Feature Importances
API Query
- By using Yelp Fusion API, the application will data-visualize the query results.