Analysis and Recommendation on YELP dataset
To provide useful insights using YELP dataset for businesses through big data analytics to determine strengths and weaknesses, so that existing owners and future business owners can make decision on new businesses or business expansion. Also to provide recommendation to both business owners and users by extensive analysis on data.
The project involves analysis on the dataset, visualization based on analysis and recommendations. Major modules of the project are:
- Validation of reviews on businesses based on user information.
- Classification of positive and negative reviews using Machine Learning techniques.
- Recommending location based “buzzwords” to future business owners by analyzing positive reviews and negative reviews for a businesses in a state.
- User-specific recommendations using user’s history of availed services. Recommendations are provided based on categories of the services, location of the business, user reviews and user ratings.
Analysis was done on the dataset to understand correlation between different metrics like - location of business and its success, etc. Analysis on business trends based on location, ratings, category and attributes of the business was performed. Trends of closed businesses was observed using user reviews and ratings.
Visualizations for the project were done using python libraries and are stored in visualization folder.
Project presentation can be found at
Prezi WIN ARYD.exe
executed on a windows OS fr
Dataset for the project should be downloaded from Yelp dataset challenge and stored in yelp-dataset folder. The codes should be executed in the following specified order in:
${SPARK_HOME}/bin/spark-submit business_etl.py
${SPARK_HOME}/bin/spark-submit user_etl.py
${SPARK_HOME}/bin/spark-submit review_classification.py
${SPARK_HOME}/bin/spark-submit review_etl.py
The following files can be executed in any order:
${SPARK_HOME}/bin/spark-submit user_recomm.py "'CxDOIDnH8gp9KXzpBHJYXw'"
# user name can be changed to obtain recommendations for different users
${SPARK_HOME}/bin/spark-submit user_analysis.py
${SPARK_HOME}/bin/spark-submit top_reviews.py
${SPARK_HOME}/bin/spark-submit business_analysis.py
${SPARK_HOME}/bin/spark-submit restaurant_analysis.py
${SPARK_HOME}/bin/spark-submit topic_mod_pos.py
${SPARK_HOME}/bin/spark-submit topic_mod_neg.py
${SPARK_HOME}/bin/spark-submit topics.py
${SPARK_HOME}/bin/spark-submit word_cloud.py
${SPARK_HOME}/bin/spark-submit ngram_word_cloud.py
Optional execution for converting data to json format for visualization:
${SPARK_HOME}/bin/spark-submit converttojson.py
-- business location - outliers removed using euclidean distance from avg location of businesses in state (Data Cleaning)
-- users's location -- user validation score
-- classification of reviews (Machine Learning)
-- joined classes to reviews and dropped not so useful columns
-- location based recommendations -- category based recommendations -- overall recommendations
-- most availed category of business by an user -- average stars given by user for each category -- number of positive and negative reviews given by a user
-- chose top 10 positive and top 10 negative reviews based on validation score for business with maximum reviews
-- average review count and stars by city and category -- average review count and stars by state and category -- business attribute based analysis -- average stars for open and closed businesses -- top 15 business categories -- top 15 business categories - city-wise -- cities with most businesses -- businesses with more 5 star ratings
-- top 20 restaurants on yelp (viz) -- restaurants with most funny, cool, useful reviews (viz)
-- topic modeling using positive reviews for businesses in Pennsylvania
-- topic modeling using negative reviews for businesses in Ontario
-- extracted terms and topics from the model saved from topic modeling
-- most frequent words from tips and review for Earl (viz) -- most frequent words from tips and review for Ontario (viz) -- most frequent words from tips and review for top 20 restaurants (viz) -- most frequent words from tips and review for bottom 20 restaurants (viz)
-- wordcloud NGrams from tips review -- wordcloud NGrams from tips review for Arizona
-- converting parquet ETLed files to JSON format for visualization purposes
-- outputs after classification of reviews and etl steps on datasets will be stored
-- outputs of all the visualizations will be stored here
-- all results of topic modelling will be saved here
-- all results of analysis will be stored here