Skip to content

hyunjoonbok/R-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R Portfolio

R Portfolio of data science projects from either original work or revised for a study and learning purpose. Portfolio in this repo is presented in the form of .R and .Rmd(R-markdown) files.

Each folder represents the fields of application (i.e. Timeseries, Deeplearning, MachineLearning, etc)

For a detailed code example and images, please refer to .readme file presented below.

Note: Data used in the projects is for learning and demo purposes only


Motivation / Thought Process

These days R is less preferrable in industory for various reasons (i.e less production-ready, non-scalable). However, I think R is still a very powerful language. I personally am fond of and use R for everyday analysis from simple EDA to creating stunning visualizations and building a complex ML/DL models. I think R has its strong advantage in looking at codes and results at a controlled enviornmenets.

This repository was origianlly to have a record of project progress and my own learning process, but I found that it would be helpful to who wants to improve data-science skills to next-level using R language, as it contains a numerious real-life data science example and notebooks created by @hyunjoonbok and codes borrowed from authors who produced state-of-the-art results.

I tried to include the usage of packages and methods that have been consistently used in actual industries, in order to to solve the problems (even if it's a toy example). The repo contatins use-cases that can be readily applied to many of real-world datasets.


Table of contents


Projects

This workbook covers complete advanced steps to create a SOTA time-series forecasting model at scale. We use Walmart M4 Kaggle competition dataset to create foreacst for (7) different time-series. It introduces latest functions in Modeltime and techniques in R, which load data, preprocess, modelling, fitting, calibration, ensembling, and visualization. The codes are experiment-ready to be applied to any of custom time-series dataset.

Often times, it's necceary for business who are performing any kind of time-series forecast model to scale it's model. This examples leverages "Nest" function to create several time-series at the same time in a single dataset, where the best-chosen ML algorithem is applied to create a forecast for entire groups. The possibility is endless. The model can be scaled to create thousands of models in pararell, with the help of "Nest" function.

The file walk-through key processes that need to be performed to generate time-series in high-level. We are looking bike_sharing_daily time series data from 2011 to 2013 to predict the sales of it for the next 3 months. We set aside last 3-months of data as the testing set, and levere modeltime package to build different SOTA timeseires models including, ARIMA, Prophet, XGBoost, randomforest. Then we evaluate the Model by refitting data from the errors we got from initial models, and eventually multi-visualize the model.

Looking at custmer transcation data to segment cusotmers into groups to better statify the business strategy. Use a K-means clustering Building and Bootstrap Evaluation to effectively group cusotmers, and create points of strategy to be possibly discussed with business stakeholders.

Building a deeplearning model using H2O, perform hyperparameter tuning through random grid serach, to solve multi-label classification problem.

A End-to-End recommendation system model building using the game title from data wragling, to building an algorithm and deplying to Shiny WebApp. A full comprehension of recommender algorithm could be gained and can be applied to any real-world data.


  • Machine Learning

    Predicting a future beer sale number using a historical data. Using H20's AUTOML feature to easily obtint the state-of-the-art ensemble results, and plot the errors to improve.

    Often times, ML models are critized as being black-box (untracakble complex inside that magiaclaly solves the problem). Here we look at the problem of predicting the apartment prices using Linear Regression, SVM, Random Forest, and get the pacakge DALEX to help look how much each variables affect this prediction.

    Employee Churn Modeling

    With the help of powerful Caret pacakge that help build ML model . Has complete steps to pre-process, fine-tune, train, and get ROC curve. Then, I use LIME (Local Interpretable Model-Agnostic Explanation) to understand ML model created. Use H2O to initiate modeling, and with the help of LIME, it gives both global and local interpretation of predictor variables. It gives a clear visual explanation of variable importance and how model is affected by those.

    The NaĂŻve Bayes classifier is a simple probabilistic classifier which is based on Bayes Theorem but with strong assumptions regarding independence. Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. Here, I built a simple classification model with Caret and H2O.

    Build a simple ML from H2O to predict which customers more likely to enroll in Bank's Term Deposit. Shows how random Grid Search combined with Stacked Ensembles is a very powerful combination

    Contatins a complete steps in model-building with XGBoost in R. From CV, grid-serach, hyperparameter tuning to feature selection, optimization, training/evaluation and Prediction. Solves a real-world binary classification problem.

    Contatins a complete steps in model-building and explanation of what's actaully going on in ML. Using 4 different method/packages (PDP, ICE, LIME, Shapley), it shows how Machine Learning can be explainable in some sense.

    Looking at a toy example here to see how we could use H2O to predict arrival delay using historical airline data with Destination to Chicago Airport. Give a easy glance how easily H2O package could be utilized in a simple ML problem.




  • Database & Pararell Computing

    Analyzing Google Analytics data (built-in as sample data) with BigQuery using R interface. It shows how we can locally connect to BigQuery using DBI pacakge.

    Connection to BigQuery, usage of dplyr commands, Calculate k-means inside the data, and fianlly visualization of data using ggplot.

    R provides a number of convenient facilities for parallel computing. This script shows how to setup and run a parallel process on your current multi-core device, without need for additional hardware.

    Introduces a R interface for Apache Spark. Connecting to Spark from a local machine. Learn to use distributed computing by fully utilizing Spark's engine, as Hadoop-based Data Lake is becoming a common practice at companies.


  • Text Mining / Social Media Analysis

    For real-world text data that goes beyond GB/TB in file size, it's necessary to leverage Spark engine load and transform data. Eventally genearate a list of the most used words, and create basic wordcloud.

    Looking at the Jane Austen Book's text to learn a full function of Text-Mining (tidying up data, Sentiment analysis, word-frequnecy, TF-IDF, Wordcloud, Tokenizing by n-gram, Topic-modeling). Ready-to-be used in any real-world datasets.

    Learn to serach tweets by length, location or any criteria set. Retrieve a list of all the accounts a user follows. Then plot the frequency of tweets for each user over time.


  • Visualization (ggplot2)

    Ready-to-Use ggplot2

    A few curated list of ggplot codes that generates beautiful plot with examples. Basic understanding of ggplot codes is required.


  • Statistic Concepts with real-world examples

    Concept of Logistic Regression displayed in R code. Solves a binary classification problem.

    Multinomial regression is similar to logistic regression, but fits better when the response variable is a categorical variable with more than 2 levels.

    Ordinal logistic regression can be used to model a ordered factor response. Here, we use ordered logistic regression to predict the car evaluation.

    Ridge Regression is a commonly used technique to address the problem of "multi-collinearity". We looks at the result of Linear Regression vs Ridge Regression

    Social Network Analysis is a set of methods used to visualize networks, describe specific characteristics of overall network structure, and build mathematical and statistical models of network structures and dynamics


Setup

  • Simply click one of the R files above and copy/paste on your own R scripts.
    • Please make sure to change the working directory!

TO-DOs

List of features ready and TO-DOs for future development

  • Introduction to R Shiny Apps : in progress
  • More Kaggle examples : in progress
  • Data cleaning .ipynbs : in progress --> In Python Portfolio

Contact

Created by @hyunjoonbok - feel free to contact me!