Merge pull request #3 from drkostas/dev

Port the original code into the template repository
drkostas · May 15, 2020 · 82b2291 · 82b2291
2 parents 7d50e37 + 1bc0cea
commit 82b2291
Show file tree

Hide file tree

Showing 46 changed files with 17,376 additions and 1,020 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -2,7 +2,7 @@ version: 2 # use CircleCI 2.0
 jobs: # A basic unit of work in a run
   build: # runs not using Workflows must have a `build` job as entry point 
     # directory where steps are run
-    working_directory: ~/template_python_project
+    working_directory: ~/hgn
     docker: # run the steps with Docker
       # CircleCI Python images available at: https://hub.docker.com/r/circleci/python/
       - image: circleci/python:3.6.9

diff --git a/.gitignore b/.gitignore
@@ -130,4 +130,10 @@ dmypy.json
 
 # PyCharm
 /.idea
-/tests/test_data/test_dropbox_cloudstore/*.txt
+
+# Custom
+/data/checkpoints/*
+/data/csv_data/*
+/data/dataframes/*
+/data/plots/*
+/data/spark-warehouse/*
diff --git a/Procfile b/Procfile
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Template Python Project
-[![CircleCI](https://circleci.com/gh/drkostas/template_python_project/tree/master.svg?style=svg)](https://circleci.com/gh/drkostas/template_python_project/tree/master)
-[![GitHub license](https://img.shields.io/badge/license-GNU-blue.svg)](https://raw.githubusercontent.com/drkostas/template_python_project/master/LICENSE)
+[![CircleCI](https://circleci.com/gh/drkostas/HGN/tree/master.svg?style=svg)](https://circleci.com/gh/drkostas/HGN/tree/master)
+[![GitHub license](https://img.shields.io/badge/license-GNU-blue.svg)](https://raw.githubusercontent.com/drkostas/HGN/master/LICENSE)
 
 ## Table of Contents
 + [About](#about)
@@ -24,9 +24,18 @@
 + [Acknowledgments](#acknowledgments)
 
 ## About <a name = "about"></a>
-This is a template repository for python projects.
-
-<i>This README serves as a template too. Feel free to modify it until it describes your project.</i>
+Code for the paper "[A Distributed Hybrid Community Detection Methodology for Social Networks.](https://www.mdpi.com/1999-4893/12/8/175)"
+<br><br>
+The proposed methodology is an iterative, divisive community detection process that combines the network topology features 
+of loose similarity and local edge betweenness measure, along with the user content information in order to remove the 
+inter-connection edges and thus unravel the subjacent community structure. Even if this iterative process might sound 
+computationally over-demanding, its application is certainly not prohibitive, since it can be safely concluded 
+from the experimentation results that the aforementioned measures are that well-informative and highly representative, 
+so merely few iterations are required to converge to the final community hierarchy at any case.
+<br><br>
+Implementation last tested with [Python 3.6](https://www.python.org/downloads/release/python-36), 
+[Apache Spark 2.4.5](https://spark.apache.org/docs/2.4.5/) 
+and [GraphFrames 0.8.0](https://github.com/graphframes/graphframes/tree/v0.8.0)
 
 ## Getting Started <a name = "getting_started"></a>
 
@@ -36,7 +45,8 @@ and testing purposes. See deployment for notes on how to deploy the project on a
 
 ### Prerequisites <a name = "prerequisites"></a>
 
-You need to have a machine with Python > 3.6 and any Bash based shell (e.g. zsh) installed.
+You need to have a machine with Python = 3.6, Apache Spark = 2.4.5, GraphFrames = 0.8.0 
+and any Bash based shell (e.g. zsh) installed. For Apache Spark = 2.4.5 you will also need Java 8.
 
 
 ```
@@ -47,31 +57,34 @@ echo $SHELL
 /usr/bin/zsh
 ```
 
-You will also need to setup the following:
-- Gmail: An application-specific password for your Google account. 
-[Reference 1](https://support.google.com/mail/?p=InvalidSecondFactor), 
-[Reference 2](https://security.google.com/settings/security/apppasswords) 
-- Dropbox: An Api key for your Dropbox account. 
-[Reference 1](http://99rabbits.com/get-dropbox-access-token/), 
-[Reference 2](https://dropbox.tech/developers/generate-an-access-token-for-your-own-account) 
-- MySql: If you haven't any, you can create a free one on Amazon RDS. 
-[Reference 1](https://aws.amazon.com/rds/free/), 
-[Reference 2](https://bigdataenthusiast.wordpress.com/2016/03/05/aws-rds-instance-setup-oracle-db-on-cloud-free-tier/) 
-
-
 ### Set the required environment variables <a name = "env_variables"></a>
 
 In order to run the [main.py](main.py) or the tests you will need to set the following 
-environmental variables in your system:
+environmental variables in your system (or in the [spark.env file](spark.env)):
 
 ```bash
-$ export DROPBOX_API_KEY=<VALUE>
-$ export MYSQL_HOST=<VALUE>
-$ export MYSQL_USERNAME=<VALUE>
-$ export MYSQL_PASSWORD=<VALUE>
-$ export MYSQL_DB_NAME=<VALUE>
-$ export EMAIL_ADDRESS=<VALUE>
-$ export GMAIL_API_KEY=<VALUE>
+$ export SPARK_HOME="<Path to Spark Home>"
+$ export PYSPARK_SUBMIT_ARGS="--packages graphframes:graphframes:0.8.0-spark2.4-s_2.11 pyspark-shell"
+$ export JAVA_HOME="<Path to Java 8>"
+
+$ cd $SPARK_HOME
+
+/usr/local/spark                                                                                                                                                                                                                                             
+$ ./bin/pyspark --version
+Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
+      /_/
+                        
+Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_252
+Branch HEAD
+Compiled by user centos on 2020-02-02T19:38:06Z
+Revision cee4ecbb16917fa85f02c635925e2687400aa56b
+Url https://gitbox.apache.org/repos/asf/spark.git
+Type --help for more information.
+
 ```
 
 ## Installing, Testing, Building <a name = "installing"></a>
@@ -123,24 +136,24 @@ make help
 ```bash
 $ make clean server=local
 make delete_venv
-make[1]: Entering directory '/home/drkostas/Projects/template_python_project'
+make[1]: Entering directory '/home/drkostas/Projects/HGN'
 Deleting venv..
 rm -rf venv
-make[1]: Leaving directory '/home/drkostas/Projects/template_python_project'
+make[1]: Leaving directory '/home/drkostas/Projects/HGN'
 make clean_pyc
-make[1]: Entering directory '/home/drkostas/Projects/template_python_project'
+make[1]: Entering directory '/home/drkostas/Projects/HGN'
 Cleaning pyc files..
 find . -name '*.pyc' -delete
 find . -name '*.pyo' -delete
 find . -name '*~' -delete
-make[1]: Leaving directory '/home/drkostas/Projects/template_python_project'
+make[1]: Leaving directory '/home/drkostas/Projects/HGN'
 make clean_build
-make[1]: Entering directory '/home/drkostas/Projects/template_python_project'
+make[1]: Entering directory '/home/drkostas/Projects/HGN'
 Cleaning build directories..
 rm --force --recursive build/
 rm --force --recursive dist/
 rm --force --recursive *.egg-info
-make[1]: Leaving directory '/home/drkostas/Projects/template_python_project'
+make[1]: Leaving directory '/home/drkostas/Projects/HGN'
 
 ```
 
@@ -183,35 +196,91 @@ running install
 
 ## Running the code locally <a name = "run_locally"></a>
 
-In order to run the code now, you will only need to change the yml file if you need to 
-and run either the main or the created console script.
+In order to run the code now, you should place under the [data/input_graphs](data/input_graphs) the graph you 
+want the communities to be identified from.<br>
+You will also only need to create a yml file for any new graph before executing the [main.py](main.py).
 
 ### Modifying the Configuration <a name = "configuration"></a>
 
-There is an already configured yml file under [confs/template_conf.yml](confs/template_conf.yml) with the following structure:
+There two already configured yml files: [confs/quakers.yml](confs/quakers.yml) 
+and [confs/hamsterster.yml](confs/hamsterster.yml) with the following structure:
 
 ```yaml
-tag: production
-cloudstore:
-  config:
-    api_key: !ENV ${DROPBOX_API_KEY}
-  type: dropbox
-datastore:
-  config:
-    hostname: !ENV ${MYSQL_HOST}
-    username: !ENV ${MYSQL_USERNAME}
-    password: !ENV ${MYSQL_PASSWORD}
-    db_name: !ENV ${MYSQL_DB_NAME}
-    port: 3306
-  type: mysql
-email_app:
-  config:
-    email_address: !ENV ${EMAIL_ADDRESS}
-    api_key: !ENV ${GMAIL_API_KEY}
-  type: gmail
+tag: dev  # Required
+spark:
+  - config:  # The spark settings
+      spark.master: local[*]  # Required
+      spark.submit.deployMode: client  # Required
+      spark_warehouse_folder: data/spark-warehouse  # Required
+      spark.ui.port: 4040
+      spark.driver.cores: 5
+      spark.driver.memory: 8g
+      spark.driver.memoryOverhead: 4096
+      spark.driver.maxResultSize: 0
+      spark.executor.instances: 2
+      spark.executor.cores: 3
+      spark.executor.memory: 4g
+      spark.executor.memoryOverhead: 4096
+      spark.sql.broadcastTimeout: 3600
+      spark.sql.autoBroadcastJoinThreshold: -1
+      spark.sql.shuffle.partitions: 4
+      spark.default.parallelism: 4
+      spark.network.timeout: 3600s
+    dirs:
+      df_data_folder: data/dataframes  # Folder to store the DataFrames as parquets
+      spark_warehouse_folder: data/spark-warehouse
+      checkpoints_folder: data/checkpoints
+      communities_csv_folder: data/csv_data  # Folder to save the computed communities as csvs
+input:
+  - config:  # All properties required
+      name: Quakers
+      nodes:
+        path: data/input_graphs/Quakers/quakers_nodelist.csv2  # Path to the nodes file
+        has_header: true  # Whether they have a header with the attribute names
+        delimiter: ','
+        encoding: ISO-8859-1
+        feature_names:  # You can rename the attribute names (the number should be the same as the original)
+          - id
+          - Historical_Significance
+          - Gender
+          - Birthdate
+          - Deathdate
+          - internal_id
+      edges:
+        path: data/input_graphs/Quakers/quakers_edgelist.csv2  # Path to the edges file
+        has_header: true  # Whether they have a header with the source and dest
+        has_weights: false  # Whether they have a weight column
+        delimiter: ','
+    type: local
+run_options:  # All properties required
+  - config:
+      cached_init_step: false  # Whether the cosine similarities and edge_betweenness been already been computed
+      # See the paper for info regarding the following attributes
+      feature_min_avg: 0.33
+      r_lvl1_thres: 0.50
+      r_lvl2_thres: 0.85
+      max_edge_weight: 0.50
+      betweenness_thres: 10
+      max_sp_length: 2
+      min_comp_size: 2 
+      max_steps: 30  # Max steps for the algorithm to run if it doesn't converge
+      features_to_check:  # Which attributes to take into consideration for the cosine similarities
+        - id
+        - Gender
+output:  # All properties required
+  - config:
+      logs_folder: data/logs
+      save_communities_to_csvs: false  # Whether to save the computed communities in csvs or not
+      visualizer:
+        dimensions: 3  # Dimensions of the scatter plot (2 or 3)
+        save_img: true
+        folder: data/plots
+        steps:  # The steps to plot
+          - 0   # The step before entering the main loop
+          - -1  # The Last step
 ```
 
-The `!ENV` flag indicates that a environmental value follows. 
+The `!ENV` flag indicates that a environmental value follows. For example you can set: <br>`logs_folder: !ENV ${LOGS_FOLDER}`<br>
 You can change the values/environmental var names as you wish.
 If a yaml variable name is changed/added/deleted, the corresponding changes should be reflected 
 on the [Configuration class](configuration/configuration.py) and the [yml_schema.json](configuration/yml_schema.json) too.
@@ -223,59 +292,55 @@ First, make sure you are in the created virtual environment:
 ```bash
 $ source venv/bin/activate
 (venv) 
-OneDrive/Projects/template_python_project  dev 
+OneDrive/Projects/HGN  dev 
 
 $ which python
-/home/drkostas/Projects/template_python_project/venv/bin/python
+/home/drkostas/Projects/HGN/venv/bin/python
 (venv) 
 ```
 
-Now, in order to run the code you can either call the `main.py` directly, or the `template_python_project` console script.
+Now, in order to run the code you can either call the `main.py` directly, or the `HGN` console script.
 
 ```bash
-$ python main.py --help
-usage: main.py -m {run_mode_1,run_mode_2,run_mode_3} -c CONFIG_FILE [-l LOG]
-               [-d] [-h]
+$ python main.py -h
+usage: main.py -c CONFIG_FILE [-d] [-h]
 
-A template for python projects.
+A Distributed Hybrid Community Detection Methodology for Social Networks.
 
-required arguments:
-  -m {run_mode_1,run_mode_2,run_mode_3}, --run-mode {run_mode_1,run_mode_2,run_mode_3}
-                        Description of the run modes
+Required Arguments:
   -c CONFIG_FILE, --config-file CONFIG_FILE
                         The configuration yml file
-  -l LOG, --log LOG     Name of the output log file
 
-optional arguments:
-  -d, --debug           enables the debug log messages
+Optional Arguments:
+  -d, --debug           Enables the debug log messages
+  -h, --help            Show this help message and exit
+
 
 # Or
 
-$ template_python_project --help
-usage: template_python_project -m {run_mode_1,run_mode_2,run_mode_3} -c
-                               CONFIG_FILE [-l LOG] [-d] [-h]
+$ hgn --help
+usage: hgn -c CONFIG_FILE [-d] [-h]
 
-A template for python projects.
+A Distributed Hybrid Community Detection Methodology for Social Networks.
 
-required arguments:
-  -m {run_mode_1,run_mode_2,run_mode_3}, --run-mode {run_mode_1,run_mode_2,run_mode_3}
-                        Description of the run modes
+Required Arguments:
   -c CONFIG_FILE, --config-file CONFIG_FILE
                         The configuration yml file
-  -l LOG, --log LOG     Name of the output log file
 
-optional arguments:
-  -d, --debug           enables the debug log messages
+Optional Arguments:
+  -d, --debug           Enables the debug log messages
   -h, --help            Show this help message and exit
+
 ```
 
 ## Deployment <a name = "deployment"></a>
 
-The deployment is being done to <b>Heroku</b>. For more information
-you can check the [setup guide](https://devcenter.heroku.com/articles/getting-started-with-python). 
+It is recommended that you deploy the application to a Spark Cluster.<br>Please see: 
+- [Spark Cluster Overview \[Apache Spark Docs\]](https://spark.apache.org/docs/latest/cluster-overview.html)
+- [Apache Spark on Multi Node Cluster \[Medium\]](https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b)
+- [Databricks Cluster](https://docs.databricks.com/clusters/index.html)
+- [Flintrock \[Cheap & Easy EC2 Cluster\]](https://github.com/nchammas/flintrock)
 
-Make sure you check the defined [Procfile](Procfile) ([reference](https://devcenter.heroku.com/articles/getting-started-with-python#define-a-procfile)) 
-and that you set the [above-mentioned environmental variables](#env_variables) ([reference](https://devcenter.heroku.com/articles/config-vars)).
 
 ## Continuous Integration <a name = "ci"></a>
 
@@ -291,17 +356,11 @@ Read the [TODO](TODO.md) to see the current task list.
 
 ## Built With <a name = "built_with"></a>
 
-* [Dropbox Python API](https://www.dropbox.com/developers/documentation/python) - Used for the Cloudstore Class
-* [Gmail Sender](https://github.com/paulc/gmail-sender) - Used for the EmailApp Class
-* [Heroku](https://www.heroku.com) - The deployment environment
+* [Apache Spark 2.4.5](https://spark.apache.org/docs/2.4.5/) - Fast and general-purpose cluster computing system
+* [GraphFrames 0.8.0](https://github.com/graphframes/graphframes/tree/v0.8.0) - A package for Apache Spark which provides DataFrame-based Graphs.
 * [CircleCI](https://www.circleci.com/) - Continuous Integration service
 
 
 ## License <a name = "license"></a>
 
 This project is licensed under the GNU License - see the [LICENSE](LICENSE) file for details.
-
-## Acknowledgments <a name = "acknowledgments"></a>
-
-* Thanks το PurpleBooth for the [README template](https://gist.github.com/PurpleBooth/109311bb0361f32d87a2)
-