GIS-DataScience_Pipeline

Using PySAL library libraries to perform GeoSpatial Data Science on PostGIS-hosted data. We are performing these tasks through Jupyter NoteBooks, in the optics of using these methods in a full-scaled environnement featuring a PostGIS server to host datasets.

This series of Jupyter Notebooks are going though using Python's libraries in a database administration context. The first two Notebooks are written in French. I am writing the latest in English, as a mean to showcase my ability to present my work in english. I am supplying their download links in the Notebooks. The data is free and widely available.

VS Code Setup

It's a beginner's guide to setting up VS Code for SQL, Markdown and Python. It's dealing with installing the extension we need inside your virtual environments.

SQL use showcase (in French)

The only file that is not a Jupyter NoteBook! It's a Markdown file. I originally sent it as a demo to my peers, as they were curious about my SQL proficiency. I am featuring a single, very short SQL request, but it is a mighty one. I am describing the purpose of the request, and the thinking process leading to the actual script. I wrote it while I was new to VS Code. I was pleased with its capabilities in being a PostGres client, which came handy for the jobs the next Notebooks are achieving.

PostGIS to GeoPandas

In this NoteBook, we are manipulating Python to create a hexagonal grid over a geographical polygon. There is one parameter, the mesh size. We are using the results of the script in the next Notebooks. As I am discussing inside the Notebook, I could certainly have done that job in SQL if I had to in a production environnement. Still, the challenge allowed me to set a workflow from Python to PostGIS, and back, through SQL Alchemy's binding parameters.

Tobler Spatial Interpolation

We will use a complete dataset, and we will manipulate various interpolation variable categories (intensive & categorical), and we will combine the results in the hexagonal mesh we have calculated before. We are using the Tobler Package, which is part of PySAL library. This is done in a Jupyter Notebook. The code is built upon the previous Notebooks.

ESDA Spatial AutoCorrelation

We are making use of Exploratory Spatial Data Analysis, which is also part of the PySAL library to come up with a quantitative measure of how much neighboring cells are affecting each other. Our exeample is using the same dataset we have been using throughout, focusing on the population repartition over an administrative circonscription of France, the Hérault Département.

Map creation and Data visualization

The purpose of this notebook is to displaying a map of our data, nested with histograms pointing at their corresponding hexagonal cell on the map. In order to doing so, we are doing data manipulation in Python: sorting lists, creating dictionnaries of Panda's DataFrames. We are also using GridSpec to arranging several plots inside a single figure. We are also using Contextily to enhance the visual experience.

PCA Using python

PCA(Principal Component Analysis) is an unsupervised machine learning method allowing reduce the dimension of your data. It is the first method in the realm of data mining we are going to use and interpret. Adapting Ostwal Prasad's Notebook to my workflow so I can pull an interpretation of a selected subset of the my PostGIS-hosted dataset from the previous Notebooks. This is the first Notebook featuring a SciKit Learn library.

Decision Tree and Random Forest Classifier

Using the findings of our PCA to let a Decision Tree model predict the classes for the whole dataset from a few manually picked exeamples. We are using another SciKit Learn library here. We are using the Random Forest classifier to performing the same process. We are using dictionnaries of datasets to perform a parallel analysis throughout on different subsets of our dataset.

k-Nearest Neighbors

We are using another SciKit Learn classifier. As per the Random Forest classifier, we are using the k-Nearest Neighbors classifier to predict classes. We are then setting elbow charts to determine our best value for k. We are also making maps to have a sense of the established classification.

k-Means Neighbors

k-Means Neighbors is our first dip into clustering. We are performing a k-Means Neighbors clustering on a few different subsets. We are then trying to evaluate its performance with calisnski-harabasz index, as well as trying out Geosilhouettes. Lastly, we are trying to get a feel on how consistent the clustering is, given a few iterations.

We made creative use of Moran's ${I}$ binary cases throughout. This allowed us to quickly evaluate differences in clustering by the color palette of the resulting maps.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
images		images
0_VSCode_setup.ipynb		0_VSCode_setup.ipynb
1_SQL.md		1_SQL.md
2_PostGIS_to_GeoPandas.ipynb		2_PostGIS_to_GeoPandas.ipynb
3_Spatial_Interpolation.ipynb		3_Spatial_Interpolation.ipynb
4_Spatial_autocorrelation.ipynb		4_Spatial_autocorrelation.ipynb
5_maps_and_dataviz.ipynb		5_maps_and_dataviz.ipynb
6_PCA.ipynb		6_PCA.ipynb
7_Descision_Tree_Random_Forest.ipynb		7_Descision_Tree_Random_Forest.ipynb
8_k-Nearest_Neighbors.ipynb		8_k-Nearest_Neighbors.ipynb
9_KMeans_Neighbors.ipynb		9_KMeans_Neighbors.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIS-DataScience_Pipeline

VS Code Setup

SQL use showcase (in French)

PostGIS to GeoPandas

Tobler Spatial Interpolation

ESDA Spatial AutoCorrelation

Map creation and Data visualization

PCA Using python

Decision Tree and Random Forest Classifier

k-Nearest Neighbors

k-Means Neighbors

About

Releases

Packages

Languages

AdrienGahery/GIS-DataScience_Pipeline

Folders and files

Latest commit

History

Repository files navigation

GIS-DataScience_Pipeline

About

Topics

Resources

Stars

Watchers

Forks

Languages