Skip to content

Performing GeoSpatial Data Science on PostGIS-hosted data through Jupyter Notebooks

Notifications You must be signed in to change notification settings

AdrienGahery/GIS-DataScience_Pipeline

Repository files navigation

GIS-DataScience_Pipeline

Using PySAL library libraries to perform GeoSpatial Data Science on PostGIS-hosted data. We are performing these tasks through Jupyter NoteBooks, in the optics of using these methods in a full-scaled environnement featuring a PostGIS server to host datasets.

This series of Jupyter Notebooks are going though using Python's libraries in a database administration context. The first two Notebooks are written in French. I am writing the latest in English, as a mean to showcase my ability to present my work in english. I am supplying their download links in the Notebooks. The data is free and widely available.

It's a beginner's guide to setting up VS Code for SQL, Markdown and Python. It's dealing with installing the extension we need inside your virtual environments.

The only file that is not a Jupyter NoteBook! It's a Markdown file. I originally sent it as a demo to my peers, as they were curious about my SQL proficiency. I am featuring a single, very short SQL request, but it is a mighty one. I am describing the purpose of the request, and the thinking process leading to the actual script. I wrote it while I was new to VS Code. I was pleased with its capabilities in being a PostGres client, which came handy for the jobs the next Notebooks are achieving.

In this NoteBook, we are manipulating Python to create a hexagonal grid over a geographical polygon. There is one parameter, the mesh size. We are using the results of the script in the next Notebooks. As I am discussing inside the Notebook, I could certainly have done that job in SQL if I had to in a production environnement. Still, the challenge allowed me to set a workflow from Python to PostGIS, and back, through SQL Alchemy's binding parameters.

We will use a complete dataset, and we will manipulate various interpolation variable categories (intensive & categorical), and we will combine the results in the hexagonal mesh we have calculated before. We are using the Tobler Package, which is part of PySAL library. This is done in a Jupyter Notebook. The code is built upon the previous Notebooks.

We are making use of Exploratory Spatial Data Analysis, which is also part of the PySAL library to come up with a quantitative measure of how much neighboring cells are affecting each other. Our exeample is using the same dataset we have been using throughout, focusing on the population repartition over an administrative circonscription of France, the Hérault Département.

The purpose of this notebook is to displaying a map of our data, nested with histograms pointing at their corresponding hexagonal cell on the map. In order to doing so, we are doing data manipulation in Python: sorting lists, creating dictionnaries of Panda's DataFrames. We are also using GridSpec to arranging several plots inside a single figure. We are also using Contextily to enhance the visual experience.

PCA(Principal Component Analysis) is an unsupervised machine learning method allowing reduce the dimension of your data. It is the first method in the realm of data mining we are going to use and interpret. Adapting Ostwal Prasad's Notebook to my workflow so I can pull an interpretation of a selected subset of the my PostGIS-hosted dataset from the previous Notebooks. This is the first Notebook featuring a SciKit Learn library.

Using the findings of our PCA to let a Decision Tree model predict the classes for the whole dataset from a few manually picked exeamples. We are using another SciKit Learn library here. We are using the Random Forest classifier to performing the same process. We are using dictionnaries of datasets to perform a parallel analysis throughout on different subsets of our dataset.

We are using another SciKit Learn classifier. As per the Random Forest classifier, we are using the k-Nearest Neighbors classifier to predict classes. We are then setting elbow charts to determine our best value for k. We are also making maps to have a sense of the established classification.

k-Means Neighbors is our first dip into clustering. We are performing a k-Means Neighbors clustering on a few different subsets. We are then trying to evaluate its performance with calisnski-harabasz index, as well as trying out Geosilhouettes. Lastly, we are trying to get a feel on how consistent the clustering is, given a few iterations.

We made creative use of Moran's ${I}$ binary cases throughout. This allowed us to quickly evaluate differences in clustering by the color palette of the resulting maps.