This repository includes all logic around the data needed for the GeoExplorer - a AI-driven search application for Berlin's geo data. It contains:
- A Node.js scraper script to collect the data. 🔗
- A script to create and write embeddings to a DB using OpenAIs and Supabase APIs. 🔗
- A script to run a Jupyter notebook to analyze and export the embeddings. 🔗
The scraper (located in the scraper folder) gets all WFS & WMS related metadata from Berlin's Open Data Portal and Berlin's Geo Data Portal (FisBroker) and writes a markdown file (.mdx) for each dataset. The scraper has multiple steps which you can control in the index.js by (un)commenting them.
Before running the scraper you will need to install npm and the dependencies:
npm i
Run the scraper like so:
npm run scrape
Or if you want to update the data:
npm run scrape:update
1. Set up a local Supabase DB (optional)
The initialization of the database, including the setup of the pgvector
extension is stored in the supabase/migrations
folder which is automatically applied to your local Postgres instance when running npx supabase start
Make sure you have Docker installed and running locally. Then run
npx supabase start
This will set up a local Supabase DB for you.
2. Provide connection details
Duplicate the .env.example
file and rename it to .env
. Then provide either your local connection details or those from Supabase, depending on where you want to save your data.
- To retrieve your local
NEXT_PUBLIC_SUPABASE_ANON_KEY
andSUPABASE_SERVICE_ROLE_KEY
run:
npx supabase status
You will also need to provide a key to use OpenAI API.
3. Generate embeddings
This script requests an embedding for each markdown file created earlier. The embedding will then be written to your Supabase DB. To run the script:
npm run embeddings
Note: Make sure Supabase is running. To check, run
supabase status
. If is not running, runsupabase start
.
4. Link your local development project to a hosted Supabase project (optional)
You can do this like so (your data will not be uploaded):
npx supabase
npx supabase link SUPABASE_DB_PASSWORD
npx supabase login
npx supabase link --project-ref SUPABASE_DB_PASSWORD
npx supabase db push
Go to your graphical interface of your Supabase DB (e.g., http://localhost:54323/project/default/editor) and export the nods_page_section_rows table as a .csv file. Save the file in the createGraph folder. Then install jupyter notebook via pip if you haven't installed it yet.
pip install notebook
Run the notebook like so:
npm run embedgraph
This will open a new window in your browser.
You can also access the notebook directly via http://localhost:8888/notebooks/embeds.ipynb.
Run the notebook. It will show you a scatterplot representing the vectors in a 2D representation.
At the bottom of the notebook, you will find a link called tsne_data.csv. This will allow you to download the 2D coordinates including the titles of the dataset. The data is used to update the scatterplot displayed in the GeoExplorer.
The Notebook script is based on OpenAI guides.
Before you create a pull request, write an issue so we can discuss your changes.
Thanks goes to these wonderful people (emoji key):
Hans Hack 💻 🖋 🔣 📖 📆 |
alsino 💻 |
This project follows the all-contributors specification. Contributions of any kind welcome!
Texts and content available as CC BY.
Made by:
|
Together with:
|
A project by
|
Supported by
|