Talk to the City is an application that:
ingests unstructured natural language, e.g:
citizen surveys / public deliberations
newsgroups
forums
discussion archives
uses LLMs to extract and classify:
atomic claims
topics and subtopics
generates interactive reports
The Heal Michigan report is a video-based survey and an in-depth look into the challenges and daily lives of the Michigan community.
The Taiwan same-sex marriage report is a very large survey of the Taiwanese population, covering their views on same-sex marriage in Taiwan.
The Mina protocol report features the results of a user-survey carried out by Mina Zero Knowledge Protocol on their users.
Repo link: https://github.com/AIObjectives/talk-to-the-city-reports
On a technology front, tttc uses a dependency-graph based data and computational model based on nodes that are connected by directional edges. The nodes + edges form a pipeline where some nodes provide data, whilst others provide computation steps. Computation simply involves a topological sort (since edges are directed) where the output of nodes are passed into the input of their downstream nodes. On each step the "compute" function for each node is simply invoked with the upstream input data, and so on until all nodes have been computed.
Computation has two modes: "run" when the pipeline creator actively runs the pipeline, and "load" which is called when the resulting report page is loaded by a viewer.
Reusability with the MVC pattern
The graph is also used for the UI. Pipelines have two rendering mode: graph and standard. The graph view uses Svelteflow whilst the standard view performs a topological sort and renders the nodes in a single column.
Nodes use the MVC pattern. The compute functions hold the Model and the Controller. The graph UI components hold the View.
Since the MC and V are decoupled, we can use different combinations of MC <-> V to yield many combinations of compute + UI entities whilst minimizing code and maximizing reusability.
Our AI Pipeline Engineering Guide #1 takes the reader step by step over the process of creating a report pipeline.
Our user docs provides a very high level overview of the application for non-technical users.
$ git clone https://github.com/AIObjectives/talk-to-the-city-reports
The application can be hosted anywhere, although the persistence layer is currently coupled with Firestore and Google Cloud Storage.
Setting up a firebase instance
Setting up a firebase instance
Since the app uses Firebase, you'll need a dev / staging firebase instance for local development, and for deployment. To do so, you have two options:
setting up your own instance.
using AOI's dev instance.
Deploying and maintaining google cloud platform resources is fairly simple and straight forwards although requires the use of the gcloud
and gsutil
CLI applications. So before we get started make sure you have those correctly installed, and authenticated.
https://cloud.google.com/sdk/docs/install
Setting up your own instance
To set up your own instance:
Head over to https://console.firebase.google.com/
Click "add project" and enter a project name
Disable google analytics
Click "create project" & continue
Under "Get started by adding Firebase to your app" click on the web </>
icon
Add an app nickname (same as earlier)
Click "firebase hosting" if you intend to deploy the app
Click "register app"
Copy .env.example
to .env
in the turbo
directory
Copy & paste the values of the variables.
Click next.
npm install -g firebase-tools
firebase login
Setting up authentication
In the project overview, click on "Authentication"
Click on "set up sign-in method"
Click 'Google'
Click 'enable'
Select a support email address
Click 'save'
In the project overview, in the left side panel, click on "build"
Click on "firestore database"
Click "Create Database"
Select your region / multi region
Click 'next'
Click 'Start in test mode'
Click 'enable'
N.B Firestore rules are still being finalized. Please contact @lightningorb to find out more.
Setting up Google Cloud Storage
In the project overview, in the left side panel, click on "build"
Click on 'storage'
Click 'get started'
Click 'start in test mode'
Click next
Click done
Install and configure the gsutil application
Save the following in a temporary cors.json
file
[
{
"origin" : [" http://localhost:5173" , " https://<optional_deployment_url>" ],
"method" : [" GET" , " HEAD" , " DELETE" ],
"responseHeader" : [" Content-Type" ],
"maxAgeSeconds" : 3600
}
]
Install the gsutil
application
Run the following:
gsutil cors set cors.json gs://< project-name> .appspot.com
Setting up the service account
Authenticated backend endpoints require the service account file:
in the console for the project, click on project settings (the cog icon)
click on "service accounts"
click on Manage service account permissions
look for the email address that matches the project id
click actions
click create key
save the json private key to turbo/src/lib/service-account-pk.json
add the environment variable to your shell: export GOOGLE_APPLICATION_CREDENTIALS="src/lib/service-account-pk.json"
After launching the app, for the first time check your dev console, as it will contain a link for creating an index for datasets.
Talk to the City turbo uses pipeline templates, so end users do not have to construct their own graphs.
You can manage templates via http://localhost:5173/templates or https://tttc-turbo.web.app/templates .
The .env
file contains a VITE_ADMIN
variable that should be filled in with your user id, which can be acquired from the Firestore database.
Contact @brittneygallagher or @lightningorb for credentials files
save the provided .env
in turbo/
optional steps for deployment:
save the provided service-account-pk.json
in turbo/src/lib/
npm install -g firebase-tools
firebase login
Disclaimer: by using a shared dev instance, you are aware that the data you shared by nature, and therefore no privacy can be made for the data you choose to upload to the platform. For better privacy, consider setting up your own instance.
Once you're done making your changes, you can deploy to firebase with:
Firebase allows easily deploying to multiple sites that use the same project resources.
To specify a different site:
modify .hosting.site
in turbo/firebase.json
run firebase deploy --only hosting:<alt-site-name>
Once you have set up a Firebase instance:
Node version tested: v18.0.0
$ cd talk-to-the-city-reports/turbo
$ npm install --legacy-peer-deps # or --force
$ npm run dev
Adding new node types
To add pipeline computation nodes:
create the compute function in src/lib/compute/
look for a suitable UI component in src/components/
In the vast majority of cases, you should be able to simply use an existing UI component. If a UI component does not suit your needs, then feel free to create a new one.
Bind the node's compute type with a component in src/lib/node_types.ts
add the node to src/lib/templates.ts
add node documentation to src/lib/docs
Node UI component hierarchy
Node UI component hierarchy:
The primary UI components displayed to users are called "nodes" as they are part of a dependency graph.
The docs that appear when the user presses the ?
mark are stored in:
src/lib/docs
Adding text inside nodes:
The UI nodes are stored in ./turbo/src/components/graph/nodes
.
DGNode is the 'base' node, that all nodes reuse. DefaultNode is an empty generic node, when nodes don't have a specialized UI. DefaultNode is the generic file upload, which CSVNode and JSON reuse.
This is the "Argument Extraction" and "Cluster Extraction" etc. nodes, essentially all nodes requiring prompts to interact with GPTs use the PromptNode .
Internationalization
src/lib/i18n/en.json
src/lib/zh-TW.json
Since we use internationalization, UI strings use:
< script lang ='ts>
import { _ as __ } from 'svelte-i18n ';
</ script>
<p>{$__('this_is_a_string')}</p>
The localized strings is then added to their respective src/lib/<lang>.json
files.
Tests & TDD
The core functionalities of the nodes are tested. Thus it is strongly recommended to run the tests, and keep them running (vitest uses a daemon with file watch) while you make changes.
brew install xorg-server
pip install chromedriver-autoinstaller selenium pyvirtualdisplay
DISPLAY=:99 python src/test/test_selenium.py
Metric
Count
Total Test Suites
106
Passed Test Suites
106
Failed Test Suites
0
Pending Test Suites
0
Total Tests
215
Passed Tests
215
Failed Tests
0
Pending Tests
0
Todo Tests
0
Test
Status
Duration (ms)
testing vimeo claim
passed
testing yt claim
passed
testing yt link has si
passed
testing yt link has timestamp
passed
testing yt link has si and timestamp
passed
testing no video
passed
testing no claim throws error
passed
Test
Status
Duration (ms)
should concatenate multiple CSV inputs into a single output array
passed
should handle empty input arrays
passed
should handle a single input array
passed
should set dirty to false after compute
passed
should return an empty array if no inputs are provided
passed
should not mutate the input data
passed
Test
Status
Duration (ms)
extract the given arguments
passed
should not extract the arguments if no csv
passed
should not extract the arguments if no open_ai_key and no GCS
passed
should load from GCS if no open ai key
passed
should not extract the arguments if no prompt and no system prompt
passed
test GCS caching
passed
Test
Status
Duration (ms)
extract the given arguments
passed
extract the given arguments with missing rows in CSV
passed
should not extract the arguments if no csv
passed
should not extract the arguments if no open_ai_key and no GCS
passed
should load from GCS if no open ai key
passed
should not extract the arguments if no prompt and no system prompt
passed
test GCS caching
passed
Test
Status
Duration (ms)
should return the cached output if not dirty and output exists
passed
should read audio from GCS and update size and mime_type if download is true
passed
should create an empty audio file if download is false
passed
should set dirty to false after compute
passed
should return undefined if gcs_path is not set
passed
Test
Status
Duration (ms)
compute should set output to messages and dirty to false
passed
Test
Status
Duration (ms)
extract the cluster
passed
should not extract the cluster if no csv
passed
should not extract the cluster if no open_ai_key
passed
should not extract the cluster if no prompt and no system prompt
passed
test GCS caching
passed
Test
Status
Duration (ms)
extract the cluster
passed
should not extract the cluster if no csv
passed
should not extract the cluster if no open_ai_key
passed
should not extract the cluster if no prompt and no system prompt
passed
test GCS caching
passed
Test
Status
Duration (ms)
should concatenate comments until reaching 100 words, then start a new chunk
passed
should start a new chunk when the interview field changes
passed
should handle an empty input array
passed
should not lose the last comment if it does not exceed 100 words
passed
should correctly handle comments with exactly 100 words
passed
Test
Status
Duration (ms)
should correctly count tokens in input data
passed
should not count tokens if input data length matches and node is not dirty
passed
should count tokens if the input data is a string
passed
Test
Status
Duration (ms)
should process CSV data correctly from GCS
passed
should handle empty CSV data from GCS
passed
should handle rows with uneven columns from GCS
passed
Test
Status
Duration (ms)
Find by compute type
passed
Simple pipeline run test
passed
Full pipeline run test
passed
Test
Status
Duration (ms)
generates new columns
passed
deletes columns
passed
renames columns
passed
returns undefined if input is undefined
passed
handles multiple operations
passed
does not modify input if no operations are specified
passed
does not crash if input is empty
passed
Test
Status
Duration (ms)
should filter CSV data inclusively based on provided filters
passed
should filter CSV data exclusively based on provided filters
passed
should return all data if no filters are set
passed
should handle multiple filters correctly
passed
should set dirty to false after compute
passed
should not mutate the input data
passed
Test
Status
Duration (ms)
should compute embeddings for input data
passed
should not compute embeddings if no open_ai_key is provided
passed
should load embeddings from GCS if data length matches and save_to_gcs is true
passed
should handle no data input
passed
Test
Status
Duration (ms)
general prompt
passed
json prompt
passed
json prompt with text
passed
Test
Status
Duration (ms)
sets the output of the node to the input data
passed
Test
Status
Duration (ms)
should process data correctly with JQ filter
passed
should handle invalid JQ filter
passed
Test
Status
Duration (ms)
should process data correctly with JQ filter
passed
should handle invalid JQ filter
passed
should return an empty array when no matches found
passed
should process data correctly with a complex JQ filter
passed
should return undefined if the input is null or undefined
passed
Test
Status
Duration (ms)
should process JSON data correctly from GCS
passed
should handle invalid JSON data from GCS
passed
should update dirty state correctly
passed
Test
Status
Duration (ms)
evaluates JSONata expressions
passed
returns undefined if no expression is provided
passed
catches errors when evaluating expressions
passed
Test
Status
Duration (ms)
should let all data pass through if number is left blank
passed
should limit the number of rows correctly, for an object
passed
should return all rows if limit is greater than number of rows
passed
should return an empty array if input is empty
passed
should not mutate the input node
passed
Test
Status
Duration (ms)
should set markdown data if input is a string
passed
should combine multiple string inputs with separation
passed
should wrap non-string inputs within code block
passed
should handle an empty input object
passed
should preserve the order of inputs when combining
passed
should stringify and wrap arrays in code blocks
passed
should throw an error if input data contains circular references
passed
Test
Status
Duration (ms)
merges cluster_extraction and argument_extraction data
passed
does not merge if cluster_extraction data is missing
passed
does not merge if argument_extraction data is missing
passed
does not merge if cluster_extraction data has no topics
passed
sets node data output to the merged data and dirty to false after merge
passed
Test
Status
Duration (ms)
merges cluster extraction data
passed
does not merge if cluster extractions are missing
passed
uses cached data if available and not dirty
passed
does not merge if no open_ai_key is provided
passed
Test
Status
Duration (ms)
should merge cluster extractions into a single output
passed
should handle empty input data
passed
should not process if no open_ai_key is provided
passed
Test
Status
Duration (ms)
should return the cached output if not dirty and output exists
passed
should read audio from GCS and update size and mime_type if download is true
passed
should create empty audio files if download is false
passed
Test
Status
Duration (ms)
should split CSV into chunks and process each chunk
passed
should handle empty CSV input
passed
should not process if no open_ai_key is provided
passed
Test
Status
Duration (ms)
should process multiple prompts
passed
should process multiple differing prompts
passed
should join outputs if join_output is true
passed
should not process if no open_ai_key is provided
passed
Test
Status
Duration (ms)
should process multiple audio files
passed
should handle empty audio input
passed
should update node_info with results from WhisperNode computations
passed
should remove entries from node_info that are not in the audio list
passed
should mark node_info entry as dirty if WhisperNode output is null
passed
Test
Status
Duration (ms)
should set the key in cookies if the UI key is valid
passed
if ui key is set but invalid use local key
passed
should set the node text to "Invalid key" if the UI key is not valid and there is no local key
passed
should not mutate the node if the UI key and local key are both valid
passed
Test
Status
Duration (ms)
filters participants based on the provided name
passed
removes subtopics with no claims after filtering
passed
removes topics with no subtopics after filtering
passed
returns undefined if input data does not contain topics
passed
does not filter claims if interview key is missing
passed
Test
Status
Duration (ms)
should set the key in cookies if the UI key is provided
passed
should use the local key from cookies if available
passed
should return an empty string if no key is provided or available in cookies
passed
Test
Status
Duration (ms)
should initialize Pinecone with the provided API key
passed
should create a new index if it does not exist and upsert embeddings
passed
should list Pinecone indexes
passed
should provide tools for querying Pinecone index
passed
Test
Status
Duration (ms)
should execute python script and return outputData
passed
should be able to pass input to outputData
passed
test passing in complex data from jsonapi
passed
Test
Status
Duration (ms)
should execute python script and return outputData
passed
should be able to pass input to outputData
passed
should be able to make get requests to jsonapi
passed
Test
Status
Duration (ms)
should execute python script and return output
passed
should handle fetch errors gracefully
passed
should handle invalid JSON response
passed
should handle non-string JSON response
passed
should update node data output with the response
passed
Test
Status
Duration (ms)
test node registeration
passed
Load all nodes
passed
Test
Status
Duration (ms)
should set the output of the node to the input data
passed
should handle empty input data
passed
should not mutate the input node
passed
Test
Status
Duration (ms)
sets the output of the node to the input data
passed
handles translation
passed
uploads data to GCS on run
passed
reads data from GCS on load if gcs_path is set and input data is empty
passed
clears gcs_path if readFileFromGCS throws an error
passed
sets message if merge and csv data are present
passed
sets message to empty string if merge or csv data are missing
passed
does not mutate the input node
passed
Test
Status
Duration (ms)
scores the relevance of arguments
passed
uses cached data if available and not dirty
passed
does not score if argument_extraction data is missing
passed
does not score if open_ai_key is missing
passed
does not score if prompts are missing
passed
Test
Status
Duration (ms)
should set the key in cookies if the UI key is provided
passed
should use the local key from cookies if available
passed
should return an empty string if no key is provided or available in cookies
passed
Test
Status
Duration (ms)
should process CSV data correctly from GCS
passed
Test
Status
Duration (ms)
should correctly stringify input data
passed
should return input if it cannot be stringified
passed
should handle different types of input
passed
should not mutate the input node
passed
Test
Status
Duration (ms)
should generate summaries for topics and subtopics
passed
should load summaries from GCS if data length matches
passed
Test
Status
Duration (ms)
integer node
passed
adder node
passed
dataset run adder
passed
dataset run multi input multi output
passed
Test
Status
Duration (ms)
should convert a single text input to CSV format
passed
should convert multiple text inputs to CSV format
passed
should handle empty text input
passed
should split text into chunks if it exceeds the number of tokens
passed
Test
Status
Duration (ms)
translates the input data
passed
loads translations from GCS if data has not changed
passed
does not translate if required inputs are missing
passed
Test
Status
Duration (ms)
should return unique values based on the specified property
passed
should return an empty array if input is empty
passed
should return undefined if no property is specified
passed
should set dirty to false after compute
passed
should not mutate the input data
passed
Test
Status
Duration (ms)
Test secondsToHHMMSS
passed
Test secondsToHHMMSS with string
passed
Test HHMMSSToSeconds
passed
Test
Status
Duration (ms)
should load from cache if data is not dirty and gcs_path is set
passed
should load from GCS if data is not dirty, gcs_path is set, and output is empty and audio size matches
passed
should transcribe audio and upload to GCS if data is dirty
passed
should return undefined and set message if open_ai_key is missing
passed
should convert transcription to internal format if response_format is custom
passed
Test
Status
Duration (ms)
should load from cache if data is not dirty and gcs_path is set
passed
should load from GCS if data is not dirty, gcs_path is set, and output is empty and audio size matches
passed
should transcribe audio and upload to GCS if data is dirty
passed
should return undefined and set message if open_ai_key is missing
passed
should convert transcription to internal format if response_format is custom
passed
Test
Status
Duration (ms)
should execute function in workerpool
passed
should execute delayed function in workerpool
passed