Skip to content

Commit

Permalink
Merge branch 'develop' into 2435-hhs-env-vars
Browse files Browse the repository at this point in the history
  • Loading branch information
elipe17 authored Nov 1, 2024
2 parents d9d2680 + f2f91ea commit d823eff
Show file tree
Hide file tree
Showing 56 changed files with 6,562 additions and 3,108 deletions.
2 changes: 1 addition & 1 deletion .circleci/build-and-test/commands.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
fi
echo "export CURRENT_FLAG=$CURRENT_FLAG" >> $BASH_ENV
- run:
name: Upload code coverage report if target branch
name: Upload code coverage report of target branch
command: codecov -t "$CODECOV_TOKEN" -f <<parameters.coverage-report>> -F "$CURRENT_FLAG"

install-nodejs-machine:
Expand Down
1 change: 1 addition & 0 deletions .gitconfig
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@
allowed = .git/config:.*
allowed = .gitconfig:.*
allowed = .*DJANGO_SECRET_KEY=.*
allowed = ./tdrs-backend/plg/loki/manifest.yml:*
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -109,4 +109,9 @@ cypress.env.json

# Patches
*.patch

# Logs
*.log

# DB seeds
tdrs-backend/*.pg
9 changes: 2 additions & 7 deletions Taskfile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,6 @@ version: '3'

tasks:

upload-kibana-objs:
desc: Upload dashboards to Kibana server
cmds:
- 'curl -X POST localhost:5601/api/saved_objects/_import -H "kbn-xsrf: true" --form file=@tdrs-backend/tdpservice/search_indexes/kibana_saved_objs.ndjson'

create-network:
desc: Create the external network
cmds:
Expand Down Expand Up @@ -251,7 +246,7 @@ tasks:
desc: Open a shell in the frontend container
dir: tdrs-frontend
cmds:
- docker-compose -f docker-compose.yml exec tdp-frontend sh
- docker-compose -f docker-compose.yml exec tdp-frontend bash

up:
desc: Start both frontend and backend web servers
Expand All @@ -268,4 +263,4 @@ tasks:
help:
desc: Show this help message
cmds:
- task --list
- task --list
5 changes: 4 additions & 1 deletion codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,7 @@ flags:
carryforward: true

ignore:
- "tdrs-backend/tdpservice/scheduling/db_backup.py"
- "tdrs-backend/tdpservice/scheduling/db_backup.py"
- "tdrs-backend/tdpservice/search_indexes/admin/mulitselect_filter.py"
- "tdrs-backend/tdpservice/email/helpers/account_access_requests.py"
- "tdrs-backend/tdpservice/search_indexes/admin/filters.py"
302 changes: 301 additions & 1 deletion docs/Security-Compliance/diagram.drawio

Large diffs are not rendered by default.

Binary file modified docs/Security-Compliance/diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# 22. Monitoring Application Health and Performance

Date: 2024-09-30

## Status

Pending

## Context
Historic feedback highlighted an ongoing desire for improved alerting and monitoring mechanisms, particularly originating in issue [#831](https://github.com/raft-tech/TANF-app/issues/831) circa 2021. Currently, our cloud platform has limited logging features and user interface issues leading to a "blindness" to errors and stack traces that have occurred, ultimately impairing our ability to maintain system stability; additionally, the existing dashboards only offer live performance data lacking data over time or any archives. Without context for either performance or system logging, determination of anomalous or erroneous system behavior is not possible.

Additionally, we have experienced critical blocking issues related to our updates to both Elasticsearch (ES) and PostgreSQL, which have compounded the need for more proactive alerting and load-testing in lower environments. Without timely notifications, we risk delays in addressing failures that could escalate into more significant problems.


## Decision
We will build out a suite of tools in accordance with industry best practices to monitor our applications. Implementing a comprehensive monitoring and alerting ecosystem will not only help in identifying errors in real-time but also enable us to establish benchmarks based on historical data. This approach will foster a more proactive response strategy, ensuring that potential issues are mitigated before they impact our users or that system owners and system admins are aware of issues that have impacted users.

<p style="text-align:center; margin:0; padding:0;">Cloud Environments Workflow</p>

![Environments](../diagrams/TDP_Environments.png)

### Why Sentry
Sentry captures unhandled exceptions and incorporates detail context about exceptions including error messages, stack traces, affected URLs and user data information. Such information is essential in demystifying the cause of error.

Additionally, as can be seen in the image below, the following information is available:

- Frequency: shows the frequency detail of error
- Timeline: when has the error happened in a period
- Can create a ticket and assign automatically
- Variables at each step of stack trace. This is very important for debugging

<p style="text-align:center; margin:0; padding:0;">Issues with filter enabled</p>

![Issues with filter enabled](../images/sentry/1.%20Issues%20with%20filter%20enabled.png)

<p style="text-align:center; margin:0;padding:0;">Detail exceptions</p>

![Detail exceptions](../images/sentry/3.%20detail%20about%20exception.png)

<p style="text-align:center; margin:0; padding:0;">Full stack trace of the exceptions</p>

![Full stack trace of the exceptions](../images/sentry/4.%20full%20stack%20trace%20of%20the%20exceptions.png)


Performance monitoring in Sentry can greatly enhance the backend application by providing real-time insights into how the TANF app is performing. Sentry tracks various metrics such as response time, database queries, and external API calls. These metrics will help identify performance bottlenecks associated to the backend app.

A unique ability of Sentry is that it links performance issues and groups them together. This gives us the ability to visualize areas that consistently have poor performance. Allowing us to swarm and resolve the most frequent offenders that have the highest impact. Sentry also detects issues with web transactions, database queries, and function regressions (if the duration of function has increased).

### Why Prometheus-Loki-Grafana

Grafana shall provide a visualization dashboard for these various tools which will collect and aggregate performance metrics, system logs, and allow deeper analysis for all aspects of our systems: frontend, proxies, backend, databases, and even networking. Additionally, the development team will seek to hone a proactive alerting system for out-of-threshold issues and errors for improved visibility of system issues.

The storing of system logs will allow more expedient troubleshooting and debugging that is currently out of reach with Cloud.gov's existing Kibana interface for logging. The ability to find and correlate log events is critical to technical analysis of faults, performance degradation, and system's overall health.

By having our monitoring ecosystem take in performance metrics, we will garner performance metrics over time as opposed to simply a live snapshot as is currently provided. This will allow spotting of anomolous or out-of-bounds behaviors such as out of memory, high memory, cpu spikes, and disk thrashing.

Finally, having all of this data in one place will allow technical staff to easily cross-reference given time periods with problematic performance, ongoing issues, or error stacktraces leading to a holistic view of all of our applications both in lower tier development sites and in critical production.

## Consequences

* Increased platform costs for running these tools
* Time and effort maintaining and configuring these new systems
* "Noisy" notifications from from out-of-tune alerting
* Efforts made towards security compliance as these systems have intimate access to our systems and data
* Learning curve for technical staff

## Notes
Given the prohibitive costs of self-hosting Sentry in Cloud.gov, we propose using Sentry's Cloud SaaS offering which will alter the [boundary diagram](../../Security-Compliance/diagram.png). The other tools in use (PLG stack and associated), will be self-hosted and maintained by the technical staff both at Raft and OFA.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
![image](https://github.com/user-attachments/assets/97e8fbf1-954b-4f3b-9000-ecb3b0f1d0d9)![image](https://github.com/user-attachments/assets/39a7b775-2771-438f-a270-d09dc263ef3c)# Stakeholders and Personas
# Stakeholders and Personas

Last updated for [Issue #3100](https://github.com/raft-tech/TANF-app/issues/3100)

Expand Down
14 changes: 7 additions & 7 deletions scripts/cf-check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@ set -e
if command -v cf /dev/null 2>&1; then
echo The command cf is available
else

apt-get update
apt-get install wget gnupg2 apt-transport-https

wget -q -O - https://packages.cloudfoundry.org/debian/cli.cloudfoundry.org.key | sudo apt-key add -

echo "deb https://packages.cloudfoundry.org/debian stable main" | sudo tee /etc/apt/sources.list.d/cloudfoundry-cli.list

apt-get update
apt-get install cf7-cli
NEXUS_ARCHIVE="cf7-cli_7.7.13_linux_x86-64.tgz"
NEXUS_URL="https://tdp-nexus.dev.raftlabs.tech/repository/tdp-bin/cloudfoundry-cli/$NEXUS_ARCHIVE"
curl $NEXUS_URL -o $NEXUS_ARCHIVE # prefers anonymous, use of -u failed.
tar xzf $NEXUS_ARCHIVE
mv ./cf /usr/local/bin/
chmod +x /usr/local/bin/cf
cf --version

fi
25 changes: 25 additions & 0 deletions scripts/deploy-backend.sh
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,27 @@ update_kibana()
cf run-task $CGAPPNAME_BACKEND --command "$CMD" --name kibana-obj-upload
}

prepare_promtail() {
pushd tdrs-backend/plg/promtail
CONFIG=config.yml
yq eval -i ".scrape_configs[0].job_name = \"system-$backend_app_name\"" $CONFIG
yq eval -i ".scrape_configs[0].static_configs[0].labels.job = \"system-$backend_app_name\"" $CONFIG
yq eval -i ".scrape_configs[1].job_name = \"backend-$backend_app_name\"" $CONFIG
yq eval -i ".scrape_configs[1].static_configs[0].labels.job = \"backend-$backend_app_name\"" $CONFIG
popd
}

update_plg_networking() {
# Need to switch the space after deploy since we're not always in dev space to handle specific networking from dev
# PLG apps to the correct backend app.
cf target -o hhs-acf-ofa -s tanf-dev
cf add-network-policy prometheus "$CGAPPNAME_BACKEND" -s "$CF_SPACE" --protocol tcp --port 8080
cf target -o hhs-acf-ofa -s "$CF_SPACE"

# Promtial needs to send logs to Loki
cf add-network-policy "$CGAPPNAME_BACKEND" loki -s "tanf-dev" --protocol tcp --port 8080
}

update_backend()
{
cd tdrs-backend || exit
Expand Down Expand Up @@ -152,6 +173,9 @@ update_backend()
# Add network policy to allow frontend to access backend
cf add-network-policy "$CGAPPNAME_FRONTEND" "$CGAPPNAME_BACKEND" --protocol tcp --port 8080

# Add PLG routing
update_plg_networking

if [ "$CF_SPACE" = "tanf-prod" ]; then
# Add network policy to allow backend to access tanf-prod services
cf add-network-policy "$CGAPPNAME_BACKEND" clamav-rest --protocol tcp --port 9000
Expand Down Expand Up @@ -238,6 +262,7 @@ else
CYPRESS_TOKEN=$CYPRESS_TOKEN
fi

prepare_promtail
if [ "$DEPLOY_STRATEGY" = "rolling" ] ; then
# Perform a rolling update for the backend and frontend deployments if
# specified, otherwise perform a normal deployment
Expand Down
5 changes: 3 additions & 2 deletions scripts/deploy-frontend.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ CF_SPACE=${5}
ENVIRONMENT=${6}

env=${CF_SPACE#"tanf-"}
frontend_app_name=$(echo $CGHOSTNAME_FRONTEND | cut -d"-" -f3)

# Update the Kibana name to include the environment
KIBANA_BASE_URL="${CGAPPNAME_KIBANA}-${env}.apps.internal"
Expand Down Expand Up @@ -52,7 +53,7 @@ update_frontend()

cf set-env "$CGHOSTNAME_FRONTEND" BACKEND_HOST "$CGHOSTNAME_BACKEND"
cf set-env "$CGHOSTNAME_FRONTEND" KIBANA_BASE_URL "$KIBANA_BASE_URL"

npm run build:$ENVIRONMENT
unlink .env.production
mkdir deployment
Expand Down Expand Up @@ -86,7 +87,7 @@ update_frontend()
else
cf map-route "$CGHOSTNAME_FRONTEND" app.cloud.gov --hostname "${CGHOSTNAME_FRONTEND}"
fi

cd ../..
rm -r tdrs-frontend/deployment
}
Expand Down
4 changes: 4 additions & 0 deletions scripts/localstack-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@ awslocal s3api create-bucket --bucket $AWS_BUCKET --region $AWS_REGION_NAME

# Enable object versioning on the bucket
awslocal s3api put-bucket-versioning --bucket $AWS_BUCKET --versioning-configuration Status=Enabled

# Add bucket for Loki to store logs
awslocal s3api create-bucket --bucket loki-logs --region $AWS_REGION_NAME
awslocal s3api put-bucket-versioning --bucket loki-logs --versioning-configuration Status=Enabled
15 changes: 10 additions & 5 deletions tdrs-backend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,23 @@ ENV DJANGO_SETTINGS_MODULE=tdpservice.settings.local
ENV DJANGO_CONFIGURATION=Local
# Allows docker to cache installed dependencies between builds
COPY Pipfile Pipfile.lock /tdpapp/
COPY sources.list /etc/apt/sources.list
WORKDIR /tdpapp/
# Download latest listing of available packages:
RUN apt-get -y update
# Upgrade already installed packages:
RUN apt-get -y upgrade
# Install packages:
RUN apt-get install -y gcc graphviz graphviz-dev libpq-dev python3-dev vim curl ca-certificates

# Postgres client setup
RUN apt --purge remove postgresql postgresql-* && apt install -y postgresql-common curl ca-certificates && install -d /usr/share/postgresql-common/pgdg && \
#RUN bash -c 'echo "deb [trusted=yes] https://tdp-nexus.dev.raftlabs.tech/repository/apt-proxy-postgres/ bullseye-pdpg main" >> /etc/apt/sources.list'
RUN apt-get update -y && apt-get upgrade -y
RUN apt install -y postgresql-common && install -d /usr/share/postgresql-common/pgdg && \
sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc trusted=yes] https://tdp-nexus.dev.raftlabs.tech/repository/apt-proxy-postgres/ bullseye-pgdg main" >> /etc/apt/sources.list' && \
curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc && \
sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] https://apt.postgresql.org/pub/repos/apt bullseye-pgdg main" > /etc/apt/sources.list.d/pgdg.list' && \
apt -y update && apt install postgresql-client-15 -y
# Install packages:
RUN apt install -y gcc graphviz graphviz-dev libpq-dev python3-dev vim
apt -y update && apt -y upgrade && apt install postgresql-client-15 -y

# Install pipenv
RUN pip install --upgrade pip pipenv
RUN pipenv install --dev --system --deploy
Expand Down
5 changes: 3 additions & 2 deletions tdrs-backend/Pipfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
trusted-host = "https://tdp-nexus.dev.raftlabs.tech/"
url = "https://tdp-nexus.dev.raftlabs.tech/repository/pypi-proxy/simple"
verify_ssl = true

[dev-packages]
Expand Down Expand Up @@ -62,4 +63,4 @@ django_prometheus = "==2.3.1"
sentry-sdk = "==2.11.0"

[requires]
python_version = "3.10.8"
python_version = "3.10.8"
Loading

0 comments on commit d823eff

Please sign in to comment.