Skip to content

Latest commit

 

History

History
326 lines (224 loc) · 12.3 KB

07-22_log-streaming.md

File metadata and controls

326 lines (224 loc) · 12.3 KB

Log Streaming

Key Value
Author(s) Jordan.Brockopp
Reviewers Neal.Coleman, David.May, Emmanuel.Meinen, Kelly.Merrick, David.Vader, Matthew.Fevold
Date July 22nd, 2021
Status Accepted

Background

Please provide a summary of the new feature, redesign or refactor:

This enhancement will enable viewing logs for a service/step in near real-time.

Currently, when watching logs for a service/step they aren't provided in real-time.

Instead, we use a buffer mechanism for controlling how logs are published:

  1. service/step starts running on a worker producing logs
  2. logs that are produced from the service/step are pushed to a buffer
  3. if the buffer exceeds 1000 bytes
    • publish the logs from the buffer via API call to the server
    • flush the buffer so we can push more logs to it from the service/step
  4. circle back to number 2 until the service/step is complete
  5. once the service/step is complete, publish remaining logs from the buffer

The end-behavior produced by this method is the logs appear in delayed chunks.

Please briefly answer the following questions:

  1. Why is this required?
  • provide compatible functionality with existing CI solutions
  • improve user experience when viewing logs for a service/step
  • improve ability to troubleshoot pipelines by seeing what parts take longer
  1. If this is a redesign or refactor, what issues exist in the current implementation?

The current implementation leaves a lot to be desired for user experience.

With the logs appearing in delayed chunks, it's difficult to troubleshoot pipelines.

This behavior can make processes appear to be "stuck" or "hung" when inspecting the logs.

This can make it almost impossible to determine if something is running or how long it takes to run.

  1. Are there any other workarounds, and if so, what are the drawbacks?

We could explore decreasing the limit we impose on the log buffer we use.

Currently, we're using 1000 bytes as that limit but we could set that to something smaller (100, 500, etc.).

This would mean that the worker uploads logs to the server more frequently which could improve the experience.

  1. Are there any related issues? Please provide them below if any exist.

#156

Design

Please describe your solution to the proposal. This includes, but is not limited to:

  • new/updated endpoints or url paths
  • new/updated configuration variables (environment, flags, files, etc.)
  • performance and user experience tradeoffs
  • security concerns or assumptions
  • examples or (pseudo) code snippets

Pipeline

To resolve the undesired behavior, we need to create options that can fix it.

However, to craft those options, we need a pipeline that can show the behavior.

We'll use the below pipeline to demonstrate this behavior:

version: "1"

steps:
  - name: logs
    image: alpine:latest
    pull: not_present
    commands:
      - sleep 1
      - echo "hello one"
      - sleep 2
      - echo "hello two"
      - sleep 3
      - echo "hello three"
      - sleep 4
      - echo "hello four"
      - sleep 5
      - echo "hello five"

DISCLAIMER:

The options below only cover updating code for the backend components for log streaming.

This means the proposal does not cover any updates to the go-vela/ui codebase.

The reason for this is no UI changes are required to produce a streaming effect for logs.

This is due to the UI already polling ~ every 5s in its current state.

However, this also means we leave room for improvement in regards to the user experience.

This experience could be augmented by making changes to the UI (and server?) along with the below options.

Option 1

This option involves updating the go-vela/pkg-executor codebase to upload logs on a regular time interval to simulate a streaming effect.

A brief explanation of how the code works:

  1. service/step starts running on a worker producing logs
  2. worker creates a channel to signal when to stop processing logs
  3. logs that are produced from the service/step are pushed to a buffer
  4. worker spawns a go routine to start polling the buffer
    • (inside the go routine) spawn an "infinite" for loop
      • (inside the for loop) sleep for 1s
      • if the channel is closed, terminate the go routine
      • if the channel is not closed
        • publish the logs from the buffer via API call to the server
        • flush the buffer so we can push more logs to it from the service/step
  5. once the service/step is complete, worker closes the channel to terminate the go routine

The code changes can be found below:

NOTE:

The time interval I chose to use in the above code is 1s.

However, we could choose any time interval we deem fit for this use-case.

Also, we'd likely make this time interval configurable to provide more flexibility.

Option 2

This option involves updating the go-vela/pkg-executor codebase to stream logs to the go-vela/server via HTTP.

To accomplish this, new endpoints were added to the server that can accept streaming connections.

Once a streaming connection is open, the server will capture and upload logs to the database on a regular time interval.

A brief explanation of how the code works:

  1. service/step starts running on a worker producing logs
  2. worker begins streaming logs via HTTP call to the server
  3. server accepts the streaming logs from the worker
  4. server creates a channel to signal when to stop processing streamed logs
  5. streamed logs are pushed to a buffer by server
  6. server spawns a go routine to start polling the buffer
    • (inside the go routine) spawn an "infinite" for loop
      • (inside the for loop) sleep for 1s
      • if the channel is closed, terminate the go routine
      • if the channel is not closed
        • publish the streamed logs from the buffer to the database
        • flush the buffer so we can push more logs to it from the service/step
  7. once the service/step is complete, worker terminates the HTTP call
  8. once the streaming is complete, server closes the channel to terminate the go routine

The code changes can be found below:

NOTE:

The time interval I chose to use in the above code is 1s.

However, we could choose any time interval we deem fit for this use-case.

Also, we'd likely make this time interval configurable to provide more flexibility.

Option 3

This option involves updating the go-vela/pkg-executor codebase to stream logs to the go-vela/server via WebSocket.

To accomplish this, new endpoints were added to the server that can accept websocket connections.

Once a websocket connection is open, the server will capture and upload logs to the database on a regular time interval.

A brief explanation of how the code works:

  1. service/step starts running on a worker producing logs
  2. worker opens a websocket connection to the server
  3. server accepts the websocket connection for streaming logs from the worker
  4. worker begins streaming logs via websocket to the server
  5. server creates a channel to signal to stop processing the streaming logs
  6. streamed logs from the websocket connection are pushed to a buffer by server
  7. server spawns a go routine to start polling the buffer
    • (inside the go routine) spawn an "infinite" for loop
      • (inside the for loop) sleep for 1s
      • if the channel is closed, terminate the go routine
      • if the channel is not closed
        • publish the streamed logs from the buffer to the database
        • flush the buffer so we can push more logs to it from the service/step
  8. once the service/step is complete, worker closes the websocket connection
  9. once the streaming is complete, server closes the channel to terminate the go routine

The code changes can be found below:

NOTE:

The time interval I chose to use in the above code is 1s.

However, we could choose any time interval we deem fit for this use-case.

Also, we'd likely make this time interval configurable to provide more flexibility.

Implementation

Please briefly answer the following questions:

  1. Is this something you plan to implement yourself?

Yes

  1. What's the estimated time to completion?

1 month

Please provide all tasks (gists, issues, pull requests, etc.) completed to implement the design:

After some discussion amongst the team, we've decided to progress forward with Option 2.

This decision was driven via a vote that was done and results provided here.

A concern that was brought up among those discussions was how much resources (CPU/RAM) were required for each option.

As we look to actually implement the functionality for Option 2, we should evaluate what changed in resource consumption (if any).

This will likely involve looking at how much CPU/RAM is consumed by both the server and worker when streaming logs with Option 2.

Questions

Please list any questions you may have:

N/A