Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitor application to EVE #4410

Merged
merged 11 commits into from
Nov 28, 2024

Conversation

rucoder
Copy link
Contributor

@rucoder rucoder commented Oct 31, 2024

UPDATE: all comments were fixed. Remaining 3 tasks will be converted into issues and fixed later
UPDATE 2: fixed last comments from @eriknordmark.

This PR introduces a monitor agent to communicate with rust application that implements a TUI (TextUI) and allows the user to monitor system status and perform (as of now) two operaions:

  1. Change IP address of Ethernet interface (DHCP <-> static)
  2. Set server URL if the node is not yet onboarded

The monitor agent opens a Unix socket and waits for incoming connection. When the connection is established the NodeStatus structure is sent to the other party to provide current node status. Then other information is sent periodically as soon as it become available in PubSub

Current limitations/issues

  • network: yes is used in build.yaml while we are working on proper source vendoring approach for rust application
  • there is no way to completely remove manual DPC from the list
  • EVE structures should not be sent as-is to Rust. Every time the structure is changed on Eve side there is no way to catch this situation at build time. Stable EVE-Rust API is needed.
  • define git commit SHA in Dockerfile when Fix SetDPC request and implement SetServer request and UI eve-monitor-rs#3 is merged
  • write a documentation in pkg/pillar/docs/monitor.md

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 20.93%. Comparing base (dd27fe1) to head (4734622).
Report is 20 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4410   +/-   ##
=======================================
  Coverage   20.93%   20.93%           
=======================================
  Files          13       13           
  Lines        2895     2895           
=======================================
  Hits          606      606           
  Misses       2163     2163           
  Partials      126      126           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lastResort *types.DevicePortConfig
manualDPC *types.DevicePortConfig
Copy link
Contributor

@milan-zededa milan-zededa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that forceManualDPC and manualDPC are unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milan-zededa this was a placeholder to implement merging of DPCs, not it is done on rust side but I do not like to have so many dependencies on eve types. I'll remove it for now

@@ -782,6 +809,12 @@ func (n *nim) handleDPCImpl(key string, configArg interface{}, fromFile bool) {
n.forceLastResort = false
n.reevaluateLastResortDPC()
}
// if device can connect to controller it may get a new DPC in global config. This global DPC
// will have higher priority but can be invalid and the device will loose connectifity again
// at least temporraraly while DPC is beeing tested. To avoid this we reset the timestamp on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporraraly -> temporarily

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and connectifity in the line above)

@@ -0,0 +1,444 @@
// Copyright (c) 2024 Zededa, Inc.
Copy link
Contributor

@milan-zededa milan-zededa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the filename

RUN cp /app/target/$CARGO_BUILD_TARGET/release/monitor /app/target/


FROM lfedge/eve-alpine:1f7685f95a475c6bbe682f0b976f12180b6c8726 AS runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use 591df01e581889c3027514c8a91feaca1c8ad49f now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OhmSpectator yeah, CI told me that. I'll fix.

nodeStatus.ZedAgentStatus = ctx.getZedAgentStatus()
nodeStatus.AppSummary = ctx.getAppSummary()

if ctx.lastNodeStatus != nil && *ctx.lastNodeStatus == nodeStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason behind this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa I have a plan to reuse this function to send a node stats to abstract types used in EVE from types used in Go-Rust interface

if item, err := sub.Get("global"); err == nil {
zedAgentStatus := item.(types.ZedAgentStatus)
return zedAgentStatus
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the error should be logged

var logger *logrus.Logger
var log *base.LogObject

type monitorContext struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type monitorContext struct {
type monitor struct {

It is not a context, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa we call the field in SubscriptionOptions Ctx aka context, and the type is typically "$agent"Context.

Sure, this is not the golang context.Context. But there is more than one kind of context; which one you refer to depends on your context ;-)

Copy link
Contributor

@milan-zededa milan-zededa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would side with Christoph on this one, I think it is cleaner (and Go-Idiomatic) to just call it monitor. I haven't see this <something>Context terminology used with custom structs in any other Go project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknordmark @milan-zededa @christoph-zededa so, which one should I use? I have no personal preferences but as Erik mentioned we use this style all over EVE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIM, zedrouter, mmagent and their components no longer use the Context suffix. I think it reads better and avoids confusion with context.Context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to 'monitor'

if _, err := os.Stat(types.ServerFileName); os.IsNotExist(err) {
ctx.serverNameAndPort = ""
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU

better to just check if err from os.Readfile is an ENOENT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.IsNotExist is the golang way to check this AFAIK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pkg.go.dev/os#IsNotExist

This function predates errors.Is. It only supports errors returned by the os package. New code should use errors.Is(err, fs.ErrNotExist).

good that we talked about it :-)

findCmd := exec.Command("/sbin/findfs", "PARTLABEL=CONFIG")
deviceBytes, err := findCmd.Output()
if err != nil {
return fmt.Errorf("failed to find CONFIG partition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("failed to find CONFIG partition: %v", err)
return fmt.Errorf("failed to find CONFIG partition: %v -- output was: %s", err, string(deviceBytes))

// 3. Mount the CONFIG partition as read-write
mountCmd := exec.Command("mount", "-t", "vfat", "-o", "rw,iocharset=iso8859-1", devicePath, tempDir)
if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount CONFIG partition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please include the stdout/stderr of mount in the error message.

if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount CONFIG partition: %v", err)
}
defer exec.Command("umount", tempDir).Run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check for errors

Ctx: ctx,
CreateHandler: handleDPCCreate,
ModifyHandler: handleDPCModify,
DeleteHandler: handleDPCDelete,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this one not have ErrorTime set?

Comment on lines 218 to 382

subDevicePortConfigList, err := ps.NewSubscription(
pubsub.SubscriptionOptions{
AgentName: "nim",
MyAgentName: agentName,
Persistent: true,
TopicImpl: types.DevicePortConfigList{},
Activate: false,
Ctx: ctx,
CreateHandler: handleDPCCreate,
ModifyHandler: handleDPCModify,
DeleteHandler: handleDPCDelete,
})
if err != nil {
log.Error("Cannot create subscription for DevicePortConfigList")
return err
}

subAppInstanceSummary, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "zedmanager",
MyAgentName: agentName,
TopicImpl: types.AppInstanceSummary{},
Activate: false,
Ctx: ctx,
CreateHandler: handleAppInstanceSummaryCreate,
ModifyHandler: handleAppInstanceSummaryModify,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for AppInstanceSummary")
return err
}

subAppInstanceStatus, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "zedmanager",
MyAgentName: agentName,
TopicImpl: types.AppInstanceStatus{},
Activate: false,
Ctx: ctx,
CreateHandler: handleAppInstanceStatusCreate,
ModifyHandler: handleAppInstanceStatusModify,
DeleteHandler: handleAppInstanceStatusDelete,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for AppInstanceStatus")
return err
}

subDownloaderStatus, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "downloader",
MyAgentName: agentName,
TopicImpl: types.DownloaderStatus{},
Activate: false,
Ctx: ctx,
CreateHandler: handleDownloaderStatusCreate,
ModifyHandler: handleDownloaderStatusModify,
DeleteHandler: handleDownloaderStatusDelete,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for DownloaderStatus")
return err
}

subLedBlinkCounter, err := ps.NewSubscription(
pubsub.SubscriptionOptions{
AgentName: "",
MyAgentName: agentName,
TopicImpl: types.LedBlinkCounter{},
Activate: false,
Ctx: ctx,
CreateHandler: handleLedBlinkCreate,
ModifyHandler: handleLedBlinkModify,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for LedBlinkCounter")
return err
}

subZedAgentStatus, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "zedagent",
MyAgentName: agentName,
TopicImpl: types.ZedAgentStatus{},
Activate: false,
Ctx: ctx,
CreateHandler: handleZedAgentStatusCreate,
ModifyHandler: handleZedAgentStatusModify,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for ZedAgentStatus")
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loop?

clientConnected chan bool
}

// constructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// constructor
// factory method

ipcMessage := IpcMessage{Type: t, Message: json.RawMessage(data)}
if data, err = json.Marshal(ipcMessage); err == nil {
log.Noticef("Sending IPC message: %s", string(data))
_, err = s.codec.Write(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap the error so that the reader of the log can see later which function the error came from

Comment on lines 218 to 235
if err := json.Unmarshal(r.RequestData, &dpc); err == nil {
if err := ctx.IPCServer.validateDPC(dpc); err != nil {
return r.errResponse("Failed to validate DPC", err)
}
// unpublish current manual DPC first
ctx.pubDevicePortConfig.Unpublish(dpc.Key)
// publish the DPC
if err := ctx.pubDevicePortConfig.Publish(dpc.Key, dpc); err != nil {
return r.errResponse("Failed to publish DPC", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
case "SetServer":
var server string
if err := json.Unmarshal(r.RequestData, &server); err == nil {
if err := ctx.updateServerFile(server); err != nil {
return r.errResponse("Failed to update server file", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes you check if err != nil and then have an else case, sometimes you check if err == nil and then have an else case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa why is this bad?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering what is the reason behind it.

The "bad" thing is that it is deeply nested.

Suggested change
if err := json.Unmarshal(r.RequestData, &dpc); err == nil {
if err := ctx.IPCServer.validateDPC(dpc); err != nil {
return r.errResponse("Failed to validate DPC", err)
}
// unpublish current manual DPC first
ctx.pubDevicePortConfig.Unpublish(dpc.Key)
// publish the DPC
if err := ctx.pubDevicePortConfig.Publish(dpc.Key, dpc); err != nil {
return r.errResponse("Failed to publish DPC", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
case "SetServer":
var server string
if err := json.Unmarshal(r.RequestData, &server); err == nil {
if err := ctx.updateServerFile(server); err != nil {
return r.errResponse("Failed to update server file", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
if err := json.Unmarshal(r.RequestData, &dpc); err != nil {
return r.malformedRequestResponse(err)
}
if err := ctx.IPCServer.validateDPC(dpc); err != nil {
return r.errResponse("Failed to validate DPC", err)
}
// unpublish current manual DPC first
ctx.pubDevicePortConfig.Unpublish(dpc.Key)
// publish the DPC
if err := ctx.pubDevicePortConfig.Publish(dpc.Key, dpc); err != nil {
return r.errResponse("Failed to publish DPC", err)
}
return r.okResponse()
case "SetServer":
var server string
if err := json.Unmarshal(r.RequestData, &server); err != nil {
return r.malformedRequestResponse(err)
}
if err := ctx.updateServerFile(server); err != nil {
return r.errResponse("Failed to update server file", err)
}
return r.okResponse()

Comment on lines 281 to 282
log.Notice("Accepted connection")

if err != nil {
log.Warnf("Accept for RPC call failed: %v", err)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Notice("Accepted connection")
if err != nil {
log.Warnf("Accept for RPC call failed: %v", err)
continue
}
if err != nil {
log.Warnf("Accept for RPC call failed: %v", err)
continue
}
log.Notice("Accepted connection")

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a markdown file in pkg/pillar/docs/monitor.md which describes the relationship between the code inside pillar/zedbox and the rust code.


# building the final image
FROM toolchain AS builder
ADD --keep-git-dir=true https://github.com/lf-edge/eve-monitor-rs.git /app
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did somebody review this code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the entire code, but there is at least one PR to be reviewed: lf-edge/eve-monitor-rs#3

Before this commit PCR indexes were present only in error text.
Add them to VaultStatus to simplify further analysis

Signed-off-by: Mikhail Malyshev <[email protected]>
- The monitor agent establishes a connection over Unix socket and send
  messages to rust application about the system state.
- It handles two requests: SetDPC to set new configuration for
  networking interfaces and SetServer to change the contents of
/config/server file

Signed-off-by: Mikhail Malyshev <[email protected]>
- the monitor agent depends on a module to parse binary stream from Unix
  socket in predefined format

Signed-off-by: Mikhail Malyshev <[email protected]>
- The application itself is hosted at https://github.com/lf-edge/eve-monitor-rs
- The container is built using lfedge/eve-rust:1.80.1 base image and
  uses cross-compilation
- RISCV target is not yet supported
- cargo-chef is used to speedup development. It won't speedup CI when
  --no-cache is used for building
- SBOM is generated using cargo-sbom

Brief design description
1. The container is started on /dev/tty3 using openvt and not started on
   any other  consoles including SSH and serial consoles. It is
   intended to be used only by a local operator
2. Full dmesg output is disabled and only panic output is enabled. Full
   dmesg output is available in the application
3. Standard keymap is changed to support CTRL+left|right|up|down
   key combinations

Signed-off-by: Mikhail Malyshev <[email protected]>
/dev/console ins the current console for kernel so even when we switch
to /dev/tty3 messages still appear and corrupt monitor application. Fo
now print them on /dev/tty which is the default kernel tty.

TODO:
1. diag output must appear on /dev/ttyS* if defined in the list of
   kernel consoles
2. The same applies to other subsystems

In order to do so the kernel command line must be parsed and the last
console= must be detected. Unfortunately there is no easier way to do
this

Signed-off-by: Mikhail Malyshev <[email protected]>
}

func handleDownloaderStatusDelete(ctxArg interface{}, key string,
statusArg interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this function for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this function for?

@christoph-zededa It seems I have to handle it to inform UI that downloading is over. It is empty for now because I do not display the status (only in debug tab which is not visible in production build).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, so this is wip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa yes, it is a WIP and will be fixed in coming PRs

@@ -320,7 +326,7 @@ func (m *DpcManager) compressDPCL() {
i, dpc)
} else {
// Retain the lastresort if enabled. Delete everything else.
if dpc.Key == LastResortKey && m.enableLastResort {
if (dpc.Key == LastResortKey && m.enableLastResort) || (dpc.Key == ManualDPCKey) {
m.Log.Tracef("compressDPCL: Retaining last resort. i = %d, dpc: %+v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace text doesn't capture the manual case. Maybe saying "manual or last resort"?

Comment on lines 184 to 185
// Remove by matching TimePriority and Key
func (m *DpcManager) removeDPC(dpc types.DevicePortConfig) {
func (m *DpcManager) removeDPC(dpc types.DevicePortConfig, keyOnly bool) {
Copy link
Contributor

@eriknordmark eriknordmark Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear to introduce a separate removeDPCByKey(key string) function. Furthermore, that function should not log an error if none found since that will happen the first time a manual DPC is added.


These limitations can be relaxed by using a custom font with 256 glyphs comparing to the standard one that uses 512 glyphs. In this case an extra bit can be used to render 16 colors. Besides extra pseudo-graphics glyphs can be added instead of unused characters to display e.g. rounded boxes.

As of now a standard font is used so the look of the application on the host and on EVE is deferent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deferent or different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this one and several other mistakes e.g missed articles. @eriknordmark would it make sense to add more advanced linters to check grammar/spelling of the *.md files?

- 'manual' DPC has  highest priority and applied immediately
- there can be only one 'manual' DPC. so we nedd to delete all backup
  copies of it from a DPC list. EVE should not care about previous attempts
to change network configuration

Signed-off-by: Mikhail Malyshev <[email protected]>
- monitor app is run only on physical monitor so we need to run QEMU
  with -vga or similar to see it

Signed-off-by: Mikhail Malyshev <[email protected]>
- Setting server URL in both /config/server and real config partition

Signed-off-by: Mikhail Malyshev <[email protected]>
- Add comments for SubscriptionOptions.{WarningTime, ErrorTime} fields

Signed-off-by: Mikhail Malyshev <[email protected]>
- add documentation for pkg/monitor container
- update NIM documentation
- add documentation for pkg/pillar/cmd/monitor service

Signed-off-by: Mikhail Malyshev <[email protected]>
- Update tag to v0.1.1 which fixes is_onboarded() logic

Signed-off-by: Mikhail Malyshev <[email protected]>
// at least temporarily while DPC is being tested. To avoid this we reset the timestamp on
// the Manual DPC to the current time
// TODO: do it. or check for ManualDPCKey in DPCManager
// TODO 2: we should not try lastresort DPC if the user set the DPC to manual
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR ready for review&merge despite these two TODOs?

Copy link
Contributor Author

@rucoder rucoder Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milan-zededa I hope yes. It addresses immediate chicken and egg problem with incorrect global network configuration. I can remove the comment

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return fmt.Errorf("failed to update shadow copy: %v", err)
}

if remountErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is always false, isn't it?

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-run

@eriknordmark eriknordmark merged commit 57d6bc0 into lf-edge:master Nov 28, 2024
74 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants