Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitor application to EVE #4410

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

rucoder
Copy link
Contributor

@rucoder rucoder commented Oct 31, 2024

This PR introduces a monitor agent to communicate with rust application that implements a TUI (TextUI) and allows the user to monitor system status and perform (as of now) two operaions:

  1. Change IP address of Ethernet interface (DHCP <-> static)
  2. Set server URL if the node is not yet onboarded

The monitor agent opens a Unix socket and waits for incoming connection. When the connection is established the NodeStatus structure is sent to the other party to provide current node status. Then other information is sent periodically as soon as it become available in PubSub

Current limitations/issues

  • network: yes is used in build.yaml while we are working on proper source vendoring approach for rust application
  • there is no way to completely remove manual DPC from the list
  • EVE structures should not be sent as-is to Rust. Every time the structure is changed on Eve side there is no way to catch this situation at build time. Stable EVE-Rust API is needed.
  • define git commit SHA in Dockerfile when Fix SetDPC request and implement SetServer request and UI eve-monitor-rs#3 is merged
  • write a documentation in pkg/pillar/docs/monitor.md

AgentName: "monitor",
MyAgentName: agentName,
TopicImpl: types.DevicePortConfig{},
Persistent: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This most likely does not need to be persistent since we persist currently known DPCs using pubDevicePortConfigList (i.e. in /persist/status/nim/DevicePortConfigList) - I assume this includes the manual DPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milan-zededa yes, agree. I had this set to false for experements, but this change fall through the cracks.

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 20.93%. Comparing base (dd27fe1) to head (4734622).
Report is 20 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4410   +/-   ##
=======================================
  Coverage   20.93%   20.93%           
=======================================
  Files          13       13           
  Lines        2895     2895           
=======================================
  Hits          606      606           
  Misses       2163     2163           
  Partials      126      126           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lastResort *types.DevicePortConfig
manualDPC *types.DevicePortConfig
Copy link
Contributor

@milan-zededa milan-zededa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that forceManualDPC and manualDPC are unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milan-zededa this was a placeholder to implement merging of DPCs, not it is done on rust side but I do not like to have so many dependencies on eve types. I'll remove it for now

@@ -782,6 +809,12 @@ func (n *nim) handleDPCImpl(key string, configArg interface{}, fromFile bool) {
n.forceLastResort = false
n.reevaluateLastResortDPC()
}
// if device can connect to controller it may get a new DPC in global config. This global DPC
// will have higher priority but can be invalid and the device will loose connectifity again
// at least temporraraly while DPC is beeing tested. To avoid this we reset the timestamp on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporraraly -> temporarily

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and connectifity in the line above)

@@ -0,0 +1,444 @@
// Copyright (c) 2024 Zededa, Inc.
Copy link
Contributor

@milan-zededa milan-zededa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the filename

RUN cp /app/target/$CARGO_BUILD_TARGET/release/monitor /app/target/


FROM lfedge/eve-alpine:1f7685f95a475c6bbe682f0b976f12180b6c8726 AS runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use 591df01e581889c3027514c8a91feaca1c8ad49f now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OhmSpectator yeah, CI told me that. I'll fix.

nodeStatus.ZedAgentStatus = ctx.getZedAgentStatus()
nodeStatus.AppSummary = ctx.getAppSummary()

if ctx.lastNodeStatus != nil && *ctx.lastNodeStatus == nodeStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason behind this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa I have a plan to reuse this function to send a node stats to abstract types used in EVE from types used in Go-Rust interface

if item, err := sub.Get("global"); err == nil {
zedAgentStatus := item.(types.ZedAgentStatus)
return zedAgentStatus
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the error should be logged

var logger *logrus.Logger
var log *base.LogObject

type monitorContext struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type monitorContext struct {
type monitor struct {

It is not a context, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa we call the field in SubscriptionOptions Ctx aka context, and the type is typically "$agent"Context.

Sure, this is not the golang context.Context. But there is more than one kind of context; which one you refer to depends on your context ;-)

Copy link
Contributor

@milan-zededa milan-zededa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would side with Christoph on this one, I think it is cleaner (and Go-Idiomatic) to just call it monitor. I haven't see this <something>Context terminology used with custom structs in any other Go project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknordmark @milan-zededa @christoph-zededa so, which one should I use? I have no personal preferences but as Erik mentioned we use this style all over EVE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIM, zedrouter, mmagent and their components no longer use the Context suffix. I think it reads better and avoids confusion with context.Context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to 'monitor'

if _, err := os.Stat(types.ServerFileName); os.IsNotExist(err) {
ctx.serverNameAndPort = ""
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU

better to just check if err from os.Readfile is an ENOENT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.IsNotExist is the golang way to check this AFAIK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pkg.go.dev/os#IsNotExist

This function predates errors.Is. It only supports errors returned by the os package. New code should use errors.Is(err, fs.ErrNotExist).

good that we talked about it :-)

findCmd := exec.Command("/sbin/findfs", "PARTLABEL=CONFIG")
deviceBytes, err := findCmd.Output()
if err != nil {
return fmt.Errorf("failed to find CONFIG partition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("failed to find CONFIG partition: %v", err)
return fmt.Errorf("failed to find CONFIG partition: %v -- output was: %s", err, string(deviceBytes))

// 3. Mount the CONFIG partition as read-write
mountCmd := exec.Command("mount", "-t", "vfat", "-o", "rw,iocharset=iso8859-1", devicePath, tempDir)
if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount CONFIG partition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please include the stdout/stderr of mount in the error message.

if err := mountCmd.Run(); err != nil {
return fmt.Errorf("failed to mount CONFIG partition: %v", err)
}
defer exec.Command("umount", tempDir).Run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check for errors

Ctx: ctx,
CreateHandler: handleDPCCreate,
ModifyHandler: handleDPCModify,
DeleteHandler: handleDPCDelete,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this one not have ErrorTime set?

Comment on lines 218 to 382

subDevicePortConfigList, err := ps.NewSubscription(
pubsub.SubscriptionOptions{
AgentName: "nim",
MyAgentName: agentName,
Persistent: true,
TopicImpl: types.DevicePortConfigList{},
Activate: false,
Ctx: ctx,
CreateHandler: handleDPCCreate,
ModifyHandler: handleDPCModify,
DeleteHandler: handleDPCDelete,
})
if err != nil {
log.Error("Cannot create subscription for DevicePortConfigList")
return err
}

subAppInstanceSummary, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "zedmanager",
MyAgentName: agentName,
TopicImpl: types.AppInstanceSummary{},
Activate: false,
Ctx: ctx,
CreateHandler: handleAppInstanceSummaryCreate,
ModifyHandler: handleAppInstanceSummaryModify,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for AppInstanceSummary")
return err
}

subAppInstanceStatus, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "zedmanager",
MyAgentName: agentName,
TopicImpl: types.AppInstanceStatus{},
Activate: false,
Ctx: ctx,
CreateHandler: handleAppInstanceStatusCreate,
ModifyHandler: handleAppInstanceStatusModify,
DeleteHandler: handleAppInstanceStatusDelete,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for AppInstanceStatus")
return err
}

subDownloaderStatus, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "downloader",
MyAgentName: agentName,
TopicImpl: types.DownloaderStatus{},
Activate: false,
Ctx: ctx,
CreateHandler: handleDownloaderStatusCreate,
ModifyHandler: handleDownloaderStatusModify,
DeleteHandler: handleDownloaderStatusDelete,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for DownloaderStatus")
return err
}

subLedBlinkCounter, err := ps.NewSubscription(
pubsub.SubscriptionOptions{
AgentName: "",
MyAgentName: agentName,
TopicImpl: types.LedBlinkCounter{},
Activate: false,
Ctx: ctx,
CreateHandler: handleLedBlinkCreate,
ModifyHandler: handleLedBlinkModify,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for LedBlinkCounter")
return err
}

subZedAgentStatus, err := ps.NewSubscription(pubsub.SubscriptionOptions{
AgentName: "zedagent",
MyAgentName: agentName,
TopicImpl: types.ZedAgentStatus{},
Activate: false,
Ctx: ctx,
CreateHandler: handleZedAgentStatusCreate,
ModifyHandler: handleZedAgentStatusModify,
WarningTime: warningTime,
ErrorTime: errorTime,
})
if err != nil {
log.Error("Cannot create subscription for ZedAgentStatus")
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loop?

clientConnected chan bool
}

// constructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// constructor
// factory method

ipcMessage := IpcMessage{Type: t, Message: json.RawMessage(data)}
if data, err = json.Marshal(ipcMessage); err == nil {
log.Noticef("Sending IPC message: %s", string(data))
_, err = s.codec.Write(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap the error so that the reader of the log can see later which function the error came from

Comment on lines 218 to 235
if err := json.Unmarshal(r.RequestData, &dpc); err == nil {
if err := ctx.IPCServer.validateDPC(dpc); err != nil {
return r.errResponse("Failed to validate DPC", err)
}
// unpublish current manual DPC first
ctx.pubDevicePortConfig.Unpublish(dpc.Key)
// publish the DPC
if err := ctx.pubDevicePortConfig.Publish(dpc.Key, dpc); err != nil {
return r.errResponse("Failed to publish DPC", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
case "SetServer":
var server string
if err := json.Unmarshal(r.RequestData, &server); err == nil {
if err := ctx.updateServerFile(server); err != nil {
return r.errResponse("Failed to update server file", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes you check if err != nil and then have an else case, sometimes you check if err == nil and then have an else case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa why is this bad?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering what is the reason behind it.

The "bad" thing is that it is deeply nested.

Suggested change
if err := json.Unmarshal(r.RequestData, &dpc); err == nil {
if err := ctx.IPCServer.validateDPC(dpc); err != nil {
return r.errResponse("Failed to validate DPC", err)
}
// unpublish current manual DPC first
ctx.pubDevicePortConfig.Unpublish(dpc.Key)
// publish the DPC
if err := ctx.pubDevicePortConfig.Publish(dpc.Key, dpc); err != nil {
return r.errResponse("Failed to publish DPC", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
case "SetServer":
var server string
if err := json.Unmarshal(r.RequestData, &server); err == nil {
if err := ctx.updateServerFile(server); err != nil {
return r.errResponse("Failed to update server file", err)
} else {
return r.okResponse()
}
} else {
return r.malformedRequestResponse(err)
}
if err := json.Unmarshal(r.RequestData, &dpc); err != nil {
return r.malformedRequestResponse(err)
}
if err := ctx.IPCServer.validateDPC(dpc); err != nil {
return r.errResponse("Failed to validate DPC", err)
}
// unpublish current manual DPC first
ctx.pubDevicePortConfig.Unpublish(dpc.Key)
// publish the DPC
if err := ctx.pubDevicePortConfig.Publish(dpc.Key, dpc); err != nil {
return r.errResponse("Failed to publish DPC", err)
}
return r.okResponse()
case "SetServer":
var server string
if err := json.Unmarshal(r.RequestData, &server); err != nil {
return r.malformedRequestResponse(err)
}
if err := ctx.updateServerFile(server); err != nil {
return r.errResponse("Failed to update server file", err)
}
return r.okResponse()

Comment on lines 281 to 282
log.Notice("Accepted connection")

if err != nil {
log.Warnf("Accept for RPC call failed: %v", err)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Notice("Accepted connection")
if err != nil {
log.Warnf("Accept for RPC call failed: %v", err)
continue
}
if err != nil {
log.Warnf("Accept for RPC call failed: %v", err)
continue
}
log.Notice("Accepted connection")

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a markdown file in pkg/pillar/docs/monitor.md which describes the relationship between the code inside pillar/zedbox and the rust code.


# building the final image
FROM toolchain AS builder
ADD --keep-git-dir=true https://github.com/lf-edge/eve-monitor-rs.git /app
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did somebody review this code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the entire code, but there is at least one PR to be reviewed: lf-edge/eve-monitor-rs#3

Before this commit PCR indexes were present only in error text.
Add them to VaultStatus to simplify further analysis

Signed-off-by: Mikhail Malyshev <[email protected]>
if err != nil {
return fmt.Errorf("failed to find CONFIG partition: %v", err)
}
devicePath := strings.TrimSpace(string(deviceBytes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
devicePath := strings.TrimSpace(string(deviceBytes))
devicePath := strings.TrimRight(string(deviceBytes), "\n\r")

- /containers:/containers:rshared,rbind

devices:
# all block devices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need all these?

network: yes
config:
pid: host
binds:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here, is all these access needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shjala probably not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you go through them, pick the ones you really need and add a comment above each one that is needed stating why it is needed? it makes creating apparmor profile much easier.

same for this.

@shjala
Copy link
Member

shjala commented Nov 4, 2024

@rucoder let's cook an apparmor profile for it while it is being introduced.

@rucoder
Copy link
Contributor Author

rucoder commented Nov 6, 2024

@rucoder let's cook an apparmor profile for it while it is being introduced.
@shjala let's cook it in a separate PR then. I'll create a ticket

COPY --from=runtime /out/usr/lib /usr/lib

# copy busybox and sh but just for debugging
COPY --from=runtime /out/bin/sh /bin/sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, /bin/sh will be a symbolic link to busybox, right? So instead of copy both sh and busybox (line below), it makes sense to copy only busybox and run /bin/busybox --install -s <DESTINATION_DIR> to create all symbolic links of supported commands...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rene your suggestion will not work as-is but I implemented a similar one. check out updated code. pretty interesting fact

@@ -289,7 +289,7 @@ func RebootReason(reason string, bootReason types.BootReason, agentName string,
}

// Printing the reboot reason to the console
filename = "/dev/console"
Copy link
Contributor

@rene rene Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some scripts from pkg/pillar/scripts also write to /dev/console, doesn't make sense to change them as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rene I checked those and they do not affect monitor app. They actually should write to console so we see the output on /dev/ttyS* if connected. Monitor is started after they finish so it should be ok. I still need to modify getty to properly handle console devices anyways

- The monitor agent establishes a connection over Unix socket and send
  messages to rust application about the system state.
- It handles two requests: SetDPC to set new configuration for
  networking interfaces and SetServer to change the contents of
/config/server file

Signed-off-by: Mikhail Malyshev <[email protected]>
- the monitor agent depends on a module to parse binary stream from Unix
  socket in predefined format

Signed-off-by: Mikhail Malyshev <[email protected]>
@rucoder
Copy link
Contributor Author

rucoder commented Nov 21, 2024

@rucoder let's cook an apparmor profile for it while it is being introduced.
@shjala let's cook it in a separate PR then. I'll create a ticket

ok, I fixed that (mostly)

- The application itself is hosted at https://github.com/lf-edge/eve-monitor-rs
- The container is built using lfedge/eve-rust:1.80.1 base image and
  uses cross-compilation
- RISCV target is not yet supported
- cargo-chef is used to speedup development. It won't speedup CI when
  --no-cache is used for building
- SBOM is generated using cargo-sbom

Brief design description
1. The container is started on /dev/tty3 using openvt and not started on
   any other  consoles including SSH and serial consoles. It is
   intended to be used only by a local operator
2. Full dmesg output is disabled and only panic output is enabled. Full
   dmesg output is available in the application
3. Standard keymap is changed to support CTRL+left|right|up|down
   key combinations

Signed-off-by: Mikhail Malyshev <[email protected]>
/dev/console ins the current console for kernel so even when we switch
to /dev/tty3 messages still appear and corrupt monitor application. Fo
now print them on /dev/tty which is the default kernel tty.

TODO:
1. diag output must appear on /dev/ttyS* if defined in the list of
   kernel consoles
2. The same applies to other subsystems

In order to do so the kernel command line must be parsed and the last
console= must be detected. Unfortunately there is no easier way to do
this

Signed-off-by: Mikhail Malyshev <[email protected]>
- 'manual' DPC has  highest priority and applied immediately
- there can be only one 'manual' DPC. so we nedd to delete all backup
  copies of it from a DPC list. EVE should not care about previous attempts
to change network configuration

Signed-off-by: Mikhail Malyshev <[email protected]>
- monitor app is run only on physical monitor so we need to run QEMU
  with -vga or similar to see it

Signed-off-by: Mikhail Malyshev <[email protected]>
- Setting server URL in both /config/server and real config partition

Signed-off-by: Mikhail Malyshev <[email protected]>
- Add comments for SubscriptionOptions.{WarningTime, ErrorTime} fields

Signed-off-by: Mikhail Malyshev <[email protected]>
- add documentation for pkg/monitor container
- update NIM documentation
- add documentation for pkg/pillar/cmd/monitor service

Signed-off-by: Mikhail Malyshev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants