Skip to content

Troubleshooting problems

Jason Volk edited this page Mar 24, 2023 · 15 revisions

TROUBLESHOOTING


Useful program options

Start the daemon with one or more of the following program options to make it easier to troubleshoot and perform maintenance:

  • -debug increases the logging level to the terminal.

  • -single will start in "single user mode" which is a convenience combination of -nolisten -wa -console options described below.

  • -safe is more restrictive than -single by denying outbound network requests and background database tasks.

  • -nolisten will disable the loading of any listener sockets during startup.

  • -nobackfill specifically skips the initial-backfill background task.

  • -wa write-avoid will discourage (but not deny) writes to the database. This prevents a lot of background tasks and other noise, making it easier to conduct maintenance (implies -nobackfill).

  • -ro read-only is more restrictive than -wa by denying all writes to the database.

  • -slave allows read-only access to a live database by additional instances of Construct. Only one instance of Construct may have write access to a database at a time; additional instances use this option.

  • -console convenience to immediately drop to the adminstrator console after startup.

Recovering from broken configurations

If your server ever fails to start from an errant conf item: you can override any item using an environmental variable before starting the program. To do this simply replace the '.' characters with '_' in the name of the item when setting it in the environment. The name is otherwise the same, including its lower case.

Otherwise, the program can be run with the option -defaults. This will prevent initial loading of the configuration from the database. It will not prevent environmental variable overrides (as mentioned above). Values will not be written back to the database unless they are explicitly set by the user in the console.

Recovering from database corruption

In very rare cases after a hard crash the journal cannot completely restore data before the crash. Due to the design of rocksdb and the way we apply it for Matrix, data is lost in chronological order starting from the most recent transaction (matrix event). The database is consistent for all events up until the first corrupt event, called the point-in-time.

When any loss has occurred the daemon will fail to start normally. To enable point-in-time recovery use the command-line option -recoverdb point at the next invocation. Some recent events may be lost. If -recoverdb point does not work, others techniques may be invoked as detailed below.

In some cases the daemon will start normally without the need for any recovery mode but later encounter hard corruption. Only the -recoverdb repair mode is effective against this.

❗ It is advised that any recovery is performed in -single mode. Additional program options such as -safe or -ro may be useful for some salvage techniques.

❗ After employing a salvage mode one should strongly consider an events dump and rebuild.

-recoverdb <option>
  • 🟢 point - Recovery mode; rewinds the database to the last consistent state before corruption.
  • 🔴 skip - Salvage mode; drops recent corrupt data, which will leave the database in an inconsistent state.
  • 🔴 tolerate - Salvage mode; expert use only.
  • 🔴 repair - Salvage mode; finds and drops deep corruption. This will leave the database in an inconsistent state.

Trouble with reverse proxies and middlewares

Construct is designed to be capable internet service software and should perform best when directly interfacing with remote parties. Nevertheless, some users wish to employ middlewares known as "reverse-proxies" through which all communication is forwarded. This gives the appearance, from the server's perspective, that all clients are connecting from the same IP address on different ports.

At this time there are some known issues with reverse proxies which may be mitigated by administrators having reviewed the following:

  1. Construct now supports plaintext listener sockets and this point can be ignored. If the proxy generates ACME certificates you can use those same certificates to encrypt the link to Construct. The proxy will have to be configured to forward SNI, for example with Caddy:
https://construct.chat:8448
reverse_proxy https://localhost:1234 {
    transport http {
        tls_server_name construct.chat
    }
}
  1. If the proxy does not run on localhost, the connection limit from a single remote IP address must be raised from its default, for example by entering the following in !control or console:
conf set ircd.client.max_client_per_peer 65535
  1. Avoid rewriting the Host: header which is sent to Construct. The header should appear as sent by remote clients. This is no longer a hard requirement with recent versions of Matrix protocol and this is likely not the source of your trouble.

  2. Ensure the reverse-proxy is not setting Connection: close when communicating to Construct. The ideal middleware is configured to maintain a pool of persistent connections and pipeline requests. As a hint based on Construct's default settings at the time of this writing, the optimal connection count from the middleware is 64, and up to 128.

Timeouts while resolving domain names

Due to the abnormal loads of Matrix, Construct implements custom DNS resolution directly over UDP. Construct does not use 127.0.0.1 or any locally provided DNS servers by default after nearly all users reported issues which required them to reconfigure or upgrade their service. To ship the least-broken solution by default, Construct is pre-configured with an array of public servers to query in a load-balanced round-robin. To view the DNS configuration in its entirety use the command: conf ircd.net.dns.

  1. Reduce the rate-limits to slow down queries made to the servers. This can be done with conf ircd.net.dns.resolver.send_rate which is a millisecond value to wait between requests; higher is slower: conf set ircd.net.dns.resolver.send_rate 300

  2. The conf ircd.net.dns.resolver.send_burst can be tweaked in conjunction with the conf ircd.net.dns.resolver.send_rate to more effectively shape the load as tolerated by the remote server's rate-limiting scheme. The burst is important to keep requests in flight to utilize the array while minimizing local delays for a lot of resolutions. If necessary, try setting a lower value, or 1 to never exceed the send_rate with any burst.

  3. Add or replace the default configured array of servers. The configuration at conf ircd.net.dns.resolver.servers is a string of IP addresses separated by spaces. It is better to add more servers than to replace the existing, but it is worse to add a server which is configured very differently from the others. The default servers were chosen because they have reasonably high rates and are consistent among themselves; they may not be the best choice for all users, especially in Europe and Asia.

👉 Administrators are tempted to simply replace the array with 127.0.0.1 to use their own high-performance service: this is okay, but the ircd.net.dns.resolver.send_rate may need to be configured significantly faster (lower) than the default if only one server is configured rather than the default six.