1. DESCRIPTION
A Python code that monitors the available egress bandwidth of selected PNI interfaces and status of the pertinent eBGP sessions on a Cisco IOS-XR router acting as an ASBR, and make selective decisions to block / unblock the ingress traffic at its source if it is on a local interface (typically a CDN cache directly-connected to the router)
2. DEPENDENCIES
Python
The code contains certain modules / functions that are only available in Python 2.7 or later minor releases (< 3.x). Any developer who wishes to run the code on an older Python release will have to override these functions and / or replace them with their functional equivalents as applicable.
OS
The code has been written and tested solely on a Debian Linux distribution (rel 7.4). And its portability to other (specifically non-Linux) operating systems may be limited.
NetSNMP
The code has been tested with NetSNMP rel 5.4.3.
MIB translation MUST be enabled in the snmp.conf
file. This is due to the differences in output formatting of
NetSNMP with and without MIB translation enabled. The program does NOT require the vendor MIB files to operate.
3. CONFIGURATION
The program can optionally be run with a configuration file (pniMonitor.conf
) that resides inside the same
directory with the Python file. If the program is started without a configuration file or with any or all of the
configuration lines missing or commented out, it will then apply its default configuration settings for the missing
parameter(s) and continue running.
Note: Although the main program can operate without a configuration file, the utility scripts pniDiscovery.py
and pniMonitory_livenessCheck.py
do require it.
3.1. STARTUP CONFIGURATION
The following parameters can be configured during startup only. Any modifications during runtime will be silently ignored.
inventory_file=[<filename>
(default:inventory.txt
)]
The inventory details (list of node names) MUST be provided in a text file with each node written on a separate line. Example:
###inventory.txt
asbr-1
asbr-2
#asbr-3
As shown above, the pound sign (#
) can be used to create comment lines or comment out a selected node.
The program does not perform - nor was seen necessary to do - regex checks to the provided node names. Hence, invalid entries in the inventory file will not be ignored straight away, however they will be retried in every polling cycle and then ignored due to DNS lookup failures. (This behaviour will be modified in the next releases, where the name resolution check will be accompanied by system OS validation during startup.)
pni_interface_tag=[<string>
(default:CDPautomation_PNI
)]
A user-defined label to identify the PNI interfaces that are intended for monitoring. The label will be searched within the description strings of all Ethernet Bundle interfaces of a router, when the discovery function is run.
Interfaces with a no-mon
string applied will be excluded from monitoring. (Including a new interface or
excluding an existing one from monitoring requires a manual discovery trigger in the current release.)
cdn_interface_tag=[<string>
(default:CDPautomation_CDN
)]
A user-defined label to identify the PNI interfaces that are intended for monitoring. The label will be searched within the description strings of all Ethernet Bundle and HundredGigabit Ethernet interfaces of a router, when the discovery function is run. It is important NOT to label the interfaces that are members of an Ethernet Bundle.
Interfaces with a no-mon
string applied will be excluded from monitoring. (Including a new interface or
excluding an existing one from monitoring requires a manual discovery trigger in the current release.)
acl_name=[<string>
(default:CDPautomation_UdpRhmBlock
)]
User-defined name of the IPv4 access-list as configured on the router(s). Missing ACL configuration on the router
or misconfiguration of the acl_name in the pniMonitor.conf file will cause the SSH session(s) to be stalled, until
the protection mechanism in the MainThread kicks in terminates all threads, including itself. This will trigger a
CRITICAL
alert. (see Section-5 'Multi-Threading' for further details)
3.2. RUNTIME CONFIGURATION
The following parameters can be modified while the program is running, and any changes will be acted on accordingly
in the next polling cycle. Invalid configurations will be ignored, accompanied with a WARNING
alert, and the
program will revert back to either default (during startup) or last known good configuration.
Note: Commenting out a configuration line or removing it while the program is running will NOT revert it back to its default configuration. Once the program is running, preferred settings must be configured explicitly.
frequency=[<30-300>
(default:30
)]
The running frequency of the mainThread, configured in seconds, defining how frequently the subThreads should be re-initialized when the system time is within the peak hours. Also referred to as the polling cycle in the other sections of this document.
off_peak_frequency=[<30-300>
(default:180
)]
The running frequency of the mainThread, configured in seconds, defining how frequently the subThreads should be re-initialized when the system time is outside the peak hours. Also referred to as the polling cycle in the other sections of this document.
peak_hours=[<start_time(hh:mm)-end_time(hh:mm)>
(default:17:30-23:59
)]
The peak hours during the day where the PNI links are expected more likely to be congested due to higher than normal utilization caused by the increased subscriber demand on the CDN caches.
rising_threshold=[<0-100>
(default:95
)]
The PNI utilisation threshold value, which will trigger the RHM blocking algorithm, when reached or exceeded. See section-8 'Process (Decision Making)' for further details.
Must always be set to a value that is higher than the configured falling_threshold paramater, which will otherwise be ignored by the program and reset to the default / last known good configuration, followed by a warning message.
falling_threshold=[<0-100>
(default:90
)]
The PNI utilisation threshold value, which will trigger the RHM unblocking algorithm, when reached or deceeded. See section-8 'Process (Decision Making)' for further details.
Must always be set to a value that is smaller than the configured rising_threshold paramater, which will otherwise be ignored by the program and reset to the default / last known good configuration, followed by a warning message.
ipv4_min_prefixes=[<integer>
(default:0
)]
Minimum number of prefixes 'accepted' from a BGPv4 peer with unicast IPv4 AFI. Default value is '0', which means the PNI interface will be considered 'usable' until ALL accepted prefixes are withdrawn by the peer.
ipv6_min_prefixes=[<integer>
(default:100
)]
Minimum number of prefixes 'accepted' from a BGPv6 peer with unicast IPv6 AFI. Default value is '100', which is intentionally set high, in order to avoid a PNI interface running with a single IPv6 stack from being considered usable.
cdn_serving_cap=[<0-100>
(default:90
)]
Maximum serving capacity of a CDN node relative to its wire rate. Default value is '90'.
Where applicable, this parameter must be configured to the lowest of the bit-cap or flit-limit values. For instance; if the maximum expected throughput from a CDN region with 200 Gbps physical capacity is 160 Gbps due to its manually overridden bit-limit, then the cdn_serving_cap must be set to '80'. When the bit- limit is removed, it should be reset to a value (typically >90) that is indicative of the highest achievable throughput without the region being flit-limited.
log_level=[<INFO|WARNING|ERROR|CRITICAL|DEBUG>
(default:INFO
)]
The log_level can be specified as one of INFO
, WARNING
, ERROR
, CRITICAL
or DEBUG
. If none specified, the
program will run with default level INFO.
Log files saved on disk will be rotated and compressed with Gzip daily at midnight local time.
log_retention=[<0-90>
(default:7
)]
The number of days for the rotated log files to be kept on disk.
email_distribution_list=[[email protected],[email protected]
(default:[email protected]
)]
The list of email addresses to be notified when an event occurs. Email addresses that are outside the @domain1.com
or
@domain2.com
domains will NOT be accepted. Multiple entries must be separated by a comma (,
).
Emails alerts will be sent stateless and will not be retried or repeated.
email_alert_severity=[<WARNING|ERROR|CRITICAL>
(default:ERROR
)]
The minimum level of event severity to trigger an email alert.
runtime=[<integer>
(default:infinite
)]
An integer value, if configured, is used to calculate the number of polling cycles left before the program terminates itself. It could be useful in scenarios where it is desired to gracefully exit the program after a certain amount of time, such as C-Auth password expiry.
simulation_mode=[<on|off>
(default:off
)]
If switched on; node discovery, probing and decision-making functions will continue, however NO configuration changes will be made to the router(s).
data_retention=[<2-60>
(default:2
)]
The number of polling cycles for the collected probe data to be kept on disk. Setting this value too high might cause increased memory utilisation, which might lead to program terminations (by the self-protection mechanism) on a relatively busy or under-spec'd server.
snmp_retries=[<integer>
(default:2
)]
The number of retries to be used in the SNMP requests. The default value is '2'.
snmp_timeout=[<integer>
(default:3
)]
The timeout in seconds between SNMP retries. The default value is '3'.
4. USAGE
4.1. HOW TO RUN
- Verify the configuration and inventory files.
- Verify the environment settings:
- Python version must be 2.7.x;
python -V
- If the correct version not found, add the following line to the
.bash_profile
file in user$HOME
directory;source ~nadt/nadt-aliases.include.bash
- Re-verify the Python version;
vrun python -V
- Python version must be 2.7.x;
- Browse to the script directory;
cd /scripts/laphroaig/
- Run the script:
- If using the native Python installation on the system;
./pniMonitor.py
OR - If using the virtualized NADT environment as described above;
vrun ./pniMonitor.py
- If using the native Python installation on the system;
- Wait for the prompt and enter C-Auth password
- Once the
Authentication Successful
message is displayed:- Pause the script;
ctrl^z
- Send it to background;
bg
- And finally de-attach it from your terminal session;
disown -h %1
- Pause the script;
4.2. HOW TO TERMINATE
Graceful:
Set the runtime
parameter in the pniMonitor.conf
file to 1
. The program will gracefully terminate itself once
the next threading cycle is completed.
Forced:
Browse to the working directory and run the following command:
kill -9 "$(<pniMonitor.pid)"
This will terminate the mainThread and all subThreads immediately.
5. MULTI-THREADING
The program will initiate a subThread for each node (router) specified in the inventory file, so that the interface status on multiple routers can be managed simultaneously and independently.
For convenience in operations and diagnostics, a subThread's name will be comprised of the hostname of the router that it is relevant to. And the thread names will be included in every log line and alert produced by the program.
If for any reason (such as a stalled SSH session or high CPU / Memory utilisation on the host system) one or more
of the subThreads take too long (i.e. longer than the pre-defined running frequency of the mainThread) to complete,
then the program will no longer terminate (new in release 1.4), but hibernate itself along with all inactive
subThreads that are waiting in the queue. The hibernation will be preceded and followed by WARNING
severity
alerts, issued by the MainThread; the former including the name of all subThread(s) that were detected to be in
hung state, and the latter indicating successful resume of operation. This behaviour is designed intentionally.
Although this may incur unintended delays to monitoring, it would otherwise constitute a greater risk to allow the
program to continue while the reason of the hang is unknown.
6. DISCOVERY
The program has a built-in discovery function which will be auto-triggered either during the first run or any time
the inventory file is updated. Collected data is stored in local files on disk; .DO_NOT_MODIFY_<nodename>.dsc
.
Discovery function uses the description tags configured on the router interfaces in order to build an inventory of all interfaces to be included in the decision-making process, as well as their IP addresses and the relevant BGP neighbors. For any BGP neighbor to be associated with a PNI, the session MUST be sourced from the IP address (IPv4 or IPv6) of the local interface.
Addition or removal of an interface to / from the monitoring (once the interfaces are labeled correctly) can be
achieved by using the pniDiscovery.py
script which can be found inside the same directory. The correct syntax
to run the script is as follows;
`pniDiscovery [-c <filename>] [--config <filename>]`
`pniDiscovery -c pniMonitor.conf`
Note: The first release of the code do not have persistence enabled. Hence, at any time the discovery function is triggered to run, which should not be too frequent, it will cause the previously collected data to be lost. This does not incur any risk other than delaying the process (decision making) functions by one (1) polling period.
At the time of development, the original intent of the code was to make it operate over SNMP only, to keep it fast and light-touch on the network equipment. However, since Cisco IOS-XR routers do not support the ACL-MIB; the discovery function had to evolve in a hybrid mode of operation where the ACL status on the interfaces is verified via an SSH session, while the rest of the inventory details are polled via SNMP.
Once discovery is completed; ACL configuration status per-interface is saved on disk, and not re-checked in every polling cycle. This behaviour is added in v1.1.0 to prevent overloading the management-plane on the routers with too many SSH connections. In the same release the script is also updated to keep the discovery file updated with any ACL configuration change it performs on the router interface(s) and alert in the case of failure to update, in order to prevent data inconsistencies.
7. PROBE (Data Collection)
The probe function collects the administrative and operational interface status, in and out octets per interface,
state of the BGP sessions and the number of received and accepted routes per-neighbor from each node simultaneously
and stores the data in local hidden files on disk; .DO_NOT_MODIFY_<nodename>.prb
, while also tagging the data it
collects with timestamps.
Since the process function (see Section-8) specifically relies on the timestamps of the previously collected data and is capable of measuring the timeDelta in its operation, interface utilisation can always be reliably calculated regardless of any interruptions in polling.
8. PROCESS (Decision Making)
The entire decision making logic resides in a function called _process(). The main function constantly runs in
the background (as a daemon-like process) and uses subThreads to re-assess the usable PNI egress capacity and
recalculate the actual risk factor
as actualPniOut / usablePniOut * 100
in the preferred polling frequency,
using the data collected by probe.
For any PNI interface and its available physical egress capacity to be considered 'usable', it must satisfy the following requirements;
- Interface operational status MUST be `UP` (this will typically be a Ethernet Bundle interface, and in the
case of partial link failures, the total bandwidth of the remaining interfaces will be considered available)
AND
- State of the BGPv4 session sourced from the interface's local IPv4 address MUST be `ESTABLISHED` AND the
number of IPv4 prefixes received and `accepted` from the remote BGP peer MUST NOT be lower than the
configured `ipv4_min_prefixes`
OR
- State of the BGPv6 session sourced from the interface's local IPv6 address MUST be `ESTABLISHED` AND the
number of IPv6 prefixes received and `accepted` from the remote BGP peer MUST NOT be lower than the
configured `ipv6_min_prefixes`
Once the usable PNI egress capacity is calculated:
If at any time;
- There is NO usable PNI egress capacity left on the local router:
OR
- There is a partial PNI failure scenario on the local router / traffic overflow from another site, which
causes the risk factor (ratio of the actual PNI egress to usable PNI egress capacity) to be equal or greater
than the configured rising_threshold:
ALL DIRECTLY-ATTACHED CDN INTERFACES WILL BE BLOCKED
Else, if at any time;
- Usable PNI egress capacity is present on the local router
AND
- There is usable PNI egress capacity is present on the local router,
AND
- The risk factor (ratio of the actual PNI egress to usable PNI egress capacity) is smaller than the configured
falling_threshold,
AND
- The sum of the maximum serving capacity of the unblocked local CDN caches and the actual non-local traffic (P2P
+ Overflow) egressing the local PNI and the maximum serving capacity of any directly-attached (but blocked) CDN
region is smaller than the usable PNI egress capacity on the local router:
DIRECTLY-ATTACHED CDN INTERFACES WILL START BEING UNBLOCKED, ONE BY ONE, AS SOON AS THE AFOREMENTIONED RULE IS
SATISFIED
Otherwise;
NO ACTION WILL BE TAKEN
9. LOGGING
The program saves its logs in two separate local files saved on the disk and rotated daily;
- pniMonitor_main.log: All events produced by the MainThread and its subThreads. Configurable severity.
- pniMonitor_ssh.log: All events that are logged by the SSH module. Has a fixed severity setting; WARNING.
- pniMonitor_cron.log: Generated and used by the livenessCheck script running on the crontab (see Section-10)
In addition to local log files, high severity events are also available to be distributed as email alerts (see Section-3 for configuration details).
Definition of available log / alert severities are as follows:
DEBUG
Detailed information, typically of interest only when diagnosing problems.
INFO
Confirmation that things are working as expected.
WARNING
An indication that something unexpected happened (such as a misconfiguration), or indicative of event (PNI failure,
BGP prefix withdrawal, etc) which will soon trigger automated recovery actions. The program is still working as
expected.
ERROR
Due to a more serious problem, the program has not been able to perform some function (such as a `Data Collection`
or `Configuration Attempt` failures).
CRITICAL
A serious error, indicating that the program itself will be unable to continue running (`Dying gasp`).
10. LIVENESS CHECKS
The distribution includes an audit script pniMonitor_livenessCheck.py
which can be added into the operating
system's crontab configuration to verify the liveness of the main program at regular intervals. It reads the PID of
the main process from the pniMonitor.pid
file, which is created by the main process during startup, and verifies
the existence of a matching entry in the operating system's /proc/
folder.
Any of the following conditions will cause the liveness check to fail and send out an email alert with CRITICAL
severity to the email distribution list found in the pniMonitor.conf
file;
- A process with the given PID is not running,
- A process ID could not be found in the pniMonitor.pid file,
- The pniMonitor.pid file could not be located,
- The pniMonitor.conf file could not be located. (This results in the email alerts being sent to a default
distribution list)
Non-critical events (INFO
, WARNING
or ERROR
) will be sent to console or a cronlog file (if one configured).
Sample crontab configuration to schedule the liveness checks to be run in every 5 minutes;
*/5 * * * * cd /<path>/laphroaig/; ./pniMonitor_livenessCheck.py -c pniMonitor.conf >> pniMonitor_cron.log 2>&1
Note: Using the above log file naming convention (pniMonitor_cron.log
) will allow the main script to handle the
rotation of the cronlogs with no additional configuration effort.
11. PLANNED FOR FUTURE RELEASES
- P1 STDOUT handling
- P1 Requirements.txt
- P1 Kill the reliance on local SMTP listener for better portability
- P2 Per-region cdn_serving_cap setting (It is available as a Global parameter in the current release)
- P2 Netcool integration (might outsource this)
- P2 Multi-ASN support
- P2 IPv6 ACL for RHM Blocking
- P2 Persistence (of the previously recorded interface utilization data upon a new node or interface discovery)
- P2 Automated discovery of new interfaces (In the current release it is manually triggered)
- P3 Nokia 7750 support
- P3 IOS-XR / SROS version check
- P4 Catch SIGTERM KILL and report in logging (SIGKILL cannot be caught & there is a seperate liveness check script)
- P4 Activate node reachability checks upon SSH failures for improved logging (Currently available for SNMP failures)
- P4 Graphical email updates with interface utilisation charts
- P4 Ordered directory structure (/logs, /data, /conf, etc.)
- P4 Replace SNMP & SSH with more reliable & convenient alternatives (eg. Netconf/RestAPI)