Monitoring
Kublade provides comprehensive monitoring capabilities for your Kubernetes clusters, automatically collecting and aggregating metrics at multiple levels. The monitoring system is designed to provide real-time insights into cluster health, resource usage, and performance.
Monitoring Levels
The system monitors resources at different levels:
Global Level
- Overall system health
- Cross-cluster resource utilization
- Global alert status
- System-wide metrics aggregation
- Performance trends across all clusters
Project Level
- Project-wide resource usage
- Aggregated cluster metrics
- Project-specific alerts
- Resource allocation across clusters
- Project health status
Cluster Level
- Cluster health status
- Node status and metrics
- Resource utilization
- Network performance
- Storage capacity
- Pod distribution
Node Level
- CPU usage and allocation
- Memory consumption
- Storage utilization
- Network I/O
- Pod count and status
- System load
Automated Monitoring
Kublade automatically sets up and maintains monitoring through several components:
Status Monitoring
StatusMonitoring
job runs periodically- Tracks cluster availability
- Monitors node health
- Updates cluster status
- Reports real-time health
Resource Monitoring
LimitMonitoring
job collects metrics- Tracks CPU utilization
- Monitors memory usage
- Measures storage consumption
- Counts pod usage
- Calculates resource percentages
Metrics Collection
- Collects node metrics via cAdvisor
- Aggregates cluster-level metrics
- Tracks resource quotas
- Monitors namespace usage
- Records historical data
Status Aggregation
The system uses boolean values to track health status at different levels:
Cluster Status
true
: Cluster is healthy and operationalfalse
: Cluster is unhealthy or unreachable
Node Status
true
: Node is healthy and operationalfalse
: Node is unhealthy or unreachable
Health Aggregation
- Cluster health is determined by the health of its nodes
- Project health is determined by the health of its clusters
- Global health is determined by the health of all clusters
- A single unhealthy component will set the parent's health to
false
Alert System
Kublade implements a comprehensive alert system:
Alert Levels
info
: Informational alertswarning
: Potential issuescritical
: Immediate attention needed
Alert Types
cluster
: Cluster-related alertsnode
: Node-specific alertsresource
: Resource usage alertsnetwork
: Network-related alertsstorage
: Storage capacity alerts
Alert Triggers
- Resource thresholds exceeded
- Cluster status changes
- Node status changes
- Network issues detected
- Storage capacity warnings
Metrics and Dashboards
Available Metrics
- CPU utilization per node/cluster
- Memory usage and allocation
- Storage capacity and usage
- Network throughput and latency
- Pod count and distribution
- Resource quota utilization
- Node health status
- Cluster performance metrics
Dashboard Views
- Global system overview
- Project-level dashboards
- Cluster-specific views
- Node detail pages
- Resource utilization graphs
- Historical trend analysis
- Alert status overview
- Performance metrics
Best Practices
-
Monitoring Setup
- Configure appropriate thresholds
- Set up meaningful alerts
- Define clear escalation paths
- Regular review of metrics
- Update monitoring rules
-
Alert Management
- Define clear alert priorities
- Set up proper notifications
- Regular alert review
- Update alert thresholds
- Document alert procedures
-
Resource Planning
- Monitor resource trends
- Plan capacity upgrades
- Track usage patterns
- Optimize resource allocation
- Regular capacity planning
Troubleshooting
Common Issues
-
Monitoring Problems
- Check metrics server status
- Verify cAdvisor access
- Validate permissions
- Review resource quotas
- Check network connectivity
-
Alert Issues
- Verify alert configuration
- Check notification settings
- Review alert thresholds
- Validate alert delivery
- Test alert triggers
-
Metric Collection
- Verify data collection
- Check storage capacity
- Review retention policies
- Validate metric accuracy
- Monitor collection jobs
-
Status Reporting
- Verify status updates
- Check aggregation logic
- Review status definitions
- Validate status changes
- Monitor status jobs