Autonomous CubeSat Health Monitoring & Fault Detection
Simulation-first, on-board autonomy without cloud or external APIs

Simulation & Dashboard

LiveDeterministicFault Injection

Mode

Nominal
Power/ThermalCommsPayload

SoC

80%
Energy Margin

Battery Voltage

8.0 V
EPS

OBC Temp

25 °C
Thermal

CPU Load

30%
Software

Downlink Success

0%
Comm

EPS Health

0%

Thermal Health

0%

Software Health

0%

Comm Health

0%
Battery Voltage
SoC
OBC Temp
CPU Load
Downlink Success

Fault Events

    Action Queue

      Fleet Management (Bulk Profiles & Linking)

      ProfilesAggregationAutonomy
      Create CubeSat Profile
      Model parameters are hardware-agnostic and map to EPS/thermal/comm budgets.
      Fleet Simulation Controls
      Fleet Status

      Nominal

      0

      Limited

      0

      Safe

      0
      Fleet Link & Aggregation
      +---------------- Fleet Aggregation Layer ----------------+
      | Per-satellite profiles (UID, model params, thresholds)  |
      |   -> Telemetry streams (1–2 Hz, local)                  |
      |   -> Health scores and faults (local)                   |
      |   -> Fleet aggregator (counts, lists, priority downlink)|
      | No inter-satellite control; autonomy per spacecraft.    |
      +---------------------------------------------------------+
                  
      UID Name Mode SoC Vbatt CPU Downlink Faults Actions

      Fleet Management (Bulk Profiles & Linking)

      ProfilesAggregationAutonomy
      Create CubeSat Profile
      Model parameters are hardware-agnostic and map to EPS/thermal/comm budgets.
      Fleet Simulation Controls
      Fleet Status

      Nominal

      0

      Limited

      0

      Safe

      0
      Fleet Link & Aggregation
      +---------------- Fleet Aggregation Layer ----------------+
      | Per-satellite profiles (UID, model params, thresholds)  |
      |   -> Telemetry streams (1–2 Hz, local)                  |
      |   -> Health scores and faults (local)                   |
      |   -> Fleet aggregator (counts, lists, priority downlink)|
      | No inter-satellite control; autonomy per spacecraft.    |
      +---------------------------------------------------------+
                  
      UID Name Mode SoC Vbatt CPU Downlink Faults Actions

      Config & Tools

      ThresholdsBeaconSweep
      Threshold Editor
      Updates apply immediately to monitoring and fault logic.
      Beacon Builder (Current Simulation)
      {}
      Validation Sweep
      
              

      Bulk Launch Operations (Rideshare)

      RideshareDeployersOps Readiness
      Context
      • Modern rideshare missions deploy dozens to hundreds of CubeSats in a single separation sequence.
      • Common launchers: Falcon 9 Transporter (SSO), PSLV, Vega, Soyuz, Electron; altitude typically 450–600 km SSO.
      • Aggregators: Exolaunch (EXOpod), Planetary Systems (CSD), Spaceflight Inc., ISILAUNCH; deployers ensure mechanical and electrical isolation.
      Deployers & Standards
      • PPOD/QuadPack, CSD, EXOpod, and similar rail/deployer interfaces per Cal Poly CubeSat standard.
      • Requirements: no protrusions, kill-switches engaged, remove-before-flight (RBF) pins until integration.
      • Spring ejection induces tumble; satellites must detumble autonomously (e.g., B-dot control) within minutes.
      Regulatory & Spectrum
      • Licensing: FCC/NTIA (US) or national regulator; ITU filings by sponsoring administration.
      • Amateur bands require IARU coordination; unique call signs and beacon IDs for identification in crowded deployments.
      • Debris mitigation: deorbit compliance (≤25 years traditional, trend toward ≤5 years); drag augmentation recommended.
      Post-Deployment Operations
      • Radio silence period often mandated (e.g., ≥30 min) to avoid interference during separation; then low-duty beacon.
      • Initial acquisition uses generic TLE clouds; identification via unique beacon patterns and doppler signatures.
      • ADCS detumble, transition to Safe by default; power-up rails and payloads only after health checks.
      Scaling Impacts on Flight Software
      • Minimal ground contact initially; autonomy must manage power, thermal, and comms without commands.
      • Prioritized downlink: beacon frames include health summary, fault events, and basic ephemeris.
      • Conjunction risk increases; timely ephemeris updates and identifiable beacons improve cataloging.
      Launch Readiness Checklist
      • Default Safe mode on boot; detumble enabled; transmit inhibited until post-separation timer expires.
      • Beacon content: subsystem vitals, fault counts, mode, unique ID; short frame size to fit low bitrate.
      • Battery charge rules enforced (0–45 °C) from first boot; payloads disabled until margins verified.
      • Hysteresis prevents oscillation in crowded RF; rate limits on restarts and mode changes.

      Validation & Metrics Runner

      DeterministicRepeatableQuantitative
      Scenario Controls
      Ground Truth (Expected Events)
        Metric Value Notes
        Detection latency (avg)Seconds from injection to detection
        Detection latency (max)Worst-case across events
        RecallDetected truth events / total truth
        PrecisionCorrect detections / all detections
        Survival timeOperational time fraction (SoC ≥20%)
        Power savedWh saved via shedding vs nominal payload
        False positivesDetections without matching truth

        Executive Summary

        Local Low-overhead No cloud
        Objective
        • Monitor telemetry, detect/classify faults, and autonomously recover.
        • Operate under constrained power, CPU, memory, and bandwidth.
        • Simulation-first with deterministic scenarios; fully local execution.
        Key Capabilities
        • Rule-based checks plus sliding-window trend analysis.
        • Structured fault events with severity and timestamps.
        • Safety-prioritized recovery with hysteresis and rate-limiting.
        Constraints
        • No cloud services, external APIs, or heavy frameworks.
        • Hardware-agnostic configuration; mission thresholds parameterized.
        • Deterministic simulator; 1–2 Hz ingest; bounded memory via ring buffers.
        Explainable Trend-aware Safety-first

        System Architecture

        ModulesData Flow
        +------------------------------ CubeSat Flight Software ------------------------------+
        | [Subsystem Drivers] -> [Telemetry Ingest] -> [Health Monitoring] -> [Fault Classifier]
        |         |                   |                   |                    |
        |     EPS, OBC,          timestamped         rules + trends       type/severity/
        |     Thermal, COMM      frames (1–2 Hz)      sliding windows     time/subsystem
        |                                                                                     |
        |                                 -> [Decision & Recovery] -> [Action Queue]          |
        |                                   safe-mode, restarts, payload shedding             |
        |                                                                                     |
        |                       [Logging & Telemetry Output] (ring buffers, priority)         |
        |                                                                                     |
        |                                     [Simulator]                                     |
        |                           nominal + injected faults (deterministic)                 |
        +-------------------------------------------------------------------------------------+
                  
        End-to-End Behavior
        • Ingest frames at 1–2 Hz with UTC timestamps.
        • Combine rules and trends into subsystem health scores.
        • Emit structured fault events when anomalies persist.
        • Execute recovery actions under a safety hierarchy.
        • Log to ring buffers and prioritize emergency downlink.
        • Drive validation via deterministic simulations.

        Workflow

        IngestMonitorDetectDecideAct
        1. Subsystem drivers publish telemetry frames for EPS, Thermal, OBC, COMM, Payload.
        2. Health engine computes thresholds and sliding-window statistics with incremental updates.
        3. Fault classifier converts anomalies into events: type, severity, timestamp, subsystem.
        4. Decision engine selects mode and actions using safety hierarchy, hysteresis, rate limits.
        5. Action queue executes payload shedding, rail isolation, restarts, and beacon telemetry.
        6. Logging stores telemetry, health scores, and fault events for downlink and analysis.

        Phase 1 — Telemetry & Operating Ranges

        RangesWarningsCriticals
        Subsystem Parameter Nominal Warning Critical Rationale
        EPS Battery Voltage 7.0–8.4 V <6.9 V <6.6 V Prevents brownouts and deep discharge; governs safe-mode entry.
        EPS Pack Current −0.1 to −1.5 A charge; +0.1 to +3.0 A discharge Discharge >3.0 A >10 s or charge <−2.0 A Discharge >3.5 A or step >1.0 A in 1 s Detects overloads and shorts; protects rails and battery.
        EPS 3.3 V Rail 3.3 V ±5% ±8% ±10% Ensures CPU/peripheral stability; avoids resets.
        EPS 5 V Rail 5.0 V ±5% ±8% ±10% Supports radio/payload reliability; avoids damage.
        EPS State of Charge 30–100% <25% >5 min <20% >2 min Maintains eclipse survivability; triggers payload shedding.
        Thermal OBC Temperature −10 to +55 °C <−15 or >60 °C <−20 or >70 °C Protects electronics and timing integrity.
        Thermal Battery Temperature −10 to +40 °C <−15 or >45 °C <−20 or >50 °C Charges only 0–45 °C; unsafe ranges risk venting.
        Thermal Payload Temperature −20 to +60 °C <−25 or >65 °C <−30 or >75 °C Protects sensors and calibration stability.
        OBC CPU Load 10–60% >75% for 30 s >90% for 60 s with missed heartbeat Prevents missed deadlines and watchdog resets.
        OBC Memory Usage 40–80% >90% for 30 s >95% for 60 s Detects leaks and fragmentation.
        OBC Heartbeat ~1 Hz Missing >5 s Missing >30 s Detects process starvation or deadlocks.
        COMM Downlink Success >70% in pass 30–70% <30% >60 s in pass Indicates link health; triggers bitrate/FEC and resets.
        COMM Uplink Success >60% in pass 30–60% <30% >120 s across pass Detects ground contact or antenna issues.
        Payload Payload Current 0.2–1.0 A typical >1.2× nominal >10 s >1.5× nominal or rail droop >±8% Captures shorts/misconfigurations; isolates faulty payload.

        Phase 2 — Telemetry Simulator & Fault Injection

        DeterministicOrbitsFaults
        Required Behaviors
        • Day/night cycling across a 90-min orbit with eclipse fraction.
        • Thermal lag via first-order dynamics with distinct time constants.
        • Task-driven CPU load spikes during payload ops and passes.
        • Ground-pass windows with increased comm success rates.
        Fault Injection Scenarios
        • Battery undervoltage with negative slope and low SoC.
        • Thermal overheating due to reduced radiative coupling or increased duty.
        • Payload power surge causing 5 V rail droop.
        • Communication blackout during scheduled pass.
        • CPU hang and watchdog-triggered recovery.
        Orbit Mapping
        • Eclipse drives energy budget; solar input zero in darkness.
        • Thermal inertia causes lag between load changes and temperature.
        • Pass geometry governs link availability and performance.
        • Runaway processes and payload events mirror flight risks.

        Phase 3 — Health Monitoring Engine

        RulesTrendsLow Overhead
        Approach
        • Rule-based thresholds for immediate, explainable flags.
        • Sliding-window trend analysis for slow drifts.
        • Incremental statistics for mean, variance, and slope.
        Incremental Statistics
        • Mean and variance via numerically stable updates.
        • Linear slope from windowed sums for best-fit trend.
        • Min/max and percentiles for bounds and spikes.
        Health Outputs
        • Per-parameter scores normalized by thresholds.
        • Subsystem health scores from weighted aggregation.
        • Early warnings when sustained over limits.
        Example
        • Track per-orbit V_batt minima and midday SoC maxima.
        • Raise warning for >50 mV/orbit drop and >5% SoC decline.
        • Escalate to critical for >100 mV/orbit or SoC <30% with nominal insolation.

        Phase 4 — Fault Detection & Classification

        TypesSeverityTimestamps
        Categories
        • Power: undervoltage, low SoC, overcurrent.
        • Thermal: overtemp, unsafe battery charge, rapid heating.
        • Communication: blackout, persistent uplink failure.
        • Software: CPU hang, memory leak, telemetry gap.
        • Payload: overcurrent/overheat, non-responsive reset.
        Fault Record
        {
          "fault_type": "PowerFault",
          "severity": "Critical",
          "timestamp": "2026-01-05T12:34:56Z",
          "subsystem": "EPS"
        }

        Phase 5 — Autonomous Decision & Recovery

        ModesSafety HierarchyHysteresis
        Operating Modes
        • Nominal: full operations and standard thresholds.
        • Limited: payload duty reduced, conservative limits, prioritized telemetry.
        • Safe: payloads off, high-power rails off, beacon telemetry only.
        Safety Hierarchy
        • Power and thermal survival first.
        • Communication continuity second.
        • Payload operation last.
        Recovery Actions
        • Payload shedding and rail isolation.
        • Subsystem restarts, escalating to OBC reboot.
        • Safe-mode entry with beacon telemetry.
        Hysteresis & Rate-Limiting
        • Debounce entries, require persistence beyond short spikes.
        • Clear exit with margins and minimum durations.
        • Rate-limit restarts and escalation to avoid oscillation.

        Phase 6 — Validation & Metrics

        ScenariosMetricsDeterministic
        Scenarios
        • Single fault: undervoltage during eclipse.
        • Cascading faults: payload surge, bus droop, CPU spike, comm degradation.
        • Long-term degradation: capacity loss across multiple orbits.
        • False-positive avoidance: short comm fades and thermal spikes.
        Metrics
        • Detection latency: Critical <10 s, Warning <60 s.
        • Classification accuracy: Critical precision >95%, recall >85%.
        • Survival time across eclipses under degraded EPS.
        • Power saved via payload shedding and duty cycling.
        Method
        • Deterministic seeds and scenario scripts at 1–2 Hz.
        • Logs of telemetry, health scores, and fault events to ring buffers.
        • Post-run metrics and parameter sweeps for robustness.

        Phase 7 — Presentation Summary

        TechnicalConciseProfessional
        Problem
        • Tight power and thermal budgets; intermittent communications.
        • Delayed ground intervention requires on-board autonomy.
        Solution
        • Explainable, lightweight monitoring and recovery aligned to NanoSat constraints.
        • Deterministic simulator for validation of nominal and faulted operations.
        Architecture
        • Telemetry ingest, monitoring, classification, decision/recovery, logging/downlink.
        • Simulator generates nominal and injected faults for repeatable tests.
        Key Decisions
        • Incremental statistics for low CPU/memory.
        • Hysteresis and rate limits to prevent oscillation.
        • Safety hierarchy and mode transitions protecting spacecraft.
        Validation Targets
        • Critical detection latency <10 s; Warning <60 s.
        • Critical precision >95%; recall >85% across scenarios.
        • Improved eclipse survival under degraded EPS.
        • Quantified energy savings via payload shedding.
        Future Extensions
        • Lightweight ML anomaly detectors trained offline, deployed on-board.
        • Explainability preserved by combining ML scores with rule/trend evidence.

        Operational Logic & Thresholds

        Safety-firstHysteresisPriority
        Priority Order
        • Power and thermal survival.
        • Communication continuity.
        • Payload operation.
        Mode Transitions
        • Critical EPS/Thermal persists ≥10 s → Safe.
        • Clear exit: V_batt ≥7.0 V and SoC ≥40% for ≥5 min, no active Critical faults.
        Recovery Ladder
        • Process restart → Subsystem re-init → OBC reboot if persistent.
        • Rate-limit: ≤1 restart per 5 min; 2 retries then escalate.
        • Beacon telemetry when in Safe with summarized status.
        Communication Handling
        • In-pass blackout: lower bitrate or increase FEC if available.
        • Radio re-init and emergency telemetry prioritization.