Autonomous CubeSat Health Monitoring & Fault Detection

Simulation & Dashboard

LiveDeterministicFault Injection

Seed Rate Battery Undervoltage Thermal Overheating Payload Power Surge Communication Blackout CPU Hang

Mode

Nominal

Power/ThermalCommsPayload

SoC

80%

Energy Margin

Battery Voltage

8.0 V

EPS

OBC Temp

25 °C

Thermal

CPU Load

30%

Software

Downlink Success

0%

Comm

EPS Health

0%

Thermal Health

0%

Software Health

0%

Comm Health

0%

Battery Voltage

SoC

OBC Temp

CPU Load

Downlink Success

Fault Events

Action Queue

Fleet Management (Bulk Profiles & Linking)

ProfilesAggregationAutonomy

Create CubeSat Profile

Name UID Battery (Wh) Solar (W) Payload A Seed

Model parameters are hardware-agnostic and map to EPS/thermal/comm budgets.

Fleet Simulation Controls

Rate RideShare Profile (radio silence + detumble)

Fleet Status

Nominal

0

Limited

0

Safe

0

Fleet Link & Aggregation

+---------------- Fleet Aggregation Layer ----------------+
| Per-satellite profiles (UID, model params, thresholds)  |
|   -> Telemetry streams (1–2 Hz, local)                  |
|   -> Health scores and faults (local)                   |
|   -> Fleet aggregator (counts, lists, priority downlink)|
| No inter-satellite control; autonomy per spacecraft.    |
+---------------------------------------------------------+

UID	Name	Mode	SoC	Vbatt	CPU	Downlink	Faults	Actions

Fleet Management (Bulk Profiles & Linking)

ProfilesAggregationAutonomy

Create CubeSat Profile

Name UID Battery (Wh) Solar (W) Payload A Seed

Model parameters are hardware-agnostic and map to EPS/thermal/comm budgets.

Fleet Simulation Controls

Rate RideShare Profile (radio silence + detumble)

Fleet Status

Nominal

0

Limited

0

Safe

0

Fleet Link & Aggregation

+---------------- Fleet Aggregation Layer ----------------+
| Per-satellite profiles (UID, model params, thresholds)  |
|   -> Telemetry streams (1–2 Hz, local)                  |
|   -> Health scores and faults (local)                   |
|   -> Fleet aggregator (counts, lists, priority downlink)|
| No inter-satellite control; autonomy per spacecraft.    |
+---------------------------------------------------------+

UID	Name	Mode	SoC	Vbatt	CPU	Downlink	Faults	Actions

Config & Tools

ThresholdsBeaconSweep

Threshold Editor

Vbatt Warn Vbatt Crit SoC Warn SoC Crit CPU Warn CPU Crit Import Config

Updates apply immediately to monitoring and fault logic.

Beacon Builder (Current Simulation)

{}

UID

Validation Sweep

Scenario Runs Base Seed Rate

Bulk Launch Operations (Rideshare)

RideshareDeployersOps Readiness

Context

Modern rideshare missions deploy dozens to hundreds of CubeSats in a single separation sequence.
Common launchers: Falcon 9 Transporter (SSO), PSLV, Vega, Soyuz, Electron; altitude typically 450–600 km SSO.
Aggregators: Exolaunch (EXOpod), Planetary Systems (CSD), Spaceflight Inc., ISILAUNCH; deployers ensure mechanical and electrical isolation.

Deployers & Standards

PPOD/QuadPack, CSD, EXOpod, and similar rail/deployer interfaces per Cal Poly CubeSat standard.
Requirements: no protrusions, kill-switches engaged, remove-before-flight (RBF) pins until integration.
Spring ejection induces tumble; satellites must detumble autonomously (e.g., B-dot control) within minutes.

Regulatory & Spectrum

Licensing: FCC/NTIA (US) or national regulator; ITU filings by sponsoring administration.
Amateur bands require IARU coordination; unique call signs and beacon IDs for identification in crowded deployments.
Debris mitigation: deorbit compliance (≤25 years traditional, trend toward ≤5 years); drag augmentation recommended.

Post-Deployment Operations

Radio silence period often mandated (e.g., ≥30 min) to avoid interference during separation; then low-duty beacon.
Initial acquisition uses generic TLE clouds; identification via unique beacon patterns and doppler signatures.
ADCS detumble, transition to Safe by default; power-up rails and payloads only after health checks.

Scaling Impacts on Flight Software

Minimal ground contact initially; autonomy must manage power, thermal, and comms without commands.
Prioritized downlink: beacon frames include health summary, fault events, and basic ephemeris.
Conjunction risk increases; timely ephemeris updates and identifiable beacons improve cataloging.

Launch Readiness Checklist

Default Safe mode on boot; detumble enabled; transmit inhibited until post-separation timer expires.
Beacon content: subsystem vitals, fault counts, mode, unique ID; short frame size to fit low bitrate.
Battery charge rules enforced (0–45 °C) from first boot; payloads disabled until margins verified.
Hysteresis prevents oscillation in crowded RF; rate limits on restarts and mode changes.

Validation & Metrics Runner

DeterministicRepeatableQuantitative

Scenario Controls

Scenario Seed Rate

Ground Truth (Expected Events)

Metric	Value	Notes
Detection latency (avg)	–	Seconds from injection to detection
Detection latency (max)	–	Worst-case across events
Recall	–	Detected truth events / total truth
Precision	–	Correct detections / all detections
Survival time	–	Operational time fraction (SoC ≥20%)
Power saved	–	Wh saved via shedding vs nominal payload
False positives	–	Detections without matching truth

Executive Summary

Local Low-overhead No cloud

Objective

Monitor telemetry, detect/classify faults, and autonomously recover.
Operate under constrained power, CPU, memory, and bandwidth.
Simulation-first with deterministic scenarios; fully local execution.

Key Capabilities

Rule-based checks plus sliding-window trend analysis.
Structured fault events with severity and timestamps.
Safety-prioritized recovery with hysteresis and rate-limiting.

Constraints

No cloud services, external APIs, or heavy frameworks.
Hardware-agnostic configuration; mission thresholds parameterized.
Deterministic simulator; 1–2 Hz ingest; bounded memory via ring buffers.

Explainable Trend-aware Safety-first

System Architecture

ModulesData Flow

+------------------------------ CubeSat Flight Software ------------------------------+
| [Subsystem Drivers] -> [Telemetry Ingest] -> [Health Monitoring] -> [Fault Classifier]
|         |                   |                   |                    |
|     EPS, OBC,          timestamped         rules + trends       type/severity/
|     Thermal, COMM      frames (1–2 Hz)      sliding windows     time/subsystem
|                                                                                     |
|                                 -> [Decision & Recovery] -> [Action Queue]          |
|                                   safe-mode, restarts, payload shedding             |
|                                                                                     |
|                       [Logging & Telemetry Output] (ring buffers, priority)         |
|                                                                                     |
|                                     [Simulator]                                     |
|                           nominal + injected faults (deterministic)                 |
+-------------------------------------------------------------------------------------+

End-to-End Behavior

Ingest frames at 1–2 Hz with UTC timestamps.
Combine rules and trends into subsystem health scores.
Emit structured fault events when anomalies persist.
Execute recovery actions under a safety hierarchy.
Log to ring buffers and prioritize emergency downlink.
Drive validation via deterministic simulations.

Workflow

IngestMonitorDetectDecideAct

Subsystem drivers publish telemetry frames for EPS, Thermal, OBC, COMM, Payload.
Health engine computes thresholds and sliding-window statistics with incremental updates.
Fault classifier converts anomalies into events: type, severity, timestamp, subsystem.
Decision engine selects mode and actions using safety hierarchy, hysteresis, rate limits.
Action queue executes payload shedding, rail isolation, restarts, and beacon telemetry.
Logging stores telemetry, health scores, and fault events for downlink and analysis.

Phase 1 — Telemetry & Operating Ranges

RangesWarningsCriticals

Subsystem	Parameter	Nominal	Warning	Critical	Rationale
EPS	Battery Voltage	7.0–8.4 V	<6.9 V	<6.6 V	Prevents brownouts and deep discharge; governs safe-mode entry.
EPS	Pack Current	−0.1 to −1.5 A charge; +0.1 to +3.0 A discharge	Discharge >3.0 A >10 s or charge <−2.0 A	Discharge >3.5 A or step >1.0 A in 1 s	Detects overloads and shorts; protects rails and battery.
EPS	3.3 V Rail	3.3 V ±5%	±8%	±10%	Ensures CPU/peripheral stability; avoids resets.
EPS	5 V Rail	5.0 V ±5%	±8%	±10%	Supports radio/payload reliability; avoids damage.
EPS	State of Charge	30–100%	<25% >5 min	<20% >2 min	Maintains eclipse survivability; triggers payload shedding.
Thermal	OBC Temperature	−10 to +55 °C	<−15 or >60 °C	<−20 or >70 °C	Protects electronics and timing integrity.
Thermal	Battery Temperature	−10 to +40 °C	<−15 or >45 °C	<−20 or >50 °C	Charges only 0–45 °C; unsafe ranges risk venting.
Thermal	Payload Temperature	−20 to +60 °C	<−25 or >65 °C	<−30 or >75 °C	Protects sensors and calibration stability.
OBC	CPU Load	10–60%	>75% for 30 s	>90% for 60 s with missed heartbeat	Prevents missed deadlines and watchdog resets.
OBC	Memory Usage	40–80%	>90% for 30 s	>95% for 60 s	Detects leaks and fragmentation.
OBC	Heartbeat	~1 Hz	Missing >5 s	Missing >30 s	Detects process starvation or deadlocks.
COMM	Downlink Success	>70% in pass	30–70%	<30% >60 s in pass	Indicates link health; triggers bitrate/FEC and resets.
COMM	Uplink Success	>60% in pass	30–60%	<30% >120 s across pass	Detects ground contact or antenna issues.
Payload	Payload Current	0.2–1.0 A typical	>1.2× nominal >10 s	>1.5× nominal or rail droop >±8%	Captures shorts/misconfigurations; isolates faulty payload.

Phase 2 — Telemetry Simulator & Fault Injection

DeterministicOrbitsFaults

Required Behaviors

Day/night cycling across a 90-min orbit with eclipse fraction.
Thermal lag via first-order dynamics with distinct time constants.
Task-driven CPU load spikes during payload ops and passes.
Ground-pass windows with increased comm success rates.

Fault Injection Scenarios

Battery undervoltage with negative slope and low SoC.
Thermal overheating due to reduced radiative coupling or increased duty.
Payload power surge causing 5 V rail droop.
Communication blackout during scheduled pass.
CPU hang and watchdog-triggered recovery.

Orbit Mapping

Eclipse drives energy budget; solar input zero in darkness.
Thermal inertia causes lag between load changes and temperature.
Pass geometry governs link availability and performance.
Runaway processes and payload events mirror flight risks.

Phase 3 — Health Monitoring Engine

RulesTrendsLow Overhead

Approach

Rule-based thresholds for immediate, explainable flags.
Sliding-window trend analysis for slow drifts.
Incremental statistics for mean, variance, and slope.

Incremental Statistics

Mean and variance via numerically stable updates.
Linear slope from windowed sums for best-fit trend.
Min/max and percentiles for bounds and spikes.

Health Outputs

Per-parameter scores normalized by thresholds.
Subsystem health scores from weighted aggregation.
Early warnings when sustained over limits.

Example

Track per-orbit V_batt minima and midday SoC maxima.
Raise warning for >50 mV/orbit drop and >5% SoC decline.
Escalate to critical for >100 mV/orbit or SoC <30% with nominal insolation.

Phase 4 — Fault Detection & Classification

TypesSeverityTimestamps

Categories

Power: undervoltage, low SoC, overcurrent.
Thermal: overtemp, unsafe battery charge, rapid heating.
Communication: blackout, persistent uplink failure.
Software: CPU hang, memory leak, telemetry gap.
Payload: overcurrent/overheat, non-responsive reset.

Fault Record

{
  "fault_type": "PowerFault",
  "severity": "Critical",
  "timestamp": "2026-01-05T12:34:56Z",
  "subsystem": "EPS"
}

Phase 5 — Autonomous Decision & Recovery

ModesSafety HierarchyHysteresis

Operating Modes

Nominal: full operations and standard thresholds.
Limited: payload duty reduced, conservative limits, prioritized telemetry.
Safe: payloads off, high-power rails off, beacon telemetry only.

Safety Hierarchy

Power and thermal survival first.
Communication continuity second.
Payload operation last.

Recovery Actions

Payload shedding and rail isolation.
Subsystem restarts, escalating to OBC reboot.
Safe-mode entry with beacon telemetry.

Hysteresis & Rate-Limiting

Debounce entries, require persistence beyond short spikes.
Clear exit with margins and minimum durations.
Rate-limit restarts and escalation to avoid oscillation.

Phase 6 — Validation & Metrics

ScenariosMetricsDeterministic

Scenarios

Single fault: undervoltage during eclipse.
Cascading faults: payload surge, bus droop, CPU spike, comm degradation.
Long-term degradation: capacity loss across multiple orbits.
False-positive avoidance: short comm fades and thermal spikes.

Metrics

Detection latency: Critical <10 s, Warning <60 s.
Classification accuracy: Critical precision >95%, recall >85%.
Survival time across eclipses under degraded EPS.
Power saved via payload shedding and duty cycling.

Method

Deterministic seeds and scenario scripts at 1–2 Hz.
Logs of telemetry, health scores, and fault events to ring buffers.
Post-run metrics and parameter sweeps for robustness.

Phase 7 — Presentation Summary

TechnicalConciseProfessional

Problem

Tight power and thermal budgets; intermittent communications.
Delayed ground intervention requires on-board autonomy.

Solution

Explainable, lightweight monitoring and recovery aligned to NanoSat constraints.
Deterministic simulator for validation of nominal and faulted operations.

Architecture

Telemetry ingest, monitoring, classification, decision/recovery, logging/downlink.
Simulator generates nominal and injected faults for repeatable tests.

Key Decisions

Incremental statistics for low CPU/memory.
Hysteresis and rate limits to prevent oscillation.
Safety hierarchy and mode transitions protecting spacecraft.

Validation Targets

Critical detection latency <10 s; Warning <60 s.
Critical precision >95%; recall >85% across scenarios.
Improved eclipse survival under degraded EPS.
Quantified energy savings via payload shedding.

Future Extensions

Lightweight ML anomaly detectors trained offline, deployed on-board.
Explainability preserved by combining ML scores with rule/trend evidence.

Operational Logic & Thresholds

Safety-firstHysteresisPriority

Priority Order

Power and thermal survival.
Communication continuity.
Payload operation.

Mode Transitions

Critical EPS/Thermal persists ≥10 s → Safe.
Clear exit: V_batt ≥7.0 V and SoC ≥40% for ≥5 min, no active Critical faults.

Recovery Ladder

Process restart → Subsystem re-init → OBC reboot if persistent.
Rate-limit: ≤1 restart per 5 min; 2 retries then escalate.
Beacon telemetry when in Safe with summarized status.

Communication Handling

In-pass blackout: lower bitrate or increase FEC if available.
Radio re-init and emergency telemetry prioritization.