Skip to content

Health Checks

BRIDGEPORT performs four types of health checks — container health, URL, TCP port, and TLS certificate — to continuously verify that your services are running correctly, and integrates with bounce logic and deployment orchestration to prevent alert storms and gate rollouts.

  1. Go to Services and select a service.
  2. Set a Health Check URL (e.g., http://localhost:8080/health).
  3. Click Health Check to run an immediate check.
  4. View results at Monitoring > Health Checks.

For TCP and certificate checks:

  1. In the service detail page, configure TCP Checks (host:port pairs) and/or Certificate Checks.
  2. These require the server to be in agent mode — the agent performs the checks and reports results.
flowchart TD
    SCHED[Scheduler Timer] -->|Every SCHEDULER_SERVICE_HEALTH_INTERVAL| HC_RUN[Run Health Checks]
    MANUAL[UI: Run Health Check] --> HC_RUN

    HC_RUN --> CONTAINER[Container Health<br/>Docker inspect]
    HC_RUN --> URL[URL Health<br/>curl via SSH/agent]
    HC_RUN --> TCP[TCP Check<br/>agent only]
    HC_RUN --> CERT[Certificate Check<br/>agent only]

    CONTAINER --> LOG[Log to HealthCheckLog]
    URL --> LOG
    TCP --> SVC_UPDATE[Update Service record]
    CERT --> SVC_UPDATE

    LOG --> BOUNCE{Bounce Logic}
    BOUNCE -->|Threshold reached| NOTIFY[Send Notification]
    BOUNCE -->|Below threshold| SKIP[Suppress alert]
    BOUNCE -->|Recovery detected| RECOVER[Send Recovery Notification]

BRIDGEPORT inspects the Docker container’s native health status using the Docker API.

StatusMeaning
healthyDocker HEALTHCHECK passes
unhealthyDocker HEALTHCHECK fails
noneNo HEALTHCHECK defined in the Dockerfile
startingContainer is still initializing

This check runs automatically during both SSH and agent metrics collection. No configuration is needed — it works for every discovered container.

BRIDGEPORT (or the agent) performs an HTTP request to a user-configured URL and checks the response.

How to configure:

  1. Go to the service detail page.
  2. Set the Health Check URL field.
  3. The URL is typically an internal endpoint (e.g., http://localhost:8080/health from within the server’s network).

What gets logged:

FieldDescription
statussuccess if HTTP 2xx, failure otherwise
httpStatusThe HTTP status code (200, 500, etc.)
durationMsHow long the request took
errorMessageError details on failure (timeout, connection refused, etc.)

The agent tests TCP connectivity to specified host:port pairs and reports success/failure with latency.

How to configure:

  1. Go to the service detail page.
  2. In the TCP Checks section, add entries:
    [
    { "host": "db.internal", "port": 5432, "name": "PostgreSQL" },
    { "host": "redis.internal", "port": 6379, "name": "Redis" }
    ]

What gets reported:

FieldDescription
successWhether the TCP connection succeeded
durationMsConnection time in milliseconds
errorError message on failure

The agent connects via TLS, retrieves the certificate, and reports expiry information.

How to configure:

  1. Go to the service detail page.
  2. In the Certificate Checks section, add entries:
    [
    { "host": "api.example.com", "port": 443, "name": "API SSL" }
    ]

What gets reported:

FieldDescription
expiresAtCertificate expiry timestamp
daysUntilExpiryDays remaining
issuerCertificate issuer
subjectCertificate subject

TCP and certificate checks are only available with the agent mode. They are configured per-service but executed by the agent and included in its metrics push.

On the service detail page, click the Health Check button. This runs an immediate container + URL check and displays the result.

Navigate to Monitoring > Health Checks and click Run Health Checks. Choose:

  • All — Check all servers and services in the environment.
  • Servers only — SSH connectivity check for all servers.
  • Services only — URL health check for all services with a health check URL configured.

Results are displayed immediately and also logged to HealthCheckLog.

Go to Monitoring > Agents & SSH and click Test SSH for a server, or Test All to check every server in the environment.

The scheduler runs health checks automatically based on configurable intervals.

Intervals are global (not per-environment), set by the SCHEDULER_* env vars — see Configuration Reference → Scheduler.

  • What: SSH connectivity test (or agent push for agent-mode servers).
  • Interval: SCHEDULER_SERVER_HEALTH_INTERVAL (default: 60 seconds).
  • Skips: Agent-mode servers are skipped because the agent reports health directly.
  • What: Container health + URL health check.
  • Interval: SCHEDULER_SERVICE_HEALTH_INTERVAL (default: 60 seconds).
  • Skips: Services on agent-mode servers are skipped because the agent performs URL checks.
  • What: Discovers running Docker containers and updates service statuses.
  • Interval: SCHEDULER_DISCOVERY_INTERVAL (default: 5 minutes).
  • Runs on: Healthy servers only.

Each service has three timing parameters that control health verification during deployment orchestration:

SettingDefaultDescription
healthWaitMs5000Initial wait before the first health check (milliseconds)
healthRetries3Maximum number of health check attempts
healthIntervalMs10000Wait between retries (milliseconds)

These are used by the Deployment Plans system. When a deployment plan includes a health_check step, it calls verifyServiceHealth() which:

  1. Waits healthWaitMs for the service to stabilize after deploy.
  2. Checks container health and URL health.
  3. If unhealthy, retries up to healthRetries times, waiting healthIntervalMs between each.
  4. If all retries fail, the deployment plan triggers auto-rollback of all previously deployed services.

To configure:

  1. Go to the service detail page.
  2. Open the Health Check Config section.
  3. Adjust wait, retries, and interval as needed.

All health check results are stored in the HealthCheckLog table and viewable at Monitoring > Health Checks (/monitoring/health).

Filter logs by:

FilterOptions
Resource typeserver, service, container
Check typessh, url, container_health, discovery
Statussuccess, failure, timeout
Resource IDSpecific server or service
Time range1 hour to 7 days

The health logs page shows summary counts broken down by resource type:

Server: 42 success | 2 failure | 0 timeout
Service: 156 success | 3 failure | 1 timeout
Container: 312 success | 5 failure | 0 timeout

Health check logs are automatically cleaned up based on:

SettingDefaultWhere
healthLogRetentionDays30Global, in Admin > System Settings > Retention

The scheduler runs daily cleanup and hot-reloads this setting each tick (no restart needed).

Bounce logic prevents alert storms when a resource repeatedly fails. Instead of sending a notification on every failure, BRIDGEPORT tracks consecutive failures and only sends an alert when a threshold is reached.

sequenceDiagram
    participant HC as Health Check
    participant BT as BounceTracker
    participant N as Notifications

    HC->>BT: recordFailure(server, srv-1, offline)
    BT->>BT: consecutiveFailures: 1
    BT-->>HC: shouldAlert: false

    HC->>BT: recordFailure(server, srv-1, offline)
    BT->>BT: consecutiveFailures: 2
    BT-->>HC: shouldAlert: true (threshold=2)
    HC->>N: Send "Server Offline" notification

    Note over BT: Cooldown starts (15 min default)

    HC->>BT: recordFailure(server, srv-1, offline)
    BT->>BT: consecutiveFailures: 3
    BT-->>HC: shouldAlert: false (in cooldown)

    HC->>BT: recordSuccess(server, srv-1, offline)
    BT->>BT: consecutiveFailures: 0, alertSentAt: null
    BT-->>HC: wasRecovered: true
    HC->>N: Send "Server Back Online" notification

Bounce thresholds and cooldowns are configured per notification type in Admin > Notifications:

SettingDescriptionDefault
bounceEnabledWhether bounce logic appliesVaries by type
bounceThresholdFailures before first alert3 (or 2 for server offline)
bounceCooldownSeconds before re-alerting900 (15 minutes)

Notification types with bounce enabled:

TypeThresholdCooldown
system.health_check_failed315 min
system.server_offline215 min
system.container_crash315 min
system.database_unreachable315 min

The BounceTracker table tracks state per resource:

FieldDescription
resourceTypeserver, service, or database
resourceIdID of the resource
eventTypehealth_check, offline, crash, backup
consecutiveFailuresCurrent failure count
lastFailedAtTimestamp of last failure
lastSuccessAtTimestamp of last success
alertSentAtWhen the last alert was sent (for cooldown calculation)

When a resource recovers (success after alert was sent), BRIDGEPORT sends a recovery notification and resets the tracker.

Health checks are a first-class part of the deployment orchestration system. When you create a Deployment Plan:

  1. Dependencies define order: health_before dependencies require the upstream service to be healthy before deploying the downstream service.
  2. Health verification after deploy: Each deploy step is followed by a health_check step that calls verifyServiceHealth().
  3. Auto-rollback on failure: If a health check fails after exhausting retries, the plan triggers rollback of all previously deployed services.
flowchart LR
    DEPLOY_A[Deploy Service A] --> HC_A[Health Check A]
    HC_A -->|Healthy| DEPLOY_B[Deploy Service B]
    HC_A -->|Failed after retries| ROLLBACK[Rollback A]
    DEPLOY_B --> HC_B[Health Check B]
    HC_B -->|Healthy| DONE[Plan Complete]
    HC_B -->|Failed| ROLLBACK_ALL[Rollback B + A]
  1. Check the URL: Ensure the health check URL is accessible from the server (not from your local machine). Use Test SSH > run curl <url> on the server to verify.
  2. Check the container: If the container is not running, the health check will always fail. Check container status first.
  3. Check timing: If the service takes a long time to start, health checks may fail during startup. Increase healthWaitMs.
  • Enable bounce logic for the notification type in Admin > Notifications.
  • Increase the bounceThreshold to require more consecutive failures before alerting.
  • Increase the bounceCooldown to wait longer before re-alerting.
  • Check if the underlying issue can be fixed (e.g., flaky health endpoint).

Health checks pass but service shows “unhealthy”

Section titled “Health checks pass but service shows “unhealthy””

This can happen if:

  • The Docker HEALTHCHECK in the Dockerfile is failing even though the URL check passes.
  • BRIDGEPORT uses both container health and URL health to determine overall status. Both must pass for healthy.
  • Check docker inspect <container> on the server to see the Docker health status.
  • Verify the server is in agent mode.
  • Check that the agent is active (not stale or offline).
  • Verify the TCP/cert check configuration on the service detail page.
  • Check that the target host:port is reachable from the server where the agent runs.