MQTT QoS for Industrial IoT: What the Spec Does Not Tell You

QoS in Plain Terms

MQTT defines three delivery guarantees between a client and a broker. The right choice depends on what the data is and how expensive a missed or duplicate message is.

QoS 0 — At most once. Fire and forget. The client publishes; the broker may or may not receive it. No acknowledgement, no retry. Best for high-frequency telemetry (temperature every second) where occasional loss is acceptable and bandwidth is constrained.
QoS 1 — At least once. The broker sends a PUBACK. The client retries until it receives one. The message is guaranteed to arrive, but it may arrive more than once. Your subscriber must handle duplicates — use a timestamp or sequence ID in the payload.
QoS 2 — Exactly once. A four-way handshake (PUBLISH → PUBREC → PUBREL → PUBCOMP) guarantees delivery with no duplicates. Correct for commands and financial events. Too expensive for sensor telemetry on LTE-M — the handshake adds two round-trip latencies and significant battery drain.

Industrial rule of thumb: Use QoS 0 for periodic sensor readings. Use QoS 1 for threshold alerts, state changes, and actuator commands — with idempotent message IDs in the payload to handle the at-least-once duplicates. Reserve QoS 2 for payment, safety-critical commands, or regulatory audit trails where the four-way handshake cost is justified.

Last Will and Testament — The Underused Feature

Last Will and Testament (LWT) is one of the most useful MQTT features for industrial monitoring, and consistently the most overlooked. When a client registers an LWT during connection, the broker stores a will message. If the client disconnects ungracefully — network failure, power loss, firmware crash — the broker automatically publishes the will message to the specified topic.

Without LWT, a dashboard or monitoring system has no way to distinguish between "device is alive and quiet" and "device is dead." With LWT, a retained online: false message appears on the status topic the moment the device disappears, and any subscriber or alerting system immediately knows.

Python — paho-mqtt with LWT

import paho.mqtt.client as mqtt
import json, time, ssl

BROKER   = "your-broker.example.com"
PORT     = 8883          # TLS
CLIENT_ID = "sensor-node-001"
STATUS_TOPIC = f"devices/{CLIENT_ID}/status"
TELEMETRY_TOPIC = f"devices/{CLIENT_ID}/telemetry"

def build_client() -> mqtt.Client:
    client = mqtt.Client(client_id=CLIENT_ID,
                         clean_session=False)   # persistent session

    # Last Will and Testament — broker publishes this on ungraceful disconnect
    will_payload = json.dumps({"online": False, "ts": 0, "reason": "lost"})
    client.will_set(STATUS_TOPIC,
                    payload=will_payload,
                    qos=1,
                    retain=True)   # retained: new subscribers see last state immediately

    # TLS — use CA cert to verify broker identity
    client.tls_set(ca_certs="/etc/ssl/certs/ca-certificates.crt",
                   tls_version=ssl.PROTOCOL_TLSv1_2)

    client.username_pw_set(username="device-user",
                           password="device-secret")
    return client


def on_connect(client, userdata, flags, rc):
    if rc == 0:
        # Publish online status with retain so dashboards pick it up immediately
        online_payload = json.dumps({"online": True, "ts": int(time.time())})
        client.publish(STATUS_TOPIC, online_payload, qos=1, retain=True)
    else:
        print(f"Connection failed, rc={rc}")


def publish_telemetry(client: mqtt.Client, reading: dict):
    payload = json.dumps(reading)

    # Periodic sensor data — QoS 0, low overhead
    client.publish(TELEMETRY_TOPIC, payload, qos=0)


def publish_alert(client: mqtt.Client, alert: dict):
    payload = json.dumps(alert)

    # Alerts must arrive — QoS 1
    client.publish(f"devices/{CLIENT_ID}/alerts", payload, qos=1)

Persistent Sessions — Broker-Side Queuing

When clean_session=False, the broker stores any QoS 1 or 2 messages published to topics the client is subscribed to, for delivery when the client reconnects. This is essential for devices that go offline regularly — solar-powered field sensors, devices in coverage-poor areas, anything that sleeps between readings.

The trade-off: the broker must maintain per-client state indefinitely, which consumes memory proportional to the number of devices and their offline duration. On self-hosted brokers (Mosquitto), set a sensible max_queued_messages to prevent runaway memory use. Most managed MQTT services handle this automatically.

Client ID stability matters: Persistent sessions are keyed on client ID. If your device changes its client ID between reboots (e.g. using a random UUID each time), the broker creates a new session and the queued messages for the old ID are orphaned. Use a stable, hardware-derived ID — MAC address, serial number, or provisioning certificate Common Name.

TLS Overhead on Constrained Devices

TLS is non-negotiable for industrial MQTT — unencrypted connections expose device credentials and sensor data. The question is how much overhead to expect.

TLS handshake: 3–8 KB of data exchange. On LTE-M with typical 50 ms round-trip, the handshake adds 400–800 ms to connection establishment. This is a one-time cost per connection session, not per message.
Per-message overhead: TLS record header adds ~25 bytes per message. At QoS 0 this is the only overhead. At QoS 1 the PUBACK also incurs TLS overhead but remains well within LTE-M capabilities.
LoRaWAN gateways: The gateway typically terminates LoRa on the device side and runs MQTT to the cloud on the gateway's Linux side. TLS runs on the gateway — not the end node — so the constrained radio link is unaffected.

Python — reconnect loop with backoff

import time, random

MAX_BACKOFF = 120   # seconds

def connect_with_backoff(client: mqtt.Client, host: str, port: int):
    """
    Exponential backoff reconnect loop.
    On LTE-M the TLS handshake takes 400–800 ms — do not hammer the broker.
    """
    delay = 1
    while True:
        try:
            client.connect(host, port, keepalive=60)
            client.loop_start()
            return
        except (ConnectionRefusedError, OSError) as e:
            jitter = random.uniform(0, delay * 0.2)
            print(f"Connect failed ({e}). Retry in {delay:.0f}s")
            time.sleep(delay + jitter)
            delay = min(delay * 2, MAX_BACKOFF)

Broker Selection at a Glance

Mosquitto: Lightweight, single-binary, excellent for under 500 concurrent devices. No native clustering. Good starting point for on-premise industrial deployments.
EMQX: Clustering, rule engine for routing messages to databases or HTTP endpoints, Sparkplug B support. Better suited to large-scale industrial deployments.
HiveMQ: Enterprise features, Sparkplug B native, good documentation. Higher cost. Often seen in automotive and manufacturing OEM integrations.
AWS IoT Core / Azure IoT Hub: Managed, no operational overhead, scales automatically. MQTT with additional proprietary topic conventions. Good choice when your data pipeline is already cloud-native.

Sparkplug B: If your deployment needs interoperability with SCADA systems or Ignition, look at the Sparkplug B specification on top of MQTT. It standardises topic namespaces and payload encoding (Protobuf) — removing the need for custom parsing on the broker side.