Troubleshooting: Unexpected expansion leads to degradation or attach failure
Applicable versions
Confirmed in:
- Longhorn v1.3.2 - v1.3.3
- Longhorn v1.4.0 - v1.4.2
- Longhorn v1.5.0
Potentially mitigated in:
- Longhorn v1.4.3
- Longhorn v1.5.1
Complete fix planned in:
- Longhorn v1.4.x
- Longhorn v1.5.x
- Longhorn v1.6.0
Symptoms
While the root cause is always the same, symptoms can vary depending on other factors (e.g. whether there are multiple
healthy replicas, which specific version of Longhorn is in use, etc.).
Generic symptoms that are not in-and-of-themselves evidence of this issue include:
- A volume is degraded with multiple failed rebuild attempts.
- A volume fails to attach and/or appears to be in an attach/detach loop.
- A volume experiences the above and has fewer replicas than expected.
More specific symptoms include the following. Not all symptoms are present in all cases.
Expansion error in the UI
A volume shows as expanding in the UI with a red info symbol indicating a problem. Hovering over the red info symbol yields a message like:
Expansion Error: the expected size <small_size> of engine <engine> should not be smaller than the current size <large_size>. You can cancel the expansion to avoid volume crash.
An expansion is not actually ongoing and cannot be cancelled. Attempting to do so yields an error like:
unable to cancel expansion for volume <volume>: volume expansion is not started
Instance-manager logs
Instance-manager pods responsible for rebuilding new or pre-existing replicas log repeated failure to do so because of a size mismatch:
<time> time="<time>" level=error msg="failed to prune <snapshot>.img based on <snapshot>.img: file sizes are not
equal and the parent file is larger than the child file"
It is sometimes possible to catch this issue at its origination. The instance-manager pod for an engine logs that it will expand a replica and then fails to add it. Note that this log is normal and is not by itself an indication of a problem. However, it can be a red flag if no expansion has been requested:
<time> [longhorn-instance-manager] time="<time>" level=debug msg="Adding replica <replica_address>"
currentSize=<size> restore=false serviceURL="<engine_address>" size=<size>
<time> [longhorn-instance-manager] time="<time>" level=info msg="Prepare to expand new replica to size <size>"
<time> [longhorn-instance-manager] time="<time>" level=info msg="Adding replica <replica_address> in WO mode"
Similarly, the instance-manager pod for a replica logs that it is expanding:
<time> [<replica>] time="<time>" level=info msg="Replica server starts to expand to size <large_size>"