Cultural Learnings of SORA Upgrades for Make Benefit Glorious Developers of SORA

On July 7th at 13:33, the SORA network experienced an outage while upgrading the blockchain runtime. This report outlines the outage timeline and causes, actions taken to mitigate and the following steps to prevent a similar situation from happening again.

Incident Timeline:

On 7/7/23 at 13:30 the runtime upgrade 1.11.0 was dispatched by SORA governance.
13:33 – A migration failure (panic) was discovered and the team began working on a fix. Total downtime was one hour.
14:25 – Users began to notice slow block production and reported the issue
23:29 – The Hashi bridge was disabled as a security measure
On 8/7/23 at 01:00 – The MOF nodes were recovered and block production began stabilising
09:10 – Full network recovery and normal block production resumed
10:08 – The Hashi bridge was enabled.

Detailed Description:

The pallet migration to runtime version 1.11.0 failed due to an issue with the handling of string fields within the Hermes platform. This issue was not flagged during testing, as there were no polls on the Hermes platform that exceeded the 64-byte string limit.
On mainnet, the presence of polls with descriptions over the 64-byte limit caused the migration to fail (panic). This failure made it impossible to initialise the next block, which stopped block production entirely.
SORA developers provided an updated runtime WASM along with a guide for validators to implement the fix in the SORA Devs Telegram chat. As validators who implemented the fix started to produce blocks again, those who still couldn’t produce blocks were considered offending and were subsequently slashed.

Actions Taken to Resolve the Failure

A runtime WASM without the failed pallet migration was prepared
The runtime WASM update was applied by 2 validators
SORA developers prepared a guide for the community validators to implement the runtime WASM update, after which block production gradually resumed. Validators who did not implement the fix and resume block production immediately were slashed, because they were considered offending by validators who resumed block production.
An archive node was set up to monitor the network during public node downtime
Issues with the slashes and validator election provider were discovered
An attempt was made to cancel the slashes, however, there were difficulties, as Polkadot.js did not have the option to modify the CancelSlashes extrinsic threshold.
The SORA Council was contacted to create an internal proposal to cancel the slashes.
The Hashi bridge was disabled temporarily to mitigate any potential issues.
Once the slashes were cancelled, proposals were raised to restart the election provider.
Local testing was conducted to anticipate and address any further issues.
The SORA Council members voted to cancel the slashes and elect new validators.
The Hashi bridge was re-enabled after successful acceptance tests and network stability.

Actions Taken to Prevent This Issue from Recurring

Production migration testing has been implemented in the CI pipeline, and other hooks will be tested for panic scenarios.
A streamlined communication channel with other development teams involved has been established and teams are collaborating to establish common development policies and processes for SORA.
A reliable 24/7 maintained archive node has been included in client applications and restricted permissions have been granted to the Technical Committee for emergency cases.

Lessons Learnt

Situations such as this one are often difficult to predict when tests go through without a hitch. The first lesson is to avoid releases on a Friday. Although the blockchain space never sleeps, usually people have difficulties responding as quickly as required in such a situation.

Another way to mitigate the delays is for the Council to grant the permissions required to resolve technical issues to the Tech Committee, as they are constantly online to ensure that the entire network is running smoothly.

More test coverage should be implemented ahead of launches to ensure that no stone is left unturned before deploying any major upgrade. Additionally, testing automation will help make processes more robust and reduce the risk of human error.

Next Steps

The release 1.11.1 will be dispatched on production to fix the pallet migration and the runtime WASM will be made available to validators so they can upgrade their nodes and stop using the extra runtime WASM update.

Update
At the time of publishing (19 July) the 1.11.1 fix was successfully deployed and new node runners are encouraged to set their node up using the 1.11.1 Docker image.

Thank you for your support in resolving this issue.