Skip to content

💥 Use Temporal Failures for Nexus Error Serialization#1336

Open
Quinn-With-Two-Ns wants to merge 5 commits intomainfrom
nexus-error-serialization-quinn
Open

💥 Use Temporal Failures for Nexus Error Serialization#1336
Quinn-With-Two-Ns wants to merge 5 commits intomainfrom
nexus-error-serialization-quinn

Conversation

@Quinn-With-Two-Ns
Copy link
Contributor

@Quinn-With-Two-Ns Quinn-With-Two-Ns commented Feb 19, 2026

Note This PR can't be merged until the corresponding Core PR (temporalio/sdk-core#1109) is merged and updated in this branch and the nexus-rpc PR (nexus-rpc/sdk-python#45) is merged and released

What was changed

  • Nexus error serialization now uses Temporal Failures: Instead of the custom Nexus Failure JSON format (nexus.v1.Failure with metadata/details), Nexus handler errors and operation errors are now serialized as standard Temporal failure.v1.Failure protos via the SDK's DataConverter/FailureConverter. This includes HandlerErrornexus_handler_failure_info and OperationErrorCancelledError/ApplicationError conversions.
  • Simplified _nexus.py worker code: Removed the _nexus_error_to_nexus_failure_proto, _operation_error_to_proto, and _handler_error_to_proto helper methods. Errors are now encoded directly through data_converter.encode_failure() into the completion/response proto's failure field.
  • Simplified _exception_to_handler_error: Removed the workaround that inserted an extra ApplicationError at the head of the HandlerError cause chain (previously needed to preserve the HandlerError message when hoisted to the Nexus Failure).
  • Converter improvements: _error_to_failure for HandlerError now uses error.message and error.stack_trace instead of str(error). from_failure uses match/case instead of if/elif chains.
  • Updated protos and sdk-core: Updated nexus_pb2, workflow_activation_pb2, workflow_commands_pb2, and the sdk-core submodule to support the new Nexus failure fields on completion protos.
  • Removed HTTP-based Nexus tests: Deleted test_handler.py and test_handler_async_operation.py which tested via direct HTTP calls (the HTTP interface is not user-facing). Converted remaining tests (test_workflow_run_operation.py, test_dynamic_creation_of_user_handler_classes.py) to use workflow callers.
  • Refactored error chain tests: Replaced the class-based ErrorConversionTestCase registry pattern with typed dataclasses (ExpectedNexusOperationError, ExpectedHandlerError, ExpectedApplicationError, ExpectedCancelledError) and explicit ErrorTestService handler methods.
  • Added failure converter unit tests: New tests in test_converter.py for round-trip serialization of HandlerError, NexusOperationError, OperationError, and related types.
  • Cleaned up stale TODOs: Removed TODO comments marked as "Won't Do" in the Nexus TODO triage, and removed the unused self._interceptors variable in the Nexus worker.

Why?

The previous Nexus error serialization used a custom JSON-based format (nexus.v1.Failure with metadata and JSON details) that was specific to the Nexus HTTP protocol. With the move to have Core SDK handle protocol-level conversion based on server capabilities, the Python SDK now only needs to produce standard Temporal Failure protos.

Checklist

  1. How was this tested:

    • Existing Nexus workflow caller tests converted and expanded to validate error chains end-to-end
    • New unit tests for FailureConverter round-trip serialization of Nexus error types in tests/test_converter.py
    • New typed error chain validation in tests/nexus/test_workflow_caller_error_chains.py
  2. Any docs updates needed?

    • No

Note

Medium Risk
Changes the wire-level/proto surface for Nexus operation error handling and introduces a new failure variant, which may affect compatibility with server/core expectations and callers handling the deprecated operation_error path.

Overview
Updates Nexus protocol bindings to support Temporal failure.v1.Failure-based error responses.

The generated nexus.v1 protos now include a Request.capabilities.temporal_failure_responses flag and a new StartOperationResponse.failure variant (with the old operation_error variant marked deprecated), plus expanded nexus.v1.Failure fields (stack_trace, cause).

Also bumps the nexus-rpc dependency to 1.4.0, tweaks a generated docstring in deployment protos, adds a doc comment for NexusOperationFailureInfo, and exports CountSchedulesRequest/Response from workflowservice.v1.

Written by Cursor Bugbot for commit 7f78ec3. This will update automatically on new commits. Configure here.

@Quinn-With-Two-Ns Quinn-With-Two-Ns changed the title Nexus error ser. 💥 Use Temporal Failures for Nexus Error Serialization Feb 19, 2026
@Quinn-With-Two-Ns Quinn-With-Two-Ns marked this pull request as ready for review February 19, 2026 20:07
@Quinn-With-Two-Ns Quinn-With-Two-Ns requested a review from a team as a code owner February 19, 2026 20:07
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b58ce8660f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +403 to +407
match err.state:
case nexusrpc.OperationErrorState.CANCELED:
raise CancelledError(err.message) from err.__cause__
case nexusrpc.OperationErrorState.FAILED:
raise ApplicationError(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle unknown OperationError states explicitly

The match err.state block only raises for CANCELED and FAILED; there is no default branch. If an unexpected state is received (e.g., from a newer nexus-rpc enum value or a malformed custom raise), this except nexusrpc.OperationError path falls through without returning a StartOperationResponse, and _handle_start_operation_task proceeds as if start-operation succeeded with an invalid empty response variant instead of surfacing a failure.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odds we add a new enum here are basically zero, I can handle it though if the team would like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our configured linters should catch the scenario where a new enum value is created in the nexus-rpc library.

The risk of a malformed custom raise that codex pointed out is real though it seems unlikely.

content: nexusrpc.Content, # type:ignore[reportUnusedParameter]
as_type: type[Any] | None = None,
) -> Any:
payload = self.payload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind breaking it into two steps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking it into two steps allows us to automatically retry for errors that may be ephemeral during codec decoding (such as a KMS being temporarily unavailable) while immediately failing if the payload cannot be deserialized into the expected input type.

The main driver behind the difference in behavior here compared to activities is that Nexus operations are subject to a circuit breaker. A malformed input being retried could trigger the circuit breaker and potentially deny other operations on the Nexus endpoint that would succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More details are here temporalio/features#762, we are updating this or have for all core based SDKs

Copy link
Contributor

@tconley1428 tconley1428 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks okay to me. Approval pending nexus sdk release.

@VegetarianOrc
Copy link
Contributor

It looks okay to me. Approval pending nexus sdk release.

The Nexus SDK has now been released and updated on this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants