TLDR: a state manager as a standalone component for Terraform / OpenTofu is needed because most real-world deployments involve multiple state files, often depending on one another. The DAG of states is itself stateful - a successful or failed apply in one node may or may not result in the need to apply another, depending whether or not the outputs have changed. There are other benefits of a centralised state manager: it can store audit logs, abstract away storage and locking mechanisms, and take care of versioning / rollbacks.

What’s wrong with S3 backend?

Nothing wrong; just insufficient. Even if you use something like Terragrunt to manage complexity of hundreds of states you still have a bunch of json files in a bucket, without any safeguards in place that’d ensure overall system consistency. Sole responsibility for ensuring consistency lies with the CLI (more on that below).

A practical example: if you run a bunch of applies with run-all, and one fails for some reason, your setup in some sense is in “invalid state” - say a VPC ID has changed - but there is no way to tell that something is wrong by looking at the state bucket. This picture of what was supposed to happen, which state are up to date and which aren’t, is entirely in the head of the engineer who runs run-all. And if by unlucky coincidence at this very moment their laptop battery dies or wifi connection drops, there is no way another engineer can pick up from where the first engineer left it - there’s simply not enough information in the state bucket.

What’s wrong with TFC / Spacelift?

Nothing wrong; just a bit too much. Higher-level constructs like Stacks that are designed to solve this problem; but to make use of them you have to commit to a bigger cloud-based offering. Both HCP Terraform and Spacelift have a concept of Stacks, but the semantics is diffferent (it has become a loaded word in terraform land). It is great that we have a solution; but the “core” of it seems to be much smaller than the commercial offerings built around it. If history is of any guidance, such pieces of technology (databases, queues etc) tend to eventually be extracted as standalone open-source components and become community-good. The same is likely to happen here.

Why not language-level Stacks in OpenTofu?

There are several open issues (931, 2860) and discussions are ongoing. In my view however, this functionality simply does not belong in the CLI. The semantics of state dependencies is much like a database; the only way to reliably ensure consistency with client code in charge of the mutations is to operate via snapshots and distributed locks - and Terraform CLI does exactly that (obtain lock → change state in memory → upload new state → release lock). But when multiple state files are involved - often hundreds, and state files can get heavy over time - it becomes impractical to send snapshots of the entire DAG over the network.

On Terragrunt stacks

I’m a huge fan of Terragrunt and Gruntwork’s work in general. Their take on Stacks is perhaps the most community-friendly of all solutions to the multi-state problem that exist to date. Still, I believe that a much cleaner solution can be achieved by dropping the “CLI only” constraint. Similar to a language-level solution, in my view this functionality does not belong in the CLI (see above). If you have a system with multiple clients (CLIs that run in CI environments and / or developer machines) and singular centralised stateful “source of truth” (a DAG of state files), the further away from the “center” you have consistency logic the more error-prone your system becomes; in some sense this is the opposite to “shift left” best practice.

Responsibilities of a state manager

For the reasons outlined above, I believe that it has to be a server. But at the same time I think it should be as small a piece as possible - definitely not a full-blown TACO, not concerned with orchestration of jobs, integration with third-party systems, or anything of that sort. It’s basically a “database of states” that:

exposes single state backend APIs as a regular state backend would (S3, http, remote, etc)
is aware of the dependency graph (which state depend on which outputs of other states)
stores up-to-date state of the DAG (which output got changed → which states need reapply)
exposes management endpoints (CRUD for state files and DAG edges)

Stretching the database parallel a bit, the DAG of dependencies is like a schema; it determines whether or not a data operation - eg apply - is valid.

Order of applies

Ensuring that applies run in the correct order is the main use case of higher-level constructs like Stacks. This is relatively straightforward locally in the CLI (terragrunt constructs an in-memory run queue), but a bit more tricky in a distributed setting when applies need to be serverside jobs (CI or “runs” in your favourite TACO). Hashicorp’s approach is “orchestration rules” and “deferred changes” for Stacks in HCP Cloud, but it strikes me as another example of “too much” functionality tied to an enterprise cloud offering. This stuff - the purely technical part of it - belongs in the open source land.

I think a more sensible approach would be to decouple execution from ordering. The state manager should expose an API to get the up-to-date state of the DAG. A client then can make the decision in which order to run the applies. A client can be a CLI on a developer’s laptop - the run-all scenario - or it can be a TACO like Atlantis or an orchestrator like Digger. If the client decides to run applies out of order that’s ok (a bit like force push in git) but that needs to be an explicit decision.