Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disallowing inconsistent metadata in cli-migrations images (close #10599) #10602

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

chardo
Copy link

@chardo chardo commented Nov 15, 2024

Description

The Hasura CLI's hasura metadata apply command supports a --disallow-inconsistent-metadata flag, which helps to prevent breaking changes to metadata before they're applied rather than discovering them through hasura metadata ic list or, worse, via runtime application errors. However, in production environments, it's common to deploy graphql-engine using the cli-migrations Docker image and avoid exposing the metadata API entirely. This means that CI/CD workflows have no guaranteed way of preventing inconsistent metadata from landing in production, which is made even riskier by the fact that metadata changes will be automatically picked up by any already-running instances, even if that metadata is inconsistent.

This change attempts to address that by exposing an optional HASURA_GRAPHQL_DISALLOW_INCONSISTENT_METADATA env variable that can be provided to the cli-migrations Docker images in order to activate the corresponding --disallow-inconsistent-metadata flag on the hasura metadata apply step. If this is set, metadata application will fail, the docker-entrypoint.sh script will exit early, and the container will fail to start up.

Also, just to say: I know that I opened this PR before I got any traction on the associated issue. If there's good reason for not implementing this change, I will understand and won't mind throwing this work away.

Changelog

Component : build

Type: feature

Product: community-edition

Short Changelog

Add support for disallowing inconsistent metadata in cli-migrations image

Long Changelog

Related Issues

#10599
#8095

Solution and Design

This follows the existing pattern for configuration env vars in the docker-entrypoint.sh script(s), though is strict in requiring the value of the new variable to be "true" (case insensitive) rather than it just being set to any value.

In this first draft, I've not made any effort to fail gracefully in the event of inconsistent metadata. Ideally I think we'd probably capture the exit code, shut down the temporary graphql-engine server, and then exit with the original code. That said, this implementation is consistent with how the script already handles possible non-zero exit codes for the hasura-cli commands (eg if the server is unreachable, or the metadata contains invalid YAML).

Zooming out a little bit, it's maybe also worth mentioning that I've deliberately chosen to isolate this feature to the cli-migrations image, rather than making it a server-level config variable that would change the graphql-engine's default behavior when receiving new metadata updates. The latter seems a bit far-reaching, and I'd rather leverage an existing API than broaden/complicate its scope in a significant way.

Steps to test and verify

I updated the existing test scripts so that, after confirming the "good" behavior works as intended, they also attempt to apply some inconsistent metadata and then confirm that the docker image is unable to start up.

I couldn't find any tools/docs for running tests locally, but I was able to get both test scripts running and passing locally by:

  • installing a copy of the hasura CLI into a local hasura-cli
  • building the cli-migrations images manually
  • setting the necessary test env vars
  • running the test scripts on local

I did set the new disallow-inconsistent-metadata flag to "true" on both of the test docker-compose.yaml files, so that I could just augment the existing test files. If you'd prefer to have this test run in a separate, isolated file with a different env configuration, I'm willing to do that too! Just thought this was a simpler first revision.

Limitations, known bugs & workarounds

Server checklist

Catalog upgrade

Does this PR change Hasura Catalog version?

  • No
  • Yes
    • Updated docs with SQL for downgrading the catalog

Metadata

n/a

Does this PR add a new Metadata feature?

  • No
  • Yes
    • Does run_sql auto manages the new metadata through schema diffing?
      • Yes
      • Not required
    • Does run_sql auto manages the definitions of metadata on renaming?
      • Yes
      • Not required
    • Does export_metadata/replace_metadata supports the new metadata added?
      • Yes
      • Not required

GraphQL

  • No new GraphQL schema is generated
  • New GraphQL schema is being generated:
    • New types and typenames are correlated

Breaking changes

  • No Breaking changes

  • There are breaking changes:

    1. Metadata API

      Existing query types:

      • Modify args payload which is not backward compatible
      • Behavioural change of the API
      • Change in response JSON schema
      • Change in error code
    2. GraphQL API

      Schema Generation:

      • Change in any NamedType
      • Change in table field names

      Schema Resolve:-

      • Change in treatment of null value for any input fields
    3. Logging

      • Log JSON schema has changed
      • Log type names have changed

@chardo chardo requested a review from a team as a code owner November 15, 2024 17:42
@CLAassistant
Copy link

CLAassistant commented Nov 15, 2024

CLA assistant check
All committers have signed the CLA.

@@ -77,7 +87,7 @@ if [ -d "$HASURA_GRAPHQL_METADATA_DIR" ]; then
echo "version: 3" > config.yaml
echo "endpoint: http://localhost:$HASURA_GRAPHQL_MIGRATIONS_SERVER_PORT" >> config.yaml
echo "metadata_directory: metadata" >> config.yaml
hasura-cli metadata apply
hasura-cli metadata apply $HASURA_GRAPHQL_DISALLOW_INCONSISTENT_METADATA
Copy link
Author

@chardo chardo Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question for the codeowners - is there a good reason that the v3 image applies metadata updates before applying db migrations, while v2 does them the other way around?

I suppose either order might result in temporary metadata inconsistencies if DB updates and metadata updates are bundled into the same release. if you're dropping a column or table, you probably want metadata applied first; if you're adding a column or table, you probably want migrations applied first -- so we probably need to accept some unavoidable ICs either way. just checking to see if there's a good reason that we're picking one side for v2 and another for v3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this - I went ahead and modified this script so that it has the same ordering (and thus the same behavior) as v2. The need for this became more clear after my updates to the test scripts, which revealed that the v2 test could pass while the v3 test would fail under the same circumstances, because the current v3 setup relies on DB migrations being run before metadata can be applied, in order to remain strictly consistent.

@chardo chardo force-pushed the cli-migrations-support-disallow-ic branch from faa2c1f to 034fd38 Compare November 15, 2024 21:44
@chardo chardo changed the title Support disallowing inconsistent metadata in cli-migrations images (fixes #10599) Support disallowing inconsistent metadata in cli-migrations images (close #10599) Nov 29, 2024
Copy link
Member

@scriptnull scriptnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chardo, Thanks for the PR! Sorry that we couldn't take a look at it earlier.

It looks good overall 👍 The one thing I am a bit doubtful about is the change of execution order of the metadata and migrations command in the v3 cli migrations image.

I feel like there was a reason to preserve the execution order when we originally introduced it but I am unable to find the reason behind it. So I am going to ask around internally (cc: @scriptonist) to see if I can find the original reason for this. Please give me a little more time to arrive at a conclusion on this matter.

Copy link
Member

@scriptnull scriptnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason behind running the metadata command before the migrations command in the v3 cli migrations image seems to be:

  • CLI uses the run_sql API to apply migrations of a database.
  • The information about "connected" databases is in hasura metadata.
  • So, if we are trying to apply migrations on a database A, it should already be connected to hasura. Otherwise hasura will throw an error saying it does not know what A database is.
    • Because of this reason, we apply metadata first.
    • ie Let hasura know about A first, though there might be some metadata inconsistencies if we are tracking tables from A etc
    • Now, migrations can be applied without any errors.

So let us avoid changing the order of the commands in the v3 cli migrations image.

@chardo
Copy link
Author

chardo commented Dec 30, 2024

@scriptnull great, thanks so much for the review and the context! I can change the v3 docker entrypoint script on this branch so that it preserves the existing order, and then I'll re-run the updated test suites to make sure they still pass.

Just want to point out, though, that this will probably result in different behavior between v2 and v3 when disallowing inconsistent metadata. In v2, running migrations first means you can create a new table and define metadata for that table in the same change (since the table will exist by the time the metadata apply command runs). But v3 applying metadata updates first will prevent you from applying both of those changes in one deploy - the metadata will be inconsistent since the table doesn't exist yet.

I'm totally fine with this (it preserves existing behavior, and the flag is opt-in rather than being on by default) but just wanted to mention it because it might be confusing for others. Do you think it's worth clarifying this difference in the docs in some way?

@chardo
Copy link
Author

chardo commented Dec 30, 2024

Just to follow up on my previous comment: I just ran the test script on the v3 image inconsistent metadata disallowed, and confirmed that the graphql-engine container can't start up from scratch because of the ordering of metadata application/migrations (the engine can't apply metadata that references a table which hasn't been created yet).

So as is, the baseline v3 tests are failing with the step ordering reverted to the original state. I think I can get the tests working properly by starting up the test gql-engine container without disallowing ICs, applying metadata and migrations, then restarting it with the HASURA_GRAPHQL_DISALLOW... flag and confirming that we get an error.

Again, happy to do all this, but wanted to follow up with a confirmation that I'm seeing the predicted consequences of the current ordering and double check whether you have any concerns with my planned approach. I'll hold off on getting too deep into fiddling with tests for now, in case you have any other ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants