Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: Stop server startup if an unrecognized option is found in a config file #9855

Merged
merged 17 commits into from
Apr 8, 2019

Conversation

kolbe
Copy link
Contributor

@kolbe kolbe commented Mar 21, 2019

What problem does this PR solve?

#7103

What is changed and how it works?

Previously, tidb-server would simply ignore invalid config options encountered in a config.toml file. This can cause big problems if a user puts an option in the wrong section of the config file, or if they mis-spell an option name. This could conceivably lead to data loss (if, for example, data is not written to the correct device), security issues (if listeners are not bound to the correct interface), or administration headaches (if someone puts an option in the wrong section and cannot figure out why it isn't taking effect).

This change uses the metaData returned by toml.DecodeFile to identify any items not decoded. A list of them is made and is returned as an error from Config.Load, which causes server startup to abort.

Check List

Tests

  • Unit test
    Yes, check added to config/config_test.go to make sure that an unrecognized option in a config file throws the correct error.

Code changes

  • Has exported function/method change
    No
  • Has exported variable/fields change
    No
  • Has interface methods change
    No
  • Has persistent data change
    No

Side effects

  • Possible performance regression
    No
  • Increased code complexity
    No
  • Breaking backward compatibility
    Yes, if someone is using a config file in product that has unrecognized options.

Related changes

  • Need to cherry-pick to the release branch
    No
  • Need to update the documentation
    Yes
  • Need to update the tidb-ansible repository
    Maybe, if it has a strange config file with invalid options!
  • Need to be included in the release note
    Yes

@kolbe kolbe requested review from shenli and IANTHEREAL March 21, 2019 22:31
@codecov
Copy link

codecov bot commented Mar 21, 2019

Codecov Report

Merging #9855 into master will decrease coverage by 0.3011%.
The diff coverage is 100%.

@@               Coverage Diff                @@
##             master      #9855        +/-   ##
================================================
- Coverage   77.5172%   77.2161%   -0.3011%     
================================================
  Files           403        405         +2     
  Lines         81827      81641       -186     
================================================
- Hits          63430      63040       -390     
- Misses        13689      13932       +243     
+ Partials       4708       4669        -39

@morgo
Copy link
Contributor

morgo commented Mar 21, 2019

For context: this is the mysql server behavior with parsing config files. If an item is discovered that is unknown, an error is returned. As Kolbe pointed out, the logic is:

This could conceivably lead to data loss (if, for example, data is not written to the correct device), security issues (if listeners are not bound to the correct interface), or administration headaches (if someone puts an option in the wrong section and cannot figure out why it isn't taking effect).

@morgo morgo changed the title Stop server startup if an unrecognized option is found in a config file config: Stop server startup if an unrecognized option is found in a config file Mar 21, 2019
@kolbe kolbe changed the base branch from master to rc4 March 22, 2019 18:51
@kolbe kolbe changed the base branch from rc4 to master March 22, 2019 18:52
Copy link
Contributor

@winkyao winkyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind compatibility

@winkyao
Copy link
Contributor

winkyao commented Mar 25, 2019

@jackysp PTAL

@jackysp
Copy link
Member

jackysp commented Mar 25, 2019

/run-all-tests

@jackysp
Copy link
Member

jackysp commented Mar 25, 2019

It seems that there are some illegal configurations in the test, which causes TiDB to fail to start.

@xiekeyi98
Copy link
Contributor

/run-all-tests

@xiekeyi98
Copy link
Contributor

/run-all-tests tidb-test=pr/764

@xiekeyi98
Copy link
Contributor

/run-unit-tests tidb-test=pr/764

@xiekeyi98
Copy link
Contributor

/run-mybatis-tests tidb-test=pr/764

@xiekeyi98 xiekeyi98 closed this Mar 25, 2019
@xiekeyi98 xiekeyi98 reopened this Mar 25, 2019
config/config.go Outdated
for _, item := range undecoded {
undecodedItems += fmt.Sprintf(" %s \n", item.String())
}
err = errors.Errorf("config file %s contained unknown configuration options:\n%s", confFile, undecodedItems)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error generated by errors.Errorf() is attached with a call stack, we don't need to wrap a errors.Trace() in line 391.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me, but the trace is not my code (it was there before, so that a non-existent config file for example would be traced for some reason).

config/config.go Outdated Show resolved Hide resolved
@@ -371,10 +372,22 @@ func GetGlobalConfig() *Config {

// Load loads config options from a toml file.
func (c *Config) Load(confFile string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When upgrade from old version tidb to a newer version, some old config item maybe is not used in the newer version tidb. The rolling upgrade may meet error in this case, then the rolling upgrade failed.

How about only check the config items when -config-check is provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this suggestion. I think the behavior should be that tidb-server simply refuses to start if there are unknown configuration options, for the reasons I provided above.

A couple points:

  1. Now that there's a --config-check option, it can be a part of the recommended procedure for a rolling upgrade, so that the only people running into the issue you describe are those not following the recommended procedure.

  2. The benefit of a rolling upgrade is that you never have to take the entire cluster offline. When upgrading the first node, it will refuse to start with this error. That's a fine time to address the problem, apply the change to the config files, get that first node upgraded, and then proceed with the rolling upgrade.

@morgo, your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because our own CI broke, we may need to ease this change in slowly. Here would be my suggestion:

  1. Keep --config-check as is, it is independently very useful.
  2. Change from refuse to start to print a WARN about each incorrect config item.
  3. Add a new flag for --config-strict, which defaults to FALSE. We can document it that enabling it is a best practice, and we may change the default to TRUE in the future. This will result in errors for invalid options.

It is not bulletproof of course, because the default log level is INFO it is possible that the warnings may not be seen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the very fact our own CI broke is evidence of the value of this behavior change. Even we weren't keeping track of whether our config files were well-formed!

  1. Your suggestion above is difficult, because the logging system isn't set up yet when the config file is parsed. Either we have to keep around the list of undecoded items and check them later after logging is possible, or we have to pass some other string/structure/flag around, or we have to simply print something to stderr, or ...? @coocood Morgan mentioned that you may have some thoughts on this?

  2. I dislike this. Adding this new flag means that people may start relying on it, and then we have to keep it around as a no-op if we later make that behavior the default.

Copy link
Contributor

@winkyao winkyao Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with Morgan's first and second suggestions. Our users may use old version ansible to upgrade, that doesn't pre-check the config using tidb-server --config-check, we need to keep the cluster stable when rolling upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, it sounds like there's a consensus. Can someone offer a suggestion of the best way to technically implement a "warning" before the logging system is set up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolbe I will create a pull request to your branch later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winkyao I added some comments on your PR. I re-wrote my own interpretation of how I see this working and pushed it to my branch of my fork, please take a look.

Basically I added a kind of unsettling amount of additional stuff to create a custom error in config.go, then use that in the case of failed config validation. That enables checking for the specific error type in main.go, and if configStrict is not enabled, it'll get the string from the warning and keep that until ater logging has been set up.

This is kind of a lot of extra work and adds some ugly things, but the whole idea behind strict config checking (and the --config-strict option) is that this is a temporary situation until we make --config-strict behavior the default (and hopefully only!) behavior of TiDB Server in the future. I think at that point we'd simply remove all this extra stuff with the custom error and strange handling in the server.

Thoughts?

@xiekeyi98
Copy link
Contributor

/run-all-tests tidb-test=pr/764

… defer a warning from loadConfig until after logging has been set up. Virtually all of this should be removed after configStrict is made the default behavior of TiDB Server!
@winkyao
Copy link
Contributor

winkyao commented Apr 2, 2019

Ci failed by

config_test.go:71:
    c.Assert(conf.Load(configFile), IsNil)
... value *config.ErrConfigValidationFailed = &config.ErrConfigValidationFailed{err:"config file /go/src/github.com/pingcap/tidb/config/config.toml contained unknown configuration options: tikv-client.token-limit"} ("config file /go/src/github.com/pingcap/tidb/config/config.toml contained unknown configuration options: tikv-client.token-limit"

The test case is wrong, you should replace (https://github.com/pingcap/tidb/blob/master/config/config_test.go#L48)

_, err = f.WriteString(`[performance]
[tikv-client]
commit-timeout="41s"
max-batch-size=128

token-limit = -1
`)

with

_, err = f.WriteString(`
token-limit = 0
[performance]
[tikv-client]
commit-timeout="41s"
max-batch-size=128
`)

Copy link
Contributor

@winkyao winkyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @zz-jason PTAL

@winkyao
Copy link
Contributor

winkyao commented Apr 3, 2019

/run-all-tests tidb-test=pr/764

@morgo morgo added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 3, 2019
@codecov-io
Copy link

Codecov Report

Merging #9855 into master will decrease coverage by 0.5034%.
The diff coverage is 100%.

@@               Coverage Diff                @@
##             master      #9855        +/-   ##
================================================
- Coverage   78.0225%   77.5191%   -0.5035%     
================================================
  Files           404        403         -1     
  Lines         82016      81932        -84     
================================================
- Hits          63991      63513       -478     
- Misses        13324      13711       +387     
- Partials       4701       4708         +7

@xiekeyi98
Copy link
Contributor

/run-all-tests tidb-test=pr/764

@xiekeyi98
Copy link
Contributor

/run-common-test tidb-test=pr/764

Copy link
Contributor

@xiekeyi98 xiekeyi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/server status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants