Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the default charset at startup #7757

Closed
wants to merge 11 commits into from

Conversation

gregwebs
Copy link
Contributor

What problem does this PR solve?

This adds MySQL compatibility for setting a default charset (and collation).
To some extent it is just a work around for the lack of charset support that makes existing applications that set utf8mb4 fail. Here is a bug report when someone tried to store data from Grafana.

I am a unsure if the current approach is good, so I haven't followed through with writing tests yet. I would be happy if the TiDB team took this over and used a better approach.

What is changed and how it works?

I am now running tidb with the arguments:

-character-set-server utf8mb4
-collation-server utf8mb4_unicode_ci

My understaning is that this doesn't change the behavior of TiDB (it is
equivalent to the current configuration).
However, it allows TiDB to be compatible with standard MySQL setups.
Previously, I would see this failure message when using an application:

unsupported modify column charset utf8 not match origin utf8mb4

Using the above command line flags, the application now works.
There is still more to do to have proper charset support,
but this is enough to fix my use case.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test

As per the above description, I manually tested that this fixed the issue with the application I am running (prisma).

Code changes

This change creates backwards compatible additions/enhancements.

Side effects

  • Increased code complexity

Related changes

  • Need to update the documentation
  • Need to be included in the release note

I am now running tidb with the arguments:

    -character-set-server utf8mb4
    -collation-server utf8mb4_unicode_ci

My understaning is that this doesn't change the behavior of TiDB (it is
equivalent to the current configuration).
However, it allows TiDB to be compatible with standard MySQL setups.
Previously, I would see this failure message when using an application:

    unsupported modify column charset utf8 not match origin utf8mb4

Using the above command line flags, the application now works.
There is still more to do to have proper charset support,
but this is enough to fix my use case.
@morgo
Copy link
Contributor

morgo commented Sep 24, 2018

/run-all-tests

@gregwebs gregwebs requested a review from lysu September 26, 2018 00:06
if err != nil {
return errors.Trace(err)
}
err = variable.SetSessionSystemVar(sessionVars, variable.CollationDatabase, types.NewStringDatum("utf8_unicode_ci"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @XuHuaiyu is any by design in here to using utf8_unicode_ci instead of utf8_bin like other place...and can we change this~?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackysp its hard for me to tell which question was being answered. Is this a good change to make?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's a good change.

@@ -164,7 +164,7 @@ var defaultSysVars = []*SysVar{
{ScopeGlobal, "innodb_max_undo_log_size", ""},
{ScopeGlobal | ScopeSession, "range_alloc_block_size", "4096"},
{ScopeGlobal, ConnectTimeout, "10"},
{ScopeGlobal | ScopeSession, "collation_server", charset.CollationUTF8},
{ScopeGlobal | ScopeSession, "collation_server", mysql.DefaultCollationName},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found charset.CharsetUTF8, charset.CollationUTF8 was refered in another pkg, e.g. expression/ builtin_*, does them also need change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in more places now :)

@lysu
Copy link
Contributor

lysu commented Sep 26, 2018

@gregwebs the question that we want to solve is could not modify column charset from utf8 to utf8mb4

how abount only modify modify column ddl logic to open a door that can change utf8 to utf8mb4(IMHO, we can support it, because they are same in tidb and golang)?

this PR seems open a door to modify global charset configuration, user maybe use this to set gbk, latin and other charset, for current TiDB's situation support multiple charset is not easy and maybe meet some other question.(but for utf8mb4, tidb just see it same as utf8)

if we just want to support alter column from utf8 to utf8mb4 modify ddl logic check log to adapte utf8<->utf8mb4 seems be the simplest way.

maybe need PTAL @winkyao

@gregwebs
Copy link
Contributor Author

I think making a change in modify column might be a hack and that the real way to fix this is to accept the charset setting somewhere where it is not being accepted on the initial schema creation (I am not sure where exactly it is not working right now: it may be the database, table, or column level). If you can outline how we would support setting the charset on schema creation we can look at making those changes.

Otherwise it is simple to validate that only utf8 and utf8mb4 are the allowed charsets for this setting: I can push a commit for that.

@gregwebs gregwebs force-pushed the set-default-charset branch 3 times, most recently from 41666dd to 8883e5d Compare September 26, 2018 17:57
@@ -557,7 +557,7 @@ func (s *testEvaluatorSuite) TestConv(c *C) {
tp := f.GetType()
c.Assert(tp.Tp, Equals, mysql.TypeVarString)
c.Assert(tp.Charset, Equals, charset.CharsetUTF8)
c.Assert(tp.Collate, Equals, charset.CharsetUTF8)
c.Assert(tp.Collate, Equals, charset.CollationUTF8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bugfix from this PR, although I am not sure what part of this changset fixed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pr also fixed it... :)

@lysu
Copy link
Contributor

lysu commented Sep 27, 2018

@gregwebs if we start tidb-server with

-character-set-server utf8mb4
-collation-server utf8mb4_unicode_ci

can we change utf8mb4 back to utf8? it seems mysql can do this. 🤣

@coocood
Copy link
Member

coocood commented Sep 27, 2018

@gregwebs
I think the change is too much relative to the problem it solves.
Support modify column charset from utf8 to utf8mb4 is much simpler.

@gregwebs
Copy link
Contributor Author

@lysu no, you cannot change it back. That would be an additional feature to make the global variables work, but hopefully an easy change we can add on. Maybe that is needed to make this PR look right?

Right now this changeset is focused on being good enough to solve the most common real-world usage where utf8mb4 is used for everything. We could also solve this by just switching our default charset from utf8 -> utf8mb4, is there a reason why we default to utf8?

@coocood can you outline the changes to support modify column? But after that it is quite possible for there to be confusion and bugs because the real problem is that the column was supposed to be created with utf8mb4, not utf8. The modify issue is just an artifact of that. But it is useful to properly support modify, and I can test to see if that solves my problem for now.

I agree this PR touches a lot for solving just my issue. Maybe you are thinking of a more elegant way of doing it? However, I think this change is necessary at some point to have proper MySQL compatibility, particularly if we add some of the related global variable setting.

@morgo
Copy link
Contributor

morgo commented Sep 27, 2018

This will also become an issue for MySQL 8.0 compatibility, where utf8mb4 (not utf8) is the default character set. Some applications expect/depend on utf8mb4.

@gregwebs
Copy link
Contributor Author

Maybe instead of flags for setting charset, we should have --compatibility 8.

@morgo
Copy link
Contributor

morgo commented Sep 28, 2018

Where possible, I think it's better to target a new GA as 8.0 compatible. Having modes can hurt usability, and users rarely change from defaults.

@gregwebs
Copy link
Contributor Author

gregwebs commented Sep 28, 2018

I think we should separate UX from internal implementation. Internally, I doubt TiDB can afford to have to backport fixes to an old 5.7 only release and we will want one system that supports both modes, particularly as we start to add 8.0 compatibility but have not yet completed it.

This could be exposed to the user via a single config option (compatibility mode), a collection of config options (config files), or two different binary builds with different default settings built-in. With any of these options, at some point we can switch to 8.0 as default (in the binary case that is more about naming the binaries).

@gregwebs gregwebs changed the title [WIP] set the default charset at startup Set the default charset at startup Oct 9, 2018
@@ -337,7 +352,7 @@ func overrideConfig() {
}
}

func validateConfig() {
func validateConfig() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think It's better to print the error log then exit instead of return an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its really problematic to have exits anywhere other then the main function. The linters warn against this. In this case it is a private function in main.go, but still any function which exits is not testable.

I agree that being in an intermediate state where some return error and some still exit is not great, but I don't want to start cleaning up other code in this PR. Although it is still possible I might be forced to when I start writing test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. But it still looks so strange. We need to clean up the other code anyway, but may not in this PR. It's better to add a TODO here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

}

func setGlobalVars() {
func setGlobalVars() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also avoid this return value.

@gregwebs
Copy link
Contributor Author

I am still hoping for agreement that this PR can be merged once the quality is good enough. Once agreed, I can write tests (write down any specific test cases you have in mind here).

Copy link
Member

@jackysp jackysp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jackysp
Copy link
Member

jackysp commented Oct 14, 2018

It seems there is no need to modify the ddl package, so I remove the label component/DDL-need-LGT3.

@gregwebs
Copy link
Contributor Author

I will remove the command line option to change the default charset. We need to just switch the default to utf8mb4, which is the correct MySQL terminology.

@winkyao
Copy link
Contributor

winkyao commented Oct 19, 2018

Maybe we should merge #7965 first?

expression/builtin_control.go Show resolved Hide resolved
@@ -81,19 +80,19 @@ func inferType4ControlFuncs(lhs, rhs *types.FieldType) *types.FieldType {
}
}
if types.IsNonBinaryStr(lhs) && !types.IsBinaryStr(rhs) {
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = charset.CharsetUTF8, charset.CollationUTF8, 0
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = mysql.DefaultCharset, mysql.DefaultCollationName, 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change to mysql.DefaultCharset, mysql.DefaultCollationName. This type is string, so the charset is definite utf8mb4(after my pr merged.) and my PR will change here to charset.CharsetUTF8MB4, charset.CollationUTF8MB4

@@ -557,7 +557,7 @@ func (s *testEvaluatorSuite) TestConv(c *C) {
tp := f.GetType()
c.Assert(tp.Tp, Equals, mysql.TypeVarString)
c.Assert(tp.Charset, Equals, charset.CharsetUTF8)
c.Assert(tp.Collate, Equals, charset.CharsetUTF8)
c.Assert(tp.Collate, Equals, charset.CollationUTF8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pr also fixed it... :)

@winkyao
Copy link
Contributor

winkyao commented Oct 22, 2018

@gregwebs I will support change any charset to utf8 or utf8mb4 later, I think it will solve your problem.

if mysql.HasBinaryFlag(lhs.Flag) || !types.IsNonBinaryStr(rhs) {
resultFieldType.Flag |= mysql.BinaryFlag
}
} else if types.IsNonBinaryStr(rhs) && !types.IsBinaryStr(lhs) {
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = charset.CharsetUTF8, charset.CollationUTF8, 0
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = mysql.DefaultCharset, mysql.DefaultCollationName, 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Allow setting of the default collation on the CLI.
This allows setting -collation-server utf8mb4_unicode_ci,
which is not correct, but could enhance compatibility
@winkyao
Copy link
Contributor

winkyao commented Nov 12, 2018

@gregwebs #7965 was merged, you can update your PR now :).

@gregwebs
Copy link
Contributor Author

@winkyao should we go forward with this PR? We decided that we don't need to let the user set the charset at startup if we just default to utf8mb4.
However, setting the default collation to match MySQL (utf8mb4_general_ci) could still be useful for compatibility purposes.

@gregwebs
Copy link
Contributor Author

I am just going to close this for now. We can revisit this if we have a concrete use case for collation support, etc.

@gregwebs gregwebs closed this Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants