Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend System.Guid with a new creation API for v7 #103658

Closed
tannergooding opened this issue Jun 18, 2024 · 73 comments · Fixed by #104124
Closed

Extend System.Guid with a new creation API for v7 #103658

tannergooding opened this issue Jun 18, 2024 · 73 comments · Fixed by #104124
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Milestone

Comments

@tannergooding
Copy link
Member

Rationale

The UUID specification (https://datatracker.ietf.org/doc/rfc9562) defines several different UUID versions which can be created and which allow developers to produce and consume UUIDs that have a particular structure.

As such, we should expose helpers to allow creating such UUIDs. For .NET 9, the set of potential versions is detailed below, but only UUIDv7 is proposed for the time being.

As per the UUID spec, GUID is a valid alternative name and so the name of the new APIs remains NewGuid* for consistency with our existing APIs:

This specification defines UUIDs (Universally Unique IDentifiers) -- also known as GUIDs (Globally Unique IDentifiers)

API Proposal

namespace System;

public partial struct Guid
{
    // Alternatively: MaxValue -or- AllBitsSet
    public static Guid Max { get; }

    public int Variant { get; }
    public int Version { get; }

    // v1
    //     60-bit timestamp 100ns ticks since 00:00:00.00, 15 October 1582 UTC
    //     14-bit clock sequence, intended to be random once in the lifetime of system
    //     48-bit node identifier, as in IEEE 802 Node IDs

    // v2
    //     DCE Security UUIDs

    // v3
    //     128-bit MD5 of namespaceId + name converted to canonical octet sequence
    //     6-bits then replaced with the version/variant fields

    // v4
    //     Already exposed as NewGuid()

    // v5
    //     192-bit SHA1 of namespaceId + name converted to canonical octet sequence, taking the most significant 128-bits
    //     6-bits then replaced with the version/variant fields

    // v6
    //     60-bit timestamp 100ns ticks since 00:00:00.00, 15 October 1582 UTC
    //     14-bit clock sequence, intended to be random for every new UUID, can be compatible with v1
    //     48-bit node identifier, intended to be random for every new UUID, can be compatible with v1

    // v7
    //     48-bit timestamp millisecond ticks since Unix Epoch
    //     12-bits of random data -or- submillisecond timestamp (M3)
    //         M3 - Scale remaining precision to fit into the 12-bits at even intervals
    //     62-bits of random data -or- carefully seeded counter (M1 or M2)
    //         M1 - Dedicated counter, simply increment per UUID created within a given tick stamp
    //         M2 - Monotonic random, seed random then increment by random amount within a given tick stamp
    public static Guid NewGuidv7(); // uses DateTime.UtcNow
    public static Guid NewGuidv7(DateTime timestamp);

    // v8
    //     48-bits of custom data
    //     12-bits of custom data
    //     62-bits of custom data
}
@tannergooding tannergooding added area-System.Runtime blocking Marks issues that we want to fast track in order to unblock other important work api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jun 18, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 18, 2024
@julealgon
Copy link

@tannergooding shouldn't it be cased GuidV7 instead of Guidv7?

On another note, what about having the method take an enum parameter speficying the version (instead of having all the New method variations)?

Could then have something like:

var guid = Guid.NewGuid(GuidVersion.Version7);

Which would look much cleaner with:

of course:

Guid guid = NewGuid(Version7);

@tannergooding
Copy link
Member Author

shouldn't it be cased GuidV7 instead of Guidv7?

I'll let API review decide if I got it "wrong" or not. I prefer the look of Guidv7 personally.

On another note, what about having the method take an enum parameter speficying the version (instead of having all the New method variations)?

This doesn't work because there are unique parameters/overloads per version. i.e. v7 takes a DateTime while v5 does not (it would presumably take a string namespace, string name, ignoring the name/keyword conflict)

@tannergooding tannergooding added this to the 9.0.0 milestone Jun 18, 2024
@teo-tsirpanis teo-tsirpanis removed the untriaged New issue has not been triaged by the area owner label Jun 18, 2024
@Ilchert
Copy link

Ilchert commented Jun 19, 2024

Hello @tannergooding, a few questions:

  1. How to handle Local and Unspecified date time conversion to Unix milliseconds?
  2. Maybe add DateTimeOffset overload?
  3. Please look at provided code, how to handle comparation of guid? Should BCL provide Guidv7Comparer for proper sorting/comparation?
var t1 = DateTimeOffset.FromUnixTimeMilliseconds(0x010203040506).DateTime; // 11/02/2005 20:02:37
var t2 = DateTimeOffset.FromUnixTimeMilliseconds(0x020203040505).DateTime; // 16/12/2039 15:56:25
var g1 = Create(t1);
var g2 = Create(t2);

Console.WriteLine(g1); // 03040506-0102-0000-0000-000000000000
Console.WriteLine(g2); // 03040505-0202-0000-0000-000000000000
Console.WriteLine(g1 < g2); // False
Console.ReadKey();

static Guid Create(DateTime dateTime)
{
    var dto = new DateTimeOffset(dateTime); // handle Local and Unspecified
    var ms = (ulong)dto.ToUnixTimeMilliseconds();
    ms &= (1ul << 49) - 1;
    Span<byte> data = stackalloc byte[16];
    BinaryPrimitives.WriteUInt64LittleEndian(data, ms);
    return new Guid(data);
}

@tannergooding
Copy link
Member Author

tannergooding commented Jun 19, 2024

How to handle Local and Unspecified date time conversion to Unix milliseconds?

That's largely an implementation detail and is not really relevant to the API proposal. DateTime tracks enough information to know what point of time it represents, the DateTimeOffset constructor knows how to get a correct offset for local and utc. It treats unspecified the same as local, by design.

Should BCL provide Guidv7Comparer for proper sorting/comparation?

There is no such thing as "proper" sorting/comparison, that is there is no formal definition of how to compare UUIDs. The way .NET does it is to treat it effectively as an unsigned 128-bit integer represented in hex form. The output string is already in big endian format.

The algorithm you've used to create the Guid in the sample code is notably incorrect, however. The actual definition is as follows, where the RFC lists it with most significant bytes first and in terms of simple octets:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           unix_ts_ms                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          unix_ts_ms           |  ver  |       rand_a          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |var|                        rand_b                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                            rand_b                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The internal layout used by Guid, which is allowed to differ as per the spec, is then:

  • int32 a
  • int16 b
  • int16 c
  • int8 d, e, f, g, h, i, j, k

Given a 48-bit Unix Timestamp of 0x010203040506, the Guid string should always appear as 01020304-0506-7???-8???-???????????? (where the 8??? is [8000, BFFF]). Thus, you need to store the 48-bit timestamp as:

_a = (int)(msSinceUnixEpoch >> 16); // store the most significant 4-bytes of the timestamp
_b = (short)(msSinceUnixEpoch); // store the least significant 2-bytes of the timestamp

// Fill c, d, e, f, g, h, i, j, and k with random bits

_c = (short)(_c & ~0xF000) | 0x7000; // Set the version to 7
_d = (byte)(_d & ~0xC0) | 0x80; // Set the variant to 2

This ensures that the Unix timestamps are already effectively sorted. The exception is for the random data that exists for two UUIDs created in the same "Unix tick". The spec allows for these random bits to contain better structed data, however that is extended functionality and can be provided at a later point in time if/when there is enough ask for it. I'd guess that we'd likely do that via an additional overload that looks something like Guid NewGuidv7(DateTime timestamp, bool trackSubmillisecond, long counter)

Maybe add DateTimeOffset overload?

It can be discussed in API review, but isn't really necessary to get the core functionality supported. It is also likely a far stretch from what the typical use-case would be.

The actual implementation is likely going to be effectively: long msSinceUnixEpoch = (long)((timestamp.ToUniversalTime() - DateTime.UnixEpoch).TotalMilliseconds), which is about as cheap as you can get the actual computation work.

@Ilchert
Copy link

Ilchert commented Jun 19, 2024

Given a 48-bit Unix Timestamp of 0x010203040506, the Guid string should always appear as 01020304-0506-7???-8???-???????????? (where the 8??? is [8000, BFFF]).

Thanks for the explanation, you will perform some additional operation to make string, sorting and binary big-endian representation consistent.

var t1 = DateTimeOffset.FromUnixTimeMilliseconds(0x010203040506).DateTime; // 11/02/2005 20:02:37
var t2 = DateTimeOffset.FromUnixTimeMilliseconds(0x020203040505).DateTime; // 16/12/2039 15:56:25
var g1 = Create(t1);
var g2 = Create(t2);

Console.WriteLine(g1); // 01020304-0506-7000-8000-000000000000
Console.WriteLine(g2); // 02020304-0505-7000-8000-000000000000
Console.WriteLine(g1 < g2); // true

Console.WriteLine(Convert.ToHexString(g1.ToByteArray(true))); // 01020304_0506_7000_8000000000000000
Console.WriteLine(Convert.ToHexString(g2.ToByteArray(true))); // 02020304_0505_7000_8000000000000000

Console.ReadKey();

static Guid Create(DateTime dateTime)
{
    var dto = new DateTimeOffset(dateTime); // handle Local and Unspecified
    var msSinceUnixEpoch = (ulong)dto.ToUnixTimeMilliseconds();

    var a = (int)(msSinceUnixEpoch >> 16); // store the most significant 4-bytes of the timestamp
    var b = unchecked((short)(msSinceUnixEpoch)); // store the least significant 2-bytes of the timestamp

    var c = (short)0x7000; // Set the version to 7
    var d = (byte)0x80; // Set the variant to 2

    return new Guid(a, b, c, d, 0, 0, 0, 0, 0, 0, 0);
}

@tannergooding
Copy link
Member Author

you will perform some additional operation to make string, sorting and binary big-endian representation consistent.

There's not really anything "additional" to do here.

0x00, 0x00, 0x01, 0x23 (big endian) and 0x23, 0x01, 0x00, 0x00 (little endian) both represent the same value 0x123 which will always be less than 0x124 (which would be 0x00, 0x00, 0x01, 0x24 as big endian and 0x24, 0x01, 0x00, 0x00 as little endian)

Which is to say, the underlying storage format doesn't matter except for when it applies to serialization/deserialization. The actual value stored is what matters and is what is used in the context of doing operations such as ToString or CompareTo.

You can use any storage format you'd like, provided that the underlying operations understand how to interpret it as the actual value. Different platforms then use different formats typically based on what is the most efficient or convenient (hence why most CPUs natively use little-endian and why networking typically use big-endian). .NET happens to use a format for Guid that is compatible with the Microsoft _GUID structure and which historically worked well for COM and other scenarios but that doesn't make the actual value it represents any different.

@huoyaoyuan
Copy link
Member

How to handle Local and Unspecified date time conversion to Unix milliseconds?

That's largely an implementation detail and is not really relevant to the API proposal. DateTime tracks enough information to know what point of time it represents, the DateTimeOffset constructor knows how to get a correct offset for local and utc. It treats unspecified the same as local, by design.

DateTimeOffset is actually more "correct" about timestamp. It's tolerant from changes in local time zone, including DST changing. However it also has more overhead.

@vanbukin
Copy link
Contributor

vanbukin commented Jun 20, 2024

There is no such thing as Guidv7, v2, v8 or any other v-something. There's Uuid and there's Guid, which Microsoft developed. They're literally different structures. Uuid is 16 consecutive bytes. For Uuid, there's only one way to roundtrip from binary to string representation and back. Guid is a structure of the same length, but with a specific layout (int, short, short, byte, byte, byte..byte) and an API that "masks" its layout. Because of this, there are 2 ways to roundtrip from binary to string representation and back. We all know this perfectly well, this topic was previously discussed in #86084 and #86798.

Guid does not exist outside of technologies related to Microsoft in one way or another, or technologies that it has had a hand in, to some extent.
Uuid, on the other hand, is a universally accepted standard that has those very versions, variants, etc., which determine what exact values should be written at certain places within the Uuid.

Since there's no Uuid in BCL (and @tannergooding specifically insisted that no Uuid should be introduced, closing #86084), the existing ecosystem is forced to use Guid as a container for Uuid. The only safe way to do this is to rely solely on the string representation.

And now let's look at how this (doesn't) work in the real world.

RFC 9562, 6.13. DBMS and Database Considerations

For many applications, such as databases, storing UUIDs as text is unnecessarily verbose, requiring 288 bits to represent 128-bit UUID values. Thus, where feasible, UUIDs SHOULD be stored within database applications as the underlying 128-bit binary value.
For other systems, UUIDs MAY be stored in binary form or as text, as appropriate. The trade-offs to both approaches are as follows:

  • Storing in binary form requires less space and may result in faster data access.
  • Storing as text requires more space but may require less translation if the resulting text form is to be used after retrieval, which may make it simpler to implement.

Using a specialized data type provided by a specific RDBMS in conjunction with a particular way of generating Uuid can increase data access speed. This is precisely the reason for Uuidv7's existence.

Uuidv7 stores the number of milliseconds that have passed since the start of Unix time in the first 48 bits (unix_ts_ms), followed by 4 bits for the version (ver), 12 bits for the first random part (rand_a), 2 bits for the variant (var), and 62 bits for rand_b. It's important that in Section 6.2 (Method 3) the use of part of rand_a to store the time-based part is allowed in order to increase time precision (from millisecond to sub-millisecond), thereby bringing the time-based part up to 60 bits. This allows for the use of unix_ts_ms and rand_a to store the number of 100-nanosecond intervals that have passed since the start of the unix-epoch, specifically - Ticks (overflow will occur on June 18, 5623 at 9:21 UTC - this is far enough in the future to directly use ticks in the time-based part when generating Uuidv7).

An important point is that the time-based part is stored in big-endian. Since RDBMS typically indexes binary data from left to right, this method of generation ensures monotonically increasing values, just like an integer counter. This allows for maintaining low levels of index fragmentation, fast search, and constant insertion time.

And now, having finished with the introductory part, let's dive into the peculiarities of how popular RDBMS and their .NET drivers work with Uuid and Guid (which is used as a lousy transport for Uuid).

Welcome to hell.

Uuidv7

Let's write a simple function for generating Uuidv7, which will use the fields unix_ts_ms and rand_a to store the number of ticks since the start of the Unix epoch.

static string GenerateUuidV7()
{
    Span<byte> uuidv7 = stackalloc byte[16];
    ulong unixTimeTicks = (ulong)DateTimeOffset.UtcNow.Subtract(DateTimeOffset.UnixEpoch).Ticks;
    ulong unixTsMs = (unixTimeTicks & 0x0FFFFFFFFFFFF000) << 4;
    ulong unixTsMsVer = unixTsMs | 0b0111UL << 12;
    ulong randA = unixTimeTicks & 0x0000000000000FFF;
    // merge "unix_ts_ms", "ver" and "rand_a"
    ulong hi = unixTsMsVer | randA;
    BinaryPrimitives.WriteUInt64BigEndian(uuidv7, hi);
    // fill "rand_b" and "var"
    RandomNumberGenerator.Fill(uuidv7[8..]);
    // set "var"
    byte varOctet = uuidv7[8];
    varOctet = (byte)(varOctet & 0b00111111);
    varOctet = (byte)(varOctet | 0b10111111);
    uuidv7[8] = varOctet;
    return Convert.ToHexString(uuidv7);
}

The hexadecimal representation is used here as the base because for Guid, only the string representation is considered valid.
This string can be passed to the Guid constructor, and when calling ToString, we will get the same value.

PostgreSQL

God bless the developers of PostgreSQL and Npgsql.
This is the only database and driver where everything works without any problems.
We take the string representation of Uuidv7, pass it to the Guid constructor, write the Guid to the database (uuid column), and obtain the records in the same order in which they were written.
Without modifying the connection string or any strange behavior.
It just works.

MySQL

It can only use binary(16) or varbinary(16) because it does not have a dedicated data type for storing Uuid.

Okay, let's test how it works in practice.

docker run --name mysql -e MYSQL_ROOT_PASSWORD=root -p 3306:3306 --cpus=2 --memory=1G --rm -it mysql:latest

Somehow connect to the server and create a database and its schema there.

CREATE DATABASE `dotnet`;

And after selecting our newly created database:

CREATE TABLE `uuids`
(
    `uuid` BINARY(16) NOT NULL PRIMARY KEY,
    `order` BIGINT NOT NULL
);

Let's write a simple program that generates a Uuidv7, inserts its value along with an ordinal number (for validation of sorting).
Annnd... it doesn't work.
Because out of the box, you can't write a Guid to binary(16).

This happens because the MySQL driver has a connection string parameter with a default value of Char36.
To make everything work correctly, you need to add the connection string parameter GUID Format=Binary16.

If we set the values to TimeSwapBinary16 or LittleEndianBinary16, everything will also work (for a while).

However, if you execute

SELECT * FROM uuids ORDER BY uuid ASC;

You will notice that for TimeSwapBinary16 and LittleEndianBinary16 values, the order in the order column is NOT sequential! Due to this, the data will become fragmented, resulting in degraded insertion time as the number of records in the table increases.

I would like to remind you that we are passing absolutely correct Uuidv7 (according to the specification) values as parameters, using Guid as a container for the value.

Okay, let's generate Uuidv7 and insert them into the database, using various GUID Format values, and construct graphs for visualizing the process. After all, everyone loves graphs.

For this, I wrote a pair of small programs:

  • The first one generates Uuidv7 and inserts its value (using Guid) along with its sequential order number (the order in which it was generated) into the database.
  • The second one reads all entries from the database using the SQL query: SELECT uuid, order FROM uuids ORDER BY uuid ASC; then it calculates the distribution statistics of records between the actual order number row (in which order it was read) and the one recorded during generation, grouping the deviations in buckets of 100k records.

And here are the results by insertion time:

mysql-inserts-compare

And here is the deviation:

mysql-deviation-all

The denser the points are to the left part and the higher the values there, the more such values resemble a monotonically increasing sequence.

Notably, when using Binary(16), the maximum deviation of the order in which the record was read from the order number at recording is 1. And this is easy to explain.

I ran BenchmarkDotNet and saw the following picture.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.302
  [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-FTFTEX : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Server=True

| Method         | Mean     | Error    | StdDev   | Gen0   | Allocated |
|--------------- |---------:|---------:|---------:|-------:|----------:|
| GenerateUuidV7 | 63.96 ns | 0.654 ns | 0.612 ns | 0.0001 |      88 B |

This means that sometimes 2 Uuids are generated within a 100-nanosecond interval (one Tick), and the random part of the second one outpaces the first.

Thus, Uuidv7 + Guid as transport + GUID Format=Binary16 in the connection string parameter give us the expected behavior. Only in this combination do we have a truly monotonic increasing sequence, which behaves like an auto-incrementing integer counter.

However, such visualization does not allow for a clear assessment of the situation for LittleEndianBinary16 and TimeSwapBinary16. Let's remove Binary16 and look at the deviation graph again without it:

mysql-deviation-timeswap-and-little-endian

It is seen that when using LittleEndianBinary16, values still resemble sequential ones (this is also noticeable by the insertion time), but nevertheless, they have dispersion, and with a large amount of data, the degradation of the insertion performance will still be there.

In the case of TimeSwapBinary16, everything is as bad as possible.
Press F to pay respects to a table with such a primary key or unique index.

Microsoft SQL Server

The final boss of the next Doom will be the uniqueidentifier.
This is the quintessence of obscure technologies multiplied by outright poor engineering decisions.

Let's start by generating a Uuidv7 with a correct text representation in the form of Guid, insert them into the database, and measure the insertion time.

We get the following graph:

mssql-inserts-raw

We see a significant slowdown in insertion over time. We observed a similar pattern with MySQL.

And here is the deviation:

mssql-deviation-only-raw

Let's check the state of our index:

SELECT * FROM sys.dm_db_index_physical_stats(db_id('dotnet'), object_id('uuids'), NULL, NULL, NULL);

and we see that avg_fragmentation_in_percent = 99.25588285567774
Our table is as bad as it can possibly be.

From here we dive into obscure technologies.
This behavior occurs because the uniqueidentifier has its own sort order. But there is no documentation on this order. There is only one single description on the entire internet in an MSDN article from 2006, which has already been deleted.
Fortunately, we have the internet archive and we can see what was written there:

More technically, we look at bytes {10 to 15} first, then {8-9}, then {6-7}, then {4-5}, and lastly {0 to 3}.

But this is insufficient.

This is where poor engineering decisions come into play, in the form of Guid's internal layout.
This is because Microsoft.Data.SqlClient uses the ToByteArray() call on System.Guid.
The ToByteArray() call simply dumps the internal contents of Guid into a byte array.

Now we multiply obscure technologies by poor engineering decisions.
As a result, in order to write Uuidv7 to a table column of type uniqueidentifier and to get a monotonically increasing sequence in return, we need to make 2 permutations.

This is done as follows (the code is written as simply as possible so that anyone can understand what's going on):

string ReorderUuid(string uuid)
{
    var src = Convert.FromHexString(uuid);
    var dst = new byte[16];
    // reorder for SQL SERVER Sort order
    dst[0] = src[12];
    dst[1] = src[13];
    dst[2] = src[14];
    dst[3] = src[15];
    dst[4] = src[10];
    dst[5] = src[11];
    dst[6] = src[8];
    dst[7] = src[9];
    dst[8] = src[6];
    dst[9] = src[7];
    dst[10] = src[0];
    dst[11] = src[1];
    dst[12] = src[2];
    dst[13] = src[3];
    dst[14] = src[4];
    dst[15] = src[5];
    // reorder for guid internal layout
    var tmp0 = dst[0];
    var tmp1 = dst[1];
    var tmp2 = dst[2];
    var tmp3 = dst[3];
    dst[0] = tmp3;
    dst[1] = tmp2;
    dst[2] = tmp1;
    dst[3] = tmp0;
    var tmp4 = dst[4];
    var tmp5 = dst[5];
    dst[4] = tmp5;
    dst[5] = tmp4;
    var tmp6 = dst[6];
    var tmp7 = dst[7];
    dst[6] = tmp7;
    dst[7] = tmp6;
    return Convert.ToHexString(dst);
}

And if we construct a Guid in the following way new Guid(ReorderUuid(GenerateUuidV7())) and write it to the database by passing it as a parameter to the INSERT query, only in this case will we actually get a monotonically increasing sequence.

We get the following picture for insertion time:

mssql-inserts-compare

And for deviation:

mssql-deviation-compare

The maximum deviation is 1 (the reason is the same - generating 2 values in 1 Tick).
The avg_fragmentation_in_percent value is 0.6696610861576153 (compared to 99.25588285567774 for insertion without reorder).

BINARY(16)

Despite the specification explicitly requiring the use of a specialized data type for storing UUIDs in the database, not everyone in the real world follows these recommendations. Or simply, a project could have started when there were neither recommendations nor .NET (let alone Core, even Framework). And databases may simply not have had a specialized type for working with UUIDs.

This is where binary(16) comes into play. The catch is that neither Npgsql nor Microsoft.Data.SqlClient allow the use of Guid as a parameter for values of this type.

Explicit conversion to a byte array on the calling side is required, or the parameter value from uuid / uniqueidentifier should be converted to binary(16) on the database side in the SQL query itself.

In case of binary(16), all three described RDBMS use the same sorting order - the value is interpreted as big endian as a whole.
When it comes to a public API returning a Uuidv7 wrapped in Guid - to transform such a Guid into a byte array you will need to call only ToByteArray(bigEndian: true), and construct only through new Guid(bytes, bigEndian: true).

Only such combination of APIs will ensure a correct roundtrip of values and a monotonically increasing sequence of values at the database level.

Summary

In the case of a Guid as a container for Uuidv7 and using a specialized type on the database side:

  • MySql: it is mandatory to pass a specific value in the connection string (GUID Format=Binary16)
  • PostgreSQL: everything will work out of the box
  • MS SQL Server: manual rearrangement of content will be required to keep the database from breaking down

In the case of a Guid as a container for Uuidv7 and using binary(16) on the database side:

  • MySql: depending on the parameter value in the connection string, you are required to either pass a specific value in the connection string (GUID Format=Binary16), or convert Guid to byte array and back only using ToByteArray(bigEndian: true) and new Guid(bytes, bigEndian: true).
  • PostgreSQL and MS SQL Server: convert Guid to byte array and back only using ToByteArray(bigEndian: true) and new Guid(bytes, bigEndian: true).

Do we need this?

@tannergooding, given everything listed above, I have a question

Whose problems and how exactly will this API solve?

At the moment, it appears to be a feature only for PostgreSQL and MySql users (under certain conditions).
If we generate Guidv7 (Uuidv7) optimized for the use of uniqueidentifier in MS SQL Server, this will firstly violate the specification (because neither the string nor the binary representation will correspond to the uuidv7 described in the specification), and secondly, it will automatically lead to performance degradation in other databases.
Reordering will be necessary in any case.
The only question is - in which scenarios: when working with Microsoft SQL Server or when working with PostgreSQL / MySQL.

IMHO: it should not be added. It's a minefield. This will definitely do more harm than good.

@tannergooding
Copy link
Member Author

Whose problems and how exactly will this API solve?

The multitude of users, both internal and external, that have asked for such an API to exist and which are currently using 3rd party packages that are doing effectively what this API will be providing.

Since last year, #88290 was opened explicitly asking for this on Guid with users continuing to come back and ask for this to be supported. This got renewed interest last month due to the new version of RFC 9562 having been published/finalized. It also has explicitly highlighted, including in new comments from the community, the problems that a custom Uuid represents and how it hurts integration with the ecosystem.

There are a multitude of NuGet packages that provide this as well, most of which (particularly UUIDNext which has 482k downloads) are explicitly doing so over System.Guid: https://www.nuget.org/packages?q=uuidv7&includeComputedFrameworks=true&prerel=true&sortby=relevance


I am not interested in getting to another elongated discussion around the pros vs cons of using System.Guid to represent this information.

GUIDs are not a Microsoft specific concept, they are an alternative name for UUIDs and that is explicitly covered in the official RFC 9562 in the very first sentence

This specification defines UUIDs (Universally Unique IDentifiers) -- also known as GUIDs (Globally Unique IDentifiers) -- and a Uniform Resource Name namespace for UUIDs.

The RFC further discusses layout and how GUIDs underlying values are represented and that this is distinct from the concept of saving it to binary format. The basic quote is as below, but I have given consistent in depth analysis and additional citations as replies other threads where this has been asked:

Saving UUIDs to binary format is done by sequencing all fields in big-endian format. However, there is a known caveat that Microsoft's Component Object Model (COM) GUIDs leverage little-endian when saving GUIDs. The discussion of this (see [MS_COM_GUID]) is outside the scope of this specification.

System.Guid is not a binary format, it is a strong type that represents a UUID, otherwise known as a GUID. It has the capability to serialize to a binary format and for the developer to pick whether that binary format should be big endian or little endian as per the needs of the consuming context.

Citing worst case performance characteristics is likewise not the correct basis for deciding whether or not a feature is suitable. We do not provide worst case or naive implementations of core APIs. We provide optimized routines that efficiently handle the data and which try to do a lesser number of operations where possible. Serializing to a big endian binary format requires up to 3x "byte swap" operations, which can be emitted as part of the general storage code. If it were found to be a significant performance bottleneck in real world code, we could optimize this further to be a single instruction handling the byte swap for all 8 bytes that need it simultaneously.

@ImoutoChan
Copy link

There are a multitude of NuGet packages that provide this as well, most of which (particularly UUIDNext which has 482k downloads) are explicitly doing so over System.Guid: https://www.nuget.org/packages?q=uuidv7&includeComputedFrameworks=true&prerel=true&sortby=relevance

This is a good example because this library actually creates "GUID-like" UUIDv7 by accepting the database where it would be used:
Guid sequentialUuid = Uuid.NewDatabaseFriendly(Database.SQLite); // from readme

Why is that? It's because GUID fails to represent UUID in a single format, and each database needs its own representation of UUIDv7 within a GUID to function correctly.

@vanbukin
Copy link
Contributor

@tannergooding

GUIDs are not a Microsoft specific concept

This is not true. Guid is a structure that appeared for COM/OLE. In its original form, with its layout and all subsequent advantages and disadvantages, it was invented by Microsoft and is used in the Microsoft technology stack. Literally all other languages use Uuid, which from the perspective of the public API is either 16 bytes in big endian, or a string, the format of which is defined in the specification.

We provide optimized routines that efficiently handle the data and which try to do a lesser number of operations where possible.

I am 100% sure that the .NET runtime team can implement the fastest and most optimized Uuidv7 generation algorithm in the world. But I did not raise the issue of generation performance. I highlighted the topic of what happens to such a Uuid AFTER generation. When it is used as an identifier in a database that will live there for 50 years.

What happens next - when it starts living its own life?
That's the topic I was referring to. And I looked at this question through the prism of databases.

Uuidv7 is needed in order to NOT fragment indexes in databases. This is why it exists.

And from the perspective of PostgreSQL or MySQL (under certain conditions), a Uuidv7, packed into a Guid, will be written to the database in such a way that it won't cause index fragmentation.

In case of using Microsoft SQL Server and writing such a Uuidv7, packed into a Guid, into a column of type uniqueidentifier - index fragmentation will occur. For such a scenario, the Uuidv7 generation algorithm described in the RFC is not suitable (due to the specific sort order and the use by the SQL Server driver of an API that was originally intended for COM). A Uuidv7 with a changed byte order is required. Only in this case will the database get the benefits for which Uuidv7 was created. If the database does not receive them, then such a generation algorithm for Microsoft SQL Server makes no sense.

When changing the connection string in MySQL, data begins to be written to the database in a not optimal way. This leads to the loss of all benefits from such a generation algorithm. I would characterize the benefit of such an algorithm for MySQL as positive under a certain driver configuration.

For PostgreSQL, everything is fine.

So we have a situation, where the proposed API will generate Uuidv7, which when written to databases will have the following characteristics:

  • Microsoft SQL Server - no sense
  • PostgreSQL - makes sense
  • MySQL - makes sense under a certain configuration of the database driver.

As @ImoutoChan rightly noted, UUIDNext contains an API for generating Uuids, packed into Guids, that are specific to each database and its driver.

For the API to generate Uuidv7, packed into a Guid to make sense, it is necessary to specify for which database it is generated. But this is not something that can be added to the BCL.

So if the Microsoft platform can't make an algorithm that would work equally well for both the Microsoft-developed database and all other databases - then maybe it's not worth adding such a method at all and leave the implementations to the community?

It seems to have been doing pretty well all these years.

@tannergooding
Copy link
Member Author

A UUID of f81d4fae-7dec-11d0-a765-00a0c91e6bf6 is always exactly f81d4fae-7dec-11d0-a765-00a0c91e6bf6 regardless of whether it is stored in big-endian format, little-endian format, or some arbitrary other format with say even/odd bytes swapped.

This is no different than the value 2 is always the value 2, regardless of whether it is an 8-bit integer 0x02, a little-endian 16-bit integer 0x02, 0x00, a big-endian 16-bit integer 0x00, 0x02, a 32-bit integer, a 64-bit integer, a 3-bit integer, etc.

Binary serialization and deserialization is fully independent of the value represented at runtime in the named type. If you have a destination that requires the data to be stored in a particular binary format, you should explicitly use the APIs which ensure that the data is serialized as that format and ensure the inverse APIs are used when loading from that format.

The underlying storage format used by the type at runtime is fully independent of the value represented by the UUID. It is not safe to rely on and is not something that is observable to the user outside of unsafe code. The RFC itself clearly dictates that GUID is an alternative name for UUID and that the spec lists it as a ordered set of 16 octets in big-endian format for simplicity and that it is the expected format for the variant 2 UUIDs discussed by the spec when saving to a binary format. It explicitly calls out the fact that the standard Microsoft format used by COM defaults to the inverse behavior (a little-endian storage format) and that it is out of scope of the spec to describe how to handle that. .NET explicitly states that it can be handled using the dedicated APIs to specify you would like to load or store the data as big-endian.

At this point, you appear to be explicitly ignoring how the code actually works, the considerations that actually exist, and what the official UUID specification actually calls out. That is not productive, it does not assist anyone, and it is blatantly pushing the discussion in a direction that makes it appear as though this is not a viable solution when there is no actual difference in how System.Guid is used by any database system provided you are correctly saving and loading using the appropriate endianness convention as required by the underlying database system. In all the examples you've listed so far, this is a requirement to save and restore as big-endian format, so if you decide call NewGuidv7() and then do not appropriately call TryWriteBytes(destination, bigEndian: true) when storing and new Guid(source, bigEndian: true) when loading, it is a bug in your code and it is your responsibility to fix it. The same would be true if you were reading data from a PE Header file and did not appropriately use ReadInt32BigEndian when reading data from the linker header; or if you failed to do the same when processing or creating network packets, etc.

It is purely a consideration of serialization.

@vanbukin
Copy link
Contributor

Okay. Let's imagine a situation where such an API was added. Let's go through a User Story.

I, as an ordinary .NET developer, install the .NET 9 SDK, see that a new method NewGuidv7() has appeared in Guid.
I go to read about what a Guidv7 is, find the specification.
I start using it with Microsoft SQL Server.
I use it as a primary key or in columns with a unique index.
And I get avg_fragmentation_in_percent = 99.

And my friend, who uses this API for generating Guids and writing them to PostgreSQL, where values are stored in a column of the uuid type - everything is perfect.
No fragmentation, excellent insertion.

At the same time, we both use specialized data types (as required by the specification) and pass the Guid as a query parameter without any preliminary conversion.
Both of us have code like this:

await using var cmd = connection.CreateCommand();
cmd.CommandText = "INSERT INTO someTable (id, payload) VALUES (@id, @payload);";
cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());
cmd.Parameters.AddWithValue("payload", payload);
await cmd.ExecuteNonQueryAsync();

So it turns out that when I use Microsoft technology (.NET) with a recently added API in combination with a database driver developed by Microsoft and an RDBMS developed by Microsoft - I don't get the absence of fragmentation.

But when I use an OpenSource database (PostgreSQL) with an OpenSource driver (Npgsql) for this database, I do get it.

@tannergooding
Copy link
Member Author

tannergooding commented Jun 20, 2024

The problem would be caused by failure to properly serialize the data as big endian and therefore not storing the UUIDv7 that was generated, but rather a different GUID instead

It is not an issue with the Guid type nor an issue with how the underlying v7 UUID is generated or stored at runtime. It is solely an issue of the developer who with failed to serialize/deserialize the data in the format expected by PostgreSQL

@vanbukin
Copy link
Contributor

You seem like you're not reading what I'm writing.
In PostgreSQL everything is just fine.
Everything goes to hell in combination with Microsoft SQL Server.

@vanbukin
Copy link
Contributor

If I had a separate data type for Uuid, I could oblige the Microsoft SQL Server driver team to do automatic binary representation conversion at the driver level when writing and reading, so that it would be written into uniqueidentifier with the understanding that the value in Uuid is in big-endian, while the database sorting order is different. And implement the transformation like:

dst[0] = src[12]; 
dst[1] = src[13]; 
dst[2] = src[14]; 
dst[3] = src[15]; 
dst[4] = src[10]; 
dst[5] = src[11]; 
dst[6] = src[8]; 
dst[7] = src[9]; 
dst[8] = src[6]; 
dst[9] = src[7]; 
dst[10] = src[0]; 
dst[11] = src[1]; 
dst[12] = src[2]; 
dst[13] = src[3]; 
dst[14] = src[4]; 
dst[15] = src[5]; 

at the driver level.

Likewise, the MySQL and PostgreSQL driver teams could easily adapt such a type.
But instead, we're being suggested to introduce even more workarounds, like what happened with bigEndian: true.

@tannergooding
Copy link
Member Author

tannergooding commented Jun 20, 2024

The same statement holds true in the inverse, I had merely misunderstood which of the two had the problem.

It is fundamentally the fault of the developer for not serializing/deserializing in the format expected by the database. If the database expects little endian format, you must use TryWriteBytes(destination, isBigEndian: false). If the database expects big endian format, you must use TryWriteBytes(Destination, isBigEndian: true)

The actual underlying storage format used by the Guid struct doesn't matter. We could choose to change it to be ulong _lower; ulong _upper, we could choose to change it to be fixed byte _data[16], we could choose to change it to be UInt128 _value, etc. It is never safe to assume the underlying data structure and there is already no guarantee the raw bytes are in little-endian order, as the data will be stored (internally) in big-endian order on a big-endian machine (such as the IBM z9). The internal storage format is how it is today because that was the most convenient throughout the history of .NET.

@vanbukin
Copy link
Contributor

vanbukin commented Jun 20, 2024

The problem isn't about the binary representation.

It's about how the existing ecosystem of RDBMS drivers works with the Guid data type.
Because developers feed Guid into the driver. They do this either directly (through ADO) or indirectly (through Dapper, EF Core, or any other ORM, which in turn passes the value to ADO driver without changes).

This already exists and is already "somehow" working. And you can't change it without breaking a huge amount of code.

Due to how it works with Guid - different database drivers require different workarounds:

  • PostgreSQL does not need workarounds.
  • MySQL requires a certain parameter in the connection string.
  • Microsoft SQL Server, for uniqueidentifier, needs to generate a Guid and reshuffle its contents in such a way that it has the byte order in which the database itself sorts data internally.

Therefore, the presence of such an API might create a misconception in the minds of developers.
That it's a silver bullet that allows you to generate IDs on the client-side and write them to the database without fragmenting indices.
And this is indeed the case, but you can't just call the new API and feed the generated Guid into the driver.
You need to know which database you're working with and what workarounds are needed for its driver.

With the current proposed implementation, this will be a feature for PostgreSQL and MySQL (remember about the parameter). But those who use Microsoft SQL Server need to know that their database and its driver require mandatory reshuffling to get a "Uuidv7 that doesn't ruin indexes" (remember it needs 2 reshuffles, one to compensate for ordering at the database level, another to compensate for using the COM-intended Guid API at the database driver level. Yes, they can be collapsed into one, but anyway). Because the Guid that the new API will generate can't be provided to the driver in its unchanged form - there will be index fragmentation.

@vanbukin
Copy link
Contributor

@tannergooding

I expect synergy between Microsoft products. With the current state of affairs, there will be none.

If we go down the path of problem-solving, there are two ways. Either introduce a new data type or fix the Microsoft SQL Server driver. We will put aside the first option for reasons we both know. The second can be implemented in two ways - either through a breaking change in the driver, or a feature toggle. A breaking change is not an option, so let's consider the second one. It can be implemented in different ways - through environment variables or connection string parameters. And it seems that in this case, it becomes a problem of the database driver.

But if you add the proposed API BEFORE the Microsoft SQL Server driver has support for "alternative Guid handling mode", it turns out that you will roll out a feature that does not synergize with your own database.

As a developer, I expect that such a line of code:

cmd.Parameters.AddWithValue("id", Guid.NewGuidv7()); 

will generate identifiers for me that, when written to a database into a column of the uniqueidentifier type, will have an index that won't be fragmented (as stated in the specification). And it's the responsibility of the ADO driver to do something with the Guid to achieve such an effect.

Therefore, I suggest that you, as the author, create an issue in the Microsoft SQL Server driver repository, and discuss the possibility of adding support for "native Uuid" at the database driver level through some alternative mode toggle mechanism, or in some other way.

@tannergooding
Copy link
Member Author

But if you add the proposed API BEFORE

Developers already can create a valid v7 UUID using System.Guid in many different ways, including parsing a string, passing in the individual fields to the constructor, or by reading a byte sequence (big or little endian) that some other piece of code created.

This API proposal changes nothing with that regard, it simply gives developers an easier way to generate a v7 UUID that will serialize as expected if stored using bigEndian: true or if converting to a string using ToString.

Therefore, I suggest that you, as the author, create an issue in the Microsoft SQL Server driver repository, and discuss the possibility of adding support for "native Uuid" at the database driver level through some alternative mode toggle mechanism, or in some other way.

.NET is producing correct UUIDs that serialize as expected when using the relevant APIs such as ToString (in which case there are multiple ways to separate the bytes but they are always in big-endian format) or TryCopyTo (in which case you decide if they should be serialized in bigEndian or littleEndian format). It is up to downstream components to consume System.Guid correctly using these APIs, just as they would be required to consume int, long, or Int128 (all of which have the same general endianness considerations).

If there is a scenario that you believe is not covered by the downstream component, then you as the interested party should be the one to file the feature request and to correctly articulate the problem space you believe exists and to optionally provide input as to how you believe it should be resolved.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 21, 2024

@vanbukin Have you filed any issues for your complaints about the Microsoft.Data.SqlClient performance? I regularly contribute performance improvements but if no-one tells me about them how would I know what needs improving to help your codebase?

@vanbukin
Copy link
Contributor

vanbukin commented Jun 21, 2024

@Wraith2 There's no problem with the driver's performance itself. The driver simply takes Guid as input, does something with it, and forms bytes, which are sent to Microsoft SQL Server over the TDS protocol. The problem is that when a Guidv7 formed through the proposed API gets into a database in a column of the uniqueidentifier type - there will not be a monotonically increasing sequence. I conducted a small study, made measurements, plotted graphs. You can read more about it here.

To make Guid.NewGuidv7() produce a sequence that would be optimal for index building specifically when working with Microsoft SQL Server, it's necessary to reshuffle the bytes. And there are two reasons to do this.

  1. The order in which Microsoft SQL Server sorts uniqueidentifier (and consequently builds indexes, which is clearly seen by the dispersion of values, the fragmentation index indication, and insertion slowdown)
  2. The order in which bytes are located in Guid (because the driver uses ToByteArryay() and absolutely disregards bigEndian: true and something else there).
    If we do not compensate for this with reshuffling, then when writing Guidv7 to uniqueidentifier, we will get complete index fragmentation. And Uuidv7, described in the RFC, was basically created for indexing optimization.

The irony of the situation is that Microsoft develops both .NET itself, the proposed API, the driver, and the database.
However, a construct like

cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());

will write data, the index of which will be completely fragmented right from the start.

@vanbukin
Copy link
Contributor

Developers already can create a valid v7 UUID using System.Guid in many different ways, including parsing a string, passing in the individual fields to the constructor, or by reading a byte sequence (big or little endian) that some other piece of code created.

That's right. And the BCL can only provide one single way to do this, in order to meet the RFC. Yet, this option will not work with your own products right now. It's absurd.

This API proposal changes nothing with that regard, it simply gives developers an easier way to generate a v7 UUID that will serialize as expected if stored using bigEndian: true or if converting to a string using ToString.

The driver developed by Microsoft for its own database is not doing this right now. And it doesn't even have any options to change this behavior. You propose to create an API that won't work as expected. The very generation of Uuidv7, without regard to how it will be indexed by the database, is devoid of meaning. Because Uuidv7 was created to be optimized when inserted into the database. It doesn't matter who's to blame - the API of Guid, the driver developers, the creators of the TDS protocol, or the database developers. As a consumer, what matters to me is that it doesn't work as it should.

And this will only happen when I use the combination of Microsoft .NET together with Microsoft SQL Server. However, if I take PostgreSQL and its driver, I will have no problems.

So what am I paying for?

@vanbukin
Copy link
Contributor

cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());

This line of code will give me proper indexes in PostgreSQL.
With the correct setup through connection string parameters, it will provide me with normal indexes in MySQL.
In Microsoft SQL Server, it will NEVER give me normal indexes.

I have to write my own function to rearrange the internals of the Guid and call it before each insertion.

Yes, you will be following the RFC.

But there is zero synergy between your products.

@jodydonetti
Copy link
Contributor

jodydonetti commented Jul 13, 2024

Hi @tannergooding , thanks for sharing so much juicy info.

When all the details will be settled (are they already?) I think a blogpost on devblogs with a recap of all of this would be great!

@jscarle
Copy link

jscarle commented Jul 21, 2024

For something like v7 where some bits are not random but rather seeded based on some input state, the chance of collision differs. Assuming that you aren't changing the source timestamp provider, then there is a guarantee of no conflicts provided that you are calling CreateVersion7 no more than once per millisecond (at least for the next 10k years or so, but that's not something that really needs to be accounted for in practice). If you call it more frequently then there is 74-bits of randomly seeded data that tries to help ensure uniqueness, but it has the same fundamental consideration as v4 which is that there is a possibility (although an extremely minimal one) that you can end up with a conflict.

@tannergooding Really looking forward to using this in November! Thanks for your work on this.

I'm curious on what your thought are regarding Ulid, which implements the monotonic random counter as per the Ulid spec. The data structure between UUIDv7 and Ulid seems identical apart from a few bits reserved for the version.

I also wonder what your thoughts are when it comes to involving MAC addresses in the generation of these types of unique identifiers, like MassTransit does with their NewId.

@Timovzl
Copy link

Timovzl commented Jul 22, 2024

@jscarle Here are some thoughts on the use of MAC addresses.

A MAC address takes 48 bits. This eats off so much of the random portion that you lose the unpredictable property - IDs can become too guessible, especially once you have observed one single ID.

You'd also want to be very sure that a MAC address (or especially if you'd opt to use a partial MAC address) is unique, because as soon as you get a "collision" on those, you have two instances that continue to have a much increased chance of generating colliding IDs for their entire lifetime.

As with UUIDv4, the very large random portion of UUIDv7 is supposed to make the use of any "instance identifier" unnecessary.

@tannergooding
Copy link
Member Author

I also wonder what your thoughts are when it comes to involving MAC addresses in the generation of these types of unique identifiers, like MassTransit does with their NewId.

As per https://datatracker.ietf.org/doc/rfc9562/

Privacy and network security issues arise from using a Media
Access Control (MAC) address in the node field of UUIDv1.
Exposed MAC addresses can be used as an attack surface to locate
network interfaces and reveal various other information about
such machines (minimally, the manufacturer and, potentially,
other details). Additionally, with the advent of virtual
machines and containers, uniqueness of the MAC address is no
longer guaranteed.

I'm curious on what your thought are regarding Ulid, which implements the monotonic random counter as per the Ulid spec. The data structure between UUIDv7 and Ulid seems identical apart from a few bits reserved for the version.

At a glance the only difference between it and v7 is that the 6 bits otherwise required for the variant/version are no longer reserved and so there doesn't appear to be a significant benefit to it and using the more broadly standardized thing likely outweighs any minor benefit you might otherwise get from the 6 extra bits of randomness. -- UUIDv7 then of course also has the option to allow non-random data, instead of random data if the domain needs it which is not something ULID appears to allow.

@Timovzl
Copy link

Timovzl commented Jul 22, 2024

I was curious to know whether it was considered "generally safe" to use v7 IDs in such scenarios, and at what point (number of nodes) would one expect to see a non-trivial number of collisions.

@glen-84 I'll post some results at the bottom, but here is the approach to calculate this:

  • Only IDs with the same timestamp can collide, so everything that follows applies solely to reused timestamps.
  • The birthday paradox's approximation formula tells use that approximately ProbabilityOfCollision = (NoIdsGenerated)^2 / (2 * NoPossibleValues) = (NoIdsGenerated)^2 / (2 * 2^BitCount).
  • The expected number of repetitions of this experiment for which 1 collision occurs is 1 / ProbabilityOfCollision.
  • The expected number of IDs we have then generated per 1 collision is NoIdsGenerated * (1 / ProbabilityOfCollision).
  • For example, 1 million IDs generated on the same millisecond, repeatedly, with 74 bits of randomness leads to an expected 1 collision per 1000000 * (1/(1000000^2/(2*2^74)))) or roughly 38,000,000,000,000,000 (38 quadrillion) IDs generated.

Those are some fine numbers. There's still one issue. If we generate a handful of invoice lines for an invoice and insert them into a database, they'll have the same millisecond component followed by a random component, and end up getting shuffled! That's arguably astonishing. One way to solve it is to use smaller pseudorandom increments to the previous ID, for example by a 57-bit number.

The calculations get very interesting with this approach:

  • Since we still start each millisecond with a 74-bit random number, there is no difference to separate servers/instances each generating a single ID during a millisecond. The previous calculation applies. Let's call the scenario of each server generating one or very few IDs per millisecond Online Transaction Processing (OTP).
  • However, things change when each server generates a significant number of IDs per millisecond. Let's call this scenario batch processing.
  • Instead of a random value out of 2^74 possible values for each server, each server now "saturates" that range of values to a greater degree.
  • The calculation depends strongly on the degree of saturation, i.e. the number of IDs each server generates per millisecond.
  • The best case (with the fewest collisions) is 1 ID per server per millisecond. The worst case is where each sever fully saturates its 2^74 value range. We will calculate the worst case, bearing in mind that the reality is going to be far, far more optimistic.
  • A server starts with a random value of up to 2^74, which on average is 2^73. It then repeatedly adds a random value of up to 2^57, which on average is 2^56. This means that, on average, the range from 2^73 to 2^74 is divided into 2^73 / 2^56 = 131072 segments. Each segment has 2^56 possible values, out of which 1 is used on average.
  • In other words, under these worst-case, maximum-saturation conditions, for each ID, there are on average [number of simultaneously acting servers] IDs being generated in the same 2^56-sized segment that could collide.
  • We can apply the original calculation, except with 2^56 as the possible number of values and [number of simultaneously acting servers] as the number of IDs being generated.
  • Having 100K servers bulk generate IDs with maximum saturation on the same millisecond introduces a collision chance of 100000^2/(2*2^56) per 100K IDs generated.
  • We can expect one collision if we repeat that experiment 1 / (100000^2/(2*2^56)) times.
  • Generating 100K IDs per experiment, we expect to generate 100000 * (1 / (100000^2/(2*2^56))) or 1,400,000,000,000 (1.4 trillion) IDs with one single collision.
  • Generating the same number of IDs with lower saturation and a greater number of servers brings the calculation closer to the original, non-incrementing scenario and thus gives us a more optimistic result. (And spreading them out over multiple milliseconds yields a far more optimistic results.)
  • The reality is likely closer to the OTP scenario than to 100K IDs per server per millisecond (that's 100M per server per second). Our worst case is very, very pessimistic.

I suspect that both calculations offer some level of comfort. 😊

@gatapia
Copy link

gatapia commented Aug 15, 2024

Given that this is closed, this may not be the right place, but... Is it possible to parse the timestamp back from a ULID? in Guid v7? Will there be a Guid.ParseDateTime(Guid ulid) method? This will save adding a DateTime column in a database, etc, since that information is already included in the CreateV7(DateTime) ULID.

@rcollina
Copy link

rcollina commented Aug 15, 2024

Just to confirm, have @Timovzl’s considerations been taken into account?

I believe he mentioned some crucial V7 features necessary for actual distributed systems.

@osexpert
Copy link

osexpert commented Aug 15, 2024

Just to confirm, have @Timovzl’s considerations been taken into account?

I believe he mentioned some crucial V7 features necessary for actual distributed systems.

AFICK there is no such v7 features for distributed systems. For distributed systems, the best one (to prevent collisions) is version 4 (all random). v7 is about getting a better order/locality by replacing some of the random data with time and this will increase change of collisions in distributed systems, and there is nothing to do about that. Pick your poison:-)

The .net v7 implementation AFAIK uses a simple implementation where it take a v4 Guid, replace the 6 first bytes with the current unix epoc milliseconds and set version as 7. Plain and simple. It does not use rand_a to create a local sub sequence of time, it does not try to make it monotonic, it does not make sure that the newly generated v7 Guid is "greater" that the previously generated v7. But all these things are about improving local monotony. For distributed systems, its best with as much random data as possible (use rand_a for random data).

Long story short: the current .net implementation is the best for distributed systems. But it is the worst for local monotony. We can not have both and need to choose, and IMO .net made the right choice (all thou it may change in the future).

My 5cent.

@rcollina
Copy link

rcollina commented Aug 15, 2024

@osexpert

Just to confirm, have @Timovzl’s considerations been taken into account?
I believe he mentioned some crucial V7 features necessary for actual distributed systems.

AFICK there is no such v7 features for distributed systems. For distributed systems, the best one (to prevent collisions) is version 4 (all random). v7 is about getting a better order/locality by replacing some of the random data with time and this will increase change of collisions in distributed systems, and there is nothing to do about that. Pick your poison:-)

That was bad wording on my part - I stand corrected. I'm still curious if @Timovzl's feedback was considered all the same (reference: #103658 (comment)).

it does not try to make it monotonic, it does not make sure that the newly generated v7 Guid is "greater" that the previously generated v7. But all these things are about improving local monotony.

That's is why I'm not sure about the use of this new implementation in an actual system (let's leave out the distributed part).

@Timovzl says:

There's still one issue. If we generate a handful of invoice lines for an invoice and insert them into a database, they'll have the same millisecond component followed by a random component, and end up getting shuffled! That's arguably astonishing. One way to solve it is to use smaller pseudorandom increments to the previous ID, for example by a 57-bit number.

As it stands, unless the above scenario is considered, I for one would have no use for this new implementation - at least, none come to mind right now but I'm open to suggestions.

@osexpert
Copy link

osexpert commented Aug 15, 2024

@Timovzl says:

There's still one issue. If we generate a handful of invoice lines for an invoice and insert them into a database, they'll have the same millisecond component followed by a random component, and end up getting shuffled! That's arguably astonishing. One way to solve it is to use smaller pseudorandom increments to the previous ID, for example by a 57-bit number.

As it stands, unless the above scenario is considered, I for one would have no use for this new implementation - at least, none come to mind right now but I'm open to suggestions.

Monotony can be solved locally in the node, but then it would need to be implemented by the os/kernel itself, like Guid v1/v6 generators typically are. The .net implementation could attempt to make it monotonic, thou it would only be monotonic locally in that application/process. And even if monotonic in node, distributed monotony can not be guaranteed.

For batch generation, worst case order of .net v7 Guid's is the same as for v4 Guid's. But for all other uses, locally and distributed, .net v7 Guid's will be ordered as they were generated (good for locality/database indexes).

Maybe we can suggest a new method, eg. IEnumerable<Guid> CreateVersion7Sequence() that could help in the batch processing use case.

@matteocontrini
Copy link

matteocontrini commented Aug 15, 2024

it does not try to make it monotonic, it does not make sure that the newly generated v7 Guid is "greater" that the previously generated v7. But all these things are about improving local monotony.

That's is why I'm not sure about the use of this new implementation in an actual system (let's leave out the distributed part).

Could you clarify why you're saying that UUIDv7 can't be used if it's not monotonic? As far as I know it's by design that UUIDv7 is k-sorted and not monotonic, and that seems to be enough to work well with B-tree primary key indexes in databases, which is the primary use case.

In distributed systems where collisions are a concern you'll likely want to use Snowflake or variants of Snowflake, which are also usually k-sorted and not monotonic.

@rcollina
Copy link

@matteocontrini

Could you clarify why you're saying that UUIDv7 can't be used if it's not monotonic?

@Timovzl wrote:

There's still one issue. If we generate a handful of invoice lines for an invoice and insert them into a database, they'll have the same millisecond component followed by a random component, and end up getting shuffled! That's arguably astonishing. One way to solve it is to use smaller pseudorandom increments to the previous ID, for example by a 57-bit number.

I'd also wager to say that any reliable Transactional Outbox implementation, which is my use case, won't be able to use this v7 GUID, since keeping the insertion order of messages in any batch is paramount. Batches can be reordered, that wouldn't be a problem.

As far as I know it's by design that UUIDv7 is k-sorted and not monotonic

I'm going to have to use something else then :) Perhaps what @Timovzl implemented or https://github.com/Cysharp/Ulid are better suited for my needs.

@osexpert

Maybe we can suggest a new method, eg. IEnumerable CreateVersion7Sequence() that could help in the batch processing use case.

That could be useful.

@tannergooding
Copy link
Member Author

Maybe we can suggest a new method, eg. IEnumerable CreateVersion7Sequence() that could help in the batch processing use case.

As per the original proposal, there is room to expose additional Guid CreateVersion7(...) overloads that allow controlling the additional optional metadata, such as including submillisecond timestamps and/or providing the counter value:

    // v7
    //     48-bit timestamp millisecond ticks since Unix Epoch
    //     12-bits of random data -or- submillisecond timestamp (M3)
    //         M3 - Scale remaining precision to fit into the 12-bits at even intervals
    //     62-bits of random data -or- carefully seeded counter (M1 or M2)
    //         M1 - Dedicated counter, simply increment per UUID created within a given tick stamp
    //         M2 - Monotonic random, seed random then increment by random amount within a given tick stamp

Please feel free to open an API proposal covering these, I imagine the full overload might look something like (possibly with better parameter names or different parameter order):

public static Guid CreateVersion7(DateTime timestamp, ulong counter, bool submillisecondResolution);

@osexpert
Copy link

Maybe we can suggest a new method, eg. IEnumerable CreateVersion7Sequence() that could help in the batch processing use case.

As per the original proposal, there is room to expose additional Guid CreateVersion7(...) overloads that allow controlling the additional optional metadata, such as including submillisecond timestamps and/or providing the counter value:

    // v7
    //     48-bit timestamp millisecond ticks since Unix Epoch
    //     12-bits of random data -or- submillisecond timestamp (M3)
    //         M3 - Scale remaining precision to fit into the 12-bits at even intervals
    //     62-bits of random data -or- carefully seeded counter (M1 or M2)
    //         M1 - Dedicated counter, simply increment per UUID created within a given tick stamp
    //         M2 - Monotonic random, seed random then increment by random amount within a given tick stamp

Please feel free to open an API proposal covering these, I imagine the full overload might look something like (possibly with better parameter names or different parameter order):

public static Guid CreateVersion7(DateTime timestamp, ulong counter, bool submillisecondResolution);

Personally I am not missing such overloads (to specify the counter etc.) but maybe others do.

But back to my suggestion :-)
A way to get a sequence of monotonic v7 Guid's could be helpful and if such an api existed, it would be even less use for an api to specify counter etc. manually, because such api would IMO mainly be useful for code that want to make a monotonic sequence? So if .net provided this api directly, it would be best.

I would suggest something like (slightly adjusted from before):

// Create a monotonic sequence of Guid's
public static IEnumerable<Guid> CreateVersion7Sequence(TimeProvider timeProvider);

Maybe a better name could be CreateVersion7MonotonicSequence, a bit long thou.

The implementer (.net) can then choose how to make it monotonic: to increase the timestamp after rand_a rollover, to stall until the next millisecond when rand_a rollover, generate new random v7 Guid's until it randomly becomes greater than the previous etc.

Personally, I like this suggestion best (from the RFC):
"Implementations MAY use the following logic to ensure UUIDs featuring embedded counters are monotonic in nature:
Compare the current timestamp against the previously stored timestamp.
If the current timestamp is equal to the previous timestamp, increment the counter according to the desired method.If the current timestamp is greater than the previous timestamp, re-initialize the desired counter method to the new timestamp and generate new random bytes (if the bytes were frozen or being used as the seed for a monotonic counter)."
So, use rand_a as 12bit counter, initially seeded from Guid.NewGuid(), incrementing it +1 until > 4095, then increase timestamp_ms +1. Reseed 12bit counter (from Guid.NewGuid()), rinse and repeat.

I have made a prof of concept implementation for this api, in case of interest:-)
Like this comment if you want me to make an api proposal for this. If at least 10 like, I can think about it:-)

@Timovzl
Copy link

Timovzl commented Aug 19, 2024

It is my experience that without the following two properties, the usefulness and pit of success of a UUIDv7 implementation is severely limited:

  1. Intuitive in-process monotonicity, i.e. without complexity on the caller's end.
  2. Retention of the unpredictable property even when providing point 1.

As per the original proposal, there is room to expose additional Guid CreateVersion7(...) overloads that allow controlling the additional optional metadata, such as including submillisecond timestamps and/or providing the counter value

Sadly, providing a counter value is not a pit of success when it comes to point 1, and it also sacrifices point 2.

Maybe we can suggest a new method, eg. IEnumerable<Guid> CreateVersion7Sequence() that could help in the batch processing use case.

Requiring the user to do special things for batches is a huge sacrifice, and also not a pit of success. For example, consider the very valid use case of a DDD entity that self-validates in the constructor and is to be complete and reliable after construction. It makes sense for such an entity to populate its (readonly) ID property with a UUIDv7 during construction. If such entities are constructed in batch, the intuition is for them to get stored in creation order. It is far from ideal to have to (A) remember to do something special to get the intuitive result and (B) pollute the constructor with technical details to achieve it.

I have spent a lot of time on this topic, and my conclusions (as implemented) are:

  • Between distributed systems, there are always race conditions, during which there is no monotonicity.
  • Within each individual system, strict monotonicity can and should be automatically provided.
  • We can best achieve strict monotonicity without losing unpredictability by using large, pseudorandom increments on "reused" milliseconds. Think on the order of 57-bit increments for UUIDv7.
  • To uphold monotonicity, we should guard against backward clock adjustments. For example, we can guard against up to 1 second's worth of adjustment by permitting a sleep of up to 1 second. For example, a backward NTP adjustment of 2 ms would have us sleep for 2 ms if an ID was generated immediately after.
  • We can avoid sleeping by allowing the timestamp to be off by up to 1 second. For example, we can default to a timestamp 1 second in the past. This gives us room to creep forward by up to 1 second without ever touching any timestamps that are truly in the future. Whenever the clock is adjusted backwards, instead of sleeping, we encroach on this buffer space. As time overtakes us again, the buffer space regenerates. (The same technique can also be employed to allow greater "burst" generation before exhausting a timestamp, although at these high bit counts the generation rate is hardly a constraint.)

@rcollina
Copy link

It is my experience that without the following two properties, the usefulness and pit of success of a UUIDv7 implementation is severely limited

Wonderfully articulated, @Timovzl. I appreciate your input and apologize for tagging you multiple times, but I felt it was crucial to urgently reconsider your points.

Additionally, your analysis on collision probabilities in a distributed system was quite convincing. The calculations you shared in another post were impressive. Is it just me, or is there a possibility to enjoy the best of both worlds here? 😃

@tannergooding
Copy link
Member Author

A balance exists to exposing APIs and that includes covering the relevant set that allow users to roll their own appropriate implementation without adding every potential overload or consideration. Developers are responsible for writing code to handle and use a type in a manner that works best for their application. It is therefore not the responsibility of the BCL to expose "everything", but rather enough that allows developers to achieve success for common needs by providing foundational building blocks.

We also have the consideration of dependencies, and Guid doesn't exist at a layer where it can or should handle things like system clock adjustments. Not only would this add significant complexity to ensure it is valid, but it wouldn't satisfy many developers who have the want or need to use their own time provider. It doesn't compose or allow building something that works for each unique scenario. -- Even in the short discussion so far, there's been multiple suggestions on what is the "correct" way and while there have been common themes, it's clear that there isn't one right answer.

The right foundational building block is therefore something like:

public static Guid CreateVersion7(DateTimeOffset timestamp, ulong counter, bool submillisecondResolution);

This API "works" because it simply gives access to the 3 "keys" that represent the state of the UUIDv7, inline with the RFC allowances:

    // v7
    //     48-bit timestamp millisecond ticks since Unix Epoch
    //     12-bits of random data -or- submillisecond timestamp (M3)
    //         M3 - Scale remaining precision to fit into the 12-bits at even intervals
    //     62-bits of random data -or- carefully seeded counter (M1 or M2)
    //         M1 - Dedicated counter, simply increment per UUID created within a given tick stamp
    //         M2 - Monotonic random, seed random then increment by random amount within a given tick stamp

That is, you get:

  • the required 48-bit timestamp passed in via the DateTime timestamp
  • a bool submillisecondResolution that allows you to opt-in to M3
    • where true means you get 12-bits additional for timestamp, allowing submillisecond accuracy
    • where false means the data is seeded as random
  • a ulong counter that allows you to opt-in to M1 or M2
    • more accurately this gives precise control over the 62-bit field, allowing you to decide if it should be random or a counter that increments at your own rate (and this can therefore start at any value and increment by any amount, as appropriate for your domain/scenario)

Such an API therefore serves as a building block and helps trivialize building your own "create many GUIDs" API. It keeps all the control in the hands of the developer while removing some of the worst complexity in the typical case. What remains in complexity is then domain or scenario specific requirements, such as synchronizing and providing the base timestamp and ensuring that the counter changes are within the domain requirements (whether that be random, monotonic, unique, etc)

If the BCL were to provide some "create many GUIDs" API, it would end up being a fairly naive implementation that looked something like (haven't validated this entirely for correctness, mostly meant to be a rough example):

ulong previousUnixTs  = _unixTs;
ulong unixTs = GetUnixTs(DateTimeOffset.UtcNow, _submillisecondResolution);

ulong counter = _counter;

if (unixTs == previousUnixTs)
{
    counter += _increment;

    if (counter > 0x3FFFFFFF_FFFFFFFF)
    {
        counter -= 0x3FFFFFFF_FFFFFFFF;
    }
}
else
{
    counter = (ulong)Random.Shared.NextInt64(0x3FFFFFFF_FFFFFFFF);
    _increment = (ulong)Random.Shared.NextInt64(0x3FFFFFFF_FFFFFFFF);
}

Unsafe.SkipInit(out Guid result);

Unsafe.AsRef(in result._a) = (int)(unixTs >> 28);
Unsafe.AsRef(in result._b) = (short)(unixTs >> 12);
Unsafe.AsRef(in result._c) = (short)(unixTs & 0xFFF) | Version7Value;
Unsafe.As<byte, ulong>(ref Unsafe.AsRef(in result._d)) = counter | (Variant10xxValue << 56);

_current = result;
_counter = counter;
_unixTs = unixTs;

@jscarle
Copy link

jscarle commented Aug 21, 2024

public static Guid CreateVersion7(DateTimeOffset timestamp, ulong counter, bool submillisecondResolution);

I really like that overload, as it would allow anyone to toll their own implementation while remaining a Guid.

@Timovzl
Copy link

Timovzl commented Aug 21, 2024

@tannergooding Would it be possible to grant a bit more control over all the 74 random bytes, rather than just the last 62 of them?

For example, when doing 57-bit random increments (for which the collision resistance is arguably decent), if we have only 62 bits of space, that leaves only 62-57 = 5 bits to flow into. That would limit the generation rate to an insufficient 2^5 = 32 IDs per millisecond.

Is there a strong reason for the first 12 bits of randomness to be treated differently from the last 62? I'm guessing the reasons are ease of formulation (especially with regards to each of M1, M2, and M3), parameter simplicity (ulong), and ease of implementation (especially with regards to the variant indicator).

@Timovzl
Copy link

Timovzl commented Aug 21, 2024

To answer my own question from the RFC's perspective (section 5.7):

Alternatively, implementations MAY fill the 74 bits, jointly, with a combination of the following subfields, in this order from the most significant bits to the least, to guarantee additional monotonicity within a millisecond: [...]

@tannergooding
Copy link
Member Author

tannergooding commented Aug 21, 2024

If you really want full control, then there is already the new Guid(ReadOnlySpan<byte>) and new Guid(int, short, short, byte, byte, byte, byte, byte, byte, byte, byte) overloads.

The primary reason to differentiate the two from the perspective of the public constructor is that the RFC has a specific definition of those bits. They are either random or the represent an extension of the timestamp.

The RFC itself calls out an appropriate way to use the 62 random bits in a way that guarantees uniqueness while still keeping it random/unpredictable. Namely per timestamp (whether that is 48 or 60-bit timestamp) you generate a 62-bit random seed that serves as the basis. Then you pick a singular 62-bit random increment that is used for all UUIDs generated within that timestamp. When the timestamp changes, you then change the random seed that serves as the basis and the random increment used for that timestamp window.

Such a setup ensures it is sufficiently random (62-bits) to avoid predictability while also ensuring it is unique (2^62 unique values can be generated in that timestamp window) for a non-zero increment.

Of course, in practice you need much fewer than 62-bits because the 48-bit timestamp gives millisecond accuracy and the 60-bit timestamp gives you around 245 nanosecond accuracy. Even with the fastest computers today, you really only need up to 1_000_000 unique entries per millisecond to guarantee uniqueness while still generating 1 unique value per nanosecond (which is impractical/impossible just because you'll be saturated by memory bandwidth and other factors far earlier). So in practice, 20-bits per timestamp window is enough to guarantee something "unique" while still giving you 42-bits of truly random data and if done properly, those 20-bits are still effectively unpredictable to an external viewer, even though they've been guaranteed to be unique.

@osexpert
Copy link

I would (also) suggest a more generic overload, like:
public static Guid CreateVersion7(DateTimeOffset timestamp, ushort rand_a, ulong rand_b);
I do not completely understand this overload:
public static Guid CreateVersion7(DateTimeOffset timestamp, ulong counter, bool submillisecondResolution);
Is counter rand_b in this case? Does it make sense to combine submillisecondResolution = false with counter? Would this mean that rand_a would use random data and rand_b use counter? Because this would not create monotonic Guid's (rand_a's randomness would control the order, not the counter)

@tannergooding
Copy link
Member Author

Is counter rand_b in this case? Does it make sense to combine submillisecondResolution = false with counter? Would this mean that rand_a would use random data and rand_b use counter? Because this would not create monotonic Guid's (rand_a's randomness would control the order, not the counter)

It really depends on the scenario and what considerations are important. counter is not strictly replacing rand_b, but rather replacing the left-most bits following the timestamp to ensure monotonicity. -- A lot of these considerations are covered under 6.2. Monotonicity and Counters and other sections of the RFC.

A consumer of this method would simply need to ensure that counter contains the relevant ordered count and whether that needs to be a seeded random or not. It would then be placed into the appropriate bits.

@iSazonov
Copy link
Contributor

@tannergooding You have shared a lot of very useful information. For better discoverability it would be great to have it in the remarks section of this API and perhaps a separate blog post with the details. Many thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.