Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unity's spinlocks are bullying the Garbage Collector, and Refunding is not helping on the situation. #21

Closed
Lisias opened this issue Jun 23, 2021 · 12 comments
Assignees
Labels
Not My Fault Not my Fault! I'm innocent! :) task Not a problem, but something I must do and should not forget about
Milestone

Comments

@Lisias
Copy link
Contributor

Lisias commented Jun 23, 2021

Launch this craft with Recall installed.

See hell breaking loose over the poor Garbage Collector's head.

bully-scholarship-edition-thumb04

(There's also a memory leak happening on 3rd parties, completely unrelated to Recall)

@Lisias Lisias changed the title Possible Memory Leak involving KSP-Recall somehow. Refunding is bullying the Garbage Collector. A lot! Jun 23, 2021
@Lisias Lisias self-assigned this Jun 23, 2021
@Lisias Lisias added the bug Something isn't working label Jun 23, 2021
@Lisias
Copy link
Contributor Author

Lisias commented Jun 23, 2021

Process Sample (MacOs) if the 261 in progress:

Sample of KSP.txt

Note a lot of threads at 100% CPU on semaphore_wait_trap (they did spinlocks on a bytecode virtual machine? really???), and the GC thread being chewed also at 100% on GC_header_cache_miss and GC_generic_malloc.

@Lisias Lisias added task Not a problem, but something I must do and should not forget about and removed bug Something isn't working labels Jun 23, 2021
@Lisias
Copy link
Contributor Author

Lisias commented Jun 23, 2021

This issue is not a bug on Refunding. The inherent problem is a terrible decision made by Unity on using spinlocks instead of mutexes on critical sections of code related to memory.

This also explains a lot of different problems KSP is suffering from a long time.

See these posts on forum:

@Lisias Lisias added this to the 0.2.0.1 milestone Jun 23, 2021
@Lisias Lisias changed the title Refunding is bullying the Garbage Collector. A lot! Unity's spinlocks are bullying the Garbage Collector, and Refunding is not helping on the situation. Jun 23, 2021
@Lisias
Copy link
Contributor Author

Lisias commented Jun 23, 2021

Renaming the issue to properly address the situation.

KSP-Recall's Refunding is, indeed, abusing the GC a bit - but this is not what's causing the process to a deadlock. At worst, I would be causing some stuttering while launching the craft.

But since fixing Unity is beyound the scope of KSP-Recall, I will have to tackle down this somehow on Refunding, perhaps hindering a bit Refunding itself (it will depend if I manage to keep the critical code inside the FixedUpdate, or I will have to move to the Update and risk being not called).

@mgalyean
Copy link

Is the spinlock at the mono level and perhaps tunable via mono environment variables? Ideally there would be a way to tell it to use mutexes instead of spinlocks for the GC at least, but this is very unlikely to be doable. There are a lot of GC env var options but I don't recall seeing anything specific to how the locking works skimming the vars

@Lisias
Copy link
Contributor Author

Lisias commented Jun 23, 2021

Is the spinlock at the mono level and perhaps tunable via mono environment variables? Ideally there would be a way to tell it to use mutexes instead of spinlocks for the GC at least, but this is very unlikely to be doable. There are a lot of GC env var options but I don't recall seeing anything specific to how the locking works skimming the vars

Nope, they are on the C code:

    1279 Thread_2729384: Finalizer
    + 1279 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff73d0040d]
    +   1279 _pthread_start  (in libsystem_pthread.dylib) + 66  [0x7fff73d04249]
    +     1279 _pthread_body  (in libsystem_pthread.dylib) + 126  [0x7fff73d012eb]
    +       1279 GC_start_routine  (in libmonobdwgc-2.0.dylib) + 28  [0x115be9987]
    +         1279 GC_inner_start_routine  (in libmonobdwgc-2.0.dylib) + 90  [0x115be99f2]
    +           1279 start_wrapper  (in libmonobdwgc-2.0.dylib) + 652  [0x115b75b1b]
    +             1279 finalizer_thread  (in libmonobdwgc-2.0.dylib) + 671  [0x115baee76]
    +               1279 semaphore_wait_trap  (in libsystem_kernel.dylib) + 10  [0x7fff73c42256]`

https://github.com/Unity-Technologies/bdwgc/tree/6e8db07c8d2e45fe60016a95c343912e038e5e9d

But even if we had a way to limit this using some configuration option, we will still have to co-exist with the rest of the Operating System.

The only (and I really mean only) place where spinlocks are safe to use is on embedded systems where you have FULL control of every piece of the hardware you are using. And so you can define how much threads you have on the whole system in advance.

@mgalyean
Copy link

So the only option would be a purpose made c(++?) library on an OS specific basis with a spinlock API that actually did mutexes under the hood...and the KSP env would have to be setup so mono could only see that flavor of the library. And if the library is compiled into the runtime then there is nothing that can be done without binary editing/redirecting in the runtime.

@Lisias
Copy link
Contributor Author

Lisias commented Jun 23, 2021

So the only option would be a purpose made c(++?) library on an OS specific basis with a spinlock API that actually did mutexes under the hood...and the KSP env would have to be setup so mono could only see that flavor of the library. And if the library is compiled into the runtime then there is nothing that can be done without binary editing/redirecting in the runtime.

One size does not fits all.

You need a specialised library for every OS, exploring its strongness and avoiding its weakness.

Lisias added a commit that referenced this issue Jun 27, 2021
…bage collecting (due the destroyance of the unneeded resources).

Works around #21 .
@Lisias
Copy link
Contributor Author

Lisias commented Jun 27, 2021

I reworked the code to prevent removing a Refunding resource once it's created. It solved the deadlocks, the process get only 13 to 15% of CPU load (i7, cores, 8 threads -- max cpu load is 800%).

However, the memory continued to go through the roof. With that 1.000 parts craft on the OP, my machine just can't launch it - the process get to 32GB of memory allocated, with 30GB of compressed memory and about 2GB of "Real Memory" allocated.

Technically, this solved the issue : by avoiding destroying the Resource, I prevented the need of GC and, so, the spinlocks didn't fired. But he memory continued to blow up the roof. :(

Code on commit 9b74455

@Lisias
Copy link
Contributor Author

Lisias commented Jun 27, 2021

The changes on commit 9b74455 does what follows:

  1. Create the resource on demand.
  2. this.part.Resources.Add it
  3. Eventually this.part.Resource.Remove when appropriated, to prevent the fake resource to screw up parts that some add'ons (or even KSP) don't expect any resource. Stackable parts also gets the fake resource removed nevertheless it's needed or not, as I didn't managed to support them properly.
  4. The resource is this.part.Resources.Add back when there is a expectation that the craft could be recovered.

The fake resource is not destroyed anymore - it is saved on a private variable and constantly reused.

So, Refunding is not responsible for the memory leak - it's only the trigger. The memory leak is happening on the this.part.Resources.Add call!!!

IT'S A KSP BUG. Yeah, another one.

This is royally screwing up the hack - I just can't shove the fake resource on every part via Module Manager, as this would screw up Add'Ons that relies on some parts having NO resources. Stackable parts are just one of the problems I need to prevent.

But if I add them at runtime, lots and lots of memory are allocated for no apparent reason - 93% of the memory ends up compressed, only 7% of the process memory is really being used on a given moment. This strongly hints me on a pretty nasty memory leak on the Modules.Add code.

@Lisias
Copy link
Contributor Author

Lisias commented Jun 27, 2021

Flagging this as "Not my fault".

Refunding was not being the most efficient as possible but the code was working fine (and given my time constraints at that time, I'm not ashamed of it).

Refunding is now being the most efficient possible (besides wasting some memory as the fake resource is only destroyed when the part is destroyed now), what's kinda "fixes" the problem described on this issue.

However, the memory leak persists. It will be tackled down (or not....) on issue #23.

@Lisias
Copy link
Contributor Author

Lisias commented Jun 27, 2021

Ping @mgalyean .

@Lisias
Copy link
Contributor Author

Lisias commented Jun 27, 2021

Closing this as the code was successful on prevent triggering the spinlock madness. (the madness is still there to be triggered by another innocent code, but there's so much I can do about).

@Lisias Lisias closed this as completed Jun 27, 2021
Lisias added a commit that referenced this issue Jun 27, 2021
…bage collecting (due the destroyance of the unneeded resources).

Works around #21 .
Lisias added a commit that referenced this issue Jun 27, 2021
…bage collecting (due the destroyance of the unneeded resources).

Works around #21 .
Lisias added a commit that referenced this issue Jun 27, 2021
…bage collecting (due the destroyance of the unneeded resources).

Works around #21 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Not My Fault Not my Fault! I'm innocent! :) task Not a problem, but something I must do and should not forget about
Projects
None yet
Development

No branches or pull requests

2 participants