Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cardano-testnet] Fix flaky test workspace cleanup and node port allocation #5875

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

carbolymer
Copy link
Contributor

@carbolymer carbolymer commented Jun 5, 2024

Description

This PR fixes testnet startup nondeterministic issues:

  1. workspace cleanup for successful tests: sometimes the directory could not get removed. Fixed by waiting and retrying.
  2. on MacOS, when starting node with a random port number we were receiving "port in use" error, due to holding to the port by darwin kernel, resulting in port being stuck in TIME_WAIT state. Fixed by waiting and retrying.
  3. stemming from nr 2: the operating system was also holding the lock to the stderr log file, resulting in node start retry failure. Fixed by additional manual closing of the file handle and using a different name of the log file.

Checklist

  • Commit sequence broadly makes sense and commits have useful messages
  • New tests are added if needed and existing tests are updated. These may include:
    • golden tests
    • property tests
    • roundtrip tests
    • integration tests
      See Runnings tests for more details
  • Any changes are noted in the CHANGELOG.md for affected package
  • The version bounds in .cabal files are updated
  • CI passes. See note on CI. The following CI checks are required:
    • Code is linted with hlint. See .github/workflows/check-hlint.yml to get the hlint version
    • Code is formatted with stylish-haskell. See .github/workflows/stylish-haskell.yml to get the stylish-haskell version
    • Code builds on Linux, MacOS and Windows for ghc-8.10.7 and ghc-9.2.7
  • Self-reviewed the diff

Note on CI

If your PR is from a fork, the necessary CI jobs won't trigger automatically for security reasons.
You will need to get someone with write privileges. Please contact IOG node developers to do this
for you.

@carbolymer carbolymer requested a review from a team as a code owner June 5, 2024 22:20
@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch from 0cb4c03 to a6a8f55 Compare June 6, 2024 09:58
@carbolymer
Copy link
Contributor Author

carbolymer commented Jun 6, 2024

Uhm, seems fix does not work. Still failing with:

               ┏━━ test/cardano-testnet-test/Cardano/Testnet/Test/SanityCheck.hs ━━━
            41 ┃ hprop_ledger_events_sanity_check :: Property
            42 ┃ hprop_ledger_events_sanity_check = integrationWorkspace "ledger-events-sanity-check" $ \tempAbsBasePath' -> H.runWithDefaultWatchdog_ $ do
               ┃ │ Workspace: /private/tmp/tmp.01xnrEP0ti/ledger-events-sanity-check-test-4333b7de3f5722d3
               ┃ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
               ┃ │ ━━━ Exception (IOException) ━━━
               ┃ │ /private/tmp/tmp.01xnrEP0ti/ledger-events-sanity-check-test-4333b7de3f5722d3/logs: removePathForcibly:removePathForcibly:removeDirectory: unsatisfied constraints (Directory not empty)
            43 ┃   -- Start a local test net
            44 ┃   conf <- mkConf tempAbsBasePath'

Most likely because epoch state logging gets in the way and gets cleaned up in MonadResouce after the workspace cleanup code.

@carbolymer carbolymer marked this pull request as draft June 6, 2024 11:07
@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch from d2aac0a to 75979d6 Compare June 13, 2024 10:56
@Jimbo4350 Jimbo4350 force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch from f340160 to 9432f8d Compare June 14, 2024 09:30
@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch 2 times, most recently from 69dc77a to f0d1ae9 Compare June 19, 2024 13:34
@carbolymer carbolymer changed the title Fix flaky test workspace cleanup [cardano-testnet] Fix flaky test workspace cleanup and node port allocation Jun 19, 2024
@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch 2 times, most recently from 84ae07d to 3ec2905 Compare June 19, 2024 14:59
@carbolymer
Copy link
Contributor Author

carbolymer commented Jul 16, 2024

Failures because of slow port closing after bind when reserving port numbers:

             324 ┃     let keysWithPorts = L.zip3 [1..] poolKeys portNumbers
            325 ┃     ePoolNodes <- H.forConcurrently keysWithPorts $ \(i, key, (portReleaseKey, port)) -> do
            326 ┃       let nodeName = mkNodeName i
            327 ┃           keyDir = tmpAbsPath </> poolKeyDir i
            328 ┃       H.note_ $ "Node name: " <> nodeName
                ┃       │ Node name: pool1
                ┃       │ Node name: pool2
                ┃       │ Node name: pool3
            329 ┃       eRuntime <- runExceptT $
            330 ┃         startNode (TmpAbsolutePath tmpAbsPath) nodeName testnetDefaultIpv4Address port (Just portReleaseKey) testnetMagic
            331 ┃           [ "run"
            332 ┃           , "--config", unFile configurationFile
            333 ┃           , "--topology", keyDir </> "topology.json"
            334 ┃           , "--database-path", keyDir </> "db"
            335 ┃           , "--shelley-kes-key", keyDir </> "kes.skey"
            336 ┃           , "--shelley-vrf-key", keyDir </> "vrf.skey"
            337 ┃           , "--byron-delegation-certificate", keyDir </> "byron-delegation.cert"
            338 ┃           , "--byron-signing-key", keyDir </> "byron-delegate.key"
            339 ┃           , "--shelley-operational-certificate", keyDir </> "opcert.cert"
            340 ┃           ]
            341 ┃       pure $ flip PoolNode key <$> eRuntime
            342 ┃ 
            343 ┃     let (failedNodes, poolNodes) = partitionEithers ePoolNodes
            344 ┃     unless (null failedNodes) $ do
            345 ┃       H.noteShow_ . vsep $ prettyError <$> failedNodes
                ┃       │ Cardano node process did not start: 
                ┃       │ cardano-node: DiffusionError Network.Socket.bind: resource busy (Address already in use)
                ┃       │ Cardano node process did not start: 
                ┃       │ cardano-node: DiffusionError Network.Socket.bind: resource busy (Address already in use)
            346 ┃       H.failure
                ┃       ^^^^^^^^^

This happens on slow machines, like macos runners.

@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch 2 times, most recently from 5b6dc81 to c3f5ec6 Compare July 18, 2024 18:15
@carbolymer
Copy link
Contributor Author

carbolymer commented Jul 18, 2024

New failure after fixing the previous two:

File error: /private/tmp/tmp.WyCYpKhEZj/propose-new-constitution-spo-test-52485b463ccb6b99/logs/pool2/stderr.log: openFile: resource busy (file is locked)
              

@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch 10 times, most recently from 813f7c0 to a63f19e Compare July 29, 2024 07:50
@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch from a63f19e to 479504b Compare July 29, 2024 15:27
@carbolymer carbolymer marked this pull request as ready for review July 29, 2024 17:31
Copy link
Contributor

@Jimbo4350 Jimbo4350 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍 . One suggestion surrounding exponential backoff. waitForPortClosed will loop forever; we should probably fail at some point.

writeIORef lastResult isOpen
when isOpen $ do
-- repeat when port open
MT.threadDelay interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using an exponential back off so this doesn't loop forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrapped in timeout, so it will be interrupted eventually. I'll use here retrying instead. It will simplify the code.

H.runFinallies $ workspace (workspaceName <> "-" <> show i) f

-- | Create a workspace directory which will exist for at least the duration of
-- the supplied block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "supplied block"?

Copy link
Contributor Author

@carbolymer carbolymer Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the comment doesn't make much sense. I've copied haddock as-is from https://hackage.haskell.org/package/hedgehog-extras-0.6.5.0/docs/Hedgehog-Extras-Test-Base.html#v:workspace
I'll upstream it after merging this PR. I'll fix it in hedgehog-extras

f ws
when (os /= "mingw32" && maybeKeepWorkspace /= Just "1") $ do
-- try to delete the directory 5 times, 100ms apart
let retryPolicy = R.constantDelay 100_000 <> R.limitRetries 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spot for exponential backoff? I see retry has the exponentialBackoff combinator :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we will gain much with exponential backoff. The operation is not expensive so we're not using CPU usage or anything.

I'd say exponential backoff would make sense in resource intensive operations, like network communication.

Right h -> pure h
Left e
-- give up after 1000 attempts
| n >= 1000 -> throwE e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential spot for exponential backoff

Copy link
Contributor Author

@carbolymer carbolymer Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we're not retrying anything here. I'm assuming if the file exists, it'll be locked for a long time, and just try to create a file with the next name.

@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch from 479504b to d8956c8 Compare July 30, 2024 13:43
@carbolymer carbolymer force-pushed the mgalazyn/test/fix-flaky-test-cleanup branch from d8956c8 to 53a3f65 Compare July 30, 2024 13:44
@carbolymer carbolymer disabled auto-merge July 31, 2024 08:50
@carbolymer carbolymer merged commit 8829321 into master Jul 31, 2024
21 of 24 checks passed
@carbolymer carbolymer deleted the mgalazyn/test/fix-flaky-test-cleanup branch July 31, 2024 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants