Skip to content

8359827: Test runtime/Thread/ThreadCountLimit.java need loop increasing the limit #26401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

sendaoYan
Copy link
Member

@sendaoYan sendaoYan commented Jul 19, 2025

Hi all,

The test runtime/Thread/ThreadCountLimit.java was observed fails when run with other tests. The test start subprocess with shell prefix command ulimit -u 4096 which intend to limite the usage of thread number. But this will cause test fails when this test run with other tests. I create a demo to demonstrate that.

I start a java process which will create 5k threads, and then I can not start new java process with prefix ulimit -u 4096 on the same machine.

image

ManyThreads.java.txt

So it's necessary to make this test run sperately to make this test success.
Change has been verified locally, test-fix only, risk is low.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8359827: Test runtime/Thread/ThreadCountLimit.java need loop increasing the limit (Enhancement - P4)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26401/head:pull/26401
$ git checkout pull/26401

Update a local copy of the PR:
$ git checkout pull/26401
$ git pull https://git.openjdk.org/jdk.git pull/26401/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26401

View PR using the GUI difftool:
$ git pr show -t 26401

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26401.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 19, 2025

👋 Welcome back syan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 19, 2025

@sendaoYan This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8359827: Test runtime/Thread/ThreadCountLimit.java need loop increasing the limit

Co-authored-by: David Holmes <[email protected]>
Reviewed-by: dholmes

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 50 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 19, 2025
@openjdk
Copy link

openjdk bot commented Jul 19, 2025

@sendaoYan The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jul 19, 2025

@dholmes-ora
Copy link
Member

@sendaoYan exclusiveAccess.dirs does not work the way you expect/require. It simply indicates that only one test at a time may run from the given directory. It does not mean no other tests from any other directories may run.

@dholmes-ora
Copy link
Member

FWIW we see no issue running this test, but we ensure we already have a high ulimit setting available in our test machines by default.

@sendaoYan
Copy link
Member Author

sendaoYan commented Jul 21, 2025

FWIW we see no issue running this test, but we ensure we already have a high ulimit setting available in our test machines by default.

  1. Maybe this test has been excluded by TEST.groups
  2. The error reported in this testcase should not be related to the ulimit configuration of the test environment, but may be related to the number of CPU cores of the machine. On a machine with a large number of CPU cores, each testcase will start more gc threads and JIT threads, and the number of jtreg concurrency will also be relatively large, causing the total number of threads of all testcases to easily exceed 4096. For example, in the example below, my environment configuration ulimit -u is unlimited. I first start a background java process, which will start 5000 threads and will not exit; then I use shell predix ulimit -u to start the java process (similar to the test situation of this testcase), and then I cannot start java.
image

ManyThreads.java.txt

…t/hotspot/jtreg/resourcehogs/runtime/Thread/
@sendaoYan
Copy link
Member Author

@sendaoYan exclusiveAccess.dirs does not work the way you expect/require. It simply indicates that only one test at a time may run from the given directory. It does not mean no other tests from any other directories may run.

Thanks your correction @dholmes-ora.
I have move this test to test/hotspot/jtreg/resourcehogs, similar to JDK-8227645.

@dholmes-ora
Copy link
Member

I first start a background java process, which will start 5000 threads and will not exit; then I use shell predix ulimit -u to start the java process (similar to the test situation of this testcase), and then I cannot start java.

Okay, but in that scenario what is it you are actually running out of?

You are changing the test to suit the way you need to run it, but I'm not aware of anyone else reporting issues running this test.

@sendaoYan
Copy link
Member Author

sendaoYan commented Jul 21, 2025

Okay, but in that scenario what is it you are actually running out of?

I think it's running out of "user processes" which limit by ulimit -u 4096.

I think it is the user processes set by ulimit -u are exhausted that Java cannot start.
I created a small example in C language to illustrate this problem. The create-thread program will try to create threads continuously until it can no longer create threads, or the number of threads created exceeds 5000.

  1. Use bash -c "ulimit -u 1000 && ./create-thread" command shows that the number of threads that the create-thread program can create is about 580;
image
  1. First directly start the create-thread program (background running mode), and then use bash -c "ulimit -u 4096 && ./create-thread" command to test the number of threads that the second create-thread process can create. It will be found that the second create-thread process cannot create any thread. Explain that the number of max user processes limited by ulimit -u 4096 includes the number of threads created by the first create-thread process.
    This C language example shows that this testcase is not suitable for concurrent running with other test cases, otherwise we may encounter the failure described by the issue
image
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

void *thread_func() {
    while (1) {
        sleep(1);
    }
    return NULL;
}

int main() {
    pthread_t tid;
    int thread_count = 0;
    int ret;

    while (thread_count < 5000) {
        ret = pthread_create(&tid, NULL, thread_func, NULL);
        if (ret != 0) {
            if (ret == EAGAIN) {
                printf("can not create thread(EAGAIN) anymore: %d\n", thread_count);
                break;
            } else {
                printf("pthread_create error: %s\n", strerror(ret));
                break;
            }
        }
        thread_count++;
        if (thread_count % 1000 == 0) {
            printf("already create thread number %d\n", thread_count);
        }
    }
    printf("total created thread number: %d\n", thread_count);

    while (1) {
        sleep(1);
    }

    return 0;
}

You are changing the test to suit the way you need to run it, but I'm not aware of anyone else reporting issues running this test.

I think the failure descripted by issue, only appearance on huge CPU core number machine.

@dholmes-ora
Copy link
Member

Explain that the number of max user processes limited by ulimit -u 4096 includes the number of threads created by the first create-thread process.

That's not the way ulimit should work in different sub-shells. What is the ulimit in the parent shell? I think the subshells are limited by the parent.

@sendaoYan
Copy link
Member Author

sendaoYan commented Jul 22, 2025

That's not the way ulimit should work in different sub-shells.

I initially thought that ulimit shouldn't work like that in different sub-shells. But actually ulimit works in different sub-shells as unexpectedly.
The testcase runtime/Thread/ThreadCountLimit.java attempts to limited the number user processes of 4096 by adding the prefix "bash -c ulimit -u 4096" to start the child process, but the actual situation is that ulimit does not work as expected. If this testcase run with other tests simultaneously, the number of threads can created maybe be zero, at least the number always less than 4096, it depends how many user processes has been created in the test machine.

What is the ulimit in the parent shell? I think the subshells are limited by the parent.

The ulimit in the parent shell is unlimited. The first process "./create-thread" can create 5k threads shows that the parent shell has no limit.

image

@dholmes-ora
Copy link
Member

There is definitely something unexpected/odd about the behaviour of ulimit when used in this way, though I do not observe the exact problems you describe unless I run a number of test processes concurrently - which is simply demonstrating machine overloading.

First, what does it even mean to use ulimit -u? The manpage says it limits the maximum number of processes the user can create - it doesn't say "per shell" (and setrlimit confirms this). But you can easily demonstrate that the user can create far more processes/threads than have been set by a ulimit command running in another shell. So perhaps there is something else that affects how ulimit works, and that something is different between our systems and yours. ?? (I know there are capabilities that disable the limit but I couldn't see any indication such capabilities were present.)

Second, I observe that with ulimit -u 1024 I can't even run java-version - which makes no sense in terms of number of threads created. Relatedly with a 4096 limit the test typically can only create around 2500 threads - so where did the other 1500+ go?

The use of ulimit was added to the test, for Linux only, because we found we could exhaust other resources that could then cause fatal errors in the VM in unexpected places - rather than the failure of pthread_create that we were trying to induce.

I'm really not sure how to proceed here. The change you propose affects all platforms, but there is only an issue for you on Linux.

@sendaoYan
Copy link
Member Author

sendaoYan commented Jul 22, 2025

Hi @dholmes-ora

which is simply demonstrating machine overloading.

I think it's not machine overloading, becasuse the setting of 'ulimit -u' on my machine is 'unlimited'. I can create 5000 threads many times, show as below:

image

you can easily demonstrate that the user can create far more processes/threads than have been set by a ulimit command running in another shell

I think the 'ulimit -u' in sub-shell take effect in the sub-shell only, it's temporary setting, it will not affect the parent shell.

Relatedly with a 4096 limit the test typically can only create around 2500 threads - so where did the other 1500+ go?

It seems that the sub-shell with 'ulimit -u 4096' prefix will count all the user processes number. It's just my speculatation. That's why this test not suitable run with other tests simultaneous

Anyway, I change this PR to use docker run --pids-limit 4096 to instead the original 'ulimit -u 4096'. It will make this test more complict but more elegant and more robustness.

@openjdk openjdk bot removed the rfr Pull request is ready for review label Jul 22, 2025
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 22, 2025
@dholmes-ora
Copy link
Member

You can't just change the test to use docker! This is not a container test. We use special test tasks to run container tests in an environment where containers are enabled.

SendaoYan added 2 commits July 22, 2025 20:13
@dholmes-ora
Copy link
Member

I think the 'ulimit -u' in sub-shell take effect in the sub-shell only, it's temporary setting, it will not affect the parent shell.

I'm finding some of these statements to be contradictory to the problem being stated. If the ulimit setting only affects the sub-shell then it can't cause other concurrent tests to hit the limit and fail to create threads!

It seems that the sub-shell with 'ulimit -u 4096' prefix will count all the user processes number. It's just my speculatation. That's why this test not suitable run with other tests simultaneous

If the sub-shell counts all processes/threads belonging to the user and applies the new ulimit then that would make some sense. But again how does that then cause any problem in a different shell?

@sendaoYan
Copy link
Member Author

You can't just change the test to use docker! This is not a container test. We use special test tasks to run container tests in an environment where containers are enabled.

Okey, I have revert the docker commit.

@sendaoYan
Copy link
Member Author

sendaoYan commented Jul 22, 2025

If the ulimit setting only affects the sub-shell then it can't cause other concurrent tests to hit the limit and fail to create threads!

Maybe some of my previous statements have caused some misunderstandings.
The usage of ulimit in this testcase will not cause other concurrent tests to hit the limit, but will cause this test itself do not have enough user processes to start the java.
On the huge core number machine, every test will create more JIT compiler threads and more GC work threads. So when this test run with other tests simultancely, we can see this test can not start subprocess java with prefix "ulimit -u", the subprocess java report Failed to start thread "GC Thread#0", because the subprocess has limited by "ulimit -u 4096", and the user processes resources maybe has been occupied by other tests which run simultancely. And the other tests run normally, because they do not have 'ulimit -u' explicitly.

@dholmes-ora
Copy link
Member

dholmes-ora commented Jul 22, 2025

The usage of ulimit in this testcase will not cause other concurrent tests to hit the limit, but will cause this test itself do not have enough user processes to start the java.

Sorry - right I get it now. So basically with enough other activity on the machine all the 4096 process/thread capacity may have already been used up before the test can run. It isn't this test that is the "resource hog" as such. What we really want to do is more like ulimit -u <current process/thread count> + 4096 but getting the current value is tricky.

One possible solution would be to loop so that if we get exitcode 1 then we retry with 4096*2 etc.

      if (Platform.isLinux()) {
        // On Linux this test sometimes hits the limit for the maximum number of memory mappings,
        // which leads to various other failure modes. Run this test with a limit on how many
        // threads the process is allowed to create, so we hit that limit first. What we want is
        // for another "limit" processes to be available, but ulimit doesn't work that way and
        // if there are already many running processes we could fail to even start the JVM properly.
        // So we loop increasing the limit until we get a successful run. This is not foolproof.
        int pLimit = 4096;
        final String ULIMIT_CMD = "ulimit -u ";
        ProcessBuilder pb = ProcessTools.createTestJavaProcessBuilder(ThreadCountLimit.class.getName());
        String javaCmd = ProcessTools.getCommandLine(pb);
        for (int i = 1; i <= 10; i++) {
            // Relaunch the test with args.length > 0, and the ulimit set
            String cmd = ULIMIT_CMD + Integer.toString(pLimit * i) + " && " + javaCmd + " dummy";
            System.out.println("Trying: bash -c " + cmd);
            OutputAnalyzer oa = ProcessTools.executeCommand("bash", "-c", cmd);
            int exitValue = oa.getExitValue();
            switch (exitValue) {
              case 0: System.out.println("Success!"); return;
              case 1: System.out.println("Retry ..."); continue;
              default: oa.shouldHaveExitValue(0); // generate error report
            }
        }
        throw new Error("Failed to perform a successful run!");
      } else {
        // Not Linux so run directly.
        test();
      }

Took 5 loops to run on my system (ulimit -u 24576) when I had an unrestricted version of the test running which has about 20700 live threads.

@sendaoYan
Copy link
Member Author

Thanks @dholmes-ora very much.

I will verified this patch by ours CI.

/contributor add @dholmes-ora

@openjdk
Copy link

openjdk bot commented Jul 23, 2025

@sendaoYan
Contributor David Holmes <[email protected]> successfully added.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :)

I will run this through our CI as well.

@dholmes-ora
Copy link
Member

Testing in our CI was fine - test passes on first iteration with 4096 in all case. I ran this standalone and within tier5.

@sendaoYan
Copy link
Member Author

sendaoYan commented Jul 23, 2025

Testing in our CI was fine - test passes on third iteration with 12288 in all case. I run this within all the jtreg tests.

ThreadCountLimit_id0.jtr.txt

@sendaoYan
Copy link
Member Author

/issue JDK-8359827

@openjdk openjdk bot changed the title 8359827: Test runtime/Thread/ThreadCountLimit.java should run exclusively 8359827: Test runtime/Thread/ThreadCountLimit.java need loop increasing the limit Jul 23, 2025
@openjdk
Copy link

openjdk bot commented Jul 23, 2025

@sendaoYan This issue is referenced in the PR title - it will now be updated.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 23, 2025
@sendaoYan
Copy link
Member Author

Thanks your patient reviews and suggestions @dholmes-ora.

/integrate

@openjdk
Copy link

openjdk bot commented Jul 24, 2025

Going to push as commit fc80384.
Since your change was applied there have been 52 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 24, 2025
@openjdk openjdk bot closed this Jul 24, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 24, 2025
@openjdk
Copy link

openjdk bot commented Jul 24, 2025

@sendaoYan Pushed as commit fc80384.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@sendaoYan sendaoYan deleted the jbs8359827 branch July 24, 2025 01:56
@dholmes-ora
Copy link
Member

@sendaoYan any hotspot related changes, including tests, require two reviews before integration.

It is probably prudent to wait for any feedback from other CI maintainers before backporting this change.

@sendaoYan
Copy link
Member Author

@sendaoYan any hotspot related changes, including tests, require two reviews before integration.

It is probably prudent to wait for any feedback from other CI maintainers before backporting this change.

Sorry, I will pay attention next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime [email protected] integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

2 participants