Training ends shortly after entering play

**Describe the bug**
When running `mlagents-learn` (with and without --force) and entering play, it exits almost immediately after and prints Debug.Log calls about 33 times

**To Reproduce**
Steps to reproduce the behavior:
1. Start training with the command and press play
2. Observe it close, with traceback errors in console output

**Console logs / stack traces**
```
PS C:\repos\the big game\Saw and UFO> mlagents-learn --force
[W ..\torch\csrc\utils\tensor_numpy.cpp:77] Warning: Failed to initialize NumPy: module compiled against API version 0x10 but this version of numpy is 0xe (function operator ())

            ┐  ╖
        ╓╖╬│╡  ││╬╖╖
    ╓╖╬│││││┘  ╬│││││╬╖
 ╖╬│││││╬╜        ╙╬│││││╖╖                               ╗╗╗
 ╬╬╬╬╖││╦╖        ╖╬││╗╣╣╣╬      ╟╣╣╬    ╟╣╣╣             ╜╜╜  ╟╣╣
 ╬╬╬╬╬╬╬╬╖│╬╖╖╓╬╪│╓╣╣╣╣╣╣╣╬      ╟╣╣╬    ╟╣╣╣ ╒╣╣╖╗╣╣╣╗   ╣╣╣ ╣╣╣╣╣╣ ╟╣╣╖   ╣╣╣
 ╬╬╬╬┐  ╙╬╬╬╬│╓╣╣╣╝╜  ╫╣╣╣╬      ╟╣╣╬    ╟╣╣╣ ╟╣╣╣╙ ╙╣╣╣  ╣╣╣ ╙╟╣╣╜╙  ╫╣╣  ╟╣╣
 ╬╬╬╬┐     ╙╬╬╣╣      ╫╣╣╣╬      ╟╣╣╬    ╟╣╣╣ ╟╣╣╬   ╣╣╣  ╣╣╣  ╟╣╣     ╣╣╣┌╣╣╜
 ╬╬╬╜       ╬╬╣╣      ╙╝╣╣╬      ╙╣╣╣╗╖╓╗╣╣╣╜ ╟╣╣╬   ╣╣╣  ╣╣╣  ╟╣╣╦╓    ╣╣╣╣╣
 ╙   ╓╦╖    ╬╬╣╣   ╓╗╗╖            ╙╝╣╣╣╣╝╜   ╘╝╝╜   ╝╝╝  ╝╝╝   ╙╣╣╣    ╟╣╣╣
   ╩╬╬╬╬╬╬╦╦╬╬╣╣╗╣╣╣╣╣╣╣╝                                             ╫╣╣╣╣
      ╙╬╬╬╬╬╬╬╣╣╣╣╣╣╝╜
          ╙╬╬╬╣╣╣╜
             ╙

 Version information:
  ml-agents: 0.30.0,
  ml-agents-envs: 0.30.0,
  Communicator API: 1.5.0,
  PyTorch: 1.13.1+cpu
[W ..\torch\csrc\utils\tensor_numpy.cpp:77] Warning: Failed to initialize NumPy: module compiled against API version 0x10 but this version of numpy is 0xe (function operator ())
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
[INFO] Connected to Unity environment with package version 3.0.0-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Cell?team=0
[WARNING] Behavior name Cell does not match any behaviors specified in the trainer configuration file. A default configuration will be used.
[WARNING] Deleting TensorBoard data events.out.tfevents.1697750225.pop.26916.0 that was left over from a previous run.
[INFO] Hyperparameters for behavior name Cell:
        trainer_type:   ppo
        hyperparameters:
          batch_size:   1024
          buffer_size:  10240
          learning_rate:        0.0003
          beta: 0.005
          epsilon:      0.2
          lambd:        0.95
          num_epoch:    3
          shared_critic:        False
          learning_rate_schedule:       linear
          beta_schedule:        linear
          epsilon_schedule:     linear
        network_settings:
          normalize:    False
          hidden_units: 128
          num_layers:   2
          vis_encode_type:      simple
          memory:       None
          goal_conditioning_type:       hyper
          deterministic:        False
        reward_signals:
          extrinsic:
            gamma:      0.99
            strength:   1.0
            network_settings:
              normalize:        False
              hidden_units:     128
              num_layers:       2
              vis_encode_type:  simple
              memory:   None
              goal_conditioning_type:   hyper
              deterministic:    False
        init_path:      None
        keep_checkpoints:       5
        checkpoint_interval:    500000
        max_steps:      500000
        time_horizon:   64
        summary_freq:   50000
        threaded:       False
        self_play:      None
        behavioral_cloning:     None
[INFO] Exported results\ppo\Cell\Cell-0.onnx
[INFO] Copied results\ppo\Cell\Cell-0.onnx to results\ppo\Cell.onnx.
Traceback (most recent call last):
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\learn.py", line 264, in main
    run_cli(parse_command_line())
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\learn.py", line 260, in run_cli
    run_training(run_seed, options, num_areas)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\learn.py", line 136, in run_training
    tc.start_learning(env_manager)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\trainer_controller.py", line 175, in start_learning
    n_steps = self.advance(env_manager)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\trainer_controller.py", line 233, in advance
    new_step_infos = env_manager.get_steps()
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\env_manager.py", line 124, in get_steps
    new_step_infos = self._step()
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 408, in _step
    self._queue_steps()
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 302, in _queue_steps
    env_action_info = self._take_step(env_worker.previous_step)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 543, in _take_step
    all_action_info[brain_name] = self.policies[brain_name].get_action(
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 130, in get_action
    run_out = self.evaluate(decision_requests, global_agent_ids)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 93, in evaluate
    masks = self._extract_masks(decision_requests)
  File "C:\Users\phill\AppData\Local\Programs\Python\Python310\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 77, in _extract_masks
    mask = torch.as_tensor(
RuntimeError: Could not infer dtype of numpy.int32
```

**Environment (please complete the following information):**
- Unity 2023.3.0a10
- Windows 11, Torch 1.13.1+cpu, Python 3.10.0, numpy 1.12.1
- Package source for mlagents and mlagents.extensions are from develop branch (as upm references)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training ends shortly after entering play #5999

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training ends shortly after entering play #5999

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions