Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Out_channel.flush can cause Syserror when used in parallel #444

Open
jmid opened this issue Mar 22, 2024 · 9 comments
Open
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Mar 22, 2024

A Cygwin 5.2 run triggered when merging #443 to main found an unexpected counterexample to STM Out_channel parallel:
https://github.com/ocaml-multicore/multicoretests/actions/runs/8361686850/job/22890359851

random seed: 292949458
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential (generating)
[✓] 1000    0    0 1000 / 1000     2.9s STM Out_channel test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test parallel
[ ]  620    0    0  542 / 1000    57.1s STM Out_channel test parallel
[ ] 1116    0    0  966 / 1000   117.5s STM Out_channel test parallel (shrinking:    1.0011)
[✗] 1117    0    1  966 / 1000   129.8s STM Out_channel test parallel

--- Failure --------------------------------------------------------------------

Test STM Out_channel test parallel failed (1 shrink steps):

                                                              |                         
                                     Output_substring ("\225\185!\188Q\142\138", 0, 96) 
                                                       Output_byte 36                   
                                     Output_bytes "\158\199e\220\253\148\... (truncated)
                                                            Flush                       
                                                            Close                       
                                                   Output ("Z\197j", 3, 3)              
                                                          Open_text                     
                                                         Is_buffered                    
                                                    Set_binary_mode false               
                                              Output_substring ("{8)g", 18, 1)          
                                                           Length                       
                                                         Close_noerr                    
                                                          Open_text                     
                                                      Set_buffered true                 
                                                              |                         
                                   .----------------------------------------------------.
                                   |                                                    |                         
                              Close_noerr                                     Set_binary_mode true                
                               Open_text                                               Pos                        
                Output_substring ("\to\128N\186", 7, 6)                       Set_binary_mode true                
                            Output_char '9'                                      Output_byte 54                   
                         Set_binary_mode true                                        Length                       
                         Output_bytes "\182U"                                         Flush                       
                            Output_char '5'                    Output_substring ("\014\236\187\146^... (truncated)
              Output ("(|$\217\203\206\162t\235", 18, 4)                      Set_binary_mode true                
                                  Pos                                              Is_buffered                    
                         Set_binary_mode false                                    Output_byte 1                   


+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test parallel:

  Results incompatible with linearized model

                                                                                                       |                                                
                                                       Output_substring ("\225\185!\188Q\142\138", 0, 96) : Error (Invalid_argument("output_substring"))
                                                                                           Output_byte 36 : Ok (())                                     
                                                                         Output_bytes "\158\199e\220\253\148\... (truncated) : Ok (())                  
                                                                                                Flush : Ok (())                                         
                                                                                                Close : Ok (())                                         
                                                                         Output ("Z\197j", 3, 3) : Error (Invalid_argument("output"))                   
                                                                                              Open_text : Ok (())                                       
                                                                                            Is_buffered : Ok (true)                                     
                                                                                        Set_binary_mode false : Ok (())                                 
                                                                Output_substring ("{8)g", 18, 1) : Error (Invalid_argument("output_substring"))         
                                                                                                Length : Ok (0)                                         
                                                                                             Close_noerr : Ok (())                                      
                                                                                              Open_text : Ok (())                                       
                                                                                          Set_buffered true : Ok (())                                   
                                                                                                       |                                                
                                                     .--------------------------------------------------------------------------------------------------.
                                                     |                                                                                                  |                                                
                                           Close_noerr : Ok (())                                                                         Set_binary_mode true : Ok (())                                  
                                            Open_text : Ok (())                                                                                   Pos : Ok (0)                                           
          Output_substring ("\to\128N\186", 7, 6) : Error (Invalid_argument("output_substring"))                                         Set_binary_mode true : Ok (())                                  
                                         Output_char '9' : Ok (())                                                                          Output_byte 54 : Ok (())                                     
                                      Set_binary_mode true : Ok (())                                                                             Length : Ok (0)                                         
                                      Output_bytes "\182U" : Ok (())                                                            Flush : Error (Sys_error("Bad file descriptor"))                         
                                         Output_char '5' : Ok (())                                                        Output_substring ("\014\236\187\146^... (truncated) : Ok (())                  
              Output ("(|$\217\203\206\162t\235", 18, 4) : Error (Invalid_argument("output"))                            Set_binary_mode true : Error (Sys_error("Bad file descriptor"))                 
                                               Pos : Ok (4)                                                                                  Is_buffered : Ok (true)                                     
                                      Set_binary_mode false : Ok (())                                                                        Output_byte 1 : Ok (())                                     

================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)
File "src/io/dune", line 40, characters 7-16:
40 |  (name stm_tests)
            ^^^^^^^^^
(cd _build/default/src/io && ./stm_tests.exe --verbose)
Command exited with code 1.

AFAICS, the failure can be explained by Flush in the "right leg" causing Sys_error("Bad file descriptor").
This goes against the specification, which says:

       val close : t -> unit

       Close the given channel, flushing all buffered write operations.  Output functions raise a Sys_error exception when they are  applied  to  a  closed  output
       channel, except Out_channel.close and Out_channel.flush , which do nothing when applied to an already closed channel. [...]

This is currently captured in the STM Out_channel test as:

    | Flush, Res ((Result (Unit,Exn),_), r) ->
      (match s,r with
       | Closed, Error (Sys_error _) -> false (* should not generate an error *)
       | Closed, Ok () -> true
       | Open _, Ok () -> true
       | _ -> false)
@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Mar 22, 2024
@jmid
Copy link
Collaborator Author

jmid commented Jul 1, 2024

Just observed this again - now on FreeBSD 5.1 trying out the latest QCheck memory usage improvement
https://github.com/ocaml-multicore/multicoretests/tree/try-qcheck-mem-improvement
https://ocaml-multicoretests.ci.dev:8100/job/2024-06-28/125000-ci-ocluster-build-caa2d4

random seed: 362318939
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential (generating)
[✓] 1000    0    0 1000 / 1000     1.2s STM Out_channel test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test parallel
[✗]  103    0    1   90 / 1000     4.7s STM Out_channel test parallel

--- Failure --------------------------------------------------------------------

Test STM Out_channel test parallel failed (1 shrink steps):

                                                              |                         
                                       Output ("\011\230\197\204\173\164'\024", 9, 1)   
                                     Output_bytes "\159\238+\152]\006*\24... (truncated)
                                                         Close_noerr                    
                                                          Open_text                     
                                                            Flush                       
                                                            Flush                       
                                                              |                         
                                   .----------------------------------------------------.
                                   |                                                    |                         
                                Length                                             Close_noerr                    
                            Output_char 'i'                    Output ("Dc\015\194*x\159YlA\b\247\1... (truncated)
                                Length                                              Open_text                     
                             Output_byte 8                                       Output_char 'G'                  
                                 Flush                                        Set_binary_mode false               


+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test parallel:

  Results incompatible with linearized model

                                                                                                   |                                               
                                                          Output ("\011\230\197\204\173\164'\024", 9, 1) : Error (Invalid_argument("output"))      
                                                                     Output_bytes "\159\238+\152]\006*\24... (truncated) : Ok (())                 
                                                                                         Close_noerr : Ok (())                                     
                                                                                          Open_text : Ok (())                                      
                                                                                            Flush : Ok (())                                        
                                                                                            Flush : Ok (())                                        
                                                                                                   |                                               
                                                   .-----------------------------------------------------------------------------------------------.
                                                   |                                                                                               |                                               
                                            Length : Ok (0)                                                                              Close_noerr : Ok (())                                     
                                       Output_char 'i' : Ok (())                                     Output ("Dc\015\194*x\159YlA\b\247\1... (truncated) : Error (Sys_error("Bad file descriptor"))
                                            Length : Ok (0)                                                                               Open_text : Ok (())                                      
                                        Output_byte 8 : Ok (())                                                                        Output_char 'G' : Ok (())                                   
                            Flush : Error (Sys_error("Bad file descriptor"))                                                        Set_binary_mode false : Ok (())                                

================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)
File "src/io/dune", line 40, characters 7-16:
40 |  (name stm_tests)
            ^^^^^^^^^
(cd _build/default/src/io && ./stm_tests.exe --verbose)
Command exited with code 1.

@jmid jmid changed the title [ocaml5-issue] On Cygwin Out_channel.flush can cause Syserror when used in parallel [ocaml5-issue] Out_channel.flush can cause Syserror when used in parallel Jul 1, 2024
@jmid
Copy link
Collaborator Author

jmid commented Aug 9, 2024

I've created a standalone reproducer for this one:

let test () =
  let path = Filename.temp_file "stm-" "" in
  let channel = Atomic.make (Out_channel.open_text path) in

  (* First, a bit of one-domain channel activity *)
  Out_channel.output_byte (Atomic.get channel) 1;
  (try Out_channel.flush (Atomic.get channel) with (Sys_error _) -> assert false);

  let wait = Atomic.make true in

  (* Domain 1 closes-opens-outputs repeatedly *)
  let d1 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      for _ = 1 to 50 do
        Out_channel.close (Atomic.get channel);
        Atomic.set channel (Out_channel.open_text path);
        Out_channel.output_byte (Atomic.get channel) 1;
      done;
    ) in

  (* Domain 2 calls flush and pos repeatedly *)
  let d2 = Domain.spawn (fun () ->
      Atomic.set wait false;
      (*
       "Output functions raise a Sys_error exception when they are applied to a  closed  output  channel,
        except  Out_channel.close  and Out_channel.flush , which do nothing when applied to an already closed channel."
      *)
      for _ = 1 to 50 do
        (try Out_channel.flush (Atomic.get channel)
         with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false);
        ignore (Out_channel.pos (Atomic.get channel));
      done;
    ) in

  let () = Domain.join d1 in
  let () = Domain.join d2 in
  (* Please leave the torture chamber nice and clean as you found it *)
  (try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
  Sys.remove path

let _ =
  for i = 1 to 5_000 do
    if i mod 250 = 0 then Printf.printf "#%!";
    test ()
  done

With it, locally on Linux I am able to trigger the assertion failure on 5.1.0+fp and 5.2.0+fp switches.

@jmid
Copy link
Collaborator Author

jmid commented Aug 9, 2024

The branch https://github.com/ocaml-multicore/multicoretests/tree/flushtest-repro-focus triggers a focused test of it on the CI.

Across all runs, this reproducer only triggered in multicoretests-ci where flush may trigger a Sys_error in

@jmid
Copy link
Collaborator Author

jmid commented Aug 9, 2024

In trying to understand this better, I ran dune exec src/io/stm_tests.exe -- -v locally under a 5.2.0+fp switch, where it fails spectacularly - even the sequential one:

$ dune exec src/io/stm_tests.exe -- -v
random seed: 242921126
generated error fail pass / total     time test name
[✗]   20    0    1   19 / 1000     0.1s STM Out_channel test sequential
[✗]    8    0    1    7 / 1000     5.9s STM Out_channel test parallel

--- Failure --------------------------------------------------------------------

Test STM Out_channel test sequential failed (19 shrink steps):

   Close_noerr
   Output_char 'H'
   Output_char ']'


+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test sequential:

  Results incompatible with model

   Close_noerr : Ok (())
   Output_char 'H' : Error (Sys_error("Bad file descriptor"))
   Output_char ']' : Ok (())

--- Failure --------------------------------------------------------------------

Test STM Out_channel test parallel failed (45 shrink steps):

                            |            
                          Close          
                      Output_byte 4      
                     Output_char 'E'     
                            |            
                 .---------------------.
                 |                     |            


+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test parallel:

  Results incompatible with linearized model

                                                             |                            
                                                      Close : Ok (())                     
                                  Output_byte 4 : Error (Sys_error("Bad file descriptor"))
                                                 Output_char 'E' : Ok (())                
                                                             |                            
                                .---------------------------------------------------------.
                                |                                                         |                            

================================================================================
failure (2 tests failed, 0 tests errored, ran 2 tests)

The same test however completes under the following two which seems to indicate a fp-related bug that may have since been fixed on trunk:

  • 5.2.0 (no fp)
  • 5.3.0+fp

A standalone reproducer:

let test () =
  let path = Filename.temp_file "stm-" "" in
  let channel = Out_channel.open_text path in

  Out_channel.close_noerr channel;
  (try Out_channel.output_char channel 'A'; assert false with (Sys_error _) -> ()); (* Error (Sys_error("Bad file descriptor")) *)
  (try Out_channel.output_char channel 'B'; assert false with (Sys_error _) -> ()); (* Succeeds *)

  Sys.remove path

let _ =
  for i = 1 to 10_000 do
    if i mod 250 = 0 then Printf.printf "#%!";
    test ()
  done

@jmid
Copy link
Collaborator Author

jmid commented Aug 14, 2024

To summarize

  • the Cygwin 5.2.0 failure was actually in 5.2.0~alpha1. It hasn't shown up since #12678 was merged (into 5.2.0~beta1)
  • the FreeBSD 5.1 failure was in 5.1.1 - not including #12678
  • linux-arm64-5.1 - same
  • linux-s390x-5.1 - same

This means that we are only missing an explanation of the above 5.2.0+fp standalone reproducers to close this issue.

@ncik-roberts
Copy link

I ran into this too when running the test suite with Jane Street's branch of the OCaml compiler. My explanation seems to apply equally to the most recent head of ocaml/ocaml, so I thought I'd share it here.

Explanation: the STM model expects that Out_channel.flush will never raise if interleaved with Out_channel.close. However, Out_channel.flush can raise if you get an unlucky concurrent interleaving with Out_channel.close. In particular:

If my explanation is correct, then maybe caml_flush_partial should check if the channel is closed before attempting to write to it.

@ncik-roberts
Copy link

(The issue is probably more widespread than suggested by my last message: I see that caml_flush is called by lots of functions, not just caml_ml_flush. e.g. outputting a string. All of these operations can thus relinquish the channel lock midoperation.)

@jmid
Copy link
Collaborator Author

jmid commented Oct 21, 2024

Thanks for sharing! 🙏
Your explanation makes sense to me. I can't reproduce the error ATM though... I'll need to look more into it.

@jmid
Copy link
Collaborator Author

jmid commented Nov 1, 2024

I've now created a reproducer and filed this upstream: ocaml/ocaml#13586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

2 participants