Why is my program slow? Same logic is 3 times faster with Ocaml

Hello!

I’m half-way through reading the book “Get Programming with F#”. As a fun exercise, I decided to go a little off-track and implement a program resembling the system utility tree:

That’s because I wanted to play with a recursive type, which is a notion I first came upon while studying Elm.

Out of curiosity, I then decided to implement the same logic in Ocaml (beginner as well):

I basically copy/pasted the original F# code, then adapted the pieces bit by bit until the code compiled.

To my surprise, I found the Ocaml code 3 times faster!

Here’s the output of the tool trying to sift trough 175872 files. I piped the output into wc in order to exclude from the measurement the slowness associated to writing to stdin (or printing to stdout).

Ocaml

$ time dune exec tree ~/code/explore/angular/ | wc -l
175872

real    0m2.131s
user    0m1.087s
sys     0m1.034s

F#

$ time ./bin/Release/net6.0/DiskUsage ~/code/explore/angular/ | wc -l
175872

real    0m6.776s
user    0m5.006s
sys     0m3.928s

System utility, for reference

$ time tree ~/code/explore/angular/ | wc -l
175875

real    0m1.156s
user    0m0.711s
sys     0m0.488s

I tried to inspect things via a profiler (from my IDE, without piping into wc), but nothing jumped at me. I can see I’m spending most time printing and looping over the next directory which seems normal.

Is anything obvious jumping at you? Could you suggest me possible ways to improve performance?

You could try annotating summary as a struct if it is getting allocated (i.e., created a lot).

[<Struct>]
type Summary =

For what it’s worth, I haven’t seen that actually have any effect, but as far as I understand, it’s somewhat official guidance for when to mark something as a struct. Wait … never mind. Haha. As I’m typing this out, I don’t think it works for recursive types, if I remember right.

The thing that mainly jumps out at me is that in the F# code, you are using a seq, while you are using a list in OCaml. In F#, seq will definitely be slower than list and array and should really only be used if you need lazy evaluation, truly generic sequences, or the few generic functions that only exist in the Seq module. Lists and arrays should be the default collections used. Try switching that out and see if it improves the F# performance. Seq.toList can be used to covert the type returned by the IO functions.

This is not tail-recursive, so any deep nesting will lead to lots of overhead.

Same here:

Generally, using Seq.append is going to be very slow in a recursive position, as each Seq.append will create a new sequence, wrap it into a new class so to say, with its own GetEnumerator() to prevent mutation of the underlying original sequence and for some other reasons. Using seq with yield! could give you better perf, but not sure if that’ll really help you in this scenario.

Replacing that with List though is a good start, as that’s an eager collection type (whereas seq is delayed, as mentioned above), but List.append or similar is going to be rather bad as well.

If you can code your recursion in a similar way like the following with lists and do a List.rev at key points, you’d be blazingly fast. I didn’t look very closely at your implementation, but these were the things that stood out.

Example:

let rec doSomething ls aggregate =
    match ls with
    | [] -> aggregate
    | x::rest -> 
        x = x + 1
        doSomething rest (x::aggregate) // tail recursive
1 Like

Thanks very much for your suggestions.

I replaced all Seqs with Lists but the performance didn’t improve one bit.

I also tried to tag my Summary type with [<Struct>] but that also didn’t make a dent.

@abelbraaksma if I understand things correctly having my function not being tail-recursive means I can recompute the same data and incur a performance penalty, correct?

I don’t think that’s the case here. Each visit of a “tree node” either stops if it’s a file or generates other function calls if it’s a dir and until the whole tree is consumed.

To verify this, I introduced a little counter:

let mutable calls = 0 // NEW
let rec toSummary path =
    calls <- calls + 1 // NEW
    printfn $"CALLS: %d{calls}" // NEW

    let fi = System.IO.FileInfo path
   // Rest omitted for brevity...

Then logged the output:

I can observe the tree being built up while the CALLS counter is incrementing, prior to rendering.

I can see 179070 function calls, which is roughly the number of files and directories contained within the path.

$ find ~/code/explore/angular/ | wc -l
179071

I’ve got a slight variation here because I skip empty directories in my program.

Did I misunderstand the problem you were pointing at?

Not quite. It’s just the overhead of increasing the stack and having extra allocations on the stack, plus extra allocations for the lambdas. It’s a common optimisation, but often not an easy one, to use TCO.

As a comparison, try a small function calling Seq.append recursively. Then do the same in a loop using mutable state (to mimic list).

It’ll likely give you a tenfold perf difference or somewhere around that ballpark. Things may have improved in that area, but likely not that much that the compiler can automatically change non tailcall code into tailcall code.

If I have a moment later, I’ll try to come up with a more concrete example, or just show how you can fix the code.

PS, if anything, definitely move away from seq. There’s nothing in your code that warrants the overhead of a delayed sequence, and you’re comparing apples with pears because OCaml has lists where you have seq.

One more thing, you have this:

 if children <> [] then

This executes generic equality and typically means that the lists get iterated (hopefully it stops early, but the equal-operator does not necessarily know you’re comparing lists). Basically, the faster, or at least more idiomatic approach is to use List.isEmpty.

When you replaced the sequences with lists, did you make sure to make all calls to the Seq module List calls? The Seq functions will still work on lists.

Also, your enumEntries function should use Array.toList. You could just use arrays as well.

I’m not terribly sure where the difference between the F# and OCaml code is coming from, although there are some subtle differences and functions used. I toyed around with adjusting those and didn’t see a performance difference in the F# code.

I think in general, it doesn’t really matter. I doubt OCaml is 3x times faster than F# in general, so there’s not much to worry about and not much use in a comparison. This is a cop-out, so to speak, but as mentioned, there are so many uses of append and collect that this could probably be re-written. Those functions shouldn’t appear so often, particularly inside a recursive function. For example, the base case of toSummary is returning a single element list. In a more typical recursion of building up a list, you would be consing (:: in F#) things such that the base case returns an empty list. Actually, inside toSummary and compute, there’s no need to use List.collect. You’re mapping and then collecting single element lists. Just use List.map and return a single Summary element in toSummary.

Here’s what I played around with along these lines: test.fsx (github.com)

You can run it with dotnet fsi test.fsx. I didn’t start doing any major rewrite though, and I don’t know if the NuGet reference gets cached. Timing the script is probably not an accurate benchmark (what I was doing).

Thanks for the tip :slight_smile:

For sure, I’m just playing around to learn so I suspected my code was sub-optimal from the start.

I tried integrating some of your changes but it didn’t move the needle unfortunately :frowning:

It would be my guess that the performance difference is more related to the runtime or .NET than F# given how similar the code is. I may be wrong though, so hopefully someone who knows can chime in. All that to say, don’t let it turn you off of F#. It’s plenty fast.

Okay, so I’ve made progress and I found the main bottleneck.

Surprisingly, the main performance killer was printing! Even though I piped my output to the next process.


Conclusion: similar performance once I removed sorting + printing.

F#    : 2.2s-2.6s (sift through 179070 files)
Ocaml : 1.6s-1.9s (sift through 179070 files)

Add 200ms-300ms for sorting the list prior to printing
An extra 300ms to sort by a 2nd column, which only the F# code did.


I then experimented a little more by calling this code instead:

New F# code

module Printer

let print _ =
    let mutable i = 0
    while i < 200000 do
        i <- i + 1
        printfn $"Line %d{i}"
time ./bin/Release/net6.0/DiskUsage bogus_param
...
Line 199998
Line 199999
Line 200000

real    0m2.275s
user    0m0.941s
sys     0m1.140s
$ time ./bin/Release/net6.0/DiskUsage bogus_param | wc -l
200000

real    0m1.114s
user    0m0.793s
sys     0m0.943s

New Ocaml code

let print _ =
  let i = ref 0 in
  while !i < 200000 do
    let _ = i := !i + 1 in
    Printf.printf "Line %d\n" !i
  done
;;

Output

dune build && time ./_build/default/bin/main.exe bogus_param
...
Line 199998
Line 199999
Line 200000

real    0m0.381s
user    0m0.054s
sys     0m0.224s
$ dune build && time ./_build/default/bin/main.exe bogus_param | wc -l
200000                              

real    0m0.058s
user    0m0.063s
sys     0m0.001s

So I deduced that Fsharp’s printfn must be flushing systematically (??)

2nd F# experiment

let print2 _ =
    let mutable i = 0
    let sb = System.Text.StringBuilder()
    while i < 200000 do
        i <- i + 1
        sb.Append($"Line %d{i}\n") |> ignore
        
    System.Console.WriteLine(sb.ToString())

Output

$ time ./bin/Release/net6.0/DiskUsage bogus_param | wc -l
...
Line 199998
Line 199999
Line 200000

real    0m0.557s
user    0m0.185s
sys     0m0.225s
$ time ./bin/Release/net6.0/DiskUsage bogus_param | wc -l
200001

real    0m0.210s
user    0m0.149s
sys     0m0.041s

With this last experiment, I would be tempted to conclude that the Ocaml print library probably better handles Unix concepts such as piping and I suppose some of the code can take a short circuit internally.

But if I think about it more, Windows does have the concept of piping to a process as well (via cmd) so I’m not sure.

c:\>tasklist | findstr /i chrome
chrome.exe                   13664 Console                   10    184,432 K
chrome.exe                    1488 Console                   10      8,600 K
chrome.exe                   14992 Console                   10     59,976 K
chrome.exe                    7192 Console                   10     35,072 K
chrome.exe                   13224 Console                   10     19,176 K
...

Please see the following go code:

package main

import (
	"bufio"
	"fmt"
	"os"
)

func printBuff() {
	/*
		$ time ./print
		...
		Line 199996
		Line 199997
		Line 199998
		Line 199999

		real    0m0.348s
		user    0m0.042s
		sys     0m0.183s


		---

		$ time ./print | wc -l
		200000

		real    0m0.043s
		user    0m0.052s

	*/
	buffer := bufio.NewWriter(os.Stdout)
	defer buffer.Flush()

	for i := 0; i < 200000; i++ {
		fmt.Fprintf(buffer, "Line %d\n", i)
	}
}

func printNoBuff() {
	/*
		$ time ./print
		...
		Line 199997
		Line 199998
		Line 199999

		real    0m0.686s
		user    0m0.217s
		sys     0m0.449s

		---

		$ time ./print | wc -l
		200000

		real    0m0.281s
		user    0m0.183s
		sys     0m0.277s

	*/
	for i := 0; i < 200000; i++ {
		fmt.Printf("Line %d\n", i)
	}
}

func main() {
	printBuff()
	// printNoBuff()
}

It seems that Ocaml’s Printf module behaves as the printBuff() function above, while Fsharp’s printf or System.Console.WriteLine() functions behave as the printNoBuff() function above.

And this is why I got much performance in the first place.

I tried to play with streams or async constructs, but I haven’t been able to achieve better write performance with F#.

I’d like to achieve the same behavior as in printBuff(). Any ideas?

F#'s printfn calls System.Console.WriteLine under the hood, after parsing and creating the final string. The piping you do in the end on the commandline is not something either the F# or OCaml program has influence on, so any timings from that I would take with a grain of salt.

Printing to the console is relatively slow in .NET, I’ve often been bitten by that as well. For instance, at least on Windows, there’s a huge difference in perf or Console.WriteLine of any program when I run it on a console that has 10k lines open and is very wide. I often just do > log.txt to get around this slowness.

If you look at the TextWriter, which is the Out that is used by Console.WriteLine, you can see that under the hood, all calls ultimately run in a loop calling TextWriter.Write -> char -> unit, in other words, writing is done char by char.

If we then look at CreateOutputWriter we see that it uses a Synchronized textwriter, which suggests a lot of locking/unlocking may go on behind the scenes to ensure writing is safe. The default backing StreamWriter, in that same code, shows that the buffer size is 256, which isn’t that much, and suggests a lot of flushing will go on.

Finally, in the implementation of Write -> char -> unit of StreamWriter we can confirm that each time that buffer is reached, a flush takes place:

// System.IO.StreamWriter
public override void Write(char value)
{
	CheckAsyncTaskInProgress();
	if (_charPos == _charLen)
	{
		Flush(flushStream: false, flushEncoder: false);
	}
	_charBuffer[_charPos] = value;
	_charPos++;
	if (_autoFlush)
	{
		Flush(flushStream: true, flushEncoder: false);
	}
}

This is regardless of whether you write with a single large string, or a lot of smaller strings. To improve this performance, which is more of a dotnet challenge than an F# one tbh, you’ll need to set the buffer size higher if you want to dump a lot of stuff to the console.

This is rarely needed of course, the console is not meant for perf critical stuff. Use an in-memory buffered stream, or IO buffered IO stream instead. I do wonder though what OCaml has behind the scenes in this same area.

Disclaimer: this is just from looking at the dotnet sdk code, I haven’t tried to verify if the above assumptions are correct.

Interesting exercise, btw!

Thanks for the extra feedback.

I tried to find the right incantation to stop this auto-flushing behavior, but alas it looks like it’s really tricky to go against the defaults. I don’t seem to have much control here.

I somebody knows how to do it, please tell me, I’d like to know :slight_smile:

On a side, I’ll close with a couple of observations:

  1. In my go example, if I don’t flush stdout in printBuff() I’ll get truncated output:
Line 199720
Line 199721
Line 199722
Li
real    0m0.352s
user    0m0.038s
sys     0m0.185s

From this, I conclude that the OS does not flush stdout on program exit. So I’m not sure how this is handled on the Ocam side.

  1. While wanting to print so much to a terminal may be considered a bad idea, I can see that redirecting stdout to a file also behaves badly

Ocaml with its standard toolchain

This would take roughly 300ms to print on my terminal.

$ dune build && time ./_build/default/bin/main.exe bogus_param | wc -l
200000                              

real    0m0.057s
user    0m0.048s
sys     0m0.015s

---

$ dune build && time ./_build/default/bin/main.exe bogus_param >/tmp/file.log
                                 
real    0m0.060s
user    0m0.055s
sys     0m0.004s

F# with its standard toolchain

This would take roughly 1s to print on my terminal.

$ time ./bin/Release/net6.0/DiskUsage ~/code/explore/angular/ | wc -l
200000

real    0m0.473s
user    0m0.399s
sys     0m0.319s

---

$ time ./bin/Release/net6.0/DiskUsage ~/code/explore/angular/ >/tmp/file.log

real    0m0.835s
user    0m0.340s
sys     0m0.488s

It’s possible that you’re hitting a snafu with Linux + Dotnet. If I have a moment, I’ll try your example on Windows to see what’s going on. Though I haven’t been playing around with OCaml for some time.

On setting the flushing behavior, I believe you need to create a Stream to stdout by hand and set Console.Out to be that stream. Something along these lines at least.

F# string functions are currently slow. Fully compiled string functions are tracked in Make F# string functions fast and AOT/linker friendly · Issue #919 · fsharp/fslang-suggestions · GitHub .

1 Like

Thanks @abelbraaksma @charlesroddie

I did try to play with streams, however I kept bumping into the same issue since the System.Console itself seems to be the bottleneck and I must pass through it.

I’ll just have to remember to log lightly to stdout for now.

1 Like

I think what reason of this diff is slow loading dependences dlls by dotnet. Please, try it compile to native with trimming on sdk NET8 RC2.