May 1, 2025

Measuring the correlation between type system and performance

A research project on the correlation between a programming language's type system and its performance, written as a secondary-school final paper.

Final research paper · VLOT! Campus Sint-Laurentius, 2024–2025 ·

Download the original PDF →

Abstract

This paper details a research project spanning a school year; it describes the entire process, along with the results and conclusions. The project aimed to determine the correlation between type system and performance. The results indicate a relatively mild correlation in favor of static languages for performance. A number of tests were done to draw this conclusion; the main ones, however, were the correlation calculations. The correlation was calculated in two ways. The first was the Point-Biserial correlation. This test indicated a correlation with medium effect size, but was not particularly statistically significant according to the resulting p-value. This statistical insignificance was likely due to the effect of outliers on the distribution of raw data and the non-normal distribution of that data. To counteract this, the Rank-Biserial correlation (which is influenced far less by outliers) was calculated as well, which resulted in a very similar effect size but with an extremely low p-value, confirming its significance. The Rank-Biserial correlation therefore allowed for a more robust conclusion, and the combination of these methods provides more insight into the results of this project.

However, this project was subject to a number of limitations (mostly in skill, experience, knowledge, scale, time, and resources), which, by extension, allow for various valid criticisms. The main problem is the small sample size of only 14 languages: 7 static and 7 dynamic. The fact that only the execution time per language was measured, as opposed to a combination of metrics, can also be called into question. This paper also describes several additional analyses which help clarify the relationship between type system and performance.

Introduction

Motivation and practicality

My name is Arne Vercauteren and I study computer science. The subject of this paper — type systems — was chosen because I have a great personal interest in programming languages and their design and functionality.

In real-world, large-scale systems, performance is of incredible importance; choosing the right framework for your system can entail massive benefit when done right, or great loss when done wrong. In large corporations, for example, making the wrong choice can cause massive economic strain, while making the right choice can give you an edge over the competition. Performance is important not only for developers, but also for consumers: smoother experiences, shorter waiting times during startups, and quicker web pages are just a few examples. By measuring the correlation between type system and performance, this paper attempts to help computer science at large in obtaining a clearer view of what the benefits of static and dynamic type systems are, and therefore in making the right choice between them considering the specific use case and requirements (although it only touches on a single aspect of what influences such a choice, namely the performance).

Concept details

What is a type? A type is a classification or group, like a type of physical object or a type of abstract idea. You can also have types within other types. For example, a dog is a type of animal, so is a cat; an animal is a type of living being. Nine is a number, specifically a natural number; a natural number is a type of number, and a number is a type of character. In programming, types are used to define not only what exactly something is, but also how we can use it — it has what we call an interface. In practice, an interface is a set of functions: a dog can bark, a cat can meow, but they can both eat. So we would give both the cat type and the dog type separate sound functions, but the type animal would get an eat function. Type systems are used to make the language make sense; the type system will throw an exception when you try to count with dogs or make a number meow.

What is a type system? "A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute." (Pierce, 2002, p. 1) There are many distinctions one could make between type systems. For this project, the sole distinction will be between static and dynamic type systems. In dynamic type systems, type checking is performed at runtime and variables can change types throughout the program. In static type systems, type checking is performed during compile time and variables are generally not allowed to change types throughout the program. (GeeksForGeeks, 2019)

What does performance mean in this context? "Performance" could be interpreted in a wide variety of ways; we could measure execution time, but also memory usage, compilation speed (for compiled languages), or even energy efficiency. To keep this project manageable, we will take the simple (and most obvious) approach and only measure execution time.

Considering the audience for this paper and its low-level academic nature, some topics and concepts that might not usually need to be explained in a professional paper are given a short explanation here. A big thank you to the supervising teacher for the support of this project.

Research

Program and language selection / homogenization

To accurately measure the correlation, 14 languages were chosen, and 7 (albeit simple) programs were implemented in every language, resulting in a total of 98 unique source-code files. The ideas for the implemented programs came from rosettacode.org. Rosetta Code is an online database containing thousands of code snippets: crowdsourced programs implemented in many languages. Although the selection of programs came from Rosetta, the actual implementations were written by AI and then verified — and, if necessary, edited — by the researcher.

Below is a list of the selected languages and programs:

Dynamic languages: Python, JavaScript, PHP, Perl, Lua, Julia, Awk.

Static languages: C, C#, Swift, Go, Java, Rust, Kotlin.

Programs:

  • 100 doors (×10000)
  • factorial with n = 10 (×100000)
  • fibonacci sequence with n = 10 (×100000)
  • fizzbuzz (×10000)
  • reverse a string using "Hello, World!" (×100000)
  • selection sort using {29, 10, 14, 37, 13, 54, 76, 98, 12, 1, 64, 13, 512} (×100000)
  • sieve of eratosthenes using n = 10000 (×10000)

For-loops were wrapped around each program to obtain practical execution times (so as to avoid measuring in microseconds or nanoseconds for some languages); the numbers next to the programs correspond to the number of iterations.

The process of homogenizing such programs across different languages is one that comes with a plethora of challenges. Some languages have similar syntax but include different operations within the (seemingly) equivalent function; some do things automatically that others don't. Even though the programs used in this research were quite simple, and even though the chosen languages had very similar syntax, there are still countless corner cases one could encounter — not to mention the trouble that would be encountered if the languages and programs were selected randomly.

Program descriptions

100 doors creates an array of 100 booleans (doors), sets them all to false, then iterates over all of them and toggles them; then does the same but only with every 2nd boolean, then every 3rd, and so on, until it only toggles the 100th boolean; then it prints out every boolean's state.

Factorial calculates the factorial of n iteratively and then prints it; 10 is used as n here.

Fibonacci sequence calculates and prints Fibonacci numbers up to the n-th number recursively; 10 is used as n here.

FizzBuzz writes each number from 1 to 100, but writes Fizz when the number is divisible by 3, Buzz when it's divisible by 5, and FizzBuzz when it's divisible by 15.

Reverse a string reverses a string and then prints both the original and the reversed string; "Hello, World!" is used here.

Selection sort performs a selection sort on a given array of integers and prints out the result; this array is used here: {29, 10, 14, 37, 13, 54, 76, 98, 12, 1, 64, 13, 512}.

Sieve of eratosthenes calculates all prime numbers up to n; 10000 is used as n here.

The C# versions of each program have been provided in the appendix of the original paper as examples. All of the inputs have been hardcoded for ease of automation.

Methodological limitations

Because of the limited time, resources, and scope of this project, along with the limitations in skill, knowledge, and experience on the part of the researcher, there were certain limitations to the research method. The programs chosen for this project were mostly selected based on how easy they would be to verify given the knowledge and skill of the researcher, whereas in a project with more resources, larger, more realistic, and more diverse programs could be selected for a more accurate and comprehensive end result. Similarly, the selection of languages was mostly based on how accessible they were and on their relative lack of complexity; once again, in a project with more resources, this selection could be done more randomly. Most of all, the sample size of both languages and programs could be higher in such a project. As we will see in the conclusions, some conclusions cannot be pronounced definitive purely because of these limitations.

Script and measurements

After all of the source code had been collected, a script was used to automate the running of these programs in Docker using WSL, meaning Bash was used instead of a Windows-native alternative like Batch (the script was mostly written by AI, since the researcher had next to no knowledge of Bash). Originally, an attempt was made to use Batch, but support and available information on how to use it for these specific purposes proved to be lacking. The decision to use Docker was made to minimize interference from other processes while the programs were running. Docker is also highly useful because containers can be stopped and started within seconds, which makes scripting hundreds of runs easier. For the majority of the selected languages, an official Docker image is available; e.g., popular languages like Python and Go have official images maintained by the Docker community. Depending on the chosen languages, there will likely need to be some usage of non-official images (luckily, Docker Hub contains many qualitative alternatives), e.g., for Lua.

First, the script starts a separate container for each language and initiates some extra procedures, like installing the required libraries into each container. For the static languages, the script builds each program first within the container using the unique build commands for each language, combined with the path of the program. Most languages have built-in timing features, which allow for a more accurate measurement; if we were to do the timing in the script, we would be including the startup time for the runtime and other processes in our measurement. Unfortunately, Awk doesn't have built-in timing features, so instead the script measures an average overhead time by repeatedly running an empty program and calculating a trimmed average. This average should correspond to the average time it takes all the processes Awk needs to start. When that average time is subtracted from the execution time of each Awk program, the result should be an approximation of the actual execution time (although even if this overhead was not accurate, it turned out to be rather insignificant compared to the actual execution time for Awk programs).

Next, the programs are run one by one; the number of times this is done can be passed as a parameter when starting the script. Using this round-robin approach (instead of running a program the specified number of times and then moving on to the next one) ensures that the measurements are only minimally affected by temporary changes in processing speed, especially when the number of iterations is high. In terms of output, the script logs what it is doing in detail in a log file for easier debugging and prints the final results into a table in a CSV file. The Bash script is included in the appendix of the original paper.

This script should be run with a high number of iterations: the more iterations, the more accurate the results, and the more similar two separate runs of the script will be when compared. In this project, the script was run twice with 120 iterations; the results were sufficiently similar to be acceptable for use in analysis. A higher number of iterations would have resulted in more similarity, but it would have taken an unreasonable amount of extra iterations to produce a significantly higher accuracy (unreasonable meaning it would have taken too much time to run the script so many times — 120 iterations took around 16 hours, though a higher level of available resources could decrease this time dramatically).

Finally, Python was used to calculate the correlation, the details of which are explained further along. Some graphs were also made to clarify the relationship between the variables.

Results

Raw data

ProgramCC#SwiftGoJavaRustKotlinPythonJavaScriptPHPPerlLuaJuliaAwk
100 doors113895139260019075051891137522427356701104161684943
factorial6126464720445206862025213167593524
fibonacci sequence1138545047251108198110520122525146871861037620819097
fizzbuzz42103534858018174561828536221259321674857451661
reverse a string2214420713039950340222347818153239411601508
selection sort17520821388187529504462955172735421874170311031654411005
sieve of eratosthenes25716462141768221011822266274727901356347918145652220753

Table 1: Benchmark results (ms) for 120 runs.

The results have a massive range, even for the same programs. This is further illustrated in the boxplots below.

Boxplot of mean execution time per language.
Figure 1: Boxplot for mean time of every language.
Boxplots of mean execution time, static versus dynamic languages.
Figure 2: Boxplot for mean time of every language (static and dynamic separate).

The two main outliers are Awk and Julia, both of which are dynamic, which can be seen clearly in Figure 2: the range for dynamic languages is many times larger than that of the static languages. It isn't just the range that's smaller for static languages — the median (1999 for dynamics, 860 for statics) and the average (4235 for dynamics, 874 for statics) are also much smaller for the static languages.

Correlation

Before calculating the correlation, we have to decide how to group our data points. We have 98 data points in total, but only 14 languages; if we were to use all of these data points for the correlation, we would be artificially inflating the sample size. Instead, we compute the mean of each language and use one data point per language to compute the correlation.

Point-Biserial correlation

Let's start with a simple approach: the Point-Biserial correlation. This is in effect the same as a Pearson correlation, except that one of the variables is dichotomous (type system) while the other is continuous (execution time). For clarity, the Pearson correlation coefficient rr is a measure of the strength and direction of the linear relationship between two quantitative variables XX and YY. (Wikipedia contributors, 2025b)

This formula — for the Point-Biserial correlation — is equivalent to calculating the Pearson correlation with the XX value coded as ones and zeros. For this calculation, static languages have been coded as 1 and dynamics as 0:

rpb=M1M0syn1n0n2r_{pb} = \frac{M_1 - M_0}{s_y}\sqrt{\frac{n_1 \cdot n_0}{n^2}}
  • M1M_1 is the mean of the static observations
  • M0M_0 is the mean of the dynamic observations
  • sys_y is the standard deviation of all observations
  • n1n_1 is the number of static observations
  • n0n_0 is the number of dynamic observations
  • nn is the total number of observations

When we use Python to calculate this correlation, we end up with a coefficient of 0.4695-0.4695 and a p-value of 0.09030.0903. These results can be interpreted as follows: a value of 0.4695-0.4695 indicates a moderate negative relationship between the variables. Considering the way the binary variables have been assigned, this means that static languages tend to have lower execution times than dynamic ones. However, a p-value of 0.09030.0903 is significantly above the standard 0.050.05 threshold; practically, this means that if there were no difference in the distributions of both groups, there would be a 9.03% chance that we would observe a correlation this extreme or more.

Rank-Biserial correlation

Since there are significant outliers, this first method was rather naive. The Point-Biserial correlation also assumes that the continuous variable is normally distributed, which is not the case here; this might also explain the high p-value. Instead, we can use the Rank-Biserial correlation, as it is much less sensitive to outliers because it is based on the ranking of data instead of raw values. It is also more robust with such a small sample size, and it does not assume a normal distribution. The Rank-Biserial correlation is derived from the Mann–Whitney U test, a nonparametric test for comparing two independent samples: it ranks all observations, computes the sum of the ranks for both groups, and calculates the U statistics based on those ranks. (Wikipedia contributors, 2025a)

These are the required formulas:

U1=R1n1(n1+1)2,U2=n1n2U1U_1 = R_1 - \frac{n_1(n_1 + 1)}{2}, \qquad U_2 = n_1 n_2 - U_1
  • n1n_1 is the number of static observations
  • n2n_2 is the number of dynamic observations
  • R1R_1 is the sum of the ranks in the static group
  • R2R_2 is the sum of the ranks in the dynamic group (not needed for this version of the formula)

Based on these two statistics, we can derive the Rank-Biserial correlation:

rrb=U1U2n1n2r_{rb} = \frac{U_1 - U_2}{n_1 n_2}

Once again we use Python to calculate this. We end up with a coefficient of rrb=0.4052r_{rb} = -0.4052, which is reassuringly similar to the calculated Point-Biserial correlation. Just like the first method, it indicates a moderate negative relationship, meaning static languages tend to have lower execution times than dynamic ones. The Mann–Whitney U test also gave a p-value of about 0.000550.00055, which tells us that the difference between the average execution time of each group is statistically significant (the chance that the rank differences observed are this extreme or more, given no difference between the distributions of both groups, is 0.055%).

Per-program analysis

We can also compute this same correlation per program, which can reveal whether dynamic languages prevail in performance for any of the programs, and whether any individual programs are skewing the results massively in favor of static languages. Below is a table of results for the Rank-Biserial correlation per program, along with their respective effect sizes and p-values, computed using Python:

Programp-valueRank-BiserialEffect size
100 doors0.0728−0.5918medium–large
factorial0.1649−0.4694medium
fibonacci sequence0.0023−0.9184very large
fizzbuzz0.4557−0.2653small
reverse a string0.0973−0.5510medium
selection sort0.3829−0.3061small–medium
sieve of eratosthenes0.0262−0.7143large

Table 2: Per-program Mann–Whitney p-values and signed Rank-Biserial effect sizes.

The table above coincides very well with our previously observed correlation: not a single program displays a deviation in direction from our overall correlation; all of the coefficients are negative, meaning the results all suggest that static type systems tend to have better performance. There is a wide variation in effect size, however, meaning that the results of some programs are closer across languages. Some of the p-values also suggest that the correlation is not at all statistically significant — like that of the selection-sort program. The fibonacci-sequence program in particular shows a very large and statistically significant correlation, with a p-value of about one-fourth of a percentage point. Overall, the programs that contribute most to the overall correlation are the sieve of eratosthenes and the fibonacci-sequence programs.

Per-language analysis

To further exemplify the degree to which the pattern of statics being faster than dynamics persists across languages, we can compute the mean of each language. This way, we can see whether all static languages are faster than all dynamic languages, or whether there are outliers:

LanguageAvg time (ms)
C104.00
Rust476.43
Go675.00
Swift860.86
Lua886.29
PHP894.14
C#968.86
Python1243.71
Java1513.57
Kotlin1521.86
JavaScript1998.71
Perl2849.71
Awk8498.71
Julia13277.14

Table 3: Average execution time by language, sorted from fastest to slowest.

Bar chart of average execution time by language, fastest to slowest.
Figure 3: Average execution time by language, sorted from fastest to slowest.

Not every static language is faster than every dynamic language, although there is still a general trend in that direction, even when ignoring Awk and Julia. Both Lua and PHP are among the fastest languages, while Java and Kotlin are some of the slowest. So out of the 14 languages, 4 are outliers in the general trend of static being faster than dynamic. We can also see a rather massive level of performance spread: the fastest language (C at 104 ms) is about 128 times faster than the slowest (Julia at 13277 ms). Overall, this chart supports the observed trend even further, even when ignoring the two main outliers.

Conclusion

Interpretation of results

Based on the obtained results, we can conclude that there is a moderate negative correlation between type system and performance, meaning that static languages tend to perform better.

Something else we can learn from the results is the effect of the genre of program on performance. As we can see in the table of correlations per program, the sieve and fibonacci-sequence tasks display a much more significant correlation between type system and performance, while tasks like fizzbuzz and selection sort display a very weak and statistically insignificant correlation. This suggests that heavier tasks, such as recursive ones (the fibonacci sequence), tend to perform better in static type systems than tasks using simple loop structures and I/O operations.

Evaluation of criticisms

The conclusions drawn above can be called into question based on a few criticisms; the first being the small sample size. Even though we had 98 individual data points, to prevent artificial inflation of the sample size (which could skew the correlation calculations) we used the mean of the 14 languages. As we can see in the boxplots, there were major outliers present. Both of these outliers were dynamic, but it is entirely possible that this observation is coincidental, and that if different languages had been chosen we would have obtained a very different distribution of outliers. Although this criticism is valid to some extent, it does not mean that the obtained correlation is invalid: the method used to compute the overall correlation — the Rank-Biserial correlation — was based on rank, not raw value, which partially mitigates the effect of those outliers. However, the relatively small sample size of languages (14) allows these outliers to remain significant. What also dampens this criticism somewhat is the fact that, when looking at the mean times for each language, we can still clearly see the same pattern even when ignoring those two dynamic outliers (although not every static language is faster than every dynamic one).

Another valid criticism is that the idea that "heavier" programs (such as recursion-heavy ones) exemplify this correlation further is based on the results of only 7 programs. This observed pattern can therefore not be declared definitive.

Some other, more general criticisms are that, as stated earlier, the programs and languages were not selected randomly, and that the sample size simply isn't high enough, making it difficult to state any conclusions definitively.

Additional considerations

To be clear, an observed correlation between static typing and better performance does not entail that static typing causes better performance. There are many other factors which might generally coincide with static typing that cause better overall performance.

This research also ignores certain factors about each language, such as the compilation stage for the compiled languages and the startup time for the compilers and interpreters; since we only measured execution time, compilation was cut off completely from the measurements. If a language has horrible compilation speeds but great execution speeds, it would perform well in this type of research compared to a language with fast compilation but slow execution — i.e., even if the first language took 10 times longer after pressing the start button, it might still get a better result (though, practically, this would only matter when, for example, creating code in an editor; most of the time, programs will already have been compiled). But overall, execution time is still the most logical metric for performance.

Another consideration is the simplicity of these example programs compared to real-world production code. Real-world, large-scale code could give us a radically different result; e.g., real-world programs often integrate a vast array of libraries, or use concurrency, which none of the implementations here do. This correlation alone is also not a fully-fledged reason to pick static type systems over dynamic ones for practical use cases — there are many other considerations to make, like ease of use, complexity, and versatility.

The consensus among computer scientists seems to be that dynamic type systems are generally more useful for prototyping, while programs written in static languages take more time and effort but are better for large-scale, long-term projects. This paper supports that claim, at least in terms of the idea that statically typed languages display better performance and therefore have an advantage for large-scale projects.

Overall, the conclusions we can draw are that the results suggest a moderate correlation between type system and performance, that static languages tend to perform better, and that heavier tasks (e.g., recursive programs) exemplify this correlation even further. But this conclusion is not a final verdict; more expansive and fleshed-out research could provide a more definitive result.

Incentive for future research

Considering the limitations of this paper, there is great incentive for a successive research project following a similar structure. A follow-up like this should have far more resources at its disposal to achieve significantly independent and worthwhile results. With these resources, the selection of languages could be not only much larger but also random, which should achieve a much more statistically significant result. Bias caused by interfering processes could be minimized by running the programs on a "clean-slate" machine, and the number of iterations and programs could be scaled up with a more powerful processor (along with the program complexity). Using more programs could also give a clearer view of the observed correlation between the "heaviness" (such as recursivity) of a program and the tendency of static languages to perform (even) better.

References

  • GeeksForGeeks. (2019). Type systems: dynamic typing, static typing & duck typing. geeksforgeeks.org
  • Pierce, B. C. (2002). Types and programming languages. Cambridge, MA: MIT Press.
  • Wikipedia contributors. (2025a). Mann–Whitney U test. en.wikipedia.org
  • Wikipedia contributors. (2025b). Pearson correlation coefficient. en.wikipedia.org

Appendix

The original paper includes a full appendix with the Bash benchmark script and the C# source code for every program. Those are omitted here for readability; the complete appendix is in the original PDF.