~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~

Use number of semicolons instead

#cs

We often see the complexity of some codebase being described in terms of how many lines of code (LOC) it contains. We hear that large projects such as Chromium contain over 35 million lines of code, while some other projects attempt to do more with less lines of code. Yet, it is also apparent that, especially for smaller projects, the number of lines of code isn’t a very accurate depiction of how complex that project actually is. Comments, constructors across multiple lines, and similar style choices that make what could have been one line into multiple lines of code all inflate this flawed metric.

But, a reason why LOC is the metric that has stuck is because of how simple it is. It is trivial to count the number of lines a text file contains, rather than alternative methods such as counting the number of function calls, for example. It also appears to be intuitive: a codebase that contains 100,000 lines of code has clearly more to it than one that only contains, say, 10,000.

Perhaps the biggest offender here would be comments. Documentation is an invaluable part of a codebase, and so when attempting to count the LOC, it is inevitable that a major chunk of that comes simply from comments. Of course, you could say that we could count the lines of code excluding comments (LOCEC), but that defeats why LOC is so popular in the first place: its simplicity. Having to parse whether a line of code is a comment or not makes it no longer language-agnostic, especially if you realize that it’s not as easy as just filtering the starting tokens of a line (consider multi-line comments).

So, is there an alternative that matches the simplicity of LOC, but is also more accurate?

Introducing semicolons

C-like languages offer a clear solution: semicolons.

Semicolons are what delimit the end of a logical statement in these languages. They are also what allow you to fit multiple “lines” of code into one line, thereby highlighting another shortcoming of LOC: you can deflate this number by just stuffing a bunch of statements into one line using semicolons (this isn’t just C and C-like languages, even Python allows you to use semicolons to do this)!

This is why number of semicolons (NOS) is better: by latching on to the token the compiler uses to delimit logical statements of code, we by default exclude comments, we treat constructors and function calls spread out over multiple lines as one single entity, all while preserving the simplicity of counting lines of code.

What about control structures?

Of course, this means that certain logical statements like if-else structures and for loops will be ignored. You could try and solve this by also adding braces into the count with NOSAB (number of semicolons and braces), but this runs the risk of overcounting in languages that use braces for struct construction, or for example in Rust where braces are used for some kind of string interpolation.

However, I claim that this undercounting is actually negligible overall, because it actually prevents the count from being inflated with useless control structures. A control structure is only useful if it runs some code within it, and this code must contain a semicolon that is already counted by NOS.

How to count the number of semicolons

Just like how it is trivial to count the lines of code, it is also trivial to count the number of semicolons. This can be done in bash one-liner with grep -ro ';' . | wc -l, where the r flag searches directories recursively, and the o flag prints only the matching parts so that wc -l can count each line.

NORMAL
1:1