How do you measure software quality?

There are two major license types in the free/open source software world: copyleft (e.g. GPL) and permissive (e.g. BSD). Because of the different legal ramifications of the licenses, it’s possible to make theoretical arguments that either license would tend to produce higher quality software. For my master’s thesis, I would like to investigate the quality of projects licensed under these paradigms, and whether there’s a significant difference. In order to do this, I’ll need some objective mechanism for measuring some aspect(s) of software quality. This is where you come in: if you have any suggestions for measures to use, or tools to get these measures, please let me know. It will have to be language-independent and preferably not rely on bug reports or other similar data. Operating on source would be preferable, but I have no objections to building binaries if I have to.

The end goal (apart from graduating) is to provide guidance for license selection in open source projects when philosophical considerations are not a concern. I have no intention or desire to turn this into a philosophical debate on the merits of different license types.

6 thoughts on “How do you measure software quality?

  1. I would say the proof is in the pudding.

    I think that there’s a strong argument that the success of a piece of software can be gauged by the number of users of that software. In a vacuum, that makes a completely closed-source license the single most successful licensing scheme ever. I’m pretty sure that’s not what you’re going for.

    Instead, given your goal, would it make sense to do a survey of software that’s hosted on places like GitHub and SourceForge, weight it according to the number of separate contributing authors, and discern which license has the most active community? With enough samples, you could probably correct for financial backing, project lifetime, etc, so that things like the Linux kernel and Firefox don’t overly weight things.

  2. @Matt, you raise good points, but I don’t think adoption = quality. For that matter, the activeness of a community doesn’t necessarily equal quality either (though with the right governance, I’d expect more active communities to trend toward higher quality). There is some merit to investigating the effect of license on community participation, but that’s not what I’m aiming at. Maybe someone else will do that research? Or maybe I’ll suffer brain trauma at some point and decide a PhD sounds like a good idea.

  3. Ben, this ain’t easy. The obvious measure is defects, and if you have access to the bugzilla archives for a project this isn’t too hard. However, to make meaningful comparisons between different pieces of software, you need to calculate defects as a function of some measure of size. The best measure of size is still function points. There are a few things you can use as a proxy for function points, but at the end of the day, function points are the realistic way to do it.

    Of course you can look at SLOC which is easier to get, but FP/SLOC is dependent on language – heavily dependent. Again, an easy approach is to look at SLOC and convert via one of the tables out there. Might not be the best, but at least it is a way that isn’t too burdensome, and if you get enough samples, probably not too far off the mark.

    You need to be careful because there is a good chance that there are other characteristics that differ between software released under the different licenses, and you want to look to be sure you aren’t measuring those characteristics.

    Suppose, for example, that a larger portion of the BSD licensed SW was developed commercially rather than by a community. The development environment is likely to be quite different, and you might see quality differences as a result of the environment rather than the license. Of course, the pitfall here is that you can’t know the differences. If a lot of the BSD software was written by people wearing baseball caps, and that caused the difference in quality, you would never guess that.

    When you are doing this sort of thing, R is your friend. You’re gonna need to get real cozy with R (or one of the high-priced spreads like Minitab or JMP). Having used all three (and even taught Minitab to developers at Motorola, SAS, Xerox and others) I really prefer R, even though it is harder to learn. For something like a thesis it is a big win because every analysis id documented by definition, whereas with Minitab of JMP reproducing your analysis requires a lot of extra work initially that nobody does.

    Look at some of the work of Capers Jones. He was prolific but on target.

  4. hi,

    this is definitely a challenging task.

    The question is: How do you define software quality?

    While defects per FP or defects per SLOC looks like a reasonable approach, there may be issues. For example, this metric does not factor in maintainability, usability or a number of other non-functional requirements. Also it depends on how many people use a software. More users means more bugs being found. Also you factor in the project history. Maybe the most recent version is really good stuff, but you use all those bug reports from years ago. Or the most recent version appeared a few days ago and there are just very few bug reports. Or there is just one bug report, because the bug tracker is not used correctly and everything added to the same report as a comment. It might also be that this one bug is so serious that no one uses the tool.

    There are some standards out there. You may want to have a look at, even if the tool they provide is java- specific. Also most open source projects lack a quality process. That means, there are no test cases, test plans or continuous integration. SQUALE integrates into such tools, so it may not be usable with such projects. One might say it actually measures the effort that is made to improve or ensure quality. However, they give you an idea of what you may want to have.

    Also reading ISO/IEC 9126 may be a good idea.

    Please mind that software quality is open research. There are several groups which work on automated approaches to improve or measure software quality. From the top of my mind, I recall Andreas Zeller at Saarland university and Nikolai Tillmann at Microsoft Research, but there are many more. The cooperative bug isolation project ( is another example.

    The main problem is that quality is whether the program has certain properties, like functional correctness. Unfortunately Rice’s theorem tells us that every non-trivial property of a program is undecidable. “Quality”, no matter how you define it, is a non-trivial property, so it is undecidable. Thereby all measure are no more than an educated guess.


  5. you need some automatic method of counting defects. (counting bug reports brings in lots of extra parameters, such as number of users, ease of filing a bug report, technical knowledge of users etc).

    There are some tools for finding defects. If there were a perfect tool for this, it would be easy for people to find and fix the issues in their code, so overall code quality would be much higher. compiler warnings, valgrind, Coverity etc. but you may just end up measuring whether the project has used these tools already. ie some projects have -Werror in their make files, and so will have no compiler warnings, some will consider that some compiler warnings are false alarms and should be ignored.

  6. Pingback: License Management for Installed Software — Not Only Luck

Leave a Reply

Your email address will not be published. Required fields are marked *