Debugging like a boss

This article is about debugging.

Debugging is the process of finding and resolving defects or problems within a computer program that prevent correct operation of computer software or a system.

— Wikipedia

I’ll focus on the first part of the above definition, finding a bug. Most methods or tips discussed may be considered simple or common but my experience had taught me that many developers, especially the more inexperienced ones, may struggle when facing such situations.

I’ll try to present you the most common ones first which are the ones that you will most probably use more often. The techniques are not bounded to a specific programming language or technology but occasionally I may refer to specific examples.

So, let’s start!

Trust no one

Yes, I know… but it has to be said.

To do proper debugging you have to doubt everything!

You have to doubt your code (written by you/your colleagues/whoever)
You have to doubt your logs
You have to doubt the libraries that you have used
You have to doubt the language implementation that you are using
You have to doubt the OS that you are using
You have to doubt the whole infrastructure and external dependencies
And finally you have to doubt YOUR CODE

99.99999% (see? 5 nines!) OF THE TIME THE BUG IS IN "YOUR" CODE!

Now, having said that we may proceed :-)

Listen to your users, ask questions

Many times your production system users will find the bug first. Other times the manual testers will find the bug while testing. In any case someone else will find that nasty bug before you do.

Asking the right questions will save you much time and frustration. The most favorable outcome is to establish specific reproduction steps.

Then the problem could be crystal clear or at worst you will know where to start and continue your search from there.

However, if establishing the reproduction steps isn’t easy then there are still some questions that may worth asking:

When the incident took place. (This will help you especially in checking the logs)
If it is about an existing application that was functioning without issues ask the user if she had performed any unusual action.
Ask the user to perform the same actions and watch her while doing so.
Ask if this is an issue that has happened again before or if it is the first time. If possible establish how often the issue appears.
If you cannot establish the reproduction steps then ask the questions required to at least exclude some scenarios that definetily won’t lead to the bug.
If the issue was introduced after a new version check a diff between your releases. If you are using git, git diff is especially useful for doing so (other VCSs have similar functionality). Also tagging your releases is extremely handy in that case (see git tag). If not, check if something else in the system has changed.

However, do not forget, TRUST NO ONE! The users may unintentionally report something wrong.

Check the logs

If you are reading this and you have not logs then you should have! Go now and add logging to your application! However, don’t overdo it! Try to keep the essential logging only!

So, checking the logs may give the exact information about the error that happened. An error status or exception stack trace may lead you at once to the faulty line of code and give you a very good understanding of the problem. This most of the times should be enough to get you going.

In case there is an error that doesn’t make sense, additional logging before or even after the incident may give you a better understanding of why it happened. For example, a race condition is obvious if you see an unexpected order of log messages.

Finally, even if there is not an entry in the logs that points to the problem, you may verify the reproduction steps acquired by the users or deduct them yourself and that is a pretty good start.

Usually there are also other logs except that of your application worth checking. Get familiar with your infrastructure and don’t forget to gather every bit of useful information while you can.

Fire up the debugger

I know many people who aren’t fan of using the debugger or even considering it lame. However, I thing the debugger is the most handy tool in order to understand and validate the flow of the program or inspect that the state is the expected in all the given points in time.

Of course, it is not top in my list because it would probably require to have establish somewhat clear reproduction steps in order to use it and I think is more time consuming - especially for a larger application - to use compare to the previous methodologies.

However, definetily invest some time at least to learn the basics of using the debugger. It pays off.

Debugging in the browser

I have decided to add a separate section specific to Javascript because I have observed that many times people are under-utilizing their browser debugging capabilities.

Modern browsers have some kind of developer tools built-in, often invoked by pressing F12, that include a debugger.

You may add break points in your Javascript code and inspect the state of your program at any point.

Also, you may inspect other insightful things such as the raw headers of your requests and responses, the status codes returned, the body of the responses, Javascript errors etc.

Again as before, master these tools. Writing code for either side of the wire, having a good command of client side debugging techniques is extremely useful.

Review that code

If you don’t have any other hint about what is the problem then a fast approach is to start reading the code if you are familiar with the code-base.

The rubber duck technique is very useful in that case. Try to verify the correctness of your code while reading it. Pairing with a colleague instead of a rubber duck may be even faster.

If the bug isn’t obvious until now, things got more tricky.

Add extra logging

In order to trace the root of the evil you may add extra logging in places where you suspect to be the culprit.

I know many people who prefer to add extra traces instead of using the debugger. I actually disapprove that usage.

Add the extra logs and send your program for testing (if your are into that) or release it in the production in case that you absolutely cannot reproduce the problem.

Just make sure to add all the essential logging, so the next time the bug happens will be the last.

I faced once a bug in a process that was consuming RabbitMQ messages. The exact system was running in dozen different similar installations but it was one particular that was demonstrating strange behavior. My colleagues and I couldn’t figure out why this was happening.

Adding some extra logs just for this case showed us that there were two antagonizing instances of the consumer process that were running where conventionally should be only one.

The fault in that case has happened because at some point there was an undocumented manual intervention in the init script that was starting this process.

This was a very special kind of bug that couldn’t be reproduced in our development or testing environment.

The point is if you cannot reproduce the bug do not get despair. Add extra logging wherever you think will help you. Fill free to add unnecessary logging, you can always - and probably should - remove it later. Don’t overdo it though, extra logging adds a tiny overhead that may spawn new nightmares to a badly designed piece of software…

Also, there are some cases where working with the debugger may be bothersome.

There are two case that come into my mind.

Trying to find slow code. Running in debugger often is slow. To properly benchmark your code a fast way is to add many time tracking traces in the code (i.e. Stopwatch) and work your way to the source of the problem. In some cases you may achieve the same result with a profiler but I find usually easier and faster to place some time tracking code into specific places.
Working with many threads. Usually working with the debugger is OK but there are some cases that properly inspecting the flow of the program etc. becomes challenging.

All production logging should be written consistently in a file. Permanent messages sent to standard output should be avoided except if they should be displayed to the user. However, you may add temporary messages to the standard output in the dev/test environment during debugging but do not forget to remove them later!

Moar logs

If your system communicates with other systems or services, many times the answer could be to a foreign log file.

For example if your application talks with the database and it is slow checking your database for table locks or slow queries may give you a valuable hint for you to continue your search.

Sometimes you may need to explicitly enable logging for such systems and try to reproduce the problem or wait until the problem happens again.

Get real data

Often, the demon hides in the data. If a bug has been spotted and is reproducible in a foreign system but not in the development/testing environment then try to replicate it with the specific data.

This may be as easy as getting an SQL dump from a remote machine and import it locally or very tricky i.e. the database is way too big or there are essential information you are not permitted to get locally.

Sometimes, if it is tricky may be it is reasonable to create a custom mechanism for making this process easier. It will save you time in the future.

It’s never too late to write a test

This point is not about testing as a precaution to avoid bugs in the first place.

What I am suggesting is to write some kind of test in order to reproduce the bug.

There are three relevant scenarios that this is very useful.

The specific part that the bug is happening is too deep in the program flaw and/or requires many actions by you to reproduce it locally. In that case writing a test in any form, from a separate main() to a well written unit test, will probably speed up the whole process.
When the bug happens randomly. This could be a race condition bug that happens only if the machine is under stress or a bug that is influenced by other factors. Once, I had faced a strange bug that was causing the whole JVM to crash and it was relevant to the graphics acceleration in the specific machines. I wrote a simple script that was repeating some mouse clicks with xdotool and I found that the error was happening randomly once every some dozen actions. The solution was simply to disable the hardware acceleration by passing the -Dprism.order=sw to the JVM.
Production systems may have other load than the one in the development or testing environment. This may affect the presence of a bug. Writing simple stress tests, custom or by using various tools such as JMeter, may make the bug come to the surface sooner.

As an added bonus writing such a test makes the validation of the fixed bug piece of cake.

Write an MCVE

An MCVE is a Minimal, Complete, and Verifiable example.

In forums or answer/question sites like Stack Overflow an MCVE is often required by the people interested to answer the question. This is because it demonstrates clearly a very specific problem and is easily reproducible.

In case you are facing a bug in a large and complex code-base where even yourself are not sure where exactly the bug lives then you may create a separate program with just the essentials parts.

This is a bit similar to the previous section, but the point of this process is mostly to make you understand better the nature of the problem, by actually removing unrelated parts that could possibly confuse you.

Then by having the minimum required code you may try solutions until you find the right one. Then integrate back the changes in your more complex code-base.

Be sure to document well the approaches you are trying preferably by using the VCS of your choice to avoid repeating yourself. You may think this process as a more simplified variation of genetic programming.

As an added bonus you’ll have in your hands an MCVE that likely won’t contain any intellectual property or private information, which you may share in forums, SO, etc. in case your are stuck.

Don’t be afraid to ask strangers publicly, but do it as a last resort after you have exhausted all the other means in your arsenal. Be kind, be thankful, provide the right amount of details in your original question - no more/no less, response to other people in a timely manner and be ready to never receive a response back.

In case you are writing client code and there is a problem in the communication between the client and the server try to repeat the request by using an external tool such as Postman, SoapUI, curl or a similar tool. In case you are consuming messages from a message queue and you face a problem, use an external tool to inspect the queue such as rabbitmqadmin or rabbitmqctl for RabbitMQ. In general when there is communication between yours and another piece of software try to validate with an external reliable tool that you send and receive the expected data. There are times where the bug lies in ill-defined specs or in obsolete documentation etc. This mentioned process is actually a more generic MCVE, not bounded to your specific language, framework or libraries.

Trust no one

I cannot stress that enough! You have to doubt yourself, your code and everything else in order to effectively hunt bugs.

There will be some rare times that the bug is not in your code. It may be a rare bug in the language or JVM, in the browser or even in the desktop environment that you are using. It may also be a different configuration in one of that components that triggers the different behavior.

In the first case start by searching the interwebz for the problem. There may be a reported issue or a stack overflow question that describes your issue and if you are lucky enough it would be accompanied by a solution or a work around.

Checking the change-logs of your dependencies and/or trying out different versions (usually newer versions but you never know!) is something worth trying. Desperate times call for desperate measures after all!

In case of different configurations mind that the same software may have different default configuration in different operating systems. In Linux diff is pretty handy for comparing configuration files. Also, if the files are under git git status and git log are your friend.

Finally, make sure that your application was built correctly and the file permissions are the expected. Check the dependencies too. Having a standardized build procedure, being a simple bash script or a CI/CD system would save you from much frustration and I find it pretty time saving in the long run.

Honorary mentions

There are various debugging tools that may be worth knowing and using.

Some such tools are mentioned in Write an MCVE section before.

There are also other cases of communication between systems such as when your application communicates with a database. If there is a problem there try to run the query manually. Mind that when using tools such as ORM the underlying SQL may not be the one that you expect. Log the query and repeat it manually to validate the expected results. Many times the bug may be that there are unexpected entries in the database.

Also, there are tools such as various kind of proxies that may let you intercept the communication and validate that the requests and responses are the expected. Some tools may let you interfere and change the original request or response for testing or debugging purposes.

Some examples include the mysql-proxy for MySQL and the Fiddler for HTTP communication.

Per case and ecosystem being aware of the available tools may be a game changer :-)

Conclusion

I have introduced many methods to find a bug ordered by which I think you should follow first.

Of course one has to be methodological and pick the most appropriate first.

In any case, if the code is our own it is very important to actually own it. I mean to know it pretty good so when facing a new problem be able to jump to the appropriate place in code and also have a pretty good opinion about what parts are likely to produce problems.

I hope you found my suggestions useful. I see many people struggling to find a bug and usually without clear reproduction steps they stuck. In such cases one of the above tips might get them going.

Do you have any other methodology that you find yourself often applying? Do you think that my ordering could be improved? Have a specific real life example that demonstrates debugging in action? Please feel free to share a comment, I would love to read your thoughts!

Cheers and have a successful debugging!