Understanding how LeakHunter can help with memory problems.
There are two types of memory problems:
1. Memory leaks, where memory leaks consistently over time.
2. Memory outages, where there is simply no more memory available. Just because you run out of memory does not mean you have a memory leak.
Based on its configuration and the type of monitored application, LeakHunter can consume significant additional memory. LeakHunter is monitoring every collection in your application, causing additional overhead. This can degrade the performance of an unstable application. Therefore, only employ LeakHunter after your application is truly stable. Below are some guidelines for identifying memory usage problems, and how to successfully employ LeakHunter when needed.
Testing for a Memory Leak or Memory Outage
1. Exercise the application under such a load that you can both a) reproduce the problem, and b) monitor it for at least one hour.
2. Get a historical view of the Heap:BytesInUse metric and see if the baseline (the lower heap value after each GC) is increasing. If so, you may have a memory leak.
If you instead find that memory is suddenly exhausted and the application fails, then you have a memory outage.
Also note the initialization signature of your application. How long does it take to actually start up? How do you know it is really ready? Do you exercise a single user load before a full load test?
3. At the end of the load test, force a full GC. To confirm a leak, run the same load test at least two more times. Force a GC after each run. What is happening to the Heap baseline? Leaks are fairly easy to find, but other memory problems are more difficult to detect because they may be difficult to reproduce. You need to be sure that a more significant memory problem, resulting in memory outage, is not present.
Should you deploy LeakHunter now?
1. If your problem is actually a memory outage, LeakHunter will actually amplify that problem. You will run out of memory sooner, and you may mask other more subtle problems. You need to first understand and verify the nature of the problem—which the base Introscope product will help you do. Secondly, you need to get a baseline for normal application behavior so that you know both the criteria for an application failure as well as any potential impact from LeakHunter.
2. To identify the memory outage (when you can reproduce it), look at the Concurrency metrics to see which components are active. Also look at the invocation rates to see if there is a surge in activity that correlates with the outage. You may need to run a variety of Use Cases. You also may need to do two more load tests, increasing the load each time (5, 10, 25 simulated users, for example) in order to truly confirm the source of the sudden growth. In addition, allow the application to clean up or recover between load tests. At this phase of testing, don't restart the server for every load test.
3. Take a historical view of all the load tests and see if the baseline is increasing evenly, from test to test, or is it increasing proportional to the simulated load. Force GC if necessary to get a solid endpoint for the load test. If the source of a memory outage is identified, resolve this first as it may mask other problems in the application. Apply the code change, rerun the load and examine the Heap baseline growth. If it is increasing, over the duration of the run, and from test to test (and the other problems are addressed), you have a memory leak.
Testing for Other Leaks
So you have confirmed that a leak exists and hopefully eliminated any memory outage issues. Before you deploy LeakHunter, consider other types of leaks, such as Connection Leaks. Do you have a wrapper around your database or JMS connections? Do you clean up that connection properly? You can prove it by deploying CMTs on the Create and Cleanup methods, using incrementers and decrementers, respectively. LeakHunter only looks at Java Collection Classes, but memory leaks can take other forms. Again, the base Introscope configuration enables you to do all of this. Make sure the easy leaks are eliminated or ruled-out.
Deploy LeakHunter first in the default configuration so that you can assess the impact of the additional memory that LeakHunter will consume. Initially, leave it disabled, since it may potentially increase the time it takes the server to start up. It may also affect the initialization of the application. Compare the initialization times of the earlier three load tests. Are they approximately equal? If not, do two more runs at the same load. Is this second set approximately equal? If so, enable the LeakHunter and rerun the same load so that you can directly compare the affect on the initialization. If you can't get consistent values, then you probably have other problems in your test environment. You need to resolve these first before going any further.
The default timeout value for LeakHunter, when it is no longer looking for new collections, is 120 minutes. This covers most application initializations but it is possible to have the initialization phase take two to four hours, or longer, depending on the nature of the application. If you see that your application has not finished initialization after 120 minutes, consider doubling the default timeout value.
If you are under time constraints and can't afford to run successive tests to determine an appropriate timeout value, set the value equal to zero. This causes LeakHunter to constantly look for new collections (there is no timeout). This may create a memory outage and/or an application slowdown. You need to establish a data structure for every collection you encounter. Because you do not have an unlimited amount of memory, there is a balancing act with the timeout parameter. It is not an invitation to run it continuously.
Make LeakHunter part of your testing plan, but do not use it continuously. If you have a code drop every Monday, first ensure that the application is functional and stable. Get those issues resolved first. Then activate LeakHunter and do a couple of runs. Measure the rate of leaking memory and keep track of that value from test to test. Then turn LeakHunter off. Do not do performance test and hunt leaks at the same time. If there is any significant change in response time or memory footprint, quantify it so that you can manage expectations should you deploy it in production, should leaks only manifest themselves there.
Every application is different in terms of how it uses collection classes and how LeakHunter may impact the application. And not all memory problems will benefit from LeakHunter. Make sure you know the difference and can estimate the impact. To get LeakHunter deployed in production, you need to have the answers for the following questions, and be able to prove them.
How do I deploy and back out LeakHunter?
How do I know it is a memory leak and not some other problem?
At what rate is memory being leaked?
How much of a fix is acceptable to get back to production?
What is the additional startup time for the application server with LeakHunter deployed?
Do I have to increase the application's warm-up period before I connect it to the cluster?
How much additional memory will be consumed?
What is the relationship to the user load?
How will LeakHunter affect my application response time?