Fixing Java Production Problems with Application Performance Management
Fixing Java Production Problems with Application Performance Management
Tracking down issues in a production system can be a nightmare, but application performance management systems such as New Relic – which combine isolated log files and network and database monitoring – can help.Finding and fixing issues in a production system can be really difficult. Usually by the time the problem is visible, users are already complaining. Fixing these problems under the eye of management is no fun for anybody, especially when you don't know where the problems may be.
You may or may not have access to the servers in question, and you may have to diagnose an issue involving multiple servers. And sometimes there’s a third party involved, such as a database administrator (DBA) or hosting company, for whom your problem is not a priority. Depending on how detailed your log files are, you might be able to search through them and find some hints. It may also be that your code is using third party jars, and they may not log the level of detail you need.
How APM can help
It's often possible to derive useful information
from log files, network monitoring, database server monitoring, and
the like. The problem there is that you're trying to infer things
about your code's behavior from the information that you’ve already
decided to log. If you change your logging to add more information,
it's too late. The error has already happened.
Application Performance Management (APM) systems
allow you to remotely instrument your code and log data to an
external system continuously. This is advantageous for several
reasons. Since this data collection and logging is happening in the
background, you don’t need to think about logging metrics during
software development. When you need information about the
performance of your software in production, the information has
already been gathered for you during the normal operation of the
system. It has been gathered under real system load on the actual
production environment, as opposed to data from a test system under
simulated load. It also means that when an error occurs in
production, such as a performance problem or a threading problem,
data about it has already been gathered and is already
available.
In addition to providing help diagnosing
problems, an APM system can provide more visibility into your
code's performance and usage patterns by providing metrics about
which pages are accessed the most often and how much time the
server is taking to generate those pages. Once a page has been
identified as needing improvement, an APM can help you drill in and
see where the server is spending the most time. This lets you can
prioritize your fixes.
For example, this page shows statistics about
our office’s site that shows people’s contact information. It’s a
small site, but it gives a feel for what APM can tell you. We see
usage spikes, and can see how much time is being spent in
application code versus database code. And it’s identified in
Figure 1 that the PeopleController#phonenumbers
page is the slowest on the site.
Figure 1: Summary dashboard: Shows general statistics about an app in New Relic
In this article, I’ll demonstrate using New
Relic's APM system to help identify production performance issues.
I created a demo app with a single servlet that takes in a first
name and last name and searches for entries with that name in a
database using Hibernate. Adding APM to a system is fairly simple:
to get started, I only had to set up an additional directory
containing code and configuration, which contains the contents of a
zip file downloaded from New Relic.
6:/opt/local/apache-tomcat-7.0.34/newrelic% ls CHANGELOG newrelic-extension-example.xml LICENSE newrelic-extension.xsd README.text newrelic.jar logs newrelic.yml newrelic-api.jar 7:/opt/local/apache-tomcat-7.0.34/newrelic%
After the directory is created, you can activate
New Relic with a simple change to the launch script. In this case,
the change is in Tomcat’s catalina.sh script.
# ---- New Relic switch automatically added to start command on 2013 Jan 08, 11:43:26 NR_JAR=/opt/local/apache-tomcat-7.0.34/newrelic/newrelic.jar; export NR_JAR JAVA_OPTS="$JAVA_OPTS -javaagent:$NR_JAR"; export JAVA_OPTS
Figure 2: Web Transactions page showing four very slow servlet calls
Once your server has been launched with this new
flag (see Figure 2), it will report data to New
Relic. The data can then be mined to help you monitor your code as
it runs.
In this case, the performance problem seen in
Figure 3 is easy to spot. My single servlet is
taking between 8000 and 9000 milliseconds every time it runs.
Figure 3: Transaction Trace page showing where time was spent in a specific servlet call
The dashboard shows us that the issue lies with
the QueryServlet that’s taking a long time to run. It’s revealed to
be a database query that is taking all but 6ms of the slow request.
Since I used Hibernate in my persistence layer, it’s generating SQL
for me. Tweaking the SQL code may not be so simple a task
(Figure 4).
Drilling a little deeper shows us exactly which
query was slow:
select person0_.id as id0_, person0_.fname as fname0_, person0_.lname as lname0_, person0_.middlename as middlename0_ from person person0_ where frame+? and lname=?
Figure 4: SQL Detail Tab on the Transaction Trace showing the SQL as captured by New Relic
Now I can send this query to my DBA and ask what
can be done to make that query run faster (Figure
5). It turns out to be a simple fix. The query is only
against a single table which has over 21 million rows, and none of
the columns in the 'where' clause of the query have indexes.
Figure 5: DBA tool showing that the table being queried isn’t indexed for our query
The DBA has added some indexes to
the table. Now I can run the app again and see the results of the
change in Figure 6.
Figure 6: Transaction Trace showing the same servlet call after the table indexes were added
Conclusion
We improved the system response time from 8470ms
to 20ms, a huge improvement in a simple case. But most importantly,
I was able to get all the information I needed in an organized
fashion in the browser. I didn’t waste any time logging into
servers, viewing log files or anything like that. I also didn’t
need to change anything in my source code to enable this data
collection. I added the New Relic jar to the server launch scripts,
and after that, my server logged information to New Relic in the
background. From the New Relic website, I was able to track down my
performance problem. I drilled through to the slow web transaction,
looked at different parts of the transaction to see what was the
slowest, and acted on those results.
This was a simple demonstration where the fix
was obvious once the slow query was identified, but it illustrates
the value of app performance management. Not only can it be used to
find performance problems, it can also be used to measure your app
in your production environment so you can know where to spend your
time and money to make your system better.
Author Bio: Dan has been
writing Java code since 1996, and is currently a senior software
engineer at New Relic in Portland. When he is not at work, he
enjoys playing with trains with his son and writing model train
related software for his iPhone.
This article first appeared in JAX Magazine:
Socket to Them. For other previous issues, click here.
0 comments:
Post a Comment