Performance Tips And Techniques for Java Programmer

Java programming language has always been criticised for poor performance. However when it comes to enterprise applications, Java has been the first choice for architects. Don’t they sounds conflicting, actually they are not and you gotta know this why. A Java architect knows, what all are the different performance attribute when it comes to Java Programming and when to use what and where. What tool to use

We all know google, it handles 40k transaction per sec concurrently. CERN crunch 25Gb per sec, an efficient I/O engine. And lastly financial hedge funds , highly responsive with sub-micro second latency. With these facts, we can summaries that performance has 3 axis, Responsiveness, Data Size (I/O) & Concurrency. Your application either need all 3 in different proportionate. When we talk about google/CERN/Hedge funds, they are outliers and they have state-of-art technology. But you don’t need such expensive state-of-art for your development. You just need to few things which are pretty much defined. The concurrency is little hard with performance, but you can achieve with fact that you need to look for.

When you talk about performance, you must if this is achievable and if not is it going to cost fortune (your project must have a budget). So there are 3 steps which will help you to define the performance

  1. What is achievable and reasonable, you are not targeting google. So this is the first step to understand your objective
  2. Once you know that you are looking for reasonable, you need to have the target define for your application and apply standard techniques
  3. Now if you miss your target, what tools to use to achieve them.

Your java application is either stateful or stateless. stateful request are those which changes something at persistence layer and once it is done successful, it respond the result, like bank transaction. Stateless request does not change anything, so first tip to target to have stateless request so you can use concurrency effectively. So this tip 1 for you.

When I talk about stateful concurrency, you must understand CAP theorem where you can not achieve all 3 at the sametime (consistency, availability, partition tolerance).

Database industry has a nice benchmark and they call it TPC-C (1 Box TPC-C benchmark) and it 2000 transaction per sec with 5 response time is easily achievable. If you go to amazon and buy a 32 core processor machine and install your db, you could have such performance achieved without much hard work. For eventually consistence 1 box can support 300/sec with 100ms response time all concurrently. Such architecture scale horizontally easily. If you are not achieving this, you need to know the tool why you are not achieving this.

When it comes to data as performance attribute, you need to understand little more to manage it. The first problem with data is that storage capacity has increased and in-memory capacity has become reasonable, however the data movement from hard storage to in-memory storage is still a bottlenes. If there are objects which you are trying to access and they are residing in same CPU cache line, and even if they are being accessed through 2 diferent threads, still it would be a problem due to single cache line. Your CPU is trying to update the same cache line and they is same contention the same thing. In general it would not be noticable, but for highly concurrent system with low latency, this matters a lot.
(Speedup your Java Apps with Hardware Counters https://www.info.com/presentatoins/rrnu-hwc-java)

The 2nd boundary is Numa node, there is 32 core machine and you have socket with 8 core in each and you have 4 sockets. If you have lot of data in heap, then you must make sure that each socket should have enough heap to perform better, else you have a problem.

3rd Boundary is total RAM
4th Boundary is local space
5th Boundary is remote persistence – type of update and type of transaction

Third Access is Responsiveness

The garbage is the bottlenecks along with type of application you have, stateless vs stateful and you need to be aware of jitter. 100 of req achievable for stateful application and it is reasonable to demand for it. Watch out and avoid large data structure loading like xml & configuration files etc, doc process. Replace the entire object is expensive and has consequences with GC. Prefer concurrent-mark-sweep (CMS) collector. There is a problem with CMS that it does not defragmentation and then it takes long poll. To solve this problem, you can have a node in clusder where you detach the node, defragmentation it and make it available back to cluster and do it sequentially. Stateless responsiveness is much easier to handle 40ms is the human perception limit and is easily achievable. Even 20ms is not hard to achieve and often with just GC tuning. If you want to achieve 5ms, you need to do GC tunning with objec lifecycle tunning. Object lifecycle tuning focuses on object churn rate (cut down it), object allocation, object pooling (ugly but need to be done). 2ms it is hard, and you can not have GC pauses, do all the object creation on front and it is very much data structure in place in memory, large heap where you refresh once a day, banks follow these kind of architecture. Such low latency need a different architecture tune to shared CPU cache and use thread affinity to keep specific operation on specific cores.

Avoid such design mistake

  1. Fine-grained communication is bad and knows as chatty app
  2. Treating local call and remote call equivalent is bad
  3. Designs that are difficult to add caching to are bad
  4. A non-parallelizable workflow is bad
  5. No load balanceing in the architecture is bad
  6. Long transactions are bad
  7. Big different between the data model and object model are bad
  8. slow extensive embedded security check is bad
  9. Non intuitive user interfaces are bad
  10. Lack of feedback in user interface is bad
  11. Locking shared resources for non-short period is bad
  12. Not paging data is bad (like jdbc has paging style and if you are returning a huge data set, it churn a huge amount of memory)
  13. You can refer the troubleshooting diagram given by oracle https://shipilev.net/talks/devoxx-Nov2012-perfMethodology-mindmap.pdf
  14. There are shared resources and any of them could be a bottleneck
  15. CPU Utilization (CPU)
  16. CPU run queue ( CPU + Locks) – When you system is doing nothing but slow, it is because of locks
  17. System context switching (CPU + Locks)
  18. Threads (CPU + Memory)
  19. Physical Memory Utilisation (Memory)
  20. Pages/swap memory utilisation (CPU + Memory + IO)
  21. Dis IO Latency (IO)
  22. Disk IO throughput (IO)
  23. Disk Queue (IO)
  24. Database connections (Locks + Database) – wait for acquisition & pool utilisation
  25. Database SQL Latency (IO + Database )
  26. Database Communication throughtput (IO+Database)
  27. Network IO bandwidth use (IO)
  28. Network IO Latency of communication (IO)
  29. Network IO Frequency of communication (IO)
  30. GC Throughput (CPU+Memory)
  31. GC Pause times (CPU + Memory + Locks)
  32. Thread Contention (Locks)
  33. Race Condition
  34. Deadlocks
  35. Spin Locks
  36. Data Flow across Thread
  37. Wait for Acquisition
  38. Pool Utilisation

Application Stack Monitoring

Client initiated end-to-end response time
– Recorded by a load testing tool
– Ideally recorded by the actual UI
• Network communication overhead
– Ping inaccurate for application measurements
– Infer network times from other measurements
(response time) – (server-side service time)
• Server-side total service time
– E.g. entering “doGet()” until it exits
• Webserver access logs
• Webserver/Servlet/Application-server/JVM level
metrics such as heap size, thread pool size, etc.
• JVM execution profiles
• Stack sampling
• Logs – what to log:
– Any I/O points
– Any large data conversion points
» Marshalling, Parsing
– Transactions
– Component boundaries
– Request service times
• Server to server intercommunication times
• Component execution times, execution profiles
and inter-component communication times
• Component lifecycle and transaction times
• Application to database communications
• Heap & Garbage Collection
• Database statistics

Most Common Problem which are easy to find and fix and many people make
Most non-human error is resource leaks

  1. Memory Leaks
  2. JDBC Connection
  3. File Handlers
  4. Disk Space Filled

Memory Leak and Heap & GC – it has neglible impact
GC Logging flag – it not used much but it is quite interesting and when you see the logs you would lot of stops like GC pause, context switcing etc.

Dont do the heap dump when system is running, rather remove the node from cluster and take the heap dump because when you take the heap dump, it is an expensive operatoin and it cost a lot.

There is a alternave of heap dump and that is heap histogram using jmap. It can be done periodically but not too often

When you use JDBC and when you iterate through the records, it uses internal pages