Low Latency Slides

Last weekend was LJC Open Conference #4, and like many people I got a lot out of it.

My talk was up first which meant I could relax for the rest of the day.

Here are the slides

Note: the message size was 16 longs or 128 bytes. This makes a difference at higher throughputs.

In answer to @aix's questions

On slide 12; what are we measuring here, elapsed time from the message hitting the buffer to the "echo" reply showing up on the client socket?
In each message I add a timestamp as it is written. When each message is read, I compare the timestamp with the current time. I sort the results and take the middle / 50%tile as typical timings and the worst 0.01% as the 99.99%tile.
What's causing the large difference between "nanoTime(), Normal" and "RDTSC, Normal" in the bottom half of the slide (2M/s)?
The reason for taking the RDTSC directly (9 ns) is that System.nanoTime() is quite slow on Centos 5.7 (180 ns) and the later is a system calls which may disturb the cache. At modest message rate 200K/s (one message every 5000 ns) the difference is minor however as the message rates increase 2M/s (one message every 500 ns) the addition to latency is significant. Its not possible to send messages over 5 M/s if I was using System.nanoTime() where as with RDTSC I got up to 12 M/s. Without timing each message, I got a throughput of 17 M/s. In answer to @Matt's questions
Is that a safe way to call tsc on a multicore machine? Is it consistent across cores?
Calling RDTSC on a multi-core system is safe as there is one counter per socket. However, calling TSC on a multi-socket system is not safe and there will be difference and possibly a drift between sockets.

This is not a problem if you have only one socket, or you do all you timing on one socket. e.g. You have used thread affinity and know the timings will all be on the same socket.

Don't you need some sort of serialisation in the instruction stream to avoid out of order behaviour?
This is a potential problem. However, the total time taken with JNI is around 9 ns which is more than 40 instructions. This is longer than the CPU pipeline can re-order. (typically around 32 instructions) If this instruction were embedded the way the some Unsafe methods are, it could be re-ordered with Java instruction. However, provided this re-ordering is not random, the difference is likely to be a bias of << 10 ns. If the re-ordering was random this could add up to 10 ns jitter.

Given I am timing latencies to an accuracy of 100 ns, I decided it wasn't a problem for me. It could be a problem if you want 10 ns accuracy.

On the "reproducible results" theme, to be completely anal you need to repeat the test n times (20-30?) and examine the distribution of results.
What I do is repeatedly run the test for 5 seconds and print the results. This will consist of a minimum of one million individual messages. This is repeated 30 times and I print an aggregate distribution. I compare the individual distributions with the aggregate to see if they are "close".

I will try to document how the test are done in more detail when I am happy with the reproducibility of the 99.99%tile values. :|

Comments

  1. Peter,

    Thanks for sharing. I have a couple of questions about slide 12:

    1. What are we measuring here, elapsed time from the message hitting the buffer to the "echo" reply showing up on the client socket?

    2. What's causing the large difference between "nanoTime(), Normal" and "RDTSC, Normal" in the bottom half of the slide (2M/s)?

    P.S. I wish I knew about the conference so I could attend.

    -aix

    ReplyDelete
  2. @aix, I have added your excellent questions with answers to the main entry.

    I missed devoxx for the same reason and its one of the biggest Java conferences in Europe. :|

    ReplyDelete
  3. Is that a safe way to call tsc on a multicore machine? is it consistent across cores? don't you need some sort of serialisation in the instruction stream to avoid out of order behaviour?

    On the "reproducible results" theme, to be completely anal you need to repeat the test n times (20-30?) and examine the distribution of results.

    ReplyDelete
  4. For comparison, I've timed System.nanoTime() on a couple of systems I had to hand, and the results are quite interesting:

    1. 64-bit Ubuntu 10.04 running on Xeon E5410: 50ns per call.

    2. 64-bit RHEL 5.6 running on Xeon L5520: 350ns per call.

    As to RDTSC, I presume it can't be used to measure intervals across CPU chips? What about different cores on the same chip?

    ReplyDelete
  5. @aix, My understanding is there is only one TSC per socket/chip and I haven't seen a problem comparing times between cores on the same socket.

    For testing purposes, when having more than one socket, I would dedicate a whole socket (not the socket with cpu0 as it gets the OS by default) to the core application and only use the TSC in threads on that socket.

    ReplyDelete
  6. Do you know of a good open-source Java wrapper for RDTSC/RDTSCP? Something that would have low overhead yet take care of the subtleties (out-of-order execution etc)?

    ReplyDelete
  7. @aix, In that case I would use System.nanoTime(). If you are making a few 100K/s calls it shouldn't make much difference. Also not using Centos/RHEL 5 appears to give a significant improvement.

    ReplyDelete
  8. Hello,

    I am no specialist on the question so I have a few questions:
    - RDTSC calls are made using JNI, right ? If so, I thought the overhead was much more than the 9 ns (just for the JNI call) - do you have any recent value on the latency overhead when calling JNI ?
    - are there any decent benchmark on the performance of System.nanoTime over various OSes / versions ?

    ReplyDelete
  9. @HareRama, I tend to use JNI as little as possible, so I am no expert on what makes JNI slow. However my understanding is that the number and type of parameters makes a difference.

    In the case of RDTSC, it takes no arguments and returns a primitive which may explain its speed.

    Another reason JNI calls can be slow is that they often use a system call. The cost of a simple system call can be quite high (System.nanoTime() is basic a wrapper for a system call) RDTSC is a single machine code instruction and there are no many useful things you can do purely in machine code that you can't do in Java already (without a system call)

    It would be interesting to get a comparison of how long System.nanoTime() takes on different OSes (with the same hardware) and its accuracy and resolution. Unfortunately the time it takes to install just about every mainstream OS on one box and test it is more than most people would consider. ;)

    ReplyDelete
  10. Hi Peter,

    The library that you use for setting the thread affinity, is that home grown or is there an open source version floating around?

    Mike

    ReplyDelete
  11. @Michael Its home grown. The tricky (non Java) code is in the slides. I could write open source version if people are interested. ;)

    ReplyDelete
  12. Hi, I can't see the slides.
    How can I get them?

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Unusual Java: StackTrace Extends Throwable