Simplifying low latency services

Overview

Java Chronicle is a persisted, inter process messaging system which is very fast when used in a low level way.  However, if you don't need this extreme speed, there is a couple of simpler ways to use this open source library.  One of these to use Chronicle's distributed collections.  This is very simple to use but rather slower. 

This post explores an intermediate solution.  It is fast (sub 10 microsecond 99.9% of the time), ultra low GC, and performs well even if you have burst of data larger than the main memory size.

This post continues from Low latency services and the demo is an implementation of the gateways and processing engine in the diagram.

Service by Contract

A way to model the service is to have an interface for the methods/requests/events you want to support and another interface for events out of the processing engine.  A demo has been added to demonstrate this approach.


A simple request to pass to the processing engine from the gateway.  You can have any number of arguments and use primitives instead of command objects, but we are trying to be simple here rather than ultra fast.

public interface Gw2PeEvents {
    public void small(MetaData metaData, SmallCommand command);
}


public class SmallCommand implements ExcerptMarshallable {

    public StringBuilder clientOrderId = new StringBuilder();

    public String instrument;
    public double price;
    public int quantity;
    public Side side;

A simple response per request is as follows

public interface Pe2GwEvents {
    public void report(MetaData metaData, SmallReport smallReport);
}

public class SmallReport implements ExcerptMarshallable {
    public CharSequence clientOrderId = new StringBuilder();
    public ReportStatus status;
    public CharSequence rejectedReason = new StringBuilder();

The MetaData class wraps the timestamps for the end to end process.  It records a tenth of micro-second time stamp for when
  • the request is written
  • the request is read
  • the response is written
  • the response is read.
  • Includes a sourceId and eventId triggering the response, needed for restart

How does it perform?

The throughput on a 3.8 GHz i7 with two gateways producing 10 million inbound and 10 million outbound messages each took 12.2 second to return to the gateways which sent them or 1.6 millon request/responses per second. For 200 million messages the speed of the SSDs starts to matter as disk cache size is exceeded and the performance dropped to 200 million in 176 seconds or 1.1 million per second.

The critical latencies to look at are not the average or typical latencies but the higher percentiles. In this test, the 90th, 99th and 99.9th percentiles (worst 10%, 1% and 0.1%) were 3.3 µs, 4.9 µs, 9.8 µs.  Under higher load with 200 million request and 200 million responses, a burst exceeding the main memory size, the latencies increased to 4.2 µs, 6.8 µs, and 28.8 µs

There are very few systems which can handle bursts of activity which exceed the main memory size, without dropping in performance too much. (1.5x to 3x worse)

If you run this test with -verbosegc you may see a minor GC on startup with a small heap size of 16 MB, however the demo is designed to create less than one object per request/response and you don't get additional GCs.

Note: this test didn't include a warm up to make the numbers more representative of what you might see in a real application. i.e. worse than a micro-benchmark usually shows.

What does the memory profile look like?

The memory profile is basically flat.  If you look at the processes with VisualVM you will see the memory used to poll the process (i.e. it's the same as when you are not doing anything)

If you use -XX:-UseTLAB for more accurate memory usage and jstat -gccapacity it doesn't show any memory usage for 200 million request and responses.  This is not correct as Chronicle does use some heap but jstat only shows page usage (multiples of 4 KB) i.e. the usage is less than 4 KB.

What is ExcerptMarshallable?

This interface is basically the same as Externalizable, however it supports all of Excerpts functionality for improved performance and make the stream more compact. Given services can be limited by you disk sub-system memory bandwidth when you have large bursts of data, this can matter.

Compared with Externalizable in real programs you might expect to have half of the data serialized and double the throughput.

ExcerptMarshallable is designed to support object recycling. A large percentage of the cost of de-serialization is having to create new objects which fills your CPU caches with garbage slowing down your whole program, not just the de-serialization code but all the code running on the same socket (including other programs on that socket)




@Override

public void readMarshallable(Excerpt in) throws IllegalStateException {
    // changes often.
    clientOrderId.setLength(0);
    in.readUTF(clientOrderId);
    // cachable.
    instrument = in.readEnum(String.class);
    price = in.readDouble();
    quantity = in.readInt();
    side = in.readEnum(Side.class);
}

@Override
public void writeMarshallable(Excerpt out) {
    out.writeUTF(clientOrderId);
    out.writeEnum(instrument);
    out.writeDouble(price);
    out.writeInt(quantity);
    out.writeEnum(side);
}

Other benefits

When you consider it is recording every message sent including detailed 0.1 µs time stamp, you are getting a lot of support for very accurate tracing of the timings with very little cost to the application.

More importantly you get a deterministic service which can be replayed to ensure reproducibility of behaviour as well as performance.

For example, you can take the gateway Chroincles from production and replay them in your test environment without the need for realistic test gateways to be running.

You can replicate the chronicles to a test system in real time, just as you can production and see that your test system behaves correctly with real data.

Conclusion

While coding for Chronicle to get maximum performance is very low level, too low level for most developers, you can still get very good performance with higher level designs which make it easier to use.

Summary of performance
  • 20 million: 1.6 million per second, latencies 90/99/99.9%:  3.3 µs, 4.9 µs, 9.8 µs. 
  • 200 million: 1.1 million per second, latencies 90/99/99.9%:  4.2 µs, 6.8 µs, 28.8 µs
20 million request and responses fit in disk cache and were not significantly impacted by the disk sub-system's performance.
200 million request and responses exceeded the disk cache.


Comments

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues