Part 0 - Introduction
Part 1 - Hardware
Part 2 - Replay the market
Part 3 - Build the LOB
Part 4 - Measurements
Part 5 - Overclocking
Part 6 - Linux Tuning
Appendix - Costs of opening a one-man HFT firm
This text is a primer on how to develop a high-frequency-trading system/simulation lab, with focus on the Nasdaq exchange and the ITCH protocol.
The code is entirely written in C and follows the data-oriented-design
The reason for picking C instead of C++, when the latter is the de-facto language in the industry, is because C is a very simple language to understand (and optimize), which makes this primer more encompassing than if I was to write the same in C++ (which would require you to be very familiar with advanced concepts of the language).
The topics covered are:
- The necessary hardware to build a simulation lab, as well as advice on ideal pc components.
- How to “replay the market” by sending market data from a “fake exchange” to a “trading server”.
- Parsing incoming data and building the limit order book (LOB)
- Performance measurements techniques.
- RAM and CPU overclocking for optimal use of the hardware.
- SIMD instructions.
- Possibly other topics that will later come to mind.
By the end of this text, you will know how to set up your own simulation lab, how to build a very fast LOB, how to measure the performance of your code and will be in a great position to do research on market microstructure and experiment with HFT strategies.
I am still working on this text so feel free to check for updates every few weeks.
To be able to get a simulation lab as close as possible to the live environment,
you will need two computers (running Linux), two network cards (NIC’s) with both
kernel bypass and hardware timestamping functionality, as well as two cables to connect them (1 in and 1 out on each NIC).
One of the computers (computer A) will act as a fake exchange by sending the historical market data to the second computer (computer B - the trading server), which itself handles the parsing of the market data, builds the LOB and does the desired computations to decide when it should trade.
Regarding which Linux distro to use on “computer B”, I recommend that you pick Arch Linux if
you have experience with Linux, because, since it is a rolling-distro, you will very
easily be able to test updates made to compilers and see if they improve your code,
and additionally you can run a version without the X-server (the GUI windows manager)
which occupies only 300mb of RAM.
If you are a less experienced user, Manjaro is a good alternative.
For “computer A”, any distro will do.
The objective of this setup is to accurately simulate a real world live trading situation, which means that
you will have zero control over computer A (the fake exchange) but you have full control over computer B (the trading server).
Frameworks, such as those that you find on GitHub, won’t cut it because when you are designing a low latency system you have to consider, measure and optimize every single detail of your system, since your success/profitability will very likely depend on it.
Your system must be tuned, both hardware and software, and it must only do one thing (the trading logic), and not have rogue processes running in the background interfering with the cpu cache.
To have such high precision measurements you need to physically simulate what happens once your server is live, and that is the reason why you need two physically independent computers.
As mentioned above (computer A) simulates the exchange, by sending historical market data over the NIC (in other words, it replays the market) and computer B acts as the trading server, by receiving the data sent by computer A, parsing it, building the order book, doing any desired analysis and eventually sending a message (i.e. an order) back to computer A.
It’s now easy to understand that if you were reading data from disk and processing it on the same computer, you would be unable to get proper measurements since the computer would be performing other tasks (reading data, piping it to another process, etc), which would give you incorrect measurements.
You would also be unable to test and get an answer for questions such as:
How do you deal with the network side when you are trading in the real world and no longer reading data from disk?
How can you test if your server is able to handle spikes of millions of packets per second?
How much time does it take to copy the data from the NIC buffer to user space? And does that extra time have a negative impact on your trading algorithms?
on and on…
Before going a bit more in depth on how to accurately measure the performance of computer B (the trading server), I would first like to clearly define “performance”.
Performance, in this context, is the sum of the time it takes to retrieve the data from the NIC + doing all the computations you want to do + sending data out of the NIC.
Now, there are two ways you can truly measure performance of a system. (I alluded to one before - the one I took - but I will clarify both now):
The first approach is to use a switch with hardware time stamping functionality, and connect both computers to it.
This switch simply timestamps every packet that passes by it, both coming from “computer A” and “computer B”, and from there you can calculate how much time “computer B” takes to process, and reply, to incoming data.
These switches can cost from around $4k for the “old” metamako’s to $40k for the new Arista/Cisco.
The second approach is the one I took, and which I recommend you do.
I bought two second hand NIC’s for around $200 each on Ebay, and these NIC’s offer the same hardware timestamping functionality that you will find on the much more expensive switches, with a precision of around 10ns.
The specific models I bought were the following: Exablaze X2 and Exablaze X4.
You can buy two X2, or two X4, it doesn’t matter, I picked one of each because those were the ones that were for sale at the moment.
If you end up purchasing the Exablaze cards, check their documentation online since they go over installation and some tips on how to get the most out of the cards and most importantly, remember to update the firmware.
Note: You can also check some of the solarflare cards, they can be had pretty cheap too - just make sure that the model you buy supports OpenOnload / ef_vi.
After plugging both cards to the respective computer motherboards (use the closest pcie slot to the CPU!) , download and install the libraries that allow you to access them, plug the two cables between the cards and run:
If you see something similar to the above, you are ready to follow along and will be able to replicate the C code that I will share below and will end up with a professional, high accuracy simulation lab, where you can test multiple ideas/algorithms and do research on market-microstructure.
I will now discuss which pc components are ideal for a low latency server.
Remember that for computer A (the fake exchange) you really can use any computer (i.e. an old one).
If you have the possibility to make some components update, my recommendation is that you purchase the fastest nvme you can afford, since the fastest you can read from the disk the faster you can push data out of the NIC.
In the next next section, I will show you how to load the historical market data into memory and just pushing it out of the card as fast as possible - this is a great way to test how computer B (the trading server) behaves when you saturate it with data - however it’s not always possible to have the entire dataset in memory and that is when having a very fast disk really helps.
For computer B, however, choices of components will very much impact the performance you can get. Let’s go one by one.
You will want an intel “K” cpu, that is a cpu that allows you to overclock (OC)
its frequency and it’s cache frequency via the BIOS.
Personally, I have a 11900k which, while it didn’t receive great feedback from the public due to its poor multi-core performance, it is the fastest single core processor you can buy at the moment, and hence perfect for an HFT application.
In a later section I will go in depth into overclocking and show you how much other techniques such as disabling extra (unnecessary) cores, disabling sleep states and raising the aforementioned frequencies, improve the overall performance of the system.
Another important functionality of the last gen intel cpus is the fact that they support AVX-512 (while previous versions only supported up to AVX2), and I will also show you how using such intrinsics can lead to extremely high performance code when doing real time analysis of data.
Alternatives: The previous gen 10850k is also a very decent CPU and with a good cooling solution (and some luck on the specific bin) can be pushed up to 5.4GhZ on all cores or 5.5GhZ on one/two cores configuration.
In here you should definitely consider a 2-DIMM slot Motherboard, and the
reason why is because 2-DIMM slots motherboards have shorter traces to the CPU
and each of the two slots is connected directly to 1 channel of the CPU, and also because since there are less dimms
there will be less signal losses.
These motherboards are commonly used for overclocking (do you see a pattern?) not only because of the reasons mentioned above but also because of the fact that their BIOS’s allow super fine tuning over ram/cpu parameters so you can really get the best out of your system.
On my current system I have an Asus Apex XIII installed and in my opinion the Asus BIOS is second to none in terms of organization, accessibility and fine tuning abilities.
Alternatives: Asus Apex XII, Gigabyte Aorus Tachyon, EVGA (z590) Dark and the AsRock OC Formula.
For memory you will want a 2x8 B-DIE (Samsung chip) kit.
B-DIE’s are the most stable memory kits to overclock (in comparison to something like Micron Rev-E), which means you can push more voltage through them/increase the frequency and they won’t crash.
If you know you will need 32GB of RAM (2x16), note that they will be around 5/6ns slower than the 2x8 variants.
The best 2x8 kits:
G.Skill Trident Z Royal DDR4-4800 CL18
Team T-Force XTREEM 16GB 4500 CL18
OLOy Blade RGB 4000 CL14
The best 2x16 kits:
G.Skill Trident Z RGB 4266 CL17
G.Skill Ripjaws V 4000 CL16
Team T-Force XTREEM ARGB 3600 CL14
I currently own a pair of each of the xtreem’s and they are excellent to OC.
Any nvme/ssd will do. If you are looking for a nvme, either the
Samsung 980 pro or the Sabrent rocket4+ are great picks.
If you want to store market data, which comes with a few caveats by itself (see Appendix - Costs of opening a one-man HFT firm), you could look into the enterprise Micron 9300 Pro as perhaps the best ssd format disk with up to 15tb of capacity.
This is an extremely important part of the system because when you
OC both RAM and CPU these will get hotter and you will need to cool them down
otherwise the motherboard will either lower their frequency or just shut
down the system to avoid burning the components.
There are a few parts that you have to cool down to achieve maximum system performance and stability and those are:
CPU cooler: I recommend that you go with an AIO (All in One) cooling solution with 360mm (3 x 120) fans. I currently own the Arctic Liquid Freezer II 360 and it’s an excellent cooler which is also very well priced.
Alternatively you could either build your own liquid cooler by purchasing and assembling the necessary components from a place like AquaTuning but the problem arises if you have to ship the server to a colocation, because you would need to have someone filling up the cooling liquid at the colocation site, as shipping the server with the liquid already on the loop could cause possible leaks.
You could also go with a “normal” cooler but I discourage it since they won’t be able to cool the CPU as well as the AIO counterparts.
To cool the RAM chips and the NIC you should place a fan facing each of the components.
This will greatly improve memory stability while OC’ing and while the cooling for the NIC is not as important, it does make a difference and provides extra airflow inside the case.
You can also add one or two small fans to exhaust the air at the back of the case.
Regarding which fans to purchase, I always had great results with Noctua fans (and the Arctic fans included with the AIO), they are very capable and silent so you can easily work with the system by your side, however if noise is not a concern (i.e. you have the server in a different room or it is ready to be shipped to the trading venue) you can look into more powerful fans such as the those by Delta Electronics.
Any 650W+ model from a known brand such as EVGA, Corsair,
XPG, etc will be a good pick. I do recommend however that you
pick one with a full modular design (i.e. you only attach the cables that
you need to the PSU) as this will lead to much less clutter inside the case
and consequently improve airflow.
I currently have a XPG 750W and it’s been great.
Assuming that you wanted to colocate the server at a trading venue
you will need a 4U case, since otherwise the CPU cooler I recommended won’t fit.
Now, finding decent 4u cases is not the easiest task and even if you find one it will probably be cluttered with unnecessary things.
I recommend that you remove the front panel and purchase some metallic mesh from a hardware store and use that instead, and while this procedure takes a bit of work, it will greatly improve airflow. (Have a look at a few pictures of my system below, where I did just that).
If your system is meant to be at home, any ATX desktop case with decent airflow will do.
I will end this section with a few pictures of my own lab. Hope you are enjoying the text so far.
Replay the market
In this section I will show you how to “replay the market” by sending market data from computer A (the fake exchange) to computer B (the trading server).
For the sake of reachability and applicability of this primer, I made some decisions on how to actually proceed and I will explain them now.
When market data is captured, it is captured in a format called pcap (packet capture)
and what this means is that you store the ethernet packets exactly as they arrive, with
all the different OSI layers information, instead of only storing the payload (the messages from the exchange).
This is done mainly because having the raw data allows you to troubleshoot and do a deep analysis of your network.
If you have the Exablaze/Cisco NIC’s, you can use this optimized library to perform this capture.
However, there are some disadvantages of working with .pcap files.
First, these files are huge (a single day of market data for a single exchange can be 80+ gb) and second, acquiring market data in this format is quite expensive and very few data vendors sell it.
What I decided to do was the following:
Instead of showing you how to send .pcap market data from computer A to computer B, I will show you how to send the actual payload that those files contain, and because Nasdaq makes some of these samples available for download at ftp://emi.nasdaq.com/ITCH/ , you will be able to grab them for free and follow along implementing the code on this series.
Now, to clarify, the aforementioned payload follows a specific protocol called ITCH, which is a protocol developed by Nasdaq for disseminating market data.
I made one more decision that I must clarify prior to continuing:
That decision is that computer A will be sending only one message on each ethernet frame instead of (possible) multiple messages as you would see in a real-world environment.
I decided to do this because it slightly simplifies the code in both computer A and computer B.
Basically what will happen in computer B is that you receive one message and you parse it and then you receive another frame and you parse the one message within and on and on, while in a live environment you might receive multiple messages on the same ethernet frame and you will have to iterate over the receiving buffer and parse all the messages within.
The change is minimal but I argue that it aids readability in both computer A and computer B code and at this point that is more important in my opinion.
When I finish this primer, I will write an appendix where I will show you how to both send .pcap data from computer A to computer B, with multiple messages in the same frame, as well as how to handle multiple messages in the same frame in computer B.
With that out of the way, here’s the code you should run on computer A:
Compile with: clang -O3 -Weverything -Werror replay.c -o replay -lexanic
And turn on Kernel Bypass on the NIC: $ exanic-config exanic0:0 bypass-only on
Note that this is the simplest “fake exchange” you can have, because as of now it doesn’t receive orders from computer B, nor builds an order book representation (which you can then use to then trade against).
Later on in this primer, or in a follow-up post, I will show you how to build a “fake exchange” that addresses those issues.
Before moving on to the next section, I would like to clarify what exactly is “kernel bypass” and why is it necessary technique (and why you need a specific NIC to achieve it) when you are designing a low latency system.
When you receive data in a regular network card (such as an ethernet card on your motherboard or a wifi adapter), what happens is that
the Linux Kernel reads/copies those bytes from the hardware to a Kernel buffer (which resides in the Kernel space)
and then copies them again from that buffer to a buffer in user space (so whatever application that is waiting on that data can use it).
These copies take a considerable amount of time (in HFT terms) and also the Kernel has a limited buffer size, which means that if you are being flooded by packets, these will have to wait in a queue until they eventually get copied to user space, resulting in extra delay.
Similarly, when you want to send data, a system call is issued behind the scenes and what happens is that the Kernel takes control of the execution, and on behalf of the user space process copies the data to kernel space and eventually sends it.
What Kernel Bypass libraries/NIC’s allow you is to do is bypass the copies made by the Kernel by giving you direct access
to the hardware (NIC) buffer, which then allows you to copy the bytes directly from the hardware to userspace and from userspace to
hardware without having to go through the Kernel.
In other words, you bypass the Kernel, and hence the name.
Build the LOB
Before you start this section, I recommend that you go over the ITCH protocol document I linked above, as this will help you understand the data parsing that you will see in the code below.
I will first start by explaining some of the design choices I made in terms of data structures
and overall architecture, and then go in detail over each function used.
By the end of this section you will have a very minimal (i.e. easily expandable) foundation of a simulation lab, as well as a deep understanding of the mechanics of an order book.
Before the deep-dive, however, I want to clarify a few points:
A limit order book represents the will of all market participant at any given moment, and because of that it is the only source of truth at (any) time T.
An exchange, such as Nasdaq, hosts a LOB for each tradable asset and disseminates each update to each LOB (which can be a buy order, delete order, etc) to all market participants at the same time, via multicast, through a data feed called ITCH (same name of the protocol).
The ITCH feed is a firehose of data, as all the updates to each LOB will be pushed through it, which means that if you are receiving data from the ITCH feed you will receive updates on tradable assets that you might not even be interested in trading yourself.
To handle that, you will have to filter out the incoming messages the messages from tradable assets that you are not interested in trading, and you can either do it in the NIC (by programming the logic in verilog/vhdl directly in the card) or in software. (In this post I will cover the latter option).
It’s also important to note that not all types of messages received via the ITCH feed are useful to build the LOB,
since not all of them directly represent a change in a LOB (some messages just signal
the beginning/end of trading day, if there’s a halt in trading etc),
and while the parser should still parse them they are not mandatory.
To an attempt at brevity, I will only show you the functions that parse the messages that actually matter to building the LOB, however writing the remaining functions should be trivial for you after this section (and with the help of the ITCH protocol specification document).
Now, when you receive a message that affects the LOB you must store it somewhere, an example
would be if you receive a “add order” message, you will need to store it so in case you later
receive a “delete” message (with the same ID), you can make the appropriate changes to the order book.
Since all such messages have an ID, the most common data structure to use is a hash-map, using the ID as the key.
This is what you will see in many LOB implementations, and it is a valid option in particular if you are developing in C++ and pick a good hash-map implementation such as google-dense-hashmap or similar.
I will, however, show you a different approach to building the LOB, which has several performance benefits (and beauty) in comparison to a hash-map, at the expense of some complexity.
Without further ado, here’s how you build a super optimized LOB.
I will start to show you the header file - don’t try to grasp everything right now as it will only make sense when you see the implementation files.
There are a few points that I want to explain from the code above:
First one is “DEPTH”, which stands for the number of price levels an order book can have.
Each $1 has 100 price levels (0, 1, …, 99), so if you want to hold information for $10 of depth in each side of the book, you will need 1000 levels for the Bid side, 1000 levels for the Ask side, and 1 level for the initial spread.
You see that formula, 2 * DEPTH + 1, used in the order_book struct.
In the order_book struct, you see that there are multiple ptrs, these point exclusively to addresses that belong to either the prices or volume arrays.
On the order struct, the pointer within, points exclusively to an address in a order_book volume array.
(Note that I said “a” not “the”, as, in my implementation, an order has no idea that belongs to its symbol’s order book, it only knows in which address its quantity is stored - you will see why this is, soon).
Also note that there are only 5 functions, and these are all the functions you need to build and maintain a LOB.
I will go over each of their implementations in detail soon, but before that, here’s the “main” file of the program, which runs the loop that gets data from the NIC, checks the message type and delegates it to the appropriate function to be parsed.
The most important things to understand from the code above are the following:
The array “interested_in_trading” holds 10000 Bools, and the reason why is because Nasdaq has 8/9 thousand tradable assets. (less than 10000)
What will happen is that in the initial messages of every trading day, information about the ID of each particular stock for that day is provided (i.e. NVDA = 5823) and if we are interested in trading NVDA stock on that day we will set “interested_in_trading = 1;” and what that allows is to then in other functions, such as “add_order()”, check if the particular message we are receiving belongs (via its symbol ID) to one of the stocks we are interested in trading. If so, we parse the message, otherwise we don’t proceed further.
This, if you remember, is what I meant by filtering messages earlier before.
For the same reasons as above, I create 10000 “order_books”, and again only some of them will actually be used, but it is much preferable to pre-allocate everything.
In the array “order_ids”, I allocate 1 billion orders, a common day will have around 500/600 million orders, but it’s important to have some headroom in case there’s a trading day that has more activity than the usual.
Note that every day order ID’s start from 0, and if there was an order left in the order book from the previous day, that order will be re-entered the following day and given a new order ID.
And here’s the final piece of the puzzle, the code which parses the ITCH protocol and builds the LOB.
Compile with: clang -O3 parser.c main.c -Weverything -Werror -o program -lexanic
Enable Kernel Bypass: $ exanic-config exanic0:0 bypass-only on
Enable promiscuous mode: $ exanic-config exanic0:0 promisc on
A few notes on the code above:
Start by reading the parse_stock_id_and_initialize_orderbook() function and then move on to add_order() and the remaining functions.
Still in the parse_stock_id_and_initialize_orderbook() function is important to understand that the value ptr->previous_day_closing_price would normally be requested from a database.
The function replace_order() is identical to the function add_order() except for one extra check. However, the reason why you can’t just call add_order() from replace_order() after performing that check is because the message fields to be parsed are in a different position within the payload.
Work in Progress
Work in Progress
Work in Progress
Appendix: Costs of opening a one-man HFT firm
Going from theory to practice in HFT is not straightforward and below is something you should know (from a purely technical perspective, discarding bureaucracy etc) if you have intentions to open your own HFT shop.
In the simplest of the setups, you will need to host the minimum of one server
with a colocation provider such as
TSN/Pico/Options-IT that provides hosting at Carteret (the data center where the
Nasdaq Matching Engine (NME)) is located, and for a single server hosting expect to pay
around $3.5K MRC with a $2K NRC.
This will include layer 1 access to UDP full book market data (1 hop), which means that your server will be connected to one very fast switch (either a Cisco or Arista L1 switch) which itself is directly connected to the NME.
The average latency from the market data being sent by the NME and arrive to your network-interface-card (NIC) will be around 80ns.
To be able to place orders you will have to go through a broker and if you
are capital constrained, probably your only option here will be Lime Execution
(which will cost you around $7K MRC per server).
Your server will be connected to a (not as fast) switch owned by the colocation provider you chose, which itself is connected to an equally not so fast switch owned by Lime which will receive your order, perform some checks (i.e. can you afford what you want to buy, do you own what you want to sell, etc) and finally be sent to the NME.
From the data leaving your server to reaching Lime’s server it will take around 2 * 380ns on average, + at least a few micro seconds for the checks.
Assuming that it takes 3us for the checks to be made and the order to be sent to the NME, your order will take 760ns in transit + 3us in processing for a total of almost 4us from leaving your server to arrive at the NME.
That is ~4000 nanoseconds and it is an important figure to understand because there are currently systems (hardware/fpga based) that have direct-market-access (DMA) to the exchange and can send an order to the exchange, in response to market data, in ~20ns (in those 20ns they parse the data, make some simple computations and send the order), which means that while your order is in traffic to arrive to the NME, these systems would have been able to place 200 orders (serially) in the same period of time, which could alter the market to the point that when your order arrives to the NME, the price/quantity you were considering to trade at is no longer available.
Note that unless you have around $5M to invest you won’t be able to get “sponsored access”/DMA to the exchange, which is provided by one of the big investment banks, and so you are constrained to go through a broker and incur on the latency mentioned above.
There is also a somehow hidden cost that you will have to pay in one form or the other and that is related to market data collection.
If you are recording the market data yourself you have a couple of options:
On your main process (where your parsing/trading logic is) you will have to add extra functionality to save the data (and even if you are writing to RAM you will eventually have to save to disk) which likely implies the use of threads.
You have a second process reading the data from the NIC and saving it. (You will have to synchronize it with the main process so that both get to read the data before its evicted from the NIC)
Whatever option you choose, you will eventually have to make system calls to save the data (which will create added latency), you will also be putting more stress on the kernel task scheduler (which is non-deterministic), and additionaly you will be adding more complexity to the code, which by itself will put more stress on the instruction and data cache.
If you can’t accept the added latency/jitter, you also have a couple of options:
Colocate another server which basically only records market data (it doesn’t trade).
This will cost you another 3.5k MRC + 2k NRC + the cost of the server components (and any necessary maintenance).
Purchase the data from a data provider such as Maystreet (also check with your colocation provider as they might offer such an add-on service) and get it by the end of the trading day.
This will cost around $500-600 MRC and it’s probably the way to go if you are not in a position where you have rented a whole rack at the data center and have space for an extra server tasked only with recording market data.
As you can see, it is a substantial investment just to be able to “play the game”, however if you have the capital and the expertise to keep pushing the boundaries, it is definitely worth considering.
August 5th, 2021: Initial commit, covering the first 4 sections of the primer + the appendix “costs of opening a one-man hft firm”.
August 10th, 2021: Minor change in the valid_stocks struct and in the parse_stock_id_and_initialize_orderbook() function. Thank you for the feedback, Anton.
Files edited: parser.h, parser.c
August 14th, 2021: Improved tracking of orders that won’t be added to the order_book (because they fall outside of the predefined range).
Files edited: parser.h, parser.c