I have posted three articles, and these articles generated a lot of great comments and questions.
Each of these articles seeks to explain the technology we build along with our Alpaca algo trading brokerage. These articles led active discussions on Reddit and Medium, and it became clear to us that there is a lot of interest and a pretty large need in the community for a timeseries database dedicated for financial market data. The database world and software engineering in general have changed so much over the last decade as we’ve seen an explosion in open source programming and databases. We are seeing some people now actively using open source to and contributing some code in the GitHub repository.
In social media and offline, we’ve been answering questions and responding to comments, but today we wanted to take the opportunity to put all the queries and responses together in one post and share it with the entire community so everyone can get a look at the responses on a single post.
Q: Does MarketStore store data in memory?
A: No. MarketStore is designed to run in a reasonable size of host without huge hardware investment. If you have lots of cash, software technology is irrelevant, but what software engineering can bring is that you can do a lot better job with cheaper hardware. MarketStore’s primary use case is to be able to store and distribute years of data at second level granularity for more than tens of thousands of series (US equities and crypto coins across exchanges can easily become this size). The data size can be a few terabytes, and it is not still very common to have this big size RAM in a commodity hardware. MarketStore instead stores everything in disk, but the on-disk format is nearly identical to the layout in the memory, and thanks to SSD evolution, MarketStore can load the data at the speed competitive to in-memory storage.
Q: How does it make sense to compare with PostgreSQL and includes DataFrame loading?
A: Even if you can store the data, offloading it from application processes, it is not useful if you cannot use it. MarketStore is mainly used in the context of AI machine learning and backtesting, and the application typically loads it into some tabular structure such as Pandas DataFrame. That is why MarketStore’s network protocol is byte sequence in MessagePack so the inefficient JSON deserialization can be avoided. The client can load the delivered byte data into memory as C array, which is what is used behind DataFrame.
Q: How is it better compared to InfluxDB?
A: We have not compared the performance with InfluxDB, but InfluxDB and other general-purpose timeseries databases use-case is as system metrics or activity log analysis. Those require more flexible data structure and don’t necessarily need specific functions such as timezone-aware aggregate. The flexibility comes with necessary overhead as tradeoffs as always, and MarketStore should be much faster and cost effective if the use case is the financial market data.
Q: Why are you comparing with PostgreSQL when Timescale should be faster?
A: You can send us the benchmark results if you have them, but in our internal experiments, Timescale is even slower than PostgreSQL compared to MarketStore. The loading time at the database server level for Timescale is 2–3x slower than PostgreSQL, since Timescale makes use of table partitioning (aka table constraints exclusion) that needs to open lots of files from disk. It will give advantage to filter a small slice of the data out of large amount of data, but it will not work better if you scan most of it. MarketStore stores the data in an optimal way on disk and reads sequentially direct to memory compared to those relational databases, so it is way faster.
Q: MarketStore can be used only for historical data but not for real-time data right?
A: There is a new feature coming soon to MarketStore that will allow streaming and realtime push on every new data write. MarketStore was originally designed to help our algo trading platform that builds trading algorithms using deep learning, and run them in the real market, and had JSON websocket streaming. The feature has been for the time being so that Marketstore can find a way to fit in larger use cases. But thankfully it is now back in as a plugin. We have been testing this with thousands of updates every few seconds and so far it is working perfectly.
Q: Why do I need this for machine learning? I can load the data from disk without a problem
A: If your training process doesn’t use much data (e.g. just daily bars from one stock), then yes probably you don’t need MarketStore for performance reasons. What we needed to do on Alpaca trading platform requires a server that is large enough to store an amount of intraday data across the entire market (can be up to terabyte range), and load the necessary series data back and forth. If you are familiar with typical machine learning training process, you can tell how the training iteration can load random data from the pool. That said, MarketStore is not just for performance, but also for the convenience to prove the uniformed way to access historical and real-time timeseries data the same way without worrying about how to manage local files etc. And the built-in data ingestor can load the data without even writing any code.
Q: Where is the installer?
A: Sorry, at the moment, we are not providing the one-click installer! But instead, we package the server process into a docker container image, so if you have docker, you can just start it in a second.
Q: Why is it open sourced?
A: Because there is a problem to be solved! MarketStore was implemented proprietary for our internal use and has been used in our production, but we have also seen the common problems affecting many people in the space. Our mission at Alpaca is to help individual investors with technology, and improve the algo trading environment, regardless of whether we give that information away to users or offer it in a premium package. This kind of product has only been accessible by financial institutions with large capital resources. But now we are making it available to anyone who is eager to try out! That’s awesome, isn’t it!?
Q: I found a bug
A: Please report it in the GitHub issue!