Data, Data, Data! 11 Great Financial Data Vendors

Hey everyone, Intern Rao here. A few weeks ago, I worked with Hitoshi and Yoshi to put together “9 Great Tools for Algo Trading”. 

Looking at the response to that article, we decided to write a follow up. Whether you’re a financial firm or an individual trader, financial data is key for putting together any good strategy. With so many vendors on the market today, many good options get lost in the noise. Here are 11 great financial data vendors. 

By Chris Lee on Unsplash

By Chris Lee on Unsplash

If your service is featured and is being misrepresented in some form, please reach out to me ASAP (through my medium page, linked at the bottom), and an immediate correction will be made.

Vendors for the Individual Investor

The following data vendors are those that see their target audience as the individual investor. While they do have larger premium plans for high volume usage, the plan options and community are structured around the individual investor. 

(1) IEX

1_RHI1pnn1T3PSJVccb1PLAw.png

IEX, which stands for the Investor’s Exchange, was founded by four former employees of the Royal Bank of Canada. Upset by the unfair conditions of today’s stock market, they looked to create an exchange that leveled the playing field for long-term investors. Their story was chronicled in the 2014 Michael Lewis Book Flash Boys: A Wall Street Revolt.

41C3LHEK5TL.jpg

IEX has a significant amount of free data, as well as multiple APIs to access data. The catch is though that all pricing data is the price of that security on IEX. That said, their coverage of securities is second-to-none, as they have information on US equities, fixed income, and ETFs. Find out more on the IEX website here

(2) Quandl

1_kP9DMfLF-DBYjFhsOMM-nw.png

If you’ve used Zipline for backtesting at some point, you’ve probably used free data from Quandl. Based out of Toronto, Quandl first started as a search engine for financial data sets. As algo trading grew as a field, Quandl started repackaging their search engine of data sets as individual bundles. 

Quandl offers a bundle with US Equities for free, which is actually the default bundle you ingest on Quantopian’s Zipline backtesting engine. Be warned that this bundle doesn’t contain any fixed income or ETFs. All other bundles are priced for a one time fee, and Quandl even serves as a market-place for third parties to sell their bundles. If you’re newer to the field, Quandl’s not a bad place to start. 

(3) Alpha Vantage

1_sPDo6TuC3m4U4aRegDE1HA.png

AlphaVantage is a community-based data vendor. In the last few years, AlphaVantage have worked hard to foster a developer community, not unlike Quantopian. The community helped build their product, as well as discuss investment strategies and other apps built on AlphaVantage data. 

AlphaVantage is a generally free resource with APIs available in multiple languages. They provide both real-time data streams and historical market data for backtesting. The API doesn’t have a daily/weekly/monthly limit, but the free subscription does have limited latency. For the serious investor with larger API calls, a premium subscription is best. 

(4) Intrinio

1_uxLVVGQvRsSvrJ5RE0tGYg.png

Intrinio was founded in 2012, with the idea of making financial data easy to access. Intrinio actually gears their product towards developers, and is currently the data provider for the backtesting engine QuantRocket

They currently offer two forms of data — streams and packets. Data streams offer a free trial, and then are subscription based depending on the stream. Packets are bundles of historical data. These bundles can be bought for a one-time payment and are priced based on the width and depth of data coverage. They also offer applications like the Intrinio screener for Excel. 

(5) Zacks

1_ccRyhV11_-uBh39AKbk0gw.png

Zacks was founded in 1985 as a quantitative ranking of potential buys based on estimated earnings. That tool Zacks Ranks was the cornerstone of what Zacks would come to offer today. 

Zacks is more than just a data vendor. It offers both historical and real-time data, as well as a variety of third party applications to use that data. Zacks also features newsletters, advice from industry experts, and a community of developers. For traders who are looking to wet their feet, Zacks may be the place to go. 

(6) Polygon

1_8l14hAgNeXHuAJHCEoDTng.png

Polygon is far newer on the data vendor scene, having been founded in 2015. They have a variety of pricing plans based on data coverage as well as latency of connection. 

Polygon provides real-time quotes and data streams, and have official clients in 10 different languages including Python, C, and Go. In a trend keeping with other data vendors of late, Polygon has done their best to foster a community of developers, which has led to the creation of community clients including Perl, .NET, and Scala. 

Disclaimer: Polygon is the data vendor of choice for Alpaca

(7) EOD Historical Data

Quirkily named EOD Historical Data was founded in May of last Year, but has established itself with a combination of accurate data and affordable prices. The Lyon, France based company offers monthly rates that range from as low as $9.99 to a high of $39.99.

Business-Facing Vendors

The vendors listed in the previous section are geared towards the individual investor, they have premium subscriptions for businesses. These following vendors are specifically geared towards firms, and are out of the price range of the average individual investor. 

(8) Xignite

1_Vg_ukrEOUsClP5BAMJomHA.png

Xignite was initially founded as a wealth management platform. When the founder Stephane Dubois realized the challenge of obtaining accurate market data, he pivoted Xignite to be come a data vendor. 

Xignite offers real time data streams, daily quotes, and historical market data. For developers, they offer multiple APIs for historical data. Their products are expensive for the individual investor, and Xignite themselves identify as business facing. Some of their more famous clients include Robinhood, Wealthfront, and StockTwits.

(9) Thomson Reuters

1_XCLqOfm9b9YiHsYFWOSA1g.png

Thomson Reuters is the merger of mass multimedia Thomson Company and the information group Reuters. Founded in London, Reuters has been providing financial information since 1951. Like Bloomberg, Reuters transitioned into information providing in the late 20th century. 

Reuter’s provides pricing data for assets in over 200 exchanges for both equities and fixed income. Their software includes not just pricing data, but research and analytical tools, along with a dedicated news stream, and a mobile interface. The full Eikon software costs $22,000 a year, but individual investors can find a stripped down version with just pricing data for $3,600. 

(10) Bloomberg

1_IhoXsTBjPi1dD7xlIDmtnw.png

Bloomberg was founded in 1981 by former New York City mayor Michael Bloomberg. Bloomberg may not have Reuter’s pedigree, but it’s quickly established itself as industry standard.

The Bloomberg Terminal packs minute by minute data, trading tools and analytics for over 300 exchanges around the world. For algo traders, the terminal subscription even offers an API to access in multiple languages. But clocking in at $24,000 a year, it’s a hard sell for the individual investor. If you do have access, whether through your firm or institution, you can’t get better than Bloomberg. 

(11) YCharts

1_K2y6MIXeKzntNCMkvej0zw.png

YCharts was founded by Ara Anjargolian and Shawn Carpenter as an attempt to compete directly with the Bloomberg Terminal. Founded in 2009, the self proclaimed financial terminal of the web balances a load of data along with a variety of useful tools and integration.

They offer Excel integrations and data visualizations of niche metrics, which is in line with their target audience; hedge fund managers and sales representatives. Both groups are accustomed to in-depth research on behalf of their clients. 

YCharts Professional clocks in at $199/month which is expensive, but only about 10% of the cost of a Bloomberg Terminal. For those interested in a stripped down subscription with only access to the simplest data metrics, YCharts offers a membership at $49/month, which is affordable for most individual investors. 

Lastly...

I hope this data vendor list was useful. If you think there are services that I missed please let me know! I always appreciate any, and all feedback.

So, good luck trading everyone!

by Rao Vinnakota

 

/

Algo Trading for Dummies  - 3 Useful Tips When Storing Trade Signals (Part 2)

Handling & Storing Trading Signals Are Hard

The calculation of simple trading indicators is made easy with the use of any one of the Technical Analysis libraries available, however, the efficient handling and storage of trading signals can be one of the most complex aspects of a live trading system.

Photo by  Jeremy Thomas  on  Unsplash

Calculating Basic Indicators? No Problem

While it’s often necessary to create custom indicators and trading signals, there is still significant benefit to using a standard library such as Ta-Lib for the basics. This saves a lot of time rather than having to reimplement a set of common indicators in your language of choice. It also has the added bonus of increased processing speed as opposed to calculation done in native Python, for example.

When it comes to moving averages and other simple time-series indicators, the process is fairly self explanatory — at every time step you calculate the next numerical value which is then used as the most up-to-date signal to trade against.

(Code Snippet to read data CSV files and process into trading indicators) https://gist.github.com/yoshyoshi/73f130026c25a7dcdb9d6909b1990277

The signals themselves will be stateless in that respect — you aren’t concerned with previous signals that have been made, only the combination of indicators present at that moment. However, you may still wish to store some of the information from the indicators, if only for external analysis at a later point.

Different Story For Advanced Pattern Recognition

Meanwhile, more advanced pattern recognition cannot be handled in such a simple manner. If, for example, your strategy relies on finding divergence between indicators, its possible to get a significant performance boost by storing some past data-points from which to construct the signal at each new step, rather than having to reprocess the full set of data in the look-back period every time.

This is the trade-off between storage/ RAM efficiency and processing efficiency, with the latter also requiring greater software complexity to achieve.

How You Should Store Signals Depends On How Fast You Need It To Be

For optimal processing efficiency, you would not only store all the previously calculated signals from past time-stamps, but also the relevant information to calculate the next step in as fewer steps as possible.

While this would be completely unnecessary for any system with a polling rate above a seconds, it is exactly the kind of consideration you would have for a higher frequency strategy.

Meanwhile, a portfolio re-balancing system, or even most day-trading strategies, have all the time in the world (relatively). You could easily recalculate all the relevant signals at each time-step, which would cut down on the need for the handling of historical indicator sets.

Depending on the trading period of the system, it may also be worth using a hybrid approach to indicator and signal storage. Rather than permanently saving the data, you could calculate the full set of indicators at start-up and periodically dump and refresh the data to keep only whats going to be used in RAM.

The precise design trade-offs should considered on an individual basis, as holding more data in RAM may not be an option when running the software from lower power cloud computing instances nor, at the other end of the spectrum, would you be able to spare the seconds to recalculate everything for a market making bot.

3 Useful Tips When Storing Trade Signals

As mentioned in the part 1 of this series, there are range of different storage solutions that can be used for trading data. However, there are several best practices which apply across all:

  1. Keep indicators in a numeric or boolean format where possible for storage. For example, splitting a more complex signal set into boolean components. This particular problem caused me several issues in projects I’ve had to work on in the past.
  2. Only store what is complex or time-consuming to recalculate. If a set of signals can be calculated in time in a stateless manner, its probably easier to do so than add the design complexity of storing extra information.
  3. Plan out the flow of data through your system before you start programming anything. What market data is going to be pulled for each time-step? What will then be calculated from this and what is necessary to store? A well thought-out design will reduce complexity and hassle down the line.

Past this, common sense applies. Its probably best to store the indicators and signals in the same time-series format as, and along side, the underlying symbols they’re derived from. More complex signals, or indicators derived from multiple symbols, may even warrant their own calculation and storage process.

You could even go as far as to create a separate indicator feed script which calculates and stores everything separately from the trading bot software itself. The database could then be read by each bot as just another data feed. This not only has the benefit of keeping the system more modular, but also allowing you to create a highly optimized calculation function without the complexity of direct integration into a live system.

Whatever flavour of system you end up using, make sure to plan out the data storage and access first and foremost, before starting the rest of the design and implementation process.

By Matthew Tweed

/

Algo Trading for Dummies  -  Collecting & Storing The Market Data (Part 1)

The lifeblood of any algorithmic trading system is, of course, its data — so that’s what we’ll cover in the first two posts of the mini-series.

Photo by  Farzad Nazifi  on  Unsplash

Always Always Collect Any Live Data

For the retail trader, most platforms and brokers are broadly the same, you’ll be provided with a simple wrapper for a relatively simple REST or Websocket API. It’s usually worth modifying the provided wrapper to suit your purposes, and potentially create your own custom wrapper — however, that can be done later once you have a better understanding of the structure and requirements of your trading system.

Depending on the nature of the trading strategy, there are various types of data you may need to access and work with — OHLCV data (candlesticks), bid/ asks, and fundamental or exotic data. OHLCV is usually the easiest to get historical data for, which will be important later for back-testing of strategies. While there are some sources for tick data and historic bid/ask or orderbook snapshots, they generally come at high costs.

With this last point in mind, it’s always good to collect any live data which will be difficult or expensive to access at a later date. This can be done by setting up simple polling scripts to periodically pull and save any data that might be relevant for back-testing in the future, such as bid/ask spread. This data can provide helpful insight into the market structure, which you wouldn’t be able to track otherwise.

Alpaca Python Wrapper Lets You Start Off Quickly

The Alpaca Python Wrapper provides a simple API wrapper to begin working with to create the initial proof of concept scripts. It serves well for both downloading bulk historical data and pulling live data for quick calculations, so will need little modification to get going.

It’s also to be noted that the Alpaca Wrapper returns market data in the form of pandas Dataframes, which has slightly different syntax compared to a standard Python array or dictionary — although this is covered thoroughly in the documentation so shouldn’t be an issue.

Keeping A Local Cache Of Data

While data may be relatively quick and easy to access on the fly, via the market API, for live trading, even small delays become a serious slow down when running batches of backtesting across large time periods or multiple trading symbols. As such, it’s best to keep a local cache of data to work with. This also allows you to create consistent data samples to design and verify your algorithms against.

There are many different storage solutions available, and in most cases it will come down to what you’re most familiar with. But, we’ll explore some of the options anyway.

No Traditional RDB For Financial Data Please

Financial data is time-series, meaning that each attribute is indexed by its associated time-stamp. Depending on the volume of data-points, traditional relational databases can quickly become impractical, as in many cases it is best to treat each data column as a list rather than the database as a collection of separate records.

On top of this, a database manager can add a lot of unnecessary overhead and complexity for a simple project that will have limited scaling requirements. Sure, if you’re planning to make a backend data storage solution which will be constantly queried by dozens of trading bots for large sets of data, you’ll probably want a full specialised time-series database.

However, in most cases you’ll be able to get away with simply storing the data in CSV files — at least initially.

Cutting Down Dev Time By Using CSVs

(Code Snippet to download and store OHLCV data into a CSV) https://gist.github.com/yoshyoshi/5a35a23ac263747eabc70906fd037ff3

The use of CSVs, or another simple format, significantly cuts down on usage of a key resource — development time. Unless you know that you will absolutely need a higher speed storage solution in the future, it’s better to keep the project as simple as possible. You’re unlikely be using enough data to make local storage speed much of an issue.

Even an SQL database can easily handle the storage and querying of hundreds of thousands of lines of data. To put that in perspective, 500k lines is equivalent to the 1 minute bars for a symbol between June 2013 and June 2018 (depending on trading hours). A well optimized system which only pulls and processes the necessary data will have no problem in overheads, meaning that any storage solution should be fine. Whether than be an SQL database, NoSQL or a collection of CSV files in a folder.

Additionally, it isn’t infeasible to store the full working dataset in RAM while in use. The 500k lines of OHLCV data used just over 700MB of RAM when serialized into lists (Tested in Python with data from the Alpaca client mentioned earlier).

When it comes to the building blocks of a piece of software, its best to keep everything as simple and efficient as possible, while keeping the components suitably modular so they may be adjusted in future if the design specification of the project changes.

By Matthew Tweed

/

I am a College Student and I Built My Own Robo Advisor

I’m Rao, and I’m an intern at Alpaca working on building an open-source robo advisor. I don’t have much experience in the space, and I had to find a lot of answers. While there’s a wealth of material available on the web, very little is organized. This post is the combination of various threads and forums that I read through looking for answers.

“Yes, you an do it too!”

“Yes, you an do it too!”

What is a Robo Advisor anyway?!

Robo advisors are automated advising services that require little to no user interaction. They specialize in maintaining portfolios based on the investors chosen risk level. Btw, they were launched at the start of the financial crisis in 2008.

The actual logic for a robo advisor is straightforward.

  • Allocation” — Given a risk level, portions of capital are allocated to different positions.
  • Distance” — Over a regular time interval, the adviser scans to see if there’s a significant change in the portfolio balance, and
  • Rebalancing” — If necessary, rebalances.

Allocation:

The process of allocation is where different weights are assigned to asset classes based on the developer’s algorithm. These asset classes cover the various aspects of the market, and keep a diversified portfolio. Wealthfront sees these asset classes as:

  • US Stocks
  • Foreign Stocks
  • US Bonds
  • Foreign Bonds
  • Inflation Protection

Each of these assets is represented with ETFs, who capture a broad look at the asset class.

Distance:

Rebalancing a portfolio comes with a certain amount of overhead and cost. To avoid this, the robo advisor computes a “distance” vector. To trigger a rebalance, this distance will have to be greater than a given threshold. Most robo advisors usually have this threshold set to 5 percent.

The distance in this case, is the individual change in weight of each position. This is done by first calculating the current allocation. This is the value of a position against the total value of the portfolio. If any one position has a weight that’s 5 percent greater (current_weight/target_weight > 1.05), then the robo advisor will trigger a rebalance

Rebalancing:

Rebalancing is the act of buying and selling to return the portfolio to its target state. It’s important that all selling is done first. Once all cash available is gathered, necessary purchases are made.

To calculate how much to sell, use the target allocation of cash (weight * portfolio value), and see the number of shares that result from that cash. The difference between the number of shares currently held and the target level of shares is the amount to buy/sell.

Implementing a Robo Advisor with the Vanguard Core Series:

Now that we’ve established the various parts of what makes a robo advisor, we’ll start executing the steps to build one. We’ll build our robo advisor using Quantopian’s IDE.

When implementing our advisor, I find it easiest to do in steps. Quantopian’s IDE can only run backtests on a full algorithm. What that means, is there’s no “halfway” development. So, each step is a self-contained algorithm by itself.

Note: The implementation of Vanguard’s Core Series is using the information found here. The portfolio includes a 2% investment in a money market fund, but that isn’t available on Quantopian, so for the purposes of this tutorial, we’ll ignore it.

Risk-Based Allocation:

The first step, is to determine our universe, and assign weights using risk level. Our universe will be the Vanguard Core series:

  • VTI — Domestic Equity
  • VXUS — International Equity
  • BND — Domestic Fixed Income (Bonds)
  • BNDX — International Fixed Income (Bonds)

The risk level is actually the ratio of fixed income to equity, with equity being more volatile. So, a risk level 0 would have all capital allocated to fixed income (BND, BNDX), while a risk level of 5 would be 50/50. All 11 possible risk levels are catalogued in a dictionary, with the keys being the risk level, and values being tuples containing weight allocation:

risk_level = 5
risk_based_allocation = {0: (0,0,0.686,0.294),
                         1: (0.059,0.039,0.617,0.265),
                         2: (0.118,0.078,0.549,0.235),
                         3: (0.176,0.118,0.480,0.206),
                         4: (0.235,0.157,0.412,0.176),
                         5: (0.294,0.196,0.343,0.147), 
                         6: (0.353,0.235,0.274,0.118),
                         7: (0.412,0.274,0.206,0.088),
                         8: (0.470,0.314,0.137,0.059),
                         9: (0.529,0.353,0.069,0.029),
                         10: (0.588,0.392,0,0)}

The next step will be to implement our allocation in Quantopian. For this step, the only methods we’ll need are initialize and handle_data. The rest are superfluous.

The initialize method is the main. When a backtest is started, it automatically calls initialize. So, initialize will be the function that contains the dictionary for risk levels, as well as choosing the risk levels. The handle_data method is called automatically after initialize. It’s here that we’ll actually purchase the positions outlined in initialize.

Both initialize and handle_data have the variables context and data. Context allows the user to store global variables that get passed from method to method. Data helps the algorithm fetch different sorts of data. (Note: This is a change from Quantopian 1 where data was the object where global variables were stored).

Let’s get started. In initialize, copy and paste the following code:

context.stocks = symbols(‘VTI’, ‘VXUS’, ‘BND’, ‘BNDX’)
context.bought = False

risk_level = 5
risk_based_allocation = {0: (0,0,0.686,0.294),
                         1: (0.059,0.039,0.617,0.265),
                         2: (0.118,0.078,0.549,0.235),
                         3: (0.176,0.118,0.480,0.206),
                         4: (0.235,0.157,0.412,0.176),
                         5: (0.294,0.196,0.343,0.147),
                         6: (0.353,0.235,0.274,0.118),
                         7: (0.412,0.274,0.206,0.088),
                         8: (0.470,0.314,0.137,0.059),
                         9: (0.529,0.353,0.069,0.029),
                         10: (0.588,0.392,0,0)}
    #Saves the weights to easily access during rebalance
context.target_allocation = dict(zip(context.stocks,
                            risk_based_allocation[risk_level]))

    #To make initial purchase
context.bought = False

The variable context.stocks is a list of stock objects. The symbols function turns the strings into objects. The objects have attributes like current price, closing price, etc. Using that list, as well as the dictionary for allocation, we’ll create a second dictionary context.target_allocation. The keys are each ticker (VTI, VXUS, etc.) and the values are the ticker’s weights. This dictionary will be useful for both allocation and rebalancing.

Copy and paste the following code to create handle_data:

if not context.bought:
        for stock in context.stocks:
            amount = (context.target_allocation[stock] *
                      context.portfolio.cash) / data.current(stock,’price’)
            if (amount != 0):
                order(stock, int(amount))
                #log purchase
            log.info(“buying “ + str(int(amount)) + “ shares of “ +
                     str(stock))

         #now won’t purchase again and again
         context.bought = True

The variable context.bought refers to the value originally set to False in initialize. Since handle_data is called every time a market event occurs. So, using context.bought ensures the shares are only bought once.

To buy shares for each stock, the list of stock objects is iterated through. For each stock, the total number of shares are calculated by allocating the proper amount of capital (weight * capital), and then dividing that capital by the current price. Since we can only buy whole shares, leftover capital is added back to the cash on hand.

Lastly, for a smoother backtesting experience, all transactions are logged. The log is in the lower right hand corner. Build the current algorithm. Set the start date to June 2013 or later as the ETF BNDX was first tradeable on June 7, 2013.

Calculating Distance:

Before rebalancing a portfolio, it’s best to first calculate if the current portfolio is worth rebalancing. In more sophisticated robo advisors, this step would have a target portfolio calculated with an algorithm. For this example, our target portfolio is going to be the initial allocation defined by the risk level in initialize.

As the value of each position fluctuates, their portfolio weight may change too. The distance vector will be the change in weight of each individual position. Calculating the distance takes place in three steps:

  1. Calculating the current weights of each position
  2. Calculating the target weights (already done)
  3. Comparing the two weights

We want to check the balance daily, so we’ll go ahead and add the following function , before_trading_starts, to our code:

def before_trading_starts(context, data):
    #total value
    value = context.portfolio.portfolio_value +
                                         context.portfolio.cash
    #calculating current weights for each position
    for stock in context.stocks:
        if (context.target_allocation[stock] == 0):
            continue
        current_holdings = data.current(stock,’close’) *
                      context.portfolio.positions[stock].amount
        weight = current_holdings/value
        growth = float(weight) /
                        float(context.target_allocation[stock])
        if (growth >= 1.025 or growth <= 0.975):
            log.info(“Need to rebalance portfolio”)
            break

We first calculate total value. Since that value doesn’t change, it can be down outside the loop. Then, each position’s individual weight is calculated, and compared to the target weight (we’re using division since we’re interested in the relative growth of the position, not the absolute). If that growth exceeds the threshold (currently 2.5%) or is under the threshold (de-valued), then a rebalance is triggered. We don’t have an actual rebalance function written, so for now, we’ll simply log that a rebalance is necessary. It’s important to note the break added. Once the algorithm realizes that it’s time to rebalance, there’s no need to continue checking the rest of the stock. It creates a better best-case scenario.

However, we’re not actually done. We have to call before_trading_start, but calling it in a loop is inefficient. That’s why we’ll use the schedule_function command in initialize. Add this line of code to the end of the function block:

    schedule_function(
    func=before_trading_starts,
    date_rule=date_rules.every_day(),
    time_rule=time_rules.market_open(hours=1))

This schedules a distance calculation as the market opens every day. By computing the distance in the morning, we have the time and space necessary to execute rebalancing actions.

Rebalancing

The act of rebalancing a portfolio occurs in two steps. First, all assets that need to be sold are sold, and then all assets that need to bought are bought. Selling first makes sure that we don’t run out of cash.

The following is the code for a rebalance function:

def rebalance(context, data):
    for stock in context.stocks:
        current_weight = (data.current(stock, ‘close’) *
                          context.portfolio.positions[stock].amount) /
                          context.portfolio.portfolio_value
        target_weight = context.target_allocation[stock]
        distance = current_weight — target_weight
        if (distance > 0):
            amount = -1 * (distance * context.portfolio.portfolio_value) /
                     data.current(stock,’close’)
            if (int(amount) == 0):
                continue
            log.info(“Selling “ + str(int(amount)) + “ shares of “ +
                      str(stock))
            order(stock, int(amount))
    for stock in context.stocks:
        current_weight = (data.current(stock, ‘close’) *
                          context.portfolio.positions[stock].amount) /
                          context.portfolio.portfolio_value
        target_weight = context.target_allocation[stock]
        distance = current_weight — target_weight
        if (distance < 0):
            amount = -1 * (distance * context.portfolio.portfolio_value) /
                     data.current(stock,’close’)
            if (int(amount) == 0):
                continue
            log.info(“Buying “ + str(int(amount)) + “ shares of “ +
                      str(stock))
            order(stock, int(amount))

The selling and buying mechanics are the same. The only difference is when selling stock, you use a negative value when calling the order function. In both cases, we take the absolute difference in weight (target — current), and use that value to find the number of shares to buy or sell.

Multiple Universes

Vanguard provides several more universes beyond the core-series. Currently, we’re able to manipulate the risk level, and observe the outcome. Let’s add the functionality of also being able to select a universe.

The first and straightforward method, is to implement multiple dictionaries. Here’s how something like that would look. The CRSP series was added to our algorithm. The user now chooses both the universe and the risk level.

    core_series = symbols(‘VTI’, ‘VXUS’, ‘BND’, ‘BNDX’)
    crsp_series = symbols(‘VUG’, ‘VTV’, ‘VB’, ‘VEA’, ‘VWO’, ‘BSV’, ‘BIV’, ‘BLV’, ‘VMBS’, ‘BNDX’)
    #universe risk based allocation
    core_series_weights = {0: (0,0,0.686,0.294),
                           1: (0.059,0.039,0.617,0.265),
                           2: (0.118,0.078,0.549,0.235),
                           3: (0.176,0.118,0.480,0.206),
                           4: (0.235,0.157,0.412,0.176),
                           5: (0.294,0.196,0.343,0.147),
                           6: (0.353,0.235,0.274,0.118),
                           7: (0.412,0.274,0.206,0.088),
                           8: (0.470,0.314,0.137,0.059),
                           9: (0.529,0.353,0.069,0.029),
                           10: (0.588,0.392,0,0)}
    crsp_series_weights = {0: (0,0,0,0,0,0.273,0.14,0.123,0.15,0.294),
1: (0.024,0.027,0.008,0.03,0.009,0.245,0.126,0.111,0.135,0.265),
2: (0.048,0.054,0.016,0.061,0.017,0.218,0.112,0.099,0.12,0.235),
3: (0.072,0.082,0.022,0.091,0.027,0.191,0.098,0.086,0.105,0.206),
4: (0.096,0.109,0.03,0.122,0.035,0.164,0.084,0.074,0.09,0.176),
5: (0.120,0.136,0.038,0.152,0.044,0.126,0.07,0.062,0.075,0.147),
6: (0.143,0.163,0.047,0.182,0.053,0.109,0.056,0.049,0.06,0.118),
7: (0.167,0.190,0.055,0.213,0.061,0.082,0.042,0.037,0.045,0.088),
8: (0.191,0.217,0.062,0.243,0.071,0.055,0.028,0.024,0.030,0.059),
9: (0.215,0.245,0.069,0.274,0.079,0.027,0.014,0.013,0.015,0.029),
10: (0.239,0.272,0.077,0.304,0.088,0,0,0,0,0)}

    #set universe and risk level
    context.stocks = crsp_series
    risk_based_allocation = crsp_series_weights
    risk_level = 1

The user can use the three variables, context.stocks, risk_based_allocation, and risk_level, to set both the universe and the risk level.

What’s Next

Developing in Quantopian is a great experience — they provide lots of useful tools. But, it’s also limiting, being forced to only work in one file, etc. To build further on this robo advisor, I plan to migrate my work so far to the backend, and use Quantopian’s open source python library Zipline to run backtests. The next installment will cover all this and more!

You can find the algorithm here. Try it yourself!

by Rao Vinnakota

/

9 Most Commonly Asked Questions About MarketStore And Answers To Them

Photo by  William Stitt  on  Unsplash

Each of these articles seeks to explain the technology we build along with our Alpaca algo trading brokerage. These articles led active discussions on Reddit and Medium, and it became clear to us that there is a lot of interest and a pretty large need in the community for a timeseries database dedicated for financial market data. The database world and software engineering in general have changed so much over the last decade as we’ve seen an explosion in open source programming and databases. We are seeing some people now actively using open source to and contributing some code in the GitHub repository.

In social media and offline, we’ve been answering questions and responding to comments, but today we wanted to take the opportunity to put all the queries and responses together in one post and share it with the entire community so everyone can get a look at the responses on a single post.

Q: Does MarketStore store data in memory?

A: No. MarketStore is designed to run in a reasonable size of host without huge hardware investment. If you have lots of cash, software technology is irrelevant, but what software engineering can bring is that you can do a lot better job with cheaper hardware. MarketStore’s primary use case is to be able to store and distribute years of data at second level granularity for more than tens of thousands of series (US equities and crypto coins across exchanges can easily become this size). The data size can be a few terabytes, and it is not still very common to have this big size RAM in a commodity hardware. MarketStore instead stores everything in disk, but the on-disk format is nearly identical to the layout in the memory, and thanks to SSD evolution, MarketStore can load the data at the speed competitive to in-memory storage.

Q: How does it make sense to compare with PostgreSQL and includes DataFrame loading?

A: Even if you can store the data, offloading it from application processes, it is not useful if you cannot use it. MarketStore is mainly used in the context of AI machine learning and backtesting, and the application typically loads it into some tabular structure such as Pandas DataFrame. That is why MarketStore’s network protocol is byte sequence in MessagePack so the inefficient JSON deserialization can be avoided. The client can load the delivered byte data into memory as C array, which is what is used behind DataFrame.

Q: How is it better compared to InfluxDB?

A: We have not compared the performance with InfluxDB, but InfluxDB and other general-purpose timeseries databases use-case is as system metrics or activity log analysis. Those require more flexible data structure and don’t necessarily need specific functions such as timezone-aware aggregate. The flexibility comes with necessary overhead as tradeoffs as always, and MarketStore should be much faster and cost effective if the use case is the financial market data.

Q: Why are you comparing with PostgreSQL when Timescale should be faster?

A: You can send us the benchmark results if you have them, but in our internal experiments, Timescale is even slower than PostgreSQL compared to MarketStore. The loading time at the database server level for Timescale is 2–3x slower than PostgreSQL, since Timescale makes use of table partitioning (aka table constraints exclusion) that needs to open lots of files from disk. It will give advantage to filter a small slice of the data out of large amount of data, but it will not work better if you scan most of it. MarketStore stores the data in an optimal way on disk and reads sequentially direct to memory compared to those relational databases, so it is way faster.

Q: MarketStore can be used only for historical data but not for real-time data right?

A: There is a new feature coming soon to MarketStore that will allow streaming and realtime push on every new data write. MarketStore was originally designed to help our algo trading platform that builds trading algorithms using deep learning, and run them in the real market, and had JSON websocket streaming. The feature has been for the time being so that Marketstore can find a way to fit in larger use cases. But thankfully it is now back in as a plugin. We have been testing this with thousands of updates every few seconds and so far it is working perfectly.

Q: Why do I need this for machine learning? I can load the data from disk without a problem

A: If your training process doesn’t use much data (e.g. just daily bars from one stock), then yes probably you don’t need MarketStore for performance reasons. What we needed to do on Alpaca trading platform requires a server that is large enough to store an amount of intraday data across the entire market (can be up to terabyte range), and load the necessary series data back and forth. If you are familiar with typical machine learning training process, you can tell how the training iteration can load random data from the pool. That said, MarketStore is not just for performance, but also for the convenience to prove the uniformed way to access historical and real-time timeseries data the same way without worrying about how to manage local files etc. And the built-in data ingestor can load the data without even writing any code.

Q: Where is the installer?

A: Sorry, at the moment, we are not providing the one-click installer! But instead, we package the server process into a docker container image, so if you have docker, you can just start it in a second.

Q: Why is it open sourced?

A: Because there is a problem to be solved! MarketStore was implemented proprietary for our internal use and has been used in our production, but we have also seen the common problems affecting many people in the space. Our mission at Alpaca is to help individual investors with technology, and improve the algo trading environment, regardless of whether we give that information away to users or offer it in a premium package. This kind of product has only been accessible by financial institutions with large capital resources. But now we are making it available to anyone who is eager to try out! That’s awesome, isn’t it!?

Q: I found a bug

A: Please report it in the GitHub issue!

/

50x Faster Bitcoin Price Data Powered by MarketStore for AI Trading

In our last post “How to Setup Bitcoin Historical Price Data for Algo Trading in Five Minutes”, we introduced how to set up bitcoin price data in five minutes and we got a lot of good feedback and contributions to the open source MarketStore.

Photo by  chuttersnap &nbsp;on  Unsplash

Photo by chuttersnap on Unsplash

The data speed is really important

Today, I wanted to tell you how fast MarketStore is using the same data so that you can see the performance benefit of using the awesome open source financial timeseries database.

Faster data means more backtesting and more training in machine learning

Faster data means more backtesting and more training in machine learning for our trading algorithm.  We are seeing a number of successful machine learning-based trading algos in the space, but one of the key points we learned is the data speed is really importantIt's important not just for backtesting, but also for training AI-style algorithms since it by nature requires an iterative process.

This is another post to walk you through step by step. But TL;DR, it is really fast.

Setup

structure.001.jpeg

Last time, we showed how to setup the historical daily bitcoin price data with MarketStore.

This time, we store all the minute-level historical prices using the same mechanism called background worker, but with a slightly different configuration.

 

root_directory: /project/data/mktsdb
listen_port: 5993
# timezone: "America/New_York"
log_level: info
queryable: true
stop_grace_period: 0
wal_rotate_interval: 5
enable_add: true
enable_remove: false
enable_last_known: false
triggers:
 - module: ondiskagg.so
   on: "*/1Min/OHLCV"
   config:
     destinations:
       - 5Min
       - 15Min
       - 1H
       - 1D
bgworkers:
 - module: gdaxfeeder.so
   name: GdaxFetcher
   config:
     query_start: "2016-01-01 00:00"
     base_timefame: “1Min”
     symbols:
       - BTC

Almost 2.5 years with more than 1 million bars

The difference from last time is that background worker is configured to fetch 1-minute bar data instead of 1-day bar data, starting from 2016-01-01.  That is almost 2.5 years with more than 1 million bars. You will need to keep the server up and running for a day or so to fill all the data, since GDAX’s historical price API does not allow you to fetch that many data points quickly.

Again, the data fetch worker carefully controls the data fetch speed in case the API returns “Rate Limit” error. So you just need to sleep on it.

Additional configuration here is something called “on-disk aggregate” trigger.  What it does is to aggregate 1-minute bar data for lower resolutions (here 5 minutes, 15 minutes, 1 hour, and 1 day).

Check the longer time horizon to verify the entry/exit signals

In a typical trading strategy, you will need to check the longer time horizon to verify the entry/exit signals even if you are working on the minute level. So it is a pretty important feature. You would need pretty complicated LEFT JOIN query to achieve the same time-windowed aggregate in SQL. But with MarketStore, all you need is this small section in the configuration file.

The machine we are using for this test is a typical Ubuntu virtual machine with 8 of Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz, 32GB RAM and SSD.

The Benchmark

Unfortunately lots of people in this space are using some sort of SQL database

We are going to have a DataFrame object in python which holds all the minute level historical price data of bitcoin since January of 2016 from the server.  We compare MarketStore and PostgreSQL.

PostgreSQL is not really meant to be the data store for this type of data, but unfortunately lots of people in this space are using some sort of SQL database for this purpose since there is no other alternative.  That’s why we built MarketStore.

The table definition of the bitcoin data in PostgreSQL side looks like this.

btc=# \d prices
              Table "public.prices"
 Column |            Type             | Modifiers
--------+-----------------------------+-----------
 t      | timestamp without time zone |
 open   | double precision            |
 high   | double precision            |
 low    | double precision            |
 close  | double precision            |
 volume | double precision            |

The code looks like this.

# For postgres
def get_df_from_pg_one(conn, symbol):
    tbl = f'"{symbol}"'
    cur = conn.cursor()
    # order by timestamp, so the client doesn’t have to do it
    cur.execute(f"SELECT t, open, high, low, close, volume FROM {tbl} ORDER BY t")
    times = []
    opens = []
    highs = []
    lows = []
    closes = []
    volumes = []
    for t, open, high, low, close, volume in cur.fetchall():
        times.append(t)
        opens.append(open)
        highs.append(high)
        lows.append(low)
        closes.append(close)
        volumes.append(volume)

    return pd.DataFrame(dict(
        open=opens,
        high=highs,
        low=lows,
        close=closes,
        volume=volumes,
    ), index=times)

# For MarketStore
def get_df_from_mkts_one(symbol):
    params = pymkts.Params(symbol, '1Min', 'OHLCV')
    return pymkts.Client('http://localhost:6000/rpc'
                         ).query(params).first().df()

You don’t need much client code to get the DataFrame object

The input and output is basically the same, in that one symbol name is given, query the remote server over the network, and get one DataFrame.  One strong benefit of MarketStore is you don’t need much client code to get the DataFrame object since the wire protocol is designed to give an array of numbers efficiently.

The Result

First, PostgreSQL

%time df = example.get_df_from_pg_one(conn, 'prices')
CPU times: user 8.11 s, sys: 414 ms, total: 8.53 s
Wall time: 15.3 s

And MarketStore

%time df = example.get_df_from_mkts_one('BTC') 
CPU times: user 109 ms, sys: 69.5 ms, total: 192 ms Wall time: 291 ms 

Both results of course look the same like below.

In [21]: df.head()
Out[21]:
                       open    high     low   close   volume
2016-01-01 00:00:00  430.35  430.39  430.35  430.39   0.0727
2016-01-01 00:01:00  430.38  430.40  430.38  430.40   0.9478
2016-01-01 00:02:00  430.40  430.40  430.40  430.40   1.6334
2016-01-01 00:03:00  430.39  430.39  430.36  430.36  12.5663
2016-01-01 00:04:00  430.39  430.39  430.39  430.39   1.9530

 

50 times difference

A bitcoin was about $430 back then… Anyway, you can see the difference between 0.3 vs 15 seconds which is about 50 times difference. Remember, you may need to get the same data again and again for different kinds of backtesting and optimization as well as ML training.

Also you may want to query not just bitcoins but also other coins, stocks and fiat currencies, since the entire database wouldn’t fit into your main memory usually.

Scalability advantage in MarketStore

MarketStore can serve multiple symbol/timeframe in one query pretty efficiently, whereas with PostgreSQL and other relational databases you will need to query one table at a time, so there is also scalability advantage in MarketStore when you need multiple instruments.

Querying 7.7K symbols for US stocks

To give some sense of this power, here is the result of querying 7.7K symbols for US stocks done as an internal testing.

%time dfs = example.get_dfs_from_pg(symbols) 
CPU times: user 52.9 s, sys: 2.33 s, total: 55.3 s Wall time: 1min 26s 
%time dfs = example.get_dfs_from_mkts(symbols) 
CPU times: user 814 ms, sys: 313 ms, total: 1.13 s Wall time: 6.24 s

Again, the amount of data is the same, and in this case each DataFrame is not as large as the bitcoin case, yet the difference to expand to large number of instruments is significant (more than 10 times).  You can imagine in real life these two (per instrument and multi-instruments) factors multiply the data cost.

Alpaca has been using MarkStore in our production

Alpaca has been using MarkStore in our production for algo trading use cases both in our proprietary customers and our own purposes.  It is actually amazing that this software is available to everyone for free, and we leverage this technology to help our algo trading customers (early access signup is here).

Thanks for reading and enjoy algo trading!


 

/

Algo Trading Daily News Scan by Alpaca - May 14th 2018

Algo Trading Daily News Scan by Alpaca - May 14th, 2018

The Top Algo Trading News Stories and Headlines From Around the Web

Algorithmic Trading and TCA Adoption Driven by FX Global Code

(www.tradersmagazine.com)

“Today the speed of trading, coupled with the number of venues and increasing variety of order types, make it impossible for a human trader to replicate algorithmic behaviour on one order, let alone if the trader has multiple orders to trade simultaneously.”

 

Algo Traders Warn U.K. Over MiFID II Rules

(www.bloomberg.com)

“Financial firms are pressing the U.K. to stick as closely as possible to the European Union’s MiFID II restrictions on algorithmic trading, warning the Bank of England against imposing even tougher rules on the industry ahead of Brexit.”

 

HOW AI IS CHANGING THE FACE OF FINANCE

(www.i4u.com)

“It is a little like a meteorologist looking at two weather systems and seeking to predict how they will change over time, and how they might interact with one another. To stretch the analogy, when you try to decide on the best currency pair for Forex trading, you are selecting two out of more than 100 different weather systems to pitch against one another.”

 

Why Algorithmic Trading Really Works

(www.moneymorning.com.au)

“People often ask how I got my start. They typically want to know how I became a ‘system trader’ — it’s not the sort of job you hear about at a school careers night.”

 

How to become an algo trader

(news.efinancialcareers.com)

“Knowing the data structures, algorithms and C++ or another programming language inside and out is critical. That attribute is not easily available, and people really value that. Most [financial services and fintech] firms will hire a great programmer without trading experience over a mediocre programmer who knows trading.”

 

Automated Trading Broadens Accessibility to Crypto

(www.investopedia.com)

“While traditional asset markets like stocks have set trading hours and no weekend trading, cryptocurrency prices move around the clock, every day of the year. For most investors, staying alert and monitoring an investment portfolio around the clock is not just unfeasible, but also impractical. While automated trading has been around for years, it was largely inaccessible for most ordinary traders, remaining in the purview of sophisticated investors until this point. However, in keeping with the democratization of finance instilled by cryptocurrency, algorithmic trading is being simplified for mass-market consumption.”

Footer_Disclosure.png
/

MarketStore, the financial time series database, is now open source

We are happy to announce MarketStore is now open source! MarketStore is a database server optimized for financial timeseries data written in pure Go, designed and developed by Alpaca. You can think of it as an extensible DataFrame service that is accessible from anywhere in your system, at higher scalability.

Read More
/