Page 2 of 2

Re: Strix Devlogs

Posted: Mon Dec 03, 2018 4:08 am
by dendiz
A new month a new dev log for the stuff that's been going on :P Huge change yet again for the engine which I will get into later. First order of business is a summary of the new stuff:

Summary
  1. get rid of all but USD pairings for cryptos
  2. reduce symbol listing response size from 1.2M to 0.8M
  3. new scanner: trend start
  4. S/R zones to charts
  5. using TaLib java API
  6. Candle stick patterns scanners
  7. port back over to java w/o spring + hibernate
  8. eclim, idea + X forward, Che adventures
  9. market overview calculations in SQL
  10. store technical calculations in KV store
  11. correlation calculations in technical calculations
Good bye pairings
Some of the alt coins are worth so less that compared with BTC they end up taking 8-9 decimal places. This is a disaster for the display and the layout of the app in general. At first I had thought about displaying these types of currency pairs as satoshis but that was met with high resistance from my previous team members. The pair name is XXX/BTC so you cannot display it in Satoshi's was the justification. Fair enough I guess. Another fix for this could be decreasing the font size if there are a lot of decimal places, but I have tried this. It sounds like it should work in theory but I'd have to experiment with it to be convinced that it does. So for the time being the easiest solution is to drop pairings with BTC and keep only the pairs that are traded against the USD. All of the big coins are included in this so no big loss their.

slim responses
The autocomplete component of Vuetify requires a list of the items to complete (though I'm pretty sure their should be a version that can do partial searches with AJAX) and that list in the previous version was huge around 1.2 megabytes. The initial delay after the autocomplete trigger was around 2 seconds which made it appear to be not responding. My primary solution to this was to trigger a request to the symbol listing endpoint after the app loaded in the background and since requests are cached the searches would not have that initial lag. This worked out like I thought but the payload size is still more than I cared for so I got rid of some of the fields in the response and change the structure from a JSON object to an array dropping the keys and therefore reducing the size even further.

more scanners
A saw a scanner on STB that I wanted to incorporate as I think it's important: The new trend started scanner. It fires when the ADX crosses the 25 line. Even though ADX lags it's still useful to know that a trend has begun.

more charts
Data visualizations are always cool, you can never get enough of them. This motivated me to add a chart displaying the support and resistance zones calculated using the clustering method I blogged about a couple of weeks back.
Screenshot 2018-12-02 at 6.39.44 PM.png
I plan on adding charts to each triggered scan with the relevant indicators like MA's for MA crossovers etc. But it's a low priority task right now.

New technical analysis library
Ta-lib.org is a mature technical analysis library that supports way more indicators that I built into talib4j. I was quite happy with the results I got when
using the python wrapper so it's back in. I plan on doing a write up on the performance vs talib4j as the code is probably transpiled from the C version and impossible to read. The Java API is god awful because of this generated code, which means the C API is just as disgusting. To limit the exposure to this filth I wrapper the API with a custom class and I can just substitute that for talib4j any time if I have all the indicators coded in.

Even more scanners
Using Talib also gives me access to candle pattern recognition. I didn't want to write a scanner for each of the patterns so I had to resort to using the java reflection API to figure out the correct method to call from the scan definition file. So the definition file is something like this:

Code: Select all

{
"id": "..."
"name": "some scanner":
"module": "CDL2CROWS"
}
I had all the definitions generated from the documentation on the talib site with a small python script (around 60 different scans) and look up the method from the module attribute of the definition. With the addition of all the candle patterns each run consists now of around 16K scans!

Back to the drawing board
Now that I am running 16K scans the python code took 48 hours to complete a days worth of scans. This of course is unacceptable as the scans need to be done quickly. It's no use to display scans 2 days after the market closes. I poked around with multi threading in the python code but to no avail as there is this thing called the "Global Interpreter Lock" which limits you to the total performance of a single core. Also using multi threaded code consume too much memory - something which I will not have that much of in production. So back to a faster language: Java. But this time I didn't want all the bloat that comes with spring + hibernate. They consume way too much memory that was one of the reasons for going with python. So this time around it's a plain old Java application with small utility libraries for database operations. Now 16K scans take 1 hour to complete without multi threaded code. I was initially thinking of keeping the python engine for other stuff like synchronization, and technical calculations etc. but ended up porting all the engine code back to Java.

Have Chromebook will code
I got a new toy from the Black Friday sales events so naturally I want to use it all the time. But it's not buffed in terms of hardware so I needed to find a way of coding Java without running a full blown IDE on my Chromebook. It would probably run the IDE OK but the engine + database would put to much strain and it would start to crawl. So I started coding in Emacs. Plain old Emacs with no packages. It sounds crazy and it actually is quite crazy. It was nice just suspending the machine and reattaching the session to continue where I left off, but no syntax checking, not auto imports makes it a hassle. Not being able to see the parameters of a method call was the worst. So I checked out what packages people were using for coding on Emacs. The most effective one seemed to be Eclim. I had used Eclim before with vim (which is the original purpose of Eclim: Eclipse + vim) but somebody had wrote a wrapper around the binary for Emacs. I tried to get it running on my Debian development server but I could not get it to work. It would just not connect. So I gave up on Eclim and checked out another package called Megahanada. Found that one too complicated and didn't really find the functions it provided useful. Then I tried X forwarding with Intellij Idea. This felt like home a familiarity that was much appreciated. This lasted for a week or 2 until the inefficient X protocol drove me nuts with the stupid lagging of the UI. I looked around for some alternatives and came across xpra which was supposed to be faster but really wasn't. And the fonts and graphics were blurry with Xpra so it went out the window. I thought that maybe it was Swing that wasn't playing nicely so I tried X forwarding with VS Code. Same problems. But on a side note I really liked the Java plugin for VS Code - it's lightweight and provides all the great functionality that I was searching for. Anyways then I remembered Eclipse Che - A web based IDE. I had tried Che before in it's early stages and wasn't really impressed with it's current state. But know they've created a Docker image which is super easy to install and get running and they've also added Git support so worst I could use Che to write the code push and compile/test on my development server. But Che workspaces already come with Java + Maven so I could basically do everything I wanted in the Che environment. Another bad thing with X Forwarding was that once the laptop suspended the SSH connection was lost and the applications died on me. Since I have leave the computer to do other stuff (like burping, diaper changes, etc) this was bugging me quite a bit. Now in Che the tab remains open and I can continue where I left off without a problem.

bugs bugs bugs
Porting over code is also a great chance to check what I've actually written. I figured out major bugs in the technical calculations code which I fixed. I was doing all of these calculations in application code, which is probably slower than doing them on the DB side so I moved the things I could calculate on the DB over there.

few architectural changes
I was letting the API calculate some of the data on the symbol detail view on the fly, but now I decided to have these pre-calculated for the last trade date and store them in the key/value store. I really wan't refrain from having the API do calculations to consolidate all the logic on the engine code base. This meant moving the technical and overview tab data to the key/value store. I also wanted to merge these to requests as they did have a couple of overlapping values. I also decided to store the correlation data in the key/value store as it was taking up a lot of rows in the table. I was keeping a history of the correlated items which is probably not necessary. While at it I also removed the dedicated job for correlated calculations and merged it into the technical calculations.

Wow that was quite a long post - a lot has happened in the past week.

Re: Strix Devlogs

Posted: Tue Dec 11, 2018 6:13 am
by dendiz
So some more progress this week, mostly optimization to existing jobs - I needed to get things running smoothly before I can concentrate on the new features I have planned. The issues for the new features are on gitea waiting to be tackled but accumulating too much technical debt makes in harder in the long run to get a smooth running system. Here's a summary of this devlog:

Bugfixes
- stoch overfiring
- fix premature exit in predictions

Optimization
- Perf Eval runs in parallel now
- parallel scans, ohlcv
- Module run optimizations: don't run if no new data, don't sync if no data on IEX
- Skip untriggered events in perf eval
- Class property connections on dao
- Parallel correlation proc
- sql2o batching sux, implement grouped insert
- Save ohlcv to KV store for faster access
- track last scan date
- extract exchange param (job pipeline optimization)

Experiments
- Turtle exit for predictions
- BBSqueeze detection change


New features
- Bollinger/ma/ohlcv in KV store, display charts for scans

Bugfixes
First off, a couple of bug fixes. Finding bugs in this system notoriously difficult as testing isn't done on behaviors but generated data. Verifying the data against a source of truth is quite time consuming and I only have a certain amount of time I can dedicate to manual testing. One of the defects that caught my attention this week was stochastic signals firing for both oversold and overbought on the same day, which is absolute non-sense. At first I thought it a date issue because the symbols I checked were from the top gainers and had huge increases that could pull up the stochastic from oversold to overbought in a day. My estimate was that the stochastic was oversold on day T-1 and became overbought on day T. But deeper investigation and hours of debugging showed that this was not the case. The triggering code for the indicator is fairly straight forward

Code: Select all

if (def.srt.contains("overbought") && d[len - 2] < ob && ob < d[len - 1]) {
            return new ScanResult(data, def);
        }
        if (def.srt.contains("oversold") && d[len - 2] > os && os > d[len - 1]) {
            return new ScanResult(data, def);
        }

        if (def.srt.contains("neutral") && (d[len - 2] > ob && ob > d[len - 1]) ||
                (d[len - 2] < os && os < d[len - 1])) {
            return new ScanResult(data, def);
        }
 
There is subtle bug in this code even though it looks quite simple. The 3rd condition: a & b | c. The precedence order is equal for both operators so it would trigger for both overbought and oversold :roll: The correct code is a & (b | c). Yet again the great syntax of Java produces a bug that is easy to miss and will make you go blind in the process. Well it did cost a couple of hours but at least it was an easy fix.

Another bug that I introduced during the performance evaluator developments came to light after I tackled some optimization tasks on that module. I was expecting way more scans to be score than there were which led me to investigate the scoring code that revealed a premature exit from the evaluation loop. A misplaced return statement instead of a continue statement was causing the loop to terminate early and not score the remaining predictions. Another easy fix at least.


Optimizations
Optimizations were the meat of this weeks work. Originally I thought that running the modules on a single thread with good caching would suffice in terms of performance but I was wrong. :| So I went ahead to parallelize the portions of the modules that made sense. These were

- Performance Evaluations
- Scans
- Correlations processing

Thanks to Java 8 parallel streams this turned out to be quite easy. I just had to make sure that critical sections in code were atomic, and used classes that were thread safe. One aspect to take into consideration is to preserve the locality of the cache and not to evict items that from the cache only to have a second loop query that item again. So the order of processing is important.

Code: Select all

symbols -> combos -> dates
will make sure that the cache contains the symbol data ready to be processed and will not run a DB query. Initially I had just parallelized the dates loop but that loop doesn't contain enough data to make it worth while. The CPU cores were only busy at 60% which is not ideal. So I moved the parallel streaming up 2 levels to the symbol level which now utilized all cores at 100%. PerfEval and Scans use the same code base so that was a bit easier than the Correlation Processor that needed some extra attention due to memory issues. I operate under a RAM constraint (because I need to keep server costs to a minimum) so I needed to implement a specific cache for correlation calculations that just holds the last 30 closing prices and symbols.

The seconds area of optimization was for syncing data from IEX and running scans on the synced data. I currently orchestrate the module jobs via Jenkins and the pipeline only supports triggering based on a fail/success return code. This means that every time I run a sync job, a scan job will trigger on success - even if there is no new data. It will just calculate the same results over and over again. This wasn't really a problem when the scan job only took a couple of minutes to complete with 1 years worth of data but it doesn't work with 5 years of data. So I implemented a check on the scan module that will only run the scans if the latest scan date for a symbol is less than the latest OHLC date. The job will still run but it will just skip the scans so it takes only about 3 - 4 minutes to complete as opposed to 1 - 1.5 hours. Another issue is with IEX not clearly defining when they will update the API with the days stock data. Previously I was fetching the data at 7 pm and 11 pm local time and processing the data even if the data was old (the one at 7 pm, they would have the data updated by 11 pm for sure). But this means that the new scans/signals are not shown until almost 12 am which is less than ideal. I didn't want to check every hour or two because it would trigger the whole job pipeline and it's a lot of data to download. My solution to this came after I discovered an API endpoint that listed the symbols, which had a date field that showed the time it was updated. Why didn't I just check a symbol for the last date? because at any given date a symbol may not be traded. The probability that AAPL not being traded is rather low, but still I prefer a robust solution if there is one. The cost of making this API call to the symbol list is low, so now I poll every hour for new data between 4 pm - 11 pm on the week days. There is still no way to abort the pipe line without a failure on Jenkins but with the scan checking this is now less of a problem.

A huge pain point was the duration of the Performance evaluation. I realized this week that I was doing a lot unnecessary processing that was causing the job to take forever. I only needed to calculate performance for the scans that had been triggered for that symbol, and I was running all the scans for that symbol. Duh! :roll: With a new filter that skips scans that are not relevant and parallel processing the performance evaluation now takes 1 hour to complete which is reasonable. Even though I won't be running this job that often it kind of became my holy grail to optimize this. So looking back at the comments on the issue on gitea the first iteration resulted in 1K scored scans and 6K unscored (this was due to the bug I mentioned previously, which at the time I didn't know). This was way to little so I thought I'd throw more data at it and increased the data interval to 5 years from 1. This increase resulted in 2.4K vs 6K. Still not good enough. After fixing the premature loop termination issue the final ratio is 5K/6K which to me looks OK. It is possible that some scans just didn't occur frequent enough to be scored.

The DAO layer also got some love this week. I had previously implemented the DAO's in a way that each operation would open a new connection to the DB even though this is not good practice. I didn't want to integrate a connection pooling solution as the processes are not long lived so I just refactored the connection objects to be reused class wide. I'm using a relatively new SQL library called sql2o which has a nice plain API for DB operations but the way they implemented batch insertions is not optimal. It just wraps the inserts around a transaction and still inserts each row individually. MySQL's grouped insert performs much better than this so I refactored batch inserts to generate a grouped insert query. This increased performance of the inserts quite a bit, even though I didn't measure by how much.

After examining the MySQL advisor on PhpMyAdmin I saw that it complained about a lot of row sorting. The culprit was that each query to the OHLCV table needed a sort by date for the caching and range query to work properly. The problem here is that I already got the data sorted from the data API and lost that information after the insertion into the OHLCV table. I though about getting rid of the table and querying from the JSON data directly but some of the functions in the TOP module and market overview module take advantage of this table to reduce the amount of code and off load processing to the database, so the table had to stay. I stored the API response in K/V store only for query by the ChartData component for caching. I also saw that selecting the last scan date from the scan_result table for doing full table scans so I save those in the K/V store too. This type of optimization can be good for performance but it's important not to let these data points get out of sync. I also implemented a fall back mechanism to querying the DB if the value was not found in the K/V store. This case can occur if a symbol is introduced and the last scan date is not yet inserted.

The Jenkins job pipeline also needs to be separated by exchange to reduce the amount of processing. This can be achieved by extracting the exchange parameter as a CLI parameter to the jobs. No need to run scans for IEX after BFX data synced, just run the BFX scans :P



Experiments
I decided to change the way the prediction checker scores scans. My initial implementation was exit on +/- 2 x ATR of the symbol. This yielded around 50% average scores. Next I tried a 15 period low/high exit strategy. There is really no correct way of doing this, as strategy is very personal. My reasoning was that on an up trend the 15 period low would still decent gains and on a down trend it would exit quite early to cut losses. After the performance evaluation runs the average score was 0.23959762958591568 and average error was 0.17946781069881107 for a 95% confidence.
I will still run the strategies of 2xATR, 3xATR and a percentage based exit to determine their outputs. Maybe running multiple strategies and selecting the best performing one and showing that could also be a nice feature, but again it's very personal. I believe a 1:1 take profit / stop loss ratio will result in an average score of around 50%, and 2:1 will result with an average score of 33% as it's basically based on expected value since the price changes are distributed randomly.

I also changed the way the Bollinger Bands squeeze scan was working. It was scanning the last N periods and triggering if the last period bands were within a certain limit of the minimum band width of the last N periods. This is kind of over complicated as I just want to get the periods were the bandwidth is low, so now it will trigger if the bandwidth is < 4%.

New features
As I was sifting through the scans I realized that I was looking for a chart to see the relevant data that triggered the scan. If the scan is a Bollinger Bands squeeze I wanted to the bands on the chart. So I added this feature. This required me to store the band data in the K/V store because that's the only storage the API will access. To be consistent I also stored the last 100 period OHLC data in the K/V store to generate the candle stick data for the chart. The K/V store is backed by MySQL but this may change in the future. Redis is a strong contender for K/V storage, but I'll cross that bridge when I get there. I added a new field to the scan definitions file that defines which indicators will shown on the chart when that scan is triggered.
Here is a screenshot of this new feature in action:
Screenshot 2018-12-10 at 11.49.36 PM.png

Wow, yet another very long post for a short week. Looks like a lot has been done and development is going full steam ahead.