• Plotting WordPress downloads with Pandas and Matplotlib

    Recently I wanted to figure out a bit more about how popular WordPress plugins are used. Helpfully, WordPress plugin repository does provide a JSON API endpoint for fetching download statistics of any plugin in the plugin repository, e.g.:

    https://api.wordpress.org/stats/plugin/1.0/downloads.php?slug=akismet&limit=365
    

    It is used for plotting the stats on plugin pages, and it works for my purpose as well.

    There are two things I wanted to know from these stats: 1) How many downloads happen on average each day of the week; 2) How many total downloads happen in a week

    First I started by manually importing the data into LibreOffice Calc and calculating the moving weekly sums and moving 7d averages myself (for each daily downloads cell sum over last 7 ones, and average over last 7 ones).

    Akismet downloads

    For a quick plot it worked, but was not very convenient to update if you would like to take a look at multiple plugins every so often. So this was my excuse to play around with Pandas and Matplotlib to build something a bit more convenient in Python.

    Setup

    To begin we’ll use several libraries:

    • Requests to download the JSON stats;
    • Pandas for data manipulation and to run our extra calculations;
    • Matplotlib to plot the results.

    As usual, we start by creating a virtual environment. For this I used Python 3:

    virtualenv -p python3 venv
    source venv/bin/activate
    

    And installed the dependencies in our virtual environment:

    pip install requests matplotlib pandas
    

    We start with the initial dependencies and download the stats from WP plugin repo:

    import pandas as pd
    import json, time, requests
    import matplotlib.pyplot as plt
    from matplotlib import style
    
    url = 'https://api.wordpress.org/stats/plugin/1.0/downloads.php?slug=akismet&limit=365'
    r = requests.get(url)
    data = r.json()
    

    data now contains a dict with dates for keys, and number of downloads for values. DataFrame expects a list-like object for data parameter, so we will need to reshape the data first from dict to a list of dicts. While we’re at it, we’ll convert date strings into date objects:

    data = [
        { 'timestamp': pd.to_datetime(timestamp, utc=False), 'downloads': int(downloads) }
        for (timestamp, downloads) in data.items()]
    data = sorted(data, key=lambda item: item['timestamp'])
    

    At this point we have converted the initial stats dictionary:

    {'2017-03-21': '36723', '2016-08-17': '18061', '2016-06-26': '17821', '2016-09-14': '52539', ...}
    

    Into a list of dictionaries:

    [{'timestamp': Timestamp('2016-06-25 00:00:00'), 'downloads': 18493}, ...]
    

    However, if we pass the resulting list into a Pandas DataFrame, we will see that it is indexed by row number, instead of by date:

    >>> pd.DataFrame(data)
         downloads  timestamp
    0        18493 2016-06-25
    1        17821 2016-06-26
    2        29163 2016-06-27
    3        28918 2016-06-28
    4        27276 2016-06-29
    ...
    

    To fix that, we reindex the resulting DataFrame on timestamp column:

    df = pd.DataFrame(data)
    df = df.set_index('timestamp')
    

    And get:

    >>> df
                downloads
    timestamp            
    2016-06-25      18493
    2016-06-26      17821
    2016-06-27      29163
    2016-06-28      28918
    ...
    

    Now that we have our raw data, let’s calculate the 7 day average. DataFrame.rolling() works nicely for this:

    weekly_average = df.rolling(window=7).mean()
    

    The way it works is it takes a number of sequential data points (in our case - 7 days) and performs an aggregate function over it (in our case it’s mean()) to create a new DataFrame:

    >>> weekly_average
                   downloads
    timestamp               
    2016-06-25           NaN
    2016-06-26           NaN
    2016-06-27           NaN
    2016-06-28           NaN
    2016-06-29           NaN
    2016-06-30           NaN
    2016-07-01  24177.857143
    2016-07-02  23649.857143
    2016-07-03  23186.857143
    2016-07-04  22186.428571
    ...
    

    We can use the same approach to calculate weekly sums:

    >>> weekly_sum = df.rolling(window=7).sum()
                downloads
    timestamp            
    2016-06-25        NaN
    2016-06-26        NaN
    2016-06-27        NaN
    2016-06-28        NaN
    2016-06-29        NaN
    2016-06-30        NaN
    2016-07-01   169245.0
    2016-07-02   165549.0
    2016-07-03   162308.0
    ...
    

    Now onto plotting. We want to display all 3 series (raw downloads, weekly averages and weekly sums) on a single plot, to do that we create a subplot:

    # Setting the style is optional
    style.use('ggplot')
    fig, ax = plt.subplots()
    

    fig is the figure, and ax is the axis.

    We could call the df.plot() command and it would work, but before we do that let’s set up more presentable column names for the legend:

    df.columns = ['Daily downloads']
    weekly_average.columns = ['Weekly average']
    weekly_sum.columns = ['Weekly sum']
    
    ax.set_xlabel('Time')
    ax.set_ylabel('Downloads')
    ax.legend(loc='best')
    

    And lastly we plot the data:

    df.plot(ax=ax, label="Daily downloads", legend=True, style='r-', title="Downloads")
    weekly_average.plot(ax=ax, label="7 day average", legend=True, style='b-')
    weekly_sum.plot(ax=ax, label="7 day sum", legend=True, style='r--')
    plt.show()
    

    For line styles, r- refers to a solid red line, r-- is a dashed red line, and b- is a solid blue line. You can refer to lines_styles.py example and lines styles reference for more details on styling.

    Results

    Akismet downloads plotted with Python

    Visually perhaps it looks somewhat similar to the LibreOffice Calc variant, but with some extra effort we could generate graphs for a much larger number of plugins, add extra data or metrics, generate image files. And now we can use Matplotlib tools to zoom in and out of different sections while looking at specific details in the plot.

    The final code looked like this:

    import pandas as pd
    import json, time, requests
    import matplotlib.pyplot as plt
    from matplotlib import style
    
    url = 'https://api.wordpress.org/stats/plugin/1.0/downloads.php?slug=akismet&limit=365'
    r = requests.get(url)
    data = r.json()
    data = [
        { 'timestamp': pd.to_datetime(timestamp, utc=False), 'downloads': int(downloads) }
        for (timestamp, downloads) in data.items()]
    data = sorted(data, key=lambda item: item['timestamp'])
    
    df = pd.DataFrame(data)
    df = df.set_index('timestamp')
    
    weekly_average = df.rolling(window=7).mean()
    weekly_sum = df.rolling(window=7).sum()
    
    style.use('ggplot')
    fig, ax = plt.subplots()
    
    df.columns = ['Daily downloads']
    weekly_average.columns = ['Weekly average']
    weekly_sum.columns = ['Weekly sum']
    
    ax.set_xlabel('Time')
    ax.set_ylabel('Downloads')
    ax.legend(loc='best')
    
    df.plot(ax=ax, label="Daily downloads", legend=True, style='r-', title="Downloads")
    weekly_average.plot(ax=ax, label="7 day average", legend=True, style='b-')
    weekly_sum.plot(ax=ax, label="7 day sum", legend=True, style='r--')
    plt.show()
    


  • Tracking market activity with InfluxDB

    I really enjoy trading as my primary activity in the online game Eve Online - it’s fun seeing your wallet grow. People call it “spreadsheet simulator in space” for a reason, it’s especially relevant if you’re into trading or industry - you have to calculate your profits!

    It can be quite difficult to gauge profitability of a particular trade - you have to consider the margin, market movement and competition. With thousands of tradable items and thousands of markets - it can be challenging to find profitable items, which is why people build various tools to do that automatically. Estimating competition can be difficult. Unless you’re actively trading in an item - you often don’t know if someone will actively outbid you every minute, or if your order will stay at the top for days.

    And this is what this mini project is about. I’ll focus more on technical details, since this is not a gaming blog and the game is just an outlet for the project.

    Eve Online Market Orders

    What is trading and how it works, problem area

    If you’re familiar with real life stock trading, this will probably sound familiar. But if you’re not, here’s how trading works in a nutshell.

    There are buy and sell orders. If you want to sell something, you can create a sell order for that item by specifying a price, and it will be created on the market. If someone wants to buy the item from you, they will have to pay your quoted price in exchange for the item. Sell orders are always fulfilled from the lowest price to the highest price, meaning that if there are two orders for 1$ and 2$ and someone wants to buy an item, the 1$ order will always be fulfilled first regardless of whether the buyer is willing to pay 2$ for it. Respectively, if you want to buy something, you can put up a buy order for a specific price and number of items (e.g. “I am willing to pay X$ each for 20 items”). If someone chooses to sell you an item, an exchange will be made, the seller will get your money, and you will get the item. Buy orders are fulfilled from the highest price to the lowest price. So higher-priced orders will be fulfilled first, hence the goal of keeping your orders on top.

    You want to keep your orders on top to sell (or buy) your items fastest. However, if there are 10 other people trying to do the same thing at the same time - clearly that will be difficult. And in fierce markets your orders can be outbid in under 10 seconds!

    Also, it’s worth mentioning that in Eve Online you cannot further update an order for 5 minutes after an order is created or last updated

    So, knowing the times when your competition is active can be very helpful. Then you can come in just after they’re gone and get back to the top order for much longer, or spend less effort doing that.

    Concept and technical setup

    Going into this mini project, there is one main question I want an answer to: at what time(s) of day traders tend to update their orders. Optionally, it would be good to know how often they update their orders, or many items are sold at specific times of day.

    The idea is to passively watch the market by taking snapshots of active orders, noting their IDs, prices and timestamps of latest update. Since orders can be updated only after 5 minutes, and the endpoint is cached for 5 minutes, it likely is a good enough approximation to take snapshots every 5 minutes. Once we have that information, we can use the database to group order updates into intervals and simply count the number of rows in each interval to get the number of updated orders.

    I’ve learned about InfluxDB (a time series database) at work, and since then I’ve been meaning to find a personal project to use it on. Their Why time series data? page suggests several different use cases, such as tracking financial markets, sensor data or infrastructure data, which seems like a good fit for my problem.

    I set up InfluxDB using the Getting Started guide, and used influxdb-python package for interacting with the DB.

    Eve Online is special here, because they very actively expose game data via public APIs for 3rd party developers. So we can use their helpfully provided new ESI JSON API to query market data.

    Implementation

    We can fetch the market data from the JSON endpoint by making an HTTP request (Market > Orders endpoint docs). This can be done with requests library:

    import requests
    
    def get_item_orders(item_id, region_id, page=1, order_type='all'):
        esi_root = 'https://esi.tech.ccp.is/latest'
        url = '%s/markets/%d/orders/?type_id=%d&order_type=%s&page=%d&datasource=tranquility' % (
            esi_root,
            region_id,
            item_id,
            order_type,
            page
        )
        r = requests.get(url)
        response = r.json()
        if 'error' in response:
            raise Exception('Failed to get item orders', item_id, region_id, response)
        else:
            return response
    

    Which returns a response in this format:

    [{
      u'volume_remain': 4,
      u'type_id': 40520,
      u'order_id': 4696910017,
      u'issued': u'2016-12-01T10:20:35Z',
      u'price': 787000000.0,
      u'min_volume': 1,
      u'is_buy_order': False,
      u'range': u'region',
      u'duration': 90,
      u'volume_total': 4,
      u'location_id': 61000990
    }, ...]
    

    Here we mostly care about issued, type_id, location, order_id, is_buy_order, price and volume_remain. issued denotes when the order was created or last updated, type_id denotes which item is being sold/bought.

    Storing data in InfluxDB was straightforward with influxdb-python library:

    from influxdb import InfluxDBClient
    
    db = InfluxDBClient('localhost', 8086, 'root', 'root', 'market')
    
    def fetch_orders():
        item_type_id = 40520
        region_id = 10000046
    
        try:
            orders = get_item_orders(item_type_id, region_id)
        except Exception as ex:
            print 'Failed to fetch orders'
            print ex
            return
    
        measurements = []
        for order in orders:
            measurements.append({
                'measurement': 'trade_history',
                'tags': {
                    'region_id': region_id,
                    'location_id': order['location_id'],
                    'type_id': order['type_id'],
                    'order_id': order['order_id'],
                    'order_type': 'buy' if order['is_buy_order'] else 'sell'
                },
                'time': order['issued'],
                'fields': {
                    'price': order['price'],
                    'volume_remain': order['volume_remain'],
                }
            })
    
        db.write_points(measurements)
        print '[%s] %d orders fetched' % (datetime.datetime.now(), len(orders))
    

    Here we simply iterate over a list of order objects, and out of them construct a list of measurement points. region_id, location_id, type_id, order_id and order_type don’t change even when an order is updated. Here an important point is to use the issued timestamp for the measurement time. That way we can accurately track when an order has been updated, and successive data fetches won’t duplicate the data.

    If we run the code and check results in InfluxDB, we can see what we have:

    > SELECT * FROM "trade_history" LIMIT 10;
    name: trade_history
    -------------------
    time                    location_id     order_id        order_type      price                   region_id       type_id volume_remain
    2016-10-30T20:37:10Z    61000647        4597357707      buy             5.6300000052e+08        10000046        40520   4
    2016-11-21T07:07:57Z    61000990        4633264567      sell            7.8800000001e+08        10000046        40520   8
    2016-12-01T10:20:35Z    61000990        4696910017      sell            7.87e+08                10000046        40520   4
    2016-12-03T10:14:26Z    61000647        4699246159      sell            7.8300000052e+08        10000046        40520   2
    2016-12-19T23:59:49Z    61000896        4657031429      buy             881                     10000046        40520   10
    2016-12-28T19:58:22Z    61000647        4697110149      buy             4.13e+08                10000046        40520   1
    2016-12-28T22:55:09Z    61000990        4667323418      buy             4.130000001e+08         10000046        40520   2
    2016-12-31T08:13:49Z    61000990        4733297519      sell            7.19999998e+08          10000046        40520   5
    2016-12-31T08:30:24Z    61000990        4733307476      sell            7.19999997e+08          10000046        40520   5
    2016-12-31T08:39:25Z    61000990        4729427547      sell            7.1999999699e+08        10000046        40520   4
    

    And lastly, to run this function every 5 minutes, schedule library comes in quite handy:

    import schedule
    
    schedule.every(5).minutes.do(fetch_orders)
    while True:
      schedule.run_pending()
      time.sleep(1)
    

    And now we have a Python script which will fetch outstanding market orders every 5 minutes and will log data to InfluxDB. At this point I left it running for several hours to gather some more data.

    The last step is to query the data from InfluxDB. It was very important to use issued order value as the timestamp for storing data in InfluxDB, because it changes each time an order is updated. When an order is updated, it retains most of its data unchanged, however the issued and price fields will change. Hence, if an order was not updated at the time of checking, it will not be included in the data groups of that particular time chunk. That allows us to write this query:

    SELECT COUNT("volume_remain") FROM "trade_history" WHERE "order_type"='sell' AND type_id='40520' AND time > now() - 7h GROUP BY time(30m)
    

    Which yields this result:

    > SELECT COUNT("volume_remain") FROM "trade_history" WHERE "order_type"='sell' AND type_id='40520' AND time > now() - 7h GROUP BY time(30m)
    name: trade_history
    -------------------
    time                    count
    2016-12-31T12:00:00Z    4
    2016-12-31T12:30:00Z    0
    2016-12-31T13:00:00Z    3
    2016-12-31T13:30:00Z    0
    2016-12-31T14:00:00Z    5
    2016-12-31T14:30:00Z    7
    2016-12-31T15:00:00Z    6
    2016-12-31T15:30:00Z    11
    2016-12-31T16:00:00Z    8
    2016-12-31T16:30:00Z    9
    2016-12-31T17:00:00Z    0
    2016-12-31T17:30:00Z    6
    2016-12-31T18:00:00Z    6
    2016-12-31T18:30:00Z    2
    2016-12-31T19:00:00Z    0
    

    It’s not very important for this excercise, but we can take one more step and graph the results:

    Order update frequency

    Graphing the price was not nearly as interesting, as compared to the price, differences are at most of several thousand ISK, which doesn’t show up well on the graph.

    Gotchas

    At first I was struggling to get a WHERE clause to work when writing InfluxDB queries: whenever I used WHERE - my queries returned no results. It turns out that InfluxQL requires using single quotes for denoting values, so this query will work:

    SELECT * FROM trade_history WHERE "order_type"='sell'
    

    While this will not:

    SELECT * FROM trade_history WHERE "order_type"="sell"
    

    Final thoughts

    So in the 7 hours the script was running, there were over 50 order updates, and there were up to 11 updates in 30 minutes, which is very competetive. If you wanted your orders to stay on top, you’d have to trade quite actively in this particular location. But this is a great result. Just by glancing at outstanding orders it can be easy to be mistaken at how competetive competitors will be, but seeing longer term data paints a different picture. It would be useful to collect several days worth of data for several different items to evaluate the situation better, but this is already a good starting point.

    It would be quite interesting to get access to raw data of actual real life stock exchanges and run the same script on the data. Some Bitcoin exchanges actually publicly display these orders as well, e.g. Real Time Bitcoin Ticker can be interesting to watch.


  • Break Helper 1.6 update

    Several years ago I made a simple extension for Google Chrome called Break Helper with the intent of having a tool that constantly reminds me to take a break. However, for the past year BH has been slightly broken due to changes in Chrome notifications mechanism. Previously raw HTML notifications were allowed, but that API has been long deprecated in favor of Rich Notifications, which are a bit more rigid and less powerful - you couldn’t have any buttons, JS code running in the notification, and only raw text was allowed. I recently noticed that support for notification buttons has been added, as well as a mechanism to update existing notifications, that allowed for some much needed changes to BH.

    If you’re interested in trying it out, you can find it for free here: Download Break Helper from Chrome Web Store

    Confirmations for notifications

    In simplified version there were no confirmations. When it was time for a break, Break Helper would simply display a notification “Hey, time’s up. Time for a break!” and would automatically go into a break mode without any user interaction. After break was over, it automatically went back into work mode. The problem with this approach was that it was too easy to ignore. I found myself completely ignoring it for days, and I skipped a lot of breaks, which is not good for my health.

    Break Helper's old break notification

    Instead, with the addition of button support to the Rich Notifications mechanism, I was able to add back confirmation buttons. That way if you’re still working, the notification won’t go away and will require user interaction. It’s easy enough to dismiss it by closing it or clicking “Skip this break”, but much harder to ignore it as it will stay in the corner of your screen. And, if you’re working on something important - it gives you a chance to finish your work.

    Break Helper's time for a break notification

    Break time notification

    Once the break starts, it simply used to state “You’re on a break” without a clear indication of how much time you have left.

    But with improved Rich Notifications I was able to add a progress bar and a countdown timer clearly displaying the time left

    Break Helper's break time left notification

    Sound notification

    Sound notifications were broken for a while, and they’re fixed again in 1.6 release. Also, I found the old notification sound too jarring. Previously BH used a sound clip of a ringing old school mobile phone, so this time I switched it to a less jarring “Solemn” chime I found on notificationsounds.com.

    Thoughts for future improvements

    • Requiring a user interaction for when the break ends. Sometimes one may not notice when it’s time to get back to work, and once you get back from your break - 5 minutes of work time have already passed. It may be a bit too annoying for frequent breaks, but perhaps it would be worth asking for user confirmation to get back into work mode.
    • Code. You can find Break Helper’s code on GitHub (flakas/Break-Helper), and it’s not pretty. It’s quite error prone and difficult to maintain. It needs some dev love to refactor the code and clean up old obsolete pieces.
    • Graphics and sales copy on Chrome Web Store. It’s fairly outdated, not very clearly pertaining the purpose of this extension.

    So there’s still some work to do even on this very simple extension, but that’s a worry for 2017. However, I’m very happy to see that over 1000 users still actively use it (or ignore it).

    And if you’re interested in checking out Break Helper, you can find it here: Download Break Helper from Chrome Web Store. It’s Free!


  • The Clean Coder: A Code of Conduct for Professional Programmers

    The Clean Coder” by Robert “Uncle Bob” C. Martin is related to his previous book “Clean Code”, except this time the focus is put on how a professional programmer should work, organize work or behave.

    The book covers a broad range of topics, starting from how to behave as a professional and work with a team or with your clients, how to grow yourself as a professional, all the way to various software construction practices, like testing, QA, estimation, tools. At 200 pages it is a fairly short book, but it brings up interesting ideas and has good recommendations. I’m going to share several take-aways I got from this book, but of course there’s much more in it.

    Working under pressure. There are times when you have a hard deadline, something critical needs to be done quickly, or perhaps you have a deadline approaching you’re not sure you can meet. At times like these you may start looking at how to cut corners, and disciplines are often suggested to be sacrificed. “Perhaps we don’t need to write tests right now” or “let’s skip the design phase, we can do it on the spot”. This leads into messes, which slow you down now or at some point in the future. E.g. skipping tests “for now” with the intent of catching up later, requiring a significant amount of manual testing, and later you may find the code too difficult to test.

    Uncle Bob suggests finding a discipline you will adhere to even in times of crisis, if you’re not confident with your current one. Because a discipline will guide you through times of high pressure - “trust your discipline!” Crisis is the time to pay special attention to your discipline.

    As for how to handle it personally, he recommends to stop stressing - “fretting won’t help”. Instead slow down and think through the problem, communicate issues with the team, bosses and clients. it is much easier to deal with and plan for if you know that a piece of software will have to be delayed from January to July, than to learn the day before release about the pushback. Issues need to be actively handled, wishful thinking that “hopefully it will fix itself” does not help and can be dangerous.

    Professionalism, practicing and mentoring. The author outlines several guidelines a professional follows, like taking responsibility for one’s actions. “Do no harm. […] A software developer can do harm to both the function and structure of the software.” We want our software to work, our customers want it to work even more so. Uncle Bob suggests that “in order to be professionals, we must not create bugs”. Make sure your software works, either by testing manually, or, preferrably, automatically testing it. Don’t hope that your software works, prove it! “QA should find nothing.”

    A professional developer knows his or hers field. That means dedicating some time to learn about fundamentals, new developmends and advances. Uncle Bob advocates for continuous learning and deliberate practice. Just like musicians or sportspeople practice, so should professional programmers. Some companies dedicate some time each week for learning and fun projects, but that’s not the norm. We own our careers and not our employers, so it is our responsibility to enhance it and care for it. As a way of practising the author suggests doing code katas, or mentoring others.

    There were some other interesting soft skill related topics touched upon, like saying yes or saying no, time estimation, expectation management and working with the team. It is an interesting book, providing food for thought about the way we work. I would certainly recommend this book mostly to newer programmers, although seasoned vets may find ideas on how to improve themselves too.

    If you’re interested in the book, you can find it here on Amazon: The Clean Coder


  • Taking notes with Vim and Guake

    It is very easy to forget all the specific details you deal with during your work day, especially if you need recall them in several weeks. I certainly tend to forget specifics or tasks I’m supposed to do. And when writing weekly status reports for work, I end up second guessing myself as to where my time was sunk into over the week.

    Recently I wrote a small note taking script and ended up making an effort to take notes throughout the day about what I’m doing, what is being discussed with others. Here it is:

    #!/bin/sh
    
    if [ $# -eq 0 ] 
    then
      FILENAME=`date +%Y-%m-%d`
    else
      FILENAME=$1
    fi
    
    FILE=~/notes/$FILENAME
    
    vim $FILE
    

    The file notes is stored in /usr/bin and is assigned an executable flag.

    Usage:

    • notes opens up notes file for the current day. Files are named using YYYY-MM-DD format. If a file already exists, it appends to it, if it doesn’t - it creates a new file.
    • notes <filename> opens up a specific notes file in ~/notes/<filename>.

    The basic idea is simple. On each day running notes command will open up a new file in notes/ folder for that day, e.g. ~/notes/2016-08-28. Notes can be written in free form, though I usually end up using Markdown style lists. If you need to browse older notes files, a file system explorer suits well. For content search tools like grep or ack work fine. Sharing notes across multiple machines is easy with a tool like DropBox.

    Guake is just a top-down terminal application. It is useful, because with a single hotkey it opens up a terminal above all other apps. So you can do work in your normal terminal or browser, hit F12 to open up Guake terminal (with note file already opened up in vim), take notes, hit F12 to hide Guake and continue working.

    It ended up working really well. Looking up details of a particular day or week is as simple as opening several files. Good integration with the environment I already use minimizes friction, and I’ve been using it for the past month almost every day. Certainly a positive change.