Everything Counts

Cloud Native Blogging

I’ve migrated the blog to run on Cloud Foundry using Pelican instead of Octopress. Pelican is written in Python, which I am much more fluent in than Ruby. Besides, Pelican supports an important feature when blogging about data science: You can include IPython notebooks into blog posts with the IPython plugin for Pelican or Jake VanderPlas’ liquid tags plugin .

With Jake’s plugin you can simply write

{% notebook path/to/notebook.ipynb [cells[i:j]] %}

and Pelican will insert and render the notebook for you.

In [8]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
In [10]:
x = np.random.random(50)
plt.plot(x)
Out[10]:
[<matplotlib.lines.Line2D at 0x10c6f68d0>]

We live in 2016: Setting up and maintaining a webserver for deploying a website is not necessary anymore in a cloud native world. My PaaS of choice is Cloud Foundry (CF). It is open source and infrastructure agnostic. That means I can deploy my blog in seconds to any CF endpoint and not care about any of the underlying infrastructure.

All I need to do to make this work is to write this short manifest.yml file:

---
applications:
- name: everything-counts
  memory: 1024M
  instances: 1
  buildpack: https://github.com/ronert/heroku-buildpack-pelican.git
  timeout: 180
  env:
    PELICAN_SITEURL: "http://everything-counts.cfapps.io"
  domain: ronert-obst.com

Everything here should be fairly self-explanatory except the buildpack specification.

Buildpacks provide framework and runtime support for your applications. Buildpacks typically examine user-provided artifacts to determine what dependencies to download and how to configure applications to communicate with bound services.

That is quite a mouthful. Essentially there is a buildpack for every programming language that is supported by CF. The buildpack fetches all the dependencies (in the case of Python those specified in your requirements.txt file) and runs your code for you. Buildpacks are a very neat abstraction, because they free developers from caring about any of the underlying infrastructure. The operating system layer is completely abstracted away, so you can focus on your code. I use a custom buildpack here since I also need to run Pandoc to convert my old org-mode blog posts from Octopress to Pelican.

Once I have set up my CF endpoint, I can cd into the root of my blog’s directory and then simply run cf push and Cloud Foundry will deploy my blog for me. Cloud Foundry provides some other useful features I will go into in later posts. Besides health monitoring, routing and logging, CF can also scale an applications from running on 1 instance to 20 instances in seconds. So if my blog ever goes viral (very unlikely), I can now simply cf scale -i 20 it to serve the millions of new incoming requests.

How to read HAWQ Parquet tables from Spark

Spark SQL still lags behind many SQL on Hadoop engines in performance, reliability and functionality. As a data scientist, I really value being able to write PySpark code in Jupyter Notebooks for exploratory analysis, running unit tests using unittest/py.test and nosetests and using the growing ecosystem of machine learning libraries on top of Spark. Yet typical SQL tasks like joining datasets can be tedious and slow in Spark SQL. Spark SQL also lacks functionality such as window functions.

How can we get the best of both worlds? SQL on Hadoop engines such as HAWQ can read and write Parquet files, which Spark is also able to read and write. Using these, we can get the two systems to exchange data with HDFS acting as a storage layer.

In HAWQ, you can create a table using Parquet as a storage format and gzip as compression like so:

CREATE TABLE :target_schema.my_tabe
WITH (
appendonly = TRUE,
orientation = parquet,
compresstype = gzip,
compresslevel = 4) AS ..

Spark can read those Parquet tables from HDFS with a simple sqlContext.read.parquet(filename):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
filenames = ['/hawq_data/gpseg' + str(i) + '/16385/26349/61298' for i in range(24)]
file = sqlContext.read.parquet(*filenames)

One downside is that HAWQ stores tables across segments with a randomly generated file path. To find the HDFS file path programatically, you can actually query the database. The first part of the file name (16385 in our case) always remains the same (determined by the database).

The middle part (26349) can be found using SELECT oid,datname FROM my_table and the final part (61298) with SELECT 'my_table'::regclass::oid.

Another downside is that you need to provide the number of HAWQ segments (24) in this case. If you scale up your number of segments you also have to adjust this in your Spark code, else you will silently lose some of your data in Spark.

I am sure in the near future we will see more elegant ways to interact with Spark.

Linkfest

I am going to try and start posting regularily again, including proper blog posts and not just linkfests.

Statistics and Machine Learning

ML/Stats Package of the Week

Paper(s) of the Week

Programming

Other cool stuff

Weekend Linkfest 1.12.2013

Statistics and Machine Learning

R Package of the Week

Paper(s) of the Week

Programming

Weekend Linkfest 9.11.2013

I successfully defended my master thesis this week, so I was too busy to post linkfests in the meantime. But now I am back! I have also released my first package on CRAN: parboost. Expect more on parboost in another post.

Statistics and Machine Learning

R Package of the Week

Paper(s) of the Week

Programming

Weekend Linkfest 20.10.2013

Statistics and Machine Learning

R Package of the Week

Paper(s) of the Week

Programming

Weekend Linkfest 5.10.2013

Statistics and Machine Learning

R Package of the Week

Paper of the Week

Programming

Weekend Linkfest 28.9.2013

Statistics and Machine Learning

R Package of the Week

Paper of the Week

Programming

Elsewhere

Weekend Linkfest 15.9.2013

Statistics and Machine Learning

R Package of the Week

Paper of the Week

Programming

Elsewhere

Weekend Linkfest 8.9.2013

Statistics and Machine Learning

R Package of the Week

Paper of the Week

Programming

Elsewhere