Big Data Analysis
Someone should write an easy to use app, like a spreadsheet, but for handling Big Data.
Big Data is a popular area of discussion on the Internet today – it’s all about managing gigantic data stores of lots and lots of data, usually many millions of rows of data points, with various attributes to those data points including the day and time that the points were taken, but usually also dozens of other data values along with it.
Tools We Have Today
Think about spreadsheet apps like Microsoft Excel or Open Office’s Calc, or the spreadsheet in Google’s Documents – think about how good they are for data manipulation at a fundamental level. They present a 2-dimensional grid of cells, and each cell holds a single data point. Or, a cell can hold words; or it can have a formula, which can calculate using a formulaic language, resulting in a newly-derived data point that appears in that cell. That cell therefore has two views: the data value result of the formula, and the formula itself. How can 1 cell display itself 2 different ways? The original inventors of this idea figured it out: by default display the data value, since that’s what you’re basically working with – data. If you single-click the cell, the formula appears on a “formula line” near the top of the app. And, if you double-click the cell, you’re able to edit the formula – it appears temporarily right there in the cell, for you to edit. If it’s too large to fit in the cell’s space, it temporarily makes a larger rectangle so you have enough room to edit. Once done editing, it recalculates the formula, and draws the data value quickly in place again. And, it recalculates any formulas that were dependent on that cell for one of their values – and recalculates anything dependent on THAT cell too, and THAT cell, and THAT cell… and so on. Spreadsheets are very powerful and useful for many purposes because of this capability.
Big Data …Sheet
But Big Data is different – very often you are working with entire data sets, not individual values. Or some summarized data – if you have daily data points, maybe you want to look at the data as a whole week of data points averaged down to a single value; so the resolution is weekly data. Maybe you need summing, not averaging. Or maybe you need Bezier smoothed curve data, derived from this data.
What would a Big Data Sheet look like?
Maybe each column in a Big Data Sheet represents an extraction of 1 column from a whole 2-D set of data.
Maybe you have functions that, instead of calculating on a single data value, perform operations on a whole data set at a time, transforming it into a new set of data. Like Moving Average, or EMA (Exponential Moving Average).
The data set element would need to have meta-data about itself, such as whether the data has been de-exponentialized for graphing purposes, or not. Either the X or Y axis, or both, could be exponential. Or logarithmic.
Data sets can be more than 2-D though, they can be 3-D or 4-D or really any dimension. So another thing you’d need would be a way to extract 2-D slices from that data.
If the axes of a data set (or a slice of a data set) had meta-data about what the data type is, then you could match them up with a completely separate data set, on that common axis. Maybe the common axis is Time – that’s a popular one for a lot of data sets. If you have a data set of CPU load from server A over the past year, and a similar data set from server B, you might want to do calculations and comparisons between the two, and derive useful knowledge from them. The two data sets may not start and end at exactly the same times; you could choose to extract just the data where the timestamps overlap – that should be a Function in this system. Maybe there’s also a function that reduces that data even more to “whole days” worth of data – discarding the first day and/or last day, if they are not complete days worth of data (midnight to midnight). The two sets of data might result in 3 “columns” of data in this application (time, cpu load A, cpu load B).
And, of course, you need to be able to graph all these data sets, derived sets, slices, merges, etc.
Can’t Excel Already Do That?
It just occured to me that 75% of what I’m talking about can already be done today with Excel. There’s just a few places that it falls short. Working with data sets so large they won’t fit into a computer’s memory is one of them. It also doesn’t know how to take a 2-D slice of a 5-D data set, and remember where the slice came from and how it was taken, so you can re-slice different ways, and with different ranges, while doing “what-if” scenarios. Those slices would need to be cached somehow, for future calculation speed. Like if you summarize 5-million rows of every-second data from years and years into a new data set that’s a sub-set of the date range and has been averaged into daily data points, that’s a much smaller set. The math for that only needs to be done once, and stored under a new name that the user gives it. But it must point back to the original data set where it came from, with indications on how it was derived. Now, if the original data set changes, the derived data must be recalculated – it changed too.
Maybe the right thing is for existing spreadsheet application authors to add a few Big Data features to their apps to handle these things.
Calculations on Big Data sets (and slices, and merges of sets) are inherently parallelizable, much of the time. So it would be great if this new Big Data Sheet application could farm out the calculations across many servers that the user has pre-configured; or, maybe they can rent a CPU farm that’s a centrally located set of compute servers that the application creator, or others, have set up.
The science of Map/Reduce for data sets is growing, and is a powerful way to perform many operations in parallel on really large data sets. That could be the technology behind a front-end of high-level operations that can be performed on Big Data sets, or slices of sets.
We might need to invent a new kind of database server storage system for gigantic N-dimensional data sets, so that we can extract slices from them in any direction/dimension we want, as quickly as is reasonably possible. You can assume it might take more than one database server to store such a large set of data in this flexible manner, so the servers might all need to work together, when queries arrive from clients retrieving data from the gigantic data set.
How do we store the data so slices can be taken from any direction, of any amount?
How do we create redundancy, so any one server going down doesn’t take the whole database with it?
How do we backup such a data set, to keep it in off-site storage?
It’s fun to ponder these things. I know there are good answers to all of them.
It would be fun to see how the world explores this topic more over time.