** note To start out, let me just say that when it comes to computers I'm completely self taught. I stated learning web development and database about 2 months ago.g
I posted this yesterday, but it didn't show up.
Its longer than I intended it to be, if you don't want to read the whole thing skip to the table diagrams and "My Concerns so far".
What I'm Doing
Buidling an advanced online metering system which for the most part is easier than i thought (except for the database stuff, which i think i'll get eventually)
How I'm Doing it now
The device is currently uploading 123 columns of floats* and a datetime (time stamp). This is uploaded/inserted every 2 seconds for a total of 43,200 rows per day.
The device is also uploading calculated aggregations(sums, min, max, average, and some trig and integrals) for time periods of 60s (1 min) 300s and (5 min) 900s all of which are calculated off of the 2 second data.
additional tables with event data, 1 row per event, and daily statistics(single row per day)
*(values range from 0 to 1000000 rounded to 5th decimal place .00001)
example database layout per meter:
Code:
# LAYOUT PER DEVICE
meter_data <- database name
----meter1_2ss <- 2 second data table with 124 columns
----meter1_60ss <- 60 second data table with the same 124 columns plus 86 more
----meter1_300ss <- 300 second data table with the same 124 columns plus 86 more
----meter1_900ss <- 900 second data table with the same 124 columns plus 86 more
----meter1_events <- events table with haven't designed yet
----meter1_daily <- daily statistics that can't be easily or quickly calculated off of the data
---- **currently there are a set of tables like this for each device**
Extra information
I'm using Innodb with barracuda and file_per_table=on
with this setup 1 months worth of data(for all tables for one device) comes in at just over 1.3 million rows and a file size (with barracuda compression enabled) of about 200 mb. By rebasing and other methods i can probably shrink this to 100mb
For reference I leaned everything I know about databases on this project in the last 2 months.
Goal is to keep data in the database for as long as possible, and when necessary move it to a compressed file structure(or store a days worth of data in a single row/column of a different database).
Data compression per row of data(in csv format) yields very small gains; 374 bytes to 299 b or 20%; but per 100 rows(again in csv)it goes from 35.9kb to 3.9kb (about 90%, at 800,000 rows it maxes out at 93%).
That was with lzma compression, zip is about 4-6% less on average.
Try and store with database and archive at least 2 years of data per meter
Tools i'm using
MySQL 5.x not sure which, don't know enough to know what to use.
Python SQL alchemy (just for table creation for now, it seams ORM's are just as hard or harder to learn than SQL, especially when you don't know much about db's or SQL)
official python MySQL connector with compressed protocol and prepared statements
My Concerns so far
- For now with 2-10 devices this works, well what about when I have more, 100's 1000's
- At 100 devices that will have the database processing 4.43 million inserts a day (120GB database size increase per year) Can a database handle that many inserts? If not is there a technology that would?
- Is MySQL the right database for this? Would something like Postgres or Mongo be better?
- should i be using mysql database connector to upload rows to the database, (will that be stopped by some firewalls and such) or is there another method
- is this scalable to 10000+ devices (45 million+ inserts per day, 1.2TB per year
- Would it be more cost effective at 1000+ devices to host my own database on my own hardware?
- I've come across schema designs where all the time-series data was placed in a single table with device identifiers. Would than work for my database? is it advisable? would it perform okay? (http://stackoverflow.com/questions/4...ational-or-non) That link suggests that storing all data in a single table as illustrated below would be a good Idea.
Setup 1
Code:
meter_data <- database name
----clients < columns (customer, location, device_ids) device_ids will probably be some kind of list or dictionary as there an be more than 1 device per location
----mdata_2 <- 124 colS + (device_id,...) 2 second data only
----mdata_rest <- 210 colS + (device_id, data_freq,) this would hold the 60, 300, and 900 second data for all devices
----mdata_events <- columns (device_id, ...)
----mdata_daily <- columns (device_id, ...)
Setup 2
the same with mdata_2 and mdata_rest combined in a single table.
both tables have 124 columns that are the same, the mdata_rest table just has an additional 86 columns.
I could merge these into a single table and just null (or 0?) out the columns that mdata_2 doesn't have. Illustrated below.
Code:
# 2 second data:
|datetime, device_id, data1, data2, data3, ... data122, data123|
| 2015:.., 000000001, val 1, val 2, val 3, ... val 122, val 123|
| ^index^| ^index^ |
# 60, 300, 900 second data:
|datetime, device_id, data_freq, data1, data2, data3, ... data122, data123, data124, ... data209, data210|
| 2015:.., 000000001, 300, val 1, val 2, val 3, ... val 122, val 123, val 124, ... val 209, val 210|
| 2015:.., 000000001, 900, val 1, val 2, val 3, ... val 122, val 123, val 124, ... val 209, val 210|
| 2015:.., 000000001, 60, val 1, val 2, val 3, ... val 122, val 123, val 124, ... val 209, val 210|
| ^index^| ^index^ | ^index^ |
# NEW TABLE WITH BOTH MERGED INTO 1
|datetime, device_id, data_freq, data1, data2, data3, ... data122, data123, data124, ... data209, data210|
| 2015:.., 000000001, 300, val 1, val 2, val 3, ... val 122, val 123, val 124, ... val 209, val 210|
| 2015:.., 000000001, 900, val 1, val 2, val 3, ... val 122, val 123, val 124, ... val 209, val 210|
| 2015:.., 000000001, 60, val 1, val 2, val 3, ... val 122, val 123, val 124, ... val 209, val 210|
| 2015:.., 000000001, 2, val 1, val 2, val 3, ... val 122, val 123, null, ... null, null |
| ^index^| ^index^ | ^index^ | from this ^ point on its all nulls
Thanks
Jake