** note To start out, let me just say that when it comes to computers I'm completely self taught. I stated learning web development and database about 2 months ago. most of my experiance is with linux admin and general programming
My project
I am an electrician by trade and am in charge of(doing it myself) building on an online metering system that consists of a device you install next to your main power panel, and a web portal where you can view your data. Inside the device is a small linux based computer that handles all the data logging local storage(sqlite or mysql) and real-time analytic's. Once the computer has finished its calculations it uploads the meter data to a cloud based database (mysql on digital ocean) where the data can be viewed from a webapp(that i'm also building).
My Data/schema
The device is currently uploading 122 columns of floats* with an incrementing id(which is probably not needed) and a datetime. This is uploaded every 2 seconds for a total of 43,200 rows per day. (yes I need it that frequently, I need some of it at .125 second intervals or as fast as i can get it.)
The device is also uploading calculated aggregations(sums, min, max, average, and some trig and integrals) for time periods of 60s (1 min) 300s and (5 min) 900s all of which are calculated off of the 2 second data plus additional data stored in buffers on the device.
additional tables with event data, 1 row/record per event, and daily statistics(single row per day)
*(values range from 0 to 1000000 rounded to 5th decimal place .00001)
example database layout per meter:
meter_data...................<- database name
----meter1_2ss..............<- 2 second data table with 124 columns
----meter1_60ss............<- 60 second data table with 210 columns
----meter1_300ss..........<- 300 second data table with 210 columns
----meter1_900ss..........<- 900 second data table with 210 columns
----meter1_events.........<- events table with haven't designed yet
----meter1_daily............<- daily statistics table haven't implemented yet
----meter2_2ss..............<- 2 second data for another device
----meter2_60ss............<- 60... every meter has the same tables
I'm using Innodb with barracuda and file_per_table=on
with this setup 1 months worth of data comes in at just over 1.3 million rows and a file size (with barracuda compression enabled) of about 200 mb.
Just looking at the above I noticed that as a minimum I could merge the 60, 300, and 900 second tables together and add an indexed time period identifier.
For reference I leaned everything I know about databases on this project. I learn best by just jumping right in and then having to fix whatever i screw up(note this only applies to my own projects...)
I don't expect anyone here to teach me the nitty gritty details of schema/database planning or design. Thats on my personal to-learn list (although if you have suggested learning material or links i'm grateful :) ) When i designed this I knew just about nothing about database's in general, and now I know enough to know that this is probably a bad setup and that when I can afford it I should hire someone to help me design a better system. But until then(could be awhile) I'm stuck doing this by myself And I need to come up with a design/method... that will scale. One of the goals I have is to store the data for as long as possible, and as a minimum 2 years. As of right now I have 2 devices set up and working, I'm in the process of adding 10 more right now. Our company specializes in designing power quality improving equipment and we are going to be installing one of these devices with every system we sell; so I'll probably have 100 more by the end of the year.
Tools i'm using
MySQL 5.5
Python SQL alchemy (just for table creation for now, ORM's are just as hard to learn as SQL, especially when you don't even know much SQL)
official python MySQL connector with compressed protocol and prepared statements
My Concerns(Ya, I know you probably have more.)
- For now with 2-10 devices this works, well what about when I have more, 100's 1000's
- As it is setup now with 100 devices the database will be between 150GB and 240GB within a year. Were also going to start selling this device standalone, so that number could shoot up fast. Right now Its setup on a cloud database that purges the 2 seconds after 8 days, and replicated to local database i set up at my office for long term storage, That will only work till the replication maxes out our internet connection. Whats the best cost effective way to try and host this?
- Is MySQL the right database for this? Would something like Postgres or Mongo be better?
- At 100 devices that will have the database processing 4.43 million inserts a day. Can a database handle that many inserts? If not is there a technology that would?
Ive come across schema designs where all the time-series data was placed in a single table with device identifiers. Would than work for my 2 second data? is it advisable? (http://stackoverflow.com/questions/4...ational-or-non)
I neither need nor expect anyone to post a 5000 word explanation on how best to do anything, If you want to take the time to do that, that's fine and thank you. I'm looking for basic advice, and community experience, as it's pretty obvious I don't have any. I'll be grate-full for whatever I get, even if I have to google for a week to understand it.
Thanks
Jake