Large Database, advice needed

by Victor

Hi, I have a very large database that stores website visitor data for a bunch of websites. It keeps information like IP address and location.

It tracks about 350,000 visitors per day and stores all the data in a single table so I can run MySQL queries on it and do lots of different analyses.

But the database is now very large and takes a long time to run complex SQL statements.

The DB is hosted within a DigitalOcean droplet.

Does anyone have advice on how to make the database faster, or even better setup or structures?

UPDATE 1 YEAR LATER (May 4th 2021):
Thanks to everyone for all the advice, the solution I have been using successfully for about a year now is timescale DB. It works really great with large amounts of time based data such as website visitor tracking.

I originally started with timescale forge (the official hosted timescale solution) but the prices were quite high so I created a digital ocean postgresDB with timescale as an add-on. This is much cheaper, but is not the community version so it does not have materialized views which would allow me to create a self updating view that contains just the latest 30 days of records.

So I just created a 2nd table that only consists of the last 30 days of data. This has helped a lot in speeding up 30 day reports.

I was also able to speed things up by adding additional indexes to the tables to filter the relevant data. Timescale automatically indexes based on time, but I added additional index based on customer order so its quick to extract all the rows for a specific order within a specific time period.

Thanks again for everyone who provided advice, if anyone else needs help with this topic there are a lot of really great solutions written in the comments below.

Victor

posted to

Developers

on June 5, 2020

Say something nice to Victoritous…

Post Comment

6

PostgreSQL has "materialized views" that are essentially pre-calculated views so it doesn't have to be calculated every time the view is queried. Unfortunately mySQL/Maria doesn't have that option but it does have scheduled job functionality. What I've done in the past is create the queries I need in advance.

Another option might be to split the data out into various DBs based on site or data type.

Yet another option is indexes to speed up search operations.

Yet another option is to dump the old data to a backup and only keep the more recent data that you regularly use in the DB.

Yet another option is I've heard sites like Facebook and YouTube have a tiered storage model where old data that is rarely used is stored on slower storage medium. So, you could have one DB on an SSD only storing one month of data then all historic data on a traditional HDD for the rare times you need to access it.

Finally, I believe the new mySQL/Maria have the ability to use a JSON text file. That might at least speed up the one-time write operations if you're just appending to the end of a text file. I've not actually gotten around to looking into this option so not sure how well it'll work.

Hope that helps,

Lakebed_io

·
4 years ago
·
Reply
1. 5
  
  Just want to mention that this is a very insightful comment. You didn't just offer up general advice. You understood the problem and gave multiple valid solutions.
  
  Always love these kind of responses!
  
  Timberts
  
  ·
  4 years ago
  ·
  Reply
  1. 2
    
    Thanks, literally the reason I'm here. I love helping and learning.
    
    Lakebed_io
    
    ·
    4 years ago
    ·
    Reply
2. 2
  
  Thanks, these tips are really helpful!
  
  Can you explain what you mean by creating the queries in advance? I know that I can create views that are pretty much just saved SQL queries that you can run. But I don't know how you can have a query run say every hour, and then call the results from that query at any time.
  
  Victoritous
  
  ·
  4 years ago
  ·
  Reply
  1. 1
    
    Fair enough. In PostgreSQL a materialized view can be run every hour but, as I said, mySQL doesn't have that functionality.
    
    MySQL has an event scheduler: https://dev.mysql.com/doc/refman/5.7/en/event-scheduler.html
    What I've done in the past is to create a stored procedure/function along the lines of:
    DROP TABLE my_table; CREATE TABLE my_table as SELECT * FROM base_table WHERE date_created > NOW()-3600;
    You then call that function regularly with the event scheduler to rebuild the table. The downside being you're duplicating data so taking up more storage space. But, storage is cheaper than RAM and processors.
    
    Not sure if that answers questions or creates more but hope it helps.
    
    Lakebed_io
    
    ·
    4 years ago
    ·
    Reply
    1. 2
      
      Ah! That's a great idea, most of my reports are for the last 30 days. I can create a new table with just the latest 30 days of data to make 30 day queries faster.
      
      Thanks for all your advice!
      
      Victoritous
      
      ·
      4 years ago
      ·
      Reply
      1. 3
        
        Pro-tip:
        CREATE TABLE tmp_my_table AS SELECT...
        DROP TABLE IF EXISTS my_table;
        ALTER TABLE tmp_my_table RENAME to my_table;
        
        That results in:
        
        no downtime because you build the table first then drop the existing table and rename the tmp table.
        
        without IF EXISTS the function errors out if for some reason the table doesn't exist.
        
        Lakebed_io
        
        ·
        4 years ago
        ·
        Reply
        
        1
        
        Thanks a lot for this!
        
        Victoritous
        
        ·
        4 years ago
        ·
        Reply
3. 1
  As said, this is all great advice. My DB probably isn’t as large as yours (~60MM rows) but here’s what I tackled first.
  
  I would look at indexing first since it’s a baseline SQL feature (I’m familiar with Postgres). See which columns you query against most often that are slow and make sure those are indexed. WARNING Many cloud DB solutions bill/throttle based on memory, so if you get to aggressive with indexing you can pop your cap even if you aren’t exceeding the storage amount. I’m not sure if MySQL has composite indices but if they do I would look into that as well.
  
  My data is also chunked by time intervals (internally I call them periods). I scrape a large volume of data once a month then run analytics against it. I usually only access (write and then read) my data once a month but I need it quickly. I recently started implementing partitioned tables for my largest.
  
  Someone mentioned using JSON. If you go that route I’d recommend JSONB if available. Know, though, that some queries don’t work well or don’t work at all if the data is within JSON (especially if it’s relational data).
  
  Another option is to use something like S3, which is what I did for the first few years. If a large portion of your data is just “content” you can store it in a bucket and just keep the key in your DB, most likely the PK.
  
  Best of luck!
  
  [EDIT] Two things I forgot:
  
  I wouldn’t follow the advice of people recommending you switch DB’s at this point. Though I prefer PG, MySQL is just as capable as the rest. Switching would be a waste of time.
  
  The biggest perf boost I got is when I switched my PG instance from free tier to small. The extra memory and core 10x’d the speed or my logic-intensive queries.
  Numose
  
  ·
  4 years ago
  ·
  Reply
2

Check out Timescale.com !

Assuming you are looking at this analytical data based on time its a great choice for you.

It's based on Postgres so it will be pretty easy to move over your existing MySQL logic to it.

I've just built 2 pretty complex systems on it so feel free to reach out if you have any questions about it!

MatthewHall

·
4 years ago
·
Reply
1. 1
  
  I looked into this more and maybe this DB system could work well for my needs. Do you know if there is some kind of web interface I can use with timescale? Something like phpmyadmin for mysql
  
  Victoritous
  
  ·
  4 years ago
  ·
  Reply
  1. 2
    
    It's built on Postgres so any Postgres client can connect to it! Personally I use a desktop client called TablePlus that I really like
    
    MatthewHall
    
    ·
    4 years ago
    ·
    Reply
2. 1
  
  Thats great, I had a look at the website and it seems quite interesting. The SQL queries I run are time based (last 30 days, grouped by day) so I think this could be useful. I will have to research and learn more about it first.
  
  Victoritous
  
  ·
  4 years ago
  ·
  Reply
2

Hi Victor,

I believe Lakebed_io has already provided you with great options. Here are few things that come to my mind.

First, I'm assuming you're talking about a MySQL database here. Second, you didn't mention whether you're running this "complex" queries behind the scene or ON DEMAND as the user requests it. The reason I mention that is so that there can be few strategies you can employ to handle this.

First thing I'd suggest it to make sure you have proper indexes set up on the database itself. This will help with making your queries faster.

Next, I'd say (as mentioned before) try to cache the data. Now this caching can happen at various levels. Either directly in the DATABASE in the form of temporary tables or perhaps at serverside using something like MEMCACHE or REDIS and then rather than querying the DB directly, you can query the cache that can speed up your query. Not sure about how realtime your data needs to be - so this may or may not apply.

If all of this is still NOT enough and you need more speed, see if MySQL was really the right choice for your usecase. Perhaps you may want to look into using a NoSQL db such as MongoDB or DynamoDB? Not saying they're better choices. I personally think MySQL is great and can scale very well. But I just don't know enough about your workload details to be able to answer that properly.

Hope that helps :)

sam26880

·
4 years ago
·
Reply
1. 1
  
  Thanks for this answer!
  
  All my sql queries are currently on demand. The complex queries are mostly for internal use, so I don't mind if it's slow.
  
  But currently, I provide visitor data in real-time by creating graphs and tables on a webpage. These queries are taking maybe 20 seconds to run which is way too slow for someone to wait for a chart to load.
  
  The table structure is something like this:
  project_id
  visitor_IP
  visitor_country
  visitor_language
  date_time
  
  The primary key is (project_id, visitor_ip, date_time)
  
  It currently has 15 million rows so to collect the last 30 days of visitors grouped by day takes some time to run.
  
  If you have a better way of structuring this table I would be interested to know.
  
  There are a few hundred projects so I don't want to cache each project in its own table, but if there is some kind of middle_man cache system I could use that would help a lot.
  
  Victoritous
  
  ·
  4 years ago
  ·
  Reply
  1. 1
    
    15M rows should not be a big issue for MySQL. How does your query look like (at least one of them)? Do you have any other indices, other than the primary key? What does it say when you prepend "EXPLAIN" to your SELECT statement?
    
    Avram
    
    ·
    4 years ago
    ·
    Reply
  2. 1
    
    If you've already created the right indexes on your db and it's still not fast enough then I believe what you need is caching my friend. Look into Redis. Rather than querying data by connecting to the database everytime a user requests a webpage, you'll fetch it from Redis which is much faster. You'll need to think about how long you want to cache the data for, etc.. but I think that'll help you out. You can either host it yourself or use one of the hosted services. Theere's even a serverless version of Redis now which was just launched recently (https://lambda.store). I'm a huge proponent of serverless. I don't like managing servers myself :)
    
    sam26880
    
    ·
    4 years ago
    ·
    Reply
2

I'm no expert in this stuff, but a month or two ago I did some research into these areas.

a) this might be a great use case for Google BigQuery? and an ETL like Stitch to get it migrated: https://www.stitchdata.com/integrations/destinations/

b) if you have control over the sites it's coming from, you may want to customize your pipeline and look at Snowplow? While stock snowplow implementation can be expensive, some folks have figured out how to do some cheap tricks using the default pixel/analytics tracker, e.g. https://github.com/ownyourbusinessdata/snowplow-s3-enrich

andjules

·
4 years ago
·
Reply
1. 1
  
  thanks for sharing! I'm going to check these out. Maybe Google Big Query is a faster database provider
  
  Victoritous
  
  ·
  4 years ago
  ·
  Reply
2

Is this database purely for querying?

One thing I can think of is to create a separate database purely as a readonly data warehouse. You can create a schema that is more efficient for the types of queries you are running (plus overall normalization compared to a single table) and create a process for either copying or offloading the data from your current database to the new one.

This of course depends on how 'live' you need the data to be as there will be a gap between the production database and the warehouse.

faucet

·
4 years ago
·
Reply
1. 1
  
  So the read only database would essentially act as a cache but has faster read speads since its read only?
  
  I only just started learning about normalization. I believe the data is already at least in 3NF (which is as far as I got into normalization)
  
  Can you explain more about creating schema? I do not quite understand that.
  
  Victoritous
  
  ·
  4 years ago
  ·
  Reply
  1. 1
    
    An out-of-the-box option for this is Amazon’s AuroraDB (based on Postgres). It is pretty spendy but comes with separate read and write replicas by default. I started using it but the bill crept up fast.
    
    Numose
    
    ·
    4 years ago
    ·
    Reply
    1. 1
      
      I just had a look and the prices don't really look too bad.
      
      What was your experience in terms of speed? Was the database a lot faster than what you would expect from digital ocean?
      
      Victoritous
      
      ·
      4 years ago
      ·
      Reply
      1. 1
        
        I didn’t get far enough to really stretch it’s legs, supposedly it’s very fast. I wouldn’t “blame” Digital Ocean, per say. For the same tier of DB you’re going to get more or less the same performance regardless of cloud provider. If I remember correctly, Aurora is built on top of S3 but has Postgres compat mode. If I were in your shoes I would stick with what you’ve got as far as provider and DB flavor and just bump up your tier 1. So from say 1 core and 10 GB to 2 cores and XX GB. That made a huge difference for me and took only about 45 minutes for the migration.
        
        Numose
        
        ·
        4 years ago
        ·
        Reply
1

hey there, if you data is of "event" type and is immutable (sounds like it's append only right?) then I'd recommend TimescaleDB https://www.timescale.com/

We used to use a Postgres table for holding similar event data and it just became a bit unmanageable after a while for lots of different reasons.

Great thing about Timescale is that it's actually a Postgres extension, so it sits on top of Postgres.. and is optimised for querying large amounts of data as long as we all agree it's append only (which events usually are :-) ).

FYI, since it's on Postgres, you can host it yourself or use self hosted solutions. We host it ourselves on AWS EC2.

thatandyrose

·
4 years ago
·
Reply
1

A solo SQL database is probably not the best tool for this job.

I would recommend something with high write scalability (such as DynamoDB or Kafka or even S3) for ingesting the raw data. Then create an ETL pipeline (or pipelines) for extracting and aggregating the data you want in a SQL database (or databases) designed for your current real-time needs.

This is not unlike the “materialized views” mentioned above, but you get the added advantage of having the raw data available for future applications. You're also not constrained to using SQL in transforming the data—you can use the language of your choice.

theconsultingcto

·
4 years ago
·
Reply
1

This comment was deleted 4 years ago.

DeletedUser

·
4 years ago