2010-10-25

Observations regarding large data sets

I recently started a new position at startup that shall go nameless. It's been an interesting time so far, as we're dealing with fairly large data sets (not huge, but large-ish). Additionally, I've been dealing with MySQL a lot more than I'm accustomed to, as I've generally been something of a Postgres partisan. Some observations:



  • Some times, the query plan for a given query (as described by EXPLAIN) looks absolutely horrible, but then MySQL actually performs well.

  • Some times, the query plan for a given query looks great, using an index the selectivity of which is quite high, minimizing the rows for which any non-indexed criteria are considered. And yet the performance sucks.

  • When the performance is sucking, it is rather unhelpful to not have iostat nearby.

  • It is incredibly painful how with large data sets the simplest operations can take so long.

  • It is similarly painful that were I using Postgres, I would know exactly what to do to address some of the performance nastiness I'm seeing. And in MySQL I don't have the same level of knowledge. I think it's also fair to say that in MySQL the ability for introspection is fairly limited compared ot Postgres, which doesn't help. But I'm hoping that these conclusions are wrong and that it's just a matter of acclimation.

  • Hope is not useful when solving engineering problems. Rather, it's probably detrimental.

  • It probably makes relatively little sense sense to object to the notion of eventual consistency if one is within the same algorithm sending writes to master whilst reading from a slave, and especially so if said master/slave servers are in the crappy-io-osphere known as "the cloud".



Some of the data manipulation issues we see could be readily assisted by using a distributed datastore of the Cassandra/Ryak/Voldemort/HBase variety. Consider the implications of such a step: certain things (like arbitrary querying, or range queries, or fast sorting) become a pain in the ass. Notice that all of those things are already a pain in the ass using a relational database. Thus, it looks like a no-brainer; why spend time propping up an approach that will be horribly expensive and difficult and rickety, when you can spend that time building something with horizontal scalability, availability, and the means to exercise control over the inconsistency inherent in replication/distribution built in?



Anyway, that's all for now. It's a good bunch of folks and there are a variety of interesting problems to solve, so my brain's wanderlust is happily contained.

2 comments:

Subscribe via email

Enter your email address:

Delivered by FeedBurner

Subscribe (RSS)