cannedplatypus

Open full view…

Data Gravity

Mon, 10 Feb 2014 18:20:27 GMT

Jeff Darcy
Mon, 10 Feb 2014 18:20:27 GMT

Not too surprisingly, Randy has responded to several of the points made here. Since responding to all of his evasions on Twitter would be rude for all of my followers who aren't interested in this particular discussion, I'll try to address them here instead. "That kind of sync is not nearly as chatty unless it's fully synchronous. People do async across WAN." https://twitter.com/randybias/status/432927781018562560 "No it's not harder. Async can be done aggressively fast where large multi-block data is xferred and written." (https://twitter.com/randybias/status/432927989701963776) The problem is that "chattiness" directly corresponds to consistency. If you do async with a long period, you're just not going to be consistent enough for many workloads. You're also highly likely to impose an extra I/O penalty when your replication has to re-read data that's no longer in cache. As your period decreases, your "chattiness" increases - all the way to full sync which requires constant back-and-forth communication. This is basic CAP Theorem stuff: if you want consistency, y ou have to communicate. You can have very efficient communication *or* you can have consistency, but you can't have both across long distances. "Data gravity means I have to bring my compute to my data. If syncing it as you indicate then no gravity exists." (https://twitter.com/randybias/status/432928717006852096) That's moving the goalposts a little. While it's correct that there's no gravity when doing full synchronous replication, that's only a small part of what we were originally talking about because of the performance and storage-efficiency issues involved. I work on full synchronous replication for a living. Believe me, even on a fast local network these are issues. If people will recall, the conversation was not limited to that style of replication when Randy said data gravity and speed of light were *entirely* unrelated. Entirely. No qualifiers. It's that tendency to speak in such an absolute way, attempting to gainsay people who actually and often clearly know the subject matter better than he does, which has put Randy and me at odds before. In fact, as I have explained, latency and gravity are very deeply related in *most* cases. Even Google, which already had all that dark fiber, recognized this when they spent years developing multiple iterations of cross-datacenter storage solutions. Colossus and Spanner wouldn't exist if these problems were unimportant or easily solved with other tools. Gravity is a force that exists at all levels of strength. It doesn't have to be absolute. Whenever it's more feasible to move computation to data than to move, copy, or sync that data to where the computation will be, that's gravity. I'm not saying there aren't problems at Google or elsewhere that are latency-insensitive. Of course there are, and having access to immense bandwidth surely helps in those cases. What I'm saying is that there are *also* problems where latency issues cause computation to be attracted to data more than vice versa. As long as that's true, gravity is a relevant concern and Randy's claim counts as misinformation.

randyb
Thu, 13 Feb 2014 18:26:24 GMT

Jeff, Thanks again for your comments and posting on this. This has to be my last comment on this topic due to time constraints; however, I hope to more fully clarify the disconnect we seem to be having. First, let me say that I don’t think there is any doubt of your domain expertise in writing software for storage systems. And, in fact, I can’t actually see any kind of disconnect in our understandings of the technologies we are talking about. For background, I have been dealing with storage and WANs/LANs since the mid 1980s with UUCP, then NFS over 56K leased lines, then synchronous and asynchronous storage replication throughout the late 90s and until today. As recently as 2010 I experienced the exact issues you highlight with synchronous replication in our KT uCloud deployment where we tried to use sync replication between Solaris ZFS boxes in a datacenter. Storage systems normally capable of running at ~20Gbps suddenly throttled down to 1Gbps using sync rep. I think it’s pretty safe to say that I also have significant experience in storage, WANs, and LAN s. Primarily field experience. So, I may not be familiar with the code in synchronous replication, but I have probably implemented it in real world scenarios more than most people. Second, I agree that I was out of line in my characterization that data gravity is entirely unrelated to latency (speed of light) and I apologize for my hubris there. However, I don’t think I was far off the mark. What I should have said was “data gravity is rarely related to latency”. That, of course, may not be satisfactory to you, so I’ll have to explain and hope you see my perspective. For me, data gravity has three key elements: one the size of the data, two, the amount of data access that occurs. You could have a massive footprint of data (10PB+) that is infrequently accessed like archives and backups. There is very little data gravity for such a data set as the infrequent access means that in the most extreme example, you can just FedEx disk drives around, which is certainly a high latency activity. Conversely, if a dataset has very heavy access but a much smaller size (10TB, 100B) it could have significantly more data gravity. Given this, somewhat equal weight needs to be applied to both the data set size and the level of access. A dataset with significant access is inherently concerned with performance. Performance is negatively impacted as you said, even in a datacenter environment with little or no latency. There is significantly worse impact to performance over the WAN. Similarly, large datasets are more difficult to move around, making data gravity more relevant the larger the data set size is as you have effective “lock-in” into whatever datacenter environment that data is stored in. This usually means you want a copy of it elsewhere, typically in another datacenter. That’s why, for me, data gravity is much less relevant as an issue *except* when talking about moving data across the WAN. How could it be otherwise? There is a 10 or even 100x difference in datacenter bandwidth and typical WAN bandwidth. Sure, data gravity still exists as an issue, but it’s much easier to move large datasets around, copy them, back them up, or perform synchronous replication, in an environment with lots of bandwidth. This is why the best practice, for data replication, both for storage systems and databases is to run local synchronous replication inside the datacenter environment and asynchronous replication between datacenters. I have almost never seen folks run sync replication across the WAN given the performance hit. WAN based replication is typically for Disaster Recovery, Archives / Backups, or for Data Migration purposes. In such a case, minutes of data loss are considered acceptable for most use cases. Of course, this depends on RTO, RPO, DR policies, and data retention policies generally, which vary by industry and business. Still, in my experience, even in financial services, synchronous replication over the WAN is used in much less than 5% of real world use cases. That means that asynchronous replication is far more typical between datacenters and asynchronous replication is for the most part simply large sets of ordered data being shipped around, either as blocks, bin logs, or something similar. The very nature of async replication means that the setup and tear down of the connection is a nominal part of the round trip cost, so latency is largely irrelevant. Instead, bandwidth becomes most important as the amount of bandwidth that you have between datacenters directly relates to how much data you can transfer at one time in a given window. Ideally you might run async rep aggressively in a 5 minute window and depending on the change set size (which is related to the amount of access of the dataset) you would do a full transfer of all changes in that 5 minute window. The more the data is accessed (meaning the more data gravity it has) the larger the expect changes would be, which would require more WAN bandwidth to complete in a fixed time window. So, that’s really my point. Given that data gravity is primarily related to WAN-based replication and that asynchronous methods are the most prevalent approach to shipping data between datacenters, then in my experience in real world deployments, bandwidth between datacenters is the single biggest factor here, not latency. Yes, there are cases where business must have synchronous replication over the WAN, but that’s the exception, not the rule. The disconnect we have is that you are focusing on the consistency of the data and I’m focusing on what’s most common in practice. People give up consistency for performance and cost reasons all the time and most data sets can handle some level of inconsistency, particularly between datacenters, as there is usually a local copy in the datacenter for redundancy purposes already. Again, apologies for being overly vehement in my initial twitter response and I appreciate your point of view and feel that it’s accurate, but hopefully you will see my perspective here as I think it is also accurate and entirely relevant. Regards, —Randy

randyb
Thu, 13 Feb 2014 18:27:44 GMT

Sorry, *two* key elements...

Jeff Darcy
Thu, 13 Feb 2014 18:44:05 GMT

Thanks for stopping by, Randy. Yes, I agree that sync within the data center and async between is a best practice. In fact, I even responded to some popular violations of that best practice in (checks) March 2010. http://pl.atyp.us/wordpress/?p=2824 I think the important thing, and what rankled me about what seemed like a dismissive comment, is that data gravity is an issue even with best practices. For example, using locality to place computation near relevant data is key to how Hadoop works even on the fastest networks. I'd call that gravity. Even if we're only talking about a WAN environment, placing new jobs near the primary/active copy of a dataset is distinctly better than placing it elsewhere. For one thing, a secondary site that exists for DR/archive purposes might not be running on the same hardware as its primary. Maybe we can extend the metaphor to say that faster storage systems and new data are "denser" than slower storages and old data, so they exert a greater gravitational pull. ;) In any case, the *exact* role of data locality in designing and d eploying storage solutions is a rich topic that I hope we can discuss further (and less contentiously) some day. Cheers.