https://github.com/tinkerpop/blueprints
https://ssudan16.medium.com/storage-structures-used-in-databases-and-distributed-systems-1405f1851afc
https://courses.grainger.illinois.edu/ece598pv/sp2021/lectureslides2021/ECE_598_PV_course_notes10.pdf
https://archive.docs.influxdata.com/influxdb/v0.13/concepts/storage_engine/
Advantages of Sharding
Solve Scalability Issue: With a single database server architecture any application experience performance degradation when users start growing on that application. Reads and write queries become slower and the network bandwidth starts to saturate. At some point, you will be running out of disk space. Database sharding fixes all these issues by partitioning the data across multiple machines.
High Availability: A problem with single server architecture is that if an outage happens then the entire application will be unavailable which is not good for a website with more number of users. This is not the case with a sharded database. If an outage happens in sharded architecture, then only some specific shards will be down. All the other shards will continue the operation and the entire application won’t be unavailable for the users.
Speed Up Query Response Time: When you submit a query in an application with a large monolithic database and have no sharded architecture, it takes more time to find the result. It has to search every row in the table and that slows down the response time for the query you have given. This doesn’t happen in sharded architecture. In a sharded database a query has to go through fewer rows and you receive the response in less time.
More Write Bandwidth: For many applications writing is a major bottleneck. With no master database serializing writes sharded architecture allows you to write in parallel and increase your write throughput.
Scaling Out: Sharding a database facilitates horizontal scaling, known as scaling out. In horizontal scaling, you add more machines in the network and distribute the load on these machines for faster processing and response. This has many advantages. You can do more work simultaneously and you can handle high requests from the users, especially when writing data because there are parallel paths through your system. You can also load balance web servers that access shards over different network paths, which are processed by different CPUs, and use separate caches of RAM or disk IO paths to process work.
Disadvantages of Sharding
Adds Complexity in the System: You need to be careful while implementing a proper sharded database architecture in an application. It’s a complicated task and if it’s not implemented properly then you may lose the data or get corrupted tables in your database. You also need to manage the data from multiple shard locations instead of managing and accessing it from a single entry point. This may affect the workflow of your team which can be potentially disruptive to some teams.
Rebalancing Data: In a sharded database architecture, sometimes shards become unbalanced (when a shard outgrows other shards). Consider an example that you have two shards of a database. One shard store the name of the customers begins with letter A through M. Another shard store the name of the customer begins with the letters N through Z. If there are so many users with the letter L then shard one will have more data than shard two. This will affect the performance (slow down) of the application and it will stall out for a significant portion of your users. The A-M shard will become unbalance and it will be known as database hotspot. To overcome this problem and to rebalance the data you need to do re-sharding for even data distribution. Moving data from one shard to another shard is not a good idea because it requires a lot of downtimes.
Joining Data From Multiple Shards is Expensive: In a single database, joins can be performed easily to implement any functionalities. But in sharded architecture, you need to pull the data from different shards and you need to perform joins across multiple networked servers You can’t submit a single query to get the data from various shards. You need to submit multiple queries for each one of the shards, pull out the data, and join the data across the network. This is going to be a very expensive and time-consuming process. It adds latency to your system.
No Native Support: Sharding is not natively supported by every database engine. For example, PostgreSQL doesn’t include automatic sharding features, so there you have to do manual sharding. You need to follow the “roll-your-own” approach. It will be difficult for you to find the tips or documentation for sharding and troubleshoot the problem during the implementation of sharding.
Sharding Architectures
Consider an example that you have 3 database servers and each request has an application id which is incremented by 1 every time a new application is registered. To determine which server data should be placed on, we perform a modulo operation on these applications id with the number 3. Then the remainder is used to identify the server to store our data.
Keybased-Sharding
The downside of this method is elastic load balancing which means if you will try to add or remove the database servers dynamically it will be a difficult and expensive process. For example, in the above one if you will add 5 more servers then you need to add more corresponding hash values for the additional entries. Also, the majority of the existing keys need to be remapped to their new, correct hash value and then migrated to a new server. The hash function needs to be changed from modulo 3 to modulo 8. While the migration of data is in effect both the new and old hash functions won’t be valid. During the migration, your application won’t be able to service a large number of requests and you’ll experience downtime for your application till the migration completes.
Note: A shard shouldn’t contain values that might change over time. It should be always static otherwise it will slow down the performance.
Range-Based-Sharding
Range-based sharding is the simplest sharding method to implement. Every shard holds a different set of data but they all have the same schema as the original database. In this method, you just need to identify in which range your data falls, and then you can store the entry to the corresponding shard. This method is best suitable for storing non-static data (example: storing the contact info for students in a college.)
The drawback of this method is that the data may not be evenly distributed on shards. In the above example, you might have a lot of customers whose names fall into the category of A-P. In such cases, the first shard will have to take more load than the second one and it can become a system bottleneck.
Vertical-Sharding
In this method, you can separate and handle the critical part (for example user profiles) non-critical part of your data (for example, blog posts) individually and build different replication and consistency models around it. This is one of the main advantages of this method.
The main drawback of this scheme is that to answer some queries you may have to combine the data from different shards which unnecessarily increases the development and operational complexity of the system. Also, if your application will grow later and you add some more features in it then you will have to further shard a feature-specific database across multiple servers.
Directory-Based-Sharding
The lookup table holds a static set of information about where specific data can be found. In the above image, you can see that we have used the delivery zone as a shard key. Firstly the client application queries the lookup service to find out the shard (database partition) on which the data is placed. When the lookup service returns the shard it queries/updates that shard.
Directory-based sharding is much more flexible than range based and key-based sharding. In range-based sharding, you’re bound to specify the ranges of values. In key-based, you are bound to use a fixed hash function which is difficult to change later. In this approach, you’re free to use any algorithm you want to assign to data entries to shards. Also, it’s easy to add shards dynamically in this approach.
The major drawback of this approach is the single point of failure of the lookup table. If it will be corrupted or failed then it will impact writing new data or accessing existing data from the table.
https://www.geeksforgeeks.org/database-sharding-a-system-design-concept/
https://en.pingcap.com/case-studies/how-we-use-a-mysql-alternative-to-avoid-sharding-and-provide-strong-consistency
https://docs.oracle.com/en/database/other-databases/nosql-database/19.5/admin/shard-capacity.html#GUID-CA937E53-DA80-4499-B9B3-FBFD6F2A622F
https://queue.acm.org/detail.cfm?id=3220266
http://www.cs.cornell.edu/~saeed/skiptree.pdf
https://readingxtra.github.io/docs/RMDA/icmd19-ziegler.pdf
https://iopscience.iop.org/article/10.1088/1742-6596/1202/1/012020/pdf
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.3197&rep=rep1&type=pdf
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Flevelup.gitconnected.com%2Fmysql-sharding-tutorial-7340d2c26a3e
https://en.pingcap.com/case-studies/goodbye-mysql-sharding-use-scale-out-mysql-alternative-to-store-160-tb-of-data
https://dzone.com/articles/challenges-of-sharding-mysql
https://blog.twitter.com/engineering/en_us/a/2010/introducing-gizzard-a-framework-for-creating-distributed-datastores
https://www.geeksforgeeks.org/what-is-sharding/
https://medium.com/system-design-blog/database-sharding-69f3f4bd96db
https://freshdesk.com/product-updates/how-freshdesk-scaled-using-sharding-blog/
https://www.acodersjourney.com/database-sharding/
https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Flevelup.gitconnected.com%2Fwhat-is-database-sharding-and-how-is-it-done-f36b9cb653e8
https://hevodata.com/learn/database-sharding-a-comprehensive-guide/
https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6
http://www.startuplessonslearned.com/2009/01/sharding-for-startups.html
https://dev.to/renaissanceengineer/database-sharding-explained-2021-database-scaling-tutorial-5cej
https://docs.mongodb.com/manual/core/sharding-data-partitioning/
https://slack.engineering/scaling-datastores-at-slack-with-vitess/
https://zhuanlan.zhihu.com/p/57185574
https://freecontent.manning.com/wp-content/uploads/designing-and-architecting-for-internet-scale-sharding.pdf
https://docs.yugabyte.com/latest/explore/linear-scalability/sharding-data/
https://golangexample.com/go-package-for-sharding-databases/
drop box 分表加两阶段提交
https://dropbox.tech/infrastructure/cross-shard-transactions-at-10-million-requests-per-second
https://dropbox.tech/infrastructure/reintroducing-edgestore
https://systemdesignprimer.com/dropbox-system-design/
https://weifoo.gitbooks.io/systemdesign/content/system-design-examples/dropbox.html
https://sudonull.com/post/13754-Database-development-in-Dropbox-The-path-from-one-global-MySQL-database-to-thousands-of-servers
https://github.com/donnemartin/system-design-primer
https://proprogramming.org/design-dropbox-a-system-design-interview-question/