tree store

by 夏泽民 Nov 20, 2021

https://github.com/tinkerpop/blueprints
https://ssudan16.medium.com/storage-structures-used-in-databases-and-distributed-systems-1405f1851afc

https://courses.grainger.illinois.edu/ece598pv/sp2021/lectureslides2021/ECE_598_PV_course_notes10.pdf

https://archive.docs.influxdata.com/influxdb/v0.13/concepts/storage_engine/
Advantages of Sharding
Solve Scalability Issue: With a single database server architecture any application experience performance degradation when users start growing on that application. Reads and write queries become slower and the network bandwidth starts to saturate. At some point, you will be running out of disk space. Database sharding fixes all these issues by partitioning the data across multiple machines.
High Availability: A problem with single server architecture is that if an outage happens then the entire application will be unavailable which is not good for a website with more number of users. This is not the case with a sharded database. If an outage happens in sharded architecture, then only some specific shards will be down. All the other shards will continue the operation and the entire application won’t be unavailable for the users.
Speed Up Query Response Time: When you submit a query in an application with a large monolithic database and have no sharded architecture, it takes more time to find the result. It has to search every row in the table and that slows down the response time for the query you have given. This doesn’t happen in sharded architecture. In a sharded database a query has to go through fewer rows and you receive the response in less time.
More Write Bandwidth: For many applications writing is a major bottleneck. With no master database serializing writes sharded architecture allows you to write in parallel and increase your write throughput.
Scaling Out: Sharding a database facilitates horizontal scaling, known as scaling out. In horizontal scaling, you add more machines in the network and distribute the load on these machines for faster processing and response. This has many advantages. You can do more work simultaneously and you can handle high requests from the users, especially when writing data because there are parallel paths through your system. You can also load balance web servers that access shards over different network paths, which are processed by different CPUs, and use separate caches of RAM or disk IO paths to process work.
Disadvantages of Sharding
Adds Complexity in the System: You need to be careful while implementing a proper sharded database architecture in an application. It’s a complicated task and if it’s not implemented properly then you may lose the data or get corrupted tables in your database. You also need to manage the data from multiple shard locations instead of managing and accessing it from a single entry point. This may affect the workflow of your team which can be potentially disruptive to some teams.
Rebalancing Data: In a sharded database architecture, sometimes shards become unbalanced (when a shard outgrows other shards). Consider an example that you have two shards of a database. One shard store the name of the customers begins with letter A through M. Another shard store the name of the customer begins with the letters N through Z. If there are so many users with the letter L then shard one will have more data than shard two. This will affect the performance (slow down) of the application and it will stall out for a significant portion of your users. The A-M shard will become unbalance and it will be known as database hotspot. To overcome this problem and to rebalance the data you need to do re-sharding for even data distribution. Moving data from one shard to another shard is not a good idea because it requires a lot of downtimes.
Joining Data From Multiple Shards is Expensive: In a single database, joins can be performed easily to implement any functionalities. But in sharded architecture, you need to pull the data from different shards and you need to perform joins across multiple networked servers You can’t submit a single query to get the data from various shards. You need to submit multiple queries for each one of the shards, pull out the data, and join the data across the network. This is going to be a very expensive and time-consuming process. It adds latency to your system.
No Native Support: Sharding is not natively supported by every database engine. For example, PostgreSQL doesn’t include automatic sharding features, so there you have to do manual sharding. You need to follow the “roll-your-own” approach. It will be difficult for you to find the tips or documentation for sharding and troubleshoot the problem during the implementation of sharding.
Sharding Architectures

Key Based Sharding
This technique is also known as hash-based sharding. Here, we take the value of an entity such as customer ID, customer email, IP address of a client, zip code, etc and we use this value as an input of the hash function. This process generates a hash value which is used to determine which shard we need to use to store the data. We need to keep in mind that the values entered into the hash function should all come from the same column (shard key) just to ensure that data is placed in the correct order and in a consistent manner. Basically, shard keys act like a primary key or a unique identifier for individual rows.

Consider an example that you have 3 database servers and each request has an application id which is incremented by 1 every time a new application is registered. To determine which server data should be placed on, we perform a modulo operation on these applications id with the number 3. Then the remainder is used to identify the server to store our data.

Keybased-Sharding

The downside of this method is elastic load balancing which means if you will try to add or remove the database servers dynamically it will be a difficult and expensive process. For example, in the above one if you will add 5 more servers then you need to add more corresponding hash values for the additional entries. Also, the majority of the existing keys need to be remapped to their new, correct hash value and then migrated to a new server. The hash function needs to be changed from modulo 3 to modulo 8. While the migration of data is in effect both the new and old hash functions won’t be valid. During the migration, your application won’t be able to service a large number of requests and you’ll experience downtime for your application till the migration completes.

Note: A shard shouldn’t contain values that might change over time. It should be always static otherwise it will slow down the performance.

Horizontal or Range Based Sharding
In this method, we split the data based on the ranges of a given value inherent in each entity. Let’s say you have a database of your online customers’ names and email information. You can split this information into two shards. In one shard you can keep the info of customers whose first name starts with A-P and in another shard, keep the information of the rest of the customers.

Range-Based-Sharding

Range-based sharding is the simplest sharding method to implement. Every shard holds a different set of data but they all have the same schema as the original database. In this method, you just need to identify in which range your data falls, and then you can store the entry to the corresponding shard. This method is best suitable for storing non-static data (example: storing the contact info for students in a college.)

The drawback of this method is that the data may not be evenly distributed on shards. In the above example, you might have a lot of customers whose names fall into the category of A-P. In such cases, the first shard will have to take more load than the second one and it can become a system bottleneck.

Vertical Sharding
In this method, we split the entire column from the table and we put those columns into new distinct tables. Data is totally independent of one partition to the other ones. Also, each partition holds both distinct rows and columns. Take the example of Twitter features. We can split different features of an entity in different shards on different machines. On Twitter users might have a profile, number of followers, and some tweets posted by his/her own. We can place the user profiles on one shard, followers in the second shard, and tweets on a third shard.

Vertical-Sharding

In this method, you can separate and handle the critical part (for example user profiles) non-critical part of your data (for example, blog posts) individually and build different replication and consistency models around it. This is one of the main advantages of this method.

The main drawback of this scheme is that to answer some queries you may have to combine the data from different shards which unnecessarily increases the development and operational complexity of the system. Also, if your application will grow later and you add some more features in it then you will have to further shard a feature-specific database across multiple servers.

Directory-Based Sharding
In this method, we create and maintain a lookup service or lookup table for the original database. Basically we use a shard key for lookup table and we do mapping for each entity that exists in the database. This way we keep track of which database shards hold which data.

Directory-Based-Sharding

The lookup table holds a static set of information about where specific data can be found. In the above image, you can see that we have used the delivery zone as a shard key. Firstly the client application queries the lookup service to find out the shard (database partition) on which the data is placed. When the lookup service returns the shard it queries/updates that shard.

Directory-based sharding is much more flexible than range based and key-based sharding. In range-based sharding, you’re bound to specify the ranges of values. In key-based, you are bound to use a fixed hash function which is difficult to change later. In this approach, you’re free to use any algorithm you want to assign to data entries to shards. Also, it’s easy to add shards dynamically in this approach.

The major drawback of this approach is the single point of failure of the lookup table. If it will be corrupted or failed then it will impact writing new data or accessing existing data from the table.

https://www.geeksforgeeks.org/database-sharding-a-system-design-concept/

https://en.pingcap.com/case-studies/how-we-use-a-mysql-alternative-to-avoid-sharding-and-provide-strong-consistency

https://docs.oracle.com/en/database/other-databases/nosql-database/19.5/admin/shard-capacity.html#GUID-CA937E53-DA80-4499-B9B3-FBFD6F2A622F

https://queue.acm.org/detail.cfm?id=3220266

http://www.cs.cornell.edu/~saeed/skiptree.pdf
https://readingxtra.github.io/docs/RMDA/icmd19-ziegler.pdf
https://iopscience.iop.org/article/10.1088/1742-6596/1202/1/012020/pdf
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.3197&rep=rep1&type=pdf

https://www.digitalocean.com/community/tutorials/understanding-database-sharding

https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Flevelup.gitconnected.com%2Fmysql-sharding-tutorial-7340d2c26a3e

https://en.pingcap.com/case-studies/goodbye-mysql-sharding-use-scale-out-mysql-alternative-to-store-160-tb-of-data

https://dzone.com/articles/challenges-of-sharding-mysql

https://blog.twitter.com/engineering/en_us/a/2010/introducing-gizzard-a-framework-for-creating-distributed-datastores

https://www.geeksforgeeks.org/what-is-sharding/

https://medium.com/system-design-blog/database-sharding-69f3f4bd96db

https://freshdesk.com/product-updates/how-freshdesk-scaled-using-sharding-blog/

https://www.acodersjourney.com/database-sharding/

https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Flevelup.gitconnected.com%2Fwhat-is-database-sharding-and-how-is-it-done-f36b9cb653e8

https://hevodata.com/learn/database-sharding-a-comprehensive-guide/

https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6

http://www.startuplessonslearned.com/2009/01/sharding-for-startups.html

https://dev.to/renaissanceengineer/database-sharding-explained-2021-database-scaling-tutorial-5cej

https://docs.mongodb.com/manual/core/sharding-data-partitioning/

https://slack.engineering/scaling-datastores-at-slack-with-vitess/

https://zhuanlan.zhihu.com/p/57185574

https://freecontent.manning.com/wp-content/uploads/designing-and-architecting-for-internet-scale-sharding.pdf

https://docs.yugabyte.com/latest/explore/linear-scalability/sharding-data/

https://golangexample.com/go-package-for-sharding-databases/

drop box 分表加两阶段提交
https://dropbox.tech/infrastructure/cross-shard-transactions-at-10-million-requests-per-second

https://dropbox.tech/infrastructure/reintroducing-edgestore

https://systemdesignprimer.com/dropbox-system-design/

https://weifoo.gitbooks.io/systemdesign/content/system-design-examples/dropbox.html

https://sudonull.com/post/13754-Database-development-in-Dropbox-The-path-from-one-global-MySQL-database-to-thousands-of-servers

https://github.com/donnemartin/system-design-primer

https://proprogramming.org/design-dropbox-a-system-design-interview-question/

Category mysql