{"id":240,"date":"2016-10-11T19:35:19","date_gmt":"2016-10-11T19:35:19","guid":{"rendered":"http:\/\/vargas-solar.com\/big-data-analytics\/?page_id=240"},"modified":"2016-10-31T16:00:32","modified_gmt":"2016-10-31T16:00:32","slug":"sharding","status":"publish","type":"page","link":"http:\/\/vargas-solar.com\/big-data-analytics\/hands-on\/sharding\/","title":{"rendered":"Sharding data collections with MongoDB"},"content":{"rendered":"<p>&nbsp;<\/p>\n<h2>Objective<\/h2>\n<p style=\"text-align: left;\"><strong><a class=\"reference internal\" href=\"https:\/\/en.wikipedia.org\/wiki\/Shard_(database_architecture)\"><span class=\"xref std std-term\">Sharding<\/span><\/a><\/strong>\u00a0is a method for\u00a0<em>partitioning<\/em>\u00a0and\u00a0<em>distributing<\/em>\u00a0a large data collection across\u00a0a\u00a0<strong>cluster<\/strong>\u00a0of\u00a0<em>database servers<\/em>\u00a0called\u00a0<strong>shards<\/strong>\u00a0(cf. image below). During this exercise you will learn more about sharding by:<\/p>\n<ul>\n<li style=\"text-align: justify;\">Configuring a\u00a0<a href=\"http:\/\/mongodb.com\">MongoDB<\/a>\u00a0cluster for sharding\u00a0a data collection<\/li>\n<li style=\"text-align: justify;\">Studying the sharding techniques implemented by\u00a0MongoDB<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone aligncenter\" src=\"https:\/\/docs.mongodb.com\/v3.0\/_images\/sharded-collection.png\" alt=\"\" width=\"303\" height=\"290\" \/><\/p>\n<h2>Requirements<\/h2>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Comparison_of_SSH_clients\">SSH client<\/a> (e.g. <a href=\"https:\/\/the.earth.li\/~sgtatham\/putty\/latest\/x86\/putty.exe\">Putty<\/a> for Windows)<\/li>\n<li>Virtual Machine containing\u00a0<a href=\"http:\/\/mongodb.com\">MongoDB<\/a>\u00a0(provided during the course)<\/li>\n<\/ul>\n<h2>Configuring\u00a0a\u00a0sharded cluster<\/h2>\n<p>MongoDB supports sharding through the configuration of a\u00a0<a href=\"https:\/\/docs.mongodb.com\/v3.0\/core\/sharding-introduction\/\">sharded cluster<\/a>. A sharded cluster is composed of the following components\u00a0(cf. figure below):<\/p>\n<ul>\n<li><span class=\"xref std std-term\"><a class=\"reference internal\" href=\"https:\/\/docs.mongodb.com\/v3.0\/reference\/glossary\/#term-shard\">Shards<\/a><i>\u00a0\u2014store the data<\/i><\/span>.<\/li>\n<li><a class=\"reference internal\" href=\"https:\/\/docs.mongodb.com\/v3.0\/reference\/glossary\/#term-mongos\"><span class=\"xref std std-term\">Query routers<\/span><\/a>\u00a0\u2014<em>direct operations to the appropriate shard (or shards).<\/em><\/li>\n<li><span class=\"xref std std-term\"><a class=\"reference internal\" href=\"https:\/\/docs.mongodb.com\/v3.0\/reference\/glossary\/#term-config-server\">Config servers<\/a>\u00a0\u2014<em>store the cluster\u2019s metadata.\u00a0The query router uses this metadata to target operations to specific shards.\u00a0<\/em><\/span><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone aligncenter\" src=\"https:\/\/docs.mongodb.com\/v3.0\/_images\/sharded-cluster-production-architecture.png\" alt=\"\" width=\"322\" height=\"221\" \/><\/p>\n<p>For the sake of simplicity we have pre-configured your\u00a0virtual machine with a cluster composed of<em>\u00a01\u00a0query router<\/em> and <em>1\u00a0config server<\/em> (cf. image below).<\/p>\n<p><a href=\"http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/cluster.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-255 aligncenter\" src=\"http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/cluster-300x264.png\" alt=\"cluster\" width=\"317\" height=\"279\" srcset=\"http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/cluster-300x264.png 300w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/cluster-768x677.png 768w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/cluster-1024x902.png 1024w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/cluster-624x550.png 624w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/cluster.png 1221w\" sizes=\"auto, (max-width: 317px) 100vw, 317px\" \/><\/a><\/p>\n<h3>Adding a shard to the cluster<\/h3>\n<ul>\n<li>Connect to the\u00a0virtual machine using your SSH\u00a0client.<\/li>\n<\/ul>\n<p><a href=\"http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/Capture-d\u2019\u00e9cran-2016-10-11-\u00e0-23.06.40.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-246 aligncenter\" src=\"http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/Capture-d\u2019\u00e9cran-2016-10-11-\u00e0-23.06.40-300x266.png\" alt=\"capture-decran-2016-10-11-a-23-06-40\" width=\"300\" height=\"266\" srcset=\"http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/Capture-d\u2019\u00e9cran-2016-10-11-\u00e0-23.06.40-300x266.png 300w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/Capture-d\u2019\u00e9cran-2016-10-11-\u00e0-23.06.40-768x682.png 768w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/Capture-d\u2019\u00e9cran-2016-10-11-\u00e0-23.06.40-1024x909.png 1024w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/Capture-d\u2019\u00e9cran-2016-10-11-\u00e0-23.06.40-624x554.png 624w, http:\/\/vargas-solar.com\/big-data-analytics\/wp-content\/uploads\/sites\/35\/2016\/10\/Capture-d\u2019\u00e9cran-2016-10-11-\u00e0-23.06.40.png 1192w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<ul>\n<li>Once connected to the virtual machine, connect to the cluster\u00a0<em>query router<\/em>:<\/li>\n<\/ul>\n<pre>mongo --host localhost:27019<\/pre>\n<ul>\n<li>Add a shard to the cluster by executing the following instructions:<\/li>\n<\/ul>\n<div class=\"page\" title=\"Page 2\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<pre>use admin\r\ndb.runCommand( { addShard: \"localhost:27021\", name: \"shard1\" } )<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<ul>\n<li>Verify the state of the cluster:<\/li>\n<\/ul>\n<pre>sh.status()<\/pre>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Question 2.<\/strong>\u00a0Which is the important information reported by <strong>sh.status()<\/strong> ?<\/span><\/li>\n<\/ul>\n<h2>Sharding a database collection<\/h2>\n<p>In MongoDB\u00a0<em>sharding<\/em> is enabled on a <em>per-basis collection.\u00a0<\/em>When sharding is enabled on a collection, MongoDB partitions the collection and distributes its\u00a0documents\u00a0across the shards of a cluster using a <strong><a href=\"https:\/\/docs.mongodb.com\/manual\/core\/sharding-shard-key\/\">shard key<\/a>\u00a0<\/strong>and a <strong>partitioning strategy<\/strong>.<\/p>\n<ul>\n<li><strong>Range based partitioning<\/strong>: collection\u00a0is partitioned into ranges [min, max] determined by the shard key. Each range is called\u00a0a <a href=\"https:\/\/docs.mongodb.com\/manual\/reference\/glossary\/#term-chunk\">chunk<\/a>.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/docs.mongodb.com\/v3.0\/_images\/sharding-range-based.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone \" src=\"https:\/\/docs.mongodb.com\/v3.0\/_images\/sharding-range-based.png\" alt=\"\" width=\"500\" height=\"164\" \/><\/a><\/p>\n<ul>\n<li><strong>Hash based partitioning<\/strong>: data is partitioned into chunks using a hash function.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/docs.mongodb.com\/v3.0\/_images\/sharding-hash-based.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone \" src=\"https:\/\/docs.mongodb.com\/v3.0\/_images\/sharding-hash-based.png\" alt=\"\" width=\"499\" height=\"178\" \/><\/a><\/p>\n<p>We have\u00a0pre-configured <strong>Shard 1<\/strong> with a\u00a0collection called\u00a0<strong>mydb.cities<\/strong>.\u00a0In what follows you will shard copies of <strong>mydb.cities<\/strong> using <em>range based<\/em> and<em> hash based partitioning<\/em>.<\/p>\n<h3><strong>Range based partitioning<\/strong><\/h3>\n<ul>\n<li>Create a new collection called <strong>mydb.cities1<\/strong> :<\/li>\n<\/ul>\n<pre>use mydb\r\ndb.createCollection(\"cities1\")\r\nshow collections<\/pre>\n<ul>\n<li>Enable sharding on <strong>mydb.cities1<\/strong> using the attribute <strong>state\u00a0<\/strong>as shard key :<\/li>\n<\/ul>\n<pre>sh.enableSharding(\"mydb\") \r\nsh.shardCollection(\"mydb.cities1\", { \"state\": 1} )<\/pre>\n<ul>\n<li>Verify the <strong>number of chunks<\/strong>:<\/li>\n<\/ul>\n<pre>sh.status()<\/pre>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Question 3.<\/strong> How many chunks did you create? Which are their associated ranges? Include a screen copy of the results of the command in your answer to support your answer ?<\/span><\/li>\n<\/ul>\n<ul>\n<li>Populate <strong>mydb.cities1<\/strong> using the content of\u00a0<strong>mydb.cities<\/strong>:<\/li>\n<\/ul>\n<pre>db.cities.find().forEach( function(d) { \r\n    db.cities1.insert(d); \r\n})<\/pre>\n<ul>\n<li>Verify the number of chunks after population:<\/li>\n<\/ul>\n<pre>sh.status()<\/pre>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Question 4.\u00a0<\/strong>How many chunks are there now? Which are their associated ranges? Which changes can you identify in particular? Include a screen copy of the results of the command in your answer to support your answer.<\/span><\/li>\n<\/ul>\n<h3>Hash-based partitioning<\/h3>\n<ul>\n<li>Create a new collection called <strong>mydb.cities2<\/strong>:<\/li>\n<\/ul>\n<pre>db.createCollection(\"cities2\")\r\nshow collections<\/pre>\n<ul>\n<li>Enable <strong>hashed<\/strong> sharding on\u00a0<strong>mydb.cities2\u00a0<\/strong>using\u00a0the attribute\u00a0<strong>state\u00a0<\/strong>as <em>shard key<\/em>.<\/li>\n<\/ul>\n<pre>sh.shardCollection(\"mydb.cities2\", { \"state\": \"hashed\"} )<\/pre>\n<ul>\n<li>Verify the <strong>number of chunks<\/strong> before populating the new collection.<\/li>\n<\/ul>\n<pre>sh.status()<\/pre>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Question 5.\u00a0<\/strong>How many chunks did you create? What differences do you see with respect to the same task in the range sharding strategy? Include a screen copy of the results of the command in your answer to support your answer.<\/span><\/li>\n<\/ul>\n<ul>\n<li>Populate\u00a0<strong>mydb.cities2<\/strong>\u00a0:<\/li>\n<\/ul>\n<pre>db.cities.find().forEach( function(d) { \r\n    db.cities2.insert(d); } \r\n)<\/pre>\n<ul>\n<li>Verify the number of chunks after population:<\/li>\n<\/ul>\n<pre>sh.status()<\/pre>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Question 6.\u00a0<\/strong>How many chunks are there now? Include a screen copy of the results of the command in your answer to support your answer. Compare the result with respect to the range sharding. Include a screen copy of the results of the command in your answer to support your answer.<\/span><\/li>\n<\/ul>\n<h2>Balancing data across a sharded cluster<\/h2>\n<p>When a shard has too many chunks\u00a0compared to other shards, MongoDB automatically redistributes the chunks\u00a0across the shards. This process is called <strong><em>balancing<\/em><\/strong>.\u00a0In what follows you will analyze the behavior of the MongoDB balancing process by adding more shards to your cluster.<\/p>\n<ul>\n<li>Add another shard\u00a0to the cluster:<\/li>\n<\/ul>\n<pre>use admin\r\ndb.runCommand( { addShard: \"localhost:27022\", name: \"shard2\" } )<\/pre>\n<ul>\n<li>Wait a few seconds and check the status of the cluster:<\/li>\n<\/ul>\n<pre>sh.status()<\/pre>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Question 7.\u00a0<\/strong>Draw the new configuration of the cluster and label each element (router, config server and shards) with its corresponding port as you defined in the previous tasks.<\/span><\/li>\n<\/ul>\n<h3>Tag aware balancing<\/h3>\n<p>MongoDB balancer also supports <a href=\"https:\/\/docs.mongodb.com\/manual\/core\/tag-aware-sharding\/\"><strong>tagging<\/strong><\/a> a range of shard key values. Using tags you can:<\/p>\n<ul>\n<li>Isolate specific subset of data on a specific set of shards.<\/li>\n<li>Ensure that relevant data reside on shards that are geographically close to the\u00a0user.<\/li>\n<\/ul>\n<p>For the final part of this exercise you will observe\u00a0the behavior of the <em>MongoDB\u00a0balancing process<\/em>\u00a0when adding <em>tagged shards<\/em> to your cluster.<\/p>\n<ul>\n<li>Add a\u00a0new shard\u00a0to your cluster:<\/li>\n<\/ul>\n<pre>use admin\r\ndb.runCommand( { addShard: \"localhost:27023\", name: \"shard3\" } ) \r\nsh.status()<\/pre>\n<ul>\n<li>Associate tags to shards:<\/li>\n<\/ul>\n<pre>sh.addShardTag(\"shard1\", \"CA\") \r\nsh.addShardTag(\"shard2\", \"NY\") \r\nsh.addShardTag(\"shard3\", \"Others\")<\/pre>\n<ul>\n<li>Create, shard and populate a new collection named <strong>mydb.cities3<\/strong>:<\/li>\n<\/ul>\n<pre>db.createCollection(\"cities3\") \r\nsh.shardCollection(\"mydb.cities3\", { \"state\": 1} ) \r\n\r\nuse mydb \r\ndb.cities.find().forEach( function(d) { \r\n    db.cities3.insert(d); \r\n})<\/pre>\n<ul>\n<li>Associate <strong>shard key ranges<\/strong> to <em>tagged shards<\/em>:<\/li>\n<\/ul>\n<pre>sh.addTagRange(\"mydb.cities3\", { state: MinKey }, { state: \"CA\" }, \"Others\") \r\nsh.addTagRange(\"mydb.cities3\", { state: \"CA\" }, { state: \"CA_\" }, \"CA\") \r\nsh.addTagRange(\"mydb.cities3\", { state: \"CA_\" }, { state: \"NY\" }, \"Others\") \r\nsh.addTagRange(\"mydb.cities3\", { state: \"NY\" }, { state: \"NY_\" }, \"NY\") \r\nsh.addTagRange(\"mydb.cities3\", { state: \"NY_\" }, { state: MaxKey }, \"Others\")<\/pre>\n<ul>\n<li>Display <em>cluster<\/em> information:<\/li>\n<\/ul>\n<pre>sh.status()<\/pre>\n<ul>\n<li><span style=\"text-decoration: underline;\"><strong>Question 8<\/strong>. Analyze the results and explain the logic behind this tagging strategy. Connect to the shard that contains the data about California, and count the documents. Do the same operation with the other shards. Is the sharded data collection complete with respect to initial one? Are shards orthogonal?\u00a0<\/span><\/li>\n<\/ul>\n<h2><\/h2>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; Objective Sharding&nbsp;is a method for&nbsp;partitioning&nbsp;and&nbsp;distributing&nbsp;a large data collection across&nbsp;a&nbsp;cluster&nbsp;of&nbsp;database servers&nbsp;called&nbsp;shards&nbsp;(cf. image below). During this exercise you will learn more about sharding by: Configuring a&nbsp;MongoDB&nbsp;cluster for sharding&nbsp;a data collection Studying the sharding techniques implemented by&nbsp;MongoDB Requirements SSH client (e.g. Putty for Windows) Virtual Machine containing&nbsp;MongoDB&nbsp;(provided during the course) Configuring&nbsp;a&nbsp;sharded cluster MongoDB supports sharding through the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":38,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-240","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/pages\/240","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/comments?post=240"}],"version-history":[{"count":16,"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/pages\/240\/revisions"}],"predecessor-version":[{"id":259,"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/pages\/240\/revisions\/259"}],"up":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/pages\/38"}],"wp:attachment":[{"href":"http:\/\/vargas-solar.com\/big-data-analytics\/wp-json\/wp\/v2\/media?parent=240"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}