Beruflich Dokumente
Kultur Dokumente
Overview.......................................................................................................................................... 4 Scope.......................................................................................................................................... 4 Sources....................................................................................................................................... 5 System Definition............................................................................................................................. 6 Use Cases................................................................................................................................... 6 Constraints and assumptions:......................................................................................................7 Define the Schema........................................................................................................................... 7 Identify System Operations.......................................................................................................... 8 Identify Entities and Fields...........................................................................................................9 MongoDb Best Practices and Considerations................................................................................10 Entity Relationships................................................................................................................... 10 Size of Data............................................................................................................................... 10 Indexing .................................................................................................................................... 10 Adding indexes...................................................................................................................... 11 Filter Criteria..................................................................................................................... 11 Sorting.............................................................................................................................. 11 Considerations...................................................................................................................... 11 Query Optimization................................................................................................................ 12 Sharding.................................................................................................................................... 12 Automatic Sharding............................................................................................................... 12 Sharding Key......................................................................................................................... 12 Considerations...................................................................................................................... 13 Using the _id (or date based data) as the shard key.........................................................13 Read / Write Ratio............................................................................................................ 13 Related Data..................................................................................................................... 14 Unique Keys..................................................................................................................... 14 Result Order..................................................................................................................... 14 Bringing it all together..................................................................................................................... 15 Entities....................................................................................................................................... 15 Product.................................................................................................................................. 15 Category............................................................................................................................... 16 User...................................................................................................................................... 16 Shopping Cart....................................................................................................................... 17 Actions....................................................................................................................................... 18 Search for product based on SKU.........................................................................................18
Page 2 of 32
Search for products by product name...................................................................................18 Search for products by category identifier.............................................................................18 Increment / decrement stock item.........................................................................................19 Add / Edit products................................................................................................................ 20 Create Shopping Cart............................................................................................................ 21 Problem............................................................................................................................ 21 Define the correct shard key.............................................................................................21 Split read and write data...................................................................................................22 Add / Remove products to / from shopping cart.....................................................................22 Pay for cart by credit card.....................................................................................................22 Search for all categories........................................................................................................23 Search for products less than reorder threshold....................................................................23 Search for sub-categories by category identifier....................................................................24 Search total product value.....................................................................................................26 Search cart total per date......................................................................................................26 Discard Shopping Cart.......................................................................................................... 27 Infrastructure.................................................................................................................................. 28 Deployment................................................................................................................................ 28 Mongo Processes................................................................................................................. 29 Replica Sets[12].................................................................................................................... 29 Operating System...................................................................................................................... 29 RAM........................................................................................................................................... 29 Network..................................................................................................................................... 30 Next Steps...................................................................................................................................... 30 References..................................................................................................................................... 31
Page 3 of 32
Overview
MongoDb garnered much attention over the last couple of years. It is said to be fast and reliable and that it automates some of the processes that are usually very time consuming and error prone. Adoption seems to be growing steadily as it is being used in more and more, high transaction volume systems like Foursquare, Bit.ly and Sourceforge. MongoDb seemed like the 'way to go' but then some reports of down time surfaced as was the case with Foursquare (MongoDB Auto-sharding and Foursquare Downtime[21]) and I realised that it is not a 'quick fix' solution that can be applied to all scenarios. Financial systems seemed to be the most unsuitable type of application to use with a MongoDb back-end. I am still not 100% convinced that MongoDb can be used with all types of financial systems, especially not banking systems, but I believe that it may be suitable for most e-commerce systems. I found the following factors to be most obvious issues with starting a MongoDb implementation: Schema Design: The schema design used for MongoDb and MySql implementations are vastly different but because developers are generally used to designing for relational databases they are prone to make some bad design decisions. Sharding: MongoDb has many built-in features that reduce the operational procedures that must be in place, but not understanding how these features work could cause some serious system problems. Experience: MongoDb is a relatively new technology compared to its relational counterparts like MySql which means that there is an equally limited amount of experienced MongoDb developers and administrators in the field. This document tries to solve the above mentioned issues somewhat, by providing an overall overview of an imaginary e-commerce system built on MongoDb, instead of the numerous disjointed examples found on the internet.
Scope
The document covers the creation of the data schema for the e-commerce system, and provides an overview of the infrastructure and some of the operational procedures that must be in place to get started with a MongoDb implementation. It does not however discuss the actual e-commerce website implementation.
Page 4 of 32
We will assume that the system has a limited amount of functionality as defined in subsequent sections. This will provide a set of parameters for the use case and avoid an overly complex design that could be confusing and therefore hide some of the learning's that can be taken away from it.
Sources
This document is based on theoretical knowledge of the topic but all statements, conclusions and examples therein is based on information found on the MongoDb site, other use cases and various blogs that are freely available on the internet. All sources are noted at the end of the document. It is recommended that these additional resources also be assimilated in order to get the maximum benefit from this document.
Page 5 of 32
System Definition
Based on what we have been taught about relational database design there is only one correct design for a given problem. The approach would normally be to analyse the data, identify all the prominent entities that are represented by the data, create a table for each and then create the appropriate relationships between the tables. Once all of the data normalization (sometimes de-normalization) rules have been applied the design was done. With MongoDb databases this process differs slightly as the data schema cannot be designed without first evaluating what the system will do with the data.
Use Cases
The system will be limited to the following use cases: A user can 1. register on the site 2. log in on the site with username (email) and password 3. view products from a specific category 4. search the product list based on the name of the product 5. view a specific product 6. add n number of different products to a shopping cart 7. remove products from a shopping cart 8. can discard a shopping cart 9. can pay for a shopping cart by credit card The system must 10. track product stock levels An accountant can view the following reports: 11. Total daily, monthly and yearly income earned from online sales ordered by date 12. Total value of stock on hand An inventory clerk can: 13.Add / Edit Products 14.Set inventory stock level order threshold per product (When an order must be placed otherwise shop will run out of stock)
Page 6 of 32
Page 7 of 32
Page 8 of 32
Page 9 of 32
Entity Relationships
Each of the entities will most probably be modelled as individual tables in a relational database but this may not necessarily be the case with a MongoDb database. One of the biggest factors in deciding how the data is modelled depends on how the entities are accessed in relation to one another. For example, if an invoice and its line items are always accessed together then it would be better for performance to model them as one entity. Alternatively if line items are regularly accessed individually, then it would probably be better to model them as separate entities. For example, based on the current use case we will model the Shopping Cart and Cart Line Items as one document.
Size of Data
The maximum size of a document in MongoDb is currently limited to 8 MB but a maximum size of 32 MB has been proposed and this will probably increase even further in future. It may sound like good idea to store very large objects in a document but consider that the whole document must travel across the network between the database server and the application server when it is accessed. In cases where only part of the document is accessed each time it is retrieved it would be less resource intensive if the document is split into smaller documents.
Indexing
Adding indexes to your collections could significantly increase the query performance as MongoDb can quickly navigate the index to find the relevant document by key instead of scanning each document in the collection. The following shows a simplified depiction of how the system is able to navigate the index to find the relevant information (in this case the user with the surname of Straub) without having to scan each and every document in the collection.
Page 10 of 32
King
Harris
Rice
Bachman
Graham
Koontz
Straub
MongoDB automatically creates an index on the _id column but additional indexes can be added as required.
Adding indexes
Filter Criteria
The fields that indexes are applied to depend on the queries that are completed. In our use case the system will 'Search for products based on SKU' so we can therefore define an index on the SKU field of the document.
Sorting
Based on the 'Search cart total per date (ordered)' system action we would also need to add an index on the date as the query is sorted by date. Adding an index on the field that is sorted on enables MongoDb to sort the data without having to open each document.
Considerations
The following must be taken into consideration when applying indexes: Additional Overhead: Values are added / removed from an index whenever documents are added/removed to/from the collection. This does not pose a problem in systems that do mostly read operations but in write heavy systems this may incur significant overhead as the index must be continuously updated. Initial Index Blocking: No queries can be done against the database when the index is first applied except when using {background:true} option[9]. Page 11 of 32
Case Sensitive: MongoDb indexes are case sensitive Indexes per Collection: There is a limit of 40 indexes per collection. In most cases this number is more than sufficient. Index Key Size: Currently a maximum key length that can be indexed, is 800 bytes.
Query Optimization
As with applying indexes on a relational database, you sometimes get unexpected results so it is good practice to verify that the query uses the intended index and that using the index actually results in better performance. This can be done by examining the query execution plan by issuing the explain()[10] command.
Sharding
Automatic Sharding
MongoDb supports automatic sharding[1] where data is automatically spread out across multiple servers in order to distribute the transaction load. The system accomplishes this by storing data in multiple files (called chunks[2]) across multiple servers. Each chunk can be up to a maximum of 200 MB in size by default but can be overridden to be larger. Once a chunk reaches approximately 50%-75% (100 MB to 150 MB) of the maximum size, MongoDb will create a snapshot of the chunk and copy the snapshot data to the new chunk. Writes can still be done to the original chunk while this copy operation is in process. Once the copy process is completed, the changes made to the original chunk will be applied to the new chunk before it is made available.
Sharding Key
Mongo Db uses a key called a shard key to decide to which chunk, data will be allocated. The shard key will by default be based on the _id column that is made up of a BSON object (see BSON ObjectId Specification[3]) but this can be overridden by user code to consist of any user defined value. A shard key for user document could for example be based on the user last name. With that in mind imagine that we have three chunks with user data. The first chunk may contain all the users that have a surname starting with B to H, the second Ki to Ko and the third chunk R to S.
Page 12 of 32
If a user with a last name of Barker is added, it will be written to the first chunk where a user with a last name of Smith will be written to the last chunk.
Considerations
Deciding on the correct shard key may be one of the most significant design decisions that are made during the design process as it could have a major impact, positive or negative, on system performance. The following are some considerations to note.
Using the _id (or date based data) as the shard key
MongoDb automatically adds an _id attribute to each document (if not overridden by application code) and populates it with a unique value (see BSON ObjectId Specification [3]). The BSON object consist of a couple of values that are concatenated together to form a (relatively) unique value. The first part of this unique value is calculated based on the current date and time. This could be an advantage as data is automatically stored in date order which would increase performance of queries that query data by date range or need to order results by date. This fact can also be exploited in other ways. For example most drivers support extracting the creation date and time from the _id which means that storing a 'created at' value in the document is not required. On the other hand, based on the MongoDb website it could also have some implications on scalability. At the beginning of each month documents will be written to the same server until the data chunks are migrated across to other servers. This issue can mitigated by adding some uniqueness to the key and pre-splitting chunks[7].
If the system experience exceptionally many writes then the way that the MongoDb balancer handles the splitting of chunks could also become an issue as described in the 'MongoDB PreSplitting for Faster Data Loading and Importing'[8] article.
Related Data
Keeping related data close together will improve system performance as all the data can be retrieved from one chunk or shard. In a system with lots of user related content we may prefix the shard key with the user id. We could 'force' the system to store different documents containing user related information like personal data, uploaded media and purchase history close together by prefixing each document _id with the particular user id.
Unique Keys
Shard keys should normally be as unique as possible. MongoDb can only shard data if the key can be split into smaller parts. Depending on the system, there may some performance issues that start appearing once chunks start to grow past the default 200 MB maximum size. For example using State (eg. Texas and Ohio) as the shard key for user related data may cause some problems in the future as MongoDb will have to write data for ALL users that live in a particular state to the same chunk and because it cannot split the chunk it would grow to be very large. If the key is changed to include City it would allow MongoDb to create a chunk for each State+City combination which allows for a lot more granularity. If it is also considered that each State+City chunk is potentially stored on a different server and that some cities have more users than others, it becomes clear that some servers will experience higher loads than others.
Result Order
The order in which search results are returned to the client can also affect the selection of an appropriate shard key. Continuing with the State / City example let us imagine that we defined a shard key of {state:1,city:1} on our data and that the relevant data returned by a query is stored on multiple servers. If the query returns data ordered by city, each server will need to compile the search results and then sort the data. The data is then returned from each server and then the results are merged into one by the mongos process (See Deployment section). The extra sorting step has to be completed as there is not an index defined on the city column alone but on the
Page 14 of 32
combination of State+City. If the query on the other hand sorts by state or state+city then each server will compile the data and stream it back in order to the mongos process without having to sort and merge the results as it will be able to utilise the defined index.
Entities
Based on the 'Identify Entities and Fields' section we can assume that the documents would resemble the following samples. The structure and content of these documents may change further as the different actions are considered in the following section.
Product
Each product document will have the following structure and will be allocated to the products collection. Categories will also be stored in the product document but will be discussed in detail in a subsequent section.
Collection: products { "_id": ObjectId("4e1b091559a4f01109000000"), "name": "Ipad", "sku": "10001-23424-9098", "cost_price": 300, "selling_price": 320, "items_in_stock": 9, "reorder_threshold": 10 } { "_id": ObjectId("4e1b08e159a4f01608000000"), "name": "Ipod Nano", "sku": "10001-23424-9098", "cost_price": 100, "selling_price": 120, "items_in_stock": 10,
Page 15 of 32
Category
Category documents will be allocated to the categories collection and will not be sharded as all the category documents will make up a relatively small amount of data. We will also override the default generated _id as it is very long. The reason for this will be explained later on. Categories will fortunately not be updated often which means that the performance hit of using a custom incremental _id for categories, is acceptable
Collection: categories { "_id": "1", "name": "Electronics", "subcats": [2, 3] } { "_id": "2", "name": "Cellular", "parents": [1], "subcats": [3] } } { "_id": "3", "name": "Nokia", "parents": [1, 2 ] } }
User
User documents will be allocated to their own collection called users
Collection: users { "_id": ObjectId("4e1bfba789a4f02207000000"), "firstname" : "John", "lastname" : "Doe", "email" : "john@gmail.com", "password" : "[encrypted_text]", "password_salt" : "[salt_text]" "shipping_address" : {
Page 16 of 32
Shopping Cart
The shopping cart, products in the cart and the payment made for the cart will always be queried together which means that the data can be stored as one document. Each of the line items will become an array item in the document. Some of the product data was duplicated into the cart object which prevents additional database lookups when completing actions like previewing the cart or generating an invoice or even reprinting an invoice a year after it was paid for. The payment details and some of the user details will also be stored in the document.
Collection: cart { "_id": ObjectId("4e1bfba559a4f02207000000"), "line_items": [{ "_id": "1_4e1b091559a4f01109000000", "cost_price": 300, "name": "Ipad", "selling_price": 320, "sku": "10001-23424-9098", "qty": 2 }, { "_id": ObjectId("4e1b08e159a4f01608000000"), "cost_price": 100, "name": "Ipod Nano", "selling_price": 120, "sku": "10001-23424-9098", "qty": "1", }], "payment": { "card_number": "[encrypted_text]", "expiry": "11\/12", "card_holder": "Mr J Doe" },
Page 17 of 32
Actions
The actions are not ordered as defined in the 'Identify System Operations' section as some of the discussions build one previous ones. Note: All of the following examples refer to the document examples defined in the 'Entities' section unless otherwise specified.
Page 18 of 32
We could opt to model the product and category entities as separate documents which means that these documents should somehow reference each other. In our design we will add the category _id to the product document like this:
{ "_id": ObjectId("4e1b091559a4f01109000000"), "name": "Ipad", .... "category" : 10 }
We could then add an index on the category column in order to quickly find all products in a particular category. We could alternatively embed the whole category document in the category field if required. This approach would take more disk space because of the duplicated data but if the category data needs to be displayed on the front end with category information it could prevent an extra query to the database. This may only be an option if the category information is relatively static. In cases where a product can belong to a multiple categories we could use an array of category id's.
{ "_id": ObjectId("223b091559a4f01109000000"), "name": "Nokia", .... "categories": ["1": "2"] } }
Querying for a specific value in an array field is supported by MongoDb with the Multikey feature[13].
Page 19 of 32
But we can use a modifier [15] which is much more efficient and can be used for atomic updates [16] on the document. We will most probably query for a product by _id which automatically has an index defined on it. Use the following to increment the items_in_stock without retrieving the whole document (note the $inc operator):
{ } {
Another side effect of pre-pending the category for systems where a product can only belong to one category, is that we potentially do not have to store the category as a separate field as it can Page 20 of 32
Page 22 of 32
We are able to make use of a mapreduce[17] function though. In this use case the query will access all the product documents in the collection, it does not have any filter criteria and does not require sorting which makes it a good option for map-reduce The following example is adapted from the 'Finding Max And Min Values for a given Key' article[18]. Based on the example data (Entities section) the result is expected to look like this:
{ _id { _id : "1_497ce4051ca9ca6d3efca323", : "1_678ce4051ca9ca6d3efca323",
value : { product : { name : Ipod Nano , items_below_level : 5 } } } value : { product : { name : Ipad , items_below_level : 1 } } }
Explaining map / reduce is out of scope of this document but suffice it to say that the functions are applied to each document. Our map function would check whether the items in stock for a particular product, are below the set threshold, and if it is, it will emit the value. The reduce function will normally be used to aggregate values (eg. sums, counts and averages) but in our case not, so the function just returns the result.
> map = function () { if (this.items_in_stock < this.reorder_threshold) {
Page 23 of 32
Page 24 of 32
Page 25 of 32
> map = function () { emit("sub_total", this.items_in_stock * this.cost_price); } > reduce = function (key, values) { var grand_total = 0; for (var i = 0; i < values.length; i++) { grand_total += values[i]; } return grand_total; } > db.products.mapReduce(map, reduce, {out:{inline : true}}); { "result" : "tmp.mr.mapreduce_1310392963_13", "timeMillis" : 3, "counts" : { "input" : 2, "emit" : 2, "output" : 1 }, "ok" : 1, } > db.tmp.mr.mapreduce_1310392963_13.find() { "_id" : "sub_total", "value" : 4200 }
Page 26 of 32
Page 27 of 32
Infrastructure
Deployment
Based on the MongoDb documentation[11] we will start with a setup as shown in the following diagram. This setup ensures that queries are distributed across multiple shards which improves performance, it ensures that there are three replicas of the data available (each of the servers in the replica set[12]) and it allows for disaster recovery scenarios by replicating to servers in another data centre.
Page 28 of 32
Mongo Processes
The bulk of MongoDb processing is handled by the processed depicted in the diagram. Mongod is the main database process. It completes the actual querying and editing of the data contained in the database. Mongos on the other hand is a only a routing service. A client application will communicate with the mongos process which in turn will query the configuration store (config mongod in the diagram) to find out which shard(s) to communicate with. It will then route the query to the appropriate shard(s) and merge the results from the different shards where applicable, before it returns the combined result to the client application. This method ensures that the client application only needs to be aware of one process to communicate with and does not have to have intimate knowledge of all the mongod processes. Note that the mongos processes can be run in many different configurations. It can be installed on all of the servers or only on some. It can also be installed on separate servers with no mongod processes installed. There may be a performance boost if the service is installed on each server as it will be able to communicate over the localhost interface.
Replica Sets[12]
A replica set consists of two or more servers with the mongod process installed. One server in a replica set will be 'nominated' as master and will service all read and write requests. If the master fails or becomes unavailable the slave will automatically become the master and start serving requests.
Operating System
MongoDb uses memory-mapped files to manage data which means that the database size is limited to 2 GB on 32-bit operating systems. Use a 64-bit operating system to support databases over 4 TB.
RAM
MongoDb uses memory-mapped files to manage data which allows it to map data in memory as it appears on the hard disk. MongoDb will keep data in memory once it is queried for the first time (if possible) and use the in memory data for subsequent queries which is more efficient than reading from disk. Having a lot of memory available could speed up queries significantly as the whole Page 29 of 32
Network
Setting up replication and backups will increase network traffic which could affect the query performance. Adding an extra network card and creating a separate network on which the servers can communicate with replication and backup servers could also reduce network 'noise'.
Next Steps
Page 30 of 32
References
1. Sharding: http://www.mongodb.org/display/DOCS/Sharding+Introduction 2. Chunk: 3. http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroductionChunks 4. BSON Object: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification 5. Choosing a Shard Key: http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key 6. Indexing: http://www.mongodb.org/display/DOCS/Indexes 7. MongoTips: http://mongotips.com/b/a-few-objectid-tricks/ 8. Splitting Chunks: http://www.mongodb.org/display/DOCS/Splitting+Chunks 9. MongoDB Pre-Splitting for Faster Data Loading and Importing: http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-andimporting/ 10. Indexing as a Background Operation: http://www.mongodb.org/display/DOCS/Indexing+as+a+Background+Operation 11. Explain: http://www.mongodb.org/display/DOCS/Explain 12. Simple Initial Sharding Architecture: http://www.mongodb.org/display/DOCS/Simple+Initial+Sharding+Architecture 13. Replica Sets: http://www.mongodb.org/display/DOCS/Replica+Sets 14. Multikeys: http://www.mongodb.org/display/DOCS/Multikeys 15. Update: http://www.mongodb.org/display/DOCS/Updating#Updating-update%28%29 16. Modifiers: http://www.mongodb.org/display/DOCS/Updating#Updating-ModifierOperations 17. Atomic Operations: Page 31 of 32
http://www.mongodb.org/display/DOCS/Atomic+Operations 18. Map Reduce Basics: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/ 19. Finding Max And Min Values for a given Key: http://cookbook.mongodb.org/patterns/finding_max_and_min_values_for_a_key/ 20. Calculate the hex value of an IP address: http://www.pocketnes.org/hexa.html 21. GUID: http://en.wikipedia.org/wiki/Globally_unique_identifier 22. MongoDB Auto-sharding and Foursquare Downtime: http://nosql.mypopescu.com/post/1251523059/mongodb-auto-sharding-and-foursquaredowntime
Page 32 of 32