Beruflich Dokumente
Kultur Dokumente
) Both management and IT will understand this masterpiece written by the world's top authorities on Teradata and data warehousing, describing how Teradata is built to achieve data warehouse utopia. Table of Contents Tera-Tom on Teradata BasicsTeradata Explained Through Unimaginable Simplicity Introduction TeradataThe Shining Star Teradata Databases, Users and Space Data Protection Loading the Data ConclusionA Final Thought on Teradata
All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the publisher. No patent liability is assumed with respect to the use of information contained herein. Although every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, neither is any liability assumed for damages resulting from the use of information contained herein. For information, address: Coffing Publishing 7810 Kiester Rd. Middletown, OH 45042 0-9704980-1-2 All terms mentioned in this book that are known to be trademarks or service have been stated. Coffing Publishing cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. Acknowledgements and Special Thanks This book is dedicated to Americans and friends of liberty and freedom We also want to thank our wives Leona Coffing and Janie Jones Thanks to a great editor and friend Cheryl N. Buford
Introduction
Overview
A full 40% of Fortune's "U.S. Most Admired" companies use Teradata. What do they know that your company needs to know? I've been in the computer business for more than 27 years. I've witnessed so much since the early days of punch cards, assembler languages, and COBOL programming. With that in mind, the most magnificent, ingenious technology that I've ever seen is a database from the NCR Corporation called "Teradata." "The wave of the future is coming and there is no fighting it." Anne Morrow Lindbergh Teradata is absolutely the wave of the future in data warehousing. I introduced this technology to a great friend, Morgan Jones. He immediately recognized that Teradata is the gold standard for all data warehousing, and as a result, we've partnered to write this book. So, sit back, relax, and enjoy. With our guidance, you will soon realize why Teradata is the greatest technology on the planet!
send a person to the moon, or that someone could run a mile in under four minutes? Ingenuity and the desire to improve are attributes of the human race, and both are found in numerous avenues, from sports to business. "Expect the unexpected, or you won't find it." Roger von Oech When Frank Lloyd Wright began to design the Imperial Hotel in Tokyo, he discovered the unexpected: just eight feet below the surface of the ground lay a sixty-foot bed of soft mud. Since Japan is a land of frequent shakes and tremors Wright was faced with what appeared to be an insurmountable obstacle. This gave him an idea: Why not float the Imperial Hotel building on the bed of mud, and let it absorb the shock of any quake? Critics and cynics alike laughed at such an impossible idea. Frank Lloyd Wright built the hotel anyway. Shortly after the grand opening of the hotel, Japan suffered its worst earthquake in fifty-two years. All around Tokyo buildings were destroyed, but the Imperial Hotel stood firm. For a long time the mainframe and OLTP industry laughed at those who recommended the data warehouse design principles set forth in this book. But those companies that build one based upon these rules will join the ranks of the elite. Consider this: ten of the Top 13 global communications companies use Teradata; nine of the top 16 global retailers use Teradata; and eight of the top 20 global banks use Teradata. The ability to continually improve is one of Teradata's greatest strengths. The database was designed in 1976 and has continually improved ever since. Teradata has averaged one data warehouse installation per week for the past decade. Through continual improvement based on customer feedback from many of the largest data warehouse sites, Teradata has been able to identify itself as "the data warehouse of choice for award winning data warehouses." This book begins with the 10 cardinal rules to follow for data warehouse success. It illustrates how Teradata helps customers follow these rules. Then it explains the brilliance of how Teradata works. By the end, the reader will have a real grasp of essential Teradata concepts.
Had everyone involved with the "USS Indianapolis" adhered to a single version of the truth, with detail data to back them up, this disaster may have never occurred. Likewise, if your company doesn't maintain detail data in a Centralized Data Warehouse, you will never know which version of the truth to believe. Each division of a business will have its own view of the truth. Summarized data, such as a data mart, does have its place in knowledge management, but it should always be built from the detail data within the central data warehouse. Most companies don't have a Central Data Warehouse. Why? Because they don't have proper leadership or direction. Company leaders often let different branches of the company create data marts that are effective short-term solutions. These solutions are based on departmental leadership that is most interested in short-term solutions. Such leaders don't plan on being with a particular department forever, so they are only interested in keeping things simple, controlled, and beneficial to them. "We're all in this alone." Lily Tomlin For example, imagine a company that made cars on an assembly line. Instead of using a giant plant with the latest and greatest technology, the company builds cars in 300 small garages. Each garage is owned by a different department, and has different needs. In addition, every user has his access restricted to his or her garage. With this structure, leaders feel safe, but building cars, logistically, is a nightmare. In fact, just moving cars from one garage to the next would be a joke. This scenario may seem simple-minded, but that is how most data warehouses are built. Each part of some data warehouses operates alone. Now, imagine a giant car assembly plant where the assembly line was managed by the idea of "There is no I in Team." This plant would continually improve processes, finding better ways to work together. Everyone has an idea what the others are doing, and new ideas are welcome. Management is able to run the entire plant with one team of dedicated professionals, and decisions are made cooperatively, concisely, and clearly. This style of management is the idea behind a central data warehouse. From the top layer of management down through the entire company, they are one solid team. A data warehouse experienced team saves valuable money and resources, plus users can manage the entire data warehouse. Executives may ask any question targeted to any part of the business. Decisions are made with long-term vision, and every employee is confident that when they need answers - the data warehouse will provide them. "If I have seen further it is by standing on the shoulders of giants." Isaac Newton When asked how he had discovered the Law of Gravity, Isaac Newton did not grab all of the glory for himself. He claimed that his work stood on the foundation of those early scientists who had gone before him. Likewise, a central data warehouse allows users to stand on the shoulders of another giant. This giant, built right, allows major corporations to make decisions and act on those decisions quickly. In 1993, I was asked to train one of the world's largest retailers on its Teradata data warehouse. I flew to Bentonville, Arkansas, and an employee met me at the airport then escorted me to the classroom. As we walked down the hallways, most employees seemed to be at a pace I had never seen before. They were practically running. I asked, "What's up?" "Why is everyone hurrying?" The employee replied, "It's work time!" I was shocked. In all of places I had previously worked, we strolled. This place had a leadership that I've never encounteredanywhere. H. Ross Perot described this kind of team when he said, "When building a team, I first look for people who love to win, if I can't find any of those, then I look for people who hate to lose." This was a concise team of employees so motivated and so empowered that they thought they could take over the world!
As I grew to know the team, I asked them how long it took top management to make a decision. And how long did it take to implement that decision at thousands of stores nationwide. They simply said, "About two hours!" I was amazed. Today, this team continues to have one of the single greatest data warehouses ever built. They use it extensively and it grows stronger every day. While visiting with this team, management decided at one point that stores across the country should place Halloween displays and candy near the cash registers. In less than two hours, stores moved their Halloween candy from the normal candy aisles to end-caps near the cash register. Every store participated but one! When asked why he didn't participate, the store manager said he had simply run out of time to create the displays plus move the Halloween candy from his normal candy aisle to the end-caps. Management was ticked. Telling the manager they would get back to him, they then asked the DBA to query the data warehouse to see how much this snafu had cost the company. The DBA came back and reported that the store actually sold almost the same amount of Halloween candy as forecasted. Management was surprised and honestly a little disappointed with the answer. But then the DBA added somewhat sheepishly, "I found something else, too." "Go ahead", replied members of the management team. He said, "I found out they actually sold about 40% more normal candy then we forecasted for this holiday." Management got on the phone immediately and told the other thousand stores: "Move those goblins and Halloween candy back to the normal candy aisles!" What that DBA did was to use his instinct and the data warehouse to find out exactly what was going on with the business at that time. He was armed with a system that had cross-functional analysis. A central data warehouse gives good management great confidence because they see the whole picture. When users can ask any question, at any time, and on any data, their knowledge is unlimited. Most Teradata Central Data Warehouse sites will tell you most of their Return On Investment (ROI) came from areas they never suspected. Thomas Jefferson once said, "We don't know one millionth of a percent about anything." When we explained Teradata to Jefferson he did not build another Monticello, but he did retract his statement! Companies with a centralized data warehouse know about a million percent more than companies that have invested in stovepipe applications and 300 different data marts. Actually, any company planning on competing in this millennium must think long-term and begin building a centralized data warehouse. If not, that company will be on the short end of the stick when competing with a company that chose to build one. That thought should sound scarier than a goblin near the cash registers on Halloween! If you think about it, every major decision in business makes someone happy. If you are armed with facts supported by a central data warehouse and you do your homework, your business decisions will make your shareholders happy. However, if you are making decisions with a data mart strategy, those decisions are more likely to make your competitors happy. There are many companies that are fearful of such an undertaking. They want a central data warehouse, but wonder: "What if it fails? Which database should we choose? What type of hardware do we need? Should we do an RFP?" Decisions, decisions! It would literally take me about 30 seconds to make a decision on Teradata. There would be no RFP. We used to wade in swimming pools of data; today we are swamped in oceans of data. Teradata is built for this type of environment. This book explains the fundamentals of Teradata. Anyone with any experience or knowledge about data warehouse environments will clearly see why Teradata is the best solution.
Claude Levi-Strauss The user is the heart of the data warehouse, and they get better with each day of experience. The user makes decisions that affect the company's bottom line. That's why the data warehouse is built around the business user. Building a data warehouse is simple: find out what data the business users need and what type of queries they want to ask but are not able to ask today. Then, find out if the data is available and if the queries can be attained. With those answers, you will exceed users expectations. An experienced data warehouse user is usually shocked when he or she first uses Teradata. Its sheer power and flexibility enables users to ask questions they have never been able to ask before. On a recent consultant trip of mine, a young DBA got antsy when a particular query took more than a minute or so with Teradata. So I asked, "Well, how long did that same query take with your OLTP-based data warehouse?" He retorted, "We couldn't even run this query on the old system." I said, "So, what's wrong two minutes?" He added, "You know, some of our business users are so used to how long our queries used to run that they will be sitting, staring at the screen, without realizing that Teradata has already brought back the answer!" With Teradata, users can expand their thinking by using intuition and keen business sense without technology barriers. The building of an enterprise data warehouse begins with top management, but then cascades down to a relationship between the IT department and the business user community. The IT department must realize they have a supporting role. That role is to please the business user by making data available so the business user can easily ask questions and get answers. It's also the IT department's role to build a system that allows users to ask questions on their own without IT intervention. Forget about building a system where users ask IT to run the queries for them. When users need information, the IT department should eventually be able to say, "Ask the question yourselfit is all available to you". The business users are actually the stars, however the entire business community must take responsibility for the warehouse's success. These users must continually educate themselves and other users on the capabilities of the data warehouse, new tools, and new techniques that will enhance its potential. Those same users must help IT help them. If both understand their respective roles and work together to help the company, then the data warehouse will be a huge success.
environment can be extremely different than anything an IT department has ever built or used before. Therefore, it's a bad idea to build a data warehouse without the help of experienced people. An OLTP environment gets more and more predictable each month. It is designed to be tweaked and tuned in order to maximize a company's environment. On the other hand, a data warehouse is an unpredictable environment where the only way to gain control is to actually give up control. In data warehousing, the user must be allowed the freedom to ask the questions and they will blossom in an environment where flexibility is accepted and welcomed. "The only sure weapon against bad ideas is better ideas." A. Whitney Griswold If the IT department decides to build hundreds of data marts that will please each and every department, then they are missing the boat. Data warehouse experience is a hard teacher because it gives the test first, and the lesson afterwards. Abraham Lincoln once said, "A house divided cannot stand." With that in mind, build the data warehouse so it will stand strong for a long time. What's the formula? First and foremost, start by building your data warehouse around detail data. Bring transaction data, along with key details, from the OLTP systems into the data warehouse. Then, as known queries are identified, build data marts to enhance their performance, and also insist that data marts are created and maintained directly from the detail data. Doing so will build a foundation that will stand. Next, the IT department needs to keep an open mind about creating an environment called "User Utopia". Have you ever been there? In "User Utopia" the user confidently asks queries without fear of being charged by the minute. The user has meta-data so he or she becomes intimate with the data, then makes informed decisions. The user should also be able to ask monster queries with the full backing of IT. Recently, on one such query, the IT department wanted to pull the plug. But the DBA held out, granting the user more time. When the query finished running, the information it brought back from the detail data saved the company millions of dollars. Overall, a user will get the majority of his or her answers back quickly from data marts, but he or she also needs the capability of going back to the detail data for more information. This is "User Utopia." Here is the message for IT: Don't follow the idea that "if you build it, they will come." Instead, become a leader go to the users and build it together.
"A bird does not sing because it has the answers, it sings because it has a song." A data warehouse built on detail data does not "sing because it has a song, it sings because it has the answers." When you capture detail data, answers to an infinite amount of questions are available. But, if this is truly the case then why doesn't everybody build around detail data? Well, there are two reasons. One is price! Like a bird, many companies decide to go "cheap cheap". But watch out! The real expense is not the cost of the data warehouse; it is the money that you will not make without one. The second reason is power! Many companies don't have the wingspan to fly through the detail, so they "sore" with the summary. In addition, some companies don't want to pay for the disk space it actually takes to keep detail data, but believe me, that cost is a small price to pay for success. "Once you miss the first buttonhole it becomes difficult to button your shirt." Many companies use the same database for their data warehouse as they have done for their OLTP system. This is a critical mistake. In essence, they have missed the first buttonhole and most likely will lose their shirt on their data warehouse adventure. At this point, companies no longer have a choice of using detail data. They must summarize for performance reasons. As one marine told his boot camp soldiers jokingly, "The beatings will continue until the moral improves." Similarly, a database designed for OLTP takes a continual beating when it tries to query large amounts of detail data. Companies building true data warehouses don't compromise on price, and will have a data warehouse that is built for decision support, not one that specializes in OLTP. With this decision, you have buttoned the first buttonhole and are well on your way to reaching the top. Detail data is the foundation that data warehouses are built upon. Users can ask any question, anytime, and conduct data mining, OLAP, ROLAP, SQL and SPL functions, build data marts directly from the detail data, and can easily maintain and grow the environment on a daily basis. Now that's a tune well worth singing. Make a note of it!
write on the card, "I'm sorry. I love you." The beautiful bouquet arrived at the door. But then his wife read the words the florist had actually written in haste, "I'm sorry I love you." The top reasons to build data marts directly from detail data are:
Users can get answers from the data mart, but must validate their findings or check out additional information from the detail that built it. There is only one consistent version of the truth Maintenance is easy
If a user comes up with a data mart answer that does not make sense, then he or she has the ability to drill down into the detail and investigate. Sometimes summary data can spark interest and finding out the "why" can result in big bucks. If users don't trust the data, they won't use the system.When a data warehouse is built on a foundation of detail data and then data marts are erected from that foundation, you have a winning combination. The results will always be consistent and trustworthy. However, you should only build data marts when there is a credible business case, and you should be ready to drop them when they are no longer needed. The life span of a data mart is relatively short to that of its mother and father (better known as the detail data). If you build the data mart from the detail, it makes them easy to manage, easy to drop, and easy to change.
In today's fast paced world, Gigabytes soon become Terabytes. It may not sound like much, but it weighs a ton on the shoulders of giants. Listen to these measurements and pick your data warehouse's life span. For example, if you lived for a million seconds (Megabyte), then you would live for 11.5 days. In comparison, if you lived for a billion seconds (Gigabyte), then you would live for 31.5 years. Plus, if you lived for a trillion seconds (Terabyte), then you would live for 31,688 years! How nice it would be on your 31,688th birthday that people would say, "You sure look good for your age". Data warehouses hit the wall of scalability because they cannot grow with the same degree that the amount of data being gathered grows. Teradata allows for unlimited "linear scalability." "Linear Scalability" is a building block approach to data warehousing that ensures that as building blocks are added, the system continues at the same performance level. This is why the largest data warehouses in the world use Teradata. I was lucky to be in the right place at the right time, and taught beginning stages at what are considered the two largest data warehouse sites in the world: South Western Bell (SBC) and Wal-Mart. Wal-Mart's data warehouse started with less than 30 gigabytes, and SBC started with less than 200 gigabytes and 100 users. Both warehouses:
Started small and simple; Used Teradata from the beginning; Have built the largest Enterprise Data Warehouse in their respective industries; Continue to realize additional Return On Investment (ROI) on an annual basis; Have grown to more than 10 Terabytes of data, and are still growing; Have thousands of users (some estimates are shocking); Have educated and experienced data warehouse staffs; Have educated and experienced data warehouse users; Experience continual growth without boundaries; Have experienced linear performance by Teradata in every single upgrade (from gigabytes to terabytes and from terabytes to tens of terabytes); Both companies are impressed with Teradata's power and performance; And both SBC and Wal-Mart are committed to the excellence of Teradata.
A data warehouse is built in small building blocks. Linear Scalability is described in three ways: First, building blocks are added until the performance requirements of your environment are met. (Guaranteed Success); Second, every time the data doubles, building blocks are doubled, and the system maintains its performance level. (Guaranteed Success); and Third, any time the environment changes, building blocks are added until performance requirements are met. (Guaranteed Success) Scalability is not just about growing the data volume. It also means growing, or increasing, the number of users. Many systems work flawlessly until as few as 5 users are added, then they slow down to a crawl. Companies need a system where growth and performance are easily calculated and implemented. That means where the number of users, size and complexity of queries, volume of data, and number of applications being used can be calculated and compared to the current system's actual size. If more power, speed, or size is needed, then the company can simply add building blocks to the system until the requirements are met.
Join tables Aggregate data Sort data Scan large volumes of data.
In order to get around these system limitations, vendors will suggest a model to avoid joins, use summarized data to avoid aggregation, store data in sorted order to avoid sorts, and overuse indexes to avoid large scans. With these limitations, vendors are also going to avoid being able to compete! That is like placing a ball and chain around the runner's leg and saying, "I wish you all the best in the marathon!" Come on! Whose side are these vendors really on? Teradata is the only database engine I have seen that has the power and maturity to use a "3rd Normal Form" physical model on databases exceeding a terabyte in size. Because of the physical limitations, other databases have had to use a "Star-Schema" model to enhance performance, but have given up on the ability to perform adhoc queries and data mining.
A "normalized" model is one that should be used for the central data warehouse. It allows users to ask any question, at any time, on information from any place within the enterprise. This is the central philosophy of a data warehouse. It leads to the power of ad-hoc queries and data mining, whereby advanced tools discover relationships that are not easily detected, but do exist naturally in the business environment. A "Star-Schema" model enhances performance on known queries because we build our assumptions into the model. While these assumptions may be correct for the first application, they may not be correct for others. Flexibility is a big issue, but data marts can be dropped and added with relative ease if each is built directly from the detail data. Remember, build the data warehouse around detail data using a normalized model. Then, as query patterns emerge and performance for well-known queries becomes a priority, "Star Schema" data marts can be created by extracting summarized or departmental data from the centralized data warehouse. The user will then have access to both the data marts for repetitive queries, and the central warehouse for other queries. Because data marts can be an administrative nightmare, Teradata enables "Star-Schema" access without requiring physical data marts. By setting up a join index as the intersection of your "Star-Schema" model, you can create a "Star-Schema" structure directly from your "3rd Normal Form" data model. Best of all, once it is created, the data is automatically maintained as the underlying tables are updated. Keep in mind, 80% of data warehouse queries are repetitive, but 80% of the Return On Investment (ROI) is actually provided by the other 20% of the queries that go against detailed data in an iterative environment. By using a normalized model for your central data warehouse and a "Star-Schema" model on data marts, you can enhance the possibility of realizing an 80% Return on Investment and still enhance the performance on 80% of your queries.
Rule # 8 - Don't Let a Technical Issue Make Your Data Warehouse a Failure Statistic
"Experience is a hard teacher because she gives the test first, the lesson afterwards." Scottish Proverb Did you know that 3/4thof the people in the world hate fractions, and that 40% of the time a data warehouse fails is because of a technical issue? There are many traps and pitfalls in every data warehouse venture. One winter day a hunter met a bear in the forest. The bear said, "I'm hungry. I want a full stomach." The man replied, "Well, I'm cold. I would like a fur coat." "Let's compromise," said the bear, and he quickly gobbled up the hunter. They both got what they asked for. The bear went away with a full belly and the man left wrapped in a fur coat. With that in mind, good judgment comes from experience; experience comes from bad judgment. You have shown good judgment by reading this book; so let our experience keep your company from having a bad data warehouse experience. Author Daniel Borsten wrote in The Discoverist, "The greatest obstacle to discovering the shape of the earth, the continents, and the oceans was not ignorance, but rather the illusion of knowledge." There is a lot of "illusion of knowledge" being spread around in the data-warehousing environment. Before you decide on any data warehouse product, ask yourself, and the vendor, these questions:
As my data demands increase, will the system be able to physically load the data? Our experience shows that many systems are not capable of handling very large volumes of data. Do the math! As the data grows in volume, can the system meet the performance requirements? Do the math! As the number of users grows, will the system be able to scale? Do the math!
As my environment changes, will the system be flexible enough to allow changes quickly and easily? Do the math! Will the system need so many Database Administrators (DBAs) that my systems cost skyrockets? Do the math! If we suddenly merged with another company and needed to incorporate into their mainframe or LAN environment, would the system be able to connect and include them? Do the math! Can I continue to meet my batch window timeframes? Do the math! Could I become the hero of the company one day, only to have some technical glitch blamed on me because of my poor foresight and be thrown out of the company into a giant mud puddle? Do the bath!
As the environment changes in terms of users, data, complexity, capacity, batch windows, time changes, events, or opportunities, users should be able to continue building applications and architecture. The more a Teradata system grows, the more Teradata outshines the competition.
vision: to network enough PC chips together that the mainframe would be overpowered, yet costs would be hundreds of times cheaper than a mainframe. The Teradata team estimated the power surge would come in 1990. IBM laughed out loud! They said, "Let's get this straight you are going to network a bunch of PC chips together and overpower our mainframes? That's like plowing a field with a 1,000 chickens!" In fact, IBM salespeople are still trying to dismiss Teradata as just a bunch of PCs in a cabinet. Teradata was convinced it could produce a product that would power large amounts of data and achieve the impossible: using PC technology in mainframe territory. It's founders agreed with Napoleon Bonaparte who asserted, "The word impossible is not in my dictionary!" Sure enough, when we looked in his dictionary, that word was not there. And it is not in Teradata's Data Dictionary, either! The Teradata team set two goals: build a database that could
Driving in the car one evening, Morgan's eight-year old daughter Kara piped up from the back seat, "Daddy, can you buy Teradata in the store? I mean, what does Teradata really do?" Morgan thought for a moment and then replied, "Do you remember when you went on the Easter egg hunt last spring? Well, imagine that we had fifty eggs and you were the only child there. If I asked you to find all the purple eggs, would you be able to do that?" Kara said, "Sure! But it might take me a long time." Morgan continued, "What if we now let fifty children go in and I asked them to show me all of the purple eggs. How long would that take?" His daughter responded, "It wouldn't take any time at all because each child would only have to look at one egg." That is precisely how Teradata works. It divides up huge tasks among its processors and tackles each portion simultaneously, with amazing speed. And it doesn't matter if you have a trillion eggs in your basket! In 1984, the DBC/1012 was introduced. Since then, Teradata has been the dominant force in data warehousing. Teradata got the chickens plowing, and is considered outstanding. Meanwhile, IBM's plow is out rusting in its field.
Parallel Processing
"An invasion of armies can be resisted, but not an idea whose time has come." Victor Hugo The idea of parallel processing gives Teradata the ability to have unlimited users, unlimited power, and unlimited scalability. This is an idea whose time has come. And, it all starts with something called "parallel processing". So what is parallel processing? Let us explain: It was 10 p.m. on a Saturday night and two friends were having dinner and drinks. One of the friends looked at his watch and said, "I have to get going." The other friend responded, "What's the hurry?" His friend went on to tell him that he had to leave to do his laundry at the Laundromat. The other friend could not believe his ears. He responded, "What?! You're leaving to do your laundry on a Saturday night?! Do it tomorrow!" His buddy went on to explain that there were only 10 washing machines at the laundry. "If I wait until tomorrow, it will be crowded and I will be lucky to get one washing machine. I have 10 loads of laundry, so I will be there all day. If I go now, there will be nobody there, and I can do all 10 loads at the same time. I'll be done in less than an hour and a half."
This story describes what we call "Parallel Processing." Teradata is the only database in the world that loads data, backs-up data, and processes data in parallel. Teradata was born to be parallel, and instead of allowing just 10 loads of wash to be done simultaneously, Teradata allows for hundreds even thousands of loads to be done simultaneously. Teradata users may not be washing clothes, but this is the technology that has been cleaning every database's clock in performance tests.
"After enlightenment, the laundry" Zen Proverb "After parallel processing the laundry, enlightenment!" Teradata Zen Proverb With the computer world seeing Terabytes of data, hundreds to thousands of users are asking a wide variety of complex questions, and need instantaneous access to data. In short, this is the technology needed in a data warehouse environment. What we find most fascinating is that Teradata has unlimited power, and grows without boundaries, and was born out of the PC (personal computer) world by people with vision.
Processor Chip This is the brain of the computer. All tasks are done at the direction of the processor. Memory This is the hand of the computer. The memory allows data to be viewed, manipulated, changed, or altered. Data is brought in from the hard drive and the processor works with the data in memory. Once changes are made in memory, the processor can command that the information be written back to disk. Hard Drive This is the spine of the computer. The hard drive stores data, applications, and the Operating System inside the PC. The hard drive, also called the disk drive, holds the contents of the data for the system on its disk. For example, suppose you made three new good friends this month and want to add their names to your list. Opening that document brings it up from the hard drive and displays it on your screen. As you type in the new names, the processor executes your request onto the document while it is still being displayed in memory. Upon completion, you close the document and the processor writes all the changes to the disk where it is stored. In the picture below, we see the basic components of a Personal Computer. Note that it also holds a file called "Best_Friends listing," and lists eight best friends.
we call it parallel processing. In the previous example, one processor listed eight best friends on its disk. In that case, Teradata would read eight rows. The Teradata example on the next page shows two processors, each having direct access to its own physical disk. The "Best_Friends" table has been spread out evenly across both processors. When we ask for a list of best friends the system, both processors will receive data in parallel and will return combined results over the connecting network. Returns for this example could easily double the speed of the previous example. Even though we still need to read eight records, each processor is only responsible for reading four records and simultaneously the other processor reads the remaining four records. So, how could we double the speed of this system again?
rarely calls in sick, and lives to take direction from its boss the Parsing Engine (PE). The best example is to think of each AMP as a computer processor attached to its own disk. Every AMP has its own disk, and it's the only AMP allowed to read or write data to that disk. This action is referred to as a "Shared-Nothing" architecture. Although AMPs are the perfect workers, they are not the perfect playmates. Even as children AMPs would never share toys with other AMPs on the playground. Each AMP has its own disk, and it shares this with no other AMP, hence a "Shared-Nothing" architecture. Teradata spreads the rows of a table evenly across all AMPs in the system. When the PE asks the AMPs to get the data, each AMP will read the rows only on their particular disk. If this is done simultaneously, all AMPs should finish at about the same time. As a matter of fact, when we explained this philosophy to Confucius he stated, "A query is only as fast as the slowest AMP." Confucius, however, did say not to quote him! Again, an AMP's job is to read and write data to its disk. The AMP takes its direction from the Parsing Engine (PE). The number of AMPs varies per system. Today, some Teradata systems have just four AMPs, while others have more than 2,000!
The BYNET
"Even if you're on the right track, you'll still get run over if you just sit there." Will Rogers The BYNET ensures communication between AMPs and PEs is on the right track and that it happens rapidly. When communication between AMPs and PEs is necessary, the BYNET operates as a communication superhighway. There are always two BYNETs per system. They are called "BYNET 0" and "BYNET 1." The duplication is insurance in case one BYNET fails, and it also enhances performance. As an example, think of two BYNETs as two telephone lines in your home. AMPs and PEPs can talk to one another over either BYNET, or over both. Morgan Jones, co-author, has been talking to his four-year old son, David, about AMPs, PEs, and the BYNET. Little David asked, "Daddy, what happens when the AMPs and PEs get lonely?" Morgan replied, "They talk to each other over the BYNET". Here are the steps that outline exactly how the AMPs, PEs, and BYNETs work together: A user performs a LOGON to Teradata. A PE is assigned to manage all SQL for that particular user. The user then asks Teradata a question. Next,
The PE checks the user's SQL Syntax; The PE checks the user's security rights; The PE comes up with a plan for the AMPs to follow; The PE passes the plan along to the AMPs over the BYNET; The AMPs follow the plan and retrieve the data requested. The AMPs pass the data to the PE over the BYNET; and The PE then passes the final data to the user.
Anonymous Teradata builds its data warehouses in building blocks called "nodes." Each building block is a gem composed of four Intel processors. Each node is connected flawlessly to other nodes through two BYNETs. The AMPs and PEs reside inside the node's memory. Each node is connected to a disk array where each AMP has direct access to one virtual disk. Below is a picture of a Teradata system. It has four Intel processors, and the AMPs and PEs reside in memory. Each AMP is directly attached to its one virtual disk.
The following picture shows two nodes connected together over the BYNETs.
Teradata Tables
"Nearly everyone takes the limits of his own vision for the limits of the world. A few do not. Join them."
Arthur Schopenhauer Do you have one of those notoriously messy "junk" drawers in your kitchen? You know the one we're talking about the one next to the silverware drawer. This drawer may often contain old washer and dryer warranties, matches, half-used flashlight batteries, straws, odd nuts, bolts and washers, corncob holders, etc. Fortunately, the dresser drawers in your bedroom are typically much more organized! In fact, you probably store your clothing in those drawers much more neatly so you can get to what you need quickly. Relational databases store data much like we organize our dresser drawers: Just as you might put all of your tshirts in one drawer and your socks in another, the database will store data about one topic in one table and data that pertains to another topic is kept in another table. For example, a database might contain a CustomerTable containing items to track such as customer number, CustomerName, city, and order number. Another table, the OrderTable, might hold data like Order Number, Order Date, CustomerName, Item No, and Quantity. An example of each table follows: CUSTOMER TABLE called "CustomerTable" CustomerID (PK) 1001 1002 1003 Order Number (PK) 105372 105799 106227 CustomerName JC Penney Office Depot Dillards Order Date 03/07/2001 04/18/2001 10/17/2001 CityName Dallas Columbia Atlanta Order Number (FK) Customer Rep 105372 105799 106227 Dreyer Crocker Smith Customer ID (FK) 1001 1002 1003
The data stored in the CustomerTable is logically related to the data stored in the Order Table. The two tables both have columns called "Order Number." These tables make up an extended family, joined by the "marriage" of the columns named "Order Number" in each table. Earlier programming languages referred to files, records and fields. Relational databases use the terms "Tables", Rows, and "Columns." Each Row of a table is comprised of one or more fields identified by a column name. A Row is the smallest value that can be inserted into a table. A column is the smallest value within a table that can be updated, or modified. The data value stored in each column must match the data type for that column. For example, you cannot enter the name of a city in a column that is defined as a decimal data type. Columns that are defined but have no data value will display a "null", or are sometimes represented by a "?". One column, or combination of columns, in each table is chosen to be the "Primary Key (PK)". This is a logical modeling term. The primary key contains a unique value for each row, and enforces the uniqueness of that row. The PK cannot be null, and should contain values that will not change. In the CustomerTable, the primary key is the CustomerID column. Each customer has a unique CustomerID. The data in the columns of every row must be consistent with the unique CustomerID for that row. The rows in a table need not be stored in any particular order. This is also called being "arbitrary" or an "unordered set." Before the table is defined, the order of the columns is also arbitrary. It doesn't matter if you place CustomerName before CityName or
after it. However, once the table is created, the order of the columns (e.g., the row format for the table) must remain the same. Plus, you cannot have multiple row formats within a table. What forms the relationship between the tables in a relational database? A key that is common to each table forms it. A "Foreign Key (FK)" is a key in a table that is a Primary Key (PK) in another table. The PK and FK relationship allows the two tables to relate to one another. When you need to display data from more than one table, you can JOIN the two tables by matching a common key between the two tables. A great choice is to match the primary key of one table to the foreign key of the other table. Remember that a table may have only one PK, but it may have multiple FKs. Here is a quick reference chart for Primary and Foreign Keys: PRIMARY KEY Not optional Can only have one PK per table No duplicates allowed No changes allowed No nulls allowed Optional Can have multiple FKs per table Duplicates allowed Changes allowed Nulls allowed FOREIGN KEY
BYNET to the Parsing Engine (PE), and the PE ensures the data is delivered to the user. Keep in mind, the Bynet is an internal Teradata network over which the PEs and the AMPs communicate. The example below shows the information we have just discussed. Notice that the system has four AMPs, and three tables: "Employee," "Customer," and "Order." Notice each AMP holds a portion of the rows for every table. AMP1, for example, holds 1/4th of the Employee table rows, 1/4th of the Customer table rows, and 1/4th of the Order table rows. Plus, the data is spread evenly for all tables. If a query asks for all rows in the Customer Table, then each AMP will retrieve their Customer table rows in parallel with the other AMPs. Each AMP will then pass its data to the PE via the BYNET. Because the data in the Customer table is spread evenly among all AMPs, each should finish reading at exactly the same time. Also, notice how each AMP separates each table. Just like schools of fish, the rows of the Employee Table are grouped together. In addition, the Customer and Order tables are grouped together. This is important in a data warehouse environment because most queries read millions of rows to satisfy a single query. Performance is enhanced when table rows are grouped together and Teradata is permitted to bring blocks of rows into memory.
Primary Indexes
"Every road has two directions." Russian Proverb When world-renowned explorer, Dr. David Livingstone, was working in Africa, a group of friends wrote to him saying, "We would like to send other men to you. Have you found a good road into your area yet?" According to a member of his family, Dr. Livingstone sent this message in response, "If you have men who will only come if there is a good road, I don't want them. I want men who will come if there is no road at all." Although it doesn't have to cut its way through the dense African jungle, the PRIMARY INDEX (PI) is the trailblazer in Teradata that paves the way for the rest of the data to follow. The PI is so important to Teradata functionality that every table in the database is required to have one. As the quote above states, "Every road has two directions." The Primary Index is used in two directions: 1. The Primary Index WILL DETERMINE which rows go to which AMPs; and 2. The Primary Index is ALWAYS the FASTEST RETRIEVAL method. If the user doesn't define a PRIMARY INDEX when creating a table, the system will automatically choose one by default. Once it is defined, the PI column cannot be dropped or changed. The table would need to be recreated in order to change the PI.
An example of creating a Non-Unique Primary Index is listed below. Notice you never see the prefix "NON":
CREATE Table TomC.employee (emp INTEGER ,dept INTEGER ,lname CHAR(20) ,fname VARCHAR(20) ,salary DECIMAL(10,2) ,hire_date DATE ) PRIMARY INDEX(dept);
PRIMARY INDEXES may be defined on one column, or on a set of columns viewed as a composite unit. Up to 16 columns may be defined as a Primary Index. An example of creating a multi-column Unique Primary Index follows:
CREATE Table employee ( emp INTEGER ,dept INTEGER ,lname CHAR(20) ,fname VARCHAR(20) ,salary DECIMAL(10,2) ,hire_date DATE ) UNIQUE PRIMARY INDEX(emp, dept);
All of the tables in a Teradata database are related to each other. But the Primary Key and Primary Index ensure their relatability in day-to-day use. What is the difference between a PRIMARY KEY and a PRIMARY INDEX? A Primary Key is a logical term used to label column(s) that enforce the uniqueness of each row in a table. PKs determine relationships among tables. A Primary Index is a physical term used to label column(s) that is used to store and locate rows of data. To illustrate, imagine a library. The Primary Key, the logical, is like the actual construction of the library. Do you know what part of the library is reserved for fiction? What about for non-fiction? Plus, where will the card catalog reside? Once the library is logically correct, it is ready to receive books. A Primary Key on a table helps to logically determine what data to track in the table. The Primary Index is much like a card catalog in the library. Inside the card catalog drawers are thousands of index cards that provide the book's title, author, publisher, and the Dewey Decimal number. By taking that index card, you can immediately find where that book is shelved within the library. The Primary Index column value for a Teradata table tells where the row should reside. It's also the fastest mechanism to retrieve data. Teradata uses the Primary Index to distribute each table's rows to the proper AMPs. Teradata also uses the Primary Index to retrieve rows at lightning speed. Exactly how does Teradata actually accomplish this? Well, I'm glad you asked! Let's look at the HASH MAP next:
341234 123412 341234 123412 341234 123412 341234 The next diagram shows the hash map for an eight-AMP system. As before, this is for simulation purposes. Notice that the AMP number for this hash map goes 1, 2, 3, 4, 5, 6, 7, 8, and then starts over again. Why? Because this hash map is for an eight-AMP system. 123456 781234 567812 345678 123456 781234 567812 345678
Best_Friends Table Friend_Num Friend_Name 4 6 8 10 12 14 16 Joe Davis Mary Gray John Davis Don Roy Sam Mills Kyle Marx Lyn Jones
For this example, Teradata will attempt to spread the table rows among the four-AMP system. A picture of the four-AMP configuration follows:
Since there is a four-AMP configuration, the system will use a four-AMP hash map. Here is an illustration: 123412 341234 123412 341234 123412 341234 123412 341234 Instead of trying to figure out the NCR Wiz-Bang formula (a secret), we can show you the theory of distributing data and retrieving data with our own formula. It is called the: Coffing/Jones Wiz-Bang formula: Take a table's Primary Index and divide the column value by 2. The answer points to a hash map bucket, and that bucket tells which AMP will hold the row.
Let's take our first row and determine on which AMP it will reside. Remember, we will get the Primary Index value of the row, divide it by the Coffing/Jones Wiz-Bang formula (divide by 2), and the answer will point to a bucket in the hash map. Inside that bucket will be the AMP number in which the row will reside. Let's take our first row and determine it's proper location: Friend_Num Friend_Name 2 Bill Hon Since we designated Friend_Num as the Primary Index, we merely divide the value of Friend_Num (2) by the Coffing/Jones Wiz-Bang Formula (divide by 2): 2 divided by 2 = 1 The hash map bucket number is one. Let's check the hash map to see bucket number 1 and to see what AMP number is inside that bucket. As seen in the picture below, the first bucket in the hash map says the row's destination is AMP 1.
Let's look at another random row: Friend_Num Friend_Name 16 Lyn Jones Since we designated Friend_Num as the Primary Index, we merely divide the value of Friend_Num (16) by the Coffing/Jones Wiz-Bang Formula (divide by 2) and the answer is: 16 divided by 2 = 8 Thus, the hash map bucket number is now eight. Let's check our hash map to see bucket number eight, determine which AMP number is inside that bucket. As you can see below, bucket eight in the hash map says the row's destination is AMP four.
If we continue the process until all data is laid out, the system would look like this:
Best_Friends Table Friend_Num Friend_Name 2 4 6 8 10 12 14 16 HASH MAP 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Remember, the Teradata hashing formula is a secret. However, the Coffing/Jones Whiz Bang Formula did not crack the code. The purpose is to show you how the hash map works, in theory, to distribute and locate rows. Simply, you should understand that the formula is mathematical (similar to Coffing/Jones Whiz-Bang Formula) and it will be consistent. When we divided Friend_Number two by two, we got bucket one in the hash map. However, if we ran the formula on this premise a million times, we would still get the same results. "If you always do what you always did, you'll always get what you always got." Verne Hill In summary, Teradata will always be able to find a row if it knows the Primary Index. It can rerun the hash formula, point to the bucket in the hash map, and then retrieve the row from the correct AMP. The Teradata hashing formula always does what it always did, and always gets what it always got. Since it always runs the same formula, it is consistent. Ben Hon Joe Davis Mary Gray John Davis Don Roy Sam Mills Kyle Marx Lyn Jones
The Parsing Engine understands that the user wants to have two columns, titled "Friend_Num" and "Friend_Name," returned. The PE gets excited when it notices that we are after Friend_Num eight. It recognizes that Friend_Num is the PRIMARY INDEX. The PE then runs the hash formula for eight. For explanation purposes the Coffing/Jones hash formula is used, and merely divides the PI by two. When the PE divides the value eight by two, then it receives an answer of four. It looks in bucket four and sees the AMP number. The PE passes a plan to retrieve the data to ONLY AMP number four as this is a one AMP operation.
even thousands of AMPS? Well, one major telecommunications company copied a 3.5 billion-row table in just 18 minutes. The 1,900 AMPs in its system helped return results very rapidly. Talk about efficiency! Most FTS bring traditional databases to their knees, but Teradata was born to be parallel. Teradata was specifically designed for data warehousing. When you ask decision support questions like, "Who are my best and worst customers?" then you are asking the system to read through an entire table. Full Table Scans are fundamental and an important part of data warehousing. They allow users to literally ask any question, about any data, at any time. Teradata has the experience, power, and architecture to allow Full Table Scans. A an example of a query asking for a Full Table Scan is:
SELECT Friend_Num, Friend_Name FROM Best_Friends;
In this example, the Parsing Engine receives the SQL and checks the syntax and security. If the user passes these tests, the query continues. The PE knows this query asks to return all records. This is a Full Table Scan. Therefore, it passes the AMPs a plan that says, "Retrieve all of your Best_Friends table rows, and then pass them to me (PE) over the BYNET." With that in mind:
Each AMP reads the Best_Friends rows individually own. Each AMP passes its rows to the PE over the BYNET.
Let's run through the SQL again and see the result:
SELECT Friend_Num, Friend_Name FROM Best_Friends;
8 rows returned Friend_Num Friend_Name 6 Mary Gray 14 Kyle Marx 8 John Davis 16 Lyn Jones 2 Ben Hon 10 Don Roy 4 Joe Davis 12 Sam Mills
In this chapter, we have shown you two opposite approaches to retrieving data. In our first query, we used the Primary Index to retrieve one row. In the next query, we used a Full Table Scan (FTS) to retrieve all the rows. One approach is the fastest way, and the other is the slowest way. But are these the only options for retrieving data? No. There is another option in a Secondary Index.
Secondary Indexes
"Measure a thousand times and cut once." Turkish Proverb Secondary Indexes provide an alternate path to the data, and should be used on queries that run thousands of times. Teradata runs extremely well without secondary indexes, but since secondary indexes use up space and overhead, they should only be used on "KNOWN QUERIES" or queries that are run over and over again. Once you know the data warehouse, environment you can create secondary indexes to enhance its performance. "Measure a thousand query times and create a secondary index." Turkish Teradata Certified Professional Furthermore, there are two types of secondary indexes. They are Unique Secondary Indexes (USI) and NonUnique Secondary Indexes (NUSI), respectively referred to as USI and NUSI. A table may have up to 32 secondary indexes. The good news about secondary indexes is that they speed up queries. The bad news is that every time someone creates a secondary index on a table, Teradata creates and maintains a separate secondary index subtable. This action not only takes up space, but also adds overhead. A classical secondary index is itself: a table made up of rows having two main parts. The first is the data column inside the secondary index table, and the second part is a pointer showing the location of the row in the base table. Teradata brilliantly uses the hash formula and the hash map to build its secondary index sub-tables. There are three values stored in every secondary index sub-table row: Secondary Index data value Secondary Index Row-ID (This is the hashed version of the value) Primary Index Row-ID (This locates the AMP and the base row) When a secondary index is created, the Teradata PE tells each AMP to hash the secondary index column value for each of its rows. It tells the PE to place the hash in a secondary index sub-table along with the ROW-ID that points to the base row where the desired value resides. Let's create a secondary index on our Best_friends table. The syntax to create a secondary index on the column Friend_Name in the table called Best_Friends is:
CREATE UNIQUE INDEX(Friend_Name) on Best_Friends;
The example above shows the theory behind creating a secondary index. There are four AMPs in this system. The base table is the Best_Friends table seen near the top of the AMPs disk. We created a Unique Secondary Index (USI) on Friend_Name, and Teradata automatically created a secondary index sub-table on each AMP. Next, the AMPs "hashed" the secondary index values. These values went to the AMP to which they hashed, along with a pointer to the base row. The design is simple for display purposes. A symbol represents the base row-id. For example, Ben Hon who is Friend_Number 2 has a smiley-face for his symbol. Notice that in the Secondary Index Sub-table (located at the bottom of the AMPs disk) there is also a smiley face. Here is how the design works for retrieval. Let's look at how the following query plays out:
SELECT Friend_Num, Friend_Name FROM Best_Friends WHERE Friend_Name = 'Ben Hon'
The Teradata Parsing Engine takes the SQL and checks the syntax and security access rights. If all is well, the PE notices that in the WHERE clause of the query it is asking WHERE Friend_Name = "Ben Hon." The PE recognizes that Friend_name is a Unique Secondary Index. The PE will hash "Ben Hon," and then use the hash map to find the AMP that holds "Ben Hon" in its secondary index sub-table. As you can see the AMP involved is number two (notice the smiley face on AMP 2). The PE instructs AMP 2 to retrieve the "Ben Hon" Secondary Index Sub-table. Once complete, Teradata can see the real row-id and find the base row. In our example, once the "Ben Hon" Secondary Index Sub-table row is found, the row-id ("smiley face" in this example) is revealed, and the PE can find the matching smiley face in the base table. This approach allows all USI requests in the WHERE clause of SQL to become two-AMP operations. A NUSI used in the WHERE clause still requires all AMPs, but the AMPs can easily check the secondary index sub-table to see if they have one or more qualifying rows. Create secondary indexes only on columns used repeatedly in the WHERE clause of on-going queries. Secondary indexes take up space and overhead, but boy can they speed up queries.
Join Indexes
"A bend in the road is not the end of the road unless you fail to make the turn." A join is an SQL query that gathers its information from more than one table. Teradata can join up to 64 tables in a single query. Many databases can't handle join processing so either the database is modeled in a dimensional fashion or summary tables are created. Teradata allows you to travel down a faster and straighter highway. Because data marts or summary tables can be an administrative nightmare, Teradata enables join access without requiring physical data marts. This is accomplished by creating a join index. When you create a join index the tables involved are pre-joined. There is an actual table built containing the joined data. The users don't every query the join index. They run their normal joins and the PE will check to see if the join can be satisfied by the Join Index table. If it can Teradata will pull the data from the Join Index table. Best of all, once it is created, the data is automatically maintained as the underlying base tables are updated.
Logical Picture of System Space The DBC is now ready to distribute space, but because DBC is so powerful this can be dangerous. What if the DBC user forgets the password? What if a disgruntled employee knows the DBC password and is looking for revenge? The DBC password must be protected and as a result, many companies create a new user called SYSDBA. This user owns about 80% of the space, while the DBC owns the remaining 20% that is allocated for the Data Dictionary and the Transient Journal (see Data Protection chapter). The DBC password can then be locked in a safe, and it is now up to the SYSDBA to distribute space.
The SYSDBA now owns 80% of the system space. The user does NOT have to be called SYSDBA. It could be called Morgan, or Tom, or anything. SYSDBA, however, is a standard name that most systems utilize. As you can see in the following picture, the DBC still owns about 20% of the total space. The user "SYSDBA" has given some space to a database called "MRKT", and to another one called "SALES". It has also given space to a user called "Morgan". NOTE: Morgan has given some of his space to Tom. Therefore, both Morgan and Tom can now own tables.
Logical Picture of System Space Remember either a database or a user can own space. What's the difference between a database and a user? That topic follows.
"Perm space" defines the upper limit of space that a database or user can use to hold tables, secondary index sub-tables, and permanent journals (See protection features). Spool space defines the upper limit of space that a user has to run a query. When a user runs a query, AMPs build the answer set in spool space. Once the query is done, the spool space is released. If the query exceeds the spool space's upper limit, the query aborts. Then, the user is out of spool space. Temp space defines the upper limit that a user or database can have to hold Global Volatile Temporary tables. These tables will be discussed in another chapter. The SYSDBA knows that tenaciously holding onto its space will not provide any value to your company. A bank that holds onto all of its capital will not be successful, or will it? If it's destined for success, it will lend out its capital in the form of credit lines or mortgages. These actions will provide the bank with a healthy profit. The SYSDBA likewise gladly gives up space to each new user or database in an effort to make the Teradata system profitable. SYSDBA gives out two kinds of space: Perm space and Spool space When you receive a credit card from the bank, you are given an upper limit to your line of credit. In order to spend more than that limit, you must get approval from the bank. In the same way, the SYSDBA gives a new user an upper limit of space to use. When that amount is used up, the user must request an increase. Another way to free up some space is to drop some tables from the database. Perm space is actually used to store real data such as tables, views and macros. If you give some of your perm space to a child object, then you must subtract that same amount from the total perm space you own. Spool space is the area where AMPs temporarily place the answer to a query. Once the answer is delivered to the person making the query, the AMPs release that spool space to be used for another query! Unlike perm space, spool space is not lost if it is given away. You can actually give users below you as much spool as you
would like, yet still have the original amount. Spool is like a speed limit on the highway. If your own speed limit is 65 mph, you can still allow every other driver to drive up to 65 mph. Some users may not receive perm space if their job is just to run queries not create tables. These users will just receive spool. The following picture shows a logical view of a CustomerTable. Note: the table is stored in PERM space. When a user submits a query against this table, the answer is stored temporarily in SPOOL. When the query is completed, the answer is delivered to the user, and then the SPOOL is released. The next picture shows a logical Teradata system. In the PERM area there is a table called "Employee". This table has five columns: Emp, Dept, Lname, Fname, and Sal. The table has four employees. Notice the SQL statement at the bottom of the picture is asking to see all columns where the employee's department is equal to 10. To complete the query, the AMPs will read the rows of the table and each time they find a row where Dept is equal to 10, a row is added to spool. Plus, when the answer is returned, the spool is released.
What is a View?
At Christmas time no one cares about the past or the future. All that matters is the present! One year, my wife and I were in New York City during the holiday season. We had always heard about how wonderful the window displays are in the large department stores. As we window-shopped, we got lots of ideas for gifts. We could see products displayed in the windows, but we could not actually touch them. We only had a pleasant view. Display windows are designed to show shoppers what store management wants you to see. In Teradata, a view is like a department store window because you can see selected portions of a table, yet you aren't able to see sensitive data. Instead, you can view data within your access rights and you determine what data portions you want others to see. Views are real sticklers for protecting sensitive data from inquiring eyes. For example, the Human Resources database might contain an employee table. Management can create a view of the table that hides the salary column, yet still allows an administrative associate to view names, phone numbers and department numbers of employees. In this scenario, the salary column is not shown. As a result, views are the best choice for protecting sensitive data.
Another benefit of views is that their definitions are stored in the Data Dictionary. When you select a view of a table(s), the data is not stored on the disks, so it does not duplicate data and take up more space. In this scenario, you are looking at a filtered picture of the data. The Employee Table Emp Dept Lname 1 2 22 25 33 99 10 20 30 10 10 20 Fname Sal 100000 77000 120000 Johnson Manny 100000 Carlsbad Jan Winter Lester Walter Steve
Samuels Todd
The previous table shows the employee table. In nearly every company, employees are curious about the salaries of co-workers. Providing access to the employee table above will actually allow users to see everyone else's salary. To avoid disclosing salary information, a view should be created to limit certain columns and rows. It's simple to create a view:
CREATE VIEW EMPLOY_V AS SELECT Emp ,Dept ,Lname ,Fname FROM EMPLOYEE;
In the SQL statement above, salary is not selected. However, if users are denied access the employee table, but are given access rights to the EMPLOY_V view, there is enhanced security. With this restriction, no user can actually see the list of employee salaries. Perm Space is required to create a table, but it is not needed to create a view. The creation and definition of a view are both stored in the Data Dictionary, and are monitored by the DBC. However, anyone can create a view, provided that person has the proper privileges. Once a view has been created, users can select data from the view. An example is:
SELECT * FROM Employ_V;
6 rows returned Emp Dept Lname Fname 1 10 Johnson Manny 2 20 Carlsbad Jan 22 30 Winter Steve 25 10 Lester Bonnie 33 10 Samuels Todd 99 20 Walter Misha
What is a Macro?
"The axe soon forgets, but the tree always remembers." Anonymous When you run specific queries often, or if you want to ensure you don't forget an SQL step you should use a macro. The user sometimes forgets, but the macro always remembers. A macro is a group of one or more SQL statements that are given a name and that are executed with a simple command. If there are multiple commands, Teradata treats them as one single transaction. In other words, either they all work or none of them work. Like views, the definition statement for a macro is stored in the Data Dictionary. If your manager asks you for three reports, he may want to know:
What employees are in department 10; What employees are in department 20; and A list of employee names sorted by last name;
A macro can easily be created to run all three commands. The syntax would be:
CREATE MACRO Emp_mac AS ( SELECT * from Employ_v WHERE dept = 10; SELECT * from Employ_v WHERE dept = 20; SELECT * FROM Employ_v Order by lname; );
Once the macro has been created and stored in the Data Dictionary, it's time for a test run. To run this macro, the user merely executes the SQL:
Execute Emp_mac;
Here is a handy reference chart that compares views with macros: Views We select from views. Uses the keyword AS Macros We execute macros. Uses the keyword AS
Definition is stored in the Data Dictionary Definition is stored in the Data Dictionary Accesses certain portions of the data Accesses the real data itself Is changed using the keyword REPLACE Is changed using the keyword REPLACE
I taught in one place that was so rough security actually checked me for weapons. When they found out that I had no weapons, they gave me some! Actually, on a recent consulting trip I was signed in each morning by a friendly security guard. This customer site had tons of highly sensitive data. As long as I stayed in my assigned work area, the guard and I got along just fine. However, as soon as I needed to move to a different room, someone had to accompany me and give me access. In Teradata, the Parsing Engine is the vigilant guard who never lets someone get close to data if he or she doesn't have the right permissions. Every time an SQL request comes to the PE, it checks the SQL syntax for validity first. Its next step, every single time, is to see if the user has permission to perform a given operation on a specified Teradata object.
In the picture above, the DBC has Implicit rights on all databases and users. Plus, SYSDBA has Implicit rights on every person listed below him. MRKT has explicit rights over Mary, and Morgan has the same rights over Tom. Implicit rights simply means it is implied that those people listed above you (in a hierarchy chart) can GRANT or REVOKE privileges on you. For example, if Tom or Morgan decides to give certain privileges to Mary, either person could EXPLICITLY give her those permissions.
In comparison, Automatic Rights means when Morgan created Tom he automatically received 20 access rights (on Tom), plus Tom was given 16 access rights on himself.
Data Protection
Overview
As a man was driving down the interstate highway, his cell phone rang. When he answered he heard his wife warn him urgently, "George, I just heard on the news that there's a car going the wrong way on I-26!" George replied, "I'm on I-26 right now and it's not just one car. It's hundreds of them!" How do you protect your data when things go the wrong way? Murphy's law states, The more mission critical a data warehouse, the more likely the system will crash at the most critical moment of the mission. Ironically, most DBAs think Murphy was an optimist. "Please sleep on it tonight, and if you wake up in the morning, let me know what you think." Morgan's Life Insurance Agent A database not prepared to defend itself is like an unsigned contract. It is not worth the paper it is written on. However, Teradata is always prepared and it will protect your data better than a wild pit bull. As a matter of fact, the difference between Teradata and a pit bull is that eventually the pit bull will get bored and let go. System and user errors are inevitable in any large system. For example, an associate may accidentally give everyone a 100% raise instead of a 10% raise. Or, what if a million-dollar transaction fails right at the wrong time? Or an AMP or DISK goes down? In any of these cases, Teradata will have many ways to protect your data. Some processes for protection are automatic and some of them are optional. The protection features we will discuss are:
Transaction Concept Transient Journal FALLBACK RAID Clustering Cliques Permanent Journaling
"Transaction Concept," which means that an SQL statement is viewed as a transaction. Simply stated, either it works or it fails. The Transient Journal's job is to ensure if things do fail, then the rows affected can be reverted back to their original state. In Teradata, all SQL statements are considered transactions. This applies whether you have one statement or multiple statements executing (MACRO). If all SQL statements cannot be performed successfully, the following happens:
The user receives immediate feedback in the form of a failure message; The entire transaction is rolled back, and any changes made to the database are reversed; Locks are released; and Spool files are discarded
Wouldn't it be great if every time you got a haircut, the barber or stylist took a picture of your hairdo before they cut a single strand? Then after he or she cut your hair, asked if you liked it? If you didn't like it, then you could ask to have it restored? Well, that is what the Transaction Journal does. If a row is going to change because of an INSERT, UPDATE, or DELETE, it takes a BEFORE picture. If the transaction fails, then the journal restores it to the way it was. The TRANSIENT JOURNAL is an automatic system function. It is not optional. The BEFORE image is actually stored in the AMP's Transient Journal. Every AMP has a transient journal that is maintained in DBC's PERM space. If the transaction is aborted for any reason, the AMP restores the data to match the beforeimage stored in the Transient Journal. The data will then revert to its original state. When a transaction is successful, the PE and the AMPs shake hands on it and the Transient Journal is wiped clean. The handshake is called the "COMMIT." After a COMMIT, all the AMPS have a party to celebrate, and the user is invited to join in the festivities! In other words, "Transaction Journal Cleanliness is next to Godliness." If it is clean, then things went good!
FALLBACK Protection
I asked my dentist if I had to floss all my teeth, and he responded, "No, just the ones you want to keep." "If you're not TRUE to your teeth, they'll be FALSE to you." Morgan's Dentist FALLBACK is a table protection feature used in case an AMP fails. You can use FALLBACK on all tables, some tables or no tables. When I asked my dentist if I should use FALLBACK on all tables, he responded, "No, just the ones you want to keep running when an AMP fails." Below is the four-AMP system and the Best_Friends table. In this example, data is spread evenly and the system is ready to run in parallel. It is brilliant, but vulnerable. What happens if we lose AMP one? We can no longer get to the Best_Friends rows containing "Ben Hon" and "Don Roy." FALLBACK, however, will correct this situation.
In the picture below, you can see the Best_Friends table and the FALLBACK protected rows.
In this picture, the BASE table Best_Friends is illustrated at the top of the disk and the FALLBACK rows are placed at the bottom of the disk. If we lose AMP1, then we can get "Ben Hon" from AMP2 and "Don Roy" from AMP4. Keep in mind, FALLBACK tables use twice as much disk space as NON-FALLBACK rows. In the picture above there were eight base rows in the Best_Friends table and eight rows in the Best_Friends FALLBACK rows. With FALLBACK, we can lose any AMP and still get to the data. "You can't step into the same river twice." Heraclitus The data in a company's database tables is constantly changing, much like a flowing river. As every footstep really encounters a different river, likewise each update really makes a different table. That is why Fallback protection can be vital for mission critical tables. It actually allows the user to step into the same table twice, if necessary. If we can lose any one AMP/disk, what happens if we lose two? The chance of losing two AMPs in a four-AMP system is rare, however some systems have nearly 2,000 AMPs. Therefore, the chance of losing two AMPs in a 2,000 AMP system is much greater than in a four-AMP system. That's why Teradata designed Clustering. Let's look at this next example with a little larger system:
Let's discuss the picture above in detail: This is an eight-AMP system. Four AMPs are in Cluster one and four AMPs are in cluster two. The base table Best_Friends (listed at the top of all disks) is spread evenly across all eight AMPs. Taking the Primary Index and running it through the hashing algorithm complete this allocation. Next, the output of the hashing algorithm points to a bucket in the hash map, and inside that bucket is the AMP number or the row's destination. Notice the FALLBACK rows in this example. In the top cluster (cluster 1), FALLBACK rows are backups for the top cluster's base rows. In the bottom cluster (cluster 2), FALLBACK rows are backups for the bottom cluster's base rows. With this protection, WE CAN AFFORD TO LOSE ONE AMP IN EACH CLUSTER! The brilliance behind this protection is the Hash Map. There is a Base Row Hash Map used to distribute the base rows. It's called the Primary Hash Map. There is also the Fallback Hash Map that knows exactly how AMPs are clustered and which AMP should host a FALLBACK row. In most systems, AMPs are clustered in a group of four. The next most popular clustering scheme is a group of three. However, the minimum number of AMPs per cluster is two, but the maximum number of AMPs per cluster is 16. Let's look at the extremes of both clusters (two versus 16). The advantage of clustering in groups of two is that both AMPs would have to fail before the system stopped. The disadvantage is that if one AMP fails, the other must do its work plus the work of the down AMP. With clustering in a group of two, every complex query will take twice as long to process. The advantage to clustering in groups of 16 is that if one AMP fails, there are 15 other AMPs doing their work and sharing in the work of the failed AMP. The disadvantage to this type of clustering is there is an increased risk of losing two AMPs in the cluster. This is the reason four-AMP cluster configurations are so popular. The chances of losing two AMPs out of four are quite low. However, if one AMP is lost, the other three will share in the extra work. FALLBACK is an optional means of protection specified at the database or table level. It may be requested when the table is first created, or you may add or drop FALLBACK at any time by using the ALTER TABLE command. (For more information, refer to Teradata SQL Unleash the Power by Mike Larkins and Tom Coffing).
Let's review FALLBACK and clarify related issues: When a new row is inserted into a table, FALLBACK always places a second copy of that row on another AMP in the same group, or cluster. Keep in mind that a cluster usually consists of four AMPs. From that point on, any manipulation of the data in the primary row also happens to the FALLBACK row. FALLBACK rows are distributed evenly across all the AMPs within the same cluster. If one AMP fails, processing continues with all subsequent changes to that AMP's rows. FALLBACK provides an optional insurance policy for a failed AMP, however there is a cost for that insurance. FALLBACK requires twice as much disk space to store both the primary and duplicate rows on a table. Another cost that should not be overlooked is twice the I/O (Input/Output) applies to inserts, updates and deletes because there are always two copies to write. However, because Teradata AMPs operate in parallel, both rows are placed on their respective AMPs at nearly the same time. Although FALLBACK may be created on any, all or no tables, its extra cost causes most companies to use it only for mission critical tables. As you might suspect, the Data Dictionary is automatically FALLBACK protected. FALLBACK may not protect your system from all failures, but it certainly is an excellent fault tolerant solution.
In the previous picture there are two clusters, but notice that AMP one has failed. After failure, the other AMPs in the top cluster open the Down AMP Recovery Journal (DARJ). Also, none of the AMPs in the bottom cluster have the DARJ open. Why? Simply, because the FALLBACK rows for the down AMP are housed within the cluster. If anything happens while the "AMP is sleeping," it has three extremely cute ticket takers that will store all information pertaining to the down AMP.
In the picture above, one AMP has one Virtual Disk, but it also has four physical disks. Plus, each disk has a mirror in case of the loss of a disk. The four disks together form a "Rank of Disks". Two disks in a rank may be lost so long as they are not comprised of a data disk and its mirror. In this example, the data from the Best_Friends table is displayed. It is on the first disk, and there is a set of mirrored the information on the second disk. If a disk goes down, the system does not even flinch. It sends the operations personnel a message about failure, and keeps on running.
Cliques
In high school you can walk into the cafeteria and immediately identify the cliques (pronounced "clicks"). In other words, they are groups of students that hang around together because they have formed a common identity and a common bond. The cliques in Teradata are similar to, yet different from high school cliques.
CLIQUES (pronounced "cleeks") in Teradata are a method of system protection against the failure of an entire node. Multiple processing nodes (SMPs) are not only connected with an unbroken line to their own disks, but are also with a dotted line to each other's disks. This shared disk arrangement forms a CLIQUE. If a node fails then its virtual processors (AMPs and PEPs) migrate to other nodes in its CLIQUE like birds flying south in winter. The receiving node now has twice as many VPROCs, so its performance slows down. The important factor is that the migrated VPROCs can still access their own disks, and business continues until the failed node is repaired or replaced.
The picture above shows two nodes. A node can be thought of as a powerful PC with four Intel Processors. AMPs and PEs reside inside the node's memory, and there are about 10-16 AMPs per node and two-to-three PEs per node. This configuration is a two-node 32 AMP system. Let's focus on AMP16 in node one and AMP 17 in node two (look at the arrows). AMP 16 has it's own virtual disk and similarly, AMP 17 has it's own virtual disk. Remember, no other AMP is allowed in another AMP's virtual disk. What if an entire node is lost? Well, then AMPs 1-16 cannot access any disks. To prevent this, let's create a clique in our next picture. The idea of a clique is to connect both nodes to one another's disks. That way, if either node goes down, the AMPs can migrate over the BYNET and join the other 16 nodes in memory. However, each AMP will still have a connection to the original virtual disks.
In the illustration above, cables have been added. If node one or node two goes down, the AMPs can migrate to the other node and still have access their own disks. The only difference is that the migrating AMPs now reside in memory on different node, plus they are accessing their own virtual disk via a different physical cable. People who come from the colder climates to spend their winters in sunny Florida are often called "snowbirds." Do you know what bird migrates farther than any other bird on the planet? It is the Arctic tern. This bird leaves
its Arctic Circle home in August for its winter vacation home in Antarctica a round trip of more than 11,000 miles! In the same way, when a node goes down the software AMPs and PEs migrate over the Bynet to a temporary "home" on another node.
Permanent Journal
"The absent are always in the wrong." English Proverb If a system had five million rows and used FALLBACK protection, then it would have five million FALLBACK rows. However, this would be quite costly because FALLBACK actually stores a duplicate copy of all the rows on other AMPs within the same cluster. FALLBACK is used either because the system is mission critical or the system is not backed up regularly. For customers who backup data regularly, another option for data restoration is the "Permanent Journal." When a company is not severely impacted by a couple of hours for a restoration to be completed, this is a very good option. The Permanent Journal works in conjunction with backup procedures, plus it's a lot more cost effective than FALLBACK. The Permanent Journal stores only images of rows that have been changed due to an INSERT, UPDATE, or DELETE command. It keeps track of all new, deleted or modified data since the last Permanent Journal backup. This option is usually less expensive than storing the additional five million FALLBACK rows. Like FALLBACK, the Permanent Journal is optional. It may be used on specific tables of your choosing or on no tables at all. It provides the flexibility to customize a Journal to meet specific needs. The Permanent Journal must be manually purged from time to time. There are four image options for the Permanent Journal: 1. The BEFORE JOURNAL stores an image of a table row before it changes. It is used to perform a manual rollback to a specific "point in time" should there be a programming error. 2. The AFTER JOURNAL stores an image of a table row after it changes. It is used to manually roll forward from a specific "point in time". 3. A DUAL BEFORE JOURNAL captures two images of a table row before it changes. This type of journal stores the duplicate images on two different AMPs. 4. A DUAL AFTER JOURNAL captures two images of a table row after it changes and stores those images on two different AMPs.
In order to explain journaling, let's say that the Customer Representative table is created with a BEFORE Journal. After it's created, a programmer is told to move every Customer Representative from the Western Region to the newly designated Southwest Region. However, every representative from every region is accidentally transferred to the Southeast Region. Because there is a BEFORE Journal, a programmer has the ability to manually rollback the data to the specific point in time BEFORE this update occurred. Note that this was not a transaction failure. The update was successful but it was not accurate. The BEFORE Journal saves the day! The AFTER JOURNAL works in the opposite way. In this scenario, company officials decided not to use FALLBACK on any tables. The data was not mission-critical, and it could be restored from backup tapes if necessary. A FULL SYSTEM BACKUP takes place on the first day of each month. Plus, an AFTER JOURNAL has been placed on all the tables in the system. Every time a new row is added or a change is made to an existing row, Teradata captures the AFTER image. Suppose a hardware failure occurs on the 5th day of the month and data is lost. To recover the data, the hardware problem should be fixed, and then the data should be reloaded from the FULL SYSTEM BACKUP done on the 1st of the month. The AFTER JOURNAL is then used to capture the transactions that either added or modified data between the 1st and 5th day of the month. As you can see, an AFTER JOURNAL is used to roll forward and is usually done to restore data lost as a result of a hardware problem. The following example shows the use of FALLBACK and the PERMANENT JOURNAL:
CREATE TABLE TomC.employee, FALLBACK, BEFORE JOURNAL, DUAL AFTER JOURNAL ( emp INTEGER ,dept INTEGER ,lname CHAR(20) ,fname VARCHAR(20) ,salary DECIMAL(10,2) ,hire_date DATE FORMAT ) UNIQUE PRIMARY INDEX(emp);
The example above created the table called "Employee" in the TomC database, and is FALLBACK protected. A BEFORE Journal and a DUAL AFTER Journal are specified. Remember that both FALLBACK and JOURNALING have defaults of "NO" meaning if you don't specify this protection at either the table or database level the default is NO FALLBACK and NO JOURNALING.
Teradata allows hundreds, even thousands of users, to access the data warehouse concurrently. However, there would be a lot of confusion about which user had access to a table first if it were not for the LOCKING MODES. No one likes to be waiting for a long time in a line only to have someone cut in front of him or her. Teradata uses LOCKS to help maintain data integrity. Locks are activated on the targeted database, table, or row while the SQL request is executed. Those locks are released upon query completion. There are four modes of locking: 1. The EXCLUSIVE LOCK is the mother of all locks. It's placed only on databases or tables, and restricts access to then whenever a structural change is made. EXCLUSIVE LOCKing reminds me of what happens when there is a structural change being made to a parking garage. A construction company will wrap what seems like thousands of yards of bright orange plastic fencing around the garage in order to keep people out and protecting them from falling debris. To this day, I have not seen a database or table fall on top of a user! The EXCLUSIVE LOCK prevents any access, period. This lock is placed on a table or database. 2. The WRITE LOCK jumps to action whenever a user asks for an INSERT, DELETE, or UPDATE. Keep in mind, these commands are writing actions. No other Exclusive, Write, or Read locks can cut in line ahead of an existing WRITE LOCK. The only exception is an ACCESS LOCK one that allows a user to read data that may not be totally accurate due to modifications being made at the time it is accessed. This kind of read is called a "stale" or "dirty read." 3. Everybody loves the READ LOCK. It's placed whenever the SELECT command is used. With a READ LOCK a thousand users can simultaneously SELECT from a table. A READ LOCK will prevent either an Exclusive or WRITE LOCK from jumping ahead in the queue. 4. When a user is not concerned with precisely accurate data, he or she may request an ACCESS LOCK. This lock can jump in line ahead of either a READ or WRITE LOCK, but not an EXCLUSIVE LOCK.
Referential Integrity
Just how important is it to protect the integrity of your data? This story says it all: After reading an advertisement offering split, dry firewood for $60 a cord (including delivery), Jeff decided to place a phone order. Upon delivery, Jeff was upset when the deliveryman finished stacking the wood. Jeff objected, "That's not a full cord of wood!" "Well, that's what I call a cord," the man answered firmly. Grudgingly, Jeff pulled some money out of his pocket and thrust it into the man's hands. "Hey, just a minute," the man said after counting the money. "You only gave me $30!" Jeff shrugged his shoulders and replied, "Well, that's what I call $60." Imagine getting fired from your job and the company deletes you from its employee table, but forgets to delete you from the payroll table. That's not like getting fired it's more like getting fired up for a Bahamas vacation. Referential Integrity would have stopped this oversight. RI, as it is called, would not allow anyone to be deleted from the employee table unless he or she was also deleted from the payroll table. REFERENTIAL INTEGRITY (RI) is the relational concept that mandates that a row cannot be inserted into a table if it does not contain a column value that also exists in another table within the database. Conversely, a row with a corresponding value in another table may not be deleted unless the common value is first removed from the former table. An important function of RI on a newly created table is that it will not allow invalid data values to be entered into a column. If RI is enforced on an existing table with RI violations the ALTER TABLE will proceed. Plus, it will copy and store the table and any related RI violations for review and correction. Then the user will need to locate the table copy, and then make corrections to the original table.
Fastload
Fastload is designed to load flat file data from a mainframe or LAN directly into an empty Teradata table. This is how a Teradata table is populated the first time. I have personally seen Teradata load over one billion large rows in less than 6 hours. Plus, I have seen Teradata load millions of rows in minutes. Teradata has the quickest time to solution, and has the most powerful performance in the data warehousing industry. How is Teradata's speed and performance accomplished? It's done through parallel processing. Fastload understands one SQL command - INSERT. It inserts rows into an empty table. The process is as follows: A flat file is prepared for loading on a mainframe or LAN. The FASTLOAD utility needs three pieces of information to process: where the flat file located, what is its file definition, and what table the data should be loaded into in Teradata. When the Fastload utility starts, the Parsing Engine comes up with a plan for the AMPs. The Parsing Engine then steps back and lets the AMPs do their work. The data is loaded in large 64K blocks. Each AMP is given a 64K block of rows for loading. Like a line of workers trying to pass sand bags to prevent a flood, Teradata passes these blocks from AMP to AMP until all the data is on Teradata. Next, all AMPs take the blocks they received, hash the rows in those blocks (in parallel) and send the rows to the proper AMP over the BYNET. Once this is done, each AMP sorts its data by Row ID and the table is ready for business. Fastload Basics:
Loads data to Teradata from a Mainframe or LAN flat file; Only one table may be loaded at a time; The table to be loaded must be empty; There can be no secondary indexes, referential integrity, or triggers;
Multiload
Where Fastload is meant to populate empty tables with INSERTS, Multiload is meant to process INSERTS, UPDATES, and DELETES on tables that have existing data. Multiload is extremely fast. One major Teradata data warehouse company processes 120 million inserts, updates, and deletes during its nightly batch. Multiload works similar to Fastload. Data originates as a flat file on either a mainframe or LAN. When the Multiload utility is executed, the Parsing Engine creates a plan for the AMPs to follow. The data is then passed to the AMPs, in parallel, in 64K blocks, and the AMPs hash the rows to the proper AMP. Last, the INSERTS, UPDATES, and DELETES are applied. In the previous diagram the mainframe/LAN is talking to the Parsing Engine. The PE passes the data across the BYNET for the AMPs to retrieve. Keep in mind, many systems have hundreds to thousands of AMPs. The load takes place, continually, in parallel when the 64K packets are delivered to the AMPs. Multiload has been designed for users who have a "need for speed". Multiload locks at the table level. Therefore, while Multiload is running, the table is unavailable. Multiload Basics:
Loads data to Teradata from a Mainframe or LAN flat file; Up to 20 INSERTS, UPDATES, or DELETES may be executed on up to 5 tables; Receiving tables are usually populated; There can be no Unique secondary indexes, referential integrity, or triggers; It doesn't support Multi-set tables; and It locks at the table level.
Tpump
The Tpump utility is designed to allow OLTP transactions to immediately load into a data warehouse. When I started working with Teradata, more than 10 years ago, most companies loaded data on a monthly basis.
Suddenly, companies began to load data weekly. Today, most companies load data nightly, and industry leaders are loading data hourly. Tpump is the beginning step of an "Active Data Warehouse (ADW)." ADW combines OLTP transactions with a Decisions Support System (DSS). "You don't drown by falling into the water; you drown by staying in the water." Edwin Louis Cole If the data is not flowing, a company can drown in it! The utility is called Tpump because it theoretically acts like a water faucet. Tpump can be set to full throttle to load millions of transactions during off peak hours or "turned down" to trickle small amounts of data during the data warehouse rush hour. It can also be automatically preset to load different levels at certain times during the day, and can be modified at any time. Also, Tpump locks at a row level so users have access to the rest of the rows while the table is being loaded. Tpump Basics:
Loads data to Teradata from a Mainframe or LAN flat file; Processes INSERTS, UPDATES, or DELETES; Tables are usually populated; It can have secondary indexes, triggers, and referential integrity; It doesn't support Multi-set tables; and It locks at the row level.
Parallel processing for unlimited performance Unlimited scalability of data, users, and applications Ability to answer extremely complex queries Ease of setup and maintenance Only one DBA needed Ability to load data at lightning speeds from a mainframe or LAN Ability to answer any question on any data without any DBA intervention or tuning Performance capabilities to model detail data in 3rd Normal Form or Dimensional Models