Sie sind auf Seite 1von 52

The Agile House of Straw

30 March 2011 by Tony Davis


The ideal Agile application developer welcomes changing requirements, even late in development. The DBA or Database Developer doesn't. Why is that? You can't create complex databases in the Agile way, by breaking tasks into small increments, with minimal planning. Building a database that will perform quickly, reliably and securely over time, as it grows, is more like building a house: it involves architecture and has to be planned in detail beforehand. The tables, their data types, their key structure, constraints, and the relationships needed between the various tables, need to be thought out before you start the first cut. The database structure then needs to be loaded with realistic transactional data and rigorously tested to make sure that the expected results are always returned in the various reports. It needs to be stress tested under load to ensure adequate performance, smooth scalability, and that there are no data integrity issues caused by predicted levels of concurrent access/modification of the data. (If you think SQL Server automatically protects you from such eventualities, it doesn't, as Alex Kuznetsov proves in his Defensive Database Programming book). When a database architect finally gets all this right, it's only with the greatest reluctance that any changes to the database structure are allowed. Sure, you can make changes, but it is far trickier to do. Every change necessitates another round of regression, stress testing and so on. This need to test data structures plus data is the critical difference between database and application testing. Data retrieval logic is not enshrined in the SQL code; the database engine decides how to do it based on a range of factors including table/index structure, the data that is stored, its volume, distribution, and so on. This means that, often, each round of testing must use not just "realistic" data but also consistent data, in terms of the data types, ranges of values, and distribution of data within those ranges. The test data must stay the same between builds so that you can compare aggregations to ensure that nothing is broken, and get relative performance figures. All of this makes Agile database development hard. This becomes a severe problem when the database development becomes a full participant in the Agile process. There is no easy way to accommodate the evolutionary changes that some developers would like to freely make part of an Agile sprint, without destroying the resilience, maintainability and performance of the database. After all, you wouldn't build a house by an evolutionary process, would you? Cheers, Tony.

by Tony Davis

Life After Retirement: Replacing Visual SourceSafe


31 March 2011 by John ONeill, SourceGear
Source control systems aren't exciting, and they don't come less exciting than Visual SourceSafe. Developers continue to use it but Microsoft will soon be retiring the product officially. What's the best strategy then? TFS? Not if you are looking for the most painless approach.

Software changes. As facts of life go, that ones pretty cut and dried, and most of us are used to it.
But sometimes software thats central to the core of your business such as the version control solution that safeguards your entire product line abruptly changes, and suddenly you need to rethink the way you develop, store, and plan your code. These are the kinds of decisions that make you wish youd listened to your Mom and studied dentistry. Early last year, Microsoft announced it was retiring support for Visual SourceSafe (VSS), one of the most popular source control tools ever created. Development had gradually been winding down for years, as Microsoft shifted strategy to a more ambitious money-maker, Team Foundation Server (TFS). But the install base of VSS remained huge, and the switchover to TFS, especially among smaller development shops that were clearly not the target for an enterprise tool like TFS, hasnt been as rapid as Microsoft may have hoped. So to signal users that now is the time to make a switch, Microsoft will stop supporting VSS in June 2012, officially rendering the product obsolete. For the tens of thousands of users who still rely on it, this is a little like being told their pacemaker will stop working just as bikini season opens. Whats the right thing to do? If youre like many of those users, your strategy has been to delay. Lets face it, VSS may have its warts, but up to now its done the job. But while it may be okay to rely on that old VHS player in your basement, rather than upgrade all those Disney videos to Blu-ray, keeping an obsolete solution at the heart of your corporate development process is a Bad Idea. No support means no new bug fixes, no upgrades when the latest version of Windows rolls around, and no tech support. As every good IT guy will tell you, the key to a good Mission Critical Solution is having a vendor you can get out of bed at midnight when disaster strikes. Bluntly, if you havent done so already, the time to replace Visual SourceSafe is now. Theres a wide variety of modern source control tools to choose from many with significant advantages over the way youre doing things today and skilled professional help for a smooth transition. To understand the choices were faced with today, it pays to look at how we got here. In the early 1990s One Tree Software revolutionized version control with SourceSafe. Source control was still relatively new as far as established procedures went, but when Microsoft bought One Tree in 1994 and put real marketing weight behind SourceSafe, all that changed. SourceSafe was revolutionary in many ways, especially its great Windows support and friendly user interface. Just as importantly, it was distributed by Microsoft as part of nearly every MSDN subscription, which put it into the hands of many developers who had never used source control. It won hearts and minds and quickly became the de facto standard. From a modern perspective, SourceSafe had numerous problems. Perhaps chief among these were its lack of atomic check-ins and its reliance on Windows filesharing, both of which contributed to gradual data corruption. If a check-in fails for any reason, atomic check-ins make sure no files are changed, ensuring the repository is not left in a partially updated state. Without them, the next user can grab a mix of old and new files. Before you know it, a simple network hiccup can lead to a rapidly propagating version control nightmare. SourceSafe was also designed before the explosion in internet usage, meaning it wasnt built around a modern client/server model. That left users who wanted to access code remotely out in the cold. My company, SourceGear, got its first big break when one of our developers here in rural Illinois, tired of being unable to check in code at home where he could feed his chickens, wrote a remote client for SourceSafe in 1997. It worked so well his chickens grew fat and happy. Plus, we were able to turn his chicken-friendly add-on into our first commercial software hit, SourceGear SourceOffSite, the first tool that allowed developers to access a Visual SourceSafe database over the Internet. It still sells well for us today, over a decade later. SourceOffSite was our first toe in the water of the SourceSafe ecosystem, but pretty soon we were clutching an inflatable duck and captaining a life raft. To fully service our customers, we had to become experts in the inner workings of VSS, and especially the kinds of problems that caused the most heartache. Over the last decade weve assisted countless customers to diagnose troubled and corrupted VSS databases.

He Didnt Jump, He Was Pushed


As a consequence of Microsoft retiring support, there is more urgency to migrate away from VSS than ever before but also more and better source control options for you to choose from. Frequently our customers come to us looking for a painless rollover. They simply want to be up and running again with a new source control solution, with the shortest amount of downtime.

Thats understandable, and with the right VSS replacement and technology partner, entirely achievable. Ill discuss some of the best options for you to get underway with a VSS replacement quickly in the next section. But at the risk of being predictable, Id suggest you take a minute to look past your immediate pain, and think a bit about what you want your version control solution to be two, five, even ten years from now. Considering how many customers we have whove been using VSS for over a decade, thats not as far-fetched as it may seem. For example, one recent trend has been towards products with integrated source control and bug tracking, which many have found not only makes tasks easier, but also helps dev teams standardize on one tool and work a little more collaboratively. Most of us are still waiting to see the benefits of ALM (Application Lifecycle Management), with its promise of integrated dev, test, and build environments (and associated Cadillac price tag), but the marriage of version control and bug tracking makes sense. Another recent trend is the move towards distributed version control solutions (DVCS). A DVCS differs from traditional version control in that it has no central server, and users can share change-sets peer to peer. It can bring with it a sometimes radical change in development methodology particularly for those who approach a merge with all the enthusiasm of a root canal but it can also bring tremendous benefits. Unfortunately, virtually all DVCSs on the market today, including Mercurial and git, are not ready for the larger corporate market, lacking in things like user accounts, file locks, and an enterprise license agreement. In the absence of a solid DVCS choice, most customers we work with today choose one of three products: Subversion, Microsoft Team Foundation Server, or our own VSS replacement, SourceGear Vault Pro. All three are solid choices, but they are by no means the only ones. For now, Id like to focus on the issues involved in migrating away from VSS to a modern replacement.

The Big Move


The move to a new version control solution is a little daunting for most teams. It costs money, its risky, it upsets established routines and, worst of all, it means developer downtime. And even with the most carefully planned move, the amount of downtime can be unpredictable. Even so, none of that is as bad as you think. There are modern version control solutions for every budget, from high-end systems designed for global enterprises (like TFS) to free open source solutions (like SVN and Mercurial). A well-planned move and common sense backups will help reduce risk. And perturbing the way your developers work can be a good thing, especially if it means they start working more efficiently. For most teams the real hurdle is developer downtime, not just the time required to prepare and import a 30 gigabyte source code repository, but the effort it takes to train a team on the new tool. For most, downtime of more than a few days is ultimately far more costly than writing the check for the software, or professional services during the install. Fortunately, theres good news here as well. With the right approach, switching can be a lot more painless than you think. First, you need to get past the fact that there is probably corruption in your VSS database, possibly substantial corruption. For most of our customers, the import process is the first time they see the extent of the damage to their repository. Frequently, this induces panic. What does this mean for the integrity of their source code? For future development? Usually the panic is unjustified, and regardless of the degree of the damage, its always better to address the problem sooner rather than later. Once they finish the import and resume working, the panic subsides for the vast majority of our customers, and they quickly move forward with a solution that prevents any further data loss. Most solutions also have VSS import tools. Weve spent years refining our import tool for SourceGear Vault, and its obvious the competition has done the same. Most of them will give you a good indication of how long the import will take, with steady updates along the way. There are also alternatives to doing a full import. After years of helping customers with imports, we explored one alternative with Vaults VSS Handoff tool. VSS Handoff is essentially a continuous window into your VSS repository. It imports the latest version of every file and gets you up and running immediately, while still maintaining links to your old history, shares, and pins in the SourceSafe repository. Its one way to do a safe VSS import in minutes, and can have you underway with your new version control solution in hours, even with the largest repositories. Finally, if you do run into problems, assistance is available. At SourceGear weve helped numerous companies untangle thorny VSS import problems, and were not alone. Many IT consulting shops have also developed solid expertise in VSS, and can help you get through the tough spots with a modest number of billable hours. Ultimately, the technical details of what import approach to take and the best way to deal with problems - will depend on what replacement tool youve chosen. Since we cant really avoid talking about that, and since SourceGear is also a player in the space, Ive decided to discuss only those VSS replacement solutions we usually recommend, and focus chiefly on the strengths of each. Comparative information on other tools is obtainable on a number of blogs and commercial websites. The three replacement tools we most commonly recommend for VSS users are TFS, SVN, and Vault Pro. Each is a solid choice, with its own strengths and weaknesses. The questions were most frequently asked are, Which of these will allow me to get up and running the fastest? and Which has all the features of VSS, including Share and Pin? With that in mind, heres a high-level feature comparison chart, starting with the key features of VSS that users frequently miss the most when they adopt a new tool (such as Share, Pin, and Shadow Folders), and continuing with the new features that gain them the most (such as shelve and integrated bug tracking).

Feature Comparison of Popular VSS Replacement Tools

Microsoft TFS is an excellent choice, especially for large teams. Its Microsofts future, and theyve made a substantial bet on its success. That means its well funded, with some nice reporting tools, and a lot of nifty new features such as Branch Visualization. It also scales well, handling teams of 1,000 developers, or more, better than any other tool on this list. Subversion is another popular choice, and for good reasons. Unlike TFS, its completely free and has more modest install costs. Many larger companies that typically steer away from open source have embraced it, partly because of its excellent community support and reputation for reliability. While its cross-platform support was once criticized as shaky, this has improved significantly in the past five years. SourceGear Vault Pro was designed from the ground up as a VSS replacement tool, with a deliberately familiar user interface and support for all of VSSs features, including Share, Pin, and Shadow Folders. SourceSafe Users can transition to Vault and work the way they always have, with less time lost on the learning curve. Vault is the only tool to offer the VSS Handoff feature, which gets you up and running immediately, cutting actual downtime to a few hours. Which tool you choose, and what strategy you use to minimize the risk and downtime associated with switching version control solutions, will ultimately be dictated by your unique set up and requirements. But I hope Ive illustrated that there is a VSS replacement option that will fit your needs, and that theres no reason to put it off any longer. The time to upgrade is now. SQL Source Control can be downloaded from the Red Gate website. The time to upgrade is now. Try SourceGear Vault for free the only tool with VSS Handoff

Simple-Talk.com

A look at Microsoft Lync 2010


22 March 2011 by Johan Veldhuis
Lync manages office communications such as voice calls, video calls, instant messages, meetings, and shared whiteboard sessions all from a single interface. If this sounds like OCS, then you're right, it is the new improved version. Johan explains the background, why it is better, and how you can get started with it. is the Office Communications Server (OCS). Lync, just like OCS, manages various Microsoft Lync 2010 a singlereplacement forvoice calls, video calls, instant messages, meetings, and shared whiteboard sessions.forms of communication from user-interface: It works with Office, SharePoint and Exchange: It makes it possible to converse with other people, such as the users of Windows Live Messenger.

History
On the 22nd of March 2010 at VoiceCon Microsoft officially announced Lync, at that moment called Wave 14. Some features were presented to the public: Call Admission Control, Location Awareness, Branch Office Survivability and tighter integration with SharePoint, Office and Exchange. In the background a Technology Adoption Program (TAP) program had already started with several customers and partners. In June 2010 Microsoft gave a lot more information at TechEd North America, which was held in Orlando. Many presentations were given that included demos. These were then published on TechEd online. On the 13th of September Microsoft released the release candidate (RC) of Communications Server 14 also then known as Wave 14.

Rebrand
Communications Server 14 or Wave 14 were just code names for the new version. A number of rumors about the new name spread around the internet: Among the possibilities that were mentioned were Office Communications Server 2010 and Communications Server 10. Together with the release candidate (RC), Microsoft also revealed the official new name of their product: Lync. Some people might think Why Lync? That was my reaction too. On this Technet Blog Kirk Gregersen, Senior Director Microsoft Communications, explains how they have chosen the name. Lync is a combination of link and sync and is chosen because Microsoft wanted a new name that reflected the major transformation of the product. On the 17th of November Lync reached the General Availability (GA) Status. In this table, we give the new names of all the members of the product family:
Product Family Server Client Web Client Service 2010 Release Microsoft Lync Microsoft Lync Server 2010 Microsoft Lync 2010 Microsoft Lync Web App Microsoft Lync Online 2007 Release Microsoft Office Communications Microsoft Office Communications Server 2007 R2 Microsoft Office Communicator 2007 R2 Microsoft Office Communicator Web Access Microsoft Office Communications Online

Because it would create an enormous article to list all the new features, we will now have a look at just a few of the more interesting ones offered by Microsoft Lync Server 2010.

Central Management Store


In Office Communications Server 2007 R2, the configuration was stored in several places: Active Directory, SQL database and local on the server. Starting from Lync, a Central Management Store is used to store the configuration for all server roles. To be more specific, the following information will be stored in the database: Configuration of the Lync Server components Policies of the Lync Server components XML documents containing the deployment topology In case of an Enterprise pool, this database is placed on a dedicated SQL server. When using a Standard pool, a local installation of SQL 2008 Express will be installed on the first Front End Server and the Central Management Store database will be stored on that server. Each server in a Lync environment will contain a replica of the Central Management Store: The advantage of this is that servers can continue to work even when connection to the Central Management Store has lost. But how does Lync keep all databases up-to-date? When a configuration change is made, the changes will first be applied to the Central

Management Store and after that to the other databases.. Before distributing the changes to the other databases, all changes will be verified. The changes will be replicated as read-only data to all other servers including the Edge. In OCS 2007 R2 the Edge server did not share the configuration with any other server but stored the configuration locally.

The update process can be divided into the following steps: 1. 2. 3. 4. 5. The administrator makes a change in the current configuration using the GUI or Powershell The Master Replicator generates a snapshot containing the new configuration The File Transfer Agent distributes the snapshot to all other servers in the Lync environment The Local Replicator will be notified about the new snapshot, applies the changes and will send a status update to the CMS The Replication status will be send back to the master and the Master Replicator updates the status of the server

The replication traffic between the Edge server(s) and the CMS will be performed by using HTTPS. If security policies will not allow this, a manual update will need to be performed every time the configuration is changed.

Management utilities
Lync has two management utilities: Lync Server Windows Powershell, and the Lync Server 2010 Control Panel. The Powershell modules can be used for several tasks and are equivalent to the Exchange Management Shell of Exchange 2010. The GUI uses Silverlight; which has the consequence that one of the requirements for using the tool is the latest version of Silverlight. If it isnt detected during the startup it will display both a warning and a link to download Silverlight: So there is no MMC for Lync anymore.

Role-Based Access Control


Just like Exchange 2010, Microsoft Lync Server 2010 now also contains Role-Based Access Control, RBAC for short. Using RBAC, administrative privileges can be added to users using pre-defined administrative roles. Depending on the role assigned to a user he or she can perform tasks using either the Lync Management Shell or Lync Control Panel. Lync does contain 11 predefined RBAC-roles. Besides these predefined roles you can create custom roles yourself. A complete overview of predefined RBAC-roles and the associated cmdlets can be found here.

Virtualization Support
A change to the virtualization support for Lync is very welcome, because a lot of customers already use, or are starting to use, virtualization technologies. Microsoft made a lot of modifications to the support policy compared to OCS 2007 R2. The following Lync Server environments are now supported using virtualization technologies: Standard Edition server topology for proof-of-concept, pilot projects, and small businesses. This topology supports up to 2,000 users per virtual Standard Edition server. Data center topology, for larger deployments. This topology supports up to 5,000 users per virtual Enterprise Edition Front End Server. At this moment only Windows 2008 R2 Hyper-V and VmWare ESX 4.0 are supported. All server roles will need to have Windows 2008 R2 as Operating System (OS).

Enhanced Voice Resilience


Each user that signs in will authenticate with a registrar that provides authentication and routing services. The registrar is installed on each Front End Server and Branch Appliance. In Lync you can configure a primary and secondary registrar. By creating the secondary registrar you will create a backup registrar which is used by the user in case of failure by the primary registrar. If you have several branch offices you might not want to deploy a complete Lync environment to offer the Enterprise voice functionality. Microsoft introduced two options for this specific case: Survivable Branch Appliance (SBA), the appliances are offered by several vendors: Audiocodes, Dialogic and HP. The appliance contains both a server and a PSTN gateway. Survivable Branch Servers (SBS), the SBS is just like a default Lync Server and needs to be connected to a gateway or SIP trunk to offer Enterprise Voice functionality.

Both the SBA and SBS can be configured as backup registrar for the users in the branch office. If a WAN link failure occurs, then users will reregister with the backup registrar and can continue to use basic voice functionalities. Can a SBA or SBS host users? Yes, both solutions can host users: Before doing this, verify that the PSTN connectivity works correctly.

End User Experience


In the previous paragraphs we had a look at a server side of Lync. In this paragraph we will have a look at what has been improved or changed for the end user.

As you can see in this screenshot, the look of the client has completely changed. We can split up the client window into a few parts: You can see your current status location, and if available a picture of yourself, in the upper part of the screen. This can either be configured manually by a user or can be retrieved from the Active Directory; The next part is the so called communication bar, which gives you quick access to IMs, received calls and voicemails. Besides this, it contains a dial pad which can be used to place calls; The biggest part of the client is the contact list which now contains pictures from your contacts if available and their status text; In the bottom of the client there are a few options available: Phone icon: which gives you the opportunity to configure the primary audio device; Call forwarding: if enabled for Enterprise Voice a user can configure call forwarding easily; Information icon, displays information/errors if applicable. Note that this icon will disappear if no errors/warnings are available. Besides searching for contacts on their name you will now have the opportunity to search for people based on their skills. This functionality does require at least SharePoint 2007 and will enable you to search in the profiles of users.

Implementation process
There are two ways to start the implementation of Lync: Using the planning tool Using the setup

The first option might look a little bit strange to you. How could you start an implementation with a planning tool that is, in most cases, used before the implementation? Well, youre completely right about this; but, starting from Lync, you can use the output from the planning tool as input for the setup. The planning tool can be downloaded for free from this website. The other option can be compared to the method which was available in OCS 2007 R2. Before starting the installation you will need to install some prerequisites. First start with .NET 3.5 SP1 and the hotfixes mentioned in KB959209 and KB967190. Once these are installed the Web Server Role needs to be installed with some additional features. This can be done by running the following cmdlet: ServerManagerCmd.exe -Install Web-Server Web-Http-Redirect Web-Scripting-Tools Web-Windows-Auth Web-Client-Auth Web-Asp-Net Web-Log-Libraries Web-Http-Tracing Web-Basic-Auth Once everything is installed you can start the setup of Lync. The same steps are required for preparing the Active Directory as with OCS 2007 R2, so I do not need to explain them. Make sure youve got a backup of your Active Directory environment and try the installation in a test environment if possible, before deploying it in a production environment. Depending on your deployment, you will either choose the option to deploy the first Front End Server (only used for standard edition) or install the Topology Builder:

Standard Edition: install First Front End followed by the Topology Builder Enterprise Edition: install the Topology Builder
In Lync you wont deploy a pool immediately but first will create a configuration using the Topology Builder. Once the tools are installed, you can start the Topology Builder to create your Lync environment. The Topology Builder contains some wizards which will guide you through the process of setting up your Lync environment. As I said earlier, you can also use the output from the planning tool. This will save you some time because you wont have to use the wizards to set up the environment. When youve finished building your Lync environment, its time to publish the configuration to the Central Management Store. Once this is done, you can deploy the servers. Each server will look in the Central Management Store to determine which components it needs to install during the setup. This has the advantage that you dont have to select the components manually anymore, because youve already specified them.

The setup can be started by selecting the option Install or Update Lync Server System. The first step is to install the local configuration store by selecting the Install Local Configuration Store. This will install a SQL 2008 Express instance with a replica of the CMS database. Once the local configuration store is installed, select the Setup or Remove Server Components. You will receive a prompt to select the method which should be used to gather the configuration file. As long as you can reach the CMS, you should leave the default option checked, which is retrieve directly from the Central Management Store. The second option, import from a file, is only needed when performing the installation of the Edge Server. When you make a change to the topology, for example by adding the conferencing feature, you will need to run the setup again and select the Install or Update Lync Server System option.

Once everything is installed, the certificate needs to be request and assigned. This can be done by selecting the option Request, Install or Assign Certificates. Press the request button to create a certificate request. After the certificate has been requested, you will get the option to assign it immediately. This last option is only relevant if the CA doesnt require a CA administrator to approve the certificate. If this is the case you will need to choose the button Process Pending Certificates, followed by Assign to assign the certificate to the services. When the certificate is installed its time to start the Lync services: This can be done by selecting the Start Services option. Once this task has been performed you have the option to check if all services are running by selecting the Check Services button. This will launch services.msc which will give you the ability to check if all services have the started state.

Simple-Talk.com

The unnecessary evil of the shared development database


23 March 2011 by Troy Hunt
One of the greatest pain-points in developing a database-driven application happens when the application is in source control,but the database isn't. When the development database is shared, the pain increases, and it is not alleviated by source control alone. Troy Hunt spells out why each database developer must have their own version of the database. from the late 1990s building classic in using Dreamweaver, side by Ihave memoriessame set of files usingofthe same mappedASP apps hadVB script,lot; Have you closed the CSS side Iwith my fellowadevelopers, working on the path. You to talk a file? need to add class. And I remember the painful find and replace process, safe to execute only once every developer had saved their work and closed all their files. We huddled around shared drives mapped to the same UNC path and recklessly worked on the same set of files before firing them up in the browser right off the same server. Back then, there was source control of a kind, by which I mean Visual Source Safe, CVS or even Rational Clear Case: But normally it would be the classic pattern of selecting the root folder of the app, and then going CTRL-C > CTRL-V > Rename with the date appended at key points in the project. In hindsight, there were probably better ways of doing it a dozen years back; but the practices were reasonably common. Today, no one in their right mind would consider building apps this way. So why are so many people still using these same methods to build the databases behind their web applications? Is the data layer really that special that it needs to be approached entirely differently? Dont those hard-learned lessons from the last century apply to database development?

What are we talking about here?


Just so theres no confusion, lets try to be clear what were talking about. When using the shared database development model, the developers build the web app layer in the usual fashion (Visual Studio running locally, files stored on the developers PC) yet all connecting to a remote database server and working directly on the same DB. It looks something like this:

Usually theyd also be working with SQL Management Studio running locally and connected to the remote database server. When the app actually runs locally, all data connections are going out to the shared, central development database. There is no local instance of SQL Server on the developers machines. The alternative, of course, is dedicated development databases. Things now look a little bit different:

Obviously each developer has their own version of the database but the biggest difference to the earlier model is the presence of a version control system. Why?

The problems of the Shared Development database


The last writer wins problem
The obvious problem with collectively working on a shared database is that, when it comes to the problem of multiple developers working on the same object at the same time, its a case of last writer wins. No merge, no management of editing-conflicts, just the last guy getting his way. In order to mitigate the risk of this happening, you have to implement social mechanisms to work around the problem. So youre back to developers communicating backwards and forwards in an attempt to keep their fellow coders out of their work in progress. Its clumsy, its labour intensive and it will fail. Its just a matter of time.

The experimentation problem


An essential part of software development is experimentation. Unless youre a true believer in the waterfall mantra of design then build and two shall only ever progress in that sequence, youre inevitably going to build some code that sucks and consequently change direction. This is healthy, its part of the continuous improvement cycle we all implicitly exercise every time we build software. Whats not healthy is to unintentionally impede the work of the other developers in the team by exposing the consequences of your experimentation to them. It could happen in all manner of ways: It could, for instance, be that your work breaks their code by removing dependencies, or by unexpectedly changing the result of their ORM layer generation by adding objects. Experimentation should be about you trying different approaches and not forcing it upon your team. In short, you need to play in your own sandbox; a sandbox that spans each layer of the application. Martin Fowler has a nice summary in his article about Evolutionary Database Design Evolutionary design recognizes that people learn by trying things out. In programming terms, developers experiment with how to implement a certain feature and may make a few attempts before settling down to a preferred alternative. Database design can be like that too. As a result, it's important for each developer to have their own sandbox where they can experiment, and thereby prevent their changes from affecting anyone else. To develop code, you need to be able to experiment alone, so you must insist selfishly on your own play-space.

The unpredictable data state problem


Whilst the problems of last writer wins can be largely ameliorated by such strategies as segmenting the tasks or announcing changes before making them, the unpredictable state of the data is a whole new level of trouble. The problem is simply that databases tend to change as the client applications are used. I dont so much mean the changes at the object level as much as the data within the tables. Its the whole reason weve got the things in the first place; so that we may manipulate and persist the state of the data. Lets take a typical example. Weve got an app with an authentication module and administration layer and each of the developers have created accounts which they can test with. But the guy building the administration layer wants to test how the app behaves with no accounts in it. Now weve got a problem because deleting the existing accounts is likely to hinder the other team members. As things get more complex, the problems worsen. Applications that are highly dependent on the state of data, for example, become a nightmare when you simply cant control how its being changed. Unpredictability is not your friend when it comes to building software.

The unstable integration tests problem


If you have complex data dependencies when the time comes to do integration tests, it is essential to have a predictable, stable set of data to run against. Otherwise, with a constantly changing set of data, you can abandon all hope of a function returning a predictable set of data. Youll never achieve this if the rest of the team is continuously evolving both the schema and the data.

The objects missing in VCS problem


You will never get continuous codebase integration with the work of your team if every change is made centrally without source control, and is immediately available to everyone. Whats the motivation to commit your code? Of course whats really happening here is that the shared model is simply allowing a bad practice to creep in without any repercussions. If changes have to be committed, then the benefits of Version Control will become apparent. Dedicated development databases nurture good VCS practice.

The disconnected commuter problem


By using a shared development database, you are forced to be connected to the network at all times. If you want to take work with you and slip some development-time into the train journey, or if you want to work through the weekend in the comfort of your own home, youre only going to have half the solution available. Yes, there are often VPN solutions or other means of remote access but youre starting to increase the friction of working productively. The same problem extends to working effectively with those outside the network segment which contains the shared database. Want to send work out to a partner for a few weeks? Sure, you can do that by backing up the DB and sending it out but the chances are youre going to have issues integrating back into the trunk of development if the project is really not geared for distributed database development. The project has been tightly coupled to a single internal implementation of a database server and this is always going to result in more difficulties further down the line.

The sysadmin problem


One of the problems with development in any server-based environment, including a database server, is that there are times when elevated privileges are required. When this environment is shared, there are potential consequences well beyond the scope of a single application. Heres a case in point; I recently suggested to a developer that their performance tuning could benefit from some SQL Server Profiler analysis to take a closer look at things like reads and writes. This particular case involved a shared database so the next thing that happens is I get an email back with an image like this:

Frankly, I dont want the guy to be sysadmin on a box that may contain totally unrelated databases to which he probably shouldnt have access . I could give him ALTER TRACE permissions (and ultimately, I did), but of course this has to be set at the master database level so now he has the right to inspect every query across every database. This discussion would never have even taken place in the dedicated local database scenario. He would have simply already had the rights and it would have been dealt with locally. There are plenty of similar occasions where the rights a developer needs to do their job exceed what should be granted in a shared environment.

The unexpected performance profiling problem


Continuing from the previous point, performance-profiling in a shared environment where you have no control over the other processes running on the machine is an absolute nightmare. That query which takes 20 seconds to run in one test-run can easily blow out to 50 seconds a few moments later. Why? You have no idea. Whilst its always a bit tricky getting consistent results from any sort of performance profiling, the worst thing that can happen in the midst of this is other processes getting in your way. When youre doing this locally, you have both visibility and control over these processes. Of course there are cases, such as where huge volumes of data which would normally be queried on serious servers, where performance profiling on a PC is not going to yield constructive results. However, for the majority of performance-tuning tasks, a developer needs a predictable environment more than anything else.

The non-representational latency problem


So lets say youre working on a shared database which is almost inevitably located on a server in a data centre. Where exactly is that? How many milliseconds of latency are being added to each round trip? The problem is that in a world of data centre consolidation, youre quite possibly going to be loading up a whole heap of additional latency to each ADO.NET connection which isnt going to represent the target live environment. Im guessing you dont have the same sort of gigabitEthernet connectivity from your PC as the production web application server will have, and that creates a little bit of a problem. Its a problem in that the application performance in the development environment is going to be comparatively slow. The degree of sluggishness will depend on the latency and the amount of activity over the wire but, for example, 150ms to a remote SQL server coupled with a chatty application is not going to make for a very accurate representation of real world app performance.

The my development machine is slow problem


Of course the problem with developing databases locally is that it is necessary to run SQL Server on the PC. I say problem in quotes because the issue is not so much that SQL Server is asking too much of the machine, its that developers are all too frequently given slow, underspecified, PCs. If developers are given machines which struggle to concurrently run Visual Studio, SQL Server and the usual business as usual tools (Outlook, Word, etc.), there is a bigger underlying problem: Developers are not cheap. In Australia, your average developer is costing about $90k a year. There are then a whole bunch of other costs to contend with such as floor space, equipment (other than the PC), operating expenses (such as payroll) and on and on and on. Conservatively call it $100k annually or around $420 for each day they work. On the other hand, fast PCs are cheap. I recently replaced an aging laptop and the difference in price between a run of the mill machine designed for the desk jockey who lives in the Office productivity suite and an 8GB, SSD, i7 machine was $0.60 a day over a three year lifespan. Put it this way if you fit the $90k/y bill and youve read this far (say 10 minutes), youve just consumed three weeks worth of super-fast machine upgrade, based on your hourly rate and the time cost of reading this post. Enough said. And yes, yes, I know developers and costs are a lot cheaper in other countries. So lets assume only $25k annually; youre still looking at over $100 a day for these guys and a $0.60 cost to fundamentally improve their productivity. If you need to debate the mathematics of this with anyone, its probably time to have a good hard look at how conducive the environment is to having a productive, fulfilling role; for some essential further reading, check out Jeff Atwoods Programmer's Bill of Rights.

Getting to the solution of dedicated development databases


The SQL Server Developer Edition solution
Microsoft provides the perfect means of developing DBs locally in the SQL Server Developer Edition. This is effectively a full blown Enterprise edition licensed for non-production use. Chances are you already have a license if you have an MSDN subscription but even if you dont, its dirt cheap. Installed locally, it can easily be configured so the service doesnt start automatically if youre really worried about it dragging down the performance of your PC when youre not even using it:

But having said that, the resource usage is actually pretty small unless youre seriously pounding it. Mine is sitting there consuming only 340MB of memory (about 4% of whats on the machine) and 0.4% of CPU. So unless youre running under-specced hardware (again, this is reflective of a deeper problem), the performance impact shouldnt even be noticeable.

The script it all solution


One great thing about decentralising your development database is that it forces you to script a ready state of data. Versioning of database objects is one thing, and its obviously essential, but we all know that most applications wont play nice if they start out with zero records in them. All sorts of reference data is usually required to initialise a database so the problem now becomes how you get it in there. If you need to initialise the state of the database via SQL scripts, you are forced to think clearly about what your application needs to function. You need to think through the purpose of each table and what it needs to contain in order to achieve that ready state, rather than just organically growing reference records as required. The other big bonus is that this script then goes into source control. It gets versioned along with the DB objects and persists in VCS for perpetuity. Finally, scripts are fantastic for automation. It means that at any time you can pull a revision from VCS and have a clean, repeatable installation of the application. Tie that into a continuous integration environment and you now have one click deployment of the entire app.

The this will damn well force you to do it right solution


By working on dedicated local databases, the developer is forced into a number of good practices which could otherwise be circumvented in the shared world. The obvious one is source control; if youre not versioning your database objects and reference data, youre doing it wrong. You simply wont be able to get away with it any more if the only way your colleagues can get the changes is via VCS. So as to work effectively and not break builds, work needs to modularised and committed atomically. You can no longer get away with randomly changing unrelated parts of the application; otherwise you begin breaking the build for others, something which is generally not received very positively by your peers. This is a good thing; it forces more thoughtful design and conscious completion of tasks. And of course you cant get away with running SQL Server on that tired, cheap, PC with a single GB of RAM and an old 5,000 RPM disk. You actually have to get a half decent machine I mean one that is actually suitable for building software on! So you see the whole shared development database model can disguise the use of those practices you might not be doing properly to begin with. Working autonomously on a local DB becomes a self-perpetuating cycle of practice improvement as it simply won't let you get away with taking nasty shortcuts.

Summary
If youre using a shared development database, the chances are that youve simply inherited the practice. Take a good look around; are you really working this way because its the most effective possible way of building software? In times gone by, it wasnt easy to version-control databases, but weve now got tools at our disposal to do it.

In terms of .NET, theres obviously the official Microsoft Team Foundation Route but there are also offerings from third parties such as Red Gates SQL Source Control. Around the middle of last year I wrote about Rocking your SQL Source Control world with Red Gate and then Foolproof Atomic Versioning of Applications a little after that, both of which go into detail about the importance and value of versioning your databases. So I wont repeat the message here. Just make sure youre doing it, ok? Developing locally on dedicated databases is not only better for the process of database development, its better for configuration, which means better for deployment. Its also better for development processes in general, such as experimentation, modularisation of work. It solves all sorts of other problems which are engendered by the communal DB model. So really, whats stopping you?

Simple-Talk.com

Steve Furber: Geek of the Week


28 March 2011 by Richard Morris
Professor Stephen Byram Furber CBE, FRS, FREng was one of the designers of the BBC Micro and the ARM 32-bit RISC microprocessor. The result of his work, the ARM chip, is in most mobile phones, calculators and other low-power electronic devices in the world. At the University of Manchester, he is working on the radical SpiNNaker project which could one day change the whole nature of the personal computer.

Steve Furber is not as well known as he should be, which is surprising given that he is one of the leading pioneers of personal computing.
As part of the key team at Acorn Computers in the early 1980s the developers and manufacturers of the famed BBC Micro ( or Beeb as it was affectionately known) he was instrumental in designing the ARM (Acorn Risc Machine) chip which made the companys hugely successful PCs almost twice as fast as anything else on the market. And it is this innovation which underpinned the rapid growth in mobile communications, which has opened up economic opportunities for millions in the developing and developed world. The ARM first appeared in the Acorn Archimedes in 1987, making Acorn the first company to ship Riscbased personal computers for the mass market. Acorn founder Hermann Hauser has said that Steve Furber is one of the brightest guys I've ever worked with - brilliant and when we decided to do a microprocessor on our own I made two great decisions - I gave them two things which National, Intel and Motorola had never given their design teams: the first was no money; the second was no people. The only way they could do it was to keep it really simple. Nearly 30 years on Steve Furber (now the ICL Professor of Computer Engineering in the School of Computer Science at the University of Manchester) is still working with ARM processors, although on a much grander scale than the humble Archimedes. Steve, I think Im right in saying that you were a member of the Cambridge University Processor Group, a club for computer hobbyists when you were a student there. Was this rather like the Homebrew clubs in the US? When were you bitten by the computer bug? Yes, CUPG was very much a homebrew computer club, formed by Cambridge students. There the real men built their computers from TTL only the wimps like me used microprocessors! I got bitten by the bug as a result of being drawn into CUPG, which I joined because I was interested in flying and flight simulators, and computers seemed a good way to build a flight simulator. Was there anything that drew you into computers other than I seem to be good at this? I guess it was a combination of my interest in flight simulators and my amateur electronics experience. Id got rather put off building electronics in my teens because I struggled to make transistor circuits work (though I did get on better with valves!), but then I discovered the 741 op amp. As a Maths student the 741 gave me an abstraction I could work with, hiding all the low-level transistor details inside a clean black box. I built guitar effects boxes and two 8-channel sound mixing desks using 741s and PCBs I etched in my kitchen sink. Digital electronics offered another clean abstraction that enabled me to build stuff that worked in a different domain computers.

RM: SF:

RM: SF:

RM: SF:

One of your first major projects was designing the BBC Micro, a machine designed to accompany a computer literacy programme set up by the BBC. Did you in your wildest dreams expect it to take off as it did? We expected the BBC Micro to be a success, which is why we were so pleased to get the contract. But success meant selling the expected 12,000 machines. No-one anticipated the way home computers would take off in the early 1980s, to the extent that total Beeb sales were around 1.5 million. The first sense I got that this thing might exceed our wildest dreams was when we were lined up to give a seminar at the (then) IEE Savoy Place. I think this was 1982. The main lecture theatre at Savoy Place seats several hundred, but three times the capacity turned up Coach-loads of people had come some distance, for example from Birmingham, to hear about the BBC Micro. A lot had to be sent away to avoid exceeding the safe capacity of the lecture theatre, and we were booked back to give the seminar two more times (and many other times around the UK and Ireland) to meet demand.

RM: SF:

What do you think inspired people to crowd to see you and buy PCs in the numbers that they did? I think there was a widespread realisation that home computing was coming, and it was going to be exciting, useful and

fun. But the wider public was nervous about the great diversity of machines available, all produced by small companies they hadnt heard of and found it hard to trust. Then in came the BBC Micro, bearing one of the most trusted names in the land. That was the signal they needed to take a step into the unknown. Sure, the BBC Micro was a bit more expensive than competing machines, but if Im buying a product I dont fully understand I always prefer to pay a bit more for a name that I trust. And I like to think that the machine did live up to the brand expectations it was solidly built (some Beebs survived ten years in the hands of primary school kids) and had sound educational credentials, attracting extensive educational support. I still, frequently, come across folk who tell me that the BBC Micro introduced them to programming and was the foundation for their subsequent career.

RM:

Acorn had huge success in the late 1970s and saw its profits rise from 3000 in 1979 to 8.6m in July 1983 but it stumbled two years later and was later taken over by Olivetti. Do you think the company could have been saved had the ARM architecture project happened sooner? Having ARM earlier wouldnt have saved Acorn. ARM had to get out from the constrained Acorn market into the much more open System-on-Chip market that got them into mobile phones, and the SoC business only became technically feasible (with enough transistors on a chip to integrate all of the non-memory functions) in the 1990s. When you were designing processors at Acorn they generally had the power consumption of less than a watt. Do you feel rather glum at the power demands of todays high-end processors? Is this a consequence of the fact we dont have a sufficient grip on building parallel software? Yes, and yes! The energy-efficiency of computers is a growing concern, and the lengths we have gone to maximise single-thread performance at the cost of energy-efficiency are justifiable only in as forward now apart from going parallel, and even the high-end boys have thrown in the towel on single-thread performance. They are selling us multicore processors we still dont know how to use. Once we do know how to exploit parallelism there will be no need for high-end processors at all, because we will be able to get the same performance with much greater energyefficiency by using larger numbers of simpler processors. I expect to see this transition soon in data centres, where the load has a lot of easily accessible parallelism and where energy concerns are already at the top of the agenda, and even in high-performance computers where I see this as the most promising route to exascale. Youre working now on the SpiNNaker project that you're leading at Manchester which aims mimic the complex interactions in the human brain. Whats the higher motivation with the project and this in any way come from Doug Engelbarts vision about augmenting the human mind and interaction between machine and its user, which led directly to the invention of the PC? The higher motivation for SpiNNaker is the observation that computers arent the only information processing systems on the planet, and they arent even the best at some tasks. But we still dont know how the other sort biological brains work. This seems to me to be a fundamental gap in scientific knowledge. Computers are now approaching the performance required to build real-time models of brains (but they arent quite there yet a computer model of a human brain would require at least an Exascale machine), so can we accelerate the understanding of the brain by designing a computer that is optimised for this task? This will then offer a platform for neuroscientists, psychologists and others to develop and test hypotheses on a new scale. Scale is important. We usually like to start small, get some understanding, and then scale up building on this. But there are some places, including some ideas in the neural network field, where starting small doesnt work. There are good theoretical insights into why this should be, relating to the counter-intuitive properties of high-dimensional spaces. The maths simply stops working if the problem is below a certain (large) size. So sometimes you have to jump in at the deep end, and SpiNNaker offers a very deep pool to do this in.

SF:

RM:

SF:

RM:

SF:

RM:

Youre working with the Royal Society to figure out why the number of students taking computing classes has halved in the past eight years, what are the most important but fundamental things, the computing industry can learn from its past? Do you think that the industry has been wrong about what computing is and where it should go and how to improve it? Ill be able to answer this better when the study has drawn its conclusions. But on the evidence of other studies of this area, the problem seems to lie in the transition from the computer as a universal programmable platform for exploring ideas (as with the BBC Micro in the 1980s) to an office tool that runs productivity software. Much of what is taught in schools is IT rather than computer science. IT is important but intellectually unchallenging, and often dull. Its as if all that was taught in Maths was arithmetic, or in English spelling. IT, arithmetic and spelling are all important skills, but there is *so* much more in all these subjects. Lots of people have tried to come up with languages or programming systems that allow non-programmers to program. Is that a doomed enterprise? Currently programming is an extremely demanding discipline, requiring metal abilities from scaling multiple levels of abstraction to chasing very low-level details around at the bottom of a vast software system to track down a bug. I dont think there is any way you can train the entire population to become skilled programmers at this level it is a peculiar skill, not unlike being a theoretical physicist in its requirement to think abstractly while paying painstaking attention to detail. If the goal is to introduce a wider audience to the ideas in programming computational thinking then there may be

SF:

RM: SF:

scope for a simpler language that is less universal than those used by professional programmers, has fewer death traps for the unwary, and is perhaps more visual than symbolic in its representation. I always thought BBC BASIC was an excellent introductory language, but I just get laughed at when I suggest it these days!

RM:

Over the last 20 years the Internet has scaled in growth but operating systems and other computer software hasnt grown exponentially. Do you think that the internet concept could be imitated and used as a basis for an operating environment that doesn't have an operating system? It seems to me that operating systems have grown exponentially, in memory requirement if not in functionality! Im not sure how this relates to the question, but we have seen cycles of shift of computing power between central resources and the user at the periphery. With cloud computing we are seeing a shift back towards the centre, partly driven by improving communication services and partly by the need to mobilise the user terminal device. This is seeing the PC give way to the smart phone and iPad-like terminal, which moves quite a lot of the operating system functionality up into the Cloud.

SF:

RM: SF:

Do you think many people working in technology are unaware of its history and have little curiosity about where languages came from? Technology tends to attract folk who are more interested in the future than the past, so they often have very little sense of the history of their subject, and very little curiosity. But as folk get older their horizons get wider, and I think most technologists in the 2nd halves of their careers develop some sense of the historical path that has led to the way things are today. Do you feel were progressing in technology even though sometimes it seems that we are leaping backwards? We are definitely making progress in technology, faster than ever. After over 30 years in the business I still find new products astonishing. I carry my entire CD collection on my iPhone, but I remember a time at Acorn when we debated whether solid state music would ever be economically feasible. The iPad is a long-standing dream come true again, back in the Acorn days we talked about similar products for schools (though without the connectivity, which was inconceivable then), but of course the technology just wasnt ready. When you look back at your career on all the things you have done is there one time or a period that stands out among all the others? I guess my early years at Acorn would have to stand out from 1981 to 1985 since that period covers the BBC Micro and the first ARM chip, and those are the foundations of my subsequent career. The success of the BBC Micro was tangible at the time, as described above, whereas the ARM was a long time coming to fruition, and required a great deal of work by a lot of other folk, not to mention a fair dose of serendipity, to get to the 20 billion total shipments to date. But its knowing that this scale of impact is possible that drives me on and determines the directions I choose to take my research today. SpiNNaker has the potential to generate similar impact, though there are many contingencies and, as with all research, its highly speculative.

RM: SF:

RM: SF:

Simple-Talk.com

Using SQL Server Integration Services to Bulk Load Data


29 March 2011 by Robert Sheldon
The most flexible way to bulk-load data into SQL Server is to use SSIS. It can also be the fastest, and scaleable way of doing so. There are three different components that can be used to this, using SSIS, so which do you choose? As always, Rob Sheldon is here to explain the basics. articles, I discussed ways in which Transact-SQL statements BULK INSERT and In previous(with the OPENROWSET function) to you can use the bcp utility and theServer database. Another effectiveand indeedINSERT SELECT bulk load external data into a SQL the most flexiblemethod you can use to bulk load data is SQL Server Integration Services (SSIS). SSIS can read from a variety of data sources, data can be easily transformed in memory, and you can bulk load the data without needing to stage it. Because SSIS runs as a process separate from the database engine, much of the CPU-intensive operations can be preformed without taxing the database engine, and you can run SSIS on a separate computer. As a result, you can easily scale out your bulk load operations in order to achieve extremely high throughput. SSIS provides several task and destination components that facilitate bulk load operations:

SQL Server destination OLE DB destination BULK INSERT task


In this article, I provide an overview of each of these components and show you how they work. To demonstrate the components, I first created the following three tables in the AdventureWorks2008 database: IF OBJECT_ID('Employees1', 'U') IS NOT NULL DROP TABLE dbo.Employees1; CREATE TABLE dbo.Employees1 ( EmployeeID INT NOT NULL, FirstName NVARCHAR(50) NOT NULL, LastName NVARCHAR(50) NOT NULL, JobTitle NVARCHAR(50) NOT NULL, City NVARCHAR(30) NOT NULL, StateProvince NVARCHAR(50) NOT NULL, CountryRegion NVARCHAR(50) NOT NULL, CONSTRAINT PK_Employees1 PRIMARY KEY CLUSTERED (EmployeeID ASC) ); IF OBJECT_ID('Employees2', 'U') IS NOT NULL DROP TABLE dbo.Employees2; CREATE TABLE dbo.Employees2 ( EmployeeID INT NOT NULL, FirstName NVARCHAR(50) NOT NULL, LastName NVARCHAR(50) NOT NULL, JobTitle NVARCHAR(50) NOT NULL, City NVARCHAR(30) NOT NULL, StateProvince NVARCHAR(50) NOT NULL, CountryRegion NVARCHAR(50) NOT NULL, CONSTRAINT PK_Employees2 PRIMARY KEY CLUSTERED (EmployeeID ASC) ); IF OBJECT_ID('Employees3', 'U') IS NOT NULL DROP TABLE dbo.Employees3; CREATE TABLE dbo.Employees3 ( EmployeeID INT NOT NULL, FirstName NVARCHAR(50) NOT NULL, LastName NVARCHAR(50) NOT NULL, JobTitle NVARCHAR(50) NOT NULL, City NVARCHAR(30) NOT NULL, StateProvince NVARCHAR(50) NOT NULL, CountryRegion NVARCHAR(50) NOT NULL, CONSTRAINT PK_Employees3 PRIMARY KEY CLUSTERED (EmployeeID ASC) );

The tables are identical except for their names and the names of the primary key constraints. After I added the tables to the AdventureWorks2008 database (on a local instance of SQL Server 2008), I ran the following bcp command to create a text file in a local folder: bcp "SELECT BusinessEntityID, FirstName, LastName, JobTitle, City, StateProvinceName, CountryRegionName FROM AdventureWorks2008.HumanResources.vEmployee ORDER BY BusinessEntityID" queryout C:\Data\EmployeeData.csv -c -t, -S localhost\SqlSrv2008 T The bcp command retrieves data from the vEmployee view in the AdventureWorks2008 database and saves it to the EmployeeData.csv file in the folder C:\Data. The data is saved as character data and uses a comma-delimited format. I use the text file as the source data in order to demonstrate the three SSIS components. I next created an SSIS package named BulkLoadPkg.dtsx and added the following two connection managers:

OLE DB. Connects to the AdventureWorks2008 database on the local instance of SQL Server 2008. I named this connection manager AdventureWorks2008. Flat File. Connects to the EmployeeData.csv file in the C:\Data folder. I named this connection manager EmployeeData.
After I added the connection managers, I added three Sequence containers to the control flow, one for each bulk insert operation. Each operation is associated with one of the tables I created above. For example, the first Sequence container will contain the components necessary to bulk load data into the Employees1 table. To each container I added an Execute SQL task that includes a TRUNCATE TABLE statement. The statement truncates the table associated with that bulk load operation. This allows me to execute the container or package multiple times in order to test different configurations, without having to be concerned about primary key violations. I then added to each of the first two containers a Data Flow task, and to the third container I added a Bulk Insert task. Figure 1 shows the control flow of the BulkLoadPkg.dtsx package. Notice that I connected the precedence constraint from each Execute SQL task to the Data Flow or Bulk Insert task.

Figure 1: Control flow showing three options for bulk loading data
After I created the basic package, I configured the Data Flow task and Bulk Insert task components, which I describe in the following sections. You can download the completed package at [include link? URL?] In the meantime, you can find details about how to create an SSIS package, configure the control flow, set up the Execute SQL task, or add tasks and containers in SQL Server Books Online. Now lets look at how to work with the components necessary to bulk load the data.

SQL Server Destination


The first SSIS component that well look at is the SQL Server destination. You should consider using this component within the data flow if you must first transform or convert the data and bulk load that data into a local instance of SQL Server. You cannot use the SQL Server destination to connect to a remote instance of SQL Server. To demonstrate how the SQL Server destination works, I added a Flat File source and Data Conversion transformation to the data flow. The Flat File source uses the Flat File connection manager EmployeeData to connect the EmployeeData.csv file. The Data Conversion transformation converts the first column of the source data to four-byte signed integeran SSIS data typeand renames the outputted column to BusinessEntityID (to match the source column in the vEmployee view). The transformation converts the other columns to the Unicode string data type and again renames the columns to match the column names in the view. In addition, Ive set the length to 50 in all the string columns except City, which Ive set to 30. The Data Conversion Transformation Editor should now look similar to what is shown in Figure 2.

Figure 2: Data Conversion Transformation Editor


After I configured the Data Conversion transformation, I added a SQL Server destination to the data flow. The data flow should now look similar to the data flow in Figure 3.

Figure 3: Data flow that uses the SQL Server destination component to load data
Now I can configure the SQL Server destination. To do so, I double-click the component to launch the SQL Destination editor, which opens in t he Connection Manager screen. I then select the OLE DB connection manager I created when I first set up the SSIS package (AdventureWorks2008). Then I selected Employees1 as the destination table. Figure 4 shows the Connection Manager screen after its been configured.

Figure 4: Connection Manager screen of the SQL Server Destination editor


Next, I want to ensure that my source columns properly sync up with my destination columns. I do this on the Mappings screen of the SQL Destination editor and map the columns I outputted in the Data Conversion transformation with the columns in the Employee1 target table, as shown in Figure 5.

Figure 5: Mappings screen of the SQL Server Destination editor

Notice that I mapped the BusinessEntityID source column to the EmployeeID destination column. All other column names should match between the source and destination. After you ensure that the mapping is correct, you can configure the bulk load options, which you do on the Advanced screen of the SQL Destination editor, shown in Figure 6. On this screen, you can specify such options as whether to maintain the source identity values, apply a table-level lock during a bulk load operation, or retain null values.

Figure 6: Advanced screen of the SQL Server Destination editor


As you can see, for this bulk load operation, I am choosing to retain identity and null values and to apply a table-level lock on the destination table during the bulk load operation. In addition, Im not checking constraints or firing triggers during the operation. Im also specifying that the data is ordered according to the EmployeeID column. Because I sorted the data (based on the ID) when I exported the data to the CSV file, I can now use the Order columns option to specify that sort order. This works just like the ORDER option of the BULK INSERT statement. You might have noticed that the Advanced screen does not include any options related to batch sizes. SSIS handles batch sizes differently from other batch-loading options. By default, SSIS creates one batch per pipeline buffer and commits that batch when it flushes the buffer. You can override this behavior by modifying the Maximum Insert Commit Size property in the SQL Server Destination advanced editor. You access the editor by right-clicking the component and then clicking Show Advanced Editor. On the Component Properties tab, modify the property with the desired setting: A setting of 0 means the entire batch is committed in one large batch. This is the same as the BULK INSERT option of BATCHSIZE = 0. A setting less than the buffer size but greater than 0 means that the rows are committed whenever the number is reached and also at the end of each buffer. A setting greater than the buffer size is ignored. (The only way to work with batch sizes larger than the current buffer size is to modify the buffer size itself, which is done in the data flow properties.) For a complete description of how to configure the SQL Server destination, see the topic SQL Server Destination in SQL Server Books Online.

OLE DB Destination
The OLE DB destination is similar to the SQL Server destination except that youre destination is not limited to a local instance of SQL Server (and you can connect to OLE DB target data sources other than SQL Server). One advantage of using this task is that you can run SSIS on a computer other than where the target table is located, which lets you more easily scale out your SSIS solution. To demonstrate how the OLE DB destination works, I set up a data flow similar to the one I set up for the SQL Server destination. As you can see in Figure 7, Ive added a Flat File source and Data Conversion transformation, configured just as you saw above.

Figure 7: Data flow that uses the OLE DB Destination component to load data
After I added and configured the Data Conversion transformation, I added an OLE DB destination, opened the OLE DB Destination editor, and configured the settings on the Connection Manager screen, as shown in Figure 8.

Figure 8: Connection Manager screen of the OLE DB Destination editor


As you can see, I specified the AdventureWorks2008 connection manager and Employee2 as the target table. Notice also that in the Data access mode drop-down list, I selected the option Table or view - fast load. The OLE DB destination supports two fast-load optionsthe one Ive selected and one that lets you specify the name of the table or view within a variable: Table name or view name variable - fast load . You must specify a fast-load option for data to be bulk inserted into the destination. When you select one of the fast-load options, youre provided with options related to bulk loading the data, such as whether to maintain the identity or null values or whether to implement a table-level lock. Notice in Figure 8 that you can also specify the number of rows per batch, without having to access the advanced settings. As with the SQL Server destination, the rows per batch are tied to the SSIS buffer.

NOTE: The OLE DB Destination editor does not include an Advanced screen like the SQL Destination editor, but it does include an Error Output screen that lets you specify error handling options, something not available in the SQL Destination editor.

I next used the Mappings screen to ensure that my source columns properly sync up with my destination columns, as I did with the SQL Server destination. Figure 9 shows the mappings as they appear in the OLE DB Destination editor.

Figure 9: Mappings screen of the OLE DB Destination editor


Not all properties related to bulk loading are available through the OLE DB Destination editor. For instance, if you what to specify a sort order, as I did for the SQL Server destination, you must use the advanced editor. To access the editor, right-click the component and click Show Advanced Editor, and then select the Component Properties tab, shown in Figure 10.

Figure 10: Advanced Editor for the OLE DB Destination editor


Notice that the FastLoadOptions property setting is TABLOCK, ORDER(EmployeeID ASC) . The TABLOCK argument was added when I selected the Table lock option on the Connection Manager screen of the OLE DB Destination editor. However, I added the ORDER argument, along with the name of the column and the sort order (ASC). Also note that I used a comma to separate the TABLOCK argument from the ORDER argument. You can add other arguments as well. For a complete description of how to configure the OLE DB destination, see the topic OLE DB Destination in SQL Server Books Online.

Bulk Insert Task


Of those SSIS components related to bulk loading, the simplest to implement is the Bulk Insert task. What makes it so easy is the fact that you do not have to define a data flow. You define both the source and destination within the task itself. However, you can use the Bulk Insert task only for data that can be directly imported from the source text file. In other words, the data must not require any conversions or transformations and cannot originate from a source other than a text file. As youll recall from Figure 1, I added the Bulk Insert task to the third Sequence container, right after the Execute SQL task. To configure the task, double-click it to launch the Bulk Insert Task editor, which opens on the General screen, as shown in Figure 11.

Figure 11: General screen of the Bulk Insert Task editor


On the General screen, you simply provide a name and description for the task. Its on the Connection screen, shown in Figure 12, that you specify how to connect to both the source and destination.

Figure 12: Connection screen of the Bulk Insert Task editor


As the figure indicates, I specified AdventureWorks2008 as the OLE DB connection manager. I also specified the Employees3 table as the target table.

In the Format section of the Connection screen, I select the Specify option, which indicates that I will specify the format myself, rather than use a format file. If I wanted to use a format file, I would have selected the Use File option and then specified the format file to use. When you select the Specify option, you must also specify the row delimiter and column delimiter. In this case, I selected {CR}{LF} and comma {,}, respectively. These settings match how the source CSV file was created. Finally, in the Source Connection section of the Connection screen, I specify the name of the Flat File connection manager I created when I set up the package (EmployeeData). Note, however, that the Bulk Insert task editor uses the connection manager only to locate the source file. The task ignores other options you might have configured in the connection manager, which is why you must specify the row and column delimiters within the task. After I configured the Connection screen of the Bulk Insert Task editor, I selected the Options screen, as shown in Figure 13. The screen lets you configure the options related to your bulk load operation.

Figure 13: Options screen of the Bulk Insert Task editor


Notice that for the DataFileType property I selected char (character) because thats how the source file was created. I also specified the EmployeeID column in the SortedData property because the source data was sorted by ID. Most of the other properties, I left with their default values. However, for the Options property, I selected specific bulk load options. To do so, I clicked the down arrow associated with the property to open a box of options, shown in Figure 14.

Figure 14: Selecting load options in the Bulk Insert Task editor
As you can see, you can choose whether to fire triggers, check constraints, maintain null or identity values, or apply a table-level lock during the bulk load operation. The options you select are then listed in the Options box, with the options themselves separated by commas. Once youve configure your options, youre ready to bulk load your data. For a complete description of how to configure the Bulk Insert task, see the topic Bulk Insert Task in SQL Server Books Online.

Bulk Inserting Data into a SQL Server Database

Clearly, the three SSIS components available for bulk loading data into a SQL Server database offer a great deal of flexibility in terms of loading the data and scaling out your solution. If youre copying data out of a text file and that data does not need to be converted or transformed in any way, the Bulk Insert task is the simplest solution. However, you should use the SQL Server destination or OLE DB destination if you must perform any conversions or transformation or if youre retrieving data from a source other than a text file. As for which of the two to choose, if youre loading the data into a local instance of SQL Server and scaling out is not a consideration, you can probably stick with the SQL Server destination. On the other hand, if you want the ability to scale out your solution or you must load data into a remote instance of SQL Server, use the OLE DB destination. Keep in mind, however, that if your requirements are such that more than one scenario will work, you should consider testing them all and determining from there what solution is the most effective. You might find that simpler is not always betteror visa versa.

Simple-Talk.com

How to Import Data from HTML pages


30 March 2011 by Phil Factor
It turns out that there are plenty of ways to get data into SQL Server from websites, whether the data is in tables, lists or DIVs. Phil finds to his surprise that it is easier to use Powershell and the HTML Agility Pack, than some of the more traditional approaches. Web-scraping suddenly becomes more resilient. lot would like read subsequently database. Quite a areof developers of doing so,toand Ivethe data reliably fromIfwebsites, usually in order to as getting the load theofdata into a colours, or There several ways used most of them. it is a one-off process, such names countries, words for snow, then it isnt much of a problem. If you need to do it more regularly when data gets updated, then it can become more tedious. Any system that you use is likely to require constant maintenance because of the shifting nature of most websites. There are a number of snags which arent always apparent when youre starting out with this sort of web-scraping technology. From a distance, it seems easy. An HTML table is the most obvious place to find data. An HTML table isnt in any way equivalent to a database table. For a start, there seems to be wide range of opinions about how an HTML data table should be structured. The data, too, must always be kept at arms-length within the database until it is thoroughly checked. Some people dream of being able to blast data straight from the web page into a data table. So easily, the dream can become a nightmare. Imagine that you have an automated routine that is set up to get last weeks price movements for a commodity from a website: Sugar Beet, let us say. Fine. Because the table you want doesnt have any id or class to uniquely identify it within the website, you choose instead to exploit the fact that it is the second table on the page. It all works well, and you are unaware that the first table, used for formatting the headings and logo prettily, is replaced by some remote designer in another country with a CSS solution using DIVs. The second table then becomes the following table, containing the prices for an entirely different commodity: Oilseed Rape. Because the prices are similar and you do not check often, you dont notice, and the business takes decisions on buying and selling sugar-beet based on the fluctuations in the price of Oilseed Rape. Other things can go wrong. Designers can change tables by combining cells, either vertically or horizontally. The order of columns can change (some designers apparently dont think that column headings are 'cool'). Other Websites use different table structures or dont use TH tags for headings. Some designers put in extra columns in a table of data purely as spacers, for sub headings, pictures, or fancy borders. There are plenty of ways of fighting back against sloppy web-pages to do web-scraping. You can get to your data more reliably if it is identified by its ID or the class assigned to it, or if it is contained within a consistent structure. To do this, you need a rather efficient way of getting to the various parts of the webpage. If you are doing an aggregation and warehousing of data from a number of sources you need to do sanity checks. Is the data valid and reasonable? Has it fluctuated beyond the standard deviation? Is it within range? You need to do occasional cross-checks of a sample of the data with other sources of the same data. In one company I worked for, the quantitive data for product comparisons with competitors were stored in Metric units for some products and Imperial units for others, without anyone noticing for months. It only came to light when we bought one of the competitors and their engineers then questioned the data. Any anomalies, however slight they seem have to be logged, notified and then checked by an Administrator. There must be constant vigilance. Only when you are confident can you allow data updates. I see DBAs as being more like data-zoo-keepers than data-police. I like to check the data in a transit area within the database in order to do detailed checking before then using the data to update the database.

The alternatives
Of the various methods of extracting data from HTML tables that Ive come across This can be made to work but it is bad. Regexes were not designed for slicing up hierarchical data. An HTML page is deeply hierarchical. You can get them to work but the results are overly fragile. Ive used programs in a language such as Perl or PHP in the past to do this, though any .NET language will do as well. This will only work for well-formed HTML. Unfortunately, browsers are tolerant of badly-formed syntax. If you can be guaranteed to eat only valid XHTML, then it will work, but why bother when the existing tools exist to do it properly. This is a handy approach in a language like SQL, when you have plenty of time to do the dissection to get at the table elements. You have more control that in the recursive solution, but the same fundamental problems have to be overcome: The HTML that is out there just doesnt always have close-tags such as </p>, and there are many interpretations of the table structure out there. Robyn and I showed how this worked in TSQL here. Microsoft released a Jet driver that was able to access a table from an HTML page, if you told it the URL and used it if you were feeling lucky. It is great when this just works, but it often disappoints. It is no longer actively maintained by Microsoft. I wrote about it here. Sometimes, you'll need to drive the IE browser as a COM object. This is especially the case if the site's contents is dynamic or is refreshed via AJAX. . You then have to read the DOM via the document property to get at the actual data. Let's not go there in this article! In the old days, we used to occasionally use LINX text browser for this, and then parse the data out of the subsequent text file. You may come across a Flash site, or one where the data is done as 'text as images'. Just screendump it and OCR the

Regex dissection. Recursive dissection. Iterative dissection:.

ODBC.

Driving the browser

OCR.

results. (for one-offs I use Abbyy screenshot reader!).

XPath queries XSLT

You can use either the .NET classes for XML or the built-in XQuery in SQL Server to do this. It will only work for valid XHTML. This is always a good conversation-stopper. easier to use XSLT. Ive never tried it for this purpose, but there will always be someone who will say that it is soooo easy. The problem is that you are not dealing with XML or, for that matter, XHTML. This is so easy that it makes XPath seem quite fun. It works like standard XPath, but on ordinary HTML, warts and all. With it you can slice and dice HTML to your hearts content.

HTML Agility Pack

Using the HTML Agility Pack with PowerShell.


The HTML Agility Pack (HAP) was originally written by Simon Mourier, who spent fourteen years at Microsoft before becoming the CTO and cofounder of SoftFluent. The HAP is an assembly that works as a parser, building a read/write DOM and supporting plain XPath or XSLT. It exploits dotNet's implementation of XPath to allow you to parse HTML files straight from the web. The parser has no intrinsic understanding of the significance of the HTML tags. It treats HTML as if it were slightly zany XML. The HAP It works in a similar way to System.XML, but is very tolerant with malformed HTML documents, fragments or streams that are so typically found around the Web. It can be used for a variety of purposes such as fixing or generating HTML pages, fixing links, or adding to the pages. It is ideal for automating the drudgery of creating web pages, such as creating tables of contents, footnotes, or references for HTML documents. It is perfect for Web-scrapers too. It can even turn a web page into an RSS feed with just an XSLT file as the binding. It can be downloaded from here. It has been recently converted so as to now support LINQ to Objects via a 'LINQ to Xml'-like interface It isnt possible to use the HTML Agility Pack easily from within SQL Server, though you could probably write a CLR library pretty quickly. The downside is that youd need to access the internet from your production server, which would be silly. However, driving the HAP from PowerShell, or any other language that you might prefer, is pretty easy.

First step
Here is some PowerShell that lists out some classical insults as text, from a test website. Quite simply, it downloads the file, gulps it into the HAP as a parsed file and then pulls the text out of the second table. ('(//table)[2]' means the second table, wherever it is in the document, whereas '(//table)[2]' means the second table of the parent element, which isn't quite the same thing)
add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll' $HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument $wc = New-Object System.Net.WebClient
$result = $HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/quotations.html")) $HTMLDocument.DocumentNode.SelectSingleNode("(//table)[2]").InnerText

This will give a list of insults. You would need to install the Agility Pack and give the path to it or put it in your Global Assembly Cache (GAC). In this, and subsequent PowerShell scripts, Im giving quick n dirty code that gets results, along with some test HTML pages, just so you can experiment and get a feel for it. Youll notice that this first script extracts the meat out of the sandwich by the means of XPath queries. Normally, ordinary humans have no business looking at these, but in certain circumstances they are very useful. Of course, if youre a database programmer, youll probably be familiar with XPath. Here is an example of shredding XHTML document fragments via XPath using SQL. I give just a small part of the document. Make sure that, in your mind's eye, the precious text is surrounded by pernicious adverts and other undesirable content.
DECLARE @xml xml SELECT @XML='<body> <ol id="insults"> <li>You have two ears and one mouth so you might listen the more and talk the less.</li> <li>You like your friends to be just clever enough to comprehend your cleverness and just stupid enough to admire it.</li> <li>You own and operate a ferocious temper. </li> <li>There is nothing wrong with you, that a miracle couldn''t cure.</li> <li>Why be disagreeable, when with a little effort you could be impossible? </li> <li>You look as if you had been poured into your clothes and had forgotten to say when.</li> </ol> </body> ' SELECT @xml.query('//li[2]/text()')--text of second element SELECT Loc.query('.') FROM @xml.nodes('/body/ol/li/text()') as T2(Loc)

However, that works with XML documents, or XHTML fragments, but not HTML.

Getting the contents of a list

Lets try PowerShells .NET approach with a list (OL, in this case), which is the other main data-container youre likely to find in an HTML page
add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll' $HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument $wc = New-Object System.Net.WebClient
$HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/listExample.html")) $list=$HTMLDocument.DocumentNode.SelectNodes("(//ol)[1]/li") #only one list foreach ($listElement in $list) {$listElement.InnerText.Trim() -replace ('\n', "`r`n ") -replace('(?<=\s)\s' ,'') }

This is fun. Not only that but it is resilient too. Youll notice that Ive corrected, in this example, the interpretation of all the white-space within the text of the tag as being significant, just by using a couple of regex substitutions. If you execute this from xp_cmdshell in SQL Server, youre laughing, since you can read it straight into a single-column table and from there into the database.

Getting data that isn't in a table into an array


Lets try something more serious. The front page of Simple-Talk.
add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll' $HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument $wc = New-Object System.Net.WebClient
$HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/default.aspx")) $Table=@() $articleDetails=$HTMLDocument.DocumentNode.SelectNodes("//div[@class='middle']/div[@class='articlesummary']") foreach ($articleDiv in $articleDetails) { $Tuple = "" | Select Title, Author, Synopsis $Tuple.Title = $articleDiv.SelectSingleNode("a").innertext $Tuple.Author = $articleDiv.SelectSingleNode("div[1]").innertext $Tuple.Synopsis = $articleDiv.SelectSingleNode("div[2]").innertext $Table +=$Tuple } $Table

Here, youll immediately notice that we are specifying nodes by their class. We could use their IDs too. Weve specified a couple of DIVs by their order within each of the articleDetails nodes. Where am I getting all this information on how to address individual parts of an HTML document via XPath? Well, we have a new wallchart which we have just published that has the information in it, and maps the functionality between XPath, CSS and javascript DOM. It should help a lot with the trickier bits of web-scraping. We have also created a multidimensional array in which to put the data. This makes it easier to manipulate once we've finished.

Getting your data into SQL Server


Now that we have this data, what can we do to it? The easiest approach is to write it to file and then read it in to SQL Server via BCP. There is no need to explain how to read a file into SQL Server as this is already well-covered in Simple-Talk, here , here, and here. For the PowerShell file, we simply change the last line with
$Table | export-CSV 'SimpleTalkArticles.csv'

Ive left out any optional switches for simplicity. Or you can use Using the Export-Clixml Cmdlet if you wish to do so, to export it as an XML file. There are a number of switches you can use to fine-tune the export of this XML data to file . You can even do this ...
$Table |ConvertTo-HTML

...in order to convert that data into a table in an HTML file (or a table fragment if you prefer)! If you then write it to file, it can then be read into Excel easily, At this stage, I hope you feel empowered to create some quick-n-dirty PowerShell routines to grab HTML data from the internet and produce files that can then be imported into SQL Server or Excel. The next stage would be to import into a SQL Server table from a PowerShell script. You can call a PowerShell script directly from SQL Server and import the data it returns into a table. Chad Miller has already covered an ingenious way of doing this without requiring a DataTable. If you are likely to want to import a lot of data, then a DataTable might be a good alternative.

Using a datatable
In this next quick and dirty script well create a DataTable that can then be exported to SQL Server. The most effective way of doing this is to use Chad Millers Write-DataTable function. You can, if the urge takes you, write to an XML file as if it were a multidimensional array, and thence gulp it into SQL Server.

$DataTable.WriteXml(".\TableData.xml") $DataTable.WriteXmlSchema(".\TableData.xsd")

To take out some tedious logic, well do a generic table-scraper that doesnt attempt to give the proper column names but gives, for each cell, the cell value, the row of the cell and the column of the cell. The headers are added as row 0 if they can be found. With this information, it is pretty easy to produce a SQL Server result, as we'll show later on.
add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll' $HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument $wc = New-Object System.Net.WebClient
$result = $HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/TestForHTMLAgilityPack.html")) #create the table and the columns $DataTable = New-Object system.Data.DataTable 'TableData' $datacol1 = New-Object system.Data.DataColumn Row,([int]) $datacol2 = New-Object system.Data.DataColumn Col,([int]) $datacol3 = New-Object system.Data.DataColumn Value,([string]) #add the columns to the DataTable $DataTable.columns.add($datacol1) $DataTable.columns.add($datacol2) $DataTable.columns.add($datacol3) #iterate through the cells in the third table foreach ($Cell in $HTMLDocument.DocumentNode.SelectNodes("(//table)[3]//tr/td")) { $ThisRow=$DataTable.NewRow() $ThisRow.Item('Value')=$Cell.InnerText.Trim() -replace ('\n', "`r`n ") -replace('(?<=\s)\s' ,'') $ThisRow.Item('Col')=$Cell.XPath -creplace '(?sim)[^\s]*td\[([\d]{1,6})\]\s*\z', '$1' #column $ThisRow.Item('Row')=$Cell.XPath -creplace '(?sim)[^\s]*tr\[([\d]{1,6})\]/td\[[\d]{1,6}\]\s*\z', '$1' #row $DataTable.Rows.Add($ThisRow) } #and iterate through the headers (these may be in a thead block) foreach ($heading in $HTMLDocument.DocumentNode.SelectNodes("(//table)[3]//th")) { $ThisRow=$DataTable.NewRow() $ThisRow.Item('Value')=$heading.InnerText.Trim() $ThisRow.Item('Col')=$heading.XPath -creplace '(?sim)[^\s]*th\[([\d]{1,6})\]\s*\z', '$1' #column $ThisRow.Item('Row')=0 $DataTable.Rows.Add($ThisRow) } $DataTable

This may look a bit wacky. The reason is that were taking advantage of a short-cut. The HTML Agility Pack returns the XPath of every HTML element that it returns, so you have all the information you need about the column number and row number without iterative counters. It assumes that the table is regular, without colspans or rowspans. Here is an example from one of the cells that we just parsed.
/html[1]/body[1]/table[1]/tr[47]/td[1]

See that? It is the first table (table[1]) row 47, column 1 isnt it. You want the rightmost figures too, in case you have nesting. You just have to parse it out of the XPath string. I used regex to get the ordinal values for the final elements in the path. The sample file has three tables, you can try it out with the other two by changing the table[1] to table[2] or table[3] If we export the DataTable to XML using the
$DataTable.WriteXml(".\TableData.xml")

..in the end of the last PowerShell script, then we can soon get the result into a SQL table that is easy to manipulate.
DECLARE @Data XML SELECT @Data = BulkColumn FROM OPENROWSET(BULK 'C:\workbench\TableData.XML', SINGLE_BLOB) AS x

In this example, Ive left out most of the rows so that you can try this, out of sequence, without needing the table.
DECLARE @data XML SET @Data= '<?xml version="1.0" standalone="yes" ?> <DocumentElement> <TableData> <Row>1</Row> <Col>1</Col> <Value>Advertising agencies are eighty-five per cent confusion and fifteen per cent commission.</Value> </TableData> <TableData> <Row>1</Row> <Col>2</Col> <Value>Fred Allen b 1894</Value>

</TableData> <TableData> <Row>2</Row> <Col>1</Col> <Value>An associate producer is the only guy in Hollywood who will associate with a producer.</Value> </TableData> <TableData> <Row>2</Row> <Col>2</Col> <Value>Fred Allen b 1894</Value> </TableData> <TableData> <Row>3</Row> <Col>1</Col> <Value>California is a fine place to live, if you happen to be an orange.</Value> </TableData> <TableData> <Row>3</Row> <Col>2</Col> <Value>Fred Allen b 1894</Value> </TableData> <TableData> <Row>0</Row> <Col>1</Col> <Value>Quotation</Value> </TableData> <TableData> <Row>0</Row> <Col>2</Col> <Value>Author</Value> </TableData> </DocumentElement>' Select max(case when col=1 then value else '' end) as Quote, max(case when col=2 then value else '' end) as Author from (SELECT x.y.value('Col[1]', 'int') AS [Col], x.y.value('Row[1]', 'int') AS [Row], x.y.value('Value[1]', 'VARCHAR(200)') AS [Value] FROM @data .nodes('//DocumentElement/TableData') AS x ( y ) ) rawTableData group by row having row >0 order by row

Yes, youre right, weve used XPath once again to produce a SQL Result.

By having the data with the same three columns whatever the data, you can simplify the transition to proper relational data.You can define a Table-valued Parameter to do lot of the work for you or even pass the data from PowerShell to SQL Server using a TVP. It makes the whole process very simple.

Using a web-scraper within an application


So, how would you go about scraping a site? The data is there but generally you have no idea of the structure of the site. I usually start by checking that the site can be read by this technique (dynamic content requires a very different technique) and that robots have not been specifically forbidden. If it requires authentication, a POST, or any other complication such as the user-agent, then it is better to use a more specialised tool to do the download. I like both CURL and WGET. With most pages, it is easy to work out where the data that you want is held, just by inspecting the source. FireBug in Firefox will help you locate an element that is difficult to find. For the more complicated sites, it is simple to scan a site for a particular word or phrase in the InnerText using HAP, and get back the absolute XPath for the elements that have the phrase in their Innertext. You can also sometimes locate your data by the HREF of an anchor . This snippet displays the link for every anchor on the page and its corresponding XPath.
foreach( $link in $HTMLDocument.DocumentNode.SelectNodes("//a[@href]"))

{
$link.Attributes["href"].value +' is at address ' + $link.XPath }

..or you can search for text, either within the element (contains) , starts-with and get the XPath of the elements

foreach( $element in $HTMLDocument.DocumentNode.SelectNodes("//td[contains(text(),'Fred') ]"))

{
$element.XPath + ' --- '''+$element.innerText+ '''' }

With a fair wind and an XPath reference wallchart, a great deal is possible. If, for example, the data always has the same title, even if its location in the page varies, you can write a script that gets gets its location purely by looking for the heading. Some data is in ordinary paragraph tags, but you can still get at them via XPath if they follow a particular heading. XPath has a great deal of magic for awkward data-gathering. For data-gathering, I generally use dedicated PCs within the domain. These need need very little power, so you can use old nags. I never let SQL Server itself anywhere near the internet. On these PCs, I have a scheduled task that runs a script that downloads the next task (ID and parameters) from the SQL Server, and runs it if it is due, returning the data to SQL Server, using windows authentication. Each task corresponds to a single data collection on one site. All errors and warnings are logged, together with the taskID, the User, and time of day, within the SQL Server database. When the data is received, it is scrubbed, checked, compared with the existing data and then, if all is well, the delta is entered into the database.

Conclusions
Using the HTMLAgilityPack is great for the run-of-the-mill reading of data, and you may never hit its limitations, unless of course you are scraping the contents of an AJAX site, or the data is in flash. isn't perfect, since the HTML file is treated without understanding the semantics of HTML. This is fine up to a level of detail, but HTML tables really are awful because they allow you to mix presentation and meaning. The Colspan and Rowspan have no equivalent meaning in any sensible data table, and make extracting the data more tiresome than it need be. Although I've tended to use RegEx queries in the past, I'm now convinced that the HTML Agility Pack is a more sensible approach for general use in extracting data from HTML in .NET

Simple-Talk.com

Authentication and Authorization with Windows Accounts in ASP.NET


30 March 2011 by Matteo Slaviero
If you are providing web-based information for a closed group of users, such as a company or similar organisation with roles and membership, then Windows authentication make a great deal of sense for ASP.NET websites or even .NET applications. Why, and how do you implement it? Matteo explains all. developed or are ASP.NET applications that own Probably almost all of you haveThese will require developinguser has his own user nameallow users to manage their uses data and resources in a multi-user environment. that each and password, which he to log into the web application, and access his information. To accomplish this, you may be using, or have used, ASP.NET Forms authentication. The user enters his username and password in the login page and, after they are authenticated against some database tables, he is ready to operate. In this article I would like to propose a different schema that relies on users Windows accounts rather than Forms authentication, and show the benefits that this approach can offer. We will consider only those ASP.NET applications that are owned by an organization in which all users have their own Windows account, maybe stored in the companys Active Directory.

Authentication and Authorization


When we create a web application, we want to expose the applications users to information. This might be text, data, documents, multimedia content, and so on. Sometimes, we also need to manage access to this information, restricting certain users access to some of them. This is where authentication and authorization come in. Before presenting this Windows account authentication and authorization proposal, I would like to define what authentication and authorization mean, the difference between the two and how the .NET Framework manages them. If you are already confident with these concepts you can skip to the next section.

Authentication
Generally speaking, Authentication is the ability to identify a particular entity. The need for authentication occurs when we have some resources that we want to make available to different entities. We store these resources in a centralized place and instruct the system that manages them to prevent entities that we dont recognize from having access. Anonymous authentication refers to a situation in which we grant access to resources to all users, even if we dont know them. In web applications, we expose resources to users. We authenticate each user by requesting his credentials, normally a username and password, that we have assigned to him, or that he got during what we call the registration process. The .NET Framework uses the following authentication terminology:

Principal: this represents the security context under which code is running. Every executing thread has an associated principal. Identity: this represents the identity of the authenticated user. Every Principal has an associated identity.
It also defines the following classes, contained in the System.Security assembly: GenericPrincipal, WindowsPrincipal GenericIdentity, WindowsIdentity As their names suggest, WindowsPrincipal and WindowsIdentity are related to Principals and Identities associated with a Windows account, while GenericPrincipal and GenericIdentity are related to generic authentication mechanisms. GenericPrincipal and WindowsPrincipal implement the IPrincipal interface, while GenericIdentity and WindowsIdentity implement the IIdentity interface.

Authorization
Authorization is the ability to grant or deny access to resources, according to the rights defined for the different kinds of entities requesting
them. When dealing with Windows Operating System, and its underlying NTFS file system, authorizations are managed by assigning to each object

(files, registry keys, cryptographic keys and so on) a list of the permissions granted to each user recognized by the system. This list is commonly called the Access Control List or ACL (the correct name is actually Discretionary Access Control List or DACL, to distinguish it from the System Access Control List or SACL). The ACL is a collection of Access Control Entries or ACEs. Each ACE contains the identifier for a specific user (Security Identifier or SID) and the permissions granted to it. As you probably already know, to view the ACL for a specific file, you right-click the file name, select Properties and click on the Security tab. You will see something like this:

Figure 1: ACL editor for a demo file.


The Group or user names section lists all the users and groups, by name, which have at least one ACE in the ACL, while the Permissions section lists all the permissions associated with a specific group or user (or, rather, with its SID). You can modify the ACL by pressing the Edit button. To view the ACL of a specific file using the .NET Framework, you can use the FileSecurity class that you can find under the System.Security.AccessControl namespace. The following example shows how to browse the ACL of a file named C:\resource.txt:

FileSecurity f = File.GetAccessControl(@"c:\resource.txt"); AuthorizationRuleCollection acl = f.GetAccessRules(true, true, typeof(NTAccount)); foreach (FileSystemAccessRule ace in acl) { Console.WriteLine("Identity: " + ace.IdentityReference.ToString()); Console.WriteLine("Access Control Type: " + ace.AccessControlType); Console.WriteLine("Permissions: " + ace.FileSystemRights.ToString() + "\n"); By running this code in a console application, you get the following output:

Figure 2: Output of a console application that lists the ACEs of a demo file.

Authentication in IIS 7 and 7.5


With definitions out the way, were ready to see how to setup a Windows account authentication and authorization schema in an ASP.NET application. First, well look at how authentication with Windows accounts works. Its important to note that this type of authentication doesnt involve the ASP.NET engine. It works at the Internet Information Server (IIS) level instead, so all thats required is the correct IIS configuration. The authentication types available in IIS can be viewed by using the IIS Manager:

Figure 3: List of all authentication methods implemented in IIS 7.0 and 7.5. Anonymous Authentication: this is the most commonly used type of authentication. With it, all users can access the web site. ASP.NET Impersonation : this is not really an authentication method, but relates to authorizations granted to a web sites users. We will see
later how impersonation works.

Basic Authentication: this is a Windows account authentication, in the sense that the user needs to have a username and password,
recognized by the operating system, to use the application. When the user calls a web page, a dialog box asking for his credentials appears. If the user provides valid credentials for a valid Windows account, the authentication succeeds. This type of authentication is not considered secure because authentication data is transmitted to the server as plain text.

Digest Authentication: this is similar to Basic Authentication, but more secure. Authentication data is sent to the server as a hash, rather than
plain text. Basic Authentication and Digest Authentication are both standardized authentication methods. They are defined in RFC 2617.

Forms Authentication: this is ASP.NETs own authentication, based on the login page and the storage of users credentials in a database, or
similar location.

Windows Authentication: this type of authentication uses the NTLM or Kerberos Windows authentication protocols, the same protocols used
to log into Windows machines. As for Basic Authentication and Digest Authentication, the credentials provided by the user must match a valid Windows account. There are two other authentication methods that I have not mentioned here: Active Directory Client Certificate Mapping Authentication and IIS Client Certificate Mapping Authentication. Both use the X.509 digital certificate installed on the client; how they work is outside the scope of this article. For the purpose of this article, we can use Basic Authentication, Digest Authentication or Windows Authentication, each of which relies on Windows accounts. When theyre used, the current executing thread is associated with a Principal object that is able to give us information about the authenticated user. I wrote a simple application that shows you how to do that. Its source code is available at the top of this article as a zip file. The application defines a method, called WritePrincipalAndIdentity(), which give us the following information: 1. The name of the authenticated user. 2. The users role, by checking its role membership. 3. The type of authentication performed. The methods body is given by:

/// <summary> /// Explore the authentication properties of the current thread. /// </summary> public void WritePrincipalAndIdentity() {

IPrincipal p = Thread.CurrentPrincipal; IIdentity i = Thread.CurrentPrincipal.Identity; WriteToPage("Identity Name: " + i.Name); WriteToPage("Is Administrator: " + p.IsInRole(@"BUILTIN\Administrators")); WriteToPage("Is Authenticate: " + i.IsAuthenticated); WriteToPage("Authentication Type: " + i.AuthenticationType); WriteToPage("&nbsp"); } Where the WriteToPage() method is a helper method that encapsulates the logic needed to write text inside the page. Rather than using Thread.CurrentPrincipal, we could use the User property of the Page object to achieve the same result. I prefer to use the Thread.CurrentPrincipal, to point out that the principal is always associated with the executing thread. The importance of this will be clearer in the Role-Based Security Paragraph. When we run this application, using, for example, digest authentication (remembering to disable the anonymous authentication) the logon window ask us for our credentials.

Figure 4: Logon dialog box. To access the web site we need a valid account defined in a domain named CASSANDRA.
If we provide a valid account defined in the CASSANDRA domain we will able to log on to the application. Once weve provided it, we obtain something like this:

Figure 5: Demo web applications output.


Figure 5 shows that the identity of the user who performed the request has been authenticated. It also shows his user name is

CASSANDRA\matteo, the domain account used to perform the request, that the authentication method used was Digest Authentication, and
that the user is not an administrator. Suppose that we need to write a web application that associates the user with his own data, for example a list of contacts or some appointments. It easy to see that, at this stage, we have all the information needed to manage all the data (contacts or appointments) related to a single user. If we save all of them in a database using the username (or better a hash of it) provided by the authentication stage as the table key, we are able to fill all the applications web pages with only the users specific content, as we do with Forms authentication. This is possible without having to write any lines of code. Another important advantage comes from the fact that, by using the Principal object, we are able to check if an authenticated user belongs to a specific security group. With this information, we can develop applications that are role-enabled, in the sense that we can allow a specific user to use only the features available for his role. Suppose, for example, that the web application has an admin section and we want to allow only administrators to see it: we can check the role of the authenticated user and hide the links to the admin page if the user is not an administrator. If we use Active Directory as container for users credentials, we can take advantage of its ability to generate group structures flexible enough to generate role-based permissions for even very heterogeneous kinds of users. However, from a security point of view, authentication alone is not enough. If, for example, we hide the link to the admin page for nonadministrator users, they can nonetheless reach the admin page using its URL, breaking the security of the site. For this reason, authorization plays a very important role in designing our application. We will now see how to prevent this security issue occurring.

Authorization in ASP.NET Applications


Suppose that we have a file, resource.txt, inside the web application root that we want to make available only to administrators. We can prevent users who arent administrators from accessing the file by setting up its ACL properly. For simplicity, lets say we want to prevent CASSANDRA\matteo accessing it. Figure 6 shows how to do that:

Figure 6: ACL for the CASSANDRA\matteo user with denied permissions.


We have denied the Read and Read & execute attributes to the CASSANDRA\matteo account, but we want to see what happens when our demo application tries to open the file. To do so, we add a new method to it:

/// <summary> /// Check if a resource can be loaded. /// </summary> public void CanLoadResource() { FileStream stream = null; try { stream = File.OpenRead(Server.MapPath("resource.txt")); WriteToPage("Access to file allowed.");

} catch (UnauthorizedAccessException) { WriteException("Access to file denied."); } finally { if (stream != null) stream.Dispose(); } } The CanLoadResource() method tries to open resource.txt, in order to read its content. If the load succeeds, the Access to file allowed. message is written on the page. If an UnauthorizedAccessException exception is thrown, the message Access to file denied. is written on the page, as an error. The WriteException() method is a helper method used to write an exception message on the page. Now we launch our application with authorizations set as in Figure 6 and use CASSANDRA\matteo to log into the application. Doing that, we obtain something that should sound strange:

Figure 7: Logon with user CASSANDRA\matteo with permissions as in Figure 6.


As you can see in the Figure 7, resource.txt can be loaded by the application even if the credentials provided for the login refer to an account that has no permissions to access it. This happens because, in this case, the Application Pool associated with the web application works in Integrated mode, which relates authentication and authorization to different users. Specifically, authentication involves the user identified by the credentials provided, while authorization involves the user account used by the Application Pool associated with the application. In our example, the Application Pool uses the NETWORK SERVICE account, which has permission to access the file. Well try to deny these permissions by modifying the ACL of the resources.txt file:

Figure 8: ACL for the NETWORK SERVICE account with denied permissions.
If we launch our application, we now obtain:

Figure 9: Logon with user CASSANDRA\matteo, still with the permissions in Figure 8.
As you can see, the file is no longer available, demonstrating that the authorization process involves the NETWORK SERVICE account. To use authorization at the authenticated user level, we need to use Impersonation. With impersonation, we are able to allow the Application Pool to run with the permissions associated with the authenticated user. Impersonation only works when the Application Pool runs in Classic Mode (in Integrated mode the web application generates the 500 Internal Server Error error). To enable impersonation, we need to enable the ASP.NET Impersonation feature, as noted in Figure 3 and the discussion that followed it. If we switch our Application Pool to Classic Mode (enabling the ASP.NET 4.0 ISAPI filters, too) and enable ASP.NET impersonation, the demo application output becomes:

Figure 10: Logon with user CASSANDRA\matteo, with permissions as in Figure 8 and Application Pool in Classic Mode.
We are now able to load resource.txt even if the NETWORK SERVICE account has no permissions to access it. This shows that the permissions used were those associated with the authenticated user, not with the Application Pools identity. To take advantage of Integrated mode without having to abandon impersonation, we can use a different approach: running our application in Integrated mode and enabling impersonation at the code level when we need it. To do so, we use the WindowsImpersonationContext class, defined under the System.Security.Principal namespace. We modify the CanLoadResource() method as follows: /// <summary> /// Check if a resource can be loaded. /// </summary> public void CanLoadResource() { FileStream stream = null; WindowsImpersonationContext imp = null; try { IIdentity i = Thread.CurrentPrincipal.Identity; imp = ((WindowsIdentity)i).Impersonate(); stream = File.OpenRead(Server.MapPath("resource.txt")); WriteToPage("Access to file allowed."); } catch (UnauthorizedAccessException) { WriteException("Access to file denied."); } finally { if (imp != null) { imp.Undo(); imp.Dispose(); } if (stream != null) stream.Dispose(); } } With the modification added, we can force the application to impersonate the authenticated user before opening the file. To achieve this, we have used the Impersonate() method of the WindowsIdentity class (the class to which the Identity property belongs). With it, we have created a WindowsImpersonationContext object. This object has a method, Undo(), that is able to revert the impersonation after the resource has been used. If we try to run our application with permissions as in Figure 8, we see that we are able to access resource.txt even if the Application Pool is working in Integrated Mode. Now we can resolve the security issue presented earlier. If we want to use Windows accounts to develop a role-based application, we can use authentication to identify the user requesting resources and we can use authorization, based on the users identity, to prevent access to resources not available for the users role. If, for example, the resource we want to protect is a web page (like the admin page), we need to set its ACL with the right ACEs, and use impersonation to force the Application Pool to use the authenticated users permissions. However, as we

have seen, when the Application Pool uses Integrated mode, impersonation is available only at code level. So, although its easy in this situation to prevent access to resources (like the resource.txt file) needed by a web page, its not so easy to prevent access to a web page itself. For this, we need to use another IIS feature available in IIS Manager, .NET Authorization Rules:

Figure 11: .NET Authorization Rules feature of IIS7 and IIS7.5. .NET Authorization Rules is an authorization feature that works at ASP.NET level, not at IIS or file system level (as for ACLs). So it permits us to
ignore how IIS works and use Impersonation both in Integrated Mode than in Classic Mode. I leave you to test how it works.

Role-Based Security
A further advantage of using Windows account authentication is the ability to use a .NET Framework security feature called Role-Based Security. Role-Based Security permits us to protect our resources from unauthorized authenticated users. It relies on checking if an authenticated user belongs to a specific role that has authorization to access a specific resource. We have already seen how to do that: use the IsInRole() method of the threads Principal object. The .NET Framework security team decided to align this type of security check to Code Access Security (which I wrote about in previous articles) by defining a programming model similar to it. Specifically, a class named PrincipalPermission, found under the System.Security.Permissions namespace, has been defined. It permits us to check the role membership of an authenticated user both declaratively (using attributes) and imperatively (using objects), in the same manner as CAS checks. Suppose that we want resource.txt to be readable only by administrators. We can perform a declarative Role-Based security check in this way: /// <summary> /// Load a resource /// </summary> [PrincipalPermissionAttribute(SecurityAction.Demand, Name = "myname", Role = "administrators")] public void LoadResource() { .. where myname is the username that we want to check. If declarative Role-Based security is not what we need (because, in this case, we need to know the identity of the user first), we can use an imperative Role-Based security check: /// <summary> /// Load a Resource /// </summary> public void LoadResource() { try { // Create a PrincipalPermission object.

PrincipalPermission permission = new PrincipalPermission(Thread.CurrentPrincipal.Identity.Name, "Administrators"); // Demand this permission. permission.Demand(); .. } catch (SecurityException e) { .. } } In both cases, if the user does not belong to the Administrators group, a security exception is thrown. The PrincipalPermission class doesnt add anything to our ability to check the permission of an authenticated user. In my opinion, the IsInRole() method gives us all the instruments we need, and is simpler to use. Despite this, Ive included PrincipalPermission in this discussion for completeness. Maybe this is the same reason that the .NET development team added this type of class to the .NET Framework base classes. I end this section by mentioning that Role-Based Security can even be implemented in desktop applications. In this case, the authenticated user is a user that logs into the machine. When a desktop application starts, by default, the identity of the authenticated user is not attached to the executing thread. The Principal property of the current thread and the Identity property of the Principal property are set to GenericPrincipal and GenericIdentity respectively, and the Name property of the Identity property is empty. If we launch the following code in a Console application: static void Main(string[] args) { Console.WriteLine("Type of Identity: " + Thread.CurrentPrincipal.Identity.GetType()); Console.WriteLine("Identity Name: " + Thread.CurrentPrincipal.Identity.Name); }

We get:

Figure 12: Default Identity in a Console Application.


So we see that the application is not able to recognize the user who has logged-in. This is, however, a feature we can turn on. We need to modify the previous code as follows: static void Main(string[] args) { AppDomain.CurrentDomain.SetPrincipalPolicy(PrincipalPolicy.WindowsPrincipal); Console.WriteLine("Type of Identity: " + Thread.CurrentPrincipal.Identity.GetType()); Console.WriteLine("Identity Name: " + Thread.CurrentPrincipal.Identity.Name); } Launching the application, we now get:

Figure 13: Identity in a Console Application.


As you can see from Figure 13, Identity has been initialized with a WindowsIdentity object and the reference to the logged user has been

added. We are now able to use Role-Based Security even in desktop applications.

Conclusion
In this article we have seen how Windows accounts can be used to implement authentication and authorization in ASP.NET applications. Even if this type of approach is rarely used, Forms Authentication being the commonly adopted solution, it can have a lot of advantages: 1. Less code to develop and maintain. Authorization and authentication with Windows accounts does not require the developer to write specific code for the management of user credentials, authorizations, password recovery and so on. 2. Centralization of user credentials, access rights, password policies, role-based policies and identity management in general. All the security information related to a specific user is stored in a centralized place, Active Directory. When a new employee arrives at an organization, permissions have to be added only in the Directory structure, not in each web server used by the company, making the authorization process simpler to manage. 3. More security. In a decentralized security environment, sometimes users have to remember more than one username and password. Sometimes they are forced to write them down to remember them. Security experts think this is one of the most dangerous security issues. Moreover, if an employee with, say, ten accounts for ten different applications, stored in ten different places, leaves an organization, its easy to forget to remove all their credentials, allowing them to access, or even steal confidential data.

Simple-Talk.com

Anatomy of a .NET Assembly - The DOS stub


28 March 2011 by Simon Cooper
The DOS stub at the top of the file is the first thing you notice when you open a .NET assembly in a hex editor. But what do those bytes mean, and what do they do? As I discussed in a previous post, the first 64 bytes are the DOS header, and the next 64 bytes is the stub program. What does that program do?

Real mode & DOS


All modern x86-compatible processors, right up to the latest Pentiums and Athlons, boot up into real mode, the addressing mode used in the original 8086, and the mode used in all DOS programs. In this mode, processor registers are all 16-bits long, and the currently running program is entirely responsible for the 1024kB of physical memory available to it (how times have changed...). Memory addresses are specified by adding an offset to the (shifted) value of one of four segment registers. These specified the 'base' addresses of the stack, code (the program loaded into memory), and data segments within memory (the fourth segment register was an 'extra' that could be used as the program saw fit). When a DOS program is run, the contents of the executable are loaded into memory, processor registers are set to values specified in the DOS header, and the bytes immediately following the header are executed on the processor.

Disassembling the DOS stub

Most of the DOS header consists of the string This program cannot be run in DOS mode.\r\r\n$. In the example .NET assembly above, this string starts at file offset 0x4E. Before that, we have the following bytes:

0E 1F BA 0E 00 B4 09 CD 21 B8 01 4C CD 21

If we run this through a disassembler, we get the following 16-bit x86 code:
push cs pop ds mov dx, 0xe mov ah, 0x9 int 0x21 mov ax, 0x4c01 int 0x21

Well, what on earth is that doing? Let's break it down into stages.
1. push cs
pop ds

These two instructions operate on the stack available to real mode programs; the push instruction is essentially shorthand for 'Copy the specified value to the location pointed to by the stack pointer and increment the pointer', while pop does the opposite. To start off with, the cs register points to where the program code is loaded into memory. These two instructions are copying the value of the cs (code segment) register to the ds (data segment) register via the stack. 2. mov dx, 0xe This is setting the value of the dx register to the constant 0xe. If you have a look at the stub above, this is the offset at which the text string begins from the start of the code. 3. mov ah, 0x9
int 0x21

Let's start off with int 0x21. This instruction invokes a software interrupt; interrupt 0x21 is the interrupt number for the DOS API (yes, such a thing exists!). The ah register (the high byte of ax) contains the number of the API function to call. If you have a look at the list of function codes, 0x9 corresponds to 'Write string to STDOUT', with the string to write to pointed to by ds:dx and terminated by $. As ds has been set the same as cs, and the offset 0xe is in dx, this prints the string This program cannot be run in DOS mode.\r\r\n to the console. 4. mov ax, 0x4c01
int 0x21

This is another DOS function call. The high byte of ax, 0x4c, corresponds to the Exit function, and the low byte, 0x01, specifies the return value. So this stops execution of the program and returns to the DOS prompt, with a return value of 1.

Why is this here?


When the PE file format was introduced in Windows NT 3.1, there was a real possibility users would try to execute PE files in a DOS environment. So the existance of the DOS stub was specified in the PE file specification. When .NET 1.0 was introduced in 2002 for Windows 98, ME, NT, 2000 and XP, there was still a possibility that assemblies would be run under DOS, so the DOS stub was also specified in the CLR specification. For backwards compatibility, this stub still exists in .NET 4 assemblies.

That's it!
There we go, nice and simple. However, the CLR loader stub, which I'll be looking at in the next post, is significantly more complicated!

by Simon Cooper

Anatomy of a .NET Assembly - The CLR Loader stub


28 March 2011 by Simon Cooper
In Windows XP and above, the OS loader knows natively what to do with .NET executable assemblies, and fires up an instance of the CLR. However, .NET also runs on Windows 98, ME, NT 4.0, and 2000. When you run a .NET assembly on the older operating systems, the CLR has to be loaded somehow. This is the job of the CLR loader stub; a section of native code within a .NET assembly.

Executing a PE file
Unlike the DOS stub I discussed in my previous post, PE executables don't have full access to the entire physical memory. Instead, they are loaded into virtual memory, split into pages, that the OS maps onto physical memory as required. In the header of each PE file is information telling the loader how to map each section of a PE file into a page, and what access permissions to apply to each page. Within a normal PE file the executable code can execute jumps and calls to functions in other dlls, such as the Windows API. These dlls are loaded (imported) into the process' virtual memory address space as required by the OS loader. However, this loading into virtual memory causes several problems. Firstly, you need some way of storing function calls to imported functions in a PE file that isn't a direct jmp <memory address>, as the memory address of the function is not known until the dll is loaded into memory. Secondly, the memory address that the PE file itself is loaded is not known until load time. This means that internal function calls can't use a direct call either! Within a PE file, there are two structures that solve these problems; the import table, and relocations.

Import Table
Each entry in the import table specifies the information for a single imported dll. Along with the ASCII name of the dll, the entry contains the RVA of two identical structures, the Import Address Table (IAT) and Import Lookup Table (ILT). The IAT and ILT each contain an entry for every function imported from the dll, in the form of a two-byte hint and an ASCII function name. The import table and IAT are referenced from the 2nd and 13th data directory entries respectively, at the top of the file. This is the import table in my TinyAssembly example:

The single entry in the import table has the following highlighted bytes: 1. RVA of the ILT (0x2874, file offset 0xa74) 2. RVA of the dll name to import, as ASCII (0x288e, file offset 0xa8e) 3. RVA of the IAT (0x2000, file offset 0x200). You can see the IAT located before the CLI header. The ILT and IAT stores their information in the form of an RVA to an entry in the Hint/Name table (0x2880, file offset 0xa80), which contains the name of the function to call; in this case, "_CorExeMain". Calls to imported methods within the assembly are compiled as indirect jumps to IAT table entries. When a PE file is loaded, the loader looks through the import table and replaces all the IAT entries with executable code to jump to the specified function in memory (but leaves the ILT alone). Then, when a jmp <IAT entry> instruction is executed, the code in the IAT entry put there by the loader then jumps to the actual location of the imported function in memory. After the mscoree.dll string comes the loader stub itself. This is referenced from the AddressOfEntryPoint field in the PE header, and so is the first instruction executed when the assembly is loaded on a DOS-based Windows OS:

FF 25 00 20 40 00 jmp 0x402000

This references the first entry in the IAT, at RVA 0x2000 (file offset 0x200). This transfers execution

to the code inside that IAT entry put there by the loader, and that in turn transfers execution to the _CorExeMain function in mscoree.dll.

Relocations
That's solved the problem for imported function calls, but what about internal jumps? These include jumps to IAT table entries, as well as direct jumps. Using a structure similar to the IAT would be quite inefficient, as that would introduce an extra level of indirection to every single jump performed in the executable. Instead, the PE header at the top of the file contains an ImageBase field that gives a preferred memory location that the file would like to be loaded at (in this file, 0x400000). All the internal and IAT jumps are compiled to use that preferred image base. If, when the file is loaded, it can be loaded at that virtual memory address, everything works as expected. However, if it can't (say, another dll has been loaded there instead), then all the jump addresses in the assembly need to be modified to take account of the new image base. This is done using the relocations table. The relocations table is stored in the .reloc section of the file, and contains an entry for every address that needs to be modified. In a .NET assembly, the only address that needs to be modified is the argument to the jmp instruction in the loader stub. In this assembly, the .reloc section starts at file offset 0x1200 and consists of the following bytes:

00 20 00 00 0c 00 00 00 a0 38 00 00
Now, in standard PE files, there are expected to be quite a lot of relocations; so they are grouped into blocks. The first 8 bytes of each group specifies the base RVA of the block and the size of the group for that block (including the header itself), the following bytes specify offsets within that block at which relocations have to be applied. At each specified offset, the loader modifies the address there to take account of the new ImageBase at which the file has been loaded. So, to interpret the relocation entry above: 1. 0x2000 The base RVA of the block 2. 0xc The size of this relocation group 3. 0x38a0 The offset within the block to apply the relocation. The high 4 bits specifies the type of relocation (for .NET assemblies, this is always 0x3), so the offset is 0x8a0.

This entry specifies that the address at RVA 0x28a0 (file offset 0xaa0) needs to be modified if the ImageBase changes. And, as you can see, this corresponds to the argument to the jmp instruction of the CLR loader stub.

Putting it all together


We've now got enough information to work out what happens when a .NET assembly is executed on a platform that doesn't natively understand .NET: 1. The file is loaded into memory, preferrably at virtual memory offset 0x400000. 2. If the file couldn't be loaded at its preferred ImageBase, the addresses at the RVAs specified in the .reloc section are modified to take account of the new ImageBase. 3. The entries in the IAT table are replaced with jmp instructions to the actual location of the specified functions in memory. 4. The code at the AddressOfEntryPoint RVA is executed. In .NET assemblies, this is a jump to the first IAT entry. 5. The IAT entry then performs a jump to the _CorExeMain function in mscoree.dll, which then loads the CLR, reads all the CLR information in the assembly, and starts executing the method specified by the entrypoint token in the CLI header.

Of course, in Windows XP and up, the loader natively knows that any PE file with a non-zero 15th data directory entry needs to be passed to the CLR. This code still needs to exist just in case the assembly is executed on a pre-XP OS.

Or does it...?
What if the assembly is compiled as x64-only? The first OS to run as 64-bit was Windows XP, so an x64 assembly cannot run on any previous OS. In that case, the CLR loader stub is not added to the output assembly (at least for the C# compiler); the assembly has a zero PE entrypoint,

no .reloc section, and no import table, IAT or ILT. It still has the DOS stub though. Well, that's the CLR loader stub covered! I'll probably look at signature encodings next, but if anyone has any preferences please do comment below or email me.

by Simon Cooper

Das könnte Ihnen auch gefallen