Beruflich Dokumente
Kultur Dokumente
by Tony Davis
Software changes. As facts of life go, that ones pretty cut and dried, and most of us are used to it.
But sometimes software thats central to the core of your business such as the version control solution that safeguards your entire product line abruptly changes, and suddenly you need to rethink the way you develop, store, and plan your code. These are the kinds of decisions that make you wish youd listened to your Mom and studied dentistry. Early last year, Microsoft announced it was retiring support for Visual SourceSafe (VSS), one of the most popular source control tools ever created. Development had gradually been winding down for years, as Microsoft shifted strategy to a more ambitious money-maker, Team Foundation Server (TFS). But the install base of VSS remained huge, and the switchover to TFS, especially among smaller development shops that were clearly not the target for an enterprise tool like TFS, hasnt been as rapid as Microsoft may have hoped. So to signal users that now is the time to make a switch, Microsoft will stop supporting VSS in June 2012, officially rendering the product obsolete. For the tens of thousands of users who still rely on it, this is a little like being told their pacemaker will stop working just as bikini season opens. Whats the right thing to do? If youre like many of those users, your strategy has been to delay. Lets face it, VSS may have its warts, but up to now its done the job. But while it may be okay to rely on that old VHS player in your basement, rather than upgrade all those Disney videos to Blu-ray, keeping an obsolete solution at the heart of your corporate development process is a Bad Idea. No support means no new bug fixes, no upgrades when the latest version of Windows rolls around, and no tech support. As every good IT guy will tell you, the key to a good Mission Critical Solution is having a vendor you can get out of bed at midnight when disaster strikes. Bluntly, if you havent done so already, the time to replace Visual SourceSafe is now. Theres a wide variety of modern source control tools to choose from many with significant advantages over the way youre doing things today and skilled professional help for a smooth transition. To understand the choices were faced with today, it pays to look at how we got here. In the early 1990s One Tree Software revolutionized version control with SourceSafe. Source control was still relatively new as far as established procedures went, but when Microsoft bought One Tree in 1994 and put real marketing weight behind SourceSafe, all that changed. SourceSafe was revolutionary in many ways, especially its great Windows support and friendly user interface. Just as importantly, it was distributed by Microsoft as part of nearly every MSDN subscription, which put it into the hands of many developers who had never used source control. It won hearts and minds and quickly became the de facto standard. From a modern perspective, SourceSafe had numerous problems. Perhaps chief among these were its lack of atomic check-ins and its reliance on Windows filesharing, both of which contributed to gradual data corruption. If a check-in fails for any reason, atomic check-ins make sure no files are changed, ensuring the repository is not left in a partially updated state. Without them, the next user can grab a mix of old and new files. Before you know it, a simple network hiccup can lead to a rapidly propagating version control nightmare. SourceSafe was also designed before the explosion in internet usage, meaning it wasnt built around a modern client/server model. That left users who wanted to access code remotely out in the cold. My company, SourceGear, got its first big break when one of our developers here in rural Illinois, tired of being unable to check in code at home where he could feed his chickens, wrote a remote client for SourceSafe in 1997. It worked so well his chickens grew fat and happy. Plus, we were able to turn his chicken-friendly add-on into our first commercial software hit, SourceGear SourceOffSite, the first tool that allowed developers to access a Visual SourceSafe database over the Internet. It still sells well for us today, over a decade later. SourceOffSite was our first toe in the water of the SourceSafe ecosystem, but pretty soon we were clutching an inflatable duck and captaining a life raft. To fully service our customers, we had to become experts in the inner workings of VSS, and especially the kinds of problems that caused the most heartache. Over the last decade weve assisted countless customers to diagnose troubled and corrupted VSS databases.
Thats understandable, and with the right VSS replacement and technology partner, entirely achievable. Ill discuss some of the best options for you to get underway with a VSS replacement quickly in the next section. But at the risk of being predictable, Id suggest you take a minute to look past your immediate pain, and think a bit about what you want your version control solution to be two, five, even ten years from now. Considering how many customers we have whove been using VSS for over a decade, thats not as far-fetched as it may seem. For example, one recent trend has been towards products with integrated source control and bug tracking, which many have found not only makes tasks easier, but also helps dev teams standardize on one tool and work a little more collaboratively. Most of us are still waiting to see the benefits of ALM (Application Lifecycle Management), with its promise of integrated dev, test, and build environments (and associated Cadillac price tag), but the marriage of version control and bug tracking makes sense. Another recent trend is the move towards distributed version control solutions (DVCS). A DVCS differs from traditional version control in that it has no central server, and users can share change-sets peer to peer. It can bring with it a sometimes radical change in development methodology particularly for those who approach a merge with all the enthusiasm of a root canal but it can also bring tremendous benefits. Unfortunately, virtually all DVCSs on the market today, including Mercurial and git, are not ready for the larger corporate market, lacking in things like user accounts, file locks, and an enterprise license agreement. In the absence of a solid DVCS choice, most customers we work with today choose one of three products: Subversion, Microsoft Team Foundation Server, or our own VSS replacement, SourceGear Vault Pro. All three are solid choices, but they are by no means the only ones. For now, Id like to focus on the issues involved in migrating away from VSS to a modern replacement.
Microsoft TFS is an excellent choice, especially for large teams. Its Microsofts future, and theyve made a substantial bet on its success. That means its well funded, with some nice reporting tools, and a lot of nifty new features such as Branch Visualization. It also scales well, handling teams of 1,000 developers, or more, better than any other tool on this list. Subversion is another popular choice, and for good reasons. Unlike TFS, its completely free and has more modest install costs. Many larger companies that typically steer away from open source have embraced it, partly because of its excellent community support and reputation for reliability. While its cross-platform support was once criticized as shaky, this has improved significantly in the past five years. SourceGear Vault Pro was designed from the ground up as a VSS replacement tool, with a deliberately familiar user interface and support for all of VSSs features, including Share, Pin, and Shadow Folders. SourceSafe Users can transition to Vault and work the way they always have, with less time lost on the learning curve. Vault is the only tool to offer the VSS Handoff feature, which gets you up and running immediately, cutting actual downtime to a few hours. Which tool you choose, and what strategy you use to minimize the risk and downtime associated with switching version control solutions, will ultimately be dictated by your unique set up and requirements. But I hope Ive illustrated that there is a VSS replacement option that will fit your needs, and that theres no reason to put it off any longer. The time to upgrade is now. SQL Source Control can be downloaded from the Red Gate website. The time to upgrade is now. Try SourceGear Vault for free the only tool with VSS Handoff
Simple-Talk.com
History
On the 22nd of March 2010 at VoiceCon Microsoft officially announced Lync, at that moment called Wave 14. Some features were presented to the public: Call Admission Control, Location Awareness, Branch Office Survivability and tighter integration with SharePoint, Office and Exchange. In the background a Technology Adoption Program (TAP) program had already started with several customers and partners. In June 2010 Microsoft gave a lot more information at TechEd North America, which was held in Orlando. Many presentations were given that included demos. These were then published on TechEd online. On the 13th of September Microsoft released the release candidate (RC) of Communications Server 14 also then known as Wave 14.
Rebrand
Communications Server 14 or Wave 14 were just code names for the new version. A number of rumors about the new name spread around the internet: Among the possibilities that were mentioned were Office Communications Server 2010 and Communications Server 10. Together with the release candidate (RC), Microsoft also revealed the official new name of their product: Lync. Some people might think Why Lync? That was my reaction too. On this Technet Blog Kirk Gregersen, Senior Director Microsoft Communications, explains how they have chosen the name. Lync is a combination of link and sync and is chosen because Microsoft wanted a new name that reflected the major transformation of the product. On the 17th of November Lync reached the General Availability (GA) Status. In this table, we give the new names of all the members of the product family:
Product Family Server Client Web Client Service 2010 Release Microsoft Lync Microsoft Lync Server 2010 Microsoft Lync 2010 Microsoft Lync Web App Microsoft Lync Online 2007 Release Microsoft Office Communications Microsoft Office Communications Server 2007 R2 Microsoft Office Communicator 2007 R2 Microsoft Office Communicator Web Access Microsoft Office Communications Online
Because it would create an enormous article to list all the new features, we will now have a look at just a few of the more interesting ones offered by Microsoft Lync Server 2010.
Management Store and after that to the other databases.. Before distributing the changes to the other databases, all changes will be verified. The changes will be replicated as read-only data to all other servers including the Edge. In OCS 2007 R2 the Edge server did not share the configuration with any other server but stored the configuration locally.
The update process can be divided into the following steps: 1. 2. 3. 4. 5. The administrator makes a change in the current configuration using the GUI or Powershell The Master Replicator generates a snapshot containing the new configuration The File Transfer Agent distributes the snapshot to all other servers in the Lync environment The Local Replicator will be notified about the new snapshot, applies the changes and will send a status update to the CMS The Replication status will be send back to the master and the Master Replicator updates the status of the server
The replication traffic between the Edge server(s) and the CMS will be performed by using HTTPS. If security policies will not allow this, a manual update will need to be performed every time the configuration is changed.
Management utilities
Lync has two management utilities: Lync Server Windows Powershell, and the Lync Server 2010 Control Panel. The Powershell modules can be used for several tasks and are equivalent to the Exchange Management Shell of Exchange 2010. The GUI uses Silverlight; which has the consequence that one of the requirements for using the tool is the latest version of Silverlight. If it isnt detected during the startup it will display both a warning and a link to download Silverlight: So there is no MMC for Lync anymore.
Virtualization Support
A change to the virtualization support for Lync is very welcome, because a lot of customers already use, or are starting to use, virtualization technologies. Microsoft made a lot of modifications to the support policy compared to OCS 2007 R2. The following Lync Server environments are now supported using virtualization technologies: Standard Edition server topology for proof-of-concept, pilot projects, and small businesses. This topology supports up to 2,000 users per virtual Standard Edition server. Data center topology, for larger deployments. This topology supports up to 5,000 users per virtual Enterprise Edition Front End Server. At this moment only Windows 2008 R2 Hyper-V and VmWare ESX 4.0 are supported. All server roles will need to have Windows 2008 R2 as Operating System (OS).
Both the SBA and SBS can be configured as backup registrar for the users in the branch office. If a WAN link failure occurs, then users will reregister with the backup registrar and can continue to use basic voice functionalities. Can a SBA or SBS host users? Yes, both solutions can host users: Before doing this, verify that the PSTN connectivity works correctly.
As you can see in this screenshot, the look of the client has completely changed. We can split up the client window into a few parts: You can see your current status location, and if available a picture of yourself, in the upper part of the screen. This can either be configured manually by a user or can be retrieved from the Active Directory; The next part is the so called communication bar, which gives you quick access to IMs, received calls and voicemails. Besides this, it contains a dial pad which can be used to place calls; The biggest part of the client is the contact list which now contains pictures from your contacts if available and their status text; In the bottom of the client there are a few options available: Phone icon: which gives you the opportunity to configure the primary audio device; Call forwarding: if enabled for Enterprise Voice a user can configure call forwarding easily; Information icon, displays information/errors if applicable. Note that this icon will disappear if no errors/warnings are available. Besides searching for contacts on their name you will now have the opportunity to search for people based on their skills. This functionality does require at least SharePoint 2007 and will enable you to search in the profiles of users.
Implementation process
There are two ways to start the implementation of Lync: Using the planning tool Using the setup
The first option might look a little bit strange to you. How could you start an implementation with a planning tool that is, in most cases, used before the implementation? Well, youre completely right about this; but, starting from Lync, you can use the output from the planning tool as input for the setup. The planning tool can be downloaded for free from this website. The other option can be compared to the method which was available in OCS 2007 R2. Before starting the installation you will need to install some prerequisites. First start with .NET 3.5 SP1 and the hotfixes mentioned in KB959209 and KB967190. Once these are installed the Web Server Role needs to be installed with some additional features. This can be done by running the following cmdlet: ServerManagerCmd.exe -Install Web-Server Web-Http-Redirect Web-Scripting-Tools Web-Windows-Auth Web-Client-Auth Web-Asp-Net Web-Log-Libraries Web-Http-Tracing Web-Basic-Auth Once everything is installed you can start the setup of Lync. The same steps are required for preparing the Active Directory as with OCS 2007 R2, so I do not need to explain them. Make sure youve got a backup of your Active Directory environment and try the installation in a test environment if possible, before deploying it in a production environment. Depending on your deployment, you will either choose the option to deploy the first Front End Server (only used for standard edition) or install the Topology Builder:
Standard Edition: install First Front End followed by the Topology Builder Enterprise Edition: install the Topology Builder
In Lync you wont deploy a pool immediately but first will create a configuration using the Topology Builder. Once the tools are installed, you can start the Topology Builder to create your Lync environment. The Topology Builder contains some wizards which will guide you through the process of setting up your Lync environment. As I said earlier, you can also use the output from the planning tool. This will save you some time because you wont have to use the wizards to set up the environment. When youve finished building your Lync environment, its time to publish the configuration to the Central Management Store. Once this is done, you can deploy the servers. Each server will look in the Central Management Store to determine which components it needs to install during the setup. This has the advantage that you dont have to select the components manually anymore, because youve already specified them.
The setup can be started by selecting the option Install or Update Lync Server System. The first step is to install the local configuration store by selecting the Install Local Configuration Store. This will install a SQL 2008 Express instance with a replica of the CMS database. Once the local configuration store is installed, select the Setup or Remove Server Components. You will receive a prompt to select the method which should be used to gather the configuration file. As long as you can reach the CMS, you should leave the default option checked, which is retrieve directly from the Central Management Store. The second option, import from a file, is only needed when performing the installation of the Edge Server. When you make a change to the topology, for example by adding the conferencing feature, you will need to run the setup again and select the Install or Update Lync Server System option.
Once everything is installed, the certificate needs to be request and assigned. This can be done by selecting the option Request, Install or Assign Certificates. Press the request button to create a certificate request. After the certificate has been requested, you will get the option to assign it immediately. This last option is only relevant if the CA doesnt require a CA administrator to approve the certificate. If this is the case you will need to choose the button Process Pending Certificates, followed by Assign to assign the certificate to the services. When the certificate is installed its time to start the Lync services: This can be done by selecting the Start Services option. Once this task has been performed you have the option to check if all services are running by selecting the Check Services button. This will launch services.msc which will give you the ability to check if all services have the started state.
Simple-Talk.com
Usually theyd also be working with SQL Management Studio running locally and connected to the remote database server. When the app actually runs locally, all data connections are going out to the shared, central development database. There is no local instance of SQL Server on the developers machines. The alternative, of course, is dedicated development databases. Things now look a little bit different:
Obviously each developer has their own version of the database but the biggest difference to the earlier model is the presence of a version control system. Why?
Frankly, I dont want the guy to be sysadmin on a box that may contain totally unrelated databases to which he probably shouldnt have access . I could give him ALTER TRACE permissions (and ultimately, I did), but of course this has to be set at the master database level so now he has the right to inspect every query across every database. This discussion would never have even taken place in the dedicated local database scenario. He would have simply already had the rights and it would have been dealt with locally. There are plenty of similar occasions where the rights a developer needs to do their job exceed what should be granted in a shared environment.
But having said that, the resource usage is actually pretty small unless youre seriously pounding it. Mine is sitting there consuming only 340MB of memory (about 4% of whats on the machine) and 0.4% of CPU. So unless youre running under-specced hardware (again, this is reflective of a deeper problem), the performance impact shouldnt even be noticeable.
Summary
If youre using a shared development database, the chances are that youve simply inherited the practice. Take a good look around; are you really working this way because its the most effective possible way of building software? In times gone by, it wasnt easy to version-control databases, but weve now got tools at our disposal to do it.
In terms of .NET, theres obviously the official Microsoft Team Foundation Route but there are also offerings from third parties such as Red Gates SQL Source Control. Around the middle of last year I wrote about Rocking your SQL Source Control world with Red Gate and then Foolproof Atomic Versioning of Applications a little after that, both of which go into detail about the importance and value of versioning your databases. So I wont repeat the message here. Just make sure youre doing it, ok? Developing locally on dedicated databases is not only better for the process of database development, its better for configuration, which means better for deployment. Its also better for development processes in general, such as experimentation, modularisation of work. It solves all sorts of other problems which are engendered by the communal DB model. So really, whats stopping you?
Simple-Talk.com
Steve Furber is not as well known as he should be, which is surprising given that he is one of the leading pioneers of personal computing.
As part of the key team at Acorn Computers in the early 1980s the developers and manufacturers of the famed BBC Micro ( or Beeb as it was affectionately known) he was instrumental in designing the ARM (Acorn Risc Machine) chip which made the companys hugely successful PCs almost twice as fast as anything else on the market. And it is this innovation which underpinned the rapid growth in mobile communications, which has opened up economic opportunities for millions in the developing and developed world. The ARM first appeared in the Acorn Archimedes in 1987, making Acorn the first company to ship Riscbased personal computers for the mass market. Acorn founder Hermann Hauser has said that Steve Furber is one of the brightest guys I've ever worked with - brilliant and when we decided to do a microprocessor on our own I made two great decisions - I gave them two things which National, Intel and Motorola had never given their design teams: the first was no money; the second was no people. The only way they could do it was to keep it really simple. Nearly 30 years on Steve Furber (now the ICL Professor of Computer Engineering in the School of Computer Science at the University of Manchester) is still working with ARM processors, although on a much grander scale than the humble Archimedes. Steve, I think Im right in saying that you were a member of the Cambridge University Processor Group, a club for computer hobbyists when you were a student there. Was this rather like the Homebrew clubs in the US? When were you bitten by the computer bug? Yes, CUPG was very much a homebrew computer club, formed by Cambridge students. There the real men built their computers from TTL only the wimps like me used microprocessors! I got bitten by the bug as a result of being drawn into CUPG, which I joined because I was interested in flying and flight simulators, and computers seemed a good way to build a flight simulator. Was there anything that drew you into computers other than I seem to be good at this? I guess it was a combination of my interest in flight simulators and my amateur electronics experience. Id got rather put off building electronics in my teens because I struggled to make transistor circuits work (though I did get on better with valves!), but then I discovered the 741 op amp. As a Maths student the 741 gave me an abstraction I could work with, hiding all the low-level transistor details inside a clean black box. I built guitar effects boxes and two 8-channel sound mixing desks using 741s and PCBs I etched in my kitchen sink. Digital electronics offered another clean abstraction that enabled me to build stuff that worked in a different domain computers.
RM: SF:
RM: SF:
RM: SF:
One of your first major projects was designing the BBC Micro, a machine designed to accompany a computer literacy programme set up by the BBC. Did you in your wildest dreams expect it to take off as it did? We expected the BBC Micro to be a success, which is why we were so pleased to get the contract. But success meant selling the expected 12,000 machines. No-one anticipated the way home computers would take off in the early 1980s, to the extent that total Beeb sales were around 1.5 million. The first sense I got that this thing might exceed our wildest dreams was when we were lined up to give a seminar at the (then) IEE Savoy Place. I think this was 1982. The main lecture theatre at Savoy Place seats several hundred, but three times the capacity turned up Coach-loads of people had come some distance, for example from Birmingham, to hear about the BBC Micro. A lot had to be sent away to avoid exceeding the safe capacity of the lecture theatre, and we were booked back to give the seminar two more times (and many other times around the UK and Ireland) to meet demand.
RM: SF:
What do you think inspired people to crowd to see you and buy PCs in the numbers that they did? I think there was a widespread realisation that home computing was coming, and it was going to be exciting, useful and
fun. But the wider public was nervous about the great diversity of machines available, all produced by small companies they hadnt heard of and found it hard to trust. Then in came the BBC Micro, bearing one of the most trusted names in the land. That was the signal they needed to take a step into the unknown. Sure, the BBC Micro was a bit more expensive than competing machines, but if Im buying a product I dont fully understand I always prefer to pay a bit more for a name that I trust. And I like to think that the machine did live up to the brand expectations it was solidly built (some Beebs survived ten years in the hands of primary school kids) and had sound educational credentials, attracting extensive educational support. I still, frequently, come across folk who tell me that the BBC Micro introduced them to programming and was the foundation for their subsequent career.
RM:
Acorn had huge success in the late 1970s and saw its profits rise from 3000 in 1979 to 8.6m in July 1983 but it stumbled two years later and was later taken over by Olivetti. Do you think the company could have been saved had the ARM architecture project happened sooner? Having ARM earlier wouldnt have saved Acorn. ARM had to get out from the constrained Acorn market into the much more open System-on-Chip market that got them into mobile phones, and the SoC business only became technically feasible (with enough transistors on a chip to integrate all of the non-memory functions) in the 1990s. When you were designing processors at Acorn they generally had the power consumption of less than a watt. Do you feel rather glum at the power demands of todays high-end processors? Is this a consequence of the fact we dont have a sufficient grip on building parallel software? Yes, and yes! The energy-efficiency of computers is a growing concern, and the lengths we have gone to maximise single-thread performance at the cost of energy-efficiency are justifiable only in as forward now apart from going parallel, and even the high-end boys have thrown in the towel on single-thread performance. They are selling us multicore processors we still dont know how to use. Once we do know how to exploit parallelism there will be no need for high-end processors at all, because we will be able to get the same performance with much greater energyefficiency by using larger numbers of simpler processors. I expect to see this transition soon in data centres, where the load has a lot of easily accessible parallelism and where energy concerns are already at the top of the agenda, and even in high-performance computers where I see this as the most promising route to exascale. Youre working now on the SpiNNaker project that you're leading at Manchester which aims mimic the complex interactions in the human brain. Whats the higher motivation with the project and this in any way come from Doug Engelbarts vision about augmenting the human mind and interaction between machine and its user, which led directly to the invention of the PC? The higher motivation for SpiNNaker is the observation that computers arent the only information processing systems on the planet, and they arent even the best at some tasks. But we still dont know how the other sort biological brains work. This seems to me to be a fundamental gap in scientific knowledge. Computers are now approaching the performance required to build real-time models of brains (but they arent quite there yet a computer model of a human brain would require at least an Exascale machine), so can we accelerate the understanding of the brain by designing a computer that is optimised for this task? This will then offer a platform for neuroscientists, psychologists and others to develop and test hypotheses on a new scale. Scale is important. We usually like to start small, get some understanding, and then scale up building on this. But there are some places, including some ideas in the neural network field, where starting small doesnt work. There are good theoretical insights into why this should be, relating to the counter-intuitive properties of high-dimensional spaces. The maths simply stops working if the problem is below a certain (large) size. So sometimes you have to jump in at the deep end, and SpiNNaker offers a very deep pool to do this in.
SF:
RM:
SF:
RM:
SF:
RM:
Youre working with the Royal Society to figure out why the number of students taking computing classes has halved in the past eight years, what are the most important but fundamental things, the computing industry can learn from its past? Do you think that the industry has been wrong about what computing is and where it should go and how to improve it? Ill be able to answer this better when the study has drawn its conclusions. But on the evidence of other studies of this area, the problem seems to lie in the transition from the computer as a universal programmable platform for exploring ideas (as with the BBC Micro in the 1980s) to an office tool that runs productivity software. Much of what is taught in schools is IT rather than computer science. IT is important but intellectually unchallenging, and often dull. Its as if all that was taught in Maths was arithmetic, or in English spelling. IT, arithmetic and spelling are all important skills, but there is *so* much more in all these subjects. Lots of people have tried to come up with languages or programming systems that allow non-programmers to program. Is that a doomed enterprise? Currently programming is an extremely demanding discipline, requiring metal abilities from scaling multiple levels of abstraction to chasing very low-level details around at the bottom of a vast software system to track down a bug. I dont think there is any way you can train the entire population to become skilled programmers at this level it is a peculiar skill, not unlike being a theoretical physicist in its requirement to think abstractly while paying painstaking attention to detail. If the goal is to introduce a wider audience to the ideas in programming computational thinking then there may be
SF:
RM: SF:
scope for a simpler language that is less universal than those used by professional programmers, has fewer death traps for the unwary, and is perhaps more visual than symbolic in its representation. I always thought BBC BASIC was an excellent introductory language, but I just get laughed at when I suggest it these days!
RM:
Over the last 20 years the Internet has scaled in growth but operating systems and other computer software hasnt grown exponentially. Do you think that the internet concept could be imitated and used as a basis for an operating environment that doesn't have an operating system? It seems to me that operating systems have grown exponentially, in memory requirement if not in functionality! Im not sure how this relates to the question, but we have seen cycles of shift of computing power between central resources and the user at the periphery. With cloud computing we are seeing a shift back towards the centre, partly driven by improving communication services and partly by the need to mobilise the user terminal device. This is seeing the PC give way to the smart phone and iPad-like terminal, which moves quite a lot of the operating system functionality up into the Cloud.
SF:
RM: SF:
Do you think many people working in technology are unaware of its history and have little curiosity about where languages came from? Technology tends to attract folk who are more interested in the future than the past, so they often have very little sense of the history of their subject, and very little curiosity. But as folk get older their horizons get wider, and I think most technologists in the 2nd halves of their careers develop some sense of the historical path that has led to the way things are today. Do you feel were progressing in technology even though sometimes it seems that we are leaping backwards? We are definitely making progress in technology, faster than ever. After over 30 years in the business I still find new products astonishing. I carry my entire CD collection on my iPhone, but I remember a time at Acorn when we debated whether solid state music would ever be economically feasible. The iPad is a long-standing dream come true again, back in the Acorn days we talked about similar products for schools (though without the connectivity, which was inconceivable then), but of course the technology just wasnt ready. When you look back at your career on all the things you have done is there one time or a period that stands out among all the others? I guess my early years at Acorn would have to stand out from 1981 to 1985 since that period covers the BBC Micro and the first ARM chip, and those are the foundations of my subsequent career. The success of the BBC Micro was tangible at the time, as described above, whereas the ARM was a long time coming to fruition, and required a great deal of work by a lot of other folk, not to mention a fair dose of serendipity, to get to the 20 billion total shipments to date. But its knowing that this scale of impact is possible that drives me on and determines the directions I choose to take my research today. SpiNNaker has the potential to generate similar impact, though there are many contingencies and, as with all research, its highly speculative.
RM: SF:
RM: SF:
Simple-Talk.com
The tables are identical except for their names and the names of the primary key constraints. After I added the tables to the AdventureWorks2008 database (on a local instance of SQL Server 2008), I ran the following bcp command to create a text file in a local folder: bcp "SELECT BusinessEntityID, FirstName, LastName, JobTitle, City, StateProvinceName, CountryRegionName FROM AdventureWorks2008.HumanResources.vEmployee ORDER BY BusinessEntityID" queryout C:\Data\EmployeeData.csv -c -t, -S localhost\SqlSrv2008 T The bcp command retrieves data from the vEmployee view in the AdventureWorks2008 database and saves it to the EmployeeData.csv file in the folder C:\Data. The data is saved as character data and uses a comma-delimited format. I use the text file as the source data in order to demonstrate the three SSIS components. I next created an SSIS package named BulkLoadPkg.dtsx and added the following two connection managers:
OLE DB. Connects to the AdventureWorks2008 database on the local instance of SQL Server 2008. I named this connection manager AdventureWorks2008. Flat File. Connects to the EmployeeData.csv file in the C:\Data folder. I named this connection manager EmployeeData.
After I added the connection managers, I added three Sequence containers to the control flow, one for each bulk insert operation. Each operation is associated with one of the tables I created above. For example, the first Sequence container will contain the components necessary to bulk load data into the Employees1 table. To each container I added an Execute SQL task that includes a TRUNCATE TABLE statement. The statement truncates the table associated with that bulk load operation. This allows me to execute the container or package multiple times in order to test different configurations, without having to be concerned about primary key violations. I then added to each of the first two containers a Data Flow task, and to the third container I added a Bulk Insert task. Figure 1 shows the control flow of the BulkLoadPkg.dtsx package. Notice that I connected the precedence constraint from each Execute SQL task to the Data Flow or Bulk Insert task.
Figure 1: Control flow showing three options for bulk loading data
After I created the basic package, I configured the Data Flow task and Bulk Insert task components, which I describe in the following sections. You can download the completed package at [include link? URL?] In the meantime, you can find details about how to create an SSIS package, configure the control flow, set up the Execute SQL task, or add tasks and containers in SQL Server Books Online. Now lets look at how to work with the components necessary to bulk load the data.
Figure 3: Data flow that uses the SQL Server destination component to load data
Now I can configure the SQL Server destination. To do so, I double-click the component to launch the SQL Destination editor, which opens in t he Connection Manager screen. I then select the OLE DB connection manager I created when I first set up the SSIS package (AdventureWorks2008). Then I selected Employees1 as the destination table. Figure 4 shows the Connection Manager screen after its been configured.
Notice that I mapped the BusinessEntityID source column to the EmployeeID destination column. All other column names should match between the source and destination. After you ensure that the mapping is correct, you can configure the bulk load options, which you do on the Advanced screen of the SQL Destination editor, shown in Figure 6. On this screen, you can specify such options as whether to maintain the source identity values, apply a table-level lock during a bulk load operation, or retain null values.
OLE DB Destination
The OLE DB destination is similar to the SQL Server destination except that youre destination is not limited to a local instance of SQL Server (and you can connect to OLE DB target data sources other than SQL Server). One advantage of using this task is that you can run SSIS on a computer other than where the target table is located, which lets you more easily scale out your SSIS solution. To demonstrate how the OLE DB destination works, I set up a data flow similar to the one I set up for the SQL Server destination. As you can see in Figure 7, Ive added a Flat File source and Data Conversion transformation, configured just as you saw above.
Figure 7: Data flow that uses the OLE DB Destination component to load data
After I added and configured the Data Conversion transformation, I added an OLE DB destination, opened the OLE DB Destination editor, and configured the settings on the Connection Manager screen, as shown in Figure 8.
NOTE: The OLE DB Destination editor does not include an Advanced screen like the SQL Destination editor, but it does include an Error Output screen that lets you specify error handling options, something not available in the SQL Destination editor.
I next used the Mappings screen to ensure that my source columns properly sync up with my destination columns, as I did with the SQL Server destination. Figure 9 shows the mappings as they appear in the OLE DB Destination editor.
In the Format section of the Connection screen, I select the Specify option, which indicates that I will specify the format myself, rather than use a format file. If I wanted to use a format file, I would have selected the Use File option and then specified the format file to use. When you select the Specify option, you must also specify the row delimiter and column delimiter. In this case, I selected {CR}{LF} and comma {,}, respectively. These settings match how the source CSV file was created. Finally, in the Source Connection section of the Connection screen, I specify the name of the Flat File connection manager I created when I set up the package (EmployeeData). Note, however, that the Bulk Insert task editor uses the connection manager only to locate the source file. The task ignores other options you might have configured in the connection manager, which is why you must specify the row and column delimiters within the task. After I configured the Connection screen of the Bulk Insert Task editor, I selected the Options screen, as shown in Figure 13. The screen lets you configure the options related to your bulk load operation.
Figure 14: Selecting load options in the Bulk Insert Task editor
As you can see, you can choose whether to fire triggers, check constraints, maintain null or identity values, or apply a table-level lock during the bulk load operation. The options you select are then listed in the Options box, with the options themselves separated by commas. Once youve configure your options, youre ready to bulk load your data. For a complete description of how to configure the Bulk Insert task, see the topic Bulk Insert Task in SQL Server Books Online.
Clearly, the three SSIS components available for bulk loading data into a SQL Server database offer a great deal of flexibility in terms of loading the data and scaling out your solution. If youre copying data out of a text file and that data does not need to be converted or transformed in any way, the Bulk Insert task is the simplest solution. However, you should use the SQL Server destination or OLE DB destination if you must perform any conversions or transformation or if youre retrieving data from a source other than a text file. As for which of the two to choose, if youre loading the data into a local instance of SQL Server and scaling out is not a consideration, you can probably stick with the SQL Server destination. On the other hand, if you want the ability to scale out your solution or you must load data into a remote instance of SQL Server, use the OLE DB destination. Keep in mind, however, that if your requirements are such that more than one scenario will work, you should consider testing them all and determining from there what solution is the most effective. You might find that simpler is not always betteror visa versa.
Simple-Talk.com
The alternatives
Of the various methods of extracting data from HTML tables that Ive come across This can be made to work but it is bad. Regexes were not designed for slicing up hierarchical data. An HTML page is deeply hierarchical. You can get them to work but the results are overly fragile. Ive used programs in a language such as Perl or PHP in the past to do this, though any .NET language will do as well. This will only work for well-formed HTML. Unfortunately, browsers are tolerant of badly-formed syntax. If you can be guaranteed to eat only valid XHTML, then it will work, but why bother when the existing tools exist to do it properly. This is a handy approach in a language like SQL, when you have plenty of time to do the dissection to get at the table elements. You have more control that in the recursive solution, but the same fundamental problems have to be overcome: The HTML that is out there just doesnt always have close-tags such as </p>, and there are many interpretations of the table structure out there. Robyn and I showed how this worked in TSQL here. Microsoft released a Jet driver that was able to access a table from an HTML page, if you told it the URL and used it if you were feeling lucky. It is great when this just works, but it often disappoints. It is no longer actively maintained by Microsoft. I wrote about it here. Sometimes, you'll need to drive the IE browser as a COM object. This is especially the case if the site's contents is dynamic or is refreshed via AJAX. . You then have to read the DOM via the document property to get at the actual data. Let's not go there in this article! In the old days, we used to occasionally use LINX text browser for this, and then parse the data out of the subsequent text file. You may come across a Flash site, or one where the data is done as 'text as images'. Just screendump it and OCR the
ODBC.
OCR.
You can use either the .NET classes for XML or the built-in XQuery in SQL Server to do this. It will only work for valid XHTML. This is always a good conversation-stopper. easier to use XSLT. Ive never tried it for this purpose, but there will always be someone who will say that it is soooo easy. The problem is that you are not dealing with XML or, for that matter, XHTML. This is so easy that it makes XPath seem quite fun. It works like standard XPath, but on ordinary HTML, warts and all. With it you can slice and dice HTML to your hearts content.
First step
Here is some PowerShell that lists out some classical insults as text, from a test website. Quite simply, it downloads the file, gulps it into the HAP as a parsed file and then pulls the text out of the second table. ('(//table)[2]' means the second table, wherever it is in the document, whereas '(//table)[2]' means the second table of the parent element, which isn't quite the same thing)
add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll' $HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument $wc = New-Object System.Net.WebClient
$result = $HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/quotations.html")) $HTMLDocument.DocumentNode.SelectSingleNode("(//table)[2]").InnerText
This will give a list of insults. You would need to install the Agility Pack and give the path to it or put it in your Global Assembly Cache (GAC). In this, and subsequent PowerShell scripts, Im giving quick n dirty code that gets results, along with some test HTML pages, just so you can experiment and get a feel for it. Youll notice that this first script extracts the meat out of the sandwich by the means of XPath queries. Normally, ordinary humans have no business looking at these, but in certain circumstances they are very useful. Of course, if youre a database programmer, youll probably be familiar with XPath. Here is an example of shredding XHTML document fragments via XPath using SQL. I give just a small part of the document. Make sure that, in your mind's eye, the precious text is surrounded by pernicious adverts and other undesirable content.
DECLARE @xml xml SELECT @XML='<body> <ol id="insults"> <li>You have two ears and one mouth so you might listen the more and talk the less.</li> <li>You like your friends to be just clever enough to comprehend your cleverness and just stupid enough to admire it.</li> <li>You own and operate a ferocious temper. </li> <li>There is nothing wrong with you, that a miracle couldn''t cure.</li> <li>Why be disagreeable, when with a little effort you could be impossible? </li> <li>You look as if you had been poured into your clothes and had forgotten to say when.</li> </ol> </body> ' SELECT @xml.query('//li[2]/text()')--text of second element SELECT Loc.query('.') FROM @xml.nodes('/body/ol/li/text()') as T2(Loc)
However, that works with XML documents, or XHTML fragments, but not HTML.
Lets try PowerShells .NET approach with a list (OL, in this case), which is the other main data-container youre likely to find in an HTML page
add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll' $HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument $wc = New-Object System.Net.WebClient
$HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/listExample.html")) $list=$HTMLDocument.DocumentNode.SelectNodes("(//ol)[1]/li") #only one list foreach ($listElement in $list) {$listElement.InnerText.Trim() -replace ('\n', "`r`n ") -replace('(?<=\s)\s' ,'') }
This is fun. Not only that but it is resilient too. Youll notice that Ive corrected, in this example, the interpretation of all the white-space within the text of the tag as being significant, just by using a couple of regex substitutions. If you execute this from xp_cmdshell in SQL Server, youre laughing, since you can read it straight into a single-column table and from there into the database.
Here, youll immediately notice that we are specifying nodes by their class. We could use their IDs too. Weve specified a couple of DIVs by their order within each of the articleDetails nodes. Where am I getting all this information on how to address individual parts of an HTML document via XPath? Well, we have a new wallchart which we have just published that has the information in it, and maps the functionality between XPath, CSS and javascript DOM. It should help a lot with the trickier bits of web-scraping. We have also created a multidimensional array in which to put the data. This makes it easier to manipulate once we've finished.
Ive left out any optional switches for simplicity. Or you can use Using the Export-Clixml Cmdlet if you wish to do so, to export it as an XML file. There are a number of switches you can use to fine-tune the export of this XML data to file . You can even do this ...
$Table |ConvertTo-HTML
...in order to convert that data into a table in an HTML file (or a table fragment if you prefer)! If you then write it to file, it can then be read into Excel easily, At this stage, I hope you feel empowered to create some quick-n-dirty PowerShell routines to grab HTML data from the internet and produce files that can then be imported into SQL Server or Excel. The next stage would be to import into a SQL Server table from a PowerShell script. You can call a PowerShell script directly from SQL Server and import the data it returns into a table. Chad Miller has already covered an ingenious way of doing this without requiring a DataTable. If you are likely to want to import a lot of data, then a DataTable might be a good alternative.
Using a datatable
In this next quick and dirty script well create a DataTable that can then be exported to SQL Server. The most effective way of doing this is to use Chad Millers Write-DataTable function. You can, if the urge takes you, write to an XML file as if it were a multidimensional array, and thence gulp it into SQL Server.
$DataTable.WriteXml(".\TableData.xml") $DataTable.WriteXmlSchema(".\TableData.xsd")
To take out some tedious logic, well do a generic table-scraper that doesnt attempt to give the proper column names but gives, for each cell, the cell value, the row of the cell and the column of the cell. The headers are added as row 0 if they can be found. With this information, it is pretty easy to produce a SQL Server result, as we'll show later on.
add-type -Path 'C:\Program Files (x86)\htmlAgilityPack\HtmlAgilityPack.dll' $HTMLDocument = New-Object HtmlAgilityPack.HtmlDocument $wc = New-Object System.Net.WebClient
$result = $HTMLDocument.LoadHTML($wc.DownloadString("http://www.simple-talk.com/blogbits/philf/TestForHTMLAgilityPack.html")) #create the table and the columns $DataTable = New-Object system.Data.DataTable 'TableData' $datacol1 = New-Object system.Data.DataColumn Row,([int]) $datacol2 = New-Object system.Data.DataColumn Col,([int]) $datacol3 = New-Object system.Data.DataColumn Value,([string]) #add the columns to the DataTable $DataTable.columns.add($datacol1) $DataTable.columns.add($datacol2) $DataTable.columns.add($datacol3) #iterate through the cells in the third table foreach ($Cell in $HTMLDocument.DocumentNode.SelectNodes("(//table)[3]//tr/td")) { $ThisRow=$DataTable.NewRow() $ThisRow.Item('Value')=$Cell.InnerText.Trim() -replace ('\n', "`r`n ") -replace('(?<=\s)\s' ,'') $ThisRow.Item('Col')=$Cell.XPath -creplace '(?sim)[^\s]*td\[([\d]{1,6})\]\s*\z', '$1' #column $ThisRow.Item('Row')=$Cell.XPath -creplace '(?sim)[^\s]*tr\[([\d]{1,6})\]/td\[[\d]{1,6}\]\s*\z', '$1' #row $DataTable.Rows.Add($ThisRow) } #and iterate through the headers (these may be in a thead block) foreach ($heading in $HTMLDocument.DocumentNode.SelectNodes("(//table)[3]//th")) { $ThisRow=$DataTable.NewRow() $ThisRow.Item('Value')=$heading.InnerText.Trim() $ThisRow.Item('Col')=$heading.XPath -creplace '(?sim)[^\s]*th\[([\d]{1,6})\]\s*\z', '$1' #column $ThisRow.Item('Row')=0 $DataTable.Rows.Add($ThisRow) } $DataTable
This may look a bit wacky. The reason is that were taking advantage of a short-cut. The HTML Agility Pack returns the XPath of every HTML element that it returns, so you have all the information you need about the column number and row number without iterative counters. It assumes that the table is regular, without colspans or rowspans. Here is an example from one of the cells that we just parsed.
/html[1]/body[1]/table[1]/tr[47]/td[1]
See that? It is the first table (table[1]) row 47, column 1 isnt it. You want the rightmost figures too, in case you have nesting. You just have to parse it out of the XPath string. I used regex to get the ordinal values for the final elements in the path. The sample file has three tables, you can try it out with the other two by changing the table[1] to table[2] or table[3] If we export the DataTable to XML using the
$DataTable.WriteXml(".\TableData.xml")
..in the end of the last PowerShell script, then we can soon get the result into a SQL table that is easy to manipulate.
DECLARE @Data XML SELECT @Data = BulkColumn FROM OPENROWSET(BULK 'C:\workbench\TableData.XML', SINGLE_BLOB) AS x
In this example, Ive left out most of the rows so that you can try this, out of sequence, without needing the table.
DECLARE @data XML SET @Data= '<?xml version="1.0" standalone="yes" ?> <DocumentElement> <TableData> <Row>1</Row> <Col>1</Col> <Value>Advertising agencies are eighty-five per cent confusion and fifteen per cent commission.</Value> </TableData> <TableData> <Row>1</Row> <Col>2</Col> <Value>Fred Allen b 1894</Value>
</TableData> <TableData> <Row>2</Row> <Col>1</Col> <Value>An associate producer is the only guy in Hollywood who will associate with a producer.</Value> </TableData> <TableData> <Row>2</Row> <Col>2</Col> <Value>Fred Allen b 1894</Value> </TableData> <TableData> <Row>3</Row> <Col>1</Col> <Value>California is a fine place to live, if you happen to be an orange.</Value> </TableData> <TableData> <Row>3</Row> <Col>2</Col> <Value>Fred Allen b 1894</Value> </TableData> <TableData> <Row>0</Row> <Col>1</Col> <Value>Quotation</Value> </TableData> <TableData> <Row>0</Row> <Col>2</Col> <Value>Author</Value> </TableData> </DocumentElement>' Select max(case when col=1 then value else '' end) as Quote, max(case when col=2 then value else '' end) as Author from (SELECT x.y.value('Col[1]', 'int') AS [Col], x.y.value('Row[1]', 'int') AS [Row], x.y.value('Value[1]', 'VARCHAR(200)') AS [Value] FROM @data .nodes('//DocumentElement/TableData') AS x ( y ) ) rawTableData group by row having row >0 order by row
Yes, youre right, weve used XPath once again to produce a SQL Result.
By having the data with the same three columns whatever the data, you can simplify the transition to proper relational data.You can define a Table-valued Parameter to do lot of the work for you or even pass the data from PowerShell to SQL Server using a TVP. It makes the whole process very simple.
{
$link.Attributes["href"].value +' is at address ' + $link.XPath }
..or you can search for text, either within the element (contains) , starts-with and get the XPath of the elements
{
$element.XPath + ' --- '''+$element.innerText+ '''' }
With a fair wind and an XPath reference wallchart, a great deal is possible. If, for example, the data always has the same title, even if its location in the page varies, you can write a script that gets gets its location purely by looking for the heading. Some data is in ordinary paragraph tags, but you can still get at them via XPath if they follow a particular heading. XPath has a great deal of magic for awkward data-gathering. For data-gathering, I generally use dedicated PCs within the domain. These need need very little power, so you can use old nags. I never let SQL Server itself anywhere near the internet. On these PCs, I have a scheduled task that runs a script that downloads the next task (ID and parameters) from the SQL Server, and runs it if it is due, returning the data to SQL Server, using windows authentication. Each task corresponds to a single data collection on one site. All errors and warnings are logged, together with the taskID, the User, and time of day, within the SQL Server database. When the data is received, it is scrubbed, checked, compared with the existing data and then, if all is well, the delta is entered into the database.
Conclusions
Using the HTMLAgilityPack is great for the run-of-the-mill reading of data, and you may never hit its limitations, unless of course you are scraping the contents of an AJAX site, or the data is in flash. isn't perfect, since the HTML file is treated without understanding the semantics of HTML. This is fine up to a level of detail, but HTML tables really are awful because they allow you to mix presentation and meaning. The Colspan and Rowspan have no equivalent meaning in any sensible data table, and make extracting the data more tiresome than it need be. Although I've tended to use RegEx queries in the past, I'm now convinced that the HTML Agility Pack is a more sensible approach for general use in extracting data from HTML in .NET
Simple-Talk.com
Authentication
Generally speaking, Authentication is the ability to identify a particular entity. The need for authentication occurs when we have some resources that we want to make available to different entities. We store these resources in a centralized place and instruct the system that manages them to prevent entities that we dont recognize from having access. Anonymous authentication refers to a situation in which we grant access to resources to all users, even if we dont know them. In web applications, we expose resources to users. We authenticate each user by requesting his credentials, normally a username and password, that we have assigned to him, or that he got during what we call the registration process. The .NET Framework uses the following authentication terminology:
Principal: this represents the security context under which code is running. Every executing thread has an associated principal. Identity: this represents the identity of the authenticated user. Every Principal has an associated identity.
It also defines the following classes, contained in the System.Security assembly: GenericPrincipal, WindowsPrincipal GenericIdentity, WindowsIdentity As their names suggest, WindowsPrincipal and WindowsIdentity are related to Principals and Identities associated with a Windows account, while GenericPrincipal and GenericIdentity are related to generic authentication mechanisms. GenericPrincipal and WindowsPrincipal implement the IPrincipal interface, while GenericIdentity and WindowsIdentity implement the IIdentity interface.
Authorization
Authorization is the ability to grant or deny access to resources, according to the rights defined for the different kinds of entities requesting
them. When dealing with Windows Operating System, and its underlying NTFS file system, authorizations are managed by assigning to each object
(files, registry keys, cryptographic keys and so on) a list of the permissions granted to each user recognized by the system. This list is commonly called the Access Control List or ACL (the correct name is actually Discretionary Access Control List or DACL, to distinguish it from the System Access Control List or SACL). The ACL is a collection of Access Control Entries or ACEs. Each ACE contains the identifier for a specific user (Security Identifier or SID) and the permissions granted to it. As you probably already know, to view the ACL for a specific file, you right-click the file name, select Properties and click on the Security tab. You will see something like this:
FileSecurity f = File.GetAccessControl(@"c:\resource.txt"); AuthorizationRuleCollection acl = f.GetAccessRules(true, true, typeof(NTAccount)); foreach (FileSystemAccessRule ace in acl) { Console.WriteLine("Identity: " + ace.IdentityReference.ToString()); Console.WriteLine("Access Control Type: " + ace.AccessControlType); Console.WriteLine("Permissions: " + ace.FileSystemRights.ToString() + "\n"); By running this code in a console application, you get the following output:
Figure 2: Output of a console application that lists the ACEs of a demo file.
Figure 3: List of all authentication methods implemented in IIS 7.0 and 7.5. Anonymous Authentication: this is the most commonly used type of authentication. With it, all users can access the web site. ASP.NET Impersonation : this is not really an authentication method, but relates to authorizations granted to a web sites users. We will see
later how impersonation works.
Basic Authentication: this is a Windows account authentication, in the sense that the user needs to have a username and password,
recognized by the operating system, to use the application. When the user calls a web page, a dialog box asking for his credentials appears. If the user provides valid credentials for a valid Windows account, the authentication succeeds. This type of authentication is not considered secure because authentication data is transmitted to the server as plain text.
Digest Authentication: this is similar to Basic Authentication, but more secure. Authentication data is sent to the server as a hash, rather than
plain text. Basic Authentication and Digest Authentication are both standardized authentication methods. They are defined in RFC 2617.
Forms Authentication: this is ASP.NETs own authentication, based on the login page and the storage of users credentials in a database, or
similar location.
Windows Authentication: this type of authentication uses the NTLM or Kerberos Windows authentication protocols, the same protocols used
to log into Windows machines. As for Basic Authentication and Digest Authentication, the credentials provided by the user must match a valid Windows account. There are two other authentication methods that I have not mentioned here: Active Directory Client Certificate Mapping Authentication and IIS Client Certificate Mapping Authentication. Both use the X.509 digital certificate installed on the client; how they work is outside the scope of this article. For the purpose of this article, we can use Basic Authentication, Digest Authentication or Windows Authentication, each of which relies on Windows accounts. When theyre used, the current executing thread is associated with a Principal object that is able to give us information about the authenticated user. I wrote a simple application that shows you how to do that. Its source code is available at the top of this article as a zip file. The application defines a method, called WritePrincipalAndIdentity(), which give us the following information: 1. The name of the authenticated user. 2. The users role, by checking its role membership. 3. The type of authentication performed. The methods body is given by:
/// <summary> /// Explore the authentication properties of the current thread. /// </summary> public void WritePrincipalAndIdentity() {
IPrincipal p = Thread.CurrentPrincipal; IIdentity i = Thread.CurrentPrincipal.Identity; WriteToPage("Identity Name: " + i.Name); WriteToPage("Is Administrator: " + p.IsInRole(@"BUILTIN\Administrators")); WriteToPage("Is Authenticate: " + i.IsAuthenticated); WriteToPage("Authentication Type: " + i.AuthenticationType); WriteToPage(" "); } Where the WriteToPage() method is a helper method that encapsulates the logic needed to write text inside the page. Rather than using Thread.CurrentPrincipal, we could use the User property of the Page object to achieve the same result. I prefer to use the Thread.CurrentPrincipal, to point out that the principal is always associated with the executing thread. The importance of this will be clearer in the Role-Based Security Paragraph. When we run this application, using, for example, digest authentication (remembering to disable the anonymous authentication) the logon window ask us for our credentials.
Figure 4: Logon dialog box. To access the web site we need a valid account defined in a domain named CASSANDRA.
If we provide a valid account defined in the CASSANDRA domain we will able to log on to the application. Once weve provided it, we obtain something like this:
CASSANDRA\matteo, the domain account used to perform the request, that the authentication method used was Digest Authentication, and
that the user is not an administrator. Suppose that we need to write a web application that associates the user with his own data, for example a list of contacts or some appointments. It easy to see that, at this stage, we have all the information needed to manage all the data (contacts or appointments) related to a single user. If we save all of them in a database using the username (or better a hash of it) provided by the authentication stage as the table key, we are able to fill all the applications web pages with only the users specific content, as we do with Forms authentication. This is possible without having to write any lines of code. Another important advantage comes from the fact that, by using the Principal object, we are able to check if an authenticated user belongs to a specific security group. With this information, we can develop applications that are role-enabled, in the sense that we can allow a specific user to use only the features available for his role. Suppose, for example, that the web application has an admin section and we want to allow only administrators to see it: we can check the role of the authenticated user and hide the links to the admin page if the user is not an administrator. If we use Active Directory as container for users credentials, we can take advantage of its ability to generate group structures flexible enough to generate role-based permissions for even very heterogeneous kinds of users. However, from a security point of view, authentication alone is not enough. If, for example, we hide the link to the admin page for nonadministrator users, they can nonetheless reach the admin page using its URL, breaking the security of the site. For this reason, authorization plays a very important role in designing our application. We will now see how to prevent this security issue occurring.
/// <summary> /// Check if a resource can be loaded. /// </summary> public void CanLoadResource() { FileStream stream = null; try { stream = File.OpenRead(Server.MapPath("resource.txt")); WriteToPage("Access to file allowed.");
} catch (UnauthorizedAccessException) { WriteException("Access to file denied."); } finally { if (stream != null) stream.Dispose(); } } The CanLoadResource() method tries to open resource.txt, in order to read its content. If the load succeeds, the Access to file allowed. message is written on the page. If an UnauthorizedAccessException exception is thrown, the message Access to file denied. is written on the page, as an error. The WriteException() method is a helper method used to write an exception message on the page. Now we launch our application with authorizations set as in Figure 6 and use CASSANDRA\matteo to log into the application. Doing that, we obtain something that should sound strange:
Figure 8: ACL for the NETWORK SERVICE account with denied permissions.
If we launch our application, we now obtain:
Figure 9: Logon with user CASSANDRA\matteo, still with the permissions in Figure 8.
As you can see, the file is no longer available, demonstrating that the authorization process involves the NETWORK SERVICE account. To use authorization at the authenticated user level, we need to use Impersonation. With impersonation, we are able to allow the Application Pool to run with the permissions associated with the authenticated user. Impersonation only works when the Application Pool runs in Classic Mode (in Integrated mode the web application generates the 500 Internal Server Error error). To enable impersonation, we need to enable the ASP.NET Impersonation feature, as noted in Figure 3 and the discussion that followed it. If we switch our Application Pool to Classic Mode (enabling the ASP.NET 4.0 ISAPI filters, too) and enable ASP.NET impersonation, the demo application output becomes:
Figure 10: Logon with user CASSANDRA\matteo, with permissions as in Figure 8 and Application Pool in Classic Mode.
We are now able to load resource.txt even if the NETWORK SERVICE account has no permissions to access it. This shows that the permissions used were those associated with the authenticated user, not with the Application Pools identity. To take advantage of Integrated mode without having to abandon impersonation, we can use a different approach: running our application in Integrated mode and enabling impersonation at the code level when we need it. To do so, we use the WindowsImpersonationContext class, defined under the System.Security.Principal namespace. We modify the CanLoadResource() method as follows: /// <summary> /// Check if a resource can be loaded. /// </summary> public void CanLoadResource() { FileStream stream = null; WindowsImpersonationContext imp = null; try { IIdentity i = Thread.CurrentPrincipal.Identity; imp = ((WindowsIdentity)i).Impersonate(); stream = File.OpenRead(Server.MapPath("resource.txt")); WriteToPage("Access to file allowed."); } catch (UnauthorizedAccessException) { WriteException("Access to file denied."); } finally { if (imp != null) { imp.Undo(); imp.Dispose(); } if (stream != null) stream.Dispose(); } } With the modification added, we can force the application to impersonate the authenticated user before opening the file. To achieve this, we have used the Impersonate() method of the WindowsIdentity class (the class to which the Identity property belongs). With it, we have created a WindowsImpersonationContext object. This object has a method, Undo(), that is able to revert the impersonation after the resource has been used. If we try to run our application with permissions as in Figure 8, we see that we are able to access resource.txt even if the Application Pool is working in Integrated Mode. Now we can resolve the security issue presented earlier. If we want to use Windows accounts to develop a role-based application, we can use authentication to identify the user requesting resources and we can use authorization, based on the users identity, to prevent access to resources not available for the users role. If, for example, the resource we want to protect is a web page (like the admin page), we need to set its ACL with the right ACEs, and use impersonation to force the Application Pool to use the authenticated users permissions. However, as we
have seen, when the Application Pool uses Integrated mode, impersonation is available only at code level. So, although its easy in this situation to prevent access to resources (like the resource.txt file) needed by a web page, its not so easy to prevent access to a web page itself. For this, we need to use another IIS feature available in IIS Manager, .NET Authorization Rules:
Figure 11: .NET Authorization Rules feature of IIS7 and IIS7.5. .NET Authorization Rules is an authorization feature that works at ASP.NET level, not at IIS or file system level (as for ACLs). So it permits us to
ignore how IIS works and use Impersonation both in Integrated Mode than in Classic Mode. I leave you to test how it works.
Role-Based Security
A further advantage of using Windows account authentication is the ability to use a .NET Framework security feature called Role-Based Security. Role-Based Security permits us to protect our resources from unauthorized authenticated users. It relies on checking if an authenticated user belongs to a specific role that has authorization to access a specific resource. We have already seen how to do that: use the IsInRole() method of the threads Principal object. The .NET Framework security team decided to align this type of security check to Code Access Security (which I wrote about in previous articles) by defining a programming model similar to it. Specifically, a class named PrincipalPermission, found under the System.Security.Permissions namespace, has been defined. It permits us to check the role membership of an authenticated user both declaratively (using attributes) and imperatively (using objects), in the same manner as CAS checks. Suppose that we want resource.txt to be readable only by administrators. We can perform a declarative Role-Based security check in this way: /// <summary> /// Load a resource /// </summary> [PrincipalPermissionAttribute(SecurityAction.Demand, Name = "myname", Role = "administrators")] public void LoadResource() { .. where myname is the username that we want to check. If declarative Role-Based security is not what we need (because, in this case, we need to know the identity of the user first), we can use an imperative Role-Based security check: /// <summary> /// Load a Resource /// </summary> public void LoadResource() { try { // Create a PrincipalPermission object.
PrincipalPermission permission = new PrincipalPermission(Thread.CurrentPrincipal.Identity.Name, "Administrators"); // Demand this permission. permission.Demand(); .. } catch (SecurityException e) { .. } } In both cases, if the user does not belong to the Administrators group, a security exception is thrown. The PrincipalPermission class doesnt add anything to our ability to check the permission of an authenticated user. In my opinion, the IsInRole() method gives us all the instruments we need, and is simpler to use. Despite this, Ive included PrincipalPermission in this discussion for completeness. Maybe this is the same reason that the .NET development team added this type of class to the .NET Framework base classes. I end this section by mentioning that Role-Based Security can even be implemented in desktop applications. In this case, the authenticated user is a user that logs into the machine. When a desktop application starts, by default, the identity of the authenticated user is not attached to the executing thread. The Principal property of the current thread and the Identity property of the Principal property are set to GenericPrincipal and GenericIdentity respectively, and the Name property of the Identity property is empty. If we launch the following code in a Console application: static void Main(string[] args) { Console.WriteLine("Type of Identity: " + Thread.CurrentPrincipal.Identity.GetType()); Console.WriteLine("Identity Name: " + Thread.CurrentPrincipal.Identity.Name); }
We get:
added. We are now able to use Role-Based Security even in desktop applications.
Conclusion
In this article we have seen how Windows accounts can be used to implement authentication and authorization in ASP.NET applications. Even if this type of approach is rarely used, Forms Authentication being the commonly adopted solution, it can have a lot of advantages: 1. Less code to develop and maintain. Authorization and authentication with Windows accounts does not require the developer to write specific code for the management of user credentials, authorizations, password recovery and so on. 2. Centralization of user credentials, access rights, password policies, role-based policies and identity management in general. All the security information related to a specific user is stored in a centralized place, Active Directory. When a new employee arrives at an organization, permissions have to be added only in the Directory structure, not in each web server used by the company, making the authorization process simpler to manage. 3. More security. In a decentralized security environment, sometimes users have to remember more than one username and password. Sometimes they are forced to write them down to remember them. Security experts think this is one of the most dangerous security issues. Moreover, if an employee with, say, ten accounts for ten different applications, stored in ten different places, leaves an organization, its easy to forget to remove all their credentials, allowing them to access, or even steal confidential data.
Simple-Talk.com
Most of the DOS header consists of the string This program cannot be run in DOS mode.\r\r\n$. In the example .NET assembly above, this string starts at file offset 0x4E. Before that, we have the following bytes:
0E 1F BA 0E 00 B4 09 CD 21 B8 01 4C CD 21
If we run this through a disassembler, we get the following 16-bit x86 code:
push cs pop ds mov dx, 0xe mov ah, 0x9 int 0x21 mov ax, 0x4c01 int 0x21
Well, what on earth is that doing? Let's break it down into stages.
1. push cs
pop ds
These two instructions operate on the stack available to real mode programs; the push instruction is essentially shorthand for 'Copy the specified value to the location pointed to by the stack pointer and increment the pointer', while pop does the opposite. To start off with, the cs register points to where the program code is loaded into memory. These two instructions are copying the value of the cs (code segment) register to the ds (data segment) register via the stack. 2. mov dx, 0xe This is setting the value of the dx register to the constant 0xe. If you have a look at the stub above, this is the offset at which the text string begins from the start of the code. 3. mov ah, 0x9
int 0x21
Let's start off with int 0x21. This instruction invokes a software interrupt; interrupt 0x21 is the interrupt number for the DOS API (yes, such a thing exists!). The ah register (the high byte of ax) contains the number of the API function to call. If you have a look at the list of function codes, 0x9 corresponds to 'Write string to STDOUT', with the string to write to pointed to by ds:dx and terminated by $. As ds has been set the same as cs, and the offset 0xe is in dx, this prints the string This program cannot be run in DOS mode.\r\r\n to the console. 4. mov ax, 0x4c01
int 0x21
This is another DOS function call. The high byte of ax, 0x4c, corresponds to the Exit function, and the low byte, 0x01, specifies the return value. So this stops execution of the program and returns to the DOS prompt, with a return value of 1.
That's it!
There we go, nice and simple. However, the CLR loader stub, which I'll be looking at in the next post, is significantly more complicated!
by Simon Cooper
Executing a PE file
Unlike the DOS stub I discussed in my previous post, PE executables don't have full access to the entire physical memory. Instead, they are loaded into virtual memory, split into pages, that the OS maps onto physical memory as required. In the header of each PE file is information telling the loader how to map each section of a PE file into a page, and what access permissions to apply to each page. Within a normal PE file the executable code can execute jumps and calls to functions in other dlls, such as the Windows API. These dlls are loaded (imported) into the process' virtual memory address space as required by the OS loader. However, this loading into virtual memory causes several problems. Firstly, you need some way of storing function calls to imported functions in a PE file that isn't a direct jmp <memory address>, as the memory address of the function is not known until the dll is loaded into memory. Secondly, the memory address that the PE file itself is loaded is not known until load time. This means that internal function calls can't use a direct call either! Within a PE file, there are two structures that solve these problems; the import table, and relocations.
Import Table
Each entry in the import table specifies the information for a single imported dll. Along with the ASCII name of the dll, the entry contains the RVA of two identical structures, the Import Address Table (IAT) and Import Lookup Table (ILT). The IAT and ILT each contain an entry for every function imported from the dll, in the form of a two-byte hint and an ASCII function name. The import table and IAT are referenced from the 2nd and 13th data directory entries respectively, at the top of the file. This is the import table in my TinyAssembly example:
The single entry in the import table has the following highlighted bytes: 1. RVA of the ILT (0x2874, file offset 0xa74) 2. RVA of the dll name to import, as ASCII (0x288e, file offset 0xa8e) 3. RVA of the IAT (0x2000, file offset 0x200). You can see the IAT located before the CLI header. The ILT and IAT stores their information in the form of an RVA to an entry in the Hint/Name table (0x2880, file offset 0xa80), which contains the name of the function to call; in this case, "_CorExeMain". Calls to imported methods within the assembly are compiled as indirect jumps to IAT table entries. When a PE file is loaded, the loader looks through the import table and replaces all the IAT entries with executable code to jump to the specified function in memory (but leaves the ILT alone). Then, when a jmp <IAT entry> instruction is executed, the code in the IAT entry put there by the loader then jumps to the actual location of the imported function in memory. After the mscoree.dll string comes the loader stub itself. This is referenced from the AddressOfEntryPoint field in the PE header, and so is the first instruction executed when the assembly is loaded on a DOS-based Windows OS:
FF 25 00 20 40 00 jmp 0x402000
This references the first entry in the IAT, at RVA 0x2000 (file offset 0x200). This transfers execution
to the code inside that IAT entry put there by the loader, and that in turn transfers execution to the _CorExeMain function in mscoree.dll.
Relocations
That's solved the problem for imported function calls, but what about internal jumps? These include jumps to IAT table entries, as well as direct jumps. Using a structure similar to the IAT would be quite inefficient, as that would introduce an extra level of indirection to every single jump performed in the executable. Instead, the PE header at the top of the file contains an ImageBase field that gives a preferred memory location that the file would like to be loaded at (in this file, 0x400000). All the internal and IAT jumps are compiled to use that preferred image base. If, when the file is loaded, it can be loaded at that virtual memory address, everything works as expected. However, if it can't (say, another dll has been loaded there instead), then all the jump addresses in the assembly need to be modified to take account of the new image base. This is done using the relocations table. The relocations table is stored in the .reloc section of the file, and contains an entry for every address that needs to be modified. In a .NET assembly, the only address that needs to be modified is the argument to the jmp instruction in the loader stub. In this assembly, the .reloc section starts at file offset 0x1200 and consists of the following bytes:
00 20 00 00 0c 00 00 00 a0 38 00 00
Now, in standard PE files, there are expected to be quite a lot of relocations; so they are grouped into blocks. The first 8 bytes of each group specifies the base RVA of the block and the size of the group for that block (including the header itself), the following bytes specify offsets within that block at which relocations have to be applied. At each specified offset, the loader modifies the address there to take account of the new ImageBase at which the file has been loaded. So, to interpret the relocation entry above: 1. 0x2000 The base RVA of the block 2. 0xc The size of this relocation group 3. 0x38a0 The offset within the block to apply the relocation. The high 4 bits specifies the type of relocation (for .NET assemblies, this is always 0x3), so the offset is 0x8a0.
This entry specifies that the address at RVA 0x28a0 (file offset 0xaa0) needs to be modified if the ImageBase changes. And, as you can see, this corresponds to the argument to the jmp instruction of the CLR loader stub.
Of course, in Windows XP and up, the loader natively knows that any PE file with a non-zero 15th data directory entry needs to be passed to the CLR. This code still needs to exist just in case the assembly is executed on a pre-XP OS.
Or does it...?
What if the assembly is compiled as x64-only? The first OS to run as 64-bit was Windows XP, so an x64 assembly cannot run on any previous OS. In that case, the CLR loader stub is not added to the output assembly (at least for the C# compiler); the assembly has a zero PE entrypoint,
no .reloc section, and no import table, IAT or ILT. It still has the DOS stub though. Well, that's the CLR loader stub covered! I'll probably look at signature encodings next, but if anyone has any preferences please do comment below or email me.
by Simon Cooper